Cloudflare TV

Building the next generation edge servers/ Specialized HW Accelerators

Presented by Rebecca Weekly, Syona Sarma, Amy Leeland
Originally aired on 

Join us to learn more about how we are designing our next generation edge servers, methodologies for systems optimization, and transforming our design methodology for AI/ML as we prepare for long term growth and the needs of our innovative product teams here at Cloudflare. We will also talk about how we came to be part of the Infrastructure team in honor of the last day of Women's history Month.

English
Women in Tech

Transcript (Beta)

Hello and welcome. Welcome to Cloudflare TV. My name is Rebecca Weekly. I am the VP of hardware systems here at Cloudflare and I am joined by two amazing engineering leaders today in honor of Women's History Month, Syona Sarma, who is the leader of our compute edge store and edge platform development, and Amy Leeland, who is our director of hardware systems engineering focused on systems optimization.

So I'm going to let them introduce themselves and talk a little bit about what they do here at Cloudflare.

So, Syona. Hey, everyone. Thank you for tuning in. My name is Syona Sarma.

And like Rebecca said, I lead the hardware compute systems engineering team here at Cloudflare.

And my team is responsible for the design and development of the next generation.

Cloudflare's network is foundational to all of the services that run on it.

It's a huge responsibility and we take it really seriously.

I'm also looking forward to sharing a bit about a new work stream we introduced on accelerators and telling you a little bit more about that during the session.

Awesome. And Amy. Hello, everyone. I am rounding my sixth month here at Cloudflare.

So I'm still kind of a newbie, but working on the infrastructure hardware systems engineering team and my team works on leading a new function focused on hardware system optimization across compute storage and network.

And really, we're focused on helping to deliver proactive workload and architecture specific optimizations across our systems and services to make sure that our services and products are running best for our customers and customer experiences.

And also, so that we're really getting greater efficiency out of the infrastructure that our customers run on, that our products run on and services run on.

So I'm excited to be here with you today. No pressure, no pressure to either of you.

So, maybe I just wanted to start, you know, it is Women's History Month, it's an incredible time to sort of sit back and reflect on the journey we've all had here.

And I'd love to kind of get a little perspective on your journey to infrastructure at Cloudflare.

For those who haven't read, you know, the LinkedIn Pulse survey, women in technology represent 28% of the roles in women in tech, but only 15% in engineering and actually even fewer in infrastructure.

So I think our whole audience would probably be curious about your journey to coming to infrastructure.

So maybe I'll start with you Siona, and then Amy, how did you come to Cloudflare?

What was your journey, education and background? Yeah, some people have a misconception about a set path to get to where you want to go in your career.

But my experience has been very different and I find if you have an overall sense of direction, you can define what your next role is based on what you bring to the table and what you can learn and grow in.

So I started out at Intel with an EE degree, spent 14 years there, the first 10 of which were on the engineering side, building and designing CPUs, everything from RTL to backend design to integration.

And some of what I learned there, whether it was the encore design or memory hierarchies and how that translates to latency and unbalanced configs translate to what I'm doing today as part of a system design engineering team.

I also spent some time in the competitive analysis group, looking at the industry landscape and performance modeling in the machine learning space also translates to accelerators and what we'll be talking about in a couple of minutes.

And finally, I worked with some of Intel's cloud service provider customers as a vendor, understanding the qualification cycles, business strategy and the product development.

Now being on the other side in a cloud company and having a front row seat to customer workloads, there's a lot that I learned there that directly translates as well.

So it's been really cool to look back and connect the dots on the different skill sets and the teams that I've been a part of.

Love it. And absolutely. So a traditional sort of engineering background, but going from the bottom in some sense up now to operating at the cloud and building the systems that let us give our services to the rest of the world.

That's awesome. Amy, your journey is completely different. And I think a wonderful juxtaposition, maybe, of different paths can lead into this.

So do you want to give a little bit of your background?

Yeah. So, similarly, I actually came...

You're going to say similarly. I love it. I know, very different, very different.

But similarly, Siona, I actually did start my, I guess, technical journey and career path at Intel, actually, when I was finishing my degree.

Not similarly to Siona, I don't have an EE degree or, you know, comp sci degree.

I love math. I have always loved math since I was a young child. And I formally got my degree in finance, actually.

And I say I went to the school of on-the-job learning to really build out my hardware and software knowledge and capabilities.

So very much hands-on learning in terms of, like, learning the technical skills and engineering skills coming from a different degree path.

When I started working at Intel, I actually started working on kind of more operations type projects.

I worked in the open source technology center at Intel, which very much exposed me to, you know, how the kernel works and a lot of operating system development.

So I worked on things like Moblin, Migo, and Tizen, which were these, like, mobile operating systems way back in the day.

And then actually moved on to building out a cloud operating system or leading development of a cloud operating system.

And that's where I got my hands a little bit more dirty in terms of technical development.

Something called Clear Linux also led development of something called Cata Containers, which is a secure container technology and then orchestration.

And then kind of most recently, before joining Cloudflare, I was leading a team that worked with end customers to optimize their cloud deployments and kind of looking at their entire software stack and how they could really take advantage of hardware accelerators and features in their cloud deployments all the way to the application layer.

And I think for me, you know, again, not a traditional path and journey in terms of getting into tech and engineering, but I love the intersection of both hardware and software.

And we'll talk a little bit more about that, I think, potentially later in the episode.

But I think, you know, you never stop learning in this space.

There's especially looking at kind of hardware configurations and software configurations.

There's so many different knobs to turn.

It's a very kind of, I would call it an elegant puzzle.

And looking at, you know, all the different pieces and how you can plug them together.

It's really and how you can add value. It's a dream job to be working here at Cloudflare and with both of you.

It's amazing to get to work with two phenomenal female leaders and mentors.

And so anyway, that's what brought me here. Well, that is awesome.

I feel completely excited to have both of you on the team. So what I really think is important about kind of both of your feedback is that your journey was completely different.

A more traditional engineering background bottoms up into systems, a totally different background in terms of finance, but then connecting the dots, working through open source software, leadership, development, and then with cloud partners to stitch through their software layers to underlying hardware acceleration and features.

That intersection of being able to talk about hardware and software, the people who can speak that language are actually few and far between.

So it created these two amazing leaders. And I'm so grateful for you.

Anyway, when I talk, when we talk a little bit about what we do, maybe let's dig one level deeper.

So Siona, you know, why don't you tell us a little bit about what you work on every day?

Yeah, so we're in the middle of generation 12 of the edge server design, and the work involves everything from the hardware architecture evaluation and selection in terms of components.

What CPU, what memory configurations and some golden ratios in terms of core to memory, core to storage, scaling with workloads, etc.

Through qualification, the EVT, DVT, PVT cycles where we test for functional correctness as well as the performance.

And finally, through production and making sure that any incident responses we are able to debug and respond to as quickly as possible.

There's also the aspect of the rack itself and the different bottlenecks in terms of the number of ports you have, the space, the power and the per node system requirements that you're designing to.

And this has implications to what TDPs you run at, how many nodes you can squeeze into a rack and so on.

So it's the entire pipeline of design of a server. There are certain metrics that we have to measure ourselves against in order to be able to make the right choices.

And the three key pillars at Cloudflare have been performance, reliability and security.

And as you can imagine, performance is about requests per second.

We're trying to extend that to be more workload specific, introduce a latency aspect in terms of SLIs, and also perf per watt is becoming increasingly important as a metric that we measure against.

Of course, reliability, given the scale of the network, having a resilient network, understanding mean times to failures and how we make the most of the infrastructure we already have is another key feature that we need to keep in mind.

And last but not least, security. Cloudflare security posture is possibly the most important thing our customers value.

And therefore, we're introducing features like DCSCM and memory encryption, which have become hard requirements in terms of what the design cycles that we look at.

So one of the most interesting parts of this job is understanding the workloads themselves and how they scale with the hardware requirements.

And what this really translates to is which of the workloads are core bound versus memory bound versus IO bound.

How do we respond to larger L3 caches in our current designs or the SSD capacities that certain teams need, how we segment and things like that.

The key aspect to this is really being able to translate what we test in the lab to what we see in production, which is a lot of benchmarking and making sure we are able to expose some of this to our vendors and ODMs when we start evaluating these systems with them.

And of course, we do all of this in partnership with our capacity planning teams and our sourcing teams in order to be able to deploy in the volumes that we need with second source, to supply all of that in place.

So just a little orchestration and management.

For those who don't know the garbledygook that is infrastructure lingo, DCSCM is the data center security and control module.

It's a disaggregated root of trust solution that also has a control solution, but you would have your BMC and a root of trust on a separate card.

And it's under this premise that having disaggregation and modularity allows us to have better control, consistency, reliability, but also security and have it disassociated from like, this is a known good actor that is allowed to do X percentage on these buses.

This is the flow that we support and the two should not mix.

And traditionally, this has not been so clean in all server design.

It's an OCP specification. Just to, you know, make sure an OCP for, again, other people who don't do all of the TLAs of our life here is the Open Compute Project, which is a standards organization, an open computing organization that works to create base specifications to make the adoption of hardware innovation a little easier for those of us who aren't of hyperscale size, which, you know, we are starting to be that size, but we're not quite there yet.

So, Amy, I think I just wanted to kind of throw it over to you.

What is it that you're working on every day? Because it's completely different working in software optimization, even though you're here in infrastructure.

So, you know, Siona talked a little bit more about capacity planning and sourcing and how we work together to make sure SLIs are sort of coordinated.

You're working with product and services groups every day to try and make sure it actually runs correctly.

So let's chat a little bit about that. Yeah, so, you know, this is, like you said, Rebecca, it's a completely new domain within the infrastructure team.

And, you know, initially, our team has been just focusing on kind of baselining where we are across three kind of what I call three key pillars to help kind of, you know, again, baseline, look at where we are, and then figure out how like prioritizing where we can improve and optimize across our infrastructure.

So kind of looking at both our customers, so external workloads, and then also internal services and product workloads, our utilization across our infrastructure, our workload profiles, both internally and externally.

And then finally, also looking at, you know, the cost of, let's say, the cost to serve various products and services.

And so our team works within the infrastructure team and then also across the services and product team by, you know, first, we're kind of looking at understanding and exposing, you know, workload and resource trends to guide and work closely with Siona's team on this is what we're seeing in terms of workload profiles, and help to guide our future hardware decisions as her and her team are making decisions about, you know, what specific silicon or accelerators or, you know, what is that right ratio of compute to memory or storage based on those production workload profiles?

How can that assist in making these different hardware decisions?

The second piece is really helping to guide prioritization of kind of how we allocate and weigh our different infrastructure resources.

So, looking at those allocations, again, as, you know, as services and products and customers are using different products on our infrastructure, making sure that we're prioritizing, you know, what work needs to happen first across, you know, our resources, so compute, memory, storage, network, and ensure that those services and products are running efficiently and optimally to meet and hopefully exceed customer expectations.

The third piece is really to help and work alongside our, so we work very, very closely with the services and product teams, especially on like, you know, optimization efforts and looking at those workloads and profiles.

We also work with our kind of system software layer, so our Linux kernel team to do optimizations.

But then also, because we're working very closely with our services and product teams, we're also helping to inform not only our infrastructure team deciding the next generation hardware, but also working alongside our capacity team and Sionis team to help, you know, look and plan infrastructure at a deeper level.

So, looking at it from not just the higher level request per second and the higher level metrics, but how different products and services are growing over time and what we need to take into account for capacity planning in the future.

And, you know, identifying, you know, what workload sensitivities we have in production and how those workloads are changing over time also really help to guide how we plan for capacity and our infrastructure moving forward.

Finally, you know, we think increasing this visibility of optimization efforts across the company, right?

So, part of the whole kind of business financial metrics is really to expose, like I said earlier, the cost to serve.

And the reason why we're looking at that is to make sure that we're inspiring performance optimization across the company and making sure that, again, we're driving better efficiency and utilization across our infrastructure with our product and services teams.

And this is all in an effort to make sure that we have the best customer experience across our infrastructure fleet.

I kind of love that little tagline because I think you might have gone into the details of things in a way that people who don't do this every day don't necessarily know.

But I want to reflect back to you, the work of optimization you actually framed in the helping product teams understand the cost to serve so that we actually end up with more reliable and better user experiences for our customers.

And I think that's such an infrastructure approach of thinking about things, right?

Like, by making the cost to serve efficient, I promise you, you get a better user experience.

But inevitably, it is true, right? If we can serve more capacity more efficiently in every region where it's needed more proactively versus reactively, we're going to have a better end user experience.

And that is absolutely, again, no pressure, your job.

So I want to ask you both, starting with you, Siona.

Just to ping pong it back and forth. What are you most excited about for the future?

What are you working on? What's coming up? That's the most exciting thing every day.

So I think the growth path for Cloudflare is really ambitious.

And we hit the $1 billion revenue mark last year. And our CEO has outlined a path to get to $5 billion.

And we're only at a 1% market penetration in terms of the time that's been identified.

So translating this to a clear hardware strategy, I think is super interesting.

And it's really it comes down to two things.

One is the hardware efficiency piece, getting order of magnitude improvements.

And the second is reducing the infrastructure costs. And currently, we're in a cadence of about 20% gen over gen TCO on our next generation CPUs.

But we're not going to get to this level of growth just doing that.

So what we envision is really a divergence in the edge architecture, where we have one lane for the performance edge servers.

Really workloads that scale with cores that serve our web traffic at the service level indicators that are expected.

This is the main lane.

But there's also a need for a more power-optimized lane, given the power costs are rising significantly in specific regions of the world, like Asia, like Europe.

And a different type of architecture, something like ARM, something like an open RISC-V, would serve the needs of these customers better.

Because OPEX becomes the overriding factor in terms of the decisions that they make.

So that's the more power-optimized lane that we need to start thinking through.

And finally, this is the most exciting thing to your question, Rebecca.

Specialized hardware acceleration.

So at Cloudflare, we've had this slogan about every server runs every service, which we still want to keep to.

But there is a subset of workloads that will benefit from having specialized hardware, whether it's crypto acceleration, whether it's a product for stream.

We've already launched something called AV1, or machine learning.

And this is the machine learning piece is what we've taken on as a project for this year.

And we're going to be focusing on our internal Cloudflare workloads.

And as you can imagine, network services lend themselves really well to machine learning type of use cases, whether it's bots management, threat mitigation, identifying malicious attachments.

There are a whole bunch of use cases, about 50 of them that we've collected.

And we're going to be starting with looking at the data analytics of the core part of our fleet to demonstrate the value of these hardware accelerators.

Before we move on to the edge and actually enable our customers to maybe run inference type workloads in the future.

And I think eventually we will come up with a list of workloads that could maybe justify a custom ASIC down the line.

But this is what I think is the future of where Cloudflare is headed.

Big surprise that the person who started in silicon gets excited about maybe building our own silicon someday.

I love it. I love it.

And I have a similar background. So, of course, I get very excited about this too.

Amy, you know, a lot is spoken about sort of broadly in the ecosystem about this hardware software co-design principle.

Certainly you've taken a lot of responsibility in matching and marrying the decisions we're making in infrastructure to the product and service experience and the cost of service.

What are your thoughts as someone sort of in that intersection point on what's exciting about the future?

So. So I think I will echo a little bit what Siona said. I mean, we can't just do this in hardware.

Right. So software hardware software co-design is imperative to enable, you know, specialized hardware and get more out of the, you know, gen over gen performance gains that we're getting.

And moving more towards specialization.

As we all know, the software design and development lifecycle is a lot shorter than.

And it I think. I mean, from production or I guess from design to production, it is definitely a lot shorter.

And so to make sure that we are actually co-designing hardware and software and where I'm excited about helping to improve this, I think requires industry partnership.

So things like better low level SOC visibility and, you know, better kind of tools for us to evaluate how and profile how our unique workloads are running in production and, you know, running on and how the SOC behaves in that and give us control of how to tune.

For workloads that may haven't like that may have not existed when that hardware was designed.

I mean, that's that's imperative today. You know, we're working on, I guess, from a workload perspective, we are we are working very much and have been working on profiling.

And optimizing our workloads that are in production from the tooling that's exposed and available today.

But with new architectures and and new accelerators and with different SOC vendors, we're seeing, you know, inconsistencies in terms of the tooling that's available for us to do that profiling work.

And in many cases, we were not getting the best profiling tools and like that expose PMUs and counters in a way that we can actually give our SOC vendors proxy workloads, reflective benchmarks, open benchmarks for them to be able to use and plan in their design and development.

So, so I know that that focused a little bit more kind of on, I guess, tools, but for us to because of this gap in terms of how fast software development happens and how, you know, the timeline from hardware design to hardware in production.

I think we absolutely need to focus on getting better tooling, better proxy workloads, better open benchmarking so that we can build that in during the design phase and plan better for the future and enable kind of faster specialized hardware, faster and specialized designs that work out of the box based on those hard or workload profiles.

So hopefully that answers your question. No, no, I love this because, I mean, obviously I'm wearing my OCP shirt like I'm all about the open, open projects in this domain, right?

Whether it's about reliability, manageability, observability, what you're talking about, you know, every ASIC, whether it's a small startup or a large company, people think about performance first and foremost, and they don't realize performance when the system is down is zero.

Performance when you don't have supply is zero.

So you will fail on all of those metrics if you don't give us the ability to understand what's happening.

And if your schema is so specific to you, you've created a huge barrier to entry for us to even look at your part.

And I think that's brilliant. And I don't know that there's, you know, we're not going to buy something that doesn't map well on perf TCO.

But gosh, we can understand that so much better and actually make changes in post to your point, if we have better observability on how things are running, where the bottlenecks are, as they run in the fleet.

So huge fan of that. And gosh, there's probably a whole nother segment we could do just on observability and AI techniques for how we're tuning and working to optimize out in the fleet.

I am just so excited every day that I get to work with you both.

So that is probably my favorite part.

And as we are in our last minute, I will just say if you thought these two amazing leaders are amazing, which they are, you know, they're both hiring.

So please, please, please do click on hiring jobs at Cloudflare.

It's an amazing company, an amazing company that's trying to do everything we can to build a better Internet.

We've done a bunch of blogs on a lot of the topics that Siona and Amy hit on.

We actually didn't even talk about OpenBMC, some of the security and modularity features, sustainability work that we've done in this last couple of years here in the infrastructure group.

Please check out our blog. Please reach out to both of us, any of us, all of us, to talk through any of your questions.

We are so excited for Women's History Month, for Women Doing Amazing Things, and for what we get to do here at Cloudflare.

Thank you. Thank you.