The hardware that powers Cloudflare: AI-capable Gen 12 servers and more

Presented by: Syona Sarma, João Tomé

Originally aired on December 20, 2024 @ 12:30 PM - 1:00 PM EST

Join host João Tomé and Cloudflare’s Head of Hardware Engineering, Syona Sarma, for a discussion on Cloudflare’s latest Generation 12 hardware innovations, broadcast from the Lisbon office. As Cloudflare expands its global network across over 330 cities and 120 countries, explore how the company is evolving its hardware infrastructure to meet the demands of modern technology, particularly in the AI era.

This episode offers insights into Gen 12’s key features, including a 2x performance improvement and 60% better power efficiency compared to earlier generations. Learn how Cloudflare’s hardware engineering team is addressing the AI revolution through innovative thermal design, GPU integration for AI inference, and modular architecture for future adaptability. From testing in the Austin lab to global deployment, gain a closer look at how Cloudflare’s hardware is advancing the future of internet technology.

English

Harware

Internet

News

Transcript

Hello everyone and welcome to this week in net. By the way, that's the radar display that we have in all of our offices. This is an episode already close to Christmas. Christmas is just around the corner and this week we're going to talk about hardware. Carware is such an important part here at Calfur and for those who don't know, Calfur has a global network that spans more than 330 cities in over 120 countries and in interconnects approximately with 13 ,000 networks. That means that interconnects with major ISPs, cloud services and enterprises. It supports a note that over the last two decades, hardware innovation has evolved to meet the demand and of modern technology. For example, CPUs lay the groundwork for personal computers and early internet infrastructure, while the rise of mobile devices drove advancements in energy -efficient, ARM -based ships. But on the other hand, GPUs initially designed for graphics have become critical for AI and machine learning, excelling, for example, in parallel, processing, and powering advanced data analysis, such an important part of 2024 and chatbots. As Khalfur unveiled recently our generation 12 hardware, let's explore how these advancements are making a difference. I'm your host, Dr. Maembe based in Lisbon, Portugal where I am right now. And with me, I have Sayon Asama, head of hardware engineering. On to the episode. Hello Sayon, how are you? Oh, well, good. Thanks to be here. I'm excited to talk to you about Gen 12 and what's coming from a hardware perspective in the future. Exactly. A lot to talk about. Before that, can you tell us a bit of your background? Where are you based? Where are you originally from? A little bit of your background. Sure. So I need the Hydrate Engineering team here at the Chout -Lout. I've been here at about two years. I'm a Hydrate Engineer by training, worked at Intang, my first job out of college in a bunch of different roles that changed from CPO Design to Business and Product from an Intel Data Center Business perspective before coming here. We have a wonderful team of hardware systems engineers and the teams responsible for the design and development of the next generation of servers, which we think is critical to all of cloud flare services. It powers on of the new features and developments from a service system point. And then let's show that that's it, right? For those who don't know, can you explain us a bit? what is Gen 12 and why does it matter real? Yeah, every time we think about building a new generation of servers, which we do about every 18 months or so, because we're trying to keep up with the industry and the landscape in terms of new CPU's available, new memory and storage technologies available, we have three goals. The first is how can we deliver better performance and power efficiency to our customers? What new features can be introduced, what new functionality can be introduced, the inference capability is the one that we went with this last round. And finally, reliability and quality, which is very important for us to be able to run a global network and for customers to be able to count on us from security, privacy, no incident perspective. Those are the three goals we look at with the larger understanding of the cost of infrastructure, which is usually the biggest cap ex expenditure for any company that owns its own hardware infrastructure. We have to keep that in mind as we make some of these design choices. So with the general well specifically, what does Gen 2 and DIN develop versus the prior generation? So the three metrics that I talked about, we'd be able to DIN afford two Xs of performance, which we measure in terms of requests Again, South Fair runs web workloads, which are basically 20 % of the internet's traffic. So our workloads are varied. And it's a really fun challenge to be able to characterize these workloads and determine what the hardware requirements we need to build towards need to be. So we start out by characterizing the workloads with a Gen 12. We were able to deliver to X the performance versus prior gen through a a couple of different methods, the first being the CPU choice itself. We went with AMD's Genoa X product, which has an larger cache, which we discovered during the workbook characterization process, a DDR5 capabilities PCIE Gen5, adequate network, connectivity given our edge network spans, so many different call -os, including remote regions. And finally, we wanted to make sure that the performance is tuned well enough to meet the workload requirements. So we are able to put two extra performance in terms of requests for second, but able to deliver 60 % better powered efficiency compared to gen 11. Why is power efficiency important? We find that our customers in your event, Asia, actually value the power efficiency is the highest KPI because of the cost of power in the cost of electricity in some of these regions. Not to mention sustainability goals that CloudFere has taken on. Stay tuned for what's coming towards the end of the year in terms of an impact report related to power efficiency. It's also really important from a lack density perspective, which basically means the most servers you can squeeze into a rack that is rated at a certain power level, the less the data center expansion costs are so you know it ties back to the cost perspective. So performance and power deficiency number one number two is the featured capabilities of Gen 12. So as of last year you might have read on some of our blogs, be launched a product called workers AI which shows which has the inference capability. We decided to expand upon that in Gen 12 listening to to what customers were telling us about where the landscape is evolving. The key feedback that we got was we want to be able to support better throughput on smaller models and larger models. And we're able to do that with the GPU attached on Gen 12. We call that a turbo skew that they were to need to use requirements. We're also a security first company. So some of the feature development focused for us is what kind of security feature In terms of we have something called the secure control module disaggregated root of trust. Look in BMC capabilities for better board management. These are things that we keep in mind when we think about what the new generation of servers should be capable of. And then finally the reliability and the quality aspects. We run about three quarter qualification cycle and make sure we're adding quickly testing for we understand the edit rates of the components. We understand our presiliency requirements which vary by workload and it's able to learn but reliable resilient infrastructure to our customers. So these are some of the capabilities that Jen was offers. So several points you mentioned several points for those who are not familiar with that term AI in fear inference. Why is that important for AI specific? For AI use case, like chat bots, our interaction with chat bots, or other models, someone is building something on top of a model somewhere in the world. What does AI inference have a good AI inference in a way means for those folks? Yeah, so when we started thinking about AI, there are so many chairs in the market, we started thinking about what can Cloud Cloud do that's different from everybody else. And the answer was here, I need that we already have this global spread of edge infrastructure. How can we increase the capabilities of our existing infrastructure, given our system constraints to introduce some sort of AI capability, right? And the part forward for us was inference because it requires lower power and typically given our spread of the network, Power and space from a data center perspective is a huge constraint. We don't build giant training testers. Make sure it's not to build giant training clusters and instead go the path of a workload that requires better latency and benefits from being closer to the customer from an eyeball latency perspective. We've proven that better eyeball latency and being able to opt My zvernet work latency is a superpower that Cloudflare has given the global infrastructure spread we have. So leveraging that, inference workloads was clearly the answer on what we would be able to have a competitive advantage in. So we started out by points of presence strategy, trying to make sure we had GPUs all over the map closer to the customer because native and throughput are the two measures that people look for when comes to inference. As we went along we realized that we need to be able to support larger models for which we need larger GPUs which again ties back to what is the power and the loop that you can accommodate in a system constraint that you're already given on the edge workloads that you're supporting. So we have now two generations of servers, one that has sub -hundred -warped inference cards, and another that has high power for GPUs that are capable of supporting 70 opinion parameters and so on. We're not done. We're looking at how the landscape is evolving and our strategy with AI continues to be to understand what workloads we would like to target and assure a benefit to our customers on and how that channel states too hard with. We might go the ASIC route if we decide that there is a specific segment workloads say image generation that we want to be better at than anything else. If stream workloads are something we want to be able to combine with inference, we might go a different route with this aggregation of GPUs where we pull a bunch of GPU resources to be able to support the larger models as well. There are a bunch of experiments, a bunch of POCs that we're planning for the future and I know we've talked about that in just a second, to be able to keep up with what's happening in the AI inference landscape. In my mind, the biggest win for us has been to be able to segment the market that we want to go after and not try to do everything at the same time, which allows us to leverage the benefits that Cloudflare already offers. Makes sense. You mentioned a lot there. One of the use phases I typically find more interesting is a lot of us interacts with chatbots. And how quick they respond to us, it's really important by now. If they're a little bit quicker, we'll do more work, we'll be more effective potentially. If it's really slow, it's not a good experience. So, the latency, how quick these models work, it's even to the general audience that uses chatbots is important, of course, for those building the apps also. But it gives the total landscape of those using AI, those creating platforms over AI will have a better potential in a way with a better latency, even when something needs more computing power as AI needs, right? Yeah, definitely. And the trade -off that we make with delivering better latency or throughput is powered, right? So the challenge, any from a hardware perspective is from a mechanical and thermal point of view. How can we accommodate the extra power that GPU like hardware consumes on a large the open scale. We want every system to be capable of running these workloads. So how do you trade that off with performance, which is why we've taken kind of this two tier strategy of we have GPUs everywhere. Now we have a new generation of servers with high power GPUs and we do a lot behind the scenes in terms of traffic management and routing and optimizing for network latency to be able to deliver response times that you're talking about, which customers really see at their end. In a way, I'm showing right now the Gen 12 server in specific, with some captions showing where is the CPU, where is the memory, or the fans. Anything that you think some of these images show us that could be interesting to share with the general audience. Most folks don't have like a visualization of some of these servers. Here is one. In a way, they are put into racks. The rack could have several servers, right? Yeah. So just to give you a sense of water, rack, power, and limits, arch, the range from eight kilowatts to 15 kilowatts. So they're not super high -powered acts given the places that we have a presence in. So you'll notice we're very open about our designs. It's kind of a cloud flare where you're doing things to be able to share our learnings and learn from our partners. So one thing you'll notice here is the committed to open standards. So DCSEM is basically it stands for Data Center Secure Control Module. It's something we walked with the OCP team on the OCT workgroup on in developing first to market the two dot all version of it. So one thing that you can see clearly here is the the dedication to open standards and open architectures. I would expand this to what's coming in the future in terms of open architectures such as I'm we are exploring a bunch of I'm solutions to see how best we can and integrate child -friendly required capabilities with maybe an arm type architecture. I think we have still stick with x86, but add to that, right? So I would say from a hardware perspective, the commitment to open standards is one thing you'll see here. The other piece of this is the effort that it takes to be able to go through the hardware selection process. And what I mean by that is what type of CPU do you choose? How much memory do you need? How much I would what is the ratio of the core to cash to memory to storage. These are the aspects that our engineering team looks at very closely, given the larger constraint of power and the the fans and the cooling solution that we are capable of integrating into our data centers. As you mentioned, this is a note. We put several such servers into where the act and we make some design choices at the an example of this is moving from a one -new chassis to a two -year chassis to improve the workflow and also to be able to design for the future. When I mentioned inference and you want to insert a GPU, you need to have the lead way to be able to do that in a design that you know is not going to be half the GPU for a year ahead. So we're really designing for the future. How do you keep the collection ability of new workloads that you need to support, how do we keep our commitment to open standards, what is the design criteria and how do you balance the needs of the workloads with the costs of building a new system, primarily around power. You'll see all that as you read through the blog. We talked to you about the challenges that we face and how we've chosen to address them. I was mentioning something that we're showing visually, but those hearing will have to read the blog to see some of the images. But you mentioned DC, SCM before something that also improves security and the need to be aware of open source projects and what's coming. You mentioned ARM. Can you give us a glimpse on some of these the things that we did to put this efficient, this Gen 12 efficient, reliable working properly. Some of the things that we did to enable Gen 12 are what's coming in the future. Maybe starting with the enable part and then we'll address what's coming in the future. Yeah, I mentioned the three metrics that we were at goals for power efficiency in performance, feature improvements and the funding, the resiliency of the network. So we designed a system. We started out by looking at the hardware landscape in order to be able to select direct components, starting with the CPU and being able to tune the CPU to our workload needs, looking at the memory technologies, storage technologies. and building the right configuration for the system that could deliver the performance improvements that are customized, demand. In addition to the power efficiency aspects, so one of the things that we've done is look at it from a relaxed level and understand how we can standardize our racks so that the deployment is easier for our downstream teams. The partner weights knows the capacity planning without infrastructure operations teams, and are sourcing partners to be able to have supply diversity and also to be able to be able to deploy these systems as needed where they're needed as quickly as possible. So in order to react level design criteria was in a thought exercise in designing giant wells. Finally, the cooling solution, which will continue to be something that we focused on on in the future generations is becoming more and more important because the CPU, TDPs are going through the roof. The CPU we chose for this generation was 400 watts. The next generation is about 500 watts. And once you've crossed that threshold, you start to think about liquid cooling -like technologies if they would benefit us longer term. So it's kind of leading us down the path of how can we expand our capabilities He's in stone, keep power, it acceptable limits given where we are, the days. Makes sense. You mentioned already specific elements on the top process of selecting Gen 12, the different areas, including the design of the rack specifically. But I'm curious and a lot of changes have been happening in the hardware space because of AI, as mentioned. What is the claw for perspective regarding hardware? And also how it evolved in the last year, for example, is it different now that it was two years ago? Yeah, I think the AI landscape has made it such that we can't accommodate our design changes in the typical hardware product like cycle. So what I mean by that is it takes about 18 months to come up with a new generation of server designs, right? But with AI, you need to be able to just form quicker to the market because the Needs are changing at a much quicker pace. What that means is when we build a Gen 12 system, for example, you have to think through what's coming with all of the uncertainty around it and figure out how to build that flexibility into your system design to be able to accommodate and pivot and integrate new capabilities along the way. I would say that's one big change in terms of our mindset of hardware design. The other thing that that we've focused on for Gen 12 and we'll continue to focus on for Gen 13 as modularity. So our server support compute, they support storage workloads and now they support AI inference, right? So is there a better way to manage our infrastructure more efficiently and make it more modular and be able to give the workloads that need more storage, more storage and the workloads are more compute bound, less out of, less out of drive. How do you enable this modularity in a system that looks the same wherever you deploy, which is one of Cloudflare's philosophies that every server should be capable of running every workload? So modularity is another aspect of design that I think will be reinforced with Gen 13. Just to give you some examples into what's coming and some of the experiments we're running, we're looking at methods like desegregation where we're able to pool with the storage resources, cool the GPU resources and use them where we need them as we need them in collaboration with our software engineering teams where we were able to route the requests where they need to go into a lot of the optimization behind the scenes. So it'll be really interesting to see what comes out of some of these experiments and work that would feed from a gen 13 perspective. For sure. I'm I'm curious on on the element and I remember last year, we actually did some episodes here in this week and now regarding that, about having our data centers worldwide, being what we call AI inference capable. So they were capable of doing AI inference. As we mentioned before, how was the process of having most of the data centered by now, a lot of them with GPUs, with this Gen 12, capacity with battery, I am for latency in a way. How was that process that it's by now over much more than a year, right? How was that process specifically? Yeah, it was an evolution. And we learned a lot along the way. Like I mentioned, we started out with sub 100 walked inference cards, because that's what we thought we should target when we first started down this part. We went with in video APUs because they were the simplest to deploy both from a hardware and software perspective, right? So we try to deploy them globally everywhere we could. What we learned as we went along from our customers is we need more performance on larger models and better throughput on smaller models in certain cases. So that led us to create a new generation of GPUs, which we integrated into our Gen 12 system to be able to support that. So it's been an evolution as the landscape evolves and the workloads have all been integrating that and trying to deliver servers at a much faster pace than we would have otherwise. So it's been kind of an interrupt, which we're able to do because we own our own infrastructure. We see this as a reason to continue to invest in hardware at Cloudflare and continue to build our own on hardware systems instead of renting instances because it gives us the ability to be able to pivot and move as the workloads change around us. It makes sense. And for example, for those who don't know, we have like a lab in Austin, right, where we test all of the things. Can you guide us through a bit of the process of having the lab of deploying to the locations? Can you guide us a bit on that process and what the lab is like. Yeah. So we are production environment is given that we run all types of web workloads. It's really hard to benchmark for each type of hardware solution that we're testing, right? Because what benchmarks do you choose? Does it reflect adequately what your production environment looks like? So the lab environment is critical for us as a staging environment. It's where we take all of the new hardware solutions available in the market and determine what the right fit for our specific workload needs are. We have a tech readiness phase that we go through where in the midst of that, so, Gen 13, to understand what are we projecting in terms of the workload needs growing over the next five years, because that's how long the service day in our feet. Can we model that in our lab environment, try to benchmark them before we put these systems in production, where more adequate and thorough testing as part of our qualification cycle. So typically our qualification cycle includes a tech readiness phase, an EVT, DVTP, TV, face, and finally mass production. And tech readiness is where we make the decisions on the exact configuration for the system that we're planning to build, then work with our ODN partners to be able to build that along the way and intercept our qualification cycles that I just mentioned. So it's critical I would also mention we're running a bunch of POCs in ALA related to desactication related to liquid cooling, related to changes in our architecture that would allow us to use our infrastructure more efficiently. We want to prove these out in a lab environment before we can put that in production. The last thing is, they mentioned we're looking at alternatives to GPUs, possibly inference as generators to know which of these suit the workload needs, what we are really trying to target all of that testing is set up in our lab, we call it an AI jungle gym, where we're testing and trying to make sure we know what exactly we need before being moved forward with integrating that into our production. So that's a good name. Can you repeat the name? It seems cool. Yeah, it's very younger. Jam, we started out calling it the EIP round, but Jam, you seem more at given the different types of experiments that we're telling there. So, yeah. Yeah. Specifically, we have several blog posts in our blog. We are a mention of one of those. There's also thermal design supporting gen 12 hardware, cool efficient and reliable, and also analysis of the 100 or 40 percent performance gaining call for Jan 12 servers. Any in any of these blogs, is there something you want to highlight that folks should take a look and learn more specifically? Yeah, I think that Toman aspects of the design are increasingly important. So what if you want to understand some of the challenges by facing and how our thermal team is tackling them, the thermal design blog is super interesting. interesting. I would also refer you to the design blog of Gen 12. If you want to understand exactly what we use in our infrastructure and why we use it. I think there's a lot to share and learn with the community about future hardware development. Regarding looking at has to 2025, what excites you and worries you the most? What is your prediction? and this topic of hardware for 2025. What I'm looking forward to is the changes from a hardware architecture perspective that we can make to make our designs more modular and more flexible because I don't think we know exactly what the future looks like in five years and we want to build hardware and systems that are capable of serving the infrastructure needs at that point, right? So the flexibility in modularity aspects is a very interesting challenge that we're working through. And what can we do from a hardware perspective, from a thermal solutions perspective, to have a solution that is capable of whatever is to come in the future? It's a really interesting thought exercise. We're in the tech writing this phase and we'll be able to implement it in gen 13. So really looking forward to the architectural changes of gen 13. Also on the inference accelerator space, what can we do better? What additional capabilities can be integrated into our infrastructure? To continue down the inference path, is there an adjacent area kind of like training elsewhere, running inference on cloud player can be created into a data pipeline and move to customers the value that infinites and cloud -led prints in addition to just being able to run inference as a workload. Inference is not a workload you run an isolation, it's really a feedback load. So if you're able to provide an example of a full end -to -end data pipeline training through inference with our two solution, which by the way is super important for inference type applications, I think it will be a big win. These are some of the experiments and POCs that we're engaged with, which you will see in the next year. Regarding AI in particular, you mentioned already some of the flexibility needed to accommodate the changes that could be around. If you had to do some predictions on what those changes could be in terms of what the hardware landscape will suffer or not regarding AI, what would that be? Do you plan that it will We continue to grow a lot and it will change will be needed regarding hardware and AI. Yeah, I think anybody guess at this point, what my sense of things are that the parameter count of the models when it's capped out around 100 billion parameters, we see a change in the precision needs moving towards the work precision, which could mean smaller art -based with print in terms of compute capabilities, memory and so on. And in memory compute as some of the things that we see with the new products that are being launched, all of this with the focus on power deficiency. You'll hear a lot about more power -efficient AI architectures, which I think is critical given where the GPUs have been going in terms of the TDPs and the demands on power that not everybody can accommodate. So these are some trends that we're seeing with AI. We pick and choose what makes the most sense for cloud flow workloads and our future roadmap will address some of those. Before we go, I'm curious on this more general question, which is first, if you had to explain what excites you the most about hardware and why is this area so interesting for you, how would that expression be in the way? Yeah, I think hardware is foundational to everything cloud like does. Obviously biased as a hardware person, but as a hardware team we have to think ahead before anybody else starts thinking about some of these problems because our product lifecycle are so long. Software has a quicker turn it on so that product teams and software engineering teams can introduce a new feature in a quarter to two quarters, right? But if you don't have the hardware capabilities to support that and deliver the performance our customers need, we're not going to be competitive. So being such a critical part of the infrastructure team and being able to deliver this to our product and services teams to enable them to build new things in your feature, this super exciting. And I have a wonderful team of hardware engineers who was here at Xtox Mary from the very working on doing that over the next few years. For you, the times you already answered this, but I'll make the question the same. What is the one thing about hardware that most people, even if they're not technology, don't realize that they should? I think the impact on the environment, given that the power of hardware is continuing to rise is by serious concern. And we have some sustainability goals as a company that we've taken on and we're making slow progress towards. So I think that's something that people may not completely internalize, especially with AI and machine learning on the horizon, right? What does it mean in terms of compute power? in terms of the carbon footprint and impact on our environment from a sustainability perspective. I see it, I don't see it as something that's bleak, there is a focus on power deficiency and power efficient architectures both from an infrastructure cost savings perspective, but also from how can we reduce the impact on the environment and what types of new technologies and energy solutions can be looked at. There's a lot happening in this domain that I think people would find interesting to learn more about. Thanks, perfect sense. And you already mentioned it in a way hardware. It's core to the cloud. So cloud is hardware also. And you showed us a glimpse of that hardware perspective related to the cloud that we all use at a allows all of these interactions, critical infrastructure interactions as well in terms of what you need, like planes, airplanes, et cetera. So it's quite interesting to get a glimpse of all these hardware perspectives related to the cloud and how things work and AI's definitely relevant there for sure. Yeah, I didn't even looking forward to what's to come in 2025. Please read the gent by blog. want to understand the technical details of our hardware systems. And we will continue to share what we learned from some of the experiments that we're running in our labs today. Absolutely. And there's a lot to explore in those blocks for sure. Thank you so much, Sayana. And that's a wrap.

This Week in NET

Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!

Watch more episodes