🌱 Greencloud: Infrastructure Sustainability Update
Presented by: Annika Garbers, Syona Sarma
Originally aired on July 24, 2023 @ 2:00 PM - 2:30 PM EDT
In light of earth week, join us to learn more about edge server design choices, and more power efficient architectures we’re exploring at Cloudflare to meet the challenge of energy efficiency!
English
Greencloud
Transcript (Beta)
Hello, everyone. Welcome to Cloudflare TV. My name is Annika and I'm on the product team.
Tomorrow is the Global Earth Day and in celebration, we are so excited to highlight some of the ways that Cloudflare is helping build a more sustainable Internet.
I am so stoked to be joined by Syona today, who's going to talk about our approach to sustainability and our network and infrastructure design, specifically with a focus on Cloudflare's hardware.
But before we get into it, let's just get introduced.
Syona, can you tell us a little bit about who you are, what you and your team do at Cloudflare?
Yeah, sure. Hey, everyone. Good morning. My name is Syona Sarma and I've been at Cloudflare about six months now and I lead the hardware compute and acceleration team here.
The key charter that our team has is to build the next generation of edge servers for Cloudflare's network, which all of our customers use to run their services.
A recent addition to this charter is introducing specialized hardware into our fleet.
We have primarily CPUs today, but there are cases where a subset of the workloads would get a benefit from using different types of architectures.
And that's an innovative new project that we've taken on.
Got it. I think your team has a really fascinating and pretty tough job because of the breadth of products and solutions that Cloudflare offers that we need to deliver on these servers that live across the edge of our network.
When you think about all of the different apps that come into your team and what's most important for you, what are some of the key priorities for you as you're thinking about our next generation of hardware design in our servers?
Yeah, well, the competitive advantage that Cloudflare has is that every service runs on every server, which is slightly different from how other cloud companies do things.
So any changes we make to the design has an impact across the scale of our network.
So in terms of priorities, there are three things that are front and center.
The first is we want to capture as much efficiency as we can on every single server node, which means power efficiency and which also means performance efficiency.
Power efficiency is something we'll focus on during the discussion today in the context of Earth Day.
But there's also the piece of costs for any infrastructure team, lowering costs while still maintaining reliability.
We have a presence in over 290 cities and 90 countries.
So having a reliable network which is secure is also one of the key priorities.
In terms of how we measure things, we have a performance measure which we call requests per second because we are an edge network.
But we also need to think about requests per second per watt as one of the performance metrics to keep in mind the power efficiency requirements.
And as part of what the team does, we're looking at CPUs, we're looking at memory, we're looking at SSDs and trying to determine what the golden ratio is for our customer workloads.
And from the data, it looks like data centers are about 1% of the total energy consumption and are growing consistently.
So any efforts we make in optimizing for power in our next generation designs will have a huge impact on sustainability efforts which range from everything from supply chain to raw materials to how we recycle these servers.
Makes a lot of sense. Yeah, we've had conversations with some of your colleagues on the infrastructure team in the context of sustainability over the past couple of years on GreenCloud TV.
And one of the things that has come up consistently that I see as a really positive and exciting aspect of the job that you're in is that a lot of the time those priorities, efficiency, reliability, cost optimization actually align with improvements in the resource usage and then the sustainability of our network.
So as you kind of think about the portfolio of things that your teams work on, can you tell me about one of those projects where maybe a hardware design decision or set of decisions has had positive impact on the efficiency and sustainability of our hardware, but also keeping those kind of other priorities that your team focus on in mind?
Sure. So the majority of our fleet today is on what we call the CISC architecture, typical CPU vendors being Intel and AMD.
That's changing and we'll talk a little bit about why that is going forward.
But what we noticed with these CPUs is that the thermal design point is growing exponentially and the power consumption of these CPUs has gone from 150 watts a couple of years ago, about five years ago, to our current generation is about 240 watts.
The design that we're looking at for the next gen has hit about 350 to 400 watts of max TDP that we can hit.
Looking forward, this is going to go up to 500 watts where we're looking at system powers that are over one kilowatt high and cooling becomes a huge deal.
How do you cool a server as your CPU needs increase?
So currently we have an air cooling solution, but going forward we're going to start thinking about liquid cooling possible solutions in the future.
So in terms of the air cooling, what we found when we were looking at the architecture for gen 12, which is our next gen design going to MP next year, was the chassis height and the fan design were two norms that we could tune to actually get up to 15 percent better opex and save up to 150 watts of power.
The fan design part of it was honestly a surprise to us and what we found was we were using a 1U chassis, which was limiting and constricting the airflow, and we found that our racks could actually accommodate a larger depth chassis.
So we decided to move from a 1U to a 2U, which basically increases the height and allows for better airflow and also for better fan designs.
We used to look at somewhere it's between seven fans.
So the thing with fans is each of the fans consumes a certain amount of power and the more fans you have, the better you're able to cool the overall system.
So it's really a trade-off between the fan power consumption and the system cooling that you need.
It's in the placement of the fans, the model of the fans, how fast they operate, where they're put, or have a huge impact on the overall cooling solution.
So we made a bunch of changes and have landed up with about four fans, which are 30 watts each.
The 2U chassis gives us a benefit and we also have bigger heat sinks.
And all of this, like I said, enables up to 150 watts of savings per node.
The one thing to keep in mind is with a rack design, the rack design considerations are equally important because you can do several design changes at a node level, but they only translate at scale when you're able to make sure the stranded power in a rack is as low as it can be.
And typical rack bottlenecks include space, power, and network connectivity.
And the reason we were able to make this change is because we were not space constrained, we were power constrained to begin with.
So we were able to move to a larger chassis and accommodate that in the existing rack power limits that we have.
That's amazing. I remember talking to you about this initially when we were kind of planning for this session, thinking about what we wanted to talk about.
And you said that initially, that this idea of just redesigning and thinking about where we place the fans on our servers, which model we use, and just increasing the amount of space that we can take up with the entire chassis led to that much of an increase or an improvement in our power efficiency that blew my mind.
And it's sometimes the smallest kind of design decisions are the things that would seem small or trivial compared to the rest of total things that you guys are thinking about and putting into the server design that actually make the biggest difference.
So that's really cool. What else is on your mind? What's another project you've been working on that's having a significant impact in this space?
Yeah, the other part of what I think is relevant is that, like I mentioned, our roadmap is on CISC for the most part, but we're introducing ARM designs, a risk-based architecture going forward.
Currently, it's less than 5% of our fleet today, but we have a continued roadmap and the goal is to start increasing this generation over generation.
So why is there a need for this different type of architecture and a divergence in the edge roadmap?
The reason really is rising electricity costs, which go up to 500% in certain regions of the world based on last year's data, and it hits specific regions more than it hits others.
So Asia and Europe are impacted a whole lot more. And when you translate power to OpEx over the lifetime of a server, which is about four to five years, it becomes a bigger and bigger part of the TCO, which creates a need for a different type of architecture, which we're going to grow as a percentage of the fleet.
Now, why is ARM different and how is it different?
So it gives you overall about a 28% better perf per watt, which translates to about 20% better perf efficiency or OpEx savings.
And the reason that we're able to get this is because our workloads scale with cores and ARM cores are basically stripped down.
x86-like cores, they're much smaller, they don't have all of the extra features that had now become part of the CISC designs, whether it's for AI or networking type of use cases.
And it's a more throughput-based architecture with no L3 cache and a different type of memory design and architecture, which allows for better latency characteristics as we scale the load.
So all of these architectural components translate to better perf per watt and therefore better OpEx savings.
Now, there are some challenges in terms of introducing a new type of architecture into the fleet.
And there have been a bunch of technical blockers that we've worked on over the last six months in order to be able to scale this to where we want to go.
And I'll just talk through a few of them to give more context on what it takes to ramp a whole new architecture.
The first is the actual workloads themselves and compiling them on an ARM64 instead of the x86 architecture.
This is not a small amount of effort and the software ecosystem around ARM is still very much growing.
So it takes a bit of time to make sure all the services that run on all x86 can be run on ARM.
The other part to this is just the technical expertise of our engineers also needs to grow as we try to translate these services to ARM.
So the rate of incidents and as we introduce new services, we have to make sure they're running on ARM as well.
So this is a big part of the work and part of the effort that we've been investing in.
The other part of it is rack design. We talked about it slightly in the previous section.
Do the x86 rack power limits, sizes, distribution map to what the ARM needs are or do they need to change?
How do we determine which colos should have ARM based on, you know, like I mentioned, certain regions are more power sensitive than others and so on.
Aside from this, Cloudflare takes security really seriously.
So FedRAMP certification is a big part of the US colos that we manage and there's been an issue with boarding SSL.
It's a library which has cryptographic functions in it and there's something called FIPS compliance.
It's basically a stamp on the cryptographic modules that are called by these libraries to make sure they are up to the FedRAMP standards.
They were only validated on x86 and hadn't been validated on ARM. So there's a central forum where this approval needs to go through.
So we had to make sure which we were tracking that and only once we were able to get past that design and import that into our services and our packages, we'll be able to move forward with FedRAMP deployment of ARM.
This was one big issue that we worked through.
The other big thing is memory encryption. Again, in the security context is something our customers repeatedly talk to us about and the x86 or the CISC roadmap is far ahead of ARM in terms of the memory encryption feature set and allowing for what we call multi-key total memory encryption today.
Whereas ARM has not prioritized this and the ARM vendors have not prioritized this so far.
So we don't have in our current generation of ARM even single key memory encryption.
So we had to be careful about distributing the ARM servers, isolating them to spines and making sure we were still meeting the customer requirements and have a forward -looking roadmap, which we do today.
The next generation of ARM will have single key memory encryption.
This is one of the other big technical blockers we had to work through.
And finally, the lot compliance, lot 9 compliance for Europe, which required a transition from a platinum-grade PSU to a titanium-grade PSU.
And if you step back and think about all of these factors go into designing a whole new architecture from hardware selection through qualification, through deployment.
So it is really a huge lift, but we are willing to invest in it because we see the OPEX benefit going forward and with this will only continue to grow as we move forward.
And listening to you just list out the number of different factors that your team has to keep in mind here is just wild.
And I'm sure that there's new things that we discover all the time as we're continuing this rollout.
I'm curious for folks that are listening that maybe are in a similar position to you, or they're looking at their own infrastructure team for folks that are within organizations that design and manage their own hardware footprint.
And they're thinking about, you know, if they wanted to take on a project like this, if they wanted to reconsider a new hardware architecture, that's going to be more power efficient for sustainability reasons.
But then also, like you said, for total cost of ownership over time, what are some of the things that you have experienced through this process so far?
And I know we're still in early stages, have a long way to go, but what's been some of the things that have been really successful and really, really beneficial for your team and kind of making this happen or getting those early learnings?
And what are some of maybe the things that we've learned from this that maybe we would do differently if we were starting it over again?
Just kind of advice generally for folks that might be in your position or adjacent to roles like yours.
Yeah, I think understanding the priorities from a customer perspective is key.
And like I mentioned, there's been a divergence in our edge roadmap, because we found this power constraint to be one of the pain points that came to us constantly from our customers.
So I think understanding the customers and understanding the workloads and being able to create categories for what is valuable to them.
An example would be web tier workloads versus AI ML versus more throughput focused workloads.
How do you use a tiering structure?
Do you, like Cloudflare, have a policy where you want every server to run every service or where it matters?
So being able to segment the target use cases with the type of hardware architecture that would be most suitable, I think is the best place to start.
And that's what we found in our experience. It's also important to think through this.
There is a time investment. It is going to take some time to ramp.
So it's not going to be a couple of months before you can have ARM in your fleet.
So getting the buy -in from the larger product teams in the company to make sure they will do the work from the engineering perspective, because of the additional complexity that is introduced, is another key piece that we should keep in mind.
I think in terms of the drawbacks or the problems, I think all of us worry about second source.
And there aren't a whole bunch of ARM vendors today in the market.
We have a few we rely on. So that is something that, from a business perspective, we need to think through in terms of risk and risk mitigation.
This space is only growing.
So in the next couple of years, we should start seeing more and more competitors in the ARM space, which will make this problem easier to tackle.
Absolutely. Thinking about what you said about customers and customer priorities being the driving factor, one of the things that comes to mind for me is there are three groups of priorities that you almost have to think about and juggle.
One is, in a way, Cloudflare product and engineering are your customers.
It's the folks that are developing the things that our end customers want.
And then in some additional way, there's our end customers' customers that are looking into their services.
You got to think about all of those different people's priorities in the full chain of what's the actual value that we're delivering at the end to your customers, who are our customers, and their end users.
But then also, I think a cool thing is that we've heard from our customers over time that not only do they have increasing expectations and things that they would like to see from Cloudflare from a product delivery perspective, new types of traffic workloads, new ways to use the services that we already deliver to them, but also that kind of network characteristics for us, for Cloudflare's network, are something that factors into their own decision-making.
Our customers come to us and say, hey, we want, if we're making a big bet on someone that we're going to use as effectively an extension of our network, as an extension of our own infrastructure to do so many of these really business-critical functions for us, we want you, Cloudflare, to have and be able to prove and back up that you're the most sustainable network in the world, that you're thinking about all of these things holistically from a hardware design perspective, from a co-location perspective, from accounting and offset perspective.
There's so many aspects of this that we think about, and our customers are really continuing to push us.
In addition to us having it be our own priority, our customers are really continuing to push us to show not only that we're thinking about this, we're delivering on it, but then also we can prove it with results and data over time as well.
So that by itself, it ties into these other priorities, but it's also something that's continuing to drive us to keep that focus on.
We know that it's important to our end customers as well.
Yeah, and what you said is actually very relevant to the memory encryption feature that we talked about slightly earlier, which is what is the exact requirement from a customer standpoint?
Is it that the server that gets hit with the request needs to be memory encrypted, or every single server that gets pinged during the lifetime of the request has memory encryption?
These are two very different problems to solve from an engineering and service development perspective, and something we've been juggling with trying to figure out what the scope and the resourcing and all of that needs to be in order to enable some of these.
So yeah, very true. Thinking beyond just our customers to customers of customers is key to being able to ramp a whole new platform and architecture in the field.
Yeah, I think that's one of the lessons that I had to learn early on as a PM.
Sometimes you go into a conversation with a degree of naivete, and you're like, yeah, of course.
Those things sound really similar, and the difference between those is really subtle.
Is it the server where we actually decrypt the initial request, or any possible server that the traffic could land on through its life through our network?
But they can have, as you're saying, huge implications for our actual requirements, what the ramp-up schedule of this new architecture has to look like, all of that.
So lots to keep in mind. Another aspect that might be worth mentioning is the traffic management piece of this.
So ARM servers today are supposed to have better latency characteristics with load, which we do observe in the benchmarks.
But how that translates to our production workloads is a big question mark, something we're still working through.
So if it is true that the latency is actually better at higher loads, which based on the architecture is totally realistic, then it could mean that the buffer that you put in the traffic manager that you have for spikes in traffic can actually be higher, or the buffer can be lower, and your thresholds can be higher because of this latency benefit that you're seeing.
Because ultimately, what customers care about are response times.
And as long as you're meeting the SLIs that you give them, they don't really care about the internal operations.
So in addition to divergence in actual architecture itself, it's possible to make these changes for load balancing and traffic management that could give you some extra benefit with architectures like ARM versus x86.
So yeah, it's an end-to-end pipeline, and priority calls have to be made all along the way with the resources that we have currently.
Yeah, that makes sense and emphasizes again why collaboration is really important here.
You and your team can't go off and just design our next generation of servers in a dark cave or a silo somewhere and then just pop them up and be like, okay, this is what we're using.
It really has to be an end-to-end, really holistic, really collaborative process with all the people that have, even down to the application logic, ability to control aspects of this where there's interactions between the hardware and software in the ways that you're describing.
Okay, so we've explored lots of things in this conversation. We talked about near-term efficiency improvements with our next generation of servers just by making our chassis footprint a little bit bigger and moving some fans around, which again, the ability to have those kind of power efficiency improvements by making some of those subtle changes continues to blow my mind.
And then we talked about the longer-term project with the ARM rollout.
If you're interested in learning more about that, there's, I believe, a Cloudflare blog about it that kind of talks about some of the early decision-making processes.
So if you're watching this and curious to dig in more, check that out.
If we think about even longer-term, maybe, if that's the near and the mid-term step, what other things are you excited about or on the horizon that as a leader in this space, you're keeping your eye on to make sure that Cloudflare continues to drive and be at the forefront of these kind of advancements?
Yeah, so since we were talking about ARM, I start there.
The second source issue is real. So a RISC-V architecture or different products in the RISC-V space is something that RISC architecture space is something that we are watching for.
It's not currently productizable yet. What we've seen is 2025 is when these products will start to appear on the market.
And the work that we've done on the software side for ARM64 is useful to be able to enable these different types of architectures in the future.
So we're hoping it's going to be an incremental lift and will help solve our second source issue with more options to be able to choose from.
So that's super exciting to see the development of different products from different vendors come to market.
The other thing that I'm personally super excited about is the domain-specific accelerators.
So what we mean by this is, I mentioned we're trying to put in specialized hardware in our fleet.
And a big part of optimization is to be able to identify the right hardware for the right use case.
So if you take machine learning, for example, that is GPUs, a specific type of hardware that can deliver up to 10x the performance at similar power envelopes to what a CPU today would be, which gives you overall TCO benefit and savings longer term.
So being able to right-size the hardware for the use case is key, I think, to be able to scale the power efficiency as we grow and scale our network.
And even within these workloads, there are several different types of custom accelerators that are in the market.
And I mentioned machine learning.
There's a big subset of training workloads. There's a big subset of inference workloads with very different characteristics, whether it's the power requirements, the latency requirements, and so on.
So it's not an easy problem to solve because every new SKU you introduce needs to be designed in and maintained over the course of the lifetime of the product, but can be a really big knob to turn in order to get more performance and therefore better power efficiency for those specific workloads.
So I'm super excited to see how we move forward in that direction at Cloudflare long-term.
Awesome. That's so much to be excited about.
Thank you so much, Leona, for taking the time for this conversation today.
Really fun and fascinating, as always. If you're watching this, there's more sessions coming this week and next to highlight more aspects of how Cloudflare is helping build a more sustainable Internet, plus lots of content on this topic on Cloudflare TV and on our blog from the past couple of years.
So if you search green cloud or sustainability on the blog or in the Cloudflare TV archives, you'll be able to find lots of sessions with more folks from our infrastructure team, our public policy team, our places team.
We think about this stuff a lot and it's really important to us and we want to share our learnings and then gather some as well from you.
So please feel free to reach out if you have feedback or advice or topics you think we should cover in the future.
Leona, any closing thoughts? No, this was great.
Thank you. And please look for blogs related to any of the topics we've covered.
There's a ton of experiments that we've run and data if you're interested in digging deeper.
Thank you. Thanks so much. Bye, everybody. Bye.