🌱 Greencloud: Cloudflare's Hardware and the Environment
Presented by: Annika Garbers, Rebecca Weekly, Michael King
Originally aired on September 6, 2023 @ 1:30 AM - 2:00 AM EDT
An interview with Cloudflare's Rebecca Weekly, Head of Hardware, and Michael King, Manager of Data Center Strategy, about how Cloudflare thinks about the environmental impact of the full lifecycle of our hardware devices - from design to end of life.
English
Greencloud
Earth Day
Transcript (Beta)
Hello everyone. Welcome to Cloudflare TV. My name is Annika.
I am so excited to be here today with my colleagues to be talking about Earth Day and a whole bunch of great content has been happening all week on Cloudflare TV, some of it sponsored by Greencloud, which is our sustainability-focused employee resource group.
We've been covering all kinds of topics, all the way from crypto to transportation to Cloudflare sustainability accounting and reporting.
Today, I'm super happy to be joined by some of my colleagues from Cloudflare's infrastructure team to talk about Cloudflare's hardware and the environment.
I would love for us to start out with a quick round of intros.
Rebecca, do you want to start out?
Can you tell our audience a little bit about you, who you are, what you do at Cloudflare?
Awesome.
I am Rebecca Weekly.
I run Hardware Systems Engineering here at Cloudflare, which is an awesome and amazing team that works on basically our design development of network storage and obviously compute servers and systems throughout the ecosystem.
So that's what I do all day. Very cool.
And Michael? My name is Michael King.
I lead the data center strategy function at Cloudflare. Data Center Strategy encompasses our partnership, commercial relationships with data center providers around the world.
And we also do quite a bit of design internal to those data centers to fit out for our requirements.
So basically I am immersed all day in the world of data center.
So Rebecca's team and mine kind of are two pieces of the puzzle that are extremely important in building the platform from a hardware point of view.
Very cool.
So glad that you're both here. And again, my name is Annika.
I'm on the product team here at Cloudflare and a member of Greencloud.
But we're going to talk about hardware today.
Let's start actually with the basics for folks that are not familiar, maybe with Cloudflare, with hardware at Cloudflare.
What does this look like? Rebecca, can you tell me a little bit about what your organization is responsible for, how you measure success?
Sure.
I'll start with what does hardware look like? I mean, I think everyone who works with Cloudflare knows we offer a network, in some sense that is our product.
But a network is nothing, if not a bunch of servers distributed across the world, interconnected through fiber and various network component and gear.
So ultimately, that's our responsibility.
We have to have a physical footprint.
Michael can talk to you all about that and I'll let him do it.
We have to have that actual fiber connectivity, the power, everything, and then we also have to have the server racks, the systems that connect the network cards that integrate to our top racks, which is to our spine switches, to our edge routers.
So that is our fundamental ecosystem. And when we talk about being distributed across 275 cities, maybe I'm low on that number.
I don't even know, it's constantly growing.
That footprint requires a huge amount of server design, development, network connectivity, the validation of that gear to make sure it's effective.
And ultimately our entire capacity plan as a company and most of our capital expenditure will come from us increasingly developing and providing more performance solutions per TCO dollar that we spend.
So it's something we focus on a lot.
We want to increase our compute density.
We want to look at and always evaluate alternate architectures.
We've been very loud and proud on the blog about some of the alternate architectures we've tested in terms of both non x86 ICEs as well as domain specific accelerators.
We're super involved in open networking, whitebox switches, opportunities in that domain and just reducing costs at every point within our design cycle.
If we don't need the highest speed interconnect, we can reduce layer counts on our boards and save ten bucks in the process, right?
So we look for the little ones, the big ones, everything we can do to make sure we're improving our time to deployment, making sure we're delivering performance given TCO dollar at the most efficient CAGR that both our investors and our company expects.
Yeah, it makes sense and that, we'll talk about this a little bit more later, but a lot of those sort of like cost goals and things that you were just talking about dovetail or work very closely with sustainability goals.
If we can get kind of more performance per watt and more out of our network to deliver products for our customers, that ends up being better for the world overall in our kind of carbon impact.
And then, Michael, what about you?
Can you tell me a little bit about what your team does, how that dovetails with the work that Rebecca is doing and how you think about kind of those those goals together?
We're all sort of working toward the same thing. Yeah.
I mean, basically everything Rebecca just said applies to me as well, just applies, in my team, it applies in a slightly different way.
We have to go out and secure power and space for housing all of Cloudflare's network equipment, all of Cloudflare's servers, all of Cloudflare's storage, all of the things, wonderful things that Rebecca's team generate for us.
We have to put them somewhere.
We're kind of the Cloudflare Housing Authority, basically.
So my job is to go out and and my team's job is to go out and find the most advantageous TCO for space and power around the world.
And what we've determined kind of over the years is that a lot of the market differentiation for for data center pricing and things like that and TCO doesn't necessarily revolve around the technical characteristics of that hosting on the building side.
So it doesn't have to do with that the the cooling is better or the power is better somehow or things like that.
Sometimes it has a little to do with that, but mostly it just has to do with prerequisites around connectivity.
So for example, you can't have a network without connectivity.
We do need to have equipment deployed in certain data centers around the world that are connectivity heavy.
But do we need to put all of our equipment there?
Do we have alternatives for large chunks of the platform to go into spaces that are a better TCO and potentially a better sustainability fit as well?
The answer is yes, we do.
That's my that's my team's whole sort of goal is to identify where we can improve TCO, improve the business, and also improve our overall sustainability positioning across the data center estate.
And we have lots and lots of data center estate to do that.
So it's still early days for us, really.
Yeah, that makes a lot of sense.
So when you're thinking about strategy for where Cloudflare should put all of those components that Rebecca started out by listing, it's not just where in the world do we need to be to be close to users, which is how we talk about it a lot of times in product world.
We're like, okay, where are the eyeballs? And then where are the applications that we need to connect to?
How do we be as close to them as possible?
But it's also all those other factors that you just listed and sustainability starts to play a part in that when we think about the characteristics of those facilities.
Let's talk a little bit more about that. So where does sustainability come into this equation?
Why is it important for us to think about the environment when we're designing our hardware and data center strategy?
Rebecca, I'll start with you.
So.
I feel so strongly about this. I'm such a hippie.
I may sound crazy. Full caveat.
But honestly, there's been a ton of research in this domain. Nature did a great article early 2020, I believe, that talked about sort of the current data center footprint in terms of global energy demand and then where it is projected to go by 2030.
And basically, when you think about it, the footprint that they were trying to calculate was data centers themselves, everything that Michael was just talking about in terms of connectivity and the power of those systems.
But it's also the embodied carbon and energy to produce and ship those products into those data centers.
And then it also tries to encompass the devices that are used to consume Internet traffic and the associated network connectivity and components to connect all of this.
So it's a big statement, but what they said was by 2030, it would be approximately 30% of the world's global energy demand.
This is because the Internet keeps exploding.
So there's been a huge explosion of Internet traffic.
They were using the CAGR, the development like timeline of 2017 till 2020, and that was over 1.7 zettabytes of Internet traffic in 2017, continuing continuing continuing to grow.
1.7 zettabytes in 2017 cost about 40,000 terawatt hours of energy.
So if you scale that up to where we expect it to be, that was less than 10%, will be over 30%, you start to get a real picture of, oh my goodness, this is a huge amount of energy that we are looking at having to develop and drive just for us.
And it's not like the rest of the ecosystem stops, right?
Cars still drive.
You still buy your clothes.
You still buy everything else that makes up global energy demand.
So you still heat your houses or air conditioning them or heat your pool or whatever it is that you have.
Your footprint. You still take planes and go to other places.
So your footprint continues to grow as our population grows and the percentage of energy that we take as part of that whole pie continues to grow because of how much the Internet has grown over these last years since it began.
So ultimately, we have a huge responsibility in this ecosystem, in the ICT or whatever you would like to prefer to call it, cloud is what I like to call it, ecosystem to do that responsibly.
And honestly here at Cloudflare, I mean, I can just take stats from our compute density.
We've been able to grow our global footprint and increase our compute density over 10X in the last nine years, and we've been able to do that, reducing our TCO, reducing our energy consumption about 80% per request per second served.
So that's a huge metric of efficiency in our compute density.
There's so much we still need to do.
That is the partnership of so many people and so many great engineers both here, as well as obviously with the companies we purchase our products from.
But it is only because of how much it's growing.
Yeah.
That makes a ton of sense. You're reminding me of, we had a session last year on Cloudflare TV for Earth Week with your colleague Michael Alward and we talked about this kind of interesting paradox, or there's a graph I remember he talked about with these sort of two lines over time.
One is the increasing percentage of the total global energy use and carbon impact, associated carbon impact that is due to the Internet and all of the servers that are using energy that power the Internet every day and then the other one, so that was growing over time.
Then the other one is basically that that metric that you just referenced for our own network, that's those type of efficiencies are happening everywhere, right?
Everyone's trying to think about how can we get more performance per watt. And so there's this interesting thing happening where for a given request served on the Internet or a given amount of data that we're processing, the total energy that that takes and the carbon impact associated with that energy is going down, but the Internet use and the amount of energy associated overall is growing.
And so at some point, we kind of run out of efficiencies that that we can kind of crunch out of these systems.
Right.
There's like more law effects of this stuff happening and so we got to think about all of these aspects holistically.
It's not just performance per watt and continuing to optimize that at smaller and smaller, maybe increments of of delivery.
Yeah, we're fighting physics.
Exactly.
Exactly. To your point, Moore's Law.
Right.
If we're going to scale the transistor density, then we start to see leakage currents.
I mean, just you can only get so much out of having general purpose solutions.
You do have to step into more domain-specific accelerators, more domain- specific options from a design perspective to get higher efficiencies.
You can't be homogenous forever, even if you are serving requests per second per watt in a lot of cases.
And that's also true as our service offerings get more diverse, as we are storing the world's data with R2, as we are growing as a compute platform with Workers, all of those capabilities also shape our compute requirements and our storage requirements in different ways.
And again, to continue serving that even more efficiently, we have to keep optimizing what we're building.
Yeah, sure.
So speaking of talking about all of this holistically, from the perspective of someone who's not familiar with hardware Cloudflare maybe like the lifecycle of hardware at a company like ours at all, what is the lifecycle of a box look like sort of from the beginning of an idea in someone's mind that we should put a Cloudflare box somewhere all the way to, It's been operating on our network and it's reaching its end of life and it's time to swap it out for a new one.
What are kind of like those steps in there?
Let's break them down and then we can talk a little bit about the sustainability impact of each of those stages.
Sure, we can tag-team this one Michael.
I can, I can definitely start with the design.
Usually we start thinking about our edge.
Our edge is such a critical part of being close to the eyeballs, as you mentioned before, Annika.
So making sure A. our footprints, and I'll let Michael talk about that one, but from a design perspective, what is the efficiency we have to achieve?
How do we make sure we're going to reduce our energy, reduce our carbon footprint as we serve more and more people?
So what are the product offerings that might be general purpose CPUs, that might be domain-specific accelerators like I mentioned, we have to look at each and every one of those, evaluate them against core metrics.
And then we work with our partners.
We work with key ODMs in the ecosystem, key silicon vendors in the ecosystem to get samples.
We test everything, we test it rigorously. There's a process we go through that is effectively EVT, DVT, PVT, and they all converge to ensure that not just the system is valid and meets our thermal constraints and meets all of our requirements from memory, efficiency and sustainability, but also functional requirements.
And then we have to make sure that it actually is serving our specific traffic well and it's getting us the kinds of results.
Someone changes a caching architecture, you see more cache misses or fewer cache misses.
So we really have to understand, in each of these different architectures that we're evaluating, what's happening for our workloads in our tradeoff over a long enough period of time.
And I would say one of the things I love about doing hardware at Cloudflare is that we can run such a plethora of workloads, and we're actually able to do it early in the design cycle, early in the phase, so we can really hammer on these things with production traffic, do that pen testing and see the interesting effects.
Trying to build a simulation that captured all of our different products running simultaneously in the interesting seasonality and time zone implications would be impossible.
I love simulations.
I love modeling.
There's no way we could possibly do it. So having an environment that is built for resiliency and reliability so we can test these things on real production traffic is critical to our process for design and validation.
And I'm going to toss it to Michael because he can definitely talk far more about the energy use and data centers, etc..
Sure, so I think where we where we kind of go from there is Rebecca's team does a great job of of kind of giving us a generational micro picture, What does the platform look like?
And then we kind of, through my team and some of the teams that I work with for planning, we kind of extrapolate that.
What does that look like in the macro?
How much resource from a compute and storage and network perspective do we need in location X?
That kind of drives then a data center requirement. A data center requirement should fulfill some some runway of demand for some time period.
It should accommodate the servers and the network gear that Rebecca's team has designed and that the planning team that I work with has kind of scaled to a a metro size, and it should then achieve a good TCO.
And one of the things we have working in our favor on TCO and sustainability, a lot of TCO in the data center world is driven out by how efficient is your building design.
So there's something called power utilization efficiency, which is a metric used to measure data centers, basically waste in delivering power from the street to the server rack and to the server as well as cooling.
Basically, a lot of it is kind of an overhead for How much power does the cooling consume?
Now that is a wide, wide array of possibility because in certain climates, you know, you are going to have vastly different sort of cooling reactions then you're going to have in other climates.
But also your building design is really paramount at that point because if you don't have a good, efficient building design, you are not going to be able to sort of pass through good TCO to your customers because you're going to have to pass through this increased overhead of PUE.
So it's very, very important that we find, that we drive improvements in the sector through simply demanding good total cost of ownership, because that's an economic way of demanding that the sustainability picture is correct.
It's also really important that we continue to build new and efficient data centers out there for for different types of use cases.
A lot of the data center ecosystem that constitutes a huge percentage of what Rebecca is thinking about or talking about in terms of the global power consumption, a lot of that has been ground up designed and built in the last 15 years by Cloud Hyperscalers.
There is a segment of people who use data center resources like us.
We're not quite at that scale yet.
We will be at some point, but we're not quite there yet.
We would like to have the kind of more wholesale and retail, as we call it in the industry, ecosystem of multi-tenant colocation.
We'd like to have that more approach the kinds of efficiencies that your major hyperscalers have.
And that's that's a journey that's in progress.
There are people who understand that, that use case very, very deeply and are working on it.
And our job in the meanwhile is to again advocate for the best economic terms, because we do feel like the best economic terms are broadly representative of the best sustainability terms.
Yeah, we were just chatting with Patrick Day from our public policy team this week about a similar topic.
He was mentioning, so once your team makes these decisions about where we're going to put hardware, we do the measuring of how much energy our hardware consumes at each of these locations, then we use that for our carbon accounting and then our carbon offsets.
He was talking about the fact that from sort of the position that we're in as this kind of wholesale buyer, we have the ability to influence how these organizations that we're buying from, if we're making a strategic decision about which maybe colo provider, we want to put more Cloudflare hardware in, we can use kind of the power efficiency and then the related sustainability as a buying decision.
Right? In the same way that customers that come to Cloudflare today ask us for, hey, what's, what's the Cloudflare numbers in terms of carbon impact of our network so that they can build that into their own sustainability reporting?
We have the option to kind of like feed that back into our own buying and strategy decisions, which I think is really, really powerful.
And I'm excited definitely about a world one day, and I think we'll definitely get there, where Cloudflare, like the scale that we're at, it'll make sense for us to build our own data centers and maybe build our own like solar panels that power those data centers 100%.
But today the position that we're at is kind of one of being able to work with the other, the other tenants maybe of those facilities and advocate for the things that we think are important, again, both from a sustainability and cost perspective for us, which is super cool.
Yeah, I think I think you're right to be optimistic.
I think it's early days still.
Totally.
Rebecca, I was also thinking when you were talking about hardware design. You're like, oh, this is one of my favorite parts of working at Cloudflare and I have a lot of empathy for your position because it also sounds like a really hard part of being at Cloudflare.
The fact that we ship so much, the fact that the demands on our infrastructure change so much over time and an engineer can come up with the idea for R2, where a product manager and be like, Yes, 100%, let's do this.
And then on your end you're like, Whoa, hang on, wait.
That's a completely different thing that we're talking about in terms of what we're asking our hardware to be able to do.
And then not only from a device profile perspective, but also the the sort of consequences of that for our energy use and things downstream of that, too.
So kudos for taking on that challenge and being excited about it because that's a gnarly one.
It's a great one, honestly.
I mean, that is...
you shouldn't come to a cloud company if you're not excited about solving real challenges, and you shouldn't be in infrastructure if your mindset is not, How do I enable my product team to continue to wow our customers?
How do I make sure we're never in the way of that process? And it's it is running faster and it is planning in such a different timeline.
I mean, you work in software, you might do a quarterly plan, you might be you might be doing an agile sprint in two weeks.
When you're in hardware, this is the physics-based economy.
We have real atoms we have to ship to places.
So it's really very different in terms of the lifecycle and we've had massive supply chain challenges as an industry, really since COVID began and substrate was so affected because it was centered in terms of production in Wuhan, China.
And we are still coming out of those challenges as an ecosystem, which actually honestly, it has been a good thing from a supply, from a sustainability perspective, because the incentive of all of us to get more efficient on our hardware through design optimizations, more optimal software libraries and configurations to get the most out of the hardware we have.
And then I would also argue, and I'm a big advocate for this, open system firmware and open biosolutions, which sounds really low level, but I'm going to promise you Annika, I'll tie it back to sustainability and why it matters.
When you buy a server and you buy a server from a big entity with all of their tools and support and everything else, that server has a lifecycle.
It gets end of support from that vendor, and then those systems tend to end up in a big junk pile somewhere, unfortunately, probably far too close to our oceans.
When you purchase a server that is built with open system firmware that has the ability for in-field upgrades and maintenance through open source software, and I say software and firmware as if they're the same thing...
and they kind of are...
it's just software that runs on top of the hardware to tell the hardware what to do, as opposed to application-level software that most of our software engineers think about more.
But anyway, that is actually something that enables us to second life our equipment.
So we talked about the design cycle, we talked about the footprint and the planning and trying to go for efficiency in our design.
That's really about our Scope two and Scope three emissions, right?
That's how we use our power today and how we manage our logistics in disseminating requests per second across the world.
But we also think quite a lot about our design waterfalling techniques and our second lifing techniques.
So when we decommission a server because we have identified more efficient ways of solving our problems, that server doesn't go into a waste pile.
We work really closely with key partners like IT Renew, who's I guess now Iron Mountain, but IT Renew was their name before they were acquired, and they come in and they help us decommission.
We do all of our safety and security protocols to ensure any user data is stripped from those systems.
And because we are deeply committed to open system firmware, they can actually continue the life of that.
So the embodied carbon that is a part of that server actually is, in some sense, amortized over a much longer period of time.
So you'll see ex-Cloudflare servers that are powering the Internet in parts of Nairobi.
It's kind of an amazing thing to think about. We do our best efficiency here, and then we also really work to make sure when we decommission parts, they're going back into an extended life cycle in an efficient fashion.
That's awesome.
That's really amazing and it's so cool also that there are partners that are available that we can work with on this stuff, right?
A lot of this is like it's teamwork and cooperation. It isn't like Cloudflare having to figure out all the answers to all this stuff on our own.
We're able to, just in the same way that we partner with people on our carbon accounting to make sure that we have our numbers straight and have our approach right and we're accounting for the right stuff and the right places in the right areas of the world, same on the end-of-life side.
There's people that we can work with that have established best practices around, ok, What do you do with the server that's reached its end of life or end of useful life for us as an organization.
That's super cool to hear. - Yeah, reduce, reuse, recycle.
- 100%. That's literally what popped to mind.
That's so cool.
Okay, so we've talked about, we talked about design, talked a little bit about supply chain, talked about energy use in data centers and how we plan for the places for our hardware to end up in the first place, which is the strategy piece of this then, talked about end of life.
What are we excited about and looking forward to when we think about, well we talked about futures a little bit a minute ago, but when you think about the next five, ten, 15, 20 years at Cloudflare.
What are we, what are both of you most excited about in terms of how we think about our hardware influencing the direction of the Internet to be more sustainable overall?
Michael, do you want to go first?
Sure.
Well, what I'd like to do, I think there is a pending, not revolution necessarily, a pending evolution, let's say, in how data centers are constructed for clients like us.
So there was an evolutionary evolution in hyperscale data center construction that started, as I mentioned, like ten or 15 years ago.
That's now on generation N by now, I mean their designs have evolved and the needs have evolved.
I think we're on the cusp of of doing something like that in the in the multi-tenant space.
And a lot of what's driving that is that the the older data centers now are getting close to end of life.
You talk about sort of server hardware, network hardware being maybe on a 3 to 5 year CapEx cycle.
Well, data center equipment like generators would be on a 20 to 25 year cycle maybe.
So we're starting to see like investors and operators have to make decisions about are they going to reinvest in site X or does it make more economic sense and more sustainability sense to simply start greenfield somewhere else?
All of that, of course, has emissions chain behind it and there's there's pros and cons to both approaches.
What I'm most excited about is managing the fleet.
If you think about the fleet as all the data centers in the world that are open to multiple tenancy, managing the fleet in a more sustainability-positive direction and honestly, some of that is going to be managing stuff that's no longer fit to purpose out of the fleet, and that will go for the whole Internet.
It will also go for Cloudflare.
I will double down on that.
I think there's a whole world of smart tech that is just waiting to happen and waiting for edge compute users like us.
And we're not the only ones.
I mean, 49% of the community that is a part of Open Compute Project, which is one of my other passion projects, is building out their edge footprint as we speak because users are looking for more dynamic experiences, being close to those eyeballs.
And so edge is not... edge is the growth trajectory, it is our future. So we have to think more efficiently about it.
And just like the physical footprint was designed for hyperscalers, the edge footprint and the compute footprint that's being generated, often for those hyperscalers, doing massive search capabilities and ads versus someone like us who doesn't sell anybody's data.
We focus on, How do we make sure you have an amazing experience that's reliable and secure?, where our workloads look different.
So being green compute for that and making sure we have a way of reducing and working closely with partners as more and more people are building at the edge and it doesn't look like the hyperscalers exactly.
I think that is a critical opportunity for us to lead. Super excited to do it.
Awesome.
That's amazing. There's a ton to be excited about.
Well, thank you both so much for your time.
This was a really fun segment.
Make sure you catch the rest of Cloudflare TV from this past week.
If you're interested in this topic and want to learn more.