Story Time
Presented by: John Graham-Cumming
Originally aired on March 5, 2023 @ 3:30 AM - 4:00 AM EST
Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.
English
Interviews
Transcript (Beta)
Okay, good morning. Welcome to Storytime. I'm John Graham-Cumming, Cloudflare's CTO, and with me today is David Wragg, who is normally in our London office, if the London office wasn't closed.
And our plan today is to talk about load balancing at Cloudflare and perhaps a little bit of the history of load balancing.
But David, why don't you quickly introduce yourself?
You've been in Cloudflare for quite a while now.
What are the sorts of things you've been working on over the years? Yes, I'm an engineer at Cloudflare.
I've been at Cloudflare for about four years, and I've worked on a few different things.
I did some work on image optimization when I first joined Cloudflare, but for the last three years or so, I've mainly been working on how we manage CPU load within Cloudflare.
And CPU load is really the thing we spend most of our time managing, right?
I mean, it doesn't tend to be RAM. It doesn't tend to be local networks.
It's really CPU is the core of what we do. Yeah, there are other resources that do matter, but we have other ways of managing those.
CPU or processor load is the thing that's kind of easiest to turn on and off. It's the thing we have most control over, and in some ways, it's the most critical bottleneck.
All right, so we're going to talk about load balancing in Cloudflare and all that.
If you have questions, the email is livestudio at Cloudflare.tv, and we'll try and answer those questions during the show.
Otherwise, we will carry on, and if you hear something interesting, send us an email.
Okay, so when I joined Cloudflare, which is nine years ago almost now, we had a very small number of data centers.
I think we had four. I think we had Chicago, Bay Area, Amsterdam, and Tokyo, and I think Tokyo didn't really work, and Amsterdam kept crashing.
But we had those four data centers, but we fundamentally had some bits of the Cloudflare architecture already set up, which is that we were using any cast so that the same IP address appeared everywhere in the world, and you'd get routed to the nearest data center, and by nearest, I mean something in Internet geography nearest.
It wasn't necessarily physically closest, and then within a data center, you had a rack of machines, and we were doing very simple load balancing across those machines, typically using the external router as the mechanism for sharing connections across machines.
Super basic, but take us through, okay, that's when I joined.
I think that was probably the case when you joined. It was like that, but many, many more data centers.
So just explain for us the impact of any cast and then what was going on inside a data center in terms of load balancing.
Yeah, so as you say, there's two levels to the problem of load balancing.
There's the Internet routing part, the any cast part, how the traffic from one user, somebody visiting a website behind Cloudflare, how their traffic goes to some Cloudflare edge data center.
I'm not going to be concentrating on that part. The part we'll be talking about is what happens inside a data center, because even inside a data center, we can have many servers, so there's the question of where does the connection, where does the traffic from a user go to?
Which server does it go to? And every time somebody visits a website behind Cloudflare, one of our servers has to do a little bit of CPU work to deliver that page or whatever, and across millions of users or millions of websites, that all adds up.
So when I joined, yeah, I think it was kind of the same as you.
Within data centers, we were using a router feature called ECMP routing to spread load across servers.
ECMP routing is widely used.
It's a commodity feature of routers, but it doesn't give you a lot of control. It's quite basic, and another big problem with it is when you make changes to the configuration of the ECMP routing, it tends to break connections, and some of the connections that we deal with can be long lived.
If someone is using a website that's using web sockets, for example, it can have a connection that stays open for hours or even days.
Right, and websites obviously use things like chat apps and things that are very interactive over long periods, right?
Exactly. Right, and just to dig into ECMP, so the idea of equal cost multipath, right?
The idea is that was a very simple mechanism for saying, okay, there's these machines behind this router.
We will send traffic. We'll spread traffic across it based on what characteristics of the connection?
Yes, I think really ECMP originated not to be used in this way to spread load across servers.
It was being used when routing traffic in the Internet.
If you had two links between two paths between a couple of sites, and you wanted to spread traffic across those two sites, ECMP was introduced as a way that you could use both of those links and divide the traffic between them, but when you do that, it's important to make sure that the packets going down each of those links, a particular connection will just be using one of those links.
Right. And so then it was repurposed for spreading traffic across servers, and still we have the same characteristic that we want all of the packets that are relevant to a particular connection to go to a single server.
And the thing that defines the packets on a particular connection is what's called the four tuple.
For a TCP connection, it's the source IP address, the destination IP address, and the source and destination port numbers.
So there's a few headers on each TCP packet that define the connection that it's part of.
And ECMP works by taking those header fields, hashing them together to produce a number, and then using that number to decide which member of a group of servers that connection will go to.
And what happens if we take a machine out of production because we need to, you know, do maintenance on it, or it's failed in some way, a disk needs to get replaced or something, does it, this is where the connection breaking comes in?
Yeah, so one of the features of traditional ECMP is that if you make a change to a group of servers, if you add a server or remove a server, every connection going to that group gets broken.
So it's not just the connections that are relevant to the server that's going away or being added, it's all connections within a certain group of servers.
So there's a lot of collateral damage whenever you make, whenever you change a set of servers.
Right, and the other thing is that perhaps isn't obvious to everybody who doesn't work at Cloudflare is that within one of our data centers, there will be multiple generations of hardware.
So we keep around hardware, or we can still get used to life out of it.
So different amounts of memory, different CPU characteristics, maybe different networking, but all fundamentally running the same software stack.
So ECMP didn't take that into account either, right? So you got you got different load on different machines, even though you were theoretically spreading stuff out.
That's right. And that's really when I got involved, because I think in the first few years of Cloudflare, we had several different generations of servers.
So we have different generations of servers, different models of the server hardware that have different CPU capabilities, more powerful CPUs as we have newer generations of servers.
And I think for the first few years of Cloudflare, the increase in server performance was was not big enough that this really mattered.
But when I got involved towards the end of 2017, we were adding a new generation of servers to some of our data centers.
So we had more powerful servers sitting next to less powerful servers within the same data center.
And yeah, we didn't have the ability to send more traffic to those more powerful servers.
So we were making this investment in new hardware, and we weren't fully able to exploit it.
Right. And so the data side of things, right, which is you got uneven load.
So you got some machines are potentially overloaded, and some that are potentially somewhat idle.
You also have the issue of a lack of active control.
So if a machine somehow gets overloaded or fails in some way, ECMP doesn't know what to do directly, right?
Yeah, yeah, they can be. So not all connections are the same.
Not all users are the same. Not all websites behind Cloudflare are the same.
And this can mean that ECMP does spread load across the servers, but in a rough way.
And with ECMP, there's also a lot of variance, you find that just at certain points in time, one server within the group can have a lot more load than the other.
So it's also in order to be able to make sure that no servers are really underloaded and no servers are really overloaded, you need to kind of manage that variance across the servers.
Right. And if a server failed, we had to manually go in and remove it and then tell the system.
Yeah, yeah, that's right.
Right. So, okay. So you got involved in a project, which is called Unimog.
And in fact, there's a blog post coming out in about an hour from now that describes this to kind of try and fix these problems.
So what, what's Unimog and what does it do?
Well, so actually, initially, I got involved with trying to improve our ECMP groups.
So at the end of 2017, I got in some work to configure the ECMP groups more intelligently.
And we thought that this might at least buy us some time.
And that did work. But in early 2018, it had been clear that that was really only a temporary improvement.
And we needed some, some fundamental improvement.
We couldn't rely on ECMP groups alone going forwards. So that project was actually based on some measurements, right?
So you measure the performance of the machines and try to make based on historical data, like real data, okay, we'll make the groups like this that will spread the load out a little bit.
Yeah, it was based, it was based on data about the machines and about the load, but only in a in a fairly static way on long time scales.
So it wasn't able to make adjustments on scale of minutes or hours, right?
As loads shifted around. Yeah, we are every week or something.
I'm trying to remember now there was a quite slow update cycle for that.
Initially, it was it was less often. But yeah, gradually, we realized that if we wanted this to work, well, we had to do those updates more often.
And and yeah, it just became clear that that approach while it was an improvement, and it did allow us to use the newer servers more effectively, it wasn't a long term solution, right?
Okay, so the long term solution is Unimog. Yeah, so.
So we realized that ECMP groups alone were not a solution. And there is a category of software within the industry that does solve this problem.
And that's called a layer four load balancer.
These are these are widely used by by big cloud computing companies, that kind of thing.
Dig into that, what what does it mean by layer four load balancer?
So one of the if you so a load balancer is going to be something that that balances the load that spreads the load over the servers in some some somewhat intelligent way.
A layer, but when you do that, it's important that the CPU cost or the cost in general of the load balancer is actually low, compared to the useful work that you're doing, you don't want to add a load balancer, and suddenly the load balancing is taking up half of your CPU.
So layer four load balancers are have the advantage that they're very efficient.
And the reason they're very efficient is they just work on the level of each packet.
They're not terminating SSL, they're not terminating connections, that they're routing packets, they're just looking at the layer three and layer four information on the network packets.
And based on that information, they're making a decision about which server those packets should go to.
And that's just enough information for them to this for them to make sure that the same connection goes to the same server in a similar way to ECMP.
Right. And so there are a number of these things out there, right?
I think Facebook has one and GitHub was one and you know, various people have open source layer four load balancers.
And if you think about our architecture, originally, we had the router that was connected to the Internet really doing the load balance with this ECMP thing.
And then one option would have been to say, let's put in behind the router, depending on how you look at it, a layer of load balancers, which would then load balance onto the servers.
But we didn't do that because we actually have a view that we want to have everything running on same hardware, commodity hardware.
So the load balancers actually run on the same servers that serve customer traffic, right?
Yeah, so you can you can use commodity servers as load balancers.
But the way that layer four load balancers and load balancers in general are typically deployed is that you have a separate tier of servers that are dedicated to that load balancing function.
And then you have another tier of back end servers that are actually providing your your main services.
Right. Yeah, go ahead. But because Cloudflare has a lot of data centers, reconfiguring everything into that two tier architecture would be a huge amount of work and would have ongoing overheads in terms of managing that that separation managing which servers are in which roles.
And yeah, generally, it's it's, it simplifies things and avoids bottlenecks.
If we just say that every server within Cloudflare's data centers are going to do the same set of functions.
And we wanted to maintain that philosophy for load balancing.
And this is particularly true because we do DDoS mitigation in the same way.
I mean, some people may be familiar with DDoS mitigation services and use dedicated scrubbing hardware or dedicated scrubbing centers, which have to get sent to the Cloudflare way is spread the DDoS across the world with any cast and then spread the traffic, the DDoS traffic and all the normal traffic across servers and then drop on the server.
So you you get resilience by spreading things around the world and doing the same thing with load balancing makes a lot of sense because you actually got to do the DDoS mitigation before you do the load balancing.
And so if we'd had a layer of load balancers, you would have to move the DDoS mitigation to the load balancers becomes the architecture becomes in many ways more rigid.
Yeah, yeah. So it would have greatly complicated things.
And yeah, it would have pushed load balancing into a third tier.
Sorry, it would push DDoS mitigation into a third tier. And yeah, that that wasn't a great prospect.
Okay, so we decided we would put load balancing on commodity machines.
And just so we get the idea of how this works. The router is still doing ECMP.
It's sending connections to groups of machines and then inside, but then the machine is making the decision, oh, is this traffic for me?
And in most cases, it won't be right, it's going to say it'll actually forward the traffic over to another machine within the data center.
So you've got this sort of hop happening.
Yeah, so so it's, it's another example of computing solving things with a layer of indirection.
The router is sending a packet to one server, that server is acting when it initially receives the packet, it's acting in the role of a load balancer, making that decision about which other server, which might be itself, but is usually not which other server should be actually processing that packet.
And so and forwarding it on to that server. And so how does that this is one of those counterintuitive things because often you if you looked at the three tier architecture of like router load balancer servers, it sort of seems to make sense.
It's like, okay, yeah, you got the dedicated things. And packets are getting transmitted once, if you like.
And suddenly, here, now packets are actually getting transmitted twice, at least, right, they're going to some server, which is they say, No, it's not for me, send it over here.
But it turns out that that seems a little bit counterintuitive.
But it totally makes sense because of the CPU use and the impact on the network is pretty low, right?
Yeah, so there are overheads, there's, there's the overhead of, so the packet has to be arrive at one metal and be forwarded on.
So there's some network latency. And there's some additional traffic on the on the network within the data center.
And there's some CPU costs on the server that's that's doing that forwarding.
The big advantage of layer four load balances is that all of those costs are relatively modest.
So so when we deployed Unimog, it didn't result in in any overheads that we weren't prepared for.
And we see this dramatic change, right, where because it's, it's dynamically measuring CPU use, and dynamically updating, you see that all the CPUs in the data center, no matter what generation of hardware, they all come into some very narrow band, like we're all working at the same set utilization.
Exactly.
So so by by spreading the load evenly, you get rid of that variance. So you're able to, you don't need as much of a safety margin, you're not so worried about some servers becoming overloaded.
So you can, so the average usage of each server goes up.
So in the end, even though there is some overheads to to the to Unimog, it pays for itself.
It's if you imagine that the that the CPU overhead of Unimog is about 1%, but it's actually allowing us to get a few more percent of value out of our servers, then it pays for itself pays for itself easily.
Yeah. And it extends the life of some of those older machines.
There's some point we would have just said, you know, well, it's too difficult to manage them, they get overloaded, let's take them out.
And so we can keep those around for longer. And ultimately, a better customer experience, because you're not hitting machine that's potentially overloaded because of something else that's going on.
Yeah, so the consequence of an overloaded server is, if it's mild, then it's it's additional latency, which is not good for users or our customers.
And if it gets very overloaded, then then yeah, there can be some, some perceptible reduction in the quality of service.
Now, the other thing is interesting. So you get when I get the CPU load, all nice and under control.
Now, what happens if we want to take a machine out of service for maintenance, you know, disk has failed or pull out of service replace the disk.
Now, we actually don't have this connection breaking problem, right?
Yeah. So one of the things that Unimog achieves, so one of the things it achieves is this load balancing.
The other is it doesn't have the same issues with breaking connections when you make changes that ECMP groups did.
So yes, we can take a machine out of service.
Now, if we just abruptly take a machine out of service, of course, the connections that are going to that particular server will have to be broken.
If there's some long lived web sockets connection that's going on for hours and hours.
You can't just wait forever. But Unimog does allow us to drain connections to servers.
So we can say, okay, this server is going to be taken out of service.
We want to maintain the established connections to that and allow them to finish whatever requests are in flight.
And so we wait for some of those connections to die off naturally.
And then at some point we say, okay, now we're going to terminate all remaining connections to that server.
So we can gracefully remove a server from service.
And when we add a server, there's no collateral damage for the other server's connections to the other servers are not affected in any way.
Right, right. And in fact, in the blog post that's coming out in a bit, you describe in detail how this actually works for the AdBlue changes thing where you can, it's a clever design.
One of the things I was interested in was how do you actually move a packet from one machine to another?
So it's a TCP packet, but you end up encapsulating it in a UDP packet, right?
Yeah. So in order to get a packet from one machine to another, to forward it from one machine to another, we can't simply just put it back on the network because how would the network know which server it has to go to?
In order to make sure it gets to the server we want it to go to for load balancing purposes, we have to put some new addressing information onto that packet.
But we also need to preserve the addressing information that was already there.
So this, this leads you to doing encapsulation.
So you, you put some new network headers on the front of that packet, you retain the original packet as as the payload in this encapsulated packet, that there are some new headers on the front, which have the new addressing information.
So for example, the destination IP address will be the desk will refer to the server that you want that packet to go to.
So the original TCP packet is still there.
So that it can be processed when it arrives that it's at the right server.
But we, we wrap it inside a UDP packet in order to do the forwarding. Got it.
And all this packet processing being done in the Linux kernel, right with XDP.
Is that right? Yeah. So Cloudflare has long had an interest in XDP. It's the basis of much of our attack mitigation, our DDoS attack mitigation technology.
XDP allows you to run code within the Linux kernel, which it allows you to process packets within the Linux kernel, but with an arbitrary program.
So that's, that's a very flexible mechanism.
And it's also very efficient because that processing happens within the network driver.
So when you write a normal network application that runs in user space, when a network packet arrives, it has to go through many layers within the Linux kernel and then get passed on to user space.
But with XDP, the packet arrives really before very much code within the kernel is involved.
It gets passed on to our program, which can make decisions about what to do with that packet.
And if necessary, can encapsulate it for forwarding to another server in the Unimod case.
And the thing that's very clever about XDP and the eBPF technology that it's based on is that it allows all this to be done in a safe way.
Because when you load code into the kernel, you really don't want that to affect the stability of the kernel.
XDP allows you to do this in a way where you're, you can write arbitrary programs, but it's, it's kind of safe.
Right. And you're dealing with arbitrary data, right?
You're dealing with user control data, right? Packets coming in from the outside, they, you know, the data that's in these controls, you really want to be able to process those in a controlled environment.
In a controlled way, you want to make sure that there are no security vulnerabilities and, and, and XDP has an elaborate verification scheme that, that looks at the program that's being into the, that's being loaded into the kernel and, and make sure that it, even though it might be written in C or some language that's typically regarded as unsafe, it is in fact safe.
Yeah, that's right. We write in C and then we use Clang, right?
To produce the eBPF. Yeah. So, so we write these programs, we write the XDP programs in C.
It is a restricted form of C because of this verification scheme that XDP has.
There are limitations on what you can do. So you have to know about those limitations when you write the C, but otherwise it's normal C code.
And then we compile it to a kind of bytecode, the eBPF bytecode. And then we deploy that bytecode to our servers and it gets loaded, loaded into the kernel.
And one of the nice things is because it's getting loaded into the kernel, it's not like making a kernel module where you've got to recompile the kernel and there's all sorts of stuff.
It's actually dynamically loaded, right? So you can, you can make, you can make a change to these programs very rapidly.
Yeah. So, so yeah, distributing a new kernel module is, is possible, but yeah, it's, it's not necessarily an easy thing to do.
It's not easy to validate. XDP code, despite running in the kernel is, is basically as convenient to work with as user space programs.
Just like you can run a user space program on your laptop. You can, you can load an XDP program and work with that.
So it's very convenient from a developer perspective.
And obviously we're, we're very deeply involved with XDP and eBPF.
What have we open sourced around Unimog so far? So the, so part of Unimog deals with loading these XDP programs into the kernel.
It has to manipulate the program as it does that.
For, for a number of reasons, but basically we actually have a chain of XDP programs because we have some programs that are related to our DDoS mitigation, some programs that are related to load balancing, and we need to combine those and we do various other fix-ups on the program before we actually load it into the kernel.
So we have to manipulate these, these eBPF bytecode programs before we load them into the kernel.
And we collaborated on a new Go library, which allows you to do these kinds of manipulations.
We've also worked with the kernel and made some contributions around general XDP functionality so that it works better for us.
And we're, we're continuing to do that. There are some, there are some, XDP is a fairly new technology.
There are some rough edges, and I think we're, we're helping to smooth those off.
Great. And we also are helping to maintain the eBPF stuff as well, right?
And Linux as well. That's right. We're, we're, we're using it more and more widely.
We're using it for some other projects that are not related to load balancing.
And yeah, so, so we've got a very active interest. It's, it's, yeah, it's a very active area of innovation in, in, in the Linux world in general, and Cloudflare is playing its part.
Perfect. So I wanted to ask you about one last thing, because we're going to run out of time, which is that one of the things that the Unimog team anticipated was a situation in which a whole bunch of servers got overloaded for some reason.
Traffic got moved on to other servers, which then got overloaded and you end up in sort of an oscillation, right?
Where things go wrong and anticipated this might happen.
And then it really did. Right.
So, you know, well over a year ago now we had a, a CPU spin up to a hundred percent because of a regular expression problem, which you can all read about on the blog and Unimog had to cope.
So tell us about that and what it was like when it really happened.
Yeah. So one of the things Unimog can do, because it can remove servers from service without a lot of impact, you can do things like active health checking.
You can check that a server is healthy and the services running on that server are healthy.
And if they seem unhealthy, you can remove it from service.
The, it turns out this is a problem if, if one of our data centers gets overloaded as a whole, because if it gets completely overwhelmed, then it looks like some servers are unhealthy just because they're overwhelmed with the work arriving at them.
And so they look unhealthy and Unimog removes them from service.
That actually puts more load on the other servers. So it kind of compounds the problem, but then those servers, they, they had the, the, the load removed and they look healthy again.
So you end up with this situation, as you said, it oscillates where servers are being moved in and out of service.
And even this can be bad enough that even after the original load has gone away, this oscillation process still continues.
So we saw this early on happening in a data center early on in the development of Unimog when we were doing some trial deployments.
And we, we had to put in some things to say, okay, to detect this situation and cut off those oscillations to kind of pre-process that health information.
Yeah. As you said, then we had a few months later, we had this situation where every colo, every edge colo, every edge data center in the world was simultaneously overloaded.
And that was pretty scary.
And yes, we, we found that the cause was related to that regular expression engine.
When we fixed the problem. The, the, the scary part for the Unimog team was what would happen when the load went away?
Would everything recover?
Would this precaution we'd in place actually work? And I'm glad to say it did.
Yeah, it did. It was amazing. It was one of the, I mean, I was very involved in that outage and that was one of the most amazing things.
It was like, yeah, and it just worked.
It just went back to, yeah, that was, that was pretty cool.
Well, listen, we're about to run out of time. Thank you for talking to us about load balancing and thank you for working on Unimog, which is pretty fascinating technology.
Half an hour from now, we're going to publish an epic blog post. I think I should describe it as it's so long.
It took me a while to read it, which describes this technology, what we've open sourced, the results from doing it.
And in particular, the design, design decisions we made, because there are existing load balancers out there from other people that we could have used.
And we had some very specific reasons to write our own and we don't generally write new stuff unless we have to.
And in this case, it made sense. So I look forward to that. Thanks so much for writing it, David.
And I think we're done. Have a good day. Thank you. Bye. See ya.
Bye.