Cloudflare at Cloudflare
Presented by: Juan Rodriguez, Tom Lianza, Dimitris Antonellis
Originally aired on April 4, 2021 @ 7:30 AM - 8:00 AM EDT
This week we will discuss how we used our load balancing service across our API and services!
English
API
Load Balancing
Transcript (Beta)
Welcome everyone to Cloudflare at Cloudflare. I'm your host Juan Rodriguez. I'm the CIO of Cloudflare.
And for new visitors or new viewers, this is a series that we do weekly where we talk about dogfooding Cloudflare.
And in Cloudflare, we aspire to be a Cloudflare's first customer.
We use our own products to build solutions and solve problems internally for things I have or even to build other products that then we sell to customers.
So today we're going to talk about a couple of things that are very exciting.
We're going to talk about our load balancing product and then how do we use our load balancing product internally to do things in our core data centers or API and things like that.
So I have two amazingly smart people with me today, Dimitris and Tom.
And Dimitris, do you want to introduce yourself and what you do at Cloudflare?
Sure. Hi everyone. I'm Dimitris.
I'm the engineering manager for the load balancing team here at Cloudflare.
I joined Cloudflare five years ago as a systems engineer, helped build the load balancing products.
And then when we decided to build a team around it, I stepped up and became the engineering manager.
I've been managing the team for the last three years and it's quite fun.
Excellent. And Tom, you want to introduce yourself?
Yeah, sure. So I'm Tom Lyons. I'm an engineering director at Cloudflare. I also joined about five years ago.
I joined as the engineering manager for the SSL products and over time I've come to move more around into the core.
So our APIs and all of the, we'll talk about core and edge, I'm sure a lot in this episode, but about how Cloudflare's control plane works, how you tell Cloudflare what you want to do.
And then the edge is the thing that does it. And Dimitris probably knows better than me about the edge and how they both work together.
But I'm a core centric.
A core centric. All right, great. Well, thank you for joining me today. And I'm sure that we're going to have a lot of fun explaining to people what those things are.
So I thought we'll start with Dimitris to tell us a little bit about load balancing and what this product is.
I've used load balancing over my career before, when there were like physical boxes, obviously we don't do physical boxes like cloud.
So maybe Dimitris, you can tell a little bit about the product, how it started and what makes it special and all those things.
Sure, sure. Yeah, maybe I can go back five years when I joined.
My background before Cloudflare, I was working for Brocade, which is a big company that basically they even had hardware boxes, load balances as hardware boxes.
And the project I was working there was migration of the code of a hardware box to a software-only solution.
So my background was in load balancing.
I joined Cloudflare and surprisingly, I realized we don't have a load balancing product.
It turns out they had plans to build one, but at that time it was like a small company, right?
150 people. So it didn't become a priority.
So after a while, I think after a couple of months when I joined, we decided to kick off the project and it was really fun initially.
Within a few months, we had the basic MVP version.
So what load balancing does, think of it like that. Let's say you have a website or an API and you can serve it from multiple locations.
You have multiple servers, perhaps in different data centers that can serve the content.
So what we do as Cloudflare, we sit between the clients or the browsers and the origins.
And whenever we receive a client request for your website or API, we know the health of the origins because we are running active health checks.
And based on this information and the specified configuration, we pick the best origin at that time and we forward the request to that one.
So that's the kind of basic model.
We do support a number of different steering methods. As you can imagine, the most basic one, round robin.
We do weighted round robin. One really popular feature is the geosteering part where customers can specify different regions and how different regions should select the origin.
So you can say, let's say you have two data centers, one in San Jose, one in London.
You can specify that the Western customers should go to the closest one, San Jose one, while the European customers should go to the London one.
So that's a really popular feature. We use it also, I think internally, and we can chat more about the most advanced geosteering version, which is called dynamic steering.
I guess we can talk about that later.
So that's kind of the basic product. Some of the benefits, and I would say differentiator, is first of all, ease of use.
When we build the API and UI, it was one of the main priorities there was to make it really easy to use and configure.
I think we've done a good job there. We always try to keep it simple when we build new features.
So that's one. Then scale is another one. If you use cloud forward balancing, you take advantage of the scale of the cloud for its network.
Populations, right? Yep. And I would say the last one is more like, this is an integrated solution.
So imagine if you're a big company and you want to purchase a load balancing solution, as well as a DDoS mitigation solution.
Potentially, let's say you are a forward thinking and you go and purchase two software as a service solutions, independent.
These two most probably won't integrate with each other easily.
And on top of that, you'll be paying some latency penalty when you go from one of the data centers for the DDoS mitigation, the load balancing one.
With Cloudflare, you don't pay that penalty. That's the one. And you also like these two products integrate with each other.
So I would say these are kind of the key benefits here.
Normally, I mean, if you're using something like a load balancer to front end a set of sites or APIs or things like that, it would be a strange not to have a DDoS solution associated with that to protect that nowadays, right?
So that benefit of the integration on having somebody that knows more from a stack perspective versus somebody that knows by an engineer that knows this part versus another engineer that knows the load balancer.
I mean, this is a significant advantage also from an operating perspective.
Correct, correct. It's so much easier to file a single support ticket, right?
If there's an issue, and then let the support folks of Cloudflare figure it out than going after the DDoS mitigation support folks, the load balancing ones.
It's a mess, yeah. Yeah, that's right.
That's great. Well, thank you for the interview and for the overview of the product in there.
So, Tom, so you are the dogfooder. You're the guy that uses you and your team, Dimitris' product to solve interesting problems internally on the things that are under your responsibility.
So tell us a little bit about that, the API, the core and a little bit about the journey and how we start incorporating the load balancer and all those things to solve challenges.
Sure thing.
Well, it's funny. I've actually worked at Cisco and at F5 at different points in my career, companies known for their hardware load balancers.
I've never worked with some F5 load balancers in my previous company.
Yeah. But I always worked in the control plane.
I worked on reporting at Cisco and I worked on the UI for configuration on the big IPs at F5.
So I'm always on the control side of whatever these applications are.
So at Cloudflare, our architecture, everybody knows about the edge.
I feel like that's our bread and butter worldwide, maps of earth with pins in them.
That's Cloudflare, big, big broad. And we do so much at the edge to protect you and load balance and DDoS and firewall and on and on and on.
But the way we can do all of that so quickly is we don't have the edge make a lot of decisions.
We pre-make the decisions. We pre-define sort of what we want the edge to do so that at the time the edge receives a packet or request, it already knows what it's supposed to do.
It has the ability to do quick lookups and quick decisions.
And so that's sort of how we cheat and get the most out of every CPU cycle at the edge.
And so the advanced computations are all done at our core. So you, when you sign up for Cloudflare, you go to cloud-.Cloudflare.com and you create DNS records or enable the WAF, whatever it is.
When you're doing that, you are using Cloudflare's edge.
First of all, Cloudflare dashboard is behind Cloudflare. It is an orange clouded record itself, dash.Cloudflare.com.
And then when you click save, click the save button, you are saving that.
You are now running that through a series of microservices in Cloudflare's core.
The API and Dash use the same exact pathways for accessing those microservices.
And then different teams at Cloudflare own different microservices for their control plane.
They're just all fronted by the same API and UI.
And so traditionally, like every startup before any of us joined Cloudflare, there was a monolith, a single application built in PHP, the hot technology of the time.
And over time, we distributed that monolith into microservices.
And over time, we moved those microservices beyond just one data center.
The way they all work is still they store things in databases.
We use good old-fashioned time-tested Postgres for most of that.
But we've moved to more sophisticated technologies in the control plane with Kubernetes.
Microservices are written in all different sorts of languages.
And we have cores in multiple places around the world. And up until recently, we had the notion of a primary core data center where all of those requests went with lots of redundancy within the data center, and then had a secondary core data center in the event of a disaster.
In the event of a disaster, we can...
So it was more of a hot-cold model, right, Tom? That's right. Yeah, primary standby.
And what was becoming really hard for me as a manager was getting people to continue...
Just the effort of reminding everyone, remember, keep the cold one updated.
We're going to keep testing it. We're going to try it every quarter. We're going to tap you on the shoulder.
How's the... You know, all this stuff. And it just felt like a chore to so many people.
And I have total empathy for them. We build tools to make sure they're deploying to both, make sure that you're thinking about both when you're designing solutions.
And then finally, we decided we are going to force the secondary data center, the secondary core into everyone's minds by using it all the time, finding a way to use it all the time.
And so then that's where we started to use dynamic steering, where right now, if you go to api.Cloudflare.com and you are closer to the United States or closer to North America, you'll be sent to that data center, that core.
And if you're in Europe, you'll be sent to that core.
And that steering decision is totally made by our own load balancing product as of this year.
That is very cool.
And was there any, you know, from a volume perspective, I mean, I don't know if people know, you know, the amount of API calls and things like that and volume that we get, you know, into the core, right?
And now suddenly you're adding, you know, another layer in there, which is the load balancer.
Was there any concerns or things like that or anything that you guys had to work through that?
There were definitely concerns about that, particularly the fact that, you know, when you think about a highly available topology as a traditional application developer, you normally think of two sort of discrete availability zones, whatever, but they're very close to each other.
So you just, you know, happily load balance between them, maybe turn on sticky sessions.
Our two data centers are very far apart.
They are two different regions in most parlance. So the big concerns were about once we bring them both into the fold, if you start to build software that needs to talk across the ocean, the latency and bandwidth that we would be having to deal with, the consuming bandwidth and tolerating latency in ways that teams weren't accustomed to having to think about before.
It was just infinite transactional low latency storage right next to the application.
So basically the way I approached that problem, the way we approached it as a team of engineers was we're going to give the team some primitives.
I'm not going to tell every team, every team at Cloudflare is very autonomous.
They build things to their own liking with their own languages.
We're going to give them a set of building blocks so that they can leverage these two data centers at all times.
And to some extent at their own pace, they can figure out how and if they can get their systems working, to what extent they can spread the load across both versus steering them to one or the other and give them some tools for steering within the data centers after they go through dynamic steering.
So the way we sort of, I guess, sidestepped some of the concerns was start small and give people the tools so that they can ease into using both data centers.
But the big sort of stake in the ground was traffic is coming to both as of this date.
So get ready. Get ready. And these are the tools you have available and the whole data center now available to make use of production traffic at all times.
Yeah, they were really useful. Just to mention that, as you said, Tom, different teams could adopt independently.
And then I think that gave us a lot of flexibility.
So in a way, I guess it's not just like, you know, Tom is almost setting up a model, right, for how we're going to do the core, but it's almost like every team is, you know, with a set of services or whatever, you know, for instance, the building team that reports to me on things like that, they're like doing their own configuration on how they're using basically load balancing basically in that overall model, right?
Yeah, correct. You know, some teams like, for example, they have optimized their queries to the database, right?
So they can tolerate to pay that extra latency. For example, if you go from one region to the other one for your database queries, some other ones, they have more complicated database queries, and they have to do something different.
So having all these tools available, I think, gave us a really good flexibility, different teams that could do, you know, like if you have to optimize your queries, you can do it at your own pace.
You can set up your system in a way that it works now.
And then in the meantime, you keep improving and then migrate to another version.
Yeah, sure. And is there been any, you know, any particular challenges or something, you know, some rocks along the way that we found that, you know, it requires us to, you know, to file enhancements request with Dimitris or his team and things like that?
Or has it been all this smooth sailing?
Oh, man, so many. I mean, you just zoom way out, like you're responsible for the billing team, for example.
They've got this great roadmap of features they want to give our customers and new ways to pay and new ways to break up and subscriptions and all this stuff.
And then they have me like, you have to do this, too.
You have to do this, too. I know. And so some of the primitives we gave teams are like, OK, we have you.
We're going to activate this secondary data center at all times.
If you have the ability, we'll put a read replica for your config there, too. And we'll guarantee a certain tolerance of how far behind it is from your primary data store.
You can use it so you can get fast reads in that in that data center. Turns out that's work.
You know, they kind of have one database connection already. They're not accustomed to thinking about whether this operation was necessarily a read or write.
And so it wasn't an easy lift to encourage. That was like a primitive that we gave people.
But it wasn't incredibly easy to use. Moreover, we basically found we knew this, but it wasn't until people started using it.
A lot of the way you write code or different teams write code, if you ever use an ORM or anything like this, you're quietly, without you even sort of seeing it, having a bunch of database connections back and forth.
Look at this object. Update this thing.
And the code looks nice and clean. And it runs quite fast when it's sitting next to the database.
You can make that database 200 milliseconds away and you're dead.
It just doesn't work. And so one of the other patterns we've had to adopt is we're going to expose two versions of running your system so that people who are calling your service can choose whether they're talking to a read-only version or a writeable version.
Because we can't put the latency between the application and the database.
Maybe for some apps we could, but for many apps, it just wasn't going to fly.
So we're going to put the latency between the client and the server, the calling service and the receiving service.
And that receiving service, if it needs to do 10 writes, that's fine.
It'll be real close to its database. And if we're just doing one HRPC request-response cycle, the latency is completely tolerable.
But that was the thing I think we intuitively knew. And it wasn't until people actually did it and saw it and felt it that they were like, man, we're going to have to rethink our new topology.
I remember when I was doing some of these designs for applications in more traditional data centers, one of the things that was always a little bit tricky is all these things around the health checks.
So the load balancer knows when to get something out of the load balancer or what to send requests or not.
You can do basic things like pings or latency and things like that, but much more sophisticated things.
It's almost like simulating a synthetic transaction, if you want to call it that way.
So what capabilities?
Have we used any of those more advanced capabilities around health checks to make that whole load balancing layer across data centers as robust as possible?
Because obviously, at some point in time, something is going to fail, right?
A service is going to get restarted, something will fail somewhere. Have we done any of that, Tom, with the load balancing service?
There's so much philosophy to this.
So maybe Dimitris probably talked to what our products are capable of.
But in terms of applying them, here's the thing that we just, you know, it's tempting.
But you can make a health check that tests a ton of things and say, if it's not healthy, mark it as such because some part of it isn't working properly.
I don't know how to articulate this philosophy clearly, but if you do that too much, you're accidentally failing yourself out over something else that's unhealthy.
And in our case, with two simultaneous data centers, we need to be careful, because they are working together, that we don't mark them both unhealthy because one is unhealthy.
And we are allowing teams to, you know, have them both work together.
But they all have to be careful, and we need to be careful with the API layer to make sure, like, part of the reason we're doing this is so that the cores can, the entirety of the data centers can fail independently and have dynamic steering keep the other one, you know, up.
But if a health check accidentally thinks because one failed, the other one's unhealthy, then we are shooting ourselves in the foot.
So that's a tightrope walk we've had to make with these two active cores.
Yeah. I mean, from the OBE side, we try to build a product that is flexible.
We allow customers to configure all these different kinds of health details around the path, expected codes, expected strings, how many times do you have to see an origin as down in order for you to declare it unhealthy.
So we try to build it in a way that's flexible.
And on top of that, we try to put some things in place that can protect you.
For example, we have this notion of fallback pool.
You specify two pools. If the first one is unhealthy, we move on to the next pool of origins.
If that one is unhealthy too, we go to the fallback pool and we don't check health information.
So that's kind of the last resort. So for the case that Tom was describing, you would put the fallback pool there.
So at least you go somewhere.
That's kind of last resort. The other most important thing we put in place, we call the feature zero downtime failover, which is a really, really cool feature.
What it does is, let's step back a bit. So we have the active health checks.
If you specify a probe interval of one minute, you will run your first health check on zero, the next health check at 60 seconds.
If your origin goes down at 61 seconds, you won't know until you go to 120 seconds.
So there's a tradeoff there. You specify more frequent health checks, but then you see more traffic in your origins because of the health checks.
So what we did, when we get the actual request, the eyeball request, we try to go to the origin that load balancing tells us.
But if that one fails, we instantly try the second origin or second pool. So even if our health checks say the origin is up, if in reality we see the origin is down at the request time, we will try our best to go to the next healthy origin.
So I think that's a really cool feature.
We don't really talk a lot about that, but I think it has saved a lot of downtime from our customers.
That's really cool. It's, as I said, it's reminded me a lot of things that I have to deal with, like with these things in traditional data centers and also from cloud solutions.
So we have four more minutes.
So Tom, so somebody that is looking at doing something like this, and with all the scars that you've gone through, any particular words of wisdom, any things that you say, well, you know, make sure that keep these things in perspective or, you know, simplicity or anything that any advice that you would be giving a friend that was getting into this?
Yeah, I mean, I think one nice thing about how we're using Cloudflare is it's not that different than how a customer would use it.
We have two different data centers, two different Kubernetes clusters running microservices and Docker.
This probably all sounds like familiar technology.
We put Cloudflare out of the cloud, low balancing across them. I think that I would start, we started on hard mode, I feel like.
We're like, we're going to do two and we're going to make them really far apart.
I would start by putting, creating a couple availability zones close to each other, easing into this resilient world before you go broad.
But I do honestly think that Cloudflare is a great solution.
We're not just using it because we are it, but it solves the exact problem we have at the core.
Our api .Cloudflare.com could be api.example.com. It's very similar in most cases to any other customer.
And Dimitris, like the fallback, the capabilities Dimitris was talking about, we're using them because they're useful, not just to demonstrate them that they work.
So I don't know if that sounds like a sales pitch, but.
Well, I mean, you know, it's a little bit of reality, right?
I know that also internally, you know, we give pretty strong feedback on things, on enhancements and requests and all those things, you know, when we use our own products.
So, you know, I think that is something that, you know, valuable for our customers.
You know, we try to take a look at, I always say that, you know, if it's not good enough for us, it's probably not going to be good enough for our customers.
So we always try to use our own solutions first, you know, whether it's like beta testers or alpha testers or anything like that before we put it out there.
So hopefully, I mean, everybody can see, you know, that we take that philosophy seriously.
And, you know, we try to, apart from talking the talk, also walking the walk.
Well, I just want to wrap it up. Thank you so much to both of you for the time.
I appreciate it. And hopefully, maybe we'll see you guys in another episode when we have another project with load balancing.
Sounds good.
Have a good rest of the week. Talk to you later. Bye. Bye.