Latest from Product and Engineering
Presented by: Usman Muzaffar, Tim Bart, Tom Lianza
Originally aired on September 23, 2023 @ 7:30 PM - 8:00 PM EDT
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
All right. Good morning. Good afternoon, everybody. It's afternoon here. It's not morning.
Welcome to the Latest from Product and Engineering. I'm Usman Muzaffar.
I'm Cloudflare's Head of Engineering. Jen Taylor normally joins us, but she's out today, but that's no problem because I've got two great teammates that I want to talk about the projects that they work on.
Tom and Tim, take a second to say hello.
Hi, I'm Tom Lianza, Engineering Director at Cloudflare. I've been here six years and work mostly on our control plane.
Hi, I'm Tim Bart. I work on the Kubernetes team and I've been at Cloudflare for a little less than two years.
Excellent. Welcome, guys. So as you can guess, so Kubernetes control plane, what we're going to talk about today is some of the under the hood technology.
In particular, when we say the control plane of Cloudflare, we're talking about the API, the thing that customers actually use and the dashboard that customers actually log into.
And the interesting thing about the dashboard, Tom, you and I once talked about this.
I was a Cloudflare customer for years and I think I logged in one afternoon to sign up my startup's website and I set a couple of cache settings and uploaded a cert.
And I don't think I logged in again for another two years. So part of the interesting thing here is that you can go a long time without touching the API.
How is that? Why is that? This is not like Facebook. You're not scrolling a Cloudflare feed.
Why is it that you can get so much value out of Cloudflare without ever touching the API?
So what is the API and what is it controlling? How does this connect with what Cloudflare is doing at the edge?
Yeah, good question. So I am not a product manager and so luckily I'm not faced with the challenge of how do I measure engagement or something?
Because in my view as an engineer, if you can do exactly what you described, log in, click some buttons and never think about it again, we succeeded.
We win. That sounds great. Maybe we'll shoot you an email and let you know we stopped the tax or whatever for you while you slept.
But hanging out and configuring things doesn't sound fun to me and I hope most of our customers don't have to do that.
But some of our big customers do integrate very tightly with Cloudflare's API and have robots click those buttons.
So instead of going to the dashboard and changing those cache settings or purging some assets, they'd integrate with the API directly and have robots say purge assets when they do a deployment of their website or whatever it is.
So the API does get quite a bit of usage.
More than most of us probably when we ran Cloudflare for our personal blogs and such.
I don't think it's the star of the show at Cloudflare. The edge is the star of the show and the edge is what everyone knows is like proxying their traffic in real time.
But I like to think of us as the brain. If we weren't working, the edge wouldn't know what to do.
We are the ones who tell the edge what to do. When somebody tells us, I want this DNS record set to this value, it's us that goes on to tell the edge that's how we want it to behave.
And it's us that has to take that information extremely seriously.
We have to store that redundantly, disaster proof, et cetera, et cetera.
The edge is huge and ephemeral. We can restart any of those computers, but the brain needs to work and the brain can't lose information.
And so the control plane is really important in a different dimension than the edge, which I think makes it interesting and challenging in its own way.
And it's also like the number, is it accurate to say that the number one consumer, the number one client of the API is the dashboard also, right?
I mean, the interface that the customer actually uses, including the interface that the customer uses to sign up for Cloudflare.
Yeah, depending on how you measure it, like if it's by how many people access the API, yeah, the dashboard would be our number one client.
But if it was about raw volume, it's tough to beat those automated.
Those robots are hammering it. Yeah. Turning firewall rules on and off, reconfiguring things.
Yeah. So we have to satisfy both. So a couple of questions. So in some ways, we've often, sometimes we say like, in some ways, the Cloudflare control plane, the API is in fact the most normal, most recognizable part of the system.
It looks like what other SaaS companies are used to, right? There's a user interface, there's an API, there's databases, there's an actual endpoint, a single URL, or a host name that in this case, api.Cloudflare.com that we are monitoring, as opposed to the edge, which is a very low level piece of the infrastructure.
So when we talk about measuring the success of the Cloudflare API, we can talk about it in traditional IP terms as an uptime and how many successful requests is it processing?
How often is it doing what it's supposed to? So how did we think about when we said, you know what, we need to start watching our own API and monitoring it carefully?
What kinds of approaches did we take? Yeah, it's interesting that the API, the single host name is such an illusion, that this is as if this were one thing, the API.
And we do give our customers one front door, they have their own API tokens that they use to access the API.
But once they hit api.Cloudflare.com, the rest of that sort of path determines what back end service handles that operation they're trying to perform, whether it's SSL or load balancing or caching, or what have you.
And so really, the API team is a gateway through which you access those back end services.
And most of those back end services are housed in Tim's team's Kubernetes platform.
And so when we monitor the API, we can look across all of these endpoints.
The API team takes a lot of accountability for global problems, and alerts when we see across the board dip in availability, performance latency.
But we also built a self service tool. So teams hook themselves up to the API.
And in doing so, they are telling us who they are.
And, and so we can monitor each, each team's service as well. So if we see a dip, we can then simply scroll down to the monitoring section that shows us which of the back end services is accountable for that dip, and determine if this is a platform issue that API team or Tim's team needs to jump on immediately, or if it's one team having a problem, and it's isolated to sort of one corner of the universe.
So we have to walk sort of balance, you know, the API is a single monolithic sort of product that our customers, you know, see as one thing.
And the fact that behind that API, there's a lot of different moving parts.
That's, that's excellent.
So a huge part of what we have to do is also a routing problem here of being able to identify very quickly, who, which team should be alerted if we see something that doesn't seem right, or which one is potentially, you know, using more resources than we expected to, or whatever the issue might be.
Yeah, that's right. I think our routing is not, not hard to understand. Not to say that I understand it in perfect detail.
But we aren't, we aren't reinventing the world here.
Our, our API gateway is Nginx based like a lot of our software.
And depending on where the upstream service is, we may hand it off to Tim's team, to the various Kubernetes ingress and routing fabric to get to the perfect segue into into bringing Tim into this conversation.
So when I joined Cloudflare, several years back, Kubernetes was just a code name.
Kubernetes was just, it was a very small project that was just a couple of engineers were playing with.
We were using the Marathon Mesos framework to run microservices.
And even then, it was starting to show some problems.
And we wanted to, we wanted to get the benefit of Kubernetes.
I think we were all seeing that Kubernetes was accelerating in its adoption industry wide, and it was going to solve a lot of the problems, you know, security and management and dev tooling that we had.
But I think probably the most interesting thing, conversation I had at Cloudflare at the beginning was, I was like, okay, so, you know, Google's already released this thing.
So we just have to, we just download and install it, right?
That's, that's all we got to do here. It's, you know, it's no different than running anything else.
Is there, is, do we actually have to write any code here?
Like what, what, what do we actually, why do we, do we need a team here?
Do we even need an engineer here? So what is it, what was the big thing that I was missing, Jim, that, that you and your colleagues helped explain to me?
Yeah, I think the context of running Kubernetes on your own hardware is very different than running Kubernetes in the cloud, for which the cloud usually provides you with a bunch of other related services around identity, around storage, around networking.
And those pieces have to be either reinvented, reimplemented to be used to running in the data centers that Cloudflare operates.
And so there is a lot of, you know, bearable underlying subsystems that need to be put in place before teams can leverage Kubernetes in the way that they expect to, which is a platform for resources, compute, and memory, where they can deploy their application in seconds and, and get to see the changes immediately.
So let's, let's take that, let's take what you just said apart just very carefully, one sort of step at a time.
So what is the first piece that is sort of missing that when you use Kubernetes in the public cloud, you can kind of take for granted, but if you're going to set up Kubernetes yourself on your own bare metal, that you need to build something here because there's a, there's a effectively a puzzle piece that's missing.
What is the first thing that, that we have to, we have to construct there?
I think networking is really like one of the most important underlying pieces of running Kubernetes and networking has, you know, multiple aspects in a Kubernetes like centric world.
One very important one is how do you receive traffic from outside of Kubernetes, inside Kubernetes?
We have, you know, one of my colleagues, Terran, has written a very good book post about this recently on the blog.
And this is something that usually provided for free by cloud providers.
That's something we cannot re-leverage internally. So we have to build our own L4 load balancer, be able to bridge traffic from inside, outside Kubernetes and inside the Kubernetes networks subsystem.
For free, for free with your purchase.
For free with your purchases. Exactly. So let's talk about that particular piece for a second.
What was some of the trickier challenges that we had to solve there?
So the idea is being able to get services running in Kubernetes accessible from other parts of the cloud infrastructure that don't run in Kubernetes and for which the networking are somewhat isolated.
And so you need something that can bridge and understands where to direct traffic and understand how to respond and return that traffic to wherever it was calling.
And that bridge is actually tightly integrated with how Kubernetes provision resources internally.
And the way you say you want this to be enabled is by setting an attribute on a Kubernetes service.
And all of a sudden we have software capable of creating a tunnel between servers running outside of Kubernetes and forwarding traffic, encapsulating it, and then dispatching it to the right process and the right container running in Kubernetes for you to take action.
That's great. And I guess, Tom, that lines up directly with what we were just talking about, the routing problem here of how do we make sure that it gets to the right spot?
And I think a particularly cool property of this whole system is that the individual teams do not need to be...
They can take this for granted. From their perspective, they've built a service, they've tested it, they deployed to the Kubernetes cluster, and it gets requests that are due to their service without them having to muck with some central thing, which wasn't true in the old days.
I remember a lot of PRs conflicting over the central routers that were...
Yeah, people claiming the same IP address and all this stuff.
Yeah, those days. So Tim, one of the things that that leads next to then is sort of a very interesting monitoring and logging and in general, the class of observability problem of, okay, so now all this stuff is very neatly partitioned in pods and subnets and all this.
So how do you... The Kubernetes team is an awfully bright and talented group of engineers, but it's a pretty small team.
So how do you keep an eye on this massive cluster? Sure. So one of the aspects that we've taken is on a multi-tenant approach where each team gets a set of namespaces, which are a boundary where their pods don't interact with other teams' pods and try to limit blast radius and or connectivity between things that should not be talking to one another.
And by building under those premises, we've built quite a few custom integration that can automatically discover new pods, automatically scrape the metrics that those pods might be exposing, and then centralizing all of that through our logging and monitoring system, in addition to tracing where we can trace requests going, entering the cluster in one aspect and then going somewhere else and coming back.
And we provide this as a self -service application.
It's a bunch of annotations you put on your pods and it's there for you automatically, whether it's alerts, whether it's monitoring, whether it's tracing, you really don't have to do much to benefit from this.
And I think this is one of the value that we can provide as a team is a little bit of effort on our end gets the benefits for all of the engineering teams at Cloudflare.
Yeah, it's really fantastic because then once they add those annotations, they get that kind of visibility.
And I think the other very interesting property I realized that Kubernetes team is one of the first to notice any kind of networking hiccup in the system, almost as fast.
And in some cases before the networking team, just directly monitoring some of the systems, which are there.
And so how does that happen? How are we suddenly able to monitor so many different potentially point-to-point connections?
What is some of the challenges there that we set out to solve and what did we learn along the way?
Yeah, that's a very good question.
I think one of the aspects we realized is once you run multiple servers and multiple racks over multiple switches, it's very difficult to troubleshoot when something goes wrong.
And so we've invested quite a bit of time and resources into having probes, basically small pieces of software running on each of the nodes, each of the racks, pinging one another and being able to understand whether latency has increased, whether we see timeouts, packets drops.
And from those probes, we get usually very good sense as to what's happening at the hardware and networking level, which is most of the time invisible to app teams because the retries on the network take care of it.
Really, why not? Take it for granted, for sure.
That's the whole point of a network, right? But for us, it lets us know something is amiss on this one rack, on this one switch that we might need some maintenance and or a new release of software forgot to add a new IP to a firewall or some other stuff where we can catch those changes very quickly and we're alerted almost immediately when that happens.
I almost misrepresented the API earlier.
I made it sound like we were in front of Kubernetes and that was true. We are now in Kubernetes.
We are just another... The API team itself could not...
Wanted the benefits of the rolling restarts and all that out-of-the-box observability and other ilities that come with Kubernetes.
So the API gateway is itself just a tenant.
And we should say, especially for the non-Cloudflare listeners, of course, all of Cloudflare's edge is in front of the API itself, right?
So that we are using all the power of our own product, of course we are, to defend it, to protect it, to ensure that it's performant, all that great stuff.
So... Yeah.
If you were drawing lines, actually, so we move API gateway into Kubernetes, then we show that the edge is worldwide in front of all of this, then there's actually multiple of us, multiple Kubernetes and API gateways.
And not only can the edge send traffic to any of those, but once the traffic hits the API gateway, we have the console integration with Kubernetes that allows the API gateway to potentially upstream to a local Kubernetes service or a remote Kubernetes service in a completely different Kubernetes cluster.
So we have lines going over the oceans. I love it.
And that's actually a perfect segue into the next thing I was going to ask you, Tom, which is that...
So now that things... One of the benefits of things being in Kubernetes, it's very easy to run multiple instances, they can catch each other, they can fail over.
We already had server level redundancies, sub-process level redundancies, rack level redundancy, colo...
And now we're... Multiple colos in a point of presence.
And we also have continental, I'd like to say, level. We're not quite yet at planetary redundancy.
I don't have a colo on the moon yet or on Mars, but ask me again in five, 10 years.
But when we talk about having multiple core data center, points of presence at different points around the globe, go into a little more detail, Tom, how does that work?
How does the system know when it should send a request to the other side of the planet?
Because it feels like that's the right thing to do for whatever reason.
So there's a couple of layers here. And the job of our platform teams is generally to give primitives and give advice.
But then there are 40, 50 other teams that either can take or not take our advice, but they do use our primitives.
And we try to make the primitives, steer them in our opinions, right direction.
So first and foremost, the entry point to Cloudflare being our own edge, we use the Cloudflare low balancers.
We have them configured for, I forget the name of the mode, if it's dynamic or geo or whatever, but we steer people towards the API gateway that is fastest to them.
Not necessarily closest.
Those distinctions, there's different... The low balancing product has all these configuration options.
We do fastest. So you get steered to the API gateway that will give you the best performance from our perspective.
The API gateway then leverages console and our service discovery technologies to figure out where to upstream your traffic based on what it is you're asking us to do, whether you're storing a DNS record, asking us for a cache purge or whatever.
And then at that point, we're now talking about different teams at Cloudflare.
And those teams have the ability, depending on their product and how it was serviced, how it works, to run sort of worldwide, active, active.
Maybe cache purge is a good example of a team that has been able to run in multiple cores at the same time all across the world.
And so you hit the fastest, closest one, and it can actually do the cache purge from there.
Some teams, I regret to inform you that we have not solved the cap theorem at Cloudflare.
Nor have we made any progress on the speed of light.
I don't understand why we're having so much difficulty with these two problems.
It's just a hiring problem. So if the thing the customer is asking us to do is write something authoritative, system of record, I want you to change X property, we still send you to the single geography that has the active source of truth.
So when you get a 200 OK back from Cloudflare, it has been written authoritatively.
There's not a async, delay, eventual consistency thing. For those operations where a customer wants us to make a change that's persisted, they only get 200 OK when it is persisted, which means we may send you to one place on Earth not just the closest after you get through the APEC gateway.
That's great.
And it also means we can run increasingly sophisticated resilience tasks because we can show things sending more of the traffic one place versus another.
And I think it's great to see so many. All those lines can make architecture diagrams hard to read, but it's really fantastic.
It's also really fantastic, like you alluded to, that we're using our own product.
It's Cloudflare load balancer configured exactly in the way for this use case, which is we've got multiple geographic origins that can service the same request.
So which one should it go to?
The one that's fastest, the one that's healthiest, the one that's most appropriate for whatever reason.
They've also added features to Cloudflare load balancing product, which we have not yet taken advantage of ourselves, I confess.
But there is sophisticated rule writing procedures where we could do sort of a per URL routing to the origin.
And I think a lot of customers are using that with success, and we are looking forward to using it.
We just haven't gotten the base features.
Yeah, absolutely. It's fun to take advantage of all the cool tech that we've built and leverage it for each other.
Tim, back to you for a second. So we got to the point where we unplugged Marathon.
We've got now the ability for... At this time, Cloudflare engineering is growing like a weed and the number of teams reporting to main engineering has gone from 20 to 30 to 40.
Now it's breaking off across 50.
So you've got a world where it's easy for teams to get onboarding.
Kubernetes onboarding is part of your day one, week one training of Cloudflare.
So this is very much part of how you build a control plane service. What's next?
So your team is still hard at work. We've still got bright engineers. So what's an example of the next class of problem that shows up and what did we have to solve for?
Yeah, I think storage is an important aspect of running in Kubernetes. The ability to provide persistent storage across multiple racks and machines and availability if one of those nodes or racks were to go away and for the applications to be oblivious to those changes is one of the challenges that we work on.
We have provided those primitives to all of our clusters and we're working on adding faster storage, larger and the ability to do IOPS at much higher level using leveraging SSDs and a bunch of other stuff to give teams the ability to run applications like Elasticsearch or Redis that they leverage in their own application without having to understand where those spots actually run and on which node and which availability zone they're in.
And I think Redis is like a difficult problem storage.
You don't want to get that wrong, but it's also one that enables a bunch of extra capabilities for teams to leverage in addition to the other data stores that we leverage like Clickhouse, Kafka and Postgres that don't currently run in Kubernetes.
So naively, can I think of the mission of the storage service that the Kubernetes team, that your team is providing the other developers is, can I naively think of it as you're going to give me a file system that I can just guarantee is going to be there and I can write to it even though my service might be rescheduled on an arbitrary number of different pods in a number of different places and that as far as I'm concerned, you've just given me slash SOSMON and I can write to it whatever I want and it'll be there?
That's correct.
There's more to it than this. What are you saying? I oversimplified it? Snapshot backups and that kind of stuff, which are like capabilities we're still in a process of releasing, but eventually the idea is to abstract computation for you.
And what you say is I want 10 CPUs and I want two gigs of memory and one terabyte of storage and we give it to you and you don't have to know where it runs.
And if we can provide this abstraction for you, I think we're doing our job right.
Yeah, that's great.
So talk a little bit, like go into a little, one more level of detail there.
So what technology is really, because one of the things we don't do with Cloudflare is just buy gigantic, you know, we didn't buy any purpose-built special hardware for this or use any third-party special product.
This is all Cloudflare engineering using well, commonly, commonly available hardware, commonly available software components, open source components.
How did we build a resilient storage backend for arbitrarily complex, you know, literally hundreds of microservices that could have very different heterogeneous needs?
Yeah, I think there's two aspects to it.
There's one where the storage is outside of Kubernetes, where we leverage Ceph, which is an open source, basically file system block storage system.
And then we leverage Rook, which is a way to integrate Ceph into a Kubernetes cluster.
And this is a potentially more interesting aspect of it because it takes care of like dynamic sizing of the cluster as we add nodes and nodes going in and out of commission and capable of rebalancing the underlying storage, making copies and moving them from one disk to another to guarantee that we always have the right amount of data and we can guarantee durability.
And then Kubernetes itself offers quite a few settings and capabilities that are not present in many other scheduling or container orchestration systems that we leverage to make this transparent and seamless for teams.
That's great. So the Rook technology is something which bridges Ceph in Kubernetes and it helps us create the abstraction of a file system to the services running in Kubernetes.
That's fantastic.
That's great. Tom, just in the last couple of minutes here, I just want to shift gears and talk about the engineering, management and leadership challenges here.
So we've got a team that's built this service and we bend over backwards to make it easy for teams to use.
Every now and then we really do need everyone to change how they do things.
The first big one was get everybody on Kubernetes and every now and then we have to try to make it so that everyone upgrades or we've made a change to the thing.
You all need to do this. How do we approach that at Cloudflare?
How do we get to a world where when we need to make a change across so many teams, you've been on both sides of this, the team that's been asked to make a change and the team that's asking to make the change.
What do you feel has worked well and what have we learned over the years on that?
Yeah, definitely talk for hours on this problem.
I think most engineering teams think of their work in terms of planned and unplanned.
Like I started my week and this is what I knew I was going to do versus I started my week and then it took a left turn on Tuesday and I never got the week back and tens tend to get bummed out with too much unplanned work.
I think that is a tough existence to sustain. When we ask people to do things across teams, especially platform teams, when we ask people to do work, we end up having to ask tens of teams to do work, not just a peer team.
I think the higher order concern is not creating unplanned work for teams. How much notice can we give you?
What benefits are you going to get once we do this or why are we doing this?
Do you understand the company wants X to happen and we need to do Y in order to enable that and working our way into some team schedule so that we're not part of the Tuesday event that steered their project their week off track, but we are part of they started Monday knowing they had to do this thing we needed from them.
I think that's certainly one key part of it is working our way into their planned work schedule so that it's still an organized and peaceful existence for engineers.
We all would prefer to work that way. We've got enough things to interrupt us.
Yeah, absolutely. I think we do a pretty good job.
I think the other thing that I've noticed that this team does very well is that we start small.
We start with a couple of pilot teams and use that. The analogy of just buy them cupcakes, get them whatever we can to make sure that they can partner with us, help us smoke out the problems.
Then we build a dashboard that I can monitor.
Then the laggards, the last 10%, usually have a very good reason for why they're the last 10%.
They were under the gun for something else.
There was some other complicated, tricky thing. Then we can push the herd over the finish line and then help the teams that there's something tricky or complicated about it.
Guys, thank you so much for joining me on today's session. This was so much fun.
We are bang out of time, but it was such a great treat to talk to you both and Tim, especially to have you join us and talk about some of the details of how you implement all this.
That was really great. Thanks very much, everybody, for watching.
We'll see you again next week on Latest from Product and Engineering.
Bye, everyone. Thank you. Transcribed by ESO, translated by —