Story Time

Name: Story Time
Uploaded: 2020-07-21T13:30:00.000Z
Duration: 30 min
Description: Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it. This week's guest: Richard Boulton, Engineering Manager at Cloudflare.

Presented by: John Graham-Cumming, Richard Boulton

Originally aired on July 21, 2020 @ 9:30 AM - 10:00 AM EDT

Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.

This week's guest: Richard Boulton, Engineering Manager at Cloudflare.

English

Interviews

Transcript (Beta)

Okay, well, welcome to Story Time. I'm John Graham-Cumming, Cloudflare's CTO, and I have a guest this week. It is Richard Boulton. Richard is in the UK somewhere, and he's the engineering manager for something called FL in Cloudflare. And we're going to dive into what FL is. If you have ever visited a Cloudflare website or an API call that goes through Cloudflare, then you've gone through code that he's responsible for. So Richard, welcome. Hey there, John. Nice to see you again. It's about a year ago, we used to have desks very near each other, then John moved, and nice to see you again. Yes, it's nice to see you. It used to literally be that I could look down and see the team, because I always used to stand up and everyone else was sitting in London. So why don't you tell us what FL is? Absolutely. So as you said, it's part of the system where every request handled by Cloudflare goes through. So FL stands for Frontline, which is a clue that certainly at one point, it was the first sort of application service that requests go through. It's actually, we've separated out part of that code now, so there's a separate process which handles TLS termination, which happens before FL now. And it's the process which has most of the complex routing logic and various bits of processing logic. So pretty much every product that Cloudflare built, apart from those which operate at different layers of the stack, have some code in FL. It might be something to do with configuration loading. It might be to do with rewriting requests. It's where we do email obfuscation. We do all sorts of things. And it's a system which is built in NGINX, which is the web server that it runs in, and configured and lots of custom logic written in Lua, which is a programming language which allows fairly safe modification of code without having to worry too much about memory safety and things like that. And it's also pretty efficient because it has a reasonably good JIT, which lets us write code quickly, which runs quickly. So it's a very interesting system. And when I joined Cloudflare two and a bit years ago, I came in as the engineering manager responsible for that. I joined Cloudflare, and you have an extremely interesting orientation session, which takes a week. You're given a deluge of information about the company and how it fits together. And then suddenly I found I have the responsibility for this service, and it's dealing with vast numbers of requests a second. And it's got an awful lot of complexity in it. But it's not just that it's a complex service, it's that whenever someone is wanting to make a change, that service probably needs to have changes made. So I think one of the things I thought was very interesting to talk about was how do you get up to speed in a system like that? How do you understand how you can be helpful as a manager? So the role of a manager is not to tell everyone what to do. It's to try and make sure that a team of trusted, experienced engineers can build a system, can maintain and run a system and keep it safe. So as you come in, the first thing I sort of try to do is work out what is it that I'm actually responsible for? Which sounds like it should be a straightforward question, but in a company which is as dynamic and changing as Cloudflare, it's never a simple question. Well, so I mean, I think one of the things that's significant is if you think about going back in the history of Cloudflare, so when I joined, FL already existed with a bunch of other stuff around it. Actually, when I joined, we used PHP instead of Lua. And one of the big things that was worked on very early on when I was at the company was moving from PHP to Lua. And we kind of had the luxury that we had tens of thousands of customers at that point. And now we're 27 million domain names, zones. So you obviously took on something which was a much more tricky system to modify at the time. But one of the things I think that's interesting about FL is just in the history of the company is because it was so central to everything, everybody did stuff to it. And there was sort of an assumption that you could go in and modify it. And that was kind of okay at the beginning because the WAF needed to do something, if there was some business logic that was needed. And I think it's hard for people to grok exactly how fundamental this FL thing is to the whole company. When a request comes in for a customer, it has to figure out the configuration for that customer and then apply that configuration. And that configuration could be the imagery sizing, could be in there, it could be a redirect. There's all manner of stuff that is in there. And so, yeah, you're right in the heart of it. The DNS team obviously is doing sophisticated stuff with DNS, but they're sort of out of the game once the DNS query has happened. You're there trying to figure out exactly what configuration to run, load balancing, all this stuff is going on. So what was it like getting up to speed on that system that was live and running and you didn't have the luxury that I had almost nine years ago, sort of screwing up more than you're allowed to now? So there's definitely no safety. If I make a change and I break things, it's fine. So it would have been incredibly hard if it wasn't for having a team that knew roughly what they were doing. They knew how to make changes. They knew how to make changes and deploy them in a safe way. What they needed help with was how do you actually make things better rather than just keep your head above water? So you have a system where we, I think this is the way of thinking about it. You have a stream of changes coming in from 20 different teams in the company. You have to make sure that those changes are safe. And that means you have to put some kind of process, some kind of review system in place, some kind of quality control system. And that doesn't mean you say everyone has to have a quality control process in the gateway. It means you have to make sure that you think about what are we trying to build? What are the principles we're following when we're building the system? And one of the exercises we did actually after quite a few months was trying to work out what is the architecture of the system? And if you're someone who writes package software, which is something I did for a while, you probably have a defined architecture for the system you're trying to produce. And then you produce code to that. And it won't quite meet the architecture, you'll change it, but you'll come up with something which is roughly the shape you thought it would be when you started working. The way I think of FL is it's an evolved system. It's something which has changed over eight, 10 years to meet what the needs of its ecosystem are. As I said, we split off sections to TLS terminations which is handled somewhere else. But, and there've been lots and lots of changes to what are the responsibilities of FL, things separating, things coming back together again. At any time, FL has met the needs of its ecosystem. And you can look at a system like that and say, well, I can see this bit is a mess. I want to rewrite that bit, but it's probably doing a whole lot of things you don't know it's doing. There are lots of, you can easily write down the top level requirements, but there are so many more hidden requirements that you can't see that the system is meeting. So rewriting is not the way to make a system easier to work on. But there is a process of trying to work out what is the vision for how this system could be easier to work on? How could it be safer to operate? What are the, how do we make it more visible what's going on inside it? One of the efforts which had started when I joined, but which we accelerated was splitting FL into modules so that the code has defined pieces of code which can be deployed independently. And there's some kind of clear interface between each of those different modules. That's something that you can do without breaking the system. You define the boundaries and then you measure what's going through them. And you can evolve a system from being one system into a modular system. There was a phase of tech enthusiasm, maybe 10 years ago, where microservices were the big new thing. You didn't have to put a process boundary in place, but you can, but it's always important to think about what are the encapsulation boundaries? What are the boundaries of responsibilities of the system? And that's something which has got marked, markedly better over the two years since I've been at Cloudflare of how modular is FL. I think it's interesting that Conway's law about your software having the shape of your organization and FL was fundamentally the shape of all of Cloudflare because everybody dabbled in it and added their feature into it. Now, two years in, you have a team which is really now shaping the software to be something that is manageable. I'm curious how you even test something like FL, which has got all of the business logic of Cloudflare, all of the weird features, all of the weird exceptions, how does that work? So you have the standard practices of testing, which you test at multiple different levels. And we had some great engineers who pushed a lot of this work through, which is to make sure that we actually do as much testing as possible at unit tests, at the smallest units of the code. Because if you're testing things at higher levels, it's harder to write the test, it's much slower to test them. We have, it's the standard answer for how do you test something, you have the unit test to test individual units, then you put pieces together and you test how those were working together. And we have a Python test suite, which simulates an environment for a test and will run a request through FL put together to check that it's behaving as we expect in a whole load of different configuration scenarios. And then there's another really important test system, which is testing the FL works in the ecosystem that it's fit into. So your functional tests that you, these Python tests are testing that FL in the ecosystem we expect it to have works correctly. You then have to check that it's working in the ecosystem it's actually in. So if something has changed elsewhere in the company, they may not think it affects FL, they may not think to update tests in FL. So you have to make sure that that's going to be detected. And we have a system called the Edge Acceptance Test, which run on every location around the world and particularly in our testing areas. And they run a fairly large series of simulated end to end tests. So this is basically say for a site which has this configuration, does the entirety of CloudFlow work in the right way? And without those layers of testing, you have to do things manually, you will miss things. That's one of the really important parts of getting resilience in place. And you'll always miss things. So one of the things which is really important is getting the loop between what customers are experiencing and what the engineering teams are seeing, getting that loop closed. So we have members of the customer support team come to our stand up every day. They know what things were changing. They keep an eye on what problems they are seeing. If there are any new types of issues being reported by customers, they're able to very quickly alert us of those. Sometimes it's a change that may happen and it's having an impact, which doesn't mean people can't do things, but maybe people are reporting a slowdown or they're reporting some feature doesn't work, it's easy to work around, but it's not quite working in the way it should do. So it might not be things which would be classed as this is a major outage, but there's all sorts of problems which can happen that can be very hard to see what the cause is. So one of the other things we put in place with the assistance of various people was to get groups going to diagnose some of the long running issues that people were reporting that we hadn't really been able to address. So we got- When you say long running, you mean issues that have been outstanding for a really long time. I remember there were some bugs in the JIRA where it was like, we can't reproduce it, it's weird. Sometimes occasionally a weird thing happens, right? So we had all sorts of things. Like once a day, we might be able to see this happens in the logs, but we can't see why. You might see error codes happening, which we don't expect should ever happen. And by the way, just to put it in context for people, once a day for us, you've got to realize that we're running tens of millions of requests per second. So you're talking about, maybe one request out of a trillion, something goes wrong. So this is the real long tail. When you're working on FL, you don't really feel like you're working on a massively distributed system handling vast amounts of traffic because the way FL is architected is that each system is entirely independent. So you have a process running which was handling a few thousand requests and that's a large scale system, but it's not a massively complicated distributed system. So you can forget that actually any edge case that possibly could happen will happen and it will happen at a reasonable frequency, enough that you're gonna see it happen in your logs, enough that you're gonna see customers who are having problems with it because you have so many customers, it's gonna break someone. So the experience of how do you narrow that down is very interesting. We have actually a really good logging system which lets us get an idea of the status codes for exceptional cases and to dig in in some detail, but you're not gonna have enough detail to work out the sequence of events or the state of FL at the time something happened. So there's a lot of detective work can go in. If you have a reproducible bug, it's generally not too hard to fix it. If you have a one in a 10 million or one in a trillion event, which is still could cause a problem, then it's a bit different. One of the things we also did was look at what does FL do in error cases? So people write code and when you're writing code, it's easy to think about the happy path. What happens when everything comes together as it should do? To make a robust system, you have to think about the unhappy paths as well. If every step of something can go wrong in some way, when we're loading the configuration, it might be that we have a bad disk and it gives us an error or it gives us a timeout or it gives us potentially an inconsistent result. What's gonna happen to the system in that situation? What should happen to the system? So when Cloudflare was very new, where it was a system which was generally trying to protect people against DDoS attacks, as I understand it. I mean, John knows better than me. But in that situation, if you have a request coming to your edge and you can't quite load the configuration, what's the right thing to do? The right thing is probably to send that one request through because you're not trying to filter traffic out. You're trying to make sure that that traffic is handled and that excess floods are thrown away. If you then fast forward to a world where Cloudflare is doing a whole load of extra things, it's doing WAF protection, it's essentially applying a security check on requests before they go through. The default behavior can't be to let requests through. But because the system had evolved two and a half years ago, there were cases where it might let a request through if an error happened at just the wrong time. And an effort that we spent quite a bit of effort on was going through the code, making sure that if something goes wrong, we fail in the safe way. So that failures to load configuration means that request, that's gonna be blocked, but that's better than letting a potential security attack go through. So there's ways to look at the behavior of the system. The number of requests we're failing to serve isn't the only metric. Are we behaving safely? Fascinating. If I remember well, we also had some situations where one of the things that FL did a lot of was cache API results, and it keeps stuff in memory to run as fast as possible. But you're working on a shared system with 27 million different domains going through it. And if I remember well, there was some situations in odd situations where the cache would be out of sync or you'd get a stale thing served up to you and therefore you would get an error for a customer on one request and it would disappear because the cache would get flushed and stuff like that. So this was probably, there's an incident which happened within my first sort of two months of being at Cloudflare, which was the baptism of fire incident. So one of the worst things that can happen is that we serve traffic for one customer to the wrong domain. This is considered a massive priority incident if we ever find any sense of that happening. And it's only happened once in my time here and it happened to be two months after I'd started. So that was the fun of trying to work out how to resolve this. And when we finally got to understand exactly what was going on here, it turned out one, it was something like one in 10 million requests of our edge were seeing this behavior. And it was because we have a system in order to achieve the scale we needed, we have a system where we cache pieces of configuration. And in an event where you serve something from stale cache because the time when we refreshed the cache, we found we couldn't get the value again. In that situation, we could get out of sync with one of the other layers of cache. And the way we debugged that was essentially to spend, we spent a weekend trying to dig down in the code to work out what are the possible parts of the system where this could be happening. And we built big spreadsheets of here are all the possible things and eliminating possibilities. And eventually we got to, it has to be within this cache subsystem. And we managed to find a way to reproduce it by sending vast numbers of requests. And after 20 minutes, it would fail. So we had a reproduction, but not an easy reproduction. And that's good enough if you have really skilled engineers who are gonna patiently look through the code and say, so I now know how it could happen. That's enough to actually deal with it. That's how we debug the problem. But it's something where once we had that, we actually looked at, one of the nice things about Cloudflow is we take incidents like that really seriously. We don't see what happened there and say, how do we avoid letting people know about that? We're really open about that. And we want, if your mindset is, how are we gonna reassure people this is never gonna happen again? It gives you the space to say, so what do we do to change our system so that this is much safer? One of the things we looked at there was, what should we have three different cache systems in place? If we want to avoid this kind of problem, we have to unify our code, have one cache system, test it to the ends of the earth to make sure that we know all the behaviors it's gonna have. So essentially simplify our code base, makes it more maintainable, but it requires not leaving pieces unfinished. So the point I made earlier about FL being an evolved system, that's absolutely true. One of the key bits there is that if you just do the minimum to meet those requirements, then you end up with lots of loose ends. You end up with three cache systems, you end up with lots of unusual code paths, which they normally won't be taken, and it's there to support a legacy case, but you haven't done the work to tidy it up. The really important thing is to stop and take the time to get on top of things. And we actually spent a good three months in my first year, we stopped doing any new development and just spent the time getting systems in order. So that was things like organizing cache and catching up with that evolved legacy. I think this is a really important point because I sometimes see people on like Hacker News say, I could build Cloudflare in a weekend, it's just this, this, this, this, this. And it's absolutely true, but making it work at scale safely with all the features, but there is really what you would describe as an evolved system there. And I think there was some research from many years ago by IBM about how bugs tend to cluster. Like if you have a bug, you'll probably have other ones very close to it or in similar systems. And I think that approach that you took all that time ago, which is like, okay, we had this problem, we need to sort of look at all of the possible ways in which this can go wrong so that we actually really know that completely is pretty important to get keeping this system super, super robust. I remember this really, really well because I remember the 20 minute reproduction thing, which is incredibly painful for any engineers. Like now you instrument something, you gotta wait another 20 minutes. And it's interesting how much it takes to make a system really work reliably at scale with all the things that can go wrong. And this is another thing we haven't really talked about which is that the Internet itself is incredibly heterogeneous. It's like easy to test something with curl, but then the world's hackers are gonna throw weird URLs at you. You're gonna install it in Cairo and you're gonna get a lot of stuff now in Unicode, in Arabic, for example. You're gonna get this incredible variety of stuff coming in and you're gonna talk to an incredible variety of backends and do them as well. And FL has to do all of that too. Yeah, so I think that's really the point about scale. Scale isn't about volume. Scale is about dealing with the complexity that comes with volume. So as you say, if you're dealing with all the different possible web clients at one end and all the different possible web servers at the other end, you can get any valid or remotely valid HTTP requests coming to you from both ends. You have to handle that, you have to sequence them to make sure that you behave in a sensible way that there are no edge cases. And that's the really hard part and the bit where if you're trying to build it from scratch, you can get something working in a weekend. To get something working correctly, that's years. Yeah, I remember one of the things we did years before you joined, which was that we obviously used NGINX on the backend as a caching server, right? For the sort of CDN part of Cloudflare's business. And you look up something in the cache with the URL and it actually gets hashed and the thing is found on disk and then gets loaded. And very early on, we had a situation where the wrong thing came back from disk. And you sort of think, surely that can't be possible. But in fact, something had happened with the hashing, which had made the hashes all zero and we were getting the same file. This is years, probably six or seven years ago now. And so we added in, within the app, we actually modified the cache on disk format so that the cache file has a reference to itself in it and another hash. So it actually loads up from disk and we go, is that actually the thing we asked for? Because you get these weird situations where something goes wrong and you get the wrong file in some way when you were at this scale. But coming at it head on, if somebody had told me to do that, I would have said, that doesn't sound right. This hash could never be... You assume that the happy path is gonna happen and nothing's gonna go wrong, but actually really weird stuff happens at scale. Yeah, absolutely. And we've had that in other places as well. We've had it in our logging system. So one of the things we've added is when we generate a log message and put it onto our message system, we generate a hash of it to ensure that we can see if there's any corruption there. That's something which we implemented in response to a suspicion that there was a problem there. And it turned out that actually fired a good six months later and detected a problem and allowed us to quickly identify that we were putting a message onto the system in one place, it was coming out changed. And this was a system which we thought would never change the messages. So it's also, I used to work on search engine technology a lot. One of the things that the Lucene Search Engine Library added in the last decade as its path to becoming a really robust system was adding checksums all the way to it. So a database can't trust that what it puts on disk is gonna come back. You have to have checks at multiple levels. And this is the story of every level of computing as it gets robust and it has to start assuming that errors happen at the very extreme cases. And also assuming that things that can't happen, happen. Right, you know, gets cache, gives us the wrong file. How can that possibly happen? You can read the code, no way can it, you know, can that happen. But then you've made some assumption about something always being perfect and then it turns out it's not. You know, we only have a couple of minutes left. I wanted to ask you about something, which is that you came from GDS, the Government Digital Service, into Cloudflare. What was that transition like? I imagine that was an enormous change in terms of what you were doing and what was similar. So it's fascinating. So it probably deserves a whole section on itself, but- Come back next week. We should invite someone from GDS even better. But so the GDS, for those who don't know, is the Government Digital Service in the UK. The idea is that government historically was very bad at computing. And part of that is because it doesn't have expertise in government. So the idea of GDS, which started in 2010, was get the expertise into government. Trust that you can have people in government who are going to actually deliver projects to and through. So I joined GDS in 2014 as an engineer to work on their search engine technology to make sure that people who come to the central government website, which covers, I think we aggregated something like 1,000 plus sites into a single site, so that when people come to a UK government system, they have a consistent experience. And what I was trying to do was make sure they could find the right part of the government site. And that was fascinating. I learned so much stuff there about user research, about how to understand what people need out of a system. And after a while, I found that my skills, the search engine skills were great and also useful, but actually what was really important was trying to build teams and get teams together. One of the mantras of GDS is the unit of delivery is the team. So I moved into a role which was more helping people to work together. And I ended up after four years there being in charge of the management of 150 engineers. And building teams, but very far from actually building production systems. And part of the reason I moved to Cloudflare as I really wanted to get back into producing directly production systems and scale and all the exciting technology things that Cloudflare had to bring. You put up something there really, which is really interesting, which is that you managed a very, very large team. And then went back to managing something a lot smaller. And I think a lot of people think of career progression as bigger and bigger and bigger and bigger. And I know that we have many other members actually of the management team directors who had previously run bigger things and were running smaller things. And I came to Cloudflare and wrote code after having been VP of engineering and things. And I think it's interesting that we, people want to come and work on those things at what seems like a smaller scale at the team level, but the actual challenge is a bigger challenge. And I think that's true of many of my management colleagues. They have managed what looks on paper like a bigger thing in the past. It's probably had less complexity. It's probably had less impact. And I think, yeah, I think it's a false picture to say that management is a progression to manage bigger and bigger things. It's actually, are you learning? Are you contributing? Are you making things better? I had to learn a whole new set of skills to manage the team at Cloudflare. And it's been extremely rewarding. All right, well, on that note, we are out of time. Thank you so much. I think we could have had an hour quite easily on that. Richard, have a good afternoon in London. And I hope that one day I'll be able to get on an airplane and see you in person again. Bye. I look forward to it, bye.