Story Time

Name: Story Time
Uploaded: 2020-06-15T09:00:00.000Z
Duration: 30 min 4 s
Description: Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.

Presented by: John Graham-Cumming

Originally aired on June 15, 2020 @ 5:00 AM - 5:30 AM EDT

Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.

English

Interviews

Transcript (Beta)

Hello Marek. You and I have worked together for quite a long time. I was obviously at Cloudflare before you were, but you turned up in the London office and worked on various things. How did you end up at Cloudflare? I had a gap year. I finished my previous job and then I was looking for exciting opportunities. And at the time, which was about 2013, I was really interested in security companies. I thought the world is going to be more and more integrated. There's going to be more computers. So maybe I should switch my career from being a usual, normal programmer and do something that could help improve the security. And unfortunately, at least from what I found out, the security market wasn't that interesting, the pure security. There are generally two types of companies I was looking at. One is basically pentests or consultancy, which is very good. It works for some people, but it gets tedious after at some point. And even if you are the best consultant in some area, you can only deliver so much expertise out of one company, maybe two companies. So it basically doesn't scale. And then there's the second market, the second part of the market, which is the product companies. There are plenty of very interesting product companies that do very interesting security things. But again, they tend to do either snake oil things, so they don't deliver actual things, or they end up doing something which is very, very, very specific. For example, static code analysis is a good example. It's a very serious product, good product, but again, how much software can you improve by doing the hard work? So I stumbled upon Cloudflare, and this was exactly the right mix for me. Both networking, which I was very interested in before, and both doing things at scale and actually meaningfully improving the Internet ecosystem. When you started at Cloudflare, we were doing, and we still do, although we have to do it virtually right now, this idea of orientation. We send everybody originally to the San Francisco office. Was there anything interesting that happened during your orientation when you went to San Francisco? It was very early on, so our orientation was barely bootstrapped. We had a couple of lectures, but they were very rudimentary. So it was not really a real orientation like we have these days. Right. But I do remember that just after the orientation, there was the team retreat, which we are doing every year. I remember vividly seeing an SRE, an operations person, sitting in the back on the floor in the corner of a big conference room. There was about, I think, 70 or maybe 80, maybe 100 of the Cloudflare employees at that point. I remember an SRE sitting in the corner, looking at the screen and typing something vividly on the computer. I was thinking, he must be doing something interesting. It's probably important. What is he doing? It turned out he was doing, I think it was Ryan doing attack handling. What was he doing? Yeah, he was doing attack handling. But what was he actually doing, sitting in that machine? Exactly. Exactly. What possibly he could do. Back then, we had very rudimentary authentication tools. I think he was deploying some router configuration changes in response to some network events. It was very, very, very obscure to me. I didn't understand at all what they were talking about. Yeah. I mean, very early on, a lot of the DDoS stuff that was being done was sort of pseudo manual, right? We discovered someone was getting attacked, and then we would decide to on the router, actually just kill that IP address. And of course, the IP addresses were shared with other customers, as they still are. And so that would have had a collateral damage. So because we ran the DNS, we could change the IP address of every customer who was potentially being attacked. And then what that was called scattering, which is spread the customers around. And then the excitement was always whether the DDoS attacker spotted that he did that. And there was this chat room where people would say that the attack is following, which meant they had figured out that we changed the IP address. And so, yeah, it was a lot of manual fiddling around. But you didn't actually attack that problem right at the beginning, right? You did other stuff in London. Yes. So early on, I was put on the DNS team, which was me and Ray, the original author of our DNS software. So I was helping Ray. And early on, I was looking at what could potentially contribute. So I noticed that our server was not very stable. I think every Friday, 5 p.m. London time, it was not working perfectly. So I was trying to speak here and there, like what could I do? We run into the usual early go, go long programming troubles. So the first issue I fixed, I think, was we were starting a new goroutine every single time a new request came in. And that worked fine if you have like 20 concurrent requests, maybe 100. But if you have 10 ,000, of course, the number of concurrent goroutines is just exploding and the whole system becomes unresponsive. I think I fixed that first. And then since I was at this code, I was looking around, it's like, hey, this packet parsing, that's completely ridiculous. Why do you copy this data 10 times before you actually process it? That's not right. It could be improved. It could be made better. So I improved that code. And again, this was very normal Golang engineering, like reusing memory, GC, making sure the data is not copied, using slices correctly, the basic things. So I did this code change. And I remember it was deployed. I think Dane did deploy Spark then. Then it was reverted like five minutes later because all the systems broke. I was terrified. I was a junior employee sitting there. By God, it's impossible. It broke things. How could it come? I was just fixing, I was just making it faster. Like, why possibly could it be? And how did you break the system? I mean, all of us, early on, all of us were breaking DNS pretty regularly. I know I took multiplying something by time.millisecond. How did you break it? So in the case of the problem I introduced, it was a very subtle design issue or design feature of our system. So because Golang doesn't do socket multiplexing, you cannot read from multiple receipt queues on Linux in one Gorouti. So we had many IP addresses and we decided not to use wildcards, so any, so 000, not to bind port 53 for all IPs, but just bind for every single IP that we are handling. And because we did that, we had to hundreds of Goroutines that were just idling, that were just listening for new packets and just spinning around, receiving packets from network and putting them on a shared shared work queue. So by making the code that was parsing data faster, I made it to be able to push more data onto the shared work queue and this overloaded the whole system. So it turned out that the slowness of each single of those Goroutines was actually an implicit rate limit. By removing this rate limit, the whole system went berserk. I think this is a fairly common pattern. By improving a single item of the distributed complex system, you are exposing the next player and if you don't realize what is the problem, like that the speed could be a problem, then the whole system breaks. Well, especially at the packet rates that we operate at, I think that's one of the things, which is that we get hit by a very high packet rate when there's a DDoS and it really strains every bit of the software stack. That's true. So at the end, I improved my system. So my code change to speed up the packet processing went in, but we had to introduce this rate limit explicitly. And this is what we are doing in many systems these days. We have explicit rate limits to make sure that even high network usage will not overwhelm the whole either machine or a process. Just go back in time a little bit. What did a typical, if a DDoS attack happened back then, what did it look like on a daily basis? So I wasn't in the operations team, so of course that was not my job, but I was looking at them very carefully. So from the SREs were constantly looking at multiple charts. We had multiple dashboards that were doing various things. And they also had plenty of command line tools. So a typical day would be an SRE looking to the usual systems, opening plenty of consoles, and then looking at Nagios. Nagios was our alert system of choice back then. And then when something happened, the SRE would most likely have to log into a specific server to figure out what was happening. Is it a problem with the machine? Maybe there was some other network event. Maybe there's some amplification attack going to some other port address. Or is it just a DNS attack? Or is it just a normal behavior because our servers were very often quite normally overloaded? But sometimes this whole system broke apart because the operation person, the SRE, couldn't even log into the server because it was so busy. So then they were basically without any tools. It was very elementary. Right, although in the beginning there was some design in it to give us some robustness. It wasn't just, well, there's a server connected to the Internet. There was some thinking through about how we might make it robust. The original architecture of various components was brilliant. Sometimes you feel that you see the genius only 10 years later. This was one of those choices. A good example is the DNS system design. So as you mentioned, while the IPs, the name server IPs are shared among many customers, we tried to distribute the purse of the name servers that we give to our customers in a way that they don't overlap. So that we could have some guarantee that even if we have to kill the whole IP address, remove the IP address from the Internet, one of our name servers, until we remove more than two, only one customer should be affected. This was very elementary, but it was very much good enough for those days. And it gave us some guarantees. But that's not bulletproof, right? It's definitely not bulletproof. So the SREs were flying blind. So since I had my hands dirty with the DNS codebase, I decided, can I help them? Maybe, you know, at least for the DNS attacks, maybe I can improve the system a bit. So we developed a heavy hitters, let's call it a chart. It was actually a very simple tool that looked like a pop, like the usual console pop, which showed you basically what was happening on the system. So then the SRE was able to log into a machine, run this tool. And if the machine was responsive, the tool will tell, okay, there is DNS attack on this particular thing, and this is how you can mitigate it. But unfortunately, SREs didn't have yet tooling to actually react on that. I know. So we were, if I remember well, we had a couple of things. We had that funny thing you hooked up in the office with the LEDs that lit up depending on the CPU utilization around the world. So you could kind of tell how loaded Cloudflare was. But the big problem was interrupt storms, right, on machines. Absolutely. So that's one of the issues when you cannot even log into the server. And this happens when there's just too many packets for Linux kernel to handle. And early on, we didn't have enough tooling for that. So our only tooling was router-based firewall rules or null routing, which was very rudimentary. And we tried to fix it. So my thinking was, if the packets could potentially reach the server, if there isn't a networking problem in front, if the network isn't congested, there isn't really any reason to drop those packets. They reach the server, so we can process them. So it took us a while. It took us a number of interactions. And at the end, we developed the system to be able to skip the Linux kernel, do a kernel bypass in order to handle load in those specific cases. But we still kept the normal Linux kernel and all the other situations. So it's not like we are offloading always stuff. We're offloading only on demand when that was really, really necessary. And this is one of the interesting things, right, because you came at this from the DNS world where we were seeing, obviously, DNS text. But there are all sorts of other kind of high packet rate attacks. And if I remember well, we did some really ugly stuff really early on with, well, all the attack packets are this long. Let's drop packets of that length. This was exactly, this was very early on. This was before we had the programmable tooling on our servers, when we only had very limited vocabulary that our router firewalls could express. One of the things we found was very efficient, effective, was instead of just killing the whole IP address, we could figure out, okay, the attack is exactly 55 bytes long. All the packets are 55 bytes. So, you know, the router, you don't have to drop everything. You can drop only those 55 bytes. Or it might be 55 bytes to this particular IP address and that kind of stuff. And then we could just... Exactly. But this, even though it sounds like a reasonable engineering, we can do better attack mitigation. Let's do it. It was very error prone. It turns out, if you're an SRE typing stuff on the router under stress because the systems are breaking, you really don't want this fine accuracy. You really want to say, can the problem please go away as quickly as possible. We had some famous mistakes by SREs typing slightly new commands that weren't really supported by our routers. And then there's, if you don't have all the same routers everywhere, then the commands look a bit different, and then you forget. It becomes a big mess. I just did this in London. You say, hey, I've got a grandiose fashion. I can hear you, John. Did I break up? I think so. All right. Do you hear me now? I can hear you well. Okay. I don't know what happened there. Something with a bad cable or something. I was saying, I distinctly remember you coming to me and saying, hey, I can solve the DDoS problem. Give me time to work on it. So, what was it you wanted to, how were you planning to solve it? So, it was many stages, but after the early DNS saga, when we discovered I can do tooling that will show SREs things that are happening on networks, so give them more visibility, and then SREs could use that to make their job easier, we could repeat. Then after that, it was obvious that the routers were not good enough. So, then we did the offload part, when we were able to express more vocabulary, more vocabulary on our servers to mitigate attacks more and more precise, and this is what went in tandem. The more we could detect, the more we could mitigate, and it had to be kind of coupled, because otherwise it makes no sense. There's no sense in me reporting, oh, there is this very specific attack pattern, when you cannot react on that. So, it was obvious that we could repeat the same thing for other types of attacks. So, we repeated this for SYN packets, so very common Internet problem of SYN floods, and then we also repeated that for part of the HTTP attacks, and nowadays, there are multiple systems doing all matter of things, but the general framework remains the same, which is increase visibility. As you increase visibility, it's obvious you need to increase the mitigation, vocabulary, the mitigation, what you can mitigate, and then we kind of join them together. So, I think that particular conversation in the lobby of the office was about the middle part, about the automation part. It took me a really, really long time to figure out how that should work. So, while the visibility and while the mitigations are fairly obvious, how do you join them together, even though it might sound trivial, it was very, very hard. So, early on, this top tool that was showing you what attacks were happening, I designed it in a way that you could copy and paste, that SREs could copy and paste stuff. I remember this. It was a little Python program, wasn't it? Early on, it was a Python program, but it was really literally like copy and paste from the top into the SRE console, and people were asking me, like, you know, this could be automated. Yes, but there are plenty of corner cases which, of course, are obvious to a human operator, which are not obvious to a machine. A good example is, say we have some rate limits based on whether the customer has a free plan or a paid plan. We could have that. We don't, but we could. So, what happens if the customer upgrades a plan during the attack? Do we remove the rule? Do we create a new rule? Like, how the system should behave? So, all this complex business logic had to be expressed in a way, and this is the gateway part. This is the automation part, which I think is probably one of the most complex pieces of the whole machinery. Right, right, and I think early on, also, there was a little bit of concern about letting this thing go fully automatic, and it might shut down Cloudflare because it might decide something needed to be mitigated and switch us off or something. Absolutely. Everyone, including myself, was terrified. So, we did the roll in very slowly, but, yes, we did have a couple of mistakes. We did, right? We did a couple of times mitigate ourselves, right, and say, whoops. Yeah, there were multiple things. Some of them were predictable or just mistakes on our part, but some of them were completely like we just didn't know that we didn't know. A good example is we had a sanitizer written in BPF that was validating inbound DNS packets, and once we decided that the rate of packets is too much, we rolled in the sanitizer saying, no, this doesn't look like DNS packets, so maybe we shouldn't give it to our application because it will not be processed anyway, and I was reading the RFCs. I was trying to figure out which bits must be set, which must be cleared, and I remember that there was one bit which was obviously not a valid bit that was supposed to go to our server. It's in the packet header, but it turned out that there exists some weird software that was only sending this stuff, and this software was used for monitoring by our customers, so once we deployed this mitigation, it did not affect real traffic at all. It allowed all the valid packets to go through, but the detection software that people were using was like shouting, no, Cloudflare servers are down, and there is something wrong. I remember this. It's a good example of why you should use real systems to do the absolute worst. Well, that's a good point, right, which is that the RFCs are one thing, but once you get out onto the real Internet, you see how the real Internet is actually implemented, and you find all the quirks and at Cloudflare scale, the quirks really show up, right? Absolutely. This is one of the reasons why I have a strong opinion that mitigation should not be enabled by default. If we can handle the packet load in our applications or in our Linux kernel or whatever dimension you're using, that's fine. You should do that. You should only react to the systems. You should only do the active mitigation only when this is absolutely necessary because there are just so many queries that you basically cannot predict. Exactly. Now, one thing that's interesting, you've sort of glossed over this a little bit, is that most of what we're doing uses standard stuff, right? Initially, we were using the routers to filter stuff out, and then we started using a lot of functionality that's built into Linux. Can you just take us just through sort of the key features of Linux that we actually use? These days, we use everything, I would say. So, we are using basically everything, the whole stack, all the complexities. I think the most interesting story is the BPF story. So, we started with no BPF whatsoever, very simple IP tables back in 2013. Then we used the IP table BPF modules called XDBPF, which allowed us to push some more complex logic into the firewall rules itself. Since BPF is a not-turing-complete language, you can have some guarantees of how fast can the – that you can process packets very, very quickly. Just to let – if somebody's not familiar with what we did with BPF, I mean, the idea was that within IP tables, which is very good at routing and dropping and logging packets, we could actually build a little bit of code that would allow us to recognize a particular packet in a way that IP tables itself doesn't have the language to express. Precisely. Another example is what we mentioned before, which is the packet sanitizer. So, you can write a fairly sophisticated logic in this BPF bytecode saying only those packets should come in. You cannot express everything, but it's good enough for most of the packet validation, basic packet validation. So, that wasn't good enough for our needs because sometimes we just get too many packets, I don't know, more than two or three million packets per second per machine, which was making our machines very unhappy. So, to handle that, we did the kernel bypass thing, but I don't know if you remember, but it was also BPF. So, we just implemented the BPF runtime in another system, which was managed not by IP tables, but by us. It was basically offload of the offload. So, yep. Nowadays, we have many other layers of BPF. We are using XDP on inbound for two or three use cases now. We are using TC again, traffic control in a very, very, very obscure Linux utility. Again, with BPF, we are still using the IP table BPF, and we are using BPF in the sockets to filter closer to the application. So, even though it's all BPF, it's slightly different, owned by different teams doing slightly different jobs. And one interesting thing is how do we know what a DDoS attack is? And part of that is by sampling, so we can look at rate, but sort of the fun thing is there's these signatures, aren't there, of known attack tools? Absolutely. So, there isn't any definition of the attack. An attack is something malicious, but how do you know if it's a malicious thing or just a misconfiguration? There's no way of telling. So, how we define the attack is if the rate is just able to make our servers not happy. So, attack is anything. It could be a valid traffic spike, but if the rate is too much, then we need to react. There is a must. There is this urge. We just have to do something. And indeed, in our systems these days, we have a couple of well-known patterns, like Mirai Botnet has a very famously researched pattern. You can easily tell if the bucket is coming from Mirai or not. But then we have a couple of open-ended boxes, which is like we think it is most likely an attack, but it could capture a bit of real traffic. It's still collateral damage, but it's better than letting our system die. Yep. And over time, we've gone up with the stack. So, you're no longer the DDoS supremo at Cloudflare. We have an entire team working on this. But we went up beyond this kind of layer three, layer four into the HTTP level as well, similar kind of techniques. Absolutely. The current team is doing an amazing job on the higher levels, on layer seven. There's plenty of new developments we are doing, very interesting things around QUIC and around TLS. Yes, absolutely. Yeah, and we are now fingerprinting the whole layers, every layer through this thing, looking for attack types. And I think one of the interesting things we've done, the stuff you did originally was looking at very, very high volume. What the DDoS team has been doing now is looking at, does this high volume actually affect the origin server? So, they're actually looking for the back pressure coming from the server. So, oh, the server, the customer server has started serving up 500 errors. So, it must be under heavy load, and therefore, we'll use that signal. So, now we're actually going in the other direction and using the signal through the system. Yes. I did the easy part. In my case, early days, it was obvious whether something was affecting us or not. The signal was clear. Like, our server is dead. We have to react. Nowadays, the team has a much harder job of trying to figure out whether the origin is unhealthy because of the rate of packets going through us or for some other reason. But I think the current product is very good. I think it works for pretty much everyone. Well, I mean, that's, I think, one of the reasons why we gave it away, right? This is part of the unmeted mitigation product is we got so good at this that it was like, well, everyone should have this, and everyone deserves protection. I think that really was something we were able to do because of all this work. You've written a lot about this on the blog, and so I think anyone who's interested in understanding what we did, how GateBot works, a lot of the stuff to do with BPF, we've open -sourced pretty much everything, haven't we? Yes. So, we open-sourced all the interesting bits. So, the BPF part is open -sourced. It is actually reasonably maintained. There are multiple layers to that. So, there is still some glue needed, but the glue is always company-specific. So, there is really not much point in us publishing our, how do we distribute IP tables to our server? Whatever you do, it's going to be fine. So, yes, there are a couple of parts which are closed, but the majority of the interesting bits are open. We didn't open-source the logic part, the GateBot part, but we did open -source the framework which is under it, and we did speak about how it actually works, what it does inside. So, we are trying to be as open as we can. And we're now a maintainer of the Linux BPF stuff as well, right? Yes, we went so deep into BPF that, yeah, we have two developers, Jakub and Lawrence, who are working on a very specific field in the BPF ecosystem which we think we can use to our advantage. If we can make it work, it will be amazing for our use cases. So, yes, they are maintaining the SOCMAP data structure inside BPF. This is a very interesting subject. And, you know, just talking about this in 30 minutes, it sounds like this was all smooth sailing over a long period, but I seem to remember something with Shakespeare's Globe in London. What was that all about? Yeah, it was, again, earlier. We just discovered the BPF on IP tables, and the pulling was there, but it wasn't really tested by SREs. So, they weren't comfortable using that. And I remember there was a very big attack happening, and I was at Shakespeare's Globe looking at this, I don't remember which one, Shakespeare drama with plenty of blood flying around. And here I have a call from my manager saying, you know, can you help us with deploying those rules? So, yeah, I had to skip the second part. Was I your manager at that point? Did I call you? I don't remember, really. I don't remember. But, yes, I had to skip the second part, which was okay, because I'm not very good at blood. So, I was sitting on a bench somewhere behind, typing on a computer, helping with the attack handling. Well, that's fantastic. I mean, anyone who wants to, you know, go and, you know, read more about this should go and do the – go and look at the BPF stuff on the blog. And, you know, thanks very much for, you know, telling us. Bye. Thanks for having me.