Story Time

Name: Story Time
Uploaded: 2020-07-22T04:30:00.000Z
Duration: 30 min
Description: Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.

Presented by: John Graham-Cumming, Andre Bluehs

Originally aired on June 24, 2020 @ 6:00 AM - 6:30 AM EDT

Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.

English

Interviews

Transcript (Beta)

All right, everybody, welcome to Story Time. This is my show where I interview typically someone from Cloudflare and we talk about something, building a piece of software, a debugging story, fixing something horribly broken. And today I've got Andre Bluehs, who is the manager of the WAF team in London, Web Application Firewall. And his story is that, well, he had to build a team and take over code that I originally wrote because it's probably not well known outside Cloudflare, but the original WAF that Cloudflare has, and it's been in production now for about eight years, is something that I put together. And if I say so myself, some of it's great, but some of it's a bit of a nightmare because I wrote it in a hurry. So Andre and I can talk about that. Welcome, Andre. Thank you very much for having me. Yeah, I'm not sure. I'll go easy on it. It served its purpose for a very long time. It has, right? And there's a little bit of work going on to actually maybe deprecate some of the stuff that I wrote what is now probably eight years ago. But let's just talk about the history. So I know some of the history that you probably don't, and I'll tell you about that. I just got here about six months ago, so you know quite a bit more than I do. Yeah. So when I joined Cloudflare in 2011, Cloudflare already had a WAF product. And the way that worked, and I hope you're all sitting comfortably and there are no small children in the room because this is a horrifying story. We had Apache, the Apache web server, running mod security. But because it crashed all the time, it actually had an instance of NGINX acting as a proxy in front of it, which was called NGINX WAF. And that was already proxying from another proxy. So there was proxy to proxy to proxy to Apache, and there was all sorts of communication going back and forth with headers. Multiple layers of security. It was multiple layers of A, horrible speed problems, B, all sorts of issues with how it should actually work. And mod security is fantastic for something that is not a shared service. When you suddenly have a shared service, one of the things you need to do is change configuration based on a customer. So at the time we had a fixed configuration. It was a real mess of stuff. And so when I actually joined the company, Matthew Prince, the CEO, he gave me a list of things that needed fixing. And one of them was the WAF. And it was actually the second thing that I worked on was I sat down and I said, I'm just going to start again and write a new one. And I did. We can talk about what that looks like. And we eliminated those proxy NGINXs. We eliminated Apache completely, and we got ourselves in a position where things could be configured. And so that's the world which you've been living in for a while. Now, one of the interesting things we did was at the time, we were also doing another big migration, which was to migrate from NGINX, sorry, from PHP, so fast PHP from NGINX to using Lua within NGINX. And that was happening. So I chose to write the WAF in Lua. So that's the world that you've taken over. So tell us a little bit about what it was like to take over that code, Andre. Sure. So there's actually a mix of both worlds still. So the actual expression of the rules of the WAF, each individual rule in our dashboard, you'll see a rule ID and then some description about it. The actual representation in our code base of those rules is still in ModSecurity. It's fantastic. It's used industry-wide. The kind of industry standard for how a WAF should behave and what it should protect. That's a thing called OWASP. I'm going to be totally honest. I don't remember what that stands for. But do you know what it stands for? I think I do. I suddenly thought I was about to correct you and say, I know what it stands for. And I was like, wait, is it the Web Application Security? Protocol? No, I don't think it's protocol. But it is the standard list. And I actually talked about it last week on one of the shows called ThreatWatch, which I was doing, about those top 10 lists in the OWASP list, about what it is that, how people should protect their applications. And that's one way in which WAFs get measured, is how well do they stop the top 10 attacks. Yeah. Well, which is interesting because the OWASP top 10, one of them is about keeping things up to date, like using the most recent up-to -date version of programs and libraries and that kind of stuff. If you think about what a WAF does, it's kind of a tangential thing for protecting against malicious requests, either coming or going, that kind of stuff. But yeah, you're right, the OWASP top 10. One thing you said was we still use mod security. So I just want to make sure one thing is clear. We use the mod security language, but we don't use any of the components of mod security itself. And the reason we did that, and this is what I did originally, was so many people have written their own rules in mod security, and there was this thing called the core rule set, which is the OWASP thing. So we wanted to be able to import using that language, so it was compatible with that. But it wasn't actually mod security itself. And the compatibility layer, which is probably the most embarrassing piece of software I've ever foisted on someone else, was a monstrous couple of Perl programs, right? Yes. Those live in the memory and posterity and archives. Those, fortunately, aren't used in any kind of production deploy pipeline anymore. My goodness. It's all Python all the way down now. And actually, that's one of the things that my team has been working on with quite a lot of vigor recently is building a converter from mod security to the wire filter syntax that we open sourced a while back. That's in an effort to upgrade all of our systems. But, yeah, that's kind of in transit right now. But what our actual production pipeline looks like is, yeah, we take the mod security format that we store in our repo and run it through a whole bunch of Python and then spit out some raw Lua that equates to the actual, you know, basically a transpilation layer in Lua that then executes at the edge. Right. And that was the thing that I originally did in a big hurry. So if you go back in time, what I did was I basically reimplemented the mod security runtime in Lua. So you had all of those matching functions and all the functionality that you could do the WAF stuff. And then as a really quick hack that was never meant to survive this long, I wrote this really, really ugly Perl program that took mod security and, I'm going to say in quotes, parsed it because there was no real, it was richly sort of string matching and stuff and spit out Lua code. So one thing that's interesting about the WAF in Cloudflare is that we actually transpile or generate Lua code, essentially as object code and ship it to the edge and execute it. And that's actually, each customer gets their own custom configuration and those rules can be turned on and off. Entire sets of rules can be turned on and off. And that was actually how we got the flexibility that we have today. Now what you're referring to is obviously you've got rid of the Perl, thank goodness for that, replaced it with something a lot cleaner and more maintainable, but you're still producing Lua code, but there's a move towards producing wire filter. Tell us about wire filter, what we're doing with it and why we're thinking about doing that. Sure. So wire filter is inspired by the Wireshark protocol. So Wireshark is a way of monitoring network traffic and the Wireshark protocol is a way of querying those huge number of packets that come through. And so since that is a very common thing to use in our industry, we decided this was, again, I'm using the Royal We, this was before I joined the team, decided to take that and use that as a common way of representing queries in our firewall rules specifically. So you can say, I want to match a request that looks like this and take a particular action on it. And so what we wanted to do was we wanted to make sure that we could do that internally. We'd like to do a lot of dogfooding. I'm sure you've talked about that before. And so we want to uplift the way that we are building our WAF to be compatible with that as well. And so that's kind of the push now to move towards the wire filter syntax and to really uplift all of our firewall products to use the same kind of syntax. So today, right? So we have these two worlds where we have the mod security language, which gets translated into Lua and gets executed by the engine that I wrote all that time ago. But we also have this thing in firewall rules where customers can write using this wire filter language, and those get executed by a separate engine, which is written in Rust. So there's now two different worlds going on is the ultimate goal that we finally deprecate all the code that I wrote and it all goes through the Rust code. That is indeed on the very near horizon. Yes, that is the goal is. So what that gives us is, as you say, that the new engine is written in Rust and it's going to be the sole engine running all of these firewall like syntaxes or these products. And so when we improve one, it's, you know, a rising tide lifts all ships. Everything gets faster. Right. Now, recently, though, you've actually done a bunch of work to improve the latency of the WAF itself, right. And actually reduce its CPU utilization on the edge. So why don't you tell us a little bit about that? Because we're not cut over yet to the new world. So how did you make the existing WAF faster and by how much? Sure. So the answer is we did it by not touching the, any Lua code at all. We specifically wanted to make sure that any improvements we did because we are on the path to using the new execution, the new Rust engine. We didn't want to tailor anything in the legacy engine to gain these improvements. We wanted to do it once and for all. And so what we did was we actually rewrote a bunch of the mod security formatted rules to be more performance. We, one of the things in your engine that you actually wrote was the concept of memoization, which is you execute a particular bit of code and you save the intermediate value of that so you can reuse it. So that's if a particular function call, like a lowercase or a URL decode or something like that executes, we don't have to do that again in any part of the request for that single request. And what we discovered was that in a lot of our mod security rules, these were executed in different orders. And so we might have a lowercase first and then a URL decode or flip flops. And the memoization that we had in the engine couldn't kind of reconcile that. And so it was almost as simple as making sure that those were in the correct order. And then we got a lot of these performance gains. So we did a whole bunch of memoization use. Another, one of the things we did was we realized that when you are looking at objects that are array like so, or like a hash map, like, like headers where there's, you know, a key and a value to those things, we realized we were iterating over all of those things. And potentially if we have a regular expression that we're trying to match on those, it can get pretty slow because we have to execute that, that regular expression over all of those. And so we found out that if we, instead of we do it over a string blob, that is before those get parsed, we get the same kind of matches because we're matching the same string, but we only have to execute it once. We don't have to execute it for each one of those headers. So those are kind of where we saw the big improvements. And we were able to get about a 40% decrease in our CPU usage for, for each one of our, our edge servers that we were running on. So that in aggregate, that saved us a whole lot, you know, we're, we're talking, you know, sub second savings here. We, our web doesn't take a whole lot of time to execute, but in aggregate it, it totaled up to quite a lot. Yeah. Yeah. So originally when I, when I worked on the web, one of the reasons why we have that memorization stuff is that Lua's string handling, it's not in a way the best language for doing string manipulation in. And in particular, you know, if you do the same string thing over and over again, it can get very expensive. And so I realized that if I could prevent us from doing something like lower casing, the same string multiple times, it would save a lot. And the original design goal was to get it down to one millisecond latency through the WAF. We wanted to, you know, because Cloudflare does security and performance at the same time, we wanted it absolutely to be, you know, the case that it would be fast and secure at the same time. So yeah, that was what that quite ugly memorization code was designed to do. And okay. So you discovered that it wasn't always doing the right thing or wasn't always doing as well as it could. Yeah. And so again, in the effort to not try not to invest a whole lot of time in what we now consider to be our legacy engine, we still wanted to be able to use that, but you know, not enhance it so that it was smarter. We wanted to make our rules conform to it. So that when we finish our conversion from mod security to wire filter, we get the same benefits in the new engine as well. We didn't want to go back to where we started with a less performant engine. We wanted to start fast. Yeah. Yeah. And you mentioned very briefly there, the concept of streaming in this context. Do you want to just talk a little bit about that? Yeah, sure. So I'm not sure where you want me to go with this. Well, let me ask. Let me talk about it. One of the things that happened with the original WAF is you had to load the entire request into memory, do all the stream manipulation and then run rules over it, which means you had to buffer everything and hold it all in memory. And I think we're moving towards a world where we can actually be streaming data through and making these matches as we go. Isn't that right? Yeah. So that's one of the things, particularly in the new engine that they're, they're focusing on is being able to do that with, you know, there's, there's other protocols too, that are outside of HTTP. That's are almost exclusively stream streaming focused. And one of the interesting things, as you say, of, you know, take for instance WebRTC the, the way that if you had to load everything in memory all at once, you would only see that first like handshake request. You wouldn't be able to protect against any of the other bits, part of the stream. So that's, that's kind of one of the, you know, desires is if you can, if you can operate on streaming in a continuous stream, then you can protect against those kinds of things as well. One of the other things that I, that we also discovered was that it's kind of out of our control, but the size of a request body has a huge influence on the performance of it as well. Because for the same reason, you know, you need to load it all into memory and you need to run a regex against this huge, massive thing. And in the worst case, which is you don't find any matches, you have to check the entire, you know, potentially many megabytes of stream data. Well, this is the thing that's terrible about WAFs, right? Which is that most of the time they do a large amount of work to do nothing, right? Because they have to check everything that's being sent against all of the rules to discover that in fact, this wasn't an attack. And in fact, most, you know, most, most things aren't attacks, right? Most things aren't going to get blocked. So you actually have this terrible situation, which is you, you absolutely have to optimize for doing all the work. There's really very few shortcuts. There's a few in the CRS where they tried to jump over a bunch of rules thinking, well, you know, we're unlikely to match these ones, but it's really hard. And actually when the WAF does block something, it's often, you know, one of the first five rules or something catches it. The fastest thing is when we block something, right? So it's, it's, it's very annoying to work on WAFs because there are not a lot of shortcuts. Why did we decide to do this optimization now? Is this because of the COVID crisis? Yeah, that's exactly it. We are seeing so much growth very, very quickly condensed into a short amount of time. It was really becoming a hair on fire kind of situation. Because as you say, the most common case is that the WAF doesn't actually do anything. That means we do a lot of work every single time. And we we've known for a while that the WAF is kind of on the naughty list of the most, the hungriest CPU user on our edge. And so it's, you know, one of the things that we wanted to do was we wanted to, you know, I can kind of talk my team up a little bit. We wanted to set an example for the rest of the teams to say that, you know, there, there are some savings to be had and we want to do our part to make sure that, you know, we can continue to grow and we can continue to support this growth. As you know, more people are moving online more consistently and you know, just consuming generally more, more resources. Yeah. I mean, to put it in context, we saw in about a 12 week period, the amount of growth we'd see in a 12 month period. I mean, the, the, the load on the network went up extraordinarily. Now, the good news is we were already, you know, capacity planned for peak usage, you know, around the world in different locations and the infrastructure team did a great job making sure our supply chain of hardware was intact and we could install new hardware and we didn't run into a problem, but partly because of this incredible piece of work here on the web, which is you're knocking down. It's it's usage by 40%. It's absolutely enormous. I mean, it's rare you get a performance gain like that, especially on something that's been in production for, for many, many years. I mean, this is a, this is a pretty fantastic job. Yeah, we sorry, go ahead. I was going to say we did an internal presentation right about the time we released this and I saved that little bit for the, for the end and I called it our mic drop slide of, okay, here's a graph of these things and it goes and drops down to the bottom. Yeah. Those graphs are amazing. They're, they're always fun to watch. And so, but the problem was in the next meeting that I had with Uzman, our head of engineering, he asked, okay, so when are you going to give us the next big jump? Well, we hit all the low fruit we know about right now. We're not going to get another mic drop. Right. But that, I mean, it's pretty amazing to get 40, 40% down like that. That's, that's, it's wonderful. And you know, that obviously helped us with the COVID situation. It's also a huge savings for the company. If you think about, you know, how much CPU we can delay buying, if you can save it, you're talking millions of dollars in, you know, in benefit to the company. So it's fantastic that this got done. And then of course we're going to replace my engine with the newer, faster, better one. Let's just talk about one of the things that went horribly wrong in my engine, which was about a year ago. We had an incident where we made a rule change. We do these all the time. They're constantly adding new rules to protect against new stuff and well, everything stopped working. Right. Yes. That was a very dark day. Fortunately or unfortunately, I'm not sure which I wasn't on the team yet. But yeah, it's, it's, it's a trade-off between being able to move fast and react to things and being safe and, and having the confidence that what you're shipping really isn't going to break anything. And the, the unfortunate reality was we were optimizing a little bit more towards being able to move fast previously. And so what happened was, you know, we've been talking about regular expressions being slow and taking a long time that we hit the pathologically worst case. You know, you wrote a fantastic blog kind of describing what, what happened on that day and then kind of a long diatribe into here's why regular expressions are terrible generally. Yeah. And one of the things that came out of that was a bunch of different process changes. We, we have much more confidence now about the things we ship before we ship them. And the other thing was we switched to a different regular expression engine that is kind of makes different trade-offs. You know, going from PCRE to RE2 it's, it's, I'm going to be honest, it's a little bit slower and it's a little bit more resource intensive, but it gives us that guarantee that we won't have the exact same kind of failure as we did previously. Yeah. And yeah, we can't repeat it. Yep. Yep. And that was, that was a fairly long process where I had to go and really examine the processes by which we did WAF releases, how things got tested and then these safety guarantees within the code itself. Yeah. And we got a lot of good experience that, that paid off when we wanted to go back and rewrite a whole bunch of our rules, because as part of that, we actually needed to convert some of our rules over to this new regular expression engine. And, and that's something that we're running into with, with the, we talked about the OWASP core rule sets. They support the old regular expression engine, because that's what comes with the actual mod security plugin in Apache, the PCRE2, which is, you know, if you want to trace it back, because PHP supports that and Perl supports that. Yeah, exactly. And that's why, that's why the original version of our WAF uses it. And so we are kind of putting all of those experiences together to be able to rewrite and have the confidence in the things that we write are going to match the same kinds of requests and you know, taking down to that, that measurement and confidence that we built as, as part of that original outage, putting all of that together and saying, you know, when we did a whole bunch of these rewrites to get that 40% gain, we were very, very confident in our ability to not have a change in behavior for all of those rules. Yep. Yep. And just to go back to the optimization part of it, how did you measure and figure out that the right thing to do was actually to adjust the, I think this is the order of the transformations, right? Sure. Yeah. So we, we have a whole bunch of internal metrics about exactly how long each rule takes. And so that was part of what we have that now, that was part of what we set up to be able to do this optimization. You know, you can't fix what you can't measure. And so we had to be able to measure all of these things first. And so we have some very scary looking graphs of this is exactly, you know, how many milliseconds or microseconds each rule takes and averaged over a lot of time. One of the interesting challenges that we have and one of the things that we are continuing to improve on is because we see so much traffic, we actually see different traffic in different regions, in different geographic regions, not all traffic looks the same in Cairo as it does in Miami. Right. And so one of the challenges that we have as writing a global left, it's supposed to run everywhere is we have to write rules that, uh, that work in both of those places that can see these different kinds of traffics. And so that was, that was part of the thing that we measured as well is, you know, how long are these rules taking in different geographic regions? We couldn't break it up by, by each one of our data centers, but we were able to, uh, at least segment it a little bit, uh, to say, you know, generally these are our top 25 worst offenders. Um, and it was actually, it, it wasn't that hard. Um, the ones that we were, we rewrote were, were pretty far outliers. Uh, and that's why we could kind of cap it as at a hard number and, and see those kinds of gains because they were so far out there on the, how long it takes to execute graph. You bring up an interesting point though, and you start operating something at Cloudflare scale, you really get exposed to how heterogeneous the Internet is. Like it's not all the same everywhere. And I think, you know, you're sometimes you'll see on a hacker news or something, somebody say I could build Cloudflare in a weekend because I could get, you know, more security in NGINX and blah, blah, blah. And I can do it. And I'm like, yeah, yeah. You definitely could put all those things together in a weekend. Now make it work at scale and with the weird strange variety of the Internet. Well, you're exactly right. You can put it together in a weekend and you'd have what we had eight years ago, uh, which isn't what we have today, which is, uh, a, a, generally a vast, uh, a vastly better understanding of exactly, as you say, the, the, the differences in, in traffic and usage patterns. Uh, and it's, it's, it's pretty interesting. The, the top rules are generally the same, uh, worldwide, but in, in geo in geographic regions, the different rules that get hits are slightly different. Not necessarily how long they take to execute, but what they, what they hit are slightly different. And that's just based on the different kinds of traffic that come through. Yeah. That's interesting. Isn't it? And I wonder if it tells you that attackers around the world have their different favorite go-to attacks they start with mostly. And so there's some, maybe some interesting cultural difference in the way in which people hack things around the world. So, yeah. And the, the permeation of those kinds of attacks, like, you know, can, can you see it? I, I, we don't have this kind of data, but it would be interesting. Can you see it like travel into different places? Like it started here and then you can see it in this other region. Right, right, exactly. Now you came to this team from something quite different in Cloudflare, right? Yeah. I used to be on the marketing engineering team, which is a whole different kind of beast for our, our traffic patterns and, and exactly what we use. We were much more a consumer of, of Cloudflare and the WAF as opposed to, you know, writing the product. So once you took on this challenge of, you know, working on the WAF itself, were there things that surprised you about this type of software you're working on or the team? Yeah, genuinely the scale. So the working on the Cloudflare marketing site, you know, we, we owned www.Cloudflare.com and we get, you know, a large number of traffic. And I thought that was pretty cool. And then I came to the WAF team and I saw the, the actual number that the Cloudflare, the, the product. Yeah. And that was absolutely insane. And so that was, that was a pretty cool thing for me. And the, the, I knew that I, you know, was going to, was going to take this job in this team seriously, but the kind of realization moment of, Oh, wow, this, this affects a lot of requests per second. This is actually a serious deal. We need to do this right. Yeah. When, when we were doing the S1 for the, you know, to go public last year, there was, there was some statistics in there about blocking and the WAF and stuff. And there was one, which was something like, we block about 70 billion requests per day. And I didn't believe it. I was just like, no, no, no, no. And I, I know I drove the team mad by making them show me the query they'd done, repeat it for me. And then I'd go and do it myself and look at it. And I was like, no, no, no, that really is the scale. It's, and I've seen days when it's been over a hundred billion blocks by the WAF or by the layer seven DDoS detector, literally a hundred billion HTTP requests that are being dropped on the floor, which is, it's just hard to fathom that scale, given that we're running out of time. Thank you so much for coming on and talking about making the WAF 40% faster. That's pretty stunning. And you know, I'll shed a small tear when my code stops being in production, but you know, good luck with the new wire filter implementation. Cause I think that sounds really, really cool. Thank you very much. We are sacrificing it for the greater good. Exactly. I think that's happened to quite a lot of code. You know, that I think we finally got rid of all of Matthew's PHP code quite a few years ago. That was a, that was a thing that needed to be deleted quickly. He does actually have a minor in computer science, so he does know what he's doing, which is, but he knows enough to be dangerous. So yeah, that's the just enough to write some bad code. Yes, exactly. All right. Andre, thank you very much.