Threat Watch

Name: Threat Watch
Uploaded: 2020-07-29T20:30:00.000Z
Duration: 30 min
Description: Join Cloudflare CTO John Graham-Cumming for a weekly look at the latest trends in online attacks, with insights derived from the billions of cyber threats Cloudflare blocks every day.

Presented by: John Graham-Cumming

Originally aired on June 15, 2020 @ 12:00 PM - 12:30 PM EDT

Join Cloudflare CTO John Graham-Cumming for a weekly look at the latest trends in online attacks, with insights derived from the billions of cyber threats Cloudflare blocks every day.

English

Security

Cyber Threats

Transcript (Beta)

All right. Good morning from Lisbon. I'm John Graham-Cumming, Cloudflare's CTO, and this is my little show called Threat Watch. It was meant to happen yesterday at 5 p.m. my time, but because of a personal problem, I had to delay it. So I'm very grateful to my guest who agreed to come on this morning. And we're going to talk today in depth about recent trends in DDoS. As you will well know, Cloudflare has a very well-known DDoS mitigation service which is unmetered, which means that we don't charge people for it. We will block an attack whether you're a free customer or a paying customer. And over the last few weeks of Threat Watch, I've talked about various DDoS trends that have been happening. So we saw Akamai say that it had stopped a 1.44 terabit per second attack. We saw Amazon say it stopped a 2.3 terabit per second attack using CLDAP. We saw Akamai talk about stopping an attack which was more than 800 million packets per second, another type of DDoS attack. So there's recently been some DDoS activity that's been quite large. And that's interesting because there had seemed to have been a lull in the ever -escalating size of DDoS attacks. If you remember a few years back, you would have read news stories about GitHub going offline, which was 1.4 terabits per second. And then there was an unnamed target which took 1.6 terabits per second. And then it all went quiet for a while as DDoS mitigation services did a great job. But it seems like attackers recently are back trying things out, trying to knock stuff offline. If those things are behind Cloudflare, they're unsuccessful. And one of the reasons they're unsuccessful is our dedicated DDoS team. And my guest today is Omer, who is the product manager for DDoS. Welcome, Omer. Thank you very much. Hi, John. Good morning from London. So maybe I'll start with a quick introduction. So I'm the DDoS product manager based in London. And I moved to London from Israel about a year and a half ago. It's been quite the ride. Love it here. People are so polite. Did you move for this job? Yeah. Yeah. I mean, I wanted the experience of living in a foreign country and everything. And when the Cloudflare opportunity presented itself, it was a perfect match. All the stars aligned. So you joined us running this in the DDoS team. What have been the big surprises about handling DDoS at Cloudflare? Well, I'd say first of all that there's a saying that I like that if you're the smartest person in the room, you're in the wrong room. Well, so far, I feel like I've been in all the right rooms here. So a lot of brilliant people from the engineering teams and all that build and manage these systems. And it's actually every time it kind of surprises me or it's nice to see every time again how our systems actually do what they're supposed to do and block those attacks. And an example of that is the 754 million packets per second attack that we talked about in the last few days. So let's talk about that attack because that actually occurred a day before the very large attack that Akamai talked about, which I think is 803 million packets per second. So really in the same ballpark, it goes June 20th and June 21st. What we saw lasted for a few days, right? Yes. So this was a four-day DDoS campaign. It started from June 18th to the 21st and it peaked just on the border between June 20th to the 21st. The dates align also with the attack that Akamai published. I think it was 809 million packets per second. But it was a different attack vector. So the attack that we saw was over UDP and the attack that they saw was in CLDAP. That's pretty amazing, these packet rates. I mean, if anyone's not familiar with the different types of DDoS, we've switched from talking about terabits per second to millions of packets per second. Do you want to just explain a little bit how those things are different and the different challenges there are in mitigating those attacks? Yes, of course. So we can think of three types of DDoS attacks. There are obviously a lot of types, a lot of attack vectors, a lot of methods, but we can break it down to what the attacker is trying to overwhelm or what is the approach that they're implementing in order to cause denial of service. And it can either be a flood of many, many packets. In that case, what they're attempting to do is overwhelm the routers or the appliances handling and processing those incoming packets. That's one of the options. And it can be the router that is overwhelmed, but potentially any other device in line, any other appliance that needs to process the packets can fail. So that's one type, a packet intensive flood with a high rate of packets. And then another type is the bit intensive. These are kind of the large numbers that we see in those headline DDoS attacks that make their way to all of the mainstream media. And so as opposed to the packet intensive flood, a bit intensive flood tries to fill up your Internet link by just sending more traffic than what your service provider allows you to send according to your subscription or allowance. And then when legitimate users try to access your website, for instance, they'll just be either throttled or blocked by the ISP that limits the pipe. And another form of attack that we can think of is a request intensive attack, an HTTP request. And the goal of an HTTP intensive request intensive attack is to overwhelm the web server. So the request rate might not be that large compared to the number of packets or bits and the relevant attacks. But what will happen is that the web server will be maxed out maybe by additional processes that are generated by those HTTP requests and it'll be void. It won't be able to handle legitimate requests by users. Right. And you recently wrote a blog post about this. And I think there was a likening there to like the bits per second type thing, the terabits per second type attacks are like the pipe is full, that the connection is just full of stuff. It's a canal that's right up to its brink. And then the packets per second is like an onslaught of mosquitoes or something where you've got to individually swat each one. And there's some expense to doing that, which overwhelms your networking equipment. And then I'm not sure we had an analogy for the server one, but what's interesting, I think, is that there's basically you're attacking three different layers, right? On one hand, you're attacking just the size of pipe, can just fill the pipe. Now that's when you're attacking the networking equipment itself with many packets per second that cause it to get stuck processing packets. And in the request per second one, you're attacking the server. So it's like these different types of DDoS. And I think we see this a lot, right? We see different types of attacks, sometimes simultaneously, different types of attacks happening. And actually even within each category, you'll see multiple sort of vectors of attack happening at the same time, right? Yes, yes. For instance, we have a Magic Transit customer that was under a lot of a very prolonged or lengthy DDoS campaign. And in those few days of the attack, the attacker or attackers utilized UDP floods, TCP -based floods, so AXN, even reset floods, GRE floods, a Christmas flood. So we've seen a variety of attack vectors being utilized over time. And I'll also just like to correct, I've noticed that before I said that the 754 million packets per second attack was a UDP, but I was referring to something else. That was a SYN-ACK flood. Okay. A SYN-ACK flood. Okay. Okay. So that's just single TCP packet containing the SYN-ACK flag set to send us literally millions of those per second. Yes, exactly. And when you look specifically at the types of TCP-based attacks, each type of attack or each usage of a flag is meant to kind of take advantage of the statefulness of the TCP handshake. So in a traditional TCP handshake, we have the client sending over a SYN packet or a packet with a SYN flag, then the data center, the server responds with a SYN-ACK and then following with an AX. So depending on the flag that is used, they're trying to, the effect may be different. So a SYN flood would maybe try to cause an overflow of the connection tracking table in the router. An AX flood will just try to overwhelm also the resources of the router and so on and so forth. Right. Right. And we mitigate this stuff using what exactly? I know that, you know, if you're not, I mean, I know because I helped work on some of this stuff, but if you're outside Cloudflare, I think there's some different technologies in use here. There's GateBot, which actually goes out and does some of this thing called DOSD and there's a thing called FlowTrackD, which we're going to talk about at some point. So just take us through these different technologies. What do they do and how do they mitigate DDoS attacks? Okay. So yeah, so we have three main systems that we use to detect and mitigate attacks. And that's actually one of the nice things because you mentioned FlowTrackD and that's something that we will be talking about later. And we will share more information about that new capability. But the nice thing is that because we built in our own DDoS protection systems and their software defined, we're just able to spin up to, you know, to write and spin up whatever software we need, as opposed to other legacy providers using appliances. So you're kind of limited to the capabilities of that appliance. So as part of the evolution of Cloudflare DDoS protection, we have kind of the older brother, GateBot. GateBot is a system, a software that runs in our network's core. It collects samples from all of our data centers across the world, so from over 200 locations around the world. And it constantly analyzes for patterns, traffic anomalies, and it looks for DDoS attacks. When it detects one, it will send instructions to the edge with the signature and what to block or rate limit and how to mitigate the attack. GateBot also takes into consideration the origin's health. So in the layer seven protection realm, GateBot also listens to the origin error codes and triggers mitigation once it detects or starts looking for DDoS attacks once the origin kind of, signals, hey, I need some help with error 500s. Yeah, this is an interesting point, right? So up until relatively recently, DDoSes were kind of threshold-based. Like if it goes above a certain threshold, then we're taking a bunch more traffic than we would normally for this particular IP address or property, therefore go off and look for a DDoS activity and go mitigate it. But what you're just describing there, which is this origin protection, is the thing that actually looks at a signal from the actual customer's web server, and the customer's web server starts getting into trouble, starts throwing errors, then GateBot says, wait a minute, these errors might be caused by a layer seven attack, and therefore I should look for an attack pattern. I think that's an interesting new capability because that can protect even the smallest. Like you host your thing on a Raspberry Pi and it gets a DDoS attack, which a larger server would just handle. We can actually spot that the Raspberry Pi is in trouble and protect it. Yes, exactly. And I think that kind of correlates to two trends that we've seen. One is the changing or the evolving term of what is a DDoS attack. What do we define as a DDoS attack? The classic definition is a malicious actor that targets your web property, your Internet property, with traffic in order to cause denial of service or service disruption. That's the classic definition. But the more modern definition, we can expand on that, on the classic definition, to include any type of unwanted traffic. So this can be good bots that are just overly excited, sending more requests per second than the web server can handle. It can also be a client bug where maybe there's some custom client and our customer released a version of that firmware, whether it's an IoT device or mobile application. And just when the users update the version, then the clients start bombarding their origin. And in each one of those cases, if there is downtime, the user perceives it as a denial of service attack. And we've seen in many cases how users hurry to Twitter and say, largest DDoS attack, or this one is down, and so on and so forth. We saw some crazy activity like that the other week, where it looked like T-Mobile was down, and then people started saying, everything's down, America's under attack, kind of stuff. And in fact, it was all a bit of a Twitter drama, really. Yeah, that's a really good example, yes. And so I think that what that kind of means is that our customers expect that we will protect them from any type of unwanted traffic. And that's kind of the new modern definition of a DDoS attack, in my opinion, of course. And so that's one of the contributing factors. Customers expect us to mitigate things that are not necessarily attacks. Right. And then, so you're talking about GateBot, right, which is sort of what you described as the traditional thing that Cloudflare does. Actually, prior to GateBot, the traditional thing is we did this stuff manually. So you have to go back in time to the tradition where we actually did it ourselves. But GateBot is now, yes, a bit of software we've had for quite a long time. But there's a new thing called Dosti. Tell us a little bit about Dosti. Okay. So Dosti is the younger, more energetic thing of GateBot. Wait, are you saying that my thing I worked on, is this an ageist thing? Goodness, no. Go ahead with your younger, more energetic version. Cool. Dosti is just a leaner piece of software. It uses up much less CPU, much less memory. And the nice thing about it is that it runs on our edge. So we don't need to send samples to the core again, and we don't have that dependency anymore. So Dosti runs literally in every one of our servers and every one of the data centers. And it works autonomously to detect and mitigate attempts. So even if our core network is down or all of our data centers are down and only one server is left, that server will keep on mitigating. Right. And this is a very good point, which is just to talk a little bit about history of Cloudflare. When I joined Cloudflare and we were the manual DDoS mitigation, I wanted to build massive, specialized servers to do DDoS attacks. And frankly, the reason I wanted to do that was that it sounded technically really exciting. And I was dissuaded of this opinion by Matthew and Lee, that what we should do is we should distribute everything across all the servers everywhere. So that meant that every server would do DDoS mitigation as well. And until Dosti, that's been true. But actually, the sort of detection part has been in the core based on sampling. And so Dosti kind of finally brings that vision into play, right, where everything participates and everything is running independently. Yes, exactly. And because of that, we were able to reduce the time to mitigate for the layer three, four attacks to under 10 seconds on average, and usually even just around three seconds. So it's quite fast. Yeah. Yeah. It's pretty amazing how fast we'll detect something. I'm always fascinated by those graphs internally, where you see like a big peak, and then it drops down again, because the mitigation service figured it out and said, okay, yeah, now I'm dropping those packets. You mentioned slightly earlier on a thing called Magic Transit. So can you just tell us what Magic Transit is? Just because some of our customers will not have heard of it, because it's a slightly unusual or newer product that Cloudflare provides. Yeah, so Magic Transit is our layer three scrubbing service. Basically, Magic Transit protects entire network infrastructures against denial of service attacks. And so as part of Magic Transit, we will announce the customer's IP ranges from our data centers around the world. And using BGP Anycast, packets are attracted to the nearest data center, where they go through our both automatic DDoS detection and mitigation systems, such as GateBot and DoSD, but also through static firewall rules that the customer can define. Then the clean traffic, we route it to the customer's data center over a tunnel. And so when we first started Magic Transit, we've always had the layer three, four protection, but now with Magic Transit, we basically expanded, we had to expand our DDoS capabilities to cover much more attack vectors and methods. And we really kind of improved the detection capabilities on many of those attack vectors, because many of the Magic Transit customers, they're running their full enterprise suite of products. It can be SMTP servers, a DNS server, dedicated applications, TCP, UDP, web, VPN, and so on and so forth. So it kind of made us really strengthen GateBot and DoSD to be able to detect those as well. Yeah, that's a very good point about Magic Transit, which is that we've had obviously a long-time capability to protect HTTP-based things and then some TCP and UDP-based things, usually in a proxying way. So it comes to us and then we then do something and pass it on to the backend server or whatever. Magic Transit is unusual, I think, for two reasons. One is that, as you say, it's all the traffic. It's like literally layer three, right? So it's all the IP traffic. So it could be voice calls over SIP, could be the email going in and out of a company, could be application stuff, screen sharing, video conferencing. I mean, the whole thing is encompassed, which has new sets of challenges. And the other thing that's interesting is it's only in one direction, right? So we take the traffic in, we clean it out, we pass it back onto the backend server. But then we go, there's this direct server return thing. Do you want to talk a little bit about that? Because that's kind of a slightly different architecture from the proxying style. Yes, of course. So with our WAF service and the Spectrum service for protecting TCP and UDP applications, we serve as a reverse proxy, meaning that we are in the middle, a connection is established with the client, and we establish another connection with the origin when needed, right? Because with the WAF service, we could be potentially serving everything from cache as well. And so those are kind of our traditional reverse proxy-based setups, where we have visibility on both of those. Now, remember that because the visibility is key for the DDoS mitigation. So we see both parts of the connection. With Magic Transit, it's a little different, right? Like you said, we see the ingress traffic that traverses through us. And we pass that along through a tunnel, for instance, a GRE tunnel, to the customer's data center. Because if we were to release that packet to the wild, it would have just been caught by our data centers again, and we would have ended up in a circle or loop. And so we pass that to the customer over a tunnel. And then the data center responds directly in a direct server return to the client. So we have this asymmetrical traffic topology, a unidirectional flow, if you will. And that poses its own set of challenges for the DDoS realm, right? Because you only have half of the picture. Right. And so we've obviously got you on a, I think, blog later on, maybe even today, about how we solve that problem with a thing called FlowTrackD. So I won't reveal any secrets today. Yeah. There is an interesting blog coming out, and I'm saying that it's interesting because I wrote it, so I'm a little biased. There you go. Exactly right. So tell me about something. So since you joined, what's been the most, I don't know, your favorite project you've worked on for the DDoS team? I'd have to say that the network analytics dashboard has to be by far my favorite project. Tell me about that. Why? What made it special for you? Well, first of all, it was kind of the first of its kind in Cloudflare. It was the first. So let me start by saying what the network analytics dashboard is. It's a analytics dashboard that gives you both high level insights and also kind of deep dive details to be able to investigate layer three, four traffic that we see in our edge. So this is traffic and DDoS attacks that we automatically mitigate. This is especially important for Magic Transit customers because we don't, we're not in layer seven, right? We don't terminate the SSL connection and we don't have visibility into HTTP attributes and so on. We have a layer seven firewall dashboard for that. And so we needed to provide them visibility into what we see, the layer three, four, the packet details, the packet level dashboard. And so working on that, it was a really wide cross-team effort. We had to build the backend first. Now, how do you store the data, right? The attacks in order to be able to present it efficiently, to be able to query it and show and retain it for long periods in order to be able to show historical patterns, but also without jeopardizing any privacy concerns, right? The source IP, for instance, is considered a PII. And so the retention of those kind of things is very strict and we do not retain that over 30 days, for instance. So we had to build that backend. And the way that we built it is that each attack is just one log, one summary. And so an attack log would have a start date, start time date, and end date and time, the total packets, total bits, max, min, average, the attack vector, the target, and so on and so forth. And that basically takes all the potential billions or trillions of packets that could form a DDoS attack and just summarizes it into one log. So having that made it really simple to deliver that capability, right? To be able to view a lot of attacks over time and to be able to analyze them as well. And then we also had the front end part, the design and the UI and the user research. We worked really iteratively to show customers early stage mock-ups, even if they were quick and dirty, to get their feedback. And we met with them multiple times along the development cycle until we arrived at the place where we wanted to be. And we released Network Analytics on January this year. And since then, we've been working with the design and LUI team, the London UI team, to constantly add more improvements and capabilities to be able to slice and dice the data easily. That sounds great. Listen, we've only got a few minutes left. What's next for DDoS team? Obviously, DDoS attackers aren't standing still. So what are we doing about our next defenses or next features? Yeah. So I'd say that there are two kind of themes to the DDoS protection. One is to constantly add more coverage and protection and capabilities to detect more types of attacks. An interesting type of attack that we can share is a QUIC-based DDoS attack. So because the payload is encrypted and it's over a UDP packet, there's not a lot for the systems to identify signatures by. So if anyone's listening and not sure what you mean by QUIC here, you're talking about the protocol that Google invented, QUIC, which I think is going to end up being called HTTP3, right? And so it's UDP-based, it's encrypted, it introduces some challenges for us. Yes, exactly. And so that can definitely be seen as an DDoS landscape because by nature, UDP packets have less fields. And if the payload is encrypted, it'll be randomized completely by design. And so identifying the patterns of the attack that can be used to form a signature to mitigate them can be much more challenging. But because we've been involved with the QUIC protocol and we support it, we already have protections for those in place. Right. And I'd say that another thing is because our customers range from independent bloggers to large global enterprises and service providers, we're going to be adding much more control over our automatic systems to allow everyone to customize it and to allow the systems to adapt to the changing nature of traffic for every specific site because no site is the same, no website is the same. Yes, and we handle 27 million different Internet properties right now. There's a huge amount of variety in terms of size, in terms of what they see, in terms of their backend infrastructure, the sophistication of the customer, and keeping them all online is a challenge for the team. Well, listen, I think we're almost out of time. And I think the thing that fascinated me most about the 754 million packets per second attack is that you told me about it as kind of an afterthought, like, oh, yeah, by the way, there was this massive attack. Nobody had to do anything. No one was woken up and no one was doing anything manually. It was just all 100% automatic. And I think that's really a testament to what the team has done to make this stuff just sort of trivial to deal with. So, you know, thanks very much. Thank you for helping keep us all online. You know, hopefully, I'll get to talk to you again. And hopefully, that blog post goes out today in a couple of hours, and people can learn about FlowTrack V and Magic Transit. Thank you very much for having me. And cheers from London. Thanks very much. Yes, I know. We should say cheers from Lisbon as well. So, where it is stinking hot today. You have a good day. Bye.