Latest from Product and Engineering
Presented by: Usman Muzaffar, Omer Yoachimik, Manish Arora
Originally aired on February 22, 2022 @ 5:00 AM - 5:30 AM EST
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
All right, hello and good morning from California. This is the latest from product and engineering.
My name is Usman Muzaffar. I'm Cloudflare's head of engineering.
Normally, Jen Taylor is with me as head of product, but she's out today, so I'll be hosting solo.
But I'm very pleased to have two of my colleagues here who are actually dialing in from abroad.
Omer and Manish. Omer, say hello. How long you've been in Cloudflare and what are you responsible for?
So, hi everyone.
I'm Omer, the product manager for DDoS protection at Cloudflare. I'm based in London and I've been at Cloudflare a little more than two years now.
Two years. I was just thinking we've hit, you've just crossed the two-year boundaries.
What a flip year this has been. And Manish, nice to see you too. How have you been?
Doing good, Usman. Hi everyone. I'm Manish. I'm an engineering manager for DDoS protection systems and I joined just two months back, slightly over two months back.
Super excited to be here. That also doesn't feel right. It feels like you've been with us longer.
So, can't believe, Omer, that it's already been two years.
Can't believe, Manish, that it's only been two months. But as both of these gentlemen just said, that DDoS is in their job titles or their team titles, which is denial of service, which is a scary sounding thing.
Omer, let's just start right at the very beginning.
What is a denial of service attack? How do they happen?
What do we know about this very 21st century problem? This didn't exist. This didn't exist.
The generation of software engineers before me would never have had to deal with this.
So, what happened? How does this new form of attack even show up?
Let's start at the very basis. Yeah. So, imagine that you have a website or a mobile application or any type of online Internet property.
You want your users to be able to access it, to go online, to browse there, to buy something, to play a game.
And there are attackers, malicious actors that basically want to cause a denial of service for your users, meaning that they won't be able to access it.
This can be done by generating a flood of Internet traffic towards your website or your mobile app.
And basically, if there's too much traffic that your servers can't handle or the routers or anything in line, then your legitimate users won't be able to reach you.
Your servers won't be able to provide service to your legitimate users.
You can think of this as if someone was to push a bunch of cars into a highway or the road leading into your store.
And then someone that wants to come to your store to buy something will be just stuck.
They won't be able to get there because the road is blocked. Coming up with analogies for DOS attacks is a favorite hobby of mine.
I like this one, the idea of a bunch of cars.
The interesting thing is, the cars don't even need to be functional to take you, right?
They can just be pieces of cars. It doesn't even have to be things that have wheels on them.
It's just nonsense that's clogging the routes.
Or in this case, some of the things that we've seen, like the classic DOS attacks, they're not even complete requests.
They're not even complete.
They're just fragments of connections or fragments of requests. But that doesn't matter.
It still takes the web server time and energy to receive that connection and say, wait a minute, this isn't even an intelligible request.
Please move aside next.
And then the next thing is also an unintelligible request and so on and so forth until buried in there so deep is the actual...
The other analogy we like to use that I've used is as if you have a shop, just like you said, and rather than letting customers in the store, you just pay 105-year-olds a nickel each to just run loose in the store.
And they're not going to buy anything, but they're not going to let anyone else buy anything either.
So what are the magnitudes we're talking about here?
It's easy to lose sight of the numbers. Some sense of like, what are we talking about?
When a DOS attack happens, we're used to saying, oh, I have a DOS attack that hits in a second, if it's really popular or it's a few thousand a minute kind of things.
What happens when you have a DOS attack?
What happens to a website? How much traffic are we talking about here? So it depends on the type of the attack.
So you can think of the attacks or we can bucket the attacks into three groups.
One is attacks that have high IP packet rates. And these are attacks that aim to take down your router or your server by just flooding it with a bunch of packets.
And these inline servers or just appliances, they allocate certain memory and CPU usage for every packet or every connection that they manage.
And given a large enough amount of those, they will just crash. They will have a memory overload or won't be able to handle the legitimate packets of legitimate users.
That's one type. Another type is an attack with a high bit rate.
So not necessarily a ton of packets, but just a ton of traffic in general. So like gigabits per second, terabits per second.
And those attacks usually attempt to overwhelm your Internet uplink, your pipeline.
So if you have a one gigabit per second connection in your home that you get from AT&T or from whichever ISP, imagine that you're trying to utilize more than that.
Your ISP will just throttle your connection basically.
And so legitimate connections won't be to be established.
And so by saturating the Internet pipeline for that website, that's how you cause the denial of service.
And the third part is, or the third category is a HTTP request intensive attack.
And this also aims to take down HTTP servers and cause them to crash and not be able to handle legitimate requests.
In terms of numbers, our systems automatically mitigate hundreds of thousands of these per month.
So this is quite a common thing. And in 2020, we saw the number of attacks double every quarter since the COVID lockdown and everything, the attacks increased because, well, we have two assumptions as why this happened.
One is that people became more reliant on Internet services, ordering food with Deliveroo and online shopping and online studying and so on and so forth.
And the second thing is just cyber vandalism.
Just probably students at home bored playing around with tools, launching attacks, seeing what they can do.
And there's threats of things like, we've talked to customers who have threats of ransomware and things like this, but ultimately this all becomes a way for a malicious actor.
Let's be clear, this is not a bug.
This is an attack to try to disrupt your website. And it is, like you said, it doesn't take a lot of technical sophistication to start one of these, but it does take a lot of technical sophistication to stop one of them.
Manish, give us some rough idea, just in the highest sketch. So all this stuff is coming in through, incoming in, these half-formed requests or these malicious fake requests.
Just in the broader strokes, how do you defend against a DOS attack?
What do you have to do? I think Manish's connection might be a little stuck.
Omer, why don't you take a crack at that question?
You're the product manager, but when Manish's live comes back live, we can ask for the authoritative engineering manager response.
Yeah, for sure.
So what we aim to do is to disrupt the economy of the attacks, because like you said, attacks can be launched very simply, very easily, with not a lot of cost to it.
You can also do it for free. You can find the source code of malware and botnets like the Mirai online and just use that, or hire a botnet for as little as $10 for an attack.
Really disturbing. Yeah. Yeah. And the cost of attacks for the organizations that are targeted by it can be insanely high.
There are companies that a minute of downtime is worth tens of thousands of dollars, if not hundreds of thousands of dollars worth in revenue.
And this is not including the impact to reputation, users going online and ranting on Twitter and so on and so forth.
It's terrible.
Your site's down, just like any other reason that your site might be down. Yeah.
And even if someone causes a five -minute outage for your website, or it's knocked down for five minutes, the time it takes you to restore those services is much longer.
You need to relaunch everything, the services, so it's amplified much more.
And the way that we are able to help our customers is with our automated systems that automatically analyze, detect, and mitigate attacks as they happen within seconds.
And so that's the crux of it. I assume you got Manish back.
Let's see if he can come off mute and take a shot at answering this question.
So Manish, the question is, so in broad strokes, we have this malicious traffic, this traffic that's designed to take down an origin website.
So how does Cloudflare being in the middle, in the most general strokes without going into all the, without overwhelming, I know the amount of technology is there, how do you stop it?
How do you get in the way and disrupt this from taking over the origin? Yeah.
So let me try to demystify this whole thing into the three broad logical steps.
First is the sample packets. We don't analyze all the packets. So what's first thing we do is sample the packets.
And once we've sampled packets, then thing is we analyze those sample packets for various different filters, whether malicious or not.
And we've identified a pattern. The third, we then try to create different kinds of mitigations to protect the origins.
There'll be different types of mitigations.
It could be complete block. It could be rate limiting. It could be challenge-based mitigation.
Now, all this happens at the edge service analysis.
And then there are pieces of software that's running at the core.
So it has been doing this.
That's great. Manish, I think we're going to have to ask you to pause because you're breaking up.
But I think I got the gist of that. And I think the crux is that there's a sampling, there is an identification, and then there is a mitigation.
So part of it is that we are going to, because this is happening at such high volume, we're going to just put our thermometer in to understand enough of, OK, what are we actually seeing?
And then we have a selection of tools that are selectively applied to try to attack.
And I think even right before Manish's connection started to break up on us, he did talk about how there's actually two parts to this.
And there's part that can run in the edge by itself. And then there's part that involves some coordination back in the core.
So let's talk a little bit about that.
Omer, we've talked in the past about how FlowTrackD is something that we blogged about.
And just to sort of recap for our listeners, what was the problem that we were trying to solve with FlowTrackD, and how did it help?
Yeah. So for our viewers that are familiar with TCP connections and the TCP handshake, you can zoom out for a moment for the other ones.
So when a client's browser, when you log into some website, or when you go on some website or mobile app, usually a TCP handshake is established in order to form a connection between your browser or your phone and the server.
And this process or this handshake includes three steps.
And in order to identify these steps, you have to have a stateful connection tracker, basically, to know that this is a valid connection.
And this is all fine and easy to do with our more traditional services, such as the WAF service and Spectrum, which is our layer four service for TCP and UDP applications.
The WAF service is a reverse proxy, meaning that we are the middleman between the client and the server.
And so we're able to see both sides of the connection. So that's fine for us.
It's not trivial, but we're able to identify- It's not trivial, but at least it makes sense.
At least, like, okay, I'm supposed to block something.
I see it coming in and I see it going out. That's straightforward enough. I think I can at least wrap my head around how a program would be written to adjust it.
Let's introduce the wrinkle, though. What can get more complicated in this magic trick?
Yeah, exactly. So we have a service, a very popular one, called Magic Transit.
With that, we protect entire networks, entire IP prefixes, entire data centers, service providers, financial industries.
And so with that service, we don't proxy the traffic.
We just route packets one way. We advertise our customers' IP ranges from our data centers across the world with Anycast.
Then packets get routed to the closest data center using Anycast for the highest performance.
And then we analyze and scan for DDoS attacks, apply customer configurations and firewall rules to make sure we enforce their policies.
And then we route traffic over tunnels or in tunnels to their data centers.
So far, so good.
That didn't sound too different from what Cloud Phenomenally does. We sit in the middle of the traffic and we intercept and we block and we filter and we make sure that only good stuff makes it past us.
Yeah. And so the thing that differs from our traditional products is this part, where instead of the customer data center routing traffic back through us or to us, and then we talk with the client, the customer data center responds directly to the client using a direct server return, DSR in short.
We love acronyms, abbreviations. And this means that we only see half of the connection.
And so if there is some kind of attack that's abusing the TCP protocol, we don't have all the context that we need in order to identify that this is an attack.
And so this was the challenge that we had. And for this, we built a product or a component called FlowTrackD, which is short for flow tracking daemon or TCP flow tracking daemon.
It's quite a novel piece of work that our engineers created.
It's able to identify the state or to classify a TCP flow based just on the ingress traffic.
We've only seen one half of it. It's sort of one hand tied behind your back and still able to say, yep, this is legitimate or no, this looks like an attack, even though I'm only seeing one half of it.
It's a pretty cool trick. Exactly. And we classify the TCP flows and then we're able to drop rate limit or challenge packets that don't belong to legitimate connection.
And this software is a good example of how Cloudflare's software defined and can just spin up software when and where needed to solve challenges that we have, as opposed to legacy scrubbing center providers that just use legacy hardware appliances for mitigating attacks.
And this is a great example of what Manish was talking about just a few minutes ago, which is we are making the edge smarter.
So this is an example of the edge being able to make decisions to protect our customers and protect Cloudflare network without central coordination.
We even call that something, the autonomous edge. Talk a little bit more about how, what, not just sort of the technology behind that, but what does that let us do?
What is that decentralized architecture? What does that enable us to deliver?
Yeah. So historically, we've been doing DDoS for more than 10 years here, right?
And in 2017, we announced the unmetered DDoS protection for all customers, including the free plan.
And what this means is that we need to be able to scale.
We need to be able to provide this free unmetered DDoS protection without impacting performance and making it as cheap as possible for our systems to do it, right?
Without sacrificing CPU memory and bandwidth. And so historically, we had a, or we still have a centralized DDoS detection component that receives samples from all of our edge data centers.
It analyzes it, identifies patterns, attacks, known attack tools, botnets, protocol anomalies, and so on.
And then distributes mitigation instructions to the edge.
And this was fine when we had 30 data centers, 20, you know, back in the day.
But today we have more than 200 data centers around the world.
We process more than, or our capacity is more than 59 terabits per second.
We needed to have more of a delegated DDoS detection solution because we have various types of customers from independent bloggers to fortune 500 companies with insane amount of traffic.
And so this is why we built a system that we call DOSD, denial of service team.
And as opposed to the centralized DDoS protection or detection system, this piece of software is much more agile, consumes less CPU, needs less memory, and it's installed in every single server and every one of our edge data centers.
And it's basically, it runs autonomously in every one of those data centers.
Every server detects attacks by itself.
And what I mean by autonomous is that you don't need a centralized brain to make a decision on if this is an attack or not.
Which is an interesting thing because actually there's another thing we sort of glossed over at the very beginning.
It used to be called a DDoS attack.
And now we mostly just call it a DOS attack because D is implied.
But the idea behind the initial D was that it was a distributed denial of service as opposed to just overwhelming.
And that's important to note because if it isn't distributed, if it's all coming from one place, it's relatively easy to identify and lock out.
And part of what makes a distributed denial of service so tricky is it is coming from everywhere.
And that's why you can sort of see why in the initial architecture many, many years ago, the idea is, well, let's get a centralized view.
And then from that centralized view, we'll be able to determine enough accuracy to attack.
And so part of what it makes the Autonomous Edge, FlowTrackD, DOSD, and all this technology so fascinating is its ability to attack, to act anonymously against attacks.
Yeah, exactly. And by the way, a fun fact, we have a long going debate between engineering and a product whether...
Should the leading D be there or not? Exactly. Because distributed could also be, or distributed DDoS attack could also be a DOS attack or n equals one.
But you could also flip it around.
Yes. And if you're old enough, DOS also reminds you of the ancient original PC operating systems.
Very good. So Manish, I think I'm going to try one more time, but if your connection is unstable, no big deal.
But I wanted to ask you, there's a very interesting engineering trade-off here on resource utilization.
So talk a little bit about what is the interesting thing that you have to technically balance when you're building these kinds of systems?
Yes. As I think Omer was mentioning, the scale at which we're receiving packets and processing these packets, there are two broad things that we have to balance.
One is heavy analysis that goes into identifying malicious packets and optimizing our software to consume as minimal of resources as possible on the edge, primarily because there are multiple things that are happening within the edge server.
So these two things needs to be balanced at a scale of terabytes per second.
And the interesting thing is you can get very accurate, but then you'll chew up too much CPU.
And if you chew up too much CPU, you're going to waste the time of the Cloudflare resources and slow down our process.
Yes. It's a tricky thing.
And yeah, go ahead, Omer. Oh, sorry. I just wanted to add that a good example of that, of what Manish mentioned, is a capability that we call IP gems.
So the higher you are in the OSI stack, layer seven HTTP attacks, the more expensive it is to mitigate attacks.
The more CPU you use, you need to decrypt traffic, there's more processing.
And so during volumetric HTTP attacks or highly volumetric, we want to save or spare CPU cycles and memory and traffic consumption.
And so we have this capability called IP gels, which mitigates layer seven attacks at layer four, meaning instead of responding with a challenge page or a block or a rate limit, 429, 409, all those error codes, we'll just drop the packets in IP tables in the Linux kernel.
And this is much more efficient and it allows us to mitigate highly volumetric attacks at scale without impacting performance.
It's amazing. Yeah. And one of the other things Manish mentioned was just the scale and scale comes up all the time when you start talking about DOS.
We see a lot of dashboards internally Cloudflare, like there's charts and graphs that we are constantly throwing at each other.
We tried to blog about them, but one of the things that we really wanted to do, one of the original visions of the company was to give some of that visibility back to our customers.
That's what we released last year when we introduced a product called Radar.
Radar .Cloudflare.com is just Cloudflare's insights, what we know about what's happening on the Internet and what we can share.
What is the intersection between DOS and Radar, Omer? What have we been able to share with our customers about what's happening on the Internet at large?
Yeah, so we've been working extensively with the Radar team.
They're mostly based in Lisbon, one of our growing offices or mostly growing offices.
Can't wait to go there and visit them once we can.
Me too, actually. I was supposed to be there right when a lockdown happened last year.
I've been keen to go. Definitely. So yeah, they've been doing fantastic work on the Radar dashboard showing...
It's publicly available.
It's free for everyone. You can access it at Radar.Cloudflare.com. It has very interesting trends on, for instance, traffic to delivery websites and apps, and also a lot of COVID-related information, traffic to health websites, attacks around the world.
And you can filter it by country to see the trends in those countries.
And one of the interesting things is that we're able to show those data points, those insights, those trends without logging any user data.
So we're very privacy conscious.
And we recently partnered with the Radar team to deliver and announce a Radar DDoS page, which we launched just a few days ago, which provides a real deep dive into the DDoS trends around the world for the application layer and the network layer.
So for our magic transit and spectrum customers.
One of the interesting trends is that we saw that the telecom industry, telecommunications, was the most targeted industry in Q1.
And this is a significant jump from the previous quarter, where they were, I think, fourth or sixth place.
After the telecom industry was the consumer services, the security investigations industry, and also cryptocurrency.
They managed to pass cryptocurrency as well.
So that's interesting. So that's Radar telling you that these are where we see attacks by category of destination of where we believe the attack was going.
And can also show you some of the magnitudes and the frequency.
Those are also come up in the reports that we do on blog.Cloudflare.com. Guys, thank you so much.
I can't believe how fast the time went. We'll definitely check in again soon to hear about all the latest stuff.
But so the autonomous edge, the Radar tracking, we didn't even get to some of the more interesting stuff that we're talking about in network analytics.
It's so much fun talking to both of you about all the great work that we're doing to protect Cloudflare and protect our customers.
And so with that, we'll wrap up. Thanks for watching everyone. And I'll see you next time on Layers from Product and Edge.
Thanks, Manish. Thanks, Omer. Thank you.
Bye-bye. Thanks. Bye -bye.
Bye -bye.
Bye-bye.
Bye-bye.