Latest from Product and Engineering
Presented by: Usman Muzaffar, Rajesh Bhatia, Alex Forster
Originally aired on November 12, 2021 @ 11:30 PM - 12:00 AM EST
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
All right, welcome everybody to another edition of the latest from Product and Eng.
I'm Usman Muzaffar, I'm Cloudflare's Head of Engineering and Jen Taylor is not with us today and neither are any of the product managers.
Instead I'm really delighted to introduce two of my engineering teammates Rajesh Bhatia and Alex Forster.
Rajesh, why don't you say a few words about yourself? Sure, hi my name is Rajesh Bhatia.
I'm an engineering manager and I've been in Cloudflare for now just over two years.
My team is called Application Services and we do customer facing, common abstractions and build some engineering platform services.
Yeah and we're gonna talk a little bit more about that as we get to some of the cool stuff that your team has shipped and I was just talking to you like I felt like you'd been with us for three years Rajesh.
I feel like I've known you for a very long time and we've been working together for ages.
I think it's a testament to how much stuff we get done at Cloudflare.
I tell all my teammates that you get become an old timer pretty quick.
That's right, you become an old timer fast. You're only new for a week here.
Like there's a whole new class coming right behind. Alex, how about you say hi and introduce yourself?
Yeah, hi my name is Alex. I work on the DDoS mitigation team which sounds like what it is.
We run all of Cloudflare's classic DDoS mitigation systems like GateBot, pretty famous from our blogs, newer system DOSD and now what we're here to talk about FlowTrackD which is something I've been working on for the past six months or so.
Excellent and how long have you been with us Alex?
So I've been here coming up on three years. Started off in DDoS and been on this team ever since.
Another cool thing about the DDoS team is it's got members from all of our offices and so you are normally based out of our Austin office and now like all of us based out of our homes.
Yeah, yeah. So great, actually this is a perfect segue.
I wanted to spend a few minutes talking about DDoS and you know one of the first things is, is it DDoS?
Is it DOS? Like what do these letters even stand for?
Let's just start from the very beginning. Let's see if we can build ourselves up to the technology that you guys actually shipped and that you yourself worked on last week.
I'd say even internally we were kind of split.
I see it both, you know, even in our internal documentations. DOS is denial of service and the extra D is for distributed and so the difference there is really how many attacker attack bots are participating in the attack.
They're used pretty interchangeably though.
Most of the time I think I hear the team members say I'm on the DOS team.
I still have the old habit of saying distributed denial of service attack to distinguish it.
The truth is at this point they're all distributed.
I think that's part of what people are acknowledging that the world and the technology stack has become so distributed.
But let's talk about this for a second so and let's just say like do we get more than one attack a day or like how often is this thing?
Is this an unusual thing? Is this a rare thing to get an attack of this kind?
We get plenty of attacks a day. Across our customer base we probably handle hundreds of you know significant attacks that would otherwise you know do some real damage to our customers.
And again the idea of these attacks you know normally when if you think of a security attack you know something somebody malicious trying to attack a web property.
I'm always picturing in my mind you know someone trying to break into the database someone trying to deface the website.
You know things that are there they're found of vulnerability and they're trying to try to you know take advantage of that, exploit that.
Denial of service has a much more sort of basic idea which is just I want you offline.
I'm just gonna try to take your site offline and the basic strategy is let's just overwhelm the website the server with complete garbage to the point where it just can't do anything about it and somehow Cloudflare is able to identify and block that.
And so yeah just expound on that for a second. Yeah exactly. So most of these attacks are relatively primitive.
It's more of a it's more that you have control over lots of computers and you can use them to generate lots and lots of garbage that even if you know the victim is able themselves to distinguish the garbage from the good it's just so much garbage that they can't keep up with it.
And so that means you need a service like Cloudflare to you know put many thousands of computers in front of your computer to filter the garbage for you.
Right and so some of these famous attacks that we read about where there's a botnet.
What they're talking about is a piece of malware has managed to let an attacker remote control many thousands of unwitting users.
Like people don't realize that their computer is part of the attack.
But their computer is gonna start sending requests as if they open their browser and clicked on this website.
But even worse they're not even legitimate language.
It's not like the human analogy. It's not like someone actually knocking on your door and saying hi I'd like to sell you something.
It's like someone just threw a rock at your door and ran away.
Like there's not even a real conversation going on there.
But it's enough that you have to answer the door.
And these days it's not just computers. Not just desktop computers or laptops.
It can be your smart light bulb. It can be your smart thermostat. It can be all the all numbers of smart devices that are becoming more and more popular.
Anything that can talk to the Internet is capable of participating in one of these attacks.
So let's talk about the attack itself. When we mentioned the attack, what is the metric that we have on it?
What are the units on an attack? So there are two ways to measure.
The one that people are more familiar with is probably bits per second.
So your home Internet connection is advertised as like 50, 100 megabit Internet speed.
That is the size of the pipe. If you just think of the plowing analogy of the Internet.
Internet is a series of tubes. That's the diameter of the tube.
The other way to measure attacks is in packets per second. A packet is the smallest unit of communication on the Internet.
It represents a chunk of information, be it part of a web page, part of this audio conversation, anything going over the Internet as one discrete unit.
A packet is something that is of varying size.
So not all packets are the same size.
When you measure in bits per second, you're measuring the culmination of all the packets, how big together they are, and how much of that they fit.
What's the total size of the ocean?
What's the total size of the thing that is coming at me? Yeah, and so I would wonder why would you even want to measure in the other way?
The reason is because computers don't think so much in terms of the sizes of packets.
They think in the numbers of packets.
So if you have 10 small packets and one large packet, your computer actually does more work to handle the 10 small packets than it does to handle the one large packet.
There's probably non-technical users who might be listening.
We're all joking, sometimes our parents like to listen to this, right?
So at least my parents aren't technical. Can I think of a packet as just like, it's a bunch of numbers.
It's almost like I fill out an index card. I love to use the post office analogy.
Post office analogy, let's hear it. Yeah, I think the Internet is very well modeled by the post office.
In fact, a lot of the terms come from post office terms.
A packet would be a letter, a letter that you put in your mailbox and someone comes and picks up and routes it through the mail sorting system and drives it to its destination.
It's got a source address and a destination address.
Exactly. The return address is on the envelope, all that stuff. It's just the Internet can send a letter in 10, 20, 100 milliseconds where it takes the post office a few days.
So one other thing I can, as a metaphor, especially for some of the modern people in the audience, like Twitter, for example, you want to share an in-depth point or tell a long story.
You kind of break up that text into the two ways that it takes and you post it in a series of feeds.
So yeah, I consider each one of those like a packet as well.
Exactly. And so the conversation requires a whole series of letters going back and forth.
This Zoom call right now is many, many packets all being assembled in the right order.
And in fact, getting back to the thing where we were saying, I think we were just recently blogging about how we saw one of our biggest DDoS attacks at 754 million packets per second.
So that's enough computers collectively.
The attacker controlled enough computers to try to marshal 754 million packets per second trying to come at Cloudflare.
And again, those packets are coming to us individually.
And what's the name of the technology here?
IP addresses are the source and destination address. And so I'm thinking these are IP packets or TCP packets.
Help me out here. So these are mostly UDP packets.
It depends on the types of attacks. This particular large attack was, I believe, mostly UDP.
And it was, yeah, it was hundreds of thousands of people's laptops, people's smart devices, all controlled by one attacker, sending as many packets as they could as quickly as they could in order to try to just overwhelm the target victim.
And so TCP and UDP, these terms, transmission control protocol, Unix datagram protocol, these are different flavors of how packets can be structured for different applications that make them good.
And one of the things we like to joke about is that there's a very famous way computers say hello in TCP.
We were just joking about this on the call because this is such a famous thing that humans who invented the computer protocol use it sometimes when they talk to each other.
And so I'll tell the story and you jump in, which is basically two computers need to talk to each other.
So one of them says SYN, meaning synchronous. I would like to start talking to you.
Hey, Alex, hey Rajesh, listen to me. I want to start talking to you.
And you respond by going SYN ACK, which means ACK, I acknowledge your request to talk synchronously, and I would in turn would like to talk synchronously back to you.
So it goes SYN, SYN ACK, and then I respond with ACK.
So sort of this three-way dance. Would you like to dance with me? I certainly would.
Would you like to dance with me? Yes, I certainly would. Okay, then let's boogie, right?
And so- Exactly. The three-way handshake is the fundamental thing.
And of course, the joke is, only computers would spend so much time talking about the conversation before they actually have the conversation.
And of course, that's part of the reason why if you ask an engineer, hey, are you ready?
They might say ACK, A-C-K, that's it, just to imply.
And as we're preparing for this call, Rajesh did that.
And Amy was helping coordinate the call, goes, Rajesh, are you okay?
Yeah, it sounds like- Yeah, I was like, wow, I should have clarified that. Yeah, exactly.
So the ACK just- What is an attack then? For example, do you have an ACK packet flood?
What does that really- Well, so yeah. So TCP is, in general, the difference between TCP and UDP is that TCP is reliable.
It's kind of like first-class postage.
It will get there. You've got tracking numbers and all that kind of stuff.
So all this control on it to make sure it gets there. It's, yeah. There's a whole process to make sure it gets there.
Whereas UDP is just, you send the packet and hope it gets there.
Hopefully. Maybe it will. Most of the time it does, but sometimes it doesn't.
And so an ACK flood is a TCP packet where a certain, I guess it's a certain message in the TCP protocol is being sent.
Thank you. Yes, I'm interested in talking to you, right?
Yeah, yeah. It's that part. It's saying, yes, I would like to talk to you in response to you asking to talk to me.
And flood is exactly as it sounds.
It's just a lot of them. Yeah. It's way more. It's saying, yes, I'd like to talk to you.
Yes, I'd like to talk to you. Yes, I'd like to talk to you.
And that's the interesting thing. In isolation, there's nothing wrong with saying yes, I'd like to talk to you, but sending 754 million of them a second, that's a problem.
And just responding to, okay, I heard you. I know you want to talk to me is going to take down the attacker's website.
So that brings us, we can fast forward now.
So we've been in the business of building a denial of service products, sorry, anti-denial service products that help defend against this stuff.
And by watching forward and aggregating all when we can tell, and then sending our edge computers block this kind of traffic.
And last year we shipped a new product called Magic Transit, which was very cool, which let our customers expose their servers, their Internet properties at a very low level.
It's basically let them, basically we would serve their addresses.
So again, if we take the postal analogy, it's like, even though the letter still says 123 Main Street and the destination's house has an address, it hasn't changed.
The rest of the world thinks that 123 Main Street is now cloudless.
So it's coming to us. So everything comes to us, which means your team, the DOS team, all of a sudden saw a whole new class of attacks.
Right. Yeah. So it's kind of like a forwarding address.
If you've ever moved and set up a forwarding address, we're the new destination, where your mail goes instead.
The return conversation doesn't go through you.
So it's one way only. So you only see half of the conversation.
And this part is where the postal service analogy breaks down a little.
So what was the engineering challenge? Now we've caught our audience up.
What was the engineering challenge in front of you when you started working on what we now call FlowTrackD?
Yeah. It's a pretty unique problem. It's not a problem that many companies have, where we sit in the middle of...
A lot of companies sit in the middle of their customers and their content.
But to sit in the middle and only be able to hear one side of the conversation is a unique challenge.
It would be as if literally you were observing someone had a conversation with someone else, but you couldn't hear the other person.
Basically the annoying thing of being on a bus and hearing what half of a cell phone covers it.
Yeah. Yeah. Yeah. Perfect analogy.
And so the challenge for us is to, by only hearing that one side of the conversation, figure out whether something is an attack or not.
And there's a lot of context missing there if you can't hear both sides.
In the TCP analogy, if I hear someone come to us and say, yes, I agree to talk to you.
I never heard the other side say, I would like to talk to you.
I never heard the other guy say, please talk to me.
So there's a pretty tough problem there of figuring out whether...
Is this legitimate? Yes, I would like to talk to you now. Because if you block it, the conversation's over and correctly over, right?
Correct. Yeah. You're dropping legitimate traffic.
You got to get this right. Otherwise you're going to block the website incorrectly.
Yeah. And so that was the problem we were faced with.
And it turned out to be more solvable than honestly I expected going in.
We were initially afraid that some of these kinds of attacks, we were going to have to ask customers to just rate limit the number of packets that come in and things like that.
And just not really provide a great solution.
But with Flowtrak D, we've been able to come up with a bunch of different algorithms and heuristics to only listen to one side of the conversation and pretty reliably tell whether or not it's real.
That's amazing. Fantastic work, Alex.
It's really cool stuff. And encourage people who followed enough along here and want to understand how this works to read the blog that the DOS team put out last week.
Because it's really great stuff and it motivates the problem.
Rajesh, I'd like to switch over to you. Because like you mentioned at the beginning of the call, application services team is in the business of providing services for the other engineering teams at Cloudflare.
So like what?
Give me an example of some of the things that you guys have own and have to manage.
Sure. So if you look at our dashboard today, we have services that we have that are customer facing.
We kind of commonly abstract those services. So example would be audit logs where for your security, as a customer security, you can make changes to your account, but you need to be able to see who, when, where, what was changed.
So those are the things that you can... So when I log into the Cloudflare dashboard and I go to the audit log section, that's your handiwork.
That's your team's handiwork.
That is my team's handiwork. So again, the dashboard is the control configuration panel that allows customers to make any configuration changes around.
So audit logs is one good example. The other one is notification center.
And we can talk more about that, which is the way where customers are able to set up their own custom notifications and alerts so that if a particular event happens, they want to be warned about.
So that's the system that we've built. Yeah. So how does that work?
Well, give me an example of the kind of thing it might watch and what does it do when it notices something interesting?
So one example that we recently launched is an event that we implemented on it called usage-based billing alerts.
You can configure a threshold for your usage and based on that usage, it will, and based on the alert that is configured, it will recognize that you have exceeded that threshold.
So if I want to say, I don't want to, I want to make sure I don't want to use more than 10 bucks of Cloudflare this month.
I can do that.
And as soon as I get to nine or something like that, I'll get notified in some way.
Yeah. You will then get notified and during a billing period and you can configure and you have the flexibility to change the thresholds the way you want to, but that's a good feature because that way, at least you are keeping track of how you want to manage your billing cycles for the products that you're using.
Can it also tell me if something bad's about to happen?
Like Alex's team notices that, wait a second, you might be getting a DOS attack.
Let's tie all this together. Does Alex's team benefit from the kind of work you're doing?
Most definitely. It's so interesting that you're talking about it because we've been working on that exact alert right now, this quarter.
And basically it will be able to tell us when the attack started, when it ended, and if it breached a particular traffic threshold.
So yeah, those are the kind of use cases where you should definitely be able to use our system.
As a team, we're really excited about this because we do a lot for customers and right now it goes pretty unnoticed.
There's a lot of, bad stuff that happens to our customers that we just swallow and don't really bother telling anybody about.
Notify them so they can take action if necessary, or at least be there.
And again, internally, as the person who's responsible for making sure all these teams work together, it's very important to me that Alex says, yeah, Rajesh has built something that is easy for me to use because that's how we make sure that all the different product teams at Cloudflare can easily put something in.
So Rajesh, tell us a little bit about the framework of this thing.
What is it capable of doing?
Yeah. So like I mentioned, there are different events that you can definitely configure.
And touching the point where you mentioned, Usman, that it's making it easy.
We've spent a lot of effort this quarter to make sure that we make the onboarding process much simpler.
And we've learned from some of our customer facing services that we own.
So that we've really spent a lot of time trying to improve that.
With respect to what it is capable of doing, you can set up different event types as we onboard new event types.
You can leverage the framework that we've already built.
And then you can also implement additional delivery types.
So for example, right now, we support email as a form of delivering. And we're working on some exciting new ones that we can't quite talk about.
But popular services, which is the reason Rajesh is here is that we're code complete on something.
But popular notification services, we want to be able to support them.
But do I have the right picture in my head, Rajesh? This thing is basically arbitrarily flexible on what you can feed it.
And arbitrarily flexible on how it can notify.
And you can mix and match. So I can say, I want an email for the DOS alert.
But I want a text message for the billing alert. Can I set it up that way? Yes, you can definitely set it up that way.
And the way we've been able to accomplish that is to build on top of services that our team already abstracts.
For example, message buses.
Think of it like a high performance data pipeline. And that's how we are able to push.
Alex's team, if he wants to push a particular event, they just have to talk to the message bus layer and not have to worry about the inner workings of this.
As far as they're concerned, they're just putting it in the pipe and closing the door.
And then they'll trust that it'll make its way through the system and not have to worry about what it actually can do.
We never want to deal with SMTP.
You never want to deal with that. You got your own headaches to deal with.
You got 750 packets per second to deal with. You got to trust Rajesh's team to make sure that thing gets the right thing.
In fact, we have the email service and we call it the Cloudflare postal service, just riffing on the analogy.
Is that right? I didn't even realize that. That's fantastic. That's great.
One other term comes up a lot when we talk about the architecture of a notification service, which is a webhook.
Can you tell us a little bit about what a webhook is and how Cloudflare's dashboard and API implements that?
Sure. Think of it like a mechanism to allow an external service to be notified of events in your account.
The other way to think of it, so webhooks are like a phone number that Cloudflare calls to notify you of activity in your account.
This helps because you are always aware of what you want to...
What you're interested in. Yeah. I think there's a software design principle here that says don't...
The Hollywood principle.
Hollywood principle. Don't call us, we'll call you. You're not important enough for us to call for you to call us.
Yeah. The idea is that it's interesting. Whereas the Hollywood principle, the joke is the arrogance.
Don't even bother calling us.
We'll let you know if there's anything important. It's quite the contrary here.
It's don't busy wait on a server because there's no point in you repeatedly asking over and over again, rather just register interest in something and we will call the hook back.
Give an example of a webhook that some of our customers have used.
Yeah. We have a webhook for SSL certificate. During renewal or expiry, you can listen to those webhooks and events and then make some decisions on your side based on...
We will definitely notify you when those events are happening and you can leverage that information to decide what you want to do.
That's great.
That's great. Even with both of you guys from the engineering team, we've been talking about customer facing things, but one of the things, of course, cloud engineering does a lot of is worry about internal tools.
Neither of you from the team that built this, but the magic transit team that we were alluding to earlier, built an internal version of traceroute and that is going to be available in a more useful way.
Alex, tell us a little bit about traceroute. What is traceroute?
When was the last time you tried traceroute? Is that something you use occasionally?
Is that something you only did in school? What's the deal with that?
Not so much these days. I definitely still use it at least a couple times a month.
Back in my network engineering days, it was indispensable. Traceroute is like GPS for the Internet.
It's how you figure out where you are in the Internet. The Internet is a massive interconnected web.
There's no central point. You need to, in an awful lot of situations, be able to figure out where you are in relation to others.
That's what traceroute lets you do. It's literally, I can traceroute from wherever...
If I'm on any kind of device that's on the Internet, I can say, can you show me, how would I get to...
I can type in any other web address. How do I get to nytimes.com?
How do I get to riotgames .com? It'll show me the path through routers.
Just like Google Maps, it gives you the path right there. The directions that the system is actually taking.
The cool thing is that Cloudflare affects traceroutes in all kinds of interesting ways because we become that final destination.
Suddenly, all those hops get reduced because you just have to get to the Cloudflare edge.
For that very same reason, you can imagine why the Magic Transit team, in particular, that's building these tunnels where you enter in one side and show up on the other one, has all kinds of needs to build a distributed traceroute that is able to run traceroute from any one of our edge centers and then tell us what the world looks like from that point of view.
That's one of the cool internal tools that they're building.
Another internal tool that we use heavily at Cloudflare is called Kubernetes.
Rajesh, you and I were just talking about how Kubernetes does a lot of magic for software engineering teams.
It came out of Google.
It's pretty popular. It's what powered the microservices revolution. You can define a small piece of software in a container and Kubernetes will take care of some of the magic of scaling it up and down.
Recently, the Kubernetes team at Cloudflare made it easier to tie arbitrary metrics to scalings.
What does that mean and why might that be useful to a team like you?
That is a lifesaver for us. We've been looking for that sort of feature and really thank the team that built that.
For example, just last week, we had a situation where our consumption lag really increased and we could have used with auto-scaling our pods, auto -scaling the services run so that we could have chewed up the lag much quicker.
With this feature that the team has released, we'll be able to definitely take a lot of advantage of that.
Excellent. Great. Well, I think we are close to time at here. I'm just checking our time, 28 past the hour.
Guys, what a treat. So much fun to be able to talk to you.
I know I'm supposed to be going over all this stuff with Jen, but this was great.
I love being able to... Alex, thank you so much for joining from Austin and telling us all about the intros of this and for just great talking about the application services team.
Thank you, everyone, for watching. We'll see you next week.
Bye, all. Bye. Bye, everybody. Bye.
Bye.
Bye.