Latest from Product and Engineering
Presented by: Usman Muzaffar, Rajesh Bhatia, Alex Forster
Originally aired on August 14, 2020 @ 6:00 PM - 6:30 PM EDT
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
All right, welcome everybody to another edition of the latest from Product and Eng.
I'm Usman Muzaffar, I'm Cloudflare's Head of Engineering and Jen Taylor is not with us today and neither are any of the product managers.
Instead I'm really delighted to introduce two of my engineering teammates Rajesh Bhatia and Alex Forster.
Rajesh, why don't you say a few words about yourself? Sure, hi my name is Rajesh Bhatia.
I'm an engineering manager and I've been in Cloudflare for now just over two years.
My team is called Application Services and we do customer facing, common abstractions and build some engineering platform services.
Yeah and we're gonna talk a little bit more about that as we get to some of the cool stuff that your team has shipped and I was just talking to you like I felt like you'd been with us for three years Rajesh.
I feel like I've known you for a very long time and we've been working together for ages.
I think it's a testament to how much stuff we get done at Cloudflare.
I tell all my teammates that you get become an old timer pretty quick.
That's right, you become an old timer fast. You're only new for a week here.
Like there's a whole new class coming right behind. Alex, how about you say hi and introduce yourself?
Yeah, hi my name is Alex. I work on the DDoS mitigation team which sounds like what it is.
We run all of Cloudflare's classic DDoS mitigation systems like GateBot, pretty famous from our blogs, newer system DOSD and now what we're here to talk about FlowTrackD which is something I've been working on for the past six months or so.
Excellent and how long have you been with us Alex?
So I've been here coming up on three years. Started off in DDoS and been on this team ever since.
Another cool thing about the DDoS team is it's got members from all of our offices and so you are normally based out of our Austin office and now like all of us based out of our homes.
Yeah, yeah. So great, actually this is a perfect segue.
I wanted to spend a few minutes talking about DDoS and you know one of the first things is, is it DDoS?
Is it DOS? Like what do these letters even stand for?
Let's just start from the very beginning. Let's see if we can build ourselves up to the technology that you guys actually shipped and that you yourself worked on last week.
I'd say even internally we were kind of split.
I see it both, you know, even in our internal documentations. DOS is denial of service and the extra D is for distributed and so the difference there is really how many attacker attack bots are participating in the attack.
They're used pretty interchangeably though.
Most of the time I think I hear the team members say I'm on the DOS team.
I still have the old habit of saying distributed denial of service attack to distinguish it.
The truth is at this point they're all distributed.
I think that's part of what people are acknowledging that the world and the technology stack has become so distributed.
But let's talk about this for a second so and let's just say like do we get more than one attack a day or like how often is this thing?
Is this an unusual thing? Is this a rare thing to get an attack of this kind?
We get plenty of attacks a day. Across our customer base we probably handle hundreds of you know significant attacks that would otherwise you know do some real damage to our customers.
And again the idea of these attacks you know normally when if you think of a security attack you know something somebody malicious trying to attack a web property.
I'm always picturing in my mind you know someone trying to break into the database someone trying to deface the website.
You know things that are there they're found of vulnerability and they're trying to try to you know take advantage of that exploit that.
Denial of service has a much more sort of basic idea which is just I want you offline.
I'm just gonna try to take your site offline. And the basic strategy is let's just overwhelm the website the server with complete garbage.
It's to the point where it just can't it can't do anything about it and somehow Cloudflare is able to identify and block that.
And so yeah just expand on that for a second. Yeah exactly.
So most of these attacks are relatively primitive. It's more of a it's more that you have control over lots of computers and you can use them to generate lots and lots of garbage that even if you know the victim is able themselves to distinguish the garbage from the good it's just so much garbage that they can't keep up with it.
And so that means you need a service like Cloudflare to you know put many thousands of computers in front of your computer to filter the garbage for you.
Right and so some of these famous attacks that we read about where there's a botnet.
What they're talking about is a piece of malware has managed to let an attacker a remote control many thousands of unwitting users.
Like people don't realize that their computer is part of the attack.
But their computer is gonna start sending requests as if they open their browser and clicked on this website.
But even worse they're not even legitimate language.
It's not like the human analogy. It's not like someone actually knocking on your door and saying hi I'd like to sell you something.
It's like someone just threw a rock at your door and ran away. Like there's not even a real conversation going on there.
But enough that you have to answer the door.
And these days it's not just computers. Not just desktop computers or laptops.
It can be you know it can be your smart light bulb. It can be your smart thermostat.
It can be you know all the all numbers of smart devices that are becoming more and more popular.
Anything that can talk to the Internet is capable of participating in one of these attacks.
So let's talk about the attack itself.
When we mentioned the attack, what is the metric that we have on it? Like what are the units on an attack?
So there are two ways to measure. The one that people are more familiar with is probably you know bits per second.
So your home Internet connection is advertised as like 50, 100 megabit Internet speed.
That is the size of the pipe.
That is if you just think of the plowing analogy that you know of the Internet.
You know Internet is a series of tubes. That's the diameter of the tube.
That's right. The other way to measure attacks is in packets per second. And so of course you know a packet is the smallest unit of communication on the Internet.
It represents a chunk of information be it you know part of a web page, part of this audio conversation, anything going over the Internet as one discrete unit.
And a packet is something that is a varying size. So not all packets are the same size.
And so when you measure in bits per second, you're measuring the culmination of all the packets, how big together they are, and how much of that.
What's the total size of the ocean?
What's the total size of the thing that is coming at me?
Yeah and so and so you at least you know I would wonder why would you even want to measure in the other way?
The reason is because computers don't think so much in terms of the sizes of packets.
They think in the numbers. So if you have ten small packets and one large packet, your computer actually does more work to handle the ten small packets than it does to handle the one large packet.
Non -technical users who might be listening.
We're all joking sometimes our parents like to listen to this right?
So at least my parents aren't technical. So like if we you know it can I think of a packet as just like it's a bunch of numbers.
It's you know it's almost like I fill out an index card.
I love to use I love to use the post office analogy.
Yeah I think the I think the Internet is very well modeled by the post office.
In fact a lot of terms come from you know post office terms. A packet would be a letter.
A letter that you put in your mailbox and you know someone comes picks up and routes it through the mail sorting system and drives it to its destination.
Got a source address and a destination address. Exactly the you know the return address is on the envelope.
All that stuff. It's just the Internet can send a letter in you know 10, 20, 100 milliseconds where it takes you know post office few days.
So one other thing I can as a metaphor is especially for some of the modern people in the audience like Twitter for example right.
You want to share an in-depth point or tell a long story.
You kind of break up that text into the two ways that it takes and you kind of post it in a series of tweets.
So yeah I consider each one of those like a packet as well. Exactly and so like the conversation requires a whole a whole series of letters going back and forth.
Like this Zoom call right now is many many packets all being assembled in the right order.
And in fact getting back to the thing where we were saying like I think I think we were just recently blogging about how we saw one of our biggest DDoS attacks at 754 million packets per second right.
So that's enough computers collectively.
The attacker controlled enough computers to try to marshal 754 million packets per second trying to come at Cloudflare.
Yes and again that's they're all those packets are coming to us individually and the what's the name of the technology here that this is it you know IP addresses are the source and destination address.
And so I'm thinking these are IP packets or TCP packets help me out here.
So these are these are mostly UDP packets. It depends on it depends on the types of attacks.
This particular large attack was I believe mostly UDP and it was yeah it was it was hundreds of thousands of you know people's laptops people's you know smart devices all controlled by one attacker sending as many packets as they could as quickly as they could in order to try to just overwhelm the target victim.
And so TCP and UDP these these terms you know transmission control protocol Unix datagram protocol these are different ways sort of different flavors of how packets can be structured for different applications you know that make them make them good.
And one of the things we like to joke about is that there's a very famous there's a very famous way computers say hello in TCP.
They we were just joking about this on the call because this is such a famous thing that humans who invented the computer protocol use it sometimes when they when they talk to each other.
And so the I'll tell the story and you jump in which is basically you two computers need to talk to each other so one of them says SYN meaning synchronous I would like to start talking to you you know hey Alex hey Rajesh listen to me I want to start talking to you and you respond by going SYN ACK which means ACK ACK I acknowledge your your request to talk synchronously and I would in turn would like to talk synchronously back to you.
So it goes SYN SYN ACK and then I respond with ACK. So sort of this three -way dance.
Would you like to dance with me? I certainly would. Would you like to dance with me?
Yes I certainly would. Okay then let's let's boogie right and so exactly the three-way handshake is the fundamental thing and of course the joke is you know only computers would would spend so much time talking about the conversation before they actually have the conversation and and of course that's part of the reason why if I if you ask an engineer hey are you ready they might say ACK ACK that's it just to imply and right as we were as we're preparing for this call Rajesh did that you know Amy was helping coordinate the call goes Rajesh are you okay yeah it sounds yeah I was like wow I should have clarified that.
Yeah exactly.
What is an attack then is for example do you have an ACK packet flood what does that really mean?
Well so yeah so TCP is in general the the difference between TCP and UDP is that TCP is reliable it's kind of like first-class postage it will get there it's you know you've got tracking numbers and you know all that kind of stuff.
All this control on it to make sure it gets there. It's yeah it's it's it's got there's a whole process to make sure it gets there whereas UDP is just you send the packet and hope it gets there.
Maybe it will you know most of the time it does but sometimes it doesn't and so and so an ACK flood is is a TCP packet where a certain I guess I guess it's a certain it's a certain message in the TCP protocol is being sent.
Yes I'm interested in talking to you. Yeah yeah it's that part it's saying it's saying it's saying like yes I would like to talk to you you know in response to you asking to talk to me and flood is exactly as it sounds it's just over a lot of them yeah it's way more way you know it's saying you know yes I'd like to talk to you yes I'd like to talk to you yes I'd like to talk to you.
And that's the interesting thing like in isolation there's nothing wrong with saying yes I'd like to talk to you but sending 754 million of them a second that's a problem and just responding to okay I heard you I know you want to talk to me is is going to take down your the attackers website so that brings us we can sort of fast-forward now so we've been we've been in the business of building a denial-of-service products sorry anti -denial-of-service products that help defend against this stuff and you know by watching for it and aggregating all when we can tell and then sending our edge computers block this kind of traffic and last year we shipped a new product called Magic Transit which was very cool which let our customers expose their product their their servers their Internet properties at a very low level it's basically let them at what basically we would serve their addresses so again if we take the postal analogy it's like even though the letter still says one two three Main Street and the the are the destination house has an address it doesn't change the rest of the world thinks that one two three Main Street is now Cloudflare so it's coming to us so everything comes to us which means your team the DOS team all of a sudden saw a whole new class of attacks right yeah it's so it's kind of like a forwarding address if you've ever moved and set up a forwarding address we we're the we're the new destination where where your mail goes the return conversation doesn't go through you so it's one way only so you only see half of the conversation and so and this part is yeah this parts where the the post service analogy breaks down a little what was the engineering chat now caught our audience up what was the engineering challenge in front of you when you started working on what we now call floatrack D yeah it's it's a pretty unique problem it's not a problem that many companies have where we you know we sit in the middle of a lot a lot of companies sit in the middle of you know their customers and their you know their content but to sit in the middle and only be able to hear one side of the conversation is a unique challenge yeah it would be as if literally you were observing someone had a conversation with someone else but you couldn't hear the other person basically the annoying thing of being on a bus and hearing what half of a cell phone covers yeah yeah yeah perfect analogy and so the challenge for us is to by only hearing that one side of the conversation figure out whether something is intact or not and there's a lot of context missing there if you can't hear both sides in the TCP analogy you know if I if I if I hear someone come to us and say yes I agree to talk to you I never heard the the other side say I would like to talk to you like I never heard the other guy say like please talk to me right so I there's a there's a pretty tough problem there of figuring out whether this is legitimate yes I would like to talk to you because if you block it the conversations over right yeah you're dropping legitimate traffic right otherwise you're gonna make it you're gonna you're gonna block the website incorrectly yeah and so and so that was that was the problem we were faced with and it turned out to be it turned out to be it turned out to be more solvable than honestly I expected going in we we were we were initially afraid that some of these kinds of attacks we were gonna have to you know ask customers to just rate limit the number of packets that come in and things like that and just you know not really provide a great solution but with flowchart D we've been able to we've been able to come up with a bunch of different you know algorithms and heuristics to only listen to one side of the conversation and pretty reliably tell whether or not it's real that's a fantastic work Alex it's really cool stuff and encourage people who are who followed enough along here and want to understand how this works to read the blog that the DOS team put out last week because it's really great stuff in it and it motivates the problem but yes I'd like to I'd like to switch over to you because like you mentioned at the beginning of the call application services team is in the business of providing services for the other engineering teams at Cloudflare so like what like give me an example of some of the things that you guys have own and have to have to manage sure so if you look at our dashboard today we have services that we have that better customer facing we kind of commonly have commonly abstract those services so example would be audit logs where for your security as a customer security you can make changes to your account but you need to be able to see who when where what was changed so those are the things that you can when I log into the Cloudflare dashboard and I go to the audit log section that's your handiwork that is my team's handiwork so again the dashboard is the control configuration panel that allows customers to make any configuration changes around so audit logs is one one good example the other one is notification center and we can talk more about that which is the where where customers are able to set up their own custom notifications and alerts so that a sub if a particular event happens they want to be warned about so that's the yeah so how does that work look give me an example of the kind of thing it might watch and what does it do when it notices something interesting so so one example that we recently launched is an event that we implemented on it called usage based billing alerts you can configure a threshold for usage based your usage on and based on that usage it will and based on the alert that is configured it will recognize that you have exceeded that threshold so if I want to say I don't want to I want to make sure I don't want to use more than ten bucks of clubs there this month I can do that and as soon as I get to nine or something like that I'll get notified in some way yeah you will then get notified and if during a billing period and and you can configure and you have the flexibility to change the thresholds the way the way you want to but that's that's a good feature because that way at least you are keeping track of how you want to manage your billing cycles for the products that you're using can it also can it also tell me if something that's about to happen like Alex's team notices that wait a second you might be getting a DOS attack you know let's let's tie all this together can I does Alex's team benefit from the kind of work you're doing most definitely it's so interesting that you're talking you're talking about it because this we've been working on that exact alert right now this quarter and basically it will be able to tell us when the attacks started when it ended and you know if it breached a particular traffic threshold so yeah yeah those are the kind of use cases where you should definitely be able to use our system as a team we're really excited about this because we do a lot for customers and right now it goes pretty unnoticed there's a lot of there's a lot of you know bad stuff that happens to our customers that we just swallow and don't you know really bother telling anybody about notify those so they can take action necessary or at least and again like internally like as the as the person who's responsible for making sure all these teams work together it's very important to me that Alex says yeah Rajesh has built something that is easy for me to use because that's how we make sure that you know all the all the different product teams at Cloudflare can easily put something in so Rajesh tell us a little bit more about the framework of this thing like what is it capable of doing yeah so like I mentioned there are different events that you can definitely configure and touching the point where you mentioned Usman that it making it easy we've spent a lot of effort this quarter to make sure that we make the onboarding process much simpler and we have learned from some of our customer facing services that we own so so so that that we really spent a lot of time trying to improve that with respect to what it is capable of doing you can set up different event types as we onboard new event types yeah you can leverage the framework that we've already built and then you can also implement additional delivery types so for example right now we support email as a form of delivering yeah and we're working on some exciting popular services which is the reason Rajesh is here is that we were code complete on something but popular popular notification services you know we want to be able to support them but do I have the right picture in my head Rajesh this thing is basically arbitrarily flexible on what you can feed it and arbitrarily flexible on how it can notify and you can mix and match so I can say I want an email for the DOS alert but I want a text message for the billing alert can I set it up that way yeah yes you can definitely set it up that way and the way we've been able to accomplish that is to build on top of services that our team already abstracts for example message bus isn't it is like think of it like a high-performance data pipeline yeah and that's how we are able to push who like Alex's team if he if he wants to push a particular event they just have to talk to the message bus layer and not have to worry about the inner workings of this far as they're concerned they're just putting it in the pipe and closing the door and then they'll trust that it'll make make its way through the system and not have to worry about what definitely actually can we never want to deal with SMT you never want to deal with that you got you got your own headaches you got 750 packets per second to deal with you gotta trust Rajesh's team to make sure that thing gets the right right thing and I think you even call it we have the email service and we call it the cloud cloud postal service just riffing on the analogy that's fantastic that's great yeah one other term comes up a lot when we talk about the architecture of a notification service is a webhook can you tell us a little bit about what a webhook is and how Cloudflare's dashboard and API implements sure so where book is think think of it like a mechanism to allow an external service to be notified of events in your club in your account the other way to think of it just so where books are like a phone number that you know Cloudflare calls yeah are you of activity in your account so this helps because you are always aware of what you want to what you're interested in yeah right I think there's a software design principle here that says don't the Hollywood principle principle don't call us we'll call you you're not important enough for to for us to call there for you to call us yeah the idea is that it's interesting it's it's whereas the the Hollywood principle the joke is you know the arrogance like don't you don't even bother calling us we'll let you know if there's anything important it's quite the contrary here it's don't calm don't don't don't busy loop busy wait on a server because there's no point in you repeatedly asking over and over again rather just register interest in something and we will call the hook back and so give an example of a webhook that some of our customers have used so yeah so there's a So we have a webhook for SSL certificates so we during renewal or expiry you can listen to those webhooks and events and then make some decisions on your site based on we will definitely notify you when when those events are happening and you can leverage that information to decide what you want to do that's great that's great you know and even even with you both you guys from the engineering team we've been talking about customer facing things but one of the things of course popular engineering does a lot of is worry about internal tools and neither of you from the team that built this but the magic transit team that we were alluding to earlier built a an internal version of traceroute and that is going to be available in a more useful way Alex tell us a little bit about traceroute what is traceroute I was last time you tried traceroute is that something you use occasionally is that something you only did in school like what's the deal with that not so much these days I I definitely you know definitely still use it at least a couple times a month but you know back in my back in my network engineering days it was it was indispensable traceroute is like GPS for the Internet it's how you figure out where you are in the Internet yeah the Internet is a massive you know interconnected web there's no central point and you know you need to in an awful lot of situations be able to figure out where you are in relation to others and that's what traceroute lets you do so it's literally I can traceroute from wherever whatever if I'm on any kind of device that's on the Internet I can say can you show me like how would I get to I can type in any other web address how do I get to nytimes.com how do I get to riotgames.com and it'll show me the path through routers just like Google Maps just yeah it gives you the path right there directions that the system is actually taking so the cool thing is that Cloudflare affects traceroutes in all kinds of interesting ways right because we become that final destination and so suddenly then all those hops get reduced because you just you just have to get to the cloud for edge and for that very same reason you can imagine why the magic transit team in particular that's building these tunnels or you enter in one the side and show up on the other one has all kinds of needs to build a distributed traceroute that is that is able to run traceroute from anywhere any one of our edge centers and then tell us what the world looks like from that point of view and so that's one of the cool internal tools that they're building and another internal tool that we use heavily at Cloudflare is called Kubernetes and Rajesh you were just and I were just talking about how you know Kubernetes does a lot of magic for software engineering teams you know came out of Google's pretty popular it's sort of what powered the microservices revolution you can you know define a small piece of software in a container and you know Kubernetes will take care of some of the magic of scaling it up and down and recently the the Kubernetes team at Cloudflare made it easier to tie arbitrary metrics to scalings what does that mean and why might that be useful to a team like that is a lifesaver for us we've been looking for that sort of feature and really thank the team that built that yeah we've been looking so for example just last week last week we had a situation where our consumption lag really increased and we could have used with auto scaling our pods and auto scaling the services run so that we could have chewed up the lag much quicker and with this feature that the team has released we'll be able to definitely take a lot of advantage of that excellent great well I think we are close to time at here just checking out that 328 past the hour guys what a treat so much fun to be able to talk to you I know I'm supposed to be going over all this stuff with Jen but this was this was great I love being able to Alex thank you so much for joining from Austin and and telling us all about the intros of this and we're just great talking about the application services team thank you everyone for watching we'll see you next week bye all bye bye everybody Oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh you