Latest from Product and Engineering
Presented by: Usman Muzaffar, Brian Batraski, Dina Kozlov
Originally aired on February 24, 2021 @ 8:00 PM - 8:30 PM EST
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
Original Airdate: August 7, 2020
English
Transcript (Beta)
All right, good afternoon everyone from the San Francisco Bay Area. Welcome to the latest from product and engineering.
I'm Usman Muzaffar, Cloudflare's head of engineering.
Usually I'm joined with Jen Taylor, our head of product, to talk about everything that Cloudflare shipped in the last seven days.
But way better than Jen, we've got two of our star teammates here instead.
Jen's out of the office. I'm joined here with Rita Kozlov and Brian Batraski.
Rita, why don't you tell everyone who you are?
Rita. Sorry, I'm so sorry. Dina, Dina, I'm so sorry. Oh my gosh.
On live television. That was awful. Let's try this again. I'm Usman Muzaffar, head of engineering.
I'm here with Dina Kozlov, who is a product manager on Jen's team.
Dina, please introduce yourself. Hi, I'm Dina, not Rita. And I'm the product manager for DNS here at Cloudflare.
And Brian, I'll let you introduce yourself. Hi everybody.
My name is Brian Batraski. I am the product manager for the load balancing and the health trends team.
I'm very happy to be here today. I'm so pleased to see both of you and a big apology for those of you who don't know.
There is in fact also another Kozlov at the company, but we're here with Dina today.
Dina, you're the product manager for DNS.
And in fact, your background there says it's not DNS. There's no way it's DNS.
It was DNS. So let's talk for a few seconds about DNS and why do these three letters show up so often when we're talking about the Internet?
What is DNS? What does DNS even stand for? Yeah, so DNS stands for domain name system.
And so every time you access a website, you go to google.com, easy. But what actually happens in the background is you're connecting to an IP address.
But as humans, we are not good at remembering IP addresses like 192.0.2.0.
Who can remember that?
But instead we can remember things like Cloudflare.com, google.com. And so DNS is what works in the background to map that hostname to an IP address so that you don't have to remember it.
Sounds a lot like a phone book. It's kind of like, I don't know your cell phone number, but I know I've spoken to you on the phone.
So like, I just typed Dina into my phone and I'm talking to you.
So like something knows.
I'm imagining basically a two column table with names on the left and numbers on the right.
And something has to somehow know that when I type in nytimes .com or if I type in Cloudflare.com, that it turns into the magic number that means that website.
So good. Does that sound approximately right? Yeah, no, it is. And so Cloudflare is that phone book for the Internet.
For our customers, we store DNS records.
And so those are those exact mappings of when you go to Cloudflare.com, go to this IP address.
And so if I own Cloudflare and one day my address changes, then I change that on Cloudflare and then boom, the rest of the world knows, go to this new IP address.
Yeah. And so then there's another thing that's related to this.
So there's the phone book and I get that the phone book is also distributed, right?
Every company has their own little corner of the phone book that sort of says authoritatively, this is the number for the part of the Internet I control.
And then there's parts that are bigger and larger than that, that are more fundamental to the Internet.
And then we've also got something called a resolver, 1.1.1.
How does all this fit together? What are all these components? Yeah.
So DNS is like a tree. So if you think of something like Cloudflare.com, there's actually a dot at the end and that is the root server.
And so there are 13 organizations that run root servers.
And so when you think about it, when you're trying to find the IP address, you have this thing called the recursive resolver.
And that's what goes back and forth between all these different servers.
And so first it goes to the end.
And so it goes and asks through, where is .com? And then it points to the server where .com is, the TLB.
And so then it goes to .com and it's like, Hey, where is Cloudflare .com?
And then it returns the server that the recursive resolver should go to.
Actually, it's not a one part conversation.
You kind of have to go back and forth between all the servers to say, what do you know?
Like, I only know the next clue. And then you have to go talk to that server to get the next clue.
So when we talk about this, we often say they or it, what is it?
What is the thing that is going back and forth doing all these queries for us?
So that's the resolver. That's what goes back and forth and gets the answers.
And so all this actually happens before you make an HTTP request to that site.
And so it is incredibly important for all of these tabs to happen really, really quickly.
And so that is where Cloudflare comes in. We are actually the fastest authoritative DNS provider, and we have the fastest resolver.
And so if you put them together, you go through it incredibly quickly.
And so it's actually really cool because like, just for our authoritative DNS services, the lookup times are 12 milliseconds on average worldwide.
And when you find out with our resolver, they actually help one another out.
And so our customers who use Cloudflare as an authoritative DNS provider, their customers who use 1111 as the resolver are having faster query times because those records, like all of that is done on the same server.
And we have our DNS software that runs in all 200 locations, which is what makes us so fast because we can be incredibly close to both the end user and the resolver.
So basically, we own the phone book, we own the phone book looker upper, the phone book looker upper and the phone book are all in the same house.
And that is how we have a ridiculously unfair advantage on how fast when something needs to get that number for a name how fast that lookup can happen.
Yep, that's it. So that's all. Cloudflare has been doing that for a long time before you and I joined the company.
We were already in the business of being an awesome authoritative DNS and the resolver was born in front of us.
So let's talk about DNS analytics, which is really the headline from last week.
So first of all, why would a customer need DNS analytics?
And what kind of analytics are they looking for?
Yeah, so a customer might want to answer different kinds of questions, they might want to know, how many DNS queries is my domain getting?
Where are these DNS queries coming from?
What are people querying for that doesn't exist?
And, for example, if you're looking to see if some of your sub domains, if no one's accessing them, you can use DNS analytics to see that.
And so recently, we've also added DNS analytics to our secondary DNS product.
And so that allows you to see how's my DNS traffic distributed between my two DNS providers.
So lots of questions, lots of answers.
It's not just sort of curiosity, like if you notice that nobody is accessing some corner of your DNS, that's basically like saying, maybe there's something broken about that.
Or like, maybe we don't have that site.
So there's like, that it's not just curiosity, you can actually make business decisions based on these analytics.
And that's why it's so important to make them available to our customers.
So interestingly, the actual headline of the thing we that we shipped last week is better DNS analytics for partners.
So let's have that sentence. What's going on in there? Yeah, so DNS analytics are both queried by our customers.
So that's either through the API, or when they look at the analytics on the dashboard.
But a heavy user of our DNS analytics is also our partners such as Datadog.
And so Datadog runs their own analytics platform.
And so they're really focusing their attention on analytics. And so when you send this DNS information over to them, our customers can go there and see that information more granularly, they can modify it to fit their exact needs.
And so when you think of our regular customers who are querying the analytics API, or through the dashboard, those queries are usually for long periods of time, and those happen infrequently.
But our partners actually query our analytics API very frequently, and usually for short periods of time.
And so Armand on our DNS team, he set out to kind of make two big improvements.
He tried to make the query time faster, and to optimize for storage so that we can add more integrations later on.
Wow. So basically, if I'm following this, it's customers of ours who are actually representing whole sets of customers.
Like it's a whole service. It's not just one customer, it's potentially thousands of customers behind them.
And so they're not asking a single question.
They're not asking, is this subdomain there?
They're asking it across thousands of websites, what do the analytics look like?
And they're asking it all the time, every minute or two. How about now? How about now?
What's it look like now? What's it look like now? And our system was actually designed for the opposite kind of usage, which is a longer, more complex query over a single domain.
And so engineering side sort of said, for this to get faster for our partners, we have to restructure how this information is available.
And that's what the team worked on. And now it is in fact substantially faster, and is set up for more growth.
And I was just chatting with the engineers who are responsible for this, because it's always a great question.
How did you pull this off?
How did you make something just magically faster? We didn't buy more computers.
We didn't buy bigger hard drives. So what was the trick here? And the interesting thing is there's so many ways to shave performance off, to get time back.
In this case, it wasn't so much about making the query faster as it is about having it have to read less information to get the answer it wanted.
So what it was is really, at the time that we store the information, we store it with an eye to the fact that we know that our partners are going to be querying this in a particular way.
So we roll it up as we store it. So you can think of it as it's already tabulated.
It's already in a volume that makes sense. And we put that book on the shelf so that by the time the partner asked for it, we're like, actually, we've got that already for you, sir.
It's all packaged up. Here you go. And it's much faster and more scalable.
So that's great. That's what we shipped last week.
Thanks. Thanks, Dina. That was great. So this time, there's no way it's DNS.
It was DNS. All good news. Brian, with the Golden Gate background, how are you?
Nice to see you too. So Brian, you're in charge of LB. What does LB stand for? What does that mean?
Yeah, great question. So LB is load balancing. And so many customers will have traffic coming into their application, their website, whatever it may be.
And a lot of times, powering that application, that website is a server, or think of it as a computer.
And more times than not, you're going to need more than one of these computers to be able to handle or serve this traffic that's coming in.
And as you grow your company, you get more people interested. You need more of these servers.
And so you need something to help you make the decisions of, where do I send this traffic?
What computer does this go to, to make sure that the customer, the end user, when we want to go to Cloudflare.com, actually sees the right thing?
Yeah. And so is this complicated? If it's more than one, can't I just send 50% to server one and 50% to server two and call it a day?
Isn't that good enough?
What's so hard about this? I mean, it depends on your setup.
It depends on your infrastructure, right? It depends on the demands of your business.
And so in some cases, yeah, you can put half here, half there. You're good to go.
But in many other times, there are more items that come into play. Let's say you want to make sure you're giving the right content to people.
Make sure that the request, when you load Cloudflare.com, it loads immediately, right away.
And so if you get as deep enough, it can get quite complex if needed.
But we at Cloudflare, we make it very easy.
We make it very easy to digest, to understand, to set up, to use.
So if we just go back to that performance thing, which is the same thing we were talking about with Dina just now a second ago, right?
If I'm just trying to get to the fastest one, how does Cloudflare have the ability to say, all right, so this customer's website is big enough that the customer has told us there's actually two different computers that are capable of serving that website.
And they've put those two different computers on opposite ends of the planet because they want to make sure that they're separate.
And there's one close to people in Asia, and there's one close to people in North America.
Is Cloudflare then able to say, wait a minute, I should go to the server that's closest to me?
Is that something it can do?
100%. And this is one of the kind of bread and butter is what we do with Cloudflare load balancing.
We are able to say, because of our massive Anycast global network, to say, hey, this request is coming from San Francisco.
Let's make sure we go to a pop in San Jose instead of going to a pop in South Africa or in Asia.
And so we have something called dynamic steering. And so we pick the path with the lowest latency, the lowest amount of time it's going to take to serve that request, to serve that traffic, to make sure that the website and user experience is as performant as possible.
So when you say the word dynamic there, that to me means it's allowed to change over time, almost like we have a stopwatch and we're measuring everything.
And we've got a leaderboard of who is the fastest.
And over time, maybe at 3 p.m., I pick this origin. But by 4 p.m., something's messed up on the way to that one.
So I'm going to pick a different one. Does that mean it can change its answer over time as it goes?
Absolutely. So we're doing this thing called active health checks, active health monitoring.
So we're always contacting the infrastructure to not only see that are these computers available to serve these requests, to load these websites, but also to measure the amount of time it takes to go from A to B and back.
And so by actively monitoring these different paths, these ever changing areas in the infrastructure on the Internet, we make sure that we're picking the path, the area that is fastest and most performant.
And you're absolutely right. It's ever changing. It's ever shifting.
So that means buried inside the load balancing product is like the world's greatest network stethoscope, like something that actually can measure how fast it is to go from Cloudflare to all kinds of servers around the world.
And I imagine that that thing in and of itself was a useful product.
It's almost like as part of building a bigger product, we built an interesting piece of technology and we wanted to make that available.
And I think that's the basis of our health check product.
So let's talk a little bit about how customers have used that thing. Absolutely.
So it's incredibly important to know and have a very tight understanding of the health of your infrastructure, the health of these computers that are serving and empowering these different websites.
And so by not having active health checking, by not always ping, always contacting these servers and saying, Hey, are you okay?
Are you available? And also, you know, what is the time it's taking to contact you?
Without that gives a giant black hole to our customers to know and be able to make very critical decisions about the health of the infrastructure to make sure that their customers are happy and to be able to make quick decisions.
So they are utilizing their time as effectively as possible.
That's great. And so now we bring it back to the thing that your team shipped most recently, just a month ago, I think it was health check analytics.
So similar to the DNS analytics that Dina was just talking about, this is again, answering questions for our customers.
What kind of questions do we answer? What kind of questions can they now see in DNS health check analytics?
Sorry. Absolutely. Yeah.
It's a great question. So they're going to see a wide range of information. So today what normally happens, what typically happens is when something goes wrong, you know, some sort of alert notification goes off, some engineer or some technical individual has to go in and check what's going on.
And they have to go through these endless amounts of logs.
They have to go through these educated guesses to see where the issue is pinpointed at.
And that's a lot of time. And we felt we could do a little bit better there.
We could help give our insights to our customers so they can decrease this, this really critical metric, which is time to resolution.
And so we want to decrease that as much as possible to give time back to our customers, to focus on their table stakes and, you know, make sure their customers are happy.
And so with health check analytics, you're going to see in just moments after you load it up to see, Hey, is there actually a problem?
Validating that it's something you need actually deserves your attention, honing in on the exact detail of what that issue is.
Maybe it's an it's a timeout error.
Maybe we're not connecting to a router upstream in the, in the, in the infrastructure.
And so just within a few minutes, instead of having to comb through these endless logs, because educated guesses, you can only pinpoint what the issue is, what the areas and where it's isolated to, but also be able to validate any fixes you make in real time.
Since our health analytics are always updating, like we said before, through this ever changing environment.
Yeah.
And it's, it's so great. You know, in the last couple of weeks, I think analytics has come up three times in a row.
And, you know, we talked last week about how we started creating zoomable analytics and across these analytics, you can literally just drag the mouse and it zooms in just like in the movies and you can pinpoint exactly what's going on.
And as we were preparing for this call, you two asked me, you should talk a little bit about the history of analytics and Cloudflare, which I think is a, it's a great it's a great question.
You know, it started off years ago when Cloudflare started to, like, we have so much data, we are witnessing so much information, we could make smarter products if we collected it and use that analysis to power what goes on in the future.
And in fact, we'll talk a little bit more about how gatebot, which is the technology that powers our distributed denial of service stack is basically using that it collects a lot of information, it thinks about it, and then it makes stuff exposes to, to, to the outside world.
And it's very interesting, the current generation, sort of the current revolution at Cloudflare.
And, you know, Cloudflare is gonna be around for a long time.
And we know, this is one truth is that we will constantly keep reinventing technology.
So some, some future version of us is going to be watching this.
And I don't know how long Brian's beard will be, it'll be down. It'll be a very, a very, a very grizzled gray, Brian, looking back on the young youthful one, like remember when we were talking about analytics in 2020, you know, it used to be, it used to be that we were collecting all this stuff and putting it into relational data stores, you know, more traditional databases.
And around 2016, one of our engineers started to play with an open source project called Clickhouse and Clickhouse came out of it's a big analytics database specifically designed to answer, if you have a lot of events of the same kind of thing, just pouring in millions and millions of times a second, how quickly can you structure it in a way that makes it easy to ask complex questions?
That's really what it comes down to.
And it's very interesting, this engineer, just kind of quietly on the side, set up this database.
In fact, it was DNS team. It was, it was, it was your team, Dina, that did this.
And we were in the middle of a mystery. We were in the, as I joined Cloudflare, one of the first things I said, I sort of said, hi, everybody, I'm Usman, what can I help with?
And they said, we're having this interesting problem with DNS every now and then it throws an error and we're having trouble locking it down.
It's tricky for us to figure out why is this happening? And so we're studying analytics and people are looking at logs and this engineer in the quiet sort of raises his hand and says, actually, I have a new tool that might help here.
I've, I've, I've set up this database called Clickhouse. And, and I put a whole bunch of data there and maybe that'll help.
And it's exactly the story, Brian, that you just told, which is if you can get the information in the right spot and you can ask the question in an efficient way and get an answer fast, you can see insights that you just didn't see before.
It's just like any other scope or diagnosis tool or microscope.
If you can look at it the right way, it becomes very clear.
And all of a sudden it became clear what was wrong. And it was a revolution to us all.
And we were like, this Clickhouse thing, this has promise. We should really invest in this.
And so it went from this sort of side project that was just sort of a, you know, this Friday afternoon, kind of messing around with it, technology to basically powering the analytics revolution at Cloudflare with multiple Clickhouse clusters, hundreds of nodes, whole teams dedicated to making sure that it's operational and then putting interfaces on top of that so that customers and other product teams can, can add things like these these analytics tools that you guys are talking about.
So it's, it's really great. And you know, one of the things that it has, it continues to be our focus is performance.
So Dina, back to you.
One of the things we announced last week was Cloudflare Network Interconnect.
What's that? Yeah. So when you think of Cloudflare, you probably think of layer seven of the OSI model.
So that's where HTTP and DNS happen. And then eventually we extended down to layer four with Spectrum.
And then we went down to layer three with Magic Transit.
So now we're going even lower in the stack to layer two, which takes us closer to the physical layer.
It allows our customers who share the same data centers as us to physically connect to us.
You are, we are actually letting people wire into us directly as is my, is part of what I'm understanding here that allows us to do with using both private and networking interconnects and transits directly to a Cloudflare data center.
So why, why would you, why would you want to do this?
If I, if I want to make a cell phone call to you, I don't like string up, you know, we don't, we don't put take two tin cans and put a line between you and me.
Like, why is this a good idea? What do we get for this?
Yeah. So there's a few benefits. One of them is performance. So for example, we saw that round trip time decreased from I think seven and a half milliseconds to one, which like when you scale that, that's a huge, huge impact.
Yeah. And then another reason is security of, you don't want the rest of the Internet contacting your infrastructure and sending traffic there.
Instead, you want to create this private link of Cloudflare and my servers.
That's right. And, you know, you were referencing the seven layers of the stack.
And I think, you know, it's something that a lot of us at Cloudflare and in the networking business sort of have tattooed in our, but just to, to, to remind our audience, like whenever two machines talk, they've got to at least agree, they have to have some kind of physical connection.
And, you know, that's layer one, either it's wired or wireless, but there's literally atoms moving through the air.
And then the simplest thing on top of that is, you know, what's called layer two, which is just, can I, my machine has the very lowest level, some way to send information to yours.
And then on top of that is the base foundation of the Internet.
On top of that is computers that can have a consistent conversation, which is partly what Brian was alluding to when that load balancer is there.
It knows they can keep that conversation going, that layer four conversation all the way up to, we don't talk about five and six because we, we, we, we, we basically acknowledged that five and six was a more academic exercise, but layer seven, the top of the stack is the the, the part of the stack, which is the applications people use all day long, browsing the web, playing Minecraft, getting on zoom calls.
Uzman, I have a question for you. What Cloudflare products are now compatible with CNI and how can you use them together to?
Yeah, I think it's going to be magic transit is our is, is the is the important one here.
So because magic transit is all about protecting customer infrastructure at that lower level at layer three.
So that that's, again, where customers say, I already have IP addresses.
And again, it's, it's, we're already in a world that's so beyond what most normal Internet companies need to think about.
Most companies think I need to buy a domain name.
And they don't think I wait a minute, who's got the IP addresses that that domain is on?
And you can get those two, it's just a bigger deal.
And our biggest customers have those IP addresses. And what they want to do is say, it's still my IP address, but I want Cloudflare to sit in front of it.
It's kind of like saying, I still want to live at 123 Main Street.
But I want all mail for 123 Main Street to go here and go somewhere else so that the rest of the world thinks that's my address, my address doesn't change, but it's still out there.
And so that that is going to, that is going to help it here. And I think, you know, workers and private and serverless computing is going to take a huge advantage from this as well, because it's going to it's going to give us that direct connect there.
What else am I missing? I'm sure there's other products that can they can take advantage of it.
Spectrum, I guess also, right? So the Yeah, bring it into there.
That's great. While we're on the subject of lower levels, and, and DDoS attacks, let's talk about DDoS attacks, Brian.
So we have another post we put out last week was around how we are doing how we're doing better to defend against distributed denial of service attacks, and some of the new trends we've seen in a post COVID world.
What, what, what, what are some of the things we've seen?
I'll ask you, and then you can ask me some questions about how we're how we're defending against them.
Yeah, absolutely. You know, we live in a brand new world today, you know, we've never seen the likes of the influx of digital traffic, since shelter in place, and since COVID-19 has affected, you know, the world at large.
And so I want to ask you, like, you know, what is a DDoS attack?
You know, let's tell the world, you know, what, what are they? And what makes them different from other types of attacks?
Yeah, let's talk about it. So the interesting thing about a DDoS attack is, again, the example I always give is if you know, if you want to a store, you and you wanted to attack that store, you could hire a burglar to dress up in the dead of night and rob the cash register.
That's sort of a malicious attack and infiltration attack.
The other way you can mess up the store is you could give a nickel to 500 kindergartners and just let them lose.
And, and they're going to make a mess of the store, no one's going to get any work done, no one's going to be able to buy anything.
They're not going to buy anything either.
But that's you've effectively disabled the store. And that's what a distributed denial of service attack is.
You send so much garbage to the website, that it that the things that it's legitimately trying to do, can't get through.
The interesting thing is, if, if the if the foyer, if the anteroom of the mall that is accepting all this is big enough, it can just absorb it, no one even notices.
And then many cases, that's what's Cloudflare is able to do. And we have a network that can handle 37 terabits per second means that these attacks, they don't even affect our, our, our customers properties in a meaningful way in many cases.
But we also can study the input, and we can block it. And we can be very smart about how we block it, and when we block it.
And one of our technologies is all that logging we were just talking about a second that can fall back to our data centers where we study all the logs, or we can analyze them where all those click cost clusters live.
And that's what our gatebot product does is it turns over and it gets the exact right defense, and it sends it back out to the edge.
So the edge blocks, that kind of attack that takes about 10 seconds, which sounds pretty fast.
But remember, every second counts on these kinds of things. So what we also want is something that's even faster.
And that's what we call DOSD, which is the defense is happening in real time at the data center that notices the problem.
And so that's a very, very interesting technology that allows us to react almost instantaneously to a problem.
But there's a third kind of attack that is very interesting, and that has jumped up dramatically in the COVID times.
Because what we saw was after everyone's sheltered in place, Internet traffic just skyrocketed, and the bad guys noticed.
And they're like, wow, with more stuff there, there's more gold for us to attack.
There's more content behind the web. Let's step up the attacks.
And that's why May of 2020 was one of the most, the months that we saw the most, both the number of attacks and the size of the attacks.
You measure an attack into how big it is.
And it's very interesting, just because an attack, not all the biggest attacks are the most damaging ones, because an attack can also be damaging if it's just big enough to overwhelm our customer's origin.
And that is something that we also took a real defense against by introducing something called flow tracking, which watches how an attack is, how traffic is moving across to a specific customer zone.
And if it notices this is going to be too much for this origin, it's out of proportion for what this website normally handles, it can put on the brakes.
And that's another great example of how we're doing things.
Wow, we're almost at the end of time. And I just wanted to thank both of you for joining me on this.
This was great. Thank you. It's my pleasure. Yeah. Without Jen, I think this was so much fun to talk about DNS and load balancing, two of our greatest products that have had a lot of success.
And just to finish up my bullets, other things we shipped, delayed cancellation means when customers hit the cancel button, we want to make sure that that service doesn't turn off right away, and you get to see it all the way through the edge.
We have other behind the scenes stuff.
It's supposed to be the latest from product and engine, but I couldn't resist talking about the product stuff.
But there's great internal tools that help our internal engineers defend against attacks and throttle things when they need to, and, and have all that visibility.
But thanks again, both of you for joining me on this.
And we'll see you again next week on the latest from product and eng. Thanks for watching everyone.
Bye. Thank you. Bye. Bye.