⚡️ Speed Week: Early Hints, Global Backbone, and Last Mile Networking

Presented by: David Tuber, Tom Paseka, Vasilis Giotsas, Tanner Ryan, Alex Krivit

Originally aired on March 31, 2023 @ 5:30 AM - 6:00 AM EDT

Join our product and engineering teams as they discuss our newest releases, Early Hints, Global Backbone, and Last Mile Networking

Read the blog posts:

Visit the Speed Week Hub for every announcement and CFTV episode — check back all week for more!

English

Speed Week

Transcript (Beta)

Hey everybody, and welcome to Cloudflare TV. Today is Thursday of Speed Week. As you can tell by our backgrounds, and I'm joined by a panel of Cloudflare experts. And we're going to talk about some of the things that we've been rolling out to make Cloudflare faster on the network. And on the browser. So let's just before we get into it, let's just do a quick round of introductions. I'll start. My name is Tubes. I'm a product manager at Cloudflare for network performance and availability. And I'll pass it off to Tom. Hey, I'm Tom. I'm on Cloudflare's infrastructure team doing interconnecting. Tanner. Hi, my name is Tanner. I'm a network engineer on the infrastructure team, helping to deploy POPs around the world. Hey Vasilis. Hello everyone. I'm Vasilis Giotsas. I'm a research engineer, and I'm working on Internet measurements and the analysis of our routing and our topology. Awesome. And Alex. Hi, all the Cloudflare TV watchers out there. My name is Alex Krivett. I'm the product manager for cash and content delivery at Cloudflare. Cool. So, between the five of us, we've published three blogs, and we're going to go into just a quick overview of each of them and then we'll ask some questions to the experts here. So the first one we did was Early Hints. That's about super fast page load times through a brand new technology, which Alex will explain. We had another blog post talking about our backbone, how it's really fast and we see 45% latency improvements when we use our backbone, and Tom and Tanner will talk about that. And then we talked about the last mile and last mile insights and how our job is to get closer to users. We'll talk about how we improve that and we'll also talk about last mile insights, which allows us to see when things go wrong on the last mile. But I want to take it back to the top. I want to start off a question to Alex. Early Hints. Tell us a little bit about why Early Hints makes your browser faster. Yeah, that's a good question. So we today announced as part of Speed Week, Early Hints, and so Early Hints is a way that you can make a browser and an origin server sort of begin multitasking. They both can begin loading a page, instead of having your browser sort of waiting around for the instructions that it needs from the origin server. So frequently in the sort of request response cycle that you get with, you know, regular, you know, Internet browsing your browser you type something in you push enter a bunch of stuff happens in the background. That Cloudflare generally takes care of and that request then for the content that you want is routed to an origin server and the origin server takes that request it thinks about, you know, the response that you want to see it sometimes has to make some external calls to maybe a database or an API, do some authorizations on the request, maybe look at data to make sure that that what you're wanting to do is allowed or isn't fraud or anything. And so it's thinking about all these things are doing all these computations. And while it's doing that it's not sending anything back to the browser yet. So, the idea is that a browser needs sort of instructions in order to to start painting and loading assets on the page and so what early hints does is it takes certain parts of the page that maybe don't frequently change the browser, the origin, you know, doesn't need to wait to send information about certain style sheets if a font changes favicons, things like that. It can focus on compiling the full response talking to these external databases and things. While the browser is, you know, working on fetching things that don't really change that much like fonts and and favicons and scripts and stuff and so while the browser is thinking, while the browser is loading those things from the early hints the origin can continue to compile the full response and then send that through later. And so this multitasking sort of helps pages load faster in our initial tests. This improvement can be greater than 30%, which is a massive improvement. The huge deal. So, how can our, how can our, our customers get access to early hints, because I 30% load time is huge so how can they get that today. Yeah, it's definitely an impressive figure and something that we're, we're working on being able to deliver to everybody for free in relative short order. This project is a collaboration between ourselves and a bunch of, you know, the who's who in the browser industry. And so we're working on developing this we have a bunch of tests set up. And we currently have a beta, it's a it's a closed beta you have to go to the speed tab in the Cloudflare dashboard to sign up. And as testing continues with these browsers will continue to release people from the beta, and it will shortly be you know available to everybody and so everybody can see the improvements with early hints and so that will be a really cool thing and I'm excited to get more people using it and seeing the benefits. That's really awesome. So, one more question for you. So, like, obviously like talks about. Um, so we got 30% improvement where we're doing a collaboration, can we talk, can you give us a little more insight into like how we built this because I mean it's a really cool, it's really, it's really interesting products really great feature. How did we, how do we get this off the ground. Yeah, I think it's a good question. Generally, you know Cloudflare does a lot of routing and making sure that things, you know, go to the place the destinations that they need to go so they frequently will route requests and responses back, things like that and so a little bit of insight, I guess, as to how we got this off the ground is that we were, you know, talking with a bunch of browsers, and we wanted to work on this this cool new sort of experimental status code early hints that a lot of people were sort of saying had had great promise. So we really dug in and thought a lot about how you know we could add a little bit of, you know, magic to it to try and remove a lot of the impediments that other sort of potential technologies that sort of promise the same thing, like HTTP server push which never really saw that wide of adoption, things like that and so we wanted to take a lot of the hurdles that server push had which was adoption, relying on uncoordinated sort of disparate parties and really centralize all of that so we wanted to take as much of as much of those, the changes I guess as possible on to our stack, change some things around so that these, these pushes could be these could be sent to browsers, without a lot of origin configuration or changes. And so, currently in the the testing iteration we had to break a couple of things, certain origins and browsers maybe only expect one response in to one request and so now it's going to be two responses, so to do a bit of configuration there. But other than that and working with browsers, it was, you know, definitely just really focused on inspection of responses to make this to make this work and to coordinate with the browsers. It's awesome. It's a great cross collaboration effort, it's really great to see that live and really great to see that for our customers. So I'll take it from from early hints in the browsers I want to go, like, I want to go on to the backbone. So Tom Tanner we just had this awesome you just wrote this awesome post backbone, kind of bubbling things up a little bit for the people who really didn't get into the post as much. Tom. Why do we even need a backbone, like what is what is it, what does it do. The backbone is great for us to connect our points of presence with each other connecting our data centers. And we do that because of simply the the reliability of the Internet at large, and the Internet may look good for most of the time, but the short periods of time where a link may get full or something gets rerouted can can bring big impact to to people's Internet browsing or their web access. And so by having our own connections we can guarantee that performance and guarantee those paths across the Internet. Awesome. So, like, what is so talk a little bit about like so having a backbone. Super awesome. Love to hear that. How does, how fast does our backbone, make us or how much faster does our backbone make our customers. So we actually recently ran tests on our backbone, just to compare the performance of contacting an origin server to see over the public Internet or using our private backbone, and in some cases we actually had a 45% improvement in response time when contacting an origin. And of course this happens when you have routes of the Internet that are congested. And so, if we're on the public Internet, we don't actually know if there's going to be latency we don't know if there's going to be packet loss, and this could really add those additional delays to contacting an origin server. So if we do have these dedicated paths between our different data centers. We can be able to row packets knowing that it will receive requests and response from the origin on time and nice and fast. Awesome. So, you know, great improvements. So, how does that how does our backbone. I guess, need this question for either Tom or Tanner, how does our back, can you give us an example of how our backbone works with some of our existing products to really give that latency improvement. So the backbone itself doesn't necessarily give us the fastest route, it gives us a route that is consistent and is reliable and that is under control, but there may always be a shorter route between two certain locations that can get us packets, the fastest there. And so we have products like Argo which can actually sample the routes in real time. And so as we have routes that end up having a lower latency or lower jitter lower packet loss, we can send packets through those routes. Likewise, as conditions of the public Internet may change, we can make adjustments in real time to ensure that both our customers and people accessing Cloudflare websites always have the fastest routes available at that given time. See I knew the answer to that because I'm the Argo PM but that was a test you passed. Thanks. Good job. So, Tom and Tanner, our backbone obviously provides great improvement for performance and for reliability. How big is our backbone, and where are we growing to? Tom, do you want to take that one? Sure. So we have multiple terabits of capacity across our backbone. We've got pretty good coverage across the US. If you go back to our blog post we have a more recent map with some planned additions. We've got pretty good coverage across the US, across Western Europe, and going into Asia now. There are some expansions we've got in the works so we have some things in South America live. And we're adding some SPURs, some additional links onto the backbone to bring more and more of our data centers into that dedicated connectivity. Great. So it's great to see that, you know, our backbone is just constantly growing and we're getting more and more of our traffic on it to kind of provide our customers a better experience. Anything else you guys want to add about the backbone to tell the users at home about? I think just the key message is we're really growing our network to be as close to end users as possible. You know, there's plenty of areas in the world where the Internet infrastructure just isn't there. And so we really want to extend our edge as close to users as possible so they get a secure and reliable connection to the greater Internet. That's a great point. So now we want to take from kind of the backbone and talk about the last mile. So I guess my first question would be for Vasilis. So Vasilis, Tanner just spoke about getting close to our users. Can you talk about how we measure that today? Right. So we measure proximity based on latency, how fast a user can establish a connection to one of our points of presence, right? And this does not depend just on geographical distance, but like driving, it depends also on the network, right? So it depends on the routes, it depends on the technology. And for this reason, we opted to use latency as a measure of proximity, which also reflects the quality of experience that the users get, not just how close they are in theory. Yeah. And so I guess as sort of taking maybe a step back for this last mile experience here, what did we announce today? What improvement are we looking at? What are we, what is this last mile talking about? So it's a great, great question. And, you know, like Tanner talked about getting close to our users and Vasilis talked about how we measure that. And I think a really key important part of this is defining what the last mile is. And basically what the last mile is, is it's the path, it's the part of the Internet from, you know, us all sitting at home to our ISPs and to our Internet service providers and how they connect up to the Internet. Right. And so, you know, if you're on, you know, Comcast or, you know, Verizon or, you know, British Telecom or any of those, you know, Reliance Geo or whomever, you know, you're on a device. And that device connects to your Internet service provider because you pay them X amount of dollars a month. Right. And so those paths are very heterogeneous. There are so many different paths on every single last mile network to get to the Internet and to get to not just, you know, Cloudflare, but to everybody. And because there are so many last mile providers and there are so many different paths, there's a lot of ground to cover. And if you're a service, if you're running, you know, a mission critical service on the Internet or if you're running any service on the Internet, the ability for your users to be able to connect over these ISPs actually matters a whole lot. Right. If, you know, a user connecting to a service hosted on Cloudflare can't connect to Cloudflare, then your service will be down. And that will be very inconvenient for you. It will be very bad for you. It will be bad for us. It will be bad for your customers. It will be bad for the ISP. But the problem is that kind of last mile issues, things that happen on ISPs are kind of a black box. Right. Like you're essentially giving over trust that these networks are managing their networks correctly. And as Tom said, you know, the Internet is great, you know, 99.9% of the time, but the 0.1% of the times that it's not, it's pretty bad. And, you know, your users have a really bad experience. And, you know, it could be anything from, you know, there's a fiber cut in, you know, your last mile ISP. Or it could be, you know, a data center is a data center that hosts, you know, a whole bunch of different providers is down. It could be that one particular provider is down. It's kind of a black box. Like you can't tell. All of those things, all of those potential issues can look the same to a user. But you might not necessarily know which one is which. And you as a customer and you running a service, you actually need to know this. Right. Because the reason why this matters a lot is let's say that you're running a service and, you know, there's the last mile fiber cut. And all of a sudden, you know, you start getting a whole bunch of tweets and a whole bunch of down detector reports and a whole bunch of support tickets saying, I can't access, you know, your service. Like, please fix this for me. Like, why can't I do the thing that I need to do? And when you get a whole bunch of reports coming in at once, it can kind of seem like a whole bunch of, it can kind of seem like a big mess. And it's a big mess for a bunch of different reasons. The first reason is that you didn't catch a problem before your customers did, which looks bad for you. And the second reason is, is that all of this noise, it's difficult to pull a signal out of it. And so the thing that we've released today is called Last Mile Insights. And the point of Last Mile Insights is actually to sort of make sense of these reports by kind of automating them using built-in mechanisms in browsers. And so the way that you can think about this is, the way that down detector works today is if you're connecting to a website or connecting to a service and it goes down, then you basically go to down detector and say, I can't connect, right? What if that process was automated for you, the user, so that you wouldn't have to send that to down detector, you wouldn't have to go to down detector and press it, but your browser just told the service, hey, I can't reach you. And that's actually a really critical part of the request path, right? Like if you're talking about end-to-end requests, if you, the user, can't connect to the service, the service will never actually know that you were able to connect because it never got a TCP SYN, it never got an ACK or the connection dropped. It will look to the service like nothing went wrong. So being able to tell services and Cloudflare through an outside channel that you, the user, can't reach that service actually is super valuable. And that's what Last Mile Insights provides. It provides a side channel reporting mechanism for all of our customers and for all of our users around the world to tell us when they can't connect to the sites that they need to connect to. And then we package that up and we break it down by site and we break it down by customer, and then we can give it back to you so that you can see not only are your users unable to connect, you can see where they're not connecting from, which ISPs they're not connecting from, which regions they're not connecting from, and more importantly, why they're not connecting. And that could be so many different reasons. It could be, you know, their TCP connections are just dropping because there's packet loss on the network. It could be that a provider is actually routing them through scrubbing centers and that scrubbing is actually breaking their traffic. It could be that, you know, a TLS certificate is not being properly served and so they'll start to see a whole bunch of errors. There's a whole bunch of reasons why this could happen. And Last Mile Insights allows you to see these things and provide crisp feedback to not only you, your service, but also your customers, so that you can, so that everyone knows what's going on as fast as possible. And all of this is done in real time. All of this is done without a user having to go up and down the detector and click on the button. So the time to detect is much faster and the time to resolve is much faster. No, that's, I think that data and that services is really valuable and it's like still one of the things that is mind blowing to me just when when you're using the Internet, you push enter and all of these different services from all of these different companies have a role in making sure that the request and response get to where they're going and it all happens in, you know, milliseconds. And so it's really, really powerful data, I think that you're helping expose and provide some utility to customers here. And you mentioned a bit about this in terms of like different things that I think can go wrong in the last mile, but is there anything in particular about just like the diversity of providers or something that make last mile problems more difficult than sort of other pieces of that request response flow? That's a great question. And I think this is a two parter, and I'll answer half, and Vassilis can also answer half because he's done a bunch of research and done a great data, a lot of gathering on it. So, at a high level, last mile networks are incredibly diverse, right? And that's because population centers are incredibly diverse, right? Like we all live in, you know, we all live in different places around the world, and our towns and our cities, they're all different. And because of that, the way that we connect to the Internet looks different. And that means that no one, no two providers are going to be exactly the same, their footprints are always going to be different. So that means if things are, that basically means that the way that we connect to the Internet will change, even network to network in the same region. And actually, I want to turn that over to Vassilis, because Vassilis has actually done a really great job analyzing that and he can talk a little bit about how even networks in the same country, in the same small countries have different footprints and have different network quality. So I guess Vassilis, the question to you is, what is network quality, number one, and two, how does it differ network to network? And what does that look like to us? Right. So network quality, we can define it in terms of not just latency, but also stability of latency, how stable is latency, whether we have low jitter and so on. And also in terms of path length. So if a provider, if an ISP, picks the shortest available path or a path that is local to its Internet ecosystem, right? So we, as you said, we've observed a lot of different and we've measured a lot of different ISPs within the same countries, within the same Internet ecosystems. And we've seen that there are, you know, two extremes of optimizing routes. One is in terms of performance, which is the best case scenario. The other is optimizing in terms of cost. So some ISPs may try to minimize the cost for them, which would increase latency, right? And we have a bunch of ISPs in between. Now, the thing is that obviously this is beyond our control. We can do our best to collocate with them, but it's up to them to pick the closest route or a route that, you know, makes sense in terms of the user population. But we've seen cases where ISPs would go, would route their traffic through other countries because it may be cheaper or because they may be subsidiaries of other ISPs and so on. And in the blog post, we provide some examples. For instance, there are ISPs that are very keen to go at Internet exchange points and establish direct connections, which we call peering, with other ISPs within the same city or within the same country. And typically these ISPs observe the best performance because their paths are so, you know, short. And there are other ISPs that suffer from high congestion. They have, you know, high packet loss. We see a big variability there so far with this and also much longer paths because they do not peer with their local Internet providers. So to kind of bubble that up, would you say that, so like basically it feels like what you're saying is kind of the two factors that help drive performance and improvement are shortening the distance from, shortening the paths. And you do that through peering with as many people as possible. Does that sound about right? Yeah, exactly. Exactly. So these are the two very important factors in terms of performance. And obviously many ISPs care about that and do that. And that's why we observe the gains that we observe by growing our backbone. So the growth of our backbone has this big improvement in our performance because most ISPs care about their performance. But again, there are, you know, a few cases where, you know, it's beyond our control because of suboptimal choices in terms of path length and, you know, congestion issues. So, actually this is perfect because we have the interconnection team here. So, if the answer to the problem is, get as close to the users as possible through peering and direct connection. Tanner, what's the process by which we interconnect directly with a provider? Can you and Tom kind of break that down for us a little bit? So, right now there's actually three models that we perform these connections. For networks that have typically a little bit lower amount of traffic, we can peer with them over a public Internet exchange. So, a facility where both networks are present, we'd be happy to establish peering sessions there to exchange traffic instead of going through some transit network or having a higher latency route. For networks that are higher in traffic, we do have private network interconnects that are available where we will actually physically connect in with another network at a common facility. And then the third actual option that we have is taking our network and extending it right into the core network of an Internet provider. So, we can actually supply our hardware to Internet providers where our servers and the content will be located directly within your network. So, even if there's a high latency connection back to our network, you can cache the resources locally. And so, an Internet provider's users will have a fast experience as all the content is hosted within the network. Awesome. And, you know, I would refer everybody to the Cloudflare TV and blog episodes and blogs from Monday, where John Rolfe talked a lot about that. And, you know, how we're ingraining ourselves further into the networks, into the Internet, and building that better performance. But it's super cool to be able to kind of, you know, talk about that a little more and talk about the methodology that we, you know, use to measure how close we are. And one last question for Vasilis. So, we measure how close we are. How close are we? How close are we to our users? Yeah, we're very close. So, in terms of latency, again, which is how we measure proximity, with 95 % of the Internet population, we are at most 50 milliseconds far. And just, you know, to put that into perspective, you don't have time to blink before you reach us. So, and that's with 95% of the Internet population. With the expansion that we've seen in the last few months to markets such as South America or Africa, we improved a lot the latency we have there. So, let's say in Latin America, with the new data centers that we deployed, we increased the proportion of Internet users that reach us within 10 milliseconds by 20%, right? So, we doubled, essentially, the amount of users that can reach us in 10 milliseconds. And, yeah, in that perspective, I think we are, if not the fastest, you know, one of the fastest CDNs in the globe right now. It's great. And, you know, it's also worth noting that that 5% that's over, that's a long tail that we're definitely going to keep driving down. And it's through the efforts of, you know, literally everybody on this call to kind of drive that down by, you know, getting the peering where it needs to be and getting the interconnections and getting the data centers in the right places. And so, that number is always going to be driving down and we're not satisfied with where we are. We're going to keep going and keep getting even faster and faster until we're close to every single user on the planet. So, we are almost out of time. I just wanted to say thank you to everybody who, you know, to Vasilis, Tanner, Tom, and Alex for hopping on this call and chatting with us about, you know, the things that make Cloudflare fast. You know, if you're interested in kind of helping us to, you know, further this mission of building a faster, better Internet, you know, we're always hiring. So, please reach out to, you know, specifically the infrastructure team has a lot of open positions for making the Internet faster. But Cloudflare in general is always hiring. So, if you're interested, please, you know, apply on our careers page. We've got a lot of open positions. We'd definitely love to have you if you're interested in making me into building a better Internet. And have a great day.

Speed Week

Relive Cloudflare's Speed Week with episodes showcasing how we keep everything fast, from lightning quick configuration updates and code deploys, to logs you don’t have to wait for, to ludicrously fast cache purges and real time analytics.

Watch more episodes