Cloudflare TV

This Week In Net

Presented by John Graham-Cumming
Originally aired on 

A weekly review of stories affecting the Internet, brought to you by Cloudflare's CTO. We'll look at outages, trends, and new technologies — with special guests to help us explore these topics in greater depth.

Original Airdate: July 24, 2020

English
News
Interviews

Transcript (Beta)

I'm John Graham-Cumming, Cloudflare's CTO, and this is This Week in Net, where I typically talk about something that is news in net, as in Cloudflare, or net as in Internet, over the last week or so.

And today is hopefully Friday, July the 24th. And it's the afternoon, if you're watching this delayed, that's when this was recorded.

And typically, I do the show on my own, and talk about things that have happened.

But I have a guest this week, Lukas. Welcome, Lukas. Hello. Lukas works on the protocols team at Cloudflare, which works on, well, protocols, very clever name.

And those protocols are all the things that we provide, because one of the things we do is we try to support the latest IETF standards, be they on the HTTP side, TLS side, DNS side, to make sure we're the most compatible and most up-to -date network out there.

And Lukas, you have a show about this kind of stuff, right?

Yes. So since Cloudflare TV launched, I used that opportunity to start with effectively a rerun of a presentation I'd given to a few people.

And from that, some of the feedback was like, we want to know more, want to be able to figure out how this applies to our websites, how can we measure these things.

So from that, I've started just a weekly session on a Monday, basically.

So if maybe after this show, people are interested, do dial in on a Monday.

And there's a number of sessions also available on demand that go into some deep dives on things like using Wireshark or other debugging tools to kind of measure this stuff.

And Monday's episode should be very interesting.

We're going to be using web page test and getting some of the experts there to show us how to not look at the high level web performance stuff, but the lower level, almost like the Internet plumbing aspect, which are pretty important for certain types of things.

What sort of plumbing aspects are you going to look at?

So, I mean, typically when you're looking at web page test, the big thing is like a waterfall, right?

So you load a website and what it does is a web page is constructed of a number of resources that you've got your index.html, say, and then the images and the scripts and the CSS and the waterfall is a nice visualization of, say, when you initiate a DNS request to find out what computer the resource is at, but then also say time to first byte from when the browser sent the resource at a point in time on a timeline to when it got some bytes back and then how long it took to complete the whole thing.

When you've got technologies like HTTP2 or HTTP3, lots of things are happening in parallel.

So you can understand basically very quickly, almost like a Gantt chart, you know, what's the critical path in fetching all of these resources.

But generally it's focused on a browser level visualization metrics to say, well, yeah, you fetch it in this time, but then say the browser tried to feed this into its JavaScript engine and parse the document or write bitmap to the graphics card.

All of that is really interesting and really cool, but what's nice with web page tests is it's able to go a layer deeper into what they call the net log and look at the next layer of information of, say, break the response data down into the frames or the bytes that are coming in off the network.

So this is where it involves things like TCP or QUIC, which is the new transport protocol that HTTP3 runs on.

And so you can get an understanding of not just it took this long to get a blob, but I got the first few bytes and then having changed and then, you know, maybe there's a priority shift or something that happened, the page scrolled, these kinds of things, user interaction relates to lots of other stuff that goes on.

So they're all interrelated, but we're going to focus on just how you can dissect, that's the wrong word, because dissect means something different, how you can just look at this and understand from the visualization what you're looking at.

I think that's the most important thing, communicating to people how they can do this on their own.

And this is Monday at what time? Monday at 5pm UTC, so that's six o'clock in the evening for you and I in our time zone.

Okay, great. And then morning time in California, for folks who want to get up and see something interesting on a Monday morning.

All right, let's come back to that in a minute.

I want to talk a little bit more about HTTP3 and some of the things that are going on in a minute.

But let's just do a quick review of two things, because on this show in the past, I've talked a little bit about trends in the tax and also outages around the world.

And if you've been following along, if you're a regular viewer, you will know that there has been a very long Internet outage in all of Ethiopia, which started at the beginning of the month, after the killing of a very popular singer.

And that caused a lot of unrest and the government decided to shut down the Internet completely.

And that Internet outage actually lasted more than three weeks, which is a very long complete outage.

And now see that the Internet has been has been switched back on in Ethiopia, and traffic levels are back to where they were before.

So if you look at this graph, this shows the traffic that Cloudflare sees from clients in Ethiopia, starting on the 1st of May.

So there's sort of a fairly standard amount of traffic every day over time.

And then this very dramatic drop, where the Internet use dropped down to about 1% of the level it was normally.

That happened at the beginning of July, straight after the killing.

And you see there was still a little bit of Internet use, because some things were still connected, particularly some parts of the government.

And you'll see after sort of a week or so, the Internet use jumped up a little bit, sort of two, 3%.

And what we heard from people in the country was that some other bits of critical infrastructure, things like banks, had been reconnected to the Internet.

But the real change didn't happen until the last few days when Internet use jumped up a lot.

That's because fixed broadband and other fixed Internet connectivity was re-enabled.

And you can see that where it jumps up initially.

And then the last few days, all Internet, so mobile Internet, which is incredibly important around the world, has been re-established.

So we are now back after in Ethiopia, after about three weeks of outage to being fully reconnected.

So we'll keep an eye on what happens there, but we're assuming this is now pretty stable.

The other thing I've talked about quite a few times in this is the trends and attacks.

And Cloudflare blocks a very large number of cyber attacks per day. In this sense, a cyber attack is an HTTP request that we blocked because maybe the WAF blocked it, maybe it was participating in a layer 7 DDoS, maybe a customer had decided a certain sort of traffic pattern was malicious and needed to be blocked, or bot detection decided to block it.

And in the first quarter, we blocked on average every day 44 billion such HTTP requests.

And in the second quarter, the one that's just ended, something like 72 billion and sort of averaging out.

And so what's happened is there's been this real growth, and you can actually see this as just a natural growth in the sort of number of cyber attacks that are happening.

So unfortunately, as 2020 has gone on, as well as the COVID crisis, we've been having this big mess of increasing numbers of cyber attacks.

And actually, there was some very large cyber attacks that occurred in June, where there was some very big attacks on certain websites.

So sadly, cyber attacks are getting more and more prevalent, and we're blocking a large number of them.

So that's it for the kind of the update, I'm going to get rid of this, the screen share, and I can go back to talking to to Lucas directly.

So okay, so you were, let's just recap, if you're not familiar with HTTP 3, because if you're not really into HTTP, you might be wondering what version we're on and what's happening.

Just give us the quick rundown on 1.123.

What are all these versions? What do they mean? So, the entire history of HTTP in maybe 10 minutes.

There you go. Yeah. Fortunately, I wrote a blog post on this shortly after I joined the company, because I was having a focus on this work more than in my previous job.

And so it was almost like a research activity for me.

And I had such a great time looking at the timeline of all of these developments, because we kind of take it for granted.

HTTP is a layer seven protocol. It's an application protocol that sits on top of TCP, which just works.

We know in reality, it doesn't just work.

But you kind of, for many people, you just type in a web address, and you visit a web page.

But there's people who take this stuff very seriously, like the team that I'm on.

And so what they've been doing over time is, in the first instance, creating a textual protocol that is very simple to allow computers to communicate.

Alongside the birth of the web, and the concept of HTML as a way of describing a document, is the hypertext transfer protocol, which allows you to transfer that markup language from one computer to another.

And so you could just type this in, in ASCII text, and effectively delineate different aspects of it, just using carriage returns.

So just like you would type into Extended or anything like this.

And that was great, and it was simple. HB 0.9, as this has kind of been retroactively talked about, was just a one line thing, which we now call the request line.

You would say, get a path, and then that was it. But over time, as always, people realize the power of simplicity, and want to augment it slightly.

So then you look at adding, say, something like metadata. The idea of adding HTTP header fields, which help describe aspects of the resource that you're fetching.

Either you want to augment the request in some way, to say, here's my user agent, here's who I am.

Maybe you want to factor that into your response or not.

Likewise, just adding a server back, or other information like the date that that resource was generated.

And so that took you into HB 1. And there's also HB 1.1, which is similar, but kind of did a few bug fixes.

I did go to great lengths to plot all of the timelines of these, and it's quite humorous to see that basically, the day that the 1.1 specification ships, the work starts to refactor it, and do editorial improvements, and fix other issues.

So it's this continuing process.

But actually, kind of what happened was 1.1 was good enough, and it worked really well for a lot of things for people.

And the standard kind of got stuck there for nigh on 20 years.

If you go through the dates with a fine tooth comb, you can see this.

But yeah, it was good enough. And during that time, what we had was massive improvements in consumer computing power, consumer access networks, this whole thing of increasing website capability being matched by advancements in broadband rollout and deployment.

And so actually, it was kind of a boom era of focusing on other areas, until it was kind of discovered that HTTP 1.1, along with TCP, and the way that those things interacted, were now becoming a bottleneck.

As always, as you're doing performance, or any kind of optimization, you pick the low hanging fruit there.

So I think like it had been recognized for a while that bandwidth wasn't going to be the silver bullet forever, that something like latency actually starts to play a bigger role when you are transferring web pages that are growing.

But in comparison to the amount of bandwidth and capacity that users have to download a page, it's still quite small.

So you look at round trip times, and you've got something like TCP, which is a, it requires a three way handshake, just to get to the point that you can make a HTTP request.

Like, those things are now, you can shift your laser focus to trying to address those aspects.

So web pages became a lot more complex in terms of all the stuff they were made of, right?

So it wasn't like, just give me this one page and get it back.

It was like images and CSS and JavaScript and interactive features.

And this all compounds this latency being significant issue. Absolutely.

So you've got a, you know, a very complex tree of resources. And that the 1.1 protocol is effectively a serial thing, you can make one request at a time.

And, you know, when you're talking to an application server, it might need to go and go to the file system and pull that resource from somewhere that takes time.

So something that that would be quick to service gets blocked behind something else.

And so the general solution to those kinds of things in real life is to split, you know, divide and conquer and make more requests.

But because the way 1.1 worked, you couldn't do that within a single TCP connection.

So you would, browsers came up with a strategy of making more.

Because HTTP is a stateless protocol. It's highly paralyzable, right?

Unfortunately, what that does is create huge demand on servers, because every client now wants to open a huge number of TCP connections, and they don't come for free, they cost memory and other system resources.

So the browser vendors kind of bounded that and came up with a de facto limit of maybe 10 or 6 connections per website.

And so clever web optimization experts worked around that by sharding, which is a clever name, really, for basically just splitting resources across more host names, as TCP is bound to a host name.

And so you develop a whole load of design or web performance patterns that they worked, but they were working around a core issue of protocol.

And so with this kind of thing, you get enough groundswell to say, yes, it does work, but actually, it creates all these secondary effects on the ecosystem.

And the people who are feeling those effects are willing to try and come up with a better option here, which basically led to the development of HTTP2.

So you have people at, let's say, Google, but not just Google, elsewhere, pitching ideas for, if you could think of a way of improving HTTP 1.1, what would you do?

And so Google's proposal, Speedy, was effectively the strongest candidate of those, and factored in some other suggestions too, but they had a kind of tender process for which one should we adopt as a basis, and then continue to work in the ITF and take in multiple stakeholders from different places and work on.

So that was the genesis of HTTP2, which would allow, within a single TCP connection, multiplexed requests and responses, full concurrency, fix all the problems that we've ever had.

But as people know... Right.

But it wasn't the perfect solution. It wasn't, but they did do it like, there's problems which I'll come on to shortly, but the clever thing that was done was to avoid a need for, what's the word?

I've lost the exact word, but doing a complete switch from one system to another one.

What they wanted to do was encourage the rollout of this thing and have easy migration without developing, for example, a new URI scheme.

So you would require every webpage out there to rewrite themselves or maintain two versions of things.

So it was a goal from the outset to develop a protocol that could fit in almost seamlessly.

You would require deployment of new software, but this was at a level below what we call HTTP semantics.

So the idea that you would make a request would be common across the versions, just how that request looked on the wire would be different.

So that kind of annoyed some people because there are other problems with HTTP, not just this head of line blocking that I mentioned.

Things like sending dates and times is really annoying because they're sent as text, which takes up a lot of space, whereas there's much more efficient encoding.

Things like cookies are hugely problematic, but rather than get fallen into the trap of like a second system syndrome of let's fix everything, it was focused.

And I think that was a good thing.

It's had some side effects, but as always, this is the real world we live in. So I think on the whole, it's been a good thing.

But yeah, having then deployed HB2, you know, many different people, browsers, command line tools, different kinds of servers, it did address some of the problems.

But effectively, what it did was shine the torch on the real problem with TCP.

We can fix this HTTP issue. But TCP has this issue where it's a reliable transport.

And so on the Internet, packets go missing.

If you requested a resource, like a large image, say, and then something smaller, and the ordering of those things meant that the small thing came after the image, HB2 allows you to make the request in such a way that you could get the small thing before the image.

But when you then try and shove it down the TCP pipe, they get serialized in such an order that if you miss one packet, even if you continue to receive the TCP data for all of those things, you can't get unblocked from that, which was annoying for a number of reasons.

But for like a web perspective, it's annoying because you can't see it.

Because the TCP is implemented as like a kernel mode feature in the network stack.

Maybe, you know, if you were able to write some software that was specific aim of probing the TCP, you could find that information out.

But as an app web developer, say, you explicitly don't want to have that kind of access because of the security concerns and the browsers create a kind of sandbox secure environment.

And so you don't know if you say the outcome, the manifestation of that, they were talking about the waterfall earlier, is that the time to complete the object is longer.

You don't know why.

Maybe the server was slow. Maybe it's the network. There's not much you can do to detect and mitigate that kind of thing.

So almost as soon as SPD adopted into H2 and taken on board the standardization process, the folks at Google had kind of identified this as a problem as well.

We started working on what is called QUIC, which is still called QUIC, but is different now.

But basically a new transport protocol designed to replace TCP and pull this concept of multiplexing and streaming down into the transport.

And anyone who's familiar with transport protocols will say, yeah, we did that before.

It was called SCTP, which is a full transport protocol that was standardized in the IETF.

But unfortunately, it didn't get much deployment success.

It fell into the trap of saying, here's a brand new thing, everybody go and use it.

The Internet was designed that way, that you have this number space that you can identify different Internet protocols in.

Yes, there's IPv4 and IPv6, but the protocols that are on top of them, you can have TCP, you can have UDP.

SCTP is one of those. But what you find is effectively a focus on catering to a mass deployment, which is realistically on the Internet, there's only two kinds of traffic that you see, TCP and UDP.

The requirements of the system to support a new transport protocol is neither of those.

And have it work on a heterogeneous Internet of all different devices is basically doomed to failure, which is very sad because there's a lot of good things that SCTP did and got right.

So, having seen as an example, QUIC made the decision to basically, in some ways, reinvent TCP on top of UDP.

So, they would take things that TCP did well, things that HTTP did well, combine them into effectively a user space transport protocol that ran on top of UDP because UDP was known to work on the Internet and had enough flexibility for that.

Meanwhile, the things like the Snowden revelations and realizing that evasive monitoring was going on across the Internet and realizing that's not a good thing for people, there was a focus on trying to create a secure transport protocol by default.

Although we can achieve that with TCP and TLS, because those two things are layered and disjoint, it creates additional round trips and handshake problems.

It works fine, but there's, again, an opportunity to fix two things with one file swoop effectively.

So, the QUIC transport protocol had encryption from the get-go, but it wasn't TLS.

It was a special thing that Google invented called QUIC Crypto. And so, they've been running that for a few years on their assets like Chrome and sites like Google or YouTube and basically collected a huge amount of data to say to people that actually you can deploy a heavily used user-focused protocol that isn't DNS, but something else that uses UDP and it does work.

Because a lot of people say, yeah, it's fine to do that, but it will fail for 25% of the Internet.

Google kind of captured this data and presented it into the ITF standardization body to say to people, like, look, here's what we've been doing.

We think this is a good solution and it has benefits to people, actual real-world benefits, and we would like to effectively contribute this in for other people to bring ideas into the table.

And that was basically, like, I don't know now, three or four years ago. And so, we've been working on that in the ITF, trying to take on board different use cases, different ways of implementing the same thing.

And effectively, the important thing in regards to HTTP3 is that Google's version of QUIC effectively baked, um, like, well, the HTTP aspects were very tightly coupled in.

So, it worked great for use cases that were the web or other HTTP things.

But in the ITF, we like reusability.

So, it was effectively a requirement to pull apart that tight coupling and create this layer of HTTP that was detached from QUIC, and therefore, you could throw it away and you could put other things on top of QUIC, like DNS.

So, you could have an encrypted version of DNS that isn't, um, isn't HTTP or isn't TLS.

It's its own thing, which might have some benefits, or you could do other stuff.

And so, then you've got this mapping of HTTP onto QUIC and what you call that.

Naming things is really hard. So, to start with, we called it HTTP over QUIC, which then is really confusing to people because it's like, well, is that HTTP1 or is it HTTP2?

And through, you know, some years of wishy-washy, we don't know what to do yet.

Like, call it three. It's the same, but it's different. At least it's like crystal clear.

Most people, they won't, um, they shouldn't notice.

It should just be like something that changes under the hood and improves stuff for them.

Obviously, there's a lot of tuning. This is a new protocol. We need to experiment and get this thing right.

So, while I've been at Cloudflare, we've been working on this and actually deploying it into our edge.

We made it generally available in September last year, and we've been continuing to interop with all the browser vendors out there who have experimental support and things like Chrome Canary or Mozilla Firefox, likely, and continuing to both demonstrate it works, feed stuff back into the spec, optimize things that we're doing on the house side, like congestion control, all of these things that we've been blogging about to say, like, here's what we're doing.

Here's what other people could do. The software that we've written that powers this on the edge is a library that's written in Rust, which I have a t -shirt over right now.

And this is an open source library on GitHub, and we've had people pick it up and use it patched and do some of their own performance testing and say, find it in this way that, you know, it's an optimization opportunity here.

So, it's been really, really cool to kind of collaborate with people in all these different ways.

So, we've got that deployed on our edge for our customers.

Do you have any sense for what percentage of our traffic is actually going over HTTP3 right now?

I'll be honest. It is low, because, you know, effectively, most people will be web browsing, say.

There's a lot of different uses for HTTP, like APIs or whatever, but the user agent, the software that would negotiate that version doesn't have, generally, HTTP3 enabled by default.

So, you have to go out of your way to go and enable this thing. And doing that varies by different things.

With cURL, you'd have to go and compile a special version.

It's not that hard, but you'd have to do something more than doing nothing, which, for a lot of people, all busy, and you don't do that.

But for people who do want to experiment, like go and see the blogs or go and watch some of the other shows I've done where we kind of step through how to enable it, what to look for in your browser dev tools.

You know, we've had this enabled on blog .Cloudflare for nigh on a year, like big real websites with complexity, like you mentioned, the different resources, and we're measuring things like metrics, like page load time, time to first byte, contentful paints, things like that.

But the cool thing that's happened this week is that we enabled it for Cloudflare TV.

So, right now, like this is possibly getting streamed to people via HTTP3 if they watch my Monday show, and if they miss that, then, yeah, the instructions to enable it apply to any kind of HTTP traffic.

So, I encourage people to try that out for Monday if they want.

That sounds great. And where are we in the timeline of real deployment at scale?

I mean, obviously, we've deployed it at scale for our customers, but the browser support needs to be there.

The standard needs to be nailed down.

What's the sort of future timeline look like for the IETF and for the rest of the Internet?

I mean, timelines are tricky. You don't want to ship something that's broken.

But over time, so I should make it clear, I'm also the co-chair of the quick working group in the IETF.

So, I started that role earlier in the year.

But in leading up to that point, they've been trying to nail down changes to the specification.

Everyone's got great ideas, but comes the point where you need to say, is there really a good reason to change this thing?

And so, we've been working to that point.

Maybe a month or two ago, we issued what's called a last call, which is an indication that we think the document is ready now.

If you've been holding off, now is a great time to come and implement this thing.

When you decide to deploy it is ultimately up to you.

But what we're doing is working through a final set of issues that came in through that.

And it should be sometime this year that we can say it's done, and that we'll try and ship an RFC or a family of them.

And from that, I think that will be when we start to see more people turning this on from the user agent side.

And whenever that is, I mean, Cloudflare will be ready because we keep deploying, we keep updating with the latest versions of the protocol, right?

So, it's always the latest version, it's always available for testing.

And when it becomes real, it'll be available to everybody everywhere. Absolutely, yeah.

And the changes between the versions are very minimal. So, supporting that range is something that's very achievable.

All right, Lucas, we're out of time.

We talked about HTTP3 and QUIC for 30 minutes. It was great. Thank you so much for being on.

Thank you for deploying QUIC on our network, HTTP3. Thanks for your work with the IETF.

And I'm going to check in on Monday. I want to see what you talk about on Monday.

I want to see, I use web page test, but I don't use it well enough.

So, I'm looking forward to that. Cheers. Thanks very much. Thank you.