This Week in Net: An outage in the UK, the importance of rollbacks for developers, and how we’re hiring!
Welcome to our weekly review of stories from our blog and other sources, covering a range of topics from product announcements, tools and features to disruptions on the Internet. João Tomé is joined by our CTO, John Graham-Cumming.
In this week's program, which is arriving earlier due to the Good Friday holiday in Europe, we start with Cloudflare's view of the Virgin Media outage in the UK. Then, we get a bit technical and go over the root cause analysis and solution for a bug found in Cloudflare's mTLS implementation. We also talk about the importance of our newest announcement, Rollbacks for Workers Deployments, for developers. There’s a deep dive on how we use Oxy (a Rust project) to deploy new versions of long-lived server software while maintaining a reliable experience. We also discuss how we’re hiring worldwide , but specifically in Portugal (remote included).
In our end of show “Around NET” segment we go to Washington D.C., Zaid Zaid, from our Policy and Trust team, talks about the 2023 Summit for Democracy and Project Safekeeping (that serves under-resourced organizations that are vital to the basic functioning of our global communities).
Check the blog posts:
Transcript (Beta)
Hello and welcome everyone to This Week in Net. It's the April the 6th, 2023 edition. This week we're doing this on a Thursday because tomorrow it's Good Friday holiday here in Portugal.
I'm João Tomé based in Lisbon and with me I have as usual our CTO John Graham-Cumming.
Hello John, how are you? Hello, I'm very well. This is an unusual year because I believe that Ramadan is happening, Good Friday is happening, and it's also Passover.
Yesterday, right? Well, I'm not an expert on Passover, but it's over this period.
I think it starts, it's a few days. So all those things coincide, which they don't always.
Anyway, there are many religious celebrations happening.
And so we're doing this a day early. That's it. Let's do it. And on Sunday, it's Eastern and my birthday.
There you go. We're celebrating that.
So it's nice in Portugal to give everybody the day off for your birthday. I didn't realize that was the thing.
It's not every company, but yeah, it's mostly common in some sense.
Oh, well, we had a bunch of blog posts and one big outages in the UK this week.
Where should we start? What about the Virgin Media outage in the UK?
Because that was actually kind of an interesting broadband outage because it was kind of a double one, right?
It happened and then it happened again. So we wrote a bit of a blog post up about that.
So now if you're looking at these charts and you are trying to find Virgin, you'll find that it's called NTL.
And the reason it's called NTL is that was actually what the company was called before it became Virgin Media.
Some of you will remember NTL World if you're in the UK. And the network, which is AS5089, had this interesting outage.
So you can see that there's an outage that start, pretty complete outage that happens in the night, basically.
And then the normal chart for NTL Virgin is that dotted line. And you can see there's a big dip.
Now, obviously, the Internet dips in the night because most people don't use the Internet in the night.
And normally, it comes up in the morning.
So you see that dip. And then you see another dip, which happened again. So there are actually two outages happened.
So let's take a look at the blog post for that, actually.
Exactly. David Belson wrote a blog post about this outage in specific, in particular.
It started a bit after midnight, UTC time. So we're using UTC here, an hour later in British summertime.
And you can see this just almost complete outage for Virgin Media.
And then it came back again, sort of early in the morning.
And then you can see things were running along pretty normally, the normal kind of traffic levels, following that dotted line, following the day before.
And then suddenly, there's another drop off where it broke again. So it's kind of interesting, because often you get outages that happen, and they get fixed, and then everything's stable again.
In this case, something happened at Virgin that caused the outage to happen again.
I imagine they did something to fix the outage in the morning.
And perhaps they were going back and making a permanent fix, or making sure things were stable, and unfortunately broke it again.
Now, one of the bad effects is that you couldn't actually get to the Virgin Media status page, because virginmedia.com, the website, uses Virgin Media, uses the same network.
And actually, this is quite interesting, because one of the things that tells you, you couldn't even look up the DNS name, because the DNS server for Virgin Media is also on the Virgin Media network.
And everything is concentrated on that one AS 5089.
And so that AS actually kind of disappeared from the network. And you can see in the BGP data, lower down, what happens.
So BGP is, I think, called border gateway protocol.
It is a way that the networks that make up the Internet, because it's an inter-network, a network of networks, tell each other about their presence.
And you can see that there was a lot of activity. And normally, it's kind of quiet.
You see that line is normally flat. Normally, it's out there saying, hey, we're here.
We're Virgin Media. We're connected to the Internet. And it doesn't need to change anything.
But what you actually see was a flurry of announcements from withdrawals, i.e.
the network saying, I'm here, and then saying, I'm not here. All happening during the night.
So this was with the outage. And the same thing happens again when the second outage happened.
Exactly. So I remember seeing this type of chart, BGP chart, for the Rogers outage last year.
And the Facebook outage too, actually, if you remember.
Facebook has their own network. That was in 2021, November 2021.
Yeah, those two outages. And those were outages reported all over the world because impacted millions of customers in one case, it's Facebook.
In the other case, one main ISP in Canada. And you could see like, in the case for Rogers, you could see they were having problems with a different scenario from this case.
But in the Rogers one, they were doing the withdrawal. Something was happening and there were withdrawals of prefixes.
And then they were trying to do the announcements.
In this case, it's similar. First withdrawals, and then announcements, and then a few more withdrawals.
And the second outage. Yeah, you make a very good point here, which is that if you look at this graph, the first thing that happens around the midnight time when it actually fell off, is a withdrawal.
The network disappeared, basically. And then about three hours later, something gets re-announced.
So you can almost imagine that someone inside Virgin at that point said, yeah, we've understood what the problem is.
We'll reconnect to the Internet.
And actually quite quickly after that, there's another withdrawal, right?
Then there's a gap. Presumably they're trying to fix it. Then they re-announce it.
Then they withdraw again, not long after. So obviously they realize something's wrong.
And finally, when it actually gets fixed, there's a flurry of announcements.
And then as you say, the same thing happens again. And so for whatever reason, it may be to do with their network configuration, or maybe to do with the routers that connect them to the Internet.
They disappeared pretty much completely from the Internet at some points, or at least for many different connection points.
And you saw that particularly in the night, when it really went to close to zero.
And then in the afternoon, not quite so low. So something was happening.
We think something was happening at the network level, where it connects to the Internet.
Because you know, in other outages, we've seen things like a DNS server go offline.
And then the network might still be there, but you can't look up the IP address for that thing.
And so you get a different scenario.
But in this one, I think this looks like a network problem. And hopefully Virgin actually talks about it, because it's always fascinating to understand how these large systems have these kinds of problems.
And we've had our own set of these kinds of problems.
We've used BGP to shoot ourselves in the foot more than once.
Yeah, it happens. And these types of examples first shows people how complex the Internet is, but also how sometimes things go not very well.
And those learning curves help everyone in the industry, I think.
It's interesting. I mean, it also emphasizes how much we use the Internet all the time, right?
So we use it for so much stuff.
We were just reminiscing in the chat room internally about a company I started now about 20 years ago.
And when we first moved into the building, we didn't have an Internet connection.
So this was about 2002. We did have a dial up connection with a single 56k modem.
And we were trying to use that over Wi Fi.
And you think, you know, how did that even work? And the truth is, it didn't work very well.
But we weren't so dependent on the Internet at that point. For example, we had a lot of our developer docs on CD.
So we can actually just look them up.
So it's a different world. And so these kind of outages have an enormous effect on the users.
And obviously, in Virgin's case, you saw that in the UK with UK Twitter with people complaining about not being able to get access to anything on the Internet because their broadband was down.
Exactly. There's a lot of reports on Twitter.
And sometimes having an explanation out there helps also the user at least understand what is happening, hey, these types of problems, things like that.
So that's also interesting. And this is the remote after the pandemic, it's a remote, remote world in terms of a lot of people are using their own networks to access the Internet.
So there's also that. Yeah, working from home, being mobile, working remotely, there's all sorts of reasons why we now depend on the Internet so much.
Absolutely. So another example of that. Where should we go next?
Well, we can talk a little bit about this vulnerability, the MTLS clients certificate verification vulnerability.
So this is actually just a retrospective on a vulnerability that actually we found in our own software for people who were using Cloudflare's firewall rules.
So this only applies to actually a very small number of customers.
But we like to write up, you know, when things are mistakes.
And so this looks at a particular bug, this wasn't a wide, a wide bug.
But basically what happens, we have this feature Cloudflare firewall rules, which allows a customer to define what traffic is allowed to get to their service, for example.
And one of the things that allows you to do is do it through a client certificate.
And so what happens is the customer can say that only certain pieces of software or machines or devices which have a TLS certificate on them, presenting that certificate are allowed access.
And it's quite a strong way of saying this is coming from, say, a genuine device or a genuine app or a genuine user.
And we terminate the TLS connection, and we check the client certificate.
All great, works fantastically. And this explains this process of what's called mutual TLS, because the normal sort of TLS we use, also sometimes called SSL, if you're going gray in the head like I am, SSL, is how we get a secure connection on the Internet.
But the normal way this happens is it's really done to, first of all, provide an encrypted channel.
So nothing gets, you know, eavesdropped on and also provide some proof that you're talking to genuinely, let's say, your bank when you go to it.
It really is your bank, because this certificate that the server presents to you and says, I can prove that I really am, I don't know, Santander in Portugal, for example.
The client certificate is the opposite.
It's the client, i.e., your phone, your web browser or whatever, proving to the other end that it is genuinely this thing.
So we allow in firewalls for a customer to enforce the use of these client certificates so they know it's coming from a genuine device app or whatever.
We also allow them to revoke these certificates if they want to get rid of them and say, for example, let's suppose that one of the certificates has leaked in some way, someone decompiled the app or reverse engineered a device.
Maybe you want to get rid of the certificate. Usually for security reasons, right?
Yes, for security reasons, right. So you think, well, I'm going to issue new certificates that will allow these, which means that this other certificate is now causing me a problem, I will block.
It's fine. That works great.
But TLS has a performance improvement called session resumption. And session resumption is if you go to your bank every day to check your bank account, which I hope you don't, because I think that's probably bad for your mental health, it will be quicker if your web browser or whatever could say, I've already been to this website before.
I already had a whole encrypted thing set up.
I'll just reuse that. I can just start again. I've already been through the proof of everything, got everything going.
And there's two ways this is done. There's a thing called session IDs or session tickets.
And there's detail in here how this works.
Unfortunately, there was a problem in our software, which is if a device had used a client certificate to get a connection, and we had allowed it through, and the certificate was revoked, and there was a session resumption, the same device came back and said, hey, by the way, I was using this five minutes ago, I can carry on using it.
It was possible for that connection to be allowed through.
So that was the bug. And so this blog post just talks about the bug, how long it existed, and how we remediated it.
We discovered this. It wasn't discovered by an outside party.
Once we knew about it, we rolled out a fix in actually just a few hours, about four hours.
And we contacted all the customers. So if you're a customer of Cloudflare, you don't need to worry.
We have contacted you if you were in any way affected by this.
We have no evidence this was used in the wild. We found it ourselves.
And this just gives you a timeline of what happened. And this one's actually back in December.
Now, the reason we didn't write it up back in December is actually the truth is it got stuck in the queue of blog posts.
And because it was relatively low priority, it wasn't exploited, and all the customers had been informed, we hadn't actually blogged about it.
But we do like to actually blog these things and make sure that people can see the problems we've had.
And so this is it. And we blogged it back in December. And in January of this year, we actually rolled out a complete fix for all of this so that we had clean software that was there.
So that was a write-up of vulnerability that happened. And we'll do more of these as these things occur.
Of course. And in this case, there wasn't, like you were saying, an urgency.
Because when those are there, we usually publish after a few minutes or hours really quickly.
Yeah. Well, if it had been exploited, or if it was wide ranging, yeah, we would have been like, now, now, now, now, now.
But we didn't want to just let it pass unblogged. So. Really like the handshake expression on the TLS.
Yes, the handshake. Well, it is. I mean, TCP has a handshake too, right?
It's like, hey, I'm here. Oh, yeah, you're there. Yes, I'm here.
And then, you know, it's like that. So human-like giving a name like that. Well, it depends on your culture, right?
Because it might not be a handshake. That's true.
It might be a kiss on both cheeks, or it might be bowing, or it might be all sorts of things.
But we call it a handshake. Where should we go next? Why don't we do rollbacks?
Rollbacks is a fun one. Or workers deployments. So, I mean, I think, I know that you developers out there, but I've certainly deployed things that didn't work more than once, let's just say.
And sometimes you just, you know, the most important thing is to be able to go backwards very quickly.
And what we did back in November is we introduced deployments for workers.
Deployments for workers was a way of, you could have sort of a packaged up set of stuff you were deploying out to our network, and there can be different versions of it.
So it's like, you know, you can imagine you've got the live version, and then the next version is coming along, and you go in there and you say, right, I'm just going to make this one active, right?
This is the version I want out there.
And you can do this through the API and Wrangler and the dashboard, Terraform, all that kind of stuff.
It allows you to control which version, if you like, of the thing you're building on Cloudflare is actually deployed.
Well, that's great. And you can, you know, deploy those things. But one of the things that you might really, really want to do is suddenly go, oh, no, we broke it.
I need to roll back. I need to go back right now. And so we added the ability to add a rollback by clicking a button, just go, nope, change, go back to this version, add a quick description of, you know, you can imagine putting in here, if you want to, you don't have to, but put in here something like, oh, rolling back because of bug 6934, right?
And for history elements, right, like, is relevant.
Yep. And it happens just like that. So, and you can do it by Wrangler. You can just type Wrangler rollback.
And one of the things that's really nice in Wrangler is if you don't give it an ID to roll back, it rollbacks the latest.
So, it's really nice because it's kind of the undo, right?
It's like, I deployed something, I broke it.
Wrangler rollback, and boom, it will do the rollback for you. Obviously, you can add a comment and all that kind of stuff like that.
But it's, you know, it's a way to quickly reverse.
And I think this is an important piece of functionality for us.
So, you know, do it as the way you want to. And hopefully, this will give you confidence in deploying and deploying fast because I don't know about you, but as a developer, you want to write code as fast as you can and deploy it and get it out there and get people testing it and constantly updating little things is the world we live in.
So, this gives you some safety where you can just go, oh, no, Wrangler rollback, and we'll go back to the unbroken version.
I'm curious about that, actually, in terms of before this, how many minutes or something like that a rollback could take, depending on the developer, right?
The experience and all that.
So, it could take a few minutes? Prior to us having, you know, deploying different versions, right, in the deployments, then what you would have to do is you would have to push out a new version of your software, so whatever that required.
And so, I think the really difficult thing in the workflow is if you imagine you're working on your production branch of your software, and you've pushed it out to the edge, you've deployed it out, and then you realize it's broken, how do you get back to the previous version?
And depending on your source code control system, maybe that's nice, maybe it's a label, and you can just go backwards to that, or maybe it's complicated, depending on the thing.
What deployments did was it said, okay, look, Cloudflare will track, okay, you deployed this bunch of software to the edge, and we'll call it this.
However you made it on your end, this is what you did.
And so, the deployments allows you to go back to one of those versions, and then what this does is just allows you to have like a sort of an emergency, a big red button where you're like, roll back, roll back.
And so, I think that's what's the nice about this, is it takes away, it breaks the linkage, if you like, from you having to go back and figure out, okay, how do I get back to the previous version?
And I think, you know, with the best one in the world, all of us have had difficulty getting back to the previous version of software from time to time.
Makes sense, the command Z or control Z. Yeah, exactly. It's a good expression.
Undo. And there's a memory element, so it takes a memory of how it was before to the rollback to be faster.
Well, yeah, because you could imagine like, okay, wait, did we label that particular release?
Or, you know, what you want is to be able to iterate fast, and so this will allow you to do that.
Should we go to Oxy?
The Oxy proxy? Yes, let's talk about the Oxy proxy. So, for context, if you don't know what Oxy is, the first thing you probably need to do is click on the link that takes you back to, you know, Oxy, the proxy framework.
So, one of the things Cloudflare built in Rust is a thing called Oxy for proxy.
It rhymes. And what this thing does is it allows us to build proxies, because a lot of what Cloudflare does is, so traffic comes in, we need to do something to it, and it needs to get sent somewhere else.
Now, somewhere else might be someone else's server, it might be a caching thing, it might be workers, but fundamentally, these proxies handle traffic going through us.
And traditionally, we have reverse proxies, which are sitting in front of a website, sitting in front of an API, and they are taking in requests from you and me using our apps or using our browsers, and sitting in front of a web server.
We also have forward proxies, like our gateway product, where you're in, you know, we're at work, you use warp, and I use warp, our web browser gets sent to the forward proxy, which makes sure we're not going to a website which is dangerous for us, or there's malware or something like that.
So, we have proxies all over the place.
And Oxy is a framework for doing it. It's really interesting to read about the Oxy framework, and there's been a whole lot of blog posts, and there's going to be more talking about it in design and architecture, all in Rust, actually using Wasm as well.
And one of the things you have to do is graceful restarts, because guess what?
And we're just talking about rollbacks in software, your software is evolving, and when you roll out a new version of something, of a proxy or your code, you need a way to go from the old version to the new version.
Now, when you're really small, and you don't have much traffic, you might actually be able to stop the old version and start the new version, and there'd be a brief period where a website doesn't work, and nobody would really notice.
They might say, oh, I had to refresh, or, you know, something like that.
Well, that's great, except for a couple of things. First of all, Cloudflare, there is no downtime.
The network is active all the time. I mean, you saw in the outage one, even those graphs at night, where before Virgin Media went offline, the Internet was still active, it wasn't asleep.
So there's no night, there's no point where you can stop doing that.
And also, the other thing is, there are lots of situations where things need to be long running, like, hey, we're on a Zoom call right here, and this thing better not just stop in the middle of it, because that would be annoying.
Or there are WebSockets connections, which need to be very long running.
So you need a way to be able to shut down an old piece of software and start a new piece of software gracefully and actually hand the work between them.
And that turns out to be really complicated. So for Go software, stuff written in Golang, we used a thing called TableFlip, which was a piece of software that allows us to move the sockets around and actually, you know, sort of drain the traffic from the old executable and give it to the new executable.
And we had to do a similar thing for for Oxy. So this is about how we handle slowly shutting down stuff going to an old Oxy and sending it to a new Oxy so that nobody knows that we shut down and restart stuff.
So this is a component of the Oxy proxy work.
Go read all the blog posts if you're interested in how it works.
But it's a very nice piece of software and it has given us the ability to build many things.
It underlies stuff like Warp. It underlies the iCloud private relay service.
A very big part of what we do. Very fascinating. And there'll be this list, this tag of Oxy, I think is going to end up having probably eight blog posts in it about all of the stuff that's in there.
And if you're interested in Rust and Wasm at scale, again, take a look at this.
It's really interesting how a smooth experience takes a lot of work and Oxy is one of those cases.
Yep. It's one of those things where it grew over time and we absolutely had to grow it out.
And you're going to see all the things we do within there to make it really grow.
So I think what's interesting in here is we've really shown that Rust works extremely well for us.
And I think you'll see in there the Rust stuff.
The other place I was referring to about the Wasm is another place where we use Rust, and we did an old blog post about this, which is about a thing called BigPineapple.
And BigPineapple is our Rust -based server for 1.1.1.1, which we wrote a new one.
And he actually uses Rust and Wasm together to run that service.
And so you see these things really coming together very, very well.
Really, really interesting to understand how these things work, to be honest.
We still have a few minutes. Should we highlight possibly that we're hiring?
Sure.
So we're absolutely go ahead and you tell that story. We are hiring, definitely.
And actually, since you and I are both in Portugal, let's just zoom in on Portugal and just be a little bit cheeky.
We're hiring all over the place, but we're hiring in Portugal.
And we're hiring in all of Portugal. So we have about right now about 87 job openings in Portugal and across all sorts of departments, as you can see, including finance, including HR and all these things.
So huge number of jobs, but also not just in Lisbon. So we actually have expanded out into the rest of Portugal and we have people a little bit all over the place, I think from Braga to Faro, probably.
And you can find those by looking at remote Portugal jobs.
So, yeah, this is all Portugal. But if you're not, if you're unfortunate enough to not live in Portugal, I mean, I know there are some of you out there.
There are lots of other things in Cloudflare, both remote and within offices.
So please take a look. We are continuing to need people to help build a better Internet.
True. And it's at this moment in terms of the tech companies, quite amazing that we're still hiring also shows the progress and the growth of the company in a sense, which is interesting.
Yeah, we have. And, you know, I think being in Lisbon and being in Portugal has been a great win for us.
So come join us.
And I will just give you one shout out to one particular group of people, product managers.
One of the things we would like to build more of a function of in Lisbon and in Portugal in general is product managers.
We're building quite a lot of our important products here.
We need product management talent.
Please think about if you're a product manager interested in working on Internet things.
We have jobs for you. Please, please apply. True. So big office and big growth in Lisbon, Portugal, but also elsewhere.
Yep. It was great, John. Have a good Easter.
Yes. See you next week. And have a good Easter and Passover. And good luck to everybody observing Ramadan, too.
We're getting halfway through, I think.
We're getting close to halfway through now, I think. And so, yes, you have a good weekend.
And I guess happy birthday. Thank you. Let's see how it plays out.
Cheers. See you. Cheers. Bye -bye. Before we go, it's time for our short segment around net.
And in this case, we're going to travel to Washington, D.C.
in the U.S. to hear from Zaid Zaid. He's the head of U.S.
Public Policy. And he's based in Washington, D.C. He was at a conference this past week.
And let's hear from him. Hi, I'm Zaid Zaid. I'm the head of U.S.
Public Policy at Cloudflare. I was born in Washington, D.C., and I currently live in Washington, D.C., although I've lived in other places all around the world.
I'm actually coming to you live from the Summit for Democracy here in Washington, D.C.
on the third day of the Summit. This week, I wrote a blog post about Cloudflare's commitment to the 2023 Summit for Democracy.
One of the reasons why the Summit for Democracy is hugely important is because it is dedicated to some of the values that we cherish at Cloudflare, like human rights and democracy, as well as our commitment to an open, secure, interoperable Internet.
The most interesting thing I'm working on right now is Project Safekeeping, which we launched back in December as part of Impact Week.
Project Safekeeping provides enterprise-level, Zero Trust cybersecurity protection to vulnerable infrastructure at no cost and with no time limit.
And we rolled it out in Asia as well as in Europe.
We like to roll it out to other parts of the world as well. Fun fact for me, when I was in grade school, I was in the National Spelling Bee, but I actually was out on the first round because I misspelled the first word that they gave to me.
I recovered though. This is a Himalayan cedar tree, which is about one block from my house.
And we love this park, this traffic circle. And we come here every year to take our pictures for our New Year's cards we send out to everyone.