Cloudflare TV

This Week in Net: Deep dive, Internet disruptions & Data Privacy Day

Presented by João Tomé, John Graham-Cumming
Originally aired on 

Welcome to our weekly review of stories from our blog and elsewhere, from products, tools and announcements to disruptions on the Internet.

João Tomé is joined by our CTO, John Graham-Cumming. In this week's program, we talk about our a deep dive related to network debugging (A debugging story: corrupt packets in AF_XDP; a kernel bug or user error?), we go over the three new winners of Project Jengo (and defeats for the patent troll), and also Internet disruptions overview for Q4 2022, and Cloudflare’s January 24 incident. We end with a reading suggestion: our three blog posts related to the Data Privacy Day (Jan 28) including the one about Cloudflare’s Data Localization Suite is able to do.

Referenced blog posts:

English
News

Transcript (Beta)

Hello, everyone, and welcome to This Week in Net, our January the 23rd, 2023 edition.

I'm João Tomé coming to you from Lisbon, Portugal, and with me I have, as usual, our CTO, John Graham-Cumming, also in Lisbon.

Hello, John. How are you? I'm recovering from the flu, so if I sound a little different, that's because I'm still a bit blocked up.

Makes sense. My voice is deep since I had a cold four weeks ago, which is kind of weird.

We're all suffering a little bit. Exactly. We didn't have an episode last week.

Our last This Week in Net episode was about the CIO week, with all sorts of solutions for those who try to keep processes and systems in a company faster running.

But we had a lot of blog posts in the past two weeks.

One of the blog posts we did last week was a deep dive, a debugging story. We have data privacy day coming up tomorrow, actually, and three blog posts about that.

But why not start with a deep dive, this debugging story? I think one of the things the Cloudflare blog is famous for is these very technical stories.

This particular one, a debugging story about corrupt packets in AFX-DP, was a really interesting one because if you think about the scale at which Cloudflare operates at, there are all sorts of errors that occur in the software across our systems, as things that we've introduced or things that have been in the Linux kernel or things that are in the hardware.

Famously, in the past, we've spent a long time looking at some crashes, which turned out to actually be an Intel microcode bug.

So we like to really drill down on errors that are happening.

Obviously, we handle a lot of packets.

We use a thing called XDP, Express Data Path, to move packets through the Linux kernel.

We were seeing random packets getting truncated, basically, or at least panics happening, saying that the packets were truncated.

Well, first of all, that's not good because that means there's something wrong at a really fundamental level, which isn't the network level.

Also, at our scale, we want to eliminate all of these things.

So this blog post, which is really long, so if you want to read it, give yourself some time.

I was trying to understand why, first of all, why we're getting a panic, and actually why certain packets were getting truncated and causing the panic.

It's a really deep dive into how we do the packet processing at a low level because XDP is very fundamental to a lot of what we do, how packet queuing works in the kernel, and how a thing called FlowTrackD, which is actually we use for DDoS mitigation, works because that's a fundamental part of what we do.

So if you're up for a debugging story, you have time, and you want to sit down and read all about the gory details, I do recommend this post.

It's by Sean and Bastion.

On the 16th of January, a debugging story, corrupt packets in AFXDP, a kernel bug or user error.

Yeah, we suspected that we had probably used AFXDP in some incorrect manner, and it turned out that, nope, this was a real bug, and we ended up patching it, and you can read all about the patch as well.

Exactly, and one of the things I'm always surprised is how, in terms of others doing the same thing in other companies, working at this type of things, learn from this.

So it's also about sharing a learning experience. That's something that is important for the Internet, right?

Well, you know, I think as programmers, debugging is such a fundamental thing we do.

In fact, it's sort of both amusing and depressing that when the first computers were created in the 1940s and 1950s, there's a very famous computer in the UK that was created called EDSAC, and one of the people who created that computer recalled shortly after they were able to start programming it, this is really early days, right?

This is the time when there were a handful of computers in the world, recalled how he suddenly realized that a large percentage of the rest of his life was going to be spent debugging rather than writing code, that fixing errors, he didn't even call it debugging, fixing errors, and often in his own code.

So I think as developers, we are very, very familiar with trying to figure out why the thing doesn't work.

So debugging stories, and especially debugging stories that end up in a kernel patch, well, they're great.

Enjoy it. Enjoying the two things there. We also had last week the announcement of three new winners of Project Django, and more defeats for the patent troll.

Can you explain a bit what is Project Django, how it came about, and about these three new winners?

Yes, so Project Django is an effort by Cloudflare to essentially put patent trolls out of business.

About six years ago now, it was a patent troll that came after us using a patent that they had acquired.

We looked at it and said, this is really abusive. The company had been, or the patent troll had been filing lawsuits against companies, in some cases getting them to pay up over a patent that we thought was simply not something that should have been patentable.

So we created this thing called Project Django, which was to crowdsource people submitting prior art, so demonstrate that the idea was already known before the patent was ever created.

We put up a lot of money here, where we would award people money if they found prior art for these patents.

It's worked great.

There were a couple of patent trolls, one called Sable Networks, and one called Blackbird, that we did this against.

We've given out $135,000 to individuals who have helped us find prior art, and then ultimately those are used to invalidate the patents and fight back against this abusive behavior.

The latest update is just three more people who have helped us by finding prior art.

It's interesting, these are people who just read about the project, they read the patent, they're like, well, maybe I worked in that area in the past, or maybe I'm just interested in searching the literature, maybe I'm just very technical.

This is three people who did this.

The first one, a chap called Chris Wheeler in Georgia, we awarded him $5,000 for finding prior art, which in his case was a paper.

It was a paper that had been published, and his prior art to the particular patent that these Sable Networks folks are trying to say is something that we need to license from them.

He was going to find that description. A chap called Peter S. also got $5,000, and he, again, finding prior art to a particular patent.

In this case, someone's thesis, which was like five years before the patent had ever been applied for, showed that this sort of thing was known about.

A chap called David, another $5,000 for finding basically a patent, which predates the patent.

I think we see this with stuff going on. We've given out, I think we have another $50,000 we're going to give out, and we keep searching.

The blog post also talks about patents that have been thrown out of the case and the way in the claims are being whittled down as we fight back against what is ...

It's not genuinely inventing something.

It's just in an abusive way trying to milk money out of companies.

Exactly. In a sense, it's trying to surpass something that's not particularly helpful for innovation, right?

Yes. Absolutely. Absolutely.

We also have an Internet disruptions overview for the fourth quarter of 2022.

Yes. Obviously, this is part of an area you're very familiar with, which is Cloudflare radar, which is radar .Cloudflare.com, where we look at all sorts of traffic trends, performance, security, availability across the Internet.

Of course, one of those areas is big disruptions. David Belson, who also works on this stuff, put together this summary for the fourth quarter, and he's doing this quarterly to look at the sorts of shutdowns there are of the Internet.

It's pretty interesting.

He starts out with things that are done by the governments around the world.

For example, there was a shutdown in Cuba in September when there were protests happening.

Then another one in October.

As you can see, the Internet is just being shut down. Sudan, again, first anniversary of a coup in Sudan, and the Internet was shut down for the day by the government.

Iran, I mean, obviously, the Masa Amini protests are continuing in Iran, and there are continuing shutdowns of the Internet.

The Internet was having a daily curfew, which was happening between about 4pm and midnight local time for mobile Internet connectivity.

They carried on through October, and then there were other more sporadic things.

Actually, there was one just the other day, actually, in January.

Iran is continuing to do this kind of stuff.

Then you get power outages. Power outages, the Internet needs electricity. October, there was a big power outage, a grid failure in Bangladesh, and that shows up in Internet traffic.

You can really see that drop down. In Pakistan, in certain areas in the South, there were big power outages, which caused the Internet's drop.

In Kenya, in November, you see loss of connectivity.

Then you have the one in the US, in Moore County, North Carolina, where somebody shot at two electrical substations, which caused damage, which caused power outages, which took a long time to resolve.

They showed up as Internet being completely unavailable for a certain amount of time in those areas because of electricity.

Then, of course, you have Ukraine.

We are getting close now to a year after Russia's invasion of Ukraine.

As you will know, relatively recently, Russia has moved to attacking the country's infrastructure, and power infrastructure is part of that.

That shows up in outages in the Internet in different parts of Ukraine.

I know you follow this very closely.

Yeah. Even this week, we had in Odessa an energy infrastructure that was impacted.

It was yesterday, so Thursday, an 18-hour disruption. It has been custom to be in energy infrastructures to have an impact, of course, also on the Internet.

We can see that since October, especially since October 10th, that has been pretty much common to see those drops in traffic in several cities, sometimes in almost all the cities.

You can see sometimes in the whole country, a big Internet traffic change.

Sometimes it's only in a few areas.

Over October, November, December, and already in January, we've been seeing these types of clear disruptions.

Absolutely. It continues. As you say, sometimes it's a whole city, sometimes it's a network, sometimes it's a region.

It depends on what's happening.

What's striking to me is how fast they get it back online again.

Obviously, this stuff is going on, but then it's like, it's back again a few hours later or a day later or something like that.

A day later. Sometimes, you can see that there's an impact.

It's slower or it's lower than just a bit than before.

It goes up. You can see that, for sure, there's disruption.

Even in terms of energy, they seem to be dealing with that. I think sometimes energy infrastructure has a bigger impact because they shut down energy sometimes just for not to spend too much energy.

They're saving energy overnight and all that.

I feel like the Internet all in the country has been resilient, especially during the day, which is in some cases quite surprising.

It's a good thing, of course.

Then we have cable cuts. We only had one this quarter to really talk about, which was the cable that connects the Shetland Islands, which are a group of islands off the north of Scotland, somewhere between Scotland and Norway.

They connect the Shetland Islands into Scotland. They're part of Scotland. The cable got damaged.

It was something that people reported a lot because although the Shetland Islands are tiny, and maybe it wouldn't have made the news outside of circles like ours, this didn't come long after the Nord Stream natural gas pipeline was blown up.

People were worried that this might be sabotaged, but it looks like it was just the classic fishing vessel dragged up the cable and damaged it, which happens.

They got it back online a day later. There's more attention to that area, like when there's cable cuts, there's more attention, of course, because of the war happening and the region.

Especially around the north of Scotland, because the north of Scotland is strategically important for Britain, because that's where we hide all our nuclear submarines somewhere up there.

So if there's anything that goes on up there, we always like to take a look.

And then natural disasters, earthquakes was a problem.

The Solomon Islands had a pretty big dip because of a very large earthquake magnitude seven that occurred there in November.

And well, then there are just things go wrong sometimes, snafus. So in Kyrgyzstan, there was a three-hour Internet disruption on October 24th.

And well, at least the Ministry of Digital Development said it was an accident on one of the main lines that supplies the Internet.

So I'm not really sure. Maybe that was a cable cut.

Maybe that was somebody to turn something off and need to turn it back on again.

Hard to know. In Australia, there was quite a big outage for a broadband provider called Aussie Broadband in Victoria and New South Wales, dropped 40%.

That was pretty big. And Cloudflare had an outage earlier this week.

And I feel for Aussie Broadband, because in their discussion, they say a conflict change was made, which was pushed out through automation to the DHCP servers in those states.

And that caused the whole thing to go offline and they had to roll it back.

And so, yeah, Aussie Broadband, we've been there. Haiti, lots of problems in Haiti.

They blame this on outages of the international circuits. Hard to know what that means.

Could be something to do with the cable cut, could be something to do with the routers that are connected to the cables.

And, you know, these things happen.

And then sometimes things happen and we just don't know why. So a provider called Wide Open West in the US had an outage that lasted about an hour.

And well, not sure why.

Cuba had a long outage in November, seven hours. Not sure why. Starlink had an outage, short one, about 40 minutes in November.

Maybe we will find out one day what happened there.

Probably somebody disconnected something by accident and plugged it back in again.

So that's David's roundup of Internet disruptions.

We'll keep doing this, looking at the different situations. I think we will start to get into exam season in some countries and then we'll get the exam related Internet shutdowns, which always blows people's minds when they realize that in some countries switching off the Internet is a technique to stop people from cheating.

Exactly. That's pretty common as we are seeing year over year.

And this is the Internet outage map. For those who are hearing in podcast format, there is a Calaflar radar page with an outage center that we internally call CROC.

And it shows the places in the world where recent outages or shutdowns happened.

The more recent Ukraine, also Iran already in January, and also Pakistan.

Yeah, power outage in Pakistan in January too. So that's Internet disruptions.

And you already mentioned we had an incident right away this week, so it was on January the 24th.

And it was some services, Calaflar services became unavailable for 121 minutes.

Which is a long time. It's a long time. Yeah. And it was not all customers, right?

Not all customers were impacted. Well, so in big Calaflar outages in the past, you haven't been able to get to things that use Calaflar, like a website or the backend of an application you're using.

And those become very, very visible and become something that ends up in the news sometimes because Calaflar has such a large footprint to a large number of customers.

Something like we websites use us. And so those things have a really big impact.

This had an impact on different services within our platform. So it depended on what you were doing.

For the most part, the websites kept working, but lots of other things stopped working.

So one big thing was a thing called Warp, which is our connector into the Calaflar network, particularly for corporate users.

So we use that right to connect into Calaflar and also the Zero Trust products, which allow you to log in.

So this did have an effect on people being able to log into things in their company.

They might've been able to use the web in general.

And then there was some sort of secondary effects on things like a cache purge.

So being able to delete stuff from Calaflar images, the images product where you're allowed, you know, could you upload new images and R2, which is our object storage.

What went wrong was this was a software release that went wrong. And we use these things called service tokens, which allow services to talk to each other and have the authorization to do things.

And there was a fault in which caused the service tokens to be incorrect.

In fact, it caused them to be empty. And because of that, services couldn't authenticate to each other, couldn't get permission to use other.

And we had to go back and basically roll back the problem that caused them to be empty and then go and get the actual service tokens and put them back into place.

So it took a while to get this under control. It was understood pretty quickly what was happening.

And, you know, it was, if you look at the timeline, you know, what happened was as these service tokens became invalid, things began to sort of in a rolling fashion, get worse and worse.

And so there's actually some really clear graphs in here where you see it sort of ramp up.

And then as we get things under control, come down.

So, yeah, that was not a fun time at all, particularly because, you know, this authorization is the need for services to be able to talk to each other is so fundamental to us and to many, many systems, which are using, you know, lots of software that has to talk to each other.

And we did an update. So on how things were going after we wrote it, because we were quick to put the post out there.

And then we did an update on how it was going.

Right. Yes. Yes, absolutely. I mean, we always try to do that, which is very, very quickly and explain what happened, give a detailed timeline, what the impact is, because, you know, our customers themselves would have been debugging.

Why can't I do this particular thing? And the, you know, they need to know the impact exactly on them.

They will have to, you know, they'll need to report internally.

They will need to know it's us, not them, because they might be debugging.

So we try to put stuff out very quickly. Obviously, the status page, we update very, very fast.

But then we want to put the post mortem up quickly so people understand the impact and understand that we care about it as well, because you do see companies that will have an incident and put up a very brief apology and it's kind of like, you know, let's move on.

But we try to be detailed. And if anybody's interested in the history of outages at Cloudflare, they can read the blog posts and understand what, you know, what went wrong and when.

It's possible that many customers weren't aware that this happened.

So we're putting out there, but there were a lot of people that didn't saw a problem per se.

Yes, sure. But it matters a lot to the people who did get affected.

And also, it's part of our promise, right?

So even if you were not affected, you then can have some confidence that if there was something that affected them, you would hear about it and they would get that kind of detail.

So I think this is an essential part of what we do.

Of course. We have a blog also, let me try to find it, related to the Holocaust.

Before we go to the Data Privacy Day, Holocaust Memorial Day, because today, Friday, which is today.

Yeah, exactly. And we have a very specific Holocaust cyber attacks on Holocaust educational websites and how it increased in 2022 by Homer blog posts.

Yes. So this is a bit, you know, again, looking at sort of data that we and if you look at from the perspective of the Holocaust on the Internet, there are quite a number of educational websites that explain what happened to the people who were killed in Second World War in the Holocaust.

So that we don't forget what happened and so that we can prevent this from happening again.

And, you know, those websites come under cyber attack.

And in fact, Homer is taking a look at this and we see, you know, an increase here in cyber attacks and an increase in traffic that is cyber attack related over time.

So this is a reminder that even something which is commemorating such an awful event and is an essential piece of education for everybody gets attacked.

And it's just a reminder once again, you know, cyber attacks are relatively easy to do and people go after all manner of things, including something like this.

And I am sure that today some of those websites are getting attacked and we're helping to block them.

And a lot of websites like these ones, educational ones can be protected using our Project Galileo platform, right?

Yes. Project Galileo is, you know, open, you know, it's something that people can get referred into to help keep vulnerable websites online, particularly in the arts, human rights of society, journalism, democracy, you know, we will help and give our service to them free.

And, you know, if you click through on Project Galileo, you can take a look at what's available there as part of, you know, part of helping our mission, which is helping build a better Internet, keep those things online.

Of course. And tomorrow, Saturday, January, the 28th, it's Data Privacy Day.

It is. Where we have three blog posts related to Data Privacy Day.

The last one that is, we will publish today, Friday, it's this one towards a global framework for cross -border data flows and privacy protection.

This is as much a technology problem, but also mainly a policy one, right?

Because it's all about, in the EU, trying to comply with the GDPR, that sometimes is more tricky than people can realize.

Well, I think the important thing is that, and I think Data Privacy Day helps really reflect this, is that if you look across the world, governments around the world are legislating how their citizens' data is used.

And this is not an EU GDPR problem. GDPR is just an example and a well-known example of privacy regulation.

There are all sorts of regulations around the world.

And if you go from country to country, you'll see these. And the challenge is that the Internet is across the world, is across countries and people, you know, you and I, we're free to go to websites all over the world.

And that's one of the beautiful things about it.

You know, the number of times within Europe, I have purchased items by visiting a German website or a Spanish website or a Portuguese website, right?

Where I am right now. We need to have frameworks for how you move the data around about the fact that I made that order in those different countries.

Now within Europe, GDPR applies. So that's very clear, but there is really this incredible variety of these laws around the world.

And so, you know, how that works from a regulation perspective and also how somebody can actually comply with all those laws is a very complicated situation.

So it covers policy, but it also covers technical solutions.

So as you're highlighting here, we have our data localization suite, which is really to help our customers understand, you know, how they can use our system to, you know, protect encryption keys, because there may be rules about encryption keys, protect data or logs or metadata and put it in the places they need it to be.

And I think one of the beauties of having such a large network as ours is you can slice it into different regions and really, you know, it really helps someone build something and comply with the laws around the world.

So there are three blog posts, which talk about, you know, different aspects of the data sharing agreements between countries, but also the technical aspects of how you can use our service.

And actually, there's a very technical blog post about GeoKey Manager, which is about how we manage encryption keys around the world and are able to keep them, you know, safe and comply with our customers' desire for those keys to be inaccessible in certain locations.

And that's coming out today along with the other posts. Exactly. That's a very specific thing, but quite an important one, right?

And it was a very big piece of work for us to do that in a way that is scalable and works across the world.

Yes. Exactly. Here it is. Reimagining access control for distributed systems.

There you go. And this type of data localization type of thing, I think it's quite important for people in all sorts of countries, especially in the EU, to understand that now technology can provide a way of complying with Terms 2.0.

That's like a feature that's quite important to comply with these types of laws.

So it's technology and in the work of policy, which I find is interesting. Read all about it on the Cloudflare blog.

We have also another one, which is investing in security to protect data privacy.

So you can read here what we do and what we think about this area.

It's written by Emily Hancock and Alisa Starzak.

It's quite an interesting read, so I would invite everyone to read it. Absolutely.

And it's also in Portuguese, French and German, so different languages there.

Very good. And I think that's a wrap. Thank you, John, and see you next week.

Yeah, I'll see you next week. Bye -bye.

Thumbnail image for video "This Week in Net"

This Week in Net
Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!
Watch more episodes