Originally aired on January 28 @ 1:00 PM - 1:30 PM EDT
Welcome to our weekly review of stories from our blog and elsewhere, from products, tools and announcements to disruptions on the Internet.
João Tomé is joined by our CTO, John Graham-Cumming. In this week's program, we talk about our a deep dive related to network debugging (A debugging story: corrupt packets in AF_XDP; a kernel bug or user error?), we go over the three new winners of Project Jengo (and defeats for the patent troll), and also Internet disruptions overview for Q4 2022, and Cloudflare’s January 24 incident. We end with a reading suggestion: our three blog posts related to the Data Privacy Day (Jan 28) including the one about Cloudflare’s Data Localization Suite is able to do.
Referenced blog posts: Hello, everyone, and welcome to This Week in Net, our January 23rd, 2023 edition. I'm João Tomé coming to you from Lisbon, Portugal, and with me I have, as usual, our CTO, John Graham-Cumming, also in Lisbon. Hello, John. How are you? I'm recovering from the flu, so if I sound a little different, that's because I'm still a bit blocked up. Makes sense. My voice is deep since I had a cold four weeks ago, which is kind of weird. We're all suffering a little bit now. Exactly. We didn't have an episode last week. This Week in Net episode was about the CIO week with all sorts of solutions for those who try to keep processes and systems in a company fast and running, but we had a lot of blog posts in the past two weeks. One of the blog posts we did last week was a deep dive, a debugging story. We have data privacy day coming up tomorrow, actually, and three blog posts about that, but why not start with a deep dive, this debugging story? Yeah, I mean, I think one of the things the technical stories, and this particular one, a debugging story about corrupt packets in AFXDP, was a really interesting one because if you think about the scale at which Cloudflare operates at, there are all sorts of errors that occur in the software across our systems, as things that we've introduced, or things that have been in the Linux kernel, or things that are in the hardware. We famously, in the past, we spent a long time looking at some crashes, which turned out to actually be an Intel microcode bug. We like to really drill down on errors that are happening. Obviously, we handle a lot of packets. We use a thing called XDP, Express Data Path, to move packets through the Linux kernel. We were seeing random packets getting truncated, basically, or these panics happening, saying that the packets were truncated. Well, first of all, that's not good because that means there's something wrong at a really fundamental level, which isn't the network level. Also, at our scale, we want to eliminate all of these things. This blog post, which is really long, so if you want to read it, give yourself some time. I was trying to understand, first of all, why we're getting a panic, and actually why certain packets were getting truncated and causing the panic. It's a really deep dive into how we do the packet processing at a low level because XDP is very fundamental to a lot of what we do. How packet queuing works in the kernel, and how a thing called FlowTrackD, which is actually we use for DDoS mitigation, works because it's a fundamental part of what we do. If you're up for a debugging story, you have time, and you want to sit down and read all about the gory details, I do recommend this post. It's by Sean and Bastion. On the 16th of January, a debugging story, corrupt packets in AFXDP, a kernel bug or user error. We suspected that we had probably used AFXDP in some incorrect manner. It turned out that, nope, this was a real bug, and we ended up patching it. You can read all about the patch as well. Exactly. One of the things I'm always surprised is how, in terms of others doing the same thing and other companies working at this type of things, learn from this. It's also about sharing a learning experience. That's something that is important for the Internet, right? Well, I think as programmers, debugging is such a fundamental thing we do. In fact, it's sort of both amusing and depressing that when the first computers were created in the 1940s, 1950s, there was a very famous computer in the UK that was created called EDSAC. One of the people who created that computer recalled, shortly after they were able to start programming it, this is really early days, right? This is the time when there were a handful of computers in the world, recalled how he suddenly realized that a large percentage of the rest of his life was going to be spent debugging rather than writing code, that fixing errors. He didn't even call it debugging, fixing errors, and often in his own code. So, I think, you know, as developers, we are very, very familiar with trying to figure out why the thing doesn't work. So, debugging stories, and especially debugging stories that end up in, you know, a kernel patch, well, they're great. Enjoy it. They join the two things there. We also had, last week, the announcement of three new winners of Project Django, and more defeats for the patent troll. Can you explain a bit, what is Project Django, how it came about, and about these three new winners? Yes. So, Project Django is an effort by Cloudflare to essentially put patent trolls out of business. About six years ago now, there was a patent troll that came after us, using, you know, a patent that they had acquired. And we looked at it and said, you know, this is really abusive. And, you know, the company had been, or the patent troll had been, you know, filing lawsuits against companies, in some cases, getting them to pay up over a patent that we thought was simply not something that should have been patentable. And so, we created this thing called Project Django, which was to crowdsource people submitting prior art. So, demonstrate that the idea was already known before the patent was ever created. And, you know, we put up a lot of money here, where we would award people money if they found prior art for these patents. And it's worked great. And so, you know, we have, you know, there were a couple of patent trolls, one called Sable Networks, and one called Blackbird, that we did this against. And we've given out $135,000 to individuals who have helped us, you know, find prior art that doesn't, you know, and then ultimately, those are used to validate the patents and fight back against this, you know, abusive behavior. And so, the latest update is just three more people who have helped us by finding prior art. And this is, it's interesting, these are people who just, you know, they read about the project, they read the patent, they're like, well, you know, I, maybe I worked in that area in the past, or maybe, you know, I'm just interested in searching the literature, maybe I'm just very technical. And so, this is three people who did this, right. So, the first one, a chap called Chris Wheeler in Georgia, gave, awarded him $5 ,000 for finding prior art, which in his case was a paper. So, it was a paper that had been published. And it was his prior art to the particular patent that these Sable Networks folks are trying to say is something that we, you know, we need to license from them. And, you know, he was able to go and find that description. A chap called Peter S. also got $5,000. And he, for against, again, finding prior art to a particular patent, in this case, someone's thesis, which was like five years before the patent had even been, ever been applied for, showed that, you know, this sort of thing was known about. A chap called David, another $5,000 for, you know, finding, you know, a, basically a patent, which predates the patent. And so, so I think, you know, we see this with stuff going on, we've given out, you know, well, I think we just added, I think another $50,000 we're going to give out. And we keep, we keep searching. And the blog post also talks about patents have been thrown out of the case, and the way in which the claims are being whittled down as we, as we fight back against, you know, what is, you know, it's not genuinely inventing something, it's just in an abusive way, trying to milk money out of companies. Exactly. So in a sense, it's trying to surpass that, that's something that's not particularly helpful for innovation, right? Yes, absolutely. Absolutely. We also have Internet disruptions overview for the fourth quarter of 2022. Yes. I mean, obviously, this is part of the part of an area you're very familiar with, which is Cloudflare radar, which is radar.Cloudflare.com, where we look at, you know, all sorts of traffic trends, performance, security, availability, across the Internet. And of course, one of those areas is big disruptions. And David Belson, who also works on this stuff, put together this, this summary for the fourth quarter. And he's doing this quarterly to look at the sorts of, you know, shutdowns there are of the Internet. And it's pretty interesting. So he starts out with things that are done by the government around the world. So, you know, for example, there was a, there was a shutdown in Cuba, in September, when there were protests happening. And, you know, they, and then another one in October, and you can see, you can see these, like the Internet's just being shut down. Sudan, again, first anniversary of a coup in Sudan, and the Internet was shut down for the day by the government. Iran, I mean, obviously, the Masa Amini protests are continuing in Iran, and there are continuing shutdowns of the Internet. You know, the Internet was, was having a sort of daily curfew, which was happening, sort of between about 4pm and midnight local time, for mobile Internet connectivity. They carried on through October, and then there were sort of other more sporadic things. And actually, there was one just the other day, actually, right in January. So Iran is, yeah, Iran is continuing to cut, you know, do this, this kind of stuff. Then you get power outages. You know, power outages, the Internet needs electricity, right? So October, there was a big power outage in, a grid failure in Bangladesh, and that shows up in Internet traffic, you can really see that drop down. In Pakistan, in certain areas in the south, there were big power outages, which caused the Internet's drop in Kenya. November, you see loss of connectivity. And then you have the one in the US in Moore County, North Carolina, where somebody shot at two electrical substations, which caused damage, which caused power outages, which took a long time to resolve. And they, they showed up as, you know, Internet being completely unavailable for a certain amount of time in those areas because of electricity. And then of course, you have Ukraine, right? So we are getting close now to a year after Russia's invasion of Ukraine. And as you will know, relatively recently, Russia has moved to attacking the country's infrastructure, and power infrastructure is part of that. And that shows up in outages in the Internet in different parts of Ukraine. I know you, you follow this very closely. Mm hmm. Yeah, even this week, we had in Odessa, an energy infrastructure that was impacted. And it was yesterday, so Thursday, 18 hour disruption. So it has been a custom to, to be in energy infrastructures to have an impact, of course, also on the Internet. And we can see that over since October, especially since October the 10th, that has been pretty much common to see those drops in traffic in several cities, sometimes in almost all the cities. So you can see sometimes in the whole country, big Internet traffic change, sometimes is only in a few areas. But over October, November, December, and already in January, we've been seeing the these types of clear disruptions. Yeah, absolutely. And it continues. As you say, sometimes it's a whole city, sometimes it's a network, sometimes it's a region. So it depends on on what's happening. I mean, what's striking to me is that is how fast they get it back online again, right? There's a loss of power or something. And then, you know, obviously, this stuff is going on, but then it's like, oh, it's back again, you know, a few hours later, or a day later, or something later, sometimes it is, you can see that there's an impact, it's slower, or it's lower than a bit than before. Yeah, it goes up. So you can see that, for sure, there's disruption. But some even in terms of energy, there seem to be dealing with that. And I think sometimes energy infrastructure has a bigger impact, because they have, they shut down energy sometimes just for not to spend too much energy, they're like saving energy overnight and all that. But I feel like Internet, all in the country has been resilience, especially over during the day, which is in some cases, quite surprising. And it's a good thing, of course. Yep. And then we have cable cuts, we only had one this quarter to really talk about, which was the cable that connects the Shetland Islands, which are a group of islands off the north of Scotland, somewhere between Scotland and Norway. And they connect the Shetland Islands into Scotland, they're part of Scotland. And there was, the cable got damaged. It was something that people reported a lot, because although the Shetland Islands are tiny, and maybe it wouldn't have made the news outside of circles like ours, this didn't come long after the Nord Stream natural gas pipeline was was blown up. And so people were worried that this might be sabotaged. But it looks like it was just the classic fishing vessel dragged up the cable and damaged it, which happens. And they got it back online, they got it back online a day later. So there's more attention to that area, like when there's cable cuts, there's more attention, of course, because of the war happening and the region. Well, especially around the north of Scotland, because the north of Scotland is strategically important for Britain, because that's where we hide all our nuclear submarines somewhere up there. So if there's anything that goes on up there, we always like to take a look. And then natural disasters, earthquakes was a problem. The Solomon Islands had a pretty big dip because of a very large earthquake magnitude seven that occurred there in November. And well, then there are just things go wrong sometimes, snafus, right? So in Kyrgyzstan, there was a three-hour Internet disruption on October 24. And, well, the government, at least the Ministry of Digital Development said it was an accident on one of the main lines that supplies the Internet. So not really sure, maybe that was a cable cut. Maybe that was somebody to turn something off, need to turn it back on again. Hard to know. In Australia, there was quite a big outage for a broadband provider called Aussie Broadband in Victoria and New South Wales, dropped 40%. That was pretty big. And this was, and Cloudflare had an outage earlier this week. And I feel for Aussie Broadband because in their discussion of it, they say a config change was made, which was pushed out through automation to the DHCP servers in those states. And that caused the whole thing to go offline and they had to roll it back. And so, yeah, Aussie Broadband, we've been there. Haiti, lots of problems in Haiti. They blame this on outages of the international circuits. Hard to know what that means. Could be something to do with the cable cut, could be something to do with the routers that are connected to the cables and these things happen. And then sometimes things happen and we just don't know why. So a provider called Wide West in the US had an outage that lasted about an hour. And well, not sure why. Cuba had a long outage in November, seven hours. Not sure why. Starlink had an outage, short one, about 40 minutes in November. Maybe we will find out one day what happened there. Probably somebody disconnected something by accident and plugged it back in again. So that's David's roundup of Internet disruptions. And we'll keep doing this, looking at the different situations. I think we will start to get into exam season in some countries and then we'll get the exam related Internet shutdowns, which always blows people's minds when they realize that in some countries, switching off the Internet is a technique to stop people from cheating. Exactly. That's pretty common as we are seeing year over year. And this is the Internet outage map. For those who are hearing in podcast format, there is a Calaflar radar page with an outage center that we internally call CROC. And it shows the places in the world where recent outages or shutdowns happened. The more recent Ukraine, also Iran, already in January, and also Pakistan. Yeah, power outage in January too. So that's Internet disruptions. And you already mentioned we had an incident right away this week. So it was on January the 24th. And it was some services, Calaflar services, became unavailable for 121 minutes. Which is a long time. Yeah. And it was not all customers, right? Not all customers were impacted. Well, so in big Calaflar outages in the past, you haven't been able to get to things that use Calaflar, like a website or the backend of an application you're using. And those become very, very visible and become something that ends up in the news sometimes because Calaflar has such a large footprint to a large number of customers. Something like we think about 20% of websites use us. And so those things have a really big impact. This had an impact on different services within our platform. So it depended on what you were doing. For the most part, the websites kept working, but lots of other things stopped working. So one big thing was a thing called Warp, which is our connector into the Calaflar network, particularly for corporate users. So we use that right to connect into Calaflar. And also the Zero Trust products, which allow you to log in. So this did have an effect on people being able to log into things in their company. They might've been able to use the web in general. And then there was some sort of secondary effects on things like a cache purge. So being able to delete stuff from Calaflar. Images, the images product where you're allowed, could you upload new images and R2, which is our object storage. What went wrong was this was a software release that went wrong. And we use these things called service tokens, which allow services to talk to each other and have the authorization to do things. And there was a fault in which caused the service tokens to be incorrect. In fact, it caused them to be empty. And because of that, services couldn't authenticate to each other, couldn't get permission to use other. And we had to go back and basically roll back the problem that caused them to be empty and then go and get the actual service tokens and put them back into place. So it took a while to get this under control. It was understood pretty quickly what was happening. And it was, if you look at the timeline, what happened was as these service tokens became invalid, things began to sort of in a rolling fashion, get worse and worse. And so there's actually some really clear graphs in here where you see it sort of ramp up. And then as we get things under control, come down. So, yeah, that was not a fun time at all. And because this authorization is the need for services to be able to talk to each other is so fundamental to us and to many, many systems, which are using lots of software that has to talk to each other. And we did an update on how things were going after we wrote it because we were quick to put the post out there. And then we did an update on on how it was going. Right. Yes. Yes, absolutely. I mean, we always try to do that, which is very, very quickly and explain what happened, give a detailed timeline, what the impact is, because our customers themselves would have been debugging. Why can't I do this particular thing? And the, you know, they need to know the impact exactly on them. They will have to, you know, they'll need to report internally. They will need to know it's us, not them, because they might be debugging. So we try to put stuff out very quickly. Obviously, the status page, we update very, very fast. But then we want to put the postmortem up quickly so people understand the impact and understand that we care about it as well, because you do see companies that will have an incident and put up a very brief apology. And it's kind of like, you know, let's move on. But we try to be detailed. And if anybody's interested in the history of outages at Cloudflare, they can read the blog posts and understand what went wrong and when. It's possible that many customers weren't aware that this happened. So we're putting out there, but there were a lot of people that didn't see a problem per se. Well, yes, sure. Just the degrading, right? But it matters a lot to the people who did get affected. And also, it's part of our promise, right? So even if you were not affected, you then can have some confidence that if there was something that affected them, you would hear about it and they would get that kind of detail. So I think this is an essential part of what we do. Of course. We have a blog post also, let me try to find it, related to the Holocaust. Before we go to the Data Privacy Day, Holocaust Memorial Day, because today, Friday, Which is today. Yeah, exactly. And we have a very specific Holocaust cyber attacks on Holocaust educational websites and how it increased in 2022 by Homer blog post. Yes. So this is a bit, you know, again, looking at sort of data that we see. And if you look at, from the perspective of the Holocaust on the Internet, there are quite a number of educational websites that explain what happened to the people who were killed in Second World War, in the Holocaust, so that we don't forget what happened and so that we can prevent this from happening again. And, you know, those websites come under cyber attack. And in fact, you know, Homer is taking a look at this and we see, you know, an increase here in cyber attacks and an increase in traffic that is cyber attack related over time. So, you know, this is a reminder that even something which is commemorating such an awful event and is an essential piece of education for everybody gets attacked. And it's just a reminder once again of, you know, cyber attacks are relatively easy to do and people go after all manner of things, including something like this. And I am sure that today some of those websites are getting attacked and we're helping to block them. And a lot of websites like these ones, educational ones, can be protected using our Project Galileo platform, right? Yes. Project Galileo is, you know, open, you know, something that people can get referred into to help keep vulnerable websites online, particularly in the arts, human rights of society, journalism, democracy, you know, we will help and give our service to them free. And, you know, if you click through on Project Galileo, you can take a look at what's available there as part of, you know, part of helping our mission, which is helping build a better Internet, keep those things online. Of course. And tomorrow, Saturday, January 28th, it's Data Privacy Day. It is. Where we have three blog posts related to Data Privacy Day. The last one that we will publish today, Friday, is this one towards a global framework for cross-border data flows and privacy protection. This is as much a technology problem, but also mainly a policy one, right? Because it's all about, in the EU, trying to comply with the GDPR that sometimes is more tricky than people can realize. Well, I think the important thing is that, and I think Data Privacy Day helps really reflect this, is that if you look across the world, governments around the world are legislating how their citizens' data is used. And this is not an EU GDPR problem. GDPR is just an example and a well-known example of privacy regulation. There are all sorts of regulations around the world. And if you go from country to country, you'll see these. And the challenge is that the Internet is across the world, is across countries, and people, you and I, we're free to go to websites all over the world. And that's one of the beautiful things about it. The number of times within Europe, I have purchased items by visiting a German website or a Spanish website or a Portuguese website, right? Where I am right now. We need to have frameworks for how you move the data around about the fact that I made that order in those different countries. Now, within Europe, GDPR applies. So that's very clear. But there is really this incredible variety of these laws around the world. And so, how that works from a regulation perspective, and also how somebody can actually comply with all those laws is a very complicated situation. So it covers policy, but it also covers technical solutions. So as you're highlighting here, we have our data localization suite, which is really to help our customers understand how they can use our system to protect encryption keys, because there may be rules about encryption keys, protect data or logs or metadata, and put it in the places they need it to be. And I think one of the beauties of having such a large network as ours is you can slice it into different regions. And it really helps someone build something and comply with the laws around the world. So there are three blog posts which talk about different aspects of the data sharing agreements between countries, but also the technical aspects of how you can use our service. And actually, there's a very technical blog post about Geo Key Manager, which is about how we manage encryption keys around the world and are able to keep them safe and comply with our customers' desire for those keys to be inaccessible in certain locations. And that's coming out today along with the other posts. Exactly. That's a very specific thing, but quite an important one. And it was a very big piece of work for us to do that in a way that is scalable and works across the world. Yes, exactly. Here it is. Reimagining access control for distributed systems. There you go. And this type of data localization type of thing, I think it's quite important for people in all sorts of countries, especially in the EU, to understand that now technology can provide a way of complying with Shrem Shu VR. That's like a feature that's quite important to comply with these types of laws. So it's technology and in the work of policy, which I find it's interesting. Read all about it on the Cloudflare blog. We have also another one which is investing in security to protect data privacy. So you can read here what we do and what we think about this area. It's written by Emily Hancock and Alisa Starzak. It's quite an interesting read, so I would invite everyone to read it. Absolutely. And it's also in Portuguese, French and German, so different languages there. Very good. And I think that's a wrap. Thank you, John, and see you next week. All right. Yeah, I'll see you next week. Bye-bye.