Radar Bulletin: Q1 2024 Internet Disruptions
Presented by: David Belson, Dan York
Originally aired on August 29, 2024 @ 2:00 PM - 2:30 PM EDT
Join David Belson (Cloudflare Radar) and Dan York (Internet Society) as they review notable Internet disruptions observed around the world during the first quarter of 2024, including their underlying causes and the impact that they have on affected communities of users.
Read the blog post:
English
Internet disruptions
Transcript (Beta)
Hi, I'm David Belson, and I'm the head of data inside of Cloudflare, and I'm here with Dan York, a colleague and friend from the Internet Society.
Welcome, Dan. Hi, David.
Great to be here. Welcome to all the Cloudflare readers. I'm Dan York with the Internet Society, and delighted to be here again to talk about all these disruptions that happened in Q1.
Not delighted that the disruptions happened, but delighted to be able to talk about them.
We'd actually prefer a disruption -free- Thank you for doing it again.
What was that? We'd like it to be disruption -free, really, but you know- Right.
I'd like to actually, yeah, do one of these where we just sort of sit and stare at each other for a half hour because there's nothing to talk about.
Not going to happen now. No. Unfortunately, I think our respective organizations have not put us out of business yet.
Right. I think we both fight against Internet disruptions, whether they're shutdowns or outages, but unfortunately, they keep happening.
Yeah, so I wanted to join with Dan today to review some of the Internet disruptions that we saw around the world during the first quarter of 2024, and I think more importantly, to dig into what happened, but more importantly, talk about the impact that these disruptions had.
So let's jump right in.
I think Q1 was notable for several large cable cuts that I think both Cloudflare and the Internet Society covered.
The first one that we can talk about is the one that happened in East Africa, where CCOM, AAE1, and the EIG cables were cut.
Let me scroll down here. And they took out connectivity to about four East African countries.
This was believed to be caused by the anchor from the Ruby Mar.
So it was a ship in the Red Sea hit by a ballistic missile, drifted for a bit, and then ultimately sank.
And the interesting thing about this one is that there wasn't really a significant loss of traffic observed in the impacted countries.
That's actually, I'd say, kind of unusual, but possibly these countries may have higher resilience in terms of paths and connectivity than others.
I think we'll touch on that more, I think, in a bit. And there's a lot of cables running through that area.
There's a significant number. That's one of the kind of single points of failure actually right now.
And if you look at the global submarine cable maps, it's going right there through the Red Sea.
And when you have a situation like you have right now, where there's armed warfare happening in that environment, and people, as you said, shooting up ships and everything else, it creates an environment which is ripe for this kind of thing.
Oh, there we go. Zooming in.
This is a telegeography's map, which is, if you haven't looked at this, this is just submarinecablemap.com.
It's a great resource to go and look at. And yeah, right there.
Look at all those cables. It's interesting that so much of the infrastructure flows through such a geopolitically sensitive area, making repairs challenging when problems happen.
Right, right. And the challenge, of course, is if you want to avoid that, you have to go other routes, which are either around Africa.
Down and around, right. Or you go route other directions to connect to the rest of it.
It's, yeah, it is a challenge. Or even cross continent. And there's Steve Song's Many Possibilities site has the after fiber part of it, has great perspective on the terrestrial connections there.
And if you look at that, you do see that it is, I think, maybe a little more, I don't know if I, it's fairly sparse, to be honest.
Yeah. And I don't know if that's a function, I don't know if I would expect that, or I wouldn't expect that.
You mean the overland?
Yeah, sorry, the overland routes. Yeah, well, I mean, you just, you can't get there from here, as we would say.
Whether it's jungle, or mountains, or desert, or whatever, it's not very hospitable area to go and run anything.
To do, right, to delay lots of cable.
That makes sense. Yeah, yeah, yeah. It's just not a, now, in, after the other event we're going to talk about, I just saw something actually from Steve Song, speaking of Steve, where he had talked about there are efforts now to lay some overland cable going into some of the countries that were affected by this other, what we're talking about here.
Okay.
Because, of course, while all this was happening on the eastern side of Africa, and those cables were down, and repair ships were en route, and everything else, we have the events that happened on March 14th.
14th, right. Which we're going to show.
Both of our organizations covered that. So, you know, we did a blog post, sort of, as it was underway, and the Internet Society also covered that.
You, in fact, you covered that.
I did, yes. We started during the day, and then started to iterate on that, and add as we got more news reports and pieces like that.
But the key point in the map you're showing is a great one, any of those maps that show that it was 13 countries that were affected by this outage, again, of submarine cables.
And so, you can see those right on that western part there. And then, but it also got down to the southern tip with South Africa as well, too, which was also affected because they were tied into the same cable systems that were going there.
Right. In this case, it was the ACE cable, the SAT 3 slash WASC cable, the Submarine Atlantic 3 West Africa submarine cable, the West Africa cable system, which they also call WACS, and the main one cable.
Yeah, all four of those.
We were talking earlier that, you know, there were some of these countries who thought they were resilient because they had connections to multiple cables, but when all those cables go down at the same time, then the resilience is sort of, you know, not as useful.
Right. You may think you have network, and well, and honestly, you do, you do, from a network resilience point of view, you are connected to three or four different cables.
But when it turns out that at the physical layer, three of those cables are going through the same basic canyon or something underneath.
And right now, the best information I've seen is that there was some kind of subsea rock slide.
Right. You know, they wound up going down and, and cutting the, you know, severing the cables.
So if you've got three cables running through the same area, and the, and the, the thing comes, the rocks come down, there you go.
You thought you had network resilience, but you didn't at the physical layer.
You know, and I don't know, I'm sure there'll be reports coming out about why did they lay all three cables in there?
Probably it was an easy route to lay them, you know, if you're trying to get from one place to another, it's a logical place to do it.
And people may just not have thought that there was a risk.
And I guess that the, the Equiano cable from Google, I think is newer than the ones that were impacted.
So my assumption is that may take, and we can, I guess, check submarine cable map and see, you know, is it taking a slightly different route?
I think it did. This is more, this is more sort of, I guess. Well, if you zoom in on that part where Cote d 'Ivoire is and that, yeah, if we go into that area right there.
Yeah, I think there, there we go.
That, this is the cable that stayed up. Right. And so a lot of, a lot of the providers that were impacted said, hey, we can try to get capacity on Equiano and, and sort of stay, try to stay available.
Right. And if you look at that, the, just I guess we can't with Equiano highlighted, but you can see right there next to it in Cote, in Cote d'Ivoire, there was a bunch of cables that were going in there.
Right. And so those were the cables that got hit. If you're listening to this on the audio podcast version of this, we'll have to say, just pretend you imagine what we're talking about, but there are, you know, there were cables, there's other cables.
Yeah. And this was the other one, the Marok, Marok telecom cable was another one that, that I think saw additional use and remained available.
So, so I guess these two cables are running slightly less. They were not in the area that the rock slide fell.
Yes. Yeah. And then, you know, following some of the folks online who track these sorts of things, it looks like the repairs are either completed or near completion for the affected cables.
Yeah. They were able to get some, some of the repair ships.
Cause again, if you're, if you're watching this, there's only a handful of ships in the world that can do this kind of repair.
So, and some of those were, I think at least one was over on the Eastern side, dealing with those cables, but other ships come back around.
Yeah. Other ships were able to get there or work with this.
So, you know, I think at this point, the cables, as far as I'm aware, are either repaired or nearing completion, but that was a big, you know, a big effort.
And we saw during that day, we watched in the charts that you have on Cloudflare radar and the other measurement providers out there, we could watch just the complete drop-off for some of the countries like Cote d'Ivoire, you know, who was right there.
They, you know, they had pretty much no traffic for that period of time.
And then others had had some and were able to come back up in some way.
Right. And Ghana saw a pretty big impact as well.
Yeah. Cote d'Ivoire, where are we? Yeah. Niger, some of these were all, they were all getting hit for a bit.
Yeah. I don't, for some reason, I don't have the Cote d'Ivoire graph here.
Oh. You have all the other graphs, David. You have all the other graphs.
Lots of graphs. I guess I didn't capture that one in this particular post.
Yeah. And so one of the other related things I think that I wanted to call out here was some work that was done by a Pulse fellow at the Internet Society around resilience and specifically resilience of the undersea cables.
So there was this email that was sent out as part of the Pulse newsletter that highlighted the Nautilus work and the measurements that were done and the findings.
And then there was a related blog post that, I guess, looks at this in more detail.
Yeah. This is really interesting work. So I should say that the Internet Society's Pulse program, which is pulse.Internetsociety.org, helps with some of this, funding researchers, et cetera.
And these folks were doing this work, which is really actually fascinating because what they were doing was trying to figure out how much traffic was actually going across these submarine cables and what was going on.
So this Pulse research fellow, as we refer to this person and his team, they were looking at this.
And what they did was they did a significant amount of basically what they call their Nautilus framework.
They did a number of different trace routes, basically, looking at where was the traffic going, what was going on.
They ran their system for a while, got close to 9 million different routes that they were looking at, and then trying to categorize those into which ones were known to be going across submarine cables and which ones were likely.
And really taking a look at all of this and trying to do some geolocation, all of these different parts.
And it was really just very interesting work to try to map this, not at a physical...
If you look at what telegeography does with their awesome work, they're looking at the physical connections of where the wires are going.
But what this team, what these research folks were looking at was, what's the traffic volume?
Where are the links actually going? What's going between these?
Trying to figure out beyond... Telegeography shows you where the cables are, but these folks were trying to identify how much traffic was going across which cables.
And doing that kind of work. So it's really interesting work. I'll be curious to see where they take this, what they do with it.
If you look at this post, we'll have the links for all of this in your notes, I know.
And there's the paper that they actually presented that appeared at a conference that was here.
And then they also have the code base as well as the results.
They're source. They're available.
They're available up on GitLab. They have a repository there. So people could take a look at that and see and provide additional information.
There is going to be another part of the blog post or another blog post that they intend to be posting up on our site, which we'll look at.
It's more of a resilience... I haven't seen it, but from what I gather, they're going to be using another resilience analysis tool to look at how do these all work together.
So we'll have to see.
Yeah. It's always a question I've gotten frequently is, can you tell if traffic is going over a particular submarine cable?
And oftentimes it's no, but I'll have to dig into this more as well.
Yeah. I mean, I think they're making an attempt at it and trying to figure out how did this go, trying to answer that question.
Which cable is this actually going across? Right. And that's harder to do, certainly, I think, when there's multiple cables that you can look at.
I know that Doug Midori has done some work in the past kind of watching latencies between point A and point B when there's the expectation that a new cable between those two points is going to be turned up.
And when the latency drops from most likely satellite to most likely fiber, you can kind of tell that they've started moving the traffic over.
Yeah. Yeah. Actually, right above that map that you have on there with the lines, it tells about the data point, their data set is 235 million trace routes collected by RIPE Atlas and CATA over 15 days in March, 2022.
So that was what they- That's a good data set. Yeah. Yeah. That's fairly substantial.
So we're then talking a little bit just to, I think, close out the submarine cable bit section.
We talked a little bit about resilience. So part of the- There we go.
Part of the examination that you did with this West Africa outage was looked at the Internet Resilience Index scores for the various impacted countries.
Right. Exactly. So one of the things that we have on our Pulse site is what we call our Internet Resilience Index or IRI.
And if you go to pulse .Internetresilience.org and you go to the resilience tab, you can go and see this and you can find the resilience score for your particular country or network or things like that.
And you can see that. And what we looked at is, if you look at this table there, you can see there are countries with very low resilience scores.
So if you look at Cote d'Ivoire there, Ghana, some of those have a low score in comparison to others.
Say, if you move on down to South Africa, it has a higher one. And to the surprise of perhaps, and also in that the other column one back, it's the number of Internet exchange points or IXPs in the country.
So you can see countries that were there, those were the countries with the higher resilience score that were not as affected by the cable cuts.
They had other paths, they had other ways, they had other pieces working with it.
So it does come back to, what can happen?
How do you make this more resilient so that if you do have an issue like this where you have cables that are cut, what other methods can you have?
And we're seeing some of these different pieces that are going on where they are looking at some overland cables.
There are some Sudan and Djibouti agreed to lay a cable between them through Ethiopia.
So you're getting a connection through there. There is an effort called the Trans-Sahara Optical Fiber Backbone Cable, which is going to connect Niger, Algeria, Chad, and Nigeria.
So you're going to have that, which will then connect countries like Chad and Niger, which are landlocked, they have no access to subsea cables at all.
You're also seeing, of course, too, an increased usage of satellite connectivity.
You're seeing more places having Starlink, for instance, using a low-Earth-orbit LEO -based constellation.
So you're seeing some of that.
And I think this cable cut on the Western side of Africa only made people think more about that and say- Right.
I definitely saw a lot more coverage online of, is satellite the answer or is satellite an answer to some of these types of issues?
Right. Yeah. And I think it's a good complement. And certainly when you have issues like this, where you have no access, the amount of traffic you can send down a subsea cable is tremendously more than the capacity- Yeah.
Satellite is definitely not the replacement for it. Right. It's not the replacement, but it gets you connectivity.
And if I were certainly a government or anybody in there, I would be thinking about, can I have a LEO dish floating around as backup should we lose this kind of power, that kind of thing.
So- Right. And we saw that in Tonga, what, two years ago now, I think, where the earthquake, sorry, the volcano eruption severed the cables that were going to Tonga and they stayed somewhat online through satellite, but it was largely restricted to government and it wasn't Dan York surfing YouTube kind of connectivity.
It was like the government agencies and banking and whatnot.
Right. But this whole thing, right, when you talk about resilience, we talk about how do we have these multiple paths.
And as we look at a world too, in which we're seeing more effects of climate change and greater storms and more extreme weather, extreme drought, extreme all of this, it is this whole issue is resilience, resilience, resilience.
How do you build more paths that are not, how do you find the points?
And this, the whole West Africa thing was a good lesson to people to say, Hey, figure out your physical resilience, because you may think you've got four paths, but if three of them go through the same thing and that can happen, or we're picking on a rock slide, but you could also see them coming into the same landing station.
And if you get the drifting ship that's been blown up and it's going along, dragging its anchor out of control, then you could equally see it kind of going across the area where the cables are coming up into a landing station.
Right. Or that landing station being a target.
Thankfully we haven't seen that to date. If you have that point of aggregation.
Right. Anywhere you have a single point of failure somewhere along there is a point that you need to say, okay, well, what's my plan B if this, if this doesn't happen, if a typhoon comes in and, you know, and, and wipes out that landing station, what's plan B, you know, or C or D.
Right. Yeah. Let's, let's jump to one more.
I think one more that we saw in, where'd it go? In early, early April. So the, the earthquake in Taiwan.
So it's really significant earthquake. But interestingly had sort of a nominal impact on, on traffic.
So briefly disrupted traffic, at least through our measurements, but then, you know, traffic surged back to above normal.
So I think what that says is, okay, there's some, you know, basically Taiwan is a fairly resilient infrastructure.
And then very clearly people were going back online going, okay, what the hell just happened?
You know, I'm on the news.
I need to contact loved ones, you know. All of that. Well, in, if you're, you're showing the blog posts that we did around that.
And if you scroll down a little bit, you see a chart from telegeography, of course, which shows all of the different connect connectivity.
So in this case, I mean, Taiwan being an island that it is, it has lots of connectivity out to many different places.
So even, you know, even as the, even as the earthquake hit in one area, whatever kind of, whatever kind of things were going on is we're fine.
It might've had a minor disruption to traffic in that local area, but there's so much connectivity going into the island.
It doesn't just have three or four. It has, it has many, many cables going into it.
And the interesting thing too, from, from telegeography perspective is that it appears that it's the, the, the landing stations or as for the, where the cables are coming in are distributed around the island.
So it's not, they're not all coming into just Taipei, which could be problematic.
And again, I, I I'm terrible geography. I don't really know how big it is.
you know, so these may all be fairly close to each other, but, you know, but if you look at that map or if you go into, you're seeing a ton of connectivity there and very clearly it's, it's, you know, South coast, East coast, West coast, North coast.
Yeah. Those look like it is a very resilient setup. Yeah. When the earthquake hit, it didn't, you know, it, it, in contrast to what had happened just brief, you know, two weeks earlier, there was really very little, but again, you had so much resilience.
So this is kind of that model that you want to see is, is, is what you're doing.
Interestingly, as I know that I will say Taiwan, of course, has its own geostationary satellites up above to provide Internet access.
It's also looking at launching some satellites into low earth orbit to provide other, you know, cause they're, they obviously sitting right next to mainland China and all the geopolitical tensions there, they're very concerned about how to keep their connectivity up.
So they're looking at kind of, you know, extreme measures for, for resilience around that in so many different ways.
And you know, so in the, in the seven minutes we have left, I think let's, let's jump to resilience of another sort.
So some of the technical problems that we saw during the first quarter. So one of the ones that we can touch on is the, sorry, I'm just looking for the, here we go.
The, the DNSSEC issues that hit Russia in January. So I know that you, you, like you were saying earlier, you know, joined Internet Society, I don't know how many years ago.
2011, actually. 2011, actually. Oh, wow. Okay. To advocate for the adoption of DNSSEC among other things.
Right. So, you know, I guess ultimately the DNSSEC failure or the DNSSEC issues impacted the .ru top-level discussion, top-level domain, which means that for validating resolvers and users were, were not able to get to something like .ru or whatever.
Yeah. And, and your, your data there shows that, you know, at the peak part of it, about 68% of all the requests we're getting, we're getting rejected.
We're getting a serve-fail response in DNSSEC.
And, and unfortunately the .ru CCDLD, the country code top-level domain organization hasn't really been very forthcoming in exactly what was the issue.
But we can surmise, you know, a couple of things, these things happen, unfortunately with, with DNSSEC.
If somebody was trying to do a change, it all has a chain of trust that goes back up to the very root of DNS.
And there's a chain of keys that basically say this is vouching for the next one down and the next one down.
So what could have happened here is that if somebody didn't do things correctly in the in changing a key over, or sometimes it happens, I've had this happen with one of my own domains, you miss a, you don't update the key and the key expires or something, you can wind up with an invalid key.
And so the, the train of trust is broken.
So we don't know exactly what happened. It wasn't a huge outage.
I think it was about four hours. Yeah. That it was down. So it wasn't huge. They figured out what's going on.
They fixed it. But part of the problem, of course, with DNS is that you have time to live.
You have TTLs on the records. And so there's some period of time where it will take even after you fix it because those records are cached.
And so you're going to wind up with, with that happening somewhere along the line.
So there's a number of organizations that have kind of best practices to help avoid this kind of thing.
And we don't know, but this is an example of the kind of outage that can happen when you're using, you know, just DNS and the upgraded security that you get with it also means you have to make sure you're following practices appropriately.
And that's where the resilience comes in, presumably is, is okay.
Follow these best practices. And if you do that, things shouldn't break.
And then you, in theory, are resilient. You know, and I think one, so another one in the last couple of minutes is talking about the Orange Espana issue.
So, so, you know, RPKI is ostensibly, you know, to help improve the but the resilience as well of the routing infrastructure across the Internet.
And in this particular case, you know, This is really about the resilience of humans.
Yes.
Right. Because what happened here was somebody compromised the user account, but they had the user account and password to be able to log into the, to Orange Espana's account with RIPE NCC, one of the regional Internet registries.
And they were able to log in there and make some changes, do some things, which basically screwed up all of Orange Espana's routes, you know, for the period of time.
So this goes back to, are you managing, or, you know, just basically, you know, password security, one-on-one, are you managing your passwords appropriately?
Do you have that kind of thing in place?
Two-factor authentication. Right. Multiple factor authentication.
Yeah. All of those kinds of things. So, yeah. Interesting times. I'm assuming they will, well, RIPE has now made two-factor authentication mandatory after this incident happened.
So now that will presumably not be as easy. So we'll see where that goes.
Ideally. Yeah. So, I mean, and, you know, ultimately you're talking to some colleagues, this one ultimately resulted in, at least, you know, from our vantage point, a traffic disruption because these routes that were being announced that were, that were invalid or RPKI invalid, you know, we, we looked at them and said, yeah, these are, these are bogus, you know, these are no good.
We dropped that, dropped those routes, dropped that traffic. And then others across the Internet were doing that as well.
So which would create problems for the Orange Espana customers to be able to reach, you know, send emails or, you know, reach YouTube or what have you.
Right. Right. Yeah. Which the traffic that goes in there, all of this is, is basically, we could talk for a long time around routing security, but basically the whole idea is to make sure that you can in fact route the traffic to that, to that particular place.
It vouches for the authenticity of that route.
Right. One of these days we'll do an unbounded Cloudflare TV segment.
We can go super into depth on all of these. I think we should maybe just end on talking about what's coming up in terms of exams.
Yes. Yes.
So yeah, we've got about a little more than a minute so we can stop exam shutdowns.
Let me bring that up. Here we go. So yeah. So, so Pulse published a post on this very recently.
Well, and last time we were together, we were talking about the output of various different exams that were happening in Q4, 2023.
And so, you know, we've had this pause and now we're about to gear up with coming into May and June.
You know, if we, if we do another one of these after Q2, we'll be able to see, will we, will we be talking about exam shutdowns or will the various countries have seen the light and not shut down their infrastructure for exams?
Yeah. I think Algeria, we saw in the past, I think has gone more to a sort of filtering approach.
Iraq and Syria, I think still we're shutting things down. And then Jordan, I don't recall seeing Jordan taking that kind of action in the past, but I, my memory could be failing.
Yeah. But anyway, stay tuned. Absolutely. So we've got about 10 seconds left.
So I just want to thank you again for joining me today.
I do enjoy these conversations. Thanks for having me. And I will talk to you next quarter.
Sounds good, David. Thank you very much.