Internet disruptions from cable failures to technical issues (and an RSA teaser)

Presented by: João Tomé, David Belson, Ranee Bray

Originally aired on September 9, 2024 @ 12:30 PM - 1:00 PM EDT

In this week's episode, we discuss Internet disruptions.

Host João Tomé (based in Lisbon, Portugal) is joined by our Head of Data Insights, David Belson, based in Boston. We discuss our recent Q1 2024 Internet disruption summary blog post. There were submarine cable failures that impacted 13 countries in Africa. We also address technical issues with RPKI, DNS, and DNSSEC that disrupted connectivity for subscribers across multiple network providers.

Additionally, we give you a teaser from Ranee Bray, Chief of Staff from our Security team, about Cloudflare's presence at the cybersecurity-related RSA Conference next week in San Francisco.

Mentioned topics:

English

News

Transcript (Beta)

Hello everyone and welcome to This Week in Net. It's the May the 3rd, 2024 edition and this week we're going to talk about Internet disruptions. We published our summary about Q1 2024 and this was an eventful start of the year. I'm your host, João Tomé, based in Lisbon, Portugal and with me I have David Belson, our Head of Data Insights. Hello David, how are you? Hey João Welcome to the show. For those who don't know, you're in the Massachusetts area, right? In the US. Yep, Boston area. Boston area. This is the blog post we usually do every quarter about disruptions. Why not jump right into it? This was a quarter that had very specific disruptions, outages. Some are government directed, but in this case, we had a few that were a little bit, we don't see as much of it as we do in some of the situations, right? Right. Yeah. So there were a number of disruptions that were sort of the standard, I guess I'd call them disruptions, you know, cable cuts, government directed, power outages, things like that. But there were a couple this quarter that were maybe a little bit more unusual. One was related to a DNSSEC issue in Russia. One was related to a DNS server issue. I believe that was, I believe in the UK. And then there was an RPKI related issue that caused problems for our Spain. Exactly. And also the cable cuts in Africa in March. Yeah, definitely. Those were some pretty big, those were definitely had a pretty big blast radius. Exactly. 13 countries, if we count those right, that had a clear impact in traffic, drop in traffic after those three cables that were failures of cables in specific, right? The repairs in each of the cables are either, you know, continuing, proceeding, or they are finishing up. I think it depends on which cable you're talking about, or which, sorry, the different cables are being repaired at sort of different timeframes. Exactly. I think the last time I checked, I think two cables already repaired. There was a third one that's still under repair. So I think so. Yeah, I think that's finishing up very soon. So actually, I can share my screen and we can actually show the specific things we're mentioning. These are the cables. In this case, it's one of the cables here represented in this chart specifically. So my cable map is a great resource. It is, it is. Most people don't realize that there's a lot of cables that usually bring Internet to everyone in the world and connect everyone, right? Right. So this is the blog post you wrote specifically. There's mentions here about the different types of attacks. Like you were saying, one was actually claimed by Anonymous Sudan. Cyberattacks also play a role. And also, of course, the government directed continue most quarters. But this was not a very busy quarter there, right? That's right. Not a very busy quarter for which? For government directed shutdowns. Yeah, it was definitely, we saw a few, but not as many as we do. And I think, you know, possibly, you know, there were no exam related shutdowns this quarter, which is good. You know, I think we sort of generally hope that governments are moving away from doing that. In fact, the Internet Society and ESMEX and I believe Access Now are driving sort of an advocacy program around trying to convince governments not to shut down the Internet around exam time. I think we also, you know, something you've been watching as well, elections, you know, there were definitely a number of elections during the first quarter. And I think that we saw some issues in Pakistan around that, but we did not see many widespread Internet shutdowns related to elections. Exactly. Pakistan is referenced here. There was a situation in this case. And also there's Senegal here represented in this case, not election related, I think, but it was also government directed. This is the government directed area. And Comoros. Comoros was following protests against the re-election in this case. So, yeah, so sort of re-election related. Yeah. Yeah. In a sense. Yeah. We have a bunch of things we can go over specifically. You already mentioned some highlights there. Where should we start? Do you want to go to the specific RIKI problems specifically? Yeah, we can jump down to the technical problems section. That's fine. This was an outage in Spain specifically, Orange, Spain. And this was not a very typical type of technical problem, right? Right. This was definitely what I think people call a foot gun, where Orange, Spain effectively shot themselves in the foot. I think arguably in part with the help of RIPE, which is one of the regional Internet registries for Europe, or is the regional Internet registry for Europe. It's one of the five global RIRs. So in this particular case, each provider has a RIPE account. So an account with RIPE, and they use that to publish routing information and configurations and so on. But the apparent issue here was that the credentials for Orange, Spain had leaked onto the public Internet. And because RIPE was not enforcing multi-factor authentication, a malicious actor was able to take those credentials, use them to log into Orange, Spain's RIPE account. And then once they did that, they published what we said was multiple ROAs, which are sort of route authorizations with bogus origins. So basically saying these routes for this autonomous system should be coming from here. And because that didn't match what Orange, Spain said what should be the case, many providers, many carriers on the Internet basically said, hey, these are invalid. The RPKI checking on this is invalid. So we're going to declare those routes to be invalid and drop them and basically drop the traffic associated with them. So that caused issues sort of around the Internet for Orange, Spain. And in our case as well, because we also, by Cloudflare, enforced the RPKI validation, sorry, we enforced the validation, we rejected the invalid routes, which meant that then their users couldn't get to us, which then manifested itself ultimately as a drop in traffic from that autonomous system, as seen by Cloudflare. There was a big drop specifically there. And what reminds us is that even Internet experts, if you want to call it like that in terms of this regional Internet registry for Europe, that two-factor authentication makes a difference, even in those situations. And I saw that RIPE announced earlier this month, actually in April, April 1st, actually, a two-factor authentication is now mandatory for all access counts. So they changed something. Yeah, I mean, unfortunately, it took that to change, to force them to adopt a best practice. Exactly. So always something new in terms of disruptions, teaching us something specifically. And that's actually good advice for everyone, the two -factor authentication. It's not only for Orange or for RIPE. Right. If you're a provider of some sort of service, you should be enabling it. You should be providing two-factor authentication capabilities for your end users, for your subscribers, and for users and subscribers. If whatever service you're using has multi-factor authentication available, turn it on, use it. Exactly. This chart actually from the same situation is showing something very specific that we added recently on Radar, which is the IP address, the announced IP address that dropped in that situation also. Yep. So we saw a drop around that same time, I think, because they were probably wind up withdrawing some of the... Either Orange and Splanium wind up withdrawing affected routes, or they appeared to be withdrawn because they weren't matching what they should have been. Of course. While we're at it, there was also a Ukraine situation here, a nine -hour Internet outage in January specifically. This one is a technical issue, but it was flooding. Basically, it looks like it was data center flooding. So we've seen stuff like that before. We've seen data center fires. Unfortunately, the key infrastructure is not always protected from those sorts of issues. So when it does happen, and if it hits a key part of the infrastructure, obviously, then it will result in a loss of traffic like this. And in this particular case, also, we can see with the IP address graph that it took the network effectively completely offline. The announced IP space dropped to zero for about six hours. A few hours. Yeah, a few hours. That's also a UK example here, from one of the... Yeah. So this one is... We've seen this before, I think, where the provider has some sort of problem with their resolvers. So an end user says, I want to go to www.foo.com. They make that request to the resolver. The resolver goes out, talks to the authoritatives, gets the information back, caches it, and returns it to the end user. The problem is that if there's a problem with the resolver, the end users generally are not able to get the IP addresses for whatever site they want to go to, which means that then they can't go to those sites, which then ultimately looks like a drop in traffic. So because PlusNet end users were not able to get IP addresses for Cloudflare customers through no fault of our own, to us, it looks like a drop in traffic from PlusNet. And one of the things when you looked at the conversations that were occurring when this problem was going on, some users said, hey, if you switch to a third-party resolver like Cloudflare's Quad One, then there's no issue. That's interesting, actually. How common are these ones? Usually, a lot of people say, hey, if there's a disruption, maybe it's BGP. In this case, maybe it's DNS. I mean, that's the joke. It's not DNS. There's no way it's DNS. It was DNS. So these are, I don't think they're super frequent, but it's certainly something we've seen in the past. At least I know I've heard of these in the past and probably covered them once or twice in these summaries. Exactly. So sometimes it's not as common, but they continue to occur. You already mentioned also the DNSSEC specific one that was in Russia, right? Right. So there were issues apparently with, I don't recall the specific, but something happened with the .ru zone, the Russian TCTLD, basically. And when you have DNSSEC, if there are problems validating the signatures, does this DNSSEC-assigned domain have valid signatures? If there's an issue somewhere in the chain, that will fail to validate. And in this particular case, that resulted in us seeing serveFails. It's a serveFails type of error that basically says, yeah, something's broken in its most basic sense. So we're not returning information, we're returning an error. So once that problem cleared up, things returned to normal. The interesting thing is there are some tools used with DNSSEC where you can go and sort of explore the various bits and pieces, looking at the signatures and looking at the timestamps and whatnot. So a lot of times when these sorts of things happen, you can go and look and say, oh, it looks like it failed here and it looks like they tried to do a key rollover. Because here's the old key, here's the new key, and the timestamp is around the time that the error started. So you can see where and when the problem occurred and generally what caused it. Makes sense. Also a specific one in this case, the ATAT in the US also had some problems, right? Yeah, so I think they saw issues for about eight hours in a handful of cities on their mobile network. I think there was obviously a lot of conspiracies around that, the cyber attack, blah, blah, blah. AT&T, I don't know that they got out ahead of it, but they communicated, look, it was not a cyber attack. We were doing maintenance, we screwed something up, and that was what happened. Exactly. That also reminds us of, we live in a time where cyber attacks are mentioned, in this case, even when that's not the case. So people are already expecting cyber attacks to impact even networks, right? Yeah, I think part of it is sort of absent any communication, you assume the worst, A. B, I think a lot of times the cyber attack claims are made absent any actual information, and frankly, absent any actual knowledge of the underlying technology. Actually, cyber attack was what happened specifically in Israel also, and it's represented. Yeah, so it definitely can take down providers. In some cases, it's the... Bringing it up. In some cases, the attacker will target an attack against the provider's infrastructure, overwhelming that and taking it down. And if you hit the right stuff, or you hit it with enough traffic, you can overwhelm it and take the provider offline until they can get some filters or blocks in place to drop the... ...offending traffic. This is the Israel example we were mentioning, HotNet in specific. Apparently, it was an anonymous Sudan cyber attack. In terms of impact, not a very long impact, two hours, around two hours? Yeah, about two hours, traffic dipped. Yeah, so also a specific example here. We started this conversation talking about cable cuts in Africa, in this case. Even those are frequent, but the impactful ones are not that frequent. So in Africa, there was a few. This is one of those, but not the most impactful one. This is the January the 10th one, right? Right, so this one was in affected providers in Chad. And the challenge is that, if I'm remembering correctly, Chad is... ...yeah, Chad has...they're landlocked. So their Internet access comes through Sudan and Cameroon. And I think the Sudan cable, as we talked about last quarter, already had some issues. So I suspect that they were largely or heavily reliant on the Cameroon cable. When there was a cut there, that took them offline or certainly impacted their connectivity. Makes sense. More impactful than that, there's also this situation here, more impactful than that, let's go directly to that, was the Africa, multiple African countries that we started discussing. These are the mentioned impacted countries. A lot of them, most of those are in West Africa. It went up to, down to South Africa. Although in South Africa was just one network, one Internet operator, in a sense. Right, yep. But some of these countries had many days impact, right? Yeah, so I think it depends on how many cables you're connected to, how many submarine cables you're connected to. It depends on if you have terrestrial connectivity through other countries, and things like that. It's often a question of planned resilience, to be honest. So for countries that are heavily dependent on one cable, or maybe one or two of these cables, simultaneous cuts could be devastating. There was a lot of discussion about providers failing over to Google's Equiano cable. So if you had access to the Equiano cable, then you could fail most of your traffic over to that, or some large portion of it, and have a much shorter disruption, or a much less pronounced disruption. And if you have terrestrial backups, like we were talking about earlier with Chad, if you have terrestrial cables that are going to other countries, other network providers, that are also ultimately not going back through these cables, then you have some resilience too. If you're able to send your traffic eastbound, and have it go transit through maybe cables that are going out through Madagascar or something, that would help as well. It would impact latency, obviously, but it would keep your connectivity generally available. These are the impacted cables specifically that we wrote in another blog post. As we mentioned in the beginning, apparently the reparations are almost over. Ghana specifically, it's one of the countries where impact was seen first in a much deeper way, in a sense, initially. But there's several days that were still impacted. You could see quite clearly the difference from before March the 14th up to late April. It's quite interesting to see there. Some operators were still impacted, apparently. Yeah, I'm just trying to bring up Ghana. I'm trying to see what cables are connected to. They're connected to one, two, three, four, five active cables right now. And I think three or four of them were impacted by that cut. So yeah, I think that would be certainly a reason that Ghana took much longer to recover, because I think their situation is sort of the worst of all possible cases. Absolutely. Any other thing that we didn't mention specifically, on the others that you think you should? There's a Ukraine situation here, also on February the 22nd, because of Russian airstrikes on critical infrastructure. So Russia, sorry, Ukraine and Sudan, we're seeing the impact of military actions, unfortunately. We've seen the kinetic. Part of it is with bombings, taking out infrastructure, that's not necessarily a cyber impact of a kinetic action. But in this case, it's not an actual sort of cyber military action. Exactly. It is a weird term. But yeah, so it's military action that brings Internet problems. In this case, even power outages in some situations. Yeah, so I think earlier in the conflict, we definitely saw some more sort of cyber actions, where there was routing changes, and there were network takeovers happening, and sort of forced routing through Russian providers. In this particular case, it was more Russian airstrikes on other data centers, or I think this particular case was the airstrikes on power stations, which then ultimately impacted Internet connectivity. And Kharkiv was the main target there, and still somewhat impacted, in a sense. Also here represented is Gaza Strip. Which, since the conflict started, has been having problems, right? Yeah, so there are a number of network providers in Gaza, or Palestine, that saw disruptions, either as the conflict started or shortly thereafter. Many of them are still offline. PalTel has been fairly resilient, and especially in the Gaza Strip. But there have been several instances. There were a handful in Q4, and then a few more in Q1, where across four of the major governorates in the Gaza Strip area, PalTel basically was offline for some period of time. That's one of the main operators there, specifically. So that brings more impact, because of that, specifically in the different governorates. That's what they call it there. Did we miss something that you want to highlight specifically? I think we've hit most of the high points. We hit government-directed, we hit military action, submarine cables. Yeah, I think we've hit all the major points. Before we go, what does still surprise you in Internet disruptions, outages? Of course, some are really recurrent. And like you were explaining, government-directed ones during exams, those are really frequent and consistent. But what surprises you still the most in these types of disruptions? I think just sort of how frequent they are. I mean, I think the breadth of causes continues to be pretty consistent. I think, for better or for worse, I think that we'll continue to see submarine cable cuts that are going to cause these sorts of problems just because so much traffic transits them, just because of where they are. In many cases, they go through or go near very geopolitically sensitive areas. So there's always that big risk there. But in the cases where it's technical problems, whether it's RPKI or DNS, or we were doing maintenance and we screwed something up, those are situations where you know what's happening, you know what the risk factors are, but there hasn't been, I guess, investment or attention to trying to address or prevent them. Or I should say insufficient. I won't say there's been no, but there's been insufficient investment or attention to saying, if we, how do we make sure that our DNS resolvers, if we're an ISP, how do we make sure our DNS resolvers don't go offline? How do we make sure that when we're doing a key rollover for DNSSEC, that we don't do something wrong and take our entire top-level, country-class top-level domain offline? Or if they were doing a key rollover, imagine if VeriSign, I think, was doing a key rollover for .com, and they screwed something up there, that would be catastrophic. You'd see tons of sites become unavailable, including basically all the sites that one would normally use to talk about what happened. No more X, no more Facebook, no more Instagram, no more anything .com. Raider.Cloudflare.com also. Yeah, and no more Cloudflare. You've got all these issues here where a DNSSEC key rollover for .com could be catastrophic. And the same thing with cyberattacks. I think there's ways of strengthening the barriers against those at various levels. It's arguably maybe easier for a site owner to do that, something like Cloudflare, than it is for an ISP to do it. But again, taking the time to invest money and energy and resources into protecting against the known risk factors. I think it's really important to ultimately preventing the potential for disruptions. Yeah, it's quite interesting, especially in this quarter, to see, of course, cable cuts affect those who have less resilience, in a sense, like we saw in Africa specifically. But also big operators like Orange in Spain or AT&T in the US, those really are technical problems. BGP, as we were saying, or DNS depending on the situations or other things. But I'm always surprised of, without being natural causes like a hurricane, things like that. Yeah, we didn't see any hurricanes or earthquakes or anything. We didn't see extreme acts of God this quarter, which I guess is good. True. But still impact in major Internet providers specifically. So that still happens. So there's always a lesson curve. And I think that's also related to the fact that these are complex areas that evolve over time. There's a lot of legacy in some of these companies also. It's done that way because it was done that way, even the two-factor situation. So always a lesson learned experience there. Yeah, absolutely true. Well, that's a wrap. Thank you, David. You're welcome. Good to see you again. Good to see you again. That's a wrap. Thank you. Hello, I'm Renee Bray, Chief of Staff with the Security Organization. Next week is the RSA Conference in San Francisco. It's one of the largest and most comprehensive cybersecurity conferences globally. Our security team is excited to be there to learn, contribute, and have the opportunity to meet with customers in person. We invite you to join us next week during RSA. The Cloudflare team will be in booth 327 in the South Moscone Center. And we'll be at the Cloudflare Hub in the veranda, which is just above the Moscone Center. Come see how the Cloudflare Connectivity Cloud helps connect and protect your cyber -resilient business. We'll be hosting on -site demos and events. We look forward to seeing you there.

This Week in NET

Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!

Watch more episodes