Originally aired on March 5 @ 4:30 AM - 5:00 AM EDT
Mingwei Zhang, Senior Systems Engineer at Cloudflare and Vasilis Giotsas, research engineer also at Cloudflare explain what are route leaks and their impact on the Internet, how we detect them and how our new Cloudflare Radar route leak service works. João Tomé is hosting.
Don't miss the in-depth blog post:
How we detect route leaks and our new Cloudflare Radar route leak service Hello and welcome to our segment related to one of the most exciting things that are a part of the Internet as we know it, root leaks. At least this is the case of Mingwei Zhang, Senior Systems Engineer on the Cloudflare Radar Team based in San Francisco and Vasilis Giotsas, Research Engineer based in London. They both wrote, are here with me. Hello Vasilis, hello Mingwei. Hello. And you both wrote a blog post also with Celso all about root leaks and in this segment we'll try to understand a bit why root leaks are relevant on the Internet. I'm João Tomé, a storyteller and a root leak non-expert at your service based in Lisbon, Portugal. And Lisbon, Portugal continues to be under heavy rain and floods. So hopefully that will be sort out next few days. So let's dig right into root leaks, starting in terms of what are root leaks and why are they relevant on the Internet? Who wants to jump the gun here in terms of explaining the non-expert, what are root leaks? Maybe I can start. Sure. Root leaks are the propagation of a routing path beyond its scope. And what does it mean is that in the Internet paths have some policies attached on them, because autonomous system Internet service providers don't want the shortest path, paths that are efficient, paths that are cost effective. So they attach policies on the earth into the de facto protocol BGP. Now, quite often due to a misconfiguration or sometimes intentionally, the path may not actually be what was intended by the policy, but it may propagate to systems that are not initially intended to. And that can cause all sorts of issues, from performance degradation to unavailability, or to some security risks. That was one in the middle. So, yeah, it's all about related to Internet protocols and how the Internet works in essence, right? And BGP, the border gateway protocol, is on the basis of how the Internet works. And sometimes something that isn't supposed to happen happens. And it's called root leaks in a sense, right? The name already explains what it is in terms of, there's a problem with the route that packets, or the Internet in general, as we want to call it, should be to the destination where we want to go. And there was a problem with that route. The route was not the preferable route that should be intended from the user in a sense, right? Exactly. It's not a preferable route that should be intended not just for the end user, but also for the intermediate networks that propagate this route. You should remember that the Internet is not a monolithic network. It's a collection of individual networks, which are called autonomous systems. And each one of these networks tries to offer a routing policy based on its own interest. And they have to collaborate together in order to achieve end-to-end reachability. But sometimes errors happen, or sometimes maybe some of these networks are not benevolent. So we end up with a situation where the route is not legitimate, or where the route is following a path that is non-canonical. Exactly. And that's when the problem happens. Do you remember, Vasilis, or even Mingwei, problems related to route leaks in the past few years that were more known, more relevant, that created more problems for Internet users in a specific region, for example? Do you have examples to state here? Yeah, I can try one. So I think in the blog post, Vasilis mentioned that there's the Nigerian ISP. I think in the last paragraph here, right. This is particularly impactful because it's relevant to Google's prefixes. So what happened is that the small ISP, relatively on the edge of the Internet, misconfigured its route and then propagated a completely wrong routing information. So thinking of this like a yellow page, where you basically spread out your information about how to reach Google. And if you feed others or provide wrong information about that, and other networks believed it, then you might cause issues. And in this particular case, the issue is that everybody believes that through a small ISP, you can get a better connection or a preferred connection to Google instead of paying other networks for transit, something like that. And then the Internet was basically flooding the small ISP with the traffic toward Google instead of sending it over to its regular next hops or regular transit providers. And then as a result, this small ISP was basically overwhelmed. And then the Google was cut off for a few, I forgot the exact minutes or hours, just because everybody is reaching Google through a different... It's a 74-minute outage, so that's a relevant outage there for sure. This is the blog post that Don Paseka wrote back in 2018. Yes, and this is... So as we can see here, Ned, regardless of the size of the network, because of the kind of collaborative nature of routing, and as Vasilis mentioned, there's a kind of a federation of networks where basically we are putting in trust and somewhat relative trust to each other about their spreading the right information. The size doesn't really matter in terms of how we keep the network secure or performant. So we need to be careful about what to accept and validate the information as network operators in order for us to prevent such events from happening again in the future. Makes sense, yeah. I was showing here, in this case, the specific routes that Google was using in terms of the problem that happened in Nigeria with that ASP that you were mentioning. This shows like... Oh, sorry. This is a typical route that the network would take from a regular perspective. And in the middle, we're looking at Tier 1 AS, which is the big one that is able to handle the amount of traffic that are sending to Google. And then in the Raleigh event, what we're seeing is that the amount of traffic that is supposed to send to Tier 1 was actually sent to a small network, which was basically over flooding the network with the request. And that's causing the outage. Exactly. For me, it's really interesting just to see how the Internet is a network of networks. And this is all related to that, because networks need to be communicating with the right networks when they want to reach some place, in this case, some website, some domain. And it's this communication, these routes that sometimes have problems. But it reminds us really in a very clear way that the Internet is really a network of networks. They all have to communicate with different networks to reach the intended destination, in a sense. So it's a good reminder, I think, of that. We announced in this blog post, in a sense, we're announcing how we detect route leaks, but also we now have a new popular radar route leak service that people can use to check route leaks. Usually, the users of these tools are engineers, people that work in this area, right? But if you're curious and want to learn more about the Internet, you can also check radar as it is, in a sense, right? Go ahead, sorry. Let me note that from the previous example that Mingwei described, the error happened in a very remote area to Google, but it still affected Google and its users severely. So Google engineers, at that instance, could not detect that something happened up to the point they started receiving reports that reachability has been affected. So it's important to have a tool to detect these things in a timely manner before they start severely impacting the connectivity and the reachability, in order to resolve them also in a timely manner. It's all about solving the problem, because it's not sometimes clear that the problem is there. This is a way also of stating that the problem is there, right? Exactly. And this tool that Mingwei will present is doing exactly that. It provides these timely detection capabilities that, until today, probably were missing from most organizations. There's different types of route leaks, right? Also, we can discuss a little bit of this. It's stated here in the blog post. Yeah, of course we can. Before jumping to the types of route leaks, there's a topic I want to chat about, the reason why we're not able to stop route leaks at this point. I mean, preferably, we should be able to design and use a system that is preventing people or operators from making mistakes and then propagating mistakes. However, at the current stage, we are still in the early stage of developing and deploying systems that are cryptographically assigned, or signed the routes so that people can understand that this is a correct route instead of a misconfiguration, and people can tell the differences. But without the deployment of the system, such as ASPA and a few other RFC proposals described in the blog, we cannot effectively prevent route leaks from happening. So that task of preventing this route leak from propagating needs to be done entirely by the operators of different networks to be able to kind of visionally verify routes and then define the policies of their customers and define policies of the neighbors, the networks that they know, so that the misconfiguration does not propagate. And so... That's what's being done there for it to be better, right? Yes. So there's work being done there. There's work that is being done to prevent it from happening, but we are still far away from securing the BGP, securing the Internet routing as of today. So that's why a detection system, as Vasilis mentioned, is very important, kind of like a situational awareness to where we want to know what's happening, and we want to react, even though preferably we want to stop it. But at the moment, we can't. Exactly. And even explaining that something happened will allow different operators, in this case, to be aware that something happened, even if it's five minutes. And also, possibly, hopefully, they will learn how to make it better, not to happen again, if they are more conscious that it happened and why it happened, mostly that, right? Right. So let's talk about the types of route leaks. So this is derived from the RFC 7908, which is the document describing a well -recognized different types of route leak. So we link that in a blog post, and then you can read more from that document. And here in the blog, we try to illustrate that different types in graphs. And the key point or the key kind of takeaway point is that the relationships are described as provider-customer and peer -peer relationships, and the route leaks are usually defined as a wrong turn in the routes. So in this Type 1 example, where a provider propagates a leak or information or route to a customer, and then the customer takes that route and then sends it over to another provider, this creates a leak, or in other words, called a valley route. This is specifically impactful because the other provider who usually gets charging the customer for any traffic that the customer uses through this provider will prefer customer routes. In this case, AS6, or network 6, will prefer the customer routes instead of, say, a direct peering with AS4. In this case, then we're basically flooding the AS5, the customer, with traffic that are not intended for it to carry, which is similar to what happens in the endangered ISP case. And it's really impactful, right, this one? Yeah, this is impactful. This is a relatively kind of a high-visibility event, and that could potentially be a global event if the filtering mechanism was not in place. Global in the sense that it has an impact in terms of Internet access globally, right? Correct. Just similar to what happened on the Google side, where if the traffic route happens above Google, then it could potentially propagate globally and then cause Google's outage or any network's outage for an extended period of time. And about other types, what should we mention? There are other types that are not, or I would say potentially benign cases, like in this case where we have multiple lateral peers that are sending traffic to each other. Consequently, this is not a usual case, and we often see that, however, in the real world. So in this particular case, if you see like three really large networks that usually peer with each other and then see this pattern, it's sometimes indicating a problem, right? But it's not as important as type 1. And there are also other types like type 3 and 4, which is where you, after propagating to a customer and then you propagate to another peer or you receive the routes from a peer and then re-propagate to a provider. This is, by definition, a route leak. We are monitoring all these four types of route leaks because basically we're looking at the ASPAS. But in terms of impact or the impact to the routing, we're still focusing on type 1 as our system design perspective. Exactly. Makes sense. Just a question here. So the provider, the one that leaks, where the problem happens, is the one that is controlling a specific ASM. Usually they're the Internet Service Operators, ISPs, that are creating, in this sense, the route leaks, right? Yes. Okay. And in all these cases where we see three ASs are involved, then that's how we can understand is that the middle one is usually the leaker, in a sense, where it takes the routes learned from one provider or one neighbor and then propagates to another neighbor, which are not intended to be, or a misconfiguration similar like that. So if you see in our system, we also show three ASs involved and then the actual leaking one is the middle one in those cases. Okay. Makes sense. You were saying something, Vasilis? Yeah. I would like to briefly explain this provider -customer relationship. So as we said, the Internet is composed by thousands of networks. Some of them are very big. They may have a global footprint. Some of them are very small. They are local or regional. So the small ones need to access the entire Internet, right? And how they do it, they find a larger network and they ask to get access to their own network. And in exchange, they pay them for their privilege, right? So this is a customer-to -provider relationship. The smaller network gets access to the backbone of the larger network. Now, two networks may decide that it's for both of them beneficial to access each other's networks for free. And this is what we call a pairing relationship, right? So in the pairing relationship, there is no exchange of money, but there's exchange of traffic. And if you think of around leaks, around leaks violate the economic and financial interest of an AS because when a customer receives a route from a provider and advertises it to another provider, which is the help intern, the type one that Lingwei mentioned, then it has to pay both of these providers because it's a customer. It has the small network that needs to pay. So it pays both of them to transit traffic between each other, but it has no benefit, right? So we can think of route leak also as a violation of financial interest, right? Which means that the leaker, the AS that has the misconfiguration, also needs to be aware of the leak because it may impact its finances unless it's intentional, right? Exactly. Sometimes it's intentional. Sometimes there's intention there to cause harm, right? It can cause harm to attract traffic to its own network for purposes of intercepting it and so on. Yeah. And one of the things that makes route leak detection specifically hard is that, as you mentioned, it's intention. There's sometimes we cannot learn about intentions between networks where their business relationships are, we have to re-infer and there's no way or there's no good way, centralized way to kind of learn the ground truth for this. So in order for us to basically monitor the global network Internet, we have to kind of infer and that becomes usually a problem. And then Vasilis has published several papers, academia papers on this specific topic and they're really good. And then we think we expanded on that and then to base our system on the existing research papers so that we can be more accurate in terms of inferring the relationships. Exactly. Mingui, can you guide us through a little bit of what we have already on route leaks? And also if you can explain a bit of how we built this here and use our network, our global network to help there. Yes, yes, of course. All right. So I'm sharing the screen at this particular page. We're looking at the radar.clifford .com, which is the main page for radar. We are building everything, building the center hub of everything that we see on radar. So that includes the BGP information that we collect here in C and then also route leaks. And so if you go to the security and attack page, so without any specific features to ASN, we're looking at a global view of the Internet. And then toward the bottom of the page, we now added the global routing route leaks table, which are representing everything that we see in terms of route leaks, regardless of the network. So we can see a lot of AS is involved, and they usually happen not that frequently. We are being very careful on kind of removing the false positives and we're trying our best to be accurate on the detection results. But of course, if you see anything that is particularly jumping out or not potentially incorrect, let us know. In this particular case, we can take one example, like look at a small ISP here. I think this is a small ISP. Maybe. Sorry, I'm really bad at geography. All right. So this is jumping out because it's on the front page. So this is the one specific AS or network page for this particular AS. If you scroll down to that part, to the bottom, we can see that there are a few route leaks that are kind of generated or initiated by this particular AS. So if we look at the table specifically, we can see the three ASs involved. The frown, buy, and two. So what we can understand, how we can interpret this is frown is where the routes are coming from, and then buy is the leaking AS or the leaker AS. It basically tells us that this AS learns a route from its provider 5, 6, level 3, or I think it's Lumen now, and then propagate that to the next, who is the other provider, which is the other one. So that's two part. And we can see that the starting and ending time. Oh, but it wasn't that two. It wasn't the AS then it was supposed to propagate, right? It was the leak, right? This particular is the leak, right? So what we can see here, we can know for sure that this is another provider of this AS. So it's not supposed to go into this provider, but it happened. And in this case, it happened actually yesterday, and then it lasted about 30 minutes-ish. And then later on in the paper, we can see that it involves three prefixes, about two origin networks of these three prefixes. And then from globally, we observed about 700 messages, BGP messages about this particular leak, and then has been observed by 230 vantage point from our system. And you define the BGP message and vantage points in terms of what it means? Right. So the BGP message we're looking at here is basically the messages where the leaking happens, where basically the leaker is announcing that the pass, you can reach, you can basically propagate the pass of the routes, right? So you can consider the BGP messages as the individual message for the propagated routes, right? Right. So that's about 700 messages, or 700 individual messages about this route, and then observed by 230 vantage points on the Internet. And the vantage point is there? The vantage point is the route collectors. You can consider that as those watchtowers, or eyeballs, or routers, systems that are watching the Internet routing, and then basically allow us to be able to tap into their watchtower and then to look at what routes are being propagated and received. When the vantage points are higher, the certainty is higher because more vantage points were seeing the route and the route was not the supposed one. It was a leak, right? Correct. Correct. In this case, the higher, the more vantage points, the more global reach it has. So in another example, we can see that there were very limited visibility events. There were five messages and then lasted about 10 minutes, and it only be seen by three vantage points, which is about 0.5% of all the vantage points that we have. So that's a limited event, and it didn't have a global impact in this case. Makes sense. Anything more that we can explain here? Not at this part. So we're still expanding our capability of the radar routing page here. For example, bringing more information about ASA so that we don't have to look at just numbers. We can look at more like the names and then the other information. That's amazing. That's really helpful. And I was checking you do that. When you pass over the mouse, you could see the names of that ASN. And when we're dealing with ISPs, big ISPs, sometimes you don't know what are the names. Of course, it's difficult to memorize. Just having that there is really helpful to get a sense of, is it Verizon? Is it sometimes in the bigger ones? Which one is it? Right. So for curious readers, more interesting in radar, we recently introduced the information block here that tells us more about the information about this particular, any particular ASA that we're looking at. And we're trying to bring in more such information to the Rocklink table as well for users to be able to more easily observe and digest the information. And then in this particular case, I want to mention that this is a very large network. And what you expect from this very large network is that there will be a lot of relics involved with this network, but it's usually at the front side. It basically means that there are leakers that take routes from this very large network, which is usually the provider of the leaker and then propagate that to another provider. So you'll see a lot of relics by large networks, but that's not there to blame. We're not trying to blame anyone. It's the buy, the ASN that is on the buy, that is to blame, right? Yeah, the buy is potentially to blame, but we're still trying to make sure that they increase the accuracy there. Exactly. So if you see a lot of relics from a particular network and you see that as the front side of the relics, that usually means that this is the large provider that's providing a lot of network transit to potential leakers at those points. Makes sense. It was a good explanation. Anything we should mention in terms of just how we did this in terms of data and how we are improving? Yeah, of course. So before we jump into that, there's also the developers right now. So we already opened up the APIs for that, for the developers. So everything that you can see from the radar page, you can create an account with Clover and then get the information from the API now. And I believe they are open to all developers without any charge. And then you can build your own systems with that. You can periodically create that. And then it allows you to filter by, for example, leak ASNs or by ASNs involved. For example, in that previous case, we have 174 involved in many cases. You can also search for that. And then there will be some more useful information like the ASNames and then a little bit more details about the leaks there as well. And we're trying to bring in more information to this API page as we go so that the developers can build more powerful systems with our API as well. Exactly. They can use our information and build their own systems in a sense, right? Right. Sharing the information. Anything we should mention to wrap things up, even in terms of next steps and making it better? Right. So as we go in the blog post, we talked about how we built it. We'll skip most of the description there. But I want to mention a few things is that we basically break down the system into three components. And then we have this raw data collection component, which takes in the information from different vantage points and then turn it into a stream of BGP messages or routes in our discussion. And then we have a system that takes and examines each individual route to validate and then to detect leaks based on individual routes. And then we're using multiple information sources from KEDA, UCSD, and from IHR and from our own homebrewed AS relationships data and then as well as some public data as well. So we're combining different sources to try to reduce the false positives and then make sure that our alerts are accurate and then actionable in the cases. And then lastly, we have a system that are trying to present information about alerts. So we have developed the API. We have the UI on radar. And then as the next steps, we're trying to improve the system so that we can also be more useful to operators. One of the such features that are coming up is alerts. So people can subscribe to a certain SS alert. So that every time when a new event's coming up with certain ASs that are of your interest, you can get alerts either via email or like chat notifications, things like that. We already have internal chat notifications that is implemented that we are trying to build more capable, say, chatbots and then email notifications with our system. And we can open up to everyone when it's ready. And then other things that are coming up are relevant to BGP but are more on the security side of things. We should be submission that there are cases where we don't particularly confident in their benign behaviors or their mistakes. There sometimes are actually intentional route leaks or even the hijacks. So we are currently in the process of building the hijack detection as well so we can bring in the route leaks and then detection and the BGP hijack detection together into the same place. So if you're interested in kind of more discussions, we have a reader discord room as well. So you can join the discord and then talk with us directly. I will be on there, I believe Vasilis and Joe will also be on there. I'm there. Actually, it's on the Cloudflare developer discord server. So you must search for Cloudflare developers. It's related to Cloudflare developers. I've already talked to them. But yeah, you can join and talk to us. Before we go, just one thing, Vasilis. You've been at Cloudflare for a while. In terms of route leaks, what evolution have you seen in the past few years in this area? So Cloudflare had suffered from a very serious route leak in summer 2019, which alerted not just us, but the whole Internet about how severe route leaks can be. And since then, detecting route leaks and finding ways to prevent them has been one of our priorities. As Mingwei mentioned, one of the primary things that we tried is to accurately infer how normally the path should look like based on the expected AES relationships. But that's not easy. So part of the innovation here is the combination of different relationship data sets. And the other thing that makes our current system very unique is its efficiency, how fast it can collect this data and present it. So you can imagine that with tens of thousands of networks in the Internet and many millions of BGP methods being exchanged, the amount of data is massive, right? And while trying to detect in a timely manner route leaks, you often hit this problem. And a final thing that makes this work very unique is the way route leaks are summarized. So again, in the past, we may have the detection capability, but these incidents happen all the time. You can imagine how big is the Internet and how easy a misconfiguration is to happen. So we have messages being exchanged constantly. If we don't have the right way of presenting them, then someone is overwhelmed with alerts. So we're trying to find the right representation. And I think now we've reached the point where all of these three things, the accuracy, the efficiency, and the representation came together in this really nice way. Exactly. For us to show in this manner in a public forum, or as you said before, for others in the industry who can benefit from that information, this was really clear. I learned a lot about the leak. You had something to say? Sorry about that. I think it's good for the whole Internet, right? Better paths, good for users, good for operators. Exactly. And I really think it's fascinating how putting the information out there helps build a better Internet, at least helps the Internet be healthier and better, which goes back to our mission. And it's interesting that it's just information put in a public manner, but that information can have a real-world impact in ASNs, in operators, in those who are dealing with the Internet. So really interesting to learn about that. And that's a wrap. Thank you.