🎂 Radar Deep Dive: traffic anomalies and notifications

Presented by: David Belson, Carlos Azevedo

Originally aired on February 12, 2024 @ 11:00 AM - 11:30 AM EST

Welcome to Cloudflare Birthday Week 2023!

2023 marks Cloudflare’s 13th birthday! Each day this week we will announce new products and host fascinating discussions with guests including product experts, customers, and industry peers.

Tune in all week for more news, announcements, and thought-provoking discussions!

Read the blog posts:

Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!

English

Birthday Week

Transcript (Beta)

Hello, everybody, and welcome to a Cloudflare TV Radar Deep Dive on Traffic Anomalies and Notifications on Cloudflare Radar. I'm David Belson, the Head of Data Insight for Cloudflare and leading the Cloudflare Radar product team. And with me today is Carlos Azevedo, key member of the Radar Data Science team. So I wanted to take you all through today two new features that we launched on Radar this morning. Pretty excited about those. So let me bring up the blog post. And we can take a look at what we've got. OK, my window here. So today we launched two new features. One is traffic anomalies. And those are basically defined as unexpected drops in traffic. And the other is notifications. So it's one thing to have an unexpected drop in traffic happen. But it's another to actually know that it happened and know that it happened in a timely fashion. So we published a blog post today called Traffic Anomalies and Notifications with Cloudflare Radar. And in it, we talked about the traffic anomalies, giving examples of the new feature that we launched on Radar. And we'll show that in a minute. But you'll now be able to find a curated list of anomalies on Radar, as well as insight into where they're occurring geographically, being able to break that down. And this data is also available not only on Radar, but also through the Radar API, which is fantastic for those folks who like to automate it and do their own thing. And then the notifications functionality. So it talks you through what to look for. So you're going to want to look for the megaphone icon on the traffic pages, the routing pages, and on the outage center on Radar. You click that, and that will take you through the process of setting up a notification. And you can do that at a sort of a global level, or for a given country or set of countries, or a given autonomous system or set of autonomous systems. And as I mentioned, that works for the traffic anomalies that we launched today, that will launch for the outages, which we've been publishing over the last year since last year birthday week, that will also work for the BGP route leaks and BGP origin hijacks that we launched several weeks ago. And there was a couple of related blog posts about that. So let's take a look and see how these things look on CloudFloat Radar. And sorry, that one. So this is the the Radar outage center, we call it the croc for for obvious reasons. And what you'll see now is that we've added these orange circles to the graph, or excuse me to the map. And on the map, an orange circle now represents traffic anomalies. In this case, where it's a single, a single circle, or a single border on it, that represents an anomaly associated with a single country. In the case of like this, where you have a dual bordered circle, that represents some aggregation. So if you zoom in on the map, ultimately, you'll see it break apart into its component countries or locations. So here you see that single circle broke apart into two anomalies that were observed in Dominica, and one that was observed in Barbados, both over the past week. In addition, we've now got a new timeline at the bottom of the the below the map here. And that shows where we're in time over the last week, given outages, observed outages, or traffic anomalies, either a location or an autonomous system level have occurred. And you can mouse over any given anomaly, and it will show you the information about it. And then if you click on it, it will also take you to that anomaly on the traffic page. And we'll do that in a second. So in addition to the outages table, which we've had on the outage center for the last year, we've now added a standalone anomalies table. And within this table, we now show information about the what type of anomaly it was. So was it a location or country? Or was it an autonomous system was a network level anomaly, the entity where we observed the anomaly. So you can see over the last day or so we've we observed one in Dominica. And we observed one in Swisscom last night, as well as one in Time Warner Cable or Spectrum in the US, specifically the Carolinas. We provide a start timestamp, a duration, and Carlos can talk about how we are measuring those durations, verification. So one of the things that we try to do, or we will be doing, as we observe an anomaly is looking across multiple data sets, whether they're internal or external, or looking for information from an affected provider to say, like, yes, we know that there was an issue here, or this issue is visible in multiple, multiple data sets. And if that's the case, we will we will mark it as verified. And in fact, the Swisscom and TWC anomalies are verified. We're just trying to complete the back office functionality that will enable us to mark them as such. And then two actions. Now they're available. One is being able to view the anomaly on the traffic page. And the other is to sign up for notifications. So as an example, here, we can click the view traffic anomaly. And that takes us to the traffic page where you can see shaded in yellow here, if you mouse over it, says this is the traffic anomaly event in that given autonomous system. And you can see here that there was an obvious sort of unexpected drop in traffic that occurred. So enough of me talking. I wanted to dig in on the traffic anomalies now. And Carlos put together a blog post that talks in more detail about what the system he's built that detects anomalies. So I guess let's start. So Carlos, let's start off, you know, what is a traffic anomaly? How are you defining it? Okay, so basically, the way we can define traffic anomaly, I'll say is like, in this case, we can look at two weeks of data, let's say for a specific country or specific areas, or even worldwide. And you can get the sense of what it should look like in the next hour, for instance. So we can see it as a typically, it is a weekly pattern, but it can be more like a daily pattern as well. So you know, more or less what to expect. And when this, what we observe is not actually what we expect, we can say that we are facing an anomaly in the data. In this particular case, we are mainly interested in drops in traffic. So we also look at anomalies that are spikes, there's a huge, a sudden increase in traffic. But for now, we are ignoring it, maybe future use case. But here we are interested in sudden drops, because it's really connected with detecting outages. So we call it an anomaly, but ultimately, it will be connected to some outage. Right, yeah, you're thinking about it as a, an early warning signal for potential outage, or worse, actually, a shutdown, you know, intentional severing of Internet traffic there. Exactly. And there are several reasons for this to happen. So the same things that cause outages, it can be power outages, cable cuts, natural disasters, government orders, so all these things. But then there's another thing more that is actually related with the way we collect data. So what we observe is what we see in our data. But it doesn't mean directly that there's an outage out there, it might be some data artifact on our side as well. And we are reporting this. So if we go to radar, and we look at the signal, we'll be able to see what we are detecting as an anomaly in the signal. But then it needs some sort of validation. That's what right. So yeah, it's like a, it's an alert for us to investigate further and try to find if it's actually an outage or not, if we find the root cause, someone so the AS itself can be reporting something about it, something in the news, we end up creating an outage, because now we are sure it's not only connected with what we see, but it's actually what, what is happening. Right. So somebody has put their hand up, essentially, and said, Yes, you know, that was we saw that with Swisscom and spectrum this morning. Both of them on social media acknowledged issues with their networks. Yeah, they didn't provide any root causes, unfortunately. But yeah, they both said, Yeah, you know, we know we're having problems. So it's also, so just to finish, it's also a way for us to show that we saw some anomaly, even if no one is talking about it, because maybe it was enough, but still, it happened, it might be interesting for someone if they have visibility over it. Right. Yeah. And I think I think one of the things we actually probably should have talked about, we didn't was the importance of detecting those anomalies. You know, like you said, you know, in some cases, the Internet outages are related to, you know, power outages or severe weather or earthquakes, or, you know, sort of natural disaster type stuff. And in those cases, you almost expect Internet traffic to drop. It's the countries where that have had historical issues, you know, in terms of Internet stability, it's the countries that may have more authoritarian governments that are, you know, more likely to implement Internet shutdowns. I think, you know, those are one of the big reasons that we've built this functionality is because we want to understand, we want to help the community and the industry understand, hey, something's going on over here. And like you said, you know, maybe it's almost expected, like, you know, yeah, so we can correlate that drop in traffic with the earthquake or the, you know, the hurricane or whatever that just hit this region. But in some cases, like, okay, there's no hurricane, there's no earthquake, there's no power outage, something weird is going on here. And those are, I think, the ones we want to be most concerned about. So Cloudflare has, you know, a number of different services, obviously, you know, all of which generate traffic. Can you talk more about the different data sources, the different traffic sources that you're using to identify the anomalies? Yeah, so in this case, I'm focusing mostly in two different data sources, although I export a little bit the other ones, and we might so the ultimate objective is to cover as much entities, I mean, location and the SS as possible and using more data sources will help us with it. But currently, we are using the number of HTTP requests over time. And we are using packets sampling data as well. And the reason why we are using these two data sources is, like I said before, data collection might be something that might trigger an anomaly. But if we are looking at several data sources, it allows us to check on both sides. So if something goes wrong with collecting data on one side, the other one probably is alright. So we avoid creating false positives, let's say. And also another thing is, it allows us to look into more entities, because maybe the data quality from one data source for a specific entity is better than another one. So if we play with these two together, and if we keep adding more, maybe we'll be able to cover more scenarios. Right. I know that network error logging is another data source that we've talked about in the past that we can use to try to validate or verify an outage or an anomaly. Yeah, I mean, I think, like you said, to your point, the more data sources that we can call on, the broader the potential coverage of our monitoring, you know, because I think certainly with what we've certainly seen is with some of the smaller countries and some of the smaller network providers, autonomous systems, the signal oftentimes is not the greatest. It's not, you know, it's not good enough to really get a great picture of like, okay, this is what traffic was like, should be like, and predicting where it should be. So it makes it harder to detect those anomalies there. And also for locations that don't have a lot of users or ASs without a lot of users, we didn't get a lot of traffic. But if you're able to use a lot of data sources, maybe you don't need the signal to be so clean. Because we have this assumption that all these signals must agree with what they are seeing. And if all the data sources see the same thing, it's probably something is happening there. So yeah, it allows us to have more entities. So given these multiple data sources, can you talk us through the algorithm or the technique, techniques that you've implemented to do the anomaly detection? Yeah, so in the blog post, I have this much more in detail. I give also some examples of the data looks like in general, if you go through the first few images, let's say, we can have a sense. So this is for example, I forgot that I had it shared. Yeah, so this is like Australia, which looks like a very sort of Yeah, clean signal. Exactly. So but not all the signals look at this, this will be ideal. If we scroll down a little bit, we can skip this next one. Yeah, we start seeing that we have some shifts in the data, there's some variance also in the data. So all these things, if we look at these spikes that are actually the variability of the signal, we could trigger anomalies. So we need to pay attention to this. And all this deep, so we are tracking a lot of entities, a lot. It's around 700. We'll talk about how we got there. But each of these will have its own time series with each own patterns and own signal. It's all nuances. Yeah, exactly. So the thing is, I try that first to use some algorithm that will rule them all, let's say, but we always need to do some fine tuning for each of the time series, then we need to take into consideration that we need to clean the data, remove outliers, there's holidays in the middle as well. So we have to take all of this. And in the end, what I decided to do is okay, I'm able to look at the signal and say something about this, you know, so I want to do something that will mimic it. And also that is easy to understand, because when I look at it, I want to understand immediately why the anomaly is being triggered. And if I can do something about it. So imagine that I know it was triggered, I know why, but I also know it's, it shouldn't trigger, it's a false positive, I want to know exactly what do I need to do to control it. So I want to have full control over the algorithm. I want to skip these preprocessing steps that I just mentioned. So basically, I did something very custom. And it's easy to understand. So basically, I will get the last 24 hours, if we scroll a little bit down, maybe we have some figures that can help us understand this. So here we have the example of Madagascar, for instance, what we see is that we have different patterns for weekdays and weekends, we definitely see a strong weekday, weekday, weekday weekend pattern there. Exactly. And we have a holiday like with this green shadow, which looks like the weekend, we have an outage in this red shadow, which also looks like the weekend and the holiday. So this is the example that I use in the blog post to, to go through the process, because I have all these scenarios that I need to, to have some control over it. And I don't want to, so for instance, I don't want to trigger an anomaly when there's a, an holiday. And this holiday, for instance, is on Tuesday, if I compare different Tuesdays, of course, this traffic is below a certain threshold, then probably I will trigger an anomaly. So I want to be able to differentiate this. And I don't want my notifications blowing up on Christmas when you know, every single country triggers an anomaly. Yeah, and we can think like, okay, it's an holiday, it's probably I can get this data somewhere and I can feed it into the algorithm. But again, there are holidays that are changing every year, it's not always the same date, there are holidays that maybe won't have any impact in the data, some other ones will have. So for me, this was one more layer of complexity that I didn't want to play with. So I wanted to have an algorithm that takes care of all of this without me even knowing that it's going to be an holiday. Take the human behavior out of the loop. Exactly. And for instance, even if I do, I know the day, it means that I need to know the time zone to adjust what I should expect. And if the count is several time zones, what do I do if the AS, I have to have a time zones for each AS as well, it's going to be a problem. So I really wanted to like, do something that allows me to skip all these processes of cleaning the data, because I will have to do this for each time series, basically. And so now how it works, I look at the last 24 hours of data. And this already gives me a sense of where I am. So this is what gives me context. If I'm in a transition from Sunday to Monday, in looking at this example, I know how the data looks like. And then to look at past data and match these 24 hours with the most similar data that I have in the past. So in an ideal scenario, if I'm in a transition from Sunday to Monday, I will match with other Sunday to Mondays. And then we can say, okay, if that's the approach, why do I need to try to match it just to past week, but then again, we have this holiday problem that Right, maybe last week wasn't normal. Yeah. And yeah, so you need to probably check with more, you need to build up a set of Sunday to Monday transitions to sort of check against, I assume. Exactly. And in this case of the holiday, it's a transition from Monday to Tuesday, but it actually looks like Friday to Saturday. And I know that it will be able to match with past data with Friday to Saturday, and it will still look like a normal signal instead of comparing only with Tuesdays, which will trigger an anomaly in this case. Right. Okay. Yeah, I match all this data. And then it's quite simple. So I compute the, I look at what should I expect to be the next hour. So I'm using 15 minutes aggregation. So it's four data points. Okay. And I compute the median. The median, why? Just because I want to skip all the outliers that might be spikes or drops in traffic. Right. And yeah, so That makes sense. That was the idea, to be able to skip a lot of these pre-processing steps. Also, another thing that is important is we didn't have labeled data. So it also didn't help in the beginning. I had to Labeled in what sense? I know it's a machine learning term. Yeah. So label it, I mean, to know if data that is labeled as outage or not outages or as false positive, let's say. Oh, so something for the machine learning to sort of learn from to say, like, when it looks like this, this smells like an outage, or this is when the data looks this way, we can be confident it's an outage. Yeah. We had a few examples that I collected manually, so not the most efficient way. And also, I didn't have examples for each one of the ASs and countries that I'm trying to check because I don't know that it's, so I didn't have this label data. So I had to get some algorithm that will be still able to play with all this. No, that makes sense. And I think, I mean, I believe we have it set up such that when we have the back office, we'll be able to provide some of that feedback into the system that says, this one was a false positive, this one was legit, was verified, it was all good. Exactly. The thing is, of course, we are, so we are labeling this for our users, but we are also going to use it internally to try to improve the algorithm and try different strategies once we start collecting this kind of data. Yeah. Right. Cool. And then you mentioned 700 entities that we're looking at. So that's countries and locations. We use the term locations, but we're generally referring to countries and then autonomous systems. And is that basically because that's where we feel that we currently have the, effectively, the strongest signal, but the highest confidence? So, yeah, we are tracking around 700 different entities. And the way it was then, so we are interested mostly in human Internet traffic. One way, and there are ASs that are mainly not human Internet traffic, so we want to exclude those. And for not do this process manually, what we did is we used like a list of estimated users under each AS per country that is provided by APNIC. Okay, great. And yeah, they have this, for instance, for a specific AS for a specific country, they tell you what is the percentage of users of that country that are under that AS. And I just put in our threshold of 1%. I want at least 1% of users. And after this process, we get like 1000, around 1400 ASs. But then we have another issue that is we need to guarantee that we have enough data quality to... Right. We need enough traffic signal from a given AS to be able to do that. Exactly. Cool. So then I had to actually look at the volume of data that we have for each of these ones. And then I end up excluding the ones that are not reliable enough for me to trigger anomalies. Right. Well, that makes sense. Okay. So we'll certainly continue to try to expand that list going forward as we refine the process. Yeah. So thank you. I think this was a great explanation for folks that are interested in digging in further. We've got Carlos's blog post that was published today entitled Gone Offline, How Cloudflare Radar Detects Internet Outages. Just in the last few minutes, I think one of the things I would like to do would be to show folks how the notification process, the setup process works. So we were looking earlier at an anomaly in the anomalies list here. So if we want to, say, get notified any time that, say, Swisscom has a traffic anomaly, we can go to the table here and we click the notify me of anomalies like this. That will take you to the Cloudflare dash. So as we note in the blog post, in order to subscribe to anomalies, you need to have a Cloudflare dash login. All you really need for that is an email address. No purchase necessary, as they say. But if you click on the notification icon on Radar, that will take you through the login process, which I did before this. And then it will take you to the add notification screen. So here I can modify my notifications. Let's say I want to say Dominica anomalies. I hope I can spell. So let's say, excuse me, this is Swisscom, actually. So Swisscom anomalies, I can add a description here if I want. So I want to get notified for traffic anomalies. I can also, if I wanted to add notifications around outages, BGP leaks, or BGP hijacks, I can select those here as well. I can add a location if I want. So let's say I wanted to limit this to just Switzerland. I can do that here. I can add that. And then I can go through and add an email recipient or multiple. So let's say testexample.com. Or maybe I should use xmaple.com. And then you click save. And then that will take you through to a list of outages. Or excuse me, not a list of outages, a list of notifications that are set up. And in this case, here's my Swisscom anomalies that I just set up. I can go through and I can edit it and say, actually, I don't want to do Switzerland. I don't want to look at outages. And then I can go and save that edited anomaly. And then again, as I mentioned, you can set up notifications from the Reiter Outage Center. So here, it would be anomalies like the given anomaly there. You can set up a notification for all traffic anomalies if you click the notification icon that's in the description there. Same thing with Internet outages. If you want to learn about, get notified about all Internet outages, you'd click the icon up here. If you want to get notified about outages like one of the given rows. So in this particular case, this is Iraq, that they've been doing the regular Internet shutdowns to prevent cheating on exams. So here, it pre-fills with Iraq outage, and then all of the impacted autonomous systems. Similarly, you can do it from the traffic page. So if I want to look at, let's say Albania, and I want to get an alert about anomalies or outages for Albania, click it here. Similarly, looking at the routing page, we can drill down and probably not going to see much for Albania. So let's go up to the top. I go for routing. And again, here, you can get notified about all route leaks, or you can get notified, for instance, about route leaks that involve just the autonomous systems that were shown in that particular row. So in this particular case, it was route leaks that involve autonomous systems that are in Russia and the United Kingdom. And these were the two related autonomous systems. Similarly, you can also do it for route hijacks. So again, this would, so if we click the action button for the first row here, it would set up a notification for Ukraine, because both autonomous systems are in Ukraine. And it looks like just one of the ASNs there. I'm not quite sure why it only showed one of the ASNs. But yeah, so it's a very rich notification capability, which is, I think, pretty great. And you're looking forward to seeing how folks use it. So in the minute or two we have left, I want to stop sharing. So I'd like to thank Carlos for joining me today on this Cloudflare TV segment to go over the new Traffic Anomalies and Notifications functionality on Radar. There's the two blog posts that we published. So again, the one I wrote called Traffic Anomalies and Notifications on Cloudflare Radar. And Carlos has gone offline. And as I mentioned, look for the megaphone icons on Cloudflare Radar to start setting up notifications. You can follow Radar on social media. So we are at Cloudflare Radar on Twitter or X. We are Radar at Cloudflare.social on Mastodon and the Fediverse. And we are Radar.Cloudflare.com on BlueSky. And if you have any questions, you can also email the team at Radar at Cloudflare.com. So thank you again for joining us today. And this was live now and will be available as an on-demand video shortly after it airs. So thank you again. And we hope to talk to you soon.

Birthday Week

For Cloudflare's annual birthday, we like to give presents back to the Internet. Each day during Birthday Week, we will we announce new things that further our mission — to help build a better Internet. Be sure to head to the Cloudflare Birthday Week...

Watch more episodes