Flow-based monitoring for Magic Transit
Presented by: Annika Garbers, Erich Heine
Originally aired on March 9, 2022 @ 2:30 PM - 3:00 PM EST
An overview of flow-based monitoring, a new feature for Magic Transit that allows on-demand customers to send flow data to Cloudflare for analysis and get alerted when they're under attack so they can activate DDoS protection.
English
Magic Transit
Transcript (Beta)
Hello Cloudflare TV. My name is Annika and I'm a product manager at Cloudflare and I'm here with Erich.
Hi everybody. To talk about flow-based monitoring for Magic Transit which is a new feature that we have just recently shipped and we're super excited to tell you about it.
So I'm the product manager that works on Magic Transit.
Erich is an engineer on the Magic Transit team that helped build flow-based monitoring and we're super excited to tell you all about it.
Yeah let's get to it right away.
First of all we should probably have you tell us a little bit about what Magic Transit is and the entire family of Magic products.
Sure so Cloudflare has currently three products in the kind of Magic family which are all organized around how can we make customers entire networks more secure faster and more reliable.
So we have Magic WAN which is a recent release that helps customers route private traffic across Cloudflare's network.
Magic Firewall which is a network firewall delivered as a service so customers can enforce firewall rules right at our edge.
And then Magic Transit which is the first of the Magic family what we're going to be focused on mostly today.
And Magic Transit is a product that allows Cloudflare's network to sit in front of customers and protect them from DDoS attacks and other kinds of malicious traffic that's coming into their network from the external Internet.
Yeah it's really cool. The DDoS stuff probably is my favorite thing that we do here at the company.
And what we do is we like look at how much traffic is going through your network and what it consists of and we decide if it's good traffic or bad traffic and block the bad traffic.
And when it's DDoS attack it's really helpful to use Cloudflare because there's just so much traffic that's coming at us that we have this giant edge network that's been set up to both serve web pages in our cache products but also to block bad traffic coming to ourselves.
And so we've extended it to be available to our customers for their networking needs as well.
Totally. I'm actually interested to dig just a little bit deeper there too Eric if you wouldn't mind telling us a little bit more what actually is a DDoS attack and why should our customers be worried about it?
Yeah good question.
So a DDoS attack is it stands for a distributed denial of service attack and it's a type of attack where a lot of traffic gets sent from a lot of different computers to a single server or set of servers in a data center to try to take it offline by just sending so much traffic that the network connection is overwhelmed.
So if you have one gigabit per second and the attacker is trying to send you a 10 gigabits per second of traffic it just doesn't fit through that pipe.
And the distributed part is kind of interesting too because if it was a single if it was a single attacker's computer or a single computer that the attacker was in control of sending all that traffic it would be very easy to just say block all traffic from this computer and we're done.
But when it's a distributed denial of service attack the attackers have taken control either maliciously or just rented a bunch of computers on the Internet and they send it from hundreds or thousands or even hundreds of thousands of IPs and so it's hard to just place a rule in your firewall saying block this IP you could never keep up a human couldn't do that.
And so it's basically a volume we call it a volumetric attack it just sends so much traffic it overwhelms your resources.
It's interesting because there's a lot of different ways to do this.
You can send just Internet IP packets which are the low layer of the Internet's how traffic moves around it gets chopped up into little parcels and sent around and that can come at that layer or it can come at a higher layer like your HTTP server and they send too many HTTP requests that look like good traffic at the IP layer but at layer at what we call layer seven in the OSI model at the HTTP layer they're not good requests they're like trying to get lol .txt instead of like your web page.
So it covers a lot of different parts of computing that these attacks can hit and affect.
But that's some pretty deep technical stuff on DDoS but what about Magic Transit as a whole?
Maybe we should talk about how that works Annika.
Can you take it away for that? Yeah sure. So you mentioned a couple different types of DDoS attacks or different layers of the OSI stack which is a way to think about networks and applications it's sort of a framework and Magic Transit protects customers networks at the IP layer.
So let me go ahead and pull up a quick diagram that we can use to talk through how this actually works.
So I mentioned Cloudflare's network sits in front of a customer's network when they're using Magic Transit and in this picture you can see that right we've got this big cloud and then all of the traffic from good users over here so we can call them our eyeballs and then also attackers who are trying to get to our end customers network comes through Cloudflare first.
And the way that we do that is by advertising customers IP addresses to the Internet with a mechanism called BGP.
And so all of the traffic through sort of the magic of BGP shows up at Cloudflare's network before it gets to our customer and then we can filter out DDoS attack traffic block other kinds of traffic with the magic firewall product that I mentioned earlier and then send only the clean traffic back to our customers network over an Anycast GRE tunnel or Cloudflare network interconnect which is actually a physical or virtual sort of even further down in the OSI stack connection to Cloudflare's network.
And then the return traffic goes just back to the back to the eyeball sort of without Cloudflare in the path and so in doing this we're able to attract traffic scrub it kind of at the place closest to where this user is because Cloudflare's network is really big and we're all over the place so wherever the traffic is we're really close by we drop the bad stuff and then we just send the good back to the customer.
So that's kind of the high level of how Magic Transit actually works.
Yeah and there's an interesting feature that we have well two different modes I guess really of operation one is called always on and the other one is called on demand.
Always on means that whenever your computers are connected to the Internet they're going through magic their traffic is going through Magic Transit but then there's on demand and on demand is a product or a way of selling this having this product available that is normally your traffic goes through the Internet and comes to your data centers just like it normal just like it would without Magic Transit but then if there's an attack or any other volumetric event say you go viral you can turn on Magic Transit and press a button and all the traffic starts flowing through the Cloudflare system because our BGP servers will advertise it out to the Internet or advertise those IPs out to the Internet and so the on demand is really cool because you can selectively choose when do you want to use Magic Transit and when do you want to just leave it off.
Now I don't know I'm pretty technical and I just want cool tech running on my stuff all the time so I can't imagine why we'd want a difference between always on demand but you talk to customers a lot so why don't you tell us a little bit about why they would want that or why would they would make that choice.
Sure there's a couple of different reasons that customers would choose to use on demand versus always on.
One is just kind of the amount that they get attacked.
Some customers that use Magic Transit we have a wide range some of them are like very high profile on the Internet they have a big I mean maybe you know web presence and so they get attacks all the time from all over the place so it makes sense to always have their traffic going through Cloudflare always scrubbing those attacks out.
Some customers however almost never get attacked but they still want to be able to have the DDoS mitigation as an option to turn on in case that happens and so they're using Magic Transit essentially as more of an insurance policy in the case that they're under attack they want to be able to push the button and get the mitigation but they don't really expect that it'll happen.
Another option is cost so if traffic is not always going through our network we're not really doing much to help or drop the traffic or mitigate it so on demand is slightly cheaper than always on so sometimes customers go that route for that reason.
And then a third one might be just kind of network network control in general customers sometimes want to retain the option to kind of have their network set up as is with what they're comfortable with and they're set up today and then only put Cloudflare in the path again once they're actually under attack.
So those are some of the reasons that customers might choose to use on demand but as you might be able to kind of see audience from this explanation so far there's an interesting gap in this product which is that if a customer is using Magic Transit on demand the traffic the attack traffic when Magic Transit isn't on could show up at their network kind of at any time and so how does a customer know when they're under attack in order to be able to activate Magic Transit it's kind of this chicken and egg problem and so that is where flow-based monitoring which is what we're talking about today this new feature comes in and what flow-based monitoring essentially allows us to do is take flow data from customers infrastructure so they send us information about the traffic that they're receiving we analyze that for DDoS attacks and then we can alert customers when they're under attack so that they can choose to activate Magic Transit.
So that's flow-based monitoring and sort of the high-level problem that it solves is this sort of this sort of gap between how do I know when I'm under attack and when I can activate Magic Transit.
Eric how does flow -based monitoring actually work you help build it?
Yeah I'd love to talk about it a little bit we have a slide here that sort of describes the architecture and I'd like to show that please great thank you so this is sort of a picture of how flow -based monitoring works.
First of all I want to point out a few things get everybody oriented to this picture at the bottom left there's the customer router that's a our customer's router that talks to the Internet we call it an edge router that edge router is sits and takes all the traffic that comes off the Internet and sends it to their data center or the corporate office or wherever that happens to live and that that router sees all the traffic so it can tell us about it and it sends us to our Cloudflare edge something called net flows.
Net flows are essentially metadata about what's going on with your network it tells us where the traffic is coming from how much of it there is and some other statistics like that and it goes to a server that we have on our Cloudflare edge or a software server on our Cloudflare edge called Flow Pump which just receives that at any one of our edge network computers and sends it back to our processing core where it gets analyzed and if there's a problem it will talk to a notification service that tells the customer and our security operations center that hey we've detected an attack and we can send that with emails or webhooks or we can even set up a pager duty notification for you if you'd like and we have some other parts to it the notification service itself and some configuration stuff through an API so customers can adjust what they'd like to analyze but the interesting part is the analyzer itself it takes all of this data from our customer and says we're seeing we're seeing traffic in your network but how do we know if it's good or not and we have a bunch of different rules that we can analyze some of them are very simple like how much traffic is just currently happening some of them are a little bit more complex where we analyze are they are they flows for a server or flows for a desktop because those should have different traffic profiles and we can look at time of day stuff and and compare it to normal traffic and determine is this is this elevated or is it within bounds and we can do quite quite a bit with these with the different analysis rules sorry that that allow us to determine if there's an attack basically what we do is we just look at what's happening and determine hey is there an attack but it's super configurable for the customer which is the the exciting part to me is that every customer has a different a different set of networking needs and we can customize for them awesome thanks for the overview this is super cool and and i love that we were able to show the architecture and kind of walk through piece by piece how this works but i have a question about this a lot of it sounds very familiar right like cobbler's been doing chaos detection for a long time we can you know tell you when you're under attack we can send you an alert what pieces of this were actually new or had to be built from scratch with flow-based monitoring and uh and why yeah great question um and so to start off with uh this goes back to the always on versus on demand distinction um when when your traffic's always on it's always flowing through Cloudflare which means that our systems can see that traffic and make the decisions based on what we see but if it's an on-demand customer their their traffic isn't usually flowing through Cloudflare so our systems have no access to it um and so the entire pipeline coming in is the is new um the parts i just described but also uh it's it's using a different technology than we use internally internally we focus heavily on something called s flow which is similar to net flow but contains a little bit more data and it also includes sample packets that we've seen so we can actually take a deeper look with the s flow data to determine what the attack is and how it's working um the problem with using s flow generically generally is that it's a fairly new technology and this surprised me to learn but a lot of corporations have 10 or 15 year life cycles on their on their networking gear so some a lot of customers a lot of corporate networks out there have gear that doesn't even support s flow or related technologies that do packet sampling so we had to actually build an entirely new analysis pipeline that's focused on net flows there are some trade-offs with using net flow versus s flow i mentioned you get a deeper look with s flow so with our net flow analyzers we have a slightly less uh in-depth look at what's going on so we can't do some types of detection but it gives our customers a way to get Cloudflare level ddos protection or at least a big chunk of it in their own hands uh without having to send all of their traffic through Cloudflare if that's something they don't want to do the uh the the the analyzers that we're building here though we're hoping to use and make a seamless transition to our internal dos protection so we can say we've detected with flow -based monitoring and attack automatically turned on your automatically turned on uh magic transit for you so your advertising and traffic flows through magic transit and then we can just hand it off to our ddos team to say hey we know this is an attack and could you use our our internal tools to figure out more about it and then automatically block it etc so it's it's not uh it's not different so much as it's a new on-ramp into our ddos protection got it that makes sense so different types of flow data need different kinds of systems to analyze them because the actual data that's in those packets that we're getting from the customer routers looks different even though it's the same kind of traffic going to our customers routers as it is when it arrives to our edge that's exactly exactly um and so that's like sort of a low-level technical description of how fbm works and like what its primary like use cases which is to stop ddos attacks and monitor customer traffic but that's not really a practical example right like what do customers actually want this for and use it for in their day-to-day operations sure so the the main use case is really uh we talked about always on versus on demand just tell me when i'm under attack so i know i can activate magic transit that one's that one's the most obvious but there's also lots of other ways that customers can use flow-based monitoring to help them understand more about what's going on in their network or maybe uh just help them through the process of onboarding with magic transit and starting to use it in the first place so one example of this is customers often come to us when they're actually under active ddos attack or or the threat of ddos attack so maybe they got a letter from a ransom attacker that said hey if you don't send us a bunch of bitcoin in the next 24 hours we're going to attack you so that puts network operations teams into kind of a scramble mode sometimes if they don't have ddos mitigation ready where they need to get their network their whole network protected in a short period of time and so what customers can use flow-based monitoring for is in parallel to doing this onboarding process which is somewhat involved because we're basically saying we're going to put cobbler's whole network in front of your network we're going to advertise your ip addresses out to the Internet we're basically going to ingest all the traffic that was destined for you like there's there's some you know some mechanism involved in that it's a little tricky and so while customers are doing this kind of setup and getting onboarded with magic transit in an under pressure situation they can also in parallel start sending us their net flow data and so we can start getting a sense of the volume of attacks that are hitting them and how we're going to be able to stop them once the customer is onboarded with magic transit so that's one uh one example sorry i want to jump in here uh you know i've i've been part of some of these onboardings as an engineer trying to make sure everything's working right and it seems like it would have been really nice to have this for some of those especially early ones where uh customers are coming on board and we're making sure everything's set up right and we are trying to figure out what the heck is going on to stop it for them but until we see the traffic we don't know and it probably would have made it a lot faster so yeah that's a good point exactly yeah it's it's going to be super valuable for for every sort of under attack scenario that we're in going forward because we can kind of start getting that intel at the same time and when you're an organization that's being targeted by these kind of attacks and maybe feeling the impact from them your websites are down and your users can't get to it's you know special or or high importance critical internal apps that they need access to it's like every minute matters right so the more that we're able to kind of do while we onboard you with magic transit the the better your experience will be um another type of use case for flow-based monitoring um is as uh we can get kind of a baseline of your traffic that allows us to tune more things about how the DDoS mitigation system works so uh the majority of our DDoS mitigations are are automatic at the edge so once your traffic's flowing through us uh you know we're able to detect and mitigate a vast uh library sort of of different types of attacks with multiple layers and mitigation systems but some customers also want to really really lock things down maybe they have really limited uh uh ingress bandwidth to their infrastructure and so they're like i can't have anything leak at all like i need to have it locked down and so getting a sense of what customer's traffic looks like their kind of baseline and then when they're under attack can help us make suggestions that allow customers to actually implement additional rules on top of the automatic mitigations to really lock their infrastructure down make sure nothing leaks through to them sure i can also maybe see a good use case of um in a previous role i i worked with uh some networking companies that worked with customers who were the result of a lot of corporate mergers and sometimes the network engineers after a few mergers don't even know what's going on in their own network so another use case i can imagine is just hey what what does my network look like because we've been through this crazy merging system and we want to set up magic wan to simplify everything but we don't even know what the rules we need are so i can i can see that being really useful for that too totally totally sometimes we're you know we're working with customers especially in these under attack high pressure situations kind of hand in hand to go through that process and help them figure out you know here's how i even actually need to set this up for my network right now and sometimes if it's a big organization that can be you know across the globe teams from all over the place sort of following the sun trying to get this configuration set up in a short period of time so having the ability to get that flow data ahead of time we'll get it in parallel you know get get the baseline and make sure that we're going to be all set once we sort of turn it on is uh is super useful yeah another another thing that that we didn't really talk too deeply about but uh i want to highlight a little bit because it's pretty neat it's uh automatic advertisement of of your prefix when we detect an attack um we keep saying and you can turn on magic transit but we also have a feature that's automatic enablements and so if we detect an attack optionally we can set it up so it doesn't just notify the customer that hey you're under attack but it also says hey there's an attack going on they prob and they're signed up for magic transit they probably want to be protected let's just turn it on for them and that's going to be really cool because it's critical how long a customer's website is down there are some sites that cost thousands of dollars a second to be down or more and so if we can say we'll turn it on for you we've detected the attack and we turn it on and it just takes a few minutes that saves the customer a lot of money and a lot of headache instead of having to notify the customer hope they see the email or call the customer if they have that set up with us and get somebody on the phone and have them remember how to turn on magic transit and then go turn it on that can be twice as long three times as long and cost the company a lot of money so that seems uh that's something i just wanted to highlight is it's a really cool feature that allows us to make decisions if you want us to about what's an attack and turn it on and get that mitigated as soon as we can totally and that's especially important for customers like i mentioned sort of at the top of our segment that are using magic transit as kind of an insurance policy right don't ever expect to get attacked and especially if they have really small network engineering or security teams they an attack might hit them at like three in the morning and it might take a little bit of time for the the poor person that gets paid to kind of like wake up and be like oh my gosh and then figure out where maybe their runbook is log into the club or dashboard push the button you know even though pushing the button process itself is is really really fast and easy to do just doing that under pressure in a situation especially if you you know got woken up in the middle of the night is not ideal so we want to give customers the option to to be able to just have that happen seamlessly automatically they don't even have to think about it they still get the notification so that we know you know they're under attack and we got them covered but um but they can kind of sleep better at night knowing that that they're covered in the case that does happen and hopefully we've done a enough job with it that they don't even get that page at three in the morning instead tomorrow morning when they wake up they say oh look there was an attack that Cloudflare handled for us um we're approaching the end of our time but we we've only been talking about what exists and Cloudflare is a company that's uh pretty big on pushing forward and making new things so what do we got in the pipeline for this yeah there's a lot of stuff that we want to build on top of what we have today for flow-based monitoring we've heard a lot of great feedback from customers so far and we want to continue to invest in this product both for magic transit customers and maybe for other kinds of customers as well so one thing is support for different types of flow data so eric you mentioned today we support net flow some customers do have equipment that supports s flow or other types of flow data as well and so we want to give them the flexibility to send that to us as well so that's one another is more detailed analytics reports and alerting based on uh what what we're seeing in customers flow data so today we have a beautiful network analytics dashboard that shows you um the breakdown of all the traffic coming into Cloudflare justin for your network when that's actually happening when magic transit's activated and it has statistics on the attacks that we're seeing what Cloudflare is doing about them you can export it as a report or you can get you know periodic ones so there's a lot of stuff around there and we want to basically have that available for customers that are using us on demand with flow-based monitoring as well so you know even if you're you're just sending us net flow and you haven't needed to turn magic transit on you've been under attack you'll still get a report that says you know here's what your traffic looked like for the past month and maybe some additional insights on top of that and then the last thing is the one that i'm personally the most excited about is we want to offer a version of flow -based monitoring to our our free and pay -as-you-go customers and what this will allow our customers to do is if you are kind of a hobbyist or a network engineer and you're interested in kind of side projects or maybe you have a smaller company that doesn't own your own ip space or you're not a maybe a great fit for magic transit yet but you still want to test out Cloudflare's networking products you'll be able to configure flow-based monitoring send us flows from your your home router your cloud router your office router whatever and then see those kind of analytics and reports that i was talking about that give you insight about what's going on and then maybe from there you could learn stuff too about other types of things that we might be able to help you out with like if your network is under attack a lot maybe magic transit is a good fit or maybe you could block some malicious looking traffic with things like our our web application firewall so we want to we want to give this same functionality to customers to allow them to do this self-serve on their own i'm super jazzed about being able to send flows to Cloudflare from my kind of like home ubiquity router and and be able to go into the dashboard and see analytics on that as well so that's something i'm super jazzed about in this space awesome that sounds great um now i want to send it from my router too to my own product that'll be great eric what are you most excited about developing like from that list i obviously have some some um you know some some product and customer facing things that are kind of close to my heart but what's the coolest part about this technically for you technically the coolest part about this for me is probably going to be some combination of analytics and self-serve um mostly mostly that's a uh my own interests it's getting scratched because i would love to have my home network have pretty graphs for it um but the the the different types of data is also really cool because netflow is an old technology it's not interesting anymore so to speak because everyone knows what to do with it but the new types of flow data will allow us to discover and unlock new types of attack detection and new types of mitigation maybe even um so so that that that's going to be some pretty cool some pretty cool tech when we get into it and it will be a pretty fun month when we start building awesome uh i'm super excited about all that this this stuff coming up uh okay well i think that's pretty much all that we had to share if you listen to this segment and you got something out of it maybe you're a magic transit customer today and you're looking at using flow -based monitoring maybe you are uh looking at magic transit considering what an on -demand setup would look like or maybe if you are just in that that uh that last category of users that we talked about wanting to be able to support you you're interested in flow-based monitoring self -serve and how Cloudflare could provide this for you we'd love to hear from you um my my inbox is open i'm annika at Cloudflare and reach out to us if you have any questions about how any of this works uh we're we're super excited to ship more in this area and uh and yeah and just to keep rolling thanks so much eric for your time yeah you're welcome thank you for uh getting this going also uh just want to i didn't hear you say it but i also just want to throw out the Cloudflare technical blog has a lot of the the nitty-gritty details about how this works written up by me and some not my me but by some of my team members about this work yeah absolutely i totally forgot about the blog i don't know how i could but if you search flow-based monitoring on the Cloudflare blog there's a there's a good overview of it as well as lots of information about our uh our automatic ddos mitigation systems too um and what we're seeing in terms of ddos trends overall in kind of the marketplace my colleague omar who's the product manager for our ddos team publishes a report every quarter that breaks down what we're seeing um on our network across kind of our whole customer base and what you should kind of be be thinking about or prepared for as an organization that might be facing ddos attacks awesome great thanks so much for for your attention and watching thanks eric for your time uh have a great rest of your day or night everybody and we'll see you next time