Magic Transit
Presented by: Usman Muzaffar, Nick Wondra, Rustam Lalkaka, Annika Garbers
Originally aired on October 20, 2021 @ 8:00 AM - 8:30 AM EDT
An introduction to Magic Transit, one of the Cloudflare's most exciting new products — from some of experts who are building it.
Magic Transit delivers network functions at Cloudflare scale — including DDoS protection, traffic acceleration, and much more from every Cloudflare data center — for on-premise, cloud-hosted, and hybrid networks.
English
Product
Magic Transit
Transcript (Beta)
Welcome to Cloudflare TV. I'm really privileged today to be joined with some truly stellar employees of Cloudflare engineering and product team.
And today we're going to talk about Magic Transit, which is a new product that we built and shipped last summer and that we're continuing to develop and innovate on.
And joining me today is Annika Garbers, who is a product manager, Rustam Lalkaka, who is a director of product and Nick Wondra, who's an engineering manager, who helped build this.
And so what I'd like to do today is just spend a little bit of time talking to these individuals who are fundamental to building Magic Transit and figuring out what it should do and how it should do it and how it works on the inside and what our ultimate vision is.
And so, Annika, I thought I would start with you.
What problem does Magic Transit actually solve? What is this thing? What is it?
What is it? Why did what is it? What's the what's the itch it scratches? Yeah, so anyone that operates a network, so they're saying I have these IP addresses, you can send traffic to me, is under DDoS attacks often, and then also gets a lot of traffic that they are just not interested in receiving.
So illegitimate traffic, attack traffic, malicious traffic.
And the way that network engineers traditionally solve problems like this is they buy hardware boxes to do DDoS mitigation and network firewall and those other kinds of network functions, and then they install them in their data centers.
And this is a not optimal solution because those boxes are expensive to buy in the first place and then also to maintain and then you have to replace them over time.
And so Magic Transit solves this problem by removing the need for those boxes.
And we place our network, Cloudflare's network, in front of your network and do the DDoS mitigation as a service.
Yeah, so and when we talk about DDoS is this distributed denial of service attacks, where it's like, you have you have an address on the Internet, you have some numbers, and now all this, all this stuff is showing up at your front door that you're trying to get trying to get rid of.
So how does one turn this on? Like, so how is there?
Is there a switch? Like, where do I start? This seems so low level in the stack.
Like, where do I actually flip the switch? This feels like I've given my home address to some other company to tell the world, okay, now this is where this is where my mail should get delivered.
But where did I do that? There's there's no button in my ISP that says give my address to Cloudflare.
How does this work?
How do you turn this on? Yeah, it's a little bit more complicated than onboarding with some of the Cloudflare's Layer 7 products, because as you said, it is so low level in the stack.
So the first step that we do is get letters of authorization, which are documents that say, physical letter, like, letter, let's say it's cool for Cloudflare to advertise your addresses on your behalf.
So we have to make sure we're allowed to do that. And then you tell us about your data centers, the routers that you have your service providers that you're paying for transit of packets into and out of your data center, so that we can configure GRE tunnels, which are the link between Cloudflare and your data center.
And then after we do that, and we check and make sure that the tunnels are working, we make some network changes on our side.
And then we're able to start advertising your IP prefixes to the Internet.
And so that's the go live moment.
So it's not, it's definitely not a light switch. There's a fair amount involved here.
But I think that you said part of the magic word there, which is that I own IP addresses to begin with.
I'm most like, I don't own any IP addresses. I use the Internet, you know, for the last two decades.
I don't think I own any IP addresses.
What does this even mean? Like, why would a company own their own IP addresses?
And if they went through all the trouble of owning them, why are they giving them to us?
Yeah, that's a good question. So there's a couple of reasons that you would want to bring your IP to Cloudflare.
Magic Transit is one of the biggest ones.
But you can think of it as kind of like a phone number. So if you have your phone number, and you've been saying for forever before Magic Transit, like call me at this phone number, and you'll be able to reach me.
When you put Magic Transit in front of your phone number, what you're essentially saying is like all of the calls that would go to you, go to Cloudflare first.
And then we're able to drop spam calls, we're able to say, oh, you want to block this person.
So we're not going to send that call to you.
And then we'll send the calls to you. But the reason that we want to basically own your phone number and have the calls come to us first is because if we didn't do that, if maybe we said, we're going to give you some of Cloudflare's IPs and then put you on Magic Transit that way and for the traffic, then the people that had your phone number before the Magic Transit world would still probably try to call you at that old phone number.
So this is sort of like handing over your phone number or giving us permission to use your phone number instead of maybe like a Google Voice situation, which just forwards calls through a different number.
Yeah, that makes total sense. So it's about this lower level, like way below, you said layer seven.
And so just to remind our audience, we're talking about the different layers of abstraction of a service on the Internet.
It goes all the way down to layer one, which is literally the physical wire up to the application that we think of.
And normally, Cloudflare operates in the higher levels.
And some of the things you said, like, okay, we will look for bad traffic, we'll look for DDoS, we'll look for attack traffic, and we can discard it, we can make sure that anything that gets to the customer is actually clean and it'll go through this tunnel.
And in a second, I'll ask Nick to explain what kind of tunnel this is.
This is not a tunnel that you've ever seen in the real world.
But that sounds a lot like some of Cloudflare's other products, right? I mean, after all, we've been building proxies since the beginning of the inception of our country, long before any of the four of us were hired.
And so what's different?
What is it about this that's different? Or is it doing the same thing, but at a different level?
Talk to me a little bit about how does Magic Transit different from our firewall or our WAF?
Yeah, so you mentioned the layers of the OSI model.
And I think that's a good way to frame this question. So most of Cloudflare services live at layer seven, the application layer.
So you're thinking about web stuff, web requests, HTTP requests here.
And under the covers, we are doing the layer three protection for those kinds of requests too.
But there's other things besides just websites that run on the Internet.
So at layer four, we have spectrum, which allows you to proxy any TCP and UDP traffic.
So if you have like a Minecraft server and you want to protect it with Cloudflare, you can do that.
Other than a website. Other than a website. Yeah, exactly. And then Magic Transit is going even one layer down to layer three, where it's any IP packets.
So any packets of information that are being sent over the Internet can be protected from DDoS attacks.
That's awesome. So how does this difference? You mentioned that other customers, if they have to do this, they've either got to buy boxes, or they've got to sign up with some other kind of service, which like this, how is Cloudflare's Magic Transit different from some of the other solutions in the market?
Yeah, there's basically two buckets that I think I would put other providers solutions for this type of problem in.
One is the boxes, which we sort of already talked about, those are expensive, you have to maintain them, you have to replace them, you probably have a human being that's expensive that's doing all of that work for you.
And then the other bucket is other sort of SaaS models. And the difference between Cloudflare and a lot of our competitors is that a lot of the way that these other products work is they'll have a data center or a couple of data centers that are dedicated to scrubbing.
So the attack mitigation actually happens at those couple of data centers, which means that if I'm in LA trying to get to your website in San Francisco, that traffic might have to go all the way to a scrubbing center in New York or somewhere far away.
That sounds slow. It is slow, yeah.
So the great thing about Magic Transit is that it runs in every single Cloudflare data center.
And so whatever Cloudflare data center you're hitting with your traffic as an eyeball trying to get to the Magic Transit customer's data center, eventually the processing happens right there.
So we don't have to send the traffic around to different places to get it processed.
It just happens as the traffic is coming into Cloudflare's network, we make sure it's good, we make sure it's clean, and then we send it on to the origin where it's actually destined to go.
Yeah, exactly. Awesome. So are we done? That sounds great. If it's already working, can we wrap it up?
Are you just going around and on Cloudflare TV episodes telling how awesome this product is that you helped build?
Or do we have anything else left to do?
What's the status of this? Yeah, we have a lot left to do. Or I really hope that we have a lot left to do because otherwise it wouldn't make very much sense for me to have a job anymore.
Magic Transit's still new and we have a lot more stuff that we want to build.
Some of the stuff that I'm really excited about is more APIs and accessibility for customers to be able to understand what's going on with their network traffic.
We're going to introduce PNI, which means a physical network connection to Magic Transit so it can be even more secure.
And then we're also...
Wired in directly to the Cloudflare Edge. Yeah, which is very exciting.
And then we're also introducing flow-based monitoring, which is going to allow customers that are using Magic Transit on demand and saying, okay, only turn Magic Transit on if you think I'm under attack to be able to see and understand their network patterns and their attacks as well.
So those are just a couple of things in the pipeline, but there's so many more that I'm really excited about too.
That's really cool. Thanks, Annika. Nick, I want to swivel my chair to you.
And as the engineering leader who's responsible for this really interesting product.
So we already had a proxy, right? I mean, you and your team are brilliant, Nick, but there was something here already, right?
We did have something.
We didn't invent the transistor on this. So we started from something. We had a whole bunch of stuff that was already there.
We're in the business of proxying traffic.
We have pops all around the world. What wasn't there? What is the key piece of the tech that just didn't exist that we had to dream up, design, spec, implement, and put into production?
Yeah, absolutely. I mean, I think one of the things that's really been exciting about working on Magic Transit and helping build it is seeing how much actually we could leverage of what already existed.
On one hand, you look at, you're talking earlier about, we typically operate at layer seven and layer four.
We sit as a TCP proxy and sort of the networking stack.
And when you get down to handling just bare IP packets, there's a lot that's different, but there's actually a lot that's also the same.
So some of the problems we actually didn't have to solve were the network itself.
We were able to deploy and leverage Cloudflare's network presence in several hundred cities around the world.
The servers, the infrastructure to run our software and to do our processing already existed, as well as the automation to manage it and deploy new software, the configuration pipelines to deploy new configuration.
And most importantly here is our DOS mitigation stack.
We've been developing that for layer seven and layer four products over the past decade plus.
And we were just pleasantly surprised to see how easy it was to apply that to this layer three data flow.
So back to your question, what didn't exist? What did we have to build? You just listed a lot of things that did exist.
That sounds great. So the first message for our audience is don't try this at home because you need to have a massive network with an existing DDoS solution that's already amazing.
But let's say you have all that like you did, what did you still have to build?
Yeah. So all of that stuff that we use really dealt with, how do we get customers' traffic into Cloudflare's network?
And then what do we do with it once we have it? Let's get an IP packet destined for one of our customers' servers and take a look at it, identify if it's good traffic, if it's bad traffic and drop it.
But then we have to figure out how do we actually get this back to the customer?
Back to the analogy of Cloudflare being a proxy for your phone number.
If someone is phoning up our customer and they dial their phone number and send their IP packet, we're going to receive that.
And then let's say that that passes through our DOS mitigation stack and we decide this should go to the customer.
We've got to figure out then how do we actually get that to the customer?
Because we can't just pick up the phone and dial that same phone number because guess what?
Our phone is going to ring, right?
We're the ones who are receiving calls for that. So this is, if we start talking about GRE tunnels and we can dig a little bit into what that actually means, but we have to kind of figure out how do we take this phone call and effectively push it through another phone line to get all the way to the customer?
How do we even know what the real number is?
Yeah. So there's a lot of kind of coordination configuration that we need to do with the customer.
They need to tell us, oh, here's another phone number that you can call us to send our real phone calls through.
And so we established this, what's called a GRE tunnel.
This is kind of an interesting, you know, we say the word tunnel and I think that invokes a certain kind of idea.
It's a metaphor that's very useless for this. Right. Yeah. So a tunnel in the real world is like there's a hole and there's another hole somewhere else and you just pass through it, right?
And it's a physical metaphor. And so that metaphor kind of works here, but really what we're saying is, you know, if we've got this IP packet that has a phone number, has a destination IP address that actually we are advertising out to the Internet, we can't just send that packet out.
What we need to do is we need to package that IP packet up in another IP packet with a different destination address.
And so it's, it's kind of like, you know, if we, if we were maybe a phone, phone is not the best analogy here, but we're switching analogies.
It's not going to cut it anymore. Let's go with, let's go with packages.
So I get a package. Sometimes I get a package from Amazon and it says Usman was offered with my address underneath it.
What did you just do to that package?
Yeah. So what we, what we do is, Hey, we, we now get that package for you.
We get that and it says, Hey, destination is Usman. And so we, we take that and we put that in another slightly bigger package and we ship that out.
And it's got some secret address that you and I know, no one else knows.
Right. Like and so, so we, you know, we send that out into the Internet and the Internet delivers that, that bigger package to, to, you know, your front door, you can open it up, take a look inside and say, aha, this is, this is a package that was, was.
And it's as if, as if I got the package all along, I still got the package that said to Usman was offered in my address and nobody knows that actually it went through you.
Absolutely.
The thing I got at the end was not literally looks like something that came from the e-commerce site I bought it from.
Yeah, absolutely. So, so what we had to build here was really the, the, a fairly slim configuration plane that that config takes, you know, takes and understands that customer's configuration.
What is the address we put on that outer package?
And when we get an IP packet, that's passed through all of our DOS filtering and our firewall filtering and all of that.
And we need to send it, what, you know, what, how do we construct that, that bigger package and send it along?
And, you know, one of the things that's kind of interesting here too, is that let's say that, that, you know, your, you know, we send that package out over FedEx or something, and the FedEx truck near your house is broken.
It's busted, right? We can't actually deliver that package to you via FedEx.
You might tell us here's an address to ship it to me over FedEx. Here's another address to ship it to me over the postal service or something.
So we now have multiple different ways to deliver that package to you.
And we're constantly kind of monitoring the health of both of those different routes.
We're actually sending, you know, ping messages through the tunnel, you know, it's kind of testing package delivery to you and figuring out, you know, which of those tunnels are healthy and making a routing decision dependent on the health of that.
Actually, we're making that decision at every one of our network locations around the world.
And then how does it come back? Does it come back through the tunnel the same way?
Yeah. So as of today, no. And this is another place where the tunnel analogy kind of breaks down, right?
So, you know, as Anika kind of talked about the customer value right now, and what we're really trying to do is filter out traffic that you don't want to see.
That really only applies in this use case in the forward direction of Internet users or the Internet at large sending traffic to your network.
Any packets that you send out, you implicitly trust, right? Like my server sends it out.
It doesn't need to go back through Cloudflare again. That's right.
Yeah. And so as of today, our customers are all configured to do what's called direct server return.
Their servers directly return that packet back to the Internet user that has sent it in the first place.
So we're not seeing both sides of the traffic as we would if we were a proxy.
We would actually, the customer's infrastructure, their servers would send packets back to Cloudflare where we would then transform them into a packet back up to the Internet user.
But in this case, we're only seeing traffic in one direction, not the other.
Yeah. So it's a very weird one-way tunnel.
And yet when you look at it from the outside, it doesn't look like there's a tunnel there at all.
And if this is making your brain hurt, I want to tell everyone that Nick was the one who first explained this to me as they were first making this.
And I had to ask him a million questions, slow it down, and draw me pictures with a lot of squiggly lines for me to understand exactly how this works, let alone how it was built.
Just understand conceptually what was going on.
I want to go back to something Anika said, which is now that we have the customers, we're telling the rest of the world, this is the phone number.
We're suddenly picking up a lot of stuff that used to all go to...
That's the whole reason we're there, is to be a filter, to be the surface area that collects all the unwanted stuff that comes our way.
And of course, Cloudflare has been in the business of protecting our higher level applications against malicious website traffic, or in the case of Spectrum, even malicious layer 4 traffic.
But now we're one step lower in the stack.
So we're seeing all kinds of new stuff. What kind of new stuff did we see?
And what did we have to do to now deal with it? Yeah. It's really interesting now.
I think one of the things that I learned a lot along the process of building this is developing a better model how to think about our DOS mitigations.
And so one insight there that's been really valuable to pick up is the reason that a customer might use Cloudflare for DOS mitigation at layer 7 or layer 4, from a customer's perspective, they put Cloudflare in front and it blocks that traffic.
Same thing at layer 3. But what's actually happening and how we think about that is at layer 7 and layer 4, we're running our own web servers, our own TCP servers that sit as, again, as proxies.
We're not just passing through packets.
And so really the way to think about that is you put Cloudflare in front and then we build DOS mitigations to protect Cloudflare.
And so that means we only have to understand in detail the types of attacks that would hit our proxies.
The defense is for the whole castle, right? Yeah, exactly. Yeah. And so when we start passing through packets directly to customers, actually kind of two classes of things that we're seeing.
So one is it's not just TCP traffic anymore or UDP traffic.
It could be anything. Give me an example of something that's not one of those.
I'm a software developer. What are you even talking about?
Yeah. I mean, like GRE is actually a good example. We might have customers who have their own sort GRE infrastructure where they're bridging across their own disparate networks and sending GRE over the Internet.
They put a tunnel in our tunnel.
Exactly. Yeah. It's kind of mind boggling, right? And so, yeah, there's these sort of GRE or other types of protocols that you'd run over a layer 3 network, ICMP, these sorts of things.
So yeah, it does open up kind of an attack surface for these other classes of traffic to pass through Cloudflare.
And so our team that builds and operates our DOS mitigation pipeline has had to do some work to look and see as we've started onboarding more Magic Transit customers, what different types of traffic are we seeing?
And then they've gone in and built filters and pattern matching engines on top of our existing DOS mitigation platform to identify those types of attacks and then put in the rules to block them.
Are we able to block an attack for a specific, like down to specific customers and specific kinds of attacks?
Yeah, not only that, but we have the ability to do things like if we see some very sophisticated, unmitigated attack, we can actually take packet captures and look at the contents of that attack traffic and go down to the byte level and say, aha, we see a pattern.
Byte at offset 10 is always this specific value.
Let's put in a filter to say anything that matches that signature, we're going to drop it or we're going to rate limit it.
Let's keep that here for a second.
What level of the operating system stack is that filter running at?
Where are you actually inspecting and filtering packets? Yeah, so the inspecting and filtering happens at a couple of different places.
The ones that are most pertinent to what we're talking about here are the place where we install a lot of these types of rules are either in what's called XDP.
This is a mechanism in the Linux kernel that allows raw packet inspection and decision making.
So we're in the kernel.
We're in the kernel. We're in the kernel. We're running code in the kernel as well.
We've got code that's running, that's doing that packet inspection and then matching that against the filters and dropping it.
Then we also leverage IP tables and NF tables, which are other kernel mechanisms, which have a configuration schema above them.
So if we're able to identify signatures that we can specify in IP tables or NF tables configuration, we can insert those rules again to say, when we see a packet that matches the signature, drop it.
And you guys, our audience will all notice when I first asked Nick, what did you have to build?
He said, and a very lightweight control plane. The typical cloud for engineering answer, lightweight control plan is also necessary.
And then you scratch it a little bit and there's literally rules being injected into the kernel on a per packet and per, you know, things to make things.
The first time I had that explained to me that there are programs running constantly examining what packets flowing through the system look like.
And then just like a JavaScript virtual machine or Java hotspot compiler or whatever, we're watching for patterns.
We generate a program on the fly, compile it and inject it into the kernel all within a couple seconds.
And that program then starts dropping traffic, right? So like that's wild.
Great stuff. Nick, thank you so much. Ruslan, since we're starting to talk to you, you first approached me with a document that you had drafted and you said, what do you think of this?
And I remember going, where are we going with this?
And like, if we start building this, where does this take Cloudflare? So talk to us a little bit about why did we build this?
How does this fit into the larger story?
Cloudflare has always said, our mission is to build a better Internet. I get we're deep in the Internet now as part of this.
You know, we're literally taking over part of customers' footprints of the Internet.
We are the only trace left that they're on the Internet.
When Anika and the team finished building PNI, it's literally hardwired into the Cloudflare edge.
How does this fit? What's the vision here, Ruslan?
Yeah, it's a great question. First off, I remember that meeting well. It was like 5 or 6 p.m.
People were heading home for the day and I was like, Ruslan, I need to talk to you.
And I pulled you into a dark conference room and I started scrambling on the whiteboard.
And you were like, what are you talking about?
Start over. Go from the beginning. What is this? And I remember even then, there was some big picture thing brewing and this was not just DDoS mitigation.
This was a lot more than that.
In terms of how we actually got here and how we built to that vision, which I'll talk about in a second, everything we do at Cloudflare is really in talking to customers.
As a product management team, as an engineering team, as a sales team, and all the other teams at Cloudflare that interact with customers, we had had a ton of conversations with folks.
And consistently, we heard, we love Cloudflare.
We love the security tools you offer at Layer 7 and more recently Layer 4 with Spectrum.
I just want to be able to apply the same sorts of techniques you guys have at these higher levels to Layer 3, my data centers.
And when I first heard this, I was like, does anyone still run a data center?
Everything is in the cloud, right? And it turns out that's just not true.
That's really not true. And so, that planted the seed and we started doing more intense discovery.
I was working closely with other folks on the product team. Pat Donahue had a big role here.
And we just literally started writing a document that said, here's what customers are telling us.
Here's what our research on this problem space looks like.
And at first, if you scope the problem down, it was just build a cloud-based dedox mitigation tool.
And I was like, that's easy. That's almost boring, right?
Should we do this? I don't know, maybe. But then as we looked at this more, we realized that dedox mitigation was one of the first network functions to go from something delivered via box in a rack in a data center, as Anika was talking about, to the cloud.
And that's for all the reasons Anika talked about around maintenance and refresh cycles and all that, but also because dedox mitigation in the cloud actually buys you some advantages.
If you're mitigating dedox attacks in your data center, you risk saturating the links into your data center before you can even scrub the traffic.
So, doing the scrubbing at arm's length from your physical presence reduces - I mean, I would imagine that means I can dramatically reduce my investment in my on -prem center because it's not responsible for all that anymore.
Yeah, yeah, 100%. So, that's one of the reasons why you see folks ripping out mitigation hardware and moving toward the software-based, cloud -based things.
But as we were sort of squinting at this traffic flow diagram, very similar to the one I drew for you that night, we realized that it's not just dedox mitigation that we can deliver from our network, right?
It's all these network functions that people traditionally buy hardware for.
So, that includes a firewall, load balancers, WAN optimizers, things doing NAT, etc., right?
All those functions can be delivered from our edge if we meet the precondition that we are close enough to our customers, our network is expansive enough that we can handle traffic without adding all these crazy hairpins and U-turns and things that Anik was talking about earlier, right?
So, we realized that we could apply the breadth of our network.
We could apply our software engineering expertise and solve a lot of problems for customers all by starting with dedox mitigation via the CloudFront network and Magic Transit.
And that was actually, I'm going to preempt a question you're going to ask, or I always get asked, like, why did we call it Magic Transit?
There you go. And I do want to throw in one quote because I think it's relevant.
Arthur C. Clarke has this line that says, any sufficiently advanced technology is indistinguishable from magic.
So, play on that, if you will. Yeah.
So, I mean, if you rewind 30, 40 years, the Internet was magic, right? And now the Internet is, you know, we're early days yet in the history of this thing, but it's become routine.
It's become expected that you operate things on the Internet. And what people have realized over time is that putting things on the Internet is actually a real pain in the butt, right?
You have to, if you're, say, you're running a data center and, you know, the cloud addresses some of these things, but not all of these things.
Say you're running a data center, you know, if you imagine this as sort of like Maslow's hierarchy of needs, right?
At the very base of that, you have, you need power, you need cheap power, if you're running a lot of servers and stuff there.
You need water for cooling. And then you need Internet connectivity, right?
Transit or similar. And then if you go up one level, if you were to plug in a data center with all those things, yes, you could literally power it on, but not much more than that, right?
Then you need a firewall, then you need a load balancer, then you need all these things to actually glue things together and get traffic flowing properly.
And actually installing, buying, maintaining, blah, blah, blah, blah, blah, all these things, and then plugging them all together is really painful, right?
And what Magic Transit does is it allows customers to sort of check a lot of those boxes all at the same time, and then pay Cloudflare instead of a giant team, potentially, to operate those things for them in a reliable, secure, and performant way.
And there it is. That's the, as we're close on time here, I think that actually is right there, which is the last question I had, which is how does this make the Internet better?
It lets the people who have original content focus on that content, not the plumbing and the infrastructure, and it gives Cloudflare the responsibility and the mandate to make sure that every packet gets to the right place as fast and as securely as possible.
And the better job we do there, the better Internet we'll have for everyone, and the cheaper we'll have for everyone.
Yep, absolutely. So I think that's that all. I am so excited about this product, and I'm so proud to be working with all of you, and so I'm looking forward to talking to all of you guys about this more soon.
Thanks, everyone.
Thank you.