Latest from Product and Engineering
Presented by: Usman Muzaffar, Jen Taylor, Annika Garbers, Nick Wondra
Originally aired on October 16, 2021 @ 12:00 AM - 12:30 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
Hi, I'm Jen Taylor. Hi, and I'm Usman Muzaffar, Head of Engineering Cloudflare. Jen, it's great to see you.
Happy New Year. Happy New Year, Usman. And happy Latest from Product and Engineering.
I have to say I have missed this over the course of the past few weeks, lots of quality time with my family, but not enough quality time with our products, with our product and our engineers.
And the cool thing about this is that when we first started this, like six or seven months ago, we wanted to bring our Magic Transit product leaders on.
And I'm thrilled to welcome Annika and Nick back to the show to talk about everything that we've done.
So Annika, please take a second to introduce yourself, followed by Nick.
Yeah, I'm Annika, I'm the Product Manager for Magic Transit. Super happy to be back.
Yeah, I'm Nick. I'm the Engineering Manager for Magic Transit. So remind us again, Magic Transit is an interesting name, but what is Magic Transit?
Yeah, Magic Transit is Cloudflare's product to protect customer networks.
So a lot of the time when we talk about Cloudflare products, we're talking about websites, HTTP requests.
Magic Transit is actually a couple layers lower in the OSI stack.
And we actually protect customers' entire networks, any IP traffic on the Internet from DDoS attacks and other kinds of bad traffic that they don't want to receive back to their network.
So we take Cloudflare's entire network, and we put it in front of our customers to protect them from bad stuff out there.
Got it. That's cool. And yeah, I remember when we talked about this before, we talked about the fact that Cloudflare started with solutions for the application layer, and that was the classic DDoS offerings and our firewall offerings.
And then we brought that down to non-HTTP-based protocols and really did that with what we call Spectrum for UDP and TCP-based protocols.
And now we've gone one level deeper into Magic Transit.
Yeah, exactly. That is where we're at. So what's going on?
What's new in Magic Transit world? Specifically, what are you guys seeing from customers and from the market?
Yeah, the last year has been kind of a wild ride.
Magic Transit launched in August 2019, and we've seen a ton of growth since then.
But obviously, in the middle of all of that stuff, the pandemic started happening, and the way that people thought about the Internet and their reliance on the Internet totally changed.
And not only companies realized this very suddenly, but also attackers.
And so people out there on the Internet that are trying to attack companies' networks and websites suddenly started realizing, hey, the Internet is more important than ever for us to be able to do our jobs on a day -to-day basis.
And so the stakes are higher for me if I'm trying to attack these companies.
And so organizations have seen more DDoS attacks, new kinds of DDoS attacks, and also new ways that the attackers are actually communicating back with the companies that they're attacking.
We've seen a huge surge in what's called ransom attacks.
So attackers will reach out to companies and say, hey, I'm going to do a test attack so you can see that I'm not kidding around.
And if you don't pay me a bunch of Bitcoin or something, I'll come back and hit you and take your company down.
Not only sort of things that face your users, like your websites, but also the connectivity to the services that your employees need to be able to do their job on a day-to-day basis.
So the stakes here are really high, and customers really need solutions like Poplar and Magic Transit that can help protect them.
That's amazing.
And the problem just keeps getting harder and scarier for our customers, too, because it's a really...
I mean, we should also say, we've done seminars on this as well.
This is something... It's illegal. You need to report this.
But having said that, you still want to protect yourself. And so, Nick, we said it goes lower in the stack, but in as a few sentences is possible.
How do we do this?
How is it that we are able to accept, which is up until before you turn on Magic Transit, a customer's own internal IP ranges?
It's like their footprint on the Internet.
And now at this low level, we are intercepting and we're inserting ourselves in there without changing those IP addresses.
So we're not asking them to move to our IP address.
How does this actually work? What's the technology behind this?
Yeah, absolutely. And this comes down to the way that at the lowest layer, the Internet kind of communicates about what IP addresses exist and where, and how those different parts of the Internet interconnect with one another to pass traffic across each disparate network.
So there's a protocol called BGP that everyone who's on the Internet needs to speak BGP.
And it's this way of these different network owners who actually own physical infrastructure of the Internet to tell other people who own parts of the physical infrastructure of the Internet, here's how you get to specific IP addresses.
I have a path to 1 .2.3.4.
It's through this number of hops in these autonomous systems. And I'm telling you that so that you know, if you get a packet for 1.2.3.4, you can send it my way and I'll get it there.
And so this is the way that the Internet kind of learns and understands how to get traffic to a destination.
And so what Cloudflare is doing here is we have physical infrastructure in hundreds of data centers around the world in over a hundred countries.
And in each of those data centers, we connect to other parts of the Internet, other physical networks that speak BGP and we speak BGP to them.
So up until Magic Transit or around that time, we were only really telling our network peers and our transit providers about Cloudflare owned IP addresses.
Cause that was the only thing that we were serving is like to get to 1.2.3.4, Hey, that's not us.
You got to go somewhere else. But now if the customer who owns 1.2.3.4 wants to get their traffic to Cloudflare, we'll actually advertise that IP space from all of our data centers around the world out to the Internet over that BGP protocol.
So now the Internet learns, oh, the path to this IP space is through Cloudflare.
And that's what sends all of that traffic to us.
Even though they still own those addresses. It's not like we do, right. It's still there.
We're telling the world that's us. Yeah. But this is really no different if you think about it from even without Cloudflare, if that customer is advertising that IP space through BGP from their own network, they're telling that to their Internet service provider, who's then telling it to their service providers, who's then telling it to their service providers.
And so everyone is telling everyone about IP addresses that they don't own.
It's this way that the network kind of builds information these paths and hops through these disparate systems, many of who don't actually own the IP space they're advertising.
And this is really where the thought about around magic transit and the transit part of that comes in is we're connecting down to the lowest layer of the Internet and participating in that same conversation that all the other ISPs are participating in.
That's great. So Anika, now that we have this ability to be the front door, you know, on our customers behalf, what are some of the, why does someone sign up for this?
Like you said, like, so some of them are like, there's just, I want to protect my network, but there's some interesting new use cases that are showing up also from why people want to, you know, so it's the DDoS, and you alluded to some of this with the ransomware and like, and some of the things that are happening with increased usage around COVID, like, what are other reasons, what are the other benefits of putting Cloudflare in front of your IP space and us receiving everything?
Yeah, sure.
So blocking, blocking DDoS attacks is a big one, but there's other kinds of bad traffic out there on the Internet, or just not necessarily bad, but just like traffic that you don't want to get that's not meaningful for your organization.
And so the way that organizations traditionally manage this kind of problem or have in the past is they buy a network firewall box or as many network firewall boxes as they have locations where they have people with traffic that they want to protect.
And as you can imagine, if you're a company that has a lot of locations, that can get really expensive, really fast, it can get hard to manage.
And then the thing about boxes is that they don't last forever. You have to maintain them, you have to replace them.
If you have a whole bunch all over the place, it can get really difficult to manage.
And so we heard from customers around the same time as they're telling us, hey, we would like you to be able to do DDoS protection in this way, offered as a service from your network all over the place, not as a physical box in our, in our, in, you know, our data center, our office, we'd also like you to be able to do firewall for us this way.
And so magic firewall is a part of magic transit.
If you're a magic transit customer, you have all magic stuff.
Yeah, exactly. And it's magic in the same way that magic transit is, which is in that it runs every single, in every single Cloudflare data center on every one of our servers.
And so traffic that comes into any Cloudflare data center goes through a set of firewall rules that the customers can define to say, hey, allow this traffic that I do want to receive to my data center or block this traffic that I don't.
And then that traffic is allowed or blocked right there at the edge, very close to where the source of it is.
So we can send just the clean traffic back to our customers.
And that's the only stuff that they have to worry about.
So wait, hold on a second, Nick, you've got a full time job and then some building magic transit, right?
Like, do you guys just like, and then Anika's like, you know, I think we should put a firewall on top of this.
Or you're like, oh, great.
Now I got to like triple my team. Like how, like, we've only been doing this.
We've only been operating at this layer in the Internet for what, 18 months at this point.
Like, how do we have both a DDoS solution and a firewall solution?
Like, how did you get, how did you get magic firewall, like out the door without having to like triple your team and like add 18 months to your development cycle?
Yeah, that's a great question. I mean, one of the easiest parts of that answer is we're building on top of a great infrastructure and great architecture that already exists on the DDoS side.
You know, all of the ways that we stop DDoS attacks for magic transit traffic, the same way we stop DDoS attacks for our spectrum or our HTTP traffic.
It's all the same pipelines that process this traffic. And so we really have been building for over the past decade, all the infrastructure that we've really built on top of.
And that's really, you know, what's helped us to, we've got a really great foundation to build on top of.
You know, the other really interesting thing about, you know, how we've chosen to build magic transit is we really leverage inside of all of our data centers.
We have all these servers that are running Linux as an operating system and Cloudflare is really good at working in the Linux kernel and particularly for doing networking things.
And so we've really leveraged a lot of the tools and techniques that already exist in the Linux kernel to stitch together these really powerful tools that already exist.
And the fact that we can take an existing tool and apply it to our existing infrastructure with this great deployment of hundreds of data centers around the world means that we've got a pretty, you know, great set of tools for us to put together and then figuring out how do we present that to customers?
How do we give them access?
How do we give them visibility and control on top of that? So that they can then at a click of a button or a hit of a keystroke, deploy a new firewall rule to thousands of servers around the world almost instantaneously.
It's amazing.
It's great. It's giant standing on the shoulders of other giants is how I like to describe it.
It's amazing, amazing work. And, you know, and I know in future generations of product are going to be built on all the amazing stuff, Anika and Nick, that you guys are building.
And they'll be able to answer Jen's question like, well, we built it on Magic Transit.
That was amazing. This is this incredible platform that we were able to build on top of.
So I want to talk about some of the interesting challenges we've had as we scaled up this year.
Like we went, you know, it was like, used to be like, oh, we have, you know, six customers, seven customers.
And now it's like way past that. And a whole bunch of new class of challenges and operational things showed up.
And one of them that I think is really interesting that we make great strides on is visibility.
So like, Anika, what is the visibility problem?
Like, you know, like, why do we have, why did we have interim projects and like whole new systems in the dashboard?
What was the, what was busted and hard?
And then Nick, how did we solve it? Yeah. So I mentioned earlier, the way that companies used to solve problems like this is with boxes.
They put a box in and it's very easy actually to see what's going on. If you have a box that sees all the traffic going into and out of your one location.
But, but that's not how things work anymore.
Companies aren't just in one single building with one single box that can filter all of their traffic.
And in fact, for Magic Transit, it's really important actually that we're all over, all over the world is close to the source of the traffic as possible.
And so that means that suddenly now, if you want to see what's going on with the traffic that's coming into your network, if you want to see what the path between Cloudflare, who we're sitting here doing all the filtering for you and all the way back to your network is, you don't just need to see that from one location.
It's not like you can just go to, you know, Cloudflare in Atlanta, where I am and see like, what's the view from Cloudflare Atlanta, you actually have to be able to, to have that visibility from everywhere that Cloudflare is.
And so we've worked really hard over the past few months to put a lot of tools in our customers' hands that give them really that, that better degree of visibility.
So they can see, hey, from everywhere that Cloudflare is in the world, what is the path between Cloudflare back to me look like?
Literally like they can run an actual tracer out, which is a tool that network engineers use to see the path between points on a network from Cloudflare to them and say, okay, I can see if there's an issue between getting, you know, from Cloudflare to me, it's with this provider that's right in the middle.
So we've given customers ability to see more of this information than they've had before.
Yeah, that's a great, that's a great project. Yeah.
And, and, you know, the, the, the challenges Anik described there around, you know, we, we have hundreds of network locations and it's, it's, you know, we need to provide visibility at any one of those, because again, the, the net, the Internet is a very disparate place in the path from Atlanta to a customer's data center might be very different from the path from Hong Kong to that customer's data center.
And while Hong Kong might be healthy and traffic may be flowing just fine, Atlanta may be broken in some way that the customer needs to take some action on.
And so, you know, the, the, the tool that, that Anik had talked about, traceroute and giving customers the ability to actually run a traceroute from any one of our data centers or all of our data centers and collect all that information and make decisions about it.
This is, you know, us putting control in, in, in the same type of tools that a network operator would use to debug their own network, giving them the ability to run those in Cloudflare's network.
So how do we do that?
You know, it's, it's one of the things that we, we talk about some with, with Cloudflare and architecture as well as the difference between our, our edge network and our core network.
And our edge is all these data centers around the world where we're running all the software systems that actually touch our customers network traffic and, and do our great DOS filtering and filtering all of that.
We have a core network where a lot of our big data systems, you know, we feed a lot of data there.
We do a lot of processing and we have you know, platforms there where we can easily spin up software services that we can build customer facing APIs and, and, you know, ingest API requests.
And then so we had a team of interns over the summer who, who helped us by building one of these systems in our, our core data center.
We exposed an API to customers to allow them to specify where they wanted to collect these trace routes, this trace route information from.
And then that system in our core data center has to go out and ask all of our edge data centers, Hey, can you please run a trace route for me?
And then collect that information and ship it back.
And we do some things to, you know, add extra information. We want to make sure customers, you know, understand, you know, what autonomous systems their, their traffic is flowing through.
And so there's, there's a lot of kind of complexity there in terms of we've got this one customer logically just wants to say, give me a trace route from Cloudflare.
And we have to turn that into hundreds of such requests and send them out to all corners of the earth and collect that information.
It's great. Yeah, absolutely. It was, you know, it was a tough year for so many reasons as the whole world has noticed, but like one of the real bright spots was that Cloudflare was, we were able to double our intern class.
And so part of part of the, this, this, this magic transit team is one of the, one of the actual direct, here it is.
This was the direct benefit of that, which is in terms of Cloudflare, I get to work on things, real problems and that, that directly benefit our customers.
And so I was, you know, thrilled to see that work that you led, that your team led and, and, and, you know, to really develop a distributed trace route at scale that works, you know, arbitrarily well is, is fantastic.
It wasn't, NetVisor wasn't the only thing though, right? There was also a health, another different kind of health check that we were working on.
So let's talk about that.
Yeah, there's, there's kind of two, two ways that we thought about this, this problem of like, there might, what's happening in the Internet between Cloudflare and the customer.
And one is let's, let's give our customers the tools that when there is a problem, they can go and dig deeper.
But the fundamental question is how do you know that there is a problem?
And so one of the things that's been true since magic transit launched is that, you know, we are continually running health check probes between all of Cloudflare's network locations and our customers' data centers to monitor for things like packet loss or connectivity failures or these sorts of things.
What's one thing we didn't talk about in, in how magic transit works is how do we actually get traffic back to us?
So when a customer signs up for magic transit, they bring us their IP space and we advertise that as we discussed earlier on, that lets us get their traffic.
We, you know, scrub it for DOS attacks and apply firewall filters.
And then when we've got a good packet, we want to send back to them, we use what's called a GRE tunnel.
It's, we talked more in depth about this in the last, the last time we talked.
So, you know, for the curious folks here can go back and find the old recording and listen to the technical details there.
But suffice to say that this is, this is a way for us to take that, that packet that's destined for the customer's IP space, but which Cloudflare is to advertise on the Internet.
And we put in another packet that we know what it is.
And another envelope and change the address. Exactly. Yeah.
Yeah. And, and one of the things that we we encourage all of our customers to do is actually they can set up multiple such GRE tunnels especially in the cases where they have network diversity.
Let's say that customer at their data center, they have multiple Internet service providers.
They can set up two GRE tunnels, one that traverses each of those Internet service providers, which then means that any one of our hundreds of Cloudflare network locations now has multiple ways to get traffic back to that customer.
So we're continually probing over every one of those GRE tunnels to monitor for the relative health.
And then let's say that, you know, the customer has multiple GRE tunnels and one of their tunnels from Atlanta, they have a network problem, but the other one, they don't.
We then on our edge are dynamically monitoring the health of these tunnels and picking the healthiest one to send that traffic over.
And so we've been collecting this, this, we've been running these health checks and collecting this data and acting on it since magic transit started.
But one of the things that customers really want to understand is, Hey, like, let me know when I have a problem.
Maybe I do have like one of my tunnels is healthy, but if the other one is failing right now, I want to get that fixed just in case the backup gets unhealthy as well.
Right. And so, so we, we have the challenge then of how do we take this, this health check data that we're collecting on our globally distributed edge, and then put that in a place where customers can ask us about it and go to their dashboard.
And we populate the dashboard with, you know, here's the status of, of all your tunnels, not just your tunnels, but from every one of our network locations.
And, you know, looking at that, that data over time, because maybe that customer experienced a network blip half an hour ago, and they really want to get to the bottom of it and understand why that happened.
If it's not happening right now, guess what? Running a trace route right now isn't going to help them, but going back to our health check data and seeing, Oh yeah, Cloudflare saw that I was experiencing 5% packet loss.
Exactly.
Absolutely. And so this might not sound like that big of a challenge, but when you start to think about the size of Cloudflare's network and all the locations that we're collecting this data, not to mention customers who may have, you know, large networks, and maybe they have dozens of GRE tunnels that we're collecting this data for, that starts to become a pretty large number of data points that we're having to ship back to our core data center, where we do a lot of data collection and big data analytics, and then expose that information in an API that a customer can come and query, or that we can embed in our dashboard.
So the customer can just come to a dashboard, look and see a view of all of their GRE tunnels from all of our network locations, what's the relative health, and if one's not doing, if one is having a network problem, they can click a button, run the trace route, get that information back to them so they can understand what's happening right now in my network.
Yeah, it's a whole different latest from product and end, but one of the things that we've been working on kind of consistently across the organization over the course of the past couple years is the fact that we do have so much insight as to what's happening with our customers' networks in ways that they often don't, because we just have such a bird's eye view, that the sort of focus on providing that insight back to their customers is sort of a value -added benefit on top of the service, is something we're starting to think about from a lot of different angles.
But I was really excited specifically to see it for this one, just because I know how critical this infrastructure is for these customers, and how difficult it is for them to keep track of and get visibility into those tunnels.
So Nick, so that sounds like it went well, but one of the things we talked about earlier was now we're kind of doing some stuff we've never done before, kind of at scale, which is back in the olden days, it used to be that Cloudflare had the IPs for the customers that lived in our service, and we advertised them, and that was the way our world worked.
But clearly, as we provide this offering, it's a totally different game.
What are some of the challenges we've encountered as we sort of move deeper and deeper into that part of the solution and that part of the architecture?
Yeah, absolutely. And actually, one of the big challenges that we faced and unfortunately overcome since we talked last was going back to that BGP protocol.
One of the things that's interesting about, if you think about Cloudflare before Magic Transit, we were advertising just our own IP space, and that changed pretty infrequently.
IP addresses tend to be pretty expensive to acquire, and we had sufficient number of them to satisfy our needs.
But as we bring in more and more customer prefixes, suddenly there's a scaling factor in our infrastructure that we hadn't necessarily planned for when we built it initially.
We're taking sometimes dozens of new customer prefixes a week and needing to advertise them from all of our network locations.
And so that posed a lot of challenges just around the sort of automation and how we manage the deployment of those IP space and how we get the right traffic to the right servers in each of our data centers.
And it also actually put a lot of strain on how we were thinking about and using some of our routing hardware in our edge networks.
This is a thing that started to put stress on some of the ways we were using those pieces of hardware that forced us to really step back and take a look and examine how can we change the way that we're using this hardware because we're hitting physical memory limitations on that chip embedded in the NIC in that one device.
And it became this problem where we were fearful that when we would onboard new customer prefixes, it might take one or more of our network locations down.
And we're really fortunate to say that we took a look at the way that our internal, the way our servers interact over BGP with our routers.
We found some really interesting ways to change how we kind of slice and dice which servers advertise which IPs to the router to overcome that specific case of that overburdening the memory on that chip, on that network interface card on that one router.
And so it's this really interesting like the way we sort of thought about our network when we first built it and the choices we made when we bought that specific network hardware and we tested it for its scaling factor and stuff.
We've changed the world in a way for Cloudflare that we've got now, different constraints we're working under.
And so again, going back to, we've got such a great collection of sharp engineers here.
This one was really a cross -disciplinary effort to get that one problem and requiring the coordination and thoughts and brainstorming and creativity across all people.
I remember when I first came to Cloudflare, sales reps came to me and said, this customer wants to bring their own addresses.
And it was sort of this really wacky off-the-wall request. And as the engineering leader, I was like, that's a snowflake, absolutely not.
Pushing back, come on, figure out how to sell the product we have.
And then slowly this weirdo request became this mainstream thing.
And like all good software companies, we didn't automate it until we needed to.
That would have been a mistake, but as it became more important, let's automate this.
And now suddenly this thing that happened, like Nick said, so infrequently is happening all the time.
It's happening with customer control, not our controls.
It's happening globally on multiple time shifts.
And now it's triggering corners of corner case issues. And so some really fantastic work to bring this all together.
Jen, a question for you.
One of the things we've talked about is how we're tying some of these products together in a cohesive offering.
So how does all this fit together in some of the new things I keep reading about at blog .Cloudflare.com?
Well, so one of the big announcements we had this past fall was this notion of during our Teams week, we announced at the beginning of the week, this notion of Cloudflare 1, which is really kind of leveraging the sort of growing model that we're occurring in the industry where people are using the edge of the network as a secure access gateway effectively to be able to kind of control everything that is happening.
And so sort of what we're starting to see is, back to the point that Nick made earlier, when I was sort of kind of picking on the magic firewall a little bit, which is we see customer demand for, and a desire to solve this kind of security perspective in a much nimbler way than the boxes that Anika indicated people have done in the past.
And people are really looking at the network as a way of doing it.
And we're fortunate enough that we have some of those building blocks already in the shop here that we're able to sort of bring together.
So specifically, Anika, I know you've been working really hard with the team more recently to think about kind of how does Magic Transit integrate with the team stuff to kind of bring that Cloudflare 1 vision to fruition.
Can you talk a little bit about kind of what the assembly of those Legos kind of looks like?
Yeah, sure. So we talked about how Magic Transit protects customer networks.
On the team side, we're talking about protecting largely devices, so laptops and phones that employees have.
And sometimes customer networks mean data centers, but sometimes they could also mean offices and people sitting in offices.
And really what we keep hearing from customers is like, hey, wherever traffic to my network is coming from or going to, I want it to be able to go through Cloudflare so that I can set policies in the same place.
Whether somebody is sitting in an office on their laptop in an office and goes to a website, or they have their phone at home or their laptop because they're working remotely and goes to the website, I want the policy about whether or not they can go to that website, just as one example, to be the same.
And I want to be able to just set it in one place and not have to manage it in all kinds of different places.
And so the Magic Transit team and the Cloudflare for Teams team and lots of other folks within Cloudflare are working on kind of the integration between these services and how we can actually execute on that, how we can give customers the ability to set one rule in one place in the Cloudflare dashboard that gets applied everywhere, no matter where traffic comes from.
And I'm really excited about that. Customers are getting really excited about it.
There's lots and lots of use cases that we can kind of build on top of this.
And if you start thinking about what a rule could mean, what a policy could mean, there's all kinds of things outside of the scope of sort of the traditional definition of, I think, what people think of as a firewall.
So lots of cool stuff to come. That's so great. Again, as often, the 30 minutes just blazes by.
I feel like I could spend the rest of the day talking to you guys about Magic Transit, what we're doing, how we solved all these problems.
We will absolutely have you back again, Anika and Nick. Thank you so much for joining us.
And thanks for everyone watching. We will see you next time on the latest from Cloudflare product and engineering.
Bye everybody. Thanks all. No one likes being stuck in traffic in real life or on the Internet.
Apps, APIs, websites, they all need to be fast to delight customers.
What we need is a modern routing system for the Internet.
One that takes current traffic conditions into account and makes the highest performing, lowest latency routing decision at any given time.
Cloudflare Argo does just that. I don't think many people understand what Argo is and how incredible the performance gains can be.
It's very easy to think that a request just gets routed a certain way on the Internet, no matter what.
But that's not the case. There's network congestion all over the place, which slows down requests as they traverse the world.
And Cloudflare's Argo is unique in that it is actually polling what is the fastest way to get all across the world.
So when a request comes into Zendesk now, it hits Cloudflare's POP, and then it knows the fastest way to get to our data centers.
There's a lot of advanced machine learning and feedback happening in the background to make sure it's always performing at its best.
But what that means for you, the user, is that enabling it and configuring it is as simple as clicking a button.
Zendesk is all about building the best customer experiences, and Cloudflare helps us do that.