Originally aired on November 13, 2020 @ 6:00 PM - 6:30 PM EDT
Join Cloudflare's SVP & Head of Engineering, Usman Muzaffar, in conversation with the phenomenal teams that build our products. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
Music Hi, and welcome to another episode of the latest from product and engineering. I'm Usman Muzaffar, Head of Engineering at Cloudflare, and I'm very thrilled to welcome two of my teammates here today, Eric Reeves and David Tuber. Eric, why don't you say hi? Hey, hi. And Eric, how long have you been at Cloudflare? What are you responsible for? I've been at Cloudflare for about three and a half years. I'm an engineering manager for the Spectrum and Argo smart routing teams. Excellent. And also with me is David Tuber, nicknamed Tubes. David, how long have you been with us, and what are you responsible for? Hey, Usman. Hey, everybody. I'm Tubes. I've been here since June. I don't know how many months that's been. It's all felt like a blur, but I've been here since June. I am responsible for Cloudflare's network, Argo, and availability. That's perfect. Thanks, guys. You can guess today's topic with the word Argo and network and product networking showing up. That's what we want to talk about today. I thought I would just start by reminding our audience just how vast and impressive, let's brag it for a second here, with the numbers. Tubes, how big is our network? How many data centers do we have, and what do we know? What are some of the latest bragging rights we have on that? Yeah, we're in about 250 locations around the world, which covers most of the big cities in the world. It's very impressive. The stat we like to throw around is that we're within 100 milliseconds of 99% of the population, which is pretty impressive. A population of human beings on planet Earth. Yeah. I mean, 100 milliseconds is a pretty large radius, but that's still pretty impressive. That's awesome. Yeah. I remember when we were talking about some of this stuff, we were drawing CDN graphs. Yeah, what is it going to look like when CDNs go interplanetary? We're going to have to worry about latency problems on a different scale. But right now, yeah, that's how we think of it, is of how close are we to where there are physically eyeballs. One of the things that we've talked about in the background is that it is a network. They're actually the data centers, and that's part of what we want to talk about today is it's not just independents as they actually talk to each other, and some of our products are dependent on how they talk to each other. So tell us about how the POPs are connected to each other. Yeah, so our POPs talk to each other in a lot of different ways. We have our normal network transit, so that would be our transit providers who we peer with in a lot of different places. Basically, we go to them and say, hey, you're around the world. Can we have a place with you? Can we peer with you, get a P&I? So we hook up there. The other way we do it is we have a private backbone. So basically, for a certain percentage of our POPs, we interconnect private links and basically can funnel traffic over those. The reason why we do that is just because there's some traffic that it's better if it doesn't traverse the public Internet. And it's faster, it's more efficient for us, so all those great reasons. So the original reason when people say, well, why do you have these data centers all around the world? It's like, well, because that's where the human beings are, right? And so we put our caches there. But today, we're not going to talk about CDNs and caches. So Eric, what is Argo? When we talk about what is dynamic routing as opposed to static routing, what do we mean by dynamic routing? Argo Smart Routing is a platform within our network, and it optimizes the path from Cloudflare to the customer's network. And Cloudflare carries the traffic from visitors to our customers. So things like the CDN and Spectrum and WordPlus, these are all products that use Argo. They rely on Argo to make the traffic from Cloudflare's network back to their services secure and reliable as possible. So one of the analogies that one of our colleagues likes to throw around is that Argo is kind of like Waze for the Internet. And so actually, maybe, Chooch, you could take that one. What does that mean? Why would we say Waze for the Internet? Like, what is it that Argo is doing that the regular Internet, so to speak, isn't capable of? Yeah, so at a very high level, Argo basically monitors all of the paths that you would traverse and picks the fastest one. And that fast one is measured by RTT, round-trip time. So a good example is, if you land on a Cloudflare pop in, let's say, LA, and the origin that you have to connect to is in Chicago, and there are four different paths, Argo will measure all of those different paths and then say, hey, this one's the best, and so I'll send you down that path. Perfect. So it is like Waze for the Internet. We are literally looking at multiple different ways to get to a destination and then figuring out, of these choices, this is where we should go at this moment in time. That answer can change minute to minute, is that correct? Yes, it can definitely change minute to minute. So Argo is collecting all this information from the edge data centers, constantly studying what is the fastest way to go from point A to point B, and then presumably sending all this information back to the edge so that the edge, in your example, that computer in Los Angeles can go, I know I need to get to Chicago, but I'm going to take this route instead of that one because Argo has told me that it is faster or that it is even available. I mean, the other possibility is part of the reason the slow could mean down in the context of the Internet because of... So Eric, back to you, how does this all work? Like, what did we have to build to pull this rabbit out of a hat here? Yeah, well, you alluded to some of that just now. As with many of Cloudflare's products, there's three main components to Argo. One is this stateless, high-performance data plane that runs on our edge data centers. This is just responsible for carrying traffic, routing it as securely and as fast and as reliably as possible. But like you said, there's also another component that's constantly monitoring all of the origins with which our customers and our visitors interact and constantly measuring the network. And that's what we call our control plane. And so these are a set of services that reside within Cloudflare's network, constantly taking network measurements, looking at information from all the requests and the traffic that traverse our network, and using that to compute these optimal routes, these optimal paths from entry points into Cloudflare's network and the exit points from Cloudflare's network. It's a constantly running system that's instructing our data plane how to carry this traffic. It's like the world's greatest set of rulers and measures, constantly looking at all the different permutations. And then it needs to not just do that, but it needs to put that information in a format that the edge can consume very quickly and act on, because it doesn't know upfront what questions is going to be asked. So it almost needs to prepackage this information so that a packet that is coming to our edge data centers knows, okay, this is what Argo is suggesting I use. And so shipping that back, I'm assuming goes through the same way that Cloudflare's core is connected to the edge. We've talked about this on the show before, what's called our Quicksilver technology. We've also blogged about it. And so all that information is getting packaged into Quicksilver and shipped to the edge. And how fast do we do that? How often is it sort of sending new information up to the edge? Well, we're calculating new routes. Every couple of minutes, we're calculating new routes, and we're using network measurements from this other system that is constantly probing origins and probing paths between Cloudflare data centers. And as it continues to crunch this information, there's another set of services that are using this information to provision new routes or new optimal paths to the Internet. That's constantly happening. That's awesome. Tubes, do we have stats on how much faster does this make this a typical origin? It makes it pretty darn fast, I was about to say. Typically, it makes it about 20% to 30% faster than just using the traditional paths. That's because we do all the measurements beforehand. And actually, when you use Argo, you kind of have this, you can see this verification yourself, because we actually send a small amount of traffic through the normal public Internet routing paths, so that we can validate that Argo is actually doing stuff that's beneficial to your traffic. And so, the other aspect of this is that if you take, we split your traffic through at least two different paths, if another path starts to be faster, then we move the traffic over. So, it's a really good way, and we're always trying to make sure that we go through the fastest path, and we split your traffic up specifically, so we can do that. I love that feature. I remember when it was first explained to me, I thought that was so clever, because it's the idea that we're keeping ourselves honest with raw data in both sides. It's the control in a good experiment to constantly measure against, well, what would you do if this wasn't on in the first place? And then showing that to our customers, so that they can actually see for themselves, this is how much Argo saved you, and if it didn't, then we know, we should be thinking about how did the route miss the computation, or what else they could have missed there. So, Argo, the other thing I wanted to go back to is, so, we talked at the top of the hour about how part of making the best route here means using how Cloudflare data centers are connected to each other. So, how does that fit into this? Like, we're sort of being very hand wavy with, okay, so there's a different route, but what is, tie that back to the fact that, you know, what David was talking about at the top of the hour, which is the data centers are connected to each other. So, what is actually going on? If I were to put a microscope inside, or a magnifying glass inside a Cloudflare Edge data center, and it's got this hint from Argo, Argo's like whispering in its ear, like, by the way, you know, don't go, don't take the normal route, don't take 101, it's jam packed, use 280. Like, what happens next? How does this use other POPs to get the packet as fast there, as fast as possible? Yeah, good question. So, when we publish routes to our Edge network, we're not just publishing a single route, we're publishing several routes that highlight and describe several paths through Cloudflare's network. And so, the data plane is instructed to try those paths in the order that those routes are provisioned. And so, it has primary routes and secondary and tertiary routes. And if those paths all fell over, there's actually smarts built into the data plane to be able to not try to reuse those paths, if it knows that there's some sort of transient error condition that the control plane hasn't detected. But then we expect that the control plane will pick up those changes and then propagate new routes back out to the Edge. That's really cool. And so, not only are we actually finding out what Cloudflare data center to route the traffic through, but we're also thinking about which transit provider to pick. So, this is the very interesting v1 .1 optimization that we built. And when we first built Argo, it didn't have this functionality, right? It was just, it's like, okay, I'm going to bounce the traffic through Nevada because that's the right way to go. But now it can be like, no, wait a minute, on my way to a data center, I'm going to use one of these two different connections because a lot of our Cloudflare data centers are peered and connected to the Internet with multiple providers. So, talk a little bit about that. How did that work? And why was that important? What do we mean by Argo transit selection? Well, as David mentioned, every data center has a connectivity by one or more transits. And we know that some transits are more reliable and some more performant than others. And so, the v1.1 optimization that you're referring to is to not make Argo just aware of the fact that there is a link between two of our data centers, but that there are multiple links between two data centers. And we enhanced our control plane to understand the difference between the various transits and use those transits when probing and collecting network measurement data between those two data centers. And we use that information to calculate routes. So, what it means by transit selection is on the path to the origin, don't just choose the path across Cloudflare's network between POPs, but use the transits that are the most performant and the most reliable. And this has a really good side effect of actually improving the origin reachability for our customers. And it's funny because Argo is a performance product. So, Argo is supposed to optimize your performance and your latency and make you faster. But it also, because this intelligence is built in, it actually makes you more reachable and more reliable through our network because we're picking the transits that are up. There are some really great examples of some customers who we talked with. One of my favorites was a customer who they had one origin and it was in Dublin. And they had customers in Djibouti who were trying to talk back to their origin. And before they had Argo, they were saying, oh man, like we're just completely, we're timing out and our origin connections are totally failing all over the place. And then, we talked to them and said, just turn on Argo and see what happens. And then they turned it on and all of their problems instantly went away. That's awesome. It was because- This is a perfect example of, I'm assuming it's because it found a better path for the- That's fantastic. It found a better path back to the origin. And that's because it does it automatically. It's built in part of the product. It's great. And that part of the product is something that we really love and it makes our customers better and it makes us better. And our customers love that. That's great. It reminds me of this quote that my old boss used to say, which is, the most important performance improvement is from the non-working to the working state. And so, it's keeping under the spirit of performance that if it's down, how about we pick a link that's up that's infinitely more performant than the one that's down. Eric, I want to go nerdy here for a second. How does a process actually use a different transit selection? Like, what does this actually matter? If I was literally on the box that is decided, oh, shoot, I need to use a different transit provider. How do we pull this off? Is it a different network interface? Is it a different socket? Like, how does a user-level process that is holding on to a little bit of a customer packet, how is it going to make- How does it literally decide? Well, the decision's been made, but how does it actually put the packet onto a different transit? Yeah, well, one decision is that any product that's using Argo doesn't have to worry about transit selection. Argo just uses transit selection for that product. So, the way that we were able to distinguish transits in Argo's data plane was a creative use of IPv6 addresses. So, all of our links between the various colos have IPv6 connectivity. And because we have sufficient bit space in the IPv6 address, we're able to use that to encode certain information about the colo or the data center in the transit, and then create connections from our network measurement system to probe the colo-to-colo links using particular source address, which then instructs our network to use a particular transit. And so, we're able to distinguish multiple transits in a single data center. That's awesome. So, buried in that 128-bit, that very generous addressing space, it's like moving into a new house, right? Like, you've got so much room spread out really comfortably, is the ability for us to have actually part of the IP address is what transit provider we're using, what destination it is. And from the point of view of a user-level program, it's just deciding where am I going to send it to. That's right. And there's a whole class of systems within Cloudflare that keep track of this information. And so, every time Argo's control plane starts to probe, to run its network measurements again, it learns about new transits or transits that are added or transits that are removed, and then the routes that we provision to the edge are reflective of that. That's fantastic. That's great. You know, you mentioned that other products at Cloudflare are leveraging the Argo data plane technology. You mentioned Spectrum, you mentioned Warp. How did we do that? What was involved in making sure that those other products could leverage that? That's a great question. So, the original version of Argo, the first version of Argo was made to accelerate HTTP traffic, traffic for our CBN. Newer products started to emerge like Spectrum and Warp, which were operating at slightly lower layers of the stack. And hence, the Argo team had to build a data plane or enhance this data plane to be able to accelerate traffic at layer four. And so, what I mean by layer four is there's this thing called the OSI model, and it's a mental model that describes how computers communicate over the Internet. And so, Argo had to build an interface to the other processes that run on our edge data centers and some instructions on what to tell it when it wants to insert a connection onto the data plane. And so, both Warp and Spectrum use the same interface. And any new product on the edge that deals with edge connectivity can do the same. That's great. We talked, you mentioned Spectrum. So, Spectrum here is now the ability for Cloudflare to proxy any traffic on the Internet, any port, and that's where it gets its name. Argo, by the way, if I'm not mistaken, is a constellation, and the code name came from the constellation. I remember the engineering team's original act had a cool shot of the Argo constellation and the Wayfinder parts of the project. So, the Spectrum is like this idea that while an application like so-called layer seven is for a particular application like web traffic HTTP, and my convention tends to be on port 80 and other ports that start with 80, 80, 80, 80, 80, the idea of Spectrum is that it can handle a whole bunch of different kinds of traffic. And you're one of the people who worked on Spectrum. So, talk a little bit about that. What were some of the key challenges we had to solve with Spectrum? Yeah, great question. Going back to Eric's, I can see the big smile there. It's like, how long do you want this answer to be? We can talk about this all afternoon if you want. Yeah, so, I joined Cloudflare as a software engineer, and about four months in, the head of the product strategy team, which is now called Emerging Technology and Incubation, said, hey, we're working on this new thing called Proxy Anything, and I think your background might be a fit. Are you interested? And of course, I said yes. And so, I was one of the original members who helped build that product, and it launched in April of 2018, and it's done very well since then. What's an interesting challenge? I can talk about anything you want. Yeah, you mentioned Port 8080, Port 443. That's a very constrained list of ports on which our CDN operates, right? Spectrum, one of the largest challenges we had was how to handle the scalability issues associated with proxying any TCP or any UDP port. Kind of the dimensions of 65,000 ports was a major challenge for us, and early on, we stumbled upon a creative use of the Linux firewall that would allow us to address these scalability issues. Our minds were blown when we discovered that, and it really was a key to the product success. That's fantastic. Zooming out another level, Tubes, back to you, what are some of the key things that you're keeping your eye on? Some of the other things that we're doing to make our cloud-first network more resilient, more have better reach. You mentioned how that Argo is almost a secret weapon that can improve our reachability as well as our performance in all these places. What are some of the other interesting challenges we're seeing? What are some of the other great success stories we've had with cloud -first network technology? Yeah, we're definitely looking to use that Swiss Army knife of Argo a lot more in our service to make not just Argo customers, but all of our customers, benefit. The mantra that we say is, if one colo can reach an origin, then every colo can reach an origin. That's a great line. Love that, yeah. That's the tenet behind Argo, and it's really funny that you mentioned that Argo was the constellation, because for me, Argo is the ship that Jason has. I've called that initiative Orpheus, and it took some explanations to the team to get them to understand that, oh, Orpheus is the wayfinder. He was the musician who was on the Argo. What we need is a real cloud -first team just dedicated to codenames, so we're all in sync on our stories. I know I'm going to get texted from the Argo team, that's not what it was. Where'd you get that from? So this is good. So Orpheus is a codename for amplifying Argo technology. That's great. The other thing I wanted to ask you about, Tubes, was network error logging and how we're using that to improve our products and our reliability. In the last couple of minutes here, talk a little bit about NetLogging and what we're doing there. Yeah, sure. So at a very high level, Cloudflare is pretty good at monitoring, is good at at least monitoring and seeing problems that happen from Cloudflare back. And what that means is that we can understand when a request lands on us, if it fails. We can understand if we try to send something to an origin, if it fails. But what we can't really do is we can't see if a user tries to connect to Cloudflare and they fail. That's a really hard problem for us to solve. And that's a hard problem, for the record, for any network to solve. How do you detect requests that never reach you, especially if you're the only person in the service? Some people have opted for synthetic probing, so they buy external products. Good examples are ThousandEyes and Sydexis are things that basically like they're VMs in different cloud providers and they make requests to Cloudflare's network. That's one way of looking at it. But Cloudflare's network is actually too big. We're a victim of our own success here, that because we're in so many locations, providers like ThousandEyes and Sydexis actually don't address all of the networks that we connect to. So we won't be able to see all of the things that we need to be able to see to make a determination as to whether or not something is wrong on the Internet. So we've developed or so we've decided to utilize a browser technology called network error logging. And essentially what network error logging is, is it's a basic version. I like to call it, I like to call it virtual Matthew. So my favorite thing is every time there's a problem, JGC or Matthew will basically post in like in one of the chats being like, hey, you know, there's something wrong on the Internet. And, you know, our Twitter and use and our customers, their, their users. If something goes wrong, Cloudflare knows about it first because we are so, so, so well connected and such a part of infrastructure. Exactly. And, and they're, and our users are very, very, very passionate and they love, and they, they want us to be up and we want us to be up. And so we want to beat them to the punch. We want to be able to tell customers, hey, you know, we're sorry, we're down, but we're on, we know what's wrong, we're fixing it. So between us and them, that is our ability to move around. Exactly. So what now logging is, is now logging is basically a function of a, of a user's browser that allows them to upload failures to an endpoint that we maintain that basically says, Hey, if you fail to connect to us, just, just let us know through this endpoint. And then we can tell you, and then we'll aggregate that and we'll alert on it ourselves. So now it's basically a user's browser telling us that there's a problem before, maybe even before the user sees an error, or in a lot of times, maybe the user's application will swallow the error and we'll like do a retry and maybe that will succeed. But, or maybe they're caching something or they're doing something that allows them to stay performant so that we can still get the error and the user's experience will be less interrupted. And so now logging is probably the first time in Cloudflare's history that we have a really good view from the eyeball net from everywhere, right? Like a global view, simultaneous, real time from everybody. Which I think is, it's so cool. And you get to see a lot of things and you get to see the best, the most, the best part is from an availability perspective is you get to see how the actions that we take impact our users. And that's really the thing that makes it so exciting because we get to make sure that the actions that we're taking don't impact our users or make our users experience better. Right. It's a huge responsibility, but a great, honestly, great honor as well. Thanks guys. I am always impressed by how fast this stuff goes. We're at just a couple minutes left, but I really want to thank you tubes for coming on and talking about all things network today on today's show. And Eric, big thanks to talk about Argo, how it works on the inside, how once we built it, all your work building spectrum, it's been fantastic. And I'm really looking forward to all the new stuff that is in the hopper that we will be back on with both of you guys soon and telling the world about all the new stuff we built and, and how much bigger the network has gotten. So thanks everybody for watching and we'll see you next week. Bye all. Hi, we're Cloudflare. We're building one of the world's largest global cloud networks to help make the Internet more secure, faster, and more reliable. Meet our customer Wongnai, an online food and lifestyle platform with over 13 million active users in Thailand. Wongnai is a lifestyle platform. So we do food reviews, cooking recipes, travel reviews, and we do food reviews. We do POS software that we launched last year. Wongnai uses the Cloudflare content delivery network to boost the performance and reliability of its website and mobile app. The company understands that speed and availability are important drivers of its good reputation and ongoing growth. Three years ago, we were expanding into new services like a chatbot. We are generating images dynamically for the people who are curing the chatbot. Now, when we generate image dynamically, we need to cache it somewhere so it doesn't overload our server. We turned into a local CDN provider. They can give us caching service in Thailand for a very cheap price. But after using that service for about a year, I found that the service is not so reliable turning to Cloudflare. And for the one year that we have using Cloudflare, I would say that they achieved the reliability goals that we're expecting for. With Cloudflare, we can cache everything locally and the site would be much faster. Wongnai also uses Cloudflare to boost their platform security. Cloudflare has blocked several significant DDoS attacks against the platform and allows Wongnai to easily extend protection across multiple sites and applications. We also use web application firewalls for some of our websites that allow us to run open source CMS like WordPress and Drupal in a secure fashion. If you want to make your website available everywhere in the world and you want it to load very fast and you want it to be secure, you can use Cloudflare. With customers like Wongnai and over 25 million other Internet properties that trust Cloudflare with their performance and security, we're making the Internet fast, secure, and reliable for everyone. Cloudflare. Helping build a better Internet. Cloudflare. Helping build a better Internet.