🎂 Traffic Manager: How Cloudflare intelligently routes traffic
Presented by: David Tuber
Originally aired on September 26, 2023 @ 11:00 PM - 11:30 PM EDT
Welcome to Cloudflare Birthday Week 2023!
2023 marks Cloudflare’s 13th birthday! Each day this week we will announce new products and host fascinating discussions with guests including product experts, customers, and industry peers.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog post:
Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!
English
Birthday Week
Transcript (Beta)
Good morning, good afternoon, and good night. And welcome to Cloudflare TV. If this is your first time watching, welcome.
If this is not your first time watching, well, then welcome back.
My name is Tubes David Tuber. I'm a product manager at Cloudflare focused on network and performance.
And as you can see, I currently don't have anybody else with me.
And today we're going to be talking about traffic manager.
And, you know, in order to talk about traffic manager, usually the way these work is these sessions, you know, you ask questions go back and forth, but since I don't have anybody here, you're going to need some help determining who is asking a question or what actually a question is and what a response is.
So I've brought on David Tuber, who is also going to be asking questions and explaining what's going on.
I'm trying to make sense of these questions and this topic.
So I'm very nice to be here as well. And I'm very excited to see where this goes.
Now, you know, David, you sound a lot like Jimmy Stewart. Is there any particular reason for that?
Well, no, that's just how I sound. You know, I don't really know what you're going for.
But thank you for pointing that out. And let's get right into it.
So, Tubes, traffic management is kind of a weird thing to be talking about for birthday week.
Can you give us a little bit of insight as to why you're talking about traffic management for some kind of week that's all about new innovations and announcements?
That's a really good question, Tubes. And thank you for raising that.
You know, traffic management at its first glance is not exactly the most creative or crazy thing we could be talking about during birthday week, especially for the first day.
And it's a real honor for this blog post to come out about that.
But really what we wanted to highlight is how we are using bleeding edge innovations to improve our traffic management, which in turn improves user performance.
And we do a whole bunch of things that are kind of unparalleled or unmatched when it comes to traffic management.
And we really wanted to highlight those and talk about those in a new and innovative way.
And we thought that, you know, birthday week was a really good time for that.
And so, we're really excited to be sharing that with you.
Well, thank you very much, Tubes. And that's a very good answer.
And the blog as written, which if you haven't read, I highly recommend, talks a lot about improving user performance.
And I think the biggest question that I have is how do I, a normal Internet user, interact with traffic manager on a regular basis?
It's a really good question. And the answer is you shouldn't.
You know, traffic manager is kind of the silent superhero, like almost like if Batman didn't dress up in a cape and just did the job of a normal police officer.
You know, traffic manager works in the shadows and it works silently. And you should never know that traffic manager is actually acting and saving traffic and redirecting traffic.
And traffic manager itself makes hundreds of moves every day. And if we're doing our job correctly, you don't notice a thing and your performance doesn't suffer and you get an amazing highly available performance service all the time.
And that's really why we've gone through all of the iterations that we go through with traffic manager because we want to make sure that no matter what happens on the Internet, no matter what happens with the underlying fabric, your traffic is prioritized and saved and moved around so that you can get the best possible experience.
Well, thank you for answering. That's a very, very interesting take on that.
And, you know, I personally don't think that I've ever interacted with traffic manager, but, you know, now that I'm hearing that and rating the blog, I'm sure that it happens all the time.
Hundreds of moves a day. How does that happen?
You know, hundreds of moves a day, a lot of them are very small. And, you know, we started out as the blog says, we started out with very large moves of like hundreds of thousands of requests per second of traffic.
And now we get it to be very quite granular so that a move could only be, you know, a little bit of traffic.
Maybe it's only a couple of users that we need to move either between locations in a data center or between data centers.
Our goal is to move as little as possible so that we can make sure that everybody is getting the most performant experience they could possibly be getting.
Well, thanks for the update there. And, you know, one thing that I kind of got, that I got the feeling of when I was reading the blog is that traffic manager kind of feels like a traffic cop, you know, like letting you, letting some people go and letting other people, making other people wait.
And how accurate is that metaphor of a traffic cop? You know, when you think about it, traffic manager is actually not like a traffic cop.
You know, a traffic cop and the way that traffic is often directed, you know, with people getting to cross before others implies that there is some waiting.
But in reality, there is no waiting with traffic manager.
It's not that you, it's not that requests that get moved have to wait for the requests to be served.
They're just served from somewhere else.
That's why we, that's why when we kind of spoke about at the beginning about the airport queues is probably the best metaphor there.
It's, there's enough space for everyone to be served in the event that something goes wrong.
It just might not be at the closest possible data center, but we make sure that even when we move, we move to the closest possible data center outside of the first impact.
Well, thanks for that clarification. You know, that really helps.
I really, really loved reading the blog. It was very, very insightful. But one thing that stood out to me, and you know, I think a lot of people are kind of wondering this, is how does this version of traffic manager differ from other versions of traffic managers?
Like other companies and other people who run networks, obviously must do some sort of traffic management.
And I guess my question is, how do they get their traffic to be managed and what set Cloudflare apart?
That's a, that's a really good question. And, and, and honestly is something that really, I think we could have talked about more in the blog, but I want to give a quick highlight on that.
So for the record, Cloudflare isn't any cast network.
And what that means from a networking point of view is that we've got prefixes and IP addresses that we advertise everywhere around the world, the same prefixes we're going to be advertising.
So basically, if you've got an IP address, like 1.2.3.4, every single machine is going to be advertising that in every single data center.
So all data centers will be advertising and you, an Internet user, will pick the closest one to you.
And what that means is from a performance standpoint, is that you're more or less guaranteed the best performance because you're, you're not traveling very far to get where you need to be.
Other networks, by comparison, will do something called DNS-based traffic management.
And the way that that will work is that instead of, you know, resolving to Cloudflare.com and getting the IP 1.2.3.4, you might resolve to competitor.com or network.com, and you might get a series of IP addresses.
And those IP addresses will be given to you in a specific order.
And the order in which they are given is generally determined by what is the best possible data center for you to connect to that can take you.
And the way that they figure that out is by something called round robin.
Or basically, they'll round robin the data centers based on what, based on how much traffic is available or how much capacity is available in a given data center.
And this makes a couple of assumptions. The first assumption is that even though DNS, or maybe not even DNS, even though you might get to a certain data center to receive a DNS response, either via Anycast or Unicast, the TCP HTTP connections that you will make subsequently will be Unicasted.
And the way that you can think about this is that if Anycast means that every single Cloudflare data center advertises the same set of IP ranges, Unicast data centers each advertise different IP ranges, and you connect to those to do the same things that you would do at Cloudflare, but the IPs that you are connecting to are different.
So for example, if we're looking at Seattle and San Jose, in Cloudflare terminology, the same prefixes are being advertised from Seattle as they're being advertised from San Jose.
But if you're looking at a Unicasted or a DNS based data traffic management system, the IP ranges from Seattle are going to be different than the IP ranges from San Jose.
And actually, you can tell that they're different, because you'll connect to different locations and their performance will be different.
So here's a question for you. Doesn't that implicitly mean that performance is going to be worse in a Unicast based traffic management system?
Well, not necessarily, right? Like, ideally, everybody's got enough capacity to serve everybody in their closest possible location.
And there's really no problems at steady state, right?
At a steady state, there is no inherent difference between a Unicast system and an Anycast system, because if everybody's got enough capacity to serve everybody, then everybody should be going to their closest location.
And whether you're using DNS to do traffic management, or you're using Anycast to do traffic management, it shouldn't matter.
The problem really gets to when something goes wrong, how do those systems respond?
And the answer is different.
That's a really good question. So how is it that a Unicast system responds?
And how is that different than the Plurimog and the Anycast withdrawals that you were mentioning in the blog?
It's a really good question. And let's get into that a little bit.
So the way that it works in a Unicast based or a DNS based system is that every five minutes, or according to the DNS TTL of the HTTP session, you will receive a new set of addresses and you'll reconnect.
And let's say, for example, as most people will do that, they'll use a five minute TTL.
So that means that every five minutes, you're going to need to read, you're going to need to connect to the DNS server and get a new set of IP addresses, or maybe the IP addresses will be the same.
But let's say that between T and T plus five, there was a capacity problem.
Let's say that the Unicast system in Seattle lost a bunch of machines and no longer can serve all of the traffic.
Okay, cool.
So what the Unicast system is going to do is the Unicast system is going to determine that there isn't enough capacity to serve all of the requests in Seattle, and they're going to serve DNS requests for San Jose.
And when users connect again to do the DNS, they'll get a new address, and they'll move over there.
And that will happen relatively without incident that users in that users in the Unicast system probably won't notice that there is a problem, unless the issue happens really early in the TTL.
So basically, if between T and T plus five, there is a problem at T plus one, say exactly.
If something happens like right on the TTL, then you as a user who just got a DNS response, who is connecting to Seattle, which is now having capacity problems, you may have four minutes of really bad performance, then you'll re up your DNS, and then you'll go to San Jose, and then you'll be fine.
But you'll have four minutes of really bad performance, you can really only control the performance based on the TTL.
And that means that either you have to have a really short TTL and continuously re request DNS, or you put your users at risk for having poor performance.
So how does that differ from Cloudflares approach? This is a so this is a really interesting kind of segue into how our Anycast system has kind of evolved.
And before, you know, with traffic manager, let's call it traffic manager 1.0.
We had a very kind of similar approach to the way that Unicast systems work that if a data center was overloaded and couldn't take requests, we would just withdraw the prefixes and traffic just wouldn't come in.
And that's actually really effective, right?
Like that's essentially replacing what a network engineer would do.
Because in the old days, we would just, you know, wait until there was a problem.
And then the network engineer would get paged for an alert, and then they would go in and they would withdraw the prefixes.
And then something would get fixed, and then they would put the prefixes back, which is fine.
But this a takes a long time for us to notice and be it caused a lot of problems for the engineers and see, during that time period, users are impacted for a really long time.
And not just like some users, but pretty much all users who are connecting on a prefix.
So as the blog kind of outlines a prefix at slash 24, which is the minimum amount of IP addresses that you can advertise to the Internet is 256 IP addresses.
Let's assume for, you know, the blog does the actual math.
But if you assume that every website is mapped to exactly one IP address, and that's 256 requests per second.
So if you withdrew that slash 24, from advertising in Seattle, then you would save 256 requests from going to Seattle, which is broken, and send them to San Jose, which is not broken.
But that's not how the math actually works.
The math is actually closer to 1000s and 10s of 1000s requests per second on a slash 24.
Because we don't map IP addresses to websites, we don't believe in that.
So when you withdraw slash 24 from a data center, you're withdrawing, you're withdrawing 10s of 1000s of requests per second, and you're sending them other places, which is fine, if the only goal is to relieve pressure in a data center that is impacted by capacity issues or hardware issues.
But that impacts users that has a material impact on how users perform on the Internet.
And you mentioned before that the real goal is not necessarily to impact user performance, but to seem, you know, completely transparent.
Yes, exactly. We don't want to make it so that you will notice or see a noticeable drop in performance when you move or when we move your traffic.
So moving a slash 24, withdrawing a slash 24 is it'll work, but it's very heavy handed, we want to get even finer and finer grained.
And so we developed FloraMod. FloraMod, sounds a lot like Unimog, your layer four load balancer.
It is. And it's basically an extension of our layer four load balancer.
And it works in kind of two additional levels.
It works between data centers, and it works within data centers, between logical groupings of machines.
So if Unimog works within one logical grouping, DuoMod works between those logical groupings, and FloraMod works between data centers.
And so what that allows you to do is it allows you to not just move whole prefixes, but you can specify specific amounts of requests to move, and then you move only those requests.
And by doing that, you basically make sure that your data centers are running as much or receiving as much traffic as they possibly can, while making sure that everybody's got really good performance.
Well, that's really good.
That's a really good approach. And you know, I think that is really kind of the core of Traffic Manager.
So how does that compare to your unicache based methods?
That's a really good question. And I think that the answer is, I think it best approximates a layer five or a DNS based approach.
Because even when you're using DNS, when you're using DNS, you actually get a little bit more control with how many requests or how many users you're returning IP addresses to, right?
You can set that as granular as you want. But the problem is you're bound by the TTL.
So with FloraMod, we're able to completely adjust the percentage of requests down to a very fine grain.
But because it's not DNS based, it's completely real time, that these changes can take effect in seconds.
And that allows us to be better than these other DNS based systems, because it really sets us apart.
It allows us to be performant, as well as preserving availability, and making sure that it's completely transparent to you.
And that's really the goal here.
Well, that's really cool. And I really think that that's a really big innovation.
So now I'm starting to see a little bit why we're talking about this for birthday week.
But one of the things that you spend a lot of time on is talking about Traffic Predictor.
So Traffic Predictor is really cool. And I really love the graph that you showed of all the places that traffic will be sent.
Why is traffic prediction so important, especially with the advent of FloraMod, which allows you to move data set, move traffic deterministically, as opposed to using BGP?
So that's a really good point. And you know, Tubes, and you said some things in that, that I'd actually like to kind of expand upon.
We've got about 10 minutes left.
So I want to be brief, because we've got a couple more questions. Traffic Predictor is very important, because it allows us to understand the dynamics of the Internet, and where traffic will go if we remove it from data centers.
And for Traffic Manager, kind of the one when we were only using, when you're only advertising, withdrawing prefixes, like you said, then following the natural flow of BGP is really important, because it allows us to, determine the blast radius.
So when you say blast radius, what exactly is it that you're talking about?
Well, you know, when you're talking about blast radius, what you're really talking about is how much traffic will move from the target data from the initial data center to the target.
And then if that target has to move traffic, where does that go, right?
We do first and second order operations in Traffic Predictor, so that we can understand the flow of traffic on the Internet.
And so that we can understand where users are naturally going to go, if we have to move traffic around.
And believe it or not, that actually helps inform Plurimod, right, as well.
Because if we have to move traffic, we want to move it to where it's closest.
And where it's closest is not always as intuitive as you think.
So when you're talking about intuitive, like, can you give some examples about, like, when users would go from one location to another, and that's not always 100%, you know, obvious?
Sure. It's a really good question. And, you know, the Internet's kind of funky like that.
There's definitely a lot of places where, like, if you're locally, and that local peering goes down, you might go somewhere where you don't necessarily expect.
And some so here are some really good examples of that.
So let's say that you're in Johannesburg, South Africa, and you know, there's peering issues in Johannesburg, you're likely going to fail over to London.
If you're in Sao Paulo, and you have peering issues, you're going to fail over to Miami, Florida.
If you're in, you know, Seoul, South Korea, and you have peering issues, you might fall back to Tokyo, or even Seattle.
And if you're in Singapore, or Hong Kong, and you have peering issues, you might fall back to the United States, which is, you know, obviously a whole continent away that adds 100 an extra 120 milliseconds of latency.
So despite what you might think, that if you were to just fall back naturally, despite you might what you might think if Singapore fails, you would want to route it to Hong Kong.
But by doing that, you might actually make the user's experience worse.
And so traffic predictor serves a purpose of allowing us to understand where the traffic is naturally bound to flow so that we can point FloraMog in those directions to keep performance about as close as we can possibly get to being the best.
And yeah, it's not going to be perfect every time.
And in the examples that I gave, if we have to do failover out of those data centers, you know, we run into problems.
But in those locations, we've actually been augmenting our data centers to remove single points of failure and building what we call multicolored pops, so that we can keep traffic as local as possible all the time.
So you just used a new term there. That's a really interesting term, multicolored pop.
Can you tell us a little bit about how exactly that helps keep traffic local?
That's a really good question, Tubbs, and let's get into it.
A multicolored pop is basically a whole bunch of small pops combined into one kind of super pop.
And so for the rest of this discussion, the smaller logical groupings of capacity will be called colos, and the larger overarching terminology will be called a pop, or a point of presence.
Our colos, we have many of them, obviously, in a data center.
This is like, I feel like I'm just explaining the acronym here.
But each of those colos is logically segmented. And what that means is that they can operate independently from each other.
That means one colo can go down, and the other colo can stay up.
And you might think, well, obviously, like, that makes total sense.
Why wouldn't we be able to do that? You know, not all data centers are built the same.
And some data centers that are smaller don't need to operate like that.
For example, we have a lot of data centers that, we have many data centers that have, like, only a handful of machines.
And those data centers are very valuable, or those locations are very valuable.
But, you know, they don't need to operate in an MCP fashion.
They don't need to operate with those colos being independent.
They can just have one colo. And so what MCP allows us to do is allows us to remove single points of failure, so that if one colo has a problem, we just take it offline.
And then the traffic moves from that colo into the other colos with Duomog, like we talked about before.
Well, it's all coming back full circle there.
It is. And once they move between data centers, you know, those data centers, you know, maybe can take on more traffic, because we always build these data centers to have enough capacity to handle not just the current load, but additional load.
These MCP locations are very, very large, and they are built in our metro, in our major metro areas.
Specifically, so those scenarios that we talked about where traffic gets moved from, let's say, Singapore to Los Angeles or Seattle, so that that doesn't happen, that we can keep all of our traffic as local as possible.
So that's a really good example of how we're using even our infrastructure and our infrastructure design to keep traffic performant and keep traffic local, because at the end of the day, keeping it local is what keeps it performing.
Well, that's a really good point, and I feel like we're getting a little bit off topic, but I want to say the magic words, because we haven't said the magic word in, you know, pretty much the entire session.
I want to use the phrase AI and ML, because one of the big themes of birthday week is AI and ML.
How is traffic management and traffic manager using AI or machine learning to be successful?
I mean, obviously, it is, but, you know, can you give a little bit more insight into how that's happening?
Well, that's a really, that's a, this is where it really gets really interesting, right?
Because, you know, we've talked about, like, traffic getting moved between data centers, but a really good question that kind of, that the blog answers, obviously, and you should read the blog if you haven't, but one question that the blog answers is, how does traffic manager know when it's time to move traffic?
And the answer, initially, was we would set CPU limits, and we would collect aggregate CPU metrics from our machines, and then basically say, when the CPU limit reaches this threshold, then we move traffic out.
But those limits are static.
Those limits in the before times were static, though it meant that, like, every data center had the same limit, and that limit was set.
But what we're learning, or what we've learned, is that not all data centers are the same, not all traffic patterns are the same, and sometimes it might behoove us to have a higher threshold.
Our users can, our machines can take more requests before we need to start moving them, or it can take, or that threshold should be lower.
Whatever it is, we want that threshold to be dynamic.
And so, how do we know what that threshold is?
Well, when our user performance actually starts to degrade. So, we look at this metric called CF check latency.
And CF check latency, and CF check latency is basically a determination of all of the components that are spent through the Cloudflare stack, and if that latency starts to go up, that means that we're hitting a threshold that is starting to impact user performance, so we should move traffic out.
And so, essentially, what we do is we've got a system and a service that watches this latency and makes inferences based on the changes from the previous days, weeks, months, and then sets the threshold dynamically.
And it does that using machine learning.
And so, yeah, that's essentially how we do that, and it allows us to be very adaptable to whatever user load is placed on our service, and allows us to be smart, allows us to be fast, and allows us to be responsive.
So, it's a really great application of machine learning, and we're really excited to be introducing that, and really excited to be talking about it.
Like, it's super duper cool.
It really is super duper cool, Troobs. And so, we've got about 90 second flex.
So, one question that's been on my mind is you could really just be doing this with a normal person.
Why do you feel the need to be going so far deep into this space?
Why even bother? It's a really good question, and this is, at its core, Traffic Manager is a solution to something called the bin packing problem, that basically like you, and you've played this, and you've done the bin problem before, if you've played Tetris, right?
Like, you want to keep the Tetris blocks as low as possible, and if you're like me, you get that like sweet satisfaction, and Traffic Manager, through all of its iterations, has just driven that Tetris block baseline lower and lower and lower.
And that allow, and by doing that, not only do we improve our infrastructure, but we improve user performance.
And really, at the end of the day, that's what it's about. It's not about reducing on-call pain, although it does that.
It's not about producing really cool blogs and building really cool stuff, although we do do that.
It's about improving the user experience and making sure that you, the user, can have the best possible experience on Cloudflare for as much as possible, and that you don't even know that any of this is happening.
So, this is a really cool way to peel back the curtain, but at the end of the day, you should come out of this knowing that we've got your back, and all of this is just going to keep happening, and no matter what happens on the Internet, we'll be there for you.
We will be there for you. Thanks so much, Tubes.
It was really great talking to you. It was really great talking to you, Tubes.
Thank you so much, and thank you for watching, and I hope you had a really good time, and I'll see you later.
Bye!