Cloudflare TV

Hardware at Cloudflare (Ep 4)

Presented by: Rob Dinh, Brian Bassett, Ben Ritter, Tom Strickx
Originally aired on May 18, 2021 @ 6:30 PM - 7:30 PM EDT

Learn about the routing hardware that makes up Cloudflare's Edge Network.

English
Hardware

Transcript (Beta)

Thank you everybody for tuning in. Welcome back at Hardware at Cloudflare. This is episode four.

This is your host Rob from SRE and Brian at Hardware. Like every other episodes we are here to talk about what makes up Cloudflare and more importantly who makes up Cloudflare.

So this episode we're talking to our network engineers.

So we have Brian Ritter from Austin and Tom Strickx from London. Welcome guys.

How are you doing? Hey thank you for having us. Yeah thanks for having us.

Tom I gotta know what room are you in? That looks like it's from the future. This is my dad's office.

I've currently emigrated to Belgium for a week so I'm currently staying at my parents.

Very cool. I'm making good advantage of the office including the not so nicely patched network rack next to me.

Okay. Family business I see.

Yes. Yeah you got it. Either you have a really good voice or you have a really nice mic.

Every time I hear you. It's probably going to be a little bit of both.

Okay. Okay. All right. So yeah we can get on with it. We have a program here.

It looks like we're going to have a little bit of overview about what routing hardware is and how we use it here at Cloudflare.

For any questions please feel free to email us at livestudio at Cloudflare.tv.

Looks like currently we already have our slideshow up and open and sharing for everybody.

I'll go ahead and take it away.

Let's see. So we're going to go down the talking points here or.

Yeah if Rob if you want to share.

Yep. There you go. So I guess one of the things that we get asked pretty frequently is what the network team does at Cloudflare.

And you know the largest thing that our team does is we influence the routing to all the various Cloudflare POPs.

And that sort of begets the question well what's a router and why is it kind of important from the scope of Cloudflare.

And a router is much like any other machine or computer out there on the Internet.

And it's responsible for interconnecting different networks.

Those networks could be run by you know public institutions, private businesses, Internet service providers, and all these different entities come together on the Internet and interconnect with routers.

And that's kind of what builds up the Internet.

It's a network of networks. And part of the one of the difficult things that a router has to decide is if it has a packet or a piece of information which is destined for Cloudflare it needs to figure out what the best way is to actually get that traffic to Cloudflare.

It may have multiple paths perhaps going through different cities or different independent links.

And the our network team focuses on how we attract or pull traffic to different Cloudflare POPs around the world.

I think the best way like a friend of mine keeps saying this like every single time we see him and we have a conversation about the Internet or how networks work is he just constantly keeps saying the Internet is just another abstraction layer.

And I think that's a very good way of explaining this is basically for all intents and purposes for end users and for most application developers the Internet is not really a thing, right?

It's just this utter abstraction thing where you make sure that if I can go from A to B that's fine.

I don't care how you get from A to B but that's the thing that the network team is responsible for.

We make sure that you get from A to B preferably as fast and as smoothly as possible but unfortunately that's not always the case.

But that's what the network team does I think is a good way of describing it.

Yeah and I like that you know people look at the network as a as a utility or that they at least want to, right?

They expect it to be fast and they expect it to be reliable and people really only kind of gripe about the network if they think there's a problem.

And you know most of the time I feel like we get it right and things are kind of smooth sailing.

But yeah you know it's a really interesting layer to work on. There's a lot more complexity I think than a lot people realize sometimes and it allows us to kind of really solve some interesting problems.

Yeah complexity. We forgot to mention that if folks want to ask a question for Ben or Tom they can email livestudio at Cloudflare.tv and we'll get them in real time and we'll try and answer them while we're on the air.

Yeah I'm just looking at the share screen that we have now so there's a couple of dots and you know we've got those errors coming in there here.

When I think about a network I'm thinking of like some kind of like a spider web or maybe sort of like your rifle shot where you have one dot that kind of spreads out everywhere.

But in a way you know that packet has to get from you know that one point, point A to point B.

But it looks like it could be many different ways that you could get to it, right?

So do we always think that we have to find a way to get a straight line of a path or there could be other alternative ways that makes I guess better for user experience?

So I think it's important to make two distinctions here, right?

Is if this is an internal network or an external network because there's two very important distinctions you can make there because with an internal network you own the end -to-end path.

Like you own that entire thing, right? That would basically mean if this was an internal network it's not but if it's an internal network then Cloudflare for example would own all the dots and Cloudflare would be responsible for choosing the path between those dots to get to from point A to point B.

But if it's the public Internet for example then Cloudflare would only own that orange dot on the map and everything else is dependent on other networks.

So that also means that for all intents and purposes Cloudflare is only responsible for their little, you know, it's our little like sandbox within the major sandbox that is a beach for example, right?

And we're only responsible for that tiny bit. But that still means if we want to get from one point to another point we're reliant on other networks on doing the right thing.

But the right thing for us may be something completely different than the right thing for another network because we may care about latency but another network may care about port utilization or may care about I want as much money from my customers as possible, right?

So that's a thing we don't control.

That's a really good point which is, you know, outside of your own network boundaries, you know, we use a protocol called BGP of course to speak in between these different routers and we use it to speak to other networks, you know, to networks that we don't control or have any sort of say over the routing.

And we can use mechanisms within that routing protocol to kind of suggest or ask that other networks route to us a certain way.

But at the end of the day it ultimately comes down to the operators at the other end.

They can override sort of the hints and suggestions that we put into that protocol.

And that protocol, as someone alluded, Rob, you said, well how do we pick the best path throughout the network?

You know, it's not tuned necessarily to latency or physical distance to try to minimize propagation delay or anything like that.

The BGP, you know, route selection processes is kind of a policy language that allows all these different businesses to make suggestions.

But at the end of the day each organization is kind of going to do what's best for their own routing.

Are those policies figured out all by the OS that's running on the router or do the servers have any say in that matter?

Tom, you want to take this one?

Yeah, I mean, that's the thing, right, is routers can be servers and servers can be routers.

We're starting this journey within the networking space called de-aggregation, whereas right now and in previous years you had very dedicated hardware which does routing specifically.

We're getting more and more into a space where you just install enough network ports in a server and you make it act like a router.

You start using open source utilities like BIRD or Free Range Routing or a bunch of other platforms that you can start using to start speaking BGP, basically.

Because, again, BGP is just another communication protocol for all intents and purposes and some people have already used BGP to start playing battleships, for example.

You can do a lot of interesting things with BGP that not necessarily have anything to do with routing, but that also means that you can run a routing daemon on any commodity hardware and make it act like a router.

And Brian, to kind of approach your question from a different angle, different vendors implement BGP slightly differently, but from the grand scheme of things, they more or less tend to choose or use the same criteria to evaluate and pick the best path.

The key, what distinguishes one network from another and how they choose to route is going to be based on the policy that the administrators of those networks put in and in tune the network.

So, you know, many, as Tom mentioned, many networks, you know, let's say they're an Internet service provider and they build their customers based on the amount of bandwidth that the customer uses throughout the month.

Then it's in that Internet service provider's best interest to try to attract as much traffic as they can so they can overall, you know, provide the customer with more service.

How dynamic are those policies?

Are they changing daily, like hourly? About how often? Yeah, so someone likes to say that in many ways the BGP or the global routing table, also known as like the default free zone, meaning it's a routing table which covers the entire Internet without a default route, meaning there are specific routes for every network on the Internet, is like a database in many ways.

And they say it's a database that never like converges, meaning there's always networks coming and going, Internet links coming and going, which are, you know, falling down and coming back up and restored to service.

And as a result, you know, the routes that you use to send a packet across the Internet vary, you know, throughout the day.

They even vary throughout, like at the same point in time, depending on where you are in the world, that will be a different view.

So it's like you basically run this massively distributed system, and like Ben said, that never really converges into a singular state.

So the view that you have from the default free zone in Asia Pacific or the US West Coast, for example, can be wildly different.

And that depends, again, that is a property of those policies that Ben talked about, because it can be that, for example, a customer only wants their routes advertised in Asia, but that then means that their routes aren't being seen in Europe, for example.

And a bunch of changes like that mean that that default free zone is such a variable thing that it's very hard to quantify.

So I kind of wanted to like steer us back into routing hardware a little bit.

Sure. One of the things that, you know, we'll get back to this, but one of the things that's unique about a router is that, you know, it has many interfaces connected to many different networks.

You know, your average post, like a laptop perhaps, maybe has, you know, an ethernet interface or a wireless interface, or your mobile phone might have a couple radios in it, you know, for Wi-Fi and different cellular networks.

But a router, you know, can have many interfaces on it, you know, anywhere from 36 to potentially a few hundred interfaces interconnecting to different networks.

And then the other thing that also kind of makes a router interesting is that most of the traffic that comes into an interface on a router isn't destined to that router.

The router is only a middleman. So it's responsible for taking that packet, figuring out what the best location to send that packet is, and sending it out the right interface.

And it has to do this all very quickly. So that's kind of a different paradigm from a host or a server or a client, because most of the network traffic that you see on your standard computer is destined to that computer or sourced from that computer.

And if Rob, if you could advance to the slide of the packet.

What we've done is just taken a real quick packet capture using Wireshark on one of our machines here.

And what we see here is a, you know, a simple packet capture of a three-way handshake using TCP and then a TLS negotiation.

And this is my laptop here connecting to Cloudflare.com using a wireless connection over IPv4.

And the piece of information that I've circled here is the destination IP address, which is really the one piece of information that the router really needs to look at.

And the router looks at the destination IP address on every single packet that comes through and is able to very quickly make a decision based upon that destination address as to which interface the router should send that packet.

And it's able to do this, you know, hundreds of thousands or millions of times per second using some very specialized hardware called an ASIC.

I guess the ASIC would be this one here.

Yeah, yeah, perfect, perfect transition. Yeah, so, you know, a router is actually very similar to an off-the-shelf computer, except it has an ASIC, right?

So, one of the ways I kind of like to explain what a router is to people who are familiar with computing, but perhaps not with networking, is that, you know, if you're going to build a gaming PC and you want a stellar graphics, chances are that you wouldn't want to use the built-in or the integrated CPU.

You would rather go and install a dedicated graphics processing unit or GPU, which you could use to offload a lot of the graphics processing.

And the reason is, is because there's a special instruction set which is catering to do all the graphics processing, which has been built into that particular system.

The ASIC in a router is very similar. On any router, you have a standard interface called a management interface, which you can see in the upper right hand of this diagram.

And that interface is similar to a network interface on any other computer, because traffic that comes in that interface is ultimately going to be influenced by the CPU.

On the other hand, all these interfaces that you see at the bottom, those Ethernet 1 through 36, for example, those interfaces don't go up to the CPU.

Those interfaces all connect to each other through the ASIC.

And the reason is because the ASIC can ultimately forward traffic that much faster than the CPU.

If we tried to forward packets from 36 different interfaces at line rate through the CPU, the CPU would simply bottleneck, because it's a general purpose CPU and it has instructions and capabilities to do so many different things aside from forwarding packets.

Whereas the ASIC is a specially designed chip, which has a very reduced feature set, but is very good and accelerated at doing one thing really well, and that's packet forwarding.

Right, so the CPU is kind of your server part of a router, right?

Everything that you do in terms of management and configurations or firmware versioning, it goes through the CPU.

And then the ASIC, it's sort of like your specialized CPU or transistor, I guess you could call it.

And all these Ethernets is what I'm imagining, the ports basically on your router, right?

Yeah, so I'll go back to my analogy. When you connect a GPU to your CPU, ultimately it usually connects on like a PCI bus within your computer.

The ASIC in most routing and switching platforms actually does the same thing.

So the path from the ASIC to the CPU is very small, and it's very rare that we have traffic coming in one of the ASIC connected interfaces that needs to go to the CPU.

The exception would be stuff called control plane traffic. That's stuff that's destined to the router itself, and that would include like our BGP protocol, which we use to speak to our different BGP neighbors.

But generally the goal is to be able to promote all your packet forwarding such that it never needs to get what we call punted up to the CPU.

The goal is that a packet will be able to come in one ASIC interface and egress another ASIC interface because it's very high speed and doesn't have to do an interrupt to go to the CPU itself.

Okay, so I'm trying to imagine something like maybe I'm thinking of traffic or like entropy where maybe we hit a code or a packet hits this code because it was meant to hit that code, and it might end up somewhere else.

But there might be congestion, right? Like I might hit a highway intersection and there's gonna be like a congestion there.

Do all these flow of packets kind of stop or lag in, I guess, inside this ASIC before it gets to be anywhere?

Or does it get lost? You know what I mean? I'm trying. I'll let you take the congestion question.

I mean, you basically just hit on one of the massive pain points and interesting points of network engineering, which is buffers.

Because that's kind of what you get into as well with congestion is what a router can do or what a switch can do as well is buffer that packet.

So if you have enough memory, what you basically do is you just store that packet in memory somewhere until you're able, like until you have the signaling that that port has available bandwidth that you can actually send that packet down the wire.

So what you can see sometimes is when there's congestion is you can start seeing that your latency can start varying quite rapidly.

There'll be like standard deviation of like 100 or 200 milliseconds.

That's buffering somewhere along the path or along multiple points in the path where basically there's going to be a router or a hop holding that packet for still like hundreds of milliseconds or tens of milliseconds, which is very negligible for us.

But over the longer period, that becomes a bigger and bigger thing.

So you can, but you need like, if you want to store a packet for a long time and you have a lot of ports at a high speed, you need a whole lot of memory, right?

It's like, if you want to store a hundred gig at like a hundred milliseconds, you need like eight gigs worth of memory.

I'm not doing the math. But it's a mad amount of money.

It's a mad amount of money. It's a mad amount of memory as well.

But the thing with that is that memory still needs to be super, super, super, super fast as well, because the moment it's available, you need to be able to get it out of that memory onto that port as soon as possible.

Yeah. And there's a few times when, you know, buffering can happen.

And that can be like, if you have a speed change.

So you go from one speed of interface, like a high speed interface, like a hundred gig interface down to, for example, like a 10 gig interface, you have a lot of traffic coming in on a big, you know, high speed interface.

Maybe that hundred gig interface isn't full and it needs to kind of be downstep to go to another slower interface.

Or the other one is if simply the interface is full, meaning it's congested.

Like we have a hundred gig interface and we have 99 gigs of traffic and it's, you know, teetering on hitting a hundred gigabits per second or something.

And, you know, I want to be clear, like both of these situations, both the oversubscription and the congestion are things that we avoid here within our network.

You know, we have alerts and things that can fire if we have some sort of congestion event that happens, but, you know, outside on sort of the wild west of the Internet, these things happened.

And it's kind of a debate within network engineering, sometimes whether it's better to buffer a packet or whether it's better to drop that packet and let it get retransmitted because anytime you buffer a packet, you ultimately end up adding more latency to the end to end connection.

All right. Brian, I think you're on mute. Yes. I'm on mute, yeah.

Do you mute yourself again, Brian? Yes, sounds good. I hear you, Brian.

Is it possible for these buffers to run out like to where you actually drop packets on the router?

Yes. Yeah. So, you know, depending on the router hardware or the switch hardware, different platforms allocate the memory, it's called TCAM, to the different interfaces of the router differently.

So some kind of have what's called a shared buffer, meaning that each port gets like an equal amount of the overall memory, whereas some have individual, excuse me, some have individual where each port has a fixed amount and some are shared, where if one port's kind of hungry and using a lot of bandwidth, it can kind of borrow some of that memory pool and share it over.

But we start getting into the realm of like queuing theory and how you allocate this memory.

And ultimately, like, yeah, we get very quickly get into the weeds on this.

And it's probably a little deep for this session.

That was kind of like my next question, actually, if it's just going to be sort of how I understand, you know, server memory.

You could kind of play around with LRU and all that kind of stuff.

And yeah, I could see how buffers like it adds up, right?

Because, you know, we're, at least in hardware, we try to minimize the amount of points in between, you know, a request or I guess a packet inside.

So we have your CPU, your memory, and, you know, we try to avoid this, but, you know, we have it there for just in case.

So, like, there could be, I guess, in your BGP that there could be just too many dots.

So that could be, I guess, detrimental, right, to the path?

Yeah. You know, most of the time on the Internet, what determines the path of the route that gets used from one end to the other is the number of autonomous systems or independent businesses that that packet has to traverse through.

And to be honest, looking at an AS path, a list of different businesses that that packet goes through, you can't always necessarily be confident that, you know, hopping through two businesses is perhaps better than four, because you don't really know the scope of the network that it's traversing.

You know, if we're talking about a packet traversing a big tier one carrier, we could potentially, you know, have a packet ingressing in Washington State and egressing in Miami, you know, on the opposite side of the continent.

And from a BGP perspective, it's going to look similar, it will look like one similar AS hop if it happens to stay on that provider's network the whole time.

So, yeah, you have kind of like two different big factors for delay on a packet, you have your propagation delay, like how far the packet has to travel, like as the crow flies, which is more or less a factor of the speed of light, you know, based on you multiply that by some constant, and you can more or less figure out what the ultimate latency is of a packet to, you know, traverse any distance.

And then the other big contributing factor of all the different compound latencies is going to be based on buffering, depending on, you know, which different routers you're going through and how congested those routers are at that moment.

Yeah, yeah, I could see how you I mean, we're trying to send a packet and you know, chances are it goes to like another AS we have, we just have no idea what their infrastructure is like, right, like how many dots they have, or I mean, each of those dots have the buffers or anything like that.

So I guess it's a bit of a feedback loop, at least like when it's once it goes back to, to our own, I guess, to our own network, or to another dot in our BGP.

And it's not really like BGP doesn't really have a feedback mechanism, like the feedback loop is on your TCP stack, because TCP will, has this principle called scaling window, right, which allows you to basically send more packets, which allows you to then send more data through without getting an acknowledgement, which that acknowledgement being that feedback loop, getting that acknowledgement back.

But the thing is, with packet loss, for example, or high latencies, that window remains very, very small, which means that your throughput is going to go like right through the floor, like there's just not going to be any throughput.

So that's, that's a bit that feedback loop. But within layer three, like where the router sits, there's there's no feedback loop at all, like we don't get normally we don't get any feedback on like distance or latency or anything like that.

There is ways that that can be communicated, it's just that usually most networks don't don't listen to it or ignore it or explicitly set it to zero, for example.

Yeah, and this requires building, because the fact that these routing protocols, you know, in choosing what the best route is between multiple independent networks, because they don't have any sort of feedback loop, as Tom said, to tell them about things like packet loss or delay, from one end to the other, you as a as a content network, have to be able to build some sort of metrics and telemetry.

So that when we see, you know, for example, higher latency, or we see a lot of packet loss associated with a path out of the network, that we ultimately have a way to influence our BGP outbound, so that we can send those packets another way on the network.

But as Tom said, most networks don't do this, you know, they don't have the the tooling and the capabilities.

And it's just not something built into the into the routing protocols.

It's not just BGP.

But, you know, none of them really have any knowledge or awareness of what's happening higher up in the stack, they don't know that you have a, you know, a latency sensitive application, and that you prefer your packet to go, you know, the performant route versus the cheaper route across the Internet.

And as a result, you kind of have to push a lot of this logic down into the routing layer from the information that you learn from the upper layers.

Kind of makes me feel like we're, yeah, we're, we're, we're strapped, or we're, or there's only so much we can control, I guess, is what I'm trying to say, on something that happens then.

And it's, I'm just looking on like, on a day to day, sometimes we lose connectivity here, or like, we have some lag here and there.

And I just, you know, escalate to network and just trying to figure out what's what's actually happening.

So I guess it's almost, it literally just means, you know, we'll have to send an email to those guys or something, and just have a hopeful response or something.

Yeah, you know, it's funny with with Cloudflare having so many different locations or points of presence around the world, and all of our Cloudflare POPs being connected to multiple transit providers, or who are effectively our Internet service providers, private or in public Internet exchanges, as well as some private network interconnects that we have, you know, we see faults consistently throughout the day.

Something somewhere is always failing, you know, with, with 200 or so POPs connected to an innumerable number of networks, we see errors all day.

And, you know, it's, it's not humanly possible for us to respond to all them all, all day long, without having software and other mechanisms to be able to triage and prioritize.

And, you know, it's, it's not uncommon for us to have an issue with one of our transit providers, for example, where we'll have an automated system that will detect the problem very quickly, shut down our access to that transit provider and rely on our redundant connectivity at the location.

And then it automatically fires out, for example, an email, including a trace route at the time of the problem to the provider and opens up a support ticket.

And this whole thing, from a Cloudflare perspective is, is automated, you know, so we may not even be aware of it from a human perspective.

And then when the provider replies with their ticket number, it gets associated and we have a human follow up with it, if it's not automatically resolved by the provider within a certain point of time.

And, you know, implementing a workflow like that's, it's really critical, just because of the fact that as we continue to scale and grow up more locations and connect to more networks, the number of problems we're going to see is actually going to increase.

And we need to come up with a workflow that, you know, uses software first.

Do any of the transit providers have anything similar or are they pretty much relying on humans on their end?

Most transits have like automated systems as well and automated remediation systems.

And like quite a few of the higher tier ones, not all of them, but quite a few are playing like very state of the art systems and building their own like network controllers to start steering paths and start sending packets through different paths when they know, for example, like those tier one providers, the reason why they're tier ones is they have this massive spanning backbone across the entire globe in which it can reach the entire Internet.

Right. But they have a lot of like redundant connections.

And it's a bit like Rob said at the beginning, it's like that spider web kind of situation where you have multiple paths to go to the same endpoint.

So if that provider wants to do maintenance on one of those segments, on one of those links, they can proactively steer traffic away or drain traffic from the link they're going to do maintenance on without any impact to their end users or their end customers.

So they build like they have a bunch of automated systems to do remediation or to do all of these things.

But the end result will likely still be that if there is an unknown fault, which can happen at Cloudflare as well, is there will be a point of impact.

Right. It's like that point of impact may just be a minute or 30 seconds or 10 seconds.

But it's then unfortunately very likely for our automated systems to notice that impact, disable that transit provider, go through all the steps that Ben just explained, which also then includes sending an email to that provider and telling them, hey, you had a failure.

Usually the response we get back is like, oh yeah, we had to reboot a line card or we had a segment failure or whatever, but traffic is already failed over.

And that's the case because then we see in our automated system, it's literally just a blip of a single measurement point where we noticed, oh, things are broken.

So there is definitely quite a lot of automation in place.

It just, unfortunately, due to the nature of networks, I think Ben will agree, is like a lot of it is reactive.

You can only be as proactive as possible. I think it's the same thing in SRE where if a server fails, it's not going to tell you five minutes in advance of, yo, I'm not okay.

Right. It's like you don't have the five minutes to drain the traffic.

So it's the same thing in networks. And there's kind of like two different types of problems there.

There's times where we can see, for example, because Cloudflare pushes out so much traffic, we've seen situations where we've pushed so much traffic down a path where we've caused congestion in a provider's network.

And that's something we can avoid. That's a problem that in some ways we can contribute to and avoid by changing our routing.

But on the other hand, if an interface is going to fail because an optic is going to fail or a router is going to lose power somewhere because of some cooling or heating issue within a provider's telco cabinet, that's something we can't predict.

The provider probably didn't even know that was going to happen. And as a result, we definitely need software there to be reactive in those cases and to react very quickly so that the rest of our software, which relies on the network, operating at the higher layers, is not impacted by that fault.

Right.

And I'm not trying to put those ISPs on the spot or anything, but we should have like a certain maybe level of expectations, right?

Yeah, absolutely. Buying an Internet circuit isn't like buying an MPLS circuit.

If you're like a US enterprise company or European enterprise company, you can go to a service provider and say, I want to be able to connect to my other offices within my company.

And I want to have this guaranteed amount of bandwidth and minimal latency so I can do things like video conferencing.

In the Internet, you don't have quite the same guarantees.

We can go to different transit providers and they'll tell us how they're interconnected with different networks and tell us about the state of their network and how they're built out.

But from a contractual perspective, right, we don't have any real guarantees.

And one of the ways we sort of mitigate that risk on the Internet is to have a diversity of connections.

We connect to many different transits and have different pieces of software in our stack to allow us to influence how we route over different transits based on performance and cost and all those other sorts of different attributes.

Right. And we, well, I don't know if it's we, like maybe as an industry or I'm also an engineering, we kind of have an idea of what those alternative routes could be, right?

Like for example, it's seven o 'clock in the morning, I need to have my coffee and, you know, my tasting coffee says that I need to have, you know, some Starbucks in me.

I'm going to go drive to Starbucks, but, you know, there's these constructions on the way here and well, maybe the roads are closed and all, I just can't get to the regular Starbucks that I need to go.

So I need to go out to find a way, a different way to get around it or whatever.

And, you know, Google Maps says, you know, we'll have to take a different route.

But, you know, I kind of trust that it knows what is the next or the second in line best path.

So is this sort of all calculated, you know, reactively, like, you know, do routers or so talk to each other?

Hey, I know the second best path is going to be this way. Or do we set that already?

Yeah. So BGP itself is a routing protocol, isn't aware of sort of like the gray out scenarios.

So there can be scenarios where you have a route to a location where you have poor performance getting to that location and BGP will never do anything about it really, unless the route disappears for whatever reason.

BGP can have a second or alternative path.

So let's say, if you want to pop to the very last slide.

Yeah. And in this example, it just kind of shows us how your average Cloudflare point of presence or colo connects to the Internet, we generally have multiple transits more than two, but I wanted to kind of keep the diagram simple.

And then we often connect to an Internet exchange where we hear from anywhere from a handful to potentially 1000s of peers on the Internet exchange.

And if our Cloudflare edge router here was connecting off to a network, you know, a third party network, let's say an ISP network, which has a lot of what we call like eyeballs, meaning, you know, human beings who are sitting either at home or at their business and navigating to different locations on the web.

If someone is reaching one of those networks by means of transit one, and the interface to transit one went down, let's say that that transit providers router crashed, then BGP would have another path that it could ultimately use, you know, it would rerun the next the best path algorithm within BGP and probably choose another another transit in this particular scenario.

So yeah, you know, the router has the capability to store multiple routes, including multiple copies of routes and have what are called different next hops or different alternative routes that it can use.

And it has those ready if, you know, the primary route were to fail. Okay.

And then we have the IX, which is essentially kind of like a bundle of transits, right?

Yeah, so what makes the IX a little different is that the IX, generally speaking is, is free or closer to free than a transit, right?

So transits, you know, we are customers of different Internet transit providers.

And there's many of them out there, you know, Telia, GTT, Level 3, Cogent, HotDog, so on.

And we, you know, we pay them based upon ultimately, how much traffic we send them.

At Internet exchange locations, these are common points around the world, where we can show up with our router and run a run an interface off to the Internet exchange where the Internet exchange usually has a switch of some sort.

And then any other network that wants to show up at the Internet exchange can voluntarily also show up at that network.

And we can peer with them by means of BGP. So these are, this is known as peering.

And the benefit here is that if we both happen to be like in the same building, or at least at the same Internet exchange, then we can cut out the middleman, we don't need to pay for transit to get to them.

They don't need to pay for transit to get to us. So it's mutually beneficial from a cost or a business perspective.

But it's also it's also beneficial from often a, you know, reliability and latency perspective, because the fact that they can eliminate multiple transit hops in the in the middle.

So, you know, there's commercial as well as nonprofit IXs out there.

I know Tom's involved with an IX. I don't know if you want to speak about that, kind of from that angle.

Yeah, so I'm a, I'm a member of the board at INX, which is the Irish Internet exchange that has a presence in Cork and Dublin.

I do think it's quite important to like note that sending traffic over an IX is not free, right?

It's like the peering itself is settlement free.

But you usually what happens is there's exceptions, like for example, the Seattle Internet Exchange, which drives entirely on donations where you don't have a monthly recurring fee, but they're unfortunately the exception.

A great, amazing exception, but an exception nonetheless.

So what usually happens at these Internet exchanges is that you pay a monthly fee, a monthly port fee.

So, and usually that is variable depending on the port speed you get.

So you pay less, like quite a lot of Internet exchanges, for example, now give one gig ports away for free because it's easy enough.

10 gig you pay 100 a month, for example, 100 gig you pay 1000 a month, for example, right?

These are made up numbers. But kind of the thing.

So that means the interesting thing, what changes the economics of Internet exchanges is what that means is the hotter you run your pipe, the cheaper your bandwidth becomes, because you have that flat fee of I'm paying $1,000 for 100 gig.

If I do 100 gig constantly through that pipe, then your bandwidth is going to get very, very cheap.

If you only do 50 megs or 50 Mbit on that 100 gig pipe, it's going to be a very expensive 50 Mbit, right?

So that's important to keep in mind as well.

And at the end of the day, the IX is generally going to be a lot cheaper than paying for a meter transit interface, even if it is or happens to be a for -profit IX.

The for-profit IXs, as you can imagine, are more expensive than some of the nonprofits.

But from a cost perspective, if that was the only thing we optimized for, and it's not, but from a cost perspective, if we could just send everything over IX as opposed to transit, generally speaking, it would be cheaper than transit.

Granted, there's times where transit may have the better route to a network.

Maybe they may have more bandwidth, or it may have lower latency, or just may be more reliable.

And of course, we can't expect every single network to show up on every single Internet exchange around the world.

But this is a very cool tenet, I think, about Cloudflare is that we are the most widely viewed BGP network at the moment because of the fact that we have so many locations.

And so many of these Cloudflare POPs were built at a location with an IX, especially when we were building some of our first POPs.

And as a result, we have this open peering policy.

So anyone who wants to show up at that IX can shoot us an email at peering at Cloudflare.com.

And we will peer with anyone who emails us, provided they have the ability to speak on behalf of their autonomous system or their organization.

And we will build a BGP peering session with you if you're at the IX.

And as a result, we can kind of cut out the middleman of having to use transit to get to each other.

And it's a big benefit for a lot of organizations because of the fact that we have so many different websites that use Cloudflare services.

By, you know, peering with Cloudflare, a lot of these companies are able to save a lot of bandwidth because of the fact that a lot of their traffic ultimately ends up coming to Cloudflare.

And it's that much cheaper for them to peer with us because they can send us the traffic over their IX interface as opposed to a transit interface.

Yeah, yeah. It's almost like a no-brainer, really. I've always been thinking of some sort of like a chicken and egg problem.

Do we, you know, are we going to wait for the ISP to set up shop?

Or are we going to set up shop? But the IX has already been, there's going to be a number of those ISPs that are connected there.

I've always looked at CODOs as sort of like microcosms of many different ISPs. And it's nice to hear that we as Cloudflare, we can actually just say, hey, we're here and we've got everybody that wants to connect to us.

What's really cool is, because as we grow, Cloudflare actually becomes an attraction point for Internet exchanges.

So what will happen is when someone tries to set up a new Internet exchange in a new location where there currently isn't an Internet exchange, one of the first networks they'll email is Cloudflare.

And they'll ask, hey guys, do you want to join our IX?

Because they're aware that having Cloudflare on that Internet exchange platform is actually an incentive for other networks like ISPs to connect to the IX, because that's a way for them to reach Cloudflare.

But when I joined, that wasn't really that big a thing yet. It's like, that wasn't really a thing.

And then now we see like almost three years later, that's becoming more and more of a thing where we'll get emails to peering at or knock at or any of our emails where there's going to be people trying to set up a new Internet exchange.

And they'll ask like, hey guys, we know you guys peer at a lot of locations.

Are you interested in joining shop in San Jose, Costa Rica, anywhere else?

And that's a really, really cool thing because usually we're more than happy to do so.

A lot of independent networks and like far corners of the world like Cloudflare because of the fact that right now for them to connect and use a lot of resources on the Internet, they have to pay for transit.

And they also have to connect to those websites, which may be really far away from them.

So, by encouraging Cloudflare to install a local pop, we're ultimately able to save them money because they don't have to pay for transit, but we're also able to give them a much better performance because we are localizing those requests to their area.

And as you kind of watch the global deployment map of all the different Cloudflare pops all over, you can definitely see, sometimes we're in some rather obscure locations where you can imagine the connectivity generally isn't all that good, but installing a Cloudflare pop can be a major upgrade for the telecom infrastructure in a lot of these areas.

And we'll see other businesses follow us into some of these iAccess, which is a really cool thing to see.

Yeah, yeah, yeah.

I remember Tom's back in the old days. I used to be in DC ops and it was actually exciting to have some sort of quarterly goal to connect this many bandwidths.

We've had like 50 to 75 connections to make and it's nice to hear that nowadays people actually come to us now and it makes it I guess really easy, right?

We don't have to mark ourselves out there anymore. People are just coming to us and it's really up on us that we have enough fiber and enough optics to make things work.

And really, like literally just come meet me at the meet me room, really, right?

Just do that. Yeah, I don't know how many, I don't know what's the bandwidth that we have nowadays.

Like we have to be like over a terabyte by now, right?

Yeah, so we have public facing stats. I'm not sure what we say, so I'm not gonna commit to anything.

The last thing I saw was I think 35 terabits of network capacity.

Yeah, I'm not certain, so I'm not gonna comment, but you can imagine with 200 plus locations we have quite a lot of connectivity.

Yeah, to me it's just a whole bunch of fibers and a whole bunch of optics and we don't have enough switches.

And does that ever come to a point where like we're in IX and there's way too many people that want to connect to us and we have to add more switches and more ports and more line cards maybe?

Or is it something that we do? Yeah, that's a really great question.

One of the things that's really beautiful about the IX is everyone on the IX connects into Cloudflare through a single port on the router.

So we have one port, let's say it's 100 gig port or a 400 gig interface which connects into the IX switch.

Then these hundreds or potentially thousands of other entities can then connect into us through that one Internet exchange port.

Where the router interface has become, I guess, a constraint is if we want to do a PNI, a private network interconnect, with one of those businesses.

And the situation would be like this, let's say we have a, I don't know, let's say we have a 200 gigabit per second Internet exchange port where we're peering with a couple hundred businesses.

And on that 200 gigabit per second bundle or port we happen to have a lot of traffic with one Internet service provider which uses, let's say, 100 gigs of that traffic.

So half of our IX port is used by traffic to that one ISP. What we might do is call them or email them and say, hey, we have so much traffic with you in this location, it might make sense for us to move our traffic off the IX.

Why don't we send you a letter of authority and, you know, or have them send us a letter of authority because we happen to be in the same location.

We know this because we're both at the IX and we'll pay to run fiber or a cross-connect between our two racks.

We'll connect our routers and then we'll start routing traffic between our routers, kind of routing around the IX.

And that's beneficial because, you know, we pull that traffic off that shared interface and it allows all the smaller entities that we peer with on the IX to communicate with us.

Whereas the traffic for all those other interfaces might be so small or negligible that it doesn't make sense to burn or utilize an interface on the router, you know.

I may not want to burn a 10 gig interface if, you know, we normally only send them 500 megabits per second of traffic, for example.

So it's better if they stay on the IX, whereas we do a PNI with the higher bandwidth entities in those locations.

Yeah, kind of like a fast lane to connect. And I see us having it more controlled too.

Like I can see from end to end, you know, our port and their port, you know, we get the optics and at least here and now in San Jose, SJC, I can walk to whoever will connect directly, basically.

Like, you know, they're about 200 meters away or something.

So, okay. Yeah, that makes a lot of sense now, actually.

I've learned a lot today, for sure. It was a lot less that I knew back then.

Like I knew how to clean fiber and I knew how to read ports and if the lights were green, that's good.

If they're yellow, then it's not good. Yeah, but Tom, Ben, thank you very much.

This is pretty much at the end of our segment today. You know, Brian, if you have any additional questions, I think we have a couple more minutes.

So, Tom, you mentioned server hardware.

It made me think of something like on servers, we have the hard failures that are easy to detect.

Like say a server turns itself off while it's running servicing requests.

And then there's sort of like the creeping death failures where like say it has a bad disk and its request service per second just keeps gradually going down over time.

And those are obviously like with automation, those are really hard to detect.

Is there an equivalent on network hardware?

Kind of, right? Like, for example, like we've been talking about ports this entire segment.

You can have individual port failures. So, like if a single port dies, then you still have 3,500 ports that you can use, right?

If you have a 36 port router.

Or what can happen is a router can be made up of multiple line cards with multiple ASICs.

An ASIC can fail or a line card can fail. And again, that's like one of those creeping failures where you have, for example, like a router with four line cards in it.

If you lose a line card, you reshuffle some ports.

But then you have three line cards and you still have capacity.

But obviously, you can't keep doing that because at some point, like the same thing with disks, like if you have a RAID 1 with a bunch of disks in it, at a certain point, that's going to run out of like available capacity, right?

It's the exact same thing with line cards or with ASICs or ports.

So, that's one of those like failures.

We'll see that be an instant failure on that individual node being the individual port or the individual ASIC or the individual line card.

But as a whole entity, the router is still fine. Yeah. When a server is in the situation like I'm thinking of, unless you're looking at very specific smart attributes on that disk and realize that there's a failure coming, the operating system may not even know.

So, it's just the request per second just starts to slow down, down, down.

And so, I think the SREs are constantly trying to build automation to think of those sort of corner cases and expand on them.

But it sounds like in the network world, like the ports either on or off, right?

It doesn't tend to slow down or is that not true?

It's not that they slow down, but there are times where an interface might start logging errors, like one every 100,000 packets or something, meaning that packet gets dropped.

But we have alerts and things to catch those sorts of scenarios and can administratively shut the port down if an optic needs to be cleaned or something of the like.

And at the end of the day, routers are very similar to server hardware with the exception of this custom hardware.

So, they have a lot of the same categories of problems that computers do, software problems.

But we have other pieces of software to monitor them at the same time.

Cool.

All right. Well, this concludes everything. It was a joy, guys. Thank you very much.

Thanks for coming on. Great. Thanks for having us. Yeah, it was really good fun.

Thanks, guys. and connecting securely to websites, but your DNS traffic may still be unencrypted.

When Mozilla was looking for a partner for providing encrypted DNS, Cloudflare was a natural fit.

The idea was that Cloudflare would run the server piece of it and Mozilla would run the client piece of it.

And the consequence would be that we protect DNS traffic for anybody who used Firefox.

Cloudflare was a great partner with this because they were really willing early on to implement the protocol, stand up a trusted recursive resolver, and create this experience for users.

They were strong supporters of it.

One of the great things about working with Cloudflare is their engineers are crazy fast.

So, the time between we decide to do something and we write down the barest protocol sketch and they have it running in their infrastructure is a matter of days to weeks, not a matter of months to years.

There's a difference between standing up a service that one person can use or 10 people can use and a service that everybody on the Internet can use.

When we talk about bringing new protocols to the web, we're talking about bringing it not to millions, not to tens of millions.

We're talking about hundreds of millions to billions of people.

Cloudflare has been an amazing partner in the privacy front. They've been willing to be extremely transparent about the data that they are collecting and why they're using it and they've also been willing to throw those logs away.

Really, users are getting two classes of benefits out of our partnership with Cloudflare.

The first is direct benefits. That is, we're offering services to the user that make them more secure and we're offering them via Cloudflare.

So, that's like an immediate benefit these users are getting.

The indirect benefit these users are getting is that we're developing the next generation of security and privacy technology and Cloudflare is helping us do it and that will ultimately benefit every user, both Firefox users and every user of the Internet.

We're really excited to work with an organization like Mozilla that is aligned with the user's interests and in taking the Internet and moving it in a direction that is more private, more secure and is aligned with what we think the Internet should be.

Thumbnail image for video "Hardware at Cloudflare"

Hardware at Cloudflare
What does Cloudflare look for in hardware?
Watch more episodes