Why the Internet Works (Most of the Time): A Conversation with Tom Strickx

Presented by: Tom Strickx, João Tomé

Originally aired on November 7 @ 10:00 AM - 11:00 AM EST

Cloudflare Principal Network Engineer Tom Strickx joins This Week in NET to explain what really keeps the Internet running. From anchors cutting submarine cables to automation detecting bad Internet weather, Tom shares an inside look at how one of the world’s largest networks operates — and why human trust still matters in keeping the Internet alive.

We talk about:

How Cloudflare’s global network evolved since 2017
The hidden fragility of the Internet (and why it still works)
Routing leaks, Anycast, and automation
AI’s growing role in network reliability
What it’s like inside real data centers

Subscribe for more weekly conversations on Internet trends, infrastructure, and technology: ThisWeekinNET.com

English

Transcript (Beta)

Hi, I'm Tom Strickx. I'm a Principal Network Engineer here at Cloudflare. I joined the company in 2017 and I'm part of the team that's responsible for the day-to -day operations of the Cloudflare network, as well as some really cool engineering that we do in making the Internet faster and more secure. The interesting thing about networks is the thing that people don't realize is that computer networks are entirely built on human networks. Everything on the Internet is built on human relationships and interpersonal relationships. Hello everyone and welcome to This Week in NET. It's the November the 7th, 2025 edition. Last week we explored Internet measurement and resilience with our Chief Scientist Marwan Fayyad. Today we shift to Internet weather, the networks, cables and apparent chaos that somehow keeps everything running. To help us make sense of it, we're joined by Cloudflare's Network Specialist Tom Strickx in a conversation we recorded a few weeks ago here in our Lisbon office. I'm your host João Tomé, based in Lisbon, and this week we're also celebrating our new website, ThisWeekinNet.com, with direct links to subscribe to our podcast on Apple or Spotify, some of our best guests and episodes about policy, privacy, security, demos, Internet history, Cloudflare's history and more. There's also some easter eggs there in that website and you can also subscribe to new episodes in the newsletter and we promise we'll only use that for new episodes, nothing more. Before we start, here are a few quick Internet updates from this week from the Cloudflare blog. BGP zombies and excessive path hunting, how stuck routes can rise from the dead and cause Internet instability, that was published exactly on Friday, October the 31st last week, after we posted our episode from last week, so it was a Halloween blog but all related to BGP and the Internet of course. Another blog was fresh insights from old data, what Cloudflare found while investigating new firewall tests in the country of Turkmenistan. Also building a better testing experience for workflows, all about developers in this case, so Cloudflare introduces end-to-end testing tools for complex applications. Also in the workers and developers space, workers VPC services, now it opened beta connecting global workers to private networks anywhere in the world and also Tokyo Kish goes open source, so our AsyncQuick and HTTP tree library that powers services like iCloud Private Relay, so that's also a blog post from the week. Also this week, extract audio from videos with Cloudflare Stream, how developers can now easily pull audio tracks from uploaded videos. And today's Friday, new post is do -it-yourself BYOIP, a self-serve API that lets customers bring and manage their own IP prefixes on Cloudflare, so also a cool one. A few other headlines from around the web, this week, on this week, 1988, the Morris worm infected about 10% of the Internet, so within 24 hours, it was one of the first major Internet security incidents, many of us at Cloudflare weren't alive at the time, I was actually, and also security researcher Troy Hunt reported that 2 billion email addresses were exposed in a massive data leak, so that's it for the news, without further ado, here's my conversation with Tom Strix. Hello everyone and welcome to This Week in Net, hello Tom. Hi, welcome. You're in Lisbon this week, why are you in Lisbon? I'm here for a network automation on-site, I'm part of the Edge network team and one of our sister teams is the network automation team and quite a few of the engineers are here in Lisbon, so it's always super exciting to come to the Lisbon office and enjoy the wonderful sunshine and the delicious food, especially coming from a city like London, where we do not have as much sunshine, we do have delicious food. We want to have a conversation about how Cloudflare's global network powers the Internet in a sense and your views there, you've been at Cloudflare since 2017, can you give us a run-through of your background? I grew up in Belgium, I went to university in Belgium, kind of did my own thing for a bit, both during my uni years as well as after my uni years, but in 2017 I saw the the job opening at Cloudflare and I was like, okay, sure, let's give this a go. This was still at the time when Cloudflare was purely in office, at that point in time Cloudflare had a couple of offices, we had the San Francisco office, we had the Austin office, the London office and the Singapore office and that was basically it, so a much smaller kind of sized company than we are today and yeah, applied for the role, interviewed and got the job and I think it's like I started interviewing in April and by September I had moved to London. So you moved to London to work at Cloudflare specifically? Yeah, exactly. What was your college degree there in terms of? Computer science. Computer science, specifically. Did you have a passion for the network part already or? Yes, definitely, I mean that was I think one of the things that set me apart a bit from some of my other students was that interest in networking and that interest in, you know, it's like how do you interconnect all of these computers and making them do funky things was definitely, I was basically I think the only student in my kind of year that went to the network research group, so I was basically the only one, like most of the other folks went to either the AI working group or the software engineering working group, so I was kind of the odd one out when it came to the networking component. What drive you to that area specifically, even before Cloudflare? That's a good question, I mean it's, I started uni in 2008, right? So this was very much in the, like the Internet was already a pretty big thing at that time, I think this was about a year after YouTube launched, I think YouTube launched in 2007, 2006 maybe? So it was already a pretty big thing, but it's this interesting thing where, you know, we're just like interconnected world for everything, right? And everything relies on, you know, optimal communication between basically computers at this point, like everything that we do and live and go through as people on earth is all driven by computers that communicate with each other, like the ability to keep in touch with family on the other side of the world is driven by networks, it's all, at the end of the day, it's like the technology as an entire thing stands or falls with reliable communication, and that was something that I always found really interesting, right? Is that that specific part of being such a key component of everything, but so little talked about, and most people don't know or care, right? Like the experience they have with networking is the Wi-Fi working at home, and that's about it, right? And that still occasionally occurs within the family is Tom is coming home, so Tom is going to fix the Wi-Fi because Tom is a network engineer, so surely that's what you do, right? It's the same thing with people that come home or that go to visit family, it's like, well, yeah, you're a computer engineer, surely you can fix my printer. It's like, that's not what I do. It's like, but you work on computers all day. It's like, I guess, yeah, sure. And then the worst part is you then genuinely actually fix either the Wi-Fi or the printer, and then you reinforce that idea of, like, well, surely, like, you're, you know, you're a computer engineer, so you can fix my printer. It's like, I don't, but yeah, okay, fine. It's, yeah, it's something, yeah, I've always had an interest in networking or an interest in computers from a pretty early age. Specifically about you joining Cloudflare in London, a smaller office at the time, for sure, a smaller company before the IPO. What was joining the network team at Cloudflare at the time in 2017? It was great. I still have very fond memories of the early days. We were a pretty tiny team at the time. We had... In London or in general? In general, really. We had, I think, two engineers in Singapore. The largest team at the time was in London. That was, I think, we were five or six of us. And then a small team in San Francisco. And yeah, that was always really good fun, right? Is you have this really tight team that, like, we got along pretty well. And then being responsible for this already pretty big network. I joined at around Colo 120, I think. So we had 120 different data centers live at that point. We're now at, I think, what, like 350 cities almost? And 600 or plus data centers. So it's scaled up quite a bit since. But yeah, the early days is you're still a pretty small team, all things considered responsible for a global network, running a global network, running a global network day, like 24-7, you know, the Internet does not stop, right? Like the weekends don't exist for the Internet. That was pretty exciting, right? And then being there at kind of on the ground and then building that out and then being part of that team that gets to do these things and gets to make decisions on, like, what are we doing? Where are we going next? What are we building out? That's pretty exciting. For those who don't know, what is the principal network engineer at Cloudflare? What is it really? Even when you started, what was the job in terms of network, of dealing with not only the data centers, how they communicate with each other, in a sense? What is the role and the job? So when I started, my official title was network automation engineer. And I was the first network automation engineer that was hired. I have Jerome Fleury to thank for that. He's the VP of network now. But yeah, at the time, that was, you know, Jerome was very prescient in that kind of sense. He understood very quickly that the only way that you're going to scale a company like Cloudflare is with automation, right? From a management perspective, as well as a rollout perspective. So that was kind of the original goal for me, was to both do operational things, like day-to-day operations, is making sure that the network can run reliably. And in cases of issues, making sure that we fix those problems as quickly as possible. At the same time, also investing in automation and tools, and trying to make sure that most of the kind of boring day-to-day tasks of, you know, like deployment and things like that are automated away. So that was kind of the original intent of their role. And over the last eight years, we've built out the NetDev team, right? The NetDev and the NetSys teams. Those didn't exist at the time. That was basically just the three of us. It was Luke, Mircea, and me, and Louis Quincinio, kind of still faffing around for more than anything else, and then building pieces of software that were automating specific parts of the job, and trying to make sure that it either became easier to do specific tasks, or making sure that things were done automatically. Like a really good example is, one of the things that we're heavily dependent on is Internet health, right? Is overall, like we do a lot of communication with other networks. And Internet weather is a thing, right? Like the Internet is not, unfortunately, an infallible thing. Although, to be fair, it not being infallible is my job security. But one of the problems there is that sometimes things fail, right? Like you can have submarine cuts, you can have router failures, you can have a wide variety of things can go wrong on the Internet at any given time. So the team built this automation system that automatically detected these issues within the Internet, and then disabled the link that was seeing those issues. Because before that system was in place, we got alerts or notifications from either angry customers, or automated alerting on our side that's like, hey, there's bad Internet weather, please fix this. And then someone from the network team got engaged, and they were like, okay, it's like I can try to figure out what was going on. And then they'd identified the broken link, and then they disabled the broken link manually. So instead, what the team built is this automation system that can automatically detect it, and automatically disable that link to kind of make life easier. Yes. It interconnects, of course, with data centers with hardware, but in that sense, it's software that is being built to make sure that all the networks interconnected with us work properly in communication with them. Yeah, I mean, everything at the end of the day, like I like saying whenever I do the new hire orientation sessions, is that at the end of the day, we're an infrastructure as a service company, right? We build and provide infrastructure to our customers with some really cool products, and then we build products. But at the end of the day, we're dependent on the weather gods of the Internet and the weather gods of hardware. So everything that we do is built on top of hardware. So we need to make sure that we can run it as reliably as we can. There are some terms like weather or even plumbing, network plumbing, tubes. There's actually a book there in this library called Tubes that is quite popular explaining how the Internet works in terms of tubes. But in what way those analogies actually are really accurate in terms of making sure that the connections work between networks, the tubes, let's say like that. In what way the Internet being the network of networks, and there's not one company that can make the difference. It needs to be all the ecosystems. It's a collaboration for sure. And yeah, it's like all of these analogies are pretty accurate. Really, it's like the weather thing, for example, is mostly the bad weather, right? It's like we don't talk about the sunny weather on the Internet. It's about the bad weather on the Internet. It's the same thing where sometimes there's a wide variety of things that can go wrong, like I said, but it's also a wide variety of kind of impact scope, right? Where it can range from bad weather being a tiny bit of rain kind of thing to thunderstorms, hurricanes kind of situation. Can you give specific examples what route leak, things like that? Can you explain what happens? So it's like from the regular kind of stuff that we deal with on a day-to-day basis is basically just packet loss, right? So most of the time that packet loss can be induced by congestion. So again, analogy, I guess, is there's a limited availability of capacity in the global Internet. And from time to time, you need to go from Europe to Asia, for example, or from Europe to the US. And that depends on the available capacity that you have between those kind of regions. And sometimes you lose some of that capacity. The easiest or the best example of that is currently kind of the Europe to Asia capacity. 90% to 99% of all of that capacity goes through the Suez Canal, and then the Red Sea, and then basically goes through or past kind of the Indian Ocean, India, and then Singapore. The submarine cables. Yes, submarine cables. But you have this choke point at the Suez Canal, which also means that quite frequently, there's submarine cuts. And most of them usually tend to happen around kind of that Suez Canal, Red Sea area, because there's just a whole lot of shipping active there. And if you have ships, you have anchors. And anchors have this really annoying tendency of finding submarine cables and then cutting them. So then you get congestion. And with that congestion, you get packet loss. Basically, it just means that not every single packet that you're sending is going to get to the destination that you want it to. And you need to steer around it. And that's what we kind of do on a day-to-day basis. The rerouting. Exactly. So that's what we kind of do on a day-to-day basis. For the most part, that's kind of part and parcel of the job. But then there's worse things. Like you mentioned, route leaks, for example. Those can get really, really bad. So what a route leak is, it's a network that shouldn't get the traffic that it is receiving. So from time to time, we're a really well-peered network. I think we're now at like 30,000 different networks, something like that. Peer with a lot of different networks. And sometimes those networks will tell other networks, like, hey, you can reach Cloudflare through me. Which is true, because they are directly connected with Cloudflare. But we don't want other networks to immediately reach or immediately reach contact like Cloudflare through that network, because they don't have the capacity for it. We have a 100 gigabit link, for example, with them. And all of a sudden, 300 gigabit or 400 gigabit of traffic goes through that. That doesn't work. So that's a route leak. And it can get very impacting very quickly, because all of a sudden, all of this traffic that you were receiving across the world gets kind of pinned into a single location, goes through a single link, and it's just dreadful for everyone involved. And those are pretty difficult to kind of mitigate as well sometimes. Like some of the earlier route leaks that we experienced, sorry, one of the earliest route leaks that we've experienced was, I think we even blogged about it. It was the Allegheny route leak of Quad One, I think it was. I remember that. And that took us quite a bit to kind of resolve, because at the end of the day, what we needed to do was we needed to contact or really call the network engineer for that company to like, hey, you guys are doing this. Can you please stop? Because we tried everything at that point in our kind of capacity of making sure that they no longer received that prefix to then send that on, and it just wasn't working. So we had to like call them and actually just talk to them on the phone. It's like, hey, you're currently doing this. Can you please stop doing this? And sometimes they don't know that they're doing it. I mean, making mistakes is human, right? Like that's what makes us human is the ability to make mistakes and learn from them. But that also just sometimes means that you're making something that feels like a business as usual configuration change on your network that you're not fully understanding or grasping. It's like, oh, you know, damn, I made this mistake. And you only realize it when somebody tells you. It's like, hey, you did this thing. Can you, you know, please. It's impacting everyone around you. Like, can you please stop doing that? And then at that point, it takes a bit of convincing sometimes, right? It's like where you need to like have that conversation. It's like, hey, look, here's the data or here's these third-party websites, for example, that you can go to and that will show it's like you're doing this thing that you're not supposed to be doing. Can you please stop doing that? And that's, again, that's basically, yeah, that's the role. That's the job, right, is having those communications and with other network engineers or with other companies and then trying to understand and then trying to help build a better Internet, right, is trying to get to that point where we no longer run into these problems, right? So that's something that I'm somewhat involved in is the routing security space is trying to make sure that we don't run into these route leaks anymore, is that we don't run into these route hijacks anymore, that we have systems in place, not just processes, but actual just software systems that prevent this from happening. Because like I said, you know, like human, like failing is human and like processes and then they're there to be followed and you can have the tightest process in the world. But at the end of the day, if a human can bypass that process, mistakes can be made, right? True. So what we're trying to do now is that the entire Internet community really, or networking community, is where we're trying to make that safer and more secure so that even if someone makes a mistake, even if someone bypasses whatever processes are in place, that it can actually damage the rest of the Internet, right? It's trying to make sure that we have some security in place and making things a bit better. One of the things since I joined and I understood a bit better of how the Internet works and network works that surprised me is, as you were saying, is first how fragile the Internet is. I didn't think it was so fragile in terms of someone making a mistake because they're not an expert, they just was trying to do their job and they don't know too much about it and they made a mistake and the Internet as a whole sometimes is impacted, or at least big chunks of the Internet are impacted. Having this sense of also trust our network, trust the other networks to make the Internet work, that element of trust is also interesting as part of how the Internet works. Were you surprised with some of those elements? Not really. You knew about them? Yeah, pretty early on. It's the same thing in software engineering in general or even computer science in general. A lot of our entire industry is built on the really smart ID of a very small subset of people. At the end of the day, the C programming language, Unix, Linux, networking in general, it's a very small core component of people. That also just means that you can't be like a precog, so to speak, that knows everything and can predict every single outcome and can predict every single possibility. Whenever we built these systems or whenever these systems were invented, you're like, well, I'm going to keep it to this specific scale because I understand that specific component of the problem, and then you build that out. Then years later, you have the realization, oh, we didn't account for this possibility, or we didn't account for this possibility, or we didn't account for that possibility. Then you just patch it. That's the day-to-day job for software engineering is, well, you have a bug, you patch the bug. It's the same thing with protocols. It's the same thing with the Internet. It's the same thing with networking. You build something, and then a year later, a month later, six years later, 10 years later, you have the realization, oh, this isn't working the way that we need it to, or this isn't working the way that we want it to. Then you patch it. That's always been the case. Then for networking especially, the fragility is, it's fragile, sure, but we're still making it work. We're still making it work pretty well. It's, I think, a credit to the community, really, that we're as successful at building networks because I think a lot of people kind of looking from the outside in, they kind of look at it, and it's like, you sure that's working? But we're making it work. A lot of it is computers talking to computers, but it's also people talking to people. That's a pretty big component is these interpersonal relationships and having these contacts and knowing who to talk to, when to talk to them, those kind of things is pretty important. You spoke about route leaks and sharing alerts and things like that. We have radar.software.com has a network perspective as well. We can get alerts if something is going on there, which is also an interesting perspective of sharing the data that we have with others, specifically. For those who don't know, can you give us a run through on in what way Cloudflare is unique compared with other CDNs, compared with cloud providers? What makes Cloudflare unique? You've been here several years. You've helped build that unique part. What would you say there? It's hard to say. I don't work at other CDNs, but I think definitely one of the unique parts of Cloudflare is just the amount of incredibly clever engineers that we have. We try to look at problems with a kind of different perspective and try to solve them in a different perspective. The Anycast network that we've built is one of the largest ones out there. It is a pretty big differentiator. It means that we're capable of mitigating the largest DDoS attacks without even getting alerted. I think the biggest evolution over time is that ability now that whenever we get a DDoS attack, in the earlier days, that was all hands on deck. Definitely, everyone was aware that these were happening. Now, we basically just sleep through them. The only reason why we know is because there's this alert firing. It's like, hey, we are currently getting, I think, the largest now is like 20, 30 terabits per second. 29 .7. Yeah. We just sleep through that. There's nothing. The network is perfectly fine. There's no blips. There's nothing. The only way or the only reason why we know is because our DDoS people have set up some monitoring to make sure that we're aware of these numbers. And that's it. And that's, I think, the biggest change. But that's also, that's largely due to the software engineering work that our DDoS folks have done, obviously. But it's also the Anycast network component that makes it a whole lot more bearable to deal with these attacks in a sensible way. I think that's definitely the Anycast component is also the thing that attracted me to Cloudflare. Can you explain for those who don't know what is really Anycast? What makes it specific? Yeah. I mean, so Anycast is this kind of, it's not really technology. It's a use of technology, I guess, is a better way of saying it. What we're doing is we're advertising. So we're telling all of these networks around the world, it's like, hey, we have this prefix. So let's say 1.1.1.0.24. We have this prefix. You can reach us for that prefix because we own it. So that's we're advertising. And we do that from every single location we have in the globe. What that means is at that point, we're relying on the other networks to steer traffic to different locations. So with all of these locations that we have around the world and all of these peer-to-peer relationships that we have around the world, in normal conditions and in ideal conditions, all of the other networks will send that traffic to the closest Cloudflare location that exists. There is a bunch of caveats here that I'm not going to go into. That's probably a bit too technical. But that means that we get all of that traffic just in the closest location. But that means if there's a large botnet out there, that botnet is going to be spread across multiple ISPs, across multiple countries, multiple continents. And all of that traffic, sure, they can generate a whole lot of traffic, like almost 30 terabits per second worth of traffic. But it's sharded across the entire globe. So because we're basically saying from every single location, it's like, hey, you can get me here, all of that traffic is just kind of dissipated across our entire network. And that makes it work. And it's a really interesting thing. It's nothing unique to Cloudflare. It's not a proper system. No, exactly. It relies entirely on BGP, so Border Gateway Protocol. That's the protocol that the entire Internet runs on, basically. And it just runs on that. That's the only thing it relies on. It's just some clever use of the way that the protocols interact with each other that makes it work. And this was well-documented and well-used before Cloudflare did something with it. But yeah, we started using that as our core technology to build the CDN on, and that kind of makes us unique. And also the way it's spread out in terms of data centers in different cities around the world. Yeah, I mean, we're a pretty large network at this point. One of the things that we pride ourselves in is that we're present at a whole lot of Internet exchanges. So I think we're now the network with the most amount of Internet exchange point connections out there. And that's something that we pride ourselves in, because it allows us to connect with all these networks and make sure that we get the end users as close to Cloudflare as possible. That makes sites being faster, more performant. There's also the security element you spoke about spreading out, the DDoS and things like that. There's a bunch of things that help having those types of connections. Yeah, for sure. I mean, it's also one of those parts where being present in more locations, yes, it's like the latency and everything else, but also it really just means that if there's a big issue or a big outage, it's like, yeah, sure. But we have a wide variety of other locations that we could fall back to without massive impact to our end users. A really good example of this is the Iberian Peninsula had this, a tiny bit of a problem with the power a couple of months ago. April 28th. Yep. And we lost some of our Madrid capacity and I think we lost Lisbon as well. I'm not 100% sure, actually. But for the people that were on their mobile phones and still had Internet, that traffic just went to primarily Marseille, Paris, London, all of those kind of locations that we have active. And they could keep using the Internet and keep using the Cloudflare websites without seeing massive error pages or anything else like that. And that's the big benefit of the network that we run. Makes sense. You go to many conferences, you speak a lot with different folks from other companies, researchers, etc. Where would you say the network area is? You spoke about the automated systems that we have and also some security elements now to make it better. What would, in a nutshell, be the landscape of the situation right now? I mean, that's a pretty big question. I think that's going to be a conversation for about the next six hours. But it heavily depends. Cloudflare is in a very unique position. So there's not that many companies that do what we do at the scale that we do it. So that means that the landscape for us is vastly different from the landscape for a bunch of other networks, or the situation or the needs of another network. If you're a small ISP in Ireland, for example, your needs are vastly different than the needs that we have. I need to deal with 400 gig and 800 gig individual links for that tiny ISP. Maybe two or three 10 gig links to INEX, the Internet exchange there. That's enough. And that's your network. And then you build that out and then you provide your services. So what I think is important and what my co-workers and colleagues at other companies at our scale think is important tends to not be hyper relevant to a bunch of the other companies out there. And it also depends heavily on what the services are that you provide. An ISP has very different requirements and needs than a network like Cloudflare does. Even though, especially with some of the products that we offer now with Magic Transit or Warp, we're becoming more and more of an ISP. We do provide some of the same services, but our requirements are still vastly different from what we do versus what a Mio, for example, does here in Portugal. So yeah, I think from my perspective, what I see today, the biggest thing on our plate is it's automation, but also it's just vendor interoperability. With automation, that's great. And then being able to automate your network is important. There's still some role, some factors. There's a lot of work still left to be done, primarily because you have all of these network vendors, and you have network operating systems, and then you have Sonic, and you have JunOS, and you have NXOS, and you have iOS, and you have iOSXR, and there's a massive sprawl of different operating systems. And they all have different methods or different means of interacting with that operating system. The language, it's in English most of the time, but it's slightly different. There's a slightly different syntax. There's slightly different grammar. There's all of these different things, which means that if I write some automation that works really well for, say, a Sonic, that might not work for an NXOS box, for example, those kind of things. So you end up in doing quite a lot of double work or even triple work, where if I want to know what my interfaces are doing on five different operating systems, I'm stuck writing automation for five different operating systems. And that kind of component of getting to a point where, hey, wouldn't it be nice if we had one shared language that we could talk with all of these vendors to? That's definitely something that is still actively discussed and worked on and super important for networks like ours, whereas for smaller networks, you stick with a single vendor, you automate for the single vendor, and you're pretty solid. It's not really a problem that you encounter or not encounter as much. So that's definitely one thing. It's the writing security. Like I mentioned, security is not binary. It's not an on or off kind of component. It's this sliding scale where what you're trying to do is you just try to get a bit better over time and trying to get to a point where it's like, okay, we're now pretty happy or confident in the security of the systems that we're providing. But at the end of the day, there's always something new to either account for or to deal with or to kind of fix. So that's something that we're actively working with in the community in general, within the IETF specifically, is building out the kind of writing security space and then trying to figure out, okay, what are some of the problems that we encounter in our industry on a day-to-day basis, and how do we make that better, or how do we make sure that we secure that? But that's definitely something that's active. I think that's probably, like, writing security is definitely one of those things that is important for everyone, right? The security of the Internet as a whole is relevant for anyone running a network on the Internet. That doesn't matter if you're a super small operator in Ireland or if you're a super big hyperscaler like Google. I'm interested into understanding a bit the evolution in the past few years. Since you joined, for example, even in that space, the routing security, are things much better now than they were? A hundred percent. I'd be lying if I said that wasn't the case. I think the community as a whole has made massive strides in just fixing quite a lot of the problems, right? RPKI, when I joined, was, I think, just ratified at the IETF, so that just became an RFC and has been actively been rolled out, and now we're at a point where the vast majority of really big networks are doing route origin validation, so that's a subcomponent that's been enabled by having the RPKI. That's not a protocol, but it is a method. Yes, so very briefly, what you're doing with route origin validation within the RPKI is, say for Cloudflare, for example, recreate a route origin authorization AROA that basically says, hey, this prefix 1.1 .1.0.24 has an origin named 1.3.3.3.5, so that's the Cloudflare autonomous system number, is 1.3.3.3.5. You publish that ROA, that gets cryptographically signed, that validates the veracity of us writing that ROA. We're the only ones, because we own, well, we're an authorized party for 1 .1.1.0.24, so we can sign that ROA properly done. That goes into the RPKI. RPKI validators will pull the repositories, will do all of the validation, and they'll come up basically with this tuple, which is 1.1.1.0.24 and 1 .3.3.5, and that's the only allowed originating party. What you then do with route origin validation is, whenever you receive BGP updates, you verify that with the information that you're getting from the RPKI, and if it's valid, you mark it as valid. If it's unknown, as in no such tuple exists, you mark it as unknown. If a tuple exists, but it's not the tuple that you just received, then you mark it as invalid. What you then do with a route origin validation is, whenever you have invalid tuples that you receive as BGP update, you don't install them in your router, you drop them, and then you improve the security of the Internet. You avoid problems, outages, things like that. With that, the validation, the trust part. The really big thing is what we've been seeing over the last couple of years with the rise of crypto is the use of route hijacks to steal crypto. That's something that we keep seeing, and then there's definitely multiple events now where route hijacks were a key critical component of a pretty large crypto theft. With RPKI ROV, you don't entirely mitigate it, but you mitigate at least a part of it. You do make it a bit harder to get that going, whereas especially in 2017, that was not the case. That did not basically exist. It wasn't being done. We've improved massively over the last couple of years. There's a lot of talk about AI actually starting working in automation. That's AI in a nutshell. In what way even LLMs, which are more recent in terms of usage, are helping in some of the things that you do specifically? LLMs are really good at understanding text. That's what we use them for in the network team now as well. We're a global company. That also means that we have global providers. We have providers all over the globe that give us services. Either they provide us transit or they'll provide us backbone or they provide us physical connectivity or all those kind of things. We need to run maintenances from time to time in either rebooting our servers or rebooting our network kit or upgrading our network kit. Our providers need to do the same thing. They will need to do maintenances. They'll need to do repairs. They'll need to do all of these things. We get a lot of maintenance notifications. But as with anything in our industry, none of that is standardized. We get a maintenance notification from ZAO. We'll get a maintenance notification from EU Networks, for example. None of these two are going to be identical in any way or form. They're not going to have the same formatting. They'll have the same raw information. They'll have the same core components, which is, well, this is your circuit ID. This is the location of that circuit ID, and we're going to do a maintenance on that date at that time. Hopefully, we'll be done with the maintenance at that time. That's the core component of what a maintenance notification is. But the formatting of it is so vastly different. What the AI enablement team here at Cloudflare did for us was building a system on Cloudflare email routing that ingests all of these maintenance notifications and then runs an AI worker through it that then parses all of these emails and then gets a formatted structured text out of it that basically just says, this provider, that circuit ID, start time, end time, and then inserts that into our maintenance calendar. That's a really awesome use of AI to make sure that we get better at tracking these maintenances and understanding how these maintenances work. That's been a pretty cool use case for AI, and it's entirely built on Cloudflare itself, so that's also pretty cool. Building Cloudflare with Cloudflare. Yes. That is the core mandate, right, is build Cloudflare on Cloudflare, or at least try to build Cloudflare as much on Cloudflare as possible. That's definitely one of the pretty cool use cases. There's a lot of talk also about AI inference making LLMs better in terms of performance, more capable. They need a lot of GPUs, as we know. In what way is the network team helping with that effort? The AI inference, optimization, reduced latency in these models? Yes. For us, the really big thing is providing the capacity to deal with this. AI models are pretty chunky. They're pretty big. Those models need to be transferred to the GPUs as quickly as possible. That's definitely something that we offer or that we actively contribute to, right, is trying to make sure that we get to that component. It's enabling our other teams to do the maintenances to install more GPUs, for example. That's just something that we're actively involved in. I consider myself lucky or fortunate that we don't yet do AI training. AI training is a whole set of challenges there that we currently don't have to run through with just doing inference at the edge. For us, from a network perspective, the thing that we do that works for AI is also the things that we do for the rest of our product set. As long as we can get our end users to the closest location as quickly as possible, as performantly as possible, you immediately get all of those gains for the AI side of things as well. If we get a fast network and if we get a good network going for our original CDN use case or the Zero Trust use cases, they all get those benefits and the AI side of things gets those benefits as well. It's pretty easy for us to make those decisions and then make those enhancements because anything that we do, everybody wins. All of our products immediately win and that's pretty cool to be able to do that on the network. One of the things that I've been doing in the past, because Cloufleur did 15 years, we had the birthday week, I've been seeing older clips about Cloufleur presentations from 2010 when the company started. One of the things that surprised me is, and you can ask many of the things that we spoke today about what is the Cloufleur network, the ideas were already there, the strategy was already there. What makes Cloufleur actually pretty good for the AI age that we speak a lot today is the elements, the basis elements that were there in the beginning. In what way did you, you were surprised in a way maybe with, hey, now we have Zero Trust at Cloufleur as a product and we have the tools in terms of network to have that, or workers, workers AI, new things that appear at our products and then the fact that the network is always growing in terms of data centers and locations, it's actually the purpose of making all of the other parts of the company. The really cool thing here is that the network at Cloufleur is an enabler for everything else. It's always been amazing to me to see some of the really cool things that some of our engineers build. A couple of years ago, one of the engineers that's no longer here that you interviewed a while back, Marek Majkowski, came up with this idea. It's like, well, what if we used Anycast for origin connectivity and working through that and then building that out and that became kind of now our egress product. It's mostly used internally, but building things like that, it was really interesting because I never had considered, it's like, oh, yeah, that makes sense. You can do that, but I'd never thought of that. Again, it's the network that enables that where all of the core, like you said, all of the core pieces were there in 2010 when Cloufleur started. All of these core building blocks that Cloufleur is built on and that we've now been building this massive suite of products on were already there because the network itself is this like enabler, this flywheel for all of our other products, which means that we're going to hire an engineer in the next year that's going to build this amazing product in the next five years that nobody today can think of like, oh, we can do that. Yeah, I guess we can do that. And that's really cool. That's one of the superpowers, I would say, for Cloufleur is also from a leadership perspective that our leaders let us do that, that our leaders really highly encourage this as well. We're just kind of going, it's like, well, yeah, it's like, here's the network, go wild, find something insane. And it doesn't even have to scale, just whatever, do cool things. And then once you've done the cool thing, we'll figure out how we can either productize it, make it scale and get it to customers. And that's amazing. And it's, yeah, it's the network itself that enables us to do all of these things, which also means that if I make a mistake, it has a pretty big impact. But it also means that if we do cool things, or if we do like great things with the network, all of our products immediately get to benefit. So that's pretty cool. Pretty cool. In terms of the big trends you see shaping Internet infrastructure over the next year, I won't say five, because maybe it's too much, one year, two years, what are the things you see having an impact here? It's definitely going to be more AI growth, right? I think that's the big talking point, has been the big talking point for about the last year, last two years, is kind of the massive increase, massive surge in AI. It's still going to be, it's that, and Cloudflare is kind of part of that, right, is that centralization of the Internet in one way or another, where more and more of the Internet is just kind of becoming available through a small subset of networks, right, is people, they'll watch their TV shows on Netflix, and they'll go to the websites hosted on AWS, and that is the Internet for them, right, and they'll chat with their friends on Facebook, right, or Signal, or whatever, right? You have this very small core component, or very small subset of networks that are basically responsible for the vast majority of people, their day-to-day use of the Internet, and that's going to become more and more of a thing. I think from, as we progress, we're definitely, what we're seeing from a policy perspective is really interesting, is that countries are getting more and more interested in data, data sovereignty, and data locality, where countries have started understanding that data is virtual gold, right, whether we like that or not. I prefer to treat data like nuclear waste more than anything else. I think it was Filippo, also there, a former Cloudflare employee that coined that phrase is data is nuclear waste, where you have it, and you don't want it, and you just want to put it away, and also generate as little of it as possible. And keep it really safe. Yes, and keep it really safe, but at the same time, there is definitely companies out there that treat data like gold, and they're generating silly amounts of revenue out of it, and governments have started taking note of that, and they started noticing, so what they're now trying to do, and the European Union is very heavily involved in that, is they're trying to keep that data local, right? They're trying to prevent that kind of data exfiltration, or that data moving, especially to the U.S., right? The Internet is run, or a lot of the Internet is run by U.S. companies, so they're trying to kind of prevent that from happening. So definitely in the next year, next two years, that's going to be, that's what we're going to see more and more of, is seeing legislators trying to police, or trying to enable policies that may be restrictive for the Internet in general, or can basically be destructive for the Internet in general, where they're trying to get involved in the way that the Internet works, has always worked, but they're trying to legislate it, and they're trying to control something that they really shouldn't be controlling. There's two elements there, actually. One is the worry of breaking the Internet, so making several Internets, as there's China, of course, but there's the worry of being more in other areas, and there's the other part, which is making the data inside that region or country. Especially the second one, potentially, gives you some challenges, or are we prepared for that specifically? Yeah, there's definitely challenges to this, right? I mean, a really good example of this is that our engineering teams are working pretty hard to get us to FedRAMP High, for example, right? That's data localization for the US, primarily, or uniquely, but we see the same thing in Europe, we see the same thing in India, there's a bunch of different countries that have data localization requirements, and it's difficult, right? Sometimes the needs of a customer are slightly different from the systems, the data localization suite that we offer today, so that sometimes means that we might need to make some minor tweaks to how we're doing things, but we do have the product set for this, right? We do have the data localization suite, that's an entire set of products that we offer to our customers that do care about where is data handled, where is data decrypted, those kind of things. It's super important, but we're going to see this happen more and more, where European companies will go, it's like, well, even if our employees are in the US or if our employees are in somewhere else that isn't Europe, we want them, when they're accessing our websites, we only want them to access those websites in Europe, for example, so then we need to figure out how do we make our Anycast network, which we're trying to get the end user as close as possible, so they land in, say, Atlanta or Chicago, but the data decryption can only happen in Europe. It's like, how do we make that work? How do we get those kind of things working? How do we communicate that and make that work? And that's an interesting challenge, right? And so we have a team kind of working on those challenges and trying to figure out how do we make that work. But yeah, we're going to see that more and more, and we're going to see data localization where it's even specific countries, where you end up in a position where, well, we'll need a data localization suite for, say, South Korea, or we need a data localization suite for the Kingdom of Saudi Arabia, for example. And the core component of that requirement of, well, we only want data decryption in that specific country, is almost diagonally opposed to the way that Anycast works. So figuring out the solution to that that still works well with our Anycast network while also complying with those requirements is an interesting challenge. There's a lot of, because of AI, there's a lot of talk about energy efficiency in the hardware side, but maybe even in the network side as well. Is there some elements in the network side of making things more efficient in general, even the edge compute part? Yeah, I mean, so for us, I mean, there's two components to this, right? You have the network hardware, and that's always been a thing about efficiency. That's ever been the longest thing, has always been, like, how do we get more bang for our buck, and how can we get more bits out of the same amount of power? So, I mean, at the end of the day, data centers have limited power supply, which also means that the racks that we have in those data centers have limited power supply, which means that we need to work within a constrained envelope of the amount of power that we have available, also for our network kit. So there's always this very heavy push to more efficient hardware in the same way that you have that push with CPUs, right? Every year, the vendors come out with a new technology set or a new way of building a CPU or a new CPU architecture that they claim is like, oh yeah, well, Moore's law is, I think, is it like doubling of capacity every two years? You see the same thing, like, the same thing applies within the network world, right? But that also still means that we're, especially with AI, but the advent of streaming, for example, was kind of the precursor to this, is the demand for bandwidth is ever growing. Like, there is, it's very much a non-linear growth of just more and more and more bits that need to be pushed out. So that's definitely something that we see is just, yeah, like, you know, when I started at Cloudflare, all of our interconnects were 10 gig. The vast majority of interconnects were 10 gig. And now we're looking at interconnects of 400 gig, right? So it's just this stepping of just silly amounts of bandwidth at this point. And, you know, the 800 gig MSA is getting ratified, I think, or has been ratified. So it's just, yeah, you see these just individual links just growing in capacity and doubling at rates that are mind bogglingly large. It's just, yeah, it's a bit mad. Quite interesting. I have a few fire round questions specifically. One actually is related to the blog. You wrote a few blogs, especially about outages, explaining those from the network side. There's a lot of interest in those when there's a Facebook outage. When I joined, it was that blog post that people were trying to understand what happened to Facebook. There was a big outage and we were in a position where we could explain what we were seeing from our side. In what way the blog, even in that explanation part, has been important? It's important to reach, like, I find it interesting because it allows me to write about matters or technical subjects that I'm interested in. And it forces me to write in a way that I can explain something to a technical audience, but that might not be an expert in that specific subject matter. And John, our previous CTO, was incredibly good at pushing you at explaining things and starting from scratch, making sure that whenever someone brand new with some technical understanding read that blog post, that they could come out of it understanding what you were talking about and having a sense of, I understand something a bit better. So doing so and then doing that writing is really nice, is really good because it forces you to think about something that you might consider yourself an expert in. And then you're asked, it's like, basically it's like, explain this to a five-year-old and you realize, like, I don't actually know this as well as I thought I did because I can't explain this to a five-year-old. It's like, you have all of these constructs in your head and all of these concepts in your head and it's like, okay, it's like, I know how all of these concepts fit together, but then there's this one specific thing of it, it's like, well, now I need to explain that concept. And then you realize, like, well, I don't actually fully know how that works. So then you need to start looking into that, need to start learning yourself. And so you get to a point where it's like, now I can explain this. And then all of a sudden you're building this entire kind of graph of different technical concepts and needing to explain them. It's really good fun. It's quite interesting to see, especially how even expert engineers sometimes leaving their comfort zone in terms of that my team understand everything I speak about, make them understand things in a different way, outside of the box in a way, which is kind of interesting. Sharing what we know with a general audience sometimes has that effect, which is interesting. Yeah, I love learning in public, is I think what that's called, is making mistakes in public and learning in public is really important because the things I'm going through or the things that Cloudflare is going through as a company with engineering challenges, we're likely not the only ones experiencing those engineering challenges. So then having that ability to share that with this community of engineers that are reading the blog and can learn from this is great because you get this amazing feedback from engineers. Oh yeah, you're experiencing the same problem that we did and your approach was vastly different from what we were doing because we were doing it one way, you're doing it another way, and that enabled us to do some really cool things. It's really cool, it's really good fun. There's a bit of science there in terms of how science goes further. We share with others and others give their input and we build from that, which is interesting. I'm also curious on, you had many experiences going to data centers, things like that. What is the biggest, most amazing experience you had in one of those? What is the data center where people don't know about but probably should? I mean, the biggest thing is data centers are loud. It's really loud, yeah. Something that people underestimate is just how loud data centers are because 80 millimeter fans running at 100% is... So that's always the recommendation I give to new folks in the space or in the data center spaces, please buy a good set of noise cancelling headphones. I mean, there's so many experiences in data centers, especially in my early days, in my early career. I did a lot more data center work than I do today. I don't think I've seen the inside of a data center in the last five years, which is probably a good thing. But yeah, I mean, in the early days, it was probably when I ran some servers in a data center in the outskirts of Paris, and I was still living in Brussels at the time. So that meant that I put about 30 kilograms, 40 kilograms worth of servers in my hiking backpack, got the high speed train from Brussels to Paris at six o'clock in the morning, and then hauled that through the Paris metro and then a half an hour walk to the data center to go into data center to rack and stack a bunch of servers is an experience I will very... I'll hold dear to my heart, but also I'll definitely never want to experience ever again. But yeah, and there's some pretty cool pictures from the early days of Cloudflare kind of doing similar things where they were sending engineers all around the world doing rack and stack kind of things. Those are the fun parts about being like an early startup and then doing cool things in data centers where you look back. It's kind of mad that we did that. But yeah. In the network space, there's other buildings. I remember you sharing images from one in the US, a very large building where there's a lot of interconnection between networks. Those are still... There's still data centers. It's just that some data centers are interconnection points more than anything else. And I think the picture I shared was One Wiltshire. One Wiltshire in LA historically has been one of the largest interconnection points in the Western Hemisphere. And yeah, that's... It's a kind of weird organical growth, right? Where it's almost this... The fact that some networks are present there will then cause other networks to become present there, which then you get this critical mass, which means that at one point or another, if you want to interconnect with all of these networks, you also have to be present in that same data center location. So you get these subsets of data centers in basically every single city that are large interconnection points, right? And in Seattle, for example, that's the West End building. In London, you have a couple where it's Tallahassee North, for example, is one of those where all of these networks are present and that's where you interconnect. Cables everywhere. Cables, it's usually... If you have any form of kind of OCD about neatness, don't go into a data center. It is dreadful. Even for me as a seasoned network engineer, I sometimes still see pictures of kind of data center spaces and it's like... But yeah, it's part and parcel of the job. Also, the biggest network outage lesson you've learned? Uh, it's really hard to undo a mistake you've made if you lose access to the box you've made the mistake to. This is probably the biggest lesson I've learned. We had an outage. I think we blogged about it. We had an outage and I think it was 2019, maybe 2020, where we made a configuration change in Atlanta, which basically meant that we were attracting all of the traffic from some of our largest locations. Instead of attracting that locally, we were sending it across our backbone into Atlanta. That's not great. Atlanta is definitely not equipped to deal with that. But because that was happening, we had this massive thundering herd of bandwidth hitting our edge router in Atlanta, which meant that it was impossible for us to get a connection going to that edge router to roll back the change. So that was complicated. We did have an out-of-band network and that's how we managed to fix it. But it's one of those things definitely that is easy to forget. If you break the network sufficiently, that's it. That's the end of it. And those are some of the biggest outages, even from outside of Cloudflare, are caused by the fact that you lose inbound access to the rest of your network. Interesting. Most underrated part of Cloudflare's infrastructure? Our engineers, I think. Infrastructure, sure, it's hardware and it's cables, but it's also the software stack that you run on it. And I think it's our incredibly smart engineers that build that software, that make it, that are an incredibly valued part of that infrastructure. And without them, we wouldn't be able to do the things that we do today. Favorite tool or dashboard you use daily? Tom's little shop of broken things. I built my own dashboard that basically captures all of these metrics that indicate the health of a data center. So that goes from the amount of requests per second that we're doing, are the links congested, how many file descriptors do we have, this wide set of all of these metrics that, through experience of multiple incidents that I've learned and a couple of other engineers have learned, it's like, oh yeah, these are really solid indicators of something being broken. And I realized like, well, I don't, there's no dashboard that captures all of these. So I created a dashboard called Tom's little shop of broken things. And that is something that I keep opening on a daily, day-to-day basis. Interesting. AI, hype or evolution or both? Both. It's both. It's definitely like AI has amazing use cases. Like I said, for example, that parsing component that I talked about earlier is a really cool evolution. And sure, like we could have done this before, right? Like at the end of the day, everything's a regular expression away. It's not, but that's not fun. It's not great. And it's pretty sensitive to like changes and everything else. And so having an LLM parse those emails is cool. And it's definitely like kind of enabled the team to just kind of not have to deal with that. It's great. At the same time, I think with AI, we're trying to shoehorn AI in use cases that AI has no business being in, right? Or the risks are not worth the rewards? Sure. But with that is, I don't think that's a fair conversation. That's not a problem with AI. That's a problem with humanity more than anything else. Humanity has this really annoying tendency of if we can do something, we will, even if it's bad outcomes, whatever. There's always going to be a bad guy that will do something that you shouldn't be doing. That's just the overall nature of things, which is the same thing with like AI. Like, yes, the face swapping component, for example, is massive risks. But whether we do that in public or in the open doesn't matter because the fact that the technology exists is sufficient for someone to then abuse that technology in a way that you don't want it to. That's not inherently a problem with AI. That's inherently a problem with technology. So if we don't want that to happen, we should go back to sticks and stones and just kind of deal with that, right? Just deal with an abacus. Like an abacus, you can't abuse, right? The only thing that will tell you is that one plus one equals two. That's it, right? So if we don't want things to be abused, then we should go back to that. That's not inherently a fault of AI. It's more, I think that the biggest issue or the biggest hype with AI is just kind of pushing AI into use cases that it has no need to belong. And I think the really big important thing that I will very much get on a horse about is creativity and artistry should not be automated away, right? We should keep paying artists for making beautiful things. We should keep doing the things that make life fun and have AI do the boring and annoying bits, because I love the Terminator movies, but I don't think we'll get an AI uprising anytime soon. So I'm not too worried about robots coming back and going, you made me do the dishes. That's not a problem we'll have, I think. But that's definitely the thing that we should be focusing on is make AI do the boring parts so that we people have more time to do the fun parts. Last but not least, give us a book, maybe a talk, that someone that wants to understand about network they should read maybe. The one book I keep recommending to people is Computer Networks by Tannenbaum. I think they're now sixth or seventh edition. I highly recommend reading it. Just for, there's the single phrase in there, which is never underestimate the bandwidth of a 16-wheeler filled with tapes hurtling down the highway. It's this very specific example that just sticks in my brain and truly explains the difference between just having bandwidth and latency. And it just works, right? It's like, yeah, sure. It's like the 16-wheeler is going to have a whole lot of bandwidth, but super high latency, whereas some of the pipes that we have, much lower bandwidth, but much better latency. So that is definitely something I can recommend. But in general, one of my favorite talks is called WAT, W -A-T. I forgot who did that talk, but it's about the intricacies of JavaScript as a language. And it's still one of my favorite talks to kind of refer people to. It's like, this is the silly things that we get up to in computer science. Interesting. Thank you so much, Tom. This was great. It's always my pleasure. And that's a wrap. And that's a wrap.

This Week in NET

Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!

Watch more episodes