Cloudflare just got faster and more secure, powered by Rust

Presented by: Steve Goldsmith, Matthew Bullock, Maurizio Abba

Originally aired on September 26 @ 1:00 PM - 1:30 PM EDT

Welcome to Cloudflare Birthday Week 2025!

This week marks Cloudflare’s 15th birthday, and each day this week we will announce new things that further our mission: To help build a better Internet.

Tune in all week for more news, announcements, and thought-provoking discussions!

Read the blog posts:

Cloudflare just got faster and more secure, powered by Rust

Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!

English

Birthday Week

Transcript

Hi, everybody. Welcome to Cloudflare Tech Talk. Today, we're diving deep into a major engineering peak, rebuilding the core part of Cloudflare's network to make the internet faster. We're talking about the transition from FL1 to FL2. To get us started, we're going to go through a little bit of an elevator pitch and get some explanation from the folks who work directly on the project. I'm joined here today with Matt Bullock. and Maurizio Abba, both the product team and engineering team who've been driving this transition. Matt, I'm going to start with you. What's the big picture of this project? So really simply put, FL is the brain of Cloudflare. So any request that comes into Cloudflare's network, all of your configuration that you put in your dashboard or API, FL is what is applying that configuration. It's really been the brain of the CDN since Cloudflare was born, so 15 years ago. I think the commit was. Actually, even before that first commit for FL was 16 years ago. So it's lived around there. So it's going from the old brain to the new brain, which is FL2, and moving that traffic from FL1 to FL2 seamlessly and quickly without impact, but just delivering a quicker service for all of our customers. Rizio, maybe you can jump in here. So the original system has been around for a long time, as Matt just mentioned. What are some of the key changes? What challenges do you find when you're replacing a 15-year-old system that runs the whole business? The main thing here is that this is a matter of evolution, right? So we started, the company started with Nginx, and that's great. Nginx is an amazing software, so that was absolutely brilliant. However, over the course of time, we increased and increased the size of this software with multiple features, multiple product, as we should, as every company does, and that created us. sort of a mesh network of things intersected with each other and frankly the side effects were coming to a point where we our velocity and our ability to deliver new features for customers were heavily impacted by these side effects so it was time for a fresh start um that does make sense so in we we do talk in detail but in the article about fl1's reliance on both nginx and lua and lua jit can you explain a little bit simpler term what particular combination of things in these technologies started to make the pipeline show its age? Yeah, absolutely. So as I was saying, Nginx, great product, but you know, it's a 20 years technology. And the same thing, so we were using Nginx as a base capability. Nginx has a scripting engine, which is in Lua. Lua is a scripting language, interpreted language, famous enough. There are multiple softwares that are globally running. it and these started to become slower and slower not because the architecture got worse like the improvement of nginx and lua were all good but simply because we were adding products and feature on top of each other so we introduced lua jit lua jit is nothing else than a just-in-time compiler just -in -time for an interpreted language think about java as just -in -time python has just -in -time a compilation with jit uh with jit and which items so you know It's something that we usually do to speed up stuff. And it effectively, it did speed up stuff. But again, 15 years old technology. Over the course of time, new framework, new technologies, new patterns have been invented and created for the new technologies of the future. And that's why we could not rely and deliver amazing new stuff based on an old technology. So Matt, kind of back over to you. As an end user or a customer, what are some of the things you think that a customer might notice as we move to this new pipeline? Some of the things that were difficult in how we operate at FL that are now more graceful and more easy for end users to observe? Well, the first thing I'll say is, I hope they sort of notice nothing. It's a seamless change. They're going, it's from one piece of software to another. Like, we will talk about testing frameworks, etc., how we achieve that. What? we hope to see for the first thing is performance just in like your website your application just gets the speed boost um immediately you do not have to click a button you do not have to change any configuration we've just got more efficient and better as richard talked about the the technologies that we're using and we're seeing that typically we're seeing requests being 10 milliseconds quicker and if you're loading hundreds of requests on a page that's saving really valuable time um so be the first thing you see then we obviously get requests over time and this can be as simple as i need a new field in the rule sets engine to be able to send to my origin that's really simple real simple asking valuable for customers that would take us weeks to implement multiple things would have to push one bit of software and we change something over here and something over here breaks and then we have to sort of work out how to rebalance that so there will be things that customers require that seem simple that we haven't been able to do we'll be able to get that velocity back up and start delivering those things in the far future i'm really excited for being able to do WAFs on different streaming protocols we have a lot of we grpc is something that in the protocol side and across all the cloudflare is really powerful technology being able to sort of inspect the WAF stream and the streams of grpc and implement WAF rules and big test like that i think it's going to be really powerful so the way i've been pitching this is engine x set us up for the first 15 years of cloudflare what we're doing now is setting this up for the next 15 years and it's just going to allow us to be a lot more creative be able to deliver a lot more features that customers are asking for without the oh this is going to take us a long time because we've got to unravel this rat's nest of wires to be able to plummet in thanks matt all right so let's move then away from the rat's nest and into the new elegantly designed everything. We talk in the blog about two technologies, Rust and Oxy. It seems like a significant technical decisions. Maurizio, maybe you can walk us through what the main reasons for the choice of the new languages and platforms and what specific benefits does the Rust language provide, especially at our scale? Yeah, 100%. So as you said, and as the blog post says, we decided to move on to Rust and we use Oxy, which is an internal framework for a web proxy, in fact, is Oxy the Proxy, that is based on Tokyo, which is a very famous asynchronous runtime, probably the most famous asynchronous runtime for Rust. First and foremost, why did we use Rust? Rust, in our opinion, when we compare different potential solutions, had three main advantages. The first one is that, as Matt said, it's fast. I mean, it's very fast. It's like driving a racing car. You'd n- notice the difference. The second, which is probably one of the main pillar of Rust, and I'm not saying that it's, you know, the solution to everything that is bad in memory violation in the world, but Rust is kind of memory safety, which does not mean that it's a memory safe language. There are plenty of ways that you can introduce memory vulnerabilities in Rust, but the base get-go, unless you do something that you actively want. In terms of memory violation, it makes creating program easy from the safety and security purpose for most of our developers. And this is what we want, because there are teams that create the platform and there are teams that create features on the platform. And the teams that create features on the platform should not really think about, oh my God, now I need to address this memory safety issue because of... otherwise it would be bad. They're writing products. This is why we enter in as a platform team and that's why we chose Rust. Finally, Rust is cool. There is nothing else. It's a new language. It's cool. And we care a lot in Cloudflare to hire new generation of developers. Over this week, we announced that we are hiring 1 -1 -1 -1 interns. I know that hundreds of these interns are going to write rust because rust is one of the future the language of the future if in the 90s this was java one of the languages of the future in 2025 is rust and we are one are in the middle of the innovation trends that's awesome and it's so exciting to be on the front edge of this stuff matt back to you one of the features we talked about is graceful restart and most of the things you just mentioned there we're thinking a little bit about like why does graceful restart matter in oxy what would it look What benefits to other end users or products get from that? And why is that so crucial for a service like Cloudflare? I think as a PM, one of the customer calls I unfortunately get brought on most is around, why did my WebSocket or why did this connection suddenly get terminated and the customer get dropped? And unfortunately, in sort of the NGINX FL world, we used to be in... So when we release software, it would be a bi -weekly release where everybody's changes went out in one large code release. What that would mean is when we release that, the AFL process would restart. We give a short time window before we actually interrupt that service and bring it back on. But that sometimes isn't long enough for long -lived connections, such as WebSockets or streaming long polling H2B requests. With FL, to help us actually accelerate delivery of features, instead of us being the bottleneck of these weekly rollouts, we've moved to a modular approach. So think of the challenges team that run the Cloudflare challenge. They have their own repo that lives on like a module within FL, but that allows them to push their code change whenever they want. And we did this 10, 20, 30 times. So each team was then deploying their code a different time. times of the day across our network which meant these restarts were happening a lot more frequently and they still do so what oxy actually allows us to do one it saves us a ton of cpu so that means but the main thing that it allows us to do is run multiple instances of fl so the idea that we're looking to implement so this isn't there yet but it's definitely being implemented right now is we of FL, let's call it version 1, FL2, let's call it version 1. Requests come in and when we want to release a new piece of FL2 software, version 2 will get deployed and spin up as its own service. The first version will stop accepting connections but will continue to process connections they're handling until we decide to kill that process, let's say 24 hours. Version 2 is then running new connections that are coming into CloudFinder. how it get handled by version 2 24 hours later we release another FL2 update version 3 is then split up version 1 finally gets killed so that's now 48 hours later version 2 stops accepting the new connections version 3 is what accepts new connections so this sort of rolling restart actually allows us to keep connections open for a lot longer in FL2 so we are hoping that the situation where connections suddenly get abruptly closed in with web sockets or these long pole connections slowly go away. Now, this is just one thing that we need to focus on. Obviously, we talk about different parts of our proxy service. There's a lot of work being done all around this to make it a lot better, less frequent restarts that it will impact traffic. But I'm really excited this. It is a pain point for our customers, and I think it's something we can finally look to definitely improve and make. make this sort of like be really really sort of only a few requests because i know something happens and we have to terminate the service but yeah that's why this is i think will be a really impactful improvement ricio anything to add there any benefits from the from the framework in terms of the compute side or platform side that we're getting in that way matt is absolutely right from the perspective of the of the customer so long -term connection stay up for much longer on the other side there is also a benefit to the whole platform in the sense that we can manually tune in the tune on the memory allocation reserved to each process and that allows us to just confine older version of FL into a smaller memory allocation because there are a few requests coming in while giving more memory into the new ones that substantially means that we can run the metals hotter we call it that way so we can I see is processed as many requests as we can all at the same time without having any impact on customers. And this translates us in, well, natural efficiency. So that's good for, well, for us. And also, anyway, environmental energy reduction, because we consume simply less watt for processing the same amount of requests. Well, let's switch gears and talk about migration. How do we get from here to there? The system was running the whole time. running every day every second globally and we're making these platform changes across the spine of the whole thing it's a massive undertaking oh so can you walk us through matt a little bit of the strategy and techniques of how we moved through migrating things and and specifically things around fallback and gradual rollouts and how we're introducing this to our customers without any eruption service yeah so one thing that when me and maritio were talking about how we roll this out yes say is important but we wanted to be able to get this out quickly but safely and efficiently Matthew our CEO talks about our superpower being our free plan users they have a great community so community .cloudfly .com I'll give that a shout out because the MVPs there have been really helpful in this journey when we build this software we obviously roll it out slowly via our gradual deployments but we roll it out safely via plan type first so our free plans would have got the fl2 service i think they got it earlier in the year about six months ago maybe even more so they've been on this for a long time um that is around 20 to 25 million zones on the internet it's a hell of a lot of not canaries like real world zones but it gives us a really good visibility of how our software can impact as we go we we roll this out so We rolled it out to custom and people that we dog fooded ourselves. We rolled it out to our free plans. We monitored the community. We worked with our community to see if there was any issues. There was honestly a few little niggles when rolling this software out that we caught with the community's help, investigated it, fixed it. Then we went to our next tranche of paying customers pro plans into biz plans and then into our enterprise plans, rolling out gradually into. different percentages rollout. So nothing was big bang, nothing was immediate. I will let Mauricio talk about our testing framework later, but I will say we're really good at rolling out software. We roll out software probably 10, 20 times a day. We are really good at monitoring how the software releases go. We have a very rigorous software release program across our different Cloudflare Colos, even to a metal level called Slivers. And again, We then have the alerts, the analytics, everything sitting on top of that to be able to see if there's any issues rather than just going big bang and you're suddenly on this new version of FL2. So that's sort of how we did this in different tranches. I will say CJ, our sort of head of engineering product, was like, you know, we need to be obviously careful when running out new software. It's like it's doing it on a Tesla car. I wake up in the morning on software to be installed successfully so I can drive that Tesla car. I wanted to tell CJ. we're doing the software update as you're driving down the highway 100 mile an hour so we've got all of these safety things in place and again like to make sure this is as smooth as possible and it has been we're we're almost at the um completed this at the sort of the other end and everybody is sort of getting fl2 i will sort of hand it over to mitcho on the fallbacks and how we did this at a sort of product to product level thank mitcho i can take it from there so as matt was saying, safety, is a key feature, which needs to be paired with observability. So in short, we can move from one side to the other only thanks to knowing what is happening in the old thing, in the new thing, and having a way to switch between the old thing and the new thing at a moment notice. So we built exactly that. So the mechanism of fallback is very simple. Request comes in, it goes to FL2, it runs something. in a shadow mode, and then that thing fall back and be processed as normal in NGINX FL. Now, because the request comes in and then comes out in exactly the same way, in doing exactly the same path in the reverse, that means that we can monitor what are the actions or what is anything that has been happening in the FL2 in shadow mode and compare it with what? NGINX FL1. did in real life. For real. This allowed us to say, okay, we have a discrepancy in terms of the bot score for giving you an example. Okay, let's adjust that. And we could proceed for all of our products, everything that runs through FL case by case. Of course, this is not the work of the FL team. This is the work of 30 plus teams inside Cloudflare that handled one by one and said, okay, this does not need. fall back any longer because of this product, because we implemented it correctly. And that's why we can switch to actually do the actual work in FL2 and not any longer in NginxFL. Proceed like this for hundreds of things and we get our results. Now, that sounds amazingly complex. Let's talk a little bit about the system you guys have built to enable this testing at scale and how you do this steady state ongoing forever. I think we've called it Flamingo. there's probably a great story for the name. Mauricio, do you want to walk us through what this is and sort of some of the background on how it works? Yeah, so first and foremost, there is a competition. Everything that FL uses need to start with FL. So Flamingo, there are not many, that many words in English that start with FL. So that's an advantage. So Flamingo is one of them. Flamingo is the largest distributed testing system in the world, probably at this point. Substantially, Flamingo is a small piece of software. whose role is to constantly throw requests at FL everywhere in the world, 24 -7, 365 a days, with a set of things to check and checking that the request did the right things, where the right things can be, of course, defined in the test itself. That means that all everything in Cloudflare needs to write what we call the acceptance test, and acceptance test is... Nothing else than a Flamingo test where you specify the request, where you specify the output, and you specify this is what needs to happen when you do the input. This runs everywhere, all globally, every data center, every single metal, constantly running through. The reason why we do that is because we want to detect regressions. We want to detect issues with patterns that we know they need to have a specific thing immediately, not when the customers start to know. it but as soon as one metal in the world start reporting it because of our release because of whatever we need to know before anyone else i mean that's fantastic and to be able to execute that at scale across the world globally as you mentioned 24 7 is is an incredible achievement all by itself well let's talk about the results and what's next guys the numbers so far are impressive the the things we mentioned in the blog websites as matthew mentioned becoming 10 10 milliseconds faster and rizzio as you mentioned using less than half the cpu of f01 much more efficient memory management things like that what does that mean maxi for in terms of the the average internet user and sort of like what we'll expect to see as we go forward from here um for the average internet user hopefully you will see what with 20 the internet's going for clever 20 the internet has got faster that is what you should see you should not be you know things on your device when you are walking around town and you're just googling and to a website that site should load faster with regards to the cpu metric that becomes really interesting for sort of the next again 15 years of cloudflare there's how do we scale our vast network there's only a finite amount of colo space there's only a finite amount of cpus it's only a finite amount of gpus and everyone's trying to buy all of the resources at the moment and scale up so if we can't buy or we need to find a way of using less or using it more efficiently, I should say. So that's, again, what FL2 allows us to do. It allows us to do the multiple versions of FL2 with the connections. That's, again, because we're saving CPU and utilising memory smartly. But it also means we're allowed to, as our networking grows and goes beyond 100 million requests per second, 150 million, 200 million, allows us to keep scaling. And we don't have to keep scaling. the the racks or keep buying the hardware to handle that um so it means that requests are being served globally in the sort of data center local they don't have to be say rooted to a massive data center because there's more capacity there we're able to keep it in the same in the same country so again speed and efficiency of sites loading which is again really exciting okay Maurizio over to you in terms of network architecture for Cloudflare there's more go here we talk a little bit at the very end about how we're also doing TLS and maybe some further spectrum or other protocols can you give us a sense about sort of what happens from here at the architecture level so as I said at the moment we are we are moving FL which is as Matt was saying at the beginning the brain of the CDN realistically that's not the only proxy that is active in Cloudflare for the CDN there are other components on the role on the on the chain proxy that we use so the first bit is well let's let's migrate all of them into a new technology that is standardized and where every single piece is has points in common in terms of using rust using oxy using tokyo therefore and blah blah blah these allow us very simply to move stuff around between the different services in case we need but also move engineers around if you move team to the other which is something that Cloudflare likes to do if the person you know want to have a different experience they can just move to another team the base technologies are the same therefore they can do much funnier job that is not all i need to relearn everything from scratch so that's the first step then the second step however as we were saying before we are operating a chain here so bet bit a then b then c then t Why is that so? Well, at the moment, we know why is that so, because there are pieces that needs to be done in A, then B, then C. But traffic does not look all the same. Traffic is different according to things, where things is defined by product, of course. Therefore, a possible idea here is, rather than having a chain, is to have a mesh network of components that handle traffic. Therefore, different traffic types take different paths. inside the CDN to have, very simply, the same results as we are pointing with FL. Faster, so better performances and more CPU efficiency and memory efficiency. So these are the two main metrics that we care about. How much is performances and how much are we efficient? Well, Matt, Rizzio, thanks for joining me here today, everybody watching. This is kind of the new opportunities that these platforms are offering. provide for Cloudflare as we go. This is a reliable, fast, and efficient infrastructure that powers our entire company, all the way up to edge computing that we're known for. Our storage solutions, everything else built on top of this continues to scale on this platform. This is what we do at Cloudflare. We want to try to help the internet become a better place. This sounds like your idea of fun. Check out our careers page, and thanks for joining us here today.

Birthday Week

For Cloudflare's annual birthday, we like to give presents back to the Internet. Each day during Birthday Week, we will we announce new things that further our mission — to help build a better Internet. Be sure to head to the Cloudflare Birthday Week...

Watch more episodes