This Week in Net: From Rust to Python and a glimpse on what to expect from Security Week
Welcome to our weekly review of stories from our blog and other elsewhere, covering a range of topics from product announcements, tools and features to disruptions on the Internet. João Tomé is joined by our CTO, John Graham-Cumming.
In this week's program, we go over our recent blog posts related to two relevant programming languages, Rust and Python. The related posts are:
- Oxy is Cloudflare's Rust-based next generation proxy framework
- How we built an open-source SEO tool using Workers, D1, and Queues
- How Rust and Wasm power Cloudflare's 1.1.1.1
- Keeping the Cloudflare API 'all green' using Python-based testing
We also discuss how the new Zero Trust navigation is coming soon (and we need your feedback), how to build resiliency into systems with Cloudflare Workers, and why you should go over our White House’s National Cybersecurity Strategy related blog post.
Last but not least, we give you a teaser on what to expect of the upcoming Security Week (one of Cloudflare’s Innovation Weeks). And Andie Goodwin and Angela Huang share some insights about our blog post related to the International Women’s Day and give some suggestions.
Further reading
- Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities
- Embrace equity on International Women’s Day (and every day)
- Accelerate building resiliency into systems with Cloudflare Workers
- The White House’s National Cybersecurity Strategy asks the private sector to step up to fight cyber attacks. Cloudflare is ready
- New Zero Trust navigation coming soon (and we need your feedback)
- How Cloudflare runs Prometheus at scale
- How we built an open-source SEO tool using Workers, D1, and Queues
- Cloudflare's network expansion in Indonesia
Transcript (Beta)
Hello, everyone, and welcome to This Week in Net. It's the March 10th, 2023 edition, and our security week full of announcements is coming up, so we'll have a teaser at the end about it.
I'm João Tomé, based in Lisbon, and with me, I have, as usual, our CDO, John Graham -Cumming.
Hello, John. How are you? Hello. I'm fine, thank you.
How are you? I'm good, too. It's been like two weeks since we did This Week in Net.
I know. I got a cold. I got a cold last week, and I was feeling kind of bleh.
It happens. It happens to all of us. It's still winter here in Lisbon, so it's happening.
Honestly, I was promised better weather in Lisbon than this.
I want my money back. Disappointed. It's coming. It's coming, the good weather.
It's coming. You promise? I promise. I'm going to go see Marcelo and ask him about the weather.
I was promised more days of sunshine. President Marcelo.
Actually, President of Portugal Marcelo was at our Cloudflare booth that we had here in Lisbon a few days ago, last week, actually.
So that was a moment.
I will share a photo of that. You should do that. It's worth saying the event we were at is called Sinfo, and it's a student recruiting event held at a very well-known university here in Portugal called Técnico, where everybody calls it Técnico.
I gave a little speech about what Cloudflare does, and then Celso Martinho on the team did a workshop on building stuff using Cloudflare.
It was a good event.
It was, it was. Let's dig right into it. We had, actually, last week, a bunch of Rust-related blog posts.
Why not start there? It's not every week that we have more than one Rust-related blog post, so that's always interesting.
Actually, let's start, why Rust is important for us, really, as a coding language?
Yeah, as a language, yeah.
Well, if you dial back to 2017, so six years ago, in February of that year, we had this awful security bug that got known as Cloudbleed.
And one of the reasons that Cloudbleed happened was that we were using what are called memory -unsafe languages.
So, in particular, we were using C, and a memory problem in C caused us to leak memory, which caused the Cloudbleed thing.
And after that, we did a whole bunch of stuff to prevent that from happening again.
One of the things was a drive towards using memory-safe languages, i.e.
languages in which it is impossible to do the sorts of things that happened in the C code that caused Cloudbleed.
And so, one of those languages is Rust.
Rust has a very specific way of handling memory, which means that it's not garbage collected, but it makes sure that memory accesses are safe.
And so, Rust became a language that we were keen on using. It also is a language which is, in some ways, closely tied to Wasm.
And Wasm WebAssembly underlies our developer platform, Workers.
And so, the fact that Rust compiles so easily to Wasm means that that becomes kind of interesting, and you can actually run Rust Wasm code.
And so, the third thing is that the number of developers in Cloudflare are interested in using Rust and were highly motivated to use it.
So, it was memory-safe, good stuff with Wasm, people were highly motivated, and the performance is very, very good with Rust.
And we obviously want very low latency through our systems.
And so, Rust has grown like rust on us, basically. It's made us go all rusty.
So, we have a lot of things that are written in Rust now. And this week, we put out a number of blog posts about things that are written in Rust that people perhaps hadn't heard of.
This particular blog post you brought up is actually rather interesting because it encompasses Rust and Wasm.
So, Cloudflare runs this DNS resolver, public DNS resolver, 1.1.1.1, which has been around now for about five years.
And originally, we started out using a thing called NotResolver.
And this thing has grown enormous, right?
You're talking about trillions of DNS requests going through this thing.
And so, over time, like any system, once you get to know what functionality you need, you start to want to build things yourselves.
And 1.1.1.1 fits in the whole Cloudflare architecture, something called the super cloud, where we have stuff running all over the world.
And Not was using Lua, which we're quite familiar with, because it was part of Nginx and OpenResty when we were using that.
And when we used that, and we were able to customize it through Lua modules. But over time, a lot of things became difficult.
Lua isn't that popular language, unfortunately.
So, people have to get up to speed on it and learn about it.
There are issues with performance, there's issues with memory. There's all sorts of things that made us say, we need to own our own destiny here.
And we've done this in the past, right?
We did this with our own authoritative DNS server, which we wrote rrdns, which is used for the authoritative DNS that is out there.
And we wrote this thing, which is called BigPineapple.
And I actually have forgotten to ask the team why it's called BigPineapple.
But anyway, BigPineapple. It's a good name.
It's a very good name. Yeah. It's a very interesting name, but I don't know why.
I'm curious now. I'll have to learn. But BigPineapple is written in Rust and becomes the DNS resolver, which everyone is using for 1.1.1.1.
And in addition, and this is where it gets kind of interesting, when you scroll down in this blog, you'll see that Wasm raises its head.
Now, the reason is you want to have some customization.
So, you want to be able to do modifications and dynamically add and remove functionality from these things.
And especially when you're operating on scale, it's very useful to do that.
So, here it is. It has this plugin functionality.
Now, originally we used Lua, but now actually we're using Wasm. So, within this program, there's actually Wasm running.
And so, you can actually send a Wasm program in here and it can run within a sandboxed environment inside this other program.
And we distribute those using Quicksilver, which is our very fast global key value store, which allows us to push stuff out.
So, this is a great example of Rust and Wasm together for one of the largest DNS resolver things in the world, which is what we're running.
So, that was one of the Rusty future things that we talked about.
Exactly. And for those who don't know, 1.1.1.1, our DNS resolver that has millions of users.
So, millions of people are using it for privacy, safety online.
So, this impacts a lot of users there, which is interesting. Yeah.
I mean, to give you an idea of scale, we do about 25 million DNS queries per second.
So, just over 2 billion per day. And that's all running through code that we wrote, be it our DNS, which is our authoritative, or 1.1.1.1, which is our public resolver.
So, this code is running really at scale across and globally as well.
And why there's advantages in writing your own code more specifically? Well, I mean, one thing you can do is you can take an existing project and modify it.
And we've done that over the years. I mean, for example, before we wrote Quicksilver, which is this way of distributing configuration and data around our entire network, we used to use a thing called Kyoto Tycoon.
And that worked out pretty well for us for a long time.
And it underlined using a thing called Kyoto Cabinet, which is to store the data.
But over time, you start to add functionality to it.
And for example, in the case of the Kyoto Tycoon world, we added TLS connections to it.
And then we started to get issues of scale. So, what happens is maybe it works great at some particular number of key values stored in it.
But at some point, it doesn't scale so well.
And in particular, we ran into a problem of systems getting out of sync on the network.
And actually, how long it would take one system to catch up with another one.
So, one option is, well, can I modify the existing thing I have?
And that's often a really good option if you can, because you haven't got to write everything from scratch, and always writing from scratch is hard work.
But at some point, you end up forked away maybe from the original project, or you end up with so much customization that you're kind of working around the architecture of the piece of software that you built this thing on.
And so, at some point, and we saw this way back actually with our DNS. So, originally, we were using PDNS, PowerDNS, which was fantastic.
But at some point, you've got enough modifications and things you're doing, you're like, it'd be really better if we had a different architecture, maybe for IO performance, or maybe something to do with memory.
Like many reasons why you might want to do that.
And so, what Cloudflare has always done is kind of like gone as far as we can with an existing open source project.
And until we've seen a point where it's like, we're now hamstrung perhaps by performance characteristics or architecture of that project, it makes sense to write something new.
I mean, for example, in the case of our DNS, people often ask us if we could open source it, which we could.
The problem is our DNS is so bound to Cloudflare's own architecture and Cloudflare's own business logic, you'd have to extract most of it for it to be interesting to anybody else.
Otherwise, it's like very specifically tied in. So, these things you have to sort of make a decision on, how hard is it for us to make real progress?
And in the case of 1.1.1.1, we knew that we'd really push the not lower combination as far as we could.
And as we're scaling, and as we also want to optimize the performance, it just becomes a good idea to write something new.
We've got some more, right?
Bring up some more. So, I mean, Oxy. So, the post by Ivan about Oxy, this is also another really interesting thing.
So, a lot of what Cloudflare does is around passing traffic, web traffic, and other sorts of traffic from clients to servers, right?
And doing something to that connection. It might be like caching in the case of the CDN, or in the case of the WAF, it might be blocking a request because we've seen that there's something malicious in that.
Or it might be from the Zero Trust Gateway, it might be blocking access to a domain that the organization doesn't want to be there.
Or in the case of iCloud Private Relay, about jumping between hops that provide privacy.
All of those things have in common that they are proxies.
They do something on behalf of the client. So, you come to Cloudflare, you say, hey, I need to get this webpage, for example, or I need to go send this SSH connection somewhere.
And we, in the middle, we'll take that, maybe we're inspecting it for a DDoS or WAF or something like that, and then pass it on.
So, this idea of proxying is really, really important.
When Cloudflare started, we used Nginx, which is a well-known web server and also can be a proxy to do this.
And over time, we have needed more and more and more sophisticated proxies.
And so, the cache team built a thing called Pingora, which is a proxy that gets from the backend of Cloudflare to our customers' servers.
And another team built this thing called Oxy.
And Oxy is actually almost a framework for building proxies. And what I think is perhaps the thing that's perhaps most fascinating to me is it is a proxy that allows it to dynamically operate at different layers of the OSI stack.
So, it might be proxying an IP, a stream of IP traffic, because we might be handling that for somebody.
But there might be a necessity to look at the TCP layer or UDP if it's in there, or go up to the HTTP layer.
So, it can actually operate at different levels of the stack, which gives you an enormous amount of power to handle different types of traffic in the same framework.
And so, the team built this thing called Oxy.
And it's in Rust. There are a sequence of different blog posts coming out.
So, two have come out already, I believe. I mean, Oxy is a foundation for a very large amount of what Cloudflare does.
Actually, between Oxy and Pingora, I mean, an enormous amount of what Cloudflare does is being actually handled in Rust.
So, if you think that DNS is happening in Rust, or at least the resolver is, a lot of the proxying is happening in Rust as well.
The Edge Rules engine is in Rust.
I mean, there's a bunch of stuff. So, this one is going to turn out to be the first of, I don't know, five blog posts.
So, if you're interested in Rust, well worth looking at.
We've just done a bit of Prometheus there. And then there's one about testing.
So, this post about keeping the Cloudflare API all green by Ellie is really interesting, because one of the things we want to do is if Cloudflare is based on an API, everything is being done in Cloudflare's API.
And if we want to make sure that that API is running correctly, we need to have tests.
And we need to run those tests in a realistic environment.
And so, actually, the team built this thing called Scout, which allows them to automatically test our APIs in an environment that's production-like, so that we can be sure that when we go to production with an API change, we haven't broken anything.
And, you know, many people depend on an API, its results being correct, its results being unchanging.
And, you know, if we break something, then, you know, we have to be very, very careful about it.
So, Scout is actually a nice piece of work done in Python to make sure that we can absolutely test our API before we put it out to production use.
Makes sense. Another technical one, a different language, and a different purpose.
So, it's doing a different thing, in this case, related to the API.
It's always interesting to see that different languages give you different possibilities in terms of what to build, right?
Well, sure.
I mean, like, you know, here in Lisbon, we've got a lot of people actually working with Python because of all the data science stuff we're doing for Cloudflare Radar.
So, you know, we see a lot of Python here, and here you see it for testing.
And so, I think, within Cloudflare, you have a lot of different languages being used for different purposes.
Makes sense. Also, a blog post related to the new Zero Trust navigation that's coming soon.
And, in a sense, we're trying to get feedback from people to make it better.
Yeah. So, if you're a Zero Trust customer, then you will have noticed, probably, that the Zero Trust dashboard looks a little bit different from the main dashboard of Cloudflare or the traditional dashboard.
And there's work being done to unify those experiences.
And this blog post teases some changes we're doing to make sure that things are really integrated really, really well together, and also is asking for feedback.
So, if you are a Zero Trust customer, we would love to know about the usability, because I think one thing that Cloudflare really cares about is that it's a truly integrated experience.
I mean, there are companies, competitors of Cloudflare, that have grown by acquisition.
So, what they do is they buy a bunch of companies, they tell a great marketing story about, we have all these services, but behind the scenes, those services are not really heavily integrated.
And what you end up with is what I think of as like Frankenstein's monster-type companies, where they've sewn together a bunch of bits, they tell you it's alive, and it's kind of moving around, but the UI experience isn't good.
And, in fact, the overall experience isn't good, because stuff isn't integrated.
So, we've always really, really focused on deep integration, if we bought something, or on something, if it's something new.
And the Zero Trust thing was a little bit odd, because the dashboard looked a bit different than the rest of it.
And we're now correcting that, because we really want to stay on this single pane of glass, everything is available everywhere model, because I think it's a promise we made to our customers.
And in a sense, it's like consistency, different dashboards are similar to others.
So, you'll feel accustomed to the Zero Trust dashboard, if you know about Cloudflare.
Absolutely. And there's a short survey that people can give us.
Please tell us, if you're a Zero Trust customer, we would love to hear from you.
Yeah, that's also always a good thing to have feedback from customers. We have a few more blog posts, one about Women's Day, International Women's Day.
Actually, at the end of this segment, we'll have Andy and Angela giving us some inputs there.
That's great. But also, one related to Cloudflare Workers, in this case. Yeah, can we talk about this one?
It's one of those interesting uses of Cloudflare Workers.
So, what we're writing about is, we send emails to our customers. So, if you ask to reset your password, we have to email you the link to click on, right?
How do we do that?
We use what are called ESPs, Email Service Providers. So, we use external companies to actually do the email sending.
The reason we do that is they're the ones who maintain the reputation of their service, and so the email gets delivered.
Because if you send email, you don't have a good reputation, it goes in the spam box.
And we can't have a, your password needs to be changed, or some other critical email was sending you go in spam.
So, we send through ESPs. But what if an ESP has an outage and we can't send email and some things are super critical?
For example, if it's a one-time password login email, that needs to arrive in five minutes, right?
There's usually a timeout on that. Sometimes less. Sometimes less, right?
It should arrive as quickly as possible. And so, critical emails really matter to us.
And so, what the team did was they built this very simple thing in Cloudflare Workers, which allows them to move traffic from one ESP to another by putting an API in front of it, because most of these things use an HTTP API.
And then, okay, send an email and it will make a decision about which ESP it goes to.
And they just set there as a configuration parameter where they can adjust the percentage that goes to different ESPs.
But it's not necessarily a one or a zero thing, because you actually want to keep email flowing through both of them, first of all, to build your own reputation of sending email, but also it allows you to flip to one or zero if there's a problem.
And it's really nice. And so, the code is really simple.
And if you look at the example of how it works in here, and they've actually used this in the real world.
And I think right down at the bottom, there's a thing where they say, an actual example of how fast they flipped over between things.
Oh, there's an outage here represented in terms of ESP. Yeah, it was a real thing.
So, it was basically five minutes. And that included humans making a decision.
And they have this lovely graph here where the load on one email provider went down to zero, the other one ramped up as we switched over.
And what was nice about this is, this isn't an example of something which is deep computer science or AI or any magical stuff like this, but this is incredibly valuable and so easy to do in Cloudflare workers.
The code is so simple. And so, I think, this is one of those great, lovely examples of there's so much you can do on a thing like Cloudflare, you can deploy something like this.
It's not costing you anything when you're not using it, or it's very inexpensive.
And when it needs to be used, bam, just go make it happen.
Yeah. And there's that example of simple code and the help that is giving, which is also interesting.
Absolutely. We also have, actually, from today, a blog post, still not published.
Sneak preview. Yeah, we can give a sneak preview.
Although when this will be out... Yeah, by the time this is out, it's no longer a sneak preview.
So, you have to go back in time, and it's a sneak preview.
So, it's all about deploying firmware at a global scale, which is the Cloudflare scale.
Yeah. One of the things about anything at scale is that you're like, when you scale something up, things that are relatively simple actually become a real pain.
So, for example, we have 285 cities where we have our hardware, and we can upgrade the software on those machines that we built, that Quicksilver we built.
We use Salt, as we've written about in the past, for configuration.
We're able to upgrade the software stack that's on there. But sometimes you need to upgrade the firmware on those machines, or some subset of machines.
And so, there's a whole world of, okay, so how do I upgrade thousands and thousands and thousands of machines at the firmware level?
And what Chris has written about here is how we do that.
Which, you know, he gives you an idea of how we update, because we have to update the BMC, we have NICs, we have other cards in there, like all sorts of stuff that needs to be updated.
So, he gives an actual idea of how we do that on our fleet.
So, coming soon, by the time you see this, we will see how we do those updates of the BIOS, et cetera, on the machines.
And this has once turned out to be really important, and we're going to bring it back to Cloudbleed, that after Cloudbleed, one of the things we wanted to do was get down to zero segfaults, zero crashes of our software per day on our network, because crashing often happens if there's a memory problem.
So, it was a way of saying, look, if we eliminate all these problems, there's no crashes happening.
And we really drove that down, and there were still some crashes.
And there's a really great blog post by David Rag about the search for those last few crashes, which turned out to be a problem in the microcode of a particular set of Intel processors.
It was literally a CPU fault, and we actually had to update the microcode.
Again, something, you know, when you suddenly are at scale, you suddenly discover all sorts of interesting things.
And there's a very interesting effect, which is that as you get a large number of machines, the long tail of problems becomes interesting, become interesting things to find in them.
And that was one of them.
I really found interesting also that Chris does explain that you should update the firmware of your laptop, of your desktop, of your IoT device.
And in this case, it's like the larger scale of doing that typical every user should do type of thing, which is always interesting.
Yeah. And we have to do this for security reasons, for performance reasons, for stability reasons.
At our scale, it's necessary. Exactly. We also have a blog post that I invite everyone to see about the White House's National Cybersecurity Strategy, asking the private sector to step up to the fight of cyberattacks.
And Zaid, Zaid from our team, explains how Cloudflare is ready in this matter.
Yeah. I mean, this is really worth reading.
I mean, actually, the National Cybersecurity Strategy is worth reading because although it's written by the US government, I mean, it's very applicable, cyberspace is everywhere, to the rest of the world in terms of how do you protect yourselves against cyber threats.
And this lays out a strategy for dealing with it.
There's a lot of interesting stuff in here, not just around critical infrastructure, which is power and hospitals, all this kind of stuff, but also around the general use of the Internet.
So my sense is that you should at least read our blog post, which will give you sort of a taste of it, but it's well worth reading the actual National Cybersecurity Strategy.
True. And I think this is a good segment for what's coming.
We have security week next week, one of Cloudflare's innovation weeks.
We already had CIO week back in January, and now security week is coming up.
What can we say to everyone about security week? There's a lot of blog posts.
I think there's 41 at this point. So eight a day, basically.
So a lot of stuff. I think perhaps the big theme of next week is going to be machine learning.
Cloudflare is using a tremendous amount of machine learning across our systems, and we have a bunch of announcements.
So we use it for bot management, detecting whether a human is a human or not.
We use it in turnstile, which is our capture replacement to figure out whether we think you're likely to be a bot or not.
We use it in our WAF to detect unknown threats. We use it in Eurotrust to find lookalike domains, to find domain generation algorithms, to find DNS tunneling.
We're using it in support in order to figure out what replies people need and get the right data to our support clients.
I think the really big thing that's happened is that machine learning is everywhere in Cloudflare now.
Realistically, a few years ago, it was primarily the bot management side of the things that was doing it, and now it's across the team.
And the reason is that after this network is so large, we have so many signals about what's happening on the Internet.
We can use that data collectively to protect our clients, to provide better performance, et cetera, because we can use it to spot new things, spot new patterns, and train on that data, train on those signals to really build, I think, probably one of the most exciting machine learning things.
I know it's not chat GPT, and you can't chat to it, but we're in there able to figure out what is anomalous on the Internet, and I think that's hugely powerful.
True, and so I think what I found fascinating is that that impacts and is present in different products.
Our Zero Trust suite, Cloudflare one in general, so it's embedded, a lot of improvements are embedded in those products that companies are using, people are using, really.
And we're also are having this important thing in mind, which is security is, most governments are launching warnings, adding security strategies, like we saw recently with the White House, but also in the UK, in Germany, all sorts of countries.
So that area is building up, and I think we have a good approach there in terms of what we're doing.
Also in the email area, phishing, an old tactic, but still pretty much present and increasing, which is alarming.
Well, yeah, I mean, the phishing side, we use machine learning too, right?
Detecting the domains, and also if we're looking at the websites themselves, bringing out what is phishing and what is not.
And you spoke about chat GPT, chat GPT tools, type of tools, can help hackers being more convincing in terms of phishing, emails and all that.
So sophistication is coming. Yeah, it's definitely, I mean, those tools are definitely helping the attackers and the defenders.
So that's our time, and let's invite everyone to see what we have in place for the next week about Security Week.
Yeah, and you and I can get together next Friday and talk about all those announcements.
It'll be a bumper selection. Yeah, exactly. Thank you so much, John, and that's a wrap.