🎂 Instant Purge: invalidating cached content in under 150ms
Presented by: Alex Krivit, Zaidoon Abd Al Hadi, Tim Kornhammar
Originally aired on September 25, 2024 @ 12:00 PM - 12:30 PM EDT
Welcome to Cloudflare Birthday Week 2024!
2024 marks Cloudflare’s 14th birthday, and each day this week we will we announce new things that further our mission — to help build a better Internet.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog post:
Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!
English
Birthday Week
Transcript (Beta)
Hello, everybody, and welcome to Cloudflare's 14th birthday. My name is Alex Krivit.
I'm a product manager here at Cloudflare. I'm joined today by Tim and Zaidoon from the Cache team.
Tim is a product manager and Zaidoon works on the engineering team.
And today, we're really, really excited to talk to you about some performance improvements that we built into our new Purge pipeline.
Before we get into that, and before we get into the announcement, I'd like to take a few steps back and just sort of maybe start at the very beginning here.
Maybe, Tim, what sorts of things happen when customers cache content?
Yeah, a very good question, and an important part of CDN, where customers leverage us for scale and bandwidth control so that they can have high availability.
What really happens is that we get a content that we store for a certain amount of time, and the customers told us, store this for a minute.
We will store it for a minute and then ask again for an update after that minute has passed.
So, it's very reactive in that sense, but it could save millions of requests going back to your origin or web service by posting it on our services instead of our customers.
So, sort of getting that, we temporarily store a piece of content closer to visitors.
How does that help Internet traffic, Zeydoun?
So, the key things that customers are looking for is speed.
So, by storing or caching the content in our giant network, we're able to cache this network closer to what we call the eyeballs or the clients.
So, that's people on their browsers, Google Chrome, phones, things like that.
So, the customers of our customers are happy as they're able to get their content faster and pages just load magically in a blink of an eye.
And customers are happy because they're getting their content out very quickly.
And also, we're removing a ton of load on the origin servers.
So, from their perspective, they're also saving lots of money on bandwidth costs.
That's really fascinating, really great. And so, what we're doing is we're moving the content from their origin server to our servers closer to their visitors so that when the requests come in, the traffic comes in, we can serve it from those points of presence, those data centers, as compared to the request needs to go all the way back to the origin.
But that sort of kind of brings up the next question, maybe what this conversation is about, is, okay, what happens when that content changes?
How do I make sure that the stuff that we have all over the globe in all these different data centers isn't out-of-date content and showing somebody an incorrect piece of information?
Yeah, I'll take that one.
With Cloudflare being available in over 330 cities globally, we store a lot of content for a lot of customers.
And as we started with saying, if we store content for 60 seconds, a minute, and that's reactive, then customers need a way to say, hey, this changed more sudden.
Customers can send us a request called purge or invalidation request to say, hey, this one needs to be instantly removed right now, it's incorrect.
And so what happens is that the customer sends that request, we receive it, we remove it from this, we don't serve it to any users anymore.
And we then, on the next request after that has been processed, goes back to the customer's origin and says, hey, we need a new piece of content.
That's interesting. And so it's sort of like, we're storing something and then we're just trying to remove it and then replace it immediately.
And that's what's called a purge, right? Yes. How do you make sure that I'm purging the thing that I actually want to purge?
So, we offer a few different ways.
And the most traditional way is that you would go to a website, domain.com and slash image.
And by sending us that URL, we would know to purge that file.
And a customer can have thousands of files, different dialects, that could be static or live longer, shorter times.
But then also in other types, if you have a set of news articles, maybe you want to tag it, all the news articles could be a tag, or all the writers could be another tag.
And you can organize content by that.
So, instead of going one news article, you could say all the news articles that this writer has touched, you need to remove those and update that because we have a new name for the person.
That could be an example of how and why you send that to us.
Okay. So, there's a lot of different request types for purges that I can send.
I can target a URL saying that should presumably remove in all of the data centers, things that match that URL, the tags that you were saying, anything that has that tag in a header, I can look and I can remove that content from all of the data centers as well and replace that on the next request.
So, I mean, I guess presumably those types of purges in that concept of purging content has been around since the beginning of sort of cache, in the beginning of CDNs.
What change are we talking about for birthday week this year? So, the change that we've made here is we've made our purge instant.
We've made it really fast.
Historically, I thought that a purge wasn't the slowest. It was in the seconds, which was still fairly fast.
However, as, you know, Cloudflare got bigger and we grew and we got more customers, we wanted to challenge ourselves to do better kind of thing, to set the standards instead of just be a follower kind of thing.
So, we wanted to see how fast we can make our system and we wanted to also increase our scalability.
So, we spent the last couple of years focusing on revamping our entire purge system that has been there since day one of Cloudflare almost.
And we've made it much more scalable, reliable, and much faster.
So, this change is mostly focused on performance, that instant aspect. What sort of work did we need to do to change how quickly we're able to remove all this content globally from seconds down to milliseconds?
All right. So, lots of work had to be done.
The building blocks of all of this is our API gateway. Historically, our API gateway was in what we call a core data center or some really big data centers that don't serve edge traffic.
So, you know, it doesn't cache content or anything like that.
It serves control plane traffic where people change zone settings and set firewall rules and things like that.
So, that system has historically been all in those core data centers that are far in between and they're not near every single visitor.
So, one part of to get to our millisecond purge, we needed to have what we're calling a coreless gateway.
So, our API gateway is on the edge on every single data center that is serving traffic.
This way, when the customer issues a purge request, it hits whichever data center is closest to them and the API gateway is in charge of routing the requests to our purge services as well as doing the auth that was also historically core-based that was moved to the edge.
So, because without those two pieces, it's impossible for us to get to the millisecond instant purge.
If each purge, a customer issues, has to go to the nearest core data center, which could be in the US and they could be somewhere maybe in Japan or something like that, that would already push the latency higher than the milliseconds.
That's so interesting and such a sort of an interesting way to think about it that, I guess, bottleneck.
And so, what you're saying is that previously, if somebody was in Japan and they sent a purge request, it would have to go somewhere else, maybe the US, to that core data center to do the authorization and authentication and all of the distribution management that needed to happen to go to all of the data centers on the globe.
And so, what this change is architecturally is that we took all of those processes out of the core and sort of put them around the world so that we eliminate all of that network time, the time that it takes for the request in Japan to go to the US and then get blasted out everywhere else.
That's correct. So, the first piece, we had the IPI gateway that was moved.
We have the authentication logic and services that were moved to the edge.
Next step was our own purge systems because everything was in those core data centers.
Historically, our purge services also lived in those core data centers, which relied heavily on what we've introduced since the start of Cloudflare, which is called Quicksilver.
It's our very fast key value store that's used to propagate configurations to the edge very quickly.
Quicksilver is extremely reliable and we got sub-millisecond reads on the edge and we got sub-second propagation around the globe.
So, we were very happy with it, especially as Cloudflare was starting.
However, because we started to grow and our purge traffic started to grow, the amount of incoming purge requests that had to go and funnel their way into the core data center through Quicksilver, which then had to propagate to all of the data centers on the edge, ended up slowing us down because Quicksilver was not able to push that much data to our edge data centers from that one particular location.
So, we had an overloading issue.
So, to combat that, we started introducing queues in front of it to do batching and regulate traffic flow.
That all worked fine and nice for the past 10 or so years, but then we realized that started to slow us down and we realized that's bottlenecking us and we're getting to a point where it's not scalable, which for us is not acceptable as we get more and more customers.
So, that's when we started to look at the individual purge services to see, okay, now that API gateway has moved to the edge and we have auth to the edge, what can we do to our purge services to move them to the edge so that we don't have this a funneling problem where we have to send purges to a particular data center to be processed and then sent to all other data centers?
And that's when the Cordless Purge project started a few years ago and we started by moving first single file purge, so that's purge by URL, as that was much easier and it's much easier in the sense that for single file purge, as we've talked about, when you send us a URL telling us to purge this, we are able to know exactly where this URL or where this cached content can be in each data center.
So, we can easily, all we had to do was send this purge to each data center and then within the data center, the cache key hash can be calculated to figure out the exact disk that contains this content and we just delete that.
Once that was done, go ahead Alex. Oh no, I was going to say that sounds like a very big re-architecture and it's been years in the making.
What did you guys do next after that piece was done?
So, once we got single file purge done and that was built on top of workers and we were using durable objects to store the purge tasks that we are going to distribute to all of the data centers and we got, so we basically, we fixed the, we got the distribution to be cordless.
We accepted purges at the edge with, you know, in the nearest data center for the customer and were able to send and propagate those purges to all data centers.
Next up is what we are calling the flexible purges.
So, those are the purge by tag, purge by host, purge by prefixes.
Those were still in core until recently and the reason why they remained in core is because they're more complicated.
Earlier, I said, you know, given a URL, we just, you know, we can calculate the cache key and find out exactly which disk contains the content.
That's true because that's what happens on the edge.
When customers visit a website, we get the URL and then we know exactly where to go to look in cache.
The problem with things like purge by host name is all you're giving us is a host name and we need to find all content that happens to share the same host name.
So, knowing the host name is not enough for us to know which disks contain the content.
So, we can send the first to each data center but then now we have the problem of figuring out how to purge content from each disk in each data center.
Yeah. So, to solve this problem, we introduced, we updated our purge propagation pipeline so that now we have two layers.
Instead of just sending the purges to all data centers, we also had the second layer where once the data center received the purge request, it would then send the purge to every single disk on the data center because, you know, content can be on any disk.
So, this was the distribution change we had to make to make FlexPurge work.
That's the first change we had to make.
Yeah. I mean, and a lot of this is just the architecture and the content that we get sort of for free when we were doing core-based purges and using that architecture in those pipelines.
And we had to sort of rebuild and re-architect a lot of these things.
And I think as we're learning and as we're seeing today, it's like those assumptions change based on the type of purge and what's happening.
And there are storage implications and sort of speed and correctness things to think about.
In terms of the announcement that we're having today, Tim, from this re -architecture and from this work that we've done and that Zeydoun and team have been doing for the last years, what's the big number, the blinking light that customers should see when they're doing a certain type of purge here?
Yeah. A great question. And really the interesting part, which is we can now instantly purge less than 150 milliseconds on the P50 average globally.
And that applies to our tags, our hosts, and prefixes where, as the example with the author, writer, if you send a purge command now from the example of Japan, it would propagate globally on a P50 average of less than 150 milliseconds.
And as you pointed out really well in the beginning, was that this also makes the local purge, the more regional ones, closer.
Mainly your bulk of users, if you're a Japanese business, likely a lot of your users will be Japanese or regional.
So not only are we doing the global purge less than 150 milliseconds, we're also doing the regional much, much faster, which is pretty amazing.
Yeah. That's mind-blowing.
When we do spread out these functionalities and make them less dependent upon the core and more capable regionally, what does that mean in terms of our throughput as well?
How many more purges can we see? So the new system that we have that's distributed, based on the way we have to architect it to make the difference between single and tags, hosts, and prefixes, our indexing model that we have scales a lot, lot more.
So even if we would have the same number of requests in, the number of tags that you can attach will be increased orders of magnitude.
So that's something we're looking to roll out.
We've already rolled out the instant purge, 150 milliseconds to all customers. And the throughput will come here in the next couple of months where we hope to see an increased throughput, but also very exciting for birthday week.
We always celebrate free users.
And so once we have this capability, also plan types will be able to have some sort of tag, host, or hostname and prefix that they didn't have before.
So we're going to make the functionality available to everyone. That's massive.
I mean, I've been at Cloudflare now for maybe seven years or something like that.
And for as long as I can remember, purge by tag and by prefix and all of these purges that have the opportunity to invalidate a lot of content were always reserved for enterprise customers.
And that's probably because of needing to make sure that they were able to purge as much as they wanted.
But what you're saying is with this new architecture that now not only can enterprises purge as much as they want, but also lower plan levels can use some of the capacity as well, which is really, really exciting.
Yeah, absolutely. And when you think about the scale of it, we have millions upon millions of free sites and being able to scale to such an enormous volume is pretty cool.
I think it's really an engineering feat.
Wow, I'm really excited to see that in production and being able to purge on my free zones with a purge by tag is going to be really, really exciting.
So very much looking forward to that. In terms of final thoughts here, when do we think that this new capabilities will be available for all free pro and biz customers?
So in some parts already today, the capacity will come likely early 2025.
But that's the that's the target.
Wow, that's really soon. I'm really excited about that.
Are there any any final thoughts about this project, Zeydoun, before we before we kick it back to some other birthday week announcements?
The only thing I would say is I really hope customers notice the difference and I'm sure they will in performance.
And we are definitely still working on our purge system to make it even more scalable, even faster, introduce new ways to purge.
So we are not giving up.
We are not far from done and we're just getting started. I love to hear it.
And I'm really excited to see what's next for the purge pipeline. Well, everybody watching at home, thank you very much.
Very excited that, you know, we got to be part of this birthday week and have a really good result to show to you all.
I'm excited for you all to experience it soon. And we will turn it back now for some additional announcements.
Thanks, everyone. Thank you.