Serverless Storage Strategies

Presented by: Steve Klabnik

Originally aired on June 9, 2020 @ 9:00 PM - 9:30 PM EDT

Best of: Cloudflare Connect 2019 — NYC

A session with Steve Klabnik, Product Manager of Storage at Cloudflare, on approaches for storing and retrieving data at the edge.

English

Cloudflare Connect

Transcript (Beta)

All righty. Hey, everybody. I'm Steve. This is serverless storage strategies. We're going to be talking about, I'm the product manager of the storage team at Cloudflare, which we think is kind of like slightly weirdly named because of some stuff we're working on, but I'll get into that later. My team basically deals with anything that involves storing data persistently on Cloudflare's edge. And so we're going to talk a bunch about that next here. So the first thing I want to talk about specifically is like, why is this an interesting and kind of like difficult sort of problem? And this is not just like, oh, my job is awesome and I work on really cool stuff. But this is like also pertains to you all, like running applications on top of Cloudflare, like you are going to need data as well as code if you're doing something more ambitious. And so there are some interesting problems in that space. And so I want to share with you like the way that I and my team think about these kind of things, talk a little bit about what we've already built, and then talk a little teeny bit about some of the stuff that we're kind of doing in the future as well. So that's kind of the setup of this talk. So I only joined Cloudflare back in like March, like it's been about seven months, two quarters now that I'm a project manager, I think in quarters now. And this is something that got repeated a bunch during my orientation that I never really thought about before. I've had a long history in web development. I used to work on Ruby on Rails as a job for a long time. I did PHP and Perl and like all this kind of stuff. But, you know, I didn't like think that much about the physical nature of the Internet. And I don't think that a lot of people really do, especially developers like where there's computers and they live somewhere and I don't have to think about it at all. But Cloudflare being an infrastructure company and dealing with the actual like pipes that go from, you know, least place to place, it turns out that the Internet is actually physical. Like we don't think about it as physical, but there are actual wires somewhere and there's electricity that makes those bits happen and they have to go from one place to another. And so unfortunately, like unlike our imaginations and what we can do in code, the physical world has like laws and it seems like they're immutable and hard and we can't like change them. So like take the speed of light, for example, like that's like a physical constant that our universe is sort of bounded by. And we're doing lots of ambitious things at Cloudflare, but I don't think changing and upping the speed of light is exactly inside of the mission. So the problem is, is that like there is this sort of limit to getting information from one part of the globe to the other and inherently kind of like takes time. So that might seem like a little trivial, but this is why that like matters a little bit more and specifically this problem of storage. So let's talk a little bit about the edge. You've heard a bunch about it already today, kind of, but you know, I'm gonna give you my own little spin here. So in a sort of a traditional application, you have your user, or as we sometimes say, the eyeball. I think that term is kind of a little weird, but it seems to be industry standard. So whatever, I will say user. And it talks, your user like makes an HTTP request to a server somewhere and it's running some code and some data that exists and that's your application, right? Like you maybe have a Rails app and it talks to Postgres or like whatever. And so, you know, it does some computation, it does some operations with the data, it sends the information back. So this is actually an old map. And another funny thing about my orientation, we add data centers so fastly, fast at Cloudflare that like all the different presenters throughout the day had the wrong number of pops on their slides and they're like, oh yeah, it says 146 data centers, but actually it's like 180. I'll fix my slides later. And the next person would be like, it says 160 on my slides, but it's actually like 180. Now it's 194. I think there's like 175 on this image. I didn't want to retake another picture, but like the point is that we have a lot of data centers all over the world. That's kind of like Cloudflare's thing. We keep adding more of them all the time. But, you know, if you, if you think about how this like works with sort of this regular architecture, the sort of the idea of workers, now that you can run your applications code in any one of our pops, you know, that's going to be within a hundred milliseconds of your users. And so what that means is you now have this. So your, your user is now much closer to where the code runs, which is at the edge. And so that means that computation can happen and you can return a response. And it's much faster just because it's physically co-located closer to where user is. And the speed of light, you know, is defeated or whatever, because you're like physically closer. So it can be faster. So the problem though, is that this really only works for code because if you need to add data as well, and you can't store data on the edge, then maybe your code is running close to the user. But if it has to go back to fetch from a centralized database, then your diagram sort of looks like this. We've moved the code. We haven't moved the data. And so we haven't really actually improved your effective latency. We just shifted it around. Now it's not between, you know, the user and where code, excuse me, where your code is running. But, you know, now they like the code runs and it has to wait on the network time to fetch the database. So I've really solved this problem. And so this is kind of the sort of like what my team is attempting to solve is how do we get your data also to the edge so that that way everything can be closer to your user. And now we've actually truly sort of eliminated that latency. So this is kind of the core, like fundamental challenge to this problem and sort of like the way that we kind of think about the problems that we're trying to tackle. So, yeah. So the first product that the storage team has made and currently the only public product is Workers KV. And Workers KV, as you might expect, my previous slide posed a problem. So this slide is the answer. Workers KV gives you some data actually at the edge. And so this is a very simple key value store. We've added a couple of interesting features lately I'm going to talk about in a little bit. But basically, you know, you can put stuff in a key and a value and then you can pull stuff back out. You ask for a certain key, it gives you the value back. It's all, you know, the pretty classic key value kind of architecture. And so sort of like what this lets you do. So KV, unfortunately, has not yet solved the ability to like have all of your data everywhere, all the time, perfectly consistent and at absolute speed. I'm not a magician. The physical laws of the universe still exist. But we wanted to give something that would be helpful in at least some circumstances. And so our KV actually works as this. KV is actually itself a worker. Oh, thank you. KV is itself a worker. And so we actually use the workers platform to build Workers KV. Any of you in theory could go out and build a KV competitor if you wanted to. We think that's really important for dog feeding the platform. But it's also how we get to take advantage of the fact that everything lives everywhere. So KV does have a centralized store in cloud storage. But the idea is that we thought about, you know, OK, what are different kind of workloads that people do? And one very common workload that people are using workers for is stuff where you write data really infrequently, but you read it a lot. And I'm gonna give some examples of that in a second. But basically, you know, you can think about like Workers KV is supporting these kinds of workloads. So if you write, you end up writing the central location. But if you read, we can pull that data out to the edge and then keep it on the particular pop. And so now you don't need to go the whole way back to whatever the origin is to make stuff work. It's like a fancy cache, basically. But it's a cache you don't have to think about. It just kind of like works. And so, you know, if you're doing this kind of read heavy workloads, we handle that problem of like pulling your data out to the edge. And now you can sort of process everything relatively quickly. And there's some details here. But this is kind of just like basically how it works. And we wanted to sort of go with the simplest possible thing at first to get our hands on the problem. And we're working on some more ambitious stuff in the future. But a lot of people have already built a ton of really cool stuff on Workers KV. You know, you heard the story about DoorBot earlier. And I'm constantly just finding new fun things that our users are actually building on top of KV. So even with a relatively simple primitive, we've ended up adding a lot of value to the Workers platform. So when I think about the ways that people use storage and the way that people use Workers, I'm gonna kind of give you my take on this. Rita had sort of like three kinds of applications. I tend to think about it in sort of two. I tend to think of them as originless and edge enhanced. I did not like coordinate with marketing on these terms. These are not like official Cloudflare terms. These are like when I think about it inside my brain kinds of situations. And so these are two different sort of models for building apps. And so when you're thinking about using KV or building stuff on Workers, these are kind of like the two modes that I see people tending to do stuff in. The one that I think is interesting and slightly more ambitious is applications that have no origin whatsoever. And so these are applications that live entirely on the edge and have no origin server whatsoever. So if you can put your entire business logic into Workers and you can put your entire data into either KV or any other sort of like Cloud-based storage platform, maybe it doesn't have as great performance as KV does, but like it still can work. Then I think this is a really interesting idea because scale becomes like sort of no longer an issue. Your code is running everywhere all the time and you don't even have a sort of centralized server. So as an example of this, you've heard a little bit about this product called Worker Sites. And we sort of built this out with static site hosting on top of Workers. So this actually started when I started working at Cloudflare. The first thing you need to do whenever you become a product manager is get real familiar with the product that you're about to manage. Like I knew that Workers KV existed and I thought it was very interesting, but I hadn't really used it a whole ton myself. And so I was like, well, what can I do with this? And I think one of the best parts as a programmer, having your own personal domain is you can kind of like do whatever you want with it. And what I tend to do with it is just rewrite everything every couple of years. The content stays exactly the same, but the architecture just is like totally whatever the new fancy thing is that I want to play with. And so I was like, clearly I need to put my personal website on Cloudflare and use Cloudflare Workers to power that kind of thing. And so I wrote some Rust generated WebAssembly because I actually used to work on Rust at Mozilla with Ashley before this job. And so I was like, I can put Rust and WebAssembly and make my website work and that seems cool. And I'll use KV to like store the HTML and just sort of return it out of KV and that'll be kind of a cool thing. So I had this amazing, ridiculous hack, but it turned out it was really useful. And so I ended up rewriting it in JavaScript because as much as I love Rust, way more people know JavaScript than Rust as it turns out. JavaScript being one of the most popular programming languages basically to ever exist. And I want it to be useful and editable by other people. So we were working on the new Cloudflare Workers documentation, which has since launched. And I decided to kind of like, what if we could actually host the Workers docs on a worker because that just, there's something about dogfooding that like really satisfies your inner programmer desire for recursion and like wordplay and tricks. I don't know what it is, but it's always really appealed to me. And it's like, if we have a thing that we say is good for hosting websites and we have a website, it should be hosted on it or else we're like hypocrites or something. That may be a little strong, but like, you know, that's how you sort of think about things sometimes. And so I was like, what if we could, what if we could host this entirely out of a worker? I think that would be cool. And so this is, this is an example of the original code that I wrote in JavaScript from this like sort of port. I'm not going to go through sort of all the details, but basically the HTML and CSS and JavaScript files were all stored in Workers KV. And basically what happens is a request comes in and the worker checks the path, fetches the file out of KV and then serves it. So if you see the, like we, this code is a little small and like it's not that important. I want to talk about it really briefly, but basically this kind of gives you an idea of like a little bit of a more logic in a worker. So we take the request URL and then we grab the path name. We figure out what the content type is. And then we fetch that out of a KV namespace called static content. And then we return the response with the body and set the content type. And if not, then we log the error and, you know, return a 404. Now, this code is not actually what's running as worker sites, but the developer experience team basically saw how this was working and thought that was really cool. And they decided to make like an actual real thing out of it with a good engineering and like tests and like everything that's like actually awesome. So we took this idea that it was originally a terrible hack and made it a really cool like actual thing that works. And now we use worker sites to, you know, power our docs. I then, of course, rewrote my personal site for the second time and threw away my silly hack and actually use this to serve my stuff now. And I think this is a really cool example of like dogfooding and trying out the platform. But it's also sort of like in this sort of idea of these origin list applications, this is sort of like a jam stacky kind of thing. I don't know if you're all familiar with that sort of like movement in web development. But I personally think that's a really interesting kind of like way to build applications. And I think that workers is sort of well-suited to that. And so this is kind of just like, you know, I don't know. I'm really excited about worker sites. It's always cool to like have a fun project turn into something real that's actually useful for folks. So I'm super hyped about it. Okay. So before I talk about like enhancing existing applications, I want to talk a little bit about some tradeoffs for workers KV. I mentioned a couple times that workers KV is like sort of a simple data store. And I think that's really good because that's part of why it's fast. The fastest code is like no code at all. And so the less you do, the faster things are. But that also means there's some architectural tradeoffs in KV. And one of the things I think is important as my job as a product manager is to communicate to you what those tradeoffs are so you can make an informed decision about your product. So I actually had a customer call yesterday morning where they had two different interesting use cases for KV. And they told me their first one. And I was like, awesome. That sounds like a great fit. And they told me the second one. And I was like, actually, that won't really work. Sorry. And so I spend a lot of my time trying to help people understand when KV is appropriate and when KV doesn't work. So I want to talk a little bit about the positives but also some of the negatives. Because I don't want you to build a thing if it's not going to actually work for you. And I think that's important. So as I mentioned a little bit earlier, I talked about this briefly. But this is now a little more structured with some extra detail. We design KV for read frequently, write infrequently. And so if you are writing to the data store very often, then KV might not be the best choice. And if you're reading infrequently, it also might not be the best choice. It really, really shines when you write sometimes once. There are some customers who upload once and then only read inside of their worker. Or maybe you write occasionally and read all the time. And it turns out there's a surprising number of applications that actually fit that profile. But I think that's at first the high level way to think about KV. It's very good for fast reads. So the reason this is true is because we store very infrequently read values centrally. So KV is a persistent durable store. Your data is not going to be lost. It will live in KV for a very, very long time. But we store all that stuff centrally for stuff that's infrequent. And then for things that are read relatively often, we store them on every single pop. And that's how you get the speed benefits. But while we have a lot of data centers, it's not like all of those data centers are super massive. And so we can't store literally all the data all the time. And so for this first product, we basically made it so that things that are accessed often get stored everywhere and things that aren't, don't. And that's sort of a reasonable tradeoff both for us and, you know, for the kinds of applications we think KV works well for. But secondly, there's one other aspect that's kind of interesting. And that's this concept called eventual consistency. So when we talk about databases, there's this thing called the CAP theorem. And it's basically like these three properties of database systems. And you can only get some combinations of those letters to work. The C part is for consistency. And Worker's KV has what's called eventual consistency, which means you'll get an answer eventually, but not necessarily like in the intermediate moment. And so if you have two writes, so with this kind of eventual consistency called last write wins. And what that means is, say that you're issuing a write from here in New York, but we're also issuing a write from San Francisco, and they're both going to the same key. Literally, whatever the last one happens to hit that central store is the one that's actually stored, which means that you can sometimes lose intermediate writes, which is sort of, again, a major tradeoff in some use cases and not in others. So for example, this particular aspect was the thing that I had to unfortunately tell someone that KV wasn't great for them, because they needed a really strong consistency. But we're looking for high read performance. And so we decided to sacrifice write consistency a little bit for that purpose. But there's tons of cases in which this is still super great. And so where KV really doesn't shine is if you need some sort of like atomic operations or transactional multiple read writes in a single step, that means it's super sometimes doesn't work. So I really just want to be clear about that part, because I think it's important. And all these things are definitely drawbacks, and we are working on other different products to provide other kinds of consistency models and do other sort of things that you might want to build. So we're sort of only getting started. OK, so to get back to this originless and edge enhanced idea, we talked about originless a little bit, but we're seeing a lot of people use KV for this like edge enhanced use case. So I want to talk a little bit about that as well. An edge enhanced application is one where you have your worker and KV at the edge, but you also have some sort of application running at an origin somewhere. And so you're kind of like using Cloudflare Workers to enhance an already existing application that you're building. And so you kind of have like the sort of no origin aspect of workers, but also some sort of central place located somewhere, maybe multiple places even. And so people tend to use this architecture whenever they're trying to make some sort of decision quickly. So like I'm going to go over a couple different aspects of this, but if you need to make some sort of kind of decisions, if you can do them when they're closer to your user, that can provide enhanced performance and sometimes decrease load and do all these other sort of things. And so doing these things on the edge. And so one example of this that we've seen some customers build out and, you know, I think is a thing that we would like to eventually put in the workers platform in some capacity ourselves is blue green deployments. So I don't know if any of you are like ops folks, but we've seen people build entire blue green deployment systems on top of workers and KV for their workers. So sort of the idea here is that you want a deployment strategy that minimizes risk of new, you know, rolling out bits of your application. And so you have these like sort of two production environments that are traditionally named blue and green, which is why it's blue green deployments. But the idea is that blue is your current live application and green is sort of like your up next application. So what you do is you deploy to green and then you slowly move traffic over from blue to green. So at first you roll 10% of your traffic over and then 20% and then 80% or, you know, whatever, you get to kind of pick the strategy. But the idea is that you can watch metrics and if something goes wrong, you can roll entirely back to blue if there's any problems and not have to, you know, sort of like make everything, you know, not work out. And eventually all your traffic is green and now green is your live environment and blue is your staging environment. And so in this kind of architecture, you know, you will have this deployment logic kind of like living inside of a worker and which of the two environments is sort of active and what percentage is happening inside of KV. And so you have your, your origin servers, you have a blue and a green version and kind of like in the worker, you direct your users to which environment they need to go to based on, you know, whichever thing. And so this means that like only the traffic that you want gets sent to that new server, new version of the service and vice versa. So like you get total control over how stuff hits your origin and it happens immediately at the edge as opposed to like waiting until it gets to your infrastructure and then sort of moving on. And so we've had people do this where stuff is like totally origin based. We've also had people where they spin up blue and green workers and then also the edge worker and like put them all together and stuff, which is a little more complicated of course. And so yeah, so you start off with like all of your traffic going to one and then the other and then eventually you change the percentages. So you finally move over to the other side. And so this is kind of one example of like using a worker to enhance an existing application rather than building something totally new and something we've seen a lot of people do before. And so like the pros and cons sort of here that as I said, you already get to completely customize how you do that transition. So there's like programmers love to argue about stuff on the Internet. So you'd think something as simple as like sometimes you talk to this person and sometimes you talk to that person like would be simple, but like no, what, what percentages should we roll out and how long should we wait? Should it be like a random coin flip if someone goes to one or the other? Uh, should we like pick it so that all the people in San Francisco go to blue first and all the people in New York go to green second. Like you can invent all kinds of ways to do these strategies. Um, a lot of people do stuff where like maybe it's if you're in beta, you get to go to an more advanced version of your service. Like there's a ton of different ways this pattern kind of fits. And uh, what's nice about this is you don't need to actually change your application at all to manage this configuration. You can do it entirely within a running worker. Um, another example that, uh, people do that almost every application has to deal with is authorization and authorization is another example of asking a question at the edge. And in this case, the question is, should this user be able to access whatever resource this is? Um, and so you need to check some sort of method to see if your user is allowed to talk to your origin or not. The reason this is useful, um, this is to be clear, while NPM is a customer, they were not using workers at this time. I don't actually believe this is sort of a slightly old story, but a kind of hilarious one where, um, visual studio code, Microsoft's editor. So you're like building your JavaScript application in VS code and it will be like, Oh, you've imported the low dash package. I'm going to go see if there's a, uh, you know, low dash instance, uh, of the types for TypeScript on NPM. And, uh, the problem was, is that NPM, uh, didn't have all the right caching in place for various reasons that are sort of little scope. I don't want to talk about them right now, but, um, they had a situation where scope packages were not cached, uh, whereas non scope packages were, and all of these types were in a types, namespace on NPM. So they weren't cached. And so, uh, Microsoft deploys this new version of visual studio code and suddenly simultaneously around the world, every single user of VS code was hitting an uncached end point on the NPM registry and actually like brought it down completely because basically like Microsoft accidentally DDOSed, uh, NPM, uh, which kind of like happens sometimes. But, uh, the reason this happened was because scope packages could be private and so they needed to do an authorization check and so they didn't want to cache the results because each user may be off or maybe not off. Like that's kind of like the reason why. And so this meant that like all of these off checks hitting their origin was ultimately what caused NPM to go down at the time. Um, and so if you're able to do these kind of off checks at the edge, uh, you can significantly reduce the load on your origin because only authenticated quests requests actually make it through to your sort of like original traffic. Um, and so like kind of like the way this looks like with my good old like user and database and code diagrams is that like if you're doing traditional authorization, uh, like at your origin, all the unauthorized and authorized traffic ends up going the whole way there. But if you are able to do your off check at the edge, you can reject all of the unauthorized traffic immediately. And uh, not only does people get a quicker response, uh, but also they won't end up sending any traffic to your origin, which can significantly reduce load and reducing load reduces costs. Cause you know, if you're paying for a cloud provider, the more cycles you're actually running, uh, the more money you're paying generally speaking. Um, and so, you know, those two things are sort of aligned. This is a use case. We've seen a lot of people also use workers for, and uh, we actually added a really interesting feature to workers KV to support this use case. Um, and that is, uh, expirations. So, um, what people will do is they'll have a namespace where they include like tokens for folks. Like here's a valid API token and uh, it's going to expire 60 minutes from now. And so, uh, they have a worker where whenever you hit the worker in the first place, they read from that user's ID number out of KV and they say, Hey, do they have a valid token? They check it against the token and if it's okay, they'll pass the request through to the origin. And if it's not, they'll return some sort of error code. Um, and what's nice about this with the automatic expiration is that, uh, you know, your, your session will kind of like timeout on your own and KV will handle that for you so you don't have to do that yourself. Um, and so we've sort of seen a lot of people sort of build these kind of, um, you know, authorization situations, which uh, is definitely like a really cool use case to enhance an existing application, um, that way. And uh, now, you know, none of that logic has to hit your origin at all. Um, one last interesting thing before I go, cause I'm almost done here. Um, this is the like boilerplate Rita actually showed this earlier. This is a boilerplate like doesn't really do a whole ton worker. Like it just returns a response to the origin. I always wondered this and no one explained this to me when I started at Cloudflare. So I'm going to share this harder and wisdom with you. Uh, okay. So we have the event listener, which listens to the fetch events, uh, which is the part of the service worker spec that we have. But then we, we made the second like handle request function and it takes the request in and we do this. Every single example had this in it. And I was like, why is it two functions? Like I know I like to split up my code sometimes and tons of tiny little functions, but I don't know why this is. I'm just going to write all my logic in the event listener because like, I don't know, it's like kind of weird. Maybe you already know the answer and I'm just dumb. But uh, if you also wondered like this, uh, like I did, the reason is, is that respond with takes a promise, but the event handler is not itself async. So if you want that awesome async await JavaScript goodness, which is definitely super good, then basically this is a little hack around that because now we can have an async function handle request. And so that kind of like all works out. So I didn't really understand why this boilerplate existed in literally every single worker example. So, uh, you know, I figured I would just pass that little tidbit along to you all. Um, and maybe I'll make a pull request in the doc someday. Um, but I was like, Oh, that's so obvious after I learned about it. But, uh, it took a little bit of time. Um, okay. So future stuff in KV and some other things that we're working on. Um, KV is really great for a lot of use cases, but I have to mention there's some drawbacks and we'd like to sort of release some things to address them. And it's good for read heavy and write light workloads with this sort of last write wins, uh, eventual consistency style. Um, we have some future plans for KV. Uh, in the last six months, we actually put out a blog post, uh, this morning on the Cloudflare blog where, uh, we've, uh, up to the value size from two megabytes to 10 megabytes. Um, we've done some, some other related like infrastructure work where KV is a lot more robust. Um, we've added list support so you can list your keys and list them with a prefix and some other sort of things. So we're constantly releasing these new updates to KV. We're also working on some, uh, more ambitious projects. Uh, I said they're not public and I'd like to talk to you about your needs. I still would love to talk to you if you have interesting things about data at the edge, but I will tell you one thing that we're working on. Um, even though it's not totally public, cause this isn't some of the slides this morning. Um, my team is working on a job queue system for workers. And, uh, again, that sounds weird. Like wait, the database team is working on a job queue. Job queues need to be persistent. Uh, and we need a persistency on the edge. So it's sort of under my team's purview. So if you have thought about queuing systems or that kind of stuff in workers, I would also really love to talk to you about that stuff because we're currently building it. I cannot promise anything about when it will actually be released, but, uh, that's sort of like one of the next things that we're kind of working on. So, um, yeah, please, uh, I would love to chat today or, uh, I have this convenient slide with my email and a QR code on it. So, uh, if you would rather get in touch that way, uh, thank you so much for listening to my talk. And, uh, yeah, s cloud nick at Cloudflare.com, uh, is my email address. And, uh, I would love to chat you about all things workers, but especially storage.

Cloudflare Connect

Connect to the future of networking and security. Cloudflare is a global network designed to make everything you connect to the Internet secure, private, fast, and reliable. Connect is Cloudflare's flagship event that will connect attendees directly...

Watch more episodes