Building Workers-KV: Typescript systems programming at The Edge of the Internet
Presented by: Kyle Kloepper
Originally aired on March 29, 2021 @ 11:30 PM - 12:00 AM EDT
Learn how we built Workers-KV (an edge Key-Value store) using Cloudflare Workers writing in Typescript.
English
Cloudflare Workers
Transcript (Beta)
Hello, welcome to building workers KV typescript programming at the edge of the Internet.
I'm Kyle Kloepper, and I was one of the initial engineers who helped to build and create workers KV and joining me is Cody.
I'm the engineering manager on the team that built KV and then is also working on other distributed data.
conference edge.
So as we go through the presentation, hold up your questions will be time for answering any that you may have at the end.
And we're excited to get started and share this story with you.
So building workers KV So what is a Cloudflare worker.
So many of you may already be familiar with are using Cloudflare Workers.
And so it is a isolate That runs on the edge of our network on our edge network compute runs JavaScript or more generally was wasm.
Cody has just recently given a talk on how many languages, you can get to run that compile the JavaScript.
So the graphic on this page shows a number of our colos all around the world where your code can run very close to end consumers.
So a worker is a way to do computing at the edge of the Internet.
And so initially So, so when we first created workers, there was no way to read and write data from the edge.
And that was a gap in what we are offering. And so we created a product called KV key value store that allows very simple.
Reading and writing of data from the edge of the workers, so that the experience, you can have durable state.
So using workers KV within a script look similar to this where you have a namespace that you KV namespace that you embed within your worker.
This is called first KV namespace and you can get the values from a request.
And you can modify and put values into the store as well.
One of the differentiators of KV as opposed to edge storage products at the time was that it could both read and write values from all Edge compute code.
So in addition, so we prioritize read latency, but we still allowed right ability from the edge, which gives opens up a wide range of possibilities and applications using the product.
So the timeline for the creation of workers KV started in January 2018 we had a brainstorm.
We kicked it off officially in March of 2018 We then in Austin, we planned the execution and a month later, we had a prototype put together written in JavaScript that kind of demo the initial Components of it to get early internal feedback and realize how quickly we could put the different pieces together.
One of the What allowed the product to progress so quickly between the initial planning and the initial prototype was we built workers KV using Cloudflare Workers.
And so all of the deploys anything we do with workers KV any customer on Cloudflare can do using a worker.
Which demonstrates the power of the platform and also the development speed.
So previously when we released edge products, we would have to do a stage release globally.
Deploy bits and pieces of server code that runs on our edge. Make sure it's stable and then gradually roll that out across our fleet using a worker, we can deploy that code.
Within seconds or or a minute to all of our edge machines and have a global update that is tremendously powerful for deploying these sorts of solutions.
And so all KV requests and use this worker. So initially we implemented the worker in JavaScript.
The prototype worked great for that we decided to do a rewriting TypeScript to take advantage of the benefits of strong typing in the language.
One of the guarantees that types system provides is that you can catch errors in your program as early in the process as possible, ideally at the design phase, but if not, then But when you compile it and before you deploy diagnosing and triaging errors at each subsequent stage of development and release is By rule of thumb, about 10 times more expensive.
So if we can catch errors at compile time That's about 10 times less expensive than catching them during testing and then 10 times again less expensive than catching them once they're deployed out in the field because once it's out in the field.
You have lots of people touching it involves Customer issues and the cost of that is much higher.
So as much as we could do to push the design the defect. Discovery earlier in the process, the better.
So switching from JavaScript to TypeScript was a fantastic investment.
I am a huge fan of TypeScript as a language, particularly because it somehow is able to balance both the dynamic parts of the language and JavaScript.
What's good there with the static parts of type system in a very Strong way.
So the culmination of the prototype and the rewrite was a public announcement of KB in September at our Birthday week, which I believe is Coppler's eighth birthday week where we do a number of announcements.
There was significant user interest in the product.
As we expected and hundreds of requests came in for a closed beta.
So we began the closed beta in October, based on the feedback we continue to refine and improve the solution targeting Better stability and right speed and throughput and finally launched an open beta in March, followed shortly by general availability in May.
So this product timeline. Is pretty aggressive and not uncommon among the new products that Cloudflare launches and a lot of it has to do with a lot of iteration and feedback stages.
So let's carry on.
So that's this timeline. So these are the various technologies that we use for building workers.
So TypeScript ECMAScript we use rollup JS to bundle and tree shake to have a small Output bundle for development.
We use node JS only for our Pre release testing what goes out as a worker.
We don't use any node modules, but we all all the code is in our repository and primarily that's because Node modules tend to be heavier weight than what we want to push out and but we do use a lot of those modules to get poly fill for the runtime testing and just And so the poly fill allows us to emulate the workers edge runtime, like the fetch libraries and the subtle crypto libraries effectively to catch those errors.
Before release and then on our edge we use react for the demo we use entry for our thing and errors.
And then of course we use Cloudflare Workers to power the whole tool chain.
So this is a very high level architecture diagram for how KV works.
So we have at the heart of the system, the storage gateway worker.
And so it's a simple worker. Well, it is a worker that's available, just like it is to anyone else.
So we run using the same Options that are available to everyone and we take all of the code that's in our various Modules in TypeScript and we bundle it all together using world JS and we use the serverless framework.
Because we started before Wrangler was available. So we use serverless to upload that to our edge.
And so within the storage gateway worker, it exposes an internal API.
We does indexing and encryption of the data and the indexing is done through formatting the names.
It has a cache that is Communicates with the Colo web cache and then it also communicates and I would call it a storage gateway to the back end storage providers which are the source of record.
So some of. So just to give you an example of what a back end storage provider is.
So we have Google Cloud Alibaba cloud. Azure Tencent backblaze IBM.
So these are all fall into the category of public cloud storage. And so these are all commodity object storage.
And so what KB does is it allows those object storage is to be harnessed in a commodity way transparently to the user and And we have performance make those faster by executing transaction transactions within A worker itself at the edge and we take advantage of the existing Cola web cache to greatly accelerate reads On the edge and because of the API integration between our web cache and workers were able to have Local updates from rights apply very quickly within a particular Colo and then propagate to the rest of the world.
Over a delay. And so part of the guarantees is that we've targeted the consistency model for KB to be available and prioritize reads, but have a delayed consistency.
Model so that it can take up to 60 seconds for value to propagate globally on right but local rights and reads are very fast and reads in particular are very fast.
And so the storage gateway worker runs on top of the workers platform.
So that means that it can spin up across any number of machines on our edge.
One of the amazing Properties of the workers ecosystem is not having to worry about scaling recently we had one customer increase their usage by 10 x on our platform.
And other than making our Usage charts look a little bit spiky. There was no general disruption to the platform.
And so being able to absorb those sort of loads using the workers platform is a key advantage that we internally decided to go with, but it's also available to our customers.
So that's kind of a high level overview of the product.
I'm going to open it up for any questions that we might have.
So one question.
You mentioned building on top of both Cloud players proprietary cash, but also public cloud storage.
Can you talk a little bit about the motivation behind that decision.
Yeah, so the reason we built it on public cloud storage is Wanted to take advantage of the reliability of existing infrastructure.
So instead of building on our own storage network right away.
We can harness public cloud storage, which is highly durable storage, if not the highest performance right away.
It also allows us to Move fast with the key part of our product being the integration to workers, so that we could focus on that integration in advance and that gives us the time as as Cody knows about the distributed data team still being able to work on a lot of other cool up and coming projects related to to storage.
So you've Done some work recently on redundancy. They got announced. Could you talk a little bit more about what the The benefits of that are not only in terms of like availability, but is there a benefit in terms of reader right speed.
Mm hmm.
Yeah, that's a good question. So initially when we launched with GA with workers.
We used a single cloud back end to provide the reads and writes And one of our identified concerns was that if there was an outage with that cloud provider, then we too would be subject to an outage with KB and Beyond.
So there's a lot of internal services that cloud for provides that are also dependent on KB and We can get back to why that was really important in a minute.
But so in order to provide that level of availability.
We decided to Read and write data from not one back end, but at minimum to cloud back ends for all transactions and One of our design priorities was to not decrease read performance and increase availability on all transactions and we had a lot of flexibility and how we chose to approach and solve that problem because of our eventual consistency model.
So the way that we chose to address this was to Every time there's a read we read from both providers.
Every time there's a right we right to both providers or all providers, as the case may be.
And as those and we return on first success. So the other advantage of this is it allows us to locate our providers in different parts of the world to improve speed globally.
Particularly right speed. It also has a cold read speed boost as well for fast reads for hot reads where you're reading a data.
Again, it's already present in our edge cache. Or if you've written a value the read of that is already present in our edge cache and so that that doesn't see a benefit, but for cold reads and for all rights, the performance is greatly improved, we can take advantage of the Execution after request returns to await the various completion of other reads Additionally, using read of both values allows us to if there was an inconsistent right say some request request failed to write to one back end provider, but succeeded to the other on the next read, there'll be a Synchronization and resolution of that inconsistent right which transparently keeps the system fast and consistent Beyond that, we have other systems to synchronize the values and just ensure that everything that we don't miss any application.
So, so one of the unexpected benefits of doing this was our speed improved that was anticipated the benefit was better than anticipated.
What we didn't anticipate is our error rate went way down.
And so it turns out that if one provider is experiencing server load and or we have a network issue to one provider.
That the other private provider, especially if it's located in a different world geography is typically unaffected.
And so our errors dropped by an order of magnitude with the back end.
Which is definitely a benefit to customers, because then they don't have to deal with error handling exceptions as much as they would otherwise.
So we're really excited to announce that feature and that was deployed earlier this year.
So in any kind of distributed system, you're always going to have these trade offs about consistency versus availability or latency, depending on how you think Can you say a little bit about why you chose the eventual consistency trade offs, you did and like what kind of applications are expected to be a good fit for that trade off.
Yeah, so our initial. So a lot of the decisions we make are guided based on user feedback and requests, because you always want to build products that are useful and people want to use right away.
And the Initial identified need was adding configuration state to workers.
And so in order to do that.
And so this was to be able to take that data instead of hard coding engines worker script itself extracting that from the worker and having it Run side by side, which allows more data more access patterns and it's just a nicer model.
So to do that, we prioritize read speed above all else.
So we wanted fast reads and we wanted reasonable right And so in order to get fast reads, we decided to skew.
So, and we wanted durability or availability.
And so the cap theorem says you can pick two So we decided to have fast reads and high availability and because of that we suffer on our consistency, which means that when you write a value.
It takes a while for that to propagate to all places and Because a lot of our users are building web applications, the geo distributed the eventual consistency model fits very well with most applications initially And and so that that's why we chose the performance trade offs that we did.
So one thing I've seen Customers or even occasionally other internal cloud player teams that use can be Like conceptually a little fuzzy on is the idea of of propagation.
Right. So like when I hear eventual consistency and then it may take like up to a minute for things to propagate Does that literally mean that like when I do a right from, I mean, I'm in Austin right now.
So if I was doing a right my laptop, it would, it would go to, you know, a colon in the southwest.
Does that mean that that right then gets pushed to both coasts and to Africa, Asia, like how does this propagation thing actually happen.
Yeah, no, that's a really good question. So some of the confusion comes with other storage deployment technologies that collapse.
I use like, for instance, when you deploy a new worker.
It's distributed using a An internal storage product that pushes actively pushes it to all the co lows.
And so we don't use that technology to deploy KV values because We wanted to be able to store unlimited size and unlimited unlimited number of KV entries.
Each KV value is limited in size to 10 megabytes but unlimited number of entries and in order to do that, we couldn't have an active push all colos in the same way as our previous technology.
And so when we talk about eventual consistency and KV we're not discussing An active pushed all colos.
What we do is we provide Stale while revalidate caching in places where it's used.
So when a value is read it will stay resident in the cache for up to 60 seconds.
In practice, it's a little bit lower than that. But what we guarantee with how the cache is tuned that when you read a value.
It won't be more than 60 seconds out of date.
So For value for locations that have never read the key before.
It's a brand new key any place that it's read subsequently will only have the cold read latency, it won't actually take a minute for that value to propagate It's in the cases where a key is actively being read in one colo.
There's a right from another colo that it will take up to 60 seconds for that value to be switched in that update to be registered in the colo that were the values active Okay, so it's more of like a, is it fair to think of it more of like a poll system than a push system.
Absolutely. We've explored various ways to do push system. And that's not off the table for certain values.
But right now it just the the poll system fits the economics and the needs of our users very well.
So does the fact that it's a poll system have any implication on how you might want to design your use of it like if I'm, if I'm thinking about the difference between like I could put My data in 10 separate keys or I might put my data in one key and I'm thinking about how frequently those keys are access to what the size of them is, does that, does that have any impact on the kind of trade offs, I should consider Yeah, um, so the, the way that you think about structuring your keys is is pretty important.
One way. I mean, just like you think you can think apply a lot of cash rules so that you have spatial locality and temporal locality.
So, so just like any caching system KB will work better if you can have Both spatial locality and temporal locality to your keys to an extent.
So if You have related values and they're all going to be loaded together and they're all pretty small and they all update together having them be one key as opposed to a dozen or 100 keys is potentially important, particularly if if one changes, they all change and that allows The values to be pulled all together.
Conversely, if you have a very large value and only bits of it change at a time, but you don't read it all.
It might be better to fragment that value into many keys that are per account so that you're not holding on to stale values that could otherwise be fetched when needed.
Also, smaller values have better latency properties on load.
And so the and we're small is, you know, around No more than a kilobyte.
It's kind of sort of where you switch from different sizes. So KB happily accommodates 10 megabyte value.
So you can start store large assets there as well.
And while perfectly fine it it takes time when we're in time right now we're measuring time in milliseconds.
So our orders of magnitude, you know, it takes milliseconds to load.
Megabytes and microseconds to load by so so you just have to account for that in your designs.
So, Were there any IQ. You spoke a little bit about the importance of TypeScript.
Were there any other lessons you feel like you learned while working on a pretty meaningfully large workers project that might be helpful or generalizable to other other workers projects, people might be considering Yeah, um, that's a good question.
So we've Having a really good Test framework that mirrors your production edge as much as possible was very key so Early challenges.
Had to do with because so so because we wanted to maintain consistency, our unit test infrastructure around inconsistent rights had to be rock solid.
And in addition to a broad initial coverage. We also wanted to have an easy place to very quickly at regression tests to prevent any future issues.
And so This is true around encryption, because we do strong encryption as GCM encryption on all the values we wanted to make sure that that was 100% covered in our test coverage.
And so having a strong unit test framework that can near edge state, particularly when you need to test.
Conflicting order of operations is very important.
So when we did the redundant back end work we invested heavily in unit test infrastructure to validate that The state machine we use to replicate keys and bring values into consistency.
All of the conditions were tested effectively.
And so it would have been easy to do that. Partly because It's not fully in the ecosystem.
And so we found a lot of issues when we go from.
So if you just test the library, but then you don't test this it's integration. Like you don't use a different crypto library in the background.
So we use poly fills to get the interactions as close as possible to what we're going to see on the edge.
There's still a number of few times we've been bitten by the edge having slightly different behavior than in Testing and we've worked to converge our testing towards that, for instance, the IP addresses that we get from the requests or Anything else and and largely that was a design flaw on our end when we let web information leak into the underlying architecture, as opposed to keeping that at a boundary layer.
So a good early example is we propagated HTTP errors from sub request to cloud back ends up through our stack and that caused a lot of confusion.
Because we couldn't test for those properly. And so the way that we separated that as we pushed HTTP to the edge of the system.
So that there's interfaces on both the back end and front end that interpret those errors that we get from our back end providers or the front end into a known TypeScript.
Language.
And so that allowed us to very rigorously test the internal framework. And so as long as the inputs are what we expect and we can validate those inputs at runtime.
We know what's happening on the inside of the box. And so this takes a relatively very complicated system and makes it trackable or tractable so that when we do get defect reports which come in from time to time.
Fortunately, not as often as we would have As we could have expected, we're able to isolate those very quickly and deploy fixes very quickly.
Um, so we've kind of, we've kind of emphasized gets by key.
We've kind of talked through what the right path and the corresponding root path looks like a little bit But there's also the ability to do with to do lists with prefix and could you maybe talk about an example of when you might want to consider using that in your design.
Yeah, so list with prefix is a very powerful feature that allows us to list a set of related keys.
You can use it to step through values.
Additionally, a feature we launched Metadata user customized metadata makes this even more powerful where you can embed data within each key that comes back with the list that allows you to do updates and track state.
So the, the places that we've seen list used are for managing related data in an account for looking up values.
And and so you can use lists and you step through them both on the edge and through the API to implement certain solutions that you wouldn't be able to easily do otherwise.
So if I, if I do a list with a prefix of say 100 keys. Is that going to be like, roughly speaking, faster or slower than doing like get that's on all hundreds of those keys.
That's a great question. It's much faster because of how we can use the cloud backends to do the list.
The results come back all at once, as opposed to one by one.
If you didn't need to do the gets to get the values you can Run all those async and will issue all the requests in parallel, but at the same time lists still are much faster.
Well, thank you everybody for coming to the talk.
It's been wonderful to explain the story of KB and answer some questions.
We're happy to see how you use the product and what excellent applications you build with it.
Appreciate your time.