🎂 The Future of Database
Presented by: Brendan Irvine-Broque, Dawn Parzych, Matt Silverlock
Originally aired on September 15, 2024 @ 2:00 AM - 2:30 AM EDT
Welcome to Cloudflare Birthday Week 2023!
2023 marks Cloudflare’s 13th birthday! Each day this week we will announce new products and host fascinating discussions with guests including product experts, customers, and industry peers.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog posts:
- D1: open beta is here
- Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso
- A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()
- Running Serverless Puppeteer with Workers and Durable Objects
- Hyperdrive: making databases feel like they’re global
Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!
English
Birthday Week
Transcript (Beta)
Hello and welcome to this special Birthday Week installment episode. Today we're going to be talking about the future of databases.
I am Dawn Parzych. I am the director of product marketing for the developer platform, and I'm joined by Matt and Brendan.
Brendan, can you introduce yourself? Yeah, so I'm Brendan, product manager for the Cloudflare Workers Runtime.
And Matt, how about you? Yeah, and yeah, I'm Matt, product director for databases and messaging for workers.
Excellent.
Well, thank you both for joining me today to talk about a host of announcements we've made today and some yesterday on databases and data within the workers and developer ecosystem.
Matt, can you give a real quick synopsis of a couple of the announcements?
Yeah, so there's sort of three, I would say, big database announcements and there's tons of other ones on the side, but I'll touch on the big three.
And so D1, obviously, most folks hopefully watching have heard a lot about D1.
D1's now in open beta out of open alpha, so you've always been able to use it.
But like always, database limits are larger, it's faster, continue to add more databases.
We'll talk about them more in a sec. Hyperdrive, which is brand new. And so hyperdrive is, in a nutshell, really a way to make connecting to your existing databases way faster, right?
You've got something in another cloud, you want to connect it from a worker.
Before hyperdrive wasn't easy to make that fast, particularly if it was further away.
And so again, we'll talk about more of the details, but I'm really, really excited about that, right?
If you've got databases, you're just not going to migrate them.
Hyperdrive makes that easy. And then the last one, and we announced this yesterday, is Vectorize.
And so Vectorize is our vector database.
And so they work a little bit differently than traditional databases.
They really sort of align with a lot of these sort of ML, AI announcements we made yesterday.
And in a nutshell, again, the best way to think about a vector database is like, how do you scale an AI powered application?
Once you kind of experiment and it's awesome, but how do you actually scale that?
How do you give context to machine learning models?
How do you tell them about the stuff that's in your documentation or your product inventory without being an expert?
And so they really sort of help with that.
But those are sort of the big three database announcements.
Yeah. So Matt, I'm curious. You've got some like ancillary, like associated ones about like standards and integrations and other ways that we can work with databases, correct?
Yeah. Yeah. I mean, I think Brendan, sorry, go ahead.
No, I was just going to jump into kind of Hyperdrive. We can kind of run through that first because I know that there's some really interesting challenges specific to serverless there.
You know, like everybody's excited about things around query caching, but as we've talked about this, there's lots of interesting challenges specific to serverless functions.
And I know you've thought a lot about that.
Yeah. So it's a good place to start. I think if you go, well, hey, I can already connect to a database from my existing node application.
Well, yeah. But when you're thinking about serverless environments, particularly distributed service environments like Cloudflare workers, right?
You have potentially hundreds or thousands of functions that are being indicated that are very short -lived, which is awesome.
It doesn't mean you pay extra more than you need to. We obviously keep them really fast.
It keeps that scope really small, which is really powerful.
But database protocols are just never designed to be started, like opened and closed, you know, for a couple of seconds for two milliseconds to make a query and then shut down again.
That's just never the way they were designed. And so yes, although there are some serverless database providers, they're all still kind of running the real database engine under the hood.
And they've done some work on top to make it a little bit faster, but it is still just not that fast.
It can take on a good day, a quarter of a second, half a second or more just to open the connection before you even send the query.
And so huge problem, right? Just every time I get a request, I'm waiting a half a second or so to actually just query the database that has all my data.
You know, again, this big monolithic database that has my products kind of makes it hard to want to adopt serverless, particularly sort of, again, distributed service like Cloudflare, if that's the kind of penalty you're paying.
And so there's kind of two pieces of Hyperdrive, right? There's this part of, you know, how do I pull and reuse these existing connections?
And then there's also the query caching piece, which is really interesting.
Maybe say more about, break those two pieces down a little bit for us.
Yeah, perfect. So, you know, hopefully now sort of a bit of an understanding of like how much time is spent just setting our connections.
And so the connection pulling part, which I think is one of the most important parts, equally important, but certainly one of the most important parts when you sort of think about this problem in the beginning is, well, great.
I don't have to keep reiterating these connections. On the worker's side, we keep connection pulls.
We keep those while your worker can just establish a connection in a few milliseconds, often even less to Hyperdrive.
And then Hyperdrive keeps pools of connections open to your existing database.
That's awesome. Obviously reduces latency and just getting connections and making queries, also reduces load on your database.
Because even if it could, even if you could create all of those connections really, really, really quickly, you could have thousands of them, right?
You want a way to sort of turn all of that user, all those user requests into few database connections.
Because again, database protocols traditionally, especially Postgres were not designed to sort of be spun up and spun down so quickly and certainly not designed to sort of operate at that scale, even a really large database on a big cloud provider typically tops out at about 500 simultaneous connections and often even less than that as well.
So that's the sort of the connection pulling side in a nutshell, really, really powerful.
It's important to have it on this side of the worker, like the client side from the database perspective, having pulling on the server side is powerful, but it doesn't avoid that latency problem.
And so the second part is, sorry, go ahead.
In workers, I mean, you might have many, many hundreds or even thousands of different workers running all over the world.
It's very different than if you have your application running right next to your database with a few instances.
Right. If you've got a VM in EC2, right. It's just going to keep typically a pretty long live connection and that's fine, but that's one VM, it's 10 VM sitting in data center in Virginia, not distributed close to users around the world.
So yeah, it's one of the cool things I think that we sort of thought about as well.
We sold the compute part, getting that distributed, getting that to close to the user.
And I think as people have seen sort of, I will say every year, but really it's more like every three months we're trying to ship things and build things that make it easy to kind of get that data closer to users.
Or if we can't do that like this, where you've got existing stuff, how do we make it faster to go back to it as well?
And so then there's cases where your worker might be halfway around the world and your database is centralized in a particular region on earth, you kind of have that centralization, but you want to make things fast.
And there's some piece that Hyperdriver is solving. Tell us more about that.
Yeah. So in practice, 90%, it's actually even more than that in most cases of web applications, the queries they make are actually just read queries.
They're just reading the same data over and again, or they're just reading user profiles, product pages, things like that.
So it's a tremendous amount of re-queries. Secondly, kind of in part of that 90% is a lot of that is just the same query again.
Think of every product page you visit.
A large part of that is the same database query to bring up that same skew.
It's the recommendations at that front page of a news publication.
It's the same query all the time. And if you can avoid making that back to the database, if you just have to make it a once or twice, and this is the cool part of Hyperdrive, if you can cache those results, then first thing is you never have to go back there.
So even with connection pooling, it's still going to take some time.
It's still the speed of light to go back to the database from, I'm in Lisbon at the moment, but if my database was again, infamous US East one on a cloud provider, it's going to take some time.
It'd be great if, well, hey, I've rendered the front page.
I know the query that has been done. I can just store that locally near Lisbon or in Lisbon.
And now my worker, my client that's actually talking back to my database doesn't have to take 110 milliseconds.
In the best case, it can take five milliseconds to get that cache query.
And so we think that's actually a really, really cool part.
It's not just again about that connection pooling.
It's like, once you have that in place, how do you then even avoid having to go back to that database as much as you were before?
So much faster. And I think, again, a really powerful thing is it hopefully reduces database load.
And again, if a lot of your load is on read queries, like most folks is, that might even let you resize or downsize your database instance to save some dollars as well as you can half the size of the CPU you need.
That's obviously really powerful.
And so, I'm thinking about moving my application onto a serverless platform.
I'm thinking about moving onto workers. I have this API that I already have.
I could just cache API responses. What does this unlock for me? What's the difference between I can cache a response from a server that's somewhere that's a layer in front versus I can cache the actual database query?
What does this let me do?
Yeah. I mean, I don't think they are mutually exclusive. But I think in one case, you can sort of think of a lot of those API queries as tending to be pretty small scoped, right?
They also will tend to be very close to a particular user. Maybe they have some kind of personalization.
The database queries might actually be, and it might be actually advantageous with Hyperdrive in some cases, to be a little bit more generic and to sort of cache that query, cache those results.
And then you can do some more parsing in the application side, right?
There's a lot of sort of strategies you can do there as well. But not having to go back to the database is obviously particularly important as well.
But I would say they're not always mutually exclusive, right?
Caching those API requests can make sense.
But if you can cache that database connection, that really expensive and slow part when you do have to make those, then that's just, again, a huge amount of time and compute power saved.
Yeah. You can move whole applications out to workers.
It's really cool. So maybe moving on, we also announced that D1 is an open beta.
And so there's one side of this that's working with your existing databases, but then also when you're building from scratch and building new things, and we see lots of people trying D1 in exciting ways.
What's the team been working on over the past summer to get to this milestone?
Yeah. So one of the really, really cool things about D1 just to kind of bring the audience up to speed is it's really easy to spin up D1 databases.
There's not this database server that has to boot.
You don't have to go and configure it. I'm not going in.
And for those of you that know, setting up pghpa.conf and making sure all my settings are correct and reading like seven different blogs to make sure that I've got it all right, that it's actually going to scale.
D1 just takes care of that.
But one of the things we've been working with D1 is how do we get D1 databases larger and how do we give you more of them so you can sort of move to this database per user model?
We've had a couple of folks, including the folks at Ronin who are building this really interesting sort of data platform, want to sort of have this concept of a database per user.
And so over sort of the last three, six months since sort of our last big D1 announcement in May, we've been working on, again, on like horizontal scaling in D1.
So we've got folks with thousands of more databases already.
And we know we can scale sort of above 100,000 or more.
I think sort of as I said in the blog post, technically unlimited, there's not really a technical limit there.
But every time someone says unlimited, nobody believes you.
So I will say 100,000 to a million. And if you think you get close to that and actually hit that limit, I guarantee we can do more.
But also just making them sort of larger vertically, right?
So more storage per database. And the really sort of challenging part there is how do you make that really fast, right?
If your database has a cold start, just like your serverless function, that's not great either.
And a database cold start can be much larger because it is just more bytes to load in.
And so we do a lot of work to make that bigger. We basically went from 100 meg to 500 meg pretty quickly.
We're at two gig per database now, and you can have multiple databases.
And by the end of the year, we continue to sort of increase that.
One of the things we want to keep doing, just keep dialing up those limits as often as we can, right?
We announced the GA for D1 early next year.
But there's no reason to hold back a lot of the increase between now and then.
And just like from alpha to beta was saying things like time travel and other things we increased.
We just want to keep delivering those rather than holding users back.
And it's great that we're building all of these components into the developer platform.
We've got a database, but what if somebody already has a database?
How do they bring that into the Cloudflare ecosystem? Yeah. So one of the things I think, and this is I think a fairly non-controversial view is it's really hard to migrate your database.
It is super high risk. If the first thing you have to do is migrate data or your database to start experimenting and building things on a distributed cloud provider, Cloudflare Workers, you might not actually get there, right?
You might not even get approval from your boss, right?
It may not be your data or your database to go and actually move, right? It may be in another team in your organization, especially if you're not a small team.
And so one of our goals is, well, how do we give people these really native serverless tools, this stuff that's really, really easy, that really fits into, again, this concept of a distributed cloud-like workers?
So that's things like D1, that's things like KV and Qs, and making sure, particularly durable objects that are modeled instead of the best happy path.
But how do we make sure you can still grab that stuff you already have?
And so again, where we see the difference between, say, D1 and Hyperdrive is they're very complementary, right?
I might want to go and augment or build applications on existing core data I have.
Well, Hyperdrive is going to be that weight I can connect in.
I'm building a greenfield application or a new use case.
Well, then D1 is the thing I can reach for. In fact, what we already see is folks wanting to use both, right?
There might be takes where parts of the application can use D1, other parts of the application can use the existing database.
You could use that as a way also to potentially migrate some data, to de-risk some of those moves as well, to think about keeping that existing database.
But I've explained to a few folks, I see D1 as this fantastic fit for a lot of teams, particularly larger teams, for greenfield applications.
They should start building and experimenting.
Obviously, we'd love to talk to them. And then for brownfield applications, where you want to augment an existing app and get all the benefits of workers, but just can't migrate that database tonight or tomorrow or next week, that's where Hyperdrive comes in.
It helps me in the middle.
And you mentioned earlier time travel with D1, which is a really cool feature.
We actually, for one of the things we're working on, used it the other day when we broke something in production with the D1 database and we were able to quickly go back.
And it seems really powerful with this model of per-user databases and being able to experiment.
Tell us a little bit how that's different than a traditional database model.
Do other databases do this? Is this unique to D1?
Yes and no. So I think a lot of folks probably know something like time travel as point-in-time recovery.
And so I think almost universally, it is something you have to turn on, something you have to pay for.
It's often actually significantly more expensive than just regular backup every hour.
And so it's awesome when that backup happened.
It's terrible when you make an error 58 minutes into the hour and your periodic backup isn't there for you.
Yeah. Well, if you're right, it's not going to save you.
You've made a lot of changes since then. Maybe you get lucky and it happened 30 seconds ago, but that's often not the case, right?
It's going to be in that middle ground of probably about 30 minutes.
And you're going to remember those really bad times that it was 58 minutes.
But with time travel, it's on by default.
You do not pay extra for it. It doesn't eat into your storage costs. We just take care of that for you as well.
So you make a lot of changes to your database.
We don't count the time travel storage behind the scenes against your storage limits in any way.
But it's about having it on by default and not cost extra, right?
Because it's really easy to go, this is kind of expensive. I don't want to turn it on.
And then something goes down and now your database is hosed or you've got this problem.
You really wish you turned it on, but it's too late. And so if it's just on by default, we don't make you make that choice when you are setting things up, right?
You're making these choices you don't even maybe realize the repercussions of.
It makes it so much easier, right? It's one command to roll back your database.
And in the near future, we'll have additional commands and features that let you clone and sort of fork databases.
So you can branch out, experiment, make some schema changes, make sure you kind of get that right.
You could also potentially sort of copy a database back over an existing database as well.
So it gives you an opportunity to make these ad hoc snapshots and changes and de-risk your own database migrations.
That's really exciting for experimentation. It's going to be really cool when you can fork and try things like that.
I know we also, yesterday, Matt's been one of the crazy, busiest people at Cloudflare these days.
We announced Vectorize, which is our new vector database.
And I want to make sure we touch on that because it feels like everybody right now is building AI applications, working with embeddings, working with different models.
Give us just the high level.
Why do you need a vector database in the first place? What's it helpful for?
Yeah. So there's a couple of use cases, but I'll sort of talk about one of the more obvious ones.
And so the best thing I think about when we talk about vector databases is understand that machine learning models only know what they know.
They only know what they've been trained on. And that is itself only a representation, right?
It's a synthesis of what they've been trained on, right? They aren't copying and just storing texts.
They are understanding relationships between data or images, things like that.
And so it's cool. What if I want to use a large language model, an LLM, something like Llama or ChatGPT, to search my own documentation, to give users context back, internal kind of product search, Q&A, anything else like that?
Well, I can't retrain those models, right? I can't fine tune them.
It's tremendously expensive. I may not even actually have the models or be able to do that.
To do those on those kind of models, it would take hundreds of hours of GPU time.
It'd be tremendously expensive. And then every time my documentation updates, I've got to do it all over again.
And so it ends up really painful.
And so a vector database gives you a way to feed in context or keep context or capture the state, in some ways, of a machine learning model.
So maybe a good example is like, hey, I want to build a product recommendation engine.
And so what I'll do is I will feed in all of my existing products and their descriptions, those product representations, into the machine learning model.
I'll get the vector representations, the embeddings back, which is effectively the way that the machine learning model represents that data, its features, what it sees in that data.
And different models do that differently.
But in the most naive sense, it's really just an array of numbers. And those numbers mean something to the model.
The rest of us, it's 1,000 numbers in an array.
It just kind of looks meaningless. But the closer those numbers are to each other in those arrays, it means that the model thinks those things are more similar.
So if I can store those in a vector database, and then someone comes in and goes, hey, I'm looking for a product like these things, or I'm building an e -commerce site, and I want to be able to give them related products.
We've all seen that, right?
Some of those recommendations are really poor because they're done really poorly.
Using something like a vector database and vector search can make far more successful, obviously, actually giving more relevant recommendations based on content and history.
And so I can say, hey, for the product that this user is querying for that they're actually looking at right now on the page, show me the other products that are the nearest to it.
Show me the ones that are the most similar based on the properties I said, or the model's determination of what's similar.
That is way faster.
You can't actually do that otherwise. You can't just keep feeding that data back in to a machine learning model every time you want to get recommendations out because you'd have to give it your entire catalog every single time.
It's not actually practical. Machine learning models have input limits. Even if they didn't, in some cases, it would take a lot of time, a lot of compute to go and do that.
And so it's a way of keeping the memory of the machine learning model and then asking it again and again, which is way faster, way cheaper, and also just much easier to build.
That's awesome. And so I know there's a little differently with our vector database or how we're thinking about what you should pay or costs or different things.
Say a little more about that. Yeah. So I think one thing that we've seen a lot of is people are kind of worried about the cost.
They're really excited about building things with AI.
And then they're really worried about bill shock when they've seen people spend a lot of money and they've seen their bills go up and they built on things.
I've seen the cost of running a GPU per hour.
And it kind of like turns them away from wanting to build stuff. And so particularly in the vector database space, it's really expensive to get started.
There's really high minimums. In some cases, even what's free gets deleted after a period of time.
So all your experiments are gone. You've got to kind of start from scratch.
And that's not particularly fun either. Sometimes we all like to experiment with things on weekends.
They come back next Sunday and it's gone or take a break.
It's not really going to want me to take that to work because I never got that far.
And a lot of us, I think, bring the technology we play with to work as well.
And so all that aside, we just wanted to make it way easier to understand how much it's going to cost, way more accessible.
I think that's really important.
If only large organizations or those of us in big cities can go and play with AI.
That's not particularly fun. And so again, like for I think not just for vectorized, but for other things we've been obviously talking about as well, it's just making their pricing more accessible and again, more predictable.
I kind of want to know at the end of the month, what's it going to cost me for this workload?
What's it going to look like? What's kind of the upper bounds?
I think those things are really, really helpful for folks. Again, otherwise, what you thought was going to cost you $40 a month is now costing you $2 ,000 a month, or you've got to make this impossible choice between just shutting it all down.
That's not particularly fun. Yeah. And when somebody on a team that you're working with has an idea, there should never be this penalty for experimentation.
There should never be this idea of like, oh, we would really love to try this thing in production, but there's going to be this huge infra cost.
We're going to think about instances and capacity, and we're going to pay on a kind of per database basis.
There's such a tax at companies to just trying a thing. And it feels like right now, everybody's trying new ideas in AI.
Every engineer probably has five of them going in concurrently.
And how do you take those and make them into real demos that you get into production?
Yeah. I think that is really actually important as well.
I think vectorized with things like D1, durable objects.
If you've got one database, you've got 10 databases, you've got 100 or 110 and 100 indexes and vectorized, it doesn't matter to us under the hood.
If you've got one that has all the traffic and the other 99 don't see anything, that's fine.
We don't charge you a fixed fee for the other 99 that you're experimenting with, that are throwaways, that are copies, that are just failed experiments, whatever happened.
That doesn't seem fair. That actually puts you as a developer into those of running the manual garbage collection, where you're going through all the time, or your boss is coming down to you and saying, hey, didn't you clean up these four vector databases that are costing us $400 a month each?
You're like, oh, I didn't realize. That's not fun. And that's just kind of- Suddenly companies start locking down who can create resources in a cloud provider, and then it's really hard.
And you wake up one day and everybody goes like, nobody has permissions to do the work that they need to do, all because of pricing.
Exactly.
Yep. Well, that's awesome. I'm excited about Hyperdrive and all these databases share this common set of properties of enabling that kind of quick experimentation.
It's really cool. So maybe wrapping up a little bit, we've talked about vectorized, we've talked about D1.
Where are we going with databases?
I know it's a leading question, because you and I talk about this all the time, but maybe one specific thing to bring up that we hear a lot about is, how do I connect to a database that's on another cloud, that's behind a VPC somewhere?
I know we're thinking a lot about ideas in that space. Yeah, that's actually a really good segue.
So today with Hyperdrive in the beta, obviously, we can connect to any database exposed publicly, which is a lot, right?
That is most databases on a lot of the new providers, things like Neon and stuff like that as well, Materialize, Timescale, all of those just work.
You can connect directly and you can expose existing databases like Aurora, but not everybody can or wants to do that.
There are cases where you just technically can't, there are policy decisions in your organization, right, where those databases only talk inside the VPC, sort of the virtual private container network inside that or compute, sorry, network on your cloud provider.
And so what we're doing, we have this really cool set of products on the other side of Cloudflare, what we call Magic WAN and things like Cloudflare Tunnel, that can bridge those connections.
And so we're going to teach Hyperdrive and teach workers how to talk to resources, databases or instances or any other services you have over Magic WAN.
So they know how to dial a private IP address is probably the best way to put it.
I can put in the 10.5.6.7 address of my Amazon RDS database sitting in, I don't know, AWS Frankfurt, right?
I don't have to open up the Internet, Hyperdrive knows how to talk to it, it talks to whatever private network, and it's a bridge between Cloudflare and AWS over say a IPsec, like a VPN style tunnel, or a GRE tunnel, or a Cloudflare tunnel, like a reverse tunnel, whichever obviously works.
And that just makes it work.
I don't have to open a firewall, I don't have to change any firewall rules, it just works.
And so I think that's gonna be really powerful. And I think, again, as you said, Brendan, you and I have been talking about that a lot, because I think we see a lot of customers wanting to kind of be able to do that, right, or use things like Hyperdrive, even if they technically don't actually see a huge risk in opening these things up.
Because certainly, like, again, a lot of databases are already, it's just not a conversation that they want to have, or a battle they want to pick, or a policy they want to fight.
I would like to give the tools to talk to your security team about, hey, this, and make them feel comfortable, even if a decision may have otherwise been safe.
Like, there's ways that people do things, and we want to fit with those established patterns.
So you can talk directly from a worker, and it's smooth sailing for you and convincing everybody on your team.
Exactly, yeah. If you had to go, and if all of the teams that wanted to use Hyperdrive, and things like workers, had to have those conversations with the security teams, they would collectively spend hundreds of hours, you know, every quarter having those conversations, where I think we can probably save them a ton of time and just go and build it into the product.
Wanna wrap up here.
Thank you both for joining today. This is very interesting to hear where we're going with databases.
In addition to all of these announcements, we've also launched a number of integrations.
So as Matt was saying, if you already have an existing database, and you don't want to migrate that, we've got an integration marketplace.
We are constantly adding new database providers to there.
A lot of work we're doing in the standards work to make sure that we're following open standards and building those open standards, so that things become more portable and flexible as you're building the applications that you want.
So we've got about 50 seconds left. Any last minute things you want to plug? I think we just want to hear from people, like, you know, if you're using this, talk to us, send us feedback along the way.
Like, tell us how this fits inside of the architecture that you already have.
You know, as Matt was saying earlier, bringing your data into workers is, like, a big focus for us, but it's also the hard and the scary part.
And so we want to hear from people on, like, what are the challenges along the way?
Because we're really focused on solving those. And if you want to share your feedback, the best place to do that is in our developers Discord.
You can go to discord.Cloudflare .com, join Matt and Brendan and I all hang out there periodically, and I would love to hear from you.
Thank you all very much.