💻 What Launched Today - Wednesday, April 3
Presented by: Phillip Jones, Matt Silverlock
Originally aired on April 3, 2024 @ 1:00 PM - 1:30 PM EDT
Welcome to Cloudflare Developer Week 2024!
Cloudflare Developer Week April 1-5, our week-long series of new product announcements and events dedicated to enhancing the developer experience to fuel productivity!
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog post:
- How Picsart leverages Cloudflare's Developer Platform to build globally performant services
- Data Anywhere with Pipelines, Event Notifications, and Workflows
- Improving Cloudflare Workers and D1 developer experience with Prisma ORM
- R2 adds event notifications, support for migrations from Google Cloud Storage, and an infrequent access storage tier
Visit the Developer Week Hub for every announcement and CFTV episode — check back all week for more!
English
Developer Week
Transcript (Beta)
Hi, folks. It's Wednesday here at Cloudflare. Third day of Developer Week. And obviously, we just had a bunch of announcements go out this morning.
We're going to talk about those in a moment.
Super, super excited, particularly sort of around R2 and how we're sort of thinking about data.
And so, yeah, you know, let's just do some brief introductions for those of you that don't know us.
So, I'm Matt. I head up database system storage on the product side here at Cloudflare.
And I'm joined by Phillip, who I'll let introduce himself.
Hey, I'm Phillip. I'm a product manager.
I work on R2 and some of our data migration products, which we have a few exciting announcements about.
Cool. Thanks a lot. And so, maybe we can actually start there.
You know, you sort of mentioned data migration. Obviously, you know, we said there's a few things around R2 that, you know, I know we just sort of announced support for another cloud provider here.
So, maybe just sort of tell people watching a little bit more around like what that means.
Maybe some background even on what Supersloper and what data migration is.
Yeah. So, before jumping into the exciting announcement that we had today, you know, for those who aren't familiar, I can kind of give an overview of, you know, some of the data migration products that we have and also, I guess, a little bit of the why around why we built them.
So, you know, I guess the why first. So, we talked to a lot of customers and one of the big challenges they had, you know, they look at something like R2, they say, hey, you know, you know, it's S3 compatible, you know, you know, it has good performance, global object store, you know, you know, people like to take advantage of the, you know, kind of not charging for data transfer and egress.
But one of the challenges is how do you get your data onto the platform?
Or more broadly, how do folks in a, you know, you know, more and more multi-cloud worlds get data from one cloud provider to another?
So, you know, it's definitely possible today, but it's something that, you know, previously folks would have to maybe go spin up a VM and, you know, in a cloud provider of their choice to do a lot of this migration.
And then what happens when an object fails, right? What happens when something goes wrong?
Well, a lot of times, you know, people are kind of stuck with these migrations that are, you know, half complete and there's not really a lot of visibility.
And especially if you have like millions or even billions of objects, that kind of stuff is really challenging to manage.
So, you know, and there are other things too, right?
Just like, you know, getting it to be performance, right? Like, you know, sometimes if you go spin up a single VM on a cloud provider, you know, you know, potentially it doesn't have access to that much bandwidth, you know, from a network perspective, you know, or, you know, not doing that many transfers in parallel.
And, you know, if you have billions of objects, it can take, you know, really a long time.
So those are some of the challenges that we really set off to solve with our data migration products.
The one that we talked about today is called SuperSlurper.
And this essentially allows folks to, you know, take a bucket in a cloud provider of their choice, or even another R2 bucket, and, you know, fill out information about your bucket, the source and the destination.
You can provide options, like, do you want to overwrite files in the destination?
Do you want to migrate a specific prefix or, you know, a given set of objects?
And then with a single click of a button, we can go ahead and actually fire off a migration for you.
So you don't need, you know, to take your engineering team and have them spend a few weeks building a robust or a performant data migration tool.
If there are any failures, we retry.
So it's very reliable, very fast, secure. And if there are issues, we have an easy to digest migration log at the end.
So that's kind of a high level of, you know, what SuperSlurper is specifically.
You know, I think, you know, last year, when we first announced it, it supported migrating from R2 buckets to R2 buckets within an account.
It also supported AWS S3 into R2. And today, we're so excited to announce that we're supporting Google Cloud Storage.
So you can now migrate your Google Cloud Storage buckets directly to R2.
And it's really the same easy flow. You click a couple of buttons, the migration runs, you can see the status.
And you can kind of go drink some coffee, go do what you want and come back and see, you know, find out what is successful and all that stuff.
I mean, I actually, you know, from some stuff I had, you know, still kicking around on GCP, I just actually kicked that off this morning and bought some of the remaining stuff I had in GCS over to R2 immediately, which was great.
So I think and even going back to what you were saying around the pain of like migrating your own data, right?
I'd worked with customers in the past who, you know, wanted to migrate data to like other object storage providers, and they set up VMs, and they like, great, well, the VMs only got eight gigabits a second of bandwidth, because it actually needs more CPU cores to get more bandwidth.
And it turns out, they ended up, you know, spending you end up spending up 100s to 1000s of dollars a month.
And then potentially, not to talk about actually the cost of running the software, snapshotting your state and your progress, all of that has to be tracked.
If something goes wrong, you have to restart, then you're paying a lot of data transfer repeatedly as well.
But like, manually migrating, you know, at scale is hard, right?
Like, for the reasons you said, and it ends up just being really, really challenging.
So yeah, I think, like you said, you know, for those of you at home, having built in tools, like supporting all different storage providers, be able to track and replay that and making sure that those jobs don't fail, ends up being huge.
And so even personally, again, I find that really useful instead of migrating datasets I had.
Yeah. And yeah, I guess like, you know, we talked a little bit about SuperSurfer and kind of migrating data all at once.
But that's kind of half of the picture in terms of the data migration tools that we offer.
So I could talk a little bit about that, too. So we have another, so we have SuperSurfer, which is, you know, copying all the data at once.
One of the other things that we have is a sort of a on-demand migration tool called SIPI, which is pretty incredible.
So, you know, SuperSurfer is great.
You can migrate a ton of data at once. It's really fast. But there are some cases, you know, it could be potentially for cost reasons or other that you kind of want to just migrate the data as it's accessed, right?
And that's what we call incremental on-demand migration.
We have a tool called SIPI, which allows you to do just that.
So, you know, and actually one of the cool things about that is, you know, we didn't talk about it in this announcement today and we actually didn't have a blog post about it either.
But a few weeks ago, we actually launched support for Google Cloud Storage for SIPI as well.
So essentially what this allows users to do is they can go in the dashboard, the Cloudflare dashboard, go to R2, select a specific bucket they have.
They can go to the settings and they can enable SIPI or on-demand migration for any buckets that they would like to.
And just to kind of talk a little bit about the behavior of that, when you turn it on, when you enable it, now if you have a public bucket with a custom domain or you're using the S3 API or workers bindings to access it, we first check whether the data is in R2.
And if it's not in R2, then we actually go ahead and fetch it from Google Cloud Storage or AWS or your source storage provider.
So in effect, one of the cool things about it is it kind of can remove the need to even migrate your data at all, right?
We see people who create empty R2 buckets, link it to an S3 bucket and now start routing traffic to it and everything.
Then you go back the next day, your users were not impacted.
Everything was still working perfectly.
And you go back and check your bucket and now all your data is already there, right?
So it kind of effectively eliminates the burden of migrating data. Yeah.
I mean, I've spoken to so many customers that have done these sort of one -off migrations where that can be really good for a lot of non-live data, analytics data, right?
But you can do a migration and or stop people from writing so that you don't fall out of sync.
But just serving live user content or media, getting that migration done and then kind of hard.
It's also like, unfortunately, not that we're charging egress at Cloudflare, but you're still going to have to pay egress from your cloud provider to get that to you.
And sometimes actually for a lot of customers, smoothing that cost out over a period of maybe a couple of months, maybe a little bit longer, maybe they've got a cloud commit to kind of burn down and they're committed to spending that on the cloud provider where they can kind of get away and move to Cloudflare in full.
Yeah. So I think as much as you've obviously seen sort of this firsthand, Philip, that sort of incremental migration is great.
I don't think I'm aware of anyone else that really sort of provides this in that way.
It's all single-shot migration or do it yourself.
Yeah. And some of the, kind of to bring these things together, I think some of the coolest stories that we hear about customers who are migrating their data actually use both together.
So you can imagine situations where maybe you have different sort of clients writing data.
And if you do a single-shot migration, you may be concerned that you're missing data, et cetera.
So we have customers who woke up one morning, they said, hey, I want to move over to R2 and just enabled to be on their bucket.
And then they started moving over the data because that way they can ensure that even if the one-time migration is missing something, their end bucket is still consistent, still up to date.
Yeah. Nothing worse than you think you've done a migration, something didn't come across, content got added after the fact, and you're serving up a bunch of 404s or something like that as well.
So being able to always go back to the original source, be confident, check your metrics, right?
And go, okay, great. I'm actually really happy now.
I've been able to see this hasn't had any production impact. And then pull the code away from your recidive legacy cloud provider is great.
So yeah, I think really excited for those announcements have gone out today.
So one more thing, obviously, keeping the theme of R2 is infrequent access.
And so maybe sort of tell folks a little bit around what does infrequent access mean?
What is infrequent?
How infrequently is that? And why it's actually useful? Yeah. So as you mentioned today in our blog post, we also announced the private beta of our infrequent access storage class.
So a little bit of context on that. So effectively, I guess the definition of it is it's a different storage class.
So R2 has traditionally had one storage class, which works for really a variety of workloads.
But we have some customers who maybe are building websites or applications with user-generated content or potentially other cases where a lot of their data may not be used every month, right?
We have a lot of customers where data is used 10,000 times a second.
But there are some objects or some use cases where it's used very frequently.
And then maybe 30 days later, not many people are accessing the data, right?
So the infrequent access tier is meant for those kind of use cases.
So if you have a data set that over time becomes less and less popular, it could be user-generated content.
It could be things like logs that you need to keep for compliance reasons or others.
Infrequent access tier is perfect for those use cases because it allows you to essentially get lower storage prices.
So that's really the goal of it, right?
So if you have data that's not frequently accessed, then frequent access tier allows you to benefit essentially from lower storage prices.
And that's ultimately what we want to be giving to our users, giving to customers as ways to save money.
And this is one great way to do it. Yeah. I think particularly this, obviously, when you have more than one storage class, making sure things are in the right class, having them kind of move around and optimizing, right?
That's obviously important as well. Maybe, Philip, it might be worth talking a little bit more around what's kind of coming next on the roadmap for infrequent access, how to kind of manage costs, kind of how are we thinking about that?
Yeah, that's a great question. So today, typically, the way that folks who are joining our private beta will be using these tiers is with something called object lifecycle rules.
And for the folks who aren't familiar with object lifecycle rules, effectively, those are policies you can set on buckets and you can kind of say like, hey, well, it's been 30 days since this object has been uploaded and maybe in a given prefix, and you can add some specification there.
But after that time period, I want to go ahead and delete this data because maybe it's an ephemeral use case.
And now you'll be able to say, hey, actually, I want to go ahead and move those objects to the infrequent access tier, right?
After a month, if you have a pretty good idea of the usage patterns, and maybe it's a month, maybe it's 60 days, whatever it is, you can go ahead and do that and start benefiting from those prices.
But I think looking into the future, ultimately, the goal here is to kind of, you know, take a little bit of that burden from our customers and users as well, right?
So if you don't, if you're someone who doesn't necessarily have a great sense of after 30 days, or after 60 days, or after 90 days, we want to actually do that automatically for you and go ahead and just, you know, with a click of a button, give people the lowest price and optimize, you know, optimize costs for them, given the data that we have, you know, so that that's ultimately where we want to go is just being able to automatically do these, do these for folks.
But right now, people can use object life cycles to get that behavior. Awesome.
Yeah, I think, having seen, you know, takes on this before, like, yeah, you've got, you know, analytics data, things like that, like largely that you're running any queries over that tends to be, you know, the brunt of data that is terabytes or petabytes.
You know, the cost savings from having that infrequent access to you when your teams aren't querying data, you know, from last year's record, or three years ago, sales data, and having that be significantly cheaper is definitely worth it.
So yeah, I'm really excited for that as well. Yeah, and it's really it's at the end of the day, it's all about giving, giving people the choice, you know, to kind of make these trade offs and do what's ultimately best and most cost effective for the data.
So we're really excited about it. Cool. So I think there's, at least today, one more thing on R2.
And so, you know, maybe this sort of ties into some of the stuff that we've also announced today that's sort of adjacent to R2.
And so yeah, maybe we can talk a little bit about event notifications.
Maybe I can sort of talk about that, Philip, as well for you. Sounds good.
Sounds like a plan. Cool. So, you know, to take over from Philip, but one of the things we've been talking about a lot is, you know, how to help customers build better, like multi-step workflows on the data, right?
You've written data into R2.
As you've seen a lot of this, Philip, like with customers, right? I think, you know, what I'm about to talk about is probably one of the biggest asks we've had across the platform is, you know, you've got user uploads or you're writing data and you want to go and trigger something based on that.
And so we've talked a lot about this, but basically what we announced today is what we call event notifications.
And the first kind of practical implementation of that is for R2.
And so when you write to an R2 bucket, you change data in that, or any of your buckets, you can think of policies, we generate a notification into a queue, and then you can consume that in your queue worker.
You can pull from that queue from outside Cloudflare, you know, anything queue supported, it's just a batch of messages telling you inside that payload, what changed, what the key to the object is, what the bucket is, what the timestamp is.
And that, you know, for many, many customers unlocks a whole ton of workloads, right?
It's not really reasonable to go and pull a object storage bucket for changes, right?
You have to track state, you have to know what you pulled before, you'd be running a lot of operations on that to go and make that work at, you know, traditional millions of objects or more scale, it doesn't really work.
And so having, you know, notifications automatically generated on either any change, or a subset of changes, depending on like, maybe just creation of objects, maybe modification, maybe also deletes.
It's like, great, I can go and process that user image that's been uploaded, right?
I can scan or OCR these PDFs, I can run them through an AI model, I can take this text, vectorize it, and, you know, basically generate text embeddings from workers AI, install it into insert into a vectorized index, and then go and run queries on it against that, and make sure that works right and run that every time data is changing or refreshing, update associated columns in my database, say, you know, this data has been processed, and the user can access it.
And so all of that stuff becomes way more possible. And, you know, I think for us, like I said, R2 is kind of just the initial practical take, what we're really thinking about this is like, how are you building like an event notifications framework.
And so the other day, as we were, you know, ahead of launching this, one of the other teams at Cloudflare was like, hey, you know, we'd love to be able to have the ability to get notifications when we change a key like write to a key in Cloudflare KB, which is our key value storage, often used for configuration or authentication.
You know, we want to be able to write to a key because we've got a lot of systems that are potentially writing, get a notification on that key change, potentially be able to see the previous value as well and go and compare, you know, allows them to do auditing and audit logs as well, which is really powerful.
And, you know, I linked them basically to what we had written up around what we were launching around, you know, event notifications for R2.
And it's obviously very, very similar, right? Being able to define a policy, what change events, maybe what namespace, maybe a sub policy, so that you know, you're just capturing just the events you want and not having to kind of cover everything and then writing them to a queue.
And the reason we write to a queue is that it keeps it durable, right?
It means that we can write that.
We know that it's going to persist. It doesn't matter if the other side is maybe not keeping up or that it's falling behind.
It's not having to keep up with live traffic.
It's not just answering HTTP requests. You can then batch them too, so that every event isn't just one to one.
It can be, you know, a hundred or a thousand events in batch up into one, which is a huge optimization and cost saving as well.
You're only having to pull something off a queue once or handle a request once.
But that KD use case came up. We've been talking about this of like potentially doing this for D1.
We talked a little bit about in the blog post as well about generating this on other things, like say workers builds when a build completes being a trigger notification.
These are kind of all ideas that we've just had, you know, these are the things that we obviously aren't just shipping today, but we'd obviously love to hear from folks if you're watching, you know, you can sort of see our tweets or just reach out, you know, what are the things you're kind of thinking about where we can sort of plumb through event notifications?
You know, can you just sort of roll this out?
Actually, a big effort for us internally is making sure that other teams at Cloudflare can basically implement event notifications directly into a queue, offer a set of event types and do that really easily, sort of actually making the internal developer experience easy so that then we can ship things to customers faster.
Yeah. And yeah, I think it's really exciting because just, you know, talking to customers, talking to other teams at Cloudflare, you know, these kinds of event notifications really unlock new types of applications to be built on the platform and make them a lot easier to build.
Like, you know, it's pretty, you know, pretty frequently we hear customers who are, you know, maybe uploading media objects to R2 who say, hey, like, you know, my application isn't done just because the data landed.
And now I want to go ahead and go and process that data.
Maybe I want to go do audio transcription, you know, or maybe, you know, more and more people who are using R2 to store, you know, event data, things like that.
Now they can go ahead and load that into a data warehouse or continue to process it, things like that.
So I think it's really exciting.
And it can be, event notifications is something that everyone can use today.
So we encourage you to do it. If you have any feedback, please let us know.
We also have a tutorial for how you can get started. So it's pretty easy to do using Wrangler, just, you know, kind of a single command and get started.
So we'd love to hear more from you.
So, but I guess kind of transitioning from event notifications and some of the things we want to build on this really like event notification platform, there were a couple of new products that are kind of upcoming that we announced in your blog post, Matt.
So can you talk to us a little bit about, I guess, maybe first workflows?
Yeah.
And so, you know, one of the things, you know, I think a bit like you said, event notifications is kind of the beginning of some of this.
But one of the things we really want to help solve is make it easier for developers and the teams to operate on the data.
Right. And so you've got data somewhere, or there's an event coming in, right?
How do you, how do you act on that? How do you do that reliably?
How do you do that durably? How do you make sure that that task runs to completion?
How do you make sure that's retried? How do you transient errors? All of that is genuinely at scale, pretty hard, especially when you have like business, business critical workflows, right?
Things you want to run regularly, debugging them and tracing them can be hard, right?
Right. Just error checking and catching is, you know, is a lot of boilerplate.
And so there is a kind of broad concept across the industry.
It's, I would say it's, it's not well known to a lot of folks, but it's sort of called durable execution.
I mean, really the concept is how do you run a task truly durably?
And how do you, again, avoid rerunning large parts of it, which can be non-item potent.
It can be costly. It's wasteful, right?
It can cause a bunch of, you know, stateful issues or loops. And so, you know, in a nutshell, workflows is our durable execution engine.
That's something we're hopefully launching into open beta in the next sort of two, three months.
We wanted to kind of preview it to folks so they can sort of get a feel for what the API looks like and how we're thinking about this, how it's going to integrate with other services.
But at its core, you write a series of steps that form a workflow and each step can act on its own, be retried on its own.
If the service was to crash or something was to fail, we can pick it up from the last successful step and restore all the state, everything ready to go, right?
Like we basically sort of snapshot the program as it runs through.
And then you can break down things into further steps, right?
That can be really powerful for you to kind of isolate more state, ensure against errors, emit metrics.
Every time you have a step, we can tell you as far as it's progressed or what steps it's failing on.
You don't have to wire that up yourself, which is really, really important.
And again, that's really powerful.
You know, a huge use case of that, just like we talked about of say transforming like user uploads or data.
Well, you know, you could write that, but how do you want that to be reliable, right?
Like you probably want that upload to be processed reliably, no matter what other transient issues or other issues are happening, right?
And you want to know when it fails and exactly what step it's bad on, right?
We can write a ton of boilerplate. You can try to source state transiently and store it in your own database and try to get that right and upload.
Or, you know, ideally you can use workflows and write a step that says, oh, great.
Every time I get an upload notification from R2, and that workflow reads the notification, it grabs those objects, it passes them to the image resizing API.
All of the ones that work, it captures those, any of them that might have failed or were too large or there's something wrong with the image.
You can store those two and those can be pushed into a different queue, for example, to come and retry later and trigger another workflow, right?
I can then store those remaining objects in say R2 in a future step.
Then in my final step, I can go right to my database and say, yeah, here's the file path to the transformed, resized, rescaled, cropped image or the series of images, right?
Or if it's PDFs, right? All those kind of things.
I can write each of those steps that I want to occur once. I want to retry just that step, but I don't want to go and retry further steps.
I don't want a bunch of wasteful and redundant operations on R2 or image resizing.
I've probably done that work. I want to keep that. That's one really good example.
I think there's a lot of AI examples where data's coming in. I want to turn into text embeddings or otherwise embeddings or audio transcriptions.
Again, things that need to happen in a particular path, right?
And even further out, we're thinking about how this would even work with human approvals, right?
How you can sleep a task, how you can in between steps actually trigger the request for a manual API-driven approval, right?
Where maybe it gets halfway through and requires a human to check a certain process or validate, there are a lot of valid use cases for that, such as deploying software.
We have a similar system at CloudFlip that kind of follows those rules as well, where each step progresses, there are a series of automated checks, and then there are some key checkpoints, some human intervention steps as well under certain conditions so that a human has to go and click a button, which is an API call to say, great, I've approved.
Now you can continue to the next step, right?
And that's recorded and measured as well. And so really excited for that.
I think importantly, as we ship more and more data products, more and more AI products across the platform, right?
The more we can do to help customers avoid having to be the glue across all these products and actually helping them glue them together automatically, making it easy to combine, making it more fault tolerant, and instead of having to handle errors or transform data between services, just makes it easy to get stuff done.
Yeah, thanks for explaining that. And then, in the last few minutes we have, could you share a little bit more about pipelines?
Yes. So when we think about data and storage, I think it's a gross simplification, but sort of three big parts, right?
There's the ingestion, how you kind of get data in.
There's the actual storage, so how you persist it, how it's structured, how it's accessible.
For us, that is R2. For most organizations, that's their object storage clusters.
And then on the third part is how you query that data, how you actually kind of derive value from it, how you actually extract meaning from it, how you pull it into other systems.
Again, AI being a huge one there for a lot of data these days.
And so with pipelines, it's really about how do I get data in?
How do I get data into Cloudflare? How do I get it into the right format so that it's ready for that third part, my query engine, my ETL tools, my BI tools, my analytics tooling?
How do I get into the right format? How do I filter it? Maybe how do I batch it correctly?
I think an example you brought up the other day, Phillip, was like be able to batch or segment based on user location, split data into different locations or buckets based on the source country or other metadata in those payloads as well.
That stuff ended up being really useful. Otherwise, I've got to do all of that processing after the fact.
I've still got to ingest it, I've still got to pay for that.
I've got to write it, I've still got to pay for that. Maybe I'm writing more data than I actually need.
I'm not filtering it or being able to generate it.
And now in my unoptimized form, all my queries are going to be slow.
My query engine does not want to be reading a billion one kilobyte files because I've just ingested them one by one.
I want to be able to batch them, transform them, filter again, maybe redact PAI as part of that process as well.
And so with pipelines, really thinking about how do we get data in at a high rate.
HTTP based ingestion is obviously real common.
We've talked a lot as well and are looking at how we can support the ability to have you just repoint a Kafka producer and just write to a pipeline or sort of probably what we call a stream without having to change anything about your client.
A topic is a stream, we can kind of remap to that and then you can consume or rewrite to an R2 bucket as you wish.
And a little bit further route, kind of experimenting and thinking about how we allow more programmatic transforms.
And so a lot of what pipelines will make pipelines powerful is you don't need to write a lot of code.
You don't have to write any code to get started, to get an ingestion endpoint, to do some basic batching on JSON keys based on size, based on time parameters.
A lot of that should just be out of the box.
I can drive that through command line, I can drive that through the dashboard or Terraform.
But if I want to transform in a more advanced way, if I don't write code over the top of that, I want to filter on particular objects or structured formats, that's a larger task.
And so that's something we're definitely thinking about in the blog.
We sort of show an example of what that ABI might actually look like.
We're still evolving that as well. Sounds good. Hey, well, I think that's all we had to share today.
And if you want to follow more along with Developer Week, we encourage you to do so.
That's it.