Announcing the Cloudflare Data Platform
Presented by: Marc Selwan, Micah Wylde
Originally aired on September 26 @ 12:30 PM - 1:00 PM EDT
Welcome to Cloudflare Birthday Week 2025!
This week marks Cloudflare’s 15th birthday, and each day this week we will announce new things that further our mission: To help build a better Internet.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog posts:
Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!
English
Birthday Week
Transcript
All righty. Hey everybody, I'm Mark Zalwan. I'm a senior product manager at Cloudflare working on basically all the stuff we're just about to talk about.
Micah? Hi, I'm Micah Wild.
I'm a principal engineer at Cloudflare working on the data platform.
Awesome.
So Micah... We kind of announced a pretty broad suite of products this week, and I think this is actually pretty important for Cloudflare.
Do you want to just really quickly touch upon what is this big, massive thing that we just launched?
Yeah, so what we announced today is something we're calling the Cloudflare Data Platform.
You can think of this as, like, complementing...
Cloudflare developer platform that everyone knows and loves where the Cloudflare developer platform basically enabled anyone to deploy, operate, scale applications on the web by providing this really really easy serverless suite of technologies.
The data platform's basically doing that for the analytical data world.
So it's really made up of three different products.
that track the different lifecycle of data.
So data starts as events that are produced by mobile apps, servers, logs, IoT devices.
Those events have to kind of flow in somewhere in order to be processed.
And that product is Cloudflare Pipelines.
This is built on top of a stream processing engine that I've been working on for the past three years called Arroyo.
And in front of Arroyo, we give you basically an endpoint that you can send events to either via HTTP or via a worker binding, if you're building on top of Cloudflare workers.
You just send us those events as JSON, and then they flow into the pipelines that you define.
They're defined as SQL queries, so you're able to transform them, filter them, rewrite, schematize.
whatever you need to do with that data in order to turn it into nice analytical columnar data.
From there, we'll write it as parquet files into R2, which is our object storage system.
We can do that either just as kind of raw parquet files, or I think even more excitingly, as Apache Iceberg tables.
For those of you who aren't familiar with Iceberg, it's what's called a table format.
It turns just a...
bunch of random data files on object storage into something that looks like, you know, a Postgres table or a MySQL table.
Something that has a schema associated with it that can be queried.
It supports things like schema evolution.
It supports indexing. Basically, it kind of solves all that problem of how do you find your data, how do you organize it, and how do you evolve it over time, and normally, if you want to do iceberg you would have to like run what's called an iceberg catalog which is the metadata layer that stores where all those files are how do you access them how do you efficiently query them but thankfully cloudflare has thought of that as well we've launched a product called r2 data catalog this is a managed iceberg catalog on top of r2 so basically with one switch in the dashboard or one command in range you can get a fully managed Iceberg catalog that you're able to ingest to from pipelines.
So having data in the catalog is great, but you actually need to do something with it.
That might be hooking it up to any number of query engines that support Iceberg, like DuckDB, or Spark, or PyIceberg, or Snowflake.
But we want you to be able to solve this whole data problem end-to-end directly within Cloudflare.
So I think even more impressively, and maybe the thing that's going to be the biggest surprise to people, is that we launched a query engine.
It's called R2 SQL, and it runs across our entire distributed infrastructure.
It natively supports R2.
It's incredibly easy to use.
You just give it a SQL query, and you get back results.
And I think, you know, we're really excited to just be like...
solving kind of this end -to -end data problem and what I think is probably the easiest way anyone has ever solved this.
Yeah, I think that that part is actually super, super critical.
So let's maybe take a few steps back to the top.
We'll start, I guess, from the start of the life cycle of an event, right?
So starting from ingesting data, right? Because to your point, Cloudflare has had this developer platform for a while now.
is actually one of the things that attracted me to even coming to Cloudflare in the first place, because there are a whole bunch of developer-facing services that you could start to build applications on top of Cloudflare infrastructure and kind of free yourself from the traditional paradigms of other more traditional cloud providers, right?
And so it's really interesting because there's a lot of different products within our ecosystem and other tools that people use.
when they're bringing their data, telemetry, whatever, clickstream data, HTTP logs, whatever through Cloudflare.
And so that starts as an event, a log, something.
And then to your point, you mentioned flows through, we have this new ingest endpoint, right?
We call a stream, which I think simple enough, it's a immutable distributed log effectively, right?
So that comes.
and buffers those messages up.
And then what happens from there?
Yeah, so when you're setting up this data pipeline, you start by creating the stream, as you mentioned.
And under the hood, a stream is a partitioned, buffered, durable log, something you might think of like Kafka.
You don't really have to think about that.
We manage that for you, but it's a place you can send data.
We'll act back to you that we've gotten it.
And you know, at that point it's been durably committed and it's eventually going to end up in its destination.
Once you've created a stream, the next step is to create what we call a pipeline.
This is basically defined as a SQL query. As I was saying earlier, it lets you transform, filter, unroll, redact, whatever you need to do with that data to get it into kind of an analytical data format.
So we, we, you.
out we both come from this like, you know, traditional streaming world.
And one of the things that was always kind of tricky from my perspective are things like, let's say like enforcing schemas and data quality rules and things like that.
Um, can you talk a little bit maybe about how that, that part of the pipeline works?
So we give you a few options here in an ideal world in your data.
platform, the teams who are kind of like producing this data are thinking about schemas at the start of the whole data lifecycle, and they're really careful about emitting exactly kind of the right fields in each event, and you have a whole taxonomy.
And if that's the case, we'll let you basically just give us those schemas, and we'll ingest the data already schematized and ready to be nicely queried by SQL.
If you don't live in that world, if you live in a world where...
your developers maybe are not quite so careful to basically structure the data in a reliable and standard way.
You can also just ingest this as what we call raw JSON.
This is just a kind of raw JSON object that you're able to then decompose into fields however you'd like using SQL JSON functions.
But ideally, at the end of this process, before you're ingesting it into...
an actual table, you've kind of done that work of schematizing the data, because that makes it much more efficient and easy to kind of query down the line.
And Mark, as you said, we both come from this streaming world, and a big focus in the streaming world over the last few years has been something that people talk about as shifting left.
This is basically pushing more of that validation, schematization, data, quality work into the beginning of the pipeline, so that By the time it gets into the actual data table, you don't have to worry about that.
You just kind of have good, clean, schematized data to work with.
And Cloudflare Pipelines enables you to do that.
Yeah, that's awesome.
And when I was running through the whole just creating a pipeline end-to-end, it was kind of shockingly simple, I think.
I'm just used to stitching up.
There's a lot of great products out there for these kind of use cases, but they still require you to stitch up a lot of different components together, and it just kind of worked seamlessly.
I really liked the fact that, for example, your pipelines can inherit the schema you defined at the streaming portion, or you can explicitly set your own to make sure that the conversion that's happening is the data is exactly as you expect it to land in the table.
So that part's really cool.
And then so I guess that's the pipeline part, which we don't have a lot of time to talk about all the details.
It's actually gone through a pretty big overhaul.
So definitely recommend checking out all the new stuff that's there because there's quite a bit.
But then as you were mentioning, now we support R2 data catalog as a sync, which is awesome because now you can ingest these events directly into this structured table format that virtually Every compute engine today can talk to, right?
And we expose that through an iceberg REST endpoint that is 100 % REST compliant.
And one of the new things we announced this week is that we now have support for automatic compactions.
You can just enable compaction in your warehouse or your bucket, and we will automatically start combining those small files into larger ones.
And for those who aren't aware what compaction is, is compaction is this process that runs that takes a bunch of small files if you think about these streaming use cases we're talking about there's data flowing in consistently and oftentimes you want that data to show up in your table um sooner than later you typically want a lower latency and you can configure these events to show up a roll from the from the stream into the table as aggressively as you want basically and by doing that you end up quitting a lot of And as you want to query the data, your, your query engine has to go through all these small files, which creates metadata overhead.
It increases IO. So now we do this process called compaction, which takes all those small files and combines them into fewer, larger files, trying to get a target file size that you can configure.
And so now we handle that automatically, which is great.
Again, one last thing that a user has to think about. It's just, it's kind of like a set it and forget it operation.
Um, and then from there, Mike, as you're mentioning, we have, we, we released our own, um, SQL engine, which, um, let's, let's talk a little bit about that because I think this is the one that I think you mentioned this, like kind of out of left field, just like, wait a minute, like Cloudflare has like a distributed serverless query engine.
Um, can you talk a little bit more about like maybe some of the use cases, like what, like, what are some of like the killer apps for the.
and why does this make sense for us to release as a product?
Yeah, well, historically, running big data systems has been really hard.
Back in the day, you would basically need your own infrastructure.
You would have to run something like Hadoop to query all of your data.
You have to write these Java programs.
More recently, the industry has kind of moved to SQL, which is something that's pretty easy for anyone who works with data to use.
You don't have to be an expert in these systems. But you still had to kind of like run this vast distributed infrastructure if you wanted to query all of your data.
So the big innovation of the last, I would say, several years has been cloud-hosted versions of these things.
But there's been some pretty good products out there in the space, but I think a challenge people still have is that many of these Cloud SQL engines are still very tied to a server model.
You still have to stand up a cluster.
You have to size it.
You have to share that across your team.
And when we started thinking about this problem, we knew that's not the direction we wanted to go.
That's not the Cloudflare model.
What we wanted was like a truly serverless experience where you don't think about clusters.
You don't think about sizing.
You just think about your query and you trust us.
to run that across our infrastructure in the most efficient way.
And in terms of use cases, let's say we're really starting with use cases that sort of look like searching or you need to find some relevant data in all of your tables.
This actually evolved out of a kind of product we built called Log Explorer, where we're basically enabling Cloud for Customers their logs.
But we're going to be, I think, moving pretty quickly here to a wider set of analytical data use cases that involve things like aggregations, eventually joins, basically anything where you need to kind of make sense of all of your data you have in Iceberg.
So these often come up in cases where you have financial data you need to understand, you need to understand the behavior of your user.
Perhaps you're building data sets for model training or for AI.
These cases are pretty vast once you give people easy access to all of their data.
And I guess, you know, from my perspective, I guess just, you know, from some context and transparency, this is week five for me at Cloudflare.
So I've been drinking from the pipelines, if you will, the streams.
But the.
The thing that stood out to me the most, right, is I think the team focused on the foundational aspects of this query engine first.
I guess the impression I got or the way that I kind of describe it to myself is you give a bunch of really sharp distributed storage engineers basically a global network of compute, intelligent routing, and free reign and said, hey, if you were building a distributed… serverless SQL engine that doesn't care about regions or compute time or provisioning warehouses or whatever, like how would you build it, right?
And this is, they built the foundation that is that.
And I really encourage, there was a segment earlier today that talked about R2 SQL more in detail.
And there's a very technical blog post, but I think there's just some really cool aspects to it.
Like for example, it uses some of Cloudflare's smart network routing tech to be able to find just like, okay, like where is the best open pool of compute for us to actually distribute this work?
They've implemented this thing called, they call the streaming execution pipeline, where there's some compute that's working to basically plan the query.
And as that metadata is passing through that compute, it's firing off.
basically executors or workers to start immediately processing the data while it's still processing the metadata.
So you can start getting back data more quickly than having to go through the full query run or the full query plan, for example.
So a lot of really interesting innovation that I think just anybody who's passionate about distributed data or data infrastructure would be really appreciative of.
And to your point, Micah, the team is basically working. the speed of light.
Lots of new stuff is going to be coming pretty rapidly. So I encourage everyone to keep an eye on the change log, the docs.
We'll have more tutorials and examples as time moves on.
But yeah, I'm super excited about where this data platform is going.
And yeah. Yeah, I think it's an entirely new area for Cloudflare.
So it's pretty exciting to see what people are going to do with this.
And we basically enable all of these developers who have not been well -served by data infrastructure to start building on top of these systems.
Yeah, and the last point I'll make, too, again, coming from a more, I guess, let's say, traditional cloud provider model, the other thing that was, I think, the most eye -opening for me was as a product manager, someone who spends a lot of time in spreadsheets and cost models and pricing and blah, blah, blah.
The fact that I didn't have to spend hours and hours trying to...
calculate network transfer costs when factoring in like margins and pricing models for all this stuff because of R2's, you know, free egress model.
Absolutely like mind -blowing and game -changing.
So again, a lot of exciting stuff to come here.
Yeah. And I think, actually, I think it's worth really, you know, hammering in on that point.
The thing that really makes this all possible is R2, which is just an incredible.
technology, a globally distributed object storage system.
And the thing that makes it work financially for users is that we don't charge egress fees.
This means that even if you don't want to use pipelines, if you don't want to use R2 SQL, you can get immense value just by storing your data in R2 with R2 Data Catalog.
You can bring any query engine running in any cloud, any cloud vendor. You can query your data in R2 without paying a ransom to get the data out of R2.
And that just gives you incredible flexibility to use the query engine that makes sense for your use case.
If you want to have some team that uses Snowflake, if you have some team that uses Databricks, if you just want to run a DuckDB query on a server somewhere, you can do that.
You don't have to be locked into a single vendor or a single cloud.
And I think that that is amazingly powerful.
Yeah. And I think something I, yeah, it, this is really good point towards the tail end of this segment, but it is actually a really important point because especially with the rise of these open table formats like Apache Iceberg, where users are now running what, I'm not sure if this is the industry term for it, but headless data infrastructure where they're kind of separating the storage of the data from the compute.
vendors.
And by doing that, I've been seeing just a large increase in use cases where the storage might be in one service provider, but the compute might be in another.
And so these, this cross region, cross cloud, cross compute engine, these use cases are coming up more and more now than ever.
So I think this is really important and something that I don't think anyone should take very very important cool yeah so i think i think we ran through the the announcements here um a lot a lot more to to talk about and dive into details we'll have more blogs more more examples and tutorials as you mentioned if anyone's interesting in chatting with us we're in the discord so go into the uh the cloud for developer discord uh give us a ping or joy or chat in any of the r2 data catalog r2 sql pipelines channels where we're all in there.
And yeah, I'm looking forward to just to hear what you think. Yeah, and we're also hiring.
We have open recs across all the teams working on the data platform. If working on the largest data systems in the world sounds exciting, we'd love to chat.
So please reach out. You