R2 SQL: a deep dive into our new distributed query engine

Presented by: Nikita Lapkov, Jérôme Schneider

Originally aired on September 25 @ 12:00 PM - 12:30 PM EDT

Welcome to Cloudflare Birthday Week 2025!

This week marks Cloudflare’s 15th birthday, and each day this week we will announce new things that further our mission: To help build a better Internet.

Tune in all week for more news, announcements, and thought-provoking discussions!

Read the blog posts:

R2 SQL: a deep dive into our new distributed query engine

Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!

English

Birthday Week

Transcript

Hi, everyone. My name is Nikisa Lepkov. I'm a senior engineer at the R2 SQL team and I'm here with my colleague, Jerome. Yeah. Hi, I'm Jerome. I'm a colleague of Nikita, principal engineer at Cloudflare, working on the same team. We're here to talk about R2 SQL. And... Before we dive into the nitty-gritty details of it, I think we should just explain what R2 SQL is first. And for that, we need to understand what is R2 Data Catalog. So, earlier this year, we have launched R2 Data Catalog, which is a way for you to organize data on R2. So you can already upload all kinds of files and table files included onto the R2, but they're just that, they're just files split around the R2 bucket. it's more natural for us as, I guess, human beings to think about them as tables and to organize them into tables. And that's where the Apache Iceberg standard comes in. And it's a way to organize a bunch of files on any object storage into a neat table, which you can query with SQL. And that's what R2Data catalog does. It organizes all these files into tables and presents it in a machine -readable format to any query engine. before r2sql you needed to spin uh your own infrastructure to query um data on r2 and the r2sql is actually solving this exact problem it's a serverless query engine which you don't pay for um well it's not built right now so you don't don't pay for it at all but you don't you will not pay for it if it's not used so you don't need to maintain any infrastructure um on your side it's all based on the cloud network and it allows you to send SQL queries to the R2 data catalog so if you have any tables in in your warehouses in your R2 buckets you can query them with with SQL using R2 SQL so yeah that's that's kind of the whole thing that's kind of the simple description of a not -so -simple problem and we do hope it it's simplifies lives of people who use R2 data catalog provide some insights to analysts and developers alike. And, I mean, we're both the engineers who wrote the thing, and we are, of course, excited to talk about the technical stuff and how, like, what did it take to actually build RGSQL? And that's where I want to pass to Jerome to talk about some of the technical details. Thanks, Nikita. Yeah, so for sure, writing on R2Data Catalog, you end up with, as you can imagine, a bunch of files on your R2 bucket. This is one of the biggest challenge for such a query engine as R2 SQL. How do we maintain good performance by reading through countless files, making sure that you find every single result, every single row that the SQL query expects you to return, like not missing anything in the shortest time possible. So that's really one of the... challenge we have to solve. The second one is the amount of data that we have to crunch through can be massive. How do we make sure that we can do it in a time with latency that is acceptable to the user? That's the compute problem. We are gonna address both talking with Nikita and presenting what we did. I'm gonna talk about the DIO problem first. So yeah, as I said, reading and scanning through countless files, that's a huge challenge. And not only because the data can be massive in itself, but also because you can have a massive amount of files, massive number of files. And that's because these warehouses, we call them, R2Data catalogs really, they can be used to aggregate many, many streams. in a distributed fashion at Cloudflare, which can end up in a myriad of files being written concurrently to your bucket. And as we all know, reading many, many different things, it takes more time than reading a single thing. So that's one of the challenges. And to be efficient, we need to find a way to not have to read everything. This is the core optimization that happens when are trying to process a SQL query. The first question we ask ourselves is what can I do in order to not have to read every single byte of the warehouse? We call that pruning the search space and it goes with some techniques that are based on statistics about your warehouse and the data that's inside. Those statistics they are handed out by R2Data Catalog that Nikita described briefly and basically for every file that's written in your warehouse the R2Data Catalog in the Iceberg format is going to tell R2SQL this file contains data that ranges from this to that so we for every column of your tables we get the minimum value and the maximum value And let's say you are searching for data that matches customer ID 5. We will be able to ask the metadata that's inside our to -data catalog. Show me only the data for files. Like, show me only files that contain or have a chance of containing the customer ID 5. So we can eliminate all those files for which, for instance, the minimum... value for the column customer ID is above five because we know for sure it's never going to be in there. So this is one of the core optimization that we do and showing we don't read what we don't need to read and we combine many several layers of pruning. We have what we call a partition pruning as well. When you tell us for instance when your warehouse is partitioned by day, like you have a timestamp column in there and you ask R2Data Catalog to partition by day and you have a predicate in your SQL query saying I want data between today and yesterday, we can eliminate all the days that came before, not even consider them in the query planning because we know just by looking at the stats that the R2Data Catalog gives us about your data. There is no point reading those files. So the most efficient way to process data is to not read it in the first place. That's exactly most of what we are trying to do in order to make sure we read exactly what we need and not more. And yeah, so I mentioned briefly the compute problem as well. So this was the I.O. problem. There is a compute problem as well. So I'm going to hand it out to Nikita now to explain. in a bit more detail what it is. Yeah, thank you, Jerome. IO is definitely a big problem. You know, making sure that we do not read extra stuff is important. But once we kind of know which files contain stuff, there can still be a lot of them. Processing this on a single machine will take a lot of time. So that's why at R2SQL we leverage what we call a distributed query. engine. So what we implemented is a way to distribute the work that the planner have written out for us between various machines. So we have many machines in the cloud for network. We can leverage them for compute resources. And on each of them, we basically use an Apache Data Fusion, which is an open source analytical query engine used by many other companies. well and each each of this machine which we call a query walker receives a bit of data the small piece that query planner has deemed as potentially containing rows maybe not but each each query worker receives a small piece of work and then once they finish with it they stream stream the results to so -called coordinator which aggregates all the results from different machines and returns them to the user so this allows us to be horizontally scale the system as much as we need, but to create an additional complexity and a lot of fun problems with actually scheduling the work and splitting the work of an SQL query onto multiple machines, which is relatively straightforward for simple filter queries and much more complicated for all hours. So yeah, this is how we solve the compute part of the problem. Use as many as much machines as possible. But it's important to note that not only we use it because we have much more CPUs that way, we also have much, much higher network bandwidth as well, because a network bandwidth for a single machine can be quite limited, especially when we're talking about thousands of files. And this is where distributing the work across multiple workers helps quite a bit because we have all this additional bandwidth from other machines. And still, you know, thinking about it, The planner can, as I said before, the planner can still find a lot of files for us to work on. And no matter how many machines we throw at it, it can be kind of inefficient. It will be very inefficient to just process all the data which we got if we need like first 10 rows maybe. Because, I mean, if the user issues a query, it is very unlikely that they will scan with their eyes. Like they follow like 10,000 rows. They're likely... very likely to write like a limit on there and it will be very waste wasteful to read all the data in this case so we have um an optimization called streamed uh planning pipeline which uh jerome can you talk more about it sure yeah so it's uh it's very interesting stuff so the um the io uh program that we described and talked about how we solved it and the compute prime those are not separate things that happen sequentially Basically, the pruning, the search space thing, and the computing distributed thing, they both work concurrently. And what we are doing is that as soon as we determine that something needs to be processed first, or I mean in the order of processing, we dispatch that to the distributed compute layer. So we start processing data as soon as possible. And... And that's how we call it a streaming planning pipeline, really, because we start work even though we don't know when it's going to end, but we know that we have something already to be processed, and why wait? We are going to process it immediately. And so results are coming back, and the planner emits units of work in the order that's expected by the query. This is the optimization that Nikita hinted at. And we do that by ordering all the files that we found could be matching, could have the data. We are not going to dispatch them for execution in an arbitrary order. What we are doing is we are trying to find the ones that are most likely to contain the rows that will end up in the top 10 or whatever that the user is asking for, like select from whatever table, limit 10, right? We take the order by, we order all the files in the same order using those stats I talked about before. And then we are going to dispatch in that same order with the idea that if we have to find the top 10, we don't need to read 12 million rows. Maybe we are fine by reading just 10 of them, right? So that's the core idea. But how do we know that the test And we have actually the 10 best of all the data that's in the R2 bucket. So that's where the stats save us again, because we know up to which point we've read the data. We read it in order of the query, and we know up to which point we read the data, and we know the minimum and the maximum value for the order by columns of what we've already read. And when we can... prove that what's left to be read and processed cannot possibly make it in our top 10 that we already have just by virtue of the stats looking at the stats of the files that are remaining to be read we can actually certainly know all those files they have at best a value that is worse than my worst performer in my top 10 so there is no point reading it and we can and this is how we avoid scanning everything for queries that inevitably have some limit and some order by applied to them. So it's pretty much a universal thing. The only exception is Aggregate for now. R2SQL, I think we've not mentioned it, R2SQL in its current version doesn't support Aggregate. This is something we're going to work on soon, but Nikita is going to talk about it. Next, I think. So I hand it up to you, Nikita. Yeah, thank you, Jerome. And indeed, we are planning to support aggregations and complex aggregations in the future. As I said, it's a complicated but solvable task to compute aggregations in a distributed manner, and we are working on this as we speak. But not only on that, we also want to provide more visibility to our users in terms of query execution. how to write queries best for RGSQL. We already have some information about it in the documentation, but we want to give you more tools about how to use RGSQL efficiently. And we also have a bunch of other stuff, which is kind of more long-term, but no less exciting. So do you want to talk about it, Jerome? So what comes to mind is a better UI, something in a dashboard that will allow you to... interactively work on your tables. Right now, R2SQL is a Wrangler and API product, so it's inevitably going to happen in some terminal, so kind of a hackish feel to the product. But the long-term vision is to obviously have something visually projected in the dashboard, the Cloudflare dashboard, so that you can interactively So that's mostly what we want to see in the dashboard. There are also probably going to be improvements around the kind of metadata that we can use in order to optimize the execution of queries. We talked about statistics. Statistics is a very powerful, very efficient way to prune the search space. only works so far and for some other types of queries, the a bit more analytical queries or queries on data that is a bit less ordered on storage and more like homogeneously represented across all your files. The min and max value of a column doesn't give you that much information when every min is the absolute min that you have in your database and every max is close to the absolute max you have on the database. You cannot prune much with that. There are other types of metadata we could use. Bloom filters comes to mind. Other novel ideas for indexes that we have. Stay tuned. Stuff is likely going to happen in the next month for us on that front. And obviously... We talked about Apache Iceberg. It's a vast topic. There are a lot of configuration and a lot of settings and a lot of nuances. Right now, we are doing the sensible thing, which is reading the latest version of the table and expecting same defaults on the table. Reality is more complex than that, obviously. And we are going to improve on that front as well in order to broaden the scope. of the warehouses and the queries that we can support with virtual sequence. So yeah, that pretty much wraps it. I had a blast working with you, Nikita, on the project. And I hope it's going to last for many, many more months and quarters, because yeah, the roadmap doesn't seem to thin out. That's for sure, yeah. And I enjoyed working on this as well, Jerome. Like, it was just very interesting. all these whiteboarding sessions were really really nice as well. But yeah, we're not stopping. We're only continuing working on this and if you are interested folks in the more nitty -gritty details, feel free to read the blog post which accompanies this. So it's R2SQL, how we built the distributed query engine. And yeah, watch this space. We are excited to work on more features and excited to hear about your use cases.

Birthday Week

For Cloudflare's annual birthday, we like to give presents back to the Internet. Each day during Birthday Week, we will we announce new things that further our mission — to help build a better Internet. Be sure to head to the Cloudflare Birthday Week...

Watch more episodes