Cloudflare Analytics Demo and Q&A
Demo the new Cache Analytics experience (launching early June), talk about plans for Cloudflare Analytics, and general Q&A.
Music All right.
Hi, everyone. My name is John Levine. And I'm Rustam. Awesome. We're product managers at Cloudflare.
And you might recognize us from yesterday. We're here to continue our segment of how we built this.
And today, we're going to be talking about analytics, which is my area here at Cloudflare.
And I know it says on the title, we're going to do a demo and a Q&A.
I definitely want to do the Q&A. We're definitely going to do a demo.
But really, what we want to do is talk about how did we get here?
What got built? Why did it get built? Who do we know? What choices did we make?
What would we want to do better? What's coming next? We want to dig into all that about what tradeoffs did we make?
What constraints were there? Talk about all the kind of hard stories that don't always make it out into the blog post or into the press release.
Rustam, why don't you tell us more about the series we're trying to do?
Yeah. Sure. So, how we built this, we did our first episode yesterday.
And this is product managers and engineering managers and other folks at the company who have, you know, really built this.
And so, yesterday, we did our intro episode, which was meta, and it was talking about how we decided to do the show and just gave a high-level intro to how I should build products at Cloudflare.
And then, John graciously joined me as a guest yesterday, and then we're swapping today.
And he's going to walk through analytics at Cloudflare and how we've thought about sort of evolving it over time.
And I am here as community co-lead. Cool. So, we're going to talk about analytics.
Like we said, I want to just talk about why this is hard. So, hopefully, everyone can see my screen now.
I'm kind of presenting the basic zone analytics view if you're a customer, you've probably seen this at Cloudflare.
I first want to just talk about, like, you know, what makes analytics hard?
Because I think on its surface, this can seem pretty easy.
And I think the simplest reason to explain why it's hard is just the scale that we run out.
And it's kind of a cliche to talk about our scale or for companies to talk about why scaling is hard.
But I just want to give a couple of numbers.
At peak, we handle more than 17 million requests per second.
Every second, 17 million requests. And we have over 27 million customers.
Which is pretty crazy when you think about it. And what's also interesting about those customers is some of them are actually serving about a million requests per second.
And some of them are serving basically zero requests per second or have a zero request per day.
And what's hard is not just the scale that we operate at, but the fact that our analytics have to work for customers across those different scales.
So, the analogy I sometimes use is, like, if I were trying to count, like, I don't know, the number of guitars on my wall or something, there would be one technique, like, one, two, three.
But if I was trying to count the number of guitars that, like, had ever been made and existed in the world, I would probably need, like, a qualitatively different technique to know how to count that.
And so, even if something as simple as counting, how you do it really changes the scale that you work at.
Just so, yeah, if we want to count whether a customer has seven or 15 requests or if we want to count if they have 7 million or 15 million, those are really different things.
Cool. So, one thing I want to do in this episode is give a little bit of a history lesson of kind of how analytics have evolved.
So, I mentioned counting. And right now, we're kind of seeing our zone analytics today, which we've had for a long time.
It's been fairly stable for a while.
And we're going to talk about new products that we build and how this is going to change over the course of the episode.
But let's come back.
Let's talk about this idea of counting things. So, zone analytics, primarily, you know, we count stuff.
We count the number of requests you have. We count the number of cache requests, un -cache requests.
We can go through everything on this page.
We can make counts by country. But I think the important thing that people may not realize about this is that we have to decide ahead of time what all the things are that we're counting.
And this UI is pretty fixed for all of our customers.
There's a couple other things on here. So, by the way, I'm presenting the burrito bot, which is one of our staging zones.
So, everyone's welcome to check it out.
You can order virtual burritos from there. Just an honest amount of traffic.
We can count bytes instead of requests. We can count, you know, how many status codes we got.
Oh, I got an error. But most of the time, we can count how many status codes we got.
But we have, we think of it like fixed buckets, right?
We have like, okay, I'm going to count this many 200 requests, this many 404 requests, this many requests from the UK, this many requests from India.
But... I guess the way to think about this is product managers or powers that be have chosen what the different dimensions that we're counting things are.
In this context, right? So, we've decided, you know, prior to any request flowing through the system that number of requests matters, number of cache requests matters, number of uncached requests matters.
But then if you wanted to know how many uncached requests there were in Azerbaijan, if we hadn't made a decision that we need to count along those two dimensions, then that question becomes hard to answer.
Is that the way to summarize? That's exactly right. So, going back to the process we talked about a little bit yesterday, you know, we were talking about how we approach talking to customers.
And one thing that comes up again and again is exactly that.
Well, I don't want to just see the number of cached and uncached requests.
I want to see the number of cached requests in Azerbaijan.
I want to know what is my cache hit rate for 404s versus 200s. I want to know my cache hit rate for Android versus iPhone users.
People have other questions.
Actually, the most common we get is just by hostname, which is interesting.
We can talk about why that's hard. But, unfortunately, it's hard in this view to tell you, even for salsaverde .burritobot.com, it can be hard to tell you how many requests I had versus for www .burritobot.com.
Wait, so can we dig in on why counting things is hard? Of all the things that computers are good or bad at, that seems like an obvious one.
Yeah, counting seems really obvious.
Well, there's a couple of things that are hard about counting.
One thing about counting, and this is going to sound really reductive, but counting gets harder when you have more things to count.
For computer scientists out there, you can't always...
Not everything just scales up linearly, right?
As many people know, we have over 200 data centers around the world that are all handling traffic.
But, at some point, you have to load this dashboard at one time in one place and get a number that all these different things are counting.
Each of those data centers has to collect counts and then that has to be centralized in one place.
Now, again, coming back to this issue of whether the counts are fixed, that's okay.
If we've decided on a fixed number of things to count, that isn't so bad to do, but the problem is now we want to make the counts more flexible.
So, yeah, does that kind of get at maybe what's hard about scaling up counting?
So, I think that after talking to a lot of customers, what we realized is that, like we were saying, people need more flexibility.
We realized we were never going to have enough buckets.
We were never just going to predict all the things we wanted to ask and we needed a different, just a different way of doing things.
And so, we needed to go towards... The way I think about it is we needed to expose like a SQL model.
So, if anyone's ever written a SQL query, it's pretty simple.
There's a select statement where you talk about the metrics you want from the table.
A WHERE clause that filters things out that you want to see or don't want to see and then a GROUP BY.
So, you can say, I want to count things by country or by host name.
And the idea is that when you think about SQL, you think about what the rows are and what the columns are.
And so, going back to SQL here, a row is basically like one customer at one time for one hour.
But really what you want is a model that says that one row is one event, one request that happened in the world.
And this model is really powerful because now I can run that, conceptually run that SQL query that says, all right, count the number of requests in a given place.
And that's a really powerful model that we can expose to people. And so, yeah.
Let's go back for a second. So, we talked about what counting is hard and we talked about some different sort of conceptual models to make counting things at scale easier or more feasible.
What problems are customers actually trying to solve with this data, right?
Yeah. One more dimension. Yeah. So, the most common one is to see things by host name, I think is what I see.
And then I would see from like an operational, like what is, what is the like business problem our customers are trying to solve?
Yeah, there's a few that are really important. So, we're going to talk about caching more in this segment because we have something really cool coming up with caching analytics.
But right now we show caching on cache requests.
So, caching, why is caching important? So, caching really matters because caching is how you make your site faster, right?
Because you get content physically close to people and because he doesn't have to be re-computed on the server.
And caching reduce costs because it's just cheaper for Cloudflare to serve content from our edge than from an origin server.
And we know that costs are, you know, like origin data transfer is one of the biggest costs that people face when running a website.
So, optimizing caching is really important and it's actually kind of hard to like configure caching.
It may seem like, oh, yeah, well, we'll just cache stuff and Cloudflare will handle it.
But it turns out you actually have to make all these decisions because the example I use with caching is, you know, you sign into your bank account, right?
And like, maybe the bank account is like homepage and some of the assets, maybe some of the images in there can be cached.
But hopefully, they're not caching in like a public cache like Cloudflare, your bank balance, right?
Because even if it's like bankofamerica.com slash account, like that URL may be the same for everyone, but there's probably very different content displayed on that page.
So, we have to be really careful that we're not accidentally caching stuff that's not meant for, you know, just one person.
So, that's an example where in order to really understand what's happening, you know, the business owner or the operations person needs to see what a cache rate is, you know, by URL rate or by host name.
That's one category. Security is another really important example.
we're going to start, we're going to do this kind of historically.
So, when we looked at, you know, kind of revamping our analytics experience, the place we actually started was on the security side and on the firewall.
And the reason we did that is that we had, we have a pretty good security platform.
There's a lot that people can do to actually configure their firewall rules and understand where all that's running.
And we wanted to make sure that people had, could actually see what the impact of those rules were and see the impact of the rules where that Cloudflare is set up.
And in order to do that, you often have to know, specific URLs, maybe specific IP addresses, where things are actually running and ask these really granular questions.
So, let's actually, let's talk about that. So, I want to, we're going to, we're going to do a little bit of historical tour.
Let's go to the firewall tab. This is kind of where it started for us in terms of rethinking analytics.
Oh, no. There's an error.
Oh, man. Let's see if refreshing helps. Now, we may have to just use our imaginations.
Ah, okay. So, I have, I'm going to, I'm signed out here. Luckily, my password is all dots.
That is my password. It's five dots. Like the bash .org.
Yeah, exactly. Yeah. Hunter. Hunter. Yeah. Yeah. Cool. So, it's like a firewall event.
So, this is, so, in addition to just looking different, this is a really different experience from what we saw before.
And the reason is that this is interactive.
So, I have, I can actually hover over things here. So, actually, let's talk about what I'm looking at here first.
So, we're looking at firewall events, which we can talk about why that's different.
And these are all the different actions that can be taken by the firewall.
And if I want, I can just, you know, filter down to one of these.
Now, the graph actually changes.
Like, ah, except for when there's an error. It's okay. This is our staging site.
So, it's fine. It's fine that this has errors. The graph actually changes in response to me clicking on things.
And this is a really different model from analytics because it's dynamic because it's responding to what you do.
So, I can filter to a specific action.
Imagine I did that and then I can actually see things like top IP addresses, top paths, countries, user agents.
And I could, like, let me try filtering to one of these.
Let me filter to just like my home page.
Maybe I want to understand only what's happening on my home page.
If I filter, ah! Too bad. Live demos are tough. It's just demo gods are not with us today.
But you can imagine they can filter to a single path and see how the page updates in response to that.
Is there something we can show?
I'm just trying to think how we can how we can do a good demo of this.
Oh, man. Maybe it's because I just keep getting signed out. Hmm.
This is our staging site where we do sometimes have a I'd be in the middle of pushing an update.
Let me try it one more time to show it because it is very exciting when that works and when all the graphs change.
Let's try filtering to just the homepage here.
So I'm going to hit that filter button. Yes, and it happens pretty fast.
Everything updates. Now we're just looking at the homepage and we can see everything we saw before but just just focus down to that view which is kind of an amazing ability.
So this is really cool. There's a ton of stuff we're doing here that I'm really I think are really exciting.
One is the ability to just create a rule right from this page.
This one's pretty simple. So let's say you're slicing and dicing.
You know, a common one is maybe you're seeing one IP one user agent that's really suspicious.
So here we're getting a lot of requests from Pingdom.
In this case, I know what Pingdom is but let's pretend you know, Pingdom is suspicious to us.
We can actually filter filter just to just to their user agent and now I can actually create a firewall rule which is really cool, right?
So from this view I can actually go from seeing what's going on to actually taking action on that traffic and watching it change and click on that rule and then a minute or two later actually see lines in the graph change based on that rule.
So that's something that's really cool that's possible here. Russ, any other use cases that you've seen customers use firewall analytics for?
No, I mean, I think just that the example you showed is like contrived is the wrong word but really simple, right?
I think one of the coolest things about watching customers use this interface is just how expressive you can be, right?
And like the number of dimensions you can drill down on is infinite is too strong a word but like you can do a lot, right?
And I think it's just like a sort of different it's like a fundamental change in the way a lot of things about analytics but, you know, using other analytics tools out there just the number of degrees of freedom you have and exploring the data through a tool like this is pretty radical, right?
Totally. One thing that's cool too is I think we get a lot of requests for, from customers to add new stuff.
It becomes much easier to add to this model. So like to give an example of Firewall, someone wants to add, well, now I'm going to be stumped coming up with, do we have ASN in here?
Oh, we already have ASN. Maybe not the best example but I'm not sure we had ASN when we launched.
If someone had asked for that, it's relatively easy to add eventually because it's basically adding a column.
And then we have patterns that can adapt to that that can grow. We can add actually a lot more of these dimensions into here and this pattern continues to be, continues to work in scale pretty far which is cool.
Something that's coming really soon which I'm excited about is the ability to actually change what I think of as the group by functionality.
So right now we're kind of, conceptually we're like grouping by action but you just as well might want to group by service or by any of these top ends by IP address by user agent whatever and then see the time series graph which we'll have as well.
I think the other key thing with building analytics products is the question you should always be asking yourself as a PM or designer or whatever is what action do I expect the user to take when they see this information, right?
Like how and what is the sort of like the path from here's a graph to here's the action that they took and one of the really neat things about Firewall analytics and this UI is like you build your filter you've sliced and diced you've identified something you want to stop and then you just click add filter and stop it, right?
So like the path from evidence to result is extremely short, right?
And if only all analytics could be this straightforward and user friendly. Yeah.
I want to talk about maybe I want to talk I want to then come back to the future stuff that we're going to build off of this but actually maybe want to call out a few of the technologies behind the scenes I think are really interesting and talk about how that's made a difference for us in analytics.
So we've talked about how we moved from this roll-up to this event-based model and kind of a SQL interface.
Well behind the scenes we are using Clickhouse as our data store and Clickhouse is its column-oriented database it does support a SQL interface and it's pretty amazing it's pretty amazing the scale that it runs out and the ability to scale up to run these massive data sets for all kinds of different customers.
We've been super happy with it.
It's certainly it's a learning curve for sure. I was talking to friends of other companies who use Clickhouse you know, when you add another zero at the end it's possible that Clickhouse can handle it but you do have to we've learned a lot along the way about, you know, from other teams.
The only such thing as a free lunch in this context. Yeah, yeah. But it does work we've done a lot with it.
One thing one challenge we've had with Clickhouse is managing the read load that comes into it.
So one thing about customer -facing analytics is it's you don't really know how people are going to use it, right?
Like, and I should also add in the write load as well like so we handle attacks, right?
Where there could be a giant sudden spike in traffic where we might get again orders of magnitude more traffic for one customer and then that can subside.
And then they may be trying to do queries where they read out that attack over and over and over again and need to read lots and lots of rows and so we have to come up with ways to smooth that out, right?
Because we can't just, you know, we run we do run all of our analytic servicers in-house and then on-premise data center that we where we manage the hardware ourselves so we can't just like scale up and like, you know, quickly get a lot more machines.
And so there's a bunch of tools that have made that possible.
First I want to talk about is GraphQL. So GraphQL is interesting because GraphQL is a query language and it's maybe not obvious how that relates to anything but what it's allowed us to do is constrain SQL a little bit.
So SQL is powerful but it's it's maybe too powerful for to just expose analytics in a dashboard or in an API.
And GraphQL is like an interface on top of that that constrains a little bit still gives a lot of the power of data analysis but doesn't let you do isn't isn't Turing complete for example.
We wrote a really cool interface between GraphQL and between the Qlikhouse engine where we can make sure that as queries come in that we can we can optimize those queries and make sure they run efficiently every time.
And we encourage folks if you go to our developer site by risk opening a new tab here I'm not sure if that works.
Yeah, so you can actually go to analytics and check out our documentation on on the GraphQL API which is which is pretty neat and we expose a lot of a lot of datasets through there.
The other thing I want to mention about the other thing that we're calling out on GraphQL is you know it was it was developed by Facebook originally to access their graph and right like there there are a whole bunch of existing tools out there that already Yeah GraphQL, right?
So you could just sort of plug Yeah into into the Cloudflare analytics APIs and then and then quickly get sort of really rich group by segment you know filtering capability on top of that without without actually having to write any SQL.
It's amazing how quickly we can build these new analytics experiences which we'll talk about in a second you know based on some of these building blocks that we have which is really which is really amazing.
The other thing I want to talk about I think we can't we can't have an episode about analytics without talking about sampling which is always a favorite topic.
Sampling is a is is sampling is essential to how we're able to do analytics.
So I talked about how it's so hard to measure things that are different orders of magnitude.
Key to doing that is the ability to sample appropriately for what you're trying to measure to come up with you know accurate results and so the key thing about sampling is like again it's choosing the right sample rate to look at for the query right.
I think of sampling one way to think of sampling is like putting glasses on.
There are times when you're standing you know something is written on a chalk sign in really big letters and maybe you don't need glasses to read it but maybe to put on maybe if you want to read the fine print you have to put your glasses on and so it's not you know different sampling rates are appropriate for different kinds of data sets.
Another analogy I like to use you know about why sampling works I think about the probability class I took in college and our professor was trying to teach us about sampling and the question he asked was well how much of the earth's surface is covered in land?
How would you measure that?
How would you think about measuring that? And he had like a beach ball with like a globe on it.
What's that? The earth is flat so we can we can do we can trace the outline in the comments we can fit some polygons so he took the beach ball and said okay look at this and how do how do we know if it's covered in land and what he did was he actually just tossed the beach ball in the air a few times and caught it and just measured like where his his right pointer finger was he did that ten times and like three of the times his finger was on land and it turns out that we can go back into the math and the statistics behind this but like the summary is basically you can get accurate results by sampling.
If you do a good job sampling you can come up with a data set that's representative of of the population.
This is how you know like Nate Silver except for a couple publicized times is actually pretty good at predicting you know election results NBA games and so forth.
So sampling is here we do show that here when we do sample we do indicate that in the UI so you can see it actually I think in this case there aren't a ton of events so we don't actually have to show sample data.