Cloudflare Analytics Demo and Q&A

Name: Cloudflare Analytics Demo and Q&A
Uploaded: 2020-06-14T21:00:00.000Z
Duration: 30 min
Description: Demo the new Cache Analytics experience (launching early June), talk about plans for Cloudflare Analytics, and general Q&A.

Presented by: Jon Levine

Originally aired on June 11, 2020 @ 6:30 PM - 7:00 PM EDT

Demo the new Cache Analytics experience (launching early June), talk about plans for Cloudflare Analytics, and general Q&A.

English

Analytics

Demo

Q&A

Transcript (Beta)

🎼 All right, hi, everyone. My name is John Levine. I'm Rustam. Awesome. We're product managers at Cloudflare, and you might recognize us from yesterday. We're here to continue our segment of how we built this. And today we're going to be talking about analytics, which is my area here at Cloudflare. And I know it says on the title we're going to do a demo and a Q&A. I definitely want to do the Q&A. We're definitely going to do a demo. But really what we want to do is talk about how did we get here? What got built? Why did it get built? Who did we build it for? What choices did we make? What would we want to do better? What's coming next? We want to dig into all that about what tradeoffs did we make? What constraints were there? Talk about all the kind of hard stories that don't always make it out into the blog post or into the press release. Rosan, why don't you tell us more about the series we're trying to do? Yeah, sure. So how we built this, we did our first episode yesterday. And this is product managers and engineering managers and other folks at the company who have really built a lot of the products that you've come to know and love as Cloudflare customers and users. Having those folks just shed light on the decisions, ideas, process behind how products went from sort of a whiteboard to real life. And so yesterday we did our intro episode, which was meta, and it was talking about how we decided to do the show and just gave a high-level intro to how I should build products at Cloudflare. And then John graciously joined me as a guest yesterday, and then we're swapping today. And he's going to walk through analytics at Cloudflare and how we've thought about evolving it over time. And I'm here as community leader. Cool. So we're going to talk about analytics. Like we said, I want to just talk about why this is hard, so hopefully everyone can see my screen now. I'm kind of presenting the basic zone analytics view if you're a customer, you've probably seen this at Cloudflare. I first want to just talk about what makes analytics hard, because I think on its surface this can seem pretty easy. And I think the simplest reason to explain why it's hard is just the scale that we run out. And it's kind of a cliche to talk about our scale or for companies to talk about why scaling is hard. But I just want to give a couple of numbers. At peak we handle more than 17 million requests per second, every second 17 million requests. And we have over 27 million customers, which is pretty crazy when you think about it. And what's also interesting about those customers is some of them are actually serving about a million requests per second, and some of them are serving basically zero requests per second or have a zero request per day. And what's hard is not just the scale that we operate at, but the fact that our analytics have to work for customers across those different scales. So the analogy I sometimes use is like if I were trying to count, like, I don't know, the number of guitars on my wall or something, there would be one technique, like one, two, three. But if I was trying to count the number of guitars that like had ever been made and existed in the world, I would probably need like a qualitatively different technique to know how to count that. And so even if something as simple as counting, how you do it really changes at the scale that you work at. Just so, yeah, if we want to count whether a customer has 7 or 15 requests or if we want to count if they have 7 million or 15 million, those are really different things. Cool. So one thing I want to do in this episode is give a little bit of a history lesson of kind of how analytics have evolved. So I mentioned counting, and right now we're kind of seeing our zone analytics today, which we've had for a long time. It's been fairly stable for a while, and we're going to talk about new products that we build and how this is going to change over the course of the episode. But let's come back. Let's talk about this idea of counting things. So zone analytics primarily, you know, we count stuff. We count the number of requests you have. We have the number of cash requests, uncashed requests. We can go through everything on this page. We can make counts by country. But I think the important thing that people may not realize about this is that we have to decide ahead of time what all the things are that we're counting. And this UI is pretty fixed for all of our customers. There's a couple other things on here. So by the way, I'm presenting the burrito bot, which is one of our staging zones. So everyone's welcome to check it out. You can order virtual burritos from there. Just an honest amount of traffic. We can count bytes instead of requests. We can count, you know, how many status codes we got. Oh, I'm getting an error. But most of the time, we can count how many status codes we got. But we have, we think of it like fixed buckets, right? We have like, okay, I'm going to count this many 200 requests, this many 404 requests, this many requests from the UK, this many requests from India. But... I guess the way to think about this is product managers or powers that be have chosen what the different dimensions that we're counting things are in this context, right? So we've decided, you know, prior to any requests flowing through the system, that number of requests matters, number of cache requests matters, number of uncached requests matters. But then if you wanted to know how many uncached requests there were in Azerbaijan, if we hadn't made a decision that we need to count along those two dimensions, then that question becomes hard to answer. Is that the way to summarize? That's a really, that's exactly right. So going back to the process we talked about a little bit yesterday, you know, we were talking about how we approach talking to customers. And one thing that comes up again and again is exactly that. Well, I don't want to just see the number of cache and uncached requests. I want to see the number of cache requests in Azerbaijan. I want to know, what is my cache hit rate for 404s versus 200s? I want to know, I want to know my cache hit rate for Android versus iPhone users. People have other questions. Actually, the most common we get is just by hostname, which is interesting. We can talk about why that's hard, but I, unfortunately, it's hard in this view to tell you even for, you know, for, you know, salsaverde.burritobot.com, it can be hard to tell you how many requests I had versus for, you know, www .burritobot.com. Wait, so can we dig in on why counting things is hard, right? It's of all the things that computers are good or bad at, that seems like an obvious one. Yeah. Counting seems really obvious. Well, there's a couple of things that are hard about counting. One thing about counting, and this is going to sound really, really reductive, but counting gets harder when you have more things to count. So for computer scientists out there, you know, you can't always, not everything just kind of scales up linearly, right? And so as many people know, we have over 200 data centers around the world that are all handling traffic. But at some point, you have to load this dashboard at one time in one place and get a number from that all these different things are counting. So each of those data centers sort of has to, has to collect counts, and then that has to be centralized in one place. Now, again, coming back to this issue of whether the counts are fixed, that's okay. If we've decided on a fixed number of things to count, that isn't so bad to do, but the problem is now we want to make the counts more flexible. So yeah, does that, does that kind of get at what's maybe what's, what's hard about scaling up counting? So I think that, you know, after talking to a lot of customers, what we realized is that, you know, we're saying people need more flexibility. We realized we were never going to have enough buckets. We were never just going to predict all the things we wanted to ask. And we needed a different, just a different way of doing things. And so we needed to go towards, the way I think about it is we need to expose like a SQL model. So if anyone's ever written a SQL query, it's pretty simple. There's a SELECT statement where you talk about the metrics you want from the table, a WHERE clause that filters things out that you want to see or don't want to see, and then a GROUP BY, so you can say, I want to count things by country or by host name. And the idea is that when you think about SQL, you think about what the rows are and what the, what the columns are. And so going back to SQL here, a row is basically like one customer at one time, you know, for one hour, right? But really what you want is a model that says that one row is one event, right? One request that happened in the world. And this model is really powerful because now I can run that, conceptually run that SQL query, right? That says, all right, count the number of requests in a given place. And that's a really powerful model that we can expose to people. And so, yeah. Let's go back for a second. So we talked about what counting is hard and we talked about some, some, some different sort of conceptual models to, to make counting things at scale easier or more feasible. What, what do people, what problems are customers actually trying to solve with this data, right? Yeah. Yeah. So I, the most common one is to see things by host name, I think is what I see. And then I would see from like an operational or like what is, what is the like business problem our customers are trying to solve? Yeah. There's a few that are really important. So we're going to talk about caching more in this segment because we have something really cool coming out for caching analytics. But right now we show caching on cache requests. So caching, caching, why is caching important? So caching really matters because caching is how you make your site faster, right? Because you get content physically close to people and because he doesn't have to be re-computed on the server. And caching reduce costs because it's just cheaper for Cloudflare to serve content from our edge than, than from an origin server. And we know that costs are, you know, like origin data transfer is one of the biggest costs that people face when running a website. So optimizing caching is really important. And it's actually kind of hard to like configure caching. It may seem like, oh yeah, well we'll just cache stuff and Cloudflare will handle it. But it turns out you actually have to make all these decisions because the example I use with caching is, you know, you sign into your bank account, right? And like maybe the bank accounts like homepage and some of the assets, maybe some of the images in there can be cached, but hopefully they're not caching in like a public cache like Cloudflare, your bank balance, right? Because even if it's like bankofamerica .com slash account, like that URL may be the same for everyone, but there's probably very different content displayed on that page. So we have to be really careful that we're not accidentally caching stuff that's not meant for, you know, just one person. That's an example where in order to really understand what's happening, you know, the business owner or the operations person needs to see what a cache rate is, you know, by URL rate or by host name. That's one category. Security is another really important example. So we're going to start, we're going to do this kind of historically. So when we looked at, you know, kind of revamping our analytics experience, the place we actually started was on the security side and on the firewall. And the reason we did that is that we had, we have a pretty good security platform. There's a lot that people can do to actually configure their firewall rules, understand where all that's running. And we wanted to make sure that people had, could actually see what the impact of those rules were and see the impact of the rules were that Cloudflare set up. And in order to do that, you often have to know, again, specific URLs, maybe specific IP addresses, where things are actually running and ask these really granular questions. Cool. So let's actually, let's talk about that. So I want to, we're going to, we're going to do a little bit of historical tour. Let's go to the firewall tab. This is kind of where it started for us in terms of rethinking analytics. Oh no, there's an error. Oh man. Uh, let's see if refreshing helps. Now we may have to just use our imaginations. Ah, okay. So I have, I'm going to, I'm signed out here. Luckily my password is all dots. That is my password. It's five dots. Like the bash .org. Yeah, exactly. Yeah. Hunter, Hunter. Yeah. Cool. So it was like a firewall event. So this is, this is, so in addition to just looking different, this is a really different experience from what we saw before. And the reason is that this is interactive. So I have, I can actually hover over things here. So actually let's talk about what I'm looking at here first. So we're looking at firewall events, which we can talk about why that's different. Um, uh, um, and these are all the different actions that can be taken by, by, by the firewall. Um, and if I want, I can just, you know, filter down to one of these. Now the graph actually changes like, ah, well, except for when there's an error. It's okay. This is our staging site. So it's fine. It's fine that this has errors. Um, the graph actually changes in response to me clicking on things. Um, and this is a really different model, um, from analytics because it's dynamic because it's responding to what you do. So I can filter to a specific action. Imagine I did that. And then I can actually see things like top IP addresses, top paths, countries, user agents, and I could, let me try filtering one of these. Let me filter to, um, uh, just like my homepage. Maybe I want to understand only what's happening on my homepage. If I filter, ah, too bad. Um, live demos are tough. It's just demo gods are not with us today. Um, but you can imagine that I can filter to, uh, to a single path and see how the page updates, um, in response to that. Um, is there something we can show? Uh, I'm just trying to, trying to think how we can, how we can do a good demo of this. Um, oh man, maybe it's cause I just keep getting signed out. Hmm. This is our, this is our staging site where we do sometimes have, uh, I'd be in the middle of pushing an update. Um, let me try it one more time and just show you what it looks like. It is very exciting when that works and when all the numbers, when all the graphs change, let's try filtering to just the homepage here. So I'm gonna hit that filter button. Um, yes. And it happens pretty fast. Everything updates. Now we're just looking at the homepage and we can see everything we saw before, but just, just focus down to that view, which is kind of an amazing ability. Um, so this is really cool. There's a ton of stuff we're doing here that, that I'm really, I think are really exciting. Um, one is the ability to just, uh, create a rule right from this page. And this one's pretty simple. So, um, let's say you're slicing and dicing, you know, a common one is maybe you're seeing one IP, one user agent that's really suspicious. So here we're getting all the requests from Pingdom. In this case, I know what Pingdom is, but let's pretend, you know, Pingdom is suspicious to us. We can actually filter, filter just to, um, just to their user agent. Um, and now I can actually create a firewall rule, which is really cool. Right. So from this view, I can actually go from seeing what's going on to actually taking action on that traffic and watching it change and click on that rule. And then a minute or two later, actually see lines in the graph change based on that rule. Um, so that's something that's really cool. That's, that's possible here. Any other use cases that you've seen customers, uh, use firewall analytics for? Uh, no, I mean, I think just that the example you showed is like contrived is the wrong word, but really simple, right? I think one of the coolest things about watching customers use, uh, this interface is just how expressive you can be. Right. And like the number of dimensions you can drill down on is infinite. It's too strong a word, but like you can do a lot. Right. And, and, and I think it's just like a, a sort of different, it's like a fundamental change in the way, uh, things about analytics, but, but, uh, you know, using, using other tools, uh, out there, um, just the, the number of degrees of freedom you have and exploring the data through, through a tool like this is pretty radical, right? Totally. One thing that's cool too, is I think we get a lot of requests for, um, from customers to add new stuff. It becomes much easier to add to this model. So like to give an example of firewall, um, someone wants to add, um, well now, now I'm going to be stumped coming up with, do we have, do we have ASN in here? Oh, we already have ASN. So not the best example, but, but I'm not sure we had ASN when we launched. If someone had asked for that, it's relatively easy to add invention because it's basically adding a column. And then we have patterns that can adapt to that, that can grow. We can add actually a lot more of these dimensions into here. And this pattern, you know, continues to be, um, continues to work in scale pretty far, which is cool. Something that's coming really soon, which I'm excited about is the ability to actually, um, change what I think of as the group by functionality. So right now we're kind of conceptually, we're like grouping by action, but you just as well might want to group by, um, by service or by any of these top ends, by IP address, by user agent, whatever, and then see the time series graph, which we'll have as well. I think the other, the other, uh, key thing with building analytics products is the question you should always be asking yourself as a PM or designer or whatever is what action do I expect the user to take when they see this information? Right. Like how, and what is the sort of like the path from here's a graph to here's the action that they took. And one of the really neat things about firewall analytics and this UI is like, you build your filter, you've sliced and diced, you've identified something you want to stop, and then you just click add filter and stop it. Right. So like the, the path from evidence to, to, to, to result is, is extremely short. And if only all analytics could be this, this straightforward and user friendly. Yeah. I want to talk about maybe, um, I want to talk, I want to then come back to, uh, the future stuff that we're going to build off of this, but actually maybe want to call it a few of the technologies behind the scenes, I think are really interesting and talk about how that's made a difference for us analytics. Um, so we've talked about how we moved from this rollup to this event based model and kind of a SQL interface. Well, behind the scenes, we are using, um, Clickhouse as our, as our data store. Uh, and Clickhouse is it's column oriented, uh, database. It does support a SQL interface. Um, and it's pretty amazing. It's pretty amazing. The scale that it runs out and the ability to, to, to scale up, to, to run these massive data sets for all kinds of different customers. Um, we've been super happy with it. Um, it's, it's certainly, it's a learning curve for sure. Uh, I was talking to friends at other companies who use Clickhouse, you know, when you add it, when you add another zero at the end, it's possible that Clickhouse can handle it, but you do have to, we've learned a lot along the way about, you know, what's that? It looks like it's a free lunch in this context. Yeah. Yeah. But it does where we've, we've done a lot with it. Um, one thing, one challenge we've had with, with Clickhouse is, um, managing, uh, the read load that comes into it. So one thing about customer facing analytics is it's, you don't really know how people are going to use it, right? Like, um, and I should also add in the right load as well. Like, so we, we handle attacks, right? Where there could be a giant sudden spike in traffic where we might get again, orders of magnitude, more traffic for one customer, and then that can subside. Um, and then they may be trying to do queries where they read out that attack over and over and over again, and need to read lots and lots of rows. And so we have to come up with ways to, to, to smooth that out, right? Because we can't just, you know, we run, um, we do run all of our analytic servicers, uh, in -house and then on-premise data center that we, um, where we, we manage the hardware ourselves. So we can't just like scale up and like, you know, quickly get a lot more machines. Um, and so there's a bunch of tools that have made that possible. Um, first I want to talk about is GraphQL. So GraphQL is interesting because GraphQL is, um, it's a query language. Um, and it's maybe not obvious how that relates to anything, but what it's allowed us to do is constrain SQL a little bit. So SQL is powerful, but it's, um, it's maybe too powerful for, um, to just expose analytics in, in a dashboard or an API. Um, and GraphQL is like an interface on top of that, that constrains a little bit, still gives a lot of the power of data analysis, but, um, doesn't let you do isn't, isn't turn complete. For example, um, we wrote a really cool, uh, interface between GraphQL and between, uh, the Quickhouse engine where, um, we can make sure that as queries come in, that we can, we can optimize those queries and make, make sure they run efficiently every time. Um, and we encourage folks, if you go to our developer site by risk opening a new tab here, um, I'm not sure if that works. Uh, yeah. So you can actually go to analytics and check out, uh, our documentation on, uh, on the GraphQL API, which is, which is pretty neat. And we expose a lot of, a lot of data sets through there. Um, the other thing I want to mention about, uh, uh, yeah, it was developed by Facebook to access their graph. And, uh, right. Like there, there are a whole bunch of existing tools out there that already. Yeah. GraphQL. Right. So you could just sort of plug roles into, into the Cloudflare analytics APIs and then, and then quickly get sort of really rich group by segment, you know, filtering capability on top of that, without, without actually having to write any SQL. It's amazing how quickly we can build these new analytics experiences, which we'll talk about in a second, you know, based on some of these building blocks that we have, which is really, which is really amazing. Um, the other thing I want to talk about, I think we can't, we can't have an episode about analytics without talking about sampling, which is always a favorite topic. Um, sampling is a, is, is, uh, um, sampling is essential to how we're able to do analytics. So I talked about how it's so hard to measure things that are different orders of magnitude. Key to doing that is the ability to sample, um, appropriately for, for what you're trying to measure it to, to come up with, um, uh, you know, accurate results. And so, um, uh, the key thing about sampling is like, again, it's choosing the right sample rate to look at for the query. Right. Um, I think of sampling, one way to think of sampling is like putting glasses on. Um, there are times when you're standing, you know, it's something is written on a chalk sign and really big letters, and maybe you don't need glasses to read it, but maybe to put on the, maybe if you want to read the fine print, you have to put your glasses on. And so it's not, you know, different sampling rates are appropriate for different kinds of datasets. Um, one, another analogy I like to use, you know, about why sampling works, my, uh, I think about my, the probability of class I took in college, um, and, uh, our professor was trying to teach us about sampling. And the question he asked was, well, how much of the earth's surface is covered in land? How would you measure that? How would you think about measuring that? Uh, and he had, he had like a beach ball with like a globe on it. What's that? The earth is flat. So the earth is flat. So we can, we can trace the outline of the comments. We can fit some polygons. Um, so we took the beach ball. Uh, I said, okay, look at this. And how do we know if it's, if it's, if it's covered in land? And what he did was he actually just tossed, tossed the beach ball in there a few times and caught it and just measured like where his, um, his right pointer finger was. Um, he did that 10 times and like three of the times his fingers on land. Um, and it turns out that, um, we can go back into the math and the, and the statistics behind this, but like, um, the summary is basically you can get accurate results by sampling. If you do a good job sampling, you can come up with a dataset that's representative of, um, of, of the population. This is how, you know, like Nate Silver, except for a couple of publicized times is actually pretty good at predicting, you know, election results, NBA games and so forth. Um, so sampling is eager. We do show that, um, here, uh, when we do sample, we do indicate that in UI. So you can see it actually, I think in this case, there aren't a ton of events that we don't actually have to show sample data. Cool. Um, there's probably one other thing I should call out, which is our amazing design team. Um, and not just design, but front end engineering team. So they built this whole interface in a way that, um, these are all components and it's very reusable. So when it came time to look at other analytics experiences, the next one we'll talk about is cash. We could take so much of the UI that was off the shelf and, and, and take a lot of the GraphQL framework and plug that in and get something totally different, but using a lot of the same pieces. So let's, let's click over to cash and see what that looks like. So I'm going to be a little preview. This is going to be released for everyone, um, next week on, on Tuesday, in fact, the 16th. And, uh, it's, it's, I'm really excited about this. Um, we're basically bringing this experience for firewall, but to all HTTP traffic. Um, and, uh, we can, we can talk through what's in here, but I think you'll see, it looks pretty similar to firewall, kind of similar layout of the graphs. And instead of, uh, firewall actions, we have cash statuses. We still have top ends here that we're looking at all the traffic that comes into the website. Um, not just the, not just the request that triggered firewall events. And I don't want to say there, there was lots of challenges getting to this point. We're really excited to be, to be launching this so soon. Um, the question, going back to the question I asked earlier, what are some questions that a user could use this tool to, to answer and, and what are the sort of outcomes that they would drive, um, totally by using this tool effectively? Cash analytics is all about understanding how to optimize cash. So like we mentioned earlier, if you want to make sure that you're maybe the assets in your bank home page are getting cash, but not, not the account page. That's kind of an extreme example. I think some of the most, some of the most common things are, um, just in general cloud flavors and cash HTML by default. Um, HTML pages, you do have to create page rules to cash, but you can't always see what the impact of those page rules is. So with this, I can actually see for a specific page, um, what the, so I can actually filter, for example, to my HTML pages. I can filter to, Oh, we look, so we have an issue with, uh, content type detection here. Um, I can filter to, uh, to this, uh, to like my homepage and I can actually see specifically whether these are getting cash. Got it. And then, and then potentially make changes through page rules to, to, to exactly that. Yeah. And how you practice this can be really different because you can also use this to see which, you know, maybe which images are, are, um, are resulting in, in, you know, the largest amount of data transfer, for example. So this is a really general framework and, you know, one thing I'm excited about is taking this, this dataset we built for caching really applies to all of your traffic in your zone. And so coming back to the first page I showed, we are hoping, um, over the next few months to think about how can we take these patterns and apply them to Cloudflare analytics overall, which I think is going to be, is going to be really exciting and really different. Cool. Um, so I don't know if we have, oh yeah. Yeah. You, you mentioned, uh, cash analytics going out next week. Where does, where does this sort of go from here? Like what are the, if you were to describe what analytics look like at Cloudflare in a year or two or whatever timeframe, longer timeframe, what, what, where? Totally. I think there's a few things. When I talk about analytics today, um, I talk about using the data, the metadata generated by our customers' traffic on Cloudflare to help them show, see what the value of Cloudflare that Cloudflare provides, optimize their use of Cloudflare and also, um, improve their own infrastructure to protect it or to cache it better or whatever else. But I think what's really powerful is that, um, the analytics that we have are kind of just a general purpose analytics tool that you could use for so many things. Um, there's so many other services that people use that just tell them how many unique visitors they have. Um, what are my most popular webpages? What are my bounce rates? And using the data we collect at our edge, I think we can give people, um, just as good or, or better experience, um, than that. So there's, the edge is really interesting. Like sitting at the edge, we have this really unique perspective on what's happening. Um, uh, one thing we do, we didn't talk about browser insights, uh, in this, in this session, but we have, we do have a JavaScript beacon that we can, um, put into webpages or a sample of webpages automatically on behalf of customers and collect a lot of useful, um, performance information there. Um, we see every request to origin. So we know exactly what your origin is doing and we can actually pinpoint where things are slow. Um, and if, uh, you know, our customers choose and with their consent and with their, with their end users consent, um, we can see the behavior of, of an end user of what we call an eyeball over the course of the session. And I think being able to tie in all these sorts of data is really powerful. And, um, it's really so much stronger than having any one of these pieces. Like if you just had a detection on your origin, you would miss what's happening in cash. If you just had a deacon, um, there are people for legitimate reasons that block those beacons and don't see what's happening, um, you know, at the edge. And so I think that, that holistic view is super powerful. Cool. Excited to see where this goes. Unfortunately, I don't think we have enough time for you to play a song, but, uh, that was my, that was my secret. How about one, one chord to play us out? We'll do one chord. There we go. You've played, played guitar on live on Cluck. My, my live performance studio. All right. Super excited to see all the work that's, that's going on in analytics world. And, um, yeah, like, as you said, we're, we're putting more and more in this stuff. We've been working on for a long time in the hands of customers. So awesome. Cool. Thanks, Justin. Thanks, John. Thanks everyone for watching. See ya. Bye.