⚡️ Speed Week: Live-Updating Analytics and Instant Logs

Presented by: Ben Yule, Cole MacKenzie, Jon Levine, Tanushree Sharma

Originally aired on September 14, 2021 @ 10:30 PM - 11:00 PM EDT

Join our product and engineering teams as they discuss Live-Updating Analytics and Instant Logs

Read the blog posts:

Visit the Speed Week Hub for every announcement and CFTV episode — check back all week for more!

English

Speed Week

Transcript (Beta)

All right. Hello, everybody. Welcome to Cloudflare TV. Welcome to Speed Week. My name is Jon Levine and super excited to be here with my colleagues today to tell you about our latest releases, which are live updating analytics and instant logs. Very, very cool stuff that we've been working on. So, I'm a product manager here on the data team and I want to let my co-workers introduce themselves. So, Cole, let's start with you. Hey, guys. I'm Cole. I'm a software engineer on the data team as well. I recently joined Cloudflare this year. Awesome. Tanushree? Hi, everyone. I'm Tanushree. I'm the logs product manager, also fairly new to Cloudflare. I joined just this past summer. Awesome. And already launching things. Very cool. Ben, you go next. Yeah. My name is Ben. I'm an engineering manager here at Cloudflare focused on solving problems with distributed data. And just like Cole and Tanushree, I'm also relatively new. I joined Cloudflare in February. Very cool. Fresh team with very exciting launch under their belt already. So, Tanushree, why don't you give me the lowdown? Tell me about what did we launch today? Yeah. Yeah. Really excited about the launch of live updating analytics and instant logs. Basically, live updating analytics allow you to see data coming into our analytics platform in real time. And it's out now for all pro, biz, and enterprise customers. So, feel free to give it a shot. And instant logs allows customers to basically see a live stream of traffic from their zone on the Cloudflare dashboard. And we're going to start a beta very soon for this. The signup link is in the instant logs blog post if anyone watching is interested. Awesome. Yes. Technically, yes, instant logs will be available very soon. We will have a demo coming up, which we've been hard at work on. But live updating analytics is live right now for all folks in our pro business and enterprise plans to try out in the dashboard. Tanushree, what's cool about that? Why are people fired up about live updating analytics? Yeah. So, we've had drill down and filtering capabilities on our dashboards previously, but live updating really makes us all the more powerful. Analytics are useful in identifying trends over time. And more specifically, if you want to identify something odd is happening on your network. And so, now, with live updating, if there's spikes in traffic or drops, you can see it right away and then slice into the why all the faster. John, you're on mute. Yeah. Thanks. That's super cool. Yeah. So, one thing I think about a lot is people use Cloudflare for so many different use cases. And our analytics really have a lot of power in that they let you slice and dice. You can filter on URL, on status code, on service, on all kinds of things. And with live updating analytics, you can just park that view that you care about it and almost leave it up like a dashboard. And you can just have it on a screen and look at it and see when something interesting happens, which is something I think is super neat about that. And yeah, it's a really effective tool. Cool. But I think we're going to spend probably most of the time today talking about instant logs, which we're very excited about. Tanushree, do you want to also tell me, like, maybe what's the difference between these things? What is instant logs let you do that you couldn't do otherwise? Yeah. I'll start off with sort of like our vision for instant logs. In a very cliche sense, we wanted to be able to see logs flowing in sort of like matrix style. And it's very useful in time sensitive situations like an attack, or if you're troubleshooting, or saving the human race. You need to be able to see very, very specific event level information, which instant logs allows you to do. An example of a case where you would use it is, let's say you're a developer, you've recently deployed a new version of some software, and you want to do some testing, making sure it's working right. And you want to see any errors and find out what's causing them. And instant logs is the perfect place to do that, because data flows in in real time. And so really, the primary use cases for this, for instant logs are troubleshooting, debugging, and getting to the root cause super quickly and easily. That's awesome. So I think we've had enough talk for the start of this. Ben, are you ready to roll the live demo? Let's do it live on television. Okay, everyone, this is you're about to see a real live demo of logs on I believe, blog.Cloudflare .com. All right, so we're going to go see how the post we just released today is doing. And I'm going to switch my view here. So hopefully, folks will be able to see the presentation. Please shout out if you can't see the presentation. All right. So we have we have a new tab here under the analytics sub category. Next to our existing log push product, we now have an instant logs tab. And it's really simple UI allows you to start streaming. And there's a basic table view and the best demos to just watch it go. So you simply hit start streaming on whatever zone you want. And your logs literally stream in real time, almost as fast as they come in. So for, you know, normal size to like medium size zone, like blogs.Cloudflare .com, you really will see every, you know, just about every single log that comes in. And it doesn't really start to sample your data until you have, you know, 10s to 100s of 1000s of messages coming in. This is amazing. Maybe let's just like, first of all, number one live on air demo success. Yes. So what tell me what we're what are we looking at here? Tell me what's tell me what's what's streaming it? Yeah, so it's a it's a live stream of every single either request that's coming into your network or going out by a worker. And you can see the basic URLs, the method, the status code, the date or the time at which the message came in. And this should be basically what the real ball clock is at any given time. Because this product really has, you know, one of around one to two seconds of latency. And that is specifically latency between when something hits our edge network. And when you actually see it here should be within one or one or two seconds. And we'll we'll demonstrate that in a second. If you click in on one of these log lines, you can actually see all of the fields that we have historically provided for logs. So you can do pretty deep analysis. Very cool. So it seems like if some if you hear about a problem, you've heard a rumor of it, maybe there's very few examples, like not seeing a lot of these data points analytics, but you want to capture one of the wild, you can kind of start the streaming looks like there's some filters. So yeah, the added filter, you want to show that quickly? Yeah, let's dive in. So let's say you're trying to find a specific event, we have some really basic filters right now, we are going to add more over time. But perhaps you want to zoom in on status codes, maybe for a force, right? Where, where, where am I seeing errors? You said the basic filter specify status code, an equals condition, there are other conditions as well, you can do greater than less than any, any sort of condition that you can dream up, specify your code, and apply, start the stream. And then ideally, we should just start to see four or four errors. And there are prism.js don't know what that is, but seems to be something that people want that they're not, they're not finding. That's right. Very cool. And what's cool is now if you have this, you can actually click on that. And you could actually see kind of what that request is, and where it's coming from the refer information, right? You have these actual examples that are that are kind of tangible. Very cool. That's right. And we can also add compound filters. So I could specify, maybe there's some specific path that I'm really interested in seeing more information on, you know, type something that I maybe expect. So any path that contains this key will also be explicitly filtered for. We can start the stream, you shouldn't see any data coming in, because there's nobody actually. Yeah, well, so I'm going to pause you right there. So we have this name, it's a somewhat grandiose name, instant logs. And I mean, we spent some time debating about what to call it the name. And like, you know, what is instant really mean? So what are we kind of, what does instant mean to you, Ben? How fast is instant? Within one to two seconds, within one to two seconds. All right. So one to two seconds is what we're aiming for. We're going to do a live on our demo of this, we're going to load just to set up what we've done here, we have this filtered. So we're only going to show events that come in, if they contain this path, and status code 404. And it doesn't look like anyone's requesting that. So we're going to, we're going to make a request that's going to trigger this, we're going to, you're going to watch Ben make the request, we're going to see how long it takes to appear. Yeah, and then get your stopwatch ready. I'm actually not sure if it's going to show up as a 404. Okay. Maybe in the meantime, someone's added that page. Okay, stopwatch is ready. Virtual stopwatch. Let's do it. So this is the real, the real Cloudflare blog. Production traffic. One, two, three. Yes. Amazing. Bravo. I would say that's about one to two seconds. One Mississippi, two Mississippi. I think it was a little faster than two seconds. It was like closer to one. We'll have to replay. Yeah, I want to have, yeah, I'll have to have the slow-mo, like, you know, replay going. Very cool. That's amazing. Cool. Well, yeah, actually, let's, just because we've just been bragging about how fast it is, and filters, and ability to zoom in, like, this is pretty amazing. Like, Cole, do you want to kind of just give a quick overview? Like, how did we approach building this? Like, how does, how does all this work? Yeah, for sure. So this is kind of an extension of a product we already have. If you're an enterprise customer, you're probably familiar with log push or log stream. And we're kind of utilizing some of that architecture that already exists and infrastructure around that. But we decided to build this instant logs on the Cloudflare Workers platform. So we've actually got an agent running on all of our edge networks, on all those servers, and they're constantly just submitting the logs for certain zones that are subscribed to a workers endpoint. And our workers endpoint actually accepts all these incoming logs. And we kind of do a large map reduce across Cloudflare's edge network. So the mapping transformation is log forwarder, which is our service, our agent, sending stuff to us. And then our reduce phase is we collect all these in all of our edge servers, and we kind of reduce them down to a stream that's manageable for the client. So as we see in the web there, that's kind of one stream coming in, but from potentially hundreds of different locations around the world. And so we kind of use the concept in workers called durable objects. And durable objects kind of gives us a uniqueness guarantee that there's only one in the world. And we use basically forwarding requests from the edge to workers into a durable object. And the durable object then coalesces everything and sends it to the user that we saw on the web there. Cool. So when I open up that web browser, there's a web socket, right, that's streaming it. So it's connecting to one durable object. Is that right? Yeah. Yeah. It's connecting to one durable object on workers. So how many, you know, some people have personal websites on Kloffler, maybe you get a handful of requests per day, but we have customers that get like, you know, millions of requests per second in the limit. So what happens if someone starts sending millions of requests to that durable object? Like, I don't think my browser can handle millions of logs per second. Right. Yeah. So especially for some of the larger zones that receive a lot of traffic, we do a technique called reservoir sampling. So Ben can probably talk a little bit more about that. But the idea is as traffic starts to increase and we see a lot more load for a specific kind of web socket, we will start just randomly sampling that data that's coming in. And we'll reduce the kind of maximum output rate to the client to be something more manageable in the web. So I think, Ben, maybe you could probably give an example of what reservoir sampling is. Yeah. Yeah. So reservoir sampling is this really interesting technique to really allow us to buffer a sample and send our data to a client. And specifically, it deals with the fundamental problem of having an incoming stream of unknown size and wanting to reduce that into a stream of known size. So we have this very thin pipe to the client, right? It can handle tens of requests per second, and then you could have anywhere from zero to millions of requests per second. So you have to get this completely unknown input into a fixed output, right? Exactly. And so for a typical web client, for example, you may only really be able to display 10, 20, or 30 messages per second on the screen at any given time for it to be manageable for the user. If you're streaming hundreds or thousands of messages per second through someone's browser, they won't even be able to see them. It'll just be a blur, and it'll provide no value. And so we really want to make that stream manageable for the user to consume. So we use this reservoir sampling technique. Why the reservoir sampling technique is really interesting is because it doesn't matter how much volume we put into the top of it. We always get the same output, or at least we don't get above a certain output on the back end. It sounds like magic. It is magic, yeah. How does it work? So it works by creating a buffer of some fixed size. And as this stream of unknown length starts to enter into our durable object, we take every single one of those messages, and then we apply a randomization formula that we apply to it to decide if it should belong in the buffer or not. And some messages end up entering the buffer. Some messages end up getting discarded. Some messages that enter the buffer end up getting discarded from another message that comes in, and you're left with essentially a buffer's length of data that we'll send to the client. And we do this on a fixed time interval. And so I think we use around 500 milliseconds right now. And so we essentially buffer the data for 500 milliseconds, collect our fixed number of samples, and send them on to the client. And it's a really interesting technique for a number of reasons, because the naive approach to taking a random sample of known length from some population is that you collect your entire population, and then you pull your random samples out. But in this case, we actually don't have the memory necessarily required to store. Could be hundreds of thousands of lines or events in memory at any one given time. And so this allows us to effectively do that without having to store all of those in memory at any given time. So it works really well for this sort of application. So I'm trying to visualize this. So we have hundreds of data centers around the world and thousands of servers in those data centers that are processing requests. And then are they all just connecting to one object, which then does the reservoir sample? Or how does that work? Yeah. So certainly, I think we just actually were talking about just today that we're now in over 250 cities and even more data centers than that. And so there are a lot of servers scattered throughout the planet. And each one of those could be receiving one request, zero, or thousands. You don't know. You have a global... If you have a global footprint and do a decent amount of traffic, it's very likely that you will hit almost every server at some point in time. And so all of those servers have to get their logs to the central location. And because we want to do this instantly, so within one or two seconds, that means that they are essentially constantly pushing data. And at a certain point, DurableObject, and one of the reasons, one of the ways that DurableObject works with this constant of global uniqueness also means that it is inherently single -threaded. So you have one CPU thread to process data. And there's only a fixed amount of time that a single CPU, a single thread can actually process. So there's a bottleneck in that client, but then there's a bottleneck upstream in that DurableObject. So you have to do some kind of filtering out even before you get into that. That's right. And so that's exactly what... That was one of the challenges that we actually had to solve. And the way that we solved that was by creating multiple layers of these DurableObjects. And so we have our first layer, which is the single point where data is aggregated and sent to the client. But then in front of that, we actually introduced an entire another layer that could include up to hundreds of these DurableObjects that are then aggregating data from even more servers. And then we can create these layers. So the data starts to get smaller and smaller and in chunks before it ultimately ends up at the client. And that allows us to distribute that sampling technique across a lot of different nodes effectively. It's like a serious distributed systems problem. Okay. So I have to ask, this sounds really complicated. You have multiple layers and this distributed network, and there's locks and threading. And when did we first write code for what became the demo that we just saw? I think we started this conversation probably about three weeks ago. About three weeks ago. Yeah. Pretty cool. So that makes no sense. How is that possible? Yeah. That's a great question. It's all a little bit of a blur, but to be honest, the workers platform really enabled us to do this really quickly for a number of reasons. One, just, we weren't workers, or experts, or platform experts prior to this. We never had really written code. We're not JavaScript experts. Most of what we've done in the past is Golang and other languages like that. And we were just able to honestly onboard really, really quickly. We were able to deploy things to the edge literally in seconds, just as fast as you can receive a message. We can essentially deploy a worker to Cloudflare's global edge network. And that honestly allowed us to create the first prototype in almost 24 hours. That was just really basic, single server, single dirt block. And we were able to prove that that worked literally same day. Amazing. And that was motivating and let us know that there was a real path to this. And yeah, we got to the point where we are able to test fairly, fairly large Internet properties at this point and get data reliably to a client. And we're able to get it all together, build up the UI, and here we are today. Yeah, that's super cool. I mean, I think so cool that he did mention, oh, and we have Bharat joining us too. He's another member of the team. Very cool. Hi, Bharat. So yeah, I think one thing that's really interesting is we do have some custom software and log forward, which is deployed at our edge, which is like Cloudflare secret stuff. But I don't know what percent, 80%, 90% of this is really the public workers and durable objects, which anyone can use. So I don't know, if you were not working at Cloudflare and starting a new data pipeline, I don't know, maybe just use workers, durable objects. Is that the conclusion here? There was no secret sauce. There were no tools that we had access to that others don't. Essentially, our agent that runs at the edge, just like any of your own services or endpoints, could be an IoT device, could be your own web server, could literally be anything that produces data at a high rate. And the rest of the pipeline is workers using all the same primitives that everyone has access to. And that's it. So anyone can build this. We don't have any secret tools here that we use to do it. Super cool. Wow. That's amazing. Awesome. Well, just running towards the end here, I realized I was so excited to get to the demo. And I was so excited to talk about how we built the demo. I kind of glossed over, I think, some of, you know, we have this, you know, so we have analytics, we talked about our live updating analytics. And we have, we have log push and log pull. And now we have instant logs. And I know for some folks who are probably looking at our platform wondering, like, well, when do I, when do I use these things? What are they appropriate for? So I do want to spend a minute just to understand, kind of, maybe help me understand where does instant logs sit with this? And how is it different from some of the other products? And then we'll talk about where it's going. But yeah, Tanushree, do you want, maybe do you want to kind of give an overview of how these products relate to each other? Yeah, let's, let's start with log push. Log push is really great at getting customer data just super reliably from our network to customer destination, whether that be an analytics platform or just a storage destination. But one of the areas where we see improvement with log push is, is the fact that, you know, you don't get that real time component because it's really built for reliability. And so that's where instant logs comes in, is it, it really allows you to get that piece of real time capability. You want to just be able to see traffic, troubleshoot, diagnose without having to set up another platform to do that. So, so that's sort of the differentiation there. And log push, a lot of, we use a lot of log push, right, to build, to build this too, right? They share a lot of sort of fundamental components, right, to, to be able to do this. So, yeah, very cool. Maybe tell me a little bit about what's, what do you see as coming up next for, well, yeah, for log push, for instant logs, for anything on the logs roadmap. Yeah. So for, for instant logs specifically, we want to, you know, get this out in testing, get this in customer hands and get some feedback and iterate on it. And then also expand from there, add more datasets, add more functionality. Something that we're really excited about is the idea of real time metrics and aggregating that and showing that as well. So those are the next big things on the roadmap for, for instant logs. That's awesome. And I plug one more time, if folks have not yet, please do sign up for our waitlist. We'd love to have you be one of the first people to try it. I think, Tanushree, do you know the URL off the top of your head? I think it's Cloudflare.com slash instant dash log slash waitlist. We'll double check that for you while we're pulling that up. Sorry, should have found that before. While we're pulling that up, Tanushree, you were mentioning, you know, actually taking the logs and actually using that to produce metrics in real time, which is something that's super cool. One of the early, that's actually kind of how we produce our analytics. And in the past we've talked about how we use ABR, which is fundamentally about making metrics out of logs in real time. We realized we could do kind of something similar in the client where we're pushing these raw events and we can turn them into metrics. And that was something we thought about from the very early days. And Cole, I think, you know, we mentioned that first demo we had, one server, one WebSocket, one client. I thought you could show us, just to close out, a quick demo showing what that might look like. And I'll let everyone know this is going to be a command line demo. So this is just sort of like, it's very much a proof of concept, but I thought it just looked so cool. We wanted to share with everyone. Cole, do you want to bring that up? So yeah, when we were first building, we didn't have a UI yet. We didn't think it was going to like be a product in the console. We thought it was more just going to be a dev tool at the command line. And it's kind of branched out into two different areas. So I'll share my screen here. Hopefully everyone can see it. And kind of as John mentioned and Ben mentioned earlier, we're using WebSockets to communicate from our edge network using workers to the client. And so in the case of the UI, you can sell you the instant logs. UI page, those was coming in on a WebSocket, but now we also can also do it at the command line. So I have an example here where we can kind of just connect to a WebSocket using like a version of curl, but for WebSockets. And we're going to pipe our data into JQ, which is, if you work with JSON, you're probably very familiar with JSON query. We're just going to get the client request path. So basically this is real time traffic. People are visiting these searches, requesting these assets, and this is coming in on the fly. Will people be able to use this when instant logs is available? Yeah. So once we launch instant logs, we'll have the ability to kind of use it from the command line as well, given you have like the correct tools like WebSocket. But there's also some other things too that make the command line a lot more powerful. Well, one more question about this. How many, what do you think is the, what's the like rate of requests per second that we've seen people see in the command line? Yeah. So rate of requests when we were testing, we got up to around 6,000 requests per second coming into us on the command line. Obviously that's dependent on your bandwidth. I have gigabit at home, so it's a little bit special. And how much traffic your zone gets. Unfortunately, we're not quite that popular yet. We're going to get there. Next blog post. Next blog post, you'll see an update on that. But the other nice thing about being in the command line here is there's a lot of utilities in the bash ecosystem for slicing and dicing. So jq was one of them as well. And there's another one out there called angle grinder that we stumbled upon and it kind of made us think about it in this direction. And I'll paste the command here and quickly walk through it. So at the start here, we're just connecting to WebSocket address. And we're piping the data through jq. I'm just removing some sensitive fields so we don't accidentally release any of that. And then this tool called angle grinder. And angle grinder is like a streaming aggregation command line tool. And we're basically saying, take all the data coming in on standard in as JSON. We want to filter for our blog post URL on the client request path. And we just want to take a sum of how many people are accessing the blog post and giving it up by the country. And so if we go ahead and run that, we'll see there's no data right now. But I'm sure if Ben or Tanushree or John or anybody else visits the blog, we can see it's updated. All our viewers, see what country you're coming in from. Come refresh our blog post. I'm sure you already have it up already. Another good one too is maybe you just made like a blog post and you want to see where all of the traffic is coming from. Maybe you decided to onboard onto Facebook and social media kind of marketing. You can also give it up by the referrer and we could see where our users are coming from. So maybe you just want to gather a quick insight. Is your Facebook post working? Is it attracting users? We can see a lot of users are coming in from Google. Going to the URL directly. That's awesome. URL directly. And we could even like, I'll open it up in another tab here that you can't see. But basically, you could probably visit it even from LinkedIn where we shared the post a few times. Nice. Twitter. You'll basically see a pop up here. So I just opened that up. We can see people are coming from Android. Very nice. Very cool. Just a live feed there. That's amazing. And yeah, I think this is such a cool demo because this is kind of a, this is how we think about like just trying out the product functionality among ourselves. It's like our viewers will be able to do soon. And also something we hope to bring to the dashboards that we have this kind of ability, you know, it's right at your fingertips in the UI really soon. So thank you so much, Cole. Super cool demo. Thank you so much, everyone. To Shereen, Ben and the whole team. Shout out to Barrett, our teammate who also dropped in. He made a cameo. Everyone else who worked on this. It was a lot of fun to make this and to talk about it with you all. The beta link pulled up. Just to clarify, it's Cloudflare.com slash instant-logs-beta to sign up. Cloudflare.com slash instant-logs -beta. Got it. Thank you, Tanushree. Thank you, everyone. Have a great rest of your day. Bye.