🚚 Logpush: lower cost and more reliable
Presented by: Sohei Gallagher, Duc Nguyen
Originally aired on September 23, 2022 @ 5:00 PM - 5:30 PM EDT
Join Duc Nguyen Engineering Manager, Cloudflare and Systems Engineer Sohei Gallagher for a discussion on Logpush now being lower cost and more reliable.
Read the blog post
Visit the GA Week Hub for every announcement and CFTV episode — check back all week for more!
English
GA Week
Transcript (Beta)
Hi, everyone. Welcome to the session on logs. My name is Duc Nguyen. I work at Cloudflare on the data team, mainly focusing on logs and most of the infrastructure surrounding logs and applications around logs and pass it over to my colleague to introduce himself.
Hi, my name is Sohei Gallagher. I'm a systems engineer for data team working on logs product.
Cool. And today we're going to talk about some new announcements that we've already made regarding logs, our log post product.
So just to give an overview, you know logs is kind of critical to every single application out there.
You basically have many, many uses for logs that you may need them for debugging purposes, figuring out performance so that you can do optimization, getting metrics so you can do alerting and notifications on the health of your systems.
And, you know, sometimes you might need them for compliance reasons, you have to store certain logs just to meet rules and regulations of your country and your region.
And in some cases, you know, logs might even be critical to your business decisions.
Like, for example, like I used to work in gaming and without logs, we basically wouldn't be able to make a successful game.
We made a lot of our business decisions based on just the logs and seeing what our users are doing in their sessions and what decisions they're making.
So definitely logs are critical, not just to Cloudflare, but to our customers.
And our solution has always been LogPush, where we gather logs from our many, many data centers around the world and aggregate them, bash them up into files and send them off to certain destinations that the customers want, like S3, Google Storage, even your own HTTPS endpoint if you want.
And the problem has always been that, you know, we generate massive, massive amounts of logs.
And there's just this certain complex infrastructure that you would have to maintain just to keep up with the amount of logs that we're sending you if you're, you know, if you're big enough.
And in some cases, that just is overwhelming. Now you have to maintain a mass complex infrastructure just to be able to look at things once in a while, because you're not always looking at logs.
But, you know, you do need the logs when certain incidents happen.
So we're announcing a few new features today to help around that part of the problem, to reduce your infrastructure and to make things easier for you.
The first of them is filtering. And I'll let Sohei talk more about that.
Yeah, so we are adding new feature, log push filtering, and without it we are getting all logs in some domain or zone, or having massive amount, as Duke said.
And with log push filtering, we can use the fields that generate it and apply the filtering, and based on the field type and then the conditions, we can reduce the number of logs in substantial way.
Some example could be just extract logs with errors, or some status code, for example, like greater than 400 or other specific error code that the customer may be interested in.
Then we don't need to store or send all like 200 status code logs, if everything is normal.
Other example could be we could use some feature for bot score, which gives some integer number.
And we can give a criteria to only send logs that's more likely to be human based on the bot management feature.
Or we may want to log some logs, only that's for sure we know it's a bot request or request made by bot.
And we can also exclude some bots, for example, verified bots like the Google bot for search engine optimization.
And so we can have multiple criteria and only send the logs that customers are interested.
Some other example could be subdomain host names.
Some customer may have a dozen or hundreds of subdomains, and they are only interested in certain set of host names, they can put the filtering to reduce and get only interested logs.
And that will reduce the volumes and optimize for host analysis and some other third party engines as well, like in the scene.
They may not need all logs, but a lot of logs could incur a lot of costs.
So there's a cost for optimization for customers as well by reducing the number of logs.
So now I want to show the example how to create like log push jobs, and then how the customer can set filtering.
So here's my personal domain for example.
And in the dashboard, that's what most customers will see and then there's analytics tab.
If you go down there, there's logs tab.
And there, you can set up log push jobs. And here it's just only showing one, but you can create new jobs with different data sets and different destinations.
So I'm going to show some example with the HTTP request. So basically you just select HTTP request data set.
And here you can select specific fields that you would be interested in, or you can select all, depends on the purpose.
And here's filter beta feature. Right now it's nothing is set, but I'm going to just create new jobs without filtering.
So if you want to get all logs, then you can just simply skip that part and then create jobs.
Right now I'm going to just use the Amazon S3.
I created a bucket for it. And basically here you can specify the bucket name with any path that you want and some subfolders for daily breakdown, if you wish, and then set the bucket region for the bucket created.
And depends on the destination, it may require to prove ownership.
And so that's what it's asking right now.
So you can just go to the bucket that you specify and there's that ownership token and then Let's see, I may need to get a new one right now.
Give me a second.
Don't worry. Yeah. So yeah, I'll skip that part, but yeah.
For some reason, yeah, it's not giving the right token.
I'm not getting the right token for the path. Anyway, so once you create a job, then it will pop up in here for existing job and you can edit fields and filters.
I'm going to show you what kind of filters they have. So under the filter, there's certain fields that we think is most used and you can specify what kind of field you want.
And as I mentioned previously, there's a bot score or bot tags to specify particular bot types or scores to filter or sometimes you want to filter based on the host name or the status code.
So I'm going to right now just use the status code, for example.
And I want to get all logs that's not, for example, 200. Then you just pick the field operator and then specify the value and then you save change.
And the change may take a little bit to apply, usually five to 10 minutes, but once it applies, previously it might be logging all logs 100%, but now it may reduce to like 10 times less or depends on the type of the request and filtering, but it could significantly reduce and only get the logs that you're interested or needed for SIEM or some other processing based on the logs.
Now I want to show some example of that test zone that we set up and just show the internal dashboard to demonstrate or describe what kind of reduction we could get.
So here I set up a job with no filtering first.
Before, so I was getting about like 1,000 to 1,500 requests per second and it was giving the bytes about like 2.5 megabytes before and then after the filtering, which was applied around here, it reduced to about five times less.
So the bytes is like from 2.5 megabytes into 300 kilobytes and for the number of records, it was from 1,500 to after filtering, it was like 200 or 300 requests per second.
So that's a significant amount of reduction, then I can only get the logs that I'm interested in.
It's much easier to observe and analyze it. So that's the main reason and advantage of using log push filtering.
So some other example filtering, you could use request source to just get eyeball or the edge walker request.
Some other could be like using client IP or origin IP and you can filter out or just get the request with a specific filter IPs.
So yeah, that's pretty much about log push filtering.
And yeah, the main reason is to reduce the number of records for a number of records and then storage bytes to reduce the cost and make easier for the same and it could simplify some alert integration as well.
If you have certain third party tool to only extract the logs with certain criteria, then much easier to set up alert as well.
So can you expand a little bit more on the kind of like the advanced filtering options that we could do like arrays and, you know, like strings, string matching and stuff like that.
Yes, so some other field is only have certain operator. For example, the operator depends on the field type and basco, for example, is integer, you can specify like a specific number to extract or you can do like greater than or equal to.
Or create a list. So there's a couple several different options to use the filtering and some other filter.
For example, bot tags is an array of tags, verified bots or some other machine learning or different type of types, and you can use to match specific type of the tag or you can use the contains or does not contain.
So it does advanced matching and we support RegEx on certain fields as well.
And for hostname it's a text and again, it has equals, does not equal, but also you can use the start with or contain or does not start with.
So it has a different type of operation that you can use to match exactly how you want to filter.
Awesome.
Yeah, so one benefit of filtering is that you can obviously only just get the logs that you're interested in, but you can create multiple jobs for different things.
That makes your analysis much easier also.
And then, you know, for forensics, you might be able to set a job that has a less or more broad filter that you can use to maybe store more amounts of log, but maybe in colder storage.
So like with the ability to filter, you can separate your jobs up based on based on what's what's important to you.
And I think that's really going to be very beneficial if you're processing a lot of logs.
Yeah. And here's an example of like using multiple conditions. So I put like hostname that start with test something like match subdomain, for example, starting with test and then you can specify request method equals to and then it populates the possible options for you and I can specify, for example, all the post requests, starting with that test in the hostname.
Is there, is there a limit to how many filters you can add for a single job?
Yes, you can set up a lot of fields and operators, but there's a limit of approximately like 30 tokens and 1000 bytes, I believe.
But we prevent like huge filtering just to reduce the load, but it's very flexible and you can set up like dozen of fields, if it's within the limit.
Yeah. But I think if you have that many fields and probably won't catch anything anyways.
Yeah. Cool. So let's, let's move on to the next thing that we want to talk about is mainly surrounding the monitoring the health of your log push jobs and getting analytics on it.
The reason we added this is because, you know, like we said before, logs is extremely critical to your daily operation.
And even though it's not something that you look at all the time, it is something that, you know, that you want to work reliably 100% of the time.
And if something does go wrong, you want to know about it. You don't want to wake up on a Sunday getting paged and finding out that, you know, your logs haven't been pushed to your, to your scene for the past couple days or something like that.
And you can't, you have no way to debug the incident that's going on. So, in, you know, in, in our log push system.
Previously, there was no way to, to know if something was going on wrong.
You would basically just have to monitor it and, and look at your, your destination.
For example, you're, you know, in S3, you'd look at and you see that going, you know, no new files have been pushed for the last six hours or something like that.
And that's, that's a huge problem for a lot of customers.
And we do get a lot of requests to, to improve upon that. And what we've done now is add an alerting system to, to your log push jobs.
It's integrated into the dashboard and you can set up these notifications and you can get an email for now based on when your log push are failing.
When, when your log push are failing for 24 hours, we, we will mark, we'll normally disable it and then we'll send you an alert so that you can look at it.
I mean, log push jobs could fail for many reasons.
Could be like network issues. Could be, you know, someone deleted the bucket in S3 or someone moved the credential, changed the credentials, rotated the keys, or changed permissions on them or something.
But normally you wouldn't know about this until, you know, you get this kind of alert.
And once you get these alerts, you can go look for them and see what the problem is and fix it.
So I can show a quick demo of how you would set up one of these alerts.
One second.
Sorry.
Okay.
Is my screen showing? Yes. Okay. So in your normal Cloudflare dashboard under your account, when you go to your account, there's a notifications section here.
Go here and then you can click on add.
And one of them is log push. So for failing log push job that's been disabled.
So anytime we fail to push logs to you for 24 hours, the job will just be disabled and you'll get an alert.
So if we can set up one here, the notification name and description are up to you.
It's just for you to recognize what the alert is, but it's not important.
The main thing that's important is the email, the recipient.
And you can use any email here. You can add multiple recipients.
And once you hit save, this will be in effect.
And whenever something goes wrong and you do get an alert, it will look like the following.
Let me reshare it. So you will get an email that looks like this telling you a job has been failing and it's been disabled.
The job ID, the destination name. Usually the destination name is what's most important to you in figuring out which job because you could have many zones on Cloudflare.
You could have many log push jobs, but most likely they'll be pushing to different destinations or different bucket names.
And using the destination name, it will be much easier for you to figure out which one is failing and go in there and figure it out.
And that's what an email looks like.
And we hope it will be beneficial and definitely recommend setting this up for every single job that you have.
Cool.
And aside from alerting, the other thing that we added was an API around the performance and analytics regarding your log push jobs.
A lot of times, customers are kind of unaware of how much logs they're actually getting.
And in terms of the number of bytes that's being pushed and the number of records that's being pushed.
And it's good to have this kind of API because now you can set up metrics and set up dashboards.
You can pull in this data, create dashboards, and you can also monitor the health of these things.
Also, so you don't have to wait on an alert from Cloudflare.
You know right away if things have stopped pushing and maybe the bytes have dropped or the number of failed pushes has started increasing.
And that's where this API comes in as part of our GraphQL API.
And if you ever used our GraphQL API before, it's exactly like that.
It's just a different node on there. But if you haven't, then please visit our developer docs.
It's very useful. It has a lot of examples on there that you can use.
You can basically just copy and paste and add in your zone ID and zone tag and API key and get it working very quickly.
Next, I want to show a little bit of a demo for using the API and what you can do with the API.
Is that too big?
Too small? That's good. Right. So I'm showing a demo in the graphical client here, but you can also do this using curl.
And I'll show that later. But the new node that we've added to our GraphQL API is this log push health adaptive groups node.
And the reason I use the graphical client here is so you can hover over this and look at the documentation for it.
So you click on this and you can see the kind of things that you can pass along to filter and get one of the metrics you want.
There's average and count, sum, and then certain dimensions that you can filter by.
Date, date time. You can look for the last 15 minutes, last five minutes, hour, minute, the destination type.
So, for example, if you have different destinations for different log push jobs, you can filter specifically for that.
S3 or GCS is Google Cloud Storage here. Job ID and the status.
Status is probably pretty important, something that you want to monitor on if you're going to use this.
So you can, for example, you can get all the jobs or how many times they've been failing in the last 15 minutes or hour or something.
So you can say status not equal to 200 or something if you're using something that's HTTP based.
So here I have an example query.
You can filter and specify your zone tag. And if any of this looks weird to you, definitely check out the developer docs.
It's very helpful.
So you can specify your filter here. I'm specifying the date time greater than September 15, which is like the last seven days.
And the destination R2. And the status 200.
So basically all the successful ones. And give me the bytes and bytes compressed and the number of records that were pushed.
So you run this. You know, you get exactly what you're asking for.
So these are all successful jobs and total bytes, total compressed bytes and total records that were pushed.
And, you know, definitely play around with this because there's a lot of like stuff here that you can use and I'm sure at least you can make a few useful graphs and alerts that you can add to your monitoring system, which would be very helpful.
Yeah, so I can show the part about using cURL for that.
I mean, it's very, very much the same.
So, it's the same, but you basically have to format your JSON a little bit more specifically and pass in variables to cURL.
This is also on the Puffer developer docs, so it's very straightforward.
But, you know, there is the same query, but I'm passing in variables here just because of the string interpolation in the terminal.
And then you run that and you get the same result here.
So, something like this is very easy to add to scripts and alerting systems.
Cool.
I think that's all we wanted to show and maybe to wrap it up. We hope that these features are very useful to you and definitely take advantage of them.
I think the filtering is a very, very big thing that most users should use because, you know, logs in general, no one looks at all logs.
No one wants to trace log unless, you know, you're debugging in a certain specific environment.
Most of the time you're interested in like very specific conditions, errors, failures, bots detected and stuff like that.
And it definitely would reduce your infrastructure down to a fraction of what's currently needed and also reduce your costs.
You know, if you're on S3, definitely check out R2 for a very much improved pricing experience.
So, you want to add anything to that? No, yeah, I think it's a very neat feature.
Please play around and there's documentation on the filtering and GraphQL as well on the developer docs.
So, there's a lot of resources there.
So, yeah, have fun and enjoy. Yep. I think yeah, the alerting for sure definitely like right now that you have to manually set it up.
In the future, we potentially might set it up by default for you. But for now, you have to manually set it up.
Cool. And thanks for tuning in and catch you in the next session.
Thank you.
Q2's customers love our ability to innovate quickly and deliver what was traditionally very static, old school banking applications into more modern technologies and integrations in the marketplace.
Our customers are banks, credit unions and fintech clients.
We really focus on providing end-to-end solutions for the account holders throughout the course of their financial lives.
Our availability is super important to our customers here at Q2. Even one minute of downtime can have an economic impact.
So, we specifically chose Cloudflare for their Magic Transit solution because it offered a way for us to displace legacy vendors in the Layer 3 and Layer 4 space, but also extend Layer 7 services to some of our cloud-native products and more traditional infrastructure.
I think one of the things that separates Magic Transit from some of the legacy solutions that we had leveraged in the past is the ability to manage policy from a single place.
What I love about Cloudflare for Q2 is it allows us to get 10 times the coverage as we previously could with legacy technologies.
I think one of the many benefits of Cloudflare is just how quickly the solution allows us to scale and deliver solutions across multiple platforms.
My favorite thing about Cloudflare is that they keep development solutions and products.
They keep providing solutions.
They keep investing in technology. They keep making the Internet safe.
Security has always been looked at as a friction point, but I feel like with Cloudflare, it doesn't need to be.
You can deliver innovation quickly, but also have those innovative solutions be secure.