Actionable Decisions via Insights and Logs
Presented by: Flilipp Nisenzoun, Frank Schlesinger
Originally aired on February 13, 2021 @ 11:30 PM - 12:00 AM EST
Best of: Cloudflare Connect 2019 - London
A session on leveraging your data to make better decisions, presented by:
- Flilipp Nisenzoun - Product Manager, Data & Analytics, Cloudflare
- Frank Schlesinger - CTO, orderbird AG
English
Cloudflare Connect
Transcript (Beta)
Hi, everyone. Thank you for being here. I'm Philipp, the product manager for data and analytics just here on the left-hand side of the slide.
And I'll soon be joined by Frank Schlesinger, who's the CTO of Orderbird.
Orderbird is one of our customers in Germany who's focused on providing point-of-sale systems to independent restaurants.
So for this presentation today, I wanted to start with some background on data at Cloudflare for those of you that may not be familiar.
I'll talk a little bit about recent customer-facing products that we shipped.
And then Frank will talk to you about how Orderbird has used Cloudflare data to make decisions regarding their firewall strategies, scale of their infrastructure, and some other factors.
And then at the end, we'll close by looking ahead at our roadmap, share a little bit about where we're going, and then also have time for questions.
So hopefully you'll come away this session with a good sense of what Cloudflare data is available and how you can use that to make business decisions, hopefully make your business more successful.
So let's begin. If I can just advance the slides.
Apologies.
Okay.
Let's see if that works. Yeah. All right. Hopefully that works. All right. So what does the data team at Cloudflare do?
So when one of our customers signs up for Cloudflare, we have the frontline view of their traffic.
So we need to help them understand where their requests are coming from, what their end users are seeking, and what Cloudflare is doing on their behalf.
Essentially, are we behaving as customers would expect?
So we collect data from our global network, and then we make it available as request logs, just raw data that customers can consume.
And then we also process that data and turn it into analytics that are then available using our API or our dashboard UI.
So this is challenging because of our network scale, which is already very large and growing as both the number of our customers grows and as the reach of our network expands.
So about three years ago, we had about 4 million requests per second going through our global network.
And the request per second is how we measure the scale of our traffic.
That grew about 70% within the next year and a half to 7 million.
And then another 40% to the current peak of about 12 million per second right now.
Just for context, if you compare this to what some of the largest online properties, we'll see largest Internet properties, they would get about a few million requests per minute.
So our volume is about 60 times greater.
So it's great, we collect so much data in such scale, but what do customers want it for?
What problems are we trying to help them solve? As you might expect, one of the main things is they're just looking to use our data to tune how they've configured Koffer.
So for example, they may want to adjust what assets they're cached or which IPs they're blocking.
And over time, they'll want to run reports to understand what value Koffer has provided.
So for example, how many threats have we blocked over the last several months?
And then they'll want to use data at a particular moment in time to understand how our network is performing.
So for example, they may want to trace a request, trace and debug a request from their network, from our network to their origin.
And this is particularly to solve problems like their use of reporting that they're seeing errors when they load their website or applications.
So as we've asked customers how we can help solve these problems, we find that they'll often have their own or they'll often be using third party tools, infrastructure and have their own processes already in place.
So in that case, you're just looking for us to collect the data and then make it easy for them to get.
But sometimes they'll want us to do the processing and they just want us to provide a tool, especially a visual UI, where they can easily explore the data for themselves and just solve problems.
So we've followed both of these approaches. And here in the screenshots, I'm showing the same data available using Sumo Logic, which is one of our partners that you'll see in the hall.
This is showing firewall events.
And then this is the same firewall event information in our own dashboard. So in each one, you can find a particular request, understand where attacks are coming from, explore, filter, and so on.
Now, interestingly, there isn't necessarily a strong correlation between the size and technical ability of a customer and whether they want to do it for themselves.
We get plenty of large, highly technical customers that do want us to provide tools, plenty of smaller customers that are looking to do it for themselves, which is made important that we stay along both of these tracks.
So we've shipped a number of products so far this year in both of these categories.
Under third-party tools and infrastructure, early in the year, we released LogPush, which makes it easier for customers to get their logs from us to their cloud storage provider, so someone like Google Cloud Storage, for example.
We've also worked on analytics and log partnerships, which I mentioned, such as in the previous slide, that dashboard that you saw of Sumo Logic.
Through those, we provide pre-built dashboards and some direct analytics.
And as well. And then finally, in beta, we've released the GraphQL Analytics API.
So this is our new API for analytics that's especially powerful and flexible and allows you to query multiple datasets through one API.
So this API has underpinned the release of Firewall and DDoS analytics by our Firewall team.
And through those, you're able to explore Firewall events, filter, really understand in depth how your Firewall is working.
So we're pretty proud of everything that we ship, but of course, the key thing is, is it working?
Is it actually addressing customer problems? To help answer that, I wanted to take a look at a case study for Orderbird, which I mentioned previously.
So Orderbird provides point-of-sale solutions to independent restaurants, and their mission is to make those restaurants more successful.
To tell you more about what they do and how they're using Cloudflare data, I'm excited to welcome Fong Schlesinger, Orderbird CTO.
Orderbird was founded about eight years ago in Berlin, Germany, with this purpose of helping independent restaurants become successful.
You heard that our core product is a point-of-sale system, which sounds pretty boring, but the purpose is not boring at all, at least not to me.
If you think about independent restaurants, those are created or founded by people who want to cook some cool food, serve customers, guests, have some concept in mind that they want to implement, how to run a good restaurant.
They don't understand technology at all, generally speaking, and most of them don't even like technology.
And the poor thing about that is, or the pity here is that they miss out on the upside of technology a lot.
If you think about the bigger chains like the Starbucks and McDonald's in the world, they all invest literally billions of whatever currency you think of in technology to make sure that they gain some efficiencies, to make sure that they provide the same experience to their guests all over the world, and in the end to get some nice turnovers from that.
This is not what our customers are thinking about, so we thought as Orderbird eight years ago, we need to, we understand technology, we need to give our independent restaurants technology to help them benefit from this.
So what it actually looks like, just to give you some expression, is there's an iPad, I think it's a normal iPad, we also provide, it's basically all iOS devices you can use on the desk, usually you have the iPad and we have iPod Touch or iPhone devices usually at the table.
And this slide is only important to show you that we, that this is serious what we're doing.
So we have over 10,000 customers by now, in total having 18,000 devices which are running, and those need to be running all the time.
But it wasn't that case ever because we got in trouble, and this is I think a very typical story for most of you when you got in touch with Cloudflare back in some time.
For us it was 2016 when a heavy DDoS attack brought us nearly out of business as a company.
What happened was that, especially during Christmas time of 2016, our back end was DDoSed and it was really down for a couple of days.
And that is, for a restaurant, Christmas is usually a very important time.
You have all the bookings, you prepare for the festivities, and they couldn't use their point of sale at all for a couple of days.
So as a result, we lost a significant amount of our customer base back then.
Even the customers which decided to stay with us wanted some cash back on that, so for sure.
But the most tragic part was that it took us about two years to repair our image.
So for the next two years, our sellers would always be confronted with, yeah, I like your features, and you're cheap, and it's easy to use, but you're not stable.
And the competition really was bragging around how stable they were because they just weren't DDoSed, but we were.
So what we did back then was that we got some help.
First of all, we moved our back end, which was data center operated, into the cloud.
We improved our offline capability of the point of sale, so at least for a limited amount of time, some functionality would keep on working.
And we got in touch with Cloudflare, put it in front of it like the umbrella, open it.
And then for the next two years, we didn't even have to think about Cloudflare anymore.
There was just no attack visible to us, nothing which had any customer impact.
And when I joined the company, like two and a half years ago as a CTO, for sure I looked through all the vendors and prioritized when to engage with each vendor.
And only recently, a couple of months ago, I was approached by my key account manager, Valentin, telling me about all the new cool things at Cloudflare, and that was when I realized I have that thing.
And it apparently works very well for us.
So talking about new features, the security thing just works, as I told you about the new features that I found pretty interesting since a couple of months ago, was everything about data insights.
I'm personally pretty data driven, or so I believe, at least I want to be data informed.
And so I looked around in the platform and was immediately impressed with all the in-platform dashboards and analytics that you can see.
Basic things like threats over the course of a couple of weeks, and oh here was a peak and I can drill down, and I can see over the course of a week the ups and downs of requests day and night.
I have this kind of world map showing me that I have no idea why I have so many North American requests, because I don't have a single North American customer.
So I better start looking around what's going on in my data.
And this was when I exactly did that.
This is what we use at Audible to collect and present data from various sources.
In the end it's a Datadog dashboard, but that doesn't even matter.
For us it's just a way to present the data and just briefly show you what we do.
Basically you see in the columns always some like times, this is always last hour, this is always last day, last week, last month, last year, you get the idea.
And in the rows you have some sort of service.
So you could see for instance, ah here are the requests coming from Cloudflare as they report them to us.
Last hour, last day, last week, last month, you get it.
And this was very easily done because there is an integration provided by Cloudflare, and if you're a Datadog customer or one of the others which Philip pointed out, they have those integrations and you can just click your dashboards and you see the data.
Let me show you what I read in those things and how useful I find it.
So zoom into that box. This is basically all the requests coming through the Cloudflare front over a course of a 24-hour day.
And immediately what I can see here is that at peaks I get about 200 requests per second, and even at night time it's still 20 requests per second.
I can see it when the peak is, it's like the afternoon, I guess it's lunchtime, and it stops at around I think 10 p.m., so safe times for deploying is in the middle of the night, things like that.
I can also easily see at a particular day if my graph doesn't look like that, something's wrong.
So if I have a peak in the middle of the night, something's wrong.
It's just what it gives me immediately. One other thing, if I take the yearly graph, it's not filled yet because I just started doing this a couple of months ago, but you see the purple line there, it's a trend line.
And it gives me a very rough estimate on how my overall traffic develops, which I can use to answer questions about our scaling strategies or even do such things like put a number into my cloud budget for next year.
I have to. So I have some data to prove that or make a point.
One more interesting thing happened lately, two weeks ago, and it will be those three red boxes which I zoom in now.
And what it shows you is basically a last month view on three metrics, the requests of last month, then the traffic by country, like origin, on a 100% basically stacked chart last month, and cashed versus uncashed last month.
And what you can see, I also drew a thin black line there, there was a change.
So around the 7th, 8th of May, you can see that, first of all, the total number of requests went up since then.
You can see that some new country source was requesting us, like the yellow spikes on each day.
And you can also see that the amount of uncashed traffic is getting lower.
So I know what happened. But do you have any idea what could have happened? Okay, then I tell you, we have a couple of weeks before that we developed and delivered, deployed new services.
And at that point in time, at that day, we also opened the Cloudflare umbrella over those new services.
So the whole backend was always protected, or like since last two years, but we launched an API and we launched a new backend microservice.
And in the 8th of May, we said, we want to have Cloudflare protection for those two.
And this changed the picture. Clearly, requests, sure, more requests are collected.
But also, what I could see now is that for whatever reason, those new, the API and the other, it's called the cashbook backend, those were getting requests from a new country source.
And actually, this yellow country source is XX, which means coming from the Tor network in the end.
So someone was continuously botting, attacking, probing me, my endpoints from the Tor network.
I wasn't even seeing that before I opened the Cloudflare umbrella.
But to put that into comparison, this is also a Datadog dashboard, which just shows you the load balancers, the AWS load balancers that we're using.
So this is data from like sort of the inside where Cloudflare is sort of the outside.
And there's one particular interesting thing here, same black line on the same date, you see the ELB counted errors, it's all 400 and 500 errors, on that day, they stopped.
So there was some malicious traffic coming in all the time, probing the load balancer, maybe requesting some URLs that are not registered there, resulting in errors, noise, nothing severe happened, right?
Maybe we were lucky, maybe they didn't find a weakness.
But in the end, there was always noise.
And I found it very hard to set up an alarm threshold, when the picture is there.
And since the 8th of May, I have now the noise at Cloudflare where I want to have it.
So basically, this is what you see, nothing here anymore.
And maybe the last thing that I want to show you is Cloudflare is also producing logs for everything that happens for traffic that for every request, basically.
And also there, you can basically integrate it to whatever log facility you have.
It's again here, data log logging, because we have the tool. And so you can then very easily follow up some metrics, some spike, whatever, some error and drill down into the log and really look into what's going on.
It's pretty convenient to have that.
We have done that with the Lambda. Philip mentioned that Cloudflare is pushing logs to S3.
This is where we pick them up in the end, and push them to the next API.
If you want to learn more about how we did the dashboards and how we did the Lambda, whatever, approach me.
It's not what I want to show you here. I just want to motivate you on the use cases that we did.
And I think that's it. All right.
Thank you very much, Frank. So we just have about 10 minutes. I wanted to briefly look ahead, and then we'll move to questions.
So primarily, I wanted to share a roadmap, some of the things that we're planning in the near term, and then some other ideas that we're thinking about.
And here, I'm following the same framework as you saw earlier.
So in terms of third -party tools and infrastructure, we're going to be GAing the GraphQL Analytics API that I mentioned.
And as part of that, we'll make available user and account scope querying.
So this is something that a number of customers have requested.
It would allow you to get data for a number of your zones or domains at once, and then also roll up your data at the account level.
And that API would underpin, then, ongoing developments to both firewall analytics.
So the team has made a number of updates and will continue to do so. And then we'll be following that model to release more product-specific analytics.
So cache, load balancer, products like that will get their own analytics as well.
We'll also do a redesign of web traffic analytics to provide these account-level views in our dashboard in addition to the API.
And then we'll continue to be developing logging, and especially log push in a number of ways.
So one will expand the number of cloud providers supported.
Sumo Logic will complete an integration with them this month that would basically provide an end-to-end integration.
So you can push your logs to Sumo Logic and then use one of our pre -built dashboards to visualize them.
So that should make it really easy if you use that tool.
We'll also be adding support for Azure. And with that, Azure, AWS, Google Cloud will have coverage of the major cloud platforms.
Then we also want to expand LogPush to be more of a platform.
So right now, as Frank mentioned, you can get your request logs there.
But if you're using a product like Access, which you might have heard us mention, it's actually not tied to requests.
And we want to still make those logs available for you. So through LogPush, you can just configure what datasets you'd like and get those logs.
And then finally, another thing that's been requested for a little while, we want to add support for custom fields.
So within our logs, we want to allow you to enrich our logs with your information, perhaps a header that we don't log or a custom header that you pass like a cookie.
Some of the things that we're considering. So we want to allow within LogPush as well log filtering.
So rather than getting all of your logs or just selecting certain fields, you can filter on particular events.
For example, when you've been attacked by a particular IP or the logs only for certain errors, if you want to get those.
And then building on the destination support, we're thinking about how to provide LogPush for other destinations.
So outside of the major cloud providers, what can we do in terms of custom endpoints, either on-prem or S3 -compatible endpoints perhaps.
Under the tools, our focus for the future, and this is something that Jen alluded to in her presentation, is really in terms of operational data.
So Frank was showing you a number of dashboards that you can now create within third-party tools.
We want to make those same kind of dashboards available within our tool if you don't want to set up something on your own.
And here, those dashboards will be focused on monitoring and alerting so you can see what's happening right now with your traffic.
And you can also trace requests through our network to your origin. And this is to answer questions like when a customer, when one of your end users is reporting a problem, you can say, is it a problem with Cloudflare or with your origin, or somewhere else on the network.
So I'll close here. Thank you very much for your time and attention.
And then we have now some time for questions, either from me or Frank, I'll invite you up on stage as well, on anything at all that you saw here or anything related to data or analytics.
Yes, please.
I'm not sure about it, but I think I've read that Lockpush is only available for business or enterprise plans, right?
Yes, I can repeat the question.
So yeah, the question was, is Lockpush available only for business or enterprise plans?
Yes, so right now, it is only available for enterprise plans. It's included as part of your enterprise subscription.
We are looking, that's something I didn't have on the roadmap, we are looking to make that available for other plans, probably as a paid add-on.
It's more of a question just structuring the rollout, so we're confident that we can support expanded usage and get the right kind of pricing in place.
But yeah, that will be coming likely later this year.
Anything else at all?
Yes, so question in the back there on the left-hand side. Yeah, just nice graphs.
Always enjoy a good graph. What tooling did you use for those dashboards?
We use Datadog at Audible, so because we have it, and if you start using one of those tools, it makes sense to have it all in one place and set your thresholds there and make the alarm management there, so this is what works for us.
Yeah, in case, I guess I kind of went by this quickly on the slide, so the current main third-party platforms that they support, so Datadog of course, there's also Sumo Logic, which I mentioned, Splunk, Elastic, and Chronicle Security, which is one of Google's new products.
The other thing that we have available is a Grafana plugin.
Right now it's just for DNS analytics, but we'll be looking to provide that for the GraphQL analytics API, so essentially all the data sets will be available through Grafana as well.
So as far as how we decide those, primarily it's based on what we hear our customers ask for most often, and who's also well-established, has good market share, all those kind of factors.
So if there's any that you're using that you think we don't know about yet, please come speak with me or talk to your account manager or anyone within Cloudflare.
As we mentioned in one of the earlier presentations, we are looking at those feature requests all the time, and we use that as a way to assess who we should be supporting next.
Yes, just in the back there.
Just fill the void, but a bit of a repetition. You're talking about the roadmap to bringing log access, maybe not the full log push suite, to the business tier rather than the enterprise tier.
Let's make a feature request of that.
I'm not necessarily pinning you to the stage, but yeah. You don't have to put the whole thing there, but you could definitely put more into the business plan, access to more data.
I mean, literally the 30-day limit is what anybody gets from dashboard at this point, right?
Yes, it does. Yeah, so the limit does depend on the plan, but yeah, it starts to get limited as you go down for enterprise.
I mean, the way we're looking at log push for other plans right now is that there wouldn't be any limits in functionality.
You should be able to use it in the same way as a customer on the enterprise plan.
All the same datasets should be available, so it's more of a matter of pricing it appropriately.
That's actually something, if you have a view on that, please come talk to me after.
We're thinking about maybe either volume-based in terms of the amount of traffic or maybe request-based, something like that.
Not being an enterprise customer, I've not even ever seen it.
Obviously, volume would be one way to limit, but just the amount of accessible analytics that's offered into the business package is, as I said, tremendously limited at the moment, probably because it doesn't have access to the plan you're talking about, but our limit is around 30 days.
I can see the last 30 days.
I don't think I can see the 31st, so month-to-month comparison is something I have to take off-site and store for myself to do.
Any of that is a good incremental pitch out this way.
Or am I being told I'm just using the dashboard wrong?
No, no, you're totally correct. Yeah. Yeah.
Any other questions? Otherwise, we can end a few minutes early. Thank you very much again for your time.
Cloudflare for Teams can help.
It lets you secure every device, network, and internal application used by your remote employees without compromising speed or reliability.
Here's how it works.
Like every Cloudflare product, Teams is built on Cloudflare's massive global Anycast network.
The network spans hundreds of cities and sits within 100 milliseconds of 99% of the world's Internet-connected population.
It also provides rigorous security and smart traffic routing, drawing on intelligence from millions of Internet properties.
Cloudflare for Teams places that network, and all its security, speed, and reliability, between remote employees and everything they access online—applications, on-premises environments, and the open Internet.
One important product in the Cloudflare for Teams suite is Cloudflare Access, which provides secure access to internal applications without a VPN.
Access lets employees anywhere access internal applications through the Cloudflare network.
This automatically encrypts these connections without having to use a VPN client, and the network's scale and reach dramatically reduce latency and downtime.
Cloudflare Access also integrates with identity providers to provide Zero Trust network access at the network edge.
You can control who gets into which applications from the Cloudflare dashboard.
Remote employees can find applications in a simple launchpad, and none of this requires any hardware or software, allowing you to get started in hours, not weeks or months.
Another major part of Cloudflare for Teams is Gateway, which secures employees and devices on the open Internet.
Like with Access, turning on Gateway means remote employees' Internet activity instantly gains the security, speed, and reliability of the Cloudflare network.
Gateway uses SSL inspection at the network edge to seamlessly sniff out and stop malware before it reaches a remote employee's personal laptop or phone.
It also offers DNS filtering, allowing you to block malicious sites and content.
You can monitor and manage remote employees' Internet activity as if they're sitting in your physical offices.
These are just a few ways Cloudflare can help protect your applications, network, and data.
We're constantly working to help build a better Internet that includes better security and performance for your remote workforce.