Cloudflare TV

🔒 Security Week Product Discussion: Cloudflare Observability

Presented by Tanushree Sharma, Natasha Wissmann, Ashcon Partovi, Michael Tremante
Originally aired on 

Join Cloudflare's Product Management team to learn more about the products announced today during Security Week.

Read the blog posts:

Tune in daily for more Security Week at Cloudflare!

#SecurityWeek

English
Security Week

Transcript (Beta)

Hello everyone and welcome to Cloudflare TV. My name is Michael Tremante, I'm the product manager here at Cloudflare, and today we're going to talk about our vision for Cloudflare observability.

Of course, I'm not going to be doing most of the talking.

I'm just here to be the host.

I'm joined by some great guests. Natasha, Tanushree and Ashcon.

And before we kick it off, I guess.

Natasha, do you mind quickly introducing yourself?

Absolutely.

Good morning, everybody. My name's Natasha.

I'm the product manager for the Application Services team here at Cloudflare.

And we're the team that owns alerts and notifications. Awesome.

Tanushree, over to you. Hi everyone.

My name is Tanushree. I'm the product manager for the Logs team, also cover some of the analytics side of things and excited to chat about what we're working on and what we want to deliver to our customers.

And then finally, but not least Ashcon.

Hi everyone.

My name is Ashcon. I'm a product manager on the Workers team and I focus on many of our developer tools.

Great.

Awesome. Welcome, everyone.

So, of course, as many of you know, today is Monday and we're just over Security Week.

And we had the blog post on Cloudflare Observability posted on Saturday. So that's going to be the topic for today.

Before we jump in, a little housekeeping item.

If anyone listening today has any questions, please do send an email at livestudio@ cloudflare.tv.

We do see the questions as they come in, and if you do have one, we'll try to answer it towards the end of the session.

And if we don't get to it, we'll follow up offline.

Or if you come up with further questions, of course, reach out to the support team or to your account team and I'm sure they'll be able to answer or forward to us.

With that, let's kick it off.

So observability is a pretty heavy loaded word.

It could mean a lot of things.

But in the context of Cloudflare at a very high level, what is observability?

Tanushree.

Yeah, simply speaking, it just means getting better visibility into things.

I actually like to use an analogy, thinking about something that we use every day, like cars.

When you're in a car, you don't have a lot of visibility, don't have a lot of range of motion when you're sitting in that seat.

And so you have tools to help you like a rearview mirror, like side mirrors, like fancy new gadgets, cameras and sensors and stuff like that.

And that's sort of like, drawing from that, what we want to bring more and more to Cloudflare itself.

Lots of observability tools in the tech industry have their own different philosophies.

Tools like Prometheus and application performance monitoring tools, and we've been discussing recently, how can we bring more of that stuff to Cloudflare?

So at a high level, observability at Cloudflare has three different components.

We have monitoring, we have analytics and forensics, and at Cloudflare we address those things using a few different of our components, such as notifications, like our analytics platform, like logs and tracing.

And today we're going to sort of dive into those and tell you all about how we can tackle observability with those different components that we have at Cloudflare.

And really the vision is we want our customers to be able to see a single pane of glass.

So if you're a customer that's using Cloudflare, you should be able to see everything that's going on with your systems, with our systems in one spot.

Yeah.

And hopefully, you'll see things that you couldn't see before as well, right?

I always... I am a Cloudflare user for my little personal blog and once I onboarded onto Cloudflare, you get access to all the traffic things that I didn't have access before.

So it's really, really nice to see how it all comes together. So if I were a customer, then, how do you see this being important for, you know, broadly speaking?

I'm an individual, of course, but we've got customers of all sizes using us, you know, businesses, large enterprises.

Why does this matter to those specific organizations?

Yeah, totally.

So at Cloudflare, we scale everywhere from developers, like one person that's onboarding his own to big, big customers of multiple accounts that have hundreds of thousands of domains within them.

And our goal is to be able to make a platform that scales, that provides observability into everything from like each singular request to at an aggregate level.

If you're a big company, you want to see things like where is your traffic coming from?

Which websites are the most popular? And you can kind of tie all of that stuff to revenue as well and track business goals.

So it's not just being able to see visitors or being able to see things that are blocked.

It's also being able to draw conclusions from that raw data to actually impact your business and to...

we want our customers to grow as well. So yeah, the goal is to provide observability from everywhere, from one person to hundreds of large corporations.

We have a lot of teams at Cloudflare that also use our products and it's pretty cool to see to chat with the security team, for example, and how they use our Cloudflare Access Logs.

Or recently, there was a very interesting use case where we have our developer docs that our product content team runs and they have migrated to a different backend and they were noticing that their customers were having some issues.

So it's cool that they were able to use our analytics to figure out where are the 404 error's coming from, like what are the top URLs?

And then they asked to get access to the logs as well for a developer doc so they could make those tweaks and look into each and every single request that was having issues.

So it was a very cool experience being able to be a part of that and help that team out.

And actually I have a little fun story.

So of course we've just finished Security Week and anyone who runs a website normally would look at Google Analytics to see which post is performing best or what other traffic we're getting.

But on our side, we don't need to do that. I literally just logged in to the Cloudflare Dashboard.

We have Cloudflare in front of Cloudflare, so very meta and got all of those insights.

And then more importantly, given we're security-related, it was really interesting to see all sorts of traffic and crawlers and things that sometimes, you know, is not necessarily wanted traffic and how we're reacting or mitigating or allowing, etc., etc..

So let's assume...

you mentioned observability is the three core components and kicking off from the first one, monitoring and alerting.

Especially in security, that's super important.

If something's happening, I'm assuming people would want to know.

And Natasha, this is something I think you've been working on.

Do you mind giving us maybe some insights and examples on how that all comes to play in Cloudflare?

Absolutely.

Very near and dear to my heart. So basically, we have all of these great logs, we have all these great analytics and you can see what's happening on your site at any given time.

You can see traffic, you can see errors.

You don't want to sit on the Cloudflare page and click refresh on the dash over and over again.

That's a really terrible experience for you. You want to be told, "Hey, we're seeing something funky, please go to the Cloudflare Dash," and then you can look into it more, so you don't have to sit there all of the time.

So alerts and notifications kind of provide that bridge of saying, "We think something's a little bit weird, we think you should go check on it." There's kind of two different groups of alerts and notifications that we have.

There's something that's very event-driven.

So, for example, your SSL certificate is expiring and you're going to have to go and renew it or your secondary DNS records transferred.

Just so you know, that did happen.

It was successful, so don't worry about it.

So those are very point-in-time, something definitely happened, we can tell you about it.

Our second group of alerts and notifications is "We think something's acting weird." So that's really where we look at time series data.

So any of the analytics pages that we have, they have graphs on them that show, for a period of time, here's all of your traffic or here's exactly what's going on.

And when as humans, when we see a spike, we can say, "All right, that looks weird.

Something is definitely happening here.

I should investigate." Gets a little bit more complicated for a computer to say that's definitely a spike.

So we have a lot of different ways that we can do that, a lot of different ways that different companies do that.

None of them are perfect, unfortunately.

So we're experimenting all the time with different ways to actually detect spikes in graphs and say like, okay, "This is definitely something that's an issue.

This is definitely abnormal for your traffic.

Please go check it." We can't really say this is exactly what's causing it.

There's a lot of things that could be causing it, but we want to at least kind of try to help you get there and be like, "Here's a jumping off point for your investigation." Got it.

And then actually to that point, we did recently release some specific alerts related to security, right?

This was actually slightly before Security Week. I think it was towards the end of last year.

Close enough, I guess.

So customers can now create or set up alerts for their firewall events.

Mind giving us a little, second to that point, a little overview of what customers can do now in that context?

Absolutely.

So we have two different types of security events or excuse me, security event alerts, which will track the security events that you have in the security tab on the Cloudflare dash.

So again, that Overview page has a very nice analytics graph that you can look at, see what's going on.

We want to look at that for you.

So the first alert that we have, the security events alert.

That will look at all of your security events overall for whatever zones you select.

So you can select which domains you want to monitor.

And we'll say, "Okay, for each of these domains, we're seeing a spike right now.

Please go check it." Domain selection is really important to our customers because our customers have some domains that they care about a lot more than others.

So if you run a larger enterprise organization, you probably have a test domain that you use to test out all of your various security tools that you have.

You don't care if that domain is getting a lot of security events because you're probably the one that's simulating those.

You're probably the one that's trying to attack your website to see what happens and make sure that your security events work.

You don't want to be alerted on those.

That's very fair.

You only want to be alerted on the domains that you actually care about.

So again, you've got your domain selection. For our advanced security alerts, those are the ones that are available to our enterprise customers.

You can select your domain and you can also select which security tool is the one that is, which service is the one that is blocking those events or is logging those events.

So I can say, "I want my rate-limiting events to be the ones that I'm... that's the one that I care about right now.

I don't actually care about WAF events as much." And so you can select exactly what you want to be alerted on.

Those are the two configurations you have right now.

We want to give you more, right?

So we want to be able to say not only are there zones that you might not care about, there's maybe source IP addresses that you don't care about because again, those are your source IP addresses.

Those are IP addresses you're using to test things.

So if there's a spike in events from those IP addresses, you don't actually want to be alerted based on them.

You don't care about them.

So you want to be able to sort of filter out the events that you have within your alerts similarly to how you can do on the analytics page today, where you can choose certain, you can choose filters on your analytics to see what's happening.

We want alerts to be in a similar place, so we're trying to get feature parity between those two things and give you more customization for all the alerts that you have.

Yeah.

And actually, there's a.. because when we launched the WAF alerts back in December- ish, just as you said, one of the first things some of the customers I spoke to gave us feedback on was, you know, there are some rules that are allowing traffic rather than blocking.

And currently the way we calculate or we look for, attempt to look for spikes in charts, accounts for any viral event.

And depending on who you ask, people...

some folks will say allow events are also security events because there's a specific filter that says, I want this traffic to go through.

But depending if you ask someone else, he might say, well, actually, no, that's an allow event.

So I'm doing a positive security model. But currently all of those events count for alerts.

And exactly as you said, filtering out some things versus others seems to be a common theme moving forward.

Yeah, go ahead.

Kind of goes back to what you were talking about earlier, right, where we have a ton of different types of customers that use Cloudflare and everybody wants something a little bit different because everyone has a different use case and a different security model.

So our customers need the ability to sort of choose what they want to be alerted on.

And what I was saying earlier with there are different types of algorithms that you can use to calculate a spike.

Again, those have positive and negatives depending on who you ask.

Some of our customers want to be able to say, I have very specific thresholds that I know I want to be alerted on and I want to be able to configure those thresholds.

And some of our customers don't want to configure anything at all.

They want to say, if there's an anomaly, let me know.

But you should know what the anomaly is.

So we're trying to suit both customers.

Right.

And one last thing before we move over to the second part, which is analytics, because for the WAF alerts, we're using something known as the z- score.

But for things like origin monitoring, it's actually slightly different, right?

So we're already adopting different algorithms depending on use case? Correct.

So origin monitoring is a big one for us. As a customer, you want to know when there's issues at your origin.

We can see your traffic, so we can say, "We've been pinging your origin and we have been getting bad responses.

We don't have any control over that.

There's nothing we can do.

But you should probably check what's going on." Customers really want to know that because they want to be able to, number one, fix it so that their end users can actually access their Internet properties.

And then number two, they just want to tell their end users that something is going on.

So even if it's something that's going to take a little while to fix, you still want to tell your end users, "We know there's a problem.

We're actively working on fixing it right now," just so that your end users have some visibility into what's going on, some observability, some might say.

And so origin monitoring, really big for us, we care a lot about it, and we use what's called an SLO to actually calculate those.

So SLOs are really good for...

SLO means service level agreement. What it basically means is we expect most of your traffic to give us good response codes, good HTTP response codes.

If it's giving us above a certain level of response codes, we know that there's something wrong and we can use our service level agreement to basically say, "This is breaking the service level agreement.

You should know about that." It's a little bit harder for our security events because every customer has different security traffic and so there's no baseline we can say, "For all of our customers, we definitely know that you should be here." We actually have to calculate that baseline individually for each of our customers, which is why it gets a little bit more complicated, which is why we've been using z- scores, which basically looks at the expected range of traffic historically for your site and compares it to your current traffic and says there's a big gap between those two.

You should probably check something out. Got it.

Okay. So moving on with the observability story.

So we've done monitoring and alerting.

And then the second piece, Tanushree, I think you mentioned was analytics.

Maybe let's start, Natasha, very quickly, staying with you for a moment.

Are we bringing in alerting and sort of monitoring in the dashboard as well and sort of what's our thinking there?

So that's the goal.

So right now, you got this alert and it has a link in it and it says go to your dashboard and go look at what's wrong.

It takes you to the analytics page, the relevant analytics page with the filters that you want.

So for Origin Monitoring alerts, it will take you to the overall analytics page, the traffic page filtered down to only responses that are 5xx errors from you origin.

So that this is the traffic that we alerted you on, this is what you care about. So you can investigate it from there.

We do want to get better about that.

We want to have more indicators in the analytics page of...

this is when you actually got alerted and you can see that you got alerted or this is the expected traffic that we have based on your historic traffic.

So now that we've calculated it, we actually want to show it to you.

So we're working in the background on all of those features to get closer with alerts and analytics.

Cool, and Tanushree, anything else from your side in terms of analytics in the dashboard?

Yeah, yeah.

I'm excited to be working with Natasha's team on that and bringing notifications to the dashboard as well.

I will say that one big customer theme that we've been observing with analytics similar to alerting is we want more customization.

Customers are saying that there's different things that are important to them, there's different types of users.

So either people from the security team or analysts or network engineers that are using the Cloudflare dashboard and it's not a one-size-fits-all where they can just navigate to one tab and see everything they need.

But oftentimes it's multiple different things.

So what we want to focus on bringing to our analytics dashboards as well is more customization.

The ability to do analysis over multiple Cloudflare products in one spot is really important to bring.

And so that's something that we're working on and we're testing.

And then another couple of things that are on the horizon are bringing,.

more account-level features to our analytics.

So currently we have zone analytics where you can view, get a 100-foot view of your zone or you can slice and dice, add specific filters and time ranges.

And we want to bring that same capability to account analytics.

So more customization, you can pick and choose specific zones that are important to you and then view your traffic over all of those zones versus having to go into every single tab for a specific zone.

And then we've also been hearing some asks about performance and how...

and our goal is to expose more of that on the dashboard.

So we've been doing some testing and ways to figure out how we can expose more performance-related metrics to our customers as well.

So those have been sort of the higher, overlooking themes.

I know, Michael, you also have specific security use cases.

You've talked to those customers every day.

Do you want do you want to add in, chime in there?

Yeah, for sure.

No, and everything you've been discussing already, I think it's really exciting from a security perspective.

I think, Natasha, you said a moment ago, bringing some of those alerting indicators directly in the dashboard.

Whenever I speak to some of the security analyst teams, maybe working on some of the larger organizations, it's like, okay, I want to see when something triggered so I can start doing correlation of events.

And that will lead us onto the last topic for observability, which is forensics and tracing.

But if I can see when something alerted on the dashboard and maybe there's a label, first of all, that's the first, an excellent place to start.

You know, okay, let me start drilling down.

I know we recently allowed for the timeframe to be dynamically zoomed in from the analytics dashboard and then all of the top- end analytics sort of updates dynamically after that.

And it becomes really powerful to start seeing if the malicious actor maybe actually started attacking before the event triggered, right?

Now, there's this IP that's doing something. What did that IP try to access before the alert triggered?

And you start from IP, you might then expand to see what did it do on the entire hostname or is there some other indicator or obvious pattern that comes out?

And as you mentioned, you know, we're bringing all of these things into the dash.

It's going to open the world for anyone using Cloudflare to investigate issues.

I'll mention one other thing that I think becomes super important as analytics become more powerful, and Tanushree, I think you said, bring it to the account level as well.

If we're able to run, and it's something we're trying to do in the security teams, all of our detections all the time in an ideal scenario, with no latency impact, you essentially see a full view of what Cloudflare would have mitigated before you necessarily make a decision on, "Okay, I want to block this traffic," which is extremely powerful from a security standpoint, because in security, there's always a problem with false positives.

But if you're able to filter down on this OWASP scores or these, just like the Bot Management Dashboard already does, to some extent, you can see what filter you want to apply before you write your rule.

And if you can also do that at the account level, it becomes an extremely powerful tool that in the security space, especially as you can mix and match OWASP with Bot Management with Cloudflare Managed Rules or any other product we build in the future.

So I guess this does bring us to, you know we were speaking about security analysts, forensics and tracing and, you know, something happens, I want to figure out in detail what happens afterwards.

And Ashcon, I think you can help us, but before I jump over to you, what....Tanushree, I think you've been looking at forensics a little as well, a quick overview of what we're thinking in that space?

And then we go over to Ashcon. Yeah, yeah.

This is something I'm really excited about. I'm a bit biased.

This is like I'm just super, super excited about forensics and bringing that capability to our customers.

I will start by saying that Cloudflare as a whole, we're always looking to add more data- sets, and I have product managers across the org coming to me every week saying, "Hey, we want to add a dataset for this new product we're releasing." So that capability is always happening in the background.

And as we're developing new products, we're always thinking about both analytics as well as forensics and logging use cases.

What's next with the forensic side of things is we're very close to bringing logs on R2 for customers to be able to store their data on Cloudflare.

If you're a customer and you use a product like Logpull, it'll be very similar to that where you don't have to worry about setting something else up on a third-party service.

Cloudflare will take care of that for you. But unlike Logpull, this will be applicable to more datasets and will have more features that we're continually adding to it.

So really excited for that as well as the fact that we're adding, we'll be adding more querying functionality.

So if you want to see logs for a very specific time range or matching specific IPs, things like that is stuff that we're working on.

And then also something that is next on the horizon is bringing logs to the dashboard as well.

So we've heard customers had really good feedback to the way that our firewall analytics handles this, where you can view high-level trends of analytics, but then also see the log lines that make that up.

And as we add more logging capability on Cloudflare, we'll be able to bring that capability to our other dashboards as well.

And that being said, we also want to continue to support our partners.

We have SIAMs and observability tools that we work with that our customers use.

So in addition to building up our own capability, we want to make sure that our relationships with our partners are strong and we're bringing more features to them as well.

We're not deprecating SIAMs just yet is what you're saying.

No, no, no.

Okay.

And then for tracing, Ashcon. What are we... What's happening in that space?

I know there's some exciting vision there as well.

Yeah.

So I think one of the things we can think about when we kind of look at forensics and tracing is about answering the question, "What happened?" and kind of being able to make anyone become essentially a detective.

So you might imagine you're looking at your analytics and you see a spike in 500 response codes from your origin, and you can correlate that to a specific time.

You're going to be able to zoom in and take a look at different other data fields that came with that.

But let's say you get a user that actually says, "Hey, I experienced one of those 500 errors when I was trying to log in." And now you want to try to be able to tie that spike in analytics to a very specific event.

And I think that's very much where tracing comes into play. The whole idea of tracing is that for a series of events, we essentially chain them together with a bunch of metadata.

So that could be your logs, kind of response and request HTTP codes and various different metadata.

So when someone says, "Here's an event ID or a Cloudflare Ray ID that I have.

What happened?" We can take a look at the trace and see kind of all the series of events and sub-events that happened.

And essentially what this does, particularly for any customer, but especially for developers who are starting to use the Cloudflare Developer platform, Workers, to kind of build more complex applications, it's really important to be able to to deep dive into a specific event because ultimately, unlike an origin where you're kind of managing it on your own, with our developer products, you're actually building it on top of Cloudflare.

And so that means we're able to build a lot more intelligence, a lot more observability as a whole into what you can see and actually addressing issues faster.

So that's really one of the core principles of tracing.

And if you kind of think about it at a high level, any complex system, right, most complex systems start off as very simple systems.

And so in the beginning, maybe you're going to be satisfied with just logs or just analytics.

But over time, as you build out more features and more complexity into that system, you're going to need a very detailed, essentially ledger of what happened.

And so that's really where we see kind of tracing, particularly with Workers and our developer products, being able to dive into those details.

That's awesome.

And I, again, when I think about security, to me maybe this is obvious, but we're actually starting to look at session tracking.

So in the event of a specific behavior or someone's logged in with a compromised credential, we have an ID which is likely the session ID for the user journey on the app, and if we were to have a tool that you click on it and you see the entire journey of the user, that would be wonderful for sure.

With that, first of all, thank you very much, Natasha, Tanushree and Ashcon.

It was a pleasure speaking to you today.

Anyone who was listening, please head over to the blog. There was one last blog post today on Security Week, so please take a look.

Thank you for listening.

There are more Cloudflare TV sessions.

Also look at our schedule and with that, have a good day or good evening depending on where you are in the world and see you soon.

Thank you, everyone.

Thumbnail image for video "Security Week"

Security Week
Security Week is one of Cloudflare's flagship Innovation Weeks, and features an array of new products and announcements related to bolstering the security of — and ultimately helping build — a better Internet. Tune in all week for deep dives on each...
Watch more episodesÂ