Sentinel: An Alerting Platform for Premium Customers

Name: Sentinel: An Alerting Platform for Premium Customers
Uploaded: 2021-02-08T08:00:00.000Z
Duration: 30 min
Description: Join Jon Levine, Nerio Musa, and Natasha Wissmann as they discuss a new Cloudflare feature, Sentinel, an alerting platform for Cloudflare Premium customers.

Presented by: Jon Levine, Nerio Musa, Natasha Wissmann

Originally aired on January 27, 2021 @ 7:30 PM - 8:00 PM EST

Join Jon Levine, Nerio Musa, and Natasha Wissmann as they discuss a new Cloudflare feature, Sentinel, an alerting platform for Cloudflare Premium customers.

English

Product

Transcript (Beta)

All right. Hi, everyone. Welcome to Cloudflare TV. My name is Jon Levine. I'm the product manager for data products analytics here at Cloudflare, and I'm joined by Nerio and Natasha. Why don't you introduce yourselves? Natasha, do you want to start? Hi, everybody. I'm Natasha. I'm the product manager for application services at Cloudflare, and that includes our alerting platform. So that'll be what we're talking about today. Nerio? Hi, Jon. I'm Nerio. I am part of the support team here in Singapore and look after all of APAC for customer support. Awesome. So yeah, we're joining you today from three time zones. We're just saying for Nerio in Singapore, it's 8.30 in the morning. I'm in San Francisco, where it's 4.30. And for Natasha, it's getting into the evening. So it's like we're spending two days of the week here. But yeah, we're going to talk all about alerting and what we're doing now with alerting and where we're going and seeing this start from when we weren't doing very much. It's really great to see the progress and excited to talk about that. So yeah, Nerio, you've been doing this from the very beginning. Tell us how did this project get started? How did we decide we need to do more with alerting? So when I first joined Cloudflare, I was completely blown away by the scale of the number of data centers, the number of customers we had, and the sheer volume of traffic. And also, as part of that, I look after all the customer issues that come in our way. And these customers can be small business customers, individuals, all the way to very large enterprises. And they're all equally important to us. And so I begin on these customer calls when they had issues, just to listen and to figure out what sort of issues they had, what were their pain points. And a couple of things came up. And I noticed that there was a gap in what we can do for our customers. And they would talk about how we were not detecting issues with their network or their infrastructure, or sometimes even our infrastructure that was going wrong, but only affecting a small number of sites for their sites. And the next thing they would ask me about is when they did come to us with an issue, it would take my team, the support team, quite a while to spin up and try to understand and then investigate and try to isolate the issue, figure out what was going on, where the problem was. Because we've just had so many moving parts. And to be able to isolate exactly where under pressure is quite a big task. So I would get on these calls, they would talk to me about all of this. And then I realized that we were really good at alerting and detecting issues at our colo level, like at a macro level, at a colo level, or even if there's something wrong with DNS or something wrong with network congestion, I get all these notifications. There's so many things on the Internet that are outside of our control, right? There could be connections dropped on ISPs or connections dropped to origins that we don't control. Our team, our operations team does get alerts when that happened, right? But tell us why wasn't that enough? Yeah. So yeah, as you said, every few seconds there's some traffic change, some rewriting, some kind of congestion, anything. And we get notified of all this, but the problem is at a macro level. And a single customer at a who experiences an issue may not register in the noise of all the other issues. Because an impact, even if it's not our fault, like the impact could be just localized to one place or one customer's website, right? Yeah. But for this customer, it's very important because there's suddenly 5%, 10% of requests are throwing five XX errors, for example. And in a macro level, we can't even see that. But for that customer, it's very important. It's a big deal. It really matters. Yeah. So I just have to go away and think, okay, now maybe there's a way to track all this. There's got to be, we have such great tools. We have such great stuff that we've built in-house and our SRE teams are constantly monitoring. And how can we adapt that to looking at the customer level and what information can I get at that customer level? So there's three things I set out to do, right? So the first thing is detect customer -specific issues that I can have visibility on. And the second thing is create a set of alerts, which are meaningful. And what I mean by that is there's a lot of FYI alerts or something has gone above a certain threshold, FYI. And those alerts are really dangerous too, right? Because if you send people alerts, you're the boy who cried wolf and people just ignore them. So it's actually, it's really dangerous to send an alert that isn't meaningful. Exactly. Exactly. And people get fatigued, right? And if you keep doing that, and so I want to make them meaningful and actionable. When you get an alert, look, there's something up, right? And so that was a big challenge that we'll talk about more later. And the third thing is that, again, goes back to the complaint that why are we not letting customers know that there's something wrong? And so I want to be able to reach out to my customers when something is wrong through alerts, yes. But in some cases, if it's a problem that's caused by Cloudflare at our edge, or if there's some software defect or whatever, and that triggers a certain condition, I want to be able to reach out to my customers and say, look, we know there's a problem here. We have opened a ticket for you. Let's work on this together. And do it instead, do it in a proactive manner, rather than them having to open a support ticket. Because what happens in that is like, I call it like the ticket dodge, right? So a customer sees a problem, and then they investigate at their end, a few minutes go by. Then they decide that, okay, we need help. So they open the support ticket. The support ticket comes in, but it takes time for a support engineer to pick up that ticket. So we have an SLA, right? And so now the ticket's picked up. Now the engineer has to go back and forth with the customer asking about more information, clarifying the issue, scoping the issue, trying to narrow it down. And then at some point, there's an investigation that takes place. And this can take like an hour. Meanwhile, their site's down. And we're like, are you sure the site's down? How do you know the site's down? How do you prove it to us? Exactly, exactly. So I want to get rid of all of that and detect the alert, make them meaningful, detect the issues, rather, make them meaningful, and then proactively reach out to my customers and offer them help. I love the vision. Like we should tell people when there's a problem. They shouldn't have to tell us. And I totally hear like, you brought up some of the constraints, like the alerts have to, they have to be good. Like you can't just spam people. And then there's so many things, like you were mentioning, like DNS could fail, and like connections that are urgent could fail. There's so many classes of problems. Like this sounds like, oh, it has to work for all of our customers. So this sounds like a crazy problem. How did you start on this? How did we break this down? Yeah. And then the issues are like so widespread, as you said. It's like sometimes the customer has a latency problem. And other times it's an availability problem where they're seeing a bunch of live access errors. Other times there's like a caching issue where stuff's not being cached or it's too stale or something. Yeah. There's so many kinds of problems. Exactly. And so how do you even know where to start on these three things that I spoke about? So we went back and started from a very holistic approach. It's like, okay, let's look at the entire life, as we call it. We get trained on this. Every support engineer when they join Cloudflare, they get trained on it. It's like the life of a request. And it's from, we call the eyeball, which is end user, or what happens when they click a domain name. And then this whole entire process will happen. We have a DNS lookup, then we have a... And at different levels. So we have at the layer three, layer four level, at the layer seven level, everything that happens until a website or it starts being displayed into the browser. And you might think it's very simple, but it's not. There's a lot of moving parts. Yeah. There's a lot. Yeah. So look at it from a very holistic level. Look at it from a life of a request. So the first thing, and break those things down into chunks. So first thing is about, can the eyeball and the end user actually reach our Cloudflare edge? Can that request actually reach our edge? And that's reachability. So previously we were completely blind to that. We don't know. We only record stuff when the edge received a request, but what if that request never gets there? What if we miss it? What if we just don't see the request at all? How do you know about that? That technology out there that can... Browser-based technology. And we can actually use it in the apps as well, because if you use the correct framework, the network framework. So what happens is the browser makes a request. If that request fails to get to our edge, the browser will actually log an error and send us that error. Now it doesn't send us to a Cloudflare network, but it sends us to a third party network. So we can actually detect stuff when Cloudflare is down as well. I want to come back to that. It's super cool. That's NEL, right? Network Error Logging, or NEL. That's really cool. Let's talk more about that. So what else in that life of a request are we going on? So now the request has reached our edge. Now there's an availability problem. So say, for example, that either our edge or the origin server, the customer's web server, returns a 5xx error, right? And so that means that the server cannot serve the request correctly. So we have to look at availability as a separate pillar for alerting at our edge. That's potentially a Cloudflare-related problem, or at the origin, which is... That's like, the network worked, but did the software work and was it able to actually operate correctly and return a response? Exactly. So we've just talked about reachability. We've talked about availability. Next thing is latency. So sometimes a request does get there, but it's too slow. And it causes end users to abandon the site. And that's not good. That's the whole point of Cloudflare, to make everything faster and more reliable, right? So we need to measure latency across all the different steps that the request goes through. And that's very important, as I found out when I first joined, for our e-commerce sites that we have. And e-commerce customers measure latency to... At some point, if it's too slow, it's almost like it doesn't work at all, right? Because people will just leave. They won't check out. They won't buy things. And then after that, the next pillar will be about attack or security. So Cloudflare has some amazing protection, different layers of protection for the layer 3, layer 4 traffic, layer 7 traffic, for DDoS, for bots. We have the WAF OS rules, everything. So we are protecting and always vigilant about requests that come in to make sure that they are valid requests, not malicious requests, right? And we're doing this at a scale that's unimaginable. So we need to alert and identify any spikes in attack traffic or security events and let the customer know. Now, you might think that, okay, this is stuff that's already been mitigated. So why is it important? Sometimes it's false positive as well. So sometimes this traffic should be allowed through, but it's being blocked. And so we need to let customers decide if that's the case. Customers want to know if we're blocking traffic and they just want to be aware when those blocks happen to know what that is. Cool. So great. That sounds like we kind of like almost started from first principles. We talked to customers. We've got a sense of these pillars to alert on. Just to fast forward a little bit, it sounds like you and the team really worked on iterating on a lot of those metrics to alert on. And we onboarded a small number of customers, but even let's say for 10 customers, if you had five things to alert on, that's like 50, just maybe like 50 thresholds to pick. Like for this customer, if there are this many errors for this many minutes, then similar. And then you're constantly tuning them, even for 10 customers, that doesn't sound scalable, let alone millions of customers in our network. So I think there's a really cool technique here that's used to maybe do this in a more sustainable way. Natasha, do you want to tell us about alerting and SLOs and kind of how that works? Definitely. So basically what we want to do is we want to make sure that you're not getting too many alerts, but you are getting alerted when something important happens. So what we do is we take the number of requests that you have total, and then the number of requests that you have that failed. So looking at HTTP error logging, for example, and we'll say, here's your SLO. So we're going to say it's 99.9, 99.9% of the time, we're going to be completely okay if that 0.1% of the time, we're just going to accept that sometimes things happen. And the SLO is like, it's a target that we kind of agree on with the customer, right? It's like, we want things to work 99.9% of the time, which is very good. Exactly. And then we say, great, how do we measure that? We essentially take multiple different windows of time and we measure them and we say, we're going to have a kind of a different threshold for each of those windows and say, if we've got an hour and you're getting a certain rate of alerts, that's probably not good. Also, if you've got five minutes and you're getting a certain rate of alerts, that's also probably not good. If you combine those, that's really not good. We feel very certain that that's a bad thing, we should be alerting you. So at that point, that's when we send the alert. I think this is such a profound idea, this multi-window, multi-burn rate thing. It's from the, I think it came from the Google SRE workbook, which is really cool. And it's like, it's not intuitive, but when you think about it more, it's like, okay, if something spikes really quickly in really high rate of errors, you need to know, but if something's burning slowly, but still elevated, you still need to know. And especially when those two things happen at once, it's like a spring with a dampener and you need to get notified. Yeah, it's really cool. And so we, is that like, we do that in practice? Does that work? We do. That's one of the things that we do in practice. And there you can talk about all of the different ways that we alert. Yeah. Yeah, absolutely. It does work, John, and the SLO method. And actually we found one more thing I want to add is we found that we thought we would have to change the SLO for different customers and fine tune and all that. And what we actually found was pretty much most of the time, 99.9% is right on the money. And that's the one that most useful to all the customers. It's kind of miraculous. It's like a mathematical proof. You can just say, given this target, here's how you alert. And it works for almost anything, which is cool. Exactly. So SLO is like one methodology that we use for certain types of alerts. There are other types of alerts that we do use something called entropy. And that is to detect, we use entropy primarily to detect unmitigated attacks, potential unmitigated. Why does an SLO not work? Why do you need a different approach for those? Well, because for entropy, basically the advantage of that, it looks at the state of the system. It looks at the state of how many IPs are connecting, what the packet size for all of the packets that are coming in, what ports are being used in the source, what ports are being used in the destination. And so it looks at the system state as a whole. And then it detects shifts in that system state. So if, for example, we have an attack that comes in suddenly from a single IP address that's trying to basically come in like a brute force attack, that'll actually result in a lower, it'll lower the entropy because now it's going from a state of high chaos, which is many, many source IP addresses to a state of low chaos, which is- That's really interesting because there's no real baseline. There's no correct amount of attack to have. There's no agreed upon amount. We just have to say, if there's a change, though, if an attack starts or stops or something, we do need to know that. Exactly. And then you can go from changes either from a current state of entropy to low or to high. So we can't just say, okay, it's now high entropy means alert. It can be any shift. And so that's why it works better than SLO for that. And we use this to detect any kind of shifts in traffic patterns that come through after being scrubbed by all the protection that we have. And so we use that intelligence and we compare it to a certain time window in the past, so the median over the last hour. And if there's a shift from that, and if there's a volumetric increase, then we start to think, okay, maybe there's something wrong here. And so then our security engineers will reach out and will investigate the issue and then reach out to customers for activity. So that's the second thing. The third thing we also do have, a methodology we have is also simple threshold-based alerting. So if X happens over this amount, over this time, then fire the alert. But that's a very basic, simplistic approach. But we still have to use that. When all else fails, there's always thresholds. Yeah, that works. Cool. Well, I think the story of how we built this is really interesting too. So we started out making alerts work. We didn't know, we're saying all this like it's obvious or something, but this was a lot of trial and error. It was a lot of experimentation to learn about this, and it involved working really closely with customers. So maybe Niro, if you could maybe tell a story about some of our early customers using Sentinel and what that was like for them. What was that experience like? Yeah. I mean, we received so much feedback, right? So the customers, we are very open, they're very open with us, and we take all their feedback very seriously. And before I go into actual success stories, I want to talk about all the improvements that we've made. So customers were concerned about the amount of time it takes to onboard. So we've improved that process. We've allowed customers to configure their own notifications, and Natasha's team is instrumental in helping with that. Then we've improved the email templates so that the alerts are meaningful. When you get an alert, it's very clear what's happening, and there's no ambiguity, there's no weird message or anything like that. It's a very crystal clear message. And deduplication as well, so you just don't keep getting spammed for the same issue over and over again. So these are the things that customers told us they were concerned about, and we've gone and improved on them, right? So each one of us talks to customers very regularly, and the feedback we've received from them is really positive. So one customer said that we detected issues with their origin before they could, and that our central was able to pick this stuff up 10-15 minutes earlier than they themselves did, and we reached out to them. And because of that, that's even that 10-15 minute window is super critical for them, right? Because it's their customers experiencing issues. And then we had another customer who was very confident in their robust network, an incredible network, incredible infrastructure, very foolproof. And they were very confident in it, and they had incredible visibility into it, and they were very impressed when even despite all of that, we came to them and we alerted them of an issue with their origin that they weren't aware of, and that they didn't anticipate. And they were so grateful for us that we were able to partner with them. It's really cool, because before we had this problem where we weren't even able to sort of notice problems on our side, and now the system is sensitive enough to detect problems that the customers have on their side. Yeah, and this was the same customer that, you know, flash rewind back one year ago, and they were on the phone with me complaining that, okay, why are you not seeing this? Why are you not telling us? But it was the same customer. And then we have another customer in a country where the network infrastructure and the ISPs are kind of poor. And so what happens there is a lot of congestion, and this congestion affects our customers and customers. So it's an e-commerce site, and when their customers go to visit, there's high latency, congestion. So we partnered with them to make our alerts suitable so that we notify them when there's congestion, and then we notify them when we're able to reroute traffic across. And so they're super grateful, and they actually said that they love working with Cloudflare as a partner together to solve this problem, and that's something quite profound. Super cool. So for, you know, Nireo, we've mostly been talking about what we refer to as Sentinel, which is our service that we built that alerts for just premium customers. It's a fairly small group of customers for us. And I know we want to bring this power to everyone. We want to democratize alerting. And I know, Natasha, that's what you're working on. So tell us, like, why is it hard to now take this and then build this for the whole customer base? What's kind of different about this? How are we thinking about that approach, about bringing these alerts to all Cloudflare customers? Definitely. So what we're trying to do right now is really put the power in the end users' hands. So you want to set up a SLO-based alert. Great. You get to choose what the SLO is. You get to choose where the alert goes to. Is it an email template that goes to an internal team? Is it a webhook that plugs into something? Is it page or duty? Because it's an important thing. You want it to page whoever's on call. You can do any of those things. And what that means is that internally, you don't have to reach out to us. You don't have to ask us to do anything for you or set anything up for you. You don't have to wait on us to be able to do something. You can do it all on your own. We'll set it up. We'll monitor it. So we're starting to build out all of these great alerts that we've learned so much from Sentinel for in the notifications tab on the dash. And hopefully, we'll be seeing a lot more of those. That's so cool. I'm really proud of how we built this. I think in Cloudflare's DNA, we really value being self-service, that customers can come to us and solve their own problems. They don't need to get on the phone. But we're also easy to use. And making something that's self-service and easy to use is really hard. It takes a lot of trial and error and experimentation. But I think what's cool about this is we're able to work with a small number of customers to learn what we needed to learn to make something easy to use. And OK, now that we've figured that out, now we're able to take that to everyone, which I think is really cool. So tell me about a big milestone we're working towards right now. What's the first thing you're going to be able to do with alerting? First thing you'll be able to do is monitor your origin for errors. So all the HTTP errors that you have, you'll be able to tell when there is an increase using that cool SLO stuff that we were talking about earlier. You'll be able to pick what SLO you want. 99.9% has been working for everybody. But just in case, if you don't want to use that, then you are able to choose another one. And then you'll also be able to choose which domain you want to monitor so that it can be a little bit more specific. And you could potentially send different domains to different people or different groups or, like I said, different ways that you monitor, whether that's Webhooks or PagerDuty. The customization is really valuable, right? Because we also see people may want to alert differently based on not just the domain, but maybe the kind of alert it is or the sensitivity of the alert, right? Exactly. And there's a lot of different alerts that have different customizations that you could potentially want. So we're investing pretty heavily in making that easier for us to even develop so that now that we have this ability, we can give you even more alerts and it'll be easier for us. Yeah. It's like investing in the machine that makes the machines. Yeah. Very cool. Yeah. So tell me about, do you know what's coming next after that? Yeah. So right after this, we're going to immediately start on the edge monitoring. And that goes back to some of the NEL stuff that we were talking about earlier. But for the customer, it'll be pretty similar. Tell me. So when you're starting to explain how it works, tell me more about NEL and what's so cool about that. Yeah. So NEL is cool because we can monitor the edge, right? So we can tell what's actually going on. We have monitoring for the edge right now, but like we were talking about earlier, it's for all of Cloudflare. So it's super broad. It's not really user-friendly. You can't monitor a particular zone. So that's what we'll be building out. Nero, can you talk a little bit more about NEL itself? Yeah. And again, previously we were completely blind, right? To requests that never actually reach us. So, and this is very important for customers still, because this means that real users, their users are getting an error or having an issue with network and nobody knows about it. This is like such a mystery, like how do you measure something that doesn't happen? How do we know if we don't see the request? How does that work? Yeah. So we change where we measure it. So we normally measure at the edge and now we're measuring at the browser. And the browser is telling us, hey, I have a problem. I can't get to where I'm supposed to go, right? And that's really cool. That's just mind blowing. And we're actually using the same technology in aggregate for our colos, for our data centers. But now we can actually do it for individual customers and individual zones. It's super cool. Yeah. And so this technology, I think was first felt by the Chrome team and built into the Chrome browser and they kind of developed the spec to do this, which is great. And one of our product managers, we lovingly call Tubes, David Tuber, told us about this and encourages to use this and it's been really great. So we're going to expect lots more customers to be able to get this and get analytics about NEL. So they'll be able to see when things, when customers aren't able to reach our edge, which is, it's pretty amazing. It's pretty amazing to have that level of visibility. Cool. Niri, anything else you're excited to share with us about what's next for our premium customer alerting that's on your horizon? Yeah. So again, the idea is to, for premium customers, we will continue to provide like a very customizable set of alerts. We listen to them, we help them and work with them to fine tune what they need. And also they gave us so much feedback on how to improve as well. So it's a two-way process. And also they, we find better ways to reach out to them when something happens. So we are also in process of using Sentinel, which is the engine that we have for more network security related things. And then we hope that we can have like almost like a virtual SOC where we are working with customers to detect, to monitor and to detect and to mitigate security events that come through, to fine tune the WAF rules, to look at layer three, layer four traffic as well, and detect shifts. We spoke about entropy and then work proactively on that level. So it's not just looking at stuff that we're mitigating, but also stuff that we're potentially not mitigating despite all our protection in place. And it happens, it doesn't happen very often, but sometimes it does. And we can reach out to customers when this happens. So it's really an entire platform and an engine that we can use to improve the customer experience and improve customer satisfaction. Awesome. So cool. All right. With a minute and a half left, Natasha, tell us about, what are you excited about? What's on the horizon for our whole learning platform for everyone? I'm excited about a lot, but we're going to start by riding Nerio's coattails and taking all of the learnings that his team has with Sentinel and giving them in mass. So making sure that we take what's worked for these customers and building it out in our notification center and making sure that that's really robust. We want to make sure that all of that reachability, availability and latency, if there's issues with any of that, that you get alerted or you're able to get alerted. And then like we were talking about earlier, we want to make sure that it's customizable and it works for you. So if there's additional ways that you want to get alerted that you can't right now, we're looking into building that out. So potentially integrating with Datadog or Splunk or something like that. Want to be keeping our eye out for those and building those out. And then any alerts that we also have at a product level. So anything that's specific for Magic Tunnels or workers or anything else that we have, you should be able to go to one place on the dash and manage all of that. So that's what we're working on, on the notification side. That's so cool. And that just becomes more and more valuable as we have so many products, like in your invention, we cover so much of the stack from DNS to L3 protection with Magic Transit. And then of course, like your serverless applications, right? If your workers are throwing errors. So I think it's really cool. Tons of great platforms out there for visibility, but what we're starting to see is more and more like Cloudflare has visibility into all these parts of the stack. So really exciting. Awesome. Nerio, Natasha, thank you so much. I learned a lot. It's great to have you today. Talk to you later. Thanks, Sean. It's been a pleasure. Thank you.