Meet HelperBot

Presented by: Junade Ali

Originally aired on July 7, 2020 @ 3:30 PM - 4:00 PM EDT

At Cloudflare, Support Operations Engineers use a combination of Artificial Intelligence, Site Reliability Engineering principles and high-integrity software engineering to help solve customer problems ever more efficiently. Junade Ali describes the technology and principles which drive these efforts.

English

Customer Support

Transcript (Beta)

Hello. Sorry about that. I was just trying to make sure the technology was working all right. So hello. It's afternoon here in London and I'm Junade and today I'm going to be talking to you a bit about what I spend my day job to a large extent doing at Cloudflare. You may have watched some of my previous talks over the past few weeks where I've spoken a lot about paying to passwords, Troy Hunt and some of the efforts we did on password security and things like that. But it's quite nice now to be able to talk you through some of the stuff I spend my time doing on a day to day basis. At Cloudflare I'm the engineering manager of a team called sport operations team and work on a few different kinds of problems. I'll guide you through some of them. So here we are. Here we are. So Cloudflare deals with a very very large customer base varied across various different types of offering from free customers, self-serve customers, big enterprise customers. We do things like pro bono enterprise grade protection for public interest groups. You may have heard some of the stories over the past few weeks about Project Galileo and things like that. And to sustain that we have an ever growing amount of amount of support requests and things which our customers need help on. And it's really about how we scale this to a very large extent as one side of this kind of problem. We have a lot of a large amount of volume of customer support requests very complex and varied environments and hosting environments for customers different types of infrastructure they're using lots of different types of kind of challenges like that. So and of course on the against the backdrop of that we have to offer 24 7 coverage. So these enterprise customers again varies from these florists to these Fortune 1000 type companies. So initially a few years ago when we set about dealing with this problem and the amount of load we came across with the customer support for you. We initially rolled out a service we called stateless help. It's effectively a diagnostics API which troubleshoots common networking issues that our customers come up against deployed in many different contexts internally different services read off it to expose it in different ways. It's something we use part of our ticket processing pipeline inbound when a customer basically you know we expose it via external API gateway so people can gain access to that information in the dashboard as well. Communication web hooks. We were able to you know email customers and engage with them using this data data as well. And behind all this it is fed off multiple different data sources. Lots of different types of active tests initially drove this data. So lots of different internal environments and lots of actual testing where we'd go out into the rest of the world and we would through the cloud network and we would diagnose different issues. You may see some of this. For example, if you go through the cloud ticket submission workflow you know diagnostics common issues are presented up front. Some of this data is exposed there to help deflect customers in the workflow. This may not be entirely unheard of for some of the things for some of the other product solutions. You've dealt with before where you know diagnostics and these types of things were offered to customers where we took this was ultimately ultimately into quite a big solution as part of the helper bot suite. And this this tool really exploded delivered continuous business value built upon this. And this kind of story talks about how we took how we went from building a tool like this into building something completely kind of you know revolutionary for customer support industry. Indeed, for scaling developer products to the millions of customers we have at the moment. It's one of the earlier examples when when I talk about how this was exposed in different kind of communication channels for customers was A few years ago Chrome 68 came out. And for those of you who remember Chrome 68 changed the indication in the browser bar. So when you connected to a secure website or HTTPS enabled website. You know, it would default to saying, you know, to not having anything wouldn't have the secure label on. But if you didn't connect it would indicate your connection is insecure. And in the months running up to this, you know, Cloudflare was instrumental in delivering universal SSL across the industry. This was something we really had to be careful about because we had lots of customers who are facing kind of mixed content issues. You know, these types of challenges redirect leads with the HTTPS configuration. And this was something we wanted to make sure there wasn't any impact on when this one went live. So in order for us to do this, we started proactively You know messaging customers asking them to resolve these issues. We also had information, our ticket submission form. We expose more diagnostics in the dashboard. Of our site as well. We actually did a lot of work as well with different kind of open source projects and things like that to try and resolve a lot of these issues upstream before they came to our customers. And this was a very successful campaign, you know, we ended up managing to mitigate the spike of this. We managed to deflect a lot of customer contacts. You know, every day for the say two, three months this campaign was running, you know, we'd run You know, almost 100,000 tests a day. And this would count like a month of manual human testing for our agents. Which we're able to kind of deflect earlier upstream and nowadays we've actually got this optimized to an extent where these issues where You know, mixed content redirect leaps. They aren't really issues that are predominant for our customer base anymore, which is unique, given our position. In driving an encrypted web. So that was kind of one of the initial experiences we had from this very kind of straightforward initial service to deploying deploying something of real value here. So the need for automation. What we've just spoke about is, you know, we were being very we initially deployed tooling, which was self serve for our customer base, they can mitigate issues we made that proactive. And, you know, this is a point where it's better to be self serve for the customers, then to have tooling for support agents. But tooling isn't exactly automation and automation is better than customer tooling. It's best to be able to resolve these issues upstream and we actually did some of that work with the SSL issue, but we went even further in a lot of this and this is kind of the driving principle we had in mind for a lot of the efforts we we drove towards First of all, On the side of automation and with that in regards to some middle point I raised here tooling is not automation related to kind of customer support interactions. When we looked into natural language processing, you know, it wasn't Ideally suitable on at the time on very academic data sets, you would get 70 to 80 % accuracy with our taxonomy at the time was our structure of data we would get at best about 50% kind of When we looked at commercial vendors would get about 50% And, you know, the tolerance is the false positives very safety, you know, it depends. Really, what is the context in which it's done. Is it a sensitivity is a general issue. You know, is it best in that case that the customer. How long is the customer going to be left waiting for human. Is it best that we give them You know, a risky automated response up front, which may not match their issue, but they can come back or is it better just to have them wait These types of issues come to mind in terms of natural language processing. An example of NLP accuracy. This is to do with language classification initially this language classification, you know, is a fairly widespread technology and usually Excuse me, usually it's based on the language at which a customer engages with you. It's based on the text effectively. And we can see here. This was initially in the context of chat tickets chat tickets which, you know, are raised through live chat for our business and enterprise customers. And the routing systems behind that, you know what we've got along the x axis is the length of the message and on the y axis was actually got, you know, Percent where the classified language matches both a country language from, you know, where they came from, and the accept language header presented by the browser. The data is a bit all over the place here, but you can see a general trend. You can see the confidence line as well that generally as a message gets longer, it gets more accurate, but majority of cases were densely clustered around a small area. And this is where we develop something which was kind of new to enhance this to enhance the safety of this system. We basically took three distinct fingerprints, put them into a kind of neural network and we're able to drive improvements. So you're able to see on the left hand chart, you know, accept language header versus classified language, you get big overlap. Which is going horizontally down and this can really be used to enhance safety in that context of whether of language classification. And you can see some of the data on a pie chart on the right as to where the biggest matches and You know, the, the instances where there's no match at all that that 13% which in all likelihood are, you know, wrongly classified by the by the language classifying That's kind of an interesting use case of classifications with languages, but it really doesn't go into where things are with automation and how we can drive a lot of that forward. So our natural language processing pipeline for for automated responses from that perspective goes through a few different things. There is named entity recognition named entity record kind of strips out some of the Technical information or the email footers, those types of things before you process them. We then have a multi classifier system. I'll talk a bit more about how this works a bit later. And then we have some safety engineering And this is important for various different contexts we look at and I'll talk a bit more about them in the future, but We have an over engineered solution which is designed where, you know, there needs to be some greater element of risk sensitivity and then we have formal contracts where High grades of accuracy are required. So you can see the false positive rates. So multi classifier. You know, initially, if this was about 21% when we introduced over engineering over engineering ads. I'll describe it works a bit more detail, but it adds additional safety, which gets that false positive rate down so we can use it in context where you know customers are paying Potentially more where the issues are potentially different. And we can do so in a way without harming customer satisfaction and informal contracts in the case where we need to make configuration shaping style requests. So safety engineering approaches. So for the majority of say our low risk free customer use space. Some failure is tolerable, you know, if they get the Wrong response, they can write back and set and get through to a human agent if the response didn't answer question. In higher risk situations we have this binary classification system where it isn't a sensitive issue, but potentially higher risk for customer satisfaction. And then formally defined safety checks where we are actually, you know, these are sensitive context where we're automatically doing things like configuration changes, things like that, where no margin of For error is basically permitted in design of the system. And yeah, say sensitive context may require customer validation actions. I'll talk a bit more about that going forward. So, um, initially extracting things like error information from customer tickets and things like that. We compared a few different approaches, really. Initially, there was no real literature review. And what's the first things I did when we got a scientist on the team is I should really look through the literature and get a formal understanding of the mathematics behind different string similarity techniques. Because it used to be the case, a lot of people would just pick the Levenstein algorithm off the bat. It's computationally quite expensive. n times m for two strings, whereas cosine is like o n plus m. And we actually found that cosine gave a better false positive to true positive ratio with an n gram size of two. Then we got out of other solutions like, for example, the Levenstein algorithm, which are actually more computationally expensive. So that was kind of one of the original developments were able to work on. If you want to look at the paper, there's a link to it on screen. And we could take this a step further with threshold identification. We can actually intelligently identify the threshold at which we should trigger these things. Then another bit of technology we developed for natural language processing was our chain of responsibility system where a neural network would handle how things would be classified into individual classifications, would use a kind of cosine based approach as a filter of that. And it would basically direct how the classification would go. So that was one aspect. And then the other aspect is after this classification is done, we can have a second line good bad classification. And that would mean basically this is an over -engineered solution. It would give highly accurate classifications in this specific context. So we have the binary classification where you have a second line classifier behind this multi classifier where you can, you know, it would decrease a true positive rate, but the false positives would remain good. Use of diagnostics in here to inform that process. So cascading failure is often seen as a bad thing. You know, if we think about it in the case of servers, you think of, you know, if a server fails or a hard drive fails, you know, 99% of the time and you've got 10 servers, you know, the failure rate goes up. But here we really use that to the benefit. So adding binary classification on, as you can see, yields a significant benefit to the accuracy. And that was something we were able to develop, you know, rather than considering accuracy as one metric, considering it actually as, you know, where do the true positives and false positives really matter. In the cases where configuration changes needed to be made, automatically formally defined runtime contracts. So basically the way these work, there are contracts, there's data stored, the customer completes some validation attributes, the contracts would be revalidated and downstream APIs would revalidate. This basically means we're able to introduce practice known as design by contract. Design by contract effectively as a way of formally verifying applications. So if a downstream API failed this test, it would get, you know, the result would be empty. So nothing could be run. So, you know, there's some formal elements of safety. You can formally verify this through other types of software engineering tactics. And this really helped us develop the aspect of being able to go into other areas where, you know, things are more you know, where we want to automate things which are very kind of customer, important to the customer. Lots of different elements on this, you know, we had to simplify the taxonomy to encourage greater accuracy for classifications. Classification to fill in the gaps. So some kind of sub level classification using the cosine approach spoke about earlier. Attaching configuration change items to JIRA so the product team could make themselves serve as something else, which is quite important for us. And this is kind of the result of a Cloudflare support agency is when an automatic response is provided. You know, they get the diagnostics, they get the automated response. You know, based on this, this ticket here where someone's asking about something we're able to accurately identify it. We're able to respond and hopefully deal with it without the need for a human, but the human is there if the customer needs it. And then considering this data in multiple dimensions, if you like, you know, this is a different kind of problem we dealt with recently, which was where we had to identify different types of customer requests based on different pieces of data. And you can see, you can start to see here how all these bits of data fit together. To be able to give us a picture. We're able to pinpoint the issues we want using different systems, different natural language processing systems, different types of customer modeling systems as well. And, yeah, so that's kind of a walkthrough of initially some of the kind of automated diagnostic components we've worked on, some of the automated response components we've worked on, how data is really fed into that process. What I want to talk about towards the end of this talk is how these tactics can be used to build a kind of next generation security operation center environment. So, we've already shown how proactive messaging can be used for our self -customer base, but how could this be applied to a security operation center, you know, there's active testing as part of this, there's analysis of passive traffic data flow from our network, which is important. And this is really a solution for rolled out for some particularly, you know, high grade sensitive customers where we need to have this particular piece of functionality. So I'll just walk you through initially how some of the data looks for a DDoS attack. So a DDoS attack here, you can see along, this is a very simple representation on the left hand side of the number of requests along the x axis, along the y axis is the error rate. In red, I've indicated the times at which from support data where customers under attack and you can kind of see here, you know, the general trend as the request volume increases dramatically, the error volume so does, and you know, you can see the attack manifest itself there. On the right hand side, just added two extra dimensions. So the color indicates something called path ratio, what pages are being accessed by an attacker. And the scatter size is the user agent ratio and immediately from that you can see we get four dimensional visibility into what an attack really looks like. So this is one particular particular type of tool which can be used to analyze in aggregate how DDoS attacks look. And why this is important in aggregate is because often, you know, when we deploy software to network edge to mitigate attacks, it's often doing things on a request to request basis or a far smaller kind of sample of what it's dealing with. So this kind of gives a holistic view of where attacks are, how they look like, and the overall picture. And then, you know, there's additional properties which are really useful in doing this. On the left here you see HTTP 5XX. This is a differential measure. So this is a differential from a certain time period as to what we would expect to happen in a given time period, basically. And then the y axis is 499. 499s are basically where the client has disconnected before the web server has completed what it wants to do. And you can see a kind of L shape. The L shape, you know, the bottom side of it is really ordinary. You know, the L shape is basically represents issues which are not really attacks, but are something else going wrong with the site, but you see a linear distribution of the plots which are attacks. And this really helps identify, you know, why is this customer undergoing this issue? Is it an attack? Is it something security related or is it something else? And this is particularly useful for low traffic sites. And you can kind of see another example here with the multiplier, a differential multiplier of the request count versus uncached requests, you're able to get them again, get a very clear differentiation into what is an attack and what isn't. So to extend upon this, we can kind of go into intelligent threat fingerprinting. And intelligent threat fingerprinting, we've spoken so far about DDoS attacks. DDoS attacks are kind of quite crude in the fact of, you know, they're about overwhelming a server. Arguably, you know, much easier to identify than other types of attacks. This example here is about intelligent threat fingerprinting of credential stuffing attacks. And basically, there are a few different aspects, you can see here. There is on the left hand chart, there is a unique number of zones being attacked. You know, basically, how many times is a bot attempting to attack a given site before it, a given site before it moves on. One means it's normalized, so it's only attacking one site, you know, at a time and it's slowly trying to go through all of them. You've got the success ratio on the y-axis. Then the right hand side, you've got the same on y-axis, you've still got the success ratio of the attack, but you've also got what we call a variety ratio. The variety ratio is abnormal status codes. Abnormal status codes are usually a good indication of, you know, requests, something going wrong, requests being blocked, maybe, or an attack failing for some other type of reason. But what's really powerful, and this really shows the power of looking at this data in multiple dimensions. So on the plot on the right hand side, we've basically taken these specific types of credential stuffing attacks, most kind of high volume ones we've seen here, and we plotted them in three dimensions. Three dimensions really are the ones we've just spoke about before, unique zones, variety ratio, success rate of the attack. And what we then apply to this kind of jumble of data is we're able to apply unsupervised learning to it. And what the unsupervised learning takes out of this is we get three kind of attack clusters. And this is quite straightforward data science approach to use, where you're able to do unsupervised clusterization. And we're able to find basically a very, very successful attack, normally we're talking either about, you know, naught point something percent, or the other two clusters were maybe three to three and a half percent success ratio. So we're able to fingerprint this kind of example here. But what we also see is we see that the majority of these requests are coming from the same user agent, coming from the same country. This indicates to us that we're right in our assumption that this is all this cluster is related, it's quite good at fingerprinting this. And where this comes in really useful is other types of clusters of data of attacks. These are things attackers don't normally think about, they're things which are, you know, in aggregate, generally quite hard to deal with. So if you were purely looking at the information like country information for a very distributed attack, you would miss this type of granularity, you would miss these types of clusters and you wouldn't be able to mitigate them as effectively. And that's effectively kind of the intelligent threat fingerprinting approach we're able to use to mitigate some of these credential stuffing attacks. And the current state of affairs as to how things have looked. So we've, we've spoken about three different areas of work. Really, we've spoken originally about building a automation system, which, or a diagnostics API rather, which you can expose to customers exposing contact channels. And you can also you can also basically use in context where, you know, internally as well. Then we move to a different aspect and that aspect was really around automation of support technical support inquiries and how natural language processing built upon that. Both how we're able to build higher accuracy language classification and on top of that, build very safety engineered natural language processing approaches which deal very, very good accuracy. And so, the current state is our service, you know, and then we also went into the security operations center, which is the service we have called deployed called helper bot vigilant which forms part of the six services and these six services do everything from chatbots, anomaly detection, etc. And we're able, we're able to do this effectively very, very well. We're able to, some of the metrics are displayed here. There's always more to do. And we learn a lot of principles from this. I speak more about them in a USENIX SRE con talk I did, while it goes to the main learnings we developed for developing support operations engineering kind of as a field. So that's basically a high level of what support operations engineers spend their time doing at Cloudflare and then slowly the sun is setting here and I will use this as an opportunity then to hand over to our next session. Thank you very much for joining me today and Zoom is having trouble with the custom backgrounds at the moment. Thank you very much.