🔒 Security Week Product Discussion: Behind the Scenes - WAF ML
Presented by: Nicholas Robert, Vikram Grover, Andre Bluehs
Originally aired on May 17 @ 4:30 PM - 5:00 PM EDT
Join Cloudflare's Product Management team to learn more about the products announced today during Security Week.
Read the blog posts:
- A new WAF experience
- Improving the WAF with Machine Learning
- Security for SaaS providers
- Cloudflare Zaraz supports CSP
- WAF for everyone: protecting the web from high severity vulnerabilities
Tune in daily for more Security Week at Cloudflare!
SecurityWeek
English
Security Week
Transcript (Beta)
All right. Well, hello and welcome everyone to another segment in Cloudflare Security week.
Today we have with us Vikram and Nick, and we're going to be talking about some ML Web application firewall machine learning things.
And these are two of the data scientists that have helped us do that.
I'll be your host today.
I'm Andre.
I am an engineering manager here in London. And if you would like to both introduce yourself.
Nick, why don't you go ahead.
Hi there.
My name is Nicholas Robert. I'm a Canadian who lives here in London working as a data scientist for Cloudflare.
My interest in specialty is basically machine learning models that are very, very high performance, low latency.
Over my career, I've been kind of moving more and more in that direction, and I've had the incredible privilege of working on a very challenging problem here at Cloudflare in the last year, which is how can you make a machine learning model which is able to act to the Web application firewall and make that not only work, but make it fast enough to run at our edge.
Buttering us up, teasing all of the audience for the next thing we're going to talk about.
That's great. That's a good setup. Thanks.
And Vikram. Hi, everyone.
My name is Vikram. I am a data scientist working as part of cloud based management team.
We are focused on augmenting our Cloudflare Support capabilities using machine learning.
I mean, I had the privilege of starting this project. We identified a problem statement and we wanted to fix it using machine learning.
And currently we are at a stage where we have cool stuff and we will talk about that in the segment.
More teasing.
Great. It's great.
You're setting me up for the next bit here. Let's take a step back a little bit.
And so we've already used some terms that may not be necessarily intuitive.
So let's talk about what a web application firewall is. Vikram, can you kind of walk us through generally what that does?
Yeah.
So a web application firewall in layman's terms is we have an Internet, and we have customers.
So let's say if somebody visits a website, web application firewall blocks that traffic, if it's malicious and it looks at all the traffic and then identifies whatever looks malicious, and then we can block that using our systems like an application.
Great.
What makes something malicious? Nick, can you talk a little bit about what kinds of things Olaf would attempt to block?
Right.
So there are many different types of applications running on servers around the world.
A malicious request would be some kind of content that a person sends to a server which yields kind of undesirable consequences at that destination.
Maybe it allows the malicious actor to extract customer information, sensitive information they shouldn't have privileges to.
Maybe it allows them to crash that machine, reducing availability for other people around the world or in general just some kind of bad thing.
So people sending data that shouldn't shouldn't be sent or especially crafted in order to cause these negative undesired outcomes at the at the receiving server.
Interesting.
So what does a WAF how does a WAF typically accomplish that? We're talking about we're here to talk about machine learning.
How does a non machine learning accomplish that?
Well, so what we try to do is we look at the request data.
We try to identify the content of the request.
And the current WAF uses signatures or patents to identify whether there is a malicious content in the request.
And then it basically monitors that and blocks the traffic or logs the traffic and the kind of attacks we cover.
Let's say access or SQL or other kind of attacks as well.
So generally, I mean, these are the two main variants we cover as part of the existing Wolf.
Interesting.
Okay, great. So we've got a system where we have these rules that are trying to identify specific malicious things in various categories.
What was the initial kind of challenges of taking something like that and building an analogous version in a machine learning powered?
What was your first kind of problems that you had to overcome?
And Vikram, you were talking about you were you were kind of getting this up and running.
What was the first problems that you came across? Yeah, to start with, I think the biggest challenge for any machine learning problem we have is data and the right set of the right amount of data.
So I think that was one of the biggest challenges for us.
So when we started this problem where we could see a lot of external data sets as well as some internal our own data sets as well.
And we were trying to identify or understand whether those examples were trivial or simple and how we can build a model which would generalize well on not just trivial examples, but could also look at some variations or some real world examples which are not part of this trivial dataset.
So I think that was the biggest challenge.
We had to identify and find the right set of data, the right amount of data and how how homogeneous data we needed in terms of building the right model.
I think that was the biggest challenge.
And then the other challenges were to start with having the right set of testing framework and evaluation framework.
So I mean, to start adding machine learning model building exercise, we need an evaluation framework as to how we want to identify whether the model is doing the right thing.
So generally for a classification problem, we look at false positives or precision or goals.
And in this project also we wanted to look at those, but then we also needed an evaluation framework which could help us basically not just I mean, it won't look at just single matrices.
It also helps us evaluate and tells us whether the model is more robust to certain kind of attacks and maybe it looks at certain kind of properties or not.
Okay.
So I want to, we talked about a lot right there and you've covered a lot of ground.
I want to zoom in on a couple of different things. So you talk specifically about trivial data.
Nick, what's the difference between trivial and non trivial data and what does it sounds like?
Non trivial data is the better kind of data to have.
And what kind of data is that?
So I think in the context of web content most.
So most of the I guess the malicious activity that's going on is automated.
So in the same way that bot content makes up a pretty significant portion of internet traffic.
Most malicious traffic is generated by individuals running large, indiscriminate kind of scans against many, many target zones at once.
And they might generate, say, a billion requests per second or something.
Going after low hanging fruit.
It's going after low hanging fruit, doing any kind of protecting.
So a lot of these attacks are not in any way attempted.
They're not really obscured or hidden anyway, just out in the open looking for servers that might not be secured at all.
And I guess that kind of makes sense from a cost benefit point of view from the attacker, but it creates a challenge from a data set point of view because the fact that these samples are so kind of obvious in a way, you can build a model that will easily classify them, but it hasn't really learned anything interesting.
So when you said that, yes, non trivial examples are the ones that we're kind of more interested in.
We wanted to basically not only find the more sophisticated attacks as well, but to try to construct data which might have some similarities with an attack but is actually completely benign.
And that's that was kind of the real challenge is how do we how do we really find this balance, this parsimony between catching the sophisticated attacks, but also not catching data or not blocking legitimate traffic that might just have a few unfortunate, possibly coincidental similarities with an attack.
Interesting.
So really kind of focusing on our ability to ignore those false positives because that's in a regular WAF, that's something that we have to tune over a lot of time of testing a rule and seeing what kinds of things it catches and doing that.
So is the process similar to that of like a more like a guessing in check or is it can it be kind of scaled up?
So.
Originally, I guess in the very early iterations, it was much more manual before we had a clear understanding of the problem.
As time has gone on, we've developed an automated more and more of our testing suite in our validation framework.
This is something that is becoming more empirical, more quantitative, which is, of course, as the end goal is to kind of make everything as quantitative as possible.
As you said, like one of the big challenges is trying to avoid blocking too much legitimate traffic.
And I think also one of the important aspects is having a process in place for kind of making amendments to your solution, the security solution, when new classes of false positives are discovered, as happens with our rules based path as well.
Yeah.
Great. And so once once we've got the more robust data than this trivial, easy to come by, people are hitting us with attacks, trivial data all the time.
Once we've got something like that and we're, we're working on training the model and getting it in a more robust fashion, what kind of output are we looking for that for that model specifically and how what is an evaluation criteria?
Can you talk a little bit more about that?
Yep, sure.
So, I mean, what we are looking at is we're trying to classify whether a particular request has something, let's say, for example, exercise or secular.
And we want to generate scores for individual request or for individual feeds in our request.
And we try to identify whether it's highly what's the likelihood of request being, let's say, an exercise or a secular or a benign request.
And how we do it as we have currently, what we are doing is we have kind of NLP like models where we look at tokens and we identify whether certain tokens.
NLP means.
Natural language processing.
And what we try to do is we try to identify a set of tokens which could be targeted or related to a malicious request, let's say for exercise.
Maybe we can say that alert is one token or which could which could tell us that there could be something malicious in a request.
So I think we look at kind of malicious tokens in a request and then identify whether a particular request has a high likelihood of being the exercise or supply or denial.
And in terms of the evaluation framework, I think the biggest challenge we have seen in terms of evaluating above, like Nick pointed out, usually we look at false positives and generally in cybersecurity domain or in our, in general, in cybersecurity domain, we have seen that just looking at false positives does not depict the right picture of how the model is performing because real world data could be completely different from what we have used for training our models.
So that.
So one of the kinds of things do you look for if it's not just false positives?
Yes.
So we have identified a set of properties we want a model to have which includes, let's say, robustness as one of the property.
We want a model to be robust to variations because with existing groups of signature based roles, when you try to tweak the request a bit like you make brackets or you change brackets, the content, I mean, it will bypass the rule based off where and where the machine learning based solutions.
We want these solutions to be robust against variations or if we add extra content and the request to, let's say, benign content in a malicious request, or whether that would change the overall score for the request or not.
So we look at certain set of properties and that's part of the evaluation framework.
We identify we have identified a couple of properties we want in our model.
And then when we train our model, we evaluate the model based on those properties rather than just looking at simple matrices.
And is that something that we can expose, like as an output of the model of those that kind of robustness?
Yes.
I mean, to a certain extent, yes, we we can expose that as an output of the model.
But like, I mean, when we talk about robustness, right, we have certain categories where we want our model to be robust.
And I mean, when we train the model based on I mean, when we when we train the model with inducing the right set of properties, yes, we can expose that.
Okay.
And so in terms of what we actually can give to customers, like things to play with.
Can you talk a little bit about what we're giving to customers?
Right.
So presently we're exposing a couple of scores which indicate the degree of likelihood that the model believes that piece of content is an excess or a SQL injection or contains these malicious elements in them.
And we also have a pooled score, which is basically one the likelihood that the request contains X or Y.
Essentially, this is something that can be integrated into a firewall rule in a familiar way to a customer, which we think is really exciting because these scores, they do vary between 191 and 99, like the management scores.
So there's a lot more it's a bit more smooth than a kind of binary block, not block that you might get from a traditional rule set.
And we're hoping that there will be some creative uses of these scores as well, which will give our customers a little bit more flexibility in managing traffic, going to particular routes on their zones and whatnot.
So it will give we give people the ability to say, I have a certain pain tolerance, as it were, of I want to be super sure the model is super sure that this is cross-site scripting before I take an action on it, as opposed to, say, someone who might be more sensitive to that and saying, well, if the model thinks it might possibly be, maybe block it.
It sounds like we give that kind of flexibility in there.
That's the idea.
It's basically just to make it a little bit more continuous of a scale of your of how sure you have to be and.
I think there are definitely website owners out there who might have particular paths which they expect the content to look very much like, you know, SQL or something like that.
And they can obviously still protect those routes by just having very, very tight thresholds.
Maybe only the absolute most certain possible scores from the model should be then yielding a log or an alert or a capture, whereas someone who's quite certain that there's nothing that should be going through there that looks SQL like at all, might have much more relaxed scores.
It just gives more flexibility.
Right?
So that's what I was going to ask about. Is it kind of in SQL injection or really in cross-site scripting?
Is it kind of the threshold between this looks like SQL versus this looks like SQL injection, those being two very different things.
So I think, yes, although one of the challenges we encountered is that there's a little bit of a almost philosophical difference in these cases of, well, is it JavaScript or is it excess as well?
I guess it depends on whether you want it to be running or not at the end.
What's your pain threshold?
Right. Yeah.
And there are a lot of, I guess, web domain specific languages that do look a lot like SQL, Prometheus, query language and other sites of kind of database query languages do kind of have very similar structure and they might even share some keywords.
So if you're expecting that kind of content on a path, you might you might really want to have only very tight thresholds.
And then, of course, there is the question of if you simply type out a valid SQL Command, but there's no context in which you can execute.
Is it an injection or is it just some text that happens to be a valid SQL Command?
So yeah, so it's a little bit philosophical in places. So what were some of the ways that you kind of expose the model to these SQL like or cross-site scripting like ways and things that look like that is, did you spend 3 hours just like banging your head or your forehead on a keyboard and like, oh, it's probably JavaScript and you can actually execute just about anything.
What was the method of telling the model that this is JavaScript like, but it's not cross-site scripting.
So we have a couple of different ways of doing this.
The primary ways that we generate a great deal of synthetic data.
We generate data that is very has varying degrees of structural similarity to real attacks.
And we use that to help basically teach the model that there are specific differences between content that might look like XML and content that is HTML and JavaScript or content that might look like Prometheus query language and the SQL, even though they're very similar.
So we do a lot of synthetic data generation in order to try to supplement the data sets.
And that's proven to be a really critical part of generating or inducing the right kind of properties and the right kind of score calibrations, because we do want the model to produce a kind of middle ground level certainty score if the content does kind of look like SQL, right?
Because then the customer has the opportunity to say, Well, I expect this kind of content or I don't.
As opposed to being like really, really strict like either SQL or bust you.
Because one of the biggest challenges in web security is the fact that.
Malicious attackers are coming up with incredibly sophisticated ways to obscure the content or perhaps send content that will only at some destination become transformed back into its really dangerous form.
So we kind of want the model to have that little bit of.
Uncertainty about itself.
And even though that sounds bad, it's actually kind of good because that way you can you can write scores against that.
Interesting.
Okay. So let's let's move on to some of the other challenges that we face.
And I know that we have at Cloudflare a really interesting problem in that we run a very largely distributed network and we handle a lot of requests per second.
And we want this model to be extremely performance and be able to detect things and provide value to our customers without adding a burdensome amount of latency or even to us to add a lot of CPU time.
So what was one of the challenges kind of the fundamental first challenges of getting this model to execute at our edge, our edge being what we call our data centers and our servers all across the world.
Can you talk a little bit about that?
Yeah.
So I think one of the biggest challenges I mean, before going to the execution time, I think we agreed on a model which could handle that amount of requests.
So I think that was the biggest challenge. But in terms of architecture, we had to think like getting the best architecture which could handle that amount of request as well as I think in terms of execution performance.
Like Nick said earlier, I mean, we were looking at a lot of fine tuning ways as well in terms of how we can make the model smaller so that it could be used at the edge and cover that much number of ways to which we say more than a billion requests.
So I think that's well, that's how we decided on a model.
And then in terms of execution performance, we agreed on a certain set of tools we wanted to use, like replacing TensorFlow with TensorFlow Lite, which is a lighter version of TensorFlow, and then using that to build our models.
And that's how we basically are able to reduce the latency time as well at the edge.
Was there any kind of models or approaches or architectures that we just had to completely ignore, even though they're like best in class?
And this would give us a really great answer and solution to our problem, but we just couldn't use it.
Yeah, absolutely.
I mean, for us, we could not use state of the art transformer models which are used by most of the NLP domain problems.
So we could not use anything around that.
We could not.
I mean, there are other architectures as well where when we could not use the sequential component a lot.
So we had to think of architectures which were which could do the job and are scalable as well.
And that's how we went to an approach where we using simple I mean, I won't say the architecture like how we have to, but yeah, I mean we have a model which looks at tokens and then we try to sort of use that tokens and embeddings as part of the models and then train our attention like transformer like model, but a different kind of transformer, I would say.
So now it's my turn to tease because I know we're going to be talking about kind of our process in a little bit more in-depth technical details about our models in blog posts that are coming up soon and that we just released.
Or we'll be releasing some more information about this on our on our blog today.
And we're going to be talking a little bit more in depth, technical wise from both Vikram and Nick doing a technical deep dive on it.
So one of the things you said is, was about model size.
What do you mean?
When we have to go for a smaller model versus a larger model? And how does that impact the performance on the edge?
And also we've talked about robustness or that kind of thing.
Nick, can you talk a little bit about what kind of tradeoffs we had to make there?
Right.
So the primary things that you have to worry about are the memory footprint of a particular model when it's loaded and the well, the latency as well.
So the size affects the latency in the sense that for a deep learning solution, for example, the more complex it is, the more say, more sophistication, more information that it contains kind of increases theamount increases the size of the model in terms of the actual number of bytes it requires to load.
But it also generally will require more CPU ops to execute, which is where the latency comes from.
So when we said that we were trying to drive down the size, this kind of takes many forms.
It takes the form of we want to reduce the memory footprint when everything is loaded up on the edge server.
And we also wanted to make it easier for the CPUs that are running at our edge to essentially dos do as much as possible in as.
U ops is possible.
Modern server CPUs have some special hardware on them which allows them to do very, very efficient computations of some types of mathematical operations.
And one of the really cool things about a framework like TensorFlow Lite is that it's optimized to make the most use of those like the best use possible.
So we basically we spend a lot of time trying once we even had found a solution, we were happy with trying to aggressively kind of shrink it and tune it against the hardware that we thought it would be running on in order to get the best possible performance statistics.
And even things like heat thrown off by the CPUs can be a big factor because some of these special operations on these server CPUs are very power hungry and we have to keep all this stuff in mind as well to make sure we don't melt servers down the hall.
We laugh a little bit about that, but fortunately we have really great infrastructure teams that are worrying about that all the time.
And that's that is a concern.
But fortunately, we there's a lot of limitations in place so that we can't actually turn our servers into a crying heap of metal and silicon somewhere.
Yeah. So what are what are some of the things that we've done?
We've taken a different approach for the way that we've built our machine learning.
What are some things that we're doing differently as opposed to how this has been tried before?
Vikram, can you talk a little bit about the things that we tried to learn from others experiences and what did we do a little differently?
Sure.
So I think one of the major differences or differently we have done is that a lot of the research, existing research work has been around looking at some trivial examples.
Like I mentioned before, they had external datasets, open source datasets, which they were using to train their models.
For us, the advantages that we had generated a lot of synthetic data to make the data more homogeneous.
And that's how we have we are training the model, which works well with certain set of properties we want to induce in a model.
And I think that's what we have done differently.
And then I think in terms of the evaluation framework, like I've said, a lot of the frameworks or a lot of the existing research they mention about false positives or precision and recall in their papers.
And to be honest, I mean, that's overestimating the performance of the model.
They are not telling the right picture because if you train that same model, if you use that same model for inference on real world data, they will definitely perform poorly.
And that's one of the main reasons why we came with a certain set of properties which we wanted to induce in our model, and a testing framework which could help us testing model on certain kind of properties we want to induce them.
So I think that's how we went, definitely.
And then the other solution I think we took differently was we wanted a scalable solution, a solution which could handle 1 million plus request.
And when you say when you say you've said this a couple of times, like 1 billion requests, so we handle across our edge over 30 million requests per second.
We've talked about that in a couple of blog posts that we released today.
And so that's that's kind of on average.
And so when we're talking about these models that we want to be able to handle, plus or minus some percentage of that, obviously not all on one server.
But that's kind of the average of the kind of scale that we're dealing with.
And so you're kind of saying that's slightly exceptional for a lot of people that are trying to do this before.
Yes, absolutely.
So I think most of the models they've built are either not scalable to current set of a number of requests which we receive at our scale.
Like we said, I mean, there are state of the art models like Transformers, which might perform better than some of the models which we are also trying to build.
But then they are not scalable and cannot be used in the real world production environment.
So I think we had to do some tradeoff between going for a solution which could do the job and is also scalable as well.
So I think that's how either they have solution which is which, which will perform poorly on the production environment or a solution which could do well but is not scalable.
Great.
Well, unfortunately, that is all the time we have. Thank you very much, both Nick and Vikram for hanging out with us today and look forward to the following updates and we'll pay attention to the blog.