🔒 Machine learning: getting to more effective security postures
Presented by: John Cosgrove, Daniel Gould, Jesse Kipp
Originally aired on September 1, 2023 @ 10:00 AM - 10:30 AM EDT
Welcome to Cloudflare Security Week 2023!
During this year's Security Week, we'll make Zero Trust even more accessible and enterprise-ready, better protect brands from phishing and fraud, streamline security management, deliver dynamic machine learning protections and more.
In this episode, tune in for a conversation with Cloudflare's John Cosgrove, Dan Gould, and Jesse Kipp.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog posts:
- Detecting API abuse automatically using sequence analysis
- Using the power of Cloudflare’s global network to detect malicious domains using machine learning
- Automatically discovering API endpoints and generating schemas using machine learning
For more, don't miss the Cloudflare Security Week Hub
English
Security Week
Transcript (Beta)
Hello, everyone. Hello. Welcome to Security Week 2023 and naturally welcome to Cloudflare TV.
We're here actually to talk about how we're using machine learning in really interesting new applications for security to help our customers get to more effective security postures.
My name is Dan Gould. I'm on the product marketing team and I'm joined by a couple of esteemed colleagues, Jesse and John.
Jesse, do you want to introduce yourself? Sure.
Hi. Yeah, my name is Jesse Kipp and I'm the engineering manager of the Threat Intelligence engineering team here at Cloudflare.
And our responsibility is to take in threat intelligence data from a variety of different sources and make sure it gets delivered to all the products in Cloudflare that need it.
Exciting stuff and very, very important.
John, we're glad to have you with us, too. Do you want to say hi?
Hey, everybody.
My name is John Cosgrove. I'm the product manager for API Gateway here at Cloudflare.
Securing, Managing and Protecting APIs. Indeed.
Indeed. So, you know, thrilled to have you both with us today. It's Wednesday of Security Week.
And, you know, for those who've seen the blog today, you'll notice there's a really powerful common thread in many of the pieces of news.
And that's how we're using machine learning in interesting ways to help our customers stay safer, help their employees stay safer.
And so we're actually going to talk about a couple of those use cases here today.
And one is, of course, protecting users and employees and doing that with sophisticated machine learning models, which Jesse will talk about.
And then John will walk us through how we're helping organizations get to stronger API security postures, really better understanding abuse by using machine learning.
Both, you know, very powerful, essential use cases.
So with that said, Jesse, why don't we dive in with you and maybe, you know, it helps.
You know, we've seen the blog today really setting the stage.
And how do the the machine learning models you cover in the blog today, how do those really fit into that, that broader notion of keeping users and employees safer?
Sure.
Yeah. So usually when we see a cyber attack happening, you know, what doesn't happen is it it doesn't jump straight from, you know, a user accidentally downloads malware or enters their credentials into a phishing site straight to some extreme outcome where like the colonial pipeline on the East coast of the United States is shut down for a week.
There's some intermediate step here that has to take place.
Right? And the attacker has gained access to a device or a computer or a set of credentials.
But next, they have to figure out, like what now?
Like what computer am I on? What's the layout of this network? You know, how how can I exploit my access?
What can I do? And in order to do that, they need to get information in and out of the network and, you know, communicate with whatever piece of exploit they have on the inside, a piece of malware that's running on a user's device, for example.
Yeah, indeed.
And you think about like, I know you mentioned the Mitre Attack Framework.
I think a lot of that is at play here. When they get that initial foothold, they need to know, like what?
What's next? Where can I go? Who am I? Right, running all sorts of commands.
Now, you know, when they're doing this, Jesse, I mean, it seems to me that they're doing everything they can to be discreet and avoid being detected.
Right? Or blocked, it seems to me. Right? Does that sound right?
Yes, exactly.
So they're operating inside the network and they're trying to avoid being blocked and detected.
And as, you know, network administrators, network defenders, we have a variety of tools on our hands to help, to help be on the other side and detect or block that.
So we've got firewalls, you know, cloud firewalls, on-prem firewalls, our our network security software.
And then we've got DNS. And DNS is a useful tool both for observation as well as control.
So we can use threat intelligence to block access to threats.
So, you know, if you're using Cloudflare Secure Web Gateway product, you know, there's lists of phishing sites, you know, threat intelligence about phishing sites in there that will block access to those phishing sites for users.
But then there's also, you know, threat information in there about, um, command and control and malware sites as well that can help detect when a piece of malware is communicating with the outside world, attempting to contact its operators to receive those commands that, you know, to determine what to do on the inside of the network.
And... Keep going, sorry.
And DNS is a useful tool because it, you know, we can both block as well as, you know, we can log and we can access those logs and see what's happening over over the DNS.
So, you know, thinking about DNS, it can almost, though, be a bit of a two way street, right?
This is where attackers can also use DNS to their advantage to avoid being blocked, detected, etc.
How do they tend to to go about doing that? Right.
Yeah. So so in addition to being sort of a point of control and a and a point of observation for us as as defenders and network administrators, it can also be used by attackers as a way to transmit information or a way to or, you know, to have schemes to attempt to avoid that detection and control.
So there's a couple of different techniques that the machine learning models that we're talking about today address, and one is domain generation algorithms and the other is DNS tunneling.
Yeah, indeed.
So two interesting areas that I think you've covered in the blog today.
So let's talk about each of those individually and maybe we'll start with domain generation algorithms.
So can you tell us, you know, what is it, you know, from a high level and how do attackers tend to use it?
Right.
So the idea behind the domain generation algorithm is, you know, a piece of malware, instead of just having a single domain that after installation, it goes out and communicates with to do whatever it's going to do, exfiltrate credentials, receive commands, download second stage malware or whatnot.
Um, you know, if it's just a single domain name or a single IP address that it's going to go communicate with, once defenders or, you know, the security community learns what that address or domain name is, it can be blocked both in firewalls and in DNS.
And so the attackers have, you know, come up with schemes to attempt to avoid this.
And one way to do that is to randomly generate a set of domain names on a given day and go try to communicate with all of them.
So they use a pseudorandom number generator that generates numbers that are predictable if you have the random seed, but not predictable just from the outside observing them.
Right? And then it attempts to go out and you know, sometimes these can generate dozens or hundreds of different domain names a day.
So the malware will generate its set of random domain names for that day and go out and try to communicate to each one of them.
And if the attacker wants to gain control of the malware and communicate with them, they will register one of those domain names and begin responding to the malware on that particular domain name.
And so that way that prevents us from, defenders from being able to block that malware's communication just on one static point of control.
And to sort of take it to the next level, oftentimes, the domain name will, the the algorithm will go use some piece of information that is found out on the Internet so that we can't predict, you know, for the next month out what are all the domain names that this thing is going to generate.
So there have been examples of malware using, for example, the trending topic on Twitter or the price of Bitcoin that day from a particular website as the random seed.
So both the malware and the controller, the attacker go get the random seed, compute the domain names for that day.
The attacker registers one of the domain names and begins communicating with the malware.
Wow.
So they become very shifty, almost a moving target. Right? It's just impossible to, very tough to pin down.
Right? Right.
Yeah. So they, target is moving all the time and, you know, on a given day, you know, we don't necessarily know before that day what the random seed is going to be for that day and what the domain names are that that are going to be generated.
And so we've built a machine learning model that attempts to solve this.
Great problem for machine learning.
So, yeah. Tell us about the model.
So we built this model by getting a set of previously known domain generation algorithm domains and then a list of known good domains.
And you know, those are our two sets of, of examples.
And you know, the domain.
..
The purpose of domain names, right, is to have some human-friendly domain that you can just go in and type into your computer, cloudflare.com.
And so domain names tend to have, that are used by us as humans, tend to have words and common patterns in them that occur often.
Right? Words, abbreviations, human language stuff , whereas domain generation algorithm algorithms, particularly the ones targeted by this model, end up being just this random string of characters.
So we built a model that is based on the latest machine learning technology in natural language processing called Transformers.
And it's a special type of neural network that encodes a lot of information about language in the neural network itself.
And so we take the domain names.
So we train the model to be able to predict, you know, DGA domain or not DGA domain.
And then we take domain names that are, that we see in our new domain feed coming from our 1.1.1 resolver, and we apply this model and it gives us a prediction out whether we think this is human generated or not human generated.
And then when it's not human generated, we apply a label of, you know, potential DGA domain to it so that we can block it on our edge.
Interesting.
Interesting. So when we think about like deploying this for better security, you know, how does that work?
Right.
So, here at Cloudflare, we run the model inside our infrastructure and we take in the 1.1.1 domain, the logs from, the queries that we see observed in the 1.1.1 DNS resolver and we first filter them down to the set of new domains, new and newly seen domains for that day, and then we apply the model to them.
And when we've detected a domain generation algorithm domain, we label that domain and that is basically automatically actioned in, you know, your products at Cloudflare that use that data, particularly, you know, Secure Web Gateway is a both at DNS and and the the Web gateway versions of the product use that, it's automatically provided by the product.
You know, you as an administrator go in and you configure what types of security threats you want to block or detect.
You write your firewall rules and then that data is just automatically updated continuously.
Powerful stuff, right?
You know, using our machine learning to better protect people and really understand, keep pace with attackers right and their techniques with domain generation algorithm.
In the few minutes we have left, let's also talk about the other area, DNS tunneling, and detecting that.
Jesse, can you tell us a little bit about what is DNS tunneling and how attackers use it before we talk about the model?
Sure.
Yeah. So DNS tunneling is a is a somewhat related but a little bit different technique.
And the idea is, behind DNS Tunneling is not just that we're going to use the domain name system to go out and find what IP address to communicate with, but we're actually going to use that lookup itself to transmit information in and out of the network.
So about 18 months ago during the SolarWinds, the SUNBURST malware embedded in the SolarWinds security product, a security incident affected many companies globally.
We saw exactly this sort of behavior happen.
So the malware would do a lookup to a particular domain, and part of that host name contained information about the host that it infected, including some of the host name and network information that it extracted from that host so that the attacker would then be able to decode that and see like, okay, I've installed this malware on, you know, thousands of computers around the world.
What are they? What are their host names?
What network are they on? Is this one of the targets that I'm targeting?
And then the response that would come to that DNS lookup in this particular case was an IP address, and that IP address encoded what command to take.
So if the attacker, the threat actor in this case wanted to, for example, list the processes that were running on a particular server, they would send back one IP address and then that would get transmitted back to the server.
And then they they had a variety of different commands that they could control the malware with in this way.
But that's just one, one example that actually is a pretty compact set of data that's transmitted back and forth.
It's actually possible to build an entire VPN tunnel over DNS where your, as if you're using, you know, Warp or any other VPN software.
All the network traffic that your computer generates is encoded in DNS queries and transmitted via DNS to the authoritative DNS server for that domain, which then decodes that data and transmits it onto the internet, onto the network destination.
And then it encodes the response in the DNS records that come back .
Very surreptitious.
Now in the couple of minutes we have left, how are we using machine learning here to to, you know, keep customers safer?
Yeah.
So we've built the machine learning model that takes the DNS requests coming from Gateway and detects DNS tunneling and you know, similar to the DGA domains, it applies the DNS tunneling label to that traffic.
And you know, Gateway will both at the DNS and the, and the layer seven protocol, or the Secure Web Gateway product, block that, block those domains from, you know, transmitting.
It will stop resolution of those domain names which cuts off that communication channel.
So instead of going to the authoritative DNS server, our product will just start returning a DNS blocked response.
Really cool stuff.
Really cool stuff. And you know, we are indeed using machine learning to to keep, you know, our customers and their employees safer.
So, Jesse, you know, great work this week.
Keep it up. Now, with that said, John, I want to bring you in here because we've only hit on sort of one half of the sort of machine learning innovations we wanted to cover today.
And so let's talk about how we're using machine learning to to help organizations better understand their APIs and, you know, effectively predict or see abuse.
Can you give us a brief overview of what we're talking about today with respect to the news?
Yeah, for sure Dan.
So today we announced sequence analytics for APIs and it's a brand new way of visualizing and understanding your API traffic and your user behavior.
It's a new feature for us with API Gateway and what it does, it looks at requests made during user sessions and it identifies the important sequences of requests that your teams should actively protect, surfaces all of that in a dashboard.
Sounds good.
So I hear the word sequence and I think I know what it means, but, you know, maybe it makes sense to chat through that.
What is a sequence and what are some examples?
For sure.
Plainly, it's just an ordered list of API requests. It's as simple as that.
Doesn't matter if it's Web or if it's mobile or if it's like a B2B partner API.
Any of that is is still in scope here.
So for instance, think about using a mobile app to order pizza.
Um, there's a order to the API requests that leave the mobile app and go to the API origin, right?
Something like Create a new order, Add pizza to your order, Add toppings to the pizza, Add the pizza to the checkout cart, Apply any coupons that you might have.
Everybody loves pizza coupons. Select pickup or delivery, and then finally, Checkout.
So are you tracking, right? Like we're not reinventing the wheel here with with the idea of a sequence just yet or anything.
Okay.
So we could be able to detect essentially something that's out of order, something that's odd and suspicious in the way, you know, a communication sort of hits these APIs.
Is that what I'm hearing? Yeah, we'll certainly get there.
Okay.
Got it. Got it. And how do we think about, like, how this can better inform security?
I think there's two main ways.
And first, customers tell me really often that they just have way too much traffic to analyze manually.
They need help prioritizing where to set up their security, what to focus on first.
So with sequence analytics, customers get to see their most correlated requests of sequences that represent how their APIs are actually used by their users.
And that's really useful information when you're searching for what to protect first.
And then second, it sets the groundwork for future protection of APIs like you just mentioned, right?
In the previous example with the pizzas, you're not going to add coupons before there's food in your cart.
So if you're an API security administrator and you see that kind of behavior, it'd be suspicious and you might want to stop it, right?
Somebody is trying to guess all your coupon codes or something, right?
So sequence analytics will let us know like what's expected today and future releases like abusive sequence detection and sequence mitigation will let us stop that suspicious or unexpected behavior in the future.
And like the pizza app is a toy example, right?
But just think of it like all of this is generalizable to any API use case where there's an expected order of user behavior.
That's interesting.
Right, if we see somebody going straight to the checkout page without all the steps that should precede it, that's suspicious.
Right? And that's for e-commerce for any, any application.
So how is this going to work? The sequence analytics.
Can you talk about that for a couple of minutes. Yeah.
Yeah. So things like how long of sequences will customers see or how many sequences will they see and how are they ranked?
So right now we're surfacing the top 20 sequences inside the API Gateway dashboard, and each sequence can have a maximum of nine API calls inside of it.
And what do I mean by a top sequence? Well, after we look at the user sessions and create the sequences out of their requests, we have to score them and we use what we call a correlation score to rank and order the sequences, meaning that the highest scoring sequences will contain API requests which are likely to occur together in that order.
So higher correlation means that there's a higher chance that those specific API calls are only seen in that specific order.
I think you'll hear me talk about more of it later.
Okay.
Got it. Got it, got it. So something you know, John, you and I have certainly talked about in the past is the idea of a positive security model for APIs.
Right? Almost like a default deny, you understand what's legitimate and only allow that.
How does this fit into the notion of positive security?
Yeah, for sure.
With API Gateway, customers could already create positive security models in two main areas.
One was around volumetric abuse protection and the other one was around schema validation.
Volumetric abuse protection suggests intelligent rate limits per API endpoint based on the user's session identifier, and that really helps reduce false positives by having per endpoint session -based rate limits instead of something really broad like an IP-based rate limit set for the entire API.
And then the second way that we're doing it already is with schema validation.
Schema validation enforces the known specifics when communicating with API endpoints, making sure things like the required fields are there and the required formats for those required fields are there in that format and nothing else.
So with sequence analytics, we're starting on a third pillar.
Today there's visibility into the sequences that users are sending in on the API, and in the future you'll have the ability to set the expected usage of your API to prevent unwanted behavior.
Customers tell me all the time that they know the expected usage of their app and in the future they'll be able to enforce that usage.
So let's go back to the pizza example real quick.
You'll be able to say that customers have to add items to their cart before they can test coupon codes, for instance.
Got it.
Yep. That's expected behavior you can enforce on that and only allow that. So I know obviously, look, building products is harsh and I think you know, you both know that.
What are some of the challenges that that have come up as we've thought about this, John?
Yeah, well, sequence analytics presented some difficult technical challenges in a few different areas.
Sessions can be really long lived and they can contain a lot of different requests.
And we've got this giant global network here at Cloudflare with tons of API requests coming in every day.
And so as a result, it's not sufficient to define sequences by like the session identifier alone.
You'd have way too many since there's just so many users out there and so many sessions.
So instead it was necessary to develop a solution where we could automatically identify multiple sequences within a certain session and sort of collapse them into a general usage pattern.
And then additionally, because these important sequences are not necessarily categorized by volume, and the set of potential sequences is large, it's necessary to develop a solution capable of identifying important sequences.
So back to that correlation score I mentioned earlier.
We're not just surfacing frequency sequences.
The important sequences are the ones with the highest correlation score.
Interesting.
Okay. Yeah. That. Yeah, indeed. So what is...as we're thinking about what's next for the product.
John anything you want to share there? Yeah.
So like I mentioned before, sequence analytics is the first step of visibility here.
We already are working on ways that customers can tell us that specific behavior, that precedence and what's expected instead of us telling them.
So sequence analytics today is us telling customers, Hey, here's what to focus on first and in the future, we want to allow customers to tell us, Hey, I want to enforce on this specific behavior.
And that's exactly what we're building. So customers will be able to create that positive security model around expected behavior and then stop the unexpected behavior on their APIs.
And we're really looking forward to putting that functionality into customers' hands.
Very powerful, very differentiated.
Well, hey, you know, to wrap us up, John, I think a picture is worth a thousand words, as they say.
Why don't you, can we see this or get a sense for what this looks like?
For sure.
Let me take you through a example here taken from our dashboard. Look at that.
So in this screenshot you're seeing the main sequences page.
I've expanded a row here in the sequences, so I mentioned before top 20 sequences.
This is just one, and each sequence can have up to nine requests in it.
This one just has three.
But you can see the beginning of your sequence and all the way to the end, which methods it is, filter by the host name, and most importantly, see this correlation score.
This sequence has a super high correlation score, which means that these specific requests are seen in this order mostly on their, this sequence on its own.
So these requests aren't mixed in with any other sequences.
That means you should really pay attention to this and secure it with any available tools that you have inside the Cloudflare dashboard.
So the things like I mentioned earlier, rate limiting things like schema validation, you're already using API Gateway, so you might as well use our jot validation as well that we just released.
And of course Cloudflare Standard Tools, WAF, DDoS, all of that stuff, right?
So these are, these are important for you to focus on. Take all those actions and then take these sequences back to your App Dev team.
Say, Hey, look, is this the specific expected behavior on our APIs or is this anomalous?
Is is there an attacker out there that's abusing us in a specific way that normal users are not?
It's really important to note that that behavior will show up here as well because nobody else is doing it.
So those anomalous sequences have the potential to to have high correlation scores as well.
The frequency here is a metric for how often the sequence is called, but the correlation score is the star of the show here for sure.
Very powerful stuff.
Very powerful stuff. And this is something that's really unique to allow organizations to better understand their APIs and potential abuse that's occurring.
And so, you know, great work on this, John. Appreciate you sharing it.
So I think that actually brings us towards the end of our session. Jesse, thank you for walking through how we're using machine learning to keep our organizations and their employees safer.
And John, again, congrats on the hard work on on the sequence analytics.
So with that gang, I think we can wrap up, um, you know, and I encourage those who haven't checked out the blog this week, go to the blog.
There's a ton of Security Week news across the board, from Zero Trust to application security to APIs.
You name it, we're probably working on it this week.
So thank you so much for tuning in. We appreciate it.
And we'll see you in a future segment. Thank you.
Bye. Bye.
We are a food at work company.
We know the value of zero trust architectures. We also know the incredible difficulty it is.
So I know the only way I have a chance of implementing this well, that's scalable, that can support itself over time is having the right partners.
And that's why I'm so excited to have Cloudflare as a security partner, because they're able to give me that toolset to do Zero Trust well.
My name is Connor Sherman. I'm the head of security for ezCater.
When you want to feed a workforce of people, we are the go-to shop to making sure you've got everything you need.
It's my job to make sure anywhere you are in the world, you can safely log into our internal toolset.
There's a lot of inherent risk with the traditional VPN structure. Part of the success of Access for us is we were able to just bypass all that analysis and it was so easy just to get it going that we were able to save having to hire a specialized person to focus on VPNs.
As we are a marketplace, we have all the challenges, whether it be account takeovers, scraping, bot activity.
So being able to have risk ratings based on who's arriving at that login page really helped us remove things that were clearly bots and then focus on dealing a more sophisticated attacks.
Bot management was a bit of a godsend for us. It gave us a level of precision where we could show up with a scalpel where historically we'd shown up with a sledgehammer.
We block over 1.5 million attacks a day through Cloudflare Web Application Firewall and Bot Management.
If you ezCater didn't have Cloudflare, we'd have a very bad day.