🔒 Security Week Product Discussion: A New Cloudflare WAF
Presented by: Patrick Donahue, Michael Tremante, Nafeez Ahamed
Originally aired on October 17, 2021 @ 1:30 AM - 2:30 AM EDT
Join Cloudflare's Product Management team to learn more about the products announced today during Security Week.
Read the blog post:
Tune in daily for more Security Week at Cloudflare!
English
Security Week
Product
Transcript (Beta)
Hello and welcome back to Cloudflare TV. We're here during security week or security week plus.
This is week number two. We had so much stuff that we wanted to extend it a few more days and so I'm joined here by a couple of folks on the Cloudflare team.
We're going to get to them in a second and we're going to talk about some of the product announcements we made.
Like a lot of days, we've got both product managers and security engineers and we work really closely together.
So excited to chat about some of that stuff.
So why don't we start by introductions. Michael, Nafeez, can you introduce yourself and tell us what you do here and what you're working on?
Yeah, sure. Thank you. Thank you, Patrick. Hello everyone. My name is Michael Tremante.
I'm a product manager here at Cloudflare. I specifically work on the manage rules slash web application firewall team and I'm on Patrick's team.
I've actually been at Cloudflare for quite some time even before being a PM.
I was part of a solutions engineering team and that's given me quite a bit of insights into what our customers are looking for.
So looking forward to the conversation today.
Thanks Michael. Hi Pat and hi everyone. I'm Nafeez. I work in the product security team at Cloudflare based in San Francisco.
Mostly I'm working on security products, helping other security engineering teams build awesome security products and make sure that Cloudflare's infrastructure and security is safe.
Great.
Glad to have you here Nafeez. I know you've been really helpful as we think about building products and that consultation approach has been really useful for us over the years.
So Michael, first off, I think it's really cool that you transitioned from one role to another and I was fortunate to be able to convince you to come work in product management.
But just before we jump in, what has it been like?
What do you think about product management? Was it what you expected? Yeah, no, it's been really fun so far.
I've been at Cloudflare for nearly six years now.
It feels like things have changed a lot since the first day I joined.
And for the first five years, speaking to customers, that was my day-to -day job, helping customers on board onto Cloudflare.
And sometimes as part of the SC team, we're trying to fix the gap between what the customer is asking for and what we can deliver as part of the Cloudflare product.
And I think that's actually helped me get some pretty good insights in some of the updates that we're now bringing into the product.
In terms of how it's going, it's been great. I like Cloudflare, I've joined Cloudflare because of its technical mindset and being part of the product team has brought me a little closer to the engineering team as well, which is what I was actually also looking for.
And getting to understand a little better how the actual infrastructure works, how the platform works, how we scale applications up to serve millions of customers on the platform has been really, really, really awesome.
I still get to speak to customers daily. I think many product managers would agree, especially at Cloudflare, probably having several conversations per day on some days.
So I still get to have good conversations, get feedback, both constructive and positive feedback.
And now I just have a little bit of impact on some of the directions we make and decisions we make when building new features for the web.
So, so far, yeah, so good. Really excited about it. I think you've got that direct conduit now to the engineering team.
You don't have to go through someone.
You can have a customer conversation and then turn around and talk to your engineering counterparts about what you're building and why.
And I want to get to what you're building now.
But it's also, I think, just what you said about that experience of helping customers on board.
We talk a lot on the product team about making the product incredibly easy to use and self-serve.
We don't want even our largest customers, if they want to just go out at their own, they should be able to do so.
And we talk about API first development and everything being accessible and usable and very, very few things.
I think it's down to just very low single digits now where you'd have to actually ask for some help to do something.
We feel very strongly that you should be able to do this without paying somebody professional service fees or anything like that.
So let's talk about what you announced this morning.
So web application firewall, what is a WAF? How does it work?
Right. Yeah. So big, big day for us in the WAF team. For those of you who follow our blog, you may have noticed we made a pretty big statement regarding our WAF.
We're launching a new web application firewall. But stepping back, what is a WAF?
A WAF is essentially a software application that helps you keep your website secure.
In the context of Cloudflare, the WAF operates as part of the reverse proxy.
So anytime a user tries to access your application or your website, or maybe it's a bot that's trying to access your API, that traffic will go via the proxy.
And that's where we operate our WAF. And it's essentially keeping the bad stuff out and making sure only the good stuff reaches to your origin.
WAFs historically are, in the simplest cases, just a set of rules that we run against traffic to make sure we're matching against bad traffic and trying to reduce false positives to a minimum, right?
Because we don't want the WAFs or any firewall type application to stop your actual users from reaching your application.
That would be a complete opposite effect from what we're looking for.
The nice thing about the Cloudflare WAF, of course, is that because it's running on our edge, you don't have to think about deploying the WAF, maintaining it, keeping the software up to date, scaling it if your application grows.
All those problems sort of go away and we handle that for you.
But in essence, it's a, yeah, keep the bad stuff out, let the good stuff through.
And so we've had a web application firewall for quite some time, right?
We've made improvements to the existing code base.
I think it's great to consolidate onto an existing code base. And we'll get to that in a second and the advantages that you get from using a shared code base.
But who wrote the original Cloudflare WAF? What technologies did they use to write it?
So this is funny. I was drafting the blog post announcement and I was talking about all the new WAF and then John, who is our CTO, who reviews most of the announcements we make, was like, let's talk about the fact that the original code, it's finally leaving the Cloudflare edge, he would say.
But he was one of the authors, along with the original engineers back in London that were working the firewall, who wrote our first WAF product, which was based actually on as a legit module for NGINX, which is actually how Cloudflare started growing from the start.
As a reverse proxy, NGINX was our technology of choice. And initially it was just a set of simple rules running on traffic that you could turn on with a simple click once your website was deployed on the platform.
And finally, as of today, we're bringing a big new iteration and starting to, once we've finished rolling out the new WAF to everyone, we're going to start deprecating the old code.
And I think John would like to open source some of that code as well, so people can see how it worked.
Yeah, that is really cool. And so what about the rules though that run there?
And actually, before I get to the rules, I just came from a session where we were talking about replacing John's code on something called Keyless SSL.
And so it's really fun to see how a lot of these products have evolved. And we solved real customer problems at the time.
And I think over time you find things that don't work quite as well as you anticipated as the uses of the product change over time and the scale changes.
And so it's been great to see you revisit that and challenge some of the assumptions that we made, also on the design side.
And we'll talk about that in a second.
But what are the rules that actually run there?
What are these managed rule sets? What are those exactly? Yeah. So the WAF, when we say WAF, normally we're referring to the engine.
The engine is something that can accept incoming traffic and apply some rules to the traffic to decide good from bad.
But that's only half of it. The other half is the actual rules themselves. I would think of it as the actual content that the WAF uses to filter out good from bad.
We, as part of the WAF, we provide sets of rules that you can configure according to your requirements.
And we make it easy to actually deploy all the default rules with one click.
And the rules themselves are written in, depending on which firewall you're using, it can be written in many, many different syntaxes out there.
A very popular syntax is still the mod security syntax. Mod security is probably the most known because it was a web application firewall module, which you would add on top of the Apache web server.
Before Nginx got all of its popularity, those of you who have been working on the web for quite some time may remember that Apache is still a very popular and powerful web server.
And mod security was the default WAF for Apache.
And that WAF module implemented the mod security syntax.
And our old WAF, which has served us really well so far, also implemented the mod security syntax.
And our rules were written in that language. The reason we did that, of course, is so we could actually pull down a lot of open source libraries and rules and simply port them to the WAF as is without having to do any additional magic or convert them.
So there was a big simplicity in terms of deploying new rules as they came along.
And what are those rule sets that are accessible to customers today?
So if I sign up and I'm a pro customer, I'm paying $20 a month, what do I get access to from a rule set perspective?
I know I can write my own rules, but what are the ones that you're providing specifically?
There's two rule sets.
We take one rule set from the public domain, open source community, which is the OWASP core rule set.
And we've actually taken that and implemented it into the Cloudflare platform and provide it as is.
And officially, it's called the Cloudflare OWASP core rule set.
The rules for that rule set are geared towards protecting against generic web application attacks.
So think of XSS type payloads, SQL injection payloads, remote code execution, attack payloads, and similar.
That's one of them.
The other one is built and maintained by our own security team. And we call that rule set the Cloudflare managed rule set.
And within that rule set, we actually had a subset of the rules we used to call and still call Cloudflare specials.
So what we think to be the best rules there is out there. Those rules are maintained by us, by the security team.
We update them constantly. There's a whole review cycle, and we may be able to go into that in some more detail.
And we're trying to protect and build rules based on what we see from our customers traffic proxying through our network.
Because we have a really big visibility advantage, given we run a very large multi-tenant platform across the web.
And we take that insight.
And when we see something new, a new payload comes up, we write the rules and then provide them as part of the Cloudflare managed rule set.
In the future, we may have more rule sets.
And in fact, we are going to be adding some very soon.
But for now, Cloudflare managed rules and OWASP are the two. I heard you may have some more announcements in store for tomorrow.
Maybe. So no pressure there, but excited to see that come out.
Talk a little bit about how they run differently. So I know that the OWASP core rule set, those are great.
We pull those in, we evaluate them.
But how does that work from an evaluation perspective versus the Cloudflare specials?
Yeah, they work substantially different. But luckily, the fundamentals of the engine remain the same.
So we actually have a very neat, clean solution in that regard.
So let's talk about OWASP first. Whenever a rule executes, and it's part of the OWASP rule set, the individual rule that matches against the payload doesn't actually trigger a block event or a challenge event or some other event that would change the course of the request.
Rather, whenever an OWASP rule executes, it will increment a score, which we keep for that single request.
So out of the OWASP rule set, let's say five rules match, and each rule is incrementing by five, the total score for that request would be 25.
Then the last rule of the OWASP rule set, and that rule always runs for last, will actually check the score, the aggregate score, and then make a decision on that aggregate score.
So the OWASP rule set is intended to be a score-based system, whereby no individual rule, unless you set the threshold to be very, very, very low, will ever actually trigger on a single request.
The idea here is that to reduce false positives, rather than looking for very specific payloads, we look for a lot of little commonalities across malicious payloads, and by combining those together, you get an easier-to-configure and easier -to-deploy system, reducing your false positives.
And then, of course, you can still deploy the OWASP rule set with a lower threshold score, and maybe deploy it in a log mode to see what you might be missing, and adjusting your setting accordingly.
You mentioned false positives.
I think that's probably something that I, very far more than Nafis, can talk very in-depth about.
But what is a false positive? Yeah. A false positive is when we block or match against a request and take an action when that request was actually a good request.
So when the WAF makes a mistake, that's a false positive.
The opposite is a false negative, which is when we let through something that we should have blocked.
And those are two common terms that get used in the security industry when talking about things like web application firewalls.
Yeah. And I think we'll get to this when we talk about how we think about creating new rules and testing them and rolling them out.
But one of the things that you also announced, not today, but I think a month or so back, was some ways for customers to drill into some of those potential false positives, and investigate them, and do some kind of deep discovery there to figure out should they adjust the rules and tune them as part of that onboarding process or iterative configuration process that you handled from a solutions engineering perspective.
So tell us about what that was that was announced, and why is that?
Yeah. No. And this actually ties really well on how the Cloud for Manage rules functions as well.
So we talked about what's being a score based system.
On the other hand, the Cloud for Manage rule set is built in such a way where every rule has sufficient logic that if it matches, we know that that request is malicious, right?
So a single rule matches, then you can immediately already perform a disruptive action such as a block.
Now, that's all good, of course, but sometimes we end up hitting false positives, or maybe you're using the Cloud for WAF and the customer writes in and says, my request is blocked.
Why was it blocked? By default, whenever there's an event triggered by the WAF, Cloudflare will log and provide via the dashboard a set of common request fields, which are sufficiently generic and sufficiently precise to let you know which request was actually blocked, but we don't actually store the entire payload of the request.
We don't store the payload because there's potential for capturing a lot of sensitive data.
For example, if we're matching against a request and the payload is in a cookie header, we might actually log the session ID, which per se, if you're the application owner, that's necessarily not a big problem because you have access to the application anyway, but if you're managing a large team, there might be security controls in place.
And then on top of that, the least people you want to have access to that data is, for example, Cloudflare staff.
We do not want access to that staff. So we built a way that you can turn on additional logging for security events triggered by the firewall, but those additional fields, which will contain the relevant payload that triggered the request, are encrypted with a public key, which is provided by yourself.
What that essentially means is that only you or whoever configured the WAF has access to decrypt those payloads.
No one else, unless they have access to the key, of course, the private key will be able to look and get access to potentially sensitive data.
Go ahead. Sorry. And we call this feature encrypted WAF payload logging.
I think it's used some really cool cryptography, the hybrid public key encryption.
And so we won't dive into all of that today, but there's a blog post that goes into some of those specifics if you're interested in learning more.
Nafeez, why is it important for... Michael mentioned that this payload that Cloudflare employees don't have access to.
Why is it important to restrict the payloads and encrypt them?
How does that help the product security? Sure. So many requests, customers have a lot of their important assets behind Cloudflare, and their customers are sending requests, which usually contain personal identifiable information.
There are a lot of contexts related to their users' data. So we don't want to give access to anyone to easily access those.
But at the same time, we want to make relevant metadata information useful to protect the request that comes to Cloudflare.
So what we do with HBKE, it's a generic standard that's put out by one of Cloudflare research team members, which helps ensure that the customer can only decrypt it because they have the private key and they give us the public key.
So I used to call this the bring your own keys, B-O-A-O-K, where you upload the public key, Cloudflare can encrypt it and send it back to the customers, and they can look at the logs, and only they can look at it.
So this is a verifiable and provable way for customers to ensure that no one else can look at it.
And I think in general, most of the features that Cloudflare has been pushing out has been from, hey, we can't look at your data, to we cannot look at it, here's the proof, because that's the way the script works.
I think that's a really important point you just made in terms of how we think about data.
And I think this definitely separates us from a lot of folks in the industry where we see this data to our customers as your data.
We don't want access to it. We want the absolute minimum to operate our service.
And so if you need additional things for troubleshooting, we want to make that accessible in a way that only you have access to it.
And Nafis, as you mentioned, we're going from a world of here's some things that we do to actually demonstrating that.
And so similar, both from a security as well as a privacy perspective, where we're doing audits that are being published on our 1.1.1.1 browser, sorry, Resolver.
And so it's great to see that we're continuing that and that's permuting throughout the whole product stack.
Michael, I want to get back to the controls that someone has. So you mentioned that you can take different actions based on what is flagged from a request.
And so what are some of those actions you can take and where do you kind of see that headed over time?
Yeah, that's a good question. So out of the box, we provide quite a few actions that you can configure the rules to execute in event as a match.
The standard one, any firewall support is to block the request, right? So whenever someone sends a payload and the web triggers, we then stop the request from proxying through to your origin.
But we rather display a 403 or a page that essentially says, you've been blocked by the web.
Other actions that we've built into the web out of the box and been available for some time is the ability, for example, to challenge a request with a JavaScript payload.
This essentially increases the effort required for any potentially automated tool to get through to your origin.
So things like simple command line scripts will no longer be able to pass through because they are required to implement a full JavaScript engine to be able to solve the problem and move forward.
All the way to more sophisticated challenges, such as capture challenges, which essentially require human intervention, right?
So I think everyone here probably has seen a capture while browsing the World Wide Web more than once.
So you're required to click on the relevant images to detect specific image types before you're let through.
We do have other actions and we're bringing, we're looking at implementing a more sophisticated logic, such as, for example, we've already had rate limiting for quite some time.
This is whereby you don't necessarily block a request incoming to the system, but you start slowing it down based on the threshold you configure.
Because the new WAF engine is actually a more generic rule engine compared to the old one, you can now actually, I'm going to start opening up some of these features as we get customer feedback, perform any manipulation you want on the request.
So some very cool use cases here that we've seen some folk already implement with things like workers, which is a separate product that exists on the Cloud Platform, such as maybe sending a request to a honeypot, right?
So not necessarily blocking the action at the edge, but letting it through and then monitoring the activity of that request, as with any other honeypot to see what the attacker is potentially trying to do or what are they trying to extract from the application.
But given it's a generic request with a proxy engine, you can think of adding headers to send information back to the origin, and then you can perform an action on the origin based on that information and redirecting outright to another service you might be hosting elsewhere.
The sort of capabilities start becoming unlimited.
And then we try to, when we see a lot of usage for a specific use case, that's when we try to encapsulate that into a one-click pre-built action that we make available by UI or the API.
Talk a little bit about that engine, right? So you mentioned LuaJIT previously.
What is the new engine it's running on? What technologies is it built in?
And what stuff is sort of already there and what's moving there across kind of related products, including things maybe pay trolls?
Yeah. So the new engine is, the underlying technology is Rust. Rust is a programming language.
It promises, for us, it's worked really well. We've already adopted it for a lot of other products at Cloudflare.
It's a memory-safe, thread -safe technology, and it's also very, very performant.
So at the scale we're at, we need to write performant code that scales quickly.
So Rust sort of fits, fit really, really well, especially given it's a multi -tenant platform.
We want to reduce for minimum the chances of memory issues across all the different processes running at the edge.
The actual engine itself is also leveraging something that we're already using in our custom firewall rules, which is our, internally we call it segments, but essentially we now have our own syntax, which is based, it's a lookalike of the Wireshark filtering syntax.
We call it wire filter internally, which is very expressive.
It's a JSON-like syntax, which allows you to essentially define a filter that the engine will then match against incoming requests, right?
So you can filter against specific HTTP variables.
You can then specify arrays. If this is a form payload coming through and you want to pinpoint to a specific form component, et cetera, et cetera, et cetera.
The nice thing is, given we're a proxy, a lot of our configuration is essentially a rule-based configuration, and we're likely, well, we're already using this engine as more of a platform at the same time, of which the WAF is the first adopter, and it's going to be used across all of the other rule-based products we have at the Cloud for Edge.
Patrick, you mentioned page rules is the obvious example of this, but even across that, you can think of the load balancing themselves that's deployed via rules.
We've talked about worker deployment happens via rules.
There's a lot of common themes across the entire platform, which are going to start leveraging the underlying platform that the new Rust -based WAF is built on.
Yeah, I think that's what's really exciting for me, obviously.
This is a huge, important step and a tool that many, many people rely on, but as we start adding those other capabilities, so if you think about the honeypot, you're essentially sending traffic to somewhere else.
You're sending it away from your critical infrastructure once you detect that it's an attack, and you're sending it to a system that's designed to receive this and to be able to help study what the attack is.
That rewrite capability was built by the team that manages page rules, so Sam and team have built the ability to rewrite and send things elsewhere.
As we add those things to that engine, it's becoming accessible for other products.
You may mix and match stuff that is not necessarily security per se, but can be used in a security way.
Workers is another great example.
If you're matching on something that looks suspicious, maybe you want to have a full -blown programming environment where you can interrogate the request and write whatever custom logic you want, including logging.
It's cool to see that come in. The other thing we talked to the bots team end of last week, they're working on increasing the threat intelligence available in the web allocation firewall.
Beyond what is the source IP of this request and what is the HTTP request headers or a full reg X, which we can do on the body, what are the actual scores that we've computed based on the knowledge of all the data that passes through the Cloudflare network at large?
Things like combining a score of is this likely automated traffic, so the machine learning models for bot management and the heuristics they use, as well as is the IP on a list of open proxies that we've seen as we crawl the Internet and figure out what's out there, and you may want to block that.
It's really fun for me to see all this stuff getting combined together and allowing you to have more control over that.
Go ahead. The bots team recently announced the SuperBot Fight Mode. They actually beat us to announcing a product that uses the platform before we did.
The SuperBot Fight Mode itself is built on the same engine that powers the new web, for example, to back up what you just said.
From a managerial perspective, is that right?
Yes. I think they also are using that bot score is actually interpolated into a firewall rule, if I remember correctly, and that's where the bot rules are actually deployed at the enterprise level.
One thing I forgot to ask you about as we were talking about the different rule sets that we build and maintain, how do we think about when to add a rule and what does that process look like from like, hey, we should add a rule to it being live and available?
I know that this is a question that probably from your solutions engineering days, I'm sure you got to ask more than once.
If you can give us now the definitive answer, maybe we can cut this up as a clip and then send it to- I'm sure Nafis has a lot of opinions on this one.
It's always going to be a challenge, right?
If you think about it, there are thousands of applications out there, every customer environment is slightly different.
There's a few very well-known applications, think about the WordPress as a number one example, Joomla and Drupal, but after you go slightly below, it's a whole set of slightly different technologies and applications and everyone has their own flavor.
So it gets very difficult for us internally when we get reports of vulnerabilities coming in, where do we want to focus our time?
Because on one side, it's impossible, of course, to cover everything.
We would end up likely with thousands and thousands of rules within a couple of weeks.
But at the same time, we want to put our effort where we have the biggest impact.
We do have a process internally, just to give you a few of the highlights.
Nafis has actually helped with some of this.
We have a feed of vulnerabilities that get fed into our internal ticketing system.
We've signed up to open source online databases that post like the CVE database.
Some of the large products like WordPress and Drupal have their own feeds of vulnerabilities that you can sign up to.
And every time a notification comes in, it creates a ticket into our internal system.
And that's one input.
Other input, of course, comes from our customers. Very often, enterprise customers will be notified in advance of some vulnerabilities and they may warn us and ask us about specific signatures we may or may not have at that time.
And then we do proactive searching.
So we have a lot of, we have a really good security engineer team.
The firewall team themselves are looking out there for the web.
We do monitor forums and similar. And whenever we spot something of interest, we then create the ticket.
So we keep track of what we have to look up to see if we can write a rule for.
Now, we have this huge queue. And of course, the team grows. We'll be able to write more and more rules, but we need to be selective.
The way we try to be selective is by looking at potential impact.
Now, for example, a high-severity vulnerability for WordPress would go immediately at the top of the list.
A single WordPress rule could help a vast amount of customers behind the Kalper web application firewall.
If a vulnerability comes in and it's high severity, but it's from a very small software application that has a negligible amount of users, we'll try to get to it.
Don't get me wrong. But of course, in terms of priority, we need to make a decision there on where we spend our effort.
As we grow the team, we are looking at expanding the coverage of what we write rules for, of course.
Another approach we're looking for, and I'm actively reaching out to security teams out there.
So if anyone's listening, please listen carefully, is to get involvement from third-party security researchers or security teams, in some cases, to actually open source potentially rule sets for vulnerabilities they know about.
So then we can get those open source and provide them for free as part of the web irrigation firewall to our customers.
So we scale faster and the actual experts are writing vulnerabilities for their applications.
Yeah. It's interesting to see. I know we look at the data in terms of what percent of malicious requests get blocked by Cloudflare specials versus OWASP.
And we are trying to write things obviously as efficiently as possible at the edge because CPU or processing, and we really, as we talk to customers and industry analysts, we really mean where we don't want to incur any performance hit.
And so you should be deploying security products and actually increasing your performance.
And we really believe that about everything that we build and equip people to use.
And so I think it's really important to keep CPU to a minimum.
And so when we can detect stuff more easily or even higher up the stack, so if there's attacks that are happening at the network level, the DDoS team is handling that.
And if something makes it down to the application level, the quicker and the cheaper we can stop it, the better because that translates to performance and ability to run additional layers of security.
As we reduce CPU, there's more CPU for other things.
And so it's great to see getting more specialized rule sets to be able to block some of those attacks.
So let's talk about though like an actual example of something that's come up.
Let's talk about something generically about how we do it.
And then I want to drill into a specific example. So you get from this queue, you get the idea that we want to write a rule and it's going to have pretty substantial impact.
Maybe we get some advanced notice as we sometimes do from vendors of, hey, this is a real problem.
And you're protecting these 25-ish million websites and we want to get a rule to you to push it out.
How do we actually go about pushing that out?
Yeah. So of course, you mentioned a really good point there.
We always need to keep an eye on performance. When we write rules, we're very selective on how we write them.
Every time we write a rule, it's going to be running on a lot of traffic.
So that's a lot of CPU cycles which are matching against traffic.
If we get a notification, first step, of course, is to try and understand the POC and be very specific with the sort of rule we're implementing to reduce what we said earlier, the number of false positives.
We actually have a pretty good internal system to do that.
So we actually have a way internally to stage rules before we release them to the public and the customers can turn them on, whereby we can match against in an internal log-only mode, which we have access to, we can match them against the percentage of traffic.
And even the rollout of that individual rule doesn't actually happen all in one go.
Rather, we start with a subset of the network.
We have this new concept, for example, of canary POPs.
So we can deploy a rule to a canary POP and then observe how it performs, observe the logs, is it matching against too many false positives or is the matching good?
And can we see actually malicious payloads that are being blocked? And then we sort of roll it out a little more.
The whole process in the average case, we try to stick to a week because we commit to a well-thought out delivery process according to our change log.
So customers have time to look at the changes that are coming out and potentially configure those rules once they become available according to their settings.
But once we're confident and we've rolled it out across the entire network and it's not causing false positives, we always aim for zero false positives if we can.
And we have some pretty sophisticated internal analytical tools to review potential payloads.
We then switch the rule to our official rule sets so that it becomes available as a configurable rule to our customers.
And also, we update the change log, of course, so customers who are signed up to our RSS feed would receive a notification and then it becomes generally available.
Very often, though, the one thing I'll say when it's a generic vulnerability, which is not software specific, rather than creating a new rule, and there's a lot of misinterpretation here, we try to improve existing rules.
So the rule sets themselves are not huge, they don't have thousands of rules, but that doesn't mean that we're not blocking or we're not efficient.
Rather, we're trying to be very optimal in the way we think about these and try to improve.
Like if it's a new XSS attack vector or a new XSS bypass, we already have quite a few XSS rules and probably a tweak to an existing one will fix against a specific payload, compared to a brand new rule that might be unnecessary.
Yeah. And I think I've seen this play out time and time again, and you've written, for example, shortly after you took on the role of product manager for the team, there was an F5, big IP, remote code execution vulnerability, and we've seen quite a few of those.
Or I think there was a recent F5 one that was on the response phase where we couldn't do a whole lot about given where we sit in the request flow.
But I think writing about these major ones is important to raise awareness for customers.
And Nafis, we can patch at the edge, but I would imagine you also want to patch on premise as well.
And so talk to a little bit about how do you think about that as a security professional in terms of that gap and what is involved there?
Sure.
So let me talk through a little bit about a generic vulnerability, a lifetime of vulnerability, and let's go deep down with a very specific example and what we do at Cloudflare and how we look at it.
So in general, as security researchers, scanners, targeted teams go and find specific vulnerabilities in products that are super interesting.
And what happens at that time is either if it gets leaked, it goes over to some malicious...
In the hands of malicious people, that's not great.
But the moment the vendor gets notified, they would have to roll out a patch.
So they go ahead and roll the patch and distribute it to all their vendors, all their customers who use them, and the customer starts patching.
So unlike most of your client devices or when you update your iPhone or Chrome, where most of the times auto-updates are there and you can just click a button and just patch it.
Unfortunately, for most vendor devices out there, it's not that easy.
To begin with, they run a lot of different versions.
There's this heterogeneous versions that's running everywhere.
You don't have visibility into how many of these servers you have in your whole network, asset management problem, and a few other things.
So when a vulnerability comes up, you as a security engineer or an IT administrator or software developer running this unpatched software, your immediate response should be to negate the vulnerability or make it ineffective.
And regardless of that, you still have to go and patch the vulnerability, but you would have to reduce the time.
So generally you have this thing called the patch gap, which is the moment a vendor releases a patch and how long does it take for most of the systems to be patched.
And in general, this has been decreasing, but it's not enough because the attackers are still winning and they could just go and scan the Internet for the specific vulnerability and just put a backdoor.
And that's exactly what happened with recent Microsoft Exchange zero days.
Let's talk about that a little bit.
So back in October, this team called DevCore, which is an excellent vulnerability research team, I'm familiar with one of their researchers, R and Cy, who has been very prolific in terms of all this SSL VPN bugs, all this pre-authentication bugs on the Internet, which are super scary.
You could just, if you have one of these bugs, you could just scan the Internet and just exploit all those vulnerable boxes.
And what happened was they found four different zero days.
And the most interesting one was the first one, which is used to start the chain of attack, which is a SSRF or a server-side request forgery attack.
And what happens with that is usually if you have an exchange server, you have your list of authenticated users and anyone from the Internet cannot come and authenticate to it.
But you have a bug like this. We usually call this the pre -authentication bugs.
A lot of these bugs have been happening recently. Pre-auth bug allows you to authenticate and get some context within the vulnerable application.
Not exactly exploit, but it gives you the same context as a logged in user.
So this allows you to exploit this in the wild without having to target a specific organization's user.
You could just go and run this on all the servers that's out there.
So once you have this pre -authenticated context, the authenticated context within the exchange server, they found another, I think, three zero days, which allows something known as an arbitrary file write.
So in terms of systems security, if you can write to arbitrary files and if those files are being reused or there's a cron job, which runs that as an authenticated user, say root or system in the Microsoft world, what happens is you effectively have remote code execution.
So this is a very good example of using web logic bugs to get reliable code execution.
Back in the days we had, and we still have, memory corruption bugs.
But over the years, people have made memory corruption bugs less reliable because of ASLR and a lot of defense in depth that has gone over the years.
So I used to talk with my security friends about how SSRF is the new RCE. In a way, it is the most effective RCE out there today.
And more and more web applications are vulnerable to similar things.
So anyway, getting back to exchange, what we did at Cloudflare when this whole thing happened, I think Patrick, I remember you also had a blog post about this, where we figured out the entry point of the kill chain, in this case, the SSRF.
And it did not necessarily kill the entire functionality of Exchange Server, where we had a very specific path and we could just write a WAF rule that just goes out to all customers and just kill off the exploit chain.
So this, for instance, if someone has this vulnerability or an exploit code, which for the most part is available or was available on GitHub, I guess, and Microsoft took it down, that's another story, controversial one.
But what we have effectively is kill the chain of these bugs.
So without having this SSRF, you can't exploit any of the further bugs down in the path.
And those are also very hard to mitigate using just WAF rules, because they look at more into the actual payload content.
And when it comes to WAF, I talk about these two things.
There's these things that you can do at a binary level and it's things that are spectrum.
So when you look at the XSS, SSR, or the CSRF attacks, all these things that require some form of a, and using some regular expressions, these can be bypassed because there's, as Michael was talking about, the backend applications do a lot of different things.
You can't have a single generic rule that will stop all XSS attacks in the world.
But there is a common language which everyone uses, regardless of what application or what stack you do your application.
It's the common HTTP semantics are going to be common, right? So there's this HTTP method, HTTP path, eventually you're going to translate them all in your language specific semantics.
But in general, on the network level, these are standard.
And that's what our WAF engine is super powerful at.
We've been adding rules. We've been adding more capability to our WAF to inspect more details about the HTTP request.
And I remember a couple of years ago, there was this vulnerability with HTTP request smuggling.
And if you look at most of the WAFs out there, you can't look into very specific details about the content type and how they interact with the content blend.
And those were capabilities that we have been adding over the years, which really translates how powerful our WAF is in general, in the market as well.
Yeah, sorry.
I think the thing was interesting, you mentioned exchange, we had some early reports of what the vulnerability was, right?
The start of that chain to exploit. And so, as Michael mentioned before, a process of creating a new rule and rolling it out and testing it, especially when you think things are being exploited in the wild, we got something out really fast.
And I think what was interesting was we refined that rule, as Michael mentioned before, when we were able to get our hands on the actual POC before it got published.
And so, I think it was that Saturday. So, we had a rule out very, very quickly after the CVEs were published to offer some protection.
And then once we got our hands on what was actually being used from our threat intelligence group, we were then able to refine that rule further and stop these active exploitations, right?
And so, it was cool to be able to do that. And this was a rare circumstance, Michael, where we actually went right to enforcement mode, right?
We have these emergency class of rules where hugely impactful to customers' networks and data that we balance, right?
The false positive risk there versus the potential risk if it's exploited.
Yeah. And I was waiting for the right moment to say that, of course.
The standard process is we get a bypass or something generic, then we try to follow a procedure.
But no, Pau, you're right.
We have an emergency release process as well. The really cool thing is that we can deploy a rule at the edge within seconds.
Our customers are used to this.
Whenever they click something in the dashboard, the change propagates globally. The same system is what we use internally to propagate rules because rules, at the end of the day, are just config.
It actually takes more time to write the rule rather than propagating it.
So, as long as we have the POC and if the vendor makes the POC available with an example, then we can be very specific with what we're matching against.
And if the compromised software is widely adopted, you mentioned the F5 example, Microsoft Exchange servers, I dare to think how many thousands of Exchange servers that are out there.
And it's a severe vulnerability, which leads to total system compromise.
That's when we would definitely consider an emergency release.
And I love seeing that what you're doing with the team too, in terms of hiring data scientists directly on the team to augment that with all of the attacks.
The thing that I love about our product and our pay-as-you-go offering for self-serve customers is if you're paying us $20 a month and the attacks that are facing your network that we're defending you against, those help inform the customers that are paying us significantly more for much larger deployment.
And so, there's a lot of that shared intelligence that we can bring out. And I'm loving seeing how you're drilling more into that data to surface this, to augment it, not just with reports that are published, but also producing more and more of these ourselves.
And I think we'll probably, no pressure, probably start writing about those a bit more.
Very soon, very soon. In the future, yeah. So, I want to spend a few minutes on, just to go back to the announcement today, I want to talk a little bit about the new features shipped, and then I want to talk to Nafis about how are we using Cloudflare's WAF to protect our own network.
I think it's really important that everything we're building here, our security team is signing off on and saying, hey, we would use this ourselves, right?
And if not, you need to add these features to it.
So, tell me a little bit more about what we announced today.
Yeah. So, it's a brand new engine at the basis of it all. But of course, besides the engine itself, where we've tried to make a lot of performance and security improvements around the way the rules get matched against traffic, we also tried to heavily overhaul the flexibility of configuring and deploying the WAF on your application.
The new engine served us really well. It actually scaled to the point we are today, which is great to think about it, considering we have 25 plus million applications on the network.
But as we get more and more customers, we get more and more diverse applications behind the web application firewall.
And there's always some exceptions whereby a customer may want, you may want your WAF to run on a portion of your traffic, but for another portion of your traffic, you know that you're expecting what looks like to be a malicious payload, but you've sort of done proper validation on the backend, or you want to exclude specific paths from the WAF running or whatever other security reasons you may have.
That's been a major focus with this new implementation is flexibility of deployment, right?
Out of the box, you still hit a button and the WAF runs on the top of all of your traffic, but we're now going to make it, it's now very easy to exclude portions of your traffic or to customize the WAF settings on portions of your traffic.
When I say portion of your traffic, what I mean is a filter which could match, for example, whatever you define as an application in your environment, right?
Which could match against a specific subdomain only, it could match against a specific path.
For example, a lot of companies I've seen would build their APIs either under api.domain.com or under domain.com slash API.
Whatever it is, you know, it's just a filter at the end of the day and we can customize that behavior.
The other big focus, and I'm going to talk a little bit around the UI itself.
The old WAF UI was very simple, but it took a little time to sort of work through the rules and configure them properly.
And for example, even simply looking at which rules are not running at any given moment, you'd have to go look for them deep dive into each individual rule group.
And if you wanted to switch all of those disabled rules to log so you can get some visibility around them, that would normally also require you to perform quite a few clicks in that journey, unless you were using the API or something like Terraform.
But not everyone, you know, is using those tools. So with the new engine, new configuration flexibility, there's also brand new UI and we focus around some common tasks to filter down and deploy bulk settings to your WAF rules.
So that what used to take quite some time, now it's literally down to a couple of clicks.
Those are the first two immediate things that people will notice.
There's a whole lot, I don't know if we'll have time to go through all of it, of additional first -class concepts in new WAF.
Things that, you know, for example, versioning and rollback are things that have been asked quite a lot, especially from our enterprise customers.
I've deployed WAF config, but I've made a mistake. How do I go back with a click, right?
A lot of these fundamentals are now baked into the engine itself.
And as we start exposing them, we're going to make, you know, WAF deployments a lot easier as well.
Yeah, I think, you know, for those that have used our WAF and deployed it, you know, at a really large scale, your fingers are probably strong from, you know, clicking a lot of those different things, but what we really streamlined a lot of that.
And so it was cool to see how you worked very closely with our product design team.
I think there's a blog post in the works about how we went about that process and that partnership and investigating those common use cases, hopefully will result in a lot fewer time spent in the interface and managing it.
And so it's great to see that partnership. And I look forward to that blog post.
I'll mention one last thing, which I think is one thing that actually excites me a lot.
I think security, honestly, there's no silver bullet for security, right?
A big component of it is visibility. Now, one thing we've done with the new rule sets is exposed what we call rule tags.
There used to be rule groups in the old system.
Now every rule has a tag. A tag can identify a software application is trying to protect a CV, for example, or an attack vector.
And very soon we're going to start using those tags to populate the analytics as well.
So you'll be able to, for example, observe what sort of attack vectors are coming through to your traffic, which we think is going to be very powerful to identifying signatures of things that may be passing under the hood prior.
Yeah. And the cool thing to go back to the engine for a second and the syntax wire filter, as we call it, which is based on the Wireshark display filter syntax, as we discussed, the cool thing there is that those components from the UI side are being reused as well as the API by other teams, right?
So if you, I think it was a week ago today, we announced magic firewall, right?
Which is taking these controls and visibility that are at layer seven and bringing those up to layer three and layer four, right?
And so if you know how to use the Cloudflare firewall rules engine and web application firewall, it's an easy shift and low or not very steep learning curve to get over to protecting your network level traffic.
And so that's been pretty cool to see as well as some of the additions of threat intelligence that can be used at all layers.
So I want to go back to the versioning for a second. And actually versioning, and the other point you made about the tagging.
So in the attack that Nafis talked about, the exchange server one we were seeing actively exploited, you could go in presumably and type in that CV number, right?
And so if you were a security administrator at home wanting to know, am I protected against this?
That's a very quick thing to do, right? Rather than pouring through release notes and so on.
But what about the versioning and I know that customers will test their applications against specific versions.
And what are your plans there?
I know a lot of this has been built on the backend, but what are your plans there to expose that in your interface?
Yeah. And the nice thing is mostly all built already.
We're just keeping it hidden for a little while to see how customers start using our new app.
There's two sides of versioning, really.
There's the versioning of our managed rule sets. So the rule sets we provide to our customers.
For those, we're going to provide two different types of versioning.
Number one, what we could call releases, which is a common concept applied with many software development life cycles.
So you'll be able to deploy the latest version immediately, or for example, deploy the stable version, which could be a couple of iterations behind, but allows you enough time to integrate with your change management process, for example.
And at the same time, we're going to actually eventually allow you to pin to a specific version of the managerial sets.
Although eventually we are going to push people up as a version sort of deprecate, because we don't want customers on a very, very old version, because that would be bad for us, because we need to maintain them all, and bad for our customers as well, because they're not getting the best protection.
The other side is your own custom configurations.
Given we have this concept of rule sets now, customers, you may be used to firewall rules.
Firewall rules themselves are going to be soon encapsulated into a rule set.
And every time you make a change to your own rules, you're essentially going to be increasing the version number of the underlying rule set.
And that's up to, we'll see what sort of nice use cases customers come up with, but we do expect all the standard deployment on a staging environment first, specifying your staging filter, and then maybe after you monitor your traffic, moving that rule set into your production environment, and then potentially rolling back, which under the hood is as simple as fetching an old version and applying it on top, which is actually like code version control systems.
That sounds great. And I love the ability to group sites together and applications and slice and dice them, and really say this policy for this set of applications across the entire account, right?
We were coming from a world that was very much zone or domain-based, and now to the account level.
So I think we have like two minutes left. And so Nafis, just in a minute, minute and a half, can you tell us like how do we use this, and how do we think about the web application firewall for our own infrastructure?
Sure. Yeah.
We are a fan of the whole WAF product within Cloudflare for our own use cases.
So the security team, whenever we receive bug reports, external research that come in through our security research programs, we immediately find a way to just patch it without having to roll out an entire...
And the whole problems that we talked about, about the problems in patching.
And a recent sample was last week, a few weeks ago, we had a SQL injection vulnerability that came in on a very specific part of the app that was not really customer-related, but just an external app that we had, and we had to fix it over the weekend.
And we didn't want to ping developers over the weekend to do that because there's a lot of stuff going on.
And we just pushed out a WAF rule, just to block that part. So it was a very obscure part, not used much.
And we locked it and it's all done. And the vulnerability is back within like 10 minutes of report.
And I thought that was great.
That's the power of WAF. And also WAF, the whole Rust rewrite in the new engine is also great because there is no more remote code executions on Cloudflare.
So the whole platform itself is secure.
So that's great. Terrific. And thank you for staying under time.
We're right at the end here. So I want to thank everyone for joining.
It was really a useful, informative discussion and look forward to see what you're shipping in the future.
So thanks so much for your time. Thank you. Bye everyone.
Bye.