Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Michael Tremante, Andre Bluehs
Originally aired on December 17, 2021 @ 1:30 PM - 2:00 PM EST
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
Hi, I'm Jen Taylor. And I'm Usman Muzaffar. Head of Product at Cloudflare. Yes, you are Head of Product at Cloudflare.
I'm Usman Muzaffar, Head of Engineering at Cloudflare.
It's an earlier session for us, Jen. It's in the morning here in San Francisco, Bay Area.
And I'm really thrilled to welcome two of our two of our colleagues who are based in London, Michael and Andre.
Michael, you want to introduce yourself?
Sure. Hello, everyone. My name is Michael. I'm a product manager at Cloudflare and I've been with Cloudflare for about five and a half years now.
And hi, I'm Andre. And yes, despite my American accent, I am based here in our London office.
I'm the engineering manager for our managed rules team. And I moved over here to our London office in January.
I was previously in our San Francisco office on the marketing engineering team.
Well, I'm super excited to have you guys on.
So as some people may know, you know, one of the things that's great about Cloudflare is we actually build software around the globe.
And our team in London is actually responsible for a lot of our application security features.
And I think you guys.
So can you talk a little bit about the work that you guys focus on? Sure, I guess I can take that one.
So we are specific. Internally, we're known as the managed rules team.
I guess from for a customer's perspective, we work on our web application firewall, which is a great tool, right?
It's easy to use. You can click and protect your applications.
But the engine itself is not the whole part of it.
We also want to provide an easy set of configurations and rules that, you know, customers can deploy, ensure that it doesn't, you know, create too many false positives, but stops all the bad traffic from hitting their app.
And as a managed rules team, we are responsible to making sure that Cloudflare managed rule set, in this case, is as good as it can be and as easy to use as it can be, of course.
So Michael, rules make sense.
I know what a WAF rule is. WAF and rule go hand in hand. What's up with the word managed?
Why are we, why is it, why are we finding, what's so managed about it?
Who's managing it? And why did we feel the need to literally the name of the team, the name of the internal product, we call it managed rules.
This is, you know, what are we trying to distinguish or disambiguate there?
Well, we, going back to make our customers' life easy, we don't even want to make them focus on writing their own rules, right?
So giving the WAF on its own, if it didn't have any rules, they would have to spend time writing them.
So we focus a lot of our time doing that on behalf of our customers.
And that's what we mean by managed.
Ideally, you know, someone uses our product, they hit a big on button, and then it sort of deploys all this configuration under the hood, which we've, by the way, battle tested across our entire network, right?
So it's... Yeah.
And so the other thing that we want to provide is a minimum level of protection for our customers.
And so we have, there's a couple of industry standards that our team keep up to date with and make updates for our managed rule sets.
We have a couple of them. One of them is the OWASP core rule set. So we handle basically keeping that up to date for our customers.
We're working on upgrading to the newest version now.
And then we have a secondary kind of supplementary version of that managed rule set that sits on top that or works in tandem with that to give our customers a bare level that they can then write their own rules on top of.
But for the vast majority of things, we just kind of protect them for them.
And so that's the part that we manage for them. Got it. So you guys are providing kind of base level security protection for things that are global challenges that customers, not just individual customers, but customers of all shapes and sizes may benefit from.
So we're sort of creating that and making that available to everyone.
Right. Yeah. And that's kind of our barrier to entry for when we want to add a new rule to our managed rules is what percentage or what usability are the majority of our customers going to use?
How many people will this affect?
Does it make sense to add it as a global rule or is this something that individual users should do themselves?
Yeah. Okay. So you mentioned OWASP as a standard, but there are things that go outside the standards.
How do you guys decide and how do those rules get created for things that aren't part of these standard rule sets?
Right. Yeah. Michael, you want to talk about the decision making process?
Yeah, for sure. And I was going to bring up the example, you know, if you think about you can, anyone can sign up to the cloud for services.
We even have a freemium model, right? A lot of our customers will be using commonly used open source software.
I'm sure everyone is aware of WordPress, for example, there's a, there's a couple of big ones out there like Joomla, Drupal, Flown, et cetera.
They're all content management systems to help you manage. It's to help you easily set up a website and the kind of thing you'd get from any web hosting provider.
Right, right. Yes. So, you know, people, you know, a lot of companies out there offer sort of one click WordPress installed.
So anyone, you don't need technical knowledge to start your blog, for example.
And when, whenever we see, you know, an opportunity to help our customers secure those applications, we look at, as Andre said, you know, how common is this problem for a customer base?
And then we sort of judge what the priority in terms of, okay, this specific vulnerability is affecting WordPress, for example, we know that if we write a rule for this, we're going to protect a good percentage of our customers.
And that's one of the main criteria we, you know, drive into our decision-making to decide what goes into our managed rule sets.
There's many more, of course, the severity of the issue is also a big one we take into account.
It's actually, when I first heard about Cloudflare from our CTO, John Graham-Cumming, I think Cloudflare was only like maybe 10 people and you could count them on two hands.
Like, and I kept asking him, so what does this new company do?
And one of the ways he described it to me is we're trying to patch the Internet.
And I didn't know what he meant by that, but Michael, I think you perfectly summed it up there, right?
Which is like, we want to be able to write rules that can block vulnerabilities or issues or security issues in these really popular packages that are being run by not just a few customers, but literally thousands and millions of customers.
So that when Andre, when your team actually writes a rule for that, that means everybody benefits.
And that's a fantastic sort of network effect there.
But Jen was asking you a different question and I got all excited about where we went with that.
So yeah, back to you on the decision-making of what kind of rule goes into, what kind of decision-making goes into the writing of a rule?
Right. So yeah, go ahead. Sorry, go ahead, Michael. The popularity of the software application is one.
The other one I was going to definitely mention is the severity of the issue, right?
If something is going to cause what we call remote code execution.
So a hacker, if they exploit the vulnerability, they essentially get into your application and do whatever they want.
Something like that would get higher priority in terms of, we should write a rule for this quickly.
The other thing to note about security is speed at which we can deploy and provide these rules to customers is important.
Andre and I have worked together with a team on writing rules for vulnerabilities that get announced.
Before they're announced, we call those zero day because no one knows that they exist.
Potentially the hacker knows, but they're not publicly announced. As soon as they become announced, we immediately see lots of people trying to trigger them and try to explode and search for applications that might be vulnerable.
So between the popularity and the severity, there's some other criteria we take into account.
For example, the freshness of the vulnerability. We get customers reporting, do you have a rule for X and Y and Z?
And maybe those are things that are very, very, very old in terms of exposure and affecting old platforms.
Those are the ingredients we take into account. One of the things that is so interesting to me is how responsive the team is and how responsive the platform is to being able to author and deploy those rules so quickly.
Now, I've come to understand and appreciate in my career as a product manager that that magic does not happen just by some wizard sitting in an office someplace.
I know that this is an area where we've spent a lot of time investing over the course of the past couple of years to make sure that we're better, faster, stronger with the technology that underlies this so that we can be more responsive.
Can you talk to me a little bit about like when you guys walked in the door, what you found and sort of kind of what is the evolution of that platform been to really achieve that kind of speed and performance?
Sure. So I joined the team back in January of 2020 and the team was actually reeling from a quite a large incident that happened last summer where there was a large amount of CPU consumed because of a bad regex.
We blogged about it.
We talked about the kinds of things that we wanted to do to prevent that going forward in the future.
And the team was still a little bit and not super motivated because of those kinds of things.
And one of the consequences and some of the fallout was we needed to move a little slower intentionally.
We were moving a little bit too fast.
And so we wanted to take a step back and look at, well, why do we need to move slowly?
What is the reasoning behind, well, we need to pump the brakes and we need to kind of have a stage rollout and those kinds of things.
And so I sat down with the team and generally the complaint was, well, it takes us three days to come up with a rule, whether it's regex-based or looking for a particular string or something like that in a rule or whatever the combination of the rule is, and then be able to gather enough information about it so that we can test it against enough traffic to give us the confidence that it's going to catch what we want it to catch.
And also it's not going to repeat the same problem of an exponential CPU issue.
And then slowly roll it out so we don't break people's integrations if they were relying on some kind of something, false positives basically, checking to make sure that we don't block too many things.
And so that whole process would take the better part of a week.
And so we've spent quite a bit of time and effort and planning and engineering time to get us back to a place where we can deploy things quickly, but also extremely safe.
Exactly. And so the kinds of things that we've built, we've talked about this before, and I was on with our CTO, John Graham, coming a few months ago, talking about this.
We're moving to an entirely new engine.
And so the actual engine that executes what we call an expression, which is the, does this header look like a particular thing, match particular regex?
We're starting from scratch with that. And one of the first things that the larger kind of firewall and managed rules teams worked together on was putting limits on how much CPU usage any particular given rule could use.
And so we could calculate the max amount of CPU time or resources that a particular rules that could use and put hard limits into the engine.
But that also got us down this path of, well, if we have a whole different engine, we should rethink our deploy process.
And so we've gone through that exercise now too, of rethinking our deploy process and being able to not just be a staged rollout, but build ourselves a sandbox and staging environment that doesn't run on anything in production.
We don't have to test in production anymore. And so that's been one of the most valuable things that we are finishing up very soon here is this sandbox environment that takes hundreds of millions of requests, runs our rules through them, makes sure that we aren't moving the needle too much on using too many more resources or false positives.
We have a consistent benchmark that we can use of, well, we added one new rule.
We probably shouldn't block 20% more traffic. And then eventually we want to get to a point where we can look at a very, very small subset of production traffic and point it at the sandbox and run real live traffic through it.
There's a whole host of considerations that we have to take into account there of PII and data crossing borders and those kinds of things that we need to work through before we can actually put that into practice.
But the ability for us to test things and have that confidence, pre-release confidence, the ability for us to have that is going to dramatically speed up our ability to release these kinds of things.
Like Michael talked about, the speed at which we can deploy rules is going to increase dramatically.
That's awesome. The name of the game at Cloudflare is innovation at speed, but with safety.
And that incident was notable because it was a bad rule, but that rule was not actually live on traffic.
It was in simulate mode. It was following a normal process. And it didn't look broken in staging and it didn't look broken in test.
And yet only at scale, it's like, wait a minute, this is consuming enough to effectively lock up the edge on request processing.
And for 21 minutes, it made a big mess. And so this, all the work that the team has done to make it possible to test slices of that traffic in a realistic way is really fascinating and interesting work.
Andrea, one of the other things I think your team worked on recently, if we go to some of the, it is called the latest from product and engineering.
We started from the beginning of time.
And what did this look like when John and the team first decided to build Cloudflare?
It's a really slick thing, which sort of preempts what a lot of customers would do.
If I was, I remember being, I had a startup long ago and I signed up Cloudflare specifically to give me protection.
Cause I was also using one of those, one of those popular things.
And I wasn't paying attention to every CVE that comes out, like Michael was just talking about.
And so the first thing that I did was sort of download one of the very popular pen test tools and let it rip and see if Cloudflare will block it.
So maybe actually, Michael, can you start like, what do customers do when they sign up Cloudflare WAF and how do they put it through its paces?
And then, and then after you answer that, Andrea, why don't you tell us, like, so how do we get ahead of that, that interesting challenge?
Yeah, no, that's a great question.
Ideally customers would like all of it to just work magically, right?
So that's the, that's the end goal. So they don't have to do anything.
Realistically though, the, the first thing that, you know, anyone was trying to use the WAF wants to do is, is turn on the WAF and deploy those rules.
And normally, and you mentioned, you implied this Usman, we, we have this really nice feature where we can, which we use internally, even in our deploy process, where we can deploy the rule sets and the individual rules in what we call a simulator log mode, where we're just silently looking at traffic and picking up on warning, Hey, I would have blocked this specific request because it looks malicious because of X, Y, and Z.
And, and that is, you know, normally customers would sign up and that's the first thing they do.
They, they turn on the rule sets, the manager said, maybe the right, a couple of their own rules as well.
And they deploy them in log or simulate mode.
And then they sort of observe the traffic for a couple of weeks to see, you know, how it would react on, on their application.
Once they're sure that it's, it's everything looks good, they would turn the rules in the more aggressive mode, such as blocking the traffic or challenging it.
If on the other hand, there's false positives, which is the rules are triggering on traffic that is legitimate, then they would go in and try and configure those specific rules by either turning them off or potentially leaving them in simulate mode.
So that's the general path. And then after two, no, normally, depending on how much effort they put into it, you could have a very well configured WAF, even in the day.
But depending on how careful or how, you know, risk adverse a customer might be, it could take, you know, an anywhere from a day to a couple of weeks of testing and fine tuning.
And so, so one of the things, how did your team start to think like, if that's what this is going to happen, when you give this to someone, what are you, what are you doing to sort of get into that?
The customer is going to try to take it apart and break it and, and like, put it back together again in a way that it's not supposed to work.
Like, how do you make sure that like, it, the customer is going to be able to kind of put it through its paces and, and disassemble it and reassemble it in a way that is actually going to still deliver value.
I think what, Jen, you're trying to get us to talk about is something that we just blogged about on Friday, which is our ability to, we just gave customers ability, actually, I think it's releasing this week, we first talked about it on Friday of actually being able to look at what we see in the request and what our rules match against, because up until we released this, it's just kind of a black box.
We say, you know, we, for, for various reasons up to and including, we don't want to give attackers more surface.
We don't give you the actual rules.
We don't say what the secret sauce is making up those rules. But what we are comfortable with is giving you, okay, here's the bit of the request that matched.
We think it's legitimate. If you think otherwise, you know, let us know, give us, give us this bit of the request back and we'll, we'll fine tune it a little bit more.
And so we, we are giving the customers the ability to, to enhance this configuration process to say, all right, this, this rule blocked.
I definitely want to leave it in simulate mode because this is legitimate traffic, but you're still going to block it.
I can't use this rule. And kind of the technology behind that is pretty cool too.
Yeah. But, but that's a little tricky, right?
Because sometimes the things that we match on are to your point a few minutes ago, like we might've been triggering something that is potentially PII.
Like how, how do we, how do we, how do we provide security to our customers while also ensuring that those customers can continue to provide the right level of privacy to their users?
It's a tricky bit. And, and that, that, that's the magic, I guess, that we, and we, we might even be the first ones doing this.
The, you know, logging the payload is something we've always wanted to be able to let customers to do, right?
But the second we log something, it means that we might be storing, especially if the rule is triggering a false positive, you know, someone might be submitted in the credit card details, but there might be a string in that request that causes the WAF to trigger.
And if we'd log the payload, then all of a sudden we've got credit card details in our logs, which is basically a no-go, especially in terms of user privacy.
So, so, I mean, Andrea can definitely talk about the secret sauce behind it, but the, the general idea is we, we can now allow anyone to log whatever the WAF matched on, including the payload, but only if they provide an encryption key, which allows us, allows them to decrypt it and observe and sort of do their analysis.
But me or Andrea or anyone in the team, we, we, we cannot see what the contents of those payloads are.
So we give the keys to our customers.
We don't have the keys ourselves. So, so, you know, nobody at Cloudflare can actually see that.
Correct. And in fact, they, they, customers can either generate their, their keys on their own.
We do provide an easy way in the dashboard so they can generate the private and public key.
You know, the encryption works with two components there is the public key, which we use to encrypt the payload and the, and the private key is what they would use to decrypt the payload.
We might generate it for them, but then they download it and we never store it on our site.
Yeah. That's pretty much in keeping with the philosophy of Cloudflare, which is it's, it's their data.
It's not our data. And, and that's a, it's a very key point and it's something we all hold.
We all hold dear. One thing I, I, I wanted to also ask you about, Michael, is, you know, we've been saying the word rules without giving any indication of what the order of magnitude is.
There's a lot of these and some of them, the customers can write themselves. And so there's, there's, there's a couple of things I want to talk about.
One is how, how a customer can, can also write their own rules and enhance that.
And some of that falls into firewall and how we've enabled a different syntax.
You know, Andre has talked a lot about ModSec and regular expressions.
And there's, you know, there's a really old, famous quote in engineering.
It says, you know, sometimes when a, when a programmer is faced with a problem, they think, I know I'll use a regular expression and now they have two problems.
And the, the, you know, cause they can be, they can be tricky and they can get thorny and, and you can make mistakes like the one that led to the incident last July.
And so, so one of the other things that we're talking about is using a different syntax, a very familiar and common syntax that network engineers use all the time called wire filter.
And, and, and then also around how we make it easy for customers to keep track and organize and decide which rule sets are on and off.
I just wanted to, if you could give a little, a little bit of background on those problems and how we've addressed them.
Yeah, I'll, I'll start and then I'll hand over to Andre, but for sure. So we are, you know, as we mentioned earlier, we're rebuilding the entire engine that runs our security features.
It's a, it's a big undertaking. It's like switching the engines and the rocket when it's, when it's taking off in our, in our regards.
Right.
Right. But the, the old engine, you know, didn't use what, what the now custom firewall rules are using, which is a wire filter syntax, which basically we're exposing, and we use the same syntax internally now.
A very simple to use syntax where customers can define the rules if they're building their own, for example, and, and then can, they can deploy it.
And the, the other thing we've done with a new engine as well as the ability to group these rules into what we call rule sets.
And, and these rule sets can be sort of deployed, configured on specific portion of traffic as the customer sees fit.
So the configurability of the firewall and WAF engine is, is much greater than what it used to be before.
The, the, yeah, I'll let, I'll let Andre maybe go into detail on the syntax per se, but we, you know, going back, the, the OWASP rule set, for example, is written in mod security.
And some of the things we've had to do is, for example, write a converter from the popular mod security syntax for Apache into our new syntax, which is now, we think at least a lot easier, easier to use even from customers that, that are not too familiar with how to write firewall rules.
Yeah. The, the wire filter syntax, as you said, Isman, it's, it's based on the Wireshark query syntax.
I don't know what the exact name is for it. Wireshark filter syntax. Right. That's how most people would describe it.
Yeah. Inspired by Wireshark. And so, yeah, Michael alluded there to a converter.
We, we, we built a, an in-house converter. We plan on making it available.
There's no particular secret sauce in there. It just takes a modsec rule and converts it to a wire filter.
And one of the reasons we did that is because we, we do have, you know, on the order of hundreds of rules and we didn't want to convert them cause it's, that's, you know, that way lies dragons and heartache and tears.
And we've, we've already seen benefits from this because about halfway through, after we've, after we'd converted all the rules and started using them in, in, in our testing, we realized that there was an optimization that applied to a little over half the rules.
And all we had to do was change a couple of characters in our converter and then reconvert everything and it was done.
And so that's, that's kind of what we wanted and accomplished with that converter.
And we, we want to give this to our customers too. We want to make it super easy to say, if you have a legacy WAF that you're using or an in-house WAF that is configured with mod security, you can convert it and easily just drop your rules into our wire filter.
We want to make that as easy as possible. And, and all those changes, all those generated rules, we're also feeding those against all that test infrastructure that you just talked about as well.
Right. So like all, and including things like third-party pen test tools.
So we're, we're sort of testing it from all sides where we're, we're, we're, we're, we're, we're creating these rules, running, running traffic simulated slices of traffic against them as simulations from, from pen testing tools, then giving them to customers to put them in simulate modes and launch it and explore them and then can turn them on and turn them off a different way.
So it's a, it's a, it's a, it's a massive undertaking and it's a, but it's a big deal because of the, the, the different ways and things can be exploited and how many, what the surface area there is and what we're trying to, we're trying to patch the Internet, like, like John said, back in, back in 2012.
Yeah. And actually I want to talk about that. Oh, sorry. Go ahead, Jen.
Go ahead, go ahead. Well, so I was going to say, I wanted to talk about the, the kind of pen testing a little bit because we actually just completed a project this quarter that kind of the, the, the first iteration of it where we are running those kinds of pen tests against ourself that our customers are doing.
And if I could plug a little bit, because we're hiring for people to help us with this we're hiring security analysts to kind of take that information that we're gathering from the, you know, we're, we're not at a hundred percent as, as Michael said, but we want to be as close as possible.
We want to spin up a team dedicated to, to doing this, to actively scanning our, our WAF and our rules to make sure that we have as much coverage as possible.
That's fantastic. So, okay. So here's the thing now, like I hear you guys, I hear hundreds of rules and I, and I'm like, that's amazing.
And that's great. We got phenomenal coverage. And then I think two problems, one is how do customers and how do we make it understandable for customers to manage hundreds of rules in a way that like mortals can grow?
And then more importantly, like, how do we make sure that that actually scales?
Like, I think running hundreds of rules like that to me sounds like it would take a long time.
So like, how do we solve the customer usability problem?
How do we solve the performance problem? Especially as it sounds like we're on deck to keep adding hundreds and hundreds more.
Yeah. And that's, that's going to go faster and faster as well.
I'll answer the usability one.
We, you know, tackling, exploring, you know, very large rule set sounds like a daunting task, but what we're doing now is we're providing very easy categories to go with those rules.
And actually we're thinking of building, you know, rule suggestions because we, we, we do already have the technology that scans back in origins and tell us, you know, this customer is running WordPress.
Well, if you actually browse the rule set, you will find that, you know, all the rules that are applicable to WordPress are tagged with WordPress.
And that's, that's sort of the, make it easy to deploy.
Even if the rule set at a first glance is very large and complex, the categories are, are a, you know, a bridge to make it very easy to deploy.
And then the other thing to note is with all of the traffic that we're testing our rule sets on with the deploy process we recently talked about, we, we wanted to provide very good default settings, right?
So between the categories and the default settings, which are optimized to not cause false positives, we believe, you know, the WAF is actually very, very easy to use and configure.
Yeah.
And from a speed perspective, that's really where the new engine shines. So previously we, we had a converter.
This is what I've talked with our CTO about quite a bit.
The, the, the previous workflow was actually, it converted mod security into Lua code that, that ran.
And then we, yeah, it was, it was gnarly. Very well, but it was time eventually, like all software to move on.
Yeah. Yeah. It did work for a long time and it, it was, it was starting to, starting to crack under the strain a little bit.
And so we, we knew that we needed to move on as you say. And so we, we have rebuilt this engine with performance from, from the first go.
We, we decided to pick Rust because we know it well internally at Cloudflare.
We had a lot of resources to help us do things as, as well as possible.
And it's, it's a great language to turn to in terms of safety and performance as well.
And so we, we are writing the new engine in, in Rust with an eye towards performance as, as our first kind of priority.
We, we make a lot of effort to cache as much as possible as part of the execution.
We have budgets that we are trying to keep to as far as how much execution time should a, should a rule take.
And because we do have so much testing data and rules just ourselves, we have a pretty wide gamut of the kinds of rules that not only we're going to use, but our customers are probably going to use too.
And so we can kind of fine tune as we go and say, well, this particular, you know, field or function is used by 98% of customers.
We should probably tune this one a lot and spend a lot of time optimizing this one.
Fantastic.
I, I'm, I'm always amazed by how quickly these 30 minutes go. So we're, we're at time and Michael and Andre, thanks so much for joining us today to talk about the WAF and all the exciting projects and the testing and the management and the infrastructure that's behind it.
We'll definitely look forward to having you guys back to tell us, you know, all the newest things as we do another latest from edge.
But for now, we're going to, we're going to say goodbye.
Thanks so much both for being on. Thanks everyone for watching. Thanks so much, everybody.
Thank you very much.