Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Richard Boulton, Daniele Molteni
Originally aired on March 29, 2022 @ 3:00 AM - 3:30 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
Hi, I'm Jen Taylor, Chief Product Officer at Cloudflare, and I am so thrilled to be joining you and Usman again for another segment of Latest from Product and Eng.
Hi, Jen.
It's great to see you. We've missed you in the last couple of shows. I'm Usman Muzaffar, Cloudflare's Head of Engineering.
And we're very thrilled today to welcome two of our teammates, Daniele and Richard.
Daniele, why don't you introduce yourself and say hello?
Hi, everyone. I'm Daniele Molteni and I'm a Product Manager here at Cloudflare since May.
So I joined just recently and I'm the Product Manager for Firewall Rules.
Hey, and I'm Richard Bolton. I'm the Engineering Manager for the Firewall Team.
I've been at Cloudflare for a couple of years now, working on Edge and Firewall stuff.
And that's the magic word for today, team, is Firewall.
And we're going to talk about all things Firewall. So actually, Jen, I'll let you start since you haven't been on the show for a while.
I know.
So I got to say, I have been looking forward specifically to doing this episode since we actually started the show.
The Firewall product is something that is very near and dear to my heart and something that we've done a lot of innovation in in the three and a half years that I've been here.
But Daniele, can you just the highest level kind of explain what is our Firewall and kind of what does it do for our customers?
Yes, definitely. So basically, the Firewall engine is a system, a rule-based system that allow customer to design their security position.
For example, design rules, decide whether to block certain traffic or allow other, like consider good traffic.
So it's really the basics for all our security products suite, which includes the WAF, of course, but also other products like IP access rules, rate limiting, for example.
And we do work closely with other teams, like for example, both protection and DDoS as well.
The reason, part of the reason why I've been so excited to do this session is really kind of twofold.
I mean, one is, I mean, our WAF, our Firewall is basically almost as old as the company.
I think, you know, John Graham-Cumming, our CTO, actually wrote it himself.
And then the second is exactly what Daniele highlighted, is that the Firewall really has become kind of this core fundamental platform for all of our application security.
And it enables us to really do and deliver kind of a really deeply integrated solution in a really unique way.
But just sort of stepping back for a second, you know, Richard, I just sort of mentioned, you know, JGC wrote this thing, like, talk to me a little bit, like, how does the WAF actually work?
And kind of what are some of the improvements that we've been making over the course of the past couple of years to make it work even better?
Yeah, well, as you say, JGC wrote the early version. We're still running some of his code.
We're working quickly on getting rid of that. Yeah, if you go deep enough in Git and you like, you run annotate in the right place, you will still see his handwriting, I think.
But we're really close to getting rid of that basically, completely.
Really close. So I remember when I was interviewed two and a bit years ago, one of the questions that I was, not so much questions, but topics for discussion was, how do you go about building an engine which can filter through the vast amount of traffic we get efficiently enough, use a little enough CPU, but do really deep inspection of what's going on?
And that's essentially what the engine has to do.
So the system we had originally, actually, I think a few back ago, there was a version in PHP.
Then there was a much better version written in Lua, which is the one JGC wrote, which is essentially generating Lua from mod security rules, which is this security language that's quite widely used.
Amazing, actually.
It was really, really amazing work. And it's one of those amazing pieces of technology that serves you really well for a long while, but doesn't necessarily go forever.
So we've got lots of pain with that. It's very hard to maintain.
It's very hard to get the visibility and the control. And really the key thing is that the rules written with that, they're run using an arbitrary language on our edge.
We have to be really careful to make sure that that's safe and stable.
And that's one thing we can't give to customers. So we couldn't give customers the ability to write arbitrary rules because we couldn't do a security review that would make that system safe for anyone to write arbitrary rules on our edge.
So instead, what we have done, we've built an engine, and this has been worked on for the last three years, that allows people to define really flexible expressions.
And we took inspiration from an existing product, from Wirefilter, which is a product which people use as the network layer to filter through traffic, to understand, and to filter expressions.
And we took that language and we've essentially, we would say, been inspired by the language, which is very compatible in some ways.
It's the same field names, lots of the same syntax, but it works in a very robust and efficient way, and it's essentially a very safe way.
And that's the system we've built into our edge that you can essentially write arbitrary rules, customers can write them, and we are now using that same language to write our firewall rules as well.
So that's the piece which we're working on to replace John's code, to move entirely from the system where he translates the rules into ones where we write the rules ourselves.
It's really fantastic. I remember as, you know, I joined just a couple years, a year and a half before you, and one of the real big missions was, we have to let customers, give customers the ability to write their own rules.
And, you know, it was at the time you, you know, this is, I'm talking years and years ago, you had to write in and file a support ticket a long and long ago to get someone to write a rule, because we had a flexible way.
It wasn't that we could push rules very quickly, but an engineer had to actually write them.
And so part of it was first getting to the maturity where there was a big enough group of people in Cloudflare who could write those on behalf of customers, because we were growing so quickly.
And then it was like, no, we've got to come up with a way to let customers write this, but we can't just let them write in the raw Lua or even in the raw ModSec, because we know that's not safe.
And so I think that the, you know, exactly what Richard, what you just said, that taking the inspiration from, from Wireshark, which is the network, which is the tool that all network engineers use when, you know, where it's popped up, you can probably see it on the average Cloudflare engineering engineer's desk at any time running in a corner there as they're analyzing a packet trace or something was, was perfect because that's exactly the language that our customers and our network engineers want to, want to use and write in.
And so then building that so that it can run at scale, was, was really interesting.
We'll come back to, to some of that more, but, you know, I think probably a question back to the product side of the house, Daniele, like, what are some of the challenges that customers want to solve with a firewall?
What are some of the things that, I mean, the easy one is block the bad traffic, but like how, and like what, what, what's involved in, in, in customizing that and giving them that control?
Yeah, I think the first thought I have is that we have to basically develop a product that covers a range of different use cases, right?
From customers that really have a basic need of like turning on a security portfolio, if you want, of, of solutions, we just a click of a button to a more sophisticated, more and more sophisticated customer.
They want to control or what really goes through our firewall, but also our integrated with their backhand and their infrastructure.
So I guess that's the first challenge for us, like making a product and offering this product that is easy, easy to use, but at the same time, flexible enough that covers all of these use cases.
And then, I mean, going back to your question as well, I mean, what are the, these use cases?
These can be very basic use cases like filtering out traffic, for example, from a geographics perspective as well, and like narrowing down to what you know that is going to be good traffic.
And then this can be done with simple rules, but also apply rules like, like manage rules, like OWASP, for example, that offers a pre -built set of rules, really sophisticated rules that protects by, from most common attacks, if you want, like SQL injection, for example.
These things are like a set of rules that, let's say, people that are not used to write those specific rules would be found really challenging to really implement and write.
So we can offer this out of the box to them with just a simple click of a button.
And then, of course, you can improve or, let's say, get more and more sophisticated solutions.
For example, if you have to protect API traffic or, for example, if you want to allow still to reach your origin, but perhaps manage the bandwidth throughput that you are letting through, for example, with solutions like rate limiting.
So you can really have more granular control on who gets to your origin, through your servers, and try to also limit whether there's attacks or people perhaps they want to...
And I'm using the same language for all this.
I'm still defining everything in this simple, does it match this, with this long list of fields that I can match on, and then take some action or integrate with all these products.
So it's not just even block, it's not just firewall, it's actually enable and turn on other, effectively, other products that are inside Cloudflare.
And that's the platform aspect of what... So I think a really interesting thing is, back in 2018, we introduced a slew, in 2017, 18, we introduced a slew of different products to meet specific needs.
So we introduced things like the user agent block, which will allow you to block particular user agents, and the zone lockdown, which will allow you to say, for these IP addresses, I only want to allow these IP addresses to get to this part of my site, because that's where my admin panel is, or something like that.
We did that because those were the top needs we needed to address.
But then we look back at the system we had, and it's a system where you've got all these different ways to configure all these different use cases.
And then there's all the gaps between those that you just can't meet.
You say someone wants to do a rate limiting for their lockdown, or they want to do something along those lines.
We just can't do that with the system we built.
And that's one of the real drivers behind moving to a single engine, where you have a single way that you can define all your rules, and you can get the product of all the capabilities.
And there's a landscape of user needs we may not know about that people can meet.
The long tail is long. Well, and it's one of the things, if people hear me talk about product, and hear me talk a lot, especially about our security solutions, is really the power of that seamless integration.
So enabling us to create this engine where you can basically do single pass inspection enables it to be incredibly performant.
But then from the other side, part of what we've really done is started taking what Richard, I guess, kind of just identified as these little silos of information, and started making them more universally available.
Like the work that we did recently on IP lists to enable customers to build effectively these corpuses of information that they have from their use over time, to be able to kind of effectively codify that, and use that deeply within the rules itself is so cool.
And it's so powerful.
Yeah, I think IP list is a great example. So I remember when Richard, when you first started showing me the expression, and we should talk about the UI behind this thing too, right?
Because effectively, it's turned into an IDE for expression.
It's as rich as you would expect any development environment to be in which you can describe arbitrary input.
It's probably complete, and describe an arbitrary action to be taken.
And I think one of the things that we started was like, this thing needs the equivalent of variables.
It needs the equivalent of constants.
And so then we, because otherwise people started writing really long expressions, like if it matches IP address one, or two, or three, or four, or five, or six, or seven, and basically actually, if it matches set X, and now you need the ability to define set X.
So Richard, what was involved in that? I mean, like this thing is now turning into like a mini programming language at the edge.
How did we start to evolve this so that it could do things like lists? Let's just talk about lists for starters.
So exactly. So IP list is probably a really good example to look at.
So we have a product. We have a product called the IP firewall, which allows people to say, these IPs, I want to block them, or I want to allow them.
And that's another one of those little local peaks. We've got this product is really useful.
It meets a lot of use cases, but you can't combine it. People want to say, I only want to apply that, but I want to exclude this part of my site because I know that actually there's going to be a whole load of traffic, which is really important.
And I don't care if a bot hits it. So various different ways in which people want to separate that traffic.
So I mean, the conversation is, well, we clearly want to combine that with our rules, but we need to find the syntax.
The way we address this is we have a really good process in the team of thinking of all the different ways we could write a syntax for that.
I think we came up with at least six or seven different syntaxes and different ways we could address this.
And we built prototypes. We built prototypes of something to work out, well, this one might be useful.
This one might not be. And that's a process which takes effort and time, but resulted in, well, actually the one we came up with and implemented in the end was simply saying, instead of an IP, you can say dollar and then the name of a list.
So you have a variable. I love it, right?
It turned into a metasyntactic variable and programming schematics. And in all our prototypes, that was the one where you had, this is actually going to be the hardest to implement and the hardest to make work in our system.
But the reason we took that choice is the most useful and the most flexible.
And there's still quite a lot of things that would have to happen behind the scenes in the engine to make that syntax work.
But it's proven really popular. We've got lots of people using it.
And the idea is that everyone can migrate from their firewall, their IP firewall into that system without losing functionality and gaining a whole lot of power and gaining the ability to do all these different things as a product of the features.
But I think just like you had me at multi-syntactic variable, right? Like you got me.
But at the same time, like Danielle, most of our users are not like, God, I'm so happy they added a multi -syntactic variable because now I can include it in this really robust language.
Like talk to me actually, the thing that I also like when you hear me talk about a lot of our security solutions and specifically the work that this team has done that I am so grateful for and so passionate about is the simplicity of the experience.
Can you talk to me a little bit about the way that we've thought about creating the rule builder and the discipline that the team keeps their eye on as we continue to add in these new capabilities?
Yeah, definitely. So I think this is something we keep in mind for every new product release, right?
So how this is going to interact with the customer.
So let's start from the dashboard. Dashboard is really the flow that a customer goes through is the first thing we show is actually the analytics or if you wanted the traffic that went through the firewall and which is for your server.
And then from there, you can see the breakdown of, for example, of all the traffic that was blocked or allowed.
And then, for example, if you see there is a spike, for example, of traffic there, you could isolate the spike and automatically write a rule that blocks exactly the spike of traffic.
So, for example, you see, oh, there's a peak of traffic from Italy, let's say.
And then once you say you find that, you can just simply write the rule.
Or conversely, if you want to, you know what you want to do, the flow of writing and deploying a rule is really streamlined and it's really flexible.
You can add fields and add conditions. Let's say if, let's say, a request comes from a country and perhaps as a particular header, then take an action, which can be block, challenge, or allow, or you name it.
And in this, I think, in this flexibility and simplicity also, you can add more complex products, for example, as IP list.
It's really amazing because it's actionable analytics, right?
It's like as you're looking at the graph, you can turn that into control plane changes that actually shape your traffic.
It's a very streamlined experience.
It's really slick. Well, and also just a great evolution, right?
Like when we went from like, first, like we released the rule builder and it was like this, if this, then that, like email filter sort of point and click thing, right?
And then we added the ability to kind of add the syntax itself. But then we added the analytics and even kind of taking point and click to one step further of like, you don't even need to look at the rule builder.
You can literally point and click on the analytics and use that to help craft the rule itself.
Yeah. One of the really interesting things is to look at actually what's needed in the company to support that.
So we have a whole load of data. How do you come up with analytics that can handle gazillions of requests a second?
And it's, it's absolutely, we talk about the firewall here.
There's a firewall team.
That's not the only team who's needed to make this work. So we have a really good data team.
One of the, they essentially think about how do we handle the volumes of data?
How do we process it? How do we provide real-time analytics experiences like this?
And actually the firewall analytics work was one of the things pioneering a new technique we've blogged about called ABR, the sort of adaptive bit rates, where we, we store the data at multiple different resolutions.
If you're doing a difficult query, you'll get a less accurate answer, but we'll be able to answer the question in a good time.
And they work really closely with our engineering, with our site reliability engineering team to actually say, well, how do we handle the servers we need to run this?
How do we do capacity planning? We essentially, we need to watch those graphs as they change to check that we're not on a trajectory where we're going to have difficulty with space and take action in time.
So they all work together to support these kinds of analytics experiences.
And then at the other end, we have the user interface teams who work on building, building these interfaces and pulling them together.
And they have to understand the details of the firewall product and able to do that.
And then one of the things which has been really interesting over the period we've been developing this is it's really easy from a technical side to design an API that feels good and consistent and makes sense.
And then you try and tie it to a user interface and to the user needs you're trying to meet, and it just doesn't work.
So we, we have regular, really regular meetings with our product design team who are thinking about the whole experience that customers need to go through.
And we are not releasing new APIs until we are confident that those APIs can be built into the right user interface and can meet the systems.
So it's all those things working together.
And that's one of the things that I've really been enjoying over the last, last years is having all those different teams with their specialisms, pulling together to make sure that we're actually making something that is, is a smooth, it's not one, one team's effort.
It's not one person's effort. It's so many people pulling together.
When I think just to double click on the design piece for a moment, and, you know, Daniele, I know one of the things that you've been very involved in is working with the design team and working with customers as we're sort of iterating on, on some of these things.
Can you talk to me a little bit about sort of that process and sort of how you guys have, have tackled that?
Yes, definitely.
So I think as Richard mentioned, we tend to involve design very early on.
So, so when you're developing something from an API perspective, but then once we have something ready, let's say we can, we can start to release into customer, let's say a new API called, for example, triggers, triggers a new product, for example, for IP list, then we basically, then we take this to, to design and we say, okay, how can we make this really the experience of the customer exploring and using this product?
I can, I can be a streamlined and integrated in our dashboard.
And usually the process is iterative, right? So we have conversation internally with across the teams as Richard mentioned, but then when we have something also ready that we can show to customers, then we do show this to customer and collect their feedback and say, is this really, and try to sometimes not even try to prime them or say, oh, this is how this is the way you should use it, right?
It's like, okay, this is a new toy. How would you go about it? How would you, yeah, play with it and tell me what's going on in your mind, right?
So there is, then they, they talk about it.
It's like, oh, I want to do this in this app, but I don't see where the button is.
And then you realize you just hide it in the wrong part of the dashboard.
Building easy is hard. Building easy is very hard. And it's different.
I mean, so like, just to touch back on something Richard said earlier, right?
Like the reason there's a very talented data team that's worrying about this and that adaptive bit rate idea is the data team, when they first presented this problem to me, they're like, do you realize that our customers, the firewall events frequency spans nine orders of magnitude.
Like some people will have literally a single digit, maybe the firewall triggers once in 24 hours and other people can have it trigger, you know, hundreds of hundreds of millions, hundreds of tens of billions.
And like, it's hard to even conceptualize that, like, that's the difference between the size of Mount Everest and the grain and a grain of sand.
I mean, that's the range that the team is dealing with. So it's very easy for a product manager to put in a spec.
Yeah. And it should also show you the top end hits or the top, you know, the top, the top rules that heads on.
And yet, if you think about the engineering challenges that's behind that, that's phenomenally difficult to come up with a system that is going to give you the right answer over nine orders of magnitude range of how many, how, how big and small of all sizes and shapes in between.
And then to design a UI like Danielle is talking about that is intuitive for everybody who's working at, in those different spaces.
It's great.
One thing, Danielle, I wanted to ask you about, you know, when Cloudflare started literally the flare in Cloudflare as an allusion to part of this firewall aspect of this, you know, this was part of the original, original vision of the company, which is extraordinary that we're still, we're still, we're still nailing this.
Right. And the, the, but you know, the web that was 10 years ago was much more HTTP, HTML based, you know, single pages apps were just trying to coming on.
And now APIs are a, a much more, way more, more of the traffic over HTTP is actually an, an HTTP based API rather than raw HTML.
So talk a little bit about that.
How, what has that evolved for the, for the firewall and what you've had to start to think about?
Yeah, I think this is something we realized recently. So we are in a really privileged position, right?
Cloudflare see a lot of the, of the traffic of Internet flowing through their system.
And if we look at the traffic, we see the most, more than 50% is actually API based or related to API.
So we, a lot of our products already work with API when can protect API, the firewall rate limiting, they all, they're all fine with APIs, but we don't have anything that really has been, let's say, designed and, and the UI, if you want as well, just tailored for APIs, for the deployment, the streamline for API security.
And so this is something we started thinking about last year already.
And in the last few months and one, one of the project culminated in new product at the beginning of October, the API shield.
And this is something that sits on top of the firewall.
And what it does at the moment is takes care of the encryption and the authentication of API traffic.
So basically it relies on MTLS. And what it simply does in very like basic words is just check that every request has a valid certificate.
And if it doesn't, then you block it. And this is the, if you want the basic way to really remove a lot of the noise that is out there.
That's great. And instead of, instead of the client worrying that the server is secure, this is the server worrying that the client is who it thinks about, right?
So you're actually the client presenting a certificate and, and for example, are you supposed to be talking to me rather than the other way around?
Yeah. There are like a couple of big use cases like IoT and mobile apps, right?
IoT is becoming massive out there, mobile apps as well.
And, and the, the challenge there is like, this is kind of automatic traffic as well.
It's automated. So it's also where is the line between a bot and, and an IoT device.
And this, I think, because it's becoming a big challenge because you cannot use the same paradigms you were using in a different context as before.
It's not just expressions matching over little snippets of HTML.
That's right. But from an engineering side, one of the really interesting things about that product is we didn't have to build it from scratch.
We have, we had all the pieces we needed to put together, to put it, to make it as a new product.
We have the mutual TLS work exists with our access product. We had, we could, we could build on the, the whole matching engine that I've been talking about for, for actually checking those.
So there was not a huge amount of, of integration work to make those things work together.
And that's where we're getting to.
We're trying to build a powerful generic system where we can build new products quickly.
And we're getting there. And that's part of what I love about building product at Cloudflare is right.
We have all of these interesting kind of, I would call pieces of innovation that when we launched them, our solutions and products in and of themselves, but they, as we build them become effectively these, these Legos, these building blocks that we can pull down off the shelf of the imaginary shelf and combine them in new ways.
And it enables us to solve problems that the industry has in new and novel ways.
Like, you know, most of the people today from a mobile SDK, a mobile problem, they're like, well, I'll just bake some code into my SDK.
And like, it'll do the checking on the client. And like that falls apart really quickly, but the solution that this team has taken for mobile, for API, and be being able to effectively do the exact same for that kind of traffic that we're doing for other traffic to do that authentication and inspection on the edge is a game changer in, in, in API security and in mobile security.
And I'm just super excited about it. Great. And plus, you know, those components, when we've used existing components, that means they've already been built for scale, they're tested for scale.
So as an engineering leader, that makes me very happy.
We only have a couple minutes left. I want to bring up one last point, because I think it's very interesting.
Richard, you noted that the original version, you know, was long ago was in PHP.
The one that, you know, the Cloudflare built for the, you've ran for a decade and still runs in is Lua.
One of the interesting things that we did with the, with the firewall engine that is a big chunk of it is written in Rust.
So, and the, you know, and it actually runs in Rust at the control plane and at the edge.
So talk just in the last couple minutes, what are some of our experiences of, of working with Rust and how has, how has that unlocked some, some, some advantages?
So one of the great things about Rust is that it's a low level language with safety.
So you can, you can really get into the details of performance.
And also you avoid a lot of the problems where you have to spend a huge amount of your effort, making sure that your code is safe and runs correctly.
Still have to do a lot of work on that. We still have a lot of first testing and careful regression testing to make sure that things are working correctly.
But there's, there's a whole load of areas you need to, to, to worry about with other languages you don't have to with Rust.
And then in terms of performance, we've been able to build an engine which can handle large amounts of traffic, handle very quickly.
And we've got really good control over the memory usage and the resource usage.
And that's something which is utterly critical. So we, we run our systems to good, to utilize our resources as effectively as possible.
And that means we actually have to have control.
So things like knowing how much memory is being used to do something that's really, really critical.
We've also done a lot of work to build a really good sort of performance benchmarking system.
To the extent that one, one thing I was discussing with an engineer today was that we were trying the latest version of Rust, upgrading our Rust compiler to a new version.
Every time we do that, we run it to our regression system. And we noticed in this case, there's a regression of 10% in one of our benchmarks.
So that's something that we can then dig into, see which version of the Rust commit caused that problem.
We can feed that back to the Rust community. So it's a really a nice, we rely on this community who are building a very good language and a very good compiler.
And we are able to feed back to it with, probably we've got a better benchmark on parts of the compiler than anyone else.
That's great. Really working at scale.
Testing it in a way that no one else is. I always love this. I'm always amazed how quickly we run through our 30 minute budget.
It's so much fun to talk to our engineering and product leaders about everything they've built.
Daniele and Richard, thank you so much for joining us.
Jen, always great to see you too. And we will be back next week with, actually we won't be back next week.
Next week is US Thanksgiving, but two weeks we'll be back with the latest from product and engineering to talk to another team and about all the great stuff we're building.
Thanks everyone for watching.
And thanks Daniele and Richard for joining us. Bye everyone.
Thanks everyone. Thanks Usman. Bye. Bye. Bye.