Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Richard Boulton, Daniele Molteni
Originally aired on March 29, 2023 @ 2:30 AM - 3:00 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
Hi, welcome to another episode of Latest in Product and Engineering. I'm Jen Taylor, Chief Product Officer at Cloudflare.
Hi Jen, it's nice to see you again in the morning California time because I'm Usman Muzaffar, Cloudflare's Head of Engineering, because our colleagues joining us are based in Europe.
Daniele and Richard, thank you for joining us this morning.
Daniele, can you introduce yourself?
Just say who you are, what you do and how long you've been with Cloudflare?
Yeah, sure. Hello everyone, I'm Daniele Molteni. I've been with Cloudflare since May last year, so almost a year, and I'm the Product Manager for Firewall Rules.
So Richard and I work together. Yeah, and hi, I'm Richard Bolton.
I'm the Engineering Manager for the Firewall System and I've got the distinction of just having passed my three -year anniversary.
I was just thinking that Richard, it's three years since I remember interviewing you, so that's so great.
The exciting thing about three years is you can order a new laptop, so I'm just asking for my laptop refresh right now.
Approval's heading your way, Usman.
Excellent. Yeah, yeah. Well, so it's fantastic to talk to you all. This is one of my favorite areas of the product.
It's actually one of the oldest areas of our product.
It's part of some of our origin story. Daniele, just backing up, when we talk about a firewall, what are we talking about here?
What is the problem we're trying to solve?
What is the capability we provide here? Yeah, sure. So in this context, we usually talk about a web application firewall, and it's a specific type of firewall that filters, monitors, blocks HTTP traffic to and from a web application, web and also API traffic as well.
So essentially, it's a tool that allows to control the traffic in general, gives the ability to customers to put rules in place that blocks perhaps malicious requests or requests that are not supposed to reach the origin or server in the first place.
And I know that one of the things that this team has been doing very aggressively over the course of the past year, past couple years, is really, as I mentioned, this is one of the first features and products we launched when we launched the company.
The team has been really leaning in and looking more deeply at some of the customer use cases and really evolving the product based on that.
What are some of the key use cases you have kind of top of mind today, and how is that influencing your roadmap?
Yeah, sure. So I think, yeah, the security and the firewall in general is evolving very fast.
And I think the main driver, the main change that we see in the space is the adoption of API application, essentially.
So the growth is driven by the, if you want, the interconnected world of IoT and mobile apps.
So we see more and more traffic in that space, and the web application firewall is traditionally more catered towards web traffic.
So we have seen a shift in focus in more like API and automated traffic itself and the need for security solutions to protect and inspect API traffic.
What makes inspecting API traffic different from inspecting and thinking about web traffic?
What are some of the things that the team has had to kind of bring to the foreground that they might not have thought about before?
Yeah, I think it's, well, API traffic is automated, for example, by default.
So while web traffic, the majority is also, there is a human component to it.
In API, this doesn't happen.
And many of the tools we develop for web traffic, like bot management, for example, doesn't really apply perfectly to API traffic.
So there are a lot of false positive, for example.
So there are patterns in the way the requests are sent and the response as well that is very peculiar and typical of API.
And because of that, new features and new products are required to really look for specific attack patterns, for example, and new threats and vulnerabilities that are coming up with more API endpoints being present on the Internet.
One of the things that we had to do here was, those rules were there from the beginning, right?
And we originally used to write rules for our customers, and then we let our customers write rules.
And then we decided, okay, because one of the things that Cloudflare does is when we find a rule that can block something, an attack, we can apply that rule for all the zones behind Cloudflare.
And it's one of the most powerful things, right?
So we have a team that's also studying and thinking of vulnerabilities, thinking of how to do that.
But what we also wanted to build here was a new firewall engine.
And I think that was, to attack some of the other use cases here, and then how that same thing could be applied to some of the other problems.
So I think it was very interesting to see the trajectory of, it starts off just as customizations, no different than people were doing in the 90s and the early 2000s.
It becomes a system, and then we need to start scaling it because we start thinking about, what about things like you just said, Daniella, like thinking about APIs now and the automated traffic that comes from them, and how are we going to block those?
Sorry, Jen, I was just so excited by what Daniella was saying. You were about to ask something as well.
Well, no, because one of the things that's been really interesting for me in talking to you, because you spend a lot of time with customers on this topic, that I really didn't keep track of was it was really easy for people when they think about web interface to be able to understand and have a good inventory and a good understanding, for the most part, of what their web inventory looks like.
But one of the biggest challenges for many of our customers as they move to thinking about better API level security is just, there's a lot of undocumented stuff happening at the API layer.
And basically, rule one of protecting anything is identify the thing that you're protecting.
How have you guys thought about that?
And what are some of the things we're doing to help customers with that?
Yeah, I think you raise a very interesting point. So visibility is a big problem.
So let alone stopping malicious traffic, the first thing you need to do is provide visibility on the active endpoints.
Are the endpoints maintained? Is there is an active development effort to maintain these endpoints or not?
Because we've seen that many customers have different applications on the same infrastructure.
And some of the applications are maintained and developed currently right now, but they have old applications where they still run and they don't maintain anymore.
And so it's basically a blind spot. So logging and being able to provide a view on the traffic on this endpoint is becoming key and probably one of the first source of value for our customers.
Knowledge is power.
Yeah. So the HTTP analytics that we provide are probably a key part of some of the answers to that.
But we're looking at all sorts of ways in which we can use that data to help structure things for customers.
Maybe next time we talk, we'll be able to go into a lot more detail on it.
I'm looking forward to that. One of the things that comes up a lot when you talk about firewalls is rules.
It's almost impossible to say the word firewall without the word rule coming right out right after it.
You say firewall and then the rule, because rule is basically the model we have of this is a behavior I want the system to do.
I want it to learn something, to block something, to let something through, to accept it.
And once you have enough rules, we need more sophisticated ways of managing them.
And so now you start talking about sets of rules and there's a new noun that's been introduced more recently, rule sets.
So I was wondering, Daniel, can you talk a little bit about that?
How do customers think of groups of rules and how they can be turned on and off as a set?
How do you think about how we manage those? What kinds of levels of control do we need to start to give to our customers?
And then Richard Proctor, you could talk about, okay, so that sounds great when a PM writes it on a PRD, what does it take for us to actually turn that into something that works?
But I thought let's sort of hear it from Daniele. What is the customer looking for here?
Yeah. So I would say, let's take a step back before my time at Cloudflare.
I think if the story has been told to me correctly, we launched firewalls first, right?
So firewalls essentially is a bunch of rules, like a list of rules.
And customers were super excited, like, oh, I can finally write my own rule, my own custom rule.
And that was the first step. But of course, as things evolve and infrastructure also becomes more complex, our customers become more sophisticated.
They have like multiple zones, multiple applications. Then having the ability of granularly define a list of rules that apply only to a subset of the traffic becomes really important, right?
So we've heard, for example, one of the typical use cases we have heard recently is to have a possibility to define rules that apply only to, yeah, a portion of traffic, let's say a base path of an API endpoint, for example, also across different host names.
So really, this wasn't possible with the firewall rules we developed to start with a few years ago.
And this is what actually motivated a new development effort in what is called rulesets right now.
And I'm sure Richard can talk a lot about that. Yeah, so I mean, the history is right.
So we started with, actually, we started before that. We started with custom firewall products, which are written manually by us.
We wrote the code in Lua on the edge, generally, to do specific things.
Then we have firewall rules, which is still sort of coordinated by Lua, but there are these expressions in this wire filter syntax, which is a nice, easy syntax to write expressions in and safe for anyone to write them.
We did a lot of engineering work to guarantee that those things can be evaluated safely when customers write them.
Give me an example of what you mean by safety, Richard.
Whose safety are we talking about here? So there's all sorts of things here.
So one is it's going to have a well-defined behavior.
You're going to be able to look at it and understand what it's going to do.
And it's not going to suddenly change as our systems evolve under you. It's not going to change and start behaving differently.
Another is safety of our edge in terms of resource usage.
So we want to make sure that we know that we can evaluate that thing quickly.
It's not going to delay the request. It's not going to use too many resources on our infrastructure.
And then there's safety in terms of it's not going to be allowed to do things we don't want it to do.
So if you were able to write an arbitrary piece of Lua code, a customer could write that.
They could do anything.
They could access our disk. We could put sandboxes around it.
But you can never be completely confident, unless you've built it from the ground up or inspected it from the ground up, that it can't do unexpected things.
Whereas with the wire filter syntax, it's not a programming language.
It's a way to evaluate some conditions.
And it's much more constrained what it can do. So it's a very sort of pure form of here's some conditions I want to test and see whether I want to do something for my request.
So we use that to build firewall rules. And that was always expected to be just a first step.
And we met a lot of use cases with that.
But we are now well into the next generation of that, which is taking that as the fundamental filter structure, but actually we've built this concept of a rule, which is a filter and an action.
And if you have the ability to do fairly interesting actions, combined with fairly complex filters, you can do a huge amount of stuff.
So what we're beginning to see more and more in sort of behind the scenes is this system called rule sets.
We have a standard way to represent a list of rules that will be executed as part of the processing of your request.
And those will then be evaluated.
And instead of having to write lots of custom lower code, we're more and more seeing people are using that system to start building new products.
So Magic Firewall is a good example of that.
They're using the rule sets APIs behind the scenes that will apply filters in Magic Firewall, in Magic Transit.
So they can be, you're using exactly the same syntax externally to write something which works in a completely different part of our stack.
And there's a whole load more things we're beginning to see here.
Our managed rule system is all being rebuilt on this.
And I can talk about that because that is public, but we'll be launching a lot of that very, very shortly.
And there's a whole load more things. And actually, when it comes to engineering, we build a lot of stuff at Cloudflare.
One of the reasons that we can go so fast is we invest effort in how do we take advantage of our technology and build things once to be used many times.
So we don't want everyone to have to build a brand new API every time they want to build a new piece of software.
We want to have a standard way of representing these things and a standard way to do these on the edge.
So there'll be new pieces of code each time, but you don't want to have to start from scratch.
And the rule sets concept is a really nice general concept, which is allowing us to build more and more things more and more quickly.
Well, and I just want to insert myself because I know Ismael is going to ask you a technical question, but just like a plus one from a customer facing perspective.
First, coming back to the fact that you say you've been here almost three years, which blows my mind.
And you really have been an integral part of this entire move to the definition of rules and the user experience.
And part of what's so phenomenal about Cloudflare is that we build these building blocks to make our own work and our own systems easier and more efficient and reuse.
But part of also what the brilliant work that this team has done and probably doesn't get enough credit for is that you've created not only great APIs, but great user patterns that we're able to roll out.
I was just about to say it's the pattern is as much the innovation here as the tech.
Because it's like from a usability perspective, what you guys have really been able to help us drive is a pattern that we can share with our customers, with our users, that they can learn.
And then as we extend it to new things like magic firewall or potentially something like rate limiting or something like that, that they're like, I got this.
I know how to do it.
And it enables them to come up with the learning curve of Cloudflare. So I just wanted to give you guys some props on that.
And now you can ask your technical question as well.
I'm going to bounce before we do that, I'll bounce back to actually, it's not just, so I run the engineering team, which is mostly about backend engineering, but we work very closely with the product design team and the user interface experts and other team.
And those patterns, they're not just at the API, as you say, when we built magic firewall, we did not need to go back from scratch and say, how do we build a complex rule editor?
We had one off the shelf, all ready to use.
Our shelf, but yes. But with that, what we are now seeing is we are able to reuse those patterns in many, many places in the user interface as well.
So we can have a very powerful dashboard, which is still, you don't have to learn each part of it from scratch because you've got familiar patterns.
On the subject of learning things, the original firewall was famously in what's called ModSec, which takes its name from the Apache web server for the old timers in the audience who remember that every extension was module this, module security, mod pro, mod at, mod fast.
And that was all regular expressions, which are pretty esoteric and honestly not super flexible.
And Richard, one of the things you mentioned is that the firewall engine is let's its users write in the wire filter syntax, wire shark syntax that we're all used to doing.
Anyone who's ever, ever had to study a packet trace knows how to pull up wire shark.
And in fact, we've even presented at wire shark conferences because we've been such a close partner with that.
And one of the fun things was to be able to say is that we're adopting that language and that language is now naturally extends all the way through the stack.
And so I just wanted to ask a little bit around how did we implement that?
Like what languages do we use to implement that? And how did we, cause it's not like we put, we took wire shark code and embedded that into the product.
So what did we actually do on the implementation side to make that possible? And then to get those guarantees of safety that you were just talking about that are so important.
So there's an interesting story, which goes back to when we first started expressions and validate.
We'll write that in where we write, but most of our languages in Lua.
So we'll do that in Lua. And very quickly, the team working on it before my time found, hang on, this isn't going to work.
We need a consistent way to build this across.
Our stacks are going to behave the same way. You're not going to have synchronization problems.
And the language they picked at the time was Rust, which was perhaps more of a controversial choice then than it would be now, but it's a very powerful language for writing high, highly efficient code in a safe way.
If I early in my career, I did a lot of C programming. And when you're working in C, you you're, you're very able to break things in all sorts of exciting ways.
It's the best language for writing, writing insecure code. You can do a lot of great stuff in it, but it do anything you want, including shoot yourself very cleanly.
Whereas Rust gives you a lot of the same power, but with so many tools built into the language to allow you to ensure that your code is safe, you have to explicitly tell it, here's a bit where I'm, I'm going to tell you what's what's right.
Here's an unsafe part. And we basically don't do that in our code. So we, we have built this, this engine in Rust.
It's a highly efficient part of it. There are some blogs on our blog, blog posts on our blog about how we've implemented this.
And there's probably quite a few more ones we'll be needing to write soon to keep that up to date, but we've done, we've done a lot of extensions to it.
And we've done a lot of work on performance evaluations of it. So the ways we've got a lot of infrastructure behind the scenes, every time we make a change, we evaluate all the existing rules to check they still work the same.
We evaluate a whole load of, we run a whole battery of performance tests to check that our performance is, is improving over time.
So we have a lot of systems we've built behind, behind the scenes to, to not just trust that this language is going to automatically give us safety.
But we, we really can do a lot more than we would be able to in other languages.
Rust has really helped us here to be able to be sure that things are working as we need.
That's, that's so great. You know, I remember 14, 15 years ago, first hearing, you know, Google's built this new language called Go.
And, and when I first heard about Cloudflare, I was, you know, the CTO of John Graham-Cumming is an old friend.
And I used, yeah, we're using, we're using Go a lot.
And I was like, are you using this crazy new language? Aren't you used to tried and true?
And, and so I remember when the team first proposed Rust that, you know, there's, there's an instinct to be like, wait, are you just, are you just going after the trendy new language?
Or is this really going to really going to give us dividends?
And I think it really did because it's been able to also be used at the core where we build the APIs.
And so that means that we can use the same language at the control plane that we do at the edge.
And that's, that's, that's really given us, you know, what Jen said was that how many teams have built on top of you, but you've also built on top of other teams.
Can you talk a little bit about how you've leveraged some of the other platform work that other teams have built, like workers to, to build the firewall?
Absolutely. So, so as I was saying, one of the things we need to do to go as fast as we can is to, to make it, take a, take advantage of all the other things that Cloudflare is building to help developers.
To some extent, we, we have, we have not built on workers that much in the past for the firewall, because we, we have extremely tight latency guarantees that we're worried about.
We have extremely, essentially, we need to be able to defend the worker's product rather than...
That's right. It's in front of workers.
Yeah. But, but that said, there are a lot of times where we want to do more complex things than we can currently do in, in the filter, in the, in the filter expression.
Sometimes that means that we write codes that will run, we write more RAS code.
We're doing a lot of interesting things there as well, but also sometimes it's something which can well, can well be expressed by writing something in workers.
So there's, there's a pattern we're beginning to see where we can run a, run a filter.
And if that filter matches, then we'll do some more complex expression with some third party service.
So we, we have, we've built the technology to, to, to enhance our filters by pushing some of the processing when, when we know that the, the filter fundamentally matches to workers.
And then there are other, other situations where we sort of in the wider teams that we've built on rely on, on systems built on workers.
So our challenge platform is something else that we've done.
So we're doing a lot of work to try and reduce the number of captures we show to people and to, to find better ways to, to partly ensure that we are better identifying malicious bots that are coming at us.
But also to, to reduce the number of times we annoy eyeballs.
And that's, that needs an extremely fast iteration cycle.
There's, there's to some extent an arms race going on there. We need to make sure that we can respond to changes in the ecosystem very quickly.
And having that in cap, that code encapsulated in a workers platform has allowed us to do a lot of stuff there.
So there's, there's a wider team than mine. I can't take full credit for it, but there's, there's a lot of things where the, the general pattern is we do the hot path processing in sort of fairly bare metal compiled Rust code.
And then for, for the more complex cases, we can hand over to something written in workers where we can more flexibly adopt.
That's really great. And it's, it's very, keeping with power of the platform.
Well, and again, like the great thing here is, you know, we're using, you know, everybody's like, yeah, we're a platform.
We build API first, you know?
And what workers has enabled us to do is sort of take that to, to the next level.
And, you know, one of the things we talk a lot about externally is, you know, we use workers to build our solutions and our products, and it's just a great proof point of how that is possible.
The other thing I really like about the workers piece as it relates to, to everything, but specifically with firewall is, you know, you guys are doing things that, that also customers can then go do, right?
I mean, one of the biggest challenges, you know, Danielly, you know, I know you live this every day is, is as we're building the product, what is the thing that only Cloudflare can do?
And then what are the capabilities on top of that, that we want to leave open to our customers to further customize it for those edge use cases that we can't solve.
And, and the combination of the work that you guys are doing with the firewall plus, plus workers gives our ecosystem of developers, customers, users, a phenomenally customizable and flexible platform.
How, Danielly, how do you think about some of that flexibility as you start thinking about, you know, some of the work that you're doing with APIs and stuff like that?
What, what are some of the angles of flexibility that you'd like to see us leave, leave open for our customers?
Yeah. I think there is, for example, if you look at API or let's say rate limiting, there's also another great example.
So we've seen patterns where customers are requiring specific way of, of rate limiting traffic, for example.
So our current limit is applies to URL. So you can define your URL, you can count on specific IPs, and then this is going to apply like a mitigation action when you exceed a certain threshold.
This is kind of a standard pattern for, for customers.
And there are, of course, cases where this requires a further customization, perhaps where you require more like a granularity in which you can define the traffic first before taking any action.
So we're not just going beyond URL.
So when we identify this case, then we can say, okay, cool. So we can develop a product that allows customer to define the traffic as they want, as they wish, and then apply rate limiting.
So we essentially provide a tool that allows this flexibility, but at the same time allows to cover like basic use cases as the majority of customers want.
So it's about finding that trade-off between covering the, the happy path, or if you want the, what the majority of customers want, but then still leaving open that door and that opportunity for customization.
And this is, of course, anything is something that Richard and I were currently working on and thinking about how we expand the functionalities of a product like that, this bread and butter, if you want, of security to secure like applications broadly on the Internet.
Yeah. With rate limiting, I was in the conversation, if I could, I'd almost want to want to put a little asterisk next to it and be like, watch this space.
We've talked in the course of this conversation about a lot of the work that the team has done to modernize the platform and to create rules and stuff like that.
And in some ways, the metaphor I have in my mind is that we're, we're in the process of renovating, you know, effectively a castle and it's, you know, there are different rooms in different sections that we've gotten to and like, we've got some interesting things possibly coming.
Yeah. Yeah.
Yeah. Can't wait to be able to share more. A lot of cryptic smiles from the product leaders.
Excellent. We only have a couple of minutes left. I want to just go, let's go way under the hood here.
Richard, one of the things that got a lot of attention that had nothing to do with the customer facing was the infrastructure that your team did and leveraging, like, just like every other software company, we've got to build, you've got to, you know, something has to turn code into ones and zeros.
And, and so what was it that we've spent a few minutes, but spent some few cycles on this last quarter and with a sort of a dramatic effect came out of it.
Like tell that story for me as we close.
So, so I've always talked about how it's really important.
We can develop our development velocity is kind of key to, to being able to, to do the things we do.
We have a system called a continuous integration system.
This is essentially every time we make a change, we run a whole battery of tests.
We compile the code. We we then build packages and we push them to, to the environments, first of all, to the staging environment to test it.
And then to a production environment, once we're happy with it. And we noticed that it was taking us half an hour to get to the staging environment and another half an hour to get to production.
And that's not terrible maybe, but it means that you you're limited to making maybe four changes a day.
We want to be far faster than that.
So the goal I set to Usman, which I haven't quite met, but was that we would get there for no more than 10 minutes from merging to production 10 minutes, 10 minutes is the magic number.
Yeah. And we'll get there, but we're not there quite yet.
But what we have done is, is we worked, we introduced a system called Bazel.
So we've been using make, and I certainly won't knock makes not, not in the company where John Graham-Cumming is the, is the CTO, but it's, it's a, it's a very good system, but it's one that's very old and doesn't, doesn't integrate as well with the kind of layout we have in our code.
We have a monorepair. We have lots of, lots of change, lots of different projects in the same repository.
And essentially, whenever you make a change to anything, it had to rebuild everything.
And what Bazel does most better for us, this, this new build system is allow us to cache partly done things.
This is caching old build artifacts. So we can build, or we have to build the whole thing.
We haven't changed this, this code. So we know the tests are going to pass.
So we don't have to spend nearly so much time. So that cut the time down.
Also, we had huge numbers of different builds kicked off to parallelize this.
We've been able to throw them all away. So one of the headline numbers was a build, which took seven minutes.
We got down to taking under five seconds.
There you go. That was, that's, that's a lot of the, a lot of the seven minutes was setting up the environments and then moving them down again.
Yeah. Very good. All right, folks. Thank you so much. We're right, right at the end of time.
I wanted to end with, I wanted Richard to be able to brag about that, that infrastructure optimization, but Daniele and Richard, thank you so much for joining us.
We will definitely have you back soon when you, when those, all those twinkles on those product managers' eyes turn into, into headlines that we can brag more about.
But Jen, always great to see you and we will see you next week.
Latest from Product and Eng. Thanks everybody.