Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Ben Solomon, Andrew Li
Originally aired on August 15, 2021 @ 6:30 AM - 7:00 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
Music Hi, I'm Jen Taylor.
Welcome to Latest from Product and Engineering. Hi Jen, I'm Usman, Cloudflare's Head of Engineering.
Nice to see you again, Jen. Good to see you as well.
I'm back here for our favorite time of the week. It is the best time of the week.
And I'm as always super excited to introduce two of our product and engineering leaders who are joining us today.
Ben, why don't you say hi? How long you've been in Cloudflare and what do you do here?
Hi everyone. My name is Ben. I am a product manager on our bots team.
I've been in Cloudflare for about nine months now and I am based out of San Francisco.
Excellent. Hey guys. My name is Andrew. I'm the engineering manager running the bot management engineering team.
Joined Cloudflare 18 months today, exactly.
18 to the day. That is amazing. It feels like you guys both have been here forever because we were like, so the topic of the day, if you haven't figured it out from Ben's background, is bots.
And we were chatting just before we were getting ready to kind of like, what do you want to talk about and stuff like that?
And we realized that we could probably fill three episodes of latest product and engineering with the work that we'd be doing on bots.
So I can't believe you guys have not been here for like 400 years, because I don't know how you got all this done in the time you've been here.
Okay, but stepping back. Ben, what is a bot?
Why are they good or why are they bad? So here's the deal. When I typically think about the Internet and the way that we use it, I think about it from my perspective, right?
I wake up in the morning, I go on, I check my email, maybe I read some articles, maybe I get on a Zoom call.
I behave on the Internet as I'd expect a human to behave, right, just a bunch of manual actions.
And if you Zoom out, you'd expect to see a lot of people just like me.
A lot of people just moving throughout the Internet and kind of making a couple of clicks per minute.
But it turns out that a huge percentage of Internet traffic doesn't come from humans.
It comes from bots or automated services.
And there are both good bots and there are bad bots.
The good bots power a lot of important services, things that are like Google, for example, search engine, we all rely on it every day.
The bad bots do really bad things.
And we can talk about those in a minute, but they have all sorts of really, really consequential impacts on big businesses, which often cost people a lot of money.
So the team that I work on and the team that Andrew works on we manage a particular solution for bigger customers, as well as sometimes for smaller customers that helps them manage these two groups of traffic, the bots and the humans, and then decide what to do when those bots and humans show up at their websites.
That's great. So a huge part of this is actually trying to identify the thing that is knocking on the door of my website.
First of all, is this a bot or is this a human?
Because after all the bots, there are bots that are, I'm assuming, trying their hardest to look human.
So they look like Ben checking his email, when in fact they're trying to potentially do something more nefarious.
And then to give that control back to a customer.
So Ben, what does the control model actually look like here?
And then I'm gonna turn to Andrew and ask him, how do we even tell these two things apart?
But once we've given someone the ability to even have a knob here, what do they do with it?
What's an example of something a customer would wanna do with the knowledge that this request is coming from a bot versus Ben checking his Twitter feed in the morning?
Well, there are a lot of options here.
So the most simple thing you can do is just block the bot, right? And this works for most people.
They decide a bot's showing up at their doorstep, they just wanna block it, because it's gonna eat up at their origin resources, it's potentially gonna do something worse.
And you can think about this for a bank, for example, right?
We all have online bank accounts. When I go to sign into my bank, I enter a username and a password.
But if a bot is showing up, particularly a bad one, that's trying to somehow compromise a banking system, you may wanna do something else to that bot.
You might wanna block it, but you might not know that it's a bot.
And if you don't know that it's a bot, the other control option you have is to issue some sort of a challenge or a captcha.
We've all seen these captchas on the Internet.
You click on the little images of planes and trains. And the idea is you prove whether or not you're a human.
So a lot of our customers, rather than just block bots up front, they say, you know what?
Just to not fully ruin the customer experience and not to accidentally block a human, I'm gonna issue one of these captchas.
And that way I can get a really good measure on whether or not this is actually a bot or a human.
Got it. Andrew, how does this work? Like even at the beginning, like the bots don't come in, some of them are trying their hardest to not look like a human.
So without going into all the magic, like just even like at the highest level, how do we even approach an engineering problem like this?
Yeah, excellent question. Identifying bot is no different than identifying a bad guy in real world.
So you identify them by identity and you identify them by behavior.
That's literally what we do on the engineer side.
We identify by known bot identity. And so things like what type of browser are you using?
That type of browser has well-known or the type of browsers or the things that people tend to use for bad things.
So we'll have a set of rule-based sort of identification process to identify based on who you are.
And I said, we're also based on behavior. So bots tend to do things human generally don't do.
So for example, I can probably visit a page one second per page, probably a top speed.
A bot can do a lot more frequently, right? So they exhibit behaviors that's very different than human for the most part, and especially for the bots that are less sophisticated.
So by and large, we identify bots in those two axes, who you are and what you do.
And underneath the covers, are there two completely different engineering problems and engineering systems that are trying to...
One is an identification and a classification, and the other one is sort of watching longer -term behavior and trying to aggregate that into...
Okay, so anything that can click through an e-commerce website at a hundred clicks a second is probably not a human, even though I have seen professional video gamers, they're pretty impressive.
But that's a good signal that this is almost certainly an automated system on the other side.
Yeah, totally. And so we have engineer system exactly translate what I described, identity and behavior into the implementation.
So namely the supervised machine learning versus the unsupervised machine learning.
A supervised machine learning, essentially the model where we feed it into the node attributes of a bot and let the model run at our edge on thousands and thousands of machines to detect those bot.
The unsupervised, it's sort of a statistical inference model where we don't feed specific attributes and identity, but we tell the model, hey, here is what normal human behavior would be, how frequent you would actually refresh a page versus what you actually see on the bot.
A high frequent refresh crawler, spoofing the page.
Those type of things are the unsupervised machine learning are doing a very good job at it.
So we have some other engines, but those are two of the main components of our bot management detection today.
And here's why that's so important. I mean, Andrew's talking about really, really sophisticated ways of detecting bots.
So machine learning, this requires a lot of different inputs. And if you look online, if you start researching bots, people will call them bad actors.
The truth is they're actually great at acting, right?
They blend in with human traffic all the time.
I love that. That's awesome. If they were bad actors, we wouldn't have a problem.
The problem is they're good at it. And the Oscar for the best bot goes to...
Oh, that's great. This is the problem, right? They're great actors, but they have really bad intentions.
And the machine learning model that's behind all of this is designed to sort of find that where we can.
It's not always an easy thing to do.
So what are some of the things we do then to figure out or get them to reveal their true nature?
So we do a number of different things. Andrew mentioned a couple of them.
Being able to perform machine learning is really important.
We also, just within the last year, have launched a brand new detection engine that we call JavaScript detections.
And the idea with this is we wanna go after a very specific group of bots.
We call them headless browsers. It's kind of a violent name.
People hear headless browsers and assume there's a guillotine involved or something, but that's not the case.
Headless browsers are just like the regular browsers we have like Google Chrome, but they don't have an address bar.
And for that reason, they're driven by the command line and they're really good at making lots of requests at once.
So instead of me just clicking on a website a few times per minute, like Andrew said, they can go hundreds, if not thousands of times.
So what we do with JavaScript detections is we inject just a really small amount of JavaScript onto the client side of a connection, but we do it in a way that honors our privacy standards.
So we're not gathering personally identifiable information.
We do it really responsibly, gather information about the GPU, about the browser, and then use that to spot headless browsers on the Internet.
So that's one of the things we're doing. One thing I should really drive home here is we are very focused on the privacy element.
I actually had one of the engineers on our team pull a little bit of information about this.
It's, let me see exactly.
For one particular hash that was gathered from JavaScript detections, normally you'd worry, oh no, this is gonna identify a particular person.
That hash was associated with 66 ,000 different requests across 1,400 different Cloudflare sites.
And that's over 12,000 unique IPs. So when we're gathering data to actually detect these bots- It's awfully blurry.
Yeah, it's awfully blurry.
There's no way to track it back to one person and we're not trying to do that either.
Yeah, blurry by design. That's so important. That's great. So let's talk a little bit more about some of the other analytics we give people, right?
So like now that we're, we become a website's front door when you sign up for Cloudflare.
And you know, one of the things that, when I, long ago, when I first signed up for Cloudflare, I had this sort of feeling like, wait, now how do I know what's going on on my website?
Like, I don't have the Apache log anymore. I'm not seeing the request to the server that I have root on.
Like, how am I supposed to know what is actually going on?
And, you know, we've talked in other layers from product and engine, but other analytics, but just, you know, Ben, from the perspective of bot analytics, what does that mean?
Like, what is an analytic we would show our customer and what are they supposed to do with that information when we show it to them?
Yeah, so we generally, at a really high level, try to group bots into different categories, right?
And there's a couple of different ways we can do this, but the categories we actually break it down to in our own analytics tool are definitely automated.
These are the requests that definitely come from automated sources.
Then there's likely automated, requests that we tend to think come from bots.
We've got likely human and then verified bots. Verified bots are bots like Google and Bing on the Internet.
And so we can talk more about this. Andrew's helped us build up this really, really impressive bot analytics platform, but being able to break traffic down into these four groups lets us then show customers in the dashboard what their traffic looks like.
And we let them pair these different buckets with all sorts of different filters, IP addresses, user agents, so that customers who have had bot management working for a while now on the backend can finally see what we're actually doing behind the curtain.
Literally gives them almost like a cockpit where they can slice and dice all this information, pull up the pie charts and say, what section of this is from what side or the other.
So Andrew, I'm guessing that's a completely different part of the system, right?
So like the model was trained in one place, executed in another, and then the analytics are back in a central place where it can process this.
What are some of the challenges of setting up a system that's got like so many different pieces that have got to connect to each other?
How do we approach that problem?
Yeah, at the high level, I think you're exactly right. You mentioned the different major components in the Cloudflare network, and you have a core, you have an edge, and you also have the data plane.
So in terms of the performance and in terms of the scalability, in terms of reliability, all of that is the critical consideration of how we engineer our systems.
So for example, performance. Our model, when they detect HTTP requests, it actually is on the path of HTTP requests.
It's absolutely critical. We render a judgment on the request very, very fast.
Today, we do so in a microsecond, not even milliseconds, so it's really, really fast.
And I would even say, our model is just not one version of the model.
We have multiple of them, and all of that actually making a decision on path of requests, and we have to make sure all the models does not, and adding a new version of the model does not add to the penalty of the latency.
So that's just one example of performance we're considering.
And another thing is scalability, right? So again, as I said, we have a thousand, thousand machines running all over the world.
And that one model, if you make one mistake, and it'd be amplified so many times.
And if you make it, and even the model, when we ship the model from core to edge, has to make sure the model is agile enough, small enough, be able to fit into the machine that it's running on.
And again, there's multiple models running on there, right?
And also the data that we get, right? So it's a lot of data in our network to be able to feed that data into our data pipeline, and retain it, analyze it, and then be able to filter the data the customers are actually interested in, and feedback to the dash.
And all of that requires a lot of engineering teams, including ours, and data team, including the firewall team, build our pipelines, allow the customer to have the seamless experience, where when something happens at the edge, almost instantaneous that you can see that on the dashboard.
So that's a really amazing part of what we do.
That's really great. So when you talk about having to pay attention to things, you have both a time component and a space component, because it's got to be efficient, but also can't be too big, because this thing actually has to run on real computers with real RAM, that's not infinite, and still manage to process that request, and then talk to the logging pipeline to say, by the way, here's what I actually did.
You know, Ben, a question for you is like, one of the other teams Andrew mentioned was the firewall team.
How does bots file fall into the, you know, Clover has already been in the firewall business for a long time.
It was like one of the original things we did was we put up a firewall, we like to say in the oldest days we were patching the Internet, you know, and here's this really advanced thing, you know, machine learning, and learning, you know, a headless browsers, and like, and JavaScript challenges, and it's feeding into this classic product, the firewall, like how do these two products fit together, and how do we make it so we don't confuse our customers with a million options, and buttons, and switches?
Well, look, it's really interesting.
I mean, we've kind of figured it out as we've gone along here.
The firewall is a really, really powerful tool. This idea that you can set specific rules that govern the traffic that comes into your website, that can get very complex very quickly, and bot management is just a portion of that.
So for a lot of folks who have had firewall rules for a while now, you may have had block rules in place, right, for a particular country, for example, right?
Let's say you block all traffic that comes out of the US to your site, and you just have a rule that says, anything from the United States, block, really simple.
But what we've allowed other customers to start doing is to actually pair these rules with bot insights, or bot elements, and we generate a bot score.
So that bot score is one through 99, and the idea is that for every request, we'll generate a bot score.
Those lower scores indicate a higher degree of certainty that a request was made.
So a score of one means we are really confident that that request is coming from the bot, and then the score of 99, as you can probably guess, means we're really confident the request comes from a human.
And as you're writing those firewall rules, you can start to pair those bot scores with other elements.
So I might write a rule now that says, for any request that comes both from the United States, and is scored below a bot score of 30, I'd like to block it, or I'd like to issue a CAPTCHA or some other type of challenge.
But that's kind of how it sort of blends into the firewall.
Although you're right, bot management is increasingly becoming a bigger area.
It's something we're investing more in, and who knows, it may eventually break out of the pipeline.
But it's great, I mean, yeah, go ahead, Jen.
No, I was gonna say, you know, one of the things that we hear a lot from customers is just the concern they have over, quote unquote, false positives, right?
When bot management kind of is too good, or, you know, and kind of gets it wrong.
And, you know, we talked a little bit here in this conversation about sort of the things and the signals we use to create the score, but what are the things that we're doing to help our customers identify potential false positives?
And how can, you know, how do we help them sort of mitigate the risk of false positives?
Yeah, it's probably the biggest challenge in the area of bots. And we've talked about one of the solutions already, which is bot analytics.
If you go into bot analytics in the Cloudflare dashboard, we have this one really cool graph.
And what we do is left to right, we show you the amount of requests for each individual bot score.
So what this looks like is you get 90 or 99 individual bars that will show you how many requests you've had that were scored one, that were scored two, three, et cetera.
And if you look at this graph, just left to right, it's hard without seeing it in front of you, you may see outliers, right?
You may see that for whatever reason, the score of 37 is triggering all the time, right?
And you've got like 30,000 requests that are getting scored 37.
If you look at this graph, you can start to find those particular areas and then tune your firewall rules around them.
It may be that your website just gets a particular type of traffic and being able to spot where those nuances or those niches in your traffic are, will then help you set a firewall rule that says, I'm gonna affect all traffic that is maybe one through 36 and 38 through 45, right?
You can jump over those little problems and eventually you'll get to a point where you can kind of set it and forget it.
Just the nature of your traffic will work around the false positives. That's cool, that's cool.
One of the things that I also really like about the work that you guys have done is, we just talked a lot about a lot of complexity and a lot of configuration and a lot of like writing rules and thinking about scores and optimizing stuff like that.
But one of the things I love is one of the first things you guys all did when you showed up is you created the easy button, right?
I mean, bot fight mode is basically, it could not be any easier and it's available to anybody.
What is bot fight mode and how does that fit into this big picture of bot mitigation that we're talking about here?
Yeah, I'll give you the product side and Andrew can tell you about, I guess what was involved building it.
Bot fight mode is a much simpler version of what I just talked about.
So this idea of actually setting firewall rules works really well for larger companies that want a more sophisticated approach to handling bots.
But there's a lot of us out there who have smaller websites, right?
I don't use firewall rules for my website. I just want something I can click and it works.
And so bot fight mode is the simple version of bot management.
It's a toggle. And as soon as you flip that toggle, we start detecting bots on your website.
And the thing that's really cool is no one else does this.
It's totally free. And we basically let you use all of the detection engines that are powering so many other parts of Cloudflare, which is really important.
And we do one other thing here, which is instead of just blocking or challenging that traffic, we issue something called a computationally expensive challenge.
Now this confuses some people, but it's actually, it's a really cool concept.
So when a request comes into Cloudflare, instead of just blocking it, we will actually respond with some sort of really difficult thing for the server to- The Rubik's Cube for you to solve 100 times over.
No, actually it's probably Wendy's jokes.
They're trying to get, see if the bot can solve Wendy's jokes.
We'll give them a massive file, right? And let the server just go to work on that.
And it costs money, right? This ends up costing the server operator a lot of money, which then disincentivizes bot operators in the first place.
So the idea of Bot Fight Mode is we are fighting those bots whenever we detect them and you can turn it on.
If you've got a Cloudflare account, you just go to the dashboard and flip that toggle on.
Andrew, you had a big smile as Ben was talking. What are some of the technical things that we had to do there?
And how did we approach a challenge that was effectively just giving the bot more work to slow it down and waste its time so that it leaves legitimate websites alone?
Yeah, it was an excellent product idea.
I remember when I first joined Cloudflare, I was asked to implement this. And my first reaction was, how do I ever be able to do this?
But this is one of the things I think has got me really excited coming to work every day is that there's always this source, like so many smart people working in Cloudflare have this genius idea.
And at a time you think, oh, this is the kind of city we can never do that.
But once we started doing that and you see other competitors are sort of follow our trend and industry analysts saying, hey, you guys are actually the game changer here.
And the Fireball is primary example of that because really we are changing the economic incentives of a bot operator.
And as Ben said, the intent of a Fireball is, we're actually making you spend more money.
It's more costly to be a bot than otherwise.
So very different than the JavaScript challenges, JavaScript detections that we had traditionally for a long time, where we're in a passive detection mode.
In Fireball, we're sort of in an offensive mode where we force your browser to do a lot more CPU memory intensive work to the point, especially in the case where we're a hundred percent done sure you're a bot, that we start to actually in the cases where your browser will just actually stop working and crash.
Obviously we do so very judiciously, very carefully.
So we don't have false positive into James point that that would be really bad.
So then a lot of engineer work going into the design to make sure that if else condition is designed to a point where the only, the very, very limited cases where we're very a hundred percent sure they're being punished.
So there's a lot of engineering ingenuity going into the design and allowing us to identify the bots and allowing us to pinpoint the type of JavaScript challenges that they're able to do.
So for example, there are browsers can run by bot, but there are some other, you know, JavaScript incompatible, a user agent are not able to run Fireball.
And we have to think a way of how to inject the Fireball still allow them to do that, to sort of being challenged.
So it's not easy job.
It's not easy work, but again, very excited. We're able to launch a feature recent today, actually about a year ago.
And we start to adding more a competition or heavy challenges into the platform.
It's really cool. It's really cool. What do we do with all the learning from that?
Like we must see and learn so much. Like how do we, how do we, how do we use, how do we like use the signals here?
Yeah. Yeah.
I have five minutes left. I don't want to take the rest of the time, but I, for 30 minutes.
In three minutes or less, explain how all the machine learning works, Andrew.
So, you know, we have a 23 million requests per second and that translates to 1.5 to 2 trillion requests a day.
And that's the unique advantage Koffler has.
And combined with the signals from false positive report from customers, those are offline signals, but more importantly, the real-time signals that we have.
So when I arrived at Koffler, the engineering teams were building a platform to get Gagarin and to find out more with our public block.
And that platform is so powerful.
What it does is not only we are treating a static model, but we also do is real-time every few minutes, we gather all the signals we see on the live traffic from a thousand machines we're running on the edge and then be able to recalculate, recalibrate our model.
And our score is infinitely more accurate. All right. So just give you some comparison.
When we first started, you know, we started adding rules, manual rules to block sort of bots that we solve the call false negatives.
The customer can play without detecting it.
Today, I rarely see those requests at all because our machine learning model does a good job precisely using the signals actively blocking.
That's great. Last question I have for you, Ben. One of the things that I always have to remind myself is that CAPTCHA solve rate, which sounds like a good thing, like solve it.
Like the higher the CAPTCHA solve rate, the more your eyebrow goes up.
Like, wait a minute, why? So like, what is kind of counterintuitive about that CAPTCHA solve rate?
Why do we show it to customers and what are they supposed to do with that information?
Yeah. So here's the thing. We've all solved CAPTCHAs and I know we've talked about them a little bit, but these are the puzzles that show up on the Internet.
You got to click them. And if you solve one of those, you've effectively proven that you're not a bot.
Now at Cloudflare, we only show those CAPTCHAs or most of the time when we are unsure of a particular request.
We don't know if you're a bot. We don't know if you're a human.
This rarely happens, but we show the CAPTCHA to be able to discern, right? To be able to put people into different buckets and say, okay, this is really a human.
And the reason for that, the reason for the CSR, the challenge solve rate, is we want to be able to measure how well we are actually doing with our detections.
So we see that the solve rate of these CAPTCHAs is very, very low.
If it's close to 0%, it means most of the time we're showing these CAPTCHAs to bots and we're doing our job.
We're not disrupting the human experience on the Internet, right? And one other thing I should call out there is it started as CSR CAPTCHA solve rate.
It's now becoming challenge solve rate. We were fortunate that the word challenges also starts with a C.
And we're starting to roll out these other types of challenges that are less disruptive.
So instead of just rolling out something like a CAPTCHA, we can do something like a JS challenge.
You'll see those three little orange bubbles when you go to a website.
Instead of making you click on images, we're able to do all of the same work, sometimes even more in a better job of detecting bots without requiring any clicking at all.
Kind of leveraging some of those same types of challenges we're using in bot fight mode behind the scenes, I would imagine then to not have to surface the CAPTCHA.
Yeah, except we're not unloading a hundred Rubik's cubes.
But other than that, it's the same challenge concept.
You're right. It's the same type of thing. Yeah. That's great. And I love that that means like if we challenge somebody and they successfully, they met the challenge, like, whoops, we shouldn't have challenged them.
Like that signal back to Andrew and his team, like, hey, make this thing smarter.
We shouldn't have had to challenge those people.
So I thought it was interesting that the higher the challenge solve rate, the more that we need to work to do to get it back down to, get it back down low.
Guys, thank you so much for joining. Once again, we've hit the 30 minute mark so fast.
And I can't wait to have you back here and tell us more about what all the new stuff that you'll be working on.
So thanks very much for joining us and we'll see you at the end of next week.
Thanks everyone. Thanks, everyone.
Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.
Bye. Bye. Bye. I You