Latest from Product and Engineering

Presented by: Jen Taylor, Usman Muzaffar, Sergi Isasi

Originally aired on December 13, 2021 @ 11:30 PM - 12:00 AM EST

Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.

Original Airdate: October 2, 2020

English

Product

Transcript (Beta)

Music Music All right. Hello everyone and welcome to the latest from product and engineering at Cloudflare. My name is Usman Muzaffar. I'm Cloudflare's head of engineering. Jen Taylor will hopefully join us halfway through the show, but that doesn't matter because I'm very thrilled to introduce our guest today, Sergi Isasi. Sergi, say hello and what are you responsible for? And Jen just showed up right on cue. How about that? Hot off the fireside chat. If you were just watching me, I was just chatting with John Collison, the founder and president of Stripe and literally we ended about one second before I came on. Transitioning. I had to run all the way from Studio A. I'm exhausted. Have we had a back-to-back presenter yet in the time of Cloudflare TV? We have now. Yeah, man, I'm just like, I'm like, I'm out of breath. Anyway, sorry, Sergi, you were introducing yourself. Ah, yes. So my name is Sergi Isasi. I am one of Jen's directors of product. I run a few of the product teams here at Cloudflare. I don't really have a theme that I can use for my teams. It's not quite that clean. So I do a lot of the infrastructure things, DNS, load balancing, IP address management, and our TLS, everything that has to do with a certificate. And then I also work on our bots product. So on the security side, and then I work with the research team. So we provide product management support for the formerly known as crypto team, but the kind of team that's looking forward on what's next on the Internet. That's amazing. Yeah. And it's a great portfolio of projects. And so what I thought we might do, Sergi, is just talk about, just touch on some of them, because we've got some amazing developments on each of those axes. And one of the newer ones that is very exciting for all of us at Cloudflare is bots. So what do we mean? Why do we have a bots team? And what is that doing? What are we doing? And what are some of our projects and activities there? Sure. So why do we have a bots team? We have a bots team because our customers have identified a problem and have been using a number of our products to try to help solve it. And usually the primary product for our enterprise customers is the WAF. So our customers have automated traffic, some that they do want on their site. So things like Google crawler, things that are helpful, their own monitoring systems, as an example. But a lot of automated traffic they don't want. So as an example, our own contact Cloudflare page gets the vast majority of the traffic there is automated. And it's trying to kind of stuff the form and take down the access. And so we noticed that our customers were writing pretty complex firewall rules a couple years ago, you know, lots of user agents and ASNs, and it's just really, really difficult firewall rules. And we look to see, well, could we do better? And so a little less than two years ago, we launched an enterprise bot management product that distills all of the things that our customers were trying to do. So all of those insights that they were seeing on their own traffic, we take what we see on the 25 million sites on Cloudflare and distill that down just into a score. So our customers can say, you know what, Cloudflare says that anything below a score of 30 is extremely likely to be automated. I'm going to go ahead and not allow that onto my site in some way. What's interesting is, is sometimes that's a block. And usually our customers, when they go to lower scores, they tend to block because we're, it's a score of one, for example, we're positive that's automated. But there's areas in between where they might not want to block. And either because there might be a chance of a false positive, or they just want to make sure that the rule that they wrote is catching automated traffic versus humans. And so we've had challenges, various types of challenges in the history of Cloudflare, things from JavaScript to CAPTCHAs. And so earlier this year, we rolled a few engineers who were working on challenges into the bots team kind of combined a bots and challenge platform team. And so now we're looking at it from both detection, do we know you're automated? And then mitigation, what tools do we give our customers to do on that type of traffic? That's great. And so like the CAPTCHA tool we're all familiar with, and I can never remember what this stands for, completely automated, tested, distinct capture, human and automaton, something like that. More letters than CAPTCHA. We're skipping some there. The way I remember it is like, has the Internet ever asked you to identify traffic lights or buses or motorcycles? That's CAPTCHA, right? But talk a little bit about that. Because there's all sorts of people watching Cloudflare TV. Why are we asking people to identify those things? Why does that help protect the website? There's different reasons to do CAPTCHAs. Our customers will choose to write rules for various reasons. Sometimes it's just a kind of a form of rate limiting, right? So you make someone do something before they gain access to your site. It's not a great form, but that is an action. But really, if we go again back down to the bot management customers, what they're doing is, I am not sure whether you are human or not. So do something to prove that you are. And that's something that is really annoying to humans. For the most part, no one really goes, oh, yay, I get to do a CAPTCHA. But it's kind of that edge of a really hard problem, which is that right when you're unsure, how can you tell? And so earlier this year, we talked about when we changed our CAPTCHA providers. But the really interesting part of when we did that is we took a big step back and said, okay, what if we just wanted to serve no CAPTCHAs? Which is kind of a crazy thing to say. Can I put a penny in that box? Can I vote for that one, please? By the way, to the audience, I did not know, sir, you were going to beautifully tie these together. But that is exactly where I was hoping you would go. So, yeah, it was kind of a crazy question. But there's a few guys on the engineering team in London who said, okay, I like the question. But let me just, again, take a really big step back. What do we need to build to get to that goal? And I kind of alluded to, we had multiple challenge options for customers. And prior to 2020, those were multiple challenge systems. So we had a JavaScript challenge system, we had a CAPTCHA page, we had I'm Under Attack, which kind of fused those things together. And it made it really hard to gain intelligence from the combined stack. And the systems were really rigid. They were built to solve a specific problem at the time. We really wanted room to experiment, to say, what if we did this thing to this subset of traffic? And what would that look like? Would that, for example, would everyone who did solve a CAPTCHA, would they have solved it if they matched certain criteria? And so we spent the first half of this year combining all of our challenge systems into one platform and giving our engineers and data team the ability to experiment and to gather information. To do things like, what if we want to use different challenges for some connections versus another one? What if we want to start inserting different challenges based off of the way that someone's actually solving them? What about, could we do things that would mess with solvers and bots, things that exist on the Internet to try to get around our challenge systems? And so we built that platform. We built a lot of extensibility and metrics behind it. And then in the last quarter, so the one we just finished, we rolled it out globally for all of our pay-as-you-go customers. You don't really notice because it looks the same. Everything that you were experiencing was largely the same. But now we have the ability to really take the remainder of the year and experiment and learn about if we don't use our ML system in our challenges at all today, what if we did? And what will that look like? And really what we want to do is, if you're likely to solve a CAPTCHA, we should never show it to you. And so I've been at Cloudflare for three years now, so it's been a long time. And I think this is the thing that I'm most excited about, because imagine if we can make CAPTCHAs go away for real users. Exactly. And then we can actually see a path there, right? Yeah. Very exciting. And so this quarter, we're going to really go full headlong into experimentation and see if we have a whole list of hypotheses and what those look like and what comes true, and then roll that into production. Surya, what does experimentation at Cloudflare look like? How do we do it? And when you say the team's got a bunch of experiments they're going to run, they're wearing white lab coats, and how are they tackling that? So we have passive experimentation and active. And passive is really easy. Just look at the patterns and the data, right? So we'll use the bot score as an example. We think that if you have a score of 99, what is your challenge solve rate? We can look at that, because we know the score of every single challenge that we issued, and whether they passed or solved, and we could very easily see if that's likely to be a score of 99 is likely to result in a solve. What we want to do is active experimentation. So if you had a score of one, what if we gave you the hardest challenges? Would you even attempt them, or would you just go away? If that's the case, should we just block that type of traffic? That is actually impacting users and traffic. So when we do things like that, we do them very slowly. So we roll them into our canary colos, so colos that look like the rest of the world. So they have a mix of traffic that looks very, very similar, but are relatively small. And then we can start sending small amounts of traffic to the new system or with the experiment versus not. So 1%, then up to 5%, then up to 10% as we see. Monitoring the whole way, like monitoring the metrics every step of the way, so we know if we're... Yeah, with the big goal of not impacting users. Right, right. As soon as we start impacting users, turn it down. Yeah, yeah. Yeah, and so it's such an exciting thing. So what does this look like from... Before we tap off the bot subject, what does this look like for some of our customers now who came to us with some of those... Who were faced with writing some of those really complex firewall rules, trying to figure out, can I match against all these crazy user agents? Can I match against ASNs? What does the world look like for them now that it's been almost coming on two years for our bot setup? So it's really quite good. So we have a really high attach rate from WAF to bots. Almost every bots customer uses the WAF. Some use workers to get a little more advanced in their mitigation, but the rules are really simple. They're just score lower than some threshold. Usually it's 30, but some customers have a little higher tolerance for false positive and some have lower, so they have that option of where they want to move that line. And it's usually one or two rules for an entire domain. You might have a different rule for, say, a mobile endpoint versus your standard login endpoint. But for the most part, they're no longer having to go find the user agent that's attacking them today, block that, and then figure out whatever was changed tomorrow because the system kind of just does that for them and distills it down to a score. It's great. Really exciting stuff. So Sergey, when are you going to be done with bots? Probably never. Well, I do think if we look at something like DDoS, and we used to see a lot more DDoS than we do. And the reason for that is a lot of times now if you're an attacker and you see Cloudflare in front of something, you just know you're not getting through it. I think we can get there with bots. I think as we make this product even better, I wanted to talk a little bit about some specific bot detection enhancements that we've done. Eventually, that just goes away. And that is another goal of mine. I think when I was asking that question, I was thinking about the enhancement you guys have made to those detection engines, because a big part of what I think we're doing with this experimentation, this learning, and this innovation is putting ourselves on a trajectory where we keep getting smarter, keep getting stronger, and keep making it more and more difficult. Can you talk a little bit about some of the innovation, the iterations that have been going into some of those models? Yeah. What's interesting is we don't really talk about iterations for machine learning. It's not a new release. It's not a feature. It just works. And I think we should probably try harder to talk about that. We're actually on our third version of our supervised machine learning models. We have two. We have an unsupervised model that I think we're on our fourth version of that, and a third version of our supervised model. The third iteration actually came very quickly. We had a V2 engine that we released in April, and that was a massive change from our V1 engine. Really, really big step up. And what we wanted to do when we came up with the V2 engine is to take what we learned of how people would attack a machine learning detection engine and get around the edges of it. And the V2 engine did that. So it actually tracks connections over time and looks for changes in patterns. So as a bot gets blocked, it will change things to try to get around the block. And now we can actually track that change over time. And so that's been great with V2. But very quickly, we saw even more that we could do from the V2 to V3 model. We found a few more edges that we could iterate on quickly. And so last quarter, we released a V3 iteration. And that one actually smooths out some of the false positive edges from the V2 model when there are changes. And so it's not a huge step change like V2 was, but it's still really significant. It has extremely high accuracy ratio. And now we're starting on a V4. Yeah. One other topic I wanted to make sure you had a chance to just tell us a little bit about is, how are we reporting all this to the customer? Where does this show up for them? Does it show up in their logs? The beautiful thing is we've made it down to a simple dial. You can turn it all the way down. It'll be really conservative. You will definitely only block things that are definitely blocked. So you can go on the other end and you might get some false positives. But regardless of how you set that, how does the customer know how the world looks? How do they get to see how the last product is performing against their traffic? Sure. So there's a bit of history there. We first really wanted to show customers how effective the system was in their firewall rules. So in each firewall rule, when you issue a challenge, we show you the number of detections that rule had, and then the solve rate. So you can say, I have a 0.2% solve rate. This rule is doing great. Or I have a 1% solve rate maybe, and maybe I want to tune it down a little bit to be less aggressive. Really up to the customer on, again, their tolerance for those solves. We then had a lot of our customers say, okay, great, but I want the score. And I want the why you think that, and I want to incorporate that into my logs. So they have their own dashboarding systems and they want to see what we think of 100% of their traffic. So not just the things that we challenged, but everything. And so we built that and gave that to them. But very recently, our operators, the folks that are in here every day, they say, okay, I see the solve rate, that's helpful. But what I now want to do is get into my firewall analytics and see the score there and be able to filter and look and see exactly what Cloudflare is doing. So then when we have, and I think some of this also comes from our customer facing teams, when we block an attack for them, they can say, really filter down in the analytics and then export that out into PDF and show it to the non-technical user that says, look at what we've done here. We did. Yeah. And we're going to soon expand even further on those analytics. So I guess watch this space. We're pretty excited for what we're doing there. Yeah. The hardest part about great and effective security products is if they're doing their job well, you don't even know that they're doing their job. Yeah. So there's always this healthy tension of sort of like, hey, by the way, lights are still on, lights are still on. You have no idea how many meteors we stopped from crashing into earth. Exactly. Exactly. Exactly. That's the, you know, it's sort of like taking the atmosphere for granted. Let's talk about like, so at the top of the call, you mentioned like you helped lead the direction of four or five different teams at Cloudflare. Another team that's way older than the bots team has been around and is one of our key teams is the SSL team, the one that's actually designed to provision certificates at the edge. And, you know, one of the big ones we've done is SSL for SaaS v2. And before we talk about SSL for SaaS v2, like what is SSL for SaaS v1? Why does this product need this? Or one step upstream. What is SSL and why do we have a team? So let's, you know, bring us up to speed here. So how do we think through all this? What are the problems Cloudflare is trying to solve over these years? Sure. In 10 minutes. So SSL is foundational. And the reason for that is a birthday week, I can't remember which year Cloudflare decided to make SSL foundational. And deliver free certificates to every site on Cloudflare, which is amazing. I was a free customer at the time and I could not believe it. I remember thinking, this is unbelievable. What a birthday gift. You want to help build a better Internet? It better be encrypted. And what does that mean from a product delivery? It means that you have to go and get a certificate for every single domain on Cloudflare. So let's just, that's really hard. Which used to cost 300 bucks a certificate. So how is Cloudflare doing this for free was my first question. You have to go just get the certificate. And there's things that you need to do to issue a certificate through a proper CA, like prove you control the domain. So we have to build that bit, which is just, I have this domain. I want to go to a certificate authority. I want to get the certificate and I prove control over it. So that's step one. That would be hard if you were installing it on one server. Now we have to take that certificate and ship it out to over 200 locations. You probably better know than me on the number of actual servers that are sitting out there. Almost 10,000 today. It was less than that then, but it is literally tens of thousands. It's awesome. A copy of that certificate everywhere. And then we have to make sure that we know when that certificate expires. Because a lot of devices certificates aren't, they don't live in perpetuity, they have an expiration. And when that expires, we have to go get another one and then ship that thing out. So that's what the SSLC does, which is an extremely difficult technical problem that has a group of extremely talented engineers that just don't get enough credit for how foundational they are to the Internet. So incredibly important. And that's available for any customer. You get a certificate. We built in 2017 something called SSL for SaaS, which solves a similar but different problem, which is if you are a SaaS provider, which is basically everybody's a SaaS provider. You need to go get certificates for your customers. So you have that problem, but it's once removed. And that makes it even harder. And then you want to issue that certificate for probably the root of the domain. See, the customer doesn't want Cloudflare.cnds.com, we want support .Cloudflare.com. I want support.Cloudflare .com. So then where's Zendesk in that name, but it's Zendesk product there. So how does this work? We got three players, right? We've got the company, the one that's providing the service, and then Cloudflare in the middle of it, which is actually terminating the connection on the edge for Zendesk. Right. And if you think about it now, if you are a SaaS provider, you would potentially think about building the exact same system or a very similar system to the one we built for getting those certificates, issuing them and renewing them. So we made an extension of our product called SSL for SaaS that did all that for those customers. And it was a really, really successful product. You can kind of listen to that type of name. It's a very unique product. And over the last few years, we got a lot of really large customers, very successful deployments, and they all had very, very good product feedback, good feature requests. But because this product is so unique and special, we couldn't do some of those feature requests on the platform that we built, the decisions that you made. It's a classic engineering problem, right? So I made all these architectural decisions. I didn't think I wanted to do this other thing, and now I can't. So we made a v2, which is a bit different than Cloudflare. Before I talk about v2, I do have to call out our internal code name for it, which is 2SSL2SaaS, which is a nod to one of the greatest movie series of all time. I promised the team I would do that. We had at some point a requirement that that's what you had to refer to it as in the internal team chat. But we released a new one, and it allows us to be more flexible in how we manage IP addresses, both our own and the ones our customers bring to us, which is something that we didn't think about back in 2017, which is what if customers want to use their IPs? And it allows us to do things like Apex proxying, which is put an IP address or give a customer an IP address to put as their A record on a different DNS provider. So we have all these new features on this brand new platform that has all of the robustness of the previous and actually more robust, but now allows us to be a lot more flexible and a lot more efficient with our own resources. So we launched that last quarter, we actually have most of our really big customers transitioned on to the new service as some of their SSL for SaaS zones, which is really exciting. And all new customers today so if you sign up for SSL for SaaS today you get the v2 product. Music to the ears of the engineering team. Yeah, goal is to eventually retire the v1 product and we think we'll do that early next year. That's fantastic. Congratulations. Great, five minutes left, a couple words on load balancing. Sure. So load balancing is, was my first product at Cloudflare. And so it's, it's got a kind of a special place for me right. And the product manager on load balancing now. He has, he and the team came up with, you know, they've identified a problem and realize that their solution, or things that they've already built can be remolded to solve this problem. It's a waiting room. And we're in early access for this so it's a brand new product. And it allows a Cloudflare customer to quickly and easily build a waiting room queue. And this is for, you know, if you have some sort of event that is not like the rest of your year. So I used to use concerts as the example here but that kind of isn't a great one in 2020. Toilet paper. Toilet paper. I was going to go with video games like you're selling a brand new video game console or in the bots world if you're selling a new sneaker right and you only have an amount. And you want to make sure everyone's fair and kind of gets access to it then you can put that amount is, you know, different than what you're already can handle. We can put these users in a queue, and then let them out as the customer decides to see that so whether it's kind of randomly or in an orderly fashion. We can let them out of the queue and so that's an early access. We're really excited to get our early or first adopter customers on board and kind of iterate into a GA version that will launch later on. And this and control that customers will have like they'll be able to control how that page looks and what the thresholds are around it. What happens when you get past it and like, you know, queue depth and some of those, you know, those kinds of things which, which, because the alternative is you overwhelm the origin right which is nobody wins. So that is everybody has a problem. Yeah. So, yes. So, so I'm at the risk of asking a question I've ever asked you before in front of a large global audience. And can I run bot management on my waiting room. Like, can I make sure that I'm not queuing bots. That's, yes. That's, that's the idea. Yeah. So there you go. It's like it all kind of comes together. Right. Yeah. So you will you will reject all the bots or make them do a challenge hopefully no, no. And then you get into the waiting room, and then you go through to the origin. Yeah, or you could route them all to like Rick roll. Right. Yeah. Queue them up, queue them up for Rick roll. A fun Cloudflare session will be things that our customers have done to bots. That would be a fun one. Well, and the thing I love about the waiting room is it represents so much of what I love about the way we do innovation at Cloudflare is that like great ideas come from everywhere. And so many of the best ideas we have are people taking pieces of things we already have on the shelf and combining them in new ways to provide new solutions to new problems that our customers already have. The fact that the load balancing team was able to pick a lot of the Legos they already have and reconfigure the Millennium Falcon into, you know, a garage, a gas station. It's kind of cool. Yeah. And that team, I mean, they've delivered quite a lot over the last three years. Really have. Longer, really. That wasn't the only one, like this idea that the ingredients are standalone dishes. You know, the other thing that is fundamental to a load balancer is this idea of something that can check your origin. And, you know, that turns into health checks and which has just had some great innovation done against it. So the last thing I do want to talk about is this is the first time that a feature I shipped from one of my teams that I actually didn't know was coming. And I was a little sad about it when I first saw it, when I saw the email. But then I got really, that's actually great. But the feature is so cool. So you use the health check and you want to make a change, but you don't know. You want to be sure that change doesn't suddenly mark everything as unhealthy. So the team just, you know, this is a quality of experience feature that says, well, run the health check once or as many times, preview it and see what things look like, whatever. I'm like, I definitely didn't think of that, but it's such a good feature and it will say it's going to save someone. It's probably already has saved someone from taking down their site. Yeah. When I was talking to the product manager about this, we were going back and forth on analogies. It's like, yeah, a health check is literally a stethoscope. What if you accidentally put the stethoscope on the elbow and conclude the patient is in cardiac arrest? That is not the right answer. And so we need to make sure that people have the ability, before you say trust everything this stethoscope tells you, that actually reading a pulse. Sergey, it was so great to have you on. It's such a privilege to work with you and your teams. So thanks very much. We will be sure to have you on again soon to talk about all the great stuff from bots coming up and from SSL for SaaS and load balancing. And research and research. Yeah, there's so much to cover. They have something big coming. All right. We'll be back. See you next week. Thanks, everybody.