Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Nathan Disidore, Natasha Wissmann
Originally aired on January 14, 2022 @ 6:00 PM - 6:30 PM EST
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
In this episode, Natasha Wissmann and Nathan Disidore join to discuss Application Services.
English
Product
Engineering
Application Services
Transcript (Beta)
Hi, I'm Jen Taylor, Chief Product Officer at Cloudflare and welcome to Latest from Product and Engineering.
I'm joined by my partner-in-crime Usman. Hi Jen, nice to see you.
I'm Usman Muzaffar, Cloudflare Head of Engineering. We haven't done one of these in a while.
Happy 2022. I have not done one of these in a while. I'm very happy 2022 to you as well.
Oh man, like still recovering from 2021. We shipped an awful lot of stuff in 2021.
Yeah, and I'm very pleased to welcome members of the application services team.
It's a generic team name because they do all kinds of stuff.
There was a massive internal debate over the name of this team, which ultimately I said application services and everyone's like really?
And I was like, yes, because they are providing services to applications.
Natasha is our product manager.
Natasha, why don't you introduce yourself? Hi everybody. My name is Natasha.
I'm the product manager for the application services team based out of San Francisco.
Been at Cloudflare for over a year now and joined by Nathan today, senior engineer on our team.
Nathan. Hey everybody. Nathan Disador on the application services team.
And as they've mentioned, based out of the Austin branch down here.
I'm one of the pre-pandemic folk here at Cloudflare. So I've been with them for just over- Special merit badge, pre-pandemic.
Pre and post. Pre or during, I should say.
We're not posting it. I was so thrilled when I heard we were having you guys on today.
You really are. Working with the amount of stuff that you do and the impact of the stuff that you do, both for our internal teams and our customer facing teams is just, it's phenomenal.
And the impact of the work is great. I think, as I reflect back on 2021, one of the things that I feel like we could call 2021 is really, it was a year of alerts, right?
I mean, you guys took what was a very sort of nascent kind of alerts platform at the beginning of the year and really over the course of the year, blew it out.
So Natasha, can you walk us through a little bit, what are you doing with alerts?
How are you thinking about Notification Center?
Yeah, absolutely. So Notification Center is one of the tabs that we have on our dashboard and it's kind of a centralized hub for any product that wants alerts or for alerts that actually aren't associated with any product at all.
We started the year with five alerts.
We ended with 25 different types of alerts that customers can configure across all types of different products that we have.
We have some alerts for secondary DNS for when your records are changed over.
We have our script monitoring product, which is mostly based on letting you know when there's new JavaScript on your page.
You obviously wanted to get alerted about that.
If something is going wrong with your site, so we have a lot of origin error rate and origin monitoring alerts.
So if we can't reach your origin, your customers can't reach your origin and you probably want to know about that.
So we've got alerts for that.
Like I said, we've got 25 that kind of run the gamut. Yeah, because one of the things we talk about when customers put Cloudflare in front of their web properties, it's a tremendous handing over of trust, right?
Because we are now their front door.
This is not just an internal application. This is their entire web presence.
In some cases, their whole business, which means we also are the ones with the best visibility.
So the moment we notice something is up, obviously if it's a Cloudflare issue, the fire brigade is paged immediately and we're all on top of it.
We're proud of the incident response there. But these are often issues which are specific to that customer.
And with millions, literally millions of websites behind Cloudflare, we wanted to make sure that customers could customize their alerts.
Like Natasha, give us an example of one. So when you talked about some of these origin error alerts and these kinds of things, what are some of the kinds of problems that customers came to us and said, hey, Cloudflare, can you help us notice something like this?
And then what are the tools we gave them in that toolbox?
Absolutely. So the big thing is we have really good analytics. So you can go onto your Cloudflare dashboard and you can take a look at what's happening with all of requests going to your Internet property.
Super great. And then you can look and you can see there's a spike in your graph.
You can say like, all right, that's not supposed to be there.
Something is definitely going wrong.
And as humans, we can look, but no one wants to sit on their dashboard pressing refresh over and over again.
This is why we invented computers. Exactly. Exactly.
So we built this kind of internal anomaly system called our anomaly detection alerter or ADA is what we're calling it.
And basically it does that job for you of saying, we think there's a spike and that feels wrong compared to the rest of your traffic.
And so you should know about it. So the first thing we did with that was origin error rates.
So again, that's HTTP requests to your origin.
Is there a spike in your 5XX errors? And if there is, there's probably something going wrong.
You should probably take a look at it. This could be the customer's origin has a problem and Cloudflare is just faithfully replaying that error back to the eyeball, but we can see it.
So what does it mean to tell the customer?
How do we actually, what's the bat phone? Natasha calls them. We're talking to a virtual Natasha right now.
I'm just a projection. We've got a couple of different ways.
The classic way that you think about is email, right? So like something goes wrong, you get an email, but we all know people who have 3000 emails sitting in their inbox.
And if I send an email to them, it's not going to be helpful.
It's going to get drowned out. And these days, there's kind of a couple other ways that companies want to be alerted.
PagerDuty is pretty classic, right?
Most companies that are pretty large use PagerDuty for on-call shifts. So we have a direct integration with PagerDuty.
But a lot of companies want other things.
Here at Cloudflare, we use GChat for a lot of our internal messaging. And so when something goes wrong, we want to be alerted in our GChat channel.
That's where we want to get it.
So there's a super powerful thing called webhooks, which is basically just one API talking to another API.
So Cloudflare talking to something else.
And that something else can kind of be whatever. Usually it's messaging systems.
So it's GChat. It's Microsoft Teams. It's Slack. It's all of those kind of things.
But actually, it can go into anything else too. So that webhook can go to any sort of workflow starting.
So instead of going to a person, it can actually kick off a separate workflow.
You can do that. We've got customers that actually don't even have any workflow involvement.
They just have it hit directly to their internal APIs.
And so then something happens on their side. As soon as they get that webhook, something is updated internally for them.
And you can build straight off of that.
So it's super flexible, super great to use. And you can plug into kind of anything there.
Most websites at this point support webhooks.
Pro tip. You can actually plug that into workers itself. So you can actually invoke a worker from that webhook.
And then if you can you can do it with workers.
That's awesome. Back up a second on that donation. So webhooks, when I go to webhooks, it's what we used to call the Hollywood principle.
Don't call us.
We'll call you. So the system is going to call you rather than you having to pay. So what internally, what do we have to build to make that possible?
How does Cloudflare know what to reach out to when it could be anything?
It's Kafka all the way down.
I joke, but that's actually fairly true. As you can imagine, so Natasha mentioned that we went from five different alert types to 25 in the last year, which is quite literally an exponential growth in our alert types.
And there's obviously some scaling concerns that could crop up with something like that whenever you're designing a system that starts off as an idea and becomes something that's almost mission critical at some point.
So one of the ways that we basically achieved that is through making sure that we have some fault tolerant pipelines in place to handle those alerts.
And Kafka is a huge piece of the puzzle. And one of the things that application services is really kind of known for here at Cloudflare is kind of managing our own sort of Kafka cluster.
But we use that to basically do some ingestion.
The teams that want to produce the alerts will end up kind of basically letting us know that alert worthy criteria has been deemed worthy to actually be dispatched.
And then we'll do some internal routing logic to kind of handle that.
And that'll flow through to a couple of different microservices to get fanned out to basically process that information through Kafka and eventually through workers as well when we actually do the dispatch to end user sites.
That's amazing. And so just to nerd out here for a second, what happens if the destination endpoint that the customer has configured is not responding?
Do we keep trying? Give us a retry? What are some of the other things you have to worry about as the engineer behind the system?
Webhooks is a whole can of worms whenever you're calling external APIs.
It's tricky to do right and even to do securely.
So there are a couple of ways that we kind of have in place to help with that.
We do retry the webhooks if something goes wrong. We've also got a lot of monitoring in place to kind of look at status codes and metrics on those endpoints that get called.
And one of the common patterns you'll see a lot with Kafka and sort of like dispatch -based queues is that there'll be what we call retry queues, which is basically if something fails at one stage, then we'll just kind of back off a little bit and then forward to another pipeline where it gets retried a couple more times before it finally is marked as failed.
Amazing. And this literally lets customers say Cloudflare should call me back any on this system and then I'll take it from there and plug it into my chat or my email or have a lava lamp go off, which literally...
You could. Someone's going to think that we're in the lava lamp business, Jen.
Yeah, we are in the lava lamp. But okay, so hold on a second because I think one of the things that I also really appreciated that what you guys did is part of what's powerful about alerts and notifications is that in the moment you get them, but I do also know that like to your point, Natasha, go back and you're looking at your analytics, you're like, remind me again, what was that spike?
And one of the big things you guys have also done this year is really enabled customers to create the notion of a history or a log of these notifications.
Yeah.
And that kind of plays into what Usman was saying about what if your endpoint is down?
Like, do you just not know that something happened? That's not helpful.
What if it's still happening? So we made what we call alert history or notification history.
So you can pull the past 30 days of the notifications that have been sent to you and you can see exactly what we've tried to send to you and what we actually have sent to you.
And there's a couple of different use cases. It's not just when your webhook goes down, but someone changes out at your company and someone left your company and we're still sending emails to them because you forgot to update your alert.
And now you can see the kind of things that they were getting and you can tell what's been going wrong.
And then it's also good for finding patterns, actually, for saying there's this one, one of our zones keeps having issues and we keep sending out this origin error rate alert for this one specific Internet property.
Maybe we should take a look at what's powering the Internet property and make it more stable on our side.
Right. Right. Well, and then I know we're digging deep on alerts, but I am really excited.
I think the other thing that I'm really excited about is like the last time we chatted with you guys, we talked about origin error alerts, but the other one that you guys nailed in the last year was the WAF alerts.
Talk a little bit about that. Same concept, right? You have a graph and there's a spike and you want to know when there's a spike.
And for the firewall, a lot of times that means you're getting attacked and maybe there is a spike in one type of rule that you have, or maybe one of our services is catching it, but there could still be stuff getting through.
And you probably do want to take a look at all of the requests that you're getting and all of the logs that you have in your firewall page and say, okay, we should actually change something about this.
And again, that's something you want to know as early as possible, because that's really important for running your site.
Well, and I think that's just such a great example of the power of what you guys do as a platform, right?
That you built this, this fundamental service that has some underlying kind of methods and principles and ways that it handles and processes data.
And that we're able to take that and apply that flexibly to so many different use cases that are critical for our customers.
And again, in that privileged position that Isma mentioned. Yeah.
I want to talk about the, exactly what Jen just said there, like the different use cases.
So one of the fun things about troubleshooting in the modern era is that there's a lot of metrics and after a problem, engineers can usually find a graphing system, Grafana or whatever, and zoom in on the exact, with the right filter and the right X-axis zoom in like there, see it's a spike, but it doesn't necessarily look like a spike until you actually, until you look at it that way, you have to focus the microscope exactly at that level, then you can see the spike.
So how did we teach, how did we teach our systems to recognize what might be a spike to one company, to one zone's traffic might be absolutely, utterly irrelevant.
And, you know, because the other way you can mess up alerting is not having enough alerts, is having too many alerts and it's just too much noise.
So what were some of the interesting challenges and approaches we did to try to figure out what is truly an anomaly for this website, as opposed to, you know, all the websites.
I think if you, if you end up solving that problem, a hundred percent, you'll have, you'll have another million dollar startup on your hands.
But for the audience listening there, like interesting problems to work on.
Yeah, man.
Yeah, it's interesting. It's definitely a problem that is solved in a couple of different ways out there.
And a lot of the companies are doing something similar here in like the anomaly detection side of things.
But the way that we kind of went about this ourselves is, so we have some historical data there that you can kind of see it all in the plots and all that.
And that helps establish usually a baseline of what kind of normal traffic patterns are over some sort of like fixed window or threshold.
And then you can use that throwback to the college stats class here.
I don't know how much you remember from that, but there is, you know, standard deviation is, is a notion there where you're kind of looking for outliers and your traffic patterns and stuff like that.
And one of the common ways to approach something like anomaly detection is through the use of what's called the Z score, which is where you're kind of looking for, or Z score for short, but it's kind of where you're looking at current traffic patterns and your standard deviations over like short windows and long windows against normal traffic patterns to identify any sort of deviations and what that looks like.
And this means taking into account things like diurnal variation, like, you know, like just because it's, it's, it spikes at 9am on Monday morning, that's normal.
It's a business, it's a business website and it gets traffic when people wake up and log in in the morning.
And, but this, the kinds of algorithms and techniques you're talking about, take that into account because they have an idea of what normal is for a given moment in time.
Yeah, we look at normal for that, that site specifically, right?
So not normal across Cloudflare, normal for you. That's really amazing.
And so, and then this, and this stuff, what kinds of things does this, has this, does it work Natasha?
What kinds of alerts have we, have we seen customers come back with or feedback that we've heard from?
We've heard good feedback. We've heard, you know, that their, their origins were having issues.
And we told them before even their internal systems figured out that they were, that their origins were having issues.
So, you know, saved them a bunch of time figuring out what's going wrong and that something actually is going wrong.
We have a lot of requests for like, this works well for the series that we're using.
Can we do this for more things?
And so that's what we're working on is let's expand this even more. Well, and I think about just expanding, right?
I mean, kind of building on what Nathan was saying and Natasha, you were just touching on too, and is really, you know, part of the secret sauce of getting this alerting system to work and work so effectively for each site is the capacity that this team has built to process huge volumes of data very quickly to basically find the needle in the haystack and let somebody know, by the way, you have a needle.
And, and so, you know, kind of a leading question, here's a kind of an answer, but you know, what are some of the other ways in which this team has applied those special powers to other problems?
Nathan actually probably wants to talk a little bit about large data processing that we did for our Crawlhints project.
What a perfect segue. Yes, yes, let's talk about Crawlhints, which, which was one of the kind of main projects, the application service ship last quarter, may have seen a couple blog posts on it recently on the Cloudflare blog.
Cloud, so a quick, quick high level summary, elevator pitch for what Crawlhints is, is typically for, for search indexers out there, the way that they basically update their own content caches to know what the content looks like is that they have a, who knows what kind of algorithm in place to basically periodically crawl or reach out to these sites to actually go ahead and get the data there.
But Cloudflare, being, being the CDN and the middleman here, isn't a privileged, a privileged position to know basically when content has changed by, by looking at a couple of different, you know, incoming pieces of data from, from client websites.
And the goal here is that we can actually make an, give, give search indexers ability to make an informed decision on when to crawl client resources by keeping tags on this data and then periodically letting them know that the content has changed and that it's worthy of a, of a retrawl there.
So Crawlhints, Crawlhints, one of the kind of main driving factors here was that we really wanted to be, to be, this kind of came out of Impact Week.
One of the things that we noticed was that there were a large, a large amount of kind of wasted crawls out there on content that wasn't, wasn't changed, didn't need to be refreshed.
Basically computers asking each other, has it changed yet?
Has it changed yet? Has it changed yet?
That's what it is. It's like me asking who's money, has it chipped yet?
That's what I need. I need, I need a Jen hint.
But you know, you multiply this out and that's tremendous, like a monumental waste of the web's infrastructure, wasting time.
And yet here's Cloudflare sitting on, actually we know the answer, or at least we have a hint of what we think the answer is.
And you should listen to us. Spot on, spot on. Yeah.
Some analytics were done and I think we've kind of found that, that, you know, on average, I want to say only 20% of the time content had changed between those, between those crawls from, from bots.
And, you know, there's, there's, there's, I'm sure anybody could write their own, their own search index or like, like, like crawler, basic slide crawler in a day in Python if they wanted to.
But what we want to do is for, you know, we're, we're working with a lot of the partners out there for some, some really big search indexers to come up with a common API that we actually were a part of this partnership that came out called, well, we worked on this, this joint API called IndexNow, where it was like read upon standard to basically get a new API in place to talk to these indexers and let them know, I explicitly changed this content.
I hope he agrees.
I've explicitly changed the content and I'm, I am ready for a recall on my side.
And so this is something that the indexers reach out to Cloudflare. Have I got, have I got the information direction correct or the other way around?
Other way around, actually.
Yeah. Yeah. So we're actually, we're actually letting them know.
We let them know. So we're proactively poking them going, here's what, so we're, we're the client in this conversation.
We're telling them, we have the answer.
We're telling you, here's what you need to know is what you use. This, this has changed or has not changed, or you should, you should consider, you should consider callbacks.
That's amazing. A hundred percent. I should mention, this is all an opt-in process.
So this, this doesn't happen unless you explicitly turn it on.
Yeah. But it is, it is kind of astounding, the volume. So I think we launched in November of last year and I just checked today, we're processing about 500,000 URL free resources a minute.
And, and it's, we're up to, we're up to over, over 10 billion.
And it's one of those awesome things where it's literally a win, win, win.
I mean, like everybody wins here. The crawlers spend less energy. They get more accurate results.
The origins get less, less useless traffic in the band. Cloudflare is in the interest of helping this and, and seeing, seeing less useless traffic from, so on our networks as well.
So it's like, it's, it's really an example of connecting the dots in a way that everybody, everybody benefits.
It's, it's really, it's really fantastic.
Natasha, I want to shift gears to one of, one of the more classic, that crawler is a new and exciting, like one of the, one of the most important things that application services owned from the beginning is the audit logs, which is really fundamental to owning all this, which is, you know, the Cloudflare dashboard has dozens and dozens of products and there are, they have give you hundreds of different settings.
And as you make changes to things, some of our customers, it's not just one admin, there's whole teams of people who are logging into the Cloudflare dashboard.
And they often ask the question, wait a second, who changed this?
When did this change? Why did this change? And they just need to ask the question, what was up with this setting?
Who changed? And that's sort of the problem statement for audit logs.
So take it from there. How did Cloudflare think about that problem?
What did we make available to our customers? And what are some of the cool things we've done recently there?
Definitely. So our customers use this, use audit logs kind of in two ways.
There's, I want to do a yearly check of what's changing, who's changing things, are the right teams working on the right things?
I want to take all of this data and analyze it. And then the second one, which is actually our bigger use case is there's something wrong with my site.
What has changed recently? And what happened that I need to now revert this change so that it goes back to normal and then we can readdress.
But to see what changed recently, you go to audit logs.
In audit logs, we have a couple of main things.
It's who made the change? When did they make the change? What thing was changed?
And then how was it changed? So it used to be what, and it is now what. So all of those things are answered in audit logs.
We have a lot of data. And one of the things we want to make sure is that customers are able to access that data.
So we have a UI, but there's a lot of data for these larger accounts.
Are you going to create a UI with page one of 99,000?
And actually, Natasha is going to read it for you.
It's one of our longest storage that we have at Cloudflare as well.
We store audit logs for 18 months.
And so you have those for a while. So UI isn't always the most helpful.
You can ping our API every five minutes and gather that data, but then you have to write whatever system is going to ping our API every five minutes to get the data.
That's not always great either. So one of our more recent projects is integrating with another great Cloudflare product, LogPush.
So you can already LogPush a bunch of different types of data sets to whatever data ingestion site you want.
And now you can also LogPush your audit logs. So your audit logs will go, you can store them somewhere, you can alert based on them from your preferred data manager, and you can keep them for longer than 18 months as well.
Yeah. So just put on your virtual John Lean hat here and your data product manager colleague's hat.
What is LogPush? What does it solve? And what are some of the neat bags of tricks that it has?
And how does integrating with LogPush suddenly give audit logs, your product, a huge advantage?
Audit logs is actually very similar to alerts, right?
Instead of you having to pull our APIs all the time, we're going to tell you when something is happening.
So we are directly integrated with a bunch of different data management systems, Splunk, Datadog, all of these normal ones that all of our customers use were integrated, and we will send the information directly to that system.
So you as the customer set up this job, you say, here's the data set I want pushed.
Here's the destination that I want it pushed to.
Here's my data bucket. And then we will just send that data to you. So then you can log into whatever data bucket you have and you can just see it.
You don't have to do anything.
So this can go straight into security teams, logging systems, event systems.
Literally, it's pushing it right into the place where they really care about it.
And the information is being pushed as structured information, right?
So there's actual discrete fields, who did what, what, when, things that a machine can read, which means your machine can query.
So I'm not paging through 99 ,000 UI screens and I can get anything I want.
And I think we even also give you the ability to just literally download some of this stuff directly.
And so that's great.
So tell us some stories around that, like how have customers used that stuff and what have they been able to do with it?
So actually, some of the best stories I see are on the Cloudflare community board.
So various users will post and say, there's something wrong, what's going on?
And the answer is usually check your audit logs.
And their response is usually like, oh, I see that something was changed.
And one of their big concerns there is when you enable a third-party website to access your site or to access your Cloudflare account, they are able to make changes.
And sometimes those changes include editing your DNS records, so that your site no longer works.
And what we've seen recently is sometimes you don't just want to know when someone did something, you want to know when someone saw something.
So it's not just, I went through and I actually made a change, it's someone logged in and maybe they saw my firewall configuration.
And so they know how to make an attack that kind of gets around it.
And that's really dangerous information.
So what we've done and is in beta right now is the ability to see everyone on your account that logged in.
So I logged in, which means I could have seen something.
And that's really important and really useful information for our customers.
And again, it's amazing because some of our customers are enormous organizations with whole divisions and they don't even, everybody doesn't know each other.
And so knowing who logged in, many cases utterly innocuous and other cases needs further investigation.
Well, and I think that's why this is such a critical background, there's so many compliance efforts, right?
I mean, it's critical.
Especially, yeah, with the increasing and with the important part that Cloudflare serves, it's the front door of your website.
And so like any change that affects them.
This is amazing. I cannot believe how quickly this 30 minutes went.
I feel like I can talk to you guys for another two hours about all the interesting stuff.
Because you keep doing all this amazing stuff. There's always so much good stuff to talk about with you guys.
It's incredible. And the impact of it is so deep and so broad.
It's really cool. Yeah. I just wanted to thank both of you for spending some time with us, talking through us.
This was a lot of fun.
Can't wait to have you back on soon and talk about all the next generation stuff.
I know you're both super busy with another whole bucket of features and awesome new stuff.
And we'll have you again on soon to talk about that. Thanks for having us.
Thanks all. Have a great weekend. Thanks everybody. Thanks for watching.
Bye. or distributed denial of service attack a malicious attempt to disrupt the normal functioning of your service.
There are people out there with extensive computer knowledge whose intentions are to breach or bypass Internet security.
They want nothing more than to disrupt the normal transactions of businesses like yours.
They do this by infecting computers and other electronic hardware with malicious software or malware.
Each infected device is called a bot. Each one of these infected bots works together with other bots in order to create a disruptive network called a botnet.
Botnets are created for a lot of different reasons, but they all have the same objective.
Taking web resources like your website offline in order to deny your customers access.
Luckily with Cloudflare DDoS attacks can be mitigated and your site can stay online no matter the size duration and complexity of the attack.
When DDoS attacks are aimed at your Internet property instead of your server becoming deluged with malicious traffic Cloudflare stands in between you and any attack traffic like a buffer instead of allowing the attack to overwhelm your website we filter and distribute the attack traffic across our global network of data centers using our anycast network.
No matter the size of the attack Cloudflare advanced DDoS protection can guarantee that you stay up and run smoothly.
Want to learn about DDoS attacks in more detail? Explore the Cloudflare learning center to learn more.