Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Natasha Wissmann, Rajesh Bhatia
Originally aired on October 12, 2022 @ 2:00 AM - 2:30 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
Hi, welcome to Latest from Product and Engineering at Cloudflare. I'm Jen Taylor. I lead the product team.
Hi, Jen. I'm Usman Muzaffar. I lead the engineering team.
It's been a while. I feel like it's been a few weeks since we've got everyone together.
I'm very excited to have Natasha and Rajesh join us and tell us what they're working on.
Natasha, say hi. How long have you been at Cloudflare? Hi, everybody.
I'm Natasha. I'm the product manager for our application services team here.
I've been at Cloudflare for about six months now. Six months. And we were just joking like six months.
That's it. It feels like you've been already a key part of this team.
She's running the place, basically. That's right. Absolutely. And a pure post-pandemic hire.
So like someone who we haven't completely virtual interactions with.
I have no idea how tall she is. Yes, that's right. I've never seen her in person.
It's a surprise. Rajesh, always good to see you, man. Yeah. I'm Rajesh Bhatia.
I'm the engineering manager on the application services team. I'm based out of San Francisco and have now been at Cloudflare for over three years.
And I agree, you become an old timer pretty quickly here.
You are an old timer. That's fantastic.
Legit old timer. So the term you heard both Natasha and Rajesh say is application services.
And even that name had a vigorous debate, Rajesh. I remember it was almost as much intrigue as naming a new company.
It was like multiple names and like, well, how does this roll off the tongue?
And like, is people going to confuse it?
Are we going to know what this does? So Natasha, I'm going to take a swivel my chair and ask you, what is application services?
What is it that your team owns?
Why did we call it that? Great question. So we are a platform team. So basically things that multiple other product teams want, we build.
So an example of that is alerts.
A lot of other product teams want to send alerts about their products.
It doesn't make sense for every single team to build out their own alerting system and a way to send you emails and all of that fun stuff.
So build one system, one place on the dashboard, you can find it.
And then we own that. And similar with audit logs, a lot of different teams produce audit logs and want to show those to the customers.
And we want there to be a single system that's maintaining all of that super fun stuff.
And then we also have a bunch of engineering platforms that we maintain that Rajesh can probably talk more about.
Yeah.
We were called internal tools and the naming thing that Usman is referring to is because we weren't just doing internal stuff.
We weren't just doing cool stuff.
So we figured application services kind of captured that really well because we do have a customer facing side of common abstractions that we build, like what Natasha is talking about, audit logs, alert notifications.
We handle all transactional emails, a service for that.
And on the engineering platform side, we build services like a message bus, which is a Kafka pipe and a feature flagging system.
And we have an internal tool called that we manage and we build a platform around that.
So yeah, I always say we have a two piece charter. One is more customer facing abstractions.
One is engineering platform. And we really work with cool tech stack at Cloudflare.
One of the things I really appreciate this, but it's the work that this team does is you have these two pieces, but you also are integral in tying these together.
Part of what happens at Cloudflare is we see a phenomenal amount that is happening for our customers and their accounts.
We see a lot of it internally, and we want to be able to aggregate that and share that with customers.
And then the other way, we want to be able to turn around and give our internal customers the ability to see and configure that as well.
And you guys are a critical part to that glue.
And this is integral to running the service and helping demonstrate the value that we're delivering.
One of the ways you guys do this is actually through the notification service.
So Natasha, what is the notification service?
Why did we build it? Great question. So Cloudflare sits in front of all of your Internet properties and we can see everything.
And it's a really integral piece to your company and what you're building.
So what we want to do is we want to be able to tell you when we see something that needs your attention.
So that could be for a product specifically, that could be more at a platform level.
We've got notifications that will tell you if your origin is seeing errors.
We have notifications that will tell you if your secondary DNS records are having issues transferring.
A lot of different things that we can alert you or notify you based on.
And again, we want it to be kind of one place you can go and manage all of these notifications that you have.
Well, and the cool thing for me with the work that you guys have been doing is that you've not only been adding support for notifications from different services within Cloudflare, but you've also been adding more configurability for the end user.
Can you talk a little bit about that, kind of how you've thought about it and kind of where we're at today?
Absolutely.
So the notification service actually exists on the account level, and you can have a lot of different domains on your account.
And that's not always super helpful to get a notification for every single domain that you have.
So what we want to do is we want to give our end users more customization, more configurability.
Maybe you have a test domain that you don't want to get notified on.
You don't expect your origin to always be up for that. So we're trying to allow you to select more of what you actually want to be notified on and make those alerts more useful for you.
That's awesome. And then you guys, I mean, you send them to what you, today we've got support for email.
What platforms are we sending today?
We do email, webhooks, and PagerDuty today. So email's pretty standard.
You get it to your inbox. It's not always the most helpful. I don't always look at my emails right away personally.
So if it's a really important notification, you probably don't want it to just be an email.
So then we also have PagerDuty.
So you can get paged immediately for whoever's on call on your schedule right then.
And then webhooks we also have, which is my personal favorite. It's super configurable.
It basically sends a message to any sort of other system that you have.
So you can send a message to Slack. You can plug into Slack through webhooks or Gchat or any seams that you have, you can plug into webhooks as well.
So that will send those notifications wherever they need to go. And then you can actually build your own systems on top of those notifications as well.
So you can, you know, instead of actually having someone that has to go in and look at the notification, you can have it trigger something within your system without any human interaction.
Yeah, that's fantastic. This is all super important because like, it's one of the things we're always, we're always very, you know, humbled by and honored by at Cloudflare is that our product is part of their core infrastructure, of our customer's core infrastructure.
It's not an application in a traditional sense that it might be a really important application that's used internally to, you know, track issues or track sales deals or something.
It's literally the front door of their service.
And the other day, I was actually looking at one of our customers' websites and they had a status page that they keep track to tell everybody, and it had their own status and right below it was Cloudflare status, because if Cloudflare status is anything's wrong with it, it is just as impactful as if it's their own origin.
So not only is it an easy way for them to know what's going on, but like they need to know what's going on, right?
They have to be able to be alerted right away if there's any kind of, if there's any kind of problem.
The webhooks one is really interesting. So just to implement a webhook, you actually have to know where you want to send this stuff.
And now we're in the business of sending something and worrying about backoffs and worrying about retries.
And so talk a little bit about that. And also a generic, I guess, asterisk to let us do a webhooks, but there's some complexity involved there.
What are some of the challenges there?
Totally. So there are two modes you can access this data.
There's APIs available where people can request and get that data.
Sometimes people call webhooks as reverse APIs, but perhaps more accurately, webhooks is something that lets you skip a step.
So for example, with most APIs, there's a request followed by a response, whereas no request is required for webhooks.
You just send data when it is available. So you are able to register a URL with our service because we are the providing service and that URL has a place within the application that will accept data.
So whenever an alert happens, you are able to trigger the particular or look at the registration information of that webhooks and are able to send that alert directly to what is configured without the customer having to have a step where they request us to send a particular alert data.
So this is super helpful because we can configure these alerts to fire. And like Natasha was mentioning, it can go into like a Slack channel.
It can go into a Google chat channel or any other application on the customer side on how they want to consume that.
And then they can use that information to drive any analysis and any next actions that they want to take based on that.
Yeah, it's really powerful.
I remember the webhooks, this is an older metaphor, is the Hollywood principle.
Don't call us, we'll call you. You don't have to keep making requests to Cloudflare.
Is there an alert for me? Is there an alert for me? We'll let you know.
Just tell us where you want the call to go and we'll call you. Leave us your number and we'll tell you when there's something interesting.
Yeah. And another metaphor I've seen people use is like, call me when he's warmed up.
Baseball managers, you can see them picking up a phone from a dugout and calling a bullpen to see.
Because especially when a manager wants to make a change in the pitcher, he has to make a phone call to see if the pitcher can start to get ready.
But if he had a webhook... That's right. Call me when he's warmed up. That's what that is all about.
Rajesh just accidentally leaked Cloudflare's next product, which is a...
Major League Baseball. Major League Baseball bullpen as a service.
Excellent. So other things that matter here are being able to monitor our customers' origins and alerting on them.
So because we're in the position of being in the middle of the conversation between what we call the eyeball, our customers' customers and our customers' origins, which is the part of their infrastructure that is wholly under their control.
We're also the first to notice if anything is funny looking there.
And a lot of our customers, I remember, have said to us like, could you let us know if you see something funny looking?
So Natasha, how did those conversations come to us?
What were customers looking for and what have we made available to them?
Great question. So we have a lot of different options right now, actually, and they kind of each serve their own individual purpose.
So we've had passive origin monitoring for a while.
It's a notification that we have where if we are only seeing 521 errors from your origin for a period of time, we're going to let you know.
We know that there's definitely something wrong at that point.
It's not necessarily as sensitive as a lot of our customers need, though. A lot of our customers were saying we're seeing some errors from our origin, but not every single request is an error from our origin.
And we still want to know about that.
And we want a good way for you to be able to tell us that. So we've implemented this new alert type, which is the origin error rate alert, where we actually look at the levels of 5xx errors that we're seeing from your origin and use that and compare it based on your historical performance from your origin and say, OK, we think that there's something wrong.
So we make sure I understand that.
So basically, what you're telling me is Cleveler watches the error rate, kind of baselines and error rate.
And then when we detect that you've gone above that, that's when we alert.
Sort of. So that's a pretty simple threshold that you can do pretty easily, say, like if more than five percent of the requests to my origin are errors, let me know.
But setting that threshold is really difficult.
Yeah. And it's different for a lot of different websites. So instead, we're using this really cool concept outlined in the Google SRE workbook, actually, of using SLOs to set what are the level of errors I'm OK with having from my origin at any point, because we know it's never going to be perfect.
There's going to be some one offs.
We don't want to alert you for every single thing that's happening.
Worse than the no alert is an alert that fires. Exactly, exactly.
So we want to be able to set kind of you're burning through this budget that we have too quickly.
You say that we're OK with a certain budget of errors. You're burning through that pretty quick.
You probably want to take a look. Yeah. What is important is like the concepts that like service level indicators like think of it as it's a it's a ratio of two numbers, like number of happy events of our total number of events.
And in context of Cloudflare, like Natasha was mentioning, it's the number of HTTP requests that have not with a status of less than 500 or your total HTTP requests that you get.
And ideally, the service level indicator number should be 100 percent.
That means you're you're you never have any errors.
But service level objectives is the SLOs that we're talking about is where you define what your target is, target SLIs for a particular service.
So suppose you decide ninety nine point nine percent of delivery success over the last 30 days is your SLO.
That means you have a point one percent error budget that.
And so, for example, if you get like three million requests over a period of 30 days, you have a budget of three hundred three thousand errors.
And if you're running through your coupon book awfully quick, if there was a single outage, then that is responsible for fifteen hundred errors.
You've used up fifty percent of your budget.
We know that we can capture that. We can let the customers define that and fire alerts based on what we see there.
It's really cool. It's a very it's so what's great is this is a well understood system after it was documented about a decade ago.
And it's been really instructive because it allows for it allows for not just a baseline error problem, but a spiky level of problems like even a spike is OK unless there's too many of them in a row.
And and yet it's still just a ratio.
It's still just a single metric. Right. Which means it's easy to calculate and it's easy.
OK, I know how to do this like on a piece of paper.
Right. Like I can do this myself on a piece of paper, but like I'm not a scalable system that's going to be able to service twenty five million.
I could try. Yeah.
But you're pretty clever. You might come up with something. I've seen how many notebooks you fill when you were in the office.
I can write pretty fast. But how do we how do we how do we architect a system like this?
Like how do we how do we build a system that can deal with this volume of aggregation and counting?
Yeah. So the good part is we we have a big base of requests and we do profile.
We do capture analytical data about a particular origin.
And so we are able to tap into that data.
And based on that data, based on the SLO configurations that you have configured for a particular alert, figure out that you're going to be using up your error budget based on the number of errors that have happened in your domain.
You you've used you will use it in 15 days, even though you you have a period of 30 days, you're going to be using it up pretty quickly.
And that's when we can alert you.
So the way the system is set up is we you we tap into the analytical data and we definitely make use of it.
And then on the other side, on the service side, we've been able to build like high performance data pipelines based on Kafka that we are able to leverage to move a lot of this data around and for us to be able to calculate this near real time so that we can give as much accurate data to our customers as a result.
One of the one of the technologies Rajesh you're mentioning there is Kafka, which is actually that was in the news recently because the team that productized it and originally it was Confluent and they went public last weekend.
But we are we use the open and open source instance of it and a big one that's processing a lot of the analytics that you were just referring to.
That's how all the data pipelines work.
But there's also one specific that is largely under the custody of your team to worry about message passing.
And one of the things is message passing is all the information that is going between these systems, including the sort of underpinning our autologs.
Before we go into more about how autologs work, Natasha, back to you.
What are we auditing?
What are we logging? Who is this log for? What do they expect to see in it? And why did we build this?
And are accountants involved? Thankfully, no, at the moment.
So we basically as a customer, you want to know when something on your account changes and you want to be able to trace anything back to that change.
So you want to be able to say, like, I set up this firewall rule at this time and date, and here's exactly who set it up so that I can know we're starting to block things.
This is why.
You know exactly why everything's happening. And that's really important for a lot of these systems that you set up so that you know exactly why things are happening, when they happened, when they started, and also who did them.
And that has a lot of legal implications for a lot of our customers in terms of, you know, who is accessing their Cloudflare dashboard and who's seeing their data.
They want to be able to tell all of that. And who's actually making changes?
Are the right people making changes? If something goes wrong on their side, what happened?
Can they trace it back? And so the audit log system allows you to see all of the configuration changes across the products that we have.
So again, it's a lot of different product teams working together to produce these events.
We have an internal name for it. We call it Watchtower. The main purpose is to like provide a high safe place from which like a guard or a sentinel is like watching and making sure things are all going okay.
Yeah. And it's interesting, you know, when you're a one-person startup, this feature feels like complete overkill.
The only person who's allowed in is, you know, Usman's the only one who ever messes with the firewall rules.
So if it's messed up, it's Usman's fault, right? But once you become a giant corporation, some of our customers are huge with multiple divisions and even making sure that the right people have access, let alone make the change.
Like it is so important to be able to study that log of changes.
It's exactly as important as engineers using, you know, Git log or Git blame to be able to see what, wait a second, who made this change and why did they make that change?
What were they trying to do? And can I go talk to that person?
I understand it. And so again, the thing here is to actually make sure that it is complete and has everything and yet it's legible.
It can't be, it's got to actually become events that were meaningful enough, not low-level things, not, you know, that we changed record 17 to value 294.
It's got to be something that's actually meaningful to a customer as a discrete change that they made in the UI or the API that they can then study and reference.
And actually, Usman, a lot of, even if we do have just one user on an account, there's a lot of third-party integrations on websites that can make changes.
And that will also show up in your audit logs.
Well, and in many ways, those are the ones that are often the most difficult because you're like, did you change this?
No, I didn't change it. You're like, Oh, third-party application changed it.
Ah, you know, it would have been impossible to find that without the audit log.
Now, like how do we make our audit logs accessible to our customers and how do they, how do they consume them?
How do they use them?
Great question. So there's kind of two different use cases. There's why is something happening right now?
I want to check my most recent audit logs, what has just happened.
And then there's kind of like the quarterly or yearly overview kind of piece.
I want to see a summary of everything that's changed.
So audit logs also exist at the account level of the dashboard. You can go into the audit logs tab.
You can see everything on the UI. And then you can also pull them from our APIs to get a higher level summary.
Got it. Got it. So I could kind of like with some of the other stuff we've been talking about, pull it into my, you know, we're talking about notifications.
You know, you guys built a lot of services that we provide in a variety of different ways so that customers can ingest them and integrate them with whatever their operational platforms are internally.
Exactly. That's awesome. That's awesome. On the inside, if I've, I just joined Cloudflare, you know, I'm not an old timer like Natasha has been here since December.
I joined just three days ago. I've been part of a brand new team.
We're building a brand new product. And suddenly Jen Taylor comes knocking and says, Hey, you are also going to make sure that your all actions are logged in the central audit log.
How does, what happens then? Like now here's where the application services part of your team kicks into play.
Like how, what is the service you've provided internally that lets our developers.
Yeah. So yeah, clearly, I mean, we need to make sure that onboarding onto audit logs is as easy and straightforward as possible, but there is less friction or no friction and no friction.
Yeah. And as long, as long as we have been able to focus on that aspect of our technology, we really strive to do that.
And we get, we provide an SDK that different teams can integrate into, and we have various other mechanisms as well.
We also provide hooks where you can directly publish your events as they come to a particular Kafka topic.
And we listen to that topic. So we, we try to be as mindful as possible and provide as much avenues for different product teams who might be working in completely different languages and repos to be able to leverage client SDKs in whatever languages they work on and make it as simple as possible for people to keep publishing these events.
And it's very critical for us that we don't lose any of the data that is coming in.
It is definitely very important from a compliance point of view.
And we take really good care in how we have designed our system to ensure that there is a lot of backup and reliability built into the service.
And as well as there is some asynchronous nature built into it that eventually it will get written and it will be the system of record for all the events that have happened.
And change, like that's another, a fair amount of the, a surprising amount of effort, making sure that it's, it is a cryptographically secure, they can't be tampered with, right?
So that's the, there's no point in an audit log if you're...
Yeah, we don't, we don't allow edits or deletes to any of that because it is a system of record.
And the other thing is sometimes there might be errors that we do see.
And the way we've been able to build a system is that we need to back propagate some of these errors back to the producing teams.
And so we have tried to implement some features in our system to be able to do that.
And we're trying to push for those sorts of changes to go ahead in our system as well.
Natasha, what are some of the kinds of events, like what's a typical, what's an event and like how much detail is in one of these records and does it vary by team?
And like, how do we decide if the team says, look, I want to stuff way more information into my audit log.
What happens next? You're the product manager. What, what, how many knobs and sliders are there in this system?
There are honestly a lot of knobs and sliders in this system.
And we do want to get a little bit better at making that as standardized as possible.
So when a producer team goes in, another product team goes in and they want to make new audit logs, they have a better idea than we do of what their customer wants to see because they know their product the best and they know what their, what is actually changing about it.
So the information that we give is who made the change?
Did you make it via the API or the UI?
When was the change made? Obviously what was changed? And then we have other kind of open fields for, give us more information.
So tell us details about your DNS record changed.
How did it change? Which record changed? What was it before? What was it after?
As much information as we think is going to be helpful, but at the same time, like you said, we don't want to drown our users in useless information.
Well, that actually kind of brings me to the kind of customer perspective.
So, obviously these are integral for our customers. What are some of the common asks that we get from customers and how do you guys think about the roadmap for audit logs?
Most of our questions that we get from customers are make things more standardized.
So I want to look at all of my audit logs across all of my products and be able to tell exactly who did everything.
And we want to make sure that that information is shown the same across all of our different types of audit logs.
So that's what we're working pretty hard on these days. And then like we had mentioned before, just making sure that they're as accessible as possible and they can get sent to as many of our customers downstream systems as they need.
Yeah, that's great.
I think there's an interesting product and engineering management problem here as well, as like you've got a company at Cloudflare, which is exploding and it's got so many teams and so many new products and services.
And here's something they all need to do.
It's an important requirement. Otherwise, they'll be like, wait a minute, I made a change.
Where is that? And I think that's another interesting challenge here is like Rajesh, making that on-ramp as easy as possible and then building the system so that everybody knows this is part of the product template, of the PRD template.
And don't forget, you've got to also integrate with audit logs, but don't worry, Natasha and Rajesh are there to help you, is I think one of our great success stories and a pattern we've seen repeat over and over again as Jen, as we've grown.
Well, and my product management brain, when Rajesh was talking about those on-ramps, was just kind of going off because I think that it's, you know, obviously we spend a lot of time talking about it and focusing and celebrating publicly the things that we ship and put in the hands of our customers.
I think, you know, it is sort of something that's not necessarily always transparent to the outside, but critical to our success is building the tools and the platforms internally to make the development and the integration as easy as possible.
And that's something I really applaud about the work that this team has done, is really like taking the needs of those internal customers and really thinking about how do we make this seamless, simple, and scalable.
And I think audit logs is a great example of that journey.
Yeah, literally the services for the applications team, which is Rajesh, ultimately why I felt like, okay, I've figured it out, like it's just called the team application services, because at the end of the day, everything it does falls into that.
I know there was still a lot of debate.
There's a lot to a name. There's a lot to a name. That's okay. That's okay.
Thank you both for joining us. I can't believe how fast the time went.
This was so much fun. Can't wait to have you back in a couple months to tell us all the new things and new tricks, new services, new modalities, how we're going to tell people how they can see what's going on and what other services you've both brought.
Thank you. Thank you both for joining us. Thank you for having us. All right.
Bye, everybody. Have a good weekend, everyone. We built our e-commerce platform from scratch.
There's a lot of security requirements from processing credit cards to just making sure that the site loads quickly and is responsive so that people don't get deterred or lose trust in us since they are trusting us with their personal information.
Believe it or not, we've actually had customers write in and tell us that they have gone into their browser and viewed the source code to the webpage to find out what's happening with their personal information.
Twice in the last year that I can remember, we came to work and we couldn't work because Amazon was down.
We couldn't log into our support panel.
We couldn't manage our shipments through our third-party logistics provider.
But our site was still working. And being able to stay online through two Amazon down times has been amazing.
In fact, they're some of the highest sales days of the past year.
In terms of bandwidth savings, we have gotten amazing bandwidth savings from Cloudflare.
Over 95% of the bandwidth that we use is cached now.
Most of that are large static images, which are getting optimized through Mirage.
And so we know that they're just loading so quickly and the best that they possibly can.
Also, the web application firewall is really great because it allows us to make sure that people aren't compromising our system through any known attack vectors or browser vulnerabilities.
We're a really small engineering team.
We only have about one and a half technical people that write code on a day-to-day basis.
So anytime that we have the opportunity to use a service that reduces our need to write code, it really means a lot to us.
We've had zero security breaches the entire time that we've been online, and Cloudflare has been there with us every step of the way.