Latest from Product and Engineering
Presented by: Usman Muzaffar, Jen Taylor, Jamie Herre, Jon Levine
Originally aired on July 8, 2022 @ 4:30 AM - 5:00 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
Music Hi, I'm Jen Taylor.
Welcome to another installment of Latest from Product and Edge.
Hi, Jen. I'm Usman Muzaffar, Cloudflare's head of engineering. Nice to see you again.
Nice to see you again as well. Some of our favorite employees here. We don't have favorites.
We love everyone equally, but we're very excited. These guys are great.
So, Jon and Jamie, can you take a second to introduce yourselves? Hi, yeah, I'm Jon Levine and I'm product manager for our data products.
Hi, I'm Jamie Herre.
I'm a director of engineering and I also work on data products. Okay, so hold on a second.
What the heck's a data product? Yeah, so Cloudflare does a ton of stuff.
We make websites, anything on the Internet faster and more secure. We have a firewall.
We have a CDN. We have a secure web gateway. We do so much stuff. And all of these products generate metadata about what happened on our network.
And Cloudflare needs that data so we can manage our network, send people bills, protect against threats.
And our customers need that data for the same things, right? They need to make the best use of Cloudflare.
They need to understand their own infrastructure.
And they do that through the metadata that our team helps provide to them.
And we do that in a bunch of different ways, which we'll talk about. So, analytics, which you might see in our dashboards or in our GraphQL APIs.
And we also have logs and alerts and lots of other data products.
Okay, hold on a second, though.
What do we serve? 25 million Internet properties around the globe?
I mean, how much data are we talking about here? Yeah, so at peak, 28 million events per second flow through our system, which is kind of a...
Per second. Per second.
Second. Per second, yeah. It is a mind-boggling number. And, you know, Jamie, we'll talk more about that.
But even just at the roughest sketch, like, how?
How does that even work? Like, the data is being collected and we're noticing these events in our edge servers all around the world.
So, like, how do we even sum?
Like, where's the query that knows that we even... How did we even count them when it's spread over 10,000 computers around the world?
What is the rough answer to, like, how did you do this?
Well, with great difficulty, I guess, is the answer. It is surprisingly harder than it sounds.
And when we get into the scale that Cloudflare operates, some of the standard operating procedures that, like, everyone does, it turns out to be challenging.
So, in general terms, what we do is we collect this metadata on every machine everywhere in the world.
And then we have software that we wrote ourselves that packages it and sends it to our core data center.
And there we receive it, right?
I mean, it's a very simple word. We actually have software called Log Receiver that receives these logs.
And it operates at a very high throughput.
And it's very highly optimized to be able to do this. And then we send it through a Kafka cluster.
And then we do all sorts of crazy things with it. I mean, what we greatly improved in our architecture is the ability to kind of put these pipes going to different places.
And so, we really leveraged this open source software called Clickhouse to do the kinds of aggregation where you say, like, just how many of them were there?
We made a lot of effort to remove as many unnecessary pieces from this pipeline as possible.
And that also helps us optimize it quite a bit.
So, we don't have that many moving pieces, but they move really fast. Yeah. I think one of your senior engineers once said to me, like, the crux of this is, like, how much can we strip this thing down?
It is just, you know, there's no body panel.
It is just a pure engine that receives this firehose of information. Firehose is the wrong analogy.
It's way beyond that. Waterfalls. Niagara Falls. Yeah.
It's coming in and then actually sorting it and putting it into systems so that the rest of the engineering teams at Cloudflare can build all these great visualizations and all these great insights.
And, you know, whether it's analytics or logs or bills, like, they're all there.
And that's the answer to data products.
Okay. Hold on a second. Like, you guys might have done a great job, like, stripping down the car and getting it, like, built for speed.
But in order for the things on the other side to be able to take that data, right, we need, like, how do you kind of gear the machines, I'm mixing my analogies now, to a place and a way in which other product teams can actually take this and use it.
Because I can't imagine what, you know, drawing charts with, you know, 28 million, you know, messages per second would look like.
Yeah. So one thing Jamie and I spend a lot of time on is actually helping other teams design their schema, which might sound boring, but it's super important and I find it fascinating part of this job, which is to- Do you know Picasso of schemas?
I like to think of myself. Would you tell Picasso to sell his guitars?
So, you know, we spend a lot of time thinking, it requires you to think really deeply about what these products are trying to do.
So for our CDN, like, what is our CDN for? What do customers need to know about our CDN, about our WAF, about our secure web gateway?
When you think about all these products, you can, and you think deeply about the questions people are trying to answer, and the hard-won lessons of building many of these systems, we can come up with good schemas.
And a good schema is amazing because it lets you serve very efficient analytics, it lets you deliver logs to customers, and then when they get the logs, they understand them, and they know how to make use of them.
And it's never perfect, it's always a work in progress, but that's a big part of it.
And a big part of that is making sure that not only can we receive all the information, but when it comes, because it's one thing to have an awesome system for getting all these little pieces of data and, you know, carefully sorting all these M&Ms and all the jars, but unless you can actually pull it back and serve it to the customer in a way that makes sense, you know, what was the point?
So, one of the technologies that your team helped invent, Jamie, is adaptive bit rate, which is a term that I used to hear more when I thought of streaming video.
To me, that's when I'm watching Dark Knight and my net connection blips off for a second, then, you know, it might get a little blurry, but I'm still watching the movie, I'm not taken out of the film, and, you know, I'm still, the thrust of what I need is there.
And so, what is the problem that ABR was trying to solve and what is the analogy there with what are we adapting to and what is the rate that we are trying to optimize?
Yeah, we chose that acronym, you know, exactly to evoke that analogy to video.
The cool thing about that, if you think about it as a user, is that it doesn't fail.
So, like, you might be walking through some crowded hallway and your cell phone reception is bad, and instead of just failing completely, it degrades the video.
But depending on what you're watching, like, if you're watching us, maybe, it's not that important that it be super high resolution.
And so, it's better to do it that way.
We realized by looking at user-based use cases and, like, what people were, if we thought about both the schema, as John said, and then the questions that people are trying to answer, then we can make intelligence decisions about the best way to do that.
For instance, we said just a few minutes ago, Cloudflare does some, you know, 25 million, 28 million requests a second.
Well, like, I didn't say, like, 28 million, 342,000, 100 and, like, that doesn't really matter.
All that matters is, like, hey, it's 28, like, and so, in order to derive that number, we don't actually need to count everyone.
And using proper statistics and some math and maybe a little hand-waving, we can get to the point where we can answer the questions in tens of milliseconds that are good enough from the point of view of the customer.
Like, how many requests did I have? How many cache misses?
What was the ratio? Well, those we can answer really fast without having full precision.
And one thing I want to say to you is, like, this difference between precision and accuracy.
Like, sometimes people come to me and they say, oh, you're sampling, but that means the data is not accurate, right?
Well, no, it can be extremely accurate.
We actually put, like, formal, like, mathematical bounds.
Like, we're 99% confident that the accuracy is bounded by this much. Now, to be clear, like, we should do a better job, maybe, of communicating and providing those bounds.
But, like, to two significant, we're often quite sure to two significant figures or three significant figures that this is the right answer.
So, it is accurate.
We might be missing the 10 digits of precision of counting everyone. And the fact that the precision was treated to ensure that high latency and getting that information to the customer in a very accurate, predictable way, like, that's a great thing to have because it means our customers can see what's going on as their traffic is going through the net.
Yeah. Speed is really underrated. Just really quick, speed is underrated as a feature.
Like, it helps us scale that we can have a lot of people, like, access the data.
It also helps it that, like, you can use it.
Like, that means if a query loads in a second, that means I can run more queries.
I can be like, oh, let me change this, right? Yeah. And it's like when people have questions of their data, it's not like, you know, calling customer service and it's like, well, I want to give you hold music while we calculate it.
It's like when people are having questions of their data, like, something urgent, they have an urgent need.
And so, making sure that we have the ability to be responsive and give that to them quickly, either within the context of the dashboard itself or via our logs is critical.
Absolutely. Yeah. Awesome. Let's talk about, so we talked about streaming, we talked about logs, let's talk about log streaming, which is actually, those two words are leading us in the direction, but when we say log streaming, we're talking about something quite different and quite innovative.
So, I'll just leave the question at that. John, what is log streaming and how does that change the picture?
Yeah. So, behind the scenes, we were just talking about the importance of speed.
We've been making a bunch of under, behind the scenes, under the hood improvements to get our logs, raw logs to customers much, much, much faster than we did before.
And I also want to add in a way that's much more scalable and more resilient and more flexible as well.
And what I mean by that is, so I'll start with flexible.
I keep talking about, we keep adding products, we keep adding schema.
We want our product teams to work really fast and to deliver new things and to add more schema as our products change.
And with our old system, it was hard to do that.
It was hard to keep up with the changes. And then when I say scalable and resilient, what I mean is we want to, this system needs to work.
Like when customers want to get their logs, like they really rely on their logs and we don't want one customer to hog all the resources.
We don't want an outage in one part of our network or just one part of our stack to maybe spill over.
We want to really contain problems.
And then, you know, I think the speed we just talked about, it kind of speaks for itself.
So, we have a new system called log stream that kind of helps us do all of that.
Maybe, I don't know, Jamie, you want to talk more about how that works, how we built that?
Sure. So, there's a, we have an analogy for this too.
And I'll tell you what it is. In the old days, when stores needed new goods, they would send an order and then there would be a warehouse and whatever, like the big rolls of toilet paper would be sent out to big box stores all over the country from some warehouse where they had been delivered earlier.
But in the 1980s, some of these retailers figured out like they didn't need to do that.
And actually, the goods arrive at a facility from a container ship or someplace, and then they're loaded directly onto the trucks to the store.
So, there is no warehouse.
And that's the analogy for what we've done with log stream.
We used to get all this data in our core data center, and then organize it, and then get ready to send it to the customer, and then send it to the customer.
And now what we do is we just, as soon as it comes in the one door, we push it out the other door.
And this is better for everyone, we think, because you wind up getting the logs much faster.
I mean, much faster. And then the other issue is like, we actually don't have them.
So, for many of our customers, maybe this contains important information, they might be concerned like what's happening to this information.
We get it in their hands as soon as possible, and we don't have it.
This is better for all of us. So, you're basically helping Class 3 deliver toilet paper more quickly.
Exactly. The essentials of life. She almost said it with deadpan delivery.
I was very impressed. So serious. But there are two other things that I think come on top of what you were talking about.
I think the first is, and I'm thinking about it from things that this phenomenal team has done lately.
I mean, the first is, you sort of touched on already, is being able to be more respectful around the information that we are capturing ourselves and getting visibility into ourselves, and the information we're not looking at.
And this is obviously increasingly a concern for companies that are operating across the globe, trying to comply with basically the patchwork of different regulatory challenges.
Let's just start there. Talk to me about how this fits together, right? Because data is at once this powerful asset.
It's also this toxic nuclear material if it contains PII or things that can basically create liability.
Yeah, I'm glad you asked about that.
And of course, what's top of mind for a lot of folks, especially if you're tuning in from Europe, is the Shrems 2 decision last year, which really just changes the way that personal data about Europeans can be transferred back to the US.
So this is a recent change in Europe, but it's resonated around the world.
More and more and more jurisdictions have requirements around how this data can be handled.
So I think the first and most important principle here is just actually what our business model is.
We need to help protect our customers.
We need to make their websites and all their Internet properties faster and more secure.
But we're just not in the business of selling personal information.
We don't have an ad network. It's just not what we do. And that's kind of baked into our DNA.
So I think that's really the bedrock. The second thing is what you might think of as privacy by design.
So making sure that when we design these schemas and we think about the metadata that we collect, we're really collecting the bare minimum needed to do our job.
This is something that's not abstract.
We really do talk about this all the time when we design schemas and we say, we can get into nerd stuff like, how many bits of entropy here?
Could someone be de-identified based on having these high cardinality fields?
And we think about this stuff a lot.
And so careful schema design is a big part of that. So for folks who don't know, at our edge, we do see a lot of sensitive stuff.
There's sensitive traffic that flows through our network.
And we often need to decrypt it to prevent threats.
But then we're not logging someone's bank account information.
We're just logging metadata about that request, which just doesn't contain that much fundamentally based on what we see at the edge.
So I would say those are the most important principles of how we handle that.
I would add a caveat. We don't mean to imply that we didn't used to take this seriously.
We've always taken it seriously. We've always tried to take good care.
We have no interest in doing anything otherwise. It's always been like that.
But we are realizing that there are certain advantages kind of across the board.
So it's good for us. It's good for the customer.
It's good for the people out there on the Internet. It makes the Internet better.
And they leverage each other. So by making our logs faster, we also made them more private.
That's great. Everybody wins. Yeah. Just to build it, because I want people to think there's a contradiction between Jamie saying maybe...
I said, hey, we don't collect things. And Jamie said, hey, but customers don't want us to see it.
So there's maybe two examples of why that might sound intention.
There are people who, even though we don't have sensitive stuff, it doesn't really matter.
And our customers say the data just needs to be in this place, this geographic location.
And Logstream opens the door for us to align with that, where we say, it's okay.
We get it. The logs need to live in this jurisdiction.
We'll just get you the logs there. The second thing is I said we don't want to log things that are very sensitive.
But actually, sometimes our customers do.
So we don't do it by default. But a customer may want to be like, hey, there's this cookie that I need for me.
And if our customers are using it, that's really up to them.
And so we want to make sure that they have access to that data and that we don't.
Well, I think that that's one of the things I really appreciate about the way that we build products at Cloudfire.
And I think in particular, the way that you all have approached this problem space, which is, honestly, it's not up to us.
It's not up to us to have an opinion one way or the other.
What we do is we give you the tools that empower you to basically take action on the decisions that you make.
Regulations change. People have different interpretations of them.
People have different needs, operate in different businesses, operate in different jurisdictions.
We don't claim to have the market cornered on an opinion or a decision here.
And so it's really about giving you those tools.
And fundamentally, in my mind, that's the heart and soul of what it means for us to be a data platform.
Absolutely. Absolutely.
We were just saying the raw logs aren't very, they're not very useful, like on their own, right?
You have to do something. The analogy, like, sorry if they're vegetarians, but like the analogy is that if the, you know, if the data, if the insights you want are the food, like the hamburger, then the raw logs are like the ground beef, you still have to cook it.
Or even more of the green that you see at the cattle.
We can extend that further. We spent a lot of time working on the analogy.
We're workshopping them, folks. It's okay. We might need to work on this one.
Okay. That's how I think about it. So customers still need to cook their meal. How's that?
And so we're proud. We do a lot of great stuff, I think, to do that with our analytics products.
But we also, we know customers, there's lots of stuff they want to do.
They often just want to see the data in one place with all the other data that they have from their origin servers, from other providers that they have, and totally support that, want to help them do that.
And so there was actually a relatively small set of tools that people used to do this.
So Splunk is one of our partners, super popular, Datadog, Sumo Logic, New Relic, LogDNA, Qradar.
I can list about like 10 of these, and that covers most of what people want to do.
And so what we've been doing recently is we've just been trying to make our customers' lives easier.
We've said, we've had relationships with a lot of these partners for a while, and we have some nice prebuilt dashboards.
We want to make it even easier.
So the goal is for, you go into your Cloudflare dashboard, click a couple buttons, type in your Splunk credentials, and the logs should just appear in Splunk.
We don't want you to have to do data engineering work to get the logs there.
We do the data engineering work. It's actually really hard. This is what we do.
We think about it, we want to take care of that so customers don't have to do it.
So we just announced or just released a direct Splunk integration and a direct Datadog logs integration.
We've had Sumo Logic for a while and expect more of these to come out.
And we are working hard at work on a UI for these as well. And so this means that these are first class integrations, like from our dashboard, from our API, you can directly configure.
Cloudflare will send this information, exactly the information you want.
Just like Jen was talking about, the stuff that you care about through us into your system, giving you exactly the control and visibility you want.
It's really a great, powerful story. There's one more question I want to talk about, because this came up a couple of years ago, and it required a lot of thought.
And at the time, I was looking after the data team directly myself.
And that is, what is the right way to ask questions of a giant data store? In some ways, this is a very old problem.
There's a structured query language, SQL has been around for probably 40 years now, 35 years at least.
And so that's sort of the default answer as well.
Let's make a database. But there's a new kid on the block that showed up four or five years ago, which is a smarter, more intelligent, the interface, the GraphQL.
It was, I think, originally came from Facebook. And so talk a little bit about what were some of the decisions that went in there?
And what happened after we made a GraphQL interface available for teams inside Cloudflare, for teams outside Cloudflare?
How did that change how people interacted with their data?
Jamie, do you want to take a stab? I can try, yeah. You know, when we talk about GraphQL, it's really exciting.
It's a big improvement, but we don't know if it's the best or the only or the ultimate way of accessing data.
What's interesting about it is that, and for those people who may not be familiar with GraphQL, it's sort of like, if you're familiar with REST APIs, GraphQL is not that.
It's actually like kind of competes with that, and it's a different way of organizing APIs, particularly in cases where the underlying data is structured.
And so this is how we tie into the schema.
So using GraphQL, we can expose the universe of Cloudflare data as a schema in a hierarchical layer of nodes that makes sense to people or makes sense to the computers that they're using anyway.
The way I try to describe it is, I'm going to ask a question, and I want your answer to take this form.
So you can sort of say, by the way, when you respond to me, make sure it looks like this.
And that's a very powerful thing to be able to ask somebody. I want it to come back in this form because I'm going to glue it into a UI or an app or whatever.
The underlying data is complex. There's a lot of schema. There's a lot of fields.
These fields can have a lot of values. Having, frankly, just a simple REST API, it's very, very hard to build that expressiveness in without just pasting in SQL in a kind of janky way.
GraphQL is a structured way that lets you get the richness of the data.
It's pretty cool. And then very cleverly, the GraphQL design requires those questions that you ask to be very regular.
So you can't ask any question.
You can only ask the questions that fit into this schema, which is basically all the use cases that we know about anyway.
But because it's so regular and you can't ask any question in the world, it can be very fast.
And so we've tried to optimize this, and then we integrated it with our ADR technology.
So it's both clever and fast.
Are people familiar with the expression dog food?
So we dog fooded this and our own dashboard uses it.
And so front-end developers, people are familiar with that environment.
It works really well with GraphQL and this way of seeing the world. And so you can put together a dashboard quickly, and we think our customers could use it the same way.
So it's not a drop-in replacement for the old REST APIs. It's like a new and better world.
A new and better way. It just unlocked a renaissance of this explosion of new analytics dashboards in the Cloudflare product.
Because the front-end team were cheering this technology the whole way with how easily and quickly they can build really interesting new, all the way full-blown cockpits that let our users slice and dice the data in so many different ways.
Turning the corner, one of the things that's cool about all this innovation is it allows us to retire yesterday's innovation.
And one of the things you guys were just recently able to do was actually retire the Zone Analytics API.
Monday. Monday.
I got them ahead of myself. It's been years in the making. Talk to me a little bit about that.
Talk to me about why do we deprecate things, and how do we go through that process, and what went into this one?
I just want to start by saying it's not a light decision, and it took me a while to really come on board, honestly.
Because we know that our customers build on these APIs, and they depend on them.
It's a lot of work to build against one API, let alone to have to shift to a new one.
And as we said, GraphQL is a new way of thinking about things.
The problem is that in order to support all the cool new stuff that we're doing, there are limits that we have to impose on the way people read data.
And we need to discourage certain usage patterns and encourage other patterns.
So this interactive querying is awesome.
When I say interactive querying, I'm like, what do you do in a dashboard?
Do I want to add a filter? Do I want to change the time range?
I'm doing it live, and I want to make a lot of queries. But the automated, scraping, periodic might have to happen in a different way.
Or you might have to space out your queries instead of doing them all at the same time.
And so we've changed the limits.
And that's actually really the most fundamental thing in that shift to GraphQL.
And what that means is we can't continue to support both patterns.
If we're committed to the new thing, it's time to retire the old thing so that we can spend those resources, our time and our computing resources.
And one of the things we think through as we think through this at Klaflr is we don't take these kinds of decisions lightly.
Our focus is making sure we do not allow the deprecation of any existing technology unless we are able to provide a replacement that is better than what we have to offer.
And then we make sure that we have a time and a process by which customers can switch over seamlessly.
But I think based on what we've seen just in the way that we've transitioned from using this analytics API to GraphQL and stuff internally, I think we believe that we are leaps and bounds ahead of where we've been in the past.
Yeah, that's great. So, data team, I think you guys should get t -shirts.
Clever and fast. Yes. What about Picasso of schemas?
I love that one too. If they get team jerseys, then JPL gets to get that on the back of his jersey.
That's right. And with an asterisk, still working on the right analogies.
John and Jamie, thank you so much for joining.
It's always a pleasure to talk to you guys.
I'm so proud to be part of your team and all the amazing stuff. Jen, always great to see you.
Thanks everyone for watching. We will see you next week on the latest from Product Engineering.
Thanks everyone. Bye.
Bye.