Latest from Product and Engineering
Presented by: Jen Taylor, Jon Levine, Ben Yule
Originally aired on February 18, 2022 @ 1:00 AM - 1:30 AM EST
Join Cloudflare's Head of Product, Jen Taylor, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
All right. Hi everyone. Welcome to Cloudflare TV. I'm Jon Levine. I'm our product manager for data analytics.
Ben and Jen, do you want to introduce yourselves?
Yeah. Hi everyone. My name is Ben Yule. I'm an engineering manager here at Cloudflare on the data team.
I'm Jen Taylor.
Welcome to Latest from Product and Eng. I'm thrilled to have the data team in the house.
This team has just been killing it on a lot of different fronts.
Sad to not be joined by my co -pilot Usman, but he's out on some very well-deserved PTO.
So if you're a regular watcher, you can just sort of doodle him in on your screen there.
So welcome data team. You know, JPL, we've had you on before Ben, but JPL, but you're new Ben.
That's right. Yeah. I did about five months ago now.
Oh my God. Where'd you join us from? I was previously the CTO of Digital Health Startup, much smaller than Cloudflare before this.
Got it. So you're going from digital health to healthy data.
Exactly. Yeah. A little bit different. A lot of data, a lot of data, a lot of data, but yeah.
Yeah. And I had a chance actually to meet one of your, your new team members this week.
So you're also growing your team.
That's right. We're always hiring, especially right now. So engineers primarily.
Awesome. Awesome. Okay. But so here's the other thing, like, I was really excited to have you guys on because like, it's very rare that we actually get a team on that's just had like a press release.
Like you guys have just been like, you're making news.
Like what's the news? Yeah. The news is we have some big changes to our logging products and they just got a whole lot more powerful because we've added a number of new data sets, which we'll talk about in a minute.
So new kinds of data you can push from our products out to wherever you want to go.
And we also have new integrations with new partners. So you can more easily than ever get your data into Splunk now, into Datadog, into Azure Sentinel, and into actually any storage partner that has an S3 compatible API, which makes things much easier for customers.
And that just unlocks a whole lot of new possibilities about what customers can do with the data that we generate.
That's awesome. That's awesome.
So talk to me a little bit about like, okay, hold on, like backing up, like what are customers doing?
Like why does this matter? Yeah. I think one thing that's really interesting about Cloudflare is folks who are listening probably know we run a massive global network, incredible scale, but there's just a huge number of kinds of services that we have on our network.
So we have a CDN or an HTTP reverse proxy, people make HTTP requests, they go through us.
That's probably what we're best known for.
But we also have a WAF, a web application firewall, and that's generating a lot of data.
But we also have, we use technology called NEL, which stands for network error logging.
And we collect reports that browsers send back to our network if they're having trouble reaching a website.
And we have data about our team's products, right?
Our forward proxy products like Gateway or Access, which is a Zero Trust solution.
And all these things are generating data.
They're giving you visibility into what's happening with our products.
And so any one of these data sets has a ton of use cases. So very commonly for CDN, customers want to know, how much traffic do I have?
That's the most fundamental question.
Are there errors? What's my cache rate like? And getting into the firewall, you might want to know, fundamentally, am I under attack?
One of the things that's interesting about these integrations with partners is if you pull the data into something like Splunk, what you see people do is, we have analytics in our dashboard, but what you may want to do in something like Splunk is say, hey, I'm seeing an attack from this IP address, and I want to double click on that IP address and see where else is that.
Is that sending me attack traffic through any other service?
So if you have your origin application logs going in, you can see it in the same place there.
Got it. So that's a lot of different data sets going to a lot of different places.
Ben, how do we make that happen? Yeah. I think that's the interesting challenge that we have here at Cloudflare.
And one of the reasons why our data pipeline is very unique in that, I think a lot of what John is really talking about here is that every single data point that we collect is actually really valuable and important to a customer.
And it's very often that we get a customer that says, hey, this weird thing happened somewhere at the edge, and I need you guys to help me figure out what was going on.
And we generate literally tens of millions of requests every single second that we need to provide visibility to our customers over.
And so that's really the hard part, and we make all that happen right now.
Got it. So to get all these different data sets to all these different destinations, what are some of the key things that we had to make happen that we didn't have before?
Yeah. There's a number of them. I think really the biggest one is traditionally we've had some basic destinations, but it's really creating all the logic that it actually takes to interface with the partner's APIs and making sure that we can do that in a way that, A, makes it really simple for our customers to click a few buttons and get the integration to actually work.
And so it was really just fine-tuning all the parameters and everything else that it takes to actually get it into the partner's system and make sure that that can work reliably and the customer doesn't have to constantly worry about monitoring it and making sure that it's not throwing errors and everything else.
And so I think that's really the magic of what we were able to do there.
Right. Taking those kind of complicated dynamics and making it simple and seamless for people to pull in.
I think that's the biggest thing is because you could get your data into Splunk before and you could push it into your S3 bucket or something, but it was just a little bit.
It was kind of brittle and hard to do that. Also, I want to give shout-outs to two groups who helped out a lot here.
One is the other teams at Cloudflare.
In some ways, it's their data that we're taking and we couldn't do it without those teams.
And the first question I always like to say is, what is the schema?
What is the data we're pushing? And so a lot of the hard work we do is actually just working with those teams on defining that.
And if you have a good schema, everything flows nicely from that.
Everything's really clear at that point.
The other thing I think that was really valuable is just working with the partners to actually, now that the data is in Splunk, well, it's really nice if you can just have some dashboards that are ready made for you about how to use the data.
So folks in our partnerships team work with teams at these other companies. And actually, a special shout-out to the Azure Sentinel team.
I didn't even know they were working on this.
We got on a call with them before the announcement and I looked at their dashboards.
I was like, this is amazing. And they really produce a lot of great content and the community can build on that.
So they can improve things about that.
They can suggest changes, and then those will get improved for everyone rather than all of our customers separately trying to make a dashboard and keep it up to date.
That's super cool. Where do you guys want to take this from here?
What do you think is the next phase of this? Yeah. So I think it's great. And I think we're always going to have integrations.
We're always going to keep expanding integrations.
You're going to see more partners that we're going to do this with.
But we're also looking at what are people doing when the data gets pushed to Datadog?
We're looking at how do we make those capabilities happen natively on our own platform, just because it's so valuable.
So my favorite example of this is probably alerting.
So I think we just announced alerts on OriginHealth, super cool technology.
Yeah. We had Natasha and Rajesh on last time. And that was one of the things you were talking about there.
And that's just like, boom. It's like, oh my God, that's amazing.
It's so cool. And the tech that we use to build that, I hope is going to...
And it's something that our team collaborated with Natasha and Rajesh on.
And I think it's something that we're going to see a lot more of that, a lot more kinds of alerts, a lot more flexibility, the kinds of alerts you can make.
So that's something I'm really excited about, just to pick one example. Another thing that I think is really unique we can do is, as we see customers who start to use us, not just for CDN, not just for WAF, not just for DDoS, but all those things, and they're protecting their employees with Teams, they're protecting their network with Magic Transit.
What can you do if we have the power of all those different data sets on Cloudflare?
And that's something that I think we're just at the start of, of like, hey, actually, you want to know what that IP address is in the WAF and in Magic Transit?
And it's amazing, don't get me wrong, and I'm thrilled, but I kind of think about the volume of data and the growth, the growth of the data and the growth of the services, like, you know, Ben, what are some of the things that you're thinking about in terms of making sure that this thing continues to be scalable and performant?
Because it's like, a customer wants more data in more places, but they're not willing to compromise, you know, kind of completeness of the data or the scale of the performance.
Like, how do you, what are, what's some of the magic you're running behind the scenes to kind of make that happen?
Yeah, that's a great question. And I think there's a couple different aspects.
And so the first one is, you know, obviously, I think one of the biggest challenges that we have is just, you know, how do you, how do you sort of store all this data?
And how do you index all this data in a way that, you know, really makes it such that you can think of it, you can, like, look at the state and analyze it at a very high level or a very granular level.
Like, sometimes you just need, you know, a few data points a day to say, how many requests did I do today, right?
And you really need to summarize this at a very, very high level. But at the same time, and I think this is a very unique challenge of dealing with our data set, sometimes you want to find the smallest little detail in that haystack.
And so it's, it's not that uncommon that, you know, a customer will come to us and say, Hey, I have this like very specific Ray ID or that that really ties to like an exact request that happens that I need to know why it failed, right?
And they need to dive all the way in, or they think that they're under attack.
And they need to really be able to hone in on that specific thing.
And so one of the big challenges for us is, you know, how do you, you know, index catalog and store this data in such a way that that really lets you sort of do both of those things in a way that doesn't cost the customer, you know, astronomical amounts of money.
There are, you know, existing database solutions, enterprise grade solutions that will let you do this if you want to spend millions and millions of dollars a year at the scales that a lot of our customers actually operate at.
But in our case, you know, we, we don't want them to have to spend all that money.
And we think we can do this much more efficiently.
And so that that's really the biggest, I would say the biggest challenge of a lot of our, our data pipeline is how do you enable them to, to, to look at that fine granular when, when they need to, in, in, in order of magnitude worth of time that, you know, makes it useful, right?
If it takes, you know, six or seven hours to get an answer on some kind of question that you're asking if your data or query that you run, you know, nine times out of 10, it's like, you're already moved on to something else, right?
It's not valuable. And, and so how do you do that in a few seconds is, is the challenges that we're really working on right now.
Yeah. Do you want to talk about the bit about the magic of sampling, adaptive sampling?
I know we talked about it earlier, but it's worth another, I never get enough of adaptive.
I could talk adaptive sampling all day long. Exactly. Yeah.
Yeah. Adaptive sampling is this, this great innovation that, that some folks at CloudFlow actually came up with before I was even here.
And so I just, I guess I just get to have fun talking about it.
But really it's the sort of idea of as this data comes into our system there are really sort of two types of use cases that we think of.
You know, the first one is when, you know, how do you enable queries to that data in a very fast and efficient way, just like we were actually talking about.
But if you, you know, need a very fine or sorry, if you need a very granular answer, right, you just want an estimate of how many queries came in today.
We'll actually write that data at multiple different levels of granularity in real time as it comes into our system.
And so when an analyst will go in later and actually query that data and ask questions of it, depending on the time range that they're looking at and sort of, you know, the granularity of response that they actually need, the adaptive sampling system will actually choose the table or the level of the data stream that is best going to give them the answer that they need within the margin of error that they would expect it in, in a timely manner.
And so, you know, if you're querying across, you know, days, months, years and need something that's, that's very, very fine, it's actually going to look at a data set that is specifically stored and it's going to be very, very quick for that type of sample.
And then if you need something, Hey, this thing happened two hours ago, and it's this, this little tiny detail, it's actually going to have to look at a different data stream.
And because the time range that you're actually looking at is, is so limited, you're actually not scanning as much data under those under those conditions.
So that's really kind of the magic of what we do there.
It's so cool. Cause it's just like, I just always think about like the, the kind of trying to find the grain of sand on a, on a beach and sort of like the different things you could be doing to sort of adapt the strategies and adapt the way that you sort of dive in.
And the way, the fact that the team has built, not only the system, but they've done it in a way that it works so seamlessly and flawlessly across so many different use cases at the same time is, is incredibly powerful.
Yeah.
It's something that's cool to me whenever we want to light up like a new product or a new data set.
And we've developed these patterns where we've gotten pretty good at it now, just adding in like, Oh yeah, we have, we have this set of development patterns on the infrastructure side, but also on the UI side where there's a, I think a common UI language for analytics and it works really well for these.
But I think of as high dimensionality data sets. So you don't know if you're going to filter on as a country or IP address or URL, like you don't even know ahead of time what you're going to want to filter on.
And we can answer these questions.
Like, like is, you know, iOS or India, like gaining in popularity is iOS or Android getting popular in India right now?
We can answer that kind of question.
Yeah. Yeah. You know, we're, we're talking about looking at data and visibility into data.
And, you know, it's really interesting for me when I think about what we're talking about now and all the work that the team is doing to give the customers a great deal of visibility and control into the data and putting the data where they want it.
But at the same time, the team has a huge book of work that you're doing around data sovereignty to really manage and control who can access what.
So it's almost like a diametrically opposed force of work. Can you ask kind of like, first of all, what is data sovereignty?
What is that problem? And why are we looking at it?
Yeah, sure. So something we're seeing as a trend around the world, but we're especially seeing, especially sharp right now in Europe, is that our customers expect to be able to control the region where the data that we handle is processed and stored.
And primarily that includes this sort of actual application data that flows on our network.
But increasingly, it also includes the metadata about that traffic.
So even just the, this is what we've been talking about, the metadata, the record of who did what, when or what, what computer did what, when that record folks want stored, you know, maybe exclusively in the U.S.
or maybe, you know, exclusively in the U.S. or in Turkey or in Japan or in Brazil.
And Ben and I and many other folks are working on making this a reality that we can say, you know, totally concretely, yes, 100% of the data that is generated by your traffic will be stored in a specific place.
And because we've been talking about all the ways the data gets used, we've been talking about the kinds of questions you can answer, the granularity of it, and just how hard it is.
This is a tricky project.
There's a lot of different systems that are touched here. You know, it's like, we love that analogy of like, we're changing out the jet engines while it's flying.
Totally applies. Like everything has to change, but everything has to keep running, running smoothly while we do this.
Yeah. But, you know, I mean, we've got presence in what, over 200 cities around the globe.
So, you know, why is it not simply as easy as just pointing them to just say, just store it there?
Like, why is this hard?
Yeah. Well, maybe I'll back up just one second and mention the data localization suite, something we announced in December and has a number of components to it because it's a suite of products that are all about, again, giving customers control of where the data is stored.
One part of that is a product called regional services and actually very interesting and complex in and of itself.
I don't want to downplay it, but that is kind of what regional services does.
It says, we have this huge network, but we know you only want the data processed in Europe.
So you're only like, essentially like we're only going to use the EU component of our network.
And that works because like, we actually have a big presence in the EU with a lot of data centers there.
We can handle that traffic. The challenge we have is that although we have a big distributed network, fundamentally to answer these questions, the data has to be collected in one place to answer them right now.
And the data somehow has to get aggregated and rolled up.
Now, I would love to be in a future where we're working towards a future where maybe any one of our data centers can kind of take the lead and be really smart and be that aggregator.
But right now, that isn't how it works. It's a big change. I don't know if Ben, you want to say more about that shift?
Yeah. No, I think that's a great summary.
And I think more technically speaking, the real big challenge is we have tens of millions of requests that are generating some sort of data every single second at the edge.
And we essentially have to make decisions on every single one of those messages that comes in over where that data is allowed to go.
So if that data arrives, one of our data centers in Europe and you're a European customer who really wants your data to stay in Europe, we need to make sure that data doesn't go to the US and it actually stays within Europe.
And now we have these dashboards and other analytics products and you're a customer and you want to ask questions of your data.
Now we need to know based on which customer you are, which data center are we actually going to go to try and answer those questions.
And that might sound like a simple problem in some ways, but it's actually very complex and very difficult, especially when you operate at the scale that we do.
And so that's a very big part of what we're working through. So is it basically just sort of trying to help the data remember where it left its keys, right?
That sort of metaphor of like, I got to go look up my data. That data is going to be coming from this colo.
And so it feels to me almost like you need kind of a new aspect of sort of logic or memory or routing.
How is that working? Yeah, that's exactly it.
So I think one of the... Actually, this is a big challenge that we have across Cloudflare.
It isn't really just specific to data, but it's how do you actually manage all this customer configuration?
And you have all these different customers that want us to do slightly different things based on their configurations and every single server that we have literally all over the planet needs to be aware of that configuration.
And actually needs to be aware of that configuration really quickly.
So when a customer makes a change, they don't want it to take like three weeks to propagate across the entire planet.
They actually want it within seconds.
And so we have an internal system that we use to actually propagate this called Quicksilver.
I'm sure it's written all over the blog on, but we utilize this system to propagate this configuration that tells all of our servers exactly how to handle customer's data.
And there's a lot of considerations to make sure in terms of like how we handle failures.
Like what happens if something goes wrong?
We want to fail open in the sense that if we don't know what the right decision is to make for some reason, we want to make sure that we don't accidentally send the data to the wrong place.
That's actually the worst thing that could happen.
And so we have to take all these things into consideration as we design those systems.
Yeah. It's sort of like the great news about Cloudflare is we're this phenomenally powerful, any cast, edge-based network and we're lightning fast and everything is distributed everywhere, except when you don't want it to be distributed everywhere.
You only want it to be in a few places and you need to remember how to aggregate those things.
Yeah. Yeah. Ubisoft has this great analogy of we're very distributed, but there's still a brain.
There still needs to be a way that we can identify this is a DDoS attack. This is whatever that is.
And it's actually hard to have multiple brains. So one thing we're actually, we're kind of trying to have multiple brains right now.
Yeah.
It's putting it up. It's putting it up. So the other thing you guys have also had to do is not only think about data and the distribution of data, but also kind of, I'm not going to call it the destruction of data, but making sure that we're leveraging kind of the best and the best technologies to make sure the information we're storing is also compliant with sort of the shifting landscape.
I mean, how are we doing that and why? Yeah. Well, that could be a good opportunity to just talk about the work we're doing for FedRAMP.
Yes. So folks may be aware that Cloudflare is working towards getting FedRAMP certified, meaning that we could provide services for the U .S.
federal government and also for companies that serve the U.S.
federal government and companies that serve those companies and so on and so forth.
So it's a very important set of security standards that kind of cover everything about how we operate.
And so these two things are FedRAMP and these data sovereignty changes are these kind of really seismic foundational changes that we're going through.
It's making us rethink, as Trent said, retention and destruction and how long are you keeping the data for exactly and what happens when it expires and can it get deleted sooner?
And even just our systems, like how do all of our systems talk to each other?
So Ben and his team are doing a lot of work right now to get us in ship shape for FedRAMP.
But what do you have to do?
What do you have to do for Fed? What from a data perspective, what are the critical pieces of the work that we have to do for FedRAMP?
Yeah, so I can take that. I think one of the most important pieces is really just making sure that we send this data between all of our different data centers and when we send that, we want to make sure that it's encrypted for various reasons.
We've always encrypted the data that gets sent between our data centers, but what makes FedRAMP uniquely challenging is that they have very specific algorithms and ciphers that they actually require us to use that I'm sure they've done all sorts of auditing on and things like that.
And actually a lot of what we use is a little bit more modern than some of those standards.
And so a really big challenge for us has actually been making sure that we comply with those standards everywhere.
And in some cases, actually adapting different encryption techniques to make sure this data is secure in a way that that complies with their regulations as it transmits between all of our different data centers and from the edge to the core and likewise.
And then there's even the same requirements for what happens when the data sits at rest as well as in transit.
And so we keep a lot of data in our data centers and we need to make sure that all of that is stored and encrypted in a FIPS compliant way.
So that's probably the biggest challenge for the data team.
Yeah. What is it about the encryption that makes it difficult for you guys?
I think it's mostly just that a lot of the algorithms and cipher suites that meet compliance are not necessarily readily available in a lot of the systems that we actually use.
We use the Go language. It's very common. It's one of the main programming languages that we use here.
And the sort of out of the box modules are not considered FIPS compliant for various reasons.
And so a big part of what we had to do is actually port some of that work into the Go language and then compile all of our services and binaries specifically using that version.
And that's non-trivial because now we're actually maintaining our own sort of internal distribution of a programming language which comes with its own pitfalls.
And so there's been a lot of work that's had to go into actually making sure that that happens.
I think the second bit is because we have so much data and it was originally encrypted using non -compliant mechanisms, we have to essentially rewrite all of that data.
And when you have literally petabytes of data that you're storing, you can't just like- Hundreds of computers, racks and racks and racks of servers, physical things that you need to use.
To John's analogy on the airplanes in the air right now, we're actively writing to those machines at the same time.
Those machines aren't idle, right? And so we have to effectively rewrite every single data point that we have while new data is coming in.
And that's hard and that takes a long time.
And we have to set up jobs and we have to monitor them and it takes days, weeks to rewrite all this data.
And so those are the big challenges.
That's amazing. That's amazing. So I mean, from a FedRAMP perspective, obviously there's the getting ready for the assessment and the attestation and stuff like that.
But if you think about sort of data going forward, being compliant going forward, what are some of the things you think we're going to have to keep in mind as we continue to sort of build our systems, build them in a way that is FedRAMP compliant, gives our team kind of at once compliance with these complex encryption standards, but at the same time gives our teams and our customers the transparency and control that they need.
What is that going to ask of us that we're going to have to think about doing differently?
Yeah.
The big topic that I think about a lot is, and we're just kind of getting started on, is really access control, making sure that the right people have access to the right data to do their jobs, but also that they don't have access to too much.
We have a pretty good, obviously, access data coupler of course is controlled, but making sure that things like people don't have access to the data for longer than they need.
Or that when you do request access to something, you just get it for just the set of things you need it for, not a bunch of other stuff too, that you've made it really granular.
And to me, the ultimate version of this is, there's exceptions to this about our own systems, but for the most part, I want our customers to see the data that we see.
And so there shouldn't be this distinction between data our customers have, data that Cloudflare has.
There will be a few things, but for most of it, it should be data that our customers have, that a customer can grant someone at Cloudflare access to their data.
And it goes both ways, that customers will have all the tools that we have access to, and that they know exactly who Cloudflare is.
If they're trying to stop an attack, our customers should know about that and should see that.
What kind of challenges is this going to prevent to you, Ben, as you think about innovation and the rate of innovation in your team and how your team innovates and works with data?
Yeah, that's a great question.
And I think even before we get too far into it, I would just say that as Cloudflare gets larger and a lot of these access controls are now being put in place, I think as an engineer who just likes to be able to go to a command line and just start ripping data apart when they have a question or they're trying to solve a bug, having these access controls, for lack of a better word, it does suck as an engineer, but it's just so incredibly necessary.
You just can't have thousands of people who readily have access to customer's data.
That would be unheard of in a lot of other circles.
So I think the pain is just something that ultimately we have to put up with.
But the question and I think the challenge for our team is how do we develop things in a way that causes as least amount of pain as possible?
And a big part of that is just really starting to separate our datasets. And so you could think historically, years and years ago, we maybe had one single dataset that has every piece of sensitive information in it that you could possibly imagine.
And a big part of what we're doing is we're starting to separate things into datasets that we can be a little bit more open with and datasets that really have sensitive customer information that engineers just don't have default access to.
And so there's a lot of places where we're actively improving this and slowly pulling some of those metrics out that help give our engineering team and our operations team visibility of what's actually happening with our network, what's happening at our edge, how many requests per second are we doing?
And they can answer those questions without having to understand where are my client eyeballs?
What are my client IPs? Which customer's data is this? You don't actually need to know what customer's data is.
For a lot of these operations, you just need to know what the event happened, maybe how many bytes went through the network and that's about it.
So there's a lot of these kind of low level technical metrics that can be very useful to engineers, but aren't violating customer's privacy and aren't real concerns.
And so really that's the puzzle that we're engineering right now is how do we pull as much of this out without violating privacy at the same time?
It's amazing. I mean, just the dimensions on which the data team is innovating right now are really incredibly powerful.
Ben, what roles are you hiring for right now?
What do you need? Yeah, so we are primarily hiring for data engineering.
So folks that really want to work on one of the largest and I think most interesting data pipelines in the world.
What we do is not simple. And folks that want to help us continue to grow.
Cloudflare is growing really, really quickly and that data volume is going up and up and up.
And we're going to have to continue to innovate to make sure that we can actually keep up with it.
And then also, we're also looking for folks who have a little bit of sort of a statistical background or want to be somewhat involved in machine learning and actually taking this data that we're generating and providing insight to customers with it.
And so at Cloudflare, we don't just want to give customers data points.
We also want to make sense of that data and make it actionable and useful.
And it's great if you can send a firehose of billions of events to a customer, but really they may only want one or two answers.
And if you can give them that instead, that's much more effective.
So folks that want to help give those answers to customers. Awesome. Thanks all.
Thank you.