🔒 Introducing the Data Localization Suite
Presented by: Jon Levine, Achiel van der Mandele
Originally aired on June 4, 2022 @ 10:30 AM - 11:00 AM EDT
In this session, Cloudflare Product Managers Jon Levine and Achiel van der Mandele do a deep dive on Cloudflare's new Data Localization Suite offering.
Read the Blog: Introducing the Data Localization Suite
English
Privacy Week
Data Localization
Transcript (Beta)
All right. Hello, everyone. Welcome back to Cloudflare TV. My name is Jon Levine. I'm a product manager here at Cloudflare working on data and analytics.
And with me, I have Achiel.
Do you want to introduce yourself? Hi, I'm Achiel. I'm also on the product team focusing on a variety of things, but more recently, like protocols and regional services.
Hey, Jon. So kind of to kick us off, we have privacy week and like we have compliance and we have privacy and all of that stuff.
But kind of like what's driving a lot of this thing or like specifically like what is this Shrems thing?
People always talk about Shrems and like what is that? Yeah, it's funny.
So yeah, we're here. Like why is the second week of December privacy week?
So it actually probably goes back early to this summer. It's a recent court case, the Shrems 2 decision.
We were just talking before like this term Shrems 2 gets thrown around.
Max Shrems is just a guy. He's an activist and a lawyer from Austria and he's won a few really high profile successful court cases.
The Shrems 2 decision was a court case.
It was a lawsuit of Max Shrems versus Facebook, basically saying that their data was not being protected under the GDPR.
And upshot of the Shrems 2 decision, as I'm sure many folks know, is that it invalidated this earlier thing called the privacy shield.
And now that it's invalidated, it's not so important to know the details of that.
But basically the upshot is it's now much more complicated for U.S.
companies to transfer personal data about Europeans back to the U.S.
And so since this summer, there's been a lot of questions about what are the implications of all this.
Just in the last few weeks, European data protection agencies released some draft guidance about how to comply with this new ruling.
But this is a pretty active issue in Europe.
And I want to say it's not just in Europe.
Europe, I think, is kind of on the leading edge here. But we're hearing about this in India, in Brazil, in Australia, even in the U.S.
with things like FedRAMP. People have a lot of these questions about where is my data going?
Cool. So there's a lot of companies, and you mentioned Facebook, but there's all sorts of companies that have global footprints.
And Cloudflare also has a global footprint, right?
We're over 200 data centers, and we always tell people like, hey, we run the exact same software everywhere.
So how does that tie into Cloudflare? How do we think as a company about all of this?
Yeah. Well, I think one thing that's great about Cloudflare is we have a pretty big footprint.
We're a pretty significant -sized company, but privacy is really core to our business, right?
None of our services are about selling or renting or anything about the data of the visitors who come to our network, right?
Our customers are paying us to provide performance and security services.
And we can only do that if we have their trust and if they know exactly what we're doing with the data that we do have.
And so what we want to do is build tools for those customers to protect the data of their visitors and to stay compliant.
And this is something where we think we have a lot that we can do to help here.
We're pretty mature in this space. We've been GDPR compliant from the moment GDPR existed, right?
And that wasn't even, frankly, that hard for us because we'd already been doing most of this stuff from the very beginning of our company, right?
Privacy is really fundamental to our DNA. And I should also add, we have whole teams of lawyers and policy and compliance experts who follow this space really closely and understand this, and technologists like us who meet with these folks every day and try to understand the rules and also go way beyond that to just have the most private service that we can offer.
We want to give our customers the same capability that we have.
To be honest, that's actually really great.
And one of the things that I really, really appreciated about Cloudflare when I joined, I think in a previous job, we also had to deal with some GDPR stuff.
And then actually, my boss at some point, he got a little bit agitated.
He's like, F Europe, because he was so upset about all these regulations. I'm originally Dutch, so I didn't really know how to deal with that because I personally actually do care a lot about privacy.
And it does feel a little bit nebulous, right?
This cloud stuff and where is my data going? Why are you even collecting it?
Are you collecting it to sell it or to offer services? So joining Cloudflare was actually a breath of fresh air that from the get-go, we have no interest in collecting this data or selling it or anything because our core business is to offer security and performance benefits.
We might have to collect something to be able to offer those services, but it's very direct to do that.
It's something I've really, really liked working here. It's pretty interesting too, because we're at a scale where we can offer free consumer services.
I'm really proud this week, we just announced a free analytics service.
We've had our free 1.1.1.1 DNS resolver, super fast, private DNS resolver that anyone can use, a warp app.
And most companies, there's this bargaining with these services that when you use them, you know you're kind of giving up.
There's a saying like, you're the product, right?
You're giving up your data. But for us, it's really different, right?
We do these things because we have really powerful network effects.
We can actually give away free consumer products, but it makes our paid offering smaller.
So I think it's really great that our values in our business are just really aligned on this.
And we don't really feel any conflict of like, oh, but if we can only bend this rule, we don't have any conflicts like that, which is really nice.
Yeah, definitely been really great. So kind of moving over to when you look at all of this compliance and all these regulations, obviously different countries have different things, but what does that even mean to be compliant?
Like literally, how do these policy makers look at these things?
I mean, just saying the data is not allowed to leave the country or something, it's like a very vague thing, right?
So what are they literally, what is the type of language and stuff and the core tenets of the policies that really represent?
Totally. Yeah, well, so spoiler alert, we're here to talk about our new announcement this week, the data localization suite.
And I'll tell you a little bit about how that came about.
So since it's accelerated this summer, but since always, customers ask us, well, what is the data that you do have?
How is all this data flowing around your network?
Help us understand that, help us understand the landscape.
So we wanted to do first was just kind of map that out and explain that.
And then we realized we needed to give people controls over every part of that.
Let's first talk about kind of what all the data is that's kind of flowing around.
So I think the first and most important thing that we'll talk about is your application data when it's in flight.
I'm going to stop you right there.
What's application data here? That sounds like a very technical thing.
Yeah, let's unpack that jargon a little bit. Well, application, it can mean tons of stuff.
It could mean everything from this Zoom call, right?
Like the content of the video. It could be an order I make on DoorDash. What's the content of the order that I'm making?
What is DoorDash telling me about where my driver is or how long my food will take to arrive?
Yeah, or maybe bank statements when I'm doing e-banking on my banking site.
I actually kind of care about where that data ends up.
That sounds good. Telemedicine, right? If I'm viewing forms.
So anything that's like the meat of what you're trying to do on the Internet, that's usually what we refer to as application data.
And the predominant protocol that we talk about is usually HTTP.
Usually what we're talking about is stuff in the HTTP requests and responses from a neural level or L7, we'll talk about sometimes.
Yeah. So one thing I'll say, and we'll get to this in a second, is to make those services faster, to make them more secure, we do need to be able to kind of inspect those and say, hey, is there an attacker trying to send something malicious?
Is there content here that's cacheable that we can store a copy of at our edge, right?
We don't want to cache your bank statement, but we might want to cache your website's logo to make it look faster.
Cool. So that's one really important pillar.
Another important pillar is storage. So where is that stuff stored? How is it stored?
So I mentioned caching briefly, right? One of the major ways we make things faster is we store copies of things.
But there's other stuff too, like we do have a video service called Stream.
We're not going to talk too much about that, but we do store videos for that.
For workers, we do store the contents of the worker scripts that customers run on us.
It's a little bit less of a concern for Cloudflare because we're not as much in the storage business primarily, but it's something that is definitely concerning for folks.
Yeah, plus that's not really my bank account, right?
If I'm a European citizen, I really care about literally my personal information.
So what other personal information do you track? Yeah, I think with people in the context of Cloudflare, when it comes to personal data, the most sensitive things are often around metadata.
So we'll talk about this a little bit.
We keep coming back to the fact that people are buying Cloudflare for what?
Well, for speed and security. So there are things we need to know to be able to do that.
And so a really good example of this is bot detection. There's some basic amount that we have to learn about traffic on the Internet in order to understand the difference between a bot, a computer trying to request stuff, either to stuff passwords or maybe to try to scrape the website, who knows what, versus a human accessing that.
So there's some information we need to collect, and that falls in this incredibly broad category called metadata.
And the thing about metadata is it could be anything, right?
And we'll talk a little bit about this, but we always want to have really the absolute minimum that we need to be able to offer the performance and security products that we do.
Yeah. That answer your question, Akil?
Yeah, I think so. We have two things. Where do we inspect traffic?
Where could Cloudflare theoretically see your bank account if you're going to a banking information?
And then the other is, well, what types of data does Cloudflare really collect, and where does it store this?
Data, right? I don't want Cloudflare to track me and see like, oh, I just did this website, a website.
Yeah. Yeah, but they can't see it. Cool. Well, should we transition to talk about how we're thinking about solving some of these problems and- Yeah, let's do it.
Yeah. So this week we announced the data localization suite. And what this is, is just trying to help tie our products together, which help customers with these problems about where is this data inspected and stored?
And it's really about giving our customers the control to really say where that's all being done.
So I think the most important piece of this, and Akil, I want to ask you about this, is regional services.
So we did announce regional services this year, and I think it's really the linchpin of the data localization suite.
So Akil, do you want to tell me a little bit about what this is, how it works?
Why do we build this?
Yeah, for sure. So we touched on this earlier. We have data centers all across the globe, and whenever someone visits a website on Cloudflare, they automatically get attracted to the closest one, right?
But a lot of times you're a little bit hesitant to do that because you don't want a machine maybe in certain countries to be able to see that bank account, right?
So an example is someone in the EU, a European customer might not want someone in the US when they're visiting their website for that US pop or data center to decrypt and look at the traffic.
So with regional services, we're giving you really fine grained control over where we actually decrypt and can look at that bank account number or those requests and perform security features such as WAF.
We localize that for EU to only the EU member states, and we offer a bunch of other regions.
So US is a popular one for people that don't want anything decrypted outside of the US, and we're looking to add more in the future.
The key thing here is we still use our global network for DDoS protection.
We just only apply decryption and the WAF inspection and bot management and all that stuff inside the geo region of your choosing.
That was going to be one of my first questions about this was we know there are these DDoS attacks are huge, and maybe it's interesting, we talked about how maybe they're not quite as big even as they once were, but global huge DDoS attacks were not a thing in the past.
So how do you deal with that? I think you answered that, which is we're still able to use our whole network without which- We fend off like the large volumetric network style attacks using our global network.
We just use a subset of our network for the ACP inspection.
So let me walk through a few kind of edge cases here.
So let me understand this. So if I have a site and I'm mainly serving, I don't know, let's say car dealerships in Denmark, but then one of my car dealers goes on vacation in Hawaii, which I don't think is part of the EU anymore anyway.
They're European, the website, but they're in Hawaii, they're kind of on the other side of the world.
Help me understand what's going to happen in that situation. Yeah.
So at that point, when we're seeing that traffic come in, we can't immediately determine whether this person is Danish or not.
We can't do a speed passport check. That's not the way technology works, but we can see that that's from an IP address in Hawaii.
So what that allows us to do, and we know that that data center is obviously in Hawaii.
So if we see that traffic going to a certain customer that doesn't want us to inspect ACP traffic inside the Hawaiian pot, but only EU, we look at whether it's a DDoS attack, obviously not.
There's a dealer guy just visiting a website.
We kind of tunnel just the raw packets over to a data center inside Europe, where we know that it's good to decrypt and we do the inspection there.
And that kind of gets you the best thing, because if we have a huge attacker in Hawaii, that stuff does actually get blocked there.
And we can use our entire network capacity for that while still making sure that we only process the data locally.
That's awesome. I want to ask really quick, something we haven't touched on is encryption.
We know encryption is so essential doing all this, that stuff's encrypted when it's in flight, across the network, it's encrypted when it's stored on disk.
But the keys, the keys to the kingdom, it's so important, who has access to those keys and where are those stored?
So we have KeylessSSL and GeoKeyManager, and you want to maybe help me understand how does that kind of work together with regional services to kind of keep your private key material local?
Yeah, for sure.
So with regional services, we just make sure that we only decrypt certain things.
But then there's a question that's like, well, where could we decrypt? Because that's another that a lot of people are afraid of, like, hey, so we don't want an engineer randomly going in and stealing some keys and decrypting all this traffic.
So that is literally the key. Before you decrypt traffic, you need the private key and the certificate to be able to establish that encrypted traffic and be able to read all that data.
So what GeoKeyManager does is it makes sure that those keys are only usable inside certain data centers.
Normally, we would just have those keys everywhere using secure transport mechanisms, but you could theoretically use them everywhere.
GeoKeyManager, we just reduce that to a certain set of data centers.
So if you're a European customer, you would probably want regional services to make sure that the decryption only takes place there.
But we also can literally promise you that the keys will only be useful there.
Yeah, it would be cryptographically impossible for us to do it anywhere else, which is really neat.
Yeah. Cool. Great. So should we talk about storage a little bit? We've mostly been talking about data flying around the network.
We talked, I mentioned briefly on services like streaming workers, but I don't think that's, I think mostly when we're talking about storage, we're talking about metadata, right?
And about what is the metadata we collect?
We touched a little bit briefly about the bots use case, but- Yeah, that was going to be my question because you used one of those techno words, which is metadata, which is super vague.
What does that mean? That sounds as nebulous as the cloud, which we're all trying to figure out.
So what does metadata here even mean?
Maybe we can use an example. Let's say I'm a banking customer that bank is on Cloudflare and I visit that website.
So I go through Cloudflare.
What does Cloudflare like see in store of me personally? That's a really good question.
So we don't want, like we talked about application data and I think the stuff we don't, we definitely don't want to store or like any details about your username or password or your address, your name, certainly not the bank statements they're going to return back to you.
We certainly don't want any of that, but there is stuff about that- That passes through the network, but then we instantly forget it, right?
We don't care about it. We need to check at the time when that comes in that you're trying to log in again, that you're not trying to do some kind of, maybe there's a zero day, right?
Where there's some kind of attack against the database server, some crazy combination of keystrokes that can cause a problem.
We got to protect our customers against that. Or someone's trying to get my password.
I definitely want you to stop those. Credential stuffing, right? We have to stop that.
But the flip side is like, there are a few things we do want to know. Like we want to know how fast we are.
We want to know how long it takes, for example, to connect to an origin server.
So that connection time, how long it takes us to get data back from the origin server, that data is essential to building Argo.
So Argo is like our ways for the Internet. So there's our smart routing product.
What it helps us do is literally get data across the Internet faster by picking the least congested, fastest paths across the Internet.
We're basically outsmarting BGP.
It's crazy sci-fi stuff. It only works because we have visibility about how long it takes to connect back to origins.
So that's kind of thing we're talking about.
So that way we know with Argo, maybe we should take this path instead of this path because we know that will get us the fastest connection to your bank server, make that site load faster, just make that user experience better.
We don't want to know anything about your bank statement.
Yeah, that just sounds like some extra data that gives you like some timing information.
But I don't think I really care if you store that, especially if you can then make the website go faster for me.
I mean, as long as you can't tie it back to me in any way or know what my bank statement is.
So metadata, it's funny because I think what we learned sort of in the Snowden era is like, if you have a lot of metadata from a lot of people, even stuff that seems innocuous in large volumes can be possible to de-anonymize people, which is really a concern.
And so that really leads us, I think, to one of the products that we just announced this week called Edge Log Delivery.
So one of the most valuable things we do for our customers is we get them their traffic logs from our Edge and help them understand what it is that Cloudflare is doing, but also what's happening at their origin server and also what are their visitors doing?
Because we're really the front door for all our customer services.
And so we had a few problems with the way logs have been delivered in the past.
So typically all of our logs go through one of our core data centers today.
And we have two core data centers, one in Portland in the US, another one in Luxembourg in the EU.
And there are a few issues here. One is just reliability, first of all.
If one of these does go offline, we don't want there to be any delay in getting logs to customers.
We want that to just continue flowing. Trying to back that up can be problematic.
Another one is just efficiency. If your traffic is being served, maybe far from where we have a core data center and people like to bring up South Africa, maybe latency of a couple hundred milliseconds on your logs, that may not be the end of the world.
But actually just from an efficiency perspective, the amount of bandwidth required to ship even a relatively small amount of metadata around the world when you're talking about high volumes of traffic can be a lot of data transfer can be expensive too for customers to re-adjust that.
What we're increasingly seeing is that customers don't even want their metadata stored in other locations, especially when we're talking about the complete set of logs that they have.
And so the idea of edge log delivery is that if you're, say, a customer in Germany, and at the end of the day, you want your log stored and analyzed in Germany, they're your logs, right?
You can do whatever you want with them.
We're going to send them from the data centers where we serve the traffic to where you store logs.
So going back to the regional services example, if you're using, we call it a map, the European map.
So your traffic is all getting funneled back to the EU and being inspected there.
Just those pops that are doing that final step of serving and inspecting traffic, those are the data centers that are going to send logs.
So this really helps your log stay more local.
It's more resilient. It's more efficient. It helps keep your data just where you want to keep it.
So that tackles both of the things that a lot of these regulatory compliance care about, right?
We give you some more control over where we kind of service traffic, and we also give you a lot more control of where we store and how data is stored.
Cool.
So I think, what do we think is kind of unique about what we're doing here? What have you been hearing about some of the problems that customers have faced with kind of other solutions they've seen in this space?
That's a great question. I think there's kind of two parts to this, or two ways that a lot of these vendors are, I think, either they just use their global footprints and it's like, okay, we do it everywhere, but we're, I don't know, we promise not to do too nebulous things, which a lot of people are obviously a little bit upset about.
And then on the flip side, you have a whole set of vendors that just offer like totally regionalized services, literally just only put data warehouses inside of certain countries, which is kind of fine if you're looking to operate on the smaller scale, but not really great if you're looking to fend off like super large DDoS attacks or want to use a large network like Cloudflare.
Also, a lot of times it's just really hard to manage services in different regions.
You have to like these dropdowns where you have to click through regions and with Cloudflare, it's just one dashboard and you just go.
With a lot of services, like not everything runs in every region, right?
So, and you may have things deployed in different regions, but those could even be in like different versions and like there'd be separate like instances of a thing that you have to maintain.
I think we really thought we wanted to really simplify that and make it incredibly simple to just kind of fence things and control where the data lives and just really simplify that process.
Yeah. So, that's really cool that we've launched these products, but like, so what's next?
Have we just like solved all privacy now? It's all done.
We're all going to just retire and go home. We ship a lot here and we're definitely not slowing down after this.
So, I think one of the first questions we get is really about what regions are we in?
And so many of our examples today have been about the US or about the EU, especially again, because of shrems, because that's what's on the folks' mind.
This impacts global companies because a lot of companies have customers in the EU, even if they're not based in the EU, right?
We know there are also customers who are primarily serving folks just in Brazil or just in Australia or just in Canada.
And we see this really proliferating.
So, actually, maybe Akil, do you want to tell me about kind of how we're thinking next about expanding the regions that we're in?
Yeah.
So, we're definitely looking at adding more regions in the future. And like one example that we're also very excited about is we're actively pursuing FedRAMP, and this is a core part of that because for FedRAMP, you really don't want any data decrypted outside of the US.
So, this is a really good fit for there. We're really excited about that.
We're hoping to be able to share more about that soon. Cool.
Another thing I wanted to bring up is, you know, one thing at Cloudflare is even though we have this incredible edge and we're in like over 100 cities around the world and we run the same services everywhere, there are certain things about Cloudflare, we sometimes call them like the brain of Cloudflare, that do tend to run in these centralized places, what I've referred to as our core data center.
I picked some examples, like I picked how we do our bot management product, how our Argo product works, and all these things require kind of having this view of what's happening around the world and kind of processing that in one place, which is really valuable.
And I think one thing we're working towards is splitting up those processes, splitting up that brain, so we can be multi-brained, you might think of it as, right?
So, we can really silo off even better all of the services that we offer, so there's no questions about any metadata, anything flowing back to a single core.
The way I think about this in like techie or computer science terms is sharding, not replication.
I think we're good at replicating and now we're going to work on sharding and making sure that things actually can stay more separate.
And the challenge will be to do this in a way that still gives people, again, global benefits, right?
Making sure that we can still offer GDOS production globally, that we don't just have Argo, we only detect the European bots.
Like, that's not a good outcome, right?
Your threats are coming from a global audience, from the whole world, and you have to be able to recognize those.
Yeah, so another cool one that I also thought was worth mentioning is, we touched on this earlier, but at Cloudflare, we're super-duper privacy -focused, and that allows us to do a lot of cool things.
Because when you look at JPL and myself, we're actively trying to become more privacy-focused.
I mean, even without the GDPR stuff, it's just very core to our being.
A really cool example of that is we have a cookie called CFDUID, which is...
Wait, can you explain to me why do we have this cookie even?
It goes back to the early days of Cloudflare. So I mentioned part of our mission is to help stop malicious traffic on the web, and very early on, 10 or more years ago, we realized that if we added a cookie to websites, this could be one signal that could be useful for basically telling if traffic is malicious or not.
Now, we always go to... We actively track people to be able to tell if they're bots.
That's what you're saying? So we did a lot of things, actually.
We can talk about... I mean, the punchline is that CFDUID is going away, but we can talk about...
We did a lot to really make sure when we use CFDUID that it was and is, at this moment, very private.
So for example, it's not used across different websites in our network.
We're not storing it for very long.
I think one... I will say one benefit of a cookie versus other things is that you can see cookies, you can delete cookies, you can block cookies, which is an important thing about them.
But at the same time, we don't want there to be any confusion that we could be using this to track individuals or to be doing something with the data that we're not.
And so we challenged the engineering team, our leadership challenge engineering team, like, can we offer security services and make them really accurate without this?
Can we just get smarter? And that was something we actually were able to do.
It's like, how can we actively minimize the data we collect and have a service that's just as good?
And we think we're able to do that, which is why we're able to announce the deprecation of CFDUID cookie this week.
We plan to be removing that later this spring. That's great because it doesn't make it any easier for us to keep bots out, right?
Yeah. I think...
This is a really great example of how I think Confluence operates that we're making it harder for ourselves because we value privacy so much.
Exactly. Yeah. I was gonna say privacy by design, it's sort of an abused, overused, overloaded term.
I think it is really how we think about it, that we think about building privacy in our products, building products where privacy is kind of the headline, compliance is kind of the headline.
And then not just resting on that, but continuing to work on that.
Cool. Well, this is a really interesting chat. Thanks so much, Akil.
Please do, if you're interested in the data localization suite, check our blog.
We have a blog post that went out on Monday. We have a product page up and please do reach out if you have any questions.
Any parting thoughts, Akil? No, it's just been great and fun chatting with you on this.
It's one of those topics that always seems really dry and boring, but when you get into it, it's actually kind of fascinating.
So it's been a fun chatting. Thanks for having me. Awesome. All right.
Thanks so much, Akil.