🔒 Building Privacy Into Products at Cloudflare
Presented by: Emily Hancock, Tara Whalen, Edo Royker, Jon Levine
Originally aired on April 28, 2022 @ 5:30 AM - 6:00 AM EDT
A discussion on how we build privacy into our products.
English
Privacy Week
Transcript (Beta)
Hi, welcome to our Cloudflare TV segment on how we build privacy into our products here at Cloudflare.
I'm Emily Hancock, head of legal for product privacy and IP and our data protection officer at Cloudflare and I'm joined today by my colleagues Tara Whalen, Jon Levine and Edo Royker.
Why don't you guys go ahead and introduce yourselves.
Tara, you go first. Sure. Hi, I'm Tara Whalen. I am the privacy research lead here at Cloudflare and I mainly work on advancing privacy enhancing technologies throughout the Internet.
In my previous life, I worked at Google and at Apple helping to engineer privacy in their products.
Before that, I worked as a technical expert at the office of the Privacy Commissioner of Canada.
So I'm also familiar with the regulatory side of privacy as well.
Jon. Hi, I'm Jon Levine.
I'm a product manager at Cloudflare for data and analytics products.
So I spend a lot of time thinking about how we can take the most private possible approach to all of our data products, which we'll be talking about today.
Hi, I'm Edo Royker. I'm a Cloudflare senior managing counsel for product and intellectual property, which means I spend a lot of time counseling product engineering, research, and our special projects teams on a range of legal issues, including a lot of privacy work.
Great. Thanks. Well, Tara, let's start with you.
You've been thinking about how to build privacy into products for much of your career.
Can you give us a few key principles that product developers should be thinking about if they're going to be designing their products with privacy in mind?
Sure thing. I think we'll start up with data minimization. People talk about this a lot, but there's a good reason for it because it really is this central pillar of privacy.
This idea that you really just want to collect and use precisely what you need for a very specific purpose.
So you think about what it is you need the data for, you collect that, you use it for that purpose.
It's very, very focused work.
So what you want to avoid is collecting a bunch of data just in case.
We're not sure what we're going to need it for. Maybe in the future, this could be useful.
So let's just have it. That is the complete opposite of what you should be doing for data minimization.
You also want to pay attention to how long you hold onto this data.
So you want to think about your retention. So you want to clean it up as soon as you can and keep that amount of data small, compact.
If you want to do kind of a rough analogy, what you don't want to be is the hoarder person who is collecting everything they ever got back from their middle school, like report cards and everything, every gift they ever received and every receipt and put it in a great big warehouse.
And that's really the opposite of what we want. We want the Marie Kondo minimization where everything should spark joy.
You should be being thoughtful about your data.
Because of course, the more you have, you have to maintain that.
You have to keep track of it. People will ask what information you have.
You have to know these things. So there's a burden for you taking care of the data that you collect.
So that's another good practical reason why you want to have minimization around your data.
And it absolutely makes your other processes easier.
If you're trying to keep track of where your data is and what's happening, then the less you have, then the easier your job is.
Other principle, I would say, be careful around your communication and your transparency.
So you want to communicate to users what you're doing with data.
So we talked about you have a purpose in mind.
Well, you want to talk about why you're collecting data and what you're providing with that data.
Like what's the benefit to people? And you can communicate this in a lot of ways.
So it can be part of your design. So you might have a UI that's saying, here's what we're going to do.
We're collecting this piece and here's what we're using it for.
Here's what your location is being used for to deliver you this navigation service, for example.
It could be your supplementary communication as well.
So you may have a knowledge-based article. You may be putting out blog posts or communications from the company.
And those are other ways in which you can communicate about what you're doing.
So people think about the privacy policies and they say, oh, I've got this really dense, long document that nobody wants to read.
And is really that what I'm expected to do to find out what's happening?
And of course, privacy policies have a use. But I wouldn't say that's supposed to be the way in which you clearly communicate to people about what you're doing with data.
And to link to the data minimization idea, if you have much more straightforward, simple data story, then this makes your communications easier.
It makes your descriptions of what you're doing a lot more straightforward when you've actually been careful and thoughtful about your data in the first place.
I'd also say you have to think about the entire lifecycle of your product and your data.
These things have a beginning and a middle and an end in a way.
They have a story. You come up with an idea at the beginning. You're maybe brainstorming about something.
It may be very early. That may be a good time for you to start saying, well, what's the data story around this?
What's the privacy story around this?
Very, very early on before you've really launched fully into the development stage.
You may then make it more mature. You may go back and rethink things.
You may have lots of redesigns. There are lots of way points along the way for you to work in those thoughtful privacy communications.
The product may have a sun setting.
You may deprecate something. Well, what's the plan at the end there?
Is it going to mature into something else? Are people going to take their data out?
Is all of that going to be deleted? You need to have an end-of -life conversation as well.
And you want to think about what ways your organization does its development.
So when do these conversations take place? Are there lots of bursts of time in which you do heavy design work?
Are they more iterative in nature?
And then how do you fit your privacy design work into those ongoing conversations that fit the way in which you actually do your design and development?
And the last guideline I'd like to mention is think about the fact that you're in a big ecosystem that is evolving.
So you're coming up with designs.
You're coming up with ideas. And the world is moving while you're doing all of those things.
And you're moving along with it. So you may end up, for example, having new opportunities you didn't think about in the first place.
There may have been new ways of doing things that weren't available when you first made a bunch of decisions about how you could use your data.
Maybe there's something that's even more clever.
Maybe you can work smarter with less data. And then you now have new opportunities that you can take advantage of.
And the world changes in what users expect around privacy as well.
So the landscape is changing. And it's good to stay in touch with the expectations around users and their privacy and find out what it is that they want and need.
I think that's a really important point to think about the evolution because privacy is never just done.
It's always an ongoing journey.
But I do think that there is a tendency to say, OK, well, we've done this one thing with our product.
And now we can forget about the privacy issues and move on.
But you're right. As products evolve and privacy regulations evolve, we have to keep an eye on that kind of thing.
And I'll just kind of also make the point that privacy-by-design actually right now is required by Article 25 of the EU's General Data Protection Regulation.
So that went into effect in 2018. So keeping up with that is kind of another thing that folks have to keep in mind.
And Article 25 emphasizes using technical and organizational measures to implement data protection principles like data minimization, as you were talking about, and making sure that personal data is only used for the specific purpose for which it was collected.
And that goes back to the transparency and communication part you were talking about, that when you tell people this is why we're collecting your data and here's how we're going to use it, you can't just flip and change it and use it for something else just because you want to or just because it's there.
So John, Terra gave us a lot of really great principles to think about.
And I know these are things that you kind of think about in your day to day.
So I'd like to get your perspective as a PM on how you're putting some of these principles into practice.
Yeah, totally. So I think a really great example of this is the web analytics product that we just launched this week.
So folks aren't familiar, really excited about this.
So for a long time, we've had analytics in the Cloudflare dashboard for our existing customers.
People really love us because privacy is one of our values that we build into all our products, because we can be very accurate.
It's very simple because we collect data at our edge. But of course, the challenge is that that is only available for customers who proxy their traffic already through Cloudflare.
They've already onboarded, changed their name servers.
We know that not everyone can do that. And we felt like there was a big opportunity to make a free and privacy -first analytics product for really anyone on the web that works the way that other analytics products work out there, which is that you take a little snippet of JavaScript and you add it to your web page.
So there's nothing inherently wrong about collecting analytics on the client using JavaScript.
But we know that there are a lot of folks out there who are concerned about this because of some of the things they're doing.
And so we wanted to make sure that we really wanted to explain privacy first.
So what did that mean to us?
How do we build this into the product? And so I think data minimization was really one of the core things, I think, that we thought about in terms of what we minimized or what we did or didn't need.
I think a lot of people assume that analytics means tracking.
It means tracking a person, an individual across many requests, maybe many websites, maybe over a long period of time.
But we thought, is that really necessary?
That's going to be important if you're trying to understand an ad campaign.
If the thing you're trying to understand is that someone sees this ad and then buys this product, yeah, you might need that kind of tracking.
But that's not your goal. If your goal is just to instrument your website, you don't need that.
So we started from the beginning and said, well, what do we need?
And it's not that much. We need to know, OK, was the web page loaded?
What was the URL so we can see what those web pages were? Then rather than drop a cookie and try to track a user, we wanted some other metrics, some other way of saying, how many people, how many visits were there?
We realized what we're really getting at is not how many people are navigating within your site.
If someone's clicking around URLs in your site, that's fine.
You already know about that.
But what about someone who's coming to your site for the first time? So we came up with this new definition of a visit.
And it's very simple. A visit is just a page view, but with a referrer for another website.
So it could be an empty referrer or a different URL.
It could be a search engine. The idea being, if someone's navigating to your site from somewhere else, you'll see that in the referrer.
It's actually OK if some browsers don't populate referrer. Generally, they do for same sites.
You'll see that. But it's OK, right? It's very privacy-first and allows us to really minimize the data we collect while still showing something useful.
So this is an example of how we kind of built privacy into one of our products.
Yeah, that's great. And it's a little bit timely, I think, that we're not only redoing this during our Privacy and Compliance Week, but the Caneel, which is the French Data Protection Authority, just levied some pretty big fines against Google and Amazon for placing cookies on user devices without the proper notice and communication.
So the idea that we can help our websites do some kind of analytics about their users without having to do that kind of cookie is, I think, really fantastic and super privacy preserving.
Totally. And one thing I want to do jump in and say, because we're talking about cookies, and cookies have a bad rap, and I totally get why.
I think cookies are preferable, in my mind, to fingerprinting and doing secret tracking that you can't control.
At least cookies, if you're tech-savvy, you can see them, you can remove them, you can block them.
You can't do that if someone's fingerprinting you.
So we've been really careful that it's not part of our analytics.
We're not using fingerprinting to track you either. Just an important point I wanted to make about that as well.
Just not using cookies is not necessarily privacy-first.
Yeah, that's true. And if you are using cookies, then the important thing is to make sure you're giving notice.
And then that's where those cookie banners come in that nobody really likes.
But at least you do see what's happening with your data, and you know that a cookie is going to be on your device.
So at least you've got that transparency, which goes back to what Tara was talking about earlier.
So switching a little bit, so you're a product counsel at Koffler, and maybe you could give a little bit of an idea to folks about what a product counsel does.
But one of the products that I know that you worked on a lot is our public resolver, 1.1.1.1.
So wondering if you can describe a little bit about how that product was built with privacy in mind?
Because I know there was a lot of work there to make sure that it was built with privacy at the ground floor.
Yeah, yeah, of course.
Happy to help with that question. And I think it's important to first take a step back for folks who may not know.
So 1.1 is Cloudflare's public recursive DNS resolver.
And what a recursive DNS resolver does is when you're typing a human-readable domain into your web browser, your machine needs to find the IP address for the actual website.
And the first thing it's going to do is reach out to a recursive resolver to try and get that information.
And as a result, the recursive resolver is going to know your device's IP address.
So they'll know who you are.
And historically, these weren't done over encrypted connections. So other players over the Internet might have seen what websites you were accessing over the Internet, which for privacy folks, that definitely is not desirable.
And it really kind of went against some of Cloudflare's core values.
And we thought we could do something better than that, which is what was the impetus for building the 1.1 resolver, which we made sure that all of the DNS queries were encrypted with HTTPS over TLS to prevent snooping of your requests.
The thing that came to mind, though, is even if we were encrypting the connection, Cloudflare as a recursive resolver would still know the IP address of the users.
So there was still someone who had information on what websites you were trying to access.
So we realized the business objectives were really pretty strong, that we wanted to have a privacy -first recursive resolver where we could really assure our users that we would never sell their data, never use it for retargeted marketing in any way, and really do something different here.
So it really actually felt pretty neatly into the principles that Tara laid out on approaching it with practice of data immunization and with good communication and transparency.
So when I was talking to our team, the first thing that we started talking about was whether we actually needed the IP address from the device.
And when we spoke about it some more, the question started coming to the conclusion that we didn't actually necessarily need it.
And we could use a known technique that's talked about in RFC 6235 of IP truncation of the device IP.
And we ended up deciding to do that actually at our edge.
So it was to address certain issues of not moving IPs back to our main data center, but it ended up accomplishing a lot more than that.
And I remember chatting with our engineering manager, Olafur, about that when we first rolled it out.
And we decided at that point to do the anonymization at the edge, which, as I'll talk in a little bit, ended up being very helpful because by not ever keeping the full IP address to our full storage, it made our public statements much stronger and much easier to stand behind.
And we decided to take transparency to an even further level.
Matthew, our CEO, is a real visionary. And he'd committed to having our privacy practices audited.
And some of our main commitments were that we will never sell your personal data or use it for retargeting marketing.
Now, what made the audit so successful was the fact that we really applied data minimization practices.
And we didn't have the personal data to begin with. We had never logged it to permanent storage.
And as a result of doing that, we were able to do the audit much more successfully.
And the audit, it was a lot of work. And it's something that we're very proud of.
But it involved even looking at code lines to confirm that throughout the entire audit period, our IP truncation logic was running.
And so that was a pretty impressive, I think it was even before the GDPR went into effect, and we were already applying those concepts, which is nice.
One small pitch, though, is that at Cloudflare, we're continuing to take things further.
And I think it was this week, Tarek could probably talk more about it.
We announced we're working on ODO.
And that, in some respects, is taking this whole implementation even further in that no two parties will have both pieces of information and know both the requester and the destinations, which is really nice to see moving forward.
Yeah, I think that's really important.
And I think one of the things that we talk about a lot, too, and this goes both to the web analytics point, especially to 1.1, is there's an ocean of data out there in the world right now on the Internet.
And our goal is to really shrink that ocean.
So rather than, you know, there's a lot of important work being done to try to put rules and protections around how that ocean of data is being used.
But one really great way to kind of help with privacy is to just not put data into that data ocean or to minimize what amount of data is in there.
So I love that we keep doing things that try to accomplish that. Tara, do you want to jump in a little bit and talk about ODO?
Sure. I was very happy to see support go out for this.
As a person working in the research team, I was very excited about new developments.
I think IDO set it up pretty well, so I'm going to keep it at a pretty high level.
But the idea of this, the ODO, the Oblivious DNS over HTTPS, really is this idea that you have a query, you have a question about, I have a domain name, I have something I want to look up to find the IP address.
So there's where the information is, and there's the person who wants this answer.
But you sort of try to separate these things so that you can have one party who says, someone has asked a question and I want to get them the answer, but maybe I don't need to see what's inside to know what that is.
And the other party who's trying to answer the question, who doesn't actually need to know the identity of the person asking the question in the first place.
So you sort of separate those out into two different parties.
And so you send your communication through a proxy, who then will send information to a target, which may or may not be the resolver, but the resolver function may live there and answer the question.
But you keep those two separated from each other. And so between the use of the separation through the protocol, through the use of encryption, you're then able to just send, in a sense, the correct level of information to the right parties to fulfill that functionality.
And I think there were a lot of things that worked really well for this.
As Ido laid out, you can set up all sorts of things in policy and you ought to do that.
But when there are places where you can also put in a layer of sort of a technical limitation, then that's just great when you can do that.
And this was a big collaborative effort.
I mean, these are Internet-wide initiatives. You're part of the big ecosystem.
So this was very collaborative. We had people on the team working on a standard in the IETF.
So the Internet Engineering Task Force were working on this.
It was coauthored with Apple and Fastly to get this out. And this is a really big initiative.
There was source code made available for this. So this is put out into the community.
So some of what we do is we say, we're working on these things.
We want to bring privacy out throughout the ecosystem and bring everybody else into this world of more advanced privacy as well.
So all of those things kind of work together to help build a better Internet.
That's fantastic. So one of the things that I think about, I mean, we talk a lot about kind of how you build privacy in.
But I think there's also a challenge to figure out the logistics of doing that.
And so, Ida, maybe back to you a little bit. One of the biggest challenges that product and privacy lawyers face is working with product teams is that we don't want to discourage innovation.
We don't want to slow product teams down.
But sometimes there are legal requirements, or sometimes there are privacy practices and principles that we want to make sure the product teams and the engineering teams are focusing on when they're rolling out products.
How do you work with product teams to get them to think about privacy from the start if they aren't already?
I mean, we've got great PMs like John who kind of always are thinking about it, but not everybody is always thinking about privacy as the very first thing all the time.
How do you do that? And John, please chime in too about what you think works well when you're working with product counsel.
Yeah, of course.
So first of all, I think we're really lucky at Cloudflare because we really do have privacy built into the DNA of our culture.
And at times, I think people are even more cautious and concerned about doing anything that may be a violation.
So in that respect, we're very lucky. I think the first thing I'm always trying to do is figure out what the actual business objective is that we're trying to accomplish.
Usually, it's in the PRD, but you have to dive in a little bit deeper sometimes and really connect with the product manager that you're supporting.
And then you need to figure out what exactly are we trying to accomplish for our customers and what data do our customers need?
Because at the end of the day, we're building services for our customers.
And then the second piece, we also need to understand what data we need in order to successfully provide our services to our customers.
And those are two pieces of the same coin, but I think that's the first step.
And then once you have a clear understanding of the business objective and what each party needs in order to provide the service successfully, that's when you start really diving into the specific data elements.
Now, one thing that I've noticed over time is our PMs, they all know the products that they're working on very, very well.
But Cloudflare systems have become more and more complex, and there are many systems that data's flowing through throughout.
So it's always really important to be working with product and identifying the different systems where data may be flowing through different types of audit logs, error logs, and figuring out what types of retention periods and access controls are for all those systems.
Well, I'll just say, yeah, it helps to have great partners in the legal and privacy team that can help us with all this as PMs.
It also helps that our business goals and our privacy goals are really well aligned because we have paying enterprise customers.
That makes it so much easier than you're trying to make your advertising network really private.
So that's a nice help that we have starting out.
But yeah, certainly not every, I'll say, not every product starts out where the headline goal of the future is to be private.
But I still think as PMs, we have to have a little bit of that Spidey sense of like, okay, is this going to be sensitive?
Does this relate to personal data? So a really simple example I can give in my world where we deal with analytics and the logs about metadata generated at our edge, which we'll talk more about.
All I'm doing is helping customers.
We're recently, we're working on an integration to send our logs to Splunk.
So instead of folks sending their logs to their Azure storage and then writing code to send things from Azure to Splunk, we've made it simple.
We've cut out the middleman.
We just help send them to Splunk. I don't think I'm going to have to have a really long conversation with Ido about that.
I think it's pretty well understood.
We're just doing plumbing, connecting things. It's going to be okay. People were, that's where the logs are going anyway.
Formatting new fields, new stuff that we're collecting, right?
That's a totally different, that's a totally different topic.
And Spidey sense tingles. Now I know, okay, we should have a conversation.
So I think having, developing that sense of when is this, when is this interesting from a privacy perspective is really important.
And I'll, let me give you actually a specific example of that.
So tell a story. So we have, one of our main, most important products is our WAF, our web application firewall.
Idea is that customers can write rules and we provide some default rules to help detect malicious things that people put into their HTTP requests.
So famously like SQL injection attacks, right?
We didn't want to just drop that stuff at our edge.
So when customers write rules that drop this traffic, we provide metadata about like, okay, how many requests were blocked by what rule and things like that.
We've had that for a long time and that's pretty straightforward.
But some customers come to us and they say, that's not enough.
They want to know why did we block something?
And specifically, what are people trying to attack me with? You want to know, you want to know your enemy, right?
You want to know what they're sending at you, their Cloudflare stopping.
You know, maybe it's a false positive, but maybe it's just interesting, might help you to learn how people are trying to get around your defenses.
The problem is like, how do you, how do you attack, how do you log the bad stuff, not have anything that might be really sensitive, right?
If you're talking about things that might be ultimately be making like a database query, right?
That's maybe some of the most sensitive data that's flying through our network, right?
But our goals are, we thought about this and we said, you know, when it comes to, we call these like WAF matches, right?
WAF match payloads. We don't want, like Cloudflare doesn't want this data.
Like we don't want to know it's in there.
We want our customers to know. We want to get it safely to their storage provider, to their Splunk instance or whatever, but we don't want to see it.
And so what we realized we could do is we could let customers give us the encryption key or have them just encrypt it and have us not see the encryption key actually at all.
So to us, it's just opaque data flowing through our system. And it's just a blob that we can't see until, until they get it.
And that's actually worked really well.
I think it's a great example of how, okay, we need this, we need this thing for our product to work.
We didn't say, oh, well that's personal data. So we like, we can't, we can't touch it.
We won't do it. We thought of a technical solution that still keeps it private.
Thanks. And Tara, this is something that you and I have talked about a bunch before is this idea that personal data can be in some of the metadata.
Do you want to talk a little bit about that? Because I think when we think about privacy, a lot of times we're thinking of the really common stuff like your name and your address or your social security number or credit card number.
But there's a lot of other data that can tell a story about a person that you may not realize because it's in log data.
I was wondering if you could add a little bit of information about that.
Sure. So, I mean, the metadata tends to be the data about data.
So there can become different definitions as to what that is. But a lot of the time, if you're making connections between communicating parties, people are really talking often about the confidentiality of the communication inside.
So your, your messages aren't able to be eavesdropped upon. But just the general idea of who is communicating to who and how often and when.
And maybe you can find out a little bit about the types of devices that they might be using or a location that that conversation might be taking place from.
That's not the content. That's not the inside of the message.
But those things that wrap around it may be able to be used to provide information about who a party might be or to some degree like why some groups might be communicating and that leaks a bunch of private information.
So you really have to think a lot about like different layers. And there's definitely work that we do at Cloudflare around, for example, you know, trying to encrypt some of the handshake information with setting up TLS connections to try to keep the amount of that information from leaking, leaking out because we're trying to think about the like the entirety of these communications and keeping things as private as possible.
Cool. And, you know, one other thing, speaking of metadata, John, this just sort of reminded me that one of the other things that we're announcing this week is about our CFD UID cookie.
Do you want to talk about that?
Totally. Yeah, I think this is something that's pretty unique. You know, we're we talk about data minimization.
We think we're always looking at ways to just do more with less data because we don't want it.
And so, you know, the CFD UID cookie has been something that we've only ever used for for helping customers with their security.
But, you know, it is a cookie. We do set it on a lot of websites and it's certainly raised questions.
We don't want people to have questions about that.
So we looked at whether we can provide those same security services without that cookie.
We think we have a good road map to letting us be able to implement those and be able to stop populating this cookie in a few months.
So that was something I was really driven really out of privacy concerns.
Yeah. And that's that's one of the things where, you know, there was there is some metadata in there that potentially could identify an end user.
However, that's not what we use it for.
And we certainly have never used that to track anybody. But yeah, getting rid of another cookie feels feels kind of good.
Well, I want to thank everybody for for spending some time today talking about how we build privacy into products at Cloudflare.
I mean, I think, you know, hearing about data minimization as a really key way to start and then making sure you communicate about it transparently and openly.
I think, you know, as long as you've got those two things going, you're on a really good path to building privacy into your product.
So thanks, everyone.
Thank you, Emily. Thank you.