๐ Lukasz Olejnik & John Graham-Cumming Fireside Chat
Presented by: John Graham-Cumming, Lukasz Olejnik
Originally aired on January 28, 2021 @ 6:00 AM - 6:30 AM EST
In this Cloudflare TV Data Privacy Day segment, John Graham-Cumming will host a fireside chat with Lukasz Olejnik, Independent Cybersecurity & Privacy Researcher and Advisor.
English
Fireside Chat
Data Privacy Day
Transcript (Beta)
Good morning. Good morning and welcome to Data Privacy Day, International Data Privacy Day.
I'm John Graham-Cumming, Cloudflare's CTO, and my guest for the second talk this morning is Lukasz Olejnik, who is an independent privacy researcher with a very, very interesting background.
Lukasz, why don't you tell us a little bit more about some of the things you've done in your career?
Yeah, that's a pretty broad thing to discuss.
So I have a PhD in computer science privacy, so I have a technology background.
I worked on various things from quantum communication protocols that would be privacy preserving to web privacy and privacy of advertising technologies.
I also devoted a considerable amount of time to cyber security studies and also work.
I was, for example, the cyber warfare advisor at the International Committee of the Red Cross.
Closer to our talk, I was providing advice at the European Data Protection Supervisor in Brussels.
And perhaps much closer to the talk, I am the former World Wide Web Consortium Technical Architecture Group member.
In fact, standardization privacy was also among my research interests.
Okay, great. So a very, very relevant background to what I wanted to talk to you about, which is the state of privacy on the web in particular.
I think this is a sort of a complex and misunderstood situation.
Could you talk to us a little bit about when I use the web, what's happening to my data as I surf the web?
When you use the app or when you browse the web?
Well, you can tell me it's different things, right? It's a pretty complex problem, but let's focus on the web.
When you visit the web, as you know, a lot of things happens.
There is a name resolution, connections, routing.
When it comes from the perspective of your box, of your device, of your search device.
Typically, when you browse the web, the web browser gathers a lot of components and parses them, displays them.
Very often, most of the components are, of course, related to the content of the page you are interested in.
For example, when you might browse some blog or New York Times page or another page.
There are also considerations whether...
I mean, what's the distinction between this particular content you are interested in, which is typically the first-party content, such as the content of the articles you read, and also the third -party content, which typically might be content such as messages targeted to the user, such as advertisements.
There is a tremendously complex infrastructure in place that makes sure that the user receives the right message, the right from the perspective of the one that serves the message, of course.
And that those advertising technology players are able to gather data and process it.
And this gives rise to a lot of concerns. For example, what happens to the data?
Who has access to this? One of the things that perhaps is not understood is, if I go to a website, what is it that the website and an advertiser can learn about me?
I mean, obviously, perhaps my IP address at home, but there are lots of other technologies right around fingerprinting.
How does that all work? Fingerprinting is definitely something that came to the public eye recently, but the concept in the research circles is not that new.
I mean, it's been very actively researched since more than 10 years now.
So there are indeed a plethora of ways of identifying what the users are doing on the web.
The primary element that allows for this context, for this processing to happen are, of course, cookies.
So cookies are the primary mechanism that allows to discover what the user is doing on the web, on different websites, and also possibly link the actions of the users among the different websites, which gives a lot of information.
Because, for example, there is such a detail, and it's confirmed by scientific research, that the mirror list of visited websites is very sensitive data.
So my research indicates that it's unique, it might be biometric, but it also can be converted to discover the user's state of mind.
So, for example, extract whether the user is an introvert or those matters.
So you might discover the behavioural profile of the user, psychometric profile of the user.
This is pretty sensitive data. Right. And just to dig into cookies a little bit.
So, obviously, in Europe, we have these cookie pop-ups, which are asking, if I go to a new website, for me to select which cookies I'm going to allow, which is, I think, for a non-technical user quite hard.
But broadly, are there good cookies? Are there bad cookies? How should we be thinking about that?
Oh, yeah, there are different cookies. You might have cookies that is used to track.
You might have cookies that is used to simply make the website work better.
So just to improve its work. You might have cookies to study, to see how many users are visiting your site.
So analytics groups. The most simple example is still a case when cookies are simply facilitating the operation of e -commerce cart.
When you buy something, there needs to be a way to track this activity to your future request.
Because the way HTTP requests are made, the requests for the websites, they are stateless.
So cookies are the things that introduce this state.
So for market participants or website operators, it simply allows to introduce the link between one visit and another visit.
Of course, there are also other mechanisms that are different than cookies, such as fingerprinting.
Right. And perhaps one of the most obvious ways in which people, I think, see all this tracking going on is when they visit one website.
And then subsequently, maybe the next day, they see an ad for a thing they were looking at before.
And I mean, in particular, if this happens, you look at something on the web and then you go to Instagram on your phone and you see an ad for the same holiday location or something.
That retargeting thing feels quite creepy to people. Well, it might freak people out, indeed.
But the use case you mentioned is remarketing. So targeting ads content based on the content the user visited previously.
For example, the previously visited websites or maybe even web searches.
So I think we can both agree that they are very often not well targeted to our needs.
So, for example, there is a popular joke that when you buy something on Amazon, then you will see the ads for this for the rest of your life.
Yes. But in general, we may ridicule it. But if this is happening, I would say that someone is finding this obvious.
For example, the advertisers wouldn't be doing this if there would be no sense in doing this.
Right. Right. And as this problem or at least this situation has become more obvious, to a certain extent, the public is worrying now about privacy and some companies are making moves to make things more private.
So Apple has been talking about changing the IDs that are in phones so that advertisers can and cannot tell what's going on.
And Google was introducing something, at least an experiment, called Privacy Sandbox.
Can you explain what are these attempts to make the web more private all about?
Yes. This is since five, six or seven years. It seems there is a competition on the grounds of privacy.
First picked up by Apple and they even made a crucial point in their keynote when they proposed new stuff.
They speak about privacy. At least they used to.
And I remember when this happened for the first time. So I was pretty amazed because this is not really something that I would expect.
Or I remember when technical terms such as differential privacy started to be mentioned in these kind of addresses.
So there's definitely a competition on the grounds of privacy, perhaps even human rights.
Go ahead.
So the most renowned company known for doing this is, as of now, Apple. Others are trying to join the party and others are deciding to do the inevitable.
For example, to improve the state of some areas of web or mobile activity.
And this is related to this proposal recently put forward by Google in the form of a bunch of standards, the privacy sandbox.
But people do not realize that privacy was actually not built in the Internet when the core standards in the 80s of the previous century when they were being designed.
The significant interest in privacy, we see this perhaps over the previous two decades.
And it was only in 2013 when the Internet Engineering Task Force issued this installation request for comments on this privacy considerations for Internet protocol.
And it is since then when more and more Internet protocols or web standards contain privacy consideration sections.
So privacy is since then becoming something that you design for, you include it in design considerations.
And this is a major change.
Can you give an example of something that's been designed to include privacy in its fundamental nature?
The most well-known and available technology available to users is end-to -end encryption and communication.
But we can probably agree that this is a bit different piece of affairs.
Of course, end -to-end encryption is being developed and deployed in many new uses, many new standards.
But I think what counts more is not a particular, I mean, there are particular proposals put forward to improve privacy specifically.
For example, recently a proposal to phase out the probably initial thing, but user agent string.
When we browse the web, the web browser informs the website about the user agent.
So the browser with all the versions or many numbers.
And this might help in fingerprinting.
So this is phased out and this will be simplified to reduce the fingerprinting surface.
But I think what is the most important is the actual gradual transition of privacy-aware thinking in many standards at a scale.
So there's gradual change.
So I can, for example, mention specifications or features that were being phased out completely by web browsers or which received significant changes.
Some of those at first I had my share in and I expect this will continue to happen.
But that points to this being quite a long process to embed privacy in the protocols of the Internet.
It's not going to happen quickly.
I understand. But the same happened with security. To arrive at a major security engineering, it took how much?
30 years? And we are still basically at the beginning of this process when it comes to privacy.
So there are processes designed, flows designed, standards designed.
I worked on many standards and I know that it takes time to change things, to find a consensus.
Then it takes time for implementers to actually change things as well.
So it's complex and often tiresome experience.
So one of the things that's going on right now is that Google, who is fundamentally an advertising company and needs to understand what people are doing to sell them ads, is proposing something called Privacy Sandbox.
It's a very large project.
But in the remaining 12 minutes, let's talk a little bit about this. What are they trying to do?
Yeah, that's a pretty catchy name, isn't it? Privacy Sandbox.
Yes. When it comes to the PR, it's good. It's too bad that they do not continue with this privacy-aware communication and PR with their further communication.
Sometimes they even omit to mention some important things. But back to the story, back to the core.
Privacy Sandbox is basically a bunch of specifications, standards that would change how some features behave, how they work.
We actually, these days, have a conversation about privacy on the web and it touches the very details of web architecture.
So web standardization is at the core.
And this is a testament to that, that Google put forward for consideration a bunch of standards within an open venue such as World Wide Web Consortium.
And all the people to provide feedback, provide opinions, criticize, provide even ideas how to change something or improve.
So there is a really clever move by Google since the very beginning and many people omitted this fact.
They say that indeed, yes, they will phase out the mechanism of third-party cookies, much criticized for the risk of abuse and misuse.
But they will only do this under a condition that satisfactory changes to the web platform are made.
And you might respect this because it's a fair way of demonstrating how something is subjected to something, is conditioned on something.
Some people might also respect it because from a market competition perspective, you have many businesses that are utilizing cookies or identification method to provide services.
And this has actually been actively seen now.
For example, you have this competition and markets authority in the UK that is making an investigation.
But back to the privacy sandbox, it's a bunch of bricks.
So there are a number of standard specifications that will describe how features use.
And taken together, only taken together, they would facilitate to maintain how the way of directing ads to users.
So some patterns of use like that, but also preserving user privacy.
I would say that some of those proposals would also contribute to the improvement of privacy in general.
But the aim is to arrive at a solution in advertisement technologies that maintains the ability of running such businesses.
It also reduces the privacy cost. But presumably does improve the individual privacy to a certain extent while still allowing advertising.
Because that's the trade off, right? That's the catch. And there are two phases of this problem.
First, whether it improves things. And second, whether users would actually understand if things are actually improved, assuming they are improved.
So let's say, yes, they do.
And the correct implementation and use of this stack would improve privacy with respect to today's concerns and some of the harms.
But there is a massive catch because we need to understand and we need to track how those proposals are actually deployed in the end.
Because we do not know yet what will be the final vision, what will be the final project.
It's being developed.
It's on the face of some gentle implementation and tests. But we actually do not know the full picture and the end results.
So answering this in general would not be possible.
What is clear is that such proposals need to undergo public vetting, but also assessments from the privacy point of view, including the changes.
So it's not that you make an assessment now. There are changes later and we do not change our position.
Agreed. When you look at what Google has proposed today, are there technical things that you find particularly interesting?
I did have a look, indeed.
And there are a bunch of proposals. Some are more interesting to our discussion.
Others are less interesting to our discussions. So some of the proposals would definitely improve things.
For example, as I already mentioned, the simplification of the user agent string.
So less potential of using this for fingerprinting because I don't know if people realize, but today only considering the IP address and this user agent string may actually create a unique fingerprint in between 50 to maybe 70% of cases.
So users. So we have a unique fingerprint not based on cookies.
So, yes, there is a need to fix that.
And I would say that this will unconditionally improve things.
There are also other things, such as the attempt to decouple real-time bidding, which is server-side bidding based on the heavy processing of user data, bidding for ad spaces.
This is totally untransparent. So one idea is to do this in the browser. So run transparent bids, run transparent auction for ad spaces.
But the devil is always in the details.
So for example, whether trucking is really curved, whether leaks are solved.
For example, the proposal I'm talking about, and it's called a turtle dove because for some reason, Google chose the nomenclature of using bird-like names.
So turtle dove, or they don't use dogs, but they're right.
So this turtle dove allows the identification of user interest groups, a segment of user interest group, and bidding.
This bidding will no longer happen on transparent server side, as it happens today in real-time bidding, but it will happen in the user side, which means that the algorithms will be available to audit.
But the problem is whether the final solution deployed would improve privacy, because we do not know yet how the final solution will look like.
For example, it is already known that this current solution, even now, it has some data leaks, which are not addressed.
So another question is whether the designers of the specification would want to address the known data leaks.
Very enough. You earlier on today, you sent me an email about something called IP blindness.
Should we just talk about that a little bit?
A little bit, please? You sent me an email earlier about a proposal around IP addresses.
I think that's an interesting thing, because people often think about IP addresses as very sensitive.
What is this proposal around?
So now we are discussing privately sent emails. What's time to be alive?
So this particular proposal is a very initial solution.
So it's not totally not being discussed by anyone.
It's an initial solution, but at the center of the core of this proposal is to mask the IP address and possibly also the user agent string, which, as I said, would immediately solve the big fingerprinting vector.
I mentioned it before, that only processing IP address and the user agent string allows for the creation of a strong fingerprint.
So the proposal at the center is to mask IP addresses by setting up a bunch of proxy servers.
So the user would connect the website through some proxy server in a way that the communication content is protected, perhaps with end-to -end encryption.
And the source address will be masked by this proxy server.
But as in such solutions, as always in the case of such solutions, we would have another problem.
So we exchange a problem of revealing IP address of the user.
And it so happens that IP addresses may be used for persistent tracking.
So we change this problem to another. We replace it with a problem of trusting trust.
So can we trust those servers? So can we, for example, maintain some level of auditability?
In general, trusting trust is not an easy problem to solve.
And this is a problem that the Tor project has really faced, which is how do you trust untrusted parties to forward your traffic?
Yes, it's a little bit different. But I would imagine that trusting such third-party trust.
And I understand that even Cloudflare maintains a fleet of very strong, powerful CDN servers.
And I agree that using this masking solution might even reduce the risk of denial of service to some websites.
But the problem that would be left is how should users approach this need to trust the supplier, the provider of the servers?
All right. Well, Lukasz, listen, we are out of time.
We're going to have to leave it at that point. Thank you for this discussion.
Thanks for introducing some of these ideas to the audience.
And thank you for taking the time today to talk with me. It's been great.
Thank you very much. Have a nice afternoon. Thank you very much.