🔒 Oblivious DoH Deep Dive with Sudheesh Singanamalla

Presented by: Sudheesh Singanamalla

Originally aired on August 4, 2021 @ 6:00 PM - 6:30 PM EDT

DNS is the foundation of a human-usable Internet and the traditional DNS protocol is unencrypted and leaks user information. Recent efforts to secure DNS with DoT and DoH have been gaining traction and are securing user DNS traffic. These protocols enable DNS resolvers to associate query contents with client identities in the form of IP addresses. Oblivious DNS over HTTPS (ODoH) addresses this problem and was proposed in an IETF draft.

In this segment, we'll go through the protocol and some research measurements showing how the protocol is a practical way forward to improving privacy in DNS.

English

Privacy Week

Transcript (Beta)

Hello, everyone. Thank you for being here today. I'm Sudheesh, a part of the research team here at Cloudflare, and I'll be taking you through a deep dive of a new privacy-enhancing protocol for DNS, which the Cloudflare Resolver now supports. I'll be sharing some of the measurements we did for this protocol, and this work is the result of a collaborative effort between lots of incredible people at Cloudflare, and was a part of my internship work this summer. And this segment will focus on oblivious DNS over HTTPS, or the ODO standard proposal, co-authored by Cloudflare, Apple, and Fastly, and we're committed to moving this forward. Please do send in your questions, and I'll try to answer them towards the end of this presentation. So before we dive into any of the details, let's start by understanding what DNS really is. DNS is the domain name system and is the foundation of a human-usable Internet today. It's like a phone book containing the IP addresses and other information corresponding to host names. The DNS Resolver is a server like 1111 and responds to client queries like those sent from your computers connected to the Internet with corresponding IP addresses and records from the phone book. Typically, as you can see here, the client makes a request to the DNS Resolver, like Cloudflare DNS, for a website example.com, and the Resolver then communicates with the root server requesting the information for .com and then speaks to the TLD server to ask for information about example.com, after which it speaks to an authoritative name server belonging to example.com, who can respond with the IP address and the DNS Resolver that sends it back as the final step to the client who requested it. The client then connects using this response to the server and obtains a response from example.com. This type of a Resolver is called the Recursive Resolver. For the rest of this presentation, we'll be considering only what happens between the client and the Resolver, which are steps one to eight that are shown here. Traditionally, the DNS protocol is not encrypted and it uses UDP, which continues to be a majority of the traffic that's received by the public Recursive Resolver that Cloudflare operates, making up to 92% of the traffic that we see. The usage of non-encrypted DNS actually leaks user information, like the website that you're trying to visit and your identity, to the network operators and also to on-path adversaries who are observing the network and even allows active attackers to modify the request from the client or the response from the server. But wait, who are these network entities who can see my request and the associated information? Well, you can perform a trace route to the Resolver you use and it would show you the number of hops your message goes through before reaching the Resolver. For example, here, I'm trying to find a route for my client to reach the Cloudflare DNS and I see seven hops between me and before the request is delivered to Cloudflare's 1111 service. Each of these routers can store your information and view which websites you're visiting, creating a profile for you for targeted advertising. To overcome some of these problems, there have been a lot of recent efforts to secure DNS called DNS over TLS or DOT and DNS over HTTPS. And these have been gaining popularity and have been integrated into various web browsers like Firefox and Chrome and operating systems to protect the client traffic from being observed by onlookers or on-path adversaries. This also prevents them from being intercepted or changed by these attackers. But there's a problem that still remains. The Resolver operator can continue to associate the client queries made to the IP addresses and build a profile around their browsing pattern. And over the last few years, we've seen active measurement research trying to understand and measure the impacts of encrypted protocols like DOH and DOT. And many large-scale measurements have shown that the performance of encrypted protocols actually vary by the choice of the Resolver, but it does not impact the page load time and it improves user security. There have also been a lot of various attempts to improve page load times using something called prefetching. But while DOH or DNS over HTTPS actually did improve the security of DNS queries for the clients, it also received a lot of criticism for the small number of publicly available services essentially centralizing the Internet and giving these organizations a lot of control. Additionally, these Resolver operators can also associate all of your client queries with using your IP address and geolocate you. But to maintain privacy guarantees, Cloudflare and some other operators actively purge data exceeding 24 hours. And this gives users privacy policy-based guarantees. But one way to address this is by bringing together organizations to agree on a common set of privacy practices and maintain user privacy. But this requires a lot of explicit negotiation and effort between these organizations. Mozilla, for example, actively defined these criteria around data retention, data aggregation, and enabling frequent audits before a DOH service can be configured as a default in the Firefox browser. There's also heavy regulation in place to prevent monetization attempts from the usage of DNS. And while some users might be comfortable with this policy-driven approach to privacy, these are difficult to enforce and are often very time -consuming, making all these users want a system that can technologically guarantee their privacy. And today, I will focus on the privacy critiques and the steps that we're taking at Cloudflare to prevent the ability for these Resolvers to being able to create and profile their clients. This is exactly where the new protocol Oblivious DNS over HTTPS, or ODO, kicks in. At a high level, there are three main components in ODO. The first are the clients who prepare the query for which they would like a response. And the goal of the clients in ODO is to be able to successfully send encrypted messages, receive valid encrypted responses to decrypt, and in the process be able to identify if there are malicious actors and take any corrective actions if they recognize an attack on their requests. But the clients in this protocol prepare the DNS queries and send them to our second component, which is the proxy. And the proxy's main role is to relay these encrypted queries and responses to and from what is essentially the third component of the system, which are the targets. And they try to do this while removing the IP address of the client being sent from the proxy instance. So the third is the target instance, or sometimes what I will also refer to as the target resolver. And these receive encrypted queries from the clients, they decrypt the query, and they obtain a DNS response for the corresponding client query. The target's role is to then find the result by looking up in the phone book and prepare an encrypted response and send it back to the proxy with no ability to identify the client by their IP addresses. But when is this useful? The design of the protocol assumes what we usually refer to as a Dolevia-style attacker, who are extremely powerful attackers who can monitor and observe all the requests on a specific network channel. But let me put the power of this attacker in perspective. We're assuming an attack model where the attacker can observe all the traffic between the client, that is you, and the proxy, and also between the proxy and the target resolver. This kind of an adversary is actually quite a powerful one. But what exactly do we want to achieve in these settings? First, we want to ensure that the protocol achieves confidentiality guarantees in the presence of an adversary like what we just discussed. This means that the adversary should not be able to read your request or the response that you're receiving. Secondly, we want to ensure that the queries that you send are actually received by the target and you want to speak to and the contents of the message that you send are actually not changed in the process, which is the integrity guarantees that we would actually like. And thirdly, we want the client to be able to verify the correctness of the response that you receive and ensure that even if the responses for two clients were probably swapped by this attacker, the information about the client's other queries isn't leaked anyway. But finally, and most importantly, we want to achieve the goal where the resolver don't know the identity or the IP address of the client. And with Odo, we achieve all of these guarantees and try to evaluate the impacts of providing these guarantees on performance. At a high level, the design of Odo is similar to that of Doe, but it injects an intermediate proxy node, which terminates the client query queue, which is the encrypted query for example.com and performs the query on the client's behalf. The communication between the client and the proxy is established using HTTPS and provides the same security properties of confidentiality or integrity as HTTPS itself. And the guarantees for privacy, however, make this slightly different from a proxy variant of Doe, where for example, the client makes a Doe query to the proxy, and the proxy relays the Doe query to the target. Because we want to ensure that the client cannot read the information and it's only the target which can read this information. So the main goal of Odo is to prevent the recursive resolver or the target here and the ISPs running such resolvers from being able to link the clients to their requests. So a client encrypts the DNS query using a hybrid public key encryption scheme with a validated public key from the target resolver and sends this query to the proxy. The proxy then forwards this query to the target. But in Odo, the proxy instance can see the IP address of the client, but not the content of the DNS query and forwards this message to the oblivious target, which decrypts the query and obtains the response from the resolver. And the response is encrypted by the oblivious target and is sent back to the client through the proxy. And in this process, the target only sees the DNS content and not the actual client's IP address. But we often say the target resolver as independent entities, but this need not really be true. Ideally, in practice, for performance reasons, the oblivious target and the recursive resolver could be co-located to avoid any additional network messages between the oblivious target and the recursive resolver. But it is still possible to actually maintain these as individual services without co-location. So you might be wondering, well, how do you get this public key from the resolver which you want to use for encryption? And that's a great question. Your client can make an HTTPS records query to the resolver to see if the protocol is actually supported and then obtain the key that is configured by the resolver or the target operator in the HTTPS bindings. The clients can then perform DNSSEC validations on the response to ensure that the response is actually valid and has not been tampered by the on -path adversary that we discussed. But here's an example. We can perform a dig query for type 65 or the HTTPS service records to enable DNSSEC validation to retrieve the public key and a valid signature from the ODO service on Cloudflare DNS. But what is DNSSEC? DNSSEC or what is known as DNS security extensions use signatures based on public key cryptographic schemes to protect the DNS data itself rather than the query that we have been discussing so far. And this ensures that the data that's being sent from the resolver is actually not tampered. But why is this important? With DNSSEC, we achieve the properties of authentication and integrity where the client receiving the data and the signature from the server can communicate with and can verify that the data is actually sent by the server where it was supposed to be coming from. And it also verifies that the content has not been modified since the time it was signed by the actual owner. But this brings us to the next question. How do we perform this encryption? What is this HPKE thing? And HPKE is the hybrid public key encryption scheme. And it allows both the symmetric and asymmetric cryptographic schemes to actually be put together. The clients retrieve the public key from the target like in a public key encryption scheme. And then they encapsulate a secret along with the actual content of the DNS query, which is then sent to the target as we discussed before through the proxy. The target decrypts this message, decrypts it, and obtains the decapsulated secret. The target then runs the secret through a key derivation function to obtain a shared session secret, which can be used like a symmetric cryptographic scheme. The client also does the same. And both end up with the same information necessary to actually perform the encryption and decryption. Here, the target encrypts the response with an authenticated encryption scheme. And the client verifies and decrypts the response, obtaining the final value or the answer to the query that they have actually asked for. But this procedure of retrieving keys should not be a one-time operation. So the resolver returning these keys actually sends something called a TTL value or a time to live value for it, indicating when this value expires and after which the clients actually have to request again for a new or a more recent copy of the key and see if the key has probably changed. That was quite a lot of information. But I'm hoping now with the understanding of the protocol in place that we set out to understand, there are three main research questions that really encouraged us. The first one is, what is the impact of ODO on the DNS response times? And the second one is, how does ODO actually affect page load time or user experiences? And third, how does ODO compare itself to the other privacy -enhancing protocols that are out there today? And this helped us understand the cost of privacy for the user while maintaining the security guarantees that are provided by these encrypted DNS protocols. This slide pretty much sums up the results of our measurement study and is probably of the most interest to all of you. But I'll leave this here as a reference to come back to in case anyone wants to refer to the slides or pause and watch the recording later. But in the next few slides, I'll talk about these measurements and the results in more detail. So let's get to the measurements. We implemented and deployed the oblivious targets and proxies using Google Cloud and a serverless platform like Cloudflare Workers. And we physically separated the oblivious target instances from the resolvers and randomized the query to three public resolvers, the Cloudflare DNS, Google DNS, and Quad9 DNS, which we used in our measurements. And we used nine Google Cloud data center locations, seven across the United States, one in Montreal in Canada, and one in Sao Paulo in Brazil, and 10 client instances running at each of these vantage points, performing an experiment, sending DNS requests at a rate of like 15 requests per minute, which is the average number of DNS queries that are sent by client devices with very high Internet usage. And the average bandwidth for all the clients running on the single core Intel Xeon machine in each of these data centers during the experiment was 480 megabits per second. The clients perform DNS response time measurements by choosing pairs of available proxies and targets. And by choosing a low latency proxy target pair for the measurement, which is shown in the orange line here, we find that the average response time improves by 22.8%, compared to only choosing a low latency proxy, which is shown by the green line. This hints towards the fact that having an intermediate proxy on the same network path to the target will actually improve the response time performance. And this path can actually be quite different from the path that a UDP-based DNS packet, which we discussed earlier and saw in the trace route, might actually take. So choosing a low latency proxy target pair actually does make a significant impact on performance. But what about connection reuse? Connection reuse is an optimization that can enable clients to improve their performance by at least 46% on an average. And this is because it avoids any unnecessary TCP and TLS handshakes for every HTTPS request and reuses the same HTTPS connection for subsequent queries. So we encourage all clients to potentially reuse connections whenever possible. And it would still maintain the privacy guarantees. But in our experimental setup, we evaluated for the worst-case performance and incurred an additional network latency between the target instance and the resolver in the architecture that we saw before. And the target instances, which are located in Google Cloud and performing queries to the three open public resolvers, have faster response times for Google DNS, which you can see on the green line here, compared to other services. And these measurements kind of hint towards the fact that integrating the oblivious target into the recursive resolver can reduce the network latency currently incurred to that of a cache hit for the answer and a cache miss resulting in the actual network cost being incurred by the recursive resolver to communicate with the other name servers like the TLD server and the root server that we've seen. So we find that co -locating these services actually results in much better performance. And that's exactly what we've enabled in the recent update to Cloudflare DNS. So we co-located the target and the resolver and enabled the protocol for ODO. But what about performance? To understand the performance of ODO, we compared this to other protocols offering similar privacy and security guarantees and used DOE as the baseline protocol, which is shown on the indigo line on the left. And DOE over TOR is a variant of DOE, which is shown here on the beige line on the right. And both provide security and privacy guarantees of some sort. But when compared to ODO, which is the red line here, we noticed that ODO with no service co-location, which means that the target and the resolver are actually kept separate, achieves this interesting position roughly in between DOE and DOE over TOR. And note that these lines might look really close, but they're really on a logarithmic scale, indicating their orders of magnitude different. And these results get even better and more interesting as we start to co-locate the target and the resolver together, which is shown by the dashed blue line in this figure. But what you see here is that eventually ODO behaves equivalent to that of the standard DOE queries, which are widely used today, but as a baseline. And on an average, it increases the response time by 50% when the services are co -located and by 100% when the services are not co-located. But DNS protocols with message encryption, like DNS script, tend to have much larger compute overheads and also use non-encrypted channels, but do message-level encryption. And these protocols have a much higher response time compared to those of ODO and DOE over TOR, as shown by the dotted line that's over here, and they lie in this specific range. So the performance in response times is a middle ground that we achieve between DOE and privacy enhancing variants, such as using DOE with TOR or anonymous DNS script, and all of them while achieving the same privacy guarantees. But this brings us to a crucial measurement that we're really interested in, the page load time impact. To do this, we establish a measurement node in a lab network with an available on-path and a randomly chosen off-path proxy in the same geographic area. And the node runs a local stub resolver, which is configured to use DOE or the ODO protocol for various runs. And in each run, we browse the same set of webpages chosen randomly from the top 2000 websites of the Internet by the amount of traffic or by ranking. And by purging all the local cache entries and doing page loads through Selenium. And in the process, we also store the W3C navigation events, like the page load events, the DNS response time events, and so on. And the results that we present here are really the pessimistic figures, and they use the worst case network architecture, where the target and resolver are actually kept separate. These results would actually get much better with the recent release yesterday, where we have co-located both the target and the resolver into the Cloud Flood DNS service. And over here, we're also looking at the load event end page event, which is the most pessimistic event that we can look at, which is the entire page load, which seemed to have occurred, compared to time to first byte or the first useful paint, which are more frequently used by measurement researchers. There are also a lot of browser artifacts that come into play, like caching, which result in the top graph and the plots, which look very different compared to the plots that we've seen before. And we find that using ODO with an on-path proxy increases the page load time by 20% compared to the baseline, and using it without the on-path proxy or using an off-path proxy increases it by 25%. So to put these in perspective of numbers, the average page load time actually changes from 1.319 seconds with an unencrypted DO53 protocol to 1.6 seconds if you use ODO. And these results are still preliminary, but we are very optimistic with the recent release and by monitoring these results that we can improve the performance of this system. So you might be thinking, well, all these look good. Where are the proxies? How do I use this system? And you're absolutely right. The protocol builds and becomes possible because of the ecosystem of proxies that become available. We currently have partners who are running proxies which help us make this protocol possible. We implemented the protocol libraries for interoperability in both Golang and Rust with the ability for benchmarking tools and being able to reproduce the results of our experiments. And the HPKE implementations by default use the X25519 key encapsulation mechanism with SHA-256 being the key derivation function and AES-128-GCM being the authenticated encryption. And the library also supports the other variants. We performed extensive microbenchmarks on the libraries and tools to find that the ODO encryption and decryption computation overheads are really minimal with the 99th percentile of the compute overheads roughly taking 300 microseconds. However, what does increase and what contributes to this network overhead and the increase in the response time is the size of the payload that's actually being sent. So what was originally 33 bytes of query information after encryption increases by four times to 107 bytes while the answer size that you retrieve from the resolver increases by 1.2 times resulting in 16 -byte responses. But with the integration into the Cloudflare resolver, you can use the ODO protocol by downloading our client and configuring it with a proxy run by our partner Surf by running this command from the Golang client, for example. The query here uses the ODO protocol and queries the Cloudflare resolver for the IPv6 address or the AAAA query type and proxies the request through the partner proxy hosted by Surf. But to conclude, ODO is really a practical privacy-enhancing protocol for DNS with minimum total page load time impacts and the performance impacts of these protocols are purely network topology effects. We make a lot of recommendations for ideal usage in these production systems and have released this as a part of the update yesterday to the Cloudflare DNS service. All of our code and implementation is open source and is available at these links on GitHub under the Cloudflare organization and we're committed to moving the standard forward in the ITF for standardization. But we are also hoping more partners join us in providing support for the protocol either by running the proxies or running the targets for privacy. I would like to give a special shout out to Tanya, to Chris, Marwan, Nick, Peter, Bob, Marek, Curtis, Angban, Tommy, Patrick, Eric, Jonathan, Wesley, and countless others who actually made this work possible. Please do send in your questions to ask-research at Cloudflare.com and thank you so much for tuning in. You run a successful business through your e-commerce platform. Sales are in an all -time high, costs are going down, and all your projection charts are moving up and to the right. One morning you wake up and log into your science analytics platform to check on current sales and see that nothing has sold recently. You type in your URL only to find that it is unable to load. Unfortunately, your popularity may have made you a target of a DDoS or Distributed Denial of Service attack, a malicious attempt to disrupt the normal functioning of your service. There are people out there with extensive computer knowledge whose intentions are to breach or bypass Internet security. They want nothing more than to disrupt the normal transactions of businesses like yours. They do this by infecting computers and other electronic hardware with malicious software or malware. Each infected device is called a bot. Each one of these infected bots works together with other bots in order to create a disruptive network called a botnet. Botnets are created for a lot of different reasons, but they all have the same objective, taking web resources like your website offline in order to deny your customers access. Luckily, with Cloudflare, DDoS attacks can be mitigated and your site can stay online no matter the size, duration, and complexity of the attack. When DDoS attacks are aimed at your Internet property, instead of your server becoming deluged with malicious traffic, Cloudflare stands in between you and any attack traffic like a buffer. Instead of allowing the attack to overwhelm your website, we filter and distribute the attack traffic across our global network of data centers using our Anycast network. No matter the size of the attack, Cloudflare Advanced DDoS Protection can guarantee that you stay up and run smoothly. Want to learn about DDoS attacks in more detail? Explore the Cloudflare Learning Center to learn more.