Can Encrypted DNS Be Fast?

In this paper, we study the performance of encrypted DNS protocols and conventional DNS from thousands of home networks in the United States, over one month in 2020. We perform these measurements from the homes of 2,693 participating panelists in the Federal Communications Commission's (FCC) Measuring Broadband America program. We found that clients do not have to trade DNS performance for privacy. For certain resolvers, DoT was able to perform faster than DNS in median response times, even as latency increased. We also found significant variation in DoH performance across recursive resolvers. Based on these results, we recommend that DNS clients (e.g., web browsers) should periodically conduct simple latency and response time measurements to determine which protocol and resolver a client should use. No single DNS protocol nor resolver performed the best for all clients.


Introduction
The Domain Name System (DNS) is responsible for translating human-readable domain names (e.g., nytimes.com) to IP addresses. It is a critical part of the Internet's infrastructure that users must interact with before almost any communication can occur. For example, web browsers may require tens to hundreds of DNS requests to be issued before a web page can be loaded. As such, many design decisions for DNS have focused on minimizing the response times for requests. These decisions have in turn improved the performance of almost every application on the Internet.
In recent years, privacy has become a significant design consideration for the DNS. Research has shown that conventional DNS traffic can be passively observed by network eavesdroppers to infer which websites a user is visiting [2,25]. This attack can be carried out by anyone that sits between a user and their recursive resolver. As a result, various protocols have been developed to send DNS queries over encrypted channels. Two prominent examples are DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH) [8,10]. DoT establishes a TLS session over port 853 between a client and a recursive resolver. DoH also establishes a TLS session, but unlike DoT, all requests and responses are encoded in HTTP packets, and port 443 is used. In both cases, a client sends DNS queries to a recursive resolver over an encrypted transport protocol (TLS), which in turn relies on the Transmission Control Protocol (TCP). Encrypted DNS protocols prevent eavesdroppers from passively observing DNS traffic sent between users and their recursive resolvers. From a privacy perspective, DoT and DoH are attractive protocols, providing confidentiality guarantees that DNS lacked.
Past work has shown that typical DoT and DoH query response times are typically marginally slower than DNS [3,9,14]. However, these measurements were performed from university networks, proxy networks, and cloud data centers, rather than directly from homes. It is crucial to measure DNS performance from home networks in situ, as they may be differently connected than other networks. An early study on encrypted DNS performance was conducted by Mozilla at-scale with real browser users, but they did not study DoT, and they did not explore the effects of latency to resolvers, throughput, or Internet service provider (ISP) choice on performance [15]. Thus, the lack of controlled measurements prevents the networking community from fully understanding how encrypted DNS protocols perform for real users.
In this work, we provide a large-scale performance study of DNS, DoT, and DoH from thousands of home networks dispersed across the United States. We perform measurements from the homes of 2,693 participating panelists in the Federal Communications Commission's (FCC) Measuring Broadband America program from April 7th, 2020 through May 8th, 2020. We measure query response times and connection setup times using popular, open recursive resolvers, as well as resolvers provided by local networks. We also study the effects of latency to resolvers, throughput, and ISP choice on query response times.

Method
In this section, we describe the measurement platform we used to study DNS, DoT, and DoH performance and outline our analyses. We then describe the experiments we conduct and their limitations.

Measurement Platform
The FCC contracts with SamKnows [20] to implement the operational and logistical aspects of the Measuring Broadband America (MBA) program [6]. Sam-Knows is a company that specializes in developing custom software and hardware (also known as "Whiteboxes") to evaluate the performance of broadband access networks. Whiteboxes act as Ethernet bridges that connect directly to existing modems/routers, which enables us to control for poor Wi-Fi signals and crosstraffic. In accordance with MBA program objectives, SamKnows has deployed Whiteboxes to thousands of volunteers' homes across the United States. We were granted access to the MBA platform through the FCC's MBA-Assisted Research Studies program (MARS) [5], which enables researchers (generally from the United States) to run measurements from the Whiteboxes. We utilize the platform to evaluate how DNS, DoT, and DoH perform from home networks.
We perform measurements from each Whitebox using SamKnows' DNS query tool. For each query, the tool reports a success/failure status (and failure reason, if applicable), the DNS resolution time excluding connection establishment (if the query was successful), and the resolved record [19]. For DoT and DoH, the tool separately reports the TCP connection setup time, the TLS session establishment time, and the DoH resolver lookup time. For this study, we only study queries for 'A' and 'AAAA' records. We note that queries for DNS and DoT are sent synchronously, i.e., they must each receive a response before the next query can be sent. On the other hand, DoH queries are sent asynchronously, functionality that is enabled by the underlying HTTP protocol.
The query tool handles failures in several ways. First, if a response with an error code is returned from a recursive resolver (e.g., NXDOMAIN or SERV-FAIL), then the matching query is marked as a failure. Second, if the tool fails to establish a DoT or DoH connection, then all queries in the current batch (explained in Section 2.3) are marked as failures. Third, the query tool times out conventional DNS queries after three seconds, at which point it re-sends them. If three timeouts occur for a given query, the tool marks the query as a failure. Finally, the query tool marks DoT/DoH queries as failures if either five seconds have passed or if TCP hits the maximum number of re-transmissions allowed by the operating system's kernel (Linux 4.4.79). The Whiteboxes we measure use the default TCP settings configured by the kernel (e.g., net.ipv4.tcp frto = 2, net.ipv4.tcp retries1 = 3, and net.ipv4.tcp retries2 = 15 ).
In total, we collected measurements from 2,804 Whiteboxes, each of which use the latest generation of hardware and software (8.0) [21]. Our measurements were performed continuously from April 7th, 2020 through May 8th, 2020 in collaboration with SamKnows and the FCC. We filtered out certain Whiteboxes from our analysis in several ways. First, we filtered out 56 Whiteboxes that we did not have any network configuration information about (e.g., ISP speed tier, ISP name, and access technology). Second, we filtered out 25 Whiteboxes that were connected by satellite. Third, we filtered out 30 Whiteboxes for which we did not know the access technology or ISP speed tier. This left us with 2,693 Whiteboxes to analyze, with 96% of queries marked as successful. The Whiteboxes were connected to 14 ISPs over cable, DSL, and fiber.

Analyses
We studied DNS, DoT, and DoH performance across several dimensions: connection setup times, query response times for each resolver and protocol, and query response times relative to latency to resolvers, throughput, and ISPs. Our analyses are driven by choices that DNS clients are able to make (e.g., which protocol and resolver to use) and how these choices affect DNS performance.
Connection Setup Times. Before any query can be issued for DoT or DoH, the client must establish a TCP connection and a TLS session. As such, we measure the time to complete a 3-way TCP handshake and a TLS handshake. Additionally, for DoH, we measure the time to resolve the domain name of the resolver itself. The costs associated with connection establishment are amortized over many DoT or DoH queries as the connections are kept alive and used repeatedly once they are open. We study connection setup times in Section 3.1.
DNS Response Times. Query response times are crucial for determining the performance of various applications. Before any resource can be downloaded from a server, a DNS query often must be performed to learn the server's IP address (assuming a response is not cached). As such, we study query response times for each resolver and protocol in Section 3.2. We remove TCP and TLS connection establishment time from DoT and DoH query response times. The DNS query tool we use closes and re-establishes connections after ten queries (detailed in Section 2.3). As this behavior is unlikely to mimic that of stub resolvers and web browsers [7,16,17], we remove connection establishment times to avoid negatively biasing the performance of DoT and DoH.
DNS Response Times Relative to Latency and Throughput. Conventional DNS performance depends on latency, as the protocol is relatively lightweight; therefore, latency to the DNS resolver can have a significant effect on overall performance. Furthermore, encrypted DNS protocols may perform differently than conventional DNS in response to higher latency, as they are connection-oriented protocols. We study the effect of latency on query response times for each open resolver and protocol in Section 3.3. SamKnows also provides us with the subscribed downstream and upstream throughput for each Whitebox. We use this information to study the effect of downstream throughput on query response times in Section 3.3.
DNS Response Times Relative to ISP Choice. Lastly, SamKnows provides us with the ISP for each Whitebox. We study query response times for a selection of ISPs in Section 3.4.

Experiment Design
We describe below which recursive resolvers and domain names we perform measurements with and how we arrived at these choices.

DNS Resolvers.
For each Whitebox, we perform measurements using three popular open recursive DNS resolvers (anonymized as X, Y, and Z, respectively 4 ), as well as the recursive resolver automatically configured on each Whitebox (the "default" resolver). Typically, the default resolver is set by the ISP that the Whitebox is connected to. Resolvers X, Y, and Z all offer public name resolution for DNS, DoT, and DoH. However, the default resolver typically only supports DNS. As such, for the default resolver, we only perform measurements with conventional DNS. If a Whitebox has configured Resolver X, Y, or Z as its default resolver, then we leave its default resolver measurements out of our analysis.
In Table 1, we include the latency to each resolver across all Whiteboxes. We measure latency by running five ICMP ping tests for each resolver at the top of each hour and computing the average. We separate latency to DoH resolvers from latency to DNS and DoT resolvers because the domain names of DoH resolvers must be resolved in advance. As such, the IP addresses for the DoH resolvers are not always the same as DNS and DoT resolvers. We note that the latencies for the default resolvers are particularly low because these resolvers are often DNS forwarders configured on home routers. We exclude measurements with five failures or with an average latency of zero (0.7% of the total measurements).
We identified 41 Whiteboxes with median latencies to Resolvers X, Y, and Z DNS of up to 100 ms, despite median query response times of less than 1 ms. We consulted with SamKnows, and based on their experience, they believed this behavior could be attributed to DNS interception by middleboxes between Whiteboxes and recursive resolvers. For example, customer-premises equipment (CPE) can run DNS proxies (e.g., dnsmasq) that can cache DNS responses to achieve such low query response times. Furthermore, previous reports from the United Kingdom indicate that ISPs can provide customer-premises equipment that is capable of passively observing and interfering with DNS queries [11]. We found that 29 of these 41 Whiteboxes are connected to the same ISP. We also identified two Whiteboxes with median latencies to X, Y, and Z DoH of less than 1 ms. Lastly, we identified one Whitebox with median latencies to X, Y, and Z DoT of up to 100 ms, despite median query response times of less than 1 ms. We analyze the data for these Whiteboxes for completeness.
Domain Names. Our goal was to collect DNS query response times for domain names found in websites that users are likely to visit. We first selected the top 100 websites in the Tranco top-list, which averages the rankings of websites in the Alexa top-list over time [13]. For each website selected, we extracted the domain names of all included resources found on the page. We obtained this data from HTTP Archive Objects (or "HARs") that we collected from a previous study [9].
Importantly, we needed to ensure that the domain names were not sensitive in nature (e.g., pornhub.com) so as to not trigger DNS-based parental controls. As such, after we created our initial list of domain names, we used the Webshrinker API to filter out domains associated with adult content, illegal content, gambling, and uncategorized content [24]. We then manually reviewed the resulting list. In total, our list included 1,711 unique domain names. 5 Measurement Protocol. The steps we take to measure query response times from each Whitebox are as follows: 1. We randomize the input list of 1,711 domain names at the start of each hour. 2. We compute the latency to each resolver with a set of five ICMP ping tests. 3. We begin iterating over the randomized list by selecting a batch containing ten domain names. 4. We issue queries for all 10 domain names in the batch to each resolver / protocol combination. For DoT and DoH, we re-use the TLS connection for each query in the batch, and then close the connection. If a batch of queries has not completed within 30 seconds, we pause, check for cross-traffic, and retry if cross-traffic is present. If there is no cross traffic, we move to the next resolver/protocol combination. 5. We select the next batch of 10 domain names. If five minutes have passed, we stop for the hour. Otherwise, we return to step four.
Limitations. Due to bandwidth usage concerns and limited computational capabilities on the Whiteboxes, we do not collect web page load times while varying the underlying DNS protocol and resolver. Additionally, while we conducted our measurements, the COVID-19 pandemic caused many people to work from home. We did not want to perturb other measurements being run with the Measuring Broadband America platform or introduce excessive strain on the volunteers' home networks. Due to these factors, we focus on DNS response times.

Results
This section presents the results of our measurements. We organize our results around the following questions: (1)  on broadband access ISP? Our results show that in the case of certain resolversto our surprise-DoT had lower median response times than conventional DNS, even as latency to the resolver increased. We also found significant variation in DoH performance across resolvers.

How Much Connection Overhead Does Encrypted DNS Incur?
We first study the overhead incurred by encrypted DNS protocols, due to their requirements for TCP connection setup and TLS handshakes. Before any batch of DoT queries can be issued with the SamKnows query tool, a TCP connection and TLS session must be established with a recursive resolver. In the case of DoH, the resolver's domain name is also resolved (e.g., resolverX.com). In Fig. 1, we show timings for different aspects of connection establishment for DoT and DoH.
The results show that lookup times were similar for all three resolvers ( Fig. 1(a)). This result is expected because the same default, conventional DNS resolver is used to look up the DoH resolvers' domain names; the largest median DoH resolver lookup time was X with 17.1 ms. Depending on the DNS time to live (TTL) of the DoH resolver lookup, resolution of the DoH resolver may occur frequently or infrequently. Next, we study the TCP connection establishment time for DoT and DoH for each of the three recursive resolvers ( Fig. 1(b)). For each of the three individual resolvers, TCP establishment time for DoT and DoH are similar. Resolvers X and Y are similar; Z experienced longer TCP connection times. The largest median TCP connection establishment time across all resolvers and protocols (Resolver Z DoH) was 30.8 ms.
Because DoT and DoH rely on TLS for encryption, a TLS session must be established before use. Fig. 1(c) shows the TLS establishment time for the three open resolvers. Again, Resolver Z experienced higher TLS setup times compared to X and Y. Furthermore, DoT and DoH performed similarly for each resolver. The largest median TLS connection establishment time across all recursive resolvers and protocols (Resolver Z DoH) was 105.2 ms. As with resolver lookup overhead, the cost of establishing a TCP and TLS connection to the recursive resolver for a system would ideally occur infrequently, and should be amortized over many queries by keeping the connection alive and reusing it for multiple DNS queries. Connection-oriented, secure DNS protocols will incur additional latency, but these costs can be (and are) typically amortized by caching the DNS name of the DoH resolver, as well as multiplexing many DNS queries over a single TLS session to a DoH resolver. Many browser implementations of DoH implement these practices. For example, Firefox establishes a DoH connection when the browser launches, and it leaves the connection open [16,17]. Thus, the overhead for DoH connection establishment in Firefox is amortized over time.
In the remainder of this paper we do not include connection establishment overhead when studying DNS query response times. We omit connection establishment time for the rest of our analysis because the DNS query tool closes and re-opens connections for each batch of queries. Thus, inclusion of TCP and TLS connection overheads may negatively skew query response times.

How Does Encrypted DNS Perform Compared With
Conventional DNS?
We next compare query response times across each protocol and recursive resolver. Fig. 2 shows box plots for DNS response times across all Whiteboxes for each resolver and protocol. "Default" refers to the resolver that is configured by default on each Whitebox (which is typically the DNS resolver operated by the Whitebox's upstream ISP).

DNS performance varies across resolvers.
First of all, conventional DNS performance varies across recursive resolvers. For the default resolvers configured on Whiteboxes, the median query response time using conventional DNS is 24.8 ms.
For Resolvers X, Y, and Z, the median query response times using DNS are 23.2 ms, 34.8 ms, and 38.3 ms, respectively. Although X performs better than the default resolvers, Y and Z perform at least 10 ms slower. This variability could be attributed to differences in deployments between open resolvers. DoT performance nearly matches conventional DNS. Interestingly DoT lookup times are close to those of conventional DNS. For Resolvers X, Y, and Z, the median query response times for DoT are 20.9 ms, 32.2 ms, and 45.3 ms, respectively. Interestingly, for X and Y, we find that DoT performs 2.3 ms and 2.6 ms faster than conventional DNS, respectively. For both of these resolvers, the best median DNS query performance could be attained using DoT. Z's median response time was 7 ms slower. The performance improvement of DoT over conventional DNS in some cases is interesting because conventional wisdom suggests that the connection overhead of TCP and TLS would be prohibitive. On the other hand, various factors, including transport-layer optimizations in TCP, as well as differences in infrastructure deployments, could explain these discrepancies. It may also be the case that DoT resolvers have lower query loads than conventional DNS resolvers, enabling comparable (or sometimes faster) response times. Investigating the causes of these discrepancies is an avenue for future work.
DoH response times were higher than those for DNS and DoT. DoH experienced higher response times than conventional DNS or DoT, although this difference in performance varies significantly across DoH resolvers. For Resolvers X, Y, and Z, the median query response times for DoH are 37.7 ms, 46.6 ms, and 60.7 ms, respectively. Resolver Z exhibited the biggest increase in response latency between DoH and DNS (22.4 ms). Resolver Y showed the smallest difference in performance between DoH and DNS (11.8 ms). Median DoH response times between resolvers can differ greatly, with X DoH performing 23 ms faster than Z DoH. The performance cost of DoH may be due to the overhead of HTTPS, as well as the fact that DoH implementations are still relatively nascent, and thus may not be optimized. For example, an experimental DoH recursive resolver implementation by Facebook engineers terminates DoH connections to a reverse web proxy before forwarding the query to a DNS resolver [4].   Table 2: Coefficients, intercepts, and errors for ridge regression models.

How Does Network Performance Affect Encrypted DNS
Performance?
We next study how network latency and throughput characteristics affect the performance of encrypted DNS.
DoT can meet or beat conventional DNS despite high latencies to resolvers, offering privacy benefits for no performance cost. Fig. 3 shows that DoT can perform better than DNS as latency increases for Resolvers X and Y; in the case of Resolver Z, DoT nearly matches the performance of conventional DNS. We observe similar behavior with the linear ridge regression models shown in Fig. 4. As discussed in Section 3.2, these results could be explained by transport-layer optimizations in TCP, differences in infrastructure deployments, and lower query loads on DoT resolvers compared to conventional DNS resolvers.
DoH performs worse than conventional DNS and DoT as latencies to resolvers increase. Fig. 3 shows that DoH performs substantially worse when latency between the client and recursive resolver is high; Fig. 4 shows a similar result with a ridge regression model. As discussed in Section 3.2, this result could be explained by either HTTPS overhead, nascent DoH implementations and deployments, or both.

Does Encrypted DNS Resolver Performance Vary Across ISPs?
Fig . 6 shows how encrypted DNS response times vary across different resolvers and ISPs. In short, the choice of resolver matters, and the "best" encrypted DNS resolver also may depend on the user's ISP. For instance, while ISP C is comparable to the other ISPs for queries sent to Resolver X, ISP C has significantly lower query response times to Resolver Y, and is one of the poorest performing ISPs on Resolver Z. The difference in median query response times between Resolver X DoH and X DNS was 20.9 ms for Whiteboxes on ISP D, and 8.9 ms for Whiteboxes on ISP E; for Z DoH, the difference in median times was 34.5 ms for Whiteboxes on ISP D, and 47.9 ms for Whiteboxes on ISP E. Resolver performance can also differ across ISPs. For ISP B, the median query response time for Z DoT is 11.1 ms faster than Z DNS. However, for ISP C, Z DoT is significantly slower than DNS, with a difference in median query response times of 30.6 ms. We attribute this difference in performance to higher latency to Resolver Z via ISP C. The median latency to Z DNS and DoT across Whiteboxes on ISP C was 50 ms, compared to 18.5 ms on ISP B.

Related Work
Researchers have compared the performance of DNS, DoT, and DoH in various ways. Zhu et al. proposed DoT to encrypt DNS traffic between clients and recursive resolvers [25]. They modeled its performance and found that DoT's overhead can be largely eliminated with connection re-use. Böttger et al. measured the effect of DoT and DoH on query response times and page load times from a university network [3]. They find that DNS generally outperforms DoT in response times, and DoT outperforms DoH. Hounsel et al. also measure response times and page load times for DNS, DoT, and DoH using Amazon EC2 instances [9]. They find that despite higher response times, page load times for DoT and DoH can be faster than DNS on lossy networks. Lu et al. utilized residential TCP SOCKS networks to measure response times from 166 countries and found that, in the median case with connection re-use, DoT and DoH were slower than conventional DNS over TCP by 9 ms and 6 ms, respectively [14].
Researchers have also studied in depth how DNS influences application performance. Sundaresan et al. used an early MBA deployment of 4,200 home gateways to identify performance bottlenecks for residential broadband networks [22]. This study found that page load times for users in home networks are significantly influenced by slow DNS response times. Wang et al. introduced WProf, a profiling system that analyzes various factors that contribute to page load times [23]. They found that queries for uncached domain names at recursive resolvers can account for up to 13% of the critical path delay for page loads. Otto et al. found that CDN performance was significantly affected by clients choosing recursive resolvers that are far away from CDN caches [18]. As a result of these findings. Otto et al. proposed namehelp, a DNS proxy that sends queries for CDN-hosted content to directly to authoritative servers. Allman studied conventional DNS performance from 100 residences in a neighborhood and found that only 3.6% of connections were blocked on DNS with lookup times greater than either 20 ms or 1% of the application's total transaction time [1].
Past work studied the performance impact of "last mile" connections to home networks in various ways. Kreibich et al. proposed Netalyzr as a Java applet that users run from devices in their home networks to test debug their Internet connectivity. Netalyzr probes test servers outside of the home network to measure latency, IPv6 support, DNS manipulation, and more. Their system was run from over 99,000 public IP addresses, which enabled them to study network connectivity at scale [12]. Dischinger et al. measured bandwidth, latency, and packet loss from 1,894 hosts and 11 major commercial cable and DSL providers in North America and Europe. This work found that the "last mile" connection between an ISP and a home network is often a performance bottleneck, which they could not have captured by performing measurements outside of the home network. However, their measurements were performed from hosts located within homes, rather than the home gateway. This introduces confounding factors between hosts and the home gateway, such as poor Wi-Fi performance.

Conclusion
In this paper, we studied the performance of encrypted DNS protocols and DNS from 2,693 Whiteboxes in the United States, between April 7th, 2020 and May 8th, 2020. We found that clients do not have to trade DNS performance for privacy. For certain resolvers, DoT was able to perform faster than DNS in median response times, even as latency increased. We also found significant variation in DoH performance across recursive resolvers. Based on these results, we recommend that DNS clients (e.g., web browsers) measure latency to resolvers and DNS response times determine which protocol and resolver a client should use. No single DNS protocol nor resolver performed the best for all clients.
There were some limitations to our work that point to future research. First, due to bandwidth restrictions, we were unable to perform page loads from Whiteboxes. Future work could utilize platforms of similar scale to SamKnows to measure page loads, such as browser telemetry systems. Second, future work should perform measurements from mobile devices. DoT was implemented in Android 10, but to our knowledge, its performance has not been studied "in the wild." Finally, future work could study how encrypted DNS protocols perform from networks that are far away from popular resolvers. This is particularly important for browser vendors that seek to deploy DoH outside of the United States.