Performance evaluation of DNS servers to build a benchmarking system of DNS64 implementations

DNS64 is an important IPv6 transition technology that facilitates the communication of an IPv6 only client with an IPv4 only server, which becomes a more and more common scenario. Several different DNS64 implementations exist, and their performance is a relevant decision factor for network operators. RFC 8219 has defined a benchmarking methodology for DNS64 servers, which requires the operation of an authoritative DNS server at 220% of the query rate used for DNS64 benchmarking. In this paper, we aim to build an authoritative DNS server that operates at 2.2 million qps (queries per second) rate, thus it facilitates DNS64 benchmarking up to 1,000,000 qps rate. To that end, we compare the performance of BIND, YADIFA, NSD, Knot DNS and FakeDNS (a special purpose software) to find the best suiting one of them. We fully disclose the details of our measurements including the configuration of the DNS implementations, the usage of our improved software tester called dns64perf ++, and the details of the hardware and software measurement environment in the NICT StarBED, Japan. We perform a series of measurements to examine, how the performance of the tested solutions scale up with the number of the active CPU cores from 1 to 32. Besides their performance, we also measure their memory consumption and zone load time. We present and discuss all the results. In addition to successfully building an authoritative DNS server with the required performance, we also make recommendations, which solutions suit to different special needs.


Introduction
Currently, we are in a transition from IPv4 to IPv6 [1]. Unfortunately, these two versions of the Internet Protocol are not compatible with each other, and IETF has standardized several IPv6 transition technologies to facilitate their cooperation in various communication scenarios [2]. For example, DNS64 [3] servers and NAT64 [4] gateways are used to enable the IPv6 only clients to communicate with IPv4 only servers, which is going to be a more and more common scenario due to the deployment of IPv6 on the client side and the fact that some servers will remain IPv4 only for the foreseeable future. Network operators will need to choose the best suiting DNS64 implementations to their purposes from among several ones, and performance is one of the key decision factors. We have already published a paper about the performance comparison of four DNS64 implementations in 2016 [5]. Then RFC 8219 [6] has defined a benchmarking methodology for IPv6 transition technologies including DNS64 in 2017. The benchmarking procedure for DNS64 servers, which we have described in more details in [7], requires the usage of a high performance authoritative DNS server during the benchmarking of DNS64 servers (we give more details in Sect. 2.1). The first author of this paper had the opportunity to measure the performance of three DNS64 implementations in an RFC 8219 compliant way as a guest researcher in Japan [8]. The tested implementations showed a moderate performance, which is very far away from the performance of the Google public DNS server, which is about 810,000 qps (query per second) on average [9]. Therefore, we set our goal to develop a high performance DNS64 server and to be able to benchmark DNS64 implementations up to 1 million queries per second rate.

3
The aim of our current effort is to build a high performance authoritative DNS server that facilitates the benchmarking of DNS64 implementations up to 1,000,000 qps rate. To that end, we evaluate the performance of different authoritative DNS server implementations BIND, YADIFA, NSD, Knot DNS and FakeDNS (a special purpose software) to find the best candidate for our purpose. We also disclose the details of their configuration as well as the necessary settings of our test environment to be able to receive several millions of packets per second.
The remainder of this paper is organized as follows: In Sect. 2, we summarize all background information, including the highlights of DNS64 benchmarking, as well as the operation of our testing tool and its improvements in a nutshell. In Sect. 3, we introduce the tested authoritative DNS server implementations and disclose their settings. In Sect. 4, we explain the details of the measurements. In Sect. 5, we report and discuss our results. In Sect. 6, we unfold our plans for future research. In Sect. 7, we give our conclusions.

Benchmarking methodology for DNS64 servers
We have defined a benchmarking methodology for DNS64 servers in Section 9 of RFC 8219 [6] and elaborated its details in [7]. Now we present only a very brief summary of it. The compulsory test of the DNS64 benchmarking procedure follows the test and traffic setup shown in Fig. 1. This is the "worst case" scenario, when all the following six messages are used: 1. The Measurer subsystem of the Tester sends a query for an IPv6 address ("AAAA" record) of the domain name in question to the tested DNS64 server, which is the DUT (Device Under Test). 2. The DUT sends a query for an IPv6 address of the same domain name to the AuthDNS (Authoritative DNS Server) subsystem of the Tester. 3. The AuthDNS subsystem of the Tester replies with an empty "AAAA" record. 4. The DUT sends a query for an IPv4 address ("A" record) of the same domain name to the AuthDNS subsystem of the Tester. 5. The AuthDNS subsystem of the Tester replies with a valid "A" record. 6. The DUT synthesizes an IPv4-embedded IPv6 address [10] using either the NAT64 well-known prefix or a network specific prefix plus the received "A" record, and then it returns the synthesized "AAAA" record to the Measurer subsystem of the Tester.
To eliminate caching, all different domain names must be used during the at least 60 s long testing. The tester sends the queries at a constant rate and checks the replies if they arrive within the required timeout time and contain a valid "AAAA" record. If yes, then the test is successful, otherwise it is failed. The DNS64 performance is the highest rate at which the test is successful. In practice, a binary search is used to find the highest such rate. And the binary search is to be executed at least 20 times, and the final result is the median of the at least 20 results, whereas their first and 99th percentiles are used to show the stability of the results.
To certify the Tester (including its both subsystems: Measurer and AuthDNS) for testing up to r rate with t timeout, a self-test must be performed at 2.2*r rate and with 0.25*t timeout. In short, the rationale for 220% of the rate is that the AuthDNS subsystem has to serve two queries for each query to the DUT, and 10% is the performance reserve. As for the timeout, thus together the half of the time can be used up by the AuthDNS subsystem for serving the two queries, and the other half remains for the DUT. (Please refer to [6] and [7] for further details.) As for our measurements, the 220% of the query rate means that the requirement for the Authoritative DNS server subsystem is 2,200,000 qps rate to be able to test the DNS64 servers up to 1,000,000 qps rate. As for timeout, we have shown it in [7] that it should be 1 s for the DNS64 servers and thus the timeout should be 250 ms for the Authoritative DNS server used to support DNS64 benchmarking.

Fig. 1
Test and traffic setup for benchmarking DNS64 servers [11] Performance evaluation of DNS servers to build a benchmarking system of DNS64 implementations 1 3

The operation of dns64perf ++ in a nutshell
The operation of the original version of dns64perf ++ is described in detail in our open access paper [11], therefore, we only give a very short summary of it.
To be able to generate a high number of requests for all different domain names that can be described systematically, However, for the self-test of the tester, the domain names have to be resolved to IPv4-embedded IPv6 addresses by the DNS server, because dns64perf ++ can send only requests for "AAAA" records. (The domain name in the above example can be mapped to 2001:db8::0.1.2.3.) The task of the dns64perf ++program is to perform one elementary test, and a bash shell script is used to perform the binary search and its 20 repetitions.
Originally, dns64perf ++ used only two threads (one thread for sending the queries and another thread for receiving the replies), and it was capable of testing up to 200,000 qps rate [11]. When we tested its accuracy, we have discovered a bug in its timing algorithm, which made it unreliable over about 50,000qps rate, and we have corrected it and rechecked its accuracy [12]. We have also enabled dns64perf ++ for benchmarking the caching performance of DNS64 servers [13], which is an optional test of RFC 8219.
Dániel Bakai, the author of dns64perf ++, has made different developments on dns64perf ++ so that it can be used for benchmarking up to several million queries per second. The "multiport" feature [14] (latest commit d6fa119 on Oct 8, 2018) includes all the following ones: 1. Usage of 2*n threads (n threads for sending queries and another n threads for receiving and processing the replies) 2. Usage of IPv4 as transport protocol. (For DNS64 testing, it has to communicate over IPv6, but we experienced higher self-test performance over IPv4, therefore we used IPv4 in our experiments.) 3. Usage of multiple source port numbers. We need it for RSS (Receive Side Scaling), please refer to Sect. 4.2.
Besides his development, we also added a further feature that pins the threads to given CPU cores using the function pthread_setaffinity_np(), to avoid their wandering among CPU cores. Our modified source code is available from [15].

Size of the name space
The size of the name space was determined to be "/5", which corresponds to 2 27 = 134,217,728 different domain names, and it is enough up to 2,236,962 qps rate, if the test last for the required 60 s.

DNS implementations and their settings
In this section, we enumerate the selected authoritative DNS server implementations, give the reasons for their selection and disclose their settings. We have considered only free software [16] implementations for the same reasons as given in Sect. 3 of [5].

BIND
BIND of ISC [17] is the most popular DNS server implementation, therefore it was a must to be included in our benchmarking.
We have tested its two different versions. The 9.10.3-P4-Debian version, which was included in our Debian distribution, and also its version 9.12.4, which was downloaded in source and compiled with the --with-tuning = large option to test their performance difference.
Its configuration was very simple, the following lines were appended to the /etc/bind/named.conf. local file:

YADIFA
YADIFA of EURid is told to be one of the lowest memory footprint authoritative DNS server [18]. We have successfully used it to support DNS64 benchmarking and we experienced that it outperformed BIND [8]. For our current project, we used its 2.2.3-6237 version, which was included in our Debian distribution. Its configuration was very simple, we added the following lines to /etc/yadifa/yadifad.conf: In addition to that, we had to copy the zone file to /var/ lib/yadifa.

NSD
The NSD of NLnet Labs was optimized for serving high number of requests per second [19]. We used its 4.1.14 version, which was included in our Debian distribution.
NSD supports the SO_REUSEPORT socket option [20], which improves performance of the network stack on a multi-core computer if server-count is set to higher than 1 [21].
Its zone file was configured by adding the following lines to the /etc/nsd/nsd.conf file: Unlike BIND and YADIFA, NSD can utilize only a single CPU core, unless the required number of processes is explicitly specified by the server-count option. Therefore, we always set this value to the number of active CPU cores.

Knot DNS
Knot DNS of CZ.NIC is a modern, high performance DNS server [22]. Knot DNS also supports the SO_REUSEPORT socket option, and it does not need to be enabled in the configuration file, as it is enabled by default [23].
Its configuration was very simple, the following lines were added to /etc/knot/knot.conf:

FakeDNS
FakeDNS is a special purpose program developed by Dániel Bakai using the code base of the mtd64-ng DNS64 server to eliminate the need for an authoritative DNS server, when performing DNS64 benchmarking [24]. FakeDNS does not use a zone file, and it can serve only the name space used by dns64perf ++: it simply takes the information from the first label (e.g. 000-001-002-003) to calculate the appropriate IPv4 address (e.g. 0.1.2.3). As it does not use a zone file, it starts very fast and it uses only a very low amount of memory. Similarly to mdt64-ng, FakeDNS is also multithreaded. As it did not provide the required performance during our preliminary measurements, Dánial Bakai has developed an experimental feature, called as "moreproc". It starts a separate process for every single CPU core, and a modified version of iptables is used to distribute the requests among the processes.
FakeDNS was configured in /etc/fakedns.conf as follows: As dns64perf ++ sends its requests to port 53, we used a special kernel module and iptables patch prepared by Dánial Bakai, which could rewrite the destination port numbers in the requests and the source port number in the replies. The requests were distributed equally among the fakedns processes using the nth mode of the statistics module of iptables.

Hardware and software environment
The measurements were carried out using the resources of the NICT StarBED, Japan. The measurement setup is shown in Fig. 2. Dell PowerEdge R430 servers were used both as the Tester and as the DUT (Device Under Test). They had two 2.1 GHz Intel Xeon E5-2683 v4 CPUs having 16 cores each, 384 GB 2400 MHz DDR4 RAM and Intel 10G dual port X540 network adapters. They were interconnected by a 10 Gbps VLAN.
Debian Linux 9.6 with kernel version 4.9.0-8-amd64 was installed on both computers.
Based on our benchmarking experience [8] and [25], we switched off Hyper-Threading on both computers to achieve consistent results. Turbo Boost was left enabled on the Tester, but it was switched off on the DUT to avoid the influence of the power budget on the scale up tests (see below). Thus the CPU clock frequency of the Tester could theoretically vary from 1.2 to 3 GHz, but power budget limited it to 2.6 GHz, when all cores were used by dns64perf ++. The CPU clock frequency of the DUT could vary from 1.2 to 2.1 GHz. The CPU clock frequency scaling governor was set to "performance" for all active CPU cores on both computers.
We wanted to know, how the performance of each DNS implementation scales up with the number of CPU cores. To set the required number of active CPU cores, we used the maxcpus = N kernel parameter to activate N number of CPU cores at boot time. We tested with 1, 2, 4, 8, 16 and 32 active CPU cores.

Receive-side scaling
In the past network interfaces used only a single queue for forwarding the packets from the hardware to the operating system kernel. This solution limited the processing of the received packets to a single CPU core, which became a bottleneck. RSS (Receive-Side Scaling) [26], (also called multi-queue receiving) can efficiently distribute the incoming packets among the CPU cores, thus increasing the performance of the networking stack. In practice, the NICs (Network Interface Cards) use a hash value to distribute the incoming packets into the queues, that is, to assign them to the CPU cores. By default, only the source and destination IP addresses are used to compute the hash value. In real life, there are a high number of different source IP addresses can be found in the requests arriving to a DNS server. However, in our case, they were all the same. The Intel X540-AT2 adapters of our servers facilitated the usage of the four tuple (source IP address, destination IP address, source port number, destination port number). Therefore, we enabled this feature both on the Tester and on the DUT by the following command: This setting was a prerequisite to be able to test up to 2.2 million query per second rate [27].

Execution of the measurements
A single execution of the dns64perf ++ program can be used to decide, if the system can serve all the requests at the specified rate during the 60 s long time interval required by RFC 8219. The highest such rate can be determined by a binary search. As for DNS64 measurements, RFC 8219 requires to execute the binary search at least 20 times, and the final result is the median of the 20 results, whereas the first percentile and the 99th percentile are used to express the indices of dispersion, which are the minimum and maximum, if the number of repetitions of the binary search is less than 100.
The binary search and its 20 repetitions were implemented by a bash shell script. The upper limit of the binary search was set to 2,236,962 qps, and the 8.0.0.0/5 name space was used for the measurements.
Independently from the number of active CPU cores of the DUT, always all 32 core were used on the Tester, and both the number of sending and receiving threads were always 16. The number of source port numbers per sending thread were set to 2048, to facilitate an even distribution of the arriving packets among the CPU cores.

Measuring further important quantities
DNS resolution performance is definitely the most important decision factor, but some other quantities may also be important, when selecting the most suitable implementation to support DNS64 benchmarking. Not to interfere with the benchmarking tests, we have measured separately the memory consumption of the different authoritative DNS server implementations as well as the load time of a "/5" size zone file.

Results
First, we disclose and evaluate the performance of the different implementations one by one focusing on the scale up of their performance, and then we make a comparison.
In addition to that, we also expose the memory consumption and start up time of the different DNS implementations.

BIND
The domain name resolution performance of BIND 9.10.3-P4-Debian is shown in Fig. 3. (The columns show the median of the 20 results, and the error bars show the first and 99th percentiles.) As for the scale up of the performance of BIND, there is a strange phenomemon at four cores. Unlike with any other number of CPU cores, at four cores, the performance is not doubled compared to two cores, but it is even lower than that. We have found its root cause by checking the number of started UDP listeners in the /var/ log/syslog file. At one and two cores, the number of UDP listeners was equal with the number of active CPU cores, however, from 4 cores, the number of UDP listeners was only the half of the number of active CPU cores. Thus the number of listeners beacame a bottleneck at four active CPU cores.
The domain name resolution performance of BIND compiled with the --with-tuning = large option is shown in Fig. 4. On the one hand, the single core performance increased drastically from 10,564 to 57,670 qps compared with the previous case, however, on the other hand, there are problems with the scale up. The phenomenon that the performance did not increase from one to two active CPU cores can be explained by the fact that BIND used a single UDP listener in both cases. From four active CPU cores, BIND used one less UDP listeners than the number of active CPU cores. Unfortunately, the performance showed decrease from 16 to 32 cores. The high difference between the first percentile and the 99th percentile of the results from 4 to 32 cores is another issue. (We need to consider the first percentile for DNS64 benchmarking.) We did not do any further performance tuning and we did not go into deeper analysis of the behavior of BIND, as our purpose was to find an implementation which suited to our needs and BIND was deliberately far from it.

YADIFA
The domain name resolution performance of YADIFA 2.2.3-6237 is shown in Fig. 5. On the one hand, YADIFA showed a good single core performance (154,800 qps), but on the other hand its performance scaled up very poorly with the number of active CPU cores, which is a fundamental problem in the case of our current multi-core CPUs.
(It also showed significantly scattered results at two active CPU cores.) Thus YADIFA is definitely not a candidate for our purposes.

NSD
We have performed two measurement series with NSD. In the first case, RSS was set to include the source and destination port numbers in the hash value, as described in Sect. 4.2. In addition to that, we have also performed another measurement series excluding the port numbers from the hash to demonstrate the difference.   The domain name resolution performance of NSD 4.1.14 with properly set RSS is shown in Fig. 6. NSD has shown an excellent performance, having both a high single core performance (188,872 qps) and very also a good scale up. We mention that its median performance has reached the upper limit of the binary search at 32 active CPU cores, but its DNS64 benchmarking relevant performance (that is the 1st percentile) was not limited by the size of the zone file.
The domain name resolution performance of NSD 4.1.14 with default RSS is shown in Fig. 7. As expected, its performance was limited by the packet receiving performance of a single CPU core. We note that the CPU core, which processed all the packet arrivals (interrupts) was not excluded from the operation of NSD. (We did not want to do any tuning, our only purpose with this experiment was to demonstrate the need for the properly set RSS.)

Knot DNS
The domain name resolution performance of Knot DNS 2.4.0 is shown in Fig. 8. It has shown an excellent performance with any number of CPU cores, and its median performance at 32 cores was significantly limited by the size of the zone file, but the first percentile was "only" 2,236,824 qps. Even though its results were significantly scattered at 16 active cores, the first percentile of its results was definitely higher than our targeted 2.2 Mqps at 32 cores.

FakeDNS
The performance of FakeDNS is shown in Fig. 9. It both showed a good single core performance and it scaled up quite well, though it produced significantly scattered results (especially at 16 active CPU cores).

Performance comparison
We have compared the DNS64 benchmarking relevant performance of the different solutions, that is, the first percentiles of the results of their 20 tests in Fig. 10.
In general, Knot DNS and NSD have shown the best performance and the third one was FakeDNS. Knot DNS has outperformed NSD at 32 cores, and it was the only  implementation that achieved 2.2 Mqps, thus it is our number one recommendation. But NSD was somewhat better at lower number of cores, and NSD with default RSS is definitely the best choice, when only a single CPU core can be used.
Both BIND and YADIFA have shown low performance. BIND suffered from very low single core performance (its bar is hardly visible at a single CPU core in Fig. 10), and the performance of YADIFA scaled up poorly.
For our final recommendation, we need to consider also some other decision factors, which we address in the next subsection.

Memory consumption and zone load time
The memory consumption of the tested authoritative DNS servers with a "/5" size zone is shown in Fig. 11. YADIFA has shown the lowest memory consumption (23 GB), which complies with the claim of its developers that YADIFA is a low memory footprint server. However it cannot utilize the entries of such a large zone file in a 60 s long test. On the other hand, NSD required the highest amount of memory, 38.1 MB, which is still affordable, if the server has at least 64 MB of RAM.
The load time of the tested authoritative DNS servers with a "/5" size zone is shown in Fig. 12. It is important to mention that NSD (with its default settings) prepares a binary zone file at the time of its first start. Using a "/5" size

Recommendation for DNS64 benchmarking
All in all, we recommend Knot DNS as authoritative DNS server to support DNS64 benchmarking up to 1,000,000 qps rate.
When somewhat lower rates are satisfactory, NSD and FakeDNS can also be good choices. FakeDNS is the best choice, if there is not enough memory in the server, or if very fast start is needed. (Please see further explanation in Sect. 5.7.) When only a single CPU core is available for the authoritative DNS server, NSD with default RSS can provide the highest performance.

Discussion of FakeDNS
First of all, we note that FakeDNS is only a byproduct of mtd64-ng, an experimental DNS64 server [24].
Our result that the performance of FakeDNS is lower than that of NSD or Knot DNS can be explained by the fact that FakeDNS has to convert the 4 times 3 decimal digits of the first label of the DNS query to an unsigned 32-bit integer (IPv4 address), whereas the real authoritative DNS servers can read the IPv4 address directly from the memory. NSD and Knot DNS are good examples that the latter solution may be faster. BIND and YADIFA are good examples for the opposite case.
We believe that there is room for FakeDNS in DNS64 benchmarking. If the servers used for benchmarking are shipped with only 32 GB of memory, the zone file cannot be loaded into the memory without swapping, but swapping is unacceptable due to its performance penalty. Even if there is enough memory, the fast start of the test system may be attractive, and thus FakeDNS can be a good choice, when its performance is enough.

Construction of a high performance DNS64 benchmarking system
Following the method of using three devices for DNS64 benchmarking (that is, the Measurer and the AuthDNS subsystems of the Tester are implemented by two separate devices), originally invented and disclosed in Fig. 6. of [7] and also used in Fig. 2 of [8], we have constructed a measurement setup for benchmarking DNS64 servers as shown in Fig. 13. The selected authoritative DNS server implementation is executed by node p093, and it is to be configured as described in Sect. 3 of our current paper. It is important to set RSS so that it considers also the source and destination port numbers for both IPv4 and/or IPv6 an all interfaces used for measurements. We note that the "link" on the right side of the figure serves only the purpose of checking the performance of the authoritative DNS server, and it is not used during DNS64 benchmarking. (RFC 8219 requires to perform the self-test of the Tester, before it is used for benchmarking DNS64 servers.)

Plans for future research
Our plans for future research include the further development of mtd64-ng our tiny DNS64 proxy [24] and its benchmarking, which became feasible by our current results.

Conclusion
We have selected the following free software authoritative DNS server implementations as potential candidates for supporting DNS64 benchmarking: BIND, YADIFA, NSD, Knot DNS, plus a special purpose software called FakeDNS. We have built a suitable test system for benchmarking the selected solutions, and we have compared their performance, memory consumption, and zone file load time.
We have found that Knot DNS can be used to support DNS64 benchmarking up to 1,000,000 qps rate.
When lower rates are satisfactory, NSD and FakeDNS are also good choices. FakeDNS is the best choice, if there is not enough memory in the server, or if very fast start is needed.
When only a single CPU core is available for the authoritative DNS server, NSD with default RSS can provide the highest performance. Attila Pivoda received BSc in electrical engineering from the Széhenyi István University, Győr, Hungary in 2019. He has experience in managing wireless ISP network with MikroTik, Ubiquiti and Cisco devices since 2010, and he is currently working for an ISP company as network engineer. In part time he manages Linux based web hosting servers and he does programming in HTML, PHP, MySQL.
Keiichi Shima is a deputy director at the Research Laboratory of IIJ Innova-tion Institute, Inc. His research field is the Internet, including designing and implementing communication protocols, com-puter networking technologies, computer network security, AI-based anomaly detection, and so forth. He also works as a board member of the WIDE project operating a nation wide research network in Japan.