Scouting the Path to a Million-Client Server

Zhao, Yimeng; Saeed, Ahmed; Ammar, Mostafa; Zegura, Ellen

doi:10.1007/978-3-030-72582-2_20

Yimeng Zhao¹¹,
Ahmed Saeed¹²,
Mostafa Ammar¹¹ &
…
Ellen Zegura¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12671))

Included in the following conference series:

International Conference on Passive and Active Network Measurement

1468 Accesses
2 Citations

Abstract

To keep up with demand, servers will scale up to handle hundreds of thousands of clients simultaneously. Much of the focus of the community has been on scaling servers in terms of aggregate traffic intensity (packets transmitted per second). However, bottlenecks caused by the increasing number of concurrent clients, resulting in a large number of concurrent flows, have received little attention. In this work, we focus on identifying such bottlenecks. In particular, we define two broad categories of problems; namely, admitting more packets into the network stack than can be handled efficiently, and increasing per-packet overhead within the stack. We show that these problems contribute to high CPU usage and network performance degradation in terms of aggregate throughput and RTT. Our measurement and analysis are performed in the context of the Linux networking stack, the most widely used publicly available networking stack. Further, we discuss the relevance of our findings to other network stacks. The goal of our work is to highlight considerations required in the design of future networking stacks to enable efficient handling of large numbers of clients and flows.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
HD and SD videos consume up to 5 Mbps and 3 Mbps, respectively [9].
2.
The number of packets is typically much smaller than the worst case scenario due to imperfect pacing. Delays in dispatching packets, resulting from imperfect pacing, require sending larger packets to maintain the correct average rate, leading to a lower packet rate. However, the CPU cost of autosizing increases with the number of flows even with imperfect pacing.

References

High-performance, feature-rich netxtreme® e-series dual-port 100g pcie ethernet nic. https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/p2100g
Intel DPDK: Data plane development kit (2014). https://www.dpdk.org/
IEEE standard for ethernet - amendment 10: Media access control parameters, physical layers, and management parameters for 200 gb/s and 400 gb/s operation. IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s 802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017), pp. 1–372 (2017)
Google Scholar
Microprocessor trend data (2018). https://github.com/karlrupp/microprocessor-trend-data
IEEE 802.3 Industry Connections Ethernet Bandwidth Assessment Part II (2020)
Google Scholar
dstat-Linux man page (2020). https://linux.die.net/man/1/dstat
FlowQueue-Codel (2020). https://tools.ietf.org/id/draft-ietf-aqm-fq-codel-02.html
neper: a Linux networking performance tool (2020). https://github.com/google/neper
Netflix Help Center: Internet Connection Speed Recommendations (2020). https://help.netflix.com/en/node/306
netstat-Linux man page (2020). https://linux.die.net/man/8/netstat
Perf Manual (2020). https://www.man7.org/linux/man-pages/man1/perf.1.html
ss-Linux man page (2020). https://linux.die.net/man/8/ss
Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., Bugnion, E.: \(\{\)IX\(\}\): a protected dataplane operating system for high throughput and low latency. In: 11th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 14), pp. 49–65 (2014)
Google Scholar
Benvenuti, C.: Understanding Linux Network Internals. O’Reilly Media, Inc. (2006)
Google Scholar
Brouer, J.D.: Network stack challenges at increasing speeds. In: Proceedings of the Linux Conference, pp. 12–16 (2015)
Google Scholar
Cardwell, N., Cheng, Y., Gunn, C.S., Yeganeh, S.H., Jacobson, V.: BBR: congestion-based congestion control. Queue 14(5), 20–53 (2016)
Article Google Scholar
Cavalcanti, F.R.P., Andersson, S.: Optimizing Wireless Communication Systems, vol. 386. Springer, Cham (2009). https://doi.org/10.1007/978-1-4419-0155-2
Book MATH Google Scholar
Checconi, F., Rizzo, L., Valente, P.: Qfq: Efficient packet scheduling with tight guarantees. IEEE/ACM Trans. Networking 21(3)(2013)
Google Scholar
Chen, Q.C., Yang, X.H., Wang, X.L.: A peer-to-peer based passive web crawling system. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1878–1883. IEEE (2011)
Google Scholar
Dumazet, E., Corbet, J.: TCP small queues (2012). https://lwn.net/Articles/507065/
Dumazet, E., Corbet, J.: Tso sizing and the FQ scheduler (2013). https://lwn.net/Articles/564978/
Firestone, D., et al.: Azure accelerated networking: Smartnics in the public cloud. In: 15th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2018), pp. 51–66 (2018)
Google Scholar
Geer, D.: Chip makers turn to multicore processors. IEEE Computer 38(2005)
Google Scholar
Hedayati, M., Shen, K., Scott, M.L., Marty, M.: Multi-queue fair queuing. In: 2019 USENIX Annual Technical Conference (USENIX ATC 2019) (2019)
Google Scholar
Hock, M., Veit, M., Neumeister, F., Bless, R., Zitterbart, M.: TCP at 100 gbit/s-tuning, limitations, congestion control. In: 2019 IEEE 44th Conference on Local Computer Networks (LCN), pp. 1–9. IEEE (2019)
Google Scholar
Jeong, E., et al.: MTCP: a highly scalable user-level \(\{\)TCP\(\}\) stack for multicore systems. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 489–502 (2014)
Google Scholar
Kalia, A., Kaminsky, M., Andersen, D.: Datacenter RPCs can be general and fast. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2019), pp. 1–16 (2019)
Google Scholar
Kaufmann, A., Peter, S., Sharma, N.K., Anderson, T., Krishnamurthy, A.: High performance packet processing with flexnic. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 67–81 (2016)
Google Scholar
Kaufmann, A., Stamler, T., Peter, S., Sharma, N.K., Krishnamurthy, A., Anderson, T.: TAS: TCP acceleration as an OS service. In: Proceedings of the Fourteenth EuroSys Conference, 2019, pp. 1–16 (2019)
Google Scholar
Li, Y., Cornett, L., Deval, M., Vasudevan, A., Sarangam, P.: Adaptive interrupt moderation (Apr 14 2015), uS Patent 9,009,367
Google Scholar
Marty, M., et al.: Snap: a microkernel approach to host networking. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. SOSP 2019, pp. 399–413 (2019)
Google Scholar
Mogul, J.C., Ramakrishnan, K.: Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst. 15(3), 217–252 (1997)
Article Google Scholar
Moon, Y., Lee, S., Jamshed, M.A., Park, K.: Acceltcp: accelerating network applications with stateful TCP offloading. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020), pp. 77–92 (2020)
Google Scholar
Ousterhout, A., Fried, J., Behrens, J., Belay, A., Balakrishnan, H.: Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In: Proceedings of USENIX NSDI 2019 (2019)
Google Scholar
Radhakrishnan, S., .: \(\{\)SENIC\(\}\): Scalable \(\{\)NIC\(\}\) for end-host rate limiting. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 475–488 (2014)
Google Scholar
Rizzo, L.: Netmap: a novel framework for fast packet i/o. In: 21st USENIX Security Symposium (USENIX Security 2012), pp. 101–112 (2012)
Google Scholar
Rotaru, M., Olariu, F., Onica, E., Rivière, E.: Reliable messaging to millions of users with migratorydata. In: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track, pp. 1–7 (2017)
Google Scholar
Saeed, A., Dukkipati, N., Valancius, V., Lam, T., Contavalli, C., Vahdat, A.: Carousel: scalable traffic shaping at end-hosts. In: Proceedings of ACM SIGCOMM 2017 (2017)
Google Scholar
Saeed, A., et al.: Eiffel: Efficient and flexible software packet scheduling. In: Proceedings of USENIX NSDI 2019 (2019)
Google Scholar
Shrivastav, V.: Fast, scalable, and programmable packet scheduler in hardware. In: Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM 2019 (2019)
Google Scholar
Stephens, B., Akella, A., Swift, M.: Loom: flexible and efficient \(\{\)NIC\(\}\) packet scheduling. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 19), pp. 33–46 (2019)
Google Scholar
Stephens, B., Singhvi, A., Akella, A., Swift, M.: Titan: Fair packet scheduling for commodity multiqueue nics. In: 2017 \(\{\)USENIX\(\}\) Annual Technical Conference (USENIX ATC 2017), pp. 431–444 (2017)
Google Scholar
Sun, L., Kostic, P.: Adaptive hardware interrupt moderation, January 2 2014. uS Patent App. 13/534,607
Google Scholar
Yasukata, K., Honda, M., Santry, D., Eggert, L.: Stackmap: low-latency networking with the \(\{\)OS\(\}\) stack and dedicated nics. In: 2016 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 2016), pp. 43–56 (2016)
Google Scholar
Zhang, T., Wang, J., Huang, J., Chen, J., Pan, Y., Min, G.: Tuning the aggressive TCP behavior for highly concurrent http connections in intra-datacenter. IEEE/ACM Trans. Networking 25(6), 3808–3822 (2017)
Article Google Scholar
Zhao, Y., Saeed, A., Zegura, E., Ammar, M.: ZD: a scalable zero-drop network stack at end hosts. In: Proceedings of the 15th International Conference on Emerging Networking Experiments and Technologies, pp. 220–232 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, USA
Yimeng Zhao, Mostafa Ammar & Ellen Zegura
Massachusetts Institute of Technology, Cambridge, USA
Ahmed Saeed

Authors

Yimeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Saeed
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Ammar
View author publications
You can also search for this author in PubMed Google Scholar
Ellen Zegura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Brandenburg University of Technology (BTU), Cottbus, Germany
Oliver Hohlfeld
Telefonica Research, Barcelona, Spain
Andra Lutu
Iribe Center, University of Maryland, College Park, MD, USA
Dave Levin

Appendices

A Linux Stack Overview

Packet transmission in an end-host refers to the process of a packet traversing from user space, to kernel space, and finally to NIC in packet transmission process. The application generates a packet and copies it into the kernel space TCP buffer. Packets from the TCP buffer are then queued into Qdisc. Then there are two ways to a dequeue packet from the Qdisc to the driver buffer: 1)dequeue a packet immediately, and 2) schedule a packet to be dequeued later through softriq, which calls net_tx_action to retrieve packet from qdisc (Fig. 10).

B Parameter Configuration

Table 2 shows all the parameters we have used in our setup.

Table 2. Tuning parameters

Full size table

C Overall Stack Performance

We find that the trends shown in Fig. 2 remain the same regardless of packet rate. In particular, we disable TSO, forcing the software stack to generate MTU packets. This ensures that the packet rate remains relatively constant across experiments. Note that we perform experiments with a maximum number of 100k flows. We try two values for the MTU: 1500 Bytes and 9000 Bytes. As expected, the performance of the server saturates at a much lower number of flows when generating packets of 1500 Bytes (Fig. 11). This is because the packet rate increases compared to the experiments discussed in Sect. 3. One the other hand, the performance of the server when using 9000 Byte packets is similar to that discussed in Sect. 3 (Fig. 12).

D FQ v.s. PFIFO

We compare the fq with pfifo_fast qdiscs in terms of enqueueing latency (Fig. 13). The time to enqueue a packet into pfifo_fast queue is almost constant while the enqueue time for fq increases with the number of flows. This is because the FQ uses a tree structure to keep track of every flow and the complexity of insertion operation is \(O(\log (n))\). The cache miss when fetching flow information from the tree also contributes to the latency with large number of flows.

E Packet Rate with Zero Drops

We verified that BBR and CUBIC has similar CPU usage when PPS is fixed (Fig. 14). We disable TSO and GSO to fix the packet size and set MTU size to 7000 to eliminate CPU bottleneck. We also observe that with more than 200k flows, CUBIC consumes slightly more CUBIC than BBR because CUBIC reacts to packet drop by reducing packet size, thus generating more packets.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Saeed, A., Ammar, M., Zegura, E. (2021). Scouting the Path to a Million-Client Server. In: Hohlfeld, O., Lutu, A., Levin, D. (eds) Passive and Active Measurement. PAM 2021. Lecture Notes in Computer Science(), vol 12671. Springer, Cham. https://doi.org/10.1007/978-3-030-72582-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-72582-2_20
Published: 30 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72581-5
Online ISBN: 978-3-030-72582-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics