Abstract
To keep up with demand, servers will scale up to handle hundreds of thousands of clients simultaneously. Much of the focus of the community has been on scaling servers in terms of aggregate traffic intensity (packets transmitted per second). However, bottlenecks caused by the increasing number of concurrent clients, resulting in a large number of concurrent flows, have received little attention. In this work, we focus on identifying such bottlenecks. In particular, we define two broad categories of problems; namely, admitting more packets into the network stack than can be handled efficiently, and increasing per-packet overhead within the stack. We show that these problems contribute to high CPU usage and network performance degradation in terms of aggregate throughput and RTT. Our measurement and analysis are performed in the context of the Linux networking stack, the most widely used publicly available networking stack. Further, we discuss the relevance of our findings to other network stacks. The goal of our work is to highlight considerations required in the design of future networking stacks to enable efficient handling of large numbers of clients and flows.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
HD and SD videos consume up to 5 Mbps and 3 Mbps, respectively [9].
- 2.
The number of packets is typically much smaller than the worst case scenario due to imperfect pacing. Delays in dispatching packets, resulting from imperfect pacing, require sending larger packets to maintain the correct average rate, leading to a lower packet rate. However, the CPU cost of autosizing increases with the number of flows even with imperfect pacing.
References
High-performance, feature-rich netxtreme® e-series dual-port 100g pcie ethernet nic. https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/p2100g
Intel DPDK: Data plane development kit (2014). https://www.dpdk.org/
IEEE standard for ethernet - amendment 10: Media access control parameters, physical layers, and management parameters for 200 gb/s and 400 gb/s operation. IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s 802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017), pp. 1–372 (2017)
Microprocessor trend data (2018). https://github.com/karlrupp/microprocessor-trend-data
IEEE 802.3 Industry Connections Ethernet Bandwidth Assessment Part II (2020)
dstat-Linux man page (2020). https://linux.die.net/man/1/dstat
FlowQueue-Codel (2020). https://tools.ietf.org/id/draft-ietf-aqm-fq-codel-02.html
neper: a Linux networking performance tool (2020). https://github.com/google/neper
Netflix Help Center: Internet Connection Speed Recommendations (2020). https://help.netflix.com/en/node/306
netstat-Linux man page (2020). https://linux.die.net/man/8/netstat
Perf Manual (2020). https://www.man7.org/linux/man-pages/man1/perf.1.html
ss-Linux man page (2020). https://linux.die.net/man/8/ss
Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., Bugnion, E.: \(\{\)IX\(\}\): a protected dataplane operating system for high throughput and low latency. In: 11th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 14), pp. 49–65 (2014)
Benvenuti, C.: Understanding Linux Network Internals. O’Reilly Media, Inc. (2006)
Brouer, J.D.: Network stack challenges at increasing speeds. In: Proceedings of the Linux Conference, pp. 12–16 (2015)
Cardwell, N., Cheng, Y., Gunn, C.S., Yeganeh, S.H., Jacobson, V.: BBR: congestion-based congestion control. Queue 14(5), 20–53 (2016)
Cavalcanti, F.R.P., Andersson, S.: Optimizing Wireless Communication Systems, vol. 386. Springer, Cham (2009). https://doi.org/10.1007/978-1-4419-0155-2
Checconi, F., Rizzo, L., Valente, P.: Qfq: Efficient packet scheduling with tight guarantees. IEEE/ACM Trans. Networking 21(3)(2013)
Chen, Q.C., Yang, X.H., Wang, X.L.: A peer-to-peer based passive web crawling system. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1878–1883. IEEE (2011)
Dumazet, E., Corbet, J.: TCP small queues (2012). https://lwn.net/Articles/507065/
Dumazet, E., Corbet, J.: Tso sizing and the FQ scheduler (2013). https://lwn.net/Articles/564978/
Firestone, D., et al.: Azure accelerated networking: Smartnics in the public cloud. In: 15th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2018), pp. 51–66 (2018)
Geer, D.: Chip makers turn to multicore processors. IEEE Computer 38(2005)
Hedayati, M., Shen, K., Scott, M.L., Marty, M.: Multi-queue fair queuing. In: 2019 USENIX Annual Technical Conference (USENIX ATC 2019) (2019)
Hock, M., Veit, M., Neumeister, F., Bless, R., Zitterbart, M.: TCP at 100 gbit/s-tuning, limitations, congestion control. In: 2019 IEEE 44th Conference on Local Computer Networks (LCN), pp. 1–9. IEEE (2019)
Jeong, E., et al.: MTCP: a highly scalable user-level \(\{\)TCP\(\}\) stack for multicore systems. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 489–502 (2014)
Kalia, A., Kaminsky, M., Andersen, D.: Datacenter RPCs can be general and fast. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2019), pp. 1–16 (2019)
Kaufmann, A., Peter, S., Sharma, N.K., Anderson, T., Krishnamurthy, A.: High performance packet processing with flexnic. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 67–81 (2016)
Kaufmann, A., Stamler, T., Peter, S., Sharma, N.K., Krishnamurthy, A., Anderson, T.: TAS: TCP acceleration as an OS service. In: Proceedings of the Fourteenth EuroSys Conference, 2019, pp. 1–16 (2019)
Li, Y., Cornett, L., Deval, M., Vasudevan, A., Sarangam, P.: Adaptive interrupt moderation (Apr 14 2015), uS Patent 9,009,367
Marty, M., et al.: Snap: a microkernel approach to host networking. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. SOSP 2019, pp. 399–413 (2019)
Mogul, J.C., Ramakrishnan, K.: Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst. 15(3), 217–252 (1997)
Moon, Y., Lee, S., Jamshed, M.A., Park, K.: Acceltcp: accelerating network applications with stateful TCP offloading. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020), pp. 77–92 (2020)
Ousterhout, A., Fried, J., Behrens, J., Belay, A., Balakrishnan, H.: Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In: Proceedings of USENIX NSDI 2019 (2019)
Radhakrishnan, S., .: \(\{\)SENIC\(\}\): Scalable \(\{\)NIC\(\}\) for end-host rate limiting. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 475–488 (2014)
Rizzo, L.: Netmap: a novel framework for fast packet i/o. In: 21st USENIX Security Symposium (USENIX Security 2012), pp. 101–112 (2012)
Rotaru, M., Olariu, F., Onica, E., Rivière, E.: Reliable messaging to millions of users with migratorydata. In: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track, pp. 1–7 (2017)
Saeed, A., Dukkipati, N., Valancius, V., Lam, T., Contavalli, C., Vahdat, A.: Carousel: scalable traffic shaping at end-hosts. In: Proceedings of ACM SIGCOMM 2017 (2017)
Saeed, A., et al.: Eiffel: Efficient and flexible software packet scheduling. In: Proceedings of USENIX NSDI 2019 (2019)
Shrivastav, V.: Fast, scalable, and programmable packet scheduler in hardware. In: Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM 2019 (2019)
Stephens, B., Akella, A., Swift, M.: Loom: flexible and efficient \(\{\)NIC\(\}\) packet scheduling. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 19), pp. 33–46 (2019)
Stephens, B., Singhvi, A., Akella, A., Swift, M.: Titan: Fair packet scheduling for commodity multiqueue nics. In: 2017 \(\{\)USENIX\(\}\) Annual Technical Conference (USENIX ATC 2017), pp. 431–444 (2017)
Sun, L., Kostic, P.: Adaptive hardware interrupt moderation, January 2 2014. uS Patent App. 13/534,607
Yasukata, K., Honda, M., Santry, D., Eggert, L.: Stackmap: low-latency networking with the \(\{\)OS\(\}\) stack and dedicated nics. In: 2016 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 2016), pp. 43–56 (2016)
Zhang, T., Wang, J., Huang, J., Chen, J., Pan, Y., Min, G.: Tuning the aggressive TCP behavior for highly concurrent http connections in intra-datacenter. IEEE/ACM Trans. Networking 25(6), 3808–3822 (2017)
Zhao, Y., Saeed, A., Zegura, E., Ammar, M.: ZD: a scalable zero-drop network stack at end hosts. In: Proceedings of the 15th International Conference on Emerging Networking Experiments and Technologies, pp. 220–232 (2019)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Appendices
A Linux Stack Overview
Packet transmission in an end-host refers to the process of a packet traversing from user space, to kernel space, and finally to NIC in packet transmission process. The application generates a packet and copies it into the kernel space TCP buffer. Packets from the TCP buffer are then queued into Qdisc. Then there are two ways to a dequeue packet from the Qdisc to the driver buffer: 1)dequeue a packet immediately, and 2) schedule a packet to be dequeued later through softriq, which calls net_tx_action to retrieve packet from qdisc (Fig. 10).
B Parameter Configuration
Table 2 shows all the parameters we have used in our setup.
C Overall Stack Performance
We find that the trends shown in Fig. 2 remain the same regardless of packet rate. In particular, we disable TSO, forcing the software stack to generate MTU packets. This ensures that the packet rate remains relatively constant across experiments. Note that we perform experiments with a maximum number of 100k flows. We try two values for the MTU: 1500 Bytes and 9000 Bytes. As expected, the performance of the server saturates at a much lower number of flows when generating packets of 1500 Bytes (Fig. 11). This is because the packet rate increases compared to the experiments discussed in Sect. 3. One the other hand, the performance of the server when using 9000 Byte packets is similar to that discussed in Sect. 3 (Fig. 12).
D FQ v.s. PFIFO
We compare the fq with pfifo_fast qdiscs in terms of enqueueing latency (Fig. 13). The time to enqueue a packet into pfifo_fast queue is almost constant while the enqueue time for fq increases with the number of flows. This is because the FQ uses a tree structure to keep track of every flow and the complexity of insertion operation is \(O(\log (n))\). The cache miss when fetching flow information from the tree also contributes to the latency with large number of flows.
E Packet Rate with Zero Drops
We verified that BBR and CUBIC has similar CPU usage when PPS is fixed (Fig. 14). We disable TSO and GSO to fix the packet size and set MTU size to 7000 to eliminate CPU bottleneck. We also observe that with more than 200k flows, CUBIC consumes slightly more CUBIC than BBR because CUBIC reacts to packet drop by reducing packet size, thus generating more packets.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, Y., Saeed, A., Ammar, M., Zegura, E. (2021). Scouting the Path to a Million-Client Server. In: Hohlfeld, O., Lutu, A., Levin, D. (eds) Passive and Active Measurement. PAM 2021. Lecture Notes in Computer Science(), vol 12671. Springer, Cham. https://doi.org/10.1007/978-3-030-72582-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-72582-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72581-5
Online ISBN: 978-3-030-72582-2
eBook Packages: Computer ScienceComputer Science (R0)