Skip to main content

Scouting the Path to a Million-Client Server

  • Conference paper
  • First Online:
Passive and Active Measurement (PAM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12671))

Included in the following conference series:

Abstract

To keep up with demand, servers will scale up to handle hundreds of thousands of clients simultaneously. Much of the focus of the community has been on scaling servers in terms of aggregate traffic intensity (packets transmitted per second). However, bottlenecks caused by the increasing number of concurrent clients, resulting in a large number of concurrent flows, have received little attention. In this work, we focus on identifying such bottlenecks. In particular, we define two broad categories of problems; namely, admitting more packets into the network stack than can be handled efficiently, and increasing per-packet overhead within the stack. We show that these problems contribute to high CPU usage and network performance degradation in terms of aggregate throughput and RTT. Our measurement and analysis are performed in the context of the Linux networking stack, the most widely used publicly available networking stack. Further, we discuss the relevance of our findings to other network stacks. The goal of our work is to highlight considerations required in the design of future networking stacks to enable efficient handling of large numbers of clients and flows.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    HD and SD videos consume up to 5 Mbps and 3 Mbps, respectively [9].

  2. 2.

    The number of packets is typically much smaller than the worst case scenario due to imperfect pacing. Delays in dispatching packets, resulting from imperfect pacing, require sending larger packets to maintain the correct average rate, leading to a lower packet rate. However, the CPU cost of autosizing increases with the number of flows even with imperfect pacing.

References

  1. High-performance, feature-rich netxtreme® e-series dual-port 100g pcie ethernet nic. https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/p2100g

  2. Intel DPDK: Data plane development kit (2014). https://www.dpdk.org/

  3. IEEE standard for ethernet - amendment 10: Media access control parameters, physical layers, and management parameters for 200 gb/s and 400 gb/s operation. IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s 802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017), pp. 1–372 (2017)

    Google Scholar 

  4. Microprocessor trend data (2018). https://github.com/karlrupp/microprocessor-trend-data

  5. IEEE 802.3 Industry Connections Ethernet Bandwidth Assessment Part II (2020)

    Google Scholar 

  6. dstat-Linux man page (2020). https://linux.die.net/man/1/dstat

  7. FlowQueue-Codel (2020). https://tools.ietf.org/id/draft-ietf-aqm-fq-codel-02.html

  8. neper: a Linux networking performance tool (2020). https://github.com/google/neper

  9. Netflix Help Center: Internet Connection Speed Recommendations (2020). https://help.netflix.com/en/node/306

  10. netstat-Linux man page (2020). https://linux.die.net/man/8/netstat

  11. Perf Manual (2020). https://www.man7.org/linux/man-pages/man1/perf.1.html

  12. ss-Linux man page (2020). https://linux.die.net/man/8/ss

  13. Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., Bugnion, E.: \(\{\)IX\(\}\): a protected dataplane operating system for high throughput and low latency. In: 11th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 14), pp. 49–65 (2014)

    Google Scholar 

  14. Benvenuti, C.: Understanding Linux Network Internals. O’Reilly Media, Inc. (2006)

    Google Scholar 

  15. Brouer, J.D.: Network stack challenges at increasing speeds. In: Proceedings of the Linux Conference, pp. 12–16 (2015)

    Google Scholar 

  16. Cardwell, N., Cheng, Y., Gunn, C.S., Yeganeh, S.H., Jacobson, V.: BBR: congestion-based congestion control. Queue 14(5), 20–53 (2016)

    Article  Google Scholar 

  17. Cavalcanti, F.R.P., Andersson, S.: Optimizing Wireless Communication Systems, vol. 386. Springer, Cham (2009). https://doi.org/10.1007/978-1-4419-0155-2

    Book  MATH  Google Scholar 

  18. Checconi, F., Rizzo, L., Valente, P.: Qfq: Efficient packet scheduling with tight guarantees. IEEE/ACM Trans. Networking 21(3)(2013)

    Google Scholar 

  19. Chen, Q.C., Yang, X.H., Wang, X.L.: A peer-to-peer based passive web crawling system. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1878–1883. IEEE (2011)

    Google Scholar 

  20. Dumazet, E., Corbet, J.: TCP small queues (2012). https://lwn.net/Articles/507065/

  21. Dumazet, E., Corbet, J.: Tso sizing and the FQ scheduler (2013). https://lwn.net/Articles/564978/

  22. Firestone, D., et al.: Azure accelerated networking: Smartnics in the public cloud. In: 15th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2018), pp. 51–66 (2018)

    Google Scholar 

  23. Geer, D.: Chip makers turn to multicore processors. IEEE Computer 38(2005)

    Google Scholar 

  24. Hedayati, M., Shen, K., Scott, M.L., Marty, M.: Multi-queue fair queuing. In: 2019 USENIX Annual Technical Conference (USENIX ATC 2019) (2019)

    Google Scholar 

  25. Hock, M., Veit, M., Neumeister, F., Bless, R., Zitterbart, M.: TCP at 100 gbit/s-tuning, limitations, congestion control. In: 2019 IEEE 44th Conference on Local Computer Networks (LCN), pp. 1–9. IEEE (2019)

    Google Scholar 

  26. Jeong, E., et al.: MTCP: a highly scalable user-level \(\{\)TCP\(\}\) stack for multicore systems. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 489–502 (2014)

    Google Scholar 

  27. Kalia, A., Kaminsky, M., Andersen, D.: Datacenter RPCs can be general and fast. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2019), pp. 1–16 (2019)

    Google Scholar 

  28. Kaufmann, A., Peter, S., Sharma, N.K., Anderson, T., Krishnamurthy, A.: High performance packet processing with flexnic. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 67–81 (2016)

    Google Scholar 

  29. Kaufmann, A., Stamler, T., Peter, S., Sharma, N.K., Krishnamurthy, A., Anderson, T.: TAS: TCP acceleration as an OS service. In: Proceedings of the Fourteenth EuroSys Conference, 2019, pp. 1–16 (2019)

    Google Scholar 

  30. Li, Y., Cornett, L., Deval, M., Vasudevan, A., Sarangam, P.: Adaptive interrupt moderation (Apr 14 2015), uS Patent 9,009,367

    Google Scholar 

  31. Marty, M., et al.: Snap: a microkernel approach to host networking. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. SOSP 2019, pp. 399–413 (2019)

    Google Scholar 

  32. Mogul, J.C., Ramakrishnan, K.: Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst. 15(3), 217–252 (1997)

    Article  Google Scholar 

  33. Moon, Y., Lee, S., Jamshed, M.A., Park, K.: Acceltcp: accelerating network applications with stateful TCP offloading. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020), pp. 77–92 (2020)

    Google Scholar 

  34. Ousterhout, A., Fried, J., Behrens, J., Belay, A., Balakrishnan, H.: Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In: Proceedings of USENIX NSDI 2019 (2019)

    Google Scholar 

  35. Radhakrishnan, S., .: \(\{\)SENIC\(\}\): Scalable \(\{\)NIC\(\}\) for end-host rate limiting. In: 11th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2014), pp. 475–488 (2014)

    Google Scholar 

  36. Rizzo, L.: Netmap: a novel framework for fast packet i/o. In: 21st USENIX Security Symposium (USENIX Security 2012), pp. 101–112 (2012)

    Google Scholar 

  37. Rotaru, M., Olariu, F., Onica, E., Rivière, E.: Reliable messaging to millions of users with migratorydata. In: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track, pp. 1–7 (2017)

    Google Scholar 

  38. Saeed, A., Dukkipati, N., Valancius, V., Lam, T., Contavalli, C., Vahdat, A.: Carousel: scalable traffic shaping at end-hosts. In: Proceedings of ACM SIGCOMM 2017 (2017)

    Google Scholar 

  39. Saeed, A., et al.: Eiffel: Efficient and flexible software packet scheduling. In: Proceedings of USENIX NSDI 2019 (2019)

    Google Scholar 

  40. Shrivastav, V.: Fast, scalable, and programmable packet scheduler in hardware. In: Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM 2019 (2019)

    Google Scholar 

  41. Stephens, B., Akella, A., Swift, M.: Loom: flexible and efficient \(\{\)NIC\(\}\) packet scheduling. In: 16th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 19), pp. 33–46 (2019)

    Google Scholar 

  42. Stephens, B., Singhvi, A., Akella, A., Swift, M.: Titan: Fair packet scheduling for commodity multiqueue nics. In: 2017 \(\{\)USENIX\(\}\) Annual Technical Conference (USENIX ATC 2017), pp. 431–444 (2017)

    Google Scholar 

  43. Sun, L., Kostic, P.: Adaptive hardware interrupt moderation, January 2 2014. uS Patent App. 13/534,607

    Google Scholar 

  44. Yasukata, K., Honda, M., Santry, D., Eggert, L.: Stackmap: low-latency networking with the \(\{\)OS\(\}\) stack and dedicated nics. In: 2016 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 2016), pp. 43–56 (2016)

    Google Scholar 

  45. Zhang, T., Wang, J., Huang, J., Chen, J., Pan, Y., Min, G.: Tuning the aggressive TCP behavior for highly concurrent http connections in intra-datacenter. IEEE/ACM Trans. Networking 25(6), 3808–3822 (2017)

    Article  Google Scholar 

  46. Zhao, Y., Saeed, A., Zegura, E., Ammar, M.: ZD: a scalable zero-drop network stack at end hosts. In: Proceedings of the 15th International Conference on Emerging Networking Experiments and Technologies, pp. 220–232 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendices

A Linux Stack Overview

Fig. 10.
figure 10

Packet Transmission

Packet transmission in an end-host refers to the process of a packet traversing from user space, to kernel space, and finally to NIC in packet transmission process. The application generates a packet and copies it into the kernel space TCP buffer. Packets from the TCP buffer are then queued into Qdisc. Then there are two ways to a dequeue packet from the Qdisc to the driver buffer: 1)dequeue a packet immediately, and 2) schedule a packet to be dequeued later through softriq, which calls net_tx_action to retrieve packet from qdisc (Fig. 10).

Fig. 11.
figure 11

Overall performance of the network stack as a function of the number of flows with fixed TSO disabled and 1500 MTU size

Fig. 12.
figure 12

Overall performance of the network stack as a function of the number of flows with TSO enabled and 9000 MTU size

B Parameter Configuration

Table 2 shows all the parameters we have used in our setup.

Table 2. Tuning parameters

C Overall Stack Performance

We find that the trends shown in Fig. 2 remain the same regardless of packet rate. In particular, we disable TSO, forcing the software stack to generate MTU packets. This ensures that the packet rate remains relatively constant across experiments. Note that we perform experiments with a maximum number of 100k flows. We try two values for the MTU: 1500 Bytes and 9000 Bytes. As expected, the performance of the server saturates at a much lower number of flows when generating packets of 1500 Bytes (Fig. 11). This is because the packet rate increases compared to the experiments discussed in Sect. 3. One the other hand, the performance of the server when using 9000 Byte packets is similar to that discussed in Sect. 3 (Fig. 12).

D FQ v.s. PFIFO

We compare the fq with pfifo_fast qdiscs in terms of enqueueing latency (Fig. 13). The time to enqueue a packet into pfifo_fast queue is almost constant while the enqueue time for fq increases with the number of flows. This is because the FQ uses a tree structure to keep track of every flow and the complexity of insertion operation is \(O(\log (n))\). The cache miss when fetching flow information from the tree also contributes to the latency with large number of flows.

Fig. 13.
figure 13

Enqueue time

Fig. 14.
figure 14

BBR v.s. CUBIC

E Packet Rate with Zero Drops

We verified that BBR and CUBIC has similar CPU usage when PPS is fixed (Fig. 14). We disable TSO and GSO to fix the packet size and set MTU size to 7000 to eliminate CPU bottleneck. We also observe that with more than 200k flows, CUBIC consumes slightly more CUBIC than BBR because CUBIC reacts to packet drop by reducing packet size, thus generating more packets.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, Y., Saeed, A., Ammar, M., Zegura, E. (2021). Scouting the Path to a Million-Client Server. In: Hohlfeld, O., Lutu, A., Levin, D. (eds) Passive and Active Measurement. PAM 2021. Lecture Notes in Computer Science(), vol 12671. Springer, Cham. https://doi.org/10.1007/978-3-030-72582-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72582-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72581-5

  • Online ISBN: 978-3-030-72582-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics