Skip to main content
Log in

Open problems in queueing theory inspired by datacenter computing

  • Published:
Queueing Systems Aims and scope Submit manuscript

Abstract

Datacenter operations today provide a plethora of new queueing and scheduling problems. The notion of a “job” has become more general and multi-dimensional. The ways in which jobs and servers can interact have grown in complexity, involving parallelism, speedup functions, precedence constraints, and task graphs. The workloads are vastly more variable and more heavy-tailed. Even the performance metrics of interest are broader than in the past, with multi-dimensional service-level objectives in terms of tail probabilities. The purpose of this article is to expose queueing theorists to new models, while providing suggestions for many specific open problems of interest, as well as some insights into their potential solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. In the multiserver job model, we assume FCFS scheduling, which is what is used in datacenters. This is not to be confused with the virtual machine (VM) packing problem, where the literature has focused on packing jobs into VMs based on the number of resources that they request, so as to achieve throughput optimality (see [77, 85, 109, 110, 118]). However, even in the VM packing problem, waste can occur.

  2. In the above example, we are thinking of the job as being run alone on the k servers. If two jobs are time-sharing the same k servers, then the service time of each will double.

  3. If \(k <1\), it is common to assume that \(s(k) = k\), which is consistent with the intuition that if a job is allocated half a server, then it runs at half speed.

  4. Note that SRPT and FCFS are equivalent in the case where all jobs have the same size.

  5. The optimal allocation is derived both for the case where the goal is to minimize mean response time and the case where the goal is to minimize mean slowdown. The slowdown metric is discussed in Sect. 7.1.3.

  6. Gittins becomes SRPT when job sizes are known.

  7. A job’s “rank” is its priority, where lower rank is better, and where ties are broken in FCFS order. Rank is a function of age, but can also depend on a job’s size or class [129].

References

  1. Amazon EC2. http://aws.amazon.com/ec2/. Accessed 15 Nov 2020

  2. Azure Public Dataset (2019). https://github.com/Azure/AzurePublicDataset. Accessed 15 Nov 2020

  3. Google Compute Engine. http://cloud.google.com/products/compute-engine.html. Accessed 15 Nov 2020

  4. Windows Azure. http://www.windowsazure.com/. Accessed 15 Nov 2020

  5. Datacenter Spending (2020). https://www.cbronline.com/news/data-centre-spending. Accessed 15 Nov 2020

  6. Flexera.: State of the Cloud Report (2020). https://www.flexera.com/blog/industry-trends/trend-of-cloud-computing-2020/. Accessed 15 Nov 2020

  7. Aalto, S., Ayesta, U., Righter, R.: On the Gittins index in the M/G/1 queue. Queueing Syst. 63(1), 437–458 (2009)

    Article  Google Scholar 

  8. Aalto, S., Ayesta, U., Righter, R.: Properties of the Gittins index with application to optimal scheduling. Probab. Eng. Inf. Sci. 25(3), 269–288 (2011)

    Article  Google Scholar 

  9. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X..: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), pp. 265–283 (2016)

  10. Abate, J., Choudhury, G.L., Whitt, W.: Asymptotics for steady-state tail probabilities in structured Markov queueing models. Stoch. Mod. 10(1), 99–143 (1994)

    Google Scholar 

  11. Abate, J., Choudhury, G.L., Whitt, W.: Waiting-time tail probabilities in queues with long-tail service-time distributions. Queueing Syst. 16, 311–338 (1994)

    Article  Google Scholar 

  12. Abate, J., Choudhury, G.L., Whitt, W.: An introduction to numerical transform inversion and its application to probability models. In: Grassmann, W.K. (ed.) Computational Probability, pp. 257–323. Springer, Boston (2000)

    Chapter  Google Scholar 

  13. Abate, J., Whitt, W.: A unified framework for numerically inverting Laplace transforms. INFORMS J. Comput. 18(4), 408–421 (2006)

    Article  Google Scholar 

  14. Acar, U., Blelloch, G.E., Blumofe, R.: The data locality of work stealing. Theory Comput. Syst. 35(3), 321–347 (2002)

    Article  Google Scholar 

  15. Afanaseva, L., Bashtova, E., Grishunina, S.: Stability analysis of a multi-server model with simultaneous service and a regenerative input flow. Methodol. Comput. Appl. Probab. 22, 1439–1455 (2020)

    Article  Google Scholar 

  16. Afanaseva, L., Grishunina, S.: Stability conditions for a multiserver queueing system with a regenerative input flow and simultaneous service of a customer by a random number of servers. Queueing Syst. 94, 213–241 (2020)

    Article  Google Scholar 

  17. Agrawal, K., Li, J., Lu, K., Moseley, B.: Scheduling parallel DAG jobs online to minimize average flow time. In: Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’16), pp. 176–189 (2016)

  18. Agrawal, K., Li, J., Lu, K., Moseley, B.: Scheduling parallelizable jobs online to minimize the maximum flow time. In: Symposium on Parallel Algorithms and Architectures (SPAA’16), pp. 195–205 (2016)

  19. Agrawal, K., Li, J., Lu, K., Moseley, B.: Scheduling parallelizable jobs online to maximize throughput. In: LATIN 2018: Theoretical Informatics—13th Latin American Symposium, Buenos Aires, Argentina, pp. 755–776 (2018)

  20. Ahmad, N., Greenberg, A.G., Lahiri, P., Maltz, D., Patel, P.K., Sengupta, S., Vaid, K.V.: Distributed load balancer. Google Patents. U.S. Patent App. 12/189,438 (2008)

  21. Anton, E., Ayesta, U., Jonckheere, M., Verloop, I.M..: On the stability of redundancy models (2019). arXiv:1903.04414

  22. Anton, E., Ayesta, U., Jonckheere, M., Verloop, I.M..: Improving the performance of heterogeneous data centers through redundancy (2020). arXiv:2003.01394

  23. Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: 10th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 119–129 (1998)

  24. Arthurs, E., Kaufman, J.: Sizing a message store subject to blocking criteria. In: IFIP Performance Conference, pp. 547–564 (1979)

  25. AWS. Netflix & AWS Lambda Case Study. https://aws.amazon.com/solutions/case-studies/netflix-and-aws-lambda/. Accessed 15 Nov 2020

  26. AWS. Step Functions. https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html. Accessed 15 Nov 2020

  27. Baccelli, F., Foss, S.: Poisson hail on a hot ground. J. Appl. Probab. 48(A), 343–366 (2011)

    Article  Google Scholar 

  28. Baccelli, F., Makowski, A.M.: Simple computable bounds for the fork–join queue. Technical Report RR-0394, INRIA (1985)

  29. Baccelli, F., Makowski, A.M., Shwartz, A.: The fork–join queue and related systems with synchronization constraints: stochastic ordering and computable bounds. Adv. Appl. Probab. 21, 629–660 (1989)

    Article  Google Scholar 

  30. Barroso, L.A., Holzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)

    Article  Google Scholar 

  31. Bean, N.G., Gibbens, R.J., Zachary, S.: Asymptotic analysis of single resource loss systems in heavy traffic, with applications to integrated networks. Adv. Appl. Probab. 27(1), 273–292 (1995)

    Article  Google Scholar 

  32. Bekker, R., Borst, S., Núñez-Queija, R.: Performance of TCP-friendly streaming sessions in the presence of heavy-tailed elastic flows. Perform. Eval. 61(2), 143–162 (2005)

    Article  Google Scholar 

  33. Benameur, N., Fredj, S. Ben, Delcoigne, F., Oueslati-Boulahia, S., Roberts, J.W.: Integrated admission control for streaming and elastic traffic. In: International Workshop on Quality of Future Internet Services, pp. 69–81 (2001)

  34. Berg, B., Dorsman, J.-P., Harchol-Balter, M.: Towards optimality in parallel job scheduling. Proc. ACM Meas. Anal. Comput. Syst. (POMACS/SIGMETRICS) 1(2), 1–30 (2017). Article 40

    Article  Google Scholar 

  35. Berg, B., Harchol-Balter, M., Moseley, B., Wang, W., Whitehouse, J.: Optimal resource allocation for elastic and inelastic jobs. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’20), pp. 75–87, Philadelphia, PA (2020)

  36. Berg, B., Vesilo, R., Harchol-Balter, M.: heSRPT: Parallel scheduling to minimize mean slowdown. In: 38th International Symposium on Computer Performance, Modeling, Measurement, and Evaluation (IFIP PERFORMANCE 2020), Milan, Italy (2020)

  37. Berger, D., Berg, B., Zhu, T., Sen, S., Harchol-Balter, M.: Robinhood: Tail latency aware caching—dynamic reallocation from cache-rich to cache-poor. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 195–212, Carlsbad, CA (2018)

  38. Bienia, C., Kumar, S., Singh, J. P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08), pp. 72–81, New York, NY (2008)

  39. Blelloch, G., Gibbons, P., Matias, Y.: Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46(2), 281–321 (1999)

    Article  Google Scholar 

  40. Blelloch, G.E., Fineman, J.T., Gibbons, P.B., Simhadri, H.V.: Scheduling irregular parallel computations on hierarchical caches. In: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11), pp. 355–366, San Jose, California (2011)

  41. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)

    Article  Google Scholar 

  42. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. In: IEEE Symposium on Foundations of Computer Science, pp. 356–368 (1994)

  43. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)

    Article  Google Scholar 

  44. Blumofe, R.D., Papadopoulos, D.: Hood: a user-level threads library for multiprogrammed multiprocessors. Technical Report, University of Texas at Austin (1999)

  45. Bonald, T., Proutière, A.: On performance bounds for the integration of elastic and adaptive streaming flows. In: Joint International ACM SIGMETRICS/Performance Conference on Measurement and Modeling of Computer Systems, pp. 235–245 (2004)

  46. Borst, S., Núñez-Queija, R., Zwart, B.: Sojourn time asymptotics in processor-sharing queues. Queueing Syst. 53(1–2), 31–51 (2006)

    Article  Google Scholar 

  47. Borst, S.C., Boxma, O.J., Núñez-Queija, R., Zwart, B.: The impact of the service discipline on delay asymptotics. Perform. Eval. 54(2), 175–206 (2003)

    Article  Google Scholar 

  48. Boxma, O.J., Deng, Q., Zwart, B.: Waiting-time asymptotics for the M/G/2 queue with heterogeneous servers. Queueing Syst. 40(1), 5–31 (2002)

    Article  Google Scholar 

  49. Boxma, O.J., Zwart, B.: Tails in scheduling. SIGMETRICS Perform. Eval. Rev. 34(4), 13–20 (2007)

    Article  Google Scholar 

  50. Brill, P.H., Green, L.: Queues in which customers receive simultaneous service from a random number of servers: a system point approach. Manag. Sci. 30(1), 51–68 (1984)

    Article  Google Scholar 

  51. Cera, M.C., Georgiou, Y., Richard, O., Maillard, N., Navaux, P.O.A.: Supporting malleability in parallel architectures with dynamic CPUSETsMapping and dynamic MPI. In: Kant, K., Pemmaraju, S.V., Sivalingam, K.M., Wu, J. (eds.) International Conference on Distributed Computing and Networking (ICDCN’20), pp. 242–257 (2010)

  52. Chowdhury, R.A., Ramachandran, V., Silvestri, F., Blakeley, B.: Oblivious algorithms for multicores and networks of processors. J. Parallel Distrib. Comput. 73(7), 911–925 (2018)

    Article  Google Scholar 

  53. Crovella, M., Harchol-Balter, M., Murta, C.: Task assignment in a distributed system: Improving performance by unbalancing load. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 268–269. Poster Session (1998)

  54. Dasylva, A., Srikant, R.: Bounds on the performance of admission control and routing policies for general topology networks with multiple call centers. In: Eighteenth Annual IEEE INFOCOM’99 International Conference on Computer Communications, pp. 505–512 (1999)

  55. Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)

    Article  Google Scholar 

  56. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  57. Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. In: ASPLOS’14, pp. 127–144, Salt Lake City, Utah (2014)

  58. den Iseger, P.: Numerical transform inversion using Gaussian quadrature. Probab. Eng. Inf. Sci. 20, 1–44 (2006)

    Article  Google Scholar 

  59. den Iseger, P., Gruntjes, P., Mandjes, M.: A Wiener–Hopf based approach to numerical computations in fluctuation theory for Lévy processes. Math. Methods Oper. Res. 78(1), 101–118 (2013)

    Article  Google Scholar 

  60. Dubner, H., Abate, J.: Numerical inversion of Laplace transforms by relating them to the finite Fourier cosine transform. J. ACM 15(1), 115–123 (1968)

    Article  Google Scholar 

  61. Fan, Z., Sen, R., Koutris, P., Albarghouthi, A.: Automated tuning of query degree of parallelism via machine learning. In: Proceedings of the 3rd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (2020)

  62. Filippopoulos, D., Karatza, H.: An M/M/2 parallel system model with pure space sharing among rigid jobs. Math. Comput. Model. 45(5), 491–530 (2007)

    Article  Google Scholar 

  63. Foss, S., Konstantopoulos, T., Mountford, T.: Power law condition for stability of Poisson hail. J. Theor. Probab. 31, 684–704 (2018)

    Article  Google Scholar 

  64. Foss, S., Korshunov, D.: Heavy tails in multi-server queue. Queueing Syst. Theory Pract. 52, 31–48 (2006)

    Article  Google Scholar 

  65. Foss, S., Korshunov, D., Zachary, S.: An Introduction to Heavy-Tailed and Subexponential Distributions, 2nd edn. Springer, New York (2013)

    Book  Google Scholar 

  66. Fouladi, S., Wahby, R.S., Shacklett, B., Balasubramaniam, K.V., Zeng, W., Bhalerao, R., Sivaraman, A., Porter, G., Winstein, K.: Encoding, fast and slow: low-latency video processing using thousands of tiny threads. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 363–376, Boston, MA (2017)

  67. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: ACM PLDI, pp. 212–223 (1998)

  68. Gandhi, A., Doroudi, S., Harchol-Balter, M., Scheller-Wolf, A.: Exact analysis of the M/M/k/setup class of Markov chains via Recursive Renewal Reward. Queueing Syst. Theory Appl. 77(2), 177–209 (2014)

    Article  Google Scholar 

  69. Gandhi, A., Gupta, V., Harchol-Balter, M., Kozuch, M.: Optimality analysis of energy-peformance trade-off for server farm management. Perform. Eval. 67(11), 1155–1171 (2010)

    Article  Google Scholar 

  70. Gandhi, A., Harchol-Balter, M., Adan, I.: Server farms with setup costs. Perform. Eval. 67(11), 1123–1138 (2010)

    Article  Google Scholar 

  71. Gandhi, A., Harchol-Balter, M., Raghunathan, R., Kozuch, M.: AutoScale: dynamic, robust capacity management for multi-tier data centers. ACM Trans. Comput. Syst. 30(4), 1–26 (2012)

    Article  Google Scholar 

  72. Gardner, K., Harchol-Balter, M., Scheller-Wolf, A., Van Houdt, B.: A better model for job redundancy: decoupling server slowdown and job size. ACM/IEEE Trans. Netw. 25(6), 3353–3367 (2017)

    Article  Google Scholar 

  73. Gardner, K., Harchol-Balter, M., Scheller-Wolf, A., Velednitsky, M., Zbarsky, S.: Redundancy-d: the power of d choices for redundancy. Oper. Res. 65(4), 1078–1094 (2017)

    Article  Google Scholar 

  74. Gardner, K., Zbarsky, S., Doroudi, S., Harchol-Balter, M., Hyytia, E., Scheller-Wolf, A.: Queueing with redundant requests: exact analysis. Queueing Syst. Theory Appl. 83(3), 227–259 (2016)

    Article  Google Scholar 

  75. Gardner, K., Zbarsky, S., Doroudi, S., Harchol-Balter, M., Hyytiä, E., Scheller-Wolf, A.: Reducing latency via redundant requests: exact analysis. In: ACM Sigmetrics 2015 Conference on Measurement and Modeling of Computer Systems, pp. 347–360 (2015)

  76. Gavish, B., Schweitzer, P.J.: The Markovian queue with bounded waiting time. Manag. Sci. 23(12), 1349–1357 (1977)

    Article  Google Scholar 

  77. Ghaderi, J.: Randomized algorithms for scheduling VMs in the cloud. In: 35th Annual IEEE International Conference on Computer Communications, INFOCOM 2016, San Francisco, CA, USA, April 10–14, 2016, pp. 1–9 (2016)

  78. Gittins, J.C., Glazebrook, K.D., Weber, R.: Multi-armed Bandit Allocation Indices. Wiley, New York (2011)

    Book  Google Scholar 

  79. Glynn, P.W., Whitt, W.: Logarithmic asymptotics for steady-state tail probabilities in a single-server queue. J. Appl. Probab. 31(A), 131–156 (1994)

    Article  Google Scholar 

  80. Goldstein, S.C., Schauser, K.E., Culler, D.E.: Lazy threads: implementing a fast parallel call. J. Parallel Distrib. Comput. 37(1), 5–20 (1996)

    Article  Google Scholar 

  81. Graham, R.L., Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G.: Optimization and approximation in deterministic squencing and scheduling: a survey. Ann. Discrete Math. 5, 287–326 (1979)

    Article  Google Scholar 

  82. Grosof, I, Harchol-Balter, M, Scheller-Wolf, A.: Stability for two-class multiserver-job systems (2020). arXiv:2010.00631

  83. Grosof, I., Scully, Z., Harchol-Balter, M.: SRPT for multiserver systems. Perform. Eval. 127–128, 154–175 (2018)

    Article  Google Scholar 

  84. Grosof, I., Scully, Z., Harchol-Balter, M.: Load balancing guardrails: keeping your heavy traffic on the road to low response times. Proc. ACM Meas. Anal. Comput. Syst. (POMACS/SIGMETRICS) 3(2), 1–31 (2019). Article 42

    Article  Google Scholar 

  85. Guo, M., Guan, Q., Ke, W.: Optimal scheduling of VMs in queueing cloud computing systems with a heterogeneous workload. IEEE Access 6, 15178–15191 (2018)

    Article  Google Scholar 

  86. Gupta, A., Acun, B., Sarood, O., Kale, L.: Towards realizing the potential of malleable jobs. In: IEEE International Conference on High Performance Computing (HiPC’14) (2014)

  87. Harchol-Balter, M.: Network analysis without exponentiality assumptions. Ph.D. thesis, University of California at Berkeley (1996)

  88. Harchol-Balter, M.: The effect of heavy-tailed job size distributions on computer system design. In: Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, Washington, DC (1999)

  89. Harchol-Balter, M.: Task assignment with unknown duration. J. ACM 49(2), 260–288 (2002)

    Article  Google Scholar 

  90. Harchol-Balter, M.: Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, Cambridge (2013)

    Book  Google Scholar 

  91. Harchol-Balter, M., Crovella, M., Murta, C.: On choosing a task assignment policy for a distributed server system. In: Lecture Notes in Computer Science, No. 1469: 10th International Conference on Modeling Techniques and Tools for Computer Performance Evaluation, pp. 231–242 (1998)

  92. Harchol-Balter, M., Downey, A.: Exploiting process lifetime distributions for dynamic load balancing. In: Proceedings of ACM SIGMETRICS, pp. 13–24, Philadelphia, PA (1996)

  93. Harchol-Balter, M., Downey, A.: Exploiting process lifetime distributions for dynamic load balancing. ACM Trans. Comput. Syst. 15(3), 253–285 (1997)

    Article  Google Scholar 

  94. Harchol-Balter, M., Schroeder, B., Bansal, N., Agrawal, M.: Size-based scheduling to improve web performance. ACM Trans. Comput. Syst. 21(2), 207–233 (2003)

    Article  Google Scholar 

  95. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. Computer 41, 33–38 (2008)

    Article  Google Scholar 

  96. Horvath, G., Horvath, I., Almousa, S.A.-D., Telek, M.: Numerical inverse Laplace transformation using concentrated matrix exponential distributions. Perform. Eval. 137, 1–22 (2019)

    Google Scholar 

  97. Hunt, P.J., Kurtz, T.G.: Large loss networks. Stoch. Process. Appl. 53(2), 363–378 (1994)

    Article  Google Scholar 

  98. Hyytiä, E., Aalto, S., Penttinen, A.: Minimizing slowdown in heterogeneous size-aware dispatching systems. In: Proceedings of the 2012 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (2012)

  99. Jonas, E., Pu, Q., Venkataraman, S., Stoica, I., Recht, B.: Occupy the cloud: distributed computing for the 99%. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 445–451, New York, NY (2017)

  100. Jonas, E., Schleier-Smith, J., Sreekanti, V., Tsai, C., Khandelwal, A., Pu, Q., Shankar, V., Carreira, J., Krauth, K., Yadwadkar, N.J., Gonzalez, J.E., Popa, R.A., Stoica, I., Patterson, D.A.: Cloud programming simplified: a Berkeley view on serverless computing (2019). CoRR, arXiv:1902.03383

  101. Joshi, G., Soljanin, E., Wornell, G.: Efficient replication of queued tasks for latency reduction in cloud systems. In: Allerton Conference on Communication, Control, and Computing, University of Illinois, Urbana-Champaign (2015)

  102. Kim, S.S.L M/M/s queueing system where customers demand multiple server use. Ph.D. thesis, Southern Methodist University (1979)

  103. Lee, K., Shah, N.B., Huang, L., Ramchandran, K.: The MDS queue: analysing the latency performance of erasure codes. IEEE Trans. Inf. Theory 63(5), 2822–2842 (2017)

    Google Scholar 

  104. Leonardi, S., Raz, D.: Approximating total flow time on parallel machines. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), pp. 110–119 (1997)

  105. Li, H., Groep, D., Wolters, L.: Workload characteristics of a multicluster supercomputer. In: 10th International Conference on Job Scheduling Strategies for Parallel Processing (IPPS’04), pp. 176–193. Springer (2004)

  106. Lin, S.-H., Paolieri, M., Chou, C.F., Golubchik, L.: A model-based approach to streamlining distributed training for asynchronous SGD. In: MASCOTS 2018, pp. 306–318 (2018)

  107. Lu, Y., Xie, Q., Kliot, G., Geller, A., Larus, J.R., Greenberg, A.: Join-idle-queue: a novel load balancing algorithm for dynamically scalable web services. Perform. Eval. 68(11), 1056–1071 (2011)

    Article  Google Scholar 

  108. Madni, S.H.H., Latiff, M.S.A., Abdullahi, M., Abdulhamid, S.M., Usman, M.J.: Performance comparison of heuristic algorithms for task scheduling in IaaS cloud computing environment. PLoS ONE 12(5), 1–26 (2017)

    Article  Google Scholar 

  109. Maguluri, S.T., Srikant, R.: Scheduling jobs with unknown duration in clouds. IEEE/ACM Trans. Netw. 22(6), 1938–1951 (2014)

    Article  Google Scholar 

  110. Maguluri, S.T., Srikant, R., Ying, L.: Stochastic models of load balancing and scheduling in cloud computing clusters. In: Proceedings of IEEE INFOCOM, pp. 702–710 (2012)

  111. Massoulie, L., Roberts, J.W.: Bandwidth sharing and admission control for elastic traffic. Telecommun. Syst. 15, 185–201 (2000)

    Article  Google Scholar 

  112. Melikov, A.: Computation and optimization methods for multiresource queues. Cybern. Syst. Anal. 32(6), 821–836 (1996)

    Article  Google Scholar 

  113. Mok, A.: Fundamental design problems of distributed systems for the hard real-time environment. Ph.D. thesis, MIT, Department of EE and CS (1983)

  114. Morozov, E., Rumyantsev, A.S.: Stability analysis of a MAP/M/s cluster model by matrix-analytic method. In: Fiems, D., Paolieri, M., Platis, A.N. (eds.) Computer Performance Engineering—13th European Workshop, EPEW 2016, Chios, Greece, October 5–7, 2016, Proceedings, volume 9951 of Lecture Notes in Computer Science, pp. 63–76. Springer (2016)

  115. Narlikar, G.J.: Scheduling threads for low space requirement and good locality. Theory Comput. Syst. 35(2), 151–187 (2002)

    Article  Google Scholar 

  116. Nelson, R.D., Tantawi, A.N.: Approximate analysis of fork/join synchronization in parallel queues. IEEE Trans. Comput. 37(6), 739–743 (1988)

    Article  Google Scholar 

  117. Ponomarenko, L., Kim, C.S., Melikov, A.: Performance Analysis and Optimization of Multi-traffic on Communication Networks. Springer, Berlin (2010)

    Book  Google Scholar 

  118. Psychas, K., Ghaderi, J.: On non-preemptive VM scheduling in the cloud. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 1–29 (2017). Article 35

    Article  Google Scholar 

  119. Raaijmakers, Y., Borst, S., Boxma, O.: Delta probing policies for redundancy. Perform. Eval. 127(128), 21–35 (2018)

    Article  Google Scholar 

  120. Raaijmakers, Y., Borst, S., Boxma, O.: Redundancy scheduling with scaled Bernoulli service requirements. Queueing Syst. 93(1–2), 67–82 (2019)

    Article  Google Scholar 

  121. Raaijmakers, Y., Borst, S., Boxma, O.: Stability of redundancy systems with processor sharing. In: Proceedings of the 13th International Conference on Performance Evaluation Methodologies and Tools (VALUETOOLS’20), pp. 120–127 (2020)

  122. Rizk, A., Poloczek, F., Ciucu, F.: Stochastic bounds in fork–join queueing systems under full and partial mapping. Queueing Syst. 83(3), 261–291 (2016)

    Article  Google Scholar 

  123. Rumyantsev, A., Morozov, E.: Stability criterion of a multiserver model with simultaneous service. Ann. Oper. Res. 252(1), 29–39 (2017)

    Article  Google Scholar 

  124. Schrage, L.E.: A proof of the optimality of the shortest remaining processing time discipline. Oper. Res. 16, 678–690 (1968)

    Article  Google Scholar 

  125. Schrage, L.E., Miller, L.W.: The queue M/G/1 with the shortest remaining processing time discipline. Oper. Res. 14, 670–684 (1966)

    Article  Google Scholar 

  126. Schroeder, B., Harchol-Balter, M.: Evaluation of task assignment policies for supercomputing servers: the case for load unbalancing and fairness. Clust. Comput. J. Netw. Softw. Tools Appl. 7(2), 151–161 (2004)

    Google Scholar 

  127. Scully, Z., Grosof, I., Harchol-Balter, M.: The Gittins policy is nearly optimal in the M/G/k under extremely general conditions. Proc. ACM Meas. Anal. Comput. Syst. (POMACS/SIGMETRICS) 3(4), 1–29 (2020). Article 43

    Google Scholar 

  128. Scully, Z., Grosof, I., Harchol-Balter, M.: Optimal multiserver scheduling with unknown job sizes in heavy traffic. In: 38th International Symposium on Computer Performance, Modeling, Measurement, and Evaluation (IFIP PERFORMANCE 2020), Milan, Italy (2020)

  129. Scully, Z., Harchol-Balter, M., Scheller-Wolf, A.: SOAP: one clean analysis of all age-based scheduling policies. Proc. ACM Meas. Anal. Comput. Syst. (POMACS/SIGMETRICS) 2(1), 1–30 (2018). Article 16

    Article  Google Scholar 

  130. Scully, Z., Harchol-Balter, M., Scheller-Wolf, A.: Simple near-optimal scheduling for the M/G/1. Proc. ACM Meas. Anal. Comput. Syst. (POMACS/SIGMETRICS) 4(1), 1–29 (2020). Article 11

    Article  Google Scholar 

  131. Shankar, V., Krauth, K., Pu, Q., Jonas, E., Venkataraman, S., Stoica, I., Recht, B., Ragan-Kelley, J.: Numpywren: serverless linear algebra (2018). CoRR, arXiv:1810.09679

  132. Shneer, S., Stolyar, A..: Large-scale parallel server system with multi-component jobs (2020). arXiv:2006.11256

  133. Sigman, K.: Appendix: a primer on heavy-tailed distributions. Queueing Syst. 33(1/3), 261–275 (1999)

    Article  Google Scholar 

  134. Simhadri, H.V., Blelloch, G.E., Fineman, J.T., Gibbons, P.B., Kyrola, A.: Experimental analysis of space-bounded schedulers. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’14), pp. 30–41, Prague, Czech Republic (2014)

  135. Smith, W.L.: On the distribution of queueing times. Math. Proc. Camb. Philos. Soc. 49(3), 449–461 (1953)

    Article  Google Scholar 

  136. Snyder, B.: Server virtualization has stalled, despite the hype (2010). InfoWorld. https://www.infoworld.com/article/2624771/server-virtualization-has-stalled--despite-the-hype.html. Accessed 15 Nov 2020

  137. Sreekanti, V., Chenggang, W., Lin, X.C., Schleier-Smith, J., Gonzalez, J., Hellerstein, J.M., Tumanov, A.: Cloudburst: stateful functions-as-a-service. Proc. VLDB Endow. 13(11), 2438–2452 (2020)

    Article  Google Scholar 

  138. Sun, Y., Zheng, Z., Koksal, C.E., Kim, K.-H., Shroff, N.B.: Provably delay efficient data retrieving in storage clouds. In: Proceedings of IEEE INFOCOM (2015)

  139. Talbot, A.: The accurate numerical inversion of Laplace transforms. IMA J. Appl. Math. 23(1), 97–120 (1979)

    Article  Google Scholar 

  140. Tang, C., Yu, K., Veeraraghavan, K., Kaldor, J., Michelson, S., Kooburat, T., Anbudurai, A., Clark, M., Gogia, K., Cheng, L., Christensen, B., Gartrell, A., Khutornenko, M., Kulkarni, S., Pawlowski, M., Pelkonen, T., Rodrigues, A., Tibrewal, R., Venkatesan, V., Zhang, P.: Twine: a unified cluster management system for shared infrastructure. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20) (2020)

  141. Thomasian, A.: Analysis of fork/join and related queueing systems. ACM Comput. Surv. 47(2), 1–71 (2014)

    Article  Google Scholar 

  142. Tian, H., Zheng, Y., Wang, W.: Characterizing and synthesizing task dependencies of data-parallel jobs in Alibaba cloud. In: 10th ACM Symposium on Cloud Computing (SoCC’19), Santa Cruz, CA (2019)

  143. Tikhonenko, O.M.: Generalized Erlang problem for service systems with finite total capacity. Probl. Inf. Transm. 41(3), 243–253 (2005)

    Article  Google Scholar 

  144. Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Harchol-Balter, M., Wilkes, J.: Borg: the next generation. In: Proceedings of the 15th European Conference on Computer Systems (EuroSys’20), pp. 1–14, Greece (2020)

  145. Trueman, C.: Why data centres are the new frontier in the fight against climate change. Computerworld (2019)

  146. Van Dijk, N.M.: Blocking of finite source inputs which require simultaneous servers with general think and holding times. Oper. Res. Lett. 8(1), 45–52 (1989)

    Article  Google Scholar 

  147. Vandevoorde, M.T., Roberts, E.S.: WorkCrews: an abstraction for controlling parallelism. Int. J. Parallel Program. 17(4), 347–366 (1988)

    Article  Google Scholar 

  148. Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the 10th European Conference on Computer Systems, p. 18 (2015)

  149. Wang, D., Joshi, G., Wornell, G.W.: Efficient straggler replication in large-scale parallel computing. Proc. ACM Meas. Model. Comput. Syst. (ACM SIGMETRICS 2019) 4(2), 1–23 (2019). Article 7

    Google Scholar 

  150. Wang, W., Harchol-Balter, M., Jiang, H., Scheller-Wolf, A., Srikant, R.: Delay asymptotics and bounds for multi-task parallel jobs. Queueing Syst. Theory Appl. 91(3), 207–239 (2019)

    Article  Google Scholar 

  151. Wang, W., Xie, Q., Harchol-Balter, M.: Zero queueing for multi-server jobs (2020). arXiv:2011.10521

  152. Wardley, S.: Why the fuss about serverless? (2016). https://blog.gardeviance.org/2016/11/why-fuss-about-serverless.html. Accessed 15 Nov 2020

  153. Welch, P.D.: On a generalized M/G/1 queueing process in which the first customer of each busy period receives exceptional service. Oper. Res. 12, 736–752 (1964)

    Article  Google Scholar 

  154. Weng, W., Wang, W.: Dispatching parallel jobs to achieve zero queueing delay (2020). arXiv:2004.02081

  155. Whitt, W.: Understanding the efficiency of multi-server service systems. Manag. Sci. 38(5), 708–723 (1992)

    Article  Google Scholar 

  156. Whitt, W.: Blocking when service is required from several facilities simultaneously. AT&T Bell Lab. Tech. J. 64, 1807–1856 (1985)

    Article  Google Scholar 

  157. Wilkes, J.: More Google cluster data. Google research blog (2011). http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html. Accessed 15 Nov 2020

  158. Wilkes, J.: Google cluster-usage traces v3 (2019). http://github.com/google/cluster-data. Accessed 15 Nov 2020

  159. Xu, Y., Musgrave, Z., Noble, B., Bailey, M.: Bobtail: avoiding long tails in the cloud. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI’13), pp. 329–342, USA (2013)

  160. Zhan, X., Bao, Y., Bienia, C., Li, K.: PARSEC3.0: a multicore benchmark suite with network stacks and SPLASH-2X. ACM SIGARCH Comput. Arch. News 44, 1–16 (2017)

    Article  Google Scholar 

  161. Zhang, W., Fang, V., Panda, A., Shenker, S.: Kappa: A programming framework for serverless computing. In: ACM Symposium on Cloud Computing (SoCC’20), pp. 328–343 (2020)

  162. Zhu, T., Berger, D., Harchol-Balter, M.: SNC-Meister: admitting more tenants with tail latency SLOs. In: ACM Symposium on Cloud Computing (SoCC’16), pp. 374–387, Santa Clara, CA (2016)

  163. Zhu, T., Tumanov, A., Kozuch, M.A.. Harchol-Balter, M., Ganger, G.R.: PriorityMeister: tail latency QoS for shared networked storage. In: ACM Symposium on Cloud Computing 2014 (SoCC’14), pp. 1–14, Seattle, WA (2014)

Download references

Acknowledgements

We would like to thank Sem Borst, Onno Boxma, and Isaac Grosof for their helpful suggestions and careful proof-reading.

Funding

Funding was provided by National Science Foundation (Grant numbers CMMI-1938909, CSR-1763701, XPS-1629444) and Google (Grant number 2020 Faculty Research Award).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mor Harchol-Balter.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by: NSF-CMMI-1938909, NSF-CSR-1763701, NSF-XPS-1629444, and a Google 2020 Faculty Research Award.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Harchol-Balter, M. Open problems in queueing theory inspired by datacenter computing. Queueing Syst 97, 3–37 (2021). https://doi.org/10.1007/s11134-020-09684-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11134-020-09684-6

Keywords

Mathematics Subject Classification

Navigation