Skip to main content

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 11997)

Abstract

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs.

This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded.

Keywords

  • Data-parallel processing
  • Cluster scheduling
  • Resource management
  • Distributed dataflows
  • Reinforcement learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-48340-1_40
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-48340-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.

References

  1. Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8(12), 1792–1803 (2015)

    CrossRef  Google Scholar 

  2. Bao, Y., Peng, Y., Wu, C., Li, Z.: Online job scheduling in distributed machine learning clusters. In: INFOCOM 2018. IEEE, April 2018

    Google Scholar 

  3. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^\text{ TM }\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)

    Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004. USENIX Association, January 2004

    Google Scholar 

  5. Delimitrou, C., Kozyrakis, C.: Paragon: QoS-aware scheduling for heterogeneous datacenters. In: ASPLOS 2013. ACM, March 2013

    Google Scholar 

  6. Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. In: ASPLOS 2014. ACM, March 2014

    Google Scholar 

  7. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI 2011. USENIX Association, March 2011

    Google Scholar 

  8. Ghoting, A., et al.: SystemML: declarative machine learning on mapreduce. In: ICDE 2011. IEEE, April 2011

    Google Scholar 

  9. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: OSDI 2014. USENIX Association, September 2014

    Google Scholar 

  10. Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI 2011. USENIX Association, March 2011

    Google Scholar 

  11. Jyothi, S.A., et al.: Morpheus: towards automated SLOs for enterprise clusters. In: OSDI 2016. USENIX Association, November 2016

    Google Scholar 

  12. Ludwig, U.L., Xavier, M.G., Kirchoff, D.F., Cezar, I.B., De Rose, C.A.F.: optimizing multi-tier application performance with interference and affinity-aware placement algorithms. Concurr. Comput. Pract. Exp. 31(18), e5098 (2018)

    Google Scholar 

  13. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: HotNets 2016. ACM, November 2016

    Google Scholar 

  14. Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning Scheduling Algorithms for Data Processing Clusters. arXiv preprint arXiv:1810.01963, October 2018

  15. Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  16. Niu, Z., Tang, S., He, B.: Gemini: an adaptive performance-fairness scheduler for data-intensive cluster computing. In: CloudCom2015. IEEE, November 2015

    Google Scholar 

  17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD 2008. ACM, June 2008

    Google Scholar 

  18. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: NSDI 2015. USENIX Association, March 2015

    Google Scholar 

  19. Rasley, J., Karanasos, K., Kandula, S., Fonseca, R., Vojnovic, M., Rao, S.: Efficient queue management for cluster scheduling. In: EuroSys 2016. ACM, April 2016

    Google Scholar 

  20. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: google trace analysis. In: SoCC 2012. ACM, October 2012

    Google Scholar 

  21. Renner, T., Thamsen, L., Kao, O.: Network-aware resource management for scalable data analytics frameworks. In: Big Data 2015. IEEE, October 2015

    Google Scholar 

  22. Thamsen, L., Rabier, B., Schmidt, F., Renner, T., Kao, O.: Scheduling recurring distributed dataflow jobs based on resource utilization and interference. In: BigData Congress. IEEE, June 2017

    Google Scholar 

  23. Thamsen, L., Verbitskiy, I., Rabier, B., Kao, O.: Learning efficient co-locations for scheduling distributed dataflows in shared clusters. Serv. Trans. Big Data 4(1), 1–15 (2019)

    Google Scholar 

  24. Vavilapalli, V.K., et al.: Apache hadoop YARN: yet another resource negotiator. In: SOCC 2013. ACM, September 2013

    Google Scholar 

  25. Xavier, M.G., De Rose, C.A.: Data Processing with Cross-application Interference Control via System-level Instrumentation. Ph.D. Thesis at PUCRS, Brazil (2018)

    Google Scholar 

  26. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys 2010. ACM, April 2010

    Google Scholar 

  27. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010. USENIX Association, June 2010

    Google Scholar 

Download references

Acknowledgments

This work has been supported through grants by the German Ministry for Education and Research (BMBF; funding mark 01IS14013A and 01IS18025A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lauritz Thamsen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Thamsen, L. et al. (2020). Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs. In: , et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-48340-1_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-48339-5

  • Online ISBN: 978-3-030-48340-1

  • eBook Packages: Computer ScienceComputer Science (R0)