Skip to main content

ScSF: A Scheduling Simulation Framework

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10773))

Included in the following conference series:

Abstract

High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedulers are designed to support classical large tightly coupled parallel jobs over homogeneous systems. Therefore, there is an urgent need to investigate new scheduling algorithms that can manage the future workloads on HPC systems. However, there is a lack of appropriate models and frameworks to enable development, testing, and validation of new scheduling ideas.

In this paper, we present an open-source scheduler simulation framework (ScSF) that covers all the steps of scheduling research through simulation. ScSF provides capabilities for workload modeling, workload generation, system simulation, comparative workload analysis, and experiment orchestration. The simulator is designed to be run over a distributed computing infrastructure facilitating large-scale tests. We demonstrate ScSF through a case study to develop new techniques to manage scientific workflows in a batch scheduler. The evaluation consisted of 1728 experiments and equivalent to 33 years of simulated time, were run in a deployment of ScSF over a distributed infrastructure of 17 compute nodes over two months. Finally, the experimental results were analyzed using the ScSF framework to demonstrate that our technique minimizes workflow turnaround time without over-allocating resources. Finally, we discuss lessons learned from our experiences to inform future large-scale simulation studies using ScSF and other similar frameworks.

Source code available to download at: http://frieda.lbl.gov/download.

G. P. Rodrigo—Work performed while working at the Lawrence Berkeley National Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at: http://frieda.lbl.gov/download.

References

  1. Dirty cow, January 2017. https://dirtycow.ninja/

  2. SchedMD, January 2017. https://www.schedmd.com/

  3. shuttle, January 2017. https://github.com/apenwarr/sshuttle

  4. Workflowgenerator, January 2017. https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator

  5. Declerck, T.M., Sakrejda, I.: External Torque/Moab on an XC30 and fairshare. Technical report, NERSC, Lawrence Berkeley National Lab (2013)

    Google Scholar 

  6. Feitelson, D.G.: Parallel workloads archive 71(86), 337–360 (2007). http://www.cs.huji.ac.il/labs/parallel/workload

  7. Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, Cambridge (2015)

    Book  MATH  Google Scholar 

  8. Feitelson, D.G., Tsafrir, D.: Workload sanitation for performance evaluation. In: 2006 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 221–230. IEEE (2006)

    Google Scholar 

  9. IBM: Platform computing - lsf, January 2014. http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/sessionscheduler.html

  10. Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_6

    Chapter  Google Scholar 

  11. Kannan, S., Mayes, P., Roberts, M., Brelsford, D., Skovira, J.: Workload Management with LoadLeveler. IBM Corporation, Poughkeepsie (2001)

    Google Scholar 

  12. Klusáček, D., Rudová, H.: Alea 2 - job scheduling simulator. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques (SIMUTools 2010). ICST (2010)

    Google Scholar 

  13. Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)

    Article  MATH  Google Scholar 

  14. Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proceedings of the 5th IBERGRID (2011)

    Google Scholar 

  15. Rodrigo, G., Östberg, P.O., Elmroth, E., Antypass, K., Gerber, R., Ramakrishnan, L.: HPC system lifetime story: workload characterization and evolutionary analyses on NERSC systems. In: The 24th International ACM Symposium on High-Performance Distributed Computing (HPDC) (2015)

    Google Scholar 

  16. Rodrigo, G., Östberg, P.O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L.: Towards understanding job heterogeneity in HPC: a NERSC case study. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 521–526. IEEE (2016)

    Google Scholar 

  17. Rodrigo, G.P., Elmroth, E., Östberg, P.O., Ramakrishnan, L.: Enabling workflow-aware scheduling on HPC systems. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 3–14. ACM (2017)

    Google Scholar 

  18. Schwiegelshohn, U.: How to design a job scheduling algorithm. In: Cirne, W., Desai, N. (eds.) JSSPP 2014. LNCS, vol. 8828, pp. 147–167. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15789-4_9

    Google Scholar 

  19. Stephen Trofinoff, M.B.: Using and modifying the BSC Slurm workload simulator. In: Slurm User Group (2015)

    Google Scholar 

  20. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3

    Chapter  Google Scholar 

  21. Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 51–60. IEEE (2014)

    Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) and uses resources at the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility, supported by the Office of Science of the U.S. Department of Energy, both under Contract No. DE-AC02-05CH11231. Financial support has been provided in part by the Swedish Government’s strategic effort eSSENCE and the Swedish Research Council (VR) under contract number C0590801 (Cloud Control). Special thanks to Stephen Trofinoff and Massimo Benini from the Swiss National Supercomputing Centre, who shared with us the code base of their Slurm Simulator. Also, we would like to thank the members of the DST department at LBNL and the distributed systems group at Umeå University who administrated and provided the compute nodes supporting our case study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo P. Rodrigo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rodrigo, G.P., Elmroth, E., Östberg, PO., Ramakrishnan, L. (2018). ScSF: A Scheduling Simulation Framework. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77398-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77397-1

  • Online ISBN: 978-3-319-77398-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics