ScSF: A Scheduling Simulation Framework

Rodrigo, Gonzalo P.; Elmroth, Erik; Östberg, Per-Olov; Ramakrishnan, Lavanya

doi:10.1007/978-3-319-77398-8_9

Gonzalo P. Rodrigo¹⁶,
Erik Elmroth¹⁶,
Per-Olov Östberg¹⁶ &
…
Lavanya Ramakrishnan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10773))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

732 Accesses
10 Citations

Abstract

High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedulers are designed to support classical large tightly coupled parallel jobs over homogeneous systems. Therefore, there is an urgent need to investigate new scheduling algorithms that can manage the future workloads on HPC systems. However, there is a lack of appropriate models and frameworks to enable development, testing, and validation of new scheduling ideas.

In this paper, we present an open-source scheduler simulation framework (ScSF) that covers all the steps of scheduling research through simulation. ScSF provides capabilities for workload modeling, workload generation, system simulation, comparative workload analysis, and experiment orchestration. The simulator is designed to be run over a distributed computing infrastructure facilitating large-scale tests. We demonstrate ScSF through a case study to develop new techniques to manage scientific workflows in a batch scheduler. The evaluation consisted of 1728 experiments and equivalent to 33 years of simulated time, were run in a deployment of ScSF over a distributed infrastructure of 17 compute nodes over two months. Finally, the experimental results were analyzed using the ScSF framework to demonstrate that our technique minimizes workflow turnaround time without over-allocating resources. Finally, we discuss lessons learned from our experiences to inform future large-scale simulation studies using ScSF and other similar frameworks.

Source code available to download at: http://frieda.lbl.gov/download.

G. P. Rodrigo—Work performed while working at the Lawrence Berkeley National Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at: http://frieda.lbl.gov/download.

References

Dirty cow, January 2017. https://dirtycow.ninja/
SchedMD, January 2017. https://www.schedmd.com/
shuttle, January 2017. https://github.com/apenwarr/sshuttle
Workflowgenerator, January 2017. https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator
Declerck, T.M., Sakrejda, I.: External Torque/Moab on an XC30 and fairshare. Technical report, NERSC, Lawrence Berkeley National Lab (2013)
Google Scholar
Feitelson, D.G.: Parallel workloads archive 71(86), 337–360 (2007). http://www.cs.huji.ac.il/labs/parallel/workload
Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, Cambridge (2015)
Book MATH Google Scholar
Feitelson, D.G., Tsafrir, D.: Workload sanitation for performance evaluation. In: 2006 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 221–230. IEEE (2006)
Google Scholar
IBM: Platform computing - lsf, January 2014. http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/sessionscheduler.html
Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_6
Chapter Google Scholar
Kannan, S., Mayes, P., Roberts, M., Brelsford, D., Skovira, J.: Workload Management with LoadLeveler. IBM Corporation, Poughkeepsie (2001)
Google Scholar
Klusáček, D., Rudová, H.: Alea 2 - job scheduling simulator. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques (SIMUTools 2010). ICST (2010)
Google Scholar
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)
Article MATH Google Scholar
Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proceedings of the 5th IBERGRID (2011)
Google Scholar
Rodrigo, G., Östberg, P.O., Elmroth, E., Antypass, K., Gerber, R., Ramakrishnan, L.: HPC system lifetime story: workload characterization and evolutionary analyses on NERSC systems. In: The 24th International ACM Symposium on High-Performance Distributed Computing (HPDC) (2015)
Google Scholar
Rodrigo, G., Östberg, P.O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L.: Towards understanding job heterogeneity in HPC: a NERSC case study. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 521–526. IEEE (2016)
Google Scholar
Rodrigo, G.P., Elmroth, E., Östberg, P.O., Ramakrishnan, L.: Enabling workflow-aware scheduling on HPC systems. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 3–14. ACM (2017)
Google Scholar
Schwiegelshohn, U.: How to design a job scheduling algorithm. In: Cirne, W., Desai, N. (eds.) JSSPP 2014. LNCS, vol. 8828, pp. 147–167. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15789-4_9
Google Scholar
Stephen Trofinoff, M.B.: Using and modifying the BSC Slurm workload simulator. In: Slurm User Group (2015)
Google Scholar
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Chapter Google Scholar
Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 51–60. IEEE (2014)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) and uses resources at the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility, supported by the Office of Science of the U.S. Department of Energy, both under Contract No. DE-AC02-05CH11231. Financial support has been provided in part by the Swedish Government’s strategic effort eSSENCE and the Swedish Research Council (VR) under contract number C0590801 (Cloud Control). Special thanks to Stephen Trofinoff and Massimo Benini from the Swiss National Supercomputing Centre, who shared with us the code base of their Slurm Simulator. Also, we would like to thank the members of the DST department at LBNL and the distributed systems group at Umeå University who administrated and provided the compute nodes supporting our case study.

Author information

Authors and Affiliations

Department of Computing Science, Umeå University, 901 87, Umeå, Sweden
Gonzalo P. Rodrigo, Erik Elmroth & Per-Olov Östberg
Lawrence Berkeley National Lab, Berkeley, CA, 94720, USA
Lavanya Ramakrishnan

Authors

Gonzalo P. Rodrigo
View author publications
You can also search for this author in PubMed Google Scholar
Erik Elmroth
View author publications
You can also search for this author in PubMed Google Scholar
Per-Olov Östberg
View author publications
You can also search for this author in PubMed Google Scholar
Lavanya Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gonzalo P. Rodrigo .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, California, USA
Walfredo Cirne
Google, Seattle, Washington, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigo, G.P., Elmroth, E., Östberg, PO., Ramakrishnan, L. (2018). ScSF: A Scheduling Simulation Framework. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-77398-8_9
Published: 28 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77397-1
Online ISBN: 978-3-319-77398-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics