Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations

Comprés, Isaías; Arima, Eishi; Schulz, Martin; Rotaru, Tiberiu; Machado, Rui

doi:10.1007/978-3-031-40843-4_7

Isaías Comprés¹¹,
Eishi Arima¹¹,
Martin Schulz¹¹,
Tiberiu Rotaru¹² &
…
Rui Machado¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1142 Accesses

Abstract

Malleability support in supercomputing requires several updates to system software stacks. In addition to this, updates to applications, libraries and the runtime systems of distributed memory programming models are also necessary. Because of this, there are relatively few applications that have been extended or developed with malleability support. As a consequence, there are no job histories from production systems that include sufficient malleable job submissions for scheduling research. In this paper, we propose a solution: a probabilistic job history conversion. This conversion allows us to evaluate malleable scheduling heuristics via simulations based on existing job histories. Based on a configurable probability, job arrivals are converted into malleable versions, and assigned a malleable performance model. This model is used by the simulator to evaluate its changes at runtime, as an effect of malleable operations being applied to it.

This work has received funding under the European Commission’s EuroHPC and Horizon 2020 programmes under grant agreements no. 955606 (DEEP-SEA) and 956560 (REGALE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We do not take the impact of manufacturing process variations into account here.

References

AMD ryzen^TM threadripper^TM pro 5975wx. https://www.amd.com/en/product/11791. Accessed 13 Mar 2023
The HPC powerstack. https://hpcpowerstack.github.io/index.html. Accessed 16 Mar 2023
Logs of real parallel workloads from production systems. https://www.cs.huji.ac.il/labs/parallel/workload/logs.html. Accessed 18 Mar 2023
Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 10–19 (2018)
Google Scholar
Aliaga, J.I., Castillo, M., Iserte, S., Martín-Álvarez, I., Mayo, R.: A survey on malleability solutions for high-performance distributed computing. Appl. Sci. 12(10), 5231 (2022)
Article Google Scholar
Amdahl, G.M.: Computer architecture and Amdahl’s law. Computer 46(12), 38–46 (2013)
Article Google Scholar
Barba, L.A., Yokota, R.: How will the fast multipole method fare in the exascale era. SIAM News 46(6), 1–3 (2013)
Google Scholar
Burd, T., et al.: Zen3: the AMD 2 nd-generation 7nm x86-64 microprocessor core. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 1–3. IEEE (2022)
Google Scholar
Cascajo, A., Singh, D.E., Carretero, J.: Detecting interference between applications and improving the scheduling using malleable application proxies. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds.) ISC High Performance 2022. LNCS, vol. 13387, pp. 129–146. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23220-6_9
Chapter Google Scholar
Chacko, J.A., Ureña, I.A.C., Gerndt, M.: Integration of apache spark with invasive resource manager. pp. 1553–1560 (2019)
Google Scholar
Chadha, M., John, J., Gerndt, M.: Extending slurm for dynamic resource-aware adaptive batch scheduling (2020)
Google Scholar
Comprés, I., Mo-Hellenbrand, A., Gerndt, M., Bungartz, H.J.: Infrastructure and API extensions for elastic execution of MPI applications. In: Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, pp. 82–97. Association for Computing Machinery, New York (2016)
Google Scholar
Corbalan, J., D’Amico, M.: Modular workload format: extending SWF for modular systems. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds.) JSSPP 2021. LNCS, vol. 12985, pp. 43–55. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88224-2_3
Chapter Google Scholar
Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka, M.E.: Hybrid workload scheduling on HPC systems. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 470–480 (2022). https://doi.org/10.1109/IPDPS53621.2022.00052
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Georgiou, Y., Hautreux, M.: Evaluating scalability and efficiency of the resource and job management system on large HPC clusters. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 134–156. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_8
Chapter Google Scholar
Huber, D., Streubel, M., Comprés, I., Schulz, M., Schreiber, M., Pritchard, H.: Towards dynamic resource management with MPI sessions and PMIX. In: Proceedings of the 29th European MPI Users’ Group Meeting, EuroMPI/USA 2022, pp. 57–67. Association for Computing Machinery, New York (2022)
Google Scholar
Iserte, S., Mayo, R., Quintana-Ortí, E.S., Peña, A.J.: DMRlib: easy-coding and efficient resource management for job malleability. IEEE Trans. Comput. 70(9), 1443–1457 (2021). https://doi.org/10.1109/TC.2020.3022933
Article MATH Google Scholar
Jokanovic, A., D’Amico, M., Corbalan, J.: Evaluating SLURM simulator with real-machine SLURM and vice versa. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 72–82 (2018)
Google Scholar
Legrand, A., Marchal, L., Casanova, H.: Scheduling distributed applications: the SimGrid simulation framework, pp. 138–145 (2003)
Google Scholar
Özden, T., Beringer, T., Mazaheri, A., Fard, H.M., Wolf, F.: ElastiSim: a batch-system simulator for malleable workloads. In: Proceedings of the 51st International Conference on Parallel Processing (ICPP), Bordeaux, France. ACM (2022)
Google Scholar
Patki, T., et al.: Exploring hardware overprovisioning in power-constrained, high performance computing. In: ICS, pp. 173–182 (2013)
Google Scholar
Patki, T., et al.: Practical resource management in power-constrained, high performance computing. In: HPDC, pp. 121–132 (2015)
Google Scholar
Prabhakaran, S., Iqbal, M., Rinke, S., Windisch, C., Wolf, F.: A batch system with fair scheduling for evolving applications. In: 2014 43rd International Conference on Parallel Processing, pp. 351–360 (2014)
Google Scholar
Prabhakaran, S., Neumann, M., Rinke, S., Wolf, F., Gupta, A., Kale, L.V.: A batch system with efficient adaptive scheduling for malleable and evolving applications. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 429–438 (2015)
Google Scholar
Sakamoto, R., et al.: Analyzing resource trade-offs in hardware overprovisioned supercomputers. In: IPDPS, pp. 526–535 (2018)
Google Scholar
Sarood, O., et al.: Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: SC, pp. 807–818 (2014)
Google Scholar
Schreiber, M., Riesinger, C., Neckel, T., Bungartz, H.J.: Invasive compute balancing for applications with hybrid parallelization. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 136–143 (2013)
Google Scholar
Scogland, T.R., et al.: A power-measurement methodology for large-scale, high-performance computing. In: ICPE, pp. 149–159 (2014)
Google Scholar
Singh, T., et al.: Zen: an energy-efficient high-performance \(\times \)86 core. IEEE J. Solid-State Circ. 53(1), 102–114 (2017)
Article Google Scholar
Suleiman, D., Ibrahim, M., Hamarash, I.: Dynamic voltage frequency scaling (DVFS) for microprocessors power and energy reduction. In: 4th International Conference on Electrical and Electronics Engineering, vol. 12 (2005)
Google Scholar
Wallossek, I.: Chagall lives! AMD Ryzen threadripper PRO 5995WX and its 4 brothers 5975WX, 5965WX, 5955WX and 5945WX with technical data (2021). https://www.igorslab.de/en/chagall-lives-at-ryzen-threadripper-pro-5995wx-and-his-4-brothers-with-interesting-technical-data/. Accessed 13 Mar 2023
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Technical University of Munich, Garching, Germany
Isaías Comprés, Eishi Arima & Martin Schulz
Fraunhofer ITWM, Kaiserslautern, Germany
Tiberiu Rotaru & Rui Machado

Authors

Isaías Comprés
View author publications
You can also search for this author in PubMed Google Scholar
Eishi Arima
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Tiberiu Rotaru
View author publications
You can also search for this author in PubMed Google Scholar
Rui Machado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isaías Comprés .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Comprés, I., Arima, E., Schulz, M., Rotaru, T., Machado, R. (2023). Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_7
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations