Skip to main content

Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

  • 1142 Accesses

Abstract

Malleability support in supercomputing requires several updates to system software stacks. In addition to this, updates to applications, libraries and the runtime systems of distributed memory programming models are also necessary. Because of this, there are relatively few applications that have been extended or developed with malleability support. As a consequence, there are no job histories from production systems that include sufficient malleable job submissions for scheduling research. In this paper, we propose a solution: a probabilistic job history conversion. This conversion allows us to evaluate malleable scheduling heuristics via simulations based on existing job histories. Based on a configurable probability, job arrivals are converted into malleable versions, and assigned a malleable performance model. This model is used by the simulator to evaluate its changes at runtime, as an effect of malleable operations being applied to it.

This work has received funding under the European Commission’s EuroHPC and Horizon 2020 programmes under grant agreements no. 955606 (DEEP-SEA) and 956560 (REGALE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We do not take the impact of manufacturing process variations into account here.

References

  1. AMD ryzenTM threadripperTM pro 5975wx. https://www.amd.com/en/product/11791. Accessed 13 Mar 2023

  2. The HPC powerstack. https://hpcpowerstack.github.io/index.html. Accessed 16 Mar 2023

  3. Logs of real parallel workloads from production systems. https://www.cs.huji.ac.il/labs/parallel/workload/logs.html. Accessed 18 Mar 2023

  4. Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 10–19 (2018)

    Google Scholar 

  5. Aliaga, J.I., Castillo, M., Iserte, S., Martín-Álvarez, I., Mayo, R.: A survey on malleability solutions for high-performance distributed computing. Appl. Sci. 12(10), 5231 (2022)

    Article  Google Scholar 

  6. Amdahl, G.M.: Computer architecture and Amdahl’s law. Computer 46(12), 38–46 (2013)

    Article  Google Scholar 

  7. Barba, L.A., Yokota, R.: How will the fast multipole method fare in the exascale era. SIAM News 46(6), 1–3 (2013)

    Google Scholar 

  8. Burd, T., et al.: Zen3: the AMD 2 nd-generation 7nm x86-64 microprocessor core. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 1–3. IEEE (2022)

    Google Scholar 

  9. Cascajo, A., Singh, D.E., Carretero, J.: Detecting interference between applications and improving the scheduling using malleable application proxies. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds.) ISC High Performance 2022. LNCS, vol. 13387, pp. 129–146. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23220-6_9

    Chapter  Google Scholar 

  10. Chacko, J.A., Ureña, I.A.C., Gerndt, M.: Integration of apache spark with invasive resource manager. pp. 1553–1560 (2019)

    Google Scholar 

  11. Chadha, M., John, J., Gerndt, M.: Extending slurm for dynamic resource-aware adaptive batch scheduling (2020)

    Google Scholar 

  12. Comprés, I., Mo-Hellenbrand, A., Gerndt, M., Bungartz, H.J.: Infrastructure and API extensions for elastic execution of MPI applications. In: Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, pp. 82–97. Association for Computing Machinery, New York (2016)

    Google Scholar 

  13. Corbalan, J., D’Amico, M.: Modular workload format: extending SWF for modular systems. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds.) JSSPP 2021. LNCS, vol. 12985, pp. 43–55. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88224-2_3

    Chapter  Google Scholar 

  14. Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka, M.E.: Hybrid workload scheduling on HPC systems. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 470–480 (2022). https://doi.org/10.1109/IPDPS53621.2022.00052

  15. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  16. Georgiou, Y., Hautreux, M.: Evaluating scalability and efficiency of the resource and job management system on large HPC clusters. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 134–156. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_8

    Chapter  Google Scholar 

  17. Huber, D., Streubel, M., Comprés, I., Schulz, M., Schreiber, M., Pritchard, H.: Towards dynamic resource management with MPI sessions and PMIX. In: Proceedings of the 29th European MPI Users’ Group Meeting, EuroMPI/USA 2022, pp. 57–67. Association for Computing Machinery, New York (2022)

    Google Scholar 

  18. Iserte, S., Mayo, R., Quintana-Ortí, E.S., Peña, A.J.: DMRlib: easy-coding and efficient resource management for job malleability. IEEE Trans. Comput. 70(9), 1443–1457 (2021). https://doi.org/10.1109/TC.2020.3022933

    Article  MATH  Google Scholar 

  19. Jokanovic, A., D’Amico, M., Corbalan, J.: Evaluating SLURM simulator with real-machine SLURM and vice versa. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 72–82 (2018)

    Google Scholar 

  20. Legrand, A., Marchal, L., Casanova, H.: Scheduling distributed applications: the SimGrid simulation framework, pp. 138–145 (2003)

    Google Scholar 

  21. Özden, T., Beringer, T., Mazaheri, A., Fard, H.M., Wolf, F.: ElastiSim: a batch-system simulator for malleable workloads. In: Proceedings of the 51st International Conference on Parallel Processing (ICPP), Bordeaux, France. ACM (2022)

    Google Scholar 

  22. Patki, T., et al.: Exploring hardware overprovisioning in power-constrained, high performance computing. In: ICS, pp. 173–182 (2013)

    Google Scholar 

  23. Patki, T., et al.: Practical resource management in power-constrained, high performance computing. In: HPDC, pp. 121–132 (2015)

    Google Scholar 

  24. Prabhakaran, S., Iqbal, M., Rinke, S., Windisch, C., Wolf, F.: A batch system with fair scheduling for evolving applications. In: 2014 43rd International Conference on Parallel Processing, pp. 351–360 (2014)

    Google Scholar 

  25. Prabhakaran, S., Neumann, M., Rinke, S., Wolf, F., Gupta, A., Kale, L.V.: A batch system with efficient adaptive scheduling for malleable and evolving applications. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 429–438 (2015)

    Google Scholar 

  26. Sakamoto, R., et al.: Analyzing resource trade-offs in hardware overprovisioned supercomputers. In: IPDPS, pp. 526–535 (2018)

    Google Scholar 

  27. Sarood, O., et al.: Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: SC, pp. 807–818 (2014)

    Google Scholar 

  28. Schreiber, M., Riesinger, C., Neckel, T., Bungartz, H.J.: Invasive compute balancing for applications with hybrid parallelization. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 136–143 (2013)

    Google Scholar 

  29. Scogland, T.R., et al.: A power-measurement methodology for large-scale, high-performance computing. In: ICPE, pp. 149–159 (2014)

    Google Scholar 

  30. Singh, T., et al.: Zen: an energy-efficient high-performance \(\times \)86 core. IEEE J. Solid-State Circ. 53(1), 102–114 (2017)

    Article  Google Scholar 

  31. Suleiman, D., Ibrahim, M., Hamarash, I.: Dynamic voltage frequency scaling (DVFS) for microprocessors power and energy reduction. In: 4th International Conference on Electrical and Electronics Engineering, vol. 12 (2005)

    Google Scholar 

  32. Wallossek, I.: Chagall lives! AMD Ryzen threadripper PRO 5995WX and its 4 brothers 5975WX, 5965WX, 5955WX and 5945WX with technical data (2021). https://www.igorslab.de/en/chagall-lives-at-ryzen-threadripper-pro-5995wx-and-his-4-brothers-with-interesting-technical-data/. Accessed 13 Mar 2023

  33. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isaías Comprés .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Comprés, I., Arima, E., Schulz, M., Rotaru, T., Machado, R. (2023). Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics