On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

Arima, Eishi; Comprés, A. Isaías; Schulz, Martin

doi:10.1007/978-3-031-23220-6_14

Eishi Arima¹¹,
A. Isaías Comprés¹¹ &
Martin Schulz^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13387))

Included in the following conference series:

International Conference on High Performance Computing

906 Accesses
6 Citations

Abstract

Recent High-Performance Computing (HPC) systems are facing important challenges, such as massive power consumption, while at the same time significantly under-utilized system resources. Given the power consumption trends, future systems will be deployed in an over-provisioned manner where more resources are installed than they can afford to power simultaneously. In such a scenario, maximizing resource utilization and energy efficiency, while keeping a given power constraint, is pivotal. Driven by this observation, in this position paper we first highlight the recent trends of resource management techniques, with a particular focus on malleability support (i.e., dynamically scaling resource allocations/requirements for a job), co-scheduling (i.e., co-locating multiple jobs within a node), and power management. Second, we consider putting them together, assess their relationships/synergies, and discuss the functionality requirements in each software component for future over-provisioned and power-constrained HPC systems. Third, we briefly introduce our ongoing efforts on the integration of software tools, which will ultimately lead to the convergence of malleability and power management, as it is designed in the HPC PowerStack initiative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Deep-sea: Programming environment for european exascale systems. https://www.deep-projects.eu/, Accessed 25 Apr 2022
The hpc powerstack. https://hpcpowerstack.github.io/index.html, LNCS Accessed 25 Apr 2022
Regale: Open architecture for exascale supercomputers. https://regale-project.eu/, Accessed 25 Apr 2022
Top 500. https://www.top500.org/statistics/list/, Accessed 28 Feb 2022
Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. Future Gener. Comput. Syst. 110, 202–213 (2020)
Article Google Scholar
Aupy, G., et al.: Co-scheduling HPC workloads on cache-partitioned CMP platforms. In: CLUSTER, pp. 348–358 (2018)
Google Scholar
Bartolini, A., et al.: A pulp-based parallel power controller for future exascale systems. In: ICECS, pp. 771–774 (2019)
Google Scholar
Bhadauria, M., et al.: An approach to resource-aware co-scheduling for CMPs. In: ICS, pp. 189–199 (2010)
Google Scholar
Borghesi, A., et al.: Examon-x: a predictive maintenance framework for automatic monitoring in industrial iot systems. IEEE Internet Things J. (2021)
Google Scholar
Breitbart, J., et al.: Case study on co-scheduling for HPC applications. In: ICPPW, pp. 277–285 (2015)
Google Scholar
Breitbart, J., et al.: Dynamic co-scheduling driven by main memory bandwidth utilization. In: CLUSTER, pp. 400–409 (2017)
Google Scholar
Breslow, A.D., et al.: Enabling fair pricing on hpc systems with node sharing. In: SC (2013)
Google Scholar
Capit, N., et al.: A batch scheduler with high level components. In: CCGrid, vol. 2, pp. 776–783 (2005)
Google Scholar
Castain, R.H., et al.: Pmix: process management for exascale environments. Parallel Comput. 79, 9–29 (2018)
Article MathSciNet Google Scholar
Cesarini, D., et al.: Countdown slack: a run-time library to reduce energy footprint in large-scale mpi applications. IEEE TPDS 31(11), 2696–2709 (2020)
Google Scholar
Cochran, R., et al.: Pack & cap: adaptive dvfs and thread packing under power caps. In: MICRO, pp. 175–185 (2011)
Google Scholar
Comprés, I., et al.: Infrastructure and api extensions for elastic execution of mpi applications, pp. 82–97. EuroMPI (2016)
Google Scholar
Corbalan, J., et al.: EAR: energy management framework for supercomputers. In: Barcelona Supercomputing Center (BSC) Working paper (2019)
Google Scholar
D’Amico, M., et al.: Holistic slowdown driven scheduling and resource management for malleable jobs. In: ICPP (2019)
Google Scholar
Esmaeilzadeh, H., et al.: Dark silicon and the end of multicore scaling. In: ISCA, pp. 365–376 (2011)
Google Scholar
Feitelson, D.G., et al.: Toward convergence in job schedulers for parallel supercomputers. In: JSSPP, pp. 1–26 (1996)
Google Scholar
Hennessy, J., Patterson, D.: A new golden age for computer architecture: domain-specific hardware/software co-design, enhanced. In: ISCA (2018)
Google Scholar
Kale, L.V., et al.: A malleable-job system for timeshared parallel machines. In: CCGRID, pp. 230–230 (2002)
Google Scholar
Mo-Hellenbrand, A., et al.: A large-scale malleable tsunami simulation realized on an elastic mpi infrastructure. In: CF, pp. 271–274 (2017)
Google Scholar
Netti, A., et al.: From facility to application sensor data: modular, continuous and holistic monitoring with dcdb. In: SC, pp. 1–27 (2019)
Google Scholar
Patki, T., et al.: Exploring hardware overprovisioning in power-constrained, high performance computing. In: ICS, pp. 173–182 (2013)
Google Scholar
Patki, T., et al.: Practical resource management in power-constrained, high performance computing. In: HPDC, pp. 121–132 (2015)
Google Scholar
Sakamoto, R., et al.: Analyzing resource trade-offs in hardware overprovisioned supercomputers. In: IPDPS, pp. 526–535 (2018)
Google Scholar
Sarood, O., et al.: Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: SC, pp. 807–818 (2014)
Google Scholar
Schreiber, M., et al.: Invasive compute balancing for applications with hybrid parallelization. In: SBAC-PAD, pp. 136–143 (2013)
Google Scholar
Scogland, T.R., et al.: A power-measurement methodology for large-scale, high-performance computing. In: ICPE, pp. 149–159 (2014)
Google Scholar
Shalf, J.: The future of computing beyond moore’s law. Phil. Trans. Roy. Soc. A 378(2166), 20190061 (2020)
Article MathSciNet MATH Google Scholar
Utrera, G., et al.: A job scheduling approach for multi-core clusters based on virtual malleability. In: Euro-Par, pp. 191–203 (2012)
Google Scholar
Vigouroux, X., et al.: Towards energy consumption application profiling with bull energy software. https://prace-ri.eu/wp-content/uploads/PRACE-at-SC17-Ludovic-Sauge.pdf, Accessed 14 Mar 2022
Yoo, A.B., et al.: Slurm: simple linux utility for resource management. In: JSSPP, pp. 44–60 (2003)
Google Scholar
Zhu, Q., et al.: Co-run scheduling with power cap on integrated CPU-GPU systems. In: IPDPS, pp. 967–977 (2017)
Google Scholar

Download references

Acknowledgements

We would like to express our sincere gratitude to the anonymous reviewers for their constructive suggestions. This work has received funding under the European Commission’s EuroHPC and H2020 programmes under grant agreement no. 955606 and no. 956560.

Author information

Authors and Affiliations

Technical University of Munich, Garching, Germany
Eishi Arima, A. Isaías Comprés & Martin Schulz
Leibniz Supercomputing Centre, Garching, Germany
Martin Schulz

Authors

Eishi Arima
View author publications
You can also search for this author in PubMed Google Scholar
A. Isaías Comprés
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schulz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eishi Arima .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, TN, USA
Hartwig Anzt
University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Marc Baboulin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arima, E., Comprés, A.I., Schulz, M. (2022). On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds) High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham. https://doi.org/10.1007/978-3-031-23220-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-23220-6_14
Published: 04 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23219-0
Online ISBN: 978-3-031-23220-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems