An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems

Khasyah, Fitra Rahmani; Santiyuda, Kadek Gemilang; Kaunang, Gabriel; Makhrus, Faizal; Amrizal, Muhammad Alfian; Takizawa, Hiroyuki

doi:10.1007/978-3-031-29927-8_8

Fitra Rahmani Khasyah¹³,
Kadek Gemilang Santiyuda¹³,
Gabriel Kaunang¹³,
Faizal Makhrus¹³,
Muhammad Alfian Amrizal¹³ &
…
Hiroyuki Takizawa¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13798))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

524 Accesses

Abstract

A primary concern when deploying a High-Performance Computing (HPC) system is its high energy consumption. Typical HPC systems consist of hundreds to thousands of compute nodes that consume huge amount of electrical power even during their idle states. One way to increase the energy efficiency is to apply the backfilling method to the First Come First Serve (FCFS) job scheduler (FCFS+Backfilling). The backfilling method allows jobs that arrive later than the first job in the queue to be executed earlier if the starting time of the first job is not affected, therefore increasing the throughput and the energy efficiency of the system. Nodes that are idle for a specific amount of time can also be switched off to further improve the energy efficiency. However, switching off nodes based only on their idle time can also impair the energy efficiency and the throughput instead of improving them. As an example, new jobs may immediately arrive after nodes are switched off, hence missing the chance of directly executing the jobs via backfilling. This paper proposed a Deep Reinforcement Learning (DRL)-based method to predict the most appropriate timing to switch on/off nodes. A DRL agent is trained with Advantage Actor-Critic algorithm to decide which nodes must be switched on/off at a specific timestep. Our simulation results on NASA iPSC/860 HPC historical job dataset show that the proposed method can reduce the total energy consumption compared to most of the conventional timeout policies that switch off nodes after they became idle for some period of time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Top500 lists. https://www.top500.org/
Amrizal, M.A., Takizawa, H.: Optimizing energy consumption on HPC systems with a multi-level checkpointing mechanism. In: 2017 International Conference on Networking, Architecture, and Storage (NAS), pp. 1–9. IEEE (2017)
Google Scholar
Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)
Article Google Scholar
Bridi, T., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A constraint programming scheduler for heterogeneous high-performance computing machines. IEEE Trans. Parallel Distrib. Syst. 27(10), 2781–2794 (2016)
Article Google Scholar
Casagrande, L.C.: Batsim-py (2020). https://github.com/lccasagrande/batsim-py
Chen, H., Lu, Y., Zhu, Q.: A power-aware job scheduling algorithm. In: 2012 International Conference on Cloud and Service Computing, pp. 8–11. IEEE (2012)
Google Scholar
Dayarathna, M., Wen, Y., Fan, R.: Data center energy consumption modeling: a survey. IEEE Commun. Surv. Tutor. 18(1), 732–794 (2015)
Article Google Scholar
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 178–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_10
Chapter Google Scholar
Feitelson, D.G., Nitzberg, B.: Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 337–360. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_38
Chapter Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hikita, J., Hirano, A., Nakashima, H.: Saving 200 kw and \$200 k/year by power-aware job/machine scheduling. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 448–456. JMLR.org (2015)
Google Scholar
Kool, W., van Hoof, H., Welling, M.: Attention, learn to solve routing problems! In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019)
Google Scholar
Kumar, V., Bhambri, S., Shambharkar, P.G.: Multiple resource management and burst time prediction using deep reinforcement learning. In: Eighth International Conference on Advances in Computing, Communication and Information Technology CCIT, pp. 51–58 (2019)
Google Scholar
Liang, S., Yang, Z., Jin, F., Chen, Y.: Data centers job scheduling with deep reinforcement learning. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 906–917. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_68
Chapter Google Scholar
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016)
Google Scholar
Meng, J., McCauley, S., Kaplan, F., Leung, V.J., Coskun, A.K.: Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs. Sustain. Comput.: Inform. Syst. 6, 48–57 (2015). https://doi.org/10.1016/j.suscom.2014.05.002. https://www.sciencedirect.com/science/article/pii/S2210537914000237. Special Issue on Selected Papers from 2013 International Green Computing Conference (IGCC)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 1928–1937. JMLR.org (2016)
Google Scholar
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Ohmura, T., Shimomura, Y., Egawa, R., Takizawa, H.: Toward building a digital twin of job scheduling and power management on an HPC system. In: Klusáček, D., Julita, C., Rodrigo, G.P. (eds.) JSSPP 2022. LNCS, vol. 13592, pp. 47–67. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22698-4_3
Chapter Google Scholar
Pinedo, M.L.: Scheduling, vol. 29. Springer, Heidelberg (2012)
Book MATH Google Scholar
Pinheiro, E., Bianchini, R., Carrera, E.V., Heath, T.: Load balancing and unbalancing for power and performance in cluster-based systems. Technical report, Rutgers University (2001)
Google Scholar
Shirani, M.R., Safi-Esfahani, F.: Dynamic scheduling of tasks in cloud computing applying dragonfly algorithm, biogeography-based optimization algorithm and Mexican hat wavelet. J. Supercomput. 77(2), 1214–1272 (2020). https://doi.org/10.1007/s11227-020-03317-8
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Yang, S.: Online scheduling with migration cost. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 2168–2175 (2012). https://doi.org/10.1109/IPDPSW.2012.268
Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Google Scholar

Download references

Acknowledgement

This work was partially supported by the grant of Penelitian Dosen Dana Masyarakat Alokasi Fakultas MIPA-UGM under Contract No. 91/J01.1.28/PL.06.02/2022, Grant-in-Aid for Challenging Research (Exploratory) #22K19764, and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures jh220025.

Author information

Authors and Affiliations

Department of Computer Science and Electronics, Universitas Gadjah Mada, Yogyakarta, 55281, Indonesia
Fitra Rahmani Khasyah, Kadek Gemilang Santiyuda, Gabriel Kaunang, Faizal Makhrus & Muhammad Alfian Amrizal
Cyberscience Center, Tohoku University, Sendai, 980-8578, Japan
Hiroyuki Takizawa

Authors

Fitra Rahmani Khasyah
View author publications
You can also search for this author in PubMed Google Scholar
Kadek Gemilang Santiyuda
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Kaunang
View author publications
You can also search for this author in PubMed Google Scholar
Faizal Makhrus
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Alfian Amrizal
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Takizawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Alfian Amrizal .

Editor information

Editors and Affiliations

Tohoku University, Aoba-ku, Japan
Hiroyuki Takizawa
Sun Yat-sen University, Guangzhou, China
Hong Shen
The University of Tokyo, Tokyo, Japan
Toshihiro Hanawa
Seoul National University of Science and Technology, Seoul, Korea (Republic of)
Jong Hyuk Park
Griffith University, Queensland, QLD, Australia
Hui Tian
Tokyo Denki University, Tokyo, Japan
Ryusuke Egawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khasyah, F.R., Santiyuda, K.G., Kaunang, G., Makhrus, F., Amrizal, M.A., Takizawa, H. (2023). An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-29927-8_8
Published: 08 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29926-1
Online ISBN: 978-3-031-29927-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems