Skip to main content

An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems

  • Conference paper
  • First Online:
Parallel and Distributed Computing, Applications and Technologies (PDCAT 2022)

Abstract

A primary concern when deploying a High-Performance Computing (HPC) system is its high energy consumption. Typical HPC systems consist of hundreds to thousands of compute nodes that consume huge amount of electrical power even during their idle states. One way to increase the energy efficiency is to apply the backfilling method to the First Come First Serve (FCFS) job scheduler (FCFS+Backfilling). The backfilling method allows jobs that arrive later than the first job in the queue to be executed earlier if the starting time of the first job is not affected, therefore increasing the throughput and the energy efficiency of the system. Nodes that are idle for a specific amount of time can also be switched off to further improve the energy efficiency. However, switching off nodes based only on their idle time can also impair the energy efficiency and the throughput instead of improving them. As an example, new jobs may immediately arrive after nodes are switched off, hence missing the chance of directly executing the jobs via backfilling. This paper proposed a Deep Reinforcement Learning (DRL)-based method to predict the most appropriate timing to switch on/off nodes. A DRL agent is trained with Advantage Actor-Critic algorithm to decide which nodes must be switched on/off at a specific timestep. Our simulation results on NASA iPSC/860 HPC historical job dataset show that the proposed method can reduce the total energy consumption compared to most of the conventional timeout policies that switch off nodes after they became idle for some period of time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Top500 lists. https://www.top500.org/

  2. Amrizal, M.A., Takizawa, H.: Optimizing energy consumption on HPC systems with a multi-level checkpointing mechanism. In: 2017 International Conference on Networking, Architecture, and Storage (NAS), pp. 1–9. IEEE (2017)

    Google Scholar 

  3. Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)

    Article  Google Scholar 

  4. Bridi, T., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A constraint programming scheduler for heterogeneous high-performance computing machines. IEEE Trans. Parallel Distrib. Syst. 27(10), 2781–2794 (2016)

    Article  Google Scholar 

  5. Casagrande, L.C.: Batsim-py (2020). https://github.com/lccasagrande/batsim-py

  6. Chen, H., Lu, Y., Zhu, Q.: A power-aware job scheduling algorithm. In: 2012 International Conference on Cloud and Service Computing, pp. 8–11. IEEE (2012)

    Google Scholar 

  7. Dayarathna, M., Wen, Y., Fan, R.: Data center energy consumption modeling: a survey. IEEE Commun. Surv. Tutor. 18(1), 732–794 (2015)

    Article  Google Scholar 

  8. Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 178–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_10

    Chapter  Google Scholar 

  9. Feitelson, D.G., Nitzberg, B.: Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 337–360. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_38

    Chapter  Google Scholar 

  10. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  12. Hikita, J., Hirano, A., Nakashima, H.: Saving 200 kw and \$200 k/year by power-aware job/machine scheduling. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)

    Google Scholar 

  13. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 448–456. JMLR.org (2015)

    Google Scholar 

  14. Kool, W., van Hoof, H., Welling, M.: Attention, learn to solve routing problems! In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019)

    Google Scholar 

  15. Kumar, V., Bhambri, S., Shambharkar, P.G.: Multiple resource management and burst time prediction using deep reinforcement learning. In: Eighth International Conference on Advances in Computing, Communication and Information Technology CCIT, pp. 51–58 (2019)

    Google Scholar 

  16. Liang, S., Yang, Z., Jin, F., Chen, Y.: Data centers job scheduling with deep reinforcement learning. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 906–917. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_68

    Chapter  Google Scholar 

  17. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016)

    Google Scholar 

  18. Meng, J., McCauley, S., Kaplan, F., Leung, V.J., Coskun, A.K.: Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs. Sustain. Comput.: Inform. Syst. 6, 48–57 (2015). https://doi.org/10.1016/j.suscom.2014.05.002. https://www.sciencedirect.com/science/article/pii/S2210537914000237. Special Issue on Selected Papers from 2013 International Green Computing Conference (IGCC)

  19. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 1928–1937. JMLR.org (2016)

    Google Scholar 

  20. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  21. Ohmura, T., Shimomura, Y., Egawa, R., Takizawa, H.: Toward building a digital twin of job scheduling and power management on an HPC system. In: Klusáček, D., Julita, C., Rodrigo, G.P. (eds.) JSSPP 2022. LNCS, vol. 13592, pp. 47–67. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22698-4_3

    Chapter  Google Scholar 

  22. Pinedo, M.L.: Scheduling, vol. 29. Springer, Heidelberg (2012)

    Book  MATH  Google Scholar 

  23. Pinheiro, E., Bianchini, R., Carrera, E.V., Heath, T.: Load balancing and unbalancing for power and performance in cluster-based systems. Technical report, Rutgers University (2001)

    Google Scholar 

  24. Shirani, M.R., Safi-Esfahani, F.: Dynamic scheduling of tasks in cloud computing applying dragonfly algorithm, biogeography-based optimization algorithm and Mexican hat wavelet. J. Supercomput. 77(2), 1214–1272 (2020). https://doi.org/10.1007/s11227-020-03317-8

    Article  Google Scholar 

  25. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  27. Yang, S.: Online scheduling with migration cost. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 2168–2175 (2012). https://doi.org/10.1109/IPDPSW.2012.268

  28. Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)

    Google Scholar 

Download references

Acknowledgement

This work was partially supported by the grant of Penelitian Dosen Dana Masyarakat Alokasi Fakultas MIPA-UGM under Contract No. 91/J01.1.28/PL.06.02/2022, Grant-in-Aid for Challenging Research (Exploratory) #22K19764, and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures jh220025.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Alfian Amrizal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khasyah, F.R., Santiyuda, K.G., Kaunang, G., Makhrus, F., Amrizal, M.A., Takizawa, H. (2023). An Advantage Actor-Critic Deep Reinforcement Learning Method for Power Management in HPC Systems. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-29927-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-29926-1

  • Online ISBN: 978-3-031-29927-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics