Theory and Practice of Efficient Supercomputer Management

Conference paper

Abstract

The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility management—all these questions are being studied by the described tools.

Notes

Acknowledgements

This material is based upon work supported in part by the Russian Found for Basic Research (grant No. 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).

References

  1. 1.
    Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015, Krakow, September 6–9, 2015. Revised Selected Papers, Part I, pp. 12–22. Springer, Cham (2016). DOI 10.1007/ 978-3-319-32149-3_2. http://link.springer.com/10.1007/978-3-319-32149-3_2
  2. 2.
  3. 3.
    Geimer, M., Wolf, F., Wylie, B.J.N., Ibrahim, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Experience 22(6), 702–719 (2010). DOI 10.1002/cpe.1556. http://doi.wiley.com/10.1002/cpe.1556 Google Scholar
  4. 4.
    Infrastructure Monitoring System Nagios. https://www.nagios.org/. Cited 15 Jun 2017
  5. 5.
    Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Computational Science – ICCS 2009, pp. 686–695. Springer, Berlin (2009). DOI 10.1007/978-3-642-01973-9_77. http://link.springer.com/10.1007/978-3-642-01973-9_77
  6. 6.
    JobDigest Components. https://github.com/srcc-msu/job_statistics. Cited 15 Jun 2017
  7. 7.
    Lu, K., Wang, X., Li, G., Wang, R., Chi, W., Liu, Y., Tang, H., Feng, H., Gao, Y.: Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comp. Sci. 8(3), 378–390 (2014). DOI 10.1007/s11704-014-3503-1. http://link.springer.com/10.1007/s11704-014-3503-1 CrossRefMathSciNetGoogle Scholar
  8. 8.
    Mohr, B., Voevodin, V., Gimenez, J., Hagersten, E., Knupfer, A., Nikitenko, D.A., Nilsson, M., Servat, H., Shah, A., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Tools for High Performance Computing 2012, pp. 127–146. Springer, Berlin (2013). DOI 10.1007/978-3-642-37349-7_9. http://link.springer.com/10.1007/978-3-642-37349-7_9
  9. 9.
    Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. Bull. South Ural State Univ. Ser. Comput. Math. Softw. Eng. 5(3), 76–95 (2016). DOI 10.14529/cmse160306. http://vestnik.susu.ru/cmi/article/view/3998
  10. 10.
    Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable - another inescapable step towards exascale. Supercomput. Front. Innov. 3(2), 72–79 (2016). DOI 10.14529/jsfi160205. http://superfri.org/superfri/article/view/96
  11. 11.
    OctoShell Source Code. https://github.com/%5Cshell/%5Cshell-v2. Cited 15 Jun 2017
  12. 12.
    OctoTron Framework Source Code. https://github.com/srcc-msu/OctoTron. Cited 15 Jun 2017
  13. 13.
    Slurm Workload Manager. https://slurm.schedmd.com/. Cited 15 Jun 2017
  14. 14.
    Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Proc. Comput. Sci. 66, 625–634 (2015). DOI 10.1016/j.procs.2015.11.071. http://linkinghub.elsevier.com/retrieve/pii/S1877050915034201 CrossRefGoogle Scholar
  15. 15.
    System Statistics Collection Daemon Collectd. https://collectd.org/. Cited 15 Jun 2017
  16. 16.
    TORQUE Resource Manager. http://www.adaptivecomputing.com/products/open-source/torque/. Cited 15 Jun 2017
  17. 17.
    Voevodin, V., Voevodin, V.: Software system stack for efficiency of exascale supercomputer centers. Technical Report (2015)Google Scholar
  18. 18.
    Voevodin, V., Zhumatiy, S., Sobolev, S., Antonov, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V.: The practice of “Lomonosov” supercomputer. Open Syst. DBMS 7, 36–39 (2012)Google Scholar
  19. 19.
    Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, pp. 090015-1–090015-4. Pizzo Calabro (2016). DOI 10.1063/1.4965379. http://aip.scitation.org/doi/abs/10.1063/1.4965379
  20. 20.
    Zenoss – Monitoring and Analytics Software. https://community.zenoss.com/home. Cited 15 Jun 2017

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Research Computing Center of Lomonosov Moscow State UniversityMoscowRussia

Personalised recommendations