Advertisement

Lobachevskii Journal of Mathematics

, Volume 40, Issue 11, pp 1817–1830 | Cite as

Driving a Petascale HPC Center with Octoshell Management System

  • D. A. NikitenkoEmail author
  • Vad. V. VoevodinEmail author
  • S. A. ZhumatiyEmail author
Article
  • 3 Downloads

Abstract

Running any computing center is a complex task. With the growth of scales and costs such tasks become challenges. So the top supercomputer sites, being big in everything, have always required special approaches to manage, to control, and to take care of them. At present, large HPC centers can have a variety of totally diverse systems containing up to millions of components, having thousands of users worldwide with the full range of complicated applications. Obviously, tons of data have to be managed in a concerted way to allow such an informational factory functioning. This paper shares the design principles, some implementation details and the roadmap vision regarding the Octoshell HPC center management system, which has been developed and is currently being used in the everyday practice of Moscow State University supercomputer center. This open source system manages Lomonosov and Lomonosov-2 systems with a total of over 5 PFlops peak performance complexes at present, providing multiple tools aimed to tackle most typical workflow tasks both for regular users and system administrators in a single shell.

Keywords and phrases

large-scale system administering automation of administering routines managing HPC systems user support fault-tolerant administering HPC center workflow 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

The work is partially funded by the Russian Foundation for Basic Research, grants no. 17-07-00719, 18-29-03230. The results of the section 7 (related to the quality of HPC centers) are obtained in part with the financial support of the grant from the Russian Federation President’s Fund (MK-2330.2019.9).

References

  1. 1.
    Top50 supercomputer sites of Russia and CIS. http://top50.supercomputers.ru. Accessed 2019.
  2. 2.
    D. Nikitenko and A. Zheltkov, “The Top50 list vivification in the evolution of HPC pankings,” Commun. Comput. Inform. Sci. 753, 14–26 (2017).CrossRefGoogle Scholar
  3. 3.
    Top500 Supercomputer Sites. http://top500.org. Accessed 2019.
  4. 4.
    P. Ricoux, “Addressing the challenge of exascale,” in Proceedings of the European Exascale Software Initiative, EESI2 Final Conference, 2015. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdf. Accessed 2019.Google Scholar
  5. 5.
    SLURMWorkloadManager. http://slurm.schedmd.com. Accessed 2019.
  6. 6.
    Open-source Ticket Request System. http://www.otrs.org. Accessed 2019.
  7. 7.
    Ganglia Monitoring System. http://ganglia.sourceforge.net. Accessed 2019.
  8. 8.
    ZabbixMonitoring. http://www.zabbix.com. Accessed 2019.
  9. 9.
    Nagios Monitoring. https://www.nagios.org. Accessed 2019.
  10. 10.
  11. 11.
    D. Nikitenko, Vl. Voevodin, and S. Zhumatiy, “Octoshell: large supercomputer complex administration system,” in Proceedings of the Russian Supercomputing Days International Conference, Moscow, Russia, Sept. 28–29, 2015, CEURWorkshop Proc. 1482, 69–83 (2015).Google Scholar
  12. 12.
    D. Nikitenko, Vl. Voevodin, and S. Zhumatiy, “Resolving frontier problems of master-ing large-scale supercomputer complexes,” in Proceedings of the ACM International Conference on Computing Frontiers (CF‘16), May 16–18, 2016, Como, Italy (ACM, New York, 2016), pp. 349–352.Google Scholar
  13. 13.
    High Performance Computing in Moscow State University High Performance Computing inMoscow State University. http://hpc.msu.ru. Accessed 2019.
  14. 14.
    Shared ResourcesMSU HPC center. http://www.parallel.ru/cluster. Accessed 2019.
  15. 15.
    Vl. Voevodin, S. Zhumatiy, S. Sobolev, A. Antonov, P. Bryzgalov, Nikitenko, K. Stefanov, and Vad. Voevodin, “Practice of Lomonosov supercomputer,” Open Syst. J. 7, 36–39 (2012).Google Scholar
  16. 16.
    V. Sadovnichy, A. Tikhonravov, Vl. Voevodin, and V. Opanasenko, “Lomonosov: supercomputing at Moscow State University,” in Contemporary High Performance Computing: From Petascale toward Exascale (Chapman and Hall/CRC, Boca Raton, FL, 2013), pp. 283–307.Google Scholar
  17. 17.
    Vl. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, I. Sidorov, K. Stefanov, Vad. Voevodin, and S. Zhumatiy, “Supercomputer Lomonosov-2: large scale, deep monitoring and fine analytics for the user community,” Supercomput. Front. Innov. J. 6 (2), 4–1 (2019).  https://doi.org/10.14529/jsfi190201 Google Scholar
  18. 18.
    Octoshell at Github. https://github.com/octoshell/octoshell-v2. Accessed 2019.
  19. 19.
    D. Nikitenko, P. Shvets, V. Voevodin, and S. Zhumatiy, “Role-dependent resource utilization analysis for large HPC centers,” Commun. Comput. Inform. Sci. 910, 47–61 (2018).CrossRefGoogle Scholar
  20. 20.
    K. Stefanov et al., “Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon),” Proc. Comput. Sci. 66, 625–634 (2015).CrossRefGoogle Scholar
  21. 21.
    Vl. Voevodin et al., “Job Digest—approach to analysis of application dynamic characteristics on supercomputer systems,” Numer. Methods Programm. 13 160–166 (2012).Google Scholar
  22. 22.
    D. Nikitenko, Vl. Voevodin, Vad. Voevodin, S. Zhumatiy, K. Stefanov, A. Teplov, and P. Shvets, “Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems,” in Proceedings of the 10th Annual International Scientific Conference on Parallel Computing Technologies, PCT 2016, Arkhangelsk, Russia, March 29–31, 2016, CEURWorkshop Proc. 1576, 20–30 (2016).Google Scholar
  23. 23.
    D. Nikitenko, K. Stefanov, S. Zhumatiy, A. Teplov, P. Shvets, and Vad. Voevodin, “System monitoring-based holistic resource utilization analysis for every user of a large HPC center,” Lect. Notes Comput. Sci. 10049, 305–318 (2016).CrossRefGoogle Scholar
  24. 24.
    Vl. Voevodin, Vad. Voevodin, D. Shaikhislamov, and D. Nikitenko, “Data mining method for anomaly detection in the supercomputer task flow,” in Numerical Computations: Theory and Algorithms, Proceedings of the 2nd International Conference and Summer School, Pizzo calabro, Italy, June 20–24, 2016, AIP Conf. Proc. 1776, 090015-1-090015-4 (2016).Google Scholar
  25. 25.
    D. Nikitenko, A. Antonov, P. Shvets, S. Sobolev, K. Stefanov, Vl. Voevodin, Vad. Voevodin, and S. Zhumatiy, “JobDigest detailed system monitoring-based supercomputer application behavior analysis,” in Supercomputing, Proceedings of the 3rd Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, Sept. 25–26, 2017, pp. 516–529. Springer, Cham (2017).CrossRefGoogle Scholar
  26. 26.
    P. Shvets, A. Antonov, D. Nikitenko, S. Sobolev, K. Stefanov, Vad. Voevodin, Vl. Voevodin, and S. Zhumatiy, “An approach for ensuring reliable functioning of a supercomputer based on a formal model,” in Parallel Processing and Applied Mathematics, Proceedings 11th International Conference, PPAM 2015, Krakow, Poland, Sept. 6–9, 2015, Lect. Notes Comput. Sci. 9573, 12–22 (2016).MathSciNetGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2019

Authors and Affiliations

  1. 1.Research Computing CenterMoscow State UniversityMoscowRussia

Personalised recommendations