Driving a Petascale HPC Center with Octoshell Management System
- 3 Downloads
Running any computing center is a complex task. With the growth of scales and costs such tasks become challenges. So the top supercomputer sites, being big in everything, have always required special approaches to manage, to control, and to take care of them. At present, large HPC centers can have a variety of totally diverse systems containing up to millions of components, having thousands of users worldwide with the full range of complicated applications. Obviously, tons of data have to be managed in a concerted way to allow such an informational factory functioning. This paper shares the design principles, some implementation details and the roadmap vision regarding the Octoshell HPC center management system, which has been developed and is currently being used in the everyday practice of Moscow State University supercomputer center. This open source system manages Lomonosov and Lomonosov-2 systems with a total of over 5 PFlops peak performance complexes at present, providing multiple tools aimed to tackle most typical workflow tasks both for regular users and system administrators in a single shell.
Keywords and phraseslarge-scale system administering automation of administering routines managing HPC systems user support fault-tolerant administering HPC center workflow
Unable to display preview. Download preview PDF.
The work is partially funded by the Russian Foundation for Basic Research, grants no. 17-07-00719, 18-29-03230. The results of the section 7 (related to the quality of HPC centers) are obtained in part with the financial support of the grant from the Russian Federation President’s Fund (MK-2330.2019.9).
- 1.Top50 supercomputer sites of Russia and CIS. http://top50.supercomputers.ru. Accessed 2019.
- 3.Top500 Supercomputer Sites. http://top500.org. Accessed 2019.
- 4.P. Ricoux, “Addressing the challenge of exascale,” in Proceedings of the European Exascale Software Initiative, EESI2 Final Conference, 2015. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdf. Accessed 2019.Google Scholar
- 5.SLURMWorkloadManager. http://slurm.schedmd.com. Accessed 2019.
- 6.Open-source Ticket Request System. http://www.otrs.org. Accessed 2019.
- 7.Ganglia Monitoring System. http://ganglia.sourceforge.net. Accessed 2019.
- 8.ZabbixMonitoring. http://www.zabbix.com. Accessed 2019.
- 9.Nagios Monitoring. https://www.nagios.org. Accessed 2019.
- 10.Bright ClusterManager. http://www.brightcomputing.com/product-offerings/bright-cluster-manager-forhpc. Accessed 2019.
- 11.D. Nikitenko, Vl. Voevodin, and S. Zhumatiy, “Octoshell: large supercomputer complex administration system,” in Proceedings of the Russian Supercomputing Days International Conference, Moscow, Russia, Sept. 28–29, 2015, CEURWorkshop Proc. 1482, 69–83 (2015).Google Scholar
- 12.D. Nikitenko, Vl. Voevodin, and S. Zhumatiy, “Resolving frontier problems of master-ing large-scale supercomputer complexes,” in Proceedings of the ACM International Conference on Computing Frontiers (CF‘16), May 16–18, 2016, Como, Italy (ACM, New York, 2016), pp. 349–352.Google Scholar
- 13.High Performance Computing in Moscow State University High Performance Computing inMoscow State University. http://hpc.msu.ru. Accessed 2019.
- 14.Shared ResourcesMSU HPC center. http://www.parallel.ru/cluster. Accessed 2019.
- 15.Vl. Voevodin, S. Zhumatiy, S. Sobolev, A. Antonov, P. Bryzgalov, Nikitenko, K. Stefanov, and Vad. Voevodin, “Practice of Lomonosov supercomputer,” Open Syst. J. 7, 36–39 (2012).Google Scholar
- 16.V. Sadovnichy, A. Tikhonravov, Vl. Voevodin, and V. Opanasenko, “Lomonosov: supercomputing at Moscow State University,” in Contemporary High Performance Computing: From Petascale toward Exascale (Chapman and Hall/CRC, Boca Raton, FL, 2013), pp. 283–307.Google Scholar
- 17.Vl. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, I. Sidorov, K. Stefanov, Vad. Voevodin, and S. Zhumatiy, “Supercomputer Lomonosov-2: large scale, deep monitoring and fine analytics for the user community,” Supercomput. Front. Innov. J. 6 (2), 4–1 (2019). https://doi.org/10.14529/jsfi190201 Google Scholar
- 18.Octoshell at Github. https://github.com/octoshell/octoshell-v2. Accessed 2019.
- 21.Vl. Voevodin et al., “Job Digest—approach to analysis of application dynamic characteristics on supercomputer systems,” Numer. Methods Programm. 13 160–166 (2012).Google Scholar
- 22.D. Nikitenko, Vl. Voevodin, Vad. Voevodin, S. Zhumatiy, K. Stefanov, A. Teplov, and P. Shvets, “Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems,” in Proceedings of the 10th Annual International Scientific Conference on Parallel Computing Technologies, PCT 2016, Arkhangelsk, Russia, March 29–31, 2016, CEURWorkshop Proc. 1576, 20–30 (2016).Google Scholar
- 24.Vl. Voevodin, Vad. Voevodin, D. Shaikhislamov, and D. Nikitenko, “Data mining method for anomaly detection in the supercomputer task flow,” in Numerical Computations: Theory and Algorithms, Proceedings of the 2nd International Conference and Summer School, Pizzo calabro, Italy, June 20–24, 2016, AIP Conf. Proc. 1776, 090015-1-090015-4 (2016).Google Scholar
- 25.D. Nikitenko, A. Antonov, P. Shvets, S. Sobolev, K. Stefanov, Vl. Voevodin, Vad. Voevodin, and S. Zhumatiy, “JobDigest detailed system monitoring-based supercomputer application behavior analysis,” in Supercomputing, Proceedings of the 3rd Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, Sept. 25–26, 2017, pp. 516–529. Springer, Cham (2017).CrossRefGoogle Scholar
- 26.P. Shvets, A. Antonov, D. Nikitenko, S. Sobolev, K. Stefanov, Vad. Voevodin, Vl. Voevodin, and S. Zhumatiy, “An approach for ensuring reliable functioning of a supercomputer based on a formal model,” in Parallel Processing and Applied Mathematics, Proceedings 11th International Conference, PPAM 2015, Krakow, Poland, Sept. 6–9, 2015, Lect. Notes Comput. Sci. 9573, 12–22 (2016).MathSciNetGoogle Scholar