Sustained Simulation Performance 2017 pp 3-14 | Cite as
Theory and Practice of Efficient Supercomputer Management
Abstract
The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility management—all these questions are being studied by the described tools.
Notes
Acknowledgements
This material is based upon work supported in part by the Russian Found for Basic Research (grant No. 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).
References
- 1.Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015, Krakow, September 6–9, 2015. Revised Selected Papers, Part I, pp. 12–22. Springer, Cham (2016). DOI 10.1007/ 978-3-319-32149-3_2. http://link.springer.com/10.1007/978-3-319-32149-3_2
- 2.Bright Cluster Manager home page. http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc. Cited 15-06-2017
- 3.Geimer, M., Wolf, F., Wylie, B.J.N., Ibrahim, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Experience 22(6), 702–719 (2010). DOI 10.1002/cpe.1556. http://doi.wiley.com/10.1002/cpe.1556 Google Scholar
- 4.Infrastructure Monitoring System Nagios. https://www.nagios.org/. Cited 15 Jun 2017
- 5.Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Computational Science – ICCS 2009, pp. 686–695. Springer, Berlin (2009). DOI 10.1007/978-3-642-01973-9_77. http://link.springer.com/10.1007/978-3-642-01973-9_77
- 6.JobDigest Components. https://github.com/srcc-msu/job_statistics. Cited 15 Jun 2017
- 7.Lu, K., Wang, X., Li, G., Wang, R., Chi, W., Liu, Y., Tang, H., Feng, H., Gao, Y.: Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comp. Sci. 8(3), 378–390 (2014). DOI 10.1007/s11704-014-3503-1. http://link.springer.com/10.1007/s11704-014-3503-1 CrossRefMathSciNetGoogle Scholar
- 8.Mohr, B., Voevodin, V., Gimenez, J., Hagersten, E., Knupfer, A., Nikitenko, D.A., Nilsson, M., Servat, H., Shah, A., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Tools for High Performance Computing 2012, pp. 127–146. Springer, Berlin (2013). DOI 10.1007/978-3-642-37349-7_9. http://link.springer.com/10.1007/978-3-642-37349-7_9
- 9.Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. Bull. South Ural State Univ. Ser. Comput. Math. Softw. Eng. 5(3), 76–95 (2016). DOI 10.14529/cmse160306. http://vestnik.susu.ru/cmi/article/view/3998
- 10.Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable - another inescapable step towards exascale. Supercomput. Front. Innov. 3(2), 72–79 (2016). DOI 10.14529/jsfi160205. http://superfri.org/superfri/article/view/96
- 11.OctoShell Source Code. https://github.com/%5Cshell/%5Cshell-v2. Cited 15 Jun 2017
- 12.OctoTron Framework Source Code. https://github.com/srcc-msu/OctoTron. Cited 15 Jun 2017
- 13.Slurm Workload Manager. https://slurm.schedmd.com/. Cited 15 Jun 2017
- 14.Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Proc. Comput. Sci. 66, 625–634 (2015). DOI 10.1016/j.procs.2015.11.071. http://linkinghub.elsevier.com/retrieve/pii/S1877050915034201 CrossRefGoogle Scholar
- 15.System Statistics Collection Daemon Collectd. https://collectd.org/. Cited 15 Jun 2017
- 16.TORQUE Resource Manager. http://www.adaptivecomputing.com/products/open-source/torque/. Cited 15 Jun 2017
- 17.Voevodin, V., Voevodin, V.: Software system stack for efficiency of exascale supercomputer centers. Technical Report (2015)Google Scholar
- 18.Voevodin, V., Zhumatiy, S., Sobolev, S., Antonov, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V.: The practice of “Lomonosov” supercomputer. Open Syst. DBMS 7, 36–39 (2012)Google Scholar
- 19.Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, pp. 090015-1–090015-4. Pizzo Calabro (2016). DOI 10.1063/1.4965379. http://aip.scitation.org/doi/abs/10.1063/1.4965379
- 20.Zenoss – Monitoring and Analytics Software. https://community.zenoss.com/home. Cited 15 Jun 2017