On-the-Fly Calculation of Performance Metrics with Adaptive Time Resolution for HPC Compute Jobs
Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself. The common approach to building performance monitoring systems is to store all the data to a database and then to retrieve the data which correspond to the specific job and perform an analysis using that portion of the data. This approach works well when the data stream is not very large. For large performance monitoring data stream this incurs much IO and imposes high requirements on storage systems which process the data.
In this paper we propose an adaptive on-the-fly approach to performance monitoring of High Performance Computing (HPC) compute jobs which significantly lowers data streams to be written to a storage. We used this approach to implement performance monitoring system for HPC cluster to monitor compute jobs. The output of our performance monitoring system is a time-series graph representing aggregated performance metrics for the job. The time resolution of the resulted graph is adaptive and depends on the duration of the analyzed job.
KeywordsPerformance Performance monitoring Adaptive performance monitoring Supercomputer HPC
The work is supported by the Russian Found for Basic Research, grant 16-07-01121 The research is carried out using the equipment of the shared research facilities of HPC computing resources at M.V.Lomonosov Moscow State University This material is based upon work supported by the Russian Presidential study grant (SP-1981.2016.5).
- 3.Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings Frontiers 1995. The Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422-429. IEEE Computer Society Press, McLean, VA (1995). https://doi.org/10.1109/FMPC.1995.380483
- 4.Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170. IEEE Computer Society (2002). https://doi.org/10.1109/HPDC.2002.1029915
- 5.Jagode, H., Dongarra, J., Alam, S.R., Vetter, J.S., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) Computational Science ICCS 2009. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01973-9CrossRefGoogle Scholar
- 7.Ries, B., et al.: The paragon performance monitoring environment. In: Supercomputing 1993, Proceedings, pp. 850-859. IEEE (1993). https://doi.org/10.1109/SUPERC.1993.1263542
- 9.Mooney, R., Schmidt, K., Studham, R.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 379–389. IEEE (2004). https://doi.org/10.1109/CLUSTR.2004.1392637
- 10.Shaykhislamov, D., Voevodin, V.: An approach for detecting abnormal parallel applications based on time series analysis methods. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10777, pp. 359–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78024-5_32CrossRefGoogle Scholar
- 11.Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). In: Sloot, P., Boukhanovsky, A., Athanassoulis, G., Klimentov, A. (eds.) 4th International Young Scientist Conference on Computational Science. Procedia Comput. Sci. 66, 625–634. Elsevier B.V. (2015). https://doi.org/10.1016/j.procs.2015.11.071
- 12.Slurm workload manager. https://slurm.schedmd.com/