An Approach for Detecting Abnormal Parallel Applications Based on Time Series Analysis Methods
The low efficiency of parallel program execution is one of the most serious problems in high-performance computing area. There are many researches and software tools aimed at analyzing and improving the performance of a particular program, but the task of detecting such applications that need to be analyzed is still far from being solved.
In this research, methods for detecting abnormal behavior of the programs in the overall supercomputer task flow are being developed. There are no clear criteria for anomalous behavior, and also these criteria can differ significantly for different computing systems, therefore machine learning methods are being used. These methods take system monitoring data as an input, since they provide the most complete information about the dynamics of program execution.
In this article we propose a method based on the time series analysis of dynamic characteristics describing the behavior of programs. In this method, the time series is divided into a set of intervals, where the anomalous ones are detected. After that the final classification of the entire application is performed based on the results of interval classification. The developed method is being tested on real-life data of the Petaflops-level Lomonosov-2 supercomputer.
KeywordsHigh-performance computing Efficiency analysis Parallel program Task flow Time series analysis Anomaly detection Machine learning
This work was funded in part by the Russian Found for Basic Research (grant 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).
- 1.Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P.: System monitoring-based holistic resource utilization analysis for every user of a large HPC center. In: Carretero, J., Garcia-Blas, J., Gergel, V., Voevodin, V., Meyerov, I., Rico-Gallego, J.A., Díaz-Martín, J.C., Alonso, P., Durillo, J., Garcia Sánchez, J.D., Lastovetsky, A.L., Marozzo, F., Liu, Q., Bhuiyan, Z.A., Fürlinger, K., Weidendorfer, J., Gracia, J. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 305–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_24 CrossRefGoogle Scholar
- 2.Nikitenko, D.A., Voevodin Vad, V., Zhumatiy, S.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: 10th Annual International Scientific Conference on Parallel Computing Technologies, Arkhangelsk, Russian Federation, CEUR Workshop Proceedings, vol. 1576, pp. 20–30 (2016)Google Scholar
- 5.Pena, E.H.M., de Assis, M.V.O., Proena, M.L.: Anomaly detection using forecasting methods ARIMA and HWDS. In: 32nd International Conference of the Chilean Computer Science Society (SCCC), Temuco, pp. 63–66 (2013)Google Scholar
- 6.Cheboli, D.: Anomaly detection of time series. Dissertation, University of Minnesota (2010)Google Scholar
- 7.Malhotra, P., Vig, L., Shroff, G., Agarwal, P.: Long short term memory networks for anomaly detection in time series. In: European Symposium on Artificial Neural Networks, vol. 23 (2015)Google Scholar