Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms

Emeras, Joseph; Varrette, Sébastien; Guzek, Mateusz; Bouvry, Pascal

doi:10.1007/978-3-319-61756-5_6

Joseph Emeras¹⁵,
Sébastien Varrette¹⁶,
Mateusz Guzek¹⁵ &
…
Pascal Bouvry¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10353))

Included in the following conference series:

724 Accesses
6 Citations

Abstract

At the advent of a wished (or forced) convergence between High Performance Computing HPC platforms, stand-alone accelerators and virtualized resources from Cloud Computing CC systems, this article unveils the job prediction component of the Evalix project. This framework aims at an improved efficiency of the underlying Resource and Job Management System RJMS within heterogeneous HPC facilities by the automatic evaluation and characterization of the submitted workload. The objective is not only to better adapt the scheduled jobs to the available resource capabilities, but also to reduce the energy costs. For that purpose, we collected the resource consumption of all the jobs executed on a production cluster for a period of three months. Based on the analysis then on the classification of the jobs, we computed a resource consumption model. The objective is to train a set of predictors based on the aforementioned model, that will give the estimated CPU, memory and IO used by the jobs. The analysis of the resource consumption highlighted that different classes of jobs have different kinds of resource needs and the classification of the jobs enabled to characterize several application patterns of the users. We also discovered that several users whose resource usage on the cluster is considered as too low, are responsible for a loss of CPU time on the order of five years over the considered three month period. The predictors, trained from a supervised learning algorithm, were able to correctly classify a large set of data. We evaluated them with three performance indicators that gave an information retrieval rate of 71% to 89% and a probability of accurate prediction between 0.7 and 0.8. The results of this work will be particularly helpful for designing an optimal partitioning of the considered heterogeneous platform, taking into consideration the real application needs and thus leading to energy savings and performance improvements. Moreover, apart from the novelty of the contribution, the accurate classification scheme offers new insights of users behavior of interest for the design of future HPC platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Later on, another monitoring tool named Colmet [11] will be used.

References

Lublin, U., Feitelson, D.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 1105–1122 (2001)
Article MATH Google Scholar
Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002). doi:10.1007/3-540-45798-4_6
Chapter Google Scholar
Feitelson, D.G., Jettee, M.A.: Improved utilization and responsiveness with gang scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 238–261. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_24
Chapter Google Scholar
Cao, J., Zimmermann, F.: Queue scheduling and advance reservations with cosy. In: Parallel and Distributed Processing Symposium, p. 63 (2004)
Google Scholar
Emeras, J., Ruiz, C., Vincent, J.-M., Richard, O.: Analysis of the jobs resource utilization on a production system. In: Desai, N., Cirne, W. (eds.) JSSPP 2013. LNCS, vol. 8429, pp. 1–21. Springer, Heidelberg (2014). doi:10.1007/978-3-662-43779-7_1
Google Scholar
Varrette, S., Bouvry, P., Cartiaux, H., Georgatos, F.: Management of an academic HPC cluster: the UL experience. In: Proceedings of the 2014 HPCS Conference (2014)
Google Scholar
Capit, N., Costa, G.D., Georgiou, Y., et al.: A batch scheduler with high level components. In: CCGrid, pp. 776–783 (2005)
Google Scholar
Wolter, N., McCracken, M.O., Snavely, A., et al.: What’s working in HPC: Investigating HPC user behavior and productivity. CTWatch Q. 2, 9–17 (2006)
Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distribut. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Feitelson, D.: Parallel workload archive
Google Scholar
Colmet. https://github.com/oar-team/colmet
Linux Kernel: https://www.kernel.org/, Taskstats: https://www.kernel.org/doc/Documentation/accounting/taskstats.txt, Cgroups: https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
Bailey, D.H.: NAS parallel benchmarks. In: Padua, D. (ed.) Encyclopedia of Parallel Computing. Springer, Heidelberg (2011)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Duan, R., Nadeem, F., Wang, J., Zhang, Y., Prodan, R., Fahringer, T.: A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 CCGRID Conference, pp. 339–347 (2009)
Google Scholar
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27: 1–27: 27 (2011)
Article Google Scholar
Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Article Google Scholar
Szollosi, D., Denes, D.L., Firtha, F., Kovacs, Z., Fekete, A.: Comparison of six multiclass classifiers by the use of different classification performance indicators. J. Chemometr. 26(3–4), 76–84 (2012)
Article Google Scholar
Ben-David, A.: Comparison of classification accuracy using cohen’s weighted kappa. Expert Syst. Appl. 34(2), 825–832 (2008)
Article Google Scholar
Provost, F.J., Fawcett, T., et al.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD 97, 43–48 (1997)
Google Scholar
Uebersax, J.S.: A generalized kappa coefficient. Educ. Psychol. Meas. 42(1), 181–183 (1982)
Article Google Scholar
Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990)
Article Google Scholar
Hand, D., Till, R.: A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)
Article MATH Google Scholar
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Article Google Scholar
Duan, K., Keerthi, S., Poo, A.N.: Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 51, 41–59 (2003)
Article Google Scholar
Guyon, I.: A Scaling Law for the Validation-Set Training-Set Size Ratio. AT&T Bell Laboratories (1997)
Google Scholar
Matsunaga, A., Fortes, J.A.B.: On the use of machine learning to predict the time and resources consumed by applications. In: CCGrid (2010)
Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Article Google Scholar
Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998). doi:10.1007/BFb0053984
Chapter Google Scholar
Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 58–77. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_16
Chapter Google Scholar
Zhang, J., Figueiredo, R.: Application classification through monitoring and learning of resource consumption patterns. In: IPDPS, April 2006
Google Scholar

Download references

Acknowledgments

The experiments presented in this paper were carried out using the HPC facility of the University of Luxembourg. Many thanks are also due to all those who participated in collecting and distributing the logs available through the PWA and used in Table 1.

Author information

Authors and Affiliations

Interdisciplinary Centre for Security Reliability and Trust, Luxembourg, Luxembourg
Joseph Emeras & Mateusz Guzek
Computer Science and Communications (CSC) Research Unit, 6, rue Richard Coudenhove-Kalergi, 1359, Luxembourg, Luxembourg
Sébastien Varrette & Pascal Bouvry

Authors

Joseph Emeras
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Varrette
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz Guzek
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Bouvry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joseph Emeras .

Editor information

Editors and Affiliations

Google, Seattle, USA
Narayan Desai
Google, Mountain View, USA
Walfredo Cirne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Emeras, J., Varrette, S., Guzek, M., Bouvry, P. (2017). Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-61756-5_6
Published: 12 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61755-8
Online ISBN: 978-3-319-61756-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics