Skip to main content

Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2015, JSSPP 2016)

Abstract

At the advent of a wished (or forced) convergence between High Performance Computing HPC platforms, stand-alone accelerators and virtualized resources from Cloud Computing CC systems, this article unveils the job prediction component of the Evalix project. This framework aims at an improved efficiency of the underlying Resource and Job Management System RJMS within heterogeneous HPC facilities by the automatic evaluation and characterization of the submitted workload. The objective is not only to better adapt the scheduled jobs to the available resource capabilities, but also to reduce the energy costs. For that purpose, we collected the resource consumption of all the jobs executed on a production cluster for a period of three months. Based on the analysis then on the classification of the jobs, we computed a resource consumption model. The objective is to train a set of predictors based on the aforementioned model, that will give the estimated CPU, memory and IO used by the jobs. The analysis of the resource consumption highlighted that different classes of jobs have different kinds of resource needs and the classification of the jobs enabled to characterize several application patterns of the users. We also discovered that several users whose resource usage on the cluster is considered as too low, are responsible for a loss of CPU time on the order of five years over the considered three month period. The predictors, trained from a supervised learning algorithm, were able to correctly classify a large set of data. We evaluated them with three performance indicators that gave an information retrieval rate of 71% to 89% and a probability of accurate prediction between 0.7 and 0.8. The results of this work will be particularly helpful for designing an optimal partitioning of the considered heterogeneous platform, taking into consideration the real application needs and thus leading to energy savings and performance improvements. Moreover, apart from the novelty of the contribution, the accurate classification scheme offers new insights of users behavior of interest for the design of future HPC platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Later on, another monitoring tool named Colmet [11] will be used.

References

  1. Lublin, U., Feitelson, D.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 1105–1122 (2001)

    Article  MATH  Google Scholar 

  2. Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002). doi:10.1007/3-540-45798-4_6

    Chapter  Google Scholar 

  3. Feitelson, D.G., Jettee, M.A.: Improved utilization and responsiveness with gang scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 238–261. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_24

    Chapter  Google Scholar 

  4. Cao, J., Zimmermann, F.: Queue scheduling and advance reservations with cosy. In: Parallel and Distributed Processing Symposium, p. 63 (2004)

    Google Scholar 

  5. Emeras, J., Ruiz, C., Vincent, J.-M., Richard, O.: Analysis of the jobs resource utilization on a production system. In: Desai, N., Cirne, W. (eds.) JSSPP 2013. LNCS, vol. 8429, pp. 1–21. Springer, Heidelberg (2014). doi:10.1007/978-3-662-43779-7_1

    Google Scholar 

  6. Varrette, S., Bouvry, P., Cartiaux, H., Georgatos, F.: Management of an academic HPC cluster: the UL experience. In: Proceedings of the 2014 HPCS Conference (2014)

    Google Scholar 

  7. Capit, N., Costa, G.D., Georgiou, Y., et al.: A batch scheduler with high level components. In: CCGrid, pp. 776–783 (2005)

    Google Scholar 

  8. Wolter, N., McCracken, M.O., Snavely, A., et al.: What’s working in HPC: Investigating HPC user behavior and productivity. CTWatch Q. 2, 9–17 (2006)

    Google Scholar 

  9. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distribut. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  10. Feitelson, D.: Parallel workload archive

    Google Scholar 

  11. Colmet. https://github.com/oar-team/colmet

  12. Linux Kernel: https://www.kernel.org/, Taskstats: https://www.kernel.org/doc/Documentation/accounting/taskstats.txt, Cgroups: https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

  13. Bailey, D.H.: NAS parallel benchmarks. In: Padua, D. (ed.) Encyclopedia of Parallel Computing. Springer, Heidelberg (2011)

    Google Scholar 

  14. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  15. Duan, R., Nadeem, F., Wang, J., Zhang, Y., Prodan, R., Fahringer, T.: A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 CCGRID Conference, pp. 339–347 (2009)

    Google Scholar 

  16. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27: 1–27: 27 (2011)

    Article  Google Scholar 

  17. Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)

    Article  Google Scholar 

  18. Szollosi, D., Denes, D.L., Firtha, F., Kovacs, Z., Fekete, A.: Comparison of six multiclass classifiers by the use of different classification performance indicators. J. Chemometr. 26(3–4), 76–84 (2012)

    Article  Google Scholar 

  19. Ben-David, A.: Comparison of classification accuracy using cohen’s weighted kappa. Expert Syst. Appl. 34(2), 825–832 (2008)

    Article  Google Scholar 

  20. Provost, F.J., Fawcett, T., et al.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD 97, 43–48 (1997)

    Google Scholar 

  21. Uebersax, J.S.: A generalized kappa coefficient. Educ. Psychol. Meas. 42(1), 181–183 (1982)

    Article  Google Scholar 

  22. Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990)

    Article  Google Scholar 

  23. Hand, D., Till, R.: A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)

    Article  MATH  Google Scholar 

  24. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  25. Duan, K., Keerthi, S., Poo, A.N.: Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 51, 41–59 (2003)

    Article  Google Scholar 

  26. Guyon, I.: A Scaling Law for the Validation-Set Training-Set Size Ratio. AT&T Bell Laboratories (1997)

    Google Scholar 

  27. Matsunaga, A., Fortes, J.A.B.: On the use of machine learning to predict the time and resources consumed by applications. In: CCGrid (2010)

    Google Scholar 

  28. Tsafrir, D., Etsion, Y., Feitelson, D.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)

    Article  Google Scholar 

  29. Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998). doi:10.1007/BFb0053984

    Chapter  Google Scholar 

  30. Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 58–77. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_16

    Chapter  Google Scholar 

  31. Zhang, J., Figueiredo, R.: Application classification through monitoring and learning of resource consumption patterns. In: IPDPS, April 2006

    Google Scholar 

Download references

Acknowledgments

The experiments presented in this paper were carried out using the HPC facility of the University of Luxembourg. Many thanks are also due to all those who participated in collecting and distributing the logs available through the PWA and used in Table 1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph Emeras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Emeras, J., Varrette, S., Guzek, M., Bouvry, P. (2017). Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61756-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61755-8

  • Online ISBN: 978-3-319-61756-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics