Advertisement

Data-Driven Job Dispatching in HPC Systems

  • Cristian GalleguillosEmail author
  • Alina Sîrbu
  • Zeynep Kiziltan
  • Ozalp Babaoglu
  • Andrea Borghesi
  • Thomas Bridi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10710)

Abstract

As High Performance Computing (HPC) systems get closer to exascale performance, job dispatching strategies become critical for keeping system utilization high while keeping waiting times low for jobs competing for HPC system resources. In this paper, we take a data-driven approach and investigate whether better dispatching decisions can be made by transforming the log data produced by an HPC system into useful knowledge about its workload. In particular, we focus on job duration, develop a data-driven approach to job duration prediction, and analyze the effect of different prediction approaches in making dispatching decisions using a real workload dataset collected from Eurora, a hybrid HPC system. Experiments on various dispatching methods show promising results.

Notes

Acknowledgments

We thank Dr. A. Bartolini, Prof. L. Benini, Prof. M. Milano and Dr. M. Lombardi for fruitful discussions on the work presented here and for providing access to the Eurora data, together with the SCAI group in Cineca. We acknowledge the Cineca PM-HPC award allowing access to HPC resources. C. Galleguillos has been supported by Postgraduate Grant PUCV 2017. A. Sîrbu has been partially funded by the E.U. project SoBigData Research Infrastructure—Big Data and Social Mining Ecosystem (grant agreement 654024).

References

  1. 1.
    Buddhakulsomsiri, J., Kim, D.S.: Priority rule-based heuristic for multi-mode resource-constrained project scheduling problems with resource vacations and activity splitting. Eur. J. Oper. Res. 178(2), 374–390 (2007)CrossRefGoogle Scholar
  2. 2.
    Cavazzoni, C.: EURORA: a european architecture toward exascale. In: FutureHPC@ICS, pp. 1:1–1:4. ACM (2012)Google Scholar
  3. 3.
    Chen, X., et al.: Predicting job completion times using system logs in supercomputing clusters. In: DSN Workshops, IEEE Computer Society (2013)Google Scholar
  4. 4.
    Chandio, A.A., et al.: A comparative study of job scheduling strategies in large-scale parallel computational systems. In: TrustCom/ISPA/IUCC, pp. 949–957. IEEE Computer Society (2013)Google Scholar
  5. 5.
    Bartolini, A., Borghesi, A., Bridi, T., Lombardi, M., Milano, M.: Proactive workload dispatching on the EURORA supercomputer. In: O’Sullivan, B. (ed.) CP 2014. LNCS, vol. 8656, pp. 765–780. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10428-7_55CrossRefGoogle Scholar
  6. 6.
    Bartolini, A., et al.: Unveiling eurora - thermal and power characterization of the most energy-efficient supercomputer in the world. In: DATE, pp. 1–6. European Design and Automation Association (2014)Google Scholar
  7. 7.
    Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems. In: Pesant, G. (ed.) CP 2015. LNCS, vol. 9255, pp. 524–540. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-23219-5_37CrossRefGoogle Scholar
  8. 8.
    Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 181–199. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-41321-1_10CrossRefGoogle Scholar
  9. 9.
    Reiss, C., et al.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: SoCC, p. 7. ACM (2012)Google Scholar
  10. 10.
    Storlie, C., et al.: Modeling and predicting power consumption of high performance computing jobs. arXiv:1412.5247 (2014, preprint)
  11. 11.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997).  https://doi.org/10.1007/3-540-63574-2_14CrossRefGoogle Scholar
  12. 12.
    Tsafrir, D., et al.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)CrossRefGoogle Scholar
  13. 13.
    Gaussier, É., et al.: Improving backfilling by using machine learning to predict running times. In: SC, pp. 64:1–64:10. ACM (2015)Google Scholar
  14. 14.
    Blazewicz, J., et al.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Cao, J., et al.: A taxonomy of application scheduling tools for high performance cluster computing. Clust. Comput. 9(3), 355–371 (2006)CrossRefGoogle Scholar
  16. 16.
    Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-45540-X_11CrossRefzbMATHGoogle Scholar
  17. 17.
    Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: IPPS/SPDP, pp. 542–546 (1998)Google Scholar
  18. 18.
    Haupt, R.: A survey of priority rule-based scheduling. Oper. Res. Spektrum 11(1), 3–16 (1989)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Matsunaga, A.M., Fortes, J.A.B.: On the use of machine learning to predict the time and resources consumed by applications. In: CCGRID, pp. 495–504. IEEE Computer Society (2010)Google Scholar
  20. 20.
    Shoukourian, H., Wilde, T., et al.: Predicting the energy and power consumption of strong and weak scaling HPC applications. Supercomput. Front. Innov. 1(2), 20–41 (2014)Google Scholar
  21. 21.
    Sîrbu, A., Babaoglu, O.: A holistic approach to log data analysis in high-performance computing systems: the case of IBM blue gene/q. In: Hunold, S., Costan, A., Giménez, D., Iosup, A., Ricci, L., Gómez Requena, M.E., Scarano, V., Varbanescu, A.L., Scott, S.L., Lankes, S., Weidendorfer, J., Alexander, M. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 631–643. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-27308-2_51CrossRefGoogle Scholar
  22. 22.
    Sîrbu, A., Babaoglu, O.: Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 117–130. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43659-3_9CrossRefGoogle Scholar
  23. 23.
    Streit, A.: Enhancements to the decision process of the self-tuning dynP scheduler. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 63–80. Springer, Heidelberg (2005).  https://doi.org/10.1007/11407522_4CrossRefGoogle Scholar
  24. 24.
    Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: CLUSTER, IEEE Computer Society (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Cristian Galleguillos
    • 1
    • 2
    Email author
  • Alina Sîrbu
    • 3
  • Zeynep Kiziltan
    • 1
  • Ozalp Babaoglu
    • 1
  • Andrea Borghesi
    • 1
  • Thomas Bridi
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly
  2. 2.Escuela de Ing. InformáticaPontificia Universidad Católica de ValparaísoValparaísoChile
  3. 3.Department of Computer ScienceUniversity of PisaPisaItaly

Personalised recommendations