Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters

Gombert, Luc; Suter, Frédéric

doi:10.1007/978-3-030-88224-2_6

Luc Gombert¹¹ &
Frédéric Suter¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12985))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

391 Accesses
1 Citations

Abstract

High Throughput Computing datacenters are a cornerstone of scientific discoveries in the fields of High Energy Physics and Astroparticles Physics. These datacenters provide thousands of users from dozens of scientific collaborations with tens of thousands computing cores and Petabytes of storage.

The scheduling algorithm used in such datacenters to handle the millions of (mostly single-core) jobs submitted every month ensures a fair sharing of the computing resources among user groups, but may also cause unpredictably long job wait times for some users. The time a job will wait can be caused by many entangled factors and configuration parameters and is thus very hard to predict. Moreover, batch systems implementing a fair-share scheduling algorithm cannot provide users with any estimation of the job wait time at submission time.

Therefore, we investigate in this paper how learning-based techniques applied to the logs of the batch scheduling system of a large HTC datacenter can be used to get an estimation of job wait time. First, we illustrate the need for users for such an estimation. Then, we identify some intuitive causes of this wait time from the information found in the batch system logs. We also formally analyze the correlation between job and system features and job wait time. Finally, we study several Machine Learning algorithms to implement learning-based estimators of both job wait time and job wait time ranges. Our experimental results show that a regression-based estimator can predict job wait time with a median absolute percentage error of about 54%, while a classifier that combines regression and classification assigns nearly 77% of the jobs in the right wait time range or in an immediately adjacent one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Azevedo, F., Gombert, L., Suter, F.: Reducing the human-in-the-loop component of the scheduling of large HTC workloads. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2018. LNCS, vol. 11332, pp. 39–60. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-10632-4_3
Chapter Google Scholar
Azevedo, F., Klusáček, D., Suter, F.: Improving fairness in a large scale HTC system through workload analysis and simulation. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 129–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_10
Chapter Google Scholar
Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996). https://doi.org/10.1007/BF00117832
Article MathSciNet MATH Google Scholar
Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), New York, NY, pp. 110–118, March 2006. https://doi.org/10.1145/1122971.1122989
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11
Chapter MATH Google Scholar
Feitelson, D., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distri. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Freund, Y., Schapire, R.: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997). https://doi.org/10.1006/jcss.1997.1504
Article MathSciNet MATH Google Scholar
Gombert, L., Suter, F.: Companion of the “learning-based approaches to estimate job wait time in HTC datacenters” article (2021). https://doi.org/10.6084/m9.figshare.13913912
Jancauskas, V., Piontek, T., Kopta, P., Bosak, B.: Predicting queue wait time probabilities for multi-scale computing. Philos. Trans. Roy. Soc. A 377(2142) (2019). https://doi.org/10.1098/rsta.2018.0151
Kay, J., Lauder, P.: A fair share scheduler. Commun. ACM 31(1), 44–55 (1988)
Article Google Scholar
Kumar, R., Vadhiyar, S.: Prediction of queue waiting times for metascheduling on parallel batch systems. In: Cirne, W., Desai, N. (eds.) JSSPP 2014. LNCS, vol. 8828, pp. 108–128. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15789-4_7
Chapter Google Scholar
Li, H., Chen, J., Tao, Y., Groep, D., Wolters, L.: Improving a local learning technique for QueueWait time predictions. In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid), Singapore, pp. 335–342, May 2006.https://doi.org/10.1109/CCGRID.2006.57
Li, H., Groep, D., Wolters, L.: Efficient response time predictions by exploiting application and resource state similarities. In: Proceedings of of the 6th IEEE/ACM International Conference on Grid Computing (GRID), Seattle, WA, pp. 234–241, November 2005. https://doi.org/10.1109/GRID.2005.1542747
Loh, W.Y.: Classification and regression trees. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 1, 14–23 (2011). https://doi.org/10.1002/widm.8
Article Google Scholar
Mu’alem, A., Feitelson, D.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE TPDS 12(6), 529–543 (2001)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Schlagkamp, S., Ferreira da Silva, R., Allcock, W., Deelman, E., Schwiegelshohn, U.: Consecutive job submission behavior at mira supercomputer. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Kyoto, Japan, pp. 93–96, May 2016. https://doi.org/10.1145/2907294.2907314
Smith, W.: A service for queue prediction and job statistics. In: Proceedings of the 2010 Gateway Computing Environments Workshop, Los Alamitos, CA, pp. 1–8, November 2010. https://doi.org/10.1109/GCE.2010.5676119
Smith, W., Foster, I., Taylor, V.: Predicting application run times with historical information. JPDC 64(9), 1007–1016 (2004). https://doi.org/10.1016/j.jpdc.2004.06.008
Article MATH Google Scholar
Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11
Chapter Google Scholar
The IN2P3/CNRS Computing Center. http://cc.in2p3.fr/en/
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996). https://doi.org/10.2307/2346178
Univa Corporation: Grid Engine. http://www.univa.com/products/

Download references

Acknowledgments

The authors would like to thank Wataru Takase and his colleagues from the Japanese High Energy Accelerator Research Organization (KEK) for providing the initial motivation for this work.

Author information

Authors and Affiliations

IN2P3 Computing Center/CNRS, Lyon-Villeurbanne, France
Luc Gombert & Frédéric Suter

Authors

Luc Gombert
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Suter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frédéric Suter .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Apple, Cupertino, CA, USA
Gonzalo P. Rodrigo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gombert, L., Suter, F. (2021). Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2021. Lecture Notes in Computer Science(), vol 12985. Springer, Cham. https://doi.org/10.1007/978-3-030-88224-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-88224-2_6
Published: 06 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88223-5
Online ISBN: 978-3-030-88224-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics