A Self-optimized Job Scheduler for Heterogeneous Server Clusters

Yom-Tov, Elad; Aridor, Yariv

doi:10.1007/978-3-540-78699-3_10

Elad Yom-Tov¹ &
Yariv Aridor¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4942))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

461 Accesses
5 Citations

Abstract

Heterogeneous clusters and grid infrastructures are becoming increasingly popular. In these computing infrastructures, machines have different resources, including memory sizes, disk space, and installed software packages. These differences give rise to a problem of over-provisioning, that is, sub-optimal utilization of a cluster due to users requesting resource capacities greater than what their jobs actually need. Our analysis of a real workload file (LANL CM5) revealed differences of up to two orders of magnitude between requested memory capacity and actual memory usage. This paper presents an algorithm to estimate actual resource capacities used by batch jobs. Such an algorithm reduces the need for users to correctly predict the resources required by their jobs, while at the same time managing the scheduling system to obtain superior utilization of available hardware. The algorithm is based on the Reinforcement Learning paradigm; it learns its estimation policy on-line and dynamically modifies it according to the overall cluster load. The paper includes simulation results which indicate that our algorithm can yield an improvement of over 30% in utilization (overall throughput) of heterogeneous clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Basney, J., Livny, M., Tannenbaum, T.: High throughput computing with condor. HPCU news 1(2) (1997)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley and Sons, Inc, New-York, USA (2001)
MATH Google Scholar
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling: a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005)
Google Scholar
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–206. Springer, Heidelberg (2001)
Chapter Google Scholar
Henderson, R.L.: Job scheduling under the portable batch system. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS 1995), pp. 279–294. Springer, London, UK (1995)
Google Scholar
Kaelbling, L.P., Littman, M., Moore, A.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Google Scholar
Kannan, S., Roberts, M., Mayes, P., Brelsford, D., Skovira, J.F.: Workload Management with LoadLeveler. IBM Press (2001)
Google Scholar
Kumar, K.P., Agarwal, A., Krishnan, R.: Fuzzy based resource management framework for high throughput computing. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2004), pp. 555–562. IEEE Computer Society Press, Los Alamitos (2004)
Chapter Google Scholar
Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)
Google Scholar
Liu, C., Yang, L., Foster, I., Angulo, D.: Design and evaluation of a resource selection framework for grid applications. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC 2002), p. 63. IEEE Computer Society Press, Washington, DC, USA (2002)
Google Scholar
Livny, M.: Personal communication (2005)
Google Scholar
Naik, V., Liu, C., Yang, L., Wagner, J.: On-line resource matching in a heterogeneous grid environment. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005), IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Raman, R., Livny, M., Solomon, M.: Matchmaking: Distributed resource management for high throughput computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), July 1998, Chicago, IL (1998)
Google Scholar
Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co-allocation with gangmatching. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12 2003), pp. 80–89 (2003)
Google Scholar
Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid reinforcement learning approach to autonomic resource allocation. In: Proceedings of the IEEE International Conference on Autonomic Computing (ICAC) 2006, Dublin, Ireland, pp. 65–73 (2006)
Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using runtime predictions rather than user estimates. Technical Report TR 2005-5, School of Computer Science and Engineering, Hebrew University of Jerusalem (2003)
Google Scholar
Upton, G., Cook, I.: Oxford Dictionary of Statistics. Oxford University Press, Oxford, UK (2002)
Google Scholar
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)
Chapter Google Scholar
Parallel workloads archive, http://www.cs.huji.ac.il/labs/parallel/workload
Xu, M.Q.: Effective metacomputing using lsf multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 100. IEEE Computer Society Press, Los Alamitos (2001)
Google Scholar
Yom-Tov, E., Aridor, Y.: Improving resource matching through estimation of actual job requirements. In: IBM Research Report H-0244 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Haifa Research Lab, , Haifa, 31905, Israel
Elad Yom-Tov & Yariv Aridor

Authors

Elad Yom-Tov
View author publications
You can also search for this author in PubMed Google Scholar
Yariv Aridor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Eitan Frachtenberg Uwe Schwiegelshohn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yom-Tov, E., Aridor, Y. (2008). A Self-optimized Job Scheduler for Heterogeneous Server Clusters. In: Frachtenberg, E., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2007. Lecture Notes in Computer Science, vol 4942. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78699-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-78699-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78698-6
Online ISBN: 978-3-540-78699-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics