Skip to main content

A Self-optimized Job Scheduler for Heterogeneous Server Clusters

  • Conference paper
Job Scheduling Strategies for Parallel Processing (JSSPP 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4942))

Included in the following conference series:

Abstract

Heterogeneous clusters and grid infrastructures are becoming increasingly popular. In these computing infrastructures, machines have different resources, including memory sizes, disk space, and installed software packages. These differences give rise to a problem of over-provisioning, that is, sub-optimal utilization of a cluster due to users requesting resource capacities greater than what their jobs actually need. Our analysis of a real workload file (LANL CM5) revealed differences of up to two orders of magnitude between requested memory capacity and actual memory usage. This paper presents an algorithm to estimate actual resource capacities used by batch jobs. Such an algorithm reduces the need for users to correctly predict the resources required by their jobs, while at the same time managing the scheduling system to obtain superior utilization of available hardware. The algorithm is based on the Reinforcement Learning paradigm; it learns its estimation policy on-line and dynamically modifies it according to the overall cluster load. The paper includes simulation results which indicate that our algorithm can yield an improvement of over 30% in utilization (overall throughput) of heterogeneous clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Basney, J., Livny, M., Tannenbaum, T.: High throughput computing with condor. HPCU news 1(2) (1997)

    Google Scholar 

  2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley and Sons, Inc, New-York, USA (2001)

    MATH  Google Scholar 

  3. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling: a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005)

    Google Scholar 

  4. Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–206. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  5. Henderson, R.L.: Job scheduling under the portable batch system. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS 1995), pp. 279–294. Springer, London, UK (1995)

    Google Scholar 

  6. Kaelbling, L.P., Littman, M., Moore, A.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)

    Google Scholar 

  7. Kannan, S., Roberts, M., Mayes, P., Brelsford, D., Skovira, J.F.: Workload Management with LoadLeveler. IBM Press (2001)

    Google Scholar 

  8. Kumar, K.P., Agarwal, A., Krishnan, R.: Fuzzy based resource management framework for high throughput computing. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2004), pp. 555–562. IEEE Computer Society Press, Los Alamitos (2004)

    Chapter  Google Scholar 

  9. Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)

    Google Scholar 

  10. Liu, C., Yang, L., Foster, I., Angulo, D.: Design and evaluation of a resource selection framework for grid applications. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC 2002), p. 63. IEEE Computer Society Press, Washington, DC, USA (2002)

    Google Scholar 

  11. Livny, M.: Personal communication (2005)

    Google Scholar 

  12. Naik, V., Liu, C., Yang, L., Wagner, J.: On-line resource matching in a heterogeneous grid environment. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005), IEEE Computer Society Press, Los Alamitos (2005)

    Google Scholar 

  13. Raman, R., Livny, M., Solomon, M.: Matchmaking: Distributed resource management for high throughput computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), July 1998, Chicago, IL (1998)

    Google Scholar 

  14. Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co-allocation with gangmatching. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12 2003), pp. 80–89 (2003)

    Google Scholar 

  15. Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid reinforcement learning approach to autonomic resource allocation. In: Proceedings of the IEEE International Conference on Autonomic Computing (ICAC) 2006, Dublin, Ireland, pp. 65–73 (2006)

    Google Scholar 

  16. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using runtime predictions rather than user estimates. Technical Report TR 2005-5, School of Computer Science and Engineering, Hebrew University of Jerusalem (2003)

    Google Scholar 

  17. Upton, G., Cook, I.: Oxford Dictionary of Statistics. Oxford University Press, Oxford, UK (2002)

    Google Scholar 

  18. Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. Parallel workloads archive, http://www.cs.huji.ac.il/labs/parallel/workload

  20. Xu, M.Q.: Effective metacomputing using lsf multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 100. IEEE Computer Society Press, Los Alamitos (2001)

    Google Scholar 

  21. Yom-Tov, E., Aridor, Y.: Improving resource matching through estimation of actual job requirements. In: IBM Research Report H-0244 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Eitan Frachtenberg Uwe Schwiegelshohn

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yom-Tov, E., Aridor, Y. (2008). A Self-optimized Job Scheduler for Heterogeneous Server Clusters. In: Frachtenberg, E., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2007. Lecture Notes in Computer Science, vol 4942. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78699-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78699-3_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78698-6

  • Online ISBN: 978-3-540-78699-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics