A Self-optimized Job Scheduler for Heterogeneous Server Clusters

  • Elad Yom-Tov
  • Yariv Aridor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4942)

Abstract

Heterogeneous clusters and grid infrastructures are becoming increasingly popular. In these computing infrastructures, machines have different resources, including memory sizes, disk space, and installed software packages. These differences give rise to a problem of over-provisioning, that is, sub-optimal utilization of a cluster due to users requesting resource capacities greater than what their jobs actually need. Our analysis of a real workload file (LANL CM5) revealed differences of up to two orders of magnitude between requested memory capacity and actual memory usage. This paper presents an algorithm to estimate actual resource capacities used by batch jobs. Such an algorithm reduces the need for users to correctly predict the resources required by their jobs, while at the same time managing the scheduling system to obtain superior utilization of available hardware. The algorithm is based on the Reinforcement Learning paradigm; it learns its estimation policy on-line and dynamically modifies it according to the overall cluster load. The paper includes simulation results which indicate that our algorithm can yield an improvement of over 30% in utilization (overall throughput) of heterogeneous clusters.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Basney, J., Livny, M., Tannenbaum, T.: High throughput computing with condor. HPCU news 1(2) (1997)Google Scholar
  2. 2.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley and Sons, Inc, New-York, USA (2001)MATHGoogle Scholar
  3. 3.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling: a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005)Google Scholar
  4. 4.
    Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–206. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Henderson, R.L.: Job scheduling under the portable batch system. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS 1995), pp. 279–294. Springer, London, UK (1995)Google Scholar
  6. 6.
    Kaelbling, L.P., Littman, M., Moore, A.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
  7. 7.
    Kannan, S., Roberts, M., Mayes, P., Brelsford, D., Skovira, J.F.: Workload Management with LoadLeveler. IBM Press (2001)Google Scholar
  8. 8.
    Kumar, K.P., Agarwal, A., Krishnan, R.: Fuzzy based resource management framework for high throughput computing. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2004), pp. 555–562. IEEE Computer Society Press, Los Alamitos (2004)CrossRefGoogle Scholar
  9. 9.
    Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)Google Scholar
  10. 10.
    Liu, C., Yang, L., Foster, I., Angulo, D.: Design and evaluation of a resource selection framework for grid applications. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC 2002), p. 63. IEEE Computer Society Press, Washington, DC, USA (2002)Google Scholar
  11. 11.
    Livny, M.: Personal communication (2005)Google Scholar
  12. 12.
    Naik, V., Liu, C., Yang, L., Wagner, J.: On-line resource matching in a heterogeneous grid environment. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005), IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
  13. 13.
    Raman, R., Livny, M., Solomon, M.: Matchmaking: Distributed resource management for high throughput computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), July 1998, Chicago, IL (1998)Google Scholar
  14. 14.
    Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co-allocation with gangmatching. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12 2003), pp. 80–89 (2003)Google Scholar
  15. 15.
    Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid reinforcement learning approach to autonomic resource allocation. In: Proceedings of the IEEE International Conference on Autonomic Computing (ICAC) 2006, Dublin, Ireland, pp. 65–73 (2006)Google Scholar
  16. 16.
    Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using runtime predictions rather than user estimates. Technical Report TR 2005-5, School of Computer Science and Engineering, Hebrew University of Jerusalem (2003)Google Scholar
  17. 17.
    Upton, G., Cook, I.: Oxford Dictionary of Statistics. Oxford University Press, Oxford, UK (2002)Google Scholar
  18. 18.
    Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  19. 19.
  20. 20.
    Xu, M.Q.: Effective metacomputing using lsf multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 100. IEEE Computer Society Press, Los Alamitos (2001)Google Scholar
  21. 21.
    Yom-Tov, E., Aridor, Y.: Improving resource matching through estimation of actual job requirements. In: IBM Research Report H-0244 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Elad Yom-Tov
    • 1
  • Yariv Aridor
    • 1
  1. 1.IBM Haifa Research Lab HaifaIsrael

Personalised recommendations