Abstract
Heterogeneous clusters and grid infrastructures are becoming increasingly popular. In these computing infrastructures, machines have different resources, including memory sizes, disk space, and installed software packages. These differences give rise to a problem of over-provisioning, that is, sub-optimal utilization of a cluster due to users requesting resource capacities greater than what their jobs actually need. Our analysis of a real workload file (LANL CM5) revealed differences of up to two orders of magnitude between requested memory capacity and actual memory usage. This paper presents an algorithm to estimate actual resource capacities used by batch jobs. Such an algorithm reduces the need for users to correctly predict the resources required by their jobs, while at the same time managing the scheduling system to obtain superior utilization of available hardware. The algorithm is based on the Reinforcement Learning paradigm; it learns its estimation policy on-line and dynamically modifies it according to the overall cluster load. The paper includes simulation results which indicate that our algorithm can yield an improvement of over 30% in utilization (overall throughput) of heterogeneous clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Basney, J., Livny, M., Tannenbaum, T.: High throughput computing with condor. HPCU news 1(2) (1997)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley and Sons, Inc, New-York, USA (2001)
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling: a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005)
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–206. Springer, Heidelberg (2001)
Henderson, R.L.: Job scheduling under the portable batch system. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS 1995), pp. 279–294. Springer, London, UK (1995)
Kaelbling, L.P., Littman, M., Moore, A.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Kannan, S., Roberts, M., Mayes, P., Brelsford, D., Skovira, J.F.: Workload Management with LoadLeveler. IBM Press (2001)
Kumar, K.P., Agarwal, A., Krishnan, R.: Fuzzy based resource management framework for high throughput computing. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2004), pp. 555–562. IEEE Computer Society Press, Los Alamitos (2004)
Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)
Liu, C., Yang, L., Foster, I., Angulo, D.: Design and evaluation of a resource selection framework for grid applications. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC 2002), p. 63. IEEE Computer Society Press, Washington, DC, USA (2002)
Livny, M.: Personal communication (2005)
Naik, V., Liu, C., Yang, L., Wagner, J.: On-line resource matching in a heterogeneous grid environment. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2005), IEEE Computer Society Press, Los Alamitos (2005)
Raman, R., Livny, M., Solomon, M.: Matchmaking: Distributed resource management for high throughput computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), July 1998, Chicago, IL (1998)
Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co-allocation with gangmatching. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12 2003), pp. 80–89 (2003)
Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid reinforcement learning approach to autonomic resource allocation. In: Proceedings of the IEEE International Conference on Autonomic Computing (ICAC) 2006, Dublin, Ireland, pp. 65–73 (2006)
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using runtime predictions rather than user estimates. Technical Report TR 2005-5, School of Computer Science and Engineering, Hebrew University of Jerusalem (2003)
Upton, G., Cook, I.: Oxford Dictionary of Statistics. Oxford University Press, Oxford, UK (2002)
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)
Parallel workloads archive, http://www.cs.huji.ac.il/labs/parallel/workload
Xu, M.Q.: Effective metacomputing using lsf multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 100. IEEE Computer Society Press, Los Alamitos (2001)
Yom-Tov, E., Aridor, Y.: Improving resource matching through estimation of actual job requirements. In: IBM Research Report H-0244 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yom-Tov, E., Aridor, Y. (2008). A Self-optimized Job Scheduler for Heterogeneous Server Clusters. In: Frachtenberg, E., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2007. Lecture Notes in Computer Science, vol 4942. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78699-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-78699-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78698-6
Online ISBN: 978-3-540-78699-3
eBook Packages: Computer ScienceComputer Science (R0)