Scheduling in HPC Resource Management Systems: Queuing vs. Planning
- 56 Citations
- 6 Mentions
- 773 Downloads
Abstract
Nearly all existing HPC systems are operated by resource management systems based on the queuing approach. With the increasing acceptance of grid middleware like Globus, new requirements for the underlying local resource management systems arise. Features like advanced reservation or quality of service are needed to implement high level functions like co-allocation. However it is difficult to realize these features with a resource management system based on the queuing concept since it considers only the present resource usage.
In this paper we present an approach which closes this gap. By assigning start times to each resource request, a complete schedule is planned. Advanced reservations are now easily possible. Based on this planning approach functions like diffuse requests, automatic duration extension, or service level agreements are described. We think they are useful to increase the usability, acceptance and performance of HPC machines. In the second part of this paper we present a planning based resource management system which already covers some of the mentioned features.
Keywords
High Performance Computing Service Level Agreement Grid Resource Resource Management System Advanced ReservationPreview
Unable to display preview. Download preview PDF.
References
- 1.Brune, M., Gehring, J., Keller, A., Reinefeld, A.: RSD - Resource and Service Description. In: Proc. of 12th Intl. Symp. on High-Performance Computing Systems and Applications (HPCS 1998), pp. 193–206. Kluwer Academic Press, Dordrecht (1998)Google Scholar
- 2.Brune, M., Gehring, J., Keller, A., Reinefeld, A.: Managing Clusters of Geographically Distributed High-Performance Computers. Concurrency - Practice and Experience 11(15), 887–911 (1999)CrossRefGoogle Scholar
- 3.Brune, M., Reinefeld, A., Varnholt, J.: A Resource Description Environment for Distributed Computing Systems. In: Proceedings of the 8th International Symposium High-Performance Distributed Computing HPDC 1999, Redondo Beach. LNCS, pp. 279–286. IEEE Computer Society, Los Alamitos (1999)CrossRefGoogle Scholar
- 4.Cjajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S.: SNAP: A Protocol for Negotiation of Service Level Agreements and Coordinated Resource Management in Distributed Systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 153–183. Springer, Heidelberg (2002)CrossRefGoogle Scholar
- 5.Direct Access Transport (DAT) Specification (April 2003), http://www.datcollaborative.org
- 6.Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R.: Enhanced Algorithms for Multi-Site Scheduling. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 219–231. Springer, Heidelberg (2002)CrossRefGoogle Scholar
- 7.Feitelson, D.G., Jette, M.A.: Improved Utilization and Responsiveness with Gang Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 238–262. Springer, Heidelberg (1997)Google Scholar
- 8.Feitelson, D.G., Rudolph, L.: Towards Convergence in Job Schedulers for Parallel Supercomputers. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 1–26. Springer, Heidelberg (1996)CrossRefGoogle Scholar
- 9.Feitelson, D.G., Rudolph, L.: Metrics and Benchmarking for Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998)CrossRefGoogle Scholar
- 10.Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C.: Theory and Practice in Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997)Google Scholar
- 11.Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
- 12.Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., Roy, A.: A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. In: Proceedings of the International Workshop on Quality of Service (1999)Google Scholar
- 13.GGF Grid Scheduling Dictionary Working Group. Grid Scheduling Dictionary of Terms and Keywords (April 2003), http://www.fz-juelich.de/zam/RD/coop/ggf/sd-wg.html
- 14.Hungershöfer, J., Wierum, J.-M., Gänser, H.-P.: Resource Management for Finite Element Codes on Shared Memory Systems. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) ICCSA 2003. LNCS, vol. 2667, pp. 927–936. Springer, Heidelberg (2003)CrossRefGoogle Scholar
- 15.Jackson, D., Snell, Q., Clement, M.: Core Algorithms of the Maui Scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–103. Springer, Heidelberg (2001)CrossRefGoogle Scholar
- 16.Keller, A., Reinefeld, A.: Anatomy of a Resource Management System for HPC Clusters. In: Keller, A., Reinefeld, A. (eds.) Annual Review of Scalable Computing, vol. 3, pp. 1–31. Singapore University Press (2001)Google Scholar
- 17.Kishimoto, H., Savva, A., Snelling, D.: OGSA Fundamental Services: Requirements for Commercial GRID Systems. Technical report, Open Grid Services Architecture Working Group (OGSA WG) (April 2003), http://www.gridforiam.org/Dociaments/Drafts/default_b.htm
- 18.Lifka, D.A.: The ANL/IBM SP Scheduling System. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995)Google Scholar
- 19.Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. In: Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS 1988), pp. 104–111. IEEE Computer Society Press, Los Alamitos (1988)Google Scholar
- 20.MacLaren, J., Sander, V., Ziegler, W.: Advanced Reservations - State of the Art. Technical report, Grid Resource Allocation Agreement Protocol Working Group, Global Grid Forum (April 2003), http://www.fz-juelich.de/zam/RD/coop/ggf/graap/sched-graap-2.0.html
- 21.Mu’alem, A., Feitelson, D.G.: Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Trans. Parallel & Distributed Systems 12(6), 529–543 (2001)CrossRefGoogle Scholar
- 22.Sahai, A., Durante, A., Machiraju, V.: Towards Automated SLA Management for Web Services. HPL-2001-310 (R.l), Hewlett-Packard Company, Software Technology Laboratory, HP Laboratories Palo Alto (2000), http://www.hpl.hp.com/techreports/2001/HPL-2001-310R1.html
- 23.Sahai, A., Durante, A., Machiraju, V., Sayal, M., Jin, L., Casati, F.: Towards Automated SLA Management for Web Services Monitoring for Web Services. In: Feridun, M., Kropf, P.G., Babin, G. (eds.) DSOM 2002. LNCS, vol. 2506, pp. 28–41. Springer, Heidelberg (2002)CrossRefGoogle Scholar
- 24.Scali MPI ConnectTM (April 2003), http://www.scali.com
- 25.Smarr, L., Catlett, C.E.: Metacomputing. Communications of the ACM 35(6), 44–52 (1992)CrossRefGoogle Scholar
- 26.Smith, W., Foster, I., Taylor, V.: Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999)CrossRefGoogle Scholar
- 27.Streit, A.: A Self-Tuning Job Scheduler Family with Dynamic Policy Switching. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 1–23. Springer, Heidelberg (2002)CrossRefGoogle Scholar
- 28.Talby, D., Feitelson, D.G.: Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. In: 13th Intl. Parallel Processing Symp., April 1999, pp. 513–517 (1999)Google Scholar
- 29.Verma, D.: Supporting Service Level Agreements on an IP Network, August 1999. Macmillan Technology Series. Macmillan Technical Publishing, Basingstoke (1999)Google Scholar
- 30.Windisch, K., Lo, V., Moore, R., Feitelson, D., Nitzberg, B.: A Comparison of Workload Traces from Two Production Parallel Machines. In: 6th Symposium Frontiers Massively Parallel Computing, pp. 319–326 (1996)Google Scholar