Scheduling in HPC Resource Management Systems: Queuing vs. Planning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2862)


Nearly all existing HPC systems are operated by resource management systems based on the queuing approach. With the increasing acceptance of grid middleware like Globus, new requirements for the underlying local resource management systems arise. Features like advanced reservation or quality of service are needed to implement high level functions like co-allocation. However it is difficult to realize these features with a resource management system based on the queuing concept since it considers only the present resource usage.

In this paper we present an approach which closes this gap. By assigning start times to each resource request, a complete schedule is planned. Advanced reservations are now easily possible. Based on this planning approach functions like diffuse requests, automatic duration extension, or service level agreements are described. We think they are useful to increase the usability, acceptance and performance of HPC machines. In the second part of this paper we present a planning based resource management system which already covers some of the mentioned features.


High Performance Computing Service Level Agreement Grid Resource Resource Management System Advanced Reservation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brune, M., Gehring, J., Keller, A., Reinefeld, A.: RSD - Resource and Service Description. In: Proc. of 12th Intl. Symp. on High-Performance Computing Systems and Applications (HPCS 1998), pp. 193–206. Kluwer Academic Press, Dordrecht (1998)Google Scholar
  2. 2.
    Brune, M., Gehring, J., Keller, A., Reinefeld, A.: Managing Clusters of Geographically Distributed High-Performance Computers. Concurrency - Practice and Experience 11(15), 887–911 (1999)CrossRefGoogle Scholar
  3. 3.
    Brune, M., Reinefeld, A., Varnholt, J.: A Resource Description Environment for Distributed Computing Systems. In: Proceedings of the 8th International Symposium High-Performance Distributed Computing HPDC 1999, Redondo Beach. LNCS, pp. 279–286. IEEE Computer Society, Los Alamitos (1999)CrossRefGoogle Scholar
  4. 4.
    Cjajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S.: SNAP: A Protocol for Negotiation of Service Level Agreements and Coordinated Resource Management in Distributed Systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 153–183. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Direct Access Transport (DAT) Specification (April 2003),
  6. 6.
    Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R.: Enhanced Algorithms for Multi-Site Scheduling. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 219–231. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Feitelson, D.G., Jette, M.A.: Improved Utilization and Responsiveness with Gang Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 238–262. Springer, Heidelberg (1997)Google Scholar
  8. 8.
    Feitelson, D.G., Rudolph, L.: Towards Convergence in Job Schedulers for Parallel Supercomputers. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 1–26. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  9. 9.
    Feitelson, D.G., Rudolph, L.: Metrics and Benchmarking for Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  10. 10.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C.: Theory and Practice in Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997)Google Scholar
  11. 11.
    Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  12. 12.
    Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., Roy, A.: A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. In: Proceedings of the International Workshop on Quality of Service (1999)Google Scholar
  13. 13.
    GGF Grid Scheduling Dictionary Working Group. Grid Scheduling Dictionary of Terms and Keywords (April 2003),
  14. 14.
    Hungershöfer, J., Wierum, J.-M., Gänser, H.-P.: Resource Management for Finite Element Codes on Shared Memory Systems. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) ICCSA 2003. LNCS, vol. 2667, pp. 927–936. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Jackson, D., Snell, Q., Clement, M.: Core Algorithms of the Maui Scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–103. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  16. 16.
    Keller, A., Reinefeld, A.: Anatomy of a Resource Management System for HPC Clusters. In: Keller, A., Reinefeld, A. (eds.) Annual Review of Scalable Computing, vol. 3, pp. 1–31. Singapore University Press (2001)Google Scholar
  17. 17.
    Kishimoto, H., Savva, A., Snelling, D.: OGSA Fundamental Services: Requirements for Commercial GRID Systems. Technical report, Open Grid Services Architecture Working Group (OGSA WG) (April 2003),
  18. 18.
    Lifka, D.A.: The ANL/IBM SP Scheduling System. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995)Google Scholar
  19. 19.
    Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. In: Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS 1988), pp. 104–111. IEEE Computer Society Press, Los Alamitos (1988)Google Scholar
  20. 20.
    MacLaren, J., Sander, V., Ziegler, W.: Advanced Reservations - State of the Art. Technical report, Grid Resource Allocation Agreement Protocol Working Group, Global Grid Forum (April 2003),
  21. 21.
    Mu’alem, A., Feitelson, D.G.: Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Trans. Parallel & Distributed Systems 12(6), 529–543 (2001)CrossRefGoogle Scholar
  22. 22.
    Sahai, A., Durante, A., Machiraju, V.: Towards Automated SLA Management for Web Services. HPL-2001-310 (R.l), Hewlett-Packard Company, Software Technology Laboratory, HP Laboratories Palo Alto (2000),
  23. 23.
    Sahai, A., Durante, A., Machiraju, V., Sayal, M., Jin, L., Casati, F.: Towards Automated SLA Management for Web Services Monitoring for Web Services. In: Feridun, M., Kropf, P.G., Babin, G. (eds.) DSOM 2002. LNCS, vol. 2506, pp. 28–41. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  24. 24.
    Scali MPI ConnectTM (April 2003),
  25. 25.
    Smarr, L., Catlett, C.E.: Metacomputing. Communications of the ACM 35(6), 44–52 (1992)CrossRefGoogle Scholar
  26. 26.
    Smith, W., Foster, I., Taylor, V.: Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  27. 27.
    Streit, A.: A Self-Tuning Job Scheduler Family with Dynamic Policy Switching. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 1–23. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  28. 28.
    Talby, D., Feitelson, D.G.: Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. In: 13th Intl. Parallel Processing Symp., April 1999, pp. 513–517 (1999)Google Scholar
  29. 29.
    Verma, D.: Supporting Service Level Agreements on an IP Network, August 1999. Macmillan Technology Series. Macmillan Technical Publishing, Basingstoke (1999)Google Scholar
  30. 30.
    Windisch, K., Lo, V., Moore, R., Feitelson, D., Nitzberg, B.: A Comparison of Workload Traces from Two Production Parallel Machines. In: 6th Symposium Frontiers Massively Parallel Computing, pp. 319–326 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  1. 1.Paderborn Center for Parallel ComputingUniversity of PaderbornPaderbornGermany
  2. 2.Faculty of Computer Science, Electrical Engineering and MathematicsUniversity of PaderbornPaderbornGermany

Personalised recommendations