Skip to main content

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

  • Conference paper
Job Scheduling Strategies for Parallel Processing (JSSPP 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2862))

Included in the following conference series:

Abstract

Nearly all existing HPC systems are operated by resource management systems based on the queuing approach. With the increasing acceptance of grid middleware like Globus, new requirements for the underlying local resource management systems arise. Features like advanced reservation or quality of service are needed to implement high level functions like co-allocation. However it is difficult to realize these features with a resource management system based on the queuing concept since it considers only the present resource usage.

In this paper we present an approach which closes this gap. By assigning start times to each resource request, a complete schedule is planned. Advanced reservations are now easily possible. Based on this planning approach functions like diffuse requests, automatic duration extension, or service level agreements are described. We think they are useful to increase the usability, acceptance and performance of HPC machines. In the second part of this paper we present a planning based resource management system which already covers some of the mentioned features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brune, M., Gehring, J., Keller, A., Reinefeld, A.: RSD - Resource and Service Description. In: Proc. of 12th Intl. Symp. on High-Performance Computing Systems and Applications (HPCS 1998), pp. 193–206. Kluwer Academic Press, Dordrecht (1998)

    Google Scholar 

  2. Brune, M., Gehring, J., Keller, A., Reinefeld, A.: Managing Clusters of Geographically Distributed High-Performance Computers. Concurrency - Practice and Experience 11(15), 887–911 (1999)

    Article  Google Scholar 

  3. Brune, M., Reinefeld, A., Varnholt, J.: A Resource Description Environment for Distributed Computing Systems. In: Proceedings of the 8th International Symposium High-Performance Distributed Computing HPDC 1999, Redondo Beach. LNCS, pp. 279–286. IEEE Computer Society, Los Alamitos (1999)

    Chapter  Google Scholar 

  4. Cjajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S.: SNAP: A Protocol for Negotiation of Service Level Agreements and Coordinated Resource Management in Distributed Systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 153–183. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  5. Direct Access Transport (DAT) Specification (April 2003), http://www.datcollaborative.org

  6. Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R.: Enhanced Algorithms for Multi-Site Scheduling. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 219–231. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Feitelson, D.G., Jette, M.A.: Improved Utilization and Responsiveness with Gang Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 238–262. Springer, Heidelberg (1997)

    Google Scholar 

  8. Feitelson, D.G., Rudolph, L.: Towards Convergence in Job Schedulers for Parallel Supercomputers. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 1–26. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  9. Feitelson, D.G., Rudolph, L.: Metrics and Benchmarking for Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  10. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C.: Theory and Practice in Parallel Job Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997)

    Google Scholar 

  11. Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

  12. Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., Roy, A.: A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. In: Proceedings of the International Workshop on Quality of Service (1999)

    Google Scholar 

  13. GGF Grid Scheduling Dictionary Working Group. Grid Scheduling Dictionary of Terms and Keywords (April 2003), http://www.fz-juelich.de/zam/RD/coop/ggf/sd-wg.html

  14. Hungershöfer, J., Wierum, J.-M., Gänser, H.-P.: Resource Management for Finite Element Codes on Shared Memory Systems. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) ICCSA 2003. LNCS, vol. 2667, pp. 927–936. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  15. Jackson, D., Snell, Q., Clement, M.: Core Algorithms of the Maui Scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–103. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  16. Keller, A., Reinefeld, A.: Anatomy of a Resource Management System for HPC Clusters. In: Keller, A., Reinefeld, A. (eds.) Annual Review of Scalable Computing, vol. 3, pp. 1–31. Singapore University Press (2001)

    Google Scholar 

  17. Kishimoto, H., Savva, A., Snelling, D.: OGSA Fundamental Services: Requirements for Commercial GRID Systems. Technical report, Open Grid Services Architecture Working Group (OGSA WG) (April 2003), http://www.gridforiam.org/Dociaments/Drafts/default_b.htm

  18. Lifka, D.A.: The ANL/IBM SP Scheduling System. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1995 and JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995)

    Google Scholar 

  19. Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. In: Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS 1988), pp. 104–111. IEEE Computer Society Press, Los Alamitos (1988)

    Google Scholar 

  20. MacLaren, J., Sander, V., Ziegler, W.: Advanced Reservations - State of the Art. Technical report, Grid Resource Allocation Agreement Protocol Working Group, Global Grid Forum (April 2003), http://www.fz-juelich.de/zam/RD/coop/ggf/graap/sched-graap-2.0.html

  21. Mu’alem, A., Feitelson, D.G.: Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Trans. Parallel & Distributed Systems 12(6), 529–543 (2001)

    Article  Google Scholar 

  22. Sahai, A., Durante, A., Machiraju, V.: Towards Automated SLA Management for Web Services. HPL-2001-310 (R.l), Hewlett-Packard Company, Software Technology Laboratory, HP Laboratories Palo Alto (2000), http://www.hpl.hp.com/techreports/2001/HPL-2001-310R1.html

  23. Sahai, A., Durante, A., Machiraju, V., Sayal, M., Jin, L., Casati, F.: Towards Automated SLA Management for Web Services Monitoring for Web Services. In: Feridun, M., Kropf, P.G., Babin, G. (eds.) DSOM 2002. LNCS, vol. 2506, pp. 28–41. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  24. Scali MPI ConnectTM (April 2003), http://www.scali.com

  25. Smarr, L., Catlett, C.E.: Metacomputing. Communications of the ACM 35(6), 44–52 (1992)

    Article  Google Scholar 

  26. Smith, W., Foster, I., Taylor, V.: Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  27. Streit, A.: A Self-Tuning Job Scheduler Family with Dynamic Policy Switching. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 1–23. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  28. Talby, D., Feitelson, D.G.: Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. In: 13th Intl. Parallel Processing Symp., April 1999, pp. 513–517 (1999)

    Google Scholar 

  29. Verma, D.: Supporting Service Level Agreements on an IP Network, August 1999. Macmillan Technology Series. Macmillan Technical Publishing, Basingstoke (1999)

    Google Scholar 

  30. Windisch, K., Lo, V., Moore, R., Feitelson, D., Nitzberg, B.: A Comparison of Workload Traces from Two Production Parallel Machines. In: 6th Symposium Frontiers Massively Parallel Computing, pp. 319–326 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hovestadt, M., Kao, O., Keller, A., Streit, A. (2003). Scheduling in HPC Resource Management Systems: Queuing vs. Planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2003. Lecture Notes in Computer Science, vol 2862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10968987_1

Download citation

  • DOI: https://doi.org/10.1007/10968987_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20405-3

  • Online ISBN: 978-3-540-39727-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics