Skip to main content

Job management requirements for nas parallel systems and clusters

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 1995)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 949))

Included in the following conference series:

Abstract

A job management system is a critical component of a production supercomputing environment, permitting oversubscribed resources to be shared fairly and efficiently. Job management systems that were originally designed for traditional vector supercomputers are not appropriate for the distributed-memory parallel supercomputers that are becoming increasingly important in the high performance computing industry. Newer job management systems offer new functionality but do not solve fundamental problems. We address some of the main issues in resource allocation and job scheduling we have encountered on two parallel computers — a 160- node IBM SP2 and a cluster of 20 high performance workstations located at the Numerical Aerodynamic Simulation facility. We describe the requirements for resource allocation and job management that are necessary to provide a production supercomputing environment on these machines, prioritizing according to difficulty and importance, and advocating a return to fundamental issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. “Adaptive Parallelism and Piranha”, N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, IEEE Computer, January 1995.

    Google Scholar 

  2. “Distributed Job Manager Administration Guide” AHPCRC, Minnesota Supercomputer Center, 1993.

    Google Scholar 

  3. “Research Toward a Heterogeneous Networked Computing Cluster: The Distributed Queuing System Version 3.0,” D. Duke, T. Green, J. Pasko, Supercomputer Computations Research Institute, Florida State University, March, 1994.

    Google Scholar 

  4. “Dynamic Process Management in an MPI Setting”, W. Gropp and E. Lusk, draft report, ANL, 1995.

    Google Scholar 

  5. “IBM Loadleveler Administration and Installation Guide”, Document Number SH26-7220-02, IBM Kingston Research Center, January 1994.

    Google Scholar 

  6. “The Network Queuing System,” B.A. Kinsbury, Cosmic Software, NASA Ames Research Center, 1986.

    Google Scholar 

  7. “Condor: A Hunter of Idle Workstations”, M. Litzkow, M. Livny and M. Mutka, Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, June 1988.

    Google Scholar 

  8. “GLUnix: A New Approach to Operating Systems for Networks of Workstations.” D. Patterson and T. Anderson, Proceedings of the First Workshop on Networks of Workstations, San Jose, October 1994.

    Google Scholar 

  9. “LSF: Load Sharing Facility Administrator's Guide”, Platform Computing Corporation, December 1994.

    Google Scholar 

  10. “Portable Batch System: External Reference Specification”, Revision 1.4, NAS, NASA Ames Research Center, January 1995.

    Google Scholar 

  11. “PVM3 Users Guide and Reference Manual,” Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, Vaidy Sunderam, Oak Ridge National Lab TM-12187, September, 1994.

    Google Scholar 

  12. “Parallel Computation of 3-D Navier-Stokes Flowfields for Supersonic Vehicles”, J.S. Ryan and S.K. Weeratunga, AIAA 93-0064, Jan. 1993.

    Google Scholar 

  13. “Connection Machine CM-5 Technical Summary”, Thinking Machines Corporation, November 1992.

    Google Scholar 

  14. “Utopia: a Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems,” S. Zhou, X. Zheng, J. Wang, and P. Delisle, Software-Practice and Experience, Vol. 23, December 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saphir, W., Tanner, L.A., Traversat, B. (1995). Job management requirements for nas parallel systems and clusters. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1995. Lecture Notes in Computer Science, vol 949. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60153-8_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-60153-8_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60153-1

  • Online ISBN: 978-3-540-49459-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics