Job management requirements for nas parallel systems and clusters

Saphir, William; Tanner, Leigh Ann; Traversat, Bernard

doi:10.1007/3-540-60153-8_37

William Saphir¹,
Leigh Ann Tanner¹ &
Bernard Traversat¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 949))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

237 Accesses
5 Citations

Abstract

A job management system is a critical component of a production supercomputing environment, permitting oversubscribed resources to be shared fairly and efficiently. Job management systems that were originally designed for traditional vector supercomputers are not appropriate for the distributed-memory parallel supercomputers that are becoming increasingly important in the high performance computing industry. Newer job management systems offer new functionality but do not solve fundamental problems. We address some of the main issues in resource allocation and job scheduling we have encountered on two parallel computers — a 160- node IBM SP2 and a cluster of 20 high performance workstations located at the Numerical Aerodynamic Simulation facility. We describe the requirements for resource allocation and job management that are necessary to provide a production supercomputing environment on these machines, prioritizing according to difficulty and importance, and advocating a return to fundamental issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

“Adaptive Parallelism and Piranha”, N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, IEEE Computer, January 1995.
Google Scholar
“Distributed Job Manager Administration Guide” AHPCRC, Minnesota Supercomputer Center, 1993.
Google Scholar
“Research Toward a Heterogeneous Networked Computing Cluster: The Distributed Queuing System Version 3.0,” D. Duke, T. Green, J. Pasko, Supercomputer Computations Research Institute, Florida State University, March, 1994.
Google Scholar
“Dynamic Process Management in an MPI Setting”, W. Gropp and E. Lusk, draft report, ANL, 1995.
Google Scholar
“IBM Loadleveler Administration and Installation Guide”, Document Number SH26-7220-02, IBM Kingston Research Center, January 1994.
Google Scholar
“The Network Queuing System,” B.A. Kinsbury, Cosmic Software, NASA Ames Research Center, 1986.
Google Scholar
“Condor: A Hunter of Idle Workstations”, M. Litzkow, M. Livny and M. Mutka, Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, June 1988.
Google Scholar
“GLUnix: A New Approach to Operating Systems for Networks of Workstations.” D. Patterson and T. Anderson, Proceedings of the First Workshop on Networks of Workstations, San Jose, October 1994.
Google Scholar
“LSF: Load Sharing Facility Administrator's Guide”, Platform Computing Corporation, December 1994.
Google Scholar
“Portable Batch System: External Reference Specification”, Revision 1.4, NAS, NASA Ames Research Center, January 1995.
Google Scholar
“PVM3 Users Guide and Reference Manual,” Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, Vaidy Sunderam, Oak Ridge National Lab TM-12187, September, 1994.
Google Scholar
“Parallel Computation of 3-D Navier-Stokes Flowfields for Supersonic Vehicles”, J.S. Ryan and S.K. Weeratunga, AIAA 93-0064, Jan. 1993.
Google Scholar
“Connection Machine CM-5 Technical Summary”, Thinking Machines Corporation, November 1992.
Google Scholar
“Utopia: a Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems,” S. Zhou, X. Zheng, J. Wang, and P. Delisle, Software-Practice and Experience, Vol. 23, December 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

NAS Scientific Computing Branch, NASA Ames Research Center, Mail Stop 258-6, 94035-1000, Moffett Field, CA
William Saphir, Leigh Ann Tanner & Bernard Traversat

Authors

William Saphir
View author publications
You can also search for this author in PubMed Google Scholar
Leigh Ann Tanner
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Traversat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saphir, W., Tanner, L.A., Traversat, B. (1995). Job management requirements for nas parallel systems and clusters. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1995. Lecture Notes in Computer Science, vol 949. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60153-8_37

Download citation

DOI: https://doi.org/10.1007/3-540-60153-8_37
Published: 02 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60153-1
Online ISBN: 978-3-540-49459-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics