Advertisement

Resource management for high-performance PC clusters

  • Axel Keller
  • Matthias Brune
  • Alexander Reinefeld
Track C2: Computational Science
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1593)

Abstract

With the recent availability of cost-effective network cards for the PCI bus, researchers have been tempted to build up large compute clusters with standard PCs. Many of them are operated with workstation cluster management software in high-throughput or single user mode.

For very large clusters with more than 100 PEs, however, it becomes necessary to implement a full fledged resource management software that allows to partition the system for multi-user access.

In this paper, we present our Computing Center Software (CCS), which was originally designed for managing massively parallel high-performance computers, and now adapted to modern workstation clusters. It provides
  • partitioning of exclusive and non-exclusive resources,

  • hardware-independent scheduling of interactive and batch jobs,

  • open, extensible interfaces to other resource management systems,

  • a high degree of reliability.

Keywords

Service Description Resource Management System Configuration Manager Queue Manager Machine Manager 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abramson, D., Sosic, R., Giddy, J. Hall, B.: Nimrod: A Tool for Performing Parameterized Simulations using Distributed Workstations. 4th IEEE Symp. High Performance and Distributed Computing, August 1995.Google Scholar
  2. 2.
    Baker, M., Fox, G., Yau, H.: Cluster Computing Review. Northeast Parallel Architectures Center, Syracuse University, New York, November 1995. http://www.npar.-syr.edu/techreports/index.htmlGoogle Scholar
  3. 3.
    Bayucan, A., Henderson, R., Proett, T., Tweten, D., Kelly, B.: Portable Batch System: External Reference Specification. Release 1.1.7, NASA Ames Research Center, June 1996.Google Scholar
  4. 4.
    Berman, F., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-Level Scheduling on Distributed Heterogeneous Networks. Supercomputing, November 1996.Google Scholar
  5. 5.
    Boden, N., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15, 1, Feb. 1995, pp. 29–36.CrossRefGoogle Scholar
  6. 6.
    Brune, M., Gehring, J., Keller, A., Reinefeld, A.: RSD—Resource and Service Description. Intl. Symp. on High Performance Computing Systems and Applications HPCS'98, Edmonton Canada, Kluwer Academic Press, May 1998.Google Scholar
  7. 7.
    Epema, D., Livny, M., van Dantzig, R., Evers, X., Pruyne, J.: A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. FGCS, Vol. 12, 1996, pp. 53–66.Google Scholar
  8. 8.
    Gehring, J., Ramme, F.: Architecture-Independent Request-Scheduling with Tight Waiting-Time Estimations. IPPS'96 Workshop on Scheduling Strategies for Parallel Processing, Hawaii, Springer LNCS 1162, 1996, pp. 41–54.Google Scholar
  9. 9.
    GENIAS Software GmbH: Codine: Computing in Distributed Networked Environments. http://www.genias.de/products/codine, January 1999.Google Scholar
  10. 10.
    Grimshaw, A., Weissman, J., West, E., Loyot, E.: Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems. J. Parallel Distributed Computing, Vol. 21, 1994, pp. 257–270.CrossRefGoogle Scholar
  11. 11.
    Hellwagner, H., Reinefeld, A. (eds.): Scalable Coherent Interface: Technology and Applications. Proceedings of the SCI-Europe98, Bordeaux Sept. 98. Cheshire Hensbury, 1998.Google Scholar
  12. 12.
    Jones, J., Brickell, C.: Second Evaluation of Job Queueing/Scheduling Software: Phase 1 Report. Nasa Ames Research Center, NAS Tech. Rep. NAS-97-013, June 1997.Google Scholar
  13. 13.
    Keller, A., Reinefeld, A.: CCS Resource Management in Networked HPC Systems. 7th Heterogeneous Computing Workshop HCW'98 at IPPS, Orlando Florida, IEEE Comp. Society Press, 1998, pp. 44–56.CrossRefGoogle Scholar
  14. 14.
    Kinsbury, B.A.: The Network Queuing System. Cosmic Software, NASA Ames Research Center, 1986.Google Scholar
  15. 15.
    Litzkow, M.J., Livny, M.: Condor-A Hunter of Idle Workstations. Procs. 8th IEEE Int. Conference on Distributed Computing Systems, June 1988, pp. 104–111.Google Scholar
  16. 16.
    LSF: Product Overview. http://www.platform.com/content/products/, January 1999.Google Scholar
  17. 17.
    NQE-Administration. Cray-Soft USA, SG-2150 2.0, May 1995.Google Scholar
  18. 18.
    Tandiary, F., Kothari, S.C., Dixit, A., Anderson, E.W.: Batrun: Utilizing Idle Workstations for Large-Scale Computing. IEEE Parallel and Distributed Technics, 1996, pp. 41–48.Google Scholar

Copyright information

© Springer-Verlag 1999

Authors and Affiliations

  • Axel Keller
    • 1
  • Matthias Brune
    • 2
  • Alexander Reinefeld
    • 2
  1. 1.Paderborn Center for Parallel ComputingBaderbornGermany
  2. 2.Konrad-Zuse-Zentrum für Informationstechnik BerlinBerlinGermany

Personalised recommendations