Multi-User System Management on SCI Clusters

  • Matthias Brune
  • Axel Keller
  • Alexander Reinefeld
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1734)

Abstract

The growing maturity of hardware and software components has tempted researchers to build very large SCI clusters with several hundred processors that are operated as high-performance compute servers in multi-user mode.

In this chapter, we present a resource management software for the user access and system administration of high-performance compute clusters named Computing Center Software (CCS). It is in day-to-day use since 1992 on various parallel systems and has recently been adapted to the management of SCI clusters. CCS provides pluggable schedulers, optimal space partitioning for multiple users, reliable user access, and powerful tools for specifying resources and services by means of a specification language and a graphical user interface.

After a brief introduction in the remainder of this section, we describe the CCS system architecture and the characteristics of its resource description facilities.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abramson, D., Sosic, R., Giddy, J., Hall, B.: Nimrod: A Tool for Performing Parameterized Simulations using Distributed Workstations. In: 4th IEEE Symp. High Performance and Distributed Computing (August 1995)Google Scholar
  2. 2.
    Baker, M., Fox, G., Yau, H.: Cluster Computing Review. Northeast Parallel Architectures Center, Syracuse University, New York (1995), http://www.npar.syr.edu/techreports/index.html
  3. 3.
    Bauer, B., Ramme, F.: A General Purpose Resource Description Language. In: Grebe, B. (ed.) Parallele Datenverarbeitung mit dem Transputer, pp. 68–75. Springer, Berlin (1991)Google Scholar
  4. 4.
    Bayucan, A., Henderson, R., Proett, T., Tweten, D., Kelly, B.: Portable Batch System: External Reference Specification. Release 1.1.7, NASA Ames Research Center (June 1996)Google Scholar
  5. 5.
    Berman, F., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-Level Scheduling on Distributed Heterogeneous Networks. Supercomputing (November 1996)Google Scholar
  6. 6.
    Boden, N., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15(1), 29–36 (1995)CrossRefGoogle Scholar
  7. 7.
    Brune, M., Gehring, J., Keller, A., Reinefeld, A.: RSD – Resource and Service Description. In: Intl. Symp. on High Performance Computing Systems and Applications HPCS 1998, Edmonton Canada, Kluwer Academic Press, Dordrecht (1998)Google Scholar
  8. 8.
    Epema, D., Livny, M., van Dantzig, R., Evers, X., Pruyne, J.: A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. In: FGCS, vol. 12, pp. 53–66 (1996)Google Scholar
  9. 9.
    Gehring, J., Ramme, F.: Architecture-Independent Request-Scheduling with Tight Waiting-Time Estimations. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 41–54. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  10. 10.
    GENIAS Software GmbH: Codine: Computing in Distributed Networked Environments (January 1999), http://www.genias.de/products/codine
  11. 11.
    Grimshaw, A., Weissman, J., West, E., Loyot, E.: Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems. J. Parallel Distributed Computing 21, 257–270 (1994)CrossRefGoogle Scholar
  12. 12.
    Jones, J., Brickell, C.: Second Evaluation of Job Queueing/Scheduling Software: Phase 1 Report. Nasa Ames Research Center, NAS Tech. Rep. NAS-97-013 (June 1997)Google Scholar
  13. 13.
    Keller, A., Reinefeld, A.: CCS Resource Management in Networked HPC Systems. In: 7th Heterogeneous Computing Workshop HCW 1998 at IPPS, Orlando Florida, pp. 44–56. IEEE Comp. Society Press, Los Alamitos (1998)CrossRefGoogle Scholar
  14. 14.
    Kinsbury, B.A.: The Network Queuing System. Cosmic Software, NASA Ames Research Center (1986)Google Scholar
  15. 15.
    Litzkow, M.J., Livny, M.: Condor – A Hunter of Idle Workstations. In: Procs. 8th IEEE Int. Conference on Distributed Computing Systems, June 1988, pp. 104–111 (1988)Google Scholar
  16. 16.
    LSF: Product Overview (January 1999), http://www.platform.com/content/products/
  17. 17.
    NQE-Administration. Cray-Soft USA SG-2150 2.0 (May 1995)Google Scholar
  18. 18.
    Ramme, F., Römke, T., Kremer, K.: A Distributed Computing Center Software for the Efficient Use of Parallel Computer Systems. In: Gentzsch, W., Harms, U. (eds.) HPCN-Europe 1994. LNCS, vol. 797, pp. 129–136. Springer, Heidelberg (1994)Google Scholar
  19. 19.
    Tandiary, F., Kothari, S.C., Dixit, A., Anderson, E.W.: Batrun: Utilizing Idle Workstations for Large-Scale Computing. IEEE Parallel and Distributed Techn., 41–48 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Matthias Brune
    • 1
  • Axel Keller
    • 2
  • Alexander Reinefeld
    • 1
  1. 1.Konrad-Zuse-Zentrum für InformationstechnikBerlin
  2. 2.Paderborn Center for Parallel ComputingPaderborn

Personalised recommendations