Advertisement

Abstract

Bringing clusters of computers into the mainstream as general-purpose computing systems requires that better facilities for transparent remote execution of parallel and sequential applications be developed. While much research has been done in this area, most of this work remains inaccessible for clusters built using contemporary hardware and operating systems. Implementations are either too old and/or not publicly available, require use of operating systems which are not supported by modern hardware, or simply do not meet the functional requirements demanded by practical use in real world settings. To address these issues, we designed REXEC, a decentralized, secure remote execution facility. It provides high availability, scalability, transparent remote execution, dynamic cluster configuration, decoupled node discovery and selection, a well-defined failure and cleanup model, parallel and distributed program support, and strong authentication and encryption. The system is implemented and is currently installed and in use on a 32-node cluster of 2-way SMPs running the Linux 2.2.5 operating system.

Keywords

Clusters Remote execution Distributed systems Decentralized control 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Ezzat, A.K.: Location independent remote execution in nest. IEEE Transactions on Software Engineering 13(8), 905–912 (1987)CrossRefGoogle Scholar
  2. 2.
    Barak, A., La’Adan, O., Smith, A.: Scalable cluster computing with mosix for linux. In: Proceedings of Linux Expo 1999, pp. 95–100 (May 1999)Google Scholar
  3. 3.
    Barcellos, A.M.P., Schramm, J.F.L., Filho, V.R.B., Geyer, C.F.R.: The hetnos network operating system: a tool for writing distributed applications. Operating Systems Review (October 1994)Google Scholar
  4. 4.
    Chun, B.N., Culler, D.E.: Market-based proportional resource sharing for clusters (September 1999) (submitted for publication)Google Scholar
  5. 5.
    Douglis, F., Ousterhout, J.: Transparent process migration: Design alternatives and the sprite implementation. Software—Practice and Experience 21(8) (August 1991)Google Scholar
  6. 6.
    Freier, A.O., Karlton, P., Kocher, P.C.: The ssl protocol version 3.0, internetdraft (1996)Google Scholar
  7. 7.
    Ghormley, D.P., Petrou, D., Rodrigues, S.H., Vahdat, A.M., Anderson, T.E.: Glunix: a global layer unix for a network of workstations. Software—Practice and Experience (April 1998)Google Scholar
  8. 8.
    Hori, A., Tezuka, H., Ishikawa, Y.: An implementation of parallel operating system for clustered commodity computers. In: Proceedings of Cluster Computing Conference (March 1997)Google Scholar
  9. 9.
    Ju, J., Xu, G., Tao, J.: Parallel computing using idle workstations. Operating Systems Review (July 1993)Google Scholar
  10. 10.
    Khalidi, Y.A., Bernabeu, J.M., Matena, V., Shirriff, K., Thadani, M.: Solaris mc: A multi computer os. In: Proceedings of the 1996 USENIX Conference (1996)Google Scholar
  11. 11.
    Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint andmigration of unix processes in the condor distributed processing system. Tech. Rep. 1346, University of Wisconsin-Madison (April 1997)Google Scholar
  12. 12.
    Myricom. The gm api (1999)Google Scholar
  13. 13.
    Nichols, D.A.: Using idle workstations in a shared computing environment. In: Proceedings of the 11th ACM Symposium on Operating Systems Principles (1987)Google Scholar
  14. 14.
    Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The sprite network operating system. IEEE Computer 21(2) (February 1988)Google Scholar
  15. 15.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: Proceedings of the 1995 USENIX Winter Conference (1995)Google Scholar
  16. 16.
    Rowe, L.A., Birman, K.P.: A local network based on the unix operating system. IEEE Transactions on Software Engineering 8(2) (March 1982)Google Scholar
  17. 17.
    Shirriff, K.: Building distributed process management on an object-oriented framework. In: Proceedings of the 1997 USENIX Conference (1997)Google Scholar
  18. 18.
    Stumm, M.: The design and implementation of a decentralized scheduling facility for a workstation cluster. In: Proceedings of the 2nd IEEE Conference on Computer Workstations, pp. 12–22 (March 1988)Google Scholar
  19. 19.
    Theimer, M.M., Lantz, K.A., Cheriton, D.R.: Preemptable remote execution facilities for the v-system. In: Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985)Google Scholar
  20. 20.
    Waldspurger, C.A., Weihl, W.E.: Stride scheduling: Deterministic proportionalshare resource management. Tech. Rep. MIT/LCS/TM-528, Massachusetts Institute of Technology (1995)Google Scholar
  21. 21.
    Walker, B., Popek, G., English, R., Kline, C., Thiel, G.: The locus distributed operating system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles, pp. 49–70 (1983)Google Scholar
  22. 22.
    Zhou, S., Wang, J., Zheng, X., Delisle, P.: Utopia: A load sharing facility for large, heterogenous distributed computer systems. Software—Practice and Experience (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Brent N. Chun
    • 1
  • David E. Culler
    • 1
  1. 1.Computer Science DivisionUniversity of California at BerkeleyBerkeleyUSA

Personalised recommendations