Migration and rollback transparency for arbitrary distributed applications in workstation clusters

  • Stefan Petri
  • Matthias Bolze
  • Horst Langendörfer
Workshop on Run-Time Systems for Parallel Programming Matthew Haines, University of Wyoming, USA Koen Langendoen, Vrije Universiteit, The Netherlands Greg Benson, University of Califonia at Davis, USA
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1388)

Abstract

Programmers and users of compute intensive scientific applications often do not want to (or even cannot) code load balancing and fault tolerance into their programs.

The Beam system [18] uses a global virtual name space to provide migration and rollback transparency in user space for distributed groups of processes on workstations. The system calls are interposed and their parameters translated between the name spaces. Unlike other migration mechanisms, Beam does not require the applications to be written for a specific programming model or communication library.

In this paper we describe design and implementation of a separate system call interposition process [3] that accesses the application via the debugging interface. The main advantage of this approach is that it can handle even unmodified (e. g. commercially bought) application programs. We compare measured performance figures with previous similar approaches [15, 20].

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A.D. Alexandrov, M. Ibel, K.E. Schauser, and C.J. Scheiman. Extending the Operating System at the User Level: the Ufo Global File System. In USENIX Technical Conference Proceedings, pages 77–90, Anaheim, CA, January 1997.Google Scholar
  2. 2.
    D. Andres, C. Elford, B. Fin, and L. Smith. Dynamic load balancing in PVM. Technical report, University of Illinois at Urbanna-Champaign, April 1993.Google Scholar
  3. 3.
    M. Bolz. Transparent Redirection of System Calls for Unmodified Programs in Beam Master's thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, November 1997. (In German).Google Scholar
  4. 4.
    J. Cargille and B.P. Miller. Binary Wrapping: A Technique for Instrumenting Object Code. ACM Sigplan Notices, 27(6):17–18, June 1992.Google Scholar
  5. 5.
    J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171–216, 1995.Google Scholar
  6. 6.
    CCS Annual Report. WWW page, Center for Computational Sciences, Oak Ridge National Laboratory, 1995.http://www.ccs.ornl.org/AnRep95/CCS95.html.Google Scholar
  7. 7.
    R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. In USENIX Technical Conference Proceedings, pages 243–252, Dallas, TX, January 1991.Google Scholar
  8. 8.
    Al Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine — A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994.Google Scholar
  9. 9.
    M.B. Jones.Transparently Interposing User Code at the System Interface. PhD thesis, CMU, September 1992.Google Scholar
  10. 10.
    A.H. Karp, M. Heath, and Al Geist. 1995 Gordon Bell Prize Winners. IEEE Computer, 29(1):79–85, January 1996.Google Scholar
  11. 11.
    J. León, A.L, Fisher, and P. Steenkiste. Fail-save PVM: A portable package for distributed programming with Transparent Recovery. Report CMU-CS-93-124, Carnegie Mellon University, February 1993.Google Scholar
  12. 12.
    M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System. Report 1346, University of Wisconsin-Madison Computer Sciences, April 1997.Google Scholar
  13. 13.
    M.J. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In USENIX Technical Conference Proceedings, pages 283–290, San Francisco, CA, January 1992.Google Scholar
  14. 14.
    D. Long, J. Caroll, and C. Park. A Study of the Reliability of Internet Sites. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 177–186,1991.Google Scholar
  15. 15.
    K.I. Mandelberg and V.S. Sunderam. Process Migration in UNIX Networks. In USENIX Technical Conference Proceedings, pages 357–363, Dallas, TX, February 1988.Google Scholar
  16. 16.
    Message Passing Interface Forum MPIF. MPI-2: Extensions to the Message-Passing Interface. Technical report, University of Tennessee, Knoxville, July 1997. http://www.mpi-forum.org.Google Scholar
  17. 17.
    S. Petri, M. Bolz, and H. Langendörfer. Transparent Migration and Rollback for Unmodified Applications in Workstation Clusters. Informatik-Bericht 98-02, TU Braunschweig, April 1998. To appear.Google Scholar
  18. 18.
    S. Petri and H. Langendbrfer. Load Balancing and Fault Tolerance in Workstation Clusters — Migrating Groups of Communicating Processes. Operating Systems Review, 29(4):25–36, October 1995.CrossRefGoogle Scholar
  19. 19.
    S. Petri, B. Schnor, M. Becker, B. Hinrichs, T. Tschamtke, and H. Langendörfer. Evaluation of Multicast Methods to Maintain a Global Name Space for Transparent Process Migration in Workstation Clusters. In Kommunikation in Verteilten Systemen, pages 224–234. GI/ITG Fachtagung KIVS'97, Springer, February 1997.Google Scholar
  20. 20.
    S. Petri, B. Schnor, H. Langendbrfer, and J. Steinborn. Consistent Global Checkpoints for Distributed Applications on Clusters of Unix Workstations. In Paralleles und Verteiltes Rechnen — Beiträge zum 4. Workshop über Wissenschaftliches Rechnen, pages 77–86, Aachen, October 1996. TU Braunschweig, Shaker.Google Scholar
  21. 21.
    T Shirakihara, H. Hirayama, K. Sato, and T. Kanai. ARTEMIS: Advanced Reliable disTributed Environment Middleware System. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'97, pages 97–106, Las Vegas, NV, July 1997.Google Scholar
  22. 22.
    G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, April 1996.Google Scholar
  23. 23.
    Sun Microsystems. SunOS Reference Manual, 1990. Revision A.Google Scholar
  24. 24.
    J. Trinitis. An External Checkpointing Technique for Integration into a Parallel Tool Environment. In preparation. trinitis@informatik.tu-muenchen.de, 1998.Google Scholar
  25. 25.
    J.J.J. Vesseur, R.N. Heederik, B.J. Overeinder, and P.M.A. Sloot. Experiments in Dynamic Load Balancing for Parallel Cluster Computing. In Proceedings of the Workshop on Parallel Programming and Computation (ZEUS'95) and the 4th Nordic Transputer Conference (NTUG'95), pages 189–194, Amsterdam, June 1995. IOS Press. *** DIRECT SUPPORT *** A0008D07 00007Google Scholar

Copyright information

© Springer-Verlag 1998

Authors and Affiliations

  • Stefan Petri
    • 1
  • Matthias Bolze
    • 2
  • Horst Langendörfer
  1. 1.Institute for Computer EngineeringMedical University at LübeckUSA
  2. 2.Institute for Operating Systems and Computer NetworksTechnical University BraunschweigUSA

Personalised recommendations