Migration and rollback transparency for arbitrary distributed applications in workstation clusters
- First Online:
Programmers and users of compute intensive scientific applications often do not want to (or even cannot) code load balancing and fault tolerance into their programs.
The Beam system  uses a global virtual name space to provide migration and rollback transparency in user space for distributed groups of processes on workstations. The system calls are interposed and their parameters translated between the name spaces. Unlike other migration mechanisms, Beam does not require the applications to be written for a specific programming model or communication library.
In this paper we describe design and implementation of a separate system call interposition process  that accesses the application via the debugging interface. The main advantage of this approach is that it can handle even unmodified (e. g. commercially bought) application programs. We compare measured performance figures with previous similar approaches [15, 20].
Unable to display preview. Download preview PDF.
- 1.A.D. Alexandrov, M. Ibel, K.E. Schauser, and C.J. Scheiman. Extending the Operating System at the User Level: the Ufo Global File System. In USENIX Technical Conference Proceedings, pages 77–90, Anaheim, CA, January 1997.Google Scholar
- 2.D. Andres, C. Elford, B. Fin, and L. Smith. Dynamic load balancing in PVM. Technical report, University of Illinois at Urbanna-Champaign, April 1993.Google Scholar
- 3.M. Bolz. Transparent Redirection of System Calls for Unmodified Programs in Beam Master's thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, November 1997. (In German).Google Scholar
- 4.J. Cargille and B.P. Miller. Binary Wrapping: A Technique for Instrumenting Object Code. ACM Sigplan Notices, 27(6):17–18, June 1992.Google Scholar
- 5.J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171–216, 1995.Google Scholar
- 6.CCS Annual Report. WWW page, Center for Computational Sciences, Oak Ridge National Laboratory, 1995.http://www.ccs.ornl.org/AnRep95/CCS95.html.Google Scholar
- 7.R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. In USENIX Technical Conference Proceedings, pages 243–252, Dallas, TX, January 1991.Google Scholar
- 8.Al Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine — A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994.Google Scholar
- 9.M.B. Jones.Transparently Interposing User Code at the System Interface. PhD thesis, CMU, September 1992.Google Scholar
- 10.A.H. Karp, M. Heath, and Al Geist. 1995 Gordon Bell Prize Winners. IEEE Computer, 29(1):79–85, January 1996.Google Scholar
- 11.J. León, A.L, Fisher, and P. Steenkiste. Fail-save PVM: A portable package for distributed programming with Transparent Recovery. Report CMU-CS-93-124, Carnegie Mellon University, February 1993.Google Scholar
- 12.M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System. Report 1346, University of Wisconsin-Madison Computer Sciences, April 1997.Google Scholar
- 13.M.J. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In USENIX Technical Conference Proceedings, pages 283–290, San Francisco, CA, January 1992.Google Scholar
- 14.D. Long, J. Caroll, and C. Park. A Study of the Reliability of Internet Sites. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 177–186,1991.Google Scholar
- 15.K.I. Mandelberg and V.S. Sunderam. Process Migration in UNIX Networks. In USENIX Technical Conference Proceedings, pages 357–363, Dallas, TX, February 1988.Google Scholar
- 16.Message Passing Interface Forum MPIF. MPI-2: Extensions to the Message-Passing Interface. Technical report, University of Tennessee, Knoxville, July 1997. http://www.mpi-forum.org.Google Scholar
- 17.S. Petri, M. Bolz, and H. Langendörfer. Transparent Migration and Rollback for Unmodified Applications in Workstation Clusters. Informatik-Bericht 98-02, TU Braunschweig, April 1998. To appear.Google Scholar
- 19.S. Petri, B. Schnor, M. Becker, B. Hinrichs, T. Tschamtke, and H. Langendörfer. Evaluation of Multicast Methods to Maintain a Global Name Space for Transparent Process Migration in Workstation Clusters. In Kommunikation in Verteilten Systemen, pages 224–234. GI/ITG Fachtagung KIVS'97, Springer, February 1997.Google Scholar
- 20.S. Petri, B. Schnor, H. Langendbrfer, and J. Steinborn. Consistent Global Checkpoints for Distributed Applications on Clusters of Unix Workstations. In Paralleles und Verteiltes Rechnen — Beiträge zum 4. Workshop über Wissenschaftliches Rechnen, pages 77–86, Aachen, October 1996. TU Braunschweig, Shaker.Google Scholar
- 21.T Shirakihara, H. Hirayama, K. Sato, and T. Kanai. ARTEMIS: Advanced Reliable disTributed Environment Middleware System. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'97, pages 97–106, Las Vegas, NV, July 1997.Google Scholar
- 22.G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, April 1996.Google Scholar
- 23.Sun Microsystems. SunOS Reference Manual, 1990. Revision A.Google Scholar
- 24.J. Trinitis. An External Checkpointing Technique for Integration into a Parallel Tool Environment. In preparation. firstname.lastname@example.org, 1998.Google Scholar
- 25.J.J.J. Vesseur, R.N. Heederik, B.J. Overeinder, and P.M.A. Sloot. Experiments in Dynamic Load Balancing for Parallel Cluster Computing. In Proceedings of the Workshop on Parallel Programming and Computation (ZEUS'95) and the 4th Nordic Transputer Conference (NTUG'95), pages 189–194, Amsterdam, June 1995. IOS Press. *** DIRECT SUPPORT *** A0008D07 00007Google Scholar