Migration and rollback transparency for arbitrary distributed applications in workstation clusters
Purchase on Springer.com
$29.95 / €24.95 / £19.95*
* Final gross prices may vary according to local VAT.
Programmers and users of compute intensive scientific applications often do not want to (or even cannot) code load balancing and fault tolerance into their programs.
The Beam system  uses a global virtual name space to provide migration and rollback transparency in user space for distributed groups of processes on workstations. The system calls are interposed and their parameters translated between the name spaces. Unlike other migration mechanisms, Beam does not require the applications to be written for a specific programming model or communication library.
In this paper we describe design and implementation of a separate system call interposition process  that accesses the application via the debugging interface. The main advantage of this approach is that it can handle even unmodified (e. g. commercially bought) application programs. We compare measured performance figures with previous similar approaches [15, 20].
- A.D. Alexandrov, M. Ibel, K.E. Schauser, and C.J. Scheiman. Extending the Operating System at the User Level: the Ufo Global File System. In USENIX Technical Conference Proceedings, pages 77–90, Anaheim, CA, January 1997.
- D. Andres, C. Elford, B. Fin, and L. Smith. Dynamic load balancing in PVM. Technical report, University of Illinois at Urbanna-Champaign, April 1993.
- M. Bolz. Transparent Redirection of System Calls for Unmodified Programs in Beam Master's thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, November 1997. (In German).
- J. Cargille and B.P. Miller. Binary Wrapping: A Technique for Instrumenting Object Code. ACM Sigplan Notices, 27(6):17–18, June 1992.
- J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171–216, 1995.
- CCS Annual Report. WWW page, Center for Computational Sciences, Oak Ridge National Laboratory, 1995.http://www.ccs.ornl.org/AnRep95/CCS95.html.
- R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. In USENIX Technical Conference Proceedings, pages 243–252, Dallas, TX, January 1991.
- Al Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine — A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994.
- M.B. Jones.Transparently Interposing User Code at the System Interface. PhD thesis, CMU, September 1992.
- A.H. Karp, M. Heath, and Al Geist. 1995 Gordon Bell Prize Winners. IEEE Computer, 29(1):79–85, January 1996.
- J. León, A.L, Fisher, and P. Steenkiste. Fail-save PVM: A portable package for distributed programming with Transparent Recovery. Report CMU-CS-93-124, Carnegie Mellon University, February 1993.
- M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System. Report 1346, University of Wisconsin-Madison Computer Sciences, April 1997.
- M.J. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In USENIX Technical Conference Proceedings, pages 283–290, San Francisco, CA, January 1992.
- D. Long, J. Caroll, and C. Park. A Study of the Reliability of Internet Sites. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 177–186,1991.
- K.I. Mandelberg and V.S. Sunderam. Process Migration in UNIX Networks. In USENIX Technical Conference Proceedings, pages 357–363, Dallas, TX, February 1988.
- Message Passing Interface Forum MPIF. MPI-2: Extensions to the Message-Passing Interface. Technical report, University of Tennessee, Knoxville, July 1997. http://www.mpi-forum.org.
- S. Petri, M. Bolz, and H. Langendörfer. Transparent Migration and Rollback for Unmodified Applications in Workstation Clusters. Informatik-Bericht 98-02, TU Braunschweig, April 1998. To appear.
- S. Petri and H. Langendbrfer. Load Balancing and Fault Tolerance in Workstation Clusters — Migrating Groups of Communicating Processes. Operating Systems Review, 29(4):25–36, October 1995. CrossRef
- S. Petri, B. Schnor, M. Becker, B. Hinrichs, T. Tschamtke, and H. Langendörfer. Evaluation of Multicast Methods to Maintain a Global Name Space for Transparent Process Migration in Workstation Clusters. In Kommunikation in Verteilten Systemen, pages 224–234. GI/ITG Fachtagung KIVS'97, Springer, February 1997.
- S. Petri, B. Schnor, H. Langendbrfer, and J. Steinborn. Consistent Global Checkpoints for Distributed Applications on Clusters of Unix Workstations. In Paralleles und Verteiltes Rechnen — Beiträge zum 4. Workshop über Wissenschaftliches Rechnen, pages 77–86, Aachen, October 1996. TU Braunschweig, Shaker.
- T Shirakihara, H. Hirayama, K. Sato, and T. Kanai. ARTEMIS: Advanced Reliable disTributed Environment Middleware System. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'97, pages 97–106, Las Vegas, NV, July 1997.
- G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, April 1996.
- Sun Microsystems. SunOS Reference Manual, 1990. Revision A.
- J. Trinitis. An External Checkpointing Technique for Integration into a Parallel Tool Environment. In preparation. email@example.com, 1998.
- J.J.J. Vesseur, R.N. Heederik, B.J. Overeinder, and P.M.A. Sloot. Experiments in Dynamic Load Balancing for Parallel Cluster Computing. In Proceedings of the Workshop on Parallel Programming and Computation (ZEUS'95) and the 4th Nordic Transputer Conference (NTUG'95), pages 189–194, Amsterdam, June 1995. IOS Press. *** DIRECT SUPPORT *** A0008D07 00007
- Migration and rollback transparency for arbitrary distributed applications in workstation clusters
- Book Title
- Parallel and Distributed Processing
- Book Subtitle
- 10 IPPS/SPDP'98 Workshops Held in Conjunction with the 12th International Parallel Processing Symposium and 9th Symposium on Parallel and Distributed Processing Orlando, Florida, USA, March 30 – April 3, 1998 Proceedings
- pp 159-170
- Print ISBN
- Online ISBN
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- Series ISSN
- Springer Berlin Heidelberg
- Copyright Holder
- Additional Links
- Industry Sectors
- eBook Packages
To view the rest of this content please follow the download PDF link above.