Skip to main content
Log in

Heterogeneous process state capture and recovery through Process Introspection

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The ability to capture the state of a process and later recover that state in the form of an equivalent running process is the basis for a number of important features in parallel and distributed systems. Adaptive load sharing and fault tolerance are well-known examples. Traditional state capture mechanisms have employed an external agent (such as the operating system kernel) to examine and capture process state. However, the increasing prevalence of heterogeneous cluster and “metacomputing” systems as high-performance computing platforms has prompted investigation of process-internal state capture mechanisms. Perhaps the greatest advantage of the process-internal approach is the ability to support cross-platform state capture and recovery, an important feature in heterogeneous environments. Among the perceived disadvantages of existing process-internal mechanisms are poor performance in multiple respects, and difficulty of use in terms of programmer effort. In this paper we describe a new process-internal state capture and recovery mechanism: Process Introspection. Experiences with this system indicate that the perceived disadvantages associated with process-internal mechanisms can be largely overcome, making this approach to state capture an appropriate one for cluster and metacomputing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A. Acharya, M. Ranganathan and J. Saltz, Sumatra: A language for resource-aware mobile programs, in: Mobile Object Systems, eds. J. Vitek and C. Tschudin (Springer, Berlin, 1997).

    Google Scholar 

  2. A. Beguelin, E. Seligman and M. Starkey, Dome: Distributed object migration environment, Technical Report CMU-CS-94-153, Carnegie Mellon University (May 1994).

  3. F. Bodin, P. Beckman, D. Gannon, J. Gotwals, S. Narayana, S. Srinivas and B. Winnicka, Sage++: An object-oriented toolkit and class library for building Fortran and C++ restructuring tools, OONSKI (1994).

  4. J. Casas, D.L. Clark, P.S. Galbiati, R. Konuru, S.W. Otto, R.M. Prouty and J. Walpole, MIST: PVM with transparent migration and checkpointing, in: 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA (May 7-9, 1995).

  5. F.B. Dubach, R.M. Rutherford and C.M. Shub, Process-originated migration in a heterogeneous environment, in: Proceedings of the ACM Computer Science Conference (February 1989) pp. 98-102.

  6. E.N. Elnozahy, D.B. Johnson and Y.M. Wang, A survey of rollback-recovery protocols in message-passing systems, Technical Report CMU-CS-96-181, Carnegie Mellon University (October 1996).

  7. A.J. Ferrari, Process state capture and recovery in high-performance heterogeneous distributed systems, Ph.D. thesis 9802, Department of Computer Science, University of Virginia (January 1998).

  8. R.F. Freund and D.S. Cornwell, Superconcurrency: A form of distributed heterogeneous supercomputing, Supercomputing Review 3 (October 1990) 47–50.

    Google Scholar 

  9. A. Geist, A Beguelin, J. Dongarra, W. Jiang, R. Manchek and V.S. Sunderam, PVM: Parallel Virtual Machine (MIT Press, Cambridge, MA, 1994).

    Google Scholar 

  10. A.S. Grimshaw, J.B.Weissman, E.A. West and E. Loyot, Meta systems: An approach combining parallel processing and heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing 21(3) (June 1994) 257–270.

    Article  Google Scholar 

  11. W. Gropp, E. Lusk and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface (MIT Press, Cambridge, MA, 1994).

    Google Scholar 

  12. D.R. Jefferson, Virtual time, ACM Transactions on Programming Languages and Systems 7(3) (July 1985) 404–425.

    Article  MathSciNet  Google Scholar 

  13. F.C. Knabe, Language support for mobile agents, Ph.D. thesis, available as Technical Report CMU-CS-95-223, Carnegie Mellon University (December 1995).

  14. J.A. Kohl and P.M. Papadopoulos, Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS, in: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR (August 1998).

  15. M.J. Lewis and A.S. Grimshaw, The core Legion object model, in: Proceedings of IEEE High Performance Distributed Computing 5, Syracuse, NY (August 6-9, 1996) pp. 551-561.

  16. M.J. Litzkow, M. Livny and M.W. Mutka, Condor-A hunter of idle workstations, in: Proceedings of the Eighth International Conference on Distributed Computing Systems (1988) pp. 104-111.

  17. H. Peine and T. Stolpmann, The architecture of the Ara platform for mobile agents, in: Proceedings of the First International Workshop A. Ferrari et al. / Heterogeneous process state capture and recovery through Process Introspection 73 on Mobile Agents: MA'97, Berlin, Germany (April 7-8, 1997), eds. K. Rothermel and R. Popescu-Zeletin, Lecture Notes in Computer Science, Vol. 1219 (Springer, Berlin, 1997).

    Google Scholar 

  18. J. Robinson, S.H. Russ, B. Flachs and B. Heckel, A task migration implementation for the message passing interface, in: Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Systems, Syracuse, NY (August 1995).

  19. J.M. Smith, A survey of process migration mechanisms, Operating Systems Review 22(3) (July 1988) 28–40.

    Article  Google Scholar 

  20. P. Smith and N.C. Hutchinson, Heterogeneous process migration: The Tui system, Technical Report, University of British Columbia (February 28, 1996).

  21. Sun Microsystems, External Data Representation Reference Manual (Sun Microsystems, 1985).

  22. Sun Microsystems, Java Object Serialization Specification, Revision 0.9 (1996).

  23. B. Steensgaard and E. Jul, Object and native code thread mobility among heterogeneous computers, SOSP (1995).

  24. M.M. Theimer and B. Hayes, Heterogeneous process migration by recompilation, in: Proceedings of the 11th International. Conference on Distributed Computing Systems, Arlington, TX (May 1991) pp. 18-25.

  25. D.G. von Bank, C.M. Shub and R.W. Sebesta, A unified model of pointwise equivalence of procedural computations, ACM Transactions on Programming Languages and Systems 16(6) (November 1994) 1842–1874.

    Article  Google Scholar 

  26. H. Zhou and A. Geist, Receiver makes right data conversion in PVM, in: Proceedings of 14th International Conference on Computers and Communications, Phoenix (March 1995) pp. 458-464.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferrari, A., Chapin, S.J. & Grimshaw, A. Heterogeneous process state capture and recovery through Process Introspection. Cluster Computing 3, 63–73 (2000). https://doi.org/10.1023/A:1019067801346

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019067801346

Keywords

Navigation