Cluster Computing

, Volume 3, Issue 2, pp 63–73 | Cite as

Heterogeneous process state capture and recovery through Process Introspection

  • Adam Ferrari
  • Steve J. Chapin
  • Andrew Grimshaw


The ability to capture the state of a process and later recover that state in the form of an equivalent running process is the basis for a number of important features in parallel and distributed systems. Adaptive load sharing and fault tolerance are well-known examples. Traditional state capture mechanisms have employed an external agent (such as the operating system kernel) to examine and capture process state. However, the increasing prevalence of heterogeneous cluster and “metacomputing” systems as high-performance computing platforms has prompted investigation of process-internal state capture mechanisms. Perhaps the greatest advantage of the process-internal approach is the ability to support cross-platform state capture and recovery, an important feature in heterogeneous environments. Among the perceived disadvantages of existing process-internal mechanisms are poor performance in multiple respects, and difficulty of use in terms of programmer effort. In this paper we describe a new process-internal state capture and recovery mechanism: Process Introspection. Experiences with this system indicate that the perceived disadvantages associated with process-internal mechanisms can be largely overcome, making this approach to state capture an appropriate one for cluster and metacomputing environments.


Mobile Agent Function Call Memory Block State Recovery Code Location 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    A. Acharya, M. Ranganathan and J. Saltz, Sumatra: A language for resource-aware mobile programs, in: Mobile Object Systems, eds. J. Vitek and C. Tschudin (Springer, Berlin, 1997).Google Scholar
  2. [2]
    A. Beguelin, E. Seligman and M. Starkey, Dome: Distributed object migration environment, Technical Report CMU-CS-94-153, Carnegie Mellon University (May 1994).Google Scholar
  3. [3]
    F. Bodin, P. Beckman, D. Gannon, J. Gotwals, S. Narayana, S. Srinivas and B. Winnicka, Sage++: An object-oriented toolkit and class library for building Fortran and C++ restructuring tools, OONSKI (1994).Google Scholar
  4. [4]
    J. Casas, D.L. Clark, P.S. Galbiati, R. Konuru, S.W. Otto, R.M. Prouty and J. Walpole, MIST: PVM with transparent migration and checkpointing, in: 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA (May 7-9, 1995).Google Scholar
  5. [5]
    F.B. Dubach, R.M. Rutherford and C.M. Shub, Process-originated migration in a heterogeneous environment, in: Proceedings of the ACM Computer Science Conference (February 1989) pp. 98-102.Google Scholar
  6. [6]
    E.N. Elnozahy, D.B. Johnson and Y.M. Wang, A survey of rollback-recovery protocols in message-passing systems, Technical Report CMU-CS-96-181, Carnegie Mellon University (October 1996).Google Scholar
  7. [7]
    A.J. Ferrari, Process state capture and recovery in high-performance heterogeneous distributed systems, Ph.D. thesis 9802, Department of Computer Science, University of Virginia (January 1998).Google Scholar
  8. [8]
    R.F. Freund and D.S. Cornwell, Superconcurrency: A form of distributed heterogeneous supercomputing, Supercomputing Review 3 (October 1990) 47–50.Google Scholar
  9. [9]
    A. Geist, A Beguelin, J. Dongarra, W. Jiang, R. Manchek and V.S. Sunderam, PVM: Parallel Virtual Machine (MIT Press, Cambridge, MA, 1994).Google Scholar
  10. [10]
    A.S. Grimshaw, J.B.Weissman, E.A. West and E. Loyot, Meta systems: An approach combining parallel processing and heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing 21(3) (June 1994) 257–270.CrossRefGoogle Scholar
  11. [11]
    W. Gropp, E. Lusk and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface (MIT Press, Cambridge, MA, 1994).Google Scholar
  12. [12]
    D.R. Jefferson, Virtual time, ACM Transactions on Programming Languages and Systems 7(3) (July 1985) 404–425.MathSciNetCrossRefGoogle Scholar
  13. [13]
    F.C. Knabe, Language support for mobile agents, Ph.D. thesis, available as Technical Report CMU-CS-95-223, Carnegie Mellon University (December 1995).Google Scholar
  14. [14]
    J.A. Kohl and P.M. Papadopoulos, Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS, in: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR (August 1998).Google Scholar
  15. [15]
    M.J. Lewis and A.S. Grimshaw, The core Legion object model, in: Proceedings of IEEE High Performance Distributed Computing 5, Syracuse, NY (August 6-9, 1996) pp. 551-561.Google Scholar
  16. [16]
    M.J. Litzkow, M. Livny and M.W. Mutka, Condor-A hunter of idle workstations, in: Proceedings of the Eighth International Conference on Distributed Computing Systems (1988) pp. 104-111.Google Scholar
  17. [17]
    H. Peine and T. Stolpmann, The architecture of the Ara platform for mobile agents, in: Proceedings of the First International Workshop A. Ferrari et al. / Heterogeneous process state capture and recovery through Process Introspection 73 on Mobile Agents: MA'97, Berlin, Germany (April 7-8, 1997), eds. K. Rothermel and R. Popescu-Zeletin, Lecture Notes in Computer Science, Vol. 1219 (Springer, Berlin, 1997).Google Scholar
  18. [18]
    J. Robinson, S.H. Russ, B. Flachs and B. Heckel, A task migration implementation for the message passing interface, in: Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Systems, Syracuse, NY (August 1995).Google Scholar
  19. [19]
    J.M. Smith, A survey of process migration mechanisms, Operating Systems Review 22(3) (July 1988) 28–40.CrossRefGoogle Scholar
  20. [20]
    P. Smith and N.C. Hutchinson, Heterogeneous process migration: The Tui system, Technical Report, University of British Columbia (February 28, 1996).Google Scholar
  21. [21]
    Sun Microsystems, External Data Representation Reference Manual (Sun Microsystems, 1985).Google Scholar
  22. [22]
    Sun Microsystems, Java Object Serialization Specification, Revision 0.9 (1996).Google Scholar
  23. [23]
    B. Steensgaard and E. Jul, Object and native code thread mobility among heterogeneous computers, SOSP (1995).Google Scholar
  24. [24]
    M.M. Theimer and B. Hayes, Heterogeneous process migration by recompilation, in: Proceedings of the 11th International. Conference on Distributed Computing Systems, Arlington, TX (May 1991) pp. 18-25.Google Scholar
  25. [25]
    D.G. von Bank, C.M. Shub and R.W. Sebesta, A unified model of pointwise equivalence of procedural computations, ACM Transactions on Programming Languages and Systems 16(6) (November 1994) 1842–1874.CrossRefGoogle Scholar
  26. [26]
    H. Zhou and A. Geist, Receiver makes right data conversion in PVM, in: Proceedings of 14th International Conference on Computers and Communications, Phoenix (March 1995) pp. 458-464.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Adam Ferrari
    • 1
  • Steve J. Chapin
    • 2
  • Andrew Grimshaw
    • 1
  1. 1.University of VirginiaCharlottesvilleUSA
  2. 2.Syracuse UniversitySyracuseUSA

Personalised recommendations