Advertisement

Distributed Peer-to-Peer Control in Harness

  • C. Engelmann
  • S. L. Scott
  • G. A. Geist
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2330)

Abstract

Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications.

Keywords

Virtual Machine Global State Message Passing Interface Execution Result Vote Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    C. Engelmann: Distributed Peer-to-Peer Control for Harness. Master Thesis, School of Computer Science, The University of Reading, UK, Jan. 2001Google Scholar
  2. 2.
    W.R. Elwasif, D.E. Bernholdt, J.A. Kohl, G.A. Geist: An Architecture for a Multi-Threaded Harness Kernel. Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA, 2001Google Scholar
  3. 3.
    G.A. Geist, J.A. Kohl, S.L. Scott, P.M. Papadopoulos:HARNESS: Adaptable Virtual Machine Environment For Heterogeneous Clusters. Parallel Processing Letters, Vol. 9, No. 2, (1999), pp 253–273CrossRefGoogle Scholar
  4. 4.
    G.A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam: PVM: Parallel Virtual Machine; A User’s Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994Google Scholar
  5. 5.
    M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra: MPI: The Complete Reference. MIT Press, Cambridge, MA, 1996Google Scholar
  6. 6.
    F. Cristian: Synchronous and Asynchronous Group Communication. Communications of the ACM, Vol. 39, No. 4, April 1996, pp 88–97CrossRefGoogle Scholar
  7. 7.
    D. Dolev, D. Malki: The Transis Approach to High Availability Cluster Communication. Communications of the ACM, Vol. 39, No. 4, April 1996, pp 64–70CrossRefGoogle Scholar
  8. 8.
    E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia and A. Lingley-Papadopoulos: Totem: a fault-tolerant multicast group communication system. Communications of the ACM 39,4 (Apr. 1996), pp 54–63CrossRefGoogle Scholar
  9. 9.
    F.C. Gaertner: Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments. ACM Computing Surveys, Vol. 31, No. 1, March 1999Google Scholar
  10. 10.
    S. Mishra, Lei Wu: An Evaluation of Flow Control in Group Communication. IEEE/ACM Transactions on Networking, Vol. 6, No. 5, Oct. 1998Google Scholar
  11. 11.
    T.D. Chandra, S. Toueg: Unreliable Failure Detectors for Reliable Distributed Systems. I.B.M Thomas J. Watson Research Center, Hawthorne, New York and Department of Computer Science, Cornell University, Ithaca, New York 14853, USA, 1991Google Scholar
  12. 12.
    M. Patino-Martinez, R. Jiminez-Peris, B. Kemme, G. Alonso: Scalable Replication in Database Clusters. Technical University of Madrid, Facultad de Informatica, Boadilla del Monte, Madrid, Spain and Swiss Federal Institute of Technology (ETHZ), Department of Computer Science, ZuerichGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • C. Engelmann
    • 1
  • S. L. Scott
    • 1
  • G. A. Geist
    • 1
  1. 1.Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations