Abstract
Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications.
This research was supported in part by an appointment to the ORNL Postmasters Research Participation Program which is sponsored by Oak Ridge National Laboratory and administered jointly by Oak Ridge National Laboratory and by the Oak Ridge Institute for Science and Education under contract numbers DE-AC05-84OR21400 and DE-AC05-76OR00033, respectively.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
C. Engelmann: Distributed Peer-to-Peer Control for Harness. Master Thesis, School of Computer Science, The University of Reading, UK, Jan. 2001
W.R. Elwasif, D.E. Bernholdt, J.A. Kohl, G.A. Geist: An Architecture for a Multi-Threaded Harness Kernel. Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA, 2001
G.A. Geist, J.A. Kohl, S.L. Scott, P.M. Papadopoulos:HARNESS: Adaptable Virtual Machine Environment For Heterogeneous Clusters. Parallel Processing Letters, Vol. 9, No. 2, (1999), pp 253–273
G.A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam: PVM: Parallel Virtual Machine; A User’s Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra: MPI: The Complete Reference. MIT Press, Cambridge, MA, 1996
F. Cristian: Synchronous and Asynchronous Group Communication. Communications of the ACM, Vol. 39, No. 4, April 1996, pp 88–97
D. Dolev, D. Malki: The Transis Approach to High Availability Cluster Communication. Communications of the ACM, Vol. 39, No. 4, April 1996, pp 64–70
E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia and A. Lingley-Papadopoulos: Totem: a fault-tolerant multicast group communication system. Communications of the ACM 39,4 (Apr. 1996), pp 54–63
F.C. Gaertner: Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments. ACM Computing Surveys, Vol. 31, No. 1, March 1999
S. Mishra, Lei Wu: An Evaluation of Flow Control in Group Communication. IEEE/ACM Transactions on Networking, Vol. 6, No. 5, Oct. 1998
T.D. Chandra, S. Toueg: Unreliable Failure Detectors for Reliable Distributed Systems. I.B.M Thomas J. Watson Research Center, Hawthorne, New York and Department of Computer Science, Cornell University, Ithaca, New York 14853, USA, 1991
M. Patino-Martinez, R. Jiminez-Peris, B. Kemme, G. Alonso: Scalable Replication in Database Clusters. Technical University of Madrid, Facultad de Informatica, Boadilla del Monte, Madrid, Spain and Swiss Federal Institute of Technology (ETHZ), Department of Computer Science, Zuerich
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Engelmann, C., Scott, S.L., Geist, G.A. (2002). Distributed Peer-to-Peer Control in Harness. In: Sloot, P.M.A., Hoekstra, A.G., Tan, C.J.K., Dongarra, J.J. (eds) Computational Science — ICCS 2002. ICCS 2002. Lecture Notes in Computer Science, vol 2330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46080-2_76
Download citation
DOI: https://doi.org/10.1007/3-540-46080-2_76
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43593-8
Online ISBN: 978-3-540-46080-0
eBook Packages: Springer Book Archive