Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Ranganathan, Sridharan; George, Alan D.; Todd, Robert W.; Chidester, Matthew C.

doi:10.1023/A:1011494323443

Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Published: July 2001

Volume 4, pages 197–209, (2001)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Sridharan Ranganathan¹,
Alan D. George¹,
Robert W. Todd¹ &
…
Matthew C. Chidester¹

225 Accesses
27 Citations
Explore all metrics

Abstract

Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Article 12 December 2022

Grace Nansamba, Amani Altarawneh & Anthony Skjellum

Failure detection algorithm for Fail-Lagging model applied to HPC

Article 27 March 2022

Yingjun Ye, Yongdong Zhang & Weicai Ye

Clairvoyant State Machine Replications

References

J. Aspnes and M. Herlihy, Fast randomized consensus using shared memory, Journal of Algorithms 11(4) (1990) 441–461.
Google Scholar
O. Babaoglu, On the reliability of consensus-based fault-tolerant distributed computing systems, ACM Transactions on Computer Systems 5(3) (1987) 394–416.
Google Scholar
M. Barborak, A. Dahbura and M. Malek, The consensus problem in fault-tolerant computing, ACM Computing Surveys 25(2) (1993) 171–220.
Google Scholar
N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second local area network, IEEE Micro 15(1) (1995) 26–36.
Google Scholar
M. Burns, A. George and B. Wallace, Simulative performance analysis of gossip failure detection for scalable distributed systems, Cluster Computing 2(3) (1999).
J. Dennis and V. Torczon, Direct search methods on parallel machines, SIAM Journal on Optimization 1(4) (1991) 448–474.
Google Scholar
A. George and R. VanLoon, High-fidelity modeling and simulation of Myrinet system area networks, International Journal of Modelling and Simulation 21(1) (2001) 40–50.
Google Scholar
J. Kuhl and S. Reddy, Fault-diagnosis in fully-distributed systems, Proc. of the 11th Int. IEEE Symp. on Fault-Tolerant Computing (1981) pp. 100–105
R. Kieckhafer and M. Azadmanesh, Reaching approximate agreement with mixed mode faults, IEEE Transactions on Parallel and Distributed Systems 5(1) (1994) 53–63.
Google Scholar
R. Nair, Diagnosis, self-diagnosis and roving diagnosis, Dept. of Computer Science Rep. R-823, University of Illinois at Urbana-Champaign (1978).
Google Scholar
F. Preparata, G. Metz and R. Chien, On the connection assignment problem of diagnosable systems, IEEE Transactions on Electronic Computers 16(6) (1967).
K. Taylor and R. Golding, Group membership in the epidemic style, Dept. of Computer Science Rep. UCSC-CRL-92–1, University of California at Santa Cruz (1992).
Google Scholar
T. Tsuchiya, M. Yamaguchi and T. Kikuno, Minimizing the maximum delay for reaching consensus in quorum-based mutual exclusion schemes, IEEE Transactions on Parallel and Distributed Systems 10(4) (1999) 337–345.
Google Scholar
J. Turek and D. Shasha, The many faces of consensus in distributed systems, IEEE Computer 25(6) (1992) 8–17.
Google Scholar
S. Shanmugan, S. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (1992) 83–94.
Google Scholar
S. Shostak and B. Baker, Gossips and telephones, Discrete Mathematics 2(3) (1972) 191–193.
Google Scholar
R. Van Renesse, R. Minsky and M. Hayden, A gossip-style failure detection service, in: Proc. of the IFIP Int. Conf. on Distributed Systems Platforms and Open Distributed Processing Middleware'98, 15–18 September 1998.
D. Yogen and D. Oppen, The clearinghouse: A decentralized agent for locating named objects in a distributed environment, ACM Transactions on Office Information Systems 1(3) (1983) 230–253.
Google Scholar

Download references

Author information

Authors and Affiliations

High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, PO Box 116200, Gainesville, FL, 32611-6200, USA
Sridharan Ranganathan, Alan D. George, Robert W. Todd & Matthew C. Chidester

Authors

Sridharan Ranganathan
View author publications
You can also search for this author in PubMed Google Scholar
Alan D. George
View author publications
You can also search for this author in PubMed Google Scholar
Robert W. Todd
View author publications
You can also search for this author in PubMed Google Scholar
Matthew C. Chidester
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ranganathan, S., George, A.D., Todd, R.W. et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters. Cluster Computing 4, 197–209 (2001). https://doi.org/10.1023/A:1011494323443

Download citation

Issue Date: July 2001
DOI: https://doi.org/10.1023/A:1011494323443

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Abstract

Access this article

Similar content being viewed by others

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Failure detection algorithm for Fail-Lagging model applied to HPC

Clairvoyant State Machine Replications

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Abstract

Access this article

Similar content being viewed by others

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Failure detection algorithm for Fail-Lagging model applied to HPC

Clairvoyant State Machine Replications

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation