Abstract
Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.
Similar content being viewed by others
References
J. Aspnes and M. Herlihy, Fast randomized consensus using shared memory, Journal of Algorithms 11(4) (1990) 441–461.
O. Babaoglu, On the reliability of consensus-based fault-tolerant distributed computing systems, ACM Transactions on Computer Systems 5(3) (1987) 394–416.
M. Barborak, A. Dahbura and M. Malek, The consensus problem in fault-tolerant computing, ACM Computing Surveys 25(2) (1993) 171–220.
N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second local area network, IEEE Micro 15(1) (1995) 26–36.
M. Burns, A. George and B. Wallace, Simulative performance analysis of gossip failure detection for scalable distributed systems, Cluster Computing 2(3) (1999).
J. Dennis and V. Torczon, Direct search methods on parallel machines, SIAM Journal on Optimization 1(4) (1991) 448–474.
A. George and R. VanLoon, High-fidelity modeling and simulation of Myrinet system area networks, International Journal of Modelling and Simulation 21(1) (2001) 40–50.
J. Kuhl and S. Reddy, Fault-diagnosis in fully-distributed systems, Proc. of the 11th Int. IEEE Symp. on Fault-Tolerant Computing (1981) pp. 100–105
R. Kieckhafer and M. Azadmanesh, Reaching approximate agreement with mixed mode faults, IEEE Transactions on Parallel and Distributed Systems 5(1) (1994) 53–63.
R. Nair, Diagnosis, self-diagnosis and roving diagnosis, Dept. of Computer Science Rep. R-823, University of Illinois at Urbana-Champaign (1978).
F. Preparata, G. Metz and R. Chien, On the connection assignment problem of diagnosable systems, IEEE Transactions on Electronic Computers 16(6) (1967).
K. Taylor and R. Golding, Group membership in the epidemic style, Dept. of Computer Science Rep. UCSC-CRL-92–1, University of California at Santa Cruz (1992).
T. Tsuchiya, M. Yamaguchi and T. Kikuno, Minimizing the maximum delay for reaching consensus in quorum-based mutual exclusion schemes, IEEE Transactions on Parallel and Distributed Systems 10(4) (1999) 337–345.
J. Turek and D. Shasha, The many faces of consensus in distributed systems, IEEE Computer 25(6) (1992) 8–17.
S. Shanmugan, S. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (1992) 83–94.
S. Shostak and B. Baker, Gossips and telephones, Discrete Mathematics 2(3) (1972) 191–193.
R. Van Renesse, R. Minsky and M. Hayden, A gossip-style failure detection service, in: Proc. of the IFIP Int. Conf. on Distributed Systems Platforms and Open Distributed Processing Middleware'98, 15–18 September 1998.
D. Yogen and D. Oppen, The clearinghouse: A decentralized agent for locating named objects in a distributed environment, ACM Transactions on Office Information Systems 1(3) (1983) 230–253.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ranganathan, S., George, A.D., Todd, R.W. et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters. Cluster Computing 4, 197–209 (2001). https://doi.org/10.1023/A:1011494323443
Issue Date:
DOI: https://doi.org/10.1023/A:1011494323443