Skip to main content
Log in

Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. J. Aspnes and M. Herlihy, Fast randomized consensus using shared memory, Journal of Algorithms 11(4) (1990) 441–461.

    Google Scholar 

  2. O. Babaoglu, On the reliability of consensus-based fault-tolerant distributed computing systems, ACM Transactions on Computer Systems 5(3) (1987) 394–416.

    Google Scholar 

  3. M. Barborak, A. Dahbura and M. Malek, The consensus problem in fault-tolerant computing, ACM Computing Surveys 25(2) (1993) 171–220.

    Google Scholar 

  4. N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second local area network, IEEE Micro 15(1) (1995) 26–36.

    Google Scholar 

  5. M. Burns, A. George and B. Wallace, Simulative performance analysis of gossip failure detection for scalable distributed systems, Cluster Computing 2(3) (1999).

  6. J. Dennis and V. Torczon, Direct search methods on parallel machines, SIAM Journal on Optimization 1(4) (1991) 448–474.

    Google Scholar 

  7. A. George and R. VanLoon, High-fidelity modeling and simulation of Myrinet system area networks, International Journal of Modelling and Simulation 21(1) (2001) 40–50.

    Google Scholar 

  8. J. Kuhl and S. Reddy, Fault-diagnosis in fully-distributed systems, Proc. of the 11th Int. IEEE Symp. on Fault-Tolerant Computing (1981) pp. 100–105

  9. R. Kieckhafer and M. Azadmanesh, Reaching approximate agreement with mixed mode faults, IEEE Transactions on Parallel and Distributed Systems 5(1) (1994) 53–63.

    Google Scholar 

  10. R. Nair, Diagnosis, self-diagnosis and roving diagnosis, Dept. of Computer Science Rep. R-823, University of Illinois at Urbana-Champaign (1978).

    Google Scholar 

  11. F. Preparata, G. Metz and R. Chien, On the connection assignment problem of diagnosable systems, IEEE Transactions on Electronic Computers 16(6) (1967).

  12. K. Taylor and R. Golding, Group membership in the epidemic style, Dept. of Computer Science Rep. UCSC-CRL-92–1, University of California at Santa Cruz (1992).

    Google Scholar 

  13. T. Tsuchiya, M. Yamaguchi and T. Kikuno, Minimizing the maximum delay for reaching consensus in quorum-based mutual exclusion schemes, IEEE Transactions on Parallel and Distributed Systems 10(4) (1999) 337–345.

    Google Scholar 

  14. J. Turek and D. Shasha, The many faces of consensus in distributed systems, IEEE Computer 25(6) (1992) 8–17.

    Google Scholar 

  15. S. Shanmugan, S. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (1992) 83–94.

    Google Scholar 

  16. S. Shostak and B. Baker, Gossips and telephones, Discrete Mathematics 2(3) (1972) 191–193.

    Google Scholar 

  17. R. Van Renesse, R. Minsky and M. Hayden, A gossip-style failure detection service, in: Proc. of the IFIP Int. Conf. on Distributed Systems Platforms and Open Distributed Processing Middleware'98, 15–18 September 1998.

  18. D. Yogen and D. Oppen, The clearinghouse: A decentralized agent for locating named objects in a distributed environment, ACM Transactions on Office Information Systems 1(3) (1983) 230–253.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ranganathan, S., George, A.D., Todd, R.W. et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters. Cluster Computing 4, 197–209 (2001). https://doi.org/10.1023/A:1011494323443

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011494323443

Navigation