The Heard-Of model: computing in distributed systems with benign faults

Abstract

Problems in fault-tolerant distributed computing have been studied in a variety of models. These models are structured around two central ideas: (1) degree of synchrony and failure model are two independent parameters that determine a particular type of system, (2) the notion of faulty component is helpful and even necessary for the analysis of distributed computations when faults occur. In this work, we question these two basic principles of fault-tolerant distributed computing, and show that it is both possible and worthy to renounce them in the context of benign faults: we present a computational model based only on the notion of transmission faults. In this model, computations evolve in rounds, and messages missed in a round are lost. Only information transmission is represented: for each round r and each process p, our model provides the set of processes that p “hears of” at round r (heard-of set), namely the processes from which p receives some message at round r. The features of a specific system are thus captured as a whole, just by a predicate over the collection of heard-of sets. We show that our model handles benign failures, be they static or dynamic, permanent or transient, in a unified framework. We demonstrate how this approach leads to shorter and simpler proofs of important results (non-solvability, lower bounds). In particular, we prove that the Consensus problem cannot be generally solved without an implicit and permanent consensus on heard-of sets. We also examine Consensus algorithms in our model. In light of this specific agreement problem, we show how our approach allows us to devise new interesting solutions.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Ben-Or, M. : Another advantage of free choice: Completely asynchronous agreement protocols. In: Proceedings of the Second ACM Symposium on Principles of Distributed Computing, pp. 27–30 (1983)

  2. 2

    Biely, M., Charron-Bost, B., Gaillard, A., Hutle, M., Schiper, A., Widder, J.: Tolerating corrupted communications. In: Proceedings of the Twentysixth ACM Symposium on Principles of Distributed Computing, pp. 244–253 (2007)

  3. 3

    Brasileiro, F., Greve, F., Mostéfaoui, A., Raynal, M.: Consensus in one communication step. In: 6th International Conference Parallel Computing Technologies (PaCT), pp. 42–50. Springer, LNCS 2127 (2001)

  4. 4

    Chandra T.D., Hadzilacos V., Toueg S.: The weakest failure detector for solving consensus. J. ACM 43(4), 685–722 (1996)

    MATH  Article  MathSciNet  Google Scholar 

  5. 5

    Chandra T.D., Toueg S.: Unreliable failure detectors for asynchronous systems. J. ACM 43(2), 225–267 (1996)

    MATH  Article  MathSciNet  Google Scholar 

  6. 6

    Chandy K.M., Misra J.: How processes learn. Distrib. Comput. 1(1), 40–52 (1986)

    MATH  Article  Google Scholar 

  7. 7

    Charron-Bost, B., Hutle, M., Widder, J.: In search of lost time. Technical Report LSR/2008-006, EPFL (2008)

  8. 8

    Charron-Bost, B., Merz, S.: Formal verification of a Consensus algorithm in the Heard-Of model. TR 20009/07, LIX (2009)

  9. 9

    Chlebus B.S., Dicks K., Pelc A.: Broadcasting in synchronous networks with dynamic faults. Networks 27, 309–318 (1996)

    MATH  Article  MathSciNet  Google Scholar 

  10. 10

    Chor B., Dwork C.: Randomization in byzantine agreement. Adv. Comput. Res. 5, 443–497 (1989)

    Google Scholar 

  11. 11

    Dobrev S.: Computing input multiplicity in anonymous synchronous networks with dynamic faults. J. Discrete Algorithms 2(4), 425–438 (2004)

    MATH  Article  MathSciNet  Google Scholar 

  12. 12

    Dolev D.: The Byzantine generals strike again. J. Algorithms 3(1), 14–30 (1982)

    MATH  Article  MathSciNet  Google Scholar 

  13. 13

    Dolev D., Dwork C., Stockmeyer L.: On the minimal synchronism needed for distributed consensus. J. ACM 34(1), 77–97 (1987)

    MATH  Article  MathSciNet  Google Scholar 

  14. 14

    Dolev D., Reischuk R., Strong H.R.: Early stopping in Byzantine agreement. J. ACM 37(4), 720–741 (1990)

    MATH  Article  MathSciNet  Google Scholar 

  15. 15

    Dolev, D., Strong, H.R.: Polynomial algorithms for multiple processor agreement. In: Proceedings of the Fourteenth ACM Symposium on Theory of Computing, pp. 401–407. ACM Press, New York (1982)

  16. 16

    Dolev D., Strong H.R.: Authenticated algorithms for Byzantine agreement. SIAM J. Comput. 12(4), 656–666 (1983)

    MATH  Article  MathSciNet  Google Scholar 

  17. 17

    Dwork C., Lynch N.A., Stockmeyer L.: Consensus in the presence of partial synchrony. J. ACM 35(2), 288–323 (1988)

    Article  MathSciNet  Google Scholar 

  18. 18

    Elrad, T.E., Francez, N.: Decomposition of distributed programs into communication-closed-layers. Sci. Comput. Progr. 2(3), April 1982

  19. 19

    Fischer M.J., Lynch N.A.: A lower bound for the time to assure interactive consistency. Inf. Process. Lett. 14, 183–186 (1982)

    MATH  Article  MathSciNet  Google Scholar 

  20. 20

    Fischer M.J., Lynch N.A., Paterson M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)

    MATH  Article  MathSciNet  Google Scholar 

  21. 21

    Fraigniaud P., Peyrat C.: Broadcasting in a hypercube when some calls fail. Inf. Process. Lett. 39(3), 115–119 (1991)

    MATH  Article  MathSciNet  Google Scholar 

  22. 22

    Gafni, E.: Round-by-round fault detectors: unifying synchrony and asynchrony. In: Proceedings of the Seventeenth ACM Symposium on Principles of Distributed Computing, pp. 143–152 (1998)

  23. 23

    Gopal, A., Toueg, S.: Reliable broadcast in synchronous and asynchronous environments (preliminary version). In: Bermond, J.-C., Raynal, M. (eds.) Proceedings of the Third International Workshop on Distributed Algorithms. Lecture Notes on Computer Science, vol. 392, pp. 110–123. Springer, Heidelberg (1989)

  24. 24

    Herlihy, M., Rajsbaum, S., Tuttle, M.: Unifying synchronous and asynchronous message-passing models. In: Proceedings of the Seventeenth ACM Symposium on Principles of Distributed Computing, pp. 123–132 (1998)

  25. 25

    Keidar, I., Shraer, A.: Timeliness, failure-detectors, and consensus performance. In: ACM Symposium on Principles of Distributed Computing, pp. 169–178 (2006)

  26. 26

    Kralovic, R., Kralovic, R., Ruzicka, P.: Broadcasting with many faulty links. In: SIROCCO, pp. 211–222 (2003)

  27. 27

    Lamport L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)

    Article  Google Scholar 

  28. 28

    Lamport L.: Fast Paxos. Distrib. Comput. 19(2), 79–103 (2006)

    Article  MathSciNet  Google Scholar 

  29. 29

    Lynch N.A.: Distributed Algorithms. Morgan Kaufmann, Menlo Park (1996)

    MATH  Google Scholar 

  30. 30

    Malkhi, D., Oprea, F., Zhou, L.: Ω meets Paxos: leader election and stability without eventual timely links. In: International Conference on Distributeed Computing (DISC), pp. 199–213 (2005)

  31. 31

    Merritt, M.J.: Unpublished Notes (1985)

  32. 32

    Moses Y., Rajsbaum S.: A layered analysis of consensus. SIAM J. Comput. 31(4), 989–1021 (2002)

    MATH  Article  MathSciNet  Google Scholar 

  33. 33

    Neiger G., Toueg S.: Automatically increasing the fault-tolerance of distributed algorithms. J. Algorithms 11(3), 374–419 (1990)

    MATH  Article  MathSciNet  Google Scholar 

  34. 34

    Pease M., Shostak R., Lamport L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)

    MATH  Article  MathSciNet  Google Scholar 

  35. 35

    Pedone, F., Schiper, A., Urban, P., Cavin, D.: Solving agreement problems with weak ordering oracles. In: Proceedings of the 4th European Dependable Computing Conference (EDCC-4), LNCS-2485, pp. 44–61, Toulouse, France. Springer, Heidelberg (2002)

  36. 36

    Perry K.J., Toueg S.: Distributed agreement in the presence of processor and communication faults. IEEE Trans. Softw. Eng. 12(3), 477–482 (1986)

    Google Scholar 

  37. 37

    Santoro N., Widmayer P.: Agreement in synchronous networks with ubiquitous faults. Theor. Comput. Sci. 384, 232–249 (2007)

    MATH  Article  MathSciNet  Google Scholar 

  38. 38

    Santoro, N., Widmayer, P.: Time is not a healer. In: Proceedings of the 6th Symposium on Theor. Aspects of Computer Science, pp. 304–313, Paderborn, Germany (1989)

  39. 39

    Santoro, N., Widmayer, P.: Distributed function evaluation in the presence of transmission faults. In: SIGAL International Symposium on Algorithms, pp. 358–367 (1990)

  40. 40

    Santoro, N., Widmayer, P.: Majority and unanimity in synchronous networks with ubiquitous dynamic faults. In: Proceedings of the 12th International Colloquium SIROCCO. Lecture Notes on Computer Science, vol. 3499, pp. 262–276, Mont Saint Michel, France. Springer, Heidelberg (2005)

  41. 41

    Schmid, U.: How to model link failures: A perception-based fault model. In: IEEE International Conference on Dependable Systems and Networks (DSN), pp. 57–66 (2001)

  42. 42

    Tsuchiya, T., Schiper, A.: Using bounded model checking to verify consensus algorithms. In: International Conference on Distributeed Computing (DISC), pp. 466–480 (2008)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bernadette Charron-Bost.

Additional information

A. Schiper’s research funded by the Swiss National Science Foundation under grant number 200021-111701 and Hasler Foundation under grant number 2070.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Charron-Bost, B., Schiper, A. The Heard-Of model: computing in distributed systems with benign faults. Distrib. Comput. 22, 49–71 (2009). https://doi.org/10.1007/s00446-009-0084-6

Download citation

Keywords

  • Consensus
  • Benign fault
  • Computational model
  • HO (Heard-Of) model
  • Transmission fault