Dependable Systems

  • André Schiper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4028)


Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented.


Group Communication Correct Process Failure Detector Dependable System Active Replication 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aguilera, M.K., Chen, W., Toueg, S.: Heartbeat: a timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, M. (ed.) WDAG 1997. LNCS, vol. 1320, pp. 126–140. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  2. 2.
    Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Thrifty generic broadcast. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, p. 268. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  3. 3.
    Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Distributed Database Systems. Addison-Wesley, Reading (1987)Google Scholar
  4. 4.
    Birman, K., Joseph, T.: Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems 5(1), 47–76 (1987)CrossRefGoogle Scholar
  5. 5.
    Bünzli, D.C., Fuzzati, R., Mena, S., Nestmann, U., Rütti, O., Schiper, A., Wojciechowski, P.T.: Advances in the Design and Implementation of Group Communication Middleware. In: Kohlas, J., Meyer, B., Schiper, A. (eds.) Dependable Systems: Software, Computing, Networks. LNCS, vol. 4028, pp. 172–194. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Chandra, T.D., Hadzilacos, V., Toueg, S.: The Weakest Failure Detector for Solving Consensus. Journal of ACM 43(4), 685–722 (1996)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of ACM 43(2), 225–267 (1996)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: A Comprehensive Study. ACM Computing Surveys 4(33), 1–43 (2001)Google Scholar
  9. 9.
    Défago, X., Schiper, A., Urban, P.: Totally Ordered Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Computing Surveys 4(36), 1–50 (2004)Google Scholar
  10. 10.
    Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchrony needed for distributed consensus. Journal of ACM 34(1), 77–97 (1987)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of ACM 35(2), 288–323 (1988)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Ekwall, R., Schiper, A.: Replication: Understanding the Advantage of Atomic Broadcast over Quorum Systems. Journal of Universal Computer Science 11(5), 703–711 (2005)Google Scholar
  13. 13.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  14. 14.
    Fischer, M., Lynch, N., Paterson, M.: Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM 32, 374–382 (1985)MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Guerraoui, R., Larrea, M., Schiper, A.: Reducing the cost for Non-Blocking in Atomic Commitment. In: IEEE 16th Intl. Conf. Distributed Computing Systems, pp. 692–697 (May 1996)Google Scholar
  16. 16.
    Hadzilacos, V., Toueg, S.: Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University (May 1994)Google Scholar
  17. 17.
    Herlihy, M., Wing, J.: Linearizability: a correctness condition for concurrent objects. ACM Trans. on Progr. Languages and Syst. 12(3), 463–492 (1990)CrossRefGoogle Scholar
  18. 18.
    Hermant, J.-F., Le Lann, G.: Fast Asynchronous Uniform Consensus in Real-Time Distributed Systems. IEEE Transactions on Computers 51(8), 931–944 (2002)CrossRefGoogle Scholar
  19. 19.
    Lamport, L.: Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM 21(7), 558–565 (1978)MATHCrossRefGoogle Scholar
  20. 20.
    Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers C28(9), 690–691 (1979)CrossRefGoogle Scholar
  21. 21.
    Lamport, L.: The Part-Time Parliament. TR 49, Digital SRC (September 1989)Google Scholar
  22. 22.
    Lamport, L.: The Part-Time Parliament. ACM Trans. on Computer Systems 16(2), 133–169 (1998)CrossRefGoogle Scholar
  23. 23.
    Laprie, J.C. (ed.): Dependability: Basic Concepts and Terminology. Springer, Heidelberg (1992)MATHGoogle Scholar
  24. 24.
    Lynch, N.A.: Distributed Algorithms. Morgan Kaufmann, San Francisco (1996)MATHGoogle Scholar
  25. 25.
    Misra, J.: Axioms for memory access in asynchronous hardware systems. ACM Trans. on Progr. Languages and Syst. 8(1), 142–153 (1986)MATHCrossRefGoogle Scholar
  26. 26.
    Pedone, F., Schiper, A.: Handling Message Semanticas with Generic Broadcast Protocols. Distributed Computing 15(2), 97–107 (2002)CrossRefGoogle Scholar
  27. 27.
    Schiper, A.: Dynamic Group Communication. Distributed Computing 18(5), 359–374 (2006)CrossRefGoogle Scholar
  28. 28.
    Schiper, A., Toueg, S.: From Set Membership to Group Membership: A Separation of Concerns. IEEE Transactions on Dependable and Secure Computing (TDSC) 3(1), 2–12 (2006)CrossRefGoogle Scholar
  29. 29.
    Schneider, F.B.: Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial. Computing Surveys 22(4), 299–319 (1990)CrossRefGoogle Scholar
  30. 30.
    Skeen, D.: Nonblocking Commit Protocols. In: ACM SIGMOD Intl. Conf. on Management of Data, pp. 133–142 (1981)Google Scholar
  31. 31.
    Urbán, P., Shnayderman, I., Schiper, A.: Comparison of failure detectors and group membership: Performance study of two atomic broadcast algorithms. In: Proc. Int’l Conf. on Dependable Systems and Networks, San Francisco, CA, USA, pp. 645–654 (June 2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • André Schiper
    • 1
  1. 1.Ecole Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland

Personalised recommendations