Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 4028))

Abstract

Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented.

Almost the same paper appears under the title Group Communication: from practice to theory in Proceedings SOFSEM 2006: Theory and Practice of Computer Science, Merin, Czech Republic, January 2006, Springer, LNCS 383, pages 117-137, 2006.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aguilera, M.K., Chen, W., Toueg, S.: Heartbeat: a timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, M. (ed.) WDAG 1997. LNCS, vol. 1320, pp. 126–140. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  2. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Thrifty generic broadcast. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, p. 268. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  3. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Distributed Database Systems. Addison-Wesley, Reading (1987)

    Google Scholar 

  4. Birman, K., Joseph, T.: Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems 5(1), 47–76 (1987)

    Article  Google Scholar 

  5. Bünzli, D.C., Fuzzati, R., Mena, S., Nestmann, U., Rütti, O., Schiper, A., Wojciechowski, P.T.: Advances in the Design and Implementation of Group Communication Middleware. In: Kohlas, J., Meyer, B., Schiper, A. (eds.) Dependable Systems: Software, Computing, Networks. LNCS, vol. 4028, pp. 172–194. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Chandra, T.D., Hadzilacos, V., Toueg, S.: The Weakest Failure Detector for Solving Consensus. Journal of ACM 43(4), 685–722 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  7. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of ACM 43(2), 225–267 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  8. Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: A Comprehensive Study. ACM Computing Surveys 4(33), 1–43 (2001)

    Google Scholar 

  9. Défago, X., Schiper, A., Urban, P.: Totally Ordered Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Computing Surveys 4(36), 1–50 (2004)

    Google Scholar 

  10. Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchrony needed for distributed consensus. Journal of ACM 34(1), 77–97 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  11. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of ACM 35(2), 288–323 (1988)

    Article  MathSciNet  Google Scholar 

  12. Ekwall, R., Schiper, A.: Replication: Understanding the Advantage of Atomic Broadcast over Quorum Systems. Journal of Universal Computer Science 11(5), 703–711 (2005)

    Google Scholar 

  13. Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  14. Fischer, M., Lynch, N., Paterson, M.: Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM 32, 374–382 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  15. Guerraoui, R., Larrea, M., Schiper, A.: Reducing the cost for Non-Blocking in Atomic Commitment. In: IEEE 16th Intl. Conf. Distributed Computing Systems, pp. 692–697 (May 1996)

    Google Scholar 

  16. Hadzilacos, V., Toueg, S.: Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University (May 1994)

    Google Scholar 

  17. Herlihy, M., Wing, J.: Linearizability: a correctness condition for concurrent objects. ACM Trans. on Progr. Languages and Syst. 12(3), 463–492 (1990)

    Article  Google Scholar 

  18. Hermant, J.-F., Le Lann, G.: Fast Asynchronous Uniform Consensus in Real-Time Distributed Systems. IEEE Transactions on Computers 51(8), 931–944 (2002)

    Article  Google Scholar 

  19. Lamport, L.: Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM 21(7), 558–565 (1978)

    Article  MATH  Google Scholar 

  20. Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers C28(9), 690–691 (1979)

    Article  Google Scholar 

  21. Lamport, L.: The Part-Time Parliament. TR 49, Digital SRC (September 1989)

    Google Scholar 

  22. Lamport, L.: The Part-Time Parliament. ACM Trans. on Computer Systems 16(2), 133–169 (1998)

    Article  Google Scholar 

  23. Laprie, J.C. (ed.): Dependability: Basic Concepts and Terminology. Springer, Heidelberg (1992)

    MATH  Google Scholar 

  24. Lynch, N.A.: Distributed Algorithms. Morgan Kaufmann, San Francisco (1996)

    MATH  Google Scholar 

  25. Misra, J.: Axioms for memory access in asynchronous hardware systems. ACM Trans. on Progr. Languages and Syst. 8(1), 142–153 (1986)

    Article  MATH  Google Scholar 

  26. Pedone, F., Schiper, A.: Handling Message Semanticas with Generic Broadcast Protocols. Distributed Computing 15(2), 97–107 (2002)

    Article  Google Scholar 

  27. Schiper, A.: Dynamic Group Communication. Distributed Computing 18(5), 359–374 (2006)

    Article  Google Scholar 

  28. Schiper, A., Toueg, S.: From Set Membership to Group Membership: A Separation of Concerns. IEEE Transactions on Dependable and Secure Computing (TDSC) 3(1), 2–12 (2006)

    Article  Google Scholar 

  29. Schneider, F.B.: Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial. Computing Surveys 22(4), 299–319 (1990)

    Article  Google Scholar 

  30. Skeen, D.: Nonblocking Commit Protocols. In: ACM SIGMOD Intl. Conf. on Management of Data, pp. 133–142 (1981)

    Google Scholar 

  31. Urbán, P., Shnayderman, I., Schiper, A.: Comparison of failure detectors and group membership: Performance study of two atomic broadcast algorithms. In: Proc. Int’l Conf. on Dependable Systems and Networks, San Francisco, CA, USA, pp. 645–654 (June 2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Schiper, A. (2006). Dependable Systems. In: Kohlas, J., Meyer, B., Schiper, A. (eds) Dependable Systems: Software, Computing, Networks. Lecture Notes in Computer Science, vol 4028. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11808107_2

Download citation

  • DOI: https://doi.org/10.1007/11808107_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-36821-2

  • Online ISBN: 978-3-540-36823-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics