A Data-Centric Approach for Scalable State Machine Replication

  • Gregory Chockler
  • Dahlia Malkhi
  • Danny Dolev
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2584)


Data replication is a key design principle for achieving reliability, high-availability, survivability and load balancing in distributed computing systems. The common denominator of all existing replication systems is the need to keep replicas consistent. The main paradigm for supporting replicated data is active replication, in which replicas execute the same sequence of methods on the object in order to remain consistent. This paradigm led to the definition of State Machine Replication (SMR) [29.8], [29.13]. The necessary building block of SMR is an engine that delivers operations at each site in the same total order without gaps, thus keeping the replica states consistent.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 29.1
    R. Boichat, P. Dutta, S. Frolund and R. Guerraoui. Deconstructing Paxos. Technical Report DSC ID:200106, Communication Systems Department (DSC), École Polytechnic Fédérale de Lausanne (EPFL), January 2001. Available at http://dscwww.epfl.ch/EN/publications/documents/tr01006.pdf.
  2. 29.2
    T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43(2):225–267, March 1996.MathSciNetCrossRefMATHGoogle Scholar
  3. 29.3
    G. Chockler and D. Malkhi. Active disk Paxos with infinitely many processes. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC’02 ), July 2002. To appear.Google Scholar
  4. 29.4
    G. Chockler, D. Malkhi and M. K. Reiter. Backoff protocols for distributed mutual exclusion and ordering. In Proceedings of the 21st International Conference on Distributed Computing Systems, pages 11–20, April 2001.Google Scholar
  5. 29.5
    M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2):374–382, April 1985.MathSciNetCrossRefMATHGoogle Scholar
  6. 29.6
    E. Gafni and L. Lamport. Disk Paxos. In Proceedings of 14th International Symposium on Distributed Computing (DISC’2000), pages 330–344, October 2000.Google Scholar
  7. 29.7
    P. Jayanti, T. Chandra, and S. Toueg. Fault-tolerant wait-free shared objects. Journal of the ACM 45(3):451–500, May 1998.MathSciNetCrossRefMATHGoogle Scholar
  8. 29.8
    L. Lamport. Time, clocks, and the ordering of events in distributed systems. Communications of the ACM 21(7):558–565, July 1978.CrossRefMATHGoogle Scholar
  9. 29.9
    L. Lamport. The Part-time parliament. ACMTransactions on Computer Systems 16(2):133–169, May 1998.CrossRefGoogle Scholar
  10. 29.10
    W. K. Lo and V. Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th InternationalWorkshop on Distributed Algorithms (WDAG), Springer-Verlag LNCS 857:280–295, Berlin, 1994.Google Scholar
  11. 29.11
    D. Malkhi and M. K. Reiter. An architecture for survivable coordination in largescale systems. IEEE Transactions on Knowledge and Data Engineering 12(2):187–202, March/April 2000.CrossRefGoogle Scholar
  12. 29.12
    J. P. Martin, L. Alvisi and M. Dahlin. Minimal Byzantine Storage. In Proceedings of the 16th International Conference on DIStribued Computing (DISC’02), pages 311–325, October 2002Google Scholar
  13. 29.13
    F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4):299–319, December 1990.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Gregory Chockler
    • 1
    • 2
  • Dahlia Malkhi
    • 1
  • Danny Dolev
    • 1
  1. 1.School of Computer Science and EngineeringThe Hebrew University of JerusalemJerusalemIsrael
  2. 2.IBM Haifa Research Labs (Tel-Aviv Annex)Haifa

Personalised recommendations