A Protocol for Reconciling Recovery and High-Availability in Replicated Databases

  • J. E. Armendáriz-Iñigo
  • F. D. Muñoz-Escoí
  • H. Decker
  • J. R. Juárez-Rodríguez
  • J. R. González de Mendívil
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4263)


We describe a recovery protocol which boosts availability, fault tolerance and performance by enabling failed network nodes to resume an active role immediately after they start recovering. The protocol is designed to work in tandem with middleware-based eager update-everywhere strategies and related group communication systems. The latter provide view synchrony, i.e., knowledge about currently reachable nodes and about the status of messages delivered by faulty and alive nodes. That enables a fast replay of missed updates which defines dynamic database recovery partition. Thus, speeding up the recovery of failed nodes which, together with the rest of the network, may seamlessly continue to process transactions even before their recovery has completed. We specify the protocol in terms of the procedures executed with every message and event of interest and outline a correctness proof.


Recovery Protocol View Change Local Transaction Snapshot Isolation Alive Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gray, J., Helland, P., O’Neil, P.E., Shasha, D.: The dangers of replication and a solution. In: SIGMOD Conference, pp. 173–182 (1996)Google Scholar
  2. 2.
    Wiesmann, M., Pedone, F., Schiper, A., Kemme, B., Alonso, G.: Understanding replication in databases and distributed systems. In: ICDCS, pp. 464–474 (2000)Google Scholar
  3. 3.
    Muñoz-Escoí, F.D., Irún-Briz, L., Decker, H.: Database replication protocols. In: Encyclopedia of Database Technologies and Applications, pp. 153–157 (2005)Google Scholar
  4. 4.
    Kemme, B., Alonso, G.: A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25(3), 333–379 (2000)CrossRefGoogle Scholar
  5. 5.
    Kemme, B., Pedone, F., Alonso, G., Schiper, A., Wiesmann, M.: Using optimistic atomic broadcast in transaction processing systems. IEEE TKDE 15(4), 1018–1032 (2003)Google Scholar
  6. 6.
    Patiño-Martínez, M., Jiménez-Peris, R., Kemme, B., Alonso, G.: MIDDLE-R: Consistent database replication at the middleware level. ACM TOCS 23(4), 375–423 (2005)CrossRefGoogle Scholar
  7. 7.
    Wu, S., Kemme, B.: Postgres-R(SI): Combining replica control with concurrency control based on snapshot isolation. In: ICDE, pp. 422–433. IEEE-CS, Los Alamitos (2005)Google Scholar
  8. 8.
    Lin, Y., Kemme, B., Patiño-Martínez, M., Jiménez-Peris, R.: Middleware based data replication providing snapshot isolation. In: SIGMOD Conference (2005)Google Scholar
  9. 9.
    Armendáriz, J.E., Juárez, J.R., Garitagoitia, J.R., González de Mendívil, J.R., Muñoz-Escoí, F.D.: Implementing database replication protocols based on O2PL in a middleware architecture. In: DBA 2006, pp. 176–181 (2006)Google Scholar
  10. 10.
    Holliday, J., Steinke, R.C., Agrawal, D., Abbadi, A.E.: Epidemic algorithms for replicated databases. IEEE Trans. Knowl. Data Eng. 15(5), 1218–1238 (2003)CrossRefGoogle Scholar
  11. 11.
    Chockler, G., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Comput. Surv. 33(4), 427–469 (2001)CrossRefGoogle Scholar
  12. 12.
    Holliday, J.: Replicated database recovery using multicast communication. In: NCA, pp. 104–107. IEEE-CS, Los Alamitos (2001)Google Scholar
  13. 13.
    Elnikety, S., Pedone, F., Zwaenopoel, W.: Database replication using generalized snapshot isolation. In: SRDS, IEEE-CS, Los Alamitos (2005)Google Scholar
  14. 14.
    Kemme, B., Bartoli, A., Babaoglu, Ö.: Online reconfiguration in replicated databases based on group communication. In: DSN, pp. 117–130. IEEE-CS, Los Alamitos (2001)Google Scholar
  15. 15.
    Armendáriz-Iñigo, J.E., González de Mendívil, J.R., Muñoz-Escoí, F.D.: A lock based algorithm for concurrency control and recovery in a middleware replication software architecture. In: HICSS, IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
  16. 16.
    Armendáriz, J.E., Garitagoitia, J.R., González de Mendívil, J.R., Muñoz-Escoí, F.D.: Design of a MidO2PL database replication protocol in the MADIS middleware architecture. In: AINA, vol. 2, pp. 861–865. IEEE-CS, Los Alamitos (2006)Google Scholar
  17. 17.
    Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading (1987)Google Scholar
  18. 18.
    Cristian, F.: Understanding fault-tolerant distributed systems. Commun. ACM 34(2), 56–78 (1991)CrossRefGoogle Scholar
  19. 19.
    Jiménez-Peris, R., Patiño-Martínez, M., Alonso, G.: Non-intrusive, parallel recovery of replicated data. In: SRDS, pp. 150–159. IEEE-CS, Los Alamitos (2002)Google Scholar
  20. 20.
    Berenson, H., Bernstein, P.A., Gray, J., Melton, J., O’Neil, E.J., O’Neil, P.E.: A critique of ANSI SQL isolation levels. In: SIGMOD Conference, pp. 1–10. ACM Press, New York (1995)Google Scholar
  21. 21.
    Birman, K., Cooper, R., Joseph, T., Marzullo, K., Makpangou, M., Kane, K., Schmuck, F., Wood, M.: The ISIS - system manual, Version 2.1. Technical report, Dept. of Computer Science, Cornell University (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • J. E. Armendáriz-Iñigo
    • 1
  • F. D. Muñoz-Escoí
    • 2
  • H. Decker
    • 2
  • J. R. Juárez-Rodríguez
    • 1
  • J. R. González de Mendívil
    • 1
  1. 1.Universidad Pública de NavarraPamplonaSpain
  2. 2.Instituto Tecnológico de InformáticaValenciaSpain

Personalised recommendations