Skip to main content
Log in

Semantics of recovery lines for backward recovery in distributed systems

SÉMANTIQUE DES ÉTATS DE REPRISE POUR LE RÉTABLISSEMENT DANS LES SYSTÈMES DISTRIBUÉS

  • Published:
Annales Des Télécommunications Aims and scope Submit manuscript

Abstract

This paper addresses the definition of recovery lines in the context of backward recovery whose aim is to cope with failures in distributed syterns. A general framework that allows for several semantics of recovery lines is introduced. Key notions such as missing messages and orphan messages are precisely defined and their impact on the definition of consistency of recovery lines is carefully analyzed. Basic mechanisms such as local checkpointing, messages identification and (optimistic or pessimistic) messages logging are then discussed as an illustration of (coordinated or uncoordinated) checkpointing protocols.

Résumé

Cet article est consacré à la définition des états de reprise dans le cadre des techniques de rétablissement, utilisées pour traiter les défaillances dans les systèmes distribués. On introduit un cadre général permettant de considérer plusieurs sémantiques ďétats de reprise. Des notions clefs telles que celles de messages manquants ou messages orphelins sont définies avec précision, et leur influence sur la définition de la cohérence ďun état de reprise est soigneusement analysée. Les outils de base, tels que les points de contrôle locaux, le stockage des messages (optimiste ou pessismiste) et la ré-exécution, sont introduits pour illustrer des algorithmes de reprise, coordonnés ou non.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ahuja (M.), Mishra (S.). A basic unit of computation in fault- tolerant distributed systems.Proc. 14th Int Conf Distributed Computing Systems, Poznan (1994), pp. 626–633.

    Google Scholar 

  2. Alalgar (S.),Venkatesan (S.). Hierarchy in testing distributed programs.Proc. Int. Workshop AADEBUG’93, Springer Verlag LNCS (1993), pp. 101–116.

  3. Alvisi (L.), Hoppe (B.), Marzullo (K.). Nonblocking and orphan-free message logging protocols.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 145–154.

    Google Scholar 

  4. Alvisi (L.), Marzullo (K.). Optimistic message logging protocols.Proc. Workshop on Unifying Theory and Practice in Distributed Systems, Dagstuhl, Germany (1994).

    Google Scholar 

  5. Babaoglu (O.),Marzullo (K.). Consistent global states of distributed systems: fondamental concepts and mechanisms. In S.J. Mullemder, Editor, Distributed systems, Chap. 4,ACM Press (1993).

  6. Bhargava (B.),Lian (S. R.). Independent checkpointing and concurrent rollback for recovery in distributed system — an optimistic approach.Symp. Reliable Distributed Systems SRDS’88 (1988).

  7. Borg (A.),Baumback (J.),Glazer (S.). A message system supporting fault tolerance.Proc. ACM Symp. on Operating Systems Principles (1993), pp. 90-99.

  8. Chandy (K. M.), Lamport (L.). Distributed snapshots: determining global states of distributed systems.ACM TOCS (1985),3, n° 1, pp. 63–75.

    Article  Google Scholar 

  9. Chandy (K. M.), Misra (J.). Parallel program design: a foundation.Addison Wesley, New York (1988).

    MATH  Google Scholar 

  10. Cristian (F.). Understanding fault - tolerant distributed systems.Commun. ACM (1991),34, n° 2, pp. 56–78.

    Article  Google Scholar 

  11. Elnohazy (E. N.), Zwaenepoel (W.). Manetho-transparent rollback-recovery with low overhead, limited rollback and fast output commit.IEEE Trans. C. (1992),41, n° 5, pp. 526–531.

    Article  Google Scholar 

  12. Goldberg (A. P.), Gopal (A.), Lowry (A.), Strom (R.). Restoring consistent global states of distributed computation.ACM Sigplan (1991),26, n° 12, pp. 144–154.

    Article  Google Scholar 

  13. Helary (J. M.), Mostefaoui (A.), Raynal (M.). Déterminer un état global dans un système réparti.Ann. Télécommunic. (1994),49, n° 7-8, pp. 460–469.

    Google Scholar 

  14. Hurfin (M.), Plouzeau (N.), Raynal (M.). A debugging tool for distributed Estelle programs.Journal of Computer Communic. (1993),16, n° 5, pp. 328–333.

    Article  Google Scholar 

  15. Jalote (P.). Fault tolerant processes.Distributed Computing (1989), n° 3, pp.187-195.

  16. Janssesns (B.), Fuchs (W. K.). Relaxing consistency in recoverable distributed shared memory.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 155–163.

    Google Scholar 

  17. Johnson (D. B.), Zwaenepoel (W.). Recovery in distributed systems using optimistic message logging and checkpointing.Journal of Algorithms (1990),11, n° 3, pp. 462–491.

    Article  MATH  MathSciNet  Google Scholar 

  18. Johnson (D. B.),Zwaenepoel (W.). Sender-based message logging.Proc. Fault Tolerant Computing Systems (1987), pp. 14-19.

  19. Juang (T. T. Y.),Venkatesan (S.). Crash recovery with little overhead.Proc. 11th Int. Conf. Distributed Computing Systems (1991), pp. 454-461.

  20. Kim (J. L.), Park (T.). An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel and Distributed Systems (1993),4, n° 8, pp. 955–960.

    Article  Google Scholar 

  21. Kim (K. H.),You (J. H.),Abouelnaga (A.). A scheme for coordinated execution of independently designed recoverable distributed processes.Proc. 16th IEEE Symp. Fault-Tolerant Comput. (1986), pp. 130-135.

  22. Kim (K. H.). Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation.IEEE Trans. SE (1988),14, n° 6, pp. 810–821.

    Article  Google Scholar 

  23. Koo (R.), Toueg (S.). Checkpointing and rollback-recovery for distributed systems.IEEE Trans. SE (1987),13, n° 1, pp. 23–31.

    Article  MATH  Google Scholar 

  24. Lamport (L.). Time, clocks and the ordering of events in a distributed system.Commun. ACM (1978),21, n° 7, pp. 558- 565.

    Article  MATH  Google Scholar 

  25. Laprie (J. C.) (ed.).Dependability: Basic concepts and terminology, IFIP WG 10.4, Series in Dependable Computing and Fault-Tolerant Systems,5, Springer-Verlag (1992).

  26. Leblanc (T. J.), Mellar-Crummey (J. M.). Debugging parallel programs with instant replay.IEEE Trans. C. (1987),36, n° 6, pp. 471–482.

    Article  Google Scholar 

  27. Ľecuyer (P.), Malenfant (J.). Computing optimal checkpointing strategies for rollback and recovery system.IEEE Trans. C. (1988),37, n° 4, pp. 491–496.

    Article  Google Scholar 

  28. Lee (P. A.), Anderson (T.). Fault tolerance principles and practice.Springer Verlag, Wien (1990).

    MATH  Google Scholar 

  29. Leong (H. V.), Agrawal (D.). Using message semantics to reduce rollback in optimistic message logging recovery scheme.Proc. 14th Int. Conf. Distributed Computing Systems, Poznan (1994), pp. 227–234.

    Google Scholar 

  30. Lin (L.),Ahamad (M.). Checkpointing and rollback-recovery in distributed object based systems.Proc. Fault Tolerant Computing Systems (1990), pp. 97-104.

  31. Netzer (R. H. B.),Miller (B. P.). Optimal tracing and replay for debugging message passing parallel programs.Proc. Super- computing (1992).

  32. Powell (M. L.),Presotto (D. L.). PUBLISHING: A reliable broadcast communication mechanism.Proc. ACM Symp. on Operating Systems Principles (1983), pp. 100-109.

  33. Ramanathan (P.),Shin (K. G.). Checkpointing and rollback recovery in a distributed system using common time base.Proc. Symp. Reliable Distributed Systems (1988), pp. 13-21.

  34. Randel (B.). System structure for software fault tolerance.IEEE Trans. Soft. Eng (1975),SE-1, n° 2, pp. 220–232.

    Google Scholar 

  35. Randel (B.), Lee (P. A.), Treleaven (P. C.). Reliability issues in computing system design.Computing Surveys (1978),10, n° 2, pp. 123–165.

    Article  Google Scholar 

  36. Russell (D. L.). State restoration in systems of communicating processes.IEEE Trans. SE (1980),6, n° 2, pp.183–194.

    Article  Google Scholar 

  37. Schlichting (R. D.), Schneider (F. B.). Fail-stop processors: an approach to designing fault-tolerant computing systems.ACM Trans. CS (1983),1, n° 3, pp. 222–238.

    Article  Google Scholar 

  38. Schwarz (R.), Mattern (F.). Detecting causal relationship in distributed computations: in search of the holy grail.Distributed Computing (1994),7, pp. 149–174.

    Article  MATH  Google Scholar 

  39. Sistla (A. P.),Welch (J. L.). Efficient distributed recovery using message logging.Proc. ACM Symp. on Principles of Distributed Computing (1989), pp. 223–238.

  40. Strom (R. E.), Yemini (S. A.). Optimistic recovery in distributed systems.ACM Trans. CS (1985),3, n° 2, pp. 204–226.

    Article  Google Scholar 

  41. Strom (R. E.),Bacon (D. F.),Yemini (S. A.).Volatile logging inn-fault-tolerant distributed systems.Proc. Fault Tolerant Computing Systems (1988), pp. 44-49.

  42. Tono (Z.), Kain (R. Y.), Tsai (W. T.). Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. PDS (1992),3, n°2, pp. 246–251.

    Google Scholar 

  43. Toueg (S.) Babaoglu (O.) On the optimum checkpointing selection problem.SIAM J. C. (1984),13, n° 3, pp. 630–649.

    Article  MATH  MathSciNet  Google Scholar 

  44. Venkateshan (K.), Radhakrishnan (T.), Li (H. F.). Optimal checkpointing and local recording for domino-free rollback recovery.Information Processing Letters (1987),25, n° 5, pp. 295- 303.

    Article  Google Scholar 

  45. Venkateshan (S.),Juang (T. T. Y.). Efficient optimistic crash recovery in distributed systems.RR-Nov’93, University of Texas at Dallas (1993), pp. 1-28.

  46. Wang (Y. M.),Fuchs (W. K.). Optimistic message logging for independent checkpointing in message-passing systems.Proc. Symp. on Reliable Distributed Systems (1992), pp. 147-154.

  47. Wang (Y M.),Huang (Y.),Fuchs (W. K.). Progressive retry for software error recovery in distributed systems.Proc. Fault Tolerant Computing Systems (1993), pp. 138-144.

  48. Wang (Y. M.),Fuchs (W. K.). Scheduling message processing for reducing rollback propagation.Proc. Fault Tolerant Computing Systems (1992), pp. 204-211.

  49. Wu (K. L.), Fuchs (W. K.), Patel (J. H.). Error recovery in shared memory multiprocessors using private caches.IEEE Trans. on Parallel and Distributed Systems (1990),1, n° 2, pp. 231–240.

    Article  Google Scholar 

  50. Wu (K. L.), Fuchs (W. K.). Recoverable distributed shared virtual memory.IEEE Trans. C. (1990),39, n° 4, pp. 460–469.

    Article  Google Scholar 

  51. Xu (J.),Netzer (R. H. B.). Adaptive independent checkpointing for reducing rollback propagation.IEEE Symp. on Parallel and Distributed Processing (1993), pp. 754-761.

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work has been (partially) supported by CNET-France Télécom as part of the Cesarne projet under grant 92 1B 178.

Rights and permissions

Reprints and permissions

About this article

Cite this article

BrzeziŃski, J., Helary, JM. & Raynal, M. Semantics of recovery lines for backward recovery in distributed systems. Ann. Télécommun. 50, 874–887 (1995). https://doi.org/10.1007/BF03005244

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03005244

Key words

Mots clés

Navigation