Semantics of recovery lines for backward recovery in distributed systems

BrzeziŃski, Jerzy; Helary, Jean-Michel; Raynal, Michel

doi:10.1007/BF03005244

Semantics of recovery lines for backward recovery in distributed systems

SÉMANTIQUE DES ÉTATS DE REPRISE POUR LE RÉTABLISSEMENT DANS LES SYSTÈMES DISTRIBUÉS

Published: November 1995

Volume 50, pages 874–887, (1995)
Cite this article

Annales Des Télécommunications Aims and scope Submit manuscript

Jerzy BrzeziŃski¹,
Jean-Michel Helary² &
Michel Raynal²

21 Accesses
2 Citations
Explore all metrics

Abstract

This paper addresses the definition of recovery lines in the context of backward recovery whose aim is to cope with failures in distributed syterns. A general framework that allows for several semantics of recovery lines is introduced. Key notions such as missing messages and orphan messages are precisely defined and their impact on the definition of consistency of recovery lines is carefully analyzed. Basic mechanisms such as local checkpointing, messages identification and (optimistic or pessimistic) messages logging are then discussed as an illustration of (coordinated or uncoordinated) checkpointing protocols.

Résumé

Cet article est consacré à la définition des états de reprise dans le cadre des techniques de rétablissement, utilisées pour traiter les défaillances dans les systèmes distribués. On introduit un cadre général permettant de considérer plusieurs sémantiques ďétats de reprise. Des notions clefs telles que celles de messages manquants ou messages orphelins sont définies avec précision, et leur influence sur la définition de la cohérence ďun état de reprise est soigneusement analysée. Les outils de base, tels que les points de contrôle locaux, le stockage des messages (optimiste ou pessismiste) et la ré-exécution, sont introduits pour illustrer des algorithmes de reprise, coordonnés ou non.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

Invited Paper: Cross-Chain State Machine Replication

Session Types for Link Failures

References

Ahuja (M.), Mishra (S.). A basic unit of computation in fault- tolerant distributed systems.Proc. 14th Int Conf Distributed Computing Systems, Poznan (1994), pp. 626–633.
Google Scholar
Alalgar (S.),Venkatesan (S.). Hierarchy in testing distributed programs.Proc. Int. Workshop AADEBUG’93, Springer Verlag LNCS (1993), pp. 101–116.
Alvisi (L.), Hoppe (B.), Marzullo (K.). Nonblocking and orphan-free message logging protocols.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 145–154.
Google Scholar
Alvisi (L.), Marzullo (K.). Optimistic message logging protocols.Proc. Workshop on Unifying Theory and Practice in Distributed Systems, Dagstuhl, Germany (1994).
Google Scholar
Babaoglu (O.),Marzullo (K.). Consistent global states of distributed systems: fondamental concepts and mechanisms. In S.J. Mullemder, Editor, Distributed systems, Chap. 4,ACM Press (1993).
Bhargava (B.),Lian (S. R.). Independent checkpointing and concurrent rollback for recovery in distributed system — an optimistic approach.Symp. Reliable Distributed Systems SRDS’88 (1988).
Borg (A.),Baumback (J.),Glazer (S.). A message system supporting fault tolerance.Proc. ACM Symp. on Operating Systems Principles (1993), pp. 90-99.
Chandy (K. M.), Lamport (L.). Distributed snapshots: determining global states of distributed systems.ACM TOCS (1985),3, n° 1, pp. 63–75.
Article Google Scholar
Chandy (K. M.), Misra (J.). Parallel program design: a foundation.Addison Wesley, New York (1988).
MATH Google Scholar
Cristian (F.). Understanding fault - tolerant distributed systems.Commun. ACM (1991),34, n° 2, pp. 56–78.
Article Google Scholar
Elnohazy (E. N.), Zwaenepoel (W.). Manetho-transparent rollback-recovery with low overhead, limited rollback and fast output commit.IEEE Trans. C. (1992),41, n° 5, pp. 526–531.
Article Google Scholar
Goldberg (A. P.), Gopal (A.), Lowry (A.), Strom (R.). Restoring consistent global states of distributed computation.ACM Sigplan (1991),26, n° 12, pp. 144–154.
Article Google Scholar
Helary (J. M.), Mostefaoui (A.), Raynal (M.). Déterminer un état global dans un système réparti.Ann. Télécommunic. (1994),49, n° 7-8, pp. 460–469.
Google Scholar
Hurfin (M.), Plouzeau (N.), Raynal (M.). A debugging tool for distributed Estelle programs.Journal of Computer Communic. (1993),16, n° 5, pp. 328–333.
Article Google Scholar
Jalote (P.). Fault tolerant processes.Distributed Computing (1989), n° 3, pp.187-195.
Janssesns (B.), Fuchs (W. K.). Relaxing consistency in recoverable distributed shared memory.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 155–163.
Google Scholar
Johnson (D. B.), Zwaenepoel (W.). Recovery in distributed systems using optimistic message logging and checkpointing.Journal of Algorithms (1990),11, n° 3, pp. 462–491.
Article MATH MathSciNet Google Scholar
Johnson (D. B.),Zwaenepoel (W.). Sender-based message logging.Proc. Fault Tolerant Computing Systems (1987), pp. 14-19.
Juang (T. T. Y.),Venkatesan (S.). Crash recovery with little overhead.Proc. 11th Int. Conf. Distributed Computing Systems (1991), pp. 454-461.
Kim (J. L.), Park (T.). An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel and Distributed Systems (1993),4, n° 8, pp. 955–960.
Article Google Scholar
Kim (K. H.),You (J. H.),Abouelnaga (A.). A scheme for coordinated execution of independently designed recoverable distributed processes.Proc. 16th IEEE Symp. Fault-Tolerant Comput. (1986), pp. 130-135.
Kim (K. H.). Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation.IEEE Trans. SE (1988),14, n° 6, pp. 810–821.
Article Google Scholar
Koo (R.), Toueg (S.). Checkpointing and rollback-recovery for distributed systems.IEEE Trans. SE (1987),13, n° 1, pp. 23–31.
Article MATH Google Scholar
Lamport (L.). Time, clocks and the ordering of events in a distributed system.Commun. ACM (1978),21, n° 7, pp. 558- 565.
Article MATH Google Scholar
Laprie (J. C.) (ed.).Dependability: Basic concepts and terminology, IFIP WG 10.4, Series in Dependable Computing and Fault-Tolerant Systems,5, Springer-Verlag (1992).
Leblanc (T. J.), Mellar-Crummey (J. M.). Debugging parallel programs with instant replay.IEEE Trans. C. (1987),36, n° 6, pp. 471–482.
Article Google Scholar
Ľecuyer (P.), Malenfant (J.). Computing optimal checkpointing strategies for rollback and recovery system.IEEE Trans. C. (1988),37, n° 4, pp. 491–496.
Article Google Scholar
Lee (P. A.), Anderson (T.). Fault tolerance principles and practice.Springer Verlag, Wien (1990).
MATH Google Scholar
Leong (H. V.), Agrawal (D.). Using message semantics to reduce rollback in optimistic message logging recovery scheme.Proc. 14th Int. Conf. Distributed Computing Systems, Poznan (1994), pp. 227–234.
Google Scholar
Lin (L.),Ahamad (M.). Checkpointing and rollback-recovery in distributed object based systems.Proc. Fault Tolerant Computing Systems (1990), pp. 97-104.
Netzer (R. H. B.),Miller (B. P.). Optimal tracing and replay for debugging message passing parallel programs.Proc. Super- computing (1992).
Powell (M. L.),Presotto (D. L.). PUBLISHING: A reliable broadcast communication mechanism.Proc. ACM Symp. on Operating Systems Principles (1983), pp. 100-109.
Ramanathan (P.),Shin (K. G.). Checkpointing and rollback recovery in a distributed system using common time base.Proc. Symp. Reliable Distributed Systems (1988), pp. 13-21.
Randel (B.). System structure for software fault tolerance.IEEE Trans. Soft. Eng (1975),SE-1, n° 2, pp. 220–232.
Google Scholar
Randel (B.), Lee (P. A.), Treleaven (P. C.). Reliability issues in computing system design.Computing Surveys (1978),10, n° 2, pp. 123–165.
Article Google Scholar
Russell (D. L.). State restoration in systems of communicating processes.IEEE Trans. SE (1980),6, n° 2, pp.183–194.
Article Google Scholar
Schlichting (R. D.), Schneider (F. B.). Fail-stop processors: an approach to designing fault-tolerant computing systems.ACM Trans. CS (1983),1, n° 3, pp. 222–238.
Article Google Scholar
Schwarz (R.), Mattern (F.). Detecting causal relationship in distributed computations: in search of the holy grail.Distributed Computing (1994),7, pp. 149–174.
Article MATH Google Scholar
Sistla (A. P.),Welch (J. L.). Efficient distributed recovery using message logging.Proc. ACM Symp. on Principles of Distributed Computing (1989), pp. 223–238.
Strom (R. E.), Yemini (S. A.). Optimistic recovery in distributed systems.ACM Trans. CS (1985),3, n° 2, pp. 204–226.
Article Google Scholar
Strom (R. E.),Bacon (D. F.),Yemini (S. A.).Volatile logging inn-fault-tolerant distributed systems.Proc. Fault Tolerant Computing Systems (1988), pp. 44-49.
Tono (Z.), Kain (R. Y.), Tsai (W. T.). Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. PDS (1992),3, n°2, pp. 246–251.
Google Scholar
Toueg (S.) Babaoglu (O.) On the optimum checkpointing selection problem.SIAM J. C. (1984),13, n° 3, pp. 630–649.
Article MATH MathSciNet Google Scholar
Venkateshan (K.), Radhakrishnan (T.), Li (H. F.). Optimal checkpointing and local recording for domino-free rollback recovery.Information Processing Letters (1987),25, n° 5, pp. 295- 303.
Article Google Scholar
Venkateshan (S.),Juang (T. T. Y.). Efficient optimistic crash recovery in distributed systems.RR-Nov’93, University of Texas at Dallas (1993), pp. 1-28.
Wang (Y. M.),Fuchs (W. K.). Optimistic message logging for independent checkpointing in message-passing systems.Proc. Symp. on Reliable Distributed Systems (1992), pp. 147-154.
Wang (Y M.),Huang (Y.),Fuchs (W. K.). Progressive retry for software error recovery in distributed systems.Proc. Fault Tolerant Computing Systems (1993), pp. 138-144.
Wang (Y. M.),Fuchs (W. K.). Scheduling message processing for reducing rollback propagation.Proc. Fault Tolerant Computing Systems (1992), pp. 204-211.
Wu (K. L.), Fuchs (W. K.), Patel (J. H.). Error recovery in shared memory multiprocessors using private caches.IEEE Trans. on Parallel and Distributed Systems (1990),1, n° 2, pp. 231–240.
Article Google Scholar
Wu (K. L.), Fuchs (W. K.). Recoverable distributed shared virtual memory.IEEE Trans. C. (1990),39, n° 4, pp. 460–469.
Article Google Scholar
Xu (J.),Netzer (R. H. B.). Adaptive independent checkpointing for reducing rollback propagation.IEEE Symp. on Parallel and Distributed Processing (1993), pp. 754-761.

Download references

Author information

Authors and Affiliations

Institute of Computing Science, Poznan University of Technology, 60-965, Poznań, Poland
Jerzy BrzeziŃski
irisa, Campus de Beaulieu, F-35042, Rennes Cedex, France
Jean-Michel Helary & Michel Raynal

Authors

Jerzy BrzeziŃski
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Helary
View author publications
You can also search for this author in PubMed Google Scholar
Michel Raynal
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work has been (partially) supported by CNET-France Télécom as part of the Cesarne projet under grant 92 1B 178.

Rights and permissions

Reprints and permissions

About this article

Cite this article

BrzeziŃski, J., Helary, JM. & Raynal, M. Semantics of recovery lines for backward recovery in distributed systems. Ann. Télécommun. 50, 874–887 (1995). https://doi.org/10.1007/BF03005244

Download citation

Received: 25 April 1995
Accepted: 06 October 1995
Issue Date: November 1995
DOI: https://doi.org/10.1007/BF03005244

Key words

Mots clés

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantics of recovery lines for backward recovery in distributed systems

Abstract

Résumé

Access this article

Similar content being viewed by others

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

Invited Paper: Cross-Chain State Machine Replication

Session Types for Link Failures

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Mots clés

Navigation

Semantics of recovery lines for backward recovery in distributed systems

Abstract

Résumé

Access this article

Similar content being viewed by others

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

Invited Paper: Cross-Chain State Machine Replication

Session Types for Link Failures

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Mots clés

Search

Navigation