Abstract
This paper addresses the definition of recovery lines in the context of backward recovery whose aim is to cope with failures in distributed syterns. A general framework that allows for several semantics of recovery lines is introduced. Key notions such as missing messages and orphan messages are precisely defined and their impact on the definition of consistency of recovery lines is carefully analyzed. Basic mechanisms such as local checkpointing, messages identification and (optimistic or pessimistic) messages logging are then discussed as an illustration of (coordinated or uncoordinated) checkpointing protocols.
Résumé
Cet article est consacré à la définition des états de reprise dans le cadre des techniques de rétablissement, utilisées pour traiter les défaillances dans les systèmes distribués. On introduit un cadre général permettant de considérer plusieurs sémantiques ďétats de reprise. Des notions clefs telles que celles de messages manquants ou messages orphelins sont définies avec précision, et leur influence sur la définition de la cohérence ďun état de reprise est soigneusement analysée. Les outils de base, tels que les points de contrôle locaux, le stockage des messages (optimiste ou pessismiste) et la ré-exécution, sont introduits pour illustrer des algorithmes de reprise, coordonnés ou non.
Similar content being viewed by others
References
Ahuja (M.), Mishra (S.). A basic unit of computation in fault- tolerant distributed systems.Proc. 14th Int Conf Distributed Computing Systems, Poznan (1994), pp. 626–633.
Alalgar (S.),Venkatesan (S.). Hierarchy in testing distributed programs.Proc. Int. Workshop AADEBUG’93, Springer Verlag LNCS (1993), pp. 101–116.
Alvisi (L.), Hoppe (B.), Marzullo (K.). Nonblocking and orphan-free message logging protocols.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 145–154.
Alvisi (L.), Marzullo (K.). Optimistic message logging protocols.Proc. Workshop on Unifying Theory and Practice in Distributed Systems, Dagstuhl, Germany (1994).
Babaoglu (O.),Marzullo (K.). Consistent global states of distributed systems: fondamental concepts and mechanisms. In S.J. Mullemder, Editor, Distributed systems, Chap. 4,ACM Press (1993).
Bhargava (B.),Lian (S. R.). Independent checkpointing and concurrent rollback for recovery in distributed system — an optimistic approach.Symp. Reliable Distributed Systems SRDS’88 (1988).
Borg (A.),Baumback (J.),Glazer (S.). A message system supporting fault tolerance.Proc. ACM Symp. on Operating Systems Principles (1993), pp. 90-99.
Chandy (K. M.), Lamport (L.). Distributed snapshots: determining global states of distributed systems.ACM TOCS (1985),3, n° 1, pp. 63–75.
Chandy (K. M.), Misra (J.). Parallel program design: a foundation.Addison Wesley, New York (1988).
Cristian (F.). Understanding fault - tolerant distributed systems.Commun. ACM (1991),34, n° 2, pp. 56–78.
Elnohazy (E. N.), Zwaenepoel (W.). Manetho-transparent rollback-recovery with low overhead, limited rollback and fast output commit.IEEE Trans. C. (1992),41, n° 5, pp. 526–531.
Goldberg (A. P.), Gopal (A.), Lowry (A.), Strom (R.). Restoring consistent global states of distributed computation.ACM Sigplan (1991),26, n° 12, pp. 144–154.
Helary (J. M.), Mostefaoui (A.), Raynal (M.). Déterminer un état global dans un système réparti.Ann. Télécommunic. (1994),49, n° 7-8, pp. 460–469.
Hurfin (M.), Plouzeau (N.), Raynal (M.). A debugging tool for distributed Estelle programs.Journal of Computer Communic. (1993),16, n° 5, pp. 328–333.
Jalote (P.). Fault tolerant processes.Distributed Computing (1989), n° 3, pp.187-195.
Janssesns (B.), Fuchs (W. K.). Relaxing consistency in recoverable distributed shared memory.Proc. Fault Tolerant Computing Systems, Toulouse (1993), pp. 155–163.
Johnson (D. B.), Zwaenepoel (W.). Recovery in distributed systems using optimistic message logging and checkpointing.Journal of Algorithms (1990),11, n° 3, pp. 462–491.
Johnson (D. B.),Zwaenepoel (W.). Sender-based message logging.Proc. Fault Tolerant Computing Systems (1987), pp. 14-19.
Juang (T. T. Y.),Venkatesan (S.). Crash recovery with little overhead.Proc. 11th Int. Conf. Distributed Computing Systems (1991), pp. 454-461.
Kim (J. L.), Park (T.). An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel and Distributed Systems (1993),4, n° 8, pp. 955–960.
Kim (K. H.),You (J. H.),Abouelnaga (A.). A scheme for coordinated execution of independently designed recoverable distributed processes.Proc. 16th IEEE Symp. Fault-Tolerant Comput. (1986), pp. 130-135.
Kim (K. H.). Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation.IEEE Trans. SE (1988),14, n° 6, pp. 810–821.
Koo (R.), Toueg (S.). Checkpointing and rollback-recovery for distributed systems.IEEE Trans. SE (1987),13, n° 1, pp. 23–31.
Lamport (L.). Time, clocks and the ordering of events in a distributed system.Commun. ACM (1978),21, n° 7, pp. 558- 565.
Laprie (J. C.) (ed.).Dependability: Basic concepts and terminology, IFIP WG 10.4, Series in Dependable Computing and Fault-Tolerant Systems,5, Springer-Verlag (1992).
Leblanc (T. J.), Mellar-Crummey (J. M.). Debugging parallel programs with instant replay.IEEE Trans. C. (1987),36, n° 6, pp. 471–482.
Ľecuyer (P.), Malenfant (J.). Computing optimal checkpointing strategies for rollback and recovery system.IEEE Trans. C. (1988),37, n° 4, pp. 491–496.
Lee (P. A.), Anderson (T.). Fault tolerance principles and practice.Springer Verlag, Wien (1990).
Leong (H. V.), Agrawal (D.). Using message semantics to reduce rollback in optimistic message logging recovery scheme.Proc. 14th Int. Conf. Distributed Computing Systems, Poznan (1994), pp. 227–234.
Lin (L.),Ahamad (M.). Checkpointing and rollback-recovery in distributed object based systems.Proc. Fault Tolerant Computing Systems (1990), pp. 97-104.
Netzer (R. H. B.),Miller (B. P.). Optimal tracing and replay for debugging message passing parallel programs.Proc. Super- computing (1992).
Powell (M. L.),Presotto (D. L.). PUBLISHING: A reliable broadcast communication mechanism.Proc. ACM Symp. on Operating Systems Principles (1983), pp. 100-109.
Ramanathan (P.),Shin (K. G.). Checkpointing and rollback recovery in a distributed system using common time base.Proc. Symp. Reliable Distributed Systems (1988), pp. 13-21.
Randel (B.). System structure for software fault tolerance.IEEE Trans. Soft. Eng (1975),SE-1, n° 2, pp. 220–232.
Randel (B.), Lee (P. A.), Treleaven (P. C.). Reliability issues in computing system design.Computing Surveys (1978),10, n° 2, pp. 123–165.
Russell (D. L.). State restoration in systems of communicating processes.IEEE Trans. SE (1980),6, n° 2, pp.183–194.
Schlichting (R. D.), Schneider (F. B.). Fail-stop processors: an approach to designing fault-tolerant computing systems.ACM Trans. CS (1983),1, n° 3, pp. 222–238.
Schwarz (R.), Mattern (F.). Detecting causal relationship in distributed computations: in search of the holy grail.Distributed Computing (1994),7, pp. 149–174.
Sistla (A. P.),Welch (J. L.). Efficient distributed recovery using message logging.Proc. ACM Symp. on Principles of Distributed Computing (1989), pp. 223–238.
Strom (R. E.), Yemini (S. A.). Optimistic recovery in distributed systems.ACM Trans. CS (1985),3, n° 2, pp. 204–226.
Strom (R. E.),Bacon (D. F.),Yemini (S. A.).Volatile logging inn-fault-tolerant distributed systems.Proc. Fault Tolerant Computing Systems (1988), pp. 44-49.
Tono (Z.), Kain (R. Y.), Tsai (W. T.). Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. PDS (1992),3, n°2, pp. 246–251.
Toueg (S.) Babaoglu (O.) On the optimum checkpointing selection problem.SIAM J. C. (1984),13, n° 3, pp. 630–649.
Venkateshan (K.), Radhakrishnan (T.), Li (H. F.). Optimal checkpointing and local recording for domino-free rollback recovery.Information Processing Letters (1987),25, n° 5, pp. 295- 303.
Venkateshan (S.),Juang (T. T. Y.). Efficient optimistic crash recovery in distributed systems.RR-Nov’93, University of Texas at Dallas (1993), pp. 1-28.
Wang (Y. M.),Fuchs (W. K.). Optimistic message logging for independent checkpointing in message-passing systems.Proc. Symp. on Reliable Distributed Systems (1992), pp. 147-154.
Wang (Y M.),Huang (Y.),Fuchs (W. K.). Progressive retry for software error recovery in distributed systems.Proc. Fault Tolerant Computing Systems (1993), pp. 138-144.
Wang (Y. M.),Fuchs (W. K.). Scheduling message processing for reducing rollback propagation.Proc. Fault Tolerant Computing Systems (1992), pp. 204-211.
Wu (K. L.), Fuchs (W. K.), Patel (J. H.). Error recovery in shared memory multiprocessors using private caches.IEEE Trans. on Parallel and Distributed Systems (1990),1, n° 2, pp. 231–240.
Wu (K. L.), Fuchs (W. K.). Recoverable distributed shared virtual memory.IEEE Trans. C. (1990),39, n° 4, pp. 460–469.
Xu (J.),Netzer (R. H. B.). Adaptive independent checkpointing for reducing rollback propagation.IEEE Symp. on Parallel and Distributed Processing (1993), pp. 754-761.
Author information
Authors and Affiliations
Additional information
This work has been (partially) supported by CNET-France Télécom as part of the Cesarne projet under grant 92 1B 178.
Rights and permissions
About this article
Cite this article
BrzeziŃski, J., Helary, JM. & Raynal, M. Semantics of recovery lines for backward recovery in distributed systems. Ann. Télécommun. 50, 874–887 (1995). https://doi.org/10.1007/BF03005244
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF03005244