Efficient message logging for uncoordinated checkpointing protocols

  • Achour Mostefaoui
  • Michel Raynal
Session 8 Replication and Distribution
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1150)


A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages.

This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol.


Distributed Systems Backward Recovery Consistent Global Checkpoints Optimistic Sender-Based Logging 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Acharya, B.R. Badrinath, Checkpointing Distributed Applications on Mobile Computers, Proc. 3rd Int. Conf. on Par. and Dist. Information Systems, 1994.Google Scholar
  2. 2.
    L. Alvisi, K. Marzullo, Message Logging: Pessimistic, Optimistic, and Causal, Proc. 15th IEEE Int. Conf. on Distributed Computing Systems, 1995, pp. 229–236.Google Scholar
  3. 3.
    R. Baldoni, J. M. Hélary, A. Mostefaoui, M. Raynal, Consistent Checkpointing in Distributed systems, INRIA Research Report 2564, June 1995, 25 p.Google Scholar
  4. 4.
    R. Baldoni, J. Brzezinski, J.M. Hélary, A. Mostefaoui, M. Raynal, Characterization of Consistent Checkpoints in Large Scale Distributed Systems. Proc. 6th IEEE Int. Workshop on Future Trends of Dist. Comp. Sys., Korea, pp. 314–323, August 1995.Google Scholar
  5. 5.
    K.M. Chandy, L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Comp. Sys., Vol. 3(1), 1985, pp. 63–75.Google Scholar
  6. 6.
    E.N. Elnozahy, W. Zwaenepoel, Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit, IEEE Trans. on Computers, Vol. 41(5), 1992, pp. 526–531.Google Scholar
  7. 7.
    D.B. Johnson, W. Zwaenepoel, Sender-Based Message Logging, Proc. 17th IEEE Conf. on Fault-Tolerant Computing Systems, 1987, pp. 14–19.Google Scholar
  8. 8.
    D.B. Johnson, W. Zwaenepoel, Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing, Journal of Algorithms, Vol. 11(3), 1990, pp. 462–491.Google Scholar
  9. 9.
    R. Koo, S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems, IEEE Trans. on Software Engineering, Vol. 13(1), 1987, pp. 23–31.Google Scholar
  10. 10.
    L. Lamport, Time, Clocks and the Ordering of Events in a Distributed System, Communications of the ACM, Vol. 21(7), 1978, pp. 558–565.Google Scholar
  11. 11.
    F. Mattern, Virtual Time and Global States of Distributed Systems. In Cosnard, Quinton, Raynal, and Robert, Editors, Proc. Int. Workshop on Dist. Alg., France, October 1988, pp. 215–226, 1989.Google Scholar
  12. 12.
    R.H.B. Netzer, J. Xu, Necessary and Sufficient Conditions for Consistent Global Snapshots, IEEE Trans. on Parallel and Distributed Systems, Vol. 6(2), 1995, pp. 165–169.Google Scholar
  13. 13.
    B. Randell, System Structure for Software Fault-Tolerance, IEEE Trans. on Software Engineering, Vol. 1(2), 1975, pp. 220–232.Google Scholar
  14. 14.
    M. Raynal, A. Schiper, S. Toueg, The Causal Ordering Abstraction and a Simple Way to Implement it, Inf. Processing Letters, Vol. 39, 1991, pp. 343–350.Google Scholar
  15. 15.
    F. Ruget, Cheaper Matrix Clocks, Proc. 8th Int. Workshop on Distributed Algorithms, Springer Verlag, LNCS 857, pp. 340–354, 1994.Google Scholar
  16. 16.
    D.L. Russell, State Restoration in Systems of Communicating Processes, IEEE Trans. on Software Engineering, Vol. 6, 1980, pp. 183–194.Google Scholar
  17. 17.
    L.M. Silva, J.G. Silva, Global Checkpointing for Distributed Programs, Proc. 11th IEEE Symp. on Reliable Distributed Systems, Houston, TX, 1992, pp. 155–162.Google Scholar
  18. 18.
    M. Singhal, F. Mattern, An Optimality Proof for Asynchronous Recovery Algorithms in Distributed Systems, Inf. Processing Letters, Vol. 55, 1995, pp. 117–121.Google Scholar
  19. 19.
    R.E. Strom, S. Yemini, Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, Vol. 3(3), 1985, pp. 204–226.Google Scholar
  20. 20.
    G.T. Wuu, A. J. Bernstein, Efficient Solutions to the Replicated Log and Dictionary Problems, Proc. 3rd ACM Symp. on Principles of Dist. Comp., 1984, pp. 233–242.Google Scholar
  21. 21.
    Y.M. Wang, W.K. Fuchs, Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems, Proc. 11th IEEE Symp. Reliable Distributed Systems, 1992, pp. 147–154.Google Scholar
  22. 22.
    Y.M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems, IEEE Trans. on Parallel and Distributed Systems, Vol. 6(5), 1995, pp. 546–554.Google Scholar
  23. 23.
    J. Xu, R.H.B. Netzer, M. Mackey, Sender-Based Message Logging for Reducing Rollback Propagation, Proc. 7th IEEE Symp. on Parallel and Distributed Processing, 1995, pp. 602–609, San Antonio.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Achour Mostefaoui
    • 1
  • Michel Raynal
    • 1
  1. 1.IRISARennes CedexFrance

Personalised recommendations