Abstract
In this chapter, we present a message optimal non-intrusive checkpointing protocol for nondeterministic message passing distributed computing systems that does not require global time. Checkpoints in distributed systems can be coordinated, independent or quasi-synchronous. Coordinated checkpointing is attractive due to simple recovery, domino-freeness and optimal stable storage requirement. The quasi-synchronous checkpointing approach is also domino-free but may force processes to take multiple checkpoints. Independent checkpointing requires multiple local checkpoints of each node to be stored on stable storage and is affected by “domino effect”. Coordinated checkpointing has been found better than independent checkpointing as it is domino-free and has minimum storage and performance overheads. So far, many coordinated checkpointing protocols have been proposed in literature for distributed computing systems. These protocols can be broadly classified as minimum process intrusive checkpointing protocols and non-intrusive checkpointing protocols. In this chapter, we present a non-intrusive coordinated checkpointing protocol for distributed systems with least failure-free overhead. The proposed checkpointing algorithm has optimal communication and storage overheads. It requires only O(n) extra messages for taking a global consistent checkpoint. We introduce the concept of “odd” and “even” checkpoint intervals that replace the checkpoint sequence numbers that are generally piggybacked with each message to avoid orphan messages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alviski L., Hoppe B. and Marzullo K., “Nonblocking and Orphan-free message logging protocols,” Proceedings of 23` d Intentional Symposium on Fault-Tolerant Computing, pages 145–154, 1993.
Bhargava B. and Tian S. R., “Independent checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach,” Proceedings Seventh IEEE Symposium on Reliable Distributed Systems, pages 3–12, 1988.
Cao G. and Singhal M., “Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing systems,” IEEE Transactions On Parallel and Distributed Systems, vol. 12, no. 2, pages 157–172, 2001.
Chandy K. M. and Lamport L., “Distributed snapshots: Determining global state of distributed systems,”. ACM Transactions on Computer Systems, vol. 3, no. 1, pages 63–75, 1985.
Elnozahy E.N. and Zwaenepoel W., “Mantho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Transactions on Computer,voL 41, no. 5, pages 526531, 1992.
Elnozahy E.N., Alvisi L., Wang Y. M., and Johnson D.B., “A Survey of Rollback-Recovery Protocols in Message-Passing Systems” CMU Technical Report CMU-CS-99–148, 1999.
Elnozahy E.N., Johnson D.B., Zwaenepoel W., “The performance of Consistent Checkpointing” In Proceedings of the International Symposium on Reliable Distributed Systems, pages 39–47, 1992.
Helary J.M., Netzer R., Raynal M., “Consistency Issues in Distributed Checkpoints” IEEE Transactions Software Engineering, vol. 25, no. 2, pages 274–281, 1999.
Johnson D. B. and Zwaenepoel W., “Sender-based message logging” In Proceedings of 17th Intentional Symposium on Fault-Tolerant Computing, pages14–19, 1987.
Koo R. and Toueg S., “Checkpointing and Roll-Back Recovery for Distributed Systems” IEEE Transactions Software Engineering, vol 13, no. 1, pages 23–31, 1987.
Lai T.H. and Yang T.H., “On Distributed Snapshots” Information Processing Letters, vol. 25, pages 153–158, 1987.
]Lamport L., “Time, Clocks and the Ordering of Events in a Distributed System.” Communications of the ACM, vol. 21 no. 7, pages 558–565, 1978.
Li K., Naughton J.F., Plank J.S., “An Efficient Method for Checkpointing Multicomputers with Wormhole Routing,” International Journal Parallel Programming, vol. 20, no. 3, pages 159–180, 1992.
Manivannan D. and Singhal M., “Quasi-Synchronous Checkpointing: Models, Characterization, and Classification.,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 7, pages 703–713, 1999.
Mostefaui A.,Helary J.M., Netzer R., Raynal M., “Communication-based Prevention of Useless Checkpoints in Distributed Computation” Distributed Computing, vol. 13, no. 1, pages 29–43, 2000.
Russel David R., “State restoration in Systems of communicating Processes,” IEEE Transactions on Software Engineering, vol. 6, no. 2, pages 183–194, 1980.
Silva L.M. and Silva J.G., “Global Checkpointing for Distributed Programs” Proceedings of the International Symposium on Reliable Distributed Systems, pages 155–162, 1992.
Silva L.M. and Silva J.G., `“The Performance of Coordinated and Independent Checkpointing” In Proceedings 13th Intentional Symposium on Parallel Distributed Processing, pages 280–284, 1999.
Storm R. E. and Yemini S., “Optimistic Recovery in Distributed systems” ACM Transactions. on Computer Systems, vol. 3, no. 3, pages 204–226, 1985.
Wang Y. M., “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Transactions on Computers, vol. 46, no. 4, pages 456–468, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kumar, L., Mishra, M., Joshi, R.C. (2002). Checkpointing in Distributed Computing Systems. In: Ezhilchelvan, P., Romanovsky, A. (eds) Concurrency in Dependable Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3573-4_14
Download citation
DOI: https://doi.org/10.1007/978-1-4757-3573-4_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5278-3
Online ISBN: 978-1-4757-3573-4
eBook Packages: Springer Book Archive