Checkpointing in Distributed Computing Systems

Kumar, Lalit; Mishra, Manoj; Joshi, Ramesh Chander

doi:10.1007/978-1-4757-3573-4_14

Lalit Kumar²,
Manoj Mishra³ &
Ramesh Chander Joshi³

72 Accesses
6 Citations

Abstract

In this chapter, we present a message optimal non-intrusive checkpointing protocol for nondeterministic message passing distributed computing systems that does not require global time. Checkpoints in distributed systems can be coordinated, independent or quasi-synchronous. Coordinated checkpointing is attractive due to simple recovery, domino-freeness and optimal stable storage requirement. The quasi-synchronous checkpointing approach is also domino-free but may force processes to take multiple checkpoints. Independent checkpointing requires multiple local checkpoints of each node to be stored on stable storage and is affected by “domino effect”. Coordinated checkpointing has been found better than independent checkpointing as it is domino-free and has minimum storage and performance overheads. So far, many coordinated checkpointing protocols have been proposed in literature for distributed computing systems. These protocols can be broadly classified as minimum process intrusive checkpointing protocols and non-intrusive checkpointing protocols. In this chapter, we present a non-intrusive coordinated checkpointing protocol for distributed systems with least failure-free overhead. The proposed checkpointing algorithm has optimal communication and storage overheads. It requires only O(n) extra messages for taking a global consistent checkpoint. We introduce the concept of “odd” and “even” checkpoint intervals that replace the checkpoint sequence numbers that are generally piggybacked with each message to avoid orphan messages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alviski L., Hoppe B. and Marzullo K., “Nonblocking and Orphan-free message logging protocols,” Proceedings of 23` d Intentional Symposium on Fault-Tolerant Computing, pages 145–154, 1993.
Google Scholar
Bhargava B. and Tian S. R., “Independent checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach,” Proceedings Seventh IEEE Symposium on Reliable Distributed Systems, pages 3–12, 1988.
Google Scholar
Cao G. and Singhal M., “Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing systems,” IEEE Transactions On Parallel and Distributed Systems, vol. 12, no. 2, pages 157–172, 2001.
Article Google Scholar
Chandy K. M. and Lamport L., “Distributed snapshots: Determining global state of distributed systems,”. ACM Transactions on Computer Systems, vol. 3, no. 1, pages 63–75, 1985.
Article Google Scholar
Elnozahy E.N. and Zwaenepoel W., “Mantho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Transactions on Computer,voL 41, no. 5, pages 526531, 1992.
Google Scholar
Elnozahy E.N., Alvisi L., Wang Y. M., and Johnson D.B., “A Survey of Rollback-Recovery Protocols in Message-Passing Systems” CMU Technical Report CMU-CS-99–148, 1999.
Google Scholar
Elnozahy E.N., Johnson D.B., Zwaenepoel W., “The performance of Consistent Checkpointing” In Proceedings of the International Symposium on Reliable Distributed Systems, pages 39–47, 1992.
Google Scholar
Helary J.M., Netzer R., Raynal M., “Consistency Issues in Distributed Checkpoints” IEEE Transactions Software Engineering, vol. 25, no. 2, pages 274–281, 1999.
Article Google Scholar
Johnson D. B. and Zwaenepoel W., “Sender-based message logging” In Proceedings of 17th Intentional Symposium on Fault-Tolerant Computing, pages14–19, 1987.
Google Scholar
Koo R. and Toueg S., “Checkpointing and Roll-Back Recovery for Distributed Systems” IEEE Transactions Software Engineering, vol 13, no. 1, pages 23–31, 1987.
Article MATH Google Scholar
Lai T.H. and Yang T.H., “On Distributed Snapshots” Information Processing Letters, vol. 25, pages 153–158, 1987.
Article MathSciNet MATH Google Scholar
]Lamport L., “Time, Clocks and the Ordering of Events in a Distributed System.” Communications of the ACM, vol. 21 no. 7, pages 558–565, 1978.
Article MATH Google Scholar
Li K., Naughton J.F., Plank J.S., “An Efficient Method for Checkpointing Multicomputers with Wormhole Routing,” International Journal Parallel Programming, vol. 20, no. 3, pages 159–180, 1992.
Article Google Scholar
Manivannan D. and Singhal M., “Quasi-Synchronous Checkpointing: Models, Characterization, and Classification.,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 7, pages 703–713, 1999.
Article Google Scholar
Mostefaui A.,Helary J.M., Netzer R., Raynal M., “Communication-based Prevention of Useless Checkpoints in Distributed Computation” Distributed Computing, vol. 13, no. 1, pages 29–43, 2000.
Article Google Scholar
Russel David R., “State restoration in Systems of communicating Processes,” IEEE Transactions on Software Engineering, vol. 6, no. 2, pages 183–194, 1980.
Article Google Scholar
Silva L.M. and Silva J.G., “Global Checkpointing for Distributed Programs” Proceedings of the International Symposium on Reliable Distributed Systems, pages 155–162, 1992.
Google Scholar
Silva L.M. and Silva J.G., `“The Performance of Coordinated and Independent Checkpointing” In Proceedings 13th Intentional Symposium on Parallel Distributed Processing, pages 280–284, 1999.
Google Scholar
Storm R. E. and Yemini S., “Optimistic Recovery in Distributed systems” ACM Transactions. on Computer Systems, vol. 3, no. 3, pages 204–226, 1985.
Article Google Scholar
Wang Y. M., “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Transactions on Computers, vol. 46, no. 4, pages 456–468, 1997.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Regional Engineering College, Hamirpur, HP, India
Lalit Kumar
Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, India
Manoj Mishra & Ramesh Chander Joshi

Authors

Lalit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Chander Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Newcastle-upon-Tyne, UK
Paul Ezhilchelvan & Alexander Romanovsky &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kumar, L., Mishra, M., Joshi, R.C. (2002). Checkpointing in Distributed Computing Systems. In: Ezhilchelvan, P., Romanovsky, A. (eds) Concurrency in Dependable Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3573-4_14

Download citation

DOI: https://doi.org/10.1007/978-1-4757-3573-4_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5278-3
Online ISBN: 978-1-4757-3573-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics