Skip to main content

Checkpointing in Distributed Computing Systems

  • Chapter
Concurrency in Dependable Computing

Abstract

In this chapter, we present a message optimal non-intrusive checkpointing protocol for nondeterministic message passing distributed computing systems that does not require global time. Checkpoints in distributed systems can be coordinated, independent or quasi-synchronous. Coordinated checkpointing is attractive due to simple recovery, domino-freeness and optimal stable storage requirement. The quasi-synchronous checkpointing approach is also domino-free but may force processes to take multiple checkpoints. Independent checkpointing requires multiple local checkpoints of each node to be stored on stable storage and is affected by “domino effect”. Coordinated checkpointing has been found better than independent checkpointing as it is domino-free and has minimum storage and performance overheads. So far, many coordinated checkpointing protocols have been proposed in literature for distributed computing systems. These protocols can be broadly classified as minimum process intrusive checkpointing protocols and non-intrusive checkpointing protocols. In this chapter, we present a non-intrusive coordinated checkpointing protocol for distributed systems with least failure-free overhead. The proposed checkpointing algorithm has optimal communication and storage overheads. It requires only O(n) extra messages for taking a global consistent checkpoint. We introduce the concept of “odd” and “even” checkpoint intervals that replace the checkpoint sequence numbers that are generally piggybacked with each message to avoid orphan messages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alviski L., Hoppe B. and Marzullo K., “Nonblocking and Orphan-free message logging protocols,” Proceedings of 23` d Intentional Symposium on Fault-Tolerant Computing, pages 145–154, 1993.

    Google Scholar 

  2. Bhargava B. and Tian S. R., “Independent checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach,” Proceedings Seventh IEEE Symposium on Reliable Distributed Systems, pages 3–12, 1988.

    Google Scholar 

  3. Cao G. and Singhal M., “Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing systems,” IEEE Transactions On Parallel and Distributed Systems, vol. 12, no. 2, pages 157–172, 2001.

    Article  Google Scholar 

  4. Chandy K. M. and Lamport L., “Distributed snapshots: Determining global state of distributed systems,”. ACM Transactions on Computer Systems, vol. 3, no. 1, pages 63–75, 1985.

    Article  Google Scholar 

  5. Elnozahy E.N. and Zwaenepoel W., “Mantho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Transactions on Computer,voL 41, no. 5, pages 526531, 1992.

    Google Scholar 

  6. Elnozahy E.N., Alvisi L., Wang Y. M., and Johnson D.B., “A Survey of Rollback-Recovery Protocols in Message-Passing Systems” CMU Technical Report CMU-CS-99–148, 1999.

    Google Scholar 

  7. Elnozahy E.N., Johnson D.B., Zwaenepoel W., “The performance of Consistent Checkpointing” In Proceedings of the International Symposium on Reliable Distributed Systems, pages 39–47, 1992.

    Google Scholar 

  8. Helary J.M., Netzer R., Raynal M., “Consistency Issues in Distributed Checkpoints” IEEE Transactions Software Engineering, vol. 25, no. 2, pages 274–281, 1999.

    Article  Google Scholar 

  9. Johnson D. B. and Zwaenepoel W., “Sender-based message logging” In Proceedings of 17th Intentional Symposium on Fault-Tolerant Computing, pages14–19, 1987.

    Google Scholar 

  10. Koo R. and Toueg S., “Checkpointing and Roll-Back Recovery for Distributed Systems” IEEE Transactions Software Engineering, vol 13, no. 1, pages 23–31, 1987.

    Article  MATH  Google Scholar 

  11. Lai T.H. and Yang T.H., “On Distributed Snapshots” Information Processing Letters, vol. 25, pages 153–158, 1987.

    Article  MathSciNet  MATH  Google Scholar 

  12. ]Lamport L., “Time, Clocks and the Ordering of Events in a Distributed System.” Communications of the ACM, vol. 21 no. 7, pages 558–565, 1978.

    Article  MATH  Google Scholar 

  13. Li K., Naughton J.F., Plank J.S., “An Efficient Method for Checkpointing Multicomputers with Wormhole Routing,” International Journal Parallel Programming, vol. 20, no. 3, pages 159–180, 1992.

    Article  Google Scholar 

  14. Manivannan D. and Singhal M., “Quasi-Synchronous Checkpointing: Models, Characterization, and Classification.,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 7, pages 703–713, 1999.

    Article  Google Scholar 

  15. Mostefaui A.,Helary J.M., Netzer R., Raynal M., “Communication-based Prevention of Useless Checkpoints in Distributed Computation” Distributed Computing, vol. 13, no. 1, pages 29–43, 2000.

    Article  Google Scholar 

  16. Russel David R., “State restoration in Systems of communicating Processes,” IEEE Transactions on Software Engineering, vol. 6, no. 2, pages 183–194, 1980.

    Article  Google Scholar 

  17. Silva L.M. and Silva J.G., “Global Checkpointing for Distributed Programs” Proceedings of the International Symposium on Reliable Distributed Systems, pages 155–162, 1992.

    Google Scholar 

  18. Silva L.M. and Silva J.G., `“The Performance of Coordinated and Independent Checkpointing” In Proceedings 13th Intentional Symposium on Parallel Distributed Processing, pages 280–284, 1999.

    Google Scholar 

  19. Storm R. E. and Yemini S., “Optimistic Recovery in Distributed systems” ACM Transactions. on Computer Systems, vol. 3, no. 3, pages 204–226, 1985.

    Article  Google Scholar 

  20. Wang Y. M., “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Transactions on Computers, vol. 46, no. 4, pages 456–468, 1997.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kumar, L., Mishra, M., Joshi, R.C. (2002). Checkpointing in Distributed Computing Systems. In: Ezhilchelvan, P., Romanovsky, A. (eds) Concurrency in Dependable Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3573-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-3573-4_14

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-5278-3

  • Online ISBN: 978-1-4757-3573-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics