Skip to main content

Checkpointing Systems

  • Chapter
  • First Online:
  • 528 Accesses

Abstract

Checkpointing applies to large software systems subject to failures. In the absence of failures the software system continuously serves requests, performs transactions, or executes long-running batch processes. If the execution time of the task and the time at which processing starts is known, then the moment of completion of the task is known as well. If failures can happen the completion of a task severely depends on the underling fault model. The typical fault model employed in checkpointing consists in the assumption that faults are detected immediately as they happen. This implies that only crash-faults are considered and no transient or Byzantine faults that would require fault-detection mechanisms. Some checkpointing models assume that faults are detected only at the end of the software module [152].

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Reference

  1. M. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  2. S. Kalaiselvi, V. Rajaraman, A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25(5), 489–510 (2000)

    Article  Google Scholar 

  3. S. Thanawastien, R.S. Pamula, Y.L. Varol, Evaluation of Global Checkpoint Rollback Strategies for Error Recovery in Concurrent Processing Systems. In Proceedings of the 16th International Symposium on Fault-Tolerant Computing, New York (IEEE Computer Society, Washington, DC, 1986), pp. 246–251

    Google Scholar 

  4. J. Hong, S. Kim, Y. Cho, Cost analysis of optimistic recovery model for forked checkpointing. IEICE Trans. Inform. Syst. E-85A(1), 1534–1541 (2002)

    Google Scholar 

  5. S. Toueg, Ö. Babaoglu, On the optimum checkpoint selection problem. SIAM J. Comput. 13(3), 630–649 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  6. A.J. Oliner, R. Sahoo, Evaluating Cooperative Checkpointing for Supercomputing Systems. In SMTPS’06: Proceedings of the 2nd Workshop on System Management Tools for Large-Scale Parallel Systems at IPDPS, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)

    Google Scholar 

  7. N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans. Comput. 46(8), 942–947 (1997)

    Article  Google Scholar 

  8. F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation. IEEE Trans. Parall. Distr. Syst. 12(4), 346–362 (2001)

    Article  Google Scholar 

  9. S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, E. Roman, The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)

    Article  Google Scholar 

  10. E.N. Elnozahy, J.S. Plank, Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 1(2), 97–108 (2004)

    Article  Google Scholar 

  11. H.G. Naik, R. Gupta, P. Beckman, Analyzing Checkpointing Trends for Applications on Peta-Scale Systems. In P2S2’09: Proceedings of the 2nd International Workshop on Parallel Programming Models and Systems Software (P2S2) for High-End Computing (IEEE Computer Society, Vienna, Austria, 2009)

    Google Scholar 

  12. A.J. Oliner, L. Rudolph, R. Sahoo, Cooperative Checkpointing Theory. In Proceedings of the Parallel and Distributed Processing Symposium, Rhodes Island, Greece, January 2006 (IEEE Computer Society, Los Alamitos, CA/ACM Press, 2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katinka Wolter .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Wolter, K. (2010). Checkpointing Systems. In: Stochastic Models for Fault Tolerance. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11257-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11257-7_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11256-0

  • Online ISBN: 978-3-642-11257-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics