Equidistant Checkpoint Placement for Checkpointing and Rollback Recovery

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 238)


To derive the proper equidistant checkpoint interval for log-based checkpointing and rollback recovery mechanism, a directed state transition model of the system execution is presented under the assumption that the inter-failure time follows the exponential distribution. Various related essential factors are considered synthetically in this model. Combined with Laplace transform, the fault-tolerant overhead ratio is derived by evaluating the expected total execution overhead of a single checkpoint interval. Finally, the optimal equidistant checkpoint interval can be obtained. The metrics show that the derived formula is more practical to determine the checkpoint placement for log-based fault-tolerant performance optimization and the degenerated formula agrees with the previous model.





The authors would like to thank the anonymous reviewers and the editor for carefully reading the chapter and for their great help in improving the chapter.


  1. 1.
    Elnozahy EN, Alvisi L, Wang YM et al (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408CrossRefGoogle Scholar
  2. 2.
    Lin JW, Kuo SY (2000) A new log-based approach for independent recovery in distributed shared memory systems. J Inform Sci Eng 16(2):271–290MathSciNetGoogle Scholar
  3. 3.
    Park T, Woo N, Yeom HY (2003) An efficient recovery scheme for mobile computing environments. Future Gener Comput Syst 19:37–53MATHCrossRefGoogle Scholar
  4. 4.
    Chandy KM, Browne JC, Dissly CW et al (1975) Analytic models for rollback and recovery strategies in data base systems. IEEE Trans Software Eng 1:100–110CrossRefGoogle Scholar
  5. 5.
    Duda A (1983) The effects of checkpointing on program execution time. Inform Proc Lett 16:221–229MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531MATHCrossRefGoogle Scholar
  7. 7.
    L’Ecuyer P, Malenfant J (1988) Computing optimal checkpointing strategies for rollback and recovery systems. IEEE Trans Comput 37(4):491–496CrossRefGoogle Scholar
  8. 8.
    Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. LNCS 2660(4):3–12MathSciNetGoogle Scholar
  9. 9.
    Daly JT (2004) A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps. In: Proc. 26th international conf. on software engineering. Edinburgh, Scotland, UK, pp 70–74Google Scholar
  10. 10.
    Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312CrossRefGoogle Scholar
  11. 11.
    Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947CrossRefGoogle Scholar
  12. 12.
    Ziv A, Bruck J (1997) Performance optimization of checkpointing schemes with task duplication. IEEE Trans Comput 46(12):1381–1386MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ziv A, Bruck J (1997) An on-line algorithm for checkpoint placement. IEEE Trans Comput 46(9):976–985MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–707CrossRefGoogle Scholar
  15. 15.
    Ozaki T, Dohi T, Okamura H (2006) Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Trans Dependable Secure Comput 3(2):130–140CrossRefGoogle Scholar
  16. 16.
    Dohi T, Ozaki T, Kaio N (2006) Optimal checkpoint placement with equality constraints. In: Proc. 2nd IEEE international symposium on dependable, autonomic and secure computing, DASC 2006. pp 77–84Google Scholar
  17. 17.
    Ozaki T, Dohi T, Kaio N (2009) Numerical computation algorithms for sequential checkpoint placement. Perform Eval 66:311–326CrossRefGoogle Scholar
  18. 18.
    Liu Y, Nassa R, Leangsuksun C et al (2007) A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In: Proc. 2007 I.E. international conf. on cluster computing. pp 452–457Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Jiangsu Automation Research InstituteLianyungangChina

Personalised recommendations