Skip to main content

The Checkpoint-Timing for Backward Fault-Tolerant Schemes

  • Conference paper
  • First Online:
Advanced Computer Architecture (ACA 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 908))

Included in the following conference series:

  • 732 Accesses

Abstract

To improve the performance of the backward fault tolerant scheme in the long-running parallel application, a general checkpoint-timing method was proposed to determine the unequal checkpointing interval according to an arbitrary failure rate, to reduce the total execution time. Firstly, a new model was introduced to evaluate the mean expected execution time. Secondly, the optimality condition was derived for the constant failure rate according to the calculation model, and the optimal equal checkpointing interval can be obtained easily. Subsequently, a general method was derived to determine the checkpointing timing for the other failure rate. The final results shown the proposal is practical to trade-off the re-processing overhead and the checkpointing overhead in the backward fault-tolerant scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Li, T., Shafique, M., Ambrose, J.A., et al.: Fine-grained checkpoint recovery for application-specific instruction-set processors. IEEE Trans. Comput. 66(4), 647–660 (2017)

    Article  MathSciNet  Google Scholar 

  2. Meroufel, B., Belalem, G.: Lightweight coordinated checkpointing in cloud computing. J. High Speed Netw. 20(3), 131–143 (2014)

    Google Scholar 

  3. Salehi, M., Tavana, M.K., Rehman, S., et al.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Trans. Very Large Scale Integr. Syst. 24(7), 2426–2437 (2016)

    Article  Google Scholar 

  4. Islam, T.Z., Bagchi, S., Eigenmann, R.: Reliable and efficient distributed checkpointing system for grid environments. J. Grid Comput. 12(4), 593–613 (2014)

    Article  Google Scholar 

  5. Fu, H., Yu, C., Sun, J., Du, J., Wang, M.: A multilevel fault-tolerance technique for the DAG data driven model. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, China, pp. 1127–1130 (2015)

    Google Scholar 

  6. Mendizabal, O.M., Jalili Marandi, P., Dotti, F.L., Pedone, F.: Checkpointing in parallel state-machine replication. In: Aguilera, M.K., Querzoni, L., Shapiro, M. (eds.) OPODIS 2014. LNCS, vol. 8878, pp. 123–138. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14472-6_9

    Chapter  Google Scholar 

  7. Sweiti, S., Dweik, A.A.: Integrated replication-checkpoint fault tolerance approach of mobile agents “IRCFT”. Int. Arab J. Inf. Technol. 13(1A), 190–195 (2016)

    Google Scholar 

  8. Awasthi, L.K., Misra, M., Joshi, R.C., et al.: Minimum mutable checkpoint-based coordinated checkpointing protocol for mobile distributed systems. Int. J. Commun. Netw. Distrib. Syst. 12(4), 356–380 (2014)

    Article  Google Scholar 

  9. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  10. Treaster, M.: A survey of fault-tolerance and fault-recovery techniques in parallel systems. Technical report cs.DC/0501002, ACM Computing Research Repository, January 2005

    Google Scholar 

  11. Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)

    Article  Google Scholar 

  12. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 303–312 (2006)

    Article  Google Scholar 

  13. Ozaki, T., Dohi, T., Kaio, N.: Numerical computation algorithms for sequential checkpoint placement. Perform. Eval. 66, 311–326 (2009)

    Article  Google Scholar 

  14. Naruse, K., Umemura, S., Nakagawa, S.: Optimal checkpointing interval for two-level recovery schemes. Comput. Math Appl. 51, 371–376 (2006)

    Article  MathSciNet  Google Scholar 

  15. Okamura, H., Dohi, T.: Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system. J. Syst. Softw. 83(9), 1591–1604 (2010)

    Article  Google Scholar 

  16. Endo, P.T., Rodrigues, M., et al.: High availability in clouds: systematic review and research challenges. J. Cloud Comput. Adv. Syst. Appl. 5, 16 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, M. (2018). The Checkpoint-Timing for Backward Fault-Tolerant Schemes. In: Li, C., Wu, J. (eds) Advanced Computer Architecture. ACA 2018. Communications in Computer and Information Science, vol 908. Springer, Singapore. https://doi.org/10.1007/978-981-13-2423-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2423-9_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2422-2

  • Online ISBN: 978-981-13-2423-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics