Abstract
To improve the performance of the backward fault tolerant scheme in the long-running parallel application, a general checkpoint-timing method was proposed to determine the unequal checkpointing interval according to an arbitrary failure rate, to reduce the total execution time. Firstly, a new model was introduced to evaluate the mean expected execution time. Secondly, the optimality condition was derived for the constant failure rate according to the calculation model, and the optimal equal checkpointing interval can be obtained easily. Subsequently, a general method was derived to determine the checkpointing timing for the other failure rate. The final results shown the proposal is practical to trade-off the re-processing overhead and the checkpointing overhead in the backward fault-tolerant scheme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, T., Shafique, M., Ambrose, J.A., et al.: Fine-grained checkpoint recovery for application-specific instruction-set processors. IEEE Trans. Comput. 66(4), 647–660 (2017)
Meroufel, B., Belalem, G.: Lightweight coordinated checkpointing in cloud computing. J. High Speed Netw. 20(3), 131–143 (2014)
Salehi, M., Tavana, M.K., Rehman, S., et al.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Trans. Very Large Scale Integr. Syst. 24(7), 2426–2437 (2016)
Islam, T.Z., Bagchi, S., Eigenmann, R.: Reliable and efficient distributed checkpointing system for grid environments. J. Grid Comput. 12(4), 593–613 (2014)
Fu, H., Yu, C., Sun, J., Du, J., Wang, M.: A multilevel fault-tolerance technique for the DAG data driven model. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, China, pp. 1127–1130 (2015)
Mendizabal, O.M., Jalili Marandi, P., Dotti, F.L., Pedone, F.: Checkpointing in parallel state-machine replication. In: Aguilera, M.K., Querzoni, L., Shapiro, M. (eds.) OPODIS 2014. LNCS, vol. 8878, pp. 123–138. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14472-6_9
Sweiti, S., Dweik, A.A.: Integrated replication-checkpoint fault tolerance approach of mobile agents “IRCFT”. Int. Arab J. Inf. Technol. 13(1A), 190–195 (2016)
Awasthi, L.K., Misra, M., Joshi, R.C., et al.: Minimum mutable checkpoint-based coordinated checkpointing protocol for mobile distributed systems. Int. J. Commun. Netw. Distrib. Syst. 12(4), 356–380 (2014)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Treaster, M.: A survey of fault-tolerance and fault-recovery techniques in parallel systems. Technical report cs.DC/0501002, ACM Computing Research Repository, January 2005
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 303–312 (2006)
Ozaki, T., Dohi, T., Kaio, N.: Numerical computation algorithms for sequential checkpoint placement. Perform. Eval. 66, 311–326 (2009)
Naruse, K., Umemura, S., Nakagawa, S.: Optimal checkpointing interval for two-level recovery schemes. Comput. Math Appl. 51, 371–376 (2006)
Okamura, H., Dohi, T.: Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system. J. Syst. Softw. 83(9), 1591–1604 (2010)
Endo, P.T., Rodrigues, M., et al.: High availability in clouds: systematic review and research challenges. J. Cloud Comput. Adv. Syst. Appl. 5, 16 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, M. (2018). The Checkpoint-Timing for Backward Fault-Tolerant Schemes. In: Li, C., Wu, J. (eds) Advanced Computer Architecture. ACA 2018. Communications in Computer and Information Science, vol 908. Springer, Singapore. https://doi.org/10.1007/978-981-13-2423-9_16
Download citation
DOI: https://doi.org/10.1007/978-981-13-2423-9_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2422-2
Online ISBN: 978-981-13-2423-9
eBook Packages: Computer ScienceComputer Science (R0)