Abstract
This work revises current assumptions adopted in the checkpointing modelling and evaluates their impact on the attained prediction of the optimal coordinated single-level checkpoint period. An accurate a priori assessment of the optimal checkpoint period for a given computing facility is necessary as it drives the incurred overhead due to frequent checkpointing and, as a result, implies a drop in the resource steady-state availability. The present study discusses the impact of the order of approximation used in the single-level coordinated checkpoint modelling and follows on extending previous results of the optimal checkpoint period to explore the effects of the checkpoint rate on the cluster performance under total execution time and energy consumption policies, and in terms of resource availability. A consequence of a prescribed checkpoint rate with current technology is a critical size of the cluster above which the attained availability is too poor to become a cost-effective platform. Thus, some guidelines for the cluster sizing are indicated.
This is a preview of subscription content, access via your institution.











References
Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Towards exascale resilience: 2014 update. Int J Supercomput Front Innov 1(1):5–28
Geist A, Reed DA (2015) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31:1–10
Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Proceedings of the International Conference on High Computing for Computational Science—VECPAR 2010, Lecture Notes in Computer Science, vol 6449. Springer, Berlin, pp 1–25
Geist A (2016) How to kill a supercomputer: dirty power, cosmic rays and bad solder-will future exascale supercomputers be able to withstand the steady onslaught of routine faults? In: IEEE Spectrum. http://spectrum.ieee.org/computing/hardware
Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350
Hacker ThJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665
Hérault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing. Springer, Berlin, pp 3–85
Hiroyama S, Dohi T, Okamura H (2010) Comparison of aperiodic checkpoint placement algorithms. In: Proceedings of the Advanced Computer Science and Information Technology, AST20120, Miyazaki, June 23–25
Buntinas D, Coti C, Hérault T, Lemarinier P, Pilard L, Rezmerita A, Rodríguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. Future Gener Comput Syst 24(1):73–84
Naruse K, Umemura Sh, Nakagawa S (2006) Optimal checkpointing interval for two-level recovery schemes. Comput Math Appl 51:371–376
Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS ONE. https://doi.org/10.1371/journal.pone.01045912014
Benoit A, Cavelan A, Robert Y, Sun H (2015) Optimal resilience patterns to cope with fail-stop and silent errors. Research Report RR-8786, LIP-ENS Lyon <hal-01215857>
Di S, Robert Y, Vivien F, Cappello F (2017) Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans Parallel Distrib Syst 28(1):244–259
Mohror K, Moody A, Bronevetsky G, Supinski BR (2014) Detailed modelling and evaluation of a scalable multilevel checkpointing system. IEEE Trans Parallel Distrib Syst 25(9):2255–2263
Ferreira KB, Widener P, Levy S, Arnold D, Hoefler T (2014) Understanding the effect of communication and coordination on checkpointing at scale. In: Supercomputing Conference (SC14), Nov. 16–21, New Orleans
Bolsica G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Hérault Th, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26:2772–2791
Bouguerra MS, Gainaru A, Gómez LB, Cappello F, Matsuoka S, Maruyama N (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, May 20–24, Boston
Gottumukkala NR, Nassar R, Paun M, Leangsuksun ChB, Scott SL (2010) Reliability of a system of K nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169
Paun M, Naksinehaboon N, Nassar R, Leangsuksun Ch, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(3):329–344
Bouguerra MS, Gautier T, Trystram D, Vincent JM (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM 2009, Part I, LNCS 6067, pp 206–215
Aupy G, Benoit A, Hérault T, Robert Y, Dongarra J (2013) Optimal checkpointing period: time vs. energy. In: Proceedings of the Benchmarking and Simulation of High Performance Computer Systems, Supercomputing Conference (SC13), Nov. 17–22, Denver
Vaidya NH (1998) A case for two-level recovery schemes. IEEE Trans Comput 47(6):656–666
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312
Young W (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Gelenbe E, Hernández M (1990) Optimum checkpoints with age dependent failures. Acta Inf 27:517–531
Cox DR, Miller HD (1972) The theory of stochastic processes. Chapman and Hall Ltd, London
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947
Vaidya NH (1995) On checkpoint latency, tex. As A&M University, Report 95015
Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708
Ozaki T, Dohi T, Okamura H, Kaio N (2006) Distribution-free checkpoint placement algorithms based on min–max principle. IEEE Trans Dependable Secure Comput 3(2):130–140
Plank JS, Elwasif WR (1997) Experimental assessment of workstation failures and their impact on checkpointing systems. In: 28th Annual International Symposium on Fault-Tolerant Computing, Munich, pp 48–57 (also as Univ. Tennessee Technical Report UT CS 97379, 1997)
Liu Y, Nassar R, Leangsuksun CB, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the IEEE International Symposium Parallel and Distributed Processing, Miami, pp 1–9
Ozaki T, Dohi T, Kaio N (2009) Numerical computation algorithms for sequential checkpoint placement. Perform Eval 66:311–326
Herault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing, chapter 1, computer communications and networks series. Springer, Berlin
Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76:1914–1924
Private Communication (2016) SLURM user group meeting, Sept. 26–27, Athens
Acknowledgements
This work was supported by the COST Action NESUS (IC1305) and partially funded by the Spanish Ministry of Economy and Competitiveness Project CODEC2 (TIN2015-63562-R) with FEDER funds, the RICAP Network (517RT0529) with CYTED funds, and EU H2020 Project HPC4E (Grant Agreement No. 689772).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Moríñigo, J.A., Rodríguez-Pascual, M. & Mayo-García, R. On the modelling of optimal coordinated checkpoint period in supercomputers. J Supercomput 75, 930–954 (2019). https://doi.org/10.1007/s11227-018-2621-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2621-1
Keywords
- Coordinated checkpoint
- Cluster availability
- Optimal checkpoint period
- Single-level checkpoint