The Journal of Supercomputing

, Volume 75, Issue 2, pp 930–954 | Cite as

On the modelling of optimal coordinated checkpoint period in supercomputers

  • José A. MoríñigoEmail author
  • Manuel Rodríguez-Pascual
  • Rafael Mayo-García


This work revises current assumptions adopted in the checkpointing modelling and evaluates their impact on the attained prediction of the optimal coordinated single-level checkpoint period. An accurate a priori assessment of the optimal checkpoint period for a given computing facility is necessary as it drives the incurred overhead due to frequent checkpointing and, as a result, implies a drop in the resource steady-state availability. The present study discusses the impact of the order of approximation used in the single-level coordinated checkpoint modelling and follows on extending previous results of the optimal checkpoint period to explore the effects of the checkpoint rate on the cluster performance under total execution time and energy consumption policies, and in terms of resource availability. A consequence of a prescribed checkpoint rate with current technology is a critical size of the cluster above which the attained availability is too poor to become a cost-effective platform. Thus, some guidelines for the cluster sizing are indicated.


Coordinated checkpoint Cluster availability Optimal checkpoint period Single-level checkpoint 



This work was supported by the COST Action NESUS (IC1305) and partially funded by the Spanish Ministry of Economy and Competitiveness Project CODEC2 (TIN2015-63562-R) with FEDER funds, the RICAP Network (517RT0529) with CYTED funds, and EU H2020 Project HPC4E (Grant Agreement No. 689772).


  1. 1.
    Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Towards exascale resilience: 2014 update. Int J Supercomput Front Innov 1(1):5–28Google Scholar
  2. 2.
    Geist A, Reed DA (2015) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31:1–10Google Scholar
  3. 3.
    Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Proceedings of the International Conference on High Computing for Computational Science—VECPAR 2010, Lecture Notes in Computer Science, vol 6449. Springer, Berlin, pp 1–25Google Scholar
  4. 4.
    Geist A (2016) How to kill a supercomputer: dirty power, cosmic rays and bad solder-will future exascale supercomputers be able to withstand the steady onslaught of routine faults? In: IEEE Spectrum.
  5. 5.
    Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350CrossRefGoogle Scholar
  6. 6.
    Hacker ThJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665CrossRefGoogle Scholar
  7. 7.
    Hérault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing. Springer, Berlin, pp 3–85CrossRefzbMATHGoogle Scholar
  8. 8.
    Hiroyama S, Dohi T, Okamura H (2010) Comparison of aperiodic checkpoint placement algorithms. In: Proceedings of the Advanced Computer Science and Information Technology, AST20120, Miyazaki, June 23–25Google Scholar
  9. 9.
    Buntinas D, Coti C, Hérault T, Lemarinier P, Pilard L, Rezmerita A, Rodríguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. Future Gener Comput Syst 24(1):73–84CrossRefGoogle Scholar
  10. 10.
    Naruse K, Umemura Sh, Nakagawa S (2006) Optimal checkpointing interval for two-level recovery schemes. Comput Math Appl 51:371–376MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS ONE. Google Scholar
  12. 12.
    Benoit A, Cavelan A, Robert Y, Sun H (2015) Optimal resilience patterns to cope with fail-stop and silent errors. Research Report RR-8786, LIP-ENS Lyon <hal-01215857>Google Scholar
  13. 13.
    Di S, Robert Y, Vivien F, Cappello F (2017) Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans Parallel Distrib Syst 28(1):244–259CrossRefGoogle Scholar
  14. 14.
    Mohror K, Moody A, Bronevetsky G, Supinski BR (2014) Detailed modelling and evaluation of a scalable multilevel checkpointing system. IEEE Trans Parallel Distrib Syst 25(9):2255–2263CrossRefGoogle Scholar
  15. 15.
    Ferreira KB, Widener P, Levy S, Arnold D, Hoefler T (2014) Understanding the effect of communication and coordination on checkpointing at scale. In: Supercomputing Conference (SC14), Nov. 16–21, New OrleansGoogle Scholar
  16. 16.
    Bolsica G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Hérault Th, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26:2772–2791CrossRefGoogle Scholar
  17. 17.
    Bouguerra MS, Gainaru A, Gómez LB, Cappello F, Matsuoka S, Maruyama N (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, May 20–24, BostonGoogle Scholar
  18. 18.
    Gottumukkala NR, Nassar R, Paun M, Leangsuksun ChB, Scott SL (2010) Reliability of a system of K nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169CrossRefGoogle Scholar
  19. 19.
    Paun M, Naksinehaboon N, Nassar R, Leangsuksun Ch, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(3):329–344MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Bouguerra MS, Gautier T, Trystram D, Vincent JM (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM 2009, Part I, LNCS 6067, pp 206–215Google Scholar
  21. 21.
    Aupy G, Benoit A, Hérault T, Robert Y, Dongarra J (2013) Optimal checkpointing period: time vs. energy. In: Proceedings of the Benchmarking and Simulation of High Performance Computer Systems, Supercomputing Conference (SC13), Nov. 17–22, DenverGoogle Scholar
  22. 22.
    Vaidya NH (1998) A case for two-level recovery schemes. IEEE Trans Comput 47(6):656–666CrossRefGoogle Scholar
  23. 23.
    Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312CrossRefGoogle Scholar
  24. 24.
    Young W (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531CrossRefzbMATHGoogle Scholar
  25. 25.
    Gelenbe E, Hernández M (1990) Optimum checkpoints with age dependent failures. Acta Inf 27:517–531MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Cox DR, Miller HD (1972) The theory of stochastic processes. Chapman and Hall Ltd, LondonGoogle Scholar
  27. 27.
    Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947CrossRefGoogle Scholar
  28. 28.
    Vaidya NH (1995) On checkpoint latency, tex. As A&M University, Report 95015Google Scholar
  29. 29.
    Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708CrossRefGoogle Scholar
  30. 30.
    Ozaki T, Dohi T, Okamura H, Kaio N (2006) Distribution-free checkpoint placement algorithms based on min–max principle. IEEE Trans Dependable Secure Comput 3(2):130–140CrossRefGoogle Scholar
  31. 31.
    Plank JS, Elwasif WR (1997) Experimental assessment of workstation failures and their impact on checkpointing systems. In: 28th Annual International Symposium on Fault-Tolerant Computing, Munich, pp 48–57 (also as Univ. Tennessee Technical Report UT CS 97379, 1997)Google Scholar
  32. 32.
    Liu Y, Nassar R, Leangsuksun CB, Naksinehaboon N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the IEEE International Symposium Parallel and Distributed Processing, Miami, pp 1–9Google Scholar
  33. 33.
    Ozaki T, Dohi T, Kaio N (2009) Numerical computation algorithms for sequential checkpoint placement. Perform Eval 66:311–326CrossRefGoogle Scholar
  34. 34.
    Herault T, Robert Y (eds) (2015) Fault-tolerance techniques for high-performance computing, chapter 1, computer communications and networks series. Springer, BerlinGoogle Scholar
  35. 35.
    Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76:1914–1924MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Private Communication (2016) SLURM user group meeting, Sept. 26–27, AthensGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of TechnologyCIEMATMadridSpain

Personalised recommendations