Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

  • Aurelien Bouteiller
  • Franck Cappello
  • Jack Dongarra
  • Amina Guermouche
  • Thomas Hérault
  • Yves Robert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8097)


Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.


Idle Time Mean Time Between Failure High Performance Computing System High Performance Computing Application Checkpoint Interval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. IJHPCA 23(4), 309–322 (2009)Google Scholar
  2. 2.
    Gibson, G.: Failure tolerance in petascale computers. Journal of Physics: Conference Series 78, 012022 (2007)CrossRefGoogle Scholar
  3. 3.
    Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of SC 2011. ACM/IEEE (2011)Google Scholar
  4. 4.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Survey 34, 375–408 (2002)CrossRefGoogle Scholar
  5. 5.
    Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 51–64. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic MPI applications. In: Proc. 26th IPDPS, pp. 1216–1227. IEEE (May 2012)Google Scholar
  7. 7.
    Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Research report RR-7950, INRIA (2012)Google Scholar
  8. 8.
    Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518–528 (1984)CrossRefGoogle Scholar
  9. 9.
    Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. 10th ACM SIGPLAN PPoPP, pp. 213–223. ACM (2005)Google Scholar
  10. 10.
    Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3), 319–333 (2006)Google Scholar
  11. 11.
    Bouteiller, A., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y.: Multi-criteria checkpointing strategies: Optimizing response-time versus resource utilization. Research report ICL-UT-1301, University of Tennessee (February 2013)Google Scholar
  12. 12.
    Miyazaki, H., Kusano, Y., Okano, H., Nakada, T., Seki, K., Shimizu, T., Shinjo, N., Shoji, F., Uno, A., Kurokawa, M.: K computer: 8.162 petaflops massively parallel scalar supercomputer built with over 548k cores. In: ISSCC, pp. 192–194. IEEE (2012)Google Scholar
  13. 13.
    Chakravorty, S., Kale, L.: A fault tolerance protocol with fast fault recovery. In: Proc. 21st IPDPS, pp. 1–10. IEEE (March 2007)Google Scholar
  14. 14.
    Yang, X., Du, Y., Wang, P., Fu, H., Jia, J.: FTPA: Supporting fault-tolerant parallel computing through parallel recomputing. IEEE Transactions on Parallel and Distributed Systems 20(10), 1471–1486 (2009)CrossRefGoogle Scholar
  15. 15.
    Gustafson, J.L.: Reevaluating Amdahl’s law. Communications of the ACM 31, 532–533 (1988)CrossRefGoogle Scholar
  16. 16.
    Thekkath, R., Eggers, S.J.: The effectiveness of multiple hardware contexts. In: Proc. of the 6th ASPLOS, pp. 328–337. ACM (1994)Google Scholar
  17. 17.
    Huang, C., Zheng, G., Kalé, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN PPoPP, pp. 12–21. ACM (2006)Google Scholar
  18. 18.
    Bouteiller, A., Bouziane, H.L., Herault, T., Lemarinier, P., Cappello, F.: Hybrid preemptive scheduling of message passing interface applications on grids. IJHPCA 20(1), 77–90 (2006)Google Scholar
  19. 19.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3), 303–312 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Aurelien Bouteiller
    • 1
  • Franck Cappello
    • 2
    • 3
  • Jack Dongarra
    • 1
  • Amina Guermouche
    • 4
  • Thomas Hérault
    • 1
  • Yves Robert
    • 1
    • 5
  1. 1.University of Tennessee KnoxvilleUSA
  2. 2.University of Illinois at Urbana ChampaignUSA
  3. 3.INRIAFrance
  4. 4.Univ. Versailles St QuentinFrance
  5. 5.Ecole Normale Supérieure de LyonFrance

Personalised recommendations