Advertisement

Fault Tolerance Techniques for High-Performance Computing

  • Jack Dongarra
  • Thomas Herault
  • Yves Robert
Chapter
Part of the Computer Communications and Networks book series (CCN)

Abstract

This chapter provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the chapter by discussing techniques to cope with silent errors (or silent data corruption).

Notes

Acknowledgments

Yves Robert is with the Institut Universitaire de France. The research presented in this chapter was supported in part by the French ANR (Rescue project) and by contracts with the DOE through the SUPER-SCIDAC project, and the CREST project of the Japan Science and Technology Agency (JST). This chapter has borrowed material from publications co-authored with many colleagues and PhD students, and the authors would like to thank Guillaume Aupy, Anne Benoit, George Bosilca, Aurélien Bouteiller, Aurélien Cavelan, Franck Cappello, Henri Casanova, Amina Guermouche, Saurabh K. Raina, Hongyang Sun, Frédéric Vivien, and Dounia Zaidouni.

References

  1. 1.
    Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485Google Scholar
  2. 2.
    Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), 2013. IEEE, pp 1–10Google Scholar
  3. 3.
    Aupy G, Benoit A, Herault T, Robert Y, Vivien F, Zaidouni D (2013) On the combination of silent error detection and checkpointing. In: PRDC 2013, the 19th IEEE Pacific Rim international symposium on dependable computing. IEEE Computer Society PressGoogle Scholar
  4. 4.
    Aupy G, Robert Y, Vivien F, Zaidouni D (2014) Checkpointing algorithms and fault prediction. J Parallel Distrib Comput 74(2):2048–2064CrossRefGoogle Scholar
  5. 5.
    Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11Google Scholar
  6. 6.
    Benson AR, Schmit S, Schreiber R (2013) Silent error detection in numerical time-stepping schemes. CoRR, abs/1312.2674Google Scholar
  7. 7.
    Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAMGoogle Scholar
  8. 8.
    Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416CrossRefGoogle Scholar
  9. 9.
    Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2014) Assessing the impact of ABFT and checkpoint composite strategies. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, May 19–23 2014, pp 679–688Google Scholar
  10. 10.
    Bosilca G, Bouteiller A, Brunet E, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2014) Unified model for assessing checkpointing protocols at extreme-scale. Concurr Comput Pract Exp 26(17):925–957CrossRefGoogle Scholar
  11. 11.
    Bosilca G, Bouteiller A, Herault T, Robert Y, Dongarra JJ (2015) Composing resilience techniques: ABFT, periodic and incremental checkpointing. IJNC 5(1):2–25CrossRefGoogle Scholar
  12. 12.
    Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of SC’11Google Scholar
  13. 13.
    Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3):319–333Google Scholar
  14. 14.
    Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput Pract Exp 22(16):2196–2211CrossRefGoogle Scholar
  15. 15.
    Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2011) Correlated set coordination in fault tolerant message logging protocols. In: Proceedings of Euro-Par’11 (II). LNCS, vol 6853. Springer, pp 51–64Google Scholar
  16. 16.
    Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28Google Scholar
  17. 17.
    Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: Proceedings of 22nd international conference on supercomputing, ICS ’08. ACM, pp 155–164Google Scholar
  18. 18.
    Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State UniversityGoogle Scholar
  19. 19.
    Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Combining process replication and checkpointing for resilience on exascale systems. Research report RR-7951, INRIAGoogle Scholar
  20. 20.
    Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75Google Scholar
  21. 21.
    Chen Z (2013) Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of 18th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’13. ACM, pp 167–176Google Scholar
  22. 22.
    Chen Z, Dongarra J (2006) Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: Proceedings of the 20th international conference on parallel and distributed processing, IPDPS’06, Washington, DC, USA. IEEE Computer Society, pp 97–97Google Scholar
  23. 23.
    Chen Z, Dongarra J (2008) Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS 19(12):1628–1641Google Scholar
  24. 24.
    Choi J, Demmel J, Dhillon I, Dongarra J, Ostrouchov S, Petitet A, Stanley K, Walker D, Whaley R (1996) ScaLAPACK: a portable linear algebra library for distributed memory computers-design issues and performance. Comput Phys Commun 97(1–2):1–15CrossRefzbMATHGoogle Scholar
  25. 25.
    Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3):303–312CrossRefGoogle Scholar
  26. 26.
    Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the international conference on supercomputing, ICS ’11. ACM, New York, pp 162–171Google Scholar
  27. 27.
    Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322Google Scholar
  28. 28.
    Dongarra J, Herault T, Robert Y (2014) Performance and reliability trade-offs for the double checkpointing algorithm. Int J Netw Comput 4(1):23–41Google Scholar
  29. 29.
    Du P, Bouteiller A, Bosilca G, Herault T, Dongarra J (2012) Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2012, New Orleans, LA, USA, 25–29 February 2012, pp 225–234Google Scholar
  30. 30.
    Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: Proceedings of ICDCS ’12. IEEE Computer SocietyGoogle Scholar
  31. 31.
    Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale highh performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194Google Scholar
  32. 32.
    Esteban Meneses CLM, Kalé LV (2010) Team-based message logging: preliminary results. In: Workshop resilience in clusters, clouds, and grids (CCGRID 2010)Google Scholar
  33. 33.
    Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti, K, Brightwell R, Riesen R, Bridges, PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of the ACM/IEEE on supercomputingGoogle Scholar
  34. 34.
    Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IPDPS’12Google Scholar
  36. 36.
    Guermouche A, Ropars T, Snir M, Cappello F (to appear) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: Proceedings of IEEE IPDPS 2012Google Scholar
  37. 37.
    Gustafson JL (1988) Reevaluating Amdahl’s law. IBM Syst J 31(5):532–533Google Scholar
  38. 38.
    Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26Google Scholar
  39. 39.
    Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69(7):652–665Google Scholar
  40. 40.
    Hakkarinen D, Chen Z (2010) Algorithmic cholesky factorization fault recovery. In: 2010 IEEE International symposium on parallel distributed processing (IPDPS). IEEE, Atlanta, pp 1–10Google Scholar
  41. 41.
    Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: Proceedings of SciDAC 2006Google Scholar
  42. 42.
    Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perform Eval Rev 30(1):217–227Google Scholar
  43. 43.
    Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM conference on supercomputing (SC)Google Scholar
  44. 44.
    Heroux M, Hoemmen M (2011) Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National LaboratoriesGoogle Scholar
  45. 45.
    Huang K-H, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528Google Scholar
  46. 46.
    Huang K-H, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 33(6):518–528CrossRefGoogle Scholar
  47. 47.
    Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS. pp 1–8Google Scholar
  48. 48.
    Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122CrossRefGoogle Scholar
  49. 49.
    Kella O, Stadje W (2006) Superposition of renewal processes and an application to multi-server queues. Stat Probab Lett 76(17):1914–1924MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Kingsley G, Beck M, Plank JS (1995) Compiler-assisted checkpoint optimization using SUIF. In: First SUIF compiler workshopGoogle Scholar
  51. 51.
    Kondo D, Chien A, Casanova H (2007) Scheduling task parallel applications for rapid application turnaround on enterprise desktop grids. J Grid Comput 5(4):379–405CrossRefGoogle Scholar
  52. 52.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565CrossRefzbMATHGoogle Scholar
  53. 53.
    Li C-C, Fuchs W (1990) Catch-compiler-assisted techniques for checkpointing. In: 20th international symposium fault-tolerant computing, 1990. FTCS-20. Digest of papers, pp 74–81Google Scholar
  54. 54.
    Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS’08. IEEEGoogle Scholar
  55. 55.
    Lu G, Zheng Z, Chien AA (2013) When is multi-version checkpointing needed. In: 3rd Workshop for fault-tolerance at extreme scale (FTXS). ACM Press. https://sites.google.com/site/uchicagolssg/lssg/research/gvr
  56. 56.
    Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209CrossRefzbMATHGoogle Scholar
  57. 57.
    Moody A, Bronevetsky G, Mohror K, Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International conference high performance computing, networking, storage and analysis SC’10Google Scholar
  58. 58.
    Moody A, Bronevetsky G, Mohror K, Supinski BR de (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the ACM/IEEE conference SC, pp 1–11Google Scholar
  59. 59.
    Ni X, Meneses E, Kalé LV (2012) Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer SocietyGoogle Scholar
  60. 60.
    Ni X, Meneses E, Jain N, Kalé LV (2013) ACR: automatic checkpoint/restart for soft and hard error protection. In: Proceedings of international conference high performance computing, networking, storage and analysis, SC ’13. ACMGoogle Scholar
  61. 61.
    O’Gorman T (1994) The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans Electron Devices 41(4):553–557CrossRefGoogle Scholar
  62. 62.
    Plank JS, Beck M, Kingsley G (1995) Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech Comm Oper Syst Appl Environ 7:10–14Google Scholar
  63. 63.
    Rodríguez G, Martín MJ, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766Google Scholar
  64. 64.
    Ross SM (2009) Introduction to probability models, 8th edn. Academic Press, San DiegoGoogle Scholar
  65. 65.
    Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of ScalA ’13. ACMGoogle Scholar
  66. 66.
    Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198Google Scholar
  67. 67.
    Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258Google Scholar
  68. 68.
    Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS ’12. ACMGoogle Scholar
  69. 69.
    Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531CrossRefzbMATHGoogle Scholar
  70. 70.
    Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for blue gene/p: period-based vs event-driven. In: Dependable systems and networks workshops (DSN-W), pp 259–264Google Scholar
  71. 71.
    Zheng G, Shi L, Kalé LV (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of IEEE international conference on cluster computing. IEEE Computer SocietyGoogle Scholar
  72. 72.
    Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computingGoogle Scholar
  73. 73.
    Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for blue gene/p. In: Dependable systems and networks workshops (DSN-W), pp 15–22Google Scholar
  74. 74.
    Ziegler J, Muhlfeld H, Montrose C, Curtis H, O’Gorman T, Ross J (1996) Accelerated testing for cosmic soft-error rate. IBM J Res Dev 40(1):51–72CrossRefGoogle Scholar
  75. 75.
    Ziegler J, Nelson M, Shell J, Peterson R, Gelderloos C, Muhlfeld H, Montrose C (1998) Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J Solid-State Circuits 33(2):246–252CrossRefGoogle Scholar
  76. 76.
    Ziegler JF, Curtis HW, Muhlfeld HP, Montrose CJ, Chin B (1996) IBM experiments in soft fails in computer electronics. IBM J Res Dev 40(1):3–18CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.University of ManchesterManchesterUK
  4. 4.Ecole Normale Supérieure de LyonLyonFrance

Personalised recommendations