Coping with Silent Errors in HPC Applications

  • Guillaume AupyEmail author
  • Anne Benoit
  • Aurlien Cavelan
  • Massimiliano Fasi
  • Yves Robert
  • Hongyang Sun
  • Bora Uçar
Part of the Emergence, Complexity and Computation book series (ECC, volume 24)


This chapter describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extreme-scale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time. Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra. Thanks to Selim Akl, by Yves Robert—I have a vivid souvenir of Selim’s visit to Lyon in the early 90s. Selim had obtained a Louis Néel fellowship devoted to promote exchanges between Canada and the Rhône-Alpes area in France, and he spent 6 months in Lyon with his family. Michel Cosnard was the head of the LIP laboratory at that time. Selim gave a course on parallel algorithms, mainly sorting and PRAM, that sparkled a lot of interest among both our students and the researchers in the lab. During his stay, Selim initiated several collaborations with Jean Duprat, Afonso Ferreira and Pierre Fraigniaud. Although I never collaborated with him, I would like to thank him for his vision. I was then a young professor in LIP, and I felt like meeting a star, but a very kind one. His two books, Parallel Sorting Algorithms and The Design and Analysis of Parallel Algorithms, had a huge influence on many researchers at LIP (including myself), as they helped shape our view of parallel complexity. Later on we all took different research directions (PRAM, hypercubes, systolic arrays, scheduling, routing, ...) but Selim laid the foundations of the field for us, and we are grateful to him.


Dynamic Programming Algorithm Preconditioned Conjugate Gradient Detection Latency Single Error Vector Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aupy, G., Benoit, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: On the combination of silent error detection and checkpointing. In: Proceedings of the 2013 International Symposium on Dependable Computing, pp. 11–20 (2013)Google Scholar
  2. 2.
    Bautista-Gomez, L., Benoit, A., Cavelan, A., Raina, S.K., Robert, Y., Sun, H.: Which verification for soft error detection? In: Proceedings of the 2015 International Conference on High Performance Computing (HiPC’2015). IEEE Computer Society Press (2015)Google Scholar
  3. 3.
    Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Assessing general-purpose algorithms to cope with fail-stop and silent errors. In: Proceedings of the 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (2014)Google Scholar
  4. 4.
    Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Optimal resilience patterns to cope with fail-stop and silent errors. Research report RR-8786, INRIA (2015).
  5. 5.
    Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Two-level checkpointing and partial verifications for linear task graphs. In: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (2015)Google Scholar
  6. 6.
    Benson, A.R., Schmit, S., Schreiber, R.: Silent error detection in numerical time-stepping schemes. Int. J. High Perform. Comput. Appl. (2014). doi: 10.1177/1094342014532297
  7. 7.
    Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)CrossRefGoogle Scholar
  8. 8.
    Bougeret, M., Casanova, H., Rabie, M., Robert, Y., Vivien, F.: Checkpointing strategies for parallel jobs. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11 (2011)Google Scholar
  9. 9.
    Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 2008 International Conference on Supercomputing (ICS), pp. 155–164 (2008)Google Scholar
  10. 10.
    Cavelan, A., Raina, S.K., Robert, Y., Sun, H.: Assessing the impact of partial verifications against silent data corruptions. In: Proceedings of the 44th International Conference on Parallel Processing (ICPP) (2015)Google Scholar
  11. 11.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)CrossRefGoogle Scholar
  12. 12.
    Chen, Z.: Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming, pp. 167–176 (2013)Google Scholar
  13. 13.
    Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society (1997)Google Scholar
  14. 14.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  15. 15.
    Dongarra, J., et al.: The international exascale software project: a call to cooperative action by the global high-performance community. Int. J. High Perform. Comput. Appl. 23(4), 309–322 (2009)CrossRefGoogle Scholar
  16. 16.
    Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 2012 IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 615–626 (2012)Google Scholar
  17. 17.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 375–408 (2002)CrossRefGoogle Scholar
  18. 18.
    Engelmann, C., Ong, H.H., Scorr, S.L.: The case for modular redundancy in large-scale highh performance computing systems. In: Proceeding of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009)Google Scholar
  19. 19.
    Fasi, M., Langou, J., Robert, Y., Uçar, B.: A backward/forward recovery approach for the preconditioned conjugate gradient method. Research report RR-8826, INRIA, 2015.
  20. 20.
    Fasi, M., Robert, Y., Uçar, B.: Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers. Research Report RR-8675, INRIA, 2015. Short version appears in the proceedings of PDSEC’2015Google Scholar
  21. 21.
    Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 44:1–44:12 (2011)Google Scholar
  22. 22.
    Heroux, M., Hoemmen, M.: Fault-tolerant iterative methods via selective reliability. Research report SAND2011-3915 C, Sandia National Laboratories (2011)Google Scholar
  23. 23.
    Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  24. 24.
    Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGARCH Comput. Archit. News 40(1), 111–122 (2012)CrossRefGoogle Scholar
  25. 25.
    Lu, G., Zheng, Z., Chien, A.A.: When is multi-version checkpointing needed? In: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale (FTXS), pp. 49–56 (2013)Google Scholar
  26. 26.
    Lyons, R.E., Vanderkulk, W.: The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)CrossRefzbMATHGoogle Scholar
  27. 27.
    Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press (2005)Google Scholar
  28. 28.
    Moody, A., Bronevetsky, G., Mohror, K., B.R.d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10) (2010)Google Scholar
  29. 29.
    Ni, X., Meneses, E., Jain, N., Kalé, L.V.: ACR: Automatic checkpoint/restart for soft and hard error protection. In: Proceedings of the 2013 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’13). ACM (2013)Google Scholar
  30. 30.
    O’Gorman, T.: The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans. Electron Devices 41(4), 553–557 (1994)CrossRefGoogle Scholar
  31. 31.
    Ozaki, T., Dohi, T., Okamura, H., Kaio, N.: Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Trans. Dependable Secure Comput. 3(2), 130–140 (2006)CrossRefGoogle Scholar
  32. 32.
    Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM Press (2003)Google Scholar
  33. 33.
    Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) (2013)Google Scholar
  34. 34.
    Schroeder, B., Gibson, G.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78(1) (2007)Google Scholar
  35. 35.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 2012 International Conference on Supercomputing, pp. 69–78 (2012)Google Scholar
  36. 36.
    Toueg, S., Babaoglu, Ö.: On the optimum checkpoint selection problem. SIAM J. Comput. 13(3), 630–649 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefzbMATHGoogle Scholar
  38. 38.
    Zheng, Z., Lan, Z.: Reliability-aware scalability models for high performance computing. In: Proceedings of the 2009 IEEE Conference on Cluster Computing (2009)Google Scholar
  39. 39.
    Ziegler, J., Muhlfeld, H., Montrose, C., Curtis, H., O’Gorman, T., Ross, J.: Accelerated testing for cosmic soft-error rate. IBM J. Res. Dev. 40(1), 51–72 (1996)CrossRefGoogle Scholar
  40. 40.
    Ziegler, J., Nelson, M., Shell, J., Peterson, R., Gelderloos, C., Muhlfeld, H., Montrose, C.: Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J. Solid-State Circuits 33(2), 246–252 (1998)CrossRefGoogle Scholar
  41. 41.
    Ziegler, J.F., Curtis, H.W., Muhlfeld, H.P., Montrose, C.J., Chin, B.: IBM experiments in soft fails in computer electronics. IBM J. Res. Dev. 40(1), 3–18 (1996)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2017

Authors and Affiliations

  • Guillaume Aupy
    • 1
    Email author
  • Anne Benoit
    • 2
  • Aurlien Cavelan
    • 2
  • Massimiliano Fasi
    • 3
  • Yves Robert
    • 2
    • 4
  • Hongyang Sun
    • 5
  • Bora Uçar
    • 6
  1. 1.Penn State UniversityState CollegeUSA
  2. 2.ENS LyonLyonFrance
  3. 3.The University of ManchesterManchesterUK
  4. 4.University of TennesseeKnoxvilleUSA
  5. 5.ENS Lyon & INRIALyonFrance
  6. 6.CNRS & ENS LyonLyonFrance

Personalised recommendations