Abstract
The resilience behavior of three GMRES prototyped implementations (with Incomplete LU, Flexible and randomized-SVD—based preconditioners) has been analyzed with a soft errors injection approach. A low-level fault injector is inserted into the GMRES solvers, which randomly select locations in the program to inject the fault across multiple executions. This fault injection approach combines the configurability of high-level and the accuracy of low-level techniques at the same time, so the effect of faults may be closely emulated. In order to gather enough statistical data, a set of eighteen sparse matrix-based linear systems Ax = b has been solved with these GMRES implementations in the injection experiments and monitored. The results of this prototype-based fault injection suggest an improved error resilience behavior of the randomized-SVD—based preconditioned GMRES version in many of the analyzed matrices, which points out to its interest in supercomputing applications where silent errors are more prominent.
Similar content being viewed by others
References
Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K (2015) LLFI: an intermediate code-level fault injection tool for hardware faults. In: Proceedings of IEEE International Conference on Software Quality, Reliability and Security. Vancouver, Canada. 3–5 Aug 2015. Doi: https://doi.org/10.1109/QRS.2015.13
Thomas A, Pattabiraman K (2013) LLFI: an intermediate code-level fault injector for soft computing applications. In: IEEE workshop on silicon errors in logic—system effects (SELSE). Stanford, CA, USA. 26–27 Mar 2013
Hsuen MC, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer, pp 75–82. Doi: https://doi.org/10.1109/2.585157
Wei J, Thomas A, Li G, Pattabiraman K (2014) Quantifying the accuracy of high-level fault injection techniques for hardware faults. In Proceedings 44th Annual IEEE/IFIP International Conference Dependable Systems and Networks (DSN), pp 375–382. Doi: https://doi.org/10.1109/DSN.2014.2
Saad Y, van der Vorst HA (2000) Iterative solution of linear systems in the 20th century. J Comput Appl Math 123(1–2):1–33.
Benzi M (2002) Preconditioning techniques for large linear systems: a survey. J Comput Phys 182(2):418–477. https://doi.org/10.1006/jcph.2002.7176
Vuik C (1995) New insight in GMRES-like methods with variable preconditioners. J Comp Appl Math 61(2):189–204. https://doi.org/10.1016/0377-0427(94)00067-B
Saad Y (2019) Iterative methods for linear systems of equations: a brief historical journey. Doi:https://doi.org/10.1090/conm/754/15141. arXiv:1908.01083v1
van der Vorst HA (2003) Iterative Krylov methods for large linear systems. Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge
Saad Y (1993) A flexible inner-outer preconditioned gmres algorithm. SIAM J Sci Comput 14:461–469
Higham NJ, Mary Th (2019) A new preconditioner that exploits low-rank approximations to factorizations error. SIAM J Sci Comput 41(1):A59–A82. https://doi.org/10.1137/18M1182802
Stratton JA, Rodrigues C, Sung IJ, Obeid N, Chang LW, Anssari N, Liu GD, Hwu WW (2012): Parboil: a revised benchmark suite for scientific and comercial throughput computing. IMPACT Technical Report, IMPACT-12–01.
LINPACK benchmark. https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html
LLFI software download. https://github.com/DependableSystemsLab/LLFI
Lattner C, Avre V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. CGO 2004:75–86. https://doi.org/10.1109/CGO.2004.128166
Kestor G., Peng I.B., Gioiosa R., Krishnamoorthy S. (2018): Understanding scale-dependent soft-error behaviour of scientific applications. In: Proceedings of IEEE/ACM 18th International Symposium on Cluster and Grid Computing (CCGRID). Washington DC, USA. 1–4 May 2018. Doi: https://doi.org/10.1109/CCGRID.2018.00075.
Kestor G, Mutlu BO, Manzano J, Subasi O, Unsal O, Krishnomoorthy S (2018) Comparative analysis of soft-error detection strategies: a case study with iterative methods. In Proceedings of 15th ACM International Conference on Computer Frontiers (CF-2018), pp172–182, Ischia, Italy. 8–10 May 2018. Doi: https://doi.org/10.1145/3203217.3203240.
Ayatolahi F, Sangchoolie B, Johansson R, Karlsson J (2013) A study of the impact of single bit-flip and double bit-flip errors on program execution. In: Bitsch F, Guiochet J, Kaâniche M (eds) Computer safety, reliability, and security. SAFECOMP 2013. Lecture Notes in Computer Science, vol 8153. Springer, Berlin. DOI: https://doi.org/10.1007/978-3-642-40793-2_24
Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithms for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058
Elliot J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14), pp 1193–1202. Doi: https://doi.org/10.1109/IPDPS.2014.123.
Bridges PG, Ferreira KB, Heroux MA, Hoemmen M (2012) Fault-tolerant linear solvers via selective reliability. arXiv:1206.1390v1
Henderson HV, Searle SR (1981) On deriving the inverse of a sum of matrices. SIAM Rev 23(1):53–60, https://www.jstor.org/stable/202983.
Halko N, Martinsson PG, Tropp J (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53 (2):217–288. http://www.jstor.org/stable/23065163.
Martinsson P-G (2019) Randomized methods for matrix computations. In IAS/Park City mathematics series. American Mathematical Society 25:187–230
SuiteSparse Matrix Collection (University of Florida Matrix Collection). https://sparse.tamu.edu/
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1), Article 1, 25 pages.
Duff IS, Grimes RG, Lewis JG (1997) The Rutherford-boing sparse matrix collection. Rep Rutherford Appleton Lab. RAL-TR-97–031.
Calhoun J, Snir M, Olson LN, Gropp WD (2017) Towards a more complete understanding of SDC propagation. In: Proceedings 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). Association for Computing Machinery, New York, NY, USA, pp 131–142, 2017. Doi: https://doi.org/10.1145/3078597.3078617
Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V (2021) Understading a program's resiliency through error propagation. In: Proceedings Principles and Practice of Parallel Programming Conference (PPoPP). Republic of Korea. 27 Feb–3 Mar 2021. Doi: https://doi.org/10.1145/3437801.3441589
Oliveira D, Pilla L, De Bardeleben N, Blanchard S, Quinn H, Koren I, Navaux P, Rech P (2017) Experimental and analytical study of Xeon Phi reliability. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article 28, pp. 1–12. Doi: https://doi.org/10.1145/3126908.3126960
Oliveira D, Pilla L, Hanzich M, Fratin V, Fernandes F, Lunardi CB, Cela J, Navaux P, Carro L, Rech P (2017) Radiation-induced error criticality in modern HPC parallel accelerators. In Proceedings 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 577–588. Doi: https://doi.org/10.1109/HPCA.2017.41
Cher C, Gupta MS, Bose P, Muller KP (2014) Understanding soft error resiliency of blue gene/Q compute chip through hardware proton irradiation and software fault injection. In: SC '14: proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 587–596. Doi: https://doi.org/10.1109/SC.2014.53.
Ziade H, Ayoubi RA, Velazco R (2004) A survey on fault injection techniques. Int Arab J Inf Technol 1(2):171–186
Cho H, Mirkhani S, Cher C, Abraham JA, Mitra S (2013) Quantitative evaluation of soft error injection techniques for robust system design. In: Proceedings 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp 1–10. Doi: https://doi.org/10.1145/2463209.2488859
Sharma VC, Haran A, Rakamaric Z, Gopalakrishnan G (2013) Towards formal approaches to system resilience. In: Proceedingss IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp 41–50. Doi: https://doi.org/10.1109/PRDC.2013.14
Kooli M, Natale GD, Benoit P, Bosio A, Torres L et al (2014) Fault injection tools based on virtual machines. In: Proceedings of IEEE 9th Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), Montpellier, France. 26–28 May 2014. DOI:https://doi.org/10.1109/ReCoSoC.2014.6861351
Sharma VC, Gopalakrishnan G, Krishnamoorthy S (2016) Towards resiliency evaluation of vector programs. In Proceedings IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1319–1328. Doi: https://doi.org/10.1109/IPDPSW.2016.187
Calhoun J, Olson L, Snir M (2014) FLIPIT: an LLVM based fault injector for HPC. In: Lopes L et al (eds) Euro-Par 2014: parallel processing workshops. Lecture Notes in Computer Science, vol 8805. Springer, Cham. Doi: https://doi.org/10.1007/978-3-319-14325-5_47
Giuffrida C, Kuijsten A, Tanenbaum AS (2013): EDFI: a dependable fault injection tool for dependability benchmarking experiments. In: Proceedings IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp. 31–40. Doi: https://doi.org/10.1109/PRDC.2013.12
Guo L, Li D, Laguna I, Schulz M (2018) Fliptracker: understanding natural error resilience in HPC applications. In Proceedings SC18: International Conference for High Performance Computing. Networking, Storage and Analysis, pp 94–107. Doi: https://doi.org/10.1109/SC.2018.00011
Ni X, Kale LV (2016) FlipBack: automatic targeted protection against silent data corruption. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16), Salt Lake City, UT, USA, pp 335–346. Doi: https://doi.org/10.1109/SC.2016.28.
Georgakoudis G, Laguna I, Nikolopoulos DS, Schulz M (2017) REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed. In: Proceedings of ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). New York, USA, Article 29:1–14. https://doi.org/10.1145/3126908.3126972
Oliveira D, Fratin V, Navaux P, Koren I, Rech P (2017) CAROL-FI: an efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators. In: Proceedings of ACM International Conference on Computing Frontier. Siena, Italy. 15–17 May 2017
Li G, Pattabiraman K, Cher C, Bose P (2016) Understanding error propagation in GPGPU applications. In: SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 240–251. Doi: https://doi.org/10.1109/SC.2016.20
Tselonis S, Gizopoulos D (2016) GUFI: a framework for GPUs reliability assessment. In Proceedings IEEE international symposium on performance analysis of systems and software (ISPASS), pp 90–100. Doi: https://doi.org/10.1109/ISPASS.2016.7482077
Mutlu BO, Kestor G, Manzano J, Unsal O, Chatterjee S, Krishnamoorthy S (2018) Characterization of the impact of soft errors on iterative methods. In: Proceedings of 25th IEEE International Conference on High Performance Computing (HiPC-2018), pp 203–214. Doi: https://doi.org/10.1109/HiPC.2018.00031
Mutlu BO, Kestor G, Cristal A, Unsal O, Krishnamoorthy S (2019) Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In: Proceedings of IEEE 26th International Conference on hHigh Performance Computing (HiPC-2019), pp 333–344. Doi: https://doi.org/10.1109/HiPC.2019.00048
Mutlu BO (2019) An extensive study on iterative solver resilience: characterization, detection and prediction. University of Cataluña Sept 2019.
Sangchoolie B, Pattabiraman K, Karlsson J (2017) One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. In: 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Denver, CO, USA, pp 97–108. Doi: https://doi.org/10.1109/DSN.2017.30
Elliot J, Hoemmen M, Mueller F (2014) Tolerating silent data corruptions in opaque preconditioners, SAND2014–3452C.
Patrick G, Bridges PG, Hoemmen M, Ferreira KB, Heroux MA, Soltero P, Brightwell R (2012) Cooperative application/OS DRAM fault recovery. In: Proceedings Euro-Par (Alexander M et al) (eds), Part II, LNCS 7156, pp 241–250. Springer-Verlag, Berlin, Heidelberg.
Coleman E, Jamal A, Baboulin M, Khabou A, Sosonkina M (2017) A comparison of soft-fault error models in the parallel preconditioned flexible GMRES. In: Proceedings in Internatonal Conference Parallel Processing and Applied Mathematics, Lublin, Poland, pp 36–46. Sept 2017. Doi: https://doi.org/10.1007/978-3-319-78024-5_4
Ashraf RA, Hukerikar S, Engelmann C (2018) Pattern-based Modeling of multiresilience solutions for high-performance computing. In: Proceedings ACM/SPEC International Conference on Performance Engineering (ICPE '18), NY, USA, pp 80–87. Doi: https://doi.org/10.1145/3184407.3184421
Acknowledgment
This work was partially funded by the Spanish Ministry of Science, Innovation, and Universities CODEC-OSE project (RTI2018-096,006-B-I00) and the Comunidad de Madrid CABAHLA-CM project (S2018/TCS-4423), both with European Regional Development Fund (ERDF). It also profited from funding received by the H2020 co-funded projects Energy oriented Centre of Excellence for computing applications II (EoCoE-II, No. 824158), and Supercomputing and Energy in Mexico (Enerxico, No. 828947). Last, the authors thank the clusters administrators at CIEMAT: Pablo García-Muller and Antonio J. Rubio-Montero for their support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Moríñigo, J.A., Bustos, A. & Mayo-García, R. Error resilience of three GMRES implementations under fault injection. J Supercomput 78, 7158–7185 (2022). https://doi.org/10.1007/s11227-021-04148-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04148-x