Abstract
An extensive survey of the literature shows that the Lanczos eigensolver is a popular iterative method for approximating a few maximal eigenvalues of a real symmetric matrix, particularly if the matrix is large and sparse. In recent years, graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra, and are increasingly being used as the main computational units in supercomputers. This trend is expected to continue as the number of computations required by scientific applications reach petascale and exascale range. In this paper, building on our earlier work [22], we investigate in detail the error checking mechanism for the Lanczos eigensolver. We identify a low cost invariant for efficient error checking, and through mathematical analysis determine the efficiency of our mechanism when used by the Lanczos eigensolver. We evaluate the proposed fault tolerant scheme using an open-source sparse eigensolver on a GPU platform, with and without the injection of faults. We use a large number of sparse matrices from real applications, to determine the efficiency and efficacy of our method and our implementation shows that the proposed fault tolerant method has good error coverage and low overhead. To the best of our knowledge, we are the first to introduce such a scheme for the Lanczos method.
Similar content being viewed by others
References
Agerwala T (2010) Exascale computing: The challenges and opportunities in the next decade. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA)
Arnoldi W (1951) The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quart Appl Math 9:17–29
Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Dalcin L, Eijkhout V, Gropp W, Kaushik D, Knepley M, May D, McInnes L, Rupp K, Sanan P, Smith B, Zampini S, Zhang H, Zhang H (2017) PETSc users manual. Technical Report ANL-95/11 - Revision 3.8, Argonne National Laboratory
Balay S, Gropp W, McInnes L, Smith B (1997) Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press
Braun C, Halder S, Wunderlich HJ (2014) A-ABFT: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 443–454
Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In Proc. of the International Conference on Supercomputing, pages 155–164
Chen Z (2013) Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 167–176
Chen J, Liang X, Chen Z (2016) Online algorithm-based fault tolerance for Cholesky decomposition on heterogeneous systems with GPUs. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Soft 38(1):1:1–1:25
Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES iterative solver. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS) pages 1193–1202
Golub GH, van Loan CF (1996) Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore, MD
Hakkarinen D, Wu P, Chen Z (2015) Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Trans Par Distr Sys 26(5):1323–1335
Hernandez V, Roman JE, Tomas A, Vidal V (2006) Lanczos methods in SLEPc. Technical Report STR-5, Universitat Politècnica de València. Available at http://slepc.upv.es
Hernandez V, Roman JE, Vidal V (2005) SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans Math Soft 31(3):351–362
Heroux MA (2009) Software challenges for extreme scale computing: Going from petascale to exascale systems. Int J High Perf Comput Appl 23(4):437–439
Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comp C-33(6):518–528
Kim H, Vuduc R, Baghsorkhi S, Choi J, Hwu W (2012) Performance analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures on Computer Architecture
Knyazev A (2001) Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J Sci Comput 23(2):517–541
Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Nat Bur Stand 45(4):255–282
Loh F, Ramanathan P, Saluja KK (2015) Transient fault resilient QR factorization on GPUs. In Proc. of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 63–70
Loh F, Saluja KK, Ramanathan P (2016) Fault tolerance through invariant checking for iterative solvers. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 481–486
Loh F, Saluja KK, Ramanathan P (2020) Fault tolerance through invariant checking for the lanczos eigensolver. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 13–18
Nie B, Tiwari D, Gupta S, Smirni E, Rogers JH (2016) A large-scale study of soft-errors on GPUs in the field. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA), pages 519–530
NVIDIA (2016) NVIDIA GeForce GTX 1080. White Paper
Oboril F, Tahoori MB, Heuveline V, Lukarski D, Weiss JP (2011) Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers. In Proc. of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pages 144–153
Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L (2003) Modeling the impact of device and pipeline scaling on the soft error rate of processor elements. Technical Report 2002-19, Dept. of Computer Sciences, The University of Texas at Austin
Scholl A, Braun C, Kochte MA, Wunderlich H (2015) Low-overhead fault-tolerance for the preconditioned conjugate gradient solver. In Proc. of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pages 60–65
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proc. of the International Conference on Supercomputing, pages 69–78
Siefert N, Jahinuzzaman S, Velamala J, Ascazubi R, Patel N, Gill B, Basile J, Hicks J (2015) Soft error rate improvements in 14-nm technology featuring second-generation 3D tri-gate transistors. IEEE Trans Nucl Sci 62(6):2570–2577
Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12
Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z (2016) Towards practical algorithm based fault tolerance in dense linear algebra. In Proc. of the 25th International Symposium on High-performance Parallel and Distributed Computing, HPDC ’16, pages 31–42
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible Editor: V. D. Agrawal.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Loh, F., Saluja, K.K. & Ramanathan, P. Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method. J Electron Test 37, 409–422 (2021). https://doi.org/10.1007/s10836-021-05945-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10836-021-05945-1