Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Loh, Felix; Saluja, Kewal K.; Ramanathan, Parameswaran

doi:10.1007/s10836-021-05945-1

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Published: 30 April 2021

Volume 37, pages 409–422, (2021)
Cite this article

Journal of Electronic Testing Aims and scope Submit manuscript

101 Accesses
Explore all metrics

Abstract

An extensive survey of the literature shows that the Lanczos eigensolver is a popular iterative method for approximating a few maximal eigenvalues of a real symmetric matrix, particularly if the matrix is large and sparse. In recent years, graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra, and are increasingly being used as the main computational units in supercomputers. This trend is expected to continue as the number of computations required by scientific applications reach petascale and exascale range. In this paper, building on our earlier work [22], we investigate in detail the error checking mechanism for the Lanczos eigensolver. We identify a low cost invariant for efficient error checking, and through mathematical analysis determine the efficiency of our mechanism when used by the Lanczos eigensolver. We evaluate the proposed fault tolerant scheme using an open-source sparse eigensolver on a GPU platform, with and without the injection of faults. We use a large number of sparse matrices from real applications, to determine the efficiency and efficacy of our method and our implementation shows that the proposed fault tolerant method has good error coverage and low overhead. To the best of our knowledge, we are the first to introduce such a scheme for the Lanczos method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Spectral Analysis of Large Sparse Matrices for Scalable Direct Solvers

Multigrid at Scale?

References

Agerwala T (2010) Exascale computing: The challenges and opportunities in the next decade. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA)
Arnoldi W (1951) The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quart Appl Math 9:17–29
Article MathSciNet Google Scholar
Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Dalcin L, Eijkhout V, Gropp W, Kaushik D, Knepley M, May D, McInnes L, Rupp K, Sanan P, Smith B, Zampini S, Zhang H, Zhang H (2017) PETSc users manual. Technical Report ANL-95/11 - Revision 3.8, Argonne National Laboratory
Balay S, Gropp W, McInnes L, Smith B (1997) Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press
Braun C, Halder S, Wunderlich HJ (2014) A-ABFT: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 443–454
Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In Proc. of the International Conference on Supercomputing, pages 155–164
Chen Z (2013) Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 167–176
Chen J, Liang X, Chen Z (2016) Online algorithm-based fault tolerance for Cholesky decomposition on heterogeneous systems with GPUs. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Soft 38(1):1:1–1:25
Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES iterative solver. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS) pages 1193–1202
Golub GH, van Loan CF (1996) Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore, MD
MATH Google Scholar
Hakkarinen D, Wu P, Chen Z (2015) Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Trans Par Distr Sys 26(5):1323–1335
Article Google Scholar
Hernandez V, Roman JE, Tomas A, Vidal V (2006) Lanczos methods in SLEPc. Technical Report STR-5, Universitat Politècnica de València. Available at http://slepc.upv.es
Hernandez V, Roman JE, Vidal V (2005) SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans Math Soft 31(3):351–362
Article MathSciNet Google Scholar
Heroux MA (2009) Software challenges for extreme scale computing: Going from petascale to exascale systems. Int J High Perf Comput Appl 23(4):437–439
Article Google Scholar
Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comp C-33(6):518–528
Kim H, Vuduc R, Baghsorkhi S, Choi J, Hwu W (2012) Performance analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures on Computer Architecture
Knyazev A (2001) Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J Sci Comput 23(2):517–541
Article MathSciNet Google Scholar
Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Nat Bur Stand 45(4):255–282
Article MathSciNet Google Scholar
Loh F, Ramanathan P, Saluja KK (2015) Transient fault resilient QR factorization on GPUs. In Proc. of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 63–70
Loh F, Saluja KK, Ramanathan P (2016) Fault tolerance through invariant checking for iterative solvers. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 481–486
Loh F, Saluja KK, Ramanathan P (2020) Fault tolerance through invariant checking for the lanczos eigensolver. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 13–18
Nie B, Tiwari D, Gupta S, Smirni E, Rogers JH (2016) A large-scale study of soft-errors on GPUs in the field. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA), pages 519–530
NVIDIA (2016) NVIDIA GeForce GTX 1080. White Paper
Oboril F, Tahoori MB, Heuveline V, Lukarski D, Weiss JP (2011) Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers. In Proc. of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pages 144–153
Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L (2003) Modeling the impact of device and pipeline scaling on the soft error rate of processor elements. Technical Report 2002-19, Dept. of Computer Sciences, The University of Texas at Austin
Scholl A, Braun C, Kochte MA, Wunderlich H (2015) Low-overhead fault-tolerance for the preconditioned conjugate gradient solver. In Proc. of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pages 60–65
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proc. of the International Conference on Supercomputing, pages 69–78
Siefert N, Jahinuzzaman S, Velamala J, Ascazubi R, Patel N, Gill B, Basile J, Hicks J (2015) Soft error rate improvements in 14-nm technology featuring second-generation 3D tri-gate transistors. IEEE Trans Nucl Sci 62(6):2570–2577
Article Google Scholar
Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12
Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z (2016) Towards practical algorithm based fault tolerance in dense linear algebra. In Proc. of the 25th International Symposium on High-performance Parallel and Distributed Computing, HPDC ’16, pages 31–42

Download references

Author information

Authors and Affiliations

University of Wisconsin-Madison, 1415 Engineering Drive, Madison, WI, USA
Felix Loh, Kewal K. Saluja & Parameswaran Ramanathan

Authors

Felix Loh
View author publications
You can also search for this author in PubMed Google Scholar
Kewal K. Saluja
View author publications
You can also search for this author in PubMed Google Scholar
Parameswaran Ramanathan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Felix Loh.

Additional information

Responsible Editor: V. D. Agrawal.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loh, F., Saluja, K.K. & Ramanathan, P. Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method. J Electron Test 37, 409–422 (2021). https://doi.org/10.1007/s10836-021-05945-1

Download citation

Received: 03 January 2021
Accepted: 12 April 2021
Published: 30 April 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10836-021-05945-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Abstract

Access this article

Similar content being viewed by others

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Spectral Analysis of Large Sparse Matrices for Scalable Direct Solvers

Multigrid at Scale?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Abstract

Access this article

Similar content being viewed by others

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Spectral Analysis of Large Sparse Matrices for Scalable Direct Solvers

Multigrid at Scale?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation