The Journal of Supercomputing

, Volume 63, Issue 2, pp 443–466 | Cite as

GPU-accelerated preconditioned iterative linear solvers



This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.


GPU computing Preconditioned iterative methods Sparse matrix computations 


  1. 1.
    Agarwal A, Levy M (2007) The kill rule for multicore. In: DAC’07: proceedings of the 44th annual design automation conference, New York, NY, USA. ACM, New York, pp 750–753 CrossRefGoogle Scholar
  2. 2.
    Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys Conf Ser 180(1):012037 CrossRefGoogle Scholar
  3. 3.
    Ament M, Knittel G, Weiskopf D, Strasser W (2010) A parallel preconditioned conjugate gradient solver for the Poisson problem on a Multi-GPU platform. In: PDP’10: proceedings of the 2010 18th euromicro conference on parallel, distributed and network-based processing, Washington, DC, USA. IEEE Comput. Soc., Los Alamitos, pp 583–592 Google Scholar
  4. 4.
    Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. Tech report, IBM Research Google Scholar
  5. 5.
    Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC’09: proceedings of the conference on high performance computing networking, storage and analysis, New York, NY, USA. ACM, New York, pp 1–11 CrossRefGoogle Scholar
  6. 6.
    Bell N, Garland M (2010) Cusp: generic parallel algorithms for sparse matrix and graph computations. Version 0.1.0 Google Scholar
  7. 7.
    Bolz J, Farmer I, Grinspun E, Schröoder P (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924 CrossRefGoogle Scholar
  8. 8.
    Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not 45:115–126 CrossRefGoogle Scholar
  9. 9.
    Davis PJ (1963) Interpolation and approximation. Blaisdell, Waltham MATHGoogle Scholar
  10. 10.
    Davis TA (1994) University of Florinda sparse matrix collection, na digest Google Scholar
  11. 11.
    Erhel J, Guyomarc’H F, Saad Y (2001) Least-squares polynomial filters for ill-conditioned linear systems. Tech report umsi-2001-32, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN Google Scholar
  12. 12.
    George A, Liu JWH (1989) The evolution of the minimum degree ordering algorithm. SIAM Rev 31(1):1–19 MathSciNetMATHCrossRefGoogle Scholar
  13. 13.
    Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility Google Scholar
  14. 14.
    Gupta R (2009) A GPU implementation of a bubbly flow solver. Master’s thesis, Delft Institute of Applied Mathematics, Delft University of Technology, 2628 BL, Delft, The Netherlands Google Scholar
  15. 15.
    Karypis G, Kumar V (1998) Metis—a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0. Tech report, University of Minnesota, Department of Computer Science/Army HPC Research Center Google Scholar
  16. 16.
    Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand 45:255–282 MathSciNetCrossRefGoogle Scholar
  17. 17.
    Monakov A, Avetisyan A (2009) Implementing blocked sparse matrix-vector multiplication on nvidia GPUs. In: Bertels K, Dimopoulos N, Silvano C, Wong S (eds) Embedded computer systems: architectures, modeling, and simulation. Lecture notes in computer science, vol 5657. Springer, Berlin, pp 289–297 CrossRefGoogle Scholar
  18. 18.
    Monakov A, Lokhmotov A, Avetisyan A (2010) Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt Y, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) High performance embedded architectures and compilers. Lecture notes in computer science, vol 5952. Springer, Berlin, pp 111–125 CrossRefGoogle Scholar
  19. 19.
    NVIDIA (2012) CUBLAS library user guide 4.2 Google Scholar
  20. 20.
    NVIDIA (2012) CUDA CUSPARSE Library Google Scholar
  21. 21.
    NVIDIA (2012) NVIDIA CUDA C programming guide 4.2 Google Scholar
  22. 22.
    Oberhuber T, Suzuki A, Vacata J (2010) New row-grouped csr format for storing the sparse matrices on GPU with implementation in CUDA. CoRR abs/1012.2270 Google Scholar
  23. 23.
    Robert Y (1982) Regular incomplete factorizations of real positive definite matrices. Linear Algebra Appl 48:105–117 MathSciNetMATHCrossRefGoogle Scholar
  24. 24.
    Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computations. Tech report RIACS-90-20, Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffett Field, CA Google Scholar
  25. 25.
    Saad Y (1994) ILUT: a dual threshold incomplete ILU factorization. Numer Linear Algebra Appl 1:387–402 MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia MATHCrossRefGoogle Scholar
  27. 27.
    Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. Graphics hardware 2007. ACM, New York, pp 97–106 Google Scholar
  28. 28.
    Sudan H, Klie H, Li R, Saad Y (2010) High performance manycore solvers for reservoir simulation. In: 12th European conference on the mathematics of oil recovery Google Scholar
  29. 29.
    Vázquez F, Garzon EM, Martinez JA, Fernandez JJ (2009) The sparse matrix vector product on GPUs. Tech report, Department of Computer Architecture and Electronics, University of Almeria Google Scholar
  30. 30.
    Volkov V, Demmel J (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech report, Computer Science Division University of California at Berkeley Google Scholar
  31. 31.
    Wang M, Klie H, Parashar M, Sudan H (2009) Solving sparse linear systems on nvidia tesla GPUs. In: ICCS’09: proceedings of the 9th international conference on computational science. Springer, Berlin, pp 864–873 Google Scholar
  32. 32.
    Williams S, Bell N, Choi JW, Garland M, Oliker L, Vuduc R (2010) Scientific computing with multicore and accelerators. CRC Press, Boca Raton, pp 83–109. Chap 5 CrossRefGoogle Scholar
  33. 33.
    Zhou Y, Saad Y, Tiago ML, Chelikowsky JR (2006) Parallel self-consistent-field calculations via Chebyshev-filtered subspace acceleration. Phys Rev E 74:066704 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Department of Computer Science & EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations