A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Lin, Shaozhong; Xie, Zhiqiang

doi:10.1007/s11227-016-1887-4

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Published: 11 October 2016

Volume 73, pages 433–454, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Shaozhong Lin^1,2 &
Zhiqiang Xie^1,2

621 Accesses
12 Citations
Explore all metrics

Abstract

The General Purpose Graphics Processing Unit (GPGPU or GPU) has powerful float-point computation ability and is suitable for intensive computing, such as solving large linear systems. The Jacobi Preconditioned Conjugate Gradient method (Jacobi_PCG or JPCG), one type of preconditioned iteration methods for the numerical solution of large sparse linear systems, has advantages of high parallelism and is especially appropriate for implementation on GPUs. On multi-GPU cluster, the matrix–vector multiplication involved in the PCG iteration needs the vector entries generated by current GPU and other GPUs, so the communication between GPUs becomes a major performance bottleneck. In this paper, we study the implementation of the JPCG on multi-GPU cluster. Considering the coarse-grained parallelism between GPUs and the sparsity of matrices arising from the finite element method (FEM), a simple and fast node reordering method is presented to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs. This novel reordering method is based on integerized nodal coordinates of FEM mesh and the counting sort algorithm. Additionally, computation and communication are overlapped using CUDA asynchronous memory transfer and MPI_sendrecv communication to further reduce the communication cost. A JPCG solver on multi-GPU cluster is developed using CUDA Fortran. Tests show that this solver has high efficiency and strong scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

GPU implementation of an incomplete Cholesky conjugate gradient solver for a FEM-generated system using full kernel consolidation

Article 07 May 2023

References

Nvidia (2007) NVIDIA CUDA Compute unified device architecture programming guide. NVIDIA Corporation. http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelpha
Book MATH Google Scholar
Poole EL, Ortega JM (1987) Multicolor ICCG methods for vector computers. SIAM J Numer Anal 24(6):1394–1418
Article MathSciNet MATH Google Scholar
González P, Cabaleiro JC, Pena TF (2000) On parallel solvers for sparse triangular systems. J Syst Archit 46(8):675–685
Article Google Scholar
Lin SZ, Xu HW, Xie ZQ (2013) Hybrid programming implementation of MPI + OpenMP on multicolor SSOR-PCG. Comput Aided Eng 22(6):79–83 (in Chinese)
Google Scholar
Li RP, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466
Article Google Scholar
Chen Y, Zhao YH, Zhao W, Zhao L (2015) GPU-accelerated incomplete Cholesky factorization preconditioned conjugate gradient method. J Comput Res Dev 52(4):843–850 (in Chinese)
MathSciNet Google Scholar
Bolz J, Farmer I, Grinspun E (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. Acm Trans Graph 22(3):917–924
Article Google Scholar
Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91
Article Google Scholar
Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient sparse linear solver on the GPU. High performance computing and communications. Springer, Berlin
Google Scholar
Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.8861&rep=rep1&type=pdf
Agullo E, Giraud L, Guermouche A et al (2012) Task-based conjugate-gradient for multi-GPUs platforms. RR-8192, INRIA
Zhang JF, Shen DF (2013) GPU-based preconditioned conjugate gradient method for solving sparse linear systems. J Comput Appl 33(3):825–829 (in Chinese)
Google Scholar
Khodja LZ, Couturier R, Giersch A et al (2014) Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters. J Supercomput 69(1):200–224
Article Google Scholar
Chen C, Taha TM (2014) A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster. Clust Comput 17(2):327–337
Article Google Scholar
Wolfe M (2014) CUDA Fortran programming guide and reference. The Portland Group, Release. http://www.pgroup.com/support/
Forum MPI (2012) MPI: a message-passing interface standard. Version 3.0, September. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Ruetsch G, Fatica M (2013) CUDA Fortran for scientists and engineers: best practices for efficient CUDA Fortran programming. Elsevier, Amsterdam
Google Scholar
Harris M (2007) Optimizing parallel reduction in CUDA. Nvidia developer technology
Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation
Vazquez F, Garzon EM, Martinez JA et al (2009) The sparse matrix vector product on GPUs. In: Proceedings of the 2009 International Conference on Computational and Mathematical Methods in Science and Engineering, vol 2, pp 1081–1092
Wafai M (2009) Sparse matrix vector multiplications on graphics processors. University of Stuttgart
George A, Liu JW (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, Englewood Cliffs, New Jersey
MATH Google Scholar
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference. ACM, pp 157–172
Knuth DE (1998) The art of computer programming: sorting and searching. Addison-Wesley, Boston, New York
MATH Google Scholar
https://en.wikipedia.org/wiki/Counting_sort
Mathew C (2011) Multi-GPU programming with CUDA Fortran, MPI, and GPUDirect-Part 1. http://www.pgroup.com/lit/articles/insider/v3n3a2.htm
Micikevicius P Multi-GPU programming. http://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf
Jacobsen DA, Thibault JC, Senocak I (2010) An MPI-CUDA implementation for massively parallel incompressible flow computations on Multi-GPU clusters. In: 48th AIAA Aerospace Sciences Meeting and Exhibit, Orlando, FL., Jan 2010
Macioł P, Płaszewski P, Banaś K (2010) 3D finite element numerical integration on GPUs. Procedia Comput Sci 1(1):1093–1100
Article Google Scholar
Fu ZS, Lewis TJ, Kirby RM et al (2014) Architecting the finite element method pipeline for the GPU. J Comput Appl Math 257:195–211
Article MathSciNet MATH Google Scholar
DeConinck A (2014) Tools and tips for managing a gpu cluster. In: GPU Technology Conference http://on-demand.gputechconf.com/gtc/2014/presentations/S4253-tools-tips-for-managing-a-gpu-cluster.pdf
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of 49th IEEE International Midwest Symposium on Circuits and Systems. IEEE, pp 148–154
Thapliyal H, Arabnia HR, Bajpai R et al (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), pp 449–450

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant Nos. 51539002, 51509020) and the Fundamental Research Funds for Central Public Welfare Research Institutes (Grant No. CKSF2015033/CL).

Author information

Authors and Affiliations

Changjiang River Scientific Research Institute, Wuhan, 430010, China
Shaozhong Lin & Zhiqiang Xie
Research Center on Water Engineering Safety and Disaster Prevention of MWR, Wuhan, 430010, China
Shaozhong Lin & Zhiqiang Xie

Authors

Shaozhong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaozhong Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, S., Xie, Z. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. J Supercomput 73, 433–454 (2017). https://doi.org/10.1007/s11227-016-1887-4

Download citation

Published: 11 October 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11227-016-1887-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Abstract

Access this article

Similar content being viewed by others

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

GPU implementation of an incomplete Cholesky conjugate gradient solver for a FEM-generated system using full kernel consolidation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Abstract

Access this article

Similar content being viewed by others

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

GPU implementation of an incomplete Cholesky conjugate gradient solver for a FEM-generated system using full kernel consolidation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation