Abstract
The General Purpose Graphics Processing Unit (GPGPU or GPU) has powerful float-point computation ability and is suitable for intensive computing, such as solving large linear systems. The Jacobi Preconditioned Conjugate Gradient method (Jacobi_PCG or JPCG), one type of preconditioned iteration methods for the numerical solution of large sparse linear systems, has advantages of high parallelism and is especially appropriate for implementation on GPUs. On multi-GPU cluster, the matrix–vector multiplication involved in the PCG iteration needs the vector entries generated by current GPU and other GPUs, so the communication between GPUs becomes a major performance bottleneck. In this paper, we study the implementation of the JPCG on multi-GPU cluster. Considering the coarse-grained parallelism between GPUs and the sparsity of matrices arising from the finite element method (FEM), a simple and fast node reordering method is presented to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs. This novel reordering method is based on integerized nodal coordinates of FEM mesh and the counting sort algorithm. Additionally, computation and communication are overlapped using CUDA asynchronous memory transfer and MPI_sendrecv communication to further reduce the communication cost. A JPCG solver on multi-GPU cluster is developed using CUDA Fortran. Tests show that this solver has high efficiency and strong scalability.
Similar content being viewed by others
References
Nvidia (2007) NVIDIA CUDA Compute unified device architecture programming guide. NVIDIA Corporation. http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelpha
Poole EL, Ortega JM (1987) Multicolor ICCG methods for vector computers. SIAM J Numer Anal 24(6):1394–1418
González P, Cabaleiro JC, Pena TF (2000) On parallel solvers for sparse triangular systems. J Syst Archit 46(8):675–685
Lin SZ, Xu HW, Xie ZQ (2013) Hybrid programming implementation of MPI + OpenMP on multicolor SSOR-PCG. Comput Aided Eng 22(6):79–83 (in Chinese)
Li RP, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466
Chen Y, Zhao YH, Zhao W, Zhao L (2015) GPU-accelerated incomplete Cholesky factorization preconditioned conjugate gradient method. J Comput Res Dev 52(4):843–850 (in Chinese)
Bolz J, Farmer I, Grinspun E (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. Acm Trans Graph 22(3):917–924
Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91
Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient sparse linear solver on the GPU. High performance computing and communications. Springer, Berlin
Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.8861&rep=rep1&type=pdf
Agullo E, Giraud L, Guermouche A et al (2012) Task-based conjugate-gradient for multi-GPUs platforms. RR-8192, INRIA
Zhang JF, Shen DF (2013) GPU-based preconditioned conjugate gradient method for solving sparse linear systems. J Comput Appl 33(3):825–829 (in Chinese)
Khodja LZ, Couturier R, Giersch A et al (2014) Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters. J Supercomput 69(1):200–224
Chen C, Taha TM (2014) A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster. Clust Comput 17(2):327–337
Wolfe M (2014) CUDA Fortran programming guide and reference. The Portland Group, Release. http://www.pgroup.com/support/
Forum MPI (2012) MPI: a message-passing interface standard. Version 3.0, September. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Ruetsch G, Fatica M (2013) CUDA Fortran for scientists and engineers: best practices for efficient CUDA Fortran programming. Elsevier, Amsterdam
Harris M (2007) Optimizing parallel reduction in CUDA. Nvidia developer technology
Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation
Vazquez F, Garzon EM, Martinez JA et al (2009) The sparse matrix vector product on GPUs. In: Proceedings of the 2009 International Conference on Computational and Mathematical Methods in Science and Engineering, vol 2, pp 1081–1092
Wafai M (2009) Sparse matrix vector multiplications on graphics processors. University of Stuttgart
George A, Liu JW (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, Englewood Cliffs, New Jersey
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference. ACM, pp 157–172
Knuth DE (1998) The art of computer programming: sorting and searching. Addison-Wesley, Boston, New York
Mathew C (2011) Multi-GPU programming with CUDA Fortran, MPI, and GPUDirect-Part 1. http://www.pgroup.com/lit/articles/insider/v3n3a2.htm
Micikevicius P Multi-GPU programming. http://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf
Jacobsen DA, Thibault JC, Senocak I (2010) An MPI-CUDA implementation for massively parallel incompressible flow computations on Multi-GPU clusters. In: 48th AIAA Aerospace Sciences Meeting and Exhibit, Orlando, FL., Jan 2010
Macioł P, Płaszewski P, Banaś K (2010) 3D finite element numerical integration on GPUs. Procedia Comput Sci 1(1):1093–1100
Fu ZS, Lewis TJ, Kirby RM et al (2014) Architecting the finite element method pipeline for the GPU. J Comput Appl Math 257:195–211
DeConinck A (2014) Tools and tips for managing a gpu cluster. In: GPU Technology Conference http://on-demand.gputechconf.com/gtc/2014/presentations/S4253-tools-tips-for-managing-a-gpu-cluster.pdf
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of 49th IEEE International Midwest Symposium on Circuits and Systems. IEEE, pp 148–154
Thapliyal H, Arabnia HR, Bajpai R et al (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), pp 449–450
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant Nos. 51539002, 51509020) and the Fundamental Research Funds for Central Public Welfare Research Institutes (Grant No. CKSF2015033/CL).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, S., Xie, Z. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. J Supercomput 73, 433–454 (2017). https://doi.org/10.1007/s11227-016-1887-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1887-4