Skip to main content
Log in

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The General Purpose Graphics Processing Unit (GPGPU or GPU) has powerful float-point computation ability and is suitable for intensive computing, such as solving large linear systems. The Jacobi Preconditioned Conjugate Gradient method (Jacobi_PCG or JPCG), one type of preconditioned iteration methods for the numerical solution of large sparse linear systems, has advantages of high parallelism and is especially appropriate for implementation on GPUs. On multi-GPU cluster, the matrix–vector multiplication involved in the PCG iteration needs the vector entries generated by current GPU and other GPUs, so the communication between GPUs becomes a major performance bottleneck. In this paper, we study the implementation of the JPCG on multi-GPU cluster. Considering the coarse-grained parallelism between GPUs and the sparsity of matrices arising from the finite element method (FEM), a simple and fast node reordering method is presented to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs. This novel reordering method is based on integerized nodal coordinates of FEM mesh and the counting sort algorithm. Additionally, computation and communication are overlapped using CUDA asynchronous memory transfer and MPI_sendrecv communication to further reduce the communication cost. A JPCG solver on multi-GPU cluster is developed using CUDA Fortran. Tests show that this solver has high efficiency and strong scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Nvidia (2007) NVIDIA CUDA Compute unified device architecture programming guide. NVIDIA Corporation. http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programming_Guide_1.0.pdf

  2. Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelpha

    Book  MATH  Google Scholar 

  3. Poole EL, Ortega JM (1987) Multicolor ICCG methods for vector computers. SIAM J Numer Anal 24(6):1394–1418

    Article  MathSciNet  MATH  Google Scholar 

  4. González P, Cabaleiro JC, Pena TF (2000) On parallel solvers for sparse triangular systems. J Syst Archit 46(8):675–685

    Article  Google Scholar 

  5. Lin SZ, Xu HW, Xie ZQ (2013) Hybrid programming implementation of MPI + OpenMP on multicolor SSOR-PCG. Comput Aided Eng 22(6):79–83 (in Chinese)

    Google Scholar 

  6. Li RP, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466

    Article  Google Scholar 

  7. Chen Y, Zhao YH, Zhao W, Zhao L (2015) GPU-accelerated incomplete Cholesky factorization preconditioned conjugate gradient method. J Comput Res Dev 52(4):843–850 (in Chinese)

    MathSciNet  Google Scholar 

  8. Bolz J, Farmer I, Grinspun E (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. Acm Trans Graph 22(3):917–924

    Article  Google Scholar 

  9. Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91

    Article  Google Scholar 

  10. Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient sparse linear solver on the GPU. High performance computing and communications. Springer, Berlin

    Google Scholar 

  11. Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.8861&rep=rep1&type=pdf

  12. Agullo E, Giraud L, Guermouche A et al (2012) Task-based conjugate-gradient for multi-GPUs platforms. RR-8192, INRIA

  13. Zhang JF, Shen DF (2013) GPU-based preconditioned conjugate gradient method for solving sparse linear systems. J Comput Appl 33(3):825–829 (in Chinese)

    Google Scholar 

  14. Khodja LZ, Couturier R, Giersch A et al (2014) Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters. J Supercomput 69(1):200–224

    Article  Google Scholar 

  15. Chen C, Taha TM (2014) A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster. Clust Comput 17(2):327–337

    Article  Google Scholar 

  16. Wolfe M (2014) CUDA Fortran programming guide and reference. The Portland Group, Release. http://www.pgroup.com/support/

  17. Forum MPI (2012) MPI: a message-passing interface standard. Version 3.0, September. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

  18. Ruetsch G, Fatica M (2013) CUDA Fortran for scientists and engineers: best practices for efficient CUDA Fortran programming. Elsevier, Amsterdam

    Google Scholar 

  19. Harris M (2007) Optimizing parallel reduction in CUDA. Nvidia developer technology

  20. Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation

  21. Vazquez F, Garzon EM, Martinez JA et al (2009) The sparse matrix vector product on GPUs. In: Proceedings of the 2009 International Conference on Computational and Mathematical Methods in Science and Engineering, vol 2, pp 1081–1092

  22. Wafai M (2009) Sparse matrix vector multiplications on graphics processors. University of Stuttgart

  23. George A, Liu JW (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, Englewood Cliffs, New Jersey

    MATH  Google Scholar 

  24. Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference. ACM, pp 157–172

  25. Knuth DE (1998) The art of computer programming: sorting and searching. Addison-Wesley, Boston, New York

    MATH  Google Scholar 

  26. https://en.wikipedia.org/wiki/Counting_sort

  27. Mathew C (2011) Multi-GPU programming with CUDA Fortran, MPI, and GPUDirect-Part 1. http://www.pgroup.com/lit/articles/insider/v3n3a2.htm

  28. Micikevicius P Multi-GPU programming. http://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf

  29. Jacobsen DA, Thibault JC, Senocak I (2010) An MPI-CUDA implementation for massively parallel incompressible flow computations on Multi-GPU clusters. In: 48th AIAA Aerospace Sciences Meeting and Exhibit, Orlando, FL., Jan 2010

  30. Macioł P, Płaszewski P, Banaś K (2010) 3D finite element numerical integration on GPUs. Procedia Comput Sci 1(1):1093–1100

    Article  Google Scholar 

  31. Fu ZS, Lewis TJ, Kirby RM et al (2014) Architecting the finite element method pipeline for the GPU. J Comput Appl Math 257:195–211

    Article  MathSciNet  MATH  Google Scholar 

  32. DeConinck A (2014) Tools and tips for managing a gpu cluster. In: GPU Technology Conference http://on-demand.gputechconf.com/gtc/2014/presentations/S4253-tools-tips-for-managing-a-gpu-cluster.pdf

  33. Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of 49th IEEE International Midwest Symposium on Circuits and Systems. IEEE, pp 148–154

  34. Thapliyal H, Arabnia HR, Bajpai R et al (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of 2007 International Conference on Parallel & Distributed Processing Techniques & Applications (PDPTA’07), pp 449–450

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant Nos. 51539002, 51509020) and the Fundamental Research Funds for Central Public Welfare Research Institutes (Grant No. CKSF2015033/CL).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaozhong Lin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, S., Xie, Z. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. J Supercomput 73, 433–454 (2017). https://doi.org/10.1007/s11227-016-1887-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1887-4

Keywords

Navigation