Skip to main content

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU


The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Abu-Sufah W, Karim A (2012) An effective approach for implementing sparse matrix–vector multiplication on graphics processing units. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 453–460

  2. Ahamed C, Kassim A, Frdric M (2012) Iterative methods for sparse linear systems on graphics processing unit. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 836–842

  3. Alejandro D, Polanco R (2009) Collective communication and barrier synchronization on NVIDIA CUDA GPU. MS thesis, University of Kentucky

  4. Aliaga J, Perez J, Quintana-Orti E, Anzt H (2013) Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp 320–329

  5. Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c/sell-c-\(\sigma \) formats on NVIDIA GPUs. Technical report on UT-EECS-14-727, University of Tennessee

  6. Anzt H, Tomov S, Luszczek P, Sawyer W, Dongarra J (2015) Acceleration of GPU-based krylov solvers via data transfer reduction. Int J High Perform Comput Appl 29(3):366–383

    Article  Google Scholar 

  7. Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Eijkhout V, Gropp W, Kaushik D, Knepley M et al (2014) PETSc users manual revision 3.5. Technical report on Argonne National Laboratory (ANL)

  8. Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Technical report on NVR-2008-004, NVIDIA Corporation

  9. Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09. ACM, pp 18:1–18:11

  10. Bell N, Garland M (2012) CUSP: generic parallel algorithms for sparse matrix and graph computations.

  11. Bordawekar R, Baskaran MM (2008) Optimizing sparse matrix–vector multiplication on GPUs. Technical report on RC24704, IMB Research

  12. Buatois L, Caumon G, Lvy B (2009) Concurrent number cruncher: a GPU implementation of a general sparse linear solver. Int J Parallel Emerg Distrib Syst 24(3):205–223

    MathSciNet  Article  Google Scholar 

  13. Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1:1–1:25

    MathSciNet  MATH  Google Scholar 

  14. Goharian N, Jain A, Sun Q (2003) Comparative analysis of sparse matrix algorithms for information retrieval. Computer 2:0–4

    Google Scholar 

  15. Greathouse JL, Daga M (2014) Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14. IEEE Press, Piscataway, pp 769–780

  16. Grillo L, de Sande F, Reyes R (2014) Performance evaluation of OpenACC compilers. In: 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 656–663

  17. Heroux MA, Bartlett RA, Howle VE, Hoekstra RJ, Hu JJ, Kolda TG, Lehoucq RB, Long KR, Pawlowski RP, Phipps ET et al (2005) An overview of the Trilinos project. ACM Trans Math Softw 31(3):397–423

    MathSciNet  Article  MATH  Google Scholar 

  18. Hoberock J, Bell N (2010) Thrust: a parallel template library.

  19. Huan G, Qian Z (2012) A new method of sparse matrix–vector multiplication on GPU. In: 2nd International Conference on Computer Science and Network Technology (ICCSNT), pp 954–958

  20. Khan AH, Al-Mouhamed M, Firdaus LA (2015) Evaluation of global synchronization for iterative algebra algorithms on many-core. In: 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD), pp 1–16

  21. Lee S, Vetter JS (2012) Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12. IEEE Computer Society Press, Los Alamitos, pp 23:1–23:11

  22. Lowell D, Godwin J, Holewinski J, Karthik D, Choudary C, Mametjanov A, Norris B, Sabin G, Sadayappan P, Sarich J (2013) Stencil-aware GPU optimization of iterative solvers. SIAM J Sci Comput 35(5):S209–S228

    MathSciNet  Article  MATH  Google Scholar 

  23. Maggioni M, Berger-Wolf T (2016) Optimization techniques for sparse matrixvector multiplication on {GPUs}. J Parallel Distrib Comput 9394:66–86

    Article  Google Scholar 

  24. Matam KK, Kothapalli K (2011) Accelerating sparse matrix vector multiplication in iterative methods using GPU. In: IEEE International Conference on Parallel Processing (ICPP), pp 612–621

  25. Nagasaka Y, Nukada A, Matsuoka S (2016) Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Comput Sci 80:131–142 (International Conference on Computational Science 2016)

  26. Neelima B, Reddy GRM, Raghavendra PS (2014) Predicting an optimal sparse matrix format for SpMV computation on GPU. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp 1427–1436

  27. NVIDIA: CUDA basic linear algebra subroutines (cublas) library, CUDA sparse matrix (cusparse) library.

  28. NVIDIA (2013) Tuning CUDA applications for kepler. Accessed 10 June 2013

  29. NVIDIA (2008) NVIDIA CUDA programming guide 2.0. NVIDIA

  30. NVIDIA (2011) CUDA toolkit.

  31. NVIDIA (2014) cuSparse library.

  32. Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2007) A survey of general-purpose computation on graphics hardware. In: Computer graphics forum, vol 26. Wiley Online Library, pp 80–113

  33. Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia

    Book  MATH  Google Scholar 

  34. Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240

    Article  MATH  Google Scholar 

  35. Xiao S, c. Feng W (2010) Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp 1–12

  36. Yan SEA (2014) yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49(8):107–118

    Article  Google Scholar 

  37. Yang LT (2000) Data distribution and communication schemes for IQMR method on massively distributed memory computers. In: Proceedings of the International Workshop on Parallel Processing, ICPPW, Toronto, Canada, August 21–24, pp 299–306

  38. Zaza A (2015) A CUDA based parallel multi-phase oil reservoir simulator. PhD thesis, Computer Engineering Department, King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arabia

Download references


This project was funded by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology- through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks for King Fahd University of Petroleum & Minerals (KFUPM) for computing support.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mayez A. Al-Mouhamed.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Al-Mouhamed, M.A., Khan, A.H. SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73, 3761–3795 (2017).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Reservoir simulation
  • BiCG-Stab
  • SpMV
  • GPU
  • CUDA Optimization
  • Structured grid
  • Synchronization
  • Coalesced memory