Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform


In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication () on NVIDIA GPUs using CUDA. has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU.


  1. 1.

    Zone CUDA. http://www.nvidia.com/cuda

  2. 2.

    decuda. http://wiki.github.com/laanwj/decuda

  3. 3.

    GPGPU.org. http://www.gpgpu.org

  4. 4.

    Belgin M, Back G, Ribbens C (2011) A library for pattern-based sparse matrix vector multiply. Intl J Parallel Program 39(1):62–67

  5. 5.

    Buatois L, Caumon G, Levy B (2009) Concurrent number cruncher—a GPU implementation of a general sparse linear solver. Intl J of Parallel, Emergent and Distributed Systems 24(3):205–223

  6. 6.

    Chen D, Li D, Xiong M, Bao H, Li X (2010) GPGPU-aided ensemble empirical mode decomposition for EEG analysis during anaesthesia. IEEE Trans Inf Technol BioMed 14(6):1417–1427

  7. 7.

    Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on CPUs. ACM SIGPLAN Not 45(5):115–126

  8. 8.

    Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proc 24th nat conf ACM, pp 157–172

  9. 9.

    Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix-vector multiplication using index and value compression, pp 87–96

  10. 10.

    Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc SC’09

  11. 11.

    Vuduc RW (2002) Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, 2002

  12. 12.

    Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proc of the 20th annual intl conf on supercomputing, ICS ’06. ACM, New York, pp 307–316

  13. 13.

    Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel JW (2007) Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proc 2007 ACM/IEEE conference on supercomputing, SC ’07. ACM, New York, pp 38:1–38:12

  14. 14.

    Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia

Download references

Author information

Correspondence to Shiming Xu.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Xu, S., Xue, W. & Lin, H.X. Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform. J Supercomput 63, 710–721 (2013). https://doi.org/10.1007/s11227-011-0626-0

Download citation


  • Sparse matrices-vector multiplication
  • GPU
  • CUDA
  • Matrix permutation
  • Cache optimization