Abstract
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication () on NVIDIA GPUs using CUDA.
has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of
based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU.
Article PDF
Avoid common mistakes on your manuscript.
References
Zone CUDA. http://www.nvidia.com/cuda
GPGPU.org. http://www.gpgpu.org
Belgin M, Back G, Ribbens C (2011) A library for pattern-based sparse matrix vector multiply. Intl J Parallel Program 39(1):62–67
Buatois L, Caumon G, Levy B (2009) Concurrent number cruncher—a GPU implementation of a general sparse linear solver. Intl J of Parallel, Emergent and Distributed Systems 24(3):205–223
Chen D, Li D, Xiong M, Bao H, Li X (2010) GPGPU-aided ensemble empirical mode decomposition for EEG analysis during anaesthesia. IEEE Trans Inf Technol BioMed 14(6):1417–1427
Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on CPUs. ACM SIGPLAN Not 45(5):115–126
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proc 24th nat conf ACM, pp 157–172
Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix-vector multiplication using index and value compression, pp 87–96
Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc SC’09
Vuduc RW (2002) Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, 2002
Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proc of the 20th annual intl conf on supercomputing, ICS ’06. ACM, New York, pp 307–316
Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel JW (2007) Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proc 2007 ACM/IEEE conference on supercomputing, SC ’07. ACM, New York, pp 38:1–38:12
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Xu, S., Xue, W. & Lin, H.X. Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform. J Supercomput 63, 710–721 (2013). https://doi.org/10.1007/s11227-011-0626-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0626-0