LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Liu, Yongchao; Schmidt, Bertil

doi:10.1007/s11265-016-1216-4

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Published: 10 January 2017

Volume 90, pages 69–86, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yongchao Liu¹ &
Bertil Schmidt²

1285 Accesses
10 Citations
Explore all metrics

Abstract

Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this issue by presenting LightSpMV, a parallelized CSR-based SpMV implementation programmed in CUDA C++. This algorithm achieves high speed by employing atomic and warp shuffle instructions to implement fine-grained dynamic distribution of matrix rows over vectors/warps as well as efficient vector dot product computation. Moreover, we propose a unified cache hit rate computation approach to consistently investigate the caching behavior for different SpMV kernels, which may have different data deployment in the hierarchical memory space of CUDA-enabled GPUs. We have assessed LightSpMV using a set of sparse matrices and further compared it to the CSR-based SpMV kernels in the top-performing CUSP, ViennaCL and cuSPARSE libraries. Our experimental results demonstrate that LightSpMV is superior to CUSP, ViennaCL and cuSPARSE on the same Kepler-based Tesla K40c GPU, running up to 2.63× and 2.65× faster than CUSP, up to 2.52× and 1.96× faster than ViennaCL, and up to 1.94× and 1.79× faster than cuSPARSE with respect to single and double precision, respectively. In addition, for the acceleration of the PageRank graph application, LightSpMV still keeps consistent superiority to the aforementioned three counterparts. LightSpMV is open-source and publicly available at http://lightspmv.sourceforge.net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Zhixiang Zhao, Guoyin Zhang, … Yan Fu

Towards an Efficient Sparse Storage Format for the SpMM Kernel in GPUs

Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs

References

Aila, T., & Laine, S. (2009). Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009 (pp. 145–149): ACM.
Aluru, M., Zola, J., Nettleton, D., & Aluru, S. (2012). Reverse engineering and analysis of large genome-scale gene networks. Nucleic acids research (p. gks904).
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., & et al. (2006). The landscape of parallel computing research: A view from berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., & Sadayappan, P. (2014). Fast sparse matrix-vector multiplication on gpus for graph applications. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 781–792): IEEE.
Ashari, A., Sedaghati, N., Eisenlohr, J., & Sadayappan, P. (2014). An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on gpus. In Proceedings of the 28th ACM international conference on supercomputing (pp. 273–282): ACM.
Barrachina, S., Castillo, M., Igual, F. D., Mayo, R., & Quintana-Ortí, E. S. (2008). Solving dense linear systems on graphics processors. In Lecture notes in computer science, (Vol. 5168 pp. 739–748): Springer.
Baskaran, M. M., & Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report RC24704.
Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis (p. 18): ACM.
Bell, N., & Garland, M. (2014). Cusp: Generic parallel algorithms for sparse matrix and graph computations (v0.4). http://cusplibrary.github.io.
Brin, S., & Page, L. (2010). The anatomy of a large-scale hypertextual web search engine.
Bustamam, A., Burrage, K., & Hamilton, N. A. (2012). Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(3), 679–692.
Article Google Scholar
Butte, A. J., & Kohane, I. S. (1999). Unsupervised knowledge discovery in medical databases using relevance networks. In Proceedings of the AMIA Symposium (p. 711): American Medical Informatics Association.
Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM sigplan notices, (Vol. 45 pp. 115–126): ACM.
Daga, M., & Greathouse, J. L. (2015). Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In 2015 IEEE 22nd International conference on high performance computing (HiPC) (pp. 64–74): IEEE.
Dang, H. V., & Schmidt, B. (2013). Cuda-enabled sparse matrix–vector multiplication on gpus using atomic operations. Parallel Computing, 39(11), 737–750.
Article MathSciNet Google Scholar
Davis, T. A., & Hu, Y. (2011). The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 1.
MathSciNet MATH Google Scholar
Dehnavi, M. M., Fernández, D. M., & Giannacopoulos, D. (2010). Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Transactions on Magnetics, 46(8), 2982–2985.
Article Google Scholar
Gilbert, J. R., Reinhardt, S., & Shah, V. B. (2007). High-performance graph algorithms from parallel sparse matrices. In Applied Parallel Computing. State of the Art in Scientific Computing (pp. 260–269): Springer.
Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., & Koziris, N. (2009). Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 36–77.
Article Google Scholar
Greathouse, J. L., & Daga, M. (2014). Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 769–780): IEEE.
Im, E. J., & Yelick, K. (2000). Optimization of sparse matrix kernels for data mining. In First SIAM Conference on Data Mining. Citeseer.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.
Article MathSciNet MATH Google Scholar
Li, R., & Saad, Y. (2013). Gpu-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing, 63(2), 443–466.
Article Google Scholar
Liu, W., & Vinter, B. (2015). Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 339–350).
Liu, W., & Vinter, B. (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49, 179–193.
Article MathSciNet Google Scholar
Liu, X., Smelyanskiy, M., Chow, E., & Dubey, P. (2013). Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th international ACM conference on International conference on supercomputing (pp. 273–282): ACM.
Liu, Y., & Schmidt, B. (2014). Swaphi: Smith-waterman protein database search on xeon phi coprocessors. In 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (pp. 184–185): IEEE.
Liu, Y., & Schmidt, B. (2015). Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In 26th IEEE International Conference on Application-specific Systems (pp. 82–89).
Liu, Y., Tran, T. T., Lauenroth, F., & Schmidt, B. (2014). Swaphi-ls: Smith-waterman algorithm on xeon phi coprocessors for long dna sequences. In 2014 IEEE International Conference on Cluster Computing (pp. 257–265): IEEE.
Merrill, D., & Garland, M. (2016). Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (p. 43): ACM.
Merrill, D., Garland, M., & Grimshaw, A. (2012). Scalable gpu graph traversal. In ACM SIGPLAN Notices, (Vol. 47 pp. 117–128): ACM.
Misra, S., Pamnany, K., & Aluru, S. (2014). Parallel mutual information based construction of whole-genome networks on the intel (r) xeon phi (tm) coprocessor. In 28th IEEE International on Parallel and Distributed Processing Symposium (pp. 241–250): IEEE.
Monakov, A., Lokhmotov, A., & Avetisyan, A. (2010). Automatically tuning sparse matrix-vector multiplication for gpu architectures. In High Performance Embedded Architectures and Compilers (pp. 111–125): Springer.
Nagasaka, Y., Nukada, A., & Matsuoka, S. (2016). Adaptive multi-level blocking optimization for sparse matrix vector multiplication on gpu. Procedia Computer Science, 80, 131–142.
Article Google Scholar
Nvidia (2013). Nvidia’s next generation cuda compute architecture: Kepler gk110. NVIDIA White Paper.
Nvidia (2015). Maxwell: The most advanced cuda gpu ever made. http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made.
NVIDIA (2015). The nvidia cuda sparse matrix library (cusparse). In CUDA 6.5 toolkit.
NVIDIA (2015). Nvidia visual profiler in cuda 7 tookit. https://developer.nvidia.com/nvidia-visual-profiler.
Nvidia (2016). Nvidia gp100 pascal architecture-infinite compute for infinite opportunities. http://www.nvidia.com/object/pascal-architecture-whitepaper.html.
Reguly, I., & Giles, M. (2012). Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing, 2012 (pp. 1–12): IEEE.
Rupp, K., Rudolf, F., & Weinbub, J. (2010). Viennacl-a high level linear algebra library for gpus and multi-core cpus. Proceedings of the International Workshop on GPUs and Scientific Applications, 51–56.
Saad, Y. (2003). Iterative methods for sparse linear systems, Siam.
Saule, E., Kaya, K., & Çatalyürek, Ü. V. (2014). Performance evaluation of sparse matrix multiplication kernels on intel xeon phi, (pp. 559–570): Springer.
Su, B. Y., & Keutzer, K. (2012). clspmv: A cross-platform opencl spmv framework on gpus. In Proceedings of the 26th ACM international conference on Supercomputing (pp. 353–364): ACM.
Tang, W., Tan, W., Goh, R. S. M., Turner, S., & Wong, W. K. (2015). A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu. IEEE Transactions on Parallel and Distributed Systems, 26(9), 2373–2385.
Article Google Scholar
Tong, H., Faloutsos, C., & Pan, J. Y. (2008). Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14(3), 327–346.
Article MATH Google Scholar
Tzeng, S., Patney, A., & Owens, J. D. (2010). Task management for irregular-parallel workloads on the gpu. In Proceedings of the Conference on High Performance Graphics (pp. 29–37): Eurographics Association.
Vazquez, F., Ortega, G., Fernández, J. J., & Garzón, E. M. (2010). Improving the performance of the sparse matrix vector product with gpus. In 10th IEEE International Conference on Computer and Information Technology (pp. 1146–1151): IEEE.
Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, (Vol. 10 p. 16). San Jose, CA.
Volkov, V., & Demmel, J. W. (2008). Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 31 (pp. 1–11): IEEE.
Vuduc, R. W. (2003). Automatic performance tuning of sparse matrix kernels. Ph.D. thesis. PhD thesis, University of California, Berkeley.
Wu, B., Zhao, Z., Zhang, E. Z., Jiang, Y., & Shen, X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu (Vol. 48, pp. 57–68): ACM.
Xiang, P., Yang, Y., & Zhou, H. (2014). Warp-level divergence in gpus: Characterization, impact, and mitigation. In 20th IEEE International Symposium on High Performance Computer Architecture (pp. 284–295): IEEE.
Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). yaspmv: Yet another spmv framework on gpus (Vol. 49, pp. 107–118): ACM.
Yang, X., Parthasarathy, S., & Sadayappan, P. (2011). Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 231–242.
Article Google Scholar

Download references

Acknowledgments

We acknowledge funding by the Center for Computational Sciences (SRFN) Johannes Gutenberg University Mainz and the Carl-Zeiss-Foundation.

Author information

Authors and Affiliations

School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Yongchao Liu
Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz, 55128, Germany
Bertil Schmidt

Authors

Yongchao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bertil Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongchao Liu.

Additional information

Selected Paper from the 26th IEEE International Conference on Application-specific Systems, Architectures and Processors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Schmidt, B. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. J Sign Process Syst 90, 69–86 (2018). https://doi.org/10.1007/s11265-016-1216-4

Download citation

Received: 25 April 2016
Revised: 22 September 2016
Accepted: 21 December 2016
Published: 10 January 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11265-016-1216-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Abstract

Access this article

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Towards an Efficient Sparse Storage Format for the SpMM Kernel in GPUs

Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Abstract

Access this article

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Towards an Efficient Sparse Storage Format for the SpMM Kernel in GPUs

Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation