Skip to main content
Log in

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this issue by presenting LightSpMV, a parallelized CSR-based SpMV implementation programmed in CUDA C++. This algorithm achieves high speed by employing atomic and warp shuffle instructions to implement fine-grained dynamic distribution of matrix rows over vectors/warps as well as efficient vector dot product computation. Moreover, we propose a unified cache hit rate computation approach to consistently investigate the caching behavior for different SpMV kernels, which may have different data deployment in the hierarchical memory space of CUDA-enabled GPUs. We have assessed LightSpMV using a set of sparse matrices and further compared it to the CSR-based SpMV kernels in the top-performing CUSP, ViennaCL and cuSPARSE libraries. Our experimental results demonstrate that LightSpMV is superior to CUSP, ViennaCL and cuSPARSE on the same Kepler-based Tesla K40c GPU, running up to 2.63× and 2.65× faster than CUSP, up to 2.52× and 1.96× faster than ViennaCL, and up to 1.94× and 1.79× faster than cuSPARSE with respect to single and double precision, respectively. In addition, for the acceleration of the PageRank graph application, LightSpMV still keeps consistent superiority to the aforementioned three counterparts. LightSpMV is open-source and publicly available at http://lightspmv.sourceforge.net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3

Similar content being viewed by others

References

  1. Aila, T., & Laine, S. (2009). Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009 (pp. 145–149): ACM.

  2. Aluru, M., Zola, J., Nettleton, D., & Aluru, S. (2012). Reverse engineering and analysis of large genome-scale gene networks. Nucleic acids research (p. gks904).

  3. Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., & et al. (2006). The landscape of parallel computing research: A view from berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

  4. Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., & Sadayappan, P. (2014). Fast sparse matrix-vector multiplication on gpus for graph applications. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 781–792): IEEE.

  5. Ashari, A., Sedaghati, N., Eisenlohr, J., & Sadayappan, P. (2014). An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on gpus. In Proceedings of the 28th ACM international conference on supercomputing (pp. 273–282): ACM.

  6. Barrachina, S., Castillo, M., Igual, F. D., Mayo, R., & Quintana-Ortí, E. S. (2008). Solving dense linear systems on graphics processors. In Lecture notes in computer science, (Vol. 5168 pp. 739–748): Springer.

  7. Baskaran, M. M., & Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report RC24704.

  8. Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis (p. 18): ACM.

  9. Bell, N., & Garland, M. (2014). Cusp: Generic parallel algorithms for sparse matrix and graph computations (v0.4). http://cusplibrary.github.io.

  10. Brin, S., & Page, L. (2010). The anatomy of a large-scale hypertextual web search engine.

  11. Bustamam, A., Burrage, K., & Hamilton, N. A. (2012). Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(3), 679–692.

    Article  Google Scholar 

  12. Butte, A. J., & Kohane, I. S. (1999). Unsupervised knowledge discovery in medical databases using relevance networks. In Proceedings of the AMIA Symposium (p. 711): American Medical Informatics Association.

  13. Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM sigplan notices, (Vol. 45 pp. 115–126): ACM.

  14. Daga, M., & Greathouse, J. L. (2015). Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In 2015 IEEE 22nd International conference on high performance computing (HiPC) (pp. 64–74): IEEE.

  15. Dang, H. V., & Schmidt, B. (2013). Cuda-enabled sparse matrix–vector multiplication on gpus using atomic operations. Parallel Computing, 39(11), 737–750.

    Article  MathSciNet  Google Scholar 

  16. Davis, T. A., & Hu, Y. (2011). The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 1.

    MathSciNet  MATH  Google Scholar 

  17. Dehnavi, M. M., Fernández, D. M., & Giannacopoulos, D. (2010). Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Transactions on Magnetics, 46(8), 2982–2985.

    Article  Google Scholar 

  18. Gilbert, J. R., Reinhardt, S., & Shah, V. B. (2007). High-performance graph algorithms from parallel sparse matrices. In Applied Parallel Computing. State of the Art in Scientific Computing (pp. 260–269): Springer.

  19. Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., & Koziris, N. (2009). Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 36–77.

    Article  Google Scholar 

  20. Greathouse, J. L., & Daga, M. (2014). Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 769–780): IEEE.

  21. Im, E. J., & Yelick, K. (2000). Optimization of sparse matrix kernels for data mining. In First SIAM Conference on Data Mining. Citeseer.

  22. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.

    Article  MathSciNet  MATH  Google Scholar 

  23. Li, R., & Saad, Y. (2013). Gpu-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing, 63(2), 443–466.

    Article  Google Scholar 

  24. Liu, W., & Vinter, B. (2015). Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 339–350).

  25. Liu, W., & Vinter, B. (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49, 179–193.

    Article  MathSciNet  Google Scholar 

  26. Liu, X., Smelyanskiy, M., Chow, E., & Dubey, P. (2013). Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th international ACM conference on International conference on supercomputing (pp. 273–282): ACM.

  27. Liu, Y., & Schmidt, B. (2014). Swaphi: Smith-waterman protein database search on xeon phi coprocessors. In 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (pp. 184–185): IEEE.

  28. Liu, Y., & Schmidt, B. (2015). Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In 26th IEEE International Conference on Application-specific Systems (pp. 82–89).

  29. Liu, Y., Tran, T. T., Lauenroth, F., & Schmidt, B. (2014). Swaphi-ls: Smith-waterman algorithm on xeon phi coprocessors for long dna sequences. In 2014 IEEE International Conference on Cluster Computing (pp. 257–265): IEEE.

  30. Merrill, D., & Garland, M. (2016). Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (p. 43): ACM.

  31. Merrill, D., Garland, M., & Grimshaw, A. (2012). Scalable gpu graph traversal. In ACM SIGPLAN Notices, (Vol. 47 pp. 117–128): ACM.

  32. Misra, S., Pamnany, K., & Aluru, S. (2014). Parallel mutual information based construction of whole-genome networks on the intel (r) xeon phi (tm) coprocessor. In 28th IEEE International on Parallel and Distributed Processing Symposium (pp. 241–250): IEEE.

  33. Monakov, A., Lokhmotov, A., & Avetisyan, A. (2010). Automatically tuning sparse matrix-vector multiplication for gpu architectures. In High Performance Embedded Architectures and Compilers (pp. 111–125): Springer.

  34. Nagasaka, Y., Nukada, A., & Matsuoka, S. (2016). Adaptive multi-level blocking optimization for sparse matrix vector multiplication on gpu. Procedia Computer Science, 80, 131–142.

    Article  Google Scholar 

  35. Nvidia (2013). Nvidia’s next generation cuda compute architecture: Kepler gk110. NVIDIA White Paper.

  36. Nvidia (2015). Maxwell: The most advanced cuda gpu ever made. http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made.

  37. NVIDIA (2015). The nvidia cuda sparse matrix library (cusparse). In CUDA 6.5 toolkit.

  38. NVIDIA (2015). Nvidia visual profiler in cuda 7 tookit. https://developer.nvidia.com/nvidia-visual-profiler.

  39. Nvidia (2016). Nvidia gp100 pascal architecture-infinite compute for infinite opportunities. http://www.nvidia.com/object/pascal-architecture-whitepaper.html.

  40. Reguly, I., & Giles, M. (2012). Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing, 2012 (pp. 1–12): IEEE.

  41. Rupp, K., Rudolf, F., & Weinbub, J. (2010). Viennacl-a high level linear algebra library for gpus and multi-core cpus. Proceedings of the International Workshop on GPUs and Scientific Applications, 51–56.

  42. Saad, Y. (2003). Iterative methods for sparse linear systems, Siam.

  43. Saule, E., Kaya, K., & Çatalyürek, Ü. V. (2014). Performance evaluation of sparse matrix multiplication kernels on intel xeon phi, (pp. 559–570): Springer.

  44. Su, B. Y., & Keutzer, K. (2012). clspmv: A cross-platform opencl spmv framework on gpus. In Proceedings of the 26th ACM international conference on Supercomputing (pp. 353–364): ACM.

  45. Tang, W., Tan, W., Goh, R. S. M., Turner, S., & Wong, W. K. (2015). A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu. IEEE Transactions on Parallel and Distributed Systems, 26(9), 2373–2385.

    Article  Google Scholar 

  46. Tong, H., Faloutsos, C., & Pan, J. Y. (2008). Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14(3), 327–346.

    Article  MATH  Google Scholar 

  47. Tzeng, S., Patney, A., & Owens, J. D. (2010). Task management for irregular-parallel workloads on the gpu. In Proceedings of the Conference on High Performance Graphics (pp. 29–37): Eurographics Association.

  48. Vazquez, F., Ortega, G., Fernández, J. J., & Garzón, E. M. (2010). Improving the performance of the sparse matrix vector product with gpus. In 10th IEEE International Conference on Computer and Information Technology (pp. 1146–1151): IEEE.

  49. Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, (Vol. 10 p. 16). San Jose, CA.

  50. Volkov, V., & Demmel, J. W. (2008). Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 31 (pp. 1–11): IEEE.

  51. Vuduc, R. W. (2003). Automatic performance tuning of sparse matrix kernels. Ph.D. thesis. PhD thesis, University of California, Berkeley.

  52. Wu, B., Zhao, Z., Zhang, E. Z., Jiang, Y., & Shen, X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu (Vol. 48, pp. 57–68): ACM.

  53. Xiang, P., Yang, Y., & Zhou, H. (2014). Warp-level divergence in gpus: Characterization, impact, and mitigation. In 20th IEEE International Symposium on High Performance Computer Architecture (pp. 284–295): IEEE.

  54. Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). yaspmv: Yet another spmv framework on gpus (Vol. 49, pp. 107–118): ACM.

  55. Yang, X., Parthasarathy, S., & Sadayappan, P. (2011). Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 231–242.

    Article  Google Scholar 

Download references

Acknowledgments

We acknowledge funding by the Center for Computational Sciences (SRFN) Johannes Gutenberg University Mainz and the Carl-Zeiss-Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongchao Liu.

Additional information

Selected Paper from the 26th IEEE International Conference on Application-specific Systems, Architectures and Processors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Schmidt, B. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. J Sign Process Syst 90, 69–86 (2018). https://doi.org/10.1007/s11265-016-1216-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-016-1216-4

Keywords

Navigation