Abstract
Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this issue by presenting LightSpMV, a parallelized CSR-based SpMV implementation programmed in CUDA C++. This algorithm achieves high speed by employing atomic and warp shuffle instructions to implement fine-grained dynamic distribution of matrix rows over vectors/warps as well as efficient vector dot product computation. Moreover, we propose a unified cache hit rate computation approach to consistently investigate the caching behavior for different SpMV kernels, which may have different data deployment in the hierarchical memory space of CUDA-enabled GPUs. We have assessed LightSpMV using a set of sparse matrices and further compared it to the CSR-based SpMV kernels in the top-performing CUSP, ViennaCL and cuSPARSE libraries. Our experimental results demonstrate that LightSpMV is superior to CUSP, ViennaCL and cuSPARSE on the same Kepler-based Tesla K40c GPU, running up to 2.63× and 2.65× faster than CUSP, up to 2.52× and 1.96× faster than ViennaCL, and up to 1.94× and 1.79× faster than cuSPARSE with respect to single and double precision, respectively. In addition, for the acceleration of the PageRank graph application, LightSpMV still keeps consistent superiority to the aforementioned three counterparts. LightSpMV is open-source and publicly available at http://lightspmv.sourceforge.net.
Similar content being viewed by others
References
Aila, T., & Laine, S. (2009). Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009 (pp. 145–149): ACM.
Aluru, M., Zola, J., Nettleton, D., & Aluru, S. (2012). Reverse engineering and analysis of large genome-scale gene networks. Nucleic acids research (p. gks904).
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., & et al. (2006). The landscape of parallel computing research: A view from berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., & Sadayappan, P. (2014). Fast sparse matrix-vector multiplication on gpus for graph applications. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 781–792): IEEE.
Ashari, A., Sedaghati, N., Eisenlohr, J., & Sadayappan, P. (2014). An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on gpus. In Proceedings of the 28th ACM international conference on supercomputing (pp. 273–282): ACM.
Barrachina, S., Castillo, M., Igual, F. D., Mayo, R., & Quintana-Ortí, E. S. (2008). Solving dense linear systems on graphics processors. In Lecture notes in computer science, (Vol. 5168 pp. 739–748): Springer.
Baskaran, M. M., & Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report RC24704.
Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis (p. 18): ACM.
Bell, N., & Garland, M. (2014). Cusp: Generic parallel algorithms for sparse matrix and graph computations (v0.4). http://cusplibrary.github.io.
Brin, S., & Page, L. (2010). The anatomy of a large-scale hypertextual web search engine.
Bustamam, A., Burrage, K., & Hamilton, N. A. (2012). Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(3), 679–692.
Butte, A. J., & Kohane, I. S. (1999). Unsupervised knowledge discovery in medical databases using relevance networks. In Proceedings of the AMIA Symposium (p. 711): American Medical Informatics Association.
Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM sigplan notices, (Vol. 45 pp. 115–126): ACM.
Daga, M., & Greathouse, J. L. (2015). Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In 2015 IEEE 22nd International conference on high performance computing (HiPC) (pp. 64–74): IEEE.
Dang, H. V., & Schmidt, B. (2013). Cuda-enabled sparse matrix–vector multiplication on gpus using atomic operations. Parallel Computing, 39(11), 737–750.
Davis, T. A., & Hu, Y. (2011). The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 1.
Dehnavi, M. M., Fernández, D. M., & Giannacopoulos, D. (2010). Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Transactions on Magnetics, 46(8), 2982–2985.
Gilbert, J. R., Reinhardt, S., & Shah, V. B. (2007). High-performance graph algorithms from parallel sparse matrices. In Applied Parallel Computing. State of the Art in Scientific Computing (pp. 260–269): Springer.
Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., & Koziris, N. (2009). Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 36–77.
Greathouse, J. L., & Daga, M. (2014). Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 769–780): IEEE.
Im, E. J., & Yelick, K. (2000). Optimization of sparse matrix kernels for data mining. In First SIAM Conference on Data Mining. Citeseer.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.
Li, R., & Saad, Y. (2013). Gpu-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing, 63(2), 443–466.
Liu, W., & Vinter, B. (2015). Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 339–350).
Liu, W., & Vinter, B. (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49, 179–193.
Liu, X., Smelyanskiy, M., Chow, E., & Dubey, P. (2013). Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th international ACM conference on International conference on supercomputing (pp. 273–282): ACM.
Liu, Y., & Schmidt, B. (2014). Swaphi: Smith-waterman protein database search on xeon phi coprocessors. In 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (pp. 184–185): IEEE.
Liu, Y., & Schmidt, B. (2015). Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In 26th IEEE International Conference on Application-specific Systems (pp. 82–89).
Liu, Y., Tran, T. T., Lauenroth, F., & Schmidt, B. (2014). Swaphi-ls: Smith-waterman algorithm on xeon phi coprocessors for long dna sequences. In 2014 IEEE International Conference on Cluster Computing (pp. 257–265): IEEE.
Merrill, D., & Garland, M. (2016). Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (p. 43): ACM.
Merrill, D., Garland, M., & Grimshaw, A. (2012). Scalable gpu graph traversal. In ACM SIGPLAN Notices, (Vol. 47 pp. 117–128): ACM.
Misra, S., Pamnany, K., & Aluru, S. (2014). Parallel mutual information based construction of whole-genome networks on the intel (r) xeon phi (tm) coprocessor. In 28th IEEE International on Parallel and Distributed Processing Symposium (pp. 241–250): IEEE.
Monakov, A., Lokhmotov, A., & Avetisyan, A. (2010). Automatically tuning sparse matrix-vector multiplication for gpu architectures. In High Performance Embedded Architectures and Compilers (pp. 111–125): Springer.
Nagasaka, Y., Nukada, A., & Matsuoka, S. (2016). Adaptive multi-level blocking optimization for sparse matrix vector multiplication on gpu. Procedia Computer Science, 80, 131–142.
Nvidia (2013). Nvidia’s next generation cuda compute architecture: Kepler gk110. NVIDIA White Paper.
Nvidia (2015). Maxwell: The most advanced cuda gpu ever made. http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made.
NVIDIA (2015). The nvidia cuda sparse matrix library (cusparse). In CUDA 6.5 toolkit.
NVIDIA (2015). Nvidia visual profiler in cuda 7 tookit. https://developer.nvidia.com/nvidia-visual-profiler.
Nvidia (2016). Nvidia gp100 pascal architecture-infinite compute for infinite opportunities. http://www.nvidia.com/object/pascal-architecture-whitepaper.html.
Reguly, I., & Giles, M. (2012). Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing, 2012 (pp. 1–12): IEEE.
Rupp, K., Rudolf, F., & Weinbub, J. (2010). Viennacl-a high level linear algebra library for gpus and multi-core cpus. Proceedings of the International Workshop on GPUs and Scientific Applications, 51–56.
Saad, Y. (2003). Iterative methods for sparse linear systems, Siam.
Saule, E., Kaya, K., & Çatalyürek, Ü. V. (2014). Performance evaluation of sparse matrix multiplication kernels on intel xeon phi, (pp. 559–570): Springer.
Su, B. Y., & Keutzer, K. (2012). clspmv: A cross-platform opencl spmv framework on gpus. In Proceedings of the 26th ACM international conference on Supercomputing (pp. 353–364): ACM.
Tang, W., Tan, W., Goh, R. S. M., Turner, S., & Wong, W. K. (2015). A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu. IEEE Transactions on Parallel and Distributed Systems, 26(9), 2373–2385.
Tong, H., Faloutsos, C., & Pan, J. Y. (2008). Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14(3), 327–346.
Tzeng, S., Patney, A., & Owens, J. D. (2010). Task management for irregular-parallel workloads on the gpu. In Proceedings of the Conference on High Performance Graphics (pp. 29–37): Eurographics Association.
Vazquez, F., Ortega, G., Fernández, J. J., & Garzón, E. M. (2010). Improving the performance of the sparse matrix vector product with gpus. In 10th IEEE International Conference on Computer and Information Technology (pp. 1146–1151): IEEE.
Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, (Vol. 10 p. 16). San Jose, CA.
Volkov, V., & Demmel, J. W. (2008). Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 31 (pp. 1–11): IEEE.
Vuduc, R. W. (2003). Automatic performance tuning of sparse matrix kernels. Ph.D. thesis. PhD thesis, University of California, Berkeley.
Wu, B., Zhao, Z., Zhang, E. Z., Jiang, Y., & Shen, X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu (Vol. 48, pp. 57–68): ACM.
Xiang, P., Yang, Y., & Zhou, H. (2014). Warp-level divergence in gpus: Characterization, impact, and mitigation. In 20th IEEE International Symposium on High Performance Computer Architecture (pp. 284–295): IEEE.
Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). yaspmv: Yet another spmv framework on gpus (Vol. 49, pp. 107–118): ACM.
Yang, X., Parthasarathy, S., & Sadayappan, P. (2011). Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 231–242.
Acknowledgments
We acknowledge funding by the Center for Computational Sciences (SRFN) Johannes Gutenberg University Mainz and the Carl-Zeiss-Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Selected Paper from the 26th IEEE International Conference on Application-specific Systems, Architectures and Processors.
Rights and permissions
About this article
Cite this article
Liu, Y., Schmidt, B. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. J Sign Process Syst 90, 69–86 (2018). https://doi.org/10.1007/s11265-016-1216-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-016-1216-4