Design Principles for Sparse Matrix Multiplication on the GPU

  • Carl YangEmail author
  • Aydın Buluç
  • John D. Owens
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)


We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients—(i) merge-based load-balancing and (ii) row-major coalesced memory access—we demonstrate a 4.1\(\times \) peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.


Sparse matrix multiplication Parallel GPU 



We appreciate the funding support from the National Science Foundation (Award # CCF-1629657), the DARPA XDATA program (US Army award W911QX-12-C-0059), and the DARPA HIVE program. For HIVE support, this material is based on research sponsored by Air Force Research Lab (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7836. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Lab (AFRL) and the Defense Advanced Research Projects Agency (DARPA) or the U.S. Government.

This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.

This research was supported in part by the Applied Mathematics program of the DOE Office of Advanced Scientific Computing Research under Contract No. DE-AC02-05CH11231, and in part the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.


  1. 1.
    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  2. 2.
    Sarıyüce, A.E., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Regularizing graph centrality computations. J. Parallel Distrib. Comput. 76, 106–119 (2015)CrossRefGoogle Scholar
  3. 3.
    Tiskin, A.: All-pairs shortest paths computation in the BSP model. In: Orejas, F., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 178–189. Springer, Heidelberg (2001). Scholar
  4. 4.
    Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16(4), 917–933 (1995)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000)CrossRefGoogle Scholar
  6. 6.
    Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM SISC 23(2), 517–541 (2001)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Wang, H., Banerjee, A., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Large scale distributed sparse precision estimation. In: NIPS, pp. 584–592 (2013)Google Scholar
  8. 8.
    Si, S., Shin, D., Dhillon, I.S., Parlett, B.N.: Multi-scale spectral decomposition of massive graphs. In: NIPS, pp. 2798–2806 (2014)Google Scholar
  9. 9.
    Kannan, R., Ballard, G., Park, H.: A high-performance parallel algorithm for nonnegative matrix factorization. In: ACM SIGPLAN, vol. 51. ACM (2016)CrossRefGoogle Scholar
  10. 10.
    Vazquez, F., Garzon, E.M., Fernandez, J.J.: A matrix approach to tomographic reconstruction and its implementation on GPUs. J. Struct. Biol. 170(1), 146–151 (2010)CrossRefGoogle Scholar
  11. 11.
    Buluç, A., Mattson, T., McMillan, S., Moreira, J., Yang, C.: Design of the GraphBLAS API for C. In: IEEE Workshop on Graph Algorithm Building Blocks, IPDPSW (2017)Google Scholar
  12. 12.
    Baxter, S.: Modern GPU library (2015).
  13. 13.
    Dalton, S., Olson, L., Bell, N.: Optimizing sparse matrix-matrix multiplication for the GPU. ACM TOMS 41(4), 25 (2015)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Supercomputing 2016, pp. 678–689. IEEE, November 2016Google Scholar
  15. 15.
    Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM TOMS 38(1), 1 (2011)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Ortega, G., Vázquez, F., García, I., Garzón, E.M.: FastSpMM: an efficient library for sparse matrix matrix product on GPUs. Computer 57(7), 968–979 (2014)Google Scholar
  17. 17.
    Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82 (2015)Google Scholar
  18. 18.
    Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the IPDPS. IEEE Computer Society (2014)Google Scholar
  19. 19.
    Filippone, S., Cardellini, V., Barbieri, D., Fanfarillo, A.: Sparse matrix-vector multiplication on GPGPUs. ACM TOMS 43(4), 30 (2017)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Naumov, M., Chien, L.S., Vandermersch, P., Kapasi, U.: CUSPARSE library: a set of basic linear algebra subroutines for sparse matrices. In: GTC (2010)Google Scholar
  21. 21.
    Hong, C., et al.: Efficient sparse-matrix multi-vector product on GPUs. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC 2018, pp. 66–79. ACM, New York (2018)Google Scholar
  22. 22.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008, pp. 31:1–31:11, November 2008Google Scholar
  23. 23.
    Jablin, J.A., Jablin, T.B., Mutlu, O., Herlihy, M.: Warp-aware trace scheduling for GPUs. In: ACM PACT 2014, pp. 163–174 (2014)Google Scholar
  24. 24.
    Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Supercomputing 2009, pp. 18:1–18:11, November 2009Google Scholar
  25. 25.
    Yang, C., Buluc, A., Owens, J.D.: Supporting data for design principles for sparse matrix multiplication on the GPU paper at euro-par 2018 (2018).
  26. 26.
    Greiner, G., Jacob, R.: The I/O complexity of sparse matrix dense matrix multiplication. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 143–156. Springer, Heidelberg (2010). Scholar
  27. 27.
    Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of SPAA (2009)Google Scholar
  28. 28.
    Merrill, D.: CUB library (2015).

Copyright information

© This is a U.S. government work and its text is not subject to copyright protection in the United States; however, its text may be subject to foreign copyright protection 2018

Authors and Affiliations

  1. 1.University of CaliforniaDavisUSA
  2. 2.Lawrence Berkeley National LaboratoryBerkeleyUSA
  3. 3.University of CaliforniaBerkeleyUSA

Personalised recommendations