Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

  • Junhong LiuEmail author
  • Xin He
  • Weifeng Liu
  • Guangming Tan


General sparse matrix–matrix multiplication (SpGEMM) is a fundamental building block of a number of high-level algorithms and real-world applications. In recent years, several efficient SpGEMM algorithms have been proposed for many-core processors such as GPUs. However, their implementations of sparse accumulators, the core component of SpGEMM, mostly use low speed on-chip shared memory and global memory, and high speed registers are seriously underutilised. In this paper, we propose three novel register-aware SpGEMM algorithms for three representative sparse accumulators, i.e., sort, merge and hash, respectively. We fully utilise the GPU registers to fetch data, finish computations and store results out. In the experiments, our algorithms deliver excellent performance on a benchmark suite including 205 sparse matrices from the SuiteSparse Matrix Collection. Specifically, on an Nvidia Pascal P100 GPU, our three register-aware sparse accumulators achieve on average 2.0 \(\times \) (up to 5.4 \(\times \)), 2.6 \(\times \) (up to 10.5 \(\times \)) and 1.7 \(\times \) (up to 5.2 \(\times \)) speedups over their original implementations in libraries bhSPARSE, RMerge and NSPARSE, respectively.


Sparse matrix Sparse matrix–matrix multiplication GPU Register 



We would like to express our gratitude to all reviewers constructive comments for helping us polishing this paper. This work is supported by the National Key Research and Development Program of China (2017YFB0202105, 2016YFB0201305, 2016YFB0200803, 2016YFB0200300), National Natural Science Foundation of China, under Grant Nos. (61521092, 91430218, 31327901, 61472395, 61432018), and the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie project (Grant No. 752321).


  1. 1.
    Bell, N., Dalton, S., Olson, L.: Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comput. 34(4), C123–C152 (2012)MathSciNetCrossRefGoogle Scholar
  2. 2.
    D’Alberto, P., Nicolau, A.: R-kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Zhang, F., Lin, H., Zhai, J., Cheng, J., Xiang, D., Li, J., Chai, Y., Du, X.: An adaptive breadth-first search algorithm on integrated architectures. J. Supercomput. 74(11), 6135–6155 (2018)CrossRefGoogle Scholar
  4. 4.
    Azad, A., Pavlopoulos, G.A., Ouzounis, C.A., Kyrpides, N.C., Buluç, A.: Hipmcl: a high-performance parallel implementation of the markov clustering algorithm for large-scale networks. Nucleic Acids Res. 46(6), e33 (2018)CrossRefGoogle Scholar
  5. 5.
    Mattson, T.G., Yang, C., McMillan, S., Buluç, A., Moreira, J.E.: GraphBLAS C API: Ideas for future versions of the specification. In: IEEE High Performance Extreme Computing Conference (HPEC) (2017)Google Scholar
  6. 6.
    Davis, T.A.: Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and k-truss. In: IEEE High Performance Extreme Computing Conference (HPEC) (2018)Google Scholar
  7. 7.
    Dalton, S., Olson, L., Bell, N.: Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw. 41(4), 25:1–25:20 (2015)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Demouth, J.: Sparse matrix-matrix multiplication on the GPU. GTC ’12 (2012)Google Scholar
  9. 9.
    Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I., Sadayappan, P.: On improving performance of sparse matrix-matrix multiplication on gpus. In: Proceedings of the International Conference on Supercomputing (ICS ’17), pp. 14:1–14:11 (2017)Google Scholar
  10. 10.
    Akbudak, K., Aykanat, C.: Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Trans. Parallel Distrib. Syst. 28(8), 2258–2271 (2017)CrossRefGoogle Scholar
  11. 11.
    Nagasaka, Y., Nukada, A., Matsuoka, S.: High-performance and memory-saving sparse general matrix-matrix multiplication for nvidia pascal gpu. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 101–110 (Aug 2017)Google Scholar
  12. 12.
    Buluç, A., Gilbert, J.R.: Parallel sparse matrix–matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. 34(4), C170–C191 (2012)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Azad, A., Ballard, G., Buluç, A., Demmel, J., Grigori, L., Schwartz, O., Toledo, S., Williams, S.: Exploiting multiple levels of parallelism in sparse matrix–matrix multiplication. SIAM J. Sci. Comput. 38(6), C624–C651 (2016)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ballard, G., Druinsky, A., Knight, N., Schwartz, O.: Hypergraph partitioning for sparse matrix–matrix multiplication. ACM Trans. Parallel Comput. 3(3), 18:1–18:34 (2016)CrossRefGoogle Scholar
  15. 15.
    Yuster, R., Zwick, U.: Fast sparse matrix multiplication. ACM Trans. Algorithms 1(1), 2–13 (2005)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18), pp. 407–408 (2018)Google Scholar
  17. 17.
    Liu, W.: Parallel and Scalable Sparse Basic Linear Algebra Subprograms. PhD thesis, University of Copenhagen (2015)Google Scholar
  18. 18.
    Liu, W., Vinter, B.: A framework for general sparse matrix–matrix multiplication on gpus and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015)CrossRefGoogle Scholar
  19. 19.
    Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014)CrossRefGoogle Scholar
  20. 20.
    Nagasaka, Y., Matsuoka, S., Azad, A., Buluç, A.: High-performance sparse matrix-matrix products on intel knl and multicore architectures. In: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP ’18) Workshop, pp. 34:1–34:10 (2018)Google Scholar
  21. 21.
    Gremse, F., Kpper, K., Naumann, U.: Memory-efficient sparse matrix–matrix multiplication by row merging on many-core architectures. SIAM J. Sci. Comput. 40(4), C429–C449 (2018)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Buluç, A., Gilbert, J.R.: Challenges and advances in parallel sparse matrix–matrix multiplication. In: 2008 37th International Conference on Parallel Processing, pp. 503–510 (Sept 2008)Google Scholar
  23. 23.
    Deveci, M., Trott, C., Rajamanickam, S.: Multithreaded sparse matrix–matrix multiplication for many-core and gpu architectures. Parallel Comput. 78, 33–46 (2018)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Gustavson, F.G.: Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Trans. Math. Softw. 4(3), 250–269 (1978)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures. In: Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 27–38 (2017)Google Scholar
  26. 26.
    Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated cpu/gpu architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017)CrossRefGoogle Scholar
  27. 27.
    Tan, G., Liu, J., Li, J.: Design and implementation of adaptive spmv library for multicore and many-core architecture. ACM Trans. Math. Softw. 44(4), 46:1–46:25 (2018)CrossRefGoogle Scholar
  28. 28.
    Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ’15), pp. 339–350 (2015)Google Scholar
  29. 29.
    Liu, W., Vinter, B.: An efficient gpu general sparse matrix-matrix multiplication for irregular data. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS ’14), pp. 370–381 (May 2014)Google Scholar
  30. 30.
    Gremse, F., Höfter, A., Schwen, L.O., Kiessling, F., Naumann, U.: Gpu-accelerated sparse matrix–matrix multiplication by iterative row merging. SIAM J. Sci. Comput. 37(1), C54–C71 (2015)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Gilbert, J., Moler, C., Schreiber, R.: Sparse matrices in matlab: design and implementation. SIAM J. Matrix Anal. Appl. 13(1), 333–356 (1992)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Hou, K., Liu, W., Wang, H., Feng, W.c.: Fast segmented sort on gpus. In: Proceedings of the International Conference on Supercomputing (ICS ’17), pp. 12:1–12:10 (2017)Google Scholar
  33. 33.
    Anh, P.N.Q., Fan, R., Wen, Y.: Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In: Proceedings of the 2016 International Conference on Supercomputing (ICS ’16), pp. 36:1–36:12 (2016)Google Scholar
  34. 34.
    Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Locality-aware cta clustering for modern gpus. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17), pp. 297–311 (2017)Google Scholar
  35. 35.
    Li, A., Liu, W., Wang, L., Barker, K., Song, S.L.: Warp-Consolidation: A Novel Execution Model for GPUS (ICS ’18) (2018)Google Scholar
  36. 36.
    Rawat, P.S., Rastello, F., Sukumaran-Rajam, A., Pouchet, L.N., Rountev, A., Sadayappan, P.: Register optimizations for stencils on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18), pp. 168–182 (2018)Google Scholar
  37. 37.
    Blelloch, G.E., Heroux, M.A., Zagha, M.: Segmented operations for sparse matrix computation on vector multiprocessors. Technical report, CMU (1993)Google Scholar
  38. 38.
    Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)MathSciNetzbMATHGoogle Scholar
  40. 40.
    Dalton, S., Baxter, S., Merrill, D., Olson, L., Garland, M.: Optimizing sparse matrix operations on gpus using merge path. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15), pp. 407–416 (May 2015)Google Scholar
  41. 41.
    Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels (SC ’17), pp. 26:1–26:14 (2017)Google Scholar
  42. 42.
    Xie, X., Liang, Y., Li, X., Wu, Y., Sun, G., Wang, T., Fan, D.: Enabling coordinated register allocation and thread-level parallelism optimization for gpus. In: 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 395–406 (Dec 2015)Google Scholar
  43. 43.
    Yuan, L., Liu, J., Luo, Y., Tan, G.: Locality of computation for stencil optimization. In: Carretero, J., Garcia-Blas, J., Ko, R.K., Mueller, P., Nakano, K. (eds.) Algorithms and Architectures for Parallel Processing, pp. 449–456. Springer International Publishing, Cham (2016)CrossRefGoogle Scholar
  44. 44.
    Liu, J., Tan, G., Luo, Y., Li, J., Mo, Z., Sun, N.: An autotuning protocol to rapidly build autotuners. ACM Trans. Parallel Comput. (2018)Google Scholar
  45. 45.
    Li, J., Tan, G., Chen, M., Sun, N.: SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13), pp. 117–126 (2013)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBeijingChina
  2. 2.Department of Computer ScienceNorwegian University of Science and TechnologyTrondheimNorway

Personalised recommendations