Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Liu, Junhong; He, Xin; Liu, Weifeng; Tan, Guangming

doi:10.1007/s10766-018-0604-8

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Published: 01 January 2019

Volume 47, pages 403–417, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Junhong Liu ORCID: orcid.org/0000-0001-7596-7011¹,
Xin He¹,
Weifeng Liu² &
…
Guangming Tan¹

739 Accesses
12 Citations
Explore all metrics

Abstract

General sparse matrix–matrix multiplication (SpGEMM) is a fundamental building block of a number of high-level algorithms and real-world applications. In recent years, several efficient SpGEMM algorithms have been proposed for many-core processors such as GPUs. However, their implementations of sparse accumulators, the core component of SpGEMM, mostly use low speed on-chip shared memory and global memory, and high speed registers are seriously underutilised. In this paper, we propose three novel register-aware SpGEMM algorithms for three representative sparse accumulators, i.e., sort, merge and hash, respectively. We fully utilise the GPU registers to fetch data, finish computations and store results out. In the experiments, our algorithms deliver excellent performance on a benchmark suite including 205 sparse matrices from the SuiteSparse Matrix Collection. Specifically, on an Nvidia Pascal P100 GPU, our three register-aware sparse accumulators achieve on average 2.0 \(\times \) (up to 5.4 \(\times \)), 2.6 \(\times \) (up to 10.5 \(\times \)) and 1.7 \(\times \) (up to 5.2 \(\times \)) speedups over their original implementations in libraries bhSPARSE, RMerge and NSPARSE, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Notes

This should not be confused with sparse matrix-vector multiplication (SpMV), which multiplies a sparse matrix with a dense vector and obtains a dense vector.

References

Bell, N., Dalton, S., Olson, L.: Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comput. 34(4), C123–C152 (2012)
Article MathSciNet MATH Google Scholar
D’Alberto, P., Nicolau, A.: R-kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007)
Article MathSciNet MATH Google Scholar
Zhang, F., Lin, H., Zhai, J., Cheng, J., Xiang, D., Li, J., Chai, Y., Du, X.: An adaptive breadth-first search algorithm on integrated architectures. J. Supercomput. 74(11), 6135–6155 (2018)
Article Google Scholar
Azad, A., Pavlopoulos, G.A., Ouzounis, C.A., Kyrpides, N.C., Buluç, A.: Hipmcl: a high-performance parallel implementation of the markov clustering algorithm for large-scale networks. Nucleic Acids Res. 46(6), e33 (2018)
Article Google Scholar
Mattson, T.G., Yang, C., McMillan, S., Buluç, A., Moreira, J.E.: GraphBLAS C API: Ideas for future versions of the specification. In: IEEE High Performance Extreme Computing Conference (HPEC) (2017)
Davis, T.A.: Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and k-truss. In: IEEE High Performance Extreme Computing Conference (HPEC) (2018)
Dalton, S., Olson, L., Bell, N.: Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw. 41(4), 25:1–25:20 (2015)
Article MathSciNet MATH Google Scholar
Demouth, J.: Sparse matrix-matrix multiplication on the GPU. GTC ’12 (2012)
Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I., Sadayappan, P.: On improving performance of sparse matrix-matrix multiplication on gpus. In: Proceedings of the International Conference on Supercomputing (ICS ’17), pp. 14:1–14:11 (2017)
Akbudak, K., Aykanat, C.: Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Trans. Parallel Distrib. Syst. 28(8), 2258–2271 (2017)
Article Google Scholar
Nagasaka, Y., Nukada, A., Matsuoka, S.: High-performance and memory-saving sparse general matrix-matrix multiplication for nvidia pascal gpu. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 101–110 (Aug 2017)
Buluç, A., Gilbert, J.R.: Parallel sparse matrix–matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. 34(4), C170–C191 (2012)
Article MathSciNet MATH Google Scholar
Azad, A., Ballard, G., Buluç, A., Demmel, J., Grigori, L., Schwartz, O., Toledo, S., Williams, S.: Exploiting multiple levels of parallelism in sparse matrix–matrix multiplication. SIAM J. Sci. Comput. 38(6), C624–C651 (2016)
Article MathSciNet MATH Google Scholar
Ballard, G., Druinsky, A., Knight, N., Schwartz, O.: Hypergraph partitioning for sparse matrix–matrix multiplication. ACM Trans. Parallel Comput. 3(3), 18:1–18:34 (2016)
Article Google Scholar
Yuster, R., Zwick, U.: Fast sparse matrix multiplication. ACM Trans. Algorithms 1(1), 2–13 (2005)
Article MathSciNet MATH Google Scholar
Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18), pp. 407–408 (2018)
Liu, W.: Parallel and Scalable Sparse Basic Linear Algebra Subprograms. PhD thesis, University of Copenhagen (2015)
Liu, W., Vinter, B.: A framework for general sparse matrix–matrix multiplication on gpus and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015)
Article Google Scholar
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014)
Article Google Scholar
Nagasaka, Y., Matsuoka, S., Azad, A., Buluç, A.: High-performance sparse matrix-matrix products on intel knl and multicore architectures. In: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP ’18) Workshop, pp. 34:1–34:10 (2018)
Gremse, F., Kpper, K., Naumann, U.: Memory-efficient sparse matrix–matrix multiplication by row merging on many-core architectures. SIAM J. Sci. Comput. 40(4), C429–C449 (2018)
Article MathSciNet MATH Google Scholar
Buluç, A., Gilbert, J.R.: Challenges and advances in parallel sparse matrix–matrix multiplication. In: 2008 37th International Conference on Parallel Processing, pp. 503–510 (Sept 2008)
Deveci, M., Trott, C., Rajamanickam, S.: Multithreaded sparse matrix–matrix multiplication for many-core and gpu architectures. Parallel Comput. 78, 33–46 (2018)
Article MathSciNet Google Scholar
Gustavson, F.G.: Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Trans. Math. Softw. 4(3), 250–269 (1978)
Article MathSciNet MATH Google Scholar
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures. In: Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 27–38 (2017)
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated cpu/gpu architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017)
Article Google Scholar
Tan, G., Liu, J., Li, J.: Design and implementation of adaptive spmv library for multicore and many-core architecture. ACM Trans. Math. Softw. 44(4), 46:1–46:25 (2018)
Article MathSciNet MATH Google Scholar
Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ’15), pp. 339–350 (2015)
Liu, W., Vinter, B.: An efficient gpu general sparse matrix-matrix multiplication for irregular data. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS ’14), pp. 370–381 (May 2014)
Gremse, F., Höfter, A., Schwen, L.O., Kiessling, F., Naumann, U.: Gpu-accelerated sparse matrix–matrix multiplication by iterative row merging. SIAM J. Sci. Comput. 37(1), C54–C71 (2015)
Article MathSciNet MATH Google Scholar
Gilbert, J., Moler, C., Schreiber, R.: Sparse matrices in matlab: design and implementation. SIAM J. Matrix Anal. Appl. 13(1), 333–356 (1992)
Article MathSciNet MATH Google Scholar
Hou, K., Liu, W., Wang, H., Feng, W.c.: Fast segmented sort on gpus. In: Proceedings of the International Conference on Supercomputing (ICS ’17), pp. 12:1–12:10 (2017)
Anh, P.N.Q., Fan, R., Wen, Y.: Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In: Proceedings of the 2016 International Conference on Supercomputing (ICS ’16), pp. 36:1–36:12 (2016)
Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Locality-aware cta clustering for modern gpus. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17), pp. 297–311 (2017)
Li, A., Liu, W., Wang, L., Barker, K., Song, S.L.: Warp-Consolidation: A Novel Execution Model for GPUS (ICS ’18) (2018)
Rawat, P.S., Rastello, F., Sukumaran-Rajam, A., Pouchet, L.N., Rountev, A., Sadayappan, P.: Register optimizations for stencils on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18), pp. 168–182 (2018)
Blelloch, G.E., Heroux, M.A., Zagha, M.: Segmented operations for sparse matrix computation on vector multiprocessors. Technical report, CMU (1993)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015)
Article MathSciNet Google Scholar
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
MathSciNet MATH Google Scholar
Dalton, S., Baxter, S., Merrill, D., Olson, L., Garland, M.: Optimizing sparse matrix operations on gpus using merge path. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15), pp. 407–416 (May 2015)
Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels (SC ’17), pp. 26:1–26:14 (2017)
Xie, X., Liang, Y., Li, X., Wu, Y., Sun, G., Wang, T., Fan, D.: Enabling coordinated register allocation and thread-level parallelism optimization for gpus. In: 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 395–406 (Dec 2015)
Yuan, L., Liu, J., Luo, Y., Tan, G.: Locality of computation for stencil optimization. In: Carretero, J., Garcia-Blas, J., Ko, R.K., Mueller, P., Nakano, K. (eds.) Algorithms and Architectures for Parallel Processing, pp. 449–456. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Liu, J., Tan, G., Luo, Y., Li, J., Mo, Z., Sun, N.: An autotuning protocol to rapidly build autotuners. ACM Trans. Parallel Comput. (2018)
Li, J., Tan, G., Chen, M., Sun, N.: SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13), pp. 117–126 (2013)

Download references

Acknowledgements

We would like to express our gratitude to all reviewers constructive comments for helping us polishing this paper. This work is supported by the National Key Research and Development Program of China (2017YFB0202105, 2016YFB0201305, 2016YFB0200803, 2016YFB0200300), National Natural Science Foundation of China, under Grant Nos. (61521092, 91430218, 31327901, 61472395, 61432018), and the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie project (Grant No. 752321).

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
Junhong Liu, Xin He & Guangming Tan
Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway
Weifeng Liu

Authors

Junhong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin He
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junhong Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., He, X., Liu, W. et al. Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication. Int J Parallel Prog 47, 403–417 (2019). https://doi.org/10.1007/s10766-018-0604-8

Download citation

Received: 21 September 2018
Accepted: 01 October 2018
Published: 01 January 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10766-018-0604-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Abstract

Access this article

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Abstract

Access this article

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation