Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Liu, Xiaoyan; Liu, Yi; Yang, Hailong; Dun, Ming; Yin, Bohong; Luan, Zhongzhi; Qian, Depei

doi:10.1007/s11227-022-04334-5

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Published: 14 February 2022

Volume 78, pages 11464–11491, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiaoyan Liu^1,2,
Yi Liu²,
Hailong Yang ORCID: orcid.org/0000-0003-1101-7927^1,2,
Ming Dun³,
Bohong Yin²,
Zhongzhi Luan² &
…
Depei Qian²

524 Accesses
4 Citations
2 Altmetric
Explore all metrics

Abstract

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm redesign to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Notes

cuSpAMM is open-sourced at https://github.com/buaa-hipo/cuSpAMM

References

Gale T, Zaharia M, Young C, Elsen E (2020) Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–14. IEEE
Odashima MM, Prado BG, Vernek E (2017) Pedagogical introduction to equilibrium Green’s functions: condensed-matter examples with numerical implementations. Revista Brasileira de Ensino de FÃsica, 39: 00
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. J Chem Phys 132(15):154104
Article Google Scholar
Rudberg E, Rubensson EH, Sałek P, Kruchinina A (2018) Ergo: An open-source program for linear-scaling electronic structure calculations. vol 7, pp 107–111. Elsevier
Cao S, Ma L, Xiao W, Zhang C, Liu Y, Zhang L, Nie L, Yang Z (2019) Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11216–11225
Ioannou Y, Robertson D, Cipolla R, Criminisi A, Roots D (2017) Improving CNN efficiency with hierarchical filter groups. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1231–1240
Sharma S, Sharma S, Athaiya A (2017) Activation functions in neural networks. Towards Data Sci 6(12):310–316
Google Scholar
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 19–24. IEEE
Benzi M, Boito P, Razouk N (2013) Decay properties of spectral projectors with applications to electronic structure. SIAM Rev 55(1):3–64
Article MathSciNet Google Scholar
Stephen D, Moss William F, Smith Philip W (1984) Decay rates for inverses of band matrices. Math Comput 43(168):491–499
Article MathSciNet Google Scholar
Eijkhout V, Polman B (1988) Decay rates of inverses of banded m-matrices that are near to toeplitz matrices. Linear Algebra Appl 109:247–277
Article MathSciNet Google Scholar
Tessera R (2010) Left inverses of matrices with polynomial decay. J Funct Anal 259(11):2793–2813
Article MathSciNet Google Scholar
Aune E (2012) Computation and modeling for high dimensional gaussian distributions. Doctoral dissertation, Norwegian University of Science and Technology
Ye Q (2013) Error bounds for the lanczos methods for approximating matrix exponentials. SIAM J Numer Anal 51(1):68–87
Article MathSciNet Google Scholar
Benzi M, Tuma M (1999) Orderings for factorized sparse approximate inverse preconditioners. SIAM J Sci Comput 21(5):1851–1868
Article MathSciNet Google Scholar
Iserles A (2000) How large is the exponential of a banded matrix. New Zealand J Math 29(177192):56
MathSciNet MATH Google Scholar
Simon B (1982) Some jacobi matrices with decaying potential and dense point spectrum. Commun Math Phys 87(2):253–258
Article MathSciNet Google Scholar
Michele B, Golub Gene H (1999) Bounds for the entries of matrix functions with applications to preconditioning. BIT Numer Math 39(3):417–438
Article MathSciNet Google Scholar
Bowler David R, Tsuyoshi M (2012) Methods in electronic structure calculations. Rep Prog Phys 75(3):036503
Article Google Scholar
Cramer M, Eisert J (2006) Correlations, spectral gap, and entanglement in harmonic quantum systems on generic lattices. New J Phys 8(5):71
Article MathSciNet Google Scholar
Cramer M, Jens E, Plenio Martin B, Dreissig J (2006) Entanglement-area law for general bosonic harmonic lattice systems. Phys Rev A 73(1):012309
Article Google Scholar
Jens E, Cramer M, Plenio Martin B (2010) Area laws for the entanglement entropy—a review. Rev Mod Phys 82(1):277–306
Article Google Scholar
Norbert S, Ignacio CJ, Wolf Michael M (2006) Quantum states on harmonic lattices. Commun Math Phys 267(1):65–92
Article MathSciNet Google Scholar
Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on GPUs. In: 2011 IEEE 17th international Conference on Parallel and Distributed Systems, pp 157–164. IEEE
Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp 1–6
Greathouse JL, Daga M (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 769–780. IEEE
Bock N, Challacombe M (2013) An optimized sparse approximate matrix multiply for matrices with decay. SIAM J Sci Comput 35(1):C72–C98
Article MathSciNet Google Scholar
Bock N, Challacombe M, Kalé LV (2016) Solvers for O(N) electronic structure in the strong scaling limit. SIAM J Sci Comput 38(1):C1–C21
Article MathSciNet Google Scholar
Artemov AG (2019) Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework. CoRR, arXiv: abs/1906.08148
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. pp 73–82
Roberto O, Watson Mark A, Edgar Richard G, Leslie V, Yihan S, Alan A (2010) Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J Chem Theory Comput 6(1):135–144
Article Google Scholar
NVIDIA Corporation. cublas. https://developer.nvidia.com/cublas, (2021)
NVIDIA Corporation. Nvidia cusparse. https://docs.nvidia.com/cuda/cusparse, (2021)
Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-Sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th annual international symposium on computer architecture (ISCA), pp 473–486. IEEE
NVIDIA Corporation. Nvidia tesla v100 GPU architecture. https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, (2021)
Bari M, Stoltzfus L, Lin P, Liao C, Emani M, Chapman B (2018) Is data placement optimization still relevant on newer GPUs? Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)
Bischof CH, Lacroute PG (1990) An adaptive blocking strategy for matrix factorizations. In: Helmar B (ed) CONPAR 90 — VAPP IV, pp 210–221, Berlin, Heidelberg. Springer Berlin Heidelberg
Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on GPUs. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp 614–625. IEEE
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp 1–11. IEEE
Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp 522–531. IEEE
Haidar A, Tomov S, Dongarra J, Higham NJ (2018) Harnessing GPU tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 603–613. IEEE
Hatfield S, Chantry M, Düben P, Palmer T (2019) Accelerating high-resolution weather models with deep-learning hardware. In: Proceedings of the Platform for Advanced Scientific Computing conference, pp 1–11
Maia Julio DC, dos Anjos Formiga CL, Rocha GB (2020) GPU algorithms for density matrix methods on mopac: linear scaling electronic structure calculations for large molecular systems. J Molecular Modeling 26(11):1–12
Google Scholar
Gargantini I (1982) An effective way to represent quadtrees. Commun ACM 25(12):905–910
Article Google Scholar
Alexander G (2000) Gray and Andrew W. Moore. “N-Body” problems in statistical learning, In NIPS
Kalé Laxmikant V, Krishnan Sanjeev (1993) CHARM++: a portable concurrent object oriented system based on c++. In: OOPSLA ’93, pp 91–108
Rubensson EH, Rudberg E (2014) Chunks and tasks: a programming model for parallelization of dynamic algorithms. Parallel Comput 40(7):328–343
Article Google Scholar
Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 370–381. IEEE
Winter M, Mlakar D, Zayer R, Seidel H-P, Steinberger M (2019) Adaptive sparse matrix-matrix multiplication on the GPU. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 68–81
Huang J, Yu CD, van de Geijn RA (2018) Implementing strassen’s algorithm with CUTLASS on NVIDIA volta GPUs. arXiv:abs/1808.07984
Mukunoki D, Ogita T (2020) Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J Comput Appl Math 372:112701
Article MathSciNet Google Scholar
Navarro CA, Carrasco R, Barrientos RJ, Riquelme JA, Vega R (2020) GPU tensor cores for fast arithmetic reductions. IEEE Trans Parallel Distrib Syst 32(1):72–84
Article Google Scholar
Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241
Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057
Article Google Scholar
Gupta A, Kumar V (1993) Scalability of parallel algorithms for matrix multiplication. In: 1993 International Conference on Parallel Processing-ICPP’93, vole 3, pp 115–123. IEEE
Van Den Geijn RA, Watt J (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Pract Exp 9(4):255–274
Article Google Scholar
NVIDIA Corporation. Nvidia cublasxt. https://docs.nvidia.com/cuda/nvblas/index.html, (2021)
NVIDIA Corporation. Nvidia nsight. (2021)
NVIDIA Corporation. Nvidia cusparse github issue. https://github.com/NVIDIA/CUDALibrarySamples/issues/38, (2021)
Ho N-M, Wong W-F (2017) Exploiting half precision arithmetic in nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. IEEE
Ergoscf. http://www.ergoscf.org/xyz/h2o.php, (2020)
Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag 29(6):141–142
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2020YFB1506703), National Natural Science Foundation of China (No. 62072018) and State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beijing, 100191, China
Xiaoyan Liu & Hailong Yang
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Xiaoyan Liu, Yi Liu, Hailong Yang, Bohong Yin, Zhongzhi Luan & Depei Qian
School of Cyber Science and Technology, Beihang University, Beijing, 100191, China
Ming Dun

Authors

Xiaoyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Dun
View author publications
You can also search for this author in PubMed Google Scholar
Bohong Yin
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author in PubMed Google Scholar
Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, X., Liu, Y., Yang, H. et al. Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. J Supercomput 78, 11464–11491 (2022). https://doi.org/10.1007/s11227-022-04334-5

Download citation

Accepted: 22 January 2022
Published: 14 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11227-022-04334-5

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Abstract

Access this article

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Abstract

Access this article

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation