Abstract
Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm redesign to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.
Similar content being viewed by others
Notes
cuSpAMM is open-sourced at https://github.com/buaa-hipo/cuSpAMM
References
Gale T, Zaharia M, Young C, Elsen E (2020) Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–14. IEEE
Odashima MM, Prado BG, Vernek E (2017) Pedagogical introduction to equilibrium Green’s functions: condensed-matter examples with numerical implementations. Revista Brasileira de Ensino de FÃsica, 39: 00
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. J Chem Phys 132(15):154104
Rudberg E, Rubensson EH, Sałek P, Kruchinina A (2018) Ergo: An open-source program for linear-scaling electronic structure calculations. vol 7, pp 107–111. Elsevier
Cao S, Ma L, Xiao W, Zhang C, Liu Y, Zhang L, Nie L, Yang Z (2019) Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11216–11225
Ioannou Y, Robertson D, Cipolla R, Criminisi A, Roots D (2017) Improving CNN efficiency with hierarchical filter groups. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1231–1240
Sharma S, Sharma S, Athaiya A (2017) Activation functions in neural networks. Towards Data Sci 6(12):310–316
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 19–24. IEEE
Benzi M, Boito P, Razouk N (2013) Decay properties of spectral projectors with applications to electronic structure. SIAM Rev 55(1):3–64
Stephen D, Moss William F, Smith Philip W (1984) Decay rates for inverses of band matrices. Math Comput 43(168):491–499
Eijkhout V, Polman B (1988) Decay rates of inverses of banded m-matrices that are near to toeplitz matrices. Linear Algebra Appl 109:247–277
Tessera R (2010) Left inverses of matrices with polynomial decay. J Funct Anal 259(11):2793–2813
Aune E (2012) Computation and modeling for high dimensional gaussian distributions. Doctoral dissertation, Norwegian University of Science and Technology
Ye Q (2013) Error bounds for the lanczos methods for approximating matrix exponentials. SIAM J Numer Anal 51(1):68–87
Benzi M, Tuma M (1999) Orderings for factorized sparse approximate inverse preconditioners. SIAM J Sci Comput 21(5):1851–1868
Iserles A (2000) How large is the exponential of a banded matrix. New Zealand J Math 29(177192):56
Simon B (1982) Some jacobi matrices with decaying potential and dense point spectrum. Commun Math Phys 87(2):253–258
Michele B, Golub Gene H (1999) Bounds for the entries of matrix functions with applications to preconditioning. BIT Numer Math 39(3):417–438
Bowler David R, Tsuyoshi M (2012) Methods in electronic structure calculations. Rep Prog Phys 75(3):036503
Cramer M, Eisert J (2006) Correlations, spectral gap, and entanglement in harmonic quantum systems on generic lattices. New J Phys 8(5):71
Cramer M, Jens E, Plenio Martin B, Dreissig J (2006) Entanglement-area law for general bosonic harmonic lattice systems. Phys Rev A 73(1):012309
Jens E, Cramer M, Plenio Martin B (2010) Area laws for the entanglement entropy—a review. Rev Mod Phys 82(1):277–306
Norbert S, Ignacio CJ, Wolf Michael M (2006) Quantum states on harmonic lattices. Commun Math Phys 267(1):65–92
Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on GPUs. In: 2011 IEEE 17th international Conference on Parallel and Distributed Systems, pp 157–164. IEEE
Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp 1–6
Greathouse JL, Daga M (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 769–780. IEEE
Bock N, Challacombe M (2013) An optimized sparse approximate matrix multiply for matrices with decay. SIAM J Sci Comput 35(1):C72–C98
Bock N, Challacombe M, Kalé LV (2016) Solvers for O(N) electronic structure in the strong scaling limit. SIAM J Sci Comput 38(1):C1–C21
Artemov AG (2019) Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework. CoRR, arXiv: abs/1906.08148
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. pp 73–82
Roberto O, Watson Mark A, Edgar Richard G, Leslie V, Yihan S, Alan A (2010) Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J Chem Theory Comput 6(1):135–144
NVIDIA Corporation. cublas. https://developer.nvidia.com/cublas, (2021)
NVIDIA Corporation. Nvidia cusparse. https://docs.nvidia.com/cuda/cusparse, (2021)
Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-Sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th annual international symposium on computer architecture (ISCA), pp 473–486. IEEE
NVIDIA Corporation. Nvidia tesla v100 GPU architecture. https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, (2021)
Bari M, Stoltzfus L, Lin P, Liao C, Emani M, Chapman B (2018) Is data placement optimization still relevant on newer GPUs? Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)
Bischof CH, Lacroute PG (1990) An adaptive blocking strategy for matrix factorizations. In: Helmar B (ed) CONPAR 90 — VAPP IV, pp 210–221, Berlin, Heidelberg. Springer Berlin Heidelberg
Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on GPUs. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp 614–625. IEEE
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp 1–11. IEEE
Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp 522–531. IEEE
Haidar A, Tomov S, Dongarra J, Higham NJ (2018) Harnessing GPU tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 603–613. IEEE
Hatfield S, Chantry M, Düben P, Palmer T (2019) Accelerating high-resolution weather models with deep-learning hardware. In: Proceedings of the Platform for Advanced Scientific Computing conference, pp 1–11
Maia Julio DC, dos Anjos Formiga CL, Rocha GB (2020) GPU algorithms for density matrix methods on mopac: linear scaling electronic structure calculations for large molecular systems. J Molecular Modeling 26(11):1–12
Gargantini I (1982) An effective way to represent quadtrees. Commun ACM 25(12):905–910
Alexander G (2000) Gray and Andrew W. Moore. “N-Body” problems in statistical learning, In NIPS
Kalé Laxmikant V, Krishnan Sanjeev (1993) CHARM++: a portable concurrent object oriented system based on c++. In: OOPSLA ’93, pp 91–108
Rubensson EH, Rudberg E (2014) Chunks and tasks: a programming model for parallelization of dynamic algorithms. Parallel Comput 40(7):328–343
Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 370–381. IEEE
Winter M, Mlakar D, Zayer R, Seidel H-P, Steinberger M (2019) Adaptive sparse matrix-matrix multiplication on the GPU. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 68–81
Huang J, Yu CD, van de Geijn RA (2018) Implementing strassen’s algorithm with CUTLASS on NVIDIA volta GPUs. arXiv:abs/1808.07984
Mukunoki D, Ogita T (2020) Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J Comput Appl Math 372:112701
Navarro CA, Carrasco R, Barrientos RJ, Riquelme JA, Vega R (2020) GPU tensor cores for fast arithmetic reductions. IEEE Trans Parallel Distrib Syst 32(1):72–84
Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241
Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057
Gupta A, Kumar V (1993) Scalability of parallel algorithms for matrix multiplication. In: 1993 International Conference on Parallel Processing-ICPP’93, vole 3, pp 115–123. IEEE
Van Den Geijn RA, Watt J (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Pract Exp 9(4):255–274
NVIDIA Corporation. Nvidia cublasxt. https://docs.nvidia.com/cuda/nvblas/index.html, (2021)
NVIDIA Corporation. Nvidia nsight. (2021)
NVIDIA Corporation. Nvidia cusparse github issue. https://github.com/NVIDIA/CUDALibrarySamples/issues/38, (2021)
Ho N-M, Wong W-F (2017) Exploiting half precision arithmetic in nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. IEEE
Ergoscf. http://www.ergoscf.org/xyz/h2o.php, (2020)
Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag 29(6):141–142
Acknowledgements
This work was supported by National Key Research and Development Program of China (No. 2020YFB1506703), National Natural Science Foundation of China (No. 62072018) and State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06). Hailong Yang is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, X., Liu, Y., Yang, H. et al. Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. J Supercomput 78, 11464–11491 (2022). https://doi.org/10.1007/s11227-022-04334-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04334-5