Skip to main content
Log in

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm redesign to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. cuSpAMM is open-sourced at https://github.com/buaa-hipo/cuSpAMM

References

  1. Gale T, Zaharia M, Young C, Elsen E (2020) Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–14. IEEE

  2. Odashima MM, Prado BG, Vernek E (2017) Pedagogical introduction to equilibrium Green’s functions: condensed-matter examples with numerical implementations. Revista Brasileira de Ensino de FÃsica, 39: 00

  3. Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. J Chem Phys 132(15):154104

    Article  Google Scholar 

  4. Rudberg E, Rubensson EH, Sałek P, Kruchinina A (2018) Ergo: An open-source program for linear-scaling electronic structure calculations. vol 7, pp 107–111. Elsevier

  5. Cao S, Ma L, Xiao W, Zhang C, Liu Y, Zhang L, Nie L, Yang Z (2019) Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11216–11225

  6. Ioannou Y, Robertson D, Cipolla R, Criminisi A, Roots D (2017) Improving CNN efficiency with hierarchical filter groups. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1231–1240

  7. Sharma S, Sharma S, Athaiya A (2017) Activation functions in neural networks. Towards Data Sci 6(12):310–316

    Google Scholar 

  8. Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 19–24. IEEE

  9. Benzi M, Boito P, Razouk N (2013) Decay properties of spectral projectors with applications to electronic structure. SIAM Rev 55(1):3–64

    Article  MathSciNet  Google Scholar 

  10. Stephen D, Moss William F, Smith Philip W (1984) Decay rates for inverses of band matrices. Math Comput 43(168):491–499

    Article  MathSciNet  Google Scholar 

  11. Eijkhout V, Polman B (1988) Decay rates of inverses of banded m-matrices that are near to toeplitz matrices. Linear Algebra Appl 109:247–277

    Article  MathSciNet  Google Scholar 

  12. Tessera R (2010) Left inverses of matrices with polynomial decay. J Funct Anal 259(11):2793–2813

    Article  MathSciNet  Google Scholar 

  13. Aune E (2012) Computation and modeling for high dimensional gaussian distributions. Doctoral dissertation, Norwegian University of Science and Technology

  14. Ye Q (2013) Error bounds for the lanczos methods for approximating matrix exponentials. SIAM J Numer Anal 51(1):68–87

    Article  MathSciNet  Google Scholar 

  15. Benzi M, Tuma M (1999) Orderings for factorized sparse approximate inverse preconditioners. SIAM J Sci Comput 21(5):1851–1868

    Article  MathSciNet  Google Scholar 

  16. Iserles A (2000) How large is the exponential of a banded matrix. New Zealand J Math 29(177192):56

    MathSciNet  MATH  Google Scholar 

  17. Simon B (1982) Some jacobi matrices with decaying potential and dense point spectrum. Commun Math Phys 87(2):253–258

    Article  MathSciNet  Google Scholar 

  18. Michele B, Golub Gene H (1999) Bounds for the entries of matrix functions with applications to preconditioning. BIT Numer Math 39(3):417–438

    Article  MathSciNet  Google Scholar 

  19. Bowler David R, Tsuyoshi M (2012) Methods in electronic structure calculations. Rep Prog Phys 75(3):036503

    Article  Google Scholar 

  20. Cramer M, Eisert J (2006) Correlations, spectral gap, and entanglement in harmonic quantum systems on generic lattices. New J Phys 8(5):71

    Article  MathSciNet  Google Scholar 

  21. Cramer M, Jens E, Plenio Martin B, Dreissig J (2006) Entanglement-area law for general bosonic harmonic lattice systems. Phys Rev A 73(1):012309

    Article  Google Scholar 

  22. Jens E, Cramer M, Plenio Martin B (2010) Area laws for the entanglement entropy—a review. Rev Mod Phys 82(1):277–306

    Article  Google Scholar 

  23. Norbert S, Ignacio CJ, Wolf Michael M (2006) Quantum states on harmonic lattices. Commun Math Phys 267(1):65–92

    Article  MathSciNet  Google Scholar 

  24. Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on GPUs. In: 2011 IEEE 17th international Conference on Parallel and Distributed Systems, pp 157–164. IEEE

  25. Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp 1–6

  26. Greathouse JL, Daga M (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 769–780. IEEE

  27. Bock N, Challacombe M (2013) An optimized sparse approximate matrix multiply for matrices with decay. SIAM J Sci Comput 35(1):C72–C98

    Article  MathSciNet  Google Scholar 

  28. Bock N, Challacombe M, Kalé LV (2016) Solvers for O(N) electronic structure in the strong scaling limit. SIAM J Sci Comput 38(1):C1–C21

    Article  MathSciNet  Google Scholar 

  29. Artemov AG (2019) Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework. CoRR, arXiv: abs/1906.08148

  30. Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. pp 73–82

  31. Roberto O, Watson Mark A, Edgar Richard G, Leslie V, Yihan S, Alan A (2010) Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J Chem Theory Comput 6(1):135–144

    Article  Google Scholar 

  32. NVIDIA Corporation. cublas. https://developer.nvidia.com/cublas, (2021)

  33. NVIDIA Corporation. Nvidia cusparse. https://docs.nvidia.com/cuda/cusparse, (2021)

  34. Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-Sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th annual international symposium on computer architecture (ISCA), pp 473–486. IEEE

  35. NVIDIA Corporation. Nvidia tesla v100 GPU architecture. https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, (2021)

  36. Bari M, Stoltzfus L, Lin P, Liao C, Emani M, Chapman B (2018) Is data placement optimization still relevant on newer GPUs? Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)

  37. Bischof CH, Lacroute PG (1990) An adaptive blocking strategy for matrix factorizations. In: Helmar B (ed) CONPAR 90 — VAPP IV, pp 210–221, Berlin, Heidelberg. Springer Berlin Heidelberg

  38. Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on GPUs. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp 614–625. IEEE

  39. Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp 1–11. IEEE

  40. Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp 522–531. IEEE

  41. Haidar A, Tomov S, Dongarra J, Higham NJ (2018) Harnessing GPU tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 603–613. IEEE

  42. Hatfield S, Chantry M, Düben P, Palmer T (2019) Accelerating high-resolution weather models with deep-learning hardware. In: Proceedings of the Platform for Advanced Scientific Computing conference, pp 1–11

  43. Maia Julio DC, dos Anjos Formiga CL, Rocha GB (2020) GPU algorithms for density matrix methods on mopac: linear scaling electronic structure calculations for large molecular systems. J Molecular Modeling 26(11):1–12

    Google Scholar 

  44. Gargantini I (1982) An effective way to represent quadtrees. Commun ACM 25(12):905–910

    Article  Google Scholar 

  45. Alexander G (2000) Gray and Andrew W. Moore. “N-Body” problems in statistical learning, In NIPS

  46. Kalé Laxmikant V, Krishnan Sanjeev (1993) CHARM++: a portable concurrent object oriented system based on c++. In: OOPSLA ’93, pp 91–108

  47. Rubensson EH, Rudberg E (2014) Chunks and tasks: a programming model for parallelization of dynamic algorithms. Parallel Comput 40(7):328–343

    Article  Google Scholar 

  48. Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 370–381. IEEE

  49. Winter M, Mlakar D, Zayer R, Seidel H-P, Steinberger M (2019) Adaptive sparse matrix-matrix multiplication on the GPU. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 68–81

  50. Huang J, Yu CD, van de Geijn RA (2018) Implementing strassen’s algorithm with CUTLASS on NVIDIA volta GPUs. arXiv:abs/1808.07984

  51. Mukunoki D, Ogita T (2020) Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J Comput Appl Math 372:112701

    Article  MathSciNet  Google Scholar 

  52. Navarro CA, Carrasco R, Barrientos RJ, Riquelme JA, Vega R (2020) GPU tensor cores for fast arithmetic reductions. IEEE Trans Parallel Distrib Syst 32(1):72–84

    Article  Google Scholar 

  53. Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241

  54. Kurzak J, Tomov S, Dongarra J (2012) Autotuning gemm kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057

    Article  Google Scholar 

  55. Gupta A, Kumar V (1993) Scalability of parallel algorithms for matrix multiplication. In: 1993 International Conference on Parallel Processing-ICPP’93, vole 3, pp 115–123. IEEE

  56. Van Den Geijn RA, Watt J (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Pract Exp 9(4):255–274

    Article  Google Scholar 

  57. NVIDIA Corporation. Nvidia cublasxt. https://docs.nvidia.com/cuda/nvblas/index.html, (2021)

  58. NVIDIA Corporation. Nvidia nsight. (2021)

  59. NVIDIA Corporation. Nvidia cusparse github issue. https://github.com/NVIDIA/CUDALibrarySamples/issues/38, (2021)

  60. Ho N-M, Wong W-F (2017) Exploiting half precision arithmetic in nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–7. IEEE

  61. Ergoscf. http://www.ergoscf.org/xyz/h2o.php, (2020)

  62. Deng L (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag 29(6):141–142

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2020YFB1506703), National Natural Science Foundation of China (No. 62072018) and State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Liu, Y., Yang, H. et al. Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. J Supercomput 78, 11464–11491 (2022). https://doi.org/10.1007/s11227-022-04334-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04334-5

Navigation