Skip to main content
Log in

H2Opus: a distributed-memory multi-GPU software package for non-local operators

  • Published:
Advances in Computational Mathematics Aims and scope Submit manuscript

Abstract

Hierarchical \({\mathscr{H}}^{2}\)-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the \({\mathscr{H}}^{2}\) format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of \({\mathscr{H}}^{2}\) matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflop/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. FMM3D: Flatiron Institute Fast Multipole Libraries. https://github.com/flatironinstitute/FMM3D

  2. H2Lib: http://www.h2lib.org/

  3. H2Opus: A performance-oriented library for hierarchical matrices. https://github.com/ecrc/h2opus

  4. MAGMA: matrix algebra on GPU and multicore architectures. https://icl.utk.edu/magma/index.html

  5. STRUMPACK: STRUctured Matrices PACKage, v3.3. http://portal.nersc.gov/project/sparse/strumpack/

  6. Thrust library documentation: https://docs.nvidia.com/cuda/thrust/

  7. Aliaga, J.I., Carratalá-Sáez, R., Kriemann, R., Quintana-Ortí, E.S.: Task-parallel LU factorization of hierarchical matrices using OmpSs. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1148–1157 (2017)

  8. Alzahrani, H., Turkiyyah, G., Knio, O., Keyes, D.: Space-fractional diffusion with variable order and diffusivity:, Discretization and direct solution strategies. arXiv:2108.12772 (2021)

  9. Ambartsumyan, I., Boukaram, W., Bui-Thanh, T., Ghattas, O., Keyes, D., Stadler, G., Turkiyyah, G., Zampini, S.: Hierarchical matrix approximations of hessians arising in inverse problems governed by pdes. SIAM J. Sci. Comput. 42(5), A3397–A3426 (2020)

    Article  MathSciNet  Google Scholar 

  10. Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE Trans. Pattern Anal. Machine Intell. 38(2), 252–265 (2016)

    Article  Google Scholar 

  11. Ambikasaran, S., Singh, K.R., Sankaran, S.S.: HODLRLib: a library for hierarchical matrices. J. Open Source Softw. 4(34), 1167 (2019). https://doi.org/10.21105/joss.01167

    Article  Google Scholar 

  12. Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs (in the MAGMA project). In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC08 (2008)

  13. Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc users manual. Tech. Rep. ANL-95/11 - Revision 3.13, Argonne National Laboratory. https://www.mcs.anl.gov/petsc (2020)

  14. Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSC Web page https://www.mcs.anl.gov/petsc (2020)

  15. Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix–vector multiplication. Journal of Parallel and Distributed Computing 130, 166–178 (2019)

    Article  Google Scholar 

  16. Börm, S.: Efficient numerical methods for non-local operators: \({\mathscr{H}}^{2}\)-matrix compression, algorithms and analysis, vol. 14 European Mathematical Society (2010)

  17. Börm, S., Bendoraityte, J.: Distributed \({\mathscr{H}}^{2}\)-matrices for non-local operators. Comput. Vis. Sci. 11(4), 237–249 (2008)

    Article  MathSciNet  Google Scholar 

  18. Bȯrm, S., Christophersen, S., Kriemann, R.: Semi-automatic task graph construction for \({\mathscr{H}},\)-matrix arithmetic. arXiv:1911.07531 (2019)

  19. Boukaram, W., Turkiyyah, G., Keyes, D.: Hierarchical matrix operations on GPUs: matrix-vector multiplication and compression. ACM Transactions on Mathematical Software 45(1), 3:1–3:28 (2019)

    Article  MathSciNet  Google Scholar 

  20. Boukaram, W., Turkiyyah, G., Keyes, D.: Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations. SIAM J. Sci. Comput. 41(4), C339–C366 (2019)

    Article  MathSciNet  Google Scholar 

  21. Boukaram, W., Turkiyyah, G., Ltaief, H., Keyes, D.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74, 19–33 (2018)

    Article  MathSciNet  Google Scholar 

  22. Boukaram, W., Zampini, S., Turkiyyah, G., Keyes, D.: H2OPUS-TLR:, High performance tile low rank symmetric factorizations using adaptive randomized approximation. arXiv:2108.11932 (2021)

  23. Elafrou, A., Goumas, G., Koziris, N.: Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY, USA (2019)

  24. Erlandson, L., Cai, D., Xi, Y., Chow, E.: Accelerating parallel hierarchical matrix-vector products via data-driven sampling. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS47924.2020.00082, pp 749–758. IEEE Computer Society, USA (2020)

  25. Ghysels, P., Li, X.S., Gorman, C., Rouet, F.: A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 897–906 (2017)

  26. Gillman, A., Barnett, A.H., Martinsson, P.G.: A spectrally accurate direct solution technique for frequency-domain scattering problems with variable media. BIT Numer. Math. 55(1), 141–170 (2015)

    Article  MathSciNet  Google Scholar 

  27. Gillman, A., Martinsson, P.: An O(N) algorithm for constructing the solution operator to 2D elliptic boundary value problems in the absence of body loads. Adv. Comput. Math. 40(4), 773–796 (2014). https://doi.org/10.1007/s10444-013-9326-z

    Article  MathSciNet  Google Scholar 

  28. Grasedyck, L., Hackbusch, W.: Construction and arithmetics of \({\mathscr{H}}\)-matrices. Computing 70, 295–334 (2003)

    Article  MathSciNet  Google Scholar 

  29. Greengard, L., O’Neil, M., Rachh, M., Vico, F.: Fast multipole methods for the evaluation of layer potentials with locally-corrected quadratures. Journal of Computational Physics X 10, 100092 (2021). https://doi.org/10.1016/j.jcpx.2021.100092

    Article  MathSciNet  Google Scholar 

  30. Guo, D., Gropp, W., Olson, L.N.: A hybrid format for better performance of sparse matrix-vector multiplication on a GPU. The International Journal of High Performance Computing Applications 30(1), 103–120 (2016)

    Article  Google Scholar 

  31. Hackbusch, W.: Hierarchical matrices: algorithms and analysis. Springer, Berlin (2015)

    Book  Google Scholar 

  32. Hackbusch, W., Börm, S.: Data-sparse approximation by adaptive \({\mathscr{H}}^{2}\)-matrices. Computing 69 (1), 1–35 (2002). https://doi.org/10.1007/s00607-002-1450-4

    Article  MathSciNet  Google Scholar 

  33. Halko, N., Martinsson, P., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    Article  MathSciNet  Google Scholar 

  34. Hao, S., Barnett, A.H., Martinsson, P.G., Young, P.: High-order accurate methods for nyström discretization of integral equations on smooth curves in the plane. Adv. Comput. Math. 40(1), 245–272 (2014). https://doi.org/10.1007/s10444-013-9306-3

    Article  MathSciNet  Google Scholar 

  35. Ho, K.L.: FLAM: Fast Linear algebra in MATLAB –algorithms for hierarchical matrices. Journal of Open Source Software 5(51), 1906 (2020). https://doi.org/10.21105/joss.01906

    Article  Google Scholar 

  36. Huang, H., Xing, X., Chow, E.: H2pack: High-performance \({\mathscr{H}}^{2}\) matrix package for kernel matrices using the proxy point method. ACM Trans. Math. Softw 47(1). https://doi.org/10.1145/3412850 (2020)

  37. Ida, A., Iwashita, T., Mifune, T., Takahashi, Y.: Parallel hierarchical matrices with adaptive cross approximation on symmetric multiprocessing clusters. Journal of Information Processing 22(4), 642–650 (2014)

    Article  Google Scholar 

  38. Jolivet, P., Roman, J.E., Zampini, S.: KSPHPDDM and PCHPDDM: extending PETSc with advanced Krylov methods and robust multilevel overlapping Schwarz preconditioners. Computers & Mathematics with Applications 84, 277–295 (2021)

    Article  MathSciNet  Google Scholar 

  39. Marple, G.R., Barnett, A., Gillman, A., Veerapaneni, S.: A fast algorithm for simulating multiphase flows through periodic geometries of arbitrary shape. SIAM J. Sci. Comput. 38(5), B740–B772 (2016). https://doi.org/10.1137/15M1043066

    Article  MathSciNet  Google Scholar 

  40. Massei, S., Robol, L.: Kressner, d.: hm-toolbox: MATLAB software for HODLR and HSS matrices. SIAM J. Sci. Comput. 42(2), C43–C68 (2020). https://doi.org/10.1137/19M1288048

    Article  Google Scholar 

  41. Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press (2016)

  42. Mills, R.T., Adams, M.F., Balay, S., Brown, J., Dener, A., Knepley, M., Kruger, S.E., Morgan, H., Munson, T., Rupp, K., Smith, B.F., Zampini, S., Zhang, H., Zhang, J. arXiv:2011.00715(2020)

  43. Minden, V., Ying, L.: A simple solver for the fractional laplacian in multiple dimensions. SIAM J. Sci. Comput. 42(2), A878–A900 (2020)

    Article  MathSciNet  Google Scholar 

  44. Ohshima, S., Yamazaki, I., Ida, A., Yokota, R.: Optimization of hierarchical matrix computation on GPU. In: Yokota, R., Wu, W. (eds.) Supercomputing Frontiers, Lecture Notes in Computer Science, vol. 10776. Springer International Publishing, pp. 274–292 (2018)

  45. Rebrova, E., ChÁvez, G., Liu, Y., Ghysels, P., Li, X.S.: A study of clustering techniques and hierarchical matrix formats for kernel ridge regression. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 883–892 (2018)

  46. Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software 42(4), 27:1–35 (2016)

    Article  MathSciNet  Google Scholar 

  47. Smigaj, W., Betcke, T., Arridge, S., Phillips, J., Schweiger, M.: Solving boundary integral problems with BEM++. ACM Trans. Math. Softw 41(2). https://doi.org/10.1145/2590830 (2015)

  48. Wu, B., Martinsson, P.G.: Zeta correction: a new approach to constructing corrected trapezoidal quadrature rules for singular integral operators. Adv. Comput. Math. 47(3), 45 (2021). https://doi.org/10.1007/s10444-021-09872-9

    Article  MathSciNet  Google Scholar 

  49. Yamazaki, I., Abdelfattah, A., Ida, A., Ohshima, S., Tomov, S., Yokota, R., Dongarra, J.: Performance of Hierarchical-Matrix BiCGStab Solver on GPU Clusters. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 930–939 (2018)

  50. Yamazaki, I., Ida, A., Yokota, R., Dongarra, J.: Distributed-memory lattice \({\mathscr{H}}\)-matrix factorization. The International Journal of High Performance Computing Applications 33(5), 1046–1063 (2019)

    Article  Google Scholar 

  51. Yu, C.D., March, W.B., Biros, G.: An \(N \log N\) parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 886–896 (2017)

  52. Yu, C.D., March, W.B., Xiao, B., Biros, G.: INV-ASKIT: a parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 161–171 (2016)

  53. Yu, C.D., Reiz, S., Biros, G.: Distributed-memory hierarchical compression of dense SPD matrices. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press (2018)

  54. Yu, C.D., Reiz, S., Biros, G.: Distributed O(N) linear solver for dense symmetric hierarchical semi-separable matrices. In: IEEE 13Th International Symposium on Embedded Multicore/Many-Core Systems-On-Chip (MCSoc), pp. 1–8 (2019)

  55. Zaspel, P.: Algorithmic patterns for \({\mathscr{H}}\)-matrices on many-core processors. J. Sci. Comput. 78(2), 1174–1206 (2019)

    Article  MathSciNet  Google Scholar 

  56. Zhang, J., Brown, J., Balay, S., Faibussowitsch, J., Knepley, M., Marin, O., Mills, R.T., Munson, T., Smith, B.F., Zampini, S.: The petscSF scalable communication layer IEEE Transactions on Parallel and Distributed Systems (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Zampini.

Additional information

Communicated by: Michael O’Neil

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Advances in Computational Integral Equations Guest Editors: Stephanie Chaillat, Adrianna Gillman, Per-Gunnar Martinsson, Michael O’Neil, Mary-Catherine Kropinski, Timo Betcke, Alex Barnett

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zampini, S., Boukaram, W., Turkiyyah, G. et al. H2Opus: a distributed-memory multi-GPU software package for non-local operators. Adv Comput Math 48, 31 (2022). https://doi.org/10.1007/s10444-022-09942-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10444-022-09942-6

Keywords

Mathematics Subject Classification (2010)

Navigation