H2Opus: a distributed-memory multi-GPU software package for non-local operators

Zampini, Stefano; Boukaram, Wajih; Turkiyyah, George; Knio, Omar; Keyes, David

doi:10.1007/s10444-022-09942-6

H2Opus: a distributed-memory multi-GPU software package for non-local operators

Published: 10 May 2022

Volume 48, article number 31, (2022)
Cite this article

Advances in Computational Mathematics Aims and scope Submit manuscript

Stefano Zampini ORCID: orcid.org/0000-0002-0435-0433¹,
Wajih Boukaram¹,
George Turkiyyah¹,
Omar Knio¹ &
…
David Keyes¹

146 Accesses
3 Citations
Explore all metrics

Abstract

Hierarchical \({\mathscr{H}}^{2}\)-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the \({\mathscr{H}}^{2}\) format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of \({\mathscr{H}}^{2}\) matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflop/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations

Distributed Sparse Block Grids on GPUs

A Geometric Multigrid Solver on GPU Clusters

References

FMM3D: Flatiron Institute Fast Multipole Libraries. https://github.com/flatironinstitute/FMM3D
H2Lib: http://www.h2lib.org/
H2Opus: A performance-oriented library for hierarchical matrices. https://github.com/ecrc/h2opus
MAGMA: matrix algebra on GPU and multicore architectures. https://icl.utk.edu/magma/index.html
STRUMPACK: STRUctured Matrices PACKage, v3.3. http://portal.nersc.gov/project/sparse/strumpack/
Thrust library documentation: https://docs.nvidia.com/cuda/thrust/
Aliaga, J.I., Carratalá-Sáez, R., Kriemann, R., Quintana-Ortí, E.S.: Task-parallel LU factorization of hierarchical matrices using OmpSs. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1148–1157 (2017)
Alzahrani, H., Turkiyyah, G., Knio, O., Keyes, D.: Space-fractional diffusion with variable order and diffusivity:, Discretization and direct solution strategies. arXiv:2108.12772 (2021)
Ambartsumyan, I., Boukaram, W., Bui-Thanh, T., Ghattas, O., Keyes, D., Stadler, G., Turkiyyah, G., Zampini, S.: Hierarchical matrix approximations of hessians arising in inverse problems governed by pdes. SIAM J. Sci. Comput. 42(5), A3397–A3426 (2020)
Article MathSciNet Google Scholar
Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE Trans. Pattern Anal. Machine Intell. 38(2), 252–265 (2016)
Article Google Scholar
Ambikasaran, S., Singh, K.R., Sankaran, S.S.: HODLRLib: a library for hierarchical matrices. J. Open Source Softw. 4(34), 1167 (2019). https://doi.org/10.21105/joss.01167
Article Google Scholar
Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs (in the MAGMA project). In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC08 (2008)
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc users manual. Tech. Rep. ANL-95/11 - Revision 3.13, Argonne National Laboratory. https://www.mcs.anl.gov/petsc (2020)
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSC Web page https://www.mcs.anl.gov/petsc (2020)
Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix–vector multiplication. Journal of Parallel and Distributed Computing 130, 166–178 (2019)
Article Google Scholar
Börm, S.: Efficient numerical methods for non-local operators: \({\mathscr{H}}^{2}\)-matrix compression, algorithms and analysis, vol. 14 European Mathematical Society (2010)
Börm, S., Bendoraityte, J.: Distributed \({\mathscr{H}}^{2}\)-matrices for non-local operators. Comput. Vis. Sci. 11(4), 237–249 (2008)
Article MathSciNet Google Scholar
Bȯrm, S., Christophersen, S., Kriemann, R.: Semi-automatic task graph construction for \({\mathscr{H}},\)-matrix arithmetic. arXiv:1911.07531 (2019)
Boukaram, W., Turkiyyah, G., Keyes, D.: Hierarchical matrix operations on GPUs: matrix-vector multiplication and compression. ACM Transactions on Mathematical Software 45(1), 3:1–3:28 (2019)
Article MathSciNet Google Scholar
Boukaram, W., Turkiyyah, G., Keyes, D.: Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations. SIAM J. Sci. Comput. 41(4), C339–C366 (2019)
Article MathSciNet Google Scholar
Boukaram, W., Turkiyyah, G., Ltaief, H., Keyes, D.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74, 19–33 (2018)
Article MathSciNet Google Scholar
Boukaram, W., Zampini, S., Turkiyyah, G., Keyes, D.: H2OPUS-TLR:, High performance tile low rank symmetric factorizations using adaptive randomized approximation. arXiv:2108.11932 (2021)
Elafrou, A., Goumas, G., Koziris, N.: Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY, USA (2019)
Erlandson, L., Cai, D., Xi, Y., Chow, E.: Accelerating parallel hierarchical matrix-vector products via data-driven sampling. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS47924.2020.00082, pp 749–758. IEEE Computer Society, USA (2020)
Ghysels, P., Li, X.S., Gorman, C., Rouet, F.: A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 897–906 (2017)
Gillman, A., Barnett, A.H., Martinsson, P.G.: A spectrally accurate direct solution technique for frequency-domain scattering problems with variable media. BIT Numer. Math. 55(1), 141–170 (2015)
Article MathSciNet Google Scholar
Gillman, A., Martinsson, P.: An O(N) algorithm for constructing the solution operator to 2D elliptic boundary value problems in the absence of body loads. Adv. Comput. Math. 40(4), 773–796 (2014). https://doi.org/10.1007/s10444-013-9326-z
Article MathSciNet Google Scholar
Grasedyck, L., Hackbusch, W.: Construction and arithmetics of \({\mathscr{H}}\)-matrices. Computing 70, 295–334 (2003)
Article MathSciNet Google Scholar
Greengard, L., O’Neil, M., Rachh, M., Vico, F.: Fast multipole methods for the evaluation of layer potentials with locally-corrected quadratures. Journal of Computational Physics X 10, 100092 (2021). https://doi.org/10.1016/j.jcpx.2021.100092
Article MathSciNet Google Scholar
Guo, D., Gropp, W., Olson, L.N.: A hybrid format for better performance of sparse matrix-vector multiplication on a GPU. The International Journal of High Performance Computing Applications 30(1), 103–120 (2016)
Article Google Scholar
Hackbusch, W.: Hierarchical matrices: algorithms and analysis. Springer, Berlin (2015)
Book Google Scholar
Hackbusch, W., Börm, S.: Data-sparse approximation by adaptive \({\mathscr{H}}^{2}\)-matrices. Computing 69 (1), 1–35 (2002). https://doi.org/10.1007/s00607-002-1450-4
Article MathSciNet Google Scholar
Halko, N., Martinsson, P., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Article MathSciNet Google Scholar
Hao, S., Barnett, A.H., Martinsson, P.G., Young, P.: High-order accurate methods for nyström discretization of integral equations on smooth curves in the plane. Adv. Comput. Math. 40(1), 245–272 (2014). https://doi.org/10.1007/s10444-013-9306-3
Article MathSciNet Google Scholar
Ho, K.L.: FLAM: Fast Linear algebra in MATLAB –algorithms for hierarchical matrices. Journal of Open Source Software 5(51), 1906 (2020). https://doi.org/10.21105/joss.01906
Article Google Scholar
Huang, H., Xing, X., Chow, E.: H2pack: High-performance \({\mathscr{H}}^{2}\) matrix package for kernel matrices using the proxy point method. ACM Trans. Math. Softw 47(1). https://doi.org/10.1145/3412850 (2020)
Ida, A., Iwashita, T., Mifune, T., Takahashi, Y.: Parallel hierarchical matrices with adaptive cross approximation on symmetric multiprocessing clusters. Journal of Information Processing 22(4), 642–650 (2014)
Article Google Scholar
Jolivet, P., Roman, J.E., Zampini, S.: KSPHPDDM and PCHPDDM: extending PETSc with advanced Krylov methods and robust multilevel overlapping Schwarz preconditioners. Computers & Mathematics with Applications 84, 277–295 (2021)
Article MathSciNet Google Scholar
Marple, G.R., Barnett, A., Gillman, A., Veerapaneni, S.: A fast algorithm for simulating multiphase flows through periodic geometries of arbitrary shape. SIAM J. Sci. Comput. 38(5), B740–B772 (2016). https://doi.org/10.1137/15M1043066
Article MathSciNet Google Scholar
Massei, S., Robol, L.: Kressner, d.: hm-toolbox: MATLAB software for HODLR and HSS matrices. SIAM J. Sci. Comput. 42(2), C43–C68 (2020). https://doi.org/10.1137/19M1288048
Article Google Scholar
Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press (2016)
Mills, R.T., Adams, M.F., Balay, S., Brown, J., Dener, A., Knepley, M., Kruger, S.E., Morgan, H., Munson, T., Rupp, K., Smith, B.F., Zampini, S., Zhang, H., Zhang, J. arXiv:2011.00715(2020)
Minden, V., Ying, L.: A simple solver for the fractional laplacian in multiple dimensions. SIAM J. Sci. Comput. 42(2), A878–A900 (2020)
Article MathSciNet Google Scholar
Ohshima, S., Yamazaki, I., Ida, A., Yokota, R.: Optimization of hierarchical matrix computation on GPU. In: Yokota, R., Wu, W. (eds.) Supercomputing Frontiers, Lecture Notes in Computer Science, vol. 10776. Springer International Publishing, pp. 274–292 (2018)
Rebrova, E., ChÁvez, G., Liu, Y., Ghysels, P., Li, X.S.: A study of clustering techniques and hierarchical matrix formats for kernel ridge regression. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 883–892 (2018)
Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software 42(4), 27:1–35 (2016)
Article MathSciNet Google Scholar
Smigaj, W., Betcke, T., Arridge, S., Phillips, J., Schweiger, M.: Solving boundary integral problems with BEM++. ACM Trans. Math. Softw 41(2). https://doi.org/10.1145/2590830 (2015)
Wu, B., Martinsson, P.G.: Zeta correction: a new approach to constructing corrected trapezoidal quadrature rules for singular integral operators. Adv. Comput. Math. 47(3), 45 (2021). https://doi.org/10.1007/s10444-021-09872-9
Article MathSciNet Google Scholar
Yamazaki, I., Abdelfattah, A., Ida, A., Ohshima, S., Tomov, S., Yokota, R., Dongarra, J.: Performance of Hierarchical-Matrix BiCGStab Solver on GPU Clusters. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 930–939 (2018)
Yamazaki, I., Ida, A., Yokota, R., Dongarra, J.: Distributed-memory lattice \({\mathscr{H}}\)-matrix factorization. The International Journal of High Performance Computing Applications 33(5), 1046–1063 (2019)
Article Google Scholar
Yu, C.D., March, W.B., Biros, G.: An \(N \log N\) parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 886–896 (2017)
Yu, C.D., March, W.B., Xiao, B., Biros, G.: INV-ASKIT: a parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 161–171 (2016)
Yu, C.D., Reiz, S., Biros, G.: Distributed-memory hierarchical compression of dense SPD matrices. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press (2018)
Yu, C.D., Reiz, S., Biros, G.: Distributed O(N) linear solver for dense symmetric hierarchical semi-separable matrices. In: IEEE 13Th International Symposium on Embedded Multicore/Many-Core Systems-On-Chip (MCSoc), pp. 1–8 (2019)
Zaspel, P.: Algorithmic patterns for \({\mathscr{H}}\)-matrices on many-core processors. J. Sci. Comput. 78(2), 1174–1206 (2019)
Article MathSciNet Google Scholar
Zhang, J., Brown, J., Balay, S., Faibussowitsch, J., Knepley, M., Marin, O., Mills, R.T., Munson, T., Smith, B.F., Zampini, S.: The petscSF scalable communication layer IEEE Transactions on Parallel and Distributed Systems (2021)

Download references

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Stefano Zampini, Wajih Boukaram, George Turkiyyah, Omar Knio & David Keyes

Authors

Stefano Zampini
View author publications
You can also search for this author in PubMed Google Scholar
Wajih Boukaram
View author publications
You can also search for this author in PubMed Google Scholar
George Turkiyyah
View author publications
You can also search for this author in PubMed Google Scholar
Omar Knio
View author publications
You can also search for this author in PubMed Google Scholar
David Keyes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Zampini.

Additional information

Communicated by: Michael O’Neil

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Advances in Computational Integral Equations Guest Editors: Stephanie Chaillat, Adrianna Gillman, Per-Gunnar Martinsson, Michael O’Neil, Mary-Catherine Kropinski, Timo Betcke, Alex Barnett

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zampini, S., Boukaram, W., Turkiyyah, G. et al. H2Opus: a distributed-memory multi-GPU software package for non-local operators. Adv Comput Math 48, 31 (2022). https://doi.org/10.1007/s10444-022-09942-6

Download citation

Received: 01 September 2021
Accepted: 07 March 2022
Published: 10 May 2022
DOI: https://doi.org/10.1007/s10444-022-09942-6

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

H2Opus: a distributed-memory multi-GPU software package for non-local operators

Abstract

Access this article

Similar content being viewed by others

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations

Distributed Sparse Block Grids on GPUs

A Geometric Multigrid Solver on GPU Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

H2Opus: a distributed-memory multi-GPU software package for non-local operators

Abstract

Access this article

Similar content being viewed by others

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations

Distributed Sparse Block Grids on GPUs

A Geometric Multigrid Solver on GPU Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation