Abstract
Hierarchical \({\mathscr{H}}^{2}\)-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the \({\mathscr{H}}^{2}\) format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of \({\mathscr{H}}^{2}\) matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflop/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.
Similar content being viewed by others
References
FMM3D: Flatiron Institute Fast Multipole Libraries. https://github.com/flatironinstitute/FMM3D
H2Lib: http://www.h2lib.org/
H2Opus: A performance-oriented library for hierarchical matrices. https://github.com/ecrc/h2opus
MAGMA: matrix algebra on GPU and multicore architectures. https://icl.utk.edu/magma/index.html
STRUMPACK: STRUctured Matrices PACKage, v3.3. http://portal.nersc.gov/project/sparse/strumpack/
Thrust library documentation: https://docs.nvidia.com/cuda/thrust/
Aliaga, J.I., Carratalá-Sáez, R., Kriemann, R., Quintana-Ortí, E.S.: Task-parallel LU factorization of hierarchical matrices using OmpSs. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1148–1157 (2017)
Alzahrani, H., Turkiyyah, G., Knio, O., Keyes, D.: Space-fractional diffusion with variable order and diffusivity:, Discretization and direct solution strategies. arXiv:2108.12772 (2021)
Ambartsumyan, I., Boukaram, W., Bui-Thanh, T., Ghattas, O., Keyes, D., Stadler, G., Turkiyyah, G., Zampini, S.: Hierarchical matrix approximations of hessians arising in inverse problems governed by pdes. SIAM J. Sci. Comput. 42(5), A3397–A3426 (2020)
Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE Trans. Pattern Anal. Machine Intell. 38(2), 252–265 (2016)
Ambikasaran, S., Singh, K.R., Sankaran, S.S.: HODLRLib: a library for hierarchical matrices. J. Open Source Softw. 4(34), 1167 (2019). https://doi.org/10.21105/joss.01167
Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs (in the MAGMA project). In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC08 (2008)
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc users manual. Tech. Rep. ANL-95/11 - Revision 3.13, Argonne National Laboratory. https://www.mcs.anl.gov/petsc (2020)
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Karpeyev, D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSC Web page https://www.mcs.anl.gov/petsc (2020)
Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix–vector multiplication. Journal of Parallel and Distributed Computing 130, 166–178 (2019)
Börm, S.: Efficient numerical methods for non-local operators: \({\mathscr{H}}^{2}\)-matrix compression, algorithms and analysis, vol. 14 European Mathematical Society (2010)
Börm, S., Bendoraityte, J.: Distributed \({\mathscr{H}}^{2}\)-matrices for non-local operators. Comput. Vis. Sci. 11(4), 237–249 (2008)
Bȯrm, S., Christophersen, S., Kriemann, R.: Semi-automatic task graph construction for \({\mathscr{H}},\)-matrix arithmetic. arXiv:1911.07531 (2019)
Boukaram, W., Turkiyyah, G., Keyes, D.: Hierarchical matrix operations on GPUs: matrix-vector multiplication and compression. ACM Transactions on Mathematical Software 45(1), 3:1–3:28 (2019)
Boukaram, W., Turkiyyah, G., Keyes, D.: Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations. SIAM J. Sci. Comput. 41(4), C339–C366 (2019)
Boukaram, W., Turkiyyah, G., Ltaief, H., Keyes, D.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74, 19–33 (2018)
Boukaram, W., Zampini, S., Turkiyyah, G., Keyes, D.: H2OPUS-TLR:, High performance tile low rank symmetric factorizations using adaptive randomized approximation. arXiv:2108.11932 (2021)
Elafrou, A., Goumas, G., Koziris, N.: Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY, USA (2019)
Erlandson, L., Cai, D., Xi, Y., Chow, E.: Accelerating parallel hierarchical matrix-vector products via data-driven sampling. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS47924.2020.00082, pp 749–758. IEEE Computer Society, USA (2020)
Ghysels, P., Li, X.S., Gorman, C., Rouet, F.: A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 897–906 (2017)
Gillman, A., Barnett, A.H., Martinsson, P.G.: A spectrally accurate direct solution technique for frequency-domain scattering problems with variable media. BIT Numer. Math. 55(1), 141–170 (2015)
Gillman, A., Martinsson, P.: An O(N) algorithm for constructing the solution operator to 2D elliptic boundary value problems in the absence of body loads. Adv. Comput. Math. 40(4), 773–796 (2014). https://doi.org/10.1007/s10444-013-9326-z
Grasedyck, L., Hackbusch, W.: Construction and arithmetics of \({\mathscr{H}}\)-matrices. Computing 70, 295–334 (2003)
Greengard, L., O’Neil, M., Rachh, M., Vico, F.: Fast multipole methods for the evaluation of layer potentials with locally-corrected quadratures. Journal of Computational Physics X 10, 100092 (2021). https://doi.org/10.1016/j.jcpx.2021.100092
Guo, D., Gropp, W., Olson, L.N.: A hybrid format for better performance of sparse matrix-vector multiplication on a GPU. The International Journal of High Performance Computing Applications 30(1), 103–120 (2016)
Hackbusch, W.: Hierarchical matrices: algorithms and analysis. Springer, Berlin (2015)
Hackbusch, W., Börm, S.: Data-sparse approximation by adaptive \({\mathscr{H}}^{2}\)-matrices. Computing 69 (1), 1–35 (2002). https://doi.org/10.1007/s00607-002-1450-4
Halko, N., Martinsson, P., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Hao, S., Barnett, A.H., Martinsson, P.G., Young, P.: High-order accurate methods for nyström discretization of integral equations on smooth curves in the plane. Adv. Comput. Math. 40(1), 245–272 (2014). https://doi.org/10.1007/s10444-013-9306-3
Ho, K.L.: FLAM: Fast Linear algebra in MATLAB –algorithms for hierarchical matrices. Journal of Open Source Software 5(51), 1906 (2020). https://doi.org/10.21105/joss.01906
Huang, H., Xing, X., Chow, E.: H2pack: High-performance \({\mathscr{H}}^{2}\) matrix package for kernel matrices using the proxy point method. ACM Trans. Math. Softw 47(1). https://doi.org/10.1145/3412850 (2020)
Ida, A., Iwashita, T., Mifune, T., Takahashi, Y.: Parallel hierarchical matrices with adaptive cross approximation on symmetric multiprocessing clusters. Journal of Information Processing 22(4), 642–650 (2014)
Jolivet, P., Roman, J.E., Zampini, S.: KSPHPDDM and PCHPDDM: extending PETSc with advanced Krylov methods and robust multilevel overlapping Schwarz preconditioners. Computers & Mathematics with Applications 84, 277–295 (2021)
Marple, G.R., Barnett, A., Gillman, A., Veerapaneni, S.: A fast algorithm for simulating multiphase flows through periodic geometries of arbitrary shape. SIAM J. Sci. Comput. 38(5), B740–B772 (2016). https://doi.org/10.1137/15M1043066
Massei, S., Robol, L.: Kressner, d.: hm-toolbox: MATLAB software for HODLR and HSS matrices. SIAM J. Sci. Comput. 42(2), C43–C68 (2020). https://doi.org/10.1137/19M1288048
Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press (2016)
Mills, R.T., Adams, M.F., Balay, S., Brown, J., Dener, A., Knepley, M., Kruger, S.E., Morgan, H., Munson, T., Rupp, K., Smith, B.F., Zampini, S., Zhang, H., Zhang, J. arXiv:2011.00715(2020)
Minden, V., Ying, L.: A simple solver for the fractional laplacian in multiple dimensions. SIAM J. Sci. Comput. 42(2), A878–A900 (2020)
Ohshima, S., Yamazaki, I., Ida, A., Yokota, R.: Optimization of hierarchical matrix computation on GPU. In: Yokota, R., Wu, W. (eds.) Supercomputing Frontiers, Lecture Notes in Computer Science, vol. 10776. Springer International Publishing, pp. 274–292 (2018)
Rebrova, E., ChÁvez, G., Liu, Y., Ghysels, P., Li, X.S.: A study of clustering techniques and hierarchical matrix formats for kernel ridge regression. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 883–892 (2018)
Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software 42(4), 27:1–35 (2016)
Smigaj, W., Betcke, T., Arridge, S., Phillips, J., Schweiger, M.: Solving boundary integral problems with BEM++. ACM Trans. Math. Softw 41(2). https://doi.org/10.1145/2590830 (2015)
Wu, B., Martinsson, P.G.: Zeta correction: a new approach to constructing corrected trapezoidal quadrature rules for singular integral operators. Adv. Comput. Math. 47(3), 45 (2021). https://doi.org/10.1007/s10444-021-09872-9
Yamazaki, I., Abdelfattah, A., Ida, A., Ohshima, S., Tomov, S., Yokota, R., Dongarra, J.: Performance of Hierarchical-Matrix BiCGStab Solver on GPU Clusters. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 930–939 (2018)
Yamazaki, I., Ida, A., Yokota, R., Dongarra, J.: Distributed-memory lattice \({\mathscr{H}}\)-matrix factorization. The International Journal of High Performance Computing Applications 33(5), 1046–1063 (2019)
Yu, C.D., March, W.B., Biros, G.: An \(N \log N\) parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 886–896 (2017)
Yu, C.D., March, W.B., Xiao, B., Biros, G.: INV-ASKIT: a parallel fast direct solver for kernel matrices. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 161–171 (2016)
Yu, C.D., Reiz, S., Biros, G.: Distributed-memory hierarchical compression of dense SPD matrices. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press (2018)
Yu, C.D., Reiz, S., Biros, G.: Distributed O(N) linear solver for dense symmetric hierarchical semi-separable matrices. In: IEEE 13Th International Symposium on Embedded Multicore/Many-Core Systems-On-Chip (MCSoc), pp. 1–8 (2019)
Zaspel, P.: Algorithmic patterns for \({\mathscr{H}}\)-matrices on many-core processors. J. Sci. Comput. 78(2), 1174–1206 (2019)
Zhang, J., Brown, J., Balay, S., Faibussowitsch, J., Knepley, M., Marin, O., Mills, R.T., Munson, T., Smith, B.F., Zampini, S.: The petscSF scalable communication layer IEEE Transactions on Parallel and Distributed Systems (2021)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Michael O’Neil
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Advances in Computational Integral Equations Guest Editors: Stephanie Chaillat, Adrianna Gillman, Per-Gunnar Martinsson, Michael O’Neil, Mary-Catherine Kropinski, Timo Betcke, Alex Barnett
Rights and permissions
About this article
Cite this article
Zampini, S., Boukaram, W., Turkiyyah, G. et al. H2Opus: a distributed-memory multi-GPU software package for non-local operators. Adv Comput Math 48, 31 (2022). https://doi.org/10.1007/s10444-022-09942-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10444-022-09942-6
Keywords
- Hierarchical matrices
- Matrix-vector multiplication
- Matrix compression
- Distributed-memory
- GPU
- Integral equations