Algorithmic Patterns for $$\mathcal {H}$$ -Matrices on Many-Core Processors

Zaspel, Peter

doi:10.1007/s10915-018-0809-4

Algorithmic Patterns for $\mathcal {H}$-Matrices on Many-Core Processors

Published: 01 September 2018

Volume 78, pages 1174–1206, (2019)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Peter Zaspel¹

256 Accesses
4 Citations
Explore all metrics

Abstract

In this work, we consider the reformulation of hierarchical ($\mathcal {H}$) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). $\mathcal {H}$ matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of $\mathcal {H}$ matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing $\mathcal {H}$ matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full $\mathcal {H}$ matrix construction and the fast matrix–vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source $\mathcal {H}$ matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance study against a standard multi-core CPU $\mathcal {H}$ matrix library highlights profound speedups of our many-core parallel approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Redesigning Triangular Dense Matrix Computations on GPUs

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Notes

In the following, we stick to model problem (1) with collocation matrices of type (2), where the evaluation of each matrix entry is cheap. In contrast, the evaluation of system matrix entries in e.g. the discretization by the BEM approach is much more expensive. That is, we here focus on the implementation and optimization of the (many-core) parallel $\mathcal {H}$ matrix algorithms instead of the optimization of the performance of the evaluation of the kernel matrix entries, which is a different objective, cf. [11].
Note that this choice may not be stable. However, in our practical experiments with the discussed model problems, we did not observe instabilities. Nevertheless, other applications might require a better, thus more expensive, heuristic for the choice of $j_r$.
We here assume that the reader has some knowledge about the construction of Morton codes. For details, see e.g. [8].
An exact pricing for the Tesla P100 SXM2 was, at time of writing this article, hard to find, since it is no discrete graphics card. The discrete Tesla P100 version with 16 GB has an identical pricing as the two Intel CPUs.
Note that we stick to low dimensions d since it is well-known that these can be handled well by $\mathcal {H}$ matrices. The application of $\mathcal {H}$ matrices in high dimensions is ongoing research.
We did not observe perfect scalability of the OpenMP parallel code on the CPU systems. This should be kept in mind, when comparing CPU to GPU results.

References

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. In: Proceedings of the International Conference on Supercomputing, ICS ’17, pp. 5:1–5:10. ACM, New York (2017)
Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Task-based FMM for multicore architectures. SIAM J. Sci. Comput. 36(1), C66–C93 (2014)
Article MathSciNet MATH Google Scholar
Bebendorf, M.: AHMED Another software library on hierarchical matrices for elliptic differential equations https://github.com/xantares/ahmed. Accessed 31 Aug 2018
Bebendorf, M.: Hierarchical Matrices—A Means to Efficiently Solve Elliptic Boundary Value Problems, Lecture Notes in Computational Science and Engineering, vol. 63. Springer, Berlin (2008)
MATH Google Scholar
Bebendorf, M., Kunis, S.: Recompression techniques for adaptive cross approximation. J. Integr. Equ. Appl. 21(3), 331–357 (2009)
Article MathSciNet MATH Google Scholar
Bebendorf, M., Rjasanow, S.: Adaptive low-rank approximation of collocation matrices. Computing 70(1), 1–24 (2003)
Article MathSciNet MATH Google Scholar
Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2, 359–371 (2011)
Google Scholar
Bern, M., Eppstein, D., Teng, S.H.: Parallel construction of quadtrees and quality triangulations. Int. J. Comput. Geom. Appl. 09(06), 517–532 (1999)
Article MathSciNet MATH Google Scholar
Börm, S.: $\cal{H}^2$-matrices—multilevel methods for the approximation of integral operators. Comput. Vis. Sci. 7(3), 173–181 (2004)
Article MathSciNet MATH Google Scholar
Börm, S.: H2Lib, A Library for Hierarchical Matrices (2017). http://www.h2lib.org. Accessed 31 Aug 2018
Börm, S., Christophersen, S.: Approximation of BEM Matrices Using GPGPUs. ArXiv e-prints (2015)
Börm, S., Grasedyck, L., Hackbusch, W.: Introduction to hierarchical matrices with applications. Eng. Anal. Bound. Elem. 27(5), 405–422 (2003)
Article MATH Google Scholar
Boukaram, W., Ltaief, H., Litvinenko, A., Abdelfattah, A., Keyes, D.E.: Accelerating matrix–vector multiplication on hierarchical matrices using graphical processing units
Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression. ArXiv e-prints (2017)
Charara, A., Keyes, D., Ltaief, H.: Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans. Math. Softw. 9(4), Art. No 39 (2017)
Fasshauer, G.F.: Meshfree Approximation Methods with MATLAB. World Scientific Publishing Co., Inc, River Edge (2007)
Book MATH Google Scholar
Garanzha, K., Pantaleoni, J., McAllister, D.: Simpler and faster HLBVH with work queues. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pp. 59–64. ACM, New York (2011)
Ghysels, P., Li, X.S., Rouet, F., Williams, S., Napov, A.: An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. 38(5), S358 (2016)
Article MathSciNet MATH Google Scholar
Grasedyck, L., Kriemann, R., Le Borne, S.: Parallel black box-LU preconditioning for elliptic boundary value problems. Comput. Vis. Sci. 11(4), 273–291 (2008)
Article MathSciNet Google Scholar
Greengard, L., Rokhlin, V.: A new version of the fast multipole method for the Laplace equation in three dimensions. Acta Numer. 6, 229–269 (1997)
Article MathSciNet MATH Google Scholar
Hackbusch, W.: Hierarchical Matrices : Algorithms and Analysis, Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015)
Book Google Scholar
Hackbusch, W.: Survey on the technique of hierarchical matrices. Vietnam J. Math. 44(1), 71–101 (2016)
Article MathSciNet MATH Google Scholar
Hackbusch, W., Börm, S.: $\cal{H}^2$-matrix approximation of integral operators by interpolation. Appl. Numer. Math. 43(1–2), 129–143 (2002)
Article MathSciNet MATH Google Scholar
Hackbusch, W., Khoromskij, B., Sauter, S.A.: On $\cal{H}^2$-matrices. In: Lectures on Applied Mathematics: Proceedings of the Symposium Organized by the Sonderforschungsbereich 438 on the Occasion of Karl–Heinz Hoffmanns 60th Birthday, Munich, 30 June–1 July 1999, p. 9. Springer (2000)
Hackbusch, W., Nowak, Z.P.: On the fast matrix multiplication in the boundary element method by panel clustering. Numer. Math. 54(4), 463–491 (1989)
Article MathSciNet MATH Google Scholar
Kriemann, R.: Parallel $\cal{H}$-matrix arithmetics on shared memory systems. Computing 74(3), 273–297 (2005)
Article MathSciNet MATH Google Scholar
Kriemann, R.: $\cal{H}$-LU factorization on many-core systems. Comput. Vis. Sci. 16(3), 105–117 (2013)
Article MathSciNet MATH Google Scholar
Kriemann, R.: $\cal{H}$-$\text{Lib}^{\text{ pro }}$ (website) (2017). http://www.hlibpro.com. Accessed 31 Aug 2018
Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., Manocha, D.: Fast BVH construction on GPUs. Comput. Graph. Forum 28(2), 375–384 (2009)
Article Google Scholar
March, W.B., Xiao, B., Yu, C., Biros, G.: ASKIT: an efficient, parallel library for high-dimensional kernel summations. SIAM J. Sci. Comput. 38, S720–S749 (2016)
Article MathSciNet MATH Google Scholar
Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 117–128. ACM, New York (2012)
Morton, G.: A computer oriented geodetic data base and a new technique in file sequencing. Technical Report Ottawa, Ontario, Canada (1966)
Poulson, J.: DMHM—Distributed-Memory Hierarchical Matrices. https://bitbucket.org/poulson/dmhm. Accessed 31 Aug 2018
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). MIT Press, Cambridge (2005)
Book Google Scholar
Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Math. Softw. 42(4), Art. No 27 (2016)
Sheng, Z., Dewilde, P., Chandrasekaran, S.: Algorithms to Solve Hierarchically Semi-separable Systems, pp. 255–294. Birkhäuser Basel, Basel (2007)
MATH Google Scholar
Szuppe, J.: Boost.Compute: a parallel computing library for C++ based on OpenCL. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL ’16, pp. 15:1–15:39. ACM, New York (2016)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Vovk, V.: Kernel Ridge Regression, pp. 105–116. Springer, Berlin (2013)
MATH Google Scholar
Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API (2015). https://github.com/arrayfire/arrayfire. Accessed 31 Aug 2018
Yokota, R., Barba, L.: FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method. Comput. Fluids 80, 17–27 (2013)
Article MathSciNet MATH Google Scholar
Yokota, R., Barba, L., Knepley, M.G.: PetRBF: a parallel O(N) algorithm for radial basis function interpolation with Gaussians. Comput. Methods Appl. Mech. Eng. 199(25), 1793–1804 (2010)
Article MathSciNet MATH Google Scholar
Zaspel, P.: MPLA—Massively Parallel Linear Algebra (2017). https://github.com/zaspel/MPLA. Accessed 31 Aug 2018
Zaspel, P.: hmglib—Hierarchical Matrices on GPU(s) Library (2017). https://github.com/zaspel/hmglib. Accessed 31 Aug 2018

Download references

Acknowledgements

The author would like to thank the anonymous referees for their enlightening remarks. This work is funded by the Swiss National Science Foundation (SNF) under Project No. $407540\_167186$. Furthermore, code developments tasks in this research were done on resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725. The IBM POWER8 system with the NVIDIA Tesla P100 SXM2 used in the benchmarks for this research was donated by the NVIDIA PSG Cluster. All funding and support is gratefully acknowledged.

Author information

Authors and Affiliations

Departement Mathematik und Informatik, Universität Basel, Spiegelgasse 1, 4051, Basel, Switzerland
Peter Zaspel

Authors

Peter Zaspel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Zaspel.

A Batched Bounding Box Computation

As part of the traversal of the block cluster tree, we have to evaluate the admissibility condition (3) for index blocks $\tau \times \sigma $ involving the bounding boxes of $\tau $ and $\sigma $ in each node. In the following, we will discuss an algorithm to concurrently compute the bounding boxes for clusters $\tau $, $\sigma $ in all nodes on a given level l of the block cluster tree. The algorithm is based on batching, cf. Sect. 4.3.1.

We collect the set of nodes on a level l of the block cluster tree, i.e. $V_{I\times I}(l)$, in the array node_data of length $|V_{I\times I}(l)|$ composed of structs work_item and have the input points $\mathcal {Y}$ in an instance of struct point_set, cf. Sect. 4.1. As simplification, we only consider the concurrent computation of the bounding boxes for one cluster set, e.g. $\tau $, in each node/block cluster.

Note that standard implementations of $\mathcal {H}$ matrix algorithms first build a cluster tree and then construct the block cluster tree. In that case, it is natural to compute the bounding boxes of each cluster in the cluster tree construction. We here, however, have chosen a different approach. That is, we construct the block cluster tree without first building a cluster tree. This is efficient with respect to the clustering, since the Morton ordering of the point sets makes clustering cheap. Moreover, we prefer to do just one tree traversal (for the block cluster tree) instead of several tree traversals (for two cluster trees and the block cluster tree) to maximize the utilization of the many-core processor within a single tree traversal. However, this comes at the price that the bounding box computation becomes part of the block cluster tree traversal, which is quite unusual. To avoid to compute bounding boxes for a single cluster many times, we divide the bounding box calculation in a pre-processing step and the bounding box assignment. In the pre-processing step, we extract from all block clusters the list of clusters. This list potentially contains multiple copies of clusters. We then unify this list to create a unique list of clusters. For each unique cluster in that unified list, we compute the bounding boxes and store them in a lookup table. This corresponds to computing the bounding boxes on the cluster tree. In the bounding box assignment step, we use a lookup table to assign the pre-computed bounding boxes to the clusters in the block cluster tree. Thereby, we avoid recomputing bounding boxes for identical clusters.

Since the algorithmic demanding step is the pre-processing step, we stick to the description of this step. Here, we first identify the set of unique clusters, as defined before. We then create a lookup table bb_lookup_table storing for each unique cluster the bounding box information. In addition, we need a map from a node in node_data to the entry in the lookup table. Figure 17 exemplifies this idea.

Algorithm 7 describes our approach to compute the entries of the lookup tablebb_lookup_table. Function compute_bounding_box_lookup_table gets as input the coordinate array coords of the input point set $\mathcal {Y}$, the nodes $V_{I\times I}(l)$ on level l in node_data, and further size information. First, the lower index bounds $i_{l,1}$ and upper index bounds $i_{u,1}$ are extracted from each node and stored in arrays lower_index_bounds and upper_index_bounds. By construction, the (block) cluster tree traversal based on Z-order curves only creates clusters that do not overlap in the point set array and that, for a given lower index bound, will always have the same upper bound. Therefore, we can use parallel sorting and unification methods to identify the set of unique clusters. The unique clusters are collected (by their lower and upper index bounds) in unique_lower_index_bounds and unique_upper_index_bounds. The final step is to compute the coordinate minima and maxima in each cluster. This step follows the idea of batching, cf. Sect. 4.3.1. The batched array is the array of coordinates. The bounds for the batches are given by the unique lower and upper index bounds and the keys for the batches are the sequence of numbers $\{1,2,\ldots \}$. Results in the batched computation that are associated to points in $\mathcal {Y}$ and not being part of any cluster are finally removed by removing all batched compute results associated to the key 0.

Our approach to compute the map between the nodes in node_data and the lookup table is summarized in Algorithm 8. Again, we first get the lower and upper index bounds. Then, without loss of generality, we sort the lower bounds of the clusters and keep the applied permutation in permutation. Next, we create a global array map of length $|V_{I\times I}(l)|$ and initialize it to “0”. The parallel kernel set_bounds_for_map of $|V_{I\times I}(l)|$ then sets a “1” in map wherever there are two different subsequent entries in the sorted lower_bounds. By an inclusive scan on map, we create growing indices in map marking identical entries in lower_bounds. The result is exemplified in Fig. 18. We finally permute back map by kernel permutation with $|V_{I\times I}(l)|$ threads leading to the required map.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zaspel, P. Algorithmic Patterns for $\mathcal {H}$-Matrices on Many-Core Processors. J Sci Comput 78, 1174–1206 (2019). https://doi.org/10.1007/s10915-018-0809-4

Download citation

Received: 31 August 2017
Revised: 26 July 2018
Accepted: 13 August 2018
Published: 01 September 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10915-018-0809-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithmic Patterns for \(\mathcal {H}\)-Matrices on Many-Core Processors

Abstract

Access this article

Similar content being viewed by others

Redesigning Triangular Dense Matrix Computations on GPUs

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Notes

References

Acknowledgements