Skip to main content
Log in

Algorithmic Patterns for \(\mathcal {H}\)-Matrices on Many-Core Processors

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

In this work, we consider the reformulation of hierarchical (\(\mathcal {H}\)) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix–vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance study against a standard multi-core CPU \(\mathcal {H}\) matrix library highlights profound speedups of our many-core parallel approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In the following, we stick to model problem (1) with collocation matrices of type (2), where the evaluation of each matrix entry is cheap. In contrast, the evaluation of system matrix entries in e.g. the discretization by the BEM approach is much more expensive. That is, we here focus on the implementation and optimization of the (many-core) parallel \(\mathcal {H}\) matrix algorithms instead of the optimization of the performance of the evaluation of the kernel matrix entries, which is a different objective, cf. [11].

  2. Note that this choice may not be stable. However, in our practical experiments with the discussed model problems, we did not observe instabilities. Nevertheless, other applications might require a better, thus more expensive, heuristic for the choice of \(j_r\).

  3. We here assume that the reader has some knowledge about the construction of Morton codes. For details, see e.g. [8].

  4. An exact pricing for the Tesla P100 SXM2 was, at time of writing this article, hard to find, since it is no discrete graphics card. The discrete Tesla P100 version with 16 GB has an identical pricing as the two Intel CPUs.

  5. Note that we stick to low dimensions d since it is well-known that these can be handled well by \(\mathcal {H}\) matrices. The application of \(\mathcal {H}\) matrices in high dimensions is ongoing research.

  6. We did not observe perfect scalability of the OpenMP parallel code on the CPU systems. This should be kept in mind, when comparing CPU to GPU results.

References

  1. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. In: Proceedings of the International Conference on Supercomputing, ICS ’17, pp. 5:1–5:10. ACM, New York (2017)

  2. Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Task-based FMM for multicore architectures. SIAM J. Sci. Comput. 36(1), C66–C93 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bebendorf, M.: AHMED Another software library on hierarchical matrices for elliptic differential equations https://github.com/xantares/ahmed. Accessed 31 Aug 2018

  4. Bebendorf, M.: Hierarchical Matrices—A Means to Efficiently Solve Elliptic Boundary Value Problems, Lecture Notes in Computational Science and Engineering, vol. 63. Springer, Berlin (2008)

    MATH  Google Scholar 

  5. Bebendorf, M., Kunis, S.: Recompression techniques for adaptive cross approximation. J. Integr. Equ. Appl. 21(3), 331–357 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bebendorf, M., Rjasanow, S.: Adaptive low-rank approximation of collocation matrices. Computing 70(1), 1–24 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2, 359–371 (2011)

    Google Scholar 

  8. Bern, M., Eppstein, D., Teng, S.H.: Parallel construction of quadtrees and quality triangulations. Int. J. Comput. Geom. Appl. 09(06), 517–532 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  9. Börm, S.: \(\cal{H}^2\)-matrices—multilevel methods for the approximation of integral operators. Comput. Vis. Sci. 7(3), 173–181 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  10. Börm, S.: H2Lib, A Library for Hierarchical Matrices (2017). http://www.h2lib.org. Accessed 31 Aug 2018

  11. Börm, S., Christophersen, S.: Approximation of BEM Matrices Using GPGPUs. ArXiv e-prints (2015)

  12. Börm, S., Grasedyck, L., Hackbusch, W.: Introduction to hierarchical matrices with applications. Eng. Anal. Bound. Elem. 27(5), 405–422 (2003)

    Article  MATH  Google Scholar 

  13. Boukaram, W., Ltaief, H., Litvinenko, A., Abdelfattah, A., Keyes, D.E.: Accelerating matrix–vector multiplication on hierarchical matrices using graphical processing units

  14. Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression. ArXiv e-prints (2017)

  15. Charara, A., Keyes, D., Ltaief, H.: Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans. Math. Softw. 9(4), Art. No 39 (2017)

  16. Fasshauer, G.F.: Meshfree Approximation Methods with MATLAB. World Scientific Publishing Co., Inc, River Edge (2007)

    Book  MATH  Google Scholar 

  17. Garanzha, K., Pantaleoni, J., McAllister, D.: Simpler and faster HLBVH with work queues. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pp. 59–64. ACM, New York (2011)

  18. Ghysels, P., Li, X.S., Rouet, F., Williams, S., Napov, A.: An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. 38(5), S358 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  19. Grasedyck, L., Kriemann, R., Le Borne, S.: Parallel black box-LU preconditioning for elliptic boundary value problems. Comput. Vis. Sci. 11(4), 273–291 (2008)

    Article  MathSciNet  Google Scholar 

  20. Greengard, L., Rokhlin, V.: A new version of the fast multipole method for the Laplace equation in three dimensions. Acta Numer. 6, 229–269 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  21. Hackbusch, W.: Hierarchical Matrices : Algorithms and Analysis, Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015)

    Book  Google Scholar 

  22. Hackbusch, W.: Survey on the technique of hierarchical matrices. Vietnam J. Math. 44(1), 71–101 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hackbusch, W., Börm, S.: \(\cal{H}^2\)-matrix approximation of integral operators by interpolation. Appl. Numer. Math. 43(1–2), 129–143 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hackbusch, W., Khoromskij, B., Sauter, S.A.: On \(\cal{H}^2\)-matrices. In: Lectures on Applied Mathematics: Proceedings of the Symposium Organized by the Sonderforschungsbereich 438 on the Occasion of Karl–Heinz Hoffmanns 60th Birthday, Munich, 30 June–1 July 1999, p. 9. Springer (2000)

  25. Hackbusch, W., Nowak, Z.P.: On the fast matrix multiplication in the boundary element method by panel clustering. Numer. Math. 54(4), 463–491 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  26. Kriemann, R.: Parallel \(\cal{H}\)-matrix arithmetics on shared memory systems. Computing 74(3), 273–297 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kriemann, R.: \(\cal{H}\)-LU factorization on many-core systems. Comput. Vis. Sci. 16(3), 105–117 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kriemann, R.: \(\cal{H}\)-\(\text{Lib}^{\text{ pro }}\) (website) (2017). http://www.hlibpro.com. Accessed 31 Aug 2018

  29. Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., Manocha, D.: Fast BVH construction on GPUs. Comput. Graph. Forum 28(2), 375–384 (2009)

    Article  Google Scholar 

  30. March, W.B., Xiao, B., Yu, C., Biros, G.: ASKIT: an efficient, parallel library for high-dimensional kernel summations. SIAM J. Sci. Comput. 38, S720–S749 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  31. Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 117–128. ACM, New York (2012)

  32. Morton, G.: A computer oriented geodetic data base and a new technique in file sequencing. Technical Report Ottawa, Ontario, Canada (1966)

  33. Poulson, J.: DMHM—Distributed-Memory Hierarchical Matrices. https://bitbucket.org/poulson/dmhm. Accessed 31 Aug 2018

  34. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). MIT Press, Cambridge (2005)

    Book  Google Scholar 

  35. Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Math. Softw. 42(4), Art. No 27 (2016)

  36. Sheng, Z., Dewilde, P., Chandrasekaran, S.: Algorithms to Solve Hierarchically Semi-separable Systems, pp. 255–294. Birkhäuser Basel, Basel (2007)

    MATH  Google Scholar 

  37. Szuppe, J.: Boost.Compute: a parallel computing library for C++ based on OpenCL. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL ’16, pp. 15:1–15:39. ACM, New York (2016)

  38. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  39. Vovk, V.: Kernel Ridge Regression, pp. 105–116. Springer, Berlin (2013)

    MATH  Google Scholar 

  40. Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  41. Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API (2015). https://github.com/arrayfire/arrayfire. Accessed 31 Aug 2018

  42. Yokota, R., Barba, L.: FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method. Comput. Fluids 80, 17–27 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  43. Yokota, R., Barba, L., Knepley, M.G.: PetRBF: a parallel O(N) algorithm for radial basis function interpolation with Gaussians. Comput. Methods Appl. Mech. Eng. 199(25), 1793–1804 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  44. Zaspel, P.: MPLA—Massively Parallel Linear Algebra (2017). https://github.com/zaspel/MPLA. Accessed 31 Aug 2018

  45. Zaspel, P.: hmglib—Hierarchical Matrices on GPU(s) Library (2017). https://github.com/zaspel/hmglib. Accessed 31 Aug 2018

Download references

Acknowledgements

The author would like to thank the anonymous referees for their enlightening remarks. This work is funded by the Swiss National Science Foundation (SNF) under Project No. \(407540\_167186\). Furthermore, code developments tasks in this research were done on resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725. The IBM POWER8 system with the NVIDIA Tesla P100 SXM2 used in the benchmarks for this research was donated by the NVIDIA PSG Cluster. All funding and support is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Zaspel.

A Batched Bounding Box Computation

A Batched Bounding Box Computation

As part of the traversal of the block cluster tree, we have to evaluate the admissibility condition (3) for index blocks \(\tau \times \sigma \) involving the bounding boxes of \(\tau \) and \(\sigma \) in each node. In the following, we will discuss an algorithm to concurrently compute the bounding boxes for clusters \(\tau \), \(\sigma \) in all nodes on a given level l of the block cluster tree. The algorithm is based on batching, cf. Sect. 4.3.1.

We collect the set of nodes on a level l of the block cluster tree, i.e. \(V_{I\times I}(l)\), in the array node_data of length \(|V_{I\times I}(l)|\) composed of structs work_item and have the input points \(\mathcal {Y}\) in an instance of struct point_set, cf. Sect. 4.1. As simplification, we only consider the concurrent computation of the bounding boxes for one cluster set, e.g. \(\tau \), in each node/block cluster.

Note that standard implementations of \(\mathcal {H}\) matrix algorithms first build a cluster tree and then construct the block cluster tree. In that case, it is natural to compute the bounding boxes of each cluster in the cluster tree construction. We here, however, have chosen a different approach. That is, we construct the block cluster tree without first building a cluster tree. This is efficient with respect to the clustering, since the Morton ordering of the point sets makes clustering cheap. Moreover, we prefer to do just one tree traversal (for the block cluster tree) instead of several tree traversals (for two cluster trees and the block cluster tree) to maximize the utilization of the many-core processor within a single tree traversal. However, this comes at the price that the bounding box computation becomes part of the block cluster tree traversal, which is quite unusual. To avoid to compute bounding boxes for a single cluster many times, we divide the bounding box calculation in a pre-processing step and the bounding box assignment. In the pre-processing step, we extract from all block clusters the list of clusters. This list potentially contains multiple copies of clusters. We then unify this list to create a unique list of clusters. For each unique cluster in that unified list, we compute the bounding boxes and store them in a lookup table. This corresponds to computing the bounding boxes on the cluster tree. In the bounding box assignment step, we use a lookup table to assign the pre-computed bounding boxes to the clusters in the block cluster tree. Thereby, we avoid recomputing bounding boxes for identical clusters.

Fig. 17
figure 17

The boundary box computation is sped up by pre-computing bounding boxes for each cluster, once. They are stored in bb_lookup_table and accessed via a map between work items and the lookup table

Since the algorithmic demanding step is the pre-processing step, we stick to the description of this step. Here, we first identify the set of unique clusters, as defined before. We then create a lookup table bb_lookup_table storing for each unique cluster the bounding box information. In addition, we need a map from a node in node_data to the entry in the lookup table. Figure 17 exemplifies this idea.

figure g

Algorithm 7 describes our approach to compute the entries of the lookup tablebb_lookup_table. Function compute_bounding_box_lookup_table gets as input the coordinate array coords of the input point set \(\mathcal {Y}\), the nodes \(V_{I\times I}(l)\) on level l in node_data, and further size information. First, the lower index bounds \(i_{l,1}\) and upper index bounds \(i_{u,1}\) are extracted from each node and stored in arrays lower_index_bounds and upper_index_bounds. By construction, the (block) cluster tree traversal based on Z-order curves only creates clusters that do not overlap in the point set array and that, for a given lower index bound, will always have the same upper bound. Therefore, we can use parallel sorting and unification methods to identify the set of unique clusters. The unique clusters are collected (by their lower and upper index bounds) in unique_lower_index_bounds and unique_upper_index_bounds. The final step is to compute the coordinate minima and maxima in each cluster. This step follows the idea of batching, cf. Sect. 4.3.1. The batched array is the array of coordinates. The bounds for the batches are given by the unique lower and upper index bounds and the keys for the batches are the sequence of numbers \(\{1,2,\ldots \}\). Results in the batched computation that are associated to points in \(\mathcal {Y}\) and not being part of any cluster are finally removed by removing all batched compute results associated to the key 0.

figure h
Fig. 18
figure 18

Creating the map between a work item and an entry in the lookup table requires sorting, a compute kernel for bounds assignment, a scan operation and a permutation operation

Our approach to compute the map between the nodes in node_data and the lookup table is summarized in Algorithm 8. Again, we first get the lower and upper index bounds. Then, without loss of generality, we sort the lower bounds of the clusters and keep the applied permutation in permutation. Next, we create a global array map of length \(|V_{I\times I}(l)|\) and initialize it to “0”. The parallel kernel set_bounds_for_map of \(|V_{I\times I}(l)|\) then sets a “1” in map wherever there are two different subsequent entries in the sorted lower_bounds. By an inclusive scan on map, we create growing indices in map marking identical entries in lower_bounds. The result is exemplified in Fig. 18. We finally permute back map by kernel permutation with \(|V_{I\times I}(l)|\) threads leading to the required map.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zaspel, P. Algorithmic Patterns for \(\mathcal {H}\)-Matrices on Many-Core Processors. J Sci Comput 78, 1174–1206 (2019). https://doi.org/10.1007/s10915-018-0809-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10915-018-0809-4

Keywords

Mathematics Subject Classification

Navigation