1 Introduction

The concept of samplets has been introduced in [25] by generalizing the wavelet construction from [45] to discrete data sets in Euclidean space. A samplet basis is a multiresolution analysis of discrete signed measures, where stability is a direct consequence of the orthogonality of the basis. Samplets are data-centric and can be constructed such that their measure integrals vanish for all polynomials up to a certain degree. Thanks to this vanishing moment property in ambient space, kernel matrices, as they arise in scattered data approximation, become quasi-sparse in the samplet basis. This means that these kernel matrices are compressible in samplet coordinates, S-compressible for short, and can be replaced by sparse matrices. We call the resulting sparsity pattern the compression pattern. The latter has been characterized in [25, Section 5.3]. Given a quasi-uniform data set of cardinality N, i.e., the distance between neighboring points is uniformly bounded from below and above by \(N^{-1/d}\) with \(d \ge 1\) being the spatial dimension of the data, the S-compressed kernel matrix contains only \({\mathcal {O}}(N\log N)\) relevant entries, for kernels of possibly low regularity. A similar multiresolution approach in the reproducing kernel Hilbert space context was suggested in [30], while a geometry oblivious compression based on local degenerate kernel expansions is considered in [51].

In this article, we develop fast arithmetic operations for S-compressed kernel matrices. Fixing the sparsity pattern, we can perform addition and multiplication of kernel matrices with high precision in essentially linear cost. The derived cost bounds assume quasi-uniformity of the data points. Even so, all algorithms can still be applied if the quasi-uniformity assumption does not hold. In this case, however, the established cost bounds may become invalid. Similar approaches for realizing arithmetic operations of nonlocal operators exist by means hierarchical matrices, see [11, 14, 17, 21, 22], and by means of wavelets, see [6, 7, 42].

We prove that the inverses of regularized kernel matrices are compressible with respect to the original compression pattern. We can thus employ the selected inversion algorithm proposed in [36] to efficiently approximate the inverse. Our concrete implementation is based on a supernodal left-looking LDLT-factorization of the underlying matrix, which is available in the sparse direct solver Pardiso, see [31, 40]. The selected inversion computes, in the absence of rounding, the exact matrix inverse of the S-compressed matrix on its matrix pattern. Likewise, matrix addition and matrix multiplication are performed exactly on the prescribed compression pattern. This means that the relevant matrix coefficients are computed exactly when adding, multiplying, and inverting S-compressed kernel matrices. The only error introduced is the matrix compression error issuing from the restriction to the compression pattern.

Having a fast formatted matrix addition and fast matrix inversion at hand enables the fast approximate evaluation of holomorphic operator functions via contour integrals to compute more complicated matrix functions. This has been envisioned in [7] (“We conjecture and provide numerical evidence that functions of operators inherit this property”) and suggested in [23]. In the present paper we prove, using the multiresolution kernel matrix algebra under consideration, that, up to (exponentially small) contour quadrature errors, these contour integrals are computed exactly on the prescribed pattern. This is in contrast to previously proposed formats, such as hierarchical matrices, see [17].

Many applications particularly require only the computation of a subset of the elements of a given matrix inverse. Important examples are sparse inverse covariance matrix estimation in \(\ell ^1\)-regularized Gaussian maximum likelihood estimation, see, e.g., [9, 29], or integrated nested Laplace approximations for approximate Bayesian inference, see, e.g., [48] and the references therein. Other examples of computing a subset of the inverse are electronic structure calculations of materials utilizing multipole expansions, where only the diagonal and, occasionally, sub-diagonals of the discrete Green’s function are required to determine the electron density [33, 35].

We provide a rigorous theoretical underpinning of the algorithms under consideration by means of pseudodifferential calculus [28, 46]. To this end, we focus on kernels of reproducing kernel Hilbert spaces and assume that the associated integral operators correspond, via the Schwarz kernel theorem, to classical, elliptic pseudodifferential operators, from the Hörmander class \(S^m_{1,0}\), cp. [28]. A prominent example of such kernels is the Matérn class of kernels, see [38], also called Sobolev splines [16]. The latter are known to generate the Sobolev spaces of positive order, and correspond to fractional powers of the shifted Laplacian. We prove that such pseudodifferential operators are S-compressible, meaning that for numerical representation, only the coefficients in the associated compression pattern need to be computed. Admissible classes comprise in particular the smooth Hörmander class \(S^m_{1,0}\), but also considerably larger kernel classes of finite smoothness, which admit Calderon-Zygmund estimates and an appropriate operator calculus, see, e.g., [1, 47]. The corresponding operator calculus implies that sums, concatenations, powers and holomorphic functions of self-adjoint, elliptic pseudodifferential operators yield again pseudodifferential operators. As a consequence the respective operations of kernel matrices in samplet coordinates result again in compressible matrices.

The rest of this article is structured as follows. In Sect. 2, we introduce the scattered data framework under consideration and recall the relevant theory for reproducing kernel Hilbert spaces. The construction of samplets and the samplet matrix compression from [25] are summarized in Sect. 3. The main contribution of this article is Sect. 4. Here, we develop and analyze arithmetic operations for compressed kernel matrices in samplet coordinates. In Sect. 5, we perform numerical experiments in order to qualify and quantify the matrix algebra. Beyond benchmarking experiments, we consider the computation of an implicit surface from scattered data using Gaussian process learning. Finally, the required details from the theory pseudodifferential operators, especially the associated calculus, are collected in Appendix A.

Throughout this article, in order to avoid the repeated use of generic but unspecified constants, by \(C\lesssim D\) we indicate that C can be bounded by a multiple of D, independently of parameters which C and D may depend on. Moreover, \(C\gtrsim D\) is defined as \(D\lesssim C\) and \(C\sim D\) as \(C\lesssim D\) and \(D\lesssim C\).

2 Reproducing kernel Hilbert spaces

Let \(({\mathcal {H}},\langle \cdot ,\cdot \rangle _{\mathcal {H}})\) be a Hilbert space of functions \(h:\Omega \rightarrow {\mathbb {R}}\) with dual space \({\mathcal {H}}'\). Herein, \(\Omega \subset {\mathbb {R}}^d\) is a given bounded domain or a lower-dimensional manifold. Furthermore, let \(\kappa \) be a symmetric and positive definite kernel, i.e., \([\kappa ({\varvec{x}}_i,{\varvec{x}}_j)]_{i,j=1}^N\) is a symmetric and positive semi-definite matrix for every \(N\in {\mathbb {N}}\) and any point selection \({\varvec{x}}_1,\ldots ,{\varvec{x}}_N\in \Omega \). We recall that \(\kappa \) is the reproducing kernel for \({\mathcal {H}}\), iff \(\kappa ({\varvec{x}},\cdot )\in {\mathcal {H}}\) for every \({\varvec{x}}\in \Omega \) and \(h({\varvec{x}})=\langle \kappa ({\varvec{x}},\cdot ),h\rangle _{\mathcal {H}}\) for every \(h\in {\mathcal {H}}\). In this case, we call \(({\mathcal {H}},\langle \cdot ,\cdot \rangle _{\mathcal {H}})\) a reproducing kernel Hilbert space (RKHS).

Let \(X\mathrel {\mathrel {\mathop :}=}\{{\varvec{x}}_1,\ldots ,{\varvec{x}}_N\}\subset \Omega \) denote a set of N mutually distinct points. With respect to the set \(X\), we introduce the subspace

$$\begin{aligned} {\mathcal {H}}_X\mathrel {\mathrel {\mathop :}=}{{\,\textrm{span}\,}}\{\kappa ({\varvec{x}}_1,\cdot ),\ldots ,\kappa ({\varvec{x}}_N,\cdot )\} \subset {\mathcal {H}}. \end{aligned}$$
(1)

Corresponding to \({\mathcal {H}}_X\), we consider the subspace

$$\begin{aligned} {\mathcal {X}}\mathrel {\mathrel {\mathop :}=}{{\,\textrm{span}\,}}\{\delta _{{\varvec{x}}_1}, \ldots ,\delta _{{\varvec{x}}_N}\}\subset {\mathcal {H}}', \end{aligned}$$

which is spanned by the Dirac measures supported at the points of \(X\), i.e.,

$$\begin{aligned} \delta _{{\varvec{x}}_i}(A)\mathrel {\mathrel {\mathop :}=}{\left\{ \begin{array}{ll} 1,&{}\text {if }{\varvec{x}}_i\in A,\\ 0,&{}\text {otherwise} \end{array}\right. } \end{aligned}$$

for any subset \(A\subset \Omega \). For a continuous function \(f\in C(\Omega )\), we use the notation

$$\begin{aligned} (f,\delta _{{\varvec{x}}_i})_\Omega \mathrel {\mathrel {\mathop :}=}\int _{\Omega }f({\varvec{x}})\delta _{{\varvec{x}}_i}({\text {d}}\!{\varvec{x}}) =f({\varvec{x}}_i). \end{aligned}$$

As the kernel \(\kappa ({\varvec{x}},\cdot )\) is the Riesz representer of the point evaluation \((\cdot ,\delta _{\varvec{x}})_\Omega \), we particularly have

$$\begin{aligned} (h,\delta _{\varvec{x}})_\Omega =\langle \kappa ({\varvec{x}},\cdot ),h\rangle _{\mathcal {H}}\quad \text {for every } h\in {\mathcal {H}}. \end{aligned}$$

Thus, the space \({\mathcal {X}}\) is isometrically isomorphic to the subspace \({\mathcal {H}}_X\) from (1) and we identify

$$\begin{aligned} u'=\sum _{i=1}^Nu_i\delta _{{\varvec{x}}_i}\in {\mathcal {X}}\quad \text {with}\quad u=\sum _{i=1}^Nu_i\kappa ({\varvec{x}}_i,\cdot )\in {\mathcal {H}}_X. \end{aligned}$$

Later on, we endow \({\mathcal {X}}\) with the inner product

$$\begin{aligned} \langle u',v'\rangle _{\mathcal {X}}\mathrel {\mathrel {\mathop :}=}\sum _{i=1}^N u_iv_i,\quad \text {where } u'=\sum _{i=1}^Nu_i\delta _{{\varvec{x}}_i},\ v'=\sum _{i=1}^Nv_i\delta _{{\varvec{x}}_i}. \end{aligned}$$
(2)

This inner product is different from the restriction of the canonical one in \({\mathcal {H}}\) to \({\mathcal {H}}_X\). The latter is given by

$$\begin{aligned} \langle {u},{v}\rangle _{\mathcal {H}}={\varvec{u}}^\intercal {\varvec{K}}{\varvec{v}} \end{aligned}$$

with the symmetric and positive semi-definite kernel matrix

$$\begin{aligned} {\varvec{K}}\mathrel {\mathrel {\mathop :}=}\left[ \kappa ({\varvec{x}}_i,{\varvec{x}}_j)\right] _{i,j=1}^N\in {\mathbb {R}}^{N\times N} \end{aligned}$$
(3)

and \({\varvec{u}}\mathrel {\mathrel {\mathop :}=}[u_i]_{i=1}^N\) and \({\varvec{v}}\mathrel {\mathrel {\mathop :}=}[v_i]_{i=1}^N\).

Due to the duality between \({\mathcal {H}}_X\) and \({\mathcal {X}}\), the \({\mathcal {H}}\)-orthogonal projection of a function \(h\in {\mathcal {H}}\) onto \({\mathcal {H}}_X\) is given by the interpolant

$$\begin{aligned} s_h({\varvec{x}})\mathrel {\mathrel {\mathop :}=}\sum _{i=1}^N\alpha _i\kappa ({\varvec{x}}_i,\cdot ), \end{aligned}$$

which satisfies \(s_h({\varvec{x}}_i) = h({\varvec{x}}_i)\) for all \({\varvec{x}}_i\in X\). The associated coefficients \({\varvec{\alpha }}=[\alpha _i]_{i=1}^N\) are given by the solution to the linear system

$$\begin{aligned} {\varvec{K}}{\varvec{\alpha }}={\varvec{h}} \end{aligned}$$

with right hand side \({\varvec{h}}=[h({\varvec{x}}_i)]_{i=1}^N\).

From [49, Corollary 11.33], we have the following approximation result.

Theorem 1

Let \(\Omega \subset {\mathbb {R}}^d\) be a bounded Lipschitz domain satisfying an interior cone condition. Suppose that the Fourier transform of the kernel \(\kappa ({\varvec{x}}-{\varvec{y}})\) satisfies

$$\begin{aligned} {\widehat{\kappa }}({\varvec{\varvec{\xi }}})\sim \left( 1+\Vert {\varvec{\varvec{\xi }}}\Vert _2^2\right) ^{-\tau }, \quad {\varvec{\varvec{\xi }}}\in {\mathbb {R}}^d. \end{aligned}$$
(4)

Then for \(0\le t < \lceil \tau \rceil -d/2-1\), the error between \(f\in H^\tau (\Omega )\) and its interpolant \(s_{f,X}\) satisfies the bound

$$\begin{aligned} \Vert f-s_{f,X}\Vert _{H^{t}(\Omega )}\lesssim h_{X,\Omega }^{\tau -t}\Vert f\Vert _{H^\tau (\Omega )} \end{aligned}$$

for a sufficiently small fill distance

$$\begin{aligned} h_{X,\Omega } \mathrel {\mathrel {\mathop :}=}\sup _{{\varvec{x}}\in \Omega }\min _{{\varvec{x}}_i\in X} \Vert {\varvec{x}}-{\varvec{x}}_i\Vert _2. \end{aligned}$$
(5)

One class of kernels satisfying the conditions of Theorem 1 are the isotropic Matérn kernels, also called Sobolev splines, see [16]. These kernels play an important role in applications, such as spatial statistics [41]. They are given by

$$\begin{aligned} \kappa _\nu (r)\mathrel {\mathrel {\mathop :}=}\frac{2^{1-\nu }}{\Gamma (\nu )} \bigg (\frac{\sqrt{2\nu }r}{\ell }\bigg )^\nu K_\nu \bigg (\frac{\sqrt{2\nu }r}{\ell }\bigg ) \end{aligned}$$

with \(r\mathrel {\mathrel {\mathop :}=}\Vert {\varvec{x}}-{\varvec{y}}\Vert _2\), smoothness parameter \(\nu >0\) and length scale parameter \(\ell >0\), see [38, 41]. Furthermore, \(K_\nu \) denotes the modified Bessel function of the second kind. Specifically, property (4) holds with

$$\begin{aligned} {\widehat{\kappa }}_\nu ({\varvec{\varvec{\xi }}}) = \alpha \bigg (1+\frac{\ell ^2}{2\nu }\Vert {\varvec{\varvec{\xi }}}\Vert _2^2\bigg )^{-\nu -d/2}, \end{aligned}$$
(6)

where \(\alpha \) is a scaling factor depending on \(\nu \), \(\ell \) and d, see [38]. The Matérn kernels are the reproducing kernels of the Sobolev spaces \(H^{\nu +d/2}({\mathbb {R}}^d)\), see also [49].

For half integer values of \(\nu \), i.e., for \(\nu =p+1/2\) with \(p\in {\mathbb {N}}_0\), the Matérn kernels have an explicit representation given by

$$\begin{aligned} \kappa _{p+1/2}(r)=\exp \bigg (\frac{-\sqrt{2\nu }r}{\ell }\bigg ) \frac{p!}{(2p)!} \sum _{q=0}^p\frac{(p+q)!}{q!(p-q)!} \bigg (\frac{\sqrt{8\nu }r}{\ell }\bigg )^{p-q}. \end{aligned}$$

The limit case \(\nu \rightarrow \infty \) gives rise to the Gaussian kernel

$$\begin{aligned} \kappa _\infty (r) = \exp \bigg (\frac{-r^2}{2\ell ^2}\bigg ). \end{aligned}$$

Our subsequent analysis covers the Matérn family, but has considerably wider scope. Indeed, rather large classes of pseudodifferential operators will be admissible. As suitable classes of such operators are known to define an algebra, properties of arithmetic expressions of the underlying kernels, such as off-diagonal coefficient decay and matrix compressibility, can directly be inferred. Equally important, we show that these properties of operator algebras are to some extent transferred also to the corresponding finitely represented structures, i.e., we show the corresponding matrix representation likewise are algebras in the compressed format. We refer to Appendix A for the details and properties of pseudodifferential operators in this article.

3 Samplet matrix compression

We recall in this section the concept of samplets as it has been introduced in [25].

3.1 Samplets

Samplets are defined based on a sequence of spaces \(\{{\mathcal {X}}_j\}_{j=0}^J\) forming a multiresolution analysis, i.e.,

$$\begin{aligned} {\mathcal {X}}_0\subset {\mathcal {X}}_1\subset \cdots \subset {\mathcal {X}}_J = {\mathcal {X}}. \end{aligned}$$
(7)

Rather than using a single scale from the multiresolution analysis (7), the idea of samplets is to keep track of the increment of information between two consecutive levels j and \(j+1\). Since we have \({\mathcal {X}}_{j}\subset {\mathcal {X}}_{j+1}\), we may decompose

$$\begin{aligned} {\mathcal {X}}_{j+1} ={\mathcal {X}}_j\overset{\perp }{\oplus }{\mathcal {S}}_j \end{aligned}$$
(8)

by introducing the detail space \({\mathcal {S}}_j\), where orthogonality is to be understood with respect to the (discrete) inner product defined in (2).

Let \({\varvec{\Sigma }}_j\) be a basis of the detail space \({\mathcal {S}}_j\) in \({\mathcal {X}}_j\). We call a basis for \({\mathcal {S}}_j\) samplet basis. By choosing a basis of scaling distributions \(\varvec{\Phi }_0\) of \({\mathcal {X}}_0\) and recursively applying the decomposition (8), we see that the set

$$\begin{aligned} \mathbf \Sigma _J = {\varvec{\Phi }}_0\cup \bigcup _{j=0}^J{\varvec{\Sigma }}_j \end{aligned}$$

forms a basis of \({\mathcal {X}}_J={\mathcal {X}}\). A visualization of a scaling distribution and two samplets on different resolution levels on a spiral data set is displayed in Fig. 1. The (xy)-components indicate the support of the associated Dirac measures, while the z-component reflects the size of the corresponding coefficient.

Fig. 1
figure 1

A scaling distribution on the coarsest scale (left plot) and samplets on levels 2 and 3 (middle and right plot)

To employ samplets for the compression of kernel matrices, it is desirable that the signed measures \(\sigma _{j,k}\in {\mathcal {X}}_j\subset {\mathcal {H}}'\) have isotropic convex hulls of supports, are localized with respect to the corresponding discretization level j, i.e.,

$$\begin{aligned} {{\,\textrm{diam}\,}}({{\,\textrm{supp}\,}}\sigma _{j,k})\sim 2^{-j/d}, \end{aligned}$$
(9)

and that they are stable with respect to the inner product defined in (2), i.e.,

$$\begin{aligned} \langle \sigma _{j,k},\sigma _{j',k'}\rangle _{\mathcal {X}}=0 \quad \text {for }(j,k)\ne (j',k'). \end{aligned}$$

Furthermore, an essential ingredient is the vanishing moment condition of order \(q+1\), i.e.,

$$\begin{aligned} (p,\sigma _{j,k})_\Omega = 0\quad \text {for all}\ p\in {\mathcal {P}}_q(\Omega ), \end{aligned}$$
(10)

where \({\mathcal {P}}_q(\Omega )\) is the space of all polynomials with total degree at most \(q\). We say then that the samplets have vanishing moments of order \(q+1\).

Remark 1

Associated to each samplet

$$\begin{aligned} \sigma _{j,k} = \sum _{\ell =1}^N\beta _\ell \delta _{{\varvec{x}}_{i_\ell }}, \end{aligned}$$

we find a uniquely determined function

$$\begin{aligned} {\hat{\sigma }}_{j,k}\mathrel {\mathrel {\mathop :}=}\sum _{\ell =1}^N\beta _\ell \kappa ({\varvec{x}}_{i_\ell },\cdot )\in {\mathcal {H}}_X, \end{aligned}$$

which also exhibits vanishing moments, i.e.,

$$\begin{aligned} \langle {\hat{\sigma }}_{j,k},h\rangle _{\mathcal {H}}=0 \end{aligned}$$

for any \(h\in {\mathcal {H}}\) which satisfies \(h|_{O}\in {\mathcal {P}}_q(O)\) for any open set \({{\,\textrm{supp}\,}}\sigma _{j,k}\subset O\subset \Omega \).

3.2 Construction of samplets

The starting point for the construction of samplets is the multiresolution analysis (7). Its construction is based on a hierarchical clustering of the set \(X\).

Definition 1

Let \({\mathcal {T}}=(V,E)\) be a binary tree with vertices V and edges E. We define its set of leaves as

$$\begin{aligned} {\mathcal {L}}({\mathcal {T}})\mathrel {\mathrel {\mathop :}=}\{\nu \in V:\nu ~\text {has no children}\}. \end{aligned}$$

The tree \({\mathcal {T}}\) is a cluster tree for the set \(X=\{{\varvec{x}}_1,\ldots ,{\varvec{x}}_N\}\), iff the set X is the root of \({\mathcal {T}}\) and all \(\nu \in V{\setminus }{\mathcal {L}}({\mathcal {T}})\) are disjoint unions of their two children.

The level \(j_\nu \) of \(\nu \in {\mathcal {T}}\) is its distance from the root, i.e., the number of edges that are required for traveling from X to \(\nu \). The depth \(J\) of \({\mathcal {T}}\) is the maximum level of all clusters. We define the set of clusters on level j as

$$\begin{aligned} {\mathcal {T}}_j\mathrel {\mathrel {\mathop :}=}\{\nu \in {\mathcal {T}}:\nu ~\text {has level}~j\}. \end{aligned}$$

The cluster tree is balanced, iff \(|\nu |\sim 2^{J-j_{\nu }}\).

To bound the diameter of the clusters, we introduce the separation radius

$$\begin{aligned} q_X\mathrel {\mathrel {\mathop :}=}\frac{1}{2}\min _{i\ne j}\Vert {\varvec{x}}_i-{\varvec{x}}_j\Vert _2 \end{aligned}$$
(11)

and require \(X\) to be quasi-uniform.

Definition 2

The set \(X\subset \Omega \) is quasi-uniform if the fill distance (5) is proportional to the separation radius (11), i.e., there exists a constant \(c = c(X,\Omega )\in (0,1)\) such that

$$\begin{aligned} 0<c\le \frac{q_X}{h_{X,\Omega }} \le c^{-1}. \end{aligned}$$

Roughly speaking, the points \({\varvec{x}}\in X\) are equispaced if \(X\subset \Omega \) is quasi-uniform. This immediately implies the following result.

Lemma 1

Let \({\mathcal {T}}\) be a cluster tree constructed by hierarchical longest edge bisection of the bounding box \(B_{X}\), where \(B_{\nu }\), \(\nu \in {\mathcal {T}}\), is the smallest axis-parallel cuboid that contains all points of \(\nu \). If \(X\subset \Omega \) is quasi-uniform, then there holds

$$\begin{aligned} \frac{|B_\nu |}{|\Omega |} \sim \frac{|B_\nu \cap X|}{N} \end{aligned}$$

with the constant hidden in \(\sim \) depending only on the constant \(c(X,\Omega )\) in Definition 2. In particular, we have \({{\,\textrm{diam}\,}}(\nu )\sim 2^{-j_\nu /d}\) for all clusters \(\nu \in {\mathcal {T}}\).

Samplets with vanishing moments are obtained recursively by employing a two-scale transform between basis elements on a cluster \(\nu \) of level j. To this end, we represent scaling distributions \(\mathbf {\Phi }_{j}^{\nu } = \{ \varphi _{j,k}^{\nu } \}\) and samplets \(\mathbf {\Sigma }_{j}^{\nu } = \{ \sigma _{j,k}^{\nu } \}\) as linear combinations of the scaling distributions \(\mathbf {\Phi }_{j+1}^{\nu }\) of \(\nu \)’s child clusters. This results in the refinement relation

$$\begin{aligned}{}[ \mathbf {\Phi }_{j}^{\nu }, \mathbf {\Sigma }_{j}^{\nu } ] \mathrel {\mathrel {\mathop :}=}\mathbf {\Phi }_{j+1}^{\nu } {\varvec{Q}}^{\nu }= \mathbf {\Phi }_{j+1}^{\nu } \big [ {\varvec{Q}}_{j,\Phi }^{\nu },{\varvec{Q}}_{j,\Sigma }^{\nu }\big ]. \end{aligned}$$

The transformation matrix \({\varvec{Q}}_{j}^{\nu }\) is computed from the QR decomposition

$$\begin{aligned} ({\varvec{M}}_{j+1}^{\nu })^\intercal = {\varvec{Q}}{\varvec{R}} \mathrel {=\mathrel {\mathop :}}\big [{\varvec{Q}}_{j,\Phi }^{\nu }, {\varvec{Q}}_{j,\Sigma }^{\nu }\big ]{\varvec{R}} \end{aligned}$$

of the moment matrix

with

$$\begin{aligned} m_q\mathrel {\mathrel {\mathop :}=}\sum _{\ell =0}^q\left( {\begin{array}{c}\ell +d-1\\ d-1\end{array}}\right) \le (q+1)^d \end{aligned}$$

being the dimension of \({\mathcal {P}}_q(\Omega )\). There holds

$$\begin{aligned} \begin{aligned} \big [{\varvec{M}}_{j,\Phi }^{\nu }, {\varvec{M}}_{j,\Sigma }^{\nu }\big ]&= \left[ ({\varvec{x}}^{\varvec{\alpha }},[\mathbf {\Phi }_{j}^{\nu }, \mathbf {\Sigma }_{j}^{\nu }])_\Omega \right] _{|\varvec{\alpha }|\le q}\\&= \left[ ({\varvec{x}}^{\varvec{\alpha }},\mathbf {\Phi }_{j+1}^{\nu }[{\varvec{Q}}_{j,\Phi }^{\nu }, {\varvec{Q}}_{j,\Sigma }^{\nu }])_\Omega \right] _{|\varvec{\alpha }|\le q} \\&= {\varvec{M}}_{j+1}^{\nu } [{\varvec{Q}}_{j,\Phi }^{\nu }, {\varvec{Q}}_{j,\Sigma }^{\nu } ] = {\varvec{R}}^\intercal . \end{aligned} \end{aligned}$$

As \({\varvec{R}}^\intercal \) is a lower triangular matrix, the first \(k-1\) entries in its k-th column are zero. This corresponds to \((k-1)\) vanishing moments for the k-th function generated by the transformation \([{\varvec{Q}}_{j,\Phi }^{\nu }, {\varvec{Q}}_{j,\Sigma }^{\nu } ]\). By defining the first \(m_{q}\) functions as scaling distributions and the remaining as samplets, we obtain samplets with vanishing moments at least up to order \(q+1\).

For leaf clusters, we define the scaling distributions by the Dirac measures supported at the points \({\varvec{x}}_i\in X\), i.e., \(\mathbf {\Phi }_J^{\nu }\mathrel {\mathrel {\mathop :}=}\{ \delta _{{\varvec{x}}_i}: {\varvec{x}}_i\in \nu \}\), to make up for the lack of child clusters that could provide scaling distributions. The scaling distributions of all clusters on a specific level j then generate the spaces

$$\begin{aligned} {\mathcal {X}}_{j}\mathrel {\mathrel {\mathop :}=}{{\,\textrm{span}\,}}\{ \varphi _{j,k}^{\nu }: k\in \Delta _j^\nu ,\ \nu \in {\mathcal {T}}_{j} \}, \end{aligned}$$
(12)

while the samplets span the detail spaces

$$\begin{aligned} {\mathcal {S}}_{j}\mathrel {\mathrel {\mathop :}=}{{\,\textrm{span}\,}}\{ \sigma _{j,k}^{\nu }: k\in \nabla _j^\nu ,\ \nu \in {\mathcal {T}}_{j} \} = {\mathcal {X}}_{j+1}\overset{\perp }{\ominus }{\mathcal {X}}_j. \end{aligned}$$
(13)

Combining the scaling distributions of the root cluster with all clusters’ samplets amounts to the basis

$$\begin{aligned} \mathbf {\Sigma }_{N}\mathrel {\mathrel {\mathop :}=}\mathbf {\Phi }_{0}^{X} \cup \bigcup _{\nu \in {\mathcal {T}}} \mathbf {\Sigma }_{j_{\nu }}^{\nu }. \end{aligned}$$
(14)

By construction, samplets satisfy the following properties, which are collected from [25, Theorem 3.6, Lemma 3.9, Theorem 5.4].

Theorem 2

The spaces \({\mathcal {X}}_{j}\) defined in equation (12) form the desired multiresolution analysis (7), where the corresponding detail spaces \({\mathcal {S}}_{j}\) from (13) satisfy

$$\begin{aligned} {\mathcal {X}}_{j+1}={\mathcal {X}}_j\overset{\perp }{\oplus }{\mathcal {S}}_{j}\quad \text {for all}\quad j=0,1,\ldots , J-1. \end{aligned}$$

The associated samplet basis \(\mathbf {\Sigma }_{N}\) defined in (14) constitutes an orthonormal basis of \({\mathcal {X}}\) and we have:

  1. 1.

    The number of all samplets on level j behaves like \(2^j\).

  2. 2.

    The samplets have vanishing moments of order \(q+1\), i.e., there holds (10).

  3. 3.

    Each samplet is supported in a specific cluster \(\nu \). If the points in X are quasi-uniform, then the diameter of the cluster satisfies \({{\,\textrm{diam}\,}}(\nu )\sim 2^{-j_\nu /d}\) and there holds (9).

  4. 4.

    The coefficient vector \({\varvec{\omega }}_{j,k}=\big [\omega _{j,k,i}\big ]_i\) of the samplet \(\sigma _{j,k}\) on the cluster \(\nu \) fulfills

    $$\begin{aligned} \Vert {\varvec{\omega }}_{j,k}\Vert _{1}\le \sqrt{|\nu |}. \end{aligned}$$
  5. 5.

    Let \(f\in C^{q+1}(\Omega )\). Then, there holds for a samplet \(\sigma _{j,k}\) supported on the cluster \(\nu \) that

    $$\begin{aligned} |(f,\sigma _{j,k})_\Omega |\le \bigg (\frac{d}{2}\bigg )^{q+1} \frac{{{\,\textrm{diam}\,}}(\nu )^{q+1}}{(q+1)!}\Vert f\Vert _{C^{q+1}(\Omega )} \Vert {\varvec{\omega }}_{j,k}\Vert _{1}. \end{aligned}$$

Remark 2

Each samplet is a linear combination of the Dirac measures supported at the points in X. The related coefficient vectors \({\varvec{\omega }}_{j,k}\) in

$$\begin{aligned} \sigma _{j,k} = \sum _{i=1}^{N} \omega _{j,k,i} \delta _{{\varvec{x}}_i} \end{aligned}$$

are pairwise orthonormal with respect to the inner product (2).

The dual samplet basis in \({\mathcal {H}}_X\), which exhibits the Lagrange property, cp. [49], is given by

$$\begin{aligned} {\tilde{\sigma }}_{j,k}=\sum _{i=1}^N{\tilde{\omega }}_{j,k,i}\kappa ({\varvec{x}}_i,\cdot ), \quad \text {where}\quad \tilde{\varvec{\omega }}_{j,k}\mathrel {\mathrel {\mathop :}=}{\varvec{K}}^{-1}{\varvec{\omega }}_{j,k}, \end{aligned}$$

as there holds

$$\begin{aligned} \langle {\tilde{\sigma }}_{j,k},{\hat{\sigma }}_{j'k'}\rangle _{{\mathcal {H}}}= ({\tilde{\sigma }}_{j,k},\sigma _{j'k'})_\Omega&=\sum _{i,i'=1}^N{\tilde{\omega }}_{j,k,i}{\omega }_{j',k',i'} \big (\kappa ({\varvec{x}}_i,\cdot ),\delta _{{\varvec{x}}_{i'}}\big )_\Omega \\&=\tilde{\varvec{\omega }}_{j,k}^\intercal {\varvec{K}}{\varvec{\omega }}_{j',k'} =\delta _{(j,k),(j',k')}. \end{aligned}$$

3.3 Matrix compression

For the compression of the kernel matrix \({\varvec{K}}\) from (3), with samplets of vanishing moment order \(q+1\), for some integer \(q\ge 0\), we suppose that the kernel \(\kappa \) is “\(q+1\)-asymptotically smooth”. This is to say that there are constants \(c_{\kappa ,\varvec{\alpha },\varvec{\beta }}>0\) such that for all \({\varvec{x}},{\varvec{y}}\in \Omega \) with \({\varvec{x}}\ne {\varvec{y}}\) there holds

$$\begin{aligned} \bigg |\frac{\partial ^{|\varvec{\alpha }|+|\varvec{\beta }|}}{\partial {\varvec{x}}^{\varvec{\alpha }} \partial {\varvec{y}}^{\varvec{\beta }}} \kappa ({\varvec{x}},{\varvec{y}})\bigg | \le c_{\kappa ,\varvec{\alpha },\varvec{\beta }} \Vert {\varvec{x}}-{\varvec{y}}\Vert _2^{-(|\varvec{\alpha }|+|\varvec{\beta }|)}, \quad |\varvec{\alpha }|, |\varvec{\beta }| \le q+1. \end{aligned}$$
(15)

Note that such an estimate can only be valid for continuous kernels as considered here, but not for singular ones. However, we observe in passing that this condition is considerably weaker than the usual notion of asymptotic smoothness of kernels in \({{\mathcal {H}}}\)-matrix theory, cp. [21]. The condition there would correspond to infinite differentiability in (15) with analytic estimates on the constants \(c_{\kappa ,\varvec{\alpha },\varvec{\beta }}\).

Due to (15), we have in accordance with [25, Lemma 5.3] the decay estimate

$$\begin{aligned} \begin{aligned}&(\kappa ,\sigma _{j,k}\otimes \sigma _{j',k'})_{\Omega \times \Omega }\\&\qquad \le c_{\kappa ,q}\frac{{{\,\textrm{diam}\,}}(\nu )^{q+1}{{\,\textrm{diam}\,}}(\nu ')^{q+1}}{{{\,\textrm{dist}\,}}(\nu _{j,k},\nu _{j',k'})^{2(q+1)}} \Vert {\varvec{\omega }}_{j,k}\Vert _{1}\Vert {\varvec{\omega }}_{j',k'}\Vert _{1} \end{aligned} \end{aligned}$$
(16)

for two samplets \(\sigma _{j,k}\) and \(\sigma _{j',k'}\), with the vanishing moment property of order \(q+1\) and supported on the clusters \(\nu \) and \(\nu '\) such that \({{\,\textrm{dist}\,}}(\nu ,\nu ') > 0\).

Estimate (16) holds for a wide range of kernels that obey the so-called Calderón-Zygmund estimates. It immediately results in the following compression strategy for kernel matrices in samplet representation, cp. [25, Theorem 5.4], which is well-known in the context of wavelet compression of operator equations see, e.g., [39].

Theorem 3

(\({\varvec{S}}\)-compression) Set all coefficients of the kernel matrix

$$\begin{aligned} {\varvec{K}}^\Sigma \mathrel {\mathrel {\mathop :}=}\big [(\kappa ,\sigma _{j,k} \otimes \sigma _{j',k'})_{\Omega \times \Omega } \big ]_{j,j',k,k'} \end{aligned}$$

to zero which satisfy the \(\eta \)-admissibility condition

$$\begin{aligned} {{\,\textrm{dist}\,}}(\nu ,\nu ')\ge \eta \max \{{{\,\textrm{diam}\,}}(\nu ),{{\,\textrm{diam}\,}}(\nu ')\},\quad \eta >0, \end{aligned}$$
(17)

where \(\nu \) is the cluster supporting \(\sigma _{j,k}\) and \(\nu '\) is the cluster supporting \(\sigma _{j',k'}\), respectively. Then, the resulting S-compressed matrix \({\varvec{K}}^\eta \) satisfies

$$\begin{aligned} \big \Vert {\varvec{K}}^\Sigma -{\varvec{K}}^\eta \big \Vert _F \le c {\eta ^{-2(q+1)}} N{\log (N)}. \end{aligned}$$

for some constant \(c>0\) dependent on the polynomial degree \(q\) and the kernel \(\kappa \).

Remark 3

We remark that Theorem 3 uses the Frobenius norm for measuring the error rather than the operator norm, as it gives control on each matrix coefficient. Estimates with respect to the operator norm would be similar.

The \(\eta \)-admissibility condition (17) appears reminiscent to the one used for hierarchical matrices, compare, e.g., [11] and the references there. However, in the present context, the clusters \(\nu \) and \(\nu '\) may also be located on different levels, i.e., \(j_\nu \ne j_{\nu '}\) in general. As a consequence, the resulting block cluster tree is the cartesian product \({\mathcal {T}}\times {\mathcal {T}}\) rather than the level-wise cartesian product considered in the context of hierarchical matrices.

The error bounds for S-compression hold for kernel functions \(\kappa \) with finite differentiability (especially, with derivatives of order \(q+1\), cp. [25, Lemma 5.3]), as opposed to the usual requirement of asymptotic smoothness which appears in the error analysis of the \({\mathcal {H}}\)-format, see [21] and the references therein.

For sets \(X = \{{\varvec{x}}_i\}_{i=1}^N\) that are quasi-uniform in the sense of Definition 2, there holds

$$\begin{aligned} \frac{1}{N^2}\big \Vert {\varvec{K}}^\Sigma \big \Vert _F^2 = \frac{1}{N^2}\sum _{i=1}^N\sum _{j=1}^N |\kappa ({\varvec{x}}_i,{\varvec{x}}_j)|^2 \sim \int _\Omega \int _\Omega |\kappa ({\varvec{x}},{\varvec{y}})|^2{\text {d}}\!{\varvec{x}}{\text {d}}\!{\varvec{y}}, \end{aligned}$$

i.e., \(\big \Vert {\varvec{K}}^\Sigma \big \Vert _F\sim N\). Thus, we can refine the above result, see also [25, Corollary 5.5].

Corollary 1

In case of quasi-uniform points \({\varvec{x}}_i\in X\), the S-compressed matrix \({\varvec{K}}^\eta \) has only \({\mathcal {O}}(N\log N)\) nonzero coefficients, while it satisfies the error estimate

$$\begin{aligned} \frac{\big \Vert {\varvec{K}}^\Sigma -{\varvec{K}}^\eta \big \Vert _F}{\big \Vert {\varvec{K}}^\Sigma \big \Vert _F} \le c \eta ^{-2(q+1)}. \end{aligned}$$
(18)

Here, the constant c depends on K and on q, but is independent of \(\eta \) and N. In [25], an algorithm has been proposed which provides a numerical realization of the compressed matrix \({\varvec{K}}^\eta \) in work and memory \({\mathcal {O}}(N\log N)\). The key ingredient to achieve this is the use of an interpolation-based fast multipole method and \({\mathcal {H}}^2\)-matrix techniques [3, 11, 20].

4 Samplet matrix algebra

4.1 Addition and multiplication

To bound the cost for the addition of two compressed kernel matrices represented with respect to the same cluster tree, it is sufficient to assume that the points in X are quasi-uniform. Then it is straightforward to see that the cost for adding such matrices is \({\mathcal {O}}({N\log N})\). The multiplication of two compressed matrices, in turn, is motivated by the concatenation \({\mathcal {C}}={\mathcal {A}}\circ {\mathcal {B}}\) of the two pseudodifferential operators \({\mathcal {A}}\) and \({\mathcal {B}}\). In suitable algebras, \({\mathcal {C}}\) is again a pseudodifferential operator and, hence, compressible. The respective kernel \(\kappa _{{\mathcal {C}}}(\cdot ,\cdot )\) is given by

$$\begin{aligned} \kappa _{{\mathcal {C}}}({\varvec{x}},{\varvec{y}}) = \int _\Omega \kappa _{{\mathcal {A}}}({\varvec{x}},{\varvec{z}}) \kappa _{{\mathcal {B}}}({\varvec{z}},{\varvec{y}}){\text {d}}\!{\varvec{z}}. \end{aligned}$$
(19)

Since \(\Omega \subset {\mathbb {R}}^d\) is bounded by assumption, we may without loss of generality assume \(\Omega \subset [0,1)^d\). Moreover, we assume that the data points in \(X=\{\varvec{x}_i\}_{i=1}^N\subset \Omega \) are uniformly distributed modulo one, i.e.,

$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{|\Omega |}{N}\sum _{i=1}^N(f,\delta _{{\varvec{x}}_i})_\Omega = \int _\Omega f({\varvec{x}}){\text {d}}\!{\varvec{x}} \end{aligned}$$
(20)

for every Riemann integrable function \(f:\Omega \rightarrow {\mathbb {R}}\), cp. [32, Chap. 2.1]. Then, we may interpret the matrix product as a discrete version of the convolution (19). In view of (20), we conclude

$$\begin{aligned} \bigg |\kappa _{{\mathcal {C}}}({\varvec{x}},{\varvec{y}}) -\frac{|\Omega |}{N}\sum _{k=1}^N \kappa _{{\mathcal {A}}}({\varvec{x}},{\varvec{x}}_k) \kappa _{{\mathcal {B}}}({\varvec{x}}_k,{\varvec{y}})\bigg |\rightarrow 0\ \text {as}\,N\rightarrow \infty . \end{aligned}$$
(21)

Consequently, the product of two kernel matrices

$$\begin{aligned} \varvec{K}_{{\mathcal {A}}} = [\kappa _{{\mathcal {A}}}(\varvec{x}_i,\varvec{x}_j)]_{i,j=1}^N,\quad \varvec{K}_{{\mathcal {B}}} = [\kappa _{{\mathcal {B}}}(\varvec{x}_i,\varvec{x}_j)]_{i,j=1}^N \end{aligned}$$

yields an S-compressible matrix \(\varvec{K}_{{\mathcal {A}}}\cdot \varvec{K}_{{\mathcal {B}}}\in {\mathbb {R}}^{N\times N}\).

Theorem 4

Let \(X=\{\varvec{x}_i\}_{i=1}^N \subset \Omega \) be uniformly distributed modulo one, see (20), and denote by \(\varvec{K}_{{\mathcal {C}}}\) the corresponding kernel matrix

$$\begin{aligned} \varvec{K}_{{\mathcal {C}}} = \frac{N}{|\Omega |} [\kappa _{{\mathcal {C}}}(\varvec{x}_i,\varvec{x}_j)]_{i,j=1}^N \end{aligned}$$

with \(\kappa _{{\mathcal {C}}}(\cdot ,\cdot )\) from (19). Then, there holds

$$\begin{aligned} \frac{\Vert \varvec{K}_{{\mathcal {C}}}-\varvec{K}_{{\mathcal {A}}}\varvec{K}_{{\mathcal {B}}}\Vert _F}{ \Vert \varvec{K}_{{\mathcal {C}}}\Vert _F} \rightarrow 0 \ \hbox { as}\ N\rightarrow \infty . \end{aligned}$$

Proof

On the one hand, we conclude from (21) that, as \(N\rightarrow \infty \),

$$\begin{aligned}&\Vert \varvec{K}_{{\mathcal {C}}}-\varvec{K}_{{\mathcal {A}}}\varvec{K}_{{\mathcal {B}}}\Vert _F^2\\&\qquad = \sum _{i,j=1}^N \bigg [\frac{N}{|\Omega |}\kappa _{{\mathcal {C}}}({\varvec{x}}_i,{\varvec{x}}_j) - \sum _{k=1}^N \kappa _{{\mathcal {A}}}({\varvec{x}}_i,{\varvec{z}}_k)\kappa _{{\mathcal {B}}}({\varvec{z}}_k,{\varvec{x}}_j) \bigg ]^2\\&\qquad \sim N^4 \int _\Omega \int _\Omega \bigg [\kappa _{{\mathcal {C}}}({\varvec{x}},{\varvec{y}}) -\frac{|\Omega |}{N}\sum _{k=1}^N \kappa _{{\mathcal {A}}}({\varvec{x}},{\varvec{x}}_k)\kappa _{{\mathcal {B}}}({\varvec{x}}_k,{\varvec{y}}) \bigg ]^2{\text {d}}\!\varvec{x}{\text {d}}\!\varvec{y}\\&\qquad = o(N^4). \end{aligned}$$

On the other hand, we find likewise

$$\begin{aligned} \Vert \varvec{K}_{{\mathcal {C}}}\Vert _F^2\sim \int _\Omega \int _\Omega N^2 \kappa _{{\mathcal {C}}}({\varvec{x}},{\varvec{y}})^2{\text {d}}\!\varvec{x}{\text {d}}\!\varvec{y} \sim N^4. \end{aligned}$$

This implies the assertion. \(\square \)

Remark 4

We mention that the consistency bound in the preceding theorem is rather crude. Under provision of stronger kernel-function regularity, corresponding higher convergence rates can be achieved, given that X satisfies appropriate higher-order quasi-Monte Carlo designs, see, e.g., [13] and the references there.

Let \(\varvec{K}_{{\mathcal {A}}}^\eta ,\varvec{K}_{{\mathcal {B}}}^\eta ,\varvec{K}_{{\mathcal {C}}}^\eta \) be compressed with respect to the same compression pattern. We assume for given \(\varepsilon (\eta )>0\) that \(\eta \) in (18) is chosen such that

$$\begin{aligned} \big \Vert {\varvec{K}}^\Sigma -{\varvec{K}}^\eta \big \Vert _F \le \varepsilon (\eta ){\big \Vert {\varvec{K}}^\Sigma \big \Vert _F},\quad \text {for }{\varvec{K}}\in \{\varvec{K}_{{\mathcal {A}}},\varvec{K}_{{\mathcal {B}}},\varvec{K}_{{\mathcal {C}}}\}. \end{aligned}$$

Then, a repeated application of the triangle inequality yields

$$\begin{aligned}&\Vert \varvec{K}_{{\mathcal {C}}}^\eta -\varvec{K}_{{\mathcal {A}}}^\eta \varvec{K}_{{\mathcal {B}}}^\eta \Vert _F\\&\ \le \Vert \varvec{K}_{{\mathcal {C}}}^\Sigma -\varvec{K}_{{\mathcal {C}}}^\eta \Vert _F + \Vert \varvec{K}_{{\mathcal {A}}}^\Sigma \Vert _F\Vert \varvec{K}_{{\mathcal {B}}}^\Sigma -\varvec{K}_{{\mathcal {B}}}^\eta \Vert _F +\Vert \varvec{K}_{{\mathcal {B}}}^\eta \Vert _F\Vert \varvec{K}_{{\mathcal {A}}}^\Sigma -\varvec{K}_{{\mathcal {A}}}^\eta \Vert _F\\&\ \le \varepsilon (\eta )\big (\Vert \varvec{K}_{{\mathcal {C}}}\Vert _F +\Vert \varvec{K}_{{\mathcal {A}}}\Vert _F\Vert \varvec{K}_{{\mathcal {B}}}\Vert _F +\big (1+\varepsilon (\eta )\big )\Vert \varvec{K}_{{\mathcal {A}}}\Vert _F\Vert \varvec{K}_{{\mathcal {B}}}\Vert _F\big )\\&\ \lesssim \varepsilon (\eta )\big (\Vert \varvec{K}_{{\mathcal {C}}}\Vert _F +\Vert \varvec{K}_{{\mathcal {A}}}\Vert _F\Vert \varvec{K}_{{\mathcal {B}}}\Vert _F\big ). \end{aligned}$$

This means that we only need to compute \({\mathcal {O}}(N\log N)\) matrix entries to determine an approximate version \((\varvec{K}_{{\mathcal {A}}}^\eta \varvec{K}_{{\mathcal {B}}}^\eta )^\eta \) of the product \(\varvec{K}_{{\mathcal {A}}}^\eta \cdot \varvec{K}_{{\mathcal {B}}}^\eta \). We like to stress that this S-formatted matrix multiplication is exact on the given compression patterns. The next theorem gives a cost bound for the matrix multiplication.

Theorem 5

Consider two kernel matrices

$$\begin{aligned} \varvec{K}_{{\mathcal {A}}}^\eta =[a_{(j,k),(j',k')}], \quad \varvec{K}_{{\mathcal {B}}}^\eta =[b_{(j,k),(j',k')}]\in {\mathbb {R}}^{N\times N} \end{aligned}$$

in samplet coordinates which are S-compressed with respect to the compression pattern induced by the \(\eta \)-admissibility condition (17).

Then, computing the matrix \(\varvec{K}_{{\mathcal {C}}}^\eta =[c_{(j,k),(j',k')}]\in {\mathbb {R}}^{N\times N}\) with respect to the same compression pattern, where the nonzero entries are given by the discrete inner product

$$\begin{aligned} c_{(j,k),(j',k')} = \sum _{\ell =0}^J\sum _{m\in \nabla _\ell } a_{(j,k),(\ell ,m)} b_{(\ell ,m),(j',k')}, \end{aligned}$$
(22)

is of cost \({\mathcal {O}}(N\log ^2 N)\).

Proof

To estimate the cost of the matrix multiplication, we shall make use of the compression rule (17). We assume for all clusters that \({{\,\textrm{diam}\,}}(\nu )\sim 2^{-j/d}\) if \(\nu \) is on level j. Thus, the samplet \(\sigma _{j,k}\) has approximately diameter \(2^{-j/d}\) and, therefore, only \({\mathcal {O}}(2^{\ell -j})\) samplets \(\sigma _{\ell ,m}\) of diameter \(\sim 2^{-\ell /d}\) are found in its nearfield if \(\ell \ge j\) while only \({\mathcal {O}}(1)\) are found if \(\ell <j\). For fixed level \(0\le \ell \le J\) in (22), we thus have at most \({\mathcal {O}}(\max \{2^{\ell -\max \{j,j'\}},1\})\) nonzero products to evaluate per coefficient \(c_{(j,k),(j',k')}\). We assume without loss of generality that \(j\ge j'\) and sum over \(\ell \), which yields the cost \({\mathcal {O}}(\max \{2^{J-j},j\})\). Per target block matrix \({\varvec{C}}_{j,j'} = [c_{(j,k),(j',k')}]_{j,j'}\), we have \({\mathcal {O}}(2^{\max \{j,j'\}}) = {\mathcal {O}}(2^j)\) nonzero coefficients. Hence, the cost for computing the desired target block is \({\mathcal {O}}(2^j \max \{2^{J-j},j\})\). We shall next sum over j and \(j'\)

$$\begin{aligned} \sum _{j=0}^J \sum _{j'=0}^j {\mathcal {O}}(2^j\max \{2^{J-j},j\})&= \sum _{j=0}^J \sum _{j'=0}^j {\mathcal {O}}(\max \{N,j2^j\})\\&= \sum _{j=0}^J {\mathcal {O}}(j\max \{N,j 2^j\}) = {\mathcal {O}}(N\log ^2 N). \end{aligned}$$

\(\square \)

4.2 Sparse selected inversion

Having addition and multiplication of kernel matrices at our disposal, we consider the matrix inversion next. To this end, observe that the inverse \({{\mathcal {A}}}^{-1}\) of a pseudodifferential operator \({{\mathcal {A}}}\) from a suitable algebra of pseudodifferential operators, provided that it exists, is again a pseudodifferential operator, see Sect. A. However, if \({{\mathcal {A}}}\) is a pseudodifferential operator of negative order as in the present RKHS case, the operator \({{\mathcal {A}}}^{-1}\) is of positive order and hence gives rise to a singular kernel which does not satisfy the condition (15). Even so, in the regime of kernel matrices we are rather interested in inverting regularized pseudodifferential operators, i.e., \({{\mathcal {A}}}+\mu {I}\), where \(I\) denotes the identity. For such operators, we have the following lemma.

Lemma 2

Let \({{\mathcal {A}}}\) be a pseudodifferential operator of order \(s \le 0\) with symmetric and positive semidefinite kernel function.

Then, for any \(\mu >0\), the inverse of \({{\mathcal {A}}}+\mu {I}\) can be decomposed into \(\frac{1}{\mu }{I}-{{\mathcal {B}}}\) with

$$\begin{aligned} {{\mathcal {B}}} = \frac{1}{\mu }({{\mathcal {A}}}+\mu {I})^{-1}{{\mathcal {A}}}. \end{aligned}$$
(23)

Especially, \({{\mathcal {B}}}\) is also a pseudodifferential operator of order s, which admits a symmetric and positive semidefinite kernel function.

Proof

In view of (23), we infer that

$$\begin{aligned} ({{\mathcal {A}}}+\mu {I})\bigg (\frac{1}{\mu }{I}-{{\mathcal {B}}}\bigg ) = \frac{1}{\mu }{{\mathcal {A}}} +I - ({{\mathcal {A}}}+\mu {I}){{\mathcal {B}}} = I + \frac{1}{\mu }{{\mathcal {A}}}- \frac{1}{\mu }{{\mathcal {A}}} = I. \end{aligned}$$

Therefore, \(\frac{1}{\mu }{I}-{{\mathcal {B}}}\) is the inverse operator to \({{\mathcal {A}}}+\mu {I}\). Since \({{\mathcal {A}}}+\mu {I}\) is of order 0, \(({{\mathcal {A}}} +\mu {I})^{-1}\) is of order 0, too, and thus \(({{\mathcal {A}}}+\mu {I})^{-1} {{\mathcal {A}}}\) is of the same order as \({{\mathcal {A}}}\). Finally, the symmetry and nonnegativity of \({\mathcal {B}}\) follows from the symmetry and nonnegativity of \({\mathcal {A}}\). \(\square \)

As a consequence of this lemma, the inverse \(({\varvec{K}}_{{\mathcal {A}}}+\mu {\varvec{I}})^{-1} \in {\mathbb {R}}^{N\times N}\) of the associated kernel matrix \({\varvec{K}}_{{\mathcal {A}}}+\mu {\varvec{I}} \in {\mathbb {R}}^{N\times N}\) is S-compressible with respect to the same compression pattern as \({\varvec{K}}_{{\mathcal {A}}}\). In [24], strong numerical evidence was presented that a sparse Cholesky factorization of a compressed kernel matrix can efficiently be computed by means of nested dissection, cf. [18, 37]. This suggests the computation of the inverse \(({\varvec{K}}_{{\mathcal {A}}}+\mu {\varvec{I}})^{-1}\) in samplet basis on the compression pattern of \({\varvec{K}}_{{\mathcal {A}}}\) by means of selected inversion [36] of a sparse matrix. The approach is outlined below.

Assume that \({\varvec{A}}\in {\mathbb {R}}^{N\times N}\) is symmetric and positive definite. There are two steps in the inversion algorithm. The first stage involves factorizing the input matrix \(\varvec{A}\) into \(\varvec{A}=\varvec{LDL}^\intercal \). The \(\varvec{L}\) and \(\varvec{D}\) matrices are used in the second phase to compute the selected components of \(\varvec{A}^{-1}\). The first step will be referred to as factorization in the following and the second step as selected inversion. To explain the second step, let \({\varvec{A}}\) be partitioned according to

In particular, the diagonal blocks \({\varvec{A}}_{ii}\) are also symmetric and positive definite. The selected inversion is based on the identity

(24)

with the Schur complement \({\varvec{S}}\mathrel {\mathrel {\mathop :}=}{\varvec{A}}_{22}+{\varvec{A}}_{12}^\intercal {\varvec{C}}\), where \({\varvec{C}}\mathrel {\mathrel {\mathop :}=}-{\varvec{A}}_{11}^{-1}{\varvec{A}}_{12}\). For sparse matrices, this block algorithm can efficiently be realized based on the observation that for the computation of the entries of \({\varvec{A}}^{-1}\) on the pattern of \({\varvec{L}}\) only the entries on the pattern of \({\varvec{L}}\) are required, as it is well known from the sparse matrix literature, cp. [15, 19, 36]. The pattern of \({\varvec{A}}\) is particularly contained in the pattern of \({\varvec{L}}\).

4.3 Algorithmic aspects

A block selected inversion algorithm has at least two advantages: Because \(\varvec{A}\) is sparse, blocks can be specified in terms of supernodes [36]. This allows us to use level-3 BLAS to construct an efficient implementation by leveraging memory hierarchy in current microprocessors. A supernode is a group of nodes with the same nonzero structure below the diagonal in their respective columns (of their \(\varvec{L}\) factor). The supernodal approach for sparse symmetric factorization represents the factor \(\varvec{L}\) as a set of supernodes, each of which consists of a contiguous set of \(\varvec{L}\) columns with identical nonzero patterns, and each supernode is stored as a dense submatrix to take advantage of level-3 BLAS calculations.

Given these considerations, it is natural to employ the selected inversion approach presented in [48] and available in [40] in order to directly compute the entries on the pattern of the inverse matrix. For the particular implementation of the selected inversion, we rely on Pardiso. For larger kernel matrices, which cannot be indexed by \(32bit\) integers due to the comparatively large number of non-zero entries, we combine the selected inversion with a divide and conquer approach based on the identity (24). The inversion of the \({\varvec{A}}_{11}\) block and of the Schur complement \({\varvec{S}}\) are performed with Pardiso (exploiting symmetry), while the remaining arithmetic operations, i.e., addition and multiplication, are performed in a formatted way, compare Theorem 5.

4.4 Matrix functions

Based on the S-formatted multiplication and inversion of operators represented in samplet basis, certain holomorphic functions of an S-compressed operator also admit S-formatted approximations with, essentially, corresponding approximation accuracies.

To illustrate this, we recall the method in [23]. This approach employs the contour integral representation

$$\begin{aligned} f({\varvec{A}}) = \frac{1}{2\pi i} \int _\Gamma f(z) (z{\varvec{I}}-{\varvec{A}})^{-1}{\text {d}}\!z, \end{aligned}$$
(25)

where \(\Gamma \) is a closed contour being contained in the analyticity region of f and winding once around the spectrum \(\sigma ({\varvec{A}})\) in counterclockwise direction. As it is well-known, analytic functions f of elliptic, self-adjoint pseudodifferential operators yield again pseudodifferential operators in the same algebra, see, e.g., [46, Chap.XII.1]. Hence, \({\varvec{B}} \mathrel {\mathrel {\mathop :}=}f({\varvec{A}})\) is S-compressible provided that f is analytic. Especially, the S-compressed representation \(\big (f({\varvec{A}}^\eta )\big )^{\eta }\) satisfies

$$\begin{aligned} \begin{aligned} \big \Vert {\varvec{B}}^\Sigma -\big (f({\varvec{A}}^{\eta })\big )^{\eta }\big \Vert _F&\le \Vert {\varvec{B}}^\Sigma -{\varvec{B}}^{\eta }\Vert _F +\big \Vert \big (f({\varvec{A}}^\Sigma )-f({\varvec{A}}^{\eta })\big )^{\eta }\big \Vert _F\\&\le \varepsilon \Vert {\varvec{B}}\Vert _F + L\Vert {\varvec{A}}^\Sigma -{\varvec{A}}^{\eta }\Vert _F\\&\le \varepsilon \big (\Vert {\varvec{B}}\Vert _F+L\Vert {\varvec{A}}\Vert _F\big ). \end{aligned} \end{aligned}$$
(26)

Herein, \(L\) denotes the Lipschitz constant of the function \(f\). In other words, estimate (26) implies that the error of the approximation of the S-formatted matrix function \(\big (f({\varvec{A}}^{\eta })\big )^{\eta }\) is rigorously controlled by the sum of the input error \(\Vert {\varvec{A}}^\Sigma -{\varvec{A}}^{\eta }\Vert _F\) and the compression error for the exact output \(\Vert {\varvec{B}}^\Sigma -{\varvec{B}}^{\eta }\Vert _F\). The latter is under control if the underlying pseudodifferential operator is of order \(s < -d\) since then the kernel is continuous and satisfies (15). In the other cases, some analysis is needed to control this error (see below).

For the numerical approximation of the contour integral (25), one has to apply an appropriate quadrature formula. Exemplarily, we consider the matrix square root, i.e., \(f(z) = \sqrt{z}\) for \(\textrm{Re} z > 0\). This occurs, for example, in the efficient path simulation of Gaussian processes in spatial statistics. We shall apply here the approximation, see [23, Eq. (4.4) and comments below],

$$\begin{aligned} \begin{aligned} {\varvec{A}}^{-1/2}&\approx \frac{2 E \sqrt{{\underline{c}}}}{\pi K} \sum _{k=1}^K\frac{{\text {dn}} \left( t_k | 1-\varkappa _{\varvec{A}}\right) }{{\text {cn}}^2\left( t_k | 1 - \varkappa _{\varvec{A}}\right) } \left( {\varvec{A}} + w_k^2{\varvec{I}}\right) ^{-1},\\ {\varvec{A}}^{1/2}&= {\varvec{A}}\cdot {\varvec{A}}^{-1/2}. \end{aligned} \end{aligned}$$
(27)

Herein, \({\text {sn}}, {\text {cn}}\) and \({\text {dn}}\) are the Jacobian elliptic functions [2, Chapter 16], E is the complete elliptic integral of the second kind associated with the parameter \(\varkappa _{\varvec{A}}:= {\underline{c}}/{\overline{c}}\) [2, Chapter 17], and, for \(k\in \{1,\ldots , K \}\),

$$\begin{aligned} w_k\mathrel {\mathrel {\mathop :}=}\sqrt{{\underline{c}}}\,\frac{ {\text {sn}}\left( t_k | 1 - \varkappa _{\varvec{A}}\right) }{{\text {cn}}\left( t_k | 1 - \varkappa _{\varvec{A}}\right) } \quad \text {and} \quad t_k\mathrel {\mathrel {\mathop :}=}\frac{E}{K}\big (k-\tfrac{1}{2}\big ). \end{aligned}$$

The quadrature approximation (27) of the contour integral (25) for the matrix square root is known to converge root-exponentially in the number K of quadrature nodes in (27) of the contour integral, see e.g., [10, Lemma 3.4]. Hence, approximate representations with algebraic (with respect to N) consistency order can be achieved with \(K\sim |\varepsilon (\eta )|\), resulting in overall log-linear complexity of the numerical realization of (27) in S-format. We also remark that the quadrature shifts \(w_k^2\) in the inversions which occur in (27) act as regularizing “nuggets” of a possibly ill-conditioned \({\varvec{A}}\). The input parameters \(0< {\underline{c}} < {\overline{c}}\) shall provide bounds to the spectrum of \({\varvec{A}}\), i.e., \({\underline{c}}\approx \lambda _{\min }({\varvec{A}})\) and \({\overline{c}}\approx \lambda _{\max }({\varvec{A}})\). Note that we also assume here that \({\varvec{A}}\) is symmetric and positive definite. Moreover, we should mention that, except for the quadrature error, (27) computes the square root \(({\varvec{A}}^\eta )^{-1/2}\) of the compressed input \({\varvec{A}}^\eta \) in an exact way on the compression pattern when we use the selected inversion algorithm from Sect. 4.2.

That \(({\varvec{A}}^\eta )^{-1/2}\) is indeed S-compressible is a consequence of the following lemma.

Lemma 3

Let \({{\mathcal {A}}}\) be a pseudodifferential operator of order \(s \le 0\) with symmetric and positive semidefinite kernel function. Then, for any \(\mu >0\), the inverse square root of \({{\mathcal {A}}}+\mu {I}\) can be written as \(\frac{1}{\sqrt{\mu }}{I}-{{\mathcal {B}}}\) with \({{\mathcal {B}}}\) being also a pseudodifferential operator of order s, which admits a symmetric and positive semidefinite kernel function.

Proof

Straightforward calculation shows that the ansatz

$$\begin{aligned} ({\mathcal {A}}+\mu I)^{-1/2} = \frac{1}{\sqrt{\mu }} I-{\mathcal {B}}\end{aligned}$$
(28)

is equivalent to

$$\begin{aligned} ({\mathcal {A}}+\mu I)\bigg (\frac{1}{\mu } I-\frac{2}{\sqrt{\mu }}{\mathcal {B}}+{\mathcal {B}}^2\bigg ) = I. \end{aligned}$$

Thus,

$$\begin{aligned} {\mathcal {B}}\bigg (\frac{2}{\sqrt{\mu }}I-{\mathcal {B}}\bigg ) = \frac{1}{\mu }({\mathcal {A}}+\mu I)^{-1}{\mathcal {A}}, \end{aligned}$$

which in view of (28) is equivalent to

$$\begin{aligned} {\mathcal {B}}\bigg (\frac{1}{\sqrt{\mu }}I+({\mathcal {A}}+\mu I)^{-1/2}\bigg ) = \frac{1}{\mu }{\mathcal {A}}({\mathcal {A}}+\mu I)^{-1}. \end{aligned}$$

As both, \(\frac{1}{\sqrt{\mu }}I+({\mathcal {A}}+\mu I)^{-1/2}\) and \(({\mathcal {A}}+\mu I)^{-1}\), are pseudodifferential of order 0, \({\mathcal {B}}\) must have the same order as \({\mathcal {A}}\). \(\square \)

An alternative to the contour integral for computing the matrix exponential of a (possibly singular) matrix \({\varvec{A}}\) is given by the direct evaluation of the power series

$$\begin{aligned} \exp ({\varvec{A}})=\sum _{k=0}^\infty \frac{1}{k!}{\varvec{A}}^k. \end{aligned}$$

As we show in the numerical results, this series converges very fast for the matrices presently under consideration which stem from reproducing kernels, since they correspond to compact operators.

5 Numerical results

The computations in this section have been performed on a single node with two Intel Xeon E5-2650 v3 @2.30GHz CPUs and up to 512GB of main memory. To achieve consistent timings, all computations have been carried out using 16 cores. The samplet compression is implemented in C++11 and relies on the Eigen template libraryFootnote 1 for linear algebra operations. Moreover, the selected inversion is performed by Pardiso. Throughout this section, we employ samplets with \(q+1=4\) vanishing moments. The parameter for the admissibility condition (17) is set to \(\eta =1.25\). Together with the a priori pattern, which is obtained by neglecting admissible blocks, we also consider an a posteriori compression by setting all matrix entries smaller than \(\tau =10^{-5}/N\) to zero resulting in the a posteriori pattern. In view of (18), there is a tradeoff between the number \(q+1\) of vanishing moments and \(\eta \). Increasing either results in higher accuracy, but also in more densely populated matrices. The chosen setting results in compression errors about \(10^{-5}\) for all shown examples. For a comprehensive study of the compression errors, we refer to [25].

5.1 S-formatted matrix multiplication

To benchmark the multiplication, we consider uniformly distributed random points on the unit hypercube \([0,1]^d\). As kernel, we consider exclusively the exponential kernel (which is the Matérn kernel with smoothness parameter \(\nu =1/2\) and correlation length \(\ell =1\))

$$\begin{aligned} \kappa ({\varvec{x}},{\varvec{y}})=\frac{1}{N}e^{-\Vert {\varvec{x}}-{\varvec{y}}\Vert _2}. \end{aligned}$$
(29)

Note that we impose the scaling 1/N of the kernel function in order fix the largest eigenvalue of the kernel matrix as its trace stays uniformly bounded. We like to stress that the present approach also works for smoother kernels than (29). However, in this case, single-block low-rank approximation techniques are competitive, too, see [4, 5, 12, 26].

We compute the matrix product \({\varvec{K}}^\eta \cdot \tilde{\varvec{K}}^\eta \), where \(\tilde{\varvec{K}}^\eta \) is obtained from \({\varvec{K}}^\eta \) by relatively perturbing each nonzero entry by 10% additive noise, which is uniformly distributed in \([0,1]\). This way, we rule out symmetry effects as \(\tilde{\varvec{K}}^\eta \) will not be symmetric in general.

Fig. 2
figure 2

S-formatted matrix multiplication) computation times for matrix multiplication (left) and multiplication errors (right)

To measure the multiplication error, we consider the estimator

$$\begin{aligned} e_F({\varvec{A}})\mathrel {\mathrel {\mathop :}=}\frac{\Vert {\varvec{A}}{\varvec{U}}\Vert _F}{\Vert {\varvec{U}}\Vert _F}, \end{aligned}$$

where \({\varvec{U}}\in {\mathbb {R}}^{N\times 10}\) is a random matrix with uniformly distributed independent entries. The left hand side of Fig. 2 shows the computation time for a single multiplication. The dashed lines correspond to the asymptotic rates \({\mathcal {O}}(N\log ^\alpha N)\) for \(\alpha =0,1,2,3\). It can be seen that the multiplication time for \(d=2\) perfectly reflects the expected essentially linear behavior. Though the graph is steeper for \(d=3\), we expect it to flatten further for larger \(N\). The right hand side of the plot shows the multiplication error \(e_F({\varvec{K}}^\eta \cdot \tilde{\varvec{K}}^\eta - {\varvec{K}}^\eta \boxdot \tilde{\varvec{K}}^\eta )\), where the formatted multiplication \(\boxdot \) is performed on the a posteriori pattern. Taking into account that the compression errors for \({\varvec{K}}^\eta \) are approximately \(5.6\cdot 10^{-6}\) for \(d=2\) and \(1.6\cdot 10^{-5}\) for \(d=3\), the obtained matrix product can be considered to be very accurate.

5.2 S-formatted matrix inversion

In order to assess the numerical performance of the matrix inversion, we again consider uniformly distributed random points on the unit hypercube \([0,1]^d\). Since the separation radius \(q_X\) ranges between \(4.7\cdot 10^{-5}\) (\(N=5000\)) and \(2.8\cdot 10^{-7}\) (\(N=1\, 000\, 000\)) for \(d=2\) and \(3.8\cdot 10^{-4}\) (\(N=5000\)) and \(3.2\cdot 10^{-5}\) (\(N=1\, 000\, 000\)) for \(d=3\), we do not expect \({\varvec{K}}^\eta \) to be invertible. Therefore, we rather consider the regularized version \({\varvec{K}}^\eta +\mu {\varvec{I}}\) for a ridge parameter \(\mu >0\).

As our theoretical results suggest that the inverse has the same a priori pattern as the matrix itself, we first consider the inversion on the a priori pattern for \(d=2\).

Fig. 3
figure 3

(S-formatted matrix inversion, \(d=2\), a priori pattern). Left panel: Computation times for compressed matrix assembly and selected inversion on the a priori pattern. Dashed lines indicate linear (\(\alpha =1\)) and super-linear (\(\alpha =1.5\)) scaling, respectively. Right panel: Inversion errors for ridge parameters \(\mu =10^{-6},10^{-4},10^{-2}\)

The left hand side of Fig. 3 shows the computation times for the inverse matrix employing Pardiso. The dashed line shows the asymptotic rates \({\mathcal {O}}(N^\alpha )\) for \(\alpha =1,1.5\). For \(N=1\,000\,000\), due to the large amount of non-zero entries, we use the block inversion with one subdivision. This explains the bump in the computation time due to the formatted matrix multiplication. Besides this, Pardiso perfectly exhibits the expected rate of \(N^{1.5}\). The right hand side of the plot shows the error for the ridge parameters \(\mu =10^{-6},10^{-4},10^{-2}\), where denotes the selected inversion on the pattern of \({\varvec{K}}^\eta \). The choice of the ridge parameters is exemplarily, starts from about the compression error and spans four orders of magnitude. The error reduces significantly with increasing ridge parameter, since the matrix \({\varvec{K}}^\eta +\mu {\varvec{I}}\) gets spectrally closer to the identity matrix.

As the a priori pattern typically exhibits significantly fewer nonzero entries, we also investigate the inversion on the a posteriori pattern. The corresponding results are shown in Fig. 4.

Fig. 4
figure 4

(S-formatted matrix inversion, \(d=2\), a posteriori pattern). Left panel: computation times for compressed matrix assembly and selected inversion on the a posteriori pattern. Dashed lines indicate linear (\(\alpha =1\)) and super-linear (\(\alpha =1.5\)) scaling, respectively. Right panel: inversion errors for ridge parameters \(\mu =10^{-6},10^{-4},10^{-2}\)

As it can be seen on the left hand side of the figure, the selected inversion now even exhibits a linear behavior, which is explained by the fixed threshold \(\tau \), resulting in successively fewer nonzero entries for increasing \(N\). On the other hand, the errors for the different ridge parameters, depicted on the right hand side of the same figure, asymptotically exhibit the same behavior as in the a priori case.

Fig. 5
figure 5

(S-formatted matrix inversion, \(d=3\), a posteriori pattern) Left panel: computation times for compressed matrix assembly and selected inversion on the a posteriori pattern. Dashed lines indicate linear (\(\alpha =1\)) and quadratic (\(\alpha =2\)) scaling, respectively. Right panel: inversion errors for ridge parameters \(\mu =10^{-6},10^{-4},10^{-2}\)

Motivated by the results for \(d=2\), we consider only the inversion on the a posteriori pattern for \(d=3\). The corresponding results are shown in Fig. 5. On the left hand side of the figure, again the computation times are shown. The dashed lines show the asymptotic rates \({\mathcal {O}}(N^\alpha )\) for \(\alpha =1,2\). Until \(N=100\,000\), the expected quadratic rate is perfectly matched. Due to the large number of non-zeros in the case \(d=3\), we have employed the block inversion with three recursion steps for \(N>100\,000\), resulting in the peculiar linear behavior for the respective values in the graph. The errors depicted on the right hand side show a behavior similar to the case \(d=2\), with a slightly reduced decay.

Fig. 6
figure 6

Data points from a 3D scan of the head of Michelangelo’s David. The scan is provided by the Statens Museum for Kunst under the Creative Commons CC0 license

5.3 S-formatted matrix functions

We compute the matrix square root \({\varvec{A}}^{1/2}\) and the matrix exponential \(\exp ({\varvec{A}})\) for the exponential kernel

$$\begin{aligned} \kappa ({\varvec{x}},{\varvec{y}})=\frac{1}{N}e^{-2\Vert {\varvec{x}}-{\varvec{y}}\Vert _2} \end{aligned}$$

This time, the data points are randomly subsampled from a 3D scan of the head of Michelangelo’s David (The scan is provided by the Statens Museum for Kunst under the Creative Commons CC0 license), cp. Fig. 6. The bounding box of the point cloud is \([-0.52,0.42]\times [-0.47,0.46]\times [-0.18,0.78]\). All other parameters are set as in the examples before. Moreover, we set the ridge parameter to \(\mu =10^{-4}\), corresponding to the middle value from Sect. 5.2. The smallest eigenvalue is estimated by the ridge parameter, while the largest eigenvalue is upper bounded by \(1\). For the contour integral method for the computation of the matrix square root, we found stagnation in the error for \(K\ge 7\) quadrature points. The corresponding errors for different values of \(N\) are tabulated in Table 1.

Table 1 Errors for the contour integral method for \(({\varvec{K}}^\eta +\mu {\varvec{I}})^{1/2}\)

Finally, Table 2 shows the approximation error of the matrix exponential for different values of \(N\). The true matrix exponential is estimated by a power series of length \(30\) directly applied to the matrix \({\varvec{X}}\). Here, we found that the error starts to stagnate for more than 8 terms in the expansion. The largest eigenvalue satisfies \(\Vert {\varvec{K}}^\eta \Vert _2\approx 0.337\) (estimated by a Rayleigh quotient iteration with 50 iterations), hence explaining the rapid convergence. Note that we do not require any regularization here, as just matrix products are computed.

Table 2 Errors for the approximation of \(\exp ({\varvec{K}}^\eta )\) by the power series of the exponential

5.4 Gaussian process implicit surfaces

We consider Gaussian process learning of implicit surfaces. In accordance with [50], we consider a closed surface \(S=\partial \Omega \) of dimension \(d-1\), given by the 0-level set of the function

$$\begin{aligned} f:{\mathbb {R}}^d\rightarrow {\mathbb {R}},\quad f({\varvec{x}}){\left\{ \begin{array}{ll}=0,&{} {\varvec{x}}\in S,\\ >0, &{}{\varvec{x}}\in \Omega ,\\ <0,&{}{\varvec{x}}\in {\mathbb {R}}^d\setminus {\overline{\Omega }}, \end{array}\right. } \end{aligned}$$

i.e.,

$$\begin{aligned} S=\{{\varvec{x}}\in {\mathbb {R}}^d:f({\varvec{x}})=0\}. \end{aligned}$$

For the function \(f\), we impose a Gaussian process model with covariance function given by the exponential kernel

$$\begin{aligned} \kappa ({\varvec{x}},{\varvec{y}})=\frac{1}{N}e^{-6\Vert {\varvec{x}}-{\varvec{y}}\Vert _2} \end{aligned}$$

and prior mean zero. Then, given the data sites \(X\) of size \(N\mathrel {\mathrel {\mathop :}=}|X|\) and the noisy measurements \({\varvec{y}}=f(X)+{\varvec{\varepsilon }}\), where \({\varvec{\varepsilon }}\sim {\mathcal {N}}({\varvec{0}},\mu {\varvec{I}})\), the posterior distribution for the data sites \(Z\subset {\mathbb {R}}^3\) is determined by

$$\begin{aligned} {\mathbb {E}}[f(Z)|X,{\varvec{y}}]&={\varvec{K}}_{ZX}({\varvec{K}}_{XX}+\mu {\varvec{I}})^{-1}{\varvec{y}},\\ {\text {Cov}}[f(Z)|X,{\varvec{y}}]&= {\varvec{K}}_{ZZ}-{\varvec{K}}_{ZX}({\varvec{K}}_{XX}+\mu {\varvec{I}})^{-1}{\varvec{K}}_{ZX}^\intercal . \end{aligned}$$

Herein, setting \(M\mathrel {\mathrel {\mathop :}=}|Z|\), we have \({\varvec{K}}_{XX}=[\kappa (X,X)] \in {\mathbb {R}}^{N\times N}\), \({\varvec{K}}_{ZX}=[\kappa (Z,X)]\in {\mathbb {R}}^{M\times N}\), \({\varvec{K}}_{ZZ}=[\kappa (Z,Z)]\in {\mathbb {R}}^{M\times M}\).

The matrix \({\varvec{K}}_{ZX}\) can efficiently be computed by using one samplet tree for \(Z\) and a second samplet tree for \(X\), while \(({\varvec{K}}_{XX}+\mu {\varvec{I}})^{-1}\) can be computed as in the previous examples. Hence, the computation of the posterior mean \({\mathbb {E}}[f(Z)|X,{\varvec{y}}]\) is straightforward. For \(X\), we use samplets with \(q+1=4\) vanishing moments, while samplets with \(q+1=3\) vanishing moments are applied for \(Z\). Moreover, we use an a-posteriori threshold of \(\tau =10^{-4}/N\) for \({\varvec{K}}_{ZX}^{\eta }\).

Fig. 7
figure 7

(Gaussian process implicit surfaces) Left panel: Data points for the surface reconstruction. Red corresponds to a value of 1, green to a value of 0 and blue to a value of \(-1\). Middle panel: 0-level set of the posterior expectation evaluated at a regular grid. Right panel: Standard deviation for the reconstruction (blue is small, red is large) (figure is reproduced in color in the digital version only)

Similarly, we can evaluate the covariance in samplet coordinates. The evaluation of the standard deviation \(\sqrt{{\text {diag}}({\text {Cov}}[f(Z)|X,{\varvec{y}}])}\) requires more care. Here we just transform \({\varvec{K}}_{ZX}\) with respect to the points in \(X\) and evaluate the diagonal resulting in a computational cost of \({\mathcal {O}}\big (MN\log N\big )\).

The left panel in Fig. 7 shows the initial setup. 240 data points with a value \(-1\) are located on a sphere within the point cloud, \(15\,507\) points with a value of 0 are located at its surface and 1200 points with a value of 1 are located on a box enclosing it. This results in \(N=16\,947\) data points in total. The ridge parameter was set to \(\mu =2\cdot 10^{-5}\). The conditional expectation and the standard deviation have been computed on a regular grid with \(M=8\,000\,000\) points. The middle panel in Fig. 7 shows the 0-level set while the right panel shows the standard deviation. As expected, the standard deviation is lowest close to the data sites (blue is small, red is large).

6 Conclusion

We have presented a sparse matrix algebra for kernel matrices in samplet coordinates. This algebra allows for the rapid addition, multiplication and inversion of (regularized) kernel matrices, whose operations mimic algebras of corresponding pseudodifferential operators. The proposed arithmetic operations extend to S-formatted, approximate representations of holomorphic functions of S-formatted approximations of self-adjoint operators, which are likewise realized in log-linear cost. While the addition is straightforward, we have derived an error and cost analysis for the multiplication, and for the approximate evaluation of holomorphic operator-functions, again having log-linear cost. The S-formatted approximate inversion is realized by selected inversion for sparse matrices, which also enables the computation of general matrix functions by the contour integral approach. The numerical benchmarks corroborate the theoretical findings for data sets in two and three dimensions. As a relevant example from computer graphics, we have considered Gaussian process learning for the computation of a signed distance function from scattered data.

We expect the presently developed fast kernel matrix algebra to impact various areas in machine learning and statistics, where kernel-based approximations appear, see, e.g., [8, 34] and the references there.