Multiresolution kernel matrix algebra

We propose a sparse algebra for samplet compressed kernel matrices, to enable efficient scattered data analysis. We show the compression of kernel matrices by means of samplets produces optimally sparse matrices in a certain S-format. It can be performed in cost and memory that scale essentially linearly with the matrix size $N$, for kernels of finite differentiability, along with addition and multiplication of S-formatted matrices. We prove and exploit the fact that the inverse of a kernel matrix (if it exists) is compressible in the S-format as well. Selected inversion allows to directly compute the entries in the corresponding sparsity pattern. The S-formatted matrix operations enable the efficient, approximate computation of more complicated matrix functions such as ${\bm A}^\alpha$ or $\exp({\bm A})$. The matrix algebra is justified mathematically by pseudo differential calculus. As an application, efficient Gaussian process learning algorithms for spatial statistics is considered. Numerical results are presented to illustrate and quantify our findings.


Introduction
The concept of samplets has been introduced in [24] by abstracting the wavelet construction from [49] to general discrete data sets in euclidean space.A samplet basis is a multiresolution analysis of discrete signed measures, where stability is entailed by the orthogonality of the basis.Samplets are data-centric and can be constructed such that their measure integrals vanish for polynomials up to a predefined degree.Thanks to this vanishing moment property in ambient space, kernel matrices, as they arise in scattered data approximation, become quasi-sparse in the samplet basis.This means that these kernel matrices are compressible in samplet coordinates, S-compressible for short, and can be replaced by sparse matrices.We call the resulting sparsity pattern the compression pattern.The latter has been characterized in [24,Section 5.3].Given a quasi-uniform data set of cardinality N , i.e., the distance between neighboring points is uniformly bounded from below and above by N −1/d with d ≥ 1 being the spatial dimension of the data, the Scompressed kernel matrix contains only O(N log N ) relevant entries, for kernels of possibly low regularity.
In this article, we develop fast arithmetic operations for S-compressed kernel matrices.By fixing the sparsity pattern, we can perform addition and multiplication of kernel matrices with high precision in essentially linear cost.The derived cost bounds assume quasi-uniformity of the data points.Even so, all algorithms can still be applied if the quasi-uniformity assumption does not hold.In this case, however, the established cost bounds may become invalid.Similar approaches for realizing arithmetic operations of nonlocal operators exist by means hierarchical matrices, see [10,12,16,20,21], and by means of wavelets, see [4,6,46].
We prove that the inverses of (regularized) kernel matrices are compressible with respect to the original compression pattern.We can thus employ the selected inversion algorithm proposed in [37], to efficiently approximate the inverse.Our concrete implementation is based on a supernodal left-looking LDLT-factorization of the underlying matrix, which is available in the sparse, direct solver Pardiso, see [43].The selected inversion computes (in the absence of rounding) the exact matrix inverse of the S-compressed matrix on its matrix pattern.Likewise, matrix addition and matrix multiplication are performed exactly on the prescribed compression pattern.This means that, the relevant matrix coefficients are computed exactly when adding, multiplying, and inverting S-compressed kernel matrices.The only error introduced is the matrix compression error issuing from the restriction to the compression pattern.
Having a fast formatted matrix addition and fast matrix inversion at hand enables the fast approximate evaluation of holomorphic operator functions via contour integrals in order to derive more complicated matrix functions.This has been envisioned in [6] ("We conjecture and provide numerical evidence that functions of operators inherit this property") and suggested in [22].In the present paper we prove, using the samplet algebra, that, up to (exponentially small) contour quadrature errors, these contour integrals are computed exactly on the prescribed pattern.This is in contrast to previously proposed methods.In addition, many applications particularly require the computation of a subset of the elements of a given matrix inverse.Important examples are sparse inverse covariance matrix estimation in 1 -regularized Gaussian maximum likelihood estimation, see [9,29], or integrated nested Laplace approximations for approximate Bayesian inference, cp.[53].Other examples of computing a subset of the inverse are electronic structure calculations of materials utilizing multipole expansions, where the diagonal and occasionally subdiagonals of the discrete Green's function are required to determine the electron density [35,36].We provide a rigorous theoretical underpinning of the algorithms under consideration by means of pseudodifferential calculus [28,50].To this end, we focus on kernels of reproducing kernel Hilbert spaces and assume that the associated integral operators correspond, via the Schwarz kernel theorem, to classical, elliptic pseudodifferential operators, from the Hörmander class S m 1,0 , cp. [28].A prominent example of such kernels is the Matérn class of kernels, see [40], also called Sobolev splines [15].The latter are known to generate the Sobolev spaces of positive order, and correspond to fractional powers of the shifted Laplacian.We prove that such pseudodifferential operators are compressible in samplet coordinates, meaning that for numerical representation, only the coefficients in the associated compression pattern need to be computed.Admissible classes comprise in particular the smooth Hörmander class S m 1,0 , but also considerably larger kernel classes of finite smoothness, which admit Calderon-Zygmund estimates and an appropriate operator calculus, see, e.g., [1,51].The corresponding operator calculus implies that sums, concatenations, powers and holomorphic functions of self-adjoint, elliptic pseudodifferential operators yield again pseudodifferential operators.As a consequence the corresponding operations on kernel matrices in samplet coordinates result again in compressible matrices.
The rest of this article is structured as follows.In Section 2, we briefly introduce the scattered data framework under consideration and recall the relevant theory for reproducing kernel Hilbert spaces.The construction of samplets and the samplet matrix compression from [24] are summarized in Section 3. The main contribution of this article is Section 4. Here, we develop and analyze arithmetic operations for compressed kernel matrices in samplet coordinates.In Section 5, we perform numerical experiments in order to qualify and quantify the matrix algebra.Beyond benchmarking experiments, we consider here the computation of an implicit surface from scattered data using Gaussian process learning.Finally, the required details from the theory pseudodifferential operators, especially the associated calculus, are collected in Appendix A.
Throughout this article, in order to avoid the repeated use of generic but unspecified constants, by C D we indicate that C can be bounded by a multiple of D, independently of parameters which C and D may depend on.Moreover, C D is defined as D C and C ∼ D as C D and D C.

Reproducing kernel Hilbert spaces
Let (H, •, • H ) be a Hilbert space of functions h : Ω → R with dual space H . Herein, Ω ⊂ R d is a given bounded domain or a lower-dimensional manifold.Furthermore, let κ be a symmetric and positive definite (SPD) kernel, i.e., [κ(x i , x j )] N i,j=1 is a symmetric and positive semi-definite matrix for each N ∈ N and any point selection x 1 , . . ., x N ∈ Ω.We recall that κ is the reproducing kernel for H, iff κ(x, •) ∈ H for every x ∈ Ω and h(x) = κ(x, •), h H for every h ∈ H.In this case, we call (H, •, • H ) a reproducing kernel Hilbert space (RKHS).
Let X := {x 1 , . . ., x N } ⊂ Ω denote a set of N mutually distinct points.With respect to the set X, we introduce the subspace (1) Corresponding to H X , we consider the subspace X := span{δ x1 , . . ., δ x N } ⊂ H , which is spanned by the Dirac measures supported at the points of X, i.e., For a continuous function f ∈ C(Ω), we use the notation As the kernel κ(x, •) is the Riesz representer of the point evaluation (•, δ x ) Ω , we particularly have Thus, the space X is isometrically isomorphic to the subspace H X from (1) and we identify Later on, we endow X with the inner product This inner product is different from the restriction of the canonical one in H to H X .The latter is given by û, v H = u Kv with the symmetric and positive semi-definite kernel matrix A consequence of the duality between H X and X is that the H-orthogonal projection of a function h ∈ H onto H X is given by the interpolant are given by the solution to the linear system (4) Kα = h with right hand side h = [h(x i )] N i=1 .From [54,Corollary 11.33], we have the following approximation result.
Theorem 2.1.Let Ω ⊂ R d be a bounded Lipschitz domain satisfying an interior cone condition.Suppose that the Fourier transform of the kernel κ(x − y) satisfies for a sufficiently small fill distance (6) h X,Ω := sup One class of kernels satisfying the conditions of Theorem 2.1 are the isotropic Matérn kernels, also called Sobolev splines, see [15].These kernels play an important role in applications, such as spatial statistics [44].They are given by (7) κ ν (r) := with r := x − y 2 , smoothness parameter ν > 0 and length scale parameter > 0, see [40,44].Here, K ν denotes the modified Bessel function of the second kind.Specifically, property (5) holds with , where α is a scaling factor depending on ν, and d, see [40].Particularly, the Matérn kernels are the reproducing kernels of the Sobolev spaces H ν+d/2 (R d ), see also [54].
For half integer values of ν, i.e., for ν = p + 1/2 with p ∈ N 0 , the Matérn kernels have an explicit representation given by The limit case ν → ∞ gives rise to the Gaussian kernel Our subsequent compression analysis covers the Matérn kernels, but has considerably wider scope.Indeed, rather large classes of pseudodifferential operators will be admissible.As suitable classes of such operators are known to define an algebra, properties of arithmetic expressions of the underlying kernels, such as off-diagonal coefficient decay and matrix compressibility, can directly be inferred.Equally important, we show that these properties of the operator algebras are to some extent transferred also to the corresponding finitely represented structures, i.e., we show the corresponding matrix representation likewise are algebras in the compressed format.We refer to Appendix A for the details and properties of pseudodifferential operators in this article.

Samplet matrix compression
For the readers convenience, we recall in this section the concept of samplets as it has been introduced in [24].
3.1.Samplets.Samplets are defined based on a sequence of spaces {X j } J j=0 forming a multiresolution analysis, i.e., (9) Rather than using a single scale from the multiresolution analysis ( 9), the idea of samplets is to keep track of the increment of information between two consecutive levels j and j + 1.Since we have X j ⊂ X j+1 , we may decompose (10) by using the detail space S j , where orthogonality is to be understood with respect to the (discrete) inner product defined in (2).Let Σ j be a basis of the detail space S j in X j .By choosing a basis Φ 0 of X 0 and recursively applying the decomposition (10), we see that the set forms a basis of X J = X , which we call a samplet basis.
In order to employ samplets for the compression of kernel matrices, it is desirable that the signed measures σ j,k ∈ X j ⊂ H have isotropic convex hulls of supports, and are localized with respect to the corresponding discretization level j, i.e., (11) diam(supp σ j,k ) ∼ 2 −j/d , and that they are stable with respect to the inner product defined in (2), i.e., σ j,k , σ j ,k X = 0 for (j, k) = (j , k ).
Furthermore, an essential ingredient is the vanishing moment condition of order q + 1, i.e., (12) (p, σ j,k ) Ω = 0 for all p ∈ P q (Ω), where P q (Ω) is the space of all polynomials with total degree at most q.We say then that the samplets have vanishing moments of order q + 1.

Construction of samplets.
The starting point for the construction of samplets is the multiresolution analysis (9).Its construction is based on a hierarchical clustering of the set X. Definition 3.2.Let T = (P, E) be a binary tree with vertices P and edges E. We define its set of leaves as L(T ) := {ν ∈ P : ν has no sons}.
The tree T is a cluster tree for the set X = {x 1 , . . ., x N }, iff the set X is the root of T and all ν ∈ P \ L(T ) are disjoint unions of their two sons.
The level j ν of ν ∈ T is its distance from the root, i.e., the number of son relations that are required for traveling from X to ν.The depth J of T is the maximum level of all clusters.We define the set of clusters on level j as The cluster tree is balanced, iff |ν| ∼ 2 J−jν .
To bound the diameter of the clusters, we introduce the separation radius (13) q X := 1 2 min i =j and require X to be quasi-uniform.
Definition 3.3.The data set X ⊂ Ω is quasi-uniform if the fill distance ( 6) is proportional to the separation radius (13), i.e., there exists a constant c = c(X, Ω) ∈ (0, 1) such that Roughly speaking, the points x ∈ X are equispaced if X ⊂ Ω is quasi-uniform.This immediately implies the following result.Lemma 3.4.Let T be a cluster tree.For ν ∈ T , the bounding box B ν of ν is the smallest axis-parallel cuboid that contains all points of ν.If X ⊂ Ω is quasiuniform, then there holds In particular, we have diam(ν) ∼ 2 −jν /d for all clusters ν ∈ T .
Samplets with vanishing moments are obtained recursively by employing a twoscale transform between basis elements on a cluster ν of level j.To this end, we represent scaling distributions Φ ν j = {ϕ ν j,k } and samplets Σ ν j = {σ ν j,k } as linear combinations of the scaling distributions Φ ν j+1 of ν's son clusters.This results in the refinement relation being the dimension of P q (Ω).There holds ( 16) As R is a lower triangular matrix, the first k − 1 entries in its k-th column are zero.This corresponds to (k−1) vanishing moments for the k-th function generated by the transformation . By defining the first m q functions as scaling distributions and the remaining as samplets, we obtain samplets with vanishing moments at least up to order q + 1.
For leaf clusters, we define the scaling distributions by the Dirac measures at the points x i , i.e., Φ ν J := {δ xi : x i ∈ ν}, to make up for the lack of son clusters that could provide scaling distributions.The scaling distributions of all clusters on a specific level j then generate the spaces (17) X j := span{ϕ ν j,k : k ∈ ∆ ν j , ν ∈ T j }, while the samplets span the detail spaces ( 18) Combining the scaling distributions of the root cluster with all clusters' samplets amounts to the final samplet basis A visualization of a scaling distribution and different samplets on a spiral data set is found in Figure 1.By construction, samplets satisfy the following properties, which are collected from [24, Theorem 3.6, Lemma 3.9, Theorem 5.4].
Theorem 3.5.The spaces X j defined in equation ( 17) form the desired multiresolution analysis (9), where the corresponding complement spaces S j from (18) satisfy The associated samplet basis Σ N defined in (19) constitutes an orthonormal basis of X and we have: (1) The number of all samplets on level j behaves like 2 j .
(3) Each samplet is supported in a specific cluster ν.If the points in X are quasi-uniform, then the diameter of the cluster satisfies diam(ν) ∼ 2 −jν /d and there holds (11).(4) The coefficient vector ω j,k = ω j,k,i i of the samplet σ j,k on the cluster ν fulfills ω j,k 1 ≤ |ν|.
(5) Let f ∈ C q+1 (Ω).Then, there holds for a samplet σ j,k supported on the cluster ν that Remark 3.6.Each samplet is a linear combination of the Dirac measures supported at the points in X.The related coefficient vectors ω j,k in are pairwise orthonormal with respect to the inner product (2).The dual samplet in H X is given by as there holds

Matrix compression.
For the compression of the kernel matrix K from (3), with samplets of vanishing moment order q + 1, with integer q ≥ 0, we suppose that kernel κ is "q + 1-asymptotically smooth".This is to say that there are constants c κ,α,β > 0 such that for all x, y ∈ Ω with x = y there holds for all |α|, |β| ≤ q + 1 .
Note that such an estimate can only be valid for continuous kernels as considered here, but not for singular kernels.However, we observe in passing that this condition is considerably weaker than the notion of asymptotic smoothness of kernels in H-matrix theory, cp.[20].The condition there would correspond to infinite differentiability in (21) with analytic estimates on the constants c κ,α,β .
Due to (21), we have in accordance with [24, Lemma 5.3] the decay estimate for two samplets σ j,k and σ j ,k , with the vanishing moment property of order q + 1 and supported on the clusters ν and ν such that dist(ν, ν ) > 0.
Estimate (22) holds for a wide range of kernels that obey the so-called Calderón-Zygmund estimates.It immediately results in the following compression strategy for kernel matrices in samplet representation, cp.[24, Theorem 5.4], which is wellknown in the context of wavelet compression of operator equations see, e.g., [41].

Theorem 3.7 (S-compression). Set all coefficients of the kernel matrix
to zero which satisfy the η-admissibility condition where ν is the cluster supporting σ j,k and ν is the cluster supporting σ j ,k , respectively.
Then, the resulting S-compressed matrix K η satisfies . for some constant c > 0 dependent on the polynomial degree q and the kernel κ.Remark 3.8.We remark that Theorem 3.7 uses the Frobenius norm for measuring the error rather than the operator norm, as it gives control on each matrix coefficient.Estimates with respect to the operator norm would be similar.
The η-admissibility condition (23) appears reminiscent to the one used for hierarchical matrices, compare, e.g., [10] and the references there.However, in the present context, the clusters ν and ν may also be located on different levels, i.e., j ν = j ν in general.As a consequence, the resulting block cluster tree is the cartesian product T × T rather than the level-wise cartesian product considered in the context of hierarchical matrices.
The error bounds for S-compression hold for kernel functions κ with finite differentiability (especially, with derivatives of order q + 1, cp. [24, Lemma 5.3]), as opposed to the usual requirement of asymptotic smoothness which appears in the error analysis of the H-format, see [20] and the references therein.
For point sets X = {x i } N i=1 that are quasi-uniform in the sense of Definition 3.3, there holds i.e., K Σ F ∼ N .Thus, we can refine the above result, see also [24,Corollary 5.5].Corollary 3.9.In case of quasi-uniform points x i ∈ X, the S-compressed matrix K η has only O(N log N ) nonzero coefficients, while it satisfies the error estimate In [24], an algorithm has been proposed which computes the compressed matrix K η in work and memory O(N log N ).The key ingredient to achieve this is the use of an interpolation-based fast multipole method and H 2 -matrix techniques [3,10,19].

Samplet matrix algebra
4.1.Addition and multiplication.To bound the cost for the addition of two compressed kernel matrices represented with respect to the same cluster tree, it is sufficient to assume that the points in X are quasi-uniform.Then it is straightforward to see that the cost for adding such matrices is O(N log N ).The multiplication of two compressed matrices, in turn, is motivated by concatenation C = A • B of the two pseudodifferential operators A and B. In suitable algebras, the product C is again a pseudodifferential operator and, hence, compressible.The respective kernel κ C (•, •) is given by Since Ω ⊂ R d is bounded by assumption, we may without loss of generality assume Ω ⊂ [0, 1) d .Then, if the distribution of the data points in X = {x i } N i=1 ⊂ Ω satisfies the stronger assumption of being asymptotically uniform modulo one, then there holds (26) lim for every Riemann integrable function f : Ω → R, cp.[39].Hence, we may interpret the matrix product as a discrete version of the convolution (25).In view of (26), we conclude Consequently, the product of two kernel matrices yields an S-compressible matrix ⊂ Ω be asymptotically distributed uniformly modulo one, cp.(26), and denote by K C the corresponding kernel matrix with κ C (•, •) from (25).Then, there holds Proof.On the one hand, we conclude from ( 27) that, as N → ∞, On the other hand, we find likewise This implies the assertion.
Remark 4.2.We mention that the consistency bound in the preceding theorem is rather crude.Under provision of stronger kernel-function regularity, corresponding higher convergence rates can be achieved, given that X satisfies appropriate higherorder quasi-Monte Carlo designs, see, e.g., [11] and the references there.
C be compressed with respect to the same compression pattern.We assume for given ε(η) > 0 that η in ( 24) is chosen such that Then, a repeated application of the triangle inequality yields This means that we only need to compute O(N log N ) matrix entries to determine an approximate version We like to stress that this formatted matrix multiplication is exact on the given compression pattern.The next theorem gives a cost bound on the matrix multiplication.
samplet representation which are S-compressed with respect to the compression pattern induced by the η-admissibility condition (23).
Then, computing with respect to the same compression pattern the matrix , where the nonzero entries are given by the discrete inner product Proof.To estimate the cost of the matrix multiplication, we shall make use of the compression rule (23).We assume for all clusters that diam(ν) ∼ 2 −j/d if ν is on level j.Thus, the samplet σ j,k has approximately the diameter 2 −j/d and, therefore, only O(2 −j ) samplets σ ,m of diameter ∼ 2 − /d are found in its nearfield if ≥ j while only O(1) are found if < j.For fixed level 0 ≤ ≤ J in (28), we thus have at most O(max{2 −max{j,j } , 1}) nonzero products to evaluate per coefficient c (j,k),(j ,k ) .We assume without loss of generality that j ≥ j and sum over , which yields the cost O(max{2 J−j , j}).Per target block matrix C j,j = [c (j,k),(j ,k ) ] j,j , we have O(2 max{j,j } ) = O(2 j ) nonzero coefficients.Hence, the cost for computing the desired target block is O(2 j max{2 J−j , j}).We shall next sum over j and j 4.2.Sparse selected inversion.Having addition and multiplication of kernel matrices at our disposal, we consider the matrix inversion next.To this end, observe that the inverse A −1 of a pseudodifferential operator A from a suitable algebra of pseudodifferential operators, provided that it exists, is again a pseudodifferential operator, see Section A. However, if A is a pseudodifferential operator of negative order as in the present RKHS case, the operator A −1 is of positive order and hence gives rise to a singular kernel which does not satisfy the condition (21).Even so, in the regime of kernel matrices we are rather interested in inverting regularized pseudodifferential operators, i.e., A + µI, where I denotes the identity.For such operators, we have the following lemma.
Lemma 4.4.Let A be a pseudodifferential operator of order s ≤ 0 with symmetric and positive semidefinite kernel function.
Then, for any µ > 0, the inverse of A + µI can be decomposed into 1 µ I − B with Especially, B is also a pseudodifferential operator of order s, which admits a symmetric and positive semidefinite kernel function.
Proof.In view of (29), we infer that Therefore, 1 µ I − B is the inverse operator to A + µI.Since A + µI is of order 0, (A + µI) −1 is of order 0, too, and thus (A + µI) −1 A is of the same order as A. Finally, the symmetry and nonnegativity of B follows from the symmetry and nonnegativity of A.
As a consequence of this lemma, the inverse (K A + µI) −1 ∈ R N ×N of the associated kernel matrix K A + µI ∈ R N ×N is S-compressible with respect to the same compression pattern as K A .In [23], strong numerical evidence was presented that a sparse Cholesky factorization of a compressed kernel matrix can efficiently be computed by means of nested dissection, cf.[17].This suggests the computation of the inverse (K A + µI) −1 in samplet basis on the compression pattern of K A by means of selective inversion [37] of a sparse matrix.The approach is outlined below.
Assume that A ∈ R N ×N is symmetric and positive definite.There are two steps in the inversion algorithm.The first stage involves factorizing the input matrix A into A = LDL .The L and D matrices are used in the second phase to compute the selected components of A −1 .The first step will be referred to as factorization in the following and the second step as selected inversion.To explain the second step, let A be partitioned according to In particular, the diagonal blocks A ii are also symmetric and positive definite.The selected inversion is based on the identity (30) where S := A 22 + A 12 C is the Schur complement with C := −A −1 11 A 12 .For sparse matrices, this block algorithm can efficiently be realized based on the observation that for the computation of the entries of A −1 on the pattern of L only the entries on the pattern of L are required, as it is well known from the sparse matrix literature, cp.[13,18,37].We note that the pattern of A is particularly contained in the pattern of L.

Algorithmic aspects. A block selected inversion algorithm has at least two advantages: Because
A is sparse, blocks can be specified in terms of supernodes [37].This allows us to use level-3 BLAS to construct an efficient implementation by leveraging memory hierarchy in current microprocessors.A supernode is a group of nodes with the same nonzero structure below the diagonal in their respective columns (of their L factor).The supernodal approach for sparse symmetric factorization represents the factor L as a set of supernodes, each of which consists of a contiguous set of L columns with identical nonzero patterns, and each supernode is stored as a dense submatrix to take advantage of level-3 BLAS calculations.
Taking these consideration as a starting point, it is natural to employ the selected inversion approach presented in [53] and available in [43] in order to directly compute the entries on the pattern of the inverse matrix.For the particular implementation of the selected inversion, we rely on Pardiso.For larger kernel matrices, which cannot be indexed by 32bit integers due to the comparatively large number of non-zero entries, we combine the selected inversion with a divide and conquer approach based on the identity (30).The inversion of the A 11 block and of the Schur complement S are performed with Pardiso (exploiting symmetry), while the other arithmetic operations, i.e., addition and multiplication, are performed in a formatted way, compare Theorem 4.3.4.4.Matrix functions.Based on the S-formatted multiplication and inversion of operators represented in samplet basis, certain holomorphic functions of an Scompressed operator also admit S-formatted approximations with, essentially, corresponding approximation accuracies.
To illustrate this, we recall the method in [22].This approach employs the contour integral representation where Γ is a closed contour being contained in the analyticity region of f and winding once around the spectrum σ(A) in counterclockwise direction.As is wellknown, analytic functions f of elliptic, self-adjoint pseudodifferential operators yield again pseudodifferential operators in the same algebra, see, e.g., [50, Chap.XII.1].Hence, Herein, L denotes the Lipschitz constant of the function f .In other words, estimate (32) implies that the error of the approximation of the S-formatted matrix function f (A η ) η is rigorously controlled by the sum of the input error A Σ −A η F and the compression error for the exact output B Σ − B η F .The latter is under control if the underlying pseudodifferential operator is of order s < −d since then the kernel is continuous and satisfies (21).In the other cases, some analysis is needed to control this error (see below).
For the numerical approximation of the contour integral (31), one has to apply an appropriate quadrature formula.Exemplarily, we consider the matrix square root, i.e., f (z) = √ z for Rez > 0. This occurs for example in the efficient path simulation of Gaussian processes in spatial statistics.We shall here apply, see [22,Eq. (4.4) and comments below], the approximation Here, sn, cn and dn are the Jacobian elliptic functions [2, Chapter 16], E is the complete elliptic integral of the second kind associated with the parameter κ A := c/c [2, Chapter 17], and, for k ∈ {1, . . ., K}, The quadrature approximation (33) of the contour integral (31) for the matrix square root is known to converge root-exponentially (e.g.[8,Lemma 3.4]) in the number K of quadrature nodes in (33) of the contour integral.Hence, approximate representations to algebraic with respect to N consistency orders can be achieved with K ∼ |ε(η)|, resulting in overall log-linear complexity of numerical realization of (33) in S-format.We also remark that the quadrature shifts w2 k in the inversions which occur in (33) act as regularizing "nuggets" of a possibly ill-conditioned A. The input parameters 0 < c < c shall provide bounds to the spectrum of A, i.e., c ≈ λ min (A) and c ≈ λ max (A).Note that we also assume here that A is symmetric and positive definite.Moreover, we should mention that, except for the quadrature error, (33) computes the square root (A η ) −1/2 of the compressed input A η in an exact way on the compression pattern when we use the selected inversion algorithm from Subsection 4.2.
That (A η ) −1/2 is indeed S-compressible is a consequence of the following lemma.
Lemma 4.5.Let A be a pseudodifferential operator of order s ≤ 0 with symmetric and positive semidefinite kernel function.Then, for any µ > 0, the inverse square root of A + µI can be written as 1 √ µ I − B with B being also a pseudodifferential operator of order s, which admits a symmetric and positive semidefinite kernel function.
Proof.Straightforward calculation shows that the ansatz which in view of ( 34) is equivalent to As both, 1 √ µ I + (A + µI) −1/2 and (A + µI) −1 , are pseudodifferential of order 0, B must have the same order as A.
An alternative to the contour integral for computing the matrix exponential of a (possibly singular) matrix A is given by the direct evaluation of the power series As we show in the numerical results, this series converges very fast for the present matrices under consideration which stem from reproducing kernels, since they correspond to compact operators.

Numerical results
The computations in this section have been performed on a single node with two Intel Xeon E5-2650 v3 @2.30GHz CPUs and up to 512GB of main memory 1 .To achieve consistent timings, all computations have been carried out using 16 cores.The samplet compression is implemented in C++11 and relies on the Eigen template library 2 for linear algebra operations.Moreover, the selected inversion is performed by Pardiso.Throughout this section, we employ samplets with q + 1 = 4 vanishing moments.The parameter for the admissibility condition ( 23) is set to η = 1.25.Together with the a priori pattern, which is obtained by neglecting admissible blocks, we also consider an a posteriori compression by setting all matrix entries smaller than τ = 10 −5 /N to zero resulting in the a posteriori pattern.

S-formatted matrix multiplication.
To benchmark the multiplication, we consider uniformly distributed random points on the unit hypercube [0, 1] d .As kernel, we consider the exponential kernel (which is the Matérn kernel with smoothness parameter ν = 1/2 and correlation length = 1) Note that we impose the scaling 1/N of the kernel function in order fix the largest eigenvalue of the kernel matrix as its trace stays uniformly bounded.We compute the matrix product K η • Kη , where Kη is obtained from K η by relatively perturbing each nonzero entry by 10% additive noise, which is uniformly distributed in [0, 1].This way, we rule out symmetry effects as Kη will not be symmetric in general.To measure the multiplication error, we consider the estimator where X ∈ R N ×10 is a random matrix with uniformly distributed independent entries.The left hand side of Figure 2 shows the computation time for a single multiplication.The dashed lines correspond to the asymptotic rates O(N log α N ) for α = 0, 1, 2, 3.It can be seen that the multiplication time for d = 2 perfectly reflects the expected essentially linear behavior.Though the graph is steeper for d = 3, we expect it to flatten further for larger N .The right hand side of the plot shows the multiplication error e F (K η • Kη − K η Kη ), where the formatted multiplication is performed on the a posteriori pattern.Taking into account that the compression errors for K η are approximately 5.6 • 10 −6 for d = 2 and 1.6 • 10 −5 for d = 3, the obtained matrix product can be considered to be very accurate.

S-formatted matrix inversion.
In order to assess the numerical performance of the matrix inversion, we again consider uniformly distributed random points on the unit hypercube [0, 1] d .Since the separation radius q X ranges between 4.7 • 10 −5 (N = 5000) and 2.8 • 10 −7 (N = 1 0000 000) for d = 2 and 3.8 • 10 −4 (N = 5000) and 3.2 • 10 −5 (N = 1 0000 000) for d = 3, we do not expect that K η to be invertible.Therefore, we rather consider the regularized version K η + µI for a ridge parameter µ > 0.  As our theoretical results suggest that the inverse has the same a priori pattern as the matrix itself, we first consider the inversion on the a priori pattern for d = 2.
The left hand side of Figure 3 shows the computation times for the inverse matrix employing Pardiso.The dashed line shows the asymptotic rates O(N α ) for α = 1, 1.5.For N = 1 000 000, due to the large amount of non-zero entries, we use the block inversion with one subdivision.This explains the bump in the computation time due to the formatted matrix multiplication.Besides this, Pardiso perfectly exhibits the expected rate of N 1.5 .The right hand side of the plot shows the error e F (K η + µI) −1 (K η + µI) − I for the ridge parameters µ = 10 −6 , 10 −4 , 10 −2 , where −1 denotes the selected in version on the pattern of K η .As expected, the error reduces significantly with increasing ridge parameter.
As the a priori pattern typically exhibits significantly less entries, we also investigate the inversion on the a posteriori pattern.The corresponding results are shown in Figure 4 As can be seen on the left hand side of the figure, the selected inversion now even exhibits a linear behavior, which is explained by the fixed threshold τ , resulting in successively less entries for increasing N .On the other hand, the errors for the different ridge parameters, depicted on the right hand side of the same figure, asymptotically exhibit the same behavior as in the a priori case.
Motivated by the results for d = 2, we consider only the inversion on the a posteriori pattern for d = 3.The corresponding results are shown in Figure 5. On the left hand side of the figure, again the computation times are shown.The dashed lines show the asymptotic rates O(N α ) for α = 1, 2. Until N = 100 000, the expected quadratic rate is perfectly matched.Due to the large number of non-zeros in the case d = 3, we have employed the block inversion with three recursion steps for N > 100 000, resulting in the peculiar linear behavior for the respective values in the graph.The errors depicted on the right hand side show a behavior similar to the case d = 2, with a slightly reduced decay.Table 1.Errors for the contour integral method for (K η +µI) 1/2 .Finally, Table 2 shows the approximation error e F exp(K η ) − exp (K η ) of the matrix exponential for different values of N .The true matrix exponential is estimated by a power series of length 30 directly applied to the matrix X.Here, we found that the error starts to stagnate for more than 8 terms in the expansion.The largest eigenvalue satisfies K η 2 ≈ 0.337 (estimated by a Rayleigh quotient iteration with 50 iterations), hence explaining the rapid convergence.Note that we do not require any regularization here, as just matrix products are computed.2. Errors for the approximation of exp(K η ) by the power series of the exponential.5.4.Gaussian process implicit surfaces.We consider Gaussian process learning of implicit surfaces.In accordance with [56], we consider a closed surface S = ∂Ω of dimension d − 1, given by the 0-level set of the function  and prior mean zero.Then, given the data sites X of size N := |X| and the noisy measurements y = f (X) + ε, where ε ∼ N (0, µI), the posterior distribution for the data sites Z ⊂ R 3 is determined by Herein, setting M := |Z|, we have The matrix K ZX can efficiently be computed by using one samplet tree for Z and a second samplet tree for X, while (K XX + µI) −1 can be computed as in the previous examples.Hence, the computation of the posterior mean E[f (Z)|X, y] is straightforward.For X, we use samplets with q + 1 = 4 vanishing moments, while samplets with q + 1 = 3 vanishing moments are applied for Z.Moreover, we use an a-posteriori threshold of τ = 10 −4 /N for K η ZX .Similarly, we can evaluate the covariance in samplet coordinates.However, the evaluation of the standard deviation diag(Cov[f (Z)|X, y]) requires more care.Here we just transform K ZX with respect to the points in X and evaluate the diagonal resulting in a computational cost of O M N log N .
The left panel in Figure 7 shows the initial setup.240 data points with a value −1 are located on a sphere within the point cloud, 15 507 points with a value of 0 are located at its surface and 1200 points with a value of 1 are located on a box enclosing it.This results in N = 16 947 data points in total.The ridge parameter was set to µ = 2 • 10 −5 .The conditional expectation and the standard deviation have been computed on a regular grid with M = 8 000 000 points.The middle panel in Figure 7 shows the 0-level set while the right panel shows the standard deviation.As expected, the standard deviation is lowest close to the data sites (blue is small, red is large).

Conclusion
We have presented a sparse matrix algebra for kernel matrices in samplet coordinates.This algebra allows for the rapid addition, multiplication and inversion of (regularized) kernel matrices, which operations mimic algebras of corresponding pseudodifferential operators.The proposed arithmetic operations extend to S-formatted, approximate representations of holomorphic functions of S-formatted approximations of self-adjoint operators, which are likewise realized in log-linear cost.While the addition is straightforward, we have derived an error and cost analysis for the multiplication, and for the approximate evaluation of holomorphic operator-functions in log-linear cost.The S-formatted approximate inversion is realized by selective inversion for sparse matrices, which also enables the computation of general matrix functions by the contour integral approach.The numerical benchmarks corroborate the theoretical findings for data sets in two and three dimensions.As a relevant example from computer graphics, we have considered Gaussian process learning for the computation of a signed distance function from scattered data.
We expect the presently developed fast kernel matrix algebra to impact various areas in machine learning and statistics, where kernel-based approximations appear (e.g.[7], [34] and the references there).

Appendix A. Pseudodifferential operators
We present basic definitions and terminology from the theory of pseudodifferential operators, in particular elements of the calculus of pseudodifferential operators, going back to Seeley [47,48].We adopt the notation for the statements of results on pseudodifferential operators from the monographs of Hörmander [28] and Taylor [50], but hasten to add that infinite smoothness of kernels in the corresponding operator calculi is not essential in S-formatted matrix algebra, as the S-compression is based on Calderón-Zygmund estimates (21) to order q + 1.Note that then χ(ξ)a r (x, ξ) ∈ S r (Ω × R d ) for any smooth, nonnegative cut-off function χ which vanishes identically for ξ 2 ≤ 1/2 and χ(ξ) ≡ 1 for ξ 2 ≥ 1.
The set of all pseudodifferential operators A generated via (37) from a symbol a ∈ S r (Ω × R d ) is denoted by OP S r (Ω).
A symbol a ∈ S r (Ω × R d ) is called classical symbol of order r ∈ R if for every k ∈ N there exist functions a r−k (x, ξ) ∈ S r−k (Ω × R d ) such that a ∼ k a r−k (in the sense of asymptotic expansions of symbols, compare [28]), where a r−k is homogeneous of degree r−k, i.e., there holds a r−k (x, tξ) = t r−k a r−k (x, ξ) for every t > 0 and for every ξ ∈ R d with ξ 2 ≥ 1.As a consequence of the asymptotic expansion of a ∈ S r cl (Ω × R d ), for every α, β ∈ N d and for every K Ω exists a constant c α,β (K) ∈ (0, 1) such that for every N ∈ N holds (38)  A.2. Calculus.Pseudodifferential operators admit calculi which are crucial for the subsequent matrix arithmetic.We collect properties of the calculi in S r cl (Ω × R d ) that are required throughout the article.

Figure 1 .
Figure 1.A scaling distribution on the coarsest scale (left) and samplets on level 2 and 3 (second from the left to right).

Figure 6 .
Figure 6.Data points from a 3D scan of the head of Michelangelo's David.The scan is provided by the Statens Museum for Kunst under the Creative Commons CC0 license.

Figure 7 .
Figure 7. Left panel: Data points for the surface reconstruction.Red corresponds to a value of 1, green to a value of 0 and blue to a value of −1.Middle panel: 0-level set of the posterior expectation evaluated at a regular grid.Right panel: Standard deviation for the reconstruction (blue is small, red is large).

Proposition A. 2 (
Schwartz Kernel Theorem[27, Thm.5.2.1]).Every distributional kernel κ ∈ D (Ω 1 × Ω 2 ) induces, via(40), a continuous, linear map from D(Ω 2 ) to D (Ω 1 ).Conversely, for every linear map K, there exists a unique distribution K such that (40) holds.The distribution κ is called (distributional) kernel of K.Via the Schwartz Kernel Theorem, every classical pseudodifferential operator A ∈ OP S r cl (Ω) with symbol a ∈ S r cl (Ω × R d ) can be written as a (distributional) integral operator with (distributional) Schwartz kernel κ A .If the order r of the pseudodifferential operator A ∈ OP S r cl (Ω) is smaller than −d, its distributional kernel is continuous and satisfies(21).