Analysis of target data-dependent greedy kernel algorithms: Convergence rates for $f$-, $f \cdot P$- and $f/P$-greedy

Data-dependent greedy algorithms in kernel spaces are known to provide fast converging interpolants, while being extremely easy to implement and efficient to run. Despite this experimental evidence, no detailed theory has yet been presented. This situation is unsatisfactory especially when compared to the case of the data-independent $P$-greedy algorithm, for which optimal convergence rates are available, despite its performances being usually inferior to the ones of target data-dependent algorithms. In this work we fill this gap by first defining a new scale of greedy algorithms for interpolation that comprises all the existing ones in a unique analysis, where the degree of dependency of the selection criterion on the functional data is quantified by a real parameter. We then prove new convergence rates where this degree is taken into account and we show that, possibly up to a logarithmic factor, target data-dependent selection strategies provide faster convergence. In particular, for the first time we obtain convergence rates for target data adaptive interpolation that are faster than the ones given by uniform points, without the need of any special assumption on the target function. The rates are confirmed by a number of examples. These results are made possible by a new analysis of greedy algorithms in general Hilbert spaces.


Introduction
Kernel methods are a well understood and widely used technique for approximation, regression and classification in machine learning and numerical analysis.
We start by collecting some notation and preliminary results, while more details are provided in Section 2. For a non-empty set Ω a kernel is defined as a symmetric function k : Ω × Ω → R. The kernel matrix A Xn for a set of points X n = {x 1 , . . . , x n } ⊂ Ω is given as (A Xn ) ij = (k(x i , x j )) ij ∈ R n×n , i, j = 1, . . . , n. If the kernel matrix is strictly positive definite for any set X n ⊂ Ω of n distinct points, the kernel is called strictly positive definite. Associated to every strictly positive definite kernel there is a unique Reproducing Kernel Hilbert Space H k (Ω) (RKHS) with inner product ·, · H k (Ω) , which is also called native space of k, and which is a space of real valued functions on Ω where the kernel k acts as a reproducing kernel , that is 1. k(·, x) ∈ H k (Ω) ∀x ∈ Ω, 2. f (x) = f, k(·, x) H k (Ω) ∀x ∈ Ω, ∀f ∈ H k (Ω) (reproducing property).
Strictly positive definite continuous kernels can be used for the interpolation of continuous functions. The theory is developed under the assumption that f ∈ H k (Ω), and in this case for any set of pairwise distinct interpolation points X n ⊂ Ω there exists a unique minimum-norm interpolant s n ∈ H k (Ω) that satisfies It can be shown that this interpolant is given by the orthogonal projection Π V (Xn) of f onto the linear subspace V (X n ) := span{k(·, x i ), x i ∈ X n }, i.e., s n (·) = Π V (Xn) (f ) = n j=1 α j k(·, x j ).
The coefficients α j , j = 1, . . . , n, can be calculated by solving the linear system of equations arising from the interpolation conditions in Eq. (1), which is always invertible due to the assumed strict positive definiteness of the kernel. A standard way of estimating the error between the function f and the interpolant in the · L ∞ (Ω) -norm makes use of the power function, which is given as Obviously it holds P Xn (x i ) = 0 for all i = 1, . . . , n, and the standard power function estimate bounds the interpolation error as where we denoted the residual as r n := r n (f ) := f − s n . Observe that any worst-case error bound on |(f − Π V (Xn) (f ))(x)| over the entire H k (Ω) transfers to the same decay of the power function via the second equality in Eq. (2). For the large class of translational invariant kernels, that we will introduce below and that includes the notable class of Radial Basis Function (RBF) kernels, it is possible to refine this error estimate by bounding the decay of the power function in terms of the fill distance h Xn := h Xn,Ω := sup x∈Ω min xj ∈Xn x − x j 2 .
Depending on certain properties of the kernel, one may obtain in this way both algebraic and exponential rates in terms of h Xn . Especially in the case of kernels whose RKHS is norm equivalent to a Sobolev space, these algebraic rates are provably quasi-optimal and may even be extended to certain functions that are outside of H k (Ω) (see [17]).
These results are nevertheless bounded by the dependence on the filling of the space and by the independence on the target function f . Namely, the fill distance is at most decaying as h Xn,Ω ≍ c Ω n −1/d for quasi-uniform points, which are space-filling and target-independent. On the other hand, a global targetdependent optimization of the interpolation points is a combinatorial and practically infeasible task, and thus approximated strategies have been proposed, and in particular greedy algorithms.
Greedy algorithms in general are studied in various branches of mathematics and we point to [30] for a general treatment of their use in approximation. In kernel interpolation, a greedy algorithm starts with the empty set X 0 := ∅ and adds points incrementally as X n+1 := X n ∪ {x n+1 } according to some selection criterion η (n) , that is Commonly used selection criteria in the greedy kernel literature are the Pgreedy [3], f · P -greedy 1 [6], f -greedy [27], and f /P -greedy [14] criteria, and they choose the next point according to the following strategies. From now on we use the short-hand notation P n (·) := P Xn (·) whenever the power function is determined by some greedy algorithm.
i. P -greedy: η (n) These algorithms have been used in a series of applications (see e.g. [6, 10-12, 18, 22, 27-29]), and overwhelming numerical evidence points to the fact that criteria which incorporate a residual-dependent term provide faster convergence, even if sometimes at the price of stability (see [33] for a discussion of this fact for f /P -greedy, and [6] for f · P -greedy).
The faster convergence is fully understandable since function adaptivity should clearly be beneficial to convergence speed. Nevertheless, the theoretical results are of opposite nature. Namely, for the P -greedy algorithm it is possible to prove quasi-optimality statements (see [21]), namely that whatever is the best known decay rate of the power function for arbitrarily optimized points, this transfers to the same decay of the power function associated to the points selected by P -greedy. Especially in the case of Sobolev spaces, these results can be proven to be optimal [33]. On the other hand, the convergence theory for the target data-dependent algorithms is much weaker: The known results (see Section 2 for a detailed account) provide convergence of order at most n −1/2 , which is generally not only largely missing the practical observations, but also slower than the rates proven for P -greedy.
We remark that existing techniques to prove convergence of greedy algorithms in general Hilbert spaces are not directly transferable to this setting. Indeed, the first results on similar algorithms have been obtained in Matching Pursuit, and they work for finite dimensional spaces [2,13]. When transferred to the kernel setting (see [27]) these require a norm equivalence between the H k (Ω)-and the ∞-norm, which hold only for finite n. Subsequent general results on greedy algorithms (see [5]) require special assumptions on the target function, and the resulting rates are only of order n −1/2 . Another common strategy in the greedy literature makes use of the Restricted Isometry Property (see e.g. [1]), which in the kernel setting translates to the requirement that the smallest eigenvalue λ n of the kernel matrix is bounded away from zero uniformly in n. This is not the case here, since it is known that λ n ≤ min 1≤j≤n P Xn\{xj } (x j ) 2 (see [24]), and we will see later that a fast convergence to zero of the right hand side of this inequality is the key of our analysis. Especially, all these results prove convergence in the Hilbert space norm, which is generally too strong (to obtain convergence rates) since the interpolation operator is an orthogonal projection in H k (Ω). We work instead with the ∞-norm, which allows to derive fast convergence, even if it introduces an additional difficulty since the norm of the error is not monotonically decreasing.
The paper is organized as follows. After recalling additional facts on kernel greedy interpolation in Section 2, we derive a new analysis of general greedy algorithms in general Hilbert spaces based on [4] (Section 3).
These results are combined in Section 5 to obtain precise convergence rates of the minimum error e min (f, n) := min 1≤i≤n f − s i ∞ . This measure allows us to circumvent the non monotonicity of the error, and we remark in particular that e min (f, n) < ε for some ε > 0 means that an error smaller than ε is achieved using at most n points. As an exemplary result, we mention here the case where the rate of worst-case convergence in H k (Ω) for a fixed set of n interpolation points is n −α for a given α > 0. In this case, for β ∈ [0, 1] we get new convergence rates of the form e min (f, n) ≤ c log(n) α n −β/2 n −α , n ≥ n 0 ∈ N, with c > 0. These results prove in particular that the worst case decay of the error that can be obtained in H k (Ω) with a fixed sequence of points transfers to the β-greedy algorithms with an additional multiplicative factor of log(n) α n −β/2 . Namely, adaptively selected points provide faster convergence than any fixed set of points.
Finally, Section 6 illustrates the results with analytical and numerical examples while the final Section 7 presents the conclusion and gives an outlook.

Background results on kernel interpolation
We recall additional required background information on kernel based approximation and in particular greedy kernel interpolation. For a more detailed overview we refer the reader to [7,9,31]. We remark that in this section no special attention is paid to the occurring constants, which can change from line to line.

Interpolation by translational invariant kernels
In many applications of interest the domain is a subset of the Euclidean space, i.e., Ω ⊂ R d . In this case, a special kind of kernels is given by translational invariant kernels, i.e., there exists a function Φ : R d → R with a continuous Fourier transformΦ such that We remark that the well known radial basis function kernels are a particular instance of translational invariant kernels.
Depending on the decay of the Fourier transform of the function Φ, two classes of translational invariant kernels can be distinguished: 1. We call the kernel k a kernel of finite smoothness τ > d/2, if there exist constants c Φ , C Φ > 0 such that The assumption τ > d/2 is required in order to have a Sobolev embedding in C 0 (Ω).
2. If the Fourier transformΦ decays faster than at any polynomial rate, the kernel is called infinitely smooth.
As mentioned in Section 1, for these two types of kernels it is possible to derive error estimates by bounding the decay of the power function in terms of the fill distance. We have the following: 1. For kernels of finite smoothness τ > d/2, given appropriate conditions on the domain Ω ⊂ R d (e.g. Lipschitz boundary and interior cone condition), the native space H k (Ω) is norm equivalent to the Sobolev space W τ 2 (Ω). By making use of this connection, error estimates for kernel interpolation can be obtained by using Sobolev bounds [16,32] that give 2. For kernels of infinite smoothness such as the Gaussian, the Multiquadric or the Inverse Multiquadric, we have if the domain Ω is a cube. We remark that these error estimates are not limited to these three exemplary kernels. We point to [31,Theorem 11.22] which states a sufficient condition in order to obtain these exponential kind of error estimates.
By looking at well distributed points such that h Xn,Ω ≤ c Ω n −1/d , these bounds from Eq. (4) and (5) can be cast only in terms of the number of interpolation points n, i.e. P Xn L ∞ (Ω) ≤c 1 n 1/2−τ /d ,

Greedy kernel interpolation
We collect the motivation, a few properties, and the existing analysis of the four selection criteria introduced in Section 1: i. P -greedy: The P -greedy algorithm is the best analyzed one of the four algorithms named above. It aims at minimizing the error for all functions in the native space simultaneously, which is done by greedily minimizing the upper error bound from Eq. (3), which is the power function. Thus, the selection criterion of the P -greedy algorithm is target data independent. For the P -greedy algorithm it holds P n (x n+1 ) = P n L ∞ (Ω) . Several results on the P -greedy algorithm have been derived in [21,33]: (a) Corollary 2.2. in [21] showed convergence statements for the maximal power function value P n L ∞ (Ω) for radial basis function kernels, when Ω ⊂ R d has a Lipschitz boundary and satisfies an interior cone condition. It states Via the standard power function bound from Eq. (3) these bounds directly give bounds on the approximation f − s n L ∞ (Ω) . A few more details of the proof strategy of [21] will be recalled in Section 3.
(b) The paper [33] showed further results for the case of kernels of finite smoothness τ > d/2: Theorem 12 in [33] showed that the decay rate on P n L ∞ (Ω) is sharp. The sequence of Theorems 15, 19 and 20 of [33] further established that the resulting sequence of points are asymptotically uniformly distributed under some mild conditions. These results implied (optimal) stability statements in [33,Corollary 22].
ii. f -greedy: The f -greedy algorithm aims at directly minimizing the residual by setting the currently largest residual to zero by introducing the next interpolation point at this point, i.e. it holds Existing results prove convergence of order n −ℓ/d for kernels As mentioned before, these convergence results do not reflect the approximation speed of f -greedy that can be observed in numerical investigations. Additionally, in [23] convergence of order n −1/2 of the H k (Ω)-norm of the error is proven, but only under additional assumptions on f .
iii. f /P -greedy: The f /P -greedy selection aims at minimizing the native space error of the residual as much as possible as it can be seen from Eq. (7). We remark as a technical detail that the supremum of |(f − s n )(x)|/P n (x) over x ∈ Ω \ X n need not be attained as it was exemplified in Example 6 of [33]. However this can be alleviated by choosing the which however only holds for a quite restricted set of functions f , which has slightly been extended in [23].
iv. f ·P -greedy: The idea of the just recently introduced f ·P -greedy algorithm is to have a combination of the power function dependence and the target data dependence in order to balance between the stability of the P -greedy algorithm and the target data dependence of the f -greedy algorithm. No convergence results were given in the original publication [6].
In addition to the selection criteria, we remark that for a practical numerical implementation the greedy algorithms stop if a predefined bound (either on e.g. the accuracy or the numerical stability) is reached, or if the interpolant is exact.
Finally, to analyze and implement these algorithms it is useful to consider the Newton basis {v j } n j=1 of V n (see [15,18]), which is obtained by applying the Gram-Schmidt orthonormalization process to {k(·, x j ), j = 1, . . . , n} whereby {x j , j = 1, . . . , n} are the pairwise distinct points that are incrementally selected by the greedy procedure. We recall that we have

Analysis of greedy algorithms in an abstract setting
This section extends the abstract analysis of greedy algorithms in Hilbert spaces introduced in [4]. For this, let H be a Hilbert space with norm · = · H . Let F ⊂ H be a compact subset and assume for notational convenience only that it holds f H ≤ 1 for all f ∈ F . We consider greedy algorithms that select elements f 0 , f 1 , . . . , without yet specifying any particular selection criterion. We define V n := span{f 0 , . . . , f n−1 } and the following quantities, whereby Y n is any n-dimensional subspace of H: The quantities d n and σ n have already been used in [4], where d n is the Kolmogorov n-width of F , and we recall that the compactness of F is equivalent to require that lim n d n = 0 (see [19]). On the other hand, the newly introduced quantity ν n does not seem in itself to be an interesting quantity for the abstract setting, and it has not been considered before. However, it will be the key quantity for our new analysis in the kernel setting in Section 4 and Section 5.
As we focus on Hilbert spaces, expressions like dist(f, V n ) can be computed via the orthogonal projector in H onto V n , that we denote as Π Vn . We have the following elementary properties: 1. Estimates: d n ≤ σ n and ν n ≤ σ n for all n ∈ N.
The paper [4] considers weak greedy algorithms that choose, for some fixed 0 < γ ≤ 1, the elements f n such that and shows that, roughly speaking, an asymptotic polynomial or exponential decay of d n yields a polynomial or exponential decay of σ n , i.e., the weak greedy algorithms essentially realize the Kolmogorov widths up to multiplicative constants. We remark that this analysis includes the strong greedy algorithm, i.e., γ = 1.
In the following we show in Subsection 3.1 that even without using the selection of Eq. (9) -i.e., the elements f 0 , f 1 , . . . may even be randomly chosen within F -comparable statements hold for ν n .

Greedy approximation with arbitrary selection rules
We start by stating a simple modification of [4, Theorem 3.2.] and a subsequent corollary.
Theorem 1. Consider a compact set F in a Hilbert space H, and a greedy algorithm that selects elements from F with an arbitrary selection rule.
We have the following inequalities between ν n , σ n and d n for any N ≥ 0, K ≥ 1, 1 ≤ m < K: Proof. The result is obtained by simply omitting the last step in the proof of Theorem 3.2 in [4]. Namely, we follow the original proof up to right before Eq. (3.4), i.e., up to the bound on the quantity a 2 N +i,N +i . Using the secondto-last equation on p. 459 in [4] and our definition of ν n , in our notation we have N +i , and this gives the result. In the original paper an additional step in Eq. (3.4) is used to obtain a bound on σ n instead of ν n .
Similarly to the approach used in [4], in the following corollary we make suitable choices of N, K, m to specialize the result to the case of algebraically or exponentially decaying Kolmogorov widths.

Corollary 2.
Under the assumptions of Theorem 1 the following holds.
Proof. First of all we observe that for 1 ≤ m < n, we have 0 < x := m/n < 1.
We use Theorem 1 for N = K = n and any 1 ≤ m < n, i.e. we have where we took the 2n-th root for the second line and used the monotonicity and boundedness of (σ n ) n∈N in the last step, i.e. σ m/n n+1 ≤ σ m/n 1 ≤ 1. In order to prove the statements i) and ii), we conclude now in two different ways: i) For n fixed we choose a fixed 0 < ω ≪ 1 and define m * := ⌈ωn⌉ ∈ N, i.e.
ii) We pick m = ⌈n/2⌉ and make use of the assumed decay d n (F ) ≤C 0 e −c0n α to estimate where c 1 := 2 −(2+α) c 0 , and this concludes the proof.
Remark 3. Observe that the constantC 0 2 α+1/2 e α =C 0 √ 2 (2e) α in (10) is significantly smaller than the one obtained in [4] for the algebraic rate, which is C 0 2 5α+1 . However, we have here instead the logarithmic factor in n, even if we presume that it may be possible to remove it with a finer analysis.

Analysis of greedy algorithms in the kernel setting
This section introduces and analyses β-greedy algorithms, that are a scale of greedy algorithms which generalize the P -, f ·P -, f -and f /P -greedy algorithms.
We work under the assumption

A scale of greedy algorithms: β-greedy
We start with the definition of β-greedy algorithms.

Definition 4.
A greedy kernel algorithm is called β-greedy algorithm with β ∈ [0, ∞], if the next interpolation point is chosen as follows.
As depicted in Figure 1, for β = 0 this is the P -greedy algorithm, for β = 1/2 it is the f · P -algorithm, and for β = 1 it is the f -greedy algorithm. In the limit β → ∞ it makes sense to define the algorithm to be the f /P -greedy algorithm 2 .
Observe that the β-greedy algorithms are well defined also for 1 < β < ∞. Indeed, in this case 1 − β < 0 and thus the power function part occurs as a divisor, and this may potentially be a problem since P n (x i ) = 0 for all 1 ≤ i ≤ n. Nevertheless, the standard power function estimate gives Remark 5 (Generalizations of the β-selection rule). We remark that it is sufficient to consider only one parameter β > 0 for the weighting of |(f − s n )(x)| R 0 Pgreedy f · Pgreedy 1/2 fgreedy 1 limit β → ∞ f /P greedy β Figure 1: Visualization of the scale of the β-greedy algorithms on the real line. Several important cases for β ∈ {0, 1/2, 1} and β → ∞ are marked. and P n (x) as it was done in Eq. (15), in the sense that using two different parameters would be useless. Indeed, due to the strict monotonicity of the function x → x 1/α , for some α > 0 and for γ ∈ R it holds arg max which shows that only the ratio γ/α is decisive. The specific parametrization via β and 1 − β in Eq. (15) was chosen in order to obtain f /P -greedy as the limit case β → ∞.

Analysis of β-greedy algorithms
We can now prove the convergence of these algorithms. So far, analysis of greedy kernel algorithms mainly focused on estimates on f − s i L ∞ (Ω) . Here and in the following, different quantities will be analyzed with the goal of bounding instead min i=n+1,...,2n f − s i L ∞ (Ω) . We remark that no requirements on the kernel k or the set Ω are needed for the results of this section, and especially for Theorem 8, as the proofs are based solely on RKHS theory.
We start by proving a key technical statement for greedy kernel interpolation that provides a bound on the product of the residual terms r i := f − s i . This result holds independently of the strategy that is used to select the points, greedy or not. Lemma 6. For any sequence {x i } i∈N ⊂ Ω and any f ∈ H k (Ω) it holds for all n = 1, 2, . . . that Proof. Let The geometric arithmetic mean inequality gives We now use Eq. (7) applied to s 2n+1 and s n+1 , and the properties of orthogonal projections to obtain It follows that R n ≤ n −1/2 · r n+1 H k (Ω) , and thus In order to derive convergence statements in the L ∞ (Ω) norm based on Lemma 6, it is now required to find a relationship between |r i (x i+1 )| and r i L ∞ (Ω) . To this end, we have the following Lemma for β-greedy algorithms. Observe that the sequence of points depends on the value of β, i.e. x n ≡ x (β) n , but for notational convenience we drop the superscript.
b) In the case of β ∈ (1, ∞] with 1/∞ := 0: Proof. We prove the two cases separately: a) For β = 0, i.e. the P -greedy algorithm, this is the standard power function estimate in conjunction with the P -greedy selection criterion P n (x n+1 ) = P n L ∞ (Ω) . For β = 1 this holds with equality as it is simply the selection criterion of f -greedy since we have here r n (x n+1 ) = r n L ∞ (Ω) . We thus consider β ∈ (0, 1) and letx i+1 ∈ Ω be such that |r i (x i+1 )| = r i L ∞ (Ω) . Then the selection criterion from Eq. (15) gives and in particular Using this bound with the standard power function estimate gives This can be rearranged for r i L ∞ (Ω) to yield the final result. b) For β ∈ (1, ∞), the selection criterion from Eq. (15) can be rearranged to and taking the supremum sup x∈Ω\Xi gives For β = ∞, the selection criterion of the f /P -greedy algorithm can be directly rearranged to yield the statement (when using the notation 1/∞ = 0).
Using the results of Lemma 7 as lower bounds on |r i (x i+1 )|, it is now possible to control the left hand side of Inequality (16). This gives the main theorem of this section: b) In the case of β ∈ (1, ∞] with 1/∞ := 0: Proof. We prove the two cases separately: a) For β = 0, i.e. P -greedy, Eq. (17) gives Taking the product 2n i=n+1 and the n-th root in conjunction with the estimate r i H k (Ω) ≤ r n+1 H k (Ω) for i = n + 1, . . . , 2n gives the result.
For β ∈ (0, 1], we start by reorganizing the estimate (17) of Lemma 7 to get and we use this to bound the left hand side of Eq. (16) as Rearranging the factors, and using again the fact that r i H k (Ω) ≤ r n+1 H k (Ω) for i = n + 1, . . . , 2n, gives Now, the inequality can be raised to the exponent β to give the final statement.
b) For β ∈ (1, ∞] we proceed similarly by first rewriting Eq. (18) of Lemma 7 as , and we lower bound the left hand side of Equation (16) as which gives the final result due to P i L ∞ (Ω) ≤ 1 for all i = 0, 1, ...

An improvement of the standard estimate
As an additional consequence of Lemma 7, the following Corollary 9 gives a new inequality that can be seen as an improved standard power function estimate, i.e. an improvement compared to the standard power function estimate from Eq. (3), that holds for any β-greedy algorithm.
The estimate from Eq. (21) is an improved estimate in comparison to Eq. (3), in that it provides a bound on r i L ∞ (Ω) instead of |r i (x i+1 )|, and this is a strictly larger quantity except that in the case of the f -greedy algorithm (i.e. β = 0), where they coincide. Moreover, for β ∈ [0, 1] the right hand side of the estimates of Eq. (3) and (21) coincide, while for β > 1 this improvement comes at the price of a smaller exponent on the power function term, since 1/β < 1.

Remark 10.
We will see in the following how to obtain convergence rates of the term min n+1≤i≤2n r i L ∞ (Ω) . From a practitioner point of view this kind of result might be unsatisfactory, as it is unclear which interpolant s i gives the best approximation. In this case it is possible to resort to the improved standard power function estimate of Corollary 9: This inequality suggests to pick s i * with i * := arg min n+1≤i≤2n P i (x i+1 ).

Convergence rates for greedy kernel interpolation
We can finally combine the abstract Hilbert space analysis from Section 3 and the greedy kernel interpolation analysis from Section 4, and apply them to concrete classes of kernels.
First of all, we recall a convenient connection that was established in [21] between the abstract analysis of [4] and kernel interpolation. We repeat it as we need to include also the extension of Section 3, i.e., the new quantity ν n . The goal is to frame the β-greedy algorithms as particular instances of the general greedy algorithm of Section 3. In this view we choose H = H k (Ω) and F = {k(·, x), x ∈ Ω}. The fact that this set is compact is implied by the decay to zero of its Kolmogorov width, that is equivalent to the existence of a sequence of points such that the associated power function converges to zero (see Eq. (23) below). This choice means that f = k(·, x) ∈ F can be uniquely associated with an x ∈ Ω and vice versa. This yields a realization of the abstract greedy algorithm that produces an approximation set and thus this is a greedy kernel algorithm, with an appropriate selection rule. Table 1 summarizes these assignments. Abstract With these choices, as can be seen from the definition in Eq. (8), σ n is simply the maximal power function value and ν n is the power function value at the selected point.
Moreover, d n can be similarly bounded as and thus any convergence statement on P Xn L ∞ (Ω) for a given set of points X n ⊂ Ω gives via Eq. (23) a bound on d n . Additionally, observe that the assumption f H ≤ 1 for f ∈ F implies in the kernel setting that

Convergence rates for β-greedy algorithms
From Theorem 8 it is now easily possible to derive convergence statements and decay rates for the kernel greedy algorithms, by bounding the right-hand side by Inequality (2) and using the interpretations of ν i and d n from from Eq. (22) and Eq. (23).
Proof. The proof is a simple combination of Corollary 2 and Theorem 8, with the addition of the following simple steps: First, the worst case bounds in H k (Ω) (either algebraic or exponential) imply the same bound on the power function via Eq. (2). Second, in all cases we use the results of Theorem 8 in combination with the bound min i=n+1,...,2n Then, Eq. (19) and (20) of Theorem 8 can be jointly written as Plugging the bounds of Corollary 2 in the last inequality gives the result of the first two points. The third point directly follows from Eq. (20) for β = ∞ due to P i (x i+1 ) ≤ 1 for all i = 1, 2, ... .

Translational invariant kernels
Strictly positive definite and translational invariant kernels are popular kernels for applications. To specialize our result to this interesting case, in this subsection we use the following assumption. Assumption 1. Let k(x, y) = Φ(x − y) be a strictly positive definite translational invariant kernel with associated reproducing kernel Hilbert space H k (Ω), whereby the domain Ω ⊂ R d is assumed to be bounded with Lipschitz boundary and interior cone condition.
In this context we have the following special case of Corollary 11. To highlight the results in the most relevant cases, we state them only for β ∈ {0, 1/2, 1, ∞} even if similar statements hold for general β > 0.
1. In the case of kernels of finite smoothness τ > d/2 min i=n+1,...,2n 2. In the case of kernels of infinite smoothness: min i=n+1,...,2n Observe that for any β ∈ (0, 1] we have the additional convergence of order n −β/2 or n −1/2 for β > 1. The additional decay is faster for increasing β ∈ (0, 1], i.e. increasing the weight of the target data-dependent term in the selection criterion gives better decay rates. Especially, the proven decay rate for f -greedy is better than the one for f · P -greedy which is better than the one for P -greedy. This additional convergence proves in particular that the Kolmogorov barrier can be broken, i.e., approximation rates that are better than the ones provided by the Kolmogorov width can be obtained for any function in H k (Ω). Indeed, as discussed above any bound on d n turns into a bound on P n L ∞ (Ω) , which can then be used in Corollary 11 or Corollary 12. This is particularly relevant for the kernels whose RKHS is norm equivalent to a Sobolev space. But also other general kernels of low smoothness are of interest, since it might happen that the power function is decaying at arbitrarily slow speed, while the adaptive points selected by a β-greedy algorithm provide an additional convergence rate.
Moreover, the additional decay for β > 0 is dimension independent and thus it does not incur in the curse of dimensionality. This is of interest in particular for translational invariant kernels of Corollary 12, as both the algebraic and the exponential decay of the power function (or Kolmogorov width) degrade with the dimension d and thus the additional term gains more importance.
Despite this notable relevance, the estimates of Corollary 11 and Corollary 12 are likely not optimal in the algebraic case. Indeed, for kernels with algebraically decaying Kolmogorov width, in the case of the P -greedy algorithm (β = 0) bounds without the additional log(n) α factor are known [21]. We thus expect that the inconvenient additional log(n) α factor is not required for any of the β-greedy algorithms. We remark that this factor is related to the the additional ǫ within Corollary 2, but we did not find a way to get rid of it, with exception of β = 0, i.e. the P -greedy case. Moreover, we obtained our bounds by means of the worst-case bounds on ( 2n i=n+1 P i (x i+1 )) 1/n from Corollary 2. Numerically, a faster decay than the worst case bound from Corollary 2 can often be observed (see the examples in Subsection 6.1). Especially, for each β value we obtain a different sequence of points and thus a different decay of the corresponding power function values.
Remark 13 (Additional convergence orders). Additional convergence orders can be obtained from the decay of r n H k (Ω) . Even if this quantity is in general decaying at arbitrarily slow speed for a general f ∈ H k (Ω), we mention the case of superconvergence [25,26], which allows to bound r i H k (Ω) ≤ C f · P i L 2 (Ω) for special functions f ∈ H k (Ω). The original superconvergence requirement f = T g (whereby T is the kernel integral operator and g ∈ L 2 (Ω), i.e. [23,Theorem 19]).
Remark 14 (Stability). The stability of the greedy interpolation, as computed here by the so-called direct method, is mainly linked to the smallest eigenvalue of the kernel interpolation matrix. A standard result [24] gives the upper estimate λ min (X n ) ≤ P n−1 (x n ) 2 . In view of the estimates of Equation (25) and (26), this means that a faster convergence based on a faster decay of the power function values P i (x i+1 ) directly negatively influences the stability. This holds especially for β > 1, because in this case the upper bound for the convergence in terms of the power function scales with the exponent 1/β < 1.
Remark 15 (Other greedy selection rules). The analysis above shows that γgreedy algorithms which were introduced in [33] are actually closer to the Pgreedy algorithm than to target data-dependent algorithms for the case of kernels of finite smoothness τ > d/2. In this case for γ-greedy algorithms the decay of P n (x n+1 ) can be both lower and upper bounded by a constant times n 1/2−τ /d . As the point selection criteria of γ-stabilized greedy algorithms first of all look at the power function value via P n (x n+1 ) ≥ γ · P n L ∞ (Ω) , there is no relationship as in Eq. (15) (β > 0). Thus we cannot derive additional convergence rates.
Remark 16 (Other norms). For kernels of finite smoothness τ > d/2 on a set Ω with Lipschitz boundary satisfying an interior cone condition, the optimal rates of L p -convergence are of order r n L p (Ω) ≤ c p n −τ /d+(1/2−1/p)+ . This rate is matched by the P -greedy algorithm (see [33,Corollary 22]), since it is proven to select asymptotically uniformly distributed points.
In the case of the f -greedy algorithm, we can use the additional factor n −1/2 from Corollary 12 to get rid of the conversion from the L p to the L ∞ norm, i.e. we have r n L p (Ω) ≤ meas(Ω) 1/p r n L ∞ (Ω) ≤ c ∞ log(n) τ /d−1/2 n −τ /d . So we have almost L p -optimal results (up to the poly-logarithmic factor) for p ∈ [1, 2] and even improved convergence for p ∈ (2, ∞]. Similar statements hold for general β-greedy algorithms. ii. Violet: P -greedy algorithm on the subdomain Ω 2 := {x ∈ Ω | (x) 3 = 1/2}. Like this, the dimension is effectively reduced from d = 3 to d = 2.
iii. Yellow, red: The points are randomly picked within Ω.
The results are displayed in Figure 2: • The upper two figures displays the quantities σ n = P n L ∞ (Ω) (left) and ν n = P n (x n+1 ) (right).
• The lower two figures display For the numerical experiments, the domain Ω was discretized using 2·10 4 random points and Ω 2 was discretized by projecting the random points related to Ω onto Ω 2 . The algorithms run until 300 points are selected or the next selected Power function value satisfies P n (x n+1 ) < 10 −5 . From the top left picture one can infer that the displayed quantity P n L ∞ (Ω) decays fastest for the P -greedy algorithm. This was expected, as the algorithms directly aims at minimizing this quantity. However the displayed quantity P n L ∞ (Ω) does not drop at all for the P -greedy algorithm on Ω 2 , as it picks only points from Ω 2 and thus does not fill Ω.
Contrarily, the top right picture shows that the displayed quantity P n (x n+1 ) decays faster for the P -greedy algorithm on Ω 2 , while for the P -greedy algorithm on Ω we have exactly the same curve due to P n (x n+1 ) = P n L ∞ (Ω) . The two further point choices exhibit a wiggling, noisy behaviour on the displayed P n (x n+1 ) quantity, which is related to the random point choice.
The two lower figures refer to the geometric mean (Π 2n j=n+1 .. ) 1/n of the quantities of the upper figures. In the lower left figure, we can see that only the curve related to the P -greedy algorithm on Ω decays fast, the other curves do not decay at all or only slowly -because the points are not chosen in a way to minimize the maximal Power function value P n L ∞ (Ω) . Contrarily, the Pgreedy algorithm on Ω exhibits the slowest decay of the quantity (Π 2n j=n+1 ν j ) 1/n , which is the same curve as in the lower left figure due to ν j = P j (x j+1 ) = P j L ∞ (Ω) = σ j . However, all the three other choices of points provide a faster decay of the displayed quantity ( 2n j=n+1 P j (x j+1 )) 1/n = ( 2n j=n+1 ν j ) 1/n . The theoretical reason for (at least) the same decay as the P -greedy algorithm on Ω was proven in Corollary 2. The upper two plots display the quantites σ n = P n L ∞ (Ω) (left) respective ν n = P n (x n+1 ) (right). The lower two plots display the quantities ( 2n j=n+1 σ j ) 1/n = ( 2n j=n+1 P j L ∞ (Ω) ) 1/n (left) respective ( 2n j=n+1 ν j ) 1/n = ( 2n j=n+1 P j (x j+1 )) 1/n (right).

β-greedy algorithms using the Wendland kernel
We consider the application of β-greedy algorithms for the particular example of the Wendland k = 0 kernel on the domain Ω = [0, 1], which is defined as k(x, y) = max(1 − |x − y|, 0), and thus a piecewise linear kernel. Its native space H k (Ω) is norm equivalent to the Sobolev space W 1 2 (Ω). It is immediate to see that kernel interpolation using the Wendland k = 0 kernel on centers X n ⊂ Ω boils down to linear spline interpolation on the subinterval [min X n , max X n ] ⊂ [0, 1]. On Ω \ [min X n , max X n ] the interpolant is still an affine function. We consider the function f : Ω → R, x → x α for some 1/2 < α < 1. For α > 1/2 it holds f ∈ W 1 2 (Ω), thus f ∈ H k (Ω). It can be shown, that in the case of asymptotically uniform interpolation points -i.e. q n ≍ h n ≍ n −1 -it is possible to lower-bound the error as (for details see Appendix A) f − s n L ∞ (Ω) ≥ C α · n −α (27) for C α > 0. Furthermore, independent of the way the interpolation points X N were chosen (i.e. even for optimally chosen points), it holds f − s n L ∞ (Ω) ≥ C · n −2 (28) for some C > 0. Thus we can infer: • Any (greedy) algorithm that yields asymptotically uniformly distributed points cannot have a convergence rate better than n −α for this particular example. This includes especially the P -greedy algorithm, but also any γ-stabilized greedy algorithms [33], as they are known to provide asymptotically uniform points as well, see [33,Theorem 20]. Thus this example shows that γ-stabilized greedy algorithms cannot be expected in general to give a better approximation rate than the P -greedy algorithm (however they were motivated by their use for the preasymptotic range). Especially for α → 1/2, the convergence rate can be abritrary close to 1/2.
• For the f -greedy and f · P -greedy algorithms we have a convergence of at least log(n) 1/2 · n −1 respective log(n) 1/2 · n −3/4 according to Corollary 12, which is strictly better compared to the P -greedy algorithm. Figure 3 visualizes the convergence of several β-greedy algorithms for the described setting. One can observe that the error for the P -greedy algorithm (β = 0) decays approximately according to n −1/2 , which is in accordance with Eq. (27). For the f -greedy algorithm (β = 1) the error seems to decay according to n −2 , which is the fastest possible decay rate according to Eq. (28). For all intermediate β values one can observe intermediate convergence rates: For values of β closer to 1, the error decays faster. The f /P -greedy algorithm (β = ∞) seems to give a convergence in between n −1/2 and n −2 .
We remark that this behaviour of the error decay depending on β is not unique to the Wendland k = 0 kernel, but can also be observed for other kernels, domains and target functions f . This particular example was chosen, because it is analytically easily possible to derive several explicit statements on convergence rates for asymptotically uniform and adapted points.

Conclusion and outlook
Using an abstract analysis of greedy algorithms in Hilbert spaces it was shown that arbitrary point sequences -e.g. generated from arbitrary greedy kernel algorithms -yield certain decay rates for specific power function quantities. Based on these results and refined greedy kernel interpolation analysis it was possible to investigate and prove convergence statements for a range of greedy kernel algorithms including the target data-dependent f -, f ·P -and f /P -greedy algorithms. The provided techniques and results will likely lead to further advancements, e.g. in the field of kernel quadrature.
Several points remain open, and they will be addressed in future research. First, the proven decay rate for the f /P -greedy algorithm is still not satisfactory and is likely improvable. Moreover, the results are independent of the special choice of function f ∈ H k (Ω). How to make use of properties of that function? It would be desirable to conclude a faster decay of the ( 2n i=n+1 P i (x i+1 )) 1/nquantity based on properties of the considered function f ∈ H k (Ω). Finally, it is still unclear if it is possible to derive general statements on the decay of f − s n H k (Ω) , and what is the relationship between this fact and superconvergence.
Thus we have f − s n L ∞ (Ω) ≥