Worst-case optimal approximation with increasingly flat Gaussian kernels

We study worst-case optimal approximation of positive linear functionals in reproducing kernel Hilbert spaces induced by increasingly flat Gaussian kernels. This provides a new perspective and some generalisations to the problem of interpolation with increasingly flat radial basis functions. When the evaluation points are fixed and unisolvent, we show that the worst-case optimal method converges to a polynomial method. In an additional one-dimensional extension, we allow also the points to be selected optimally and show that in this case convergence is to the unique Gaussian quadrature–type method that achieves the maximal polynomial degree of exactness. The proofs are based on an explicit characterisation of the reproducing kernel Hilbert space of the Gaussian kernel in terms of exponentially damped polynomials.


Introduction
Most popular kernels used in scattered data approximation [11,37] and Gaussian process regression [27] are isotropic (i.e., radial basis functions), depending only on the Euclidean distance • 2 between the points: for a continuous positive-definite function Φ : [0, ∞) → R and a length-scale parameter > 0. Given any function f : R d → R evaluated at distinct points X = {x 1 , . . ., x N } ⊂ R d , such a kernel can be used to construct a unique kernel interpolant based on the translates {K (•, x n )} N n=1 .The kernel interpolant is as follows: where u n are the Lagrange cardinal functions that solve: and satisfy u ,n (x m ) = δ nm .Uniqueness of the solution for each x ∈ R d is guaranteed by positive-definiteness of the matrix on the left-hand side of this system.When → ∞, the kernel K becomes increasingly flat and the linear system (1.3) increasingly ill-conditioned. 1 Nevertheless, the corresponding kernel interpolant is typically well-behaved at this limit.Starting with the work of Driscoll and Fornberg [10], it has been shown that a certain unisolvency assumption on X implies that the kernel interpolant converges to (i) a polynomial interpolant if the kernel is infinitely smooth [10,13,19,20,31,32] or (ii) a polyharmonic spline interpolant if the kernel is finitely smooth [21,33].Further generalisations appear in [22].The former case covers kernels such as, Gaussians, multiquadrics, and inverse multiquadrics while the latter applies to, for example, Matérn kernels and Wendland's functions.Among the most interesting of these results is the one by Schaback [31] who proved that the interpolant at the increasingly flat limit of the Gaussian kernel exists regardless of the geometry of X and coincides with the de Boor and Ron polynomial interpolant [6,7].Furthermore, numerical ill-conditioning for large , mentioned above, has necessitated the development of techniques for stable evaluation of the kernel interpolant [5,12,14,38].Increasingly flat kernels have been also discussed independently in the literature on the use of Gaussian processes for numerical integration [24,26,34], albeit accompanied only with non-rigourous arguments.
Even though the intuition that the lowest degree terms in the Taylor expansion of the kernel dominate construction of the interpolant as → ∞ and that this ought to imply convergence to a polynomial interpolant is quite clear, this is not always translated into transparent proofs.The purpose of this article is to generalise the aforementioned results on flat limits of kernel interpolants for worst-case optimal approximation of general positive linear functionals in the reproducing kernel Hilbert space (RKHS) of the Gaussian kernel (1.4).That such generalisations are possible is not perhaps surprising; it is rather the simple proof technique made possible by the worst-case framework and an explicit characterisation [23] of the Gaussian RKHS that we find the most interesting aspect of the present work.

Worst-case optimal approximation
Let Ω be a subset of R d with a non-empty interior and L : C(Ω) → R a positive linear functional acting on continuous real-valued functions defined on Ω and satisfying L[|p|] < ∞ for every polynomial p on Ω.The functionals most often discussed in this article are the point evaluation and the integration functionals When restricted on Ω × Ω, the positive-definite kernel K in (1.1) induces a unique reproducing kernel Hilbert space H(K ) ⊂ C(Ω) where the reproducing property f, K (•, x) H(K ) = f (x) holds for every x ∈ Ω and f ∈ H(K ).With minor modifications everything in this section holds also when the kernel is not isotropic.Because the kernel is isotropic, The worst-case error e (Q X (w)) of the cubature rule (1.6) in H(K ) is as follows: Given a fixed set of distinct points, we are interested in the kernel cubature rule Q X (w * ) whose weights are chosen so as to minimise the worst-case error: These weights are unique and available as the solution to the linear system [25, Section 3 Although our notation does not make this explicit, the weights obviously depend on the linear functional L and the evaluation points X.For each x ∈ R d , the kernel interpolant s ,f,X (x) now arises as the kernel cubature rule for approximation of the point evaluation functional L x in (1.5) and the Lagrange functions are u ,n (x) = w * (n).In this case, the worst-case error coincides with the power function [30].For an arbitrary L, the kernel cubature rule can be obtained by applying L to the kernel interpolant as follows: That is, the weights are w

Contributions
Recall that we only consider the Gaussian kernel (1.4).This article contains two theoretical main contributions: -In Section 2, we prove that if X is unisolvent with respect to a full polynomial space Π m and N = dim Π m , then Q X (w * ) converges (as → ∞) to the unique cubature rule Q X (w pol ) that satisfies Q X (w pol )[p] = L[p] for every polynomial p of degree at most m.This result, contained in Theorem 2.2 and Corollary 2.1, is a generalisation for arbitrary positive linear functionals of the interpolation results cited earlier.If Ω is bounded, the results hold for any positive linear functional satisfying the mild assumptions imposed earlier.However, boundedness of Ω is not necessary: at the end of Section 2, we supply an example involving integration over R d with respect to the Gaussian measure.-In Section 3, we present a generalisation, based on a theorem of Barrow [2], for optimal kernel quadrature rules [25,Chapter 5] that have both their points and weights selected so as to minimise the worst-case error.The result, Theorem 3.2, states that such rules, if unique, converge to the N-point Gaussian quadrature rule for the functional L, which is the unique quadrature rule for every polynomial p of degree at most 2N − 1.This partially settles a conjecture posed by O'Hagan [26, Section 3.3], and further discussed in [24,34], on convergence of optimal kernel quadrature rules to Gaussian quadrature rules.
Some generalisations for other kernels and cubature rules of more general form than (1.6) are briefly discussed in Section 4.

Fixed points
The following theorem, which provides a characterisation of the RKHS of the Gaussian kernel (1.4), is the central tool of this article.This results is due to Steinwart [36] and Minh [23]; see also [35,Section 4.4] and [8,Example 3].In this theorem (and the remainder of the article), N d 0 stands for the collection of d-dimensional nonnegative multi-indices: Theorem 2.1 (Steinwart 2006;Minh 2010) Let Ω be a subset of R d with a nonempty interior.Then, the RKHS H(K ) induced by the Gaussian kernel (1.4) with length-scale > 0 consists of the functions α!f α g α .Furthermore, the collection

of functions forms an orthonormal basis of H(K ).
Two crucial implications of this theorem are that H(K ) consists of functions expressible as series of exponentially damped polynomials, the damping effect vanishing as → ∞, and that, due to the terms 2|α| appearing in the RKHS norm, the high-degree terms contribute the most to the norm.Consequently, the worst-case error (1.7), taking into account only functions of at most unit norm, is dominated by low-degree terms when is large.The rest of this section formalises this intuition.
Let Π m ⊂ C(Ω) stand for the space of d-variate polynomials of degree at most m ∈ N 0 : In this section, we assume that the point set and the zero function is the only element of Π m that vanishes on X.This is equivalent to non-singularity of the (generalised) Vandermonde matrix where {α 1 , . . ., Its weights solve the linear system P T Π w pol = L Π of N equations, where the N-vector L Π has the elements In this section, we prove that the worst-case optimal weights w * for the Gaussian kernel (1.4) converge to w pol as → ∞.
Define then so that functions in the Gaussian RKHS, characterised by Theorem 2.1, are of the form f (x) = α∈N d 0 f α φ α (x) for coefficients f α decaying sufficiently fast.Since the exponential function has no real roots, the determinant of the matrix 22 ) = 0 and P φ, is hence nonsingular.From non-singularity, it follows that there are unique weights w φ, such that The weights solve P T φ, w φ, = L φ, , where the N-vector L Φ, has the elements 2 This auxiliary cubature rule plays an important role in our argument.To summarise, the following three weights (or sequences of weights) appear in the proofs below: 1.The weights w * , solved from (1.8), are the worst-case optimal weights for the Gaussian kernel (1.4).The results concern the behaviour of these weights as → ∞. 2. The weights w pol are constructed such that the cubature rule defined by them is exact for all polynomials up to degree m: Proof The assumption lim →∞ L[φ α (x)] = L[x α ] and unisolvency of X imply that lim →∞ w φ, = w pol .Because L[|p|] < ∞ for any polynomial p, both the weights w pol and w φ, are finite, which implies the claim.
2 See [12] for an interpolation method based on a closely related basis derived from a Mercer eigendecomposition of the Gaussian kernel and [17] for an explicit construction of weights similar to w φ, in the case L is the Gaussian integral.
Proof We have P T Π w Π = L Π and where each of the terms on the right-hand side vanishes as → ∞.Because L Π − P T Π w 2 = P T Π (w pol − w ) 2 and P Π is non-singular, we conclude that lim →∞ w = w pol .
We are ready to prove the main result of the article for a fixed Π m -unisolvent point set X ⊂ Ω consisting of N distinct points.First, by considering one of the basis functions (2.2), we show that Second, the sub-optimal cubature rule Q X (w φ, ) defined above can be used, in combination with (2.1), to establish the upper bound e (Q X (w * )) ≤ C −(m+1) .These two bounds imply that |L Proof For every α ∈ N d 0 , select the function From Theorem 2.1, it follows that g α 2 H(K ) = 1 since g α is one of the basis functions (2.2).Thus, by definition of the worst-case error, Next, we derive an appropriate upper bound on e (Q X (w * )) by considering the unique sub-optimal cubature rule Q X (w φ, ) that is exact for every φ α with |α| ≤ m.
In the expansion (2.1) of a function in H(K ), we have L[φ α ] = Q X (w φ, )[φ α ] for every term with |α| ≤ m.Consequently, the worst-case error is bounded as follows: e Q X (w φ, ) where f α are the coefficients that define f by assumption (2.6).Moreover, because max n=1,...,N for some C X > 0 and every , we have the following: where C Q < ∞ follows from convergence of the last term and Lemma 2.1.Thus, when ≥ 0 .Since Q X (w * ) is worst-case optimal, we have thus established with (2.7) and (2.8) that, for sufficiently large , for every α ∈ N d 0 such that |α| ≤ m and a constant C independent of .That is, (2.9) The claim then follows by setting w = w * in Lemma 2.2.
Assumptions of Theorem 2.2 hold, for instance, if the domain Ω is bounded.Proof On a bounded domain, the convergence ) is also satisfied as follows: where β = (b, . . ., b) ∈ R d for b = sup z∈Ω z 2 and finiteness follows from the assumption L[1] < ∞.
However, boundedness of Ω is not necessary.Consider Gaussian integration as follows: ] as → ∞ follows from the monotone convergence theorem.To verify (2.6), recall that the absolute moments of the standard Gaussian distribution are as follow: if n is odd, we have the following: Thus,

Optimal points in one dimension
Let d = 1 and Ω = [a, b] for a < b.In this section, we consider quadrature rules whose points are also selected so as to minimise the worst-case error.A kernel quadrature rule is optimal if its points and weights satisfy the following: In order to eliminate degrees of freedom in ordering the points, we require that the points are in ascending order (i.e., x n ≤ x n+1 ).Even though optimal kernel quadrature rules have been studied since the 1970s [1,3,18,28,29] for the integration functional L[f ] = b a f (x)ω(x)dx, ω(x) > 0, their theory is still not complete (the main results have been recently collated by Oettershagen [25, Section 5.1]).Although uniqueness results have been proved only for totally positive isotropic kernels of the form (1.1) and integration when ω ≡ 1 [4], there exists numerical evidence suggesting that the optimal rule is unique in more general settings [25, p. 97].Note that the Gaussian kernel (1.4) we consider is totally positive.In Theorem 3.2, we show that uniqueness of an optimal kernel quadrature rule for each > 0 implies that its increasingly flat limit is Q G = Q X G (w G ), the N-point Gaussian quadrature rule for the linear functional L. This is the unique quadrature rule that is exact for every polynomial of degree at most 2N − 1: This degree of exactness is maximal; there are no Npoint quadrature rules exact for all polynomials up to degree 2N.The most familiar methods of this type are of course the classical Gaussian quadrature rules for numerical integration [15, Section 1.4].For example, the Gauss-Legendre quadrature rule satisfies the following: for every polynomial p of degree at most 2N −1 and its points are the roots of the Nth degree Legendre polynomial.Theorem 3.2 was conjectured by O'Hagan [26,Section 3.3] in 1991 in the form that the optimal kernel quadrature rule has the classical Gauss-Hermite quadrature rule as its increasingly flat limit if the kernel is Gaussian and L is the Gaussian integral.More discussion of this conjecture-but no rigorous proofs-can be found in [24,Section 4].
The proof of Theorem 3.2 is based on a general result by Barrow [2] on existence and uniqueness of generalised Gaussian quadrature rules.This result replaces the polynomials in a Gaussian quadrature rule with generalised polynomials formed out of functions that constitute an extended Chebyshev system [16, Chapter 1] ) of functions is an extended Chebyshev system if any non-trivial linear combination of the functions has at most m − 1 zeroes, counting multiplicities.That is, if u ∈ span{u n } m−1 n=0 and u (q p ) (x p ) = 0 for x p ∈ [a, b], p = 1, . . ., P , and q p = 0, . . ., Q p − 1, then P p=1 Q p ≤ m − 1.Any basis of the space of polynomials of degree at most m − 1 is an extended Chebyshev system.Importantly, the functions {φ n } m−1 n=0 in (2.4) are an extended Chebyshev system for any m ∈ N. To verify this, note that any φ ∈ span{φ n } m−1 n=0 can be written as φ(x) = e −x 2 /(2 2 ) p(x) for some polynomial p of degree at most m − 1 and consequently for some polynomials s r .From this expression, we see that φ (l) (x) = 0 for every l = 0, . . ., q if and only if p (l) (x) = 0 for every l = 0, . . ., q.Since p can have at most m − 1 zeroes, counting multiplicities, it follows that the same is true of φ. w(n) ≤ L [1] c u c l .
Theorem 3.2 Suppose that Ω = [a, b] for a < b.If for every > 0, there exists a unique optimal kernel quadrature rule Q * = Q X * (w * ), then its points and weights converge to those of the N-point Gaussian quadrature rule for L: where X G and w G are the unique points and weights such that Proof In a manner identical to the proof of Theorem 2.2, we establish the lower bound 1 that holds for every n ≥ 0. Because {φ n } 2N−1 n=0 is an extended Chebyshev system, Theorem 3.1 guarantees the existence of a unique N-point quadrature rule

N
of this rule are distinct and lie inside Ω and the weights w G are positive.We can then replicate the rest of the proof of Theorem 2.2 in one dimension but with m = 2N − 1 and Lemma 2.1 replaced with Lemma 3.1 (applied to the function u = φ 0 ) to show that, for sufficiently large and a constant C independent of , 1) for every n ≤ 2N − 1.We then fix 0 > 0 and invoke Lemma 3.2 with the function uniformly on A. Since the unique minimiser of g ∞ is (X G , w G ), the claim follows from (3.1) and Lemma 3.2.

Generalisations
This section discusses some straightforward generalisations of the results in Sections 2 and 3.

Damped power series kernels
Theorem 2.1 for the Gaussian kernel (1.4) is a consequence of the identity where K pow (x, x ) is a power series kernel [39].Accordingly, the results in Sections 2 and 3 can be generalised for a class of kernels that we call damped power series kernels.Let G : R d → R\{0} be a non-zero function and define G (x) = G( x / ).Then, a damped power series kernel is defined as follows: for q > 0 and weight parameters ω α > 0 such that the series converges for any > 0 and x, x ∈ Ω. Arguments identical to those used in [23,39] establish that K defined in (4.1) is a positive-definite kernel and that its RKHS H(K ) consists of functions The Gaussian kernel is recovered by setting G(x) = e − x 2 2 /2 , q = 2, and ω α = α!.Note that the Gaussian kernel is an exception; damped power series kernels are rarely stationary.

General information functionals
It would also be easy to replace the cubature rule (1.6) with a generalised version as follows: where L n are any bounded linear functionals.If L n are such that the matrices ⎡ ⎢ ⎣ which are generalisations of (2.3) and (2.5), are non-singular, then Theorem 2.2 and Corollary 2.1 can be generalised.

Non-unisolvent point sets
If the kernel is Gaussian but point set X ⊂ Ω is not unisolvent, Schaback [31] has proved that the kernel interpolant (1.2) converges the de Boor and Ron polynomial interpolant [6,7], which is the unique interpolant to f at X in a point-dependent polynomial space Π X having in a certain sense minimal degree.We expect that extensions for non-unisolvent points of the results in Section 2 are possible.The kernel cubature weights would presumably convergence to the weights w pol such that Q X (w pol )[p] = L[p] for every p ∈ Π X .

) for some 0 > 1 0 such that α∈N d 0 a 2 α ≤ 1 .
and any sequence (a α ) α∈N d Then, lim →∞ w * = w pol and e Q X (w * ) = O −(m+1) , where w pol are the weights of the unique polynomial cubature rule such that Q X (w pol )[p] = L[p] for every p ∈ Π m .

Corollary 2 . 1
Let N = dim Π m for some m ∈ N 0 and X be Π m -unisolvent.Suppose that Ω is bounded.Then, lim →∞ w * = w pol and e Q X (w * ) = O −(m+1) , where w pol are the weights of the unique polynomial cubature rule such that Q X (w pol )[p] = L[p] for every p ∈ Π m .

Theorem 3 . 1 (Lemma 3 . 1
Barrow 1978) Let {u n } 2N−1 n=0 ⊂ C 2N−1([a, b]) be an extended Chebyshev system and L a positive linear functional on span{u n } 2N−1 n=0 .Then, there exist unique points a <x 1 < • • • < x N < b and positive weights w ∈ R N such that Q X (w)[u n ] = L[u n ] for every n = 0, . . ., 2N − 1.Let Ω ⊂ Rd and suppose that a cubature rule Q X (w) with non-negative weights satisfies Q X (w)[u] = L[u] for some positive function u : Ω → (0, ∞) such that 0 < c l ≤ u(x) ≤ c u for all x ∈ Ω.Then, max n=1,...,N w(n) ≤ N n=1 are obtained by selecting G ≡ 1 in (4.1).As → ∞, the corresponding kernel quadrature rules then converge to polynomial rules.Perhaps the two most interesting special cases are the exponential kernel areK (x, x ) = exp xx = xx ) n .The Szegő kernel induces a Hardy space on a disk of radius .Interestingly, it has been pointed out already in the 1970s that approximation with the Szegő kernel yields polynomial methods as → ∞ [18,Section 3].See also[24, Section 4].An extensive numerical investigation has been recently published by Oettershagen [25, Section 6.2].