On the positivity and magnitudes of Bayesian quadrature weights

This article reviews and studies the properties of Bayesian quadrature weights, which strongly affect stability and robustness of the quadrature rule. Specifically, we investigate conditions that are needed to guarantee that the weights are positive or to bound their magnitudes. First, it is shown that the weights are positive in the univariate case if the design points locally minimise the posterior integral variance and the covariance kernel is totally positive (e.g. Gaussian and Hardy kernels). This suggests that gradient-based optimisation of design points may be effective in constructing stable and robust Bayesian quadrature rules. Secondly, we show that magnitudes of the weights admit an upper bound in terms of the fill distance and separation radius if the RKHS of the kernel is a Sobolev space (e.g. Matérn kernels), suggesting that quasi-uniform points should be used. A number of numerical examples demonstrate that significant generalisations and improvements appear to be possible, manifesting the need for further research.

and an example of a probabilistic numerical method [34,22,11]. Let Ω be a subset of R d , d ≥ 1, and ν a Borel probability measure on Ω. Given an integrand f : Ω → R, the task is to approximate the integral the solution of which is assumed not to be available in closed form. In Bayesian quadrature, a user specifies a prior distribution over the integrand as a Gaussian process f GP ∼ GP(0, k) by choosing a positive-definite covariance kernel k : Ω × Ω → R, so as to faithfully represent their knowledge about the integrand, such as its smoothness. The user then evaluates the true integrand at chosen design points X = {x x x 1 , . . . , x x x n } ⊂ Ω. By regarding the pairs D := {(x x x i , f (x x x i ))} n i=1 thus obtained as "observed data", the posterior distribution I ν (f GP ) | D becomes a Gaussian random variable. This posterior distribution is useful for uncertainty quantification and decision making in subsequent tasks; this is one factor that makes Bayesian quadrature a promising approach in modern scientific computation, where quantification of discretisation errors is of great importance [8,38].
In Bayesian quadrature, the mean of the posterior over the integral is used as a quadrature estimate. The mean is given as a weighted average of function values: where w BQ X,1 , . . . , w BQ X,n ∈ R are the weights computed with the kernel k, design points X and the measure ν (see Section 2.1 for details). This form is similar to (quasi) Monte Carlo methods, where x x x 1 , . . . , x x x n are (quasi) random points from a suitable proposal distribution and w 1 , . . . , w n are the associated importance weights, positive by definition. This similarity naturally leads to the following question: Are the weights w BQ X,1 , . . . , w BQ X,n of Bayesian quadrature positive? These weights are derived In Section 3, we present results concerning positivity of the Bayesian quadrature weights. We discuss results on the number of the weights that must be positive, focusing on the univariate case and totally positive kernels (Definition 2). Corollary 1, the main result of this section, states that all the weights are positive if the design points are locally optimal. A practically relevant consequence of this result is that it may imply that the weights are positive if the design points are obtained by gradient descent, which is guaranteed to provide locally optimal points (see e.g. [35]), Section 4 focuses on results on the magnitudes of the weights. More specifically, we discuss the behaviour of the sum of absolute weights, n i=1 |w BQ X,i |, that strongly affects stability and robustness of Bayesian quadrature. If this quantity is small, the quadrature rule is robust against errors in integrand evaluations [17] and misspecification of the Gaussian process prior [26]. This quantity is also related to the numerical stability of the quadrature rule. Using a result on stability of kernel interpolants by De Marchi and Schaback [14] we derive an upper bound on the the sum of absolute weights for some typical cases where the Gaussian process has finite degree of smoothness and the RKHS induced by the covariance kernel is norm-equivalent to a Sobolev space.

Bayesian quadrature
This section defines a Bayesian quadrature rule as the integral of the posterior of Gaussian process used to model the integrand. We also discuss the equivalent characterisation of this quadrature rule as the worstcase optimal integration rule in the RKHS H(k) induced by the covariance kernel k of the Gaussian process.

Basics of Bayesian quadrature
In standard Bayesian quadrature [40,36,8], the deterministic integrand f : Ω → R is modelled as a Gaussian process. The integrand is assigned a zero-mean Gaussian process prior f GP ∼ GP(0, k) with a positive-definite covariance kernel k. This is to say that for any n ∈ N and any distinct points X = {x x x 1 , . . . , x x x n } ⊂ Ω we have (f GP (x x x 1 ), . . . , f GP (x x x n )) ∼ N(0 0 0, K K K X ), with the n × n positive-definite (and hence invertible) kernel matrix. Conditioning on the data D = {(x x x i , f (x x x i ))} n i=1 , consisting of evaluations f f f X := (f (x x x i )) n i=1 ∈ R n of f at points X, yields a Gaussian posterior process with the mean and covariance Note that the posterior covariance only depends on the points, not on the integrand, and that the posterior mean interpolates the data (i.e., µ X, . . , n). Accordingly, the posterior mean often goes by the name kernel interpolant or, if the kernel is isotropic, radial basis function interpolant.
Due to the linearity of the integration operator, the posterior of the integral becomes a Gaussian distribu- with the mean and variance where k ν (x x x) := Ω k(·, x x x) dν(x x x) is the kernel mean [53], k k k ν,X ∈ R n with [k k k ν,X ] i = k ν (x x x i ) and The integral mean I BQ X (f ) is used to approximate the true intractable integral I ν (f ) while the variance V BQ X is supposed to quantify epistemic uncertainty, due to partial information being used (i.e., a finite number of function evaluations) inherent to this approximation.
The integral mean I BQ X (f ) indeed takes the form a quadrature rule, a weighted sum of function evaluations: where w BQ X,1 , . . . , w BQ X,n are the Bayesian quadrature weights given by The purpose of this article is to analyse the properties of these weights.
A particular property of a Bayesian quadrature rule is that the n kernel translates k x x xi := k(·, x x x i ) are integrated exactly: which derives from the fact that the jth equation of the linear system K K K X w w w BQ X = k k k ν,X defining the weights is The left-hand side is precisely I BQ X (k x x xi ) while on the right-hand side we have k ν (x x x j ) = I ν (k x x xi ). Note also that the integral variance is the integration error of the kernel mean: Occasionally, it is instructive to interpret the weights as integrals of the Lagrange cardinal functions u u u X = (u X,i ) n i=1 (see e.g. [54,Chapter 11]). These functions are defined as u u u Consequently, the cardinality property is satisfied, as can be verified by considering the interpolant µ X,gi to any function g i such that g i (x x x j ) = δ ij . Since the integral mean is merely the integral of µ X,f , we have from (4) that That is, the ith Bayesian quadrature weight is integral of the ith Lagrange cardinal function: w BQ X,i = I ν (u X,i ).

Reproducing kernel Hilbert spaces
An alternative interpretation of Bayesian quadrature weights is that they are, for the given points, the worstcase optimal weights in the reproducing kernel Hilbert space H(k) induced by the covariance kernel k. The material of this section is contained in, for example, [8,Section 2], [39,Section 3.2] and [30,Section 2]. For a comprehensive introduction to RKHSs, see the monograph of Berlinet and Thomas-Agnan [5].
The RKHS induced by k is the unique Hilbert space of functions characterised by (i) the reproducing property k x x x , f H(k) = f (x x x) for every f ∈ H(k) and x x x ∈ Ω and (ii) the fact that k x x x ∈ H(k) for every x x x ∈ Ω. The worstcase error in H(k) of a quadrature rule with points X and weights w w w ∈ R n is It can be then shown that the Bayesian quadrature weights w w w BQ X are the unique minimiser of the worst-case error among all possible weights for these points: and Furthermore, the worst-case error can be written as the RKHS error in approximating the integration representer k ν that satisfies From this representation and the Cauchy-Schwartz inequality it follows that For analysis of convergence of Bayesian quadrature rules as n → ∞ it is therefore sufficient to analyse how the worst-case error (i.e., integral variance) behaves-as long as the integrand indeed lives in H(k). Convergence will be discussed in Section 4.

Positivity
This section reviews existing results on the positivity of the weights of Bayesian quadrature that can be derived in one dimension when the covariance kernel is totally positive (Definition 2; this assumptions is stronger than positive-definiteness but is satisfied by e.g. the Gaussian kernel). For most of the section we assume that d = 1 and Ω = [a, b] for a < b. Furthermore, the measure ν is typically assumed to admit a density function with respect to the Lebesgue measure, 1 an assumption that implies I ν (f ) > 0 if f (x) > 0 for almost every x ∈ Ω. Positivity of the weights was actively investigated during the 1970s [48,50,49,3,2], and these results have been recently refined and collected by Oettershagen [39,Section 4]. To simplify presentation, some of the results in this section are given in a slightly less general form than possible. Two of the most important results are -Theorem 1: At least one half of the weights of any Bayesian quadrature rule are positive (Theorem 1). -Corollary 1: All the weights are positive when the points are selected so that the integral posterior variance in (1) is locally minimised in the sense that each of its n partial derivatives with respect to the integration points vanishes (Definition 3).
The latter of these results is particular interesting since (i) it implies that points selected using a gradient descent algorithm may have positive weights, and (ii) the resulting Bayesian quadrature rule is a positive linear functional and hence potentially well-suited for integration of functions that are known to be positive-a problem for which a number of transformation-based methods have been developed recently [42,21,10]. As no multivariate extension of the theory used to prove the aforementioned results appears to have been developed, we do not provide any general theoretical results on the weights in higher dimensions. However, some special cases based on, for example, tensor products are discussed in Sections 3.7 and 3.9 and two numerical examples are used to provide some evidence for the conjectures that multivariate versions of Theorem 1 and Corollary 1 hold.
It will turn out that optimal Bayesian quadrature rules are analogous to classical Gaussian quadrature rules in the sense that, in addition to being exact for kernel interpolants (recall (3)), they also exactly integrate Hermite interpolants (see Section 3.2.2). We thus begin by reviewing the argument used to establish positivity of the Gaussian quadrature weights.

Gaussian quadrature
Under the assumption that ν admits a density 2 there exist unique weights w 1 , . . . , w n and points x 1 , . . . , x n ∈ Ω such that for every polynomial P of degree at most 2n − 1 [18,Chapter 1]. This quadrature rule is known as a Gaussian quadrature rule (for the measure ν). One can show the positivity of the weights of a Gaussian rule as follows.
Proposition 1 Assume that ν admits a Lebesgue density. Then the weights w 1 , . . . , w n of the Gaussian quadrature (6) are positive.
Proof For each i = 1, . . . , n there exists a unique polynomial L i of degree n − 1 such that L i (x j ) = δ ij . This property is shared by the function G i := L 2 i ≥ 0 that, being of degree 2n − 2, is also integrated exactly by the Gaussian rule. Because G i is almost everywhere positive, it follows from the assumption that ν admits a density that The positivity of the weights is thus concluded.
This proof may appear to be based on the closedness of the set of polynomials under exponentiation. Closer analysis reveals a structure that can be later generalised.
To describe this, recall that one of the basic properties of polynomials is that a polynomial P of degree n can have at most n zeroes, when counting multiplicities (for some properties of polynomials and interpolation with them, see e.g. [1,Chapter 3]). This is to say that, if for some points x 1 , . . . , x m it holds that This fact on zeroes of polynomials can be used to supply a proof of positivity of the Gaussian quadrature weights that does not explicitly make use of the fact that square of a function is non-negative. By the chain rule, the derivative of G i vanishes at each x j such that j = i. That is, G i has a double zero at each of these n − 1 points (i.e., G i (x j ) = 0 and G (1) i (x j ) = 0), for the total of 2n − 2 zeroes. Being a polynomial of degree 2n − 2, G i cannot have any other zeroes besides these. Since all the zeroes of G i are double, it cannot hence have any sign changes. This is because, in general, a function g that satisfies g(x) = g (1) (x) = 0 but g (2) (x) = 0 at a point x cannot change its sign at x, since its derivative changes sign at the point. From G i (x i ) = 1 > 0 it then follows that G i is almost everywhere positive.

Chebyshev systems and generalised Gaussian quadrature
The argument presented above works almost as such when the polynomials are replaced with generalised polynomials and the Gaussian quadrature rule with a generalised Gaussian quadrature rule. Much of the following material is covered by the introductory chapters of the monograph by Karlin and Studden [28]. In the following C m (Ω) stands for the set of functions that are m times continuously differentiable in the interior of Ω.
constitutes an (extended) Chebyshev system if any non-trivial linear combination of the functions, called a generalised polynomial, has at most m − 1 zeroes, counting multiplicities.
Remark 1 Some of the results we later present, such as Proposition 3, are valid even when a less restrictive definition, that does not require differentiability of φ i , of a Chebyshev system is used. Of course, in this case the definition is not given in terms of multiple zeroes. The above definition is used here to simplify presentation.
By selecting φ i (x) = x i−1 we see that polynomials are an example of a Chebyshev system. Perhaps the simplest example of a non-trivial Chebyshev system is given by the following example.
i=1 constitute a Chebyshev system. To verify this, observe that any linear combination φ of φ 1 , . . . , φ m is of the form φ(x) = e x P (x) for a polynomial P of degree at most m − 1 and that the jth derivative of this function takes the form for certain integer coefficients c 1 , . . . , c j . We observe that φ(x 0 ) = 0 for a point x 0 if and only if P (x 0 ) = 0. If also φ (1) (x 0 ) = 0, then it follows from (7) that P (1) (x 0 ) = 0, and, generally, that That is, the zeroes of φ are precisely those of P and, consequently, the functions φ i constitute a Chebyshev system.

Interpolation using a Chebyshev system
A crucial property of generalised polynomials is that unique interpolants can be constructed using them, as we next show. For any Chebyshev system {φ i } n i=1 and a set of distinct points X = {x 1 , . . . , x n } ⊂ Ω, we know that there cannot exist α α α = (α 1 , . . . , α n ) = 0 0 0 such that n i=1 α i φ i (x j ) = 0 for every j = 1, . . . , n since α 1 φ 1 + · · · + α n φ n can have at most n − 1 zeroes. Equivalently, the only solution β β β ∈ R n to the linear , the above fact guarantees the existence and uniqueness of an interpolant s X,f such that (i) s X,f is in span{φ 1 , . . . , φ n } and (ii) s X,f (x j ) = f (x j ) for each j = 1, . . . , n. These two requirements imply that for and some α α α ∈ R n and every j = 1, . . . , n. In matrix form, these n equations are equivalent to

Hermite interpolants
A Hermite interpolant s X,q q q,f is based on data containing also derivative values (see [1, Section 3.6] for polynomial and [16,Chapter 36] for kernel-based Hermite interpolation). In this setting, the point set X contains m points and q q q ∈ N m 0 is a vector of multiplicities such that m i=1 q i = n. The data to be interpolated is That is, the interpolant is to satisfy for each i = 1, . . . , m and j i = 0, . . . , q i − 1. Note that the interpolant s X,f is a Hermite interpolant with m = n and q 1 = · · · = q n = 1. If the interpolant is to lie in span{φ 1 , . . . , φ n }, we must have, for some α 1 , . . . , α n , Again, these n equations define a linear system that is invertible because {φ i } n i=1 constitute a Chebyshev system. The Hermite interpolant can be written in the form (8) with V V V X replaced with a version involving also derivatives of φ i (see e.g. [39, Section 2.3.1]).

Generalised Gaussian quadrature
A generalised Gaussian quadrature rule is a quadrature rule that uses n points to integrate exactly all functions in the span of {φ i } 2n i=1 constituting a Chebyshev system: for every φ ∈ span{φ 1 , . . . , φ 2n }. The existence and uniqueness of the points and weights is guaranteed under fairly general assumptions [4]. We prove positivity of the weights by constructing a function F i ∈ span{φ 1 , . . . , φ 2n } analogous to G i in Section 3.1.
Proposition 2 Assume that ν admits a Lebesgue density. Then the weights w 1 , . . . , w n of the generalised Gaussian quadrature rule (9) positive.
Proof Let F i be the Hermite interpolant to the data An example is depicted in Figure 1. As there are 2n data points, F i indeed exists since {φ i } 2n i=1 are a Chebyshev system. Moreover, F i has 2n − 1 zeroes. Because all its zeroes occurring on (a, b) are double, Next we turn our attention to kernels whose translates and their derivatives constitute Chebyshev systems.

Totally positive kernels
We are now ready to begin considering kernels and Bayesian quadrature. A concept related to Chebyshev systems is that of totally positive kernels whose theory is covered by the monograph of Karlin [27]. For a sufficiently differentiable kernel, define the derivatives In the RKHS framework, these function as act as representers for differentiation:
The class of totally positive kernels is smaller than that of positive-definite kernels. For the simplest case of q = 1 and m = n the total positivity condition is that the kernel translates form k x1 , . . . , k xn constitute a Chebyshev system. This implies that the n × n ma- Basic examples of totally positive kernels are the Gaussian kernel with length-scale > 0 and the Hardy kernel k(x, x ) = r 2 /(r 2 − xx ) for r > 0. Both of these kernels are totally positive of any order. There is also a convenient result that guarantees total positivity [9, Proposition 3]: k is totally positive if there are positive constants a m and a positive increasing function v ∈ C ∞ (Ω) such that More examples are collected in [27,9].

General result on weights
The following special case of the theory developed in [28, Chapter 2] appears in, for instance, [50, Lemma 2]. Its proof is a generalisation of the proof for the case m = 2n that was discussed in Section 3.2.
An immediate consequence of this proposition is that a Bayesian quadrature rule based on a totally positive kernel has at least one half of its weights positive.
Theorem 1 Suppose that the kernel k ∈ C ∞ (Ω × Ω) is totally positive of order 1. Then, for any points, at least (n + 1)/2 of the Bayesian quadrature weights w BQ X,1 , . . . , w BQ X,n are positive.
Proof Since the kernel is totally positive of order 1, the translates {k xi } n i=1 constitute a Chebyshev system. The exactness condition (3) holds for each of these functions. The claim follows by setting m = n in Proposition 3.

Weights for locally optimal points
Recall the definition of the Bayesian quadrature variance: The variance can be considered a function X → V BQ X defined on the simplex S n := {z z z ∈ Ω n : a < z 1 < · · · < z n < b} ⊂ Ω n .
We introduce the following definition of locally optimal points.
Definition 3 Let m ≤ n. A Bayesian quadrature rule with points X ⊂ Ω is locally m-optimal if X ∈ S n and there is an m-point subset X * m ⊂ X such that ∂ ∂x j V BQ X = 0 for every x j ∈ X * m .
A locally n-optimal rule is called locally optimal. The point set of a locally m-optimal Bayesian quadrature rule is also called locally m-optimal.
When the kernel is totally positive of any order, it has been shown that any local minimiser of V BQ X is locally optimal in the sense of above definition. That is, no point in a point set that locally minimises the variance can be located on the boundary of the integration interval nor can any two points in the set coalesce. 3 These results, the origins of which can be traced to the 1970s [3,2,6], have been recently collated by Oettershagen [39,Corollary 5.13].
A locally m-optimal Bayesian quadrature rule is, in addition to the kernel translates at X, exact for 3 Coalescence is possible because because V BQ X is in fact a continuous function of X defined on the whole of Ω n , not merely on S n [39, Proposition 5.5]. This would result in a quadrature rule that also uses evaluations of the derivative of the integrand. translate derivatives at X * m (it is worth noting that Bayesian quadrature rules with derivative evaluations have been recently considered in [45,55]). When m = n, this is analogous to the interpretation of classical Gaussian quadrature rules as integrated Hermite interpolants (see [49]). This result first appeared in [33]. Its proof is typically based on considering the RKHS representation of the variance; see [50,Section 3] or [39, Section 5.
We present a mainly linear algebraic proof.
Proposition 4 Let m ≤ n. Suppose that the n-point set X ∈ S n is locally m-optimal. If the kernel k is differentiable in both of its arguments, then where k (1) x is the kernel derivative defined in (10).
Proof Define E(Z) := V BQ Z for Z ∈ S n . By definition of local m-optimality, the partial derivatives Z=X must vanish for each j for which x j ∈ X * m . Let ∂ i g g g(X) ∈ R n stand for the ith partial derivative of a vector-valued function g g g : R n → R n evaluated at X. From the explicit expression (1) for the variance we compute where the inverse matrix derivative formula and the weight expression w w w BQ X = K K K −1 X k k k ν,X have been used. The two partial derivatives appearing in the equation for E j (X) can be explicitly computed. First, only the jth element of k k k ν,X depends on x j . Thus, Secondly, only the jth row and column of K K K X have dependency on x j . For l = j we have where the first equality is consequence of symmetry of the kernel. The diagonal element is a total derivative: Therefore ∂ j K K K X is a zero matrix except for the jth row and column that are xj (x n ) and its transpose, respectively. Hence Now, if w BQ X,j = 0, the integral variance is not decreased by addition of any point. That is, V BQ X = V BQ X\{xj } = 0. According to the worst-case interpretation (5) of the variance this implies that I BQ X (f ) = I ν (f ) for every f ∈ H(k). Since k (1) x ∈ H(k) for every x ∈ Ω, we deduce that, in particular, On the other hand, if w BQ X,j = 0, then E j (X) = 0 so that the form of E j above implies that I BQ X (k for j = 1, . . . , d and x x x ∈ X * m are integrated exactly by a locally m-optimal Bayesian quadrature rule, defined by requiring a gradient version of (12). See also [20]. However, there appear to exist no generalisations of Chebyshev systems and Theorem 3 to higher dimensions.
Theorem 2 Let m ≤ n. Suppose that the point set X ∈ S n is locally m-optimal. If k ∈ C ∞ (Ω × Ω) is totally positive of order 2, then at least (n + m + 1)/2 of the Bayesian quadrature weights w BQ X,1 , . . . , w BQ X,n are positive.
Proof By (13), a Bayesian quadrature rule using an moptimal point set is exact for n kernel translates and m of their derivatives. By total positivity of the kernel, the collection of these n + m functions constitutes a Chebyshev system. The claim follows from Proposition 3.
Finally, the main result of this section follows by setting m = n in the preceding theorem.

Corollary 1
If k ∈ C ∞ (Ω × Ω) is totally positive of order 2 and X ∈ S n is locally optimal, then all the Bayesian quadrature weights w BQ X,1 , . . . , w BQ X,n are positive.
Remark 3 A key consequence of Corollary 1 is the following: If w BQ X,1 , . . . , w BQ X,n contain negative values, then the design points X are not locally optimal. In other words, in this case there is still room for improvement by optimising these points using, for example, gradient descent. In this way, the signs of the weights can provide information about the quality of the design point set.
A positive-weight quadrature rule is a positive linear functional (i.e., every positive function is mapped to a positive real). A locally optimal Bayesian quadrature rule may therefore be appropriate for numerical integration of functions that are a priori known to be positive, such as likelihood functions. Theoretical comparison to warped models [42,21,10] that encode positivity of the integrand by placing the GP prior on, for example, square root of the integrand would be an interesting topic of research.

Greedily selected points
The optimal points discussed in the preceding section cannot be constructed efficiently (see [39,Section 5.2] for what appears to be the most advanced published algorithm). In practice, points selected by greedy minimisation of the integral variance are often used. This approach is known as sequential Bayesian quadrature [12,23]. Assuming for a moment that d is arbitrary and an n-point set X n ⊂ Ω has been already generated, sequential Bayesian quadrature proceeds by selecting a new point x x x n+1 ∈ Ω by minimising the integral variance: In higher dimensions there is little that we are able to say about qualitative properties of the resulting quadrature rules. However, when d = 1 we can invoke Theorem 2 since X n ∪ {x n+1 } is locally 1-optimal.
Proposition 5 Let Ω = [a, b] ⊂ R. Suppose that k ∈ C ∞ (Ω × Ω) is totally positive of order 2. If X n ∪ x n+1 ∈ S n , then at least (n + 3)/2 of the weights of a n + 1 point sequential Bayesian quadrature rule are positive.

Other kernels and point sets
A number of combinations of kernels and point sets, that are not covered by the theory above, have been shown, either theoretically or experimentally, to yield positive Bayesian quadrature weights: -The Brownian motion kernel k(x, x ) = min(x, x ) on Ω = [0, 1] yields, for the uniform measure and any points, the standard trapezoidal rule (with positive weights); see [15] and [51,Lemma 8 in Section 3.2, Chapter 2]. -Suitably selected priors give rise to Bayesian quadrature rules whose posterior mean coincides with a classical rule, such a Gaussian quadrature [29,32]. Analysis of the weights and their positivity naturally reduces to that of the reproduced classical rule. -There is convincing numerical evidence that the weights are positive if the nodes for the Gaussian kernel and measure on R are selected by suitable scaling the classical Gauss-Hermite nodes [31]. -Uniform weighting (i.e., w BQ X,i = 1/n) can be achieved when certain quasi-Monte Carlo point sets and shift invariant kernels are used [24].

Upper bound on the sum of weights
We summarise below a simple yet generic result that has an important consequence on the stability of Bayesian quadrature in Section 4.

Lemma 1
Let Ω ⊂ R d . If the Bayesian quadrature weights w BQ X,1 , . . . , w BQ X,n are non-negative, then we have .
Proof The claim immediately follows from the property (3) that n i=1 w BQ X,i k x x xj (x x x i ) = I ν (k x x xj ) for each j = 1, . . . , n.
Combined with Corollary 1, we get a bound on the sum of absolute weights n i=1 |w BQ Xn,i |, which is the main topic of discussion in Section 4.
is totally positive of order 2 and design points X ∈ S n are locally optimal, then we have Most importantly, Corollary 2 is applicable to the Gaussian kernel, for which the upper-bound is finite. This result will be discussed in Section 4.4 in more detail. One may see a supporting evidence in Fig. 2, where the sum of weights seems to converge to a value around 1.

Higher dimensions
As far as we are aware of, there are no extensions of the theory of Chebyshev systems to higher dimensions. Consequently, it is not possible to say much about positivity of the weights when d > 1. Some simple cases can be analysed, however. Let where there are d terms in the products. Suppose that (i) the point set X is now a Cartesian product of onedimensional sets X 1 = {x 1 1 , . . . , x 1 n } ⊂ Ω 1 : X = X d 1 ; (ii) the kernel is of product form: k(x x x, x x x ) = d i=1 k 1 (x i , x i ) for some kernel k 1 on Ω 1 . A quadrature rule using Cartesian product points is called a tensor product rule. For such points the Bayesian quadrature weights w w w BQ X are products of the one-dimensional weights w w w BQ X1 : the weight for the point [39,Section 2.4]. In particular, if k 1 is totally positive and X 1 is a locally optimal set of points, then all the n d weights w w w BQ X are positive. 4 Analysis of more flexible sparse grid and symmetry based methods [30] might yield more interesting results.
We conclude this section with two numerical examples. Both of them involve the standard Gaussian measure on Ω = R d and the Gaussian kernel (11).
Locally optimal points First, we investigate positivity of weights for locally optimal points. We set = 1 and d = 2 and use gradient-based optimisation to find points that locally minimise the integral variance for n = 2, . . . , 20. Some point sets generated using the same algorithm have appeared in [52, Section IV] (for other examples of optimal points in dimension two, see [41,36] and, in particular, [39,Chapter 6]). The point sets we obtain appear sensible and all of them are associated with positive weights; four sets and their weights are depicted in Fig. 2.
Random points Secondly, we investigate the validity of Theorem 1 in higher dimensions. We set = 1.5 and d = 4 and count the number of positive weights for n = 2, . . . , 1000 when each n-point set is generated by drawing Monte Carlo samples from ν. Random samples are often used in Bayesian quadrature [46,8,7] and they also function as a suitable test case where structurality of point sets has little role in constraining behaviour of some subsets of the weights as happens when product or symmetric point designs are used. Fig. 3 shows the proportion of positive weights; it appears that at least half of the weights for randomly drawn points are always positive. This supports the obvious conjectural extension to higher dimensions of Theorem 1.

Magnitudes of weights and the stability
This section studies the magnitudes of the weights in a Bayesian quadrature rule and discusses how they are related to stability and robustness the quadrature rule. We are in particular interested in the following quantity, which we call the Bayesian quadrature stability constant: To make dependency on n more explicit, the quadrature point set is denoted by X n instead of X in this section. The terminology is motivated by the close connection where u Xn,i are Lagrange cardinal functions from Section 2.1. The connection to (14) arises from the fact that w BQ Xn,i = I ν (u Xn,i ) for i = 1, . . . , n. We begin by providing an example that provides motivation for studying the stability constant (14). Suppose that the function evaluations contain errors (which may be numerical or stochastic), denoted by i and modelled as indendent zero-mean random variables with variance σ 2 . Then the mean-square error (where the expectation is w.r.t. 1 , . . . , n ) of Bayesian quadrature is given by This implies a small stability constant (14) suppresses the additional error caused by the perturbations i . Another motivation example will be given in Section 4.2, after introducing necessary notation. Since our results are based on those in [14], which are applicable to kernels that induce Sobolev-equivalent RKHSs, we mainly focus on such kernels in this section (e.g., Matérn kernels). We begin by reviewing basic properties of Sobolev spaces in Section 4.1 and convergence results for Bayesian quadrature in Section 4.2. The main results, Theorem 5 and Corollary 3, on the magnitudes of quadrature weights and the stability constant appear in Section 4.3. We discuss a relevant stability issue, known as the Runge phenomenon, for infinitely smooth kernels such as the Gaussian kernel in Section 4.4. Finally, simulation results are provided in Section 4.5, demonstrating that the obtained upper bound is conservative; there is much room for improving the obtained results.

Sobolev spaces
Let L 2 (Ω) denote the Hilbert space of functions that are square-integrable with respect to the standard Lebesgue measure on a measurable set Ω ⊂ R d . A standard Sobolev space W r 2 (Ω) of order r ∈ N is defined as a subspace of functions in L 2 (Ω) whose (weak) derivatives exist and are in L 2 (Ω) for every α α α ∈ N d 0 such that |α α α| ≤ r. That is, (Ω) for all |α α α| ≤ r . This becomes a Hilbert space when equipped with the norm Let F be a Hilbert space such that F = W r 2 (Ω) as a set of functions. The space F is said to be norm-equivalent to W r 2 (Ω) if there exist constants C 1 , C 2 > 0 such that for every f ∈ W r 2 (Ω). There is a strong connection between Sobolev spaces and RKHSs [54,Chapter 10]. Namely, if r > d/2 and the Fourier transform where > 0 is a length-scale parameter, ρ > 0 a smoothness parameter and K ρ the modified Bessel function of the second kind of order ρ. If ρ = r − d/2 > 0 and Ω has a Lipschitz boundary, the RKHS H(k ρ ) is normequivalent to W r 2 (Ω) [54,Corollary 10.48].

Convergence for Sobolev-equivalent kernels
Recall from Section 2.2 that integration error by a Bayesian quadrature rule for functions in H(k) satisfies so that in convergence analysis only the behaviour of the worst-case error needs to be considered. If the RKHS is norm-equivalent to a Sobolev space, the convergence rates of Bayesian quadrature can be established. These results follow from [54,Corollary 11.33]; see [26,Proposition 4] for an explicit version. Some assumptions, satisfied by all domains of interest to us, are needed; see for instance [26,Section 3] for precise definitions.

Assumption 3 The set Ω ⊂ R d is a bounded open set that satisfies an interior cone condition and has a Lipschitz boundary.
This assumption essentially says that the boundary of Ω is sufficiently regular (Lipschitz boundary) and that there is no "pinch point" on the boundary of Ω (interior cone condition). Convergence results are expressed in terms of the fill-distance h Xn,Ω := sup that quantifies the size of the largest "hole" in an n-point set X n . We use to denote an inequality that is valid up to a constant independent of n, number of points, and f , the integrand. That is, for generic sequences of functionals g n and h n , g n (f ) h n (f ) means that there is a constant C > 0 such that g n (f ) ≤ Ch n (f ) for all n ∈ N and any f in a specified class of functions.
Theorem 4 Suppose that (i) Ω satisfies Assumption 3 (ii) that the measure ν has a bounded (Lebesgue) density function and that (iii) the RKHS H(k) is normequivalent to W r 2 (Ω). Then ,Ω for any f ∈ W r 2 (Ω) when the fill-distance is sufficiently small.
A simpler form is achieved for point sets that are quasi-uniform, which is to say that c 1 q Xn ≤ h Xn,Ω ≤ c 2 q Xn for some constants c 1 , c 2 > 0 independent of n. Here is the separation distance. In dimension d quasi-uniform sets satisfy h Xn,Ω = O(n −1/d ) as n → ∞ (e.g., regular product grids). In Theorem 4 we thus obtain the rate for f ∈ W r 2 (Ω) when the point sets are quasi-uniform and n sufficiently large. The following simple result is an immediate consequence of the above developments. Of course, it is the stability constant Λ BQ Xn = n i=1 |w BQ Xn,i | that we analyse next whose behaviour is typically more consequential. However, the above proposition may be occasionally interesting if one desires to interpret Bayesian quadrature as a weighted Dirac approximation ν BQ := n i=1 w BQ Xn,i δ x x xi ≈ ν of a probability measure (i.e., ν BQ (Ω) ≈ 1). Note that there is also a simple way to ensure summing up to one of the weights by inclusion of a non-zero prior mean function for the Gaussian process prior; see [40] and [32,Section 2.3] Finally, we provide a second example that highlights the importance of analysing the stability constant. Kanagawa et al. [26,Section 4.1] (see also [25]) studied convergence rates of kernel-based quadrature rules in Sobolev spaces when the integrand is potentially rougher (i.e., f ∈ W s 2 (Ω) for some s ≤ r) than assumed. If s < r, the integrand f may not belong to the Sobolev space W r 2 (Ω) that is assumed by the user when constructing the quadrature rule; therefore this is a misspecifed setting. Under a certain condition, they showed [26,Corollary 7] that if Λ BQ Xn n c for a constant c ≥ 0, then when h Xn,Ω n −1/d . The condition Λ BQ Xn n c means that the stability constant Λ BQ Xn should not grow quickly as n increases. The bound (17) shows that the error in the misspecified setting becomes small if c is small. This implies that if the stability constant Λ BQ Xn does not increase quickly, then the quadrature rule becomes robust against the misspecification of a prior. This provides a second motivation for understanding the behaviour of Λ BQ Xn .

Upper bounds for absolute weights
We now analyse magnitudes of individual weights and the stability constant (14). We first derive an upper bound on the magnitude of each weight w BQ Xn,i . The proof of this result is based on an upper bound on the L 2 (Ω) norm of Lagrange functions derived in [14]. provided that h Xn,Ω is sufficiently small. Let ν ∞ < ∞ stand for the supremum of the density function of ν.
Then it follows from w BQ Xn,i = I ν (u Xn,i ) that (20). When X n are quasi-uniform, the ratio h Xn,Ω /q Xn remains bounded and h Xn,Ω behaves like n −1/d . From verifies (19).

Inequality (18) now follows from
An important consequence of Theorem 5 is that the magnitudes of quadrature weights decrease uniformly to zero as n increases if the design points are quasiuniform and ν has a density. In other words, none of the design points will have a constant weight that does not decay. This is similar to importance sampling, where the weights decay uniformly at rate 1/n. As a direct corollary of Theorem When X n are quasi-uniform and n sufficiently large this becomes The levelling off appears to be caused by loss of numerical precision.
While the bounds of Corollary 3 are somewhat conservative (as will be demonstrated in Section 4.5), they are still useful in understanding the factors affecting stability and robustness of Bayesian quadrature. That is, inequality (21) shows that the stability constant can be made small if the ratio h Xn,Ω /q Xn is kept small; this is possible if the point set is sufficiently uniform.
Another important observations concerns the the exponent r − d/2 of the ratio h Xn,Ω /q Xn : if the smoothness r of the kernel is large, then the stability constant may also become large if the points are not quasiuniform. This is true because h Xn,Ω /q Xn ≥ 1 for any configuration of X n , as can be seen easily from the definitions of q Xn and h Xn,Ω . This observation implies that the use of a smoother kernel may lead to higher numerical instability. Accordingly, we next discuss stability of infinitely smooth kernels and the Runge phenomenon that manifests itself in this setting.

On infinitely smooth kernels
While the theoretical results of this section only concern kernels of finite smoothness, we make a few remarks on the stability of Bayesian quadrature when using infinitely smooth kernels, such as the Gaussian kernel. When using such a kernel, Bayesian quadrature rules suffer from the famous Runge phenomenon: if equispaced points are used, then Lebesgue constants and the stability constants grow rapidly; see [39,Section 4.3] and [43,44]. This effect is demonstrated in Fig. 4.
A key point is that Runge phenomenon can occur even when the design points are quasi-uniform (e.g., equispaced). This means that quasi-uniformity of the points does not ensure stability of Bayesian quadrature when the kernel is infinitely smooth. Care has to be taken if a numerically stable Bayesian quadrature rule is to be constructed with such a kernel. One possibility is to use locally optimal design points from Section 3.5; Xn,Ω (its true maximum was roughly 2.2 × 10 7 ) that is expected to control the stability constant. Note that the theoretical upper bound (21) contains an additional multiplication by n.
Corollary 2 then guarantees uniform boundedness of the stability constant, at least when d = 1.

A numerical example
Numerical examples of behaviour kernel Lebesgue constants can be found in [13], where it was observed that the theoretical bounds similar to (22) are conservative: the Lebesgue constant appears to remain uniformly bounded. Bayesian quadrature weights are no different. We experimented with the Matérn kernel with length-scale = 0.5 and the uniform measure on the interval [0, 1]. When uniformly spaced points were used, all weights remained positive and their sum quickly converged to one when n was increased. In contrast, Corollary 3 provides the, up to a constant, upper bound √ n that is in this case clearly very conservative. When points were drawn from the uniform distribution on [0, 1], more interesting behaviour was observed (Fig. 5). As expected, the magnitude of Λ BQ Xn was closely related to the ratio h Xn,Ω /q Xn . Nevertheless, increase in n did not generally correspond to incerase in Λ BQ Xn .