1 Introduction

Many problems in statistics and the applied sciences require the numerical integration of one or more functions belonging to a particular class. Given \({\mathcal {M}}\subset {\mathbb {R}}^D\), endowed with some probability measure \(\mu \), and an integrable function \(f:{\mathcal {M}}\rightarrow {\mathbb {R}}\), standard Monte Carlo methods approximate the integral \(\int _{{\mathcal {M}}} f(x)\mathrm {d}\mu (x)\) by the finite sum

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n f(x_j), \end{aligned}$$
(1)

where \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) are independent samples from \(\mu \). On the one hand, Monte Carlo integration is widely used in many numerical and statistical applications (Robert and Casella 2013). It is well-known, however, that the expected worst case integration error for n random points using (1) in reproducing kernel Hilbert spaces does not decay faster than \(n^{-1/2}\), cf. Brauchart et al. (2014), Breger et al. (2018), Hinrichs (2010), Novak and Wozniakowski (2010), Plaskota et al. (2009) and Gräf (2013), proof of Corollary 2.8. To improve the approximation, it has been proposed to re-weight the random points (Briol et al. 2018; Oettershagen 2017; Rasmussen and Ghahramani 2003; Sommariva and Vianello 2006; Ullrich 2017), which is of particular importance when \(\mu \) can only be sampled (Oates et al. 2017) and evaluating f is rather expensive.

That re-weighting of deterministic points can lead to optimal convergence order has been known since the pioneering work of Bakhvalov (1959). For Sobolev spaces on the sphere and more generally on compact Riemannian manifolds, there are numerically feasible strategies to select deterministic points and weights matching optimal worst case error rates, cf. Brandolini et al. (2014), Brauchart et al. (2014), Breger et al. (2017), see also Hellekalek et al. (2016), Hinrichs et al. (2016) and Niederreiter (2003).

The use of random points avoids the need to manually specify a point set and can potentially lead to simpler algorithms if the geometry of the manifold \({\mathcal {M}}\) is complicated. For random points, it was derived in Briol et al. (2018) that the optimal rate for \([0,1]^d\), the sphere, and quite general domains in \({\mathbb {R}}^d\) can be matched up to a logarithmic factor if the weights are optimized with respect to the underlying reproducing kernel. Decay rates of the worst case integration error for Sobolev spaces of dominating mixed smoothness on the torus and the unit cube were studied in Oettershagen (2017). Gaussian kernel quadrature is studied in Karvonen and Särkkä (2019). Numerical experiments on the Grassmannian manifold were provided in Ehler and Gräf (2017). We refer to Trefethen (2017a, b), for further related results.

The present paper is dedicated to verify that, for Sobolev spaces on closed Riemannian manifolds, random points with optimized weights yield optimal decay rates of the worst case error up to a logarithmic factor. We should point out that we additionally allow for the restriction to nonnegative weights, a desirable property not considered in Briol et al. (2018). Our findings also transfer to functions defined on more general sets such as the d-dimensional unit ball and the simplex.

The paper is structured as follows: First, we bound the worst case error by the covering radius of the underlying points. Second, we use estimates on the covering radius of random points from Reznikov and Saff (2015), see also Brauchart et al. (2018) for the sphere, to establish the optimal approximation rate up to a logarithmic factor. Some consequences for the Bayesian Monte Carlo method are then presented. Numerical experiments for the sphere and the Grassmannian manifold are provided that support our theoretical findings. We also discuss the extension to the unit ball, the cube, and the simplex.

2 Preliminaries

Let \({\mathcal {M}}\subset {\mathbb {R}}^D\) be a smooth, connected, closed Riemannian manifold of dimension d, endowed with the normalized Riemannian measure \(\mu \) throughout the manuscript. Prototypical examples for \({\mathcal {M}}\) are the sphere and the Grassmannian

$$\begin{aligned}&{\mathbb {S}}^{d}=\{x\in {\mathbb {R}}^{d+1} : \Vert x\Vert =1\},\\&{\mathcal {G}}_{k,m} = \{x\in {\mathbb {R}}^{m\times m} : x^\top = x,\; x^2 = x,\; {{\,\mathrm{rank}\,}}(x)=k\}, \end{aligned}$$

respectively, where \(d = k(m-k)\) with \(D=m^2\) in case of the Grassmannian.

Let \({\mathcal {H}}\) be any normed space of continuous functions \(f:{\mathcal {M}}\rightarrow {\mathbb {R}}\). For points \(\{x_j\}_{j=1}^{n}\subset {\mathcal {M}}\) and weights \(\{w_j\}_{j=1}^{n} \subset {\mathbb {R}}\), the worst case error of integration is defined by

$$\begin{aligned}&{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},{\mathcal {H}}) \nonumber \\&\quad := \sup _{\begin{array}{c} f\in {\mathcal {H}}\\ \Vert f\Vert _{{\mathcal {H}}}\le 1 \end{array}}\left| \int _{{\mathcal {M}}} f(x)\mathrm {d}\mu (x)- \sum _{j=1}^{n} w_j f(x_j) \right| . \end{aligned}$$
(2)

Suppose now that \({\mathcal {H}}\) is a reproducing kernel Hilbert space, denoted by \({\mathcal {H}}_K\). Then the squared worst case error can be expressed in terms of the reproducing kernel K as

$$\begin{aligned}&{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},{\mathcal {H}}_K)^2 \nonumber \\&\quad = \sum _{i,j=1}^n w_i w_{j}K(x_i,x_j) - 2 \sum _{j=1}^{n} w_{j} \int _{{\mathcal {M}}}K(x_j,y)\mathrm {d}\mu (y) \nonumber \\&\qquad +\int _{{\mathcal {M}}}\int _{{\mathcal {M}}}K(x,y)\mathrm {d}\mu (x)\mathrm {d}\mu (y). \end{aligned}$$
(3)

If \(x_1,\ldots ,x_n\in {\mathcal {M}}\) are random points, independently distributed according to \(\mu \), then it holds

$$\begin{aligned} \sqrt{{\mathbb {E}} \Big [ {{\,\mathrm{wce}\,}}(\{(x_j,\tfrac{1}{n})\}_{j=1}^{n},{\mathcal {H}}_K)^2\Big ]} \asymp n^{-\frac{1}{2}}, \end{aligned}$$
(4)

cf. Brauchart et al. (2014), Breger et al. (2018), Novak and Wozniakowski (2010) and Gräf (2013), Proof of Corollary 2.8. Hence, even if \({\mathcal {H}}_K\) consists of arbitrarily smooth functions, the left hand side of (4) decays only like \(n^{-1/2}\).

The present paper is dedicated to the question if and, as the case may be, how much one can actually improve the error rate in (4) when replacing the equal weights \(\frac{1}{n}\) with weights \(\{ w_j\}_{j=1}^n\) that are customized to the random points \(\{x_j\}_{j=1}^n\). From a practical perspective, the methods studied in this paper require that the integrals appearing in (3) can be analytically evaluated.

Remark 1

The restriction to integrals with respect to the normalized Riemannian measure \(\mu \) is without too much loss of generality, for if \(\nu \) is any \(\sigma \)-finite measure that is absolutely continuous with respect to \(\mu \), then

$$\begin{aligned} \int _{{\mathcal {M}}} f(x) \mathrm {d}\nu (x)=\int _{{\mathcal {M}}} g(x) \mathrm {d}\mu (x), \end{aligned}$$

where \(g = f \frac{\mathrm {d}\nu }{\mathrm {d}\mu }\) and \(\frac{\mathrm {d}\nu }{\mathrm {d}\mu }\) is the Radon–Nikodym derivative of \(\nu \) with respect to \(\mu \). Often, \({\mathcal {H}}\) is an algebra, so that \(f,\frac{\mathrm {d}\nu }{\mathrm {d}\mu }\in {\mathcal {H}}\) also yield \(g\in {\mathcal {H}}\) and our considerations for \(\mu \) do apply.

3 Bounding the worst case error by the covering radius

To define appropriate smoothness spaces, let \({\varDelta }\) denote the Laplace–Beltrami operator on \({\mathcal {M}}\) and let \(\{\varphi _\ell \}_{\ell =0}^\infty \) be the collection of its orthonormal eigenfunctions with eigenvalues \(\{-\lambda _\ell \}_{\ell =0}^\infty \) arranged by \(0=\lambda _0 \le \lambda _1\le \ldots \). We choose each \(\varphi _\ell \), \(\ell =0,1,2,\ldots \), to be real-valued with \(\varphi _0\equiv 1\). Given \(f\in L_p({\mathcal {M}})\) with \(1\le p\le \infty \), the Fourier transform is defined by

$$\begin{aligned} {\hat{f}}(\ell ):=\int _{{\mathcal {M}}} f(x)\varphi _\ell (x)\mathrm {d}\mu (x),\qquad \ell =0,1,2,\ldots , \end{aligned}$$

with the usual extension to distributions on \({\mathcal {M}}\). For \(1\le p\le \infty \) and \(s>0\), the Sobolev space \(H^s_p({\mathcal {M}})\) is the collection of all distributions on \({\mathcal {M}}\) with \((I-{\varDelta })^{s/2}f\in L_p({\mathcal {M}})\), i.e., with

$$\begin{aligned} \Vert f\Vert _{H^s_p}&:= \Vert (I-{\varDelta })^{s/2} f \Vert _{L_{p}} \\&= \left\| \sum _{\ell =0}^\infty (1+\lambda _\ell )^{s/2} {\hat{f}}(\ell ) \varphi _\ell \right\| _{L_p}<\infty . \end{aligned}$$

For \(s>d/p\), each function in \(H^s_p({\mathcal {M}})\) is continuous, cf. Brandolini et al. (2014) and Triebel (1992), Theorem 7.4.5, Section 7.4.2, so that point evaluation is well-defined.

For \(s>d/p\) and any set of points \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) with arbitrary weights \(\{ w_j\}_{j=1}^n\subset {\mathbb {R}}\), we have

$$\begin{aligned} n^{-s/d} \lesssim {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})), \end{aligned}$$
(5)

see Brauchart et al. (2014) for the sphere and Brandolini et al. (2014) for the general case. Note that the constant in (5) may depend on s, \({\mathcal {M}}\), and p.

Another lower bound involves the covering radius,

$$\begin{aligned} \rho _n : = \max _{x\in {\mathcal {M}}} \min _{j=1,\ldots ,n} {{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(x,x_j), \end{aligned}$$

where \({{\,\mathrm{dist}\,}}_{{\mathcal {M}}}\) denotes the geodesic distance. According to Breger et al. (2018), it also holds

$$\begin{aligned} \rho _n^{s+d/q} \lesssim {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H^s_p({\mathcal {M}})), \end{aligned}$$
(6)

where \(1/p+1/q=1\).

Attempting to match this lower bound, we shall optimize the weights. Given points \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\), we define optimal weights with nonnegativity constraints by

$$\begin{aligned} \{{\widehat{w}}^{\ge 0;\,p}_j\}_{j=1}^n:= \mathop {\mathrm{arg\,min}}\limits _{w_1,\ldots , w_n\ge 0}{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})).\nonumber \\ \end{aligned}$$
(7)

It should be mentioned that similar weight optimization is suggested in Liu and Lee (2016), where the additional constraint \(\sum _{j=1}^n w_j=1\) is used for stabilization purposes.

The worst case error for the optimized weights is upper bounded by the covering radius:

Theorem 1

Let \(1\le p\le \infty \), suppose \(s>d/p\), and let \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) be any set of points with covering radius \(\rho _n\). Then the optimized weights \(\{{\widehat{w}}^{\ge 0;\, p}_j\}_{j=1}^n\) in (7) satisfy

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\,p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \lesssim \rho _n^s. \end{aligned}$$
(8)

Note that the constant in (8) may depend on \({\mathcal {M}}\), s, and p.

Remark 2

If we fix a constant \(c>0\), independent of n and \(\{x_j\}_{j=1}^n\), then any weights \(\{\widetilde{ w}^p_j\}_{j=1}^n\subset {\mathbb {R}}\) with

$$\begin{aligned}&{{\,\mathrm{wce}\,}}(\{(x_j,{\widetilde{w}}^{p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \\&\quad \le c \cdot {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\,p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \end{aligned}$$

satisfy the estimate

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widetilde{w}}^{p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \lesssim \rho _n^s. \end{aligned}$$

This fact is beneficial when we compute weights numerically.

Proof of Theorem 1

Let \(X:=\{x_j\}_{j=1}^n\) and \(\rho (X):=\rho _n\). There is a subset \(Y=\{y_j\}_{j=1}^m\subset X\) with covering radius \(\rho (Y)\le 2\rho (X)\) and minimal separation

$$\begin{aligned} \delta (Y):=\min _{\begin{array}{c} a,b\in Y\\ a\ne b \end{array}} {{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(a,b) \end{aligned}$$

such that \(\rho (Y)\le 2\delta (Y)\), cf. Filbir and Mhaskar (2010), Section 3. We observe that our present setting satisfies the technical requirements of Filbir and Mhaskar (2010), cf. Hsu (1999) and Chavel (1984), p. 159. We deduce from Brandolini et al. (2014), Lemma 2.14 and Filbir and Mhaskar (2010), Theorem 3.1, see also Mhaskar et al. (2001) for \({\mathcal {M}}={\mathbb {S}}^d\), with Brandolini et al. (2014), Corollary 2.15 that there exist \( w_1,\ldots , w_m\gtrsim \rho _n^d\), such that

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(y_j, w_j)\}_{j=1}^{m},H_p^s({\mathcal {M}})) \lesssim m^{-s/d}. \end{aligned}$$

Since the bound

$$\begin{aligned}&{{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \\&\quad \le {{\,\mathrm{wce}\,}}(\{(y_j, w_j)\}_{j=1}^{m},H_p^s({\mathcal {M}})), \end{aligned}$$

holds, we also obtain

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \lesssim m^{-s/d}. \end{aligned}$$

The general lower bound on the covering radius \(m^{-1/d}\lesssim \rho (Y)\), cf. Breger et al. (2018), implies

$$\begin{aligned} m^{-s/d}\lesssim \rho (Y)^s\lesssim \rho (X)^s, \end{aligned}$$

which concludes the proof. \(\square \)

Combining (6) with Theorem 1 for \(p=1\) yields

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\,1}_j)\}_{j=1}^{n},H_1^s({\mathcal {M}})) \asymp \rho _n^s, \end{aligned}$$

so that the worst case error’s asymptotic behavior is governed by the covering radius.

Remark 3

The above proof reveals that in the setting of Theorem 1 there exist \(\{ w_j\}_{j=1}^n\) with either \( w_j\gtrsim \rho ^d_n\) or \( w_j=0\), for \(j=1,\ldots ,n\), such that

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \lesssim \rho _n^s. \end{aligned}$$

The covering radius \(\rho _n\) of any n points in \({\mathcal {M}}\) is lower bounded by \(\gtrsim n^{-1/d}\), which follows from standard volume arguments. If \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) are points with asymptotically optimal covering radius, i.e., \(\rho _n\asymp n^{-1/d}\), then Theorem 1 yields the optimal rate for the worst case integration error

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \lesssim n^{-s/d}, \end{aligned}$$

cf. (5).

Several point sets on \({\mathbb {S}}^2\) with asymptotically optimal covering radius are discussed in Hardin et al. (2016), see quasi-uniform point sequences therein, and see Breger et al. (2018) for general \({\mathcal {M}}\). The covering radius of random points is studied in Brauchart et al. (2018), Oates et al. (2018) and Reznikov and Saff (2015), which leads to almost optimal bounds on the worst case error in the subsequent section. Although we shall consider independent random points, it is noteworthy that it is verified in Oates et al. (2018) that the required estimates on the covering radius still hold for random points arising from a Markov chain instead of being independent. Note also that results related to Theorem 1 are derived in Mhaskar (2018) for more general spaces \({\mathcal {M}}\).

4 Consequences for random points

For random points \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) and any type of weights \(\{ w_j\}_{j=1}^n\subset {\mathbb {R}}\), no matter if random or not, (5) implies, for all \(r>0\),

$$\begin{aligned} n^{-s/d}\lesssim \Big ({\mathbb {E}}\big [{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}}))^r\big ]\Big )^{1/r}, \end{aligned}$$
(9)

where the constant may depend on s, p, and \({\mathcal {M}}\). Note that if \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) are random points, then the weights \(\{{\widehat{w}}^{\ge 0;\, p}_j\}_{j=1}^n\) are random as well. We shall deduce that Theorem 1 implies that the optimal worst case error rate is (almost) matched in these cases:

Corollary 1

Let \(\{x_j\}_{j=1}^n\subset {\mathcal {M}}\) be random points, independently distributed according to \(\mu \). Suppose \(1\le p\le \infty \) and \(s>d/p\), then, for each \(r\ge 1/s\), it holds

$$\begin{aligned}&\Big ({\mathbb {E}}\big [{{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}}))^r\big ]\Big )^{1/r} \nonumber \\&\quad \lesssim n^{-s/d}\log (n)^{s/d}. \end{aligned}$$
(10)

Note that Corollary 1 yields the optimal rate up to the logarithmic factor \(\log (n)^{s/d}\), cf. (9), and that the constant in (10) may depend on s, \({\mathcal {M}}\), p, and r.

Proof

From Reznikov and Saff (2015), Theorem 3.2, Corollary 3.3, we deduce that, for each \(r\ge 1\),

$$\begin{aligned} \big ({\mathbb {E}}\rho ^r_n\big )^{1/r} \asymp n^{-1/d}\log (n)^{1/d}, \end{aligned}$$

where the constant may depend on \({\mathcal {M}}\) and r. Thus, Theorem 1 implies

$$\begin{aligned} {\mathbb {E}}\big [{{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}}))^r\big ]\lesssim \big (\tfrac{\log (n)}{n}\big )^{sr/d}, \end{aligned}$$

for each \(r\ge 1/s\). \(\square \)

Remark 4

Let \(\nu \) be a probability measure on \({\mathcal {M}}\) that is absolutely continuous with respect to \(\mu \) and its density is bounded away from zero, i.e., \(\nu =f \mu \) with \(f(x)\ge c>0\), for all \(x\in {\mathcal {M}}\). Corollary 1 still holds for independent samples from \(\nu \), where the constant in (10) then also depends on c. This is due to \(\approx \frac{2}{c}n\) independent samples from \(\nu \) covering \({\mathcal {M}}\) at least as well as n independent samples from \(\mu \).

Corollary 1 yields bounds on the moments of the worst case integration error. The results in Reznikov and Saff (2015) also enable us to derive probability estimates:

Corollary 2

Under the assumptions of Corollary 1, there are positive constants \(c_1,\ldots ,c_4\) depending on \({\mathcal {M}}\) and where \(c_2\) may additionally depend on s and p, such that,

$$\begin{aligned}&{\mathbb {P}}\Big ({{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \ge c_2 \big (\tfrac{r\log (n)}{n}\big )^{s/d} \Big ) \\&\quad \le c_3 \left( \frac{1}{n}\right) ^{c_4r-1}, \end{aligned}$$

for all \(r\ge c_1\).

Proof

By applying Theorem 1, we deduce that there is a constant \(c>0\), which may depend on \({\mathcal {M}}\), s, and p, such that

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;\, p}_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})) \le c \rho _n^s. \end{aligned}$$

According to Reznikov and Saff (2015), Theorem 2.1, there are constants \(c_1,{\tilde{c}}_2, c_3,c_4>0\), which may depend on \({\mathcal {M}}\), such that, for all \(r\ge c_1\),

$$\begin{aligned} {\mathbb {P}}\Big ( \rho _n\ge {\tilde{c}}_2\big (\tfrac{r\log (n)}{n}\big )^{1/d} \Big )\le c_3 \big (\tfrac{1}{n}\big )^{c_4r-1}. \end{aligned}$$

Raising the left inequality to the power s and multiplying by c yields the desired result with \(c_2:=c{\tilde{c}}^s_2\). \(\square \)

Our bounds address functions in \(H_p^s({\mathcal {M}})\) exclusively. For bounds beyond Sobolev functions in misspecified settings, we refer to Kanagawa et al. (2019) and references therein.

Remark 5

Corollaries 1 and 2 also hold for weights \(\{{\widetilde{w}}_j^p\}_{j=1}^n\) that minimize (7) up to a constant factor as discussed in Remark 2, and, in particular, for the unconstrained minimizer

$$\begin{aligned} \{{\widehat{w}}^p_j\}_{j=1}^n:= \mathop {\mathrm{arg\,min}}\limits _{\{ w_j\}_{j=1}^n\subset {\mathbb {R}}}{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}})). \end{aligned}$$
(11)

Nonnegative weights are often more desirable for numerical applications of cubatures points, but solving the constrained minimization problem (7) is usually more involved than dealing with the unconstrained problem (11).

5 Implications for Bayesian Monte Carlo

Our results have consequences for Bayesian cubature, cf. Larkin (1972), an integration method whose output is not a scalar but a distribution. Bayesian cubature enables a statistical quantification of integration error, useful in the context of a wider computational work-flow to measure the impact of integration error on subsequent output, cf. Briol et al. (2018) and Cockayne et al. (2017).

Consider a linear topological space \({\mathcal {L}}\) of continuous functions on \({\mathcal {M}}\) such as a reproducing kernel Hilbert space on \({\mathcal {M}}\). The integrand f in Bayesian cubature is treated as a Gaussian random process; that is, \(f : {\mathcal {M}} \times {\varOmega }\rightarrow {\mathbb {R}}\), where \(f(\cdot ,\omega ) \in {\mathcal {L}}\) for each \(\omega \in {\varOmega }\), and the random variables \(\omega \mapsto L f(\cdot ,\omega ) \in {\mathcal {L}}\) are (univariate) Gaussian for all continuous linear functionals L on \({\mathcal {L}}\), such as integration (\({\mathcal {I}} f = \int _{{\mathcal {M}}} f(x) \mathrm {d} \mu (x)\)) and point evaluation (\(\delta _x f = f(x)\)) operators, cf. Bogachev (1998). The Bayesian approach is then taken, wherein the process f is constrained to interpolate the values \(\{(x_j,f(x_j))\}_{j=1}^n\). Formally, this is achieved by conditioning the process on the data provided through the point evaluation operators \(\delta _{x_j}(f) = f(x_j)\), for \(\{x_j\}_{j=1}^n \subset {\mathcal {M}}\). The conditioned process, denoted \(f_n\), is again Gaussian, cf. Bogachev (1998), and as such the linear functional \({\mathcal {I}} f_n\) is a (univariate) Gaussian; this is the output of the Bayesian cubature method. This distribution, defined on the real line, provides statistical uncertainty quantification for the (unknown) true value of the integral.

Concretely, let \(K(x,y) = \text {cov}(f(x),f(y))\) denote the covariance function that characterizes the Gaussian probability model. The output of Bayesian cubature is the univariate Gaussian distribution with mean

$$\begin{aligned} \sum _{j=1}^n {\widehat{w}}_j f(x_j), \end{aligned}$$

which takes the form of a weighted integration method with weights \({\widehat{w}} = ({\widehat{w}}_1,\dots ,{\widehat{w}}_n)^\top \) implicitly defined by \({\mathcal {K}} {\widehat{w}}=b\) where

$$\begin{aligned} {\mathcal {K}}&:= \left[ \begin{array}{ccc} K(x_1,x_1) &{} \dots &{} K(x_1,x_n) \\ \vdots &{} &{} \vdots \\ K(x_n,x_1) &{} \dots &{} K(x_n,x_n) \end{array} \right] \end{aligned}$$
(12)
$$\begin{aligned} b&:= \left[ \begin{array}{c} \int _{{\mathcal {M}}} K(x_1,y) \mathrm {d}\mu (y) \\ \vdots \\ \int _{{\mathcal {M}}} K(x_n,y) \mathrm {d}\mu (y) \end{array} \right] , \end{aligned}$$
(13)

cf. Briol et al. (2018). The integrals appearing in (13) have an elegant interpretation in terms of the kernel mean embedding of the distribution \(\mu \) (Muandet et al. 2017).

Any symmetric positive definite covariance function K can be viewed as a reproducing kernel. In particular, the Bessel kernel

$$\begin{aligned} K^{(s)}_{B}(x,y)=\sum _{\ell =0}^\infty (1+\lambda _\ell )^{-s} \varphi _\ell (x)\varphi _\ell (y),\quad x,y\in {\mathcal {M}}, \end{aligned}$$
(14)

reproduces \(H^s({\mathcal {M}}):=H^s_2({\mathcal {M}})\), for \(s>d/2\). Observe that the weights \({\widehat{w}}\) just defined solve the unconstrained minimization problem (11) for \(p=2\). The latter follows from the quadratic minimization form in (3) as well as from the posterior mean being an \(L_2\)-optimal estimator (Kimeldorf and Wahba 1970).

The variance of the Gaussian measure can be shown to be formally equal to (3) when these weights are substituted, see Briol et al. (2018). The special case where the points \(\{x_j\}_{j=1}^n\) are random was termed Bayesian Monte Carlo in Rasmussen and Ghahramani (2003). Therefore, our results in Sect. 4 have direct consequences for Bayesian Monte Carlo. Due to Remark 5 within this Bayesian setting, Corollaries 1 and 2 generalize earlier work of Briol et al. (2018) to a general smooth, connected, closed Riemannian manifold.

Fig. 1
figure 1

The worst case integration error (wce) for \({\mathcal {H}}_{K_1}\) and \({\mathcal {H}}_{K_2}\) averaged over 20 instances of random points in logarithmic scalings. The lines with exact slope \(-\,1/2\) and \(-\,3/4\) are included as a visual aid (

figure a
average wce with equal weights;
figure b
average wce with nonnegative weights;
figure c
average wce with optimal weights)

6 Numerical experiments for the sphere and the Grassmannian

The numerical computation of the worst case error \({{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H_p^s({\mathcal {M}}))\) is difficult in general but, for \(p=2\), it is expressed in terms of the reproducing kernel in (3). Therefore, our numerical experiments are designed for \(p=2\). However, the kernel \(K^{(s)}_B\) itself, see (14), may still be difficult to evaluate numerically, so that we would like to allow for other kernels in numerical experiments. If K is any positive definite kernel on \({\mathcal {M}}\) that reproduces \(H^s({\mathcal {M}})\) with equivalent norms, then

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H^s({\mathcal {M}})) \asymp {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},{\mathcal {H}}_K). \end{aligned}$$

Therefore, the asymptotic results in Corollaries1 and 2 are the same when replacing \(\{\widehat{ w}^{\ge 0;\, 2}_j\}_{j=1}^n\) with the minimizer

$$\begin{aligned} \{{\widehat{w}}_j^{\ge 0;K}\}_{j=1}^n:= \mathop {\mathrm{arg\,min}}\limits _{w_1,\ldots , w_n\ge 0}{{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},{\mathcal {H}}_K). \end{aligned}$$
(15)

Dropping the nonnegativity constraints yields \({\widehat{w}}^{K}\), which is given by \({\widehat{w}} = {\mathcal {K}}^{-1}b\), where \({\mathcal {K}}\) and b are as in (12) and (13). To provide numerical experiments for Sobolev spaces on the sphere \({\mathbb {S}}^2\subset {\mathbb {R}}^3\) and on the Grassmannian \({\mathcal {G}}_{2,4}\), we shall specify suitable kernels in the following. We shall consider two kernels \(K_1,K_2\) on the sphere \({\mathbb {S}}^2\) and two kernels \(K_3,K_4\) on the Grassmannian \({\mathcal {G}}_{2,4}\).

Fig. 2
figure 2

The worst case integration error for \({\mathcal {H}}_{K_3}\) and \({\mathcal {H}}_{K_4}\) averaged over 20 instances with random points in logarithmic scalings. The lines with exact slope \(-\,1/2\) and \(-\,7/8\) are included as a visual aid (

figure d
average wce with equal weights;
figure e
average wce with nonnegative weights;
figure f
average wce with optimal weights)

The numerical results are produced by taking sequences of random points \(\{x_j\}_{j=1}^n\) with increasing cardinality n. We compute each of the three worst case errors \({{\,\mathrm{wce}\,}}(\{(x_j,\tfrac{1}{n})\}_{j=1}^{n},{\mathcal {H}}_{K_i})\), \({{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{K_i}_j)\}_{j=1}^{n},{\mathcal {H}}_{K_i})\), and \({{\,\mathrm{wce}\,}}(\{(x_j,{\widehat{w}}^{\ge 0;K_i}_j)\}_{j=1}^{n},{\mathcal {H}}_{K_i})\), for \(i=1,\ldots ,4\), and averaged these results over 20 instantiations of the random points. The constrained minimization problem for the latter two quantities is solved by using the Python CVXOPT library. It should be mentioned that numerical experiments on the sphere for the unconstrained optimizer \({\widehat{w}}^{K_1}\) are also contained in Briol et al. (2018).

The kernel

$$\begin{aligned} K_{1}(x,y) := 2 - \Vert x-y\Vert ,\quad x,y\in {\mathbb {S}}^2, \end{aligned}$$

reproduces the Sobolev space \(H^{3/2}({\mathbb {S}}^2)\) with an equivalent norm, cf. Gräf (2013), Section 6.4.1. To compute (3), it is sufficient to notice

$$\begin{aligned} \int _{{\mathbb {S}}^2} K_1(x,y) \mathrm {d}\mu (y) = \frac{2}{3}, \quad \text {for all } x\in {\mathbb {S}}^2. \end{aligned}$$

By plotting the worst case error versus the number of points in a logarithmic scale, we are supposed to observe lines whose slopes coincide with the decay rate \(-\,s/d\) for the optimized weights and slope \(-\,1/2\) for the weights 1 / n. Indeed, we see in Fig. 1a that \({{\,\mathrm{wce}\,}}(\{(x_j,\tfrac{1}{n})\}_{j=1}^{n},{\mathcal {H}}_{K_1})\) for random points matches the error rate \(-\,1/2\) predicted by (4) with \(d=2\). When optimizing the weights, we observe the decay rate \(-\,3/4\) for both optimizations, \(\widehat{ w}^{\ge 0;K}\) in (15) and the unconstrained minimizer \({\widehat{w}}^K\). Hence, the numerical results match the rate predicted by the theoretical findings in (9), (10) with \(p=2\) and \(r=1\). The logarithmic factor in (10) is not visible.

The smooth kernel

$$\begin{aligned} K_2 (x,y) := 48\exp (-12\Vert x-y\Vert ^2),\quad x,y\in {\mathbb {S}}^2, \end{aligned}$$

generates a space \({\mathcal {H}}_{K_2}\) of smooth functions contained in \(H^s({\mathbb {S}}^2)\), for all \(s>0\), and satisfies

$$\begin{aligned} \int _{{\mathbb {S}}^2} K_2(x,y) \mathrm {d}\mu (y) = 1 - \mathrm \exp (-\,48),\quad \text {for all } x\in {\mathbb {S}}^2. \end{aligned}$$

Our numerical experiments in Fig. 1b suggest that the decay rate for the optimized weights is indeed faster than \(-\,1/2\). Note that the equal weight case is stuck at the \(-\,1/2\) rate, although we are now dealing with arbitrarily smooth functions.

The dimension of the Grassmannian \({\mathcal {G}}_{2,4}\) is \(d=4\), and we consider the two reproducing kernels

$$\begin{aligned} K_{3}(x,y)&:= \sqrt{(2-{{\,\mathrm{trace}\,}}(xy))^{3}} + 2 {{\,\mathrm{trace}\,}}(xy), \\ K_4 (x,y)&:= \tfrac{3}{2}\exp ({{\,\mathrm{trace}\,}}(xy)-2)). \end{aligned}$$

Note that \(K_3\) reproduces \(H^{7/2}({\mathcal {G}}_{2,4})\) with an equivalent norm, and \({\mathcal {H}}_{K_4}\) is contained in \(H^{s}({\mathcal {G}}_{2,4})\), for all \(s>0\). The worst case integration error (3) is computable from

$$\begin{aligned} \int _{{\mathcal {G}}_{2,4}} K_3(x,y) \mathrm {d}\mu (y)&= 2 + \frac{74}{75}\sqrt{2}-\frac{2}{5}\log (1+\sqrt{2}),\\ \int _{{\mathcal {G}}_{2,4}} K_4(x,y) \mathrm {d}\mu (y)&= \frac{3}{2} \mathrm \exp (-1) \int _0^1 \frac{\mathrm {sinh}(t)}{t}\mathrm {d}t. \end{aligned}$$

for all \(x\in {\mathcal {G}}_{2,4}\).

For \(K_3\), we observe in Fig. 2a that the random points with equal weights yield decay rate \(-\,1/2\) and optimizing weights leads to \(-\,7/8\) matching the optimal rate in (9), (10) with \(d=4\). In Fig. 2b, it seems that the worst case error for \({\mathcal {H}}_{K_4}\) decays faster than the \(-\,1/2\) rate when optimizing the weights for random points on the Grassmannian \({\mathcal {G}}_{2,4}\) outperforming the case where weights are equal.

7 Beyond closed manifolds

We shall make use of the push-forward to transfer our results on the worst case integration error from closed manifolds to more general sets. Suppose S is a topological space and \(h:{\mathcal {M}}\rightarrow S\) is Borel measurable and surjective. We endow S with the push-forward measure \(h_*\mu \) defined by \((h_*\mu )(A)=\mu (h^{-1}A)\) for any Borel measurable subset \(A\subset S\). By abusing notation, let \({{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(A,B):=\inf _{a\in A;\, b\in B} {{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(a,b)\) for \(A,B\subset {\mathcal {M}}\), and we put

$$\begin{aligned} {{\,\mathrm{dist}\,}}_{S,h}(x,y):= {{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(h^{-1}x,h^{-1}y),\quad x,y\in S. \end{aligned}$$
(16)

For \(s>d/p\), we define

$$\begin{aligned} H^s_p(S)_h:=\{f:S\rightarrow {\mathbb {R}}: h^*f \in H^s_p({\mathcal {M}})\} \end{aligned}$$

with \(\Vert f\Vert _{H^s_p(S)_h}:=\Vert h^* f\Vert _{H^s_p({\mathcal {M}})}\), where \(h^*f\) denotes the pullback \(f\circ h\). This enables us to formulate the analogue of Theorem 1:

Theorem 2

Given \(1\le p\le \infty \) with \(s>d/p\) and \(\{x_j\}_{j=1}^n\subset S\), suppose that the following two conditions are satisfied,

  1. (a)

    \(\# h^{-1}x\) is finite, for all \(x\in S\),

  2. (b)

    \({{\,\mathrm{dist}\,}}_{{\mathcal {M}}}(\{a\},h^{-1}x) \asymp {{\,\mathrm{dist}\,}}_{{\mathcal {M}}} (h^{-1}h(a),h^{-1}x)\), for all \(a\in {\mathcal {M}}\), \(x\in S\).

Then there are nonnegative weights \(\{ w_j\}_{j=1}^n\) such that

$$\begin{aligned} {{\,\mathrm{wce}\,}}(\{(x_j, w_j)\}_{j=1}^{n},H^s_p(S)_h) \lesssim \rho _n^s, \end{aligned}$$

where \(\rho _n\) denotes the covering radius of \(\{x_j\}_{j=1}^n\) taken with respect to (16).

Note that (16) is a quasi-metric on S if the assumptions in Theorem 2 are satisfied, i.e., the conditions of a metric are satisfied except for the triangular inequality that still holds up to a constant factor.

Conventional integration bounds on standard Euclidean domains S, see Kanagawa et al. (2019), Proposition 4, for instance, usually require bounded densities. Theorem 2 also applies to standard Euclidean domains that now inherit the measure and the distance from the closed manifold. The measure can now have unbounded density because there is a potential compensation by the induced distance. This shall be observed for the unit ball, the cube, and the simplex in the present section.

Proof of Theorem 2

Denote \(\{z_{j,i}\}_{i=1}^{n_j}=h^{-1}x_j\), for \(j=1\ldots ,n\). According to Theorem 1, there exist nonnegative weights \(\{ w_{j,i}\}_{i=1}^{n_j}\), for \(j=1\ldots ,n\), such that, for all \(f\in H^s_p(S)_h\),

$$\begin{aligned}&\left| \int _{{\mathcal {M}}} (h^*f) (z)\mathrm {d}\mu (z) - \sum _{j=1}^n\sum _{i=1}^{n_j} w_{j,i}(h^*f) (z_{j,i}) \right| \nonumber \\&\quad \lesssim \rho _{{\mathcal {M}}} \left( \bigcup _{j=1}^n h^{-1}x_j \right) ^s \Vert h^* f \Vert _{H^s_p({\mathcal {M}})}, \end{aligned}$$
(17)

where \(\rho _{{\mathcal {M}}}(\bigcup _{j=1}^n h^{-1}x_j)\) denotes the covering radius of \(\bigcup _{j=1}^n h^{-1}x_j\subset {\mathcal {M}}\). The assumptions imply that \(\rho _n\asymp \rho _{{\mathcal {M}}}(\bigcup _{j=1}^n h^{-1}x_j)\), so that \( w_j:=\sum _{i=1}^{n_j} w_{j,i}\), \(j=1,\ldots ,n\), and (17) lead to

$$\begin{aligned} \left| \int _{S} f(x) \mathrm {d}(h_*\mu )(x) - \sum _{j=1}^n w_{j} f(x_j) \right| \lesssim \rho _{n}^s \Vert f\Vert _{H^s_p(S)_h}, \end{aligned}$$

which concludes the proof. \(\square \)

Remark 6

Since independent random points \(\{x_j\}_{j=1}^n\) distributed according to \(h_*\mu \) on S with covering radius \(\rho _n\) are generated by independent random points \(\{z_j\}_{j=1}^n\) with respect to \(\mu \) on \({\mathcal {M}}\) with \(x_j=h(z_j)\), for \(j=1,\ldots ,n\), the observation

$$\begin{aligned} \rho _n\lesssim \rho _{{\mathcal {M}}}(\{z_j\}_{j=1}^n) \end{aligned}$$

implies that also Corollaries 1, and 2 hold for \(H^s_p(S)_h\) and \(h_*\mu \).

The impact of Theorem 2 depends on whether or not the choices of h yield reasonable function spaces \(H^s_p(S)_h\), distances \({{\,\mathrm{dist}\,}}_{S,h}\), and measures \(h_*\mu \). For instance, if h is also injective with measurable \(h^{-1}\), then \(H^s(S)_h\) is the reproducing kernel Hilbert space with kernel

$$\begin{aligned} \sum _{\ell =0}^\infty (1+\lambda _\ell )^{-s} \psi _\ell (x)\psi _\ell (y), \end{aligned}$$

where \(\psi _\ell :=\varphi _\ell \circ h^{-1}\), so that \(\{\psi _\ell \}_{\ell =0}^\infty \) is an orthonormal basis for the square integrable functions with respect to \(h_*\mu \). In the following, we shall discuss a few special cases, in which h is not injective. By using the results in Xu (1998, 2001), we shall determine \(H^s(S)_h\) for S being the unit ball \({\mathbb {B}}^d:=\{x\in {\mathbb {R}}^d : \Vert x\Vert \le 1\}\), the cube \([-1,1]^d\), and the simplex \({\varSigma }^d:=\{x\in {\mathbb {R}}^d: x_1,\ldots ,x_d\ge 0;\; \sum _{i=1}^dx_i\le 1\}\).

Fig. 3
figure 3

Worst case integration error for \({\mathcal {H}}_{K_5}\) and \({\mathcal {H}}_{K_6}\) with \(r=0.97\), averaged over random points in logarithmic scalings (

figure g
average wce with equal weights;
figure h
average wce with nonnegative weights;
figure i
average wce with optimal weights)

Let \(h:{\mathbb {S}}^d\rightarrow {\mathbb {B}}^d\) be the projection onto the first d coordinates, i.e., \(h(x)=(x_1,\ldots ,x_d)\in {\mathbb {B}}^d\). The push-forward measure \(h_*\mu _{{\mathbb {S}}^d}\) on \({\mathbb {B}}^d\) is given by

$$\begin{aligned} \frac{{\varGamma }(d/2+1/2)}{\pi ^{d/2+1/2}}\frac{\mathrm{d} x}{\sqrt{1-\Vert x\Vert ^2}}, \end{aligned}$$
(18)

and the assumptions in Theorem 2 are satisfied. Let \(\{T_{k,\ell }: \ell =0,1,2,\ldots ; \;k=1,\ldots ,r^d_\ell \}\) with \(r^d_\ell :=\left( {\begin{array}{c}\ell +d-1\\ \ell \end{array}}\right) \) be orthonormal polynomials with respect to the measure (18), and each \(T_{k,\ell }\) has total degree \(\ell \). For \(d=1\), this corresponds to Chebyshev polynomials. The case \(d=2\) relates to generalized Zernike polynomials, cf. Wünsche (2005).

Proposition 1

(The unit ball) For \(s>d/2\) and \(h:{\mathbb {S}}^d \rightarrow {\mathbb {B}}^d\) as above, the space \(H^s({\mathbb {B}}^d)_h\) is reproduced by the kernel

$$\begin{aligned} \begin{aligned}&K^s_{{\mathbb {B}}^d}(x,y) \\&\quad :=\sum _{\ell =0}^\infty (1+\ell (\ell +d-1))^{-s} \sum _{k=1}^{r^d_\ell }T_{k,\ell }(x)T_{k,\ell }(y) \end{aligned} \end{aligned}$$
(19)

for \(x,y\in {\mathbb {B}}^d\).

For related results on approximation on \({\mathbb {B}}^d\), we refer to Petrushev and Xu (2008) and references therein.

Proof

For \(\ell =0,1,2,\ldots \), the eigenfunctions of the Laplace-Beltrami operator on the sphere associated to the eigenvalue \(-\,\lambda _\ell =-\,\ell (\ell +d-1)\) are the spherical harmonics of order\(\ell \), given by the homogeneous harmonic polynomials in \(d+1\) variables of exact total degree \(\ell \) restricted to \({\mathbb {S}}^d\). Each eigenspace \(E_\ell \) associated to \(\lambda _\ell \) splits orthogonally into \(E_\ell = E_\ell ^{(1)} \oplus E_\ell ^{(2)}\), where

$$\begin{aligned} \begin{aligned} E_\ell ^{(1)}&= \{f\in E_\ell : f(x) = f(h(x),-x_{d+1}),\forall x\in {\mathbb {S}}^d\},\\ E_\ell ^{(2)}&= \{f\in E_\ell : f(x) = -f(h(x),-x_{d+1}),\forall x\in {\mathbb {S}}^d\} \end{aligned} \end{aligned}$$
(20)

for \(\ell =0,1,2,\ldots \). We deduce from Xu (1998), Theorem 3.3, Example 3.4 that the functions

$$\begin{aligned} Z^{(1)}_{k,\ell }(z) :=\Vert z\Vert ^{\ell }(h^*T_{k,\ell })\left( \tfrac{z}{\Vert z\Vert }\right) ,\quad z\in {\mathbb {R}}^{d+1} \end{aligned}$$

are homogeneous polynomials of total degree \(\ell \), and their restrictions \(Y^{(1)}_{k,\ell }:=Z^{(1)}_{k,\ell }|_{{\mathbb {S}}^d}\), for \(k=1,\ldots ,r^d_\ell \), are an orthonormal basis for \(E_\ell ^{(1)}\). Note that \(f\in H^s({\mathbb {B}}^d)_h\) if and only if \(f\circ h\) is contained in

$$\begin{aligned} H^s({\mathbb {S}}^d)^{{{\,\mathrm{sym}\,}}}&:= \{f\in H^s({\mathbb {S}}^d) : f|_{h^{-1}x} \;\text {is constant } \forall x\in {\mathbb {B}}^d\} \\&= \{f\in H^s({\mathbb {S}}^d) : f(x)=f(h(x),-x_{d+1}), \;x\in {\mathbb {S}}^d\}. \end{aligned}$$

According to (14) and due to the decomposition induced by (20) and using \(Y^{(1)}_{k,\ell }=h^* T_{k,\ell }\), the reproducing kernel of \(H^s({\mathbb {S}}^d)^{{{\,\mathrm{sym}\,}}}\) is

$$\begin{aligned} K^{s}_{{\mathbb {S}}^d,{{\,\mathrm{sym}\,}}}(x,y)&=\sum _{\ell =0}^\infty (1+\lambda _\ell )^{-s} \sum _{k=1}^{r^d_\ell }Y^{(1)}_{k,\ell }(x)Y^{(1)}_{k,\ell }(y) \\&= \sum _{\ell =0}^\infty (1+\lambda _\ell )^{-s} \sum _{k=1}^{r^d_\ell }(h^*T_{k,\ell })(x) (h^*T_{k,\ell })(y) \end{aligned}$$

for \(x,y\in {\mathbb {S}}^d\). Thus, \(H^s({\mathbb {B}}^d)_h\) is indeed reproduced by (19). \(\square \)

Example 1

(\({\mathbb {B}}^1\)) For \(d=1\), the even and odd spherical harmonics,

$$\begin{aligned} \begin{aligned} Y^{(1)}_{\ell }(\cos (\alpha ),\sin (\alpha ))&= \sqrt{2}\cos (\ell \alpha ),\\ Y^{(2)}_{\ell }(\cos (\alpha ),\sin (\alpha ))&= \sqrt{2}\sin (\ell \alpha ),\quad \alpha \in [0,2\pi ], \end{aligned} \end{aligned}$$
(21)

with \(\ell =1,2,3,\ldots \), and \(Y^{(1)}_0=1\), form orthonormal bases for the respective spaces \(E_\ell ^{(1)}\) and \(E_\ell ^{(2)}\) in (20). We observe \( {{\,\mathrm{dist}\,}}_{[-1,1],h}(x,y):=|\arccos (x)-\arccos (y)|\), \(x,y\in [-1,1]\), and recognize the Chebyshev measure in (18). The Chebyshev polynomials \(T_\ell \) of the first kind, scaled by the factor \(\sqrt{2}\) for \(\ell =1,2,3,\ldots \), indeed satisfy the characteristic identities \(T_\ell (\cos (\alpha )) = \sqrt{2}\cos (\ell \alpha )\) for \(\alpha \in [0,2\pi ]\), \(\ell =1,2,3,\ldots \), and \(T_0=1\).

To simplify numerical experiments, we observe that the kernel

$$\begin{aligned} K_5(x,y)&:= 2 - \sqrt{1-xy+|x-y|}\\&= 2 + \frac{4}{\pi } \sum _{\ell =0}^\infty \frac{1}{4\ell ^2-1} T_\ell (x)T_\ell (y) \end{aligned}$$

reproduces \(H^1({\mathbb {B}}^1)_h\) with an equivalent norm, and, for fixed \(0<r<1\), the smooth kernel

$$\begin{aligned} K_6(r;x,y)&:= \frac{(1-r^2)(1-2rxy+r^2)}{1+r^4 - 4xy(r+r^3)+ r^2(4x^2+4y^2-2)} \\&= \frac{1}{2}+\frac{1}{2}\sum _{\ell =0}^\infty r^\ell T_\ell (x)T_\ell (y) \end{aligned}$$

reproduces a function space that is continuously embedded into \(H^s({\mathbb {B}}^1)_h\) for all \(s>1/2\). As in our previous examples, our numerical experiments in Fig. 3 are in accordance with the theoretical results.

Example 2

(\({\mathbb {B}}^2\)) For \(r\ge 1\), we define the family of kernels

$$\begin{aligned}&L_r(x,y) \\&\quad := 3 - \frac{3}{2\sqrt{2r+2}} \\&\qquad \sqrt{r-\langle x,y \rangle +\sqrt{ \begin{array}{ll} \big (r-\langle x,y\rangle \big )^2 \\ -(1-\Vert x\Vert ^2)(1-\Vert y\Vert ^2) \end{array} }}. \end{aligned}$$

These kernels are positive definite on \({\mathbb {B}}^2\) and satisfy

$$\begin{aligned} \int _{{\mathbb {B}}^2} L_r(x,y)\frac{\mathrm{d}x}{2\pi \sqrt{1-\Vert x\Vert ^2}} = \frac{3+r+2\sqrt{r^2-1}}{1+r+\sqrt{r^2-1}}. \end{aligned}$$

Note that \(L_1\) reproduces \(H^{3/2}({\mathbb {B}}^2)\) with an equivalent norm and \(L_r\) for \(r>1\) reproduces a space that is continuously embedded into each \(H^s({\mathbb {B}}^2)\) for \(s>1\). In our numerical experiments, we set

$$\begin{aligned} K_7(x,y) := L_1(x,y),\qquad K_8(x,y) := L_{51/50}(x,y), \end{aligned}$$

and Fig. 4 supports our theoretical results. There, however, the worst case error for nonnegative weights does not show faster decay for smooth functions in Fig. 4b, but we speculate that this is due to a numerical artifact of the very last data point.

Fig. 4
figure 4

Worst case integration error for \({\mathcal {H}}_{K_7}\) and \({\mathcal {H}}_{K_8}\), averaged over random points in logarithmic scalings (

figure j
average wce with equal weights;
figure k
average wce with nonnegative weights;
figure l
average wce with optimal weights)

The d-dimensional torus \({\mathbb {T}}^d:={\mathbb {S}}^1\times \ldots \times {\mathbb {S}}^1\) leads to \(h:{\mathbb {T}}^d \rightarrow [-1,1]^d\) defined by

$$\begin{aligned} h(x_1,\ldots ,x_d)=\big (x_{1,1}, \ldots ,x_{d,1}\big ), \end{aligned}$$

where \(x_i=(x_{i,1},x_{i,2})^\top \in {\mathbb {S}}^1\). The push-forward of the Riemannian measure on \({\mathbb {T}}^d\) under h is

$$\begin{aligned} \frac{\mathrm{d} x_1 \cdots \mathrm{d}x_d}{\pi ^d\sqrt{(1-x^2_1)\cdots (1-x^2_d)}}. \end{aligned}$$
(22)

A suitable basis of orthonormal polynomials characterizes \(H^s([-1,1]^d)_h\):

Proposition 2

(The cube) For \(s>d/2\) and \(h:{\mathbb {T}}^d \rightarrow [-1,1]^d\) as above, the space \(H^s([-1,1]^d)_h\) is reproduced by

$$\begin{aligned} K^s_{[-1,1]^d}(x,y):=\sum _{\ell \in {\mathbb {N}}^d} (1+\Vert \ell \Vert ^2)^{-s} T_\ell (x) T_\ell (y), \end{aligned}$$

\(x,y\in [-1,1]^d\), where \( T_\ell (x):=T_{\ell _1}(x_1)\cdots T_{\ell _d}(x_d)\), \(x\in [-1,1]^d\), \(\ell \in {\mathbb {N}}^d\).

Proof

The polynomials \(\{T_\ell : \ell \in {\mathbb {N}}^d\}\) are orthonormal with respect to (22). By using the eigenspace decomposition (20) for \({\mathbb {S}}^1\), we deduce that the space

$$\begin{aligned}&H^s({\mathbb {T}}^d)^{{{\,\mathrm{sym}\,}}} \\&\quad := \{f\in H^s({\mathbb {T}}^d) : f|_{h^{-1}x} \text { is constant } \forall x\in [-1,1]^d\} \end{aligned}$$

is reproduced by the kernel

$$\begin{aligned} K^s_{{\mathbb {T}}^d,{{\,\mathrm{sym}\,}}}(x,y)=\sum _{\ell \in {\mathbb {N}}^d} (1+\Vert \ell \Vert ^2)^{-s} Y^{(1)}_{\ell }(x)Y^{(1)}_{\ell }(y), \end{aligned}$$

\(x,y\in {\mathbb {T}}^d\), with \(Y^{(1)}_\ell (x):=Y^{(1)}_{\ell _1}(x_1)\cdots Y^{(1)}_{\ell _d}(x_d)\) and \(Y^{(1)}_{\ell _i}\) are as in (21). Observing \(Y^{(1)}_\ell = h^* T_\ell \) concludes the proof.

\(\square \)

By following Xu (2001), we derive an analogous construction for the simplex. Define \(h:{\mathbb {S}}^d\rightarrow {\varSigma }^d\) by \(h(x):=(x_1^2,\ldots ,x_d^2)\) and observe that the assumptions in Theorem 2 are satisfied. The push-forward measure \(h_*\mu _{{\mathbb {S}}^d}\) on \({\varSigma }^d\) is given by

$$\begin{aligned} \frac{{\varGamma }(d/2+1/2)}{\pi ^{d/2+1/2}}\frac{\mathrm{d} u}{\sqrt{u_1\cdots u_d(1-\sum _{i=1}^d u_i)}}. \end{aligned}$$
(23)

Let \(\{R_{k,\ell }:\ell =0,1,2,\ldots ;\; k=1,\ldots ,r^d_\ell \}\) be a system of orthonormal polynomials with respect to (23) on \({\varSigma }^d\), so that each \(R_{k,\ell }\) has total degree \(\ell \).

Proposition 3

(The simplex) For \(s>d/2\) and \(h:{\mathbb {T}}^d \rightarrow {\varSigma }^d\) as above, the space \(H^s({\varSigma }^d)_h\) is reproduced by

$$\begin{aligned} \begin{aligned} K^s_{{\varSigma }^d}(u,v)&:=\sum _{\ell =0}^\infty (1+2\ell (2\ell +d-1))^{-s} \\&\quad \sum _{k=1}^{r^d_\ell } R_{k,\ell }(u)R_{k,\ell }(v), \end{aligned} \end{aligned}$$

for \(u,v\in {\varSigma }^d\).

Proof

Let us define

$$\begin{aligned} Z_{k,2\ell }(z) :=\Vert z\Vert ^{2\ell }R_{k,\ell } \left( \tfrac{z_1^2}{\Vert z\Vert ^2},\ldots , \tfrac{z_d^2}{\Vert z\Vert ^2}\right) , \quad z\in {\mathbb {R}}^{d+1}. \end{aligned}$$

Note that the restrictions \(Y_{k,2\ell }:=Z_{k,2\ell }|_{{\mathbb {S}}^d}\) satisfy \(Y_{k,2\ell }=h^* R_{k,\ell }\). We deduce from Xu (2001) that the collection \(\{Y_{k,2\ell }:k=1,\ldots ,r^d_\ell \}\) is an orthonormal system of spherical harmonics of order \(2\ell \) and that the space

$$\begin{aligned} H^s({\mathbb {S}}^d)^{{{\,\mathrm{sym}\,}}} := \{f\in H^s({\mathbb {S}}^d) : f|_{h^{-1}x} \text { is constant } \forall x\in {\varSigma }^d\} \end{aligned}$$

is reproduced by the kernel

$$\begin{aligned} K(x,y)&:=\sum _{\ell =0}^\infty (1+2\ell (2\ell +d-1))^{-s} \sum _{k=1}^{r^d_\ell } Y_{k,2\ell }(x)Y_{k,2\ell }(y)\\&= \sum _{\ell =0}^\infty (1+2\ell (2\ell +d-1))^{-s} \\&\quad \sum _{k=1}^{r^d_\ell } (h^*R_{k,\ell })(x) (h^*R_{k,\ell })(y), \end{aligned}$$

which concludes the proof. \(\square \)

Remark 7

Our Theorem 2 is an elementary way to transfer results from closed manifolds to more general settings. Our treatment of the unit ball, the cube, and the simplex were based on this transfer. The proof of the underlying Theorem 1 is based on results in Filbir and Mhaskar (2010), and we restricted attention to closed manifolds although the setting in Filbir and Mhaskar (2010) is more general. Alternatively, we could have stated our Theorem 1 in more generality and then attempted to check that the technical requirements in Filbir and Mhaskar (2010) hold. For instance, technical requirements for \([-1,1]\) were checked in Coulhon et al. (2012), and the recent work (Kerkyacharian et al. 2019) covers technical details for the unit ball and the simplex.

8 Perspectives

Re-weighting techniques for statistical and numerical integration have attracted attention in different disciplines. Partially complementing findings in Briol et al. (2018) and Oettershagen (2017), we have here established that re-weighting random points can yield almost optimal approximation rates of the worst case integration error for isotropic Sobolev spaces on closed Riemannian manifolds. Our results suggest several directions for future work, for instance, allowing for more general spaces \({\mathcal {M}}\), considering other smoothness classes than \(H^s_p({\mathcal {M}})\), considering other types of point processes such as determinantal point processes (Bardenet and Hardy 2016), and replacing the expected worst case error by alternative error functionals such as the average error, cf. Novak and Wozniakowski (2010) and Ritter (2000).

Our results have direct consequences for the Bayesian Monte Carlo method, as indicated in Sect. 5. Indeed, there has been recent interest in exploiting Bayesian cubature in applications including global illumination in computer vision (Marques et al. 2015), signal processing (Prüher and Šimandl 2016), uncertainty quantification (Oettershagen 2017) and Bayesian computation (Briol et al. 2018). The results in this paper justify the use of a random point set in these applications, in situations where a deterministic point set would otherwise need to be explicitly constructed.