Optimal Monte Carlo integration on closed manifolds

The worst case integration error in reproducing kernel Hilbert spaces of standard Monte Carlo methods with n random points decays as n-1/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{-1/2}$$\end{document}. However, the re-weighting of random points, as exemplified in the Bayesian Monte Carlo method, can sometimes be used to improve the convergence order. This paper contributes general theoretical results for Sobolev spaces on closed Riemannian manifolds, where we verify that such re-weighting yields optimal approximation rates up to a logarithmic factor. We also provide numerical experiments matching the theoretical results for some Sobolev spaces on the sphere S2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb {S}}^2$$\end{document} and on the Grassmannian manifold G2,4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {G}}_{2,4}$$\end{document}. Our theoretical findings also cover function spaces on more general sets such as the unit ball, the cube, and the simplex.


Introduction
Many problems in statistics and the applied sciences require the numerical computation of integrals for an entire class of functions. Given M ⊂ R D , endowed with some probability measure µ, and a function f : M → R, standard Monte Carlo methods approximate the integral M f (x)dµ(x) by the finite sum (1) 1 n n j=1 f (x j ), where {x j } n j=1 ⊂ M are independent samples from µ. On the one hand, Monte Carlo integration is widely used in many numerical and statistical applications [34]. It is well-known, however, that the expected worst case integration error for n random points using (1) in reproducing kernel Hilbert spaces does not decay faster than n −1/2 , cf. [4,5,16,25,30] and [13,proof of Corollary 2.8]. To improve the approximation, it has been proposed to re-weight the random points [7,28,31,35], which is of particular importance when µ can only be sampled [27] and evaluating f is rather expensive.
That re-weighting of deterministic points can lead to optimal convergence order has been known since the pioneering work of [1]. For Sobolev spaces on the sphere and more generally on compact Riemannian manifolds, there are numerically feasible strategies to select deterministic points and weights matching optimal worst case error rates, cf. [2,4,6], see also [15,17,24].
The use of random points avoids the need to manually specify a point set and can potentially lead to simpler algorithms if the geometry of the manifold M is complicated. For random points, it was derived in [7] that the optimal rate for [0, 1] d , the sphere, and quite general domains in R d can be matched up to a logarithmic factor if the weights are optimized with respect to the underlying reproducing kernel. Decay rates of the worst case integration error for Sobolev spaces of dominating 1 arXiv:1707.04723v5 [math.NA] 24 Jan 2018 mixed smoothness on the torus and the unit cube were studied in [28]. Numerical experiments on the Grassmannian manifold were provided in [11]. We refer to [36,37], for further related results.
The present note is dedicated to verify that, for Sobolev spaces on closed Riemannian manifolds, random points with optimized weights yield optimal decay rates of the worst case error up to a logarithmic factor. We should point out that we additionally allow for the restriction to nonnegative weights, a desirable property not considered in [7]. Our findings also transfer to functions defined on more general sets such as the d-dimensional unit ball and the simplex.
First, we bound the worst case error by the covering radius of the underlying points. Second, we use estimates on the covering radius of random points from [32], see also [3] for the sphere, to establish the optimal approximation rate up to a logarithmic factor. Some consequences for popular Bayesian integration methods are then presented. Numerical experiments for the sphere and the Grassmannian manifold are provided that support our theoretical findings. We also discuss the extension to the unit ball, the cube, and the simplex.

Preliminaries
Let M ⊂ R D be a smooth, connected, closed Riemannian manifold of dimension d, endowed with the normalized Riemannian measure µ throughout the manuscript. Prototypical examples for M are the sphere and the Grassmannian Let H be any normed space of continuous functions f : M → R. For points {x j } n j=1 ⊂ M and weights {w j } n j=1 ⊂ R, the worst case error of integration is defined by Suppose now that H is a reproducing kernel Hilbert space, denoted by H K , then the squared worst case error can be expressed in terms of the reproducing kernel K by If x 1 , . . . , x n ∈ M are random points, independently distributed according to µ, then it holds cf. [4,5,25] and [13, Proof of Corollary 2.8]. Hence, even if H K consists of arbitrarily smooth functions, the left hand side of (4) decays only like n − 1 2 .
The present note is dedicated to the question if and, as the case may be, how much one can actually improve the error rate in (4) when replacing the equal weights 1/n with weights {w j } n j=1 that are customized to the random points {x j } n j=1 .
3. Bounding the worst case error by the covering radius To define appropriate smoothness spaces, let ∆ denote the Laplace-Beltrami operator on M and let {ϕ } ∞ =0 be the collection of its orthonormal eigenfunctions with eigenvalues {−λ } ∞ =0 arranged by 0 = λ 0 ≤ λ 1 ≤ . . .. We choose each ϕ , = 0, 1, 2, . . ., to be real-valued with ϕ 0 ≡ 1. Given f ∈ L p (M) with 1 ≤ p ≤ ∞, the Fourier transform is defined bŷ with the usual extension to distributions on M. For 1 ≤ p ≤ ∞ and s > 0, the Sobolev space H s p (M) is the collection of all distributions on M with (I − ∆) s/2 f ∈ L p (M), i.e., with n −s/d wce({(x j , w j )} n j=1 , H s p (M)), see [4] for the sphere and [2] for the general case. Note that the constant in (6) may depend on s, M, and p.
Attempting to match this lower bound, we shall optimize the weights. Given points {x j } n j=1 ⊂ M, we define optimal weights with nonnegativity constraints by The worst case error for the optimized weights is upper bounded by the covering radius: Theorem 3.1. Let 1 ≤ p ≤ ∞, suppose s > d/p, and let {x j } n j=1 ⊂ M be any set of points with covering radius ρ n . Then the optimized weights { w ≥0; p j } n j=1 in (8) satisfy (9) wce({(x j , w ≥0; p j )} n j=1 , H s p (M)) ρ s n . Note that the constant in (9) may depend on M, s, and p. Remark 3.2. If we fix a constant c > 0, independent of n and {x j } n j=1 , then any , H s p (M)) satisfy the estimate (10) wce({(x j , w p j )} n j=1 , H s p (M)) ρ s n . This fact is beneficial when we compute weights numerically.
Remark 3.3. The above proof reveals that in the setting of Theorem 3.1 there exist {w j } n j=1 with either w j ρ d n or w j = 0, for j = 1, . . . , n, such that wce({(x j , w j )} n j=1 , H s p (M)) ρ s n .
The covering radius ρ n of any n points in M is lower bounded by n −1/d , which follows from standard volume arguments. If {x j } n j=1 ⊂ M are points with asymptotically optimal covering radius, i.e., ρ n n −1/d , then Theorem 3.1 yields the optimal rate for the worst case integration error (12) wce Several point sets on S 2 with asymptotically optimal covering radius are discussed in [14], see quasi-uniform point sequences therein, and see [5] for general M. The covering radius of random points is studied in [3,26,32], which leads to almost optimal bounds on the worst case error in the subsequent section. Although we shall consider independent random points, it is noteworthy that it is verified in [26] that the required estimates on the covering radius still hold for random points arising from a Markov chain instead of being independent. Note also that results related to Theorem 3.1 are derived in [22] for more general spaces M.

Consequences for random points
For random points {x j } n j=1 ⊂ M and any weights {w j } n j=1 ⊂ R, no matter if random or not, (6) implies, for all r > 0, where the constant may depend on s, p, and M. Note that if {x j } n j=1 ⊂ M are random points, then the weights { w ≥0; p j } n j=1 are random as well. We shall deduce that Theorem 3.1 implies that the optimal worst case error rate is (almost) matched in these cases: Note that Corollary 4.1 yields the optimal rate up to the logarithmic factor log(n) s/d , cf. (13), and that the constant in (14) may depend on s, M, p, and r.
Proof. From [32, Theorem 3.2, Corollary 3.3] we deduce that, for each r ≥ 1, where the constant may depend on M and r. Thus, Theorem 3.1 implies Let ν be a probability measure on M that is absolutely continuous with respect to µ and its density is bounded away from zero, i.e., ν = f µ with f (x) ≥ c > 0, for all x ∈ M. Corollary 4.1 still holds for independent samples from ν, where the constant in (14) then also depends on c. This is due to 2n/c independent samples from ν covering M at least as good as n independent samples from µ.
Corollary 4.1 yields bounds on the moments of the worst case integration error. The results in [32] also enable us to derive probability estimates: for all r ≥ c 1 .
Proof. By applying Theorem 3.1 we deduce that there is a constant c > 0, which may depend on M, s, and p, such that According to [32,Theorem 2.1], there are constants c 1 ,c 2 , c 3 , c 4 > 0, which may depend on M, such that, for all r ≥ c 1 , Raising the left inequality to the power s and multiplying by c yields the desired result with c 2 := cc s 2 .
Remark 4.4. Corollary 4.1 and Corollary 4.3 also hold for weights { w p j } n j=1 that minimize (8) up to a constant factor as discussed in Remark 3.2, and, in particular, for the unconstrained minimizer Nonnegative weights are often more desirable for numerical applications of cubatures points, but solving the constrained minimization problem (8) is usually more involved than dealing with the unconstrained problem (16).

Relations to Bayesian Monte Carlo
Our results have consequences for Bayesian cubature, cf. [21], an integration method whose output is not a scalar but a distribution. Bayesian cubature enables a statistical quantification of integration error, useful in the context of a wider computational work-flow to measure the impact of integration error on subsequent output, cf. [7,9].
Consider a linear topological space L of continuous functions on M. The integrand f in Bayesian cubature is treated as a Gaussian random process; that is, f : M × Ω → R, where f (·, ω) ∈ L for each ω ∈ Ω, and the random variables ω → Lf (·, ω) ∈ L are (univariate) Gaussian for all continuous linear functionals L on L, such as integration (If = M f (x)dµ(x)) and point evaluation (δ x f = f (x)) operators, cf. [2]. The Bayesian approach is then taken, wherein the process f is constrained to interpolate the values {(x j , f (x j ))} n j=1 . Formally, this is achieved by conditioning the process on the data provided through the point evaluation op- The conditioned process, denoted f n , is again Gaussian (cf. [2]) and as such the linear functional If n is a (univariate) Gaussian; this is the output of the Bayesian cubature method. This distribution, defined on the real line, provides statistical uncertainty quantification for the (unknown) true value of the integral.
Concretely, let K(x, y) = cov(f (x), f (y)) denote the covariance function that characterizes the Gaussian probability model. The output of Bayesian cubature is the univariate Gaussian distribution with mean (17) . . . . . .
cf. [7]. This expression is recognized as a weighted integration method with weights w = ( w 1 , . . . , w n ) implicitly defined by w := K −1 b, so that (17) becomes n j=1 w j f (x j ).
Any symmetric positive definite covariance function K can be viewed as a reproducing kernel. In particular, the Bessel kernel reproduces H s (M) := H s 2 (M), for s > d/2. Observe that the weights w just defined solve the unconstrained minimization problem (16) for p = 2. The latter follows from the quadratic minimization form in (3) as well as from the posterior mean being an L 2 -optimal estimator [20].
The variance of the Gaussian measure can be shown to be formally equal to (3) when these weights are substituted, see [7]. The special case where the points {x j } n j=1 are random was termed Bayesian Monte Carlo in [31]. Therefore, our results in Section 4 have direct consequences for Bayesian Monte Carlo. Due to Remark 4.4 within this Bayesian setting, Corollary 4.1 and Corollary 4.3 generalize earlier work of [7] to a general smooth, connected, closed Riemannian manifold.

Numerical experiments for the sphere and the Grassmannian
Numerically computing the worst case error wce({(x j , w j )} n j=1 , H s p (M)) is difficult in general but, for p = 2, it is expressed in terms of the reproducing kernel in (3). Therefore, our numerical experiments are designed for p = 2. However, the kernel K Dropping the nonnegativity constraints yields w K , which is given by w = K −1 b, where K and b are as in (17). To provide numerical experiments for Sobolev spaces on the sphere S 2 ⊂ R 3 and on the Grassmannian G 2,4 , we shall specify suitable kernels in the following. We shall consider two kernels K 1 , K 2 on the sphere S 2 and two kernels K 3 , K 4 on the Grassmannian G 2,4 . The numerical results are produced by taking sequences of random points {x j } n j=1 with increasing cardinality n. We compute each of the three worst case errors , H Ki ), for i = 1, . . . , 4, and averaged these results over 20 instantiations of the random points. The constrained minimization problem for the latter two quantities is solved by using the Python CVXOPT library. It should be mentioned that numerical  experiments on the sphere for the unconstrained optimizer w K1 are also contained in [7].
The kernel reproduces the Sobolev space H 3/2 (S 2 ) with an equivalent norm, cf. [13, Section 6.4.1]. To compute (3) and (17), it is sufficient to notice By plotting the worst case error versus the number of points in a logarithmic scale, we are supposed to observe lines whose slopes coincide with the decay rate −s/d for the optimized weights and slope −1/2 for the weights 1/n. Indeed, we see in Figure 1(a) that wce({(x j , 1 n )} n j=1 , H K1 ) for random points matches the error rate −1/2 predicted by (4) with d = 2. When optimizing the weights, we observe the decay rate −3/4 for both optimizations, w ≥0;K in (19) and the unconstrained minimizer w K . Hence, the numerical results match the rate predicted by the theoretical findings in (13), (14) with p = 2 and r = 1. The logarithmic factor in (14) is not visible.
The smooth kernel generates a space H K2 of smooth functions contained in H s (S 2 ), for all s > 0, and satisfies Our numerical experiments in Figure 1(b) suggest that the decay rate for the optimized weights is indeed beyond linear. Note that the equal weight case is stuck with the decay rate −1/2, although we are now dealing with arbitrarily smooth functions. The dimension of the Grassmannian G 2,4 is d = 4, and we consider the two reproducing kernels K 3 (x, y) := (2 − trace(xy)) 3 + 2 trace(xy), Note that K 3 reproduces H 7/2 (G 2,4 ) with an equivalent norm, and H K4 is contained in H s (G 2,4 ), for all s > 0. The terms (3) and (17) for all x ∈ G 2,4 . We observe in Figure 2(a) that the random points with equal weights yield decay rate −1/2 and optimizing weights leads to −7/8 matching the optimal rate in (13), (14) with d = 4. In Figure 2(b), it seems that the worst case error for H K4 decays faster than linear when optimizing the weights for random points on the Grassmannian G 2,4 outperforming the equal weight case with its rate −1/2.

Beyond closed manifolds
We shall make use of the push-forward to transfer our results on the worst case integration error from closed manifolds to more general sets. Suppose S is a topological space and h : M → S is Borel measurable and surjective. We endow S with the push-forward measure h * µ defined by (h * µ)(A) = µ(h −1 A) for any Borel measurable subset A ⊂ S. By abusing notation, let dist M (A, B) := inf a∈A; b∈B dist M (a, b) for A, B ⊂ M, and we put (20) dist S,h (x, y) := dist M (h −1 x, h −1 y), x, y ∈ S.
For s > d/p, we define where h * f denotes the pullback f • h. This enables us to formulate the analogue of Theorem 3.1: Theorem 7.1. Given 1 ≤ p ≤ ∞ with s > d/p and {x j } n j=1 ⊂ S, suppose that the following two conditions are satisfied, where ρ n denotes the covering radius of {x j } n j=1 taken with respect to (20). Note that (20) is a quasi-metric on S if the assumptions in Theorem 7.1 are satisfied, i.e., the conditions of a metric are satisfied except for the triangular inequality that still holds up to a constant factor.
where ρ M ( which concludes the proof. Remark 7.2. Since independent random points {x j } n j=1 distributed according to h * µ on S with covering radius ρ n are generated by independent random points {z j } n j=1 with respect to µ on M with x j = h(z j ), for j = 1, . . . , n, the observation ρ n ρ M ({z j } n j=1 ) implies that also Corollary 4.1, and Corollary 4.3 hold for H s p (S) h and h * µ. The impact of Theorem 7.1 depends on whether or not the choices of h yield reasonable function spaces H s p (S) h , distances dist S,h , and measures h * µ. For instance, if h is also injective with measurable h −1 , then H s (S) h is the reproducing kernel Hilbert space with kernel where ψ := ϕ • h −1 , so that {ψ } ∞ =0 is an orthonormal basis for the square integrable functions with respect to h * µ. In the following, we shall discuss a few special cases, in which h is not injective. By using the results in [40] For related results on approximation on B d , we refer to [29] and references therein.
Proof. For = 0, 1, 2, . . ., the eigenfunctions of the Laplace-Beltrami operator on the sphere associated to the eigenvalue −λ = − ( + d − 1) are the spherical harmonics of order , given by the homogeneous harmonic polynomials in d + 1 variables of exact total degree restricted to S d . Each eigenspace E associated to λ splits orthogonally into E = E (1) ⊕ E (2) , where We deduce from [40, Theorem 3.3, Example 3.4] that the functions are homogeneous polynomials of total degree , and their restrictions Y k, | S d , for k = 1, . . . , r d , are an orthonormal basis for E (1) .
According to (18) and due to the decomposition induced by (25) and using Y Thus, H s (B d ) h is indeed reproduced by (24).
Those kernels are positive definite on B 2 and satisfy Note that L 1 reproduces H 3/2 (B 2 ) with an equivalent norm and L r for r > 1 reproduces a space that is continuously embedded into each H s (B 2 ) for s > 1. In our numerical experiments, we set K 7 (x, y) := L 1 (x, y), K 8 (x, y) := L 51/50 (x, y), and Figure 4 supports our theoretical results. There, however, the worst case error for nonnegative weights does not show superlinear decay for smooth functions in Figure 4(b), but we speculate that this is due to a numerical artifact of the very last data point. . A Proof. The polynomials {T : ∈ N d } are orthonormal with respect to (30). By using the eigenspace decomposition (25) for S 1 , we deduce that the space i are as in (29). Observing Y (1) = h * T concludes the proof.
Let {R k, : = 0, 1, 2, . . . ; k = 1, . . . , r d } be a system of orthonormal polynomials with respect to (32) on Σ d , so that each R k, has total degree . , z ∈ R d+1 . Note that the restrictions Y k,2 := Z k,2 | S d satisfy Y k,2 = h * R k, . We deduce from [41] that the collection {Y k,2 : k = 1, . . . , r d } is an orthonormal system of spherical harmonics of order 2 and that the space which concludes the proof.
Remark 7.8. Our Theorem 7.1 is an elementary way to transfer results from closed manifolds to more general setting. Our treatment of the unit ball, the cube, and the simplex were based on this transfer. The proof of the underlying Theorem 3.1 is based on results in [12], and we restricted attention to closed manifolds although the setting in [12] is more general. Alternatively, we could have stated our Theorem 3.1 in more generality and then attempted to check that the technical requirements in [12] hold. For instance, technical requirements for [−1, 1] were checked in [10], and the recent work [19] covers technical details for the unit ball and the simplex.

Perspectives
Re-weighting techniques for statistical and numerical integration have attracted attention in different disciplines. Partially complementing findings in [7,28], we have here established that re-weighting random points can yield almost optimal approximation rates of the worst case integration error for isotropic Sobolev spaces on closed Riemannian manifolds. Our results suggest several directions for future work, for instance, allowing for more general spaces M, considering other smoothness classes than H s p (M), and replacing the expected worst case error wce by alternative error functionals such as the average error, cf. [25,33].