Abstract
Convergence of the Bayes posterior measure is considered in canonical statistical settings where observations sit on a geometrical object such as a compact manifold, or more generally on a compact metric space verifying some conditions. A natural geometric prior based on randomly rescaled solutions of the heat equation is considered. Upper and lower bound posterior contraction rates are derived.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \(\mathcal{M }\) be a compact metric space, equipped with a Borel measure \(\mu \) and the corresponding Borel-sigma field. Let \(\mathbb L ^p:=\mathbb L ^p(\mathcal{M },\mu )\), \(p\ge 1\) and \(\mathcal C ^0(\mathcal{M })\) respectively denote the real vector spaces of real-valued functions defined on \(\mathcal{M }\) that are \(p\)-integrable with respect to \(\mu \) and of real-valued continuous functions on \(\mathcal{M }\). Also, denote by \(\mathcal D (\mathbb R )\) the algebra of real-valued infinitely differentiable functions on the real line.
In this paper we investigate rates of contraction of posterior distributions for nonparametric models on geometrical structures such as
-
1.
Gaussian white noise on a compact metric space \(\mathcal{M }\), where, for \(n\ge 1\), one observes
$$\begin{aligned} dX^{(n)}(x) = f(x)dx + \frac{1}{\sqrt{n}} dZ(x),\quad x\in \mathcal{M }, \end{aligned}$$where \(f\) is in \(\mathbb L ^2\) and \(Z\) is a white noise on \(\mathcal{M }\).
-
2.
Fixed design regression where one observes, for \(n\ge 1\),
$$\begin{aligned} Y_i = f(x_i) + \varepsilon _i,\quad 1\le i\le n. \end{aligned}$$The design points \(\{x_i\}\) are fixed on \(\mathcal{M }\) and the variables \(\{\varepsilon _i\}\) are assumed to be independent standard normal.
-
3.
Density estimation on a manifold where the observations are a sample
$$\begin{aligned} (X_i)_{1\le i\le n}\ \sim \ f, \end{aligned}$$\(X_1,\ldots ,X_n\) are independent identically distributed \(\mathcal{M }\)-valued random variables with positive density function \(f\) on \(\mathcal{M }\).
Although an impressive amount of work has been done using frequentist approachs to estimation on manifolds, see [20] and the references therein, we focus in this paper on the Bayes posterior measure.
Works devoted to deeply understanding the behaviour of Bayesian nonparametric methods have recently experienced a considerable development in particular after the seminal works of A. W. van der Vaart, H. van Zanten, S. Ghosal and J. K. Ghosh [12], [27]. Especially, the class of Gaussian processes forms an important family of nonparametric prior distributions, for which precise rates have been obtained in [31], see also [5] for lower bound counterparts. In [33], the authors obtained adaptive performance up to logarithmic terms by introducing a random rescaling of a very smooth Gaussian random field. In these results, the considered rescaling corresponds to shrinking the paths of the process. These results have been obtained on \([0,1]^d, d\ge 1\). Our point in this paper is to develop a Bayesian procedure adapted to the geometrical structure of the data. Among the examples covered by our results, we can cite directional data corresponding to the spherical case and more generally data supported by a compact manifold.
We follow the illuminating approach of [31] and [33] and use a fixed prior distribution, constructed by rescaling a smooth Gaussian random field. For a recent survey on Gaussian processes and their basic properties, see [19]. Basically our aim will be twofold:
First, because the ‘shrinking of paths’ approach from [33] has no natural analogue on a general manifold, this type of rescaling cannot be used. In our more general setting, we show how a rescaling is made possible by introducing a notion of time decoupled from the underlying space and issued from the semigroup property of a family of operators. Another important difference brought by the geometrical nature of the problem is the underlying Gaussian process, which now originates from an harmonic analysis of the data space \(\mathcal{M }\), with the rescaling naturally acting on the frequency domain. More precisely, we suppose that \(\mathcal{M }\) is equipped with a positive self-adjoint operator \(L\) such that the associated semigroup \(e^{-tL}\), \(t>0\), the heat kernel, allows a smooth functional calculus, which in turn allows the construction of the Gaussian random field. A central example of operator is \(L=-\Delta \), where \(\Delta \) is the Laplacian on \(\mathcal{M }\). Our prior can then be interpreted as a randomly rescaled (random) solution of the heat equation.
This construction enables to obtain rates of contraction of the posterior distribution depending on the ‘regularity’ of the estimated function (defined in terms of approximation rates) and a ‘dimension’ of the geometrical object at hand.
Secondly, we prove a lower bound for the posterior contraction rate showing in particular that the logarithmic factor appearing in the upper evaluations of the posterior rate is necessary.
We also took inspiration on earlier work by [1], where the authors consider a symmetry-adaptive Bayesian estimator in a regression framework. Precise minimax rates in the \(\mathbb L ^2\)-norm over Sobolev spaces of functions on compact connected orientable manifolds without boundary are obtained in [11]. We also mention a recent development by [2], where Bayesian consistency properties are derived for priors based on mixture of kernels over a compact manifold.
The paper is mostly self-contained and does not require prior knowledge of heat kernel theory. Definitions and notation for the heat kernel can be found in Sect. 2.4. To obtain sharp entropy bounds for some compact sets appearing in the proof, we use the existence of needlet-type basis on \(\mathcal{M }\), as established in [9]. Standard general conditions to obtain rates for Bayesian posteriors are used following [12, 13].
Here is an outline of the paper. We first detail in Sect. 2 the properties assumed on the structure \(\mathcal{M }\) and the associated heat kernel allowing our construction. We then construct the associated Gaussian prior defining the procedure. Examples are considered in Sect. 3. The main results are stated in Sect. 4, that is: rates of contraction for the procedure, as well as a lower bound proving that the logarithmic factor present in the rates is, in fact, sharp. The rest of the paper is devoted to the proofs of these results. In Sects. 5 and 6 structural properties of the considered Gaussian processes are studied, and entropy estimates are stated. These properties enable one to check general sufficient conditions for upper-bound posterior rates, as we demonstrate in Sect. 7. Upper-bound rates are then derived in Sect. 8, as well as the corresponding lower bound result. Sections 9 and 10 contain respectively the definition of Besov spaces and the proofs of the sharp entropy results. Finally, in Sect. 10.5, a homogeneity property of measure of balls needed for our results is verified for compact Riemannian manifolds without boundary.
The notation \(\lesssim \) means less than or equal to up to some universal constant. For any sequences of reals \((a_n)_{n\ge 0}\) and \((b_n)_{n\ge 0}\), the notation \(a_n\sim b_n\) means that the sequences verify \(c\le \liminf _{n} (b_n/a_n) \le \limsup _n (b_n/a_n) \le d\) for some positive constants \(c,d\), and \(a_n\ll b_n\) stands for \(\lim _n (b_n/a_n)=0\). For any reals \(a,b\), we denote \(\min (a,b)=a\wedge b\) and \(\max (a,b)=a\vee b\). For a given differentiable function \(u\), we denote by \(u^{\prime }\) its derivative.
2 The geometrical framework and our method
2.1 Compact metric doubling space \(\mathcal{M }\)
Let \(\rho \) denote the metric on the space \(\mathcal{M }\). The open ball of radius \(r\) centered at \(x \in \mathcal{M }\) is denoted by \(B(x,r)\) and to simplify the notation we put \(\mu (B(x,r)) =: |B(x,r)|.\) Without loss of generality, we impose in the abstract proofs that both the total mass \(\mu (\mathcal{M })\) and the diameter of \(\mathcal{M }\) are equal to \(1\). Although this is typically not the case in practical situations, see Sect. 3, considering the general case only changes constants in the proofs.
We assume that \(\mathcal{M }\) has the so called doubling property: i.e. there exists \(0<D <\infty \) such that:
We say that \(\mathcal{M }\) verifies the Ahlfors property (see e.g. [16]) if there exist positive \( c_1, c_2, d\) such that
If (2) holds, then one must have \( d \le D\). Indeed, successive applications of (1) imply \(|B(x,r)| \ge (r/2)^D\). In the rates obtained in the sequel as well as in the specific examples we consider, \(d\) plays the role of a dimension. Notice, however, that there is no need for \(d\) to be an integer.
2.2 Previous work on the real line
To motivate our approach, let us start with the simple case where the space \(\mathcal{M }\) is a compact interval on the real line, say \(\mathcal{M }=[0,1]\). This is the case considered in [33]. The statistical goal, for anyone of the models considered in the introduction (white noise, regression and density), is to estimate the unknown function \(f\). To do so, a Bayesian approach first has to put a prior distribution on \(f\), that is a probability distribution on, say, continuous functions \(f\) on \(\mathcal{M }\).
So, how does one build a prior on \(f\)? A possibility to model a random function on \([0,1]\) is to take realisations of a stochastic process on this interval. A natural class which comes to mind is the one of Gaussian processes. Any such process \((Z_t)_{t\in [0,1]}\) is characterised by a mean function, which here will be taken to be identically zero, and a covariance kernel \(K(s,t)=\mathbb E (Z_sZ_t)\), for \(s,t\) in \([0,1]\). In this case, choosing a prior reduces to choosing a covariance kernel \(K\).
In [33], the authors make use of the so-called squared-exponential covariance kernel
It can be shown that the centered Gaussian process \((Z_t)\) with such covariance has very smooth paths. To achieve minimax adaptation properties for the Bayesian posterior, the approach taken by [33] is to additionally allow for some extra freedom in the rescaling by considering \((Z_{At})\) with \(A\) random with properly chosen distribution.
There is a simple reason why the squared-exponential kernel cannot be used in a geometric context for general \(\mathcal{M }\). Although (3) admits the immediate generalisation (recall that \(\rho \) is the metric on \(\mathcal{M }\))
it can be shown that this function is not positive definite in general already for the simplest examples such as \(\mathcal{M }\) taken to be the sphere in \(\mathbb R ^k\), \(k\ge 2\).
2.3 Building a positive-definite kernel on \(\mathcal{M }\) via an operator \(L\)
Following the previous idea of building a Gaussian process on \(\mathcal{M }\), the question is now how to build an appropriate covariance kernel on \(\mathcal{M }\).
Suppose one is given a decomposition of the space of square-integrable functions on \(\mathcal{M }\)
where the \(\mathcal H _{k}\) are finite-dimensional subspaces of \(\mathbb L ^2\) consisting of continuous functions on \(\mathcal{M }\), and orthogonal in \(\mathbb L ^2\). Then, the projector \(Q_{k}\) on \(\mathcal H _{k}\) is actually a kernel operator \(Q_k(x,y):=\sum _{1\le i \le dim(\mathcal H _k)} e_k^i(x) e_k^i(y)\), where \(\{e_k^i\}\) is any orthonormal basis of \(\mathcal H _k\); so it is obviously a positive-definite kernel. Also, given \(\varphi :\mathbb N \rightarrow (0,+\infty )\) such that \( \forall x \in M, \; \sum _{k\ge 0} \varphi (k) Q_k(x,x) < \infty ,\) the function \(K_\varphi (x,y)=\sum _{k\ge 0}\varphi (k)Q_k(x,y)\) is a positive definite kernel which is the covariance kernel of a Gaussian process. Constructing explicitely a Gaussian process with this covariance is not difficult, this will be done in Sect. 2.6.
A simple way of obtaining a decomposition (5) is by diagonalisation of a self-adjoint positive operator \(L\) on \(\mathcal{M }\) with discrete spectrum, finite dimension spectral spaces and eigenfunctions continuous on \(\mathcal{M }\). In this case the subspaces \(\mathcal H _k=:\mathcal H _{\lambda _k}\) can be taken to be the eigenspaces of \(L\). Such an operator has non-negative eigenvalues \(\lambda _k\) that we order in an increasing way (\( 0\le \lambda _0< \lambda _1<\cdots \)).
While many such operators \(L\) could in principle be used, we will be especially interested in the cases where \(L\) reflects quite well some geometrical properties of \(\mathcal{M }\). A central example when \(\mathcal{M }\) is a compact Riemannian manifold without boundary is \(L=-\Delta \), where \(\Delta \) is the Laplacian (or weighted Laplacian) on \(\mathcal{M }\), see Sect. 3 for details.
The operator \(L\) being given, we still need to choose the function \(\varphi \). For this purpose, we will concentrate on another important aspect of the approach developed in [31] and [33]: the rescaling \(At\). In these papers, the rescaling drives the regularity and its proper choice is essential for the properties of the estimators. In the case of a general set \(\mathcal{M }\), a multiplicative rescaling \(At\) has typically no sense. To find a meaningful generalisation of the rescaling, we will use a standard tool of the theory of operators (see [10]), the semi-group associated to the operator \(L\), which ultimately yields the choice \(\varphi (k)=e^{-t\lambda _k}\). Note that a slightly similar point of view has been considered recently in statistical learning with Laplacian-based spectral methods (diffusion maps, diffusion wavelets...) propagating information on the data through a Markov kernel (see for instance [8, 21]).
2.4 Heat kernel
Let us now give the properties on the operator \(L\) that we will need. These conditions arise naturally in the theory of heat kernels. Though no prerequisites on heat kernels are needed for the present paper, we refer the interested reader to [14, 22, 26] for standard expositions on heat kernel theory.
We suppose that the self-adjoint positive operator \(L\) is defined on a domain \(D \subset \mathbb L ^2\) dense in \(\mathbb L ^2\). Then \(-L\) is the infinitesimal generator of a contraction self-adjoint semigroup \(e^{-tL}\), see [10, Thm. 4.6].
We suppose in addition that \(e^{-tL}\) is a Markov kernel operator i.e. there exists a non negative continuous kernel \( P_t(x,y)\) (the ‘heat kernel’) such that:
Clearly, from (7), (9) and \(P_t(x,y) \ge 0\), we have
which immediately implies that \(P_t(x,y)\) is a positive-definite kernel.
We will further assume the following bounds on the heat kernel, which are satisfied in a large variety of situations, in particular all the examples considered in Sect. 3 (see [14, 22, 26]):
Suppose that there exist \(C_1, C_2>0, c_1, c_2 >0\), such that, \( \text{ for} \text{ all} t \in ]0,1[, \) and any \(x,y\in \mathcal{M }\),
It is also known ([9], and references therein), that \(P_t(x,y)\) is a continuous function, the eigenspaces \(\mathcal H _{\lambda _k}\) of \(L\) are of finite dimension and the eigenvectors are continuous. That is, \(\mathbb L ^2(\mathcal{M }) = \oplus _{k\ge 0} \mathcal H _{\lambda _k},\) and the orthogonal projectors \(P_{ \mathcal H _{\lambda _k}}\) on the eigenspaces \( \mathcal H _{\lambda _k} \) are kernel operators \(Q_k(x,y)\) with
as soon as \(\{e_k^l,\; 1 \le l \le dim(\mathcal H _{\lambda _k} ) \} \) is an orthonormal basis of \(\mathcal H _{\lambda _k}.\)
The Markov kernel \(P_t\) writes
A direct consequence of (10) is that for all \(x\in \mathcal{M }\) and all \(t \in ]0,1[\),
2.5 Why is the heat kernel a canonical kernel on \(\mathcal{M }\)?
Consider the already large class of compact Riemannian manifolds without boundary for \(\mathcal{M }\). In that case we set \(L=-\Delta \), where \(\Delta \) is the Laplacian on \(\mathcal{M }\), see Sect. 3 for details. We claim that in this case the associated heat kernel is a natural choice for the following reasons
-
1.
The heat kernel \(P_t\) is a positive definite kernel on \(\mathcal{M }\). Thus \(P_t\) is at least a possible candidate for use as a covariance kernel of a Gaussian process.
-
2.
The heat kernel associated to \(L=-\Delta \) on \(\mathcal{M }\) verifies (10), see Sect. 3. In particular, up to constants, the heat kernel appears as a natural geometric generalisation of the squared-exponential kernel (3) on the real line ! Also, we see that the ‘time’ \(t\) is a natural candidate for a (inverse-) scale parameter. We will indeed allow \(t\) to vary in the definition of our prior below.
-
3.
In the context of harmonic analysis on geometric spaces \(\mathcal{M }\), the Laplacian on \(\mathcal{M }\), or equivalently the associated heat kernel semi-group, are known as natural carriers of the information about the ‘geometry’ of \(\mathcal{M }\). The collection of eigenspaces \(\mathcal H _{\lambda _k}\) defined above can very much be interpreted as an harmonic analysis of \(\mathcal{M }\). For instance, in the case of the circle \(\mathbb S ^1\sim \mathbb R / 2\pi \mathbb Z \), one has \(\lambda _k=k^2\) and \(\mathcal H _{\lambda _k}=\mathcal H _{k}\) is generated by \(x\rightarrow \cos (kx)\) and \(x\rightarrow \sin (kx)\), see Sect. 3.
2.6 Our method: Prior, definition
For the statistical results of the paper, we always assume that both conditions (1) and (2) on the set \(\mathcal{M }\) hold. In particular, the prior depends on the ‘dimension parameter’ \(d\) in (2). However, the key entropy property stated in Sect. 5 holds more generally under (1) only.
We consider a prior on functions from \(\mathcal{M }\) to \(\mathbb R \) constructed hierarchically as follows.
First, generate a collection of independent standard normal variables \(\{X_k^l \}\) with indexes \(k\ge 0\) and \(1 \le l \le dim(\mathcal H _{\lambda _k})\). Set, for \(x\in \mathcal{M }\) and any \(t\in (0,1]\),
To simplify the notation, and when no confusion is possible, we omit in the sequel the range for indexes \(k,l\) in summations. Equation (13) defines a Gaussian stochastic process \(W^t(\cdot )\) indexed by \(\mathcal{M }\). This process is centered and has covariance kernel precisely \(P_t\), as follows from computing
Also, \(W^t\) defines a Gaussian variable in various separable Banach spaces \(\mathbb B \), see [32] for definitions. In particular, it is a Gaussian random element in both \(\mathbb B =(\mathcal C ^0(\mathcal{M }),\Vert \cdot \Vert _{\infty })\) and \( \mathbb B =(\mathbb L ^2(\mathcal{M },\mu ),\Vert \cdot \Vert _2)\), the two Banach spaces we consider in the sequel. To check this, apply Theorem 4.2 in [32], where almost sure convergence of the series (13) in \(\mathbb B \) follows from the properties of the Markov kernel (11).
Second, draw a positive random variable \(T\) according to a density \(g\) on \((0,1]\). This variable can be interpreted as a random scaling, or ‘time’. It turns out that convenient choices of \(g\) are deeply connected to the geometry of \(\mathcal{M }\). We choose the density \(g\) of \(T\) such that, for a real \(a>1\) and positive constants \(C_1,C_2,q\), with \(d\) defined in (2),
We show below that the choice \(q=1+d/2\) leads to sharp rates.
The full (non-Gaussian) prior we consider is \(W^T\!\), where \(T\) is random with density \(g\). Hence, this construction leads to a prior \(\Pi _w\), which is the probability measure induced by
Some comments are in order. First, one could be tempted to use the Gaussian prior \(W^t\) (with \(t\) fixed) as prior. Nevertheless, similar to what happens for the squared exponential prior on the real line [33], the paths of the corresponding Gaussian process are infinitely differentiable almost surely, which would lead to slow rates of convergence for the posterior, see [30]. This difficulty can be overcome by making \(t\) random, allowing the prior to adapt to regularity of the unknown function. The choice of the particular form for the prior on \(T\) is related to the form taken by the entropy of the Reproducing Kernel Hilbert Space (RKHS) of \(W^t\), as will be seen in Sect. 5. For more discussion on this, see also Sect. 4.3.
3 Examples
This section is devoted to the presentation of a variety of examples which naturally fit into the framework introduced above. The three first examples reflect a situation with no boundary where Condition (2) is verified.
The last case gives an illustration of a more complicated situation, where a boundary is present and (2) may or may not be valid, depending on the type of measure \(\mu \) and operator \(L\).
It is interesting to observe in these examples that, in the situations where (2) is true, the constant \(d\) has a natural interpretation as the dimension of the problem.
Torus case Let \(\mathcal{M }=\mathbb{S }^1\) be the torus, parameterised by \([-\pi , \pi ]\), with identification of \(\pi \) and \(-\pi \), and equipped with the normalised Lebesgue measure. The metric \(\rho \) reflects the previous identification.
In particular, for any \(0<r\le \pi \) one has \(|B(x,r)| = r/\pi \), which ensures condition (2) with \(d=1\). The spectral decomposition of the Laplacian operator \(\Delta =-L\) gives rise to the classical Fourier basis, with \(\lambda _k=k^2\) and
From this one deduces that \( Q_k(x,y) =2 \cos k(x-y)\) and
Clearly, \( \text{ for} \text{ all} t >0,\,e^{t\Delta }(x,x) \ge 1,\) and
Sphere case Let now \(\mathcal{M }=\mathbb{S }^{d-1} \subset \mathbb R ^d\). The geodesic distance on \(\mathbb{S }^{d-1}\) is given by
We take as \(\mu \) the natural measure on \(\mathbb{S }^{d-1}\) which is rotation invariant. It follows that
From this one deduces the following inequalities ensuring (2) with ‘dimension’ \(d-1\),
As \(\mathcal{M }=\mathbb{S }^{d-1}\) is a Riemannian manifold, there is a natural Laplacian on \(\mathbb{S }^{d-1}, ~\Delta _{\mathbb{S }^{d-1}}=-L,\) which is a negative self-adjoint operator, whose spectral decomposition we describe next. The eigenspaces \(\mathcal H _{\lambda _k}\) turn out to be the restriction to \(\mathbb{S }^{d-1}\) of polynomials of degree \(k\) which are homogeneous (i.e. \(P(x) =\sum _{|\alpha | =k} a_\alpha x^\alpha ,~~ \alpha =(\alpha _1, \ldots , \alpha _n) , ~|\alpha | =\sum \alpha _i , ~\alpha _i\in \mathbb N \)) and harmonic on \(\mathbb R ^d\) (i.e. \(\Delta P= \sum _{i=1}^d \frac{\partial ^2 P}{\partial x_i^2} =0\).) We have,
Moreover \( dim(\mathcal H _{\lambda _k} ) = \frac{2k+d-2}{d-2} {d+k-3 \atopwithdelims ()k}:= N_k(d).\) This space is called the space of spherical harmonics of order \(k\). Moreover, if \((Y_{ki})_{1\le i\le N_k}\) is any orthonormal basis of \(\mathcal H _{\lambda _k} \), the projector writes
In fact, \(Q_k\) further has an explicit expression in terms of Gegenbauer polynomials, see [28]. Now for all \(0< t \le 1\), it can be proved (as a consequence of a general result on compact Riemannian manifolds, see below) that
Compact Riemannian manifold, without boundary This case obviously generalises the two examples above. Let \(\mathcal{M }\) be a compact Riemannian manifold of dimension \(d\) without boundary. Associated to the Riemannian structure we have a measure \(dx\), a metric \(\rho ,\) and a Laplacian \(\Delta ,\) such that
So \(L=-\Delta \) is a symmetric nonnegative operator. Now the associated semigroup \(e^{t\Delta }\) is a positive kernel operator verifying: \( \text{ for} \text{ all} t \in ]0,1[, \) (see [7])
The main property proved in Sect. 10.5 is that there exist \( 0< c_1 <c_2 <\infty \), such that
which is exactly (2) with dimension \(d\).
Jacobi case Consider \(\mathcal{M }=[-1,1],\) equipped with the measure \(\omega (x) dx\) with \(\omega (x) =(1-x)^\alpha (1+x)^\beta \) and \(\alpha >- 1, ~\beta >-1.\) (So we have, in fact, a family of measures.) Consider the metric
If \(\sigma (x)= (1-x)^2\), then \(\tau :=\frac{(\sigma \omega )^{\prime }}{\omega } \) is a polynomial of degree \(1\), and we set
The operator \(L\) is a nonnegative symmetric second order differential operator in \(\mathbb L _2 (\omega (x) dx)\). Using Gram Schmidt orthonormalisation (again, in \(\mathbb L _2 (\omega (x) dx)\)) of \(\{x^k,\, k\in \mathbb N \}\) we get a family of orthonormal polynomials \(\{\pi _k ,\, k\in \mathbb N \}\) called Jacobi polynomials, which coincides with the spectral decomposition of \(D_J.\) More precisely,
Then, for any \(k \in \mathbb N \), \(\mathcal H _{\lambda _k}=span\{\pi _k\},~dim(\mathcal H _{\lambda _k}) =1\) and
It can be proved, see [9], that for any \(x,y\) on \(\mathcal{M }\) and any \(t\in ]0,1[\),
Furthermore it can be checked, see [23], that for all \(x \in [-1,1]\) and \(0<r \le \pi \),
So, in this case, condition (2) is true for \(\alpha =\beta =-1/2\) and not fulfilled for other values of \(\alpha \) and \(\beta \).
4 Main results
Before stating the main results, we briefly present the general Bayesian framework.
4.1 Bayesian framework and general result
Data Given a metric space \({\fancyscript{F}}\) equipped with a \(\sigma \)-field \(\mathcal T \), consider a sequence of statistical experiments \((\mathcal{X }_n,\mathcal{A }_n,\{P_{f}^{(n)}\}_{f\in {\fancyscript{F}}})\). Suppose there exists a common (\(\sigma \)-finite) dominating measure \(\mu ^{(n)}\) to all probability measures \(\{P_{f}^{(n)}\}_{f\in {\fancyscript{F}}}\), that is
We assume that the map \( (x^{(n)},f) \rightarrow p_f^{(n)}(x^{(n)}) \) is jointly measurable relative to \(\mathcal{A }_n \otimes \mathcal T \).
Prior We equip the space \(({\fancyscript{F}},\mathcal T )\) with a probability measure \(\Pi \) that is called prior. Then the space \(\mathcal{X }_n\times {\fancyscript{F}}\) can be naturally equipped of the \(\sigma \)-field \(\mathcal{A }_n\otimes \mathcal T \) and of the probability measure
The marginal in \(f\) of this measure is the prior \(\Pi \). The law \(X| f\) is \(P_f^{(n)}\).
Bayes formula Under the preceding framework, the conditional distribution of \(f\) given the data \(X^{(n)}\) is absolutely continuous with respect to \(\Pi \) and is given by, for any measurable set \(A\in \mathcal T \),
We study the convergence of the posterior measure in a frequentist sense in that we suppose that there exists a ‘true’ parameter, here an unknown function, denoted \(f_0\). That is, we consider convergence under the law \(P_{f_0}^{(n)}\). The expectation under this distribution is denoted \(\mathbb{E }_{f_0}\). Theorem 1 in [13] gives sufficient conditions for the posterior to concentrate at rate \(\varepsilon _n\rightarrow 0\) towards \(f_0\), when \(n\) goes to infinity,
where \(d_n\) is a semi-distance for which certain exponential tests exist, see [13] for details. In [31, 33], the case of plain and randomly rescaled Gaussian priors is considered and the authors establish that, as soon as the statistical distance \(d_n\) of the problem properly relates to the Banach space norm \(\mathbb B \), then Conditions (23), (24), (25) defined in the sequel imply the convergence (16).
4.2 Concentration results
Let us recall that we assume that the compact metric space \(\mathcal{M }\) satisfies the conditions of Sect. 2, that is the doubling property (1) together with the polynomial-type growth (2) of volume of balls. The operator \(L\) is also supposed to verify the properties listed in Sect. 2.4.
Gaussian white noise One observes
In this case, set \((\mathbb B ,\Vert \cdot \Vert )=(\mathbb L _2,\Vert \cdot \Vert _2)\). We set the prior \(\Pi \) to be the law \(\Pi _w\) induced by \(W^T\), see (15). Here \(\Pi \) serves directly as a prior on \(f\) (so \(w=f\) here). Besov spaces are defined in Sect. 9.
Theorem 1
(Gaussian white noise on \((\mathcal{M },\rho )\), upper-bound) Let the set \(\mathcal{M }\) and the operator \(L\) satisfy the properties listed above. Suppose that \(f_0\) is in the Besov space \(B_{2,\infty }^s(\mathcal{M })\) with \(s>0\) and that the prior \(\Pi \) on \(f\) is \(W^T\) given by (15). Let \(q=1+d/2\) in (14). Set \(\varepsilon _n \sim \bar{\varepsilon }_n\sim (\log {n}/n)^{2s/(2s+d)}\). For \(M\) large enough, as \(n\rightarrow \infty \),
The proof of Theorem 1 is given in Sect. 8. For the next two examples, we only give a sketch of the argument.
Fixed design regression With the notation from Sect. 1, the observations are
The prior \(\Pi \) is, as above, the law induced by \(W^T\), see (15), and serves directly as a prior on \(f\) (so \(w=f\) here).
If \(f_0\) is in \(B_{\infty ,\infty }^s(\mathcal{M })\), with \(s>0\), and \(q=1+d/2\) in (14), it follows from Sect. 7 that (23), (24), (25) are satisfied with \(\varepsilon _n \sim \bar{\varepsilon }_n\sim (\log {n}/n)^{2s/(2s+d)}\) and \((\mathbb B ,\Vert \cdot \Vert )=(\mathcal C ^0(\mathcal{M }),\Vert \cdot \Vert _{\infty })\). This implies, as in [31]-Section 3.3, that if \(d_n\) is the semi-distance defined by, for \(f_1,f_2\) in \({\fancyscript{F}}\),
then posterior concentration (16) holds at rate \(\varepsilon _n\).
Density estimation The observations are a sample
for a density \(f\) on \(\mathcal{M }\). The true density \(f_0\) is assumed to be continuous and bounded away from \(0\) and infinity on \(\mathcal{M }\). In order to build a prior on densities, we consider the transformation, for any given continuous function \(w:\mathcal{M }\rightarrow \mathbb R \),
where \(\Lambda :\mathbb R \rightarrow (0,+\infty )\) is such that \(\log \Lambda \) is Lipschitz on \(\mathbb R \) and has an inverse \(\Lambda ^{-1}:(0,+\infty )\rightarrow \mathbb R \). For instance, one can take the exponential function as \(\Lambda \). Here, the function \(w_0\) is taken to be \(w_0:=\Lambda ^{-1}f_0\). The law of \(W^T\), see (15), here serves as a prior \(\Pi _w\) on \(w\)’s, which induces a prior \(\Pi \) on densities via the transformation \(f_{w}^\Lambda \). That is, the final prior \(\Pi \) on densities we consider is \(f_{W^T}^{\Lambda }\). In this case we set \((\mathbb B ,\Vert \cdot \Vert )=(\mathcal C ^0(\mathcal{M }),\Vert \cdot \Vert _{\infty })\), the Banach space in which the function \(w\) and the prior \(\Pi _w\) live.
If \(f_0\) is in \(B_{\infty ,\infty }^s(\mathcal{M })\), with \(s>0\), and \(q=1+d/2\) in (14), it follows from Sect. 7 that (23), (24), (25) are satisfied with \(\varepsilon _n \sim \bar{\varepsilon }_n\sim (\log {n}/n)^{2s/(2s+d)}\) and \((\mathbb B ,\Vert \cdot \Vert )=(\mathcal C ^0(\mathcal{M }),\Vert \cdot \Vert _{\infty })\). This implies, as in [31, Section 3.1] (see also [33, Thm. 3.1]), that (16) holds with the previous rate, where \(d_n\) is the Hellinger distance between densities. The verification is as in [31], extending their Lemma 3.1 to the case of a general \(\Lambda \) with \(\log \Lambda \) Lipschitz. The proof is not difficult and is left to the reader.
Discussion In the case that \(\mathcal{M }\) is a compact connected orientable manifold without boundary, minimax rates of convergence have been obtained in [11], where Sobolev balls of smoothness index \(s\) are considered and data are generated from a regression setting. In particular, in this framework, our procedure is adaptive in the minimax sense for Besov regularities, up to a logarithmic factor.
We have obtained convergence rates for the posterior distribution associated to the geometrical prior in a variety of statistical frameworks. Obtaining these rates does not presuppose any a priori knowledge of the regularity of the function \(f_0\). Therefore our procedure is not only nearly minimax, but also nearly adaptive.
Note also that another attractive property of the method is that it does not assume a priori any (upper or lower) bound on the regularity index \(s>0\). This is related to the fact that approximation is via the spaces \(\mathbb{H }_t\), which are made of (super)-smooth functions.
4.3 Lower bound for the rate
Works obtaining (nearly-)adaptive rates of convergence for posterior distributions are relatively recent and so far were obtained for density or regression on subsets of the real line or the Euclidian space. Often, logarithmic factors are reported in the (upper-bound) rates, but it is unclear whether the rate must include such a logarithmic term. We aim at giving an answer to this question in our setting by providing a lower bound for the rate of convergence of our general procedure. This lower bound implies that the rates obtained in Sect. 4 are, in fact, sharp. One can conjecture that the same phenomenon appears for hierarchical Bayesian procedures with randomly rescaled Gaussian priors when the initial Gaussian prior has a RKHS which is made of super-smooth functions (e.g. infinitely differentiable functions), for instance the priors considered in [25, 33].
For simplicity we consider the Gaussian white noise model
We set \((\mathbb B ,\Vert \cdot \Vert )=(\mathbb L _2(\mathcal{M }),\Vert \cdot \Vert _2)\). As before, for this model the prior sits on the same space as the function \(f\) to be estimated, so \(w=f\). Again, the set \(\mathcal{M }\) and the operator \(L\) are as in Sect. 4.2.
Theorem 2
(Gaussian white noise on \((\mathcal{M },\rho )\), lower bound) Let \(\varepsilon _n \!=\!(\log {n}/n)^{s/(2d+s)}\) for \(s>0\) and let the prior on \(f\) be the law induced by \(W^T\), see (15), with \(q>0\) in (14). Then there exist \(f_0\) in the unit ball of \(B_{2,\infty }^s(\mathcal{M })\) and a constant \(c>0\) such that
As a consequence, for any prior of the type (15) with any \(q>0\) in (14), the posterior convergence rate cannot be faster than \(\varepsilon _n\) above. If \(q\) is larger than \(1+d/2\), the rate becomes slower than \(\varepsilon _n\).
Remark 1
More generally, an adaptation of the proof of Theorem 2 yields that, for any ‘reasonable’ prior on \(T\), in that, for \(\varepsilon _n \sim (\log {n}/n)^{s/(2d+s)}\), it holds
then \(\Pi [\Vert f-f_0\Vert _2\le c\varepsilon _n | X]\rightarrow 0\) for small enough \(c>0\). This condition is the standard ‘prior mass’ condition in checking upper-bound rates, see (23). Note that the previous display is automatically implied if the prior satisfies \(\Pi [\Vert f-f_0\Vert _2\le \varepsilon _n^* ] \ge e^{-Cn{\varepsilon _n^*}^2}\) for \(\varepsilon _n^*=n^{-s/(2d+s)}\), or more generally for any rate at least as fast as \(\varepsilon _n\). For instance, this can be used to check that taking a uniform prior on \((0,1)\) as law for \(T\) leads to the same lower bound rate.
4.4 Structure of the proofs
The proof of Theorem 1 is split in two different parts. The first part considers the properties of the heat kernel Gaussian process and the concentration of the corresponding posterior measure. A key result for this part lies in sharp entropy estimates for the process and is stated in Sect. 5. Establishing such sharp estimates is considered in the second part of the proofs.
5 RKHS and heat kernel Gaussian process
This section focuses on the analysis of the underlying Gaussian process \(W^t\) in (13), for \(t>0, x\in \mathcal{M }\),
The Gaussian process \((W^t(x))_{x\in \mathcal{M }}\) is centered and its covariance kernel precisely coincides with the heat kernel on \(\mathcal{M }\), as noted above. Since \(P_t(\cdot ,\cdot )\) is a covariance kernel for any fixed \(t>0\), it is associated to a Reproducing Kernel Hilbert Space (RKHS) \(\mathbb{H }_t\), which is also the RKHS of the Gaussian process \(W^t\), see [32] for definition and properties of the RKHS. Also, \(\mathbb{H }_t\) is the isometric image of \(\mathbb L ^2\) by \(P_t^{1/2}=P_{t/2}\). So the family \(\{e^{-\lambda _kt/2}e^l_k,\;{k\in \mathbb N , \; 1 \le l \le dim(\mathcal H _{\lambda _k}) }\}\) is a ‘natural’ orthonormal basis of \(\mathbb{H }_t\).
The RKHS \(\mathbb{H }_t\) has also the following description, for any \(t>0\):
equipped with the inner product
Hence, if we denote by \(\mathbb{H }^1_t\) the unit ball of \(\mathbb{H }_t\):
5.1 Entropy of the RKHS unit ball
Let \((X, \rho )\) be a metric space. For \(\varepsilon >0\), we define, as usual, the covering number \(N(\varepsilon , X) \) as the smallest number of balls of radius \(\varepsilon \) covering \(X\). The entropy \(H(\varepsilon , X)\) is by definition \(H(\varepsilon , X)= \log _2 N(\varepsilon , X).\)
An important result of this section is the link between the covering number \(N(\varepsilon , \mathcal{M },\rho ) \) of the space \(\mathcal{M }\), and \(H(\varepsilon , \mathbb{H }^1_t, \mathbb L ^p )\) for \(p=2,\; \infty \), where \( \mathbb{H }^1_t\) is the unit ball of the RKHS defined above. More precisely we prove in Sect. 10 the following theorem:
Theorem 3
Let \(\mathcal{M }\) be a compact metric space satisfying (1), on which a self-adjoint positive operator \(L\) exists such that \(e^{-tL}\) has kernel \(P_t\) satisfying the properties listed in Sect. 2.4. Let us fix \(\nu >0, \; a>0\). There exists \(\varepsilon _0>0\) such that for \(\varepsilon , t\) with \( \varepsilon ^\nu \le at\) and \( 0<\varepsilon \le \varepsilon _0,\)
Remark 2
Theorem 3 gives the precise behaviour up to constants, from above and below, of the entropy of the RKHS unit ball \(\mathbb{H }_t^1\). The constants involved depend only on \(\mathcal{M }, a, \nu \). The mild restriction on the range of \(t\) arises from technical reasons in the proof of the upper-bound. As an examination of the proof reveals, this restriction is not needed in the proof of the lower bound.
5.2 Entropy under Ahlfors’ condition
In Sect. 10, the general case is considered, but for sake of simplicity, in the sequel we focus on the case where Ahlfors’ condition (2) is fullfilled. In this case, Theorem 3 takes the following form.
Proposition 1
Under the conditions of Theorem 3, suppose additionally that \(\mathcal{M }\) satisfies (2). If \(\Lambda _{\varepsilon }\) is a maximal \(\varepsilon \)-net of \(\mathcal{M }\), then
for all \( 0<\varepsilon \le \varepsilon _0\), where we suppose that for some \(\nu >0\) and \(a>0\), it holds \(\varepsilon ^\nu \le at.\)
Proof
Let \((B(x_i,\varepsilon ))_{i\in I}\) be a minimal covering of \(\mathcal{M }\). Then,
On the other hand, if \(\Lambda _{\varepsilon }\) is any maximal \( \varepsilon -\)net, we have:
\(\square \)
6 Proofs I: Geometrical Prior, concentration function
6.1 Approximation and small ball probabilities
The so-called concentration function of a Gaussian process defined below turns out to be fundamental to prove sharp concentration of the posterior measure. For this reason we focus now on the detailed study of this function for the geometrical prior.
In the sequel, the notation \((\mathbb B ,\Vert \cdot \Vert _\mathbb B )\) is used for anyone of the two spaces
Any property stated below with a \(\Vert \cdot \Vert _\mathbb{B }\)-norm holds for both spaces.
Concentration function Consider the Gaussian process \(W^t\) defined in (13), for a fixed \(t\in (0,1]\). Its concentration function within \(\mathbb B \) is defined, for any function \(w_0\) in \(\mathbb B \), as the sum of two terms
The approximation term \(A_{w_0}^t(\varepsilon )\) quantifies how well \(w_0\) is approximable by elements of the RKHS \(\mathbb{H }_t\) of the prior while keeping the ‘complexity’ of those elements, quantified in terms of RKHS-norm, as small as possible. The term \(A_{w_0}^t(\varepsilon )\) is finite for all \(\varepsilon >0\) if and only if \(w_0\) lies in the closure in \(\mathbb B \) of \(\mathbb{H }_t\) (which can be checked to coincide with the support of the Gaussian prior, see [32, Lemma 5.1]) It turns out that for the prior \(W^t\), this closure is \(\mathbb B \) itself, as follows from the approximation results below.
In order to have a precise calibration of \(A_{w_0}^t(\varepsilon )\), we will assume regularity conditions on the function \(w_0\), which in turn will yield the rate of concentration. Namely we shall assume that \(w_0\) belongs to a regularity class \({\fancyscript{F}}_s(\mathcal{M })\), \(s>0\), taken equal to a Besov space
The problem of the regularity assumption in a context like here is not a simple one. We took here a natural generalisation of the definition of usual spaces on the real line, by means of approximation properties. For more details we refer to Sect. 9.
Approximation term \(A_{w_0}^t(\varepsilon )\) -regularity assumption on \(\mathcal{M }\). For any \(w_0\) in the Banach space \(\mathbb B \), consider the following sequence of approximations. For \(\delta \rightarrow 0\), the operator \(L\) and \(\Phi \) a Littlewood–Paley function, see (33) in Sect. 9, define
where \(P_\mathcal{H _{\lambda _k}}\) is the projector onto the eigenspace \(\mathcal H _{\lambda _k}\). For any \(\delta >0\), the sum in the last display is finite thus \(\Phi (\delta \sqrt{L}) w_0\) belongs to \(\mathbb{H }_t\). It directly follows from the definition of the considered Besov spaces, see (36) in Sect. 9, that, for \(w_0 \in {\fancyscript{F}}_s(\mathcal{M })\),
On the other hand, making use of the choice \(\delta ^s =: \varepsilon \),
Note that \(\Vert w_0 \Vert _2\le 1\) if we suppose that \(w_0\) is in the unit ball of \({\fancyscript{F}}_s(\mathcal{M })\) (since necessarily \(\Vert w_0\Vert _\mathbb B \) is bounded by 1 and, for the case of the sup-norm, since \(\mathcal{M }\) is compact with \(\mu \)-measure \(1\)). Hence,
Note that this is precisely the place where the regularity of the function plays a role.
Small ball probability \(S^t(\varepsilon )\). Let us now show in successive steps that the following upper-bound on the small ball probability of the Gaussian process \(W^t\) viewed as a random element in \(\mathbb B \) holds.
Proposition 2
Fix \(A>0\). There exists a universal constant \(\varepsilon _0>0\), and constants \(C_0,C_1>0\) which depend on \(d,A, \mathbb B \) only, such that, for any \(\varepsilon \le \varepsilon _0\) and any \(t\in [C_1\varepsilon ^{A},1]\),
The steps of the proof follow the method proposed by [33]. The starting point is a bound on the entropy of the sunit ball \(\mathbb{H }_t^1\) of \(\mathbb{H }_t\) with respect to the sup-norm, which is a direct consequence of (18) and is summarised by the following:
There exists a universal constant \(\varepsilon _1>0\), and constants \(C_2,C_3>0\) which depend on \(d,A\) only, such that, for any \(\varepsilon \le \varepsilon _1\) and any \(t\in [C_2\varepsilon ^{A},1]\),
Step 1, crude bound. Let \(u_t\) be the mapping canonically associated to \(W^t\) considered in [18] and, as in this paper, set
By definition, the previous quantity is smaller than the solution of the following equation in \(\eta \), where we use the bound (21),
that is \(\eta = \exp \{-Cn^{\frac{2}{2+d}} t^{\frac{d}{2+d}} \}\). Thus
The first equation of [29], p. 300 can be written
We have, for any \(k\ge 1\) and any \(m\ge 1\),
where \(V_m:x\rightarrow x^m e^{-Cx^{\frac{1}{2+d}}}\) is uniformly bounded on \((0,+\infty )\) by a finite constant \(c_m\) (we omit the dependence in \(d\) in the notation). It follows that for any \(n\ge 1\),
We have obtained \(e_k(u_t) \le 32 c_m t^{-md} k^{-2m} \) for any \(k\ge 1\). Lemma 2.1 in [18], itself cited from [24], can be written as follows. If \(\ell _n(u_t)\) denotes the \(n\)-th approximation number of \(u_t\) as defined in [18, p. 1562],
From the bound on \(e_k(u^*_t) \) above one deduces, for some constant \(c_m^{\prime }\) depending only on \(m\), for any \(n\ge 1\),
Consider the definitions, for any \(\varepsilon >0\) and \(t>0\),
A sufficient condition for \(n_t(\varepsilon )\) to exist is \(4\sigma (W^t) \ge \varepsilon \), since \(\ell _n(u_t)\le \ell _1(u_t)=\sigma (W^t)\). So, provided \(\varepsilon \le 4\sigma (W^t)\), the bound on \(\ell _n\) implies \(n_t(\varepsilon )\le C_m (\varepsilon ^{-1} t^{-d})^{1/(2m-1)}\).
The following result makes Proposition 2.3 in [18] precise with respect to constants involving the process under consideration. This is important in our context since we consider a collection of processes \(\{W_t\}\) indexed by \(t\) and need to keep track of the dependence in \(t\).
Proposition 3
Let \(X\) be centered Gaussian in a real separable Banach space \((E,\Vert \cdot \Vert )\). Define \(n(\varepsilon )\) and \(\sigma (X)\) as above. Then for a universal constant \(C_4>0\), any \(\varepsilon \le 1 \wedge (4\sigma (X))\),
Explicit upper and lower bounds for \(\sigma (W^t)\) are given in Sect. 10, see (50). In the ‘polynomial case’, see (2), these bounds imply, uniformly in the interval of \(t\)’s considered, that \(1 \lesssim \sigma (W^t)\lesssim \varepsilon ^{-B}\) for some \(B>0\).
Combining this fact with Proposition 3 and the previous bound on \(n_t\), we obtain that for some positive constants \(C_7,\varepsilon _3, \zeta \), for any \(\varepsilon \le \varepsilon _3\) and \(t\in [C_2\varepsilon ^{A},1]\)
Step 2, general link between entropy and small ball. According to Lemma 1 in [17], we have, if \(G\) is the distribution function of the standard Gaussian distribution (see their formula (3.19), or (3.2)),
Lemma 4.10 in [33] implies, for every \(x>0\),
Take \(\lambda =\sqrt{2 S^t(\varepsilon )}\) in the previous display. Then for values of \(t,\varepsilon \) such that (21) holds,
Finally combine this with (22) to obtain the desired Equation (20) that is
under the conditions \(\varepsilon \le \varepsilon _3\) and \(C_2\varepsilon ^{A}\le t\le 1\).
7 Proofs I: General conditions for posterior rates
A general theory to obtain convergence rates for posterior distributions for some distances is presented in [12, 13]. The object of interest is a function \(f_0\) (e.g. a regression function, a density function etc.). In some cases, for instance density estimation with Gaussian priors, one cannot directly put the prior on the density itself (a Gaussian prior does not lead to positive paths). This is why we parameterise the considered statistical problem with the help of a function \(w_0\) in some separable Banach space \((\mathbb B ,\Vert \cdot \Vert _\mathbb B )\) of functions defined over \((\mathcal{M },\rho )\). As already noticed in Sect. 4, in some cases (e.g. regression) \(w_0\) and \(f_0\) coincide, in others not (e.g. density estimation). As before, \(\mathbb B \) is either \(\mathcal C ^0(\mathcal{M })\) or \(\mathbb L ^2\).
In this section we check that there exist Borel measurable subsets \(B_n\) in \((\mathbb B ,\Vert \cdot \Vert _\mathbb B )\) such that, for some vanishing sequences \(\varepsilon _n\) and \(\bar{\varepsilon }_n\), some \(C>0\) and \(n\) large enough,
This will imply, as in [31], that the posterior concentrates at rate \(\varepsilon _n\) around \(f_0\), see Sect. 8. In [33], the authors also follow this approach. One advantage of the prior considered here is that, contrary to [33], the RKHS unit balls are precisely nested as the time parameter \(t\) varies, see (28). This leads to slightly simplified proofs.
Prior mass For any fixed function \(w_0\) in \(\mathbb B \) and any \(\varepsilon >0\), by conditioning on the value taken by the random variable \(T\),
The following inequality links mass of Banach-space balls for Gaussian priors with their concentration function in \(\mathbb B \), see [32, Lemma 5.3],
for any \(w_0\) in the support of \(W^t\). We have seen above that any \(f_0\) in \({\fancyscript{F}}^s(\mathcal{M })\) belongs to the support of the prior. It is not hard to adapt the argument to check that in fact any \(f_0\) in \(\mathbb B \) can be approximated in \(\mathbb B \) by a sequence of elements in the RKHS \({\mathbb{H }}_t\) and thus belongs to the support in \(\mathbb B \) of the prior by Lemma 5.1 in [32]. Then
for some \(t_\varepsilon ^*\) to be chosen.
The concentration function is bounded from above, assuming \(\varepsilon \le \varepsilon _3 \) and \(t\in [C_2\varepsilon ^A,1]\), by
Set \(t_\varepsilon ^*=\delta \varepsilon ^{\frac{2}{s}} \log \frac{1}{\varepsilon }\) with \(\delta \) small enough to be chosen. This is compatible with the above conditions provided \(A>2/s\). Then for \(\varepsilon \) small enough and any \(t\in [t_\varepsilon ^*,2t_\varepsilon ^*]\),
Set \(\delta =d/(4s)\). One obtains, for any \(t\in [t_\varepsilon ^*,2t_\varepsilon ^*]\),
Inserting this estimate in the previous bound on the prior mass, one gets, together with (14), for \(\varepsilon \) small enough and \(q\le 1+d/2\),
Condition (23) is satisfied for the choice
Sieve The idea is to build sieves using Borell’s inequality. Recall here that \(\mathbb{H }_r^1\) is the unit ball of the RKHS of the centered Gaussian process \(W^r\), viewed as a process on the Banach space \(\mathbb B \). The notation \(\mathbb B _1\) (as well as \(\mathbb{H }_r^{1}\)) stands for the unit ball of the associated space.
First, notice that from the explicit form of the RKHS of \(W^t\), we have
Let us set for \(M=M_n\), \(\varepsilon =\varepsilon _n\) and \(r>0\) to be chosen later,
Consider the case \(t\ge r\), then using (28)
where the last line follows from Borell’s inequality.
Choices of \(\varepsilon \), \(r\) and \(M\). Let us set \(\varepsilon =\varepsilon _n\) given by (27) and
First, one checks that \(r\) belongs to \([C_2\varepsilon ^A,1]\). This is clear from the definition since we have assumed \(A>2/s\). Then any \(t\in [r,1]\) also belongs to \([C_2\varepsilon ^A,1]\) so we can use the entropy bound and write
Now the bounds \(-\sqrt{2\log (1/u)}\le G^{-1}(u)\le -\frac{1}{2}\sqrt{\log (1/u)}\) valid for \(u\in (0,1/4)\) imply that
as soon as \(M\ge 4\sqrt{S^*_n}\) and \(e^{-S^*_n}<1/4\).
To check \(e^{-S^*_n}<1/4\) note that \(S_n^*\ge S^r(\varepsilon )\) which can be further bounded from below using Equation (3.1) in [17] which leads to, for any \(\varepsilon ,\lambda >0\),
Here we have used the bound from below of the entropy see (18). Then take \(\lambda =1\) to obtain \(S^*_n(\varepsilon )\ge \log (4)\) for \(\varepsilon \) small enough.
The first inequality \(M\ge 4\sqrt{S^*_n}\) is satisfied if
and this holds for the choices of \(r\) and \(M\) given by (30). Hence for large enough \(n\),
Then we can write, if \(q\ge 1+d/2\),
Entropy It is enough to bound from above
where we have used (21) to obtain the one but last inequality.
8 Proofs I: Posterior concentration
Proof of Theorem 1 We check that (23), (24) and (25) are satisfied with \((\mathbb B ,\Vert \cdot \Vert _\mathbb B )=(\mathbb L ^2,\Vert \cdot \Vert _2)\). With the considered prior, it follows from Sect. 7 that, since \(q\le 1+d/2\), Condition (23) holds and, since \(q\ge 1+d/2\), Condition (24) holds. Also, (25) holds with the choice of \(B_n\) from Sect. 7, regardless of the value of \(q\). One can then apply the general rate result (Theorem 1 in [13]), with the distance \(d_n\) in (16) chosen to be the \(\mathbb L ^2\)-norm, see [13] Sect. 5. The end of the proof is as in [31], Theorem 3.1 and 3.4, and is omitted. \(\square \)
Proof of Theorem 2 We use a general approach to prove lower bounds for posterior measures introduced in [5] (see [5, 6] for examples). The idea is to apply the following lemma (Lemma 1 in [13]) to the sets \(\{f\in \mathbb B ,\ \Vert f-f_0\Vert _\mathbb B \le \zeta _n\}\), for some rate \(\zeta _n\rightarrow 0\) and \(f_0\) in \(B_{2,\infty }^s\), with \(s>0\).
Lemma 1
Let \(\alpha _n\rightarrow 0\) such that \(n\alpha _n^2\rightarrow +\infty \) as \(n\rightarrow \infty \) and let \(B_n\) be a measurable set such that
where, in the white noise model, \(B_{KL}(f_0,\alpha _n)=\{f:\, \Vert f-f_0\Vert _2\le \alpha _n\}\). Then \(\mathbb{E }_{f_0}\Pi (B_n\ | \ X^{(n)}) \rightarrow 0\).
In our context this specialises as follows. Let \(\alpha _n\rightarrow 0\) and \(n\alpha _n^2\rightarrow +\infty \). Suppose that, as \(n\rightarrow +\infty \),
Then \(\zeta _n\) is a lower bound for the rate of the posterior in that, as \(n\rightarrow +\infty \),
We first deal with the case where \(q\le 1+d/2\). In this case let us choose \(\alpha _n=2\varepsilon _n\), where \(\varepsilon _n=(\log n/n)^{2s/(2s+d)}\). In Sect. 7, we have established in (26) that, for the prior \(W^T\) with \(q\le 1+d/2\) in (14), there exists \(C>0\) with
So it is enough to show that, for some well-chosen \(\zeta _n\rightarrow 0\),
We would like to take \(\zeta _n=c\varepsilon _n\), for some (small) constant \(c>0\). In order to bound from above the previous probability, we write
We separate the above integral in two parts. The first one is \(\mathcal T _1:=\{\mu _n\le t\le B t^*_n\}\), where \(t_n^*\) is a similar cut-off as in the upper-bound proof \(t_n^*=\zeta _n^{2/s}\log (1/\zeta _n)\). On \(\mathcal T _1\), one can bound from below \(\varphi ^t_{f_0}(\zeta _n)\) by its small ball probability part \(\varphi _0^t(\zeta _n)\). Moreover, thanks to relation (3.1) in [17], we have, for any \(\lambda >0\) and \(t\in (0,1]\),
Set \(\lambda =1\) and recall from Remark 2 that the lower bound on the entropy can be used for any \(t\) regardless of the value of \(\varepsilon \). This yields, for large enough \(n\), if \(\zeta _n=o(1)\),
Thus we obtain
This is less than \(e^{-(8+C)n\varepsilon _n^2}\) provided \(\zeta _n=\kappa \varepsilon _n\) and \(\kappa >0\) is small enough.
It remains to bound the integral from above on \(\mathcal T _2:=\{Bt_n^* \le t \le 1\}\). Here we bound \( \varphi ^t_{f_0}(\zeta _n)\) from below by its approximation part. For any \(t\in \mathcal T _2\),
We prove in Sect. 9, see Theorem 4, that there exist constants \(c,\; C\) and \(f_0\) in \(B_{2,\infty }^s(\mathcal{M })\) such that
Now, under (32) for the previous fixed function \(f_0\), taking \(\zeta _n=\kappa \varepsilon _n\) for small (but fixed) enough \(\kappa \), it holds, when \(t\) belongs to \(\mathcal T _2\),
For \(\kappa \) small enough, this is larger that any given power of \(\varepsilon _n\). In particular, it is larger than \((8+C)n\varepsilon _n^2\) if the (upper-bound) rate \(\varepsilon _n\) is no more than polynomial in \(n\), which is the case here since \(\varepsilon _n=(\log n/n)^{s/(2s+p)}\). We have verified that (31) is satisfied, which gives the desired lower bound result when \(q\le 1+d/2\) using Lemma 1.
In the case that \(q> 1+d/2\), the proof is similar, except that the exponent of the logarithmic factor in (26) has now the power \(q-d/2\), due to the assumption on the prior density \(g\), and that \(\varepsilon _n\) is now replaced by \(\widetilde{\varepsilon }_n=(\log {n})^{q-1-\frac{d}{2}}\varepsilon _n\). \(\square \)
9 Proofs II: Besov spaces and needlets
In this section we start by introducing some standard notation useful in the context of Besov spaces, mainly the concepts of Littlewood–Paley function and decomposition into low-frequency subspaces. Let \(L\) be the operator whose properties are listed in Sect. 2.4.
A Littlewood–Paley function is any even function \(\Phi \) in \(\mathcal D (\mathbb R )\) with
Given a Littlewood–Paley function, let us also define
From this it follows that \( 0\le \Psi (x)\le 1\), that the support of \(\Psi \) is included in \(\{ \frac{1}{2} \le |x|\le 2\}\) and that
For any even \(\varTheta \) in \(\mathcal D (\mathbb R )\) (in the sequel we apply (34) below for \(\varTheta =\Phi \), \(\varTheta =\Psi \) or rescaled versions of them) and \(0<\delta \le 1\), define the kernel operator \(\varTheta (\delta \sqrt{L})\) using the spectral decomposion of \(L\), by setting
So any square-integrable function \(f\) on \(\mathcal{M }\) can be expanded \(f = \Phi (\delta \sqrt{L})f + \sum _{j\ge 0} \Psi (2^{-j}\delta \sqrt{L})f, \) where
Moreover, if we define the ‘low frequency’ functions using the eigenspaces \(\mathcal H _{\lambda _k}\) of the operator \(L\) by
for any \(t>0\), we have from the definitions of \(\Phi \) and \(\Psi \) that
Also recall that an \(\varepsilon -\) net \(\Lambda \subset \mathcal{M }\) is a set such that \(\xi \ne \xi ^{\prime }, \; \xi , \xi ^{\prime } \in \Lambda \) implies \(\rho (\xi , \xi ^{\prime }) >\varepsilon .\) A maximal \(\varepsilon -\) net \(\Lambda \), is a an \(\varepsilon -\)net such that \(\text{ for} \text{ all} x \in X\setminus \Lambda , \; \Lambda \cup \{x\}\) is no more an \(\varepsilon -\)net.
9.1 Definition of Besov spaces
We follow [9] to introduce the Besov spaces \(B^s_{pq}\) in this setting with \(s>0\), \(1\le p \le \infty \) and \(0<q \le \infty \). To do so, let us choose any Littlewood–Paley function \(\Phi \) as in (33) and let \(\Phi _j(\lambda ):= \Phi (2^{-j}\lambda )\) for \(j\ge 1\). Again, \(L\) is the operator from Sect. 2.4.
Definition 1
Let \(s>0\), \(1\le p\le \infty \), and \(0<q \le \infty \). The Besov space \(B_{pq}^{s}=B_{pq}^{s}(\mathcal{M })=B_{pq}^{s}(\mathcal{M },L)\) is defined as the set of all \(f \in \mathbb L ^p(\mathcal{M },\mu )\) such that
Here the \(\ell ^q\)-norm is replaced by the sup-norm if \(q=\infty \).
Remark 3
One can prove, as a consequence of the Gaussian estimate (10), see [9], that this definition is independent of the choice of \(\Phi \) and that the Besov spaces can also be introduced via the following approximation properties: If \(\mathbb{E }_t(f)_p\) denotes the best approximation of \(f \in \mathbb L ^p\) from \(\Sigma _t\), see (35), i.e.
(where, here \(\mathbb L ^\infty \) is identified as the space \(\mathrm{UCB}\) of all uniformly continuous and bounded functions on \(M\)) then it is proved in [9] that
9.2 Smooth functional calculus and ‘sampling-father-wavelets’
In addition to the orthogonal analysis provided by the projectors \(P_\mathcal{H _{\lambda _k}}\) onto eigenspaces of the operator \(L\), one can build, following [9], a wavelet-type analysis on \(\mathcal{M }\) associated to \(L\). The properties of the operator \(L\) given in Sect. 2.4 have the following important consequences, see [9],
Localisation [9, Section 3] For any even \(\varTheta \) in \(\mathcal D (\mathbb R )\), there exists a constant \(C( \varTheta )\) such that
From (37) one can easily deduce the symmetrical bound \(| \varTheta (\delta \sqrt{L})(x,y) |\le \frac{1}{\sqrt{|B(x,\delta )| |B(y,\delta )| }} \frac{ C( \varTheta )}{(1+ \frac{ \rho (x, y)}{\delta })^{D+1}}.\)
Father wavelet One can deduce from [9] (Lemmas 5.2 and 5.4) that there exist \( 0<C_0 <\infty ,\quad 0<\gamma \) structural constants such that for any \(0<\delta \le 1\) , for any \(\Lambda _{\gamma \delta }\) maximal \(\gamma \delta -\)net, there exists a family of functions : \((D^\delta _\xi )_{\xi \in \Lambda _{\gamma \delta }}\) such that
we have the following wavelet-type representation:
We see on the formulae (39) and (40) that the functions \(|B(\xi , \delta )|D^\delta _\xi \) behave like father-wavelets, with coefficients directly obtained by sampling. We will see in Sect. 10 that these functions play an important role for instance to bound the entropy of various functional spaces.
10 Proofs II: Entropy properties
10.1 Covering number, entropy, \(\varepsilon \)-net
Let \(\Lambda \) be a maximal \(\varepsilon -\)net over a metric space \((X,\rho )\). We have :
Hence, for \(\Lambda _\varepsilon \) a maximal \(\varepsilon -\)net it holds
Now if \((X, \rho )\) is a doubling metric space then we have the following property : If \(x_1, \ldots ,x_N \in B(x,r)\) are such that, \( \rho (x_i,x_j) > r 2^{-l}\) (\(l \in \mathbb N \)) clearly \(B(x,r) \subset B(x_i, 2r)=B(x_i, 2^{l+2} (r 2^{-l-1}))\) and the balls \(B(x_i , r 2^{-l-1})\) are disjoint and contained in \(B(x, 2r)\). so:
If \(\Lambda _{r2^{-l}}\) is any \(r2^{-l}-\)net then: \(Card(\Lambda _{r2^{-l}}) \le 2^{(l+3)d}N(X,r).\)
So if \(\Lambda _{\varepsilon }\) is any maximal \( \varepsilon -\)net and for \( l\in \mathbb N , \; \Lambda _{2^l\varepsilon }\) is any maximal \(2^l \varepsilon -\)net then :
For \(l=0\)
So for any \( \varepsilon >0, \) and for any maximal \( \varepsilon -\)net \(\Lambda _{ \varepsilon }, \; Card (\Lambda _{ \varepsilon }) \) and \( N(X, \varepsilon ) \) are of the same order.
Moreover clearly, taking \(r=1\) in (41), so that \(B(x,1)=\mathcal{M },\) we get:
10.2 Dimension of spectral spaces, covering number, and trace of \(P_t\)
Let us now use the heat kernel’ assumptions. The following proposition gives the link between the covering number \( N(\delta , \mathcal{M })\) of the underlying space \(\mathcal{M }\), the behavior of the trace of \(e^{-tL}\) and the dimension of the spectral spaces. Let us recall:
Denote by \(P_{\Sigma _\lambda }\) the orthogonal projector onto this space and also, with a slight abuse of notation, the associated kernel
Then one can prove the following bounds (see [9], Lemma 3.19): For any \(\lambda \ge 1, \) and \(\delta = \frac{1}{\lambda },\)
Let us recall that \(Tr(e^{-tL})\!=\!\sum _k e^{-\lambda _kt} dim(\mathcal H _ {\lambda _k})\). In addition we have \(\int \nolimits _{\mathcal{M }}\!P_{t}(x, x) d\mu (x) = Tr(e^{-tL})\). Moreover, as
we have, if \(\Vert \; \Vert _{HS}\) stands for the Hilbert-Schmidt norm,
Proposition 4
-
1.
For \( \lambda \ge 1, \quad \delta = \frac{1}{\lambda }, \quad \)
$$\begin{aligned} C^{\prime }_2 \int \limits _{\mathcal{M }} \frac{1}{|B(x, \delta )|} d\mu (x)\! \le \! \dim (\Sigma _\lambda ) \!=\! \int \limits _{\mathcal{M }} P_{\Sigma _\lambda }(x,x) d\mu (x) \!\le \! C_2 \int \limits _{\mathcal{M }} \frac{1}{|B(x, \delta )|} d\mu (x)\nonumber \\ \end{aligned}$$(45) -
2.
$$\begin{aligned} 2^{-2D} N(\delta , \mathcal{M }) \le 2^{-2D} card(\Lambda _\delta )&\le \int \limits _{\mathcal{M }} \frac{1}{|B(x, \delta )|} d\mu (x) \le 2^D card(\Lambda _\delta )\nonumber \\&\le 2^{4D} N(\delta , \mathcal{M }) \end{aligned}$$(46)
where \(\Lambda _\delta \) is any \(\delta -\)maximal net.
-
3.
$$\begin{aligned} C^{\prime }_1 \int \limits _{\mathcal{M }} \frac{1}{|B(x,\sqrt{t})|} d\mu (x) \le Tr(e^{-tL}) \le C_1 \int \limits _{\mathcal{M }} \frac{1}{|B(x,\sqrt{t})|} d\mu (x) \end{aligned}$$
Proof of the Proposition Point 1. is a consequence of (44) while 3. is a consequence of (12). Let us now prove 2. Let \(\Lambda _\delta \) be any \(\delta -\)maximal net.
But:
and in the same way:
This implies:
\(\square \)
The former results can be summarised in the following corollary:
Corollary 1
10.3 Connection between the covering number of \(\mathcal{M }\) and the entropy of \(\mathbb{H }^1_t\)
In this section we establish the link between the covering number \(N(\varepsilon , \mathcal{M }) \) of the space \(\mathcal{M }\), and \(H(\varepsilon , \mathbb{H }^1_t, \mathbb L ^p)\) for \(p=2,\; \infty \) stated in Theorem 3.
Notice, of course, that, using the previous section, one can replace \(N(\delta (t,\varepsilon ), \mathcal{M }) \) at any place by \( card(\Lambda _{\delta (t,\varepsilon ) }) ,\) where \( \Lambda _{\delta (t,\varepsilon ) } \) is a maximal \(\delta (t,\varepsilon )-\)net. Also, since \(\mu (\mathcal{M })=1\), we have
So the proof will be done in two steps:
1-We prove the lower bound for \(H(\varepsilon , \mathbb{H }^1_t, \mathbb L ^2) \) in the next subsection, using Carl’s inequality.
2-We prove next the upper bound for \(H(\varepsilon , \mathbb{H }^1_t, \mathbb L ^\infty ).\)
10.3.1 Proof of Theorem 3: Lower estimates for \(H(\varepsilon , \mathbb{H }^1_t, \mathbb L ^2)\)
Let us recall some classical facts: see the following references [3, 4].
For any subset \(X\) of a metric space, we define, for any \(k \in \mathbb N \),
Clearly
Now for the special case of a compact positive selfadjoint operator \(T : \mathbb{H }\mapsto \mathbb{H }\) we have the following Carl inequality (see [3]) relating \(e_k(T(B))\) where \(B\) is the unit ball of \(\mathbb{H }\) and the eigenvalues \( 0 \le \mu _1 \le \mu _2,\ldots \) (possibly repeated with their multiplicity order) of \(T\):
In our case, let us take: \(T= P_{t/2}, \; \mu _i =e^{-(t/2) \lambda _i } , T(B)= \mathbb{H }_t^1\). Let us fix:
Carl’s inequality gives:
So
but by Corollary 1, it holds \( dim ( \Sigma _\lambda ) \sim N( \delta , \mathcal{M })\), with \(\delta = \frac{1}{\lambda }.\) So,
10.3.2 Proof of Theorem 3: Upper estimate for \(H(\varepsilon ,H^1_t,\mathbb L _\infty )\)
Recall the notation introduced in Sect. 9 (especially 9.2). Let us suppose : \( \varepsilon ^\nu \le at, \nu > 0 , a>0.\)
First, we prove that for all \( \varepsilon >0,\) small enough, there exists \(\delta \; (\sim \!\delta (t, \varepsilon )\!:=\sqrt{\frac{1}{t} \log \frac{1}{\varepsilon }}) \) such that
In a second step, we use (39) to expand on the \(|B(\xi , \delta )| D^\delta _\xi \)’s:
In a third step, we use a family of points of \(\Sigma _{\frac{1}{\delta }}\) as centers of balls of radius \(\varepsilon /2\) covering \( \Phi (\delta \sqrt{L}) ( \mathbb{H }^1_t)\) so that the balls centered in these points is an \(\varepsilon -\) covering in \(\mathbb L ^\infty \) norm of \( \mathbb{H }^1_t.\)
The next lemma gives evaluations of \( \Vert \Phi (\delta \sqrt{L})f- f \Vert _\infty \) and \(\Vert \Phi (\delta \sqrt{L}) ( \mathbb{H }^1_t )\Vert _\infty .\)
Lemma 2
\( \text{ for} \text{ all} f \in \mathbb{H }^1_t\)
-
1.
$$\begin{aligned} \Vert \Phi (\delta \sqrt{L}) f \Vert _\infty \lesssim \frac{1}{t^{D/4} } \end{aligned}$$
-
2.
$$\begin{aligned} \Vert \Psi (\delta \sqrt{L}) f \Vert _\infty \lesssim e^{- \frac{t}{8\delta ^2}}\frac{1}{\delta ^{D/2}} \end{aligned}$$
-
3.
$$\begin{aligned} \Vert \Phi (\delta \sqrt{L})f- f \Vert _\infty \le \sum _{j\ge 0} \Vert \Psi ( 2^{-j}\delta \sqrt{L}) f \Vert _\infty \lesssim \frac{1}{\delta ^{D/2}} e^{-\frac{A}{4}} A^{-1}, \quad A= \frac{ t}{8\delta ^2} \end{aligned}$$
Proof of the Lemma First, \( f \in \mathbb{H }^1_t\) so \( f=\sum _k \sum _l a_k^l e^l_k(.)e^{-\lambda _kt/2}, \; \sum _k \sum _l | a_k^l |^2 \le 1\). As \(\Phi (\delta \sqrt{L})(x,y) =\sum _k \Phi (\delta \sqrt{\lambda _k}) P_k(x,y)\),
using (12), (37) and the lower bound \(|B(x,r)| \ge (r/2)^D\) obtained in Sect. 2.1. In the same way,
So, we have
Put \(A= \frac{ t}{8\delta ^2}\); as:
as
Conclude that
\(\square \)
First step Fix \(\delta \) such that \(\Vert f- \Phi ( \delta \sqrt{L})f \Vert _\infty < \frac{\varepsilon }{2}\)
Using the previous lemma, we need to choose \(\delta \) so that
Let us take
Then, as \( \varepsilon ^\nu \le at,\)
if \(\alpha \) is suitably chosen. So for \(\frac{1}{\delta }\sim \sqrt{\frac{1}{t} \log \frac{1}{\varepsilon }}\),
Second step \(\varepsilon -\) covering of \(\mathbb{H }_t^1\).
Now if \(f \in \mathbb{H }_t^1,\) using Lemma 2, \( \Vert \Phi (\delta \sqrt{L})f\Vert _\infty \lesssim t^{-D/4} \). Moreover \(\Phi (\delta \sqrt{L})f \in \Sigma _{1/\delta },\) so, using (39)
Let us consider the following family :
Certainly for all \( f \in \mathbb{H }^1_t, \) there exists \(( k_\xi )\) in the previous family such that
As \( \Vert \Phi (\delta \sqrt{L})f -f \Vert _\infty \le \frac{\varepsilon }{2}\) , one can cover \(\mathbb{H }^1_t\) by balls centered in the \( f_{(k.)}\) of radius \(\varepsilon .\)
The cardinality of this family of balls is: \((2K+1)^{card(\Lambda _{\gamma \delta })}\). As \(\gamma \) is a structural constant, \(\varepsilon ^\nu \le at\) and \(\delta \sim \delta (t,\varepsilon )\), clearly
\(\square \)
10.4 Bounds for \(\mathbb E (\Vert W^t \Vert _\mathbb B ^2)\)
In the two remaining subsections, we prove some useful results used in Sect. 6 and the proof of Theorem 2 respectively. The next Proposition provides a control on the expectation of the squared sup-norm of \(W^t\). Similar bounds in the \(\mathbb L ^2\) norm are obtained along the way (even slightly more precise).
Proposition 5
There exist universal constants \(C_1\) and \(C_2\) such that
We recall that \(W^t\) writes
where \(X_k^l\) is a family of independent \(N(0, 1)\) Gaussian variables. Clearly since \(\mathcal{M }\) is supposed to have measure 1,
As \(\Vert W^t \Vert _2^2 = \sum _k e^{-\lambda _kt}\sum _{1\le l \le \dim \mathcal H _ {\lambda _k}} (X_k^l)^2 \), we get
Hence using Proposition 4, one obtains
Now, let us first observe, using again Proposition 4, that
On the other side, using Cauchy-Schwarz inequality,
So
Hence, we get
And we have in addition: \( N(\sqrt{t}, \mathcal{M }) \sim \int \nolimits _{\mathcal{M }} \frac{1}{|B(x, \sqrt{t})|} d\mu (x) \ll \sup _{x \in \mathcal{M }} \frac{1}{|B(x, \sqrt{t})|}.\)
10.5 Lower bound for \(A^t_f(\varepsilon )\)
Theorem 4
For \(s >0\) fixed, there exists \(f \in B^s_{2,\infty }(\mathcal{M }), \) (the unit ball of the Besov space ) with \(\Vert f \Vert _2^2 =1\) and constants \(c>0, C>0\) such that:
Let us take \(f\) such that
We are interested in:
Let us put
We have,
Necessarily \(\mu \ne 0,\) otherwise \(g_0=0\) and \(\Phi (g_0)= \Vert f\Vert ^2_2 \gg \varepsilon ^2.\) Let us put \(\lambda = \frac{1}{\mu }.\) We necessarily have \( \lambda g_0 = P_{t/2} f - P_t(g_0) \), hence \(( \lambda + P_{t}) (g_0) = P_{t/2} f\), so
Let us now write the constraint:
Clearly:
is increasing from \(0\) to \( \Vert f \Vert ^2_2.\) As well,
is decreasing.
On the other way: if \(L = \int \limits xdE_x\), and
and
Let us recall the following result from [9, Lemma 3.19].
Theorem 5
There exists \( b >1, \; C^{\prime \prime }_1 >0 , \; C^{\prime \prime }_2 >0 , \) such that \( \text{ for} \text{ all} \; \lambda \ge 1, \quad \delta = \frac{1}{\lambda }, \) then
and more precisely:
As \( P_{\Sigma _{\sqrt{a}}}= E_a\), one can build a function \(f \in \mathbb L ^2\) such that:
for \( a= b^{2j}\!,\) and \( j \in \mathbb N \). It is enough to have:
and this could be done by the previous theorem.
Let us choose for \(\varepsilon >0, \; b^{-2js} \ge 4\varepsilon ^2\ge b^{-2(j+1)s}\). So
so, if \(\lambda = e^{-ta}, \; a= b^{2j}\),
But
\(\square \)
References
Angers, J.-F., Kim, P.T.: Multivariate Bayesian function estimation. Ann. Stat. 33(6), 2967–2999 (2005)
Bhattacharya, A., Dunson, D.B.: Nonparametric Bayesian density estimation on manifolds with applications to planar shapes. Biometrika 97(4), 851–865 (2010)
Carl, B.: Entropy numbers, \(s\)-numbers, and eigenvalue problems. J. Funct. Anal. 41(3), 290–306 (1981)
Carl, B., Stephani, I.: Entropy, Compactness and the Approximation of Operators. Cambridge Tracts in Mathematics, vol. 98. Cambridge University Press, Cambridge (1990)
Castillo, I.: Lower bounds for posterior rates with Gaussian process priors. Electr. J. Stat. 2, 1281–1299 (2008)
Castillo, I., van der Vaart, A.: Needles and straw in a haystack: posterior contraction for possibly sparse sequences. Ann. Stat. 40(4), 2069–2101 (2012)
Chavel, I.: Eigenvalues in Riemannian Geometry. Pure and Applied Mathematics, vol. 115. Academic Press Inc., Orlando (1984); Including a chapter by Burton Randol, With an appendix by Jozef Dodziuk
Coifman, R.R., Maggioni, M.: Diffusion wavelets. Appl. Comput. Harm. Anal. 21(1), 53–94 (2006)
Coulhon, T., Kerkyacharian, G., Petrushev, P.: Heat kernel generated frames in the setting of Dirichlet spaces. J. Fourier Anal. Appl. 18, 995–1066 (2012)
Davies, E.B.: One-parameter semigroups. London Mathematical Society Monographs, vol. 15. Academic Press Inc. (Harcourt Brace Jovanovich Publishers), London (1980)
Efromovich, S.: On sharp adaptive estimation of multivariate curves. Math. Methods Stat. 9(2), 117–139 (2000)
Ghosal, S., Ghosh, J.K., van der Vaart, A.W.: Convergence rates of posterior distributions. Ann. Stat. 28(2), 500–531 (2000)
Ghosal, S., van der Vaart, A.W.: Convergence rates of posterior distributions for noniid observations. Ann. Stat. 35(1) (2007)
Grigor’yan, A.: Heat kernel and analysis on manifolds. AMS/IP Studies in Advanced Mathematics, vol. 47. American Mathematical Society, Providence (2009)
Gromov, M.: Metric structures for Riemannian and non-Riemannian spaces. Progress in Mathematics, vol. 152. Birkhäuser Boston Inc., Boston (1999)
Heinonen, J.: Lectures on Analysis on Metric Spaces. Universitext. Springer, New York (2001)
Kuelbs, J., Li, W.V.: Metric entropy and the small ball problem for Gaussian measures. J. Funct. Anal. 116(1), 133–157 (1993)
Li, W.V., Linde, W.: Approximation, metric entropy and small ball estimates for Gaussian measures. Ann. Probab. 27(3), 1556–1578 (1999)
Lifshits, M.: Lectures on Gaussian processes. SpringerBriefs in Mathematics. Springer, Berlin (2012)
Mardia, K.V., Jupp, P.E.: Directional statistics. Wiley Series in Probability and Statistics. Wiley, Chichester (2000); Revised reprint of Statistics of directional data by Mardia [MR0336854 (49 #1627)]
Nadler, B., Lafon, S., Coifman, R.R., Kevrekidis, I.G.: Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. App. Comput. Harm. Anal. 21(1), 113–127 (2006)
Ouhabaz, E.M.: Analysis of heat equations on domains. London Mathematical Society Monographs Series, vol. 31. Princeton University Press, Princeton (2005)
Petrushev, P., Xu, Y.: Localized polynomial frames on the interval with Jacobi weights. J. Fourier Anal. Appl. 11(5), 557–575 (2005)
Pisier, G.: The volume of convex bodies and Banach space geometry. Cambridge Tracts in Mathematics, vol. 94. Cambridge University Press, Cambridge (1989)
Rivoirard, V., Rousseau, J.: Posterior concentration rates for infinite dimensional exponential families. Bayesian Anal. 7(2), 1–24 (2012)
Saloff-Coste, L.: Aspects of Sobolev-type inequalities. London Mathematical Society Lecture Note Series, vol. 289. Cambridge University Press, Cambridge (2002)
Shen, X., Wasserman, L.: Rates of convergence of posterior distributions. Ann. Stat. 29(3), 687–714 (2001)
Stein, E.M., Weiss, G.: Introduction to Fourier analysis on Euclidean spaces. Princeton Mathematical Series, vol. 32. Princeton University Press, Princeton (1971)
Tomczak-Jaegermann, N.: Dualité des nombres d’entropie pour des opérateurs à valeurs dans un espace de Hilbert. C. R. Acad. Sci. Paris Sér. I Math. 305(7):299–301 (1987)
van der Vaart, A., van Zanten, H.: Information rates of nonparametric Gaussian process methods. J. Mach. Learn. Res. 12, 2095–2119 (2011)
van der Vaart, A.W., van Zanten, H.: Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Stat. 36(3), 1435–1463 (2008)
van der Vaart, A.W., van Zanten, H.: Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collect. 3, 200–222 (2008)
van der Vaart, A.W., van Zanten, J.H.: Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth. Ann. Stat. 37(5B):2655–2675 (2009)
Acknowledgments
The authors would like to thank Richard Nickl, Aad van der Vaart and Harry van Zanten for insightful comments on this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
I. Castillo’s work is partly supported by ANR Grant ‘Banhdits’ ANR-2010-BLAN-0113-03. G. Kerkyacharian and D. Picard’s work is partly supported by ANR Grant ‘Parcimonie’ ANR-2009-BLAN-0128-01.
Appendix: Compact Riemannian manifold
Appendix: Compact Riemannian manifold
We investigate now the case described in Sect. 3 where \(\mathcal{M }\) is a compact Riemannian manifold of dimension \(d\) without boundary. Our aim here is to prove Ahlfors’ condition (2) for this special case.
Proposition 6
Let \(\mathcal{M }\) be a compact Riemannian manifold of dimension \(d\) without boundary. Then there exist \(0<c\le C <\infty \) such that,
Proof
Let \(\mu \) and \(\rho \) be the (non normalised) Riemannian measure and metric on \( \mathcal{M }.\) The proposition is a consequence of the Bishop-Gromov comparison Theorem, see [15] and [7].
As \(\mathcal{M }\) is compact, clearly
where \(Ricc\) is the Ricci tensor and \(g\) is the metric tensor. Let \(V_\kappa (r)\) be the volume of the (any) ball of radius \(r\) in the model space of dimension \(d\) and constant sectional curvature \(\kappa .\) Let \(V_d\) be the volume of the unit ball of \(\mathbb R ^d.\)
-
1.
For \(\kappa >0, \) the model space is the sphere \(\frac{1}{\sqrt{\kappa }} \mathbb{S }_d\) of \(\mathbb R ^{d+1}\) of radius \(\frac{1}{\sqrt{\kappa }}\) and
$$\begin{aligned} V_\kappa (r)= dV_d \int \limits _0^r \left( \frac{ \sin \sqrt{\kappa } t}{ \sqrt{\kappa } }\right) ^{d-1} dt; \;\text{ so}\quad \left( \frac{2}{\pi }\right) ^{d-1} V_dr^d \le V_\kappa (r) \le V_dr^d \end{aligned}$$ -
2.
For \(\kappa =0, \) the model space is \(\mathbb R ^d\) and
$$\begin{aligned} V_\kappa (r)= V_d r^d \end{aligned}$$ -
3.
For \(\kappa <0 \) the model space is the hyperbolic space of constant sectional curvature \(\kappa .\)
$$\begin{aligned} V_\kappa (r)= dV_d \int \limits _0^r \left( \frac{ \sinh \sqrt{|\kappa |} t}{ \sqrt{|\kappa |} }\right) ^{d-1} dt; \;\text{ so}\quad V_dr^d \le V_\kappa (r) \le V_dr^d e^{(d-1) \sqrt{|\kappa |} r} \end{aligned}$$as \( s \le \sinh (s) \le se^s.\)
Moreover by the Bishop-Gromov comparaison comparison Theorem: \( r \mapsto \frac{ |B(x,r)|}{V_\kappa (r)}\) is non increasing. So if \(0<\varepsilon <r < s \le R= diam(M):\)
So
So
Remark 4
If \((\mathcal{M }, \mu , \rho )\) is a compact metric space with a Borel measure \(\mu \), then if we have the doubling condition:
then
Rights and permissions
About this article
Cite this article
Castillo, I., Kerkyacharian, G. & Picard, D. Thomas Bayes’ walk on manifolds. Probab. Theory Relat. Fields 158, 665–710 (2014). https://doi.org/10.1007/s00440-013-0493-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-013-0493-0