Strong posterior contraction rates via Wasserstein dynamics

In Bayesian statistics, posterior contraction rates (PCRs) quantify the speed at which the posterior distribution concentrates on arbitrarily small neighborhoods of a true model, in a suitable way, as the sample size goes to infinity. In this paper, we develop a new approach to PCRs, with respect to strong norm distances on parameter spaces of functions. Critical to our approach is the combination of a local Lipschitz-continuity for the posterior distribution with a dynamic formulation of the Wasserstein distance, which allows to set forth an interesting connection between PCRs and some classical problems arising in mathematical analysis, probability and statistics, e.g., Laplace methods for approximating integrals, Sanov's large deviation principles in the Wasserstein distance, rates of convergence of mean Glivenko-Cantelli theorems, and estimates of weighted Poincar\'e-Wirtinger constants. We first present a theorem on PCRs for a model in the regular infinite-dimensional exponential family, which exploits sufficient statistics of the model, and then extend such a theorem to a general dominated model. These results rely on the development of novel techniques to evaluate Laplace integrals and weighted Poincar\'e-Wirtinger constants in infinite-dimension, which are of independent interest. The proposed approach is applied to the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression. In general, our approach leads to optimal PCRs in finite-dimensional models, whereas for infinite-dimensional models it is shown explicitly how the prior distribution affect PCRs.


Introduction
Bayesian consistency guarantees that the posterior distribution concentrates on arbitrarily small neighborhoods of the true model, in a suitable way, as the sample size goes to infinity (Doob [41], Schwartz [73], Freedman [45,46], Diaconis and Freedman [36], Barron et al. [15], Ghosal et al. [50], Walker [86]).See Ghosal and van der Vaart [51,Chapter 6 and Chapter 7] for a general overview on Bayesian consistency.Posterior contractions rates (PCRs) strengthen the notion of Bayesian consistency, as they quantify the speed at which such small neighborhoods of the true model may decrease to zero meanwhile still capturing most of the posterior mass.The problem of establishing optimal PCRs in finite-dimensional (parametric) Bayesian models have been first considered in Ibragimov and Has'minskiǐ [59] and LeCam [61].However, it is in the works of Ghosal et al. [50] and Shen and Wasserman [75] that the problem of establishing PCRs have been investigated in a systematic way, setting forth a general approach to provide PCRs in both finitedimensional and infinite-dimensional (nonparametric) Bayesian models.Since then, several methods have been proposed to obtain more explicit and also sharper PCRs.Among them, we recall the metric entropy approach, in combination with the definition of specific tests (Schwartz [73], Ghosal et al. [50]), the methods based on bracketing numbers and entropy integrals (Shen and Wasserman [75]), the martingale approach (Walker [86], Walker et al. [87]), the Hausdorff entropy approach Xing [90], and some approaches based on the Wasserstein distance (Chae et al. [27], Camerlenghi et al. [25]).See Ghosal and van der Vaart [51, Chapter 8 and Chapter 9], and references therein, for a comprehensive and up-to-date account on PCRs.

Our contributions
In this paper, we develop a new approach to PCRs, in the spirit of the seminal work of Ghosal et al. [50].We consider a dominated statistical model as a family M = {f θ } θ∈Θ of densities, with the parameter space Θ being a (possibly infinite-dimensional) separable Hilbert space.We focus on posterior Hilbert neighborhoods of a given true parameter, say θ 0 , measuring PCRs in terms of strong norm distances on parameter spaces of functions, such as Sobolev-like norms.This assumption on Θ yields a stronger metric structure on M , as a subset of the space of densities, usually not equivalent to those considered so far by the literature on nonparametric density estimation (see, e.g., Ghosal et al. [49], Giné and Nickl [56], Scricciolo [74], Shen and Wasserman [75], van der Vaart and van Zanten [84], Walker [86], Walker et al. [87]), based on the choice of (pseudo-)distances such as the L p -norm, the Hellinger, the Kullback-Leibler, and the chi-square.To the best of our knowledge, we are not aware of works in the Bayesian literature that deal with strong PCRs for density estimation by using constructive tests, as prescribed by the standard theory, even if this line of research could be pursued as well.As far as we know, the standard nonparametric approach covers the case of (semi-)metrics which are dominated by the Hellinger distance (see, e.g., Ghosal and van der Vaart [51,Proposition D.8]).
We present a theorem on PCRs for the regular infinite-dimensional exponential family of statistical model, and a theorem on PCRs for a general dominated statistical models.The former may be viewed as a special case of the latter, allowing to exploit sufficient statistics arising from the infinite-dimensional exponential family.Critical to our approach is an assumption of local Lipschitz-continuity for the posterior distribution, with respect to the observations or a sufficient statistics of them.Such a property is typically known as "Bayesian well-posedness" (Stuart [79,Section 4.2]), and it has been investigated in depth in Dolera and Mainini [38,39].By combining the local Lipschitz-continuity with the dynamic formulation of the Wasserstein distance (Benamou and Brenier [12], Ambrosio et al. [5]), referred to as Wasserstein dynamics, we set forth a connection between the problem of establishing PCRs and some classical problems arising in mathematical analysis, probability and statistics, e.g., Laplace methods for approximating integrals (Breitung [22], Wong [89]), Sanov's large deviation principle in Wasserstein distance (Bolley et al. [19], Jing [60]), rates of convergence of mean Glivenko-Cantelli theorems (Ajtai et al. [1], Ambrosio et al. [7], Dobrić and Yukic [37], Dolera and Regazzini [40], Fournier and Guillin [47], Talagrand [80,81], Bobkov and Ledoux [17], Weed and Bach [88], Jing [60]), and estimates of weighted Poincaré-Wirtinger constants (Bakry et al. [14,Chapter 4], Heinonen et al. [58,Chapter 15]).In particular, our study leads to introduce new results on Laplace methods for approximating integrals and the estimation of weighted Poincaré-Wirtinger constants in infinite dimension, which are of independent interest.Some applications of our main theorems are presented for the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinitedimensional linear regression.It turns out that our main results lead to optimal PCRs in finite dimension, whereas in infinite dimension it is shown explicitly how the prior distribution affects PCRs.Among the applications of our results, the infinite-dimensional logistic-Gaussian model is arguably the best setting to motivate the use of strong norm distances.In such a setting our approach is of interest when the ultimate goal of the inferential procedure is the estimation of some functional Φ(f θ ) of the density [77, Chapter 6] for which the mapping f → Φ(f ) is not continuous with respect to the aforesaid metrics on densities, whereas θ → Φ(f θ ) turns out to be even locally Lipschitz-continuous with respect to the Hilbertian metric on Θ.
Thus, strong norms allow to consider larger classes of functionals of density functions, and then possibly a broader range of analyses.Another motivation in the use of strong norms comes from the theory of density estimation under penalized loss functions, with penalizations depending on derivatives of the density, according to the original Good-Gaskins proposal [57,76].As these penalized loss functions are used to derive smoother estimators, it sounds interesting to derive relative PCRs under the same loss functions.

Related works
The most popular classical (frequentist) approaches to density estimation are developed within the following frameworks: i) a parameter space that is the space of density functions, typically endowed with the L p norm or the Hellinger distance (Tsybakov [83]), usually associated to the notion of "strong consistency"; ii) a parameter space that is the space of density functions endowed with the Wasserstein distance, under which the parameter space is metrized according to a (concrete) metric structure on the space of the observations (Berthet and Niels-Weed [16]), usually associated to the notion of "weak consistency".Both these frameworks are different from the one we consider in this paper, and therefore a comparison of our PCRs with optimal minimax rates from Tsybakov [83] and Berthet and Niels-Weed [16] it is not directly possible.Within the classical literature, Sriperumbudur et al. [78] considered our statistical framework and provided rates of consistency under the infinite-dimensional exponential family of statistical models, though without any formal statement on their minimax optimality.In principle, our approach to PCRs may be developed within the aforementioned popular statistical frameworks for density estimation.However, since our approach relies on properties of the Wasserstein distance that are well-known for parameter spaces with a linear structure, i.e.Wasserstein dynamics, the framework considered in this paper is the most natural and convenient to start with.As for the other statistical frameworks for density estimation, we conjecture that our approach to PCRs requires a suitable formulation of Wasserstein dynamics for parameter spaces with a nonlinear structure.While such a formulation is available from Gigli [53] and Gigli and Ohta [54], it is still not clear to us how to exploit it to deal with PCRs.

Organization of the paper
The paper is structured as follows.In Section 2 we recall the definition of PCR, presenting an equivalent definition in terms of the Wasserstein distance, and we outline the main steps of our approach to PCRs.Section 3 contains the main results of our work, that is a theorem on PCRs for the regular infinite-dimensional exponential family of statistical models, and a generalization of it for general dominated statistical models.In Section 4 we present some applications of our results for the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression.Section 5 contains a discussion of some directions for future work, especially with respect to the application of our approach to other nonparametric models, such as the popular class of hierarchical (mixture) models.Proofs of our results are deferred to appendices.

A new approach to PCRs
We consider n ≥ 1 observations to be modeled as part of a sequence X (∞) := {X i } i≥1 of exchangeable random variables, with the X i 's taking values in a measurable space (X, X ).Let (Θ, d Θ ) be metric space, referred to as the parameter space, endowed with its Borel σ-algebra T .Moreover, let π be a probability measure on (Θ, T ), referred to as the prior measure, and let µ(• | •) : X × Θ → [0, 1] be a probability kernel, referred to as the statistical model.The Bayesian approach relies on modeling the parameter of interest as a Θ-valued random variable, say T , with probability distribution π.At the core of Bayesian inferences lies the posterior distribution, that is the conditional distribution of T given a random sample (X 1 , . . ., X n ), whenever both T and the sequence X (∞) are supported on a common probability space (Ω, F , P).The minimal regularity conditions that are maintained, and possibly strengthened, throughout the paper are the following: the set X is a separable topological space, with X coinciding with the ensuing Borel σ-algebra, and (Θ, T ) is a standard Borel space.In this setting, the posterior distribution can be represented through for all sets A 1 , . . ., A n ∈ X and B ∈ T and n ≥ 1, where x (n) := (x 1 , . . ., x n ) and ) is dominated by some σ-finite measure λ on (X, X ), with a relative family of λ-densities {f (• | θ)} θ∈Θ , then (a version of ) the posterior distribution is given by the Bayes formula, that is we write for any set B ∈ T and α n -a.e.x (n) , while α n turns out to be absolutely continuous with respect to the product measure λ ⊗n with density function of the form We say that the posterior distribution is (weakly . ., ξ n ) → 0 holds in probability for any neighborhood U 0 of θ 0 , where ξ (∞) := {ξ i } i≥1 stands for a sequence of X-valued independent random variables identically distributed as µ 0 (•) := µ(•|θ 0 ) (Ghosal and van der Vaart [51, Definition 6.1]).The non uniqueness of the posterior distribution π n requires additional regularity assumptions in order that π n (• | ξ 1 , . . ., ξ n ) is well-defined.PCRs strengthen the notion of Bayesian consistency, in the sense that they quantify the speed at which such neighborhoods may decrease to zero meanwhile still capturing most of the posterior mass.In particular, the definition of PCR can be stated as follows (Ghosal and van der Vaart [51, Definition 8.1]).
holds in probability for every sequence {M n } n≥1 of positive numbers such that M n → ∞.Now, we present our approach to PCRs based on the Wasserstein distance.This is a new approach, which relies on four main steps that are outlined hereafter.The first step of our approach originates from a reformulation of Definition 2.2 in terms of the so-called p-Wasserstein distance, for p ≥ 1.In particular, to recall this concept in full generality, we denote by (M, d M ) an abstract separable metric space, and we denote by P(M) the relative space of all probability measures on (M, B(M)).Then, the p-Wasserstein distance is defined as for any γ 1 , γ 2 ∈ P p (M), where is the class of all probability measures on (M gives a PCR at θ 0 , where δ θ0 denotes the degenerate distribution at θ 0 . The second step of our approach relies on the assumption of the existence of a suitable sufficient statistics.In particular, we assume the existence of another metric space, say (S, d S ), and the existence of a measurable map, say S n : X n → S, in such a way that the kernel π n (•|•) in (1) can be represented by means of another kernel, say π * n (• | •) : T × S → [0, 1], according to the identity for all (x 1 , . . ., x n ) ∈ X n .See Fortini et al. [44], and references therein, for the existence of sufficient statistics in relationship with the exchangeability assumption.Of course, when the statistical model µ(•|•) is dominated, the existence of the sufficient statistics S n is implied by standard assumptions on the statistical model, such as the well-known Fisher-Neyman factorization criterion.The third step of our approach relies on the large n asymptotic behavior of the random variable Ŝn := S n (ξ 1 , . . ., ξ n ).In particular, we assume the existence of a weak law of large numbers for Ŝn , which means that there exists some (non random) S 0 ∈ S for which Ŝn → S 0 holds true in P-probability, as n → +∞.Hereafter, for any sequence {δ n } n≥1 of positive numbers, we denote by the probability that Ŝn lies outside a δ n -neighborhood of S 0 .Usually, J n (δ n ) can be evaluated by means of concentration inequalities and large deviation principles.
Based on (7), the fourth step of our approach relies on a form of local Lipschitz-continuity for the kernel π * n (• | •), which holds under suitable assumptions on the model µ(• | •) and the prior π.It corresponds to the existence of two sequences of positive numbers, say {δ n } n≥1 and {L (n) 0 } n≥1 such that, for each n ∈ N, holds for any S ′ belonging to U δn (S 0 ) := {S ∈ S : d S (S 0 , S) < δ n }.We refer to Dolera and Mainini [38,39] for a detailed treatment of the property of local Lipschitz-continuity, for fixed n ∈ N, providing some quantitative estimates for L (n) 0 .Then, according to Lemma 2.3, under the validity of ( 7) and ( 9), we write Under additional assumptions, in Section 3 we develop a careful analysis of the three terms on the righthand side of (10), in order to show that they can be bounded by terms of more explicit quantities that behave like n −α , for some α > 0. In particular, the first term is a non-random quantity which is equal to and it measures the speed of shrinkage of π * n (•|S 0 ) at θ 0 .Its evaluation is a pure analytical problem, which relies on an extension to infinite-dimensional spaces of the classical Laplace methods of approximating integrals.In (10), the term provides the speed of convergence of the mean law of large numbers, which is well-known, at least for the situations considered throughout this paper.The term 1{ Ŝn ∈ U δn (S 0 )} in (10) hints at an application of a large deviation principle.As for the L (n) 0 's in (10), the bounds provided in Dolera and Mainini [38,39] show that they can be expressed in terms of weighted Poincaré-Wirtinger constants.As we will show below, a proper choice of the sequency {δ n } n≥1 should entail that {L (n) 0 } n≥1 is bounded or, at least, diverges at a controlled rate.
Critical to our analysis of the term L (n) 0 is the so-called dynamic formulation of the p-Wasserstein distance, which is referred to as Wasserstein dynamics (Benamou and Brenier [12]).In particular, assume that M is the norm-closure of some nonempty, open and connected subset of a separable Hilbert space H, and endowed with scalar product •, • and norm • .Then, for any γ 0 , γ 1 ∈ P p (M) where AC p [γ 0 ; γ 1 ] is the space of all absolutely continuous curves in P p (M) with L p (0, 1) metric derivative (w.r.t.W p ) connecting γ 0 to γ 1 , and Here, Dψ denotes the Riesz representative of the Frechét differential of the function ψ, and ψ ∈ C 1 b (M) means that ψ is the restriction to M of a function in the class C 1 b (H), that is ψ is a bounded continuous function with bounded continuous Fréchet derivative on H. See Da Prato and Zabczyk [32,Chapter 2] for spaces of continuous functions defined on Hilbert spaces, and Ambrosio et al. [5,Chapter 8] for a detailed account on the partial differential equation (13).
For any fixed t and given γ t , it is natural to look for a solution v t (•) of Equation ( 13) in the form of a gradient, and therefore we may interpret (13) as an abstract elliptic equation, for which it is well-known that a critical role is played by Poincaré inequalities in the context of proving the existence and regularity of a solution.Definition 2.4.We say that a probability measure µ on (M, B(M)) satisfies a weighted Poincaré inequality of order p if there exists a constant C p for which holds for every ψ ∈ C 1 b (M).We denote by C p [µ] the best constant C p in (14).In particular, for p = 2 the best constant C 2 [µ] may be characterized by means of Finally, if M = H and µ is absolutely continuous with respect to a non-degenerate Gaussian measure, then the Fréchet derivative in (14) can be replaced by the Malliavin derivative D, yielding the following weaker definition We refer to the monographs of Bogachev [18], Da Prato [30], Da Prato and Zabczyk [32] for a detailed account of Malliavin calculus and related Sobolev spaces.

Main results on PCRs
Following the approach to PCRs outlined in Section 2, we present two main results: i) a theorem on PCRs for the regular infinite-dimensional exponential family of statistical models; ii) a theorem on PCRs for a general dominated statistical model.

PCRs for the regular infinite-dimensional exponential family
It is useful to recall the definition and some basic properties of the infinite-dimensional exponential family.In general, classical results on exponential family may be extended to the infinite-dimensional setting through suitable arguments of convex analysis (Bauschke et al. [11], Bauschke and Combettes [10]).Definition 3.1.Let λ be a σ-finite measure on (X, X ), B be a separable Banach space with dual B * , and B * •, • B be the pairing between B and B * .Also, let Γ be a nonempty open subset of B * , and let β : X → B be a measurable map.If the interior Λ of the convex hull of the support of λ • β −1 is nonempty and holds for any γ ∈ Γ, then the regular infinite-dimensional exponential family is a statistical model defined through the family of λ-densities {ϕ(• | γ)} γ∈Γ , where with Brown [24, Theorems 1.13, Theorem 2.2 and Theorem 2.7] state that M ϕ is a strictly convex function on Γ, lower semi-continuous on B * , of class C ∞ (Γ) and analytic.In addition, Barndorff-Nielsen [9, Corollary 5.3] implies that M ϕ is steep (essentially smooth).Therefore, from Brown [24,Theorem 3.6] it holds that defines a smooth injective map from Γ into B, with dense range.Finally, [24, Corollary 2.5] entails the identifiability of the model characterized by the densities (16).
To introduce the setting of our theorem on PCR, it is useful to express the statistical model µ(• | •) in terms of an infinite-dimensional exponential family.In this regard, we introduce a further measurable mapping g : Θ → Γ and write In the setting of (19), we observe that the identity ( 7) is satisfied with Note that Equation (19) arises naturally from the assumption that the statistical model µ(• | •) is dominated, which provides a family {f (• | θ)} θ∈Θ of density functions.Accordingly, by assuming that X is endowed of a richer metric structure, if The functions g and β then arise from the mapping y → log f (y | θ) and the measure δ x through (some sort of) integration-by-parts, if this is admitted, or through classical Fourier transformation arguments, such as the Plancherel formula.
According to Ledoux and Talagrand [64,Corollary 7.10], under the assumption we set in the sense of Bochner integral, and conclude the strong law of large numbers, i.e.Ŝn → S 0 holds P-a.s., as n → +∞.Now, we set M (θ) = M ϕ (g(θ)) and then define provided that for any n ∈ N and b ∈ B. We remark that ( 24) is a necessary assumption for the existence of the posterior distribution.Now, we state the theorem on PCRs in the setting of infinite-dimensional exponential families; the proof is deferred to Appendix A.
+ θ 0 Θ P Ŝn ∈ U δn (S 0 ) where π * n (• | •) is given by (23), S 0 and Ŝn are as in (22) and (20), respectively, and U δn (S 0 ) := {S ∈ S | S 0 − S B < δ n }.Theorem 3.2 provides an implicit form for PCRs.That is, the large n asymptotic behaviour of the terms on the right-hand side of (25) must be further investigated to obtain a more explicit expression for the corresponding PCR.In this regard, it is useful to rewrite π * n in terms of the Kullback-Leibler divergence.That is, if S • g is injective and b belongs to the range of S • g, then where denotes the Kullback-Leibler divergence.See Appendix A.2 for the proof of Equation (26).It is natural to expect that the main contribution to PCRs arises from the first and the fourth term on the right-hand side of (25), which provide general algebraic rates of convergence to zero.Hereafter, we investigate the large n asymptotic behaviour of the terms on the right-hand of (25).More explicit results in terms of PCRs will be presented in Section 4 with respect to the application of Theorem 3.2 in the context of the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression.

First term on the right-hand of (25)
We start by considering the large n asymptotic behaviour of the first term on the right-hand side of (25).
In particular, from (26), we can rewrite this terms as .
The last expression of (28) shows the ratio of two Laplace integrals, and therefore the Laplace method of approximating integrals can be applied.In the finite-dimensional setting, i.e.Θ ⊆ R d , the Laplace approximation method is well-known (Breitung [22], Wong [89]), and it leads to the following proposition.Proposition 3.3.In the case that Θ ⊆ R d , assume that π has a continuous density q with respect to the Lebesgue measure, with q(θ 0 ) > 0, and that θ → K(θ | θ 0 ) is a C 2 -function with a strictly positive definite Hessian at θ 0 , which coincides with the Fisher information matrix I[θ 0 ] at θ 0 .Let Θ |θ| p π(dθ) < +∞ be fulfilled for some p ≥ 1.Finally, suppose that for any δ > 0 there exists c(δ) > 0 such that Then, for any p > 0, there hold as n → +∞, where S d (1) := {z ∈ R d | z = 1}, dS denotes the surface measure and •, • stands for the standard scalar product in R d .Thus, under these assumptions, as n → +∞.
It is interesting to observe that the inequality ( 29) is a sort of strengthening of the so-called Shannon-Kolmogorov information inequality.See, e.g., Ferguson [43,Chapter 17].In particular, because of (29), integrals on the whole Θ can be reduced to integrals over balls centered at θ 0 , as integration over the complement of any such ball yields exponentially small quantities with respect to n.
According to Proposition 3.3, in the finite-dimensional setting the prior distribution does not affect the large n asymptotic behaviour of the first term on the right-hand side of (25).Differently from the standard finite-dimensional setting, the literature on the Laplace approximation method in the infinite-dimensional setting appears to be not well developed.That is, to the best of our knowledge, infinite-dimensional Laplace approximations are limited to the case in which the measure π is a Gaussian measure (Albeverio and Steblovskaya [3,2]).Unfortunately, this literature does not cover the case in which the Hessian of the map θ → K(θ | θ 0 ) at θ 0 is not coercive (uniformly elliptic), which is precisely the case of interest in our specific problem.The next proposition covers this critical gap; the proof is deferred to Appendix A.4.The proposition is of independent interest in the context of the classical Laplace method.Proposition 3.4.Let Θ be a separable Hilbert space with scalar product •, • , and let π be the non-degenerate Gaussian measure N (m, Q), with m ∈ Θ and Q a trace-class operator.For fixed θ 0 ∈ Θ, assume that θ → K(θ | θ 0 ) belongs to C 2+q (Θ) for some q ∈ (0, 1], and that its Hessian at θ 0 , which coincides with the Fisher information operator I(θ 0 ) at θ 0 , is a compact self-adjoint linear operator from Θ into itself, with trivial kernel.Suppose there exists an orthonormal Fourier basis {e k } k≥1 of Θ which diagonalizes simultaneously both Q and I(θ 0 ), so that are valid with two suitable sequences {λ k } k≥1 and {γ k } k≥1 of strictly positive numbers that go to zero as k → +∞, with {λ k } k≥1 ∈ ℓ 1 .Finally, assume there exist two other Hilbert spaces K and V such that i) V ⊂ Θ ⊂ K with continuous, dense embeddings; ii) an interpolation inequality like holds for any θ ∈ V with conjugate exponents r, s > 1 such that r < 1 + q/2; iii) for all θ ∈ V, the inequalities are valid with some monotone non-decreasing function φ Then, as n → +∞, the following expansion holds with the sequence {ω k } k≥1 ∈ ℓ 2 given by (θ Remark 3.5.In the infinite-dimensional setting, the assumption (29) is, in general, too strong.Conditions (33)- (34), combined with the interpolation (32), constitute a reasonable set of assumptions that allow a quite general treatment in the applications.It is worth noticing that (29), as well as (33), is expressed in the form of a lower bound for K(θ | θ 0 ).These bounds are conceptually opposite with respect to the so-called "prior mass condition" required in the standard theory (Ghosal and van der Vaart [51, Theorem 8.9, inequality (8.4)]), which is usually proved by means of upper bounds for K(θ | θ 0 ).See, e.g. the upper bounds for K(θ | θ 0 ) in Lemma 2.5 of Ghosal and van der Vaart [51].
Remark 3.6.With respect to Proposition 3.3, the statement of Proposition 3.4 is confined to the case p = 2.
There are no technical limitations for treating the more general case p = 2, though p = 2 yields to a more readable (conclusive) result.
Remark 3.7.Assumption (31) is not necessary to obtain PCRs.However, without this assumption, the resulting PCR would have a complicated form, which may be recovered from the proof.For example, let Θ N be the finite-dimensional subspace of Θ obtained by the linear span of {e 1 , . . ., e N }, let Q N denote the N × N matrix that represents the restriction of Q to Θ N , after projecting the range of such restriction again on Θ N , and let I N (θ 0 ) denote the N × N matrix associated to the restriction just explained of the operator I(θ 0 ) to Θ N .If Q N and I N (θ 0 ) are non-singular, then the first term on the right-hand side of (35) can be replaced by which is not as clear as the series ∞ k=1 λ k /(nλ k γ k + 1).An analogous operation can be performed with respect to the second term on the right-hand side of (35).
Moreover, the above argument can be reinforced by resorting to some trace inequalities, as explained in [26].In particular, we assume there exists another compact, self-adjoint operator I * such that I(θ 0 ) ≥ I * in the sense of quadratic forms, i.e. θ, for any θ ∈ Θ. Whence, upon denoting by I * N the restriction of I * to Θ N as above, we have By the Löwner-Heinz theorem, the mapping t → −t −1 is operator monotone, yielding that See again [26] for the details.Therefore, if the orthonormal Fourier basis {e k } k≥1 of Θ diagonalizes simultaneously both Q and I * (instead of I(θ 0 )), so that are valid with suitable strictly positive γ * k 's that go to zero as k → +∞, then by Proposition 3.4 Proposition 3.4 shows that the large n asymptotic behavior of the first term on the right-hand side of ( 25) is worse than 1/n, which is the large n asymptotic behaviour obtained in Proposition 3.3 with p = 2.For example, by taking the first term on the right-hand side of (35) holds as n → +∞.As for the second term on the right-hand side of (35), it can be made identical to zero by choosing m = θ 0 , that is by means of centering the Gaussian prior at θ 0 .However, if holds as n → +∞.Therefore, if c < a this second term is slower than the one in (39), whilst if c > a it is negligible with respect to that term.Again on (39), it is interesting to notice what happens if the eigenvalues λ k 's approach zero very rapidly, like λ k ∼ e −k , for example.Another straightforward calculation shows that holds as n → +∞.A refinement of this argument entails that the large n asymptotic behavior of the righthand side of ( 35) can be made arbitrarily close to the rate 1/n, for example by choosing 1+c) for some r, b, c > 0, with arbitrarily large r.By recalling that the first term on the right-hand side of (25) coincides with the square root of the left-hand side of (25), this argument shows that the PCR is arbitrarily close to 1/ √ n.It is reasonable to guess that the minimax (classical) risk should go to zero as fast as 1/ √ n, though we are not aware of any result proving such a behaviour.A merit of Proposition 3.4 is to show explicitly that, within the infinite-dimensional setting, PCRs are influenced by three quantities that do not appear in finite-dimensional setting of Proposition 3.3: i) the rate of approach to zero of the sequence {λ k } k≥1 , which measures the "regularity of the prior"; ii) the rate of approach to zero of the sequence {γ k } k≥1 , which measures the "regularity of the model"; iii) the rate of approach to zero of the sequence {ω k } k≥1 , which measures how close is θ 0 to m.Finally, we notice that the space V is linked with the Cameron-Martin space associated to π, which must be included in V.

Second and third term on the right-hand of (25)
Now, we consider the large n asymptotic behaviour of the second term and of the third term on the right-hand side of (25).Both these terms depend explicitly on Note that the tail probability in ( 40) is directly related to classical concentration inequalities for sum or random variables.Besides well-know Bernstein-type concentration inequalities for real-valued random variables (Boucheron et al. [20], Dembo and Zeitouni [34]), some useful generalizations or extension can be found in, e.g., Giné and Nickl [55], Ledoux and Talagrand [64], Pinelis and Sakhanenko [70] and Yurinskii [91].In particular, for a suitable choice of the sequence {δ n } n≥1 , such that a constant sequence or a vanishing sequence at an algebraic rate, the term (40) goes to zero at suitable exponential rates, and therefore it provides a negligible contribution in the right-hand side of ( 25).
The third term on the right-hand side of ( 25) includes the posterior moment In particular, an application of Hölder's inequality shows that such a moment is bounded from above by It is useful to recall that the density function ρ n has been defined in (3).Accordingly, the second factor above coincides with the ρ ′ -th moment of a martingale, since At this stage, a possible resolutive strategy may rely on well-known bounds for moments of martingales (Dharmadhikari et al. [35]).As for the term E[ Ŝn − S 0 B ], by means of a direct application of Lyapuonov's inequality, we can write that and the right-hand side typically goes to zero as 1/ √ n.Besides the obvious case in which B coincides with a separable Hilbert space, we refer to Nemirovski [68], Massart [65] and Massart and Rossignol [66] for the case in which we have B = ℓ p (R d ).

Fourth term on the right-hand of (25)
Finally, we consider the large n asymptotic behaviour of the fourth term on the right-hand side of (25).In particular, this term involves the constant L (n) 0 , whose treatment requires to recall some fundamental notions of infinite-dimensional calculus.Given g : Θ → B * , the Fréchet differential D θ [g] of g is now meant as a bounded linear operator from Θ to B * such that g(θ Here, we consider the case p = 2.It should be recalled that the theory of weighted Poincaré constant has been mainly focused on the two cases p = 1 and p = 2 (see, e.g., Bakry et al. [13]).We choose only the latter case in order to avoid other technical problems connected with the Wasserstein dynamic when p = 1.See, e.g., the first comment opening Section 8.3 of Ambrosio et al. [5].Therefore, in order to obtain an explicit upper bound for the constant L (n) 0 it is useful to consider the following proposition; the proof is deferred to Appendix A.5 Proposition 3.8.In addition to the assumptions of Theorem 3.2, suppose that g ∈ C 1 (Θ; B * ), that * π(dθ) < +∞, and that map S • g is continuous.Then, for the constant In addition, if Remark 3.9.When π is a Gaussian measure on the infinite-dimensional Hilbert space Θ, an analogous statement can be formulated with the Fréchet derivative replaced by the Malliavin derivative.Hence, for the constant L (n) 0 in (25) we can set and if holds with V δn (θ 0 ) := (S • g) −1 (U δn (S 0 )), then the following inequality holds true Denote by ⇒ the weak convergence of probability measures on (Θ, B(Θ)).Verifying the validity of (42) represents a strengthening of the fact that, as This may be proved by means of the same arguments as in the proofs of Proposition 3.  [43,Chapter 18]), the aforesaid mapping proved also to be strictly convex, at least in finite dimension.In this context, there are several conditions that entail the upper bound C(θ ′ ) n for every n ∈ N and positive constant C(θ ′ ).In particular, the simplest condition to quote is the so-called Bakry-Emery condition, characterized by the fact that for some ρ > 0, with Id being the identity matrix, uniformly with respect to θ ∈ Θ, in conjunction with the hypothesis that π(dθ) = e −U(θ) dθ for some U ∈ C 2 (Θ).Some generalizations of the condition (47) are given in the next proposition, which specifies some results that have first appeared in Bakry et al. [13].
(2) If, in addition, there exist According to Proposition 3.10, in the finite-dimensional setting the prior distribution does not affect the large n asymptotic behaviour of the weighted Poincaré-Wirtinger constant A similar phenomenon has been observed in the study of the first term on the right-hand side of (25).Differently from the finite-dimensional setting, the literature on weighted Poincaré-Wirtinger constants in the the infinitedimensional setting appears to be not well developed .To the best of our knowledge, in the infinitedimensional setting, upper bounds on weighted Poincaré-Wirtinger constants are limited to the case of Gibbsean (Boltzmann) measures, that is measures of the form exp{−nG(θ)}π(dθ) with G being a smooth convex function and π being an infinite-dimensional Gaussian measure (Da Prato [31,).While this is the case of interest in our problem, the upper bounds available in the literatures are not sharp for large values of n, and therefore they can not be applied.The next proposition covers this critical gap by providing results involving Malliavin calculus; the proof is deferred to Appendix A.6.The proposition is of independent interest in the context of weighted Poincaré-Wirtinger constants.Proposition 3.11.Let Θ be a separable Hilbert space, and let π be the non-degenerate Gaussian measure N (m, Q), with m ∈ Θ and Q a trace-class operator.Let G 0 : Θ → Θ be a compact linear operator, with trivial kernel.Let G be an element of C 2 (Θ), bounded from below and such that Hess(G(θ)) ≥ G 0 (in the sense of operators) whenever θ Θ ≤ R, for some R > 0. Suppose there exists a Fourier orthonormal basis {e k } k≥1 of Θ which diagonalizes simultaneously both Q and G 0 , that is for two suitable sequences {λ k } k≥1 and {η k } k≥1 of strictly positive numbers that go to zero as k → +∞, with {λ k } k≥1 ∈ ℓ 1 .
(1) Suppose, in addition, there exists c > 0 such that θ where G R := sup BR D θ G and C R is an explicit universal constant only depending on R.
(2) Suppose, in addition, there exist c 1 > 0, c 2 > 0 such that whenever θ ≥ R, where D θ and L π denote the Malliavin derivative and the Malliavin-Laplace operator associated to π, respectively.Then, for every n > 1 + 1/c 2 , it holds Proposition 3.11 shows that the large n asymptotic behavior of the Poincaré-Wirtinger constant holds for any a, b > 0. A particular merit of Proposition 3.11 consists in showing explicitly that, within the infinite-dimensional setting, PCRs are influenced by two quantities that do not appear in finite-dimensional setting of Proposition 3.10: i) the rate of approach to zero of the sequence {λ k } k≥1 , which measures the "regularity of the prior"; ii) the rate of approach to zero of the sequence {η k } k≥1 , which measures another "regularity of the model".A similar phenomenon has been observed in the study of the first term on the right-hand side of (25).To conclude, we observe that, under the assumptions of Proposition 3.4, we can apply Equation (26) to rewrite the right-hand side of (43) as follows sup = sup and then observe that the role of θ ′ is now confined to the multiplicative constants that appear on the right-hand sides of the various inequalities that we considered.Thus, in order to handle the supremum, it is enough to check the boundedness of such multiplicative constants by standard arguments of continuity.We conclude this section by summarizing our results on the large n asymptotic behaviours of the terms on the right-hand side of (25).The second and the third term go to zero exponentially fast, and this holds true independently on the dimension of the statistical model.This confirms that the main contribution to the PCR arise from the first and the fourth term, which give generally algebraic rates of convergence to zero.In the finite-dimensional setting, the first and the fourth term go to zero as n −1 , which is the optimal rate.In the infinite-dimensional setting, the first and the second term go to zero according to Proposition 3.4 and Proposition 3.11.At least when with a, b, c > 0, Equation (39) and Equation (51) show that the first term on the right-hand side of ( 25) is asymptotically equivalent to n − a 2(a+b+1) + n − c 2(a+b+1) , whereas the fourth term on the right-hand side of ( 25) is asymptotically equivalent to . This completes our analysis of PCRs in the setting of infinite-dimensional exponential families.Some applications of these results will be presented in Section 4 with respect to specific statistical models.

PCRs for a general dominated statistical model
We present a more general version of Theorem 3.2, which relies on the assumption that both the sample space X and parameter space Θ have richer analytical structures.As in Section 3.1, we confine to the case p = 2.In particular, the setting that we consider may be summarized through the following assumptions.iii) [43,Theorem 18]) vi) for any θ, there exist positive constants b(θ), c(θ) for which hold for every x ∈ X; vii) π ∈ P 2 (Θ), with full support; viii) µ 0 ∈ P 2 (X).
The setting of infinite-dimensional exponential families, considered in Section 3.1.3,is a popular example that satisfies Assumptions 3.12.Now, we state the theorem on PCRs in the setting of Assumptions 3.12; the proof is deferred to Appendix A.7 Theorem 3.13.Within the setting specified by Assumptions 3.12, (7) is fulfilled with where γ ∈ S = P 2 (X).Moreover, (9) holds relatively to a suitable choice of a W < +∞ for any n ∈ N. Thus, the assumptions of Lemma 2.3 are fulfilled and a PCR at θ 0 is given by where e (ξ) n )] is the speed of mean Glivenko-Cantelli.
From Theorem 3.13, we observe that if finite for every n ∈ N, then the expression on the right-hand side of (56) reduces to the first two terms.Similarly to Theorem 3.2, Theorem 3.13 provides an implicit form for the PCR, thus requiring to further investigate the large n asymptotic behaviour of the terms on the right-hand side of (56).The posterior distribution appears in ( 55) and ( 56), meaning that further work is required to obtain more explicit terms.In general, it is possible to get rid of π * n in ( 55) and ( 56), thus reducing ( 55) and ( 56) to expressions that involve only the statistical model and the prior distribution.The first term on the right-hand side of ( 56) has the same form as in (25), meaning that the Laplace method plays a critical role in the study of these term.Such a term can be handled as described in Proposition 3.3 and Proposition 3.4.With regards to ε n,2 (X, µ 0 ), we recall from Fournier and Guillin [47, Theorem 1] that, if X |x| q µ 0 (dx) < +∞ for some q > 2, then if m = 1, 2, 3 and q = 4 n −1/4 log(1 + n) + n −(q−2)/(2q) if m = 4 and q = 4 n −1/m + n −(q−2)/(2q) if m > 4 and q = m/(m − 2) with some positive constant C(q, m).Under some more restrictive assumptions, ε n,2 (X, µ 0 ) is of order O(n −1/2 ), which is optimal in the dimension 1 (Bobkov and Ledoux [17,Section 5]).In the dimension 2, the optimal rate is (log n)/n (Ambrosio et al. [7]), whereas for m ≥ 3 the optimal rate is n −1/m (Talagrand [81]).Lastly, when X has infinite dimension, logarithmic rates have been obtained in Jing [60].With regards to P[e 0 ], we refer to Bolley et al. [19,Theorem 2.7].In particular, if X |x| q µ 0 (dx) < +∞ for some q ≥ 1, then for any t > 0 and n ∈ N, with some positive constant B(q, m).Exponential bounds can be also obtained upon requiring that X e α|x| µ 0 (dx) < +∞ for some α > 0. See Bolley et al. [19,Theorem 2.8].In the next corollary we show that, under additional assumptions, similar bounds hold true for the other terms appearing on the right-hand side of (56); the proof is deferred to Appendix A.8.
Corollary 3.14.In addition to the hypotheses of Theorem 3.13, suppose that there exist constants some C > 0 and β ≥ 2 for which holds for all θ ∈ Θ.Moreover, assume that X |x| q µ 0 (dx) < +∞ holds for some constants q > 4, and for all n ∈ N and some r > 2.Then, if the neighborhood for some K > 0 and a ∈ [0, 1/4), for the PCR given in (56) we obtain the new bound + C 2 n [q(r−1)(a−1/4)]/r M n,r with suitable positive constants C 1 and C 2 .
From Corollary 3.14, the posterior distribution appears in ( 58) and (59).With regards to (58), this term is typically available in an explicit form, even if the posterior is not explicit.In general, a possible strategy may rely on well-known bounds for moments of martingales.With regards to L (n) 0 , this term can be handled as described in Proposition 3.10 and Proposition 3.11, that is by inequalities for the weighted Poincaré-Wirtinger constant.To conclude it remains to handle with sup which is expected to be bounded with respect to n, in regular situations.To deal with this term, a possible strategy consists in obtaining inequality of the form for a suitable constant C γ and a suitable function W .This particular point will be made more precise in Section 4 with respect to some specific statistical models.

Regular parametric models
Consider the case of dominated Bayesian statistical models with a finite-dimensional parameter θ ∈ Θ ⊂ R d .Accordingly, we start by considering the set of Assumptions 3.12, with d ∈ N, along with the hypotheses of Theorem 3.13.In this setting, the Kullback-Leibler divergence K(θ | θ 0 ) is a C 2 function, whose Hessian at θ 0 just coincides with the Fisher information matrix at θ 0 .Whence, as θ → θ 0 .Finally, we assume (29).Therefore, we can apply Proposition 3.3 to get as n → +∞.Now, we discuss the behavior of the constant L (n) 0 , as n goes to infinity.First, we would like to stress that there are plenty of conditions that entail for every n ∈ N and some positive constant C(γ).We consider the double integral , that is the model is an element of the exponential family in the canonical form, then we notice that D θ ∇xf (x|θ) f (x|θ) reduces to a d×m matrix whose entries are given by ∂ xj T i (x), for j = 1, . . ., m and i = 1, . . ., d.Therefore, the study of the above double integral boils down to that of the much simpler expressions X |∂ xj T i (x)| 2 γ(dx), which are independent of n.More generally, we can reduce the problem by resorting to the Laplace method for approximating probability integrals, from which we have that as n → +∞, where θ * (γ) denotes a maximum point of the mapping θ → X log f (y|θ)γ(dy).Therefore, if the above right-hand side proves to be positive, a reasonable plan to prove global boundedness of L (n) 0 with respect to n can be based on the following to steps.First, we check the validity of an inequality like for every γ belonging to a W (P(Θ)) 2 -neighborhood of µ 0 , where C is a positive constant possibly depending on the fixed neighborhood.Second, we prove global boundedness (for γ varying in the neighborhood) of the following integral γ(dx) < +∞ .
To fix ideas in a more concrete way, we consider the Gaussian case, where θ = (µ, Σ) and Note that the mapping θ → X log f (y | θ)γ(dy) depends on γ only through its moments of order 1 and 2. Thus, the above strategy reduces to an ordinary finite-dimensional maximization problem, very similar to the question of finding the maximum likelihood estimator.Finally, the last term in on the right-hand side of (56) can be treated as in Corollary 3.14, by studying the asymptotic behavior of some posterior r-moment as in (58).We state two propositions that summarize the above considerations.The former result holds when Theorem 3.2 can be applied and gives the optimal rate, while the latter result ensues from Theorem 3.13.
Proposition 4.1.Assume that there exist a separable Banach space B with dual B * and two measurable maps β : X → B and g : Θ → B * for which (19) is in force.If the assumptions of Theorem 3.2 and Propositions 3.3, 3.8 and 3.10 are met, then as n → +∞ which is the optimal rate.
which is the optimal rate, at least when m = 1.

Multinomial models
Consider the case in which the observations, i.e. both the sequence {X i } i≥1 and the sequence {ξ i } i≥1 , take values in the finite set, say {a 1 , . . ., a N }.It is easy to check that Θ can be assumed to coincide with the interior of the (N − 1)-dimensional simplex where θ = (θ 1 , . . ., θ N −1 ), t = (t 1 , . . ., t N −1 ), θ Of course, if we put X = {a 1 , . . ., a N }, we can not directly apply Theorem 3.13.Nonetheless, we can resort to a reinterpretation of the data, in terms of the frequencies ν n,i , that we now explain, that allows the use of our theorem.We consider defined for p = (p 1 , . . ., p N −1 ) ∈ ∆ N −1 with the usual proviso that p The problem of consistency, and the allied question of finding a PCR, can be now reformulated as follows.
After fixing θ 0 ∈ ∆ N −1 , we consider the sequence {ξ i } i≥1 of i.i.d.random variables, each taking values in {a 1 , . . ., a N }, with P[ξ 1 = a i ] = θ 0,i , for i = 1, . . ., N .An analogous version of Lemma 2.3 states that provides a PCR at θ 0 .Now, we reformulate Theorem 3.13 as follows.First of all, we have that replaces the speed of the mean Glivenko-Cantelli convergence.Then, we have that The relation analogous to that in (56), which gives a PCR at θ 0 , reads as follows where {δ n } n≥1 provides a sequence of positive numbers and L (n) 0 is defined as follows We show that the PCR in (62) reduces to a simpler expression.Indeed, the first term on the right-hand side of ( 62) is similar to the one already studied in the previous section.By resorting to the same theorems from Breitung [22], recalling that the mapping θ → K(θ | θ 0 ) is minimum when θ = θ 0 , we get, as n → +∞, provided that π has full support.As for the second terms on the right-hand side of (62), we have already shown that the expectation is controlled by 1/ √ n.Apropos of the constant L (n) 0 (δ n ), we can easily show that it is bounded, at least whenever θ 0 is fixed in the interior of ∆ N −1 .In fact, δ n can be chosen equal to any positive constant δ less than the distance between θ 0 and the boundary of ∆ N −1 .In particular, by exploiting the convexity of the mapping θ → K(θ | p), we can resort to Proposition 3.10, upon assuming more regularity on the prior distribution π, in order to obtain [C 2 (π * n (• | p))] 2 ≤ C(δ)/n, with a positive constant C(δ) which is independent of p.Then, by means of a direct computation, under the above conditions on θ 0 and δ we can show that the integral can be bounded uniformly in n.To conclude the analysis of the terms on the right-hand side of ( 62), we only need to exploit the boundedness of |θ|, as θ varies in ∆ N −1 , to show that the third and the fourth terms are both bounded by a multiple of Thus, if θ 0 is in the interior of ∆ N −1 and δ n = δ, for the same δ as above, it is well-known from the theory of large deviations that this probability goes to zero exponentially fast.See Dembo and Zeitouni [34, Chapter 2] for a detailed account.To conclude, we state a proposition that summarizes the above considerations.
Proposition 4.3.Let N ≥ 2 be an integer.Let π be a prior on ∆ N −1 .If π has a density q (with respect to the Lebesgue measure) such that q ∈ C 1 (∆ N −1 ) and q(θ) = 0 for any θ ∈ ∂∆ N −1 , then as n → +∞ which is the optimal rate.

Finite-dimensional logistic-Gaussian model
Consider a class of dominated statistical models specified by density functions of the form where, for simplicity, we have fixed , that is the one-dimensional Lebesgue measure restricted to [0,1], and Γ N (x) := (sin πx, sin 2πx, . . ., sin N πx) .
Of course, the expression θ • Γ N (x) represents a Fourier polynomial and, for sufficiently large N , can approximate very well any smooth function, in various norm.This model has been studied in connection with the problem of density estimation (Crain [28,29], Lenk [62,63]), essentially as a toy model.In the following section, we will analyze its infinite dimensional generalization, which is a more flexible statistical model, even if more complex from a mathematical point of view.
To apply Theorem 3.2, we start by fixing θ 0 ∈ Θ, so that µ 0 (dx) = f (x|θ 0 )dx, where x → f (x|θ 0 ) is a continuous and bounded density function on [0, 1].Then, we let {ξ i } i≥1 be a sequence of independent random variables identically distributed as the probability law µ 0 .The model (63) satisfies Definition 3.1 with B = Θ, B * = Θ (by Riesz's representation theorem) and Γ = B * , with B * •, • B being identified with the standard (Euclidean) scalar product of R N .The function β coincides with Γ N (x), while g is the identity function.Finally, we have that ΓN (x) dx which proves to be a convex function, steep, of class C ∞ (Θ) and analytic.Therefore, we have a regular exponential family, in canonical form.As for the prior distribution, besides the multivariate (non-degenerate) Gaussian distribution N (m, Q), with m ∈ Θ and Q being a symmetric and positive-definite N × N matrix, any other distribution of log-concave form like π(dθ) ∝ exp{−U (θ)}dθ fits our assumptions, provided that U is of class C 2 (Θ) and strongly convex.
Coming back to the application of Theorem 3.2, we check the validity of the assumptions.First, |Γ N (x)| ≤ √ N for all x ∈ [0, 1], so that ( 21) is in force.Then, (24) and Θ θ ap π(dθ) < +∞ for some a > 1 hold, because of the assumptions on the prior distribution.Thus, the bound (25) provides the desired PCR, so that we proceed further by analyzing the various terms as in Section 3.1.Since is strictly positive definite, we can apply Proposition 3.3.We conclude that the first term on the right-hand side of (25) goes to zero as 1 √ n .Then, the boundedness condition |Γ N (x)| ≤ √ N for all x ∈ [0, 1] entails that the second and the third terms on the right-hand side of (25) go to zero exponentially fast by means of classical concentration inequalities, like Bernstein's inequality for instance (Boucheron et al. [20], Dembo and Zeitouni [34]).Finally, we consider the last term on the right-hand side of (25).In particular, by Jensen's inequality as n → +∞, where Ŝn = n −1 n i=1 Γ N (ξ i ) and Then, we apply Proposition 3.8.Since D θ [g] coincides with the identity operator, we conclude that We conclude our analysis by estimating the weighted Poincaré-Wirtinger constant by means of Proposition 3.10.Indeed, a common feature of these logistic models is that the behavior of the Kullback-Leibler K(θ|θ 0 ) is twofold: it is quadratic as θ varies around θ 0 , while it is linear as |θ| → +∞.Thus, the strong Bakry-Emery condition does not apply here, and we resort to the boundedness condition for all |θ| ≥ R. To check the validity of this lower bound, we fix for simplicity θ 0 = 0, to get . By fixing a unitary vector σ ∈ S N −1 and considering θ = tσ, the Laplace approximation yields that the above right-hand side is asymptotic to as t → +∞.At this stage, the function proves to be continuous and non-negative on S N −1 .The minimum of such a function must be positive, otherwise there would exist σ ∈ S N −1 for which the map x → σ • Γ N (x) turns out to be constant.But this contradicts the linear independence of the Fourier basis Γ N , yielding that the function in ( 65) must be strictly positive.This fact validates (64).Thus, by point (1) of Proposition 3.10, and the square of the weighted Poincaré-Wirtinger constant is asymptotic to 1/n.All the above considerations can be summarized in the following proposition.

Infinite-dimensional logistic-Gaussian model
Consider a class of dominated statistical models specified by density functions of the form 1 0 e θ(y) λ(dy) x ∈ X, θ ∈ Θ (66) where we have fixed , that is the one-dimensional Lebesgue measure restricted to [0,1].As for the parameter space Θ, we set thought of as an infinite-dimensional Hilbert space endowed with scalar product and norm Here, the well-known Sobolev embedding theorem (Maz'ya [67]) states that H 1 * (0, 1) is continuously embedded in C 0 [0, 1], and therefore the above notations θ(x) and φ(0) are referred to the continuous representatives of θ and φ, respectively.
The infinite-dimensional logistic-Gaussian model is typically considered in connection with the fundamental problem of density estimation (Crain [28,29], Lenk [62,63]).Under the assumption that the prior is a Gaussian measure, Bayesian consistency is investigated in Tokdar and Ghosh [82], whereas PCRs are provided in Giné and Nickl [56], Rivoirard and Rousseau [72], Scricciolo [74] and van der Vaart and van Zanten [84].These results consider the set Θ to be the space of all density functions on [0, 1], typically endowed with the total variation distance, the Hellinger distance, some L p norm or the Kullback-Leibler divergence.Our approach to PCRs relies on the choice (67), so that our PCRs refers to Definition 2.2 with d Θ equal to the H 1 * (0, 1)-norm.This metric is generally stronger, since a Sobolev norm is, for suitable exponents, grater that the L r norm considered in Giné and Nickl [56] and, in turn, greater than the (squared) Hellinger distance, as proved in Scricciolo [74,Lemma A.1].In connection with the statistical model ( 66), the work of Fukumizu [48] provides an implicit Riemannian structure on the space of densities which is modeled on the metric of the underlying space Θ, that is the Riemannian distance between two densities f (•|θ 1 ) and f (•|θ 2 ) turns out to be locally equivalent to θ 1 − θ 2 H 1 * (0,1) .Another (geometrical) view of the set {f (• | θ)} θ∈Θ , which is simply thought as a differential manifold, is provided in Pistone and Rogantin [71].
We provide PCRs for the model ( 66) on the basis of Theorem 3.2.We start by fixing θ 0 ∈ Θ, with Θ being the same as in (67).Whence, µ 0 (dx) = f (x|θ 0 )dx, where x → f (x|θ 0 ) is a continuous and bounded density function on [0, 1].Then, we let {ξ i } i≥1 be a sequence of independent random variables identically distributed with probability law µ 0 .At this stage, we notice that the model (66) satisfies Definition 3.1 with B = Θ, B * = Θ (by Riesz's representation theorem) and Γ = B * .For completeness, we specify that also the pairing B * •, • B is identified with the scalar product •, • as in (68), again by Riesz's representation theorem.In this setting, we deduce that the function β in Definition 3.1 coincides with the Riesz representative of the δ x functional, for any x ∈ [0, 1], that is β x (z) := z1 [0,x] (z) + x1 (x,1] (z) for z ∈ [0, 1], since, for any θ ∈ Θ, θ(x) = θ, β x .Lastly, we fix g as the identity map on Θ, so that ( 19) is satisfied and As for the prior π, we assume that it is a Gaussian measure on Θ, with mean m ∈ Θ and covariance operator Q : Θ → Θ.We recall that Q is a trace operator with eigenvalues {λ k } k≥0 that satisfy ∞ k=0 λ k < +∞.See Da Prato [30,31], and references therein, for a review on Gaussian measures on Hilbert spaces.Now, we check the validity of the assumptions of Theorem 3.2.First, we have that yielding that ( 21) is trivially satisfied.In particular, the element S 0 is given by and Ŝn = n −1 n i=1 β ξi .We recall that Ŝn → S 0 as n → +∞, in both P-a.s. and L 2 sense, by the Laws of Large Numbers in Hilbert spaces (Ledoux and Talagrand [64,Corollary 7.10]).Now, we observe that (24) boils down to write that [31,Proposition 1.15].Then, condition iii) of Theorem 3.2 is trivially satisfied.Finally, with regards to iv), we mention that any sequence δ n ∼ n −q with q ∈ [0, 1/2), as n → +∞, is valid as far as we verify the validity of ( 9), as we will do just below.After these preliminaries, we start analyzing the four terms on the right-hand side of (25).
As for the first term on the right-hand side of ( 25), we study (28).We observe that where D θ0 M represents the (Riesz representative of) the Fréchet differential of M : Θ → R at θ 0 , while Hess θ0 [M ] = I(θ 0 ) stands for the Hessian operator of M at θ 0 , which coincides with the Fisher information operator I(θ 0 ).In particular, in the last identity we have used the Taylor expansion of M around θ 0 .In view of a more concrete characterization of D θ0 M and I(θ 0 ), we write that where R(h; µ 0 ) ≤ C(µ 0 ) h 3 Θ for h Θ ≤ 1, with some suitable constant C(µ 0 ) depending solely on µ 0 .In particular, a straightforward integration by parts shows that , by means of Riesz's representation.Moreover, with the same technique, we obtain for any y ∈ [0, 1] and h ∈ Θ. Tthe above left-hand side should be read as follows: first, the operator Hess θ0 [M ], applied to h ∈ Θ, gives a new element of Θ, called Hess θ0 [M ][h]; second, this new object, as a continuous function evaluated at y, coincides with the right-hand side.Finally, integration by parts entails that for any h ∈ Θ.The way is now paved for the application of Proposition 3.4 and Remark 3.7.As first step, we check that the operator in (71), from Θ to itself, is compact.As for the term h, Φ 0 Φ 0 , it defines a finite-rank operator, which is of course compact.As for the term 2 ]dz, it is enough to pick a bounded sequence, say {h n } n≥1 , in Θ, and study the sequence {Ψ n } n≥1 given by Ψ n (y) := 2 y 0 h n (z)[1 − F 0 (z)]dz.Now, from the well-known properties of weak topologies of separable Hilbert spaces, we can extract a subsequence {h nj } j≥1 , which converges weakly to some h * ∈ Θ. Whence, h nj converges uniformly (i.e. in the strong topology of C 0 [0, 1]) to h * , by the Rellich-Kondrachov embedding theorem.Consequently, it is trivial to get that the sequence {Ψ nj } j≥1 converges strongly in Θ to Ψ * (y) := 2 as j → +∞.This proves that the operator in (71), from Θ to itself, is a compact operator, even if it is not self-adjoint.Then, we resort to Remark 3.7, noticing that where Hess 0 [M ][h] is defined by (71) with θ 0 ≡ 0, to re-write the above relations as or simply as Hess θ0 [M ] ≥ I † .By means of the above argument, I † , as a linear operator from Θ to itself, is again compact, but not self-adjoint.By a straightforward integration by part, we find that a self-adjointized version of I † is given by with x ∈ [0, 1] and k ∈ N.After having fixed the Fourier basis {e k } k≥1 , we can further specify the prior distributions in terms of the probability laws of the random elements Ξ, with values in Θ, of the form (Karhunen-Loève representation) Here, {Z k } k≥1 is a sequence of independent real-valued random variables with Z k ∼ N (m k , λ k ), for suitable sequences m := {m k } k≥1 ⊂ R and {λ k } k≥1 ⊂ (0, +∞) with {m k } k≥1 ∈ ℓ 2 and {λ k } k≥1 ∈ ℓ 1 .Thus, if π(B) := P [ Ξ ∈ B] for any B ∈ B(Θ), it is straightforward to check that π is a Gaussian measure on (Θ, B(Θ)) with mean m and covariance operator Q satisfying Q[e k ] = λ k e k .Whence, (37) is verified.To justify the validity of (38), we check the remaining assumptions of Proposition 3.4.First, it is trivial to check that θ → K(θ|θ 0 ) belongs to C ∞ (Θ), so that we can put q = 1.Then, we consider points i)-iv).For simplicity, we again fix θ 0 ≡ 0, with no real loss of generality.We start with the definition of the space K, expressed as the closure of Θ with respect to the norm which represents, plainly speaking, a dual Sobolev norm of the function x → θ(x)− 1 0 θ(y)dy.The embedding Θ ⊂ K, with dense and continuous inclusion, follows from the Poincaré-Wirtinger inequality.Then, we notice that the function has two different behaviors according on whether the norm of θ is small or large.To be more precise, we fix σ ∈ Θ with σ Θ = 1 and then we set θ = tσ for any t ∈ (0, +∞).In particular, as t → 0, a straightforward argument based on Taylor expansions of the exponential and the logarithmic functions shows that On the other hand, as t → +∞, by means of a direct application of the Laplace method of approximation (Wong [89, Theorem 1.II]), we obtain the following expansion + with (a) + := max{a, 0}.Upon denoting by H −1 * (0, 1) the dual space of Θ, we can exploit that L 1 (0, 1) ⊂ H −1 * (0, 1), with continuous dense embedding, to obtain that max Therefore, ( 33)-( 34) are fulfilled with the above choice of the space K, and some φ : [0, +∞) → [0, +∞) which behaves quadratically for small arguments and linearly for large arguments, like Then, the choice of q = 1 entails that r ∈ (1, 3  2 ).Further insights on inequalities like (33) can be found in Bal et al. [8], while properties of homogeneous spaces like K have been recently investigated in Brasco et al. [21].As for the validity of the interpolation inequality (32), we can fix, for example, r = 4/3, s = 4 and start from the following specific version of the Gagliardo-Nirenberg interpolation inequality where H m (0, 1) denotes the standard (Hilbertian) Sobolev space of order m [23, Corollary 5.1].Applying this inequality to f (x) = θ(x) − 1 0 θ(y)dy, we get for all θ ∈ Θ such that d 8 dx 8 θ(x) ∈ L 2 (0, 1).Now, we define the Hilbert space V as the subspace of Θ formed by those θ ∈ Θ such that d 8 dx 8 θ(x) ∈ L 2 (0, 1), with the norm .
The inclusion V ⊂ Θ with continuous and dense embedding follows by means of the usual Sobolev embedding theorem [67].At this stage, we make use of the other specific version of the Gagliardo-Nirenberg interpolation inequality given by f H 2 (0,1) to deduce that u L 2 (0,1) u holds for any u ∈ C ∞ c (0, 1) with 1 0 u(x)dx = 0.By combining this inequality with (74), we finally deduce (32) with r = 4/3 and s = 4.To guarantee that π(V) = 1, we can resort to the standard Kolmogorov three-series criterion to obtain that P < +∞ holds for any t > 0. By proceeding with the analysis of the other terms on the right-hand side of (25), we observe that the boundedness condition (70) entails a direct application of results in Pinelis and Sakhanenko [70] and Yurinskii [91], yielding that and in addition that, for any sequence {δ n } n≥1 such that δ n ∼ n −q with q ∈ [0, 1/2), for a positive constant C that depends only on µ 0 .It remains to deal with the asymptotic behavior of L (n) 0 by combining Propositions 3.8 and Proposition 3.11.Here, we exploit once again the fact that g coincides with identity function, so that for all S ∈ Θ. Whence, where we have indicated our preference for a weighted Poincaré-Wirtinger constant, with respect to the Malliavin derivative.Indeed, we can argue as in the finite-dimensional setting, exploiting the key observation that the Kullback-Leibler divergence K(θ|θ 0 ) behaves quadratically if θ varies around θ 0 , while it is linear as θ → +∞.To be more precise, we can use the same arguments developed above to show that the choice G 0 = I * fits the requirements of Proposition 3.11.Thus, the eigenfunctions {e k } k≥1 are the same as above, and η k = γ * k .In order to exploit point (1) of Proposition 3.11, we can mimic the same arguments already used in the previous section to prove (64).Actually, it works in the same way, with the sole difference that the function in ( 65) is now replaced by with σ Θ = 1, because of the fact that the gradient is replaced by the Malliavin derivative.See Da Prato [30, Section 2.3].Since Q 1/2 is a compact operator, the image of the bounded set {σ ∈ Θ | σ Θ = 1} through Q 1/2 is sequentially compact.Thus, the infimum of the function in (76) cannot be equal to zero.Finally, with the application of (49), which provides the rate of the weighted Poincaré-Wirtinger constant, the discussion is completed.To conclude our analysis, we state a proposition that summarizes all the above considerations.
Proposition 4.5.In connection with the model (66), let X = [0, 1] and Θ = H 1 * (0, 1).Let θ 0 ∈ Θ be fixed.Assume that π = N (m, Q) with m ∈ Θ and Q a non-degenerate trace-class operator satisfying (31).Fix the eigenfunctions {e k } k≥1 and the spaces K and V as in (72), ( 73) and (75), respectively.Finally, set γ * k as in (72), η k = γ * k and ω k according to the Fourier representation for some δ > 0, then points i)-iv) of Proposition 3.4 are valid, along with the assumptions of point (1) of Proposition 3.11.In conclusion, it holds To provide some hints on the optimality of our PCRs, it is useful to recall the discussion at the end of Section 3.1.At least in the simpler case when m = θ 0 , the above rate has the form O n − a−1 for all k for simplicity, a precise statement is as follows: if a > 1 and ε ∈ 0, a−1 2 , then the trajectories of the random process Ξ belong to H 1+ε (0, 1) almost surely.We notice that our rate is just slightly slower than the standard rate n − α 2α+1 which is proved in Giné and Nickl [56], Rivoirard and Rousseau [72], Scricciolo [74], where α is characterized by the fact that the random process Ξ belongs to H α (0, 1) almost surely.This slight discrepancy makes sense since our reference norm (i.e., the Sobolev norm of H 1 * ) is larger than any L p norm, for any p ∈ [1, +∞].To the best of our knowledge, our rate does not admit a fair comparison with any other known rate of consistency, neither Bayesian nor classical, because of the different choice of the loss function.The only fair comparison could be made with the rates obtained in Sriperumbudur et al. [78], which are nonetheless relative to distinguished classical estimators (see, in particular, Theorem 7, point ii) therein).Since these classical rates are slower than n −1/3 , we notice, in support of the optimality of our approach, that our rate is: i) arbitrarily close to the optimal (parametric) rate n −1/2 if a → +∞; ii) faster than n −1/3 as soon as a > 9, a condition which is surely met in the framework presented in Proposition 4.5, where a = 15 + δ.Hence, a Bayesian estimator, that shares our PCR as rate of consistency, performs better that the minimum-distance estimator proposed in Sriperumbudur et al. [78].

Infinite-dimensional linear regression
Consider a statistical model that arises from the popular linear regression.The observed data are the collection of pairs (u 1 , v 1 ), . . ., (u n , v n ), such that: i) the u i vary in an interval [a, b] ⊂ R, and are modeled as i.i.d.random variables, say U 1 , . . ., U n , with a known distribution, say ̟(du) = h(u)du, on ([a, b], B([a, b])); the v i 's vary in R, and are modeled as i.i.d.random variables V 1 , . . ., V n .The V i 's are stochastically dependent of the U i 's according to the relation where E 1 , . . ., E n are i.i.d.random variables with Normal N (0, σ 2 ) distribution, while θ : [a, b] → R is an unknown continuous function.Assuming for simplicity that σ 2 > 0 is known, the statistical model is characterized by probability densities f (•|θ) on [a, b] × R, respect to the Lebesgue measure, given by The space Θ is chosen, as in the previous section, as a Sobolev space H s (a, b) with s > 1/2, which is continuously embedded in C 0 [a, b].Whence, upon fixing θ 0 ∈ Θ, On the other hand, from the Bayesian point of view, upon fixing a prior distribution π on (Θ, T ) and resorting to the Bayes formula, the posterior takes on the form .
Whence, for any probability measure γ ∈ P 2 ([a, b] × R), we can write the following .
Lastly, as for the Kullback-Leibler divergence, a straightforward computation yields This statistical model is particularly versatile with respect to our theory, because it can be studied as either an infinite-dimensional exponential family or by means of Theorem 3.13 and Corollary 3.14.For example, to see that we can use the theory of infinite-dimensional exponential families, it suffices to consider the identities where G(x, y; u, v) stands for the Green function of the set [a, b] × R. If θ varies in a sufficiently regular space, that is if s is sufficiently large, then [∆((θ(x) − y) 2 )] is still a function, which can be set equal to g(θ).On the other hand, G(x, y; u, v) represents the function β in the theory of exponential families.
As for the assumptions of Theorem 3.13, we can prove their validity if, for instance, h belongs to C 0 [a, b] ∩ C 2 (a, b) and it is bounded away from zero.The assumption X |x| q µ 0 (dx) < +∞ is valid for any q > 0 and ( 58) holds if we assume, for instance, a Gaussian prior π.As for Corollary 3.14, we can check the validity of (57) as a consequence of the Gagliardo-Nirenberg interpolation inequality (Maz'ya [67,Section 12.3]).Being K(θ|θ 0 ) equivalent to the squared L 2 -norm, for any s ′ > s, where α := s/s ′ .Therefore, choosing a prior distribution that is supported on H s ′ (a, b), such as for instance a Gaussian type prior, and recalling that H s ′ (a, b) is dense in H s (a, b), it is enough to consider the neighborhood θ − θ 0 H s ′ (a,b) ≤ 1 of θ 0 and check that the interpolation inequality immediately yields (57).Whence, β = 2/(1 − α).These considerations show that Proposition 3.4 is applicable, provided that the prior is Gaussian with a covariance matrix that satisfies (31).In any case, both the methods end up by highlighting the main terms that figures on the right-hand sides of ( 25) and (56).Now, for the sake of brevity, we confine ourselves on the application of Theorem 3.2.Apropos of the first term on the right-hand side of (25), we notice that I(θ 0 ) is independent of θ 0 , and is equivalent to the identity operator.In view of a straightforward coercivity, we can apply the results in Section 3.3 of Albeverio and Steblovskaya [2] to obtain that the first term on the right-hand side of ( 25) is asymptotic to 1 √ n , as n → +∞.Then, the second and the third terms are exponentially small, and hence asymptotically negligible.To complete the treatment, we are left to discuss the asymptotic behavior of the constant L This is the sum of the terms σ −2 (vθ ′ (u), θ(u) − v) and 1 σ 2 (−θ ′ (u)θ(u), 0), where the former vector is a linear functional of θ.Thus, the Fréchet derivative of the first term with respect to θ is given by the vector σ −2 (vS u , T u ), where T u (S u , respectively) stands for the Riesz representative of the functional δ u (−δ ′ u , respectively).It is useful to observe that such a derivative, being independent of θ, does not contribute asymptotically in the expression of the double integral, as we have already discussed in the previous section.Finally, the Fréchet derivative of the second term is −σ −2 (T u θ ′ (u) + S u θ(u), 0).At this stage, we can see that the study of the double integral can be reduced, through the use of Sobolev inequalities, to the study of the corresponding posterior moments.To conclude, we state a proposition that summarizes the above considerations.as n → +∞, which represents the optimal rate.

Discussion
We conclude our work by discussing some directions for future research.The flexibility of the Wasserstein distance is promising when considering non-regular Bayesian statistical models, even in a finite-dimensional setting.One may consider the problem of dealing with dominated statistical models that have moving supports, i.e. supports that depend on θ.The prototypical example is the family of Pareto distributions, which is characterized by a density function where θ = (α, x 0 ) ∈ (0, +∞) 2 .Under this model, by rewriting the posterior distribution to obtain the representation (7), we observe that the empirical distribution can be replaced by the minimum of the observations, which is the maximum likelihood estimator.In doing this, we expect to parallel the proof of Theorem 3.13, with the minimum playing the role of the sufficient statistic, instead of the empirical measure.In particular, we expect that the term ε n,p (X, µ 0 ) should be replaced by other rates typically involved in limit theorems of order statistics.The theoretical framework for such an extension of our results is developed in the work of Dolera and Mainini [39], where it is shown how the continuity equation yields a specific boundary-value problem of Neumann type.
As for the infinite-dimensional setting covered by Theorem 3.2 and Theorem 3.13, an interesting development of our approach to PCRs is represented by the possibility of finding, for general statistical models, explicit sufficient statistics belonging to Banach spaces of functions.To be more precise, we hint at a constructive version of the well-known Fisher-Neyman factorization lemma.This result would pave the way for a suitable rewriting of the statistical model, that allows for the use of our approach.By way of example, one may consider the identity log f (x|θ) = X log f (y|θ)δ x (dy), and exploit an integration-by-part formula to obtain an identity like (19), with respect to a suitable measure λ on (X, X ).Such a procedure is at the basis for the development of our approach to PCRs in the context of popular nonparametric models, not considered in this paper, such as the Dirichlet process mixture model (Ghosal and  Another promising line of research consists in extending Theorem 3.13 to metric measure spaces.The theoretical ground for this development may be found in the seminal works of Gigli [53], Gigli and Ohta [54], Ambrosio et al. [6], Otto and Villani [69] and von Renesse and Sturm [85].In such a context, it is of interest the treatment of the relative entropy-functional in the Wasserstein space.It is well-known that the Hessian of the relative entropy-functional, i.e. the Kullback-Leibler divergence, generalizes by using techniques from infinite-dimensional Riemannian geometry (Otto and Villani [69]).From the statistical side, the possibility of choosing a parameter space that coincides with a space of measures allows to re-consider, from a different point of view, popular Bayesian statistical models such as Dirichlet process mixture models, which are defined as where τ is a kernel parameterized by y, and p is a random probability measure with a Dirichlet process prior (Ferguson [42]).The goal should be that of considering PCRs relative to Wasserstein neighborhoods of a given true distribution, say p 0 .This approach is again different from the nonparametric framework considered in Berthet and Niels-Weed [16], and seems still unexplored.

A Proofs
A.1 Proof of Lemma 2.3 By a standard measure theoretic argument, any two solutions ) as elements of P(Θ), for all x (n) ∈ X n \ N n , where N n is a α n -null set.The assumption µ ⊗n 0 ≪ α n entails that ξ (n) := (ξ 1 , . . ., ξ n ) takes values in N n with P-probability zero, yielding the desired welldefiniteness.Then, if π ∈ P p (Θ), any solution π n (•|•) of (1) satisfies π n (P p (Θ)|x (n) ) = 1 for α n -almost every x (n) ∈ X n .Since µ ⊗n 0 ≪ α n , it follows that π n (P p (Θ)|ξ (n) ) = 1 with P-probability 1. Whence, 1/p is a random variable, which proves to be finite P-a.s.. Combining Markov's and Lyapunov's inequalities, it follows that M n ǫ n holds P-a.s.. Now, taking expectation of both sides and taking account of ( 6) yields Thus, the convergence indicated in (4) holds in L 1 (Ω, F , P) and, hence, in P-probability.The proof is complete.
A.2 Proof of identity (26) In view of (23), it is enough to prove that holds for all θ ∈ Θ and b in the range of S • g, with some suitable function Combining the above identity with (18) and observing that g(θ b ) = S −1 (b), it follows that is valid for all θ ∈ Θ and b in the range of S • g.Then, the validity of ( 79

A.3 Proof of Theorem 3.2
Under the assumptions of the Theorem, Lemma 2.3 is valid, and a PCR at θ 0 is given by (6).Moreover, ( 7) is valid with S n (ξ 1 , . . ., ξ n ) = Ŝn , where Ŝn is given by ( 20), S = B endowed with the distance ensuing from the norm • B , and the kernel π * n (•|•) is.given by (23).The triangle inequality for W (P(Θ)) p gives with the same S 0 as in (22).See [5,Chapter 7] for information about the aforesaid triangle inequality.Then, take the expectation of both sides above to obtain At this stage, the first summand on the last member of ( 80) is exactly equal to the first summand on the right-hand side of ( 25), thanks to identity (11).For the second summand on the last member of (80), invoke (9) to conclude that such term is majorized by the last summand on the right-hand side of (25).It remains to handle the third summand on the last member of (80).Exploit the fact that, for any two elements µ, ν ∈ P p (Θ) there holds Now, the first summand on the right-hand side of ( 81) can be bounded by means of a combination of Hölder's and Lyapunov's inequalities, yielding For the second summand on the right-hand side of (81) just exploit the triangular inequality to obtain .
Re-organizing the terms just obtained yields the right-hand side of (25), concluding the proof.

A.4 Proof of Proposition 3.4
Start by fixing ε in the interval 2(r − 1), q , which is possible since 0 < 2(r − 1) < q.Then, let {η n } n≥1 be a sequence of positive numbers such that η n = O(n −1/(2+ε) ) as n → +∞.Let B ηn (θ 0 ) denote the open ball in Θ with radius η n , centered at θ 0 .Without loss of generality, assume that θ 0 ∈ V. Otherwise, by density of V, pick a sequence {θ 0,n } n≥1 ⊂ V such that θ 0,n − θ 0 Θ → 0 sufficiently fast, and replace θ 0 by θ 0,n .The proof is divided into four steps, according to typical operations in the theory of Laplace approximation.First, let us prove that as n → +∞ where, for any pair of sequences {a n } n≥1 and {b n } n≥1 of positive numbers, the notation a n ∼ b n means that lim n→+∞ a n /b n = 1.To this aim, it is enough to show that the integrals on the exterior of B ηn (θ 0 ) are exponentially small, and hence irrelevant in the global asymptotic expansion.Exploiting ( 32)-( 33), one gets Bη n (θ0) c exp{−nK(θ|θ 0 )}π(dθ) ρn (θ 0 ) denote the open ball in V, with radius ρ n and centered at θ 0 .Thus, the last integral can be bounded from above by which can be made an exponentially small quantity after choosing properly the sequence {ρ n } n≥1 .Actually, it is enough to fix that ρ n = O(n h ) as n → +∞, for some h satisfying Of course, this is possible in view of the bound 2 + ε > 2r.Now, h > 0 entails that η n ρ Lastly, combination of the identity s = r r−1 with the inequality (84) entails that This argument shows that the second summand in (83) goes to zero like e −n c , making it a negligible quantity.Finally, the first summand in ( 83) is also bounded by a term that goes to zero like e −n h , thanks to a straightforward combination of Markov's inequality with the assumption that V e t θ V π(dθ) < +∞ for some t > 0.
As for the term the argument to prove that it is also exponentially small is similar.Indeed, it is enough to get rid of the term θ − θ 0 2 Θ by a straightforward application of Hölder inequality.This proves (82).After reducing both the integrals on B ηn (θ 0 ), exploit the regularity of the map θ → K(θ|θ 0 ) by showing that it can be replaced by its second order Taylor polynomial, which reads because K(θ 0 |θ 0 ) = 0 and D θ K(θ|θ 0 ) |θ=θ0 = 0.By the assumptions of the proposition, go to zero faster than their respective counterparts exp{−nK(θ|θ 0 )}π(dθ) . Whence, This concludes the fourth step and the proof.
A.5 Proof of Proposition 3.8 The main issue is to prove the validity of (9).Thus, fix S 0 ∈ B and S ′ ∈ U δn (S 0 ).For t varying in [0, 1], let S t = S 0 + t(S ′ − S 0 ) denote the line-segment joining S 0 with S ′ .Use the kernel π * n (•|•) defined in (23) to lift the line-segment [S t ] t∈[0,1] to P 2 (Θ), by means of the new curve Here, we apply the Benamou-Brenier representation introduced in Section 2 with M = Θ, to get Then, rewrite (13) as By Riesz representation, we get Now, take the derivative inside the integral in the expression of T t , consider the expression of µ * t and apply the Leibnitz rule, as follows.
where in the inequality with the super-script "duality" we have used the fact that, for any b ∈ B, it holds This proves inequality (41).Finally, (43) follows trivially from (41), in view of the boundedness condition (42).
A.6 Proof of Proposition 3.11 The first step of the proof is to provide a result analogous to Bakry et al. [13,Theorem 1.4].To this aim, we need the concept of Lyapunov function, as done in that paper.Therefore, let V : Θ → R a C 2 function bounded from below.Define the probability measure µ V,π in Gibbsean form as µ V,π (dθ) = e −V (θ) π(dθ) Θ e −V (τ ) π(dτ ) .
Then, define the differential operator L V,π := L π − D θ [V ] • D θ , where D and L π denote the Malliavin derivative and the Malliavin-Laplace operator associated to π, respectively.See Da Prato [30,Chapter 2] for definition and properties of these differential operators.In particular, here it is enough to recall the following integration-by-parts formula that links these operator together: At this stage, we can follow the same exact steps in Bakry et al. [13] to conclude that where the constants a, b are the same as in (89), while κ R denotes the weighted Poincaré-Wirtinger constant (relative to the Malliavin derivative) of the measure µ V,π restricted on the ball θ < R.After these preliminaries, let us consider point (1).We put V := nG in (88).Let W be a C 2 (Θ) function such that W ≥ 1 on Θ and such that W  A.7 Proof of Theorem 3.13 To establish (54), we start from the Bayes formula Then, the bound (53) entails that the integral X log f (y|θ)γ(dy) is well-defined and finite for any γ ∈ P 2 (X) and θ ∈ Θ.Now, recalling that µ 0 ∈ P 2 (X), let V (n) 0 be the W (P(Θ)) 2 -neighborhood of µ 0 for which (55) is in force.Let ζ be a fixed element of such a neighborhood and let {ζ t } t∈[0,1] be a W 2 -constant speed geodesic connecting µ 0 with ζ.In particular, [0, 1] ∋ t → ζ t is an absolutely continuous curve in P 2 (X).The map π * n allows the construction of a lifting of this path, in the sense that {π The right space for the solution of this problem is, for fixed t ∈ (0, 1), the weighted Sobolev space H  .
Then, the second summand on the right-hand side of ( 56) is already provided by the second summand on the the right-hand side of the last inequality in (10).Finally, the last two terms on the the right-hand side of (56) comes from the last summand on the the right-hand side of (10), after noticing that we have Indeed, the last term on the above right-hand side yields immediately the last term on the right-hand side of (56).Lastly, we just observe that .
The conclusion of the proof then follows by using (58), again by a direct combination with respect to the bound borrowed from Bolley et al. [19,Theorem 2.7].

Assumptions 3 . 12 .
The set X, the parameter space Θ and the statistical model µ(• | •) are such that i) X coincides with an open, connected subset of with Lipschitz boundary, and X = B(X).With minor changes of notation, X could also coincide with a smooth Riemannian manifold without boundary of dimension m ∈ N. ii) Θ coincides with an open, connected subset of a separable Hilbert space of dimension d ∈ N ∪ {+∞}.

π
(dθ) and we observe that the regularity of the mapping x → f (x|θ) allows us to write n i=1 f (x i |θ) as exp n i=1 log f (x i |θ) = exp n