1 Introduction

Bayesian consistency guarantees that the posterior distribution concentrates on arbitrarily small neighborhoods of the true model, in a suitable way, as the sample size goes to infinity [15, 35, 40, 44, 45, 49, 71, 84]. See Ghosal and van der Vaart [50, Chapter 6 and Chapter 7] for a general overview on Bayesian consistency. Posterior contractions rates (PCRs) strengthen the notion of Bayesian consistency, as they quantify the speed at which such small neighborhoods of the true model may decrease to zero meanwhile still capturing most of the posterior mass. The problem of establishing optimal PCRs in finite-dimensional (parametric) Bayesian models have been first considered in Ibragimov and Has’minskiǐ [57] and LeCam [59]. However, it is in the works of Ghosal et al. [49] and Shen and Wasserman [73] that the problem of establishing PCRs have been investigated in a systematic way, setting forth a general approach to provide PCRs in both finite-dimensional and infinite-dimensional (nonparametric) Bayesian models. Since then, several methods have been proposed to obtain more explicit and also sharper PCRs. Among them, we recall the metric entropy approach, in combination with the definition of specific tests [49, 71], the methods based on bracketing numbers and entropy integrals [73], the martingale approach [84, 85], the Hausdorff entropy approach Xing [88], and some approaches based on the Wasserstein distance  [25, 27]. See Ghosal and van der Vaart [50, Chapter 8 and Chapter 9], and references therein, for a comprehensive and up-to-date account on PCRs.

1.1 Our contributions

In this paper, we develop a new approach to PCRs, in the spirit of the seminal work of Ghosal et al. [49]. We consider a dominated statistical model as a family \({\mathscr {M}} = \{f_{\theta }\}_{\theta \in \varTheta }\) of densities, with the parameter space \(\varTheta \) being a (possibly infinite-dimensional) separable Hilbert space. We focus on posterior Hilbert neighborhoods of a given true parameter, say \(\theta _0\), measuring PCRs in terms of strong norm distances on parameter spaces of functions, such as Sobolev-like norms. This assumption on \(\varTheta \) yields a stronger metric structure on \({\mathscr {M}}\), as a subset of the space of densities, usually not equivalent to those considered so far by the literature on nonparametric density estimation (see, e.g., [48, 54, 72, 73, 82, 84, 85]), based on the choice of (pseudo-)distances such as the \(\text {L}^p\)-norm, the Hellinger, the Kullback–Leibler, and the chi-square. To the best of our knowledge, we are not aware of works in the Bayesian literature that deal with strong PCRs for density estimation by using constructive tests, as prescribed by the standard theory, even if this line of research could be pursued as well. As far as we know, the standard nonparametric approach covers the case of (semi-)metrics which are dominated by the Hellinger distance (see, e.g., [50, Proposition D.8]).

We present a theorem on PCRs for the regular infinite-dimensional exponential family of statistical model, and a theorem on PCRs for a general dominated statistical models. The former may be viewed as a special case of the latter, allowing to exploit sufficient statistics arising from the infinite-dimensional exponential family. Critical to our approach is an assumption of local Lipschitz-continuity for the posterior distribution, with respect to the observations or a sufficient statistics of them. Such a property is typically known as “Bayesian well-posedness” ([77, Section 4.2]), and it has been investigated in depth in Dolera and Mainini [37, 38]. By combining the local Lipschitz-continuity with the dynamic formulation of the Wasserstein distance [5, 12], referred to as Wasserstein dynamics, we set forth a connection between the problem of establishing PCRs and some classical problems arising in mathematical analysis, probability and statistics, e.g., Laplace methods for approximating integrals [22, 87], Sanov’s large deviation principle in Wasserstein distance [19, 58], rates of convergence of mean Glivenko–Cantelli theorems [1, 7, 17, 36, 39, 46, 58, 78, 79, 86], and estimates of weighted Poincaré–Wirtinger constants ([14, Chapter 4], [56, Chapter 15]). In particular, our study leads to introduce new results on Laplace methods for approximating integrals and the estimation of weighted Poincaré–Wirtinger constants in infinite dimension, which are of independent interest.

Some applications of our main theorems are presented for the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression. It turns out that our main results lead to optimal PCRs in finite dimension, whereas in infinite dimension it is shown explicitly how the prior distribution affects PCRs. Among the applications of our results, the infinite-dimensional logistic-Gaussian model is arguably the best setting to motivate the use of strong norm distances. In such a setting our approach is of interest when the ultimate goal of the inferential procedure is the estimation of some functional \(\varPhi (f_{\theta })\) of the density [75, Chapter 6] for which the mapping \(f \mapsto \varPhi (f)\) is not continuous with respect to the aforesaid metrics on densities, whereas \(\theta \mapsto \varPhi (f_{\theta })\) turns out to be even locally Lipschitz-continuous with respect to the Hilbertian metric on \(\varTheta \). Thus, strong norms allow to consider larger classes of functionals of density functions, and then possibly a broader range of analyses. Another motivation in the use of strong norms comes from the theory of density estimation under penalized loss functions, with penalizations depending on derivatives of the density, according to the original Good–Gaskins proposal [55, 74]. As these penalized loss functions are used to derive smoother estimators, it sounds interesting to derive relative PCRs under the same loss functions.

1.2 Related works

The most popular classical (frequentist) approaches to density estimation are developed within the following frameworks: (i) a parameter space that is the space of density functions, typically endowed with the \(\text {L}^p\) norm or the Hellinger distance [81], usually associated to the notion of “strong consistency”; (ii) a parameter space that is the space of density functions endowed with the Wasserstein distance, under which the parameter space is metrized according to a (concrete) metric structure on the space of the observations [16], usually associated to the notion of “weak consistency”. Both these frameworks are different from the one we consider in this paper, and therefore a comparison of our PCRs with optimal minimax rates from Tsybakov [81] and Berthet and Niels-Weed [16] it is not directly possible. Within the classical literature, Sriperumbudur et al. [76] considered our statistical framework and provided rates of consistency under the infinite-dimensional exponential family of statistical models, though without any formal statement on their minimax optimality. In principle, our approach to PCRs may be developed within the aforementioned popular statistical frameworks for density estimation. However, since our approach relies on properties of the Wasserstein distance that are well-known for parameter spaces with a linear structure, i.e. Wasserstein dynamics, the framework considered in this paper is the most natural and convenient to start with. As for the other statistical frameworks for density estimation, we conjecture that our approach to PCRs requires a suitable formulation of Wasserstein dynamics for parameter spaces with a nonlinear structure. While such a formulation is available from Gigli [51] and Gigli and Ohta [52], it is still not clear to us how to exploit it to deal with PCRs.

1.3 Organization of the paper

The paper is structured as follows. In Sect. 2 we recall the definition of PCR, presenting an equivalent definition in terms of the Wasserstein distance, and we outline the main steps of our approach to PCRs. Section 3 contains the main results of our work, that is a theorem on PCRs for the regular infinite-dimensional exponential family of statistical models, and a generalization of it for general dominated statistical models. In Sect. 4 we present some applications of our results for the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression. Section 5 contains a discussion of some directions for future work, especially with respect to the application of our approach to other nonparametric models, such as the popular class of hierarchical (mixture) models. Proofs of our results are deferred to appendices.

2 A new approach to PCRs

We consider \(n\ge 1\) observations to be modeled as part of a sequence \(X^{(\infty )}:= \{X_i\}_{i \ge 1}\) of exchangeable random variables, with the \(X_i\)’s taking values in a measurable space \((\mathbb {X}, \mathscr {X})\). Let \((\varTheta , \text {d}_{\varTheta })\) be metric space, referred to as the parameter space, endowed with its Borel \(\sigma \)-algebra \(\mathscr {T}\). Moreover, let \(\pi \) be a probability measure on \((\varTheta , \mathscr {T})\), referred to as the prior measure, and let \(\mu (\cdot \,|\, \cdot ): \mathscr {X}\times \varTheta \rightarrow [0,1]\) be a probability kernel, referred to as the statistical model. The Bayesian approach relies on modeling the parameter of interest as a \(\varTheta \)-valued random variable, say T, with probability distribution \(\pi \). At the core of Bayesian inferences lies the posterior distribution, that is the conditional distribution of T given a random sample \((X_1, \dots , X_n)\), whenever both T and the sequence \(X^{(\infty )}\) are supported on a common probability space \((\varOmega , \mathscr {F}, \mathbb {P})\). The minimal regularity conditions that are maintained, and possibly strengthened, throughout the paper are the following: the set \(\mathbb {X}\) is a separable topological space, with \(\mathscr {X}\) coinciding with the ensuing Borel \(\sigma \)-algebra, and \((\varTheta , \mathscr {T})\) is a standard Borel space. In this setting, the posterior distribution can be represented through a probability kernel \(\pi _n(\cdot \,|\, \cdot ): \mathscr {T}\times \mathbb {X}^n \rightarrow [0,1]\) that satisfies the disintegration

$$\begin{aligned} \mathbb {P}[X_1\in A_1, \dots , X_n\in A_n, T\in B] = \int _{A_1\times \dots \times A_n} \pi _n(B\,|\,x^{(n)}) \alpha _n(\text {d}x^{(n)}) \end{aligned}$$
(1)

for all sets \(A_1, \dots , A_n \in \mathscr {X}\) and \(B \in \mathscr {T}\) and \(n\ge 1\), where \(x^{(n)}:= (x_1, \dots , x_n)\) and

$$\begin{aligned} \alpha _n(A_1\times \dots \times A_n):= \int _{\varTheta }\left[ \prod _{i=1}^n \mu (A_i\,|\, \theta )\right] \pi (\text {d}\theta ), \end{aligned}$$
(2)

so that \(\mathbb {P}[T \in B\,|\,X_1, \dots , X_n] = \pi _n(B\,|\,X_1, \dots , X_n)\) is valid \(\mathbb {P}\)-a.s. for any \(B \in \mathscr {T}\).

Remark 1

When the statistical model \(\mu (\cdot \,|\, \cdot )\) is dominated by some \(\sigma \)-finite measure \(\lambda \) on \((\mathbb {X}, \mathscr {X})\), with a relative family of \(\lambda \)-densities \(\{f(\cdot \, |\, \theta )\}_{\theta \in \varTheta }\), then (a version of) the posterior distribution is given by the Bayes formula, that is we write

$$\begin{aligned} \pi _n(B\,|\,x^{(n)}) = \frac{\int _B [\prod _{i=1}^n f(x_i\, |\, \theta )] \pi (\text {d}\theta )}{\int _{\varTheta } [\prod _{i=1}^n f(x_i\, |\, \theta )] \pi (\text {d}\theta )} \end{aligned}$$

for any set \(B \in \mathscr {T}\) and \(\alpha _n\)-a.e. \(x^{(n)}\), while \(\alpha _n\) turns out to be absolutely continuous with respect to the product measure \(\lambda ^{\otimes _n}\) with density function of the form

$$\begin{aligned} \rho _n(x_1, \dots , x_n):= \int _{\varTheta }\left[ \prod _{i=1}^n f(x_i\,|\, \theta )\right] \pi (\text {d}\theta )\ . \end{aligned}$$
(3)

We say that the posterior distribution is (weakly) consistent at \(\theta _0\in \varTheta \) if, as \(n\rightarrow +\infty \), \(\pi _n(U_0^c\,|\, \xi _1, \dots , \xi _n) \rightarrow 0\) holds in probability for any neighborhood \(U_0\) of \(\theta _0\), where \(\xi ^{(\infty )}:= \{\xi _i\}_{i \ge 1}\) stands for a sequence of \(\mathbb {X}\)-valued independent random variables identically distributed as \(\mu _0(\cdot ):= \mu (\cdot | \theta _0)\) ([50, Definition 6.1]). The non uniqueness of the posterior distribution \(\pi _n\) requires additional regularity assumptions in order that \(\pi _n(\cdot \,|\, \xi _1, \dots , \xi _n)\) is well-defined. PCRs strengthen the notion of Bayesian consistency, in the sense that they quantify the speed at which such neighborhoods may decrease to zero meanwhile still capturing most of the posterior mass. In particular, the definition of PCR can be stated as follows ([50, Definition 8.1]).

Definition 1

A sequence \(\{\epsilon _n\}_{n \ge 1}\) of positive numbers is a PCR at \(\theta _0\) if, as \(n\rightarrow +\infty \),

$$\begin{aligned} \pi _n\left( \left\{ \theta \in \varTheta : \text {d}_{\varTheta }(\theta ,\theta _0) \ge M_n\epsilon _n\right\} \,|\, \xi _1, \dots , \xi _n \right) \rightarrow 0 \end{aligned}$$
(4)

holds in probability for every sequence \(\{M_n\}_{n \ge 1}\) of positive numbers such that \(M_n \rightarrow \infty \).

Now, we present our approach to PCRs based on the Wasserstein distance. This is a new approach, which relies on four main steps that are outlined hereafter. The first step of our approach originates from a reformulation of Definition 1 in terms of the so-called p-Wasserstein distance, for \(p \ge 1\). In particular, to recall this concept in full generality, we denote by \((\mathbb {M}, \text {d}_{\mathbb {M}})\) an abstract separable metric space, and we denote by \({\mathcal {P}}(\mathbb {M})\) the relative space of all probability measures on \((\mathbb {M}, \mathscr {B}(\mathbb {M}))\). Then, the p-Wasserstein distance is defined as

$$\begin{aligned} \mathcal {W}_p^{({\mathcal {P}}(\mathbb {M}))}(\gamma _1; \gamma _2):= \inf _{\eta \in \mathcal {F}(\gamma _1,\gamma _2)} \left( \int _{\mathbb {M}^2} [\text {d}_{\mathbb {M}}(x, y)]^p\ \eta (\text {d}x\text {d}y) \right) ^{1/p} \end{aligned}$$
(5)

for any \(\gamma _1, \gamma _2 \in {\mathcal {P}}_p(\mathbb {M})\), where

$$\begin{aligned} {\mathcal {P}}_p(\mathbb {M}):= \left\{ \gamma \in {\mathcal {P}}(\mathbb {M})\,\ \int _{\mathbb {M}} [\text {d}_{\mathbb {M}}(x, x_0)]^p \gamma (\text {d}x) < +\infty \, \ \text {for some } x_0 \in \mathbb {M}\right\} \end{aligned}$$

and \(\mathcal {F}(\gamma _1, \gamma _2)\) is the class of all probability measures on \((\mathbb {M}^2, \mathscr {B}(\mathbb {M}^2))\) with i-th marginal \(\gamma _i\), for \(i=1,2\). See Ambrosio et al. [5, Chapter 7] and Ambrosio et al. [5, Proposition 7.1.5]. If we let \((\mathbb {M}, \text {d}_{\mathbb {M}}) = (\varTheta , \text {d}_{\varTheta })\), then we can reformulate Definition 1 according to the next lemma; the proof is deferred to Appendix A.1

Lemma 1

Assume that \(\pi \in {\mathcal {P}}_p(\varTheta )\) and that \(\mu _0^{\otimes _n} \ll \alpha _n\) is valid for any \(n \in \mathbb {N}\). Then, \(\pi _n(\cdot \,|\, \xi _1, \dots , \xi _n)\) is a well-defined random probability measure belonging to \({\mathcal {P}}_p(\varTheta )\) with \(\mathbb {P}\)-probability one, and

$$\begin{aligned} \epsilon _n = \mathbb {E}\left[ \mathcal {W}_p^{({\mathcal {P}}(\varTheta ))}(\pi _n(\cdot \,|\, \xi _1, \dots , \xi _n); \delta _{\theta _0}) \right] \end{aligned}$$
(6)

gives a PCR at \(\theta _0\), where \(\delta _{\theta _0}\) denotes the degenerate distribution at \(\theta _0\).

The second step of our approach relies on the assumption of the existence of a suitable sufficient statistics. In particular, we assume the existence of another metric space, say \((\mathbb {S}, \text {d}_{\mathbb {S}})\), and the existence of a measurable map, say \(\mathfrak {S}_n: \mathbb {X}^n \rightarrow \mathbb {S}\), in such a way that the kernel \(\pi _n(\cdot | \cdot )\) in (1) can be represented by means of another kernel, say \(\pi _n^{*}(\cdot \,|\, \cdot ): \mathscr {T}\times \mathbb {S}\rightarrow [0,1]\), according to the identity

$$\begin{aligned} \pi _n(\cdot \,|\,x_1, \dots , x_n) = \pi _n^{*}\left( \cdot \,|\,\mathfrak {S}_n(x_1, \dots , x_n)\right) \end{aligned}$$
(7)

for all \((x_1, \dots , x_n) \in \mathbb {X}^n\). See Fortini et al. [43], and references therein, for the existence of sufficient statistics in relationship with the exchangeability assumption. Of course, when the statistical model \(\mu (\cdot | \cdot )\) is dominated, the existence of the sufficient statistics \(\mathfrak {S}_n\) is implied by standard assumptions on the statistical model, such as the well-known Fisher-Neyman factorization criterion.

The third step of our approach relies on the large n asymptotic behavior of the random variable \(\hat{S}_n:= \mathfrak {S}_n(\xi _1, \dots , \xi _n)\). In particular, we assume the existence of a weak law of large numbers for \(\hat{S}_n\), which means that there exists some (non random) \(S_0 \in \mathbb {S}\) for which \(\hat{S}_n \rightarrow S_0\) holds true in \(\mathbb {P}\)-probability, as \(n \rightarrow +\infty \). Hereafter, for any sequence \(\{\delta _n\}_{n \ge 1}\) of positive numbers, we denote by

$$\begin{aligned} J_n(\delta _n):= \mathbb {P}[\text {d}_{\mathbb {S}}(\hat{S}_n, S_0) \ge \delta _n] \end{aligned}$$
(8)

the probability that \(\hat{S}_n\) lies outside a \(\delta _n\)-neighborhood of \(S_0\). Usually, \(J_n(\delta _n)\) can be evaluated by means of concentration inequalities and large deviation principles.

Based on (7), the fourth step of our approach relies on a form of local Lipschitz-continuity for the kernel \(\pi _n^{*}(\cdot \,|\, \cdot )\), which holds under suitable assumptions on the model \(\mu (\cdot \,|\,\cdot )\) and the prior \(\pi \). It corresponds to the existence of two sequences of positive numbers, say \(\{\delta _n\}_{n \ge 1}\) and \(\{L_0^{(n)}\}_{n \ge 1}\) such that, for each \(n \in \mathbb {N}\),

$$\begin{aligned} \mathcal {W}_p^{({\mathcal {P}}(\varTheta ))}\left( \pi _n^{*}(\cdot \,|\,S_0); \pi _n^{*}(\cdot \,|\,S') \right) \le L_0^{(n)} \text {d}_{\mathbb {S}}(S_0, S') \end{aligned}$$
(9)

holds for any \(S'\) belonging to \(\mathcal {U}_{\delta _n}(S_0):= \{S \in \mathbb {S}\text {: } \text {d}_{\mathbb {S}}(S_0, S) < \delta _n\}\). We refer to Dolera and Mainini [37, 38] for a detailed treatment of the property of local Lipschitz-continuity, for fixed \(n\in \mathbb {N}\), providing some quantitative estimates for \(L_0^{(n)}\). Then, according to Lemma 1, under the validity of (7) and (9), we write

$$\begin{aligned} \epsilon _n&\le \mathcal {W}_p^{({\mathcal {P}}(\varTheta ))}(\pi _n^{*}(\cdot \,|\, S_0); \delta _{\theta _0})\nonumber \\&\quad + L_0^{(n)}\mathbb {E}[ \text {d}_{\mathbb {S}}(\hat{S}_n, S_0)]\nonumber \\&\quad + \mathbb {E}[ \mathcal {W}_p^{({\mathcal {P}}(\varTheta ))}(\pi _n^{*}(\cdot \,|\, S_0); \pi _n^{*}(\cdot \,|\, \hat{S}_n)) \mathbbm {1}\{\hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0)\}]. \end{aligned}$$
(10)

Under additional assumptions, in Sect. 3 we develop a careful analysis of the three terms on the right-hand side of (10), in order to show that they can be bounded in terms of more explicit quantities that behave like \(n^{-\alpha }\), for some \(\alpha > 0\). In particular, the first term is a non-random quantity which is equal to

$$\begin{aligned} \mathcal {W}_p^{({\mathcal {P}}(\varTheta ))}(\pi _n^{*}(\cdot \,|\, S_0); \delta _{\theta _0}) = \left( \int _{\varTheta } \text {d}_{\varTheta }^p(\theta , \theta _0) \pi _n^{*}(\text {d}\theta \,|\, S_0) \right) ^{1/p}, \end{aligned}$$
(11)

and it measures the speed of shrinkage of \(\pi _n^{*}(\cdot | S_0)\) at \(\theta _0\). Its evaluation is a pure analytical problem, which relies on an extension to infinite-dimensional spaces of the classical Laplace methods of approximating integrals. In (10), the term

$$\begin{aligned} \varepsilon _{n,p}(\mathbb {S},S_0):= \mathbb {E}[ \text {d}_{\mathbb {S}}(\hat{S}_n, S_0)] \end{aligned}$$
(12)

provides the speed of convergence of the mean law of large numbers, which is well-known, at least for the situations considered throughout this paper. The term \(\mathbbm {1}\{\hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0)\}\) in (10) hints at an application of a large deviation principle. As for the \(L_0^{(n)}\)’s in (10), the bounds provided in Dolera and Mainini [37, 38] show that they can be expressed in terms of weighted Poincaré–Wirtinger constants. As we will show below, a proper choice of the sequence \(\{\delta _n\}_{n \ge 1}\) should entail that \(\{L_0^{(n)}\}_{n \ge 1}\) is bounded or, at least, diverges at a controlled rate.

Critical to our analysis of the term \(L_0^{(n)}\) is the so-called dynamic formulation of the p-Wasserstein distance, which is referred to as Wasserstein dynamics [12]. In particular, assume that \(\mathbb {M}\) is the norm-closure of some nonempty, open and connected subset of a separable Hilbert space \(\mathbb {H}\), and endowed with scalar product \(\langle \cdot ,\cdot \rangle \) and norm \(\Vert \cdot \Vert \). Then, for any \(\gamma _0, \gamma _1 \in {\mathcal {P}}_p(\mathbb {M})\)

$$\begin{aligned} \left[ \mathcal {W}_p^{({\mathcal {P}}(\mathbb {M}))}(\gamma _0; \gamma _1)\right] ^p = \inf _{\{\gamma _t\}_{t \in [0,1]} \in AC^p[\gamma _0; \gamma _1]} \int _0^1 \int _{\mathbb {M}} \Vert {{\textbf {v}}}_t(x)\Vert ^p \gamma _t(\text {d}x) \text {d}t, \end{aligned}$$

where \(AC^p[\gamma _0; \gamma _1]\) is the space of all absolutely continuous curves in \({\mathcal {P}}_p(\mathbb {M})\) with \(\text {L}^p(0,1)\) metric derivative (w.r.t. \({\mathcal {W}}_p\)) connecting \(\gamma _0\) to \(\gamma _1\), and \([0,1]\times \mathbb {M}\ni (t,x)\mapsto {{\textbf {v}}}_t(x)\in \mathbb {H}\) is a Borel function such that for almost every \(t\in (0,1)\) it holds

$$\begin{aligned} \frac{\text {d}}{\text {d}s} \int _{\mathbb {M}} \psi (x) \gamma _s(\text {d}x) \ _{\big | s=t} = \int _{\mathbb {M}} \langle {{\textbf {v}}}(x), \text {D} \psi (x)\rangle \gamma _t(\text {d}x)\qquad \forall \ \psi \in C^1_b(\mathbb {M}). \end{aligned}$$
(13)

Here, \(\text {D}\psi \) denotes the Riesz representative of the Frechét differential of the function \(\psi \), and \(\psi \in C^1_b(\mathbb {M})\) means that \(\psi \) is the restriction to \(\mathbb {M}\) of a function in the class \(C^1_b(\mathbb {H})\), that is \(\psi \) is a bounded continuous function with bounded continuous Fréchet derivative on \(\mathbb {H}\). See Da Prato and Zabczyk [32, Chapter 2] for spaces of continuous functions defined on Hilbert spaces, and Ambrosio et al. [5, Chapter 8] for a detailed account on the partial differential equation (13).

For any fixed t and given \(\gamma _t\), it is natural to look for a solution \({{\textbf {v}}}_t(\cdot )\) of Eq. (13) in the form of a gradient, and therefore we may interpret (13) as an abstract elliptic equation, for which it is well-known that a critical role is played by Poincaré inequalities in the context of proving the existence and regularity of a solution.

Definition 2

We say that a probability measure \(\mu \) on \((\mathbb {M}, \mathscr {B}(\mathbb {M}))\) satisfies a weighted Poincaré inequality of order p if there exists a constant \({\mathcal {C}}_p\) for which

$$\begin{aligned} \inf _{a\in {\mathbb {R}}}\left( \int _\mathbb {M}|\psi (x) - a|^p\,\mu (\text {d}x)\right) ^{\frac{1}{p}} \le {\mathcal {C}}_p \left( \int _\mathbb {M}\Vert \text {D} \psi (x)\Vert ^p\,\mu (\text {d}x) \right) ^{\frac{1}{p}} \end{aligned}$$
(14)

holds for every \(\psi \in C^1_b(\mathbb {M})\). We denote by \({\mathfrak {C}}_p[\mu ]\) the best constant \(C_p\) in (14). In particular, for \(p=2\) the best constant \({\mathfrak {C}}_2[\mu ]\) may be characterized by means of

$$\begin{aligned} \left( \frac{1}{{\mathfrak {C}}_2[\mu ]}\right) ^2 = \inf _{\{\psi \in C^1_b(\mathbb {M})\,:\,\int _\mathbb {M}\psi (x)\,\mu (\text {d}x)=0\,\wedge \,\int _\mathbb {M}|\psi (x)|^2\,\mu (\text {d}x)=1\}}\int _\mathbb {M}\Vert \text {D}\psi (x)\Vert ^2\,\mu (\text {d}x) . \end{aligned}$$

Finally, if \(\mathbb {M}= \mathbb {H}\) and \(\mu \) is absolutely continuous with respect to a non-degenerate Gaussian measure, then the Fréchet derivative in (14) can be replaced by the Malliavin derivative \(\mathcal {D}\), yielding the following weaker definition

$$\begin{aligned} \left( \frac{1}{{\mathfrak {C}}_2^{(M)}[\mu ]}\right) ^2 = \inf _{\{\psi \in C^1_b(\mathbb {M})\,:\,\int _\mathbb {M}\psi (x)\,\mu (\text {d}x)=0\,\wedge \,\int _\mathbb {M}|\psi (x)|^2\,\mu (\text {d}x)=1\}}\int _\mathbb {M}\Vert {\mathcal {D}}\psi (x)\Vert ^2\,\mu (\text {d}x) . \end{aligned}$$

We refer to the monographs of Bogachev [18], Da Prato [30] and Da Prato and Zabczyk [32] for a detailed account of Malliavin calculus and related Sobolev spaces.

3 Main results on PCRs

Following the approach to PCRs outlined in Sect. 2, we present two main results: (i) a theorem on PCRs for the regular infinite-dimensional exponential family of statistical models; (ii) a theorem on PCRs for a general dominated statistical model.

3.1 PCRs for the regular infinite-dimensional exponential family

It is useful to recall the definition and some basic properties of the infinite-dimensional exponential family. In general, classical results on exponential family may be extended to the infinite-dimensional setting through suitable arguments of convex analysis [10, 11].

Definition 3

Let \(\lambda \) be a \(\sigma \)-finite measure on \((\mathbb {X}, \mathscr {X})\), \(\mathbb {B}\) be a separable Banach space with dual \(\mathbb {B}^{*}\), and \( _{\mathbb {B}^{*}}\langle \cdot , \cdot \rangle _{\mathbb {B}}\) be the pairing between \(\mathbb {B}\) and \(\mathbb {B}^{*}\). Also, let \(\varGamma \) be a nonempty open subset of \(\mathbb {B}^{*}\), and let \(\beta : \mathbb {X}\rightarrow \mathbb {B}\) be a measurable map. If the interior \(\varLambda \) of the convex hull of the support of \(\lambda \circ \beta ^{-1}\) is nonempty and

$$\begin{aligned} \int _{\mathbb {X}} \exp \{ _{\mathbb {B}^{*}}\langle \gamma , \beta _x \rangle _{\mathbb {B}}\} \lambda (\text {d}x) < +\infty \end{aligned}$$
(15)

holds for any \(\gamma \in \varGamma \), then the regular infinite-dimensional exponential family is a statistical model defined through the family of \(\lambda \)-densities \(\{\varphi (\cdot \,|\,\gamma )\}_{\gamma \in \varGamma }\), where

$$\begin{aligned} \varphi (x\,|\,\gamma ) = \exp \{ _{\mathbb {B}^{*}}\langle \gamma , \beta _x \rangle _{\mathbb {B}} - M_{\varphi }(\gamma )\} \end{aligned}$$
(16)

with

$$\begin{aligned} M_{\varphi }(\gamma ):= \log \left( \int _{\mathbb {X}} \exp \{ _{\mathbb {B}^{*}}\langle \gamma , \beta _x \rangle _{\mathbb {B}}\} \lambda (\text {d}x) \right) . \end{aligned}$$
(17)

Brown [24, Theorems 1.13, Theorem 2.2 and Theorem 2.7] state that \(M_{\varphi }\) is a strictly convex function on \(\varGamma \), lower semi-continuous on \(\mathbb {B}^{*}\), of class \(C^{\infty }(\varGamma )\) and analytic. In addition, Barndorff–Nielsen [9, Corollary 5.3] implies that \(M_{\varphi }\) is steep (essentially smooth). Therefore, from Brown [24, Theorem 3.6] it holds that

$$\begin{aligned} {\mathcal {S}}: \gamma \mapsto \nabla M_{\varphi }(\gamma ) = \int _{\mathbb {X}} \beta _x \varphi (x\,|\,\gamma ) \lambda (\text {d}x) \end{aligned}$$
(18)

defines a smooth injective map from \(\varGamma \) into \(\mathbb {B}\), with dense range. Finally, [24, Corollary 2.5] entails the identifiability of the model characterized by the densities (16).

To introduce the setting of our theorem on PCR, it is useful to express the statistical model \(\mu (\cdot \,|\,\cdot )\) in terms of an infinite-dimensional exponential family. In this regard, we introduce a further measurable mapping \(g: \varTheta \rightarrow \varGamma \) and write

$$\begin{aligned} \mu (\text {d}x\,|\,\theta ) = \varphi (x\,|\, g(\theta )) \lambda (\text {d}x)\ . \end{aligned}$$
(19)

In the setting of (19), we observe that the identity (7) is satisfied with \(\mathbb {S}= \mathbb {B}\), \(\text {d}_{\mathbb {S}}(s_1, s_2) = \Vert s_1 - s_2\Vert _{\mathbb {B}}\) and \(\mathfrak {S}_n(x_1, \dots , x_n) = n^{-1} \sum _{1\le i\le n} \beta _{x_i}\). Therefore, we write

$$\begin{aligned} \hat{S}_n = \frac{1}{n} \sum _{i=1}^n \beta _{\xi _i} \ . \end{aligned}$$
(20)

Note that Eq. (19) arises naturally from the assumption that the statistical model \(\mu (\cdot \,|\,\cdot )\) is dominated, which provides a family \(\{f(\cdot \,|\,\theta )\}_{\theta \in \varTheta }\) of density functions. Accordingly, by assuming that \(\mathbb {X}\) is endowed of a richer metric structure, if \(f(\cdot \,|\,\theta ) > 0\) and \(x \mapsto f(x\,|\,\theta )\) is continuous for any \(\theta \in \varTheta \), we write

$$\begin{aligned} \log f(x\,|\,\theta ) = \int _{\mathbb {X}} \log f(y\,|\,\theta ) \delta _x(\text {d}y). \end{aligned}$$

The functions g and \(\beta \) then arise from the mapping \(y \mapsto \log f(y\,|\,\theta )\) and the measure \(\delta _x\) through (some sort of) integration-by-parts, if this is admitted, or through classical Fourier transformation arguments, such as the Plancherel formula.

According to Ledoux and Talagrand [62, Corollary 7.10], under the assumption

$$\begin{aligned} \mathbb {E}\left[ \Vert \beta _{\xi _i}\Vert _{\mathbb {B}}\right] = \int _{\mathbb {X}} \Vert \beta _x\Vert _{\mathbb {B}}\ \mu _0(\text {d}x) = \int _{\mathbb {X}} \Vert \beta _x\Vert _{\mathbb {B}}\ \varphi (x\,|\, g(\theta _0)) \lambda (\text {d}x) < +\infty \, \end{aligned}$$
(21)

we set

$$\begin{aligned} S_0 = \int _{\mathbb {X}} \beta _x \mu _0(\text {d}x) = \int _{\mathbb {X}} \beta _x \varphi (x\,|\, g(\theta _0)) \lambda (\text {d}x) \end{aligned}$$
(22)

in the sense of Bochner integral, and conclude the strong law of large numbers, i.e. \(\hat{S}_n \rightarrow S_0\) holds \(\mathbb {P}\)-a.s., as \(n\rightarrow +\infty \). Now, we set \(M(\theta ) = M_{\varphi }(g(\theta ))\) and then define

$$\begin{aligned} \pi _n^{*}(\text {d}\theta \,|\, b):= \frac{\exp \{n[ _{\mathbb {B}^{*}}\!\langle g(\theta ), b\rangle _{\mathbb {B}} - M(\theta )]\} \pi (\text {d}\theta )}{\int _{\varTheta } \exp \{n[\langle g(\tau ), b\rangle _{\mathbb {B}} - M(\tau )]\} \pi (\text {d}\tau )}, \end{aligned}$$
(23)

provided that

$$\begin{aligned} \int _{\varTheta } \exp \{n\ _{\mathbb {B}^{*}}\!\langle g(\tau ), b\rangle _{\mathbb {B}} \} \pi (\text {d}\tau ) < +\infty \end{aligned}$$
(24)

for any \(n \in \mathbb {N}\) and \(b \in \mathbb {B}\). We remark that (24) is a necessary assumption for the existence of the posterior distribution. Now, we state the theorem on PCRs in the setting of infinite-dimensional exponential families; the proof is deferred to Appendix A.3.

Theorem 1

Let \(p \ge 1\) be a fixed number. Let \(\{\varphi (\cdot \,|\,\gamma )\}_{\gamma \in \varGamma }\) be a regular infinite-dimensional exponential family according to Definition 3. Let \(\varTheta \) be an open, connected subset of some separable Hilbert space \(\mathbb {H}\). Let \(g: \varTheta \rightarrow \varGamma \) be a measurable mapping for which representation (19) is in force. For a fixed \(\theta _0 \in \varTheta \), suppose that:

  1. (i)

    (21) is valid;

  2. (ii)

    (24) holds for any \(n \in \mathbb {N}\) and \(b \in \mathbb {B}\);

  3. (iii)

    \(\int _{\varTheta } \Vert \theta \Vert ^{ap} \pi (\text {d}\theta ) < +\infty \) for some \(a > 1\);

  4. (iv)

    there exists a sequence \(\{\delta _n\}_{n \ge 1}\) of positive numbers for which (9) is valid for any \(n \in \mathbb {N}\), with \(\mathbb {S}= \mathbb {B}\), \(\text {d}_{\mathbb {S}}(s_1, s_2) = \Vert s_1 - s_2\Vert _{\mathbb {B}}\) and suitable positive constants \(L_0^{(n)}\).

Then, for the PCR \(\epsilon _n\) at \(\theta _0\) it holds

$$\begin{aligned} \epsilon _n&\lesssim \left( \int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^p \pi _n^{*}(\text {d}\theta \,|\, S_0) \right) ^{\frac{1}{p}}\nonumber \\&\quad + \Vert \theta _0\Vert _{\varTheta }\ \mathbb {P}\left[ \hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0) \right] \nonumber \\&\quad + \left( \mathbb {E}\left[ \int _{\varTheta } \Vert \theta \Vert _{\varTheta }^{ap} \pi _n^{*}(\text {d}\theta \,|\, \hat{S}_n) \right] \right) ^{\frac{1}{ap}} \left( \mathbb {P}\left[ \hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0)\right] \right) ^{1 - \frac{1}{ap}} \nonumber \\&\quad + L_0^{(n)} \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}] \end{aligned}$$
(25)

where \(\pi _n^{*}(\cdot \,|\,\cdot )\) is given by (23), \(S_0\) and \(\hat{S}_n\) are as in (22) and (20), respectively, and \(\mathcal {U}_{\delta _n}(S_0):= \{S \in \mathbb {S}\ |\ \Vert S_0-S\Vert _{\mathbb {B}} < \delta _n\}\).

Theorem 1 provides an implicit form for PCRs. That is, the large n asymptotic behaviour of the terms on the right-hand side of (25) must be further investigated to obtain a more explicit expression for the corresponding PCR. In this regard, it is useful to rewrite \(\pi ^{*}_{n}\) in terms of the Kullback–Leibler divergence. That is, if \({\mathcal {S}} \circ g\) is injective and b belongs to the range of \({\mathcal {S}} \circ g\), then

$$\begin{aligned} \pi _n^{*}(\text {d}\theta \,|\, b) = \frac{\exp \{-n \textsf {K}(\theta \,|\, \theta _b)\}\pi (\text {d}\theta )}{\int _{\varTheta } \exp \{-n \textsf {K}(\tau \, |\, \theta _b)\}\pi (\text {d}\tau )} \end{aligned}$$
(26)

where \(\theta _b = ({\mathcal {S}} \circ g)^{-1}(b)\) and

$$\begin{aligned} \textsf {K}(\theta \,|\,\theta '):= \int _{\mathbb {X}} \left[ \ln \left( \frac{f(x\,|\,\theta ')}{f(x\,|\,\theta )}\right) \right] f(x\,|\,\theta ')\text {d}x\ . \end{aligned}$$
(27)

denotes the Kullback–Leibler divergence. See Appendix A.2 for the proof of Eq. (26). It is natural to expect that the main contribution to PCRs arises from the first and the fourth term on the right-hand side of (25), which provide general algebraic rates of convergence to zero. Hereafter, we investigate the large n asymptotic behaviour of the terms on the right-hand of (25). More explicit results in terms of PCRs will be presented in Sect. 4 with respect to the application of Theorem 1 in the context of the regular parametric model, the multinomial model, the finite-dimensional and the infinite-dimensional logistic-Gaussian model and the infinite-dimensional linear regression.

3.1.1 First term on the right-hand of (25)

We start by considering the large n asymptotic behaviour of the first term on the right-hand side of (25). In particular, from (26), we can rewrite this terms as

$$\begin{aligned}&\int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^p \pi _n^{*}(\text {d}\theta \,|\, S_0)\nonumber \\&\quad = \frac{\int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^p \exp \{n[ _{\mathbb {B}^{*}}\!\langle g(\theta ), S_0\rangle _{\mathbb {B}} - M(\theta )]\} \pi (\text {d}\theta )}{\int _{\varTheta } \exp \{n[\langle g(\tau ), S_0\rangle - M(\tau )]\} \pi (\text {d}\tau )} \nonumber \\&\quad =\frac{\int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^p \exp \{-n \textsf {K}(\theta \, |\, \theta _0)\}\pi (\text {d}\theta )}{\int _{\varTheta } \exp \{-n \textsf {K}(\theta \, |\, \theta _0)\}\pi (\text {d}\theta )}\ . \end{aligned}$$
(28)

The last expression of (28) shows the ratio of two Laplace integrals, and therefore the Laplace method of approximating integrals can be applied. In the finite-dimensional setting, i.e. \(\varTheta \subseteq \mathbb {R}^d\), the Laplace approximation method is well-known [22, 87], and it leads to the following proposition.

Proposition 1

In the case that \(\varTheta \subseteq \mathbb {R}^d\), assume that \(\pi \) has a continuous density q with respect to the Lebesgue measure, with \(q(\theta _0) > 0\), and that \(\theta \mapsto \textsf {K}(\theta \,|\, \theta _0)\) is a \(C^2\)-function with a strictly positive definite Hessian at \(\theta _0\), which coincides with the Fisher information matrix \(\text {I}[\theta _0]\) at \(\theta _0\). Let \(\int _{\varTheta } |\theta |^p\pi (\text {d}\theta ) < +\infty \) be fulfilled for some \(p \ge 1\). Finally, suppose that for any \(\delta > 0\) there exists \(c(\delta ) >0\) such that

$$\begin{aligned} \inf _{|\theta - \theta _0| \ge \delta } \textsf {K}(\theta \,|\, \theta _0) \ge c(\delta ). \end{aligned}$$
(29)

Then, for any \(p > 0\), there hold

$$\begin{aligned} \int _{\varTheta } |\theta - \theta _0|^p e^{-n \textsf {K}(\theta \,|\, \theta _0)} \pi (\text {d}\theta ) \sim \frac{1}{2}\left( \frac{2}{n}\right) ^{\frac{d+p}{2}} \!\!\!\!\! \varGamma \left( \frac{d+p}{2}\right) \end{aligned}$$

and

$$\begin{aligned} \frac{\int _{S_d(1)} \{\langle z, \text {I}[\theta _0]^{-1} z\rangle \}^{p/2} \text {d}S(z)}{\sqrt{\text {det}[\text {I}(\theta _0)]}} \int _{\varTheta } e^{-n \textsf {K}(\theta \, |\, \theta _0)} \pi (\text {d}\theta ) \sim \left( \frac{2\pi }{n}\right) ^{d/2} \frac{1}{\sqrt{\text {det}[\text {I}(\theta _0)]}} \end{aligned}$$

as \(n \rightarrow +\infty \), where \(S_d(1):= \{z \in \mathbb {R}^d\ |\ \Vert z\Vert = 1\}\), \(\text {d}S\) denotes the surface measure and \(\langle \cdot , \cdot \rangle \) stands for the standard scalar product in \(\mathbb {R}^d\). Thus, under these assumptions,

$$\begin{aligned} \int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^p \pi _n^{*}(\text {d}\theta \,|\, S_0) = O(n^{-p/2}) \end{aligned}$$
(30)

as \(n\rightarrow +\infty \).

It is interesting to observe that the inequality (29) is a sort of strengthening of the so-called Shannon–Kolmogorov information inequality. See, e.g., Ferguson [42, Chapter 17]. In particular, because of (29), integrals on the whole \(\varTheta \) can be reduced to integrals over balls centered at \(\theta _0\), as integration over the complement of any such ball yields exponentially small quantities with respect to n.

According to Proposition 1, in the finite-dimensional setting the prior distribution does not affect the large n asymptotic behaviour of the first term on the right-hand side of (25). Differently from the standard finite-dimensional setting, the literature on the Laplace approximation method in the infinite-dimensional setting appears to be not well developed. That is, to the best of our knowledge, infinite-dimensional Laplace approximations are limited to the case in which the measure \(\pi \) is a Gaussian measure [2, 3]. Unfortunately, this literature does not cover the case in which the Hessian of the map \(\theta \mapsto \textsf {K}(\theta \,|\, \theta _0)\) at \(\theta _0\) is not coercive (uniformly elliptic), which is precisely the case of interest in our specific problem. The next proposition covers this critical gap; the proof is deferred to Appendix A.4. The proposition is of independent interest in the context of the classical Laplace method.

Proposition 2

Let \(\varTheta \) be a separable Hilbert space with scalar product \(\langle \cdot , \cdot \rangle \), and let \(\pi \) be the non-degenerate Gaussian measure \(\mathcal {N}(m,Q)\), with \(m \in \varTheta \) and Q a trace-class operator. For fixed \(\theta _0 \in \varTheta \), assume that \(\theta \mapsto \textsf {K}(\theta \,|\,\theta _0)\) belongs to \(C^{2+q}(\varTheta )\) for some \(q \in (0,1]\), and that its Hessian at \(\theta _0\), which coincides with the Fisher information operator \(\text {I}(\theta _0)\) at \(\theta _0\), is a compact self-adjoint linear operator from \(\varTheta \) into itself, with trivial kernel. Suppose there exists an orthonormal Fourier basis \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) of \(\varTheta \) which diagonalizes simultaneously both Q and \(\text {I}(\theta _0)\), so that

$$\begin{aligned} Q[{{\textbf {e}}}_k] = \lambda _k {{\textbf {e}}}_k \qquad \qquad \text {I}(\theta _0)[{{\textbf {e}}}_k] = \gamma _k {{\textbf {e}}}_k \end{aligned}$$
(31)

are valid with two suitable sequences \(\{\lambda _k\}_{k \ge 1}\) and \(\{\gamma _k\}_{k \ge 1}\) of strictly positive numbers that go to zero as \(k \rightarrow +\infty \), with \(\{\lambda _k\}_{k \ge 1} \in \ell _1\). Finally, assume there exist two other Hilbert spaces \(\mathbb {K}\) and \(\mathbb {V}\) such that

  1. (i)

    \(\mathbb {V}\subset \varTheta \subset \mathbb {K}\) with continuous, dense embeddings;

  2. (ii)

    an interpolation inequality like

    $$\begin{aligned} \Vert \theta \Vert _{\varTheta } \lesssim \Vert \theta \Vert _{\mathbb {K}}^{1/r} \Vert \theta \Vert _{\mathbb {V}}^{1/s} \end{aligned}$$
    (32)

    holds for any \(\theta \in \mathbb {V}\) with conjugate exponents \(r,s > 1\) such that \(r < 1 + q/2\);

  3. (iii)

    for all \(\theta \in \mathbb {V}\), the inequalities

    $$\begin{aligned}&\textsf {K}(\theta \, |\, \theta _0) \ge \phi (\Vert \theta - \theta _0\Vert _{\mathbb {K}}) \end{aligned}$$
    (33)
    $$\begin{aligned}&\langle \theta -\theta _0, \text {I}(\theta _0)[\theta -\theta _0]\rangle \gtrsim \Vert \theta - \theta _0\Vert _{\mathbb {K}}^2 \end{aligned}$$
    (34)

    are valid with some monotone non-decreasing function \(\phi : [0,+\infty ) \rightarrow [0,+\infty )\) such that \(\phi (x) = O(x^2)\) as \(x \rightarrow 0^+\);

  4. (iv)

    \(\pi (\mathbb {V}) = 1\) and \(\int _{\mathbb {V}} e^{t \Vert \theta \Vert _{\mathbb {V}}} \pi (\text {d}\theta ) < +\infty \) for some \(t > 0\).

Then, as \(n\rightarrow +\infty \), the following expansion

$$\begin{aligned} \int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^2 \pi _n^{*}(\text {d}\theta \,|\, S_0) = O\left( \sum _{k=1}^{\infty } \frac{\lambda _k}{n \lambda _k \gamma _k + 1} \right) + O\left( \sum _{k=1}^{\infty } \frac{\omega _k^2}{(n \lambda _k \gamma _k + 1)^2} \right) \end{aligned}$$
(35)

holds with the sequence \(\{\omega _k\}_{k \ge 1} \in \ell _2\) given by \((\theta _0 - m) = \sum _{k=1}^{\infty } \omega _k{{\textbf {e}}}_k\).

Remark 2

In the infinite-dimensional setting, the assumption (29) is, in general, too strong. Conditions (33)–(34), combined with the interpolation (32), constitute a reasonable set of assumptions that allow a quite general treatment in the applications. It is worth noticing that (29), as well as (33), is expressed in the form of a lower bound for \(\textsf {K}(\theta \, |\, \theta _0)\). These bounds are conceptually opposite with respect to the so-called “prior mass condition” required in the standard theory ([50, Theorem 8.9, inequality (8.4)]), which is usually proved by means of upper bounds for \(\textsf {K}(\theta \, |\, \theta _0)\). See, e.g. the upper bounds for \(\textsf {K}(\theta \, |\, \theta _0)\) in Lemma 2.5 of Ghosal and van der Vaart [50].

Remark 3

With respect to Proposition 1, the statement of Proposition 2 is confined to the case \(p=2\). There are no technical limitations for treating the more general case \(p \ne 2\), though \(p=2\) yields to a more readable (conclusive) result.

Remark 4

Assumption (31) is not necessary to obtain PCRs. However, without this assumption, the resulting PCR would have a complicated form, which may be recovered from the proof. For example, let \(\varTheta _N\) be the finite-dimensional subspace of \(\varTheta \) obtained by the linear span of \(\{{{\textbf {e}}}_1, \dots , {{\textbf {e}}}_N\}\), let \(Q_N\) denote the \(N\times N\) matrix that represents the restriction of Q to \(\varTheta _N\), after projecting the range of such restriction again on \(\varTheta _N\), and let \(\text {I}_N(\theta _0)\) denote the \(N\times N\) matrix associated to the same restriction of the operator \(\text {I}(\theta _0)\) to \(\varTheta _N\). If \(Q_N\) and \(\text {I}_N(\theta _0)\) are non-singular, then the first term on the right-hand side of (35) can be replaced by

$$\begin{aligned} \lim _{N\rightarrow +\infty } \text {Tr}\left[ \left( n\text {I}_N(\theta _0) + Q_N^{-1}\right) ^{-1}\right] , \end{aligned}$$
(36)

which is not as clear as the series \(\sum _{k=1}^{\infty } \lambda _k/(n \lambda _k \gamma _k + 1)\). An analogous operation can be performed with respect to the second term on the right-hand side of (35).

Moreover, the above argument can be reinforced by resorting to some trace inequalities, as explained in [26]. In particular, we assume there exists another compact, self-adjoint operator \(\text {I}^{*}\) such that \(\text {I}(\theta _0) \ge \text {I}^{*}\) in the sense of quadratic forms, i.e.

$$\begin{aligned} \langle \theta , \text {I}(\theta _0)[\theta ]\rangle \ge \langle \theta , \text {I}^{*}[\theta ]\rangle \end{aligned}$$

for any \(\theta \in \varTheta \). Whence, upon denoting by \(\text {I}^{*}_N\) the restriction of \(\text {I}^{*}\) to \(\varTheta _N\) as above, we have \(\text {I}_N(\theta _0) \ge \text {I}^{*}_N\) and, consequently, \(n\text {I}_N(\theta _0) + Q_N^{-1} \ge n \text {I}^{*}_N + Q_N^{-1}\). By the Löwner–Heinz theorem, the mapping \(t \mapsto -t^{-1}\) is operator monotone, yielding that

$$\begin{aligned} \text {Tr}\left[ \left( n\text {I}_N(\theta _0) + Q_N^{-1}\right) ^{-1}\right] \le \text {Tr}\left[ \left( n \text {I}^{*}_N + Q_N^{-1}\right) ^{-1}\right] \ . \end{aligned}$$

See again [26] for the details. Therefore, if the orthonormal Fourier basis \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) of \(\varTheta \) diagonalizes simultaneously both Q and \(\text {I}^{*}\) (instead of \(\text {I}(\theta _0)\)), so that

$$\begin{aligned} Q[{{\textbf {e}}}_k] = \lambda _k {{\textbf {e}}}_k \qquad \qquad \text {I}^{*}[{{\textbf {e}}}_k] = \gamma _k^{*} {{\textbf {e}}}_k \end{aligned}$$
(37)

are valid with suitable strictly positive \(\gamma _k^{*}\)’s that go to zero as \(k \rightarrow +\infty \), then by Proposition 2

$$\begin{aligned} \int _{\varTheta } \Vert \theta - \theta _0\Vert _{\varTheta }^2 \pi _n^{*}(\text {d}\theta \,|\, S_0) \lesssim \sum _{k=1}^{\infty } \frac{\lambda _k}{n \lambda _k \gamma _k^{*} + 1} + \sum _{k=1}^{\infty } \frac{\omega _k^2}{(n \lambda _k \gamma _k^{*} + 1)^2} \ . \end{aligned}$$
(38)

Proposition 2 shows that the large n asymptotic behavior of the first term on the right-hand side of (25) is worse than 1/n, which is the large n asymptotic behaviour obtained in Proposition 1 with \(p=2\). For example, by taking the first term on the right-hand side of (35) into account, if \(\lambda _k \sim k^{-(1+a)}\) and \(\gamma _k \sim k^{-b}\) as \(k\rightarrow +\infty \), for some \(a,b > 0\), a straightforward calculation shows that

$$\begin{aligned} \sum _{k=1}^{\infty } \frac{k^{-(1+a)}}{n k^{-(1+a+b)} + 1} \sim n^{-{\frac{a}{1+a+b}}} \end{aligned}$$
(39)

holds as \(n\rightarrow +\infty \). As for the second term on the right-hand side of (35), it can be made identical to zero by choosing \(m = \theta _0\), that is by means of centering the Gaussian prior at \(\theta _0\). However, if \(\lambda _k \sim k^{-(1+a)}\), \(\gamma _k \sim k^{-b}\) and \(\omega ^2_k \sim k^{-(1+c)}\) as \(k\rightarrow +\infty \), for some choice of \(a,b,c > 0\) with \(c < 2(1+a+b)\), then

$$\begin{aligned} \sum _{k=1}^{\infty } \frac{k^{-(1+c)}}{(n k^{-(1+a+b)} + 1)^2} \sim n^{-{\frac{c}{1+a+b}}} \end{aligned}$$

holds as \(n\rightarrow +\infty \). Therefore, if \(c < a\) this second term is slower than the one in (39), whilst if \(c > a\) it is negligible with respect to that term. Again on (39), it is interesting to notice what happens if the eigenvalues \(\lambda _k\)’s approach zero very rapidly, like \(\lambda _k \sim e^{-k}\), for example. Another straightforward calculation shows that

$$\begin{aligned} \sum _{k=1}^{\infty } \frac{e^{-k}}{n e^{-k} k^{-b} + 1} \sim \frac{(\log n)^{b+1}}{n} \end{aligned}$$

holds as \(n\rightarrow +\infty \). A refinement of this argument entails that the large n asymptotic behavior of the right-hand side of (35) can be made arbitrarily close to the rate 1/n, for example by choosing \(\lambda _k \sim e^{-k^r}\) and \(\gamma _k \sim k^{-b}\) and \(\omega ^2_k \sim k^{-(1+c)}\) for some \(r, b,c > 0\), with arbitrarily large r. By recalling that the first term on the right-hand side of (25) coincides with the square root of the left-hand side of (35), this argument shows that the PCR is arbitrarily close to \(1/\sqrt{n}\). It is reasonable to guess that the minimax (classical) risk should go to zero as fast as \(1/\sqrt{n}\), though we are not aware of any result proving such a behaviour.

A merit of Proposition 2 is to show explicitly that, within the infinite-dimensional setting, PCRs are influenced by three quantities that do not appear in finite-dimensional setting of Proposition 1: (i) the rate of approach to zero of the sequence \(\{\lambda _k\}_{k \ge 1}\), which measures the “regularity of the prior”; (ii) the rate of approach to zero of the sequence \(\{\gamma _k\}_{k \ge 1}\), which measures the “regularity of the model”; (iii) the rate of approach to zero of the sequence \(\{\omega _k\}_{k \ge 1}\), which measures how close is \(\theta _0\) to m. Finally, we notice that the space \(\mathbb {V}\) is linked with the Cameron–Martin space associated to \(\pi \), which must be included in \(\mathbb {V}\).

3.1.2 Second and third term on the right-hand of (25)

Now, we consider the large n asymptotic behaviour of the second term and of the third term on the right-hand side of (25). Both these terms depend explicitly on

$$\begin{aligned} \mathbb {P}\left[ \hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0) \right] = \mathbb {P}\left[ \Vert \hat{S}_n - S_0\Vert _{\mathbb {B}} \ge \delta _n \right] = \mathbb {P}\left[ \Vert \hat{S}_n - \mathbb {E}[\hat{S}_n]\Vert _{\mathbb {B}} \ge \delta _n \right] . \end{aligned}$$
(40)

Note that the tail probability in (40) is directly related to classical concentration inequalities for sum or random variables. Besides well-know Bernstein-type concentration inequalities for real-valued random variables [20, 33], some useful generalizations or extension can be found in, e.g., Giné and Nickl [53], Ledoux and Talagrand [62], Pinelis and Sakhanenko [68] and Yurinskii [89]. In particular, for a suitable choice of the sequence \(\{\delta _n\}_{n \ge 1}\), such that a constant sequence or a vanishing sequence at an algebraic rate, the term (40) goes to zero at suitable exponential rates, and therefore it provides a negligible contribution in the right-hand side of (25).

The third term on the right-hand side of (25) includes the posterior moment \(\mathbb {E}[\int _{\varTheta } \Vert \theta \Vert _{\varTheta }^{ap} \pi _n^{*}(\text {d}\theta \,|\, \hat{S}_n)] = \mathbb {E}[ \int _{\varTheta } \Vert \theta \Vert _{\varTheta }^{ap} \pi _n(\text {d}\theta \,|\, \xi _1, \dots , \xi _n)]\). In particular, an application of Hölder’s inequality shows that such a moment is bounded from above by

$$\begin{aligned}&\left( \int _{\varTheta } \Vert \theta \Vert ^{\rho ap}\pi (\text {d}\theta )\right) ^{1/\rho }\left( \int _{\mathbb {X}^n} \left[ \frac{\prod _{i=1}^n f(x_i\,|\,\theta _0)}{\rho _n(x_1, \dots , x_n)} \right] ^{\rho '} \!\!\!\!\!\rho _n(x_1, \dots , x_n) \prod _{i=1}^{n}\lambda (\text {d}x_i)\right) ^{1/\rho '} \end{aligned}$$

for conjugate exponents \(\rho , \rho ' > 1\), provided that \(\int _{\varTheta } \Vert \theta \Vert ^{\rho ap}\pi (\text {d}\theta ) < +\infty \). It is useful to recall that the density function \(\rho _n\) has been defined in (3). Accordingly, the second factor above coincides with the \(\rho '\)-th moment of a martingale, since

$$\begin{aligned}&\int _{\mathbb {X}^n} \left[ \frac{\prod _{i=1}^n f(x_i\,|\,\theta _0)}{\rho _n(x_1, \dots , x_n)} \right] ^{\rho '} \!\!\!\!\!\rho _n(x_1, \dots , x_n) \prod _{i=1}^{n}\lambda (\text {d}x_i)= \mathbb {E}\left[ \left( \frac{\prod _{i=1}^n f(X_i\,|\,\theta _0)}{\rho _n(X_1, \dots , X_n)} \right) ^{\rho '} \right] \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ \frac{\prod _{i=1}^{n+1} f(X_i\,|\,\theta _0)}{\rho _{n+1}(X_1, \dots , X_{n+1})}\,\Big |\, X_1, \dots , X_n \right] = \frac{\prod _{i=1}^n f(X_i\,|\,\theta _0)}{\rho _{n}(X_1, \dots , X_{n})}\ . \end{aligned}$$

At this stage, a possible resolutive strategy may rely on well-known bounds for moments of martingales [34]. As for the term \(\mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}]\), by means of a direct application of Lyapuonov’s inequality, we can write that

$$\begin{aligned} \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}] \le \left( \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}^2] \right) ^{1/2} \end{aligned}$$

and the right-hand side typically goes to zero as \(1/\sqrt{n}\). Besides the obvious case in which \(\mathbb {B}\) coincides with a separable Hilbert space, we refer to Nemirovski [66], Massart [63] and Massart and Rossignol [64] for the case in which we have \(\mathbb {B}= \ell _p(\mathbb {R}^d)\).

3.1.3 Fourth term on the right-hand of (25)

Finally, we consider the large n asymptotic behaviour of the fourth term on the right-hand side of (25). In particular, this term involves the constant \(L_0^{(n)}\), whose treatment requires to recall some fundamental notions of infinite-dimensional calculus. Given \(g: \varTheta \rightarrow \mathbb {B}^{*}\), the Fréchet differential \(\mathfrak {D}_{\theta }[g]\) of g is now meant as a bounded linear operator from \(\varTheta \) to \(\mathbb {B}^{*}\) such that \(g(\theta + \delta ) = g(\theta ) + \mathfrak {D}_{\theta }[g](\delta ) + o(\Vert \delta \Vert _{\varTheta })\), as \(\delta \rightarrow 0\) in \(\varTheta \), and

$$\begin{aligned} \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}:= \sup _{\Vert \delta \Vert _{\varTheta }\ \le 1} \Vert \mathfrak {D}_{\theta }[g](\delta )\Vert _{\mathbb {B}^{*}}\ . \end{aligned}$$

Here, we consider the case \(p=2\). It should be recalled that the theory of weighted Poincaré constant has been mainly focused on the two cases \(p=1\) and \(p=2\) (see, e.g., [13]). We choose only the latter case in order to avoid other technical problems connected with the Wasserstein dynamic when \(p=1\). See, e.g., the first comment opening Section 8.3 of Ambrosio et al. [5]. Therefore, in order to obtain an explicit upper bound for the constant \(L_0^{(n)}\) it is useful to consider the following proposition; the proof is deferred to Appendix A.5

Proposition 3

In addition to the assumptions of Theorem 1, suppose that \(g \in C^1(\varTheta ; \mathbb {B}^{*})\), that \(\int _{\varTheta } \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}^2\ \pi (\text {d}\theta ) < +\infty \), and that map \({\mathcal {S}} \circ g\) is continuous. Then, for the constant \(L_0^{(n)}\) in (25) we can put

$$\begin{aligned} L_0^{(n)} = n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2 \left( \int _{\varTheta } \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}^2\ \pi _n^{*}(\text {d}\theta \,|\, S) \right) ^{1/2}\ . \end{aligned}$$
(41)

In addition, if

$$\begin{aligned} \sup _{n\in \mathbb {N}}\ \ \sup _{\theta ' \in \mathcal {V}_{\delta _n}(\theta _0)}\ \frac{\int _{\varTheta } \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}^2 \exp \{-n \textsf {K}(\theta \, |\, \theta ')\}\pi (\text {d}\theta )}{\int _{\varTheta } \exp \{-n \textsf {K}(\theta \, |\, \theta ')\}\pi (\text {d}\theta )} =: {\mathcal {B}}(g) < +\infty \end{aligned}$$
(42)

holds with \(\mathcal {V}_{\delta _n}(\theta _0):= ({\mathcal {S}} \circ g)^{-1}(\mathcal {U}_{\delta _n}(S_0))\), then

$$\begin{aligned} L_0^{(n)} \le \sqrt{{\mathcal {B}}(g)}\ n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2\ . \end{aligned}$$
(43)

Remark 5

When \(\pi \) is a Gaussian measure on the infinite-dimensional Hilbert space \(\varTheta \), an analogous statement can be formulated with the Fréchet derivative replaced by the Malliavin derivative. Hence, for the constant \(L_0^{(n)}\) in (25) we can set

$$\begin{aligned} L_0^{(n)} = n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2^{(M)}[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2 \left( \int _{\varTheta } \Vert \mathcal {D}_{\theta }[g]\Vert _{*}^2\ \pi _n^{*}(\text {d}\theta \,|\, S) \right) ^{1/2}, \end{aligned}$$
(44)

and if

$$\begin{aligned} \sup _{n\in \mathbb {N}}\ \sup _{\theta ' \in \mathcal {V}_{\delta _n}(\theta _0)}\ \ \frac{\int _{\varTheta } \Vert \mathcal {D}_{\theta }[g]\Vert _{*}^2 \exp \{-n \textsf {K}(\theta \, |\, \theta ')\}\pi (\text {d}\theta )}{\int _{\varTheta } \exp \{-n \textsf {K}(\theta \, |\, \theta ')\}\pi (\text {d}\theta )} =: {\mathcal {B}}_M(g) < +\infty \end{aligned}$$
(45)

holds with \(\mathcal {V}_{\delta _n}(\theta _0):= ({\mathcal {S}} \circ g)^{-1}(\mathcal {U}_{\delta _n}(S_0))\), then the following inequality holds true

$$\begin{aligned} L_0^{(n)} \le \sqrt{{\mathcal {B}}_M(g)}\ n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2^{(M)}[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2\ . \end{aligned}$$
(46)

Denote by \(\Rightarrow \) the weak convergence of probability measures on \((\varTheta , \mathscr {B}(\varTheta ))\). Verifying the validity of (42) represents a strengthening of the fact that, as \(n\rightarrow +\infty \)

$$\begin{aligned} \frac{\exp \{-n \textsf {K}(\theta \,|\, \theta ')\}\pi (\text {d}\theta )}{\int _{\varTheta } \exp \{-n \textsf {K}(\tau \, |\, \theta ')\}\pi (\text {d}\tau )} \Rightarrow \delta _{\theta '}. \end{aligned}$$

This may be proved by means of the same arguments as in the proofs of Propositions 1 and 2. According to Proposition 3, to conclude it remains to make more explicit the large n asymptotic behaviour of the weighted Poincaré–Wirtinger constant \(\mathfrak {C}_2[\pi _n^{*}(\cdot \,|\, S')]\). In the finite-dimensional setting, i.e. \(\varTheta \subseteq \mathbb {R}^d\), the representation (28) shows that the posterior distribution, or better \(\pi _n^{*}(\cdot \,|\, \cdot )\), characterizes Gibbsean (Boltzmann) probability distributions. Properties of the Kullback-Leibler divergence entail that the mapping \(\theta \mapsto \textsf {K}(\theta \, |\, \theta ')\) is non-negative and vanishes iff \(\theta = \theta '\) Ferguson [42, Chapter 17]. Moreover, under standard regularity assumptions for \(f(\cdot \ |\ \cdot )\) [42, Chapter 18], the aforesaid mapping proved also to be strictly convex, at least in finite dimension. In this context, there are several conditions that entail the upper bound

$$\begin{aligned}{}[{\mathfrak {C}}_2(\pi _n^{*}(\cdot \,|\, {\mathcal {S}} \circ g(\theta ')))]^2 \le \frac{C(\theta ')}{n} \end{aligned}$$

for every \(n \in \mathbb {N}\) and positive constant \(C(\theta ')\). In particular, the simplest condition to quote is the so-called Bakry–Emery condition, characterized by the fact that

$$\begin{aligned} \text {Hess}[\textsf {K}(\cdot \, |\, \theta ')](\theta ) \ge \rho \text {Id} \end{aligned}$$
(47)

for some \(\rho > 0\), with \( \text {Id}\) being the identity matrix, uniformly with respect to \(\theta \in \varTheta \), in conjunction with the hypothesis that \(\pi (\text {d}\theta ) = e^{-U(\theta )}\text {d}\theta \) for some \(U \in C^2(\varTheta )\). Some generalizations of the condition (47) are given in the next proposition, which specifies some results that have first appeared in Bakry et al. [13].

Proposition 4

(Dolera and Mainini [38]) Let U and G be elements of \(\text {C}^2(\varTheta )\), bounded from below and such that \(\text {Hess}(G(\theta ))\ge \alpha \text {Id}\) and \(\text {Hess}(U(\theta ))\ge h \text {Id}\) (in the sense of quadratic forms) whenever \(|\theta |\le R\), for some \(\alpha >0\), \(R>0\) and \(h\in \mathbb {R}\).

  1. (1)

    If, in addition, there exist \(c>0\) and \(\ell \in \mathbb {R}\) such that \(\theta \cdot \nabla G(\theta )\ge c|\theta |\) and \(\theta \cdot \nabla U(\theta )\ge \ell |\theta |\) whenever \(|\theta |\ge R\), then

    $$\begin{aligned}&\left[ {\mathfrak {C}}_2\left( \frac{e^{-nG(\theta ) - U(\theta )} \text {d}\theta }{\int _{\varTheta } e^{-nG(\tau ) - U(\tau )} \text {d}\tau }\right) \right] ^2\\&\quad \le \frac{\alpha n+h+(cn+\ell -d_R+ nG_R+U_R)\,C_R}{(\alpha n+h)\,(cn+\ell -1-d_R)} \sim \frac{1}{n} \end{aligned}$$

    for every \(n>(-h/\alpha )\vee ((d_R+1-\ell )/c)\), where \(d_R:=(d-1)/R\), \(G_R:=\sup _{B_R}|\nabla G|\), \(U_R:=\sup _{B_R}|\nabla U|\) and \(C_R\) is an explicit universal constant only depending on R.

  2. (2)

    If, in addition, there exist \(c_1>0\), \(c_2>0\) such that

    $$\begin{aligned} |\nabla G(\theta )|^2\ge 2c_1+c_2\, [\varDelta G(\theta )+\nabla G(\theta )\cdot \nabla U(\theta )]_+ \end{aligned}$$

    whenever \(|\theta |\ge R\), then

    $$\begin{aligned} \left[ {\mathfrak {C}}_2\left( \frac{e^{-nG(\theta ) - U(\theta )} \text {d}\theta }{\int _{\varTheta } e^{-nG(\tau ) - U(\tau )} \text {d}\tau }\right) \right] ^2 \le \frac{\alpha n+h+e^{\omega _R}(c_1n+G_R^*+W_R)}{(\alpha n+h)c_1n} \sim \frac{1}{n} \end{aligned}$$

    for every \(n>(1+1/c_2)\vee (-h/\alpha )\), where \(G_R^*:=\sup _{B_R}|\varDelta G|\), \(W_R:=\sup _{B_R}|\nabla U||\nabla G|\) and \(\omega _R:=\sup _{B_R}G-\inf _{\varTheta }G\).

According to Proposition 4, in the finite-dimensional setting the prior distribution does not affect the large n asymptotic behaviour of the weighted Poincaré–Wirtinger constant \(\mathfrak {C}_2[\pi _n^{*}(\cdot \,|\, S')]\). A similar phenomenon has been observed in the study of the first term on the right-hand side of (25). Differently from the finite-dimensional setting, the literature on weighted Poincaré–Wirtinger constants in the infinite-dimensional setting appears to be not well developed. To the best of our knowledge, in the infinite-dimensional setting, upper bounds on weighted Poincaré–Wirtinger constants are limited to the case of Gibbsean (Boltzmann) measures, that is measures of the form \(\exp \{-nG(\theta )\}\pi (\text {d}\theta )\) with G being a smooth convex function and \(\pi \) being an infinite-dimensional Gaussian measure ([31, Chapters 10–11]). While this is the case of interest in our problem, the upper bounds available in the literatures are not sharp for large values of n, and therefore they can not be applied. The next proposition covers this critical gap by providing results involving Malliavin calculus; the proof is deferred to Appendix A.6. The proposition is of independent interest in the context of weighted Poincaré–Wirtinger constants.

Proposition 5

Let \(\varTheta \) be a separable Hilbert space, and let \(\pi \) be the non-degenerate Gaussian measure \(\mathcal {N}(m,Q)\), with \(m \in \varTheta \) and Q a trace-class operator. Let \(\text {G}_0:\varTheta \rightarrow \varTheta \) be a compact linear operator, with trivial kernel. Let G be an element of \(\text {C}^2(\varTheta )\), bounded from below and such that \(\text {Hess}(G(\theta ))\ge \text {G}_0\) (in the sense of operators) whenever \(\Vert \theta \Vert _{\varTheta } \le R\), for some \(R>0\). Suppose there exists a Fourier orthonormal basis \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) of \(\varTheta \) which diagonalizes simultaneously both Q and \(\text {G}_0\), that is

$$\begin{aligned} Q[{{\textbf {e}}}_k] = \lambda _k {{\textbf {e}}}_k \qquad \qquad \text {G}_0[{{\textbf {e}}}_k] = \eta _k {{\textbf {e}}}_k \end{aligned}$$
(48)

for two suitable sequences \(\{\lambda _k\}_{k \ge 1}\) and \(\{\eta _k\}_{k \ge 1}\) of strictly positive numbers that go to zero as \(k \rightarrow +\infty \), with \(\{\lambda _k\}_{k \ge 1} \in \ell _1\).

  1. (1)

    Suppose, in addition, there exists \(c>0\) such that \(\theta \cdot \mathcal {D}_{\theta } G\ge c\Vert \theta \Vert _{\varTheta }\) whenever \(\Vert \theta \Vert _{\varTheta } \ge R\). Then, for every \(n>\text {Tr}[Q](1 + 1/R)/c\), it holds

    $$\begin{aligned}&\left[ {\mathfrak {C}}_2^{(M)}\left( \frac{e^{-nG(\theta )} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-nG(\tau )} \pi (\text {d}\tau )}\right) \right] ^2\nonumber \\&\quad \lesssim \frac{1 + C_R(1+\tau _n+nG_R) \max _{k \in \mathbb {N}} \left\{ \frac{\lambda _k}{n\lambda _k \eta _k + 1}\right\} }{cn - \text {Tr}[Q](1 + 1/R)} \nonumber \\&\quad = O\left( \max _{k \in \mathbb {N}} \left\{ \frac{\lambda _k}{n\lambda _k \eta _k + 1}\right\} \right) \end{aligned}$$
    (49)

    where \(G_R:=\sup _{B_R} \Vert \mathcal {D}_{\theta } G\Vert \) and \(C_R\) is an explicit universal constant only depending on R.

  2. (2)

    Suppose, in addition, there exist \(c_1>0\), \(c_2>0\) such that

    $$\begin{aligned} \Vert \mathcal {D}_{\theta } G\Vert ^2 \ge 2c_1+c_2\, [\mathfrak {L}_{\pi } G(\theta )]_+ \end{aligned}$$
    (50)

    whenever \(\Vert \theta \Vert \ge R\), where \(\mathcal {D}_{\theta }\) and \(\mathfrak {L}_{\pi }\) denote the Malliavin derivative and the Malliavin–Laplace operator associated to \(\pi \), respectively. Then, for every \(n> 1 + 1/c_2\), it holds

    $$\begin{aligned}&\left[ {\mathfrak {C}}_2^{(M)}\left( \frac{e^{-nG(\theta )} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-nG(\tau )} \pi (\text {d}\tau )}\right) \right] ^2 \\&\quad \lesssim \frac{1 + e^{\omega _R}(C_1 n + G^*_R) \max _{k \in \mathbb {N}} \left\{ \frac{\lambda _k}{n\lambda _k \eta _k + 1}\right\} }{c_1 n} \\&\quad = O\left( \max _{k \in \mathbb {N}} \left\{ \frac{\lambda _k}{n\lambda _k \eta _k + 1} \right\} \right) \end{aligned}$$

    where \(\omega _R:= \sup _{B_R} G - \inf _{\varTheta } G\) and \(G^*_R:= \sup _{B_R} [|\mathfrak {L}_{\pi }[G]| + \Vert \mathcal {D}_{\theta }[G] \Vert ^2]\).

Proposition 5 shows that the large n asymptotic behavior of the Poincaré–Wirtinger constant \(\mathfrak {C}_2^{(M)}[\pi _n^{*}(\cdot \,|\, S')]\) is worse than \(n^{-1/2}\), which is the large n asymptotic behaviour obtained in Proposition 4. By straightforward calculations, as \(n\rightarrow +\infty \)

$$\begin{aligned} \max _{k \in \mathbb {N}} \left\{ \frac{k^{-(1+a)}}{n k^{-(1+a+b)} + 1} \right\} \sim n^{-\frac{a+1}{1+a+b}} \end{aligned}$$
(51)

holds for any \(a,b > 0\). A particular merit of Proposition 5 consists in showing explicitly that, within the infinite-dimensional setting, PCRs are influenced by two quantities that do not appear in finite-dimensional setting of Proposition 4: (i) the rate of approach to zero of the sequence \(\{\lambda _k\}_{k \ge 1}\), which measures the “regularity of the prior”; (ii) the rate of approach to zero of the sequence \(\{\eta _k\}_{k \ge 1}\), which measures another “regularity of the model”. A similar phenomenon has been observed in the study of the first term on the right-hand side of (25). To conclude, we observe that, under the assumptions of Proposition 2, we can apply Eq. (26) to rewrite the right-hand side of (43) as follows

$$\begin{aligned}&\sup _{S' \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2^{(M)}[\pi _n^{*}(\cdot \,|\, S')] \right\} ^2\nonumber \\&\quad = \sup _{\theta ' \in ({\mathcal {S}} \circ g)^{-1}(\mathcal {U}_{\delta _n}(S_0))} \left[ {\mathfrak {C}}_2^{(M)}\left( \frac{e^{-n\textsf {K}(\theta \,|\,\theta ')} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-n\textsf {K}(\tau \,|\,\theta ')} \pi (\text {d}\tau )}\right) \right] ^2, \end{aligned}$$
(52)

and then observe that the role of \(\theta '\) is now confined to the multiplicative constants that appear on the right-hand sides of the various inequalities that we considered. Thus, in order to handle the supremum, it is enough to check the boundedness of such multiplicative constants by standard arguments of continuity.

We conclude this section by summarizing our results on the large n asymptotic behaviours of the terms on the right-hand side of (25). The second and the third term go to zero exponentially fast, and this holds true independently on the dimension of the statistical model. This confirms that the main contribution to the PCR arise from the first and the fourth term, which give generally algebraic rates of convergence to zero. In the finite-dimensional setting, when \(p = 2\), the first and the fourth term go to zero as \(n^{-1}\), which is the optimal rate. In the infinite-dimensional setting, the first and the second term go to zero according to Propositions 2 and 5. At least when \(\eta _k \sim \gamma _k \sim k^{-b}\), \(\lambda _k \sim k^{-(1+a)}\) and \(\omega _k^2 \sim k^{-(1+c)}\) with \(a,b,c >0\), Eqs. (39) and (51) show that the first term on the right-hand side of (25) is asymptotically equivalent to

$$\begin{aligned} n^{-\frac{a}{2(a+b+1)}} + n^{-\frac{c}{2(a+b+1)}}, \end{aligned}$$

whereas the fourth term on the right-hand side of (25) is asymptotically equivalent to

$$\begin{aligned} n^{-\frac{a+1-b}{2(a+b+1)}}, \end{aligned}$$

at least assuming that \(\mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}] \sim n^{-1/2}\). This completes our analysis of PCRs in the setting of infinite-dimensional exponential families. Some applications of these results will be presented in Sect. 4 with respect to specific statistical models.

3.2 PCRs for a general dominated statistical model

We present a more general version of Theorem 1, which relies on the assumption that both the sample space \(\mathbb {X}\) and parameter space \(\varTheta \) have richer analytical structures. Here, we confine to the case \(p=2\). In particular, the setting that we consider may be summarized through the following assumptions.

Assumptions 1

The set \(\mathbb {X}\), the parameter space \(\varTheta \), the statistical model \(\mu (\cdot \,|\,\cdot )\) and \(\theta _0\) are such that

  1. (i)

    \(\mathbb {X}\) coincides with an open, connected subset of \(\mathbb {R}^m\) with Lipschitz boundary, and \(\mathscr {X}= \mathscr {B}(\mathbb {X})\). With minor changes of notation, \(\mathbb {X}\) could also coincide with a smooth Riemannian manifold without boundary of dimension \(m \in \mathbb {N}\).

  2. (ii)

    \(\varTheta \) coincides with an open, connected subset of a separable Hilbert space of dimension \(d \in \mathbb {N}\cup \{+\infty \}\).

  3. (iii)

    \(\mu (\cdot \,|\,\cdot )\) is dominated by the m-dimensional Lebesgue measure, i.e. \(\mu (A\,|\,\theta ) = \int _A f(x\,|\,\theta ) \text {d}x\) for every \(A \in \mathscr {X}\), where \(x \mapsto f(x\,|\,\theta ) > 0\) is a probability density function for any \(\theta \in \varTheta \).

  4. (iv)

    \((x,\theta ) \mapsto f(x\,|\,\theta ) \in C^2(\mathbb {X}\times \varTheta )\);

  5. (v)

    the model \(\{f(\cdot \,|\,\theta )\}_{\theta \in \varTheta }\) is \(C^2\)-regular at \(\theta _0\) (as in [42, Theorem 18])

  6. (vi)

    for any \(\theta \), there exist positive constants \(b(\theta ), c(\theta )\) for which

    $$\begin{aligned} |\log f(x\,|\,\theta )| \le b(\theta )(1 + |x|^2) \quad \text{ and }\quad |\nabla _x \log f(x\,|\,\theta )|\le c(\theta )(1+|x|)\nonumber \\ \end{aligned}$$
    (53)

    hold for every \(x \in \mathbb {X}\);

  7. (vii)

    \(\pi \in {\mathcal {P}}_2(\varTheta )\), with full support;

  8. (viii)

    \(\mu _0 \in {\mathcal {P}}_2(\mathbb {X})\).

The setting of infinite-dimensional exponential families, considered in Sect. 3.1.3, is a popular example that satisfies Assumptions 1. Now, we state the theorem on PCRs in the setting of Assumptions 1; the proof is deferred to Appendix A.7

Theorem 2

Within the setting specified by Assumptions 1, (7) is fulfilled with

$$\begin{aligned} \pi _n^{*}(\text {d}\theta \,|\,\gamma ):= \frac{\exp \{ n\int _{\mathbb {X}} \log f(y \,|\, \theta ) \gamma (\text {d}y)\} }{\int _{\varTheta } \exp \{ n\int _{\mathbb {X}} \log f(y \,|\, \tau ) \gamma (\text {d}y)\} \pi (\text {d}\tau )} \pi (\text {d}\theta ) \end{aligned}$$
(54)

where \(\gamma \in \mathbb {S}= {\mathcal {P}}_2(\mathbb {X})\). Moreover, (9) holds relatively to a suitable choice of a \(\mathcal {W}_2^{(\mathcal {P}(\varTheta ))}\)-neighborhood \(V_0^{(n)}\) of \(\mu _0(\cdot ):= \mu (\cdot \,|\,\theta _0)\), provided that

$$\begin{aligned} L_0^{(n)}&:= n \sup _{\gamma \in V_0^{(n)}} [{\mathfrak {C}}_2(\pi _n^{*}(\cdot \,|\,\gamma ))]^2\nonumber \\&\quad \times \left( \int _{\varTheta }\int _{\mathbb {X}} \Big \Vert {\mathfrak {D}}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \gamma (\text {d}x) \pi _n^{*}(\text {d}\theta \,|\,\gamma ) \right) ^{1/2} < +\infty \end{aligned}$$
(55)

for any \(n \in \mathbb {N}\). Thus, the assumptions of Lemma 1 are fulfilled and a PCR at \(\theta _0\) is given by

$$\begin{aligned} \epsilon _n&= 3\left( \frac{\int _{\varTheta } \Vert \theta -\theta _0\Vert ^2 e^{-n \textsf {K}(\theta \,|\,\theta _0)} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-n \textsf {K}(\theta \,|\,\theta _0)} \pi (\text {d}\theta )} \right) ^{1/2} \nonumber \\&\quad + L_0^{(n)} \varepsilon _{n,2}(\mathbb {X},\mu _0) + 2\Vert \theta _0\Vert \mathbb {P}[\mathfrak {e}_n^{(\xi )} \not \in V_0^{(n)}] \nonumber \\&\quad + \mathbb {E}\left[ \left( 2 \frac{\int _{\varTheta } \Vert \theta \Vert ^2 \left[ \prod _{i=1}^n f(\xi _i\,|\,\theta ) \right] \pi (\text {d}\theta )}{\int _{\varTheta } \left[ \prod _{i=1}^n f(\xi _i\,|\,\theta ) \right] \pi (\text {d}\theta )} \right) ^{1/2} \mathbbm {1}\{\mathfrak {e}_n^{(\xi )}\not \in V_0^{(n)}\} \right] \end{aligned}$$
(56)

where \(\mathfrak {e}_n^{(\xi )}:= n^{-1} \sum _{1\le i\le n} \delta _{\xi _i}\) and \(\varepsilon _{n,p}(\mathbb {X},\mu _0):= \mathbb {E}[ \mathcal {W}_p^{({\mathcal {P}}(\mathbb {X}))}(\mu _0; \mathfrak {e}_n^{(\xi )})]\) is the speed of mean Glivenko–Cantelli.

From Theorem 2, we observe that if \(V_0^{(n)} = {\mathcal {P}}_2(\mathbb {X})\) makes \(L_0^{(n)}\) finite for every \(n \in \mathbb {N}\), then the expression on the right-hand side of (56) reduces to the first two terms. Similarly to Theorems 1 and 2 provides an implicit form for the PCR, thus requiring to further investigate the large n asymptotic behaviour of the terms on the right-hand side of (56). The posterior distribution appears in (55) and (56), meaning that further work is required to obtain more explicit terms. In general, it is possible to get rid of \(\pi _n^{*}\) in (55) and (56), thus reducing (55) and (56) to expressions that involve only the statistical model and the prior distribution. The first term on the right-hand side of (56) has the same form as in (25), meaning that the Laplace method plays a critical role in the study of these term. Such a term can be handled as described in Propositions 1 and 2. With regards to \(\varepsilon _{n,2}(\mathbb {X},\mu _0)\), we recall from Fournier and Guillin [46, Theorem 1] that, if \(\int _{\mathbb {X}} |x|^q \mu _0(\text {d}x) < +\infty \) for some \(q > 2\), then

$$\begin{aligned}&\varepsilon _{n,2}(\mathbb {X},\mu _0)\\&\quad \le C(q,m) \left( \int _{\mathbb {X}} |x|^q \mu _0(\text {d}x)\right) ^{1/q}\\&\quad \quad \times \left\{ \begin{array}{ll} n^{-1/4} + n^{-(q-2)/(2q)} &{}\text{ if }\ m=1,2,3\ \text{ and }\ q\ne 4\\ n^{-1/4}\sqrt{\log (1+n)} + n^{-(q-2)/(2q)} &{}\text{ if }\ m=4\ \text{ and }\ q\ne 4\\ n^{-1/m} + n^{-(q-2)/(2q)} &{}\text{ if }\ m>4\ \text{ and }\ q\ne m/(m-2) \end{array}\right. \end{aligned}$$

with some positive constant C(qm). Under some more restrictive assumptions, \(\varepsilon _{n,2}(\mathbb {X},\mu _0)\) is of order \(O(n^{-1/2})\), which is optimal in the dimension 1 ([17, Section 5]). In the dimension 2, the optimal rate is \(\sqrt{(\log n)/n}\) [7], whereas for \(m\ge 3\) the optimal rate is \(n^{-1/m}\) [79]. Lastly, when \(\mathbb {X}\) has infinite dimension, logarithmic rates have been obtained in Jing [58]. With regards to \(\mathbb {P}[\mathfrak {e}_n^{(\xi )} \not \in V_0^{(n)}]\), we refer to Bolley et al. [19, Theorem 2.7]. In particular, if \(\int _{\mathbb {X}} |x|^q \mu _0(\text {d}x) < +\infty \) for some \(q \ge 1\), then

$$\begin{aligned} \mathbb {P}[\mathcal {W}_2(\mathfrak {e}_n^{(\xi )}; \mu _0)> t] \le B(q,m) t^{-q} \times \left\{ \begin{array}{ll} n^{-q/4} &{}\text{ if }\ q >4\\ n^{1 - q/2} &{}\text{ if }\ q \in [2,4) \end{array}\right. \end{aligned}$$

for any \(t >0\) and \(n \in \mathbb {N}\), with some positive constant B(qm). Exponential bounds can be also obtained upon requiring that \(\int _{\mathbb {X}} e^{\alpha |x|} \mu _0(\text {d}x) < +\infty \) for some \(\alpha > 0\). See Bolley et al. [19, Theorem 2.8]. In the next corollary we show that, under additional assumptions, similar bounds hold true for the other terms appearing on the right-hand side of (56); the proof is deferred to Appendix A.8.

Corollary 1

In addition to the hypotheses of Theorem 2, suppose that there exist constants some \(C>0\) and \(\beta \ge 2\) for which

$$\begin{aligned} \textsf {K}(\theta |\theta _0) \ge C \min \{\Vert \theta - \theta _0\Vert ^{\beta }, 1\} \end{aligned}$$
(57)

holds for all \(\theta \in \varTheta \). Moreover, assume that

$$\begin{aligned} \int _{\mathbb {X}} |x|^q \mu _0(\text {d}x) < +\infty \end{aligned}$$

holds for some constants \(q > 4\), and

$$\begin{aligned} \mathfrak {M}_{n,r}:= \mathbb {E}\left[ \int _{\varTheta } \Vert \theta \Vert ^r \pi _n(\text {d}\theta \,|\,\xi _1, \dots , \xi _n) \right] < +\infty \end{aligned}$$
(58)

for all \(n \in \mathbb {N}\) and some \(r > 2\). Then, if the neighborhood \(V_0^{(n)}\) has the form

$$\begin{aligned} \{\gamma \in {\mathcal {P}}_2(\mathbb {X})\ |\ \mathcal {W}_2(\gamma ; \mu _0) \le Kn^{-a}\} \end{aligned}$$

for some \(K >0\) and \(a \in [0, 1/4)\), for the PCR given in (56) we obtain the new bound

$$\begin{aligned} \epsilon _n&\le C_1 n^{-1/\beta } + L_0^{(n)} \varepsilon _{n,2}(\mathbb {X},\mu _0) + 2\Vert \theta _0\Vert K^{-q} B(q,m) n^{q(a - 1/4)} \nonumber \\&\quad + C_2 n^{[q(r-1)(a - 1/4)]/r} \mathfrak {M}_{n,r} \end{aligned}$$
(59)

with suitable positive constants \(C_1\) and \(C_2\).

From Corollary 1, the posterior distribution appears in (58) and (59). With regards to (58), this term is typically available in an explicit form, even if the posterior is not explicit. In general, a possible strategy may rely on well-known bounds for moments of martingales. With regards to \(L_0^{(n)}\), this term can be handled as described in Propositions 4 and 5, that is by inequalities for the weighted Poincaré–Wirtinger constant. To conclude it remains to handle with

$$\begin{aligned} \sup _{\gamma \in V_0^{(n)}} \int _{\varTheta }\int _{\mathbb {X}} \Big \Vert {\mathfrak {D}}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \gamma (\text {d}x) \pi _n^{*}(\text {d}\theta \,|\,\gamma ), \end{aligned}$$

which is expected to be bounded with respect to n, in regular situations. To deal with this term, a possible strategy consists in obtaining inequality of the form

$$\begin{aligned} \int _{\mathbb {X}} \Big \Vert {\mathfrak {D}}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \gamma (\text {d}x) \le C_{\gamma } W(\theta ) \end{aligned}$$

for a suitable constant \(C_{\gamma }\) and a suitable function W. This particular point will be made more precise in Sect. 4 with respect to some specific statistical models.

4 Applications

4.1 Regular parametric models

Consider the case of dominated Bayesian statistical models with a finite-dimensional parameter \(\theta \in \varTheta \subset \mathbb {R}^d\). Accordingly, we start by considering the set of Assumptions 1, with \(d \in \mathbb {N}\), along with the hypotheses of Theorem 2. In this setting, the Kullback–Leibler divergence \(\textsf {K}(\theta \,|\,\theta _0)\) is a \(C^2\) function, whose Hessian at \(\theta _0\) just coincides with the Fisher information matrix at \(\theta _0\). Whence,

$$\begin{aligned} \textsf {K}(\theta \,|\, \theta _0) = \frac{1}{2}\ ^t(\theta - \theta _0) \text {I}[\theta _0] (\theta - \theta _0) + o(|\theta -\theta _0|^2) \end{aligned}$$
(60)

as \(\theta \rightarrow \theta _0\). Finally, we assume (29). Therefore, we can apply Proposition 1 to get

$$\begin{aligned} \mathcal {W}_p(\pi _n^{*}(\text {d}\theta | \mu _0), \delta _{\theta _0}) = \left( \frac{\int _{\varTheta } |\theta - \theta _0|^p e^{-n \textsf {K}(\theta \, |\, \theta _0)} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-n \textsf {K}(\theta \, |\, \theta _0)} \pi (\text {d}\theta )} \right) ^{1/p} = O\big (n^{-1/2}\big ) \end{aligned}$$
(61)

as \(n \rightarrow +\infty \). Now, we discuss the behavior of the constant \(L_0^{(n)}\), as n goes to infinity. First, we would like to stress that there are plenty of conditions that entail

$$\begin{aligned}{}[{\mathfrak {C}}_2(\pi _n^{*}(\cdot \,|\,\gamma ))]^2 \le \frac{C(\gamma )}{n} \end{aligned}$$

for every \(n \in \mathbb {N}\) and some positive constant \(C(\gamma )\). We consider the double integral

$$\begin{aligned} \int _{\varTheta }\int _{\mathbb {X}} \Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2\gamma (\text {d}x) \pi _n^{*}(\text {d}\theta \,|\,\gamma )\ . \end{aligned}$$

First, if \(f(x\,|\,\theta ) = \exp \{\langle \theta , T(x)\rangle - M(\theta )\}\), that is the model is an element of the exponential family in the canonical form, then we notice that \(D_{\theta } \frac{\nabla _x f(x|\theta )}{f(x|\theta )}\) reduces to a \(d\times m\) matrix whose entries are given by \(\partial _{x_j} T_i(x)\), for \(j = 1, \dots , m\) and \(i = 1, \dots , d\). Therefore, the study of the above double integral boils down to that of the much simpler expressions \(\int _{\mathbb {X}} |\partial _{x_j} T_i(x)|^2 \gamma (\text {d}x)\), which are independent of n. More generally, we can reduce the problem by resorting to the Laplace method for approximating probability integrals, from which we have that

$$\begin{aligned} \int _{\varTheta }\Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \pi _n^{*}(\text {d}\theta |\gamma ) \sim \Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \ _{\Big |\ \theta = \theta ^*(\gamma )} \end{aligned}$$

as \(n \rightarrow +\infty \), where \(\theta ^*(\gamma )\) denotes a maximum point of the mapping \(\theta \mapsto \int _{\mathbb {X}} \log f(y|\theta ) \gamma (\text {d}y)\). Therefore, if the above right-hand side proves to be positive, a reasonable plan to prove global boundedness of \(L_0^{(n)}\) with respect to n can be based on the following two steps. First, we check the validity of an inequality like

$$\begin{aligned} \sup _{n\in \mathbb {N}} \int _{\varTheta }\Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x|\theta )}\Big \Vert ^2 \pi _n^{*}(\text {d}\theta \,|\,\gamma ) \le C \Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \ _{\Big |\ \theta = \theta ^*(\gamma )} \end{aligned}$$

for every \(\gamma \) belonging to a \(\mathcal {W}_2^{(\mathcal {P}(\varTheta ))}\)-neighborhood of \(\mu _0\), where C is a positive constant possibly depending on the fixed neighborhood. Second, we prove global boundedness (for \(\gamma \) varying in the neighborhood) of the following integral

$$\begin{aligned} \int _{\mathbb {X}} \Big \Vert \text {D}_{\theta } \frac{\nabla _x f(x\,|\,\theta )}{f(x\,|\,\theta )}\Big \Vert ^2 \ _{\Big |\ \theta = \theta ^*(\gamma )}\gamma (\text {d}x) < +\infty \ . \end{aligned}$$

To fix ideas in a more concrete way, we consider the Gaussian case, where \(\theta = (\mu , \varSigma )\) and

$$\begin{aligned} f(x\,|\,\theta ) = (2\pi )^{-m/2} \frac{1}{\sqrt{\text {det}(\varSigma )}} \exp \left\{ -\frac{1}{2} (x-\mu )^t \varSigma ^{-1}(x-\mu )\right\} \qquad x \in \mathbb {R}^m\ . \end{aligned}$$

Note that the mapping \(\theta \mapsto \int _{\mathbb {X}} \log f(y\,|\,\theta ) \gamma (\text {d}y)\) depends on \(\gamma \) only through its moments of order 1 and 2. Thus, the above strategy reduces to an ordinary finite-dimensional maximization problem, very similar to the question of finding the maximum likelihood estimator. Finally, the last term in on the right-hand side of (56) can be treated as in Corollary 1, by studying the asymptotic behavior of some posterior r-moment as in (58). We state two propositions that summarize the above considerations. The former result holds when Theorem 1 can be applied and gives the optimal rate, while the latter result ensues from Theorem 2.

Proposition 6

Assume that there exist a separable Banach space \(\mathbb {B}\) with dual \(\mathbb {B}^{*}\) and two measurable maps \(\beta : \mathbb {X}\rightarrow \mathbb {B}\) and \(g: \varTheta \rightarrow \mathbb {B}^{*}\) for which (19) is in force. If the assumptions of Theorem 1 and Propositions 1, 3 and 4 are met, then as \(n\rightarrow +\infty \)

$$\begin{aligned} \epsilon _n = O\big (n^{-1/2}\big ), \end{aligned}$$

which is the optimal rate.

Proposition 7

Assume that the model \(\{f(\cdot |\cdot )\}_{\theta \in \varTheta }\) and the prior \(\pi \) satisfy Assumption 1 along with Propositions 1, 3 and 4. Then, as \(n \rightarrow +\infty \)

$$\begin{aligned} \epsilon _n = O\big ( \varepsilon _{n,2}(\mathbb {X},\mu _0) \big ), \end{aligned}$$

which is the optimal rate, at least when \(m=1\).

4.2 Multinomial models

Consider the case in which the observations, i.e. both the sequence \(\{X_i\}_{i \ge 1}\) and the sequence \(\{\xi _i\}_{i \ge 1}\), take values in the finite set, say \(\{a_1, \dots , a_N\}\). It is easy to check that \(\varTheta \) can be assumed to coincide with the interior of the \((N-1)\)-dimensional simplex

$$\begin{aligned} \varDelta _{N-1}:= \left\{ \theta = (\theta _1, \dots , \theta _{N-1}) \in [0,1]^{N-1}\ \Big |\ \sum _{i=1}^{N-1} \theta _i \le 1\right\} \end{aligned}$$

and

$$\begin{aligned} \pi _n(\text {d}\theta \,|\, x_1, \dots , x_n) = \frac{\left[ \prod _{i=1}^{N} \theta _i^{\nu _{n,i}(x)} \right] \pi (\text {d}\theta )}{\int _{\varDelta _{N-1}} \left[ \prod _{i=1}^{N} t_i^{\nu _{n,i}(x)} \right] \pi (\text {d}t) } \end{aligned}$$

where \(\theta = (\theta _1, \dots , \theta _{N-1})\), \(t = (t_1, \dots , t_{N-1})\), \(\theta _N:= 1 - \sum _{i=1}^{N-1} \theta _i\), \(t_N:= 1- \sum _{i=1}^{N-1} t_i\) and

$$\begin{aligned} \nu _{n,i}(x):= \sum _{j=1}^n \mathbbm {1}\{x_j = a_i\} \qquad \qquad i = 1, \dots , N. \end{aligned}$$

Of course, if we put \(\mathbb {X}= \{a_1, \dots , a_N\}\), we can not directly apply Theorem 2. Nonetheless, we can resort to a reinterpretation of the data, in terms of the frequencies \(\nu _{n,i}\), that we now explain, that allows the use of our theorem. We consider

$$\begin{aligned} \pi _n^{*}(\text {d}\theta \,|\, p):= \frac{\left[ \prod _{i=1}^{N} \theta _i^{np_i} \right] \pi (\text {d}\theta )}{\int _{\varDelta _{N-1}} \left[ \prod _{i=1}^{N} t_i^{np_i} \right] \pi (\text {d}t) } \end{aligned}$$

defined for \(p = (p_1, \dots , p_{N-1}) \in \varDelta _{N-1}\) with the usual proviso that \(p_N:= 1 - \sum _{i=1}^{N-1} p_i\). Whence,

$$\begin{aligned} \pi _n(\text {d}\theta \,|\, x_1, \dots , x_n) = \pi _n^{*}\left( \text {d}\theta \,\Big |\, \left( \frac{\nu _{n,1}(x)}{n}, \dots , \frac{\nu _{n,N-1}(x)}{n}\right) \right) \ . \end{aligned}$$

The problem of consistency, and the allied question of finding a PCR, can be now reformulated as follows. After fixing \(\theta _0 \in \varDelta _{N-1}\), we consider the sequence \(\{\xi _i\}_{i \ge 1}\) of i.i.d. random variables, each taking values in \(\{a_1, \dots , a_N\}\), with \(\mathbb {P}[\xi _1 = a_i] = \theta _{0,i}\), for \(i = 1, \dots , N\). An analogous version of Lemma 1 states that

$$\begin{aligned} \epsilon _n = \mathbb {E}[\mathcal {W}_2^{({\mathcal {P}}(\varTheta ))}(\pi _n(\text {d}\theta \,|\, \xi _1, \dots , \xi _n); \delta _{\theta _0}) ] \end{aligned}$$

provides a PCR at \(\theta _0\). Now, we reformulate Theorem 2 as follows. First of all, we have that

$$\begin{aligned} \mathbb {E}\left[ \left| \left( \frac{\nu _{n,1}(\xi )}{n}, \dots , \frac{\nu _{n,N-1}(\xi )}{n} \right) - \theta _0 \right| \right] \le \sqrt{\sum _{i=1}^{N-1} \mathbb {E}\left[ \left| \frac{\nu _{n,1}(\xi )}{n} - \theta _{0,i} \right| ^2\right] } \le \frac{1}{\sqrt{n}} \end{aligned}$$

replaces the speed of the mean Glivenko–Cantelli convergence. Then, we have that

$$\begin{aligned} \textsf {K}(\theta \,|\, \theta _0) = \sum _{i=1}^N \theta _{0,i} \log \left( \frac{\theta _{0,i}}{\theta _{i}}\right) \ . \end{aligned}$$

The relation analogous to that in (56), which gives a PCR at \(\theta _0\), reads as follows

$$\begin{aligned} \epsilon _n&= 3\left( \frac{\int _{\varDelta _{N-1}} |\theta -\theta _0|^2 e^{-n \textsf {K}(\theta |\theta _0)} \pi (\text {d}\theta )}{\int _{\varTheta } e^{-n \textsf {K}(\theta \,|\,\theta _0)} \pi (\text {d}\theta )} \right) ^{1/2} \nonumber \\&\quad + L_0^{(n)}(\delta _n) \mathbb {E}\left[ \left| \left( \frac{\nu _{n,1}(\xi )}{n}, \dots , \frac{\nu _{n,N-1}(\xi )}{n} \right) - \theta _0 \right| \right] \nonumber \\&\quad + 2|\theta _0| \mathbb {P}\left[ \left| \left( \frac{\nu _{n,1}(\xi )}{n}, \dots , \frac{\nu _{n,N-1}(\xi )}{n} \right) - \theta _0 \right|> \delta _n\right] \nonumber \\&\quad + \mathbb {E}\Bigg [\left( 2 \frac{\int _{\varDelta _{N-1}} |\theta |^2 \left[ \prod _{i=1}^{N} \theta _i^{\nu _{n,i}(\xi )} \right] \pi (\text {d}\theta )}{\int _{\varDelta _{N-1}} \left[ \prod _{i=1}^{N} \theta _i^{\nu _{n,i}(\xi )} \right] \pi (\text {d}\theta )} \right) ^{1/2} \nonumber \\&\quad \times \mathbbm {1}\left\{ \left| \left( \frac{\nu _{n,1}(\xi )}{n}, \dots , \frac{\nu _{n,N-1}(\xi )}{n} \right) - \theta _0 \right| > \delta _n\right\} \Bigg ] \end{aligned}$$
(62)

where \(\{\delta _n\}_{n \ge 1}\) provides a sequence of positive numbers and \(L_0^{(n)} \) is defined as follows

$$\begin{aligned} L_0^{(n)}:= n \sup _{|p - \theta _0| \le \delta _n} [{\mathfrak {C}}_2(\pi _n^{*}(\cdot \,|\, p))]^2 \left( \int _{\varDelta _{N-1}} \Big | \nabla _{\theta } \nabla _p \textsf {K}(\theta \,|\, p) \Big |^2 \pi _n^{*}(\text {d}\theta \,|\, p) \right) ^{1/2}\ . \end{aligned}$$

We show that the PCR in (62) reduces to a simpler expression. Indeed, the first term on the right-hand side of (62) is similar to the one already studied in the previous section. By resorting to the same theorems from Breitung [22], recalling that the mapping \(\theta \mapsto \textsf {K}(\theta \,|\,\theta _0)\) is minimum when \(\theta =\theta _0\), we get, as \(n \rightarrow +\infty \),

$$\begin{aligned} \left( \frac{\int _{\varDelta _{N-1}} |\theta - \theta _0|^2 e^{-n \textsf {K}(\theta \,|\, \theta _0)} \pi (\text {d}\theta )}{\int _{\varDelta _{N-1}} e^{-n \textsf {K}(\theta \, |\, \theta _0)} \pi (\text {d}\theta )} \right) ^{1/2} = O\big (n^{-1/2}\big ) \end{aligned}$$

provided that \(\pi \) has full support. As for the second terms on the right-hand side of (62), we have already shown that the expectation is controlled by \(1/\sqrt{n}\). Apropos of the constant \(L_0^{(n)}(\delta _n)\), we can easily show that it is bounded, at least whenever \(\theta _0\) is fixed in the interior of \(\varDelta _{N-1}\). In fact, \(\delta _n\) can be chosen equal to any positive constant \(\delta \) less than the distance between \(\theta _0\) and the boundary of \(\varDelta _{N-1}\). In particular, by exploiting the convexity of the mapping \(\theta \mapsto \textsf {K}(\theta \,|\, p)\), we can resort to Proposition 4, upon assuming more regularity on the prior distribution \(\pi \), in order to obtain \([{\mathfrak {C}}_2(\pi _n^{*}(\cdot \,|\, p))]^2 \le C(\delta )/n\), with a positive constant \(C(\delta )\) which is independent of p. Then, by means of a direct computation, under the above conditions on \(\theta _0\) and \(\delta \) we can show that the integral

$$\begin{aligned} \int _{\varDelta _{N-1}} \Big | \nabla _{\theta } \nabla _p \textsf {K}(\theta \,|\, p) \Big |^2 \pi _n^{*}(\text {d}\theta \,|\, p) \end{aligned}$$

can be bounded uniformly in n. To conclude the analysis of the terms on the right-hand side of (62), we only need to exploit the boundedness of \(|\theta |\), as \(\theta \) varies in \(\varDelta _{N-1}\), to show that the third and the fourth terms are both bounded by a multiple of

$$\begin{aligned} \mathbb {P}\left[ \left| \left( \frac{\nu _{n,1}(\xi )}{n}, \dots , \frac{\nu _{n,N-1}(\xi )}{n} \right) - \theta _0 \right| > \delta _n\right] \ . \end{aligned}$$

Thus, if \(\theta _0\) is in the interior of \(\varDelta _{N-1}\) and \(\delta _n = \delta \) for every \(n \in \mathbb {N}\), for the same \(\delta \) as above, it is well-known from the theory of large deviations that this probability goes to zero exponentially fast. See Dembo and Zeitouni [33, Chapter 2] for a detailed account. To conclude, we state a proposition that summarizes the above considerations.

Proposition 8

Let \(N \ge 2\) be an integer. Let \(\pi \) be a prior on \(\varDelta _{N-1}\). If \(\pi \) has a density q (with respect to the Lebesgue measure) such that \(q \in \text {C}^1(\overline{\varDelta _{N-1}})\) and \(q(\theta ) = 0\) for any \(\theta \in \partial \varDelta _{N-1}\), then as \(n\rightarrow +\infty \)

$$\begin{aligned} \epsilon _n = O\big (n^{-1/2}\big ), \end{aligned}$$

which is the optimal rate.

4.3 Finite-dimensional logistic-Gaussian model

Consider a class of dominated statistical models specified by density functions of the form

$$\begin{aligned} f(x\,|\,\theta ) = \frac{e^{\theta \cdot \varGamma _N(x)}}{\int _0^1 e^{\theta \cdot \varGamma _N(x)} \lambda (\text {d}y)} \qquad x \in \mathbb {X}, \theta \in \varTheta \end{aligned}$$
(63)

where, for simplicity, we have fixed \(N \in \mathbb {N}\), \(\varTheta = \mathbb {R}^N\), \(\mathbb {X}= [0,1]\), \(\mathscr {X}= \mathscr {B}([0,1])\), \(\lambda = \mathcal {L}^1_{[0,1]}\), that is the one-dimensional Lebesgue measure restricted to [0,1], and

$$\begin{aligned} \varGamma _N(x):= \left( \sin \pi x, \sin 2\pi x, \dots , \sin N\pi x \right) \ . \end{aligned}$$

Of course, the expression \(\theta \cdot \varGamma _N(x)\) represents a Fourier polynomial and, for sufficiently large N, can approximate very well any smooth function, in various norms. This model has been studied in connection with the problem of density estimation [28, 29, 60, 61], essentially as a toy model. In the following section, we will analyze its infinite dimensional generalization, which is a more flexible statistical model, even if more complex from a mathematical point of view.

To apply Theorem 1, we start by fixing \(\theta _0 \in \varTheta \), so that \(\mu _0(\text {d}x) = f(x | \theta _0) \text {d}x\), where \(x \mapsto f(x | \theta _0)\) is a continuous and bounded density function on [0, 1]. Then, we let \(\{\xi _i\}_{i \ge 1}\) be a sequence of independent random variables identically distributed as the probability law \(\mu _0\). The model (63) satisfies Definition 3 with \(\mathbb {B}= \varTheta \), \(\mathbb {B}^{*} = \varTheta \) (by Riesz’s representation theorem) and \(\varGamma = \mathbb {B}^{*}\), with \( _{\mathbb {B}^{*}}\langle \cdot , \cdot \rangle _{\mathbb {B}}\) being identified with the standard (Euclidean) scalar product of \(\mathbb {R}^N\). The function \(\beta \) coincides with \(\varGamma _N(x)\), while g is the identity function. Finally, we have that

$$\begin{aligned} M(\theta ) = \log \left( \int _0^1 e^{\theta \cdot \varGamma _N(x)} \text {d}x \right) \end{aligned}$$

which proves to be a convex function, steep, of class \(\text {C}^{\infty }(\varTheta )\) and analytic. Therefore, we have a regular exponential family, in canonical form. As for the prior distribution, besides the multivariate (non-degenerate) Gaussian distribution \({\mathcal {N}}(m, Q)\), with \(m \in \varTheta \) and Q being a symmetric and positive-definite \(N\times N\) matrix, any other distribution of log-concave form like \(\pi (\text {d}\theta ) \propto \exp \{-U(\theta )\} \text {d}\theta \) fits our assumptions, provided that U is of class \(\text {C}^2(\varTheta )\) and strongly convex.

Coming back to the application of Theorem 1, we check the validity of the assumptions. First, \(|\varGamma _N(x)| \le \sqrt{N}\) for all \(x \in [0,1]\), so that (21) is in force. Then, (24) and \(\int _{\varTheta } \Vert \theta \Vert ^{ap} \pi (\text {d}\theta ) < +\infty \) for some \(a > 1\) hold, because of the assumptions on the prior distribution. Thus, the bound (25) provides the desired PCR, so that we proceed further by analyzing the various terms as in Sect. 3.1. Since

$$\begin{aligned} \text {I}(\theta _0) = \text {Hess}[M](\theta _0) = \textsf {Cov}_{\theta _0}(\varGamma _N(\xi _1)) \end{aligned}$$

is strictly positive definite, we can apply Proposition 1. We conclude that the first term on the right-hand side of (25) goes to zero as \(\frac{1}{\sqrt{n}}\). Then, the boundedness condition \(|\varGamma _N(x)| \le \sqrt{N}\) for all \(x \in [0,1]\) entails that the second and the third terms on the right-hand side of (25) go to zero exponentially fast by means of classical concentration inequalities, like Bernstein’s inequality for instance [20, 33]. Finally, we consider the last term on the right-hand side of (25). In particular, by Jensen’s inequality

$$\begin{aligned} \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}] \le \left( \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}^2]\right) ^{1/2} = O\big (n^{-1/2}\big )\, \end{aligned}$$

as \(n \rightarrow +\infty \), where \(\hat{S}_n = n^{-1} \sum _{i=1}^n \varGamma _N(\xi _i)\) and

$$\begin{aligned} S_0 = \mathbb {E}_{\theta _0}[\varGamma _N(\xi _1)] = \int _0^1 \varGamma _N(x) \mu _0(\text {d}x) \ . \end{aligned}$$

Then, we apply Proposition 3. Since \(\mathfrak {D}_{\theta }[g]\) coincides with the identity operator, we conclude that

$$\begin{aligned} \int _{\varTheta } \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}^2\ \pi _n^{*}(\text {d}\theta \,|\, S)= 1 \end{aligned}$$

for all \(S \in \mathbb {B}\). Whence,

$$\begin{aligned} L_0^{(n)} = n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2 \ . \end{aligned}$$

We conclude our analysis by estimating the weighted Poincaré–Wirtinger constant by means of Proposition 4. Indeed, a common feature of these logistic models is that the behavior of the Kullback–Leibler \(\textsf {K}(\theta |\theta _0)\) is twofold: it is quadratic as \(\theta \) varies around \(\theta _0\), while it is linear as \(|\theta | \rightarrow +\infty \). Thus, the strong Bakry–Emery condition does not apply here, and we resort to the boundedness condition

$$\begin{aligned} \theta \cdot \nabla _{\theta } \textsf {K}(\theta \,|\,\theta _0) \ge c|\theta | \end{aligned}$$
(64)

for all \(|\theta | \ge R\), with some suitable \(R>0\). To check the validity of this lower bound, we fix for simplicity \(\theta _0 = 0\), to get

$$\begin{aligned} \theta \cdot \nabla _{\theta } \textsf {K}(\theta \,|\,\theta _0) = \frac{\int _0^1 \phi (x) e^{\phi (x)} \text {d}x}{\int _0^1 e^{\phi (x)} \text {d}x} - \int _0^1 \phi (x) \text {d}x \end{aligned}$$

with \(\phi (x):= \theta \cdot \varGamma _N(x)\). By fixing a unitary vector \(\sigma \in S^{N-1}\) and considering \(\theta = t\sigma \), the Laplace approximation yields that the above right-hand side is asymptotic to

$$\begin{aligned} t \left[ \max _{x \in [0,1]} \sigma \cdot \varGamma _N(x) - \sigma \cdot \int _0^1\varGamma _N(x)\text {d}x \right] \end{aligned}$$

as \(t \rightarrow +\infty \). At this stage, the function

$$\begin{aligned} \sigma \mapsto \max _{x \in [0,1]} \sigma \cdot \varGamma _N(x) - \sigma \cdot \int _0^1\varGamma _N(x)\text {d}x \end{aligned}$$
(65)

proves to be continuous and non-negative on \(S^{N-1}\). The minimum of such a function must be positive, otherwise there would exist \(\hat{\sigma } \in S^{N-1}\) for which the map \(x \mapsto \hat{\sigma } \cdot \varGamma _N(x)\) turns out to be constant. But this contradicts the linear independence of the Fourier basis \(\varGamma _N\), yielding that the function in (65) must be strictly positive. This fact validates (64). Thus, by point (1) of Proposition 4, the square of the weighted Poincaré–Wirtinger constant is asymptotic to 1/n. All the above considerations can be summarized in the following proposition.

Proposition 9

Let \(\pi (\text {d}\theta ) \propto \exp \{-U(\theta )\} \text {d}\theta \) be any prior on \(\varTheta = \mathbb {R}^N\), with U of class \(\text {C}^2(\varTheta )\) and stongly convex. Then, for the finite-dimensional logistic-Gaussian model as in (63) with \(N \in \mathbb {N}\), \(\mathbb {X}= [0,1]\), \(\mathscr {X}= \mathscr {B}([0,1])\) and \(\lambda = \mathcal {L}^1_{[0,1]}\), the PCR \(\epsilon _n\) satisfies

$$\begin{aligned} \epsilon _n = O\big (n^{-1/2}\big ) \end{aligned}$$

as \(n \rightarrow +\infty \), which is the optimal one.

4.4 Infinite-dimensional logistic-Gaussian model

Consider a class of dominated statistical models specified by density functions of the form

$$\begin{aligned} f(x\,|\,\theta ) = \frac{e^{\theta (x)}}{\int _0^1 e^{\theta (y)} \lambda (\text {d}y)} \qquad x \in \mathbb {X}, \theta \in \varTheta \end{aligned}$$
(66)

where we have fixed \(\mathbb {X}= [0,1]\), \(\mathscr {X}= \mathscr {B}([0,1])\), \(\lambda = \mathcal {L}^1_{[0,1]}\), that is the one-dimensional Lebesgue measure restricted to [0,1]. As for the parameter space \(\varTheta \), we set

$$\begin{aligned} \varTheta = \text {H}^1_{*}(0,1):= \{\phi \in \text {H}^1(0,1)\ |\ \phi (0)=0\} \end{aligned}$$
(67)

thought of as an infinite-dimensional Hilbert space endowed with scalar product

$$\begin{aligned} \langle \phi , \psi \rangle = \int _0^1 \phi '(z) \psi '(z) \text {d}z \end{aligned}$$
(68)

and norm

$$\begin{aligned} \Vert \phi \Vert _{\text {H}^1_{*}(0,1)}:= \left( \int _0^1 [\phi '(z)]^2 \text {d}z\right) ^{1/2}\ . \end{aligned}$$

Here, the well-known Sobolev embedding theorem [65] states that \(\text {H}^1_{*}(0,1)\) is continuously embedded in \(\text {C}^0[0,1]\), and therefore the above notations \(\theta (x)\) and \(\phi (0)\) are referred to the continuous representatives of \(\theta \) and \(\phi \), respectively.

The infinite-dimensional logistic-Gaussian model is typically considered in connection with the fundamental problem of density estimation [28, 29, 60, 61]. Under the assumption that the prior is a Gaussian measure, Bayesian consistency is investigated in Tokdar and Ghosh [80], whereas PCRs are provided in Giné and Nickl [54], Rivoirard and Rousseau [70], Scricciolo [72] and van der Vaart and van Zanten  [82]. These results consider the set \(\varTheta \) to be the space of all density functions on [0, 1], typically endowed with the total variation distance, the Hellinger distance, some \(\text {L}^p\) norm or the Kullback–Leibler divergence. Our approach to PCRs relies on the choice (67), so that our PCRs refers to Definition 1 with \(\text {d}_{\varTheta }\) equal to the \(\text {H}^1_{*}(0,1)\)-norm. This metric is generally stronger, since a Sobolev norm is, for suitable exponents, grater that the \(\text {L}^r\) norm considered in Giné and Nickl [54] and, in turn, greater than the (squared) Hellinger distance, as proved in Scricciolo [72, Lemma A.1]. In connection with the statistical model (66), the work of Fukumizu [47] provides an implicit Riemannian structure on the space of densities which is modeled on the metric of the underlying space \(\varTheta \), that is the Riemannian distance between two densities \(f(\cdot | \theta _1)\) and \(f(\cdot | \theta _2)\) turns out to be locally equivalent to \(\Vert \theta _1 - \theta _2\Vert _{\text {H}^1_{*}(0,1)}\). Another (geometrical) view of the set \(\{f(\cdot \,|\,\theta )\}_{\theta \in \varTheta }\), which is simply thought as a differential manifold, is provided in Pistone and Rogantin [69].

We provide PCRs for the model (66) on the basis of Theorem 1. We start by fixing \(\theta _0 \in \varTheta \), with \(\varTheta \) being the same as in (67). Whence, \(\mu _0(\text {d}x) = f(x | \theta _0) \text {d}x\), where \(x \mapsto f(x | \theta _0)\) is a continuous and bounded density function on [0, 1]. Then, we let \(\{\xi _i\}_{i \ge 1}\) be a sequence of independent random variables identically distributed with probability law \(\mu _0\). At this stage, we notice that the model (66) satisfies Definition 3 with \(\mathbb {B}= \varTheta \), \(\mathbb {B}^{*} = \varTheta \) (by Riesz’s representation theorem) and \(\varGamma = \mathbb {B}^{*}\). For completeness, we specify that also the pairing \( _{\mathbb {B}^{*}}\langle \cdot , \cdot \rangle _{\mathbb {B}}\) is identified with the scalar product \(\langle \cdot , \cdot \rangle \) as in (68), again by Riesz’s representation theorem. In this setting, we deduce that the function \(\beta \) in Definition 3 coincides with the Riesz representative of the \(\delta _x\) functional, for any \(x \in [0,1]\), that is \(\beta _x(z):= z\mathbbm {1}_{[0,x]}(z) + x\mathbbm {1}_{(x, 1]}(z)\) for \(z \in [0,1]\), since, for any \(\theta \in \varTheta \), \(\theta (x) = \langle \theta , \beta _x\rangle \). Lastly, we fix g as the identity map on \(\varTheta \), so that (19) is satisfied and

$$\begin{aligned} M(\theta ) = \log \int _0^1 e^{\theta (y)} \text {d}y. \end{aligned}$$
(69)

As for the prior \(\pi \), we assume that it is a Gaussian measure on \(\varTheta \), with mean \(m \in \varTheta \) and covariance operator \(Q: \varTheta \rightarrow \varTheta \). We recall that Q is a trace operator with eigenvalues \(\{\lambda _k\}_{k \ge 0}\) that satisfy \(\sum _{k=0}^{\infty } \lambda _k < +\infty \). See Da Prato [30, 31], and references therein, for a review on Gaussian measures on Hilbert spaces.

Now, we check the validity of the assumptions of Theorem 1. First, we have that

$$\begin{aligned} \Vert \beta _x\Vert _{\mathbb {B}} = \left( \int _0^1 \mathbbm {1}_{[0,x]}(z) \text {d}z\right) ^{1/2} = \sqrt{x} \in [0,1] \end{aligned}$$
(70)

yielding that (21) is trivially satisfied. In particular, the element \(S_0\) is given by

$$\begin{aligned} z \mapsto S_0(z) = \int _0^1 \beta _x(z) \mu _0(\text {d}x) = z\mu _0([z,1]) + \int _0^z x \mu _0(\text {d}x) \in \varTheta \end{aligned}$$

and \(\hat{S}_n = n^{-1}\sum _{i=1}^n \beta _{\xi _i}\). We recall that \(\hat{S}_n \rightarrow S_0\) as \(n \rightarrow +\infty \), in both \(\mathbb {P}\)-a.s. and \(\text {L}^2\) sense, by the Laws of Large Numbers in Hilbert spaces ([62, Corollary 7.10]). Now, we observe that (24) boils down to write that

$$\begin{aligned} \int _{\varTheta } \exp \{n\ _{\mathbb {B}^{*}}\!\langle g(\tau ), b\rangle _{\mathbb {B}} \} \pi (\text {d}\tau ) = \exp \left\{ n\langle m, b\rangle + \frac{n^2}{2}\langle Q[b], b\rangle \right\} < +\infty \end{aligned}$$

for all \(n \in \mathbb {N}\) and \(b \in \mathbb {B}= \varTheta \). See Da Prato [31, Proposition 1.15]. Then, condition (iii) of Theorem 1 is trivially satisfied. Finally, with regards to (iv), we mention that any sequence \(\delta _n \sim n^{-q}\) with \(q \in [0,1/2)\), as \(n \rightarrow +\infty \), is valid as far as we verify the validity of (9), as we will do just below. After these preliminaries, we start analyzing the four terms on the right-hand side of (25).

As for the first term on the right-hand side of (25), we study (28). We observe that

$$\begin{aligned} \text {K}(\theta \, |\, \theta _0)&= \int _0^1 [\theta _0(y) - M(\theta _0) - \theta (y) + M(\theta )] f(y\,|\,\theta _0) \text {d}y \\&= \int _0^1 \langle \theta _0 - \theta , \beta _y\rangle f(y\,|\,\theta _0) \text {d}y + M(\theta ) - M(\theta _0) \\&= \langle \text {D}_{\theta _0} M, \theta _0 - \theta \rangle + M(\theta ) - M(\theta _0) \\&= \frac{1}{2} \langle \text {Hess}_{\theta _0}[M][\theta _0 - \theta ], \theta _0 - \theta \rangle + o(\Vert \theta _0 - \theta \Vert ^2) \qquad (\text {as}\ \theta \rightarrow \theta _0) \end{aligned}$$

where \(\text {D}_{\theta _0} M\) represents the (Riesz representative of) the Fréchet differential of \(M: \varTheta \rightarrow \mathbb {R}\) at \(\theta _0\), while \(\text {Hess}_{\theta _0}[M] = \text {I}(\theta _0)\) stands for the Hessian operator of M at \(\theta _0\), which coincides with the Fisher information operator \(\text {I}(\theta _0)\). In particular, in the last identity we have used the Taylor expansion of M around \(\theta _0\). In view of a more concrete characterization of \(\text {D}_{\theta _0} M\) and \(\text {I}(\theta _0)\), we write that

$$\begin{aligned}&M(\theta _0 + h) - M(\theta _0)\\&\quad = \int _{0}^1 h(y)\mu _0(\text {d}y) + \frac{1}{2}\left[ \int _{0}^1 h^2(y)\mu _0(\text {d}y) - \left( \int _{0}^1 h(y)\mu _0(\text {d}y)\right) ^2\right] + {\mathcal {R}}(h; \mu _0) \end{aligned}$$

where \(\Vert {\mathcal {R}}(h; \mu _0)\Vert \le C(\mu _0) \Vert h\Vert _{\varTheta }^3\) for \(\Vert h\Vert _{\varTheta } \le 1\), with some suitable constant \(C(\mu _0)\) depending solely on \(\mu _0\). In particular, a straightforward integration by parts shows that

$$\begin{aligned} \int _{0}^1 h(y)\mu _0(\text {d}y) = \int _{0}^1 h'(y)\varPhi '_0(y) \text {d}y = \langle h, \varPhi _0 \rangle \end{aligned}$$

with \(\varPhi _0(y):= \int _{0}^y [1 - F_0(z)] \text {d}z\) and \(F_0(z):= \mu _0([0,z])\). Whence, \(\varPhi _0 = \text {D}_{\theta _0} M\), by means of Riesz’s representation. Moreover, with the same technique, we obtain

$$\begin{aligned} \text {Hess}_{\theta _0}[M][h](y) = 2\int _0^y h(z) [1 - F_0(z)] \text {d}z - \langle h, \varPhi _0 \rangle \varPhi _0(y) \end{aligned}$$
(71)

for any \(y \in [0,1]\) and \(h \in \varTheta \). The above left-hand side should be read as follows: first, the operator \(\text {Hess}_{\theta _0}[M]\), applied to \(h \in \varTheta \), gives a new element of \(\varTheta \), called \(\text {Hess}_{\theta _0}[M][h]\); second, this new object, as a continuous function evaluated at y, coincides with the right-hand side. Finally, integration by parts entails that

$$\begin{aligned} \int _{0}^1 h^2(y)\mu _0(\text {d}y) - \left( \int _{0}^1 h(y)\mu _0(\text {d}y)\right) ^2 = \langle h, \text {Hess}_{\theta _0}[M][h]\rangle \end{aligned}$$

for any \(h \in \varTheta \). The way is now paved for the application of Proposition 2 and Remark 4. As first step, we check that the operator in (71), from \(\varTheta \) to itself, is compact. As for the term \(\langle h, \varPhi _0 \rangle \varPhi _0\), it defines a finite-rank operator, which is of course compact. As for the term \(2\int _0^y h(z) [1 - F_0(z)] \text {d}z\), it is enough to pick a bounded sequence, say \(\{h_n\}_{n \ge 1}\), in \(\varTheta \), and study the sequence \(\{\varPsi _n\}_{n \ge 1}\) given by \(\varPsi _n(y):= 2\int _0^y h_n(z) [1 - F_0(z)] \text {d}z\). Now, from the well-known properties of weak topologies of separable Hilbert spaces, we can extract a subsequence \(\{h_{n_j}\}_{j \ge 1}\), which converges weakly to some \(h_{*} \in \varTheta \). Whence, \(h_{n_j}\) converges uniformly (i.e. in the strong topology of \(\text {C}^0[0,1]\)) to \(h_{*}\), by the Rellich–Kondrachov embedding theorem. Consequently, it is trivial to get that the sequence \(\{\varPsi _{n_j}\}_{j \ge 1}\) converges strongly in \(\varTheta \) to \(\varPsi _{*}(y):= 2\int _0^y h_{*}(z) [1 - F_0(z)] \text {d}z\) since

$$\begin{aligned} \Vert \varPsi _{n_j} - \varPsi _{*}\Vert _{\varTheta }^2 = 4 \int _0^1 |h_{n_j}(z) - h_{*}(z)|^2 [1 - F_0(z)]^2 \text {d}z \le 4 \Vert h_{n_j} - h_{*} \Vert _{\infty }^2 \rightarrow 0 \end{aligned}$$

as \(j \rightarrow +\infty \). This proves that the operator in (71), from \(\varTheta \) to itself, is a compact operator, even if it is not self-adjoint. Then, we resort to Remark 4, noticing that

$$\begin{aligned}&\int _0^1 h^2(y)\mu _0(\text {d}y) - \left( \int _{0}^1 h(y)\mu _0(\text {d}y)\right) ^2 \\&\quad = \int _{0}^1 \left[ h(x) - \int _{0}^1 h(y)\mu _0(\text {d}y)\right] ^2 f(x\,|\,\theta _0) \text {d}x\\&\quad \ge \exp \{- \text {osc}(\theta _0)\} \int _{0}^1 \left[ h(x) - \int _{0}^1 h(y)\mu _0(\text {d}y)\right] ^2 \text {d}x\\&\quad \ge \exp \{- \text {osc}(\theta _0)\} \int _{0}^1 \left[ h(x) - \int _{0}^1 h(y)\text {d}y\right] ^2 \text {d}x\ \end{aligned}$$

where \(\text {osc}(\theta _0):= \max _{x \in [0,1]} \theta _0(x) - \min _{x \in [0,1]} \theta _0(x)\) is the oscillation. Therefore, we can set

$$\begin{aligned} \text {I}^{\dagger }[h]:= \exp \{- \text {osc}(\theta _0)\} \text {Hess}_{0}[M][h] \end{aligned}$$

where \(\text {Hess}_{0}[M][h]\) is defined by (71) with \(\theta _0 \equiv 0\), to re-write the above relations as

$$\begin{aligned} \langle h, \text {Hess}_{\theta _0}[M][h]\rangle \ge \langle h, \text {I}^{\dagger }[h] \rangle \end{aligned}$$

or simply as \(\text {Hess}_{\theta _0}[M] \ge \text {I}^{\dagger }\). By means of the above argument, \(\text {I}^{\dagger }\), as a linear operator from \(\varTheta \) to itself, is again compact, but not self-adjoint. By a straightforward integration by part, we find that a self-adjointized version of \(\text {I}^{\dagger }\) is given by

$$\begin{aligned} \text {I}^{*}[h](x):= \exp \{- \text {osc}(\theta _0)\} \left\{ \int _0^1 \beta _x(y) h(y) \text {d}y - \left( x - \frac{x^2}{2}\right) \int _0^1 h(y) \text {d}y\right\} \end{aligned}$$

with \(x \in [0,1]\) and \(h \in \varTheta \). The relation \(\text {Hess}_{\theta _0}[M] \ge \text {I}^{*}\) is, of course, still in force. We can now invoke the spectral theorem for compact, self-adjoint operators on separable Hilbert spaces to deduce the existence of a Fourier basis (complete orthonormal system) \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) for \(\varTheta \) which diagonalizes \(\text {I}^{*}\). With reference to (37), we call \(\{\gamma _k^{*}\}_{k \ge 1}\) the sequence of the relative eigenvalues, for which we have that \(\gamma _k^{*} \rightarrow 0\) as \(k \rightarrow +\infty \), again by the spectral theorem. An explicit derivation of \({{\textbf {e}}}_k\) could be drawn from the following integral-differential Cauchy problem

$$\begin{aligned} {\left\{ \begin{array}{ll} -{{\textbf {e}}}_k(x) + \displaystyle {\int _{0}^1} {{\textbf {e}}}_k(y)\text {d}y = \gamma _k^{*} {{\textbf {e}}}^{''}_k(x)\, \qquad x \in [0,1] \\ {{\textbf {e}}}_k(0) = {{\textbf {e}}}^{'}_k(0) = 0 \end{array}\right. } \end{aligned}$$

which is obtained by differentiating twice the relation \(\text {I}^{*}[{{\textbf {e}}}_k] = \gamma _k^{*} {{\textbf {e}}}_k\). Explicit solutions are

$$\begin{aligned} {{\textbf {e}}}_k(x) = \frac{\sqrt{2}}{k\pi }(1 - \cos (k\pi x)) \qquad \qquad \gamma _k^{*} = \frac{\exp \{- \text {osc}(\theta _0)\}}{(k\pi )^2} \end{aligned}$$
(72)

with \(x \in [0,1]\) and \(k\in \mathbb {N}\). After having fixed the Fourier basis \(\{{{\textbf {e}}}_k\}_{k \ge 1}\), we can further specify the prior distributions in terms of the probability laws of the random elements \(\varXi \), with values in \(\varTheta \), of the form (Karhunen–Loève representation)

$$\begin{aligned} \varXi = \sum _{k=1}^{\infty } Z_k {{\textbf {e}}}_k\ . \end{aligned}$$

Here, \(\{Z_k\}_{k \ge 1}\) is a sequence of independent real-valued random variables with \(Z_k \sim {\mathcal {N}}(m_k, \lambda _k)\), for suitable sequences \(m:= \{m_k\}_{k \ge 1} \subset \mathbb {R}\) and \(\{\lambda _k\}_{k \ge 1} \subset (0,+\infty )\) with \(\{m_k\}_{k \ge 1} \in \ell ^2\) and \(\{\lambda _k\}_{k \ge 1} \in \ell ^1\). Thus, if \(\pi (B):= \mathbb {P}\left[ \ \varXi \in B \right] \) for any \(B \in \mathscr {B}(\varTheta )\), it is straightforward to check that \(\pi \) is a Gaussian measure on \((\varTheta ,\mathscr {B}(\varTheta ))\) with mean m and covariance operator Q satisfying \(Q[{{\textbf {e}}}_k] = \lambda _k {{\textbf {e}}}_k\). Whence, (37) is verified. To justify the validity of (38), we check the remaining assumptions of Proposition 2. First, it is trivial to check that \(\theta \mapsto \textsf {K}(\theta |\theta _0)\) belongs to \(\text {C}^{\infty }(\varTheta )\), so that we can put \(q=1\). Then, we consider points i)–iv). For simplicity, we again fix \(\theta _0 \equiv 0\), with no real loss of generality. We start with the definition of the space \(\mathbb {K}\), expressed as the closure of \(\varTheta \) with respect to the norm

$$\begin{aligned} \Vert \theta \Vert _{\mathbb {K}}:= \sup _{\begin{array}{c} \psi \in \varTheta \\ \Vert \psi \Vert _{\varTheta } \le 1 \end{array}} \int _0^1 \left[ \theta (x) - \int _0^1 \theta (y)\text {d}y \right] \psi (x) \text {d}x \end{aligned}$$
(73)

which represents, plainly speaking, a dual Sobolev norm of the function \(x \mapsto \theta (x) - \int _0^1 \theta (y)\text {d}y\). The embedding \(\varTheta \subset \mathbb {K}\), with dense and continuous inclusion, follows from the Poincaré–Wirtinger inequality. Then, we notice that the function

$$\begin{aligned} \theta \mapsto \textsf {K}(\theta | \theta _0)&= \log \left( \int _0^1 e^{\theta (y)} \text {d}y\right) - \int _0^1 \theta (y) \text {d}y \\&= \log \left( \int _0^1 \exp \left\{ \theta (x) - \int _0^1 \theta (y)\text {d}y\right\} \text {d}x\right) \end{aligned}$$

has two different behaviors according on whether the norm of \(\theta \) is small or large. To be more precise, we fix \(\sigma \in \varTheta \) with \(\Vert \sigma \Vert _{\varTheta } =1\) and then we set \(\theta = t\sigma \) for any \(t\in (0,+\infty )\). In particular, as \(t \rightarrow 0\), a straightforward argument based on Taylor expansions of the exponential and the logarithmic functions shows that

$$\begin{aligned} \textsf {K}(t\sigma | \theta _0) = \frac{t^2}{2} \int _0^1 \left[ \sigma (x) - \int _0^1 \sigma (y)\text {d}y\right] ^2 \text {d}x + o(t^2)\ . \end{aligned}$$

On the other hand, as \(t \rightarrow +\infty \), by means of a direct application of the Laplace method of approximation ([87, Theorem 1.II]), we obtain the following expansion

$$\begin{aligned} \textsf {K}(t\sigma | \theta _0) \sim t \max _{x \in [0,1]} \left( \sigma (x) - \int _0^1 \sigma (y)\text {d}y\right) _+ \end{aligned}$$

with \((a)_+:= \max \{a, 0\}\). Upon denoting by \(\text {H}^{-1}_{*}(0,1)\) the dual space of \(\varTheta \), we can exploit that \(\text {L}^1(0,1) \subset \text {H}^{-1}_{*}(0,1)\), with continuous dense embedding, to obtain that

$$\begin{aligned} \max _{x \in [0,1]} \left( \sigma (x) - \int _0^1 \sigma (y)\text {d}y\right) _+ \ge \frac{1}{2} \int _0^1 \left| \sigma (x) - \int _0^1 \sigma (y)\text {d}y \right| \text {d}x \gtrsim \Vert \sigma \Vert _{\mathbb {K}} \ . \end{aligned}$$

Therefore, (33)–(34) are fulfilled with the above choice of the space \(\mathbb {K}\), and some \(\phi : [0,+\infty ) \rightarrow [0,+\infty )\) which behaves quadratically for small arguments and linearly for large arguments, like \(\phi (x) = x^2 \mathbbm {1}_{[0,1]}(x) + x\mathbbm {1}_{(1,+\infty )}(x)\). Then, the choice of \(q=1\) entails that \(r \in (1, \frac{3}{2})\). Further insights on inequalities like (33) can be found in Bal et al. [8], while properties of homogeneous spaces like \(\mathbb {K}\) have been recently investigated in Brasco et al. [21]. As for the validity of the interpolation inequality (32), we can fix, for example, \(r= 4/3\), \(s=4\) and start from the following specific version of the Gagliardo–Nirenberg interpolation inequality

$$\begin{aligned} \Vert f\Vert _{\text {H}^1(0,1)} \lesssim \Vert f\Vert _{\text {L}^2(0,1)}^{7/8} \Vert f\Vert _{\text {H}^8(0,1)}^{1/8} \end{aligned}$$

where \(\text {H}^m(0,1)\) denotes the standard (Hilbertian) Sobolev space of order m [23, Corollary 5.1]. Applying this inequality to \(f(x) = \theta (x) - \int _0^1 \theta (y)\text {d}y\), we get

$$\begin{aligned} \Vert \theta \Vert _{\varTheta } \lesssim \left\| \theta - \int _0^1 \theta (y)\text {d}y\right\| _{\text {L}^2(0,1)}^{7/8} \cdot \left\| \theta - \int _0^1 \theta (y)\text {d}y\right\| _{\text {H}^8(0,1)}^{1/8} \end{aligned}$$
(74)

for all \(\theta \in \varTheta \) such that \(\frac{\text {d}^8}{\text {d}x^8} \theta (x) \in \text {L}^2(0,1)\). Now, we define the Hilbert space \(\mathbb {V}\) as the subspace of \(\varTheta \) formed by those \(\theta \in \varTheta \) such that \(\frac{\text {d}^8}{\text {d}x^8} \theta (x) \in \text {L}^2(0,1)\), with the norm

$$\begin{aligned} \Vert \theta \Vert _{\mathbb {V}}:= \left\| \theta - \int _0^1 \theta (y)\text {d}y\right\| _{\text {H}^8(0,1)}\ . \end{aligned}$$
(75)

The inclusion \(\mathbb {V}\subset \varTheta \) with continuous and dense embedding follows by means of the usual Sobolev embedding theorem [65]. At this stage, we make use of the other specific version of the Gagliardo–Nirenberg interpolation inequality given by

$$\begin{aligned} \Vert f\Vert _{\text {H}^2(0,1)} \lesssim \Vert f\Vert _{\text {H}^1(0,1)}^{6/7} \Vert f\Vert _{\text {H}^8(0,1)}^{1/7} \end{aligned}$$

to deduce that

$$\begin{aligned} \Vert u\Vert _{\text {L}^2(0,1)} \lesssim \Vert u\Vert _{\mathbb {K}}^{6/7} \Vert u\Vert _{\mathbb {V}}^{1/7} \end{aligned}$$

holds for any \(u \in \text {C}^{\infty }_c(0,1)\) with \(\int _0^1 u(x)\text {d}x = 0\). By combining this inequality with (74), we finally deduce (32) with \(r= 4/3\) and \(s=4\). To guarantee that \(\pi (\mathbb {V}) = 1\), we can resort to the standard Kolmogorov three-series criterion to obtain that \(\mathbb {P}\left[ \ \varXi \in \mathbb {V}\right] =1\) provided that \(m_k = O(k^{-8-\delta })\) and \(\lambda _k = O(k^{-16-\delta })\) as \(k \rightarrow +\infty \), for some \(\delta > 0\). Finally, since we have \(\Vert {{\textbf {e}}}_k\Vert _{{\mathbb {V}}} = O(k^7)\) as \(k \rightarrow +\infty \), then

$$\begin{aligned} \int _{\mathbb {V}} e^{t \Vert \theta \Vert _{\mathbb {V}}} \pi (\text {d}\theta ) = \mathbb {E}\left[ e^{t \Vert \varXi \Vert _{\mathbb {V}}} \right]&\le \mathbb {E}\left[ \exp \left\{ t \sum _{k=1}^{\infty } |Z_k| \cdot \Vert {{\textbf {e}}}_k\Vert _{\mathbb {V}} \right\} \right] \\&\le \exp \left\{ t \sum _{k=1}^{\infty } |m_k| \cdot \Vert {{\textbf {e}}}_k\Vert _{\mathbb {V}} + \frac{t^2}{2} \sum _{k=1}^{\infty } \lambda _k \Vert {{\textbf {e}}}_k\Vert ^2_{\mathbb {V}} \right\} < +\infty \end{aligned}$$

holds for any \(t > 0\).

By proceeding with the analysis of the other terms on the right-hand side of (25), we observe that the boundedness condition (70) entails a direct application of results in Pinelis and Sakhanenko [68] and Yurinskii [89], yielding that

$$\begin{aligned} \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}] \le \left( \mathbb {E}[\Vert \hat{S}_n- S_0\Vert _{\mathbb {B}}^2]\right) ^{1/2} = O\big (n^{-1/2}\big ) \end{aligned}$$

and in addition that, for any sequence \(\{\delta _n\}_{n \ge 1}\) such that \(\delta _n \sim n^{-q}\) with \(q \in [0,1/2)\),

$$\begin{aligned} \mathbb {P}\left[ \hat{S}_n \not \in \mathcal {U}_{\delta _n}(S_0) \right] \le 2\exp \{-C n^{1-2q}\} \end{aligned}$$

for a positive constant C that depends only on \(\mu _0\). It remains to deal with the asymptotic behavior of \(L_0^{(n)}\) by combining Propositions 3 and 5. Here, we exploit once again the fact that g coincides with identity function, so that

$$\begin{aligned} \int _{\varTheta } \Vert \mathfrak {D}_{\theta }[g]\Vert _{*}^2\ \pi _n^{*}(\text {d}\theta \,|\, S)= 1 \end{aligned}$$

for all \(S \in \varTheta \). Whence,

$$\begin{aligned} L_0^{(n)} = n \sup _{S \in \mathcal {U}_{\delta _n}(S_0)} \left\{ \mathfrak {C}_2^{(M)}[\pi _n^{*}(\cdot \,|\, S)] \right\} ^2, \end{aligned}$$

where we have indicated our preference for a weighted Poincaré–Wirtinger constant, with respect to the Malliavin derivative. Indeed, we can argue as in the finite-dimensional setting, exploiting the key observation that the Kullback–Leibler divergence \(\textsf {K}(\theta |\theta _0)\) behaves quadratically if \(\theta \) varies around \(\theta _0\), while it is linear as \(\Vert \theta \Vert \rightarrow +\infty \). To be more precise, we can use the same arguments developed above to show that the choice \(\text {G}_0 = \text {I}^*\) fits the requirements of Proposition 5. Thus, the eigenfunctions \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) are the same as above, and \(\eta _k = \gamma _k^{*}\). In order to exploit point (1) of Proposition 5, we can mimic the same arguments already used in the previous section to prove (64). Actually, it works in the same way, with the sole difference that the function in (65) is now replaced by

$$\begin{aligned} \sigma \mapsto \max _{x \in [0,1]} Q^{1/2}[\sigma ](x) - \int _0^1 Q^{1/2}[\sigma ](x)\text {d}x \end{aligned}$$
(76)

with \(\Vert \sigma \Vert _{\varTheta } = 1\), because of the fact that the gradient is replaced by the Malliavin derivative. See Da Prato [30, Section 2.3]. Since \(Q^{1/2}\) is a compact operator, the image of the bounded set \(\{\sigma \in \varTheta \ |\ \Vert \sigma \Vert _{\varTheta } = 1\}\) through \(Q^{1/2}\) is sequentially compact. Thus, the infimum of the function in (76) cannot be equal to zero. Finally, with the application of (49), which provides the rate of the weighted Poincaré–Wirtinger constant, the discussion is completed. To conclude our analysis, we state a proposition that summarizes all the above considerations.

Proposition 10

In connection with the model (66), let \(\mathbb {X}= [0,1]\) and \(\varTheta = \text {H}^1_{*}(0,1)\). Let \(\theta _0 \in \varTheta \) be fixed. Assume that \(\pi = {\mathcal {N}}(m,Q)\) with \(m \in \varTheta \) and Q a non-degenerate trace-class operator satisfying (31). Fix the eigenfunctions \(\{{{\textbf {e}}}_k\}_{k \ge 1}\) and the spaces \(\mathbb {K}\) and \(\mathbb {V}\) as in (72), (73) and (75), respectively. Finally, set \(\gamma ^{*}_k\) as in (72), \(\eta _k = \gamma ^{*}_k\) and \(\omega _k\) according to the Fourier representation \(\theta _0 - m = \sum _{k=1}^{\infty } \omega _k {\textbf {e}}_k\). If \(m_k = O(k^{-8-\delta })\) and \(\lambda _k = O(k^{-16-\delta })\) as \(k \rightarrow +\infty \), for some \(\delta > 0\), then points (i)–(iv) of Proposition 2 are valid, along with the assumptions of point (1) of Proposition 5. In conclusion, it holds

$$\begin{aligned} \epsilon _n = O\left( \sqrt{\sum _{k=1}^{\infty } \frac{\lambda _k}{n \lambda _k \gamma _k^{*} + 1}} + \sqrt{\sum _{k=1}^{\infty } \frac{\omega _k^2}{(n \lambda _k \gamma _k^{*} + 1)^2}} + \sqrt{n} \max _{k \in \mathbb {N}} \left\{ \frac{\lambda _k}{n\lambda _k \eta _k + 1} \right\} \right) \ . \end{aligned}$$

To provide some hints on the optimality of our PCRs, it is useful to recall the discussion at the end of Section 3.1. At least in the simpler case when \(m = \theta _0\), the above rate has the form \(O\big (n^{-\frac{a-1}{2(a + 3)}}\big )\) when \(\lambda _k = O(k^{-(1+a)})\). The parameter a can be interpreted as a smoothness parameter, in the sense that it measures the analytical regularity of the trajectories of the prior \(\pi \). By way of example, supposing \(m_k = 0\) for all k for simplicity, a precise statement is as follows: if \(a > 1\) and \(\varepsilon \in \left( 0, \frac{a-1}{2}\right) \), then the trajectories of the random process \(\varXi \) belong to \(\text {H}^{1+\varepsilon }(0,1)\) almost surely. We notice that our rate is just slightly slower than the standard rate \(n^{-\frac{\alpha }{2\alpha + 1}}\) which is proved in Giné and Nickl [54], Rivoirard and Rousseau [70] and Scricciolo [72], where \(\alpha \) is characterized by the fact that the random process \(\varXi \) belongs to \(\text {H}^{\alpha }(0,1)\) almost surely. This slight discrepancy makes sense since our reference norm (i.e., the Sobolev norm of \(\text {H}^1_{*}\)) is larger than any \(\text {L}^p\) norm, for any \(p \in [1,+\infty ]\). To the best of our knowledge, our rate does not admit a fair comparison with any other known rate of consistency, neither Bayesian nor classical, because of the different choice of the loss function. The only fair comparison could be made with the rates obtained in Sriperumbudur et al. [76], which are nonetheless relative to distinguished classical estimators (see, in particular, Theorem 7, point (ii) therein). Since these classical rates are slower than \(n^{-1/3}\), we notice, in support of the optimality of our approach, that our rate is: (i) arbitrarily close to the optimal (parametric) rate \(n^{-1/2}\) if \(a \rightarrow +\infty \); (ii) faster than \(n^{-1/3}\) as soon as \(a>9\), a condition which is surely met in the framework presented in Proposition 10, where \(a = 15 + \delta \). Hence, a Bayesian estimator, that shares our PCR as rate of consistency, performs better that the minimum-distance estimator proposed in Sriperumbudur et al. [76].

4.5 Infinite-dimensional linear regression

Consider a statistical model that arises from the popular linear regression. The observed data are the collection of pairs \((u_1,v_1), \dots , (u_n, v_n)\), such that: i) the \(u_{i}\) vary in an interval \([a,b] \subset \mathbb {R}\), and are modeled as i.i.d. random variables, say \(U_1, \dots , U_n\), with a known distribution, say \(\varpi (\text {d}u) = h(u)\text {d}u\), on \(([a,b], \mathscr {B}([a,b]))\); the \(v_{i}\)’s vary in \(\mathbb {R}\), and are modeled as i.i.d. random variables \(V_1, \dots , V_n\). The \(V_i\)’s are stochastically dependent of the \(U_i\)’s according to the relation

$$\begin{aligned} V_i = \theta (U_i) + E_i\, \qquad \qquad i=1, \dots , n\, \end{aligned}$$
(77)

where \(E_1, \dots , E_n\) are i.i.d. random variables with Normal \({\mathcal {N}}(0,\sigma ^2)\) distribution, while \(\theta : [a,b] \rightarrow \mathbb {R}\) is an unknown continuous function. Assuming for simplicity that \(\sigma ^2 > 0\) is known, the statistical model is characterized by probability densities \(f(\cdot |\theta )\) on \([a,b]\times \mathbb {R}\), with respect to the Lebesgue measure, given by

$$\begin{aligned} f(x|\theta ) = f((u,v)|\theta ) = \frac{1}{\sqrt{2\pi \sigma ^2}} \exp \left\{ - \frac{[\theta (u) - v]^2}{2\sigma ^2}\right\} h(u) \ . \end{aligned}$$
(78)

The space \(\varTheta \) is chosen, as in the previous section, as a Sobolev space \(H^s(a,b)\) with \(s > 1/2\), which is continuously embedded in \(C^0[a,b]\). Whence, upon fixing \(\theta _0 \in \varTheta \),

$$\begin{aligned} \mu _0(\text {d}u \text {d}v) = \frac{1}{\sqrt{2\pi \sigma ^2}} \exp \left\{ - \frac{[\theta _0(u) - v]^2}{2\sigma ^2}\right\} h(u) \text {d}u \text {d}v\ . \end{aligned}$$

On the other hand, from the Bayesian point of view, upon fixing a prior distribution \(\pi \) on \((\varTheta , \mathscr {T})\) and resorting to the Bayes formula, the posterior takes on the form

$$\begin{aligned} \pi _n(\text {d}\theta |x_1, \dots , x_n)&= \pi _n(\text {d}\theta |(u_1, v_1) \dots , (u_n, v_n)) \\&= \frac{\exp \left\{ - \frac{1}{2\sigma ^2} \sum _{i=1}^n [\theta (u_i) - v_i]^2\right\} \pi (\text {d}\theta )}{\int _{\varTheta } \exp \left\{ - \frac{1}{2\sigma ^2} \sum _{i=1}^n [\tau (u_i) - v_i]^2\right\} \pi (\text {d}\tau )}\ . \end{aligned}$$

Whence, for any probability measure \(\gamma \in {\mathcal {P}}_2([a,b]\times \mathbb {R})\), we can write the following

$$\begin{aligned} \pi _n^{*}(\text {d}\theta |\gamma ) = \frac{\exp \left\{ - \frac{n}{2\sigma ^2} \int _{[a,b]\times \mathbb {R}} [\theta (u) - v]^2\gamma (\text {d}u \text {d}v) \right\} \pi (\text {d}\theta )}{\int _{\varTheta } \exp \left\{ - \frac{n}{2\sigma ^2} \int _{[a,b]\times \mathbb {R}} [\tau (u) - v]^2\gamma (\text {d}u \text {d}v)\right\} \pi (\text {d}\tau )}\ . \end{aligned}$$

Lastly, as for the Kullback–Leibler divergence, a straightforward computation yields

$$\begin{aligned} \textsf {K}(\theta |\theta _0) = \frac{1}{2\sigma ^2} \int _a^b [\theta (u) - \theta _0(u)]^2 h(u) \text {d}u\ . \end{aligned}$$

This statistical model is particularly versatile with respect to our theory, because it can be studied as either an infinite-dimensional exponential family or by means of Theorem 2 and Corollary 1. For example, to see that we can use the theory of infinite-dimensional exponential families, it suffices to consider the identities

$$\begin{aligned} (\theta (u) - v)^2&= \int _a^b\int _{\mathbb {R}} (\theta (x)-y)^2 \delta _{(u,v)}(\text {d}x\text {d}y) \\&= -\int _a^b\int _{\mathbb {R}} [\varDelta ((\theta (x)-y)^2)] {\mathcal {G}}(x,y;u,v) \text {d}x\text {d}y \end{aligned}$$

where \({\mathcal {G}}(x,y;u,v)\) stands for the Green function of the set \([a,b]\times \mathbb {R}\). If \(\theta \) varies in a sufficiently regular space, that is if s is sufficiently large, then \([\varDelta ((\theta (x)-y)^2)]\) is still a function, which can be set equal to \(g(\theta )\). On the other hand, \({\mathcal {G}}(x,y;u,v)\) represents the function \(\beta \) in the theory of exponential families.

As for the assumptions of Theorem 2, we can prove their validity if, for instance, h belongs to \(C^0[a,b] \cap C^2(a,b)\) and it is bounded away from zero. The assumption \(\int _{\mathbb {X}} |x|^q \mu _0(\text {d}x) < +\infty \) is valid for any \(q>0\) and (58) holds if we assume, for instance, a Gaussian prior \(\pi \). As for Corollary 1, we can check the validity of (57) as a consequence of the Gagliardo–Nirenberg interpolation inequality ([65, Section 12.3]). Being \(\textsf {K}(\theta |\theta _0)\) equivalent to the squared \(L^2\)-norm,

$$\begin{aligned} \Vert \theta - \theta _0\Vert _{H^s(a,b)} \le \Vert \theta - \theta _0\Vert _{L^2(a,b)}^{1-\alpha } \Vert \theta - \theta _0\Vert _{H^{s'}(a,b)}^{\alpha } \end{aligned}$$

for any \(s' > s\), where \(\alpha := s/s'\). Therefore, choosing a prior distribution that is supported on \(H^{s'}(a,b)\), such as for instance a Gaussian type prior, and recalling that \(H^{s'}(a,b)\) is dense in \(H^s(a,b)\), it is enough to consider the neighborhood \(\Vert \theta - \theta _0\Vert _{H^{s'}(a,b)} \le 1\) of \(\theta _0\) and check that the interpolation inequality immediately yields (57). Whence, \(\beta = 2/(1-\alpha )\). These considerations show that Proposition 2 is applicable, provided that the prior is Gaussian with a covariance matrix that satisfies (31). In any case, both the methods end up by highlighting the main terms that figures on the right-hand sides of (25) and (56).

Now, for the sake of brevity, we confine ourselves on the application of Theorem 1. Apropos of the first term on the right-hand side of (25), we notice that \(\text {I}(\theta _0)\) is independent of \(\theta _0\), and is equivalent to the identity operator. In view of a straightforward coercivity, we can apply the results in Section 3.3 of Albeverio and Steblovskaya [2] to obtain that the first term on the right-hand side of (25) is asymptotic to \(\frac{1}{\sqrt{n}}\), as \(n \rightarrow +\infty \). Then, the second and the third terms are exponentially small, and hence asymptotically negligible. To complete the treatment, we are left to discuss the asymptotic behavior of the constant \(L_0^{(n)}\). Apropos of the Poincaré constant \([{\mathfrak {C}}_2(\pi _n^{*}(\cdot |\gamma ))]^2\), here it is trivial to notice that the mapping \(\theta \mapsto \int _{[a,b]\times \mathbb {R}} [\theta (u) - v]^2\gamma (\text {d}u \text {d}v)\) is twice Frechét-differentiable with respect to \(\theta \). Therefore, the Bakry–Emery criterion applies and, if \(\pi \) is Gaussian, results in Da Prato [31, Chapters 10–11] show that \([{\mathfrak {C}}_2(\pi _n^{*}(\cdot |\gamma ))]^2 = O(1/n)\), as \(n \rightarrow \infty \). As for the term \({\mathfrak {D}}_{\theta } \frac{\nabla _x f(x|\theta )}{f(x|\theta )}\), we first notice that

$$\begin{aligned} \frac{\nabla _x f(x|\theta )}{f(x|\theta )} = -\frac{1}{\sigma ^2} \left( \theta '(u)(\theta (u) - v), -(\theta (u) - v) \right) \ . \end{aligned}$$

This is the sum of the terms \(\sigma ^{-2}\left( v\theta '(u), \theta (u) - v\right) \) and \(\frac{1}{\sigma ^2} \left( -\theta '(u)\theta (u), 0 \right) \), where the former vector is a linear functional of \(\theta \). Thus, the Fréchet derivative of the first term with respect to \(\theta \) is given by the vector \(\sigma ^{-2}(vS_u, T_u)\), where \(T_u\) (\(S_u\), respectively) stands for the Riesz representative of the functional \(\delta _u\) (\(-\delta '_u\), respectively). It is useful to observe that such a derivative, being independent of \(\theta \), does not contribute asymptotically in the expression of the double integral, as we have already discussed in the previous section. Finally, the Fréchet derivative of the second term is \(-\sigma ^{-2}\left( T_u\theta '(u) + S_u\theta (u), 0 \right) \). At this stage, we can see that the study of the double integral can be reduced, through the use of Sobolev inequalities, to the study of the corresponding posterior moments. To conclude, we state a proposition that summarizes the above considerations.

Proposition 11

In connection with the model (77), let \(\mathbb {X}= [a,b] \times \mathbb {R}\) and \(\varTheta = \text {H}^s(0,1)\) for some \(s \ge 3\). Suppose that h is a smooth density on [ab], satisfying \(1/c \le h(x) \le c\) for any \(x \in [a,b]\) and some \(c>0\). Let \(\theta _0 \in \varTheta \) be fixed. Assume that \(\pi = {\mathcal {N}}(m,Q)\) with any \(m \in \varTheta \) and Q a non-degenerate trace-class operator. Then, it holds that

$$\begin{aligned} \epsilon _n = O\big (n^{-1/2}\big ) \end{aligned}$$

as \(n \rightarrow +\infty \), which represents the optimal rate.

5 Discussion

We conclude our work by discussing some directions for future research. The flexibility of the Wasserstein distance is promising when considering non-regular Bayesian statistical models, even in a finite-dimensional setting. One may consider the problem of dealing with dominated statistical models that have moving supports, i.e. supports that depend on \(\theta \). The prototypical example is the family of Pareto distributions, which is characterized by a density function

$$\begin{aligned} f(x\,|\,\alpha , x_0) = \frac{\alpha x_0^{\alpha }}{x^{1+\alpha }} \mathbbm {1}\{x \ge x_0\}, \end{aligned}$$

where \(\theta = (\alpha , x_0) \in (0,+\infty )^2\). Under this model, by rewriting the posterior distribution to obtain the representation (7), we observe that the empirical distribution can be replaced by the minimum of the observations, which is the maximum likelihood estimator. In doing this, we expect to parallel the proof of Theorem 2, with the minimum playing the role of the sufficient statistic, instead of the empirical measure. In particular, we expect that the term \(\varepsilon _{n,p}(\mathbb {X}, \mu _0)\) should be replaced by other rates typically involved in limit theorems of order statistics. The theoretical framework for such an extension of our results is developed in the work of Dolera and Mainini [38], where it is shown how the continuity equation yields a specific boundary-value problem of Neumann type.

As for the infinite-dimensional setting covered by Theorems 1 and 2, an interesting development of our approach to PCRs is represented by the possibility of finding, for general statistical models, explicit sufficient statistics belonging to Banach spaces of functions. To be more precise, we hint at a constructive version of the well-known Fisher-Neyman factorization lemma. This result would pave the way for a suitable rewriting of the statistical model, that allows for the use of our approach. By way of example, one may consider the identity \(\log f(x | \theta ) = \int _{\mathbb {X}} \log f(y | \theta ) \delta _x(\text {d}y)\), and exploit an integration-by-part formula to obtain an identity like (19), with respect to a suitable measure \(\lambda \) on \((\mathbb {X}, {\mathscr {X}})\). Such a procedure is at the basis for the development of our approach to PCRs in the context of popular nonparametric models, not considered in this paper, such as the Dirichlet process mixture model ([50, Chapter 5]), the random histograms ([50, Example 5.11]) and Pólya trees ([50, Section 3.7]).

Another promising line of research consists in extending Theorem 2 to metric measure spaces. The theoretical ground for this development may be found in the seminal works of Gigli [51], Gigli and Ohta [52], Ambrosio et al. [6], Otto and Villani [67] and von Renesse and Sturm [83]. In such a context, it is of interest the treatment of the relative entropy-functional in the Wasserstein space. It is well-known that the Hessian of the relative entropy-functional, i.e. the Kullback–Leibler divergence, generalizes by using techniques from infinite-dimensional Riemannian geometry [67]. From the statistical side, the possibility of choosing a parameter space that coincides with a space of measures allows to re-consider, from a different point of view, popular Bayesian statistical models such as Dirichlet process mixture models, which are defined as

$$\begin{aligned} f(x\,|\,\mathfrak {p}) = \int \tau (x\,|\,y) \mathfrak {p}(\text {d}y) \end{aligned}$$

where \(\tau \) is a kernel parameterized by y, and \(\mathfrak {p}\) is a random probability measure with a Dirichlet process prior [41]. The goal should be that of considering PCRs relative to Wasserstein neighborhoods of a given true distribution, say \(\mathfrak {p}_0\). This approach is again different from the nonparametric framework considered in Berthet and Niels-Weed [16], and seems still unexplored.