Model selection for density estimation with $$\mathbb L _2$$ -loss

Birgé, Lucien

doi:10.1007/s00440-013-0488-x

Model selection for density estimation with $\mathbb L _2$-loss

Published: 20 June 2013

Volume 158, pages 533–574, (2014)
Cite this article

Download PDF

Probability Theory and Related Fields Aims and scope Submit manuscript

Model selection for density estimation with $\mathbb L _2$-loss

Download PDF

Lucien Birgé¹

541 Accesses
9 Citations
Explore all metrics

Abstract

We consider here estimation of an unknown probability density $s$ belonging to $\mathbb L _2(\mu )$ where $\mu $ is a probability measure. We have at hand $n$ i.i.d. observations with density $s$ and use the squared $\mathbb L _2$-norm as our loss function. The purpose of this paper is to provide an abstract but completely general method for estimating $s$ by model selection, allowing to handle arbitrary families of finite-dimensional (possibly non-linear) models and any $s\in \mathbb L _2(\mu )$. We shall, in particular, consider the cases of unbounded densities and bounded densities with unknown $\mathbb L _\infty $-norm and investigate how the $\mathbb L _\infty $-norm of $s$ may influence the risk. We shall also provide applications to adaptive estimation and aggregation of preliminary estimators. One major technical tool of our approach is a proof of the existence of suitable tests between $\mathbb L _2$-balls with centers belonging to $\mathbb L _\infty $. Although of a purely theoretical nature, our method leads to results that cannot presently be reached by more concrete ones.

Density Prediction and the Stein Phenomenon

Article 13 December 2019

On Predictive Density Estimation under α-Divergence Loss

Article 01 April 2019

From robust tests to Bayes-like posterior distributions

Article 13 July 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper we shall deal with the problem of estimating an unknown density $s$ with respect to the measure $\mu $ on the measurable space $(\mathcal{X }, \mathcal{W })$ from an i.i.d. sample $\varvec{X}=(X_1,\ldots ,X_n)$ of random variables $X_i\in \mathcal{X }$ with distribution $P_s=s\cdot \mu $. We shall measure the quality of an estimator $\widehat{s}(X_1,\ldots ,X_n)$ of $s$ by its quadratic risk $\mathbb{E }_s\left[ d^2\left( \,\widehat{s},s\right) \right] $ for a suitable distance $d$, where $\mathbb{E }_s$ denotes the expectation when $s$ obtains. We shall denote by $\Vert \cdot \Vert _q$ the norm in $\mathbb L _q(\mu )$, omitting the subscript when $q=2$ for simplicity and by $d_2$ the distance in $\mathbb L _2(\mu ):\, d_2(t,u)=\Vert t-u\Vert $. For $1\le q\le +\infty $ we consider the set $\overline{\mathbb{L }}_q$ of those densities with respect to $\mu $ that belong to $\mathbb L _q(\mu )$, i.e.

$$\begin{aligned} \overline{\mathbb{L }}_q=\left\{ t\in \mathbb L _q(\mu )\,\left| \, t\ge 0 \;\text{ and }\; \int t\,d\mu =1\right. \right\} . \end{aligned}$$

We shall also make use of the Hellinger distance $h$ and the variation distance $v$ given by

$$\begin{aligned} h^2(t,u)=\frac{1}{2}\int \left( \sqrt{t}-\sqrt{u}\right) ^2\,d\mu \qquad \text{ and }\qquad v(t,u)=\frac{1}{2}\int |t-u|\,d\mu . \end{aligned}$$

When $s$ is assumed to belong to the metric space $(M,d)$, a common method of estimation that can be called model-based estimation chooses a subset $\overline{S}$ of $M$ and an estimation method which results in an estimator that automatically belongs to $\overline{S}$. Of this type is the maximum likelihood estimator over $\overline{S}$, for instance. When the distance $d$ is either $h$ or $v$, it is possible to get very general risk bounds for some special estimators based on finite-dimensional models. Indeed, if we choose for $\overline{S}$ a model with a metric dimension (to be defined in Sect. 1.2 below and generalizing to subsets of metric spaces the usual dimension of a finite-dimensional linear space) bounded by $D$, one can design an estimator $\widetilde{s}$ with values in $\overline{S}$ satisfying, whatever the true unknown density $s$,

$$\begin{aligned} \mathbb{E }_s\left[ d^2\left( \,\widetilde{s},s\right) \right] \le C\left[ \inf _{t\in \overline{S}}d^2(s,t)+n^{-1}D\right] , \end{aligned}$$

(1.1)

where $C$ denotes a universal constant (independent of $n,\, s$ and $\overline{S}$). When $s$ belongs to $\overline{S}$, (1.1) provides the following upper bound for the minimax risk over $\overline{S}$:

$$\begin{aligned} \inf _{\widehat{s}}\sup _{s\in \overline{S}}\mathbb{E }_s \left[ d^2\left( \,\widehat{s},s\right) \right] \le Cn^{-1}D, \end{aligned}$$

(1.2)

where the infimum is over all possible estimators $\widehat{s}$, a result which actually dates back to Le Cam [23] for the Hellinger distance.

Nevertheless, the square of the $\mathbb L _2$-distance $d_2$ has been much more popular in the past, as a loss function for density estimation, than either the Hellinger or variation distances, mainly because of its simplicity due to the classical “squared bias plus variance” decomposition of the risk. But, although hundreds of papers have been devoted to the derivation of risk bounds for various specific estimators, we do not know of any universal bound for the risk similar to (1.1) when $d=d_2$, universal meaning here valid for any model $\overline{S}$ with a metric dimension bounded by $D$ and any density $s\in \mathbb L _2(\mu )$, only partial results valid for some special cases being available. This is actually not surprising for the following reason. While $h$ and $v$ are distances between probabilities so that $h(s,t)=h(P_s,P_t)$ is independent of the choice of the underlying dominating measure $\mu $, this is definitely not the case of the $\mathbb L _2$-distance between densities which depends on the choice of $\mu $ and is not a distance between probabilities. Given a probability $P$ and a dominating measure $\mu $, even the fact that $dP/d\mu $ belong or not to $\mathbb L _2(\mu )$ depends on $\mu $. Further remarks on this subject can be found in [16, 17].

It is indeed the distortion between the Hellinger and $\mathbb L _2$-distances that explains the problems that may occur when we use the $\mathbb L _2$-risk as can be shown by the following elementary computations. When $t$ and $u$ are bounded by $L$,

$$\begin{aligned} \Vert t-u\Vert ^2&= \int \left( \sqrt{t}-\sqrt{u}\right) ^2\left( \sqrt{t} +\sqrt{u}\right) ^2d\mu \le 4L \int \left( \sqrt{t}-\sqrt{u}\right) ^2d\mu \nonumber \\&= 8Lh^2(t,u). \end{aligned}$$

(1.3)

Although this is only an upper bound, there are situations where it is rather sharp as in the following case. Let $\mu $ be the Lebesgue measure on $[0,1],\, t=L{1\!\!1}_{[0,a)}+(1-aL)(1-a)^{-1}{1\!\!1}_{[a,1]}$ with $L>1$ and $0<a<L^{-1}$, and $u(x)=t(1-x)$ for $0\le x\le 1$. Then $\Vert t\Vert _\infty =\Vert u\Vert _\infty =L$ and

$$\begin{aligned} \Vert t-u\Vert ^2=2a\left( L-\frac{1-aL}{1-a}\right) ^2=2a\frac{(L-1)^2}{(1-a)^2}, \end{aligned}$$

while

$$\begin{aligned} h^2(t,u)&= a\left( \sqrt{L}-\sqrt{\frac{1-aL}{1-a}}\right) ^2 =\frac{a}{1-a}\left( \frac{L-1}{\sqrt{L(1-a)} +\sqrt{1-aL}}\right) ^2\\&\le \frac{a(L-1)^2}{L(1-a)^2}. \end{aligned}$$

Therefore $\Vert t-u\Vert ^2\ge 2Lh^2(t,u)$. If $a$ is chosen in such a way that $h^2(t,u)=(4n)^{-1}$, it follows from Le Cam [23] (see Proposition 5 below) that one cannot test between $t$ and $u$ with $n$ i.i.d. observations and small errors, i.e. one cannot distinguish between $t$ and $u$ with only $n$ observations. As a consequence the minimax risk over the set $\{t,u\}$ will be of order $n^{-1}$ when the loss function is the squared Hellinger distance while it will be of order $L n^{-1}$ if the loss function is the squared $\mathbb L _2$-distance.

1.1 The example of projection estimators

1.1.1 The special case of histograms

A simple illustration of the difference that occurs when one computes risk bounds using the $\mathbb L _2$-distance rather than the Hellinger distance is provided by the case of histograms. Assuming that $\mu $ is a finite measure and given a finite partition $\mathcal{I }=\{I_1,\ldots ,I_k\}$ of $\mathcal{X }$ with $\mu (I_j)=l_j>0$ for $1\le j\le k$, the histogram $\widehat{s}_{\mathcal{I }}$ based on this partition is defined by

$$\begin{aligned} \widehat{s}_{\mathcal{I }}(X_1,\ldots , X_n)=\frac{1}{nl_j}\sum _{j=1}^kN_j{1\!\!1}_{I_j} \quad \text{ with }\quad N_j=\sum _{i=1}^n{1\!\!1}_{I_j}(X_i). \end{aligned}$$

(1.4)

Let

$$\begin{aligned} p_j=\int _{I_j}s\,d\mu ,\quad \overline{s}_{\mathcal{I }}=\sum _{j=1}^k\frac{p_j}{l_j} {1\!\!1}_{I_j},\qquad \overline{S}_{\mathcal{I }}= \left\{ \left. \sum _{j=1}^k\beta _j {1\!\!1}_{I_j}\,\right| \beta _j\in \mathbb R \text{ for } 1\le j\le k\right\} \end{aligned}$$

and $\overline{S}_{\mathcal{I }}^0$ be the convex set $\overline{S}_{\mathcal{I }}\cap \overline{\mathbb{L }}_1$. If $s\in \mathbb L _2(\mu )$, then $\overline{s}_{\mathcal{I }}\in \overline{S}_{\mathcal{I }}^0$ is the orthogonal projection of $s$ onto the $k$-dimensional linear space $\overline{S}_{\mathcal{I }}$ spanned by the functions ${1\!\!1}_{I_j}$ and onto $\overline{S}_{\mathcal{I }}^0$ as well. It follows that $\widehat{s}_{\mathcal{I }}$ is an estimator based on the model $\overline{S}_{\mathcal{I }}^0$.

Choosing $d_2^2$ as our loss function, we derive that

$$\begin{aligned} \mathbb{E }_s\left[ \Vert \,\widehat{s}_{\mathcal{I }}-s\Vert ^2\right] =\Vert \overline{s}_{\mathcal{I }}-s\Vert ^2+ \frac{1}{n}\sum _{j=1}^k\frac{p_j(1-p_j)}{l_j}. \end{aligned}$$

(1.5)

The simplest and most common situation is the one of regular histograms for which all $l_j$ are equal to $k^{-1}$. In this case we derive from (1.5) and a convexity argument that

$$\begin{aligned} \mathbb{E }_s\left[ \Vert \,\widehat{s}_{\mathcal{I }}-s\Vert ^2\right] \le \Vert \overline{s}_{\mathcal{I }}-s\Vert ^2+n^{-1}(k-1). \end{aligned}$$

(1.6)

We then get a risk bound which is the sum of the square of the bias and $n^{-1}$ times the dimension of the model, i.e. the number of parameters ($p_1,\ldots ,p_{k-1}$) which are needed to describe an element of the model $\overline{S}_{\mathcal{I }}^0$. It can therefore be viewed as an analogue of (1.1).

In the general situation of unequal values of the $l_j$ the previous elementary argument does not work but, if $s\in \mathbb L _\infty (\mu )$ with norm $\Vert {s}\Vert _\infty $, the quadratic risk of $\widehat{s}_{\mathcal{I }}$ can alternatively be bounded by

$$\begin{aligned} \mathbb{E }_s\left[ \Vert \,\widehat{s}_{\mathcal{I }}-s\Vert ^2\right] \le \Vert \overline{s}_{\mathcal{I }}-s\Vert ^2+n^{-1}(k-1)\Vert {s}\Vert _\infty , \end{aligned}$$

(1.7)

which is much worse than (1.6) when $\Vert {s}\Vert _\infty $ is large. This bound may be far from sharp for a given $s$ but it is essentially unimprovable if we want it to hold for arbitrary partitions $\mathcal{I }$ with $k$ elements and any $s\in \overline{\mathbb{L }}_\infty $ as shown by the following example. Define the partition $\mathcal{I }$ on $\mathcal{X }=[0,1]$ by $I_j=[(j-1)\alpha ,j\alpha )$ for $1\le j<k$ and $I_k=[(k-1)\alpha ,1]$ with $0<\alpha <(k-1)^{-1}$. Set $s=s_{\mathcal{I }}= [(k-1)\alpha ]^{-1}\left[ 1-{1\!\!1}_{I_k}\right] $. Then $p_j=(k-1)^{-1}$ for $1\le j<k,\, s=\overline{s}_{\mathcal{I }}$ (a case of no bias) and it follows from (1.5) that

$$\begin{aligned} \mathbb{E }_s\left[ \Vert \widehat{s}_{\mathcal{I }}-s\Vert ^2\right] =\frac{k-2}{(k-1)\alpha n} =\frac{(k-2)\Vert {s}\Vert _\infty }{n}. \end{aligned}$$

(1.8)

This shows that there is little space for improvement in (1.7) and that there are cases when the quadratic risk based on $d_2$ does involve the $\mathbb L _\infty $-norm of $s$. It also demonstrates that there is no hope to bound the risk of an histogram $\widehat{s}_{\mathcal{I }}$ based on an arbitrary partition $\mathcal{I }$ with cardinality $|\mathcal{I }|=k$ by an analogue of (1.6). Indeed, letting $\alpha $ go to zero in (1.8) shows that

$$\begin{aligned} \sup _{\{\mathcal{I }\,|\,|\mathcal{I }|=k\}}\sup _{s\in \overline{S}_{\mathcal{I }}^0} \mathbb{E }_s\left[ \Vert \widehat{s}_{\mathcal{I }}-s\Vert ^2\right] =+\infty . \end{aligned}$$

If, instead, we use as our loss function the squared Hellinger distance $h$ as we previously did we get an analogue of (1.6) and (1.1), namely

$$\begin{aligned} \mathbb{E }_s\left[ h^2(\widehat{s}_\mathcal{I },s)\right] \le h^2(s,\overline{s}_{\mathcal{I }})+(k-1)/(2n), \end{aligned}$$

whatever the partition $\mathcal{I }$ of cardinality $k$ and the density $s$, as shown in [12]. A similar result holds if $d=v$. In both cases, whatever the partition $\mathcal{I }$, we can bound the risk by a universal constant times the sum of the squared of the bias and $n^{-1}$ times the size of the partition. This is a bound of the form (1.1), since $|\mathcal{I }|$ is the dimension of our model, i.e. the linear space generated by the functions ${1\!\!1}_{I_j}$ to which $\widehat{s}_\mathcal{I }$ belongs.

1.1.2 Projection estimators

More generally, instead of the linear space generated by the functions ${1\!\!1}_{I_j},\, 1\le j\le k$, we can take as a model for estimating $s$ any $k$-dimensional linear subspace $\overline{S}$ of $\mathbb L _2(\mu )$. Given an orthonormal basis $(\varphi _1,\ldots ,\varphi _k)$ of $\overline{S}$ the projection $\overline{s}$ of $s$ onto $\overline{S}$ can be written $\overline{s}=\sum _{j=1}^k\beta _j\varphi _j$. The estimation method of Cencov [13] consists in replacing each coefficient $\beta _j=\int \varphi _js\,d\mu $ in this expansion by its empirical version $\widehat{\beta }_j=n^{-1}\sum _{i=1}^n\varphi _j(X_i)$. This results in the so called projection estimator $\widehat{s}=\sum _{j=1}^k\widehat{\beta }_j\varphi _j$ (which in general is not a density) with risk

$$\begin{aligned} {\mathbb{E }}_s\left[ \Vert \widehat{s}-s\Vert ^2\right]&= \Vert \overline{s}-s\Vert ^2+n^{-1}\sum _{j=1}^k \mathop {\mathrm{Var}}\nolimits _s\left( \varphi _j(X_1)\right) \end{aligned}$$

(1.9)

$$\begin{aligned}&\le \Vert \overline{s}-s\Vert ^2+n^{-1}\int \left[ \sum _{j=1}^k\varphi _j^2(x)\right] s(x)\, d\mu (x) \end{aligned}$$

(1.10)

$$\begin{aligned}&\le \Vert \overline{s}-s\Vert ^2+kn^{-1}\min \left\{ k^{-1} \left\| \sum _{j=1}^k\varphi _j^2\right\| _\infty ; \Vert {s}\Vert _\infty \right\} . \end{aligned}$$

(1.11)

The histogram based on the partition $\mathcal{I }$ is merely the projection estimator corresponding to $\varphi _j=l_j^{-1/2}{1\!\!1}_{I_j}$. For regular histograms, $l_j=k^{-1}$ and $\left\| \sum _{j=1}^k\varphi _j^2\right\| _\infty =k$.

It has been shown in [9] that the quantity $\left\| \sum _{j=1}^k\varphi _j^2\right\| _\infty $ only depends on $\overline{S}$ and not on the choice of the basis. For an arbitrary subset of a uniformly bounded basis like the trigonometric basis, it is bounded by $Ck$ for a constant $C$ depending on the basis only. In such a case we get a risk bound of the form $\Vert \overline{s}-s\Vert ^2+n^{-1}Ck$ which does not involve $\Vert {s}\Vert _\infty $. If we use the projection onto the first $k$ elements of a wavelet basis, the bound $Ck$ still holds but this is not true any more if we project onto an arbitrary subset with $k$ elements of the same wavelet basis. We end up with the same dichotomy we found between regular and irregular histograms: bounding the variance term of the risk is sometimes straightforward, leading to a bound of the form $n^{-1}Ck$ but for some other models the risk bound we derive involves $\Vert {s}\Vert _\infty $.

A similar difficulty occurs in more complex examples, for instance for the estimators that are considered by Reynaud-Bouret et al. [29]. Their Theorem 1 leads to a risk bound that also involves a variance term depending on the unknown density $s$ which is the analogue of $\sum _{j=1}^k\mathop {\mathrm{Var}}\nolimits _s\left( \varphi _j(X_1)\right) $. In some cases, this term can be bounded independently of $s$ but in some other cases this bounding involves $\Vert {s}\Vert _\infty $.

It follows from these illustrations that it does not seem easy to get an analogue of (1.1) in full generality when $d$ is the $\mathbb L _2$-distance (at first sight it may sometimes work and sometimes not, depending on the type of model we use). It will be the subject of our next section to formally prove that a general result of the form (1.1) cannot hold when $d=d_2$.

1.2 Model based estimation

As we already mentioned, a common method for estimating $s$ consists in choosing a particular subset $\overline{S}$ of $(M,d)$ that we shall call a model for $s$ and design an estimator with values in $\overline{S}$. Let us set $M=\overline{\mathbb{L }}_1$ and choose for $d$ either the Hellinger distance $h$ or the variation distance $v$. It follows from Le Cam [23–25] and subsequent results by Birgé [2, 4] that the risk of suitably designed estimators with values in $\overline{S}$ is the sum of two terms, an approximation term depending on the distance from $s$ to $\overline{S}$ and an estimation term depending on the metric dimension of the model $\overline{S}$ which can be defined as follows.

Definition 1

Let $\overline{S}$ be a subset of some metric space $(M,d)$ and let $\mathcal{B }_d(t,r)$ denote the open ball of center $t$ and radius $r$ with respect to the metric $d$. Given $\eta >0$, a subset $S_\eta $ of $M$ is called an $\eta $-net for $\overline{S}$ if, for each $t\in \overline{S}$, one can find $t^{\prime }\in S_\eta $ with $d(t,t^{\prime })\le \eta $.

We say that $\overline{S}$ has a metric dimension bounded by $D\ge 0$ if, for every $\eta >0$, there exists an $\eta $-net $S_\eta $ for $\overline{S}$ such that

$$\begin{aligned} |S_\eta \cap \mathcal{B }_d(t,x\eta )|\le \exp \left[ Dx^2\right] \quad \text{ for } \text{ all } x\ge 2 \text{ and } t\in M. \end{aligned}$$

(1.12)

Remark

One can always assume that $S_\eta \subset \overline{S}$ at the price of replacing $D$ by $25D/4$ according to Proposition 7 of [4].

When $(M,d)$ is a normed linear space, typical examples of sets with metric dimension bounded by $D$ are subsets of $2D$-dimensional linear subspaces of $M$, as shown in [4], where the following generalization of Le Cam [23] is also proven.

Proposition 1

Assume that we observe $n$ i.i.d. random variables with unknown distribution $P_s,\, s\in (\overline{\mathbb{L }}_1,d),\, d$ being either the Hellinger distance $h$ or the variation distance $v$, and that we have at disposal a subset $\overline{S}$ of $\overline{\mathbb{L }}_1$ with metric dimension bounded by $D\ge 1/2$. One can build an estimator $\widetilde{s}$ with values in $\overline{S}$ such that, for all $s\in \overline{\mathbb{L }}_1$ and some universal constant $C$,

$$\begin{aligned} \mathbb{E }_s\left[ d^2\left( \,\widetilde{s},s\right) \right] \le C\left[ \inf _{t\in \overline{S}}d^2(s,t)+n^{-1}D\right] \quad \text{ hence }\quad \sup _{s\in \overline{S}}\mathbb{E }_s\left[ d^2\left( \,\widetilde{s},s\right) \right] \le Cn^{-1}D. \end{aligned}$$

The risk bounds (1.1) and (1.2) that we mentioned earlier actually derive from this proposition.

1.3 Some negative results for the $\mathbb L _2$-loss

Unfortunately, the analogue of Proposition 1 when we deal with arbitrary densities and models belonging to $\overline{\mathbb{L }}_2$ and set $d=d_2$ cannot be true. To see this, let us take $\mathcal{X }=[0,1],\, \mu $ the Lebesgue measure on $\mathcal{X }$ and assume that, for some $\overline{S}$ with metric dimension bounded by $D\ge 1/2$,

$$\begin{aligned} \inf _{\widehat{s}}\sup _{s\in \overline{S}}\mathbb{E }_s\left[ d_2^2 \left( \,\widehat{s}(\varvec{X}),s\right) \right] =cn^{-1}D \end{aligned}$$

for some $c>0$. For $\lambda >1$, consider the mapping $G_\lambda $ between elements of $\overline{\mathbb{L }}_2$ given by $G_\lambda (s)(x)=\lambda s(\lambda x) {1\!\!1}_{[0,\lambda ^{-1}]}(x)$. Then $d_2\left( G_\lambda (t)-G_\lambda (u)\right) = \lambda ^{1/2}d_2(t,u)$. This implies that $G_\lambda \left( \overline{S}\right) $ has the same metric dimension as $\overline{S}$, therefore bounded by $D$. Moreover, any estimator $\widehat{s}$ of $s$ can be turned into an estimator $G_\lambda \left( \,\widehat{s}\right) $ for $G_\lambda (s)$ and vice-versa, so that the minimax risk over $G_\lambda \left( \overline{S}\right) $ is $\lambda cn^{-1}D$. Since $\lambda $ can be arbitrary large, the bound (1.2) cannot be universally true.

The fact that the $\mathbb L _\infty $-norm of $s$ may come into the risk based on $\mathbb L _2$-loss, as we noticed when studying histograms on irregular partitions, is actually not due to the use of specific estimators like histograms but it is a more general phenomenon as shown by another negative result provided by Proposition 4 of [5] that we recall below for the sake of completeness.

Proposition 2

For each $L>0$ and each integer $D$ with $1\le D\le 3n$, one can find a finite set $\overline{S}$ of densities with the following properties:

(i)
it is a subset of some $D$-dimensional affine subspace of $\mathbb L _2([0,1],dx)$ with a metric dimension bounded by $D/2$;
(ii)
$\sup _{s\in \overline{S}}\Vert {s}\Vert _\infty \le L+1$;
(iii)
for any estimator $\widehat{s}(X_1,\ldots ,X_n)$ belonging to $\mathbb L _2([0,1],dx)$ and based on an i.i.d. sample with density $s\in \overline{S}$,
$$\begin{aligned} \sup _{s\in \overline{S}}\mathbb{E }_s\left[ \Vert \,\widehat{s}-s\Vert ^2\right] >0.0139DLn^{-1}. \end{aligned}$$
(1.13)

It follows from this lower bound that the best universal risk bound one can expect to prove for an estimator $\widehat{s}$ with values in an arbitrary model $\overline{S}$ with metric dimension bounded by $D$ when $s$ is arbitrary in $\overline{\mathbb{L }}_\infty $ is

$$\begin{aligned} \mathbb{E }_s\left[ d_2^2\left( \,\widehat{s},s\right) \right] \le C\left[ \inf _{t\in \overline{S}} d_2^2(s,t)+n^{-1}D\Vert {s}\Vert _\infty \right] . \end{aligned}$$

(1.14)

The situation becomes worse when $s\not \in \mathbb L _\infty (\mu )$ or if $\sup _{s\in \overline{S}}\Vert {s}\Vert _\infty \!=\!+\infty $. It may even happen that, whatever the estimator $\widehat{s},\, \sup _{s\in \overline{S}}\mathbb{E }_s\left[ d^2\left( \,\widehat{s},s\right) \right] $ be infinite even if $\overline{S}\subset \mathbb L _2(\mu )$ has a bounded metric dimension as shown by the following lower bound, to be proved in Sect. 6.1.

Proposition 3

Let $\overline{S}=\{s_\theta ,\, 0<\theta \le 1/3\}$ be the set of densities with respect to the Lebesgue measure on $[0,1]$ given by

$$\begin{aligned} s_\theta =\theta ^{-2}{1\!\!1}_{[0,\theta ^3]}+\left( \theta ^2+\theta +1\right) ^{-1} {1\!\!1}_{(\theta ^3,1]}. \end{aligned}$$

If we have at disposal $n$ i.i.d. observations with density $s_\theta \!\in \!\overline{S}$, we can build an estimator $\widetilde{s}_n$ such that $\sup _{0<\theta \le 1/3} \mathbb{E }_{s_\theta }\left[ nh^2(s_\theta ,\widetilde{s}_n)\right] \le C$ for some $C$ independent of $n$. On the other hand, although the metric dimension of $\overline{S}$ with respect to the distance $d_2$ is bounded by $2,\, \sup _{0<\theta \le 1/3}\mathbb{E }_{s_\theta } \left[ \Vert {s}_\theta \!-\!\widehat{s}_n\Vert ^2\right] \!=\!+\infty $, whatever $n$ and the estimator $\widehat{s}_n$.

These three counter-examples show that there is absolutely no hope that Proposition 1 could be true when $d=d_2$. They also suggest that it is impossible to build a general theory of model selection (or even of estimation based on one single model) with $\mathbb L _2$-loss without taking the $\mathbb L _\infty $-norm into account, even if there do exist some special situations, like for regular histograms, for which the introduction of the $\mathbb L _\infty $-norm is superfluous and Bound (1.14) can be substantially improved.

1.4 About this paper

We have seen in the previous section that Proposition 1, which deals with one single model $\overline{S}$, cannot be true when $d$ is the $\mathbb L _2$-distance and the situation is obviously worse for model selection among many models. The general results about model selection or estimator aggregation (that can be viewed as a special case of model selection as explained in Sect. 9 of [4]) which are valid when $d=h$ or $v$ and described precisely in Theorem 1 below cannot hold in full generality when $d=d_2$. Of course, they may hold in some specific situations or under some additional restrictions and many results have already been obtained in this direction but there is presently no general theory for model selection available when the risk is based on the $\mathbb L _2$-distance. This major difference between the $\mathbb L _2$-distance and $\mathbb L _1$ or Hellinger distance was the main motivation to write this paper.

Our purpose, in the remainder of this paper, will be to explain to what extent the general theory for model selection which has been developed in Birgé [4] for $d=h$ or $v$ can be rescued when $d=d_2$ with the additional introduction of $\mathbb L _\infty $-norms in the procedures, even when the density $s$ does not belong to $\mathbb L _\infty (\mu )$, and what type of results about adaptation in Besov spaces can be derived from this general approach. In particular, for the case of a single model, this will lead to a generalized version of (1.2) that can also handle the case of $s\not \in \mathbb L _\infty $. When $s\in \mathbb L _\infty $ (with an unknown value of $\Vert {s}\Vert _\infty $), the risk bounds we get completely parallel (apart from some constants depending on $\Vert {s}\Vert _\infty $) those obtained for estimating $s$ when $d=h$ or $v$.

In order to achieve our goal, we shall use the construction of what we have called T-estimators in [4]. These estimators are based on suitable tests between balls with respect to the relevant distance $d$. In the i.i.d. case when $d=h$, tests between Hellinger balls were constructed quite a long time ago by Le Cam and there are many other frameworks for which such tests exist; see [7] for various examples. In order to apply our construction to the case of $d=d_2$ we shall have to derive suitable tests between $\mathbb L _2$-balls.

In the next section we shall recall general results for model selection, based on what we have called T-estimators in [4], that hold when $d=h$ or $v$ and what is presently known when $d=d_2$. Section 3 will be devoted to the statement of the main theorems and we shall give a few applications of them, in particular to aggregation of preliminary estimators and estimation of densities belonging to Besov spaces, in Sect. 4. We shall explain precisely the construction of our estimators, which can be viewed as a modification of T-estimators, in Sect. 5. The last section will be devoted to the most technical proofs.

2 Model selection

Let us now go back to histograms. As we noticed, the risk bound we get heavily depends on the choice of the partition. If we have at disposal a finite (although possibly very large) family $\{\mathcal{I }_m,m\in \mathcal{M }\}$ of finite partitions of $\mathcal{X }$ with respective cardinalities $|\mathcal{I }_m|$, we can consider the corresponding families of models $\left\{ \overline{S}_{\mathcal{I }_m}, m\in \mathcal{M }\right\} $ and histogram estimators $\left\{ \widehat{s}_{\mathcal{I }_m}, m\in \mathcal{M }\right\} $. It is then natural to try to find a partition in the family which leads, at least approximately, to the minimal risk $\inf _{m\in \mathcal{M }}\mathbb{E }_s\left[ \Vert \widehat{s}_{\mathcal{I }_m}-s\Vert ^2\right] $. But one cannot select such a partition from either (1.5) or (1.7) since the risk depends on the unknown density $s$ via $\overline{s}_{\mathcal{I }_m}$. Methods of model or estimator selection base the choice of a suitable partition $\mathcal{I }_{\widehat{m}}$ with $\widehat{m}= \widehat{m}(X_1,\ldots ,X_n)$ on the observations.

This problem of partition selection is actually a particular case of model selection. Going back to the general framework of Sect. 1.2 with $d=h$ or $v$, we can consider a family of models $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $, each one with metric dimension bounded by $D_m$ so that it leads to an estimator $\widehat{s}_m$ with a risk bounded, according to Proposition 1, by $C\left[ \inf _{t\in \overline{S}_m}d^2(s,t)+n^{-1}D_m\right] $. Since the bias term $\inf _{t\in \overline{S}_m}d^2(s,t)$ is unknown, it is impossible to decide which $m$ leads to the best bound and a natural problem is to design a method for choosing a value $\widehat{m}(X_1,\ldots ,X_n)$ of $m$ from the observations in order to minimize the risk bound. There is actually a solution to this problem which is provided by the following result from Birgé [4].

Theorem 1

Let $\varvec{X}=(X_1,\ldots ,X_n)$ be an i.i.d. sample with unknown density $s$ belonging to $\overline{\mathbb{L }}_1,\, d$ be either $h$ or $v$ and $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $ a finite or countable family of subsets of $\overline{\mathbb{L }}_1$ with metric dimensions bounded by $D_m\ge 1/2$ respectively. Let the nonegative weights $\Delta _m, m\in \mathcal{M }$ satisfy

$$\begin{aligned} \sum _{m\in \mathcal{M }}\exp [-\Delta _m]=\Sigma ^{\prime }<+\infty . \end{aligned}$$

(2.1)

Then there exists a universal constant $C$ and an estimator $\widetilde{s}(\varvec{X})$ such that, for any $s\in \overline{\mathbb{L }}_1$,

$$\begin{aligned} \mathbb{E }_s\left[ d^2\left( \,\widetilde{s},s\right) \right] \le C(1+\Sigma ^{\prime })\inf _{m\in \mathcal{M }} \left[ \inf _{t\in \overline{S}_m}d^2(s,t)+n^{-1}\max \left\{ D_m;\Delta _m\right\} \right] . \end{aligned}$$

(2.2)

Proposition 1 simply follows by setting $\mathcal{M } =\{0\},\, \overline{S}_0=\overline{S},\, D_0=D$ and $\Delta _0=1/2$. One should notice here the Bayesian role of the weights $\Delta _m$. The choice of weights that satisfy (2.1) amounts to putting a prior positive measure with total mass $\Sigma ^{\prime }$ on $\mathcal{M }$ or equivalently on the collection of models with a measure $\exp [-\Delta _m]$ for the model $\overline{S}_m$.

2.1 What is presently known

There exists a considerable amount of literature dealing with problems of model or estimator selection. Most of it is actually devoted to the analysis of Gaussian problems, or regression problems, or density estimation with either Hellinger or Kullback loss and it is not our aim here to review this literature. Only a few papers are actually devoted to our subject, namely model or estimator selection for estimating densities with $\mathbb L _2$-loss, and we shall therefore concentrate on these papers only. They can roughly be divided into three groups: the ones dealing with penalized projection estimators, the ones that study aggregation by selection of preliminary estimators and the more specific ones which use methods based on the thresholding of empirical coefficients within a given wavelet basis. The last ones, which are especially designed for the estimation of densities belonging to various kinds of Besov spaces, are typically not advertised as dealing with model selection but, as explained for instance in Sect. 5.1.2 of [11], can be viewed as very special instances of model selection methods for models that are spanned by some finite subsets of an orthonormal wavelet basis. They are definitely not general methods of model selection (i.e. which handle arbitrary densities and families of models) but specific ones dealing only with some special families of models and targeted to estimate special densities.

All these papers have in common the fact that they require more or less severe restrictions on the families of models or densities to be estimated. For instance, aggregation of estimators by selection only deals with models which are singletons while thresholding of wavelet coefficients amounts to deal with models which are linear spaces spanned by finite subsets of a wavelet basis. Moreover, apart from a few special cases to be mentioned below, they typically assume that $s\in \mathbb L _\infty (\mu )$ with a known or estimated bound for $\Vert {s}\Vert _\infty $.

In order to see how such methods apply to the problem of partition selection for histograms that we mentioned at the beginning of Sect. 2, let us be more specific and assume that $\mathcal{X }=[0,1],\, \mu $ is the Lebesgue measure and $\mathcal{N }=\{j/(N+1), 1\le j\le N\}$ for some (possibly very large) positive integer $N$. For any subset $m$ of $\mathcal{N }$, we denote by $\mathcal{I }_m$ the partition of $\mathcal{X }$ generated by the intervals with set of endpoints $m\cup \{0,1\}$ and we set $\overline{S}_m=\overline{S}_{\mathcal{I }_m}$ and $\widehat{s}_m=\widehat{s}_{\mathcal{I }_m}$. This leads to a family of partitions $\mathcal{M }$ with cardinality $2^N$ and to the corresponding families of linear models $\left\{ \overline{S}_m, m\in \mathcal{M }\right\} $ and related histogram estimators $\big \{\widehat{s}_m, m\in \mathcal{M }\big \}$. Then all models $\overline{S}_m$ are subsets of the largest one $\overline{S}_{\mathcal{N }}$. Given a sample $X_1,\ldots ,X_n$ with unknown density $s$, which partition $\mathcal{I }_{\widehat{m}}$ with $\widehat{m}=\widehat{m}(X_1,\ldots ,X_n)$ depending on the observations should we choose to estimate $s$ and what sort of risk bound could we derive for the resulting estimator?

Since subset selection within a given basis applies here only when $N=2^K-1$ and we use the Haar basis, we shall only consider this particular case in order to be able to deal with the three above-mentioned methods, keeping in mind that the third one does not apply to arbitrary values of $N$. In this case, $\overline{S}_{\mathcal{N }}$ is the linear span of the $2^K$ first elements of the Haar basis. To each non-empty subset $q$ of these $2^K$ elements, we can associate its linear span $\overline{S}^{\prime }_q$ and the family of linear models $\left\{ \overline{S}^{\prime }_q, q\in \mathcal{Q }\right\} $ where $\mathcal{Q }$ denotes the set of those $q$. To each $\overline{S}^{\prime }_q$ corresponds a projection estimator (as defined in Sect. 1.1.2) $\widehat{s}_q$ which looks like a histogram estimator (piecewise constant) and one can consider the problem of selecting an optimal value $\widehat{q}$ of $q$, for instance by a proper thresholding of the empirical coefficients. One should nevertheless keep in mind that the two problems (selecting an $m$ or a $q$) are different because the two families of models and estimators are different. In particular the families of models have different approximation properties. For instance, the density $2^K{1\!\!1}_{[0,2^{-K})}$ belongs to the two-dimensional model $\overline{S}_{\{1\}}$ but its expansion in the Haar basis has $K$ non-zero coefficients so that it cannot belong to any model $\overline{S}^{\prime }_q$ with dimension smaller than $K$. As to the estimators, one should notice that histograms are always genuine densities which is not the case of the projection estimators $\widehat{s}_q$.

Penalized projection estimators have been considered by Birgé and Massart [8] and an improved version is to be found in Chapter 7 of [27]. The method either deals with polynomial collections of linear models, i.e. collections for which the number of $D$-dimensional models is bounded by a polynomial in $D$ (which does not apply to our case) or with subset selection within a given basis. Moreover, it requires that $N<n/\log n$ and a bound on $\Vert \overline{s}_{\mathcal{I }_{\mathcal{N }}}\Vert _\infty $ be known or estimated, as in Sect. 4.4.4 of [8], since the penalty depends on it.

Methods based on wavelet thresholding, as described in Donoho et al. [18] or Kerkyacharian and Picard [22] (see also the numerous references therein) typically require the same type of restrictions and, in particular, a known upper bound for $\Vert {s}\Vert _\infty $ in order to properly calibrate the threshold. A noticeable exception appears to be the paper by Reynaud-Bouret et al. [29] which is devoted to the estimation of an unknown density on the real line (with possibly unbounded support) by a method which can be viewed as a specific model selection method, the models being linear spaces spanned by finite subsets of a given wavelet basis (for our problem it should be the Haar basis). Their Theorem 1 is some sort of an oracle inequality which does not involve $\Vert {s}\Vert _\infty $ at all but, instead, variance terms which are similar to those in the right-hand side of (1.9). To apply it to densities $s$ belonging to Besov spaces $B^\alpha _{p,\infty }(\mathbb R )$ (with $\alpha >(1/p)-(1/2)$ as is always required) they have to bound these variance terms like we did for (1.9) in Sect. 1.1.2. They derive risk bounds which show the same dichotomy we mentioned above for histograms. In the “nice” case (here when $p>2$) the bound does not involve $\Vert {s}\Vert _\infty $. But in the more classical case of $p\le 2$ they require that $s$ belong to $\mathbb L _\infty $ with risk bounds depending on $\Vert {s}\Vert _\infty $.

Aggregation of estimators by selection assumes that preliminary estimators (one for each model in our case) are given in advance (we should here use the histograms) and typically leads to a risk bound including a term of the form $n^{-1}\Vert {s}\Vert _\infty \log |\mathcal{M }|= n^{-1}N\Vert {s}\Vert _\infty \log 2$ so that all such results are useless for $N\ge n$. Moreover, most of them also require that an upper bound for $\Vert {s}\Vert _\infty $ be known since it enters the construction of the aggregate estimator. This is the case in Rigollet [30] (see for instance his Corollary 2.7) and Juditsky et al. [21, Corollary 5.7] since the parameter $\beta $ that governs their mirror averaging method depends crucially on an upper bound for $\Vert {s}\Vert _\infty $. As to Samarov and Tsybakov [32], their Assumption 1 requires that $N$ be not larger than $C\log n$. Similar restrictions are to be found in Yang [33] in his developments for mixing strategies and in Rigollet and Tsybakov [31] for linear aggregation of estimators. Lounici [26] does not assume that $s\in \mathbb L _\infty $ but, instead, that all preliminary estimators are uniformly bounded. One can always truncate the estimators to get this but, to be efficient, the truncation should be adapted to the unknown parameter $s$, and therefore chosen from the data in a suitable way. We do not know of any paper that allows such a data driven choice.

Consequently, none of these results can solve our partition selection problem with arbitrary partitions in a completely satisfactory way when $N$ is at least of size $n$ and whatever the unknown $s\in \mathbb L _2(\mu )$. This fact was one motivation for our study of model selection for density estimation with $\mathbb L _2$-loss. As already mentioned, some results about adaptive estimation on particular smoothness classes that are akin to model selection with special models and do not assume the boundedness of $s$ can be found in the literature. One can mention estimation of densities belonging to Sobolev classes $W_2^{\alpha }(\mathbb R )= B^\alpha _{2,2}(\mathbb R ),\, \alpha >0$ studied by Efromovich [20] and densities in $B^\alpha _{p,\infty }(\mathbb R ),\, p>2$ considered by Reynaud-Bouret et al. [29]. Both results are quite nice but they address specific situations. The results by Efromovich are actually extremely precise since he does not only get the optimal adaptive rates of convergence but also the exact optimal asymptotic constant which was first computed by Pinsker [28] for Gaussian ellipsoids. He designs a special estimator of the characteristic function based on an application of the Efromovich-Pinsker method to the empirical characteristic function. Then he proceeds by Fourier inversion. This works remarkably well on Sobolev classes which are defined via the characteristic functions but cannot be extended to more general models. We actually do not know of a general model selection result that applies to any $s\in \mathbb L _2(\mu )$ and arbitrary countable families of finite-dimensional models (possibly nonlinear). There is a counterpart to this level of generality: our procedure is of a purely abstract nature and not constructive, only indicating what is theoretically feasible. Unfortunately, we are unable to design a practical procedure with similar properties.

2.2 Some notations

Let us now fix our framework and notations. We want to estimate an unknown density $s$, with respect to some probability measure $\mu $ on the measurable space $(\mathcal{X }, \mathcal{W })$, from an i.i.d. sample $\varvec{X}=(X_1,\ldots ,X_n)$ of random variables $X_i\in \mathcal{X }$ with distribution $P_s=s\cdot \mu $. The natural domain of application of our results is therefore a compact space $\mathcal{X }$ with a finite reference measure $\nu $, in which case we shall set $\mu =\nu (\mathcal{X })^{-1}\nu $. Because of this restriction that $\mu $ should be a probability, our result does not apply to estimating densities with respect to the Lebesgue measure on $\mathbb R $ but would apply to densities with respect to a Gaussian probability on the line for instance.

Throughout the paper we denote by $\mathbb P _s$ the probability that gives $\varvec{X}$ the distribution $P_s^{\otimes n}$ and by $\mathbb{E }_s$ the corresponding expectation operator. For $\Gamma >1$, we set

$$\begin{aligned} \overline{\mathbb{L }}^\Gamma _\infty = \left\{ \left. t\in \overline{\mathbb{L }}_\infty \,\right| \,\Vert t\Vert _\infty \le \Gamma \right\} \end{aligned}$$

and, for each $s\in \overline{\mathbb{L }}_2$, we define the function $Q_s$ on $\mathbb R _+$ by

$$\begin{aligned} Q_s(z)=\int [s(x)-z]^2{1\!\!1}_{\{s(x)>z\}}\,d\mu (x)\quad \text{ for } z\ge 0. \end{aligned}$$

(2.3)

We measure the performance at $s\in \overline{\mathbb{L }}_2$ of an estimator $\widehat{s}(\varvec{X})\in \mathbb{L }_2$ by its quadratic risk $\mathbb{E }_s\left[ d_2^2\left( \,\widehat{s}(\varvec{X}),s\right) \right] $. More generally, if $(M,d)$ is a metric space of measurable functions on $\mathcal{X }$ such that $M\cap \overline{\mathbb{L }}_1\ne \emptyset $, the quadratic risk of some estimator $\widehat{s}\in M$ at $s\in M\cap \overline{\mathbb{L }}_1$ is defined as $\mathbb{E }_s\left[ d^2\left( \,\widehat{s}(\varvec{X}),s\right) \right] $. We denote by $|\mathcal{I }|$ the cardinality of the set $\mathcal{I }$ and set $a\vee b$ and $a\wedge b$ for the maximum and the minimum of $a$ and $b$, respectively. Throughout the paper $C$ (or $C^{\prime }, \dots $) will denote a universal (numerical) constant and $C(a,b,\ldots )$ or $C_q$ a function of the parameters $a, b,\ldots $ or $q$. Both may vary from line to line. Finally, from now on, countable will always mean “finite or countable”.

3 Main results

In order to define estimators based on families of models with bounded metric dimensions, we shall follow the approach of Birgé [4] based on what we have called T-estimators. We refer to this paper for the definition and construction of these estimators derived from tests between balls.

3.1 Model selection with bounded T-estimators

Our first result deals with the performance of special T-estimators that are by construction bounded by $\Gamma $ and therefore belong to $\overline{\mathbb{L }}^\Gamma _\infty $.

Theorem 2

Assume we are given a countable collection $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $ of models in $\mathbb L _2(\mu )$ with metric dimensions bounded respectively by $\overline{D}_m\ge 1/2$ and a family of weights $\Delta _m$ such that

$$\begin{aligned} \Sigma =1+\sum _{m\in \mathcal{M }}\exp [-\Delta _m]<+\infty . \end{aligned}$$

(3.1)

One can build, for each $\Gamma \ge 3$, a T-estimator $\widehat{s\,\,}^\Gamma \in \overline{\mathbb{L }}^\Gamma _\infty $ which satisfies, for all $s\in \overline{\mathbb{L }}_2$ and $q\ge 1$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}\!-\!\widehat{s}^{\,\Gamma }\right\| ^q\right] \le C_q\Sigma \left[ \inf _{m\in \mathcal{M }}\left\{ d_2\!\left( s,\overline{S}_m\right) \!+\!\sqrt{\frac{\Gamma \left( \overline{D}_m\vee \Delta _m\right) }{n}}\right\} \!+\!\sqrt{Q_s(\Gamma )}\right] ^q,\qquad \end{aligned}$$

(3.2)

with $Q_s$ given by (2.3) and $C_q$ some constant depending only on $q$. If $\Vert {s}\Vert _\infty \le \Gamma $, then

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widehat{s}^{\,\Gamma }\right\| ^2\right] \le C\Sigma \inf _{m\in \mathcal{M }}\left\{ d_2^2\!\left( s,\overline{S}_m\right) +n^{-1}\Gamma \left( \overline{D}_m\vee \Delta _m\right) \right\} \!. \end{aligned}$$

(3.3)

3.2 General model selection in $\overline{\mathbb{L }}_2$

Clearly, the performance of the estimator $\widehat{s}^{\,\Gamma }$ provided by Theorem 2 depends on the choice of $\Gamma $ since the right-hand side of (3.2) includes a sum of two terms, the first one being increasing with respect to $\Gamma $ and the second one, $[Q_s(\Gamma )]^{1/2}$, nonincreasing. An optimal value of $\Gamma $ should balance between these two terms. Unfortunately both of them depend on the unknown parameter $s$. We therefore need a way to choose $\Gamma $ from the data in order to optimize the bound in (3.2).

The idea is to build a sequence of estimators $(\,\widehat{s}^{\,2^i})_{i\ge 2}$ and select a convenient value of $i$ from our data. Since we only have at disposal a single sample $\varvec{X}$ to build the estimators $\widehat{s}^{\,2^i}$ and to choose $i$, we shall proceed by sample splitting using one half of the sample for the construction of the estimators and the second half to select a suitable value of $i$. We therefore now consider the general situation where we observe $n=2n^{\prime }$ i.i.d. random variables $X_1,\ldots ,X_n$ with an unknown density $s\in \overline{\mathbb{L }}_2$, not necessarily bounded, and have at disposal a countable collection $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $ of models in $\mathbb L _2(\mu )$ with metric dimensions bounded respectively by $\overline{D}_m\ge 1/2$ together with a family of weights $\Delta _m$ which satisfy (3.1). We split our sample $\varvec{X}=(X_1,\ldots ,X_n)$ into two subsamples $\varvec{X}\!_1$ and $\varvec{X}\!_2$ of the same size $n^{\prime }$. We use $\varvec{X}\!_1$ to build the T-estimators $\widehat{s}_i(\varvec{X}\!_1)= \widehat{s}^{\,2^{i+1}}(\varvec{X}\!_1),\, i\ge 1$, which are provided by Theorem 2. It then follows from (3.2) that each such estimator satisfies, for $q\ge 1$,

$$\begin{aligned}&\mathbb{E }_s\left[ \left\| {s}-\widehat{s}_i(\varvec{X}\!_1)\right\| ^q\right] \\&\quad \le C_q\Sigma \left\{ \inf _{m\in \mathcal{M }}\left[ d_2\!\left( s,\overline{S}_m\right) + \left( \frac{2^i\left( \overline{D}_m\vee \Delta _m\right) }{n}\right) ^{1/2}\right] +\sqrt{Q_s(2^{i+1})}\right\} ^q, \end{aligned}$$

with $Q_s$ given by (2.3). We now work conditionally on $\varvec{X}\!_1$, fix a convenient value of $A\ge 1$ (for instance $A=1$ if we just want to bound the quadratic risk) and use the second half of the sample $\varvec{X}\!_2$ to select one estimator among the previous family. This requires a special argument to select a density from an unbounded sequence which is provided by the following proposition to be proved in Sect. 5.4.

Proposition 4

Let $(t_i)_{i\ge 1}$ be a sequence of densities such that $t_i\in \overline{\mathbb{L }}_\infty ^{2^{i+1}}$ for each $i$ and $\varvec{X}$ be an $n$-sample with density $s\in \overline{\mathbb{L }}_2$. Given $A\ge 1$, one can design an estimator $\widehat{s}_A(\varvec{X})$ such that

$$\begin{aligned} \mathbb{E }_s\left[ d_2^q\!\left( \,\widehat{s}_A,s\right) \right] \le C(A,q)\inf _{i\ge 1} \left[ d_2(s,t_i)\vee \sqrt{n^{-1}i2^i}\right] ^q\quad \text{ for } 1\le q<2A/\log 2. \end{aligned}$$

The selection, based on the sample $\varvec{X}\!_2$, of an estimator in the sequence $\left( \,\widehat{s}_i(\varvec{X}\!_1)\right) _{i\ge 1}$ according to Proposition 4 results in a new estimator $\widetilde{s}_A(\varvec{X})$ which satisfies

$$\begin{aligned} \mathbb{E }_s\left[ \left. d_2^q\!\left( \,\widetilde{s}_A(\varvec{X}),s\right) \,\right| \, \varvec{X}\!_1\right] \le C(A,q)\inf _{i\ge 1}\left[ d_2\!\left( s,\widehat{s}_i(\varvec{X}\!_1)\right) \vee \sqrt{n^{-1}i2^i}\right] ^q, \end{aligned}$$

provided that $q<2A/\log 2$. Integrating with respect to $\varvec{X}\!_1$ and using our previous risk bound gives

$$\begin{aligned} {\mathbb{E }_s\left[ \left\| {s}-\widetilde{s}_A(\varvec{X})\right\| ^q\right] }&\le \!C(A,q)\inf _{i\ge 1}\left\{ \mathbb{E }_s\left[ \left\| {s}-\widehat{s}_i(\varvec{X}\!_1)\right\| ^q \right] +\left( n^{-1}i2^i\right) ^{q/2}\right\} \\&\le \!C(A,q)\Sigma \inf _{i\ge 1} \left\{ \inf _{m\in \mathcal{M }}\left[ d_2^q\left( s,\overline{S}_m\right) + \left( \frac{2^i\left( \overline{D}_m\vee \Delta _m\vee i\right) }{n}\right) ^{q/2}\right] \right. \\&\quad \left. +[Q_s(2^{i+1})]^{q/2}\right\} . \end{aligned}$$

For $2^i\le z<2^{i+1},\, \log z\ge i\log 2$ and $Q_s(z)\ge Q_s(2^{i+1})$ since $Q_s$ is nonincreasing. Modifying accordingly the constants in our bounds, we get the main result of this paper which provides adaptation with respect to both the models and the truncation constant.

Theorem 3

Let $\varvec{X}=(X_1,\ldots ,X_n)$ with $n\ge 2$ be an i.i.d. sample with unknown density $s\in \overline{\mathbb{L }}_2$ and $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $ be a countable collection of models in $\mathbb L _2(\mu )$ with metric dimensions bounded respectively by $\overline{D}_m\ge 1/2$. Let $\{\Delta _m,m\in \mathcal{M }\}$ be a family of weights which satisfy (3.1) and $Q_s(z)$ be given by (2.3). For each $A\ge 1$, there exists an estimator $\widetilde{s}_A(\varvec{X})$ such that, whatever $s\in \overline{\mathbb{L }}_2$ and $1\le q<(2A/\log 2)$,

$$\begin{aligned}&{\mathbb{E }_s\left[ \left\| {s}\!-\!\widetilde{s}_A(\varvec{X})\right\| ^q\right] }\nonumber \\&\quad \le \!\!C(A,q)\Sigma \inf _{z\ge 2}\inf _{m\in \mathcal{M }}\left[ d_2^q \left( s,\overline{S}_m\right) \!+\!\left( \frac{z\left( \overline{D}_m\vee \Delta _m\vee \log z\right) }{n}\right) ^{q/2}\!+\![Q_s(z)]^{q/2}\right] .\qquad \quad \quad \end{aligned}$$

(3.4)

In particular, for $\widetilde{s}=\widetilde{s}_1$ and $s\in \overline{\mathbb{L }}_\infty (\mu )$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\Sigma \inf _{m\in \mathcal{M }}\left[ d_2^2\left( s,\overline{S}_m\right) +n^{-1}\Vert {s}\Vert _\infty \left( \overline{D}_m\vee \Delta _m\vee \log \Vert {s}\Vert _\infty \right) \right] .\qquad \end{aligned}$$

(3.5)

3.3 Some remarks

We see that (3.4) is a generalization of (3.2) and (3.5) of (3.3) at the modest price of the extra $\log z$ (or $\log \Vert {s}\Vert _\infty $). We do not know whether this $\log z$ is necessary or not but, in a typical model selection problem, when $s$ belongs to $\overline{\mathbb{L }}_\infty (\mu )$ but not to $\cup _{m\in \mathcal{M }} \overline{S}_m$, the optimal value of $\overline{D}_m$ goes to $+\infty $ with $n$, so that, for this optimal value, asymptotically $\overline{D}_m\vee \Delta _m\vee \log \Vert {s}\Vert _\infty =\overline{D}_m\vee \Delta _m$.

Up to constants depending on $\Vert {s}\Vert _\infty $, (3.5) is the exact analogue of (1.14) which shows that, when $s\in \overline{\mathbb{L }}_\infty (\mu )$, all the results about model selection obtained for the Hellinger distance can be translated in terms of the $\mathbb L _2$-distance.

Note that Theorem 3 applies to a single model $\overline{S}$ with metric dimension bounded by $\overline{D}$, in which case one can use a weight $\Delta =1/2\le \overline{D}$ which results, if $A=1$, in the risk bound

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C \left[ d_2^2\left( s,\overline{S}\right) +\inf _{z\ge 2} \left\{ \frac{z\left( \overline{D}\vee \log z\right) }{n}+Q_s(z)\right\} \right] , \end{aligned}$$

(3.6)

and, if $s\in \overline{\mathbb{L }}_\infty (\mu )$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C \left[ d_2^2\!\left( s,\overline{S}\right) +n^{-1}\Vert {s}\Vert _\infty \left( \overline{D}\vee \log \Vert {s}\Vert _\infty \right) \right] . \end{aligned}$$

(3.7)

Apart from the extra $\log \Vert {s}\Vert _\infty $, which is harmless when it is smaller than $\overline{D}$, we recover what we expected, namely the bound (1.14).

Even if $s\in \overline{\mathbb{L }}_\infty (\mu )$ the bound (3.4) may be much better than (3.5). This is actually already visible with one single model, comparing (3.6) with (3.7). It is indeed easy to find an example of a very spiky density $s$ for which (3.6) is much better than (3.7) or the classical bound (1.11) obtained for projection estimators. Of course, this is just a comparison of universal bounds, not of the true risk of estimators for a given $s$.

More surprising is the fact that our estimator can actually dominate a histogram based on the same model, although our counter-example is rather caricatural and more an advertising against the use of the $\mathbb L _2$-loss than against the use of histogram estimators. Let us consider a partition $\mathcal{I }$ of $[0,1]$ into $2D$ intervals $I_j,\, 1\le j\le 2D$ with the integer $D$ satisfying $2\le D\le n$ and fix some $\gamma \ge 10$. We then set $\alpha =\left( \gamma ^2n\right) ^{-1}$. For $1\le j\le D$, the intervals $I_{2j-1}$ have length $\alpha $ while the intervals $I_{2j}$ have length $\beta $ with $D(\alpha +\beta )=1$. We denote by $\overline{S}$ the $2D$-dimensional linear space spanned by the indicator functions of the $I_j$. It is a model with metric dimension bounded by $D$. We assume that the underlying density $s$ with respect to Lebesgue measure belongs to $\overline{S}$ and is defined as

$$\begin{aligned} s=p\alpha ^{-1}\sum _{j=1}^D{1\!\!1}_{I_{2j-1}}+q\beta ^{-1}\sum _{j=1}^D{1\!\!1}_{I_{2j}} \quad \text{ with }\quad p=\gamma \alpha \quad \text{ and }\quad D(p+q)=1, \end{aligned}$$

so that $\beta >q$ since $\alpha <p$. We consider two estimators of $s$ derived from the same model $\overline{S}$: the histogram $\widehat{s}_{\mathcal{I }}$ based on the partition $\mathcal{I }$ and the estimator $\widetilde{s}$ based on $\overline{S}$ and provided by Theorem 3. According to (1.5) the risk of $\widehat{s}_{\mathcal{I }}$ is

$$\begin{aligned} Dn^{-1}\left[ \alpha ^{-1}p(1-p)+\beta ^{-1}q(1-q)\right] \ge 0.9Dn^{-1}\alpha ^{-1}p= 0.9D\gamma n^{-1}, \end{aligned}$$

since $p\le 1/10$. The risk of $\widetilde{s}$ can be bounded by (3.4) with $z=4$ which gives

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right]&\le C\left[ 4Dn^{-1}+ D\!\int \nolimits _{I_1}(p/\alpha )^2\,d\mu \right] \\&= CD\left[ 4n^{-1}+p^2\alpha ^{-1}\right] = 5CDn^{-1}. \end{aligned}$$

For large enough values of $\gamma $ our estimator is better than the histogram. The problem actually comes from the observations falling in some of the intervals $I_{2j-1}$ which will lead to a very bad estimation of $s$ on those intervals. Note that this fact will happen with a small probability since $Dp=D(\gamma n)^{-1}\le \gamma ^{-1}$. Nevertheless, this event of small probability is important enough to lead to a large risk when we use the $\mathbb L _2$-loss.

4 Some applications

4.1 Aggregation of preliminary estimators

Theorem 3 applies in particular to the problem of aggregating preliminary estimators, built from an independent sample, either by selecting one of them or by combining them linearily.

4.1.1 Aggregation by selection

Let us begin with the problem, that we already considered in Sect. 3.2, of selecting a point among a countable family $\{t_m, m\in \mathcal{M }\}$. Typically, as in Rigollet [30], the $t_m$ are preliminary estimators based on an independent sample (derived by sample splitting if necessary) and we want to choose the best one in the family. This is a situation for which one can choose $\overline{D}_m=1/2$ for all $m$ and $A=1$ which leads to the following corollary

Corollary 1

Let $\varvec{X}=(X_1,\ldots ,X_n)$ with $n\ge 2$ be an i.i.d. sample with unknown density $s\in \overline{\mathbb{L }}_2$ and $\{t_m,m\in \mathcal{M }\}$ be a countable collection of points in $\mathbb L _2(\mu )$. Let $\{\Delta _m,m\in \mathcal{M }\}$ be a family of weights which satisfy (3.1) and $Q_s(z)$ be given by (2.3). There exists an estimator $\widetilde{s}(\varvec{X})$ such that, whatever $s\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\Sigma \inf _{z\ge 2} \left\{ \inf _{m\in \mathcal{M }}\left[ d_2^2(s,t_m)+ \frac{z(\Delta _m\vee \log z)}{n}\right] +Q_s(z)\right\} . \end{aligned}$$

4.1.2 Linear aggregation

Rigollet and Tsybakov [31] have considered the problem of linear aggregation. Given a finite set $\{t_1,\ldots , t_N\}$ of preliminary estimators of $s$, they use the observations to build a linear combination of the $t_j$ in order to get a new and potentially better estimator of $s$. For $\varvec{\lambda }=(\lambda _1,\ldots ,\lambda _N)\in \mathbb R ^N$, let us set $t_{\varvec{\lambda }}=\sum _{j=1}^N\lambda _jt_j$. Rigollet and Tsybakov build a selector $\widehat{\varvec{\lambda }}(X_1,\ldots ,X_n)$ such that the corresponding estimator $\widehat{s}(\varvec{X})=t_{\widehat{\varvec{\lambda }}}$ satisfies, for all $s\in \overline{\mathbb{L }}_\infty $,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widehat{s}(\varvec{X})\right\| ^2\right] \le \inf _{\varvec{\lambda }\in \mathbb R ^N}d_2^2\left( s,t_{\varvec{\lambda }}\right) +n^{-1}\Vert {s}\Vert _\infty N. \end{aligned}$$

(4.1)

Unfortunately, this bound, which is shown to be sharp for such an estimator, can be really poor, as compared to the minimal risk $\inf _{1\le j\le N}d_2^2(s,t_j)$ of the preliminary estimators when one of these is already quite good and $n^{-1}\Vert {s}\Vert _\infty N$ is large, which is likely to happen when $N$ is quite large. Moreover, this result tells nothing when $s\not \in \overline{\mathbb{L }}_\infty $. In [4, Sect. 9.3] we proposed an alternative way of selecting a linear combination of the $t_j$ based on T-estimators. In the particular situation of densities belonging to $\overline{\mathbb{L }}_2$, we proceed as follows: we choose for $\mathcal{M }$ the collection of all nonvoid subsets $m$ of $\{1,\ldots ,N\}$ and, for $m\in \mathcal{M }$, we take for $\overline{S}_m$ the linear span of the $t_j$ with $j\in m$ so that the dimension of $\overline{S}_m$ is bounded by $|m|$ and its metric dimension $\overline{D}_m$ by $|m|/2$. Since the number of elements of $\mathcal{M }$ with cardinality $j$ is $\big (\begin{array}{c}N\\ j\end{array}\big )<(eN/j)^j$, we may set $\Delta _m=|m|[2+ \log (N/|m|)]$ so that (3.1) is satisfied with $\Sigma <2$. An application of Theorem 3 leads to the following corollary.

Corollary 2

Let $\varvec{X}=(X_1,\ldots ,X_n)$ with $n\ge 2$ be an i.i.d. sample with unknown density $s\in \overline{\mathbb{L }}_2$ and $\{t_1,\ldots , t_N\}$ be a finite set of points in $\mathbb L _2(\mu )$. Let $\mathcal{M }$ be the collection of all nonvoid subsets $m$ of $\{1,\ldots ,N\}$ and, for $m\in \mathcal{M }$,

$$\begin{aligned} \Lambda _m=\left\{ \left. \varvec{\lambda }\in \mathbb R ^N\,\right| \,\lambda _j=0 \text{ for } j\not \in m\right\} . \end{aligned}$$

For each $A\ge 1$, there exists an estimator $\widetilde{s}_A(\varvec{X})$ such that, whatever $s\in \overline{\mathbb{L }}_2$ and $1\le q<(2A/\log 2)$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}_A(\varvec{X})\right\| ^q\right] \le C(A,q)\inf _{z\ge 2}\inf _{m\in \mathcal{M }}R(q,s,z,m), \end{aligned}$$

where

$$\begin{aligned} R(q,s,z,m)&= \inf _{\varvec{\lambda }\in \Lambda _m} d_2^q\left( s,t_{\varvec{\lambda }}\right) +\left( \frac{z\left[ |m|\left( {} 1+\log (N/|m|)\right) \vee \log z\right] }{n}\right) ^{q/2}\!\\&\quad +[Q_s(z)]^{q/2} \end{aligned}$$

and $Q_s(z)$ is given by (2.3).

There are many differences between this bound and (4.1), apart from the nasty constant $C(A,q)$. Firstly, it applies to densities $s$ that do not belong to $\overline{\mathbb{L }}_\infty $ and handles the case of $q>2$ for a convenient choice of $A$. Also, when $s\in \overline{\mathbb{L }}_\infty $ and one of the preliminary estimators is already close to $s$, it may very well happen, when $N$ is large, that

$$\begin{aligned} R\left( 2,s,\Vert {s}\Vert _\infty ,m\right) \le \inf _{\varvec{\lambda }\in \Lambda _m} d_2^2\left( s,t_{\varvec{\lambda }}\right) +n^{-1}\Vert {s}\Vert _\infty \left[ |m|\left( {}1+\log (N/|m|)\right) \vee \log \Vert {s}\Vert _\infty \right] \end{aligned}$$

be much smaller than the right-hand side of (4.1) for some $m$ of small cardinality.

4.2 Selection of projection estimators

In this section, we assume that $s\in \overline{\mathbb{L }}_\infty (\mu )$. This assumption is not needed for the design of the estimator but only to derive suitable risk bounds. We have at hand a countable family $\left\{ \overline{S}_m, m\in \mathcal{M }\right\} $ of linear subspaces of $\mathbb L _2(\mu )$ with respective dimensions $D_m$ and we choose corresponding weights $\Delta _m$ satisfying (3.1). For each $m$, we consider the projection estimator $\widehat{s}_m$ defined in Sect. 1.1. Each such estimator has a risk bounded by (1.11), i.e.

$$\begin{aligned} \mathbb{E }_s\left[ \Vert \widehat{s}_m-s\Vert ^2\right] \le \Vert \overline{s}_m-s\Vert ^2+n^{-1}D_m \Vert {s}\Vert _\infty , \end{aligned}$$

where $\overline{s}_m$ denotes the orthogonal projection of $s$ onto $\overline{S}_m$. If we apply Corollary 1 to this family of estimators, we get an estimator $\widetilde{s}(\varvec{X})$ satisfying, for all $s\in \overline{\mathbb{L }}_\infty $,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\Sigma \inf _{m\in \mathcal{M }}\left[ \Vert \overline{s}_m-s\Vert ^2+n^{-1}\Vert {s}\Vert _\infty \left( D_m\vee \Delta _m\vee \log \Vert {s}\Vert _\infty \right) \right] . \end{aligned}$$

With this bound at hand, we can now go back to the problem we considered in Sect. 2.1, starting with an arbitrary countable family $\{\mathcal{I }_m, m\in \mathcal{M }\}$ of finite partitions of $\mathcal{X }$ and weights $\Delta _m$ satisfying (3.1). To each partition $\mathcal{I }_m$ we associate the linear space $\overline{S}_m$ of piecewise constant functions of the form $\sum _{I\in \mathcal{I }_m}\beta _I{1\!\!1}_{I}$. The dimension of this linear space is the cardinality of $\mathcal{I }_m$ and its metric dimension is bounded by $|\mathcal{I }_m|/2$. If we know that $s\in \overline{\mathbb{L }}_\infty (\mu )$, we can proceed as we just explained, building the family of histograms $\widehat{s}_{\mathcal{I }_m}(\varvec{X}\!_1)$ corresponding to our partitions and using Corollary 1 to get

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\Sigma \inf _{m\in \mathcal{M }}\left[ \Vert \overline{s}_{\mathcal{I }_m}-s\Vert ^2+n^{-1}\Vert {s}\Vert _\infty \left( |\mathcal{I }_m|\vee \Delta _m\vee \log \Vert {s}\Vert _\infty \right) \right] ,\nonumber \\ \end{aligned}$$

(4.2)

which should be compared with (1.7). Apart from the unavoidable complexity term $\Delta _m$ due to model selection, we have only lost (up to the universal constant $C$) the replacement of $|\mathcal{I }_m|$ by $|\mathcal{I }_m|\vee \log \Vert {s}\Vert _\infty $. Examples of families of partitions and corresponding weights satisfying (3.1) are given in Sect. 9 of [4].

In the general case of $s\in \overline{\mathbb{L }}_2(\mu )$, we may apply Theorem 3 to the family of linear models $\left\{ \overline{S}_m, m\in \mathcal{M }\right\} $ derived from these partitions, getting an estimator $\widetilde{s}$ with a risk satisfying

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right]&\le C\Sigma \inf _{z\ge 2} \left\{ \inf _{m\in \mathcal{M }}\left[ \Vert \overline{s}_{\mathcal{I }_m}-s\Vert ^2+ \frac{z(|\mathcal{I }_m|\vee \Delta _m\vee \log z)}{n}\right] +Q_s(z)\right\} . \end{aligned}$$

4.3 A comparison with Gaussian model selection

A benchmark for model selection in general is the particular (simpler) situation of model selection for the so-called white noise framework in which we observe a Gaussian process $\varvec{X}=\{X_z,z\in [0,1]\}$ with $X_z=\int _0^zs(x)\,dx+\sigma W_z$, where $s$ is an unknown element of $\mathbb L _2( [0,1],dx),\, \sigma >0$ a known parameter and $W_z$ a Wiener process. For such a problem, an analogue of Theorem 1 has been proved in Birgé [4], namely

Theorem 4

Let $\varvec{X}$ be the Gaussian process given by

$$\begin{aligned} X_z=\int \nolimits _0^zs(x)\,dx+n^{-1/2}W_z,\quad 0\le z\le 1, \end{aligned}$$

where $s$ is an unknown element of $\mathbb L _2( [0,1],dx)$ to be estimated and $W_z$ a Wiener process. Let $\left\{ \overline{S}_m,m\in \mathcal{M }\right\} $ be a countable collection of models in $\mathbb L _2(\mu )$ with metric dimensions bounded respectively by $\overline{D}_m\ge 1/2$. Let $\{\Delta _m,m\in \mathcal{M }\}$ be a family of weights which satisfy (3.1). There exists an estimator $\widetilde{s}(\varvec{X})$ such that, whatever $s\in \mathbb L _2( [0,1],dx)$,

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C \inf _{m\in \mathcal{M }}\left[ d_2^2\!\left( s,\overline{S}_m\right) +n^{-1} \left( \overline{D}_m\vee \Delta _m\right) \right] . \end{aligned}$$

Comparing this bound with (3.5) shows that, when $s\in \overline{\mathbb{L }}_\infty (\mu )$, we get a similar risk bound for estimating the density $s$ from n i.i.d. random variables, apart from an additional factor depending on $\Vert {s}\Vert _\infty $. Similar analogies are valid with bounds obtained for estimating densities with squared Hellinger loss or for estimating the intensity of a Poisson process as shown in Birgé [4, 6]. Therefore, all the many examples that have been treated in these papers as well as those in Baraud and Birgé [1] could be transferred to the case of density estimation with $\mathbb L _2$-loss with minor modifications due to the appearence of $\Vert {s}\Vert _\infty $ in the bounds. We leave all these translations as exercices for the concerned reader.

4.4 Adaptive estimation in Besov spaces

The Besov space $B^\alpha _{p,\infty }([0,1])$ with $\alpha ,p>0$ is defined in Devore and Lorentz [15] and it is known that a necessary and sufficient condition for $B^\alpha _{p,\infty }([0,1])\subset \mathbb L _2([0,1],dx)$ is $\delta =\alpha +1/2-1/p>0$, which we shall assume in the sequel. The problem of estimating densities that belong to some Besov space $B^\alpha _{p,\infty }([0,1])$ adaptively (i.e. without knowing $\alpha $ and $p$) has been solved for a long time when $\alpha >1/p$ which is a necessary and sufficient condition for $B^\alpha _{p,\infty }([0,1])\subset \mathbb L _\infty ([0,1],dx)$. See for instance [14, 18] (under the assumption that an upper bound for $\Vert {s}\Vert _\infty $ is known) or [8] (with an estimated value of $\Vert {s}\Vert _\infty $). It can be treated in the usual way leading to the minimax rate of convergence $n^{-2\alpha /(2\alpha +1)}$ for the quadratic risk when $n$ goes to infinity. The situation is quite different when $\alpha \le 1/p$ even when $\alpha $ and $p$ are known.

4.4.1 Wavelet expansions

It is known from analysis that functions $s\in \mathbb L _2\left( [0,1],dx\right) $ can be represented by their expansion with respect to some orthonormal wavelet basis $\{\varphi _{j,k}, j\ge -1, k\in \Lambda (j)\}$ with $|\Lambda (-1)|\le K$ and $2^j\le |\Lambda (j)|\le K2^j$ for all $j\ge 0$. Such a wavelet basis satisfies

$$\begin{aligned} \left\| \sum _{\,k\in \Lambda (j)\,}|\varphi _{j,k}|\right\| _\infty \!\le K^{\prime }2^{j/2} \quad \text{ for } j\!\ge -1 \quad \text{ and } \quad \left\| \sum _{j=-1}^q\sum _{\,k\in \Lambda (j)}\varphi _j^2\right\| _\infty \!\le K^{\prime \prime }2^q \qquad \quad \end{aligned}$$

(4.3)

and we can write

$$\begin{aligned} s=\sum _{j=-1}^\infty \sum _{\,k\in \Lambda (j)\,}\beta _{j,k}\varphi _{j,k}\quad \text{ with } \quad \beta _{j,k}=\int \varphi _{j,k}(x)s(x)\,dx. \end{aligned}$$

(4.4)

Moreover, for a convenient choice of the wavelet basis (depending on $\alpha $), the fact that $s$ belongs to the Besov space $B^\alpha _{p,\infty }([0,1])$ with semi-norm $|s|^\alpha _p$ is equivalent to

$$\begin{aligned} \sup _{j\ge 0}2^{j(\alpha +1/2-1/p)}\left( \sum _{\,k\in \Lambda (j)\,}| \beta _{j,k}|^p\right) ^{1/p}=|s|_{\alpha ,p,\infty }<+\infty , \end{aligned}$$

(4.5)

where $|s|_{\alpha ,p,\infty }<+\infty $ is equivalent to the Besov semi-norm $|s|^\alpha _p$.

Moreover, it follows from Birgé and Massart [8, 10], as summarized in [4, Proposition 13], that, given the integer $r$, one can find a wavelet basis (depending on $r$) and a universal family of linear models $\{\overline{S}_m, m\in \mathcal{M }=\cup _{J\ge 0}\mathcal{M }_J\}$ with respective dimensions $\overline{D}_m$, and weights $\{\Delta _m, m\in \mathcal{M }\}$ satisfying (3.1), with the following properties. Each $\overline{S}_m$ is the linear span of $\{\varphi _{-1,k} , k\in \Lambda (-1)\}\cup \{\varphi _{j,k},\, (j,k)\in m\}$ with $m\subset \cup _{j\ge 0}\Lambda (j);\, \overline{D}_m\vee \Delta _m\le c2^J$ for $m\in \mathcal{M }_J$ and

$$\begin{aligned} \inf _{m\in \mathcal{M }_J}\inf _{t\in \overline{S}_m}\Vert {s}-t\Vert \le C(\alpha ,p)2^{-J\alpha } |s|_{\alpha ,p,\infty }\quad \text{ for } s\in B^\alpha _{p,\infty }([0,1]),\;\alpha <r. \quad \end{aligned}$$

(4.6)

4.4.2 The bounded case

Actually, only the assumption that $s\in B^\alpha _{p,\infty }([0,1])\cap \overline{\mathbb{L }}_\infty (\mu )$, rather than $\alpha >1/p$, is needed to get the optimal rate of convergence $n^{-2\alpha /(2\alpha +1)}$. Indeed, we may apply the results of Sect. 4.2 to the family of models which satisfies (4.6) and derive an estimator $\widetilde{s}$ with a risk bounded by

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C(\alpha ,p) \inf _{J\ge 0}\left[ 2^{-2J\alpha }\left( |s|_{\alpha ,p,\infty }\right) ^2+n^{-1}\Vert {s}\Vert _\infty \left( 2^J\vee \log \Vert {s}\Vert _\infty \right) \right] . \end{aligned}$$

Choosing $2^J$ of the order of $n^{1/(2\alpha +1)}$ leads to the bound

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\left( \alpha ,p,|s|_{\alpha ,p,\infty },\Vert {s}\Vert _\infty \right) n^{-2\alpha /(2\alpha +1)}, \end{aligned}$$

which is valid for all $s\in B^\alpha _{p,\infty }([0,1])\cap \overline{\mathbb{L }}_\infty (\mu )$, whatever $\alpha <r$ and $p$ and although $\alpha ,\, p,\, |s|_{\alpha ,p,\infty }$ and $\Vert {s}\Vert _\infty $ are unknown.

4.4.3 Further upper bounds for the risk

When $\alpha \le 1/p$, i.e. $0<\delta \le 1/2,\, s$ may be unbounded and the classical theory does not apply any more. Results that do not involve $\Vert {s}\Vert _\infty $ are available in Efromovich [20] for Sobolev classes $W_2^\alpha (\mathbb R )=B^\alpha _{2,2}(\mathbb R )$ and for Besov spaces $B^\alpha _{p,\infty }(\mathbb R )$ with $p>2$ in Reynaud-Bouret et al. [29]. Nevertheless a general formula for the adaptive minimax risk over balls in $B^\alpha _{p,\infty }([0,1])$ for $p\le 2$ and $1/p-1/2<\alpha \le 1/p$ is presently unknown. Our study will not, unfortunately, solve this problem but, at least, provide some partial information. In this section we assume that $\alpha \le 1/p$ and restrict ourselves to the case $p\le 2$ so that $\delta \le \alpha $.

We consider the wavelet expansion of $s$ which has been described in Sect. 4.4.1 and, to avoid unnecessary complications, we also assume that $|s|_{\alpha ,p,\infty }\ge 1$. In what follows, the generic constant $C$ (changing from line to line) depends on the choice of the basis and $\delta $. Since $p\le 2$, by (4.5),

$$\begin{aligned} \left( \sum _{\,k\in \Lambda (j)\,}\beta _{j,k}^2\right) ^{1/2}&\le \left( \sum _{\,k\in \Lambda (j)\,}|\beta _{j,k}|^p\right) ^{1/p}\le |s|_{\alpha ,p,\infty } 2^{-j(\alpha +1/2-1/p)}\\&= |s|_{\alpha ,p,\infty }2^{-j\delta }, \end{aligned}$$

hence, for $J\in \mathbb N $,

$$\begin{aligned} \left\| \sum _{j>J}\sum _{\,k\in \Lambda (j)\,}\beta _{j,k}\varphi _{j,k}\right\| ^2&= \sum _{j>J}\sum _{\,k\in \Lambda (j)\,}\beta _{j,k}^2\le |s|_{\alpha ,p,\infty }^2\sum _{j>J} 2^{-2j\delta } \nonumber \\&= |s|_{\alpha ,p,\infty }^22^{-2J\delta }. \end{aligned}$$

(4.7)

The simplest estimators of $s$ are the projection estimators $\widehat{s}_q$ over the linear spaces $\overline{S}^{\prime }_q$ where $\overline{S}^{\prime }_q$ is spanned by $\{\varphi _{j,k}, -1\le j\le q, k\in \Lambda (j)\}$

$$\begin{aligned} \widehat{s}_q(\varvec{X})=\sum _{j=-1}^q\sum _{\,k\in \Lambda (j)\,}\widehat{\beta }_{j,k}(\varvec{X}) \varphi _{j,k}\quad \text{ with }\quad \widehat{\beta }_{j,k}(\varvec{X})=n^{-1}\sum _{i=1}^n \varphi _{j,k}(X_i). \end{aligned}$$

The risk of these estimators can be bounded using (1.11), (4.3) and (4.7) by

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widehat{s}_q(\varvec{X})\right\| ^2\right] \le d_2^2\left( s,\overline{S}^{\prime }_q\right) +C2^q/n\le 2^{-2q\delta }|s|^2_{\alpha ,p,\infty } +C2^q/n. \end{aligned}$$

A convenient choice of $q$, depending on $\delta $ (therefore nonadaptive), then leads to

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widehat{s}_q(\varvec{X})\right\| ^2\right] \le C|s|^2_{\alpha ,p,\infty }n^{-2\delta /(2\delta +1)}. \end{aligned}$$

In particular, when $p=2$ we recover the usual minimax rate $n^{-2\alpha /(1+2\alpha )}$ for all values of $\alpha $ but without adaptation.

One can actually choose $q$ from the data using a penalized least squares estimator and get a similar risk bound without knowing $\delta $ as shown by Theorem 7.5 of [27] which proves adaptation to the minimax risk when $p=2$. It also leads to an adaptive risk bound for the case $\alpha \le 1/p,\, p<2$ (hence $\delta <\alpha $), without the restriction $s\in \overline{\mathbb{L }}_\infty ([0,1])$ but with a rate which is then slower than $n^{-2\alpha /(1+2\alpha )}$.

Let us now see what our method can do. Since $s$ is a density, it follows from (4.4) and (4.3) that $|\beta _{-1,k}|\le \Vert \varphi _{-1,k}\Vert _\infty \le K^{\prime }/\sqrt{2}$, hence

$$\begin{aligned} \left\| \sum _{k\in \Lambda (-1)}\beta _{-1,k}\varphi _{-1,k}\right\| _\infty \le \left( K^{\prime }/\sqrt{2}\right) \left\| \sum _{k\in \Lambda (-1)}|\varphi _{-1,k}|\right\| _\infty \le K^{\prime 2}/2. \end{aligned}$$

Moreover, for $j\ge 0$, (4.5) implies that $\sup _{\,k\in \Lambda (j)\,}|\beta _{j,k}| \le 2^{-j\delta }|s|_{\alpha ,p,\infty }$. Therefore, by (4.3),

$$\begin{aligned} \left\| \sum _{\,k\in \Lambda (j)\,}\beta _{j,k}\varphi _{j,k}\right\| _\infty \le K^{\prime }2^{-j(\alpha -1/p)} |s|_{\alpha ,p,\infty } \end{aligned}$$

and, for $J\ge 0$,

$$\begin{aligned} \left\| \sum _{j=0}^J\sum _{\,k\in \Lambda (j)\,}\beta _{j,k}\varphi _{j,k}\right\| _\infty \le \left\{ \begin{array}{ll}C|s|_{\alpha ,p,\infty }&{}\quad \text{ if } \alpha >1/p;\\ C(J+1)|s|_{\alpha ,p,\infty }&{}\quad \text{ if } \alpha =1/p;\\ C2^{J(1/p-\alpha )} |s|_{\alpha ,p,\infty }&{}\quad \text{ if } \alpha <1/p.\end{array}\right. \end{aligned}$$

Finally,

$$\begin{aligned} \left\| \sum _{j=-1}^J\sum _{\,k\in \Lambda (j)\,}\beta _{j,k}\varphi _{j,k}\right\| _\infty \!\le C_0L_J|s|_{\alpha ,p,\infty }\quad \text{ with }\quad L_J=\left\{ \begin{array}{ll}1&{} \quad \text{ if } \alpha >1/p;\\ (J+1)&{}\quad \text{ if } \alpha =1/p;\\ 2^{J(1/p-\alpha )}&{}\quad \text{ if } \alpha <1/p.\end{array}\right. \end{aligned}$$

Observing that if $s=u+v$ with $\Vert u\Vert _\infty \le z$, then $Q_s(z)\le \Vert v\Vert ^2$, we can conclude from (4.7) that

$$\begin{aligned} Q_s\left( C_0L_J|s|_{\alpha ,p,\infty }\right) \le 2^{-2J\delta }|s|_{\alpha ,p,\infty }^2. \end{aligned}$$

Let us now turn back to the family of linear models described in Sect. 4.4.1 that satisfy (4.6). Theorem 3 asserts the existence of an estimator $\widetilde{s}(\varvec{X})$ based on this family of models and satisfying

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\inf _{z\ge 2} \inf _{m\in \mathcal{M }}\left[ d_2^2\left( s,\overline{S}_m\right) + \frac{z\left( \,\overline{D}_m\vee \Delta _m\vee \log z\right) }{n}+Q_s(z)\right] . \end{aligned}$$

Given the integers $J,J^{\prime }$, we may set $z=z_{J^{\prime }}=C_0L_{J^{\prime }}|s|_{\alpha ,p,\infty }$ and restrict the minimization to $m\in \mathcal{M }_J$ which leads by (4.6) to

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C \left[ |s|_{\alpha ,p,\infty }^2\left( 2^{-2J\alpha }+2^{-2J^{\prime }\delta }\right) + n^{-1}L_{J^{\prime }}|s|_{\alpha ,p,\infty }\left( 2^J\vee \log z_{J^{\prime }}\right) \right] . \end{aligned}$$

Since $L_{J^{\prime }}\left( 2^J\vee \log z_{J^{\prime }}\right) $ is a nondecreasing function of both $J$ and $J^{\prime }$, this last bound is optimized when $J\alpha $ and $J^{\prime }\delta $ are approximately equal which leads to choosing the integer $J^{\prime }$ so that $J\alpha /\delta \le J^{\prime }<J\alpha /\delta +1$, hence $2^{-2J^{\prime }\delta }\le 2^{-2J\alpha }$. Assuming, moreover, that $2^J\ge \log |s|_{\alpha ,p,\infty }$, which implies that $2^J\ge C^{\prime }\log z_{J^{\prime }}$, we get

$$\begin{aligned} \mathbb{E }_s\left[ \left\| {s}-\widetilde{s}(\varvec{X})\right\| ^2\right] \le C|s|^2_{\alpha ,p,\infty } \left[ 2^{-2J\alpha }+2^J\left( n|s|_{\alpha ,p,\infty }\right) ^{-1}L_{J^{\prime }}\right] . \end{aligned}$$

We finally fix $J$ so that $2^J\ge G>2^{J-1}$, where $G$ is defined below. This choice ensures that $G\ge \log |s|_{\alpha ,p,\infty }$ for $n$ large enough (depending on $|s|_{\alpha ,p,\infty }$), which we assume here.

If $\alpha >1/p$ we set $G=\left( n|s|_{\alpha ,p,\infty }\right) ^{1/(2\alpha +1)}$ which leads to a risk bound of the form
$$\begin{aligned} Cn^{-2\alpha /(2\alpha +1)}\left( |s|_{\alpha ,p,\infty }\right) ^{(2\alpha +2)/(2\alpha +1)}. \end{aligned}$$
If $\alpha =1/p,\, L_J^{\prime }<J\alpha /\delta +2$ and we take $G=\left( n|s|_{\alpha ,p,\infty }/\log n\right) ^{1/(2\alpha +1)}$ which leads to the risk bound
$$\begin{aligned} C(n/\log n)^{-2\alpha /(2\alpha +1)} \left( |s|_{\alpha ,p,\infty }\right) ^{(2\alpha +2)/(2\alpha +1)}. \end{aligned}$$
Finally, for $\alpha \!<\!1/p,\quad L_{J^{\prime }}\!<\!\sqrt{2}\,2^{(J\alpha /\delta )(1/p-\alpha )}$ and we set $G=\left( n|s|_{\alpha ,p,\infty }\right) ^{1/[\alpha \!+1\!+\alpha /(2\delta )]}$ which leads to the bound
$$\begin{aligned} Cn^{-2\alpha /[\alpha +1+\alpha /(2\delta )]} \left( |s|_{\alpha ,p,\infty }\right) ^{(2+(\alpha /\delta )/[\alpha +1+\alpha /(2\delta )]}. \end{aligned}$$

4.4.4 Some lower bounds

Lower bounds of the form $n^{-2\alpha /(1+2\alpha )}$ for the minimax risk over Besov balls are well-known (deriving from lower bounds for Hölder spaces) and they are sharp for $\alpha >1/p$, as shown in Donoho et al. [18]. To derive new lower bounds for the case $\alpha <1/p$ we shall use the following proposition which results easily from classical arguments of Le Cam [23]—see also [19] or [35].

Proposition 5

Let $X_1,\ldots ,X_n$ be i.i.d. observations with an unknown density belonging to a subset $\mathcal{S }$ of $\overline{\mathbb{L }}_1(\mu )$ and $d$ a distance on $\mathcal{S }$. Let $t,u\in \mathcal{S }$ such that

$$\begin{aligned} h(t,u)=h(P_t,P_u)= an^{-1/2},\quad a<2^{-1/2}. \end{aligned}$$

Whatever the estimator $\widehat{s}$ with values in $\mathcal{S }$ and $p\ge 1$,

$$\begin{aligned} \max \left\{ \mathbb{E }_t\left[ d^p(\widehat{s},t)\right] , \mathbb{E }_u\left[ d^p(\widehat{s},u)\right] \right\} \ge 2^{-p}\left( 1-a\sqrt{2}\right) d^p(t,u). \end{aligned}$$

(4.8)

Let us consider some probability density $f\in B^\alpha _{p,\infty }([0,1])$ with compact support included in $(0,1)$ and Besov semi-norm $|f|^\alpha _p$. We set $g(x)=af(2anx)$ for some $a>(2n)^{-1}$ to be fixed later. Then $g(x)=0$ for $x\not \in \left( 0,(2an)^{-1}\right) $,

$$\begin{aligned} \Vert g\Vert _q=a(2an)^{-1/q}\Vert f\Vert _q\qquad \text{ and }\qquad |g|^\alpha _p=a(2an)^{\alpha -1/p}|f|^\alpha _p. \end{aligned}$$

Let us now set $t=g+\left[ 1-(2n)^{-1}\right] {1\!\!1}_{[0,1]}$, so that $t$ is a density belonging to $B^\alpha _{p,\infty }([0,1])$ with Besov semi-norm

$$\begin{aligned} |t|^\alpha _p= |g|^\alpha _p=Ka^{1+\alpha -1/p} n^{\alpha -1/p}\quad \text{ with }\quad K=2^{\alpha -1/p}|f|^\alpha _p. \end{aligned}$$

For a given value of the constant $K^{\prime }>0$, the choice $a=\left[ K^{\prime }n^{1/p-\alpha }\right] ^{1/(1+\alpha -1/p)}>(2n)^{-1}$ (at least for $n$ large) leads to $|t|^\alpha _p= KK^{\prime }$ so that $K^{\prime }$ determines $|t|^\alpha _p$. We also consider the density $u(x)=t(1-x)$ which has the same Besov semi-norm. Then

$$\begin{aligned} h^2(t,u)=\int \nolimits _0^{(2an)^{-1}}\left( \!\sqrt{g+\left[ 1-(2n)^{-1}\right] } -\sqrt{1-(2n)^{-1}}\right) ^2 \!<\!\int \nolimits _0^{(2an)^{-1}}\!g=(2n)^{-1} \end{aligned}$$

and it follows from Proposition 5 that any estimator $\widehat{s}$ based on $n$ i.i.d. observations satisfies

$$\begin{aligned} \max \left\{ \mathbb{E }_t\left[ \Vert t-\widehat{s}\Vert ^2\right] , \mathbb{E }_u\left[ \Vert u-\widehat{s}\Vert ^2\right] \right\} \ge C\Vert t-u\Vert ^2=2C\Vert g\Vert ^2=Can^{-1}\Vert f\Vert ^2. \end{aligned}$$

Since $an^{-1}=K^{\prime 1/(\delta +1/2)}n^{-2\delta /(\delta +1/2)}$, we finally get

$$\begin{aligned} \max \left\{ \mathbb{E }_t\left[ \Vert t-\widehat{s}\Vert ^2\right] , \mathbb{E }_u\left[ \Vert u-\widehat{s}\Vert ^2\right] \right\} \ge C^{\prime }\left( |t|^\alpha _p\right) ^{2/(2\delta +1)}n^{-4\delta /(2\delta +1)}, \end{aligned}$$

where $C^{\prime }$ depends on $K^{\prime },\, \Vert f\Vert ,\, |f|^\alpha _p$ and $\delta $.

4.4.5 Conclusion

In the case $\alpha >1/p$, the estimator that we built in Sect. 4.4.3 has the usual rate of convergence with respect to $n$, namely $n^{-2\alpha /(2\alpha +1)}$, which is known to be optimal, and we can extend the result to the borderline case $\alpha =1/p$ with only a logarithmic loss. We do not know whether this additional logarithmic factor is necessary or not. When $\alpha \le 1/p$ only partial results are known which do not involve $\Vert {s}\Vert _\infty $. Efromovich [20] proves the same adaptive estimation rate $n^{-2\alpha /(2\alpha +1)}$ for the Sobolev spaces $W_2^\alpha (\mathbb R )\subsetneq B^\alpha _{2,\infty }(\mathbb R )$ (and even gets the exact asymptotic constants) and Reynaud-Bouret et al. [29] for $B^\alpha _{p,\infty }(\mathbb R )$ with $p>2$. As far as we are aware, nothing is known when $p<2$ and for $B^\alpha _{2,\infty }(\mathbb R )\setminus W_2^\alpha (\mathbb R )$. Our lower bound $n^{-4\delta /(2\delta +1)}$ is slower than $n^{-2\alpha /(1+2\alpha )}$ when $0<\delta <\alpha [2(\alpha +1)]^{-1}$ or, equivalently, when $\alpha +[2(\alpha +1)]^{-1}<1/p$ (which is only possible for $p<2$). This means that the minimax rate $n^{-2\alpha /(1+2\alpha )}$ cannot hold in this range (even without adaptation) but this lower bound tells us nothing when $1/2\le 1/p\le \alpha +[2(\alpha +1)]^{-1}$, in particular when $p=2$.

In the range $p<2$ and $\alpha <1/p$, our upper bound $n^{-2\alpha /[\alpha +1+\alpha /(2\delta )]}$ can be compared with the risk bound for the penalized least squares estimators based on the nested models $\overline{S}^{\prime }_q$, which is, as we have seen, of order $n^{-2\delta /(2\delta +1)}$. Our rate is better when $\alpha >2\delta /(2\delta +1)$, which is always true for $\alpha \ge 1/2$ since $\delta <1/2$. When $\alpha <1/2$ this requires that $p<2(1-\alpha )/\left( 1-2\alpha ^2\right) $, which is true independently of $\alpha $ when $p<1+2^{-1/2}$. In any case, these upper bounds never match our lower bound $n^{-4\delta /(2\delta +1)}$ and we have no idea about the true minimax rate (even without adaptation) although we suspect that the rate we have found is suboptimal in the range $\alpha <1/p$.

4.5 Using a nonlinear model

Let us now come back to the parametric problem that we considered in Sect. 1.3. We can use the whole set $\overline{S}=\{s_\theta ,\, 0<\theta \le 1/3\}$ as our model which, in this case, contains the true density $s$ so that there is no approximation term $d_2\left( s_\theta ,\overline{S}\right) $. It follows from Proposition 3 that the dimension of $\overline{S}$ is bounded by 2 so that Theorem 3 applies leading to the following bound derived from (3.6):

$$\begin{aligned} \mathbb{E }_\theta \left[ \left\| {s}_\theta -\widetilde{s}(\varvec{X})\right\| ^2\right] \le C \inf _{z\ge 2}\left\{ n^{-1}z\log z+Q_{s_\theta }(z)\right\} \quad \text{ for } \text{ all } \theta \in (0,1/3]. \end{aligned}$$

(4.9)

For $2\le z<\theta ^{-2},\, Q_{s_\theta }(z)=\theta ^3\left( \theta ^{-2}-z\right) ^2$ and $Q_{s_\theta }(z)=0$ for $z\ge \theta ^{-2}$. Optimizing the right-hand side of (4.9) with respect to $z$ leads to the risk bound

$$\begin{aligned} \mathbb{E }_\theta \left[ \left\| {s}_\theta -\widetilde{s}(\varvec{X})\right\| ^2\right] \le C\theta ^{-1}\left[ (n\theta )^{-1}\log \left( \theta ^{-1}\right) \wedge 1\right] , \end{aligned}$$

(4.10)

which goes to infinity with $\theta ^{-1}$.

Let us now see to what extent this result is sharp. It follows from Lemma 2 that if $\lambda =\theta +(12n)^{-1},\, h^2(\theta ;\lambda )<(8n)^{-1}$, hence $h(\theta ,\lambda )<2^{-3/2}n^{-1/2}$. Also

$$\begin{aligned} d^2_2(\theta ;\lambda )>\theta ^{-1}-\left( \theta +(12n)^{-1}\right) ^{-1}= [\theta (n\theta +(1/12))]^{-1}\ge (2\theta )^{-1}\left[ (n\theta )^{-1}\wedge 12\right] . \end{aligned}$$

It then follows from (4.8) that, whatever the estimator $\widehat{s}$, we get a lower bound for the risk of the form

$$\begin{aligned} \max \left\{ \mathbb{E }_\theta \left[ \left\| {s}_\theta \!-\!\widehat{s}(\varvec{X})\right\| ^2\right] , \mathbb{E }_\lambda \left[ \left\| {s}_\lambda \!-\!\widehat{s}(\varvec{X})\right\| ^2\right] \right\} \ge (8\theta )^{-1}\left[ (n\theta )^{-1}\wedge 12\right] ,\qquad \end{aligned}$$

(4.11)

which shows that (4.10) is optimal up to the logarithmic factor.

5 The construction of T-estimators for $\mathbb L _2$-loss

It will actually require several steps since we cannot simply apply the results of Birgé [4] straightforwardly. We recall that the construction of T-estimators of parameters belonging to the metric space $(M,d)$ relies on the existence of suitable tests between balls in this space. It is required that the errors of these tests satisfy some specific properties. Unfortunately, in the metric space $(\overline{\mathbb{L }}_2,d_2)$ tests with such properties cannot exist for arbitrary balls but can be built under the assumption that the centers of the two balls are bounded by some number $\Gamma $, the performance of these tests depending on $\Gamma $. With this result at hand, we can build estimators based on families of special models $S_m$, following [4]. These models need to be discrete subsets of $\overline{\mathbb{L }}^\Gamma _\infty $ (for some given $\Gamma $) with bounded metric dimension. Since there is no reason that our initial models $\overline{S}_m$ be of this type (think of linear models) we shall have to build such special models $S_m$ satisfying these conditions from ordinary ones. This construction will lead to an estimator $\widehat{s}^{\,\Gamma }$ belonging to $\overline{\mathbb{L }}^\Gamma _\infty $, the performance of which is given by Theorem 2. The last step involves the choice of $\Gamma $ among the sequence $(2^{i+1})_{i\ge 1}$ as previously explained in Sect. 3.2.

5.1 Tests between $\mathbb L _2$-balls

To derive such tests, we need a few specific technical tools to deal with the $\mathbb L _2$-distance.

5.1.1 Randomizing our sample

In the sequel we shall make use of randomized tests based on a randomization trick due to Yang and Barron [34, page 106] which has the effect of replacing all densities involved in our problem by new ones which are uniformly bounded away from zero. For this, we choose some number $\lambda \in (0,1)$ and consider the mapping $\tau $ from $\overline{\mathbb{L }}_2$ to $\overline{\mathbb{L }}_2$ given by $\tau (u)=\lambda u+1-\lambda $. Note that $\tau $ is one-to-one and isometric, up to a factor $\lambda $, i.e. $d_2(\tau (u),\tau (v))= \lambda d_2(u,v)$. If $u\in \overline{\mathbb{L }}^\Gamma _\infty $, then $\tau (u)\in \overline{\mathbb{L }}_\infty ^{\Gamma ^{\prime }}$ with $\Gamma ^{\prime }=\lambda \Gamma +1-\lambda $.

Let $s^{\prime }=\tau (s)$. Given our initial i.i.d. sample $\varvec{X}$, we want to build new i.i.d. variables $X^{\prime }_1,\ldots ,X^{\prime }_n$ with density $s^{\prime }$. For this, we consider two independent $n$-samples, $Z_1,\ldots ,Z_n$ and $\varepsilon _1,\ldots ,\varepsilon _n$ with respective distributions $\mu $ and Bernoulli with parameter $\lambda $. Both samples are independent of $\varvec{X}$. We then set $X^{\prime }_i=\varepsilon _iX_i+(1-\varepsilon _i)Z_i$ for $1\le i\le n$. It follows that $X^{\prime }_i$ has density $s^{\prime }$ as required. We shall still denote by $\mathbb P _s$ the probability on $\Omega $ that gives $\varvec{X}^{\prime }=(X^{\prime }_1,\ldots ,X^{\prime }_n)$ the distribution $P_{s^{\prime }}^{\otimes n}$. Given two distinct points $t,u\in \overline{\mathbb{L }}_2$ we define a test function $\psi (\varvec{X}^{\prime })$ between $t$ and $u$ as a measurable function with values in $\{t,u\},\, \psi (\varvec{X}^{\prime })=t$ meaning deciding $t$ and $\psi (\varvec{X}^{\prime })=u$ meaning deciding $u$.

Once we have used the randomization trick of Yang and Barron, for instance with $\lambda =1/2$, we deal with an i.i.d. sample $\varvec{X}^{\prime }$ with a density $s^{\prime }$ which is bounded from below by $1/2$ and we may therefore work within the set of densities that satisfy this property.

5.1.2 Preliminary results about tests between some convex sets

The main tool for the design of tests between $\mathbb L _2$-balls of densities is the following proposition which derives from the results of Birgé [3] (keeping here the notations of that paper) and in particular from Corollary 3.2, specialized to the case of $I=\{t\}$ and $c=0$.

Proposition 6

Let $\mathcal{M }$ be some linear space of finite measures on some measurable space $(\Omega ,\mathcal{A })$ with a topology of a locally convex separated linear space. Let $\mathcal{P },\mathcal{Q }$ be two disjoint sets of probabilities in $\mathcal{M }$ and $F$ a set of positive measurable functions on $\Omega $ with the following properties (with respect to the given topology on $\mathcal{M }$):

(i)
$\mathcal{P }$ and $\mathcal{Q }$ are convex and compact;
(ii)
for any $f\in F$ and $0<z<1$ the function $P\mapsto \int f^z\,dP$ is well-defined and upper semi-continuous on $\mathcal{P }\cup \mathcal{Q }$;
(iii)
for any $P\in \mathcal{P },\, Q\in \mathcal{Q },\, t\in (0,1)$ and $\varepsilon >0$, there exists an $f\in F$ such that
$$\begin{aligned} (1-t)\int f^t\,dP+t\int f^{1-t}\,dQ<\int (dP)^{1-t}(dQ)^t+\varepsilon ; \end{aligned}$$
(iv)
all probabilities in $\mathcal{P }$ (respectively in $\mathcal{Q }$) are mutually absolutely continuous.

Then one can find $\overline{P}\in \mathcal{P }$ and $\overline{Q}\in \mathcal{Q }$ such that

$$\begin{aligned} \sup _{P\in \mathcal{P }}\int \left( \frac{\overline{Q}}{\overline{P}}\right) ^tdP&= \sup _{Q\in \mathcal{Q }}\int \left( \frac{\overline{P}}{\overline{Q}}\right) ^{1-t}dQ = \sup _{P\in \mathcal{P },Q\in \mathcal{Q }}\int (dP)^{1-t}(dQ)^t\\&= \int \left( d\overline{P}\right) ^{1-t}\left( d\overline{Q}\right) ^t. \end{aligned}$$

In Birgé [3], we assumed that $\mathcal{M }$ was the set of all finite measures on $(\Omega ,\mathcal{A })$ but the proof actually only uses the fact that $\mathcal{P }$ and $\mathcal{Q }$ are subsets of $\mathcal{M }$. Recalling that the Hellinger affinity between two densities $u$ and $v$ is defined by $\rho (u,v)=\int \sqrt{uv}\,d\mu =1-h^2(u,v)$, we get the following corollary.

Corollary 3

Let $\mu $ be a probability measure on $(\mathcal{X },\mathcal{W })$ and, for $1\le i\le n$, let $\left( \mathcal{P }_i,\mathcal{Q }_i\right) $ be a pair of disjoint convex and weakly compact subsets of $\mathbb L _2(\mu )$ such that

$$\begin{aligned} s>0\; \mu \text{-a.s. }\quad \text{ and }\quad \int s\,d\mu =1\quad \text{ for } \text{ all } s\in \bigcup _{i=1}^n\left( \mathcal{P }_i\cup \mathcal{Q }_i\right) \!. \end{aligned}$$

(5.1)

For each $i$, one can find $p_i\in \mathcal{P }_i$ and $q_i\in \mathcal{Q }_i$ such that

$$\begin{aligned} \sup _{u\in \mathcal{P }_i}\int \sqrt{q_i/p_i}\,u\,d\mu =\sup _{v\in \mathcal{Q }_i} \int \sqrt{p_i/q_i}\,v\,d\mu =\sup _{u\in \mathcal{P }_i,v\in \mathcal{Q }_i}\rho (u,v) =\rho (p_i,q_i). \end{aligned}$$

Let $\varvec{X}=(X_1,\ldots ,X_n)$ be a random vector on $\mathcal{X }^n$ with distribution $\bigotimes _{i=1}^n(s_i\cdot \mu )$ with $s_i\in \mathcal{P }_i$ for $1\le i\le n$ and let $x\in \mathbb R $. Then

$$\begin{aligned} \mathbb P \left[ \sum _{i=1}^n\log (q_i/p_i)(X_i)\ge 2x\right] \le e^{-x}\prod _{i=1}^n\rho (p_i,q_i)\le \exp \left[ -x-\sum _{i=1}^nh^2(p_i,q_i)\right] . \end{aligned}$$

If $\varvec{X}$ has distribution $\bigotimes _{i=1}^n(u_i\cdot \mu )$ with $u_i\in \mathcal{Q }_i$ for $1\le i\le n$, then

$$\begin{aligned} \mathbb P \left[ \sum _{i=1}^n\log (q_i/p_i)(X_i)\le 2x\right] \le e^x\prod _{i=1}^n\rho (p_i,q_i)\le \exp \left[ x-\sum _{i=1}^nh^2(p_i,q_i)\right] . \end{aligned}$$

Proof

We apply Proposition 6 with $t=1/2,\, (\mathcal{X },\mathcal{W })=(\Omega ,\mathcal{A })$ and $\mathcal{M }$ the set of measures of the form $u\cdot \mu ,\, u\in \mathbb L _2(\mu )$ endowed with the weak $\mathbb L _2$-topology. In view of (5.1), $\mathcal{P }_i$ and $\mathcal{Q }_i$ can be identified with two sets of probabilities and we can take for $F$ the set of all positive functions such that $\log f$ is bounded. As a consequence, all four assumptions of Proposition 6 are satisfied. In order to get (iii) we simply take for $f$ a suitably truncated version of $s/u$ when $P=s\cdot \mu $ and $Q=u\cdot \mu $. As to the probability bounds they derive from classical exponential inequalities, as for Lemma 7 of Birgé [4].$\square $

5.1.3 Abstract tests between $\mathbb L _2$-balls

The purpose of this section is to prove the following result, which is of independent interest, about the performance of some tests between $\mathbb L _2$-balls.

Theorem 5

Let $t,u\in \overline{\mathbb{L }}^\Gamma _\infty $ for some $\Gamma \in (1,+\infty )$. For any $x\in \mathbb R $, there exists a test $\psi _{t,u,x}$ between $t$ and $u$, based on the randomized sample $\varvec{X}^{\prime }$ defined in Sect. 5.1.1 with $\lambda =\sqrt{64/65}$, which satisfies

$$\begin{aligned} \sup _{\left\{ s\in \overline{\mathbb{L }}_2\,|\,d_2(s,t)\le d_2(t,u)/4\right\} }\, \mathbb P _s[\psi _{t,u,x}(\varvec{X}^{\prime })=u]\le \exp \left[ -\frac{n\left( \Vert t-u\Vert ^2+x\right) }{65\Gamma }\right] \end{aligned}$$

(5.2)

and

$$\begin{aligned} \sup _{\left\{ s\in \overline{\mathbb{L }}_2\,|\,d_2(s,u)\le d_2(t,u)/4\right\} }\, \mathbb P _s[\psi _{t,u,x}(\varvec{X}^{\prime })=t]\le \exp \left[ -\frac{n\left( \Vert t-u\Vert ^2-x\right) }{65\Gamma }\right] . \end{aligned}$$

(5.3)

Proof

It requires several steps. To begin with, we use the randomization trick of Yang and Barron described in Sect. 5.1.1, replacing our original sample $\varvec{X}$ by the randomized sample $\varvec{X}^{\prime }=(X^{\prime }_1,\ldots ,X^{\prime }_n)$ for some convenient value of $\lambda $ to be chosen later. Each $X^{\prime }_i$ has density $s^{\prime }\ge 1-\lambda $ when $X_i$ has density $s$. Then we build a test between $t^{\prime }=\tau (t)$ and $u^{\prime }=\tau (u)$ based on $\varvec{X}^{\prime }$ and Corollary 3. To do this, we set $\Delta =\Vert t-u\Vert $,

$$\begin{aligned} \mathcal{P }=\tau \left( \mathcal{B }_{d_2}(t,\Delta /4)\cap \overline{\mathbb{L }}_2\right) \quad \text{ and }\quad \mathcal{Q }=\tau \left( \mathcal{B }_{d_2}(u,\Delta /4)\cap \overline{\mathbb{L }}_2\right) \!. \end{aligned}$$

Then $\mathcal{P }$ is the subset of the ball $\mathcal{B }_{d_2}(t^{\prime },\lambda \Delta /4)$ of those densities bounded from below by $1-\lambda $. Densities $v$ with such properties are characterized by the fact that $\langle v,{1\!\!1}_\mathcal{X }\rangle =1$ (${1\!\!1}_\mathcal{X }\in \mathbb L _2(\mu )$ because $\mu $ is a probability) and $\langle v,{1\!\!1}_A\rangle \ge (1-\lambda )\mu (A)$ for any measurable set $A$, a fact which is preserved under weak convergence and convex combinations. This shows that $\mathcal{P }$ is convex and weakly closed. Since $\mathcal{B }_{d_2}(t^{\prime },\lambda \Delta /4)$ is weakly compact, it is also the case for $\mathcal{P }$ and the same argument shows that $\mathcal{Q }$ is also convex and weakly compact. Moreover $d_2(\mathcal{P },\mathcal{Q })\ge \lambda \Delta /2$. It then follows from Corollary 3 that one can find $\bar{t}\in \mathcal{P }$ and $\bar{u}\in \mathcal{Q }$ such that

$$\begin{aligned} \mathbb P _s\left[ \sum _{i=1}^n\log \left( \bar{u}(X^{\prime }_i)/\bar{t}(X^{\prime }_i)\right) \ge 2y\right] \le \exp \left[ -nh^2\left( \bar{t},\bar{u}\right) -y\right] \quad \text{ if } s\in \mathcal{P }, \end{aligned}$$

(5.4)

while

$$\begin{aligned} \mathbb P _s\left[ \sum _{i=1}^n\log \left( \bar{u}(X^{\prime }_i)/\bar{t}(X^{\prime }_i)\right) \le 2y\right] \le \exp \left[ -nh^2\left( \bar{t},\bar{u}\right) +y\right] \quad \text{ if } s\in \mathcal{Q }. \end{aligned}$$

(5.5)

Fixing $y=nx/(65\Gamma )$, we finally define $\psi _{t,u,x}(\varvec{X}^{\prime })$ by setting $\psi _{t,u,x}(\varvec{X}^{\prime })=u$ if and only if $\sum _{i=1}^n\log \left( \bar{u}(X^{\prime }_i)/\bar{t}(X^{\prime }_i)\right) \ge 2y$. Since $s^{\prime }\in \mathcal{P }$ is equivalent to $s\in \mathcal{B }_{d_2}(t,\Delta /4)$ or $d_2(s,t)\le \Delta /4$ and similarily $s\in \mathcal{Q }$ is equivalent to $d_2(s,u)\le \Delta /4$, to derive (5.2) and (5.3) from (5.4) and (5.5), we just have to show that $h^2\left( \bar{t},\bar{u}\right) \ge (65\Gamma )^{-1}\Delta ^2$. We start from the fact, to be proved below, that

$$\begin{aligned} \Vert \bar{t}\vee \bar{u}\Vert _\infty \le 2(\lambda \Gamma +1-\lambda ). \end{aligned}$$

(5.6)

It implies that

$$\begin{aligned} h^2\left( \bar{t},\bar{u}\right)&= \frac{1}{2}\int \left( \sqrt{\bar{t}} -\sqrt{\bar{u}}\right) ^2 \,d\mu = \frac{1}{2}\int \frac{\left( \bar{t}-\bar{u}\right) ^2}{\left( \sqrt{\bar{t}}+ \sqrt{\bar{u}}\right) ^2}\,d\mu \ge \frac{\Vert \bar{t}-\bar{u}\Vert ^2}{16(\lambda \Gamma +1-\lambda )}\\&\ge \frac{(\lambda \Delta )^2}{64(\lambda \Gamma +1-\lambda )} = \frac{\Delta ^2}{65\Gamma [\lambda +\Gamma ^{-1}(1-\lambda )]} \ge \frac{\Delta ^2}{65\Gamma }, \end{aligned}$$

since $\Gamma >1$. As to (5.6), it is a consequence of the next lemma to be proved in Sect. 6.2. We apply this lemma to the pair $t^{\prime },u^{\prime }$ which satisfies $\Vert t^{\prime }\vee u^{\prime }\Vert _\infty \le \lambda \Gamma +1-\lambda $. If (5.6) were wrong, we could find $\bar{t}^{\prime }\in \mathcal{P }$ and $\bar{u}^{\prime }\in \mathcal{Q }$ with $h\left( \bar{t}^{\prime },\bar{u}^{\prime }\right) <h\left( \bar{t},\bar{u}\right) $, which, by Corollary 3, is impossible.$\square $

Lemma 1

Let us consider four elements $t,u,v_1,v_2$ in $\overline{\mathbb{L }}_2$ with $t\ne u,\, v_1\ne v_2$ and $\Vert t\vee u\Vert _\infty =B$. If $\Vert v_1\vee v_2\Vert _\infty >2B$, there exists $v^{\prime }_1,v^{\prime }_2\in \overline{\mathbb{L }}_2$ with $d_2(v^{\prime }_1,t)\le d_2(v_1,t),\, d_2(v^{\prime }_2,u)\le d_2(v_2,u)$ and $h(v^{\prime }_1,v^{\prime }_2)<h(v_1,v_2)$.

5.2 The performance of T-estimators for discrete models

We are now in a position to prove an analogue of Corollary 6 of [4].

Theorem 6

Assume that we observe $n$ i.i.d. random variables with unknown density $s\in (\overline{\mathbb{L }}_2,d_2)$ and that we have at disposal a countable family of discrete subsets $\{S_m\}_{m\in \mathcal{M }}$ of $\overline{\mathbb{L }}^\Gamma _\infty $ for some given $\Gamma >1$. Let each set $S_m$ satisfy

$$\begin{aligned} |S_m\cap \mathcal{B }_{d_2}(t,x\eta _m)|\le \exp \left[ D_mx^2\right] \quad \text{ for } \text{ all } x\ge 2 \text{ and } t\in \overline{\mathbb{L }}_2, \end{aligned}$$

(5.7)

with $\eta _m>0,\, D_m\ge 1/2$,

$$\begin{aligned} \eta _m^2\ge \frac{273\Gamma D_m}{n}\quad \text{ for } \text{ all } m\in \mathcal{M }\quad \text{ and } \quad \sum _{m\in \mathcal{M }}\exp \left[ -\frac{n\eta _m^2}{1365\Gamma }\right] =\Sigma ^{\prime }<+\infty . \nonumber \\ \end{aligned}$$

(5.8)

Then one can build a T-estimator $\widehat{s}$ such that, for all $s\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \mathbb{E }_s\left[ d_2^q(s,\widehat{s}\,)\right] \le C_q(\Sigma ^{\prime }+1)\inf _{m\in \mathcal{M }} \left\{ d_2(s,S_m)\vee \eta _m\right\} ^q,\quad \text{ for } \text{ all } q\ge 1. \end{aligned}$$

(5.9)

Proof

Since (5.9) is merely a version of (7.6) of [4] with $d=d_2$, we just have to show that Theorem 5 of this paper applies to our situation. It relies on Assumptions 1 and 3 of the paper. Assumption 3 follows from (5.7). As to Assumption 1 (with $a=n/(65\Gamma ),\, B=B^{\prime }=1$ and $\delta =4d_2$, hence $\kappa =4$), it is a consequence of our Theorem 5. The conditions (7.2) and (7.4) of [4] on $\eta _m$ and $D_m$ follow from (5.8).$\square $

In the case of a single $D$-dimensional model $\overline{S}\subset \overline{\mathbb{L }}^\Gamma _\infty $ we get the following corollary:

Corollary 4

Assume that we observe $n$ i.i.d. random variables with unknown distribution $P_s,\, s\in (\overline{\mathbb{L }}_2,d_2)$ and that we have at disposal a $D$-dimensional model $\overline{S}\subset \overline{\mathbb{L }}^\Gamma _\infty $ for some given $\Gamma >1$. One can build a T-estimator $\widehat{s}$ such that, for all $s\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \mathbb{E }_s\left[ \Vert {s}-\widehat{s}\Vert ^2\right] \le C\left[ \inf _{t\in \overline{S}} d_2^2(s,t)+n^{-1}D\Gamma \right] . \end{aligned}$$

Proof

By Definition 1 and the remark following it, for each $\eta _0>0$, one can find an $\eta _0$-net $S_0\subset \overline{S}$ for $\overline{S}$, hence $S_0\subset \overline{\mathbb{L }}^\Gamma _\infty $, satisfying (5.7) with $D_0=25D/4$. Moreover $d(s,S_0)\le \eta _0+d\left( s,\overline{S}\right) $. Choosing $\eta _0^2=273\times 25\Gamma D/4$, we may apply Theorem 6. The result then follows from (5.9) with $q=2$.$\square $

Theorem 6 applies in particular to the special situation of each model $S_m$ being reduced to a single point $\{t_m\}$ so that we can take $D_m=1/2$ for each $m$. We then get the following useful corollary.

Corollary 5

Assume that we observe $n$ i.i.d. random variables with unknown distribution $P_s,\, s\in (\overline{\mathbb{L }}_2,d_2)$ and that we have at disposal a countable subset $S=\{t_m\}_{m\in \mathcal{M }}$ of $\overline{\mathbb{L }}^\Gamma _\infty $ for some given $\Gamma >1$. Let $\{\Delta _m\}_{m\in \mathcal{M }}$ be a family of weights such that $\Delta _m\ge 1/10$ for all $m\in \mathcal{M }$ satisfying (3.1). We can build a T-estimator $\widehat{s}$ such that, for all $s\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \mathbb{E }_s\left[ d_2^q(s,\widehat{s}\,)\right] \le C_q\Sigma \inf _{m\in \mathcal{M }} \left\{ d_2(s,t_m)\vee \sqrt{\Gamma \Delta _m/n}\right\} ^q\quad \text{ for } \text{ all } q\ge 1. \end{aligned}$$

Proof

Let us set here $S_m=\{t_m\},\, D_m=1/2$ and $\eta _m=37\sqrt{\Gamma \Delta _m/n}$ for $m\in \mathcal{M }$. One can then check that (5.7) and (5.8) are satified so that (5.9) holds. Our risk bound follows.$\square $

5.3 Model selection with uniformly bounded models

At this stage, there is a major difficulty to apply Theorem 6 or Corollary 5 which is to build suitable subsets $S_m$ (or $S$) of $\overline{\mathbb{L }}^\Gamma _\infty $ from classical approximating sets (models), finite dimensional linear spaces for instance, that belong to $\mathbb{L }_2(\mu )$. We shall now address this problem.

5.3.1 The projection operator onto $\overline{\mathbb{L }}^\Gamma _\infty $

Our first task is to define a projection operator $\pi _\Gamma $ from $\mathbb L _2(\mu )$ onto $\overline{\mathbb{L }}^\Gamma _\infty $ ($\Gamma >1$) and to study its properties. In the sequel, we systematically identify a real number $a$ with the function $a{1\!\!1}_{\mathcal{X }}$ for the sake of simplicity. The following proposition is the corrected version, by Yannick Baraud, of the initially mistaken result of the author.

Proposition 7

For $t\in \mathbb L _2(\mu )$ and $1<\Gamma <+\infty $ we set $\pi _\Gamma (t)=[(t+\gamma )\vee 0]\wedge \Gamma $ where $\gamma $ is defined by $\int [(t+\gamma )\vee 0]\wedge \Gamma \,d\mu =1$. Then $\pi _\Gamma $ is the projection operator from $\mathbb L _2(\mu )$ onto the convex set $\overline{\mathbb{L }}^\Gamma _\infty $. Moreover, if $s\in \overline{\mathbb{L }}_2$ and $\Gamma >2$, then

$$\begin{aligned} \Vert {s}-\pi _\Gamma (s)\Vert ^2\le \frac{\Gamma ^2-\Gamma -1}{\Gamma (\Gamma -2)}Q_s(\Gamma ), \end{aligned}$$

with $Q_s(z)$ given by (2.3).

Proof

First note that the existence of $\gamma $ follows from the continuity and monotonicity of the mapping $z\mapsto \int [(t+z)\vee 0]\wedge \Gamma \,d\mu $ and that $\pi _\Gamma (t) \in \overline{\mathbb{L }}^\Gamma _\infty $. Since $\overline{\mathbb{L }}^\Gamma _\infty $ is a closed convex subset of a Hilbert space, the projection operator $\pi $ onto $\overline{\mathbb{L }}^\Gamma _\infty $ exists and is characterized by the fact that

$$\begin{aligned} \langle t-\pi (t),u-\pi (t)\rangle \le 0\quad \text{ for } \text{ all } u\in \overline{\mathbb{L }}^\Gamma _\infty . \end{aligned}$$

(5.10)

Since $\int [u-\pi (t)]\,d\mu =0$ for $u\in \overline{\mathbb{L }}^\Gamma _\infty $, (5.10) implies that $\langle t+z-\pi (t),u-\pi (t)\rangle \le 0$ for $z\in \mathbb R $, hence $\pi (t)=\pi (t+z)$. Since $\pi _\Gamma (t)=\pi _\Gamma (t+z)$ as well, we may assume that $\int [t\vee 0]\wedge \Gamma \,d\mu =1$, hence $\pi _\Gamma (t)= [t\vee 0]\wedge \Gamma $ and $\pi _\Gamma (t)=t$ on the set $0\le t\le \Gamma $. Then, for $u\in \overline{\mathbb{L }}^\Gamma _\infty $,

$$\begin{aligned} \langle t-\pi _\Gamma (t),u-\pi _\Gamma (t)\rangle =\int \nolimits _{t<0}tu\,d\mu +\int \nolimits _{t>\Gamma } (t-\Gamma )(u-\Gamma )\,d\mu \le 0, \end{aligned}$$

since $0\le u\le \Gamma $. This concludes the proof that $\pi =\pi _\Gamma $.

Let us now bound $\left\| {s}-\pi _\Gamma (s)\right\| $ when $s\in \overline{\mathbb{L }}_2$, setting $s=s\wedge \Gamma +v$ with $v=(s-\Gamma ){1\!\!1}_{s>\Gamma }$. Since there is nothing to prove when $\Vert {s}\Vert _\infty \le \Gamma $, we assume that $\int v\,d\mu >0$. By Cauchy-Schwarz inequality,

$$\begin{aligned} \left( \int v\,d\mu \right) ^2\le \mu (\{s>\Gamma \})\int v^2\,d\mu \le \Gamma ^{-1}\Vert v\Vert ^2. \end{aligned}$$

(5.11)

Moreover, since $\int s\wedge \Gamma \,d\mu <1,\, \pi _\Gamma (s)= (s+\gamma )\wedge \Gamma $ with $0<\gamma \le 1$. Hence

$$\begin{aligned} 1&= \int [(s+\gamma )\wedge \Gamma ]\,d\mu \ge \int (s\wedge \Gamma )\,d\mu +\gamma \mu (\{s\le \Gamma -\gamma \}) \\&\ge 1-\int v\,d\mu +\gamma \left( 1- \frac{1}{\Gamma -\gamma }\right) > 1-\int v\,d\mu +\gamma \frac{\Gamma -2}{\Gamma -1} \end{aligned}$$

and $\gamma <(\Gamma -1)/(\Gamma -2)\int v\,d\mu $. Now, since $0\le \pi _\Gamma (s)-s \le \gamma $ when $s\le \Gamma $,

$$\begin{aligned} \Vert {s}-\pi _\Gamma (s)\Vert ^2&= \int \nolimits _{s\le \Gamma }[\pi _\Gamma (s)-s]^2\,d\mu + \Vert v\Vert ^2 \le \gamma \int \nolimits _{s\le \Gamma }[\pi _\Gamma (s)-s]\,d\mu +\Vert v\Vert ^2\\&< \frac{\Gamma -1}{\Gamma -2}\left( \int v\,d\mu \right) \int \nolimits _{s>\Gamma }[s-\pi _\Gamma (s)]\,d\mu +\Vert v\Vert ^2\\&\le \frac{\Gamma -1}{\Gamma -2}\left( \int v\,d\mu \right) ^2+\Vert v\Vert ^2 \le \left( 1+\frac{\Gamma -1}{\Gamma (\Gamma -2)}\right) \Vert v\Vert ^2, \end{aligned}$$

where we used (5.11). This concludes our proof.$\square $

5.3.2 Selection with uniformly bounded models

Typical models $\overline{S}$ for density estimation in $\mathbb L _2(\mu )$ are finite-dimensional linear spaces which are not subsets of $\overline{\mathbb{L }}^\Gamma _\infty $ but merely spaces of functions with nice approximation properties. To apply Theorem 6 we have to replace them by discrete subsets of $\overline{\mathbb{L }}^\Gamma _\infty $ that satisfy (5.7). Unfortunately, they cannot simply be derived by a discretization of $\overline{S}$ followed by a projection $\pi _\Gamma $ or a discretization of $\pi _\Gamma \left( \overline{S}\right) $. A more complicated construction is required to preserve both the metric and approximation properties of $\overline{S}$. It is provided by the following preliminary result.

Proposition 8

Let $\overline{S}$ be a subset of $\mathbb L _2(\mu )$ with metric dimension bounded by $D$. For $\Gamma >2$ and $\eta >0$, one can find a discrete subset $S^{\prime }$ of $\,\overline{\mathbb{L }}^\Gamma _\infty $ with the following properties:

$$\begin{aligned} |S^{\prime }\cap \mathcal{B }_{d_2}(t,x\eta )|\le \exp \left[ 9Dx^2\right] \quad \text{ for } \text{ all } x\ge 2 \text{ and } t\in \mathbb L _2(\mu ); \end{aligned}$$

(5.12)

for any $s\in \overline{\mathbb{L }}_2$, one can find some $s^{\prime }\in S^{\prime }$ such that

$$\begin{aligned} \Vert {s}-s^{\prime }\Vert \le 3.1\left[ \eta +\inf _{t\in \overline{S}}\Vert {s}-t\Vert \right] + 4.1\left( \frac{\Gamma ^2-\Gamma -1}{\Gamma (\Gamma -2)}Q_s(\Gamma )\right) ^{1/2}. \end{aligned}$$

(5.13)

Proof

According to Definition 1, we choose some $\eta $-net $S_\eta $ for $\overline{S}$ such that (1.12) holds for all $t\in \mathbb L _2(\mu )$. Since, by Proposition 7, the operator $\pi _\Gamma $ from $\mathbb L _2(\mu )$ to $\overline{\mathbb{L }}^\Gamma _\infty $ satisfies $\Vert u-\pi _\Gamma (t)\Vert \le \Vert u-t\Vert $ for all $u\in \overline{\mathbb{L }}^\Gamma _\infty $, we may apply Proposition 12 of [4] with $M^{\prime }=\mathbb L _2(\mu ),\, d=d_2,\, M_0=\overline{\mathbb{L }}^\Gamma _\infty ,\, T=S_\eta ,\, \overline{\pi }=\pi _\Gamma $ and $\lambda =1$. It shows that one can find a subset $S^{\prime }$ of $\pi _\Gamma (S_\eta )$ such that (5.12) holds and $d_2(u,S^{\prime })\le 3.1d_2(u,S_\eta )$ for all $u\in \overline{\mathbb{L }}^\Gamma _\infty $. If $s$ is an arbitrary element of $\overline{\mathbb{L }}_2$, then

$$\begin{aligned} d_2\left( \pi _\Gamma (s), S^{\prime }\right) \le 3.1d_2\left( \pi _\Gamma (s), S_\eta \right) \le 3.1\left[ d_2\left( \pi _\Gamma (s), s\right) +d_2\left( s,\overline{S}\right) +\eta \right] , \end{aligned}$$

hence

$$\begin{aligned} d_2\left( s, S^{\prime }\right) \le 3.1\left[ d_2\left( s,\overline{S}\right) +\eta \right] + 4.1d_2\left( \pi _\Gamma (s), s\right) . \end{aligned}$$

(5.14)

The conclusion follows from Proposition 7.$\square $

We are now in a position to derive our main result about bounded model selection. We start with a countable collection $\{\overline{S}_m,m\in \mathcal{M }\}$ of models in $\mathbb L _2(\mu )$ with metric dimensions bounded respectively by $\overline{D}_m\ge 1/2$ and a family of weights $\Delta _m$ satisfying (3.1). We fix some $\Gamma \ge 3$ and, for each $m\in \mathcal{M }$, we set

$$\begin{aligned} \eta _m=\left[ \left( 50\sqrt{\overline{D}_m}\right) \vee \left( 37\sqrt{\Delta _m}\right) \right] \sqrt{\Gamma /n}. \end{aligned}$$

By Proposition 8 (with $\eta =\eta _m$), each $\overline{S}_m$ gives rise to a subset $S^\Gamma _m$ which satisfies (5.7) with $D_m=9\overline{D}_m$. It follows from our choice of $\eta _m$ that (5.8) is also satisfied so that we may apply Theorem 6 to the family of sets $\left\{ S^\Gamma _m,m\in \mathcal{M }\right\} $. This results in a T-estimator $\widehat{s}^{\,\Gamma }$ such that, for all $s\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \mathbb{E }_s\left[ d_2^q\!\left( s,\widehat{s}^{\,\Gamma }\right) \right] \le C_q\Sigma \inf _{m\in \mathcal{M }}\left\{ d_2\!\left( s,S^\Gamma _m\right) \vee \eta _m \right\} ^q\quad \text{ for } q\ge 1. \end{aligned}$$

We also derive from Proposition 8 that

$$\begin{aligned} d_2\!\left( s,S^\Gamma _m\right) \le 3.1\left[ \eta _m+\inf _{t\in \overline{S}_m} \Vert {s}-t\Vert \right] +4.1\sqrt{(5/3)Q_s(\Gamma )}. \end{aligned}$$

Putting the bounds together and rearranging the terms leads to Theorem 2.

5.4 An additional selection theorem

In order to derive Theorem 3 we need an additional selection step in order to choose a proper estimator among the sequence of estimators $(\widehat{s}^{\,2^i})_{i\ge 1}$. We start with a general selection result, to be proved in Sect. 6.3, that we state for an arbitrary statistical framework since it may apply to other situations than density estimation from an i.i.d. sample. We observe some random object $\varvec{X}$ with distribution $P_s$ on $\mathcal{X }$ where $s$ belongs to a metric space $M$ (carrying a distance $d$) which indexes a family $\mathcal{P }=\{P_t, t\in M\}$ of probabilities on $\mathcal{X }$.

Theorem 7

Let $(t_p)_{p\ge 1}$ be a sequence in $M$ such that the following assumption holds: for all pairs $(n,p)$ with $1\le n<p$ and all $x\in \mathbb R $, one can find a test $\psi _{t_n,t_p,x}$ based on the observation $\varvec{X}$ and satisfying

$$\begin{aligned} \sup _{\{s\in M\,|\,d(s,t_n)\le d(t_n,t_p)/4\}}\,\mathbb P _s[\psi _{t_n,t_p,x}(\varvec{X})=t_p] \le B\exp \left[ -a2^{-p}d^2(t_n,t_p)-x\right] \qquad \quad \ \end{aligned}$$

(5.15)

and

$$\begin{aligned} \sup _{\{s\in M\,|\,d(s,t_p)\le d(t_n,t_p)/4\}}\,\mathbb P _s[\psi _{t_n,t_p,x}(\varvec{X})=t_n] \le B\exp \left[ -a2^{-p}d^2(t_n,t_p)+x\right] \qquad \quad \ \end{aligned}$$

(5.16)

with positive constants $a$ and $B$ independent of $n,p$ and $x$. For each $A\ge 1$, one can design an estimator $\widehat{s}_A$ with values in $\{t_p,\,p\ge 1\}$ such that, for all $s\in M$,

$$\begin{aligned} \mathbb{E }_s\left[ d^q\left( \,\widehat{s}_A,s\right) \right] \le BC(A,q)\inf _{p\ge 1} \left[ d(s,t_p)\vee \sqrt{a^{-1}p2^p}\right] ^q \quad \text{ for } 1\le q<2A/\log 2.\nonumber \\ \end{aligned}$$

(5.17)

This general result applies to our specific framework of density estimation based on an observation $\varvec{X}$ with distribution $P_s,\, s\in \overline{\mathbb{L }}_2$, provided that the sequence $(t_p)_{p\ge 1}$ be suitably chosen. We shall simply assume here that $t_p\in \overline{\mathbb{L }}_2$ with $\Vert t_p\Vert _\infty \le 2^{p+1}$ for each $p\ge 1$. This implies that, for $1\le i<j,\, t_i$ and $t_j$ belong to $\overline{\mathbb{L }}^{2^{j+1}}_\infty $ so that Theorem 5 applies with $\varvec{X}$ replaced by the randomized sample $\varvec{X}^{\prime }$ and the assumption of Theorem 7 is therefore satisfied with $d=d_2,\, B=1$ and $a=n/65$, leading to Proposition 4.

6 Proofs

6.1 Proof of Proposition 3

For simplicity, we shall write $h(\theta ,\lambda )$ for $h(s_\theta ,s_\lambda )$ and analogously $d_2(\theta ,\lambda )$ for $d_2(s_\theta ,s_\lambda )$ and start with a preliminary lemma.

Lemma 2

For the parametric problem described in Proposition 3, the following holds for all $\theta $ and $\lambda $ in $(0,1/3]$:

$$\begin{aligned} h^2(\theta ,\lambda )=C(\theta ,\lambda )|\theta -\lambda | \quad \text{ with } 2/9<C(\theta ,\lambda )<3/2 \end{aligned}$$

(6.1)

and

$$\begin{aligned} d_2^2(\theta ,\lambda )=C(\theta ,\lambda )\left| \theta ^{-1}-\lambda ^{-1}\right| \quad \text{ with } 1<C(\theta ,\lambda )<3. \end{aligned}$$

(6.2)

Proof

Let us first evaluate $h^2(\theta ,\lambda )$ for $0\!<\!\theta \!<\!\lambda \!\le \!1/3$. Setting $\beta _\theta \!=\!\left( \theta ^2\!+\!\theta \!+\!1\right) ^{-1}\in [9/13,1)$, we get

$$\begin{aligned} 2h^2(\theta ,\lambda )&= \int \nolimits _0^1\left( \sqrt{s_\theta (x)} -\sqrt{s_\lambda (x)}\right) ^2dx \\&= \theta ^{3}\left( \theta ^{-1}-\lambda ^{-1}\right) ^2+\left( \lambda ^{3} -\theta ^{3}\right) \left( \lambda ^{-1}-\sqrt{\beta _\theta }\right) ^2 \\&+\left( 1-\lambda ^{3}\right) \left( \sqrt{\beta _\theta } -\sqrt{\beta _\lambda }\right) ^2 \\&= (\lambda -\theta ) \frac{\theta }{\lambda }\left( 1-\frac{\theta }{\lambda }\right) +(\lambda -\theta )\left[ 1+\frac{\theta }{\lambda }+\left( \frac{\theta }{\lambda }\right) ^2 \right] \left( 1-\lambda \sqrt{\beta _\theta }\right) ^2\\&+ \left( 1-\lambda ^{3}\right) \left( \sqrt{\beta _\theta } -\sqrt{\beta _\lambda }\right) ^2. \end{aligned}$$

Note that the monotonicity of $\theta \mapsto \beta _\theta $ implies that

$$\begin{aligned} 4/9<\left( 1-\lambda \sqrt{\beta _\theta }\right) ^2<1,\qquad \sqrt{\beta _\theta }+\sqrt{\beta _\lambda }>2\sqrt{\beta _{1/3}}=6/\sqrt{13} \end{aligned}$$

and

$$\begin{aligned} 0<\beta _\theta -\beta _\lambda =\frac{(\lambda -\theta )(\lambda +\theta +1)}{(\theta ^2+\theta +1)(\lambda ^2+\lambda +1)}<\lambda -\theta . \end{aligned}$$

(6.3)

It follows that

$$\begin{aligned} 0<\left( \sqrt{\beta _\theta }-\sqrt{\beta _\lambda }\right) ^2= \frac{(\beta _\theta -\beta _\lambda )^2}{\left( \sqrt{\beta _\theta }+\sqrt{\beta _\lambda } \right) ^2}<\frac{13}{36}(\lambda -\theta )^2=\frac{13\lambda }{36} (\lambda -\theta )\left( 1-\frac{\theta }{\lambda }\right) \end{aligned}$$

and

$$\begin{aligned} 0\!<\!\left( 1-\lambda ^{3}\right) \!\left( \sqrt{\beta _\theta } -\sqrt{\beta _\lambda }\right) ^2\! \!<\! \frac{13\lambda \left( 1-\lambda ^{3}\right) }{36}(\lambda -\theta )\! \left( \!1-\frac{\theta }{\lambda }\right) \!<\! \frac{2(\lambda -\theta )}{17}\!\left( \!1-\frac{\theta }{\lambda }\right) . \end{aligned}$$

We can therefore write

$$\begin{aligned} G=2(\lambda -\theta )^{-1}h^2(\theta ,\lambda )=z(1-z)+ c_1(\theta ,\lambda )\left( 1+z+z^2\right) +c_2(\theta ,\lambda )(1-z) \end{aligned}$$

with $z=\theta /\lambda \in (0,1),\, 4/9<c_1(\theta ,\lambda )<1$ and $0<c_2(\theta ,\lambda )<2/17$. Since, for given values of $c_1$ and $c_2$, the right-hand side is increasing with respect to $z,\, 4/9<c_1<G<3c_1<3$ and (6.1) follows.

Let us now proceed with the $\mathbb L _2$-distance $d_2$.

$$\begin{aligned} d_2^2(\theta ,\lambda )&= \theta ^3\left( \theta ^{-2}-\lambda ^{-2}\right) ^2+ \left( \lambda ^3-\theta ^3\right) \left( \lambda ^{-2}-\beta _\theta \right) ^2+ \left( 1-\lambda ^3\right) (\beta _\theta -\beta _\lambda )^2\\&= \left( \frac{1}{\theta }-\frac{1}{\lambda }\right) \left( 1-\frac{\theta }{\lambda }\right) \left( 1+\frac{\theta }{\lambda }\right) ^2\\&+ \left( \frac{1}{\theta }-\frac{1}{\lambda }\right) \left[ \frac{\theta }{\lambda }+ \left( \frac{\theta }{\lambda }\right) ^2+\left( \frac{\theta }{\lambda }\right) ^3\right] \left( 1-\lambda ^2\beta _\theta \right) ^2\\&+ \left( \frac{1}{\theta }-\frac{1}{\lambda }\right) \left( 1-\frac{\theta }{\lambda }\right) \theta \lambda ^2\left( 1-\lambda ^3\right) \left( \frac{\beta _\theta -\beta _\lambda }{\lambda -\theta }\right) ^2. \end{aligned}$$

Since $8/9<1-\lambda ^2\beta _\theta <1$ and, by (6.3),

$$\begin{aligned} 0<\theta \lambda ^2\left( 1-\lambda ^3\right) \left( \frac{\beta _\theta -\beta _\lambda }{\lambda -\theta }\right) ^2<\frac{1}{27}, \end{aligned}$$

we conclude that

$$\begin{aligned} G&= \left( \theta ^{-1}-\lambda ^{-1}\right) ^{-1}\!d_2^2(\theta ,\lambda )\\&= (1-z)(1+z)^2+ c_1(\theta ,\lambda )\left( z+z^2+z^3\right) +c_2(\theta ,\lambda )(1-z) \end{aligned}$$

with $z=\theta /\lambda \in (0,1),\, 8/9<c_1(\theta ,\lambda )<1$ and $0<c_2(\theta ,\lambda )<1/27$. It follows that

$$\begin{aligned} 1<1+z-z^2-z^3+(8/9)\left( z+z^2+z^3\right) <G<1+2z+(1/27)(1-z)<3, \end{aligned}$$

which finally implies (6.2). $\square $

It immediately follows from (6.1) that the set $S_\eta =\{s_{\lambda _j},\;j\ge 0\}$ with $\lambda _j=(2j+1)2\eta ^2/3$ is an $\eta $-net for the family $\overline{S}$ with respect to the Hellinger distance. On the other hand, given $\lambda \in (0,1/3)$ and $r\ge 2\eta $, in order that $s_{\lambda _j}\in \mathcal{B }(s_\lambda ,r)$, it is required that $h^2(\lambda _j,\lambda )= C(\lambda _j,\lambda )|\lambda _j-\lambda |< r^2$ which implies that $|\lambda _j-\lambda |<(9/2)r^2$ and therefore

$$\begin{aligned} |S_\eta \cap \mathcal{B }(s_\lambda ,r)|\le 1+(27/4)(r/\eta )^2\le \exp \left[ 0.84(r/\eta )^2\right] \quad \text{ for } \text{ all } s_\lambda \in \overline{S}. \end{aligned}$$

It follows from Lemma 2 of [4] that $\overline{S}$ has a metric dimension bounded by 3.4 and Corollary 3 of [4] implies that a suitable T-estimator $\widetilde{s}$ built on $S_\eta $ has a risk satisfying

$$\begin{aligned} \sup _{0<\theta \le 1/3}\mathbb{E }_{s_\theta }\left[ h^2(s_\theta ,\widetilde{s})\right] \le Cn^{-1}. \end{aligned}$$

Now setting $S_\eta =\{s_{\lambda _j},\;j\ge 0\}$ with $\lambda _j=\left( 3+2j\eta ^2/3\right) ^{-1}$ we deduce as before that $S_\eta $ is an $\eta $-net for $\overline{S}$ with respect to the $\mathbb L _2$-distance. In order that $s_{\lambda _j}\in \mathcal{B }(s_\lambda ,x\eta )$, it is required that $d_2^2(\lambda _j,\lambda )=C(\theta ,\lambda )|\lambda ^{-1}_j-\lambda ^{-1}|< x^2\eta ^2$, which implies that $|\lambda ^{-1}_j-\lambda ^{-1}|<x^2\eta ^2$. It follows that the number of elements of $S_\eta $ contained in the ball is bounded by $3x^2/2+1\le \exp \left( x^2/2\right) $ for $x\ge 2$. Hence the metric dimension of $\overline{S}$ with respect to the $\mathbb L _2$-distance is bounded by $2$. It nevertheless follows from (4.11) that the minimax risk over $\overline{S}$ is infinite when we use the $\mathbb L _2$-loss.

6.2 Proof of Lemma 1

Let us begin with a preliminary lemma.

Lemma 3

Let $F$ and $G$ be two disjoint sets with positive measures $\alpha =\mu (F)$ and $\beta =\mu (G)$ and $g\in \overline{\mathbb{L }}_2$ such that $\inf _{x\in F}g(x)>0$. Set $g_\varepsilon =g+\varepsilon (\alpha {1\!\!1}_{G} -\beta {1\!\!1}_{F})$ for $\varepsilon >0$. Then $g_\varepsilon $ is a density for $\varepsilon $ small enough and, for any $f\in \overline{\mathbb{L }}_2$,

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0}\frac{1}{2\varepsilon }\left[ d_2^2(g_\varepsilon ,f)- d_2^2(g,f)\right] =\alpha \int \nolimits _G(g-f)\,d\mu -\beta \int \nolimits _F(g-f)\,d\mu \qquad \end{aligned}$$

(6.4)

and

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0}\frac{2}{\varepsilon }\left[ h^2(g_\varepsilon ,f) -h^2(g,f)\right] =\beta \int \nolimits _F\sqrt{fg^{-1}}\,d\mu -\alpha \int \nolimits _G\sqrt{fg^{-1}}\,d\mu \qquad \end{aligned}$$

(6.5)

with the convention that $\int _G\sqrt{fg^{-1}}\,d\lambda =+\infty $ if either $\mu (G\cap \{g=0\}\cap \{f>0\})>0$ or the integral diverges.

Proof

Since $\int g_\varepsilon \,d\mu =1$ and $g_\varepsilon \ge 0$ for $\varepsilon $ small enough $g_\varepsilon $ is a density. Moreover, setting $k=\alpha {1\!\!1}_{G}-\beta {1\!\!1}_{F}$, we get

$$\begin{aligned} d_2^2(g_\varepsilon ,f)=\int (g+\varepsilon k-f)^2\,d\mu = d_2^2(g,f)+\varepsilon ^2\Vert k\Vert ^2+2\varepsilon \int k(g-f)\,d\mu \end{aligned}$$

and (6.4) follows. Let now $\Delta (\varepsilon )= \varepsilon ^{-1}\left[ h^2(g_\varepsilon ,f)-h^2(g,f)\right] $. Then

$$\begin{aligned} \Delta (\varepsilon )&= \varepsilon ^{-1} \left[ \int \sqrt{gf}\,d\mu -\int \sqrt{(g+\varepsilon k)f}\,d\mu \right] \\&= \varepsilon ^{-1}\left[ \,\int \nolimits _F\left[ \sqrt{gf}-\sqrt{(g-\varepsilon \beta )f}\right] d\mu +\int \nolimits _G\left[ \sqrt{gf}-\sqrt{(g+\varepsilon \alpha )f}\right] d\mu \right] \\&= \int \nolimits _F\frac{\beta \sqrt{f}}{\sqrt{g-\varepsilon \beta }+\sqrt{g}}\,d\mu - \int \nolimits _{G\cap \{g>0\}}\frac{\alpha \sqrt{f}}{\sqrt{g+\varepsilon \alpha }+\sqrt{g}}\, d\mu \\&\quad -\int \nolimits _{G\cap \{g=0\}\cap \{f>0\}}\sqrt{\alpha f/\varepsilon } \,d\mu . \end{aligned}$$

When $\varepsilon $ tends to 0, the first integral converges to $(\beta /2)\int _F\sqrt{fg^{-1}}\,d\mu $ and the second one converges to $(\alpha /2)\int _{G\cap \{g>0\}}\sqrt{fg^{-1}}\,d\mu $, by monotone convergence. The last one converges to $+\infty $ if $\mu (G\cap \{g=0\}\cap \{f>0\})>0$ and 0 otherwise, which achieves the Proof of (6.5).$\square $

If $\Vert v_1\vee v_2\Vert _\infty >2B$, we may assume, exchanging the roles of $v_1$ and $v_2$ if necessary, that $\mu (A)>0$ with $A=\{v_1\ge v_2\;\;\text{ and }\;\;v_1>2B\}$. Let $C=\{v_1<B\wedge v_2\}$. If $\mu (C)>0$, we may apply Lemma 3 with $F=A,\, G=C,\, g=v_1$ and $v^{\prime }_1=g_\varepsilon $. We first set $f=t$. Since $v_1-t<B$ on $C$ while $v_1-t>B$ on $A$, it follows from (6.4) that $d_2(v^{\prime }_1,t)<d_2(v_1,t)$ for $\varepsilon $ small enough. If we now set $f=v_2$ and use (6.5), we see that $h(v^{\prime }_1,v_2)<h(v_1,v_2)$ since $v_2\le v_1$ on $A$ and $v_2>v_1$ on $C$. We conclude by setting $v^{\prime }_2=v_2$. If $\mu (C)=0$, then $\mu (\{B\le v_1<v_2\})+\mu (\{v_2\le v_1<B\})=1$ and both sets have positive $\mu $-measure since $v_1\ne v_2$. In this case we set $F=\{B\le v_1<v_2\},\, G=\{v_2\le v_1\wedge u\}$ and $g=v_2$. Then $\mu (F)>0$ and $\mu (G)>0$ since $u\le B<v_2$ on $F$ and they are densities. If we use (6.4) with $f=u$, we derive that $d_2(v^{\prime }_2,u)<d_2(v_2,u)$ for $\varepsilon $ small enough and if we use (6.5) with $f=v_1$, we derive that $h(v^{\prime }_2,v_1)<h(v_2,v_1)$, in which case we set $v^{\prime }_1=v_1$.

6.3 Proof of Theorem 7

We consider the family of tests $\psi (t_n,t_p,\varvec{X})=\psi _{t_n,t_p,x}(\varvec{X})$ provided by the assumption with $x=A|p-n|$. Given this family of tests and $S=\{t_i, i\ge 1\}$, we define the random function $\mathcal{D }_{\varvec{X}}$ on $S$ as in Birgé [4], i.e. we set $\mathcal{R }_i=\{t_j\in S,j\ne i\,|\,\psi (t_i,t_j,\varvec{X}) =t_j\}$ and

$$\begin{aligned} \mathcal{D }_{\varvec{X}}(t_i) =\left\{ \begin{array}{l@{\quad }l}{\displaystyle \sup _{t_j\in \mathcal{R }_i} \left\{ d(t_i,t_j)\right\} }&{}\text{ if } \,\mathcal{R }_i\ne \emptyset ;\\ 0&{}\text{ if } \,\mathcal{R }_i=\emptyset .\end{array}\right. \end{aligned}$$

(6.6)

Given some $t_i\in S$, we want to bound

$$\begin{aligned} \mathbb P _s\left[ \mathcal{D }_{\varvec{X}}(t_i)>xy_i\right] \quad \text{ for } x\ge 1 \text{ and } y_i=4d(s,t_i)\vee \sqrt{Aa^{-1}i2^i}. \end{aligned}$$

Let us define the integer $K$ by $x^2<2^K\le 2x^2$. Then

$$\begin{aligned} K\ge 1,\quad a2^{-i-K}(xy_i)^2\ge a2^{-i-1}y_i^2\ge Ai/2\quad \text{ and } \quad e^{-AK}\le x^{-2A/\log 2}. \end{aligned}$$

(6.7)

Now, setting $y=xy_i$, observe that

$$\begin{aligned} \mathbb P _s\left[ \mathcal{D }_{\varvec{X}}(t_i)>y\right] = \mathbb P _s\left[ \;\exists j \text{ with } d(t_i,t_j)>y \text{ and } \psi (t_i,t_j,\varvec{X})=t_j\right] \le \Sigma _1+\Sigma _2, \end{aligned}$$

with

$$\begin{aligned}&\Sigma _1=\sum _{j<i}{1\!\!1}_{d(t_i,t_j)>y}\;\mathbb P _s\left[ \psi (t_i,t_j,\varvec{X}) =t_j\right] ; \\&\Sigma _2=\sum _{j>i}{1\!\!1}_{d(t_i,t_j)>y}\;\mathbb P _s\left[ \psi (t_i,t_j,\varvec{X}) =t_j\right] . \end{aligned}$$

If $i=1$, then $\Sigma _1=0$ and if $i\ge 2$, we can use (5.16) and $y\ge 4d(s,t_i)$ to derive that

$$\begin{aligned} \Sigma _1&\le B\sum _{j<i}{1\!\!1}_{d(t_i,t_j)>y}\;\exp \left[ -a2^{-i}d^2(t_i,t_j)+A|i-j| \right] \\&\le B\exp \left[ -a2^{-i}y_i^2x^2+Ai\right] \sum _{j\ge 1}e^{-Aj}\\&\le B\frac{e^{-A}}{1-e^{-A}}\exp \left[ -Ai\left( x^2-1\right) \right] \le B \frac{e^{-A}}{1-e^{-A}}\exp \left[ -A\left( x^2-1\right) \right] \\&\le B \left( 1-e^{-A}\right) ^{-1}\exp \left[ -Ax^2\right] \le B \left( 1-e^{-A}\right) ^{-1}x^{-2A/\log 2}, \end{aligned}$$

where we used (6.7), $i\ge 1$ and $x\ge 1$. Also, by (5.15),

$$\begin{aligned} \Sigma _2&\le B\sum _{j>i}{1\!\!1}_{d(t_i,t_j)>y}\; \exp \left[ -a2^{-j}d^2(t_i,t_j)-A|i-j|\right] \\&\le B\sum _{j>i}\exp \left[ -a2^{-j}y^2-A(j-i)\right] = B\sum _{k=1}^{+\infty }\exp \left[ -a2^{-i-k}y^2-Ak\right] \\&\le B\left[ \sum _{k=1}^{K}\exp \left[ -a2^{-i-k}y^2-Ak\right] + \sum _{k>K}\exp [-Ak]\right] =B(\Sigma _3+\Sigma _4) \end{aligned}$$

with $\Sigma _4=e^{-AK}\left( e^A-1\right) ^{-1}$ and, by (6.7),

$$\begin{aligned} \Sigma _3&= e^{-AK}\sum _{j=0}^{K-1}\exp \left[ -a2^{-i-K+j}y^2+Aj\right] \le e^{-AK}\sum _{j=0}^{K-1}\exp \left[ -A(i2^{j-1}-j)\right] \\&\le e^{-AK}\sum _{j\ge 0}\exp \left[ -\left( 2^{j-1}-j\right) \right] < 3e^{-AK}. \end{aligned}$$

We finally get, putting all the bounds together and using (6.7) again,

$$\begin{aligned} \mathbb P _s\left[ \mathcal{D }_{\varvec{X}}(t_i)>xy_i\right] \le BC(A)x^{-2A/\log 2} \quad \text{ for } x\ge 1. \end{aligned}$$

(6.8)

As a consequence $\mathcal{D }_{\varvec{X}}(t_i)<+\infty $ a.s. and we can define

$$\begin{aligned} \widehat{s}_A=t_p\quad \text{ with } p=\min \left\{ j\,\left| \,\mathcal{D }_{\varvec{X}}(t_j) <\inf _i\mathcal{D }_{\varvec{X}}(t_i)+\sqrt{Aa^{-1}}\right. \right\} . \end{aligned}$$

In view of the definition of $\mathcal{D }_{\varvec{X}},\, d(t_i,t_j)\le \mathcal{D }_{\varvec{X}}(t_i)\vee \mathcal{D }_{\varvec{X}}(t_j)$, hence, for all $t_i\in S, d\left( \,\widehat{s}_A,t_i\right) \le \mathcal{D }_{\varvec{X}}(t_i)+\sqrt{Aa^{-1}}$ and

$$\begin{aligned} d\left( \,\widehat{s}_A,s\right) \le \mathcal{D }_{\varvec{X}}(t_i)+\sqrt{Aa^{-1}}+d(s,t_i)< \mathcal{D }_{\varvec{X}}(t_i)+y_i. \end{aligned}$$

It then follows from (6.8) that

$$\begin{aligned} \mathbb P _s\left[ d\left( \,\widehat{s}_A,s\right) >zy_i\right] \le BC(A)(z-1)^{-2A/\log 2} \quad \text{ for } z\ge 2. \end{aligned}$$

Integrating with respect to $z$ leads to

$$\begin{aligned} \mathbb{E }_s\left[ \left( d\left( \,\widehat{s}_A,s\right) /y_i\right) ^q\right] \le BC(A,q) \quad \text{ for } 1\le q<2A/\log 2, \end{aligned}$$

and, since $t_i$ is arbitrary in $S$,

$$\begin{aligned} \mathbb{E }_s\left[ d^q\left( \,\widehat{s}_A,s\right) \right] \le BC(A,q)\inf _{i\ge 1} \left[ d^q(s,t_i)\vee \left( a^{-1}i2^i\right) ^{q/2}\right] \quad \text{ for } 1\le q<2A/\log 2. \end{aligned}$$

References

Baraud, Y., Birgé, L.: Estimating composite functions by model selection. http://arxiv.org/abs/1102.2818. To appear in Ann. Inst. Henri Poincaré Probab. et Statist. (2011)
Birgé, L.: Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrscheinlichkeitstheor. Verw. Geb. 65, 181–237 (1983)
Article MATH Google Scholar
Birgé, L.: Sur un théorème de minimax et son application aux tests. Probab. Math. Statist. 3, 259–282 (1984)
Google Scholar
Birgé, L.: Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincaré Probab. et Statist. 42, 273–325 (2006)
Google Scholar
Birgé, L.: Statistical estimation with model selection. Indag. Math. 17, 497–537 (2006)
Google Scholar
Birgé, L.: Model selection for Poisson processes. In: Cator, E., Jongbloed, G., Kraaikamp, C., Lopuhaä, R., Wellner, J. (eds.) Asymptotics: particles, processes and inverse problems, Festschrift for Piet Groeneboom. IMS Lect. Notes Monogr. Ser. 55, 32–64 (2007)
Birgé, L.: Robust tests for model selection. In: Banerjee, M., Bunea, F., Huang, J., Koltchinskii, V., Maathuis, M.H. (eds.) From probability to statistics and back: high-dimensional models and processes – A Festschrift in honor of Jon A. Wellner, vol. 9, pp. 47–64. IMS Collections, Beachwood, Ohio (2013)
Birgé, L., Massart, P.: From model selection to adaptive estimation. In: Pollard, D., Torgersen, E., Yang, G. (eds.) Festschrift for Lucien Le Cam: research papers in probability and statistics, pp. 55–87. Springer, New York (1997)
Chapter Google Scholar
Birgé, L., Massart, P.: Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4, 329–375 (1998)
Article MATH MathSciNet Google Scholar
Birgé, L., Massart, P.: An adaptive compression algorithm in Besov spaces. Constr. Approx. 16, 1–36 (2000)
Article MATH MathSciNet Google Scholar
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3, 203–268 (2001)
Article MATH Google Scholar
Birgé, L., Rozenholc, Y.: How many bins should be put in a regular histogram. ESAIM Probab. Stat. 10, 24–45 (2006)
Article MATH MathSciNet Google Scholar
Cencov, N.N.: Evaluation of an unknown distribution density from observations. Soviet Math. 3, 1559–1562 (1962)
Google Scholar
Delyon, B., Juditsky, A.: On minimax wavelet estimators. Appl. Comput. Harmon. Anal. 3, 215–228 (1996)
Article MATH MathSciNet Google Scholar
Devore, R.A., Lorentz, G.G.: Constructive approximation. Springer, Berlin (1993)
Book MATH Google Scholar
Devroye, L.: A course in density estimation. Birkhäuser, Boston (1987)
MATH Google Scholar
Devroye, L., Györfi, L.: Nonparametric density estimation: the $L_1$ view. Wiley, New York (1985)
MATH Google Scholar
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., Picard, D.: Density estimation by wavelet thresholding. Ann. Stat. 24, 508–539 (1996)
Article MATH MathSciNet Google Scholar
Donoho, D.L., Liu, R.C.: Geometrizing rates of convergence I. Technical report 137. Department of Statistics, University of California, Berkeley (1987)
Efromovich, S.: Adaptive estimation and oracle inequalities for probability densities and characteristic functions. Ann. Stat. 36, 1127–1155 (2008)
Google Scholar
Juditsky, A., Rigollet, P., Tsybakov, A.: Learning by mirror averaging. Ann. Stat. 36, 2183–2206 (2008)
Google Scholar
Kerkyacharian, G., Picard, D.: Thresholding algorithms, maxisets and well-concentrated bases. Test 9, 283–344 (2000)
Article MATH MathSciNet Google Scholar
Le Cam, L.M.: Convergence of estimates under dimensionality restrictions. Ann. Stat. 1, 38–53 (1973)
Google Scholar
Le Cam, L.M.: On local and global properties in the theory of asymptotic normality of experiments. In: Puri, M. (ed.) Stochastic processes and related topics, vol. 1, pp. 13–54. Academic Press, New York (1975)
Google Scholar
Le Cam, L.M.: Asymptotic methods in statistical decision theory. Springer, New York (1986)
Book MATH Google Scholar
Lounici, K.: Aggregation of density estimators for the $L^\pi $ risk with $1\le \pi < \infty $. Unpublished manuscript (2008)
Massart, P.: Concentration inequalities and model selection. In: Picard, J. (ed.) Lecture on probability theory and statistics, Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Lecture note in mathematics, Springer, Berlin (2007)
Pinsker, M.S.: Optimal filtration of square-integrable signals in Gaussian noise. Probl. Inf. Trans. 16, 120–133 (1980)
MATH Google Scholar
Reynaud-Bouret, P., Rivoirard, V., Tuleau-Malot, C.: Adaptive density estimation: a curse of support? J. Stat. Plann. Inference 141, 115–139 (2011)
Article MATH MathSciNet Google Scholar
Rigollet, T.: Ph.D. thesis, University Pierre et Marie Curie, Paris (2006)
Rigollet, T., Tsybakov, A.B.: Linear and convex aggregation of density estimators. Math. Methods Stat. 16, 260–280 (2007)
Article MATH MathSciNet Google Scholar
Samarov, A., Tsybakov, A.B.: Aggregation of density estimators and dimension reduction. In: Nair, V. (ed.) Advances in statistical modeling and inference. Essays in honor of Kjell A. Doksum, pp. 233–251. World Scientific, Singapore (2007)
Chapter Google Scholar
Yang, Y.: Mixing strategies for density estimation. Ann. Stat. 28, 75–87 (2000)
Article MATH Google Scholar
Yang, Y., Barron, A.R.: An asymptotic property of model selection criteria. IEEE Trans. Inf. Theory 44, 95–116 (1998)
Article MATH MathSciNet Google Scholar
Yu, B.: Assouad, Fano and Le Cam. In: Pollard, D., Torgersen, E., Yang, G. (eds.) Festschrift for Lucien Le Cam: research papers in probability and statistics, pp. 423–435. Springer, New York (1997)
Chapter Google Scholar

Download references

Acknowledgments

I would like to thank Yannick Baraud for his remarks and the correction of some mistakes as well as two anonymous referees for their many suggestions.

Author information

Authors and Affiliations

Laboratoire de Probabilités et Modèles Aléatoires, UMR CNRS 7599, Université Paris VI, boîte 188, 4 Place Jussieu, 75252 , Paris Cedex 05, France
Lucien Birgé

Authors

Lucien Birgé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucien Birgé.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Birgé, L. Model selection for density estimation with $\mathbb L _2$-loss. Probab. Theory Relat. Fields 158, 533–574 (2014). https://doi.org/10.1007/s00440-013-0488-x

Download citation

Received: 18 October 2010
Revised: 15 July 2012
Published: 20 June 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s00440-013-0488-x

Keywords

Mathematics Subject Classification (1991)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Model selection for density estimation with \(\mathbb L _2\)-loss

Abstract

Similar content being viewed by others

Density Prediction and the Stein Phenomenon

On Predictive Density Estimation under α-Divergence Loss

From robust tests to Bayes-like posterior distributions

1 Introduction

1.1 The example of projection estimators

1.1.1 The special case of histograms

1.1.2 Projection estimators

1.2 Model based estimation

Definition 1

Remark

Proposition 1

1.3 Some negative results for the \(\mathbb L _2\)-loss

Proposition 2

Proposition 3

1.4 About this paper

2 Model selection

Theorem 1

2.1 What is presently known

2.2 Some notations

3 Main results

3.1 Model selection with bounded T-estimators

Theorem 2

3.2 General model selection in \(\overline{\mathbb{L }}_2\)

Proposition 4

Theorem 3

3.3 Some remarks

4 Some applications

4.1 Aggregation of preliminary estimators

4.1.1 Aggregation by selection

Corollary 1

4.1.2 Linear aggregation

Corollary 2

4.2 Selection of projection estimators

4.3 A comparison with Gaussian model selection

Theorem 4

4.4 Adaptive estimation in Besov spaces

4.4.1 Wavelet expansions

4.4.2 The bounded case

4.4.3 Further upper bounds for the risk

4.4.4 Some lower bounds

Proposition 5

4.4.5 Conclusion

4.5 Using a nonlinear model

5 The construction of T-estimators for \(\mathbb L _2\)-loss

5.1 Tests between \(\mathbb L _2\)-balls

5.1.1 Randomizing our sample

5.1.2 Preliminary results about tests between some convex sets

Proposition 6

Corollary 3

Proof

5.1.3 Abstract tests between \(\mathbb L _2\)-balls

Theorem 5

Proof

Lemma 1

5.2 The performance of T-estimators for discrete models

Theorem 6

Proof

Corollary 4

Proof

Corollary 5

Proof

5.3 Model selection with uniformly bounded models

5.3.1 The projection operator onto \(\overline{\mathbb{L }}^\Gamma _\infty \)

Proposition 7

Proof

5.3.2 Selection with uniformly bounded models

Proposition 8

Proof

5.4 An additional selection theorem

Theorem 7

6 Proofs

6.1 Proof of Proposition 3

Lemma 2

Proof

6.2 Proof of Lemma 1

Lemma 3

Proof