Adaptive confidence sets in $$L^2$$

Bull, Adam D.; Nickl, Richard

doi:10.1007/s00440-012-0446-z

Adaptive confidence sets in $L^2$

Published: 22 August 2012

Volume 156, pages 889–919, (2013)
Cite this article

Download PDF

Probability Theory and Related Fields Aims and scope Submit manuscript

Adaptive confidence sets in $L^2$

Download PDF

Adam D. Bull¹ &
Richard Nickl¹

557 Accesses
30 Citations
Explore all metrics

Abstract

The problem of constructing confidence sets that are adaptive in $L^2$-loss over a continuous scale of Sobolev classes of probability densities is considered. Adaptation holds, where possible, with respect to both the radius of the Sobolev ball and its smoothness degree, and over maximal parameter spaces for which adaptation is possible. Two key regimes of parameter constellations are identified: one where full adaptation is possible, and one where adaptation requires critical regions be removed. Techniques used to derive these results include a general nonparametric minimax test for infinite-dimensional null- and alternative hypotheses, and new lower bounds for $L^2$-adaptive confidence sets.

Asymptotics of coverages of HD confidence sets and recentering at shrinkage estimates: phase transitions, large deviations

Article 25 January 2024

On Bayesian Based Adaptive Confidence Sets for Linear Functionals

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

1 Introduction

The paradigm of adaptive nonparametric inference has developed a fairly complete theory for estimation and testing—we mention the key references [2, 3, 8, 9, 23, 25, 29]—but the theory of adaptive confidence statements has not succeeded to the same extent, and consists in a significant part of negative results that are in a somewhat puzzling contrast to the fact that adaptive estimators exist. The topic of confidence sets is, however, of vital importance, since it addresses the question of whether the accuracy of adaptive estimation can itself be estimated, and to what extent the abundance of adaptive risk bounds and oracle inequalities in the literature are useful for statistical inference.

In this article we give a set of necessary and sufficient conditions for when confidence sets that adapt to unknown smoothness in $L^2$-diameter exist in the problem of nonparametric density estimation on [0,1]. The scope of our techniques extends without difficulty to density estimation on the real line, and also to other common function estimation problems such as nonparametric regression or Gaussian white noise. Our focus on $L^2$-type confidence sets is motivated by the fact that they involve the most commonly used loss function in adaptive estimation problems, and so deserve special attention in the theory of adaptive inference.

We can illustrate some main ideas by the simple example of two fixed Sobolev-type classes. Let $X_1, \dots , X_n$ be i.i.d. with common probability density $f$ contained in the space $L^2$ of square-integrable functions on [0,1]. Let $\Sigma (r)=\Sigma (r,B)$ be a Sobolev ball of probability densities on [0,1], of Sobolev-norm radius $B$—see Sect. 2 for precise definitions—and consider adaptation to the submodel $\Sigma (s) \subset \Sigma (r), s>r$. An adaptive estimator $\hat{f}_n$ exists, achieving the optimal rate $n^{-s/(2s+1)}$ for $f \in \Sigma (s)$ and $n^{-r/(2r+1)}$ otherwise, in $L^2$-risk; see for instance Theorem 2 below.

A confidence set is a random subset $C_n=C(X_1, \dots , X_n)$ of $L^2$. Define the $L^2$-diameter of a norm-bounded subset $C$ of $L^2$ as

$$\begin{aligned} |C| = \inf \left\{ \tau : C \subset \{h: \Vert h-g\Vert _2 \le \tau \}\; \text{ for} \text{ some}\,g \in L^2 \right\} \!, \end{aligned}$$

(1)

equal to the radius of the smallest $L^2$-ball containing $C$. For $G \subset L^2$ set $\Vert f-G\Vert _2= \inf _{g \in G}\Vert f-g\Vert _2$ and define, for $\rho _n \ge 0$ a sequence of real numbers, the separated sets

$$\begin{aligned} \tilde{\Sigma }(r, \rho _n) \equiv \tilde{\Sigma }(r, s, B, \rho _n) = \{f \in \Sigma (r): \Vert f-\Sigma (s)\Vert _2 \ge \rho _n\}. \end{aligned}$$

Obviously $\tilde{\Sigma }(r,0)=\Sigma (r)$, but for $\rho _n>0$ these sets are proper subsets of $\Sigma (r) \setminus \Sigma (s)$. We are interested in adaptive inference in the model

$$\begin{aligned} \mathcal P _n \equiv \Sigma (s) \cup \tilde{\Sigma }(r, \rho _n) \end{aligned}$$

under minimal assumptions on the size of $\rho _n$. We shall say that the confidence set $C_n$ is $L^2$-adaptive and honest for $\mathcal P _n$ if there exists a constant $M$ such that for every $n \in \mathbb N $,

$$\begin{aligned}&\sup _{f \in \Sigma (s)} {\Pr }_f\left\{ |C_n| > M n^{-s/(2s+1)}\right\} \le \alpha ^{\prime },\end{aligned}$$

(2)

$$\begin{aligned}&\sup _{f \in \tilde{\Sigma }(r,\rho _n)} {\Pr }_f\left\{ |C_n| > M n^{-r/(2r+1)}\right\} \le \alpha ^{\prime } \end{aligned}$$

(3)

and if

$$\begin{aligned} \inf _{f \in \mathcal P _n} {\Pr }_f\left\{ f \in C_n \right\} \ge 1-\alpha -r_n \end{aligned}$$

(4)

where $r_n \rightarrow 0$ as $n \rightarrow \infty $. We regard the constants $\alpha , \alpha ^{\prime }$ as given ‘significance levels’.

Theorem 1

Let $0<\alpha , \alpha ^{\prime } <1, s>r>1/2$ and $B>1$ be given.

(A)
An $L^2$-adaptive and honest confidence set for $\tilde{\Sigma }(r, \rho _n) \cup \Sigma (s)$ exists if one of the following conditions is satisfied:
1. (i)
  $s \le 2r$ and $\rho _n \ge 0$
2. (ii)
  $s>2r$ and
  $$\begin{aligned} \rho _n \ge M n^{-r/(2r+1/2)} \end{aligned}$$
  for every $n \in \mathbb N $ and some constant $M$ that depends on $\alpha , \alpha ^{\prime }, r, B$.
(B)
If $s>2r$ and $C_n$ is an $L^2$-adaptive and honest confidence set for $\tilde{\Sigma }(r, \rho _n) \cup \Sigma (s)$, for every $\alpha , \alpha ^{\prime }>0$, then necessarily
$$\begin{aligned} \liminf _n \rho _n n^{r/(2r+1/2)}>0. \end{aligned}$$

We note first that for $s \le 2r$ adaptive confidence sets exist without any additional restrictions—this is a main finding of the papers [6, 21, 28] and has important precursors in [1, 16, 24]. It is based on the idea that under the general assumption $f \in \Sigma (r)$ we may estimate the $L^2$-risk of any adaptive estimator of $f$ at precision $n^{-r/(2r+1/2)}$ which is $O(n^{-s/(2s+1)})$ precisely when $s \le 2r$. As soon as one wishes to adapt to smoothness $s>2r$, however, this cannot be used anymore, and adaptive confidence sets then require separation of $\Sigma (s)$ and $\Sigma (r) \setminus \Sigma (s)$ (i.e., $\rho _n>0$). Maximal subsets of $\Sigma (r)$ over which $L^2$-adaptive confidence sets do exist in the case $s>2r$ are given in Theorem 1, with separation sequence $\rho _n$ characterised by the asymptotic order $n^{-r/(2r+1/2)}$. This rate has, as we show in this article, a fundamental interpretation as the minimax rate of testing between the composite hypotheses

$$\begin{aligned} H_0: f \in \Sigma (s) ~~ \text{ against} ~~H_1: f \in \tilde{\Sigma }(r, \rho _n). \end{aligned}$$

(5)

The occurrence of this rate in Theorem 1 parallels similar findings in Theorem 2 in [17] in the different situation of confidence bands, and is inspired by the general ideas in [5, 13, 17, 22], which attempt to find ‘maximal’ subsets of the usual parameter spaces of adaptive estimation for which honest confidence statements can be constructed. Our results can be construed as saying that for $s>2r$ confidence sets that are $L^2$-adaptive exist precisely over those subsets of the parameter space $\Sigma (r)$ for which the target $s$ of adaptation is testable in a minimax way.

Our solution of (5) is achieved in Proposition 2 below, where we construct consistent tests for general composite problems of the kind

$$\begin{aligned} H_0: f \in \Sigma \quad \text{ against} \quad H_1: f \in \Sigma (r), \Vert f-\Sigma \Vert _2 \ge \rho _n, \quad \Sigma \subset \Sigma (r), \end{aligned}$$

whenever the sequence $\rho _n$ is at least of the order $\max (n^{-r/(2r+1/2)}, r_n),$ where $r_n$ is related to the complexity of $\Sigma $ by an entropy condition. In the case $\Sigma = \Sigma (s)$ with $s>2r$ relevant here we can establish $r_n=n^{-s/(2s+1)}=o(n^{-r/(2r+1/2)}),$ so that this test is minimax in light of lower bounds in [19, 20].

While the case of two fixed smoothness classes in Theorem 1 is appealing in its conceptual simplicity, it does not describe the typical adaptation problem, where one wants to adapt to a continuous smoothness parameter $s$ in a window $[r,R]$. Moreover the radius $B$ of $\Sigma (s)$ is, unlike in Theorem 1, typically unknown, and the usual practise of ‘undersmoothing’ to deal with this problem incurs a rate-penalty for adaptation that we wish to avoid here. Instead, we shall address the question of simultaneous exact adaptation to the radius $B$ and to the smoothness $s$. We first show that such strong adaptation is possible if $R<2r$, see Theorem 3. In the general case $R\ge 2r$ we can use the ideas from Theorem 1 as follows: starting from a fixed largest model $\Sigma (r, B_0)$ with $r, B_0$ known, we discretise $[r,R]$ into a finite grid $\mathcal S $ consisting of progressions $r, 2r, 4r, \dots $, and then use the minimax test for (5) in an iterated way to select the optimal value in $\mathcal S $. We then use the methods underlying Theorem 1 (A)(i) in the selected window, and show that this gives honest adaptive confidence sets over ‘maximal’ parameter subspaces $\mathcal P _n \subset \Sigma (r, B_0)$. In contrast to what is possible in the $L^\infty $-situation studied in [5], the sets $\mathcal P _n$ asymptotically contain all of $\Sigma (r, B_0)$, highlighting yet another difference between the $L^2$- and $L^\infty $-theory. See Proposition 1 and Theorem 5 below for details. We also present a new lower bound which implies that for $R>2r$ even ‘pointwise in $f$’ inference is impossible for the full parameter space of probability densities in the $r$-Sobolev space, see Theorem 4. In other words, even asymptotically one has to remove certain subsets of the maximal parameter space if one wants to construct confidence sets that adapt to arbitrary smoothness degrees. One way to remove is to restrict the space a priori to a fixed ball $\Sigma (r,B_0)$ of known radius as discussed above, but other assumptions come to mind, such as ‘self-similarity’ conditions employed in [5, 13, 22, 27] for confidence intervals and bands. We discuss briefly how this applies in the $L^2$-setting.

We state all main results other than Theorem 1 above in Sects. 2 and 3, and proofs are given, in a unified way, in Sect. 4

2 The setting

2.1 Wavelets and Sobolev–Besov spaces

Denote by $L^2:=L^2([0,1])$ the Lebesgue space of square integrable functions on [0,1], normed by $\Vert \cdot \Vert _2$. For integer $s$ the classical Sobolev spaces are defined as the spaces of functions $f \in L^2$ whose (distributional) derivatives $D^\alpha f, 0<\alpha \le s,$ all lie in $L^2$. One can describe these spaces, for $s>0$ any real number, in terms of the natural sequence space isometry of $L^2$ under an orthonormal basis. We opt here to work with wavelet bases: for index sets $\mathcal Z \subset \mathbb Z , \mathcal Z _l \subset \mathbb Z $ and $J_0 \in \mathbb N $, let

$$\begin{aligned} \{\phi _{J_0 m}, \psi _{lk}: m \in \mathcal Z , k \in \mathcal Z _l, l \ge J_0+1, l \in \mathbb N \} \end{aligned}$$

be a compactly supported orthonormal wavelet basis of $L^2$ of regularity $S$, where as usual, $\psi _{lk}=2^{l/2}\psi _k(2^l\cdot )$. We shall only consider Cohen–Daubechies–Vial [7] wavelet bases where $|\mathcal Z _l|=2^l, |\mathcal Z | \le c(S)<\infty , J_0 \equiv J_0(S)$. We define, for $\langle f, g\rangle = \int _0^1 fg$ the usual $L^2$-inner product, and for $0 \le s <S$, the Sobolev (-type) norms

$$\begin{aligned} \Vert f\Vert _{s,2}&:= \max \left(2^{J_0s}\sqrt{\sum _{k \in \mathcal Z } \langle f, \phi _{J_0 k}\rangle ^2},\sup _{l \ge J_0+1} 2^{ls} \sqrt{\sum _{k \in \mathcal Z _l}\langle f, \psi _{lk} \rangle ^2 } \right) \nonumber \\&= \max \left(2^{J_0s}\Vert \langle f, \phi _{J_0 \cdot }\rangle \Vert _2, \sup _{l \ge J_0+1} 2^{ls}\Vert \langle f, \psi _{l \cdot } \rangle \Vert _2 \right) \end{aligned}$$

(6)

where in slight abuse of notation we use the symbol $\Vert \cdot \Vert _2$ for the sequence norms on $\ell ^2(\mathcal Z _l), \ell ^2(\mathcal Z )$ as well as for the usual norm on $L^2$. Define moreover the Sobolev (-type) spaces

$$\begin{aligned} W^s \equiv B^s_{2\infty } = \{f \in L^2: \Vert f\Vert _{s,2}<\infty \}. \end{aligned}$$

We note here that $W^s$ is not the classical Sobolev space—in this case the supremum over $l \ge J_0+1$ would have to be replaced by summation over $l$—but the present definition gives rise to the slightly larger Besov space $B^s_{2 \infty }$, which will turn out to be the natural exhaustive class for our results below. We still refer to them as Sobolev spaces for simplicity, since the main idea is to measure smoothness in $L^2$. We understand $W^s$ as spaces of continuous functions whenever $s>1/2$ (possible by standard embedding theorems). We shall moreover set, in abuse of notation, $\phi _{J_0 k} \equiv \psi _{J_0k}$ (which does not equal $2^{-1/2}\psi _{J_0+1,k}(2^{-1}\cdot )$) in order for the wavelet series of a function $f \in L^2$ to have the compact representation

$$\begin{aligned} f=\sum _{l=J_0}^\infty \sum _{k \in \mathcal Z _{l}} \psi _{lk} \langle \psi _{lk},f\rangle , \end{aligned}$$

with the understanding that $\mathcal Z _{J_0}=\mathcal Z $. The wavelet projection $\Pi _{V_j}(f)$ of $f \in L^2$ onto the span $V_j$ in $L^2$ of

$$\begin{aligned} \{\phi _{J_0 m}, \psi _{lk}: m \in \mathcal Z , k \in \mathcal Z _l, J_0+1 \le l \le j\} \end{aligned}$$

equals

$$\begin{aligned} K_j(f)(x)&\equiv\int _0^1 K_j(x,y)f(y)dy \equiv 2^j \int _0^1 K(2^jx, 2^jy)f(y)dy\\&= \sum _{l=J_0}^{j-1} \sum _{k \in \mathcal Z _l}\langle f, \psi _{lk} \rangle \psi _{lk}(x) \end{aligned}$$

where $K(x,y)=\sum _k \phi _{J_0 k}(x) \phi _{J_0 k}(y)$ is the wavelet projection kernel.

2.2 Adaptive estimation in $L^2$

Let $X_1, \dots , X_n$ be i.i.d. with common density $f$ on [0,1], with joint distribution equal to the first $n$ coordinate projections of the infinite product probability measure ${\Pr }_f$. Write $E_f$ for the corresponding expectation operator. We shall throughout make the minimal assumption that $f \in W^r$ for some $r>1/2$, which implies in particular, by Sobolev’s lemma, that $f$ is continuous and bounded on [0,1]. The adaptation problem arises from the hope that $f \in W^s$ for some $s$ significantly larger than $r$, without wanting to commit to a particular a priori value of $s$. In this generality the problem is still not meaningful, since the regularity of $f$ is not only described by containment in $W^s$, but also by the size of the Sobolev norm $\Vert f \Vert _{s,2}$. If one defines, for $0<s<\infty , 1 \le B<\infty $, the Sobolev-balls of densities

$$\begin{aligned} \Sigma (s,B):= \left\{ f:[0,1] \rightarrow [0, \infty ), \int _{[0,1] } f =1, \Vert f\Vert _{s,2} \le B \right\} , \end{aligned}$$

(7)

then Pinsker’s minimax theorem (for density estimation) gives, as $n \rightarrow \infty $,

$$\begin{aligned} \inf _{T_n} \sup _{f \in \Sigma (s,B)} E_f \Vert T_n-f\Vert _2^2 \sim c(s) B^{2/(2s+1)} n^{-2s/(2s+1)} \end{aligned}$$

(8)

for some constant $c(s)>0$ depending only on $s$, and where the infimum extends over all measurable functions $T_n$ of $X_1, \dots , X_n$ (cf., e.g., the results in Theorem 5.1 in [10]). So any risk bound, attainable uniformly for elements $f \in \Sigma (s,B)$, cannot improve on $B^{2/(2s+1)}n^{-2s/(2s+1)}$ up to multiplicative constants. If $s,B$ are known then constructing estimators that attain this bound is possible, even with the asymptotically exact constant $c(s)$. The adaptation problem poses the question of whether estimators can attain such a risk bound without requiring knowledge of $B,s$.

The paradigm of adaptive estimation has provided us with a positive answer to this problem, and one can prove the following result.

Theorem 2

Let $1/2 <r \le R<\infty $ be given. Then there exists an estimator $\hat{f}_n = f(X_1, \dots , X_n, r, R)$ such that, for every $s \in [r,R]$, every $B \ge 1, U>0$, and every $n \in \mathbb N $,

$$\begin{aligned} \sup _{f \in \Sigma (s,B), \Vert f\Vert _\infty \le U} E_f \Vert \hat{f}_n - f\Vert _2^2 \le c B^{2/(2s+1)}n^{-2s/(2s+1)} \end{aligned}$$

for a constant $0<c<\infty $ that depends only on $r, R, U$.

If one wishes to adapt to the radius $B \in [1,B_0]$ then the canonical choice for $U$ is

$$\begin{aligned} \sup _{f \in \Sigma (r,B_0)}\Vert f\Vert _\infty \le c(r) B_0 \equiv U < \infty , \end{aligned}$$

(9)

but other choices will be possible below. More elaborate techniques allow for $c$ to depend only on $s$, and even to obtain the exact asymptotic minimax ‘Pinsker’-constant, see for instance Theorem 5.1 in [10]. We shall not study exact constants here, mostly to simplify the exposition and to focus on the main problem of confidence statements, but also since exact constants are asymptotic in nature and we prefer to give nonasymptotic bounds.

From a ‘pointwise in $f$’ perspective we can conclude from Theorem 2 that adaptive estimation is possible over the full continuous Sobolev scale

$$\begin{aligned} \bigcup _{s \in [r,R], 1 \le B < \infty } \Sigma (s,B) = W^r \cap \left\{ f:[0,1] \rightarrow [0, \infty ), \int _0^1 f =1 \right\} ; \end{aligned}$$

for any probability density $f \in W^s, s \in [r,R]$, the single estimator $\hat{f}_n$ satisfies

$$\begin{aligned} E_f\Vert \hat{f}_n -f\Vert _2^2 \le c \Vert f\Vert _{s,2}^{2/(2s+1)} n^{-2s/(2s+1)} \end{aligned}$$

where $c$ depends on $r,R, \Vert f\Vert _\infty $. Since $\hat{f}_n$ does not depend on $B, U$ or $s$ we can say that $\hat{f}_n$ adapts to both $s \in [r,R]$ and $B \in [1, B_0]$ simultaneously. If one imposes an upper bound on $U$ then adaptation even holds for every $B \ge 1$. Our interest here is to understand what remains of this remarkable result if one is interested in adaptive confidence statements rather than in risk bounds.

3 Adaptive confidence sets for Sobolev classes

3.1 Honest asymptotic inference

We aim to characterise those sets $\mathcal P _n$ consisting of uniformly bounded probability densities $f \in W^r$ for which we can construct adaptive confidence sets. More precisely, we seek random subsets $C_n$ of $L^2$ that depend only on known quantities, cover $f \in \mathcal P _n$ at least with prescribed probability $1-\alpha $, and have $L^2$-diameter $|C_n|$ adaptive with respect to radius and smoothness with prescribed probability at least $1-\alpha ^\prime $. To avoid discussing measurability issues we shall tacitly assume throughout that $C_n$ lies within an $L^2$-ball of radius $O(|C_n|)$ centered at a random variable $\tilde{f}_n \in L^2$.

Definition 1

($L^2$ -adaptive confidence sets) Let $X_1, \dots , X_n$ be i.i.d. on [0,1] with common density $f$. Let $0<\alpha , \alpha ^{\prime } <1$ and $1/2 <r \le R$ be given and let $C_n=C(X_1, \dots , X_n)$ be a random subset of $L^2$. $C_n$ is called $L^2$-adaptive and honest for a sequence of (nonempty) models $\mathcal P _n \subset W^r \cap \{f: \Vert f\Vert _\infty \le U\}$, if there exists a constant $L=L(r,R,U)$ such that for every $n \in \mathbb N $

$$\begin{aligned} \sup _{f \in \Sigma (s,B) \cap \mathcal P _n} {\Pr }_f\left\{ |C_n| > L B^{1/(2s+1)}n^{-s/(2s+1)}\right\} \le \alpha ^{\prime } \quad \forall s\in [r,R], B \ge 1,\nonumber \\ \end{aligned}$$

(10)

(the condition being void if $\Sigma (s,B) \cap \mathcal P _n$ is empty) and

$$\begin{aligned} \inf _{f \in \mathcal P _n} {\Pr }_f\left\{ f \in C_n \right\} \ge 1-\alpha -r_n \end{aligned}$$

(11)

where $r_n \rightarrow 0$ as $n \rightarrow \infty $.

To understand the scope of this definition some discussion is necessary. First, the interval $[r,R]$ describes the range of smoothness parameters one wants to adapt to. Besides the restriction $1/2<r \le R < \infty $ the choice of this window of adaptation is arbitrary (although the values of $R,r$ influence the constants). Second, if we wish to adapt to $B$ in a fixed interval $[1,B_0]$ only, we may take $\mathcal P _n$ a subset of $\Sigma (r, B_0)$ and the canonical choice of $U=c(r)B_0$ from (9). In such a situation (10) will still hold for every $B \ge 1$ although the result will not be meaningful for $B > B_0$. Otherwise we may impose an arbitrary uniform bound on $\Vert f\Vert _\infty $ and adapt to all $B \ge 1$. We require here the sharp dependence on $B$ in (10) and thus exclude the usual ‘undersmoothed’, near-adaptive, confidence sets in our setting. A natural ‘maximal’ model choice would be $\mathcal P _n = \Sigma (r,B_0) ~ \forall n$ with $B_0 \ge 1$ arbitrary.

3.2 The case $R < 2r$

A first result, the key elements of which have been discovered and discussed in [6, 16, 21, 24, 28], is that $L^2$-adaptive confidence statements that parallel the situation of Theorem 2 exist without any additional restrictions whatsoever, in the case where $R < 2r$, so that the window of adaptation is $[r,2r)$. The sufficiency part of the following theorem is a simple extension of results in Robins and van der Vaart [28] in that it shows that adaptation is possible not only to the smoothness $s$, but also to the radius $B$. The main idea of the proof is that, if $R<2r$, the squared $L^2$-risk of $\hat{f}_n$ from Theorem 2 can be estimated at a rate compatible with adaptation, by a suitable $U$-statistic.

Theorem 3

(A)
If $R<2r$, then for any $\alpha , \alpha ^{\prime }$, there exists a confidence set $C_n=C(X_1, \dots , X_n, r, R, \alpha , \alpha ^{\prime })$ which is honest and adaptive in the sense of Definition 1 for any choice $\mathcal P _n \equiv \Sigma (r,B_0) \cap \{f: \Vert f\Vert _\infty \le U\}, B_0 \ge 1, U>0$.

(B)
If $R \ge 2r$, then for $\alpha , \alpha ^{\prime }$ small enough no $C_n$ as in (A) exists.

We emphasise that the confidence set $C_n$ constructed in the proof of Theorem 3 does only depend on $r,R,\alpha , \alpha ^{\prime }$ and does not require knowledge of $B_0$ or $U$. Note however that the sequence $r_n$ from Definition 1 does depend on $B_0$—one may thus use $C_n$ without any prior choice of parameters, but evaluation of its coverage is still relative to the model $\Sigma (r,B_0)$. Arbitrariness of $B_0, U$ implies, by taking $B_0=\Vert f\Vert _{s,2}, U=\Vert f\Vert _\infty $ in the above result, that ‘pointwise in $f$’ adaptive inference is possible for any probability density in the Sobolev space $W^r$.

Corollary 1

Let $0<\alpha , \alpha ^{\prime } <1$ and $1/2 <r \le R$. Assume $R<2r$. There exists a confidence set $C_n=C(X_1, \dots , X_n, r, R, \alpha , \alpha ^{\prime })$ such that

(i)
$\liminf _n {\Pr }_f\left\{ f \in C_n \right\} \ge 1-\alpha ~~~ \text{ for} \text{ every} \text{ probability} \text{ density}\,f \in W^r,$ and
(ii)
$\limsup _n {\Pr }_f\{|C_n| \!>\! L \Vert f\Vert _{s, 2}^{1/(2s+1)}n^{-s/(2s+1)}\} \!\le \! \alpha ^{\prime } \quad \text{ for} \text{ every} \text{ probability} \text{ density} f \in W^s, s\in [r,R],$ and some finite positive constant $L=L(r,R, \Vert f\Vert _\infty )$.

3.3 The case of general $R$

If we allow for general $R \ge 2r$ honest inference is not possible without restricting $\mathcal P _n$ further. In fact even a weaker ‘pointwise in $f$’ result of the kind of Corollary 1 is impossible for general $R\ge r$. This is a consequence of the following lower bound.

Theorem 4

Fix $0<\alpha <1/2$, let $s \ge r$ be arbitrary. A confidence set $C_n=C(X_1, \dots , X_n)$ in $L^2$ cannot satisfy

(i)
$\liminf _n {\Pr }_f\{f \in C_n\} \ge 1- \alpha \quad \text{ for} \text{ every} \text{ probability} \text{ density}\,f \in W^r$, and
(ii)
$|C_n| \!=\! O_{{\Pr }_f}(r_n) \quad \text{ for} \text{ every} \text{ probability} \text{ density} f \!\in \! W^s$ at any rate $r_n \!=\! o(n^{-r/(2r+1/2)})$.

For $R>2r$ we have $n^{-R/(2R+1)} = o(n^{-r/(2r+1/2)})$. Thus even from a ‘pointwise in $f$’ perspective a confidence procedure cannot adapt to the entirety of densities in a Sobolev space $W^r$ when $R>2r$. On the other hand if we restrict to proper subsets of $W^r$, the situation may qualitatively change. For instance if we wish to adapt to submodels of a fixed Sobolev ball $\Sigma (r, B_0)$ with $r, B_0$ known, we have the following result.

Proposition 1

Let $0\!<\!\alpha , \alpha ^{\prime } \!<\!1$ and $1/2 \!<\!r \!\le \! R, B_0 \!\ge \! 1$. There exists a confidence set $C_n=C(X_1,\dots , X_n, B_0,r,R, \alpha , \alpha ^{\prime })$ such that

(i)
$\liminf _n {\Pr }_f\left\{ f \in C_n \right\} \ge 1-\alpha \quad \text{ for} \text{ every} \text{ probability} \text{ density}\,f \in \Sigma (r, B_0),$ and
(ii)
$\limsup _n {\Pr }_f\{|C_n| \!>\! L \Vert f\Vert _{s, 2}^{1/(2s+1)}n^{-s/(2s+1)}\} \!\le \! \alpha ^{\prime } \quad \text{ for} \text{ every} \text{ probability} \text{ density} f \in \Sigma (s, B_0), s\in [r,R],$ and some finite positive constant $L=L(r,R, \Vert f\Vert _\infty )$.

Now if we compare Proposition 1 to Theorem 3 we see that there exists a genuine discrepancy between honest and pointwise in $f$ adaptive confidence sets when $R\ge 2r$. Of course Proposition 1 is not useful for statistical inference as the index $n$ from when onwards coverage holds depends on the unknown $f$. The question arises whether there are meaningful maximal subsets of $\Sigma (r, B_0)$ for which honest inference is possible. The proof of Proposition 1 is in fact based on the construction of subsets $\mathcal P _n$ of $\Sigma (r,B_0)$ which grow dense in $\Sigma (r,B_0)$ and for which honest inference is possible. This approach follows the ideas from Part (A)(ii) in Theorem 1, and works as follows in the setting of continuous $s \in [r,R]$: assume without loss of generality that $2(N-1)r<R<2Nr$ for some $N \in \mathbb N , N >1$, and define the grid

$$\begin{aligned} \mathcal S =\{s_m\}_{m=1}^N = \{r, 2r, 4r, \dots , 2(N-1)r\}. \end{aligned}$$

Note that $\mathcal S $ is independent of $n$. Define, for $s \in \mathcal S \setminus \{s_N\}$,

$$\begin{aligned} \tilde{\Sigma }(s, \rho ):= \tilde{\Sigma }(s, B_0, \mathcal S , \rho ) = \left\{ f \in \Sigma (s, B_0): \Vert f-\Sigma (t, B_0)\Vert _2 \ge \rho ~\forall t>s, t \in \mathcal S \right\} . \end{aligned}$$

We will choose the separation rates

$$\begin{aligned} \rho _n(s) \sim n^{-s/(2s+1/2)}, \end{aligned}$$

equal to the minimax rate of testing between $\Sigma (s, B_0)$ and any submodel $\Sigma (t, B_0)$ for $t \in \mathcal S , t>s$. The resulting model is therefore, for $M$ some positive constant,

$$\begin{aligned} \mathcal P _n(M, \mathcal S ) = \Sigma (s_N, B_0) \bigcup \left(\bigcup _{s \in \mathcal S \setminus \{s_N\}} \tilde{\Sigma }(s, M\rho _n(s))\right). \end{aligned}$$

The main idea behind the following theorem is to first construct a minimax test for the nested hypotheses

$$\begin{aligned} \{H_s: f \in \tilde{\Sigma }(s, M \rho _n(s))\}_{s \in \mathcal S \setminus \{s_N\}}, \end{aligned}$$

then to estimate the risk of the adaptive estimator $\hat{f}_n$ from Theorem 2 under the assumption that $f$ belongs to smoothness hypothesis selected by the test, and to finally construct a confidence set centered at $\hat{f}_n$ based on this risk estimate (as in the proof of Theorem 3).

Theorem 5

Let $R > 2r$ and $B_0 \ge 1$ be arbitrary. There exists a confidence set $C_n=C(X_1,\dots , X_n, B_0,r,R, \alpha , \alpha ^{\prime })$, honest and adaptive in the sense of Definition 1, for $\mathcal P _n = \mathcal P _n(M, \mathcal S ), n \in \mathbb N ,$ with $M$ a large enough constant and $U$ as in (9).

First note that, since $\mathcal S $ is independent of $n, \mathcal P _n(M, \mathcal S ) \nearrow \Sigma (r, B_0)$ as $n \rightarrow \infty $, so that the model $\mathcal P _n(M, \mathcal S )$ grows dense in the fixed Sobolev ball, which for known $B_0$ is the full model. This implies in particular Proposition 1.

An important question is whether $\mathcal P _n(M, \mathcal S )$ was taken to grow as fast as possible as a function of $n$, or in other words, whether a smaller choice of $\rho _n(s)$ would have been possible. The lower bound in Theorem 1 implies that any faster choice for $\rho _n(s)$ makes honest inference impossible. Indeed, if $C_n$ is an honest confidence set over $\mathcal P _n(M, \mathcal S )$ with a faster separation rate $\rho _n^{\prime }=o(\rho _n(s))$ for some $s \in \mathcal S \setminus \{s_N\}$, then we can use $C_n$ to test $H_0: f \in \Sigma (s^{\prime })$ against $H_1: f \in \tilde{\Sigma }(s, \rho _n^{\prime })$ for some $s^{\prime }>2s$, which by the proof of Theorem 1 gives a contradiction.

3.3.1 Self-similarity conditions

The proof of Theorem 5 via testing smoothness hypotheses is strongly tied to knowledge of the upper bound $B_0$ for the radius of the Sobolev ball, but as discussed above, this cannot be avoided without contradicting Theorem 4. Alternative ways to restrict $W^r,$ other than constraining the radius, and which may be practically relevant, are given in [5, 13, 22, 27]. The authors instead restrict to ‘self-similar’ functions, whose regularity is similar at large and small scales. As the results [5, 13, 22] prove adaptation in $L^\infty ,$ they naturally imply adaptation also in $L^2;$ the functions excluded, however, are now those whose norm is hard to estimate, rather than those whose norm is merely large. In the $L^2$-case we need to estimate $s$ only up to a small constant; as this is more favourable than the $L^\infty $-situation, one may impose weaker self-similarity assumptions, tailored to the $L^2$-situation. This can be achieved arguing in a similar fashion to [5], but we do not pursue this further in the present paper.

4 Proofs

4.1 Some concentration inequalities

Let $X_i, i=1, 2, \dots ,$ be the coordinates of the product probability space $(T,\mathcal T ,P)^\mathbb{N }$, where $P$ is any probability measure on $(T, \mathcal T ), P_n=n^{-1}\sum _{i=1}^n \delta _{X_i}$ the empirical measure, $E$ expectation under $P^\mathbb N \equiv \Pr $. For $M$ any set and $H:M \rightarrow \mathbb R $, let $\Vert H\Vert _M=\sup _{m \in M}|H(m)|$. We also write $Pf=\int _T fdP$ for measurable $f: T \rightarrow \mathbb R $.

The following Bernstein-type inequality for canonical $U$-statistics of order two is due to Giné et al. [12], with refinements about the numerical constants in Houdré and Reynaud-Bouret [18]: let $R(x,y)$ be a symmetric real-valued function defined on $T \times T$, such that $ER(X,x)=0$ for all $x$, and let

$$\begin{aligned} \Lambda ^2_1&= \frac{n(n-1)}{2} ER(X_1,X_2)^2,\\ \Lambda _2&= n\sup \{E[R(X_1,X_2)\zeta (X_1)\xi (X_2)]:E\zeta ^2(X_1)\le 1,E\xi ^2(X_1)\le 1\},\\ \Lambda _3&= \Vert nER^2(X_1,\cdot )\Vert ^{1/2}_\infty ,\ \ \Lambda _4=\Vert R\Vert _\infty . \end{aligned}$$

Let moreover $U_n^{(2)}(R) = \frac{2}{n(n-1)} \sum _{i<j} R(X_i, X_j)$ be the corresponding degenerate $U$-statistic of order two. Then, there exists a universal constant $0<C<\infty $ such that for all $u>0$ and $n \in \mathbb N $:

$$\begin{aligned} \Pr \left\{ \frac{n(n-1)}{2}|U_n^{(2)}(R)|>C(\Lambda _1u^{1/2} +\Lambda _2u+\Lambda _3u^{3/2}+\Lambda _4u^2)\right\} \le 6\exp \{- u\}.\nonumber \\ \end{aligned}$$

(12)

We will also need Talagrand’s [30] inequality for empirical processes. Let $\mathcal F $ be a countable class of measurable functions on $T$ that take values in $[-1/2,1/2]$, or, if $\mathcal F $ is $P$-centered, in $[-1,1]$. Let $\sigma \le 1/2$, or $\sigma \le 1$ if $\mathcal F $ is $P$-centered, and $V$ be any two numbers satisfying

$$\begin{aligned} \sigma ^2 \ge \Vert Pf^2\Vert _\mathcal F ,\ \ V \ge n\sigma ^2+2E\left\Vert\sum _{i=1}^n(f(X_i)-Pf)\right\Vert_\mathcal F . \end{aligned}$$

Bousquet’s [4] version of Talagrand’s inequality then states: for every $u >0$,

$$\begin{aligned} \Pr \left\{ \left\Vert\sum _{i=1}^n(f(X_i)-Pf)\right\Vert_\mathcal F \!\ge \! E\left\Vert\sum _{i=1}^n(f(X_i)-Pf)\right\Vert_\mathcal F +u\right\} \!\le \! \exp \left(-\frac{u^2}{2V+\frac{2}{3}u}\right).\nonumber \\ \end{aligned}$$

(13)

A consequence of this inequality, derived in Section 3.1 in [15], is the following. If $T=[0,1], P$ has bounded Lebesgue density $f$ on $T$, and $f_n(j)=\int _0^1 K_j(\cdot ,y)dP_n(y)$, then for $M$ large enough, every $j \ge 0, n \in \mathbb N $ and some positive constants $c, c^{\prime }$ depending on $U$ and the wavelet regularity $S$,

$$\begin{aligned} \sup _{f: \Vert f\Vert _\infty \le U}{\Pr }_f \left\{ \left\Vert f_n(j) - Ef_n(j) \right\Vert_2 > M \sqrt{\Vert f\Vert _\infty \frac{2^j}{n}} \right\} \le c^{\prime } e^{-cM^2 2^j}. \end{aligned}$$

(14)

4.2 A general purpose test for composite nonparametric hypotheses

In this subsection we construct a general test for composite nonparametric null hypotheses that lie in a fixed Sobolev ball, under assumptions only on the entropy of the null-model. While of independent interest, the result will be a key step in the proofs of Theorems 1 and 5.

Let $X,X_1, \dots , X_n$ be i.i.d. with common probability density $f$ on [0,1], let $\Sigma $ be any subset of a fixed Sobolev ball $\Sigma (t,B)$ for some $t>1/2$ and consider testing

$$\begin{aligned} H_0: f \in \Sigma ~ \text{ against}\,H_1: f \in \Sigma (t, B)\setminus \Sigma , \Vert f-\Sigma \Vert _2 \ge \rho _n, \end{aligned}$$

(15)

where $\rho _n \ge 0$ is a sequence of nonnegative real numbers. For $\{\psi _{lk}\}$ a $S$-regular wavelet basis, $S>t, J_n \ge J_0$ a sequence of positive integers such that $2^{J_n} \simeq n^{1/(2t+1/2)}$ and for $g \in \Sigma $, define the $U$-statistic

$$\begin{aligned} T_n(g) = \frac{2}{n (n-1)} \sum _{i<j} \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} (\psi _{lk}(X_i)-\langle \psi _{lk}, g\rangle )(\psi _{lk}(X_j)-\langle \psi _{lk}, g \rangle ) \end{aligned}$$

(16)

and, for $\tau _n$ some thresholds to be chosen below, the test statistic

$$\begin{aligned} \Psi _n = 1\left\{ \inf _{g \in \Sigma } |T_n(g)| > \tau _n \right\} . \end{aligned}$$

(17)

Measurability of the infimum in (17) can be established by standard compactness/continuity arguments.

We shall prove a bound on the sum of the type-one and type-two errors of this test under some entropy conditions on $\Sigma $, more precisely, on the class of functions

$$\begin{aligned} \mathcal G (\Sigma ) = \bigcup _{J > J_0} \left\{ \sum _{l=J_0}^{J-1} \sum _{k \in \mathcal Z _l} \psi _{lk}(\cdot ) \langle \psi _{lk}, g \rangle : g \in \Sigma \right\} . \end{aligned}$$

Recall the usual covering numbers $N(\varepsilon , \mathcal G , L^2(P))$ and bracketing metric entropy numbers $N_{[]}(\varepsilon , \mathcal G , L^2(P))$ for classes $\mathcal G $ of functions and probability measures $P$ on [0,1] (e.g., [31, 32]).

Definition 2

Say that $\Sigma $ is $s$-regular if one of the following conditions is satisfied for some fixed finite constants $A$ and every $0<\varepsilon <A$:

(a)
For any probability measure $Q$ on [0,1] (and $A,s$ independent of $Q$) we have
$$\begin{aligned} \log N(\varepsilon , \mathcal G (\Sigma ), L^2(Q)) \le (A/\varepsilon )^{1/s}. \end{aligned}$$
(b)
For $P$ such that $dP=fd\lambda $ with Lebesgue density $f:[0,1] \rightarrow [0, \infty )$ we have
$$\begin{aligned} \log N_{[]}(\varepsilon , \mathcal G (\Sigma ), L^2(P)) \le (A/\varepsilon )^{1/s}. \end{aligned}$$

Note that a ball $\Sigma (s,B)$ satisfies this condition for the given $s, 1/2<s<S,$ since any element of $\mathcal G (\Sigma (s,B))$ has $\Vert \cdot \Vert _{s,2}$-norm no more than $B$, and since

$$\begin{aligned} \log N(\varepsilon , \Sigma (s,B), \Vert \cdot \Vert _\infty ) \le (A/\varepsilon )^{1/s}, \end{aligned}$$

see, e.g., p. 506 in [26].

Proposition 2

Let

$$\begin{aligned} \tau _n = L d_n \max (n^{-2s/(2s+1)}, n^{-2t/(2t+1/2)}), ~~~\rho ^2_n = \frac{L_0}{L} \tau _n \end{aligned}$$

for real numbers $1 \le d_n \le d(\log n)^\gamma $ and positive constants $L, L_0, \gamma ,d$. Let the hypotheses $H_0, H_1$ be as in (15), the test $\Psi _n$ as in (17), and assume $\Sigma $ is $s$-regular for some $s>1/2$. Then for $L=L(B, t, S), L_0=L_0(L, B,t, S)$ large enough and every $n \in \mathbb N $ there exist constants $c_i, i=1,\dots , 3$ depending only on $L,L_0, t, B$ such that

$$\begin{aligned} \sup _{f \in H_0} E_f\Psi _n + \sup _{f \in H_1} E_f(1-\Psi _n) \le c_1 e^{-d_n^2} + c_2 e^{-c_3 n \rho _n^2}. \end{aligned}$$

The main idea of the proof is as follows: for the type-one errors our test-statistic is dominated by a degenerate $U$-statistic which we can bound with inequality (12), carefully controlling the four regimes present. For the alternatives the test statistic can be decomposed into a degenerate $U$-statistic which can be dealt with as before, and a linear part, which is the critical one. The latter can be compared to a ratio-type empirical process which we control by a slicing argument applied to $\Sigma $, combined with Talagrand’s inequality.

Proof

1) We first control the type-one errors. Since $f \in H_0 = \Sigma $ we see

$$\begin{aligned} E_f\Psi _n = {\Pr }_f \left\{ \inf _{g \in \Sigma } |T_n(g)| > \tau _n \right\} \le {\Pr }_f \left\{ |T_n(f)| > \tau _n \right\} . \end{aligned}$$

(18)

$T_n(f)$ is a $U$-statistic with kernel

$$\begin{aligned} R_f(x,y)= \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l}(\psi _{lk}(x)-\langle \psi _{lk}, f\rangle )(\psi _{lk}(y)-\langle \psi _{lk}, f \rangle ), \end{aligned}$$

which satisfies $E R_f(x,X_1)=0$ for every $x$, since $E_f(\psi _{lk}(X)-\langle \psi _{lk}, f\rangle )=0$ for every $k,l$. Consequently $T_n(f)$ is a degenerate $U$-statistic of order two, and we can apply inequality (12) to it, which we shall do with $u=d^2_n$. We thus need to bound the constants $\Lambda _1, \dots , \Lambda _4$ occurring in inequality (12) in such a way that, for $L$ large enough,

$$\begin{aligned} \frac{2C}{n(n-1)}(\Lambda _1 d_n+\Lambda _2 d_n^2+\Lambda _3 d_n^3+\Lambda _4 d_n^4) \le L d_n n^{-2t/(2t+1/2)} \le \tau _n, \end{aligned}$$

(19)

which is achieved by the following estimates, noting that $n^{-2t/(2t+1/2)} \simeq 2^{J_n/2}/n$.

First, by standard $U$-statistic arguments, we can bound $ER^2_f(X_1,X_2)$ by the second moment of the uncentred kernel, and thus, using orthonormality of $\psi _{lk}$,

$$\begin{aligned} ER_f^2(X_1,X_2)&\le \int \int \left(\sum _{k,l} \psi _{lk}(x) \psi _{lk}(y)\right)^2 f(x)f(y)dxdy \\&\le \Vert f\Vert _\infty ^2 \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} \int _0^1 \psi _{lk}^2(x)dx \int _0^1 \psi _{lk}^2(y)dy\\&\le C(S)2^{J_n} \Vert f\Vert ^2_\infty \end{aligned}$$

for some constant $C(S)$ that depends only on the wavelet basis. We obtain $\Lambda _1^2 \le C(S) n(n-1) 2^{J_n} \Vert f\Vert _\infty ^2/2$ and it follows, using (9) that for $L$ large enough and every $n,$

$$\begin{aligned} \frac{2C\Lambda _1 d_n}{n(n-1)} \le C(S, B, t) \frac{2^{J_n/2}d_n}{n} \le \tau _n/4. \end{aligned}$$

For the second term note that, using the Cauchy–Schwarz inequality and that $K_j$ is a projection operator

$$\begin{aligned} \left|\int \int \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l}\psi _{lk}(x) \psi _{lk}(y) \zeta (x) \xi (y) f(x)f(y)dxdy \right|&= \left|\int K_{J_n} (\zeta f)(y) \xi (y) f(y)dy \right| \\&\le \Vert K_{J_n}(\zeta f)\Vert _2 \Vert \xi f\Vert _2 \le \Vert f\Vert _\infty ^2, \end{aligned}$$

and similarly

$$\begin{aligned} |E[E_{X_1} [K_{J_n}(X_1,X_2)] \zeta (X_1) \xi (X_2)]| \le \Vert f\Vert ^2_\infty , \ |EK_{J_n}(X_1,X_2)| \le \Vert f\Vert ^2_\infty . \end{aligned}$$

Thus

$$\begin{aligned} E[R_f(X_1,X_2)\zeta (X_1)\xi (X_2)] \le 4\Vert f\Vert ^2_\infty \end{aligned}$$

so that, using (9),

$$\begin{aligned} \frac{2C\Lambda _2 d_n^2}{n(n-1)} \le \frac{C^{\prime }(B,t) d_n^2}{n} \le \tau _n/4 \end{aligned}$$

again for $L$ large enough and every $n$.

For the third term, using the decomposition $R_f(x_1,x)=(r(x_1,x)-E_{X_1}r(X,x))+(E_{X,Y}r(X,Y)-E_Yr(x_1,Y))$ for $r(x,y)=\sum _{k,l} \psi _{lk}(x)\psi _{lk}(y)$, the inequality $(a+b)^2 \le 2a^2+2b^2$ and again orthonormality, we have that for every $x\in \mathbb R $,

$$\begin{aligned} n|E_{X_1}R_f^2(X_1,x)| \le 2n \left[\Vert f\Vert _\infty \sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l} \psi ^2_{lk}(x) + \Vert f\Vert _\infty \Vert \Pi _{V_{J_n}}(f)\Vert _2^2 \right] \end{aligned}$$

so that, using $\Vert \psi _{lk}\Vert _\infty \le d2^{l/2}$, again for $L$ large enough and by (9),

$$\begin{aligned} \frac{2C\Lambda _3d_n^{3}}{n (n-1)}\le C^{\prime \prime }(B,t) \frac{2^{J_n/2}d_n^3}{n}\frac{1}{\sqrt{n}} \le \tau _n/4. \end{aligned}$$

Finally, we have $\Lambda _4=\Vert R_f\Vert _\infty \le c 2^{J_n}$ and hence

$$\begin{aligned} \frac{2C\Lambda _4 d_n^4}{n(n-1)} \le C^{\prime } \frac{2^{J_n} d_n^4}{n^2} \le \tau _n/4, \end{aligned}$$

so that we conclude for $L$ large enough and every $n \in \mathbb N $, from inequality (12),

$$\begin{aligned} {\Pr }_f \left\{ |T_n(f)| > \tau _n \right\} \le 6\exp \left\{ -d^2_n\right\} \end{aligned}$$

(20)

which completes the bound for the type-one errors in view of (18).

2) We now turn to the type-two errors. In this case, for $f \in H_1$

$$\begin{aligned} E_f(1-\Psi _n) = {\Pr }_f \left\{ \inf _{g \in \Sigma } |T_n(g)| \le \tau _n \right\} . \end{aligned}$$

(21)

and the typical summand of $T_n(g)$ has Hoeffding-decomposition

$$\begin{aligned}&(\psi _{lk}(X_i)-\langle \psi _{lk},g \rangle )(\psi _{lk}(X_j)-\langle \psi _{lk}, g \rangle ) \\&\quad = (\psi _{lk}(X_i)-\langle \psi _{lk}, f \rangle + \langle \psi _{lk},f-g\rangle )(\psi _{lk}(X_j)-\langle \psi _{lk}, f \rangle + \langle \psi _{lk}, f-g \rangle ) \\&\quad = (\psi _{lk}(X_i)- \langle \psi _{lk}, f \rangle )(\psi _{lk}(X_j)- \langle \psi _{lk}, f \rangle )) \\&\qquad + (\psi _{lk}(X_i)-\langle \psi _{lk}, f\rangle ) \langle \psi _{lk}, f-g \rangle + (\psi _{lk}(X_j)-\langle \psi _{lk}, f \rangle ) \langle \psi _{lk}, f-g \rangle \\&\qquad + \langle \psi _{lk}, f-g \rangle ^2 \end{aligned}$$

so that by the triangle inequality, writing

$$\begin{aligned} L_n(g)= \frac{2}{n} \sum _{i=1}^n \sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l}(\psi _{lk}(X_i)-\langle \psi _{lk}, f\rangle ) \langle \psi _{lk}, f-g \rangle \end{aligned}$$

(22)

for the linear terms, we conclude

$$\begin{aligned} \left| T_n(g) \right|&\ge \sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l}\langle \psi _{lk}, f-g \rangle ^2 - \left|T_n(f)\right| -|L_n(g)| \nonumber \\&= \Vert \Pi _{V_{J_n}}(f-g)\Vert _2^2 - |T_n(f)| - |L_n(g)| \end{aligned}$$

(23)

for every $g \in \Sigma $.

We can find random $g^*_n \in \Sigma $ such that $\inf _{g \in \Sigma } |T_n(g)| = |T_n(g^*_n)|$. (If the infimum is not attained the proof below requires obvious modifications; for the case $\Sigma =\Sigma (s,B), s>t$, relevant below, the infimum can be shown to be attained at a measurable minimiser by standard continuity and compactness arguments.) We bound the probability in (21), using (23), by

$$\begin{aligned} {\Pr }_f \left\{ |L_n(g_n^*)| \!>\! \frac{\Vert \Pi _{V_{J_n}}(f\!-\!g_n^*)\Vert _2^2 \!-\! \tau _n}{2}\right\} \!+\!{\Pr }_f \left\{ |T_n(f)| \!>\! \frac{\Vert \Pi _{V_{J_n}}(f\!-\!g_n^*)\Vert _2^2\!-\!\tau _n}{2}\right\} . \end{aligned}$$

Now by the standard approximation bound (cf. (6)) and since $g^*_n \in \Sigma \subset \Sigma (t, B)$,

$$\begin{aligned} \Vert \Pi _{V_{J_n}}(f-g_n^*)\Vert _2^2 \ge \inf _{g \in \Sigma }\Vert f-g\Vert _2^2 - c(B)2^{-2J_nt} \ge 4 \tau _n \end{aligned}$$

(24)

for $L_0$ large enough depending only on $B$ and the choice of $L$ from above. We can thus bound the sum of the last two probabilities by

$$\begin{aligned} {\Pr }_f \{|L_n(g_n^*)| > \Vert \Pi _{V_{J_n}}(f-g_n^*)\Vert _2^2/4\} +{\Pr }_f \{|T_n(f)| > \tau _n\}. \end{aligned}$$

For the second degenerate part the proof of Step 1 applies, as only boundedness of $f$ was used there. In the linear part somewhat more care is necessary. We have

$$\begin{aligned} {\Pr }_f \{|L_n(g^*_n)| > \Vert \Pi _{V_{J_n}}(f-g^*_n)\Vert _2^2/4\} \le {\Pr }_f \left\{ \sup _{g \in \Sigma }\frac{|L_n(g)|}{\Vert \Pi _{V_{J_n}}(f-g)\Vert _2^2} > \frac{1}{4}\right\} .\qquad \end{aligned}$$

(25)

Note that the variance of the linear process from (22) can be bounded, for fixed $g \in \Sigma $, using independence and orthonormality, by

$$\begin{aligned} Var_f(|L_n(g)|)&\le \frac{4}{n} \int \left(\sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l} \psi _{lk}(x) \langle \psi _{lk}, f-g\rangle \right)^2 f(x)dx \nonumber \\&\le \frac{4 \Vert f\Vert _\infty }{n} \sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l} \int \psi ^2_{lk}(x)dx \cdot \langle \psi _{lk}, f-g \rangle ^2 \nonumber \\&\le \frac{4\Vert f\Vert _\infty \Vert \Pi _{V_{J_n}}(f-g)\Vert _2^2}{n} \end{aligned}$$

(26)

so that the supremum in (25) is one of a self-normalised ratio-type empirical process. Such processes can be controlled by slicing the supremum into shells of almost constant variance, cf. Section 5 in [31] or [11]. Define, for $g \in \Sigma ,$

$$\begin{aligned} \sigma ^2(g):=\Vert \pi _{V_{J_n}}(f-g)\Vert _2^2 \ge \Vert f-g\Vert _2^2 - c(B)2^{-2J_n t} \ge c \rho _n^2, \end{aligned}$$

the inequality holding for $L_0$ large enough and some $c>0$, as in (24). Define moreover, for $m \in \mathbb Z $, the class of functions

$$\begin{aligned} \mathcal G _{m, J_n} = \left\{ 2\sum _{l=J_0}^{J_n-1}\sum _{k \in \mathcal Z _l} \psi _{lk}(\cdot ) \langle \psi _{lk}, f-g \rangle : g \in \Sigma , \sigma ^2(g)\le 2^{m+1} \right\} , \end{aligned}$$

which is uniformly bounded by a constant multiple of $\Vert f\Vert _{t,2}+\sup _{g \in \Sigma (t,B)}\Vert g\Vert _{t,2} \le 2B$ in view of (6) and since $t>1/2$. Then, in the notation of Sect. 4.1,

$$\begin{aligned} \sup _{g \in \Sigma : \sigma ^2(g) \le 2^{m+1}}|L_n(g)| = \Vert P_n-P\Vert _\mathcal{G _{m, J_n}} \end{aligned}$$

and we bound the last probability in (25) by

$$\begin{aligned}&{\Pr }_f \left\{ \max _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C}\sup _{g \in \Sigma : 2^m \le \sigma ^2(g) \le 2^{m+1}}\frac{|L_n(g)|}{\sigma ^2(g)} > \frac{1}{4}\right\} \nonumber \\&\quad \le \sum _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C} {\Pr }_f \left\{ \sup _{g \in \Sigma : \sigma ^2(g) \le 2^{m+1}}|L_n(g)| > 2^{m-2}\right\} \nonumber \\&\quad \le \sum _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C} {\Pr }_f \left\{ \Vert P_n-P\Vert _\mathcal{G _{m, J_n}}-E \Vert P_n-P\Vert _\mathcal{G _{m, J_n}}\right.\nonumber \\&\qquad \qquad \qquad \quad \qquad \qquad \qquad \left. > 2^{m-2} -E \Vert P_n-P\Vert _\mathcal{G _{m, J_n}}\right\} \end{aligned}$$

(27)

where we may take $C<\infty $ as $\Sigma \subset \Sigma (t, B)$ is bounded in $L^2$, and where $c^{\prime }$ is a positive constant such that $c^{\prime } \rho _n^2 \le 2^m \le c \rho _n^2$ for some $m \in \mathbb Z $. We bound the expectation of the empirical process. Both the uniform and the bracketing entropy condition for $\mathcal G (\Sigma )$ carry over to $\cup _{J \ge 0}\mathcal G _{J,m}$ since translation by $f$ preserves the entropy. Using the standard entropy-bound plus chaining moment inequality (3.5) in Theorem 3.1 in [11] in case a) of Definition 2, and the second bracketing entropy moment inequality in Theorem 2.14.2 in [32] in case b), together with the variance bound (26) and with (9), we deduce

$$\begin{aligned} E \Vert P_n-P\Vert _\mathcal{G _{m,J_n}} \le C \left( \sqrt{\frac{2^m}{n}} (2^m)^{-1/4s} + \frac{(2^m)^{-1/2s}}{n}\right). \end{aligned}$$

(28)

We see that

$$\begin{aligned} 2^{m-2} -E \Vert P_n-P\Vert _\mathcal{G _k} \ge c_0 2^m \end{aligned}$$

for some fixed $c_0$ precisely when $2^m$ is of larger magnitude than $(2^m)^{\frac{1}{2}-\frac{1}{4s}} n^{-1/2} + (2^m)^{-1/2s}n^{-1}$, equivalent to $2^m \ge c^{\prime \prime } n^{-2s/(2s+1)}$ for some $c^{\prime \prime }>0$, which is satisfied since $2^m \ge c^{\prime } \rho _n^2 \ge c^{\prime \prime } n^{-2s/(2s+1)}$ if $L_0$ is large enough, by hypothesis on $\rho _n$. We can thus rewrite the last probability in (27) as

$$\begin{aligned} \sum _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C} {\Pr }_f \left\{ n\Vert P_n-P\Vert _\mathcal{G _{m, J_n}}- nE\Vert P_n-P\Vert _\mathcal{G _{m, J_n}} > c_0n2^m \right\} \!. \end{aligned}$$

To this expression we can apply Talagrand’s inequality (13), noting that the supremum over $\mathcal G _{m, J_n}$ can be realised, by continuity, as one over a countable subset of $\Sigma $, and since $\Sigma $ is uniformly bounded by $\sup _{f \in \Sigma (t, B)}\Vert f\Vert _\infty \le U \equiv U(t,B)$. Renormalising by $U$ and using (13), (26), (28) we can bound the expression in the last display, up to multiplicative constants, by

$$\begin{aligned} \sum _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C} \exp \! \left\{ -c_1 \frac{n^2(2^m)^2}{n2^m + nE \Vert P_n-P\Vert _\mathcal{G _{m, J_n}}+ n2^m } \right\}&\!\le\! \sum _{m \in \mathbb Z : c^{\prime }\rho _n^2 \le 2^m \le C}\!\! e^{-c_2 n2^m} \\&\!\le c_3 e^{-c_4 n \rho _n^2} \end{aligned}$$

since $2^m \ge c^{\prime }\rho _n^2 \gg n^{-1}$, which completes the proof. $\square $

4.3 Proof of Theorem 2

Proof

We construct a standard Lepski type estimator: choose integers $j_{\min }, j_{\max }$ such that $J_0 \le j_{\min } < j_{\max }$,

$$\begin{aligned} 2^{j_{\min }} \simeq n^{1/(2R+1)} ~~\mathrm{and} ~~ 2^{j_{\max }} \simeq n^{1/(2r+1)} \end{aligned}$$

and define the grid

$$\begin{aligned} \mathcal J := \mathcal J _n = [j_{\min }, j_{\max }] \cap \mathbb N . \end{aligned}$$

Let $f_n(j) \equiv f_n(j,\cdot )=\int _0^1 K_j(\cdot ,y)dP_n(y)$ be a linear wavelet estimator based on wavelets of regularity $S>R$. To simplify the exposition we prove the result for $\Vert f\Vert _\infty $ known, otherwise the result follows from the same proof, with $\Vert f\Vert _\infty $ replaced by $\Vert f_n(j_{\max })\Vert _\infty $, a consistent estimator for $\Vert f\Vert _\infty $ that satisfies sufficiently tight uniform exponential error bounds (using inequality (26) in [15] and proceeding as in Step (II) on p.1157 in [14]). Set

$$\begin{aligned} \bar{j}_n = \min \bigg \{ j \in \mathcal J : \Vert f_n(j) - f_n(l)\Vert _2^2 \le C(S) (\Vert f\Vert _\infty \vee 1) \frac{2^l}{n} ~~ \forall l>j, l\in \mathcal{J } \bigg \}\qquad \end{aligned}$$

(29)

where $C(S)$ is a large enough constant, to be chosen below, in dependence of the wavelet basis. The adaptive estimator is $\hat{f}_n = f_n(\bar{j}_n)$. We shall need the standard estimates

$$\begin{aligned} E\Vert f_n(j) - Ef_n(j)\Vert _2^2 \le D \frac{2^j }{n} := D \sigma ^2 (j,n) \end{aligned}$$

(30)

and, for $f \in W^s, s \in [r,R]$,

$$\begin{aligned} \Vert E f_n(j) - f\Vert _2 \le 2^{-js} D^{\prime } \Vert f\Vert _{s,2} := B(j, f) \end{aligned}$$

(31)

for constants $D, D^{\prime }$ that depend only on the wavelet basis and on $r,R$. Define $j^*:=j^*(f)$ by

$$\begin{aligned} j^*=\min \left\{ j\in \mathcal J : B(j,f) \le \sqrt{D} \sigma (j,n) \right\} \end{aligned}$$

so that, for every $f \in \Sigma (s,B)$ and $D^{\prime \prime }=D^{\prime \prime }(D,D^{\prime })$

$$\begin{aligned} D^{-1} B^2(j^*,f) \!\le \! \sigma ^2(j^*,n) \!\le \! D^{\prime \prime } \Vert f\Vert _{s,2}^{2/(2s+1)} n^{-2s/(2s+1)} \!\le \! D^{\prime \prime } B^{2/(2s+1)} n^{-2s/(2s+1)}.\nonumber \\ \end{aligned}$$

(32)

We will consider the cases $\{\bar{j}_n \le j^* \}$ and $\{\bar{j}_n > j^* \}$ separately. First, by the definition of $\bar{j}_n, j^*$ and (30), (31), (32),

$$\begin{aligned} E \left\Vert f_n(\bar{j}_n) - f \right\Vert^2_2 I_{\{\bar{j}_n \le j^* \}}&= E \left( \Vert f_n (\bar{j}_n) - f_n (j^*) \Vert ^2_2 + E\Vert f_n (j^*) - f\Vert ^2_2 \right) I_{\{\bar{j}_n \le j^* \}}\\&\le C(S)(\Vert f\Vert _\infty \vee 1)\frac{2^{j^*}}{n} + C^{\prime } \sigma ^2(j^*,n)\\&\le C^{\prime \prime } B^{2/(2s+1)} n^{-2s/2s+1} \end{aligned}$$

for $C^{\prime \prime }=C^{\prime \prime }(D,D^{\prime }, S,U)$, which is the desired bound. On the event $\{\bar{j}_n > j^* \}$ we have, using (30) and the definition of $j^*$,

$$\begin{aligned} E \left\Vert f_n(\bar{j}_n) - f\right\Vert_2 I_{\{\hat{j}_n > j^* \}}&\le \sum _{j \in \mathcal J :j > j^*} \left(E \left\Vert f_n(j) -f\right\Vert_2^2 \right)^{1/2} ~ \left(EI_{\{\hat{j}_n = j\}}\right)^{1/2} \\&\le \sum _{j \in \mathcal J : j > j^*} C^{\prime \prime \prime } \sigma (j,n) \cdot \sqrt{ {\Pr }_f\{\hat{j}_n = j\}} \\&\le C^{\prime \prime \prime \prime } \sum _{j \in \mathcal J : j>j^*} \sqrt{ {\Pr }_f\{\hat{j}_n = j\}} \end{aligned}$$

since $\sup _{j \in \mathcal J }\sigma (j,n) = \sigma (j_{\max },n)$ is bounded in $n$. Now pick any $j \in \mathcal J $ so that $j > j^*$ and denote by $j^-$ the previous element in the grid (i.e. $j^-= j-1$). One has, by definition of $\bar{j}_n$,

$$\begin{aligned} {\Pr }_f \{\bar{j}_n= j\} \le \sum _{l \in \mathcal J : l \ge j} {\Pr }_f \left\{ \left\Vert f_n(j^-) - f_n(l) \right\Vert_2 > \sqrt{C(S) (\Vert f\Vert _\infty \vee 1) \frac{2^l}{n}}\right\} ,\qquad \end{aligned}$$

(33)

and we observe that, by the triangle inequality,

$$\begin{aligned} \left\Vert f_n(j^-) - f_n(l) \right\Vert_2 \!\le \! \left\Vert f_n(j^-) \!-\! f_n(l) \!-\! Ef_n(j^-) \!+\! Ef_n(l) \right\Vert_2 \!+\! B(j^-, f) \!+\! B(l, f), \end{aligned}$$

where,

$$\begin{aligned} B(j^-, f) + B(l, f) \le 2B(j^*, f) \le c \sigma (j^*,n) \le c^{\prime } \sigma (l,n) \end{aligned}$$

by definition of $j^*$ and since $l>j^- \ge j^*$. Consequently, the probability in (33) is bounded by

$$\begin{aligned} {\Pr }_f \left\{ \left\Vert f_n(j^-) \!-\! f_n(l) -Ef_n(j^-) \!+\! Ef_n(l) \right\Vert_2 > (\sqrt{C(S)(\Vert f\Vert _\infty \vee 1)}-c^{\prime }) \sigma (l,n) \right\} ,\nonumber \\ \end{aligned}$$

(34)

and by inequality (14) above this probability is bounded by a constant multiple of $e^{-d2^l}$ if we choose $C(S)$ large enough. This gives the overall bound

$$\begin{aligned} \sum _{l \in \mathcal J : l \ge j} c^{\prime \prime }e^{-d2^l} \le d^{\prime }e^{-d^{\prime \prime }2^{j_{\min }}}, \end{aligned}$$

which is smaller than a constant multiple times $B^{1/(2s+1)} n^{-s/(2s+1)}$, uniformly in $s \in [r,R], n \in \mathbb N $ and for $B \ge 1$, by definition of $j_{\min }$. This completes the proof. $\square $

4.4 Proof of Theorem 3

Proof

(A) Suppose for simplicity that the sample size is $2n$, and split the sample into two halves with index sets $\mathcal S ^1, \mathcal S ^2$, of equal size $n$, write $E_1, E_2$ for the corresponding expectations, and $E=E_1E_2$. Let $\hat{f}_n= f_n(\bar{j}_n)$ be the adaptive estimator from the proof of Theorem 2 based on the sample $\mathcal S ^1$. One shows by a standard bias-variance decomposition, using $\bar{j}_n \in \mathcal J $ and $\Vert K_j(f)\Vert _{r,2} \le \Vert f\Vert _{r,2}$, that for every $\varepsilon >0$ there exists a finite positive constant $B^{\prime }=B^{\prime }(\varepsilon , B_0)$ satisfying

$$\begin{aligned} \inf _{f \in \Sigma (r,B_0)}{\Pr }_f\{\Vert \hat{f}_n\Vert _{r,2} \le B^{\prime }\} \ge 1 - \varepsilon . \end{aligned}$$

It therefore suffices to prove the theorem on the event $\{\Vert \hat{f}_n\Vert _{r,2} \le B^{\prime }\}$. For a wavelet basis of regularity $S>R$ and for $J_n \ge J_0$ a sequence of integers such that $2^{J_n} \simeq n^{1/(2r+1/2)}$, define the $U$-statistic

$$\begin{aligned} U_n(\hat{f}_n)\!=\!\frac{2}{n(n-1)} \sum _{i<j, i,j \in \mathcal S ^2} \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} (\psi _{lk}(X_i)\!-\!\langle \psi _{lk}, \hat{f}_n\rangle )(\psi _{lk}(X_j)\!-\!\langle \psi _{lk}, \hat{f}_n \rangle )\nonumber \\ \end{aligned}$$

(35)

which has expectation

$$\begin{aligned} E_2 U_n(\hat{f}_n) = \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l}\langle \psi _{lk}, f- \hat{f}_n \rangle ^2 = \Vert \Pi _{V_{J_n}}(f-\hat{f}_n)\Vert _2^2. \end{aligned}$$

Using Chebychev’s inequality and that, by definition of the norm (6)

$$\begin{aligned} \sup _{h \in \Sigma (r,b)}\Vert \Pi _{V_{J_n}}(h)-h\Vert _2^2 \le c(b) 2^{-2J_nr} \end{aligned}$$

for every $0<b<\infty $ and some finite constant $c(b)$, we deduce

$$\begin{aligned}&\inf _{f \in \Sigma (r,B_0)} {\Pr }_{f,2} \left\{ U_n(\hat{f}_n) - \Vert f-\hat{f}_n\Vert _2^2 \ge -(c(B_0)+c(B^{\prime }))2^{-2J_nr} -z(\alpha ) \tau _n(f) \right\} \\&\qquad \ge \inf _{f \in \Sigma (r,B_0)} {\Pr }_{f,2} \left\{ U_n(\hat{f}_n) - \Vert \Pi _{V_{J_n}}(f-\hat{f}_n)\Vert _2^2 \ge -z(\alpha ) \tau _n(f) \right\} \\&\qquad \ge 1 - \sup _{f \in \Sigma (r,B_0)}\frac{Var_2(U_n(\hat{f}_n)-E_2U_n(\hat{f}_n))}{(z(\alpha ) \tau _n(f))^2}. \end{aligned}$$

We now show that the last quantity is greater than or equal to $1-z(\alpha )^{-2} \ge 1- \alpha $ for quantile constants $z(\alpha )$ and with

$$\begin{aligned} \tau ^2_n(f)= \frac{C(S)2^{J_n}\Vert f\Vert ^2_\infty }{n(n-1)} +\frac{4\Vert f\Vert _\infty }{n}\Vert \Pi _{V_{J_n}}(f-\hat{f}_n)\Vert _2^2, \end{aligned}$$

which in turn gives the honest confidence set under $\Pr $

$$\begin{aligned} C_n(\Vert f\Vert _\infty , B_0) \!=\! \left\{ f: \Vert f-\hat{f}_n\Vert _2 \le \sqrt{z_\alpha \tau _n(f) \!+\! U_n(\hat{f}_n) \!+\! (c(B_0)+c(B^{\prime }))2^{-2{J_n}r}} \right\} .\nonumber \\ \end{aligned}$$

(36)

We shall comment on the role of the constants $\Vert f\Vert _\infty , c(B_0), C(B^{\prime })$ at the end of the proof, and establish the last claim first: note that the Hoeffding decomposition for the centered $U$-statistic with kernel

$$\begin{aligned} R(x,y)= \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} (\psi _{lk}(x)-\langle \psi _{lk}, \hat{f}_n\rangle )(\psi _{lk}(y)-\langle \psi _{lk}, \hat{f}_n \rangle ) \end{aligned}$$

is (cf. the proof of Theorem 4.1 in [28])

$$\begin{aligned} U_n(\hat{f}_n)-E_2U_n(\hat{f}_n) = \frac{2}{n} \sum _{i=1}^n (\pi _1R)(X_i) + \frac{2}{n(n-1)}\sum _{i<j}(\pi _2R)(X_i, X_j) \equiv L_n + D_n \end{aligned}$$

where

$$\begin{aligned} (\pi _1R)(x) =\sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} (\psi _{lk}(x)-\langle \psi _{lk}, f \rangle ) \langle \psi _{lk}, f-\hat{f}_n \rangle \end{aligned}$$

and

$$\begin{aligned} (\pi _2R)(x,y)= \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} (\psi _{lk}(x)-\langle \psi _{lk}, f \rangle )(\psi _{lk}(y)-\langle \psi _{lk}, f \rangle ) \end{aligned}$$

The variance of $U_n(\hat{f}_n)-E_2U_n(\hat{f}_n)$ is the sum of the variances of the two terms in the Hoeffding decomposition. For the linear term we bound the variance $Var_2(L_n)$ by the second moment, using orthonormality of the $\psi _{lk}$s,

$$\begin{aligned} \frac{4}{n} \int \left(\sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} \psi _{lk}(x) \langle \psi _{lk}, \hat{f}_n -f \rangle \right)^2 f(x) dx \le \frac{4 \Vert f\Vert _\infty }{n} \sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} \langle \psi _{lk}, \hat{f}_n-f \rangle ^2, \end{aligned}$$

which equals the second term in the definition of $\tau ^2_n(f)$. For the degenerate term we can bound $Var_2(D_n)$ analogously by the second moment of the uncentered kernel (cf. after (19)), i.e., by

$$\begin{aligned} \frac{2}{n(n-1)} \int \left(\sum _{l=J_0}^{J_n-1} \sum _{k \in \mathcal Z _l} \psi _{lk}(x) \psi _{lk}(y) \right)^2 f(x) dx f(y)dy \le \frac{C(S) 2^{J_n} \Vert f\Vert ^2_\infty }{n(n-1)}, \end{aligned}$$

using orthonormality and the cardinality properties of $\mathcal Z _l$.

The so constructed confidence set has an adaptive expected maximal diameter: let $f \in \Sigma (s,B)$ for some $s \in [r,R]$ and some $1 \le B \le B_0$. The nonrandom terms are of order

$$\begin{aligned} \sqrt{c(B_0)+c(B^{\prime })}2^{-J_nr} + \Vert f\Vert _\infty ^{1/2}2^{J_n/4}n^{-1/2} \le C(S, B_0, B^{\prime }, r, U) n^{-r/(2r+1/2)} \end{aligned}$$

which is $o(n^{-s/(2s+1)})$ since $s \le R < 2r$. The random component of $\tau _n(f)$ has order $\Vert f\Vert _\infty ^{1/4} n^{-1/4}E_1\Vert \Pi _{V_{J_n}}(\hat{f}_n - f)\Vert _2^{1/2}$ which is also $o(n^{-s/(2s+1)})$ for $s<2r$, since $\Pi _{V_{J_n}}$ is a projection operator and since $\hat{f}_n$ is adaptive, as established in Theorem 2. Moreover, by Theorem 2 and again the projection properties,

$$\begin{aligned} EU_n(\hat{f}_n) = E_1\Vert \Pi _{V_{J_n}}(\hat{f}_n-f)\Vert _2^2 \le E_1\Vert \hat{f}_n -f\Vert _2^2 \le c B^{2/(2s+1)}n^{-2s/(2s+1)}. \end{aligned}$$

The term in the last display is the leading term in our bound for the diameter of the confidence set, and shows that $C_n$ adapts to both $B$ and $s$ in the sense of Definition 1, using Markov’s inequality.

The confidence set $C_n(\Vert f\Vert _\infty , B_0)$ is not feasible if $B_0$ and $\Vert f\Vert _\infty $ are unknown, so in particular under the assumptions of Theorem 3, but $C_n$ independent of $B_0, \Vert f\Vert _\infty $ can be constructed as follows: we replace $c(B_0)+c(B^{\prime })$ in the definition of (36) by a divergent sequence of positive real numbers $c_n$, which can still be accommodated in the diameter estimate from the last paragraph since $n^{-2r/(2r+1/2)}c_n$ is still $o(n^{-2s/(2s+1)})$ as long as $s \le R<2r$ for $c_n$ diverging slowly enough (e.g., like $\log n$). Define thus the confidence set

$$\begin{aligned} C_n = \left\{ f: \Vert f-\hat{f}_n\Vert _2 \le \sqrt{z_\alpha \tau _n(f) + U_n(\hat{f}_n) + c_n2^{-2Jr}} \right\} \!, \end{aligned}$$

(37)

with $\Vert f\Vert _\infty $ replaced by $\Vert f_n(j_{\max })\Vert _\infty $ in all expressions where $\Vert f\Vert _\infty $ occurs. As stated before (29), $\Vert f_n(j_{\max })\Vert _\infty $ concentrates around $\Vert f\Vert _\infty $ with exponential error bounds, so that the sufficiency part of Theorem 3 then holds for this $C_n$ with slightly increased $z_\alpha $.

(B) Necessity of $R \le 2r$ follows immediately from Part B of Theorem 1. That $R<2r$ is also necessary is proved in Sect. 4.8 below. $\square $

4.5 Proof of Theorem 1

Proof

That an $L^2$-adaptive confidence set exists when $s \le 2r$ follows from Theorem 3; The case $s<2r$ is immediate, and the case $s=2r$ follows using the confidence set (36). This set is feasible since, under the hypotheses of Theorem 1, $B=B_0$ is known, as is $B^{\prime }$ and the upper bound for $\Vert f\Vert _\infty $ (cf. (9)). It is further adaptive since $n^{-r/(2r+1/2)}=n^{-s/(2s+1)}$ for $s=2r$.

For part (A)(ii) we use the test $\Psi _n$ from Proposition 2 with $\Sigma =\Sigma (s), t=r,$ and define a confidence ball as follows. Take $\hat{f}_n=f_n(\bar{j}_n)$ to be the adaptive estimator from the proof of Theorem 2, and let, for $0<L^{\prime }<\infty $,

$$\begin{aligned} C_n={\left\{ \begin{array}{ll}\{f \in \Sigma (r): \Vert f-\hat{f}_n\Vert _2 \le L^{\prime }n^{-s/(2s+1)}\}&\text{ if} ~\Psi _n=0\\ \{f \in \Sigma (r): \Vert f-\hat{f}_n\Vert _2 \le L^{\prime }n^{-r/(2r+1)}\}&\text{ if}~\Psi _n=1 \end{array}\right.} \end{aligned}$$

We first prove that $C_n$ is honest for $\Sigma (s) \cup \tilde{\Sigma }(r, \rho _n)$ if we choose $L^{\prime }$ large enough. For $f \in \Sigma (s)$ we have from Theorem 2, by Markov’s inequality,

$$\begin{aligned} \inf _{f \in \Sigma (s)}{\Pr }_f \left\{ f \in C_n \right\}&\ge 1- \sup _{f \in \Sigma (s)}{\Pr }_f \left\{ \Vert \hat{f}_n-f\Vert _2 > L^{\prime }n^{-s/(2s+1)} \right\} \\&\ge 1-\frac{n^{s/(2s+1)}}{L^{\prime }} \sup _{f \in \Sigma (s)}E_f \Vert \hat{f}_n -f\Vert _2 \\&\ge 1-\frac{c(B,s,r)}{L^{\prime }} \end{aligned}$$

which can be made greater than $1-\alpha $ for any $\alpha >0$ by choosing $L^{\prime }$ large enough depending only on $B, \alpha , r,s$. When $f \in \tilde{\Sigma }(r, \rho _n)$, using again Markov’s inequality

$$\begin{aligned} \inf _{f \in \tilde{\Sigma }(r, \rho _n)}{\Pr }_f \left\{ f \in C_n\right\} \ge 1 - \frac{\sup _{f \in \Sigma (r)}E_f\Vert \hat{f}_n -f\Vert _2}{L^{\prime }n^{-r/(2r+1)}} - \sup _{f \in \tilde{\Sigma }(r, \rho _n)}{\Pr }_f\{\Psi _n =0\}. \end{aligned}$$

The first subtracted term can be made smaller than $\alpha /2$ for $L^{\prime }$ large enough as before. The second subtracted term can also be made less than $\alpha /2$ using Proposition 2 and the remark preceding it, choosing $M$ and $d_n$ to be large but also bounded in $n$. This proves that $C_n$ is honest. We now turn to adaptivity of $C_n$: by the definition of $C_n$ we always have $|C_n| \le L^{\prime }n^{-r/(2r+1)}$, so the case $f \in \tilde{\Sigma }(r, \rho _n)$ is proved. If $f \in \Sigma (s)$ then using Proposition 2 again, for $M, d_n$ large enough depending on $\alpha ^{\prime }$ but bounded in $n$,

$$\begin{aligned} {\Pr }_f\{|C_n| > L^{\prime }n^{-s/(2s+1)}\} = {\Pr }_f\{\Psi _n =1\} \le \alpha ^{\prime }, \end{aligned}$$

which completes the proof of part A.

To prove part B of Theorem 1 we argue by contradiction and assume that the limit inferior equals zero. We then pass to a subsequence of $n$ for which the limit is zero, and still denote this subsequence by $n$. Let $f_0\equiv 1 \in \Sigma (s)$, suppose $C_n$ is adaptive and honest for $\Sigma (s) \cup \tilde{\Sigma }(r, \rho _n)$ for every $\alpha , \alpha ^{\prime }$, and consider testing

$$\begin{aligned} H_0: f=f_0 ~~~\text{ against}~~~H_1: f \in \tilde{\Sigma }(r, \rho _n) \end{aligned}$$

where $\rho _n =o(n^{-r/(2r+1/2)})$. Since $s>2r$ we may assume $n^{-s/(2s+1)}=o(\rho _n)$ (otherwise replace $\rho _n$ by $\rho _n^{\prime } \ge \rho _n$ s.t. $n^{-s/(2s+1)}=o(\rho _n^{\prime })$). Accept $H_0$ if $C_n \cap \tilde{\Sigma }(r, \rho _n)$ is empty and reject otherwise, formally

$$\begin{aligned} \Psi _n = 1\{C_n \cap \tilde{\Sigma }(r, \rho _n) \ne \emptyset \}. \end{aligned}$$

The type-one errors of this test satisfy

$$\begin{aligned} E_{f_0}\Psi _n&= {\Pr }_{f_0}\left\{ C_n \cap \tilde{\Sigma }(r, \rho _n) \ne \emptyset \right\} \\&\le {\Pr }_{f_0} \{f_0 \in C_n, |C_n| \ge \rho _n\} + {\Pr }_{f_0} \{f_0 \notin C_n\} \\&\le \alpha + \alpha ^{\prime } + r_n \rightarrow \alpha + \alpha ^{\prime } \end{aligned}$$

as $n \rightarrow \infty $ by the hypothesis of coverage and adaptivity of $C_n$. The type-two errors satisfy, by coverage of $C_n$, as $n \rightarrow \infty $

$$\begin{aligned} E_f(1-\Psi _n) = {\Pr }_f \{C_n \cap \tilde{\Sigma }(r, \rho _n)=\emptyset \} \le {\Pr }_{f} \{f \notin C_n\} \le \alpha + r_n \rightarrow \alpha , \end{aligned}$$

uniformly in $f \in \tilde{\Sigma }(r, \rho _n)$. We conclude that this test satisfies

$$\begin{aligned} \limsup _n\left[E_{f_0}\Psi _n + \sup _{f \in H_1}E_f(1-\Psi _n)\right] \le 2\alpha + \alpha ^{\prime } \end{aligned}$$

for arbitrary $\alpha , \alpha ^{\prime }>0$. For $\alpha , \alpha ^{\prime }$ small enough this contradicts (the proof of) Theorem 1i in [19], which implies that the limit inferior of the term in brackets in the last display, even with an infimum over all tests, exceeds a fixed positive constant. Indeed, the alternatives (6) in [19] can be taken to be

$$\begin{aligned} f_i(x) = 1 + \epsilon 2^{-j_n (r+1/2)} \sum _{k \in \mathcal Z _{j_n}} \beta _{ik} \psi _{j_n k}(x), ~~~~~i=1, \dots , 2^{2^{j_n}}, \end{aligned}$$

for $\epsilon >0$ a small constant, $\beta _{ik} = \pm 1$, and with $j_n$ such that $2^{j_n} \simeq n^{1/(2r+1/2)}$. Since

$$\begin{aligned} \inf _{g \in \Sigma (s)}\Vert f_i-g\Vert _2 \ge \sqrt{\sum _{l\ge j_n, k}\langle f_i, \psi _{lk}\rangle ^2} - \sup _{g \in \Sigma (s)} \sqrt{\sum _{l\ge j_n, k}\langle g, \psi _{lk}\rangle ^2} \ge c\epsilon n^{-r/(2r+1/2)} \end{aligned}$$

for every $\epsilon >0$, some $c>0$ and $n$ large enough, these alternatives are also contained in our $H_1$, so that the proof of the lower bound Theorem 1i in [19] applies also in the present situation. $\square $

4.6 Proof of Theorem 5

We shall write $\Sigma (s)$ for $\Sigma (s,B_0)$ and $\tilde{\Sigma }_n(s)$ for $\tilde{\Sigma }(s, \rho _n(s))$ in this proof, and we write $\tilde{\Sigma }_n(s_N)$ also for $\Sigma (s_N)$ in slight abuse of notation. For $i=1,\dots , N,$ let $\Psi (i)$ be the test from (17) with $\Sigma =\Sigma (s_{i+1})$ and $t=s_i$. Starting from the largest model we first test $H_0: f \in \Sigma (s_2)$ against $H_1: f \in \tilde{\Sigma }_n(s_1)$, accepting $H_0$ if $\Psi (1)=0$. If $H_0$ is rejected we set $\hat{s}_n = s_1=r$, otherwise we proceed to test $H_0: f \in \Sigma (s_3)$ against $H_1: f \in \tilde{\Sigma }_n(s_2)$ using $\Psi (2)$ and iterating this procedure downwards we define $\hat{s}_n$ to be the first element $s_i$ in $\mathcal S $ for which $\Psi (i)=1$ rejects. If no rejection occurs we set $\hat{s}_n$ equal to $s_N$, the last element in the grid.

For $f \in \mathcal P _n(M, \mathcal S )$ define the unique $s_{i_0}:=s_{i_0}(f)= \{s \in \mathcal S : f \in \tilde{\Sigma }_n(s) \}$. We now show that for $M$ large enough

$$\begin{aligned} \sup _{f \in \mathcal P _n(M, \mathcal S )}{\Pr }_f \{\hat{s}_n \ne s_{i_0}(f)\} < \max (\alpha , \alpha ^{\prime })/2. \end{aligned}$$

(38)

Indeed, if $\hat{s}_n < s_{i_0}$ then the test $\Psi (i)$ has rejected for some $i < i_0$. In this case $f \in \tilde{\Sigma }_n(s_{i_0}) \subset \Sigma (s_{i_0}) \subseteq \Sigma (s_{i+1})$ for every $i<i_0$, and thus,

$$\begin{aligned} {\Pr }_f\{\hat{s}_n< s_{i_0}\}&= {\Pr }_f\left\{ \bigcup _{i < i_0} \{\Psi (i)=1\} \right\} \le \sum _{i<i_0} \sup _{f \in \Sigma (s_{i+1})}E_f\Psi (i) \\&\le C(N) e^{-cd_n^2} < \max (\alpha , \alpha ^{\prime })/2 \end{aligned}$$

using Proposition 2 and the remark preceding it, choosing $M$ and $d_n$ to be large but also bounded in $n$. On the other hand if $\hat{s}_n > s_{i_0}$ (ignoring the trivial case $s_{i_0} =s_N$) then $\Psi (i_0)$ has accepted despite $f \in \tilde{\Sigma }_n(s_{i_0})$. Thus

$$\begin{aligned} {\Pr }_f\{\hat{s}_n > s_{i_0}\} \le \sup _{f \in \tilde{\Sigma }_n(s_{i_0})}E_f (1-\Psi (i_0)) \le C e^{-cd_n^2} \le \max (\alpha , \alpha ^{\prime })/2 \end{aligned}$$

again by Proposition 2, for $M, d_n$ large enough.

Denote now by $C_n(s_i)$ the confidence set (36) constructed in the proof of Theorem 3 with $r$ there being $s_i$, with $R=2s_i=s_{i+1}$, with $\Vert f\Vert _\infty $ replaced by $U$ and with $z_\alpha $ such that the asymptotic coverage level is $\alpha /2$ for any $f \in \Sigma (s_i)$. We then set $C_n=C_n(\hat{s}_n)$, which is a feasible confidence set as $B_0, r, U$ are known under the hypotheses of the theorem. We then have, from the proof of Theorem 3, uniformly in $f \in \tilde{\Sigma }_n(s_{i_0}) \subset \Sigma (s_{i_0}),$

$$\begin{aligned} {\Pr }_f\{f \in C_n(\hat{s}_n)\} \ge {\Pr }_f\{f \in C_n(s_{i_0})\} - \alpha /2 \ge 1- \alpha . \end{aligned}$$

Moreover, if $f \in \Sigma (s, B) \cap \tilde{\Sigma }_n(s_{i_0})$ for some $1 \le B \le B_0$ and for either $s \in [s_{i_0}, s_{i_0+1})$ or $s \in [s_N, R]$ (in case $s_{i_0}=s_N$), the expected diameter of $C_n$ satisfies, by the estimates in the proof of Theorem 3,

$$\begin{aligned}&{\Pr }_f\{|C_n(\hat{s}_n)| > C B^{2/(2s+1)}n^{-s/(2s+1)}\} \\&\qquad \le {\Pr }_f\{|C_n(s_{i_0})| > C B^{2/(2s+1)}n^{-s/(2s+1)}\} + \alpha ^{\prime }/2 \le \alpha ^{\prime } \end{aligned}$$

for $C$ large enough, so that this confidence set is adaptive as well, which completes the proof.

4.7 Proof of Theorem 4

Proof

Suppose such $C_n$ exists. We will construct functions $f_m \in W^s, m = 0, 1, \dots ,$ and a further function $f_\infty \in W^r$, which serve as hypotheses for $f$. For each $m \in \mathbb N $, we will ensure that, at some time $n_m, C_{n_m}$ cannot distinguish between $f_m$ and $f_\infty $, and is too small to contain both simultaneously. We will thereby obtain a subsequence $n_m$ on which, for $\delta = \tfrac{1}{5}(1 - 2\alpha ),$

$$\begin{aligned} \sup _m \Pr \nolimits _{f_\infty } \{f_\infty \in C_{n_m}\} \le 1 - \alpha - \delta , \end{aligned}$$

contradicting our assumptions on $C_n.$

For $m = 0, 1, 2, \dots , \infty ,$ construct functions $f_0=1$,

$$\begin{aligned} f_m = 1 + \varepsilon \sum _{i=1}^m \sum _{k \in \mathcal Z _{j_i}} 2^{-j_i(r+1/2)} \beta _{ik} \psi _{j_ik}. \end{aligned}$$

where $\varepsilon >0$ is a constant, and the parameters $j_1, j_2, \ldots \in \mathbb N $, $\beta _{ik} = \pm 1$ are chosen inductively satisfying $j_i/j_{i-1} \ge 1 + 1/2r$. Pick $\varepsilon > 0$ small enough that

$$\begin{aligned} \Vert f_m - f_{m-1}\Vert _\infty \le 2^{-(m+1)} \end{aligned}$$

for all $m < \infty ,$ and any choice of $j_i, \beta _{ik}.$ Then

$$\begin{aligned} f_m = 1 + \sum _{i=1}^m (f_i - f_{i-1}) \ge \tfrac{1}{2}, \end{aligned}$$

and $\int f_m = \langle 1, f_m \rangle = 1,$ so the $f_m$ are densities. By (6), $f_m \in W^r,$ and for $m < \infty ,$ also $f_m \in W^s.$

We have already defined $f_0$; for convenience let $n_0 = 1$. Inductively, suppose we have defined $f_{m-1}, n_{m-1}.$ For $n_m > n_{m-1}$ and $D>0 $ large enough depending only on $f_{m-1}$, we have:

1.
$\Pr _{f_{m-1}}\{f_{m-1} \not \in C_{n_m}\} \le \alpha + \delta $; and
2.
$\Pr _{f_{m-1}}\{|C_{n_m}| \ge Dr_{n_m}\} \le \delta .$

Setting

$$\begin{aligned} T_n = 1(\exists \ f \in C_n, \Vert f - f_{m-1}\Vert _2 \ge 2Dr_n)\!, \end{aligned}$$

we then have

$$\begin{aligned} \Pr \nolimits _{f_{m-1}}\{T_{n_m}=1\} \le \Pr \nolimits _{f_{m-1}}\{f_{m-1} \not \in C_{n_m}\} + \Pr \nolimits _{f_{m-1}}\{|C_{n_m}| \ge Dr_{n_m}\} \le \alpha + 2\delta .\qquad \end{aligned}$$

(39)

We claim it is possible to choose $j_m, \beta _{mk}$ and $n_m$, depending only on $f_{m-1}$ so that also: 1. if $m>1$,

$$\begin{aligned} 3Dr_{n_m} \le \Vert f_m - f_{m-1}\Vert _2 \le \tfrac{1}{4} \Vert f_{m-1} - f_{m-2}\Vert _2, \end{aligned}$$

(40)

and 2. for any further choice of $j_i, \beta _{ik},$

$$\begin{aligned} \Pr \nolimits _{f_\infty }\{T_{n_m} = 0\} \ge 1 - \alpha - 4\delta . \end{aligned}$$

(41)

We may then conclude that, since all further choices will satisfy (40),

$$\begin{aligned} \Vert f_\infty - f_{m-1}\Vert _2 \ge \Vert f_m - f_{m-1}\Vert _2 - \sum _{i=m+1}^\infty \Vert f_i - f_{i-1}\Vert _2 \ge 2Dr_{n_m}, \end{aligned}$$

so

$$\begin{aligned} \Pr \nolimits _{f_\infty }\{f_\infty \in C_{n_m}\} \le \Pr \nolimits _{f_\infty }\{T_{n_m} = 1\} \le \alpha + 4\delta = 1 - \alpha - \delta \end{aligned}$$

as required.

It remains to verify the claim. For $j \ge (1 + 1/2r)j_{m-1},$ $\beta _k = \pm 1,$ set

$$\begin{aligned} g_\beta = \varepsilon 2^{-j(r+1/2)} \sum _{k \in \mathcal Z _j} \beta _k \psi _{jk}, \end{aligned}$$

and $f_\beta = f_{m-1} + g_\beta .$ Allowing $j \rightarrow \infty ,$ set

$$\begin{aligned} n \sim C2^{j(2r+1/2)}, \end{aligned}$$

for $C > 0$ to be determined. Then

$$\begin{aligned} \Vert g_\beta \Vert _2 = \varepsilon 2^{-jr} \approx n^{-r/(2r+1/2)}, \end{aligned}$$

so for $j$ large enough, $f_\beta $ satisfies (40) with any choice of $\beta .$

The density of $X_1, \dots , X_n$ under $f_\beta ,$ w.r.t. under $f_{m-1},$ is

$$\begin{aligned} Z_\beta = \prod _{i=1}^n \frac{f_\beta }{f_{m-1}}(X_i). \end{aligned}$$

Set $Z = 2^{-j}\sum _\beta Z_\beta ,$ so $E_{f_{m-1}}[Z] = 1,$ and

$$\begin{aligned} E_{f_{m-1}}[Z^2]&= 2^{-2j} \sum _{\beta , \beta ^{\prime }} \prod _{i=1}^n E_{f_{m-1}}\left[ \frac{f_\beta f_{\beta ^{\prime }}}{f_{m-1}^2}(X_i)\right]\\&= 2^{-2j} \sum _{\beta , \beta ^{\prime }} \left\langle \frac{f_\beta }{\sqrt{f_{m-1}}}, \frac{f_{\beta ^{\prime }}}{\sqrt{f_{m-1}}} \right\rangle ^n\\&= 2^{-2j} \sum _{\beta , \beta ^{\prime }} \left(1 + \left\langle \frac{g_\beta }{\sqrt{f_{m-1}}}, \frac{g_{\beta ^{\prime }}}{\sqrt{f_{m-1}}} \right\rangle \right)^n\\&\le 2^{-2j} \sum _{\beta , \beta ^{\prime }} (1 + \varepsilon ^22^{1-j(2r+1)}\langle \beta , \beta ^{\prime } \rangle )^n\\&= E[(1 + \varepsilon ^22^{1-j(2r+1)} Y)^n], \end{aligned}$$

where $Y = \sum _{i=1}^{2^j} R_i,$ for i.i.d. Rademacher random variables $R_i,$

$$\begin{aligned} \le E[\exp (n\varepsilon ^2 2^{1-j(2r+1)} Y)] = \cosh \left(D2^{-j/2}(1 + o(1))\right)^{2^j}, \end{aligned}$$

as $j \rightarrow \infty ,$ for some $D > 0,$

$$\begin{aligned}&= \left(1 + D^2 2^{-j} (1 + o(1))\right)^{2^j}\\&\le \exp \left(D^2(1 + o(1))\right)\\&\le 1 + \delta ^2, \end{aligned}$$

for $j$ large, $C$ small. Hence $E_{f_{m-1}}[(Z - 1)^2] \le \delta ^2,$ and we obtain

$$\begin{aligned} \Pr \nolimits _{f_{m-1}}\{T_n = 1\} + \max _\beta \Pr \nolimits _{f_\beta }\{T_n = 0\}&\ge \Pr \nolimits _{f_{m-1}}\{T_n=1\} + 2^{-j}\sum _\beta \Pr \nolimits _{f_\beta }\{T_n = 0\}\\&= 1 + E_{f_{m-1}}[(Z-1)1(T_n = 0)]\\&\ge 1 - \delta . \end{aligned}$$

Set $f_m = f_\beta ,$ for $\beta $ maximizing this expression. The density of $X_1, \dots , X_n$ under $f_\infty ,$ w.r.t. under $f_m,$ is

$$\begin{aligned} Z^{\prime } = \prod _{i=1}^n \frac{f_\infty }{f_m}(X_i). \end{aligned}$$

Now, $E_{f_m}[Z^{\prime }] = 1,$ and

$$\begin{aligned} ||f_\infty - f_m||_2^2 = \sum _{i=m+1}^\infty \varepsilon ^2 2^{-2j_ir} \le E^{\prime }2^{-2j_{m+1}r} \le E^{\prime }2^{-j(2r+1)}, \end{aligned}$$

for some constant $E^{\prime } > 0,$ so similarly

$$\begin{aligned} E_{f_m}[{Z^{\prime }}^2]&\le (1 + 2||f_\infty - f_m||_2^2)^n\\&\le (1 + E^{\prime } 2^{1 - j(2r+1)})^n\\&\le \exp (E^{\prime }n2^{1 - j(2r+1)})\\&= \exp \left(F2^{-j/2}(1+o(1))\right), \end{aligned}$$

for some $F > 0,$

$$\begin{aligned} \le 1 + \delta ^2, \end{aligned}$$

for $j$ large. Hence $E_{f_{m}}[(Z^{\prime }-1)^2] \le \delta ^2,$ and

$$\begin{aligned} \Pr \nolimits _{f_{m-1}}\{T_n = 1\} + \Pr \nolimits _{f_\infty }\{T_n = 0\}&= \Pr \nolimits _{f_{m-1}}\{T_n=1\} + E_{f_m}[Z^{\prime }1(T_n=0)]\\&\ge 1 - \delta + E_{f_m}[(Z^{\prime }-1)1(T_n = 0)] \\&\ge 1 - 2\delta . \end{aligned}$$

If we take $j_m = j,$ $n_m = n$ large enough also that (39) holds, then $f_\infty $ satisfies (41), and our claim is proved. $\square $

4.8 Proof of Part B of Theorem 3

Proof

Suppose such $C_n$ exists for $R=2r$. Set $f_0 = 1,$ and

$$\begin{aligned} f_1 = 1 + B2^{-j(r+1/2)} \sum _{k \in \mathcal Z _j} \beta _{jk} \psi _{jk}, \end{aligned}$$

for $B > 0,$ $j > j_0,$ and $\beta _{jk} = \pm 1$ to be determined. Having chosen $B,$ we will pick $j$ large enough that $f_1 \ge \tfrac{1}{2}.$ Since $\int f_1 = \langle f_1, 1 \rangle = 1,$ $f_1$ is then a density.

Set $\delta = \tfrac{1}{4}(1 - 2\alpha ).$ As $f_0 \in \Sigma (R, 1),$ for $n$ and $L$ large we have:

1.
$\Pr _{f_0}\{f_0 \not \in C_n\} \le \alpha + \delta ;$ and
2.
$\Pr _{f_0}\{|C_n|\ge Ln^{-R/(2R+1)}\} \le \delta .$

Setting $T_n = 1(\exists \, f \in C_n : ||f - f_0||_2 \ge 2Ln^{-R/(2R+1)}),$ we then have

$$\begin{aligned} \Pr \nolimits _{f_0}\{T_n = 1\} \le \alpha + 2\delta , \end{aligned}$$

as in the proof of Theorem 4.

For a constant $C = C(\delta ) > 0$ to be determined, set $B = (3L)^{2R+1}C^{-R}.$ Allowing $j \rightarrow \infty ,$ set $n \sim CB^{-2}2^{j(R+1/2)}.$ Then

$$\begin{aligned} ||f_1 - f_0||_2 = B2^{-jr} \sim 3Ln^{-R/(2R+1)}, \end{aligned}$$

so for $j$ large, $||f_1 - f_0||_2 \ge 2Ln^{-R/(2R+1)}.$ Arguing as in the proof of Theorem 4, the density $Z$ of $f_1$ w.r.t. $f_0$ has second moment

$$\begin{aligned} E_{f_0}[Z^2]&\le \cosh (nB^22^{1-j(2r+1)})^{2^j}\\&= \cosh (C2^{1-j/2}(1 + o(1)))^{2^j}\\&= (1 + C^22^{2-j}(1 + o(1)))^{2^j}\\&\le \exp (4C^2(1 + o(1)))\\&\le 1 + \delta ^2, \end{aligned}$$

for $C(\delta )$ small, $j$ large. Hence

$$\begin{aligned} \Pr \nolimits _{f_0}\{T_n=1\} + \max _\beta \Pr \nolimits _{f_1}\{T_n = 0\} \ge 1 - \delta . \end{aligned}$$

and for all $j$ (and $n$) large enough, we obtain, for suitable $\beta ,$

$$\begin{aligned} \Pr \nolimits _{f_1}\{f_1 \in C_n\} \le \Pr \nolimits _{f_1}\{T_n = 1\} \le \alpha + 3\delta = 1 - \alpha - \delta . \end{aligned}$$

Since $f_1 \in \Sigma (r, B)$ for all $n, \beta _{jk}$ this contradicts the definition of $C_n.$ $\square $

References

Baraud, Y.: Confidence balls in Gaussian regression. Ann. Stat. 32(2), 528–551 (2004)
Article MathSciNet MATH Google Scholar
Barron, A., Birgé, L., Massart, P.: Risk bounds for model selection via penalization. Probab. Theory Relat. Fields 113(3), 301–413 (1999)
Article MATH Google Scholar
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3(3), 203–268 (2001)
Article MathSciNet MATH Google Scholar
Bousquet, O.: Concentration inequalities for sub-additive functions using the entropy method. In: Stochastic Inequalities and Applications. Progr. Probab., vol. 56, pp. 213–247. Birkhäuser, Basel (2003)
Bull, A.D.: Honest adaptive confidence bands and selfsimilar functions. Preprint, arxiv.org (2011)
Cai, T.T., Low, M.G.: Adaptive confidence balls. Ann. Stat. 34(1), 202–228 (2006)
Article MathSciNet MATH Google Scholar
Cohen, A., Daubechies, I., Vial, P.: Wavelets on the interval and fast wavelet transforms. Appl. Comput. Harmon. Anal. 1(1), 54–81 (1993)
Article MathSciNet MATH Google Scholar
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., Picard, D.: Density estimation by wavelet thresholding. Ann. Stat. 24(2), 508–539 (1996)
Article MathSciNet MATH Google Scholar
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., Picard, D.: Wavelet shrinkage: asymptopia? J. Roy. Stat. Soc. Ser. B 57(2), 301–369 (1995)
MathSciNet MATH Google Scholar
Efromovich, S.: Adaptive estimation of and oracle inequalities for probability densities and characteristic functions. Ann. Stat. 36(3), 1127–1155 (2008)
Article MathSciNet MATH Google Scholar
Giné, E., Koltchinskii, V.: Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3), 1143–1216 (2006)
Article MathSciNet MATH Google Scholar
Giné, E., Latała, R., Zinn, J.: Exponential and moment inequalities for $U$-statistics. In: High Dimensional Probability, II (Seattle, WA, 1999). Progr. Probab. vol. 47, pp. 13–38. Birkhäuser, Boston (2000)
Giné, E., Nickl, R.: Confidence bands in density estimation. Ann. Stat. 38, 1122–1170 (2010a)
Article MATH Google Scholar
Giné, E., Nickl, R.: Adaptive estimation of the distribution function and its density in sup-norm loss by wavelet and spline projections. Bernoulli 16, 1137–1163 (2010b)
Article MathSciNet MATH Google Scholar
Giné, E., Nickl, R.: Rates of contraction for posterior distributions in $\text{ l}^r$-metrics, 1 $\le $ r $\le $ $\infty $. Ann. Stat. 39, 2883–2911 (2011)
Article MATH Google Scholar
Hoffmann, M., Lepski, O.V.: Random rates in anisotropic regression. Ann. Stat. 30(2), 325–396 (2002). With discussions and a rejoinder by the authors
Google Scholar
Hoffmann, M., Nickl, R.: On adaptive inference and confidence bands. Ann. Stat. 39, 2382–2409 (2011)
Article MathSciNet Google Scholar
Houdré, C., Reynaud-Bouret, P.: Exponential inequalities, with constants, for U-statistics of order two. In: Stochastic Inequalities and Applications. Progr. Probab., vol. 56, pp. 55–69. Birkhäuser, Basel (2003)
Ingster, Yu.I.: A minimax test of nonparametric hypotheses on the density of a distribution in $L_p$ metrics. Teor. Veroyatnost. i Primenen. 31(2), 384–389 (1986)
Google Scholar
Ingster, Yu.I.: Asymptotically minimax hypothesis testing for nonparametric alternatives, I. Math. Methods Stat. 2(2), 85–114 (1993)
Juditsky, A., Lambert-Lacroix, S.: Nonparametric confidence set estimation. Math. Methods Stat. 12(4), 410–428 (2003)
MathSciNet Google Scholar
Kerkyacharian, G., Nickl, R., Picard, D.: Concentration inequalities and confidence bands for needlet density estimators on compact homogeneous manifolds. Probab. Theory Relat. Fields 153, 363–404 (2012)
Article MathSciNet MATH Google Scholar
Lepski, O.V.: A problem of adaptive estimation in Gaussian white noise. Teor. Veroyatnost. i Primenen. 35(3), 459–470 (1990)
MathSciNet Google Scholar
Lepski, O.V.: How to improve the accuracy of estimation. Math. Methods Stat. 8(4), 441–486 (1999)
MathSciNet MATH Google Scholar
Lepski, O.V., Mammen, E., Spokoiny, V.G.: Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Stat. 25(3), 929–947 (1997)
Article MathSciNet MATH Google Scholar
Lorentz, G.G., Golitschek, M.V., Makovoz, Y.: Constructive Approximation. Advanced problems. Springer, Berlin (1996)
Book MATH Google Scholar
Picard, D., Tribouley, K.: Adaptive confidence interval for pointwise curve estimation. Ann. Stat. 28(1), 298–335 (2000)
Article MathSciNet MATH Google Scholar
Robins, J., van der Vaart, A.W.: Adaptive nonparametric confidence sets. Ann. Stat. 34(1), 229–253 (2006)
Article MATH Google Scholar
Spokoiny, V.G.: Adaptive hypothesis testing using wavelets. Ann. Stat. 24(6), 2477–2498 (1996)
Article MathSciNet MATH Google Scholar
Talagrand, M.: New concentration inequalities in product spaces. Invent. Math. 126(3), 505–563 (1996)
Article MathSciNet MATH Google Scholar
van de Geer, S.A.: Applications of Empirical Process Theory. Cambridge University Press, Cambridge (2000)
van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer Series in Statistics. Springer, New York (1996). With applications to statistics

Download references

Acknowledgments

The authors are very grateful to two anonymous referees for a careful reading of a preliminary manuscript that led to several substantial improvements. The second author would like to thank the Cafés Bräunerhof and Florianihof in Vienna for their continued hospitality.

Author information

Authors and Affiliations

Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Wilberforce Road, Cambridge, CB30WB, UK
Adam D. Bull & Richard Nickl

Authors

Adam D. Bull
View author publications
You can also search for this author in PubMed Google Scholar
Richard Nickl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Nickl.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bull, A.D., Nickl, R. Adaptive confidence sets in $L^2$ . Probab. Theory Relat. Fields 156, 889–919 (2013). https://doi.org/10.1007/s00440-012-0446-z

Download citation

Received: 03 November 2011
Accepted: 29 June 2012
Published: 22 August 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s00440-012-0446-z

Mathematics Subject Classification (2000)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive confidence sets in \(L^2\)

Abstract

Similar content being viewed by others

Asymptotics of coverages of HD confidence sets and recentering at shrinkage estimates: phase transitions, large deviations

On Bayesian Based Adaptive Confidence Sets for Linear Functionals

Confidence distributions and hypothesis testing

1 Introduction

Theorem 1

2 The setting

2.1 Wavelets and Sobolev–Besov spaces

2.2 Adaptive estimation in \(L^2\)

Theorem 2

3 Adaptive confidence sets for Sobolev classes

3.1 Honest asymptotic inference

Definition 1

3.2 The case \(R < 2r\)

Theorem 3

Corollary 1

3.3 The case of general \(R\)

Theorem 4

Proposition 1

Theorem 5

3.3.1 Self-similarity conditions

4 Proofs

4.1 Some concentration inequalities

4.2 A general purpose test for composite nonparametric hypotheses

Definition 2

Proposition 2

Proof

4.3 Proof of Theorem 2

Proof

4.4 Proof of Theorem 3

Proof

4.5 Proof of Theorem 1

Proof

4.6 Proof of Theorem 5

4.7 Proof of Theorem 4

Proof

4.8 Proof of Part B of Theorem 3

Proof

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification (2000)

Search

Navigation