Abstract
We address the problem of adaptive minimax density estimation on \(\mathbb{R }^d\) with \(\mathbb{L }_p\)-loss on the anisotropic Nikol’skii classes. We fully characterize behavior of the minimax risk for different relationships between regularity parameters and norm indexes in definitions of the functional class and of the risk. In particular, we show that there are four different regimes with respect to the behavior of the minimax risk. We develop a single estimator which is (nearly) optimal in order over the complete scale of the anisotropic Nikol’skii classes. Our estimation procedure is based on a data-driven selection of an estimator from a fixed family of kernel estimators.
1 Introduction
Let \(X_{1},\ldots , X_{n}\) be independent copies of random vector \(X\in \mathbb{R }^d\) having density \(f\) with respect to the Lebesgue measure. We want to estimate \(f\) using observations \(X^{(n)}=(X_{1},\ldots ,X_n)\). By estimator we mean any \(X^{(n)}\)-measurable map \({\hat{f}}:\mathbb{R }^n\rightarrow \mathbb{L }_p(\mathbb{R }^d)\). Accuracy of an estimator \(\hat{f}\) is measured by the \(\mathbb{L }_p\)-risk
where \(\mathbb{E }_f\) denotes expectation with respect to the probability measure \(\mathbb{P }_f\) of the observations \(X^{(n)}=(X_1,\ldots ,X_n)\), and \(\Vert \cdot \Vert _p\), \(p\in [1,\infty )\), is the \(\mathbb{L }_p\)-norm on \(\mathbb{R }^d\). The objective is to construct an estimator of \(f\) with small \(\mathbb{L }_p\)-risk.
In the framework of the minimax approach density \(f\) is assumed to belong to a functional class \(\Sigma \), which is specified on the basis of prior information on \(f\). Given a functional class \(\Sigma \), a natural accuracy measure of an estimator \(\hat{f}\) is its maximal \(\mathbb{L }_p\)-risk over \(\Sigma \),
The main question is:
-
(i)
how to construct a rate-optimal, or optimal in order, estimator \({\hat{f}}_*\) such that
$$\begin{aligned} \mathcal{R }_p^{(n)}[\hat{f}_*;\Sigma ] \asymp \phi _n(\Sigma ):=\inf _{\hat{f}} \mathcal{R }_p^{(n)}[\hat{f};\Sigma ],\quad n\rightarrow \infty ? \end{aligned}$$
Here the infimum is taken over all possible estimators. We refer to the outlined problem as the problem of minimax density estimation with \(\mathbb{L }_p\)-loss on the class \(\Sigma \).
Although the minimax approach provides a fair and convenient criterion for comparison between different estimators, it lacks some flexibility. Typically \(\Sigma \) is a class of functions that is determined by some hyper-parameter, say, \(\alpha \) (we write \(\Sigma =\Sigma _\alpha \) in order to indicate explicitly dependence of the class \(\Sigma \) on the corresponding hyper-parameter \(\alpha \)). In general, it turns out that an estimator which is optimal in order on the class \(\Sigma _\alpha \) is not optimal on the class \(\Sigma _{\alpha ^\prime }\). This fact motivates the following question:
-
(ii)
is it possible to construct an estimator \({\hat{f}}_*\) that is optimal in order on some scale of functional classes \(\{\Sigma _\alpha , \alpha \in A\}\) and not only on one class \(\Sigma _\alpha \)? In other words, is it possible to construct an estimator \({\hat{f}}_*\) such that for any \(\alpha \in A\) one has
$$\begin{aligned} \mathcal{R }^{(n)}[\hat{f}_*; \Sigma _\alpha ] \asymp \phi _n(\Sigma _\alpha ),\quad n\rightarrow \infty ? \end{aligned}$$
We refer to this question as the problem of adaptive minimax density estimation on the scale of classes \(\{\Sigma _\alpha , \alpha \in A\}\).
The minimax and adaptive minimax density estimation with \(\mathbb{L }_p\)-loss is a subject of the vast literature, see for example [3, 5–11, 16, 18, 19, 22], [27, Chapter 7], [31], [32, 33] and [2]. It is not our aim here to provide a complete review of the literature on density estimation with \(\mathbb{L }_p\)-loss. Below we will only discuss results that are directly related to our study. First we review papers dealing with the one-dimensional setting; then we proceed with the multivariate case.
The problem of minimax density estimation on \(\mathbb{R }^1\) with \(\mathbb{L }_p\)-loss, \(p\in [2,\infty )\), was studied by Bretagnolle and Huber [3]. In this paper the functional class \(\Sigma \) is the class of all densities such that \([\Vert f^{(\beta )}\Vert _p \Vert f\Vert _{p/2}^{\beta }]^{1/(2\beta +1)}\le L<\infty \), where \(f^{(\beta )}\) is the generalized derivative of order \(\beta \). It was shown there that
Note that the same parameter \(p\) appears in the definitions of the risk and of the functional class.
The problem of adaptive minimax density estimation on a compact interval of \(\mathbb{R }^1\) with \(\mathbb{L }_p\)-loss was addressed in [9]. In this paper class \(\Sigma \) is the Besov functional class \(\mathbb{B }^\beta _{r\theta }(L)\), where parameter \(\beta \) stands for the regularity index, and \(r\) is the index of the norm in which the regularity is measured. It is shown there that there is an elbow in the rates of convergence for the minimax risk according to whether \(p\le r(2\beta +1)\) (called in the literature the dense zone) or \(p\ge r(2\beta +1)\) (the sparse zone). In particular,
Donoho et al. [9] develop a wavelet-based hard-thresholding estimator that achieves the indicated rates (up to a \(\ln n\)-factor in the dense zone) for a scale of the Besov classes \(\mathbb{B }^\beta _{r,\theta }(L)\) under additional assumption \(\beta r>1\).
It is quite remarkable that if the assumption that the underlying density has compact support is dropped, then the minimax risk behavior becomes completely different. Specifically, Juditsky and Lambert-Lacroix [21] studied the problem of adaptive minimax density estimation on \(\mathbb{R }^1\) with \(\Sigma \) being the Hölder class \(\mathbb{N }_{\infty ,1}(\beta ,L)\). Their results are in striking contrast with those of Donoho et al. [9]: it is shown that
Juditsky and Lambert-Lacroix [21] develop a wavelet-based estimator that achieves the indicated rates up to a logarithmic factor on a scale of the Hölder classes. Note that if the aforementioned results of Donoho et al. [9] for densities with compact support are applied to the Hölder class, \(r=\infty \), then the rate is \(n^{-1/(2+1/\beta )}\) for any \(p\ge 1\). Thus, the rate corresponding to the zone \(1\le p\le 2+1/\beta \), does not appear in the case of compactly supported densities.
In a recent paper, Reynaud-Bouret et al. [30] consider the problem of adaptive density estimation on \(\mathbb{R }^1\) with \(\mathbb{L }_2\)-losses on the Besov classes \(\mathbb{B }_{r\theta }^\beta (L)\). It is shown there that
They also proposed a wavelet-based estimator that achieves the indicated rates up to a logarithmic factor for a scale of Besov classes under additional assumption \(2\beta r>2-r\). It follows from Donoho et al. [9] that if \(p=2\) and the density is compactly supported then the corresponding rates are \(\phi _n(\Sigma )\asymp n^{-1/(2+1/\beta )}\) for all \(r\ge 2/(2\beta +1)\). Hence the rate corresponding to the zone \(r>2,\,p=2\), does not appear in the case of the compactly supported densities.
As for the multivariate setting, Ibragimov and Khasminskii in a series of papers [18, 19] studied the problem of minimax density estimation with \(\mathbb{L }_p\)-loss on \(\mathbb{R }^{d}\). Together with some classes of infinitely differentiable densities, they considered the anisotropic Nikolskii’s classes \(\Sigma =\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\), where \(\vec {\beta }=(\beta _1,\ldots ,\beta _d),\,\vec {r}=(r_1,\ldots ,r_d)\) and \(\vec {L}=(L_1,\ldots ,L_d)\) (for the precise definition see Sect. 3.1). It was shown that if \(r_i=p\) for all \(i=1,\ldots ,d\) then
Here \(\beta \) is the parameter defined by the relation \(1/\beta =\sum _{j=1}^d 1/\beta _j\). It should be stressed that in the cited papers the same norm index \(p\) is used in the definitions of the risk and of the functional class. We also refer to the recent paper by Mason [26], where further discussion of these results can be found.
Delyon and Juditsky [4] generalized the results of Donoho et al. [9] to the minimax density estimation on a bounded interval of \(\mathbb{R }^d,\,d\ge 1\) over a collection of the isotropic Besov classes. In particular, they showed that the minimax rates of convergence given by (1.1) hold with \(1/(\beta r)\) and \(1/\beta \) replaced by \(d/(\beta r)\) and \(d/\beta \) respectively. Comparing rates in (1.2) with the asymptotics of minimax risk found in [4] with \(r=p\) we conclude that the rate in (1.2) in the zone \(p\in [1,2)\) does not appear for compactly supported densities.
Recently Goldenshluger and Lepski [15] developed an adaptive minimax estimator over a scale of classes \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\); in particular, if \(r_i=p\) for all \(i=1,\ldots , d\) then their estimator attains the minimax rates indicated in (1.2). Note that in the considered setting the norm indexes in the definitions of the risk and the functional class coincide.
The results discussed above show that there is an essential difference between the problems of density estimation on the whole space and on a compact interval. The literature on density estimation on the whole space is quite fragmented, and relationships between aforementioned results are yet to be understood. These relationships become even more complex and interesting in the multivariate setting where the density to be estimated belongs to a functional class with anisotropic and inhomogeneous smoothness.The problem of minimax estimation under \(\mathbb{L }_p\)-loss over homogeneous Sobolev \(\mathbb{L }_q\)-balls \((q\ne p)\) was initiated in [28] in the regression model on the unit cube of \(\mathbb{R }^d\). For the first time, functional classes with anisotropic and inhomogeneous smoothness were considered in [23, 24] for the Gaussian white noise model on a compact subset of \(\mathbb{R }^d\). In the density estimation model [1] studied the case \(p=2\) and considered compactly supported densities on \([0,1]^d\).
To the best of our knowledge, the problem of estimating a multivariate density from anisotropic and inhomogeneous functional classes on \(\mathbb{R }^d\) was not considered in the literature. This problem is a subject of the current paper. Our results cover the existing ones and generalize them in the following directions.
-
1.
We fully characterize behavior of the minimax risk for all possible relationships between regularity parameters and norm indexes in the definition of the functional classes and of the risk. In particular, we discover that there are four different regimes with respect to the minimax rates of convergence: tail, dense and sparse zones, and the last zone, in its turn, is subdivided in two regions. Existence of these regimes is not a consequence of the multivariate nature of the problem or the considered functional classes; in fact, these regimes appear already in the dimension one. Thus our results reveal all possible zones with respect to the rates of convergence in the problem of density estimation on \(\mathbb{R }^d\) and explain different results on rates of convergence in the existing literature. In particular, results in [21, 30] pertain to the rates of convergence in the tail and dense zones, while those in [4, 9] correspond to the dense zone and to a subregion of the sparse zone.
-
2.
We propose an estimator that is based upon a data-driven selection from a family of kernel estimators, and establish for it a point-wise oracle inequality. Then we use this inequality for derivation of bounds on the \(\mathbb{L }_p\)-risk over a collection of the Nikol’skii functional classes. Since the construction of our estimator does not use any prior information on the class parameters, it is adaptive minimax over a scale of these classes. Moreover, we believe that the method of deriving \(\mathbb{L }_p\)-risk bounds from point-wise oracle inequalities employed in the proof of Theorem 2 is of interest in its own right. It is quite general and can be applied to other nonparametric estimation problems.
-
3.
Another issue studied in the present paper is related to the existence of the tail zone. This zone does not exist in the problem of estimating compactly supported densities. Then a natural question arises: what is a general condition on \(f\) which ensures the same asymptotics of the minimax risk on \(\mathbb{R }^d\) as in the case of compactly supported densities? We propose a tail dominance condition and show that, in a sense, it is the weakest possible condition under which the tail zone disappears. We also show that this condition guarantees existence of a consistent estimator under \(\mathbb{L }_1\)-loss. Recall that smoothness alone is not sufficient in order to guarantee consistency of density estimators in \(\mathbb{L }_1(\mathbb{R }^d)\) (see [20]).
The paper is structured as follows. In Sect. 2 we define our estimation procedure and derive the corresponding point-wise oracle inequality. Section 3 presents upper and lower bounds on the minimax risk. We also discuss the obtained results and relate them to the existing results in the literature. The same estimation problem under the tail dominance condition is studied in Sect. 4. Sections 5–7 contain proofs of Theorems 1–4; proofs of auxiliary results are relegated to Appendices A and B.
The following notation and conventions are used throughout the paper. For vectors \(u, v\in \mathbb{R }^d\) the operations \(u/v,\,u\vee v,\,u\wedge v\) and inequalities such as \(u\le v\) are all understood in the coordinate-wise sense. For instance, \(u\vee v =(u_1\vee v_1,\ldots , u_d\vee v_d)\). All integrals are taken over \(\mathbb{R }^d\) unless the domain of integration is specified explicitly. For a Borel set \(\mathcal{A }\subset \mathbb{R }^d\) symbol \(|\mathcal{A }|\) stands for the Lebesgue measure of \(\mathcal{A }\); if \(\mathcal{A }\) is a finite set, \(|\mathcal{A }|\) denotes the cardinality of \(\mathcal{A }\).
2 Estimation procedure and point-wise oracle inequality
In this section we define our estimation procedure and derive an upper bound on its point-wise risk.
2.1 Estimation procedure
Our estimation procedure is based on data-driven selection from a family of kernel estimators. The family of estimators is defined as follows.
2.1.1 Family of kernel estimators
Let \(K:[-1/2,1/2]^d\rightarrow \mathbb{R }^1\) be a fixed kernel such that \(K\in \mathbb{C }(\mathbb{R }^d),\,\int K(x)\,\mathrm{d}x=1\), and \(\Vert K\Vert _\infty <\infty \). Let
without loss of generality we assume that \(\log _2n\) is integer.
Given a bandwidth \(h\in \mathcal{H }\), define the corresponding kernel estimator of \(f\) by the formula
where \(V_h:=\prod _{j=1}^d h_j,\,K_h(\cdot ):=(1/V_h)K(\cdot /h)\). Consider the family of kernel estimators
The proposed estimation procedure is based on data-driven selection of an estimator from \(\mathcal{F }(\mathcal{H })\).
2.1.2 Auxiliary estimators
Our selection rule uses auxiliary estimators that are constructed as follows. For any pair \(h,\eta \in \mathcal{H }\) define the kernel \(K_h*K_\eta \) by the formula \( [K_h*K_\eta ](t)=\int K_h(t-y) K_\eta (y)\,\mathrm{d}y\). Let \(\hat{f}_{h,\eta }(x)\) denote the estimator associated with this kernel:
The following representation of kernels \(K_{h,\eta }\) will be useful: for any \(h,\eta \in \mathcal{H }\)
where function \(Q_{h,\eta }\) is given by the formula
Here function \(v:\mathbb{R }^d\times \mathbb{R }^d \rightarrow \mathbb{R }^d\) is defined by
The representation (2.2)–(2.3) is obtained by a straightforward change of variables in the convolution integral [see the proof of Lemma 12 in [14]]. We also note that \(\mathrm{supp}(Q_{h,\eta })\subseteq [-1,1]^d\), and \(\Vert Q_{h,\eta }\Vert _\infty \le \Vert K\Vert _\infty ^2\) for all \(h,\eta \). In the special case where \(K(t)=\prod _{i=1}^d k(t_i)\) for some univariate kernel \(k:[-1/2, 1/2]\rightarrow \mathbb{R }^1\) we have
We also define
and note that \(\mathrm{supp}(Q)\subseteq [-1,1]^d\), and \(\Vert Q\Vert _\infty \le \Vert K\Vert _\infty ^2\).
2.1.3 Stochastic errors of kernel estimators and their majorants
Uniform moment bounds on stochastic errors of kernel estimators \(\hat{f}_h(x)\) and \(\hat{f}_{h,\eta }(x)\) will play an important role in the construction of our selection rule. Let
denote the stochastic errors of \(\hat{f}_h\) and \(\hat{f}_{h,\eta }\) respectively. In order to construct our selection rule we need to find uniform upper bounds (majorants) on \(\xi _h\) and \(\xi _{h,\eta }\), i.e. we need to find functions \(M_h\) and \(M_{h,\eta }\) such that moments of random variables
are “small” for each \(x\in \mathbb{R }^d\). We will be also interested in the integrability properties of these moments.
It turns out that the majorants \(M_h(x)\) and \(M_{h,\eta }(x)\) can be defined in the following way. For a function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) let
Now define
where \(\varkappa \) is a positive constant to be specified. In Lemma 2 in Sect. 5 we show that under appropriate choice of parameter \(\varkappa \) functions
uniformly majorate \(\xi _h\) and \(\xi _{h,\eta }\) in the sense that the moments of random variables in (2.5) are “small”.
It should be noted, however, that functions \(M_h(x)\) and \(M_{h,\eta }(x)\) given by (2.8) cannot be directly used in construction of the selection rule because they depend on unknown density \(f\) to be estimated. We will use empirical counterparts of \(M_h(x)\) and \(M_{h,\eta }(x)\) instead.
For \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) we let
and define
2.1.4 Selection rule and final estimator
Now we are in a position to define our selection rule. For every \(x\in \mathbb{R }^d\) let
The selected bandwidth \({\hat{h}}(x)\) and the corresponding estimator are defined by
Note that the estimation procedure is completely determined by the family of kernel estimators \(\mathcal{F }(\mathcal{H })\) and by the constant \(\varkappa \) appearing in the definition of \({\hat{M}}_h\).
We have to ensure that the map \(x\mapsto \hat{f}_{\hat{h}(x)}(x)\) is an \(X^{(n)}\)-measurable Borel function. This follows from continuity of \(K\) and the fact that \(\mathcal{H }\) is a discrete set; for details see Appendix A, Sect. A.1.
The main idea behind the construction of the selection procedure (2.10)–(2.11) is the following. The expression \({\hat{M}}_{h\vee \eta }(Q, x)+{\hat{M}}_\eta (K, x)\) appearing in the square brackets in (2.10) dominates with large probability the stochastic part of the difference \(|{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)|\). Consequently, the first term on the right hand side of (2.10) serves as a proxy for the deterministic part of \(|{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)|\) which is the absolute value of the difference of biases of kernel estimates \({\hat{f}}_{h,\eta }(x)\) and \({\hat{f}}_\eta (x)\). The latter, in its own turn, is closely related to the bias of the estimator \({\hat{f}}_h(x)\). Thus, the first term on the right hand side of (2.10) is a proxy for the bias of \({\hat{f}}_h(x)\), while the second term is an upper bound on the standard deviation of \({\hat{f}}_h(x)\).
2.2 Point-wise oracle inequality
Let \(B_h(f,t)\) be the bias of the kernel estimator \({\hat{f}}_h(t)\),
and define
Theorem 1
For any \(x\in \mathbb{R }^d\) one has
where
Furthermore, for any \(q\ge 1\) if \(\varkappa \ge [\Vert K\Vert _\infty \vee 1]^2 [(4d+2)q+4(d+1)]\) then
where \(C\) is the constant depending on \(d,\,q\) and \(\Vert K\Vert _\infty \) only.
We remark that Theorem 1 does not require any conditions on the estimated density \(f\).
3 Adaptive estimation over anisotropic Nikol’skii classes
In this section we study properties of the estimator defined in (2.10)–(2.11). The point-wise oracle inequality of Theorem 1 is the key technical tool for bounding \(\mathbb{L }_p\)-risk of this estimator on the anisotropic Nikol’skii classes.
3.1 Anisotropic Nikol’skii classes
Let \((e_1,\ldots ,e_d)\) denote the canonical basis of \(\mathbb{R }^d\). For function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) and real number \(u\in \mathbb{R }\) define the first order difference operator with step size \(u\) in direction of the variable \(x_j\) by
By induction, the \(k\)-th order difference operator with step size \(u\) in direction of the variable \(x_j\) is defined as
Definition 1
For given real numbers \(\vec {r}\!=\!(r_1,\ldots ,r_d),\,r_j\!\in \! [1,\infty ],\,\vec {\beta }\!=\!(\beta _1,\ldots ,\beta _d), \beta _j>0\), and \(\vec {L}\!=\!(L_1,\ldots , L_d),\,L_j>0,\,j\!=\!1,\ldots , d\), we say that function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) belongs to the anisotropic Nikol’skii class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) if
-
(i)
\(\Vert g\Vert _{r_j}\le L_{j}\) for all \(j=1,\ldots ,d\);
-
(ii)
for every \(j=1,\ldots ,d\) there exists natural number \(k_j>\beta _j\) such that
$$\begin{aligned} \left\| \Delta _{u,j}^{k_j} g \right\| _{r_j} \le L_j |u|^{\beta _j},\quad \forall u\in \mathbb{R }^d,\quad \forall j=1,\ldots ,d. \end{aligned}$$(3.2)
The anisotropic Nikol’skii class is a specific case of the anisotropic Besov class, often encountered in the nonparametric estimation literature.
In particular, \(\mathbb{N }_{\vec {r},d}\left( \vec {\beta },\cdot \right) = \mathbb{B }_{r_1,\ldots ,r_d;\infty ,\ldots ,\infty }^{\beta _1, \ldots ,\beta _d}(\cdot )\), see [29, Section 4.3.4].
3.2 Construction of kernel \(K\)
We will use the following specific kernel \(K\) in the definition of the family \(\mathcal{F }(\mathcal{H })\) (see, e.g., [23] or [15]).
Let \(\ell \) be an integer number, and let \(w:[-1/(2\ell ), 1/(2\ell )]\rightarrow \mathbb{R }^1\) be a function satisfying \(\int w(y)\,\mathrm{d}y=1\), and \(w\in \mathbb{C }(\mathbb{R }^1)\). Put
The kernel \(K\) constructed in this way is bounded, supported on \([-1/2,1/2]^d\), belongs to \(\mathbb{C }(\mathbb{R }^d)\) and satisfies
where \(k=(k_1,\ldots ,k_d)\) is the multi-index, \(k_i\ge 0,\,|k|=k_1+\cdots +k_d\), and \(t^k=t_1^{k_1}\cdots t_d^{k_d}\) for \(t=(t_1,\ldots , t_d)\).
3.3 Main results
Let \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) be the anisotropic Nikol’skii functional class. Put
and define
In contrast to Theorem 1 proved over the set of all probability densities, the adaptive results presented below require the additional assumption: the estimated density should be uniformly bounded. For this purpose we define for \(M>0\)
Note, however, that if \(J:=\{j=1,\ldots ,d:\;r_j=\infty \}\) then \( \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)=\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}) \) with \(M=\inf _{J}L_j\). Moreover, in view of the embedding theorem for the anisotropic Nikol’skii classes (see Sect. 6.1 below), condition \(s>1\) implies that the density to be estimated belongs to a class of uniformly bounded and continuous functions. Thus, if \(s>1\) one has \( \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)=\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}) \) with some \(M\) completely determined by \(\vec {L}\).
The asymptotic behavior of the \(\mathbb{L }_p\)-risk on class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) is characterized in the next two theorems.
Let family \(\mathcal{F }(\mathcal{H })\) be associated with kernel (3.3). Let \({\hat{f}}\) denote the estimator given by the selection rule (2.10)–(2.11) with \(\varkappa = (\Vert K\Vert _\infty \vee 1)^2[(4d+2)p+4(d+1)]\) that is applied to the family \(\mathcal{F }(\mathcal{H })\).
Theorem 2
For any \(M>0,\,L_0>0,\,\ell \in \mathbb{N }^*\), any \(\vec {\beta }\in (0,\ell ]^d,\,\vec {r}\in (1,\infty ]^d\), any \(\vec {L}\) satisfying \(\min _{j=1,\ldots ,d}L_j\ge L_0\), and any \(p\in (1,\infty )\) one has
Here constant \(C\) does not depend on \(\vec {L}\) in the cases \(p\le s(2+1/\beta )\) and \(p\ge s(2+1/\beta ),\,s<1\).
Remark 1
-
1.
Condition \(\min _{j=1,\ldots ,d}L_j\ge L_0\) ensures independence of the constant \(C\) on \(\vec {L}\) in the cases \(p\le s(2+1/\beta )\) and \(p\ge s(2+1/\beta ),\,s<1\). If \(p\ge s(2+1/\beta ),\,s\ge 1\) then \(C\) depends on \(\vec {L}\), and the corresponding expressions can be easily extracted from the proof of the theorem. We note that in this case the map \(\vec {L}\mapsto C(\vec {L})\) is bounded on every closed cube of \((0,\infty )^{d}\).
-
2.
We consider the case \(1<p<\infty \) only, not including \(p=1\) and \(p=\infty \). It is well-known, [20], that smoothness alone is not sufficient in order to guarantee consistency of density estimators in \(\mathbb{L }_1(\mathbb{R }^d)\); see also Theorem 3 for a lower bound. The case \(p=\infty \) was considered recently in [25].
-
3.
As it was discussed above, Theorem 2 requires uniform boundedness of the estimated density, i.e. \(\Vert f\Vert _\infty \le M<\infty \). We note however that our estimator \(\hat{f}\) is fully adaptive, i.e., its construction does not use any information on the parameters \(\vec {\beta }, \vec {r}, \vec {L}\) and \(M\).
Now we present lower bounds on the minimax risk. Define
Theorem 3
Let \(\vec {\beta }\in (0,\infty )^{d},\,\vec {r}\in [1,\infty ]^{d},\,\vec {L}\in (0,\infty )^{d}\) and \(M>0\) be fixed.
-
(i)
There exists \(c>0\) such that
$$\begin{aligned} \liminf _{n\rightarrow \infty }\left\{ \left( \frac{L_\beta \alpha _n}{n}\right) ^{-\nu } \inf _{\widetilde{f}} \mathcal{R }_p^{(n)}\left[ \widetilde{f};\;\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\right] \right\} \ge c,\quad \forall p\in [1,\infty ), \end{aligned}$$where the infimum is taken over all possible estimators \(\widetilde{f}\). If \(\min _{j=1,\ldots ,d} L_j\ge L_0>0\) then in the cases \(p\le s(2+1/\beta )\) or \(p\ge s(2+1/\beta )\) and \(s<1\) the constant \(c\) is independent of \(\vec {L}\).
-
(ii)
Let \(p=\infty \) and \(s\le 1\); then there is no consistent estimator, i.e., for some \(c>0\)
$$\begin{aligned} \liminf _{n\rightarrow \infty }\inf _{\tilde{f}} \sup _{f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)}\mathbb{E }_f\left\| \tilde{f}-f\right\| _\infty \;>\;c. \end{aligned}$$
Remark 2
-
1.
Inspection of the proof shows that if \(\max _{j=1,\ldots ,d} L_j\le L_\infty <\infty \) then the statement (i) is valid with constant \(c\) depending on \(\vec {\beta }, \vec {r}\), \(L_0\), \(L_\infty ,\,d\) and \(M\) only.
-
2.
As it was mentioned above, adaptive minimax density estimation on \(\mathbb{R }^d\) under \(\mathbb{L }_\infty \)-loss was a subject of the recent paper [25]. A minimax adaptive estimator is constructed in this paper under assumption \(s>1\). Thus, statement (ii) of Theorem 3 finalizes the research on adaptive density estimation in the supremum norm. It is interesting to note that the minimax rates in the case \(p=\infty \) coincide with those of Theorem 2 if we put formally \(p=\infty \).
3.4 Discussion
The results of Theorem 2 together with the matching lower bounds of Theorem 3 provide complete classification of minimax rates of convergence in the problem of density estimation on \(\mathbb{R }^d\). In particular, we discover four different zones with respect to the minimax rates of convergence.
-
Tail zone corresponds to “small” \(p,\,1<p\le \frac{2+1/\beta }{1+1/s}\). This zone does not appear if density \(f\) is assumed to be compactly supported, or some tail dominance condition is imposed, see Sect. 4.
-
Dense zone is characterized by the “intermediate” range of \(p,\,\frac{2+1/\beta }{1+1/s}\le p\le s(2+1/\beta )\). Here the “usual” rate of convergence \(n^{-\beta /(2\beta +1)}\) holds.
-
Sparse zone corresponds to “large” \(p,\,p\ge s(2+1/\beta )\). As Theorems 2 and 3 show, this zone, in its turn, is subdivided into two regions with \(s\ge 1\) and \(s<1\). This phenomenon was not observed in the existing literature even for settings with compactly supported densities. For other statistical models (regression, white Gaussian noise etc) this result is also new.
It is important to emphasize that existence of these zones is not related to the multivariate nature of the problem or to the anisotropic smoothness of the estimated density. In fact, these results hold already for the one-dimensional case, and this, to a limited degree, was observed in the previous works. In the subsequent remarks we discuss relationships between our results and the existing results in the literature, and comment on some open problems.
-
1.
In [4, 9, 24] the sparse zone is defined as \(p>2(1+1/\beta ),\,s>1\). Recall that condition \(s>1\) implies that the density to be estimated belongs to a class of uniformly bounded and continuous functions. In the sparse zone we consider also the case \(s\le 1\), but density \(f\) is assumed to be uniformly bounded. It turns out that in this zone the rate corresponding to the index \(\nu =s/p\) emerges.
-
2.
The one-dimensional setting was considered in [21] and [30]. The setting of Juditsky and Lambert-Lacroix [21] corresponds to \(s=\infty \), while Reynaud-Bouret et al. [30] deal with the case of \(p=2\) and \(\beta >1/r-1/2\). Both settings rule out the sparse zone. The rates of convergence in the dense zone obtained in the aforementioned papers are easily recovered from our results. However, in the tail zone our bound contains additional \(\ln (n)\)-factor.
-
3.
In the previous papers on adaptive estimation of densities with unbounded support (cf. [21] and [30]) the developed estimators are explicitly shrunk to zero. This shrinkage is used in bounding the minimax risk on the whole space. We do not employ shrinkage in our estimator construction. We derive bounds on the \(\mathbb{L }_p\)-risk by integration of the point-wise oracle inequality (2.14). The key elements of this derivation are inequality (2.17) and statement (i) of Proposition 2. The inequality (2.17) is based on the following fact: the errors \(\zeta (x)\) and \(\chi (x)\) are integrable by the accurate choice of the majorant. Indeed, Sect. 5.3 shows that these errors are not equal to zero with probability which is integrable and “negligible” in the regions where the density is “small”. This leads to integrability of the remainders in (2.14). As for Proposition 2, it is related to the integrability of the main term in (2.14). The main problem here is that the majorant \(M_h(\cdot ,x)\) itself is not integrable. To overcome this difficulty we use the integrability of the estimator \(\hat{f}\), approximation properties of the density \(f\), and (2.17).
-
4.
In the context of the Gaussian white noise model on a compact interval [23] developed an adaptive estimator that achieves the rate of convergence \((\ln {n}/n)^{\beta /(2\beta +1)}\) on the anisotropic Nikol’skii classes under condition \(\sum _{i=1}^d [\frac{1}{\beta _i}(\frac{p}{r_i}-1)]_+<2\). This restriction determines a part of the dense zone, and our Theorem 2 improves on this result. In fact, our estimator achieves the rate \((\ln {n}/n)^{\beta /(2\beta +1)}\) in the zone \(\sum _{i=1}^d \frac{1}{\beta _i}(\frac{p}{r_i}-1)\le 2\) which is equivalent to \(p\le s(2+1/\beta )\).
-
5.
It follows from Theorem 3 that the upper bound of Theorem 2 is sharp in the zone \(p> s(2+1/\beta ),\,s>1\), and it is nearly sharp up to a logarithmic factor in all other zones. This extra logarithmic factor is a consequence of the fact that we use the point-wise selection procedure (2.10)–(2.11). We also have extra \(\ln n\)-term on the boundaries \(p=\frac{2+1/\beta }{1+1/s},\,p=s(2+1/\beta )\).
Conjecture 1
The rates found in Theorem 3 are optimal.
Thus, if our conjecture is true, the construction of an estimator achieving the rates of Theorem 3 in the tail and dense zones remains an open problem.
-
6.
Theorem 2 is proved under assumption \(\vec {r}\in (1,\infty ]^{d}\), i.e., we do not include the case where \(r_j=1\) for some \(j=1,\ldots ,d\). This is related to the construction of our selection rule, and to the necessity to bound \(\mathbb{L }_{r_j}\)-norm, \(j=1,\ldots , d\) of the term \(\bar{B}_h(f,x)\); see (2.13) and (2.14). In our derivations for this purpose we use properties of the strong maximal operator (for details see Sect. 6.1), and it is well-known that this operator is not of the weak \((1,1)\)-type in dimensions \(d\ge 2\). Nevertheless, using inequality (6.5) we were able to obtain the following result.
Corollary 1
Let \(\vec {r}\) be such that \(r_j=1\) for some \(j=1,\ldots ,d\). Then the result of Theorem 2 remains valid if the normalizing factor \((n^{-1}\ln {n})^{\nu }\) is replaced by \((n^{-1}[\ln {n}]^{d})^{\nu }\).
The proof of Corollary 1 coincides with the proof of Theorem 2 with the only difference that bounds in the proof of Proposition 1 should use (6.5) instead of the Chebyshev inequality. This will result in an extra \((\ln {n})^{d-1}\)-factor. We note that the results of Theorem 2 and Corollary 1 coincide if \(d=1\). It is not surprising because in the dimension \(d=1\) the strong maximal operator is the Hardy–Littlewood maximal function which is of the weak (1,1)-type.
4 Tail dominance condition
Let \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) be a locally integrable function. Define the map \(g\mapsto g^*\) by the formula
where \(\Pi _h(x)=[x_1-h_1/2,x_1+h_1/2]\times \cdots \times [x_d-h_d/2,x_d+h_d/2]\). In fact, formula (4.1) defines the maximal operator associated with the differential basis \(\cup _{x\in \mathbb{R }^d}\{ \Pi _h(x), h\in (0,2]\}\), see [17].
Consider the following set of functions: for any \(\theta \in (0,1]\) and \(R\in (0,\infty )\) let
Note that, although we keep the previous notation \(\Vert g\Vert _\theta =(\int |g(x)|^\theta \,\mathrm{d}x)^{1/\theta },\,\Vert \cdot \Vert _\theta \) is not longer a norm if \(\theta \in (0,1)\).
The assumption that \(f\in \mathbb{G }_\theta (R)\) for some \(\theta \in (0,1]\) and \(R>0\) imposes restrictions on the tail of the density \(f\). In particular, the set of densities, uniformly bounded and compactly supported on a cube of \(\mathbb{R }^d\), is embedded in the set \(\mathbb{G }_\theta (\cdot )\) for any \(\theta \in (0,1]\) (for details, see Sect. 7.4). We will refer to the assumption \(f\in \mathbb{G }_\theta (R)\) as the tail dominance condition.
In this section we study the problem of adaptive density estimation under the tail dominance condition. We show that under this condition the minimax rate of convergence can be essentially improved in the tail zone. In particular, if \(\theta \le \theta ^*\) for some \(\theta ^*<1\) given below then the tail zone disappears.
For any \(\theta \in (0,1]\) let
and define
where \(\nu \) is defined in (3.4).
Theorem 4
The following statements hold.
-
(i) For any \(\theta \in (0,1]\) and \(R>0\), Theorem 2 remains valid if one replaces \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) by \( \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M),\,\nu \) by \(\nu (\theta )\) and \(\mu _n\) by \(\mu _n(\theta )\). The constant \(C\) may depend on \(\theta \) and \(R\).
-
(ii) For any \(\theta \in (0,1],\,\vec {\beta },\vec {L}\in (0,\infty )^d,\,\vec {r}\in [1,\infty ]^d\) and \(M>0\) one can find \(R>0\) such that Theorem 3 remains valid if one replaces \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) by \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M),\,\nu \) by \(\nu (\theta )\), and \(\mu _n\) by \(\mu _n(\theta )\).
Remark 3
-
1.
The tail dominance condition leads to improvement of the rates of convergence in the whole tail zone. In particular, if \(f\in \mathbb{G }_1(R)\) then the additional \(\ln ^{\frac{d}{p}}(n)\)-factor disappears, cf. \(\mu _n\) and \(\mu _n(1)\). Moreover, under the tail dominance condition with \(\theta <1\) the faster convergence rate of the dense zone is achieved over a wider range of values of \(p,\,\frac{2+1/\beta }{1/\theta +1/s} \le p \le s(2+1/\beta )\). Additionally, if
$$\begin{aligned} \theta < \theta ^*:=\frac{ps}{s(2+1/\beta )-p}, \end{aligned}$$then the tail zone disappears. Note that \(\theta ^*\in (0,1)\) whenever \(p\le \frac{2+1/\beta }{1+1/s}\). As it was mentioned above, the set of uniformly bounded and compactly supported on a cube of \(\mathbb{R }^d\) densities is embedded in the set \(\mathbb{G }_\theta (\cdot )\) for any \(\theta \in (0,1]\). This fact explains why the tail zone does not appear in problems of estimating compactly supported densities.
-
2.
We would like to emphasize that the couple \((\theta , R)\) is not used in the construction of the estimation procedure; thus, our estimator is adaptive with respect to \((\theta ,R)\) as well. In particular, if the tail dominance condition does not hold, our estimator achieves the rate of Theorem 2. On the other hand, if this assumption holds, the rate of convergence is improved automatically in the tail zone.
-
3.
The second statement of the theorem is proved under assumption that \(R\) is large enough. The fact that \(R\) cannot be chosen arbitrary small is not technical; the parameters \(\vec {\beta },\,\vec {L},\,\vec {r},\,M\), \(\theta \) and \(R\) are related to each other. In particular, one can easily provide lower bounds on \(R\) in terms of the other parameters of the class. For instance, by the Lebesgue differentiation theorem, \(f(x)\le f^*(x)\) almost everywhere; therefore for any density \(f\in \mathbb{G }_\theta (R)\) such that \(\Vert f\Vert _\infty \le M\) one has
$$\begin{aligned} 1=\int f\le M^{1-\theta }\Vert f^*\Vert ^\theta _\theta \le M^{1-\theta } R^{\theta } \Rightarrow R\ge M^{1-1/\theta }. \end{aligned}$$Another lower bound on \(R\) in terms of \(\vec {L},\,\vec {r}\) and \(\theta \) can be established using the Littlewood interpolation inequality (see, e.g., [13, Section 5.5]). Let \(0<q_0<q_1\) and \(\alpha \in (0,1)\) be arbitrary numbers; then the Littlewood inequality states that \(\Vert g\Vert _q\le \Vert g\Vert _{q_0}^{1-\alpha }\Vert g\Vert _{q_1}^{\alpha }\), where \(q\) is defined by relation \(\frac{1}{q}=\frac{1-\alpha }{q_0}+\frac{\alpha }{q_1}\). Now, suppose that \(f\in \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\), and choose \(q_0=\theta ,\,q_1=r_i\) and \(\alpha =\frac{1-\theta }{1-\theta /r_i}\); then \(q=1\) and by the Littlewood inequality we have
$$\begin{aligned} 1\!=\!\Vert f\Vert _1\!\le \! \Vert f\Vert _\theta ^{1-\alpha }\Vert f\Vert _{r_i}^\alpha \!\le \! R^{\frac{r_i\theta -\theta }{r_i-\theta }}L_{i}^{\frac{r_i-r_i \theta }{r_i-\theta }},\quad i\!=\!1,\ldots , d\;\Rightarrow \;R\!\ge \! \max _{i=1,\ldots ,d} L_i^{\frac{r_i\theta -r_i}{r_i-\theta }}. \end{aligned}$$ -
4.
Another interesting observation is related to the specific case \(p=1\). Recall that the condition \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L},M)\) alone is not sufficient for existence of consistent estimators.However, for any \(\theta \in (0,1)\) we can show
$$\begin{aligned} \inf _{\tilde{f}} \mathcal{R }_1^{(n)}\left[ \tilde{f};\; \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\right] \!\le \! C\left[ \frac{L_\beta (\ln n)^d}{n} \right] ^{\frac{1-\theta }{1-\theta /s+1/\beta }} \rightarrow 0, n\rightarrow \infty . \end{aligned}$$This result follows from the proof of Theorem 4 and (6.5).
Now we argue that condition \(f\in \mathbb{G }_{\theta ^*} (R)\) is, in a sense, the weakest possible ensuring the “usual” rate of convergence, corresponding to the index \(\nu =\beta /(2\beta +1)\), in the whole zone \(p\le s(2+1/\beta )\). Indeed, in view of Theorem 4, the minimax rate of convergence on the class \(\mathbb{G }_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\), say \(\overline{\psi }_n(\theta ^*)\), satisfies
where the constants \(c\) and \(C\) may depend on \(R\). On the other hand, if \(\underline{\psi }_n(\theta ^*)\) denotes the minimax rate of convergence on the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L},M){\setminus }\mathbb{G }_{\theta ^*}(R)\) then
provided that \(p\le \frac{2+1/\beta }{1+1/s}\). The upper bound in (4.4) is one of the statements of Theorem 2, while the lower bound is proved in Sect. 7.5.
Thus we conclude that there is no tail zone in estimation over the class \(G_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta }, \vec {L},M)\), and this zone appears when considering the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta }, \vec {L},M){\setminus } \mathbb{G }_{\theta ^*}(R)\). In this sense \(f\in \mathbb{G }_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r}}(\vec {\beta },\vec {L},M)\) is the necessary and sufficient condition eliminating the tail zone.
5 Proof of Theorem 1
First we state two auxiliary results, Lemmas 1 and 2, and then turn to the proof of the theorem. Proof of measurability of our estimator and proofs of Lemmas 1 and 2 are given in Appendix A.
5.1 Auxiliary lemmas
For any \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) denote
Lemma 1
Let \(\chi _h(g,x)= [|\hat{A}_h(g,x)-A_h(g,x)|- M_h(g,x)]_+,\,h\in \mathcal{H }\); then
The next lemma establishes moment bounds on the following four random variables:
Denote \(\mathrm{k}_\infty =\Vert K\Vert _\infty \vee 1\) and
Lemma 2
Let \(q\ge 1,\,l\ge 1\) be arbitrary numbers. If \(\varkappa \ge \mathrm{k}^2_\infty [(2q+4)d + 2l]\) then for all \( x\in \mathbb{R }^d\)
where constant \(C_0\) depends on \(d,\,q\), and \(\mathrm{k}_\infty \) only.
5.2 Proof of oracle inequality (2.14)
We recall the standard error decomposition of the kernel estimator: for any \(h\in \mathcal{H }\) one has
where \(B_h(f,x)\) and \(\xi _h(x)\) are given in (2.12) and (2.4) respectively. Similar error decomposition holds for auxiliary estimators \({\hat{f}}_{h,\eta }(x)\); the corresponding bias and stochastic error are denoted by \(B_{h,\eta }(f,x)\) and \(\xi _{h,\eta }(x)\).
\(1^0\). The following relation for the bias \(B_{h,\eta }(f,x)\) of \(\hat{f}_{h,\eta }(x)\) holds:
Indeed, using the Fubini theorem and the fact that \(\int K_h(x)\,\mathrm{d}x=1\) for all \(h\in \mathcal{H }\) we have
It remains to note that \( \int K_h(t-y)[f(t)-f(y)]\,\mathrm{d}t=B_h(f,y)\) and to subtract \(f(x)\) from the both sides of the above equality. Thus, (5.3) is proved.
\(2^{0}\). By the triangle inequality we have for any \(h\in \mathcal{H }\)
We bound each term on the right hand side separately.
First we note that, by (5.3) and (2.13), for any \(h\in \mathcal{H }\)
Thus, for any \(h\in \mathcal{H }\)
where we put
Second, by (5.3) and \(\hat{f}_{h,\eta }\equiv {\hat{f}}_{\eta , h}\) for any \(h, \eta \in \mathcal{H }\) we have
where the last inequality holds by definition of \(\hat{R}_h(x)\) [see (2.10)]. There inequalities imply the following upper bound on the first term on the right hand side of (5.4): for any \(h\in \mathcal{H }\)
where we have used the fact that \({\hat{R}}_{\hat{h}}(x) \le {\hat{R}}_h(x)\) for all \(h\in \mathcal{H }\), and inequality (5.5).
Now we turn to bounding the second term on the right hand side of (5.4). We get for any \(h\in \mathcal{H }\)
where we again used (5.5) and the fact that \(\hat{R}_{\hat{h}}(x)\le \hat{R}_h(x)\) for all \(h\in \mathcal{H }\).
Finally for any \(h\in \mathcal{H }\)
Thus, combining (5.6), (5.7) and (5.4) we obtain
\(3^{0}\). In order to complete the proof we note that by the first inequality of Lemma 1 for any \(g:\mathbb{R }^d\rightarrow \mathbb{R }^{1}\)
In addition, by the second inequality in Lemma 1
so that \(\hat{\zeta }(x) \le \zeta (x)+ 2\chi (x)\). Substituting these bounds in (5.8) we obtain
as claimed. \(\square \)
5.3 Proof of moment bounds (2.17)
Let \(\zeta _j(x),\,j=1,\ldots , 4\) be defined by (5.1). Then
as claimed in Lemma 2.
Let \(T_1=\{x\in \mathbb{R }^d: F(x)\ge n^{-l}\}\) and \(T_2=\mathbb{R }^d{\setminus }T_1\). Therefore
Now we analyze integrability on the set \(T_2\). We consider only the case \(j=1,2\) since computations for \(j=3,4\) are the same as for \(j=1\).
Let \( U_{\max }(x)=[x-1, x+1]^d \) and define the event \(D(x)=\{\sum _{i=1}^n {\mathbf{1}}[X_i\in U_{\max }(x)]<2\}\), and let \(\bar{D}(x)\) denote the complementary event. First we argue that for \(j=1,2\)
Indeed, if \(x\in T_2\) then for any \(h\in \mathcal{H }\)
Here we have used that \(\mathcal{H }=[1/n,1]^d\) and that \(\text {supp}(K)=[-1/2,1/2]^d,\;\text {supp}(Q)=[-1,1]^d\).
Hence, by definition of \(\xi _h(x)\), for any \(h\in \mathcal{H }\) one has for any \(l\ge d+1\)
where we have used that \(n^{d-l}\le (nV_h)^{-1}\) for \(l\ge d+1,\,\varkappa \ln n\ge 4\mathrm{k}_\infty \) by the condition on \(\varkappa \) [see also definition of \(M_h(K,x)\)], and \(n\ge 3\). Therefore \(\zeta _1(x)\mathbf{1}\{D(x)\}=0\) for \(x\in T_2\). By the same reasoning for \(\zeta _2(x)\) we obtain that \(\zeta _2(x)\mathbf{1}\{D(x)\}=0,\,\forall x\in T_2\) because \(\varkappa \ln n \ge 4\mathrm{k}^2_\infty \). Thus (5.10) is proved. Using (5.10) we can write
Now we bound from above the integral on the right hand side of the last display formula. For any \(z>0\) we have in view of the exponential Markov inequality
Minimizing the right hand side w.r.t. \(z\) we find \(z=\ln 2 -\ln {\{nF(x)\}}\) and, therefore,
Since \(F(x)\le n^{-l}\) for any \(x\in T_2\) we obtain
Combining this inequality with (5.11) we obtain
Choosing \(l=(d+1)q+2\) we come to the assertion of the theorem in view of (5.9) and (5.12). \(\square \)
6 Proofs of Theorem 2 and statement (i) of Theorem 4
The proofs of Theorem 2 and of statement (i) of Theorem 4 go along similar lines. That is why we state our auxiliary results (Propositions 1 and 2) in the form that is suitable for the use in the proof of Theorem 4.
This section is organized as follows. First, in Sect. 6.1 we present and discuss some facts from functional analysis. Then in Lemma 3 of Sect. 6.2 we state an auxiliary result on approximation properties of the kernel \(K\) defined in (3.3). Proof outline and notation are discussed in Sect. 6.3. Sect. 6.4 presents two auxiliary propositions, and the proofs of Theorem 2 and statement (i) of Theorem 4 are completed in Sects. 6.5 and 6.6. Proofs of the auxiliary results, Lemma 3 and Propositions 1 and 2 are given in Appendix B.
In the subsequent proof \(c_i,C_i,\bar{c}_i, \bar{C}_i, \hat{c}_i,\hat{C}_i, \tilde{c}_i, \tilde{C}_i, \ldots \), stand for constants that can depend on \(L_0,\,M\) \(\vec {\beta },\,\vec {r},\,d\) and \(p\), but are independent of \(\vec {L}\) and \(n\). These constants can be different on different appearances. In the case when the assumption \(f\in \mathbb{G }_\theta (R)\) with \(\theta \in (0,1]\) is imposed, they may also depend on \(\theta \) and \(R\).
6.1 Preliminaries
We present an embedding theorem for the anisotropic Nikol’skii classes and discuss some properties of the strong maximal operator.
6.1.1 Embedding theorem
The statement given below in (6.2) is a particular case of the embedding theorem for anisotropic Nikol’skii classes \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\); see [29, Section 6.9.1.].
For the fixed class parameters \(\vec {\beta }\) and \(\vec {r}\) define
and put
Let \(\tau (p)>0\) and \(\tau _i>0\) for all \(i=1,\ldots , d\); then for any \(p\ge 1\) one has
where constant \(c>0\) is independent of \(\vec {L}\) and \(p\).
6.1.2 Strong maximal function
Let \(g:\mathbb{R }^d\rightarrow \mathbb{R }\) be a locally integrable function. We define the strong maximal function \(g^\star \) of \(g\) by formula
where the supremum is taken over all possible rectangles \(H\) in \(\mathbb{R }^d\) with sides parallel to the coordinate axes, containing point \(x\). It is worth noting that the Hardy–Littlewood maximal function is defined by (6.3) with the supremum taken over all cubes with sides parallel to the coordinate axes, centered at \(x\).
It is well known that the strong maximal operator \(g\mapsto g^\star \) is of the strong \((p,p)\)-type for all \(1<p\le \infty \), i.e., if \(g\in \mathbb{L }_p(\mathbb{R }^d)\) then \(g^\star \in \mathbb{L }_p(\mathbb{R }^d)\) and there exists a constant \(\bar{C}\) depending on \(p\) only such that
Let \(g^*\) be defined in (4.1). Since obviously \(g^*(x)\le g^\star (x)\) for all \(x\in \mathbb{R }^d\) we have
In distinction to the Hardy–Littlewood maximal function, the strong maximal operator is not of the weak (1,1)-type. In fact, the following statement holds: there exists constant \(C\) depending on \(d\) only such that
We refer to [17] for more details.
6.2 Approximation properties of kernel \(K\)
The next lemma establishes an upper bound on norm of the bias \(B_h(f,\cdot )\) of kernel estimator \(\hat{f}_h\) when \(f\) belongs to the anisotropic Nikol’skii class.
Lemma 3
Let \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\). Let \(\hat{f}_h\) be the estimator (2.1) associated with kernel (3.3) with \(\ell > \max _{j=1,\ldots ,d}\beta _j\). Then \(B_h(f,x)\) can represented as the sum \(B_h(f, x)=\sum _{j=1}^d B_{h,j}(f,x)\) with functions \(B_{h,j}(f,x)\) satisfying the following inequalities:
Moreover, if \(s \ge 1\), then for any \(p\ge 1\)
where \(\vec {\gamma }=\vec {\gamma }(p)\) and \(\vec {q}=\vec {q}(p)\) are defined in (6.1). Here \(C_1\) and \(C_2\) are constants independent of \(\vec {L}\) and \(p\).
6.3 Proof outline and notation
The starting point of our proof is the pointwise oracle inequality (2.14) together with the moment bound (2.17). Denote
then, taking into account that \(M_\eta (K\vee Q, x)\) is greater than \(M_\eta (K,x)\) and \(M_\eta (Q,x)\) for any \(x\) and \(\eta \) [see (2.6) and (2.9)], and using (2.14), we have
where \(c_0\) is an absolute constant, and \(\omega (x):=\zeta (x)+\chi (x)\) with \(\zeta (x)\) and \(\chi (x)\) defined in (2.15) and (2.16). Therefore, by (2.17) applied with \(q=p\) and by the Fubini theorem, there exists constant \(\bar{c}_0>0\) such that for any probability density \(f\) and any Borel set \(\mathcal{A }\subseteq \mathbb{R }^d\) one has
Recall that \(\mathrm{k}_\infty =\Vert K\Vert _\infty \vee 1\); by definition of \(\bar{B}_h(f,x)\) [see (2.13)] and by Lemma 3 one has
where \(B_{h,j}^*(f,x)\) is the strong maximal function of \(|B_{h,j}(f,x)|,\,j=1,\ldots ,d\). Therefore if we let
then
The key element of the proof is derivation of upper bounds on the integral
These bounds will be established by division of \(\mathbb{R }^d\) in “slices”, and appropriate choice of bandwidth \(h\in \mathcal{H }\) on every “slice”. For this purpose the following bounds on norms of \(B^*_{h,j}(f, \cdot )\) will be used. Inequality (6.4) and the first assertion of Lemma 3 imply that for any \(p>1,\,\vec {r}\in (1,\infty ]^{d}\) and any \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) one has
Moreover, if \(s\ge 1\) then, by the second assertion of Lemma 3, for any \(p>1,\,\vec {r}\in (1,\infty ]^{d}\) and \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\)
Let \(\delta :=\ln n/n,\,\varphi := (L_\beta \delta )^{\beta /(2\beta +1)}\). Let \(m_0(\theta ),\;\theta \in (0,1],\) be an integer number to be specified later; see (6.19) below. For \(m\in \mathbb{Z },\,m\ge m_0(\theta )\) define “slices”
and consider the corresponding integrals
With this notation, using (6.9) and (6.11) we can write
The rest of the proof consists of bounding the integrals \(J_{m_0(\theta )}^-\) and \(J_m\) on the right hand side of (6.14) and combining these bounds in different zones.
The following notation will be used in the subsequent proof. For the sake of brevity we will write
We let \(I:=\{1,\ldots ,d\}\), and
With \(\vec {\gamma }=(\gamma _1,\ldots ,\gamma _d)\) and \(\vec {q}=(q_1,\ldots ,q_d)\) given by (6.1) we define quantities \(\gamma ,\,\upsilon \) and \(L_\gamma \) by the formulas
Note some useful inequalities between the quantities defined above. First, \(\gamma _j < \beta _j\) for all \(j\in I_-\) which is a consequence of the fact that \(\tau (p)<\tau _j\) for \(j\in I_-\). This implies
Next, if \(s\ge 1\) then
We have
Hence (6.17) will be proved if we show that \(r_j^{-1}\tau (p)p\ge \tau _j\) for all \(j\in I_-\). Indeed,
where to get the second inequality we have used that \(r_j\le p\) for any \(j\in I_-\) and that \(s\ge 1\). Finally, remark also that
Indeed, since \(r_j\ge p\) for any \(j\in I_+\cup I_\infty \),
This yields \(p\le \upsilon /\gamma \), and (6.18) follows.
6.4 Auxiliary results
For \(\theta \in (0,1]\) and for some constant \(\hat{c}_1>0\) define
Note that \(1-\theta /s+1/\beta >0\) for any \(\theta \in (0,1]\), since \(s\ge \beta \) by \(r_j\ge 1,\,j=1,\ldots , d\). Therefore \(m_0(\theta )<0\) for large enough \(n\).
It will be convenient to introduce the following notation
It follows from this definition that
hence \(m_1>1\) for large \(n\).
The bounds on \(J_{m_0(\theta )}^-\) and \(J_m\) are given in the next two propositions.
Proposition 1
There exist constants \(\hat{c}_1,\hat{c}_2>0\) and \({\hat{C}}_1,\hat{C}_2>0\) such that any \(n\) large enough the following statements hold.
-
(i) For any probability density \(f\) and any \(m_0(1)\le m\le 0\)
$$\begin{aligned}&J_m \le \hat{C}_1\, 2^{m\left( p-\frac{2+1/\beta }{1+1/s}\right) } \varphi ^p. \end{aligned}$$(6.23) -
(ii) Let \(f\in \mathbb{G }_\theta (R),\,\theta \in (0,1]\); then for any \(m_0(\theta )\le m\le 0\) one has
$$\begin{aligned}&J_m \le \hat{C}_1\, 2^{m\left( p-\frac{2+1/\beta }{1/\theta +1/s}\right) } \varphi ^p. \end{aligned}$$(6.24) -
(iii) For any \(m\in \mathbb{Z }\) satisfying \(1\le 2^{m} \le \hat{c}_2\varphi ^{-1}\) and any probability density \(f\) one has
$$\begin{aligned}&J_m\;\le \;\hat{C}_2 2^{m [p- s(2+1/\beta )]} \varphi ^p. \end{aligned}$$(6.25) -
(iv) Let \(s\ge 1\); then for any \(m\in \mathbb{Z }\) such that \( m\ge m_1, \; 2^{m} \le \hat{c}_2\varphi ^{-1} \) and any probability density \(f\) one has
$$\begin{aligned}&J_m \;\le \;\hat{C}_2\varphi ^{p} \left[ \frac{L_\gamma \varphi ^{1/\beta }}{L_\beta \varphi ^{1/\gamma }}\right] ^{\upsilon } 2^{m\left[ p-\upsilon (2+1/\gamma )\right] }. \end{aligned}$$(6.26)
Proposition 2
There exist constants \(\hat{C}_3, \hat{C}_4>0\) such that the following statements hold.
-
(i) Let \(\nu \) is defined in (3.4). Then for all large enough \(n\) and for any density \(f\) one has
$$\begin{aligned} J^{-}_{m_0(1)}\;=\;\mathbb{E }_f\int _{\mathcal{X }^-_{m_0(1)}} \left| \hat{f}(x)-f(x)\right| ^{p}\,\mathrm{d}x\;\le \; \hat{C}_3\pi _n(L_\beta \delta )^{p\nu }, \end{aligned}$$(6.27)where \(\pi _n=\ln ^d(n)\) if \(p\le \frac{2+1/\beta }{1+1/s}\) and \(\pi _n=1\) otherwise.
-
(ii) Let \(\nu (\theta )\) is defined in (4.3). Then for any \(\theta \in (0,1)\) and for all \(n\) large enough
$$\begin{aligned} \sup _{f\in \mathbb{G }_\theta (R)}\mathbb{E }_f\int _{\mathcal{X }_{m_0(\theta )}^-} \left| \hat{f}(x)-f(x)\right| ^{p}\,\mathrm{d}x\le \hat{C}_4(L_\beta \delta )^{p\nu (\theta )}. \end{aligned}$$(6.28)
6.5 Proof of Theorem 2
Using (6.14) and inequality (6.27) of Proposition 2 we obtain
We proceed with bounding the second term on the right hand side of the last display formula. First, because \(\Vert f\Vert _\infty \le M\),
This implies that there exists constant \(c_3>0\) with the following property:
Thus the sum on right hand side of (6.29) extends from \(m_0(1)\) to \(m_2\).
\(1^0\). Tail zone: \(p<\frac{2+1/\beta }{1+1/s}\). Using bounds (6.23) and (6.25) of Proposition 1, we obtain
where the last inequality follows from the fact that \(m_0(1)<0\) and \(p<\frac{2+1/\beta }{1+1/s}<s(2+1/\beta )\). Using (6.19), after straightforward algebra we obtain that
\(2^0\). Dense zone: \(\frac{2+1/\beta }{1+1/s} < p< s(2+\frac{1}{\beta })\). Because \(p>\frac{2+1/\beta }{1+1/s}\), by Proposition 1, inequality (6.23) with \(\theta =1\),
Furthermore, because \(p<s(2+\frac{1}{\beta })\) we have by Proposition 1, inequality (6.25), that
Thus, in the dense zone
\(3^{0}\). Sparse zone: \(p>s(2+1/\beta ),\,s<1\). First we note that the bound in (6.30) remains true since \(p>s(2+1/\beta )\). By the same reason in view of Proposition 1, inequality (6.25),
Here we have used the definition of \(m_2\). It remains to note that conditions \(p>s(2+1/\beta ),\,s<1\) imply that \(\varphi ^{p}\delta ^{-s}\rightarrow 0\) as \(n\rightarrow 0\). Therefore the statement of the theorem follows from (6.30) and (6.31).
\(4^{0}\). Sparse zone: \(p>s(2+1/\beta ),\,s\ge 1\). We need to bound only \(\sum _{m=1}^{m_2} J_m\), because (6.30) remains true. By inequality (6.25) of Proposition 1 and because \(p>s(2+1/\beta )\)
Next, we have in view of the inequality (6.26) of Proposition 1
Since \(p-\upsilon (2+1/\gamma )<0\) [see (6.18)],
In order to obtain the second inequality we have used (6.21). Thus,
Using equality (6.22) and (6.21) we obtain
The statement of the theorem is now obtained by the following routine computations. Denote
First, we remark that
Next,
Hence, \(1/\gamma -1/\beta =1/\gamma _{-} -1/\beta _{-}=A/(\tau (p)\beta )\), which implies that
Two last equalities yield
where the last equality follows from the fact that \(\tau (p)-1/(p\beta )=1-1/s\). This together with (6.32). leads to the statement of the theorem in the sparse zone.
\(5^0\). Boundary zones: \(p=s(2+\frac{1}{\beta }),\,p=\frac{2+1/\beta }{1+1/s}\). Here the proof coincides with the proof for the dense zone with the only difference that the corresponding sums equal \(|m_1|\) and \(m_2\) respectively. \(\square \)
6.6 Proof of statement (i) of Theorem 4
In view of (6.14) and by bound (6.28) of Proposition 2,
If \(p<\frac{2+1/\beta }{1/\theta +1/s}\) then, using bounds (6.24) and (6.25) of Proposition 1, we have
and the assertion of the theorem follows. If \(s(2+1/\beta )\ge p\ge \frac{2+1/\beta }{1/\theta +1/s}\) then
7 Proofs of Theorem 3, statement (ii) of Theorem 4 and the lower bound in (4.4)
The proof is organized as follows. First, we formulate two auxiliary statements, Lemmas 4 and 5. Second, we present a general construction of a finite set of functions employed in the proof of lower bounds. Then we specialize the constructed set of functions in different regimes and derive the announced lower bounds.
7.1 Auxiliary lemmas
The first statement given in Lemma 4 is a simple consequence of Theorem 2.4 from [34]. Let \(\mathbb{F }\) be a given set of probability densities.
Lemma 4
Assume that for any sufficiently large integer \(n\) one can find a positive real number \(\rho _n\) and a finite subset of functions \(\{f^{(0)}, f^{(j)},\;j\in \mathcal{J }_n\}\subset \mathbb{F }\) such that
Then for any \(q\ge 1\)
where infimum on the left hand side is taken over all possible estimators.
We will apply Lemma 4 with \(\mathbb{F }= \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) in the proof of Theorem 3 and with \(\mathbb{F }=\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) in the proof of statement (ii) of Theorem 4.
Next we quote the Varshamov–Gilbert lemma [see, e.g., Lemma 2.9 in [34]].
Lemma 5
(Varshamov–Gilbert) Let \(\varrho _m\) be the Hamming distance on \(\{0,1\}^m,\,m\in \mathbb{N }^*\), i.e.
For any \(m\ge 8\) there exists a subset \(\mathcal{P }_m\) of \(\{0,1\}^m\) such that \(|\mathcal{P }_m|\ge 2^{m/8}\), and
7.2 Proof of Theorem 3. General construction of a finite set of functions
1\(^0\). For any \(t\in \mathbb{R }\) set
Note that \(\Lambda \) is a probability density compactly supported on \([-1,1]\) and infinitely differentiable on the real line, \(\Lambda \in \mathbb{C }^{\infty }(\mathbb{R }^1)\). Obviously, for any \(\alpha >0\) and \(r\ge 1\) there exists constant \(c_1=c_1(\alpha ,r)<\infty \) such that
Define
where parameter \(N=N(n)>8\) will be chosen later. By construction, \(\bar{f}^{(0)}\) is a probability density for any choice of \(N,\,\mathrm{supp}(\bar{f}^{(0)})=[-N/2-1, N/2+1]^d\), and
Moreover, in view of (7.3) and by the Young inequality, there exist constants \(\vec {C}=(\tilde{C}_1,\ldots ,\tilde{C}_d)\) depending on \(\vec {\beta }\) and \(\vec {r}\) only such that
Note that \(\vec {C}\) do not depend on \(N\).
Let \(L_0>0\) be fixed, and let \(f^{(0)}(x)=\varkappa ^{d}\bar{f}^{(0)}(x\varkappa )\), where \(\varkappa >0\) is chosen in such a way that \(f^{(0)}\) belongs to the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L}_0)\), where \(\vec {L}_0=(L_0,\ldots ,L_0)\). The existence of such \(\varkappa \) independent of \(N\) and determined by \(\vec {\beta },\,\vec {r}\) and \(L_0\) is guaranteed by (7.5). Note also that \(f^{(0)}\) is a probability density. Moreover, we remark that \(\Vert \bar{f}^{(0)}\Vert _\infty \le N^{-d}\) since \(\int |\Lambda |=1\).
Thus,
provided that \(N>(2M^{-1})^{1/d}\varkappa \). This condition is assumed to be fulfilled.
\(2^{0}\). Put for any \(t\in \mathbb{R }^1\)
We obviously have \(g\in \mathbb{C }^\infty (\mathbb{R }^{1})\), and
For any \(l=1,\ldots ,d\) let \((20\varkappa )^{-1}>\sigma _l=\sigma _l(n)\rightarrow 0,\,n\rightarrow \infty \), be the sequences to be specified later. Let \(M_l=(20\varkappa \sigma _l)^{-1}N\), and without loss of generality assume that \(M_l,\,l=1,\ldots , d\) are integer numbers. Define also
and let \(\mathcal{M }=\{1,\ldots , M_1\}\times \cdots \times \{1,\ldots , M_d\}\). For any \(m=(m_1,\ldots ,m_d)\in \mathcal{M }\) define
Several remarks on these definitions are in order. First, in view of (7.7)(ii)
Second, since \(g\in \mathbb{C }^{\infty }(\mathbb{R }^1)\), we have that \(G_m\in \mathbb{C }^{\infty }(\mathbb{R }^d)\) for any \(m\in \mathcal{M }\). Moreover, for any \(l=1,\ldots ,d\), any \(|h|\le \sigma _l\) and any integer \(k\)
where \(D^k_l G\) stands for the \(k\)th order derivative of a function \(G\) with respect to the variable \(x_l\), and \(\Delta _{h,l}\) is the first order difference operator with step size \(h\) in direction of the variable \(x_l\). For \(m\in \mathcal{M }\) define
It is easily checked that \(\pi \) defines enumeration of the set \(\mathcal{M }\), and \(\pi :\mathcal{M }\rightarrow \{1,2\ldots ,|\mathcal{M }|\} \) is a bijection. Let \(W\) be a subset of \(\{0,1\}^{|\mathcal{M }|}\). Define a family of functions \(\{F_w, w\in W\}\) by
where \(w_j,\,j=1,\ldots , |\mathcal{M }|\) are the coordinates of \(w\), and \(A\) is a parameter to be specified. It follows from (7.7)(iii), (7.8) and (7.9) that
and (7.7)(i) implies that
\(3^{0}\). Now we find conditions which guarantee that \(F_{w}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L})\) for any \(w\in W\).
Fix \(l=1,\ldots ,d\), and let \(k_l=\lfloor \beta _l\rfloor +1\) if \(\beta _l\notin \mathbb{N }^*\), and \(k_l=\lfloor \beta _l\rfloor +2\) if \(\beta _l\in \mathbb{N }^*\) (here \(\lfloor x \rfloor \) stands for the maximal integer number strictly less than \(x\)).
First, for any \(w\in W\) and \(h\in \mathbb{R }\)
where the last inequality is found in [29, Section 4.4.4]. Next, in view of (7.9) and (7.10) we obtain for any \(w\in W\) and any \(r_l\ne \infty \)
where we have put \(S_W:=\sup _{w\in W}|\{j:\;w_j\ne 0\}|\). Thus, for any \(r_l\ne \infty \) we have
Similarly, we get for any \(w\in W\)
In view of (7.7)(ii) and \(|h|\le \sigma _l\), function \(g^{(k_l-1)}(\cdot -[h/\sigma _l]) -g^{(k_l-1)}(\cdot )\) is supported on \([-3,3]\). Therefore the fact that \(g\in \mathbb{C }^{\infty }(\mathbb{R }^1)\) implies for any \(r_l\in [1,\infty ]\)
In the last inequality we have used that \(0\le \beta _l-k_l+1\le 1\) by definition of \(k_l\). Combining this with (7.13), (7.14) and (7.15) we have for any \(|h|\le \sigma _l\) and any \(r_l\in [1,\infty ]\)
If \(|h|\ge \sigma _l\) then we note that \(\Delta _{h,l}(D^{k_l-1}_l F_w)(\cdot )=(D^{k_l-1}_lF_w)(\cdot -h e_l)-(D^{k_l-1}_lF_w)(\cdot )\), and by the triangle inequality
In view of (7.8) and (7.9) we get for any \(w\in W\) and any \(r_l\ne \infty \)
Moreover,
We obtain finally from (7.13) that for any \(|h|\ge \sigma _l\) and any \(r_l\in [1,\infty ]\)
Combining (7.16) and (7.17) we conclude that for any \(w\in W\) and \(r_l\in [1,\infty ]\)
where \(C_1=\max _{l}(\Vert g\Vert _{r_l}^{d-1} \max \{6^{1/r_l}\Vert g^{(k_l)}\Vert _{\infty },2\Vert g^{(k_l-1)}\Vert _{r_l}\})\). Thus, if
then \(F_{w}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L})\) for any \(w\in W\).
4\(^0.\) Define for any \(w\in W\)
Remind that \(f^{(0)}\) is the probability density belonging to \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}_0/2, M/2)\). Therefore, in view of (7.12) and under condition (7.18), for any \(w\in W\)
where the latter inclusion holds because \(\min _{j=1,\ldots ,d}L_j\ge L_0\).
By construction of \(F_{w}\), for any \(w\in W\)
This yields
On the other hand, by (7.4)
Therefore, if we require
this together with (7.13) implies
We conclude that \(f_w\ge 0\) for any \(w\in W\). Moreover, we get from (7.6), (7.11) and (7.23) that \(\Vert f_w\Vert _\infty \le M\) for any \(w\in W\).
All this, together with (7.19), shows that \(\{f^{(0)}, f_w, w\in W\}\) is a finite set of probability densities from \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\). Thus Lemma 4 is applicable with \(\mathcal{J }_n=W\) and \(\mathbb{F }= \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\). 5\(^0\). Suppose now that the set \(W\) is chosen so that
where, we remind, \(\varrho _{|\mathcal{M }|}\) is the Hamming distance on \(\{0,1\}^{|\mathcal{M }|}\). Here \(B=B(n)\ge 1\) is a parameter to be specified. Then we deduce from (7.19), (7.8) and (7.9), that for all \(w,w^\prime \in W\)
Here we have used that the map \(\pi \) is a bijection. Putting \(C_2=\frac{1}{2}\Vert g\Vert ^{d}_{p}\), we conclude that condition (7.1) of Lemma 4 is fulfilled with
Let us remark that (7.26) remains true if we formally put \(p=\infty \). Indeed, similarly to (7.25),
Here we have used (7.9), the fact that the map \(\pi \) is a bijection and, that \(w\ne w^\prime \) for all \(w,w^\prime \in W\) in view of (7.24).
Now we verify condition (7.2) of Lemma 4. First observe that
Since \(X_k,\,k=1,\ldots ,n\) are i.i.d. random vectors, we have for any \(w\in W\)
The last equality follows from (7.12). By (7.20) and (7.22),
hence for any \(w\in W\)
Repeating computations that led to (7.17) we have
The right hand side of the latter inequality does not depend on \(w\); hence we
where we have put \(C_3=\varkappa ^{-d}\Vert g\Vert ^{2d}_{2}\). Therefore, if
then condition (7.2) of Lemma 4 is fulfilled with \(C=1\).
In order to apply Lemma 4 it remains to specify the set \(W\) and the parameters \(A,\,N,\,\sigma _j,\,j=1,\ldots ,d\) so that the relationships (7.18), (7.23), (7.24), and (7.28) are simultaneously fulfilled. According to (4), under these conditions the lower bound is given by \(\rho _n\) in (7.26).
7.3 Proof of Theorem 3. Derivation of lower bounds in different zones
We begin with the construction of the set \(W\). Let \(m\ge 8\) be an integer number whose choice will be made later, and, without loss of generality, assume that \(|\mathcal{M }|/m\) is integer. Let \(\mathcal{P }_m\) be a subset of \(\{0,1\}^{m}\) such that
Existence of such set \(\mathcal{P }_m\) is guaranteed by Lemma 5. Let \(\mathcal{J }:=\{1+\frac{j}{m}|\mathcal{M }|, \;j=0,\ldots , m-1\}\), and note that \(\mathcal{J }\subseteq \{1,\ldots ,|\mathcal{M }|\}\) with the equality in the case \(m=|\mathcal{M }|\). Define the map \(\Upsilon : \mathcal{P }_m\rightarrow \{0,1\}^{|\mathcal{M }|}\) by
and let \(W=\Upsilon (\mathcal{P }_m)\). Obviously, \(\varrho _{|\mathcal{M }|}(w,w^\prime )=\varrho _{|\mathcal{M }|} (\Upsilon [a],\Upsilon [a^\prime ])= \varrho _m(a,a^\prime )\) for all \(w,w^\prime \in W\); therefore (7.29) implies that
With such a set \(W,\,S_W\le m\); moreover, since \(\ln (|W|)\ge m\ln 2/8\), condition (7.28) holds true if
We also note that condition (7.18) is fulfilled if we require
In addition, (7.24) holds with \(B=m/8\).
7.3.1 Tail zone: \(p\le \frac{2+1/\beta }{1+1/s}\)
Let \(m=|\mathcal{M }|\). By construction, \(|\mathcal{M }|=\prod _{l=1}^d M_l= (20\varkappa )^{-d}N^d\prod _{l=1}^d\sigma _l^{-1}\) and, therefore (7.32) is reduced to
Thus, choosing
we guarantee the fulfillment of (7.33) provided that \(C_6\ge \max _{l=1,\ldots ,d}C^{-1/\beta _l}_5\). Moreover, with this choice (7.31) is reduced to
where, as before, \(L_\beta =\prod _{l=1}^d L_l^{1/\beta _l}\). Moreover, we have from (7.26)
Let \(N^d=C_9A^{-1}\), where constant \(C_9\le \varkappa ^{d}\) will be specified below; then (7.23) holds. Next, in view of (7.35) and (7.36)
We remark that \(N\rightarrow \infty \) as \(n\rightarrow \infty \). It remains to check that \(\sigma _l,\,l=1,\ldots , d\) are small enough. It follows from (7.34) that if \(r_l>1\), then \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) since \(A\rightarrow 0\). If \(r_l=1\), then
Choosing \(C_9\) small enough we guarantee that \(\sigma _l\le (20\varkappa )^{-1}\), for all \(l=1,\ldots ,d\). This condition is required in the construction of the family \(G_{m},\,m\in \mathcal{M }\). Thus, Lemma 4 can be applied with \(\rho _n=C_{11}(L_\beta \alpha _n n^{-1})^{\nu }\), and the result follows.
7.3.2 Dense zone: \(\frac{2+1/\beta }{1+1/s} \le p\le s(2+\frac{1}{\beta })\)
Here, as in the previous case, we let \(m=|\mathcal{M }|\). The relationships (7.34) (7.35) and (7.36) remain to be true, but our choice of \(N\) will be different.
Let \(N=C_{12}\) from some constant \(C_{12}\). This yields in view of (7.35) and (7.36)
The requirement (7.23) is obviously fulfilled since \(A\rightarrow 0,\;n\rightarrow \infty \). Moreover, we obtain from (7.34) that \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) and, therefore, \(\sigma _l\le (20\varkappa )^ {-1},\,l=1,\ldots ,d\) for \(n\) large enough. Thus, Lemma 4 can be applied with \(\rho _n=C_{14}(L_\beta \alpha _n n^{-1})^{\nu }\) and the result follows.
7.3.3 Sparse zone: \(s(2+\frac{1}{\beta })<p<\infty ,\,s<1\)
Let \(A=\tilde{C}\) and \(N=C_{17}\) and suppose that \(\tilde{C}\le C^{-1}_{17} \varkappa ^{d}\); then (7.23) is satisfied. Moreover (7.31) and (7.32) are reduced to
Let \(\tilde{c_1},\,\tilde{c}_2\) be constants satisfying \(\tilde{c_1}\le \tilde{C}^{-1}C_{18}\), and \(\tilde{c}_2\le \tilde{C}^{-1}C_{19}\). It is straightforward to check that if we choose
then inequalities (7.37) are fulfilled. With this choice (7.26) is reduced to
It remains to verify that \(\sigma _l\) are small enough, and that \(m\ge 8,\,|\mathcal{M }|/\mathfrak{m }\ge 1\). Note that \(m\rightarrow \infty \) as \(n\rightarrow \infty \) because of \(s<1\). Remind also that
hence \(|\mathcal{M }|/m\ge (20\varkappa )^{-d}C_{17}^d (\tilde{c}_1\tilde{c}_2^{1/\beta })^{-s}L_0^{-s/\beta }n^s\). Thus \(|\mathcal{M }|/m\ge 1\) for large enough \(n\).
We note also that \(\sigma _l\le (\tilde{c}_2L_0)^{-1/\beta _l}\) for all \(n\) large enough. Therefore, if we choose \(\tilde{C}\) large enough and put \(\tilde{c}_2=\tilde{C}^{-1} C_{19}\) we can ensure that \(\sigma _l\le (20\varkappa )^{-1}\) for all \(l=1,\ldots ,d\).Thus, Lemma 4 can be applied with \(\rho _n=\widetilde{C}C_{20}(L_\beta \alpha _n n^{-1})^{\nu }\) and the result follows.
7.3.4 Sparse zone: \(s(2+\frac{1}{\beta })<p<\infty ,\,s\ge 1\)
Here we consider another choice of the set \(W\). Let \(W=\{e_1,e_2,\ldots ,e_{|\mathcal{M }|}\}\), where \(e_j\), \(j=1,\ldots ,|\mathcal{M }|\) is the canonical basis in \(\mathbb{R }^{|\mathcal{M }|}\). With this choice
and (7.24) holds with \(B=1\). Let \(N=C_{14}\); then (7.18) and (7.28) take the form
Moreover, we get from (7.26)
Put \(\varepsilon =\sqrt{\ln n/n}\) and
We have
and it is evident that \(\prod _{l=1}^d\sigma _{l}\le \varepsilon ^{1/(\beta +1/2)}\) for all \(n\) large enough; hence \(\ln (\prod _{l=1}^d\sigma ^{-1}_{l})\ge \ln n/(2\beta +1)\). Then is easily checked that our choice (7.43) satisfies (7.40) and (7.41) provided that
Here we have also used that \(d-1/s\ge 0\). Note also that if \(s>1\) then
which ensures (7.23) and \(\sigma _l\le (20\varkappa )^{-1},\,l=1,\ldots , d\) for all n large enough.
On the other hand, if \(s=1\) then we should add to (7.44) the conditions
Obviously, both restrictions hold if we choose \(c_1\) and \(c_2\) small enough, but now these constants may depend on \(\vec {L}\). Note, however, that if \(\max _{l=1,\ldots ,d}L_l\le L_\infty \) then \(c_1\) and \(c_2\) can be chosen depending on \(L_0\) and \(L_\infty \) only.
Using (7.42) and (7.43) we conclude that Lemma 4 is applicable with
that completes the proof of statement (i) of the theorem.
7.3.5 Proof of statement (ii): sparse zone, \(p=\infty ,\,s\le 1\)
The proof in this case coincides with the one for the sparse zone with \(s<1\). Thus, we keep (7.37), (7.38), and, in view of (7.27), (7.39) is replaced by \( \rho _n=\widetilde{C}C_{17}. \) Since \(\rho _n\) does not tend to \(0\) as \(n\rightarrow \infty \), a consistent estimator does not exist. All other details of the proof remain unchanged. This completes the proof of Theorem 3. \(\square \)
7.4 Proof of statement (ii) of Theorem 4
The proof goes along the lines of the proof of Theorem 3 with modifications indicated below.
We start with the following simple observation: for any \(M>0\) and \(y>0\) one has
This is an immediate consequence of the fact that conditions \(\Vert g\Vert _\infty \le M,\,\mathrm{supp}\{g\} \subseteq [-y,y]^d\) imply that \(\Vert g^*\Vert _\infty \le M\) and \(\mathrm{supp}\{g\}\subseteq [-y-2, y+2]^d\).
Next, we note that the lower bounds of Theorem 3 in the dense and sparse zones are proved over the set of compactly supported densities. Hence they are valid also on \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\), provided that \(R\) is large enough. Hence, if \(p\ge \frac{2+1/\beta }{1/\theta +1/s}\) the assertion of the theorem follows.
Let \(p< \frac{2+1/\beta }{1/\theta +1/s}\). The proof of the lower bound here differs from the proof of Theorem 3 only in construction of the function \(f^{(0)}\).
Let \(f^{(0)}\) be the function constructed exactly as in the proof of Theorem 3 with \(N=N_0\) fixed throughout the asymptotics \(n\rightarrow \infty \), and such that \(f^{(0)}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },4^{-1}\vec {L}_0, 4^{-1}M)\). Since \(N_0\) is fixed, \(f^{(0)}\) is compactly supported and, by (7.46) we have that \(f^{(0)}\in \mathbb{G }_\theta (R_1)\) for some large enough \(R_1>0\). Define
where \(N=N(n)\rightarrow \infty \) will be specified later. Let \(\tilde{f}^{(\theta )}(x)=\varsigma ^{d}\bar{f}^{(\theta )}(\varsigma x)\), where \(\varsigma >0\) is chosen to guarantee \(\tilde{f}^{(\theta )}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },4^{-1}\vec {L}_0, 4^{-1}M)\). We note however that, in contrast to the case \(\theta =1,\,\tilde{f}^{( \theta )}\) is not a probability density. In particular, \(\int \tilde{f}^{(\theta )}\rightarrow 0\) as \(N\rightarrow \infty \), because \(\theta <1\). Define
where \(p_N:=\int \tilde{f}^{(\theta )}\) ensures \(\int f^{(\theta )}=1\).
Note also that \(f^{(1)}=\tilde{f}^{(1)}\) since \(\tilde{f}^{(1)}\) is a probability density and, therefore \(p_N=1\).
Thus, we can assert that
Note that, by construction, \(\tilde{f}^{(\theta )}\) is supported on the cube \([(-N/2-1)/\varsigma ,(N/2+1)/ \varsigma ]^d\) and bounded by \(N^{-d/\theta }\varsigma ^d\). Therefore, in view of (7.46), \(\tilde{f}^{(\theta )}\in \mathbb{G }_\theta (R_2)\) for some large enough \(R_2\).
Let \(W\) be the parameter set as defined in the proof of Theorem 3. For any \(w\in W\) and any \(\theta \le 1\) we let
where functions \(F_w\) are constructed as in the proof of Theorem 3. If instead of (7.23) we require
then we obtain in view of (7.11) and (7.47) that \(\{F_w,\;w\in W\}\subset \mathbb{G }_\theta (R_3)\) for some large enough \(R_3\). All said above one allows to conclude that \(\{f^{(\theta )},\;f^{(\theta )}_w,\;w\in W\}\) is a finite set of probability densities from \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) for some large enough \(R>0\), and Lemma 4 is applicable with \(\mathcal{J }_n=W\) and \(\mathbb{F }=\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\).
Note also that if \(\theta =1\) we come to the construction used in the proof of Theorem 3 and, therefore, the statement of the theorem in the case \(\theta =1\) follows.
Suppose now that \(\theta <1\). We will follow construction of the set \(W\) for the tail zone which is given in Sect. 7.3.1. Choose \(m=|\mathcal{M }|\) and note that (7.33), (7.34), (7.36) remain unchanged, while (7.35) should be replaced by
Now we choose \(N^d=cA^{-\theta }\) with \(c\le \varkappa ^{d}+\varsigma ^{d}\); then (7.47) is valid. We obtain from (7.48) that
Finally, because (7.34) remains intact, \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) for any \(l=1,\ldots , d\); this follows from \(A\rightarrow 0\) and \(\theta <1\). This completes the proof. \(\square \)
7.5 Proof of the lower bound in (4.4)
The required result will follow from the lower bound of Theorem 3 in the tail zone (see Sect. 7.3.1) if we will show that for any given \(R>0\) and \(\theta \in (0,1)\)
First we note that \(f^{(0)}=N^{-d}\) for \(x\in [-(N-2)/(2\varkappa ), (N-2)/(2\varkappa )]^d\); therefore, \(\Vert [f^{(0)}]^*\Vert _\theta \rightarrow \infty \) as \(N\rightarrow \infty \), because \(\theta <1\).
Next, in view of (7.21), \(f_w(x)=f^{(0)}(x)\) for any \(x\notin [-(N-4)/(4\varkappa ), \;(N-4)/(4\varkappa )]^d\), which also implies
It remains to note that in the tail zone the parameter \(N\) is chosen so that \(N=N(n)\rightarrow \infty \) as \(n\rightarrow \infty \). This completes the proof of (7.49). \(\square \)
References
Akakpo, N.: Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Math. Methods Stat. 21, 1–28 (2012)
Birgé, L.: Model selection for density estimation with \({\mathbb{L}}_2\)-loss. arXiv:0808.1416v2, http://arxiv.org (2008)
Bretagnolle, J., Huber, C.: Estimation des densités: risque minimax. Z. Wahrsch. Verw. Gebiete 47, 119–137 (1979)
Delyon, B., Juditsky, A.: On minimax wavelet estimators. Appl. Comput. Harmon. Anal. 3, 215–228 (1996)
Devroye, L., Györfi, L.: Nonparametric Density Estimation The \({\mathbb{L}}_1\) View. Wiley, New York (1985)
Devroye, L., Lugosi, G.: A universally acceptable smoothing factor for kernel density estimation. Ann. Stat. 24, 2499–2512 (1996)
Devroye, L., Lugosi, G.: Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. Ann. Stat. 25, 2626–2637 (1997)
Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, New York (2001)
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., Picard, D.: Density estimation by wavelet thresholding. Ann. Stat. 24, 508–539 (1996)
Efroimovich, S.Yu.: Non-parametric estimation of the density with unknown smoothness. Ann. Stat. 36, 1127–1155 (1986)
Efroimovich, S.Yu.: Adaptive estimation of and oracle inequalities for probability densities and characteristic functions. Theory Probab. Appl. 30, 557–568 (2008)
Folland, G.B.: Real Analysis, 2nd edn. Wiley, New York (1999)
Garling, D.J.H.: Inequalities: A Journey Into Linear Analysis. Cambridge University Press, Cambridge (2007)
Goldenshluger, A., Lepski, O.: Uniform bounds for norms of sums of independent random functions. Ann. Probab. 39, 2318–2384 (2011)
Goldenshluger, A., Lepski, O.: Bandwidth selection in kerrnel density estimation: oracle inequalities and adaptive minimax optimality. Ann. Stat. 39, 1608–1632 (2011)
Golubev, G.K.: Non-parametric estimation of smooth probability densities. Probl. Inform. Transm. 1, 52–62 (1992)
de Guzman, M.: Differentiation of Integrals in \(R^n\). With appendices by Antonio Crdoba, and Robert Fefferman, and two by Roberto Moriyn. Lecture Notes in Mathematics, vol. 481. Springer, Berlin (1975)
Hasminskii, R., Ibragimov, I.: On density estimation in the view of Kolmogorov’s ideas in approximation theory. Ann. Stat. 18, 999–1010 (1990)
Ibragimov, I.A., Khasminski, R.Z.: An estimate of the density of a distribution. Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 98, 61–85 (1980) (in Russian)
Ibragimov, I.A., Khasminski, R.Z.: More on estimation of the density of a distribution. Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 108, 72–88 (1981) (in Russian)
Juditsky, A., Lambert-Lacroix, S.: On minimax density estimation on \({\mathbb{R}}\). Bernoulli 10, 187–220 (2004)
Kerkyacharian, G., Picard, D., Tribouley, K.: \(L^p\) adaptive density estimation. Bernoulli 2, 229–247 (1996)
Kerkyacharian, G., Lepski, O., Picard, D.: Nonlinear estimation in anisotropic multi-index denoising. Probab. Theory Relat. Fields 121, 137–170 (2001)
Kerkyacharian, G., Lepski, O., Picard, D.: Nonlinear estimation in anisotropic multiindex denoising. Sparse case. Theory Probab. Appl. 52, 58–77 (2008)
Lepski, O.: Multivariate density estimation under sup-norm loss: oracle approach, adaptation and independence structure. Manuscript (2012)
Mason, D.M.: Risk bounds for kernel density estimators. Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 363, 66–104 (2009). http://www.pdmi.ras.ru/znsl/
Massart, P.: Concentration Inequalities and Model Selection. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003. Lecture Notes in Mathematics, Vol. 1896. Springer, Berlin (2007)
Nemirovski A.S.: Nonparametric estimation of smooth regression functions. Soviet J. Comput. Systems Sci. 23(6), 1–11 (1985); translated from Izv. Akad. Nauk SSSR Tekhn. Kibernet. 3, 50–60 (1985) (Russian)
Nikol’skii S.M.: Priblizhenie Funktsii Mnogikh Peremennykh i Teoremy Vlozheniya. (in Russian). [Approximation of functions of several variables and imbedding theorems.] 2nd edn, revised and supplemented. Nauka, Moscow (1977)
Reynaud-Bouret, P., Rivoirard, V., Tuleau-Malot, C.: Adaptive density estimation: a curse of support? J. Stat. Plann. Inference 141, 115–139 (2011)
Rigollet, Ph.: Adaptive density estimation using the blockwise Stein method. Bernoulli 12, 351–370 (2006)
Rigollet, Ph., Tsybakov, A.B.: Linear and convex aggregation of density estimators. Math. Methods Stat. 16, 260–280 (2007)
Samarov, A., Tsybakov, A.: Aggregation of density estimators and dimension reduction. Advances in Statistical Modeling and Inference, pp. 233–251, Ser. Biostat., Vol. 3. World Sci. Publ., Hackensack (2007)
Tsybakov, A.: Introduction to Nonparametric Estimation. Springer Series in StatisticsSpringer, New York (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the ISF Grant No. 104/11.
Appendices
Appendix A: Proofs of auxiliary results of Sect. 5
1.1 A.1 Measurability
Write \(f(x,X^{(n)}):=\hat{f}_{\hat{h}(x)}(x)\), and note that the map \(f:\mathbb{R }^d\times \mathbb{R }^{dn}\rightarrow \mathbb{R }\) is completely determined by the kernel \(K\) and the set \(\mathcal{H }\). We need to show that \(f\) is a Borel function.
Let \(R_h(x,X^{(n)}):=\hat{R}_h(x)\), and note that for every \(h\in \mathcal{H }\), the map \(R_h:\mathbb{R }^d\times \mathbb{R }^{dn}\rightarrow \mathbb{R }\) is a continuous function. This follows from the continuity of the kernel \(K\) and from the fact that \(\mathcal{H }\) is a finite set. The continuity of \(K\) also implies that the map \(f_h:\mathbb{R }^d\times \mathbb{R }^{dn}\rightarrow \mathbb{R }\) is a continuous function for any \(h\in \mathcal{H }\), where \(f_h(x,X^{(n)}):=\hat{f}_h(x)\). Next, denote by \(\mathfrak{B }\) the Borel \(\sigma \)-algebra on \(\mathbb{R }^d\times \mathbb{R }^{dn}\), and let \(b:\mathbb{R }^d\times \mathbb{R }^{dn}\rightarrow \mathcal{H }\) be the function \(b(x,X^{(n)}):=\hat{h}(x)\). We obviously have for any given \(h\in \mathcal{H }\)
where the last inclusion follows from the continuity of \(R_\eta ,\,\eta \in \mathcal{H }\). Here we have also used that \(\mathcal{H }\) is a finite set. It remains to note that
and the required statement follows.
1.2 A.2 Proof of Lemma 1
\(1^{0}\). Note that \(\check{M}_h(g,x)=4^{-1} \hat{M}_h(g,x)\) and let \(\mathcal{H }_0=\{h\in \mathcal{H }: A_h(g,x)\ge 4\varkappa \ln n/(nV_h)\}\).
For any \(h\in \mathcal{H }_0\) we have
Therefore,
We have for any \(h\in \mathcal{H }_0\)
It yields for any \(h\in \mathcal{H }_0\)
(b). Now consider the set \(\mathcal{H }_1:=\mathcal{H }{\setminus }\mathcal{H }_0\). Here \(A_h(g,x)\le 4\varkappa \ln n/(nV_h)\), and, by definition of \(M_h\) we have
Note that we have \(\hat{M}_h(g,x)\ge \varkappa \ln n /(nV_h)\) for all \(h\). This together with (8.2) shows that
Furthermore, for any \(h\in \mathcal{H }_1\)
Therefore
To get the penultimate inequality we have used that \(\sqrt{|ab|}\le 2^{-1}(|a|+|b|)\). Thus, it is shown that
Relations (8.4), (8.3) and (8.1) imply statement of the lemma. \(\square \)
1.3 A.3 Proof of Lemma 2
\(1^{0}\). Let \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) be a fixed bounded function, and let
With this notation \(\xi _h(x) = \xi _h(K,x)\) and \(\hat{A}_h(g,x)-A_h(g, x) = \xi _h(|g|, x)\). Therefore moment bounds on \(\zeta _1(x),\,\zeta _3(x)\) and \(\zeta _4(x)\) will follow from those on \(\xi _h(g,x)\) with substitution \(g \in \{ K,Q, |K|, |Q|\}\). Since \(M_h(g,x)\) depends on \(g\) only via \(|g|\) and \(\Vert g\Vert _\infty \) [see (2.6)–(2.7)], \(M_h(g, x)=M_h(|g|,x)\), and moment bounds on \(\zeta _1(x)\) and \(\zeta _3(x)\) are identical. The bound on \(\zeta _4(x)\) will follow from bounds on \(\zeta _1(x)\) and \(\zeta _3(x)\) with only one modification: kernel \(K\) should be replaced by \(Q\). As for \(\zeta _2(x)\), \(\xi _{h,\eta }(x)\) cannot be represented in terms of \(\xi _h(g,x)\) with function \(g\) independent of \(h\) and \(\eta \); see (2.3). However, the bounds on \(\zeta _2(x)\) will be obtained similarly with minor modifications. Thus it suffices to bound \(\mathbb{E }_f[\zeta _1(x)]^q\) and \(\mathbb{E }_f[\zeta _2(x)]^q\).
\(2^{0}\). We start with bounding \(\mathbb{E }_f[\zeta _1(x)]^q\). For any \(z>0,\,h\in \mathcal{H }\) and \(q\ge 1\) one has
This inequality follows by integration of the Bernstein inequality and the following bound on the second moment of \(\xi _h(x)\):
We will show that \(\mathbb{E }_f[\zeta _1(x)]^q\) is bounded by the expression appearing on the right hand side of (5.2). In fact, we will prove a stronger inequality. Let for some \(l>0\)
In suffices to show that (5.2) holds when in the definition of \(\zeta _1(x)\) the quantity \(M_h(K,x)\) replaced by \(\tilde{M}_h(K,x)\), where
Indeed, since and \(n^{-d}\le V_h\le 1\) for any \(h\in \mathcal{H }\), we have that
Therefore \(\tilde{M}_h(K,x) \le M_h(K,x)\) for all \(x\in \mathbb{R }^d\) and \(h\in \mathcal{H }\) provided that
Thus if we establish (5.2) with \(M_h(K,x)\) replaced by \(\tilde{M}_h(K,x)\), the required bound for \(\mathbb{E }_f[\zeta _1(x)]^q\) will be proved.
We have for any \(h\in \mathcal{H }\)
Furthermore, taking into account that \(A_h(g,x)\le V_h^{-1}\Vert g\Vert _\infty \) for any \(g\), we obtain
Here we have used that \(n\ge 3\). If we set \(z=\lambda _h\) then (8.5) together with two previous display formulas yields
As it was mentioned above, under the same conditions inequality (8.6) holds for \(\mathbb{E }_f[\zeta _3(x)]^q\).
As for the moment bound for \(\zeta _4(x)\), in all formulas above \(K\) should be replaced by \(Q\) and \(\mathrm{k}_\infty \) by \(\mathrm{k}^2_\infty \) since \(\Vert Q\Vert _\infty \le \mathrm{k}_\infty ^2\). Specifically, if \(\varkappa \ge \mathrm{k}^2_\infty [d(2q+4)+2l]\) then
\(3^{0}\). Now we turn to bounding \(\mathbb{E }_f[\zeta _2(x)]^q\). We have similarly to (8.5)
Here we have used the following bound on the second moment of \(\xi _{h,\eta }(x)\):
The further proof goes along the same lines as the above proof with the following minor modifications: in all formulas \(\mathrm{k}_\infty \) should be replaced with \(\mathrm{k}_\infty ^2,\,V_{h\vee \eta }\) should be written instead of \(V_h\), and \(\varkappa \) should satisfy \(\varkappa \ge \mathrm{k}^2_\infty [d(2q+4)+2l]\). The statement of the lemma holds with constant \(C_0=2^{d^2+1}\Gamma (q+1)(2\mathrm{k}^2_\infty )^q\). Combining the above bounds we complete the proof. \(\square \)
Appendix B: Proofs of auxiliary results of Sect. 6
1.1 B.1. Proof of Lemma 3
We have
First, we note that \(f(x+uh)-f(x)\) can be represented by the telescopic sum
where we put formally \(h_{d+1}u_{d+1}=0\).
Next, for any function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) and \(j=1,\ldots , d\) we have
The last equality follows from the definition of \(\ell \)-th order difference operator (3.1). Thus (9.2) and (9.1) imply that \(B_h(f,x) =\sum _{j=1}^d B_{h,j}(f,x)\), where
Therefore, by the Minkowski inequality for integrals [see, e.g., [12, Section 6.3]]
Since \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) one has
This proves (6.6). To get (6.7) we first note that the condition \(s\ge 1\) implies \(\tau (p)>0\) and \(\tau _j>0,\,j=1,\ldots ,d\). Then the inequality in (6.7) follows by the same reasoning with \(r_j\) replaced by \(q_j,\,\beta _j\) replaced by \(\gamma _j\) and with the use of embedding (6.2). \(\square \)
1.2 B.2. Proof of Proposition 1
By definition of \(J_m\) and \(\mathcal{X }_m\),
and now we bound from above \(|\mathcal{X }_m|\). By definition of \(\mathcal{X }_m\) we have for any \(h\in \mathcal{H }\)
Recall that with the introduced notation
For any \(h\in \mathcal{H }\) we have
By the Chebyshev inequality and (6.12) for any \(h\)
In addition, if \(s\ge 1\) then the Chebyshev inequality and (6.13) yield
In order to prove statements (i)–(iv) of the proposition we bound quantities \(J_{m,1}(h)\) and \(J_{m,2}(h)\) with bandwidth \(h=h[m]\) specified in an appropriate way.
1.2.1 B.2.1. Proof of statements (i) and (ii)
Note that the bound in the statement (i) coincides formally with that of the statement (ii) when \(\theta =1\). This implies that the statement (ii) in the case \(\theta =1\) follows from the statement (i). So, with slight abuse of notation, we will identify the case \(\theta =1\) with the assumption that \(f\) is a probability density.
\(1^{0}\). We start with bounding the term \(J_{m,1}(h)\) on the right hand side of (9.4). Assume that \(h\in \mathcal{H }\) is such that
then by the Chebyshev inequality
where we have taken into account that, for any \(\eta ,\,\Vert A_\eta \Vert _\theta \le R\) if \(f\in \mathbb{G }_\theta (R)\) with \(\theta <1\), and \(\Vert A_\eta \Vert _1\le \mathrm{k}_\infty ^2\). By definition of \(\mathcal{H }\), for any \(\eta \ge h,\,\eta \in \mathcal{H },\) we have \(V_\eta = V_h 2^{k_1+\cdots +k_d}\) for some \(k_1,\ldots ,k_d\ge 1\), which implies that \(\sum _{\eta \ge h} V_\eta ^{-\theta } \le (1-2^{-\theta })^{-d} V_h^{-\theta }\). Thus, we conclude that for any \(h\) satisfying (9.8) one has
\(2^{0}\). Let \(\tilde{h}=(\tilde{h}_1,\ldots ,\tilde{h}_d)\in (0,\infty ]^d\) be given by
where constant \(c_4\) will be specified later. Let us prove that \(\tilde{h}\in [n^{-1},1]^{d}\) for large enough \(n\).
Denote
and remark that \(a>0\). We note also that
If \(b_j\le 0\) then, because \(m\le 0\),
for all large enough \(n\). On the other hand, since \(0\ge m\ge m_0(\theta )\) and \(2^{m_0(\theta )a} \le 2^{a}\hat{c}_1\varkappa \varphi \) by definition of \(m_0(\theta )\),
where we took into account that \(1+b_j a^{-1}>0\) and \(\min _{j=1,\ldots ,d} L_j\ge L_0>0\). Then choosing constant \(c_4\) small enough we have \(\tilde{h}_j\le 1\). Thus we showed that \(\tilde{h}_j\in [n^{-1}, 1]\) for \(j\) such that \(b_j\le 0\).
Now consider the case \(b_j>0\). Here
It remains to note that
in view of the obvious inequality \(1/\beta -\theta /s\ge 1/\beta _j-\theta /(\beta _jr_j)\), which, in its turn, follows from the fact that \(\theta \in (0,1]\). Thus, we have that \(\tilde{h}_j>n^{-1}\) for all large enough \(n\). Furthermore, if \(b_j>0\) then since \(m\le 0\)
for all large enough \(n\). Thus we have shown that \(\tilde{h}\in [n^{-1}, 1]^d\).
\(3^{0}\). Now we proceed with bounding \(J_{m,2}(h)\) for a specific choice of \(h=h[m]\), which is defined as follows. Let \(h[m]\in \mathcal{H }\) such that \(h[m]< \tilde{h}\le 2h[m]\). Let constant \(c_4\) in (9.10) be chosen so that \(c_4<(2\bar{c}_1)^{-1}\), where \(\bar{c}_1\) appears on the right hand side of (6.12). With this choice of \(c_4\) by (6.12)
Therefore, \(J_{m,2}^{(2)}(h[m])=0\), where \(J_{m,2}^{(2)}(\cdot )\) is defined in (9.5). Moreover, we obtain from (9.6) and \(h[m]\le \tilde{h}_j\) that
Note that
This together with (9.9) yields
Then it follows from (9.11) and (9.13) that
which combined with (9.3) results in
Inequality (9.3) is valid only if (9.8) is fulfilled for \(h[m]\), i.e., \(\varkappa \delta V_{h[m]}^{-1} <2^{m-2}\varphi \); now we verify this condition. It is sufficient to check that \(\varkappa \delta 2^{d}V^{-1}_{\tilde{h}}< 2^{m-2}\varphi \). In view of (9.12) this inequality will follow if
Taking into account that \(L_\beta \delta =\varphi ^{2+1/\beta }\) we conclude that (9.8) is fulfilled for \(h[m]\) if
which is ensured by the condition \(m\ge m_0(\theta )\). This completes the proof of (6.24).
1.2.2 B.2.2. Proof of statement (iii)
\(1^{0}\). Let \(\hat{c}_4\) be a constant to be specified later, and let \(c_4\) be the constant given in (9.10). Let \(C_j=c_4\) if \(j\in I_\infty \) and \(C_j=\hat{c}_4\) if \(j\in I{\setminus }I_\infty \). Define \(\tilde{h}=(\tilde{h}_1,\ldots ,\tilde{h}_d)\in (0,\infty ]^d\) by the formula
Note that if \(j\in I_\infty \) the corresponding coordinates of \(\tilde{h}\) given by (9.10) and (9.14) are the same.
Let us show that \(\tilde{h}\in [n^{-1},1]^d\) for large enough \(n\). First consider the coordinates \(\tilde{h}_j\) such that \(1-\frac{s}{r_j}(2+1/\beta )\ge 0\). Because \(m\ge 0\) we have for all \(n\) large enough
where we have used the obvious inequality \(\beta _j/\beta >1\) for any \(j=1,\ldots , d\). On the other hand, because \(2^m\le \hat{c}_2\varphi ^{-1}\) we obtain
Thus \(\tilde{h}_j \le 1\) for large enough \(n\) if \(j\in I{\setminus }I_\infty \), and \(\tilde{h}_j\le 1\) by choice of constant \(c_4\) if \(j\in I_\infty \).
Now consider the case \(1-\frac{s}{r_j}(2+1/\beta )< 0\). Since \(2^m\le \hat{c}_2\varphi ^{-1}\)
for all \(n\) large enough. Here we have used the obvious inequality \(1/s>1/\beta _jr_j\) \(\forall j=1,\ldots ,d\). On the other hand, since \(m\ge 0,\,\tilde{h}_j\le (C_jL_j^{-1}\varphi )^{1/\beta _j}\le 1\) for large enough \(n\). Thus we have proved that \(\tilde{h}\in [n^{-1}, 1]^d\) for all large enough \(n\).
\(2^{0}\). Let \(h[m]\in \mathcal{H }\) such that \(h[m]< \tilde{h}\le 2h[m]\) and choose constant \(c_4\) satisfies \(c_4<(2\bar{c}_1)^{-1}\) [see (6.12)]. Recall that formulas (9.10) and (9.14) coincide for \(j\in I_\infty \). Therefore, as before, with the indicated choice of \(c_4\) we have
Let \(\beta _{\pm }\) and \(\beta _\infty \) be defined by expressions \( 1/\beta _\pm :=\sum _{j\in I_+\cup I_-} 1/\beta _j\) and \(1/\beta _\infty :=\sum _{j\in I_\infty } 1/\beta _j\). We have
This together with \(2^m\le \hat{c}_2\varphi ^{-1}\) shows that \(V_{h[m]}\ge c_{13}\hat{c}_4^{1/\beta _\pm } L^{-1}_\beta \varphi ^{2+1/\beta }=c_{13} \hat{c}_4^{1/\beta _\pm }\delta \) and, therefore,
Remark that \(A_\eta (x) \le 2^{d}M\mathrm{k}_\infty ^2 \) for all \(x\in \mathbb{R }^d\) and \(\eta \in \mathcal{H }\). Hence, in view of (9.17)
It yields together with (9.16)
Setting \(\hat{c}_4\) so that \(c_{15}\hat{c}_4^{-1/(2\beta _\pm )}<2^{-1}\), we obtain \(\sup _{\eta \ge \tilde{h}} M_\eta (x) \le 2^{m-1}\varphi \). This implies that
Moreover, it follows from (9.6) and from inequality \(h[m]\le \tilde{h}\) that
Then (6.25) is a consequence of (9.3), (9.15), (9.18) and (9.19). The statement (ii) is proved.
1.2.3 B.2.3. Proof of statement (iv)
\(1^{0}\). Let \(C_j,\,j=1,\ldots , d\) be the same constants in the proof of statement (ii) in the previous section. Define \(\tilde{h}=(\tilde{h}_1,\ldots ,\tilde{h}_d)\in (0,\infty ]^d\) by the following formula
where \(\gamma _j,\,q_j\) are defined in (6.1) and \(\gamma ,\,\upsilon \) and \(L_\gamma \) are given in (6.15).
Let us show that \(\tilde{h}\in [n^{-1},1]^d\) for large \(n\). Let \(b_j=1-\frac{\upsilon }{q_j}(2+1/\gamma )\).
First, assume that \(b_j<0\). Since \(m> 0\) and \(2^m\le \hat{c}_2\varphi ^{-1}\),
where we have used the obvious inequality \(1/\upsilon >1/(\gamma _jq_j)\) for any \(j=1,\ldots ,d\). On the other hand, in view of \(m\ge m_1\) and by (6.20)
Then by (6.21)
where the expression for constant \(C_1(\vec {L})\) is easily found. It remains to note that
and in view of (6.16) and (6.17)
This shows that \(\tilde{h}_j \le 1\) for large \(n\).
Now assume that \(b_j\ge 0\). Then, similarly to the reasoning that resulted in (9.21) we have
Since \(\varphi ^{2+1/\beta }=L_\beta \delta \),
for all \(n\) large enough. Here we have used (6.16), (6.17), and obvious inequalities: \(2+1/\gamma >1/\gamma _j\) and \(1/\upsilon >1/\gamma _j r_j\) for all \(j=1,\ldots , d\). On the other hand, since \(2^{m}\le \hat{c}_2\varphi ^{-1}\)
Therefore, \(\tilde{h}_j\rightarrow 0,\; n\rightarrow \infty ,\,\forall j\in I{\setminus }I_\infty \), and \(\tilde{h}_j\le c_{19}(c_4L_0^{-1})^{1/\gamma _j},\,\forall j\in I_\infty \). Choosing \(c_4\) small enough we come to required assertion.
\(3^{0}\). Let \(h[m]\in \mathcal{H }\) be such that \(h[m]< \tilde{h}\le 2h[m]\), and let constant \(c_4\) satisfy \(c_4<(2\bar{c}_1)^{-1}\), where \(\bar{c}_1\) is given in (6.12). With this choice of \(c_4\), if \(j\in I_\infty \) then the corresponding coordinates of \(\tilde{h}\) given by (9.14) and (9.20) coincide. Hence we have as before
Let \(\frac{1}{\gamma _\pm }:=\sum _{j\in I_+\cup I_-} \frac{1}{\gamma _j}\); then
We remark that (9.23) and (9.16) coincide up to the change in notation \(\beta _\pm \leftrightarrow \gamma _\pm \). Hence all the computations preceding (9.18) remain valid, and we have as before
Moreover, we obtain from (9.7)
The bound given in (6.26) follows now from (9.3), (9.22), (9.24) and (9.25). \(\square \)
1.3 B.3 Proof of Proposition 2
where \(c_0\) and \(c_1\) are appropriate constants, \(\bar{U}_f(x)\) and \(U_f(x)\) are given by (6.8) and (6.10) respectively, and \(\omega (x):=\zeta (x)+\chi (x)\) with \(\zeta (x)\) and \(\chi (x)\) defined in (2.15) and (2.16).
1.3.1 B.3.1. Proof of statement (i)
Here for brevity we will write \(m_0=m_0(1)\). By (9.26)
Noting that \(\Vert \hat{f}\Vert _1\le \ln ^d(n)\Vert K\Vert _1\le \ln ^d(n) \mathrm{k}_\infty \), we have \(\Vert \hat{f}-f\Vert _1\le \ln ^d(n) \mathrm{k}_\infty +1\) and, therefore,
Moreover, since \(\varkappa = \mathrm{k}_\infty ^2 [(4d+2)p+4(d+1)]\), the second statement of Theorem 1 implies
Combining these inequalities and taking into account that \(2^{m_0}\varphi \le 1\) we obtain
By definition of \(m_0=m_0(1),\,2^{m_0}\varphi \le c_6 (L_\beta \delta )^{1/(1+1/\beta -1/s)}\); therefore
It remains to note that for large \(n\)
and (6.27) follows.
1.3.2 B.3.2. Proof of statement (ii)
Let \(f^*\) be the maximal operator of \(f\) defined in (4.1). It follows from the definition of \(M_\eta (x)\) that for any \(h\in \mathcal{H }\)
Moreover, by definition of \(\bar{B}_h(f,x),\,\bar{B}_h(f,x) \le c_{9} [f^*(x)+f(x)]\le 2c_{9} f^*(x)\) almost everywhere, where the last inequality follows from the Lebesgue differentiation theorem. Using these two inequalities and setting \(h=(1,\ldots ,1)\) in (6.8) we come to the following upper bound on \(\bar{U}_f(x)\)
In view of (6.11) we have that \(\mathcal{X }_{m_0(\theta )}^-\subseteq \mathcal{X }^-:=\{x\in \mathbb{R }^d: \bar{U}_f(x) \le \mathrm{k}_\infty 2^{m_0(\theta )}\varphi \}\); therefore if we put
then
We bound from above the two terms on the right hand side of the above inequality.
First consider \(\mathbb{E }_fS_1\). By (9.26) for any \(\theta \in (0,1]\) we have
Here we have used that, by (9.27), \(\bar{U}_f(x)\le 2 c_{10}\delta \) for all \(x\in D_1\). Remind that \(\hat{f}(x)=\hat{f}_{\hat{h}(x)}(x)\); therefore, for any \(\theta \in (0,1]\)
Thus, for any \(f\in \mathbb{G }_\theta (R)\),
Furthermore, because \(\varkappa = \mathrm{k}_\infty ^2 [(4d+2)p+4(d+1)]\), by the second statement of Theorem 1
Combining the last two inequalities we obtain
Now we proceed with bounding \(\mathbb{E }_f S_2\). We have
Here (a) follows from the second statement of Theorem 1 and \(\bar{U}_f(x) \le 2^{m_0(\theta )}\varphi \) for \(x\in D_2\), and (b) is valid because \(\bar{U}_f(x)\le 3c_{10}f^*(x)\) for all \(x\in D_2\), see (9.27).
Note that \(\delta ^{1-\frac{1}{1-\theta /s+1/\beta }}(\ln {n})^{\frac{d\theta }{p-\theta }}\rightarrow 0\) as \(n\rightarrow \infty \) since \(\theta \le 1\) and \(1/\beta >1/s\), where the latter inequality follows from \(r\in (1,\infty ]^d\).
Thus, combining (9.29) and (9.30) with (9.28), we obtain
as claimed. \(\square \)
Rights and permissions
About this article
Cite this article
Goldenshluger, A., Lepski, O. On adaptive minimax density estimation on \(R^d\) . Probab. Theory Relat. Fields 159, 479–543 (2014). https://doi.org/10.1007/s00440-013-0512-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-013-0512-1
Keywords
- Density estimation
- Oracle inequality
- Adaptive estimation
- Kernel estimators
- \(\mathbb{L }_p\)-risk
Mathematics Subject Classification (2000)
- 62G05
- 62G20