1 Introduction

Let \(X_{1},\ldots , X_{n}\) be independent copies of random vector \(X\in \mathbb{R }^d\) having density \(f\) with respect to the Lebesgue measure. We want to estimate \(f\) using observations \(X^{(n)}=(X_{1},\ldots ,X_n)\). By estimator we mean any \(X^{(n)}\)-measurable map \({\hat{f}}:\mathbb{R }^n\rightarrow \mathbb{L }_p(\mathbb{R }^d)\). Accuracy of an estimator \(\hat{f}\) is measured by the \(\mathbb{L }_p\)-risk

$$\begin{aligned} \mathcal{R }^{(n)}_p[\hat{f}, f]:=\left( \mathbb{E }_f \Vert \hat{f}-f\Vert _p^p\right) ^{1/p},\quad p\in [1,\infty ), \end{aligned}$$

where \(\mathbb{E }_f\) denotes expectation with respect to the probability measure \(\mathbb{P }_f\) of the observations \(X^{(n)}=(X_1,\ldots ,X_n)\), and \(\Vert \cdot \Vert _p\), \(p\in [1,\infty )\), is the \(\mathbb{L }_p\)-norm on \(\mathbb{R }^d\). The objective is to construct an estimator of \(f\) with small \(\mathbb{L }_p\)-risk.

In the framework of the minimax approach density \(f\) is assumed to belong to a functional class \(\Sigma \), which is specified on the basis of prior information on \(f\). Given a functional class \(\Sigma \), a natural accuracy measure of an estimator \(\hat{f}\) is its maximal \(\mathbb{L }_p\)-risk over \(\Sigma \),

$$\begin{aligned} \mathcal{R }_p^{(n)}[{\hat{f}};\Sigma ] = \sup _{f\in \Sigma } \mathcal{R }_p^{(n)}[{\hat{f}},f]. \end{aligned}$$

The main question is:

  1. (i)

    how to construct a rate-optimal, or optimal in order, estimator \({\hat{f}}_*\) such that

    $$\begin{aligned} \mathcal{R }_p^{(n)}[\hat{f}_*;\Sigma ] \asymp \phi _n(\Sigma ):=\inf _{\hat{f}} \mathcal{R }_p^{(n)}[\hat{f};\Sigma ],\quad n\rightarrow \infty ? \end{aligned}$$

Here the infimum is taken over all possible estimators. We refer to the outlined problem as the problem of minimax density estimation with \(\mathbb{L }_p\)-loss on the class \(\Sigma \).

Although the minimax approach provides a fair and convenient criterion for comparison between different estimators, it lacks some flexibility. Typically \(\Sigma \) is a class of functions that is determined by some hyper-parameter, say, \(\alpha \) (we write \(\Sigma =\Sigma _\alpha \) in order to indicate explicitly dependence of the class \(\Sigma \) on the corresponding hyper-parameter \(\alpha \)). In general, it turns out that an estimator which is optimal in order on the class \(\Sigma _\alpha \) is not optimal on the class \(\Sigma _{\alpha ^\prime }\). This fact motivates the following question:

  1. (ii)

    is it possible to construct an estimator \({\hat{f}}_*\) that is optimal in order on some scale of functional classes \(\{\Sigma _\alpha , \alpha \in A\}\) and not only on one class \(\Sigma _\alpha \)? In other words, is it possible to construct an estimator \({\hat{f}}_*\) such that for any \(\alpha \in A\) one has

    $$\begin{aligned} \mathcal{R }^{(n)}[\hat{f}_*; \Sigma _\alpha ] \asymp \phi _n(\Sigma _\alpha ),\quad n\rightarrow \infty ? \end{aligned}$$

We refer to this question as the problem of adaptive minimax density estimation on the scale of classes \(\{\Sigma _\alpha , \alpha \in A\}\).

The minimax and adaptive minimax density estimation with \(\mathbb{L }_p\)-loss is a subject of the vast literature, see for example [3, 511, 16, 18, 19, 22], [27, Chapter 7], [31], [32, 33] and [2]. It is not our aim here to provide a complete review of the literature on density estimation with \(\mathbb{L }_p\)-loss. Below we will only discuss results that are directly related to our study. First we review papers dealing with the one-dimensional setting; then we proceed with the multivariate case.

The problem of minimax density estimation on \(\mathbb{R }^1\) with \(\mathbb{L }_p\)-loss, \(p\in [2,\infty )\), was studied by Bretagnolle and Huber [3]. In this paper the functional class \(\Sigma \) is the class of all densities such that \([\Vert f^{(\beta )}\Vert _p \Vert f\Vert _{p/2}^{\beta }]^{1/(2\beta +1)}\le L<\infty \), where \(f^{(\beta )}\) is the generalized derivative of order \(\beta \). It was shown there that

$$\begin{aligned} \phi _n(\Sigma )\asymp n^{-\frac{1}{2+1/\beta }},\quad \forall p\in [2,\infty ). \end{aligned}$$

Note that the same parameter \(p\) appears in the definitions of the risk and of the functional class.

The problem of adaptive minimax density estimation on a compact interval of \(\mathbb{R }^1\) with \(\mathbb{L }_p\)-loss was addressed in [9]. In this paper class \(\Sigma \) is the Besov functional class \(\mathbb{B }^\beta _{r\theta }(L)\), where parameter \(\beta \) stands for the regularity index, and \(r\) is the index of the norm in which the regularity is measured. It is shown there that there is an elbow in the rates of convergence for the minimax risk according to whether \(p\le r(2\beta +1)\) (called in the literature the dense zone) or \(p\ge r(2\beta +1)\) (the sparse zone). In particular,

$$\begin{aligned} \phi _n\left( \mathbb{B }^\beta _{r\theta }(L)\right) \ge \left\{ \begin{array}{l@{\quad }l} n^{-\frac{1}{2+1/\beta }}, &{} p\le r(2\beta +1),\\ (\ln n/n)^{\frac{1-1/(\beta r)+1/(\beta p)}{1-1/(\beta r)+1/(2\beta )}}, &{}p\ge r(2\beta +1). \end{array} \right. \end{aligned}$$
(1.1)

Donoho et al. [9] develop a wavelet-based hard-thresholding estimator that achieves the indicated rates (up to a \(\ln n\)-factor in the dense zone) for a scale of the Besov classes \(\mathbb{B }^\beta _{r,\theta }(L)\) under additional assumption \(\beta r>1\).

It is quite remarkable that if the assumption that the underlying density has compact support is dropped, then the minimax risk behavior becomes completely different. Specifically, Juditsky and Lambert-Lacroix [21] studied the problem of adaptive minimax density estimation on \(\mathbb{R }^1\) with \(\Sigma \) being the Hölder class \(\mathbb{N }_{\infty ,1}(\beta ,L)\). Their results are in striking contrast with those of Donoho et al. [9]: it is shown that

$$\begin{aligned} \phi _n\left( \mathbb{N }_{\infty ,1}(\beta ,L)\right) \ge \left\{ \begin{array}{l@{\quad }l} n^{-\frac{1}{2+1/\beta }}, &{} p>2+1/\beta ,\\ n^{-\frac{1-1/p}{1+1/\beta }}, &{} 1\le p\le 2+1/\beta . \end{array}\right. \end{aligned}$$

Juditsky and Lambert-Lacroix [21] develop a wavelet-based estimator that achieves the indicated rates up to a logarithmic factor on a scale of the Hölder classes. Note that if the aforementioned results of Donoho et al. [9] for densities with compact support are applied to the Hölder class, \(r=\infty \), then the rate is \(n^{-1/(2+1/\beta )}\) for any \(p\ge 1\). Thus, the rate corresponding to the zone \(1\le p\le 2+1/\beta \), does not appear in the case of compactly supported densities.

In a recent paper, Reynaud-Bouret et al. [30] consider the problem of adaptive density estimation on \(\mathbb{R }^1\) with \(\mathbb{L }_2\)-losses on the Besov classes \(\mathbb{B }_{r\theta }^\beta (L)\). It is shown there that

$$\begin{aligned} \phi _n\left( \mathbb{B }_{r\theta }^\beta (L)\right) \ge \left\{ \begin{array}{l@{\quad }l} n^{-\frac{1}{2+1/\beta }}, &{} 2/(2\beta +1)<r\le 2,\\ n^{-\frac{1}{1-1/(\beta r)+1/\beta }}, &{} r>2. \end{array} \right. \end{aligned}$$

They also proposed a wavelet-based estimator that achieves the indicated rates up to a logarithmic factor for a scale of Besov classes under additional assumption \(2\beta r>2-r\). It follows from Donoho et al. [9] that if \(p=2\) and the density is compactly supported then the corresponding rates are \(\phi _n(\Sigma )\asymp n^{-1/(2+1/\beta )}\) for all \(r\ge 2/(2\beta +1)\). Hence the rate corresponding to the zone \(r>2,\,p=2\), does not appear in the case of the compactly supported densities.

As for the multivariate setting, Ibragimov and Khasminskii in a series of papers [18, 19] studied the problem of minimax density estimation with \(\mathbb{L }_p\)-loss on \(\mathbb{R }^{d}\). Together with some classes of infinitely differentiable densities, they considered the anisotropic Nikolskii’s classes \(\Sigma =\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\), where \(\vec {\beta }=(\beta _1,\ldots ,\beta _d),\,\vec {r}=(r_1,\ldots ,r_d)\) and \(\vec {L}=(L_1,\ldots ,L_d)\) (for the precise definition see Sect. 3.1). It was shown that if \(r_i=p\) for all \(i=1,\ldots ,d\) then

$$\begin{aligned} \phi _n\left( \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\right) \asymp \left\{ \begin{array}{l@{\quad }l} n^{-\frac{1-1/p}{1-1/(\beta p)+1/\beta }}, &{} p\in [1,2),\\ n^{-\frac{1}{2+1/\beta }}, &{} p\in [2,\infty ). \end{array} \right. \end{aligned}$$
(1.2)

Here \(\beta \) is the parameter defined by the relation \(1/\beta =\sum _{j=1}^d 1/\beta _j\). It should be stressed that in the cited papers the same norm index \(p\) is used in the definitions of the risk and of the functional class. We also refer to the recent paper by Mason [26], where further discussion of these results can be found.

Delyon and Juditsky [4] generalized the results of Donoho et al. [9] to the minimax density estimation on a bounded interval of \(\mathbb{R }^d,\,d\ge 1\) over a collection of the isotropic Besov classes. In particular, they showed that the minimax rates of convergence given by (1.1) hold with \(1/(\beta r)\) and \(1/\beta \) replaced by \(d/(\beta r)\) and \(d/\beta \) respectively. Comparing rates in (1.2) with the asymptotics of minimax risk found in [4] with \(r=p\) we conclude that the rate in (1.2) in the zone \(p\in [1,2)\) does not appear for compactly supported densities.

Recently Goldenshluger and Lepski [15] developed an adaptive minimax estimator over a scale of classes \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\); in particular, if \(r_i=p\) for all \(i=1,\ldots , d\) then their estimator attains the minimax rates indicated in (1.2). Note that in the considered setting the norm indexes in the definitions of the risk and the functional class coincide.

The results discussed above show that there is an essential difference between the problems of density estimation on the whole space and on a compact interval. The literature on density estimation on the whole space is quite fragmented, and relationships between aforementioned results are yet to be understood. These relationships become even more complex and interesting in the multivariate setting where the density to be estimated belongs to a functional class with anisotropic and inhomogeneous smoothness.The problem of minimax estimation under \(\mathbb{L }_p\)-loss over homogeneous Sobolev \(\mathbb{L }_q\)-balls \((q\ne p)\) was initiated in [28] in the regression model on the unit cube of \(\mathbb{R }^d\). For the first time, functional classes with anisotropic and inhomogeneous smoothness were considered in [23, 24] for the Gaussian white noise model on a compact subset of \(\mathbb{R }^d\). In the density estimation model [1] studied the case \(p=2\) and considered compactly supported densities on \([0,1]^d\).

To the best of our knowledge, the problem of estimating a multivariate density from anisotropic and inhomogeneous functional classes on \(\mathbb{R }^d\) was not considered in the literature. This problem is a subject of the current paper. Our results cover the existing ones and generalize them in the following directions.

  1. 1.

    We fully characterize behavior of the minimax risk for all possible relationships between regularity parameters and norm indexes in the definition of the functional classes and of the risk. In particular, we discover that there are four different regimes with respect to the minimax rates of convergence: tail, dense and sparse zones, and the last zone, in its turn, is subdivided in two regions. Existence of these regimes is not a consequence of the multivariate nature of the problem or the considered functional classes; in fact, these regimes appear already in the dimension one. Thus our results reveal all possible zones with respect to the rates of convergence in the problem of density estimation on \(\mathbb{R }^d\) and explain different results on rates of convergence in the existing literature. In particular, results in [21, 30] pertain to the rates of convergence in the tail and dense zones, while those in [4, 9] correspond to the dense zone and to a subregion of the sparse zone.

  2. 2.

    We propose an estimator that is based upon a data-driven selection from a family of kernel estimators, and establish for it a point-wise oracle inequality. Then we use this inequality for derivation of bounds on the \(\mathbb{L }_p\)-risk over a collection of the Nikol’skii functional classes. Since the construction of our estimator does not use any prior information on the class parameters, it is adaptive minimax over a scale of these classes. Moreover, we believe that the method of deriving \(\mathbb{L }_p\)-risk bounds from point-wise oracle inequalities employed in the proof of Theorem 2 is of interest in its own right. It is quite general and can be applied to other nonparametric estimation problems.

  3. 3.

    Another issue studied in the present paper is related to the existence of the tail zone. This zone does not exist in the problem of estimating compactly supported densities. Then a natural question arises: what is a general condition on \(f\) which ensures the same asymptotics of the minimax risk on \(\mathbb{R }^d\) as in the case of compactly supported densities? We propose a tail dominance condition and show that, in a sense, it is the weakest possible condition under which the tail zone disappears. We also show that this condition guarantees existence of a consistent estimator under \(\mathbb{L }_1\)-loss. Recall that smoothness alone is not sufficient in order to guarantee consistency of density estimators in \(\mathbb{L }_1(\mathbb{R }^d)\) (see [20]).

The paper is structured as follows. In Sect. 2 we define our estimation procedure and derive the corresponding point-wise oracle inequality. Section 3 presents upper and lower bounds on the minimax risk. We also discuss the obtained results and relate them to the existing results in the literature. The same estimation problem under the tail dominance condition is studied in Sect. 4. Sections 57 contain proofs of Theorems 1–4; proofs of auxiliary results are relegated to Appendices A and B.

The following notation and conventions are used throughout the paper. For vectors \(u, v\in \mathbb{R }^d\) the operations \(u/v,\,u\vee v,\,u\wedge v\) and inequalities such as \(u\le v\) are all understood in the coordinate-wise sense. For instance, \(u\vee v =(u_1\vee v_1,\ldots , u_d\vee v_d)\). All integrals are taken over \(\mathbb{R }^d\) unless the domain of integration is specified explicitly. For a Borel set \(\mathcal{A }\subset \mathbb{R }^d\) symbol \(|\mathcal{A }|\) stands for the Lebesgue measure of \(\mathcal{A }\); if \(\mathcal{A }\) is a finite set, \(|\mathcal{A }|\) denotes the cardinality of \(\mathcal{A }\).

2 Estimation procedure and point-wise oracle inequality

In this section we define our estimation procedure and derive an upper bound on its point-wise risk.

2.1 Estimation procedure

Our estimation procedure is based on data-driven selection from a family of kernel estimators. The family of estimators is defined as follows.

2.1.1 Family of kernel estimators

Let \(K:[-1/2,1/2]^d\rightarrow \mathbb{R }^1\) be a fixed kernel such that \(K\in \mathbb{C }(\mathbb{R }^d),\,\int K(x)\,\mathrm{d}x=1\), and \(\Vert K\Vert _\infty <\infty \). Let

$$\begin{aligned} \mathcal{H }=\left\{ h=(h_1,\ldots ,h_d)\in (0,1]^d: \;h_j=2^{-k_j}, k_j=0,\ldots , \log _2n,\;j=1,\ldots ,d\right\} ; \end{aligned}$$

without loss of generality we assume that \(\log _2n\) is integer.

Given a bandwidth \(h\in \mathcal{H }\), define the corresponding kernel estimator of \(f\) by the formula

$$\begin{aligned} \hat{f}_h (x):= \frac{1}{nV_h} \sum _{i=1}^n K\left( \frac{X_i-x}{h}\right) = \frac{1}{n}\sum _{i=1}^n K_h(X_i-x), \end{aligned}$$
(2.1)

where \(V_h:=\prod _{j=1}^d h_j,\,K_h(\cdot ):=(1/V_h)K(\cdot /h)\). Consider the family of kernel estimators

$$\begin{aligned} \mathcal{F }(\mathcal{H }):=\{\hat{f}_h, h\in \mathcal{H }\}. \end{aligned}$$

The proposed estimation procedure is based on data-driven selection of an estimator from \(\mathcal{F }(\mathcal{H })\).

2.1.2 Auxiliary estimators

Our selection rule uses auxiliary estimators that are constructed as follows. For any pair \(h,\eta \in \mathcal{H }\) define the kernel \(K_h*K_\eta \) by the formula \( [K_h*K_\eta ](t)=\int K_h(t-y) K_\eta (y)\,\mathrm{d}y\). Let \(\hat{f}_{h,\eta }(x)\) denote the estimator associated with this kernel:

$$\begin{aligned} \hat{f}_{h,\eta }(x)=\frac{1}{n}\sum _{i=1}^n K_{h,\eta }(X_i-x),\quad K_{h,\eta }=K_h*K_\eta . \end{aligned}$$

The following representation of kernels \(K_{h,\eta }\) will be useful: for any \(h,\eta \in \mathcal{H }\)

$$\begin{aligned}{}[K_h*K_\eta ](t) = \frac{1}{V_{h\vee \eta }} Q_{h,\eta }\left( \frac{t}{h\vee \eta }\right) , \end{aligned}$$
(2.2)

where function \(Q_{h,\eta }\) is given by the formula

$$\begin{aligned} Q_{h,\eta }(t)=\int K\left( v(y,t-\nu y)\right) K\left( v(t-\nu y,y)\right) \,\mathrm{d}y,\quad \nu :=\frac{h\wedge \eta }{h\vee \eta }. \end{aligned}$$
(2.3)

Here function \(v:\mathbb{R }^d\times \mathbb{R }^d \rightarrow \mathbb{R }^d\) is defined by

$$\begin{aligned} v_j(y,z)=\left\{ \begin{array}{l@{\quad }l} y_j, &{} h_j\le \eta _j,\\ z_j , &{} h_j>\eta _j, \end{array}\! \right. ,\quad j=1,\ldots ,d. \end{aligned}$$

The representation (2.2)–(2.3) is obtained by a straightforward change of variables in the convolution integral [see the proof of Lemma 12 in [14]]. We also note that \(\mathrm{supp}(Q_{h,\eta })\subseteq [-1,1]^d\), and \(\Vert Q_{h,\eta }\Vert _\infty \le \Vert K\Vert _\infty ^2\) for all \(h,\eta \). In the special case where \(K(t)=\prod _{i=1}^d k(t_i)\) for some univariate kernel \(k:[-1/2, 1/2]\rightarrow \mathbb{R }^1\) we have

$$\begin{aligned} Q_{h,\eta } (t) =\prod _{i=1}^d \int k(t_i-\nu _i u_i)k(u_i)\,\mathrm{d}u_i,\quad \nu _i=(h_i\wedge \eta _i)/(h_i\vee \eta ). \end{aligned}$$

We also define

$$\begin{aligned} Q(t) = \sup _{h,\eta \in \mathcal{H }}\left| \int K\left( v(y,t-\nu y)\right) K\left( v(t-\nu y,y)\right) \,\mathrm{d}y\right| , \end{aligned}$$

and note that \(\mathrm{supp}(Q)\subseteq [-1,1]^d\), and \(\Vert Q\Vert _\infty \le \Vert K\Vert _\infty ^2\).

2.1.3 Stochastic errors of kernel estimators and their majorants

Uniform moment bounds on stochastic errors of kernel estimators \(\hat{f}_h(x)\) and \(\hat{f}_{h,\eta }(x)\) will play an important role in the construction of our selection rule. Let

$$\begin{aligned} \begin{aligned} \xi _{h}(x)&= \frac{1}{n}\sum _{i=1}^n K_h(X_i-x) - \int K_h(t-x)f(t)\,\mathrm{d}t,\\ \xi _{h, \eta }(x)&= \frac{1}{n}\sum _{i=1}^n K_{h,\eta }(X_i-x) - \int K_{h,\eta }(t-x)f(t)\,\mathrm{d}t \end{aligned} \end{aligned}$$
(2.4)

denote the stochastic errors of \(\hat{f}_h\) and \(\hat{f}_{h,\eta }\) respectively. In order to construct our selection rule we need to find uniform upper bounds (majorants) on \(\xi _h\) and \(\xi _{h,\eta }\), i.e. we need to find functions \(M_h\) and \(M_{h,\eta }\) such that moments of random variables

$$\begin{aligned} \sup _{h\in \mathcal{H }}\left[ |\xi _h(x)|- M_h(x)\right] _+,\quad \sup _{h,\eta \in \mathcal{H }}\left[ |\xi _{h,\eta }(x)| - M_{h,\eta }(x)\right] _+ \end{aligned}$$
(2.5)

are “small” for each \(x\in \mathbb{R }^d\). We will be also interested in the integrability properties of these moments.

It turns out that the majorants \(M_h(x)\) and \(M_{h,\eta }(x)\) can be defined in the following way. For a function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) let

$$\begin{aligned} A_h(g,x)= \int |g_h(t-x)| f(t)\,\mathrm{d}t,\quad g_h(\cdot )=V^{-1}_h g(\cdot /h),\quad h\in \mathcal{H }. \end{aligned}$$
(2.6)

Now define

$$\begin{aligned} M_h(g,x)=\sqrt{\frac{\varkappa A_h(g,x)\ln n}{nV_h}} + \frac{\varkappa \ln n}{nV_h}, \end{aligned}$$
(2.7)

where \(\varkappa \) is a positive constant to be specified. In Lemma 2 in Sect. 5 we show that under appropriate choice of parameter \(\varkappa \) functions

$$\begin{aligned} M_h(x):= M_h(K,x),\quad M_{h,\eta }(x):= M_{h\vee \eta }(Q, x) \end{aligned}$$
(2.8)

uniformly majorate \(\xi _h\) and \(\xi _{h,\eta }\) in the sense that the moments of random variables in (2.5) are “small”.

It should be noted, however, that functions \(M_h(x)\) and \(M_{h,\eta }(x)\) given by (2.8) cannot be directly used in construction of the selection rule because they depend on unknown density \(f\) to be estimated. We will use empirical counterparts of \(M_h(x)\) and \(M_{h,\eta }(x)\) instead.

For \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) we let

$$\begin{aligned} {\hat{A}}_h(g, x)=\frac{1}{n}\sum _{i=1}^n |g_h(X_i-x)|, \end{aligned}$$

and define

$$\begin{aligned} {\hat{M}}_h(g,x)= 4\sqrt{\frac{\varkappa {\hat{A}}_h(g,x)\ln n}{nV_h}} + \frac{4\varkappa \ln n}{nV_h}. \end{aligned}$$
(2.9)

2.1.4 Selection rule and final estimator

Now we are in a position to define our selection rule. For every \(x\in \mathbb{R }^d\) let

$$\begin{aligned} {\hat{R}}_h(x)&= \sup _{\eta \in \mathcal{H }}\left[ |{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)| - {\hat{M}}_{h\vee \eta }(Q, x)-{\hat{M}}_\eta (K, x)\right] _+ \nonumber \\&+\sup _{\eta \ge h} {\hat{M}}_{\eta }(Q, x) + {\hat{M}}_h(K,x), \quad h\in \mathcal{H }. \end{aligned}$$
(2.10)

The selected bandwidth \({\hat{h}}(x)\) and the corresponding estimator are defined by

$$\begin{aligned} {\hat{h}}(x)=\mathrm{arg}\inf _{h\in \mathcal{H }} {\hat{R}}_h(x),\quad {\hat{f}}(x)={\hat{f}}_{{\hat{h}}(x)}(x),\quad x\in \mathbb{R }^d. \end{aligned}$$
(2.11)

Note that the estimation procedure is completely determined by the family of kernel estimators \(\mathcal{F }(\mathcal{H })\) and by the constant \(\varkappa \) appearing in the definition of \({\hat{M}}_h\).

We have to ensure that the map \(x\mapsto \hat{f}_{\hat{h}(x)}(x)\) is an \(X^{(n)}\)-measurable Borel function. This follows from continuity of \(K\) and the fact that \(\mathcal{H }\) is a discrete set; for details see Appendix A, Sect. A.1.

The main idea behind the construction of the selection procedure (2.10)–(2.11) is the following. The expression \({\hat{M}}_{h\vee \eta }(Q, x)+{\hat{M}}_\eta (K, x)\) appearing in the square brackets in (2.10) dominates with large probability the stochastic part of the difference \(|{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)|\). Consequently, the first term on the right hand side of (2.10) serves as a proxy for the deterministic part of \(|{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)|\) which is the absolute value of the difference of biases of kernel estimates \({\hat{f}}_{h,\eta }(x)\) and \({\hat{f}}_\eta (x)\). The latter, in its own turn, is closely related to the bias of the estimator \({\hat{f}}_h(x)\). Thus, the first term on the right hand side of (2.10) is a proxy for the bias of \({\hat{f}}_h(x)\), while the second term is an upper bound on the standard deviation of \({\hat{f}}_h(x)\).

2.2 Point-wise oracle inequality

Let \(B_h(f,t)\) be the bias of the kernel estimator \({\hat{f}}_h(t)\),

$$\begin{aligned} B_h(f,t)=\int K_\eta (y-t) f(y)\,\mathrm{d}y -f(t), \end{aligned}$$
(2.12)

and define

$$\begin{aligned} {\bar{B}}_h(f,x)= |B_h(f,x)|\,\vee \,\sup _{\eta \in \mathcal{H }} \left| \int K_\eta (t-x) B_h(f,t)\,\mathrm{d}t\right| . \end{aligned}$$
(2.13)

Theorem 1

For any \(x\in \mathbb{R }^d\) one has

$$\begin{aligned} |{\hat{f}}(x)-f(x)|&\le \inf _{h\in \mathcal{H }} \left\{ 4{\bar{B}}_h(f,x)+ 60 \sup _{\eta \ge h} M_\eta (Q, x)+ 61M_h(K,x)\right\} \nonumber \\&\quad +7\zeta (x)\; +\; 18\chi (x), \end{aligned}$$
(2.14)

where

$$\begin{aligned} \zeta (x)&:=\sup _{h\in \mathcal{H }} [|\xi _h(x)|- M_h(K,x)]_+ \;\vee \; \sup _{h,\eta \in \mathcal{H }}[|\xi _{h,\eta }(x)| - M_{h\vee \eta }(Q,x)]_+,\qquad \end{aligned}$$
(2.15)
$$\begin{aligned} \chi (x)&:= \max _{g\in \{K,Q\}}\sup _{h\in \mathcal{H }}\left[ |\hat{A}_h(g,x)-A_h(g,x)|- M_h(g,x)\right] _+. \end{aligned}$$
(2.16)

Furthermore, for any \(q\ge 1\) if \(\varkappa \ge [\Vert K\Vert _\infty \vee 1]^2 [(4d+2)q+4(d+1)]\) then

$$\begin{aligned} \int \mathbb{E }_f\left\{ [\zeta (x)]^q +[\chi (x)]^q\right\} \,\mathrm{d}x \;\le \; C n^{-q/2}, \quad \forall n\ge 3, \end{aligned}$$
(2.17)

where \(C\) is the constant depending on \(d,\,q\) and \(\Vert K\Vert _\infty \) only.

We remark that Theorem 1 does not require any conditions on the estimated density \(f\).

3 Adaptive estimation over anisotropic Nikol’skii classes

In this section we study properties of the estimator defined in (2.10)–(2.11). The point-wise oracle inequality of Theorem 1 is the key technical tool for bounding \(\mathbb{L }_p\)-risk of this estimator on the anisotropic Nikol’skii classes.

3.1 Anisotropic Nikol’skii classes

Let \((e_1,\ldots ,e_d)\) denote the canonical basis of \(\mathbb{R }^d\). For function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) and real number \(u\in \mathbb{R }\) define the first order difference operator with step size \(u\) in direction of the variable \(x_j\) by

$$\begin{aligned} \Delta _{u,j}g (x)=g(x+ue_j)-g(x),\quad j=1,\ldots ,d. \end{aligned}$$

By induction, the \(k\)-th order difference operator with step size \(u\) in direction of the variable \(x_j\) is defined as

$$\begin{aligned} \Delta _{u,j}^kg(x)= \Delta _{u,j} \Delta _{u,j}^{k-1} g(x) = \sum _{l=1}^k (-1)^{l+k}\left( {\begin{array}{c}k\\ l\end{array}}\right) \Delta _{ul,j}g(x). \end{aligned}$$
(3.1)

Definition 1

For given real numbers \(\vec {r}\!=\!(r_1,\ldots ,r_d),\,r_j\!\in \! [1,\infty ],\,\vec {\beta }\!=\!(\beta _1,\ldots ,\beta _d), \beta _j>0\), and \(\vec {L}\!=\!(L_1,\ldots , L_d),\,L_j>0,\,j\!=\!1,\ldots , d\), we say that function \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) belongs to the anisotropic Nikol’skii class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) if

  1. (i)

    \(\Vert g\Vert _{r_j}\le L_{j}\) for all \(j=1,\ldots ,d\);

  2. (ii)

    for every \(j=1,\ldots ,d\) there exists natural number \(k_j>\beta _j\) such that

    $$\begin{aligned} \left\| \Delta _{u,j}^{k_j} g \right\| _{r_j} \le L_j |u|^{\beta _j},\quad \forall u\in \mathbb{R }^d,\quad \forall j=1,\ldots ,d. \end{aligned}$$
    (3.2)

The anisotropic Nikol’skii class is a specific case of the anisotropic Besov class, often encountered in the nonparametric estimation literature.

In particular, \(\mathbb{N }_{\vec {r},d}\left( \vec {\beta },\cdot \right) = \mathbb{B }_{r_1,\ldots ,r_d;\infty ,\ldots ,\infty }^{\beta _1, \ldots ,\beta _d}(\cdot )\), see [29, Section 4.3.4].

3.2 Construction of kernel \(K\)

We will use the following specific kernel \(K\) in the definition of the family \(\mathcal{F }(\mathcal{H })\) (see, e.g., [23] or [15]).

Let \(\ell \) be an integer number, and let \(w:[-1/(2\ell ), 1/(2\ell )]\rightarrow \mathbb{R }^1\) be a function satisfying \(\int w(y)\,\mathrm{d}y=1\), and \(w\in \mathbb{C }(\mathbb{R }^1)\). Put

$$\begin{aligned} w_\ell (y)\!=\!\sum _{i=1}^\ell \left( {\begin{array}{c}\ell \\ i\end{array}}\right) (-1)^{i+1}\frac{1}{i}w\left( \frac{y}{i}\right) ,\quad K(t)\!=\!\prod _{j=1}^d w_\ell (t_j),\quad t=(t_1,\ldots ,t_d).\qquad \end{aligned}$$
(3.3)

The kernel \(K\) constructed in this way is bounded, supported on \([-1/2,1/2]^d\), belongs to \(\mathbb{C }(\mathbb{R }^d)\) and satisfies

$$\begin{aligned} \int K(t)\,\mathrm{d}t=1,\quad \int K(t) t^k\,\mathrm{d}t=0,\quad \forall |k|=1,\ldots , \ell -1, \end{aligned}$$

where \(k=(k_1,\ldots ,k_d)\) is the multi-index, \(k_i\ge 0,\,|k|=k_1+\cdots +k_d\), and \(t^k=t_1^{k_1}\cdots t_d^{k_d}\) for \(t=(t_1,\ldots , t_d)\).

3.3 Main results

Let \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) be the anisotropic Nikol’skii functional class. Put

$$\begin{aligned} \frac{1}{\beta } {:}= \sum _{j=1}^d \frac{1}{\beta _j},\quad \frac{1}{s} {:}= \sum _{j=1}^d \frac{1}{\beta _jr_j}, \quad L_\beta {:}= \prod _{j=1}^d L_j^{1/\beta _j}, \end{aligned}$$

and define

$$\begin{aligned} \nu&= \left\{ \begin{array}{l@{\quad }l@{\quad }l@{\quad }l} \frac{1-1/p}{1-1/s+1/\beta }, &{} \quad p <\frac{2+1/\beta }{1+1/s},\\ \frac{\beta }{2\beta +1}, &{} \quad \frac{2+1/\beta }{1+1/s} \le p \le s(2+1/\beta ),\\ s/p, &{} \quad p> s(2+1/\beta ),\;s<1,\\ \frac{1-1/s + 1/(p\beta )}{2-2/s+1/\beta }, &{} \quad p> s(2+1/\beta ),\; s\ge 1, \end{array} \right. \\ \mu _n&= \left\{ \begin{array}{l@{\quad }l} (\ln n)^{d/p}, &{} \quad p \le \frac{2+1/\beta }{1+1/s};\\ (\ln n)^{1/p}, &{} \quad p=s(2+1/\beta ),\\ 1, &{} \quad \mathrm{otherwise}. \end{array} \right. \nonumber \end{aligned}$$
(3.4)

In contrast to Theorem 1 proved over the set of all probability densities, the adaptive results presented below require the additional assumption: the estimated density should be uniformly bounded. For this purpose we define for \(M>0\)

$$\begin{aligned} \mathbb{N }_{\vec {r},d}\left( \vec {\beta },\vec {L}, M\right) := \mathbb{N }_{\vec {r},d}\left( \vec {\beta },\vec {L}\right) \;\cap \; \left\{ f:\Vert f\Vert _\infty \le M\right\} . \end{aligned}$$

Note, however, that if \(J:=\{j=1,\ldots ,d:\;r_j=\infty \}\) then \( \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)=\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}) \) with \(M=\inf _{J}L_j\). Moreover, in view of the embedding theorem for the anisotropic Nikol’skii classes (see Sect. 6.1 below), condition \(s>1\) implies that the density to be estimated belongs to a class of uniformly bounded and continuous functions. Thus, if \(s>1\) one has \( \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)=\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}) \) with some \(M\) completely determined by \(\vec {L}\).

The asymptotic behavior of the \(\mathbb{L }_p\)-risk on class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) is characterized in the next two theorems.

Let family \(\mathcal{F }(\mathcal{H })\) be associated with kernel (3.3). Let \({\hat{f}}\) denote the estimator given by the selection rule (2.10)–(2.11) with \(\varkappa = (\Vert K\Vert _\infty \vee 1)^2[(4d+2)p+4(d+1)]\) that is applied to the family \(\mathcal{F }(\mathcal{H })\).

Theorem 2

For any \(M>0,\,L_0>0,\,\ell \in \mathbb{N }^*\), any \(\vec {\beta }\in (0,\ell ]^d,\,\vec {r}\in (1,\infty ]^d\), any \(\vec {L}\) satisfying \(\min _{j=1,\ldots ,d}L_j\ge L_0\), and any \(p\in (1,\infty )\) one has

$$\begin{aligned} \limsup _{n\rightarrow \infty }\left\{ \mu _n\left( \frac{L_\beta \ln n}{n}\right) ^{-\nu }\; \mathcal{R }_p^{(n)}\left[ {\hat{f}}\,;\mathbb{N }_{\vec {r},d}\left( \vec {\beta },\vec {L}, M\right) \right] \right\} \le C <\infty . \end{aligned}$$

Here constant \(C\) does not depend on \(\vec {L}\) in the cases \(p\le s(2+1/\beta )\) and \(p\ge s(2+1/\beta ),\,s<1\).

Remark 1

  1. 1.

    Condition \(\min _{j=1,\ldots ,d}L_j\ge L_0\) ensures independence of the constant \(C\) on \(\vec {L}\) in the cases \(p\le s(2+1/\beta )\) and \(p\ge s(2+1/\beta ),\,s<1\). If \(p\ge s(2+1/\beta ),\,s\ge 1\) then \(C\) depends on \(\vec {L}\), and the corresponding expressions can be easily extracted from the proof of the theorem. We note that in this case the map \(\vec {L}\mapsto C(\vec {L})\) is bounded on every closed cube of \((0,\infty )^{d}\).

  2. 2.

    We consider the case \(1<p<\infty \) only, not including \(p=1\) and \(p=\infty \). It is well-known, [20], that smoothness alone is not sufficient in order to guarantee consistency of density estimators in \(\mathbb{L }_1(\mathbb{R }^d)\); see also Theorem 3 for a lower bound. The case \(p=\infty \) was considered recently in [25].

  3. 3.

    As it was discussed above, Theorem 2 requires uniform boundedness of the estimated density, i.e. \(\Vert f\Vert _\infty \le M<\infty \). We note however that our estimator \(\hat{f}\) is fully adaptive, i.e., its construction does not use any information on the parameters \(\vec {\beta }, \vec {r}, \vec {L}\) and \(M\).

Now we present lower bounds on the minimax risk. Define

$$\begin{aligned} \alpha _n = \left\{ \begin{array}{l@{\quad }l} \ln n, &{} \quad p> s(2+1/\beta ),\; s\ge 1,\\ 1, &{} \quad \mathrm{otherwise}. \end{array} \right. \end{aligned}$$

Theorem 3

Let \(\vec {\beta }\in (0,\infty )^{d},\,\vec {r}\in [1,\infty ]^{d},\,\vec {L}\in (0,\infty )^{d}\) and \(M>0\) be fixed.

  1. (i)

    There exists \(c>0\) such that

    $$\begin{aligned} \liminf _{n\rightarrow \infty }\left\{ \left( \frac{L_\beta \alpha _n}{n}\right) ^{-\nu } \inf _{\widetilde{f}} \mathcal{R }_p^{(n)}\left[ \widetilde{f};\;\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\right] \right\} \ge c,\quad \forall p\in [1,\infty ), \end{aligned}$$

    where the infimum is taken over all possible estimators \(\widetilde{f}\). If \(\min _{j=1,\ldots ,d} L_j\ge L_0>0\) then in the cases \(p\le s(2+1/\beta )\) or \(p\ge s(2+1/\beta )\) and \(s<1\) the constant \(c\) is independent of \(\vec {L}\).

  2. (ii)

    Let \(p=\infty \) and \(s\le 1\); then there is no consistent estimator, i.e., for some \(c>0\)

    $$\begin{aligned} \liminf _{n\rightarrow \infty }\inf _{\tilde{f}} \sup _{f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)}\mathbb{E }_f\left\| \tilde{f}-f\right\| _\infty \;>\;c. \end{aligned}$$

Remark 2

  1. 1.

    Inspection of the proof shows that if \(\max _{j=1,\ldots ,d} L_j\le L_\infty <\infty \) then the statement (i) is valid with constant \(c\) depending on \(\vec {\beta }, \vec {r}\), \(L_0\), \(L_\infty ,\,d\) and \(M\) only.

  2. 2.

    As it was mentioned above, adaptive minimax density estimation on \(\mathbb{R }^d\) under \(\mathbb{L }_\infty \)-loss was a subject of the recent paper [25]. A minimax adaptive estimator is constructed in this paper under assumption \(s>1\). Thus, statement (ii) of Theorem 3 finalizes the research on adaptive density estimation in the supremum norm. It is interesting to note that the minimax rates in the case \(p=\infty \) coincide with those of Theorem 2 if we put formally \(p=\infty \).

3.4 Discussion

The results of Theorem 2 together with the matching lower bounds of Theorem 3 provide complete classification of minimax rates of convergence in the problem of density estimation on \(\mathbb{R }^d\). In particular, we discover four different zones with respect to the minimax rates of convergence.

  • Tail zone corresponds to “small” \(p,\,1<p\le \frac{2+1/\beta }{1+1/s}\). This zone does not appear if density \(f\) is assumed to be compactly supported, or some tail dominance condition is imposed, see Sect. 4.

  • Dense zone is characterized by the “intermediate” range of \(p,\,\frac{2+1/\beta }{1+1/s}\le p\le s(2+1/\beta )\). Here the “usual” rate of convergence \(n^{-\beta /(2\beta +1)}\) holds.

  • Sparse zone corresponds to “large” \(p,\,p\ge s(2+1/\beta )\). As Theorems 2 and 3 show, this zone, in its turn, is subdivided into two regions with \(s\ge 1\) and \(s<1\). This phenomenon was not observed in the existing literature even for settings with compactly supported densities. For other statistical models (regression, white Gaussian noise etc) this result is also new.

It is important to emphasize that existence of these zones is not related to the multivariate nature of the problem or to the anisotropic smoothness of the estimated density. In fact, these results hold already for the one-dimensional case, and this, to a limited degree, was observed in the previous works. In the subsequent remarks we discuss relationships between our results and the existing results in the literature, and comment on some open problems.

  1. 1.

    In [4, 9, 24] the sparse zone is defined as \(p>2(1+1/\beta ),\,s>1\). Recall that condition \(s>1\) implies that the density to be estimated belongs to a class of uniformly bounded and continuous functions. In the sparse zone we consider also the case \(s\le 1\), but density \(f\) is assumed to be uniformly bounded. It turns out that in this zone the rate corresponding to the index \(\nu =s/p\) emerges.

  2. 2.

    The one-dimensional setting was considered in [21] and [30]. The setting of Juditsky and Lambert-Lacroix [21] corresponds to \(s=\infty \), while Reynaud-Bouret et al. [30] deal with the case of \(p=2\) and \(\beta >1/r-1/2\). Both settings rule out the sparse zone. The rates of convergence in the dense zone obtained in the aforementioned papers are easily recovered from our results. However, in the tail zone our bound contains additional \(\ln (n)\)-factor.

  3. 3.

    In the previous papers on adaptive estimation of densities with unbounded support (cf. [21] and [30]) the developed estimators are explicitly shrunk to zero. This shrinkage is used in bounding the minimax risk on the whole space. We do not employ shrinkage in our estimator construction. We derive bounds on the \(\mathbb{L }_p\)-risk by integration of the point-wise oracle inequality (2.14). The key elements of this derivation are inequality (2.17) and statement (i) of Proposition 2. The inequality (2.17) is based on the following fact: the errors \(\zeta (x)\) and \(\chi (x)\) are integrable by the accurate choice of the majorant. Indeed, Sect. 5.3 shows that these errors are not equal to zero with probability which is integrable and “negligible” in the regions where the density is “small”. This leads to integrability of the remainders in (2.14). As for Proposition 2, it is related to the integrability of the main term in (2.14). The main problem here is that the majorant \(M_h(\cdot ,x)\) itself is not integrable. To overcome this difficulty we use the integrability of the estimator \(\hat{f}\), approximation properties of the density \(f\), and (2.17).

  4. 4.

    In the context of the Gaussian white noise model on a compact interval [23] developed an adaptive estimator that achieves the rate of convergence \((\ln {n}/n)^{\beta /(2\beta +1)}\) on the anisotropic Nikol’skii classes under condition \(\sum _{i=1}^d [\frac{1}{\beta _i}(\frac{p}{r_i}-1)]_+<2\). This restriction determines a part of the dense zone, and our Theorem 2 improves on this result. In fact, our estimator achieves the rate \((\ln {n}/n)^{\beta /(2\beta +1)}\) in the zone \(\sum _{i=1}^d \frac{1}{\beta _i}(\frac{p}{r_i}-1)\le 2\) which is equivalent to \(p\le s(2+1/\beta )\).

  5. 5.

    It follows from Theorem 3 that the upper bound of Theorem 2 is sharp in the zone \(p> s(2+1/\beta ),\,s>1\), and it is nearly sharp up to a logarithmic factor in all other zones. This extra logarithmic factor is a consequence of the fact that we use the point-wise selection procedure (2.10)–(2.11). We also have extra \(\ln n\)-term on the boundaries \(p=\frac{2+1/\beta }{1+1/s},\,p=s(2+1/\beta )\).

Conjecture 1

The rates found in Theorem 3 are optimal.

Thus, if our conjecture is true, the construction of an estimator achieving the rates of Theorem 3 in the tail and dense zones remains an open problem.

  1. 6.

    Theorem 2 is proved under assumption \(\vec {r}\in (1,\infty ]^{d}\), i.e., we do not include the case where \(r_j=1\) for some \(j=1,\ldots ,d\). This is related to the construction of our selection rule, and to the necessity to bound \(\mathbb{L }_{r_j}\)-norm, \(j=1,\ldots , d\) of the term \(\bar{B}_h(f,x)\); see (2.13) and (2.14). In our derivations for this purpose we use properties of the strong maximal operator (for details see Sect. 6.1), and it is well-known that this operator is not of the weak \((1,1)\)-type in dimensions \(d\ge 2\). Nevertheless, using inequality (6.5) we were able to obtain the following result.

Corollary 1

Let \(\vec {r}\) be such that \(r_j=1\) for some \(j=1,\ldots ,d\). Then the result of Theorem 2 remains valid if the normalizing factor \((n^{-1}\ln {n})^{\nu }\) is replaced by \((n^{-1}[\ln {n}]^{d})^{\nu }\).

The proof of Corollary 1 coincides with the proof of Theorem 2 with the only difference that bounds in the proof of Proposition 1 should use (6.5) instead of the Chebyshev inequality. This will result in an extra \((\ln {n})^{d-1}\)-factor. We note that the results of Theorem 2 and Corollary 1 coincide if \(d=1\). It is not surprising because in the dimension \(d=1\) the strong maximal operator is the Hardy–Littlewood maximal function which is of the weak (1,1)-type.

4 Tail dominance condition

Let \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) be a locally integrable function. Define the map \(g\mapsto g^*\) by the formula

$$\begin{aligned} g^*(x):= \sup _{h\in (0,2]^d} \frac{1}{V_h} \int _{\Pi _h(x)} g(t)\,\mathrm{d}t,\quad x\in \mathbb{R }^d, \end{aligned}$$
(4.1)

where \(\Pi _h(x)=[x_1-h_1/2,x_1+h_1/2]\times \cdots \times [x_d-h_d/2,x_d+h_d/2]\). In fact, formula (4.1) defines the maximal operator associated with the differential basis \(\cup _{x\in \mathbb{R }^d}\{ \Pi _h(x), h\in (0,2]\}\), see [17].

Consider the following set of functions: for any \(\theta \in (0,1]\) and \(R\in (0,\infty )\) let

$$\begin{aligned} \mathbb{G }_\theta (R)=\left\{ g:\mathbb{R }^d\rightarrow \mathbb{R }:\quad \Vert g^*\Vert _\theta \le R\right\} . \end{aligned}$$
(4.2)

Note that, although we keep the previous notation \(\Vert g\Vert _\theta =(\int |g(x)|^\theta \,\mathrm{d}x)^{1/\theta },\,\Vert \cdot \Vert _\theta \) is not longer a norm if \(\theta \in (0,1)\).

The assumption that \(f\in \mathbb{G }_\theta (R)\) for some \(\theta \in (0,1]\) and \(R>0\) imposes restrictions on the tail of the density \(f\). In particular, the set of densities, uniformly bounded and compactly supported on a cube of \(\mathbb{R }^d\), is embedded in the set \(\mathbb{G }_\theta (\cdot )\) for any \(\theta \in (0,1]\) (for details, see Sect. 7.4). We will refer to the assumption \(f\in \mathbb{G }_\theta (R)\) as the tail dominance condition.

In this section we study the problem of adaptive density estimation under the tail dominance condition. We show that under this condition the minimax rate of convergence can be essentially improved in the tail zone. In particular, if \(\theta \le \theta ^*\) for some \(\theta ^*<1\) given below then the tail zone disappears.

For any \(\theta \in (0,1]\) let

$$\begin{aligned} \nu ^*(\theta )=\max \left\{ \frac{1-\theta /p}{1-\theta /s+1/\beta },\;\frac{1}{2+1/\beta }\right\} , \end{aligned}$$

and define

$$\begin{aligned}&\nu (\theta ) =\left\{ \begin{array}{l@{\quad }l} \nu ^*(\theta ),&{} p \le s(2+1/\beta ),\\ \nu , &{} p> s(2+ 1/\beta ), \end{array} \right. \nonumber \\&\quad \mu _n(\theta )=\left\{ \begin{array}{l@{\quad }l} (\ln n)^{1/p}, &{} p \in \{ \frac{2+1/\beta }{1/\theta +1/s},\; s(2+1/\beta )\},\\ 1, &{} \mathrm{otherwise}, \end{array} \right. \end{aligned}$$
(4.3)

where \(\nu \) is defined in (3.4).

Theorem 4

The following statements hold.

  • (i) For any \(\theta \in (0,1]\) and \(R>0\), Theorem 2 remains valid if one replaces \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) by \( \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M),\,\nu \) by \(\nu (\theta )\) and \(\mu _n\) by \(\mu _n(\theta )\). The constant \(C\) may depend on \(\theta \) and \(R\).

  • (ii) For any \(\theta \in (0,1],\,\vec {\beta },\vec {L}\in (0,\infty )^d,\,\vec {r}\in [1,\infty ]^d\) and \(M>0\) one can find \(R>0\) such that Theorem 3 remains valid if one replaces \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) by \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M),\,\nu \) by \(\nu (\theta )\), and \(\mu _n\) by \(\mu _n(\theta )\).

Remark 3

  1. 1.

    The tail dominance condition leads to improvement of the rates of convergence in the whole tail zone. In particular, if \(f\in \mathbb{G }_1(R)\) then the additional \(\ln ^{\frac{d}{p}}(n)\)-factor disappears, cf. \(\mu _n\) and \(\mu _n(1)\). Moreover, under the tail dominance condition with \(\theta <1\) the faster convergence rate of the dense zone is achieved over a wider range of values of \(p,\,\frac{2+1/\beta }{1/\theta +1/s} \le p \le s(2+1/\beta )\). Additionally, if

    $$\begin{aligned} \theta < \theta ^*:=\frac{ps}{s(2+1/\beta )-p}, \end{aligned}$$

    then the tail zone disappears. Note that \(\theta ^*\in (0,1)\) whenever \(p\le \frac{2+1/\beta }{1+1/s}\). As it was mentioned above, the set of uniformly bounded and compactly supported on a cube of \(\mathbb{R }^d\) densities is embedded in the set \(\mathbb{G }_\theta (\cdot )\) for any \(\theta \in (0,1]\). This fact explains why the tail zone does not appear in problems of estimating compactly supported densities.

  2. 2.

    We would like to emphasize that the couple \((\theta , R)\) is not used in the construction of the estimation procedure; thus, our estimator is adaptive with respect to \((\theta ,R)\) as well. In particular, if the tail dominance condition does not hold, our estimator achieves the rate of Theorem 2. On the other hand, if this assumption holds, the rate of convergence is improved automatically in the tail zone.

  3. 3.

    The second statement of the theorem is proved under assumption that \(R\) is large enough. The fact that \(R\) cannot be chosen arbitrary small is not technical; the parameters \(\vec {\beta },\,\vec {L},\,\vec {r},\,M\), \(\theta \) and \(R\) are related to each other. In particular, one can easily provide lower bounds on \(R\) in terms of the other parameters of the class. For instance, by the Lebesgue differentiation theorem, \(f(x)\le f^*(x)\) almost everywhere; therefore for any density \(f\in \mathbb{G }_\theta (R)\) such that \(\Vert f\Vert _\infty \le M\) one has

    $$\begin{aligned} 1=\int f\le M^{1-\theta }\Vert f^*\Vert ^\theta _\theta \le M^{1-\theta } R^{\theta } \Rightarrow R\ge M^{1-1/\theta }. \end{aligned}$$

    Another lower bound on \(R\) in terms of \(\vec {L},\,\vec {r}\) and \(\theta \) can be established using the Littlewood interpolation inequality (see, e.g., [13, Section 5.5]). Let \(0<q_0<q_1\) and \(\alpha \in (0,1)\) be arbitrary numbers; then the Littlewood inequality states that \(\Vert g\Vert _q\le \Vert g\Vert _{q_0}^{1-\alpha }\Vert g\Vert _{q_1}^{\alpha }\), where \(q\) is defined by relation \(\frac{1}{q}=\frac{1-\alpha }{q_0}+\frac{\alpha }{q_1}\). Now, suppose that \(f\in \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\), and choose \(q_0=\theta ,\,q_1=r_i\) and \(\alpha =\frac{1-\theta }{1-\theta /r_i}\); then \(q=1\) and by the Littlewood inequality we have

    $$\begin{aligned} 1\!=\!\Vert f\Vert _1\!\le \! \Vert f\Vert _\theta ^{1-\alpha }\Vert f\Vert _{r_i}^\alpha \!\le \! R^{\frac{r_i\theta -\theta }{r_i-\theta }}L_{i}^{\frac{r_i-r_i \theta }{r_i-\theta }},\quad i\!=\!1,\ldots , d\;\Rightarrow \;R\!\ge \! \max _{i=1,\ldots ,d} L_i^{\frac{r_i\theta -r_i}{r_i-\theta }}. \end{aligned}$$
  4. 4.

    Another interesting observation is related to the specific case \(p=1\). Recall that the condition \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L},M)\) alone is not sufficient for existence of consistent estimators.However, for any \(\theta \in (0,1)\) we can show

    $$\begin{aligned} \inf _{\tilde{f}} \mathcal{R }_1^{(n)}\left[ \tilde{f};\; \mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\right] \!\le \! C\left[ \frac{L_\beta (\ln n)^d}{n} \right] ^{\frac{1-\theta }{1-\theta /s+1/\beta }} \rightarrow 0, n\rightarrow \infty . \end{aligned}$$

    This result follows from the proof of Theorem 4 and (6.5).

Now we argue that condition \(f\in \mathbb{G }_{\theta ^*} (R)\) is, in a sense, the weakest possible ensuring the “usual” rate of convergence, corresponding to the index \(\nu =\beta /(2\beta +1)\), in the whole zone \(p\le s(2+1/\beta )\). Indeed, in view of Theorem 4, the minimax rate of convergence on the class \(\mathbb{G }_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\), say \(\overline{\psi }_n(\theta ^*)\), satisfies

$$\begin{aligned}&c\left( L_\beta /n\right) ^{\frac{\beta }{2\beta +1}}\;\le \; \overline{\psi }_n(\theta ^*)\;\le \; C(\ln {n})^{1/p}\left( L_\beta \ln {n}/n\right) ^{\frac{\beta }{2\beta +1}}, \end{aligned}$$

where the constants \(c\) and \(C\) may depend on \(R\). On the other hand, if \(\underline{\psi }_n(\theta ^*)\) denotes the minimax rate of convergence on the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L},M){\setminus }\mathbb{G }_{\theta ^*}(R)\) then

$$\begin{aligned} c\left( L_\beta /n\right) ^{\frac{1-1/p}{1-1/s+1/\beta }} \le \underline{\psi }_n(\theta ^*) \;\le \; C\left( L_\beta \ln {n}/n\right) ^{\frac{1-1/p}{1-1/s+1/\beta }}, \end{aligned}$$
(4.4)

provided that \(p\le \frac{2+1/\beta }{1+1/s}\). The upper bound in (4.4) is one of the statements of Theorem 2, while the lower bound is proved in Sect. 7.5.

Thus we conclude that there is no tail zone in estimation over the class \(G_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta }, \vec {L},M)\), and this zone appears when considering the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta }, \vec {L},M){\setminus } \mathbb{G }_{\theta ^*}(R)\). In this sense \(f\in \mathbb{G }_{\theta ^*}(R)\cap \mathbb{N }_{\vec {r}}(\vec {\beta },\vec {L},M)\) is the necessary and sufficient condition eliminating the tail zone.

5 Proof of Theorem 1

First we state two auxiliary results, Lemmas 1 and 2, and then turn to the proof of the theorem. Proof of measurability of our estimator and proofs of Lemmas 1 and 2 are given in Appendix A.

5.1 Auxiliary lemmas

For any \(g:\mathbb{R }^d\rightarrow \mathbb{R }^1\) denote

$$\begin{aligned} {\check{M}}_h(g,x) =\sqrt{\frac{\varkappa {\hat{A}}_h(g,x)\ln n}{nV_h}} + \frac{\varkappa \ln n}{nV_h}~. \end{aligned}$$

Lemma 1

Let \(\chi _h(g,x)= [|\hat{A}_h(g,x)-A_h(g,x)|- M_h(g,x)]_+,\,h\in \mathcal{H }\); then

$$\begin{aligned}{}[{\check{M}}_h(g,x)- 5M_h(g,x)]_+ \le \frac{1}{2} \chi _h(g,x),\quad [M_h(g,x)- 4{\check{M}}_h(g,x)]_+ \le 2 \chi _h(g,x). \end{aligned}$$

The next lemma establishes moment bounds on the following four random variables:

$$\begin{aligned} \zeta _1(x)&= \sup _{h\in \mathcal{H }} [|\xi _h(x)|- M_h(K,x)]_{+};\nonumber \\ \zeta _2(x)&= \sup _{h,\eta \in \mathcal{H }} [|\xi _{h,\eta }(x)|-M_{h\vee \eta }(Q,x)]_{+};\nonumber \\ \zeta _3(x){:}&= \sup _{h\in \mathcal{H }} [ |A_h(K,x)-\hat{A}_h(K,x)|-M_h(K,x)]_+;\\ \zeta _4(x){:}&= \sup _{h\in \mathcal{H }} [ |A_h(Q,x)-\hat{A}_h(Q,x)|-M_h(Q,x)]_+.\nonumber \end{aligned}$$
(5.1)

Denote \(\mathrm{k}_\infty =\Vert K\Vert _\infty \vee 1\) and

$$\begin{aligned} F(x)=\int \mathbf 1 _{[-1,1]^d}(t-x)f(t)\,\mathrm{d}t. \end{aligned}$$

Lemma 2

Let \(q\ge 1,\,l\ge 1\) be arbitrary numbers. If \(\varkappa \ge \mathrm{k}^2_\infty [(2q+4)d + 2l]\) then for all \( x\in \mathbb{R }^d\)

$$\begin{aligned} \mathbb{E }_f [\zeta _j(x)]^q&\le C_0 n^{-q/2}\left\{ F(x)\vee n^{-l}\right\} ,\quad j=1,2,3,4, \end{aligned}$$
(5.2)

where constant \(C_0\) depends on \(d,\,q\), and \(\mathrm{k}_\infty \) only.

5.2 Proof of oracle inequality (2.14)

We recall the standard error decomposition of the kernel estimator: for any \(h\in \mathcal{H }\) one has

$$\begin{aligned} |{\hat{f}}_h(x) - f(x)| \le |B_h(f,x)| + |\xi _h(x)|, \end{aligned}$$

where \(B_h(f,x)\) and \(\xi _h(x)\) are given in (2.12) and (2.4) respectively. Similar error decomposition holds for auxiliary estimators \({\hat{f}}_{h,\eta }(x)\); the corresponding bias and stochastic error are denoted by \(B_{h,\eta }(f,x)\) and \(\xi _{h,\eta }(x)\).

\(1^0\). The following relation for the bias \(B_{h,\eta }(f,x)\) of \(\hat{f}_{h,\eta }(x)\) holds:

$$\begin{aligned} B_{h,\eta }(f,x) - B_\eta (f,x) = \int K_{\eta }(t-x) B_h(f,t)\,\mathrm{d}t,\quad \forall h, \eta \in \mathcal{H }. \end{aligned}$$
(5.3)

Indeed, using the Fubini theorem and the fact that \(\int K_h(x)\,\mathrm{d}x=1\) for all \(h\in \mathcal{H }\) we have

$$\begin{aligned} \int [K_h * K_\eta ](t-x) f(t)\,\mathrm{d}t&= \int \left[ \int K_h(t-y) K_\eta (y-x)\,\mathrm{d}y\right] f(t)\,\mathrm{d}t\\&= \int K_\eta (y-x) f(y)\,\mathrm{d}y\\&+ \int K_\eta (y-x) \left[ \int K_h(t-y)[f(t)-f(y)]\,\mathrm{d}t\right] \, \mathrm{d}y. \end{aligned}$$

It remains to note that \( \int K_h(t-y)[f(t)-f(y)]\,\mathrm{d}t=B_h(f,y)\) and to subtract \(f(x)\) from the both sides of the above equality. Thus, (5.3) is proved.

\(2^{0}\). By the triangle inequality we have for any \(h\in \mathcal{H }\)

$$\begin{aligned} |{\hat{f}}_{\hat{h}}(x)-f(x)| \le |{\hat{f}}_{\hat{h}}(x)- {\hat{f}}_{{\hat{h}}, h}(x)| + |{\hat{f}}_{{\hat{h}},h}(x) - {\hat{f}}_h(x)| + |{\hat{f}}_h(x) -f(x)|.\qquad \end{aligned}$$
(5.4)

We bound each term on the right hand side separately.

First we note that, by (5.3) and (2.13), for any \(h\in \mathcal{H }\)

$$\begin{aligned}&{\hat{R}}_h(x) - \sup _{\eta \ge h} {\hat{M}}_\eta (Q,x) - {\hat{M}}_h(K,x)\\&\quad =\sup _{\eta \in \mathcal{H }}\left[ |{\hat{f}}_{h,\eta }(x)-{\hat{f}}_\eta (x)| - {\hat{M}}_{h\vee \eta }(Q,x)-{\hat{M}}_\eta (K,x)\right] _+\\&\quad \le {\bar{B}}_h(f,x) + \sup _{\eta \in \mathcal{H }}\left[ |\xi _{h,\eta }(x)-\xi _\eta (x)| - {\hat{M}}_{h\vee \eta }(Q, x)-{\hat{M}}_\eta (K, x)\right] _+. \end{aligned}$$

Thus, for any \(h\in \mathcal{H }\)

$$\begin{aligned} {\hat{R}}_h(x) \le \bar{B}_h(f,x) + 2{\hat{\zeta }}(x) + {\hat{M}}_h(K,x) + \sup _{\eta \ge h} {\hat{M}}_\eta (Q,x), \end{aligned}$$
(5.5)

where we put

$$\begin{aligned} \hat{\zeta } (x) {:}= \sup _{h,\eta \in \mathcal{H }}\left[ |\xi _{h,\eta }(x)|- \hat{M}_{h\vee \eta }(Q, x)\right] _+ \vee \sup _{h \in \mathcal{H }}\left[ |\xi _{h}(x)|- \hat{M}_{h}(K,x)\right] _+. \end{aligned}$$

Second, by (5.3) and \(\hat{f}_{h,\eta }\equiv {\hat{f}}_{\eta , h}\) for any \(h, \eta \in \mathcal{H }\) we have

$$\begin{aligned} |{\hat{f}}_{h,\eta }(x) - {\hat{f}}_h(x)|&\le |B_{\eta ,h}(f,x)-B_{h}(f,x)| + |\xi _{h,\eta }(x)-\xi _{h}(x)|\\&\le \; B_\eta (f,x) + \left[ |\xi _{h,\eta }(x)-\xi _h(x)|- {\hat{M}}_{h\vee \eta }(Q,x)-\hat{M}_h(K,x)\right] \\&+ \sup _{\eta \ge h} {\hat{M}}_{\eta }(Q,x) + {\hat{M}}_h(K,x) \\&\le \bar{B}_\eta (f,x) + 2{\hat{\zeta }}(x) +{ \hat{R}}_h(x), \end{aligned}$$

where the last inequality holds by definition of \(\hat{R}_h(x)\) [see (2.10)]. There inequalities imply the following upper bound on the first term on the right hand side of (5.4): for any \(h\in \mathcal{H }\)

$$\begin{aligned}&|{\hat{f}}_{{\hat{h}},h}(x) - {\hat{f}}_{\hat{h}}(x) | \;\le \; \bar{B}_h(f,x) + {\hat{R}}_{\hat{h}}(x) + 2 {\hat{\zeta }}(x) \nonumber \\&\quad \le \bar{B}_h(f,x) + {\hat{R}}_h(x) + 2{\hat{\zeta }}(x)\nonumber \\&\quad \le 2\bar{B}_h(f,x) + 4{\hat{\zeta }}(x) + \sup _{\eta \ge h} {\hat{M}}_\eta (Q,x) +{\hat{M}}_h(K,x); \end{aligned}$$
(5.6)

where we have used the fact that \({\hat{R}}_{\hat{h}}(x) \le {\hat{R}}_h(x)\) for all \(h\in \mathcal{H }\), and inequality (5.5).

Now we turn to bounding the second term on the right hand side of (5.4). We get for any \(h\in \mathcal{H }\)

$$\begin{aligned} |\hat{f}_{\hat{h}, h}(x) - \hat{f}_h(x)|&= |\hat{f}_{\hat{h}, h}(x) - \hat{f}_h(x)| \pm \left[ \hat{M}_{\hat{h}\vee h}(Q,x) + \hat{M}_h(K,x)\right] \nonumber \\&\le \hat{R}_{\hat{h}}(x) + \sup _{\eta \ge h} \hat{M}_\eta (Q,x)+\hat{M}_h(K,x)\nonumber \\&\le \bar{B}_h(f,x)+2\hat{\zeta }(x)+2\sup _{\eta \ge h} \hat{M}_\eta (Q,x)+2\hat{M}_h(K,x),\qquad \end{aligned}$$
(5.7)

where we again used (5.5) and the fact that \(\hat{R}_{\hat{h}}(x)\le \hat{R}_h(x)\) for all \(h\in \mathcal{H }\).

Finally for any \(h\in \mathcal{H }\)

$$\begin{aligned} |\hat{f}_h(x)-f(x)| \le |B_h(f, x)| +|\xi _h(x)| \le \bar{B}_h(f,x) + M_h(K,x) + \zeta (x). \end{aligned}$$

Thus, combining (5.6), (5.7) and (5.4) we obtain

$$\begin{aligned} |\hat{f}_{\hat{h}}(x)-f(x)|&\le \inf _{h\in \mathcal{H }} \left\{ 4\bar{B}_h(f,x)+ 3 \sup _{\eta \ge h} \hat{M}_\eta (Q, x)+ 3\hat{M}_h(K,x) + M_h(K,x)\right\} \nonumber \\&+ 6{\hat{\zeta }}(x) + \zeta (x). \end{aligned}$$
(5.8)

\(3^{0}\). In order to complete the proof we note that by the first inequality of Lemma 1 for any \(g:\mathbb{R }^d\rightarrow \mathbb{R }^{1}\)

$$\begin{aligned} {\hat{M}}_h(g,x) \le 20 M_h(g,x) + 2 \chi _h(g,x). \end{aligned}$$

In addition, by the second inequality in Lemma 1

$$\begin{aligned} |\xi _h(x)| - \hat{M}_h(K,x)&= |\xi _h(x)| - M_h(K,x) + M_h(K,x)- \hat{M}_h(K,x)\\&\quad \le \zeta (x) + 2\chi (x),\\ |\xi _{h,\eta }(x)| - \hat{M}_{h\vee \eta }(Q,x)&= |\xi _{h,\eta }(x)| - M_{h\vee \eta }(Q,x) + M_{h\vee \eta }(Q,x) - \hat{M}_{h\vee \eta }(Q,x)\\&\quad \le \zeta (x) + 2\chi (x), \end{aligned}$$

so that \(\hat{\zeta }(x) \le \zeta (x)+ 2\chi (x)\). Substituting these bounds in (5.8) we obtain

$$\begin{aligned} |\hat{f}(x)-f(x)|&\le \inf _{h\in \mathcal{H }} \left\{ 4\bar{B}_h(f,x)+ 60 \sup _{\eta \ge h} M_\eta (Q, x)+ 61M_h(K,x)\right\} \\&+ 7\zeta (x) + 18\chi (x), \end{aligned}$$

as claimed. \(\square \)

5.3 Proof of moment bounds (2.17)

Let \(\zeta _j(x),\,j=1,\ldots , 4\) be defined by (5.1). Then

$$\begin{aligned} \mathbb{E }_f [\zeta _j(x)]^q&\le C_0 n^{-q/2} \left\{ F(x)\vee n^{-l}\right\} , \end{aligned}$$

as claimed in Lemma 2.

Let \(T_1=\{x\in \mathbb{R }^d: F(x)\ge n^{-l}\}\) and \(T_2=\mathbb{R }^d{\setminus }T_1\). Therefore

$$\begin{aligned}&\int _{T_1} \mathbb{E }_f [\zeta _j(x)]^q\,\mathrm{d}x \le C_0 n^{-q/2} \int _{T_1}F(x)\,\mathrm{d}x\le C_0 n^{-q/2}\int F(x)\,\mathrm{d}x = 2^{d}C_0n^{-q/2}.\nonumber \\ \end{aligned}$$
(5.9)

Now we analyze integrability on the set \(T_2\). We consider only the case \(j=1,2\) since computations for \(j=3,4\) are the same as for \(j=1\).

Let \( U_{\max }(x)=[x-1, x+1]^d \) and define the event \(D(x)=\{\sum _{i=1}^n {\mathbf{1}}[X_i\in U_{\max }(x)]<2\}\), and let \(\bar{D}(x)\) denote the complementary event. First we argue that for \(j=1,2\)

$$\begin{aligned} \zeta _j(x) \mathbf{1 }\{D(x)\}=0,\quad \forall x\in T_2. \end{aligned}$$
(5.10)

Indeed, if \(x\in T_2\) then for any \(h\in \mathcal{H }\)

$$\begin{aligned} \left| \mathbb{E }_f K_h(X_i\!-\!x)\right| \!\le \! n^{d}\mathrm{k}_\infty F(x)\!\le \! \mathrm{k}_\infty n^{d-l}, \quad \left| \mathbb{E }_f Q_h(X_i-x)\right| \!\le \! n^{d}\mathrm{k}^{2}_\infty F(x)\!\le \! \mathrm{k}^{2}_\infty n^{d-l}. \end{aligned}$$

Here we have used that \(\mathcal{H }=[1/n,1]^d\) and that \(\text {supp}(K)=[-1/2,1/2]^d,\;\text {supp}(Q)=[-1,1]^d\).

Hence, by definition of \(\xi _h(x)\), for any \(h\in \mathcal{H }\) one has for any \(l\ge d+1\)

$$\begin{aligned} |\xi _h(x)| \mathbf{1}\{D(x)\}&\le \left| \frac{1}{n}\sum _{i=1}^n K_h(X_i-x)\right| \mathbf{1}\{D(x)\} + \mathrm{k}_\infty n^{d-l}\\&\le \frac{2\mathrm{k}_\infty }{nV_h} + \mathrm{k}_\infty n^{d-l} \;\le \; \frac{4\mathrm{k}_\infty }{nV_h}\le M_h(K,x), \end{aligned}$$

where we have used that \(n^{d-l}\le (nV_h)^{-1}\) for \(l\ge d+1,\,\varkappa \ln n\ge 4\mathrm{k}_\infty \) by the condition on \(\varkappa \) [see also definition of \(M_h(K,x)\)], and \(n\ge 3\). Therefore \(\zeta _1(x)\mathbf{1}\{D(x)\}=0\) for \(x\in T_2\). By the same reasoning for \(\zeta _2(x)\) we obtain that \(\zeta _2(x)\mathbf{1}\{D(x)\}=0,\,\forall x\in T_2\) because \(\varkappa \ln n \ge 4\mathrm{k}^2_\infty \). Thus (5.10) is proved. Using (5.10) we can write

$$\begin{aligned} \int _{T_2} \mathbb{E }_f [\zeta _j(x)]^q \mathbf{1}\{\bar{D}(x)\}\,\mathrm{d}x&\le \int _{T_2} \mathbb{E }_f \left( \left[ \sup _{h\in \mathcal{H }} |\xi _h(x)|^q \!\vee \! \sup _{h,\eta \in \mathcal{H }} |\xi _{h,\eta }(x)|^q\right] \mathbf{1}\{\bar{D}(x)\}\right) \,\mathrm{d}x \nonumber \\&\le \left( 2\mathrm{k}^2_\infty n^d\right) ^q \int _{T_2} \mathbb{P }_f\{\bar{D}(x)\} \mathrm{d}x. \end{aligned}$$
(5.11)

Now we bound from above the integral on the right hand side of the last display formula. For any \(z>0\) we have in view of the exponential Markov inequality

$$\begin{aligned} \mathbb{P }_f\{\bar{D}(x)\}&= \mathbb{P }_f\left\{ \sum _{i=1}^n \mathbf{1}[X_i\in U_{\max }(x)] \ge 2\right\} \le e^{-2z} \left[ e^{z} F(x) + 1-F(x)\right] ^n\\&= e^{-2z}\left[ (e^z-1)F(x)+1\right] ^n \le \exp \{-2z + n(e^{z}-1)F(x)\}. \end{aligned}$$

Minimizing the right hand side w.r.t. \(z\) we find \(z=\ln 2 -\ln {\{nF(x)\}}\) and, therefore,

$$\begin{aligned} \mathbb{P }_f\left\{ \bar{D}(x)\right\} \le 4^{-1}n^2F^2(x) \exp \{2-nF(x)\}\le (e^2/4) n^2 F^2(x). \end{aligned}$$

Since \(F(x)\le n^{-l}\) for any \(x\in T_2\) we obtain

$$\begin{aligned} \int _{T_2} \mathbb{P }_f\{\bar{D}(x)\}\,\mathrm{d}x \le (e^2/4)n^{2-l} \int F(x)\,\mathrm{d}x =2^d (e^2/4) n^{2-l}. \end{aligned}$$

Combining this inequality with (5.11) we obtain

$$\begin{aligned}&\int _{T_2} \mathbb{E }_f [\zeta _j(x)]^q \mathbf{1}\{\bar{D}(x)\}\,\mathrm{d}x \le 2^d (2\mathrm{k}^2_\infty )^q(e^2/4) n^{2+dq-l}. \end{aligned}$$
(5.12)

Choosing \(l=(d+1)q+2\) we come to the assertion of the theorem in view of (5.9) and (5.12). \(\square \)

6 Proofs of Theorem 2 and statement (i) of Theorem 4

The proofs of Theorem 2 and of statement (i) of Theorem 4 go along similar lines. That is why we state our auxiliary results (Propositions 1 and 2) in the form that is suitable for the use in the proof of Theorem 4.

This section is organized as follows. First, in Sect. 6.1 we present and discuss some facts from functional analysis. Then in Lemma 3 of Sect. 6.2 we state an auxiliary result on approximation properties of the kernel \(K\) defined in (3.3). Proof outline and notation are discussed in Sect. 6.3. Sect. 6.4 presents two auxiliary propositions, and the proofs of Theorem 2 and statement (i) of Theorem 4 are completed in Sects. 6.5 and 6.6. Proofs of the auxiliary results, Lemma 3 and Propositions 1 and 2 are given in Appendix B.

In the subsequent proof \(c_i,C_i,\bar{c}_i, \bar{C}_i, \hat{c}_i,\hat{C}_i, \tilde{c}_i, \tilde{C}_i, \ldots \), stand for constants that can depend on \(L_0,\,M\) \(\vec {\beta },\,\vec {r},\,d\) and \(p\), but are independent of \(\vec {L}\) and \(n\). These constants can be different on different appearances. In the case when the assumption \(f\in \mathbb{G }_\theta (R)\) with \(\theta \in (0,1]\) is imposed, they may also depend on \(\theta \) and \(R\).

6.1 Preliminaries

We present an embedding theorem for the anisotropic Nikol’skii classes and discuss some properties of the strong maximal operator.

6.1.1 Embedding theorem

The statement given below in (6.2) is a particular case of the embedding theorem for anisotropic Nikol’skii classes \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\); see [29, Section 6.9.1.].

For the fixed class parameters \(\vec {\beta }\) and \(\vec {r}\) define

$$\begin{aligned} \tau (p)&= 1-\sum _{j=1}^d\frac{1}{\beta _j} \left( \frac{1}{r_j}-\frac{1}{p}\right) ,\quad \tau _i=1-\sum _{j=1}^d\frac{1}{\beta _j} \left( \frac{1}{r_j}-\frac{1}{r_i}\right) ,\quad i=1,\ldots ,d, \end{aligned}$$

and put

$$\begin{aligned} q_i=r_i\vee p, \quad \gamma _i=\left\{ \begin{array}{l@{\quad }l} \frac{\beta _i\tau (p)}{\tau _i},\quad &{} r_i<p,\\ \beta _i,\quad &{} r_i\ge p. \end{array} \right. \end{aligned}$$
(6.1)

Let \(\tau (p)>0\) and \(\tau _i>0\) for all \(i=1,\ldots , d\); then for any \(p\ge 1\) one has

$$\begin{aligned} \mathbb{N }_{\vec {r},d}\left( \vec {\beta },\vec {L}\right) \subseteq \mathbb{N }_{\vec {q},d}\left( \vec {\gamma },c\vec {L}\right) , \end{aligned}$$
(6.2)

where constant \(c>0\) is independent of \(\vec {L}\) and \(p\).

6.1.2 Strong maximal function

Let \(g:\mathbb{R }^d\rightarrow \mathbb{R }\) be a locally integrable function. We define the strong maximal function \(g^\star \) of \(g\) by formula

$$\begin{aligned} g^\star (x):= \sup _{H} \frac{1}{|H|} \int _H g(t)\,\mathrm{d}t,\quad x\in \mathbb{R }^d, \end{aligned}$$
(6.3)

where the supremum is taken over all possible rectangles \(H\) in \(\mathbb{R }^d\) with sides parallel to the coordinate axes, containing point \(x\). It is worth noting that the Hardy–Littlewood maximal function is defined by (6.3) with the supremum taken over all cubes with sides parallel to the coordinate axes, centered at \(x\).

It is well known that the strong maximal operator \(g\mapsto g^\star \) is of the strong \((p,p)\)-type for all \(1<p\le \infty \), i.e., if \(g\in \mathbb{L }_p(\mathbb{R }^d)\) then \(g^\star \in \mathbb{L }_p(\mathbb{R }^d)\) and there exists a constant \(\bar{C}\) depending on \(p\) only such that

$$\begin{aligned} \Vert g^\star \Vert _p \le \bar{C} \Vert g\Vert _p,\quad p\in (1,\infty ]. \end{aligned}$$

Let \(g^*\) be defined in (4.1). Since obviously \(g^*(x)\le g^\star (x)\) for all \(x\in \mathbb{R }^d\) we have

$$\begin{aligned} \Vert g^*\Vert _p \le \bar{C} \Vert g\Vert _p,\quad p\in (1,\infty ]. \end{aligned}$$
(6.4)

In distinction to the Hardy–Littlewood maximal function, the strong maximal operator is not of the weak (1,1)-type. In fact, the following statement holds: there exists constant \(C\) depending on \(d\) only such that

$$\begin{aligned} \left| \{x: g^\star (x)\ge \alpha \}\right| \le C \int \frac{|g(x)|}{\alpha } \left\{ 1 + \left( \ln _+\frac{|g(x)|}{\alpha }\right) ^{d-1}\right\} \,\mathrm{d}x,\quad \forall \alpha >0.\qquad \end{aligned}$$
(6.5)

We refer to [17] for more details.

6.2 Approximation properties of kernel \(K\)

The next lemma establishes an upper bound on norm of the bias \(B_h(f,\cdot )\) of kernel estimator \(\hat{f}_h\) when \(f\) belongs to the anisotropic Nikol’skii class.

Lemma 3

Let \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\). Let \(\hat{f}_h\) be the estimator (2.1) associated with kernel (3.3) with \(\ell > \max _{j=1,\ldots ,d}\beta _j\). Then \(B_h(f,x)\) can represented as the sum \(B_h(f, x)=\sum _{j=1}^d B_{h,j}(f,x)\) with functions \(B_{h,j}(f,x)\) satisfying the following inequalities:

$$\begin{aligned} \left\| B_{h,j}(f, \cdot )\right\| _{r_j} \le C_1L_j h_j^{\beta _j},\quad \forall j=1,\ldots ,d. \end{aligned}$$
(6.6)

Moreover, if \(s \ge 1\), then for any \(p\ge 1\)

$$\begin{aligned} \left\| B_{h,j}(f,\cdot )\right\| _{q_j} \le C_2 L_j h_j^{\gamma _j} ,\quad \forall j=1,\ldots , d, \end{aligned}$$
(6.7)

where \(\vec {\gamma }=\vec {\gamma }(p)\) and \(\vec {q}=\vec {q}(p)\) are defined in (6.1). Here \(C_1\) and \(C_2\) are constants independent of \(\vec {L}\) and \(p\).

6.3 Proof outline and notation

The starting point of our proof is the pointwise oracle inequality (2.14) together with the moment bound (2.17). Denote

$$\begin{aligned} \bar{U}_f(x)= \inf _{h\in \mathcal{H }}\left\{ \bar{B}_h(f,x)+\sup _{\eta \ge h} M_\eta (K\vee Q, x)\right\} ; \end{aligned}$$
(6.8)

then, taking into account that \(M_\eta (K\vee Q, x)\) is greater than \(M_\eta (K,x)\) and \(M_\eta (Q,x)\) for any \(x\) and \(\eta \) [see (2.6) and (2.9)], and using (2.14), we have

$$\begin{aligned} |{\hat{f}}(x)-f(x)| \le c_0 [\bar{U}_f(x) + \omega (x)], \end{aligned}$$

where \(c_0\) is an absolute constant, and \(\omega (x):=\zeta (x)+\chi (x)\) with \(\zeta (x)\) and \(\chi (x)\) defined in (2.15) and (2.16). Therefore, by (2.17) applied with \(q=p\) and by the Fubini theorem, there exists constant \(\bar{c}_0>0\) such that for any probability density \(f\) and any Borel set \(\mathcal{A }\subseteq \mathbb{R }^d\) one has

$$\begin{aligned} \mathbb{E }_f\int _{\mathcal{A }}|\hat{f}(x)-f(x)|^{p}\,\mathrm{d}x \;\le \; \bar{c}_0\left[ \int _{\mathcal{A }}\bar{U}_f^p(x)\,\mathrm{d}x +n^{-p/2}\right] . \end{aligned}$$
(6.9)

Recall that \(\mathrm{k}_\infty =\Vert K\Vert _\infty \vee 1\); by definition of \(\bar{B}_h(f,x)\) [see (2.13)] and by Lemma 3 one has

$$\begin{aligned} \bar{B}_h(f,x)\le \mathrm{k}_\infty \sum _{j=1}^d B_{h,j}^*(f,x), \end{aligned}$$

where \(B_{h,j}^*(f,x)\) is the strong maximal function of \(|B_{h,j}(f,x)|,\,j=1,\ldots ,d\). Therefore if we let

$$\begin{aligned} U_f(x):= \inf _{h\in \mathcal{H }}\left\{ \max _{j=1,\ldots ,d} B^*_{h,j}(f,x) + \sup _{\eta \ge h} M_\eta (K\vee Q, x)\right\} , \end{aligned}$$
(6.10)

then

$$\begin{aligned} \bar{U}_f(x)\le \mathrm{k}_\infty U_f(x),\quad \forall x\in \mathbb{R }^d. \end{aligned}$$
(6.11)

The key element of the proof is derivation of upper bounds on the integral

$$\begin{aligned} J:=\int _{\mathbb{R }^d} U_f^p(x)\,\mathrm{d}x. \end{aligned}$$

These bounds will be established by division of \(\mathbb{R }^d\) in “slices”, and appropriate choice of bandwidth \(h\in \mathcal{H }\) on every “slice”. For this purpose the following bounds on norms of \(B^*_{h,j}(f, \cdot )\) will be used. Inequality (6.4) and the first assertion of Lemma 3 imply that for any \(p>1,\,\vec {r}\in (1,\infty ]^{d}\) and any \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\) one has

$$\begin{aligned}&\left\| B^*_{h,j}(f, \cdot )\right\| _{r_j} \le \bar{c}_1L_j h_j^{\beta _j},\quad \forall j=1,\ldots , d, \end{aligned}$$
(6.12)

Moreover, if \(s\ge 1\) then, by the second assertion of Lemma 3, for any \(p>1,\,\vec {r}\in (1,\infty ]^{d}\) and \(f\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L})\)

$$\begin{aligned}&\left\| B^*_{h,j}(f,\cdot )\right\| _{q_j} \le \bar{c}_2 L_j h_j^{\gamma _j},\quad \forall j=1,\ldots ,d. \end{aligned}$$
(6.13)

Let \(\delta :=\ln n/n,\,\varphi := (L_\beta \delta )^{\beta /(2\beta +1)}\). Let \(m_0(\theta ),\;\theta \in (0,1],\) be an integer number to be specified later; see (6.19) below. For \(m\in \mathbb{Z },\,m\ge m_0(\theta )\) define “slices”

$$\begin{aligned} \mathcal{X }_m&:= \left\{ x\in \mathbb{R }^d: 2^m \varphi < U_f(x) \le 2^{m+1} \varphi \right\} ,\\ \mathcal{X }^-_{m_0(\theta )}&:= \left\{ x\in \mathbb{R }^d: U_f(x) \le 2^{m_0(\theta )}\varphi \right\} , \end{aligned}$$

and consider the corresponding integrals

$$\begin{aligned} J_m:=\int _{\mathcal{X }_m} U_f^p(x)\,\mathrm{d}x,\quad J_{m_0}^-:=\int _{\mathcal{X }_{m_0(\theta )}^-} U_f^p(x)\,\mathrm{d}x. \end{aligned}$$

With this notation, using (6.9) and (6.11) we can write

$$\begin{aligned} \mathbb{E }_f\Vert \hat{f}-f\Vert _p^p&\le \mathbb{E }_f\int _{\mathcal{X }_{m_0(\theta )}^{-}} |\hat{f}(x)-f(x)|^p\,\mathrm{d}x + \tilde{c}_1 \sum _{m=m_0(\theta )}^\infty \int _{\mathcal{X }_m} U^p_f(x)\,\mathrm{d}x + \tilde{c}_2n^{-p/2} \nonumber \\&=: J_{m_0(\theta )}^- +\;\tilde{c}_1 \sum _{m=m_0(\theta )}^\infty J_m \;+\; \tilde{c}_2n^{-p/2}. \end{aligned}$$
(6.14)

The rest of the proof consists of bounding the integrals \(J_{m_0(\theta )}^-\) and \(J_m\) on the right hand side of (6.14) and combining these bounds in different zones.

The following notation will be used in the subsequent proof. For the sake of brevity we will write

$$\begin{aligned} M_\eta (x):=M_\eta (K\vee Q, x),\quad A_\eta (x):=A_\eta (K\vee Q,x),\quad \forall \eta \in \mathcal{H }. \end{aligned}$$

We let \(I:=\{1,\ldots ,d\}\), and

$$\begin{aligned} I_+\!:=\!\{j\in I: p\!\le \! r_j<\infty \},\quad I_-\!:=\!\{j\in I:1\!<\!r_j\!<\!p\},\quad I_\infty \!:=\!\{j\in I:r_j\!=\!\infty \}. \end{aligned}$$

With \(\vec {\gamma }=(\gamma _1,\ldots ,\gamma _d)\) and \(\vec {q}=(q_1,\ldots ,q_d)\) given by (6.1) we define quantities \(\gamma ,\,\upsilon \) and \(L_\gamma \) by the formulas

$$\begin{aligned} \frac{1}{\gamma }:=\sum _{j=1}^d \frac{1}{\gamma _j},\quad \frac{1}{\upsilon }:=\sum _{j=1}^d \frac{1}{\gamma _j q_j},\quad L_\gamma :=\prod _{j=1}^dL_j^{1/\gamma _j}. \end{aligned}$$
(6.15)

Note some useful inequalities between the quantities defined above. First, \(\gamma _j < \beta _j\) for all \(j\in I_-\) which is a consequence of the fact that \(\tau (p)<\tau _j\) for \(j\in I_-\). This implies

$$\begin{aligned} \frac{1}{\gamma } - \frac{1}{\beta } = \sum _{j\in I_-} \left( \frac{1}{\gamma _j}-\frac{1}{\beta _j}\right) >0. \end{aligned}$$
(6.16)

Next, if \(s\ge 1\) then

$$\begin{aligned} \frac{1}{s} > \frac{1}{\upsilon }. \end{aligned}$$
(6.17)

We have

$$\begin{aligned} \frac{1}{s} - \frac{1}{\upsilon } = \sum _{j\in I_-} \left( \frac{1}{\beta _jr_j}-\frac{1}{\gamma _j p}\right) =\sum _{j\in I_-}\frac{1}{\beta _j}\left( \frac{1}{r_j}-\frac{\tau _j}{\tau (p)p}\right) . \end{aligned}$$

Hence (6.17) will be proved if we show that \(r_j^{-1}\tau (p)p\ge \tau _j\) for all \(j\in I_-\). Indeed,

$$\begin{aligned} \frac{\tau (p)p}{r_j}=\frac{p(1-1/s)+1/\beta }{r_j}\ge 1-\frac{1}{s}+\frac{1}{\beta r_j}:=\tau _j, \end{aligned}$$

where to get the second inequality we have used that \(r_j\le p\) for any \(j\in I_-\) and that \(s\ge 1\). Finally, remark also that

$$\begin{aligned} p-\upsilon (2+1/\gamma )<0. \end{aligned}$$
(6.18)

Indeed, since \(r_j\ge p\) for any \(j\in I_+\cup I_\infty \),

$$\begin{aligned} \frac{p}{\upsilon }=\sum _{j\in I_+}\frac{p}{\beta _jr_j}+\sum _{j\in I_-} \frac{1}{\gamma _j}\le \sum _{j\in I_+}\frac{1}{\beta _j}+\sum _{j\in I_-}\frac{1}{\gamma _j} =\frac{1}{\gamma }. \end{aligned}$$

This yields \(p\le \upsilon /\gamma \), and (6.18) follows.

6.4 Auxiliary results

For \(\theta \in (0,1]\) and for some constant \(\hat{c}_1>0\) define

$$\begin{aligned} m_0(\theta ):=\min \left\{ m\in \mathbb{Z }: 2^{m_0(\theta )\left( \frac{1-\theta /s+1/\beta }{1+\theta /s} \right) } >\hat{c}_1\varkappa \varphi \right\} . \end{aligned}$$
(6.19)

Note that \(1-\theta /s+1/\beta >0\) for any \(\theta \in (0,1]\), since \(s\ge \beta \) by \(r_j\ge 1,\,j=1,\ldots , d\). Therefore \(m_0(\theta )<0\) for large enough \(n\).

It will be convenient to introduce the following notation

$$\begin{aligned} m_1:= \min \left\{ m\in \mathbb{Z }: 2^{m\left[ \upsilon (2+1/\gamma )-s(2+1/\beta )\right] }\ge \left( L_\gamma /L_\beta \right) ^{\upsilon } \varphi ^{\upsilon (1/\beta -1/\gamma )}\right\} . \end{aligned}$$
(6.20)

It follows from this definition that

$$\begin{aligned} \left[ \left( L_\gamma /L_\beta \right) \varphi ^{1/\beta -1/\gamma } \right] ^{\frac{\upsilon }{\upsilon (2+1/\gamma )\!-\!s(2+1/\beta )}}\le 2^{m_1}\le 2\left[ \left( L_\gamma /L_\beta \right) \varphi ^{1/\beta -1/\gamma } \right] ^{\frac{\upsilon }{\upsilon (2+1/\gamma )-s(2+1/\beta )}}.\nonumber \\ \end{aligned}$$
(6.21)

In view of (6.16) and (6.17)

$$\begin{aligned} \upsilon \left( 2+\frac{1}{\gamma }\right) - s\left( 2+\frac{1}{\beta }\right) = s\upsilon \left[ \left( 2+\frac{1}{\beta }\right) \left( \frac{1}{s}- \frac{1}{\upsilon }\right) + \frac{1}{s}\left( \frac{1}{\gamma }-\frac{1}{\beta }\right) \right] >0;\nonumber \\ \end{aligned}$$
(6.22)

hence \(m_1>1\) for large \(n\).

The bounds on \(J_{m_0(\theta )}^-\) and \(J_m\) are given in the next two propositions.

Proposition 1

There exist constants \(\hat{c}_1,\hat{c}_2>0\) and \({\hat{C}}_1,\hat{C}_2>0\) such that any \(n\) large enough the following statements hold.

  • (i) For any probability density \(f\) and any \(m_0(1)\le m\le 0\)

    $$\begin{aligned}&J_m \le \hat{C}_1\, 2^{m\left( p-\frac{2+1/\beta }{1+1/s}\right) } \varphi ^p. \end{aligned}$$
    (6.23)
  • (ii) Let \(f\in \mathbb{G }_\theta (R),\,\theta \in (0,1]\); then for any \(m_0(\theta )\le m\le 0\) one has

    $$\begin{aligned}&J_m \le \hat{C}_1\, 2^{m\left( p-\frac{2+1/\beta }{1/\theta +1/s}\right) } \varphi ^p. \end{aligned}$$
    (6.24)
  • (iii) For any \(m\in \mathbb{Z }\) satisfying \(1\le 2^{m} \le \hat{c}_2\varphi ^{-1}\) and any probability density \(f\) one has

    $$\begin{aligned}&J_m\;\le \;\hat{C}_2 2^{m [p- s(2+1/\beta )]} \varphi ^p. \end{aligned}$$
    (6.25)
  • (iv) Let \(s\ge 1\); then for any \(m\in \mathbb{Z }\) such that \( m\ge m_1, \; 2^{m} \le \hat{c}_2\varphi ^{-1} \) and any probability density \(f\) one has

    $$\begin{aligned}&J_m \;\le \;\hat{C}_2\varphi ^{p} \left[ \frac{L_\gamma \varphi ^{1/\beta }}{L_\beta \varphi ^{1/\gamma }}\right] ^{\upsilon } 2^{m\left[ p-\upsilon (2+1/\gamma )\right] }. \end{aligned}$$
    (6.26)

Proposition 2

There exist constants \(\hat{C}_3, \hat{C}_4>0\) such that the following statements hold.

  • (i) Let \(\nu \) is defined in (3.4). Then for all large enough \(n\) and for any density \(f\) one has

    $$\begin{aligned} J^{-}_{m_0(1)}\;=\;\mathbb{E }_f\int _{\mathcal{X }^-_{m_0(1)}} \left| \hat{f}(x)-f(x)\right| ^{p}\,\mathrm{d}x\;\le \; \hat{C}_3\pi _n(L_\beta \delta )^{p\nu }, \end{aligned}$$
    (6.27)

    where \(\pi _n=\ln ^d(n)\) if \(p\le \frac{2+1/\beta }{1+1/s}\) and \(\pi _n=1\) otherwise.

  • (ii) Let \(\nu (\theta )\) is defined in (4.3). Then for any \(\theta \in (0,1)\) and for all \(n\) large enough

    $$\begin{aligned} \sup _{f\in \mathbb{G }_\theta (R)}\mathbb{E }_f\int _{\mathcal{X }_{m_0(\theta )}^-} \left| \hat{f}(x)-f(x)\right| ^{p}\,\mathrm{d}x\le \hat{C}_4(L_\beta \delta )^{p\nu (\theta )}. \end{aligned}$$
    (6.28)

6.5 Proof of Theorem 2

Using (6.14) and inequality (6.27) of Proposition 2 we obtain

$$\begin{aligned} \mathbb{E }_f\Vert \hat{f}-f\Vert _p^p \;\le \; c_1\pi _n \left( L_\beta \delta \right) ^{p\nu } + c_2\sum _{m=m_0(1)}^\infty J_m. \end{aligned}$$
(6.29)

We proceed with bounding the second term on the right hand side of the last display formula. First, because \(\Vert f\Vert _\infty \le M\),

$$\begin{aligned} \max _{j=1,\ldots ,d} \Vert B^*_{h,j}(f,\cdot )\Vert _\infty \le 2^{d}M\mathrm{k}^2_{\infty },\quad \sup _{\eta >0}\Vert A_\eta \Vert _\infty \le 2^{d}M\mathrm{k}^2_{\infty }. \end{aligned}$$

This implies that there exists constant \(c_3>0\) with the following property:

$$\begin{aligned} m_2:=\min \{m\in \mathbb{Z }: 2^m \ge c_3\varphi ^{-1}\} \Rightarrow J_m=0,\;\forall m\ge m_2. \end{aligned}$$

Thus the sum on right hand side of (6.29) extends from \(m_0(1)\) to \(m_2\).

\(1^0\). Tail zone: \(p<\frac{2+1/\beta }{1+1/s}\). Using bounds (6.23) and (6.25) of Proposition 1, we obtain

$$\begin{aligned} \sum _{m=m_0(1)}^\infty J_m&\le c_4\varphi ^p\left[ \sum _{m=m_0(1)}^{0} 2^{m(p-\frac{2+1/\beta }{1+1/s})} + \sum _{m=1}^{m_2} 2^{m[p-s(2+1/\beta )]}\right] \\&\le c_5\,\varphi ^p 2^{m_0(1)(p-\frac{2+1/\beta }{1+1/s})}, \end{aligned}$$

where the last inequality follows from the fact that \(m_0(1)<0\) and \(p<\frac{2+1/\beta }{1+1/s}<s(2+1/\beta )\). Using (6.19), after straightforward algebra we obtain that

$$\begin{aligned} \sum _{m=m_0(1)}^\infty J_m \le c_6 \left( L_\beta \delta \right) ^{\frac{p-1}{1+1/\beta -1/s}}\le c_6~\left( L_\beta \delta \right) ^{p\nu }. \end{aligned}$$

\(2^0\). Dense zone: \(\frac{2+1/\beta }{1+1/s} < p< s(2+\frac{1}{\beta })\). Because \(p>\frac{2+1/\beta }{1+1/s}\), by Proposition 1, inequality (6.23) with \(\theta =1\),

$$\begin{aligned} \sum _{m=m_0(1)}^0 J_m \;\le \; c_{7} \varphi ^p \sum _{m=m_0(1)}^0 2^{m\left( p-\frac{2+1/\beta }{1+1/s}\right) } \le c_{8}\,\varphi ^p = c_{8} (L_\beta \delta )^{\frac{p\beta }{2\beta +1}}. \end{aligned}$$
(6.30)

Furthermore, because \(p<s(2+\frac{1}{\beta })\) we have by Proposition 1, inequality (6.25), that

$$\begin{aligned} \sum _{m=1}^{m_2} J_m \;\le \; c_{9} \varphi ^p \sum _{m=1}^{m_2} 2^{m\left( p-s\left( 2+\frac{1}{\beta }\right) \right) } = c_{10} (L_\beta \delta )^{\frac{p\beta }{2\beta +1}}. \end{aligned}$$

Thus, in the dense zone

$$\begin{aligned} \sum _{m=m_0(1)}^{m_2} J_m \le c_{11} (L_\beta \delta )^{\frac{p\beta }{2\beta +1}}\le c_{11}(L_\beta \delta )^{p\nu } . \end{aligned}$$

\(3^{0}\). Sparse zone: \(p>s(2+1/\beta ),\,s<1\). First we note that the bound in (6.30) remains true since \(p>s(2+1/\beta )\). By the same reason in view of Proposition 1, inequality (6.25),

$$\begin{aligned} \sum _{m=1}^{m_2} J_m\le c_{12} \varphi ^p 2^{m_2\left( p-s\left( 2\!+\!\frac{1}{\beta }\right) \right) } \!\le \! c_{13} \varphi ^{s\left( 2+\frac{1}{\beta }\right) } \!=\! c_{13} (L_\beta \delta )^{s}\le c_{13}(L_\beta \delta )^{p\nu }.\qquad \end{aligned}$$
(6.31)

Here we have used the definition of \(m_2\). It remains to note that conditions \(p>s(2+1/\beta ),\,s<1\) imply that \(\varphi ^{p}\delta ^{-s}\rightarrow 0\) as \(n\rightarrow 0\). Therefore the statement of the theorem follows from (6.30) and (6.31).

\(4^{0}\). Sparse zone: \(p>s(2+1/\beta ),\,s\ge 1\). We need to bound only \(\sum _{m=1}^{m_2} J_m\), because (6.30) remains true. By inequality (6.25) of Proposition 1 and because \(p>s(2+1/\beta )\)

$$\begin{aligned}&\sum _{m=1}^{m_1} J_m\le c_{14}\varphi ^p2^{m_1\left( p-s\left( 2+\frac{1}{\beta }\right) \right) }. \end{aligned}$$

Next, we have in view of the inequality (6.26) of Proposition 1

$$\begin{aligned} \sum _{m=m_1+1}^{m_2} J_m\le c_{15} \varphi ^{p}\left[ \frac{L_\gamma \varphi ^{1/\beta }}{L_\beta \varphi ^{1/\gamma }}\right] ^{\upsilon } \sum _{m=m_1+1}^{m_2}2^{m\left[ p-\upsilon (2+1/\gamma )\right] }. \end{aligned}$$

Since \(p-\upsilon (2+1/\gamma )<0\) [see (6.18)],

$$\begin{aligned} \sum _{m=m_1+1}^{m_2} J_m\le c_{16} \varphi ^{p}\left[ \frac{L_\gamma \varphi ^{1/\beta }}{L_\beta \varphi ^{1/\gamma }}\right] ^{\upsilon } 2^{m_1\left( p-\upsilon [2+1/\gamma ]\right) } \le c_{16}\varphi ^{p}2^{m_1\left( p-s[2+1/\beta ]\right) }. \end{aligned}$$

In order to obtain the second inequality we have used (6.21). Thus,

$$\begin{aligned} \sum _{m=1}^{m_2} J_m \le c_{17}\varphi ^{p}2^{m_1\left[ p-s(2+1/\beta )\right] }. \end{aligned}$$

Using equality (6.22) and (6.21) we obtain

$$\begin{aligned} \sum _{m=1}^{m_1} J_m\le c_{20} \left( L_\gamma /L_\beta \right) ^{\frac{p-s(2+1/\beta )}{s(2+1/\beta )(1/s-1/\upsilon )+(1/\gamma -1/\beta )}} (L_\beta \delta )^{\frac{p(1/s-1/\upsilon )+1/\gamma -1/\beta }{(2+1/\beta )(1/s-1/\upsilon )+(1/\gamma -1/\beta )s^{-1}}}. \end{aligned}$$

The statement of the theorem is now obtained by the following routine computations. Denote

$$\begin{aligned} A=\frac{1}{s_-}-\frac{1}{p\beta _-},\quad \frac{1}{s_-}=\sum _{j\in I_-}\frac{1}{\beta _jr_j},\quad \frac{1}{\beta _-}=\sum _{j\in I_-}\frac{1}{\beta _j},\quad \frac{1}{\gamma _-}=\sum _{j\in I_-}\frac{1}{\gamma _j}. \end{aligned}$$

First, we remark that

$$\begin{aligned}&p\left( \frac{1}{s}\!-\!\frac{1}{\upsilon }\right) + \frac{1}{\gamma }-\frac{1}{\beta }\!=\!\frac{p}{s_-} - \frac{1}{\gamma _-} + \frac{1}{\gamma _-} - \frac{1}{\beta _-}\!=\!\frac{p}{s_-} - \frac{1}{\beta _-}=Ap.\qquad \end{aligned}$$
(6.32)

Next,

$$\begin{aligned} \frac{1}{\gamma _-}=\sum _{j\in I_-}\frac{\tau _j}{\tau (p)\beta _j}&= \frac{1}{\tau (p)} \sum _{j\in I_-}\frac{1}{\beta _j}[1-1/s+1/(r_j\beta )]= \frac{1-1/s}{\tau (p)\beta _-}+\frac{1}{\tau (p)\beta s_-}\\&= \frac{1-1/s}{\tau (p)\beta _-}+\frac{1}{\tau (p)\beta } \left( \frac{1}{s_-}-\frac{1}{p\beta _-}\right) +\frac{1}{\tau (p)\beta p\beta _-}\\&= \frac{1}{\tau (p)\beta _-}\left( 1-\frac{1}{s}+\frac{1}{p\beta }\right) +\frac{A}{\tau (p)\beta } =\frac{1}{\beta _-}+\frac{A}{\tau (p)\beta }. \end{aligned}$$

Hence, \(1/\gamma -1/\beta =1/\gamma _{-} -1/\beta _{-}=A/(\tau (p)\beta )\), which implies that

$$\begin{aligned} \frac{1}{s}-\frac{1}{\upsilon }=\frac{1}{s_{-}}-\frac{1}{p\gamma _{-}} =A+\frac{1}{p}\left( \frac{1}{\beta _{-}}-\frac{1}{\gamma _{-}}\right) = A\left( 1-\frac{1}{p\tau (p)\beta }\right) . \end{aligned}$$

Two last equalities yield

$$\begin{aligned} \left( 2+\frac{1}{\beta }\right) \left( \frac{1}{s}-\frac{1}{\upsilon }\right) + \left( \frac{1}{\gamma }-\frac{1}{\beta }\right) \frac{1}{s}&= \frac{A}{\tau (p)}\left[ \left( 2+\frac{1}{\beta }\right) \left( \tau (p)-\frac{1}{p\beta }\right) +\frac{1}{s\beta }\right] \\&= \frac{A}{\tau (p)}\left( 2+\frac{1}{\beta }-\frac{2}{s\beta }\right) , \end{aligned}$$

where the last equality follows from the fact that \(\tau (p)-1/(p\beta )=1-1/s\). This together with (6.32). leads to the statement of the theorem in the sparse zone.

\(5^0\). Boundary zones: \(p=s(2+\frac{1}{\beta }),\,p=\frac{2+1/\beta }{1+1/s}\). Here the proof coincides with the proof for the dense zone with the only difference that the corresponding sums equal \(|m_1|\) and \(m_2\) respectively. \(\square \)

6.6 Proof of statement (i) of Theorem 4

In view of (6.14) and by bound (6.28) of Proposition 2,

$$\begin{aligned} \mathbb{E }_f\Vert \hat{f}-f\Vert _p^p&\le c_1\left( L_\beta \delta \right) ^{p\nu (\theta )} + c_2\sum _{m=m_0(\theta )}^\infty J_m. \end{aligned}$$

If \(p<\frac{2+1/\beta }{1/\theta +1/s}\) then, using bounds (6.24) and (6.25) of Proposition 1, we have

$$\begin{aligned} \sum _{m=m_0(\theta )}^\infty J_m&\le c_3\varphi ^p \sum _{m=m_0(\theta )}^{m_2} 2^{m\left( p-\frac{2+1/\beta }{1/\theta +1/s}\right) } \le c_4\,\varphi ^p 2^{m_0(\theta )\left( p-\frac{2+1/\beta }{1/\theta +1/s}\right) }\\&= c_5(L_\beta \delta )^{\frac{p-\theta }{1-\theta /s+1/\beta }}, \end{aligned}$$

and the assertion of the theorem follows. If \(s(2+1/\beta )\ge p\ge \frac{2+1/\beta }{1/\theta +1/s}\) then

$$\begin{aligned} \sum _{m=m_0(\theta )}^\infty J_m \le c_6\mu _n^{p}(\theta )\varphi ^p\le c_7\mu _n^{p}(\theta )\left( L_\beta \delta \right) ^{p\nu (\theta )}. \end{aligned}$$

7 Proofs of Theorem 3, statement (ii) of Theorem 4 and the lower bound in (4.4)

The proof is organized as follows. First, we formulate two auxiliary statements, Lemmas 4 and 5. Second, we present a general construction of a finite set of functions employed in the proof of lower bounds. Then we specialize the constructed set of functions in different regimes and derive the announced lower bounds.

7.1 Auxiliary lemmas

The first statement given in Lemma 4 is a simple consequence of Theorem 2.4 from [34]. Let \(\mathbb{F }\) be a given set of probability densities.

Lemma 4

Assume that for any sufficiently large integer \(n\) one can find a positive real number \(\rho _n\) and a finite subset of functions \(\{f^{(0)}, f^{(j)},\;j\in \mathcal{J }_n\}\subset \mathbb{F }\) such that

$$\begin{aligned}&\left\| f^{(i)}- f^{(j)}\right\| _p \ge 2\rho _n,\quad \; \forall i, j\in \mathcal{J }_n\cup \{0\}:\;i\ne j;\end{aligned}$$
(7.1)
$$\begin{aligned}&\limsup _{n\rightarrow \infty }\frac{1}{|\mathcal{J }_n|^{2}} \sum _{j\in \mathcal{J }_n}\mathbb{E }_{f^{(0)}}\left\{ \frac{\mathrm{d}\mathbb{P }_{f^{(j)}}}{\mathrm{d}\mathbb{P }_{f^{(0)}}}(X^{(n)})\right\} ^{2}=:C <\infty . \end{aligned}$$
(7.2)

Then for any \(q\ge 1\)

$$\begin{aligned} \liminf _{n\rightarrow \infty } \inf _{\tilde{f}}\; \sup _{f \in \mathbb{F }} \rho ^{-1}_n\left( \mathbb{E }_f \left\| \tilde{f} - f\right\| ^{q}_p\right) ^{1/q} \ge \left( \sqrt{C} +\sqrt{C+1} \right) ^{-2/q}, \end{aligned}$$

where infimum on the left hand side is taken over all possible estimators.

We will apply Lemma 4 with \(\mathbb{F }= \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) in the proof of Theorem 3 and with \(\mathbb{F }=\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) in the proof of statement (ii) of Theorem 4.

Next we quote the Varshamov–Gilbert lemma [see, e.g., Lemma 2.9 in [34]].

Lemma 5

(Varshamov–Gilbert) Let \(\varrho _m\) be the Hamming distance on \(\{0,1\}^m,\,m\in \mathbb{N }^*\), i.e.

$$\begin{aligned} \varrho _m(a,b)=\sum _{j=1}^m \mathbf{1}\left\{ a_j\ne b_j\right\} ,\quad a,b\in \{0,1\}^m. \end{aligned}$$

For any \(m\ge 8\) there exists a subset \(\mathcal{P }_m\) of \(\{0,1\}^m\) such that \(|\mathcal{P }_m|\ge 2^{m/8}\), and

$$\begin{aligned} \varrho _m(a,a^\prime )\ge \frac{m}{8},\quad \forall a,a^\prime \in \mathcal{P }_m. \end{aligned}$$

7.2 Proof of Theorem 3. General construction of a finite set of functions

1\(^0\). For any \(t\in \mathbb{R }\) set

$$\begin{aligned} \Lambda (t)= \left( \,\,\int _{-1}^1 e^{-1/(1-u^2)}\,\mathrm{d}u\right) ^{-1} e^{-1/(1-t^2)}\; \mathbf{1 }_{[-1,1]}(t). \end{aligned}$$

Note that \(\Lambda \) is a probability density compactly supported on \([-1,1]\) and infinitely differentiable on the real line, \(\Lambda \in \mathbb{C }^{\infty }(\mathbb{R }^1)\). Obviously, for any \(\alpha >0\) and \(r\ge 1\) there exists constant \(c_1=c_1(\alpha ,r)<\infty \) such that

$$\begin{aligned} \Lambda \in \mathbb{N }_{r,1}(\alpha ,c_1). \end{aligned}$$
(7.3)

Define

$$\begin{aligned} \bar{f}^{(0)}(x)= \prod _{l=1}^d\left[ \frac{1}{N} \int _{\mathbb{R }^1}\Lambda (y-x_l)\mathbf{1 }_{[-\frac{N}{2},\frac{N}{2}]}(y)\,\mathrm{d}y\right] ,\quad x=(x_1,\ldots ,x_d)\in \mathbb{R }^d, \end{aligned}$$

where parameter \(N=N(n)>8\) will be chosen later. By construction, \(\bar{f}^{(0)}\) is a probability density for any choice of \(N,\,\mathrm{supp}(\bar{f}^{(0)})=[-N/2-1, N/2+1]^d\), and

$$\begin{aligned} \bar{f}^{(0)}(x)=N^{-d},\quad \forall x \in \left[ -N/2+1, N/2-1\right] ^d. \end{aligned}$$
(7.4)

Moreover, in view of (7.3) and by the Young inequality, there exist constants \(\vec {C}=(\tilde{C}_1,\ldots ,\tilde{C}_d)\) depending on \(\vec {\beta }\) and \(\vec {r}\) only such that

$$\begin{aligned} \bar{f}^{(0)}\in \mathbb{N }_{\vec {r},d}\left( \vec {\beta },\vec {C}\right) \!. \end{aligned}$$
(7.5)

Note that \(\vec {C}\) do not depend on \(N\).

Let \(L_0>0\) be fixed, and let \(f^{(0)}(x)=\varkappa ^{d}\bar{f}^{(0)}(x\varkappa )\), where \(\varkappa >0\) is chosen in such a way that \(f^{(0)}\) belongs to the class \(\mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L}_0)\), where \(\vec {L}_0=(L_0,\ldots ,L_0)\). The existence of such \(\varkappa \) independent of \(N\) and determined by \(\vec {\beta },\,\vec {r}\) and \(L_0\) is guaranteed by (7.5). Note also that \(f^{(0)}\) is a probability density. Moreover, we remark that \(\Vert \bar{f}^{(0)}\Vert _\infty \le N^{-d}\) since \(\int |\Lambda |=1\).

Thus,

$$\begin{aligned} f^{(0)}\in \mathbb{N }_{\vec {r},d}(\vec {\beta }, \vec {L}_0/2, M/2), \end{aligned}$$
(7.6)

provided that \(N>(2M^{-1})^{1/d}\varkappa \). This condition is assumed to be fulfilled.

\(2^{0}\). Put for any \(t\in \mathbb{R }^1\)

$$\begin{aligned} g(t)= \int _{\mathbb{R }^1}\Lambda (y-t)\left[ \mathbf 1 _{[0,1]} (y)-\mathbf 1 _{[-1,0]}(y)\right] \,\mathrm{d}y. \end{aligned}$$

We obviously have \(g\in \mathbb{C }^\infty (\mathbb{R }^{1})\), and

$$\begin{aligned}&(\mathrm{i})\quad \int _{\mathbb{R }^1} g(y)\,\mathrm{d}y =0,\quad (\mathrm{ii})\quad \text {supp}(g)\subseteq [-2,2],\quad (\mathrm{iii})\quad \Vert g\Vert _\infty \le 1. \end{aligned}$$
(7.7)

For any \(l=1,\ldots ,d\) let \((20\varkappa )^{-1}>\sigma _l=\sigma _l(n)\rightarrow 0,\,n\rightarrow \infty \), be the sequences to be specified later. Let \(M_l=(20\varkappa \sigma _l)^{-1}N\), and without loss of generality assume that \(M_l,\,l=1,\ldots , d\) are integer numbers. Define also

$$\begin{aligned} x_{j,l}=-\frac{N-4}{4\varkappa }+8j\sigma _l,\quad j=1,\ldots , M_l,\quad l=1,\ldots ,d, \end{aligned}$$

and let \(\mathcal{M }=\{1,\ldots , M_1\}\times \cdots \times \{1,\ldots , M_d\}\). For any \(m=(m_1,\ldots ,m_d)\in \mathcal{M }\) define

$$\begin{aligned} G_m(x)&= \prod _{l=1}^dg\left( \frac{x_l-x_{m_l,l}}{\sigma _l}\right) ,\quad x\in \mathbb{R }^d,\\ \Pi _m&= \left[ x_{m_1,1}-3\sigma _1,x_{m_1,1}+3\sigma _1\right] \times \cdots \times \left[ x_{m_d,d}-3\sigma _d, x_{m_d,d}-3\sigma _d\right] \subset \mathbb{R }^d. \end{aligned}$$

Several remarks on these definitions are in order. First, in view of (7.7)(ii)

$$\begin{aligned}&\text {supp}(G_m)\subset \Pi _m,\quad \forall m\in \mathcal{M },\end{aligned}$$
(7.8)
$$\begin{aligned}&\Pi _m\cap \Pi _j=\emptyset ,\quad \forall m, j\in \mathcal{M }:\quad m\ne j. \end{aligned}$$
(7.9)

Second, since \(g\in \mathbb{C }^{\infty }(\mathbb{R }^1)\), we have that \(G_m\in \mathbb{C }^{\infty }(\mathbb{R }^d)\) for any \(m\in \mathcal{M }\). Moreover, for any \(l=1,\ldots ,d\), any \(|h|\le \sigma _l\) and any integer \(k\)

$$\begin{aligned}&\text {supp}\left\{ \Delta _{h,l}\, (D^{k}_lG_m)\right\} \subseteq \Pi _m, \quad \forall m\in \mathcal{M }, \end{aligned}$$
(7.10)

where \(D^k_l G\) stands for the \(k\)th order derivative of a function \(G\) with respect to the variable \(x_l\), and \(\Delta _{h,l}\) is the first order difference operator with step size \(h\) in direction of the variable \(x_l\). For \(m\in \mathcal{M }\) define

$$\begin{aligned} \pi (m)=\sum _{j=1}^{d-1}(m_j-1)\left( \prod _{l=j+1}^d M_l\right) +m_d. \end{aligned}$$

It is easily checked that \(\pi \) defines enumeration of the set \(\mathcal{M }\), and \(\pi :\mathcal{M }\rightarrow \{1,2\ldots ,|\mathcal{M }|\} \) is a bijection. Let \(W\) be a subset of \(\{0,1\}^{|\mathcal{M }|}\). Define a family of functions \(\{F_w, w\in W\}\) by

$$\begin{aligned} F_w(x)=A\sum _{m\in \mathcal{M }}w_{\pi (m)}G_m(x),\quad x\in \mathbb{R }^d, \end{aligned}$$

where \(w_j,\,j=1,\ldots , |\mathcal{M }|\) are the coordinates of \(w\), and \(A\) is a parameter to be specified. It follows from (7.7)(iii), (7.8) and (7.9) that

$$\begin{aligned} \Vert F_{w}\Vert _\infty \le A,\quad \forall w\in W, \end{aligned}$$
(7.11)

and (7.7)(i) implies that

$$\begin{aligned} \int _{\mathbb{R }^d}F_{w}(x)\,\mathrm{d}x=0,\quad \forall w\in W. \end{aligned}$$
(7.12)

\(3^{0}\). Now we find conditions which guarantee that \(F_{w}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L})\) for any \(w\in W\).

Fix \(l=1,\ldots ,d\), and let \(k_l=\lfloor \beta _l\rfloor +1\) if \(\beta _l\notin \mathbb{N }^*\), and \(k_l=\lfloor \beta _l\rfloor +2\) if \(\beta _l\in \mathbb{N }^*\) (here \(\lfloor x \rfloor \) stands for the maximal integer number strictly less than \(x\)).

First, for any \(w\in W\) and \(h\in \mathbb{R }\)

$$\begin{aligned} \left\| \Delta ^{k_l}_{h,l} F_w\right\| _{r_l}= \left\| \Delta ^{k_l-1}_{h,l} (\Delta _{h,l} F_w)\right\| _{r_l} \le |h|^{k_l-1}\left\| \Delta _{h,l} (D^{k_l-1}_lF_w)\right\| _{r_l}, \end{aligned}$$
(7.13)

where the last inequality is found in [29, Section 4.4.4]. Next, in view of (7.9) and (7.10) we obtain for any \(w\in W\) and any \(r_l\ne \infty \)

$$\begin{aligned} \left\| \Delta _{h,l} (D^{k_l-1}_l F_w)\right\| ^{r_l}_{r_l}&= \sum _{j\in \mathcal{M }}\int _{\Pi _j}\left| \Delta _{h,l} (D^{k_l-1}_lF_w)(x)\right| ^{r_l}\,\mathrm{d}x \nonumber \\&= A^{r_l} \sum _{j\in \mathcal{M }}w_{\pi (j)}\int _{\Pi _j}\left| \Delta _{h,l} (D^{k_l-1}_l G_j)(x)\right| ^{r_l}\,\mathrm{d}x \nonumber \\&\le A^{r_l}S_W\Vert g\Vert ^{(d-1)r_l}_{r_l}\sigma _l^{-(k_l-1)r_l} \left( \prod _{j=1}^d\sigma _j\right) \\&\times \left\| g^{(k_l-1)} \left( \cdot -\frac{h}{\sigma _l}\right) -g^{(k_l-1)}(\cdot )\right\| ^{r_l}_{r_l}, \end{aligned}$$

where we have put \(S_W:=\sup _{w\in W}|\{j:\;w_j\ne 0\}|\). Thus, for any \(r_l\ne \infty \) we have

$$\begin{aligned}&\left\| \Delta _{h,l} (D^{k_l-1}_l F_w)\right\| _{r_l}\le \; A \Vert g\Vert ^{(d-1)r_l}_{r_l}\sigma _l^{-(k_l-1)} \left( S_W\prod _{j=1}^d\sigma _j\right) ^{\frac{1}{r_l}}\nonumber \\&\quad \times \left\| g^{(k_l-1)}\left( \cdot -\frac{h}{\sigma _l}\right) -g^{(k_l-1)}(\cdot )\right\| _{r_l}. \end{aligned}$$
(7.14)

Similarly, we get for any \(w\in W\)

$$\begin{aligned} \left\| \Delta _{h,l} (D^{k_l-1}_l F_w)\right\| _{\infty }&= \sup _{j\in \mathcal{M }}\sup _{x\in \Pi _j}\left| \Delta _{h,l} (D^{k_l-1}_lF_w)(x)\right| \nonumber \\&= A \sup _{j\in \mathcal{M }}w_{\pi (j)}\sup _{x\in \Pi _j}\left| \Delta _{h,l} (D^{k_l-1}_l G_j)(x)\right| \nonumber \\&\le A \Vert g\Vert ^{(d-1)}_{\infty }\sigma _l^{-(k_l-1)} \left\| g^{(k_l-1)}\left( \cdot \!-\!\frac{h}{\sigma _l}\right) \!-\!g^{(k_l-1)}(\cdot )\right\| _{\infty }.\qquad \end{aligned}$$
(7.15)

In view of (7.7)(ii) and \(|h|\le \sigma _l\), function \(g^{(k_l-1)}(\cdot -[h/\sigma _l]) -g^{(k_l-1)}(\cdot )\) is supported on \([-3,3]\). Therefore the fact that \(g\in \mathbb{C }^{\infty }(\mathbb{R }^1)\) implies for any \(r_l\in [1,\infty ]\)

$$\begin{aligned} \left\| g^{(k_l-1)}\left( \cdot -h/\sigma _l\right) \!-\!g^{(k_l-1)}(\cdot )\right\| _{r_l}\le 6^{1/r_l} \left\| g^{(k_l)}\right\| _{\infty }(h/\sigma _l)\!\le \! 6^{1/r_l}\left\| g^{(k_l)}\right\| _{\infty }|h/\sigma _l|^{\beta _l-k_l+1}.\qquad \end{aligned}$$

In the last inequality we have used that \(0\le \beta _l-k_l+1\le 1\) by definition of \(k_l\). Combining this with (7.13), (7.14) and (7.15) we have for any \(|h|\le \sigma _l\) and any \(r_l\in [1,\infty ]\)

$$\begin{aligned}&\left\| \Delta ^{k_l}_{h,l} F_w\right\| _{r_l} \le A|h|^{\beta _l}6^{1/r_l} \Vert g\Vert _{r_l}^{d-1} \left\| g^{(k_l)}\right\| _{\infty } \sigma _l^{-\beta _l}\left( S_W\prod _{j=1}^d\sigma _j\right) ^{1/r_l}. \end{aligned}$$
(7.16)

If \(|h|\ge \sigma _l\) then we note that \(\Delta _{h,l}(D^{k_l-1}_l F_w)(\cdot )=(D^{k_l-1}_lF_w)(\cdot -h e_l)-(D^{k_l-1}_lF_w)(\cdot )\), and by the triangle inequality

$$\begin{aligned} \left\| \Delta _{h,l}(D^{k_l-1}_l F_w)\right\| _{r_l}\le 2 \left\| D^{k_l-1}_l F_w\right\| _{r_l}\le 2 \left\| D^{k_l-1}_lF_w \right\| _{r_l}|h/\sigma _l|^{\beta _l-k_l+1}. \end{aligned}$$

In view of (7.8) and (7.9) we get for any \(w\in W\) and any \(r_l\ne \infty \)

$$\begin{aligned} \left\| D^{k_l-1}_l F_w\right\| ^{r_l}_{r_l}&= \sum _{j\in \mathcal{M }}\int _{\Pi _j}\left| D^{k_l-1}_l F_w(x)\right| ^{r_l}\,\mathrm{d}x =A^{r_l}\sum _{j\in \mathcal{M }}w_{\pi (j)}\int _{\Pi _j}\left| D^{k_l-1}_lG_j(x)\right| ^{r_l}\,\mathrm{d}x\nonumber \\&\le A^{r_l}S_W \Vert g\Vert ^{(d-1)r_l}_{r_l} \left\| g^{(k_l-1)}\right\| ^{r_l}_{r_l}\sigma _l^{(1-k_l)r_l} \left( \prod _{j=1}^d\sigma _j\right) . \end{aligned}$$

Moreover,

$$\begin{aligned} \left\| D^{k_l-1}_l F_w\right\| _{\infty }&= \sup _{j\in \mathcal{M }}\sup _{x\in \Pi _j}\left| D^{k_l-1}_l F_w(x)\right| =A\sup _{j\in \mathcal{M }}w_{\pi (j)}\sup _{x\in \Pi _j}\left| D^{k_l-1}_lG_j(x)\right| \nonumber \\&\le A\Vert g\Vert ^{(d-1)}_{\infty } \left\| g^{(k_l-1)}\right\| _{\infty }\sigma _l^{(1-k_l)}. \end{aligned}$$

We obtain finally from (7.13) that for any \(|h|\ge \sigma _l\) and any \(r_l\in [1,\infty ]\)

$$\begin{aligned}&\left\| \Delta ^{k_l}_{h,l} F_w\right\| _{r_l} \le A|h|^{\beta _l}2\Vert g\Vert ^{d-1}_{r_l}\left\| g^{(k_l-1)}\right\| _{r_l} \sigma _l^{-\beta _l}\left( S_W\prod _{j=1}^d\sigma _j\right) ^{1/r_l}. \end{aligned}$$
(7.17)

Combining (7.16) and (7.17) we conclude that for any \(w\in W\) and \(r_l\in [1,\infty ]\)

$$\begin{aligned} \left\| \Delta ^{k_l}_{h,l} F_w\right\| _{r_l}\le C_1 A|h|^{\beta _l} \sigma _l^{-\beta _l}\left( S_W\prod _{j=1}^d\sigma _j\right) ^{1/r_l}, \quad \forall h\in \mathbb{R }^1, \end{aligned}$$

where \(C_1=\max _{l}(\Vert g\Vert _{r_l}^{d-1} \max \{6^{1/r_l}\Vert g^{(k_l)}\Vert _{\infty },2\Vert g^{(k_l-1)}\Vert _{r_l}\})\). Thus, if

$$\begin{aligned}&A\sigma _l^{-\beta _l} \left( S_W\prod _{j=1}^d\sigma _j\right) ^{1/r_l}\le (2C_1)^{-1} L_l,\quad \forall l=1,\ldots , d \end{aligned}$$
(7.18)

then \(F_{w}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L})\) for any \(w\in W\).

4\(^0.\) Define for any \(w\in W\)

$$\begin{aligned} f_w(x)=f^{(0)}(x)+F_{w}(x),\quad x\in \mathbb{R }^d. \end{aligned}$$

Remind that \(f^{(0)}\) is the probability density belonging to \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}_0/2, M/2)\). Therefore, in view of (7.12) and under condition (7.18), for any \(w\in W\)

$$\begin{aligned}&\int _{\mathbb{R }^d} f_w(x)\,\mathrm{d}x=1,\quad f_w\in \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}), \end{aligned}$$
(7.19)

where the latter inclusion holds because \(\min _{j=1,\ldots ,d}L_j\ge L_0\).

By construction of \(F_{w}\), for any \(w\in W\)

$$\begin{aligned}&F_{w}(x)=0, \quad \forall x\notin \left[ -\frac{1}{4\varkappa }(N-4), \;\frac{1}{4\varkappa }(N+4)\right] ^d. \end{aligned}$$
(7.20)

This yields

$$\begin{aligned}&f_w(x)=f^{(0)}(x)\ge 0, \quad \forall x\notin \left[ -\frac{1}{4\varkappa }(N-4), \;\frac{1}{4\varkappa }(N+4)\right] ^d. \end{aligned}$$
(7.21)

On the other hand, by (7.4)

$$\begin{aligned} f^{(0)}(x)=\varkappa ^{d} N^{-d},\quad \forall x\in \left[ -\frac{1}{4\varkappa }(N-4), \;\frac{1}{4\varkappa }(N+4)\right] ^d. \end{aligned}$$
(7.22)

Therefore, if we require

$$\begin{aligned}&A\le \varkappa ^d N^{-d}, \end{aligned}$$
(7.23)

this together with (7.13) implies

$$\begin{aligned} f_w(x)\ge 0, \quad \forall x\in \left[ -\frac{1}{4\varkappa }(N-4), \;\frac{1}{4\varkappa }(N+4)\right] ^d. \end{aligned}$$

We conclude that \(f_w\ge 0\) for any \(w\in W\). Moreover, we get from (7.6), (7.11) and (7.23) that \(\Vert f_w\Vert _\infty \le M\) for any \(w\in W\).

All this, together with (7.19), shows that \(\{f^{(0)}, f_w, w\in W\}\) is a finite set of probability densities from \(\mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\). Thus Lemma 4 is applicable with \(\mathcal{J }_n=W\) and \(\mathbb{F }= \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\). 5\(^0\). Suppose now that the set \(W\) is chosen so that

$$\begin{aligned}&\varrho _{|\mathcal{M }|}(w,w^\prime )\ge B,\quad \forall w,w^\prime \in W, \end{aligned}$$
(7.24)

where, we remind, \(\varrho _{|\mathcal{M }|}\) is the Hamming distance on \(\{0,1\}^{|\mathcal{M }|}\). Here \(B=B(n)\ge 1\) is a parameter to be specified. Then we deduce from (7.19), (7.8) and (7.9), that for all \(w,w^\prime \in W\)

$$\begin{aligned} \left\| f_w-f_{w^\prime }\right\| ^{p}_p&= \left\| F_w-F_{w^\prime }\right\| ^{p}_p =A^p\sum _{j\in \mathcal{M }}\left| w_{\pi (j)}-w^\prime _{\pi (j)} \right| \int _{\Pi _j}|G_j(x)|^{p}\,\mathrm{d}x\nonumber \\&= A^{p}\Vert g\Vert ^{dp}_{p} \left( \prod _{j=1}^d\sigma _j\right) \sum _{j\in \mathcal{M }} \left| w_{\pi (j)}-w^\prime _{\pi (j)}\right| \nonumber \\&= A^{p}\Vert g\Vert ^{dp}_{p} \left( \prod _{j=1}^d\sigma _j\right) \varrho _{|\mathcal{M }|} (w,w^\prime ) \nonumber \\&\ge \Vert g\Vert ^{dp}_{p}A^pB \left( \prod _{j=1}^d\sigma _j\right) . \end{aligned}$$
(7.25)

Here we have used that the map \(\pi \) is a bijection. Putting \(C_2=\frac{1}{2}\Vert g\Vert ^{d}_{p}\), we conclude that condition (7.1) of Lemma 4 is fulfilled with

$$\begin{aligned}&\rho _n=C_2A\left( B\prod _{j=1}^d\sigma _j\right) ^{1/p}. \end{aligned}$$
(7.26)

Let us remark that (7.26) remains true if we formally put \(p=\infty \). Indeed, similarly to (7.25),

$$\begin{aligned}&\left\| f_w-f_{w^\prime }\right\| _\infty \!=\! \left\| F_w-F_{w^\prime }\right\| _\infty =A\sup _{j\in \mathcal{M }}\left| w_{j}-w^\prime _{j}\right| \Vert g\Vert ^{d}_{\infty }\ge A \Vert g\Vert ^{d}_{\infty }.\qquad \end{aligned}$$
(7.27)

Here we have used (7.9), the fact that the map \(\pi \) is a bijection and, that \(w\ne w^\prime \) for all \(w,w^\prime \in W\) in view of (7.24).

Now we verify condition (7.2) of Lemma 4. First observe that

$$\begin{aligned} \frac{\mathrm{d}\mathbb{P }_{f_w}}{\mathrm{d}\mathbb{P }_{f^{(0)}}}(X^{(n)})= \prod _{k=1}^n\frac{f_w(X_k)}{f^{(0)}(X_k)}. \end{aligned}$$

Since \(X_k,\,k=1,\ldots ,n\) are i.i.d. random vectors, we have for any \(w\in W\)

$$\begin{aligned} \mathbb{E }_{f^{(0)}}\left\{ \prod _{k=1}^n\frac{f_w(X_k)}{f^{(0)}(X_k)} \right\} ^{2}&= \left\{ \int _{\mathbb{R }^d} \frac{ f^{(0)}(x)+2F_w(x)+F^{2}_w(x)}{f^{(0)}(x)}\,\mathrm{d}x\right\} ^{n} \nonumber \\&= \left\{ 1+\int _{\mathbb{R }^d}\frac{F^{2}_w(x)}{f^{(0)}(x)}\,\mathrm{d}x \right\} ^{n}. \end{aligned}$$

The last equality follows from (7.12). By (7.20) and (7.22),

$$\begin{aligned} \int _{\mathbb{R }^d}\frac{F^{2}_w(x)}{f^{(0)}(x)}\,\mathrm{d}x= \varkappa ^{-d}N^d \Vert F_w\Vert _2^2; \end{aligned}$$

hence for any \(w\in W\)

$$\begin{aligned} \mathbb{E }_{f^{(0)}}\left\{ \frac{\mathrm{d}\mathbb{P }_{f_w}}{\mathrm{d}\mathbb{P }_{f^{(0)}}} (X^{(n)})\right\} ^{2}=\left\{ 1+\varkappa ^{-d}N^d \Vert F_w\Vert _2^2 \right\} ^{n}\le \exp {\left\{ n\varkappa ^{-d}N^d \Vert F_w\Vert _2^2\right\} }. \end{aligned}$$

Repeating computations that led to (7.17) we have

$$\begin{aligned} \Vert F_w\Vert _2^2\le A^{2}\Vert g\Vert ^{2d}_{2} S_W\prod _{j=1}^d\sigma _j. \end{aligned}$$

The right hand side of the latter inequality does not depend on \(w\); hence we

$$\begin{aligned} \frac{1}{|W|^{2}}\sum _{w\in W}\mathbb{E }_{f^{(0)}} \left\{ \frac{\mathrm{d}\mathbb{P }_{f_w}}{\mathrm{d}\mathbb{P }_{f^{(0)}}}(X^{(n)})\right\} ^{2}\le \exp \left\{ C_3n A^2S_W N^d\left( \prod _{j=1}^d\sigma _j\right) -\ln {(|W|)}\right\} , \end{aligned}$$

where we have put \(C_3=\varkappa ^{-d}\Vert g\Vert ^{2d}_{2}\). Therefore, if

$$\begin{aligned} C_4nA^2S_WN^d \prod _{j=1}^d \sigma _j\le \ln {(|W|)} \end{aligned}$$
(7.28)

then condition (7.2) of Lemma 4 is fulfilled with \(C=1\).

In order to apply Lemma 4 it remains to specify the set \(W\) and the parameters \(A,\,N,\,\sigma _j,\,j=1,\ldots ,d\) so that the relationships (7.18), (7.23), (7.24), and (7.28) are simultaneously fulfilled. According to (4), under these conditions the lower bound is given by \(\rho _n\) in (7.26).

7.3 Proof of Theorem 3. Derivation of lower bounds in different zones

We begin with the construction of the set \(W\). Let \(m\ge 8\) be an integer number whose choice will be made later, and, without loss of generality, assume that \(|\mathcal{M }|/m\) is integer. Let \(\mathcal{P }_m\) be a subset of \(\{0,1\}^{m}\) such that

$$\begin{aligned} |\mathcal{P }_m|\ge 2^{m/8},\quad \varrho _{m}(z,z^\prime )\ge m/8,\quad \forall z,z^\prime \in \mathcal{P }_m. \end{aligned}$$
(7.29)

Existence of such set \(\mathcal{P }_m\) is guaranteed by Lemma 5. Let \(\mathcal{J }:=\{1+\frac{j}{m}|\mathcal{M }|, \;j=0,\ldots , m-1\}\), and note that \(\mathcal{J }\subseteq \{1,\ldots ,|\mathcal{M }|\}\) with the equality in the case \(m=|\mathcal{M }|\). Define the map \(\Upsilon : \mathcal{P }_m\rightarrow \{0,1\}^{|\mathcal{M }|}\) by

$$\begin{aligned} \Upsilon _j[a]= \left\{ \begin{array}{l@{\quad }l} a_j,\quad &{} j\in \mathcal{J },\\ 0,\quad &{} j\in \{1,\ldots ,|\mathcal{M }|\}{\setminus }\mathcal{J }, \end{array} \right. \end{aligned}$$

and let \(W=\Upsilon (\mathcal{P }_m)\). Obviously, \(\varrho _{|\mathcal{M }|}(w,w^\prime )=\varrho _{|\mathcal{M }|} (\Upsilon [a],\Upsilon [a^\prime ])= \varrho _m(a,a^\prime )\) for all \(w,w^\prime \in W\); therefore (7.29) implies that

$$\begin{aligned} |W|\ge 2^{m/8},\quad \varrho _{|\mathcal{M }|}(w,w^\prime )\ge 8^{-1}m,\quad \forall w,w^\prime \in W. \end{aligned}$$
(7.30)

With such a set \(W,\,S_W\le m\); moreover, since \(\ln (|W|)\ge m\ln 2/8\), condition (7.28) holds true if

$$\begin{aligned}&A^{2}n N^d \prod _{j=1}^d \sigma _j \le (8C_4)^{-1}\ln 2. \end{aligned}$$
(7.31)

We also note that condition (7.18) is fulfilled if we require

$$\begin{aligned} A\sigma _l^{-\beta _l} \left( m\prod _{j=1}^d \sigma _j\right) ^{1/r_l}\le (2C_1)^{-1} L_l,\quad \forall l=1,\ldots , d. \end{aligned}$$
(7.32)

In addition, (7.24) holds with \(B=m/8\).

7.3.1 Tail zone: \(p\le \frac{2+1/\beta }{1+1/s}\)

Let \(m=|\mathcal{M }|\). By construction, \(|\mathcal{M }|=\prod _{l=1}^d M_l= (20\varkappa )^{-d}N^d\prod _{l=1}^d\sigma _l^{-1}\) and, therefore (7.32) is reduced to

$$\begin{aligned} A\sigma _l^{-\beta _l}N^{d/r_l}\le C_5L_l. \end{aligned}$$
(7.33)

Thus, choosing

$$\begin{aligned} \sigma _l=C_6 A^{1/\beta _l} L_l^{- 1/\beta _l} N^{\frac{d}{\beta _l r_l}}, \end{aligned}$$
(7.34)

we guarantee the fulfillment of (7.33) provided that \(C_6\ge \max _{l=1,\ldots ,d}C^{-1/\beta _l}_5\). Moreover, with this choice (7.31) is reduced to

$$\begin{aligned}&A^{2+1/\beta }N^{d(1+1/s)}\le C_7 L_\beta n^{-1}, \end{aligned}$$
(7.35)

where, as before, \(L_\beta =\prod _{l=1}^d L_l^{1/\beta _l}\). Moreover, we have from (7.26)

$$\begin{aligned}&\rho _n=C_8 AN^{d/p},\quad C_8=C_3(160\varkappa )^{-1/p}. \end{aligned}$$
(7.36)

Let \(N^d=C_9A^{-1}\), where constant \(C_9\le \varkappa ^{d}\) will be specified below; then (7.23) holds. Next, in view of (7.35) and (7.36)

$$\begin{aligned} A=C_{10}(L_\beta /n)^{\frac{1}{1-1/s+1/\beta }},\quad \rho _n=C_{11}(L_\beta /n)^{\frac{1-1/p}{1-1/s+1/\beta }}= C_{11}\left( L_\beta \alpha _n n^{-1}\right) ^{\nu }. \end{aligned}$$

We remark that \(N\rightarrow \infty \) as \(n\rightarrow \infty \). It remains to check that \(\sigma _l,\,l=1,\ldots , d\) are small enough. It follows from (7.34) that if \(r_l>1\), then \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) since \(A\rightarrow 0\). If \(r_l=1\), then

$$\begin{aligned} \sigma _l=C_{12} C_9^{1/\beta _l } L_l^{- 1/\beta _l}\le C_{12}(C_9/L_{0})^{1/\beta _l}. \end{aligned}$$

Choosing \(C_9\) small enough we guarantee that \(\sigma _l\le (20\varkappa )^{-1}\), for all \(l=1,\ldots ,d\). This condition is required in the construction of the family \(G_{m},\,m\in \mathcal{M }\). Thus, Lemma 4 can be applied with \(\rho _n=C_{11}(L_\beta \alpha _n n^{-1})^{\nu }\), and the result follows.

7.3.2 Dense zone: \(\frac{2+1/\beta }{1+1/s} \le p\le s(2+\frac{1}{\beta })\)

Here, as in the previous case, we let \(m=|\mathcal{M }|\). The relationships (7.34) (7.35) and (7.36) remain to be true, but our choice of \(N\) will be different.

Let \(N=C_{12}\) from some constant \(C_{12}\). This yields in view of (7.35) and (7.36)

$$\begin{aligned} A=C_{13}(L_\beta /n)^{\frac{\beta }{2\beta +1}},\quad \rho _n=C_{14}(L_\beta /n)^{\frac{\beta }{2\beta +1}}= C_{14}\left( L_\beta \alpha _n n^{-1}\right) ^{\nu }. \end{aligned}$$

The requirement (7.23) is obviously fulfilled since \(A\rightarrow 0,\;n\rightarrow \infty \). Moreover, we obtain from (7.34) that \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) and, therefore, \(\sigma _l\le (20\varkappa )^ {-1},\,l=1,\ldots ,d\) for \(n\) large enough. Thus, Lemma 4 can be applied with \(\rho _n=C_{14}(L_\beta \alpha _n n^{-1})^{\nu }\) and the result follows.

7.3.3 Sparse zone: \(s(2+\frac{1}{\beta })<p<\infty ,\,s<1\)

Let \(A=\tilde{C}\) and \(N=C_{17}\) and suppose that \(\tilde{C}\le C^{-1}_{17} \varkappa ^{d}\); then (7.23) is satisfied. Moreover (7.31) and (7.32) are reduced to

$$\begin{aligned}&n \prod _{j=1}^d \sigma _j \le \tilde{C}^{-2}C_{18},\quad \sigma _l^{-\beta _l}\left( m\prod _{j=1}^d \sigma _j\right) ^{1/r_l}\!\le \! \widetilde{C}^{-1}C_{19}L_l, \quad \forall l=1,\ldots ,d.\nonumber \\ \end{aligned}$$
(7.37)

Let \(\tilde{c_1},\,\tilde{c}_2\) be constants satisfying \(\tilde{c_1}\le \tilde{C}^{-1}C_{18}\), and \(\tilde{c}_2\le \tilde{C}^{-1}C_{19}\). It is straightforward to check that if we choose

$$\begin{aligned} m\!=\!\tilde{c}_1^{-1+s}\tilde{c}_2^{s/\beta }L_\beta ^{s}n^{1-s},\quad \sigma _l\!=\! (\tilde{c}_2L_l)^{-1/\beta _l}\left( \tilde{c}_1 \tilde{c}_2^{1/\beta }L_\beta n^{-1}\right) ^{s/(\beta _lr_l)}, \quad l=1,\ldots ,d,\nonumber \\ \end{aligned}$$
(7.38)

then inequalities (7.37) are fulfilled. With this choice (7.26) is reduced to

$$\begin{aligned} \rho _n= \widetilde{C}C_{17}\left( m\prod _{j=1}^d\sigma _j \right) ^{1/p}\!=\!\widetilde{C}C_{20}(L_\beta n^{-1})^{s/p}\!=\!\widetilde{C}C_{20}\left( L_\beta \alpha _n n^{-1}\right) ^{\nu }.\qquad \end{aligned}$$
(7.39)

It remains to verify that \(\sigma _l\) are small enough, and that \(m\ge 8,\,|\mathcal{M }|/\mathfrak{m }\ge 1\). Note that \(m\rightarrow \infty \) as \(n\rightarrow \infty \) because of \(s<1\). Remind also that

$$\begin{aligned} |\mathcal{M }|=\prod _{l=1}^d M_l= (20\varkappa )^{-d}N^d\prod _{l=1}^d\sigma _l^{-1} = (20\varkappa )^{-d} C_{17}^d \tilde{c}_1 n; \end{aligned}$$

hence \(|\mathcal{M }|/m\ge (20\varkappa )^{-d}C_{17}^d (\tilde{c}_1\tilde{c}_2^{1/\beta })^{-s}L_0^{-s/\beta }n^s\). Thus \(|\mathcal{M }|/m\ge 1\) for large enough \(n\).

We note also that \(\sigma _l\le (\tilde{c}_2L_0)^{-1/\beta _l}\) for all \(n\) large enough. Therefore, if we choose \(\tilde{C}\) large enough and put \(\tilde{c}_2=\tilde{C}^{-1} C_{19}\) we can ensure that \(\sigma _l\le (20\varkappa )^{-1}\) for all \(l=1,\ldots ,d\).Thus, Lemma 4 can be applied with \(\rho _n=\widetilde{C}C_{20}(L_\beta \alpha _n n^{-1})^{\nu }\) and the result follows.

7.3.4 Sparse zone: \(s(2+\frac{1}{\beta })<p<\infty ,\,s\ge 1\)

Here we consider another choice of the set \(W\). Let \(W=\{e_1,e_2,\ldots ,e_{|\mathcal{M }|}\}\), where \(e_j\), \(j=1,\ldots ,|\mathcal{M }|\) is the canonical basis in \(\mathbb{R }^{|\mathcal{M }|}\). With this choice

$$\begin{aligned} S_W=1,\quad |W|= N^d\prod _{j=1}^d\sigma _j^{-1}, \end{aligned}$$

and (7.24) holds with \(B=1\). Let \(N=C_{14}\); then (7.18) and (7.28) take the form

$$\begin{aligned} A\sigma _l^{-\beta _l} \left( \prod _{j=1}^d \sigma _j\right) ^{1/r_l}&\le (2C_1)^{-1} L_l,\quad \forall l=1,\ldots , d;\end{aligned}$$
(7.40)
$$\begin{aligned} A^{2}n\prod _{j=1}^d \sigma _j&\le C_{15}\ln {\left( \prod _{j=1}^d \sigma ^{-1}_j\right) }. \end{aligned}$$
(7.41)

Moreover, we get from (7.26)

$$\begin{aligned} \rho _n=C_{16} A \left( \prod _{j=1}^d\sigma _j\right) ^{1/p}. \end{aligned}$$
(7.42)

Put \(\varepsilon =\sqrt{\ln n/n}\) and

$$\begin{aligned} A=c_1L_\beta ^{\frac{1}{2-2/s+1/\beta }}\varepsilon ^{\frac{1-1/s}{1-1/s+ 1/(2\beta )}},\quad \sigma _{l}=c_2L_\beta ^{\frac{1- 2/r_{l}}{\beta _l(2-2/s+1/\beta )}} \varepsilon ^{\frac{1-1/s+1/(\beta r_l)}{\beta _l(1-1/s+ 1/(2\beta ))}} L_l^{-1/\beta _l}.\qquad \qquad \end{aligned}$$
(7.43)

We have

$$\begin{aligned} \prod _{l=1}^d\sigma _{l}=c_2^{d}L_\beta ^{-\frac{2}{2-2/s + 1/\beta }} \varepsilon ^{\frac{1/\beta }{1- 1/s + 1/(2\beta )}}, \end{aligned}$$

and it is evident that \(\prod _{l=1}^d\sigma _{l}\le \varepsilon ^{1/(\beta +1/2)}\) for all \(n\) large enough; hence \(\ln (\prod _{l=1}^d\sigma ^{-1}_{l})\ge \ln n/(2\beta +1)\). Then is easily checked that our choice (7.43) satisfies (7.40) and (7.41) provided that

$$\begin{aligned} c_1\le (2C_1)^{-1},\quad c_2\le 1\quad c_1^2c_2^{d}\le C_{15}/(2\beta +1). \end{aligned}$$
(7.44)

Here we have also used that \(d-1/s\ge 0\). Note also that if \(s>1\) then

$$\begin{aligned} A\rightarrow 0,\quad \max _{l=1,\ldots , d}\sigma _l\rightarrow 0,\quad n\rightarrow \infty , \end{aligned}$$

which ensures (7.23) and \(\sigma _l\le (20\varkappa )^{-1},\,l=1,\ldots , d\) for all n large enough.

On the other hand, if \(s=1\) then we should add to (7.44) the conditions

$$\begin{aligned} c_1L_\beta ^{\frac{1}{2-2/s+1/\beta }}\le C_{14}\varkappa ^{d},\quad c_2\max _{l=1,\ldots ,d}\left[ L_\beta ^{\frac{1/\beta _{l}- 2/(\beta _{l}r_{l})}{2-2/s+1/\beta }} L_0^{-1/\beta _l}\right] \le (20\varkappa )^{-1}. \end{aligned}$$

Obviously, both restrictions hold if we choose \(c_1\) and \(c_2\) small enough, but now these constants may depend on \(\vec {L}\). Note, however, that if \(\max _{l=1,\ldots ,d}L_l\le L_\infty \) then \(c_1\) and \(c_2\) can be chosen depending on \(L_0\) and \(L_\infty \) only.

Using (7.42) and (7.43) we conclude that Lemma 4 is applicable with

$$\begin{aligned} \rho _n=C_{16}L_\beta ^{\frac{1/2- 1/p}{1- 1/s + 1/(2\beta )}} \left( \frac{\ln n}{n}\right) ^{\frac{1- 1/s + 1/(p\beta )}{2(1- 1/s \!+\! 1/(2\beta ))}}=C_{16}L_\beta ^{\frac{1/2- 1/p}{1- 1/s \!+\! 1/(2\beta )}-\nu }\left( \frac{L_\beta \alpha _n}{n}\right) ^{\nu }.\qquad \end{aligned}$$
(7.45)

that completes the proof of statement (i) of the theorem.

7.3.5 Proof of statement (ii): sparse zone, \(p=\infty ,\,s\le 1\)

The proof in this case coincides with the one for the sparse zone with \(s<1\). Thus, we keep (7.37), (7.38), and, in view of (7.27), (7.39) is replaced by \( \rho _n=\widetilde{C}C_{17}. \) Since \(\rho _n\) does not tend to \(0\) as \(n\rightarrow \infty \), a consistent estimator does not exist. All other details of the proof remain unchanged. This completes the proof of Theorem 3. \(\square \)

7.4 Proof of statement (ii) of Theorem 4

The proof goes along the lines of the proof of Theorem 3 with modifications indicated below.

We start with the following simple observation: for any \(M>0\) and \(y>0\) one has

$$\begin{aligned} \Vert g\Vert _\infty \le M,\quad \mathrm{supp}\{g\}\subseteq [-y,y]^d \quad \Rightarrow g\in \mathbb{G }_\theta \left( M(2y+4)^{d/\theta }\right) ,\quad \forall \theta \in (0,1).\nonumber \\ \end{aligned}$$
(7.46)

This is an immediate consequence of the fact that conditions \(\Vert g\Vert _\infty \le M,\,\mathrm{supp}\{g\} \subseteq [-y,y]^d\) imply that \(\Vert g^*\Vert _\infty \le M\) and \(\mathrm{supp}\{g\}\subseteq [-y-2, y+2]^d\).

Next, we note that the lower bounds of Theorem 3 in the dense and sparse zones are proved over the set of compactly supported densities. Hence they are valid also on \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\), provided that \(R\) is large enough. Hence, if \(p\ge \frac{2+1/\beta }{1/\theta +1/s}\) the assertion of the theorem follows.

Let \(p< \frac{2+1/\beta }{1/\theta +1/s}\). The proof of the lower bound here differs from the proof of Theorem 3 only in construction of the function \(f^{(0)}\).

Let \(f^{(0)}\) be the function constructed exactly as in the proof of Theorem 3 with \(N=N_0\) fixed throughout the asymptotics \(n\rightarrow \infty \), and such that \(f^{(0)}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },4^{-1}\vec {L}_0, 4^{-1}M)\). Since \(N_0\) is fixed, \(f^{(0)}\) is compactly supported and, by (7.46) we have that \(f^{(0)}\in \mathbb{G }_\theta (R_1)\) for some large enough \(R_1>0\). Define

$$\begin{aligned} \bar{f}^{(\theta )}(x)= \prod _{l=1}^d\left[ N^{-1/\theta } \int _{\mathbb{R }}\Lambda (y-x_l)\mathbf 1 _{[-\frac{N}{2},\frac{N}{2}]}(y)\,\mathrm{d}y\right] ,\quad x=(x_1,\ldots ,x_d)\in \mathbb{R }^d, \end{aligned}$$

where \(N=N(n)\rightarrow \infty \) will be specified later. Let \(\tilde{f}^{(\theta )}(x)=\varsigma ^{d}\bar{f}^{(\theta )}(\varsigma x)\), where \(\varsigma >0\) is chosen to guarantee \(\tilde{f}^{(\theta )}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },4^{-1}\vec {L}_0, 4^{-1}M)\). We note however that, in contrast to the case \(\theta =1,\,\tilde{f}^{( \theta )}\) is not a probability density. In particular, \(\int \tilde{f}^{(\theta )}\rightarrow 0\) as \(N\rightarrow \infty \), because \(\theta <1\). Define

$$\begin{aligned} f^{(\theta )}=(1-p_N)f^{(0)}+\tilde{f}^{(\theta )}, \end{aligned}$$

where \(p_N:=\int \tilde{f}^{(\theta )}\) ensures \(\int f^{(\theta )}=1\).

Note also that \(f^{(1)}=\tilde{f}^{(1)}\) since \(\tilde{f}^{(1)}\) is a probability density and, therefore \(p_N=1\).

Thus, we can assert that

$$\begin{aligned} f^{(\theta )}\in \mathbb{N }_{\vec {r},d}(\vec {\beta },2^{-1}\vec {L}_0, 2^{-1}M),\quad \int f^{(\theta )}=1,\quad f^{(\theta )}\ge 0. \end{aligned}$$

Note that, by construction, \(\tilde{f}^{(\theta )}\) is supported on the cube \([(-N/2-1)/\varsigma ,(N/2+1)/ \varsigma ]^d\) and bounded by \(N^{-d/\theta }\varsigma ^d\). Therefore, in view of (7.46), \(\tilde{f}^{(\theta )}\in \mathbb{G }_\theta (R_2)\) for some large enough \(R_2\).

Let \(W\) be the parameter set as defined in the proof of Theorem 3. For any \(w\in W\) and any \(\theta \le 1\) we let

$$\begin{aligned} f^{(\theta )}_w(x)=f^{(\theta )}(x)+F_{w}(x),\quad x\in \mathbb{R }^d, \end{aligned}$$

where functions \(F_w\) are constructed as in the proof of Theorem 3. If instead of (7.23) we require

$$\begin{aligned}&A\le \left[ \varkappa ^d+ \varsigma ^{d}\right] N^{-d/\theta }, \end{aligned}$$
(7.47)

then we obtain in view of (7.11) and (7.47) that \(\{F_w,\;w\in W\}\subset \mathbb{G }_\theta (R_3)\) for some large enough \(R_3\). All said above one allows to conclude that \(\{f^{(\theta )},\;f^{(\theta )}_w,\;w\in W\}\) is a finite set of probability densities from \(\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\) for some large enough \(R>0\), and Lemma 4 is applicable with \(\mathcal{J }_n=W\) and \(\mathbb{F }=\mathbb{G }_\theta (R)\cap \mathbb{N }_{\vec {r},d}(\vec {\beta },\vec {L}, M)\).

Note also that if \(\theta =1\) we come to the construction used in the proof of Theorem 3 and, therefore, the statement of the theorem in the case \(\theta =1\) follows.

Suppose now that \(\theta <1\). We will follow construction of the set \(W\) for the tail zone which is given in Sect. 7.3.1. Choose \(m=|\mathcal{M }|\) and note that (7.33), (7.34), (7.36) remain unchanged, while (7.35) should be replaced by

$$\begin{aligned}&A^{2+ 1/\beta }N^{d(1/\theta +1/s)}\le C_7 L_\beta n^{-1}. \end{aligned}$$
(7.48)

Now we choose \(N^d=cA^{-\theta }\) with \(c\le \varkappa ^{d}+\varsigma ^{d}\); then (7.47) is valid. We obtain from (7.48) that

$$\begin{aligned} A=C_{8}(L_\beta /n)^{\frac{1}{1-\theta /s+1/\beta }},\quad \rho _n=C_{9}(L_\beta /n)^{\frac{1-\theta /p}{1-\theta /s+1/\beta }}. \end{aligned}$$

Finally, because (7.34) remains intact, \(\sigma _l\rightarrow 0\) as \(n\rightarrow \infty \) for any \(l=1,\ldots , d\); this follows from \(A\rightarrow 0\) and \(\theta <1\). This completes the proof. \(\square \)

7.5 Proof of the lower bound in (4.4)

The required result will follow from the lower bound of Theorem 3 in the tail zone (see Sect. 7.3.1) if we will show that for any given \(R>0\) and \(\theta \in (0,1)\)

$$\begin{aligned} f^{(0)}\notin \mathbb{G }_\theta (R),\quad f_w \notin \mathbb{G }_\theta (R),\quad \forall w\in W. \end{aligned}$$
(7.49)

First we note that \(f^{(0)}=N^{-d}\) for \(x\in [-(N-2)/(2\varkappa ), (N-2)/(2\varkappa )]^d\); therefore, \(\Vert [f^{(0)}]^*\Vert _\theta \rightarrow \infty \) as \(N\rightarrow \infty \), because \(\theta <1\).

Next, in view of (7.21), \(f_w(x)=f^{(0)}(x)\) for any \(x\notin [-(N-4)/(4\varkappa ), \;(N-4)/(4\varkappa )]^d\), which also implies

$$\begin{aligned} \inf _{w\in W}\Vert f_w^*\Vert _\theta \rightarrow \infty ,\quad N\rightarrow \infty . \end{aligned}$$

It remains to note that in the tail zone the parameter \(N\) is chosen so that \(N=N(n)\rightarrow \infty \) as \(n\rightarrow \infty \). This completes the proof of (7.49). \(\square \)