Concentration inequalities for Kernel density estimators under uniform mixing

Arvanitis, Stelios

doi:10.1007/s42952-023-00208-5

Concentration inequalities for Kernel density estimators under uniform mixing

Research Article
Open access
Published: 24 February 2023

Volume 52, pages 440–449, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Concentration inequalities for Kernel density estimators under uniform mixing

Download PDF

Stelios Arvanitis ORCID: orcid.org/0000-0002-1590-4606¹

1387 Accesses
1 Citation
Explore all metrics

Abstract

We derive non-asymptotic concentration inequalities for the uniform deviation between a multivariate density function and its non-parametric kernel density estimator in stationary and uniform mixing time series framework. We derive analogous inequalities for their (first) Wasserstein distance, as well as for the deviations between integrals of bounded functions w.r.t. them. They can be used for the construction of confidence regions, the estimation of the finite sample probabilities of decision errors, etc. We employ the concentration results to the derivation of statistical guarantees and oracle inequalities in regularized prediction problems with Lipschitz and strongly convex costs.

Berry-Esseen bounds of weighted kernel estimator for a nonparametric regression model based on linear process errors under a LNQD sequence

Article Open access 08 January 2018

On the convergence rates of kernel estimator and hazard estimator for widely dependent samples

Article Open access 03 April 2018

Convergence rates for kernel regression in infinite-dimensional spaces

Article 17 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the present note we are occupied with the derivation of non-asymptotic concentration inequalities for the uniform deviation between a multivariate density function and its non-parametric kernel density estimator over the support of the former. We employ a time series setting consisting of multivariate stationary processes with summable uniform (phi-) mixing coefficients (see Davidson, 1994). We rely heavily on the iid results of Vogel and Schettler (2013), adjusting their proofs via the use of concentration and covariance inequalities for uniformly mixing processes. Regarding the two underlying probability measures, we readily derive analogous probability bounds for their (first) Wasserstein distance, as well as for the deviations between integrals of bounded functions w.r.t. them. Those hold uniformly w.r.t. the sample size. The inequalities can be used for the construction of confidence regions, the estimation of the finite sample probabilities of false inclusion of parameter values in level sets of moment conditions, etc. Extension of the results to other forms of temporal dependence, like strong mixing or absolute regularity, and subsequently, application to a wider range of econometric time series models is delegated to further research.

As an example, we employ the concentration results to the derivation of statistical guarantees and oracle inequalities in regularized prediction problems with Lipschitz and strongly convex costs over function spaces. The inequalities imply a uniform non-asymptotic LLN for the deviation between the empirical (w.r.t. the kernel estimator) and the population cost differentials. This, coupled with the relevant sub-differential calculus in convex programming, implies a non asymptotic statistical guarantee for the $L^{2}$ deviation between the empirical and the population solutions as long as the regularization parameter is appropriately dominated. The framework is quite general and allows for dynamic parameter spaces and population solutions. It includes as special cases non-linear Support Vector Machines with Hinge costs.

For the remaining note: in Sect. 2 we derive and discuss the concentration inequalities, and in Sect. 3 we are occupied with the aforementioned regularized prediction problems. Section 4 contains the proofs.

2 Concentration inequalities for Kernel density estimators

We begin with an assumption that specifies our probabilistic and statistical framework. The assumption imposes restrictions on the marginal distributions and the dynamics of the stochastic process involved in the density estimation. It also restricts the properties of the employed kernel technology:

Assumption 1

(i) The $\mathbb {R}^{n}$-valued stochastic process $\left( \varvec{x}_{t}\right) _{t\in \mathbb {Z}}$ is strictly stationary and phi-mixing, with absolutely summable mixing coefficient sequence $\left( \phi _{n}\right) _{n\in \mathbb {N}}$. (ii) ${\mathcal {K}}$ is a positive symmetric bounded, Lipschitz continuous and compactly supported convolution kernel on $\mathbb {R}^{n}$ such that $\int _{\mathbb {R}^{n}}{{\mathcal {K}}}\left( u\right) du=1$, $\int _{\mathbb {R}^{n}}\left\| u\right\| ^{2}{{\mathcal {K}}}\left( u\right) du<+\infty$, and (iii) the distribution of $\varvec{x}_{0}$ has a compact support ${\mathcal {X}}$, and a continuous density $f_{\varvec{\varvec{x}}}\left( \cdot \right)$ that is twice differentiable with continuous second derivatives.

An example of a dynamic multivariate process that satisfies Assumption 1(i) is given by the solution of the following stochastic recursion equation (SRE):

$$\begin{aligned} \varvec{x}_{t}=h\left( \varvec{x}_{t-1}\right) +z_{t}, \end{aligned}$$

(1)

where $\left( z_{t}\right) _{t\in \mathbb {Z}}$ is an iid sequence of n-random vectors, the distribution of $z_{0}$ has a density, and $h:\mathbb {R}^{n}\rightarrow \mathbb {R}^{n}$ is a contraction (w.r.t. some metric on $\mathbb {R}^{n}$) with compact range. Theorem 2.1.3 of Doukhan and Ghindès (1983) implies then the required mixing property for the unique solution of the SRE. This incorporates the iid as a special case. If $z_{0}$ has also bounded support the compactness of the support of the distribution of $\varvec{x}_{0}$ from the first part of Assumption 1(i) also holds. The remaining parts impose mostly usual conditions in non-parametric statistics (see El Machkouri et al., 2020, and references therein). Boundedness of supports can be generalized as long as ${\mathcal {K}}$ has an integrable Fourier transform, and the Hessian of $f_{\varvec{\varvec{x}}}$ is bounded in the Frobenius norm.

The researcher has at her disposal the time series sample $\left( \varvec{x}_{t}\right) _{t=1,\dots ,T}$, and estimates the unknown $f_{\varvec{\varvec{x}}}$ via the kernel estimator $\frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\cdot }{b_{T}}\right)$ with $b_{T}>0$ the bandwidth. We consider the problem of bounding, uniformly in T, the probability that the uniform deviation $\sup _{\varvec{y}\in {{\mathcal {X}}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\varvec{y}}{b_{T}}\right) -f_{\varvec{\varvec{x}}}\left( \varvec{y}\right) \right|$ exceeds an asymptotically negligible deterministic sequence. The following theorem summarizes the result. There, ${{\mathcal {W}}}\left( G,\,G^{\star }\right)$ denotes the first Wasserstein distance between the arbitrary distributions on ${{\mathcal {X}}}$, $G,\,G^{\star }$, defined as $\min _{\gamma \in \Gamma \left( G,\,G^{\star }\right) }\int _{{{\mathcal {X}}}\times {{\mathcal {X}}}}d\left( z,z^{\star }\right) d\gamma \left( z,z^{\star }\right)$, where $\Gamma \left( G,\,G^{\star }\right)$ denotes the set of Borel probability distributions on ${{\mathcal {X}}}\times {{\mathcal {X}}}$ that have respective marginals $G,\,G^{\star }$, and d denotes the Euclidean distance (see Gao et al., 2017). $\mu$ denotes the Lebesgue measure on $\mathbb {R}^{n}$ and $\text {diam}\left( {{\mathcal {X}}}\right)$ denotes the Euclidean diameter of ${\mathcal {X}}$.

Theorem 1

(Concentration Inequalities) Suppose that Assumption 1 holds:

A. Uniformly in $T\ge 1$ and for any $k>0,$

$$\begin{aligned} \mathbb {P}\left( \begin{array}{c} \sup _{\varvec{y}\in {{\mathcal {X}}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\varvec{y}}{b_{T}}\right) -f_{\varvec{\varvec{x}}}\left( \varvec{y}\right) \right| >\beta _{T,k}\end{array}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) , \end{aligned}$$

(2)

where, $\beta _{T,k}:=\frac{k}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}+\frac{1}{2}C_{2}b_{T}^{2},$ $C_{1}:=\int _{\mathbb {R}^{n}}\left| \int _{\text {supp}\left( {{\mathcal {K}}}\right) }{{\mathcal {K}}}\left( u\right) \exp \left( \text {i}y^{T}u\right) du\right| dy,$ $C_{2}:=\sup _{i,j,\,{y}\in {{\mathcal {X}}}}\left| \frac{\partial ^{2}f_{{{x}}}\left( \varvec{y}\right) }{\partial x_{i}\partial x_{j}}\right| \int _{\text {supp}\left( {\mathcal{K}}\right) }\left\| u\right\| ^{2}{{\mathcal {K}}}\left( u\right) du,$ and $C_{3}:=\sup _{u\in \text {supp}\left( {{\mathcal {K}}}\right) }{{\mathcal {K}}}^{2}\left( u\right)$.

B. Let $\varvec{\varvec{F}}_{T}$ denote the cdf of $\frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\cdot }{b_{T}}\right)$, and analogously $\varvec{\varvec{\varvec{F}}}$ the cdf of $f_{\varvec{\varvec{x}}}\left( \cdot \right)$. Then, uniformly in $T\ge 1$, and for any $k>0$,

$$\begin{aligned} \mathbb {P}\left( {{\mathcal {W}}}\left( \varvec{\varvec{F}}_{T},\varvec{\varvec{F}}\right) >\text {diam}\left( {{\mathcal {X}}}\right) \mu \left( {{\mathcal {X}}}\right) \beta _{T,k}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$

(3)

C. Suppose that${\left( {{\mathcal {F}}},\left\| \cdot \right\| _{{\mathcal {F}}}\right) }$ is an R-bounded subset of a semi-normed space of real functions defined on ${\mathcal {X}}$, and there exists some $c^{\star }>0$ for which $\left\| \cdot \right\| _{\infty }\le c^{\star }\left\| \cdot \right\| _{{\mathcal {F}}}$. Then, uniformly in $T\ge 1$, and for any $k>0$,

$$\begin{aligned} \mathbb {P}\left( \left| \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left( f\left( \varvec{y}\right) \right) -\mathbb {E}_{\varvec{\varvec{F}}}\left( f\left( \varvec{y}\right) \right) \right| >c^{\star }R\mu \left( {\mathcal {X}}\right) \beta _{T,k}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$

(4)

The results in (2)–(4) are non asymptotic. Whenever $b_{T}=o\left( 1\right)$ while $\sqrt{T}b_{T}^{n}\rightarrow +\infty$, they can also provide estimates for the rates of convergence of the deviations considered. The derivation of (2) essentially follows from the proof of the iid Theorem 1 of Vogel and Schettler (2013), by taking into account relevant uniform mixing, concentration, and covariance inequalities especially when handling the asymptotic properties variance of empirical Fourier transforms. Then (3)–(4) follow from the dual functional representation of ${\mathcal {W}}$ of Kantorovich (1960), the uniform boundedness of the function space involved in C, and the compactness of ${\mathcal {X}}$. The bounding sequence $\beta _{T,k}$ depends on the integral of the Fourier transform of the kernel, the second moment of the kernel, the magnitude of the Hessian of $f_{\varvec{\varvec{x}}}$, the mixing coefficients, the sample size, and the bandwidth. In the last two cases, $\beta _{T,k}$ is complemented by the Lebesgue magnitude and the diameter of ${\mathcal {X}}$, and/or the uniform bound and the norm properties of the function space considered. The probability bound depends on the bound of the kernel and the mixing coefficients. It becomes tighter whenever ${\mathcal {K}}$ admits a low maximum, and/or the mixing coefficients are small and converge rapidly to zero. For example, suppose that $n=1$, and Model 1 is actually a stationary uniformly ergodic AR(1) recursion, i.e. $h(x)=\beta x$, with $|\beta |<1$, and $z_{0}$ follows the uniform distribution supported on $[-1,1]$-see Proposition 2.1.5 of Doukhan and Ghindès (1983). Furthermore, suppose that ${\mathcal {K}}$ equals the truncated at $[-1,1]$ Gaussian kernel with scale equal to $\sigma ^{2}$. Then, Theorem 14.14 of Davidson (1994) imply that the probability bound is bounded above by $2\exp {(-\frac{\pi \sigma ^{2}(1-\beta )^{2} k^{2}}{(1-\beta +2C)^{2}})}$, for some $C>0$. This bound is less than one, and hence meaningful, if and only if $k>\sqrt{\frac{\ln (2))}{\pi }}\frac{(1-\beta +2C)}{\sigma (1-\beta )}$.

Remark 1

The extension of the results of Theorem 1 to stationary processes that are either strongly mixing or absolutely regular (see for example Ch. 14 of Davidson, 1994) is not trivial. The proof is based on the Hoeffding type inequality of Rio (2000) for functions exhibiting the bounded differences property. To our knowledge, analogous results are not available for the aforementioned weaker forms of mixing. The derivation of suchlike inequalities in those frameworks is a very interesting issue for future research. Another promising approach could be based on results like Corollary 3.3 of Krebs (2018)-this is related to the main result of Merlevède et al. (2009). Unfortunately, this is not directly usable in our framework due to the fact that it pressuposes uniform boundedness for the functions involved in the partial sums. This is not the case in our framework, due to the presence of the bandwidth reciprocal as a multiplicative factor outside the kernel. Analogously, the extension of such a result to sequences of bounded-yet not necessarily uniformly-functions is also an interesting issue for future research.

A and C can be used among others in order to construct large enough T, non-asymptotic confidence sets under some further restrictions. Suppose for example that the non-parametric framework includes a restriction on the mixing coefficients of the form $\sum _{n=1}^{\infty }\phi _{n}\le C^{\star }$ for some known $C^{\star }>0$, as well as a known upper bound $M>0$ for the Hessian of $f_{x}$. Then (4) implies the of at least $1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2C^{\star }\right) ^{2}}\right)$ probability, confidence interval $\left( \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left( f\left( \varvec{y}\right) \right) \mp c^{\star }R\lambda \left( {\mathcal {X}}\right) \beta ^{\star }_{T,k}\right)$, where $\beta ^{\star }_{T,k}=\frac{k}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}}+\frac{1}{2}Mb_{T}^{2}$. for $\mathbb {E}_{\varvec{\varvec{F}}}\left( f\left( \varvec{y}\right) \right)$, uniformly on ${\mathcal {F}}$. The construction of valid confidence sets without the above restrictions, via the estimation of the supremum of the Hessian of $C_{2}$ and of the mixing coefficients is left for future research. Notice that the Hessian estimation could be facilitated by the derivation of the kernel estimator-see for example Sheather (2004). The estimation of the mixing coefficient series could be analogously facilitated by the results in Ahsen and Vidyasagar (2013) and truncation. B can be used in order to bound from above the probability of including $\theta ^{\star }\in \left\{ \sup _{{\mathcal {W}}\left( \varvec{\varvec{F}}_{T},\varvec{G}\right) \le \lambda _{T}}\mathbb {E}_{\varvec{G}}\left( g\left( \theta ^{T}\varvec{y}\right) \right) \le 0\right\}$, while $\theta ^{\star }\notin \left\{ \mathbb {E}_{\varvec{\varvec{F}}}\left( g\left( \theta ^{T}\varvec{y}\right) \right) \le 0\right\}$, for $g:\mathbb {R}\rightarrow \mathbb {R}$, and $\theta ^{\star }\in \Theta \subseteq \mathbb {R}^{n}$. If g is 1-Lipschitz, $\Theta$ is bounded in the Euclidean norm, and $\lambda _{T}\ge {\text {diam}\left( \Theta \right) \text{diam}}\left( {\mathcal {X}}\right) \mu \left( {\mathcal {X}}\right) \beta _{T,k}$, then the probability of falsely classifying $\theta ^{\star }$ in the zero level set of $\mathbb {E}_{\varvec{\varvec{F}}}\left( g\left( \theta ^{T}\varvec{y}\right) \right)$, via the use of the conservative statistical program $\sup _{{\mathcal {W}}\left( \varvec{\varvec{F}}_{T},\varvec{G}\right) \le \lambda _{T}}\mathbb {E}_{\varvec{G}}\left( g\left( \theta ^{T}\varvec{y}\right) \right)$, is bounded above by the rhs of (3).

3 Regularized prediction problems with Lipschitz costs

In what follows ${\left( {\mathcal {F}},\left\| \cdot \right\| _{{\mathcal {F}}}\right) }$ conforms to the function space in Theorem 1. C, ${\mathcal {L}}$ is a loss function on $\mathbb {R}^{2}$, and given the sample $\left( \varvec{x}_{t}\right) _{t\in \left\{ 1,\cdots ,T\right\} }$ with $\varvec{x}_{t}:=\left( y_{t},\textbf{X}{}_{t}\right)$ with $y_{t}$ denoting the response variables, and $\textbf{X}_{t}$ the predictors, and we consider the regularized prediction (conditional on $\textbf{X}_{t}$) empirical program:

$$\begin{aligned} f_{T}:=\arg \min _{f\in {\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left\| f\right\| _{{\mathcal {F}}}, \end{aligned}$$

(5)

with $\lambda _{T}>0$ a regularization parameter.

We employ the concentration inequalities of the previous section in order to obtain statistical guarantees for the $L^{2}$ distance between $f_{T}$, and the solution to the population analogue of (5): $\inf _{f\in {\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right]$. This is summarized in the following result.

Theorem 2

(Statistical Guarantees) Suppose that Assumption 1 holds. Suppose furthermore that (SG.i) ${\mathcal {F}}$ is convex and uniformly R-bounded, and (SG.ii) for some $L,\kappa >0$, uniformly in the second argument, ${\mathcal {L}}\left( \cdot ,\cdot \right)$ is L-Lipschitz and $\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\right]$ is $\kappa$-strongly convex. Let $f^{\star }$ be the unique solution of the population statistical program $\inf _{{\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right]$ and suppose that it lies in the interior of ${\mathcal {F}}$. Then, and if $\beta _{T,k}>\frac{\lambda _{T}}{2c^{\star }\mu \left( {\mathcal {X}}\right) L}$, the following statistical guarantees hold:

$$\begin{aligned} \kappa \left\| f_{T}-f^{\star }\right\| _{2}\le 4c^{\star }\mu \left( {\mathcal {X}}\right) LR\sqrt{\beta _{T,k}}+\sqrt{2R}\sqrt{4c^{\star }\mu \left( {\mathcal {X}}\right) L\left( \kappa +L\right) \beta _{T,k}+\kappa \lambda _{T}}, \end{aligned}$$

(6)

with probability greater than or equal to $1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)$.

The parameter space convexity and (small sample) boundedness in (SG.i) and the Lipschitz continuity property in (SG.ii) are not rare in statistical applications. Strong convexity of the population criterion depends crucially on $\varvec{\varvec{F}}$ and holds whenever $\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\right]$ is convex and two times Frechet differentiable with second order derivative that has a bounded away from zero spectrum uniformly in ${\mathcal {F}}$. The statistical guarantees in (6) hold for any T. The inequality (6) provides an upper bound for the $L^{2}$ distance between the empirical predictor and the population solution that holds with non trivial probability as long as $k>\sqrt{2\ln (2)C_{3}}(1+2\sum _{n=1}^{\infty }\phi _{n})$. The bound depends on the Lipschitz and strong convexity properties of the loss, the regularization parameter, as well as the characteristics of the kernel, the unknown density, and the boundedness properties of the function space as those appear in Theorem 1. The result allows for diverging R with $T\rightarrow \infty$, hence for cases where the parameter space ${\mathcal {F}}$ becomes asymptotically unbounded. They also allow for the population solution $f^{\star }$ to depend on T as well as on the strong convexity parameter $\kappa$ to become asymptotically nullified. If $b_{T}=o\left( 1\right)$ and $\sqrt{T}b_{T}^{n}\rightarrow +\infty$, they imply that $\left\| f_{T}-f^{\star }\right\| _{2}$ becomes asymptotically negligible w.h.p. as long as $\lambda _{T}<2c^{\star }\lambda \left( {\mathcal {X}}\right) L\beta _{T,k}$ for some $k\rightarrow \infty$, and $\frac{R}{\kappa }\left( b_{T}+\sqrt{\frac{k}{\sqrt{T}b_{T}^{n}}}\right) =o\left( 1\right)$. Thus, Theorem 2 provides with sufficient conditions for weak consistency of $f_{T}$ even in cases where f diverges, as long as the regularization parameter is asymptotically strictly bounded above by some sequence $O(\frac{k_{T}}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}+\frac{1}{2}C_{2}b_{T}^{2})$, where the diverging $k_{T}$ is $o(\sqrt{T}b_{T}^{n})$.

An example that adheres to the formulation above, is the one of Support Vector Machines with Hinge Costs (see Example 14.19 of Wainwright, 2019). ${\mathcal {F}}$ is typically the R-ball of a Reproducing Kernel Hilbert Space comprised by discriminant real functions and centered at zero, and ${\mathcal {L}}\left( f\left( x\right) ,y\right) :=\left( 1-yf\left( x\right) \right) _{+}$. The latter is clearly 1-Lipschitz in its first argument, while $\kappa$-strong convexity holds as long as $\mathbb {E}_{\varvec{F}}\left[ y^{2}\delta \left( 1-yf\left( x\right) \right) \right]$, with $\delta$ denoting the Dirac delta function, is bounded away from zero uniformly on $\mathcal {F}$.

Towards a generalization of (6), if (5) is substituted with $\min _{f\in \mathcal {G}_{T}}\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left\| f\right\| _{\mathcal {F}}$, with convex ${\mathcal {G}}_{T} \subseteq {\mathcal {F}}$, and such that $g_{T}^{\star }:=\arg \min _{f\in \mathcal {G}_{T}}\left\| f-f^{\star }\right\| _{2}$, and the sub-differential $\partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( \cdot ,y\right) \right]$ is $L_{\partial }$-Lipschitz uniformly in y (see for example Ch. 9 of Rockafellar and Wets, 2009), then the following oracle inequality is similarly obtained (see the proof of Theorem 2):

$$\begin{aligned} \kappa \left\| f_{T}-f^{\star }\right\| _{2}\le \begin{array}{c} 4c^{\star }\mu \left( {\mathcal {X}}\right) LR\sqrt{\beta _{T,k}}+\left( 1+L+L_{\partial }\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\ +\sqrt{\begin{array}{c} \left( \left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}+8L^{2}c^{\star }\mu \left( {\mathcal {X}}\right) R\sqrt{\beta _{T,k}}\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\ +2\left( 4c^{\star }\mu \left( {\mathcal {X}}\right) L\left( \kappa +L\right) \beta _{T,k}+\kappa \lambda _{T}\right) R \end{array}} \end{array}, \end{aligned}$$

(7)

whenever $\lambda _{T}<2c^{\star }\mu \left( {\mathcal {X}}\right) L\beta _{T,k}$ holds, with probability greater than or equal to the probability bound in Theorem (2). This bounces back to (6) when $f^{\star }=g_{T}^{\star }$.

4 Proofs

Proof of Theorem 1

Consider (2). Due to the Hoeffding type inequality for phi-mixing processes (see Rio 2000), and working exactly as in the proof of Theorem 1 of Vogel and Schettler (2013) we obtain that $\mathbb {P}\left( \begin{array}{c} \left| J_{T}-\mathbb {E}\left( J_{T}\right) \right| >t\end{array}\right) \le 2\exp \left( -\frac{t^{2}Tb_{T}^{2n}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)$, for any $t\ge 0$, where $J_{T}:=\sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) -f_{\varvec{\varvec{\varvec{x}}}}\left( \varvec{\varvec{x}}\right) \right|$. Working as in the proof of the first Lemma of Vogel and Settler (2013), and noting that due to the phi-mixing covariance inequality (see Corollary 14.5 of Davidson, 1994), $\frac{1}{T^{2}}\text {Var}\left( \sum _{t=1}^{T}\exp \left( \text {i}u^{\text {T}}\varvec{\varvec{x}}_{t}\right) \right) \le \frac{1}{T}\left( 1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}\right)$, we obtain that

$$\begin{aligned} \mathbb {E}\left( \sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) -\mathbb {E}\left( \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) \right) \right| \right) \le \frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}. \end{aligned}$$

Finally, due to the second Lemma of Vogel and Schettler (2013), we obtain the inequality $\sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \mathbb {E}\left( \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) \right) -f_{\varvec{\varvec{\varvec{x}}}}\left( \varvec{\varvec{x}}\right) \right| \le \frac{1}{2}C_{2}^{\star }b_{T}^{2}$. The result follows by choosing $t:=\frac{k}{\sqrt{T}b_{T}^{n}}$ in the probability inequality above. (3) follows from Theorem 4 of Gibbs and Su (2002), the compactness of ${\mathcal {X}}$ and (2). Analogously, (4) follows from the uniform boundedness of $\mathcal {F}$, the dominance of $\left\| \cdot \right\| _{\mathcal {F}}$, the compactness of ${\mathcal {X}}$ and (2).

Proof of Theorem 2

Set $R^{\star }:=2c^{\star }\mu \left( {\mathcal {X}}\right) R$. We have first that for any $f\in \mathcal {F},$ due to the Lipschitz properties of ${\mathcal {L}}$, the $\left\| \cdot \right\| _{\mathcal {F}}$-boundedness of $\mathcal {F}$-which implies that $\mathcal {F}-f^{\star }$ is 2R bounded, and (4),

$$\begin{aligned}{} & {} \mathbb {P}\left[ \frac{\left| \mathbb {E}_{\varvec{F}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{F}_{T}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}}\ge 2LR^{\star }\sqrt{\beta _{T,k}}\right] \\{} & {} \quad \le \mathbb {P}\left[ \frac{\left| \mathbb {E}_{\varvec{F}_{T}}\left[ f\left( \varvec{x}\right) -f^{\star }\left( \varvec{x}\right) \right] -\mathbb {E}_{\varvec{F}}\left[ f\left( \varvec{x}\right) -f^{\star }\left( \varvec{x}\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}}\ge R^{\star }\sqrt{\beta _{T,k}}\right] \\{} & {} \quad \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$

We first prove (7). Then (6) follows by assuming $f^{\star }=g_{T}^{\star }$, and noting that since $f^{\star }$ is interior, the normal cone of $\mathcal {F}$ at $f^{\star }$ is $\left\{ 0\right\}$, and the optimality of $f^{\star }$ implies that $\left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-f^{\star }\right\rangle _{2}\in \left\{ 0\right\}$, where $\partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right]$ denotes the sub-differential of the population criterion at $f^{\star }$, and $\left\langle \cdot ,\cdot \right\rangle _{2}$ the $L^{2}$ inner product. In this case the Lipschitz property of the sub-differential is redundant. Towards proving (7), remember that $g_{T}^{\star }:=\arg \min _{f\in \mathcal {G}_{T}}\left\| f-f^{\star }\right\| _{2}$. Then consider the event

$$\begin{aligned} \left\{ \frac{\left| \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}^{\star }}}\le 2LR^{\star }\sqrt{\beta _{T,k}}\right\} , \end{aligned}$$

and notice that the probability of the above is bounded below by $1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)$, due to the previous result (which does not use the fact that $f^{\star }$ is interior). If the event holds then, and due to the fact that $\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \le 0$, then it must be the case that

$$\begin{aligned} \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f_{T}\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}^{\star }}\right) . \end{aligned}$$

The $\kappa$-strong convexity of the population criterion then implies that

$$\begin{aligned}{} & {} \left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\frac{\kappa }{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}^{2}+\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}\right) . \end{aligned}$$

Now, notice that due to the local optimality of $g_{T}^{\star }$, $\partial \mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right]$ must lie inside the normal cone of $\mathcal {G}_{T}$ at $g_{T}^{\star }$. This and the fact that $f_{T}$ satisfies the empirical local optimality conditions imply that

$$\begin{aligned} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}\le 0. \end{aligned}$$

The lhs of the previous display is greater than or equal to

$$\begin{aligned} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2} \end{aligned}$$

which (due to the sub-differential inclusion condition) is greater than or equal to

$$\begin{aligned}{} & {} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}\\{} & {} \quad +\left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f^{\star }-g_{T}^{\star }\right\rangle _{2}, \end{aligned}$$

which due to Cauchy–Schwarz inequality and the Lipschitz property of the sub-differential is greater than or equal to

$$\begin{aligned} -L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}-L\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) . \end{aligned}$$

The previous then imply that

$$\begin{aligned}{} & {} -L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}-L\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +\frac{\kappa }{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}^{2}+\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}\right) \Rightarrow \\{} & {} \frac{\kappa }{2}\left\| f_{T}-g^{\star }\right\| _{2}^{2}-\left( L\left( 2R^{\star }\sqrt{\beta _{T,k}}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\right) \left\| f_{T}-g_{T}^{\star }\right\| _{2}\\{} & {} \quad +\left( \lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) -2LR^{\star }\beta _{T,k}\right) \le 0. \end{aligned}$$

The condition that ensures that the quadratic polynomial in the lhs of the previous display has two distinct roots, one negative and one positive, is $\beta _{T,k}>\frac{\lambda _{T}}{2c^{\star }\lambda \left( {\mathcal {X}}\right) L}$, since this and the fact that $\left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\le 2R$, imply that $\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) -2LR^{\star }\beta _{T,k}<0$. Comparing with the positive root we obtain that

$$\begin{aligned}{} & {} \kappa \left\| f_{T}-g_{T}^{\star }\right\| _{2}\le L\left( 2R^{\star }\sqrt{\beta _{T,k}}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\{} & {} \qquad +\sqrt{4LR^{\star }\left( \kappa +L\right) \beta _{T,k}+\left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}^{2}+4L^{2}R\sqrt{\beta _{T,k}^{\star }}\left\| f^{\star }-g_{T}^{\star }\right\| _{2}-2\kappa \lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) }\\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}+\left( L+L_{\partial }\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\{} & {} \qquad +\sqrt{4LR^{\star }\left( \kappa +L\right) \beta _{T,k}+\left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}^{2}+4L^{2}R^{\star }\sqrt{\beta _{T,k}}\left\| f^{\star }-g_{T}^{\star }\right\| _{2}+2\kappa \lambda _{T}\left\| g_{T}^{\star }\right\| _{\mathcal {F}}}, \end{aligned}$$

from which the oracle inequality (7) follows by noting that $\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\le R$.

References

Ahsen, M. E. & Vidyasagar, M. (2013). On the computation of mixing coefficients between discrete-valued random variables. 9th Asian Control Conference (ASCC), pp. 1–5.
Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. Oxford: OUP Oxford.
Book MATH Google Scholar
Doukhan, P., & Ghindès, M. (1983). Estimation de la transition de probabilité d’une chaîne de Markov Doëblin-récurrente, Étude du cas du processus autorégressif général d’ordre 1,. Stochastic Processes and their Applications, 15(3), 271–293.
Article MathSciNet MATH Google Scholar
El Machkouri, M., Fan, X., & Reding, L. (2020). On the Nadaraya-Watson kernel regression estimator for irregularly spaced spatial data. Journal of Statistical Planning and Inference, 205, 92–114.
Article MathSciNet MATH Google Scholar
Gao, R., Chen, X., & Kleywegt, A.J. (2017). Distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050.
Gibbs, A. L., & Su, F. E. (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.
Article MATH Google Scholar
Kantorovich, L. V. (1960). Mathematical methods of organizing and planning production. Management Science, 6(4), 366–422.
Article MathSciNet MATH Google Scholar
Krebs, J. T. (2018). A large deviation inequality for -mixing time series and its applications to the functional kernel regression model. Statistics and Probability Letters, 133, 50–58.
Article MathSciNet MATH Google Scholar
Merlevède, F., Peligrad, M., & Rio, E. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. High Dimensional Probability V: the Luminy, 5, 273–292.
MathSciNet MATH Google Scholar
Rio, E. (2000). Inégalités de Hoeffding pour les fonctions lipschitziennes de suites dépen- dantes. Comptes Rendus de l’Académie des Sciences-Series I-Mathematics, 330(10), 905–908.
Article MATH Google Scholar
Rockafellar, R. T., & Wets, R. J. B. (2009). Variational analysis (Vol. 317). Ny: Springer Science & Business Media.
MATH Google Scholar
Sheather, S. J. (2004). Density estimation (pp. 588–597). Statistical science.
Vogel, S., & Schettler, A. (2013). A uniform concentration-of-measure inequality for multivariate kernel density estimators. Techn: Univ., Inst. für Mathematik.
Google Scholar
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge: Cambridge University Press.
MATH Google Scholar

Download references

Funding

Open access funding provided by HEAL-Link Greece.

Author information

Authors and Affiliations

Department of Economics of Athens University of Economics and Business, Athens, Greece
Stelios Arvanitis

Authors

Stelios Arvanitis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stelios Arvanitis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The hospitality of the Laboratory of Econometrics and the Laboratory of Economic Policy Studies (EMOP) of the department is gratefully aknowledged.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Arvanitis, S. Concentration inequalities for Kernel density estimators under uniform mixing. J. Korean Stat. Soc. 52, 440–449 (2023). https://doi.org/10.1007/s42952-023-00208-5

Download citation

Received: 10 September 2022
Accepted: 07 February 2023
Published: 24 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s42952-023-00208-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Concentration inequalities for Kernel density estimators under uniform mixing

Abstract

Similar content being viewed by others

Berry-Esseen bounds of weighted kernel estimator for a nonparametric regression model based on linear process errors under a LNQD sequence

On the convergence rates of kernel estimator and hazard estimator for widely dependent samples

Convergence rates for kernel regression in infinite-dimensional spaces

1 Introduction

2 Concentration inequalities for Kernel density estimators

Assumption 1

Theorem 1

Remark 1

3 Regularized prediction problems with Lipschitz costs

Theorem 2

4 Proofs

Proof of Theorem 1

Proof of Theorem 2

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Concentration inequalities for Kernel density estimators under uniform mixing

Abstract

Similar content being viewed by others

Berry-Esseen bounds of weighted kernel estimator for a nonparametric regression model based on linear process errors under a LNQD sequence

On the convergence rates of kernel estimator and hazard estimator for widely dependent samples

Convergence rates for kernel regression in infinite-dimensional spaces

1 Introduction

2 Concentration inequalities for Kernel density estimators

Assumption 1

Theorem 1

Remark 1

3 Regularized prediction problems with Lipschitz costs

Theorem 2

4 Proofs

Proof of Theorem 1

Proof of Theorem 2

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation