1 Introduction

In the present note we are occupied with the derivation of non-asymptotic concentration inequalities for the uniform deviation between a multivariate density function and its non-parametric kernel density estimator over the support of the former. We employ a time series setting consisting of multivariate stationary processes with summable uniform (phi-) mixing coefficients (see Davidson, 1994). We rely heavily on the iid results of Vogel and Schettler (2013), adjusting their proofs via the use of concentration and covariance inequalities for uniformly mixing processes. Regarding the two underlying probability measures, we readily derive analogous probability bounds for their (first) Wasserstein distance, as well as for the deviations between integrals of bounded functions w.r.t. them. Those hold uniformly w.r.t. the sample size. The inequalities can be used for the construction of confidence regions, the estimation of the finite sample probabilities of false inclusion of parameter values in level sets of moment conditions, etc. Extension of the results to other forms of temporal dependence, like strong mixing or absolute regularity, and subsequently, application to a wider range of econometric time series models is delegated to further research.

As an example, we employ the concentration results to the derivation of statistical guarantees and oracle inequalities in regularized prediction problems with Lipschitz and strongly convex costs over function spaces. The inequalities imply a uniform non-asymptotic LLN for the deviation between the empirical (w.r.t. the kernel estimator) and the population cost differentials. This, coupled with the relevant sub-differential calculus in convex programming, implies a non asymptotic statistical guarantee for the \(L^{2}\) deviation between the empirical and the population solutions as long as the regularization parameter is appropriately dominated. The framework is quite general and allows for dynamic parameter spaces and population solutions. It includes as special cases non-linear Support Vector Machines with Hinge costs.

For the remaining note: in Sect. 2 we derive and discuss the concentration inequalities, and in Sect. 3 we are occupied with the aforementioned regularized prediction problems. Section 4 contains the proofs.

2 Concentration inequalities for Kernel density estimators

We begin with an assumption that specifies our probabilistic and statistical framework. The assumption imposes restrictions on the marginal distributions and the dynamics of the stochastic process involved in the density estimation. It also restricts the properties of the employed kernel technology:

Assumption 1

(i) The \(\mathbb {R}^{n}\)-valued stochastic process \(\left( \varvec{x}_{t}\right) _{t\in \mathbb {Z}}\) is strictly stationary and phi-mixing, with absolutely summable mixing coefficient sequence \(\left( \phi _{n}\right) _{n\in \mathbb {N}}\). (ii) \({\mathcal {K}}\) is a positive symmetric bounded, Lipschitz continuous and compactly supported convolution kernel on \(\mathbb {R}^{n}\) such that \(\int _{\mathbb {R}^{n}}{{\mathcal {K}}}\left( u\right) du=1\), \(\int _{\mathbb {R}^{n}}\left\| u\right\| ^{2}{{\mathcal {K}}}\left( u\right) du<+\infty\), and (iii) the distribution of \(\varvec{x}_{0}\) has a compact support \({\mathcal {X}}\), and a continuous density \(f_{\varvec{\varvec{x}}}\left( \cdot \right)\) that is twice differentiable with continuous second derivatives.

An example of a dynamic multivariate process that satisfies Assumption 1(i) is given by the solution of the following stochastic recursion equation (SRE):

$$\begin{aligned} \varvec{x}_{t}=h\left( \varvec{x}_{t-1}\right) +z_{t}, \end{aligned}$$
(1)

where \(\left( z_{t}\right) _{t\in \mathbb {Z}}\) is an iid sequence of n-random vectors, the distribution of \(z_{0}\) has a density, and \(h:\mathbb {R}^{n}\rightarrow \mathbb {R}^{n}\) is a contraction (w.r.t. some metric on \(\mathbb {R}^{n}\)) with compact range. Theorem 2.1.3 of Doukhan and Ghindès (1983) implies then the required mixing property for the unique solution of the SRE. This incorporates the iid as a special case. If \(z_{0}\) has also bounded support the compactness of the support of the distribution of \(\varvec{x}_{0}\) from the first part of Assumption 1(i) also holds. The remaining parts impose mostly usual conditions in non-parametric statistics (see El Machkouri et al., 2020, and references therein). Boundedness of supports can be generalized as long as \({\mathcal {K}}\) has an integrable Fourier transform, and the Hessian of \(f_{\varvec{\varvec{x}}}\) is bounded in the Frobenius norm.

The researcher has at her disposal the time series sample \(\left( \varvec{x}_{t}\right) _{t=1,\dots ,T}\), and estimates the unknown \(f_{\varvec{\varvec{x}}}\) via the kernel estimator \(\frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\cdot }{b_{T}}\right)\) with \(b_{T}>0\) the bandwidth. We consider the problem of bounding, uniformly in T, the probability that the uniform deviation \(\sup _{\varvec{y}\in {{\mathcal {X}}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\varvec{y}}{b_{T}}\right) -f_{\varvec{\varvec{x}}}\left( \varvec{y}\right) \right|\) exceeds an asymptotically negligible deterministic sequence. The following theorem summarizes the result. There, \({{\mathcal {W}}}\left( G,\,G^{\star }\right)\) denotes the first Wasserstein distance between the arbitrary distributions on \({{\mathcal {X}}}\), \(G,\,G^{\star }\), defined as \(\min _{\gamma \in \Gamma \left( G,\,G^{\star }\right) }\int _{{{\mathcal {X}}}\times {{\mathcal {X}}}}d\left( z,z^{\star }\right) d\gamma \left( z,z^{\star }\right)\), where \(\Gamma \left( G,\,G^{\star }\right)\) denotes the set of Borel probability distributions on \({{\mathcal {X}}}\times {{\mathcal {X}}}\) that have respective marginals \(G,\,G^{\star }\), and d denotes the Euclidean distance (see Gao et al., 2017). \(\mu\) denotes the Lebesgue measure on \(\mathbb {R}^{n}\) and \(\text {diam}\left( {{\mathcal {X}}}\right)\) denotes the Euclidean diameter of \({\mathcal {X}}\).

Theorem 1

(Concentration Inequalities) Suppose that Assumption 1 holds:

A. Uniformly in \(T\ge 1\) and for any \(k>0,\)

$$\begin{aligned} \mathbb {P}\left( \begin{array}{c} \sup _{\varvec{y}\in {{\mathcal {X}}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\varvec{y}}{b_{T}}\right) -f_{\varvec{\varvec{x}}}\left( \varvec{y}\right) \right| >\beta _{T,k}\end{array}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) , \end{aligned}$$
(2)

where, \(\beta _{T,k}:=\frac{k}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}+\frac{1}{2}C_{2}b_{T}^{2},\) \(C_{1}:=\int _{\mathbb {R}^{n}}\left| \int _{\text {supp}\left( {{\mathcal {K}}}\right) }{{\mathcal {K}}}\left( u\right) \exp \left( \text {i}y^{T}u\right) du\right| dy,\) \(C_{2}:=\sup _{i,j,\,{y}\in {{\mathcal {X}}}}\left| \frac{\partial ^{2}f_{{{x}}}\left( \varvec{y}\right) }{\partial x_{i}\partial x_{j}}\right| \int _{\text {supp}\left( {\mathcal{K}}\right) }\left\| u\right\| ^{2}{{\mathcal {K}}}\left( u\right) du,\) and \(C_{3}:=\sup _{u\in \text {supp}\left( {{\mathcal {K}}}\right) }{{\mathcal {K}}}^{2}\left( u\right)\).

B. Let \(\varvec{\varvec{F}}_{T}\) denote the cdf of \(\frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{{\mathcal {K}}}\left( \frac{\varvec{x}_{t}-\cdot }{b_{T}}\right)\), and analogously \(\varvec{\varvec{\varvec{F}}}\) the cdf of \(f_{\varvec{\varvec{x}}}\left( \cdot \right)\). Then, uniformly in \(T\ge 1\), and for any \(k>0\),

$$\begin{aligned} \mathbb {P}\left( {{\mathcal {W}}}\left( \varvec{\varvec{F}}_{T},\varvec{\varvec{F}}\right) >\text {diam}\left( {{\mathcal {X}}}\right) \mu \left( {{\mathcal {X}}}\right) \beta _{T,k}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$
(3)

C. Suppose that\({\left( {{\mathcal {F}}},\left\| \cdot \right\| _{{\mathcal {F}}}\right) }\) is an R-bounded subset of a semi-normed space of real functions defined on \({\mathcal {X}}\), and there exists some \(c^{\star }>0\) for which \(\left\| \cdot \right\| _{\infty }\le c^{\star }\left\| \cdot \right\| _{{\mathcal {F}}}\). Then, uniformly in \(T\ge 1\), and for any \(k>0\),

$$\begin{aligned} \mathbb {P}\left( \left| \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left( f\left( \varvec{y}\right) \right) -\mathbb {E}_{\varvec{\varvec{F}}}\left( f\left( \varvec{y}\right) \right) \right| >c^{\star }R\mu \left( {\mathcal {X}}\right) \beta _{T,k}\right) \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$
(4)

The results in (2)–(4) are non asymptotic. Whenever \(b_{T}=o\left( 1\right)\) while \(\sqrt{T}b_{T}^{n}\rightarrow +\infty\), they can also provide estimates for the rates of convergence of the deviations considered. The derivation of (2) essentially follows from the proof of the iid Theorem 1 of Vogel and Schettler (2013), by taking into account relevant uniform mixing, concentration, and covariance inequalities especially when handling the asymptotic properties variance of empirical Fourier transforms. Then (3)–(4) follow from the dual functional representation of \({\mathcal {W}}\) of Kantorovich (1960), the uniform boundedness of the function space involved in C, and the compactness of \({\mathcal {X}}\). The bounding sequence \(\beta _{T,k}\) depends on the integral of the Fourier transform of the kernel, the second moment of the kernel, the magnitude of the Hessian of \(f_{\varvec{\varvec{x}}}\), the mixing coefficients, the sample size, and the bandwidth. In the last two cases, \(\beta _{T,k}\) is complemented by the Lebesgue magnitude and the diameter of \({\mathcal {X}}\), and/or the uniform bound and the norm properties of the function space considered. The probability bound depends on the bound of the kernel and the mixing coefficients. It becomes tighter whenever \({\mathcal {K}}\) admits a low maximum, and/or the mixing coefficients are small and converge rapidly to zero. For example, suppose that \(n=1\), and Model 1 is actually a stationary uniformly ergodic AR(1) recursion, i.e. \(h(x)=\beta x\), with \(|\beta |<1\), and \(z_{0}\) follows the uniform distribution supported on \([-1,1]\)-see Proposition 2.1.5 of Doukhan and Ghindès (1983). Furthermore, suppose that \({\mathcal {K}}\) equals the truncated at \([-1,1]\) Gaussian kernel with scale equal to \(\sigma ^{2}\). Then, Theorem 14.14 of Davidson (1994) imply that the probability bound is bounded above by \(2\exp {(-\frac{\pi \sigma ^{2}(1-\beta )^{2} k^{2}}{(1-\beta +2C)^{2}})}\), for some \(C>0\). This bound is less than one, and hence meaningful, if and only if \(k>\sqrt{\frac{\ln (2))}{\pi }}\frac{(1-\beta +2C)}{\sigma (1-\beta )}\).

Remark 1

The extension of the results of Theorem 1 to stationary processes that are either strongly mixing or absolutely regular (see for example Ch. 14 of Davidson, 1994) is not trivial. The proof is based on the Hoeffding type inequality of Rio (2000) for functions exhibiting the bounded differences property. To our knowledge, analogous results are not available for the aforementioned weaker forms of mixing. The derivation of suchlike inequalities in those frameworks is a very interesting issue for future research. Another promising approach could be based on results like Corollary 3.3 of Krebs (2018)-this is related to the main result of Merlevède et al. (2009). Unfortunately, this is not directly usable in our framework due to the fact that it pressuposes uniform boundedness for the functions involved in the partial sums. This is not the case in our framework, due to the presence of the bandwidth reciprocal as a multiplicative factor outside the kernel. Analogously, the extension of such a result to sequences of bounded-yet not necessarily uniformly-functions is also an interesting issue for future research.

A and C can be used among others in order to construct large enough T, non-asymptotic confidence sets under some further restrictions. Suppose for example that the non-parametric framework includes a restriction on the mixing coefficients of the form \(\sum _{n=1}^{\infty }\phi _{n}\le C^{\star }\) for some known \(C^{\star }>0\), as well as a known upper bound \(M>0\) for the Hessian of \(f_{x}\). Then (4) implies the of at least \(1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2C^{\star }\right) ^{2}}\right)\) probability, confidence interval \(\left( \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left( f\left( \varvec{y}\right) \right) \mp c^{\star }R\lambda \left( {\mathcal {X}}\right) \beta ^{\star }_{T,k}\right)\), where \(\beta ^{\star }_{T,k}=\frac{k}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}}+\frac{1}{2}Mb_{T}^{2}\). for \(\mathbb {E}_{\varvec{\varvec{F}}}\left( f\left( \varvec{y}\right) \right)\), uniformly on \({\mathcal {F}}\). The construction of valid confidence sets without the above restrictions, via the estimation of the supremum of the Hessian of \(C_{2}\) and of the mixing coefficients is left for future research. Notice that the Hessian estimation could be facilitated by the derivation of the kernel estimator-see for example Sheather (2004). The estimation of the mixing coefficient series could be analogously facilitated by the results in Ahsen and Vidyasagar (2013) and truncation. B can be used in order to bound from above the probability of including \(\theta ^{\star }\in \left\{ \sup _{{\mathcal {W}}\left( \varvec{\varvec{F}}_{T},\varvec{G}\right) \le \lambda _{T}}\mathbb {E}_{\varvec{G}}\left( g\left( \theta ^{T}\varvec{y}\right) \right) \le 0\right\}\), while \(\theta ^{\star }\notin \left\{ \mathbb {E}_{\varvec{\varvec{F}}}\left( g\left( \theta ^{T}\varvec{y}\right) \right) \le 0\right\}\), for \(g:\mathbb {R}\rightarrow \mathbb {R}\), and \(\theta ^{\star }\in \Theta \subseteq \mathbb {R}^{n}\). If g is 1-Lipschitz, \(\Theta\) is bounded in the Euclidean norm, and \(\lambda _{T}\ge {\text {diam}\left( \Theta \right) \text{diam}}\left( {\mathcal {X}}\right) \mu \left( {\mathcal {X}}\right) \beta _{T,k}\), then the probability of falsely classifying \(\theta ^{\star }\) in the zero level set of \(\mathbb {E}_{\varvec{\varvec{F}}}\left( g\left( \theta ^{T}\varvec{y}\right) \right)\), via the use of the conservative statistical program \(\sup _{{\mathcal {W}}\left( \varvec{\varvec{F}}_{T},\varvec{G}\right) \le \lambda _{T}}\mathbb {E}_{\varvec{G}}\left( g\left( \theta ^{T}\varvec{y}\right) \right)\), is bounded above by the rhs of (3).

3 Regularized prediction problems with Lipschitz costs

In what follows \({\left( {\mathcal {F}},\left\| \cdot \right\| _{{\mathcal {F}}}\right) }\) conforms to the function space in Theorem 1. C, \({\mathcal {L}}\) is a loss function on \(\mathbb {R}^{2}\), and given the sample \(\left( \varvec{x}_{t}\right) _{t\in \left\{ 1,\cdots ,T\right\} }\) with \(\varvec{x}_{t}:=\left( y_{t},\textbf{X}{}_{t}\right)\) with \(y_{t}\) denoting the response variables, and \(\textbf{X}_{t}\) the predictors, and we consider the regularized prediction (conditional on \(\textbf{X}_{t}\)) empirical program:

$$\begin{aligned} f_{T}:=\arg \min _{f\in {\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left\| f\right\| _{{\mathcal {F}}}, \end{aligned}$$
(5)

with \(\lambda _{T}>0\) a regularization parameter.

We employ the concentration inequalities of the previous section in order to obtain statistical guarantees for the \(L^{2}\) distance between \(f_{T}\), and the solution to the population analogue of (5): \(\inf _{f\in {\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right]\). This is summarized in the following result.

Theorem 2

(Statistical Guarantees) Suppose that Assumption 1 holds. Suppose furthermore that (SG.i) \({\mathcal {F}}\) is convex and uniformly R-bounded, and (SG.ii) for some \(L,\kappa >0\), uniformly in the second argument, \({\mathcal {L}}\left( \cdot ,\cdot \right)\) is L-Lipschitz and \(\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\right]\) is \(\kappa\)-strongly convex. Let \(f^{\star }\) be the unique solution of the population statistical program \(\inf _{{\mathcal {F}}}\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right]\) and suppose that it lies in the interior of \({\mathcal {F}}\). Then, and if \(\beta _{T,k}>\frac{\lambda _{T}}{2c^{\star }\mu \left( {\mathcal {X}}\right) L}\), the following statistical guarantees hold:

$$\begin{aligned} \kappa \left\| f_{T}-f^{\star }\right\| _{2}\le 4c^{\star }\mu \left( {\mathcal {X}}\right) LR\sqrt{\beta _{T,k}}+\sqrt{2R}\sqrt{4c^{\star }\mu \left( {\mathcal {X}}\right) L\left( \kappa +L\right) \beta _{T,k}+\kappa \lambda _{T}}, \end{aligned}$$
(6)

with probability greater than or equal to \(1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)\).

The parameter space convexity and (small sample) boundedness in (SG.i) and the Lipschitz continuity property in (SG.ii) are not rare in statistical applications. Strong convexity of the population criterion depends crucially on \(\varvec{\varvec{F}}\) and holds whenever \(\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\right]\) is convex and two times Frechet differentiable with second order derivative that has a bounded away from zero spectrum uniformly in \({\mathcal {F}}\). The statistical guarantees in (6) hold for any T. The inequality (6) provides an upper bound for the \(L^{2}\) distance between the empirical predictor and the population solution that holds with non trivial probability as long as \(k>\sqrt{2\ln (2)C_{3}}(1+2\sum _{n=1}^{\infty }\phi _{n})\). The bound depends on the Lipschitz and strong convexity properties of the loss, the regularization parameter, as well as the characteristics of the kernel, the unknown density, and the boundedness properties of the function space as those appear in Theorem 1. The result allows for diverging R with \(T\rightarrow \infty\), hence for cases where the parameter space \({\mathcal {F}}\) becomes asymptotically unbounded. They also allow for the population solution \(f^{\star }\) to depend on T as well as on the strong convexity parameter \(\kappa\) to become asymptotically nullified. If \(b_{T}=o\left( 1\right)\) and \(\sqrt{T}b_{T}^{n}\rightarrow +\infty\), they imply that \(\left\| f_{T}-f^{\star }\right\| _{2}\) becomes asymptotically negligible w.h.p. as long as \(\lambda _{T}<2c^{\star }\lambda \left( {\mathcal {X}}\right) L\beta _{T,k}\) for some \(k\rightarrow \infty\), and \(\frac{R}{\kappa }\left( b_{T}+\sqrt{\frac{k}{\sqrt{T}b_{T}^{n}}}\right) =o\left( 1\right)\). Thus, Theorem 2 provides with sufficient conditions for weak consistency of \(f_{T}\) even in cases where f diverges, as long as the regularization parameter is asymptotically strictly bounded above by some sequence \(O(\frac{k_{T}}{\sqrt{T}b_{T}^{n}}+\frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}+\frac{1}{2}C_{2}b_{T}^{2})\), where the diverging \(k_{T}\) is \(o(\sqrt{T}b_{T}^{n})\).

An example that adheres to the formulation above, is the one of Support Vector Machines with Hinge Costs (see Example 14.19 of Wainwright, 2019). \({\mathcal {F}}\) is typically the R-ball of a Reproducing Kernel Hilbert Space comprised by discriminant real functions and centered at zero, and \({\mathcal {L}}\left( f\left( x\right) ,y\right) :=\left( 1-yf\left( x\right) \right) _{+}\). The latter is clearly 1-Lipschitz in its first argument, while \(\kappa\)-strong convexity holds as long as \(\mathbb {E}_{\varvec{F}}\left[ y^{2}\delta \left( 1-yf\left( x\right) \right) \right]\), with \(\delta\) denoting the Dirac delta function, is bounded away from zero uniformly on \(\mathcal {F}\).

Towards a generalization of (6), if (5) is substituted with \(\min _{f\in \mathcal {G}_{T}}\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left\| f\right\| _{\mathcal {F}}\), with convex \({\mathcal {G}}_{T} \subseteq {\mathcal {F}}\), and such that \(g_{T}^{\star }:=\arg \min _{f\in \mathcal {G}_{T}}\left\| f-f^{\star }\right\| _{2}\), and the sub-differential \(\partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( \cdot ,y\right) \right]\) is \(L_{\partial }\)-Lipschitz uniformly in y (see for example Ch. 9 of Rockafellar and Wets, 2009), then the following oracle inequality is similarly obtained (see the proof of Theorem 2):

$$\begin{aligned} \kappa \left\| f_{T}-f^{\star }\right\| _{2}\le \begin{array}{c} 4c^{\star }\mu \left( {\mathcal {X}}\right) LR\sqrt{\beta _{T,k}}+\left( 1+L+L_{\partial }\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\ +\sqrt{\begin{array}{c} \left( \left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}+8L^{2}c^{\star }\mu \left( {\mathcal {X}}\right) R\sqrt{\beta _{T,k}}\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\ +2\left( 4c^{\star }\mu \left( {\mathcal {X}}\right) L\left( \kappa +L\right) \beta _{T,k}+\kappa \lambda _{T}\right) R \end{array}} \end{array}, \end{aligned}$$
(7)

whenever \(\lambda _{T}<2c^{\star }\mu \left( {\mathcal {X}}\right) L\beta _{T,k}\) holds, with probability greater than or equal to the probability bound in Theorem (2). This bounces back to (6) when \(f^{\star }=g_{T}^{\star }\).

4 Proofs

Proof of Theorem 1

Consider (2). Due to the Hoeffding type inequality for phi-mixing processes (see Rio 2000), and working exactly as in the proof of Theorem 1 of Vogel and Schettler (2013) we obtain that \(\mathbb {P}\left( \begin{array}{c} \left| J_{T}-\mathbb {E}\left( J_{T}\right) \right| >t\end{array}\right) \le 2\exp \left( -\frac{t^{2}Tb_{T}^{2n}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)\), for any \(t\ge 0\), where \(J_{T}:=\sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) -f_{\varvec{\varvec{\varvec{x}}}}\left( \varvec{\varvec{x}}\right) \right|\). Working as in the proof of the first Lemma of Vogel and Settler (2013), and noting that due to the phi-mixing covariance inequality (see Corollary 14.5 of Davidson, 1994), \(\frac{1}{T^{2}}\text {Var}\left( \sum _{t=1}^{T}\exp \left( \text {i}u^{\text {T}}\varvec{\varvec{x}}_{t}\right) \right) \le \frac{1}{T}\left( 1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}\right)\), we obtain that

$$\begin{aligned} \mathbb {E}\left( \sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) -\mathbb {E}\left( \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) \right) \right| \right) \le \frac{C_{1}}{\left( 2\pi \right) ^{n}\sqrt{T}\sqrt{1+2\sum _{n=1}^{\infty }\sqrt{\phi _{n}}}b_{T}^{n}}. \end{aligned}$$

Finally, due to the second Lemma of Vogel and Schettler (2013), we obtain the inequality \(\sup _{\varvec{\varvec{x}}\in {\mathcal {X}}}\left| \mathbb {E}\left( \frac{1}{Tb_{T}^{n}}\sum _{t=1}^{T}{\mathcal {K}}\left( \frac{\varvec{\varvec{\varvec{x}}}_{t}-\varvec{\varvec{x}}}{b_{T}}\right) \right) -f_{\varvec{\varvec{\varvec{x}}}}\left( \varvec{\varvec{x}}\right) \right| \le \frac{1}{2}C_{2}^{\star }b_{T}^{2}\). The result follows by choosing \(t:=\frac{k}{\sqrt{T}b_{T}^{n}}\) in the probability inequality above. (3) follows from Theorem 4 of Gibbs and Su (2002), the compactness of \({\mathcal {X}}\) and (2). Analogously, (4) follows from the uniform boundedness of \(\mathcal {F}\), the dominance of \(\left\| \cdot \right\| _{\mathcal {F}}\), the compactness of \({\mathcal {X}}\) and (2).

Proof of Theorem 2

Set \(R^{\star }:=2c^{\star }\mu \left( {\mathcal {X}}\right) R\). We have first that for any \(f\in \mathcal {F},\) due to the Lipschitz properties of \({\mathcal {L}}\), the \(\left\| \cdot \right\| _{\mathcal {F}}\)-boundedness of \(\mathcal {F}\)-which implies that \(\mathcal {F}-f^{\star }\) is 2R bounded, and (4),

$$\begin{aligned}{} & {} \mathbb {P}\left[ \frac{\left| \mathbb {E}_{\varvec{F}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{F}_{T}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}}\ge 2LR^{\star }\sqrt{\beta _{T,k}}\right] \\{} & {} \quad \le \mathbb {P}\left[ \frac{\left| \mathbb {E}_{\varvec{F}_{T}}\left[ f\left( \varvec{x}\right) -f^{\star }\left( \varvec{x}\right) \right] -\mathbb {E}_{\varvec{F}}\left[ f\left( \varvec{x}\right) -f^{\star }\left( \varvec{x}\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}}\ge R^{\star }\sqrt{\beta _{T,k}}\right] \\{} & {} \quad \le 2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right) . \end{aligned}$$

We first prove (7). Then (6) follows by assuming \(f^{\star }=g_{T}^{\star }\), and noting that since \(f^{\star }\) is interior, the normal cone of \(\mathcal {F}\) at \(f^{\star }\) is \(\left\{ 0\right\}\), and the optimality of \(f^{\star }\) implies that \(\left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-f^{\star }\right\rangle _{2}\in \left\{ 0\right\}\), where \(\partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right]\) denotes the sub-differential of the population criterion at \(f^{\star }\), and \(\left\langle \cdot ,\cdot \right\rangle _{2}\) the \(L^{2}\) inner product. In this case the Lipschitz property of the sub-differential is redundant. Towards proving (7), remember that \(g_{T}^{\star }:=\arg \min _{f\in \mathcal {G}_{T}}\left\| f-f^{\star }\right\| _{2}\). Then consider the event

$$\begin{aligned} \left\{ \frac{\left| \mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] \right| }{\left\| f-f^{\star }\right\| _{2}+\sqrt{\beta _{T,k}^{\star }}}\le 2LR^{\star }\sqrt{\beta _{T,k}}\right\} , \end{aligned}$$

and notice that the probability of the above is bounded below by \(1-2\exp \left( -\frac{k^{2}}{2C_{3}\left( 1+2\sum _{n=1}^{\infty }\phi _{n}\right) ^{2}}\right)\), due to the previous result (which does not use the fact that \(f^{\star }\) is interior). If the event holds then, and due to the fact that \(\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( f\left( \varvec{x}\right) ,y\right) \right] -\mathbb {E}_{\varvec{\varvec{F}}_{T}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \le 0\), then it must be the case that

$$\begin{aligned} \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f_{T}\left( \varvec{x}\right) ,y\right) \right] +\mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] +\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}^{\star }}\right) . \end{aligned}$$

The \(\kappa\)-strong convexity of the population criterion then implies that

$$\begin{aligned}{} & {} \left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\frac{\kappa }{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}^{2}+\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}\right) . \end{aligned}$$

Now, notice that due to the local optimality of \(g_{T}^{\star }\), \(\partial \mathbb {E}_{\varvec{F}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right]\) must lie inside the normal cone of \(\mathcal {G}_{T}\) at \(g_{T}^{\star }\). This and the fact that \(f_{T}\) satisfies the empirical local optimality conditions imply that

$$\begin{aligned} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}\le 0. \end{aligned}$$

The lhs of the previous display is greater than or equal to

$$\begin{aligned} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2} \end{aligned}$$

which (due to the sub-differential inclusion condition) is greater than or equal to

$$\begin{aligned}{} & {} \left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( g_{T}^{\star }\left( \varvec{x}\right) ,y\right) \right] -\partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}\\{} & {} \quad +\left\langle \partial \mathbb {E}_{\varvec{\varvec{\varvec{F}}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f_{T}-g_{T}^{\star }\right\rangle _{2}+\left\langle \partial \mathbb {E}_{\varvec{\varvec{F}}}\left[ {\mathcal {L}}\left( f^{\star }\left( \varvec{x}\right) ,y\right) \right] ,f^{\star }-g_{T}^{\star }\right\rangle _{2}, \end{aligned}$$

which due to Cauchy–Schwarz inequality and the Lipschitz property of the sub-differential is greater than or equal to

$$\begin{aligned} -L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}-L\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) . \end{aligned}$$

The previous then imply that

$$\begin{aligned}{} & {} -L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}-L\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +\frac{\kappa }{2}\left\| f_{T}-g_{T}^{\star }\right\| _{2}^{2}+\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) \\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}\left( \left\| f_{T}-g_{T}^{\star }\right\| _{2}+\sqrt{\beta _{T,k}}\right) \Rightarrow \\{} & {} \frac{\kappa }{2}\left\| f_{T}-g^{\star }\right\| _{2}^{2}-\left( L\left( 2R^{\star }\sqrt{\beta _{T,k}}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\right) \left\| f_{T}-g_{T}^{\star }\right\| _{2}\\{} & {} \quad +\left( \lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) -2LR^{\star }\beta _{T,k}\right) \le 0. \end{aligned}$$

The condition that ensures that the quadratic polynomial in the lhs of the previous display has two distinct roots, one negative and one positive, is \(\beta _{T,k}>\frac{\lambda _{T}}{2c^{\star }\lambda \left( {\mathcal {X}}\right) L}\), since this and the fact that \(\left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\le 2R\), imply that \(\lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) -2LR^{\star }\beta _{T,k}<0\). Comparing with the positive root we obtain that

$$\begin{aligned}{} & {} \kappa \left\| f_{T}-g_{T}^{\star }\right\| _{2}\le L\left( 2R^{\star }\sqrt{\beta _{T,k}}+\left\| f^{\star }-g_{T}^{\star }\right\| _{2}\right) +L_{\partial }\left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\{} & {} \qquad +\sqrt{4LR^{\star }\left( \kappa +L\right) \beta _{T,k}+\left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}^{2}+4L^{2}R\sqrt{\beta _{T,k}^{\star }}\left\| f^{\star }-g_{T}^{\star }\right\| _{2}-2\kappa \lambda _{T}\left( \left\| f_{T}\right\| _{\mathcal {F}}-\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\right) }\\{} & {} \quad \le 2LR^{\star }\sqrt{\beta _{T,k}}+\left( L+L_{\partial }\right) \left\| g_{T}^{\star }-f^{\star }\right\| _{2}\\{} & {} \qquad +\sqrt{4LR^{\star }\left( \kappa +L\right) \beta _{T,k}+\left( L+L_{\partial }\right) ^{2}\left\| g_{T}^{\star }-f^{\star }\right\| _{2}^{2}+4L^{2}R^{\star }\sqrt{\beta _{T,k}}\left\| f^{\star }-g_{T}^{\star }\right\| _{2}+2\kappa \lambda _{T}\left\| g_{T}^{\star }\right\| _{\mathcal {F}}}, \end{aligned}$$

from which the oracle inequality (7) follows by noting that \(\left\| g_{T}^{\star }\right\| _{\mathcal {F}}\le R\).