Concentration inequalities for Kernel density estimators under uniform mixing

We derive non-asymptotic concentration inequalities for the uniform deviation between a multivariate density function and its non-parametric kernel density estimator in stationary and uniform mixing time series framework. We derive analogous inequalities for their (first) Wasserstein distance, as well as for the deviations between integrals of bounded functions w.r.t. them. They can be used for the construction of confidence regions, the estimation of the finite sample probabilities of decision errors, etc. We employ the concentration results to the derivation of statistical guarantees and oracle inequalities in regularized prediction problems with Lipschitz and strongly convex costs.


Introduction
In the present note we are occupied with the derivation of non-asymptotic concentration inequalities for the uniform deviation between a multivariate density function and its non-parametric kernel density estimator over the support of the former. We employ a time series setting consisting of multivariate stationary processes with summable uniform (phi-) mixing coefficients (see Davidson, 1994). We rely heavily on the iid results of Vogel and Schettler (2013), adjusting their proofs via the use of concentration and covariance inequalities for uniformly mixing processes. Regarding the two underlying probability measures, we readily derive analogous The hospitality of the Laboratory of Econometrics and the Laboratory of Economic Policy Studies (EMOP) of the department is gratefully aknowledged.

Concentration inequalities for Kernel density estimators
We begin with an assumption that specifies our probabilistic and statistical framework. The assumption imposes restrictions on the marginal distributions and the dynamics of the stochastic process involved in the density estimation. It also restricts the properties of the employed kernel technology: Assumption 1 (i) The ℝ n -valued stochastic process x t t∈ℤ is strictly stationary and phi-mixing, with absolutely summable mixing coefficient sequence n n∈ℕ . (ii) K is a positive symmetric bounded, Lipschitz continuous and compactly supported convolution kernel on ℝ n such that ∫ ℝ n K(u)du = 1 , ∫ ℝ n ‖u‖ 2 K(u)du < +∞ , and (iii) the distribution of x 0 has a compact support X , and a continuous density f x (⋅) that is twice differentiable with continuous second derivatives.
An example of a dynamic multivariate process that satisfies Assumption 1(i) is given by the solution of the following stochastic recursion equation (SRE): where z t t∈ℤ is an iid sequence of n-random vectors, the distribution of z 0 has a density, and h ∶ ℝ n → ℝ n is a contraction (w.r.t. some metric on ℝ n ) with compact range. Theorem 2.1.3 of Doukhan and Ghindès (1983) implies then the required mixing property for the unique solution of the SRE. This incorporates the iid as a (1) 1 3 special case. If z 0 has also bounded support the compactness of the support of the distribution of x 0 from the first part of Assumption 1(i) also holds. The remaining parts impose mostly usual conditions in non-parametric statistics (see El Machkouri et al., 2020, and references therein). Boundedness of supports can be generalized as long as K has an integrable Fourier transform, and the Hessian of f x is bounded in the Frobenius norm.
The researcher has at her disposal the time series sample x t t=1,…,T , and estimates the unknown f x via the kernel estimator 1 We consider the problem of bounding, uniformly in T, the probability that the uniform deviation sup y∈X exceeds an asymptotically negligible deterministic sequence. The following theorem summarizes the result. There, W G, G ⋆ denotes the first Wasserstein distance between the arbitrary distributions on X , G, G ⋆ , defined as min ∈Γ(G, G ⋆ ) ∫ X×X d z, z ⋆ d z, z ⋆ , where Γ G, G ⋆ denotes the set of Borel probability distributions on X × X that have respective marginals G, G ⋆ , and d denotes the Euclidean distance (see Gao et al., 2017). denotes the Lebesgue measure on ℝ n and diam(X) denotes the Euclidean diameter of X.

Theorem 1 (Concentration Inequalities) Suppose that Assumption 1 holds:
A. Uniformly in T ≥ 1 and for any k > 0, Then, uniformly in T ≥ 1 , and for any k > 0, is an R-bounded subset of a semi-normed space of real functions defined on X , and there exists some c ⋆ > 0 for which ‖⋅‖ ∞ ≤ c ⋆ ‖⋅‖ F . Then, uniformly in T ≥ 1 , and for any k > 0,

3
Journal of the Korean Statistical Society (2023) 52: [440][441][442][443][444][445][446][447][448][449] The results in (2)-(4) are non asymptotic. Whenever b T = o(1) while √ Tb n T → +∞ , they can also provide estimates for the rates of convergence of the deviations considered. The derivation of (2) essentially follows from the proof of the iid Theorem 1 of Vogel and Schettler (2013), by taking into account relevant uniform mixing, concentration, and covariance inequalities especially when handling the asymptotic properties variance of empirical Fourier transforms. Then (3)-(4) follow from the dual functional representation of W of Kantorovich (1960), the uniform boundedness of the function space involved in C, and the compactness of X . The bounding sequence T,k depends on the integral of the Fourier transform of the kernel, the second moment of the kernel, the magnitude of the Hessian of f x , the mixing coefficients, the sample size, and the bandwidth. In the last two cases, T,k is complemented by the Lebesgue magnitude and the diameter of X , and/or the uniform bound and the norm properties of the function space considered. The probability bound depends on the bound of the kernel and the mixing coefficients. It becomes tighter whenever K admits a low maximum, and/or the mixing coefficients are small and converge rapidly to zero. For example, suppose that n = 1 , and Model 1 is actually a stationary uniformly ergodic AR(1) recursion, i.e. h(x) = x , with | | < 1 , and z 0 follows the uniform distribution supported on [−1, 1]-see Proposition 2.1.5 of Doukhan and Ghindès (1983). Furthermore, suppose that K equals the truncated at [−1, 1] Gaussian kernel with scale equal to 2 . Then, Theorem 14.14 of Davidson (1994) imply that the probability bound is bounded above by 2 exp (− 2 (1− ) 2 k 2 (1− +2C) 2 ) , for some C > 0 . This bound is less than one, and hence meaningful, if and only if k > √ ln(2)) (1− +2C) (1− ) .

Remark 1
The extension of the results of Theorem 1 to stationary processes that are either strongly mixing or absolutely regular (see for example Ch. 14 of Davidson, 1994) is not trivial. The proof is based on the Hoeffding type inequality of Rio (2000) for functions exhibiting the bounded differences property. To our knowledge, analogous results are not available for the aforementioned weaker forms of mixing. The derivation of suchlike inequalities in those frameworks is a very interesting issue for future research. Another promising approach could be based on results like Corollary 3.3 of Krebs (2018)-this is related to the main result of Merlevède et al. (2009). Unfortunately, this is not directly usable in our framework due to the fact that it pressuposes uniform boundedness for the functions involved in the partial sums. This is not the case in our framework, due to the presence of the bandwidth reciprocal as a multiplicative factor outside the kernel. Analogously, the extension of such a result to sequences of bounded-yet not necessarily uniformly-functions is also an interesting issue for future research.
A and C can be used among others in order to construct large enough T, nonasymptotic confidence sets under some further restrictions. Suppose for example that the non-parametric framework includes a restriction on the mixing coefficients of the form ∑ ∞ n=1 n ≤ C ⋆ for some known C ⋆ > 0 , as well as a known upper bound M > 0 for the Hessian of f x . Then (4) implies the of at least 1 − 2 exp − for F (f (y)) , uniformly on F . The construction of valid confidence sets without the above restrictions, via the estimation of the supremum of the Hessian of C 2 and of the mixing coefficients is left for future research. Notice that the Hessian estimation could be facilitated by the derivation of the kernel estimator-see for example Sheather (2004). The estimation of the mixing coefficient series could be analogously facilitated by the results in Ahsen and Vidyasagar (2013) and truncation. B can be used in order to bound from above the probability of including ⋆ ∈ sup W(F T ,G)≤ T G g T y ≤ 0 , while ⋆ ∉ F g T y ≤ 0 , for g ∶ ℝ → ℝ , and ⋆ ∈ Θ ⊆ ℝ n . If g is 1-Lipschitz, Θ is bounded in the Euclidean norm, and T ≥ diam(Θ)diam(X) (X) T,k , then the probability of falsely classifying ⋆ in the zero level set of F g T y , via the use of the conservative statistical program sup W(F T ,G)≤ T G g T y , is bounded above by the rhs of (3).

Regularized prediction problems with Lipschitz costs
In what follows � F, ‖⋅‖ F � conforms to the function space in Theorem 1. C, L is a loss function on ℝ 2 , and given the sample x t t∈{1,⋯,T} with x t ∶= y t , X t with y t denoting the response variables, and X t the predictors, and we consider the regularized prediction (conditional on X t ) empirical program: with T > 0 a regularization parameter.
We employ the concentration inequalities of the previous section in order to obtain statistical guarantees for the L 2 distance between f T , and the solution to the population analogue of (5): inf f ∈F F L(f (x), y) . This is summarized in the following result.  The parameter space convexity and (small sample) boundedness in (SG.i) and the Lipschitz continuity property in (SG.ii) are not rare in statistical applications. Strong convexity of the population criterion depends crucially on F and holds whenever F [L] is convex and two times Frechet differentiable with second order derivative that has a bounded away from zero spectrum uniformly in F . The statistical guarantees in (6) hold for any T. The inequality (6) provides an upper bound for the L 2 distance between the empirical predictor and the population solution that holds with non trivial probability as long as k > √ 2 ln(2)C 3 (1 + 2 ∑ ∞ n=1 n ) . The bound depends on the Lipschitz and strong convexity properties of the loss, the regularization parameter, as well as the characteristics of the kernel, the unknown density, and the boundedness properties of the function space as those appear in Theorem 1. The result allows for diverging R with T → ∞ , hence for cases where the parameter space F becomes asymptotically unbounded. They also allow for the population solution f ⋆ to depend on T as well as on the strong convexity parameter to become asymptotically nullified. If b T = o(1) and √ Tb n T → +∞ , they imply that ‖ ‖ f T − f ⋆‖ ‖2 becomes asymptotically negligible w.h.p. as Theorem 2 provides with sufficient conditions for weak consistency of f T even in cases where f diverges, as long as the regularization parameter is asymptotically strictly bounded above by some sequence O( ). An example that adheres to the formulation above, is the one of Support Vector Machines with Hinge Costs (see Example 14.19 of Wainwright, 2019). F is typically the R-ball of a Reproducing Kernel Hilbert Space comprised by discriminant real functions and centered at zero, and L(f (x), y) ∶= (1 − yf (x)) + . The latter is clearly 1-Lipschitz in its first argument, while -strong convexity holds as long as F y 2 (1 − yf (x)) , with denoting the Dirac delta function, is bounded away from zero uniformly on F.
Towards a generalization of (6), if (5) is substituted with , and the sub-differential F L(⋅, y) is L -Lipschitz uniformly in y (see for example Ch. 9 of Rockafellar and Wets, 2009), then the following oracle inequality is similarly obtained (see the proof of Theorem 2): whenever T < 2c ⋆ (X)L T,k holds, with probability greater than or equal to the probability bound in Theorem (2). This bounces back to (6) when f ⋆ = g ⋆ T . (7)

3 4 Proofs
Proof of Theorem 1 Consider (2). Due to the Hoeffding type inequality for phi-mixing processes (see Rio 2000), and working exactly as in the proof of Theorem 1 of Vogel and Schettler (2013) we obtain that Working as in the proof of the first Lemma of Vogel and Settler (2013), and noting that due to the phi-mixing covariance inequality (see Corollary 14.5 of Davidson, 1994), , we obtain that Finally, due to the second Lemma of Vogel and Schettler (2013), we obtain the ine- Tb n T in the probability inequality above.
(3) follows from Theorem 4 of Gibbs and Su (2002), the compactness of X and (2). Analogously, (4) follows from the uniform boundedness of F , the dominance of ‖⋅‖ F , the compactness of X and (2).
case the Lipschitz property of the sub-differential is redundant. Towards proving (7), Then consider the event and notice that the probability of the above is bounded below by , due to the previous result (which does not use the fact that f ⋆ is interior). If the event holds then, and due to the fact that The -strong convexity of the population criterion then implies that Now, notice that due to the local optimality of g ⋆ T , F L g ⋆ T (x), y must lie inside the normal cone of G T at g ⋆ T . This and the fact that f T satisfies the empirical local optimality conditions imply that The lhs of the previous display is greater than or equal to which (due to the sub-differential inclusion condition) is greater than or equal to which due to Cauchy-Schwarz inequality and the Lipschitz property of the sub-differential is greater than or equal to The previous then imply that