Kernel Density Estimates in a Non-standard Situation

Abstract

Kernel density estimate is an integral part of the statistical tool box. It has been widely studied and is very well understood in situations where the observations \(\{x_i\}\) are i.i.d., or is a stationary process with some weak dependence. However, there are situations where these conditions do not hold. For instance, while the eigenvalue distribution of large-dimensional random matrices converges, the eigenvalues themselves are highly correlated for most common random matrix models. Suppose \(\{F_n\}\) is a sequence of empirical distribution functions (usually random) which converges weakly to a non-random distribution function F with density f in some probabilistic sense. We show that under mild conditions on the kernel K and the limit density f, the kernel density estimate \(\hat{f}\) based on \(F_n\) converges to f in suitable probabilistic senses. This demonstrates the robustness of the kernel density estimate. We show how the rate of convergence of \(\hat{f}\) to f can be linked to the rate of convergence of \(F_n\) and \({\mathrm{E}}(F_n)\) to F. Using the above general results, we establish the consistency of the kernel density estimates, including upper bounds on the rate of convergence, for two popular random matrix models. We also provide a few simulations to demonstrate these results and conclude with a few open questions.

Introduction

Suppose \(F_n\) is a distribution function and K is a suitable nonnegative real-valued function, so that \(\int _{-\infty }^{\infty } K(t)dt=1\). Then, K is called a kernel function. Define (all integrals are on the range \((-\infty , \ \infty )\))

$$\begin{aligned} \hat{f}(x) = \dfrac{1}{h}\int K\left( \dfrac{x-t}{h}\right) {\mathrm {d}}F_n(t). \end{aligned}$$

Note that \(\hat{f} \ge 0\) and \(\int _{-\infty }^{\infty }\hat{f}(u) {\mathrm {d}}u =1\) so that \(\hat{f}\) is a probability density function.

Often \(F_n\) is the empirical distribution which puts mass \(n^{-1}\) at \(x_1, x_2, \ldots , x_n\) which are observations from some statistical model, so that

$$\begin{aligned}F_n(x)=n^{-1}\sum _{i=1}^n \mathbb {I}(x_i \le x).\end{aligned}$$

In that case \(\hat{f}\) is a random function and

$$\begin{aligned} \hat{f}(x) = \dfrac{1}{nh} \sum _{i=1}^{n} K\left( \dfrac{x-x_i}{h}\right) . \end{aligned}$$
(1)

In particular, if \(\{x_i\}\) is an i.i.d. sequence, or more generally, a strictly stationary sequence with marginal density f, then \(\hat{f}\) serves as an estimate of f and hence is called a kernel density estimate (KDE) of f with bandwidth h. KDE is an integral part of the statistical tool box and is very well understood in these situations. If K satisfies some mild restrictions, then \(\hat{f}\) has very nice properties and in particular, as \(n \rightarrow \infty \), is a consistent estimate of f in various probabilistic senses when h is chosen appropriately.

Now suppose that the sequence \(\{F_n\}\), random or non-random, converges (in some probabilistic sense when they are random) to a non-random distribution function F which has a density f. Then, it is natural to consider \(\hat{f}(x)\) as an estimate of f(x) and ask what properties this estimate has. Note that we are not assuming that the \(\{x_i\}\) have any independence or weak dependence structure.

We first show that under mild conditions on K and f, \(\hat{f}\) converges to f (in some probabilistic sense) when a proper bandwidth is chosen. We also relate the rate of convergence of \(\hat{f}\) to f, to the rate of convergence of \(F_n\) and \({\mathrm{E}}(F_n)\) to F. In particular, this means that the KDE is very robust—more or less whenever the empirical distribution converges, the corresponding KDE is consistent provided h is chosen suitably.

We then provide three examples from the eigenvalue distribution of random matrices. It is known that the eigenvalues of the most popular random matrices are highly correlated. Nevertheless, KDE is routinely used. Suppose \(A_n\) is a sequence of (symmetric) random matrices of increasing order. Let \(\lambda ^{(n)}_1, \lambda ^{(n)}_2, \ldots , \lambda ^{(n)}_n\) be the (random) eigenvalues, and let \(F_n\) be the corresponding empirical spectral distribution (ESD) function which puts mass \(n^{-1}\) at each of the eigenvalues. Then, the corresponding KDE is given by

$$\begin{aligned} \hat{f}(x) = \dfrac{1}{nh} \sum _{i=1}^{n} K\left( \dfrac{x-\lambda ^{(n)}_i}{h}\right) . \end{aligned}$$
(2)

For many matrix models, the ESD \(F_n\) converges almost surely to a non-random limiting spectral distribution (LSD) F, which has density f. There is a host of existing results on the rate of convergence of \(F_n\) to F for specific matrices. We use these results and their suitable modifications to establish the consistency of the KDE in two matrix models, the Wigner matrix and the sample covariance matrix. We also provide a few simulations to demonstrate these results and end with a few open questions, especially for the random Toeplitz matrix.

Main Results

General Kernel Density Estimate

For any real-valued function g on \(\mathbb {R}\), its total variation \(\mathcal {T}(g)\) is defined as

$$\begin{aligned}\mathcal {T}(g)=\sup \sum _{i=1}^\infty \left| g(t_i)-g(t_{i-1})\right| \end{aligned}$$

where \(\{t_j\}\) is a partition of \((-\infty , \ \infty )\) and the supremum is over all such partitions. The function g is said to be of bounded variation if \(\mathcal {T}(g) < \infty \).

For two distribution functions F and G on \(\mathbb {R}\), their Kolmogorov distance is defined as

$$\begin{aligned}\Vert F-G\Vert _\infty =\sup _{x\in \mathbb {R}}|F(x)-G(x)|.\end{aligned}$$

The kernel K is always assumed to be nonnegative and \(\int _{-\infty }^{\infty } K(t){\mathrm {d}}t=1\).

Our first theorem is on the asymptotic unbiasedness of the KDE.

Theorem 1

Suppose K is a kernel of bounded variation and \(\int _{-\infty }^{\infty } |u|K(u){\mathrm {d}}u < \infty \). Let \(F_n\) be a sequence of random distribution functions on \(\mathbb {R}\). Let F be a (non-random) distribution function with density f.

  1. (i)

    If f is Lipschitz continuous at x, then

    $$\begin{aligned} |{\mathrm{E}}(\hat{f}(x))-f(x)|= O \left( \dfrac{1}{h}\Vert {\mathrm{E}}F_n -F\Vert _{\infty }+ h \right) . \end{aligned}$$
    (3)

    Hence, if \(\Vert {\mathrm{E}}(F_n) -F\Vert _{\infty } \rightarrow 0\), and \(h \rightarrow 0\) such that \([\Vert {\mathrm{E}}(F_n) -F\Vert _{\infty }]^{-1} h \rightarrow \infty \), then

    $$\begin{aligned} \lim _{n \rightarrow \infty } {\mathrm{E}}(\hat{f}(x)) \rightarrow f(x). \end{aligned}$$
    (4)
  2. (ii)

    If the Lipschitz continuity in (i) above holds uniformly in x, then the convergences in (3) and (4) above are also uniform over x.

Proof

Since f is Lipschitz continuous at x, there exists a constant C (depending on x) such that for all t, \(|f(x)-f(t)|\le C|x-t|\). Using this,

$$\begin{aligned} |{\mathrm{E}}(\hat{f}(x)) - f(x) |&\le \left| {\mathrm{E}}\left( \int \dfrac{1}{h}K(\dfrac{x-t}{h}){\mathrm {d}}(F_n-F)(t) \right) \right| \\&+ \left| \int \dfrac{1}{h}K(\dfrac{x-t}{h}) (f(t) - f(x)) {\mathrm {d}}t\right| \\&\le \left| {{\,\mathrm{E}\,}}\left( \int \dfrac{1}{h}K(u){\mathrm {d}}(F_n-F)(x-uh) \right) \right| \\&+ \int K(u) |f(x -uh) - f(x)| {\mathrm {d}}u\\&\le \left| \left( \int \dfrac{1}{h}{{\,\mathrm{E}\,}}[(F_n-F)(x-uh)] {\mathrm {d}} K(u)\right) \right| + C h \int K(u) |u| {\mathrm {d}}u\\&\le \dfrac{1}{h}\Vert {{\,\mathrm{E}\,}}(F_n) -F\Vert _{\infty } \int {\mathrm {d}}|K|(u) + C h \int |u|K(u) {\mathrm {d}}u. \end{aligned}$$

All conclusions now follow directly from the above. \(\square \)

The next theorem, in particular, implies the point-wise consistency of the KDE.

Theorem 2

Let \(F_n\) be a sequence of random distribution functions on \(\mathbb {R}\). Suppose F is a (non-random) distribution function with density f. Suppose K is a kernel of bounded variation and \(\int |u| K(u){\mathrm {d}}u < \infty \).

  1. (i)

    If f is Lipschitz continuous at x, then for any \(p \ge 1\),

    $$\begin{aligned} {{\,\mathrm{E}\,}}[|\hat{f}(x) - f(x)|^p] =O\left( \dfrac{{{\,\mathrm{E}\,}}[\Vert F_n -F\Vert _{\infty }^p]}{h^p} + h^p\right) . \end{aligned}$$
    (5)

    Hence, if \(h \rightarrow 0\) such that \(\left[ {{\,\mathrm{E}\,}}(\Vert F_n -F\Vert _{\infty }^p)\right] ^{-1}h \rightarrow \infty \), then \({{\,\mathrm{E}\,}}|\hat{f}(x) - f(x)|^p \rightarrow 0\) which in turn implies \(\hat{f}(x) - f(x)\rightarrow 0\) in probability.

  2. (ii)

    If the Lipschitz continuity above holds uniformly in x, then the conclusions above also hold uniformly in x.

Proof

$$\begin{aligned} |\hat{f}(x) - f(x) |^p&\le \left| \int \dfrac{1}{h}K(\dfrac{x-t}{h}){\mathrm {d}}(F_n-F)(t) + \int \dfrac{1}{h}K(\dfrac{x-t}{h}) (f(t) - f(x)) {\mathrm {d}}t \right| ^p \\&= \left| \int \dfrac{1}{h}K(u){\mathrm {d}}(F_n-F)(x-uh) + \int K(u) (f(x -uh) - f(x)) {\mathrm {d}}u \right| ^p\\&\le 2^p \big \{\big | \int \dfrac{1}{h}(F_n-F)(x-uh) {\mathrm {d}} K(u)\big |^p \\&\quad+ \big |\int K(u) (f(x -uh) - f(x)) {\mathrm {d}}u\big |^p \big \}\\&\le \dfrac{2^p \Vert F_n -F\Vert _{\infty }^p}{h^p} \Big (\int {\mathrm {d}}|K|(u)\Big )^p + 2^p C^{p} h^p \big |\int |u| K(u) {\mathrm {d}}u\big |^{p} \end{aligned}$$

The conclusions follow from the above easily. \(\square \)

The above result is useful only if we can estimate the quantity \({{\,\mathrm{E}\,}}[\Vert F_n -F\Vert _{\infty }^p]\). We investigate this issue in the next section.

Bound for the Distribution Function

To apply Theorems 1 and 2, we need bounds on \(|{{\,\mathrm{E}\,}}(F_n)-F|\) and \({{\,\mathrm{E}\,}}[|F_n-F|]\). The next theorem connects the above two in a general way.

For any distribution function G (possibly random), its (random) characteristic function is defined as

$$\begin{aligned}\varphi _G(t)= \int e^{itx} {\mathrm {d}}G(x).\end{aligned}$$

Note that if G is a random distribution function, then \(\varphi _G\) is also random.

The convolution \(F*G\) of two distributions F and G is defined as

$$\begin{aligned}F*G(x)=\int F(x-y){\mathrm {d}}G(y)=\int G(x-y){\mathrm {d}}F(y).\end{aligned}$$

The next result is an extension of a result of Chatterjee and Bose [7] who considered the case \(p=2\).

Theorem 3

Suppose F is a random distribution function on \(\mathbb {R}\) with (random) characteristic function \(\varphi \). Let \(p \ge 2\). Suppose that for each t,

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [|\varphi _{F}(t) - {{\,\mathrm{E}\,}}[\varphi _{F}(t)] |^p\big ] \le C^p|t|^p. \end{aligned}$$
(6)

If G is a non-random distribution function on \(\mathbb {R}\), such that \(\sup _{x\in {\mathbb {R}}}|G^\prime (x)|\le \lambda \), then with same C given above,

$$\begin{aligned} {{\,\mathrm{E}\,}}[\Vert F - G \Vert _\infty ^p] \le 6^p\Vert {{\,\mathrm{E}\,}}(F) - G\Vert _\infty ^p + K (C\lambda )^{p/2} \end{aligned}$$

where K depends only on p.

Proof

The proof we give below parallels the proof given in Chatterjee and Bose [7] for \(p=2\).

Let

$$\begin{aligned}F_0(x)={{\,\mathrm{E}\,}}(F(x)), \ x\in \mathbb {R}.\end{aligned}$$

Then, \(F_0\) is a distribution function with characteristic function

$$\begin{aligned}\varphi _{F_{0}}(t)={{\,\mathrm{E}\,}}(\varphi _{F}(t)).\end{aligned}$$

Define the probability density on \(\mathbb {R}\)

$$\begin{aligned} h_L(x) = \dfrac{1 - \cos (Lx)}{\pi Lx^2}, -\infty< x < \infty . \end{aligned}$$

Let \(H_L\) be the corresponding distribution function. The characteristic function of \(H_L\) is given by

$$\begin{aligned}\psi _L(t)= \big (1-\dfrac{|t|}{L}\big )\mathbb {I}(|t|\le L).\end{aligned}$$

By Esseen’s lemma (see Feller (1966, page 510)[10]),

$$\begin{aligned} \Vert F - G\Vert _\infty \le 2 \Vert F*H_L - G*H_L \Vert _\infty + \dfrac{24\lambda }{\pi L}. \end{aligned}$$
(7)

Now

$$\begin{aligned} \Vert F*H_L - G*H_L \Vert _\infty\le\, & \Vert F_0*H_L - G*H_L\Vert _\infty +\Vert F*H_L - F_0*H_L\Vert _\infty \\\le\, & \Vert F_0 - G\Vert _\infty +\Vert F*H_L - F_0*H_L\Vert _\infty . \end{aligned}$$

By applying the inversion formula (see Feller (1966, page 482–484)[10], we have

$$\begin{aligned} \Vert F*H_L - F_0*H_L\Vert _\infty \le \dfrac{1}{\pi } \int _{-L}^L |\psi _L(t)|\dfrac{ |\varphi _{F}(t) - \varphi _{F_{0}}(t)|}{|t|} {\mathrm {d}}t. \end{aligned}$$

Hence, using hypothesis (6), and noting that \(\int _{-\infty }^\infty |\psi _L(t)|{\mathrm {d}}t = L\), by Jensen’s inequality,

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [\Vert F*H_L - F_0*H_L\Vert _\infty ^p\big ] \le \dfrac{L^p}{\pi ^p} \int _{-L}^L \dfrac{|\psi _L(t)|}{L}\dfrac{ {{\,\mathrm{E}\,}}[|\varphi _{F}(t) - {{\,\mathrm{E}\,}}\varphi _{F_{0}}(t)|^p]}{|t|^p} {\mathrm {d}}t\le \dfrac{C^p L^p}{\pi ^p}. \end{aligned}$$

Combining all these facts and noting that for \(p \ge 1, \ |a+b+c|^p \le 3^p(|a|^p + |b|^p +|c|^p)\), we have

$$\begin{aligned} {{\,\mathrm{E}\,}}\Vert F - G\Vert _\infty ^p \le 6^p\Vert F_0 - G\Vert _\infty ^p + \left( \dfrac{6CL}{\pi }\right) ^p + \left( \dfrac{72\lambda }{\pi L}\right) ^p. \end{aligned}$$

Choosing \(L \sim (C\lambda )^{-1/2}\) gives the desired conclusion. \(\square \)

The crux of the above result is of course the bound assumed in condition (6). The random quantity \(\varphi _{F}(t)\) can be viewed as a nonlinear function of the underlying data generating process. So, the problem boils down to getting variance type bounds for functions of random variables which are nonlinear. We achieve these bounds by the so-called Efron–Stein inequalities. With more restrictive assumptions on the underlying random variables, sharpening of these bounds may be possible. For further information on Efron–Stein inequalities, see Efron and Stein [9], Steele [15], Devroye [8], and Györfi et al. [12].

The next result is an extension of a well-known result for the case \(p=2\).

Theorem 4

Suppose \(f(x_1, x_2, \ldots , x_n)\) is a coordinate wise differentiable map from \(\mathbb {R}^n\) into \(\mathbb {C}\) and \(M_j\), \(1\le j \le n\) are constants such that

$$\begin{aligned} \sup _x|\dfrac{\partial f}{\partial x_j}| \le M_j \;\;\; \text{ for } j = 1,2,\ldots , n. \end{aligned}$$

Let \(p\ge 2.\) Then, there is an absolute constant K, depending only on p, such that for any independent random variables \(X_1, X_2, \ldots , X_n,\)

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [|f(X_1, X_2, \ldots , X_n) - {{\,\mathrm{E}\,}}[f(X_1, X_2, \ldots , X_n)]|^p\big ] \le K \Big (\sum _{j=1}^n M_j^2 ({{\,\mathrm{E}\,}}| X_k -X^*_k|^p)^{2/p} \Big )^{p/2} \end{aligned}$$

where \(\{X_k^*\}\) denotes an independent copy of \(\{X_k\}\).

In particular, if \( X_1, X_2, \ldots , X_n\) are i.i.d. and \(M_j = M\) for all j, then

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [|f(X_1, X_2, \ldots , X_n) - {{\,\mathrm{E}\,}}[f(X_1, X_2, \ldots , X_n)]|^p\big ] \le K n^{p/2} M^p {{\,\mathrm{E}\,}}| X_k -X^*_k|^p. \end{aligned}$$

Proof

Define the \(\sigma \)-field, \( \mathcal {T}_k = \sigma (X_1, X_2,\ldots , X_k).\) Clearly, \(\{\mathcal {T}_k\}_{k \ge 0} \) forms a filtration. Denote by \({{\,\mathrm{E}\,}}_k\) the conditional expectation given \( \mathcal {F}_k\). Then,

$$\begin{aligned} f(X_1, X_2, \ldots , X_n) - {{\,\mathrm{E}\,}}f(X_1, X_2, \ldots , X_n) = \sum _{k=1}^{n} \gamma _k \end{aligned}$$

where

$$\begin{aligned} \gamma _k= {{\,\mathrm{E}\,}}_k \left[ f(X_1, X_2, \ldots , X_n)\right] - {{\,\mathrm{E}\,}}_{k-1} \left[ f(X_1, X_2, \ldots , X_n)\right] . \end{aligned}$$

Since \(X_k^*\) is distributed as \(X_k\) and is independent of \(\{ X_1, X_2, \ldots , X_n\}\),

$$\begin{aligned}&{{\,\mathrm{E}\,}}_k \left[ f(X_1, \ldots , X_{k-1},X^*_k, X_{k+1},\ldots , X_n)\right] \\&\qquad = {{\,\mathrm{E}\,}}_{k-1} \left[ f(X_1, \ldots , X_{k-1},X^*_k, X_{k+1},\ldots , X_n)\right] . \end{aligned}$$

Therefore, we have,

$$\begin{aligned} |\gamma _k|&\le \big | {{\,\mathrm{E}\,}}_{k}\left[ f(X_1, X_2,\ldots ,X_n) -f(X_1,\ldots ,X^*_k,\ldots ,X_n)\right] \big | \ \\&\quad + \big | {{\,\mathrm{E}\,}}_{k-1} \left[ f(X_1,X_2,\ldots ,X_n) -f(X_1,\ldots ,X^*_k,\ldots ,X_n)\right] \big |\\&\le M_k \big ({{\,\mathrm{E}\,}}_k [|X_k - X^*_k|] + {{\,\mathrm{E}\,}}[|X_k -X^*_k|]\big ) \end{aligned}$$

Observe that \(\{ \gamma _k\} \) forms a complex martingale difference sequence. By Burkholder Inequality (see Burkholder [6]),

$$\begin{aligned} \Delta = {{\,\mathrm{E}\,}}\big [\big |f(X_1, \ldots , X_n)\big ] - {{\,\mathrm{E}\,}}\big [f(X_1, \ldots , X_n)\big |^p\big ] = {{\,\mathrm{E}\,}}\big [\big | \sum _k \gamma _k \big |^p\big ] \le K {{\,\mathrm{E}\,}}\big ( \sum _k | \gamma _k |^2 \big )^{p/2}\end{aligned}$$

By using Minkowski’s Inequality on the right side of above,

$$ \Delta \le K \left[ \sum _k \left( {{\,\mathrm{E}\,}}|\gamma _k|^p \right) ^{2/p} \right] ^{p/2}$$

.

By Jensen’s inequality,

$$\begin{aligned} {{\,\mathrm{E}\,}}[|\gamma _k|^p] \le 2^p M_k^p {{\,\mathrm{E}\,}}\big [{{\,\mathrm{E}\,}}_k[|X_k - X^*_k |^p] + {{\,\mathrm{E}\,}}[| X_k -X^*_k|^p]\big ] \le 2^{p+1} M_k^p {{\,\mathrm{E}\,}}[| X_k -X^*_k|^p] \end{aligned}$$

and the theorem follows. \(\square \)

Kernel Density Estimate for Random Matrices

Now suppose that \(A_n\) is a sequence of (symmetric) random matrices of increasing order with eigenvalues \(\lambda ^{(n)}_1, \lambda ^{(n)}_2, \ldots \lambda ^{(n)}_n\). Let

$$\begin{aligned}F_n(x)=n^{-1}\sum _{i=1}^{n} \mathbb {I}(\lambda ^{(n)}_i\le x)\end{aligned}$$

be the corresponding empirical spectral distribution (ESD) function.

In random matrix literature, for various types of matrices, \(F_n\) converges to some non-random distribution F, either almost surely or in probability. Then, F is called the limiting spectral distribution(LSD) of the sequence of matrices. See Bose [4] for common examples of patterned random matrices where the LSD exists. Often this LSD has a bounded density, say f. Note that if the above convergence holds, then the non-random distribution function \({{\,\mathrm{E}\,}}(F_n)\) also converges to F.

At the same time, for visualization, when these matrices are simulated, a KDE is routinely fitted to the eigenvalues. We demonstrate two examples of random matrices, namely the Wigner matrix and the sample co-variance matrix, where consistency of the KDE follows from our results and thus validates fitting a KDE to the eigenvalues. Simulations from these matrices along with some performance measures are also provided. We also discuss the random Toeplitz matrix, where the LSD is known to have a density. However, neither any further properties of the limit density are known, nor any rate result for the convergence of the ESD to the LSD is known.

Wigner Matrix

The semi-circle distribution F has the density

$$\begin{aligned} f(x)= & {\left\{ \begin{array}{ll} \dfrac{1}{2\pi } \sqrt{4-x^2},\ \ \text {if} \ \ -2< x <2 \\ 0,\ \ \text {otherwise.} \end{array}\right. } \end{aligned}$$

Figure 1 gives a plot of this density.

Fig. 1
figure1

Density of the standard semi-circle law

It arises as the density of the limit spectral distribution of the Wigner matrix (see below). Note that f is bounded by \(\pi ^{-1}\).

The Wigner matrix \(W_n =((w_{jk}^{(n)}))\) is a symmetric matrix with random independent entries on and above the diagonal. We shall drop the superscript n for ease of notation. Let \(F_n\) be the ESD of \(n^{-1/2}W_n\).

Under different sets of conditions, \(F_n\) converges weakly, either almost surely or in probability, to F. See Bose [4] for a discussion on the different sets of conditions. We shall work with the following assumption:

(W1):

\({{\,\mathrm{E}\,}}(w_{jk})= 0\), \({{\,\mathrm{E}\,}}(w_{jk}^2)=1\) for all ij.

(W2):

\(\sup _{i, j, n} {{\,\mathrm{E}\,}}[w_{ij}^4] < \infty .\)

(W3):

\(\displaystyle {\sum _{1\le i, j \le n}} {{\,\mathrm{E}\,}}\left( w_{ij}^4 \mathbb {I}_{\{|w_{ij}| \ge \epsilon n^{1/2}\}}\right) = o(n^2) \text{ for } \text{ any } \epsilon > 0.\)

Bai [1] showed that if (W1)–(W3) are satisfied, then

$$\begin{aligned}\Vert {{\,\mathrm{E}\,}}(F_n)-F\Vert _\infty =O(n^{-1/4}).\end{aligned}$$

We now explain how we can apply Theorem 2 to establish the consistency of the KDE together with a rate of convergence.

Let us fix some \(t \in \mathbb {R}\) and let \(\varphi _n(t)\) be the characteristic function of the ESD of \(n^{-1/2}W_n\) evaluated at t. Chatterjee and Bose [7] showed that

$$\begin{aligned} \left\| \dfrac{\partial \varphi _n(t)}{\partial w_{jk}}\right\| _\infty \le 2|t|n^{-3/2}. \end{aligned}$$

So by Theorem 4, noting that the Wigner matrix involves \(O(n^2)\) independent random variables, we have the following inequality regarding the characteristic function of the ESD of \(n^{-1/2}W_n\) with i.i.d. entries with finite pth moment (below K is an absolute constant, depending only on p).

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [|\varphi _n(t)- {{\,\mathrm{E}\,}}\varphi _n(t)|^p\big ] \le K (n^{2})^{p/2}(|t|n^{-3/2})^{p}=K|t|^p n^{-p/2}. \end{aligned}$$
(8)

Using (8) and Theorem 3, we get for \(p \ge 2\),

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [\Vert F_n-F\Vert _\infty ^p\big ] \le 6^p\Vert {{\,\mathrm{E}\,}}F_n-F\Vert _\infty ^p + O(n^{-p/4}). \end{aligned}$$
(9)

Observe that the first term is of a smaller order than the second term above. We can use this result in Theorem 2 and obtain the following theorem. Suppose \(\hat{f}\) is a kernel density estimate constructed from the eigenvalues of \(n^{-1/2}W_n\) where the kernel K satisfies the usual conditions.

Theorem 5

Suppose the Wigner matrix satisfies the conditions (W1)–(W3) given above. Then, for any compact subset C which excludes \(\pm 2\),

$$\begin{aligned} \big [{{\,\mathrm{E}\,}}[\sup _{x\in C} |\hat{f}(x) - f(x)|^p]\big ]^{1/p}=O(\dfrac{1}{n^{1/4} h}+h),\,p \le 4. \end{aligned}$$

The above bound continues to hold for any \(p> 4\) provided \(\sup _{i, j, n} {{\,\mathrm{E}\,}}[|w_{ij}|^p] < \infty \).

In particular, \(\displaystyle {\sup _{x\in C}}|\hat{f}(x)-f(x)|\rightarrow 0\) in probability if \(h\rightarrow 0\) and \(hn^{1/4}\rightarrow \infty \).

If we optimize the upper bound \(O(\dfrac{1}{n^{1/4} h}+h)\), then we obtain \(h\cong n^{-1/8}\) and the upper bound is of the order \(n^{-1/8}\). We are not claiming that this rate is the optimal rate of convergence of \(\big [{{\,\mathrm{E}\,}}[\displaystyle {\sup _{x\in C}} |\hat{f}(x) - f(x)|^p]\big ]^{1/p}\) or this bandwidth has any optimality properties.

Fig. 2
figure2

Histograms and KDE for Wigner matrix ESD, \(n=50\) (left) and \(n=500\) (right)

Simulations for the Wigner matrix We considered the Wigner matrix with i.i.d. entries. We took several different choices of, the dimension n, the distribution of the entries and of the kernel K. We took \(C=[-2+0.05, \ 2-0.05]\). We summarize below the findings of the simulations:

  1. 1.

    Figure 2 presents the histograms along with the KDE, which is in striking agreement with the semi-circle density, even for \(n=50\). The entries were chosen as i.i.d. Bernoulli ±1 with probability 1/2 each. Similar results were obtained for other choices of the underlying distributions. Recall that the optimal bandwidth obtained by minimizing the upper bound is \(h=n^{-1/8}\). We used this bandwidth and the Gaussian kernel. Typical number of replications was 500. We tried other bandwidths and kernels. The Gaussian kernel and the bandwidth \(h=n^{-1/2}\) gave the best result.

  2. 2.

    Our theoretical results presented upper bound for the rate of convergence. This tells us that with the choice of \(h=n^{-1/8}\), we have \(\big [{{\,\mathrm{E}\,}}[\displaystyle {\sup _{x\in C}} |\hat{f}(x) - f(x)|^p] \big ]^{1/p} = O(n^{-1/8})\). Top right panel of Fig. 3 gives the box plots of the random variable \(X_{n, h} = \sup _{x\in C} |\hat{f}(x) - f(x)|\). There is marginal difference in the outcome as the kernel varies. It would be interesting to investigate if after proper centering and scaling, \(X_{n,h}\) converges in distribution.

  3. 3.

    To see how \(\hat{f}(x) - f(x)\) varies with different x, and remembering the upper bound \(n^{-1/8}\), we considered \(n^{1/8} (\hat{f}(x) - f(x))\) for a range of values of x, fixing n at 500. In Fig. 3, left panel we present the QQ-plot for a selection of x values. At each x, the distribution appears to be normal, but with different variances.

  4. 4.

    In Theorem 5, with the bandwidth \(h=n^{-1/8}\) the rate of convergence is \(n^{-1/8}\) . Of course, it is possible that these upper bound rates are conservative as a function of n. We wished to investigate this by simulations. We took average over repeated sampling of \(\displaystyle {\sup _{x\in C}} |\hat{f}(x) - f(x)|^p\) (=\(s_{n,p}\), say) for sequence of choices of n and p: \(n=50(50)1000\) and \(p=0.25(0.25)5\). Bottom right panel of Fig. 3 presents the slopes of regression \(\log (s_{n,p})\) and \(\log n\) for these sequence of values of n and p, based on 50 replicates. This plot is linear with approximate slope −1/8, leading us to believe that with the choice \(h=n^{-1/8}\), the upper bound rate of \(n^{-1/8}\) in Theorem 5 is actually sharp.

Fig. 3
figure3

Summary statistics and plots for the KDE in Wigner matrix, where the kernels are G-Gaussian, E-Epanechnikov, T-Triangular, B-Biweight, C-cosine, O-Optcosine

Sample Covariance Matrix

The Marčenko–Pastur distribution \(F_y(\cdot )\) with parameter \(y \le 1\) arises as the LSD of the sample co-variance matrix. It has a continuous density

$$\begin{aligned} f_y(x) = \left\{ \begin{array}{cl} \dfrac{1}{2\pi xy} \sqrt{(b-x)(x-a)} & \text{ if } a \le x \le b, \\ & \\ 0 & \text{ otherwise. } \end{array} \right. \end{aligned}$$
(10)

where

$$\begin{aligned}a = a(y) = (1 - \sqrt{y})^2\ \ \text {and}\ \ b = b(y) = (1+\sqrt{y})^2. \end{aligned}$$

We show that the density is bounded. Write

$$\begin{aligned}(2\pi )^2y^2f^2(x)= & \dfrac{(b-x)(x-a)}{x}\\= & \dfrac{a+b}{x}-\dfrac{ab}{x}-1=g(x)\ \ \text {say}. \end{aligned}$$

It easily checked that

$$\begin{aligned} g^{\prime }(x)=\dfrac{-(a+b)}{x^2}+\dfrac{2ab}{x}. \end{aligned}$$
(11)

The solution of \(g(x)=0\) is \(x_0=\dfrac{2ab}{a+b}\). Note that \(a\le x_0\le b\). It is clear from (11) that g is increasing for \(a\le x\le x_0\) and decreasing for \(x_0\le x\le b\). Thus, g and hence f attain its maximum at \(x_0\). Hence,

$$\begin{aligned} \sup _{ a\le x\le b}f_y(x)=2f(x_0)= & \dfrac{1}{2\pi y x_0}\sqrt{(b-x_0)(x_0-a)}\\= & \dfrac{a+b}{4\pi y ab}\dfrac{\sqrt{(b^2-ab)(ab-a^2)}}{a+b}\\= & \dfrac{1}{4\pi y} (b-a)= [\pi \sqrt{y}]^{-1}=\lambda \ \ \text {say}.\end{aligned}$$

This distribution is also defined for \(y > 1\), but then there is an extra point mass of \(1-y^{-1}\) at 0 and hence is excluded from our discussion.

Suppose X is a real \(p_n\times n\) matrix with entries \(x_{jk}\), which are i.i.d. real random variables with mean zero and unit variance. Let \(S_n = \dfrac{1}{n}XX^{\top }\). We shall write S for \(S_n\). This is the unadjusted sample covariance matrix when we treat the n columns of X as observations on a vector of dimension \(p_n\). It is known that if the fourth moment of \(x_{jk}\) is finite and \(p_n\rightarrow \infty , y_n=p_n/n \longrightarrow y \in (0, \infty )\), then the LSD of S is \(F_y\) almost surely. See Bose [4] for a detailed proof and also for the case \(y=0\). We focus only on the case \(y \in (0, 1)\) to avoid point masses and other rate of convergence issues.

As in the case of Wigner matrices, several rate results are known. For example, Bai, Miao and Tsay [2] and Bai, Miao and Yao [3] and Götze and Tikhomirov [11]. In particular, it follows from Bai, Miao and Yao [3] that if \(y_n\) remains bounded away from 1 and all random variables are uniformly bounded by \(M < \infty \), then

$$\begin{aligned} ||{{\,\mathrm{E}\,}}(F_n)-F_y||_\infty =O(n^{-1/2}). \end{aligned}$$
(12)

To keep things simple, we shall assume this boundedness condition. One can replace this strong assumption by weaker conditions to obtain results weaker than what we present below. See also the above references and Chatterjee and Bose [7].

Now consider S as a function of the entries of X. Clearly,

$$\begin{aligned} S_{jk} := \dfrac{\partial S}{\partial x_{jk}} = \dfrac{1}{n}(YX^{\top } + XY^{\top }) \end{aligned}$$

where \(Y = \partial X / \partial x_{jk}\). Now the matrix Y has 1 at the (jk)th position and 0 elsewhere, i.e., \(Y = e_je_k^{\top }\) where \(e_j\) is the vector with 1 at the jth position and 0 elsewhere. Thus, if \(x_{\cdot k}\) denotes the kth column of X and \(\varphi _n(t)\) denotes the characteristic function of the ESD of \(S_n\) evaluated at t, then

$$\begin{aligned} \dfrac{\partial \varphi _n(t)}{\partial x_{jk}}=\, & p^{-1} \text{ tr }(itS_{jk}e^{itS}) \\= \,& it(np_{n})^{-1} \text{ tr } (YX^{\top } e^{itS} + XY^{\top } e^{itS}) \\= \,& it(np_{n})^{-1} \text{ tr } (e_j x_{\cdot k}^{\top } e^{itS} + x_{\cdot k} e_j^{\top } e^{itS}) \\= \,& it(np_{n})^{-1} (x_{\cdot k}^{\top } e^{itS}e_j + e_j^{\top } e^{itS} x_{\cdot k})\\= \,& \dfrac{2itz_{kj}}{np_{n}} \end{aligned}$$

where we have written \(z_{kj}\) for the jth component of the vector \(z_k := e^{itS}x_{\cdot k}\). Note that since \(e^{itS}\) is unitary, \(\Vert z_k\Vert = \Vert x_{\cdot k}\Vert \).

Hence, using the boundedness of the variables,

$$\begin{aligned} \left|\dfrac{\partial \varphi _n(t)}{\partial x_{jk}}\right| \le 2|t|(np_n)^{-1} \Vert z_k \Vert = 2|t|(np_n)^{-1} \Vert x_{\cdot k}\Vert \le \dfrac{2M|t|}{n\sqrt{p_{n}}}. \end{aligned}$$
(13)

So by Theorem 4, noting that the S matrix involves \(O(np_n)\) independent random variables, we have the following inequality regarding the characteristic function of the ESD of S when X has i.i.d. entries with finite pth moment (below K is an absolute constant, depending only on p).

$$\begin{aligned} {{\,\mathrm{E}\,}}\big [|\varphi _n(t)- {{\,\mathrm{E}\,}}\varphi _n(t)|^p\big ] \le K (np_n)^{p/2}(\dfrac{M|t|}{n\sqrt{p_{n}}})^{p}=KM^{p}|t|^p n^{-p/2}. \end{aligned}$$
(14)

Using (14) and Theorem 3, we get for \(p \ge 2\),

$$\begin{aligned} {{\,\mathrm{E}\,}}[\Vert F_n-F\Vert _\infty ^p] \le 6^p\Vert {{\,\mathrm{E}\,}}(F_n)-F\Vert _\infty ^p + K(M\lambda )^{p/2}n^{-p/4} \end{aligned}$$
(15)

for some constant K which depends only on p. Observe that since (12) holds, the first term on the right side of the above inequality is of much smaller order than the second term. As a consequence, using (13) and Theorem 2, we obtain the following result.

Suppose \(\hat{f}\) is the kernel density estimate constructed from the eigenvalues of the matrix \(S_n=n^{-1}XX^{*}\) where the kernel K satisfies the usual conditions.

Theorem 6

Suppose the entries \(\{x_{i,j}\}\) of X are independent with mean 0 and variance 1 and are uniformly bounded by M. Suppose \(p\rightarrow \infty \) and \(p/n \rightarrow y \in (0, \ 1)\). Then, for any compact subset C which excludes \(a = a(y) = (1 - \sqrt{y})^2\) and \(b = b(y) = (1+\sqrt{y})^2\),

$$\begin{aligned}\big [{{\,\mathrm{E}\,}}[\sup _{x\in C} |\hat{f}(x) - f(x)|^p]\big ]^{1/p}=(M\lambda )^{1/2}O(\dfrac{1}{n^{1/4} h}+h).\end{aligned}$$

In particular, \(\sup _{x\in C}|\hat{f}(x)-f(x)|\rightarrow 0\) in probability if \(h\rightarrow 0\) and \(hn^{1/4}\rightarrow \infty \).

If we optimize the upper bound \(O(\dfrac{1}{n^{1/4} h}+h)\), then the optimal bandwidth is given by \(h\cong n^{-1/8}\) with the corresponding rate of convergence of \(n^{-1/8}\).

Fig. 4
figure4

Histogram along with the KDE (dashed line) of the ESD of the S-matrix and the Marčenko-Pastur density

Fig. 5
figure5

Summary statistics and plots for the KDE of the ESD of the S-matrix

Simulations for the sample co-variance matrix Even though our theorem assumes boundedness of the entries, we decided to explore the unbounded cases. So, we let the entries of \(X_{p_n\times n}\) to be i.i.d. standard normal. Several different choices of the dimension n, kernel K and \(p_n/n=y\) were explored. A selection of our findings is presented in Figs. 4 and 5. As before we used the bandwidth \(h=n^{-1/8}\), obtained by minimizing the upper bound. Typical number of replications was 500. We took \(C=[a+0.05, \ b -0.05]\). We may summarize the findings of the simulations as follows:

  1. 1.

    Figure 4 gives the histograms, the KDE of the ESD for \(n=50\) (top panel) and \(n=500\) (bottom panel) and the Marčenko–Pastur density. The KDE is in agreement with the limit density, even for dimension as low as \(n=50\) and \(y=0.25\).

  2. 2.

    Left panel of Fig. 5 gives the box plots of the random variable \(Y_{n,h}=\sup _{x\in C} |\hat{f}(x) - f(x)|\). The effect of the choice of the kernel is even less compared to the Wigner case. The middle panel gives the \(Q-Q\)-plot versus the normal quantiles and this appears to give a reasonable agreement.

  3. 3.

    To explore the sharpness of the bound \(n^{-1/8}\), we took average over repeated sampling of \(\displaystyle {\sup _{x\in C}} |\hat{f}(x) - f(x)|^p\) (=\(s_{n,p}\), say) for a sequence of choices of n and p: \(n=50(50)1000\) and \(p=0.25(0.25)5\). Right panel of Fig. 5 presents the slopes of regression \(\log (s_{n,p})\) and \(\log n\) for these sequence of values of n and p, based on 50 replicates. This plot too is linear as in the Wigner case.

Toeplitz Matrix

The \(n\times n\) matrix \(T_n = ((x_{|i-j|}))\) is called a symmetric Toeplitz matrix of order n. Let \(F_n\) be the ESD of \(n^{-1/2}T_n\). It is known that if the \(\{x_i\}\) are i.i.d. with mean 0 and variance 1, then \(F_n\) converges almost surely, to say F (see Bryc, Dembo and Jiang [5] and Hammond and Miller [13]). See Bose [4] for a detailed proof. While the form of F is unknown, it is known that the distribution is symmetric about 0 and has variance 1. Sen and Virag [14] proved that F has a bounded density, say f. No further smoothness properties of f appear to be known. Under suitable conditions on the entries, from Bose and Chatterjee [7] it follows that

$$\begin{aligned}{{\,\mathrm{E}\,}}[\Vert F_n - F\Vert _{\infty }] \le 2\Vert {{\,\mathrm{E}\,}}(F_n) - F\Vert _{\infty } + O(n^{-1/4}).\end{aligned}$$

One may also be able to sharpen this result using the arguments of the present article. Unfortunately, no rate results are yet known for the quantity \(\Vert {{\,\mathrm{E}\,}}(F_n) - F\Vert _{\infty }\).

Fig. 6
figure6

Histograms and KDE for Toeplitz matrix ESD, \(n=50\) (left) and \(n=500\) (right)

Fig. 7
figure7

Blowup of the standard normal density (dashed line) and the KDE (solid line) for Toeplitz matrix ESD around the center 0 and around ±2 for \(n=1000\) with 5000 replications

Fig. 8
figure8

Q–Q plot of standard normal versus quantiles of the ESD for the Toeplitz matrix for \(n=1000\) with 5000 replications

Fig. 9
figure9

99.95-th percentile of the Toeplitz matrix ESD, \(n=1000\) with 5000 replications

Simulations from the Toeplitz matrix We generated random Toeplitz matrices where \(\{x_i\}\) are i.i.d. standard normal variables. As before several different choices of the dimension n and kernel K were explored. We may summarize the findings of the simulations as follows:

  1. 1.

    In Fig. 6, we present the histograms along with the KDE (solid line) for \(n=50\) (left) and \(n=500\) (right). For comparison, the standard normal density has been overlaid (dashed line). It is known that the Toeplitz limit is not normal. Clear departure from normality is visible in the peak. The Q–Q plot in Fig. 8 also demonstrates the departure from normality.

  2. 2.

    The blowup of the KDE around 0 and around ±2 (solid line) along with the standard normal density (dashed line) is shown in Fig. 7. This clearly shows the lower peak of the Toeplitz limit. Quite interestingly the Toeplitz density appears to have two crossings with the normal density in the range \(x > 0\) (Fig. 8).

  3. 3.

    It is known that the moments of the Toeplitz limit are bounded by Gaussian moments. Recall that the fourth moment of the standard normal is 3. In contrast, it is known that the fourth moment of the Toeplitz limit is 8/3. No closed-form formula is known for general moments. In Fig. 9, we explore the behavior of the tail of the ESD. Recall that the 99.95-th percentile of the standard normal is approximately 3. We obtain three estimates of this percentile for the Toeplitz limit, namely the estimate based on the ESD, estimates based on KDE with \(h=n^{-1/2}\) and \(h=n^{-1/8}\). All estimates clearly indicate the value of this percentile for Toeplitz limit is much less than 3.

Some Questions

From the plots, it appear that for the choice \(h=n^{-1/8}\) the rate bounds given in Theorems 5 and 6 are sharp. It will be interesting to settle this theoretically. On the other hand, the choice \(h=n^{-1/2}\) worked best in our simulations. It will be interesting to see whether the actual rate of convergence of \(\hat{f}\) to f is better than \(n^{-1/8}\) with this choice. It would also be interesting to study the behavior of appropriately scaled \(\hat{f}(x)-f(x)\), for each x, as well as a process in x and the supremum of this process.

It appears that, for the Toeplitz matrix, identification of the density f, smoothness properties of f and rate of convergence results are all quite challenging. It would interesting to settle if the density of the Toeplitz limit does have a smaller peak than the normal density, and it cuts the normal density at two points (or more) for \(x >0\).

The referee has pointed out that better rates of convergence may be achievable by sharpening the existing methods, in particular the methods of Chatterjee and Bose [7].

As far as the rate of convergence of the EDF and the expected EDF is concerned, the existing results in the literature are possibly the sharpest achievable and we have quoted some of these earlier. However, when it comes to the convergence rates of the KDE, we have worked with a slightly improved version of Chatterjee and Bose [7] but we acknowledge that there is scope for further improvement. We hope to pursue this issue in future. We thank the referee for raising these issues.

References

  1. 1.

    Bai ZD (1993) Convergence rate of expected spectral distributions of large random matrices. I. Wigner matrices. Ann Probab 21(2):625–648

    MathSciNet  Article  Google Scholar 

  2. 2.

    Bai ZD, Miao B, Tsay J (1997) A note on the convergence rate of the spectral distributions of large random matrices. Stat Probab Lett 34(1):95–101

    MathSciNet  Article  Google Scholar 

  3. 3.

    Bai ZD, Miao B, Yao JF (2003) Convergence rates of the spectral distributions of large sample covariance matrices. Siam J Matrix Anal Appl 25(1):105–127

    MathSciNet  Article  Google Scholar 

  4. 4.

    Bose A (2018) Patterned random matrices. Chapman & Hall, Boca Raton

    Google Scholar 

  5. 5.

    Bryc W, Dembo A, Jiang T (2006) Spectral measure of large random Hankel, Markov and Toeplitz matrices. Ann Probab 34(1):1–38

    MathSciNet  Article  Google Scholar 

  6. 6.

    Burkholder DL (1966) Martingale transforms. Ann Math Stat 37:1494–1504

    MathSciNet  Article  Google Scholar 

  7. 7.

    Chatterjee S, Bose A (2004) A new method for bounding rates of convergence of empirical spectral distributions. J Theo Probab 17(4):1003–1019

    MathSciNet  Article  Google Scholar 

  8. 8.

    Devroye L (1991) Exponential inequalities in nonparametric estimation. In: Nonparametric functional estimation and related topics (Spetses, 1990), 31–44, NATO Adv Sci Inst Ser C Math Phys Sci, 335, Kluwer Acad Publ, Dordrecht

  9. 9.

    Efron B, Stein C (1981) The jackknife estimate of variance. Ann Stat 9(3):586–596

    MathSciNet  Article  Google Scholar 

  10. 10.

    Feller W (1966) An introduction to probability theory and its applications, vol II, 2nd edn. Wiley, New York

    Google Scholar 

  11. 11.

    Götze F, Tikhomirov AN (2007) The rate of convergence of spectra of sample covariance matrices. Theory Probab Appl 54(1):129–140

    MathSciNet  Article  Google Scholar 

  12. 12.

    Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution free theory of nonparametric regression. Springer, New York

    Google Scholar 

  13. 13.

    Hammond C, Miller SJ (2005) Distribution of eigenvalues for the ensemble of real symmetric Toeplitz matrices. J Theo Probab 18(3):537–566

    MathSciNet  Article  Google Scholar 

  14. 14.

    Sen A, Virág B (2011) Absolute continuity of the limiting eigenvalue distribution of the random Toeplitz matrix. Electron Commun Probab 16:706–711

    MathSciNet  Article  Google Scholar 

  15. 15.

    Steele JM (1986) An Efron–Stein inequality for nonsymmetric statistics. Ann Stat 14(2):753–758

    MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Arup Bose.

Ethics declarations

Conflicts of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Arup Bose: Research supported by J.C. Bose National Fellowship, Department of Science and Technology, Government of India.

This article is part of the topical collection “Celebrating the Centenary of Professor C. R. Rao” guest edited by, Ravi Khattree, Sreenivasa Rao Jammalamadaka, and M. B. Rao.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bose, A., Bhattacharjee, M. Kernel Density Estimates in a Non-standard Situation. J Stat Theory Pract 15, 22 (2021). https://doi.org/10.1007/s42519-020-00161-0

Download citation

Keywords

  • Kernel density estimate
  • Efron–Stein inequality
  • Esseen’s lemma
  • Eigenvalues
  • Limiting spectral distribution
  • Wigner matrix
  • Sample variance covariance matrix
  • Toeplitz matrix
  • Box plot

Mathematics Subject Classification

  • Primary 62G07
  • Secondary 60F99
  • 15B52
  • 60B20