Abstract
Kernel density estimate is an integral part of the statistical tool box. It has been widely studied and is very well understood in situations where the observations \(\{x_i\}\) are i.i.d., or is a stationary process with some weak dependence. However, there are situations where these conditions do not hold. For instance, while the eigenvalue distribution of largedimensional random matrices converges, the eigenvalues themselves are highly correlated for most common random matrix models. Suppose \(\{F_n\}\) is a sequence of empirical distribution functions (usually random) which converges weakly to a nonrandom distribution function F with density f in some probabilistic sense. We show that under mild conditions on the kernel K and the limit density f, the kernel density estimate \(\hat{f}\) based on \(F_n\) converges to f in suitable probabilistic senses. This demonstrates the robustness of the kernel density estimate. We show how the rate of convergence of \(\hat{f}\) to f can be linked to the rate of convergence of \(F_n\) and \({\mathrm{E}}(F_n)\) to F. Using the above general results, we establish the consistency of the kernel density estimates, including upper bounds on the rate of convergence, for two popular random matrix models. We also provide a few simulations to demonstrate these results and conclude with a few open questions.
Introduction
Suppose \(F_n\) is a distribution function and K is a suitable nonnegative realvalued function, so that \(\int _{\infty }^{\infty } K(t)dt=1\). Then, K is called a kernel function. Define (all integrals are on the range \((\infty , \ \infty )\))
Note that \(\hat{f} \ge 0\) and \(\int _{\infty }^{\infty }\hat{f}(u) {\mathrm {d}}u =1\) so that \(\hat{f}\) is a probability density function.
Often \(F_n\) is the empirical distribution which puts mass \(n^{1}\) at \(x_1, x_2, \ldots , x_n\) which are observations from some statistical model, so that
In that case \(\hat{f}\) is a random function and
In particular, if \(\{x_i\}\) is an i.i.d. sequence, or more generally, a strictly stationary sequence with marginal density f, then \(\hat{f}\) serves as an estimate of f and hence is called a kernel density estimate (KDE) of f with bandwidth h. KDE is an integral part of the statistical tool box and is very well understood in these situations. If K satisfies some mild restrictions, then \(\hat{f}\) has very nice properties and in particular, as \(n \rightarrow \infty \), is a consistent estimate of f in various probabilistic senses when h is chosen appropriately.
Now suppose that the sequence \(\{F_n\}\), random or nonrandom, converges (in some probabilistic sense when they are random) to a nonrandom distribution function F which has a density f. Then, it is natural to consider \(\hat{f}(x)\) as an estimate of f(x) and ask what properties this estimate has. Note that we are not assuming that the \(\{x_i\}\) have any independence or weak dependence structure.
We first show that under mild conditions on K and f, \(\hat{f}\) converges to f (in some probabilistic sense) when a proper bandwidth is chosen. We also relate the rate of convergence of \(\hat{f}\) to f, to the rate of convergence of \(F_n\) and \({\mathrm{E}}(F_n)\) to F. In particular, this means that the KDE is very robust—more or less whenever the empirical distribution converges, the corresponding KDE is consistent provided h is chosen suitably.
We then provide three examples from the eigenvalue distribution of random matrices. It is known that the eigenvalues of the most popular random matrices are highly correlated. Nevertheless, KDE is routinely used. Suppose \(A_n\) is a sequence of (symmetric) random matrices of increasing order. Let \(\lambda ^{(n)}_1, \lambda ^{(n)}_2, \ldots , \lambda ^{(n)}_n\) be the (random) eigenvalues, and let \(F_n\) be the corresponding empirical spectral distribution (ESD) function which puts mass \(n^{1}\) at each of the eigenvalues. Then, the corresponding KDE is given by
For many matrix models, the ESD \(F_n\) converges almost surely to a nonrandom limiting spectral distribution (LSD) F, which has density f. There is a host of existing results on the rate of convergence of \(F_n\) to F for specific matrices. We use these results and their suitable modifications to establish the consistency of the KDE in two matrix models, the Wigner matrix and the sample covariance matrix. We also provide a few simulations to demonstrate these results and end with a few open questions, especially for the random Toeplitz matrix.
Main Results
General Kernel Density Estimate
For any realvalued function g on \(\mathbb {R}\), its total variation \(\mathcal {T}(g)\) is defined as
where \(\{t_j\}\) is a partition of \((\infty , \ \infty )\) and the supremum is over all such partitions. The function g is said to be of bounded variation if \(\mathcal {T}(g) < \infty \).
For two distribution functions F and G on \(\mathbb {R}\), their Kolmogorov distance is defined as
The kernel K is always assumed to be nonnegative and \(\int _{\infty }^{\infty } K(t){\mathrm {d}}t=1\).
Our first theorem is on the asymptotic unbiasedness of the KDE.
Theorem 1
Suppose K is a kernel of bounded variation and \(\int _{\infty }^{\infty } uK(u){\mathrm {d}}u < \infty \). Let \(F_n\) be a sequence of random distribution functions on \(\mathbb {R}\). Let F be a (nonrandom) distribution function with density f.

(i)
If f is Lipschitz continuous at x, then
$$\begin{aligned} {\mathrm{E}}(\hat{f}(x))f(x)= O \left( \dfrac{1}{h}\Vert {\mathrm{E}}F_n F\Vert _{\infty }+ h \right) . \end{aligned}$$(3)Hence, if \(\Vert {\mathrm{E}}(F_n) F\Vert _{\infty } \rightarrow 0\), and \(h \rightarrow 0\) such that \([\Vert {\mathrm{E}}(F_n) F\Vert _{\infty }]^{1} h \rightarrow \infty \), then
$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathrm{E}}(\hat{f}(x)) \rightarrow f(x). \end{aligned}$$(4) 
(ii)
If the Lipschitz continuity in (i) above holds uniformly in x, then the convergences in (3) and (4) above are also uniform over x.
Proof
Since f is Lipschitz continuous at x, there exists a constant C (depending on x) such that for all t, \(f(x)f(t)\le Cxt\). Using this,
All conclusions now follow directly from the above. \(\square \)
The next theorem, in particular, implies the pointwise consistency of the KDE.
Theorem 2
Let \(F_n\) be a sequence of random distribution functions on \(\mathbb {R}\). Suppose F is a (nonrandom) distribution function with density f. Suppose K is a kernel of bounded variation and \(\int u K(u){\mathrm {d}}u < \infty \).

(i)
If f is Lipschitz continuous at x, then for any \(p \ge 1\),
$$\begin{aligned} {{\,\mathrm{E}\,}}[\hat{f}(x)  f(x)^p] =O\left( \dfrac{{{\,\mathrm{E}\,}}[\Vert F_n F\Vert _{\infty }^p]}{h^p} + h^p\right) . \end{aligned}$$(5)Hence, if \(h \rightarrow 0\) such that \(\left[ {{\,\mathrm{E}\,}}(\Vert F_n F\Vert _{\infty }^p)\right] ^{1}h \rightarrow \infty \), then \({{\,\mathrm{E}\,}}\hat{f}(x)  f(x)^p \rightarrow 0\) which in turn implies \(\hat{f}(x)  f(x)\rightarrow 0\) in probability.

(ii)
If the Lipschitz continuity above holds uniformly in x, then the conclusions above also hold uniformly in x.
Proof
The conclusions follow from the above easily. \(\square \)
The above result is useful only if we can estimate the quantity \({{\,\mathrm{E}\,}}[\Vert F_n F\Vert _{\infty }^p]\). We investigate this issue in the next section.
Bound for the Distribution Function
To apply Theorems 1 and 2, we need bounds on \({{\,\mathrm{E}\,}}(F_n)F\) and \({{\,\mathrm{E}\,}}[F_nF]\). The next theorem connects the above two in a general way.
For any distribution function G (possibly random), its (random) characteristic function is defined as
Note that if G is a random distribution function, then \(\varphi _G\) is also random.
The convolution \(F*G\) of two distributions F and G is defined as
The next result is an extension of a result of Chatterjee and Bose [7] who considered the case \(p=2\).
Theorem 3
Suppose F is a random distribution function on \(\mathbb {R}\) with (random) characteristic function \(\varphi \). Let \(p \ge 2\). Suppose that for each t,
If G is a nonrandom distribution function on \(\mathbb {R}\), such that \(\sup _{x\in {\mathbb {R}}}G^\prime (x)\le \lambda \), then with same C given above,
where K depends only on p.
Proof
The proof we give below parallels the proof given in Chatterjee and Bose [7] for \(p=2\).
Let
Then, \(F_0\) is a distribution function with characteristic function
Define the probability density on \(\mathbb {R}\)
Let \(H_L\) be the corresponding distribution function. The characteristic function of \(H_L\) is given by
By Esseen’s lemma (see Feller (1966, page 510)[10]),
Now
By applying the inversion formula (see Feller (1966, page 482–484)[10], we have
Hence, using hypothesis (6), and noting that \(\int _{\infty }^\infty \psi _L(t){\mathrm {d}}t = L\), by Jensen’s inequality,
Combining all these facts and noting that for \(p \ge 1, \ a+b+c^p \le 3^p(a^p + b^p +c^p)\), we have
Choosing \(L \sim (C\lambda )^{1/2}\) gives the desired conclusion. \(\square \)
The crux of the above result is of course the bound assumed in condition (6). The random quantity \(\varphi _{F}(t)\) can be viewed as a nonlinear function of the underlying data generating process. So, the problem boils down to getting variance type bounds for functions of random variables which are nonlinear. We achieve these bounds by the socalled Efron–Stein inequalities. With more restrictive assumptions on the underlying random variables, sharpening of these bounds may be possible. For further information on Efron–Stein inequalities, see Efron and Stein [9], Steele [15], Devroye [8], and Györfi et al. [12].
The next result is an extension of a wellknown result for the case \(p=2\).
Theorem 4
Suppose \(f(x_1, x_2, \ldots , x_n)\) is a coordinate wise differentiable map from \(\mathbb {R}^n\) into \(\mathbb {C}\) and \(M_j\), \(1\le j \le n\) are constants such that
Let \(p\ge 2.\) Then, there is an absolute constant K, depending only on p, such that for any independent random variables \(X_1, X_2, \ldots , X_n,\)
where \(\{X_k^*\}\) denotes an independent copy of \(\{X_k\}\).
In particular, if \( X_1, X_2, \ldots , X_n\) are i.i.d. and \(M_j = M\) for all j, then
Proof
Define the \(\sigma \)field, \( \mathcal {T}_k = \sigma (X_1, X_2,\ldots , X_k).\) Clearly, \(\{\mathcal {T}_k\}_{k \ge 0} \) forms a filtration. Denote by \({{\,\mathrm{E}\,}}_k\) the conditional expectation given \( \mathcal {F}_k\). Then,
where
Since \(X_k^*\) is distributed as \(X_k\) and is independent of \(\{ X_1, X_2, \ldots , X_n\}\),
Therefore, we have,
Observe that \(\{ \gamma _k\} \) forms a complex martingale difference sequence. By Burkholder Inequality (see Burkholder [6]),
By using Minkowski’s Inequality on the right side of above,
.
By Jensen’s inequality,
and the theorem follows. \(\square \)
Kernel Density Estimate for Random Matrices
Now suppose that \(A_n\) is a sequence of (symmetric) random matrices of increasing order with eigenvalues \(\lambda ^{(n)}_1, \lambda ^{(n)}_2, \ldots \lambda ^{(n)}_n\). Let
be the corresponding empirical spectral distribution (ESD) function.
In random matrix literature, for various types of matrices, \(F_n\) converges to some nonrandom distribution F, either almost surely or in probability. Then, F is called the limiting spectral distribution(LSD) of the sequence of matrices. See Bose [4] for common examples of patterned random matrices where the LSD exists. Often this LSD has a bounded density, say f. Note that if the above convergence holds, then the nonrandom distribution function \({{\,\mathrm{E}\,}}(F_n)\) also converges to F.
At the same time, for visualization, when these matrices are simulated, a KDE is routinely fitted to the eigenvalues. We demonstrate two examples of random matrices, namely the Wigner matrix and the sample covariance matrix, where consistency of the KDE follows from our results and thus validates fitting a KDE to the eigenvalues. Simulations from these matrices along with some performance measures are also provided. We also discuss the random Toeplitz matrix, where the LSD is known to have a density. However, neither any further properties of the limit density are known, nor any rate result for the convergence of the ESD to the LSD is known.
Wigner Matrix
The semicircle distribution F has the density
Figure 1 gives a plot of this density.
It arises as the density of the limit spectral distribution of the Wigner matrix (see below). Note that f is bounded by \(\pi ^{1}\).
The Wigner matrix \(W_n =((w_{jk}^{(n)}))\) is a symmetric matrix with random independent entries on and above the diagonal. We shall drop the superscript n for ease of notation. Let \(F_n\) be the ESD of \(n^{1/2}W_n\).
Under different sets of conditions, \(F_n\) converges weakly, either almost surely or in probability, to F. See Bose [4] for a discussion on the different sets of conditions. We shall work with the following assumption:
 (W1):

\({{\,\mathrm{E}\,}}(w_{jk})= 0\), \({{\,\mathrm{E}\,}}(w_{jk}^2)=1\) for all i, j.
 (W2):

\(\sup _{i, j, n} {{\,\mathrm{E}\,}}[w_{ij}^4] < \infty .\)
 (W3):

\(\displaystyle {\sum _{1\le i, j \le n}} {{\,\mathrm{E}\,}}\left( w_{ij}^4 \mathbb {I}_{\{w_{ij} \ge \epsilon n^{1/2}\}}\right) = o(n^2) \text{ for } \text{ any } \epsilon > 0.\)
Bai [1] showed that if (W1)–(W3) are satisfied, then
We now explain how we can apply Theorem 2 to establish the consistency of the KDE together with a rate of convergence.
Let us fix some \(t \in \mathbb {R}\) and let \(\varphi _n(t)\) be the characteristic function of the ESD of \(n^{1/2}W_n\) evaluated at t. Chatterjee and Bose [7] showed that
So by Theorem 4, noting that the Wigner matrix involves \(O(n^2)\) independent random variables, we have the following inequality regarding the characteristic function of the ESD of \(n^{1/2}W_n\) with i.i.d. entries with finite pth moment (below K is an absolute constant, depending only on p).
Using (8) and Theorem 3, we get for \(p \ge 2\),
Observe that the first term is of a smaller order than the second term above. We can use this result in Theorem 2 and obtain the following theorem. Suppose \(\hat{f}\) is a kernel density estimate constructed from the eigenvalues of \(n^{1/2}W_n\) where the kernel K satisfies the usual conditions.
Theorem 5
Suppose the Wigner matrix satisfies the conditions (W1)–(W3) given above. Then, for any compact subset C which excludes \(\pm 2\),
The above bound continues to hold for any \(p> 4\) provided \(\sup _{i, j, n} {{\,\mathrm{E}\,}}[w_{ij}^p] < \infty \).
In particular, \(\displaystyle {\sup _{x\in C}}\hat{f}(x)f(x)\rightarrow 0\) in probability if \(h\rightarrow 0\) and \(hn^{1/4}\rightarrow \infty \).
If we optimize the upper bound \(O(\dfrac{1}{n^{1/4} h}+h)\), then we obtain \(h\cong n^{1/8}\) and the upper bound is of the order \(n^{1/8}\). We are not claiming that this rate is the optimal rate of convergence of \(\big [{{\,\mathrm{E}\,}}[\displaystyle {\sup _{x\in C}} \hat{f}(x)  f(x)^p]\big ]^{1/p}\) or this bandwidth has any optimality properties.
Simulations for the Wigner matrix We considered the Wigner matrix with i.i.d. entries. We took several different choices of, the dimension n, the distribution of the entries and of the kernel K. We took \(C=[2+0.05, \ 20.05]\). We summarize below the findings of the simulations:

1.
Figure 2 presents the histograms along with the KDE, which is in striking agreement with the semicircle density, even for \(n=50\). The entries were chosen as i.i.d. Bernoulli ±1 with probability 1/2 each. Similar results were obtained for other choices of the underlying distributions. Recall that the optimal bandwidth obtained by minimizing the upper bound is \(h=n^{1/8}\). We used this bandwidth and the Gaussian kernel. Typical number of replications was 500. We tried other bandwidths and kernels. The Gaussian kernel and the bandwidth \(h=n^{1/2}\) gave the best result.

2.
Our theoretical results presented upper bound for the rate of convergence. This tells us that with the choice of \(h=n^{1/8}\), we have \(\big [{{\,\mathrm{E}\,}}[\displaystyle {\sup _{x\in C}} \hat{f}(x)  f(x)^p] \big ]^{1/p} = O(n^{1/8})\). Top right panel of Fig. 3 gives the box plots of the random variable \(X_{n, h} = \sup _{x\in C} \hat{f}(x)  f(x)\). There is marginal difference in the outcome as the kernel varies. It would be interesting to investigate if after proper centering and scaling, \(X_{n,h}\) converges in distribution.

3.
To see how \(\hat{f}(x)  f(x)\) varies with different x, and remembering the upper bound \(n^{1/8}\), we considered \(n^{1/8} (\hat{f}(x)  f(x))\) for a range of values of x, fixing n at 500. In Fig. 3, left panel we present the Q–Qplot for a selection of x values. At each x, the distribution appears to be normal, but with different variances.

4.
In Theorem 5, with the bandwidth \(h=n^{1/8}\) the rate of convergence is \(n^{1/8}\) . Of course, it is possible that these upper bound rates are conservative as a function of n. We wished to investigate this by simulations. We took average over repeated sampling of \(\displaystyle {\sup _{x\in C}} \hat{f}(x)  f(x)^p\) (=\(s_{n,p}\), say) for sequence of choices of n and p: \(n=50(50)1000\) and \(p=0.25(0.25)5\). Bottom right panel of Fig. 3 presents the slopes of regression \(\log (s_{n,p})\) and \(\log n\) for these sequence of values of n and p, based on 50 replicates. This plot is linear with approximate slope −1/8, leading us to believe that with the choice \(h=n^{1/8}\), the upper bound rate of \(n^{1/8}\) in Theorem 5 is actually sharp.
Sample Covariance Matrix
The Marčenko–Pastur distribution \(F_y(\cdot )\) with parameter \(y \le 1\) arises as the LSD of the sample covariance matrix. It has a continuous density
where
We show that the density is bounded. Write
It easily checked that
The solution of \(g(x)=0\) is \(x_0=\dfrac{2ab}{a+b}\). Note that \(a\le x_0\le b\). It is clear from (11) that g is increasing for \(a\le x\le x_0\) and decreasing for \(x_0\le x\le b\). Thus, g and hence f attain its maximum at \(x_0\). Hence,
This distribution is also defined for \(y > 1\), but then there is an extra point mass of \(1y^{1}\) at 0 and hence is excluded from our discussion.
Suppose X is a real \(p_n\times n\) matrix with entries \(x_{jk}\), which are i.i.d. real random variables with mean zero and unit variance. Let \(S_n = \dfrac{1}{n}XX^{\top }\). We shall write S for \(S_n\). This is the unadjusted sample covariance matrix when we treat the n columns of X as observations on a vector of dimension \(p_n\). It is known that if the fourth moment of \(x_{jk}\) is finite and \(p_n\rightarrow \infty , y_n=p_n/n \longrightarrow y \in (0, \infty )\), then the LSD of S is \(F_y\) almost surely. See Bose [4] for a detailed proof and also for the case \(y=0\). We focus only on the case \(y \in (0, 1)\) to avoid point masses and other rate of convergence issues.
As in the case of Wigner matrices, several rate results are known. For example, Bai, Miao and Tsay [2] and Bai, Miao and Yao [3] and Götze and Tikhomirov [11]. In particular, it follows from Bai, Miao and Yao [3] that if \(y_n\) remains bounded away from 1 and all random variables are uniformly bounded by \(M < \infty \), then
To keep things simple, we shall assume this boundedness condition. One can replace this strong assumption by weaker conditions to obtain results weaker than what we present below. See also the above references and Chatterjee and Bose [7].
Now consider S as a function of the entries of X. Clearly,
where \(Y = \partial X / \partial x_{jk}\). Now the matrix Y has 1 at the (j, k)th position and 0 elsewhere, i.e., \(Y = e_je_k^{\top }\) where \(e_j\) is the vector with 1 at the jth position and 0 elsewhere. Thus, if \(x_{\cdot k}\) denotes the kth column of X and \(\varphi _n(t)\) denotes the characteristic function of the ESD of \(S_n\) evaluated at t, then
where we have written \(z_{kj}\) for the jth component of the vector \(z_k := e^{itS}x_{\cdot k}\). Note that since \(e^{itS}\) is unitary, \(\Vert z_k\Vert = \Vert x_{\cdot k}\Vert \).
Hence, using the boundedness of the variables,
So by Theorem 4, noting that the S matrix involves \(O(np_n)\) independent random variables, we have the following inequality regarding the characteristic function of the ESD of S when X has i.i.d. entries with finite pth moment (below K is an absolute constant, depending only on p).
Using (14) and Theorem 3, we get for \(p \ge 2\),
for some constant K which depends only on p. Observe that since (12) holds, the first term on the right side of the above inequality is of much smaller order than the second term. As a consequence, using (13) and Theorem 2, we obtain the following result.
Suppose \(\hat{f}\) is the kernel density estimate constructed from the eigenvalues of the matrix \(S_n=n^{1}XX^{*}\) where the kernel K satisfies the usual conditions.
Theorem 6
Suppose the entries \(\{x_{i,j}\}\) of X are independent with mean 0 and variance 1 and are uniformly bounded by M. Suppose \(p\rightarrow \infty \) and \(p/n \rightarrow y \in (0, \ 1)\). Then, for any compact subset C which excludes \(a = a(y) = (1  \sqrt{y})^2\) and \(b = b(y) = (1+\sqrt{y})^2\),
In particular, \(\sup _{x\in C}\hat{f}(x)f(x)\rightarrow 0\) in probability if \(h\rightarrow 0\) and \(hn^{1/4}\rightarrow \infty \).
If we optimize the upper bound \(O(\dfrac{1}{n^{1/4} h}+h)\), then the optimal bandwidth is given by \(h\cong n^{1/8}\) with the corresponding rate of convergence of \(n^{1/8}\).
Simulations for the sample covariance matrix Even though our theorem assumes boundedness of the entries, we decided to explore the unbounded cases. So, we let the entries of \(X_{p_n\times n}\) to be i.i.d. standard normal. Several different choices of the dimension n, kernel K and \(p_n/n=y\) were explored. A selection of our findings is presented in Figs. 4 and 5. As before we used the bandwidth \(h=n^{1/8}\), obtained by minimizing the upper bound. Typical number of replications was 500. We took \(C=[a+0.05, \ b 0.05]\). We may summarize the findings of the simulations as follows:

1.
Figure 4 gives the histograms, the KDE of the ESD for \(n=50\) (top panel) and \(n=500\) (bottom panel) and the Marčenko–Pastur density. The KDE is in agreement with the limit density, even for dimension as low as \(n=50\) and \(y=0.25\).

2.
Left panel of Fig. 5 gives the box plots of the random variable \(Y_{n,h}=\sup _{x\in C} \hat{f}(x)  f(x)\). The effect of the choice of the kernel is even less compared to the Wigner case. The middle panel gives the \(QQ\)plot versus the normal quantiles and this appears to give a reasonable agreement.

3.
To explore the sharpness of the bound \(n^{1/8}\), we took average over repeated sampling of \(\displaystyle {\sup _{x\in C}} \hat{f}(x)  f(x)^p\) (=\(s_{n,p}\), say) for a sequence of choices of n and p: \(n=50(50)1000\) and \(p=0.25(0.25)5\). Right panel of Fig. 5 presents the slopes of regression \(\log (s_{n,p})\) and \(\log n\) for these sequence of values of n and p, based on 50 replicates. This plot too is linear as in the Wigner case.
Toeplitz Matrix
The \(n\times n\) matrix \(T_n = ((x_{ij}))\) is called a symmetric Toeplitz matrix of order n. Let \(F_n\) be the ESD of \(n^{1/2}T_n\). It is known that if the \(\{x_i\}\) are i.i.d. with mean 0 and variance 1, then \(F_n\) converges almost surely, to say F (see Bryc, Dembo and Jiang [5] and Hammond and Miller [13]). See Bose [4] for a detailed proof. While the form of F is unknown, it is known that the distribution is symmetric about 0 and has variance 1. Sen and Virag [14] proved that F has a bounded density, say f. No further smoothness properties of f appear to be known. Under suitable conditions on the entries, from Bose and Chatterjee [7] it follows that
One may also be able to sharpen this result using the arguments of the present article. Unfortunately, no rate results are yet known for the quantity \(\Vert {{\,\mathrm{E}\,}}(F_n)  F\Vert _{\infty }\).
Simulations from the Toeplitz matrix We generated random Toeplitz matrices where \(\{x_i\}\) are i.i.d. standard normal variables. As before several different choices of the dimension n and kernel K were explored. We may summarize the findings of the simulations as follows:

1.
In Fig. 6, we present the histograms along with the KDE (solid line) for \(n=50\) (left) and \(n=500\) (right). For comparison, the standard normal density has been overlaid (dashed line). It is known that the Toeplitz limit is not normal. Clear departure from normality is visible in the peak. The Q–Q plot in Fig. 8 also demonstrates the departure from normality.

2.
The blowup of the KDE around 0 and around ±2 (solid line) along with the standard normal density (dashed line) is shown in Fig. 7. This clearly shows the lower peak of the Toeplitz limit. Quite interestingly the Toeplitz density appears to have two crossings with the normal density in the range \(x > 0\) (Fig. 8).

3.
It is known that the moments of the Toeplitz limit are bounded by Gaussian moments. Recall that the fourth moment of the standard normal is 3. In contrast, it is known that the fourth moment of the Toeplitz limit is 8/3. No closedform formula is known for general moments. In Fig. 9, we explore the behavior of the tail of the ESD. Recall that the 99.95th percentile of the standard normal is approximately 3. We obtain three estimates of this percentile for the Toeplitz limit, namely the estimate based on the ESD, estimates based on KDE with \(h=n^{1/2}\) and \(h=n^{1/8}\). All estimates clearly indicate the value of this percentile for Toeplitz limit is much less than 3.
Some Questions
From the plots, it appear that for the choice \(h=n^{1/8}\) the rate bounds given in Theorems 5 and 6 are sharp. It will be interesting to settle this theoretically. On the other hand, the choice \(h=n^{1/2}\) worked best in our simulations. It will be interesting to see whether the actual rate of convergence of \(\hat{f}\) to f is better than \(n^{1/8}\) with this choice. It would also be interesting to study the behavior of appropriately scaled \(\hat{f}(x)f(x)\), for each x, as well as a process in x and the supremum of this process.
It appears that, for the Toeplitz matrix, identification of the density f, smoothness properties of f and rate of convergence results are all quite challenging. It would interesting to settle if the density of the Toeplitz limit does have a smaller peak than the normal density, and it cuts the normal density at two points (or more) for \(x >0\).
The referee has pointed out that better rates of convergence may be achievable by sharpening the existing methods, in particular the methods of Chatterjee and Bose [7].
As far as the rate of convergence of the EDF and the expected EDF is concerned, the existing results in the literature are possibly the sharpest achievable and we have quoted some of these earlier. However, when it comes to the convergence rates of the KDE, we have worked with a slightly improved version of Chatterjee and Bose [7] but we acknowledge that there is scope for further improvement. We hope to pursue this issue in future. We thank the referee for raising these issues.
References
 1.
Bai ZD (1993) Convergence rate of expected spectral distributions of large random matrices. I. Wigner matrices. Ann Probab 21(2):625–648
 2.
Bai ZD, Miao B, Tsay J (1997) A note on the convergence rate of the spectral distributions of large random matrices. Stat Probab Lett 34(1):95–101
 3.
Bai ZD, Miao B, Yao JF (2003) Convergence rates of the spectral distributions of large sample covariance matrices. Siam J Matrix Anal Appl 25(1):105–127
 4.
Bose A (2018) Patterned random matrices. Chapman & Hall, Boca Raton
 5.
Bryc W, Dembo A, Jiang T (2006) Spectral measure of large random Hankel, Markov and Toeplitz matrices. Ann Probab 34(1):1–38
 6.
Burkholder DL (1966) Martingale transforms. Ann Math Stat 37:1494–1504
 7.
Chatterjee S, Bose A (2004) A new method for bounding rates of convergence of empirical spectral distributions. J Theo Probab 17(4):1003–1019
 8.
Devroye L (1991) Exponential inequalities in nonparametric estimation. In: Nonparametric functional estimation and related topics (Spetses, 1990), 31–44, NATO Adv Sci Inst Ser C Math Phys Sci, 335, Kluwer Acad Publ, Dordrecht
 9.
Efron B, Stein C (1981) The jackknife estimate of variance. Ann Stat 9(3):586–596
 10.
Feller W (1966) An introduction to probability theory and its applications, vol II, 2nd edn. Wiley, New York
 11.
Götze F, Tikhomirov AN (2007) The rate of convergence of spectra of sample covariance matrices. Theory Probab Appl 54(1):129–140
 12.
Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution free theory of nonparametric regression. Springer, New York
 13.
Hammond C, Miller SJ (2005) Distribution of eigenvalues for the ensemble of real symmetric Toeplitz matrices. J Theo Probab 18(3):537–566
 14.
Sen A, Virág B (2011) Absolute continuity of the limiting eigenvalue distribution of the random Toeplitz matrix. Electron Commun Probab 16:706–711
 15.
Steele JM (1986) An Efron–Stein inequality for nonsymmetric statistics. Ann Stat 14(2):753–758
Author information
Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Arup Bose: Research supported by J.C. Bose National Fellowship, Department of Science and Technology, Government of India.
This article is part of the topical collection “Celebrating the Centenary of Professor C. R. Rao” guest edited by, Ravi Khattree, Sreenivasa Rao Jammalamadaka, and M. B. Rao.
Rights and permissions
About this article
Cite this article
Bose, A., Bhattacharjee, M. Kernel Density Estimates in a Nonstandard Situation. J Stat Theory Pract 15, 22 (2021). https://doi.org/10.1007/s42519020001610
Accepted:
Published:
Keywords
 Kernel density estimate
 Efron–Stein inequality
 Esseen’s lemma
 Eigenvalues
 Limiting spectral distribution
 Wigner matrix
 Sample variance covariance matrix
 Toeplitz matrix
 Box plot
Mathematics Subject Classification
 Primary 62G07
 Secondary 60F99
 15B52
 60B20