Sankhya B

, Volume 72, Issue 2, pp 123–153

Projection pursuit via white noise matrices

Authors

    • Genzyme Corporation
  • Bruce G. Lindsay
    • Department of StatisticsPennsylvania State University
Article

DOI: 10.1007/s13571-011-0008-x

Cite this article as:
Hui, G. & Lindsay, B.G. Sankhya B (2010) 72: 123. doi:10.1007/s13571-011-0008-x
  • 93 Views

Abstract

Projection pursuit is a technique for locating projections from high- to low-dimensional space that reveal interesting non-linear features of a data set, such as clustering and outliers. The two key components of projection pursuit are the chosen measure of interesting features (the projection index) and its algorithm. In this paper, a white noise matrix based on the Fisher information matrix is proposed for use as the projection index. This matrix index is easily estimated by the kernel method. The eigenanalysis of the estimated matrix index provides a set of solution projections that are most similar to white noise. Application to simulated data and real data sets shows that our algorithm successfully reveals interesting features in fairly high dimensions with a practical sample size and low computational effort.

Keywords

Projection pursuitFisher information matrixEigenanalysis

1 Introduction

This paper is concerned with the construction and analysis of white noise matrices for use in analyzing high dimensional data. The goal is to identify a subspace of white noise projections, which, by definition, are marginally normal and independent of all orthogonal projections. For a reduced dimension data analysis, one could then discard the white noise subspace (or the subspace most similar to white noise) and use the remaining orthogonal projections, which we call the informative projections, to look for interesting relationships.

The methodology is closely related to classical projection pursuit (Friedman and Tukey 1974; Huber 1985). Projection pursuit is a technique that explores a high dimensional data by examining the marginal distributions of low dimensional linear projections. The two basic components of projection pursuit are its index and its algorithm. The projection index is designed to measure how “interesting” or “uninteresting” the features are. Usually it is a distance between the marginal distribution of the data projection in a direction and some “uninteresting distributions” for that marginal distribution. According to the central limit theorem, a projection essentially as a linear combination of variables tends to be normal under some regular conditions (Diaconis and Freedman 1984). A normal distribution is elliptically symmetric and has the least information (Fisher information, negative entropy) for a fixed variance. So based on both theoretical and empirical evidence, researchers have reached the consensus that normality best represents the notion of “uninterestingness” (Diaconis and Freedman 1984; Huber 1985).

The fundamental conceptual difference between classical projection pursuit and our approach is that the former searches for the least normal projections to use as the data summaries, while we seek to find the projections that are the most similar to white noise in order to discard them and use the remaining orthogonal projections. In either case, however, the output is a selected set of interesting projections.

Regardless of which approach seems most natural, ours offers a major computational advantage because it can be carried out using ordinary matrix eigendecomposition of a easily computed matrix. In contrast, most projection pursuit algorithms have the drawback of a high computational cost, because, in order to find the optimal projection, the projection index needs to be calculated or estimated for a large set of possible projections. When the dimension increases, the computation cost increases exponentially. For Gaussian mixture model classification, Calo (2007) developed the forward/backward projection pursuit algorithm which avoids the “curse of dimensionality.” But we have not seen general solution.

Our starting point for the construction of white noise matrices is the standardized Fisher information matrix Jf for a density. The projections we will call informative are the ones with the largest Fisher information. For reasons of computation and robustness, instead of Jf, we will estimate the projection index \(J_{f_2},\) the standardized Fisher information matrix for the density square transformed distribution: \(f_2(x)=f^2(x)/\int f^2(y)dy.\) The least normal projection from this new projection index can be estimated algebraically just as in principal component analysis, provided the matrix is standardized (i.e., linear effects are removed). One only needs to estimate the matrix measure \(J_{f_2}\) by kernel density estimation, and then do an eigenanalysis for the estimated matrix.

If an eigenvalue of Jf or \(J_{f_2}\) reaches a specified lower bound, the corresponding linear projection is exactly a white noise coordinate, which is marginally normally distributed and is independent of all orthogonal solution projections. In practice, from the eigenanalysis, one can select large eigenvalues to find the most informative linear projections for future study, or from a converse but equivalent point of view, one could find and discard the least informative linear projections corresponding to small eigenvalues.

We briefly illustrate the effectiveness of our method of finding a spiral structure hidden in a eight dimensional space. Suppose we construct a spiral using 400 points in the first two dimensions of the space (Fig. 1), and fill the remaining six coordinates with white noise. This example was used in Posse (1995) for comparing the efficiency of projection pursuit methods. It is a big challenge for an algorithm to find the spiral structure in high dimensional space, because “the density of the spiral is nearly normal i.e., nearly radial and decreasing when going away from the center” (Posse 1995). The first two principal components from the estimated \(J_{f_2}\) were found within less than 3 CPU seconds and they reveal the spiral structure very well (Fig. 2).
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig1_HTML.gif
Fig. 1

The spiral structure in the first two dimensional space

https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig2_HTML.gif
Fig. 2

The structure found by the eigenanalysis of \(J_{\hat{f}_2},\)d = 8, n = 400

This structure from the eigenanalysis of \(J_{f_2}\) is similar to the result of Posse’s method, which was shown to be better than Friedman’s algorithm (Friedman 1987) in this example.

The remainder of the paper is organized as follows. In Section 2, we introduce standardized Fisher information matrix Jf and interpret the results from the eigenanalysis of Jf. The new projection index \(J_{f_2}\) is then developed in order to eliminate certain computational and statistical problems that arise in estimating Jf. In Section 3 we study how to detect white noise coordinates using the eigenvalues of \(J_{f_2}\). Section 4 discusses the selection of bandwidth. In Section 5, the new algorithm is applied to several simulated data sets and real data sets. We will summarize the performance and present the future work in Section 6.

2 Standardized Fisher information matrix

2.1 Introduction of standardized Fisher information matrix

Let X = (X1, X2,..., Xd) be a d-dimensional random vector with the density function f(x), mean μ and covariance matrix Vf. We will assume that the following regularity condition holds: the density f is continuously differentiable on an open set of support. It follows that Vf is nonsingular.

Definition 1

The Fisher information matrix for density f(x) is defined to be \(\widetilde{J}_f=\int\frac{\nabla_x f \cdot\nabla_x f^T}{f}dx,\) where \(\nabla_x f(x)=(\frac{\partial}{\partial x_1}f,...,\frac{\partial}{\partial x_d}f)\). The standardized Fisher information matrix for density f(x) is defined to be \(J_{f}=V_{f}^{1/2}\cdot\Bigl(\int\frac{\nabla_x f \cdot\nabla_x f^T}{f}dx\Bigr)\cdot V_{f}^{1/2},\) where \(V_{f}^{1/2}\) is the symmetric square root of the covariance matrix.

In the case d = 1, Jf or \(\widetilde{J}_f\) is called Fisher information number (Terrell 1995; Papaioannou and Ferentinos 2005). In the standardized Fisher information matrix, we consider the derivative with respect to x rather than the parameters as done in ordinary Fisher information matrix. So it can also be viewed as a measure of the information in the density, not the location parameters. Kagan (2001) demonstrated the connection between the Fisher information for a density and Fisher information for parameters. We now use this idea to generate a special information-matrix based goodness-of-fit methodology for the multivariate normal.

The key to our methodology is that the normal distribution is known to have the smallest Standardized Fisher information matrix, in the positive definite sense, among all regular density functions::
$$ \widetilde{J}_f\geq V_{f}^{-1}\Leftrightarrow J_{f}\geq I_d,$$
(1)
where, as always, A1 ≥ A2 means that A1 − A2 is positive semi-definite matrix. (See Kagan et al. (1973) for an early version of this result, albeit univariate.) The inequality (1) becomes an equality, if and only if f is a normal density function. Note that the regularity condition is needed, as is evident from considering a uniform density on the unit hypercube, which would have nominal information Jf = 0. Note also that (1) implies that all the eigenvalues of Jf are one or larger.

Without losing generality, in the theoretical development we assume that X has been standardized: μ = 0 and Vf = Id. Otherwise, one can standardize X using \(Y=V_{f}^{-\frac{1}{2}}(X-\mu).\)

2.2 Conditional normality interpretation

We start by introducing a new interpretation of the diagonal elements of Jf. The ith diagonal term of Jf can be expressed as
$$\begin{array}{lll} &&\int \left(\frac{\partial}{\partial x_i}f(x_1,...x_d)/f(x)\right)^2f(x)dx\nonumber\\ &&{\kern6pt}=\int\left(\int\left(\frac{\partial}{\partial x_i}\log f\left(x_i|x_{-i}\right)\right)^2f(x_i|x_{-i})dx_i\right)f(x_{-i})dx_{-i}\nonumber\\ &&{\kern6pt}=\int J_{X_i|x_{-i}} f(x_{-i})dx_{-i}, \end{array}$$
(2)
where x − i = (x1,...,xi − 1, xi + 1,...,xd), and \(J_{X_i|X_{-i}}\) is the Fisher information for the conditional distribution f(xi|x − i). That is, the ith diagonal term of Jf is not the Fisher information of the ith marginal distribution of X, but rather a weighted average of the Fisher information in xi when conditioned on the rest of the uncorrelated variables, where the weight is the density function f(x − i).

Proposition 1

Let Jf be the standard Fisher information of X = (X1, X2,...,Xd). When the ith diagonal term of Jf reaches the lower bound 1, Xi is marginally normal and independent of X − i with probability one over X − i’s distribution.

Our proof is in the Appendix. See also Section 6.2 for a way to extend the results in this paper away from the normal application. In their essence, all such results are, at their core, a version of the Cramer–Rao inequality. This result indicates that Xi|x − i is standard normal for any x − i. So we can conclude that Xi is marginally normal and independent of all the other variables. We will then call Xi a white noise coordinate.

If the ith diagonal term of Jf is bigger than one, some combination of non-normality and dependence on X − i exists. Since X has been standardized, all the Xi’s are uncorrelated. So any dependence must arise from a nonlinear structural relationship, clustering, or other forms of dependence that can occur despite zero correlation. Large values correspond to variables X, whose conditional distributions contain more Fisher information than the normal.

2.3 Eigenanalysis of Jf

In this section, we present how to use an eigenanalysis of Jf to find the most informative projections. The following result is the basis for the eigenanalysis of Jf.

Proposition 2

Let A be a d × d nonsingular matrix and Y = AX. Then
$$ J_{g}=V_{g}^{1/2}\left(A^{-1}\right)^{T}V_{f}^{-1/2}\cdot J_{f}\cdot V_{f}^{-1/2}A^{-1}V_{g}^{1/2}, $$
(3)
where g(y) is the density of Y, Vg is the covariance matrix of g, Jg is the standardized Fisher information matrix for g.

The first consequence of this proposition is that the standardization \(A=V_{f}^{-\frac{1}{2}}\) leaves the standardized Fisher information unchanged.

Next consider an orthogonal transformation Z = ΓTY, where orthogonal matrix \(\Gamma =[\gamma_{1},\gamma _{2},...,\gamma_{d}]^T,\) and the density of Z is h(z) = gTz). These transformations preserve the standardized structure and so do not create any new linear relationships. The Fisher information matrix of Z has the form \(J_h=\Gamma J_{f}\Gamma^{T}=\Bigl(\gamma_i^TJ_{f}\gamma_j\Bigr).\) Now suppose that γ1,....,γd are the eigenvectors from an eigenanalyis of Jg ( = Jf), and that the corresponding ordered eigenvalues are λ1 > = λ2 > = ... > = λd. It is then clear that Jh is a diagonal matrix, with the eigenvalues λ1,λ2,...,λd down the diagonal. \(Z_1=\gamma_1^TY\) maximizes the diagonal term of Jh. The maximal value is just λ1. According to the discussion of Section 2.2, the projection \(Z_1=\gamma_1^TY\) from the eigenanalysis of Jf has the least conditional normality(most conditional information) conditioned on all the other orthogonal uncorrelated variables. The value λ1 is a measure of its informativeness. By a similar analysis, \(Z_i=\gamma_i^TY\) has the most conditional information among all projections which are uncorrelated to Z1,...Zi − 1.

2.4 Interpreting the eigenvalues

Let λ1 ≥ λ2,... ≥ λd be the eigenvalues of Jf. If λi + 1 = 1, then λj = 1, j ≥ i + 1. According to Proposition 1, the Zi + 1,...,Zd are white noise coordinates. We assume that they can be discarded, as they are not only marginally normal, but also independent of the remaining orthogonal variables. Let d − N(f) be the number of white noise coordinates, so that N(f) is the number of informative coordinates. In other words, the projections (Z1,...,ZN(f)) are sufficient for further analysis.

When the smallest eigenvalue λd > 1, there is no white noise subspace to discard. If we were to use the projections (Z1,...,ZN(f)), we can say we have picked the Nf most conditional informative projections or we have discarded the subspace of projections that is most similar to white noise all linear projections in the sense of having the least conditional information.

2.5 Transforming to \(J_{f_2}\)

In the last section, we showed that an eigenanalysis of Jf provides a white noise decomposition. Unfortunately, in theory, usually Jf does not have an explicit form for most population models, even for a mixture of normals. Further, if we estimate Jf by replacing the density f with a kernel density estimate \(\hat{f}\), the integration will not have an explicit form because of the density \(\hat{f}\) in the denominator of the integral. Monte Carlo or numerical integration would be required to calculate these measures.

We considered the density square transformation and found it to be highly effective at solving this problem. Suppose the random variable S has the density \(\frac{f^2(s)}{\int f^2(y)dy}:=f_2(s),\) where f(s) is the density of X. The density square transformation preserves some good properties of the original density f:
  1. 1.

    The ordering of the density values is unchanged: f(x1) < f(x2) < = > f2(x1) < f2(x2), and f(x1) = f(x2) < = > f2(x1) = f2(x2), and so both densities have the same contour lines.

     
  2. 2.

    The number and locations of density modes is unchanged. This is the consequence of the first property.

     
  3. 3.

    The density f2 accentuates the peaks of the unimodal density f relative to the tails. (This is clear from examining density ratios: If f(x2)/f(x1) > 1, then f2(x2)/f2(x1) > f(x2)/f(x1).)

     
  4. 4.

    It preserves normality: X is normal if and only if S is normal, \(X\sim N(0,\Sigma)\Leftrightarrow S\sim N(0,\Sigma/2).\)

     
  5. 5.

    As a consequence of Property 4, the density square transformation also preserves the white noise subspace. (Suppose that the density has a white noise subspace in coordinates k + 1,..,d. Then the density factors as f(x1,..,xk)ϕ(xk + 1,...,xd). Clearly the squared density has the same factorization, and ϕ2 is again normal.)

     
Plugging f2 into Eq. 1 provides the inequality
$$ J_{f_2}:=\frac{V_{f_2}^{1/2}\int\nabla_x f\cdot\nabla_x f^TdxV_{f_2}^{1/2}}{\int f^2(x)dx}\geq\frac{1}{4} I_d, $$
(4)
where \(V_{f_2}\) is the covariance matrix of Y:
$$ V_{f_2}=\frac{\int xx^Tf^2(x)dx}{\int f^2(x)dx}-\left(\frac{\int xf^2(x)dx}{\int f^2(x)dx}\right)\left(\frac{\int xf^2(x)dx}{\int f^2(x)dx}\right)^T. $$
Matrix \(J_{f_2}\) is a non-normality matrix measure for Yf2, and also a non-normality matrix measure for Xf since the density square transformation preserves normality. A key advantage of \(J_{f_2}\) over Jf is that it is easy to estimate using the kernel method. It also helps to make the methodology more robust to the presence of outliers. (We are unaware of any previous use of the squared density transformation to regularize a statistical analysis.)
We are now ready to turn our theoretical tools into a practical data analysis that we will call white noise analysis (WNA). Let
$$ \hat{f}_H(x)=\sum\frac{1}{n|H|}K_d\left(H^{-1}(x-X_i)\right), $$
where Kd is the kernel function, and H is the bandwidth. We assume that H is positive definite. In this paper, we use the normal density as the kernel because it will not introduce any non-normality into the problem, and it enables explicit computation formulas. Substituting \(\hat{f}_H(x)\) for f(x) in \(J_{f_2},\) one can derive an explicit form of the estimator for \(J_{f_2}:\)
$$ \hat{J}_{f_2}=J_{\hat{f}_2}=\frac{V_{\hat{f}_2}^{1/2}\int\nabla_x \hat{f}_H(x)\cdot\nabla_x \hat{f}_H(x)^TdxV_{\hat{f}_2}^{1/2}}{\int \hat{f}^2_H(x)dx}. $$
The terms in this expression can be calculated as
$$\begin{array}{rll} \hat{f}_2(x)&=&\hat{f}^2_H(x)/\int\hat{f}^2_H(y)dy,\nonumber\\[6pt] \int\hat{f}_H^2(x)dx&=&\frac{1}{n^2}\sum\limits_{i,j}\phi_d\left(X_i,X_j,2H^2\right), \end{array}$$
$$\begin{array}{rll} \int\nabla_x\hat{f}_H(x)\cdot\nabla_x\hat{f}_H^T(x)dx &=&\frac{1}{n^2}\sum\limits_{i,j}\int \phi\left(X_i-X_j,0, 2H^2\right)\nonumber\\ &&\times \left(\frac{H^{-2}}{2}-\frac{H^{-2}}{4}\left(X_i-X_j\right)\left(X_i-X_j\right)^TH^{-2}\right), \end{array}$$
$$ V_{\hat{f}_2}=\frac{\int xx^T\hat{f}^2_H(x)(x)dx}{\int \hat{f}^2_H(x)(x)dx}-\left(\frac{\int x\hat{f}_H(x)^2(x)dx}{\int \hat{f}^2_H(x)(x)dx}\right)\left(\frac{\int x\hat{f}^2_H(x)(x)dx}{\int \hat{f}^2_H(x)(x)dx}\right)^T, $$
$$ \int xx^T\hat{f}^2_H(x)(x)dx=\frac{1}{n^2}\sum\limits_{i,j}\phi_d\left(X_i,X_j,2H^2\right)\left(\frac{H^2}{2}+\frac{\left(X_i+X_j\right)\left(X_i+X_j\right)^T}{4}\right), $$
and
$$ \int x\hat{f}_H^2(x)dx=\frac{1}{n^2}\sum\limits_{i,j}\phi_d\left(X_i,X_j,2H^2\right)\left(\frac{X_i+X_j}{2}\right). $$
One important aspect of these calculations is that the estimated covariance matrix, \(V_{\hat{f}_2}\), is always nonsingular as long as H is. This implies that one can apply the method even when n is less than d. All these estimators are V-statistics, so are consistent under some assumptions involving a large sample size n, and a vanishing bandwidth H. However it is preferable to view the problem with H fixed as n goes to infinity. In this case the kernel estimator \(\hat{J}_{f_2}=J_{\hat{f}_2}\) as a direct measure of the non-normality of the kernel-smoothed distribution \(f_2^*= (f^*(x))^2(x)/(f^*(y))^2dy,\) where
$$ f^*(x)=\int f(y)\frac{1}{|H|}K_d\left(H^{-1}(x-y)\right)dy. $$
This transformation preserves the normality of f(x) for any fixed H. Note that \(\hat{J}_{f_2}\) is also a consistent estimator of \(J_{f_2^*}\) without H going to zero. This is an important aspect for applying the method in higher dimensions. From this point of view our regularity condition is also guaranteed to hold, as Gaussian smoothing ensures it is true for \(f_2^*\).

This double-smoothing idea (smooth the model and the data with the same kernel) has at least two advantages. First, suppose we view \(J_{f^*_2}\) as a function on distributions, say T(F). Then the estimator \(\hat{J}_{f_2}\) is equal to \(T(\hat{F}_n),\) where \(\hat{F}_n\) is the empirical distribution. This fact can be used to derive the asymptotic properties of the estimator\(\hat{J}_{f_2}\) (see Hui 2008; Lindsay et al. 2008 for details). Second, the eigenanalysis of \(\hat{J}_{f_2}\) correctly estimates N(f) for any fixed H when the sample size is large enough.

Although it is true that asymptotically all values of H lead one to finding the true white noise coordinates, it is also true that the choice of H has some importance in the sensitivity of the method. We will discuss this further in Section 4.

3 White noise detection and testing

Let X = (X1, X2,..., Xd) be a standardized d-dimensional random vector with the density function f(x), and S be the random variable with the density f2. Suppose the eigenvalues of \(J_{f_2}\) are ordered: λ1 ≥ λ2 ≥ ... ≥ λd with corresponding eigenvectors γ1, γ2,...,γd. Suppose the solution projection \(P=(\gamma _{1}^TS,...,\gamma _{d}^TS)\) has the density h2. The Fisher information \(J_{h_2}\) is then the diagonal matrix \(J_{h_2}=diag(\lambda_1,\lambda_2,...\lambda_d).\) The eigenvalue λi is the measure of the non-normality of the corresponding projection \(P_i=\gamma_i^TS.\) According to Eq. 4 and Proposition 1, when the eigenvalue λk reaches a lower bound 0.25, the eigenvalues λi = 0.25, i ≥ k, and the corresponding projections \(P_i=\gamma_i^TS\) are white noise coordinates, which are standard normal and independent of the other projections. Although the variable S is defined by the squared density, we will use the corresponding projections of the original variable X when we do plots, motivated by our preceding discussion.

In this section, we will propose a sequential test to detect the true white noise coordinates within the solution projections from the eigenanalysis of \(J_{f_2}.\) First we test the null hypothesis that all eigenvalues are equal to 0.25. This is H0: N(f) = 0. The alternative hypothesis is that the largest eigenvalue λ1 ≥ 0.25. If we reject the null hypothesis, we will consider the next hypothesis:H0:N(f) = 1 or H0: λ2 = λ3 = ... = λd = 0.25 vs Ha: N(f) > 1 or Ha:λ2 ≥ 0.25. We propose to continue in this fashion until we fail to reject.

3.1 Test procedure

For the general null hypothesis H0:N(f) = k − 1 or H0: λk = λk + 1 = ... = λd = 0.25 vs the alternative hypothesis Ha: N(f) > k − 1 or Ha:λk ≠ 0.25, we propose two different test statistics: \(\hat{\lambda}_k\) and \(\hat{S}_k=\sum_{j=k}^d\hat{\lambda}_j.\)

The next challenge is to find suitable critical values. We do so by means of a hybrid parametric-nonparametric bootstrap (Davison and Hinkley 1997). Under the null hypothesis H0:N(f) = k − 1, the projections (Pk,Pk + 1,...,Pd) are white noise coordinates. Suppose Zk,Zk + 1,...,Zd are independent standard normal variables. Then the variable vector \(P^*=(P_1,P_2,...,P_{k-1},Z_k, Z_{k+1},...,Z_d)\) should have the approximately same distribution as the solution projection P = (P1, P2,..., Pk, Pk + 1,..., Pd).

For a fixed sample size n and dimensionality d, we draw B = 1000 random samples of size n from a d − k + 1-dimensional standard normal distribution. For every sample \((Z^{ib}_k,Z^{ib}_{k+1},...,Z^{ib}_d)\), i = 1, 2,..., n, b = 1, 2,..., B, we use the data \((P^{i}_1,P^{i}_2,...,P^{i}_{k-1},Z^{ib}_k,Z^{ib}_{k+1},...,Z^{ib}_d)\), i = 1, 2,..., n, to estimate the Fisher information matrix \(\hat{J}^b_{f_2},\)b = 1, 2,..., B. The jth eigenvalue of \(\hat{J}^b_{f_2}\) is a resampled value of \(\lambda^b_j, k\leq j\leq d.\) We construct the empirical distributions of \(\lambda^b_k\) and \(S^b_k=\sum_{j=k}^d\lambda_j\) using the B = 1000 samples, and then estimate the critical values \(\hat{F}_{\lambda_k,0.05}\) and \(\hat{F}_{S_k,0.05}\). If the estimated \(\hat{\lambda}_k\) (\(\hat{S}_k\)) is less than the critical value \(\hat{F}_{\lambda_k,0.05}\) (\(\hat{F}_{S_k,0.05}\)), we will fail to reject the null hypothesis: H0: N(f) = k − 1 or λk = λk + 1 = ... = λd = 0.25, i.e., we think that there are at least d − k + 1 white noise coordinates.

After a dimension reduction that removes all white noise coordinates, so that all remaining projections are significantly non-normal, then similar to principal component analysis, one can use the cumulative proportion of eigenvalues \(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\) as an index of the fraction of the non-normality of the data explained by the selected projections: P1,...,Pk.

3.2 Test validity

If we are testing N(f) = 0, then the above bootstrap testing procedure is exactly the parametric bootstrap for this testing problem. If we are testing N(f) = k, k > 1, however, the null hypothesis is now semi-parametric as we do not specify the distribution of the informative coordinates.

One way to evaluate the effectiveness of our bootstrap procedure when N(f) > 0 is to compare the p-value obtained from it with those we would obtain by parametric bootstrapping (Monte Carlo) from the true null distribution, as we do in the following example.

Example 1

Let X = (X1, X2), where X1 ∼ 0.5N(2, 1) + 0.5N( − 2, 1), X2N(0, 1) and X1 ⊥ X2. Notice N(f) = 1. We draw R = 100 samples from X: \((X_1^{ir}, X_2^{ir}), i=1,2,...,n=300; r=1,2,...,R=100.\) For every sample, we fix the first estimated projection \(P_1^{r},\) draw random normal samples \(P_2^{i,r,b},b=1\), 2,...,B = 1000, estimate the eigenvalues \((\lambda_1^{rb},\lambda_2^{rb}),\) and calculate the p-value of the rth replicate under H0: N(f) = 1. The corresponding p-value of the rth replicate under the true distribution of \((\lambda_1^{b},\lambda_2^{b})\) was also estimated using the empirical distribution. Figure 3 shows that the p-values from the bootstrap procedure are very highly correlated with those from the empirical distribution.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig3_HTML.gif
Fig. 3

n = 300, d = 2, X1 ∼ 0.5N(2, 1) + 0.5N( − 2, 1),X2N(0, 1)

Example 2

Let X = (X1, X2, X3), where \(X_1\sim 0.5N(2,1)+0.5N(-2,1),\)X2 ∼ 0.5N(1, 1) + 0.5N( − 1, 1), X3N(0, 1) and X1, X2, X3 are independent. Notice that N(f) = 2. Suppose we draw R = 100 samples from X: \((X_1^{ir}, X_2^{ir}), i=1,2,...,n=500; r=1,2,...,R.\) First we consider the hypothesis H0: N(f) = 2. The p-values from the bootstrap procedure and the empirical distribution are highly correlated although these is some scatter (Fig. 4). Similarly, for the false null hypothesis: N(f) = 1, we can calculate the p-values using the bootstrap procedure and the empirical distribution. Figure 5 shows that the p-values from the bootstrap procedure are much closer to those from the empirical distribution, since only one projection P1 is fixed. However, it also shows that there is very little power to detect the true alternative in this case. Note that although X2 is a mixture of normals, it is a symmetric unimodal mixture.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig4_HTML.gif
Fig. 4

n = 500, d = 3, X1 ∼ 0.5N(2, 1) + 0.5N( − 2, 1), X2 ∼ 0.5N(1, 1) + 0.5N( − 1, 1), X3N(0, 1), H0: P3 is a white coordinate

https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig5_HTML.gif
Fig. 5

n = 500, d = 3, X1 ∼ 0.5N(2, 1) + 0.5N( − 2, 1), X2 ∼ 0.5N(1, 1) + 0.5N( − 1, 1), X3N(0, 1), H0:P2 and P3 are white coordinates

4 Selection of H

For a fixed sample size n and dimensionality d, it is important but also very challenging to choose the most appropriate bandwidth. Too small bandwidth leads to a very spiky density estimator, which is less normal than the true density is, even when the sample is really from the true distribution. And too large bandwidth leads to over-smoothing, which will make any non-normal data look more normal. Since standardizing data leaves the standardized Fisher information matrix unchanged, we propose to use H = hId for the standardized data(Σ = Id).

For the normal density kernel, Bowman and Foster (1993) recommended the bandwidth Hopt
$$ H_{\rm opt}=\left(\frac{4}{d+2}\right)^{\frac{1}{d+4}}\Sigma^{1/2}n^{-\frac{1}{d+4}}, $$
where Σ can be estimated by the sample covariance matrix. This bandwidth Hopt is optimal for the density estimator \(\hat{f}\) when the true density is normal according to the minimal mean integrated squared error criteria. In our experience, Hopt usually provides good results in finding informative projections. There is a good reason for the good performance of Hopt. Projection pursuit based on \(\hat{J}_{f_2}\) is not very sensitive to the choice of H. This is not surprising given that the eigenanalysis of \(\hat{J}_{f_2}\) is consistent for N(f) regardless of H provided that the sample size n is large enough. We should note however that Hopt cannot be used when n is smaller than d, as the sample covariance is singular. In this case, one could use a diagonal matrix H, either a multiple of the identity or a diagonal matrix with coordinate standard deviations on the diagonal.

5 Examples

In Section 5.1, we will apply the eigenanalysis of the estimated matrix \(\hat{J}_{f_2}\) to several simulated data sets to investigate the power in detecting known non-normal structures. In Section 5.2, the algorithm will be applied to the real data sets to compare the performance with classical methods. We will summarize the advantages and possible problems of our projection pursuit methods in Section 6. All programs were implemented in Matlab 7.0.1.

5.1 Normal mixture model

The first example is a three-component normal mixture model of equal proportions (Fig. 6). Each component has unit covariance matrix I2. The mean vectors are (5, 5), ( − 5, − 5) and (5, − 5). The transformed distribution after standardization is still a three-component normal mixture and the standardization will not change the possible optimal directions. The theoretical eigenvectors of \(J_{f_2}\) are \((\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) for the largest eigenvalue and \((\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}})\) for the smallest eigenvalue. So the projection index \(J_{f_2}\) picks the projection \(P_1=(X_1+X_2)/\sqrt{2}.\) The projection generates a marginal density that is a tri-modal mixture of three normals. Alternatively, the orthogonal projection \(P_2=(X_2-X_1)/\sqrt{2}\) is closest to white noise. This marginal is a bimodal mixture of two normals.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig6_HTML.gif
Fig. 6

Contour plot for \(f(x_1,x_2)=\frac{1}{3}\phi(x_1,5,1)\phi(x_2,5,1)+\frac{1}{3}\phi(x_1,5,1)\phi(x_2,-5,1)+\frac{1}{3}\phi(x_1,-5, 1)\phi(x_2,-5,1)\)

The above two examples also support the conjecture that the optimal projections from the matrix index \(J_{f_2},\) which have least conditional normality, also tend to have poor marginal normality.

5.2 High dimension low sample size

In the current data rich environment, it is natural to ask if our white noise analysis can be carried out if d is larger than n, the so called high dimension, low sample size (HDLSS) scenario. This is a rich question that we here address only with a simple simulation experiment. Our results are promising, but admittedly incomplete. We generated n = 25 observations in d = 100 dimensional space. The first coordinate is from the normal mixture f(x1) = 0.5ϕ(x1, 0, 1) + 0.5ϕ(x1, 5, 1). All other coordinates were generated as white noise. We use H = Id instead of Hopt since the latter is not positive definite in this scenario. The results from 100 replications of this experiment showed that the eigen-analysis of \(J_{\hat{f}_2}\) always puts the most weight on the first dimension (the mixture coordinate). The mean of the absolute weight assigned to the first coordinate was 0.7292, whereas the mean for the second largest coordinate was 0.1079. In our discussion section we will discuss the HDLSS problem further. Figure 7 shows a typical result.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig7_HTML.gif
Fig. 7

The histogram of the one-dimensional solution projection from \(J_{\hat{f}_2}\). One typical result with N=100, d=25

5.3 Particle physics data

We now turn to real data sets. The first data set, having 500 observations, was derived from a high-energy particle physics scattering experiment (Ballam et al. 1971; Friedman and Tukey 1974; Jee 1985). In the nuclear reaction, a positively charged pi-meson becomes a proton, two positively charged pi-mesons and a negatively charged pi-meson. Every observation consists of seven independent measurements.

The results of the eigenanalysis of \(J_{\hat{f}_2}\) are listed in Table 1 (9 CPU seconds). The results from our two test statistics are contradictory. According to the test procedure based on \(\hat{\lambda}_k,\) all solution projections are sequentially found to be significantly non-normal because all eigenvalues are bigger than the corresponding critical values. On the other hand, ∑ λk is less than the critical value 4.6306, so on the basis of this test, we would not reject the hypothesis that all solution projections are normal.
Table 1

Eigenanalysis of \(J_{\hat{f}_2}\) and critical values from 1,000 samples for particle physics data

k

1

2

3

4

5

6

7

λk

0.7697

0.7053

0.6492

0.5196

0.5055

0.4453

0.3930

\(\hat{F}_{\lambda_k,0.05}\)

0.7170

0.6978

0.6685

0.4069

0.5069

0.4095

0.3420

\(\sum_{i=k}^d\lambda_i\)

3.9876

3.2179

2.5126

1.8634

1.3438

0.8383

0.3930

\(\hat{F}_{S_k,0.05}\)

4.6306

3.8902

3.1485

2.3925

1.4689

0.8001

0.3420

\(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\)

0.1930

0.3699

0.5327

0.6630

0.7898

0.9014

1.00

The two largest classical principal components from the covariance matrix are shown in Fig. 8, which indicates the data may consist of three clusters, but the structure is not very clear. The first two solution projections from \(J_{\hat{f}_2}\) (Fig. 9) shows a triangular shape with points concentrated around two of three corners, which is obviously non-normal. So the test based on \(\hat{\lambda}_k\) seems to better detect the non-normal structure. The first two projections explains about 36.68% of the non-normality of the whole data.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig8_HTML.gif
Fig. 8

Particle physics data the scatter plot of the first two largest principal components

https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig9_HTML.gif
Fig. 9

Particle physics data the scatter plot of the two most informative solution projections from \(\hat{J}_{f_2}\)

Jee (1985) also found similar triangular structures using the trace of Jf as projection index.

5.4 Iris flower data

This well known data set was analyzed by Fisher and many researchers. It contains three classes of fifty observations each, where each class refers to a species of iris plant. One class is quite different from the other two. Every observation consists of four independent measurements: sepal length, sepal width, petal length, petal width.

The results of the eigenanalysis for the whole iris data are listed in Table 2 (<1 CPU second). According to the critical values based on B = 1000 samples, both tests agree that only the first solution projection is significantly non-normal. The first solution projection explains 52.08% of non-normality of the whole data.
Table 2

Eigenanalysis of \(J_{\hat{f}_2}\) and critical values from 1,000 samples for the whole Iris data

k

1

2

3

4

λk

2.2375

0.7846

0.7464

0.6527

\(\hat{F}_{\lambda_k,0.05}\)

1.3877

0.9404

0.8170

0.7146

\(\sum_{i=k}^d\lambda_i\)

4.5573

1.3032

1.3992

0.6527

\(\hat{F}_{S_k,0.05}\)

4.6259

2.4667

1.5214

0.7146

\(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\)

0.5208

0.6930

0.8568

1.00

The histogram of the one-dimensional solution projection and the scatter plot of the two-dimensional solution projections are shown in Figs. 10 and 11. The projected data are well separated into two clusters: the first 50 observations (one species) and the remaining 100 observations (two more species). The 100 observations are also separated into clusters according to the true tags, but the boundary is not so apparent.
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig10_HTML.gif
Fig. 10

Iris data (150 points). The histogram of the one-dimensional solution projection from \(J_{\hat{f}_2}\)

https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig11_HTML.gif
Fig. 11

Iris data (150 points). The scatter plot of the two most informative solution projections from \(J_{\hat{f}_2}\)

After we delete the first well separated class data, we applied our eigenanalysis to the remaining 100 observations. The scatter plot of the two-dimensional projections from \(\hat{J}_{\hat{f}_2^*}\) is shown in Fig. 12. The result is similar to that of Friedman and Tukey (1974) and the two-dimensional Fisher linear discriminant projections (see Fig. 2e of Friedman and Tukey 1974).
https://static-content.springer.com/image/art%3A10.1007%2Fs13571-011-0008-x/MediaObjects/13571_2011_8_Fig12_HTML.gif
Fig. 12

Iris data (100 points). The scatter plot of the two most informative solution projections from \(J_{\hat{f}_2}\)

6 Two final remarks

6.1 Relationship of WNA with PCA

We start with a lemma that will play an important role in our discussion of the relationship of our methods to principal component analysis.

Lemma 1

The unstandardized information matrix \(\tilde{J}_{f}\) obeys the following transformation rule. If X has density f and Y = AX has density g, then
$$ \tilde{J}_{g}=A^{-1}\tilde{J}_{f}A^{-T}. $$

One should notice the inverse relationship between this transformation rule and the covariance matrix rule, Var(Y) = AVar(X)AT. This plays an important role in the relationship between the two.

Principal component analysis (PCA) is an extremely important technique in multivariate analysis. It is based on an eigenanalysis of the covariance matrix Σ or the correlation matrix of the data or model. We find it helpful to think of white noise analysis (WNA) as being a second, independent, step in data analysis. This because the computation of Jf is based on standardized random variables that have no longer have principle component information. It is truly information that lies outside of the covariance matrix.

WNA and PCA operate by inverse transformation laws, as seen in our lemma. This creates an interesting dual relationship between the two. Suppose that we were to carry out a white noise analysis on a data (or density) without standardizing. To make a simple example, suppose the true density f is N(0, Σ). Then the (unstandardized) Fisher information is
$$ \tilde{J}_{f}=\Sigma ^{-1}. $$
That is, a spectral decomposition of the information matrix \(\tilde{J}_{f}\) would provide the same eigenvectors as the PCA, but the ordering of the eigenvectors would be completely inverted. That is,the most important principle components would be deemed the most similar to white noise in WNA. If the two methods were used to select a few variables of interest, they would select variables in exactly the opposite order!

One can imagine the implications of this normal example on a more general density. If one does not standardize the information matrix one would create a situation where the analysis would choose directions that were “interesting” based on a blend of features, with small variances and large information content being prioritized simultaneously.

Clearly PCA has been widely successful. In very high dimensional data, it seems that one might wish to use PCA and WNA in some artful combination. This is a worthy topic for future research.

6.2 Extensions of our method

The Fisher information inequality (Eq. 1) has been the basis of our derivations. A referee has asked how one can generalize the white noise analysis of this paper. Here is one way to do this.

Let g(x; θ) be a parametric class of densities, which we might think of as the null hypothesis. In this paper g(x; θ) was the multi-normal density with mean θ. Then let f(x; θ) be a second class of densities with the same parameter θ, now corresponding to one or more alternative hypotheses that could depend on choice of f. In our analysis, f(x; θ) was the location family of densities having the same covariance matrix as g, and generated by an arbitrary smooth baseline density f.

Our generalization requires the kind of regularity conditions found in the Cramer–Rao lower bound. In addition, a key assumption for the theory is that the score function for g, namely \(v(\theta ;x)=\nabla _{\theta }\log(g(x;\theta )),\) is an unbiased estimating function when used in family f: Ef[v(θ; X)] = 0.

In this paper, the normal v(θ; x) has been Σ−1(x − θ) with Σ chosen to be Σf = Varf(X). That is, the normal density g has been matched to the hypothetical f in both location θ and covariance Σf. The Godambe information matrix (Godambe 1960) for estimating function v(θ; x) will then be defined to be
$$ \mathbb{G}_{f}(v)=E_{f}[-\nabla _{\theta }v(\theta ,X)]\{Var_{f}(v(\theta ;X)\}^{-1}E_{f}[-\nabla _{\theta }v(\theta ;X)]^{T}. $$
In this paper, for the specified f and the normal score v, this is \(\Sigma _{f}^{-1}.\) This calculation holds because \(E_{f}[-\nabla _{\theta }v(\theta ,X)]=\Sigma_f^{-1}\). That is, in this example, the score v is information unbiased in the density f (Lindsay 1982). Moreover, in this example, the Godambe information of v in family f is also the Fisher information of v in family g.
The key result of Godambe (1960) is that the f-score \(u(\theta ;x)=\nabla _{\theta }\log(f(x;\)θ)) has the most Godambe information of all estimating functions, and so we have \( \mathbb{G}_{f}(u)\geq \mathbb{G}_{f}(v).\) But from Bartlett’s identity this is equivalent to
$$ \tilde{J}_{f}\geq \mathbb{G}_{f}(v) $$
where \(\tilde{J}_{f}(\theta )\) is the Fisher information in family f. It is also true that this inequality is an equality if and only if the u and v scores are equivalent. In our normal analysis, the Godambe information in v in family f equals the Fisher information of v in the corresponding normal density g, so we have a Fisher information inequality as well.
The last inequality can be turned into a standardized inequality if \(G=\mathbb{G}_{f}(v)\) is nonsingular:
$$ J_{f}=G^{-1/2}\tilde{J}_{f}G^{-1/2}\geq I. $$
In the normal case, this is just inequality (Eq. 1). From this point, an analysis can follow the theme of this paper. It follows that an eigenanalysis of the standardized information matrix Jf, when estimated over a set of alternative parametric families f(x; θ), provides knowledge about the nature of the deviation of the density f from the null density g, with the largest eigenvalue corresponding to the linear combination of parameters with the greatest increase in information in the f -score, u, over that in the estimating function v based on the g-score.

Of course, there could be significant obstacles to carrying out this analysis in practice with a different null family g(x; θ). We used many particular features of normal densities and location families. In particular, our interpretation of the diagonal terms of the information in terms of the mean information in the conditional density would only hold in certain kernel-based models that are symmetric in the x variable and the θ variable (as the normal density is).

7 Conclusion and future work

In the simulated samples and real data analysis, the projection pursuit method based on the eigenanalysis of \(J_{f_2}\) successfully revealed interesting non-linear structures in fairly high dimensions with a practical sample size. Compared to current projection indices, the matrix \(J_{f_2}\) has been shown to be a rapidly computable and effective projection matrix index. However, we should remain cautious about interpretation on the revealed structures, because projection pursuit is only a part of exploratory data analysis, providing the most informative projections for further study.

An issue worthy of further exploration is the potential of white noise analysis in high dimensional, low sample size, data (HDLSS). As we have shown, there is no technical problem with a direct application of the method, but there are many issues with tuning the bandwidth parameter H. There are related questions about whether one could combine principal components analysis (PCA) with white noise analysis (WNA) in a profitable way. Much work about PCA in HDLSS scenario has already been done (Hall et al. 2005; Kazuyoshi and Makoto 2001, 2009; Sungkyu and Marron 1995; Muller et al. 2011). One possible hybrid method for HDLSS would be to use PCA on the original data to produce data with somewhat reduced dimension, and then apply our WNA to find a smaller subset of interesting directions within the reduced data from PCA.

Rejoinder by the authors

The authors are grateful to Drs. Sen and Ray for their thoughtful comments on our paper.

We agree with most of Dr. Sen’s points, although we have found that the use of the squared density f2 seems to reduce the effect of isolated outlying points, and thereby enhances robustness. As he also notes, there are important questions about how methods such as ours can work well in higher dimensions, partly due to the convergence properties of nonparametric density estimators. It is quite clear that in any asymptotic analysis in which the data dimension goes to infinity, one must have a signal strength going to infinity in order to separate it from the noise. This is an interesting point that deserves further investigation.

Part of Dr. Ray’s discussion bears strongly on the same issues of growing dimension, and will provide a springboard for future investigation of this question. He also asks how our method might be extended to other densities than the normal. In our paper, in Section 6.2, we have provided our thoughts on this point, but they are necessarily only a skeleton idea for what might be done. The points he makes will be valuable in any future development of this idea.

Copyright information

© Indian Statistical Institute 2011