# Projection pursuit via white noise matrices

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s13571-011-0008-x

- Cite this article as:
- Hui, G. & Lindsay, B.G. Sankhya B (2010) 72: 123. doi:10.1007/s13571-011-0008-x

- 93 Views

## Abstract

Projection pursuit is a technique for locating projections from high- to low-dimensional space that reveal interesting non-linear features of a data set, such as clustering and outliers. The two key components of projection pursuit are the chosen measure of interesting features (the projection index) and its algorithm. In this paper, a white noise matrix based on the Fisher information matrix is proposed for use as the projection index. This matrix index is easily estimated by the kernel method. The eigenanalysis of the estimated matrix index provides a set of solution projections that are most similar to white noise. Application to simulated data and real data sets shows that our algorithm successfully reveals interesting features in fairly high dimensions with a practical sample size and low computational effort.

### Keywords

Projection pursuitFisher information matrixEigenanalysis## 1 Introduction

This paper is concerned with the construction and analysis of white noise matrices for use in analyzing high dimensional data. The goal is to identify a subspace of white noise projections, which, by definition, are marginally normal and independent of all orthogonal projections. For a reduced dimension data analysis, one could then discard the white noise subspace (or the subspace most similar to white noise) and use the remaining orthogonal projections, which we call the informative projections, to look for interesting relationships.

The methodology is closely related to classical projection pursuit (Friedman and Tukey 1974; Huber 1985). Projection pursuit is a technique that explores a high dimensional data by examining the marginal distributions of low dimensional linear projections. The two basic components of projection pursuit are its index and its algorithm. The projection index is designed to measure how “interesting” or “uninteresting” the features are. Usually it is a distance between the marginal distribution of the data projection in a direction and some “uninteresting distributions” for that marginal distribution. According to the central limit theorem, a projection essentially as a linear combination of variables tends to be normal under some regular conditions (Diaconis and Freedman 1984). A normal distribution is elliptically symmetric and has the least information (Fisher information, negative entropy) for a fixed variance. So based on both theoretical and empirical evidence, researchers have reached the consensus that normality best represents the notion of “uninterestingness” (Diaconis and Freedman 1984; Huber 1985).

The fundamental conceptual difference between classical projection pursuit and our approach is that the former searches for the least normal projections to use as the data summaries, while we seek to find the projections that are the most similar to white noise in order to discard them and use the remaining orthogonal projections. In either case, however, the output is a selected set of interesting projections.

Regardless of which approach seems most natural, ours offers a major computational advantage because it can be carried out using ordinary matrix eigendecomposition of a easily computed matrix. In contrast, most projection pursuit algorithms have the drawback of a high computational cost, because, in order to find the optimal projection, the projection index needs to be calculated or estimated for a large set of possible projections. When the dimension increases, the computation cost increases exponentially. For Gaussian mixture model classification, Calo (2007) developed the forward/backward projection pursuit algorithm which avoids the “curse of dimensionality.” But we have not seen general solution.

Our starting point for the construction of white noise matrices is the standardized Fisher information matrix *J*_{f} for a density. The projections we will call informative are the ones with the largest Fisher information. For reasons of computation and robustness, instead of *J*_{f}, we will estimate the projection index \(J_{f_2},\) the standardized Fisher information matrix for the density square transformed distribution: \(f_2(x)=f^2(x)/\int f^2(y)dy.\) The least normal projection from this new projection index can be estimated algebraically just as in principal component analysis, provided the matrix is standardized (i.e., linear effects are removed). One only needs to estimate the matrix measure \(J_{f_2}\) by kernel density estimation, and then do an eigenanalysis for the estimated matrix.

If an eigenvalue of *J*_{f} or \(J_{f_2}\) reaches a specified lower bound, the corresponding linear projection is exactly a **white noise coordinate**, which is marginally normally distributed and is independent of all orthogonal solution projections. In practice, from the eigenanalysis, one can select large eigenvalues to find the most informative linear projections for future study, or from a converse but equivalent point of view, one could find and discard the least informative linear projections corresponding to small eigenvalues.

This structure from the eigenanalysis of \(J_{f_2}\) is similar to the result of Posse’s method, which was shown to be better than Friedman’s algorithm (Friedman 1987) in this example.

The remainder of the paper is organized as follows. In Section 2, we introduce standardized Fisher information matrix *J*_{f} and interpret the results from the eigenanalysis of *J*_{f}. The new projection index \(J_{f_2}\) is then developed in order to eliminate certain computational and statistical problems that arise in estimating *J*_{f}. In Section 3 we study how to detect white noise coordinates using the eigenvalues of \(J_{f_2}\). Section 4 discusses the selection of bandwidth. In Section 5, the new algorithm is applied to several simulated data sets and real data sets. We will summarize the performance and present the future work in Section 6.

## 2 Standardized Fisher information matrix

### 2.1 Introduction of standardized Fisher information matrix

Let *X* = (*X*_{1}, *X*_{2},..., *X*_{d}) be a *d*-dimensional random vector with the density function *f*(*x*), mean *μ* and covariance matrix *V*_{f}. We will assume that the following *regularity condition* holds: the density *f* is continuously differentiable on an open set of support. It follows that *V*_{f} is nonsingular.

**Definition 1**

The Fisher information matrix for density *f*(*x*) is defined to be \(\widetilde{J}_f=\int\frac{\nabla_x f \cdot\nabla_x f^T}{f}dx,\) where \(\nabla_x f(x)=(\frac{\partial}{\partial x_1}f,...,\frac{\partial}{\partial x_d}f)\). The standardized Fisher information matrix for density *f*(*x*) is defined to be \(J_{f}=V_{f}^{1/2}\cdot\Bigl(\int\frac{\nabla_x f \cdot\nabla_x f^T}{f}dx\Bigr)\cdot V_{f}^{1/2},\) where \(V_{f}^{1/2}\) is the symmetric square root of the covariance matrix.

In the case d = 1, *J*_{f} or \(\widetilde{J}_f\) is called Fisher information number (Terrell 1995; Papaioannou and Ferentinos 2005). In the standardized Fisher information matrix, we consider the derivative with respect to *x* rather than the parameters as done in ordinary Fisher information matrix. So it can also be viewed as a measure of the information in the density, not the location parameters. Kagan (2001) demonstrated the connection between the Fisher information for a density and Fisher information for parameters. We now use this idea to generate a special information-matrix based goodness-of-fit methodology for the multivariate normal.

*A*

_{1}≥

*A*

_{2}means that

*A*

_{1}−

*A*

_{2}is positive semi-definite matrix. (See Kagan et al. (1973) for an early version of this result, albeit univariate.) The inequality (1) becomes an equality, if and only if

*f*is a normal density function. Note that the regularity condition is needed, as is evident from considering a uniform density on the unit hypercube, which would have nominal information

*J*

_{f}= 0. Note also that (1) implies that all the eigenvalues of

*J*

_{f}are one or larger.

Without losing generality, in the theoretical development we assume that *X* has been standardized: *μ* = 0 and *V*_{f} = *I*_{d}. Otherwise, one can standardize *X* using \(Y=V_{f}^{-\frac{1}{2}}(X-\mu).\)

### 2.2 Conditional normality interpretation

*J*

_{f}. The

*i*th diagonal term of

*J*

_{f}can be expressed as

*x*

_{ − i}= (

*x*

_{1},...,

*x*

_{i − 1},

*x*

_{i + 1},...,

*x*

_{d}), and \(J_{X_i|X_{-i}}\) is the Fisher information for the conditional distribution

*f*(

*x*

_{i}|

*x*

_{ − i}). That is, the

*i*th diagonal term of

*J*

_{f}is not the Fisher information of the

*i*th marginal distribution of

*X*, but rather a weighted average of the Fisher information in

*x*

_{i}when conditioned on the rest of the uncorrelated variables, where the weight is the density function

*f*(

*x*

_{ − i}).

**Proposition 1**

*Let J*_{f}* be the standard Fisher information of X* = (*X*_{1}, *X*_{2},...,*X*_{d})*. When the i**th diagonal term of J*_{f}* reaches the lower bound 1, X*_{i}* is marginally normal and independent of X*_{ − i}* with probability one over X*_{ − i}*’s distribution.*

Our proof is in the Appendix. See also Section 6.2 for a way to extend the results in this paper away from the normal application. In their essence, all such results are, at their core, a version of the Cramer–Rao inequality. This result indicates that *X*_{i}|*x*_{ − i} is standard normal for any *x*_{ − i}. So we can conclude that *X*_{i} is marginally normal and independent of all the other variables. We will then call *X*_{i} a **white noise coordinate**.

If the *i*th diagonal term of *J*_{f} is bigger than one, some combination of non-normality and dependence on *X*_{ − i} exists. Since *X* has been standardized, all the *X*_{i}’s are uncorrelated. So any dependence must arise from a nonlinear structural relationship, clustering, or other forms of dependence that can occur despite zero correlation. Large values correspond to variables *X*, whose conditional distributions contain more Fisher information than the normal.

### 2.3 Eigenanalysis of *J*_{f}

In this section, we present how to use an eigenanalysis of *J*_{f} to find the most informative projections. The following result is the basis for the eigenanalysis of *J*_{f}.

**Proposition 2**

*Let A be a d*×

*d nonsingular matrix and Y*=

*AX*.

*Then*

*where g*(

*y*)

*is the density of Y*,

*V*

_{g}

*is the covariance matrix of g*,

*J*

_{g}

*is the standardized Fisher information matrix for g*.

The first consequence of this proposition is that the standardization \(A=V_{f}^{-\frac{1}{2}}\) leaves the standardized Fisher information unchanged.

Next consider an orthogonal transformation *Z* = Γ^{T}*Y*, where orthogonal matrix \(\Gamma =[\gamma_{1},\gamma _{2},...,\gamma_{d}]^T,\) and the density of *Z* is *h*(*z*) = *g*(Γ^{T}*z*). These transformations preserve the standardized structure and so do not create any new linear relationships. The Fisher information matrix of *Z* has the form \(J_h=\Gamma J_{f}\Gamma^{T}=\Bigl(\gamma_i^TJ_{f}\gamma_j\Bigr).\) Now suppose that *γ*_{1},....,*γ*_{d} are the eigenvectors from an eigenanalyis of *J*_{g} ( = *J*_{f}), and that the corresponding ordered eigenvalues are *λ*_{1} > = *λ*_{2} > = ... > = *λ*_{d}. It is then clear that *J*_{h} is a diagonal matrix, with the eigenvalues *λ*_{1},*λ*_{2},...,*λ*_{d} down the diagonal. \(Z_1=\gamma_1^TY\) maximizes the diagonal term of *J*_{h}. The maximal value is just *λ*_{1}. According to the discussion of Section 2.2, the projection \(Z_1=\gamma_1^TY\) from the eigenanalysis of *J*_{f} has the **least conditional normality**(most conditional information) conditioned on all the other orthogonal uncorrelated variables. The value *λ*_{1} is a measure of its informativeness. By a similar analysis, \(Z_i=\gamma_i^TY\) has the most conditional information among all projections which are uncorrelated to *Z*_{1},...*Z*_{i − 1}.

### 2.4 Interpreting the eigenvalues

Let *λ*_{1} ≥ *λ*_{2},... ≥ *λ*_{d} be the eigenvalues of *J*_{f}. If *λ*_{i + 1} = 1, then *λ*_{j} = 1, *j* ≥ *i* + 1. According to Proposition 1, the *Z*_{i + 1},...,*Z*_{d} are **white noise coordinates**. We assume that they can be discarded, as they are not only marginally normal, but also independent of the remaining orthogonal variables. Let *d* − *N*(*f*) be the number of white noise coordinates, so that *N*(*f*) is the number of informative coordinates. In other words, the projections (*Z*_{1},...,*Z*_{N(f)}) are sufficient for further analysis.

When the smallest eigenvalue *λ*_{d} > 1, there is no white noise subspace to discard. If we were to use the projections (*Z*_{1},...,*Z*_{N(f)}), we can say we have picked the *N*_{f} most conditional informative projections or we have discarded the subspace of projections that is **most similar** to white noise all linear projections in the sense of having the least conditional information.

### 2.5 Transforming to \(J_{f_2}\)

In the last section, we showed that an eigenanalysis of *J*_{f} provides a white noise decomposition. Unfortunately, in theory, usually *J*_{f} does not have an explicit form for most population models, even for a mixture of normals. Further, if we estimate *J*_{f} by replacing the density *f* with a kernel density estimate \(\hat{f}\), the integration will not have an explicit form because of the density \(\hat{f}\) in the denominator of the integral. Monte Carlo or numerical integration would be required to calculate these measures.

*S*has the density \(\frac{f^2(s)}{\int f^2(y)dy}:=f_2(s),\) where

*f*(

*s*) is the density of

*X*. The density square transformation preserves some good properties of the original density

*f*:

- 1.
The ordering of the density values is unchanged:

*f*(*x*_{1}) <*f*(*x*_{2}) < = >*f*_{2}(*x*_{1}) <*f*_{2}(*x*_{2}), and*f*(*x*_{1}) =*f*(*x*_{2}) < = >*f*_{2}(*x*_{1}) =*f*_{2}(*x*_{2}), and so both densities have the same contour lines. - 2.
The number and locations of density modes is unchanged. This is the consequence of the first property.

- 3.
The density

*f*_{2}accentuates the peaks of the unimodal density*f*relative to the tails. (This is clear from examining density ratios: If*f*(*x*_{2})/*f*(*x*_{1}) > 1, then*f*_{2}(*x*_{2})/*f*_{2}(*x*_{1}) >*f*(*x*_{2})/*f*(*x*_{1}).) - 4.
It preserves normality:

*X*is normal if and only if*S*is normal, \(X\sim N(0,\Sigma)\Leftrightarrow S\sim N(0,\Sigma/2).\) - 5.
As a consequence of Property 4, the density square transformation also preserves the white noise subspace. (Suppose that the density has a white noise subspace in coordinates

*k*+ 1,..,*d*. Then the density factors as*f*(*x*_{1},..,*x*_{k})*ϕ*(*x*_{k + 1},...,*x*_{d}). Clearly the squared density has the same factorization, and*ϕ*^{2}is again normal.)

*f*

_{2}into Eq. 1 provides the inequality

*Y*:

*Y*∼

*f*

_{2}, and also a non-normality matrix measure for

*X*∼

*f*since the density square transformation preserves normality. A key advantage of \(J_{f_2}\) over

*J*

_{f}is that it is easy to estimate using the kernel method. It also helps to make the methodology more robust to the presence of outliers. (We are unaware of any previous use of the squared density transformation to regularize a statistical analysis.)

*K*

_{d}is the kernel function, and

*H*is the bandwidth. We assume that H is positive definite. In this paper, we use the normal density as the kernel because it will not introduce any non-normality into the problem, and it enables explicit computation formulas. Substituting \(\hat{f}_H(x)\) for

*f*(

*x*) in \(J_{f_2},\) one can derive an explicit form of the estimator for \(J_{f_2}:\)

*H*is. This implies that one can apply the method even when

*n*is less than

*d*. All these estimators are V-statistics, so are consistent under some assumptions involving a large sample size

*n*, and a vanishing bandwidth

*H*. However it is preferable to view the problem with

*H*fixed as

*n*goes to infinity. In this case the kernel estimator \(\hat{J}_{f_2}=J_{\hat{f}_2}\) as a direct measure of the non-normality of the kernel-smoothed distribution \(f_2^*= (f^*(x))^2(x)/(f^*(y))^2dy,\) where

*f*(

*x*) for any fixed

*H*. Note that \(\hat{J}_{f_2}\) is also a consistent estimator of \(J_{f_2^*}\) without

*H*going to zero. This is an important aspect for applying the method in higher dimensions. From this point of view our regularity condition is also guaranteed to hold, as Gaussian smoothing ensures it is true for \(f_2^*\).

This double-smoothing idea (smooth the model and the data with the same kernel) has at least two advantages. First, suppose we view \(J_{f^*_2}\) as a function on distributions, say *T*(*F*). Then the estimator \(\hat{J}_{f_2}\) is equal to \(T(\hat{F}_n),\) where \(\hat{F}_n\) is the empirical distribution. This fact can be used to derive the asymptotic properties of the estimator\(\hat{J}_{f_2}\) (see Hui 2008; Lindsay et al. 2008 for details). Second, the eigenanalysis of \(\hat{J}_{f_2}\) correctly estimates *N*(*f*) for any fixed *H* when the sample size is large enough.

Although it is true that asymptotically all values of *H* lead one to finding the true white noise coordinates, it is also true that the choice of *H* has some importance in the sensitivity of the method. We will discuss this further in Section 4.

## 3 White noise detection and testing

Let *X* = (*X*_{1}, *X*_{2},..., *X*_{d}) be a standardized d-dimensional random vector with the density function *f*(*x*), and *S* be the random variable with the density *f*_{2}. Suppose the eigenvalues of \(J_{f_2}\) are ordered: *λ*_{1} ≥ *λ*_{2} ≥ ... ≥ *λ*_{d} with corresponding eigenvectors *γ*_{1}, *γ*_{2},...,*γ*_{d}. Suppose the solution projection \(P=(\gamma _{1}^TS,...,\gamma _{d}^TS)\) has the density *h*_{2}. The Fisher information \(J_{h_2}\) is then the diagonal matrix \(J_{h_2}=diag(\lambda_1,\lambda_2,...\lambda_d).\) The eigenvalue *λ*_{i} is the measure of the non-normality of the corresponding projection \(P_i=\gamma_i^TS.\) According to Eq. 4 and Proposition 1, when the eigenvalue *λ*_{k} reaches a lower bound 0.25, the eigenvalues *λ*_{i} = 0.25, *i* ≥ *k*, and the corresponding projections \(P_i=\gamma_i^TS\) are white noise coordinates, which are standard normal and independent of the other projections. Although the variable *S* is defined by the squared density, we will use the corresponding projections of the original variable *X* when we do plots, motivated by our preceding discussion.

In this section, we will propose a sequential test to detect the true white noise coordinates within the solution projections from the eigenanalysis of \(J_{f_2}.\) First we test the null hypothesis that all eigenvalues are equal to 0.25. This is *H*_{0}: *N*(*f*) = 0. The alternative hypothesis is that the largest eigenvalue *λ*_{1} ≥ 0.25. If we reject the null hypothesis, we will consider the next hypothesis:*H*_{0}:*N*(*f*) = 1 or *H*_{0}: *λ*_{2} = *λ*_{3} = ... = *λ*_{d} = 0.25 vs *H*_{a}: *N*(*f*) > 1 or *H*_{a}:*λ*_{2} ≥ 0.25. We propose to continue in this fashion until we fail to reject.

### 3.1 Test procedure

For the general null hypothesis *H*_{0}:*N*(*f*) = *k* − 1 or *H*_{0}: *λ*_{k} = *λ*_{k + 1} = ... = *λ*_{d} = 0.25 vs the alternative hypothesis *H*_{a}: *N*(*f*) > *k* − 1 or *H*_{a}:*λ*_{k} ≠ 0.25, we propose two different test statistics: \(\hat{\lambda}_k\) and \(\hat{S}_k=\sum_{j=k}^d\hat{\lambda}_j.\)

The next challenge is to find suitable critical values. We do so by means of a hybrid parametric-nonparametric bootstrap (Davison and Hinkley 1997). Under the null hypothesis *H*_{0}:*N*(*f*) = *k* − 1, the projections (*P*_{k},*P*_{k + 1},...,*P*_{d}) are white noise coordinates. Suppose *Z*_{k},*Z*_{k + 1},...,*Z*_{d} are independent standard normal variables. Then the variable vector \(P^*=(P_1,P_2,...,P_{k-1},Z_k, Z_{k+1},...,Z_d)\) should have the approximately same distribution as the solution projection *P* = (*P*_{1}, *P*_{2},..., *P*_{k}, *P*_{k + 1},..., *P*_{d}).

For a fixed sample size *n* and dimensionality *d*, we draw *B* = 1000 random samples of size *n* from a *d* − *k* + 1-dimensional standard normal distribution. For every sample \((Z^{ib}_k,Z^{ib}_{k+1},...,Z^{ib}_d)\), *i* = 1, 2,..., *n*, *b* = 1, 2,..., *B*, we use the data \((P^{i}_1,P^{i}_2,...,P^{i}_{k-1},Z^{ib}_k,Z^{ib}_{k+1},...,Z^{ib}_d)\), *i* = 1, 2,..., *n*, to estimate the Fisher information matrix \(\hat{J}^b_{f_2},\)*b* = 1, 2,..., *B*. The *j*th eigenvalue of \(\hat{J}^b_{f_2}\) is a resampled value of \(\lambda^b_j, k\leq j\leq d.\) We construct the empirical distributions of \(\lambda^b_k\) and \(S^b_k=\sum_{j=k}^d\lambda_j\) using the *B* = 1000 samples, and then estimate the critical values \(\hat{F}_{\lambda_k,0.05}\) and \(\hat{F}_{S_k,0.05}\). If the estimated \(\hat{\lambda}_k\) (\(\hat{S}_k\)) is less than the critical value \(\hat{F}_{\lambda_k,0.05}\) (\(\hat{F}_{S_k,0.05}\)), we will fail to reject the null hypothesis: *H*_{0}: *N*(*f*) = *k* − 1 or *λ*_{k} = *λ*_{k + 1} = ... = *λ*_{d} = 0.25, i.e., we think that there are at least *d* − *k* + 1 white noise coordinates.

After a dimension reduction that removes all white noise coordinates, so that all remaining projections are significantly non-normal, then similar to principal component analysis, one can use the cumulative proportion of eigenvalues \(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\) as an index of the fraction of the non-normality of the data explained by the selected projections: *P*_{1},...,*P*_{k}.

### 3.2 Test validity

If we are testing *N*(*f*) = 0, then the above bootstrap testing procedure is exactly the parametric bootstrap for this testing problem. If we are testing *N*(*f*) = *k*, *k* > 1, however, the null hypothesis is now semi-parametric as we do not specify the distribution of the informative coordinates.

One way to evaluate the effectiveness of our bootstrap procedure when *N*(*f*) > 0 is to compare the p-value obtained from it with those we would obtain by parametric bootstrapping (Monte Carlo) from the true null distribution, as we do in the following example.

### Example 1

*X*= (

*X*

_{1},

*X*

_{2}), where

*X*

_{1}∼ 0.5

*N*(2, 1) + 0.5

*N*( − 2, 1),

*X*

_{2}∼

*N*(0, 1) and

*X*

_{1}⊥

*X*

_{2}. Notice

*N*(

*f*) = 1. We draw

*R*= 100 samples from

*X*: \((X_1^{ir}, X_2^{ir}), i=1,2,...,n=300; r=1,2,...,R=100.\) For every sample, we fix the first estimated projection \(P_1^{r},\) draw random normal samples \(P_2^{i,r,b},b=1\), 2,...,

*B*= 1000, estimate the eigenvalues \((\lambda_1^{rb},\lambda_2^{rb}),\) and calculate the p-value of the

*r*

^{th}replicate under

*H*

_{0}:

*N*(

*f*) = 1. The corresponding p-value of the

*r*

^{th}replicate under the true distribution of \((\lambda_1^{b},\lambda_2^{b})\) was also estimated using the empirical distribution. Figure 3 shows that the p-values from the bootstrap procedure are very highly correlated with those from the empirical distribution.

### Example 2

*X*= (

*X*

_{1},

*X*

_{2},

*X*

_{3}), where \(X_1\sim 0.5N(2,1)+0.5N(-2,1),\)

*X*

_{2}∼ 0.5

*N*(1, 1) + 0.5

*N*( − 1, 1),

*X*

_{3}∼

*N*(0, 1) and

*X*

_{1},

*X*

_{2},

*X*

_{3}are independent. Notice that

*N*(

*f*) = 2. Suppose we draw

*R*= 100 samples from

*X*: \((X_1^{ir}, X_2^{ir}), i=1,2,...,n=500; r=1,2,...,R.\) First we consider the hypothesis

*H*

_{0}:

*N*(

*f*) = 2. The p-values from the bootstrap procedure and the empirical distribution are highly correlated although these is some scatter (Fig. 4). Similarly, for the false null hypothesis:

*N*(

*f*) = 1, we can calculate the p-values using the bootstrap procedure and the empirical distribution. Figure 5 shows that the p-values from the bootstrap procedure are much closer to those from the empirical distribution, since only one projection

*P*

_{1}is fixed. However, it also shows that there is very little power to detect the true alternative in this case. Note that although

*X*

_{2}is a mixture of normals, it is a symmetric unimodal mixture.

## 4 Selection of *H*

For a fixed sample size *n* and dimensionality *d*, it is important but also very challenging to choose the most appropriate bandwidth. Too small bandwidth leads to a very spiky density estimator, which is less normal than the true density is, even when the sample is really from the true distribution. And too large bandwidth leads to over-smoothing, which will make any non-normal data look more normal. Since standardizing data leaves the standardized Fisher information matrix unchanged, we propose to use *H* = *hI*_{d} for the standardized data(Σ = *I*_{d}).

*H*

_{opt}

*H*

_{opt}is optimal for the density estimator \(\hat{f}\) when the true density is normal according to the minimal mean integrated squared error criteria. In our experience,

*H*

_{opt}usually provides good results in finding informative projections. There is a good reason for the good performance of

*H*

_{opt}. Projection pursuit based on \(\hat{J}_{f_2}\) is not very sensitive to the choice of

*H*. This is not surprising given that the eigenanalysis of \(\hat{J}_{f_2}\) is consistent for

*N*(

*f*) regardless of

*H*provided that the sample size

*n*is large enough. We should note however that

*H*

_{opt}cannot be used when

*n*is smaller than

*d*, as the sample covariance is singular. In this case, one could use a diagonal matrix

*H*, either a multiple of the identity or a diagonal matrix with coordinate standard deviations on the diagonal.

## 5 Examples

In Section 5.1, we will apply the eigenanalysis of the estimated matrix \(\hat{J}_{f_2}\) to several simulated data sets to investigate the power in detecting known non-normal structures. In Section 5.2, the algorithm will be applied to the real data sets to compare the performance with classical methods. We will summarize the advantages and possible problems of our projection pursuit methods in Section 6. All programs were implemented in Matlab 7.0.1.

### 5.1 Normal mixture model

*I*

_{2}. The mean vectors are (5, 5), ( − 5, − 5) and (5, − 5). The transformed distribution after standardization is still a three-component normal mixture and the standardization will not change the possible optimal directions. The theoretical eigenvectors of \(J_{f_2}\) are \((\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) for the largest eigenvalue and \((\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}})\) for the smallest eigenvalue. So the projection index \(J_{f_2}\) picks the projection \(P_1=(X_1+X_2)/\sqrt{2}.\) The projection generates a marginal density that is a tri-modal mixture of three normals. Alternatively, the orthogonal projection \(P_2=(X_2-X_1)/\sqrt{2}\) is closest to white noise. This marginal is a bimodal mixture of two normals.

The above two examples also support the conjecture that the optimal projections from the matrix index \(J_{f_2},\) which have least conditional normality, also tend to have poor marginal normality.

### 5.2 High dimension low sample size

*d*is larger than

*n*, the so called high dimension, low sample size (HDLSS) scenario. This is a rich question that we here address only with a simple simulation experiment. Our results are promising, but admittedly incomplete. We generated

*n*= 25 observations in

*d*= 100 dimensional space. The first coordinate is from the normal mixture

*f*(

*x*

_{1}) = 0.5

*ϕ*(

*x*

_{1}, 0, 1) + 0.5

*ϕ*(

*x*

_{1}, 5, 1). All other coordinates were generated as white noise. We use

*H*=

*I*

_{d}instead of

*H*

_{opt}since the latter is not positive definite in this scenario. The results from 100 replications of this experiment showed that the eigen-analysis of \(J_{\hat{f}_2}\) always puts the most weight on the first dimension (the mixture coordinate). The mean of the absolute weight assigned to the first coordinate was 0.7292, whereas the mean for the second largest coordinate was 0.1079. In our discussion section we will discuss the HDLSS problem further. Figure 7 shows a typical result.

### 5.3 Particle physics data

We now turn to real data sets. The first data set, having 500 observations, was derived from a high-energy particle physics scattering experiment (Ballam et al. 1971; Friedman and Tukey 1974; Jee 1985). In the nuclear reaction, a positively charged pi-meson becomes a proton, two positively charged pi-mesons and a negatively charged pi-meson. Every observation consists of seven independent measurements.

*λ*

_{k}is less than the critical value 4.6306, so on the basis of this test, we would not reject the hypothesis that all solution projections are normal.

Eigenanalysis of \(J_{\hat{f}_2}\) and critical values from 1,000 samples for particle physics data

| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|

| 0.7697 | 0.7053 | 0.6492 | 0.5196 | 0.5055 | 0.4453 | 0.3930 |

\(\hat{F}_{\lambda_k,0.05}\) | 0.7170 | 0.6978 | 0.6685 | 0.4069 | 0.5069 | 0.4095 | 0.3420 |

\(\sum_{i=k}^d\lambda_i\) | 3.9876 | 3.2179 | 2.5126 | 1.8634 | 1.3438 | 0.8383 | 0.3930 |

\(\hat{F}_{S_k,0.05}\) | 4.6306 | 3.8902 | 3.1485 | 2.3925 | 1.4689 | 0.8001 | 0.3420 |

\(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\) | 0.1930 | 0.3699 | 0.5327 | 0.6630 | 0.7898 | 0.9014 | 1.00 |

Jee (1985) also found similar triangular structures using the trace of *J*_{f} as projection index.

### 5.4 Iris flower data

This well known data set was analyzed by Fisher and many researchers. It contains three classes of fifty observations each, where each class refers to a species of iris plant. One class is quite different from the other two. Every observation consists of four independent measurements: sepal length, sepal width, petal length, petal width.

*B*= 1000 samples, both tests agree that only the first solution projection is significantly non-normal. The first solution projection explains 52.08% of non-normality of the whole data.

Eigenanalysis of \(J_{\hat{f}_2}\) and critical values from 1,000 samples for the whole Iris data

| 1 | 2 | 3 | 4 |
---|---|---|---|---|

| 2.2375 | 0.7846 | 0.7464 | 0.6527 |

\(\hat{F}_{\lambda_k,0.05}\) | 1.3877 | 0.9404 | 0.8170 | 0.7146 |

\(\sum_{i=k}^d\lambda_i\) | 4.5573 | 1.3032 | 1.3992 | 0.6527 |

\(\hat{F}_{S_k,0.05}\) | 4.6259 | 2.4667 | 1.5214 | 0.7146 |

\(\sum_{i=1}^k\lambda_i/\sum_{i=1}^d\lambda_i\) | 0.5208 | 0.6930 | 0.8568 | 1.00 |

## 6 Two final remarks

### 6.1 Relationship of WNA with PCA

We start with a lemma that will play an important role in our discussion of the relationship of our methods to principal component analysis.

**Lemma 1**

*The unstandardized information matrix*\(\tilde{J}_{f}\)

*obeys the following transformation rule. If X*

*has density f*

*and Y*=

*AX has density*

*g*,

*then*

One should notice the inverse relationship between this transformation rule and the covariance matrix rule, *Var*(*Y*) = *AVar*(*X*)*A*^{T}. This plays an important role in the relationship between the two.

Principal component analysis (PCA) is an extremely important technique in multivariate analysis. It is based on an eigenanalysis of the covariance matrix Σ or the correlation matrix of the data or model. We find it helpful to think of white noise analysis (WNA) as being a second, independent, step in data analysis. This because the computation of *J*_{f} is based on standardized random variables that have no longer have principle component information. It is truly information that lies outside of the covariance matrix.

*f*is

*N*(0, Σ). Then the (unstandardized) Fisher information is

One can imagine the implications of this normal example on a more general density. If one does not standardize the information matrix one would create a situation where the analysis would choose directions that were “interesting” based on a blend of features, with small variances and large information content being prioritized simultaneously.

Clearly PCA has been widely successful. In very high dimensional data, it seems that one might wish to use PCA and WNA in some artful combination. This is a worthy topic for future research.

### 6.2 Extensions of our method

The Fisher information inequality (Eq. 1) has been the basis of our derivations. A referee has asked how one can generalize the white noise analysis of this paper. Here is one way to do this.

Let *g*(*x*; *θ*) be a parametric class of densities, which we might think of as the null hypothesis. In this paper *g*(*x*; *θ*) was the multi-normal density with mean *θ*. Then let *f*(*x*; *θ*) be a second class of densities with the same parameter *θ*, now corresponding to one or more alternative hypotheses that could depend on choice of *f*. In our analysis, *f*(*x*; *θ*) was the location family of densities having the same covariance matrix as *g*, and generated by an arbitrary smooth baseline density *f*.

Our generalization requires the kind of regularity conditions found in the Cramer–Rao lower bound. In addition, a key assumption for the theory is that the score function for *g*, namely \(v(\theta ;x)=\nabla _{\theta }\log(g(x;\theta )),\) is an unbiased estimating function when used in family *f*: *E*_{f}[*v*(*θ*; *X*)] = 0.

*v*(

*θ*;

*x*) has been Σ

^{−1}(

*x*−

*θ*) with Σ chosen to be Σ

_{f}=

*Var*

_{f}(

*X*). That is, the normal density

*g*has been matched to the hypothetical

*f*in both location

*θ*and covariance Σ

_{f}. The Godambe information matrix (Godambe 1960) for estimating function

*v*(

*θ*;

*x*) will then be defined to be

*f*and the normal score

*v*, this is \(\Sigma _{f}^{-1}.\) This calculation holds because \(E_{f}[-\nabla _{\theta }v(\theta ,X)]=\Sigma_f^{-1}\). That is, in this example, the score

*v*is information unbiased in the density

*f*(Lindsay 1982). Moreover, in this example, the Godambe information of

*v*in family

*f*is also the Fisher information of

*v*in family

*g*.

*f*-score \(u(\theta ;x)=\nabla _{\theta }\log(f(x;\)

*θ*)) has the most Godambe information of all estimating functions, and so we have \( \mathbb{G}_{f}(u)\geq \mathbb{G}_{f}(v).\) But from Bartlett’s identity this is equivalent to

*f*. It is also true that this inequality is an equality if and only if the

*u*and

*v*scores are equivalent. In our normal analysis, the Godambe information in

*v*in family

*f*equals the Fisher information of

*v*in the corresponding normal density

*g*, so we have a Fisher information inequality as well.

*J*

_{f}, when estimated over a set of alternative parametric families

*f*(

*x*;

*θ*), provides knowledge about the nature of the deviation of the density

*f*from the null density

*g*, with the largest eigenvalue corresponding to the linear combination of parameters with the greatest increase in information in the

*f*-score,

*u*, over that in the estimating function

*v*based on the

*g*-score.

Of course, there could be significant obstacles to carrying out this analysis in practice with a different null family *g*(*x*; *θ*). We used many particular features of normal densities and location families. In particular, our interpretation of the diagonal terms of the information in terms of the mean information in the conditional density would only hold in certain kernel-based models that are symmetric in the *x* variable and the *θ* variable (as the normal density is).

## 7 Conclusion and future work

In the simulated samples and real data analysis, the projection pursuit method based on the eigenanalysis of \(J_{f_2}\) successfully revealed interesting non-linear structures in fairly high dimensions with a practical sample size. Compared to current projection indices, the matrix \(J_{f_2}\) has been shown to be a rapidly computable and effective projection matrix index. However, we should remain cautious about interpretation on the revealed structures, because projection pursuit is only a part of exploratory data analysis, providing the most informative projections for further study.

An issue worthy of further exploration is the potential of white noise analysis in high dimensional, low sample size, data (HDLSS). As we have shown, there is no technical problem with a direct application of the method, but there are many issues with tuning the bandwidth parameter *H*. There are related questions about whether one could combine principal components analysis (PCA) with white noise analysis (WNA) in a profitable way. Much work about PCA in HDLSS scenario has already been done (Hall et al. 2005; Kazuyoshi and Makoto 2001, 2009; Sungkyu and Marron 1995; Muller et al. 2011). One possible hybrid method for HDLSS would be to use PCA on the original data to produce data with somewhat reduced dimension, and then apply our WNA to find a smaller subset of interesting directions within the reduced data from PCA.

## Rejoinder by the authors

The authors are grateful to Drs. Sen and Ray for their thoughtful comments on our paper.

We agree with most of Dr. Sen’s points, although we have found that the use of the squared density *f*_{2} seems to reduce the effect of isolated outlying points, and thereby enhances robustness. As he also notes, there are important questions about how methods such as ours can work well in higher dimensions, partly due to the convergence properties of nonparametric density estimators. It is quite clear that in any asymptotic analysis in which the data dimension goes to infinity, one must have a signal strength going to infinity in order to separate it from the noise. This is an interesting point that deserves further investigation.

Part of Dr. Ray’s discussion bears strongly on the same issues of growing dimension, and will provide a springboard for future investigation of this question. He also asks how our method might be extended to other densities than the normal. In our paper, in Section 6.2, we have provided our thoughts on this point, but they are necessarily only a skeleton idea for what might be done. The points he makes will be valuable in any future development of this idea.