Projection pursuit via white noise matrices

Hui, Guodong; Lindsay, Bruce G.

doi:10.1007/s13571-011-0008-x

Projection pursuit via white noise matrices

Published: 07 June 2011

Volume 72, pages 123–153, (2010)
Cite this article

Sankhya B Aims and scope Submit manuscript

Guodong Hui¹ &
Bruce G. Lindsay²

266 Accesses
13 Citations
Explore all metrics

Abstract

Projection pursuit is a technique for locating projections from high- to low-dimensional space that reveal interesting non-linear features of a data set, such as clustering and outliers. The two key components of projection pursuit are the chosen measure of interesting features (the projection index) and its algorithm. In this paper, a white noise matrix based on the Fisher information matrix is proposed for use as the projection index. This matrix index is easily estimated by the kernel method. The eigenanalysis of the estimated matrix index provides a set of solution projections that are most similar to white noise. Application to simulated data and real data sets shows that our algorithm successfully reveals interesting features in fairly high dimensions with a practical sample size and low computational effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Algorithm for Finding Projections with Extreme Kurtosis

Oblique Projection Matching Pursuit

Article 22 November 2016

Refining Invariant Coordinate Selection via Local Projection Pursuit

References

Ahn, J., J. Marron, K. Muller, and Y. Chi 2007. The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94(3):760.
Article MATH MathSciNet Google Scholar
Azzalini, A., and A. Capitanio 1999. Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3):579–602.
Article MATH MathSciNet Google Scholar
Azzalini, A., and A. Capitanio 2003. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2):367–389.
Article MATH MathSciNet Google Scholar
Azzalini, A., and A. Valle 1996. The multivariate skew-normal distribution. Biometrika 83(4):715.
Article MATH MathSciNet Google Scholar
Ballam, J., G.B. Chadwick, Z.C.G. Guiragossian, W.B. Johnson, D.W.G.S. Leith, and J. Morigasu. 1971. Van Hove analysis of the reactions π ⁻ p → π ⁻ π ⁻ π ⁺ p and π ⁺ p → π ⁺ π ⁺ π ⁻ at 16 GeV/C. Physical Review 4:1946–1947.
Google Scholar
Bowman, A.W., and P.J. Foster. 1993. Adaptive smoothing and density-based teste of multivariate normality. Journal of American Statistical Association 88(422):529–539.
Article MATH MathSciNet Google Scholar
Calo, D.G. 2007. Gaussian mixture model classification: a projection pursuit approach. Computational Statistics & Data Analysis 52(1):471–482.
Article MATH MathSciNet Google Scholar
Davison, A.C., and D.V. Hinkley. 1997. Bootstrap methods and their application. Cambridge Series in Statistical and Probabilistic Mathematics, No 1. ISBN-10: 0521574714.
Diaconis, P., and D. Freedman. 1984. Asymptotics of graphical projection pursuit. Annals of Statistics 12(3):793–815.
Article MATH MathSciNet Google Scholar
Fraley, C., and A. Raftery 2002. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458):611–631.
Article MATH MathSciNet Google Scholar
Friedman, J.H. 1987. Exploratory projection pursuit. Journal of the American Statistical Association 82(397):249–266.
Article MATH MathSciNet Google Scholar
Friedman, J.H., and J.W. Tukey. 1974. A projection pursuit algorithm for exploatory data analysis. IEEE Transactions on Computers C-23:881–889.
Article Google Scholar
Frühwirth-Schnatter, S., and S. Pyne 2010. Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2):317.
Article Google Scholar
Genton, M. 2004. Skew-elliptical distributions and their applications: a journey beyond normality.
Godambe, V.P. 1960. An opertimal property of regular maximal likelihood estimation. Annals of Mathematical Statistics 31(4):1208–1211.
Article MathSciNet Google Scholar
Golub, T., D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531.
Article Google Scholar
Hall, P., J.S. Marron, and A. Neeman. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(3):427–444.
Article MATH MathSciNet Google Scholar
Huber, P.J. 1985. Projection pursuit. Annals of Statistics 13(2):435–475.
Article MATH MathSciNet Google Scholar
Hui, G.D. 2008. Matrix distances with their application to finding directional deviations from normality in high-dimensinal data. PhD Thesis, Pennsylvania State University.
Jee, R.J. 1985. A study of projection pursuit methods. PhD Thesis, Rice University.
Kagan, A. 2001. Aother look at Cramer–Rao inequality. The American Statistician 55(3):211–212(2).
Article MATH MathSciNet Google Scholar
Kagan, A., and Yu.V. Linnik, and C.R. Rao. 1973. Characterization problems in mathematical statistics. Wiley Series in Probability and Mathematical Statistics, No 1. ISBN-10: 0471454214
Kazuyoshi, Y., and A. Makoto. 2001. Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix. Journal of Multivariate Analysis 101(9):2060–2077.
Google Scholar
Kazuyoshi, Y., and A. Makoto. 2009. PCA consistency for non-Gaussian data in high dimension, low sample size context. Communications in Statistics - Theory and Methods 38(16):2634–2652.
Article MATH MathSciNet Google Scholar
Li, J., S. Ray, and B. Lindsay 2007. A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 8(8):1687–1723.
MathSciNet Google Scholar
Lin, T., J. Lee, and S. Yen 2007. Finite mixture modelling using the skew normal distribution. Statistica Sinica 17(3):909.
MATH MathSciNet Google Scholar
Lindsay, B.G. 1982. Conditional score functions: some optimality results. Biometrika 69:503–512.
Article MATH MathSciNet Google Scholar
Lindsay, B.G., M. Markatou, S.R. Ray, K. Yang, and S.C. Chen. 2008. Quadratic distances on probabilities: a unified foundation. Annals of Statistics 36:983–1006.
Article MATH MathSciNet Google Scholar
Melnykov, V., R. and Maitra 2010. Finite mixture models and model-based clustering. Statistics Surveys 4:80–116.
Article MATH MathSciNet Google Scholar
Muller, K.E., Y.-Y. Chi, J. Ahn, and J.S. Marron. 2011. Limitations of high dimension, low sample size principal components for Gaussian data (under revision for resubmission).
Papaioannou, T., and K. Ferentinos. 2005. On two forms of Fisher’s measure of information. Communications in Statistics - Theory and Methods 34:1461–1470.
Article MATH MathSciNet Google Scholar
Posse, C. 1995. Projection pursuit exploratory data analysis. Computational Statistics and Data Analysis 20:669–687.
Article MATH MathSciNet Google Scholar
Ray, S., and B.G. Lindsay 2008. Model selection in high dimensions: a quadratic-risk-based approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(1):95–118.
MATH MathSciNet Google Scholar
Sungkyu, J., and J.S. Marron. 1995. PCA consistency in high dimension, low sample size context. Annals of Statistics 37(6B):4104–4130.
Google Scholar
Terrell, G.R. 1995. A Fisher information test for Pearson-family membership. In Proceedings of the statistical computing section, joint statistical meetings, Orlando, Florida, 230–234.

Download references

Rejoinder by the authors

The authors are grateful to Drs. Sen and Ray for their thoughtful comments on our paper.

We agree with most of Dr. Sen’s points, although we have found that the use of the squared density f ₂ seems to reduce the effect of isolated outlying points, and thereby enhances robustness. As he also notes, there are important questions about how methods such as ours can work well in higher dimensions, partly due to the convergence properties of nonparametric density estimators. It is quite clear that in any asymptotic analysis in which the data dimension goes to infinity, one must have a signal strength going to infinity in order to separate it from the noise. This is an interesting point that deserves further investigation.

Part of Dr. Ray’s discussion bears strongly on the same issues of growing dimension, and will provide a springboard for future investigation of this question. He also asks how our method might be extended to other densities than the normal. In our paper, in Section 6.2, we have provided our thoughts on this point, but they are necessarily only a skeleton idea for what might be done. The points he makes will be valuable in any future development of this idea.

Author information

Authors and Affiliations

Genzyme Corporation, Framingham, MA, 01702, USA
Guodong Hui
Department of Statistics, Pennsylvania State University, University Park, PA, 16802, USA
Bruce G. Lindsay

Authors

Guodong Hui
View author publications
You can also search for this author in PubMed Google Scholar
Bruce G. Lindsay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guodong Hui.

Additional information

Lindsay’s research is supported by the NSF-DMS.

Appendix: Proof of proposition 1

In the standardized situation, we have

$$ J_{\!\!f}-I_d = E\left[\left(\nabla_x \log f(x)-x\right)\left(\nabla_x \log f(x)-x\right)^T \right]. $$

It follows that the jj diagonal entry will be zero if and only if $\frac{\partial}{\partial x_j} \log f(x)-x_j=0$ a.s. But $\frac{\partial}{\partial x_j} \log f(x)=\frac{\partial}{\partial x_j} \log f(x_j|x_{-j})$, which proves the result.

Discussion of “Projection pursuit via white noise matrices”

by G. Hui and B. Lindsay

Pranab K. Sen

Cary C. Boshamer Professor of Biostatistics, University of North Carolina

at Chapel Hill, Chapel Hill, NC 27599-7420, USA

e-mail: pksen@bios.unc.edu

The authors are to be commended for an excellent job with this projection pursuit by white noise matrices approach for locating high to low dimensional spaces with a view to preserving interesting features and appropriate algorithms to enhance statistical data analysis perspectives. A white noise matrix, based on the Fisher information matrix, is formulated wherein the classical kernel method is incorporated to estimate the effectiveness of the projection index. The main thrust of the paper is on the adaptability of the computational algorithm for fairly high dimensional data models with comparatively low computational efforts.

The basic feature of this interesting approach is the incorporation of the density f ₂(x) related to the basic density f(x) by the formula

$$ f_2({\bf x}) = f^2({\bf x}) \left/ \int_{\mathbb{R}^p} f^2({\bf y})d{\bf y}.\right. $$

(5)

With this, various properties of information matrices, some rather well known, have been incorporated to validate the computational aspects, albeit in an asymptotic setup. In this respect, we may keep in mind that in the general multivariate case, treating the sample mean vector as an estimator of the population mean vector, the Cramer–Rao information inequality stresses that

$$ {\bf V} - {\bf I}^{-1} = \text{positive semi-def\/inite}, $$

(6)

so that equivalently,

$$ {\bf I} - {\bf V}^{-1} = \text{positive semi-def\/inite,} $$

(7)

Thus, if any diagonal element of I is equal to the reciprocal of the corresponding diagonal element of V then by the Bhattacharya inequality, the corresponding element of V ^− 1 must be equal to the reciprocal of the corresponding element of V, and hence, the matrix V is diagonal, as was to be expected.

From the methodological point of view, the kernel method of course has the wider appeal of applicability even in the high-dimensional data models. However, the rate of convergence of the kernel estimators of densities in the univariate case is O(n ^− 2/5) not the conventional rate O(n ^− 1/2) while in the high-dimensional case, the rate becomes extremely slower. Therefore, from mathematical perspectives, there is a qualm: How large the sample size need be so as to suit exceedingly high-dimensional models? Current advances in multivariate density estimation clearly signal a drastically large sample size to claim asymptotic properties, as are needed in hypothesis testing problems.

Thirdly, there is a subtle emphasis on linear manifolds in projection pursuit methods of dimension reduction. However, in a general case, the lower-dimensional contours need not be hyperplanes. As such, how far the proposed methods can be used in such contour curves models?

Finally, even in the mixture models with Gaussian distributions, the usual parametric approach may not be very robust. In such a case, instead of the highly nonrobust covariance matrix V can some other robust competitors be used to robustify the procedures and the algorithms therein? Some Huber-type influence function might give a better impression although that may be harder to exhibit analytically.

Again, I would like to congratulate the authors for a commendable job.

Discussion of “Projection pursuit via white noise matrices”

by G. Hui and B. Lindsay

Surajit Ray

Department of Mathematics and Statistics, Boston University,

Boston, MA 02215, USA

e-mail: sray@math.bu.edu

1.1 Introduction

I would like to commend Guodong Hui and Bruce Lindsay for their excellent article on projection pursuit. The paper approaches projection pursuit from a “residual analysis” perspective. As in standard projection pursuit problem the goal is to separate out “interesting” and “uninteresting” dimensions in a multidimensional dataset. But unlike standard projection pursuit approach where one is focused on uncovering the possible dimensions of interest, this paper first looks for the uninteresting dimensions and then assigns the remaining dimensions as the interesting ones. There are several advantages to this approach. First it saves a lot of computational cost to first look for “null dimensions”, especially for data with “large magnitude” both in the dimension of data vectors and in the number of vectors. Statistical tests for “interesting directions” in projection pursuit approaches are hard to come by. In this paper, Hui and Lindsay provides a test statistic for choosing the “null dimensions”, based on the elements of the observed Fisher information matrix. As the observed fisher information is an easily computable quantity it has a huge computational advantage over other methods.

In this discussion I will mostly focus on the advantages of uncovering the “null directions” as the basis of statistical analyses and its possible extensions to solve challenging statistical problems. Hui and Lindsay have successfully used this technique in the context of projection pursuit approach, by defining the “uninteresting directions” as the directions closest to “normal distributions”. After discarding these “uninteresting directions” they are able to uncover the non-trivial linear and non-linear directions in high dimensional data. I think this bottom up approach is an excellent idea and there exists a large array of statistical problems where one can apply the technique of extraction of “white noise” developed in this paper. Many of the theoretical results and computational advantage developed in this paper should be readily applicable to these problems. I will focus on two interesting areas: one pertaining to the extension of the “white noise” concept for high dimensional low sample size scenario and second generalizing the concept of “white noise” to accommodate base distributions beyond normals.

1.2 Extension to high-dimensional low sample size framework

In recent time with the advancement of scientific instruments and techniques it is possible to generate huge amounts of data on a single unit, e.g. in gene expression data we can measure the expression of thousands of genes simultaneously. But in many cases the sample size is in the order of magnitude much lower than the order of magnitude of the dimensions, e.g while discriminating types of cancer the microrray gene expression data described in Golub et al. (1999) has a total of only 76 samples, but for each samples they record the expression values for 7129 genes. This datasets are often referred to as High Dimensional Low Sample Size (HDLSS) data and their characteristics are profoundly different from classical scenario where the number of samples are typically much larger than the number of dimensions. As a result methodologies, e.g. principal component analysis, clustering, developed under the classical framework are not always readily applicable under HDLSS framework.

As in classical scenario clustering is often an important statistical task for HDLLS datasets. There has been a lot of recent development in this research field of clustering and inference on number of clusters (see Fraley and Raftery 2002; Li et al. 2007; Ray and Lindsay 2008; Melnykov and Maitra 2010). In this discussion we will focus on clustering of HDLLS data and how we can make use of the notion of finding “white noise” to detect cluster membership and the number of clusters. We propose using available HDLSS asymptotic results to find structures in datasets which conforms to the dimensionality restriction of HDLSS datasets, i.e d ≫ n, d increases, while n is fixed. Here we use the word “structure” synonymous to homogeneous subsets or clusters of the data. So a dataset with no clusters, i.e. a data set which is analogous to white noise should be considered “structureless” by our definition.

1.2.1 Motivation

Hall et al. (2005) and later Ahn et al. (2007) show that in high dimension (d) and low sample size (n) the asymptotic kicks in in the direction of dimension and not the sample size. We start by inspecting the properties of gaussian data representing “white noise” in high dimensions.

The results in Hall et al. (2005) d-asymptotic approach showed that, as d increases, under some regularity conditions, the geometrical structure of data cloud becomes almost deterministic. In particular if we generate n standard normal variates Z ₁,Z ₂,...,Z _n of dimension d, where d ≫ n and d→ ∞,

$ \parallel Z \parallel =d^{\frac 1 2}+O_p(1)$, i.e. distance of a sample from the center is approximately $\sqrt{d}$. i.e. the data cloud lie on the surface of a sphere.
The pairwise distance of the sample points $ \parallel Z_1-Z_2 \parallel =2d^{\frac 1 2}+O_p(1) \sqrt(d)$.
The pairwise angles of the sample points from the mean is $\angle(Z_1,Z_2) = \pi/2+ O_p(d^{-\frac 1 2})$, i.e. pairwise angles puts each pair of vectors perpendicular to the other.

In summary the above set of asymptotic results show that “white noise” HDLSS data acquires the randomness only from random rotation of the whole simplex generated by the above nearly deterministic relationship between the vectors of data. We will utilise these property to design diagnostic plots for white noise distributions and use it to extract clusters from HDLLS data.

Our goal in this note is to disucuss how the approach of Hui and Lindsay can be combined with the geometric structure of HDLSS data to develop a methodology for determining the number of clusters for high dimensional data. As in Hui and Lindsay we will keep on finding subclusters of the data, and remove that “selected” until we are left with only “white noise”. In the next section we demonstrate this concept using an example. Providing a formal framework to use it as a model selection tool requires further work and thus will be beyond this discussion.

1.2.2 Illustrative example

We will demonstrate our approach for model selection for the number of clusters in HDLLS scenario using a simulated example. This dataset was generated with d = 2,000 and n = 30 with four distinct clusters. A heatmap of the simulated data is given in the upper right hand corner of Fig. 13. The data follows a checkerboard pattern with the dimensions plotted along the y-axis and the samples plotted along the x-axis. Four sample clusters and five column clusters are easily visible. Here our main goal is to extract the sample clusters and arrive at the correct answer of four.

Classical model selection approaches always predict much higher than four clusters as they fail to accommodate the special geometric structure of HDLSS data. This is easily explained by the fact that in higher dimensions data are so parse that datapoints forms spurious clusters.

But our approach of using the “white noise” of null direction as a model selection tool readily uses the geometric properties of the HDLSS data. Among the three properties of individual distance, pairwise distance and pairwise angles, presented in the beginning of this section, we demonstrate our findings for model selection tool only using the distance statistic. Note that the structures in this dataset are extracted using a Singular Value Decomposition (SVD) analysis of the data matrix. At each stage we remove the direction corresponding to the highest eigen value. One can definitely come up with an alternative method of clustering and achieve a clustering which will accommodate all the clusters in one step. In Fig. 13 the left panel from row 1 to 5 show the structure of the most important directions, the second column shows the cumulative directions and finally the third column shows the residual structure of the data after removal of each of the directions. The right hand panel shows the distance statistics along with the correct asymptotic error bounds. Each row shows the effect of removing important directions from the data. Finally, the fifth row shows the plots after removing all four important directions. The distance statistic shows that now more than 95% of the distance measures now fall within the bounds. In a typical model selection rule based on this distance measure one would stop here and arrive at the answer of four important directions. Note that the residual plot corresponding to direction 4 looks like a white noise and further extraction of directions just keeps the residual as noise but due to these extra dimension after adjusting the error bounds we find more distance outside the limit. This is an indication of overfitting of the model. Overall if we use the statistic of number of points outside the 95% CI of d we clearly see a global dip at four which is the true number of clusters.

As mentioned before, the above example just shows how one can use the null directions in a non-classical framework to extract important directions and design model selection tools based on the statistics. On the other hand Hui and Lindsay’s work provides a formal framework to use null directions in a Project Pursuit framework. The open questions is: Can one use the Hui and Lindsay’s framework to design a formal model selection method based on the HDLSS asymptotics, for clustering HDLSS data?

1.3 Uncovering white- T and skewed T noises

In this section we discuss another possible extension of Hui and Lindsay’s approach. There is growing interest in the literature on parametric families of multivariate distributions which represent, in some sense, departures from the multivariate normal family. In the recent past skew-normal (Azzalini and Valle 1996; Azzalini and Capitanio 1999; Lin et al. 2007) and skew-t distributions (Azzalini and Capitanio 2003; Frühwirth-Schnatter and Pyne 2010; Genton 2004) have proved to be useful for capturing skewness and kurtosis avoiding the need for transformations. For example finite mixtures of such distributions have been considered as a more general tool for handling heterogeneous data involving asymmetric behaviors across subpopulations. Multivariate versions of the skewed distributions are more challenging but they allow robust modeling of high-dimensional multimodal and asymmetric data generated by popular biotechnological platforms such as flow cytometry.

In notation a continuous random vector Y(p ×1) is said to have a multivariate skew normal distribution if its p.d.f. is given by

$$ f(y;\mu,\Sigma,D) = \frac 1 {\Phi(0;I+D \Sigma D')} \phi_p(y;\mu,\Sigma) \Phi[D(y-\mu)],\label{skewedt} $$

(8)

where $\mu\in \mathbb R$, Σ > 0; φ _p(.;μ,Σ) Φ(,;Σ), denote the p.d.f. and the c.d.f. of p-dimensional normal distribution with mean μ and covariance matrix Σ. Here D is the p ×p dimensional skewness parameter.

Hui and Lindsay’s current approach can uncover “non-normal” directions as the smoothing steps does not introduce normality and the test statistic is designed as a contrast against the “white noise” Fisher Info Matrix. Can the same set of tools be used to uncover the multivariate T or Multivariate skewed T, given in Eq. 8 as the “white noise” distribution?

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hui, G., Lindsay, B.G. Projection pursuit via white noise matrices. Sankhya B 72, 123–153 (2010). https://doi.org/10.1007/s13571-011-0008-x

Download citation

Received: 27 July 2009
Revised: 04 January 2011
Accepted: 18 January 2011
Published: 07 June 2011
Issue Date: November 2010
DOI: https://doi.org/10.1007/s13571-011-0008-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Projection pursuit via white noise matrices

Abstract

Access this article

Similar content being viewed by others