Abstract
The textbook literature of principal components analysis (PCA) dates from a period when statistical computing was much less powerful than it is today and the dimensionality of data sets typically processed by PCA correspondingly much lower. When the formulas in those textbooks involve limiting properties of PCA descriptors, the limit involved is usually the indefinite increase of sample size for a fixed roster of variables. But contemporary applications of PCA in organismal systems biology, particularly in geometric morphometrics (GMM), generally involve much greater counts of variables. The way one might expect pure noise to degrade the biometric signal in this more contemporary context is described by a different mathematical literature concerned with the situation where the count of variables itself increases while remaining proportional to the count of specimens. The founders of this literature established a result of startling simplicity. Consider steadily larger and larger data sets consisting of completely uncorrelated standardized Gaussians (mean zero, variance 1) such that the ratio of variables to cases (the so-called “p/n ratio”) is fixed at a value y. Then the largest eigenvalue of their covariance matrix tends to \((1+\sqrt{y})^2\), the smallest tends to \((1-\sqrt{y})^2\), and their ratio tends to the limiting value \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), whereas in the uncorrelated model both of these eigenvalues and also their ratio should be just 1.0. For \(y={1/4},\) not an atypical value for GMM data sets, this ratio is 9; for \(y={1/2},\) which is still not atypical, it is 34. These extrema and ratios, easily confirmed in simulations of realistic size and consistent with real GMM findings in typical applied settings, bear severe negative implications for any technique that involves inverting a covariance structure on shape coordinates, including multiple regression on shape, discriminant analysis by shape, canonical variates analysis of shape, covariance distance analysis from shape, and maximum-likelihood estimation of shape distributions that are not constrained by strong prior models. The theorem also suggests that we should use extreme caution whenever considering a biological interpretation of any Partial Least Squares analysis involving large numbers of landmarks or semilandmarks. I illuminate these concerns with the aid of one simulation, two explicit reanalyses of previously published data, and several little sermons.
Similar content being viewed by others
References
Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices. Cambridge: Cambridge University Press.
Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 4, 122–148.
Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Bookstein, F. L. (1991). Morphometric tools for landmark data: Geometry and biology. Cambridge: Cambridge University Press.
Bookstein, F. L., Streissguth, A., Sampson, P., Connor, P., & Barr, H. (2002). Corpus callosum shape and neuropsychological deficits in adult males with heavy fetal alcohol exposure. NeuroImage, 15, 233–251.
Bookstein, F. L. (2014). Measuring and reasoning: Numerical inference in the sciences. Cambridge: Cambridge University Press.
Bookstein, F. L. (2015). Integration, disintegration, and self-similarity: characterizing the scales of shape variation in landmark data. Evolutionary Biology, 42, 395–426.
Bookstein, F. L. (2016a). The inappropriate symmetries of multivariate statistical analysis in geometric morphometrics. Evolutionary Biology, 43, 277–313.
Bookstein, F. L. (2016b). Reconsidering “The inappropriateness of conventional cephalometrics”. The American Journal of Orthodontics, 149, 784–797.
Bookstein, F. L. (2017). A method of factor analysis for shape coordinates. American Journal of Physical Anthropology. doi:10.1002/ajpa.23277.
Bookstein, F. L. A course of morphometrics for biologists. Cambridge University Press, to appear, 2018.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer.
Dryden, I. L., & Mardia, K. V. (2016). Statistical shape analysis (2nd ed.). Chichester: Wiley.
Gavrilets, S. (2004). Fitness landscapes and the origin of species. Princeton: Princeton University Press.
Goodall, C. R. (1991). Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society, B53, 285–339.
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338.
Grenander, U., & Silverstein, J. W. (1977). Spectral analysis of networks with random topologies. SIAM Journal on Applied Mathematics, 32, 499–519.
Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley-Interscience.
Jolicoeur, P., & Mosimann, J. E. (1960). Size and shape variation in the Painted Turtle. A principal component analysis. Growth, 24, 339–354.
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.
Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. Journal of Multivariate Analysis, 12, 1–38.
Lande, R. (1979). Quantitative genetic analysis of multivariate evolution, applied to brain:body size allometry. Evolution, 33, 402–416.
Leamer, E. L. (1978). Specification searches: Ad hoc inference with nonexperimental data. New York: Wiley.
Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1, 457–483.
Marcus, L. F. (1990). Traditional morphometrics. In F. James Rohlf & F. L. Bookstein (Eds.), Proceedings of the Michigan Morphometrics Workshop. Ann Arbor: University of Michigan Museums.
Mardia, K. V., Kent, J. T., & Bibby, J. (1979). Multivariate analysis. London: Wiley.
Mitteroecker, P., & Bookstein, F. L. (2009). The ontogenetic trajectory of the phenotypic covariance matrix, with examples from craniofacial shape in rats and humans. Evolution, 63, 727–37.
Mitteroecker, P., & Bookstein, F. L. (2011). Linear discrimination, ordination, and the visualization of selection gradients in modern morphometrics. Evolutionary Biology, 38, 100–114.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.
Pigliucci, M. (2012). Landscapes, surfaces, and morphospaces: What are they good for? In E. Svensson & R. Calsbeek (Eds.), The adaptive landscape in evolutionary biology (pp. 26–38). Oxford: Oxford University Press.
Shorack, G. (2000). Probability for statisticians. New York: Springer.
Stein, C. M. Multivariate Analysis I. Technical Report No. 42, Department of Statistics, Stanford University, December, 1969. Accessed as report OLK NSF 42 from https://statistics.stanford.edu/resources/technical-reports, 2 Dec 2016.
Theobald, D. L., & Wuttke, D. S. (2008). Accurate structural correlations from maximum likelihood superpositions. PLoS Computational Biology, 4(2), e43.
WolframMathWorld. Wigner’s Semicircle Law. http://mathworld.wolfram.com/WignersSemicircleLaw.html. Accessed 21 Dec 2016.
Wright, S. (1954). The interpretation of multivariate systems. In O. Kempthorne, et al. (Eds.), Statistics and mathematics in biology (pp. 11–33). Ames: Iowa State College Press.
Acknowledgements
Michael Perlman of the Department of Statistics, University of Washington, was a graduate student in the Stanford Department of Statistics when Charles Stein was working on his notes about this theorem. Here in 2016 he cheerfully informed me about its history over all the years right up through 2015 that I was searching for something like it elsewhere. Jim Rohlf, Stony Brook University, carefully reviewed an early version of this manuscript, suggested several clarifications, and caught several embarrassing errors in drafts of several of the figures; Ioana Dumitriu, University of Washington, located the G. W. Anderson reference for me; and an anonymous reviewer for this journal persuaded me to sharpen the basic argument at many different points in the exposition. As ever, I thank Philipp Mitteroecker, University of Vienna, for endless conversations on the general topic of multivariate fallacies and on help with the rhetoric of this knotty exposition in particular.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
To help resolve the fallacies in the examples, the text frequently refers to limiting large-sample condition numbers (ratio of largest to smallest eigenvalue) as a function of the ratio of count of variables to count of specimens. For convenience, here is that value, \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), for values of y from 0.01 to 0.99 by hundredths. You see how remarkably nonlinear a function of the approach to 1.0 it really is. To retrieve from the table, add the row label to the column label; that is the value of y corresponding to the limiting condition number in the table cell thus specified. For instance, the limiting condition number of a matrix with four times as many cases as variables is 9.00, the entry in the 0.20 row and the 0.05 column. For those who prefer a visual representation, Fig. 14 graphs the logarithms (base e) of the same values.
Rights and permissions
About this article
Cite this article
Bookstein, F.L. A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses. Evol Biol 44, 522–541 (2017). https://doi.org/10.1007/s11692-017-9424-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11692-017-9424-9