A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

Bookstein, Fred L.

doi:10.1007/s11692-017-9424-9

A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

Focal Reviews
Published: 02 August 2017

Volume 44, pages 522–541, (2017)
Cite this article

Evolutionary Biology Aims and scope Submit manuscript

Fred L. Bookstein^1,2

1104 Accesses
37 Citations
2 Altmetric
Explore all metrics

Abstract

The textbook literature of principal components analysis (PCA) dates from a period when statistical computing was much less powerful than it is today and the dimensionality of data sets typically processed by PCA correspondingly much lower. When the formulas in those textbooks involve limiting properties of PCA descriptors, the limit involved is usually the indefinite increase of sample size for a fixed roster of variables. But contemporary applications of PCA in organismal systems biology, particularly in geometric morphometrics (GMM), generally involve much greater counts of variables. The way one might expect pure noise to degrade the biometric signal in this more contemporary context is described by a different mathematical literature concerned with the situation where the count of variables itself increases while remaining proportional to the count of specimens. The founders of this literature established a result of startling simplicity. Consider steadily larger and larger data sets consisting of completely uncorrelated standardized Gaussians (mean zero, variance 1) such that the ratio of variables to cases (the so-called “p/n ratio”) is fixed at a value y. Then the largest eigenvalue of their covariance matrix tends to \((1+\sqrt{y})^2\), the smallest tends to \((1-\sqrt{y})^2\), and their ratio tends to the limiting value \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), whereas in the uncorrelated model both of these eigenvalues and also their ratio should be just 1.0. For \(y={1/4},\) not an atypical value for GMM data sets, this ratio is 9; for \(y={1/2},\) which is still not atypical, it is 34. These extrema and ratios, easily confirmed in simulations of realistic size and consistent with real GMM findings in typical applied settings, bear severe negative implications for any technique that involves inverting a covariance structure on shape coordinates, including multiple regression on shape, discriminant analysis by shape, canonical variates analysis of shape, covariance distance analysis from shape, and maximum-likelihood estimation of shape distributions that are not constrained by strong prior models. The theorem also suggests that we should use extreme caution whenever considering a biological interpretation of any Partial Least Squares analysis involving large numbers of landmarks or semilandmarks. I illuminate these concerns with the aid of one simulation, two explicit reanalyses of previously published data, and several little sermons.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Particle swarm optimization algorithm: an overview

Article 17 January 2017

Density-Based Clustering Based on Hierarchical Density Estimates

Hierarchy Theory: An Overview

References

Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices. Cambridge: Cambridge University Press.
Google Scholar
Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 4, 122–148.
Article Google Scholar
Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Google Scholar
Bookstein, F. L. (1991). Morphometric tools for landmark data: Geometry and biology. Cambridge: Cambridge University Press.
Google Scholar
Bookstein, F. L., Streissguth, A., Sampson, P., Connor, P., & Barr, H. (2002). Corpus callosum shape and neuropsychological deficits in adult males with heavy fetal alcohol exposure. NeuroImage, 15, 233–251.
Article PubMed Google Scholar
Bookstein, F. L. (2014). Measuring and reasoning: Numerical inference in the sciences. Cambridge: Cambridge University Press.
Book Google Scholar
Bookstein, F. L. (2015). Integration, disintegration, and self-similarity: characterizing the scales of shape variation in landmark data. Evolutionary Biology, 42, 395–426.
Article PubMed PubMed Central Google Scholar
Bookstein, F. L. (2016a). The inappropriate symmetries of multivariate statistical analysis in geometric morphometrics. Evolutionary Biology, 43, 277–313.
Article PubMed PubMed Central Google Scholar
Bookstein, F. L. (2016b). Reconsidering “The inappropriateness of conventional cephalometrics”. The American Journal of Orthodontics, 149, 784–797.
Google Scholar
Bookstein, F. L. (2017). A method of factor analysis for shape coordinates. American Journal of Physical Anthropology. doi:10.1002/ajpa.23277.
PubMed Google Scholar
Bookstein, F. L. A course of morphometrics for biologists. Cambridge University Press, to appear, 2018.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer.
Google Scholar
Dryden, I. L., & Mardia, K. V. (2016). Statistical shape analysis (2nd ed.). Chichester: Wiley.
Google Scholar
Gavrilets, S. (2004). Fitness landscapes and the origin of species. Princeton: Princeton University Press.
Google Scholar
Goodall, C. R. (1991). Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society, B53, 285–339.
Google Scholar
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338.
Article Google Scholar
Grenander, U., & Silverstein, J. W. (1977). Spectral analysis of networks with random topologies. SIAM Journal on Applied Mathematics, 32, 499–519.
Article Google Scholar
Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley-Interscience.
Book Google Scholar
Jolicoeur, P., & Mosimann, J. E. (1960). Size and shape variation in the Painted Turtle. A principal component analysis. Growth, 24, 339–354.
CAS PubMed Google Scholar
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.
Google Scholar
Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. Journal of Multivariate Analysis, 12, 1–38.
Article Google Scholar
Lande, R. (1979). Quantitative genetic analysis of multivariate evolution, applied to brain:body size allometry. Evolution, 33, 402–416.
Article PubMed Google Scholar
Leamer, E. L. (1978). Specification searches: Ad hoc inference with nonexperimental data. New York: Wiley.
Google Scholar
Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1, 457–483.
Article Google Scholar
Marcus, L. F. (1990). Traditional morphometrics. In F. James Rohlf & F. L. Bookstein (Eds.), Proceedings of the Michigan Morphometrics Workshop. Ann Arbor: University of Michigan Museums.
Google Scholar
Mardia, K. V., Kent, J. T., & Bibby, J. (1979). Multivariate analysis. London: Wiley.
Google Scholar
Mitteroecker, P., & Bookstein, F. L. (2009). The ontogenetic trajectory of the phenotypic covariance matrix, with examples from craniofacial shape in rats and humans. Evolution, 63, 727–37.
Article PubMed Google Scholar
Mitteroecker, P., & Bookstein, F. L. (2011). Linear discrimination, ordination, and the visualization of selection gradients in modern morphometrics. Evolutionary Biology, 38, 100–114.
Article Google Scholar
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.
Article Google Scholar
Pigliucci, M. (2012). Landscapes, surfaces, and morphospaces: What are they good for? In E. Svensson & R. Calsbeek (Eds.), The adaptive landscape in evolutionary biology (pp. 26–38). Oxford: Oxford University Press.
Google Scholar
Shorack, G. (2000). Probability for statisticians. New York: Springer.
Google Scholar
Stein, C. M. Multivariate Analysis I. Technical Report No. 42, Department of Statistics, Stanford University, December, 1969. Accessed as report OLK NSF 42 from https://statistics.stanford.edu/resources/technical-reports, 2 Dec 2016.
Theobald, D. L., & Wuttke, D. S. (2008). Accurate structural correlations from maximum likelihood superpositions. PLoS Computational Biology, 4(2), e43.
Article PubMed PubMed Central Google Scholar
WolframMathWorld. Wigner’s Semicircle Law. http://mathworld.wolfram.com/WignersSemicircleLaw.html. Accessed 21 Dec 2016.
Wright, S. (1954). The interpretation of multivariate systems. In O. Kempthorne, et al. (Eds.), Statistics and mathematics in biology (pp. 11–33). Ames: Iowa State College Press.
Google Scholar

Download references

Acknowledgements

Michael Perlman of the Department of Statistics, University of Washington, was a graduate student in the Stanford Department of Statistics when Charles Stein was working on his notes about this theorem. Here in 2016 he cheerfully informed me about its history over all the years right up through 2015 that I was searching for something like it elsewhere. Jim Rohlf, Stony Brook University, carefully reviewed an early version of this manuscript, suggested several clarifications, and caught several embarrassing errors in drafts of several of the figures; Ioana Dumitriu, University of Washington, located the G. W. Anderson reference for me; and an anonymous reviewer for this journal persuaded me to sharpen the basic argument at many different points in the exposition. As ever, I thank Philipp Mitteroecker, University of Vienna, for endless conversations on the general topic of multivariate fallacies and on help with the rhetoric of this knotty exposition in particular.

Author information

Authors and Affiliations

Department of Statistics, University of Washington, Seattle, WA, USA
Fred L. Bookstein
Department of Anthropology, University of Vienna, Vienna, Austria
Fred L. Bookstein

Authors

Fred L. Bookstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fred L. Bookstein.

Appendix

To help resolve the fallacies in the examples, the text frequently refers to limiting large-sample condition numbers (ratio of largest to smallest eigenvalue) as a function of the ratio of count of variables to count of specimens. For convenience, here is that value, \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), for values of y from 0.01 to 0.99 by hundredths. You see how remarkably nonlinear a function of the approach to 1.0 it really is. To retrieve from the table, add the row label to the column label; that is the value of y corresponding to the limiting condition number in the table cell thus specified. For instance, the limiting condition number of a matrix with four times as many cases as variables is 9.00, the entry in the 0.20 row and the 0.05 column. For those who prefer a visual representation, Fig. 14 graphs the logarithms (base e) of the same values.

Table 1 Limiting condition numbers of a Wishart matrix: \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\) as a function of y, the p/n ratio, as \(p,~n\rightarrow \infty\)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bookstein, F.L. A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses. Evol Biol 44, 522–541 (2017). https://doi.org/10.1007/s11692-017-9424-9

Download citation

Received: 07 January 2017
Accepted: 30 June 2017
Published: 02 August 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11692-017-9424-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

Abstract

Access this article

Similar content being viewed by others

Particle swarm optimization algorithm: an overview

Density-Based Clustering Based on Hierarchical Density Estimates

Hierarchy Theory: An Overview

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

Abstract

Access this article

Similar content being viewed by others

Particle swarm optimization algorithm: an overview

Density-Based Clustering Based on Hierarchical Density Estimates

Hierarchy Theory: An Overview

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation