Skip to main content

Advertisement

Log in

A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses

  • Focal Reviews
  • Published:
Evolutionary Biology Aims and scope Submit manuscript

Abstract

The textbook literature of principal components analysis (PCA) dates from a period when statistical computing was much less powerful than it is today and the dimensionality of data sets typically processed by PCA correspondingly much lower. When the formulas in those textbooks involve limiting properties of PCA descriptors, the limit involved is usually the indefinite increase of sample size for a fixed roster of variables. But contemporary applications of PCA in organismal systems biology, particularly in geometric morphometrics (GMM), generally involve much greater counts of variables. The way one might expect pure noise to degrade the biometric signal in this more contemporary context is described by a different mathematical literature concerned with the situation where the count of variables itself increases while remaining proportional to the count of specimens. The founders of this literature established a result of startling simplicity. Consider steadily larger and larger data sets consisting of completely uncorrelated standardized Gaussians (mean zero, variance 1) such that the ratio of variables to cases (the so-called “p/n ratio”) is fixed at a value y. Then the largest eigenvalue of their covariance matrix tends to \((1+\sqrt{y})^2\), the smallest tends to \((1-\sqrt{y})^2\), and their ratio tends to the limiting value \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), whereas in the uncorrelated model both of these eigenvalues and also their ratio should be just 1.0. For \(y={1/4},\) not an atypical value for GMM data sets, this ratio is 9; for \(y={1/2},\) which is still not atypical, it is 34. These extrema and ratios, easily confirmed in simulations of realistic size and consistent with real GMM findings in typical applied settings, bear severe negative implications for any technique that involves inverting a covariance structure on shape coordinates, including multiple regression on shape, discriminant analysis by shape, canonical variates analysis of shape, covariance distance analysis from shape, and maximum-likelihood estimation of shape distributions that are not constrained by strong prior models. The theorem also suggests that we should use extreme caution whenever considering a biological interpretation of any Partial Least Squares analysis involving large numbers of landmarks or semilandmarks. I illuminate these concerns with the aid of one simulation, two explicit reanalyses of previously published data, and several little sermons.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices. Cambridge: Cambridge University Press.

    Google Scholar 

  • Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 4, 122–148.

    Article  Google Scholar 

  • Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.

    Google Scholar 

  • Bookstein, F. L. (1991). Morphometric tools for landmark data: Geometry and biology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Bookstein, F. L., Streissguth, A., Sampson, P., Connor, P., & Barr, H. (2002). Corpus callosum shape and neuropsychological deficits in adult males with heavy fetal alcohol exposure. NeuroImage, 15, 233–251.

    Article  PubMed  Google Scholar 

  • Bookstein, F. L. (2014). Measuring and reasoning: Numerical inference in the sciences. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Bookstein, F. L. (2015). Integration, disintegration, and self-similarity: characterizing the scales of shape variation in landmark data. Evolutionary Biology, 42, 395–426.

    Article  PubMed  PubMed Central  Google Scholar 

  • Bookstein, F. L. (2016a). The inappropriate symmetries of multivariate statistical analysis in geometric morphometrics. Evolutionary Biology, 43, 277–313.

    Article  PubMed  PubMed Central  Google Scholar 

  • Bookstein, F. L. (2016b). Reconsidering “The inappropriateness of conventional cephalometrics”. The American Journal of Orthodontics, 149, 784–797.

    Google Scholar 

  • Bookstein, F. L. (2017). A method of factor analysis for shape coordinates. American Journal of Physical Anthropology. doi:10.1002/ajpa.23277.

    PubMed  Google Scholar 

  • Bookstein, F. L. A course of morphometrics for biologists. Cambridge University Press, to appear, 2018.

  • Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer.

    Google Scholar 

  • Dryden, I. L., & Mardia, K. V. (2016). Statistical shape analysis (2nd ed.). Chichester: Wiley.

    Google Scholar 

  • Gavrilets, S. (2004). Fitness landscapes and the origin of species. Princeton: Princeton University Press.

    Google Scholar 

  • Goodall, C. R. (1991). Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society, B53, 285–339.

    Google Scholar 

  • Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338.

    Article  Google Scholar 

  • Grenander, U., & Silverstein, J. W. (1977). Spectral analysis of networks with random topologies. SIAM Journal on Applied Mathematics, 32, 499–519.

    Article  Google Scholar 

  • Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley-Interscience.

    Book  Google Scholar 

  • Jolicoeur, P., & Mosimann, J. E. (1960). Size and shape variation in the Painted Turtle. A principal component analysis. Growth, 24, 339–354.

    CAS  PubMed  Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.

    Google Scholar 

  • Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. Journal of Multivariate Analysis, 12, 1–38.

    Article  Google Scholar 

  • Lande, R. (1979). Quantitative genetic analysis of multivariate evolution, applied to brain:body size allometry. Evolution, 33, 402–416.

    Article  PubMed  Google Scholar 

  • Leamer, E. L. (1978). Specification searches: Ad hoc inference with nonexperimental data. New York: Wiley.

    Google Scholar 

  • Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1, 457–483.

    Article  Google Scholar 

  • Marcus, L. F. (1990). Traditional morphometrics. In F. James Rohlf & F. L. Bookstein (Eds.), Proceedings of the Michigan Morphometrics Workshop. Ann Arbor: University of Michigan Museums.

    Google Scholar 

  • Mardia, K. V., Kent, J. T., & Bibby, J. (1979). Multivariate analysis. London: Wiley.

    Google Scholar 

  • Mitteroecker, P., & Bookstein, F. L. (2009). The ontogenetic trajectory of the phenotypic covariance matrix, with examples from craniofacial shape in rats and humans. Evolution, 63, 727–37.

    Article  PubMed  Google Scholar 

  • Mitteroecker, P., & Bookstein, F. L. (2011). Linear discrimination, ordination, and the visualization of selection gradients in modern morphometrics. Evolutionary Biology, 38, 100–114.

    Article  Google Scholar 

  • Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.

    Article  Google Scholar 

  • Pigliucci, M. (2012). Landscapes, surfaces, and morphospaces: What are they good for? In E. Svensson & R. Calsbeek (Eds.), The adaptive landscape in evolutionary biology (pp. 26–38). Oxford: Oxford University Press.

    Google Scholar 

  • Shorack, G. (2000). Probability for statisticians. New York: Springer.

    Google Scholar 

  • Stein, C. M. Multivariate Analysis I. Technical Report No. 42, Department of Statistics, Stanford University, December, 1969. Accessed as report OLK NSF 42 from https://statistics.stanford.edu/resources/technical-reports, 2 Dec 2016.

  • Theobald, D. L., & Wuttke, D. S. (2008). Accurate structural correlations from maximum likelihood superpositions. PLoS Computational Biology, 4(2), e43.

    Article  PubMed  PubMed Central  Google Scholar 

  • WolframMathWorld. Wigner’s Semicircle Law. http://mathworld.wolfram.com/WignersSemicircleLaw.html. Accessed 21 Dec 2016.

  • Wright, S. (1954). The interpretation of multivariate systems. In O. Kempthorne, et al. (Eds.), Statistics and mathematics in biology (pp. 11–33). Ames: Iowa State College Press.

    Google Scholar 

Download references

Acknowledgements

Michael Perlman of the Department of Statistics, University of Washington, was a graduate student in the Stanford Department of Statistics when Charles Stein was working on his notes about this theorem. Here in 2016 he cheerfully informed me about its history over all the years right up through 2015 that I was searching for something like it elsewhere. Jim Rohlf, Stony Brook University, carefully reviewed an early version of this manuscript, suggested several clarifications, and caught several embarrassing errors in drafts of several of the figures; Ioana Dumitriu, University of Washington, located the G. W. Anderson reference for me; and an anonymous reviewer for this journal persuaded me to sharpen the basic argument at many different points in the exposition. As ever, I thank Philipp Mitteroecker, University of Vienna, for endless conversations on the general topic of multivariate fallacies and on help with the rhetoric of this knotty exposition in particular.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fred L. Bookstein.

Appendix

Appendix

To help resolve the fallacies in the examples, the text frequently refers to limiting large-sample condition numbers (ratio of largest to smallest eigenvalue) as a function of the ratio of count of variables to count of specimens. For convenience, here is that value, \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\), for values of y from 0.01 to 0.99 by hundredths. You see how remarkably nonlinear a function of the approach to 1.0 it really is. To retrieve from the table, add the row label to the column label; that is the value of y corresponding to the limiting condition number in the table cell thus specified. For instance, the limiting condition number of a matrix with four times as many cases as variables is 9.00, the entry in the 0.20 row and the 0.05 column. For those who prefer a visual representation, Fig. 14 graphs the logarithms (base e) of the same values.

Table 1 Limiting condition numbers of a Wishart matrix: \(\bigl((1+\sqrt{y})\bigm/(1-\sqrt{y})\bigr)^2\) as a function of y, the p/n ratio, as \(p,~n\rightarrow \infty\)
Fig. 14
figure 14

Table 1 as a line plot. (Compare Fig. 8.)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bookstein, F.L. A Newly Noticed Formula Enforces Fundamental Limits on Geometric Morphometric Analyses. Evol Biol 44, 522–541 (2017). https://doi.org/10.1007/s11692-017-9424-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11692-017-9424-9

Keywords

Navigation