Principal Components Analysis

  • Detlef Groth
  • Stefanie Hartmann
  • Sebastian Klie
  • Joachim Selbig
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 930)

Abstract

Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of original variables, the first few components containing the majority of the data’s variation are explored. The visualization and statistical analysis of these new variables, the principal components, can help to find similarities and differences between samples. Important original variables that are the major contributors to the first few components can be discovered as well.

This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description. We describe how PCA can be used to analyze different datasets, and we include practical code examples. Possible shortcomings of the methodology and ways to overcome these problems are also discussed.

Key words

Principal components analysis Multivariate data analysis Metabolite profiling Codon usage Dimensionality reduction 

References

  1. 1.
    Hotelling H (1933) Analysis of complex statistical variables into principal components. J Educ Psychol 24:417–441, and 498–520CrossRefGoogle Scholar
  2. 2.
    Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl):496–501PubMedCrossRefGoogle Scholar
  3. 3.
    Steinfath M, Groth D, Lisec J, Selbig J (2008) Metabolite profile analysis: from raw data to regression and classification. Physiol Plant 132:150–161PubMedCrossRefGoogle Scholar
  4. 4.
    Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27CrossRefGoogle Scholar
  5. 5.
    Bo TM, Dysvik B, Jonassen I (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32:e34PubMedCrossRefGoogle Scholar
  6. 6.
    Stacklies W, Redestig H, Scholz M et al (2007) pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23:1164–1167PubMedCrossRefGoogle Scholar
  7. 7.
    Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525PubMedCrossRefGoogle Scholar
  8. 8.
    Celton M, Malpertuy A, Lelandais G, de Brevern AG (2010) Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 11:15PubMedCrossRefGoogle Scholar
  9. 9.
    Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106PubMedCrossRefGoogle Scholar
  10. 10.
    Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356PubMedCrossRefGoogle Scholar
  11. 11.
    Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427PubMedCrossRefGoogle Scholar
  12. 12.
    Jozefczuk S, Klie S, Catchpole G et al (2010) Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 6:364PubMedCrossRefGoogle Scholar
  13. 13.
    Gasch AP, Spellman PT, Kao CM et al (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:4241–4257PubMedGoogle Scholar
  14. 14.
    Hubert M, Engelen S (2004) Robust PCA and classification in biosciences. Bioinformatics 20:1728–1736PubMedCrossRefGoogle Scholar
  15. 15.
    Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Ludäscher B, Mamoulis N (eds) Scientific and statistical database management. Springer, BerlinGoogle Scholar
  16. 16.
    Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32:1–47Google Scholar
  17. 17.
    Ma S, Kosorok MR (2009) Identification of differential gene pathways with principal component analysis. Bioinformatics 25:882–889PubMedCrossRefGoogle Scholar
  18. 18.
    Draper BA, Baek K, Bartlett MS, Beveridge JR (2003) Recognizing faces with PCA and ICA. Comput Vis Image Understand 91:115–137CrossRefGoogle Scholar
  19. 19.
    Virtanen J, Noponen T, Meriläinen P (2009) Comparison of principal and independent component analysis in removing extracerebral interference from near-infrared spectroscopy signals. J Biomed Opt 14:054032PubMedCrossRefGoogle Scholar
  20. 20.
    Baek K, Draper BA, Beveridge JR, She K (2002) PCA vs. ICA: a comparison on the feret data set. In Proc of the 4th Intern Conf on Computer Vision, ICCV 20190, pp 824–827Google Scholar
  21. 21.
    Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626–634PubMedCrossRefGoogle Scholar
  22. 22.
    Marchini JL, Heaton C, Ripley BD (2009) fastICA: FastICA algorithms to perform ica and projection pursuit. http://cran.r-project.org/web/packages/fastICA
  23. 23.
    Scholz M, Selbig J (2007) Visualization and analysis of molecular data. Methods Mol Biol 358:87–104PubMedCrossRefGoogle Scholar
  24. 24.
    Scholz M, Kaplan F, Guy CL et al (2005) Non-linear PCA: a missing data approach. Bioinformatics 21:3887–3895PubMedCrossRefGoogle Scholar
  25. 25.
    Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319CrossRefGoogle Scholar
  26. 26.
    Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377Google Scholar
  27. 27.
    de Leeuw J, Mair P (2009) Simple and canonical correspondence analysis using the R package anacor. J Stat Softw 31:1–18Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2013

Authors and Affiliations

  • Detlef Groth
    • 1
  • Stefanie Hartmann
    • 1
  • Sebastian Klie
    • 1
  • Joachim Selbig
    • 1
  1. 1.AG BioinformaticsUniversity of PotsdamPotsdam-GolmGermany

Personalised recommendations