Skip to main content

Principal Components Analysis

  • Protocol
  • First Online:
Book cover Computational Toxicology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 930))

Abstract

Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of original variables, the first few components containing the majority of the data’s variation are explored. The visualization and statistical analysis of these new variables, the principal components, can help to find similarities and differences between samples. Important original variables that are the major contributors to the first few components can be discovered as well.

This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description. We describe how PCA can be used to analyze different datasets, and we include practical code examples. Possible shortcomings of the methodology and ways to overcome these problems are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hotelling H (1933) Analysis of complex statistical variables into principal components. J Educ Psychol 24:417–441, and 498–520

    Article  Google Scholar 

  2. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl):496–501

    Article  PubMed  CAS  Google Scholar 

  3. Steinfath M, Groth D, Lisec J, Selbig J (2008) Metabolite profile analysis: from raw data to regression and classification. Physiol Plant 132:150–161

    Article  PubMed  CAS  Google Scholar 

  4. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27

    Article  Google Scholar 

  5. Bo TM, Dysvik B, Jonassen I (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32:e34

    Article  PubMed  Google Scholar 

  6. Stacklies W, Redestig H, Scholz M et al (2007) pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23:1164–1167

    Article  PubMed  CAS  Google Scholar 

  7. Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525

    Article  PubMed  CAS  Google Scholar 

  8. Celton M, Malpertuy A, Lelandais G, de Brevern AG (2010) Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 11:15

    Article  PubMed  Google Scholar 

  9. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106

    Article  PubMed  CAS  Google Scholar 

  10. Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356

    Article  PubMed  CAS  Google Scholar 

  11. Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427

    Article  PubMed  CAS  Google Scholar 

  12. Jozefczuk S, Klie S, Catchpole G et al (2010) Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 6:364

    Article  PubMed  Google Scholar 

  13. Gasch AP, Spellman PT, Kao CM et al (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:4241–4257

    PubMed  CAS  Google Scholar 

  14. Hubert M, Engelen S (2004) Robust PCA and classification in biosciences. Bioinformatics 20:1728–1736

    Article  PubMed  CAS  Google Scholar 

  15. Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Ludäscher B, Mamoulis N (eds) Scientific and statistical database management. Springer, Berlin

    Google Scholar 

  16. Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32:1–47

    Google Scholar 

  17. Ma S, Kosorok MR (2009) Identification of differential gene pathways with principal component analysis. Bioinformatics 25:882–889

    Article  PubMed  CAS  Google Scholar 

  18. Draper BA, Baek K, Bartlett MS, Beveridge JR (2003) Recognizing faces with PCA and ICA. Comput Vis Image Understand 91:115–137

    Article  Google Scholar 

  19. Virtanen J, Noponen T, Meriläinen P (2009) Comparison of principal and independent component analysis in removing extracerebral interference from near-infrared spectroscopy signals. J Biomed Opt 14:054032

    Article  PubMed  Google Scholar 

  20. Baek K, Draper BA, Beveridge JR, She K (2002) PCA vs. ICA: a comparison on the feret data set. In Proc of the 4th Intern Conf on Computer Vision, ICCV 20190, pp 824–827

    Google Scholar 

  21. Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626–634

    Article  PubMed  Google Scholar 

  22. Marchini JL, Heaton C, Ripley BD (2009) fastICA: FastICA algorithms to perform ica and projection pursuit. http://cran.r-project.org/web/packages/fastICA

  23. Scholz M, Selbig J (2007) Visualization and analysis of molecular data. Methods Mol Biol 358:87–104

    Article  PubMed  CAS  Google Scholar 

  24. Scholz M, Kaplan F, Guy CL et al (2005) Non-linear PCA: a missing data approach. Bioinformatics 21:3887–3895

    Article  PubMed  CAS  Google Scholar 

  25. Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319

    Article  Google Scholar 

  26. Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377

    Google Scholar 

  27. de Leeuw J, Mair P (2009) Simple and canonical correspondence analysis using the R package anacor. J Stat Softw 31:1–18

    Google Scholar 

Download references

Acknowledgments

We thank Kristin Feher for carefully reviewing our manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Detlef Groth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Groth, D., Hartmann, S., Klie, S., Selbig, J. (2013). Principal Components Analysis. In: Reisfeld, B., Mayeno, A. (eds) Computational Toxicology. Methods in Molecular Biology, vol 930. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-059-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-059-5_22

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-058-8

  • Online ISBN: 978-1-62703-059-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics