Principal Components Analysis

Groth, Detlef; Hartmann, Stefanie; Klie, Sebastian; Selbig, Joachim

doi:10.1007/978-1-62703-059-5_22

Detlef Groth³,
Stefanie Hartmann³,
Sebastian Klie³ &
…
Joachim Selbig³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 930))

5675 Accesses
70 Citations
9 Altmetric

Abstract

Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of original variables, the first few components containing the majority of the data’s variation are explored. The visualization and statistical analysis of these new variables, the principal components, can help to find similarities and differences between samples. Important original variables that are the major contributors to the first few components can be discovered as well.

This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description. We describe how PCA can be used to analyze different datasets, and we include practical code examples. Possible shortcomings of the methodology and ways to overcome these problems are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hotelling H (1933) Analysis of complex statistical variables into principal components. J Educ Psychol 24:417–441, and 498–520
Article Google Scholar
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl):496–501
Article PubMed CAS Google Scholar
Steinfath M, Groth D, Lisec J, Selbig J (2008) Metabolite profile analysis: from raw data to regression and classification. Physiol Plant 132:150–161
Article PubMed CAS Google Scholar
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article Google Scholar
Bo TM, Dysvik B, Jonassen I (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32:e34
Article PubMed Google Scholar
Stacklies W, Redestig H, Scholz M et al (2007) pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23:1164–1167
Article PubMed CAS Google Scholar
Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Article PubMed CAS Google Scholar
Celton M, Malpertuy A, Lelandais G, de Brevern AG (2010) Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 11:15
Article PubMed Google Scholar
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106
Article PubMed CAS Google Scholar
Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356
Article PubMed CAS Google Scholar
Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427
Article PubMed CAS Google Scholar
Jozefczuk S, Klie S, Catchpole G et al (2010) Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 6:364
Article PubMed Google Scholar
Gasch AP, Spellman PT, Kao CM et al (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:4241–4257
PubMed CAS Google Scholar
Hubert M, Engelen S (2004) Robust PCA and classification in biosciences. Bioinformatics 20:1728–1736
Article PubMed CAS Google Scholar
Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Ludäscher B, Mamoulis N (eds) Scientific and statistical database management. Springer, Berlin
Google Scholar
Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32:1–47
Google Scholar
Ma S, Kosorok MR (2009) Identification of differential gene pathways with principal component analysis. Bioinformatics 25:882–889
Article PubMed CAS Google Scholar
Draper BA, Baek K, Bartlett MS, Beveridge JR (2003) Recognizing faces with PCA and ICA. Comput Vis Image Understand 91:115–137
Article Google Scholar
Virtanen J, Noponen T, Meriläinen P (2009) Comparison of principal and independent component analysis in removing extracerebral interference from near-infrared spectroscopy signals. J Biomed Opt 14:054032
Article PubMed Google Scholar
Baek K, Draper BA, Beveridge JR, She K (2002) PCA vs. ICA: a comparison on the feret data set. In Proc of the 4th Intern Conf on Computer Vision, ICCV 20190, pp 824–827
Google Scholar
Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626–634
Article PubMed Google Scholar
Marchini JL, Heaton C, Ripley BD (2009) fastICA: FastICA algorithms to perform ica and projection pursuit. http://cran.r-project.org/web/packages/fastICA
Scholz M, Selbig J (2007) Visualization and analysis of molecular data. Methods Mol Biol 358:87–104
Article PubMed CAS Google Scholar
Scholz M, Kaplan F, Guy CL et al (2005) Non-linear PCA: a missing data approach. Bioinformatics 21:3887–3895
Article PubMed CAS Google Scholar
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319
Article Google Scholar
Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377
Google Scholar
de Leeuw J, Mair P (2009) Simple and canonical correspondence analysis using the R package anacor. J Stat Softw 31:1–18
Google Scholar

Download references

Acknowledgments

We thank Kristin Feher for carefully reviewing our manuscript.

Author information

Authors and Affiliations

AG Bioinformatics, University of Potsdam, Potsdam-Golm, Germany
Detlef Groth, Stefanie Hartmann, Sebastian Klie & Joachim Selbig

Authors

Detlef Groth
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Klie
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Selbig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Detlef Groth .

Editor information

Editors and Affiliations

School of Biomedical Engineering, Chemical & Biological Engineering, Colorado State University, Campus Delivery 1370, Fort Collins, 80523-1370, Colorado, USA
Brad Reisfeld
, Chemical & Biological Engineering, Colorado State University, Campus Delivery 1370, Fort Collins, 80523-1370, Colorado, USA
Arthur N. Mayeno

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Groth, D., Hartmann, S., Klie, S., Selbig, J. (2013). Principal Components Analysis. In: Reisfeld, B., Mayeno, A. (eds) Computational Toxicology. Methods in Molecular Biology, vol 930. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-059-5_22

Download citation

DOI: https://doi.org/10.1007/978-1-62703-059-5_22
Published: 18 August 2012
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-058-8
Online ISBN: 978-1-62703-059-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics