Abstract
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of original variables, the first few components containing the majority of the data’s variation are explored. The visualization and statistical analysis of these new variables, the principal components, can help to find similarities and differences between samples. Important original variables that are the major contributors to the first few components can be discovered as well.
This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description. We describe how PCA can be used to analyze different datasets, and we include practical code examples. Possible shortcomings of the methodology and ways to overcome these problems are also discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hotelling H (1933) Analysis of complex statistical variables into principal components. J Educ Psychol 24:417–441, and 498–520
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl):496–501
Steinfath M, Groth D, Lisec J, Selbig J (2008) Metabolite profile analysis: from raw data to regression and classification. Physiol Plant 132:150–161
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Bo TM, Dysvik B, Jonassen I (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32:e34
Stacklies W, Redestig H, Scholz M et al (2007) pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23:1164–1167
Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Celton M, Malpertuy A, Lelandais G, de Brevern AG (2010) Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 11:15
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106
Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356
Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427
Jozefczuk S, Klie S, Catchpole G et al (2010) Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 6:364
Gasch AP, Spellman PT, Kao CM et al (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:4241–4257
Hubert M, Engelen S (2004) Robust PCA and classification in biosciences. Bioinformatics 20:1728–1736
Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In: Ludäscher B, Mamoulis N (eds) Scientific and statistical database management. Springer, Berlin
Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32:1–47
Ma S, Kosorok MR (2009) Identification of differential gene pathways with principal component analysis. Bioinformatics 25:882–889
Draper BA, Baek K, Bartlett MS, Beveridge JR (2003) Recognizing faces with PCA and ICA. Comput Vis Image Understand 91:115–137
Virtanen J, Noponen T, Meriläinen P (2009) Comparison of principal and independent component analysis in removing extracerebral interference from near-infrared spectroscopy signals. J Biomed Opt 14:054032
Baek K, Draper BA, Beveridge JR, She K (2002) PCA vs. ICA: a comparison on the feret data set. In Proc of the 4th Intern Conf on Computer Vision, ICCV 20190, pp 824–827
Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626–634
Marchini JL, Heaton C, Ripley BD (2009) fastICA: FastICA algorithms to perform ica and projection pursuit. http://cran.r-project.org/web/packages/fastICA
Scholz M, Selbig J (2007) Visualization and analysis of molecular data. Methods Mol Biol 358:87–104
Scholz M, Kaplan F, Guy CL et al (2005) Non-linear PCA: a missing data approach. Bioinformatics 21:3887–3895
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319
Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377
de Leeuw J, Mair P (2009) Simple and canonical correspondence analysis using the R package anacor. J Stat Softw 31:1–18
Acknowledgments
We thank Kristin Feher for carefully reviewing our manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Groth, D., Hartmann, S., Klie, S., Selbig, J. (2013). Principal Components Analysis. In: Reisfeld, B., Mayeno, A. (eds) Computational Toxicology. Methods in Molecular Biology, vol 930. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-059-5_22
Download citation
DOI: https://doi.org/10.1007/978-1-62703-059-5_22
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-058-8
Online ISBN: 978-1-62703-059-5
eBook Packages: Springer Protocols