Abstract
In the present work we have selected a collection of statistical and mathematical tools useful for the exploration of multivariate data and we present them in a form that is meant to be particularly accessible to a classically trained mathematician. We give self contained and streamlined introductions to principal component analysis, multidimensional scaling and statistical hypothesis testing. Within the presented mathematical framework we then propose a general exploratory methodology for the investigation of real world high dimensional datasets that builds on statistical and knowledge supported visualizations. We exemplify the proposed methodology by applying it to several different genomewide DNA-microarray datasets. The exploratory methodology should be seen as an embryo that can be expanded and developed in many directions. As an example we point out some recent promising advances in the theory for random matrices that, if further developed, potentially could provide practically useful and theoretically well founded estimations of information content in dimension reducing visualizations. We hope that the present work can serve as an introduction to, and help to stimulate more research within, the interesting and rapidly expanding field of data exploration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alter, O., Brown, P., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97(18), 10101–10106 (2000)
Anderson, T.W.: Asymptotic theory for principal component analysis. Ann. Math. Stat. 34, 122–148 (1963)
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, Hoboken, NJ (2003)
The European Bioinformatics Institute’s database ArrayExpress: http://www.ebi.ac.uk/microarray-as/ae/
Ashburner, M., et al.: The gene ontolgy consortium. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)
Autio, R., et al.: Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations. BMC Bioinform. 10, suppl.1 S24 (2009)
Bai, Z.D.: Methodologies in spectral analysis of large dimensional random matrices, a review. Statist. Sin. 9, 611–677 (1999)
Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLOS Biol. 2, 511–522 (2004)
Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principle components. J. Am. Stat. Assoc. 101, 119–137 (2006)
Bakay, M., et al.: Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb-MyoD pathways in muscle regeneration. Brain 129(Pt 4), 996–1013 (2006)
Barry, W.T., Nobel, A.B., Wright, F.A.: A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2(1), 286–315 (2008)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995)
Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Edu. Behav. Stat. 25, 60–83 (2000)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)
Ter Braak, C.J.F.: Interpreting canonical correlation analysis through biplots of structure correlations and weights. Psychometrika 55(3), 519–531 (1990)
Chen, X., Wang, L., Smith, J.D., Zhang, B.: Supervised principle component analysis for gene set enrichment of microarray data with continuous or survival outcome. Bioinformatics 24(21), 2474–2481 (2008)
Debashis, P., Bair, E., Hastie, T., Tibshirani, R.: “Preconditioning” for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618 (2008)
Diaconis, P.: Patterns in eigenvalues: The 70th Josiah Willard Gibbs Lecture. Bull. AMS 40(2), 155–178 (2003)
National Centre for Biotechnology Information’s database Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/
Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971)
Gabriel, K.R.: Biplot. In: Kotz, S., Johnson, N.L.: (eds.) Encyclopedia of Statistical Sciences, vol. 1, pp. 263–271. Wiley, New York (1982)
Gower, J.C., Hand, D.J.: Biplots. Monographs on Statistics and Applied Probability 54. Chapman & Hall, London (1996)
Hotelling, H.: The generalization of Student’s ratio. Ann. Math. Stat. 2, 360–378 (1931)
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441; 498–520 (1933)
Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(6), 559–572 (1901)
Johnstone, I.M.: On the distribution of the largest eigenvalue in principle components analysis. Ann. Stat. 29(2), 295–327 (2001)
Johnston, I.M.: High dimensional statistical inference and random matrices. Proceedings of the International congress of Math. Madrid, Spain 2006, (EMS 2007).
Kanehisa, M., Goto, S.: KEGG:Kyoto Encyclopedia of Genes and Genomes. Nucleic Acid Res. 28, 27–30 (2000)
Karhunen, K.: Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys. 37, 1–79 (1947)
El Karoui, N.: Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Stat. 36(6), 2757–2790 (2008)
Khatri, P., Draghici, S.: Ontological analysis of gene expression data: Current tools, limitations, and open problems. Bioinformatics 21(18), 3587–3595 (2005)
Kim, B.S., et al.: Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer. Bioinformatics 21, 517–528 (2005)
Kong, S.W., Pu, T.W., Park, P.J.: A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 22(19), 2373–2380 (2006)
Loève, M.: Probability theory, vol. II, 4th edn. Graduate Texts in Mathematics, vol. 46. Springer, New York (1978). ISBN 0-387-90262-7.
Mirsky, L.: Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 11(1), 50–59 (1960)
The Broad Institute’s Molecular Signatures Database (MSigDB): http://www.broadinstitute.org/gsea/msigdb/
Mootha, V.K., et al.: Pgc-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003)
Nilsson, J., Fioretos, T., Höglund, M., Fontes, M.: Approximate geodesic distances reveal biologically relevant structures in microarray data. Bioinformatics 20(6), 874–880 (2004)
Pawitan, Y., Michiels, S., Koscielny, S., Gusnanto, A., Ploner, A.: False discovery rate, sensitivity and sample size for microarray studies Bioinformatics 21(13), 3017–3024 (2005)
Rao, C.R.: Separation theorems for singular values of matrices and their applications in multivariate analysis. J. Multivar. Anal. 9, 362–377 (1979)
Rasch, D., Teuscher, F., Guiard, V.: How robust are tests for two independent samples? J. Stat. Plann. Inference 137, 2706–2720 (2007)
Rivals, I., Personnaz, L., Taing, L., Potier, M.-C.: Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4), 401–407 (2007)
Rocke, D.M., Ideker, T., Troyanskaya, O., Queckenbush, J., Dopazo, J.: Editorial note: Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics 25(6), 701–702 (2009)
Ross, M.E., et al.: Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood 102(8), 2951–2959 (2003).
Qlucore Omics Explorer, Qlucore AB, www.qlucore.com
Spira, A., et al.: Effects of Cigarette Smoke on the Human Airway Epithelial Cell Transcriptome. Proc. Natl. Acad. Sci. 101(27), 10143–10148 (2004)
Stewart, G.W.: On the early history of the singular value decomposition. SIAM Rev. 35(4), 551–566 (1993)
Storey, J.D.: A direct approach to false discovery rates. J.R. Stat. Soc. Ser. B 64, 479–498 (2002)
Storey, J.D., Tibshirani, R.: Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445 (2003)
St. Jude Children’s Research Hospital: http://www.stjuderesearch.org/data/ALL3/index.html
Subramanian, A., et al.: Gene set enrichment analysis: A knowledgebased approach for interpreting genome wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005)
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
Troyanskaya, O., et al.: Missing value estimatin methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Yin, Y., Soteros, C.E., Bickis, M.G.: A clarifying comparison of methods for controlling the false discovery rate. J. Stat. Plan. Inference 139, 2126–2137 (2009)
Yin, Y.Q., Bai, Z.D., Krishnaiah, P.R.: On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probab. Theory Relat. Field 78, 509–521 (1988)
Acknowledgements
I first of all would like to thank Gunnar Sparr for being a role model for me, and many other young mathematicians, of a pure mathematician that evolved into contributing serious applied work. Gunnar’s help, support and general encouragement have been very important during my own development within the field of mathematical modeling. I sincerely thank Johan Råde for helping me to learn almost everything I know about data exploration. Without him the here presented work would truly not have been possible. Applied work is best done in collaboration and I am blessed with Thoas Fioretos as my long term collaborator within the field of molecular biology. I am grateful for what he has tried to teach me and I hope he is willing to continue to try. Finally I thank Charlotte Soneson for reading this work and, as always, giving very valuable feed-back.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fontes, M. (2012). Statistical and Knowledge Supported Visualization of Multivariate Data. In: Åström, K., Persson, LE., Silvestrov, S. (eds) Analysis for Science, Engineering and Beyond. Springer Proceedings in Mathematics, vol 6. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20236-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-20236-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20235-3
Online ISBN: 978-3-642-20236-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)