Statistical and Knowledge Supported Visualization of Multivariate Data

Fontes, Magnus

doi:10.1007/978-3-642-20236-0_6

Magnus Fontes⁴

Part of the book series: Springer Proceedings in Mathematics ((PROM,volume 6))

1033 Accesses

Abstract

In the present work we have selected a collection of statistical and mathematical tools useful for the exploration of multivariate data and we present them in a form that is meant to be particularly accessible to a classically trained mathematician. We give self contained and streamlined introductions to principal component analysis, multidimensional scaling and statistical hypothesis testing. Within the presented mathematical framework we then propose a general exploratory methodology for the investigation of real world high dimensional datasets that builds on statistical and knowledge supported visualizations. We exemplify the proposed methodology by applying it to several different genomewide DNA-microarray datasets. The exploratory methodology should be seen as an embryo that can be expanded and developed in many directions. As an example we point out some recent promising advances in the theory for random matrices that, if further developed, potentially could provide practically useful and theoretically well founded estimations of information content in dimension reducing visualizations. We hope that the present work can serve as an introduction to, and help to stimulate more research within, the interesting and rapidly expanding field of data exploration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alter, O., Brown, P., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97(18), 10101–10106 (2000)
Article Google Scholar
Anderson, T.W.: Asymptotic theory for principal component analysis. Ann. Math. Stat. 34, 122–148 (1963)
Article MATH Google Scholar
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, Hoboken, NJ (2003)
MATH Google Scholar
The European Bioinformatics Institute’s database ArrayExpress: http://www.ebi.ac.uk/microarray-as/ae/
Ashburner, M., et al.: The gene ontolgy consortium. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)
Google Scholar
Autio, R., et al.: Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations. BMC Bioinform. 10, suppl.1 S24 (2009)
Google Scholar
Bai, Z.D.: Methodologies in spectral analysis of large dimensional random matrices, a review. Statist. Sin. 9, 611–677 (1999)
MATH Google Scholar
Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLOS Biol. 2, 511–522 (2004)
Article Google Scholar
Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principle components. J. Am. Stat. Assoc. 101, 119–137 (2006)
Article MATH MathSciNet Google Scholar
Bakay, M., et al.: Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb-MyoD pathways in muscle regeneration. Brain 129(Pt 4), 996–1013 (2006)
Article Google Scholar
Barry, W.T., Nobel, A.B., Wright, F.A.: A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2(1), 286–315 (2008)
Article MATH MathSciNet Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995)
MATH MathSciNet Google Scholar
Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Edu. Behav. Stat. 25, 60–83 (2000)
Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)
Article MATH MathSciNet Google Scholar
Ter Braak, C.J.F.: Interpreting canonical correlation analysis through biplots of structure correlations and weights. Psychometrika 55(3), 519–531 (1990)
Article Google Scholar
Chen, X., Wang, L., Smith, J.D., Zhang, B.: Supervised principle component analysis for gene set enrichment of microarray data with continuous or survival outcome. Bioinformatics 24(21), 2474–2481 (2008)
Article Google Scholar
Debashis, P., Bair, E., Hastie, T., Tibshirani, R.: “Preconditioning” for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618 (2008)
Article MATH Google Scholar
Diaconis, P.: Patterns in eigenvalues: The 70th Josiah Willard Gibbs Lecture. Bull. AMS 40(2), 155–178 (2003)
Article MATH MathSciNet Google Scholar
National Centre for Biotechnology Information’s database Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/
Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971)
Article MATH MathSciNet Google Scholar
Gabriel, K.R.: Biplot. In: Kotz, S., Johnson, N.L.: (eds.) Encyclopedia of Statistical Sciences, vol. 1, pp. 263–271. Wiley, New York (1982)
Google Scholar
Gower, J.C., Hand, D.J.: Biplots. Monographs on Statistics and Applied Probability 54. Chapman & Hall, London (1996)
Google Scholar
Hotelling, H.: The generalization of Student’s ratio. Ann. Math. Stat. 2, 360–378 (1931)
Article MATH Google Scholar
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441; 498–520 (1933)
Article Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(6), 559–572 (1901)
Google Scholar
Johnstone, I.M.: On the distribution of the largest eigenvalue in principle components analysis. Ann. Stat. 29(2), 295–327 (2001)
Article MATH MathSciNet Google Scholar
Johnston, I.M.: High dimensional statistical inference and random matrices. Proceedings of the International congress of Math. Madrid, Spain 2006, (EMS 2007).
Google Scholar
Kanehisa, M., Goto, S.: KEGG:Kyoto Encyclopedia of Genes and Genomes. Nucleic Acid Res. 28, 27–30 (2000)
Article Google Scholar
Karhunen, K.: Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys. 37, 1–79 (1947)
MathSciNet Google Scholar
El Karoui, N.: Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Stat. 36(6), 2757–2790 (2008)
Google Scholar
Khatri, P., Draghici, S.: Ontological analysis of gene expression data: Current tools, limitations, and open problems. Bioinformatics 21(18), 3587–3595 (2005)
Article Google Scholar
Kim, B.S., et al.: Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer. Bioinformatics 21, 517–528 (2005)
Article Google Scholar
Kong, S.W., Pu, T.W., Park, P.J.: A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 22(19), 2373–2380 (2006)
Article Google Scholar
Loève, M.: Probability theory, vol. II, 4th edn. Graduate Texts in Mathematics, vol. 46. Springer, New York (1978). ISBN 0-387-90262-7.
Google Scholar
Mirsky, L.: Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 11(1), 50–59 (1960)
Article MATH MathSciNet Google Scholar
The Broad Institute’s Molecular Signatures Database (MSigDB): http://www.broadinstitute.org/gsea/msigdb/
Mootha, V.K., et al.: Pgc-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003)
Article Google Scholar
Nilsson, J., Fioretos, T., Höglund, M., Fontes, M.: Approximate geodesic distances reveal biologically relevant structures in microarray data. Bioinformatics 20(6), 874–880 (2004)
Article Google Scholar
Pawitan, Y., Michiels, S., Koscielny, S., Gusnanto, A., Ploner, A.: False discovery rate, sensitivity and sample size for microarray studies Bioinformatics 21(13), 3017–3024 (2005)
Google Scholar
Rao, C.R.: Separation theorems for singular values of matrices and their applications in multivariate analysis. J. Multivar. Anal. 9, 362–377 (1979)
Article MATH Google Scholar
Rasch, D., Teuscher, F., Guiard, V.: How robust are tests for two independent samples? J. Stat. Plann. Inference 137, 2706–2720 (2007)
Article MATH MathSciNet Google Scholar
Rivals, I., Personnaz, L., Taing, L., Potier, M.-C.: Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4), 401–407 (2007)
Article Google Scholar
Rocke, D.M., Ideker, T., Troyanskaya, O., Queckenbush, J., Dopazo, J.: Editorial note: Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics 25(6), 701–702 (2009)
Article Google Scholar
Ross, M.E., et al.: Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood 102(8), 2951–2959 (2003).
Article Google Scholar
Qlucore Omics Explorer, Qlucore AB, www.qlucore.com
Google Scholar
Spira, A., et al.: Effects of Cigarette Smoke on the Human Airway Epithelial Cell Transcriptome. Proc. Natl. Acad. Sci. 101(27), 10143–10148 (2004)
Article Google Scholar
Stewart, G.W.: On the early history of the singular value decomposition. SIAM Rev. 35(4), 551–566 (1993)
Article MATH MathSciNet Google Scholar
Storey, J.D.: A direct approach to false discovery rates. J.R. Stat. Soc. Ser. B 64, 479–498 (2002)
Google Scholar
Storey, J.D., Tibshirani, R.: Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445 (2003)
Article MATH MathSciNet Google Scholar
St. Jude Children’s Research Hospital: http://www.stjuderesearch.org/data/ALL3/index.html
Subramanian, A., et al.: Gene set enrichment analysis: A knowledgebased approach for interpreting genome wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005)
Article Google Scholar
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
Article Google Scholar
Troyanskaya, O., et al.: Missing value estimatin methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Yin, Y., Soteros, C.E., Bickis, M.G.: A clarifying comparison of methods for controlling the false discovery rate. J. Stat. Plan. Inference 139, 2126–2137 (2009)
Article MATH MathSciNet Google Scholar
Yin, Y.Q., Bai, Z.D., Krishnaiah, P.R.: On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probab. Theory Relat. Field 78, 509–521 (1988)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgements

I first of all would like to thank Gunnar Sparr for being a role model for me, and many other young mathematicians, of a pure mathematician that evolved into contributing serious applied work. Gunnar’s help, support and general encouragement have been very important during my own development within the field of mathematical modeling. I sincerely thank Johan Råde for helping me to learn almost everything I know about data exploration. Without him the here presented work would truly not have been possible. Applied work is best done in collaboration and I am blessed with Thoas Fioretos as my long term collaborator within the field of molecular biology. I am grateful for what he has tried to teach me and I hope he is willing to continue to try. Finally I thank Charlotte Soneson for reading this work and, as always, giving very valuable feed-back.

Author information

Authors and Affiliations

Centre for Mathematical Sciences, Lund University, 118, SE-22100, Lund, Sweden
Magnus Fontes

Authors

Magnus Fontes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Magnus Fontes .

Editor information

Editors and Affiliations

Centre for Mathematical Sciences, Lund University, Sölvegatan 18, Lund, 221 00, Sweden
Kalle Åström
Dept. Mathematics, Luleå University of Technology, Luleå, 971 87, Sweden
Lars-Erik Persson
Centre for Mathematical Sciences, Lund University, Lund, Sweden
Sergei D. Silvestrov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fontes, M. (2012). Statistical and Knowledge Supported Visualization of Multivariate Data. In: Åström, K., Persson, LE., Silvestrov, S. (eds) Analysis for Science, Engineering and Beyond. Springer Proceedings in Mathematics, vol 6. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20236-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-20236-0_6
Published: 18 May 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20235-3
Online ISBN: 978-3-642-20236-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics