Skip to main content

Statistical and Knowledge Supported Visualization of Multivariate Data

  • Conference paper
  • First Online:
Analysis for Science, Engineering and Beyond

Part of the book series: Springer Proceedings in Mathematics ((PROM,volume 6))

  • 1033 Accesses

Abstract

In the present work we have selected a collection of statistical and mathematical tools useful for the exploration of multivariate data and we present them in a form that is meant to be particularly accessible to a classically trained mathematician. We give self contained and streamlined introductions to principal component analysis, multidimensional scaling and statistical hypothesis testing. Within the presented mathematical framework we then propose a general exploratory methodology for the investigation of real world high dimensional datasets that builds on statistical and knowledge supported visualizations. We exemplify the proposed methodology by applying it to several different genomewide DNA-microarray datasets. The exploratory methodology should be seen as an embryo that can be expanded and developed in many directions. As an example we point out some recent promising advances in the theory for random matrices that, if further developed, potentially could provide practically useful and theoretically well founded estimations of information content in dimension reducing visualizations. We hope that the present work can serve as an introduction to, and help to stimulate more research within, the interesting and rapidly expanding field of data exploration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alter, O., Brown, P., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97(18), 10101–10106 (2000)

    Article  Google Scholar 

  2. Anderson, T.W.: Asymptotic theory for principal component analysis. Ann. Math. Stat. 34, 122–148 (1963)

    Article  MATH  Google Scholar 

  3. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, Hoboken, NJ (2003)

    MATH  Google Scholar 

  4. The European Bioinformatics Institute’s database ArrayExpress: http://www.ebi.ac.uk/microarray-as/ae/

  5. Ashburner, M., et al.: The gene ontolgy consortium. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)

    Google Scholar 

  6. Autio, R., et al.: Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations. BMC Bioinform. 10, suppl.1 S24 (2009)

    Google Scholar 

  7. Bai, Z.D.: Methodologies in spectral analysis of large dimensional random matrices, a review. Statist. Sin. 9, 611–677 (1999)

    MATH  Google Scholar 

  8. Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLOS Biol. 2, 511–522 (2004)

    Article  Google Scholar 

  9. Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principle components. J. Am. Stat. Assoc. 101, 119–137 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  10. Bakay, M., et al.: Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb-MyoD pathways in muscle regeneration. Brain 129(Pt 4), 996–1013 (2006)

    Article  Google Scholar 

  11. Barry, W.T., Nobel, A.B., Wright, F.A.: A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2(1), 286–315 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  12. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995)

    MATH  MathSciNet  Google Scholar 

  13. Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Edu. Behav. Stat. 25, 60–83 (2000)

    Google Scholar 

  14. Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  15. Ter Braak, C.J.F.: Interpreting canonical correlation analysis through biplots of structure correlations and weights. Psychometrika 55(3), 519–531 (1990)

    Article  Google Scholar 

  16. Chen, X., Wang, L., Smith, J.D., Zhang, B.: Supervised principle component analysis for gene set enrichment of microarray data with continuous or survival outcome. Bioinformatics 24(21), 2474–2481 (2008)

    Article  Google Scholar 

  17. Debashis, P., Bair, E., Hastie, T., Tibshirani, R.: “Preconditioning” for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618 (2008)

    Article  MATH  Google Scholar 

  18. Diaconis, P.: Patterns in eigenvalues: The 70th Josiah Willard Gibbs Lecture. Bull. AMS 40(2), 155–178 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  19. National Centre for Biotechnology Information’s database Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/

  20. Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971)

    Article  MATH  MathSciNet  Google Scholar 

  21. Gabriel, K.R.: Biplot. In: Kotz, S., Johnson, N.L.: (eds.) Encyclopedia of Statistical Sciences, vol. 1, pp. 263–271. Wiley, New York (1982)

    Google Scholar 

  22. Gower, J.C., Hand, D.J.: Biplots. Monographs on Statistics and Applied Probability 54. Chapman & Hall, London (1996)

    Google Scholar 

  23. Hotelling, H.: The generalization of Student’s ratio. Ann. Math. Stat. 2, 360–378 (1931)

    Article  MATH  Google Scholar 

  24. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441; 498–520 (1933)

    Article  Google Scholar 

  25. Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(6), 559–572 (1901)

    Google Scholar 

  26. Johnstone, I.M.: On the distribution of the largest eigenvalue in principle components analysis. Ann. Stat. 29(2), 295–327 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  27. Johnston, I.M.: High dimensional statistical inference and random matrices. Proceedings of the International congress of Math. Madrid, Spain 2006, (EMS 2007).

    Google Scholar 

  28. Kanehisa, M., Goto, S.: KEGG:Kyoto Encyclopedia of Genes and Genomes. Nucleic Acid Res. 28, 27–30 (2000)

    Article  Google Scholar 

  29. Karhunen, K.: Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys. 37, 1–79 (1947)

    MathSciNet  Google Scholar 

  30. El Karoui, N.: Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Stat. 36(6), 2757–2790 (2008)

    Google Scholar 

  31. Khatri, P., Draghici, S.: Ontological analysis of gene expression data: Current tools, limitations, and open problems. Bioinformatics 21(18), 3587–3595 (2005)

    Article  Google Scholar 

  32. Kim, B.S., et al.: Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer. Bioinformatics 21, 517–528 (2005)

    Article  Google Scholar 

  33. Kong, S.W., Pu, T.W., Park, P.J.: A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 22(19), 2373–2380 (2006)

    Article  Google Scholar 

  34. Loève, M.: Probability theory, vol. II, 4th edn. Graduate Texts in Mathematics, vol. 46. Springer, New York (1978). ISBN 0-387-90262-7.

    Google Scholar 

  35. Mirsky, L.: Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 11(1), 50–59 (1960)

    Article  MATH  MathSciNet  Google Scholar 

  36. The Broad Institute’s Molecular Signatures Database (MSigDB): http://www.broadinstitute.org/gsea/msigdb/

  37. Mootha, V.K., et al.: Pgc-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003)

    Article  Google Scholar 

  38. Nilsson, J., Fioretos, T., Höglund, M., Fontes, M.: Approximate geodesic distances reveal biologically relevant structures in microarray data. Bioinformatics 20(6), 874–880 (2004)

    Article  Google Scholar 

  39. Pawitan, Y., Michiels, S., Koscielny, S., Gusnanto, A., Ploner, A.: False discovery rate, sensitivity and sample size for microarray studies Bioinformatics 21(13), 3017–3024 (2005)

    Google Scholar 

  40. Rao, C.R.: Separation theorems for singular values of matrices and their applications in multivariate analysis. J. Multivar. Anal. 9, 362–377 (1979)

    Article  MATH  Google Scholar 

  41. Rasch, D., Teuscher, F., Guiard, V.: How robust are tests for two independent samples? J. Stat. Plann. Inference 137, 2706–2720 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  42. Rivals, I., Personnaz, L., Taing, L., Potier, M.-C.: Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4), 401–407 (2007)

    Article  Google Scholar 

  43. Rocke, D.M., Ideker, T., Troyanskaya, O., Queckenbush, J., Dopazo, J.: Editorial note: Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics 25(6), 701–702 (2009)

    Article  Google Scholar 

  44. Ross, M.E., et al.: Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood 102(8), 2951–2959 (2003).

    Article  Google Scholar 

  45. Qlucore Omics Explorer, Qlucore AB, www.qlucore.com

    Google Scholar 

  46. Spira, A., et al.: Effects of Cigarette Smoke on the Human Airway Epithelial Cell Transcriptome. Proc. Natl. Acad. Sci. 101(27), 10143–10148 (2004)

    Article  Google Scholar 

  47. Stewart, G.W.: On the early history of the singular value decomposition. SIAM Rev. 35(4), 551–566 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  48. Storey, J.D.: A direct approach to false discovery rates. J.R. Stat. Soc. Ser. B 64, 479–498 (2002)

    Google Scholar 

  49. Storey, J.D., Tibshirani, R.: Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  50. St. Jude Children’s Research Hospital: http://www.stjuderesearch.org/data/ALL3/index.html

  51. Subramanian, A., et al.: Gene set enrichment analysis: A knowledgebased approach for interpreting genome wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005)

    Article  Google Scholar 

  52. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)

    Article  Google Scholar 

  53. Troyanskaya, O., et al.: Missing value estimatin methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

  54. Yin, Y., Soteros, C.E., Bickis, M.G.: A clarifying comparison of methods for controlling the false discovery rate. J. Stat. Plan. Inference 139, 2126–2137 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  55. Yin, Y.Q., Bai, Z.D., Krishnaiah, P.R.: On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probab. Theory Relat. Field 78, 509–521 (1988)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

I first of all would like to thank Gunnar Sparr for being a role model for me, and many other young mathematicians, of a pure mathematician that evolved into contributing serious applied work. Gunnar’s help, support and general encouragement have been very important during my own development within the field of mathematical modeling. I sincerely thank Johan Råde for helping me to learn almost everything I know about data exploration. Without him the here presented work would truly not have been possible. Applied work is best done in collaboration and I am blessed with Thoas Fioretos as my long term collaborator within the field of molecular biology. I am grateful for what he has tried to teach me and I hope he is willing to continue to try. Finally I thank Charlotte Soneson for reading this work and, as always, giving very valuable feed-back.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Magnus Fontes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fontes, M. (2012). Statistical and Knowledge Supported Visualization of Multivariate Data. In: Åström, K., Persson, LE., Silvestrov, S. (eds) Analysis for Science, Engineering and Beyond. Springer Proceedings in Mathematics, vol 6. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20236-0_6

Download citation

Publish with us

Policies and ethics