Skip to main content

Integrated data analysis for genome-wide research

  • Chapter

Part of the Experientia Supplementum book series (EXS,volume 97)

Abstract

Integrated data analysis is introduced as the intermediate level of a systems biology approach to analyse different ‘omicsrs datasets, i.e., genome-wide measurements of transcripts, protein levels or protein—protein interactions, and metabolite levels aiming at generating a coherent understanding of biological function. In this chapter we focus on different methods of correlation analyses ranging from simple pairwise correlation to kernel canonical correlation which were recently applied in molecular biology. Several examples are presented to illustrate their application. The input data for this analysis frequently originate from different experimental platforms. Therefore, preprocessing steps such as data normalisation and missing value estimation are inherent to this approach. The corresponding procedures, potential pitfalls and biases, and available software solutions are reviewed. The multiplicity of observations obtained in omics-profiling experiments necessitates the application of multiple testing correction techniques.

Keywords

  • Mutual Information
  • Independent Component Analysis
  • Canonical Correlation Analysis
  • Independent Component Analysis
  • Biological Organisation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-7643-7439-6_13
  • Chapter length: 21 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   169.00
Price excludes VAT (USA)
  • ISBN: 978-3-7643-7439-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   219.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Somogyi R, Sniegoski CA (1996) Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation. Complexity 1(6): 45–63

    Google Scholar 

  2. Gygi S, Rochon Y, Franza B, Aebersold R (1999) Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19(3): 1720–1730

    PubMed  CAS  Google Scholar 

  3. Noble D (2002) Modeling the heart-from genes to cells to the whole organ. Science 295(5560) 1678–1682

    PubMed  CrossRef  CAS  Google Scholar 

  4. Grünenfelder B, Winzeler EA (2002) Treasures and traps in genome-wide datasets: case examples from yeast. Nat Rev Genetics 3: 653–661

    CrossRef  Google Scholar 

  5. Shevchenko A, Jensen O, Podtelejnikov A, Sagliocco F, Wilm M, Vorm O, Mortensen P, Shevchenko A, Boucherie H, Mann M (1996) Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. Proc Natl Acad Sci USA 93(25): 14440–14445

    PubMed  CrossRef  CAS  Google Scholar 

  6. Pandey A, Mann M (2000) Proteomics to study genes and genomes. Nature 405: 837–846

    PubMed  CrossRef  CAS  Google Scholar 

  7. Walhout A, Vidal M (2001) Protein interaction maps for model organisms. Nat Rev Mol Cell Biol 2(1): 55–62

    PubMed  CrossRef  CAS  Google Scholar 

  8. Fiehn O, Kopka J, Dormann P, Altmann T, Trethewey R, Willmitzer L (2000) Metabolite profiling for plant functional genomics. Nat Biotechnol 18(11): 1157–1161

    PubMed  CrossRef  CAS  Google Scholar 

  9. Roessner U, Luedemann A, Brust D, Fiehn O, Linke T, Willmitzer L, Fernie A (2001) Metabolic profiling allows comprehensive phenotyping of genetically or environmentally modified plant systems. Plant Cell 13(1): 11–29

    PubMed  CrossRef  CAS  Google Scholar 

  10. Fernie A, Trethewey R, Krotzky A, Willmitzer L (2004) Metabolite profiling: from diagnostics to systems biology. Nat Rev Mol Cell Biol 5(9): 763–769

    PubMed  CrossRef  CAS  Google Scholar 

  11. Klipp E, Herwig R, Kowald A, Wierling C, Lehrach H (2005) Systems biology in practice — concepts, implementation and application, chapter.3, Wiley-VCH Verlag, Weinheim, Germany, 11–17

    Google Scholar 

  12. Griffin TJ, Gygi SP, Ideker T, Rist B, Eng J, Hood L, Aebersold R (2002) complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics 1(4): 323–333

    PubMed  CrossRef  CAS  Google Scholar 

  13. Aitchison JD, Galitski T (2003) Inventories to insights. J Cell Biol 161(3): 465–469

    PubMed  CrossRef  CAS  Google Scholar 

  14. Wissel C (1992) Aims and limits of ecological modelling exemplified by island theory. Ecol Model 63: 1–12

    CrossRef  Google Scholar 

  15. Searls D (2005) Data integration: challenges for drug discovery. Nat Rev Drug Discov 4(1): 45–58

    PubMed  CrossRef  CAS  Google Scholar 

  16. Park P, Cao Y, Lee S, Kim J, Chang M, Hart R, Choi S (2004) Current issues for DNA microarrays: platform comparison, double linear amplification, and universal RNA reference. J Biotechnol 112(3): 225–245

    PubMed  CrossRef  CAS  Google Scholar 

  17. Aebersold R, Hood L, Watts J (2000) Equipping scientists for the new biology. Nat Biotechnol 18(4): 359

    PubMed  CrossRef  CAS  Google Scholar 

  18. Weinstein JN (2002) ‘Omic’ and hypothesis-driven research in the molecular pharmacology of cancer. Curr Opin Pharmacol 2: 361–365

    PubMed  CrossRef  CAS  Google Scholar 

  19. Ge H, Liu Z, Church GM, Vidal M (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics 29: 482–486

    PubMed  CrossRef  CAS  Google Scholar 

  20. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1): 25–29

    PubMed  CrossRef  CAS  Google Scholar 

  21. The Plant Ontology Consortium (2002) The Plant Ontology Consortium and Plant Ontologies. Comp Funct Genomics 3: 137–142

    CrossRef  Google Scholar 

  22. Hazbun T, Malmstrom L, Anderson S, Graczyk B, Fox B, Riffle M, Sundin B, Aranda J, McDonald W, Chiu C et al. (2003) Assigning function to yeast proteins by integration of technologies. Mol Cell 12(6): 1353–1365

    PubMed  CrossRef  CAS  Google Scholar 

  23. Wacholder S, McLaughlin JK, Silverman DT, Mandel JS (1992) Selection of controls in case-control studies. I. principles. Am J Epidemiol 135(9): 1019–1028

    PubMed  CAS  Google Scholar 

  24. Repsilber D, Fink L, Jacobsen M, Bläsing O, Ziegler A (2005) Sample selection for microarray gene expression studies. Meth Info Med 44(3): 461–467

    CAS  Google Scholar 

  25. Smith JJ, Marelli M, Christmas RH, Vizeacoumar FJ, Dilworth DJ, Ideker T, Galitski T, Dimitrov K, Rachubinski RA, Aitchison JD (2002) Transcriptome profiling to identify genes involved in peroxisome assembly and function. J Cell Biol 158(2): 259–271

    PubMed  CrossRef  CAS  Google Scholar 

  26. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868

    PubMed  CrossRef  CAS  Google Scholar 

  27. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination of genetic network architecture. Nature Genetics 22(3): 281–285

    PubMed  CrossRef  CAS  Google Scholar 

  28. Qiu P (2003) Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochem Biophys Res Commun 309(3): 495–501

    PubMed  CrossRef  CAS  Google Scholar 

  29. Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(Suppl.1): S233–S240

    PubMed  Google Scholar 

  30. Kriete A, Anderson MK, Love B, Freund J, Caffrey JJ, Young MB, Sendera TJ, Magnuson SR, Braughler JM (2003) Combined histomorphometric and gene-expression profiling applied to toxicology. Genome Biol 4: R32

    PubMed  CrossRef  Google Scholar 

  31. Weckwerth W (2003) Metabolomics in systems biology. Annu Rev Plant Biol 54: 669–689

    PubMed  CrossRef  CAS  Google Scholar 

  32. Urbanczyk-Wochniak E, Luedemann A, Kopka J, Selbig J, Roessner-Tunali U, Willmitzer L, Fernie A (2003) Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. EMBO Rep 4(10): 989–993

    PubMed  CrossRef  CAS  Google Scholar 

  33. Nilsson J, Fioetos T, Höglund M, Fontes M (2004) Approximate geodetic distances reveal biological relevant structure in microarray data. Bioinformatics 20(6): 874–880

    PubMed  CrossRef  CAS  Google Scholar 

  34. Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J (2004) Metabolite fingerprinting: detection of biological features by independent component analysis. Bioinformatics 20: 2447–2454

    PubMed  CrossRef  CAS  Google Scholar 

  35. Scholz M, Kaplan F, Guy CL, Kopka J, Selbig J (2005) Nonlinear PCA: a missing data approach. Bioinformatics, Advance Access published online 18 August 2005

    Google Scholar 

  36. Gasch AP, Spellmann PT, Kao CM, Carmel-Harel O, Eisen M, Storz, Botstein D, Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11: 4241–4257

    PubMed  CAS  Google Scholar 

  37. Butte A, Kohane IS (2000) Mutual information relevance networks: Functional genomic clustering using pair-wise entropy measurements. Pac Symp Biocomput 5: 415–426

    Google Scholar 

  38. Steuer R, Kurths J, Daub C, Weise J, Selbig J (2002) The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18: S231–S240

    PubMed  Google Scholar 

  39. Best DJ, Roberts DE (1975) Algorithm AS 89: The upper tail probabilities of spearman’s rho. Appl Stats 24: 377–379

    CrossRef  Google Scholar 

  40. Hotelling H (1936) Relation between two sets of variates. Biometrica 28: 312–377

    Google Scholar 

  41. Hardoon D, Szedmak S, Shawe-Taylor J (2003) Canonical correlation analysis; An overview with application to learning methods. Technical Report CSD-TR-03-02. Department of Computer Science, University of London, UK

    Google Scholar 

  42. Yamanishi Y, Vert JP, Kanehisa M (2003) Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics 19:Suppl 1 i323–330

    PubMed  CrossRef  Google Scholar 

  43. Kuss M, Graepel T (2003) The geometry of kernel canonical analysis. Technical Report No. 108, Max Planck Institute for Biological Cybernetics

    Google Scholar 

  44. Kanehisa M, Goto S, Kawashima S, Nakaya A (2002) The KEGG databases at GenomeNet. Nucleic Acids Res 30: 42–45

    PubMed  CrossRef  CAS  Google Scholar 

  45. Gibbons F, Roth F (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res 12(10): 1574–1581

    PubMed  CrossRef  CAS  Google Scholar 

  46. Daub C, Steuer R, Selbig J, Kloska S (2004) Estimating mutual information using B-spline functions’an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5: 118

    PubMed  CrossRef  Google Scholar 

  47. Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R (1998) Largescale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95: 334–339

    PubMed  CrossRef  CAS  Google Scholar 

  48. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12): 6745–6750

    PubMed  CrossRef  CAS  Google Scholar 

  49. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organising maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6): 2907–2912

    PubMed  CrossRef  CAS  Google Scholar 

  50. Heyer L, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11): 1106–1115

    PubMed  CrossRef  CAS  Google Scholar 

  51. Michaels GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R (1998) Cluster analysis and data visualization of large-scale gene expression data. Pac Symp Biocomp 3: 42–53

    Google Scholar 

  52. Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100(16): 9440–9445

    PubMed  CrossRef  CAS  Google Scholar 

  53. Broberg P (2005) A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6: 199

    PubMed  CrossRef  Google Scholar 

  54. Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. J Comp Graph Stats 5(3): 299–314

    CrossRef  Google Scholar 

  55. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

    Google Scholar 

  56. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5: R80

    PubMed  CrossRef  Google Scholar 

  57. MathWorks IUC (2000) MATLAB

    Google Scholar 

  58. Eichler G, Huang S, Ingber D (2003) Gene Expression Dynamics Inspector (GEDI): for integrative analysis of expression profiles. Bioinformatics 19(17): 2321–2322

    PubMed  CrossRef  CAS  Google Scholar 

  59. Thimm O, Bläsing O, Yves Gibon OB, Nagel A, Meyer S, Krüger P, Selbig J, Müller LA, Rhee SY, Stitt M (2004) MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J 37: 914–939

    PubMed  CrossRef  CAS  Google Scholar 

  60. Zimmermann P, Hennig L, Gruissem W (2005) Gene-expression analysis and network discovery using Genevestigator. Trends Plant Sci 10(9): 407–409

    PubMed  CrossRef  CAS  Google Scholar 

  61. Zimmermann P, Hirsch-Hoffmann M, Hennig L, Gruissem W (2004) GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol 136(1): 2621–2632

    PubMed  CrossRef  CAS  Google Scholar 

  62. Breitkreutz B, Stark C, Tyers M (2003) Osprey: a network visualization system. Genome Biol 4(3): R22

    PubMed  CrossRef  Google Scholar 

  63. Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11): 2498–2504

    PubMed  CrossRef  CAS  Google Scholar 

  64. Daub C, Kloska S, Selbig J (2003) MetaGeneAlyse: analysis of integrated transcriptional and metabolite data. Bioinformatics 19(17): 2332–2333

    PubMed  CrossRef  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2007 Birkhäuser Verlag/Switzerland

About this chapter

Cite this chapter

Steinfath, M., Repsilber, D., Scholz, M., Walther, D., Selbig, J. (2007). Integrated data analysis for genome-wide research. In: Baginsky, S., Fernie, A.R. (eds) Plant Systems Biology. Experientia Supplementum, vol 97. Birkhäuser Basel. https://doi.org/10.1007/978-3-7643-7439-6_13

Download citation