Integrative Approaches for Microarray Data Analysis

  • Levi Waldron
  • Hilary A. CollerEmail author
  • Curtis Huttenhower
Part of the Methods in Molecular Biology book series (MIMB, volume 802)


Microarrays were one of the first technologies of the genomic revolution to gain widespread adoption, rapidly expanding from a cottage industry to the source of thousands of experimental results. They were one of the first assays for which data repositories and metadata were standardized and researchers were required by many journals to make published data publicly available. Microarrays provide high-throughput insights into the biological functions of genes and gene products; however, they also present a “curse of dimensionality,” whereby the availability of many gene expression measurements in few samples make it challenging to distinguish noise from true biological signal. All of these factors argue for integrative approaches to microarray data analysis, which combine data from multiple experiments to increase sample size, avoid laboratory-specific bias, and enable new biological insights not possible from a single experiment. Here, we discuss several approaches to integrative microarray analysis for a diverse range of applications, including biomarker discovery, gene function and interaction prediction, and regulatory network inference. We also show how, by integrating large microarray compendia with diverse genomic data types, more nuanced biological hypotheses can be explored computationally. This chapter provides overviews and brief descriptions of each of these approaches to microarray integration.

Key words

Microarray Meta-analysis Bioinformatics Coexpression Functional interaction networks Biomolecular networks Bayesian networks Regulatory networks Protein function prediction MEFIT COALESCE 



The authors would like to thank the editors of this title for their gracious support, the laboratories of Olga Troyanskaya and Leonid Kruglyak for their valuable input, and all of the members of the Coller and Huttenhower laboratories. This research was supported by PhRMA Foundation grant 2007RSGl9572, NIH/NIGMS 1R01 GM081686, NSF DBI-1053486, NIH grant T32 HG003284, and NIGMS Center of Excellence grant P50 GM071508. H.A.C. was the Milton E. Cassel scholar of the Rita Allen Foundation.


  1. 1.
    Brazma A, Hingamp P, Quackenbush J et al (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29: 365–371.PubMedCrossRefGoogle Scholar
  2. 2.
    Rayner TF, Rocca-Serra P, Spellman PT et al (2006) A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7:489.PubMedCrossRefGoogle Scholar
  3. 3.
    Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 96:6745–6750.PubMedCrossRefGoogle Scholar
  4. 4.
    Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537.PubMedCrossRefGoogle Scholar
  5. 5.
    Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511.PubMedCrossRefGoogle Scholar
  6. 6.
    Gadbury GL, Garrett KA, Allison DB (2009) Challenges and approaches to statistical design and inference in high-dimensional investigations. Methods Mol Biol 553:181–206.PubMedCrossRefGoogle Scholar
  7. 7.
    Leek JT, Scharpf RB, Bravo HC et al (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733–739.PubMedCrossRefGoogle Scholar
  8. 8.
    Hughes TR, Marton MJ, Jones AR et al (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–126.PubMedCrossRefGoogle Scholar
  9. 9.
    Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198.PubMedCrossRefGoogle Scholar
  10. 10.
    Bonneau R, Reiss DJ, Shannon P et al (2006) The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol 7:R36.PubMedCrossRefGoogle Scholar
  11. 11.
    Margolin AA, Wang K, Lim WK et al (2006) Reverse engineering cellular networks. Nat Protoc 1:662–671.PubMedCrossRefGoogle Scholar
  12. 12.
    Faith JJ, Hayete B, Thaden JT et al (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5:e8.PubMedCrossRefGoogle Scholar
  13. 13.
    Barrett T, Troup DB, Wilhite SE et al (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 37:D885–890.PubMedCrossRefGoogle Scholar
  14. 14.
    Parkinson H, Kapushesky M, Kolesnikov N et al (2009) ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37:D868–872.PubMedCrossRefGoogle Scholar
  15. 15.
    Kapushesky M, Emam I, Holloway E et al (2010) Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res 38:D690–698.PubMedCrossRefGoogle Scholar
  16. 16.
    Campain A, Yang YH (2010) Comparison study of microarray meta-analysis methods. BMC Bioinformatics 11:408.PubMedCrossRefGoogle Scholar
  17. 17.
    Choi JK, Yu U, Kim S et al (2003) Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19:i84–90.PubMedCrossRefGoogle Scholar
  18. 18.
    Rhodes DR, Yu, J, Shanker K et al (2004) Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 101:9309–9314.PubMedCrossRefGoogle Scholar
  19. 19.
    Cohen J (1988) Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum, New York, NY.Google Scholar
  20. 20.
    Marot G, Foulley J-L, Mayer C-D et al (2009) Moderated effect size and P-value combinations for microarray meta-analyses. Bioinformatics 25:2692–2699.PubMedCrossRefGoogle Scholar
  21. 21.
    Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article3.Google Scholar
  22. 22.
    Irizarry RA, Hobbs B, Collin F et al (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249–264.PubMedCrossRefGoogle Scholar
  23. 23.
    Wu Z, Irizarry RA (2004) Preprocessing of oligonucleotide array data. Nat Biotechnol 22: 656–658; author reply 658.Google Scholar
  24. 24.
    McCall MN, Bolstad BM, Irizarry RA (2009) Frozen robust multi-array analysis (fRMA), Johns Hopkins University, Baltimore, MD.Google Scholar
  25. 25.
    Aggarwal A, Guo DL, Hoshida Y et al (2006) Topological and functional discovery in a gene coexpression meta-network of gastric cancer. Cancer Res 66:232–241.PubMedCrossRefGoogle Scholar
  26. 26.
    Hibbs MA, Hess DC, Myers CL et al (2007) Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23:2692–2699.PubMedCrossRefGoogle Scholar
  27. 27.
    Wang K, Narayanan M, Zhong H et al (2009) Meta-analysis of inter-species liver co-expression networks elucidates traits associated with common human diseases. PLoS Comput Biol 5:e1000616.PubMedCrossRefGoogle Scholar
  28. 28.
    Huttenhower C, Hibbs M, Myers C et al (2006) A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22:2890–2897.PubMedCrossRefGoogle Scholar
  29. 29.
    Choi JK, Yu U, Yoo OJ et al (2005) Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics 21:4348–4355.PubMedCrossRefGoogle Scholar
  30. 30.
    Breitling R, Herzyk P (2005) Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 3:1171–1189.PubMedCrossRefGoogle Scholar
  31. 31.
    Hong F, Breitling R, McEntee CW et al (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22:2825–2827.PubMedCrossRefGoogle Scholar
  32. 32.
    Rosner B (2005) Fundamentals of Biostatistics, Duxbury Press, Boston, USA.Google Scholar
  33. 33.
    DerSimonian R, Laird N (1986) Meta-analysis in clinical trials. Control Clin Trials 7:177–188.PubMedCrossRefGoogle Scholar
  34. 34.
    Rhodes DR, Barrette TR, Rubin MA et al (2002) Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 62:4427–4433.PubMedGoogle Scholar
  35. 35.
    Efron B (1994) An Introduction to the Bootstrap. Chapman and Hall/CRC, New York.Google Scholar
  36. 36.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statistical Society B 57:289–300.Google Scholar
  37. 37.
    Baggerly KA, Coombes KR (2009) Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. Annals of Applied Statistics 3:1309–1334.CrossRefGoogle Scholar
  38. 38.
    Ghosh D, Poisson LM (2009) “Omics” data and levels of evidence for biomarker discovery. Genomics 93:13–16.PubMedCrossRefGoogle Scholar
  39. 39.
    Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychological Bulletin 86:638–641.CrossRefGoogle Scholar
  40. 40.
    Sutton AJ, Song F, Gilbody SM et al (2000) Modelling publication bias in meta-analysis: a review. Stat Methods Med Res 9:421–445.PubMedCrossRefGoogle Scholar
  41. 41.
    Thornton A, Lee P (2000) Publication bias in meta-analysis: its causes and consequences. J Clin Epidemiol 53:207–216.PubMedCrossRefGoogle Scholar
  42. 42.
    Simpson EH (1951) The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13:238–241.Google Scholar
  43. 43.
    Egger M, Smith GD, Sterne JA (2001) Uses and abuses of meta-analysis. Clin Med 1: 478–484.PubMedGoogle Scholar
  44. 44.
    Yuan Y, Hunt RH (2009) Systematic reviews: the good, the bad, and the ugly. Am J Gastroenterol 104:1086–1092.PubMedCrossRefGoogle Scholar
  45. 45.
    Neapolitan RE (2004) Learning Bayesian Networks. Prentice Hall, Chicago, Illinois.Google Scholar
  46. 46.
    Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29.PubMedCrossRefGoogle Scholar
  47. 47.
    Kanehisa M, Goto S, Furumichi M et al (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38:D355–360.PubMedCrossRefGoogle Scholar
  48. 48.
    Troyanskaya OG, Dolinski K, Owen AB et al (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A 100:8348–8353.PubMedCrossRefGoogle Scholar
  49. 49.
    Myers CL, Troyanskaya OG (2007) Context-sensitive data integration and prediction of biological networks. Bioinformatics 23:2322–2330.PubMedCrossRefGoogle Scholar
  50. 50.
    Huttenhower C, Mutungu KT, Indik N et al (2009) Detailing regulatory networks through large scale data integration. Bioinformatics 25:3267–3274.PubMedCrossRefGoogle Scholar
  51. 51.
    Huttenhower C, Haley EM, Hibbs MA et al (2009) Exploring the human genome with functional maps. Genome Res 19:1093–1106.PubMedCrossRefGoogle Scholar
  52. 52.
    Huttenhower C, Hibbs MA, Myers CL et al (2009) The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics 25:2404–2410.PubMedCrossRefGoogle Scholar
  53. 53.
    Huttenhower C, Hibbs M, Myers C et al (2010) Microarray Experiment Functional Integration Technology (MEFIT). Online. Accessed 25 October, 2010.
  54. 54.
    Markowetz F, Spang R. (2007) Inferring cellular networks – a review. BMC Bioinformatics 8:S5.PubMedCrossRefGoogle Scholar
  55. 55.
    Tompa M, Li N, Bailey TL et al (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23:137–144.PubMedCrossRefGoogle Scholar
  56. 56.
    Griffiths-Jones S, Grocock RJ, van Dongen S et al (2006) miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34:D140–144.PubMedCrossRefGoogle Scholar
  57. 57.
    Lunde BM, Moore C, Varani G (2007) RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol 8:479–490.PubMedCrossRefGoogle Scholar
  58. 58.
    Segal E, Fondufe-Mittendorf Y, Chen L et al (2006) A genomic code for nucleosome positioning. Nature 442:772–778.PubMedCrossRefGoogle Scholar
  59. 59.
    Margolin AA, Nemenman I, Basso K et al (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7:S7.PubMedCrossRefGoogle Scholar
  60. 60.
    van Steensel B (2005) Mapping of genetic and epigenetic regulatory networks using microarrays. Nat Genet 37:S18–24.PubMedCrossRefGoogle Scholar
  61. 61.
    Farnham PJ (2009) Insights from genomic profiling of transcription factors. Nat Rev Genet 10:605–616.PubMedCrossRefGoogle Scholar
  62. 62.
    Mathur D, Danford TW, Boyer LA et al (2008) Analysis of the mouse embryonic stem cell regulatory networks obtained by ChIP-chip and ChIP-PET. Genome Biol 9:R126.PubMedCrossRefGoogle Scholar
  63. 63.
    Ouyang Z, Zhou Q, Wong WH (2009) ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A 106:21521–21526.PubMedCrossRefGoogle Scholar
  64. 64.
    Jiang C, Pugh BF (2009) Nucleosome positioning and gene regulation: advances through genomics. Nat Rev Genet 10:161–172.PubMedCrossRefGoogle Scholar
  65. 65.
    Yeger-Lotem E, Sattath S, Kashtan N et al (2004) Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci U S A 101:5934–5939.PubMedCrossRefGoogle Scholar
  66. 66.
    Heintzman ND, Ren B (2009) Finding distal regulatory elements in the human genome. Curr Opin Genet Dev 19:541–549.PubMedCrossRefGoogle Scholar
  67. 67.
    Visel A, Rubin EM, Pennacchio LA (2009) Genomic views of distant-acting enhancers. Nature 461:199–205.PubMedCrossRefGoogle Scholar
  68. 68.
    Eisen MB, Spellman PT, Brown PO et al (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868.PubMedCrossRefGoogle Scholar
  69. 69.
    Spellman PT, Sherlock G, Zhang MQ et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297.PubMedGoogle Scholar
  70. 70.
    Gollub J, Sherlock G (2006) Clustering microarray data. Methods Enzymol 411:194–213.PubMedCrossRefGoogle Scholar
  71. 71.
    Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2:28–36.PubMedGoogle Scholar
  72. 72.
    Roth FP, Hughes JD, Estep PW et al (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16:939–945.PubMedCrossRefGoogle Scholar
  73. 73.
    Huttenhower C, Mutungu KT, Indik N et al (2009) Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE). Online. Accessed 25 October, 2010.
  74. 74.
    Tanay A, Shamir R (2004) Multilevel modeling and inference of transcription regulation. J Comput Biol 11:357–375.PubMedCrossRefGoogle Scholar
  75. 75.
    Kloster M, Tang C, Wingreen NS (2005) Finding regulatory modules through large-scale gene-expression data analysis. Bioinformatics 21:1172–1179.PubMedCrossRefGoogle Scholar
  76. 76.
    Teixeira MC, Monteiro P, Jain P et al (2006) The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res 34:D446–451.PubMedCrossRefGoogle Scholar
  77. 77.
    Reiss DJ, Baliga NS, Bonneau R (2006) Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics 7:280.PubMedCrossRefGoogle Scholar
  78. 78.
    Elemento O, Slonim N, Tavazoie S (2007) A universal framework for regulatory element discovery across all genomes and data types. Mol Cell 28:337–350.PubMedCrossRefGoogle Scholar
  79. 79.
    Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M et al (2008) RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res 36:D120–124.PubMedCrossRefGoogle Scholar
  80. 80.
    Jansen R, Yu H, Greenbaum D et al (2003) A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302:449–453.PubMedCrossRefGoogle Scholar
  81. 81.
    Lanckriet GR, De Bie T, Cristianini N et al (2004) A statistical framework for genomic data fusion. Bioinformatics 20:2626–2635.PubMedCrossRefGoogle Scholar
  82. 82.
    Aerts S, Lambrechts D, Maity S et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24:537–544.PubMedCrossRefGoogle Scholar
  83. 83.
    Lee I, Date SV, Adai AT et al (2004) A probabilistic functional network of yeast genes. Science 306:1555–1558.PubMedCrossRefGoogle Scholar
  84. 84.
    Stuart JM, Segal E, Koller D et al (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science 302:249–255.PubMedCrossRefGoogle Scholar
  85. 85.
    Troyanskaya OG (2005) Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinform 6:34–43.PubMedCrossRefGoogle Scholar
  86. 86.
    Huttenhower C, Hofmann O (2010) A quick guide to large-scale genomic data mining. PLoS Comput Biol 6:e1000779.PubMedCrossRefGoogle Scholar
  87. 87.
    Warde-Farley D, Donaldson SL, Comes O et al (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 38:W214–220.PubMedCrossRefGoogle Scholar
  88. 88.
    Harrington ED, Jensen LJ, Bork P (2008) Predicting biological networks from genomic data. FEBS Lett 582:1251–1258.PubMedCrossRefGoogle Scholar
  89. 89.
    Myers CL, Robson D, Wible A et al (2005) Discovery of biological networks from diverse functional genomic data. Genome Biol 6:R114.PubMedCrossRefGoogle Scholar
  90. 90.
    Beaver JE, Tasan M, Gibbons FD et al (2010) FuncBase: a resource for quantitative gene function annotation. Bioinformatics 26:1806–1807.PubMedCrossRefGoogle Scholar
  91. 91.
    Tian W, Zhang LV, Tasan M et al (2008) Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol 9:S7.PubMedCrossRefGoogle Scholar
  92. 92.
    Tillinghast GW (2010) Microarrays in the clinic. Nat Biotechnol 28:810–812.PubMedCrossRefGoogle Scholar
  93. 93.
    Brodie EL, Desantis TZ, Joyner DC et al (2006) Application of a high-density oligonucleotide microarray approach to study bacterial population dynamics during uranium reduction and reoxidation. Appl Environ Microbiol 72:6288–6298.PubMedCrossRefGoogle Scholar
  94. 94.
    Monni O, Barlund M, Mousses S et al (2001) Comprehensive copy number and gene expression profiling of the 17q23 amplicon in human breast cancer. Proc Natl Acad Sci U S A 98:5711–5716.PubMedCrossRefGoogle Scholar
  95. 95.
    Muggerud AA, Edgren H, Wolf M et al (2009) Data integration from two microarray platforms identifies bi-allelic genetic inactivation of RIC8A in a breast cancer cell line. BMC Med Genomics 2:26.PubMedCrossRefGoogle Scholar
  96. 96.
    Li H, Zhan M (2008) Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data. Bioinformatics 24:1874–1880.PubMedCrossRefGoogle Scholar
  97. 97.
    Youn A, Reiss DJ, Stuetzle W (2010) Learning transcriptional networks from the integration of ChIP-chip and expression data in a non-parametric model. Bioinformatics 26:1879–1886.PubMedCrossRefGoogle Scholar
  98. 98.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63.PubMedCrossRefGoogle Scholar
  99. 99.
    Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med 360:1696–1698.PubMedCrossRefGoogle Scholar
  100. 100.
    McClellan J, King MC (2010) Genetic heterogeneity in human disease. Cell 141:210–217.PubMedCrossRefGoogle Scholar
  101. 101.
    Bullinger L, Valk PJ (2005) Gene expression profiling in acute myeloid leukemia. J Clin Oncol 23:6296–6305.PubMedCrossRefGoogle Scholar
  102. 102.
    Ong IM, Glasner JD, Page D (2002) Modelling regulatory pathways in E. coli from time series expression profiles. Bioinformatics 18:S241–248.PubMedCrossRefGoogle Scholar
  103. 103.
    Zou M, Conzen SD (2005) A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21:71–79.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Levi Waldron
    • 1
  • Hilary A. Coller
    • 2
    Email author
  • Curtis Huttenhower
    • 1
  1. 1.Department of BiostatisticsHarvard School of Public HealthBostonUSA
  2. 2.Department of Molecular BiologyPrinceton UniversityPrincetonUSA

Personalised recommendations