Statistical Methods for Integrating Multiple Types of High-Throughput Data

  • Yang Xie
  • Chul Ahn
Part of the Methods in Molecular Biology book series (MIMB, volume 620)


Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice.

Key words

Integrative analysis high-throughput data analysis microarray 



The authors thank Drs. Wei Pan, Peng Wei, Feng Tai, and Guanghua Xiao for discussions and suggestions, and thank Dr. Peng Wei for providing WinBUGS programs. This work was partially supported by NIH UL1 RR024982 1R21 DA027592, and SPORE P50 CA70907.


  1. 1.
    Lackie J, Dow J. The Dictionary of Cell and Molecular Biology. Academic Press: London, 1999.Google Scholar
  2. 2.
    Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science 2000; 290(5500): 2306–9.Google Scholar
  3. 3.
    Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001; 409(6819):533–8.Google Scholar
  4. 4.
    Shannon MF, Rao S. Transcription. Of chips and ChIPs. Science 2002; 296(5568):666–9.Google Scholar
  5. 5.
    Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Volkert Wyrick JJ, Volkert Zeitlinger J, Volkert Gifford DK, Volkert Jaakkola TS, et al. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 2001; 106(6):697–708.Google Scholar
  6. 6.
    Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004; 83(3):349–60.Google Scholar
  7. 7.
    Shedden K, Taylor JMG, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008; 14(8):822–7.Google Scholar
  8. 8.
    Xie Y, Minna JD. Predicting the future for people with lung cancer. Nat Med 2008; 14(8):812–3.Google Scholar
  9. 9.
    Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99(10): 6567–72.Google Scholar
  10. 10.
    Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics 2003; 19(16): 2072–8.Google Scholar
  11. 11.
    Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics 2006; 22(4):472–6.Google Scholar
  12. 12.
    Carlin B, Louis T. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC Press: Boca Raton, FL, 2000.CrossRefGoogle Scholar
  13. 13.
    Hastie T, Tibishirani R, Friedman J. The Elements of Statistical Learning. Springer; New York, NY, 2001.Google Scholar
  14. 14.
    Xie Y, Pan W, Jeong KS, Khodursky A. Incorporating prior information via shrinkage: a combined analysis of genome-wide location data and gene expression data. Stat Med 2007; 26(10): 2258–75.Google Scholar
  15. 15.
    Guo X, Qi H, Verfaillie CM, Pan W. Statistical significance analysis of longitudinal gene expression data. Bioinformatics 2003; 19(13):1628–35.Google Scholar
  16. 16.
    Pan W. On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics 2003; 19(11):1333–40.Google Scholar
  17. 17.
    Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96(456):1151–60.Google Scholar
  18. 18.
    Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98(9):5116–21.Google Scholar
  19. 19.
    Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc, Series B 1995; 57: 289–300.Google Scholar
  20. 20.
    Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci USA 2003; 100(16):9440–45, 10.1073.Google Scholar
  21. 21.
    Xie Y, Pan W, Khodursky AB. A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 2005; 21(23):4280–8.Google Scholar
  22. 22.
    Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995; 90(432):1200–24.Google Scholar
  23. 23.
    Donoho D. De-noising by soft-thresholding. Information Theory, IEEE Trans, May 1995; 41(3):613–27, 10.1109/18.382009.Google Scholar
  24. 24.
    Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30.Google Scholar
  25. 25.
    Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science 2004; 306(5701): 1555–8.Google Scholar
  26. 26.
    Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006; 78(6):1011–25.Google Scholar
  27. 27.
    Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics 2007; 23(12): 1537–44.Google Scholar
  28. 28.
    Xiao G, Cavan R, Khodursky A. A improved detection of differentially expressed genes via incorporation of gene location. Biometrics 2009; In Press.Google Scholar
  29. 29.
    Broet P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006; 22(8):911–8.Google Scholar
  30. 30.
    Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 2008; 24(3):404–11.Google Scholar
  31. 31.
    Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8(1):37–52.Google Scholar
  32. 32.
    Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18(4):546–54.Google Scholar
  33. 33.
    McLachlan GJ, Bean RW, Jones LBT. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 2006; 22(13):1608–15.Google Scholar
  34. 34.
    McLachlan G, Peel D. Finite Mixture Models. Wiley: New York, 2000.CrossRefGoogle Scholar
  35. 35.
    Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006; 22(7):795–801.Google Scholar
  36. 36.
    Lee Y, Nelder JA. Double hierarchical generalized linear models (with discussion). J R Stat Soc: Series C (Applied Statistics) May 2006 55(2):139–85.Google Scholar
  37. 37.
    Besag J, Kooperberg C. On conditional and intrinsic autoregression. Biometrika 1995; 82(4):733–46.Google Scholar
  38. 38.
    Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol 2005; 4(NIL):Article12.Google Scholar
  39. 39.
    Xie Y JK, Pan W, Xiao G, Khodursky A. A Bayesian Approach to joint Modeling of Protein-DNA Binding, Gene Expression and Sequence Data. Statistics in Medicine 2009; in press.Google Scholar
  40. 40.
    Lonnstedt I, Britton T. Hierarchical Bayes models for cdna microarray gene expression. Biostatistics 2005; 6:279–91.Google Scholar
  41. 41.
    Vapnik V. Statistical Learning Theory. Wiley: New York, 1998.Google Scholar
  42. 42.
    Breiman L. Random forests. Machine Learning 2001; 45(1):5–32.Google Scholar
  43. 43.
    Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005; 365(9460): 671–9.Google Scholar
  44. 44.
    Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1(2): 203–9.Google Scholar
  45. 45.
    Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HFJ, Hampton GM. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 2001; 61(16): 5974–8.Google Scholar
  46. 46.
    Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci USA 2001; 98(24):13 790–95.Google Scholar
  47. 47.
    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286(5439):531–7.Google Scholar
  48. 48.
    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25(1):25–9.Google Scholar
  49. 49.
    Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 2005; 21(9):1971–8.Google Scholar
  50. 50.
    Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 2007; 23(14):1775–82.Google Scholar
  51. 51.
    Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics 2008; 9(2): 333–54.Google Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Yang Xie
    • 1
  • Chul Ahn
    • 1
  1. 1.Division of Biostatistics, Department of Clinical SciencesThe Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical CenterDallasUSA

Personalised recommendations