Skip to main content

Statistical Methods for Integrating Multiple Types of High-Throughput Data

  • Protocol
  • First Online:
Statistical Methods in Molecular Biology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

Abstract

Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lackie J, Dow J. The Dictionary of Cell and Molecular Biology. Academic Press: London, 1999.

    Google Scholar 

  2. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science 2000; 290(5500): 2306–9.

    Google Scholar 

  3. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001; 409(6819):533–8.

    Google Scholar 

  4. Shannon MF, Rao S. Transcription. Of chips and ChIPs. Science 2002; 296(5568):666–9.

    Google Scholar 

  5. Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Volkert Wyrick JJ, Volkert Zeitlinger J, Volkert Gifford DK, Volkert Jaakkola TS, et al. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 2001; 106(6):697–708.

    Google Scholar 

  6. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004; 83(3):349–60.

    Google Scholar 

  7. Shedden K, Taylor JMG, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008; 14(8):822–7.

    Google Scholar 

  8. Xie Y, Minna JD. Predicting the future for people with lung cancer. Nat Med 2008; 14(8):812–3.

    Google Scholar 

  9. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99(10): 6567–72.

    Google Scholar 

  10. Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics 2003; 19(16): 2072–8.

    Google Scholar 

  11. Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics 2006; 22(4):472–6.

    Google Scholar 

  12. Carlin B, Louis T. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC Press: Boca Raton, FL, 2000.

    Book  Google Scholar 

  13. Hastie T, Tibishirani R, Friedman J. The Elements of Statistical Learning. Springer; New York, NY, 2001.

    Google Scholar 

  14. Xie Y, Pan W, Jeong KS, Khodursky A. Incorporating prior information via shrinkage: a combined analysis of genome-wide location data and gene expression data. Stat Med 2007; 26(10): 2258–75.

    Google Scholar 

  15. Guo X, Qi H, Verfaillie CM, Pan W. Statistical significance analysis of longitudinal gene expression data. Bioinformatics 2003; 19(13):1628–35.

    Google Scholar 

  16. Pan W. On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics 2003; 19(11):1333–40.

    Google Scholar 

  17. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96(456):1151–60.

    Google Scholar 

  18. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98(9):5116–21.

    Google Scholar 

  19. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc, Series B 1995; 57: 289–300.

    Google Scholar 

  20. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci USA 2003; 100(16):9440–45, 10.1073.

    Google Scholar 

  21. Xie Y, Pan W, Khodursky AB. A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 2005; 21(23):4280–8.

    Google Scholar 

  22. Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995; 90(432):1200–24.

    Google Scholar 

  23. Donoho D. De-noising by soft-thresholding. Information Theory, IEEE Trans, May 1995; 41(3):613–27, 10.1109/18.382009.

    Google Scholar 

  24. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30.

    Google Scholar 

  25. Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science 2004; 306(5701): 1555–8.

    Google Scholar 

  26. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006; 78(6):1011–25.

    Google Scholar 

  27. Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics 2007; 23(12): 1537–44.

    Google Scholar 

  28. Xiao G, Cavan R, Khodursky A. A improved detection of differentially expressed genes via incorporation of gene location. Biometrics 2009; In Press.

    Google Scholar 

  29. Broet P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006; 22(8):911–8.

    Google Scholar 

  30. Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 2008; 24(3):404–11.

    Google Scholar 

  31. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8(1):37–52.

    Google Scholar 

  32. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18(4):546–54.

    Google Scholar 

  33. McLachlan GJ, Bean RW, Jones LBT. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 2006; 22(13):1608–15.

    Google Scholar 

  34. McLachlan G, Peel D. Finite Mixture Models. Wiley: New York, 2000.

    Book  Google Scholar 

  35. Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006; 22(7):795–801.

    Google Scholar 

  36. Lee Y, Nelder JA. Double hierarchical generalized linear models (with discussion). J R Stat Soc: Series C (Applied Statistics) May 2006 55(2):139–85.

    Google Scholar 

  37. Besag J, Kooperberg C. On conditional and intrinsic autoregression. Biometrika 1995; 82(4):733–46.

    Google Scholar 

  38. Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol 2005; 4(NIL):Article12.

    Google Scholar 

  39. Xie Y JK, Pan W, Xiao G, Khodursky A. A Bayesian Approach to joint Modeling of Protein-DNA Binding, Gene Expression and Sequence Data. Statistics in Medicine 2009; in press.

    Google Scholar 

  40. Lonnstedt I, Britton T. Hierarchical Bayes models for cdna microarray gene expression. Biostatistics 2005; 6:279–91.

    Google Scholar 

  41. Vapnik V. Statistical Learning Theory. Wiley: New York, 1998.

    Google Scholar 

  42. Breiman L. Random forests. Machine Learning 2001; 45(1):5–32.

    Google Scholar 

  43. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005; 365(9460): 671–9.

    Google Scholar 

  44. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1(2): 203–9.

    Google Scholar 

  45. Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HFJ, Hampton GM. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 2001; 61(16): 5974–8.

    Google Scholar 

  46. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci USA 2001; 98(24):13 790–95.

    Google Scholar 

  47. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286(5439):531–7.

    Google Scholar 

  48. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25(1):25–9.

    Google Scholar 

  49. Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 2005; 21(9):1971–8.

    Google Scholar 

  50. Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 2007; 23(14):1775–82.

    Google Scholar 

  51. Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics 2008; 9(2): 333–54.

    Google Scholar 

Download references

Acknowledgments

The authors thank Drs. Wei Pan, Peng Wei, Feng Tai, and Guanghua Xiao for discussions and suggestions, and thank Dr. Peng Wei for providing WinBUGS programs. This work was partially supported by NIH UL1 RR024982 1R21 DA027592, and SPORE P50 CA70907.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Xie, Y., Ahn, C. (2010). Statistical Methods for Integrating Multiple Types of High-Throughput Data. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-1-60761-580-4_19

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-60761-578-1

  • Online ISBN: 978-1-60761-580-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics