Statistical Methods for Integrating Multiple Types of High-Throughput Data

Xie, Yang; Ahn, Chul

doi:10.1007/978-1-60761-580-4_19

Yang Xie⁵ &
Chul Ahn⁵

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

5947 Accesses
4 Citations

Abstract

Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lackie J, Dow J. The Dictionary of Cell and Molecular Biology. Academic Press: London, 1999.
Google Scholar
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science 2000; 290(5500): 2306–9.
Google Scholar
Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001; 409(6819):533–8.
Google Scholar
Shannon MF, Rao S. Transcription. Of chips and ChIPs. Science 2002; 296(5568):666–9.
Google Scholar
Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Volkert Wyrick JJ, Volkert Zeitlinger J, Volkert Gifford DK, Volkert Jaakkola TS, et al. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 2001; 106(6):697–708.
Google Scholar
Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004; 83(3):349–60.
Google Scholar
Shedden K, Taylor JMG, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008; 14(8):822–7.
Google Scholar
Xie Y, Minna JD. Predicting the future for people with lung cancer. Nat Med 2008; 14(8):812–3.
Google Scholar
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99(10): 6567–72.
Google Scholar
Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics 2003; 19(16): 2072–8.
Google Scholar
Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics 2006; 22(4):472–6.
Google Scholar
Carlin B, Louis T. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC Press: Boca Raton, FL, 2000.
Book Google Scholar
Hastie T, Tibishirani R, Friedman J. The Elements of Statistical Learning. Springer; New York, NY, 2001.
Google Scholar
Xie Y, Pan W, Jeong KS, Khodursky A. Incorporating prior information via shrinkage: a combined analysis of genome-wide location data and gene expression data. Stat Med 2007; 26(10): 2258–75.
Google Scholar
Guo X, Qi H, Verfaillie CM, Pan W. Statistical significance analysis of longitudinal gene expression data. Bioinformatics 2003; 19(13):1628–35.
Google Scholar
Pan W. On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics 2003; 19(11):1333–40.
Google Scholar
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96(456):1151–60.
Google Scholar
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98(9):5116–21.
Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc, Series B 1995; 57: 289–300.
Google Scholar
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci USA 2003; 100(16):9440–45, 10.1073.
Google Scholar
Xie Y, Pan W, Khodursky AB. A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 2005; 21(23):4280–8.
Google Scholar
Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995; 90(432):1200–24.
Google Scholar
Donoho D. De-noising by soft-thresholding. Information Theory, IEEE Trans, May 1995; 41(3):613–27, 10.1109/18.382009.
Google Scholar
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30.
Google Scholar
Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science 2004; 306(5701): 1555–8.
Google Scholar
Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006; 78(6):1011–25.
Google Scholar
Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics 2007; 23(12): 1537–44.
Google Scholar
Xiao G, Cavan R, Khodursky A. A improved detection of differentially expressed genes via incorporation of gene location. Biometrics 2009; In Press.
Google Scholar
Broet P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006; 22(8):911–8.
Google Scholar
Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 2008; 24(3):404–11.
Google Scholar
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8(1):37–52.
Google Scholar
Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18(4):546–54.
Google Scholar
McLachlan GJ, Bean RW, Jones LBT. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 2006; 22(13):1608–15.
Google Scholar
McLachlan G, Peel D. Finite Mixture Models. Wiley: New York, 2000.
Book Google Scholar
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006; 22(7):795–801.
Google Scholar
Lee Y, Nelder JA. Double hierarchical generalized linear models (with discussion). J R Stat Soc: Series C (Applied Statistics) May 2006 55(2):139–85.
Google Scholar
Besag J, Kooperberg C. On conditional and intrinsic autoregression. Biometrika 1995; 82(4):733–46.
Google Scholar
Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol 2005; 4(NIL):Article12.
Google Scholar
Xie Y JK, Pan W, Xiao G, Khodursky A. A Bayesian Approach to joint Modeling of Protein-DNA Binding, Gene Expression and Sequence Data. Statistics in Medicine 2009; in press.
Google Scholar
Lonnstedt I, Britton T. Hierarchical Bayes models for cdna microarray gene expression. Biostatistics 2005; 6:279–91.
Google Scholar
Vapnik V. Statistical Learning Theory. Wiley: New York, 1998.
Google Scholar
Breiman L. Random forests. Machine Learning 2001; 45(1):5–32.
Google Scholar
Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005; 365(9460): 671–9.
Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1(2): 203–9.
Google Scholar
Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HFJ, Hampton GM. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 2001; 61(16): 5974–8.
Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci USA 2001; 98(24):13 790–95.
Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286(5439):531–7.
Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25(1):25–9.
Google Scholar
Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 2005; 21(9):1971–8.
Google Scholar
Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 2007; 23(14):1775–82.
Google Scholar
Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics 2008; 9(2): 333–54.
Google Scholar

Download references

Acknowledgments

The authors thank Drs. Wei Pan, Peng Wei, Feng Tai, and Guanghua Xiao for discussions and suggestions, and thank Dr. Peng Wei for providing WinBUGS programs. This work was partially supported by NIH UL1 RR024982 1R21 DA027592, and SPORE P50 CA70907.

Author information

Authors and Affiliations

Division of Biostatistics, Department of Clinical Sciences, The Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
Yang Xie & Chul Ahn

Authors

Yang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Chul Ahn
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Heejung Bang
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Xi Kathy Zhou
Journal of Experimental Medicine, Rockefeller University Press, First Ave. 1114, New York, 10021, New York, USA
Heather L. van Epps
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Madhu Mazumdar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Xie, Y., Ahn, C. (2010). Statistical Methods for Integrating Multiple Types of High-Throughput Data. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_19

Download citation

DOI: https://doi.org/10.1007/978-1-60761-580-4_19
Published: 15 December 2009
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-578-1
Online ISBN: 978-1-60761-580-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics