Random Forest for Bioinformatics

Qi, Yanjun

doi:10.1007/978-1-4419-9326-7_11

Yanjun Qi^3,4

19k Accesses
3 Altmetric

Abstract

Modern biology has experienced an increased use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning process, is a popular choice. It is nonparametric, interpretable, efficient, and has high prediction accuracy for many types of data. Recent work in computational biology has seen an increased use of RF, owing to its unique advantages in dealing with small sample size, high-dimensional feature space, and complex data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340 (2010)
Article Google Scholar
Amaratunga, D., Cabrera, J., Lee, Y.: Enriched random forests. Bioinformatics 24(18), 2010 (2008)
Google Scholar
Bao, L., Zhou, M., Cui, Y.: nssnpanalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Research 33(suppl 2), W480 (2005)
Article Google Scholar
Barenboim, M., Masso, M., Vaisman, I., Jamison, D.: Statistical geometry based prediction of nonsynonymous snp functional effects using random forest and neuro-fuzzy classifiers. Proteins: Structure, Function, and Bioinformatics 71(4), 1930–1939 (2008)
Article Google Scholar
Barrett, J., Cairns, D.: Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Statistical Applications in Genetics and Molecular Biology 7(2), 4 (2008)
Article MathSciNet Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). DOI 10.1023/A: 1010933404324
Google Scholar
Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P.: Identifying snps predictive of phenotype using random forests. Genet Epidemiol 28(2), 171–82 (2005). DOI 10.1002/gepi.20041
Article Google Scholar
Chen, X., Jeong, J.: Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5), 585 (2009)
Article Google Scholar
Chen, X., Liu, C.T., Zhang, M., Zhang, H.: A forest-based approach to identifying gene and gene–gene interactions. Proc Natl Acad Sci USA 104(49), 19,199–203 (2007). DOI 10.1073/pnas.0709868104
Article Google Scholar
Chen, X., Liu, M.: Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24), 4394 (2005)
Article Google Scholar
Chen, X., Wang, M., Zhang, H.: The use of classification trees for bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1), 55–63 (2011)
Google Scholar
Cummings, M., Myers, D.: Simple statistical models predict c-to-u edited sites in plant mitochondrial rna. BMC Bioinformatics 5(1), 132 (2004)
Article Google Scholar
Cummings, M., Segal, M.: Few amino acid positions in rpob are associated with most of the rifampin resistance in mycobacterium tuberculosis. BMC Bioinformatics 5(1), 137 (2004)
Article Google Scholar
Cutler, D., Edwards Jr, T., Beard, K., Cutler, A., Hess, K., Gibson, J., Lawler, J.: Random forests for classification in ecology. Ecology 88(11), 2783–2792 (2007)
Article Google Scholar
Diaz-Uriarte, R., de Andrés, S.: Variable selection from random forests: application to gene expression data. Arxiv preprint q-bio/0503025 (2005)
Google Scholar
Dybowski, J.N., Heider, D., Hoffmann, D.: Prediction of co-receptor usage of hiv-1 from genotype. PLoS Comput Biol 6(4), e1000,743 (2010). DOI 10.1371/journal.pcbi. 1000743
Article Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)
Google Scholar
Geurts, P., Fillet, M., De Seny, D., Meuwis, M., Malaise, M., Merville, M., Wehenkel, L.: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 21(14), 3138 (2005)
Article Google Scholar
Hamby, S., Hirst, J.: Prediction of glycosylation sites using random forests. BMC Bioinformatics 9(1), 500 (2008)
Article Google Scholar
Hanselmann, M., Ko the, U., Kirchner, M., Renard, B., Amstalden, E., Glunde, K., Heeren, R., Hamprecht, F.: Toward digital staining using imaging mass spectrometry and random forests. Journal of Proteome Research 8(7), 3558–3567 (2009)
Article Google Scholar
Hothorn, T., Hornik, K., Zeileis, A., Wien, W., Wien, W.: Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3), 651–674 (2006)
Article MathSciNet Google Scholar
Izmirlian, G.: Application of the random forest classification algorithm to a seldi-tof proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences 1020(1), 154–174 (2004)
Article Google Scholar
Karpievitch, Y., Hill, E., Leclerc, A., Dabney, A., Almeida, J.: An introspective comparison of random forest-based classifiers for the analysis of cluster-correlated data by way of rf++. PloS one 4(9), e7087 (2009)
Article Google Scholar
Kirchner, M., Timm, W., Fong, P., Wangemann, P., Steen, H.: Non-linear classification for on-the-fly fractional mass filtering and targeted precursor fragmentation in mass spectrometry experiments. Bioinformatics 26(6), 791 (2010)
Article Google Scholar
Kruglyak, L., Nickerson, D.A.: Variation is the spice of life. Nat Genet 27(3), 234–6 (2001). DOI 10.1038/85776
Article Google Scholar
Lee, J., Lee, J., Park, M., Song, S.: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48(4), 869–885 (2005)
Article MathSciNet MATH Google Scholar
Lin, N., Wu, B., Jansen, R., Gerstein, M., Zhao, H.: Information assessment on predicting protein–protein interactions. BMC Bioinformatics 5(1), 154 (2004)
Article Google Scholar
Lunetta, K., Hayward, L., Segal, J., Van Eerdewegh, P.: Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 5(1), 32 (2004)
Article Google Scholar
Ma, Y., Ding, Z., Qian, Y., Shi, X., Castranova, V., Harner, E., Guo, L.: Predicting cancer drug response by proteomic profiling. Clinical Cancer Research 12(15), 4583 (2006)
Article Google Scholar
Meng, Y., Yu, Y., Cupples, L., Farrer, L., Lunetta, K.: Performance of random forest when snps are in linkage disequilibrium. BMC Bioinformatics 10(1), 78 (2009)
Article Google Scholar
Menze, B., Kelm, B., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.: A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)
Article Google Scholar
Moore, J., Asselbergs, F., Williams, S.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445 (2010)
Article Google Scholar
Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J.: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Structure, Function, and Bioinformatics 63(3), 490–500 (2006)
Article Google Scholar
Qi, Y., Dhiman, H., Bhola, N., Budyak, I., Kar, S., Man, D., Dutta, A., Tirupula, K., Carr, B., Grandis, J., et al.: Systematic prediction of human membrane receptor interactions. Proteomics 9(23), 5243–5255 (2009)
Article Google Scholar
Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random forest similarity for protein–protein interaction prediction from multiple sources. In: Proceedings of the Pacific Symposium on Biocomputing (2005)
Google Scholar
Riddick, G., Song, H., Ahn, S., Walling, J., Borges-Rivera, D., Zhang, W., Fine, H.: Predicting in vitro drug sensitivity using random forests. Bioinformatics 27(2), 220 (2011)
Article Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507 (2007)
Article Google Scholar
Segal, M.R.: Machine learning benchmarks and random forest regression. Technical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco (2004)
Google Scholar
Statnikov, A., Wang, L., Aliferis, C.: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9(1), 319 (2008)
Article Google Scholar
Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinformatics 9(1), 307 (2008)
Article Google Scholar
Strobl, C., Boulesteix, A., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(1), 25 (2007)
Article Google Scholar
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and qsar modeling. J Chem Inf Comput Sci 43(6), 1947–58 (2003). DOI 10.1021/ci034160g
Article Google Scholar
Tastan, O., Qi, Y., Carbonell, J., Klein-Seetharaman, J.: Prediction of interactions between HIV-1 and human proteins by information integration. In: Pac Symp Biocomput, vol. 516 (2009)
Google Scholar
Wang, M., Chen, X., Zhang, H.: Maximal conditional chi-square importance in random forests. Bioinformatics 26(6), 831 (2010)
Article Google Scholar
Wang, W.Y.S., Barratt, B.J., Clayton, D.G., Todd, J.A.: Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6(2), 109–18 (2005). DOI 10.1038/nrg1522
Article Google Scholar
Wu, X., Wu, Z., Li, K.: Identification of differential gene expression for microarray data using recursive random forest. Chin Med J 121(24), 2492–2496 (2008)
Article Google Scholar
Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y., et al.: A review of ensemble methods in bioinformatics. Current Bioinformatics 5(4), 296–308 (2010)
Article Google Scholar
Zhang, H., Yu, C., Singer, B.: Cell and tumor classification using gene expression data: construction of forests. Proceedings of the National Academy of Sciences 100(7), 4168 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Department, NEC Labs America, Princeton, NJ, USA
Yanjun Qi
4 Independence Way, Suite 200, Princeton, NJ, USA
Yanjun Qi

Authors

Yanjun Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanjun Qi .

Editor information

Editors and Affiliations

Microsoft, One Microsoft Road, Redmond, 98052, USA
Cha Zhang
Honeywell, Douglas Drive North 1985, Golden Valley, 55422, USA
Yunqian Ma

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Qi, Y. (2012). Random Forest for Bioinformatics. In: Zhang, C., Ma, Y. (eds) Ensemble Machine Learning. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9326-7_11

Download citation

DOI: https://doi.org/10.1007/978-1-4419-9326-7_11
Published: 19 January 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-9325-0
Online ISBN: 978-1-4419-9326-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics