Skip to main content

Random Forest for Bioinformatics

  • Chapter
  • First Online:
Ensemble Machine Learning

Abstract

Modern biology has experienced an increased use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning process, is a popular choice. It is nonparametric, interpretable, efficient, and has high prediction accuracy for many types of data. Recent work in computational biology has seen an increased use of RF, owing to its unique advantages in dealing with small sample size, high-dimensional feature space, and complex data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altmann, A., ToloÅŸi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340 (2010)

    Article  Google Scholar 

  2. Amaratunga, D., Cabrera, J., Lee, Y.: Enriched random forests. Bioinformatics 24(18), 2010 (2008)

    Google Scholar 

  3. Bao, L., Zhou, M., Cui, Y.: nssnpanalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Research 33(suppl 2), W480 (2005)

    Article  Google Scholar 

  4. Barenboim, M., Masso, M., Vaisman, I., Jamison, D.: Statistical geometry based prediction of nonsynonymous snp functional effects using random forest and neuro-fuzzy classifiers. Proteins: Structure, Function, and Bioinformatics 71(4), 1930–1939 (2008)

    Article  Google Scholar 

  5. Barrett, J., Cairns, D.: Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Statistical Applications in Genetics and Molecular Biology 7(2), 4 (2008)

    Article  MathSciNet  Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). DOI 10.1023/A: 1010933404324

    Google Scholar 

  7. Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P.: Identifying snps predictive of phenotype using random forests. Genet Epidemiol 28(2), 171–82 (2005). DOI 10.1002/gepi.20041

    Article  Google Scholar 

  8. Chen, X., Jeong, J.: Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5), 585 (2009)

    Article  Google Scholar 

  9. Chen, X., Liu, C.T., Zhang, M., Zhang, H.: A forest-based approach to identifying gene and gene–gene interactions. Proc Natl Acad Sci USA 104(49), 19,199–203 (2007). DOI 10.1073/pnas.0709868104

    Article  Google Scholar 

  10. Chen, X., Liu, M.: Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24), 4394 (2005)

    Article  Google Scholar 

  11. Chen, X., Wang, M., Zhang, H.: The use of classification trees for bioinformatics. ​​Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1), 55–63 (2011)

    Google Scholar 

  12. Cummings, M., Myers, D.: Simple statistical models predict c-to-u edited sites in plant mitochondrial rna. BMC Bioinformatics 5(1), 132 (2004)

    Article  Google Scholar 

  13. Cummings, M., Segal, M.: Few amino acid positions in rpob are associated with most of the rifampin resistance in mycobacterium tuberculosis. BMC Bioinformatics 5(1), 137 (2004)

    Article  Google Scholar 

  14. Cutler, D., Edwards Jr, T., Beard, K., Cutler, A., Hess, K., Gibson, J., Lawler, J.: Random forests for classification in ecology. Ecology 88(11), 2783–2792 (2007)

    Article  Google Scholar 

  15. Diaz-Uriarte, R., de Andrés, S.: Variable selection from random forests: application to gene expression data. Arxiv preprint q-bio/0503025 (2005)

    Google Scholar 

  16. Dybowski, J.N., Heider, D., Hoffmann, D.: Prediction of co-receptor usage of hiv-1 from genotype. PLoS Comput Biol 6(4), e1000,743 (2010). DOI 10.1371/journal.pcbi. 1000743

    Article  Google Scholar 

  17. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)

    Google Scholar 

  18. Geurts, P., Fillet, M., De Seny, D., Meuwis, M., Malaise, M., Merville, M., Wehenkel, L.: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 21(14), 3138 (2005)

    Article  Google Scholar 

  19. Hamby, S., Hirst, J.: Prediction of glycosylation sites using random forests. BMC Bioinformatics 9(1), 500 (2008)

    Article  Google Scholar 

  20. Hanselmann, M., Ko the, U., Kirchner, M., Renard, B., Amstalden, E., Glunde, K., Heeren, R., Hamprecht, F.: Toward digital staining using imaging mass spectrometry and random forests. Journal of Proteome Research 8(7), 3558–3567 (2009)

    Article  Google Scholar 

  21. Hothorn, T., Hornik, K., Zeileis, A., Wien, W., Wien, W.: Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3), 651–674 (2006)

    Article  MathSciNet  Google Scholar 

  22. Izmirlian, G.: Application of the random forest classification algorithm to a seldi-tof proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences 1020(1), 154–174 (2004)

    Article  Google Scholar 

  23. Karpievitch, Y., Hill, E., Leclerc, A., Dabney, A., Almeida, J.: An introspective comparison of random forest-based classifiers for the analysis of cluster-correlated data by way of rf++. PloS one 4(9), e7087 (2009)

    Article  Google Scholar 

  24. Kirchner, M., Timm, W., Fong, P., Wangemann, P., Steen, H.: Non-linear classification for on-the-fly fractional mass filtering and targeted precursor fragmentation in mass spectrometry experiments. Bioinformatics 26(6), 791 (2010)

    Article  Google Scholar 

  25. Kruglyak, L., Nickerson, D.A.: Variation is the spice of life. Nat Genet 27(3), 234–6 (2001). DOI 10.1038/85776

    Article  Google Scholar 

  26. Lee, J., Lee, J., Park, M., Song, S.: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48(4), 869–885 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  27. Lin, N., Wu, B., Jansen, R., Gerstein, M., Zhao, H.: Information assessment on predicting protein–protein interactions. BMC Bioinformatics 5(1), 154 (2004)

    Article  Google Scholar 

  28. Lunetta, K., Hayward, L., Segal, J., Van Eerdewegh, P.: Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 5(1), 32 (2004)

    Article  Google Scholar 

  29. Ma, Y., Ding, Z., Qian, Y., Shi, X., Castranova, V., Harner, E., Guo, L.: Predicting cancer drug response by proteomic profiling. Clinical Cancer Research 12(15), 4583 (2006)

    Article  Google Scholar 

  30. Meng, Y., Yu, Y., Cupples, L., Farrer, L., Lunetta, K.: Performance of random forest when snps are in linkage disequilibrium. BMC Bioinformatics 10(1), 78 (2009)

    Article  Google Scholar 

  31. Menze, B., Kelm, B., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.: A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)

    Article  Google Scholar 

  32. Moore, J., Asselbergs, F., Williams, S.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445 (2010)

    Article  Google Scholar 

  33. Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J.: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Structure, Function, and Bioinformatics 63(3), 490–500 (2006)

    Article  Google Scholar 

  34. Qi, Y., Dhiman, H., Bhola, N., Budyak, I., Kar, S., Man, D., Dutta, A., Tirupula, K., Carr, B., Grandis, J., et al.: Systematic prediction of human membrane receptor interactions. Proteomics 9(23), 5243–5255 (2009)

    Article  Google Scholar 

  35. Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random forest similarity for protein–protein interaction prediction from multiple sources. In: Proceedings of the Pacific Symposium on Biocomputing (2005)

    Google Scholar 

  36. Riddick, G., Song, H., Ahn, S., Walling, J., Borges-Rivera, D., Zhang, W., Fine, H.: Predicting in vitro drug sensitivity using random forests. Bioinformatics 27(2), 220 (2011)

    Article  Google Scholar 

  37. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507 (2007)

    Article  Google Scholar 

  38. Segal, M.R.: Machine learning benchmarks and random forest regression. Technical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco (2004)

    Google Scholar 

  39. Statnikov, A., Wang, L., Aliferis, C.: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9(1), 319 (2008)

    Article  Google Scholar 

  40. Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinformatics 9(1), 307 (2008)

    Article  Google Scholar 

  41. Strobl, C., Boulesteix, A., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(1), 25 (2007)

    Article  Google Scholar 

  42. Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and qsar modeling. J Chem Inf Comput Sci 43(6), 1947–58 (2003). DOI 10.1021/ci034160g

    Article  Google Scholar 

  43. Tastan, O., Qi, Y., Carbonell, J., Klein-Seetharaman, J.: Prediction of interactions between HIV-1 and human proteins by information integration. In: Pac Symp Biocomput, vol. 516 (2009)

    Google Scholar 

  44. Wang, M., Chen, X., Zhang, H.: Maximal conditional chi-square importance in random forests. Bioinformatics 26(6), 831 (2010)

    Article  Google Scholar 

  45. Wang, W.Y.S., Barratt, B.J., Clayton, D.G., Todd, J.A.: Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6(2), 109–18 (2005). DOI 10.1038/nrg1522

    Article  Google Scholar 

  46. Wu, X., Wu, Z., Li, K.: Identification of differential gene expression for microarray data using recursive random forest. Chin Med J 121(24), 2492–2496 (2008)

    Article  Google Scholar 

  47. Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y., et al.: A review of ensemble methods in bioinformatics. Current Bioinformatics 5(4), 296–308 (2010)

    Article  Google Scholar 

  48. Zhang, H., Yu, C., Singer, B.: Cell and tumor classification using gene expression data: construction of forests. Proceedings of the National Academy of Sciences 100(7), 4168 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanjun Qi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Qi, Y. (2012). Random Forest for Bioinformatics. In: Zhang, C., Ma, Y. (eds) Ensemble Machine Learning. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9326-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-9326-7_11

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-9325-0

  • Online ISBN: 978-1-4419-9326-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics