Advertisement

Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests

  • Xin Guan
  • Li Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10814)

Abstract

Due to its robustness and built-in feature selection capability, random forest is frequently employed in omics studies for biomarker discovery and predictive modeling. However, random forest assumes equal importance of all features, while in reality domain knowledge may justify the prioritization of more relevant features. Furthermore, it has been shown that an antecedent feature selection step can improve the performance of random forest by reducing noises and search space. In this paper, we present a novel Know-guided regularized random forest (Know-GRRF) method that incorporates domain knowledge in a random forest framework for feature selection. Via rigorous simulations, we show that Know-GRRF outperforms existing methods by correctly identifying informative features and improving the accuracy of subsequent predictive models. Know-GRRF is responsive to a wide range of tuning parameters that help to better differentiate candidate features. Know-GRRF is also stable from run to run, making it robust to noises. We further proved that Know-GRRF is a generalized form of existing methods, RRF and GRRF. We applied Known-GRRF to a real world radiation biodosimetry study that uses non-human primate data to discover biomarkers for human applications. By using cross-species correlation as domain knowledge, Know-GRRF was able to identify three gene markers that significantly improved the cross-species prediction accuracy. We implemented Know-GRRF as an R package that is available through the CRAN archive.

Keywords

Biomarker discovery Domain knowledge Feature selection Regularized random forest 

Notes

Acknowledgments

We thank George Runger, Kristin Gillis, Vel Murugan, Jin Park and Garrick Wallstrom for insightful discussions. This project has been funded in part with federal funds from the Biomedical Advanced Research and Development Authority, office of the Assistant Secretary for Preparedness and Response, Office of the Secretary, Department of Health and Human Services under Contract No. HHS01201000008C.

References

  1. 1.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).  https://doi.org/10.1126/science.286.5439.531CrossRefGoogle Scholar
  2. 2.
    Zhou, H., Skolnick, J.: A knowledge-based approach for predicting gene–disease associations. Bioinformatics 32, 2831–2838 (2016).  https://doi.org/10.1093/bioinformatics/btw358CrossRefGoogle Scholar
  3. 3.
    Barzilay, O., Brailovsky, V.L.: On domain knowledge and feature selection using a support vector machine. Pattern Recognit. Lett. 20, 475–484 (1999).  https://doi.org/10.1016/S0167-8655(99)00014-8CrossRefGoogle Scholar
  4. 4.
    Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 03, 185–205 (2005).  https://doi.org/10.1142/S0219720005001004CrossRefGoogle Scholar
  5. 5.
    Park, H., Niida, A., Imoto, S., Miyano, S.: Interaction-based feature selection for uncovering cancer driver genes through copy number-driven expression level. J. Comput. Biol. 24, 138–152 (2017).  https://doi.org/10.1089/cmb.2016.0140MathSciNetCrossRefGoogle Scholar
  6. 6.
    Iguyon, I., Elisseeff, A.: An introduction to variable and feature selection. J Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  7. 7.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014).  https://doi.org/10.1016/j.ins.2014.05.042CrossRefGoogle Scholar
  8. 8.
    Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recogn. 46, 3483–3489 (2013).  https://doi.org/10.1016/j.patcog.2013.05.018CrossRefGoogle Scholar
  9. 9.
    Breiman, L.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984)zbMATHGoogle Scholar
  10. 10.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Sys. 34, 483–519 (2013).  https://doi.org/10.1007/s10115-012-0487-8CrossRefGoogle Scholar
  12. 12.
    Park, J.G., Paul, S., Briones, N., Zeng, J., Gillis, K., et al.: Developing human radiation biodosimetry models: testing cross-species conversion approaches using an ex vivo model system. Radiat. Res. 187, 708–721 (2017).  https://doi.org/10.1667/RR14655.1CrossRefGoogle Scholar
  13. 13.
    Marchetti, F., Coleman, M.A., Jones, I.M., Wyrobek, A.J.: Candidate protein biodosimeters of human exposure to ionizing radiation. Int. J. Radiat. Biol. 82, 605–639 (2006).  https://doi.org/10.1080/09553000600930103CrossRefGoogle Scholar
  14. 14.
    Paul, S., Barker, C.A., Turner, H.C., McLane, A., Wolden, S.L., et al.: Prediction of in vivo radiation dose status in radiotherapy patients using ex vivo and in vivo gene expression signatures. Radiat. Res. 175, 257–265 (2011).  https://doi.org/10.1667/rr2420.1CrossRefGoogle Scholar
  15. 15.
    Tucker, J.D., Joiner, M.C., Thomas, R.A., Grever, W.E., Bakhmutsky, M.V., et al.: Accurate gene expression-based biodosimetry using a minimal set of human gene transcripts. Int. J. Radiat. Oncol. Biol. Phys. 88, 933–939 (2014).  https://doi.org/10.1016/j.ijrobp.2013.11.248CrossRefGoogle Scholar
  16. 16.
    Riecke, A., Rufa, C.G., Cordes, M., Hartmann, J., Meineke, V., et al.: Gene expression comparisons performed for biodosimetry purposes on in vitro peripheral blood cellular subsets and irradiated individuals. Radiat. Res. 178, 234–243 (2012).  https://doi.org/10.1667/rr2738.1CrossRefGoogle Scholar
  17. 17.
    Bruserud, O., Reikvam, H., Fredly, H., Skavland, J., Hagen, K.M., et al.: Expression of the potential therapeutic target CXXC5 in primary acute myeloid leukemia cells - high expression is associated with adverse prognosis as well as altered intracellular signaling and transcriptional regulation. Oncotarget 6, 2794–2811 (2015).  https://doi.org/10.18632/oncotarget.3056CrossRefGoogle Scholar
  18. 18.
    van Riggelen, J., Yetil, A., Felsher, D.W.: MYC as a regulator of ribosome biogenesis and protein synthesis. Nat. Rev. Cancer 10, 301–309 (2010).  https://doi.org/10.1038/nrc2819CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Intel CorporationChandlerUSA
  2. 2.Department of Biomedical InformaticsScottsdaleUSA
  3. 3.Biodesign Institute, Arizona State UniversityTempeUSA
  4. 4.Department of NeurologyMayo ClinicScottsdaleUSA

Personalised recommendations