Skip to main content
Log in

A review of random forest-based feature selection methods for data science education and applications

  • Review
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Random forest (RF) is one of the most popular statistical learning methods in both data science education and applications. Feature selection, enabled by RF, is often among the very first tasks in a data science project, such as the college capstone project, industry consulting projects. The goal of this paper is to provide a comprehensive review of 12 RF-based feature selection methods for classification problems. The review provides necessary description of each method and the software packages. We show that different methods typically do not provide consistent feature selection results, and the model performance also varies when different RF-based feature selection approaches are employed. This observation suggests that caution must be taken when performing feature selection tasks using RF. Feature selection cannot be blindly done without a sound understanding of the methods adopted, which is not always the case in industry and many senior capstone projects that we have observed. The paper serves as a one-stop reference where students, data science consultants, engineers, and data scientists can access the basic ideas behind these methods, the advantages and limitations of different approaches, as well as the software packages to implement these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

All datasets are publicly available and the sources of data are stated in the paper.

References

  1. Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010)

    Article  CAS  PubMed  Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  Google Scholar 

  3. Calle, M.L., Urrea, V., Boulesteix, A.-L., Malats, N.: AUC-RF: a new strategy for genomic profiling with random forest. Hum. Hered. 72, 121–132 (2011)

    Article  CAS  PubMed  Google Scholar 

  4. Capstone: 6th Annual Industrial Engineering Capstone Symposium, Industrial Engineering, University of Arkansas (2022). https://industrial-engineering.uark.edu/academics/undergraduate-program/capstone-2021-2022.php

  5. Celik, E.: vita: variable importance testing approaches, r package version 1.0.0 (2015)

  6. Deng, H.: Guided random forest in the RRF package, arXiv preprint arXiv:1306.0237 (2013)

  7. Deng, H., Runger, G.: Feature selection via regularized trees. In: The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2012)

  8. Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recogn. 46, 3483–3489 (2013)

    Article  ADS  Google Scholar 

  9. Detzner, A., Eigner, M.: Feature selection methods for root-cause analysis among top-level product attributes. Qual. Reliab. Eng. Int. (2020). https://doi.org/10.1002/qre.2738

    Article  Google Scholar 

  10. Diaz-Uriarte, R.: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinform. 8, 1–7 (2007)

    Article  Google Scholar 

  11. Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 1–13 (2006)

    Article  Google Scholar 

  12. Fouodo, C.: Pomona: identification of relevant variables in omics data sets using Random Forests, r package version 1.0.2 (2022)

  13. Frank, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml

  14. Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: an R package for variable selection using random forests. R J. 7, 19–33 (2015)

    Article  Google Scholar 

  15. Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: Variable Selection Using Random Forests, r package version 1.1.0 (2019)

  16. Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)

    Article  Google Scholar 

  17. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    Google Scholar 

  18. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)

    Article  Google Scholar 

  19. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)

    Book  Google Scholar 

  20. Ho, T.K.: Random decision forests. In: The 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)

  21. Hopkins, M., Reeber, E., Forman, G., Suermondt, J.: Spambase data set, Hewlett-Packard Labs, 1 (1999)

  22. Hua, J., Xiong, Z., Lowey, J., Suh, E., Dougherty, E.R.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 1509–1515 (2005)

    Article  CAS  PubMed  Google Scholar 

  23. Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J., Lauer, M.S.: High-dimensional variable selection for survival data. J. Am. Stat. Assoc. 105, 205–217 (2010)

    Article  MathSciNet  CAS  Google Scholar 

  24. Ishwaran, H., Kogalur, U.B., Kogalur, M.U.B.: Package randomForestSRC. Breast 6, 1 (2022)

    Google Scholar 

  25. Janitza, S., Celik, E., Boulesteix, A.-L.: A computationally fast variable importance test for random forests for high-dimensional data. Adv. Data Anal. Classif. 12, 885–915 (2018)

    Article  MathSciNet  Google Scholar 

  26. Jirapech-Umpai, T., Aitken, S.: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform. 6, 1–11 (2005)

    Article  Google Scholar 

  27. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)

    Article  Google Scholar 

  28. Kuhn, M.: caret: Classification and Regression Training, r package version 6.0-86 (2020)

  29. Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010)

    Article  Google Scholar 

  30. Kursa, M.B., Rudnicki, W.R.: The all relevant feature selection using random forest, arXiv preprint arXiv:1106.5112 (2011)

  31. Lee, J.W., Lee, J.B., Park, M., Song, S.H.: An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885 (2005)

    Article  MathSciNet  Google Scholar 

  32. Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43, 2763–2772 (2010)

    Article  ADS  Google Scholar 

  33. Liu, X., Pan, R.: Analysis of large heterogeneous repairable system reliability data with static system attributes and dynamic sensor measurement in big data environment. Technometrics 62, 206–222 (2020)

    Article  MathSciNet  Google Scholar 

  34. Liu, X., Pan, R.: Boost-R: gradient boosting for recurrent event data. J. Qual. Technol. 53, 545–565 (2021)

    Article  Google Scholar 

  35. Mahajan, S., Pandit, A.K.: Hybrid method to supervise feature selection using signal processing and complex algebra techniques. Multimed. Tools Appl. (2021). https://doi.org/10.1007/s11042-021-11474-y

    Article  Google Scholar 

  36. Mansoor, M., Ur Rehman, Z., Shaheen, M., Khan, M.A., Habib, M.: Deep learning based semantic similarity detection using text data. Inf. Technol. Control 49, 495–510 (2020)

    Article  Google Scholar 

  37. Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., O’Sullivan, J.M.: A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. (2022). https://doi.org/10.3389/fbinf.2022.927312

    Article  PubMed  PubMed Central  Google Scholar 

  38. Ruiz, R., Riquelme, J.C., Aguilar-Ruiz, J.S.: Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn. 39, 2383–2392 (2006)

    Article  ADS  Google Scholar 

  39. Shaheen, M., Shahbaz, M.: An algorithm of association rule mining for microbial energy prospection. Sci. Rep. 7, 46108 (2017)

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  40. Speiser, J.L., Miller, M.E., Tooze, J., Ip, E.: A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 134, 93–101 (2019)

    Article  PubMed  PubMed Central  Google Scholar 

  41. Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J.D., Molloy, A.M., Mills, J.L., Brody, L.C., Stambolian, D., Bailey-Wilson, J.E.: r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Min. 9, 1–15 (2016)

    Article  Google Scholar 

  42. Tan, P.N., Steinbach, M., Karpatne, A., Kumar, V.: Introduction to Data Mining, 2nd edn. Pearson, London (2019)

    Google Scholar 

  43. Urrea, V., Calle, M.: AUCRF: Variable Selection with Random Forest and the Area Under the Curve, r package version 1.1 (2012)

  44. Wang, H., Li, G.: A selective review on random survival forests for high dimensional data. Quant. Bio Sci. 36, 85 (2017)

    Article  CAS  Google Scholar 

  45. Wolberg, W.H., Mangasarian, O.L.: Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Natl. Acad. Sci. 87, 9193–9196 (1990)

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  46. Zhang, J.: Selecting typical instances in instance-based learning. In: Machine Learning Proceedings 1992. Elsevier, pp. 470–479 (1992)

  47. Zhu, Z., Ong, Y.-S., Dash, M.: Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn. 40, 3236–3248 (2007)

    Article  ADS  Google Scholar 

Download references

Funding

This material was partially supported by the National Science Foundation under Award No. OIA-1946391.

Author information

Authors and Affiliations

Authors

Contributions

The first author, RI, was the Ph.D. student and graduate research assistant of the corresponding author. RI performed the literature review, completed the numerical examples, and prepared the draft of the paper. The corresponding author, XL, provided his advice during the process, led research meetings and discussions, and revised the manuscript.

Corresponding author

Correspondence to Xiao Liu.

Ethics declarations

Conflict of interest

The authors are not aware of any potential conflict of interest.

Human and animals participants

The paper does not involve human participants and/or animals.

Informed consent

The paper does not involve any informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iranzad, R., Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00509-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41060-024-00509-w

Keywords

Navigation