A review of random forest-based feature selection methods for data science education and applications

Iranzad, Reza; Liu, Xiao

doi:10.1007/s41060-024-00509-w

A review of random forest-based feature selection methods for data science education and applications

Review
Published: 03 February 2024

(2024)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Reza Iranzad¹ &
Xiao Liu²

360 Accesses
1 Altmetric
Explore all metrics

Abstract

Random forest (RF) is one of the most popular statistical learning methods in both data science education and applications. Feature selection, enabled by RF, is often among the very first tasks in a data science project, such as the college capstone project, industry consulting projects. The goal of this paper is to provide a comprehensive review of 12 RF-based feature selection methods for classification problems. The review provides necessary description of each method and the software packages. We show that different methods typically do not provide consistent feature selection results, and the model performance also varies when different RF-based feature selection approaches are employed. This observation suggests that caution must be taken when performing feature selection tasks using RF. Feature selection cannot be blindly done without a sound understanding of the methods adopted, which is not always the case in industry and many senior capstone projects that we have observed. The paper serves as a one-stop reference where students, data science consultants, engineers, and data scientists can access the basic ideas behind these methods, the advantages and limitations of different approaches, as well as the software packages to implement these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A Review on Random Forest: An Ensemble Classifier

A survey on ensemble learning

Article 30 August 2019

Data availability

All datasets are publicly available and the sources of data are stated in the paper.

References

Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010)
Article CAS PubMed Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Calle, M.L., Urrea, V., Boulesteix, A.-L., Malats, N.: AUC-RF: a new strategy for genomic profiling with random forest. Hum. Hered. 72, 121–132 (2011)
Article CAS PubMed Google Scholar
Capstone: 6th Annual Industrial Engineering Capstone Symposium, Industrial Engineering, University of Arkansas (2022). https://industrial-engineering.uark.edu/academics/undergraduate-program/capstone-2021-2022.php
Celik, E.: vita: variable importance testing approaches, r package version 1.0.0 (2015)
Deng, H.: Guided random forest in the RRF package, arXiv preprint arXiv:1306.0237 (2013)
Deng, H., Runger, G.: Feature selection via regularized trees. In: The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2012)
Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recogn. 46, 3483–3489 (2013)
Article ADS Google Scholar
Detzner, A., Eigner, M.: Feature selection methods for root-cause analysis among top-level product attributes. Qual. Reliab. Eng. Int. (2020). https://doi.org/10.1002/qre.2738
Article Google Scholar
Diaz-Uriarte, R.: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinform. 8, 1–7 (2007)
Article Google Scholar
Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 1–13 (2006)
Article Google Scholar
Fouodo, C.: Pomona: identification of relevant variables in omics data sets using Random Forests, r package version 1.0.2 (2022)
Frank, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: an R package for variable selection using random forests. R J. 7, 19–33 (2015)
Article Google Scholar
Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: Variable Selection Using Random Forests, r package version 1.1.0 (2019)
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
Book Google Scholar
Ho, T.K.: Random decision forests. In: The 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)
Hopkins, M., Reeber, E., Forman, G., Suermondt, J.: Spambase data set, Hewlett-Packard Labs, 1 (1999)
Hua, J., Xiong, Z., Lowey, J., Suh, E., Dougherty, E.R.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 1509–1515 (2005)
Article CAS PubMed Google Scholar
Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J., Lauer, M.S.: High-dimensional variable selection for survival data. J. Am. Stat. Assoc. 105, 205–217 (2010)
Article MathSciNet CAS Google Scholar
Ishwaran, H., Kogalur, U.B., Kogalur, M.U.B.: Package randomForestSRC. Breast 6, 1 (2022)
Google Scholar
Janitza, S., Celik, E., Boulesteix, A.-L.: A computationally fast variable importance test for random forests for high-dimensional data. Adv. Data Anal. Classif. 12, 885–915 (2018)
Article MathSciNet Google Scholar
Jirapech-Umpai, T., Aitken, S.: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform. 6, 1–11 (2005)
Article Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article Google Scholar
Kuhn, M.: caret: Classification and Regression Training, r package version 6.0-86 (2020)
Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010)
Article Google Scholar
Kursa, M.B., Rudnicki, W.R.: The all relevant feature selection using random forest, arXiv preprint arXiv:1106.5112 (2011)
Lee, J.W., Lee, J.B., Park, M., Song, S.H.: An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885 (2005)
Article MathSciNet Google Scholar
Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43, 2763–2772 (2010)
Article ADS Google Scholar
Liu, X., Pan, R.: Analysis of large heterogeneous repairable system reliability data with static system attributes and dynamic sensor measurement in big data environment. Technometrics 62, 206–222 (2020)
Article MathSciNet Google Scholar
Liu, X., Pan, R.: Boost-R: gradient boosting for recurrent event data. J. Qual. Technol. 53, 545–565 (2021)
Article Google Scholar
Mahajan, S., Pandit, A.K.: Hybrid method to supervise feature selection using signal processing and complex algebra techniques. Multimed. Tools Appl. (2021). https://doi.org/10.1007/s11042-021-11474-y
Article Google Scholar
Mansoor, M., Ur Rehman, Z., Shaheen, M., Khan, M.A., Habib, M.: Deep learning based semantic similarity detection using text data. Inf. Technol. Control 49, 495–510 (2020)
Article Google Scholar
Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., O’Sullivan, J.M.: A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. (2022). https://doi.org/10.3389/fbinf.2022.927312
Article PubMed PubMed Central Google Scholar
Ruiz, R., Riquelme, J.C., Aguilar-Ruiz, J.S.: Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn. 39, 2383–2392 (2006)
Article ADS Google Scholar
Shaheen, M., Shahbaz, M.: An algorithm of association rule mining for microbial energy prospection. Sci. Rep. 7, 46108 (2017)
Article CAS PubMed PubMed Central ADS Google Scholar
Speiser, J.L., Miller, M.E., Tooze, J., Ip, E.: A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 134, 93–101 (2019)
Article PubMed PubMed Central Google Scholar
Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J.D., Molloy, A.M., Mills, J.L., Brody, L.C., Stambolian, D., Bailey-Wilson, J.E.: r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Min. 9, 1–15 (2016)
Article Google Scholar
Tan, P.N., Steinbach, M., Karpatne, A., Kumar, V.: Introduction to Data Mining, 2nd edn. Pearson, London (2019)
Google Scholar
Urrea, V., Calle, M.: AUCRF: Variable Selection with Random Forest and the Area Under the Curve, r package version 1.1 (2012)
Wang, H., Li, G.: A selective review on random survival forests for high dimensional data. Quant. Bio Sci. 36, 85 (2017)
Article CAS Google Scholar
Wolberg, W.H., Mangasarian, O.L.: Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Natl. Acad. Sci. 87, 9193–9196 (1990)
Article CAS PubMed PubMed Central ADS Google Scholar
Zhang, J.: Selecting typical instances in instance-based learning. In: Machine Learning Proceedings 1992. Elsevier, pp. 470–479 (1992)
Zhu, Z., Ong, Y.-S., Dash, M.: Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn. 40, 3236–3248 (2007)
Article ADS Google Scholar

Download references

Funding

This material was partially supported by the National Science Foundation under Award No. OIA-1946391.

Author information

Authors and Affiliations

FedEx Express, Memphis, USA
Reza Iranzad
H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA
Xiao Liu

Authors

Reza Iranzad
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The first author, RI, was the Ph.D. student and graduate research assistant of the corresponding author. RI performed the literature review, completed the numerical examples, and prepared the draft of the paper. The corresponding author, XL, provided his advice during the process, led research meetings and discussions, and revised the manuscript.

Corresponding author

Correspondence to Xiao Liu.

Ethics declarations

Conflict of interest

The authors are not aware of any potential conflict of interest.

Human and animals participants

The paper does not involve human participants and/or animals.

Informed consent

The paper does not involve any informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Iranzad, R., Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00509-w

Download citation

Received: 07 November 2022
Accepted: 06 January 2024
Published: 03 February 2024
DOI: https://doi.org/10.1007/s41060-024-00509-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of random forest-based feature selection methods for data science education and applications

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

A survey on ensemble learning

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human and animals participants

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A review of random forest-based feature selection methods for data science education and applications

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

A survey on ensemble learning

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human and animals participants

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation