Abstract
Random forest (RF) is one of the most popular statistical learning methods in both data science education and applications. Feature selection, enabled by RF, is often among the very first tasks in a data science project, such as the college capstone project, industry consulting projects. The goal of this paper is to provide a comprehensive review of 12 RF-based feature selection methods for classification problems. The review provides necessary description of each method and the software packages. We show that different methods typically do not provide consistent feature selection results, and the model performance also varies when different RF-based feature selection approaches are employed. This observation suggests that caution must be taken when performing feature selection tasks using RF. Feature selection cannot be blindly done without a sound understanding of the methods adopted, which is not always the case in industry and many senior capstone projects that we have observed. The paper serves as a one-stop reference where students, data science consultants, engineers, and data scientists can access the basic ideas behind these methods, the advantages and limitations of different approaches, as well as the software packages to implement these methods.
Similar content being viewed by others
Data availability
All datasets are publicly available and the sources of data are stated in the paper.
References
Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Calle, M.L., Urrea, V., Boulesteix, A.-L., Malats, N.: AUC-RF: a new strategy for genomic profiling with random forest. Hum. Hered. 72, 121–132 (2011)
Capstone: 6th Annual Industrial Engineering Capstone Symposium, Industrial Engineering, University of Arkansas (2022). https://industrial-engineering.uark.edu/academics/undergraduate-program/capstone-2021-2022.php
Celik, E.: vita: variable importance testing approaches, r package version 1.0.0 (2015)
Deng, H.: Guided random forest in the RRF package, arXiv preprint arXiv:1306.0237 (2013)
Deng, H., Runger, G.: Feature selection via regularized trees. In: The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8 (2012)
Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recogn. 46, 3483–3489 (2013)
Detzner, A., Eigner, M.: Feature selection methods for root-cause analysis among top-level product attributes. Qual. Reliab. Eng. Int. (2020). https://doi.org/10.1002/qre.2738
Diaz-Uriarte, R.: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinform. 8, 1–7 (2007)
Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 1–13 (2006)
Fouodo, C.: Pomona: identification of relevant variables in omics data sets using Random Forests, r package version 1.0.2 (2022)
Frank, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: an R package for variable selection using random forests. R J. 7, 19–33 (2015)
Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: VSURF: Variable Selection Using Random Forests, r package version 1.1.0 (2019)
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
Ho, T.K.: Random decision forests. In: The 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)
Hopkins, M., Reeber, E., Forman, G., Suermondt, J.: Spambase data set, Hewlett-Packard Labs, 1 (1999)
Hua, J., Xiong, Z., Lowey, J., Suh, E., Dougherty, E.R.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 1509–1515 (2005)
Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J., Lauer, M.S.: High-dimensional variable selection for survival data. J. Am. Stat. Assoc. 105, 205–217 (2010)
Ishwaran, H., Kogalur, U.B., Kogalur, M.U.B.: Package randomForestSRC. Breast 6, 1 (2022)
Janitza, S., Celik, E., Boulesteix, A.-L.: A computationally fast variable importance test for random forests for high-dimensional data. Adv. Data Anal. Classif. 12, 885–915 (2018)
Jirapech-Umpai, T., Aitken, S.: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform. 6, 1–11 (2005)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Kuhn, M.: caret: Classification and Regression Training, r package version 6.0-86 (2020)
Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010)
Kursa, M.B., Rudnicki, W.R.: The all relevant feature selection using random forest, arXiv preprint arXiv:1106.5112 (2011)
Lee, J.W., Lee, J.B., Park, M., Song, S.H.: An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885 (2005)
Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43, 2763–2772 (2010)
Liu, X., Pan, R.: Analysis of large heterogeneous repairable system reliability data with static system attributes and dynamic sensor measurement in big data environment. Technometrics 62, 206–222 (2020)
Liu, X., Pan, R.: Boost-R: gradient boosting for recurrent event data. J. Qual. Technol. 53, 545–565 (2021)
Mahajan, S., Pandit, A.K.: Hybrid method to supervise feature selection using signal processing and complex algebra techniques. Multimed. Tools Appl. (2021). https://doi.org/10.1007/s11042-021-11474-y
Mansoor, M., Ur Rehman, Z., Shaheen, M., Khan, M.A., Habib, M.: Deep learning based semantic similarity detection using text data. Inf. Technol. Control 49, 495–510 (2020)
Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., O’Sullivan, J.M.: A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. (2022). https://doi.org/10.3389/fbinf.2022.927312
Ruiz, R., Riquelme, J.C., Aguilar-Ruiz, J.S.: Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn. 39, 2383–2392 (2006)
Shaheen, M., Shahbaz, M.: An algorithm of association rule mining for microbial energy prospection. Sci. Rep. 7, 46108 (2017)
Speiser, J.L., Miller, M.E., Tooze, J., Ip, E.: A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 134, 93–101 (2019)
Szymczak, S., Holzinger, E., Dasgupta, A., Malley, J.D., Molloy, A.M., Mills, J.L., Brody, L.C., Stambolian, D., Bailey-Wilson, J.E.: r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Min. 9, 1–15 (2016)
Tan, P.N., Steinbach, M., Karpatne, A., Kumar, V.: Introduction to Data Mining, 2nd edn. Pearson, London (2019)
Urrea, V., Calle, M.: AUCRF: Variable Selection with Random Forest and the Area Under the Curve, r package version 1.1 (2012)
Wang, H., Li, G.: A selective review on random survival forests for high dimensional data. Quant. Bio Sci. 36, 85 (2017)
Wolberg, W.H., Mangasarian, O.L.: Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Natl. Acad. Sci. 87, 9193–9196 (1990)
Zhang, J.: Selecting typical instances in instance-based learning. In: Machine Learning Proceedings 1992. Elsevier, pp. 470–479 (1992)
Zhu, Z., Ong, Y.-S., Dash, M.: Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn. 40, 3236–3248 (2007)
Funding
This material was partially supported by the National Science Foundation under Award No. OIA-1946391.
Author information
Authors and Affiliations
Contributions
The first author, RI, was the Ph.D. student and graduate research assistant of the corresponding author. RI performed the literature review, completed the numerical examples, and prepared the draft of the paper. The corresponding author, XL, provided his advice during the process, led research meetings and discussions, and revised the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors are not aware of any potential conflict of interest.
Human and animals participants
The paper does not involve human participants and/or animals.
Informed consent
The paper does not involve any informed consent.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Iranzad, R., Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00509-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41060-024-00509-w