Abstract
Learning algorithms can be less effective on datasets with an extensive feature space due to the presence of irrelevant and redundant features. Feature selection is a technique that effectively reduces the dimensionality of the feature space by eliminating irrelevant and redundant features without significantly affecting the quality of decision-making of the trained model. In the last few decades, numerous algorithms have been developed to identify the most significant features for specific learning tasks. Each algorithm has its advantages and disadvantages, and it is the responsibility of a data scientist to determine the suitability of a specific algorithm for a particular task. However, with the availability of a vast number of feature selection algorithms, selecting the appropriate one can be a daunting task for an expert. These challenges in feature selection have motivated us to analyze the properties of algorithms and dataset characteristics together. This paper presents significant efforts to review existing feature selection algorithms, providing an exhaustive analysis of their properties and relative performance. It also addresses the evolution, formulation, and usefulness of these algorithms. The manuscript further categorizes the algorithms analyzed in this review based on the properties required for a specific dataset and objective under study. Additionally, it discusses popular area-specific feature selection techniques. Finally, it identifies and discusses some open research challenges in feature selection that are yet to be overcome.
Similar content being viewed by others
References
Abdel-Basset M, Ding W, El-Shahat D (2021) A hybrid Harris Hawks optimization algorithm with simulated annealing for feature selection. Artif Intell Rev 54:593–637
Alelyani S, Zhao Z, Liu H (2011) A dilemma in assessing stability of feature selection algorithms. In: 2011 IEEE international conference on high performance computing and communications. https://doi.org/10.1109/hpcc.2011.99
Al-Tashi Q, Abdulkadir SJ, Rais HM, Mirjalili S, Alhussian H (2020) Approaches to multi-objective feature selection: a systematic literature review. IEEE Access 8:125076–125096
Asghar MA, Khan MJ, Rizwan M, Mehmood RM, Kim SH (2020) An innovative multi-model neural network approach for feature selection in emotion recognition using deep feature clustering. Sensors 20(13):3765
Bolón-Canedo V, Alonso-Betanzos A (2018) Recent advances in ensembles for feature selection. Intell Syst Ref Lib. https://doi.org/10.1007/978-3-319-90080-3
Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Budak H, Taşabat SE (2016) A modified t-score for feature selection. Anadolu Üniv Bilim Teknol Derg A Uygul Bilimler Mühendis 17(5):845–852
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining–KDD ’10. https://doi.org/10.1145/1835804.1835848
Chavez A, Koutentakis D, Liang Y, Tripathy S, Yun J (2019). Identify statistical similarities and differences between the deadliest cancer types through gene expression. arXiv preprint arXiv:1903.07847
Cheng Z, Lu Z (2018) A novel efficient feature dimensionality reduction method and its application in engineering. Complexity 2018:1–14. https://doi.org/10.1155/2018/2879640
Das S, Biswas D (2019) Prediction of breast cancer using ensemble learning. In: 2019 5th International conference on advances in electrical engineering (ICAEE). https://doi.org/10.1109/icaee48663.2019.8975544
Dhrif H, Giraldo LGS, Kubat M, Wuchty S (2019) A stable hybrid method for feature subset selection using particle swarm optimization with local search. In: Proceedings of the genetic and evolutionary computation conference. https://doi.org/10.1145/3321707.3321816
Diez G, Nagel D, Stock G (2022) Correlation-based feature selection to identify functional dynamics in proteins. J Chem Theory Comput 18(8):5079–5088
Erkal B, Başak S, Çiloğlu A, Şener DD (2020) Multiclass classification of brain cancer with machine learning algorithms. In: 2020 Medical technologies congress (TIPTEKNO). IEEE, pp 1–4
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50:1272–1288
Giorgio Roffo SM (20015) Infinite feature selection. In Proceedings of the IEEE international conference on computer vision, IEEE, pp 4202–4210
González Peñalver J, Ortega Lopera J, Damas Hermoso M, Martín-Smith P, Gan JQ (2019) A new multi-objective wrapper method for feature selection–accuracy and stability analysis for BCI. Neurocomputing 333:407–48
Grandini M, Bagli E, Visani G (2020) Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756
Grzegorowski M, Ślęzak D (2019) On resilient feature selection: computational foundations of rC-reducts. Inf Sci 499:25–44
Grzegorowski M, Janusz A, Litwin J, Marcinowski Ł (2022) Data-driven resilient supply management supported by demand forecasting. In: Asian conference on intelligent information and database systems, Springer, Singapore, pp 122–134
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Almuallim H, Dietterich TG (1992) Efficient algorithms for identifying relevant features. In; 9th Canadian conference on artificial intelligence
Han C, Rao N, Sorokina D, Subbian K (2020) Scalable feature selection for (multitask) gradient boosted trees. In: International conference on artificial intelligence and statistics, PMLR, pp 885–894
Hancer E, Xue B, Zhang M (2020) A survey on feature selection approaches for clustering. Artif Intell Rev 53:4519–4545
Hashemi A, Dowlatshahi MB, Nezamabadi-pour H (2022) Ensemble of feature selection algorithms: a multi-criteria decision-making approach. Int J Mach Learn Cybern 13(1):49–69
Haury AC, Gestraud P, Vert JP (2011) The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PloS ONE 6(12):e28210
Huda RK, Banka H (2021) A group evaluation based binary PSO algorithm for feature selection in high dimensional data. Evol Intel 14(4):1949–1963
Hui Wang DB (1998) Relevance approach to feature subset selection. In: Huan Liu HM (ed) Feature extraction, construction and selection. Springer, Boston, MA, pp 85–99
James Deraeve WH (2018) Fast, accurate, and stable feature selection using neural networks. Neuroinformatics 16:253–268
Jawad AA, Ali FH, Hasanain WS (2020) Using heuristic and branch and bound methods to solve a multi-criteria machine scheduling problem. Iraqi J Sci 61:2055–2069
Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300:70–79
Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform 13(5):971–989
Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv (CSUR) 52(4):1–36
Kelleher JD, Mac Namee B, D’arcy A (2020) Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. MIT press, Cambridge
Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836
Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York, p 13
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, Nowé A (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform IEEE ACM 9(4):1106–1119
Lerner B, Levinstein M, Rosenberg B, Guterman H, Dinstein L, Romem Y (1994) Feature selection and chromosome classification using a multilayer perceptron neural network. In: Proceedings of 1994 IEEE international conference on neural networks (ICNN’94). https://doi.org/10.1109/icnn.1994.374905
Li F, Miao D, Pedrycz W (2017) Granular multi-label feature selection based on mutual information. Pattern Recogn 67:410–423
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Comput Surv 50(6):1–45
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577. https://doi.org/10.1007/s10115-017-1059-8
Liang Y, Zhang S (2019) Identifying DNase I hypersensitive sites using multi-features fusion and F-score features selection via Chou’s 5-steps rule. Biophys Chem 253:106227
Liu H, Setiono R (2022) Feature selection and classification–a probabilistic wrapper approach. In: Press CRC (ed) Industrial and engineering applications or artificial intelligence and expert systems. CRC Press, New York, pp 419–424
Liu J, Wang C, Wang C (2017) An evaluation of distance metrics for high-dimensional data clustering. Inf Sci 415–416:250–265
Liu XY, Wu SB, Zeng WQ, Yuan ZJ, Xu HB (2020) LogSum+ L2 penalized logistic regression model for biomarker selection and cancer classification. Sci Rep 10(1):22125
Maleki N, Zeinali Y, Niaki STA (2021) A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Syst Appl 164:113981
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Truu J (2021) Applications of machine learning in human microbiome studies: a review on feature selection, biomarker identification, disease prediction and treatment. Front Microbiol 12:313
Mnich K, Rudnicki WR (2020) All-relevant feature selection using multidimensional filters with exhaustive search. Inf Sci 524:277–297
Mohamed WNHW, Salleh MNM, Omar AH (2012) A comparative study of reduced error pruning method in decision tree algorithms. In: 2012 IEEE international conference on control system, computing and engineering. https://doi.org/10.1109/iccsce.2012.6487177
Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A, Tonevitsky A (2022) ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 10:e13200
Nogueira S, Sechidis K, Brown G (2017) On the stability of feature selection algorithms. J Mach Learn Res 18(1):6345–6398
Oates T, Cohen PR (1996) Searching for structure in multiple streams of data. In: ICML, vol 96, pp 346–354
Omuya EO, Okeyo GO, Kimwele MW (2021) Feature selection for classification using principal component analysis and information gain. Expert Syst Appl 174:114765
Ouyang T (2022) Structural rule-based modeling with granular computing. Appl Soft Comput 128:109519
Pandya R, Pandya J (2015) C5.0 Algorithm to improved decision tree with feature selection and reduced error pruning. Int J Comput Appl 117(16):18–21. https://doi.org/10.5120/20639-3318
Pengyi Yang BB (2013) Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. In: Mourad Elloumi AY (ed) Biological knowledge discovery handbook: preprocessing, mining and postprocessing of biological data. Wiley, Hoboken, NJ, pp 333–52
Pes B (2017) Feature selection for high-dimensional data: the issue of stability. In: 2017 IEEE 26th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). https://doi.org/10.1109/wetice.2017.28
Pes B (2020) Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput Appl 32(10):5951–5973
Raza MS, Qamar U (2016) A hybrid feature selection approach based on heuristic and exhaustive algorithms using Rough set theory. In: Proceedings of the international conference on Internet of things and cloud computing, pp 1–7
Razieh Sheikhpour MA (2017) A survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158
Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH (2006) Feature selection using a random forests classifier for the integrated analysis of multiple data types. In: 2006 IEEE symposium on computational intelligence and bioinformatics and computational biology. https://doi.org/10.1109/cibcb.2006.330987
Robert Kass AR (1995) Bayes factors. J Am Stat Assoc 94:773–795
Rostami M, Berahmand K, Forouzandeh S (2021) A novel community detection based genetic algorithm for feature selection. J Big Data 8(1):1–27
Roweis ST (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/science.290.5500.2323
Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2005) Heuristic search over a ranking for feature selection. In: International work-conference on artificial neural networks, Springer, Berlin, Heidelberg, pp 742–749
Sandhiya S, Palani U (2020) An effective disease prediction system using incremental feature selection and temporal convolutional neural network. J Ambient Intell Humaniz Comput 11(11):5547–5560
Sarvari S, Sani NFM, Hanapi ZM, Abdullah MT (2020) An efficient anomaly intrusion detection method with feature selection and evolutionary neural network. IEEE Access 8:70651–70663
Schlimmer JC (1993) Efficiently inducing determinations: a complete and systematic search algorithm that uses optimal pruning. Mach Learn Proc 1993:284–290
Shaifu Gupta AD (2018) A joint feature selection framework for multivariate resource usage prediction in cloud servers using stability and prediction performance. J Supercomput 74:6033–6068
Sharif MI, Li JP, Khan MA, Saleem MA (2020) Active deep neural network features selection for segmentation and recognition of brain tumors using MRI images. Pattern Recogn Lett 129:181–189
Sharma M, Kaur P (2021) A comprehensive analysis of nature-inspired meta-heuristic techniques for feature selection problem. Arch Comput Methods Eng 28:1103–1127
Sheikhpour R, Sarram MA, Sheikhpour E (2018) Semi-supervised sparse feature selection via graph Laplacian based scatter matrix for regression problems. Inf Sci 468:14–28
Shimamura S, Hirata K (2019) The reselection of adjacent sets by consistency-based feature selection algorithm. In: Proceedings of the 2nd international conference on information science and systems, pp 210–214
Smith C, Guennewig B, Muller S (2022) Robust subtractive stability measures for fast and exhaustive feature importance ranking and selection in generalised linear models. Aust N Z J Stat 64(3):339–355
Sohrawordi M, Hossain MA, Hasan MAM (2022) PLP_FS: prediction of lysine phosphoglycerylation sites in protein using support vector machine and fusion of multiple F_Score feature selection. Brief Bioinform 23(5):bbac306
Steven Loscalzo LY (2019) Consensus group stable feature selection. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 567–576
Tang J, Wang Y, Luo Y, Fu J, Zhang Y, Li Y, Zhu F (2020) Computational advances of tumor marker selection and sample classification in cancer proteomics. Comput Struct Biotechnol J 18:2012–2025
Thakkar A, Chaudhari K (2020) Predicting stock trend using an integrated term frequency–inverse document frequency-based feature weight matrix with neural networks. Appl Soft Comput 96:106684
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58:267–288
Too J, Abdullah AR (2021) A new and fast rival genetic algorithm for feature selection. J Supercomput 77(3):2844–2874
Ullah Z, Naqvi SR, Farooq W, Yang H, Wang S, Vo DVN (2021) A comparative study of machine learning methods for bio-oil yield prediction–a genetic algorithm-based features selection. Biores Technol 335:125292
Bolón-Canedo V, Rego-Fernández D, Peteiro-Barral D, Alonso-Betanzos A (2018) On the scalability of feature selection methods on high-dimensional data. Knowl Inf Syst 56:395–442
van de Schoot R, Depaoli S, King R, Kramer B, Märtens K, Tadesse MG, Yau C (2021) Bayesian statistics and modelling. Nat Rev Methods Primers 1(1):1
Verónica Bolón-Canedo AA-B (2019) Ensembles for feature selection: a review and future trends. Inf Fusion 52:1–12
Wainberg M, Merico D, Delong A, Frey BJ (2018) Deep learning in biomedicine. Nat Biotechnol 36(9):829–838
Wang A, Liu H, Liu J, Ding H, Yang J, Chen G (2020) Stable and accurate feature selection from microarray data with ensembled fast correlation based filter. In 2020 IEEE International conference on bioinformatics and biomedicine (BIBM), IEEE, pp 2996–2998
Wang CC, Zhu CC, Chen X (2022) Ensemble of kernel ridge regression-based small molecule–miRNA association prediction in human disease. Brief Bioinform 23(1):bbab431
Wang C, Huang Y, Shao M, Fan X (2019) Fuzzy rough set-based attribute reduction using distance measures. Knowl Based Syst 164:205–212
Wang L, Wang Y, Chang Q (2016) Feature selection methods for big data bioinformatics: a survey from the search perspective. Methods 111:21–31
Wang R, Xiu N, Toh KC (2021) Subspace quadratic regularization method for group sparse multinomial logistic regression. Comput Optim Appl 79(3):531–559
Wang W, Liang J, Liu R, Song Y, Zhang M (2022) A robust variable selection method for sparse online regression via the elastic net penalty. Mathematics 10(16):2985
Wang XH, Zhang Y, Sun XY, Wang YL, Du CH (2020) Multi-objective feature selection based on artificial bee colony: an acceleration approach with variable sample size. Appl Soft Comput 88:106041
Wang Z, Wang Z, Gu X, He S, Yan Z (2018) Feature selection based on Bayesian network for chiller fault diagnosis from the perspective of field applications. Appl Therm Eng 129:674–683
Werner T (2021) Trimming stability selection increases variable selection robustness. arXiv preprint arXiv:2111.11818
Wu X, Xu X, Liu J, Wang H, Hu B, Nie F (2020) Supervised feature selection with orthogonal regression and feature weighting. IEEE Trans Neural Netw Learn Syst 32(5):1831–1838
Xu K, Arai H, Maung C, Schweitzer H (2016) Unsupervised feature selection by heuristic search with provable bounds on suboptimality. In Proceedings of the AAAI conference on artificial intelligence, vol 30
Finucane YA, Reshef DN, Reshef HK, Grossman SR, McVean G, Turnbaugh PJ, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Yu D, Xu Z, Pedrycz W (2020) Bibliometric analysis of rough sets research. Appl Soft Comput 94:106467
Yu Lei, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zaki N, AlNuaimi N, Masud MM, Serhani MA (2020) Streaming feature selection algorithms for big data: a survey. Appl Comput Inform 18:113–135
Zebari R, Abdulazeez A, Zeebaree D, Zebari D, Saeed J (2020) A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J Appl Sci Technol Trends 1(2):56–70
Zhang B, Cao P (2019) Classification of high dimensional biomedical data based on feature selection using redundant removal. PLoS ONE 14(4):e0214406. https://doi.org/10.1371/journal.pone.0214406
Zhang C, Zhu L, Shi D, Zheng J, Chen H, Yu B (2022) Semi-supervised feature selection with soft label learning. IEEE/CAA J Autom Sin. https://doi.org/10.1109/JAS.2022.106055
Zhang L, Frank S, Kim J, Jin X, Leach M (2020) A systematic feature extraction and selection framework for data-driven whole-building automated fault detection and diagnostics in commercial buildings. Build Environ 186:107338
Zhang R, Zhao T, Lu Y, Xu X (2022) Relaxed adaptive lasso and its asymptotic results. Symmetry 14(7):1422
Zhang X, Jonassen I (2018) EFSIS: ensemble feature selection integrating stability. arXiv preprint arXiv:1811.07939
Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228. https://doi.org/10.3233/ida-2009-0364
Zhong W, Chen X, Nie F, Huang JZ (2021) Adaptive discriminant analysis for semi-supervised feature selection. Inf Sci 566:178–194
Zhou P, Wang N, Zhao S (2021) Online group streaming feature selection considering feature interaction. Knowl Based Syst 226:107157
Acknowledgements
The author’s would like to thank the anonymous reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Contributions
DT and KB wrote the mail manuscript text.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Theng, D., Bhoyar, K.K. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowl Inf Syst 66, 1575–1637 (2024). https://doi.org/10.1007/s10115-023-02010-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02010-5