Abstract
Filter methods are a class of feature selection techniques used to identify a subset of informative features during data preprocessing. While the differential efficacy of these techniques has been extensively compared in data science pipelines for predictive outcome modeling, less work has examined how their stability is impacted by underlying corpora properties. A set of six stability metrics (Davis, Dice, Jaccard, Kappa, Lustgarten, and Novovičová) was compared during cross-validation in a Monte Carlo simulation study on synthetic data to examine variability in the stability of three filter methods in data pipelines for binary classification, considering five underlying data properties: (1) error of measurement in the independent covariates, (2) number of training observations, (3) number of features, (4) class imbalance magnitude, and (5) missing data pattern. Feature selection stability was platykurtic and was negatively impacted by measurement error and a smaller number of training observations included in the input corpora. The Novovičová stability metric yielded the highest mean stability values, while the Davis stability metric was the most unstable method. The distribution of all stability metrics was negatively skewed, and the Jaccard metric exhibited the largest amount of variability across all five data properties. A statistical analysis of the synergistic effects between filter feature selection techniques, filter cutoffs, data corpora properties, and machine learning (ML) algorithms on overall pipeline efficacy, quantified using the area the under curve (AUC) evaluation metric, is also presented and discussed.
Similar content being viewed by others
Data availability
The R software used to perform the simulation study can be provided upon request by sending an email to the corresponding author.
References
Alelyani, S.: On feature selection stability: a data perspective. Doctoral Dissertation. Arizona State University, Tempe, Arizona (2013)
Alexandro, D.: Aiming for success: evaluating statistical and machine learning methods to predict high school student performance and improve early warning systems. Doctoral Dissertation. University of Connecticut, Storrs, Connecticut (2018)
Almutiri, T., Saeed, F.: A hybrid feature selection method combining Gini index and support vector machine with recursive feature elimination for gene expression classification. Int. J. Data Min. Modell. Manag. 14(1), 41–62 (2022)
Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Associat. Inform. Sci. Technol. 65(10), 1964–1987 (2014)
Barabanova, I.V., Vychuzhanin, P., Nikitin, N.O.: Sensitivity analysis of the composite data-driven pipelines in the automated machine learning. Procedia Comp. Sci. 193, 484–493 (2021)
Belanche, L.A., González, F.F.: Review and evaluation of feature selection algorithms in synthetic problems. arXiv preprint arXiv:1101.2320 (2011)
Berens, J., Schneider, K., Görtz, S., Oster, S., Burghoff, J.: Early detection of students at risk – predicting student dropouts using administrative student data and machine learning methods. J. Educat. Data Min. 11(3), 1–41 (2018)
Bertolini, R.: Evaluating performance variability of data pipelines for binary classification with applications to predictive learning analytics. (Doctoral Dissertation). Stony Brook University, Stony Brook, New York (2021)
Bertolini, R., Finch, S.J.: Synergistic effects between data corpora properties and machine learning performance in data pipelines. Int. J.Data Min., Modell. Manag. 14(3), 217–233 (2022)
Bertolini, R., Finch, S.J., Nehm, R.H.: Enhancing data pipelines for forecasting student performance: integrating feature selection with cross-validation. Int. J. Educat. Technol. Higher Educat. 18(1), 1–23 (2021)
Bertolini, R., Finch, S.J., Nehm, R.H.: Quantifying variability in predictions of student performance: examining the impact of bootstrap resampling in data pipelines. Comp. Educat.: Artif. Intell. 3, 10067 (2022)
Bharathi, N., Rishiikeshwer, B.S., Shriram, T.A., Santhi, B., Brindha, G.R.: The significance of feature selection techniques in machine learning. Fund. Meth. Mach. Deep. Learn. Algorith. Tool. Appl. (2022). https://doi.org/10.1002/9781119821908.ch5
Biswas, S., Wardat, M., Rajan, H.: The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large. arXiv preprint arXiv:2112.01590 (2021)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowled. Infor. Sys. 34(3), 483–519 (2013)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Sys. 86, 33–45 (2015)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Infor. Sci. 282, 111–135 (2014)
Bommert, A.M. Integration of feature selection stability in model fitting. Doctoral Dissertation. TU Dortmund University, Dortmund, Germany (2021)
Bommert, A.M., Lang, M.: Stabm: stability measures for feature selection. J. Open Sour. Softw. 6(59), 3010 (2021)
Bommert, A.M., Rahnenführer, J.: Adjusted measures for feature selection stability for data sets with similar features. In: International conference on machine learning, optimization, and data science, pp. 203–214. Springer, Cham (2020)
Bommert, A.M., Rahnenführer, J., Lang, M.: A multicriteria approach to find predictive and sparse models with stable feature selection for high-dimensional data. Comput. Math. Model. Med. 2017, 7907163 (2017)
Bommert, A.M., Sun, X., Bischl, B., Rahnenführer, J., Lang, M.: Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143, 106839 (2020)
Bommert, A.M., Welchowski, T., Schmid, M., Rahnenführer, J.: Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief. Bioinfor. 23(1), 1–13 (2022)
Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936)
Borda, J.C.: Mémoire sur les élections au scrutin. Mémoires de l'Académie royale des Sciences de Paris pour l’Année 1781, 657-665 (1781)
Boulesteix, A.L., Slawski, M.: Stability and aggregation of ranked gene lists. Brief. Bioinfor. 10(5), 556–568 (2009)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Routledge, London (2017)
Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)
Burka, D., Puppe, C., Szepesváry, L., Tasnádi, A.: And the winner is... Chevalier de Borda: Neural networks vote according to Borda’s Rule. In: Proceedings of the Sixth International Workshop on Computational Social Choice (2016)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Ling. 22(2), 249–254 (1996)
Chaibub Neto, E., Bare, J.C., Margolin, A.A.: Simulation studies as designed experiments: the comparison of penalized regression models in the “large p, small n” setting. PloS one 9(10), e107957 (2014)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Couronné, R., Probst, P., Boulesteix, A.L.: Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinfor. 19(1), 1–14 (2018)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Küffner R, Zimmer R: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22(19), 2356–2363 (2006)
Davison, A.C., Hinkley, D.V.: Bootstrap methods and their application (No. 1). Cambridge University Press, Cambridge (1997)
Densmore, J.: Data pipeline pocket reference. O’Reilly Media, Inc (2021)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Similarity analysis of feature ranking techniques on imbalanced DNA microarray datasets. In: 2012 IEEE International conference on bioinformatics and biomedicine, pp. 1–5. IEEE (2012)
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Classification performance of rank aggregation techniques for ensemble gene selection. In: Proceedings of the twenty-sixth international FLAIRS conference, pp. 420-425 (2013)
Duangsoithong, R., Windeatt, T.: Bootstrap feature selection for ensemble classifiers. In: industrial conference on data mining, pp. 28-41. Springer, Berlin, Heidelberg (2010)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the 10th International world wide web conference, pp. 613–622. ACM (2001)
Ebenuwa, S.H., Sharif, M.S., Alazab, M., Al-Nemrat, A.: Variance ranking attributes selection techniques for binary classification problem in imbalance data. IEEE Access 7, 24649–24666 (2019)
Ghai, B., Mishra, M., Mueller, K.: Cascaded debiasing: studying the cumulative effect of multiple fairness-enhancing interventions. arXiv preprint arXiv:2202.03734 (2022)
Goswami, S., Chakraborty, S., Guha, P., Tarafdar, A., Kedia, A.: Filter-based feature selection methods using hill climbing approach. In: Natural computing for unsupervised learning, pp. 213–234. Springer, Cham (2019)
Gulgezen, G., Cataltepe, Z., Yu, L.: Stable and accurate feature selection. In: Joint European Conference on machine learning and knowledge discovery in databases, pp. 455-468. Springer, Berlin, Heidelberg (2009)
Guzmán-Martinez, R., Alaiz-Rodríguez, R.: Feature selection stability assessment based on the jensen-shannon divergence. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 597-612. Springer, Berlin, Heidelberg (2011)
Hall, M.A.: Correlation-based feature selection for machine learning. Doctoral Dissertation. University of Waikato, Hamilton, Hamilton, New Zealand (1999)
Hopf, K., Reifenrath, S.: Filter methods for feature selection in supervised machine learning applications–Review and benchmark. arXiv preprint arXiv:2111.12140 (2021)
Hua, J., Tembe, W.D., Dougherty, E.R.: Performance of feature-selection methods in the classification of high-dimension data. Patt. Recognit. 42(3), 409–424 (2009)
Huang, B.F., Boutros, P.C.: The parameter sensitivity of random forests. BMC Bioinform. 17(1), 1–13 (2016)
Huang, C.: Feature selection and feature stability measurement method for high-dimensional small sample data based on big data technology. Computat. Intell. Neurosci. 2021, 1–12 (2021)
Izenman, A.J.: Modern multivariate statistical techniques. In: Springer Texts in Statistics, Springer, New York (2008)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de La Société Vaudoise Des Sciences Naturelles 37, 547–579 (1901)
Källberg, D., Vidman, L., Rydén, P.: Comparison of methods for feature selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes. Front. Genet. 12, 632620 (2021)
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inform. Sys. 12(1), 95–116 (2007)
Karegowda, A.G., Manjunath, A.S., Jayaram, M.A.: Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag. 2(2), 271–277 (2010)
Karunakaran, V., Rajasekar, V., Joseph, S.: exploring a filter and wrapper feature selection techniques in machine learning. In: Computational vision and bio-inspired computing, pp. 497-506. Springer, Singapore (2021)
Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: a review. J. King Saud Univer. Comp. Inf. Sci. 34(4), 1060–1073 (2019)
Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: Problems in software defect prediction. In: 2010 22nd IEEE International conference on tools with artificial intelligence, pp. 137-144. IEEE (2010)
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International conference on tools with artificial intelligence, pp. 310-317. IEEE (2007)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp. 1137-1145 (1995)
Koprinska, I., Rana, M., Agelidis, V.G.: Correlation and instance based feature selection for electricity load forecasting. Knowl.-Based Sys. 82, 29–40 (2015)
Krízek, P., Kittler, J., & Hlavác, V.: Improving stability of feature selection methods. In: International conference on computer analysis of images and patterns, pp. 929-936. Springer, Berlin, Heidelberg (2007)
Kuhn, M.: Caret: classification and regression training. Astrophysics Source Code Library, ascl-1505 (2015)
Kujawska, H., Slavkovik, M., Rückmann, J. J.: Predicting the winners of Borda, Kemeny and Dodgson elections with supervised machine learning. In: Multi-Agent Systems and Agreement Technologies, pp. 440-458. Springer, Cham (2020)
Laborda, J., Ryoo, S.: Feature selection in a credit scoring model. Mathematics 9(7), 746 (2021)
Lausser, L., Müssel, C., Maucher, M., Kestler, H.A.: Measuring and visualizing the stability of biomarker selection techniques. Comput. Stat. 28(1), 51–65 (2013)
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A.: A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(4), 1106–1119 (2012)
Liu, H.: Algorithms for Scalability and Security in Adversarial Environments. Doctoral Dissertation. The University of Arizona, Tucson, Arizona (2021)
Lustgarten, J. L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. In AMIA Annual Symposium Proceeding, p. 406. American Medical Informatics Association (2009)
Mangal, A., Holm, E.A.: A comparative study of feature selection methods for stress hotspot classification in materials. Integrat. Mater. Manuf. Innovat. 7(3), 87–95 (2018)
Marshall, A., Altman, D.G., Royston, P., Holder, R.L.: Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med. Resear. Methodol. 10(1), 1–16 (2010)
Meng, X.B., Gao, X.Z., Lu, L., Liu, Y., Zhang, H.: A new bio-inspired optimisation algorithm: Bird Swarm Algorithm. J. Exper. Theoret. Artif. Intell. 28(4), 673–687 (2016)
Meyer, P.E., Schretter, C., Bontempi, G.: Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Select. Top. Sign. Process. 2(3), 261–274 (2008)
Mohd Yusof, M., Mohamed, R., Wahid, N.: Benchmark of feature selection techniques with machine learning algorithms for cancer datasets. In: Proceedings of the international conference on artificial intelligence and robotics and the International conference on automation, control, and robotics engineering, pp. 1–5 (2016)
Montgomery, D.C.: Design and Analysis of Experiments. John Wiley & Sons (2017)
Moons, E., Aerts, M., Wets, G.: A tree based lack-of-fit test for multiple logistic regression. Stat. Med. 23(9), 1425–1438 (2004)
Morán-Fernández, L., Bólon-Canedo, V., Alonso-Betanzos, A.: How important is data quality? Best classifiers vs best features. Neurocomputing 470, 365–375 (2022)
Munirathinam, D.J., Ranganadhan, M.: A new improved filter based feature selection model for high-dimensional data. J. Supercomp. 76(8), 5745–5762 (2020)
Nogueira, S.: Quantifying the stability of feature selection. Doctoral dissertation. The University of Manchester, Manchester, United Kingdom (2018)
Nogueira, S., Brown, G.: Measuring the stability of feature selection. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 442–457. Springer, Cham (2016)
Nogueira, S., Sechidis, K., Brown, G.: On the stability of feature selection algorithms. J. Mach. Learn. Res. 18(1), 6345–6398 (2017)
Novovičová, J., Somol, P., Pudil, P.: A new measure of feature selection algorithms’ stability. In: 2009 IEEE International conference on data mining workshops, pp. 382–387. IEEE (2009)
Rajbahadur, G.K., Oliva, G.A., Hassan, A.E., Dingel, J.: Pitfalls analyzer: quality control for model-driven data science pipelines. In: 2019 ACM/IEEE 22nd international conference on model driven engineering languages and systems (MODELS), pp. 12–22. IEEE (2019)
Ramaswami, M.R., Bhaskaran, R.: A study on feature selection techniques in educational data mining. J. Comput. 1(1), 7–11 (2009)
Ren, K., Fang, W., Qu, J., Zhang, X., Shi, X.: Comparison of eight filter-based feature selection methods for monthly streamflow forecasting—three case studies on CAMELS data sets. J. Hydrol. 586, 124897 (2020)
Romanski, P., Kotthoff, L., Kotthoff, M.L.: Package ‘FSelector’. URL: http://cran/r-project.org/web/packages/FSelector/index.html (2013)
Salman, R., Alzaatreh, A., Sulieman, H.: The stability of different aggregation techniques in ensemble feature selection. J. Big Data 9(1), 1–23 (2022)
Sánchez-Maroño, N., Alonso-Betanzos, A., Tombilla-Sanromán, M.: Filter methods for feature selection – a comparative study. In: international conference on intelligent data engineering and automated learning, pp. 178-187. Springer, Berlin, Heidelberg (2007)
Sarkar, C., Cooley, S., Srivastava, J.: Robust feature selection technique using rank aggregation. Appl. Artif. Intell. 28(3), 243–257 (2014)
Sen, R., Mandal, A.K., Chakraborty, B.: A critical study on stability measures of feature selection with a novel extension of lustgarten index. Mach. Learn. Knowl. Extract. 3(4), 771–787 (2021)
Sen, R., Mandal, A.K., Chakraborty, B.: Performance analysis of extended lustgarten index for stability of feature selection. In: 2021 IEEE international conference on service operations and logistics, and informatics (SOLI), pp. 1–5. IEEE (2021)
Somol, P., Novovičová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Patt. Anal. Mach. Intell. 32(11), 1921–1939 (2010)
Skiena, S.S.: The Data Science Design Manual. Springer (2017)
Skurichina, M., Duin, R.P. (2005). Combining feature subsets in feature selection. In: International workshop on multiple classifier systems, pp. 165–175. Springer, Berlin, Heidelberg (2005)
Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9(1), 1–11 (2008)
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources, and a solution. BMC Bioinform. 8(1), 1–21 (2007)
Subbian K, Melville P.: Supervised rank aggregation for predicting influence in networks. arXiv preprint arXiv:1108.4801 (2011)
Sun, L., Wang, L., Ding, W., Qian, Y., Xu, J.: Feature selection using fuzzy neighborhood entropy-based uncertainty measures for fuzzy neighborhood multigranulation rough sets. IEEE Trans. Fuzzy Sys. 29(1), 19–33 (2020)
Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.: A genetic algorithm-based method for feature subset selection. Soft Comp. 12(2), 111–120 (2008)
Toloşi, L., Lengauer, T.: Classification with correlated features: unreliability of feature ranking and solutions. Bioinform. 27(14), 1986–1994 (2011)
Tsanas, A., Little, M.A., McSharry, P.E.: A simple filter benchmark for feature selection. J. Mach. Learn. Resea. 1, 1–24 (2010)
Tunkiel, A.T., Sui, D., Wiktorski, T.: Data-driven sensitivity analysis of complex machine learning models: a case study of directional drilling. J. Petrol. Sci. Eng. 195, 107630 (2020)
Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: introduction and review. J. Biomed. Inform. 85, 189–203 (2018)
Urkullu, A., Pérez, A., Calvo, B.: Statistical model for reproducibility in ranking-based feature selection. Knowl. Inform. Sys. 63(2), 379–410 (2021)
Van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(1), 1–67 (2011)
Wah, Y.B., Ibrahim, N., Hamid, H.A., Abdul-Rahman, S., Fong, S.: Feature selection methods: case of filter and wrapper approaches for maximising classification accuracy. Pertanika J. Sci. Technol. 26(1), 329–340 (2018)
Wald, R., Khoshgoftaar, T.M., Dittman, D., Awada, W., Napolitano, A.: An extensive comparison of feature ranking aggregation techniques in bioinformatics. In: 2012 IEEE 13th international conference on information reuse & integration (IRI), pp. 377–384. IEEE (2012)
Wald, R., Khoshgoftaar, T.M., Dittman, D.: Mean aggregation versus robust rank aggregation for ensemble gene selection. In: 2012 11th International conference on machine learning and applications, pp. 63–69. IEEE (2012)
Wald, R., Khoshgoftaar, T. M., & Napolitano, A.: Stability of filter- and wrapper based feature subset selection. In: 2013 IEEE 25th International conference on tools with artificial intelligence, pp. 374–380. IEEE (2013)
Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., Hutter, F.: Nas-bench-101: Towards reproducible neural architecture search. In: International conference on machine learning, pp. 7105–7114. PMLR (2019)
Yu, L., Ding, C., & Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International conference on knowledge discovery and data mining, pp. 803-811. ACM (2008)
Zuber, V., Strimmer, K.: Gene ranking and biomarker discovery under correlation. Bioinformatics 25(20), 2700–2707 (2009)
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
RB and SF conceptualized the study. RB performed all data analyses, wrote the first draft of the manuscript, and prepared all tables and figures. RB and SF reviewed and approved the final manuscript. This work encompasses a portion of the doctoral dissertation of RB.
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bertolini, R., Finch, S.J. Stability of filter feature selection methods in data pipelines: a simulation study. Int J Data Sci Anal 17, 225–248 (2024). https://doi.org/10.1007/s41060-022-00373-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-022-00373-6