Stability of filter feature selection methods in data pipelines: a simulation study

Bertolini, Roberto; Finch, Stephen J.

doi:10.1007/s41060-022-00373-6

Stability of filter feature selection methods in data pipelines: a simulation study

Regular Paper
Published: 14 December 2022

Volume 17, pages 225–248, (2024)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Roberto Bertolini¹ &
Stephen J. Finch¹

195 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Filter methods are a class of feature selection techniques used to identify a subset of informative features during data preprocessing. While the differential efficacy of these techniques has been extensively compared in data science pipelines for predictive outcome modeling, less work has examined how their stability is impacted by underlying corpora properties. A set of six stability metrics (Davis, Dice, Jaccard, Kappa, Lustgarten, and Novovičová) was compared during cross-validation in a Monte Carlo simulation study on synthetic data to examine variability in the stability of three filter methods in data pipelines for binary classification, considering five underlying data properties: (1) error of measurement in the independent covariates, (2) number of training observations, (3) number of features, (4) class imbalance magnitude, and (5) missing data pattern. Feature selection stability was platykurtic and was negatively impacted by measurement error and a smaller number of training observations included in the input corpora. The Novovičová stability metric yielded the highest mean stability values, while the Davis stability metric was the most unstable method. The distribution of all stability metrics was negatively skewed, and the Jaccard metric exhibited the largest amount of variability across all five data properties. A statistical analysis of the synergistic effects between filter feature selection techniques, filter cutoffs, data corpora properties, and machine learning (ML) algorithms on overall pipeline efficacy, quantified using the area the under curve (AUC) evaluation metric, is also presented and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Data availability

The R software used to perform the simulation study can be provided upon request by sending an email to the corresponding author.

References

Alelyani, S.: On feature selection stability: a data perspective. Doctoral Dissertation. Arizona State University, Tempe, Arizona (2013)
Alexandro, D.: Aiming for success: evaluating statistical and machine learning methods to predict high school student performance and improve early warning systems. Doctoral Dissertation. University of Connecticut, Storrs, Connecticut (2018)
Almutiri, T., Saeed, F.: A hybrid feature selection method combining Gini index and support vector machine with recursive feature elimination for gene expression classification. Int. J. Data Min. Modell. Manag. 14(1), 41–62 (2022)
Google Scholar
Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Associat. Inform. Sci. Technol. 65(10), 1964–1987 (2014)
Article Google Scholar
Barabanova, I.V., Vychuzhanin, P., Nikitin, N.O.: Sensitivity analysis of the composite data-driven pipelines in the automated machine learning. Procedia Comp. Sci. 193, 484–493 (2021)
Article Google Scholar
Belanche, L.A., González, F.F.: Review and evaluation of feature selection algorithms in synthetic problems. arXiv preprint arXiv:1101.2320 (2011)
Berens, J., Schneider, K., Görtz, S., Oster, S., Burghoff, J.: Early detection of students at risk – predicting student dropouts using administrative student data and machine learning methods. J. Educat. Data Min. 11(3), 1–41 (2018)
Google Scholar
Bertolini, R.: Evaluating performance variability of data pipelines for binary classification with applications to predictive learning analytics. (Doctoral Dissertation). Stony Brook University, Stony Brook, New York (2021)
Bertolini, R., Finch, S.J.: Synergistic effects between data corpora properties and machine learning performance in data pipelines. Int. J.Data Min., Modell. Manag. 14(3), 217–233 (2022)
Google Scholar
Bertolini, R., Finch, S.J., Nehm, R.H.: Enhancing data pipelines for forecasting student performance: integrating feature selection with cross-validation. Int. J. Educat. Technol. Higher Educat. 18(1), 1–23 (2021)
Google Scholar
Bertolini, R., Finch, S.J., Nehm, R.H.: Quantifying variability in predictions of student performance: examining the impact of bootstrap resampling in data pipelines. Comp. Educat.: Artif. Intell. 3, 10067 (2022)
Google Scholar
Bharathi, N., Rishiikeshwer, B.S., Shriram, T.A., Santhi, B., Brindha, G.R.: The significance of feature selection techniques in machine learning. Fund. Meth. Mach. Deep. Learn. Algorith. Tool. Appl. (2022). https://doi.org/10.1002/9781119821908.ch5
Article Google Scholar
Biswas, S., Wardat, M., Rajan, H.: The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large. arXiv preprint arXiv:2112.01590 (2021)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowled. Infor. Sys. 34(3), 483–519 (2013)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Sys. 86, 33–45 (2015)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Infor. Sci. 282, 111–135 (2014)
Article Google Scholar
Bommert, A.M. Integration of feature selection stability in model fitting. Doctoral Dissertation. TU Dortmund University, Dortmund, Germany (2021)
Bommert, A.M., Lang, M.: Stabm: stability measures for feature selection. J. Open Sour. Softw. 6(59), 3010 (2021)
Article ADS Google Scholar
Bommert, A.M., Rahnenführer, J.: Adjusted measures for feature selection stability for data sets with similar features. In: International conference on machine learning, optimization, and data science, pp. 203–214. Springer, Cham (2020)
Bommert, A.M., Rahnenführer, J., Lang, M.: A multicriteria approach to find predictive and sparse models with stable feature selection for high-dimensional data. Comput. Math. Model. Med. 2017, 7907163 (2017)
Google Scholar
Bommert, A.M., Sun, X., Bischl, B., Rahnenführer, J., Lang, M.: Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143, 106839 (2020)
Article MathSciNet Google Scholar
Bommert, A.M., Welchowski, T., Schmid, M., Rahnenführer, J.: Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief. Bioinfor. 23(1), 1–13 (2022)
Article CAS Google Scholar
Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936)
Google Scholar
Borda, J.C.: Mémoire sur les élections au scrutin. Mémoires de l'Académie royale des Sciences de Paris pour l’Année 1781, 657-665 (1781)
Boulesteix, A.L., Slawski, M.: Stability and aggregation of ranked gene lists. Brief. Bioinfor. 10(5), 556–568 (2009)
Article CAS Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Routledge, London (2017)
Book Google Scholar
Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)
MathSciNet Google Scholar
Burka, D., Puppe, C., Szepesváry, L., Tasnádi, A.: And the winner is... Chevalier de Borda: Neural networks vote according to Borda’s Rule. In: Proceedings of the Sixth International Workshop on Computational Social Choice (2016)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Ling. 22(2), 249–254 (1996)
Google Scholar
Chaibub Neto, E., Bare, J.C., Margolin, A.A.: Simulation studies as designed experiments: the comparison of penalized regression models in the “large p, small n” setting. PloS one 9(10), e107957 (2014)
Article ADS PubMed PubMed Central Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Couronné, R., Probst, P., Boulesteix, A.L.: Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinfor. 19(1), 1–14 (2018)
Article Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
Article Google Scholar
Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Küffner R, Zimmer R: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22(19), 2356–2363 (2006)
Article PubMed Google Scholar
Davison, A.C., Hinkley, D.V.: Bootstrap methods and their application (No. 1). Cambridge University Press, Cambridge (1997)
Book Google Scholar
Densmore, J.: Data pipeline pocket reference. O’Reilly Media, Inc (2021)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Similarity analysis of feature ranking techniques on imbalanced DNA microarray datasets. In: 2012 IEEE International conference on bioinformatics and biomedicine, pp. 1–5. IEEE (2012)
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Classification performance of rank aggregation techniques for ensemble gene selection. In: Proceedings of the twenty-sixth international FLAIRS conference, pp. 420-425 (2013)
Duangsoithong, R., Windeatt, T.: Bootstrap feature selection for ensemble classifiers. In: industrial conference on data mining, pp. 28-41. Springer, Berlin, Heidelberg (2010)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the 10th International world wide web conference, pp. 613–622. ACM (2001)
Ebenuwa, S.H., Sharif, M.S., Alazab, M., Al-Nemrat, A.: Variance ranking attributes selection techniques for binary classification problem in imbalance data. IEEE Access 7, 24649–24666 (2019)
Article Google Scholar
Ghai, B., Mishra, M., Mueller, K.: Cascaded debiasing: studying the cumulative effect of multiple fairness-enhancing interventions. arXiv preprint arXiv:2202.03734 (2022)
Goswami, S., Chakraborty, S., Guha, P., Tarafdar, A., Kedia, A.: Filter-based feature selection methods using hill climbing approach. In: Natural computing for unsupervised learning, pp. 213–234. Springer, Cham (2019)
Chapter Google Scholar
Gulgezen, G., Cataltepe, Z., Yu, L.: Stable and accurate feature selection. In: Joint European Conference on machine learning and knowledge discovery in databases, pp. 455-468. Springer, Berlin, Heidelberg (2009)
Guzmán-Martinez, R., Alaiz-Rodríguez, R.: Feature selection stability assessment based on the jensen-shannon divergence. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 597-612. Springer, Berlin, Heidelberg (2011)
Hall, M.A.: Correlation-based feature selection for machine learning. Doctoral Dissertation. University of Waikato, Hamilton, Hamilton, New Zealand (1999)
Hopf, K., Reifenrath, S.: Filter methods for feature selection in supervised machine learning applications–Review and benchmark. arXiv preprint arXiv:2111.12140 (2021)
Hua, J., Tembe, W.D., Dougherty, E.R.: Performance of feature-selection methods in the classification of high-dimension data. Patt. Recognit. 42(3), 409–424 (2009)
Article ADS Google Scholar
Huang, B.F., Boutros, P.C.: The parameter sensitivity of random forests. BMC Bioinform. 17(1), 1–13 (2016)
Article Google Scholar
Huang, C.: Feature selection and feature stability measurement method for high-dimensional small sample data based on big data technology. Computat. Intell. Neurosci. 2021, 1–12 (2021)
CAS Google Scholar
Izenman, A.J.: Modern multivariate statistical techniques. In: Springer Texts in Statistics, Springer, New York (2008)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de La Société Vaudoise Des Sciences Naturelles 37, 547–579 (1901)
Google Scholar
Källberg, D., Vidman, L., Rydén, P.: Comparison of methods for feature selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes. Front. Genet. 12, 632620 (2021)
Article PubMed PubMed Central Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inform. Sys. 12(1), 95–116 (2007)
Article Google Scholar
Karegowda, A.G., Manjunath, A.S., Jayaram, M.A.: Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag. 2(2), 271–277 (2010)
Google Scholar
Karunakaran, V., Rajasekar, V., Joseph, S.: exploring a filter and wrapper feature selection techniques in machine learning. In: Computational vision and bio-inspired computing, pp. 497-506. Springer, Singapore (2021)
Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: a review. J. King Saud Univer. Comp. Inf. Sci. 34(4), 1060–1073 (2019)
Google Scholar
Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: Problems in software defect prediction. In: 2010 22nd IEEE International conference on tools with artificial intelligence, pp. 137-144. IEEE (2010)
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International conference on tools with artificial intelligence, pp. 310-317. IEEE (2007)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp. 1137-1145 (1995)
Koprinska, I., Rana, M., Agelidis, V.G.: Correlation and instance based feature selection for electricity load forecasting. Knowl.-Based Sys. 82, 29–40 (2015)
Article Google Scholar
Krízek, P., Kittler, J., & Hlavác, V.: Improving stability of feature selection methods. In: International conference on computer analysis of images and patterns, pp. 929-936. Springer, Berlin, Heidelberg (2007)
Kuhn, M.: Caret: classification and regression training. Astrophysics Source Code Library, ascl-1505 (2015)
Kujawska, H., Slavkovik, M., Rückmann, J. J.: Predicting the winners of Borda, Kemeny and Dodgson elections with supervised machine learning. In: Multi-Agent Systems and Agreement Technologies, pp. 440-458. Springer, Cham (2020)
Laborda, J., Ryoo, S.: Feature selection in a credit scoring model. Mathematics 9(7), 746 (2021)
Article Google Scholar
Lausser, L., Müssel, C., Maucher, M., Kestler, H.A.: Measuring and visualizing the stability of biomarker selection techniques. Comput. Stat. 28(1), 51–65 (2013)
Article MathSciNet Google Scholar
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A.: A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(4), 1106–1119 (2012)
Article PubMed Google Scholar
Liu, H.: Algorithms for Scalability and Security in Adversarial Environments. Doctoral Dissertation. The University of Arizona, Tucson, Arizona (2021)
Lustgarten, J. L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. In AMIA Annual Symposium Proceeding, p. 406. American Medical Informatics Association (2009)
Mangal, A., Holm, E.A.: A comparative study of feature selection methods for stress hotspot classification in materials. Integrat. Mater. Manuf. Innovat. 7(3), 87–95 (2018)
Article Google Scholar
Marshall, A., Altman, D.G., Royston, P., Holder, R.L.: Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med. Resear. Methodol. 10(1), 1–16 (2010)
Google Scholar
Meng, X.B., Gao, X.Z., Lu, L., Liu, Y., Zhang, H.: A new bio-inspired optimisation algorithm: Bird Swarm Algorithm. J. Exper. Theoret. Artif. Intell. 28(4), 673–687 (2016)
Article ADS Google Scholar
Meyer, P.E., Schretter, C., Bontempi, G.: Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Select. Top. Sign. Process. 2(3), 261–274 (2008)
Article ADS Google Scholar
Mohd Yusof, M., Mohamed, R., Wahid, N.: Benchmark of feature selection techniques with machine learning algorithms for cancer datasets. In: Proceedings of the international conference on artificial intelligence and robotics and the International conference on automation, control, and robotics engineering, pp. 1–5 (2016)
Montgomery, D.C.: Design and Analysis of Experiments. John Wiley & Sons (2017)
Moons, E., Aerts, M., Wets, G.: A tree based lack-of-fit test for multiple logistic regression. Stat. Med. 23(9), 1425–1438 (2004)
Article CAS PubMed Google Scholar
Morán-Fernández, L., Bólon-Canedo, V., Alonso-Betanzos, A.: How important is data quality? Best classifiers vs best features. Neurocomputing 470, 365–375 (2022)
Article Google Scholar
Munirathinam, D.J., Ranganadhan, M.: A new improved filter based feature selection model for high-dimensional data. J. Supercomp. 76(8), 5745–5762 (2020)
Article Google Scholar
Nogueira, S.: Quantifying the stability of feature selection. Doctoral dissertation. The University of Manchester, Manchester, United Kingdom (2018)
Nogueira, S., Brown, G.: Measuring the stability of feature selection. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 442–457. Springer, Cham (2016)
Nogueira, S., Sechidis, K., Brown, G.: On the stability of feature selection algorithms. J. Mach. Learn. Res. 18(1), 6345–6398 (2017)
MathSciNet Google Scholar
Novovičová, J., Somol, P., Pudil, P.: A new measure of feature selection algorithms’ stability. In: 2009 IEEE International conference on data mining workshops, pp. 382–387. IEEE (2009)
Rajbahadur, G.K., Oliva, G.A., Hassan, A.E., Dingel, J.: Pitfalls analyzer: quality control for model-driven data science pipelines. In: 2019 ACM/IEEE 22nd international conference on model driven engineering languages and systems (MODELS), pp. 12–22. IEEE (2019)
Ramaswami, M.R., Bhaskaran, R.: A study on feature selection techniques in educational data mining. J. Comput. 1(1), 7–11 (2009)
Google Scholar
Ren, K., Fang, W., Qu, J., Zhang, X., Shi, X.: Comparison of eight filter-based feature selection methods for monthly streamflow forecasting—three case studies on CAMELS data sets. J. Hydrol. 586, 124897 (2020)
Article Google Scholar
Romanski, P., Kotthoff, L., Kotthoff, M.L.: Package ‘FSelector’. URL: http://cran/r-project.org/web/packages/FSelector/index.html (2013)
Salman, R., Alzaatreh, A., Sulieman, H.: The stability of different aggregation techniques in ensemble feature selection. J. Big Data 9(1), 1–23 (2022)
Article Google Scholar
Sánchez-Maroño, N., Alonso-Betanzos, A., Tombilla-Sanromán, M.: Filter methods for feature selection – a comparative study. In: international conference on intelligent data engineering and automated learning, pp. 178-187. Springer, Berlin, Heidelberg (2007)
Sarkar, C., Cooley, S., Srivastava, J.: Robust feature selection technique using rank aggregation. Appl. Artif. Intell. 28(3), 243–257 (2014)
Article PubMed PubMed Central Google Scholar
Sen, R., Mandal, A.K., Chakraborty, B.: A critical study on stability measures of feature selection with a novel extension of lustgarten index. Mach. Learn. Knowl. Extract. 3(4), 771–787 (2021)
Article Google Scholar
Sen, R., Mandal, A.K., Chakraborty, B.: Performance analysis of extended lustgarten index for stability of feature selection. In: 2021 IEEE international conference on service operations and logistics, and informatics (SOLI), pp. 1–5. IEEE (2021)
Somol, P., Novovičová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Patt. Anal. Mach. Intell. 32(11), 1921–1939 (2010)
Article Google Scholar
Skiena, S.S.: The Data Science Design Manual. Springer (2017)
Skurichina, M., Duin, R.P. (2005). Combining feature subsets in feature selection. In: International workshop on multiple classifier systems, pp. 165–175. Springer, Berlin, Heidelberg (2005)
Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9(1), 1–11 (2008)
Article Google Scholar
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources, and a solution. BMC Bioinform. 8(1), 1–21 (2007)
Article Google Scholar
Subbian K, Melville P.: Supervised rank aggregation for predicting influence in networks. arXiv preprint arXiv:1108.4801 (2011)
Sun, L., Wang, L., Ding, W., Qian, Y., Xu, J.: Feature selection using fuzzy neighborhood entropy-based uncertainty measures for fuzzy neighborhood multigranulation rough sets. IEEE Trans. Fuzzy Sys. 29(1), 19–33 (2020)
Article CAS Google Scholar
Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.: A genetic algorithm-based method for feature subset selection. Soft Comp. 12(2), 111–120 (2008)
Article Google Scholar
Toloşi, L., Lengauer, T.: Classification with correlated features: unreliability of feature ranking and solutions. Bioinform. 27(14), 1986–1994 (2011)
Article Google Scholar
Tsanas, A., Little, M.A., McSharry, P.E.: A simple filter benchmark for feature selection. J. Mach. Learn. Resea. 1, 1–24 (2010)
Google Scholar
Tunkiel, A.T., Sui, D., Wiktorski, T.: Data-driven sensitivity analysis of complex machine learning models: a case study of directional drilling. J. Petrol. Sci. Eng. 195, 107630 (2020)
Article CAS Google Scholar
Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: introduction and review. J. Biomed. Inform. 85, 189–203 (2018)
Article PubMed PubMed Central Google Scholar
Urkullu, A., Pérez, A., Calvo, B.: Statistical model for reproducibility in ranking-based feature selection. Knowl. Inform. Sys. 63(2), 379–410 (2021)
Article Google Scholar
Van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(1), 1–67 (2011)
Google Scholar
Wah, Y.B., Ibrahim, N., Hamid, H.A., Abdul-Rahman, S., Fong, S.: Feature selection methods: case of filter and wrapper approaches for maximising classification accuracy. Pertanika J. Sci. Technol. 26(1), 329–340 (2018)
Google Scholar
Wald, R., Khoshgoftaar, T.M., Dittman, D., Awada, W., Napolitano, A.: An extensive comparison of feature ranking aggregation techniques in bioinformatics. In: 2012 IEEE 13th international conference on information reuse & integration (IRI), pp. 377–384. IEEE (2012)
Wald, R., Khoshgoftaar, T.M., Dittman, D.: Mean aggregation versus robust rank aggregation for ensemble gene selection. In: 2012 11th International conference on machine learning and applications, pp. 63–69. IEEE (2012)
Wald, R., Khoshgoftaar, T. M., & Napolitano, A.: Stability of filter- and wrapper based feature subset selection. In: 2013 IEEE 25th International conference on tools with artificial intelligence, pp. 374–380. IEEE (2013)
Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., Hutter, F.: Nas-bench-101: Towards reproducible neural architecture search. In: International conference on machine learning, pp. 7105–7114. PMLR (2019)
Yu, L., Ding, C., & Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International conference on knowledge discovery and data mining, pp. 803-811. ACM (2008)
Zuber, V., Strimmer, K.: Gene ranking and biomarker discovery under correlation. Bioinformatics 25(20), 2700–2707 (2009)
Article CAS PubMed Google Scholar

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Department of Applied Mathematics and Statistics, Stony Brook University (SUNY), Stony Brook, New York, USA
Roberto Bertolini & Stephen J. Finch

Authors

Roberto Bertolini
View author publications
You can also search for this author in PubMed Google Scholar
Stephen J. Finch
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RB and SF conceptualized the study. RB performed all data analyses, wrote the first draft of the manuscript, and prepared all tables and figures. RB and SF reviewed and approved the final manuscript. This work encompasses a portion of the doctoral dissertation of RB.

Corresponding author

Correspondence to Roberto Bertolini.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bertolini, R., Finch, S.J. Stability of filter feature selection methods in data pipelines: a simulation study. Int J Data Sci Anal 17, 225–248 (2024). https://doi.org/10.1007/s41060-022-00373-6

Download citation

Received: 25 August 2022
Accepted: 05 November 2022
Published: 14 December 2022
Issue Date: March 2024
DOI: https://doi.org/10.1007/s41060-022-00373-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stability of filter feature selection methods in data pipelines: a simulation study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Learning from imbalanced data: open challenges and future directions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stability of filter feature selection methods in data pipelines: a simulation study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Learning from imbalanced data: open challenges and future directions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation