Abstract
The advent of Big Data has brought with it an unprecedented and overwhelming increase in data volume, not only in samples but also in available features. Feature selection, the process of selecting the relevant features and discarding the irrelevant ones, has been successfully applied over the last decades to reduce the dimensionality of the datasets. However, there is a great number of feature selection methods available in the literature, and choosing the right one for a given problem is not a trivial decision. In this paper we will try to determine which of the multiple methods in the literature are the best suited for a particular type of problem, and study their effectiveness when comparing them with a random selection. In our experiments we will use an extensive number of datasets that allow us to work on a wide variety of problems from the real world that need to be dealt with in this field. Seven popular feature selection methods were used, as well as five different classifiers to evaluate their performance. The experimental results suggest that feature selection is, in general, a powerful tool in machine learning, being correlation-based feature selection the best option with independence of the scenario. Also, we found out that the choice of an inappropriate threshold when using ranker methods leads to results as poor as when randomly selecting a subset of features.
This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research project PID2019-109238GB-C2), and by the Xunta de Galicia (Grants ED431C 2018/34 and ED431G 2019/01) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bache, K., Linchman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml/. Accessed Dec 2020
Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Climente-González, H., Azencott, C.A., Kaski, S., Yamada, M.: Block HSIC lasso: model-free biomarker detection for ultra-high dimensional data. Bioinformatics 35(14), i427–i435 (2019)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181 (2014)
Furxhi, I., Murphy, F., Mullins, M., Arvanitis, A., Poland, C.A.: Nanotoxicology data for in silico tools: a literature review. Nanotoxicology 1–26 (2020)
Grgic-Hlaca, N., Zafar, M.B., Gummadi, K.P., Weller, A.: Beyond distributive fairness in algorithmic decision making: feature selection for procedurally fair learning. AAAI 18, 51–60 (2018)
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations and Applications, vol. 207. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-35488-8
Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning (1998)
Hall, M.A.: Correlation-based feature selection for machine learning (1999)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Kuncheva, L.I.: Bayesian-analysis-for-comparing-classifiers (2020). https://github.com/LucyKuncheva/Bayesian-Analysis-for-Comparing-Classifiers
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, pp. 212–217. Association for Computational Linguistics (1992)
Miller, A.: Subset Selection in Regression. CRC Press, Cambridge (2002)
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? a study using microarray data. Knowl. Inf. Syst. 51(3), 1067–1090 (2017)
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Do we need hundreds of classifiers or a good feature selection? In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 399–404 (2020)
Navarro, F.F.G.: Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains. Ph.D. thesis, Universitat Politècnica de Catalunya (UPC) (2011)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Comput. 8(7), 1341–1390 (1996)
Yang, H.H., Moody, J.: Data visualization and feature selection: new algorithms for nongaussian data. In: Advances in Neural Information Processing Systems, pp. 687–693 (2000)
Zhao, Z., Liu, H.: Searching for interacting features in subset selection. Intell. Data Anal. 13(2), 207–228 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Morán-Fernández, L., Bolón-Canedo, V. (2021). Dimensionality Reduction: Is Feature Selection More Effective Than Random Selection?. In: Rojas, I., Joya, G., Català, A. (eds) Advances in Computational Intelligence. IWANN 2021. Lecture Notes in Computer Science(), vol 12861. Springer, Cham. https://doi.org/10.1007/978-3-030-85030-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-85030-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85029-6
Online ISBN: 978-3-030-85030-2
eBook Packages: Computer ScienceComputer Science (R0)