Dimensionality Reduction: Is Feature Selection More Effective Than Random Selection?

Morán-Fernández, Laura; Bolón-Canedo, Verónica

doi:10.1007/978-3-030-85030-2_10

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12861))

Included in the following conference series:

International Work-Conference on Artificial Neural Networks

1173 Accesses
1 Citations

Abstract

The advent of Big Data has brought with it an unprecedented and overwhelming increase in data volume, not only in samples but also in available features. Feature selection, the process of selecting the relevant features and discarding the irrelevant ones, has been successfully applied over the last decades to reduce the dimensionality of the datasets. However, there is a great number of feature selection methods available in the literature, and choosing the right one for a given problem is not a trivial decision. In this paper we will try to determine which of the multiple methods in the literature are the best suited for a particular type of problem, and study their effectiveness when comparing them with a random selection. In our experiments we will use an extensive number of datasets that allow us to work on a wide variety of problems from the real world that need to be dealt with in this field. Seven popular feature selection methods were used, as well as five different classifiers to evaluate their performance. The experimental results suggest that feature selection is, in general, a powerful tool in machine learning, being correlation-based feature selection the best option with independence of the scenario. Also, we found out that the choice of an inappropriate threshold when using ranker methods leads to results as poor as when randomly selecting a subset of features.

This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research project PID2019-109238GB-C2), and by the Xunta de Galicia (Grants ED431C 2018/34 and ED431G 2019/01) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bache, K., Linchman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml/. Accessed Dec 2020
Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Article Google Scholar
Climente-González, H., Azencott, C.A., Kaski, S., Yamada, M.: Block HSIC lasso: model-free biomarker detection for ultra-high dimensional data. Bioinformatics 35(14), i427–i435 (2019)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Google Scholar
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181 (2014)
Google Scholar
Furxhi, I., Murphy, F., Mullins, M., Arvanitis, A., Poland, C.A.: Nanotoxicology data for in silico tools: a literature review. Nanotoxicology 1–26 (2020)
Google Scholar
Grgic-Hlaca, N., Zafar, M.B., Gummadi, K.P., Weller, A.: Beyond distributive fairness in algorithmic decision making: feature selection for procedurally fair learning. AAAI 18, 51–60 (2018)
Google Scholar
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations and Applications, vol. 207. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-35488-8
Book Google Scholar
Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning (1998)
Google Scholar
Hall, M.A.: Correlation-based feature selection for machine learning (1999)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Chapter Google Scholar
Kuncheva, L.I.: Bayesian-analysis-for-comparing-classifiers (2020). https://github.com/LucyKuncheva/Bayesian-Analysis-for-Comparing-Classifiers
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, pp. 212–217. Association for Computational Linguistics (1992)
Google Scholar
Miller, A.: Subset Selection in Regression. CRC Press, Cambridge (2002)
Book Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? a study using microarray data. Knowl. Inf. Syst. 51(3), 1067–1090 (2017)
Article Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Do we need hundreds of classifiers or a good feature selection? In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 399–404 (2020)
Google Scholar
Navarro, F.F.G.: Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains. Ph.D. thesis, Universitat Politècnica de Catalunya (UPC) (2011)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Comput. 8(7), 1341–1390 (1996)
Article Google Scholar
Yang, H.H., Moody, J.: Data visualization and feature selection: new algorithms for nongaussian data. In: Advances in Neural Information Processing Systems, pp. 687–693 (2000)
Google Scholar
Zhao, Z., Liu, H.: Searching for interacting features in subset selection. Intell. Data Anal. 13(2), 207–228 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CITIC, Universidade da Coruña, A Coruña, Spain
Laura Morán-Fernández & Verónica Bolón-Canedo

Authors

Laura Morán-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Verónica Bolón-Canedo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Morán-Fernández .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Málaga, Málaga, Spain
Gonzalo Joya
Technical University of Catalonia, Barcelona, Spain
Andreu Català

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Morán-Fernández, L., Bolón-Canedo, V. (2021). Dimensionality Reduction: Is Feature Selection More Effective Than Random Selection?. In: Rojas, I., Joya, G., Català, A. (eds) Advances in Computational Intelligence. IWANN 2021. Lecture Notes in Computer Science(), vol 12861. Springer, Cham. https://doi.org/10.1007/978-3-030-85030-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-85030-2_10
Published: 21 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85029-6
Online ISBN: 978-3-030-85030-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics