Abstract
Feature selection is becoming an essential part of machine learning pipelines, including the ones generated by recent AutoML tools. In case of datasets with epistatic interactions between the features, like many datasets from the bioinformatics domain, feature selection may even become crucial. A recent method called SLUG has outperformed the state-of-the-art algorithms for feature selection on a large set of epistatic noisy datasets. SLUG uses genetic programming (GP) as a classifier (learner), nested inside a genetic algorithm (GA) that performs feature selection (wrapper). In this work, we pair GA with different learners, in an attempt to match the results of SLUG with less computational effort. We also propose a new feedback mechanism between the learner and the wrapper to improve the convergence towards the key features. Although we do not match the results of SLUG, we demonstrate the positive effect of the feedback mechanism, motivating additional research in this area to further improve SLUG and other existing feature selection methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Cava, W.L., Silva, S., Danai, K., Spector, L., Vanneschi, L., Moore, J.H.: Multidimensional genetic programming for multiclass classification. Swarm Evol. Comput. 44, 260–272 (2019). https://doi.org/10.1016/j.swevo.2018.03.015
cavalab: cavalab/ellyn: python-wrapped version of ellen, a linear genetic programming system for symbolic regression and classification. https://github.com/cavalab/ellyn
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794. KDD ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Jespb: Jespb/python-m3gp: an easy-to-use scikit-learn inspired implementation of the multidimensional multiclass genetic programming with multidimensional populations (m3gp) algorithm. https://github.com/jespb/Python-M3GP
Jespb: Jespb/python-stdgp: an easy-to-use scikit-learn inspired implementation of the standard genetic programming (stdgp) algorithm. https://github.com/jespb/Python-StdGP
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
La Cava, W., Silva, S., Vanneschi, L., Spector, L., Moore, J.: Genetic programming representations for multi-dimensional feature learning in biomedical classification. In: Squillero, G., Sim, K. (eds.) EvoApplications 2017. LNCS, vol. 10199, pp. 158–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55849-3_11
Le, T.T., Fu, W., Moore, J.H.: Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36(1), 250–256 (2020)
Muñoz, L., Silva, S., Trujillo, L.: M3GP – Multiclass Classification with GP. In: Machado, P., et al. (eds.) EuroGP 2015. LNCS, vol. 9025, pp. 78–91. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16501-1_7
NMVRodrigues: Nmvrodrigues/slug: an easy-to-use scikit-learn inspired implementation of the feature selection using genetic algorithms and genetic programming (slug) algorithm. https://github.com/NMVRodrigues/SLUG
Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492. GECCO ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2908812.2908918
Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.: Automating biomedical data science through tree-based pipeline optimization. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9597, pp. 123–137. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31204-0_9
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Perkis, T.: Stack-based genetic programming. In: Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pp. 148–153, vol. 1 (1994). https://doi.org/10.1109/ICEC.1994.350025
Poli, R., B. Langdon, W., Mcphee, N.: A field guide to genetic programming. Lulu Enterprises, UK Ltd (01 2008)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877
Rodrigues, N.M., Batista, J.E., Cava, W.L., Vanneschi, L., Silva, S.: SLUG: Feature selection using genetic algorithms and genetic programming. In: Lecture Notes in Computer Science, pp. 68–84. Springer International Publishing (2022). https://doi.org/10.1007/978-3-031-02056-8_5
Rodrigues, N.M., Batista, J.E., Silva, S.: Ensemble genetic programming. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) EuroGP 2020. LNCS, vol. 12101, pp. 151–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44094-7_10
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Sohn, A., Olson, R.S., Moore, J.H.: Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 489–496. GECCO ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3071178.3071212
Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 401–408. GECCO ’12, Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2330784.2330846
Urbanowicz, R., Kiralis, J., Sinnott-Armstrong, N., et al.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5(16) (2012). https://doi.org/10.1186/1756-0381-5-16
Acknowledgments
This work was supported by FCT, Portugal, through funding of LASIGE Research Unit (UIDB/00408/2020, UIDP/00408/2020) and CISUC (UID/CEC/00326/2020); projects AICE (DSAIPA/DS/0113/2019), from FCT, and RETINA (NORTE-01-0145-FEDER-000062), supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). The authors acknowledge the work facilities and equipment provided by GECAD research center (UIDB/00760/2020) to the project team. The authors were also supported by their respective PhD grants, Pedro Carvalho (UI/BD/151053/2021), Nuno Rodrigues (2021/05322/BD), João Batista (SFRH/BD/143972/2019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carvalho, P., Ribeiro, B., Rodrigues, N.M., Batista, J.E., Vanneschi, L., Silva, S. (2023). Feature Selection on Epistatic Problems Using Genetic Algorithms with Nested Classifiers. In: Correia, J., Smith, S., Qaddoura, R. (eds) Applications of Evolutionary Computation. EvoApplications 2023. Lecture Notes in Computer Science, vol 13989. Springer, Cham. https://doi.org/10.1007/978-3-031-30229-9_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-30229-9_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30228-2
Online ISBN: 978-3-031-30229-9
eBook Packages: Computer ScienceComputer Science (R0)