Skip to main content

Feature Selection on Epistatic Problems Using Genetic Algorithms with Nested Classifiers

  • Conference paper
  • First Online:
Applications of Evolutionary Computation (EvoApplications 2023)

Abstract

Feature selection is becoming an essential part of machine learning pipelines, including the ones generated by recent AutoML tools. In case of datasets with epistatic interactions between the features, like many datasets from the bioinformatics domain, feature selection may even become crucial. A recent method called SLUG has outperformed the state-of-the-art algorithms for feature selection on a large set of epistatic noisy datasets. SLUG uses genetic programming (GP) as a classifier (learner), nested inside a genetic algorithm (GA) that performs feature selection (wrapper). In this work, we pair GA with different learners, in an attempt to match the results of SLUG with less computational effort. We also propose a new feedback mechanism between the learner and the wrapper to improve the convergence towards the key features. Although we do not match the results of SLUG, we demonstrate the positive effect of the feedback mechanism, motivating additional research in this area to further improve SLUG and other existing feature selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  2. Cava, W.L., Silva, S., Danai, K., Spector, L., Vanneschi, L., Moore, J.H.: Multidimensional genetic programming for multiclass classification. Swarm Evol. Comput. 44, 260–272 (2019). https://doi.org/10.1016/j.swevo.2018.03.015

    Article  Google Scholar 

  3. cavalab: cavalab/ellyn: python-wrapped version of ellen, a linear genetic programming system for symbolic regression and classification. https://github.com/cavalab/ellyn

  4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794. KDD ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785

  5. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    Article  MATH  Google Scholar 

  6. Jespb: Jespb/python-m3gp: an easy-to-use scikit-learn inspired implementation of the multidimensional multiclass genetic programming with multidimensional populations (m3gp) algorithm. https://github.com/jespb/Python-M3GP

  7. Jespb: Jespb/python-stdgp: an easy-to-use scikit-learn inspired implementation of the standard genetic programming (stdgp) algorithm. https://github.com/jespb/Python-StdGP

  8. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57

    Chapter  Google Scholar 

  9. La Cava, W., Silva, S., Vanneschi, L., Spector, L., Moore, J.: Genetic programming representations for multi-dimensional feature learning in biomedical classification. In: Squillero, G., Sim, K. (eds.) EvoApplications 2017. LNCS, vol. 10199, pp. 158–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55849-3_11

    Chapter  Google Scholar 

  10. Le, T.T., Fu, W., Moore, J.H.: Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36(1), 250–256 (2020)

    Article  Google Scholar 

  11. Muñoz, L., Silva, S., Trujillo, L.: M3GP – Multiclass Classification with GP. In: Machado, P., et al. (eds.) EuroGP 2015. LNCS, vol. 9025, pp. 78–91. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16501-1_7

    Chapter  Google Scholar 

  12. NMVRodrigues: Nmvrodrigues/slug: an easy-to-use scikit-learn inspired implementation of the feature selection using genetic algorithms and genetic programming (slug) algorithm. https://github.com/NMVRodrigues/SLUG

  13. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492. GECCO ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2908812.2908918

  14. Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.: Automating biomedical data science through tree-based pipeline optimization. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9597, pp. 123–137. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31204-0_9

    Chapter  Google Scholar 

  15. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  16. Perkis, T.: Stack-based genetic programming. In: Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pp. 148–153, vol. 1 (1994). https://doi.org/10.1109/ICEC.1994.350025

  17. Poli, R., B. Langdon, W., Mcphee, N.: A field guide to genetic programming. Lulu Enterprises, UK Ltd (01 2008)

    Google Scholar 

  18. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877

    Article  Google Scholar 

  19. Rodrigues, N.M., Batista, J.E., Cava, W.L., Vanneschi, L., Silva, S.: SLUG: Feature selection using genetic algorithms and genetic programming. In: Lecture Notes in Computer Science, pp. 68–84. Springer International Publishing (2022). https://doi.org/10.1007/978-3-031-02056-8_5

  20. Rodrigues, N.M., Batista, J.E., Silva, S.: Ensemble genetic programming. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) EuroGP 2020. LNCS, vol. 12101, pp. 151–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44094-7_10

    Chapter  Google Scholar 

  21. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  MATH  Google Scholar 

  22. Sohn, A., Olson, R.S., Moore, J.H.: Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 489–496. GECCO ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3071178.3071212

  23. Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 401–408. GECCO ’12, Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2330784.2330846

  24. Urbanowicz, R., Kiralis, J., Sinnott-Armstrong, N., et al.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5(16) (2012). https://doi.org/10.1186/1756-0381-5-16

Download references

Acknowledgments

This work was supported by FCT, Portugal, through funding of LASIGE Research Unit (UIDB/00408/2020, UIDP/00408/2020) and CISUC (UID/CEC/00326/2020); projects AICE (DSAIPA/DS/0113/2019), from FCT, and RETINA (NORTE-01-0145-FEDER-000062), supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). The authors acknowledge the work facilities and equipment provided by GECAD research center (UIDB/00760/2020) to the project team. The authors were also supported by their respective PhD grants, Pedro Carvalho (UI/BD/151053/2021), Nuno Rodrigues (2021/05322/BD), João Batista (SFRH/BD/143972/2019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Carvalho .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 121 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carvalho, P., Ribeiro, B., Rodrigues, N.M., Batista, J.E., Vanneschi, L., Silva, S. (2023). Feature Selection on Epistatic Problems Using Genetic Algorithms with Nested Classifiers. In: Correia, J., Smith, S., Qaddoura, R. (eds) Applications of Evolutionary Computation. EvoApplications 2023. Lecture Notes in Computer Science, vol 13989. Springer, Cham. https://doi.org/10.1007/978-3-031-30229-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30229-9_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30228-2

  • Online ISBN: 978-3-031-30229-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics