A General Feature Engineering Wrapper for Machine Learning Using \(\epsilon \)-Lexicase Survival

  • William La CavaEmail author
  • Jason Moore
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10196)


We propose a general wrapper for feature learning that interfaces with other machine learning methods to compose effective data representations. The proposed feature engineering wrapper (FEW) uses genetic programming to represent and evolve individual features tailored to the machine learning method with which it is paired. In order to maintain feature diversity, \(\epsilon \)-lexicase survival is introduced, a method based on \(\epsilon \)-lexicase selection. This survival method preserves semantically unique individuals in the population based on their ability to solve difficult subsets of training cases, thereby yielding a population of uncorrelated features. We demonstrate FEW with five different off-the-shelf machine learning methods and test it on a set of real-world and synthetic regression problems with dimensions varying across three orders of magnitude. The results show that FEW is able to improve model test predictions across problems for several ML methods. We discuss and test the scalability of FEW in comparison to other feature composition strategies, most notably polynomial feature expansion.


Genetic programming Feature selection Representation learning Regression 



This work was supported by the Warren Center for Network and Data Science at the University of Pennsylvania, as well as NIH grants P30-ES013508, AI116794 and LM009012.


  1. 1.
    Arnaldo, I., O’Reilly, U.M., Veeramachaneni, K.: Building predictive models via feature synthesis, pp. 983–990. ACM Press (2015)Google Scholar
  2. 2.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  3. 3.
    De Melo, V.V.: Kaizen programming, pp. 895–902. ACM Press (2014)Google Scholar
  4. 4.
    Foster, D., Karloff, H., Thaler, J.: Variable selection is hard. In: Proceedings of The 28th Conference on Learning Theory, pp. 696–709 (2015)Google Scholar
  5. 5.
    Friedman, J., Hastie, T., Tibshirani, R.: The elements of statistical learning. Springer series in statistics, vol. 1. Springer, Berlin (2001)zbMATHGoogle Scholar
  6. 6.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  7. 7.
    Harrison, D., Rubinfeld, D.L.: Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manage. 5(1), 81–102 (1978)CrossRefzbMATHGoogle Scholar
  8. 8.
    Helmuth, T., Spector, L., Matheson, J.: Solving uncompromising problems with lexicase selection. IEEE Trans. Evol. Comput. PP(99), 1–1 (2014)Google Scholar
  9. 9.
    Iba, H., Sato, T.: Genetic Programming with Local Hill-Climbing. Tech. Rep. ETL-TR-94-4, Electrotechnical Laboratory, 1-1-4 Umezono, Tsukuba-city, Ibaraki, 305, Japan (1994).
  10. 10.
    Icke, I., Bongard, J.C.: Improving genetic programming based symbolic regression using deterministic machine learning. In: IEEE Congress on Evolutionary Computation (CEC), 2013, pp. 1763–1770. IEEE (2013)Google Scholar
  11. 11.
    Kamath, U., Lin, J., De Jong, K.: SAX-EFG: an evolutionary feature generation framework for time series classification, pp. 533–540. ACM Press (2014)Google Scholar
  12. 12.
    Kommenda, M., Kronberger, G., Winkler, S., Affenzeller, M., Wagner, S.: Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In: Blum, C., Alba, E., Bartz-Beielstein, T., Loiacono, D., Luna, F., Mehnen, J., Ochoa, G., Preuss, M., Tantar, E., Vanneschi, L. (eds.) GECCO 2013 Companion, pp. 1121–1128. ACM, Amsterdam (2013)CrossRefGoogle Scholar
  13. 13.
    La Cava, W., Danai, K., Spector, L., Fleming, P., Wright, A., Lackner, M.: Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renew. Energy Part 2 87, 892–902 (2016)CrossRefGoogle Scholar
  14. 14.
    La Cava, W., Spector, L., Danai, K.: Epsilon-Lexicase Selection for Regression, pp. 741–748. ACM Press (2016)Google Scholar
  15. 15.
    Liskowski, P., Krawiec, K., Helmuth, T., Spector, L.: Comparison of semantic-aware selection methods in genetic programming. In: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion 2015, pp. 1301–1307. ACM, New York (2015)Google Scholar
  16. 16.
    McConaghy, T.: FFX: fast, scalable, deterministic symbolic regression technology. In: Riolo, R., Vladislavleva, E., Moore, J.H. (eds.) Genetic Programming Theory and Practice IX. Genetic and Evolutionary Computation, pp. 235–260. Springer, New York (2011)CrossRefGoogle Scholar
  17. 17.
    Muharram, M., Smith, G.D.: Evolutionary constructive induction. IEEE Trans. Knowl. Data Eng. 17(11), 1518–1528 (2005)CrossRefGoogle Scholar
  18. 18.
    Muharram, M.A., Smith, G.D.: The effect of evolved attributes on classification algorithms. In: Gedeon, T.T.D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 933–941. Springer, Heidelberg (2003). doi: 10.1007/978-3-540-24581-0_80 CrossRefGoogle Scholar
  19. 19.
    Muharram, M.A., Smith, G.D.: Evolutionary feature construction using information gain and gini index. In: Keijzer, M., O’Reilly, U.-M., Lucas, S., Costa, E., Soule, T. (eds.) EuroGP 2004. LNCS, vol. 3003, pp. 379–388. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-24650-3_36 CrossRefGoogle Scholar
  20. 20.
    Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. arXiv preprint (2016).
  21. 21.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Redmond, M., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141(3), 660–678 (2002)CrossRefzbMATHGoogle Scholar
  23. 23.
    Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion, pp. 401–408 (2012)Google Scholar
  24. 24.
    Tan, K.C., Lee, T.H., Khor, E.F.: Evolutionary algorithms with dynamic population size and local exploration for multiobjective optimization. IEEE Trans. Evol. Comput. 5(6), 565–588 (2001)CrossRefGoogle Scholar
  25. 25.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  26. 26.
    Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 155–162 (2001)Google Scholar
  27. 27.
    Torres-Sospedra, J., Montoliu, R., Martnez-Us, A., Avariento, J.P., Arnau, T.J., Benedito-Bordonau, M., Huerta, J.: UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 261–270. IEEE (2014)Google Scholar
  28. 28.
    Tsanas, A., Xifara, A.: Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 49, 560–567 (2012)CrossRefGoogle Scholar
  29. 29.
    Vanneschi, L., Cuccu, G.: A study of genetic programming variable population size for dynamic optimization problems. In: International Conference on Evolutionary Computation (ICEC 2009), pp. 119–126. Madeira, Portugal (2009)Google Scholar
  30. 30.
    Vladislavleva, E., Smits, G., Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009)CrossRefGoogle Scholar
  31. 31.
    White, D.R., McDermott, J., Castelli, M., Manzoni, L., Goldman, B.W., Kronberger, G., Jakowski, W., O’Reilly, U.M., Luke, S.: Better GP benchmarks: community survey results and proposals. Genet. Program. Evolvable Mach. 14(1), 3–29 (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute for Biomedical InformaticsUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations