Learning feature spaces for regression with genetic programming


Genetic programming has found recent success as a tool for learning sets of features for regression and classification. Multidimensional genetic programming is a useful variant of genetic programming for this task because it represents candidate solutions as sets of programs. These sets of programs expose additional information that can be exploited for building block identification. In this work, we discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. We investigate methods for biasing the components of programs that are promoted in order to guide search towards useful and complementary feature spaces. We study two main approaches: (1) the introduction of new objectives and (2) the use of specialized semantic variation operators. We find that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems. The inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems in comparison to several state-of-the-art machine learning approaches and other genetic programming frameworks. Finally, we look at the collinearity and complexity of the data representations produced by different methods, in order to assess whether relevant, concise, and independent factors of variation can be produced in application.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12


  1. 1.

    I. Arnaldo, K. Krawiec, U.M. O’Reilly, Multiple regression genetic programming, in Proceedings of the 2014 Conference on Genetic and Evolutionary Computation (ACM Press, 2014), pp. 879–886. https://doi.org/10.1145/2576768.2598291. http://dl.acm.org/citation.cfm?doid=2576768.2598291. Accessed 15 Oct 2019

  2. 2.

    I. Arnaldo, U.M. O’Reilly, K. Veeramachaneni, Building predictive models via feature synthesis, in GECCO (ACM Press, 2015), pp. 983–990. https://doi.org/10.1145/2739480.2754693. http://dl.acm.org/citation.cfm?doid=2739480.2754693. Accessed 15 Oct 2019

  3. 3.

    D.A. Belsley, A guide to using the collinearity diagnostics. Comput. Sci. Econ. Manag. 4(1), 33–50 (1991). https://doi.org/10.1007/BF00426854

    MathSciNet  Article  MATH  Google Scholar 

  4. 4.

    Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  5. 5.

    P.P. Brahma, D. Wu, Y. She, Why deep learning works: a manifold disentanglement perspective. IEEE Trans. Neural Netw. Learn. Syst. 27(10), 1997–2008 (2016)

    MathSciNet  Article  Google Scholar 

  6. 6.

    M. Castelli, S. Silva, L. Vanneschi, A C++ framework for geometric semantic genetic programming. Genet. Program. Evol. Mach. 16(1), 73–81 (2015). https://doi.org/10.1007/s10710-014-9218-0

    Article  Google Scholar 

  7. 7.

    T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 (ACM, New York, NY, USA, 2016), pp. 785–794. https://doi.org/10.1145/2939672.2939785

  8. 8.

    W.S. Cleveland, Visualizing Data (Hobart Press, New Jersey, 1993)

    Google Scholar 

  9. 9.

    A. Cline, C. Moler, G. Stewart, J. Wilkinson, An estimate for the condition number of a matrix. SIAM J. Numer. Anal. 16(2), 368–375 (1979). https://doi.org/10.1137/0716029

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    E. Conti, V. Madhavan, F.P. Such, J. Lehman, K.O. Stanley, J. Clune, Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv:1712.06560 [cs] (2017)

  11. 11.

    C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, S. Yang, Adanet: adaptive structural learning of artificial neural networks. arXiv preprint arXiv:1607.01097 (2016)

  12. 12.

    V.V. De Melo, Kaizen Programming, in GECCO (ACM Press, New York, 2014), pp. 895–902. https://doi.org/10.1145/2576768.2598264. http://dl.acm.org/citation.cfm?doid=2576768.2598264

  13. 13.

    K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II, in Parallel Problem Solving from Nature PPSN VI, vol. 1917, ed. by M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, H.P. Schwefel (Springer, Berlin, 2000), pp. 849–858. http://repository.ias.ac.in/83498/. Accessed 15 Oct 2019

  14. 14.

    J. Demšar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  15. 15.

    C. Eastwood, C.K.I. Williams, A framework for the quantitative evaluation of disentangled representations, in ICLR (2018). https://openreview.net/forum?id=By-7dz-AZ. Accessed 15 Oct 2019

  16. 16.

    C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolution: differentiable pattern producing networks. arXiv:1606.02580 [cs] (2016)

  17. 17.

    R. Ffrancon, M. Schoenauer, Memetic Semantic Genetic Programming (ACM Press, 2015), pp. 1023–1030. https://doi.org/10.1145/2739480.2754697. http://dl.acm.org/citation.cfm?doid=2739480.2754697

  18. 18.

    S.B. Fine, E. Hemberg, K. Krawiec, U.M. O’Reilly, Exploiting subprograms in genetic programming, in Genetic Programming Theory and Practice XV, Genetic and Evolutionary Computation, ed. by W. Banzhaf, R.S. Olson, W. Tozier, R. Riolo (Springer, Berlin, 2018), pp. 1–16

    Google Scholar 

  19. 19.

    D. Floreano, P. Dürr, C. Mattiussi, Neuroevolution: from architectures to learning. Evol. Intell. 1(1), 47–62 (2008). https://doi.org/10.1007/s12065-007-0002-4

    Article  Google Scholar 

  20. 20.

    Y. Freund, R.E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, in Computational Learning Theory, ed. by P. Vitanyi (Springer, Berlin, 1995), pp. 23–37. https://doi.org/10.1007/3-540-59119-2_166

    Google Scholar 

  21. 21.

    J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning. Springer series in statistics, vol. 1 (Springer, Berlin, 2001). http://statweb.stanford.edu/tibs/book/preface.ps. Accessed 15 Oct 2019

  22. 22.

    A.H. Gandomi, A.H. Alavi, A new multi-gene genetic programming approach to nonlinear system modeling. Part I: materials and structural engineering problems. Neural Comput. Appl. 21(1), 171–187 (2012). https://doi.org/10.1007/s00521-011-0734-z

    Article  Google Scholar 

  23. 23.

    F. Gomez, J. Schmidhuber, R. Miikkulainen, Efficient non-linear control through neuroevolution, in ECML, vol. 4212 (Springer, 2006), pp. 654–662. http://link.springer.com/content/pdf/10.1007/11871842.pdf#page=676

  24. 24.

    A. Gonzalez-Garcia, J. van de Weijer, Y. Bengio, Image-to-image translation for cross-domain disentanglement. arXiv preprint arXiv:1805.09730 (2018)

  25. 25.

    Goodfellow, I., H. Lee, Q.V. Le, A. Saxe, A.Y. Ng, Measuring invariances in deep networks, in Advances in Neural Information Processing Systems, pp. 646–654 (2009)

  26. 26.

    M. Graff, E.S. Tellez, E. Villaseñor, S. Miranda, Semantic genetic programming operators based on projections in the phenotype space. Res. Comput. Sci. 94, 73–85 (2015)

    Article  Google Scholar 

  27. 27.

    N. Hadad, L. Wolf, M. Shahar, A two-step disentanglement method, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 772–780 (2018)

  28. 28.

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner, \(\beta\)-VAE: Learning basic visual concepts with a constrained variational framework, in ICLR (2017)

  29. 29.

    A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)

    Article  Google Scholar 

  30. 30.

    C. Igel, Neuroevolution for reinforcement learning using evolution strategies, in The 2003 Congress on Evolutionary Computation, 2003. CEC’03, vol. 4 (IEEE, 2003), pp. 2588–2595. http://ieeexplore.ieee.org/abstract/document/1299414/. Accessed 15 Oct 2019

  31. 31.

    V. Ingalalli, S. Silva, M. Castelli, L. Vanneschi, A multi-dimensional genetic programming approach for multi-class classification problems, in Genetic Programming, ed. by M. Nicolau (Springer, Berlin, 2014), pp. 48–60. https://doi.org/10.1007/978-3-662-44303-3_5

    Google Scholar 

  32. 32.

    G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical learning, in Springer Texts in Statistics, vol. 103, ed. by N.H. Timm (Springer, New York, 2013). https://doi.org/10.1007/978-1-4614-7138-7

    Google Scholar 

  33. 33.

    D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv:1412.6980 [cs] (2014).

  34. 34.

    S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    MathSciNet  Article  Google Scholar 

  35. 35.

    M. Kommenda, G. Kronberger, M. Affenzeller, S.M. Winkler, B. Burlacu, Evolving simple symbolic regression models by multi-objective genetic programming, in Genetic Programming Theory and Practice, vol. XIV. Genetic and Evolutionary Computation (Springer, Ann Arbor, MI, 2015)

  36. 36.

    K. Krawiec, Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genet. Program. Evol. Mach. 3(4), 329–343 (2002). https://doi.org/10.1023/A:1020984725014

    Article  MATH  Google Scholar 

  37. 37.

    K. Krawiec, On relationships between semantic diversity, complexity and modularity of programming tasks, in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference (ACM, 2012), pp. 783–790. http://dl.acm.org/citation.cfm?id=2330272. Accessed 15 Oct 2019

  38. 38.

    K. Krawiec, Behavioral Program Synthesis with Genetic Programming, vol. 618 (Springer, Berlin, 2016)

    Google Scholar 

  39. 39.

    K. Krawiec, U.M. O’Reilly, Behavioral programming: a broader and more detailed take on semantic GP, in Proceedings of the 2014 Conference on Genetic and Evolutionary Computation (ACM Press, 2014), pp. 935–942. https://doi.org/10.1145/2576768.2598288. http://dl.acm.org/citation.cfm?doid=2576768.2598288. Accessed 15 Oct 2019

  40. 40.

    A. Kumar, P. Sattigeri, A. Balakrishnan, Variational inference of disentangled latent concepts from unlabeled observations, in ICLR (2018). https://openreview.net/forum?id=H1kG7GZAW. Accessed 15 Oct 2019

  41. 41.

    W. La Cava, T. Helmuth, L. Spector, J.H. Moore, A probabilistic and multi-objective analysis of lexicase selection and \(\varepsilon\)-lexicase selection. Evolut. Comput. (2018). https://doi.org/10.1162/evco_a_00224

    Article  Google Scholar 

  42. 42.

    W. La Cava, J. Moore, A general feature engineering wrapper for machine learning using \({\backslash }\)epsilon-lexicase survival, in Genetic Programming (Springer, Cham, 2017), pp. 80–95. https://doi.org/10.1007/978-3-319-55696-3_6

  43. 43.

    W. La Cava, J.H. Moore, Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods, in GECCO ’17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference (ACM, Berlin, Germany), pp. 961–968 (2017). https://doi.org/10.1145/3071178.3071215. arxiv:1703.06934

  44. 44.

    W. La Cava, J.H. Moore, Semantic variation operators for multidimensional genetic programming, in Proceedings of the 2019 Genetic and Evolutionary Computation Conference, GECCO ’19 (ACM, Prague, Czech Republic, 2019). https://doi.org/10.1145/3321707.3321776. arXiv:1904.08577

  45. 45.

    W. La Cava, S. Silva, K. Danai, L. Spector, L. Vanneschi, J.H. Moore, Multidimensional genetic programming for multiclass classification. Swarm Evolut. Comput. (2018). https://doi.org/10.1016/j.swevo.2018.03.015

    Article  Google Scholar 

  46. 46.

    W. La Cava, T.R. Singh, J. Taggart, S. Suri, J.H. Moore, Learning concise representations for regression by evolving networks of trees, in International Conference on Learning Representations, ICLR (2019). arxiv:1807.00981 (in press)

  47. 47.

    W. La Cava, L. Spector, K. Danai, Epsilon-lexicase selection for regression, in Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16 (ACM, New York, NY, USA, 2016), pp. 741–748. https://doi.org/10.1145/2908812.2908898

  48. 48.

    Q. Le, B. Zoph, Using machine learning to explore neural network architecture (2017). https://ai.googleblog.com/2017/05/using-machine-learning-to-explore.html. Accessed 15 Oct 2019

  49. 49.

    C. Liu, B. Zoph, J. Shlens, W. Hua, L.J. Li, L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive neural architecture search. arXiv preprint arXiv:1712.00559 (2017)

  50. 50.

    T. McConaghy, FFX: Fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, Berlin, 2011), pp. 235–260. https://doi.org/10.1007/978-1-4614-1770-5_13

    Google Scholar 

  51. 51.

    D. Medernach, J. Fitzgerald, R.M.A. Azad, C. Ryan, A new wave: a dynamic approach to genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16 (ACM, New York, NY, USA, 2016), pp. 757–764. https://doi.org/10.1145/2908812.2908857

  52. 52.

    V.V. de Melo, W. Banzhaf, Automatic feature engineering for regression models with machine learning: an evolutionary computation and statistics hybrid. Inf. Sci. (2017). https://doi.org/10.1016/j.ins.2017.11.041

    Article  Google Scholar 

  53. 53.

    G. Montavon, K.R. Müller, Better representations: invariant, disentangled and reusable, in Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, ed. by G. Montavon, K.R. Müller (Springer, Berlin, 2012), pp. 559–560

    Google Scholar 

  54. 54.

    A. Moraglio, K. Krawiec, C.G. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature-PPSN XII (Springer, 2012), pp. 21–31. http://link.springer.com/chapter/10.1007/978-3-642-32937-1_3. Accessed 15 Oct 2019

  55. 55.

    M. Muharram, G.D. Smith, Evolutionary constructive induction. IEEE Trans. Knowl. Data Eng. 17(11), 1518–1528 (2005)

    Article  Google Scholar 

  56. 56.

    L. Muñoz, S. Silva, L. Trujillo, M3gp—multiclass classification with GP, in Genetic Programming (Springer, 2015), pp. 78–91. http://link.springer.com/chapter/10.1007/978-3-319-16501-1_7. Accessed 15 Oct 2019

  57. 57.

    L. Muñoz, L. Trujillo, S. Silva, M. Castelli, L. Vanneschi, Evolving multidimensional transformations for symbolic regression with M3gp. Memet. Comput. (2018). https://doi.org/10.1007/s12293-018-0274-5

    Article  Google Scholar 

  58. 58.

    K. Neshatian, M. Zhang, P. Andreae, A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans. Evolut. Comput. 16(5), 645–661 (2012). (ZSCC: 0000081)

    Article  Google Scholar 

  59. 59.

    R.M. O’brien, A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 41(5), 673–690 (2007). https://doi.org/10.1007/s11135-006-9018-6. (ZSCC: 0005201)

    Article  Google Scholar 

  60. 60.

    R.S. Olson, W. La Cava, P. Orzechowski, R.J. Urbanowicz, J.H. Moore, PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining (2017). ArXiv preprint arXiv:1703.00512

  61. 61.

    P. Orzechowski, W. La Cava, J.H. Moore, Where are we now? A large benchmark study of recent symbolic regression methods. arXiv:1804.09331 [cs] (2018). https://doi.org/10.1145/3205455.3205539.

  62. 62.

    T.P. Pawlak, B. Wieloch, K. Krawiec, Semantic backpropagation for designing search operators in genetic programming. IEEE Trans. Evol. Comput. 19(3), 326–340 (2015)

    Article  Google Scholar 

  63. 63.

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  64. 64.

    H. Pham, M.Y. Guan, B. Zoph, Q.V. Le, J. Dean, Efficient neural architecture search via parameter sharing. ArXiv preprint arXiv:1802.03268 (2018)

  65. 65.

    E. Real, Using evolutionary AutoML to discover neural network architectures (2018). https://ai.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html. Accessed 15 Oct 2019

  66. 66.

    E. Real, S. Moore, A. Selle, S. Saxena, Y.L. Suematsu, J. Tan, Q. Le, A. Kurakin, Large-scale evolution of image classifiers. arXiv:1703.01041 [cs] (2017)

  67. 67.

    M. Schmidt, H. Lipson, Age-fitness pareto optimization, in Genetic Programming Theory and Practice VIII (Springer, 2011), pp. 129–146. http://link.springer.com/chapter/10.1007/978-1-4419-7747-2_8. Accessed 15 Oct 2019

  68. 68.

    D. Searson, M. Willis, G. Montague, Co-evolution of non-linear PLS model components. J. Chemom. 21(12), 592–603 (2007). https://doi.org/10.1002/cem.1084

    Article  Google Scholar 

  69. 69.

    D.P. Searson, D.E. Leahy, M.J. Willis, GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, in Proceedings of the International Multiconference of Engineers and Computer Scientists, vol. 1 (IMECS, Hong Kong, 2010), pp. 77–80

  70. 70.

    S. Silva, L. Munoz, L. Trujillo, V. Ingalalli, M. Castelli, L. Vanneschi, Multiclass classification through multidimensional clustering, in Genetic Programming Theory and Practice XIII, vol. 13 (Springer, Ann Arbor, MI, 2015)

  71. 71.

    L. Spector, Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report, in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion (2012), pp. 401–408. http://dl.acm.org/citation.cfm?id=2330846. Accessed 15 Oct 2019

  72. 72.

    K.O. Stanley, Compositional pattern producing networks: a novel abstraction of development. Genet. Program. Evolvable Mach. 8(2), 131–162 (2007). https://doi.org/10.1007/s10710-007-9028-8

    Article  Google Scholar 

  73. 73.

    K.O. Stanley, J. Clune, J. Lehman, R. Miikkulainen, Designing neural networks through neuroevolution. Nat. Mach. Intell. 1(1), 24 (2019). https://doi.org/10.1038/s42256-018-0006-z

    Article  Google Scholar 

  74. 74.

    K.O. Stanley, D.B. D’Ambrosio, J. Gauci, A hypercube-based encoding for evolving large-scale neural networks. Artif. Life 15(2), 185–212 (2009). https://doi.org/10.1162/artl.2009.15.2.15202

    Article  Google Scholar 

  75. 75.

    K.O. Stanley, R. Miikkulainen, Evolving neural networks through augmenting topologies. Evolut. Comput. 10(2), 99–127 (2002). https://doi.org/10.1162/106365602320169811

    Article  Google Scholar 

  76. 76.

    R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  77. 77.

    R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu, Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002). https://doi.org/10.1073/pnas.082099299

    Article  Google Scholar 

  78. 78.

    L. Vanneschi, M. Castelli, L. Manzoni, K. Krawiec, A. Moraglio, S. Silva, I. Gonçalves, PSXO: population-wide semantic crossover, in Proceedings of the Genetic and Evolutionary Computation Conference Companion (ACM, 2017), pp. 257–258

  79. 79.

    E. Vladislavleva, G. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009). https://doi.org/10.1109/TEVC.2008.926486

    Article  Google Scholar 

  80. 80.

    W. Whitney, Disentangled representations in neural models. arXiv:1602.02383 [cs] (2016).

  81. 81.

    B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning (2016). https://arxiv.org/abs/1611.01578

Download references


This work was supported by NIH Grants K99LM012926-01A1, AI116794 and LM012601, as well as the PA CURE Grant from the Pennsylvania Department of Health. Special thanks to Tilak Raj Singh and other members of the Computational Genetics Lab at the University of Pennsylvania.

Author information



Corresponding author

Correspondence to William La Cava.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Additional experiment information

Table 6 details the hyperparameters for each method used in the experimental results described in Sects. 4 and 5.

Table 6 Comparison methods and their hyperparameters for the comparisons in Sect. 4.2. Tuned values denoted with brackets

Comparison of selection algorithms

Our initial analysis sought to determine how different SO approaches performed within this framework. We tested five methods: (1) NSGA2, (2) Lex, (3) LexNSGA2, (4) Simulated annealing, and (5) random search. The simulated annealing and random search approaches are described below.

Simulated annealing Simulated annealing (SimAnn) is a non-evolutionary technique that instead models the optimization process on the metallurgical process of annealing. In our implementation, offspring compete with their parents; in the case of multiple parents, offspring compete with the program with which they share more nodes. The probability of an offspring replacing its parent in the population is given by the equation

$$\begin{aligned} P_{sel}(n_o | n_p, t) = \exp {\left( \frac{F(n_p) - F(n_o)}{t}\right) } \end{aligned}$$

The probability of offspring replacing its parent is a function of its fitness, F, in our case the mean squared loss of the candidate model. In Eq. 7, t is a scheduling parameter that controls the rate of “cooling”, i.e. the rate at which steps in the search space that are worse are tolerated by the update rule. In accordance with [34], we use an exponential schedule for t, defined as \(t_{g} = (0.9)^gt_0\) , where g is the current generation and t0 is the starting temperature. t0 is set to 10 in our experiments.

Random search We compare the selection and survival methods to random search, in which no assumptions are made about the structure of the search space. To conduct random search, we randomly sample \({\mathbb {S}}\) using the initialization procedure. Since FEAT begins with a linear model of the process, random search will produce a representation at least as good as this initial model on the internal validation set.

A note on archiving When FEAT is used without a complexity-aware survival method (i.e., with Lex, SimAnn, Random), a separate population is maintained that acts as an archive. The archive maintains a Pareto front according to minimum loss and complexity (Eq. 3). At the end of optimization, the archive is tested on a small hold-out validation set. The individual with the lowest validation loss is the final selected model. Maintaining this archive helps protect against overfitting resulting from overly complex/high capacity representations, and also can be interpreted directly to help understand the process being modelled.

We benchmarked these approaches in a separate experiment on 88 datasets from PMLB [60]. The results are shown in Figs. 13, 14, 15 and 16. Considering Figs. 13 and 14, we see that LexNSGA2 achieves the best average \(R^2\) value while producing small solutions in comparison to Lex. NSGA2, SimAnneal, and Random search all produce less accurate models. The runtime comparisons of the methods in Fig. 15 show that they are mostly within an order of magnitude, with NSGA2 being the fastest (due to its maintenance of small representations) and Random search being the slowest, suggesting that it maintains large representations during search. The computational behavior of Random search suggests the variation operators tend to increase the average size of solutions over many iterations.

Fig. 13

Mean tenfold CV \(R^2\) performance for various SO methods in comparison to other ML methods, across the benchmark problems

Fig. 14

Size comparisons of the final models in terms of number of parameters

Fig. 15

Wall-clock runtime for each method in seconds

Fig. 16

Mean correlation between engineered features for different SO methods compared to the correlations in the original data (ElasticNet)

Illustrative example

We show an illustrative example of the final archive and model selection process from applying FEAT to a galaxy visualization dataset [8] in Fig. 17. The red and blue points correspond to training and validation scores for each archived representation with a square denoting the final model selection. Five of the representations are printed in plain text, with each feature separated by brackets. The vertical lines in the left figure denote the test scores for FEAT, RF and ElasticNet. It is interesting to note that ElasticNet performance roughly matches the performance of a linear representation, and the RF test performance corresponds to the representation \([\tanh (x_0)][\tanh (x_1)]\) that is suggestive of axis-aligned splits for \(x_0\) and \(x_1\). The selected model is shown on the right, with the features sorted according to the magnitudes of \(\beta\) in the linear model. The final representation combines tanh, polynomial, linear and interacting features. This representation is a clear extension of simpler ones in the archive, and the archive thereby serves to characterize the improvement in predictive accuracy brought about by increasing complexity. Although a mechanistic interpretation requires domain expertise, the final representation is certainly concise and amenable to interpretation.

Fig. 17

(Left) Representation archive for the visualizing galaxies dataset. (Right) Selected model and its weights. Internal weights omitted

Statistical comparisons

We perform pairwise comparisons of methods according to the procedure recommended by Demšar [14] for comparing multiple estimators (Table 7). In Table 8, the CV \(R^2\) rankings are compared. In Table 9, the best model size rankings are compared. Note that KernelRidge is omitted from the size comparisons since we don’t have a comparable way of measuring the model size.

Table 7 Algorithms from Orzechowski et. al. [61] with their parameter settings
Table 8 Bonferroni-adjusted p values using a Wilcoxon signed rank test of R\(^2\) scores for the FEAT variants across all benchmarks
Table 9 Bonferroni-adjusted p values using a Wilcoxon signed rank test of sizes for the FEAT variants across all benchmarks
Table 10 Bonferroni-adjusted p values using a Wilcoxon signed rank test of MSE scores for the methods across all benchmarks

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

La Cava, W., Moore, J.H. Learning feature spaces for regression with genetic programming. Genet Program Evolvable Mach 21, 433–467 (2020). https://doi.org/10.1007/s10710-020-09383-4

Download citation


  • Representation learning
  • Feature construction
  • Variation
  • Regression