Parameter identification for symbolic regression using nonlinear least squares

In this paper we analyze the effects of using nonlinear least squares for parameter identification of symbolic regression models and integrate it as local search mechanism in tree-based genetic programming. We employ the Levenberg–Marquardt algorithm for parameter optimization and calculate gradients via automatic differentiation. We provide examples where the parameter identification succeeds and fails and highlight its computational overhead. Using an extensive suite of symbolic regression benchmark problems we demonstrate the increased performance when incorporating nonlinear least squares within genetic programming. Our results are compared with recently published results obtained by several genetic programming variants and state of the art machine learning algorithms. Genetic programming with nonlinear least squares performs among the best on the defined benchmark suite and the local search can be easily integrated in different genetic programming algorithms as long as only differentiable functions are used within the models.


Introduction
Symbolic regression is the task of finding a mathematical model that best explains the relationship between one or more independent variables and one dependent variable. The ability to simultaneously search the space of possible model structures and their parameters (in terms of appropriate numerical coefficients) makes genetic programming (GP) [21,34] a popular approach for symbolic regression.
However, as a biologically-inspired approach guided by fitness-based selection, the GP search process for symbolic regression is characterized by a loose coupling between fitness, expressed as an error measure with respect to the target variable, and variation operators | subordinate search heuristics in solution space that generate new models in each generation.
Consequently, it is difficult to foresee the effects on model output when variation operators perform changes on the model structure, often leading to situations where promising model structures are ignored by the algorithm due to low fitness caused by ill-fitting parameters [44]. In some cases, this can lead to necessary building blocks becoming extinct in the population before they are combined in a solution and thus recognized by the algorithm.
Generally speaking, achieving high-quality solutions in GP-based symbolic regression requires solving three interrelated subtasks: 1. Selection of the appropriate subset of variables (feature selection) 2. Discovery of the best suited model structure containing these variables 3. Determination of optimal parameter values of the model Each of these subtasks depends on the results of the previous subtask to generate optimal solutions, therefore improvements of one task can lead to improvements to the whole algorithm. Although these three subtasks have to be solved in any case, most algorithms create solutions without explicitly addressing the necessary steps.
Symbolic regression problems can be solved by tree-based GP that evolves individuals which capture all characteristics of a solution such as the appropriate subset of variables, model structure, and parameters. Individuals are in general manipulated as a whole (by crossover and/or mutation), which results in difficulties for generating good solutions. The reason is that such evolutionary variations partially change the model structure and occurring variables which necessitates a re-fitting of all parameters of the model.
In this context, hybridization with local search methods improves algorithm effectiveness by shifting the burden of finding appropriate numerical coefficients to subordinate algorithms or heuristics that become part of the evolutionary process at the same level with crossover and mutation operators. This division of tasks is particularly appropriate since genetic programming is already well suited for variable selection [6,41].
In this contribution we study the hybridization of tree-based GP with nonlinear least squares optimization of numeric parameters and discuss aspects ranging from implementation to runtime performance and solution quality. Our methodology for parameter identification in symbolic regression combines linear scaling, automatic differentiation and gradient-based optimization through the Levenberg-Marquardt (LM) algorithm. A novel contribution is the comparison of results produced by our proposed approach to results of a number of similar approaches that integrate local search mechanisms as well and other non-evolutionary regression techniques as reported in [33].

Numerical parameters in symbolic regression
When performing symbolic regression, numerical parameters are of prime importance. Numerical parameters, also referred to as constants, form together with variables the terminal set T . Every new leaf node in a symbolic expression tree is initialized with elements from the terminal set T . Internal nodes are initialized with elements from the function set F . Together, the two sets define the primitive set P used by the GP system to generate new symbolic expressions.
Constants in T can be either explicitly stated (as predefined and immutable numerical constants), or they can be defined as ephemeral random constants (ERC) [21]. In the first case, the terminal set may contain different numerical values alongside variables, such as T = {X, 1.0, 2.0, } containing the variable X and three predefined numerical constants. Consequently, random uniform initialization would result in a 75% probability for a constant to be selected during tree creation when a leaf node is initialized. This disparate ratio between constants and variables could be altered by introducing a selection bias for terminal symbols.
In the second case, the special symbol R is added to the terminal set and every time the ERC symbol R is selected during tree creation a new constant value is sampled from a predefined distribution. ERCs provide greater flexibility as it is possible to create a greater variety of real-valued constants, compared to including constants directly in the terminal set. Whether immutable or ephemeral constants are more suitable largely depends on the problem at hand and both approaches might even be combined.
The numerical constants potentially reachable through the variation of solutions depend solely on the initial constant values during tree creation. To allow new constants to be discovered, GP literature recommends adding special manipulation operators to the algorithm that alter the numerical constants of a solution, for example as described in Schoenauer et al. [38], where a random Gaussian variable is added to the constant, which is inspired by mutation in evolution strategies [39]. A similar technique by Gaussian mutation is detailed in Ryan and Keijzer [37]. Another possibility inspired by simulated annealing [18] is to replace all numeric constants with new values, sampled from a uniform distribution and adapted by a temperature factor [10]. Another alternative for adapting numerical values in symbolic regression is the inclusion of local search in GP.

Literature review
Local search aims to find a local optimum starting from a single solution. The best solution among a neighboring set of solutions is selected by applying a local move and through iterative application of such moves a local optimum is reached. Local search algorithms are often employed as subordinate heuristics for solution refinement within a higher-level metaheuristic framework. In the context of symbolic regression, local search refers to a further improvement of existing solutions towards a local optimum.
Overall, the hybridization of GP with local search methods represents a good fit. The inherent disadvantage of local search methods of converging toward the next attracting local optimum (depending on initial starting conditions) is reduced by GP, because it is likely that multiple, differently parameterized instances of the same model structure are present in the population, thus providing different starting conditions for the local search.
Another helpful technique to escape such local optima is to keep random mutation enabled, although its role and significance are reduced as the identification of appropriate numerical parameters is often performed by local search methods. As a result mutation is mostly responsible for introducing variations during the search for symbolic regression solutions.
Krawiec [22] integrates a hill-climbing algorithm into GP for symbolic classification and applies it to the best solution in each generation. The author reports a statistically significant improvement over standard GP in terms of classification accuracy. This result shows that even a small amount of local optimization can provide substantial benefits.
Topchy and Punch [42] use gradient-descent for parameter adaptation in GP for symbolic regression. They perform 100 gradient-descent iterations for each individual and show that this successfully prevents the loss of good structures by comparing the results with and without gradient-descent. Furthermore, they show that a bias towards model structures that are more readily adaptable is introduced and that their approach outperforms standard GP.
Wang et al. [48] apply a set of representation-specific local search operators to decision trees, for GP-based classification. The algorithm called Memetic GP (MGP) achieves better training quality but is more prone to overfitting due to the high intensity local search.
Lane et al. [26] use a different approach where they change the functions of internal tree nodes until fitness is improved. They test different strategies, according to several tree parsimony measures, when to apply this form of tuning and to which internal nodes. They report significant improvements in test quality over standard GP. These results are verified by Azad and Ryan [3] when integrating the tuning of internal tree nodes in GP.
Z-Flores et al. [50] use an alternative parameterization of GP trees, where a weight coefficient is assigned to each function node. They employ gradient descent [8] and test different strategies for the integration in GP. They find that best results are obtained when local search is applied to all individuals and that optimizing only a subset of the best individuals also represents a viable strategy. The same approach of adding and tuning weight coefficients of internal nodes is also evaluated in GP for binary classification [51], where the classification accuracy improved on all tested problems.
Juarez et al. [15] test the benefits of integrating local search in GP and neat-GP. While the performance in terms of quality does not increase significantly, they were able to produce consistently smaller and easier to interpret solutions. Trujillo et al. [44] also use gradient descent to optimize a percentage of individuals in each generation. They apply local search stochastically based on a probability given by tree size. They report substantial improvements in terms of solution quality, convergence speed and program size.
La Cava et al. [24] add an "epigenetic layer" as a mechanism for local search and test its effectiveness in solving symbolic regression and program synthesis problems. They attach a corresponding epigenome to each individual, which determines which genes are active in its structure. Epigenomes are altered by mutation and only changes that improve fitness or program complexity are accepted. The proposed approach is able to outperform GP in terms of fitness, exact solutions, and program sizes.
Castelli et al. [4] integrate local search in Geometric Semantic GP and test it on a number of real-world symbolic regression problems. They find that the resulting algorithm severely overfits and propose an alternative approach where local search is only applied in the first 50 generations of the evolutionary run. Table 1 offers a detailed summary of previous approaches of local search in genetic programming. Gradient descent and hill climbing are prevalent local search methods, while Lamarckian evolution is the preferred local learning behavior. In the context of learning behavior, Lamarckian and Baldwinian learning refer to the way GP handles the information obtained via local search, which can be coded back into the genotype (Lamarckian) or not (Baldwinian). In both cases, local search affects the selection process and changes the behavior of the algorithm.

Scope of this study
Only few other works [15,42,50] referenced in our survey use gradient-based local search with GP for symbolic regression. We therefore consider it opportune to revisit the topic and investigate additional aspects such as the effectiveness of local search, its convergent behavior and runtime impact on the evolutionary algorithm.
The main aim of this work is to provide a detailed treatment of gradient-based numerical optimization in the context of GP local search for symbolic regression. We follow the Lamarckian model where numerical coefficients optimized via nonlinear least squares are written back to the genotype. We investigate the benefits of nonlinear least squares optimization for two GP flavors (Standard GP and Offspring Selection GP [1]) and analyze performance in comparison with several other state-of-the-art methods on a large selection of benchmark problems [33].

Methodology
Our approach for parameter identification in symbolic regression combines automatic differentiation with gradient-based optimization [19,20]. This approach is integrated as a local search mechanism in genetic programming and has been implemented in the open source framework for heuristic optimization HeuristicLab [47].

Mathematical formulation
For simplicity, we assume no implicit feature selection is performed and each GP model uses the entire set of input variables.
1. Let X ∈ ℝ n×m be an n × m matrix where each column x i ∈ ℝ n , i = 1, … , m is an n-dimensional input variable and each row s j ∈ ℝ m , j = 1, … , n is an m-dimensional training sample. In what follows we take X as training data for the algorithm. 2. Let y ∈ ℝ n be the target vector for the regression problem. 3. Let P be the GP primitive set and S the syntactic search space defined by it. 4. Let be the space of possible expressions and their parameters. That is, the set of all tuples (E, ) , where E ∈ S is a symbolic expression and ∈ ℝ p a parameter vector of length p corresponding to the numerical parameters of E. Let us call a tuple (E, ) a symbolic expression model M E, ∈ . 5. Let G ∶ × ℝ n×m → ℝ n be a function that evaluates a model M E, ∈ on training data X and returns an n-dimensional output vector ŷ ∈ ℝ n : The overall symbolic regression goal is to find the optimal model M opt = (E opt , opt ) During local search the model structure E remains fixed for a model M E, while the parameter vector ∈ ℝ p is subject to optimization. Thus, for the sake of simplicity we define the local search residual H ∶ ℝ p → ℝ n as a function that evaluates parameter vector and returns the difference between the model output and the actual target: We formulate the goal of local search as a minimization problem: To optimize over n training samples we consider the Jacobian J( ) of H( ): For nonlinear least squares, at each iteration of the gradient descent algorithm we use the linearization which leads to the following linear least squares problem: The problem described by Eq. (5) can be efficiently solved using trust region or line search methods [31].

Local search algorithm
Our implementation uses a trust region gradient descent approach, namely the Levenberg-Marquardt (LM) algorithm [27,29] with an iteration limit as stopping criterion. Per default the LM algorithm is stopped after ten iterations. Derivatives are calculated using automatic differentiation [12,36], since the structure of the evaluated expressions is known and contains only arithmetic operations and elementary functions that can be derived using the chain rule.
We integrate linear scaling [16,17] with local search in order to bring each estimate ŷ in range with the target values y. The reason is to eliminate the need to find the correct offset and scale for the estimates, allowing GP to focus on finding the correct model structure.
Linear scaling is achieved in practice by introducing a structural extension such that a model M E, is added as an input to a linear transformation block as illustrated in Fig. 1. The resulting expression will contain four additional nodes and its parameter vector will include two additional coefficients (the slope and intercept of the linear transformation).
These individual components linear scaling, gradient calculation, and nonlinear least squares optimization are the building blocks for parameter identification of symbolic regression models and the whole method is given as pseudo-code in Algorithm 1.
Implementation-wise we prepare a model M E, from a tree-based symbolic expression by first converting it to a data structure with which automatic differentiation can operate. The initial values for are extracted from the leaf nodes of the tree. At each iteration the LM algorithm computes the residuals H( ) and the Jacobian J( ) and updates accordingly.
Then the fitness of the symbolic expression tree is calculated, according to the objective function used by the algorithm solving the symbolic regression problem. This way, although the LM algorithm optimizes the mean squared error, another objective function (for example the coefficient of determination R 2 or the mean relative error) could be used to assess the fitness of the generated model. Finally, the optimized numerical values are written back to the symbolic expression tree according to the Lamarckian learning model.

Analysis
We analyze the adaptation of numeric parameter values with respect to achieved improvement, convergence behavior and runtime overhead. We provide a detailed breakdown of measured execution time in relation with the number of training samples and the number of local optimization iterations.
First, we demonstrate the behavior of optimizing the parameters of a predefined model structure that matches the data generating function. The model structure (Eq. 6) contains three variables x i and four parameters i . The training data is generated using Eq. (6) with = {30.0, 1.0, 1.0, 10.0} by randomly sampling 300 data points in the interval [0.05, 2] for x 1 and x 3 and [1, 2] for x 2 . This data generating procedure corresponds to the original problem definition given in [45]. We have chosen this particular synthetic problem because the parameters have a nonlinear effect in the model and several iterative steps are required to find the optimal values when using the LM algorithm. The elements of the initial vector are sampled uniformly from the interval [− 100, 100] . Then, is iteratively optimized until no further improvement can be achieved. Table 2 illustrates a successful application of the LM algorithm, where the parameters of the model are correctly identified. The first row 'Iteration 0' states the initial parameter values and each subsequent row the updated parameter vector . It can be observed that all parameters are simultaneously adapted according to the current gradient information. Furthermore, it could happen that the values deviate more from the target values compared to previous iterations (see for example 1 between Iteration 15 and 20), but in this example the correct values are identified after 35 iterations .
Appropriate starting values play a crucial role in the success of the LM algorithm. It requires a starting point within the basin of attraction for the global optimum otherwise it converges to a local optimum. Thus, the algorithm might fail to identify the optimal model parameters even when an optimal model structure is identified by GP. Model structures with multiple optima are particularly vulnerable to this phenomenon. We illustrate the importance of good starting values by optimizing the model structure M E, = sin( x) . In contrast with the previous example where was a parameter vector, here is a scalar value. This corresponds to identification of the frequency of a sinusoidal function. We generate a dataset containing 6000 samples for x ranging from − 3.0 to 3.0 with a step size of 0.001 and a frequency = 2.5.
We test different starting values of in the interval [− 10, 10] with a step size of 0.05 and plot the mean squared error between the generated data and the model outputs sin( x) before and after LM optimization. The results illustrated in Fig. 2 show that LM always improves the mean squared error. However, the global optimum is only reached if the starting value for lies in the interval [0. 8,4] and thus within the basin of attraction of the global optimum. Otherwise, the optimization converges towards a local optimum and a mean squared error of almost zero cannot be reached.
The largest drawback of the methodology is that it is computationally expensive to perform multiple gradient and function evaluations for each model structure. The LM algorithm has an asymptotic runtime complexity of O(nd 2 ) with n the number of training samples and d the number of parameters for a fixed number of iterations. We empirically analyze its impact on the execution performance of GP using randomly created expression trees. Figure 3 shows the growth in runtime for the LM algorithm with an increasing number of maximum iterations and training samples. For this experiment we created 1000 symbolic expression trees with the probabilistic tree creator (PTC 2) [28] and report the median execution time of the local search for an individual tree. 1 As expected, the runtime increases linearly with the number of training samples for the different tested LM iterations. The number of training samples has the largest influence, as the training error has to be recalculated after each iteration of the LM algorithm. The number of iterations has a smaller influence on runtime since the LM algorithm may stop early when no more improvement can be achieved.
Compared to tree evaluation without local search (0 iterations), the accumulated overhead exceeds one order of magnitude for training data containing more than 1000 training samples. This effect (observed in numerous other publications) can be alleviated by applying parameter identification to smaller segments of the population, selected either probabilistically or according to fitness, or by using fewer training samples for parameter identification.

Experiments and results
We use the Penn Machine Learning Benchmarks (PMLB) collection of benchmark regression problems developed by Olson et al. [32] with the same choice of problems as in Orzechowski et al. [33] for evaluating the performance of local search A fivefold cross-validation is performed on the training data and repeated five times to account for stochastic effects. Hence, we have created 25 symbolic regression models (five times fivefold cross validation) that are trained on 4/5 folds of the training data and evaluated on the remaining fold. Afterwards, we select the best parameter settings for each algorithm and problem based on the average mean squared error on the fold not used for training obtained by these 25 generated models. Using the identified parameter settings we perform 50 repetitions of each algorithm on the whole training partition. The parameter settings for GP and OSGP, which have not been tuned, are detailed in Table 3.
This results are illustrated graphically in Fig. 4 for GP and Fig. 5 for OSGP, where each chart column (ordered ascending by the number of training samples)  corresponds to a tested problem. These figures show the difference in the median R 2 between the local search variant and the standard genetic programming variant. The color green represents a positive difference in favor of the NLS hybridization while the color red represents a negative difference. Column hatching indicates statistical significance calculated using a two-sided Wilcoxon rank-sum test ( = 0.05 ). Overall it can be observed that hybridization with local search drastically improves the modeling accuracy. A summary of the overall results is given in Table 4, where it can be seen that both GP NLS and OSGP NLS produce statistically-better results for the majority of problems, for both training and test data. For an easier comparison we use symbols +, − to denote a statistically-significant performance difference between the NLS hybridization and the baseline variant.
The detailed raw data containing the results of each algorithm repetition is available for download at the HeuristicLab homepage. 2 Aggregated results for each problem are given in Appendix Tables 6 and 7 as median R 2 ± interquartile range on the training and test data. The last column in each table illustrates how GP NLS and OSGP NLS compare against their standard counterparts.
We then perform a large-scale comparison between our GP variants and several symbolic regression methods tested by Orzechowski et al. [33] and summarized in   Table 5. We follow the original methodology in [33] and calculate algorithm ranks on the tested problems based on median mean squared error. The experimental setup has been intentionally chosen as similar as possible to the benchmarking study [33] so that we can compare the results of our methods with already published ones. Therefore, we ranked the performance of all algorithms based on the median mean squared errors. However, we only created new results for GP, GP NLS, OSGP and OSPG NLS and reused the publicly available results and scripts for analysis and visualization provided by the Epistasis Lab at the University of Pennsylvania. 3 Box plots showing the rank distribution for each algorithm across all problems are shown in Fig. 6 for the training set and in Fig. 7 for the test set.
We assess performance on the basis of training and test median ranks, and we compare the significance of the problem rankings obtained by each algorithm using the Kruskal test [23] to ascertain that the calculated rank medians are statistically different and Dunn's test [9] with Holm-Bonferroni adjustments, to ascertain whether the pairwise rank differences between the tested algorithms are  Tables 8  and 9.
In terms of training performance, both GP and OSGP algorithms fall behind all other GP variants except AFP. At the same time, neither GP NLS and OSGP NLS distinguish themselves as top performers on this benchmark set as they are outranked by MRGP, XGBoost and GradBoost.
However, in terms of generalization ability (test performance), the results show that GP NLS is tied with OSGP NLS, EPLEX-1M and XGBoost and outperforms all other methods. In turn, OSGP NLS is tied with GP NLS and EPLEX-1M and outperforms all other methods. Overall, the proposed local search hybridization is able to significantly improve the generalization ability of algorithms.

Conclusion
Hybridization with local search represents a particularly effective approach for improving GP performance on a wide array of symbolic regression and symbolic classification. Many studies in the literature report substantial benefits in terms of solution quality, convergence speed and model size, with the added cost of increased running time caused by additional evaluations required by the local improvement procedure.
In this work we described in detail one such hybridization using GP and Offspring Selection GP for the evolution of model structure and the Levenberg-Marquardt algorithm for optimization of numerical parameters for symbolic regression. Implementation-wise, it is straight-forward to integrate linear scaling by extending the model parameter vector with two additional coefficients for scale and intercept. In general the approach works remarkably well on the suite of benchmark problems that we used for testing regardless of the GP variant employed. However, local search may get trapped in local optima, as we exemplified on the problem of frequency detection for a trigonometric function.
Our testing indicates improved generalization as the main benefit of this approach. In terms of training quality GP NLS and OSGP NLS rank behind MRGP (Multiple Regression Genetic Programming), However, on the test set OSGP NLS produces on average the best predictions among all GP variants.
As future work in this area we plan to investigate in more detail the correspondence between local search effort (in terms of gradient descent iterations) and algorithm performance within the Lamarckian framework. We conjecture that a more efficient search model can be attained by tuning the balance between local and global search effort.
Acknowledgements Open access funding provided by University of Applied Sciences Upper Austria. We would especially like to thank the researchers of the Epistasis Lab at the University of Pennsylvania for collecting the benchmark suite and providing their results for comparison as well as the accompanying analysis scripts.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Genetic Programming and Evolvable Machines (2020) 21:471-501 Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.