1 Introduction

Genetic programming (GP) [13] is a machine learning algorithm based on evolutionary search. It is mostly used to solve supervised learning problems, such as classification and regression. In classification, for instance, even though standard GP can achieve state-of-the-art performance on many binary problems, it is often unable to handle problems with multiple classes. To address this issue, a GP-based system called Multidimensional Multiclass Genetic Programming (M2GP) was proposed in [11], and then extended as Multidimensional Multiclass Genetic Programming with Multidimensional Populations, or simply M3GP [21].

M3GP evolves a transformation \(k:{\mathbb {R}}^p \rightarrow {\mathbb {R}}^d\) with \(p,d \in {\mathbb {N}}\) using a special tree representation, in essence mapping the p input features of the problem to a new feature space of size d. Afterward, M3GP applies the Mahalanobis distance classifier to measure the quality of each transformation based on accuracy. In other words, M3GP was proposed as a wrapper-based GP classifier [21, 23].

Symbolic regression, on the other hand, is a regression analysis that searches for the symbolic model that best fits a dataset. It is the most prominent application domain of GP, requiring very little prior knowledge of how the final solution ought to look like, such as an initial structure for the model. Instead, expressions are formed by randomly combining syntactic building blocks, such as mathematical operators, functions, constants and variables. Subsequently, new expressions are formed by recombining previous ones, using the principles of evolutionary algorithms [13]. However, while GP is very good at constructing and modifying the syntax of a model, it is not well equipped to perform parameter optimization in an efficient manner [29]. Syntax operations are done with standard genetic operators, but model parameters are not directly accessible to the commonly used search operators. Moreover, the overwhelming majority of GP systems generate complex non-linear models, while many scientific and engineering domains prefer to use models that are linear in the parameters, which are easier to analyze, interpret and tune.

This paper presents an extension of M3GP to apply it on symbolic regression. While the M3GP acronym does not hold for regression tasks, we decide to use the same name for clarity and future reference. Taken with previous works, this paper shows that M3GP can be seen as a general purpose GP-based algorithm, applicable to the two most common supervised learning tasks. The proposal is a hybrid approach toward symbolic regression. Instead of using GP to construct the solution model, we use the M3GP algorithm to transform the problem features and, subsequently, apply standard Multiple Linear Regression (MLR) to construct the final model. The approach implements a sequential memetic structure as defined in [4, 5], using Lamarckian inheritance and two local search methods. The first local searcher prunes the dimensionality of the models, while MLR is the second method used for parameter fitting.

The main contributions of this work can be stated as follows. First, M3GP is extended to construct regression models that are linear in the parameters, which are desirable in many real-world domains. Second, the proposal is a memetic approach that combines two local search methods with a GP global search, one of the few examples of a memetic GP [7, 29]. Third, experiments show that the method finds highly accurate models for synthetic and real-world benchmarks. Fourth, the algorithm has the desirable property of generating compact models, relative to other methods. Finally, evidence is presented that the performance of M3GP can be explained by the increase in the maximal mutual information in the transformed feature space relative to the original feature space.

The paper is organized as follows. Section 2 overviews related works and the original M3GP algorithm. Section 3 presents the proposed extensions to M3GP for symbolic regression. Section 4 describes the experimental study, providing the general setup and implementation details. Results are presented and discussed in Sect. 5, with comparisons to other relevant techniques. Finally, Sect. 6 concludes the paper and discusses future work.

2 Background and previous work

The model building strategy followed by M3GP is related to other non-GP methods. For instance, Multivariate Adaptive Regression Splines (MARS) [10] builds regression models by fitting basis functions to distinct intervals of the independent variables. Another related technique is the Fast Function Extraction (FFX) algorithm [16], a non-evolutionary (but GP inspired) technique based on pathwise regularized learning using a large dictionary of randomly generated basis functions. FFX builds a large library of basis functions and uses elastic net regression to prune the final model. The present work includes an experimental comparison with MARS and FFX.

Other recent techniques are Kaizen Programming (KP) [18], Multiple Regression GP (MRGP) [2] and Evolutionary Feature Synthesis (EFS) [3]. KP is based on a set of principles for the continuous improvement in industry used by Japanese businesses. While KP is not an evolutionary method, it is a population-based algorithm where the entire population is used to construct the final solution, integrating ideas from cooperative co-evolution and GP search operators. EFS is similar to KP, but follows a more traditional evolutionary approach, where models are also built from multiple individuals in the population using LASSO regression. On the other hand, MRGP builds linear models where the predictors include the output at the root node of the GP tree, the original input variables, and all of the outputs from the internal tree nodes. MRGP uses Least-Angle regression (LARS) to perform both parameter fitting and feature selection. This approach, however, seems redundant since using all of the internal nodes in a tree can be expected to produce highly correlated predictors, which must then be filtered by the regularized regression.

2.1 M2GP–multidimensional multiclass

M2GP searches for a transformation of the input feature space, such that the transformed data is more easily classified. To achieve this, M2GP uses a tree representation that performs a mapping \(k:{\mathbb {R}}^p \rightarrow {\mathbb {R}}^d\) where k is the evolved model or tree. The root node of k works as a container, and each subtree \(ST_i\) stemming from k defines a new feature dimension, such that \(i=1, ..., d\), as summarized in Fig. 1. The genetic operators used in M2GP are subtree crossover and mutation, with the restriction that the root is never chosen as the crossing or mutation point. Fig. 2 shows an example of the type of clustering produced by M2GP. The original data, in this case in the \((x_1,x_2,x_3)\) space, is mapped into a new (in this case) 3-dimensional space by a tree whose root note has three branches, each performing the mapping to each of the three new feature dimensions \((\widehat{x_1}, \widehat{x_2}, \widehat{x_3})\).

Fig. 1
figure 1

Special tree with d subtrees \(ST_i\), each one defining a new feature dimension constructed with the input data

Fig. 2
figure 2

Example of a transformation produced by M2GP for a problem with three classes. Left: original data with a low classification accuracy. Right: transformed data, where classification achieves high accuracy. Large circles represent the centroids of each class, and each data point is marked by a different symbol depending on the class (dot, cross or small circle)

A drawback of M2GP is that the number of dimensions d is fixed at the beginning of the run using a greedy heuristic, and the algorithm does not add or remove dimensions to the evolved transformations during the search.

2.2 M3GP–M2GP with multidimensional populations

M3GP evolves a population of models that may contain transformations with a different number of dimensions, as shown in Fig. 3. M3GP achieved state-of-the-art performance compared to Random Forests, Random Subspaces and Multilayer Perceptron [21]. M3GP includes special genetic operators that can add or remove dimensions, and it is assumed that selection will be sufficient to discard individuals with the least useful dimensions while maintaining the best new features within the population. The main details of M3GP are the following.

Fig. 3
figure 3

M3GP starts with a population of one dimensional transformations and through mutation (Fig. 4b, c) the search can evolve to a multidimensional population

Fig. 4
figure 4

Two possible crossover types in M3GP. a Subtree mutation. b Add dimension. c Remove dimension

Fig. 5
figure 5

Three possible mutation types in M3GP. a Standard subtree crossover. b Crossover of dimensions

Initial population M3GP starts the search with a random population where all the individuals have only one dimension, see Fig. 3. This ensures that the evolutionary search first evaluates simple solutions, before progressing to higher dimensional solutions. The initial population is generated using the Full method [13].

Mutation During the breeding phase, whenever mutation is the chosen genetic operator one of three variants is performed with equal probability (see Fig. 4)(d): a) standard subtree mutation, where a new randomly created tree replaces a randomly chosen branch (excluding the root node) of the parent; b) adding a randomly created tree as a new branch of the root node, adding one dimension to the parent tree; c) randomly removing a complete branch of the root node, removing one dimension from the parent transformation. However, when the root node of an individual has a single branch (a transformation with a single dimension) then the dimension deletion mutation (c) is not considered. The intent of using the same probability for each mutation operator was to avoid a bias towards removing or adding dimensions. Selective pressure during the evolutionary search determines if the number of dimensions increases or is reduced. In M3GP mutation is the only way of adding and removing dimensions,Footnote 1 and therefore we have increased its probability of occurrence from 0.1 (used in M2GP [11]) to 0.5, to better explore the solution space in terms of number of dimensions. Preliminary results have shown that this relatively high mutation rate allows M3GP to produce better training and testing performance (lower error) [21].

Crossover Whenever crossover is chosen, one of two actions is performed with equal probability (see Fig. 5): a) standard subtree crossover, where a random node (excluding the root node) is chosen in each of the parents and the respective branches are swapped; b) swapping of dimensions, where a random branch of the root node is chosen in each parent and swapped. Notice that crossover (b) is a special case of crossover (a), this is done to encourage the search to exchange complete features instead of focusing on only modifying existing ones. Particularly since crossover (a) can be seen as a form of subtree mutation given the randomness of the crossover point.

It is notable that all genetic operators (mutations and crossovers) in M3GP operate on a single feature dimension of the evolved transformations. This is done to induce a smoother fitness landscape, promoting incremental variations on the evolved models, variations that only add, remove or modify a single feature. Since fitness depends on all of the feature dimensions, then this will increase the locality of the genotype-phenotype mapping. This is particularly true for individuals with a large number of dimensions where a single dimension will have less influence on fitness then in smaller individuals, allowing the search to more effectively exploit each of the visited regions of solution space. Conversely, in standard GP subtree crossover and mutation can have very large effects on fitness, inducing a more rugged fitness landscape [20].

Pruning The best individual in the population goes through a pruning process at the end of each generation using a greedy local search. The pruning procedure removes a randomly chosen dimension and reevaluates the tree. If the fitness improves, the pruned tree replaces the original and goes through pruning of another randomly chosen dimension (without repetition). Otherwise, the pruned tree is discarded and the original tree goes through pruning of another dimension. The procedure stops after evaluating all dimensions. Pruning applied to all the population has a high computational cost, so we only apply it to the best solution found in the population at each generation. This operator defines the first local search method used in our sequential memetic structure [4, 5], while the second local search method is described below.

3 M3GP for symbolic regression

All previous works on M2GP and M3GP have focused on classification. This work extends M3GP to symbolic regression, which can be defined as follows. The goal is to search for the symbolic expression \(K^O(\varvec{\theta ^O}):{\mathbb {R}}^p \rightarrow {\mathbb {R}}\) that best fits a particular training set \({\mathbb {T}} = \{ ({\mathbf {x}}_1, y_1), \dots , ({\mathbf {x}}_n, y_n)\}\) of n input/output pairs with \({\mathbf {x}}_i \in {\mathbb {R}}^p\) and \(y_i \in {\mathbb {R}}\), stated as

$$\begin{aligned} (K^{O},\varvec{\theta ^O}) \leftarrow \underset{\ K \in {\mathbb {G}};\varvec{\theta }\in {\mathbb {R}}^m}{arg \ min} \ f ( K({\mathbf {x}}_i,\varvec{\theta }), y_i) \ with \ i=1, \dots , n \ , \end{aligned}$$

where \({\mathbb {G}}\) is the solution or syntactic space defined by the functions and terminals, f is the fitness function based on the difference between a program’s output \(K({\mathbf {x}}_i,\varvec{\theta })\) and the desired output \(y_i\), and \(\varvec{\theta }\) is a particular parametrization of the symbolic expression K, assuming m real-valued parameters. GP attempts to discover the model structure and parameters that best fit the data.

On the other hand, conventional regression techniques optimize a pre-specified model structure, for example by applying Multiple Linear Regression (MLR) to model the relationship between input and output variables by fitting a linear model to the observed dataset, as given by

$$\begin{aligned} y_i =\beta _0 + \beta _1 x_{i,1} + \beta _2 x_{i,2} + ... + \beta _p x_{i,p} + \epsilon _i \quad for\quad i=1,2,...n, \end{aligned}$$

where the \(\beta _j\) are the model parameters and \(\epsilon _i\) represents the residual error for the i-th training instance. However, utilizing MLR provides poor quality solutions against state-of-the-art algorithms (as shown in Sect. 5). In general, the only way to improve the accuracy of the model is to extract better predictive features from the original data.

This is how we use the transformations k evolved by M3GP to construct a new dataset \(\{\widehat{{\mathbf {x}}_i},y_i\}\) where \(\widehat{{\mathbf {x}}_i} = k({\mathbf {x}}_i)\), where \(\widehat{{\mathbf {x}}_i} = (\widehat{x_{i,1}}, ..., \widehat{x_{i,d}})\). Notice that the j-th element \(\widehat{x_{i,j}}\) of \(\widehat{{\mathbf {x}}_i}\) is generated by the j-th subtree \(ST_j\) of k, as depicted in Fig. 1. This new dataset is used to build an MLR model, where model parameters \((\beta _0, \dots , \beta _d)\) are computed using QR decomposition to solve the least squares fitting problem. This represents the second local search method of the sequential memetic structure used in the proposed M3GP algorithm [4, 5].

The fitness function is based on the prediction accuracy of the MLR model for each M3GP transformation given by the Root Mean Square Error (RMSE), thus posing a minimization problem.Footnote 2 During testing the transformation tree k and the model parameters \((\beta _0, \dots , \beta _d)\) are used to predict the output on unseen data.

Table 1 Symbolic regression benchmarks, as suggested in [17]

Other regression techniques could have been employed instead of MLR, such as ridge regression, LARS, LASSO, or elastic net regression. However, we choose the simplest possible approach for the following reasons. First, classification results with M3GP were found to be the best when a simple linear classifier was used instead of more complex or non-linear methods. Second, we feel that the evolutionary process should be used to filter out non-useful features, instead of relying on regularization techniques that might bias the search in unexpected ways.

4 Experiments

The experiments evaluate M3GP based on the following criteria. First, learning performance given by the RMSE on the training data for the best model found. Second, generalization is evaluated by the RMSE of the best model over an unseen test set. Third, the size of the evolved transformations, since small and parsimonious solutions are preferred over large and complex models [25]. In this case, we consider two measures of size. On the one hand, model size given by the number of tree nodes.Footnote 3 On the other hand, the number of new feature dimensions d in the evolved transformations.

The first two criteria are fairly common in machine learning tasks. Obviously, a good learning algorithm should fit the training data as best as possible, without failing to generalize on test data. However, the third criterion is more important for GP-based methods that explicitly generate symbolic expressions that are amenable to human interpretation. Smaller models are preferred over larger models if their predictive accuracy is equivalent. This is particularly important for the M3GP approach since experts in many domains of science and engineering prefer models that are linear in the parameters, which are easier to analyze and interpret. Since M3GP starts from simple one-dimensional transformations, we also investigate how the search evolves relative to the number of feature dimensions.

As the results will show, M3GP exhibits the ability to learn transformations that generalize to unseen test instances, while producing relatively parsimonious solutions. To better understand how M3GP reaches this level of performance, another evaluation criterion is included in the experimental analysis. A normalized version of mutual information for the continuous case [12] is used to measure the functional dependency between an input feature X and the expected output Y. Let (XY) be a normally distributed random vector with Pearson’s correlation coefficient \(\rho \). The mutual information is calculated by \(I(X;Y)=-1/2log(1-\rho ^2)\). Then we can define a normalized version of mutual information [12] as

$$\begin{aligned} I_{c}^{*}(X;Y):=\sqrt{1-exp[-2I(X;Y))]} \ \end{aligned}$$

The values of \(I_{c}^{*}(X;Y)\) are in the range of [0, 1] (closer to 1 is consider functionally dependent) and are equal to \(|\rho |\) if (XY) is normally distributed with correlation coefficient \(\rho \).

In our analysis, we consider the original feature (x) and the transformed feature \((\widehat{x})\) that exhibit the highest mutual information relative to the target variable y, to compare the data before and after the transformation applied by M3GP. The hypothesis is that the maximal mutual information of the transformed features will increase relative to that of the original input features, which is evaluated experimentally.

To measure the above performance criteria, we perform two groups of experiments: (1) synthetic symbolic regression benchmarks and (2) real-world problems. For the former group, we compare M3GP with other standard GP-based approaches. On the other hand, the latter group of experiments, with more difficult problems, are used to compare M3GP with more advanced GP techniques and other non-GP methods. In both groups we include a linear model generated with MLR, to illustrate the importance of the data transformation evolved by M3GP, which also relies on MLR but applies it to the transformed feature space instead of the original problem features.

4.1 Synthetic benchmark problems

We evaluate the proposed approach using nine synthetic regression benchmarks, described in [17] and summarized in Table 1. For comparison, we use several other GP algorithms, namely: Standard tree-based GP [13]; neatGP [25], a bloat free GP that uses topological speciation and fitness sharing to protect novel topologies and promote incremental evolution; neatGP-SC [25] a variant of neatGP that uses standard crossover. neatGP is included in the comparisons since it controls code growth effectively, allowing us to evaluate the ability of M3GP to generate compact models.

4.2 Real-world problems

We use five real-world regression problems, summarized in Table 2. These problems represent challenging modeling tasks from various domains that continue to receive interest in recent literature, as summarized by the references provided in Table 2. In this case, we compare M3GP with a variety of approaches. The first group of methods consists of linear regression algorithms, namely: Robust Regression, Quadratic Regression and Linear regression with MLR. The second group includes two state-of-the-art algorithms, namely: (1) MARS [10]; (2) the FFX algorithm [16]; and (3) Geometric Semantic GP (GSGP) [20]. GSGP uses search operators that induce a unimodal fitness landscape by explicitly considering and bounding the effect that the operators have on program output (what is often referred to as semantics). This last method is included since it is considered as a state-of-the-art GP approach for regression.

4.3 Implementation and experimental setup

For neatGP,Footnote 4 neatGP-SC and M3GP implementations using GPLAB were used, an open source GP toolbox for Matlab.Footnote 5 During evaluation, an error threshold is set to \((e^{-5})\), in such a way that after that threshold (only reached on synthetic problems) is reached, two solutions below the threshold are considered to be equivalent. This is important for the selection mechanism used, which is Lexicographic Parsimony Pressure [14]. In this method, when comparing equivalent solutions the smallest one is preferred.

Table 2 Symbolic regression real-world problems

For the synthetic benchmarks, Table 1 specifies the function set used on each problem as suggested in [17], while for the real-world problems the function set used is specified in Table 2. For the GP based methods, all other parameters are set as in Table 3, for specific configuration details see [21]. While we do not perform explicit parameter tuning of the algorithms, recent work has shown that in general GP is extremely robust to parameter initialization [24], and a series of informal tests confirm the same for M3GP. For all other methods, we used the parameters suggested in: [15] for GP; [6] for GSGP; and [25] for neatGP.

For the real-world problems, the terminal set contains the input variables and Ephemeral Random Constants (ERC) in the range [0, 1], except for GSGP [6]. The Matlab implementations of Robust, Quadratic and Linear regression were used. For MARS,Footnote 6 FFXFootnote 7 [16] and GSGPFootnote 8 [6] the algorithms were used considering the default library settings as recommended by the original authors. For each of the original datasets we have constructed 30 different randomly generated partitions, with a training and test data ratio of 70:30, each used in 30 independent runs.

Table 3 Running parameters for the GP-based algorithms

4.4 Comments on complexity

In this work, the issue of algorithm complexity and run-time is not studied. Indeed, for GP it is notoriously difficult to provide bounds on algorithm complexity. Recent works have been able to study GP complexity and provide tighter theoretical bounds than previous results. However, the results are based on overly simplified theoretical problems [8], that do not provide a realistic depiction of the types of problems studied in state-of-the-art symbolic regression. For now, it is important to state that M3GP is basically a tree-based GP algorithm, making the efficiency and execution time comparable to any other tree-based approach. The only additional cost incurred by the M3GP search is the MLR applied to the new feature space, which as stated above is done with QR decomposition instead of the more expensive SVD decomposition. The time complexity of QR decomposition depends on the algorithm used, but is in the order of \(O(mn^2)\), where in our case m is the number of data points in the training set and n is the number of new feature dimensions. As will be seen, n will be problem dependent but relatively small, below 10 on most synthetic problems and below 50 on real-world problems.

Fig. 6
figure 6

Training and testing RMSE on each synthetic benchmark

Table 4 Comparison of the median training RMSE for synthetic benchmarks
Table 5 Comparison of the median testing RMSE for synthetic benchmarks
Fig. 7
figure 7

Size of the best solution found on each synthetic benchmark

5 Results

This section is divided into two parts. First we summarize the results on the synthetic benchmarks. Afterward, we summarize the results on the real-world problems. In each case, results are organized using the same structure: First, performance in terms of training and testing RMSE; second, performance in terms of model size and dimensionality; and third, mutual information analysis of the original features and the transformed features.

5.1 Statistical comparisons

Statistical comparisons are done with a 1xN formulation, where a single control method (M3GP) is compared with N algorithms. The Friedman test and the Bonferroni-Dunn correction of the p values are used for each comparison. In all tests, the null hypothesis (that the medians of two groups are equal) is rejected at the \(\alpha = 0.01\) significance level.

Fig. 8
figure 8

Total feature dimensions in the best solution found by M3GP for benchmark regression problems. The original number of attributes are Koza (1), Nguyen-3 (1), Nguyen-5 (1), Nguyen-7 (1), Nguyen-10 (2), Keijzer (1), Korns (5), Vladislavleva (2), Pagie (2)

Fig. 9
figure 9

Frequency histogram of the number of dimensions of each transformation in the population when the error threshold (\(e^-5\)) is reached, and how these progress in subsequent generations (shown in the caption of the plots after the problem name). The \(*\) mark represents the number of dimensions of the best solution and on the inset plot the \(\times \) mark represents the fitness of the best solution. First row shows Nguyen-7, second row shows Nguyen-10 and the third row shows Vladislavleva-1

5.2 Synthetic benchmarks

5.2.1 Training and testing RMSE

Figure 6 presents boxplot comparisons of training and testing fitness (RMSE) of each GP variant (GP [13], neatGP [25], neatGP-SC [25] and M3GP) and MLR. These plots show the minimum, first quartile, median, third quartile, and maximum values over all thirty runs of each algorithm. M3GP achieves the best overall performance, and since MLR by itself exhibits the worst performance, it is clear that the transformations evolved by M3GP are quite powerful. The statistical comparisons are presented in Table 4 and Table 5, where a (*) indicates that the null hypothesis cannot be rejected. M3GP achieves good training performance, with the median error at or very close to zero. The exception is Korns-12, but M3GP still exhibits competitive performance relative to other GP variants.

In the case of testing performance, the median error values are also very close to zero, showing no effects of overfitting and very small variance. The only exceptions are Korns-12 and Vladislavleva-1. In the case of Korns-12 neatGP obtained the best result, with M3GP performing only 4% worse in terms of median performance. Nonetheless, it is evident that M3GP has difficulty with this problem, given the large variance in testing performance. The Korns-12 function (see Table 1) shows that it is non-linear with respect to some of its parameters. This explains why M3GP has more difficulty fitting this problem, since M3GP is intended to produce models that are linear in the parameters.

The worst testing result appears on Vladislavleva-1, where M3GP produces a high error and variance. The reason was found to be the use of exp and \(-exp\) in the function set. Changing the function set to \(\{+, -, x, \div , sin, cos, log\}\) produced a median training RMSE of 0.002 and 1.23 for testing. Reducing the generations to 25 (from 100) produced a median training error of 0.02 and a testing error of 0.12, substantially reducing overfitting. These results are better than applying MLR on the original data, and similar to neatGP. This suggests that the function sets used in GP are method-dependent, since GP, neat-GP and neat-GP-SC did not overfit with the original function set.

5.2.2 Evolution of size and dimensions

The solutions evolved by M3GP are not large, as seen in Fig. 7 which shows a boxplot comparison of the total number of nodes in the best solution found by each method.Footnote 9 M3GP generates models with a relatively small number of nodes, which are close in size to the solutions obtained by the bloat-free methods taken into account. The only exceptions are Nguyen-10 (with the high variance), Korns-12 and Vladislavleva-1, where the median number of nodes is higher than for other GP variants. Given that on the latter two problems M3GP exhibited its worst testing performance, this might provide a useful indicator of when the M3GP search is struggling to find a general model.

Fig. 10
figure 10

Average frequency histograms of the number of dimensions of each transformation in the population at the end of the run, computed over 30 runs. The \(*\) mark represents the median of dimensions

Fig. 11
figure 11

The maximal normalized mutual information computed over the training and testing partitions on the synthetic benchmarks. Raw represents the original data, and M3GP represents the transformed data. a Training. b Testing

Fig. 12
figure 12

Performance comparison of training and testing RMSE for concrete, energy efficiency cooling and energy efficiency heating problems. a Concrete. b Energy efficiency cooling. c Energy efficiency heating

Fig. 13
figure 13

Performance comparison of training and testing RMSE for Tower and Yacht problems. a Tower. b Yacht hydro dynamics

The dimensionality of the best solution found by M3GP is shown in Fig. 8. The complexity of the models depends on the characteristics of the problem. On the large majority of problems the variance is low, while the median number of dimensions is less than 10. However, on the two problems where M3GP exhibits a worse performance (Korns-12 and Vladislavleva-1 with the function set in Table 1), the algorithm increases the number of dimensions and shows higher variability across runs. This information could also be used to stop the search and prevent overfitting, a study left as future work.

If we take a closer look at how the number of dimensions evolves across generations, we observe three different behaviors. First, on problems Nguyen-3, Nguyen-7, Keijzer-6 and Pagie-1 the number of dimensions exhibits a stable distribution across all generations after reaching the error threshold (\(e^-5\)). In the first row of Fig. 9 we present a snapshot of the dimensionality of the transformations and fitness of the best solution (\(\times \) mark on the inset plot) at different generations during the M3GP search on the Nguyen-7 problem (similar results were seen on the other problems). The figure presents a frequency histogram of the number of dimensions of each transformation in the population when a perfect fitness (at or below the \(e^-5\) threshold) is reached, and the same at 75 and 150 generations for a single illustrative run.Footnote 10 On these problems the number of dimensions used in the population stabilizes early in the run and does not change much over time. The plots also show the number of dimensions of the best solution found so far (\(*\) mark). By the end of the run the population surrounds this point.

For the second behavior a similar analysis has been performed, shown in the second row of Fig. 9 for the Nguyen-10 problem. In this case, the number of dimensions increases at the beginning of the run, until the error threshold is reached by a particular solution (the left most plot). After this point the distribution of dimensions begins to progressively decrease (simplifying existing solutions), following the best solution in the population (\(*\) mark). This behavior is mostly attributed to the mutation operator that eliminates dimensions working in conjunction with the Lexicographic Parsimony Pressure [14] applied in the tournament selection. This behavior shows a nice ability of M3GP to automatically reduce the size of the evolved transformations without requiring an explicit penalty term in the fitness function, on the Nguyen-10, Koza-1 and Nguyen-5 problems. While M3GP showed the best performance on these problems, neatGP-SC produced almost equivalent performance on both training and testing accuracy. This is an important correlation, since neatGP is a bloat control GP method that tends to favor smaller solutions, which is consistent with the tendency of M3GP to reduce the number of dimensions used by the evolved transformations.

The third behavior appeared on Korns-12 and Vladislavleva-1, presented in the last row of Fig. 9 for the latter. On these problems M3GP shows a clear tendency towards increasing the dimensionality of the evolved solutions over the entire run. Interestingly, on these problems M3GP does not achieve the best testing performance, clearly indicating the presence of bloat since the increase in dimensions did not improve performance.

Finally, Fig. 10 shows the average frequency histograms of the evolved populations at the end of the run for each benchmark. The plots show that the populations converge to a unimodal distribution, with bimodal distributions appearing on Korns-12 and Vladislavleva-1. Once again, the atypical behavior is exhibited on the problems where M3GP struggles to generalize. Nonetheless, M3GP evolves populations that show a unimodal distribution centered around the dimensionality of the best solutions found, with decreasing tails. Some works have previously recommended that similar distributions should be explicitly enforced during a GP-based search to control code growth [25]. Hence, the observed behavior can explain the ability of M3GP to evolve compact models.

5.2.3 Mutual information analysis

The mutual information analysis for the training and testing partitions is summarized in Fig. 11. The plots show improvement of the maximal mutual information, before (raw) and after the M3GP transformation. In all cases M3GP increases the maximal mutual information. While the enhancement varies for each problem, even the smaller improvements can make a difference, based on the performance difference observed between M3GP and MLR on most problems. However, notice that the increase in mutual information is not necessarily proportional to the improvement in predictive accuracy.

Table 6 Comparison of the median training RMSE for real-world problem
Table 7 Comparison of the median testing RMSE for real-world problems

5.3 Real-world problems

5.3.1 Training and testing RMSE

Figures 12 and 13 present boxplot comparisons of RMSE (fitness) between the different regression methods: Robust, Quadratic, MLR, MARS, FFX, GSGP and M3GP. The statistical analysis is presented in Tables 6 and 7, where a (*) indicates that the null hypothesis of the Friedman test cannot be rejected at the \(\alpha =0.01\) significance level. On these problems, M3GP obtained the best performance (lowest median error) on 4 of 5 problems in terms of training, and on 3 of 5 problems for testing.

The Tower dataset is the only problem where M3GP did not achieve the best training performance, where Quadratic regression obtained the best result. For this reason the same M3GP was run with quadratic regression instead of MLR on this problem. M3GP now obtains a median training RMSE of 13.38, outperforming quadratic regression. However, the median testing RMSE is 25.85, which is higher than M3GP with MLR, pushing the algorithm to overfit the data.

In terms of testing performance, M3GP and FFX obtain the best results with no statistical difference on two problems (Energy Efficiency Heating and Yacht Hydro Dynamics), FFX obtains the best result with statistical significance on Tower and Concrete, and M3GP obtains the best result with statistical significance on the Energy Efficiency Cooling problem.

5.3.2 Evolution of size and dimensions

M3GP, MARS and FFX are similar in the sense that they build linear models using a transformed feature space. A comparison between the best model found by each algorithm is reported in Fig. 14 for size and Fig. 15 for the number of new feature dimensions in the model, also referred to as basis functions. Table 8 and Table 9 present a numerical comparison, with MARS achieving the smallest size and fewer dimensions for all problems. However, MARS has worse results in terms of RMSE. FFX produces the largest solutions in terms of total dimensions and size, which reduces the possibility of interpreting the models. On the other hand, M3GP produces significantly smaller solutions (size and dimensions) with respect to FFX. This difference is substantial in all problems. This is noteworthy since M3GP performs competitively in terms of testing and training RMSE. In other words, M3GP exhibits a useful compromise between quality and interpretability.

Fig. 14
figure 14

Symbolic equations size of the best mapping solution for M3GP, MARS and FFX for real-world regression problems

Fig. 15
figure 15

Basis functions (dimensions) of the best solutions for M3GP, MARS and FFX for real-world regression problems. The original attributes (dimensions) for each problem are: Concrete (8), Energy Efficiency Cooling (8), Energy Efficiency Heating (8), Tower (25), Yacht Hydro Dynamics (6)

Fig. 16
figure 16

Average frequency histograms of the number of dimensions of each transformation in the population at the end of the run, computed over 30 runs. The \(*\) mark represents the median of dimensions. a Concrete. b EE cooling. c EE heating. d Tower. e Yacht

Table 8 Median size of symbolic expressions for M3GP, MARS and FFX
Table 9 Median number of dimensions (basis functions) for M3GP, MARS and FFX

Figure 16 shows the average frequency histograms of the evolved populations at the end of the run. The histograms show unimodal distributions, except on Yacht, with a tendency towards increasing the dimensionality of the evolved solutions. Nonetheless, solutions are still more parsimonious compared to FFX, the only method with comparable performance.

5.3.3 Mutual information analysis

These results are summarized in Fig. 17, which shows that the data transformation was able to increase the maximal mutual information. The total increase, however, is not proportional to the predictive accuracy of the model, as was noticed on the synthetic problems. For instance, on the Energy Efficiency problems there is only a small improvement in mutual information while the difference in testing performance between MLR and M3GP is almost double. Conversely, on the Concrete problem the maximal mutual information doubles (Fig. 17) while the relative difference in RMSE between M3GP and MLR is smaller (Fig. 12).

Fig. 17
figure 17

The maximal normalized mutual information computed over the training and testing partitions on the real-world problems. Raw represents the original data, and M3GP represents the transformed data. a Training. b Testing

6 Conclusions

This paper presents the M3GP algorithm for symbolic regression, an approach to evolve multidimensional data transformations for supervised learning tasks. Experimental results show that M3GP is effective at finding accurate models. Moreover, the method compares favorably with other learning techniques, including GP-based methods (such as neat-GP and GSGP) and non-evolutionary approaches (such as FFX and MARS).

Therefore, in problem domains where the goal is to find models that are linear in the parameters, M3GP can effectively be used to evolve multidimensional transformations that enhance the predictive capabilities of the original problem features. Results show that M3GP is able to evolve new problem features that increase the mutual information of the predictive variables with respect to the desired output. In many cases this increase is substantial, explaining the predictive accuracy exhibited by the models produced by M3GP. Focusing on the real-world problems, it is clear that M3GP can produce highly accurate models (relative to FFX, MARS and GSGP), that are one order of magnitude smaller (in terms of dimensions and number of total operations) than those found by the most competitive method (FFX). However, when M3GP models are larger than those generated with standard GP, for instance, it seems reasonable to assume that the evolved models might not generalize. In summary, when compact and accurate linear models are sought, M3GP has been shown to be a useful modeling method based on a comprehensive experimental evaluation on synthetic and real-world datasets.

Future work will consider the following. The proposed M3GP algorithm presents a semantic memetic structure with two local search methods, namely the dimensional pruning mechanism and MLR to fit model parameters. However, the approach could be enhanced by adopting the use of parallel memetic structures [4, 5]. For instance, the algorithm could employ several parameter estimation algorithms, such as LARS, LASSO, stochastic gradient descent or gradient-free methods such as CMA-ES or Differential Evolution. Moreover, other dimensionality reduction techniques could be integrated besides the pruning heuristic, such as PCA or kernel PCA.