Data Set Information
Following the same procedure described in Yeh (1998), experimental data from 17 different sources were used to check the reliability of the strength model. Data were assembled for concrete containing cement plus fly ash, blast furnace slag, and superplasticizer. A determination was made to ensure that these mixtures were a fairly representative group for all of the major parameters that influence the strength of HPC and present the complete information required for such an evaluation. The dataset is the one that was used in Yeh and Lien (2009), Chou et al. (2010), Cheng et al. (2013, 2014) and Castelli et al. (2013) and it consists of 1028 observations and 8 variables. Some facts about those variables are reported in Table 1.
For each of the studied computational methods, 30 independent executions (runs) were performed, using a different partitioning of the dataset into training and test set. More particularly, for each run 70% of the observations were selected at random with uniform distribution to form the training set, while the remaining 30% form the test set. The parameters used are summarized in Table 2. Besides those parameters, the primitive operators were addition, subtraction, multiplication, and division protected as in Koza (1992). The terminal symbols included one variable for each feature in the dataset, plus the following numerical constants: − 1.0, − 0.75, − 0.5, − 0.25, 0.25, 0.5, 0.75, 1.0. Parent selection was done using tournaments of size 5 for GSGP, and tournaments of size 10 for each layer of the nested selection for NAGP. The same selection as in NAGP was also performed in the first 50 generations of NAGP_50. Crossover rate was equal to zero (i.e., no crossover was performed during the evolution) for all the studied methods. While NAGP and NAGP_50 do not have a crossover operator implemented yet, the motivation for not using crossover in GSGP can be found in Castelli et al. (2014).
Experimental Results, Comparison with GSGP
The experimental results are organized as follows:
Fig. 7 reports the results of the training error and the error of the best individual on the training set, evaluated on the test set (from now on, the terms training error and test error will be used for simplicity);
Fig. 8 reports the results of the size of the evolved solutions (expressed as number of tree nodes);
Table 3 reports the results of the study of statistical significance that we have performed on the results of the training and test error.
From Fig. 7, we can see that NAGP_50 clearly outperforms the other two studied methods both on training and on unseen data. Also, if we compare NAGP to GSGP, we can observe that these two methods returned similar results, with a slight preference of GSGP on training data, and a slight preference of NAGP on unseen data. From plots of Fig. 7a, b, we can also have a visual rendering of how useful it is for NAGP_50 to “switch” from the NAGP algorithm to the GSGP algorithm after 50 generations. In fact, both on the training and on the test set, it is possible to notice a rapid improvement of the curve of NAGP_50, which looks like a sudden descending “step”, at generation 50.
Now, let us discuss Fig. 8, that reports the dimensions of the evolved programs. GSGP and NAGP_50 generate much larger individuals compared to NAGP. This was expected, given that generating large individuals is a known drawback of GSOs (Moraglio et al. 2012). The fact that in the first 50 generations NAGP_50 does not use GSOs only partially limits the problem, simply delaying the code growth, that is, after generation 50, as strong as for GSGP. On the other hand, it is clearly visible that NAGP is able to generate individuals that are much smaller: after a first initial phase in which also for NAGP the size of the individuals grows, we can see that NAGP basically has no further code growth (the curve, after an initial phase of growth, rapidly stabilizes and it is practically parallel to the horizontal axis). Last but not least, it is also interesting to remark that the final model generated by NAGP has around only 50 tree nodes, which is a remarkably small model size for such a complex application as the one studied here.
To analyse the statistical significance of the results of the training and test errors, a set of tests has been performed. The Lilliefors test has shown that the data are not normally distributed and hence a rank-based statistic has been used. The Mann–Whitney U-test for pairwise data comparison with Bonferroni correction has been used, under the alternative hypothesis that the samples do not have equal medians at the end of the run, with a significance level α = 0.05. The p-values are reported in Table 3, where statistically significant differences are highlighted with p-values in italics.
As we can observe, all the differences between the results obtained with all the studied methods are statistically significant.
The conclusion is straightforward: NAGP_50 outperforms GSGP in terms of prediction accuracy, but returns results that are comparable to the ones of GSGP in terms of the size of the model. On the other hand, NAGP outperforms GSGP in terms of prediction accuracy on unseen data and also in terms of model size.
Experimental Results, Comparison Other Machine Learning Techniques
This section compares the results obtained by NAGP and NAGP_50 with the ones achieved with other state-of-the-art machine learning (ML) methods. The same 30 different partitions of the dataset used in the previous part of the experimental study were considered. To run the ML techniques, we used the implementation provided by the Weka public domain software (Weka 2018). The techniques taken into account are: linear regression (LIN) (Weisberg 2005), isotonic regression (ISO) (Hoffmann 2009), an instance-based learner that uses an entropic distance measure (K*) (Cleary and Trigg 1995), multilayer perceptron (MLP) (Haykin 1999) trained with back propagation algorithm, radial basis function network (RBF) (Haykin 1999), and support vector machines (SVMs) (Schölkopf and Smola 2002) with a polynomial kernel.
As done for the previous experimental phase, a preliminary study has been performed in order to find the best tuning of the parameters for all the considered techniques. In particular, using the facilities provided by Weka, we performed a grid search parameter tuning, where different combinations of the parameters were tested. Table 4 shows the interval of tested values for each parameter and for each technique.
The results of the comparison we performed are reported in Figs. 9 and 10 where the performance on the training and test sets are presented, respectively. We start the analysis of the results by commenting the performance on the training set.
As one can show in Fig. 9, K* is the best performer on the training set, producing better quality models with respect to all the other studied techniques. MLP is the second-best technique, followed by NAGP_50 and SVMs. LIN outperforms both GSGP and NAGP, while ISO produces similar results with respect to NAGP. Finally, the worst performer is RBF. Focusing on NAGP_50, it is important to highlight that its performance is comparable to MLP and SVM, two techniques that are commonly used to address this kind of problem.
While the results on the training data are important, the performance on the test set is a fundamental indicator to assess the robustness of the model with respect to its ability to generalize over unseen instances. This is a property that must be ensured in order to use a ML technique for addressing a real-world problem. According to Fig. 10, NAGP_50 outperforms all the other techniques taken into consideration on the test set. Interestingly, its performance is comparable with the one achieved on the training set, presenting no evidence of overfitting. This indicates that NAGP_50 produces robust models that are able to generalize over unseen data.
To assess the statistical significance of the results presented in Figs. 9 and 10, the same type of statistical test as the ones presented in the previous section was performed, with α = 0.05 and the Bonferroni correction. Table 5 reports the p-values returned by the Mann–Whitney test with respect to the results achieved on the training set. Results reported in italic are those in which the null hypotheses can be rejected (i.e. the statistically significant results). According to these results, NAGP_50 produces results that are comparable with SVMs, while K* is the best performer followed by MLP.
Table 6 reports the p-values of the Mann–Whitney test with respect to the results achieved on the test set. According to these p-values, it is possible to state the NAGP_50 is the best performer, producing solutions that outperform the other techniques in a statistically significant way. SVMs are not able to produce the same good-quality performance on the test set, overfitting the training data. Interestingly, all the non-GP techniques, except LIN, suffer from overfitting, hence producing models that are not able to generalize well on unseen data.
Experimental Results, Discussion of an Evolved Model
In this section, we show and discuss the best multi-individual evolved by NAGP in our simulations. It is important to point out that, as Fig. 8 clearly shows, this would not be possible for NAGP_50 and for GSGP, since these two methods use GSOs and these operators cause a rapid growth in the size of the evolved solutions. For this reason, it was not possible to show the final model in Cheng et al. (2013), while it is possible in the present contribution.
The best multi-individual evolved by NAGP in all the runs that we have performed was composed by the following expressions, in prefix notation:
(* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (+ (* X0 (− X6 (* (+ (− (/ (* X3 (− − 0.75 X5)) (/ (/ 0.5 (+ − 1.0 − 0.25)) (/− 1.0 0.75))) (− X1 (− (/ X7 0.25) (+ (* (/− 0.75 (+ X7 X0)) (* (− 1.0 X6) (− (* (/ (+ X2 X1) − 0.25) (+ (/ X7 (+ − 0.5 (/ (/ X4 (+ − 1.0 X1)) X7))) (− (− 0.25 1.0) (+ (+ X7 X1) (− 1.0 − 0.25))))) 0.5))) 0.5)))) (− (/ (− (+ − 0.5 (*− 0.25 0.75)) (/ (* (/ X5 (− − 0.25 0.5)) (+ X0 (− (/ (− − 0.25 X3) (/ (/ (/ (/ (− (* (/ X2− 1.0) X0) X6)− 0.25) (/ (/ (+ − 0.25− 1.0) (+ X7− 0.5)) (+ − 0.25 − 0.75))) X1) X1)) − 0.75))) (/ (/ (/ (+ X6 (/ 0.25 (*− 0.75 1.0))) 0.75) (+ − 1.0 (+ 1.0 X3)))− 1.0))) (/ X6 (/ (+ (+ X6 − 1.0) (*− 0.5 (− 1.0 (− − 0.75 X3)))) 0.75)))− 0.25)) (/ X7 (* (+ − 0.75 (/ 1.0 (* (/ X6 (/ 0.75 (+ 0.75 X4))) (* (− (+ − 0.25 X4) 0.75) (* X6 (/ (* (/− 0.25 (+ (− − 0.5 X6) 1.0)) (− X6 0.25)) X7)))))) X7))))) (* X7 − 0.5)) 34.0) X5) 33.0) 36.0) X6) 23.0) 31.0) 23.0) 20.0) 31.0) 39.0) 39.0) (− X6 1.0)) 36.0) 28.0) 34.0) 22.0) 25.0) 38.0) 26.0) 29.0) 34.0) 27.0) 30.0) 23.0) 33.0) 35.0) 24.0) 34.0) 36.0) 36.0) 37.0) 38.0) 36.0) 27.0) 39.0) 36.0) 20.0) 34.0) 37.0) 37.0) 37.0) 36.0) 32.0) 37.0) 39.0) 33.0) 26.0) 39.0) 31.0) 33.0) 24.0) 27.0) 27.0) 33.0) 39.0) 37.0) 38.0) 36.0) 32.0) 23.0) 35.0) 24.0) 39.0) 26.0) 26.0) (+ (+ X0 (* (− 1.0 (+ (/ X2 0.75) − 0.75))− 0.5)) (+ (* X1 1.0) X7))) 26.0) 37.0) 37.0) 27.0) 32.0) 38.0) 22.0) 37.0) 34.0) 31.0) 28.0) 30.0) 21.0) 26.0) 23.0) 20.0) 38.0) 38.0) 33.0) 32.0) 21.0) 24.0) 20.0) 37.0) 30.0) 21.0).
(* (* (/X4 (+ X7 (+ (* (− (− (/(* (− (/(+ 1.0 X3) (− X1 (− (− X2 X6) − 0.75))) (− (/(− X2 X4) (* (+ (/(+ − 1.0 0.25) (+ 0.25 X0)) (+ X3 (− X3 0.5))) X4)) X4)) X1) (/0.75− 0.25)) (− X5 − 1.0)) (+ − 0.5 0.5)) (/− 1.0 X2)) (− X6 X2)))) X7) 21.0).
The reader is referred to Table 1 for a reference to the different variables used in this expression (only the IDs—X0, X1,…, X7—referenced in the table are used in the above expressions). If we consider the reconstructed expression Popt [as in Eq. (3)] using these two expressions, Popt has an error on the training set equal to 9.53 and an error on the test set equal to 9.06. Both the relationship between the training and test error (they have the same order of magnitude and the error on the test set is even smaller) and a comparison with the median results reported in Fig. 7 allow us to conclude that this solution has a very good performance, with no overfitting.
The first thought that comes to mind when watching these two expressions is that the first one is significantly different from the second one: first of all in terms of size (the first expression is clearly larger than the second), but also in terms of tree shape. Observing the first expression, in fact, one may notice a sort of skewed and unbalanced shape consisting of several multiplications by constant numbers. This observation is not surprising: the first of these two expressions, in fact, is the one that has undergone the multiplication by the constant λ during the mutation events, as explained in Sect. 5. These continuous multiplications by constants have, of course, also an impact on the size of the expression (this is the reason why the first expression is larger than the second one). However, it is easy to understand that all these multiplications by a constant can be easily simplified, i.e. transformed into one single multiplication by a constant. Concerning the second expression, instead, we can see that it is much simpler and quite easy to read (numeric simplifications are possible also on this second expression, which would make it even simpler and easier to read).
Concerning the variables used by the two models, Table 7 shows the number of times that each of the variables appears in these two expressions. From this table, we can see that variables X6 and X7 are the ones that appear most frequently in the expressions, and thus we hypothesize that these variables are considered as the most useful, i.e. informative, ones by NAGP for the correct reconstruction of the target. These variables represent fine aggregate (expressed in kg/m3) and age of testing (expressed in number of days), respectively.