Cellular geometric semantic genetic programming

Among the different variants of Genetic Programming (GP), Geometric Semantic GP (GSGP) has proved to be both efficient and effective in finding good solutions. The fact that the operators of GSGP operate on the semantics of the individuals in a clear way provides guarantees on the way the search is performed. GSGP is not, however, free from limitations like the premature convergence of the population to a small–and possibly sub-optimal–area of the search space. One reason for this issue could be the fact that good individuals can quickly “spread” in the population suppressing the emergence of competition. To mitigate this problem, we impose a cellular automata (CA) inspired communication topology over GSGP. In CAs a collection of agents (as finite state automata) are positioned in a n -dimensional periodic grid and communicates only locally with the automata in their neighbourhoods. Similarly, we assign a location to each individual on an n -dimensional grid and the entire evolution for an individual will happen locally by considering, for each individual, only the individuals in its neighbourhood. Specifically, we present an algorithm in which, for each generation, a subset of the neighbourhood of each individual is sampled and the selection for the given cell in the grid is performed by extracting the two best individuals of this subset, which are employed as parents for the Geometric Semantic Crossover. We compare this cellular GSGP (cGSGP) approach with standard GSGP on eight regression problems, showing that it can provide better solutions than GSGP. Moreover, by analyzing convergence rates, we show that the improvement is observable regardless of the number of executed generations. As a side effect, we additionally show that combining a small-neighbourhood-based cellular spatial structure with GSGP helps in producing smaller solutions. Finally, we measure the spatial autocorrelation of the population by adopting the Moran’s I coefficient to provide an overview of the diversity, showing that our cellular spatial structure helps in providing better diversity during the early stages of the evolution.


Introduction
Genetic Programming (GP) [1] is a prominent technique in Evolutionary Computation (EC).It consists of evolving programs, typically represented as trees, to solve specific supervised learning problems.Among the many variants of GP, Geometric Semantic GP (GSGP) [2] has received widespread interest since its introduction in 2012, showing its ability to outperform standard GP [3][4][5].
Unlike traditional GP, where the functions space is explored with search operators that act on the syntactic representations of individuals without taking into account the effects on the semantics, GSGP considers the space of the semantic representation of programs and each syntactic manipulation derives from the need to obtain a specific effect on the semantics.In this case, semantics can be generally defined as the behaviour of a program when it is executed, which can be approximated as the vector having as elements all outputs generated on the training data [2].
Despite its advantages over standard GP, GSGP still suffers from some limitations, as mentioned by Moraglio et al [6].In particular, while the fitness landscape induced by GSGP is unimodal, the fact that not all points (programs) in that landscape can be represented leads to the emergence of local optima or "holes" in the landscape, where the evolution might get stuck.Another effect is that GSGP might promote convergence towards individuals that dominate others in the early generations.In other words, the number of generations needed for one individual to fully propagate, i.e., the takeover time, is small.This can lead to being able to move in the fitness landscape only via mutation, due to the convex hull of the population-the only area of the space that can be explored via crossover-having shrunk too much and not around the global optimum.
In this work, we propose a new form of GSGP, which takes inspiration from one of the oldest natural computing models, cellular automata (CA) [7][8][9], to force the interactions among individuals of a population to be local according to a given topology.
Recall that CA is a formal model comprising a grid of identical cells, or automata, each updating its internal state based on a local rule dependent solely on a fixed number of neighboring automata.In a similar fashion, all individuals of a population are positioned into a grid (and n-dimensional torus).To update the position i in the grid, there is a selection phase happening only among individuals in the neighbourhood of i, i.e., in a ball of a fixed radius; if crossover happens, it will also be only between two individuals selected in the same neighbourhood.Hence, all interactions are local according to the topology imposed on the population.One effect of this is that even a good individual is limited in "propagation speed" by this topology, allowing more time for other-possibly betterindividuals to be generated.This should help the optimization process to keep a higher rate of diversity in the population during the early generations, allowing for a more complete and global exploration of the search space at the beginning of the evolution before it starts to converge to dominant individuals.
The resulting cellular GSGP (cGSGP) is compared, in this paper, with standard GSGP in multiple regression problems.A two-dimensional torus-based topology and multiple neighbourhood sizes are tested, together with multiple ways of performing only local selections.The results show that, in fact, cGSGP provides an improvement compared to GSGP in terms of the quality of the solutions.
The paper is structured as follows: Section 2 provides an overview of previous CA-inspired evolutionary methods.Section 3 provides a formalisation and explanation of the proposed method.The experimental settings and the datasets used in the experiments are described in Sect.4, while in Sects.5 and 6 the results obtained are presented and discussed.Finally, Sect.7 summarizes the main contributions of this work and draws some possible directions for future research.

Related works
The idea of imposing a certain spatial structure to the population of evolutionary algorithms was already explored in the literature [10][11][12], allowing each individual to interact only with a small subset of the population, i.e., its neighbourhood, usually specified by imposing a grid structure on the population, leading to the formation of semi-isolated groups of individuals having similar features.
In the case of classical genetic algorithm (GA) [13], the concept was introduced in the form of cellular Genetic Algorithm (cGA) [10][11][12].cGA was used to improve optimisation performance for a well-known problem such as the Travelling Salesman Problem (TSP) [14], where it was combined with the Simulated Annealing algorithm.Extensive research on the impact of different neighbourhoods and topologies on the performance of cGA in various problem domains was performed in [10] and [12].Alba et al. [10] also proposed a dynamic version of cGA, which adaptively adjusts the exploration/exploitation trade-off ratio during evolution.
Murata et al. [15] proposed a variation of GA for multiobjective optimisation with a cellular structure (C-MOGA) that was an extension of multiobjective GA (MOGA).In C-MOGA, the selection operator is applied in the neighbourhood of each cell.Nebro and coauthors [16] came up with another approach for multiobjective optimisation which takes inspiration from canonical cGA and results in an algorithm called MOCell.
Folino et al. [17] presented a cellular parallel implementation of GP endowed with a load-balancing mechanism that distributes the computational load among the processors equally, showing that the cellular approach can outperform both standard GP and the island model [18], where fully separate evolutions are carried out on four different subsets of the population.Authors in [19] and [20] showed how a spatial population structure in combination with a local elitist replacement scheme can effectively reduce the bloat phenomenon (i.e., the increase in average tree size without a corresponding increase in fitness) for GP, without compromising the performance.
In addition to GA and GP, several other evolutionary algorithms employed the cellular model to control the diffusion of solutions throughout the population [21][22][23], demonstrating how the idea of imposing a cellular structure is a concept applicable in general to populations of solutions independently from the algorithm working on them.However, to the best of our knowledge, the use of a cellular structure has not been explored in the context of GSGP, an area that we plan to start tackling in this work.
Promoting diversity within a population can also be achieved through a process known as speciation.This concept is rooted in the observation that natural ecosystems consist of distinct physical spaces, or niches, each characterized by specific features.This diversity of niches facilitates the emergence of various species, with individuals of a species sharing common attributes.Within each niche, resources are constrained, leading individuals to compete for them.
Likewise, the population of an Evolutionary Algorithm can be strategically partitioned into separate niches comprising individuals with similar structural characteristics.This arrangement ensures that only individuals belonging to the same species can interact through crossovers.
Della Cioppa et al. [24] proposed an adaptive species discovery strategy which addresses some limitations of nitching methods which rely on a priori knowledge about the fitness landscape.Authors in [25] took inspiration from the design principles of the NEAT algorithm [26] to leverage speciation to limit programs growth in GP.Through their algorithm, named neat-GP, complexity is built only when the problem requires it and the population is progressively divided into species containing individuals with similar size and shape.Juárez-Smith et al. [27] integrated neat-GP with a local search operator to improve the quality of solutions.Authors in [28] define a network distance metric to speciate a population of artificial gene regulatory networks, showing how speciation tends to facilitate diversity and keep small-sized individuals.Martins et al. [29] combined GA and speciation with a grid pattern recognition approach to reduce investment risk and increase profits.Additionally, they made an analysis of the differences between the speciated and the non-speciated approach.Wickman et al. [30] leveraged speciation to evolve a diverse population of policies in the context of Reinforcement Learning.
Even though there are several works about cellular EAs, it is not trivial to assess whether the cellular structure in EA is state-of-the-art or not.Several works in the literature have attempted to study the effect of selection pressure and sampling strategy in both standard EA [31,32] and cellular-based EA [33,34].
The cellular structure in EA can also have a significant impact on the takeover time, i.e., the number of generations required for a single best individual to fully propagate to the population with copies of itself.Clearly, a small takeover time implies a fast loss in the diversity of the population.In cellular-based EA the spatial structure itself provides a way of controlling takeover time and selection pressure, as shown, for example, in [34].
Therefore, considering that premature convergence to local sub-optimal solutions is one of the major limitations of GSGP, our intuition is that imposing a cellular structure may be beneficial in the context of GSGP.Such an approach may help increase the takeover time and avoid a fast loss in diversity.

Methods
In this section, we recall the basics of GSGP and then we will introduce cGSGP, focusing on how the evolution process must be changed to take into account the topology imposed on the population.

Geometric semantic genetic programming
Traditionally, Genetic Programming (GP) explores the space of solutions by means of operators acting on the syntax of programs, i.e., their formal representation, without any regard to the semantic effect of the operators on the individuals.In recent years, the integration of semantic awareness in GP has been shown to improve its performance in different scenarios, attracting the interest of the scientific community [35].Although several definitions are possible, here the semantics of an individual is the vector of output values of a program on a set of inputs.Hence, we can map each individual into a point on the semantic space, a finite-dimensional vector space, usually ℝ n .
GSGP, introduced for the first time by Moraglio et al. [2], defines operators that act directly on the semantics of individuals, the Geometric Semantic Operators (GSOs).
Specifically, the Geometric Semantic Crossover (GSC) takes the following form: where T 1 , T 2 ∶ ℝ n → ℝ are the two parents, and R is a random real function with outputs in the interval [0, 1], usually represented by a random tree.Since, for each input x ∈ ℝ n , T XO (x) is a linear combination of the two parents' outputs, the GSC ensures that the offspring has semantics belonging to the segment joining the semantics of the two parents on the semantic space.
Similarly, the Geometric Semantic Mutation (GSM) is defined as follows: where T ∶ ℝ n → ℝ is the function being mutated, R 1 , R 2 are two random real func- tions with outputs in the interval [0, 1] and m is a parameter called mutation step.GSM can be considered as a "weak" perturbation of the individual since the new individual is contained in the hyper-sphere of radius m centred in the semantics of the parent in the semantic space.Notice that the same geometric semantic operators of crossover and mutation that induce an unimodal fitness landscape, however, also increase the size of the individuals.For crossover, in particular, this increase is exponential in the number of generations, as already highlighted when GSGP was introduced [2].Practical solutions to this problem have been proposed over the years [2,[36][37][38][39][40], to make this approach usable in real-world scenarios.The combination of the formal properties of GSGP and the presence of efficient implementations [41] allowed it to outperform standard GP on many symbolic regression tasks and real-world problems [5,37].

Cellular GSGP
Standard GSGP does not impose any spatial structure on the population, allowing selection to be performed across the entire population and the crossover to happen between any two individuals. (1) In cGSGP a neighbourhood structure is imposed in a way inspired by CA [7][8][9].Recall that CA is a formal model where finite automata are distributed according to a certain topology and each cell updates its own internal state following a local rule depending only on a certain number of neighbour automata.
Similarly, we distribute the individuals of a population on a toroidal grid.In this scenario, each individual belongs to a specific cell of this grid, hence to a spatial location, which is chosen randomly at the beginning.At this point, it is possible to introduce the concept of neighbourhood within our population.Specifically, we consider an individual to be in the neighbourhood of radius r of a given cell if it is contained in the hypercube with side 2r + 1 corresponding to the Moore neighbourhood of radius r.In the following the n-dimensional toroidal grid with neighbourhood of radius r ∈ ℕ will be denoted by T n r .We choose to impose a toroidal configuration to the grid in order to have a complete neighbourhood for all cells: a hypercube has incomplete neighbourhoods for cells near the faces of the hypercube.Furthermore, notice that there are no "disconnected" neighbourhoods in T n r when r > 0 ; this means that information can travel between any two points of the grid, thus there are no isolated or independent sub-populations.
Once a neighbourhood structure has been defined, it is necessary to define how this structure is going to influence the selection process and the crossover operator 1 .
Let us now consider a position i in an n-dimensional toroidal grid with neighbourhood N i .For example, in two dimensions i = (i 1 , i 2 ) and a neighbourhood of radius r it is Moreover, each individual is contained in its own neigh- bourhood (Fig. 1).
For any given population each position i has an associated tree T i , so we expect that when we need to replace a tree in position i that it will be derived from the trees in positions part of N i (its neighbourhood).
If crossover is not performed, it is sufficient to replace (and possibly mutate) the individual in position i with one selected locally (maybe itself).Furthermore, as long as each selection is performed in a local way, forcing crossover to be local is trivial: it is sufficient to perform crossover between two locally selected individuals.Hence, we will focus on how we can define local selection (Figs. 2, 3).

Selection strategies
As a first strategy, we employ a rank-based selection method within local neighbourhoods ( ), where the entire neighbourhood N i of position i is considered.In this case, the selection procedure returns the elements of N i ordered by fitness, from the best one to the worst one without any repetition.As a consequence, if no crossover is performed then the new tree in position i will be the best one in N i .If, instead, crossover is performed, then it will be between the two best individuals in N i .
The selection given by is deterministic and, without the presence of only local interactions, would be too strict and would force premature convergence in a few generations.Local interactions avoid this problem (different neighbourhoods can have different best individuals), but we might still want to have a less "draconian" selection.To this end, as a second strategy, we employ a tournament-based selection method within local neighbourhoods ( p ), where the tournament size is defined w.r.t. a real value p ∈ (0, 1] representing a proportion of the given neighbourhood.
The idea of p is that is not applied on the entire neighbourhood of an individual i, but on a subset S ⊆ N i .This subset S is selected by iterating across the elements of N i and inserting them into S with probability p.In other words, S is a sampled subset of N i with probability p.Notice that now p can be used as a parameter to tune the selec- tion pressure, with p = 1 resulting in the most pressure as in the selection strategy.How to leverage the cellular structure to select the parents?
Define selection pressure and selections number: tournament selection, roulette selection, cellular selection (pressure may be given by a sampling probability)

Standard EA
How to generate the offspring from the parents?
Define a crossover operator to be executed with some probability: sub-tree onepoint crossover, geometric semantic crossover How to mutate the offspring?
Define a mutation operator to be executed with some probability: sub-tree uniform mutation, geometric semantic mutation (with some mutation step) Fig. 2 Diagram describing a baseline cellular-based evolutionary algorithm.For each activity, the main design choice is highlighted, and an overview of the main hyper-parameters and variants is presented.Especially, we highlight in bold the hyper-parameters and the main building blocks that compose the design choices In Algorithm 1 we provide a pseudo-code for the p selection strategy, which is equivalent to when p = 1.

Experimental study
In the experimental phase, we try to provide answers to the following questions: RQ1 Is it possible to overcome the issue of converging too fast to small and possibly sub-optimal areas of the search space?
RQ1 Can cGSGP exploit the presence of local communication to produce better solutions than GSGP?
To this end, we compare standard GSGP with cGSGP by using a specific grid topology with different neighbourhood sizes.In particular, we consider a twodimensional toroidal grid with neighbourhood of radius 1, 2, and 3, denoted by T 2 1 , T 2  2 , and T 2 3 , respectively.Each method is tested on eight regression datasets that have already been used in GP-related problems [5,37,42,43].Table 1 summarizes the number of samples and the number of variables for each dataset.References are added to provide a link to the complete description of these datasets.
Each dataset has been partitioned in 100 train-test splits.Specifically, 70 % of the observations, selected at random with uniform probability, has been used to build the training set, while the remaining 30 % has been used to build the test set.We run for each method and for each dataset 100 repetitions with different seeds, one for each split (the KJ6 dataset is unique in that it is defined for a single split; therefore, we use the same split for all repetitions, with the seed influencing only the evolution).
The optimisation problem we aim to address is a single-objective minimisation problem in which the fitness is the Root Mean Squared Error (RMSE) [51] computed between the target and the predictions obtained by applying the discovered symbolic models on the training set.We check the performance on the test set to obtain information on the way the model behaves on unseen data.The evolution parameters are fixed for all runs and described in Table 2.We test different pairs of population size and number of generations.The chosen combinations are described in Table 3.
We choose the different types of population sizes in order to obtain three different perfectly squared bi-dimensional toroidal grids whose dimensions are evenly distributed.Moreover, we set the number of generations in order to always obtain, approximately, the same number of fitness evaluations, and thus encourage a fairer comparison (since the total number of fitness evaluations is not a multiple of 900, we set the corresponding number of generations to the nearest rounded integer available).
We test each method by using: crossover and mutation together sequentially ( E cx+mut ), only crossover ( E cx ), and only mutation ( E mut ).For the sake of simplicity, we define this variant as exploration pipeline.
Each method is compared with the others methods by using a Wilcoxon signedrank test [52], and the corresponding p-values are collected and adjusted by using a Holm-Bonferroni correction [53] with = 0.05 (this test is carried out only if a Kruskal-Wallis test [54] with = 0.05 is passed for the involved methods and the given dataset).
To guarantee that the results obtained are due to the presence of a communication topology in cGSGP, GSGP was tested with tournament selection with different tournament sizes up to 12.When the results were statistically compared, no significant difference was found in the results.As a consequence, we present the results comparing cGSGP with GSGP with tournament selection of size 4.
All the code implementing GSGP and cGSGP used in the experiments is available in the following repository: https:// github.com/ lurovi/ CA-GSGP.

Results
In this section, we present the results from our experimental phase and discuss the outcomes.

Statistical comparison with the baseline
We showcase the experimental results comparing GSGP and cGSGP, utilizing the combinations of topology and neighbourhood radius introduced earlier, and employing two distinct selection methods: and 0.6 .The comparisons for all the proposed combinations of parameters are shown in the following tables, in which the median RMSE values on the test sets are presented.Each table describes the outcome for a specific combination of population size and number of generations.Within the same table, for each dataset, all topologies, selection strategies, and variants of exploration pipeline are presented.
For a given dataset, cGSGP methods that exhibit statistically better performance than the methods in the table with the same selection strategy and exploration pipeline, including the corresponding GSGP, are marked with an asterisk (*).Moreover, we highlight in bold the cGSGP methods that, taken individually, are meaningfully better, according to a Wilcoxon signed-rank test, than the GSGP baseline with the same exploration pipeline (Tables 4, 5, 6).
As a general observation for E cx+mut , in all datasets, cGSGP is able to perform better than GSGP, with the specific selection method ( or 0.6 ) not providing a significant difference.The only exception seems to be KJ6 for 100 as population size.
In case the population size is small, what influences the results tends to be the topology: at least a minimum size of the neighbourhood is needed, in fact, T 2  1 is the topology with the smallest neighbourhood (9 positions including the central one) and is usually unable to provide results which are better than GSGP in a statistically significant way.
Larger neighbourhoods instead tend to provide more consistent results.Notice that, while the best results are obtained with T 2 3 , a quite large neighbourhood, a smaller one, with radius 2, is sufficient to consistently perform better than GSGP.This could be useful to obtain good results while limiting the amount of local interaction between the individuals in the population.Moreover, it seems that by increasing the size of the population it becomes more likely that the cGSGP method can perform better than the baseline.This occasionally seems to happen even within a small neighbourhood radius.This result suggests that a large enough population size is probably necessary to take advantage of the cellularbased selection performed on the toroidal grid.By inspecting the results for E cx and E mut , the general observation that we can formulate is that the mutation operator alone is more beneficial than the crossover operator alone as regards our proposed approach.As a matter of fact, we can see that when E mut is employed, it seems to be more likely that our method performs better than the baseline, while there is less significant difference when E cx is used.This could suggest that the mutation operator is the one that benefits more from our proposed selection procedure.
The cGSGP with the E cx combination performs better than the baseline only for the synthetic datasets (V14 and KJ6) and for TXC.On the other hand, the cGSGP with the E mut combination performs better than the baseline for all the datasets except for the synthetic ones.In general, the vanilla combination E cx+mut still seems to be the one that provides more consistent results.

Best RMSE distribution
The detailed results for each dataset are presented as box-plots of the RMSE on the test set of the best individuals in the last generation.Especially, we present the results for 100 as population size and 1000 generations, with E cx+mut as explo- ration pipeline.Figure 4 presents the results for the Vladislavleva14 (V14) synthetic dataset.The plot shows that cGSGP method is able to provide better results than the baseline for this specific dataset, which benefits from an exploration based on local interactions of neighbourhoods.Particularly, the strategy seems the one that consistently performs better than the baseline.
Figure 5 presents the results for the Keijzer6 (KJ6) synthetic dataset.In this case, the cGSGP fails to outperform the baseline, meaning that the proposed selection strategies are not beneficial for this specific type of problem when the population size is small.Moreover, it seems that results from cGSGP tend to even become worse if we set a larger radius with the strategy.Figure 6 illustrates the results for the Airfoil (ARF) dataset.As it is possible to observe, all variants of cGSGP perform better than GSGP, with the twodimensional torus T 2 1 being not statistically better than GSGP independently from the selection method used.It is also possible to observe that the use of 0.6 produces lower errors, but the difference is only a slight one.The largest neighbourhood, i.e., T 2  3 , is able to produce results that are better than all other cGSGP methods independently from the selection procedure employed.
The results for the Concrete (CNC) dataset presented in Figure 7 show a pattern where different topologies influence the quality of results.While for T 2 1 the results remain not statistically better than GSGP, it is possible to observe that an increase in the size of neighbourhood corresponds to an improvement in the quality of the solutions independently from the selection method employed.Similarly to the CNC dataset, also for the Slump (SLM) dataset, whose results are presented in Figure 8, the performances of GSGP and cGSGP are comparable.In fact, for two different topologies, the performances are not better than GSGP in a statistically significant way.
Figure 9 presents the results for the Toxicity (TXC) dataset.As it is possible to observe, there is still a slight increase in the quality of the solution with the increase in the size of the neighbourhood, with the T 2 3 providing the best results also compared to other cGSGP configurations.
The results for the Yacht (YCH) dataset are presented in Figure 10 confirming the trend observed in the other datasets.A local-only interaction helps and is able to produce better results than GSGP, with an additional increase in the solutions' quality for larger neighbourhoods.Similarly to many of the other datasets, also for the Parkinson (PRK) dataset, whose results are presented in Figure 11, we can observe that, regardless of the selection strategy employed, a cGSGP-based method with a sufficiently large radius is necessary to perform better than GSGP.

Convergence rate analysis
In this section, we analyze the convergence rate of our proposed techniques.For illustrative purposes, the convergence rates of GSGP and cGSGP are depicted in Figure 12.In these plots, both E cx+mut and are employed for a population size of 100 spanning 1000 generations.Specifically, for every generation, we illustrate the fitness evaluated on the test set of the best individual of the corresponding The plot is based on E cx+mut and strategy, with 100 as population size and 1000 generations generation; the shaded area denotes the inter-quartile range.It is evident that, due to the enforcement of elitism, this individual is also the best one discovered during the course of the evolution up to that point.
Line-plots indicate a negative trend for the two synthetic datasets for both baseline and cGSGP, particularly prominent for KJ6.In both scenarios, the test fitness seems to grow with the increasing number of generations.This is especially noticeable for T 2 2 and T 2 3 in KJ6.In contrast, within the V14 dataset, the proposed approach consistently outperforms the baseline.However, it appears to plateau after the final generation, suggesting potential over-fitting across all the methods and both synthetic datasets.
For all other datasets, we consistently observe, for all involved methods, either a constantly decreasing trend (TXC, YCH, and PRK) or an abrupt drop in fitness during the initial generations, leading to a plateau in the subsequent ones (ARF, CNC, and SLM).In these situations, GSGP and T 2 1 display similar behaviors.Yet, cGSGP methods with an extended radius tend to exhibit a more pronounced downward trend.
The insights gained from this analysis reinforce our findings from Sect.5.1.Here, it seems that the evolutionary process gains from a cellular-inspired selection procedure, especially with a sizeable radius.This approach enables the optimization to derive superior results compared to the standard GSGP.Moreover, based on the convergence rates, this benefit appears to hold true regardless of the number of executed generations.

Population fitness distribution
After assessing the performance of the proposed methods across the generations as regards the best-discovered solutions, we analyze the fitness distribution on the train set of the entire population.To this end, we show the median train fitness in Figure 13, where E cx+mut and are employed for a population size of 100 and 1000 generations.Specifically, for each generation, we plot the median of all the fitness values evaluated on the train set by the individuals in the population of the generation at hand (the shaded area represents the inter-quartile range).
For almost all the tested datasets, the median RMSE consistently decreases.This is also true for the two synthetic datasets, which, according to the convergence rates analysis, suffer from over-fitting-related issues.As regards the real-world datasets, the median RMSE on the train set confirms the trend observed in the line-plots of the previous section, with TXC that exhibits a decreasing trend with a negative concavity for T 2 2 and T 2 3 .Finally, similarly to what we have already seen in the previous section, we can observe a plateau after a few generations in ARF and CNC.
An analysis based on the train distribution of the population should reveal some insights into the diversity preservation mechanisms that play along with the proposed techniques.However, as we can see, except for the synthetic datasets and occasionally for SLM, there is little or no variability in the train fitness values.Hence, this type of analysis does not reveal useful information on the diversity of the population, which is what we are going to try analyzing in deeper detail in the next section.

Semantic diversity analysis
Evaluating the diversity of the population across the generations is a challenging task, especially in the context of GSGP, where the analysis of the syntactical structure of trees may not be feasible for multiple reasons: (1) if two individuals are syntactically different, then any geometric crossover and mutation resulting from them will always be different, (2) more importantly, a tree-representation of a GSGP individual is exponential in size.Hence, in this context, we try to analyze the diversity of the population by introducing a measure that accounts for the semantic distance of individuals.
Each individual evaluated on the train set produces a semantic vector, which contains all the train predictions of the individual.It is then possible to compute the Global Moran's I (denoted as I ) [55,56] on the semantic vectors of a given popula- tion and use it to measure diversity.I is a measure of spatial autocorrelation [57], assessing similarity or dissimilarity of values located in neighbouring locations and formally defined as where N is the population size, w ij ∈ ℝ is the value located at the i-th row and the j-th column of the matrix w ∈ ℝ N×N , ∑ N j=1 w ij , x i is a representation of the i-th individual in the population as a semantic vector, and x is the mean of the representations of all the individuals in the population.In our case, w ij = 1 if i ≠ j and the j-th individual belongs to the neighbourhood of the i-th indi- vidual (in vanilla GSGP the entire population is a single neighbourhood), otherwise w ij = 0.
I ranges in the interval [−1, 1] , where values close to 1 indicate positive spatial autocorrelation, while values close to -1 suggest the presence of negative autocorrelation.Values near 0 highlight a spatial pattern not different from a random phenomenon.A high value means that similar semantic content is spatially clustered in the population, hence preserving diversity between different clusters, while a value near 0 suggests a spatial arrangement of semantic vectors which is not different from a random one.
We show the trend of I in the following line-plots, where both E cx+mut and are employed; the shaded area denotes the inter-quartile range.In this type of analysis, we test all three combinations of population size and number of generations to better show the relationship between population size and diversity of cGSGP.We remark that considering I is one possible diversity measure and necessarily represents only a partial view on the topic of semantic diversity for GSGP.
Figure 14 shows the I trend for 100 as population size.For V14 and KJ6 datasets we observe a flattening towards 0 after the very first generations for the different cGSGP methods, while the baseline is always near 0. In all the other cases, it is evident how the cGSGP methods exhibit higher diversity than the baseline, which consistently shows I values always close to 0, for a high number of generations.The 1 always showing the highest I .For all the different cGSGP methods, diversity decreases as generations progress, except for TXC, where it remains almost constant after an initial transitory phase.
Figure 15 shows the I trend for 400 as population size.As before, for V14 and KJ6, we can observe a fast convergence towards 0 for all cGSGP methods, while standard GSGP exhibits null I for the entire evolution, just as it happens for all the other datasets.T 2 1 is still the method with the highest diversity for all the different cases and the trend of diversity over generations is consistent with what was seen above, with TXC maintaining I values that are almost constant for all methods.
Figure 16 shows I trend for 900 as population size.Even here, V14 and KJ6 show trajectories which quickly flatten towards 0 for all cGSGP methods.Diversity is always null for the baseline for all datasets.Also, the rest of the analysis is entirely analogous to the previous cases.TXC shows nearly constant trends for all methods, while all the other datasets exhibit a decrease in diversity over generations for all methods.Once again, we can observe a clear increase in diversity as the neighborhood radius decreases.
Notice that T 2 1 shows the highest diversity in all the cases: local interactions are forced to happen only within a small neighbourhood.This means that the spread of good individuals across the generations will be slower than that obtained with a larger radius, resulting in a higher I.
In conclusion, the population size does not seem to affect the trend of diversity for the different methods, since we can observe analogous results for all the different sizes.On the contrary, the size of the neighbourhood seems to have a central role in the formation of spatial clusters preserving semantic diversity of individuals.

Solution size analysis
Finally, we provide an analysis on the size of the solutions.As we have already stated in Sect.3.1, the solution size always increases at every generation, because of the way the GSGP operators are defined.In this section, we try to analyze whether imposing a CA-inspired communication topology over GSGP leads to smaller solutions.This type of analysis is mainly justified by previous research works in which it was demonstrated that a spatial structure may help to reduce program growth [19,20].In a tree-based GP context, a common property that measures the solution size is the number of nodes (denoted as ).In GSGP, this value can easily become extremely large, increasing exponentially with the number of generations.For this reason, in the following, we report the base-10 logarithm in the number of nodes with a focus on its median value in the population.
As expected, the number of nodes increases exponentially w.r.t. the number of generations.In Figure 17 we show the distribution of the median log 10 ( ) in the last generation for all tested problems and for all repetitions.
Based on the plot, T 2 1 tends to discover smaller solutions than GSGP, while T 2 2 and T 2 3 tend to discover larger solutions than GSGP (except for T 2 2 with 0.6 ).This suggests that imposing a cellular spatial structure characterized by a small neighbourhood actually helps in limiting the program growth by some orders of magnitude.However, differently from other methods, when a cellular structure is imposed on GSGP, there is no clear reduction in the solution sizes for all neighbourhoods.Additionally, by comparing this outcome with the previous analysis on the convergence rates (Sec.5.3), the increase in median individual size is not hindering the quality of the solutions generated.

Discussion
Finally, based on the outcome of the experimental session, we try to summarize the main findings to provide proper answers to our research questions: RQ1 Is it possible to overcome the issue of converging too fast to small and possibly sub-optimal areas of the search space?
The issue of premature convergence is probably related to the lack of any inherent mechanism of diversity preservation in GSGP.To reduce this problem we imposed a spatial structure to the population, thus allowing local-only communication.This structure can limit the takeover time, allowing the preservation of diversity.Clearly, there is a trade-off between the amount of local communication and the semantics of the individuals that can be generated using only the individuals in a limited neighbourhood.In particular, for any semantic crossover the results are bound to be in the convex hull of the possible parents, a smaller neighbourhood restricts the part of the space where new individuals can be generated, thus limiting the possible semantics that can be obtained.Hence, new semantics might require more time to be generated, with a decrease in the speed of exploration.
Experimentally, we observed that a small neighbourhood radius provides better diversity among different neighbourhoods at the cost of slowing down the exploration.By increasing the radius, the exploration gets faster at the cost of losing overall diversity.Therefore, a reasonable tradeoff should be found to obtain both good diversity and a reasonably fast exploration.
By analyzing the Moran's I coefficient, we observe that cGSGP can provide diversity among different local neighbourhoods in the early stages of the evolution, enabling a more complete exploration of the search space before dominant solutions propagate to the entire population.This implies that a spatial structure such as a cellular-based one is beneficial also in the context of GSGP.
RQ2 Can cGSGP exploit the presence of local communication to produce better solutions than GSGP?
In the majority of cases, cGSGP-based methods can produce better solutions than GSGP.This is especially true for cGSGP-based methods with two or three as radius applied to real-world datasets.Hence, we assume that our neighbourhood-based selection strategy that exploits local communication is beneficial for the improvement of GSGP.Furthermore, by analyzing convergence rates, this seems to hold regardless of the actual number of generations executed.

Conclusion
In this paper, we introduce the cellular GSGP (cGSGP), a way to endow a communication topology on GSGP that allows to improve the performance compared to standard GSGP.This seems to be because by leveraging a spatial structure within GSGP we are able to increase the takeover time and reduce the corresponding loss of diversity, limiting GSGP operators to act only on individuals belonging to the same neighbourhood.In particular, in the early stage of the evolution, the population is characterized by low intra-neighbourhood and high inter-neighbourhood diversity, as measured via Moran's I coefficient.While not explored in this work, a cellular structure can be useful from a scalability point of view: local communications allow for easy distribution of the computation among multiple nodes.
In terms of future directions of research, the influence of topology is an important aspect to study.Furthermore, the effect of which local selection to use appears to be important, so this aspect requires further studies.Additional studies should also be performed to understand the propagation of information inside the population to obtain some insights into the interplay between topology and evolution.Finally, it would be interesting to implement cGSGP in a distributed way, taking advantage of the local-only communications that are part of the methods.

Fig. 1
Fig. 1 Example of a population with 100 individuals distributed according to T 2 1 .In the figure, two positions with the corresponding neighbourhoods are highlighted, following the toroidal structure when the position is close to the borders.Each cell contains the individual in that position, whose row index and column index are indicated as a subscript

Fig. 4 Fig. 5
Fig.4 Test results on the V14 dataset.The boxplot shows the distribution of RMSE at the last generations

Fig. 6 Fig. 7
Fig.6 Test results on the ARF dataset.The boxplot shows the distribution of RMSE in the last generations

Fig. 8 Fig. 9
Fig.8 Test results on the SLM dataset.The boxplot shows the distribution of RMSE at the last generations

Fig. 10 Fig. 11 Fig. 12
Fig.10 Test results on the YCH dataset.The boxplot shows the distribution of RMSE at the last generations

Fig. 13
Fig.13 Trend of median train RMSE of the population.The plot is based on E cx+mut and strategy, with 100 as population size and 1000 generations

Fig. 14 Fig. 15 Fig. 16
Fig. 14 Trend of I .The plot is based on E cx+mut and strategy, with 100 as population size and 1000 generations

Fig. 17
Fig.17 Distribution of the median log 10 ( ) computed on the population of the last generation (100 as population size for 1000 generations) for all the tested datasets and repetitions Diagram comparing a vanilla GSGP and our proposed cGSGP method.The figure highlights which are the components that differ between the two techniques

Table 2
Parameters adopted during the experimental phaseParameter Value

Table 3
Combinations of population size and number of generations that are tested.For each combination, the side length of the squared toroidal grid is shown, along with the corresponding total number of fitness evaluations

Table 4
Table of the median best fitness for cGSGP on the test set and all considered topologies

Table 5
Table of the median best fitness for cGSGP on the test set and all considered topologies

Table 6
Table of the median best fitness for cGSGP on the test set and all considered topologies