Evolving ensembles of heuristics for the travelling salesman problem

The Travelling Salesman Problem (TSP) is a well-known optimisation problem that has been widely studied over the last century. As a result, a variety of exact and approximate algorithms have been proposed in the literature. When it comes to solving large instances in real-time, greedy algorithms guided by priority rules represent the most common approach, being the nearest neighbour (NN) heuristic one of the most popular rules. NN is quite general but it is too simple and so it may not be the best choice in some cases. Alternatively, we may design more sophisticated heuristics considering the particular features of families of instances. To do that, we have to consider problem attributes other than the proximity of the next city to build priority rules. However, this process may not be easy for humans and so it is often addressed by some learning procedure. In this regard, hyper-heuristics as Genetic Programming (GP) stands as one of the most popular approaches. Furthermore, a single heuristic, even being good in average, may not be good for a number of instances of a given set. For this reason, the use of ensembles of heuristics is often a good alternative, which raises the problem of building ensembles from a given set of heuristic rules. In this paper, we study the application of two kinds of ensembles to the TSP. Given a set of TSP instances having similar characteristics, we firstly exploit a GP to build a set of heuristics involving a number of problem attributes, and then we build ensembles combining these heuristics by means of a Genetic Algorithm (GA). The experimental study provided valuable insights into the construction and utilisation of single rules and ensembles. It clearly demonstrated that the performance of ensembles justifies the time invested when compared to using individual heuristics.


Introduction
The Travel Salesman Problem (TSP) is one of the most studied combinatorial optimisation problems with many real-world applications in a number of fields such as logistics and planning or communications (Punnen 2007;Mavrovouniotis et al. 2017).Consequently, a lot of exact and approximate algorithms were proposed in the literature over the last decades, such as the Lin-Kernighan heuristic proposed by Link and Kernighan (1973), Christofides heuristic proposed by Christofides (1976), or Genetic Local Search (GLS) proposed by Freisleben and Merz (1996), among others.These approaches have demonstrated good performance, but in situations when the problem instance is large, and the solutions must be completed in a limited time, or when not all information is available at the beginning, the best, if not the only one, solution is a greedy algorithm preferably guided by some kind of efficient heuristic rule.
This kind of heuristics may be designed manually by experts in the problem domain.This is the case of the simple and well-known Nearest Neighbour (NN) heuristic.But such a simple heuristic often fails to build a good solution due to the low amount of knowledge it can exploits.Therefore, more sophisticated heuristics are normally required to obtain outstanding solutions, but this is usually a hard and time consuming task (Branke et al. 2016) for experts.Alternatively, some hyper-heuristic may be used to search in a given space of heuristics.In this context, Genetic Programming (GP) stands out as one of the most common approaches (Burke et al. 2019), which was already applied to a large variety of hard optimisation problems (Burke et al. 2012;Durasevic ´et al. 2016;Gil-Gala et al. 2019;Nguyen et al. 2019;Zhang et al. 2021).
Nevertheless, a single heuristic, although performing well on average for a large set of instances, may not be good for a number of them individually.For this reason, several approaches based on ensembles (sets of heuristics) have been recently proposed for some optimisation problems, such as the One Machine Scheduling (Gil-Gala et al. 2020), the Unrelated Machines Scheduling (Durasevic ´and Jakobovic ´2019), the Job Shop Scheduling (Hart and Sim 2016;Park et al. 2018), the Resource Project Scheduling (Dumic ´and Jakobovic ´2021), or the Capacitated Arc Routing Problem (Wang et al. 2019), among others.
In this paper, we investigate two types of ensembles, which are termed collaborative and competitive, respectively.A collaborative ensemble builds a solution in such a way that all the rules contribute to take the next decision in each iteration; while in a competitive ensemble each rule is exploited to build an independent solution and then the best of these solutions is considered as the solution of the ensemble.Each type of ensemble has its own weak and strong points.For example, collaborative ensembles may be exploited in online settings where the solution is being implemented at the same time it is being built (Durasevic ánd Jakobovic ´2019), while competitive ensembles require all the problem data before starting to build a solution.
The ensembles we consider herein are aimed to solve the Dynamic TSP (DTSP) viewed as a sequence of static TSP over a time horizon.Therefore, they are exploited to solve TSP instances by a limited time.This time should be much lower than the time interval between every two consecutive TSP instances, which, of course, it will depend on the particular DTSP setting.
To establish the extent to which the ensembles are viable for DTSP, we performed an experimental study on the set of instances proposed in Duflo et al. (2019).They are instances with different sizes that are taken from the TSPLIB (2022).To create ensembles, we exploit a Genetic Program (GP) to evolve a set of heuristics, which are then combined into ensembles by means of a Genetic Algorithm (GA).The results of this study show that the quality of the solutions produced by the ensembles makes up for the larger time they require with respect to that of single rules, and that competitive ensembles perform much better than collaborative ones.
The remainder of the paper is organised as follows.In the next section, we give the formulation of the (Dynamic) TSP.The proposed solving method is described in Sect.3.Then, in Sect.5, we detail the combined approach of GP and GA to evolve heuristics and ensembles.In Sect.6, we report the results of the experimental study.Finally, in Sect.7, we summarise the main conclusions and outline some ideas for future work.

The travelling salesman problem
In the classical version of the Travel Salesman Problem (TSP), we are given a symmetric matrix D NÂN , in which D i;j indicates the distance between the cities i and j.The goal is to obtain an optimal tour, i.e., the shortest path for visiting all cities and returning to the starting city.Figure 1 shows an instance with 5 cities and one of its solutions.
The dynamic version of the TSP, i.e., the DTSP, was introduced by Psaraftis in Psaraftis (1998).Since then, a number of variants were considered but there is not still a unified framework.In some cases, the distances between cities may change, and, in other cases, some cities may be removed or added.A review and taxonomy of the models proposed over the last three decades are given in Psaraftis et al. (2016).In general, in an instance of the DTSP, the distances between the cities, M i;j ðtÞ, may change over time following some temporal pattern that depends on the underlying problem.In this way, the DTSP is a continuous problem, but in practical settings, it is usually considered as a sequence of static TSP instances over a sequence of time points t i ; i ¼ 1; :::T, each time interval ðt i ; t iþ1 being sufficiently short so that the instance at time t i must be solved in real-time, indeed by taking a time much lower than t iþ1 À t i .Therefore, a particular solution may be viewed as a permutation of the N cities s ¼ ½s 1 ; :::; s N , which is evaluated as: 3 Solving the TSP in real-time In accordance with the previous definition, solving an instance of the DTSP amounts to solving a sequence of static TSP instances at time points t i ; i ¼ 1; :::; T. It often happens that the instances at times t i and t iþ1 are very similar.In these cases, repairing a previous solution may be better than generating some new one from scratch; therefore, some population based metaheuristics as Ant Colony Algorithms (ACO) or Evolutionary Algorithms (EA) may be a good choice Mavrovouniotis et al. (2017).However, if there are abrupt changes from t i to t iþ1 , solving the new instance making a fresh start may be better.This is the option we take here; specifically, we propose to use some greedy algorithm guided by problem domain priority rules, under the assumption that the time intervals t iþ1 À t i between consecutive solutions may be too short to use exact methods or even population based metaheuristics.
In the TSP context, greedy algorithms are usually termed route generation schemes, as in each iteration they select the next city applying some heuristic until a complete tour is built.The procedure we use here is given in Algorithm 1; the heuristic is used as a priority rule, meaning that it assigns a priority value to each unvisited city, and the city with the highest priority is selected to be visited next.An example of such a rule is the well-known Nearest Neighbour (NN) heuristic: the priority of the candidate city j after i is calculated from only one problem attribute, the distance D ij , as 1=D ij .Figure 2 illustrates the use of this heuristic; in the example, the next city to be visited after A will be B as it is the closest city to A among the unvisited cities.
Because of NN actually exploits too few information on the problem, it often happens that it builds a tour that seems good on the first cities, but that is really bad for the last ones due to the unvisited cities being quite dispersed and far from each others in the last iterations.To avoid the NN's low performance, we may consider attributes other than the distances to the next candidate cities.Specifically, we could consider, for example, some measure of the dispersion of the remaining unvisited cities.But a large number of attributes makes it difficult the problem of devising new heuristics, so that an automatic procedure may be the best option.
In Duflo et al. (2019), the authors consider 7 attributes and exploit GP to evolve priority rules, which are evaluated taking quadratic time complexity.Those attributes were also exploited in Singh and Pillay (2022) with a novel hyper-heuristic based on ant colony optimisation (HACO).Both works show that the evolved heuristics actually outperform NN and some other well-known classic heuristic algorithms for the TSP as nearest insertion or the Christofides heuristic Christofides (1976).
We conjecture that exploiting a low number of simple attributes could be enough to achieve reasonable heuristics that in turn could be evaluated taking less time.This is the rationale of the GP approach proposed in Sect.5.1.In addition to single rules, we also explore here the use of ensembles of rules (see Sect. 4); the rationale is that combining the recommendations from a set of rules we may take wiser decisions than that from single rules.

Ensembles of rules
Under the assumption that a single rule may not be robust enough to produce good solutions for all instances in a given set, we explore here the use of ensembles.An ensemble is just a set of rules.Figure 3 shows an ensemble composed by 3 rules.
From previous experience on some problems as, for example, the one machine sequencing with variable capacity (Gil-Gala et al. 2020), or the unrelated parallel machines scheduling Durasevic ´and Jakobovic ´(2019), we propose to use two kinds of ensembles, which are termed collaborative and competitive, respectively.
Collaborative ensembles are indeed like the classic ensembles used in other contexts as classification or recommendation systems.The rationale of these ensembles is that good rules take the right decisions in most of the situations and fail in a low number of them.Therefore, the decision on the next city to visit may be taken by aggregating the recommendations of each rule in the ensemble.In Durasevic ´and Jakobovic ´( 2019), the authors analysed the two classic aggregation methods, namely sum and vote and they opted for the second one to avoid the issue of normalising the priorities of the individual rules, which is not an easy problem in general.This is the approach we consider here as well.In the voting method, each rule assigns the value 1 to the city with the largest priority and 0 to the remaining ones.Then, these values are summed up and the city with the largest sum value is chosen, breaking ties at random.
In turn, the rules in competitive ensembles work independently from each other to build a different solution each.Then, the best of these solutions is taken as the solution produced by the ensemble.The rationale of this kind of ensembles is that a good rule produces good solutions to some instances but it may produce bad solutions to others; therefore taking different rules, one can cover reasonably well all instances in a given set.
In both cases, competitive and collaborative, the ensembles may be built from a given set of heuristics as it was proposed in Durasevic ´and Jakobovic ´(2019), where the authors analysed 5 methods to create collaborative ensembles, namely random selection, probabilistic selection, grow, grow-destroy and instance based.In all cases, the ensemble starts from just a random rule and then new rules are added iteratively up to a given limit.Each time a new rule is added, the partial ensemble must be evaluated on a training set of instances.As an alternative, we propose to use here a Genetic Algorithm (GA) to build ensembles of both types (see Sect. 5.2).

Evolving heuristics and ensembles
In this work, we use the same methodology as in Durasevic ánd Jakobovic ´(2019), Gil-Gala et al. (2022).Therefore, a large set of heuristics (priority rules in this context) is previously evolved by Genetic Programming (GP), and then these rules are used to build ensembles by a Genetic Algorithm (GA).

GP to evolving priority rules
Priority rules are simple arithmetic expressions that may be naturally represented by trees.For this reason, the framework of GP proposed by John Koza (1992) is widely used to evolve new heuristic rules.To use GP, the first issue is to establish a set of symbols and some grammar.The grammar restricts the set of expression trees that can be built from the symbols, so that it fixes the search space of GP.The set of symbols must include a number of attributes of the problem, some constants and a set of operators.In this work, we consider three problem attributes, namely where c denotes the current city in the partial tour built so far, i is the initial city, and n is a candidate city to be visited next.D c is calculated as the distance between c and the point c n (centroid of the unvisited cities excluding n) defined by the coordinates x ¼ XÀx n N rm À1 and y ¼ YÀy n N rm À1 where N rn is the number of remaining cities to visit, X and Y are the summation of x-values and y-values of the unvisited cities and x n and y n are the coordinates of n. Figure 4 shows an example of these terminals.We also include 10 constants and a number of unitary and binary arithmetic functions in the set of symbols.The whole set is given in Table 1.The set of attributes D cn ; D in ; D c is indeed a subset of the 7 attributes considered in Duflo et al. (2019).As mentioned, the rationale of this selection of attributes is to consider a small number of them and that they are meaningful and easy to evaluate at the same time.
The GP strategy is rather conventional and it is quite similar as in other studies (Durasevic ´et al. 2016;Gil-Gala et al. 2021;Nguyen et al. 2019;Zhang et al. 2021;Duflo et al. 2019).GP starts from an initial population generated by the well-known ramped half-and-half method (Koza 1992).Then, GP follows an evolutionary scheme in which parents are randomly selected into pairs at the beginning of each generation; each pair of parents is combined, and their offspring are mutated with a given probability.The genetic operators are the well-known one-point crossover and subtree mutation (Koza 1992).Finally, in the replacement phase, from each two parents and their offspring, the best child is selected unconditionally and the second selection comes from tournament between the parents and the other offspring.The evaluation is the same as in Duflo et al. (2019), Gil-Gala et al. (2022), each candidate rule is evaluated on a set of TSP instances (the training set), and the fitness value of the rule is given by the inverse of the average tour of all instances.

GA for building ensembles
To build ensembles, either collaborative or competitive, we are given a set of rules R and the goal is to come up with a subset of maximum size P of rules, so that the ensemble composed by these rules performs as well as possible on a given (training) set of TSP instances.In this work, we adapted the GA proposed to build competitive ensembles in Gil-Gala et al. (2022).This GA may be used to build collaborative ensembles just by changing the evaluation function.As proposed in Gil-Gala et al. (2020, 2023), the encoding schema is variations with repetition from R taken P by P. Figure 5 depicts an example of ensemble encoding.
In this illustration, R consists of five rules, and the ensemble is represented as an array containing three rules, each encoded by corresponding indices: 3, 1, and 4.This allows for representing subsets with maximum size P and for classic genetic operators as one-point crossover and single mutation.The evolutionary schema is quite similar to that of GP described in Sect.5.1, and the population is randomly generated.
Regarding the evaluation of candidate ensembles, there are substantial differences depending on collaborative and competitive ensembles.In the first case, each of the instances in the training set must be solved by each candidate ensemble, in similar way as done by GP to evaluate a candidate rule.In a good collaborative ensemble it is expected that most of the rules take the right decision in each iteration of the routing generation scheme (see Algorithm 1) when it solves every instance in the training set; only in this way the ensemble will produce eventually a good solution.
However, the evaluation of competitive ensembles can be done much more efficiently.If the results of each rule in R on each instance of the training set were known in advance, we would not need to obtain a new solution from the candidate competitive ensemble, as this solution is just that from the best rule in the ensemble.However, for the sake of fair comparison to collaborative ensembles, in the experimental study (see Sect. 6) we consider that the above results are not known in advance.Therefore, each rule in the competitive ensemble must be evaluated on the training set, but only when it appears in an ensemble for the first time, as this result may be kept to be used in the same or further generations of the GA.In a good competitive ensemble, it is expected that at least one of the rules produces a good solution to each problem instance in the training set.In other words, the fittest collaborative ensembles evolved by GA should provide a good covering of the training set, i.e., for each instance in the training set, they should include one of the rules that perform the best for this instance.

Experimental study
We performed an experimental study to assess the viability of the ensembles and to compare their performance with respect to that of individual rules.

Experimental setup
We implemented prototypes of GP and GA in Java 8 and ran a series of experiments distributed into a Linux machine: a Dell Power Edge R740 with 2 x Intel Xeon Gold 6132 (2.6GHz, 28 cores) and 128GB.The To analyse the effect that the size of the instances may have on the quality of the ensembles, we used training sets composed out of the N smallest instances, N taking different values as it is showed in Table 2.
The set of rules R was calculated by GP.This set is composed of 42 000 rules out of which 35 296 are syntactically different.They were recorded from the last population in each GP execution.Specifically, 6 000 rules (200 individuals and 30 executions) were collected by training the GP with each training set in Table 2.
The parameters used for GP and GA are summarised in Table 3.These values were taken from some previous experiments reported in Gil-Gala et al. (2022).We considered sizes 3, 5 and 7 for both types of ensembles, collaborative and competitive.For each configuration of parameters, GP and GA were executed 30 times, and the best, average, and standard deviation of the 30 solutions (heuristics or ensembles) were recorded on both the training and the test sets.
As mentioned, we have only used the vote combination method for collaborative ensembles (Durasevic ´and Jakobovic ´2019).
GP was firstly run starting from random initial population of rules and then from a population built from rules evolved in previous executions of GP.This was done for the sake of a fair comparison between ensembles and rules.
In all cases, the stopping condition of the algorithms was given by a number of generations, but we also established a time limit of 1440 min.Thus, the executions where the field ''Time(min)'' is 1440 min mean that the algorithm terminated before reaching the maximum number of generations.In addition, we report the number of chromosomes syntactically different (the field ''Unique''), which denotes the average number of unique chromosomes (rules or ensembles) per configuration.

Analysis of GP and GA
In this section, we analyse the results of the rules and ensembles produced by GP and GA, respectively, with different settings.In particular, we will try to assess their generalisation capability.

Priority rules evolved by GP
Table 4 summarises the results obtained by the rules evolved by GP.For each setting (Init./N), the best and average tardiness, and the standard deviations, of the 30 rules are reported for both the training and test sets; for the test set, the best value refers to the best rule in training.We can observe that starting from a heuristic initial population, GP is more stable and it is able to reach slightly better rules on average than when it starts from random populations.However, this difference vanishes on the test set.The time taken, as expected, is in direct ration with the size of the training set.And the number of different chromosomes evaluated along each execution is about 1/4 of the maximum theoretical value (300 Â 200 = 60 000), with the only exception for the largest training set when GP did not reach 200 generations.The differences between heuristic and random initial populations for different sizes of the training set may be better observed in the box-plots from the results on the test set given in Fig. 6.As it could be expected, the lowest value of N produces the worst results.Besides, there are significant statistical differences between random and heuristic initialisation for only 4 of the 7 values of N.

Collaborative ensembles evolved by GA
Table 5 and Fig. 7 summarise the results obtained by the collaborative ensembles evolved by GA.In this case, the main observation we may draw is that the performance of the ensembles on the test set improves with the size of the training set, and that there are no significant differences between the three sizes of the ensembles for each training set.Besides, the time taken by GA grows exponentially, so The last column shows the time taken by GP in one execution, and the next-to-last column shows the number of different candidate rules that were evaluated Fig. 6 Box plots from the results achieved by the priority rules for each test set (N) and initialisation method (Hr random or Hh heuristic) (see Table 4).On the top are the p-value produced by the Wilcoxon signed-rank test that it is unable to complete the 200 generations for N ¼ 49 in all cases and even for N ¼ 42 for ensembles of sizes 5 and 7.This is not surprising as GA must build a new solution for each instance in the training set to evaluate each candidate ensemble, and the average size of the instances grows with the value of N (see Table 2).

Competitive ensembles evolved by GA
The results achieved by competitive ensembles are reported in Table 6 and Fig. 8 in a similar way as it was done for collaborative ensembles.As in that case, we can observe that the performance of the competitive ensembles slightly improves on average with the size of the training set.But, at difference with collaborative ensembles, the performance of competitive ensembles strongly improves with the size of the ensemble, as it is shown by the Kruskal-Wallis test (Fig. 8).Besides, the time taken by GA grows smoothly with the size of the training set so that it is able to complete the 200 generations much earlier than the time limit in all experiments.This is not surprising as to evaluate a competitive ensemble GA does not need to build a new solution for the rules that took part in previous ensembles when searching for the best solution from the compounding rules.Here, we have to be aware that the efficiency of competitive ensembles could be further The last column shows the time taken by GA in one execution, and the next-to-last column shows the number of different ensembles that were evaluated Fig. 7 Box plots of the results reported in Table 5 obtained by collaborative ensembles on the test set.The p-values produced by the Kruskal-Wallis test for the three sizes of the ensembles and each training set in the X-axis are given at the top of the figure improved if the results of the rules on the training set were available beforehand, which may be a reasonable assumption.In this case, the time taken to evaluate a competitive ensemble would be independent of the size of the instances, and so GA would run in linear time on the number of instances in the training set.

Comparison
In this section, we show a comparison between ensembles and single rules.We also provide a comparison against classical heuristics like the nearest neighbour and the bestknown solutions in the literature obtained by exact methods and metaheuristics.

Single rules versus ensembles
Table 7 summarises the differences between ensembles and single rules with regards to the quality of the solutions reached.Collaborative ensembles do not always produce better results than single rules and there are statistical differences in only 15 out of the 21 configurations (N,P).In turn, competitive ensembles are always better than single rules, with only one exception in the best solutions reached by the best ensemble with 3 rules and the best rule, and they show significant statistical differences in all configurations.

Collaborative versus competitive ensembles
Figure 9 shows a comparison between collaborative and competitive ensembles by means of a series of box plots and p-values from the Wilcoxon signed-rank test, one for each configuration (N,P).We can observe that competitive ensembles perform better than collaborative ensembles in all cases.Even for some configurations, the worst result from competitive ensembles is better than the best result reached by collaborative ensembles.

Run-time analysis
Since the routing generation scheme given by Algorithm 1 guided by heuristics (rules or ensembles) is aimed to solve DTSP, we have to analyse the time taken by the algorithms to assess its suitability for the dynamic changes in each particular setting.From the algorithm structure, it is clear that the execution time will depend on both the size of the static TSP instance and the size of the rule or ensemble exploited.
To that purpose, we generated 1 000 random rules and 1 000 random ensembles of size P ¼ 5.All the instances in the test set were solved by these rules and by these ensembles, in this case considering them as collaborative and competitive, and the time taken in each run was registered.We evaluated them independently, without the presence of any other ensembles.Consequently, we intentionally excluded the reuse of previously calculated results to conduct an unbiased investigation into runtime performance.
Figure 10 shows the box plots of these experiments.The average times required to solve all instances in the test set were 1.62 and 1.63 s for competitive and collaborative ensembles respectively, and 0.3 s for the single rules.This The symbol 4means that the ensemble produces better results than single rules, or that the Wilcoxon signed-rank test shows statistical differences Size 3 Size 5 Size 7 Fig. 9 Box plots from the results achieved by collaborative and competitive ensembles on the test set (detailed in Tables 5 and 6).In regards to the size of the heuristics, i.e., the number of symbols in a rule or ensemble, one may expect it be strongly correlated with the time taken by the algorithms.This is rather clear in Fig. 11, which show the dispersion plots of the time taken versus the size of the heuristics.The correlation coefficients in the three plots showcase high correlation between them.
Finally, we have to analyse the influence of the problem size on the time taken by the algorithms.To this end, Fig. 12 shows the bar plot of time versus instance size with the best rule and ensembles obtained.We can see that the number of cities and the time taken is directly related.

Comparison against the state-of-art
The TSP is an extensively studied problem, and numerous algorithms have been proposed to solve it, including the Lin-Kernighan heuristic (Link and Kernighan 1973), Christofides heuristic (Christofides 1976), and Genetic Local Search (GLS) (Freisleben and Merz 1996).While these algorithms demonstrate good performance, they face challenges when dealing with large problem instances, time constraints, or incomplete information.In such situations, a greedy algorithm guided by an efficient heuristic rule is often the preferred solution.Having said that, we consider classic heuristics as NN or NI, as well as the priority rules evolved by GPHH (Duflo et al. 2019) and HACO (Singh and Pillay 2022), as suitable references in the context of hyper-heuristics.
For our comparative study, we exploited NN and NI in combination with Algorithm 1 and considered the results of the best rules evolved by GPHH, which are presented in Duflo et al. (2019).We also include the results obtained by the best configuration of HACO from Singh and Pillay (2022).The results obtained by all the mentioned methods, detailed for each instance of the test set, are reported in   8.The size of the ensembles is P ¼7, and they were evolved from the largest training set (N ¼49).The second column of the table shows the best known solution for each instance, which in some cases is optimal.
As may be expected, all methods produce solutions much worse than the best known ones, which were normally obtained by heavy exact or approximate methods that take much more time than greedy algorithms guided by heuristics.With regards to simple rules, it is clear that NN and NI perform worse than the best rules evolved by both GPHH and GP, showcasing the advantage of automatically calculated rules over the classical ones.The best rule evolved by GP is better on average and also in 11 of the 21 instances than that evolved by GPHH.In this case, showcasing that it is possible to obtain good rules considering a small number of problem attributes.
We can see that ensembles produce the best solutions among all the heuristics considered, being competitive ensembles better than collaborative ones in all but 3 instances.Furthermore, we can also observe that competitive ensembles achieve (on average) similar results to HACO.However, the HACO approach has the inconvenience that the evolved heuristic is difficult to interpret for the human eye.In this regard, a rule in HACO is encoded as a pheromone matrix that is much harder to interpret than expression trees.

Conclusions and future work
As it was done in some previous works (Duflo et al. 2019;Gil-Gala et al. 2022), we have seen that Genetic Programming is a suitable hyper-heuristic to evolve priority rules to solve the TSP.In our experimental study, these rules outperformed some classic heuristics, such as Nearest Neighbour or Nearest Insertion.From the comparison between GPHH (Duflo et al. 2019) and the GP proposed in this work, we can see that using a small number of problem attributes, the search space of GP is reasonably low.Therefore, it may reach better rules than those obtained from the search space generated from a large set of attributes.
We have also seen that ensembles of rules may produce better results than single rules at the cost of linear increase of the execution time with the size of the ensembles.From the two kinds of ensembles analysed, competitive ensembles showed much better performance than collaborative ones.Nevertheless, we have to be aware that both of them were evaluated on building solutions for static TSP instances, which is a suitable framework when a DTSP is viewed as a sequence of static TSP instances over time.However, in other situations, the dynamic problems require on-line solutions, i.e., the route is being travelled at the same time as it is being built.In these cases, competitive ensembles may not be used, so collaborative ones may also represent a good alternative to single rules.
In future work, we plan to consider alternative combination methods (Durasevic ´et al. 2023;Park et al. 2018) and multiobjective optimisation (Durasevic ´et al. 2023).Additionally, we are interested in analysing alternative rule representations, such as neural networks (Branke et al. 2015;Jia et al. 2022) or pheromone matrices (Singh and Pillay 2022), to build ensembles.

Fig. 2
Fig. 2 Application of the Nearest Neighbour (NN) heuristic

-
D cn : Distance from c to n. -D in : Distance from i to n. -D c : Distance from the centroid of the unvisited cities to c.

Fig. 3
Fig. 3 An example of an ensemble composed of three rules.D cn denotes the distance from city c to n; D in denotes the distance from city i to n; and D c denotes the distance from the centroid of the unvisited cities to the city c test bed is composed of the set of 70 TSP instances considered in Duflo et al. (2019); Gil-Gala et al. (2022).They are the euclidean instances (EDGE-WEIGHT TYPE=EUC 2D) from the TSPLIB (2022) having less than 4000 cities.As in the above works, the same 21 instances were used for testing and the remaining 49 instances were used for training, with a number of cities between 52 (berlin52) and 3795 (fl3795),

Fig. 5
Fig. 5 An example of an ensemble with three heuristics encoding in GA Fig.9Box plots from the results achieved by collaborative and competitive ensembles on the test set (detailed in Tables5 and 6).For each ensemble size P, the box plots are organised by increasing values of the size of the training set N. The numbers at the top are the p-values from Wilcoxon signed-rank tests

Fig. 10
Fig. 10 Box-plot with the time required (in milliseconds) for solving the test set with each heuristic type

Fig. 12
Fig. 12 Bar-plot with the time required (in seconds) for achieve the best solutions with each heuristic type

Table 1
Function and terminal sets used to build expression trees.Symbol ''-' is considered in unary and binary versions

Table 2
Description of the Training and Test sets of TSP instances used in the experimental study

Table 3
Parameters used by GP and GA

Table 4
Tardiness values of the solutions reached by the priority rules evolved by GP on both the training and test sets

Table 5
Tardiness values reached by the collaborative ensembles calculated via GA from different training sets (see Table2) with maximum ensemble sizes P of 3, 5 and 7, on the training and test sets

Table 6
Tardiness values reached by the competitive ensembles calculated via GA from different training sets (see Table2) with maximum ensemble sizes P of 3, 5 and 7, on the training and test sets Box plots of the results achieved by competitive ensembles solving the test set that are reported in Table6.The numbers at the top are the p-values produced by the Kruskal-Wallis test

Table 7
Summary of the comparison between rules and ensembles on the test set

Table 8
Comparison of the best rule and ensembles achieved by GP and GA against the Nearest Neighbour (NN) and Nearest Insertion (NI) heuristics, Genetic Programming Hyper-heuristic (GPHH) proposed in Duflo et al. (2019), Hyper-heuristic Ant Colony Optimisation (HACO) proposed in Singh and Pillay (2022), and the bestknown (BK) solutions (BK 2022) solving the whole test set