Evolving Continuous Optimisers from Scratch

This work uses genetic programming to explore the space of continuous optimisers, with the goal of discovering novel ways of doing optimisation. In order to keep the search space broad, the optimisers are evolved from scratch using Push, a Turing-complete, general-purpose, language. The resulting optimisers are found to be diverse, and explore their optimisation landscapes using a variety of interesting, and sometimes unusual, strategies. Significantly, when applied to problems that were not seen during training, many of the evolved optimisers generalise well, and often outperform existing optimisers. This supports the idea that novel and effective forms of optimisation can be discovered in an automated manner. This paper also shows that pools of evolved optimisers can be hybridised to further increase their generality, leading to optimisers that perform robustly over a broad variety of problem types and sizes.


Introduction
This work aims to explore the optimisation algorithms that lie outside of conventional human design space. It is based on the premise that innate biases in human thought cause us to only explore certain parts of any design space, and that evolutionary algorithms (EAs) can be a powerful tool for exploring these spaces more widely. For the most part, this work is curiosity-led. However, it also addresses recent issues surrounding the ad hoc design of new optimisers through mimicry of natural phenomena. Despite early success with EAs and particle swarm optimisation (PSO), this trend has increasingly resulted in optimisers that are technically novel, but which differ in minor and often arbitrary ways from existing optimisers [42,23,2]. If we are to create new optimisation algorithms, arguably it is better to do this in a more systematic, objective and automated manner.
Specifically, this approach uses PushGP [43] to explore optimiser design space. This builds on a significant history of using genetic programming (GP) to explore optimisation behaviours, particularly within the hyperheuristics [7] community, where it is used as a way of fitting existing optimisers to new problems. Since the focus of hyperheuristics is on adapting existing optimisation behaviours, these systems often work with very restricted languages. By comparison, if the aim is to explore a broad design space, it is desirable to use a language that allows the expression of diverse behaviours. This motivates the choice of PushGP [44], a genetic programming (GP) system based around a typed, Turing-complete language that contains both low-level primitives and control flow instructions. This provides a basis with which to design optimisers from scratch, largely avoiding the biases that result from re-using existing optimisation building blocks.
This paper presents results from applying this approach to problems in the domain of continuous function optimisation. It extends work reported in [24], which built upon methods introduced in [22] for evolving local optimisers. The main contributions are as follows: -It is shown that PushGP can be used to design both local and populationbased optimisers, largely from scratch. -It is shown that these optimisers display significant novelty, and are competitive against existing optimisers. -It is shown that significant generality can be achieved by training optimisers on only a single problem, so long as the problem instances are diverse. -It is shown that even more effective optimisers can be built by hybridising pools of diverse evolved optimisers.
The paper is organised as follows: Section 2 discusses related work, Section 3 describes the methods used in this work, Section 4 presents results and analysis, Section 5 discusses limitations and future work, and Section 6 concludes.

Related Work
Optimisation involves finding optimal solutions to problems. It is a broad field, but can roughly be broken into two sub-fields: gradient-based methods, which are typically mathematical approaches that use information provided by a function's derivatives, and gradient-free methods, which do not require a mathematical model of the problem. The study of gradient-free methods covers several, overlapping, communities, and this has led to somewhat diffusive terminology [46]. However, a distinction is often made between local search and population-based methods. Local search includes methods that follow a single trajectory through the search space, typically evaluating a single solution in each iteration of the algorithm. Well-known examples are random search, hill climbing, simulated annealing, and tabu search [26].
Population-based methods follow multiple trajectories, and can be further divided into population-centric methods, which manage the population centrally, and process-centric methods, in which search processes communicate in a distributed fashion. The archetypes of these two methods are the genetic algorithm (GA), and PSO, both modelled upon biological processes. However, many more have followed in their footsteps, and this has led to some debate over what constitutes a novel optimisation algorithm. In part, this debate rests of the meaning of the term metaheuristic, which is widely used to refer both to individual optimisation algorithms, and to broader optimisation concepts. If the latter meaning is adopted, then it has been argued that most so-called novel metaheuristics contain very little novelty [42,23,2].
Rather than resting on human design, an alternative approach to developing new optimisers is to use machine learning. Most existing approaches to doing this use GP, and fall within an area that has come to be known as generative hyperheuristics. Work in this area focuses on automatically fitting existing metaheuristics to particular problem instances. The targeted metaheuristics include GP itself [48,11,18], as well as various EA frameworks [38,25,27,31,51], swarm algorithms [33,36,6,16] and memetic algorithms [40,17]. There are two general approaches: assembling new optimisers from the components of existing optimisers [38,31,27,40,6], or designing new components for existing optimisers, particularly variation operators [48,11,18,33,10,51]. Various forms of GP have been used for this, including tree-based GP [11,27,35], linear GP [31,13,51], graph-based GP [48,18,41,40], grammatical evolution [25,6,16], and recently (building upon our earlier work [22]) PushGP [17]. Two ways in which most of these hyperheuristic approaches differ from this work are their focus on adapting existing metaheuristic frameworks, and the use of high-level primitives mined from existing algorithmic components. Possibly the closest existing approach is that of Shirakawa et al. [41], who used a graph-based GP with control flow instructions to explore a wider space of optimisers. However, their approach uses high-level primitives that are commonly found in human-designed optimisers, and their aim was to find new optimisers that broadly resemble existing human-designed algorithms.
Whilst the development of hyperheuristics has mostly taken place within the GP community, recently the deep learning and AutoML communities have also been exploring the use of machine learning to design optimisation behaviours. Their main focus has been on using deep learning to design new optimisers for training neural networks [1,49,29], but, more recently, there has been interest from the same community in using GP [34]. Similar to this work, the aim is to evolve optimisation (and, more generally, machine learning) behaviours from scratch, and they demonstrate how this approach can re-discover gradient descent. Although their goal is to reduce human bias, the approach uses a language largely comprised of mathematical primitives of the kind used by existing machine learning techniques, and no control flow oper-  ations. In this respect, the design space is significantly more constrained than the one used in this paper.

The Push Language
In this work, optimisation behaviours are expressed using the Push language. Push was designed for use within a GP context, and has a number of features that promote evolvability [43,45,44]. These include the use of stacks, a mechanism that enables evolving programs to maintain state with less fragility than using conventional indexed memory instructions [20]. However, it is also Turing-complete, meaning that it is more expressive that many languages used within GP systems. Another notable strength is its type system, which is de-signed so that all randomly generated programs are syntactically valid, meaning that (unlike type systems introduced to more conventional forms of GP) there is no need to complicate the variation operators or penalise/repair invalid solutions. This is implemented by means of multiple stacks; each stack contains values of a particular type, and all instructions are typed, and will only execute when values are present on their corresponding type stacks.
There are stacks for primitive data types (Booleans, floats, integers) and each of these have both special-purpose instructions (e.g. arithmetic instructions for the integer and float stacks, logic operators for the Boolean stack) and general-purpose stack instructions (push, pop, swap, duplicate, rot etc.) associated with them. Another important stack is the execution stack. At the start of execution, the instructions in a Push program are placed onto this stack and can be manipulated by special instructions; this allows behaviours like looping and conditional execution to be carried out. Finally, there is an input stack, which remains fixed during execution. This provides a way of passing non-volatile constants to a Push program; when popped from the input stack, corresponding values get pushed to the appropriate type stack.
GP systems that use the Push language are known as PushGP. Since a Push program is essentially a list of instructions, it can be represented as a linear array and manipulated using GA-like mutation and crossover operators. This work uses a modified version of Psh 1 (a Java implementation of PushGP) called OptoPsh 2 .

Domain-Specific Instructions
During the course of an optimisation run, an optimiser moves between search points within a particular optimisation problem's search space, resulting in an optimisation trajectory. To allow optimisers to store, express and manipulate search points within this trajectory, an extra vector type has been added to the Push language. This represents search points as fixed-length floating point vectors, one for each dimension of the search space. These can be manipulated using the special-purpose vector instructions shown in Table 1, which include: -Standard mathematical operations, such as pair-wise and component-wise arithmetic, scaling, dot product, and magnitude. -Operations that create a random vector and push it to the vector stack.
There are variants for generating unit vectors and vectors with defined bounds; for the latter, the current value f on the float stack is used to set the bounds for each dimension to the interval [−f, f ]. vector.apply and vector.zip instructions, which allow code to be applied to each component (or each pair of components in the case of zip) using a functional programming style.
-The vector.between instruction, which returns a point on a line between two vectors. For this instruction, the distance along the line is determined by a value popped from the float stack; if this value is between 0 and 1, then the point is a corresponding distance between the two vectors; if less than 0 or greater than 1, then the line is extended beyond the first or second vector, respectively.

Evolving Optimisers
Algorithm 1 outlines the fitness function for evaluating evolved Push programs. A Push program is required to carry out a single move, or optimisation step, each time it is called. In order to generate an optimisation trajectory within a given search space, the Push program is then called multiple times by an outer loop until a specified evaluation budget has been reached. After each call, the value at the top of the Push program's vector stack is popped and the corresponding search point is evaluated. The objective value of this search point, as well as information about whether it was an improving move and whether it moved outside the problem's search bounds, are then passed back to the Push program via the relevant type stacks. During an optimisation run, the contents of a program's stacks are preserved between calls, meaning that Push programs have the capacity to build up their own internal state, and consequently the potential to carry out different types of moves as search progresses. This framework is used to evolve and evaluate both local and populationbased optimisers. For a local optimiser, there is a single Push program, and a single set of stacks that are seeded with a random search point within the bounds of the current optimisation problem. For a population-based optimiser, multiple copies of the same Push program are executed in parallel, and are able to communicate with each other during the course of an optimisation run. At this point, it is important to make a distinction between the population of PushGP (i.e. the population of all programs that are being evolved) and the population of the optimiser (i.e. a population that is formed from multiple copies of a single evolved program during an optimisation run). Since there is scope for confusion, the latter will be referred to as a swarm from now on. So, for a swarm of size s, there are s copies of a particular program. Each swarm member is started at a different random search point within the bounds of the current optimisation problem. Each copy of the program has its own stacks, meaning that swarm members are able to build up their own internal state independently. Once a optimisation run has started, the swarm members remain persistent, i.e. there is no selection mechanism that creates and destroys them.
To allow coordination between swarm members during an optimisation run, two extra instructions are provided, vector.current and vector.best. These both look up information about another swarm member's search state, pushing either its current or best seen point of search to the vector stack. The Algorithm 1 Evaluating an evolved PushGP optimiser 1: fitness ← 0 2: for repeats do Measure fitness over multiple optimisation runs 3: pbest ← ∞ 4: for p ← 1, popsize do Initialise population state 5: prog p ← copy of evolved Push program 6: clearstacks(prog p ) 7: point p ← random initial point within search bounds 8: valuep ← evaluate(point p ) 9: push(point p , prog p .vector) Pass initial search point to program 10: push(valuep, prog p .float) Pass initial objective value to program 11: push(true, prog p .boolean) 12: push(bounds, prog p .input) Put search space bounds on input stack 13: bestvalp ← valuep 14: if bestvalp < pbest then 15: pbest target swarm member is determined by the value at the top of the integer stack (modulus the swarm size to ensure a valid number); if this stack is empty, or contains a negative value, the current or best search point of the current swarm member is returned. This sharing mechanism, combined with the use of persistent search processes, means that the evolved population-based optimisers resemble swarm algorithms such as PSO in their general mechanics. However, there is no selective pressure to use these mechanisms in any particular way, so evolved optimisers are not constrained by the design space of existing swarm optimisers.
The evolutionary parameters are shown in Table 2.

Hybridising Optimisers
An advantage of using EAs is that they tend to find diverse solutions, both due to variance between runs, and due to the use of a population of solutions within runs. This diversity can often be leveraged by combining solutions to form an ensemble, which in turn can be advantageous in terms of generality. Within an optimisation context, the combining of optimisers is usually referred to as hybridisation, and there is a long history of hybridising optimisers in order to build on their individual strengths [5]. For instance, memetic algorithms, which combine EAs with local search metaheuristics, often out-perform their constituent parts [9]. There are many ways in which evolved Push optimisers could potentially be hybridised. In this work, a heterogeneous swarm is used, where each member of the swarm can run a different evolved Push program. The Push programs used in a particular swarm can be sourced from a single PushGP population, or they can be assembled from multiple PushGP runs. The Push program assigned to a particular swarm member can then be persistent throughout an optimisation run, or it can be changed at each iteration. In this paper, a relatively simple approach is investigated, in which we take the best program from each one of a group of PushGP runs, and then randomly assign each swarm member of the heterogeneous swarm a randomly-chosen program at each iteration.

Evaluating Optimisers
Optimisation problems were selected from the widely used CEC 2005 realvalued parameter optimisation benchmarks [47]. These are all minimisation problems, meaning that the aim is to find the input vector (i.e. the search point) that generates the lowest value when passed as an argument to the function. Five of these were used to train optimisers. These were chosen partly because they provide a diverse range of optimisation landscapes, but also because they are relatively fast to evaluate: -F 1 , the sphere function, a separable unimodal bowl-shaped function. It is the simplest of the benchmarks, and can be solved by gradient descent. -F 9 , Rastrigin's function, is non-separable and has a large number of regularly spaced local optima whose magnitudes curve towards a bowl where the global minimum is found. The difficulty of this function lies in avoiding the many local optima on the path to the global optimum, though it is made easier by the regular spacing, since the distance between local optima basins can in principle be learnt. -F 12 , Schwefel's problem number 2.13, is non-separable and multimodal and has a small number of peaks that can be followed down to a shared valley region. Gradient descent can be used to find the valley, but the difficulty lies in finding the global mimimum, since the valley contains multiple irregularly-spaced local optima. -F 13 is a composition of Griewank's and Rosenbrock's functions. This composition leads to a complex surface that is non-separable, highly multimodal and irregular, and hence challenging for optimisers to navigate. -F 14 , a version of Schaffer's F 6 Function, comprises concentric elliptical ridges. In the centre is a region of greater complexity where the global optimum lies. It is non-separable and is challenging due to the lack of useful gradient information in most places, and the large number of local optima.
The higher-numbered functions in the CEC 2005 benchmarks comprise variants of four more composition functions, each of which is significantly more complex than those listed above, and each of which takes approximately two order of magnitude longer to evaluate. Due to the number of function evaluations required to evolve optimisers, it was not feasible to use these functions for this purpose. However, one of each type (namely F 15 , F 18 , F 21 and F 24 ) were selected to measure the broader generality of the evolved optimisers in the follow-up analysis.
To discourage overfitting to a particular problem instance, random transformations are applied to each dimension of these functions when they are used to measure fitness during the course of an evolutionary run. Random translations (of up to ±50% for each axis) prevent the evolving optimisers from learning the location of the optimum, random scalings (50-200% for each axis) prevent them from learning the distance between features of the landscape, and random axis flips (with 50% probability per axis) prevent directional biases, e.g. learning which corner of the landscape contains the global optimum.
Fitness is the mean of 10 optimisation runs, each with random initial locations and random transformations. During training, 10-dimensional versions of the functions are used, and the evaluation budget for a single optimisation run is 1E + 3 fitness evaluations (FEs). For the results tables and figures shown in the following section, the best-of-run optimisers are reevaluated over the CEC 2005 benchmark standard of 25 optimisation runs, and random transformations are not applied.

Results
Push optimisers were trained separately on each of the five training functions. Both local search and population-based optimisers were evolved. The local search optimisers have a single point of search, and each optimisation run lasts for 1E +3 iterations during training. For population-based optimisers, the F9_5x200 1E + 3 evaluation budget can be split between the swarm size and the number of iterations in different ways. In these experiments, splits of (swarm size × iterations) 50×20, 25×40 and 5×200 are used. For each of these configurations, and for each of the five training functions, 50 independent runs of PushGP were carried out. The first thing evident from these fitness distributions is that the tradeoff between swarm size and number of iterations is more significant for some problems than others. For F 1 , better optimisers are generally found for smaller swarm sizes, with the local search (i.e. 1×1000) distribution having the lowest mean error. This makes sense, because the unimodal F 1 landscape favours intensification over diversification. For F 12 , the sweet spot appears to be for 5×200, possibly reflecting the number of peaks in the landscape (i.e. 5). For the other problems, the differences appear relatively minor, and effective optimisers could be evolved for all configurations. In most cases, the best optimiser for a particular problem is an outlier within the distributions, so may not reflect any intrinsic benefit of one configuration over another. That said, four of these best-in-problem optimisers used small populations (2 with 1×1000 and 2 with 5×200). This suggests that it might be easier to find optimisers that use smaller populations rather than larger ones, perhaps reflecting the additional effort required to evolve effective coordination mechanisms within population-based optimisers. Fig. 1 also shows that, for all the training problems, the PushGP runs found at least one optimiser that performed better, on average, than G-CMA-ES. For the simplest problem F 1 , there was only one evolved optimiser that beat the general purpose optimiser. For the other problems, many optimisers were found that performed better. This is perhaps unsurprising, given that the capacity to overfit problems is a central motivation for existing work on hyperheuristics. However, an important difference in this work is the use of random problem transformations during training, since this causes the problems to exhibit greater generality, preventing optimisers from over-learning specific features of the landscape. The results suggest that this does not significantly impact the ability of evolved optimisers to out-perform general purpose optimisers on the problem on which they were trained. Table 3: Generality of evolved optimisers. For each optimiser trained with a fitness evaluation budget (FEs) of 1E + 3 function evaluations on 10D functions, mean errors are shown for 25 optimisation runs of the 10D, 30D and 50D versions of the functions for a budget of both 1E + 3 and 1E + 4. The mean rank is also shown, and the best result for each combination of problem dimensionality (D) and evaluation budget is underlined for each function number and ranking. Grey text shows where an optimiser is being evaluated on the function on which it was trained.

D FEs
Optimiser

Generality of Evolved Optimisers
However, the ability to out-perform general purpose optimisers on the problem on which they were trained is arguably not that important, especially given the substantial overhead of evolving the optimiser in the first place. Of more interest is how the evolved optimisers generalise to larger and different problems that they were not trained on. Table 3 gives insight into this, showing how well the best evolved optimiser for each training function generalises to both larger instances of the same function and to the other eight functions (i.e. the other four training functions, and the four used only for evaluation). Mean error rates are shown both for the 10-dimensional problems used in training, and for the more difficult 30 and 50-dimensional versions, both with the 1E + 3 evaluation budget used in training and for a larger 1E + 4 evaluation budget. To give an indication of relative performance, the average rank of each optimiser across the nine functions is also shown for each combination of dimensionality and evaluation budget. Table 3 shows that most of the evolved optimisers generalise well to the 30D and 50D versions of the 10D function on which they were trained. The optimisers trained on F 12 , F 13 and F 14 functions do best in this regard, outperforming G-CMA-ES on the 30D (in addition to the 10D) versions of these functions. The F 1 optimiser is the only one which generalises relatively poorly to larger problems, being beaten by G-CMA-ES and several of the other evolved optimisers on the 30D and 50D versions.
However, the most interesting observation from Table 3 is that many of the optimisers also generalise well to other functions, often out-performing the optimiser that was trained on lower-dimensional versions of the same function. Notably, in terms of its mean ranking across these nine functions, the F 12 optimiser does better than G-CMA-ES at all dimensionalities for a budget of 1E + 3 FEs. This level of generality is quite surprising, given that it only saw one of the nine functions (and only at 10D) during training.
Whilst the F 12 optimiser has the best mean rank at 10D and 30D, it is surpassed by the F 9 optimiser at 50D, which ranks first for five out of the nine functions. This is also quite suprising, not only because the F 9 optimiser does relatively poorly at lower dimensionalities, but also because it was trained on a landscape where the optima are always regularly spaced -which is not true of the other functions. However, it is known that the relative rankings of optimisers can change considerably as the dimensionality of a search space increases [14], and this seems to be a reflection of this more general observation.
It is notable that G-CMA-ES always performs best on two of the most complex landscapes, F 21 and F 24 . This may suggest that training on simpler functions limits the generality of optimisers when applied to more complex landscapes. However, G-CMA-ES is beaten in most cases by the F 12 optimiser on the other two complex landscapes, F 15 and F 18 , so this conclusion is certainly not clear. Table 3 also shows what happens when the evaluation budget is increased to 1E + 4 FEs, an order of magnitude more than the budget the evolved optimisers were able to use per optimisation run during training. First of all, it is evident that all the evolved optimisers continue to make progress when run for longer, so the restricted training budget does not appear to have impaired their generality in this regard. At 10D, the F 12 optimiser ties with G-CMA-ES. However, at 30D and 50D, G-CMA-ES does out-perform the evolved optimisers when using a larger evaluation budget. This suggests that training on 1E + 3 FEs pushes evolution towards optimisers that make relatively good progress on small budgets, but which may be less competitive on larger budgets. Nevertheless, it is notable that the F 12 optimiser still ranks first for three of the 30D functions, the same number as G-CMA-ES.
Overall, these results demonstrate that it is possible to train an optimiser on a single function at a relatively low dimensionality, and, so long as the instances are diverse enough to prevent overfitting (i.e using random transformations in this case), the resulting optimiser can generalise to a diverse range of optimisation landscapes at significantly higher dimensionalities. This is useful to know, since it means that optimisers can be trained relatively efficiently, i.e. they do not need to be evaluated on multiple functions at high dimensionalities during the evolutionary process. Table 4 shows the evolved Push expression used by the best evolved optimiser for each training function, in each case slightly simplified by removing instructions that have no effect on their fitness. Whilst it is sometimes possible to understand their behaviour by looking at the evolved expressions alone, it is usually possible to gain more insight by observing the interpreter's stack states as they run, and by observing their trajectories on 2D versions of the function landscapes. Figs. 2 and 3 show examples of the latter, both for the functions they were trained on, and the other eight functions, respectively. In almost all cases, optimisers generalise well to the easier 2D version of the function they were trained on, and it can be seen in Fig. 2 that in each case the optimiser's trajectory approaches the global optimum. Overall, these five optimisers display a diverse range of search behaviours, a number of which are quite novel:

Behaviour of Evolved Optimisers
F 1 best optimiser This optimiser uses a swarm of five parallel search processes. At each iteration, each of these moves to a position relative to the swarm best (i.e the best point seen so far by all swarm members). In each case, this is done by adding a random vector (vector.wrand) to the swarm best.   The size of this random vector is determined using a trigonometric expression based on the value of certain dimensions of the swarm member's current and best search points. This means that the move size carried out by each swarm member at each iteration is different, which leads to the generation of move sizes over multiple scales. For the F 1 landscape, the small moves are visible as the trajectory approaches the optima, and the large moves can be seen by the radiating coloured lines.
F 9 best optimiser This is a local optimiser with only one point of search. It continually switches between searching around the best-seen search point and evaluating a random search point. When searching around the best-seen point, at each iteration it adds the sine of the iteration number to a single dimension of the vector, moving along two dimensions each time. In essence, the periodic nature of the sine function causes the trajectory to systematically explore the nearby search space, which can be seen by the space-filling patterns centred around the final point of search in Fig. 2. This use of trigonometric functions appears to be relatively common amongst evolved optimisers, as reflected in Table 5, which lists the 20 most frequent instructions found for each of the four swarm size × iterations configurations. Boolean signals to determine each move. Notably, each swarm member uses a deterministic geometric pattern to explore the space around the current search point. This is depicted in Fig. 4, showing how it causes the optimiser to explore neighbourhoods that lie at powers-of-two distance from its centre. Once a neighbourhood has been seeded, it progressively grows outwards, until it eventually meets the surrounding neighbourhoods. The sampling rate for a neighbourhood is proportional to its distance from the centre, meaning that closer regions are explored with greater granularity, and the neighbourhoods are seeded progressively, meaning that it only searches further out if the nearer neighbourhoods are not productive. This pattern causes it to explore local neighbourhoods over a number of scales, enabling it to search within landscapes that are considerably larger than the one it was trained on, such as F 1 and F 14 .
F 13 best optimiser This optimiser, by comparison, has the simplest program. It is a local optimiser and, at each iteration, it adds a random value to one of the dimensions of the best-seen search point, cycling through the dimensions on each subsequent move (hence why it generates a cross-shaped trajectory in 2D). The size of each move is determined by both the sine of the objective value of the current point and the sine of the largest dimension of the search space (accessed using the input.inall instruction, which is commonly used in the local optimisers). The former causes the move size to vary cyclically as search progresses, and the latter allows it to adapt the move size to the landscape size, at least to an extent. Although this optimiser only (with the exception of the first move) explores moves in one dimension at a time, it still works quite well on the non-separable landscapes.
F 14 best optimiser This optimiser is the only one of the five which uses both a larger swarm size (of 25) and the vector.between instruction. Each iteration, each swarm member uses this instruction to generate a new search point halfway between the swarm best and one of its own previous positions, which are all retained within the vector stack. A small random vector is then added to each half-way point, presumably to inject diversity. Importantly, which previous position is used for a particular swarm member is determined by the swarm member's index: the first swarm member uses its current position, and higher numbered swarm members go back further in time. This allows a backtracking behaviour within the swarm, presumably useful for landscapes, such as F 14 , that are deceptive and have limited gradient information. From Table 5, it can be seen that the usage of vector.between generally increases as the swarm size increases, perhaps reflecting pressure to carry out more coordinated behaviour in larger swarms in order to counteract the smaller number of iterations.
A general observation from Fig. 3 is that the optimisers produce visibly different trajectories on landscapes that have very different search bounds to the function they were trained on. For instance, the F 9 and F 13 optimisers both carry out much smaller moves when applied to the F 1 and F 14 landscapes, and the F 14 optimiser carries out much larger moves on the landscapes with smaller bounds. This suggests that they may have overfit the size of the landscapes to some degree, despite the use of random instance scalings during training, and there may be some benefit to using either using a wider range of scalings, or normalising the search bounds. However, normalising search bounds may be problematic for real-world problems where the region containing the optima is not known.

Diversity of Evolved Optimisers
Due to the stochastic nature of EAs, it is common for different solutions to be found in different runs. Figs. 5 and 6 illustrate some of the diversity found between runs in these experiments. Fig. 5 shows trajectories for a selection of best-in-run 1×1000 optimisers trained on the F 1 function. These optimisers show significant variation in both the pattern of moves they carry out and the size of moves, highlighting that substantial inter-run diversity is present even for local optimisers trained on a unimodal landscape. Fig. 6 shows that considerable diversity is also present amongst the population-based optimisers. These plots suggest that interesting ideas about how to do optimisation could be gained by looking more closely at the diverse solutions found between runs.
Another way to get some insight into the amount of behavioural diversity is to look at whether there are any notable differences in how the optimisers within a batch of runs (corresponding to a particular configuration and training function) generalise to the nine different functions. Fig. 7 uses parallel coordinate plots to show this; specifically, for each optimiser within a batch of 50 runs, how they perform relative to one another on each of the CEC 2005 functions. In general, it is evident that not all the optimisers in a batch have the same performance profile across functions. For the F 13 and F 14 batches, in particular, it is clear that there are optimisers within the batch that generalise to different functions to the best optimiser within the batch. For example, the F 13 batch shows three groups: the best F 13 optimisers (in red), which mostly generalise to F 9 , F 12 and F 15 , but not to F 18 and higher; the intermediate F 13 optimisers (orange), which generalise better to F 18 and higher, and the weaker F 13 optimisers (green), which generalise to F 14 and do better than the first group on F 18 and higher. However, to varying degrees, this stratification is also present in the other batches, and suggests that repeated PushGP runs do lead to significant behavioural diversity.
As an aside, the F 14 plot is also notable for showing that the best F 14 optimisers do not generalise well to most other problems (with the exception of F 1 ). Likewise, it is mostly the case that the best optimisers for the other training functions do not generalise well to F 14 , as shown by the upward peak at F 14 in most of the plots. This suggests that F 14 may not be a good function for training generalisable optimisers. The same appears true, to a lesser extent, for F 13 ; however, the best optimisers here still generalise to several other functions. For optimisers trained on the lower-numbered functions, by comparison, the best optimisers generalise relatively well to the other functions (at least within their batch). This is especially true for F 12 , where the optimisers which do best on F 12 also tend to do best on the other functions. Taking into account the broader good performance of the best F 12 optimiser, this may point towards F 12 being a sweet-spot in terms of complexity -with the higher-numbered functions being too complex, and therefore encouraging overlearning of their topological characteristics, and the lower-numbered functions being too simple to train behaviourally complex optimisers. However, this is a somewhat speculative conclusion, and requires further investigation.   Fig. 7: Parallel coordinate plots showing the relative performance of each bestof-run optimiser within a batch of runs. Plots are shown for the five batches of runs that contained the best optimiser for each training function, i.e. from top to bottom: F 1 5×200, F 9 1×1000, F 12 5×200, F 13 1×1000, F 14 25×20. Each line depicts the performance profile of a particular optimiser (low values are better), and the colour of the line shows its relative performance on the function that was used in training, i.e. those towards the red end of the spectrum had a relatively high fitness on their training function.

Hybridisation of Evolved Optimisers
Another potential benefit of having diverse solutions is the capacity to hybridise optimisers. Much like ensembles in machine learning, hybridisation offers the potential to combine multiple optimisers in order to improve their generality. Here we take an initial look at this idea, by combining multiple evolved Push optimisers into one hybrid optimiser. In Section 3.4, a simple (a) All 1×1000, n = 25, on F 1 (b) F 1 5×200, n = 10, on F 18 approach to hybridising an arbitrary number of evolved Push programs into a single optimiser with an arbitrary swarm size was outlined. This involves each swarm member randomly selecting a Push program (from amongst a pool of previously evolved programs) at each iteration. Random allocation on a periteration basis avoids having to match the size of the pool to the swarm size. The disadvantage of this approach is that different programs have to share their stack states between invocations. However, initial experiments suggested that this method is more effective than persistently assigning a different program to each swarm member. In future work, it would be interesting to look more closely at the effect of how the programs are hybridised, but here the investigation is restricted to evaluating the effect of which programs are hybridised. Table 6 shows the performance of hybrids that are constructed from various pools of evolved Push optimisers. There are two types of pools used. First, function-specific pools. For each of the training functions, these pools are assembled from the best-in-run optimisers from the batch where the best optimiser for that function was evolved; for instance, for F 1 , the best evolved optimiser had a configuration of 5 × 200, so the pool is constructed by selecting the best optimiser evolved in each of the 50 runs in the F 1 5 × 200 batch. Second, function-agnostic pools, where, for each of the swarm size × iterations configurations, all of the best-in-run optimisers for that configuration are assembled into a pool. For instance, for the All 1 × 1000 pool, the pool is assembled from the 1 × 1000 batches for each of the five training functions, forming a pool of 250 optimisers. Hybrid optimisers are then constructed from the top n solutions present within each pool, for various values of n.
Results are only shown for the hardest (50D) function instances. On these functions, there appears to be a clear benefit to using hybridisation, at least in terms of mean error rate. Notably, for every function except F 9 , the best error rate (shown in green) in Table 6 is produced by a hybrid optimiser, rather than by the stand-alone best optimiser from each pool (these are indicated by a pool size of 1). Also, for every function-specific pool except F 12 , one or more of the hybrids shows better generalisation across all the functions than the stand-alone best. However, the sweet spot, in terms of n, varies across the pools, with a hybrid of 5 optimisers best for the F 9 , F 13 and F 14 pools, and a hybrid of size 20 best for the F 1 pool.
Another important observation is that hybrids constructed from functionspecific pools tend to perform better than those constructed from functionagnostic pools, with the best error rate being found by function-specific hybrids for all but the easiest function, F 1 . This is unexpected, since the optimisers trained on one function would be expected to be less diverse than those found within a pool of optimisers trained on different functions. However, the trend is quite clear in Table 6. One possible explanation is that the optimisers within a function-specific pool may be better adapted to one another.
Hybrids assembled from optimisers trained on the F 13 function appear to generalise particularly well, producing the lowest error rates on five of the nine functions. Apart from F 9 , the n = 5 F 13 hybrid performs significantly better on every 50D function than the F 9 best optimiser, which was the winner amongst the stand-alone optimisers for this dimensionality. This is perhaps surprising given that the F 13 best optimiser performs relatively poorly on the 50D functions (see Table 3). On the other hand, Fig. 7 shows that the F 13 pool contains relatively high diversity in terms of the functions that the best-in-run optimisers perform well at. In the case of ensembles, it is beneficial for the models from which they are composed to make mistakes on different problem instances, and it seems plausible that the same would be true when hybridising optimisers.
To give some insight into the behaviour of the F 13 hybrids, Figs. 8a and 8b depict the trajectories of the n = 5 and n = 20 versions on a number of 2D functions. It is evident that, for at least the 2D versions of these functions, the hybrid optimisers are capable of hopping between different local optima basins on even the most challenging landscapes. Contrast this with the behaviour of some of the optimisers from which they are comprised, whose trajectories are shown in Fig. 8c. On F 12 , the component optimisers struggle to escape from the local optimum basin nearest to their starting point, but the n = 5 hybrid (Fig. 8a, left) is able to hop around the F 12 landscape and find the global optimum basin.
These results indicate that hybridisation of evolved optimisers could be a productive avenue to explore further. A notable benefit is that it is not necessary to evolve the component optimisers each time. That is, it is only necessary to train a pool of optimisers once, and the pool can then be hybridised in different ways to address particular problems. However, a downside of hybridisation is that it makes optimisers harder to understand, with the behaviour of a hybrid likely to be emergent in unexpected ways from the behaviour of its component optimisers. Consider, for instance, Fig. 9. This shows the behaviour of two more hybrid optimisers, specifically those with the lowest error rates on F 1 and F 18 . In both cases, the optimisers move to the region containing the optimum in a small number of moves. However, due to the large number of programs guiding their moves, the underlying logic behind the optimisation process is unclear, and this arguably takes away one of the advantages of GP over deep learning -the relative interpretability of its solutions.

Limitations and Future Work
This study set out to show two things. First, that PushGP can be used to design useful optimisers, and second, that it can discover optimisers that lie beyond the existing human design space. Both of these have been demonstrated, at least to an extent. PushGP has been shown able to design optimisers, and some of these optimisers have a performance that compares favourably against a well-regarded general-purpose optimiser, in many cases exceeding it. The analysis of evolved optimisers identified a number of novel behaviours, such as the use of trigonometric functions and certain geometric patterns for efficiently exploring a search volume, and the use of distributed backtracking behaviours to deal with deceptive landscapes.
Nevertheless, this is only an initial proof of concept, and there remains a lot more that could be done to explore the optimiser design space in a broader and more systematic way. One limitation of this study is the use of only a small number of functions to evolve optimisers. Whilst these were chosen to represent a range of landscapes, they are not representative of all optimisation landscapes, and this consequently limits the range of behaviours that can be evolved. It also raises the question of which set of functions would be appropriate for this task. Existing benchmark sets like BBOB [15] are a possibility, but there are known to be significant biases in how they sample the space of landscapes [28,8,19]. Perhaps a more productive direction would be to use tunable benchmark functions [12,37], which generate landscapes with defined features. These features could potentially be used as an orthogonal basis with which to define a particular space of optimisation landscapes, providing a more uniform sampling of optimisation behaviours.
Conversely, by providing a varied selection of optimisers to be evaluated, perhaps the approach outlined in this paper could be used to assess benchmark functions, or even provide a better understanding of the scope for optimiser generality. The latter is important because, although the no free lunch theorem [50] itself does not apply within a restricted subspace of functions such as continuous functions, it is widely believed that there are restrictions on the amount of generality that can be achieved within such sub-spaces.
One of the objectives of this study was to explore diverse approaches to optimisation. Whilst this has been achieved to an extent both through the use of multiple training functions and the natural diversity that occurs between EA runs, there is also scope for explicitly promoting diversity. This could be done both at the population level, using a niching mechanism [21] to encourage the evolution of diverse solutions, or between runs, by penalising solutions which resemble those found in previous runs. The main barrier to doing this, however, lies in finding a suitable distance metric for comparing optimisers. Potentially this could be done by comparing trajectories. One approach would be to derive features from trajectories and compare these, for example search space coverage and move size statistics. This would be relatively easy to do, and would have the added benefit of allowing novelty search techniques such as MAP-elites [30] to be applied, but the selection of features could easily introduce bias. An alternative would be to directly compare trajectories. This is much harder to do, but one potential approach would be to map trajectories to local optima networks (as done in [4]), and then compare the trajectories within these networks.
Diversification methods could also be beneficial when generating optimiser pools for forming hybrids, since this could enable explicit selection of component optimisers that make errors on different landscapes. More generally, this study showed the potential benefit of hybridising pools of evolved optimisers. However, both the way in which component optimisers were selected, and the approach used to hybridise them, was somewhat arbitrary. Consequently, there is a lot more scope for research in this area. For instance, a wrapper method could be used to select which component optimisers to use from the pool.
Several of the other design choices made in this study may limit or bias the space of optimisers that is being explored. This includes requiring evolved optimisers to only carry out a single move each time they are invoked, which was done so that PushGP does not have to re-discover an outer loop for each optimiser. However, it is conceivable that this limits the design space, in particular pushing evolution towards optimisers that carry out similar kinds of move every time they are invoked, rather than adapting their behaviour in interesting ways as they explore the optimisation landscape. For the population-based optimisers, the use of persistent search processes and modes of inter-process communication resembling those in PSO are also likely to introduce bias. To address this, in future work, it would be interesting to explore a wider range of models of population-based search.
A number of the design choices made in this study reflected the need to limit computational effort. Computational effort is undoubtedly a limiting factor for this kind of research, but there are potential approaches for reducing the effort required to evolve optimisers. This includes general EA-based approaches for reducing fitness evaluation time, such as surrogate fitness models [32]. Another approach would be to promote the evolution of more efficient optimisers which take less time to evaluate, e.g. by penalising optimisers that use excessive instruction executions. Time and space complexity were not considered in this study, but it would be useful to do so, especially if the aim is to generate practical optimisers.
This study only looked at the design of optimisers for solving continuous functions, and it would be interesting to apply the approach to other domains. Combinatorial optimisation is one possibility, and this could be done by modifying the PushGP function set so that only valid solutions can be generated.
An intriguing possibility would be to apply the approach to GP itself, evolving PushGP programs that can optimise programs. The simplest way of doing this would be to optimise grammatical evolution [39] programs, since these are normally represented by fixed-length vectors of numbers. Indeed, anything that can be represented as a vector of numbers could potentially be a target of this approach, with relatively little modification.

Conclusions
This work uses genetic programming to design optimisers. The optimisers are expressed using the Push language, which allows them to be evolved largely from scratch using low-level primitives and control flow instructions. This discriminates the approach from earlier methods for evolving optimisers, which generally used high-level building blocks derived from existing optimisers. This means that the evolved optimisers are relatively unbiased towards existing human knowledge of how to do optimisation, offering the potential to discover truly novel approaches. This contrasts to recent nature-inspired approaches to optimiser design, many of which have failed to introduce any real novelty.
The results show that PushGP can both discover and express optimisation behaviours that are effective, complex and diverse. Importantly, the evolved optimisers generalise to problems they did not see during training, and often out-perform general-purpose optimisers on these previously unseen problems. Behavioural analysis shows that the evolved optimisers use a diverse range of strategies to explore optimisation landscapes, using behaviours that differ significantly from existing local and population-based optimisers.
This paper also takes an initial look at the hybridisation of evolved optimisers. The approach exploits the diversity found between evolutionary runs to form pools of optimisers, which are then hybridised using a simple mechanism. Many of the resulting hybrid optimisers were found to be significantly better than individual optimisers in terms of performance and generality, although at the expense of understandability. This suggests a potentially productive path for further research into the automatic design of optimisers.