Introduction

Artificial neural networks (ANN) are popular solutions to many modern problems such as computer vision [1, 2] and natural language processing [3, 4]. Most ANNs must undergo training to learn to solve a task. In most training procedures, an optimizer adjusts the ANN’s weights based on examples. The choice of optimizer profoundly impacts the efficiency and efficacy of training. The benefits of optimizers can be further increased through specialization. Specialization typically happens by tuning optimizer parameters (such as the learning rate).

However, parameter tuning is merely a superficial specialization. Changes in parameter value can only adjust the existing behavior of the optimizer, leaving no room for the creation of specialized behavior. In this work, we present AutoLR [5], a framework based on structured grammatical evolution [6] that can evolve optimizers with specialized behavior. We use AutoLR to evolve optimizers for the Fashion-MNIST image classification problem. The best optimizers resulting from evolution are then compared with human-made optimizers. These tests reveal that the evolved optimizers can compete with their human-made counterparts that have been developed over decades of research. One of the evolved optimizers, ADES, remains competitive even when moved to a different image classification task. Furthermore, ADES does not work in the same way as the human-made optimizer, employing novel mechanisms for training. The competitive performance and unique behavior of ADES open future opportunities for evolutionary approaches to help researchers discover new, even more effective optimizers.

This work is an extended version of Carvalho et al. published in EuroGP 2022 [7]. This version includes extended comparison of the training behavior human-made and evolved optimizers. Additionally, we add Sect. “Applying AutoLR to Other Problems” to this version, outlining how the reader may reproduce the experiments and apply the system to their own work. The rest of this paper is structured as follows: Sect. “Background” presents existing human-made optimizers. Section “AutoLR” presents AutoLR and its key components and design decisions. Section “Experimentation” documents the experiments conducted to validata AutoLR and compares the evolved optimizers to human-made state-of-the-art competitors. The aforementioned Sect. “Applying AutoLR to Other Problems”, exclusive to the current paper, describes how the reader may use AutoLR to improve the training of their neural network applications in Tensorflow [8] and Pytorch [9]. Section “Conclusion” sums up the achievements of AutoLR and the contributions of the current work.

Background

The profound impact of optimizers on training speed and quality motivates researchers to study and develop many solutions to this problem. In this section, we review how optimizers works, the most relevant optimizers, their behavior, and their contribution to the field.

An optimizer is an algorithm that aims to minimize an objective function l(w). This function takes the network’s weights (or trainable parameters) as an input w. The optimizer will minimize this function by adjusting the parameters in response to the gradient of the objective function \(\nabla l(w)\). This gradient is calculated using the back-propagation algorithm [10]. This process repeats several times, each repetition being commonly to referred as an epoch. Training will conclude once a stop condition is reached, usually either a maximum number of epochs or maximum runtime. Since training will go through several epochs, the notation \(w_{t}\) is used to refer to the network’s weights in at a certain epoch t.

The original optimizer used for neural network training is stochastic gradient descent (SGD) [11]. In SGD, the new set of weights (\(w_{t}\)) is calculated using a learning rate (lr) to adjust the magnitude of changes proposed by the gradient of the objective function with the previous weights (\(\nabla l(w_{t-1})\)). SGD is shown in Eq. 1.

$$\begin{aligned} w_{t} \xleftarrow {} w_{t-1} - \text {lr} \times \nabla l(w_{t-1}) \end{aligned}$$
(1)

The foundation of all modern optimizer is the product of the gradient and a scaling factor (i.e., learning rate). While SGD uses this approach, it fails to address some competing objectives that come up during training. When the weight is close to its ideal value, it is preferable to use a low learning rate to make minor adjustments to reach the ideal value. On the other hand, if the weight is barely optimized, a high learning rate is desirable to accelerate optimization [12].

Adaptive optimizers address the shortcoming of using a single learning rate for the entire network. Adaptive optimizers track long-term gradient information for each weight to make weight-specific adjustments. We call this approach based on gradient tracking adaptive behavior.

There are a variety of adaptive behaviors that can improve the speed and quality of training. In the Momentum optimizer [13], a momentum term (\(x_t\)) increases the magnitude of changes to weights that keep moving in the same direction. This behavior is regulated by two parameters, the learning rate (lr) and the momentum constant (mom), which dictates how strong the effect of momentum is. The Momentum optimizer is shown in Eq. 2.

$$\begin{aligned} x_{t}&\xleftarrow {} \textit{mom} \times x_{t-1} - lr \times \nabla l(w_{t-1})\nonumber \\ w_{t}&\xleftarrow {} w_{t-1} + x_{t} \end{aligned}$$
(2)

The momentum optimizer addresses the issues mentioned above with SGD, allowing unoptimized weights to build up speed in a specific direction and optimized weights to make minor adjustments. However, the momentum term can cause weights to "overshoot" their optimal value, hindering the training process.

A variation of the momentum optimizers, Nesterov’s momentum [14], mitigates overshooting. This optimizer calculates the gradient for the weights after the momentum term has been applied. This change works as a "look-ahead" mechanism that can correct the direction of the momentum term faster. Faster adjustments to the momentum term improve its accuracy and prevent overshooting. The Nesterov’s momentum optimizer is shown in Eq. 3.

$$\begin{aligned} x_{t}&\xleftarrow {} mom \times x_{t-1} - lr \times \nabla l(w_{t-1} + mom \times x_{t-1})\nonumber \\ w_{t}&\xleftarrow {} w_{t-1} + x_{t} \end{aligned}$$
(3)

RMSprop [15] is an unpublished optimizer that uses an annealing mechanism. In RMSprop, stagnated weights undergo more extensive changes and weights that are not converging get slowed down. This behavior is achieved by dividing the product of the learning rate and the gradient (\(lr \times \nabla l(w_{t-1})\)) by a scaling factor. This scaling factor, \(x_t\), takes the square of weight’s changes \(\nabla l(w_{t-1})^2\) and calculates a discounted average for this value. An additional parameter, \(\rho \), regulates the exponential decay rate of the discounted average. RMSprop is shown in Eq. 4.

$$\begin{aligned} x_{t}&\xleftarrow {} \rho x_{t-1} + (1 - \rho ) \nabla l(w_{t-1})^2\nonumber \\ w_{t}&\xleftarrow {} w_{t-1} - \frac{lr \times \nabla l(w_{t-1})}{\sqrt{x_{t}} + \epsilon } \end{aligned}$$
(4)

The moving average in RMSprop is initialized at 0, biasing the term. Adam [16] is an optimizer similar to RMSprop that corrects this bias using a new term (\(z_t\)). Adam also uses a range where it expects the gradient to remain consistent (\(\frac{x_{t-1}}{\sqrt{y_{t-1}}}\)) and keeps changes within that range. Adam is shown in Eq. 5. Two parameters, \(\beta _1\) and \(\beta _2\), regulate the decay rate of the discounted averages, similar to \(\rho \) in the previous optimizer.

$$\begin{aligned} x_{t}&\xleftarrow {} \beta _1 x_{t-1} + (1 - \beta _1) \nabla l(w_{t-1})\nonumber \\ y_{t}&\xleftarrow {} \beta _2 y_{t-1} + (1 - \beta _2) \nabla l(w_{t-1})^2\nonumber \\ z_{t}&\xleftarrow {} lr \times \frac{\sqrt{1 - \beta _2^t}}{(1 - \beta _1^t)}\nonumber \\ w_{t}&\xleftarrow {} w_{t-1} - z_{t} \times \frac{ x_{t}}{\sqrt{y_{t}} + \epsilon } \end{aligned}$$
(5)

AutoLR

AutoLR [5] is an open-source framework that uses evolutionary algorithms (specifically, a genetic programming [17] technique called structured grammatical evolution [18]) to create specialized optimizers. This framework has two key components. The grammar is a set of rules that dictate the optimizers the framework can create. The fitness function is a procedure that assesses the quality of the optimizers created. We describe these components in detail in the upcoming sections.

Grammar

The grammar used in AutoLR enables the system to replicate most mechanisms found in the human-made optimizers. AutoLR has extensive freedom to create optimizers. The grammar allows the system to construct anything may train the network quickly and effectively. It is a conscious design decision to use a grammar that allows AutoLR to explore beyond human-made solutions, enabling the creation of novel optimizers undiscovered by humans. However, this freedom leads to slower evolution as many solutions in the search space are syntactically valid but completely unable to train a network.

In AutoLR, four functions make an optimizer. The most important is the weight function (\(weight\_func\) in the grammar); this function is responsible for changing the weights. The other functions are auxiliary functions (x, y, and z funcs in the grammar). These allow the system to develop adaptive behaviors (e.g., momentum, moving averages, learning rate annealing). Three auxiliary functions is sufficient to replicate most optimizers found in the literature. These functions are executed sequentially in the following order:

\(x\_func \xrightarrow {} y\_func \xrightarrow {} z\_func \xrightarrow {} weight\_func\).

Each function can use: the result of preceding functions, the gradient, constant values between 0 and 1, and several mathematical operations (add, subtract, multiply, divide, power, square, negative, square root). These options are organized in rules based on their role: operations in func rules, terminals (constants, preceding function results, and the gradient) in terminal rules. Each function has its instance of these rules to ensure that behavior evolved for one function cannot move to a different one. Additionally, the weight function does not include the gradient in its terminals; this promotes the discovery of adaptive behavior.

The complete grammar is too large to include in this paper, but Fig. 1 presents an abridged version.

Fig. 1
figure 1

CFG for the evolution of optimizers

Due to technical limitations, not every single behavior found in the literature is reproducible with AutoLR. Notably, the look-ahead gradient calculation used in Nesterov’s momentum is incompatible with how AutoLR handles the gradient calculation.

Fitness Function

AutoLR can evolve optimizers for any neural network and task that uses gradient-based training. To evaluate the quality of an optimizer, we use the candidate optimizer to train a neural network and assign fitness based on the network’s performance.

Training once is sufficient to differentiate between good and bad solutions, but optimizers of similar quality may be indiscernible with a single observation. We mitigate this issue by repeating training multiple times with each optimizer, assessing network quality after each training. To reduce computational costs, training is only repeated if the network achieves a result above threshold.

AutoLR typically evolves optimizer for a single task and dataset. The data is split into three sets: the training set, the validation set, and the fitness set.

The network is trained using the training set. The training set reserves data for all training repetitions, ensuring each repetition trains with unique data.

We use the validation set to calculate the validation loss which gives insight into training progress. Additionally, an early stop mechanism monitors the validation loss and concludes training if this metric does not improve for several consecutive epochs (the exact number is controlled using a patience parameter). Early stop prevents overfitting and reduces the cost of fitness evaluations. If early stop is not activated, training will conclude after a set number of training epochs.

Once training concludes, we calculate performance of the resulting network using the fitness set. Evolved optimizers can sometimes be erratic, alternating between great training performances and complete training failure between in different repetitions. To discourage evolution from favoring these solutions, the optimizer fitness is the worst performance achieved in the fitness set across all training repetitions.

Pseudo-code for the fitness evaluation function is presented in Algorithm 1.

figure a

Experimentation

We conducted experiments using AutoLR to study the benefits of specializing optimizers through evolution. The experimentation has two phases, documented in the following two sections. In Sect. “Evolving Optimizers”, we detail the results of evolutionary experiments, documenting the significant events of evolution, and present outstanding optimizers. In Sect. “Comparison with Human-Made Optimizers”, we compare the best evolved optimizers with human-made optimizers in two tasks. We also describe our steps to ensure fair comparisons between evolved and human-made optimizers.

Evolving Optimizers

AutoLR has several parameters that must be configured, presented in Table 1. We found that a population of 20 individuals for 1500 generations was sufficient to find interesting, high-quality solutions consistently. We observe that optimizers created by AutoLR are fragile and a considerable number of invalid optimizers exist in most populations. We use a large tournament size to increase the chances of selecting good optimizers despite this abundance of low-quality individuals. We perform nine evolutionary runs using these settings.

In this experiment, we use AutoLR to evolve optimizers for classification of the Fashion-MNIST dataset, using a simple convolutional neural network architecture (full architecture details in the supplementary material). Specifically, we use the training set of Fashion-MNIST (60,000 images, referred to as Fashion-MNIST-Training from here on out). We distribute these 60,000 examples among three roles in the fitness function. We reserve 53,000 images for network training (10,600 training set unique examples, up to five repetition). The validation set contains 3500 instances and the fitness set uses the remaining 3500 images.

Table 1 Experimental parameters

Figure 2 shows best solution fitness and mean population fitness at each generation, averaged across all runs. Early on, optimizers that use the gradient value directly dominate the population. These optimizers can be successful but often fail to train the network completely. These individuals contribute to the discovery of more robust optimizers that use simple strategies (like a learning rate or momentum) to train competent networks consistently. Most runs converge to these simple optimizers that rediscover effective mechanisms invented by humans.

Fig. 2
figure 2

Progression of mean fitness and best fitness throughout evolution. Plot shows the average and standard deviation across all runs

However, some experiments go beyond human-made strategies and discover novel behavior. Specifically, there are two evolved optimizers worth highlighting.

In Eq. 6 we show a simplified version of this optimizer (\(lr = 0.0009\), in the experiment). The Sign optimizer, named after its distinct operation, is the best performing optimizer across all experiments.

$$\begin{aligned} \begin{aligned} w_{t}&= w_{t - 1} - lr \times sign(\nabla l(w_{t - 1})) \end{aligned} \end{aligned}$$
(6)

This optimizer is unusual as it discards the size of the gradient; the learning rate determines the size of the weight changes. The evolved instance of this optimizer uses a small learning rate that lead to successive small changes that steadily improve performance. We hypothesize that this optimizer is an artifact evolved to exploit the early stop mechanism as incremental improvements avoid the stop condition. Avoiding early stop results in more extended training and a better network, meaning our fitness function favors this optimizer. Additionally, the sign operation also adjusts the direction of the gradient by simplifying it. Researchers have found that sometimes it is desirable to discard part of the gradient information; many optimizers use the magnitude of the gradient as part of an adaptive mechanism. To the best of our knowledge, no human-made optimizer directly changes the weights based on the gradient sign. While it is difficult to know why this optimizer uses such a counter-intuitive approach, we hypothesize that this fixed step size makes it more resistant to vanishing/exploding gradient problems that often occur in training. Since the Sign optimizer does not exhibit any adaptive behavior, we also selected the best evolved adaptive optimizer for further study.

The adaptive evolutionary squared optimizer (ADES) is the best evolved optimizer with adaptive behavior, presented in Eq. 7. The instance of ADES produced in evolution uses the parameters: \(\beta _1 = 0.08922, \beta _2 = 0.0891\).

$$\begin{aligned} y_{t}&= y_{t-1} - (\beta _1 \times y_{t-1}^2 + \beta _2 \times (y_{t-1} \times \nabla l(w_{t-1}))\nonumber \\&\quad + \beta _2 \times \nabla l(w_{t-1}))\nonumber \\ w_{t}&= w_{t - 1} + y_{t} \end{aligned}$$
(7)

To the best of our knowledge, ADES’ adaptive behavior is novel and undiscovered by humans. Specifically, using of a squared auxiliary variable is unique to ADES. We empirically observe that ADES employs a mechanism similar to momentum but faster and more erratic. Since ADES exhibits novel adaptive behavior, it is challenging to comprehensively understand how the optimizer interacts with the networks it trains. Fully understanding this optimizer is outside the scope of this work as it requires a significant study outside the field of evolutionary computation.

Comparison with Human-Made Optimizers

While the evolved optimizers perform well in evolution, comparisons with human-made alternatives are necessary to assess their usefulness. We devised a procedure to make fair comparisons between evolved and human-made optimizers.

We evaluate an optimizer by using it to train a network and rating it based on the quality of the network after training. This approach is similar to the fitness function used during evolution, but we made some changes to ensure fairness. Training is extended to 1000 epochs and we do not use early stop. Once training concludes, we restore the best weights (those that scored the highest validation accuracy during training) and calculate accuracy on test data. Since there is a stochastic component to training, we repeat this optimizer evaluation 30 times with different seeds; all results in this section show the average and standard deviation of the test and validation accuracy. Figure 3 summarizes the differences between the fitness function used in evolution and the benchmark function used for these experiments.

We compare the evolved optimizers with three of the previously presented human-made optimizers: Nesterov’s momentum, RMSprop, and Adam. We compare the solutions on two different datasets: Fashion-MNIST (used in evolution) and CIFAR-10 (a more complicated image classification task and a more complex neural network architecture).

Fig. 3
figure 3

Differences between the fitness function used for evolution and the benchmark function used to compare evolved and human-made optimizers in post-hoc analysis

There are also fairness issues when it comes to specialization. The evolved optimizers are specialized from their inception, it is essential to also specialize the human-made optimizers for fair comparisons. Consequently, we specialize all optimizers (including evolved ones) with Bayesian optimization [19]. We use Bayesian optimization to test 100 parameter combinations guided by the validation accuracy, with ten restarts. Bayesian optimization will search for ideal parameter values for each optimizer, using each optimizer’s default parameter values as a starting point to facilitate the search. Specifically, we optimize the following parameters for each optimizer:

  • Sign—lr,

  • ADES—\(\beta _1\), \(\beta _2\),

  • Adam—lr, \(\beta _1\), \(\beta _2\),

  • RMSprop—lr, \(\rho \),

  • Nesterov—lr, momentum.

This tuning is performed for each task, optimizer evaluation uses the best parameters found during this search.

Fashion-MNIST

In this experiment, we test the optimizers in the network architecture and dataset used in evolution. The test accuracy assessment uses data excluded from evolution (Fashion-MNIST-Test, 10,000 instances) to ensure fairness. We use Fashion-MNIST-Training data for training (53,000 instances) and validation (7000 instances). In Table 2, we show the best parameter values found through Bayesian optimization and the average and standard deviation of the validation and test accuracy for each optimizer across 30 repetitions.

The differences in performance are marginal in this task. The exception is the Sign optimizer, which performed about 1.5% worse than the others. This result supports our hypothesis that the Sign optimizer thrived by exploiting the characteristics of fitness evaluation. Despite weaker results, the Sign optimizer has the smallest accuracy drop between validation and test sets, suggesting it succeeds in avoiding overfitting.

Table 2 Trial results of all optimizers in Fashion-MNIST

ADES achieved the best test accuracy across all optimizers, suggesting that AutoLR succeeded in creating a specialized optimizer for this task. It is expectable that the evolved optimizer would perform well in its native task, but, remarkably, an automatically created optimizer outperforms human-made optimizers backed by decades of research. However, the differences in performance between the four best optimizers are marginal. Performing a Mann–Whitney U test for the null hypothesis: "ADES and Nesterov are equal" with a significance level of 0.05, we cannot reject the null hypothesis (\(p=0.267\)).

CIFAR-10

CIFAR-10 is a different image classification dataset. We use this dataset to investigate if the evolved optimizers generalize to different data with the same task. CIFAR-10 is more challenging than Fashion-MNIST, with \(32 \times 32 \times 3\) images (compared to Fashion-MNIST’s \(28 \times 28 \times 1\) images). We use a more complex but still simple architecture for this task (full architecture in the repository [5]).

CIFAR-10 has two datasets. CIFAR-10-Training has 50,000 images; we use this data for training (43,000 instances) and validation (7000 instances). CIFAR-10-Test has 10000 instances, all used for test accuracy calculation. In Table 3, we show the best parameter values found through Bayesian optimization and the average and standard deviation of the validation and test accuracy for each optimizer across 30 repetitions.

Once again, the Sign optimizer performs the worse but remains the most resistant to the dataset change. We hypothesize that the Sign optimizer is more resistant to overfitting because the sign operation does not preserve the original vector of the gradient.

ADES is among the best solutions even in this new dataset, outperforming Adam and RMSProp in validation and test accuracy. While the differences remain small, it is vital to recognize that the evolved optimizer can compete with state-of-the-art human-made solutions even outside its native task. ADES’ fine-tuned behavior for Fashion-MNIST is supposed to be its advantage over other solutions; this strong result on a different task exceeds expectations.

ADES’ success in CIFAR-10 raises questions about the broader applicability of the optimizer. In what other datasets, architectures, and tasks can ADES succeed? Future work is necessary to understand the contribution of ADES to the broader field of machine learning.

The success of ADES in this task also props up AutoLR as a tool for optimizer creation. A vital characteristic of ADES is that it remains competitive with a novel optimization approach, undiscovered by humans. Even if future work reveals that ADES is not widely applicable, understanding its unique features and why it succeeds in specific tasks may provide helpful insights for creating better human-made optimizers. Historically, researchers recycle and adjust concepts from older optimizers to create new, better solutions. AutoLR’s ability to evolve competitive optimizers with novel behavior may help researchers improve human-made optimizers.

Furthermore, the evolutionary conditions did not incentivize the creation of an optimizer that performed well outside its native task. The system evolved ADES due to its fitness for the Fashion-MNIST dataset, but the optimizer still performs well in other tasks. This result raises the question: "What if AutoLR is used to evolve generally applicable optimizers?" Using a fitness function that evaluates optimizers based on their performance on several tasks could create optimizers that perform across many datasets and architectures. Additionally, our evolutionary setup enforced minimal restrictions on the optimizers created. Trying to evolve optimizers with more restrictions informed by expert knowledge that promote optimizers closer to human-made solutions is worthwhile.

Convergence Comparison

Finally, we compare how ADES convergences in comparison with some human-made optimizers. To investigate this matter, we track the loss, training accuracy and validation accuracy of ADES, Adam and Nesterov’s momentum over 1000 training epochs. We use the same settings as the benchmark evaluation for CIFAR-10, we opt to conduct this comparison in CIFAR-10 since a harder dataset is more likely to highlight the differences between optimizers. We repeat this training 30 times to account for outliers in the optimizer’s regular behavior. Results for training accuracy, validation accuracy, and training loss are presented in Figs. 4, 5 and 6, respectively.

Table 3 Trial results of all optimizers in CIFAR-10
Fig. 4
figure 4

Training accuracy in CIFAR-10 during training for ADES, Adam, and Nesterov’s momentum

Fig. 5
figure 5

Validation accuracy in CIFAR-10 during training for ADES, Adam, and Nesterov’s momentum

The results from this test are consistent across all visualization of the data. Adam is the fastest to converge out of the three optimizers, reaching its peak training accuracy early on. Adam also appears to stagnate earlier in training compared to the other two solutions. Notably, we can see that Adam’s training loss actually begins to increase rather than decrease at a certain point. However, this may be a quirk of the parameters found through bayesian optimization rather than characteristic behavior of the solution. ADES and Nesterov’s momentum behave very similarly in this setting. As evidenced by previous experiments, these optimizers have very similar performances, but analyzing these convergence plots shows that Nesterov’s momentum trains slightly faster than its evolved counterpart. However, this difference is ultimately small and does not change previous conclusions that ADES is competitive with the human-made optimizers.

Fig. 6
figure 6

Training loss in CIFAR-10 during training for ADES, Adam, and Nesterov’s momentum

Applying AutoLR to Other Problems

In this work, we used AutoLR to evolve optimizers in the Tensorflow framework, but AutoLR is open-source [5] and readily applicable to any task with gradient-based training.

We wrote AutoLR in Python, with ready-to-use grammars for two major machine learning frameworks: Tensorflow [8] and Pytorch [9]. Applying it to any task implemented using one of these frameworks requires transforming the existing code into a fitness function suitable for AutoLR. To use it in other frameworks, the user must create a grammar and custom optimizer class and configure the experiments. Experiments are configured using a parameter file, Table 4 summarizes all parameters, grouped by component.

This paper is also accompanied by a Google Colab [20] notebook with code examples of all components described.

Table 4 AutoLR parameters

In the rest of this section, we describe how the reader may use AutoLR to create optimizers for their machine learning tasks. We explain how to create a custom fitness function and grammar, even without prior experience in Evolutionary Algorithms. Finally, we describe the outputs of the framework and how to extract optimizers after the experiments conclude. The purpose of this section is to clearly, openly present the concrete implementation steps to adapt AutoLR to other problems; improving the accessibility of the framework and reproducibility of our experiments.

Creating a Fitness Function

In AutoLR, the fitness function will always receive a tuple containing the phenotype (string describing the optimizer) and parameters. The fitness function should perform several common machine learning tasks:

  • Load the model.

  • Create an optimizer.

  • Load the data.

  • Train the network.

  • Evaluate network performance after training.

Once the function concludes, it should return the fitness of the optimizer and any other information the user may want to analyze. The function should return this data in a tuple of length 2, where the first element is the fitness value as a float and the second is a serializable dictionary with all other logging information. Listing 2 presents an example of such a function in AutoLR.

figure b

It is not feasible to explain how to accomplish these tasks for all possible AutoLR applications. Instead, this section provides examples of these tasks in AutoLR, recommendations for successful implementations, and show how to adapt using AutoLR parameters.

Finally, AutoLR calls the fitness function using an Evaluator class. The framework includes Evaluator classes for Tensorflow and PyTorch, which support custom fitness functions in their initialization.

Model Loading

AutoLR fitness functions should start with model loading. All optimizers should train the same architecture and starting weights. Consequently, the user should use the same initial weights for each fitness evaluation, avoiding random weight initialization.

In this work, we also performed multiple trainings for each fitness evaluation. To minimize computational costs, consider simply resetting the weights after each training rather than reloading the network from scratch.

The user can use the MODEL parameters to specify the model file used in an experiment. This parameter allows the same fitness function to work in several experiments using different models. In Listing 3, we exemplify how model loading can fit in a fitness function with multiple trainings.

figure c

Optimizer Creation

AutoLR fitness functions receive the phenotype (string) that describes an optimizer. The user must convert the phenotype into a usable optimizer using a custom optimizer class. This approach allows advanced users to create optimizer classes that process the string and turn it into an optimizer for their ML framework of choice.

However, creating a custom optimizer class can be difficult and requires understanding the optimizer classes of the ML framework, which are often complicated. AutoLR includes custom optimizer classes for Tensorflow and PyTorch, allowing users to avoid these challenges. Using these custom optimizers is as simple as calling them with the phenotype string and the model, as shown in Listing 4.

figure d

Data Loading

The AutoLR architecture should have little impact on the data loading procedure used for a task. However, it is worth highlighting that data distribution deeply affects experiments’ computational costs. Since training is conducted many times during an experiment, adjusting the related parameters (i.e., \(EPOCHS, VALIDATION\_SIZE, FITNESS\_SIZE, BATCH\_SIZE\)) can have a profound impact on the resources used by AutoLR. Note that there is no \(TRAINING\_SIZE\) parameter; training uses all data available after creating the validation and fitness sets. We also emphasize the importance of excluding some data from the fitness evaluation. We believe that evolved optimizers may overfit, similar to neural networks, so it is desirable to perform posterior quality assessments using data excluded from evolution.

Training the Network

Network training during fitness evaluation is the most computationally expensive part of AutoLR. The evolved optimizers created by AutoLR may have dysfunctional behavior that can be detected to reduce training costs considerably. Using an early stop mechanism that enforces a stop criterion based on the progress of the validation metric will significantly reduce the time spent training invalid optimizers.

Creating a Grammar

The grammar determines the optimizers that AutoLR can create and the structure they must follow. Specifically, these AutoLR uses context free grammars (CFG) defined in text files. To implement a custom grammar, the user only needs to create a valid text file that defines a grammar and load it into the framework using the GRAMMAR parameter. A valid text file should comprise one line for each grammar rule, written in the Backus-Naur form. Each line should start with a non-terminal and the \(::=\) symbol, followed by all expansion possibilities for the non-terminal. The \(<>\) symbols should enclose all non-terminals. Additionally, \(\vert \) should separate different productions for the same rule.

The framework includes two grammars for two popular python machine learning frameworks, Pytorch and Tensorflow. The user may want to use a different grammar to create optimizers for a different framework or programming language. The included grammars can reproduce most behavior in human-made optimizers while minimizing bias towards these solutions. The user can design a more biased or restrictive grammar to reduce the problem space, leading to faster discovery of suitable optimizers.

User-created grammar should include basic mathematical operations (i.e., addition, multiplication, subtraction) as they appear in most optimizers. It is also essential to include a variety of constants to serve as parameters (e.g., the learning rate). The user should avoid the linear distribution of constants as the parameters in optimizers usually take values close to 0 or 1. One may need to sample many constants at a small step size from a linear distribution of values to capture all relevant values. This increased number of constants can make searching the search space for difficult for the algorithm since there are more values to shift through. To address this, the user may manually remove redundant constants or use other functions that can generate a higher density of relevant constants (e.g., a sigmoid function). Finally, the grammar should allow optimizers to calculate auxiliary variables that do not affect the weights directly, enabling the evolution of adaptive mechanisms (e.g., momentum, learning rate annealing).

Running an Experiment

After selecting the parameters, grammar, and fitness function, the user can start running experiments. The following command runs the experiments:

figure e

Where PARAMETER_FILE is a valid parameter file including all desired settings for the experiments. When using a custom fitness function, the user must alter the main.py file to load it into one of the evaluators provided.

The results of experiments are logged periodically (determined by the SAVE_STEP parameter) to the EXPERIMENT_NAME folder. The logger saves several types of files:

  • iteration_N.json show the state of the population at generation N. These files include the genotype, fitness (and individual evaluations), tree depth, id, phenotype, mapping values, smart phenotype (phenotype with inactive genes omitted for clarity), parents, and other information returned by the fitness function.

  • parameters.json shows the parameters used for the experiment.

  • z-archive_N.json show the archive state at generation N. These files only include the evaluations, id, and fitness of archived individuals).

  • _progress_report.csv shows the best value, average, and standard deviation of the fitness for each generation.

  • builtinstate, numpystate save the random state of the system at the end of each generation. These files are necessary for crash recovery and do not contain actual results.

Finally, users should use a GPU accelerated fitness function if they have access to such resources. Most of the computation costs of AutoLR experiments come from the multiple trainings of fitness functions; GPU acceleration will significantly reduce the time needed to conduct experiments.

Once evolution is complete, the user should parse the iteration files produced by the framework to find the optimizers created and their performance. The optimizers in these files are saved as the phenotype, it is necessary to recreate optimizers from these phenotypes (see Sect. “Optimizer Creation”).

Conclusion

In this work, we present AutoLR, an evolutionary framework to create optimizers for neural network training. We show that AutoLR evolves competitive optimizers for image classification from scratch. One evolved optimizer, ADES, performs on par with human-made solutions while exhibiting novel behavior. ADES can compete with human-made optimizers even outside its native task. The results achieved by ADES prop up AutoLR as a tool for creating effective optimizers for specific and general tasks. To summarize, the contributions of this paper are as follows:

  • We propose AutoLR, a framework capable of evolving adaptive optimizers that compete with the state-of-the-art.

  • We create, benchmark, and analyze two evolved optimizers with novel approaches to neural network optimization.

  • We present ADES, an evolved optimizer competitive with state-of-the-art human-made optimizers in two relevant image classification tasks.

These results also motivate future work. ADES remains competitive when moved to different image classification problems, but it is essential to study if it remains useful when moved to an entirely different class of problem.

Additionally, we only study optimizers based on their accuracy. Other factors, such as convergence speed and sensitivity to hyper-parameters, are also crucial in real applications. Studying these properties may reveal new advantages and disadvantages of the evolved optimizers.

The success of ADES in different tasks also hints that AutoLR may help create general optimizers. In the study of optimizers, new solutions often re-purpose ideas from weaker optimizers. Using AutoLR to evolve general optimizers may lead to creative new approaches to optimization that may help researchers develop new approaches in the future.