Massively Parallel Genetic Optimization through Asynchronous Propagation of Populations

We present Propulate, an evolutionary optimization algorithm and software package for global optimization and in particular hyperparameter search. For efficient use of HPC resources, Propulate omits the synchronization after each generation as done in conventional genetic algorithms. Instead, it steers the search with the complete population present at time of breeding new individuals. We provide an MPI-based implementation of our algorithm, which features variants of selection, mutation, crossover, and migration and is easy to extend with custom functionality. We compare Propulate to the established optimization tool Optuna. We find that Propulate is up to three orders of magnitude faster without sacrificing solution accuracy, demonstrating the efficiency and efficacy of our lazy synchronization approach. Code and documentation are available at https://github.com/Helmholtz-AI-Energy/propulate


Introduction
Machine learning (ML) algorithms are heavily used in almost every area of human life today, from medical diagnosis and critical infrastructure to transportation and food production. Almost all ML algorithms have non-learnable hyperparameters (HPs) that influence the training and in particular their predictive capacity. As evaluating a set of HPs involves at least a partial training, state-free approaches to HP optimization (HPO), like grid and random search, often go beyond available compute resources [15]. To explore the highdimensional HP spaces efficiently, information from previous evaluations must be leveraged to guide the search. Such state-dependent strategies minimize the number of evaluations to find a useful model, reducing search times and thus the energy consumption of the computation. Bayesian and bio-inspired optimizers are the most popular of these AutoML approaches. Among the latter, genetic algorithms (GAs) are versatile metaheuristics inspired by natural evolution. To solve a search-for-solutions problem, a population of candidate solutions (or individuals) is evolved in an iterative interplay of selection and variation [23,30]. Although reaching the global optimum is not guaranteed, GAs often find nearoptimal solutions with less computational effort than classical optimizers [9,8]. They have become popular for various optimization problems, including HPO for ML and neural architecture search (NAS) [14].
To take full advantage of the increasingly bigger models and datasets, designing scalable algorithms for high performance computing (HPC) has become a must [40]. While Bayesian optimization is inherently serial, the structure of GAs renders them suitable for parallelization [34]: Since all candidates in each iteration are independent, they can be evaluated in parallel. To breed the next generation, however, the previous one has to be completed. As the computational expenses for evaluating different candidates vary, synchronizing the parallel evolutionary process affects the scalability by introducing a substantial bottleneck. Approaches to reducing the overall communication in parallel GAs like the island model (IM) [34] do not address the underlying synchronization problem.
To solve the issues arising from explicit synchronization, we introduce Propulate, a massively parallel genetic optimizer with asynchronous propagation of populations and migration. Unlike classical GAs, Propulate maintains a continuous population of already evaluated individuals with a softened notion of the typically strictly separated, discrete generations. Our contributions include: -A novel parallel genetic algorithm based on a fully asynchronous island model with independently processing workers, allowing to parallelize the optimization process and distribute the internal evaluation of the objective function. We use Propulate to optimize various benchmark functions and the HPs of a deep neural network on a supercomputer. Comparing our results to those of the popular HPO package Optuna, we find that Propulate is consistently drastically faster without sacrificing solution accuracy. We further show that Propulate scales well to at least 100 processing elements (PEs) without relevant loss of efficiency, demonstrating the efficacy of our asynchronous evolutionary approach.

Related Work
Recent progress in ML has triggered heavy use of these techniques with Python as the de facto standard programming language. Tuning HPs requires solving high-dimensional optimization problems with ML algorithms as black boxes and model performance metrics as objective functions (OFs). Most common are Bayesian optimizers (e.g. Optuna [2], Hyperopt [7], SMAC3 [24,27], Spearmint [32], GPyOpt [5], and MOE [38]) and bio-inspired methods such as swarm-based (e.g. FLAPS [39]) and evolutionary (e.g. DEAP [16], MENNDL [40]) algorithms. Below, we provide an overview of popular HP optimizers in Python, with a focus on statedependent parallel algorithms and implementations. A theoretical overview of parallel GAs can be found in surveys [12,4,3] and books [37,29].
Optuna adopts various algorithms for HP sampling and pruning of unpromising trials, including tree-structured Parzen estimators (TPEs), Gaussian processes, and covariance matrix adaption evolution strategy. It enables parallel runs via a relational database server. In the parallel case, an Optuna candidate obtains information about previous candidates from and stores results to disk.
SMAC3 (Sequential Model-based Algorithm Configuration) combines a random-forest based Bayesian approach with an aggressive racing mechanism [24]. Its parallel variant pSMAC uses multiple collaborating SMAC3 runs which share their evaluations through the file system. Spearmint, GPyOpt, and MOE are Gaussian-process based Bayesian optimizers. Spearmint enables distributed HPO via Sun Grid Engine and MongoDB. GPyOpt is integrated into the Sherpa package [22], which provides implementations of recent HP optimizers along with the infrastructure to run them in parallel via a grid engine and a database server. MOE (Metric Optimization Engine) uses a one-step Bayes-optimal algorithm to maximize the multi-points expected improvement in a parallel setting [38]. Using a REST-based client-server model, it enables multi-level parallelism by distributing each evaluation and running multiple evaluations at a time.
Nevergrad [31] and Autotune [25] provide gradient-free and evolutionary optimizers, including Bayesian, particle swarm, and one-shot optimization. In Nevergrad, parallel evaluations use several workers via an executor from Python's concurrent module. Autotune enables concurrent global and local searches, cross-method sharing of evaluations, method hybridization, and multi-level parallelism. Open Source Vizier [33] is a Python interface for Google's HPO service Vizier. It implements Gaussian process bandits [19] and enables dynamic optimizer switching. A central database server does the algorithmic proposal work, clients perform evaluations and communicate with the server via remote procedure calls. Katib [18] is a cloud-native AutoML project based on the Kubernetes container orchestration system. It integrates with Optuna and Hyperopt. Tune [26] is built on the Ray distributed computing platform. It interfaces with Optuna, Hyperopt, and Nevergrad and leverages multi-level parallelism.
DEAP (Distributed Evolutionary Algorithms in Python) [16] implements general GAs, evolution strategies, multi-objective optimization, and co-evolution of multi-populations. It enables parallelization via Python's multiprocessing or SCOOP module. EvoTorch [36] is built on PyTorch and implements distributionand population-based algorithms. Using a Ray cluster, it can scale over multiple CPUs, GPUs, and computers. MENNDL (Multi-node Evolutionary Neural Networks for Deep Learning) [40] is a closed-source MPI-parallelized HP optimizer for automated network selection. A master node handles the genetic operations while evaluations are done on the remaining worker nodes. However, global synchronization hinders optimal resource utilization [40].
Algorithm 1: Basic GA. In each generation, the individuals are evaluated in terms of the optimization problem's OF. Genetic operators propagate them to the next generation: The selection operator chooses a portion of the current generation, where better individuals are usually preferred. To breed new individuals, the genes of two or more parent individuals from the selected pool are manipulated. While the crossover operator recombines the parents' genes, the mutation operator alters them randomly. This is repeated until a stopping condition is met.
Input: Search-space limits, population size P , termination condition, selection policy, crossover probability, mutation probability.

Propulate Algorithm and Implementation
To alleviate the bottleneck inherent to synchronized parallel genetic algorithms, our massively parallel genetic optimizer Propulate (propagate and populate) implements a fully asynchronous island model specifically designed for largescale HPC systems. Unlike conventional GAs, Propulate maintains a continuous population of evaluated individuals with a softened notion of the typically strictly separated generations. This enables asynchronous evaluation, variation, propagation, and migration of individuals.
Propulate's basic mechanism is that of Darwinian evolution, i.e., beneficial traits are selected, recombined, and mutated to breed more fit individuals (see Algorithm 1). On a higher level, Propulate employs an IM, which combines independent evolution of self-contained subpopulations with intermittent exchange of selected individuals [34]. To coordinate the search globally, each island occasionally delegates migrants to be included in the target islands' populations. Islands communicate genetic information competitively, thus increasing diversity among the subpopulations compared to panmictic models [11]. For synchronous IMs, this exchange occurs simultaneously after fixed intervals, with no computation happening in that time. The following hyperparameters characterize IMs: -Island number and subpopulation sizes -Migration (pollination) probability -Number of migrants (pollinators): How many individuals migrate from the source population at a time. We illustrate the asynchronous propagation and intra-island synchronization of the population using the example of the blue individual ind i1 g3 . This individual is bred by worker 1 in generation 3 by applying the propagator (yellow) to the worker's current population. After evaluating ind i1 g3 , worker 1 sends it to all workers on its island and appends it to its population. As no evaluated individuals dispatched by worker 2 await to be received, worker 1 proceeds with breeding. Worker 2 receives the blue ind i1 g3 only after finishing the evaluation of the red ind i1 g2 . It then appends both to its population and breeds a new individual for generation 3.
-Migration (pollination) topology: Directed graph of migration (pollination) paths between islands. -Emigration policy: How to select emigrants (e.g., random or best) and whether to remove them from the source population (actual migration) or not (pollination). -Immigration policy: How to insert immigrants into the target population, i.e., either add them (migration) or replace existing individuals (pollination, e.g., random or worst).
Propulate's functional principle is outlined in Algorithm 2. We consider multiple PEs (or workers) partitioned into islands. Each worker processes one individual at a time and maintains a population to track evaluated and migrated individuals on its island. To mitigate the computational overhead of synchronized OF evaluations, Propulate leverages asynchronous propagation of continuous populations with interwoven, worker-specific generations (see Figure 1). In each iteration, each worker breeds and evaluates an individual which is added to its population list. It then sends the individual with its evaluation result to all workers on the Algorithm 2: Propulate with pollination.
Input: Search-space limits; hyperparameters n islands, island sizes Pi (i = 1, . . . , n islands), number of iterations generations, evolutionary operators (including selection policy, crossover probability, mutation probability etc.), pollination probability, pollination topology, emigration policy, immigration policy. Wait for all other workers to finish: MPI.COMM WORLD.barrier() 14 Final check for incoming messages so all workers hold complete population. 15 Probe individuals evaluated by other workers on island: receive intra isle individuals() 16 Probe for incoming pollinators immigrating from other islands: receive immigrants() 17 Probe for individuals replaced by other workers on island to be deactivated: deactivate replaced individuals() Result: n individuals with smallest OF values. same island and, in return, receives evaluated individuals dispatched by them for a mutual update of their population lists. To avoid explicit synchronization points, the independently operating workers use asynchronous point-to-point communication via MPI to share their results. Each one dispatches its result immediately after finishing an evaluation. Directly afterwards, it non-blockingly checks for incoming messages from workers of its own island awaiting to be received. In the next iteration, it breeds a new individual by applying the evolutionary operators to its continuous population list of all evaluated individuals from any generation on the island. The workers thus proceed asynchronously without idle times despite the individuals' varying computational costs.
After the mutual update, asynchronous migration or pollination between islands happens on a per-worker basis with a certain probability. Each worker selects a number of emigrants from its current population. For actual migration 1 , an individual can only exist actively on one island. A worker thus may only choose eligible emigrants from an exclusive subset of the island's population to avoid overlapping selections by other workers. It then dispatches the emigrants to the target islands' workers as specified in the migration topology. Finally, it sends them to all workers on its island for island-wide deactivation of emigrated individuals before deactivating them in its own population.
In the next step, the worker probes for and, if applicable, receives immigrants from other islands. It then checks for individuals emigrated by other workers of its island and tries to deactivate them in its population. Due to the asynchronicity, individuals might be designated to be deactivated before arriving in the population. Propulate continuously corrects these synchronization artefacts during the optimization.
For pollination (Figure 2), identical copies of individuals can exist on multiple islands. Workers thus can choose emigrating pollinators from any active individuals in their current populations and do not deactivate them upon emigration. To control the population growth, pollinators replace active individuals in the target population according to the immigration policy. For proper accounting of the population, one random worker of the target island selects the individual to be replaced and informs the other workers accordingly. Individuals to be deactivated that are not yet in the population are cached to be replaced in the next iteration. This process is repeated until each worker has evaluated a set number of generations. Finally, the population is synchronized among workers and the best individuals are returned.
Propulate uses so-called propagators to breed child individuals from an existing collection of parent individuals. It implements various standard genetic operators, including uniform, best, and worst selection, random initialization, stochastic and conditional propagators, point and interval mutation, and several forms of crossover. In addition, Propulate provides a default propagator: Having selected two random parents from the breeding pool consisting of a set number of the currently most fit individuals, uniform crossover and point mutation Fig. 2. Asynchronous pollination. Consider two islands with N (blue) and M (red) workers, respectively. We illustrate pollination (dark colors) by tracing worker N on island 1. After evaluation and mutual intra-island updates (light blue, see Figure 1), this worker performs pollination: it sends copies of the chosen pollinators to all workers of each target island, here island 2. The target island's workers receive the pollinators asynchronously (dark blue arrows). For proper accounting of the populations, worker 1 on island 2, selects the individual to be replaced and informs all workers on its island accordingly (middle red arrow). Afterwards, worker N receives incoming pollinators from island 2 to be included into its population. It then probes for individuals that have been replaced by other workers on its island, here worker 1, in the meantime and need to be deactivated. After these pollination-related intra-island population updates, it breeds the next generation. As pollination does not occur in this generation, it directly receives pollinators from island 2. This time, worker N chooses the individual to be replaced.
are performed each with a specified probability. Afterwards, interval mutation is performed. To prevent premature trapping in a local optimum, a randomly initialized individual is added with a specified probability instead of one bred from the current population.

Experimental Evaluation
We evaluate Propulate on various benchmark functions (see Section 4.4) and an HPO use case in remote sensing classification (see Section 4.5) which provides a real world application. We compare our results against Optuna, since it is the most widely used HPO software.

Benchmark Functions
Benchmark functions are used to evaluate optimizers in terms of convergence, accuracy, and robustness. The informative value of such studies is limited by how well we understand the characteristics making real-life optimization problems difficult and our ability to embed these features into benchmark functions [28]. We use Propulate to optimize a variety of traditional and recent benchmark functions emulating situations optimizers have to cope with in different kinds of problems (see Table 1).
-Rosenbrock has a narrow minimum inside a parabola-shaped valley. - Step represents the problem of flat surfaces. Plateaus pose obstacles to optimizers as they lack information about which direction is favorable. -Quartic is a unimodal function padded with Gaussian noise. As it never returns the same value on the same point, algorithms that do not perform well on this test function will do poorly on noisy data. -Rastrigin is non-linear and highly multimodal. Its surface is determined by two external variables, controlling the modulation's amplitude and frequency. The local minima are located at a rectangular grid with size 1. Their functional values increase with the distance to the global minimum. -Griewank's product creates sub-populations strongly codependent to parallel GAs, while the summation produces a parabola. Its local optima lie above parabola level but decrease with increasing dimensions, i.e., the larger the search range, the flatter the function. -Schwefel has a second-best minimum far away from the global optimum. -Lunacek's bi-sphere's [28] landscape structure is the minimum of two quadratic functions, each creating a single funnel in the search space. The spheres are placed along the positive search-space diagonal, with the optimal and sub-optimal sphere in the middle of the positive and negative quadrant, respectively. Their distance and the barrier's height increase with dimensionality, creating a globally non-separable underlying surface. -Lunacek's bi-Rastrigin [28] is a double-funnel version of Rastrigin. This function isolates global structure as the main difference impacting problem difficulty on a well understood test case.

Meta-Optimizing the Optimizer
Propulate itself has HPs influencing its optimization behavior, accuracy, and robustness. To explore their effect systematically and give transparent recommendations for default values, we conducted a grid search across the six most prominent HPs. The search space is shown in Table 2. We ran the grid search five times for the quartic, Rastrigin, and bi-Rastrigin benchmark functions (see Table 1 and Section 4.4), each with a different seed consistently used over all points within a search. All three functions have their global minimum at zero. They were chosen for their high-dimensional parameter spaces (30,20, and 30, respectively) and different levels of difficulty to optimize. For quartic, Propulate found a minimum below 0.01 ± 0.005 for 80.12 % of all points across the five grid searches. This increases to 94.94 % for minima found within 0.1 ± 0.05 of the global minimum. In comparison, the tolerances have to be relaxed considerably for the more complex Rastrigin and bi-Rastrigin. While only 18.57 % of all grid points had a function value less than 1.0 ± 0.5 for Rastrigin, only a single point resulted in an average value of less than 10 for bi-Rastrigin. Although the average value of bi-Rastrigin was only less than 10 once, we found the minimum across each of the five searches to be less than 1.0 for 3.31 % of the grid points. Considering grid points with at least one result smaller than 1.0, 86.61 % used either 16 or 36 islands, while the remainder used eight. As Propulate initializes different islands at different positions in the search space, the chance that one of them is at a very beneficial position increases with the number of islands. This is further confirmed by a migration probability of 0.7 or 0.9 for 61.41 % of these points. If one of the islands is well-initialized, it thus will quickly notify others.
With every best grid point using pollination, we clearly find pollination to be favorable over real migration. To determine the other HPs, we compute the averages of the results for the top ten grid points across all three functions. The top ten were determined by grouping over the lowest average and standard deviation of the function values, sorting by the averages, and sorting by the standard deviations. This method reduces the chances of a single run simply benefiting from an advantageous starting seed. Average crossover, point-mutation, and randominitialization probabilities are 0.655 ± 0.056, 0.363 ± 0.133, and 0.423 ± 0.135, respectively. The average number of islands was 28.800 ± 6.009 which equates to an island population of 5.00 ± 1.043. The average migration probability was 0.527 ± 0.150. These values provide a reasonable starting point towards choosing default HPs for Propulate (see Table 3). As the grid searches only considered functions with independent parameters, we assume a relatively high random-initialization probability to be useful due to the benefits of random search [6]. On this account, we chose to reduce the default random-initialization probability to 0.2. As the migration probability might also be lowered artificially by this phenomenon, we set its default to 0.7. The default probabilities for crossover and point-mutation were chosen as 0.7 and 0.4, respectively. The island size was set at four individuals. This is a practical choice as our test system has four accelerators per node and the number of CPUs per node is a multiple of four.

Benchmark Function Optimization
For each function, we ran each ten equivalent Propulate and Optuna optimizations, using the same compute resources, degree of parallelization, and number of evaluations. Figure 3 shows the optimization accuracy over walltime comparing Propulate with default parameters determined from our grid search (see Table 3) to Optuna's default optimizer. In terms of accuracy, Propulate and Optuna are comparable in most experiments. For many functions, e.g. Schwefel, bi-Rastrigin, and Rastrigin, Propulate even achieves a better OF value. In terms of walltime, Propulate is consistently at least one order of magnitude faster. This is due to Propulate's MPI-based communication over the fast network, whereas Optuna uses relational databases with SQL and is limited by the slow file system. Since the functions are cheap to evaluate, optimization and communication dominate the walltime. In particular for problems where evaluations are cheap compared to the search itself, we find that Optuna's computational efficiency suffers massively from the frequent file locking inherent to its parallelization strategy, reducing its usability for large-scale HPC applications.
In addition, we inspected the evolution of the population over walltime for both Propulate and Optuna. An example for minimizing the Rastrigin function is shown in Figure 4. Propulate is roughly three orders of magnitude faster and makes significantly greater progress in terms of both OF values and distance to the global optimum. Due to this drastic difference in runtime, we measure only 46.27 Wh for Propulate compared to Optuna's 2646.29 Wh.

HP Optimization for Remote Sensing Classification
BigEarthNet [35] is a Sentinel-2 multispectral image dataset in remote sensing. It comprises 590 326 image patches each of which is assigned one or more of the 19 available CORINE Land Cover map labels [10,35]. Multiple computer vision networks for BigEarthNet classification have been trained [35], with ResNet-50 [20] being the most accurate. While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation order [21]. The search space is shown in Table 4. Optimizer parameters, LR functions, and LR warmup are included as well. We only consider SGD-based optimizers as they share common parameters and thus exclude Adam-like optimizers from the search. We theorize that including Adam led to the difficulties seen previously [13]. The training is exited if the validation loss has not been increasing for ten epochs. We prepared the data analogously to [13]. The network is implemented in TensorFlow [1].
For both Propulate and Optuna, we ran each three searches over 24 h on 32 GPUs. We use 1 − F val 1 with the validation F 1 score as the OF to be minimized.  On average, Optuna achieves its best OF value of (0.39 ± 0.01) h within (7.05 ± 3.14) h. Propulate beats Optuna's average best after (5.30 ±2.41) h and achieves its best OF value of (0.36 ± 0.00) within (13.89 ± 5.15) h.

Scaling
Finally, we explore Propulate's scaling behavior for the use case presented in Section 4.5. Figure 5 shows our results for weak and strong linear scaling. Our baseline configuration used two nodes. Since each node has four GPUs, we calculate speedup and efficiency with respect to eight workers. For strong scaling, we fix the total number of evaluations at 512 and increase the number of workers, i.e., GPUs. We average over three runs with different seeds and keep four workers per island while increasing the number of islands. Speedup increases up to 128 workers, where we reach approximately half the optimal value. This is an expected decline since each worker only processes few individuals, so the variance in evaluation times leads to larger idle times of the faster workers before the final population synchronization at the end. Additionally, as the number of workers approaches the total number of evaluations, the randomly initialized evolutionary search in turn approaches a random search. This means that the search performance is likely to be worse than what the pure compute performance might suggest. It is still possible to apply Propulate on these scales, but the other search parameters have to be adjusted accordingly as shown in the weak scaling plot Figure 5 top. Weak efficiency only drops to 95 % at our largest configuration of 128 workers The early super-scalar behavior is likely due to the non-sequential baseline.

Conclusion
We presented Propulate, our HPC-adapted, asynchronous genetic optimization algorithm and software. Our experimental evaluation shows that the fully asynchronous evaluation, propagation, and migration enable a highly efficient and parallelizable genetic optimization. Harder to quantify than performance but very important is ease of use. Especially for HPC applications at scale, some parallelization and distribution models are more suited than others. A purely MPI-based implementation as in Propulate is not only extremely efficient for highly parallel and communication-intensive algorithms but also easy to set up and maintain, since the required infrastructure is commonly available on HPC systems. This is not the case for any of the other tools investigated, except for the not publicly available MENNDL. This also facilitates a tighter coupling of individuals during the optimization, which enables a more efficient evaluation of candidates and in particular early stopping informed by previously evaluated individuals in the NAS case. Propulate was already successfully applied to HPO for various ML models on different HPC machines [13,17]. Another avenue for future work is including variable-length gene descriptions. Mutually exclusive genes of different lengths, such as the parameter sets for Adam-and SGD-like optimizers in our NAS use case, can thus be explored efficiently. While this is already possible, it requires an inconvenient workaround of including inactive genes and adapting the propagators to manually prevent the evaluation of many individuals differing only in inactive genes.