Keywords

1 Introduction

Machine learning (ML) algorithms are heavily used in almost every area of human life today, from medical diagnosis and critical infrastructure to transportation and food production. Almost all ML algorithms have non-learnable hyperparameters (HPs) that influence the training and in particular their predictive capacity. As evaluating a set of HPs involves at least a partial training, state-free approaches to HP optimization (HPO), like grid and random search, often go beyond available compute resources [15]. To explore the high-dimensional HP spaces efficiently, information from previous evaluations must be leveraged to guide the search. Such state-dependent strategies minimize the number of evaluations to find a useful model, reducing search times and thus the energy consumption of the computation. Bayesian and bio-inspired optimizers are the most popular of these AutoML approaches. Among the latter, genetic algorithms (GAs) are versatile metaheuristics inspired by natural evolution. To solve a search-for-solutions problem, a population of candidate solutions (or individuals) is evolved in an iterative interplay of selection and variation [23, 30]. Although reaching the global optimum is not guaranteed, GAs often find near-optimal solutions with less computational effort than classical optimizers [8, 9]. They have become popular for various optimization problems, including HPO for ML and neural architecture search (NAS) [14].

To take full advantage of the increasingly bigger models and datasets, designing scalable algorithms for high performance computing (HPC) has become a must [40]. While Bayesian optimization is inherently serial, the structure of GAs renders them suitable for parallelization [34]: Since all candidates in each iteration are independent, they can be evaluated in parallel. To breed the next generation, however, the previous one has to be completed. As the computational expenses for evaluating different candidates vary, synchronizing the parallel evolutionary process affects the scalability by introducing a substantial bottleneck. Approaches to reducing the overall communication in parallel GAs like the island model (IM) [34] do not address the underlying synchronization problem.

To solve the issues arising from explicit synchronization, we introduce Propulate, a massively parallel genetic optimizer with asynchronous propagation of populations and migration. Unlike classical GAs, Propulate maintains a continuous population of already evaluated individuals with a softened notion of the typically strictly separated, discrete generations. Our contributions include:

  • A novel parallel genetic algorithm based on a fully asynchronous island model with independently processing workers, allowing to parallelize the optimization process and distribute the internal evaluation of the objective function.

  • Massive parallelism by asynchronous propagation of continuous populations and migration.

  • A prototypical implementation in Python using extremely efficient communication via the message passing interface (MPI).

  • Optimal use of parallel hardware by minimizing idle times in HPC systems.

We use Propulate to optimize various benchmark functions and the HPs of a deep neural network on a supercomputer. Comparing our results to those of the popular HPO package Optuna, we find that Propulate is consistently drastically faster without sacrificing solution accuracy. We further show that Propulate scales well to at least 100 processing elements (PEs) without relevant loss of efficiency, demonstrating the efficacy of our asynchronous evolutionary approach.

2 Related Work

Recent progress in ML has triggered heavy use of these techniques with Python as the de facto standard programming language. Tuning HPs requires solving high-dimensional optimization problems with ML algorithms as black boxes and model performance metrics as objective functions (OFs). Most common are Bayesian optimizers (e.g. Optuna [2], Hyperopt [7], SMAC3 [24, 27], Spearmint [32], GPyOpt [5], and MOE [38]) and bio-inspired methods such as swarm-based (e.g. FLAPS [39]) and evolutionary (e.g. DEAP [16], MENNDL [40]) algorithms. Below, we provide an overview of popular HP optimizers in Python, with a focus on state-dependent parallel algorithms and implementations. A theoretical overview of parallel GAs can be found in surveys [3, 4, 12] and books [29, 37].

Optuna adopts various algorithms for HP sampling and pruning of unpromising trials, including tree-structured Parzen estimators (TPEs), Gaussian processes, and covariance matrix adaption evolution strategy. It enables parallel runs via a relational database server. In the parallel case, an Optuna candidate obtains information about previous candidates from and stores results to disk.

SMAC3 (Sequential Model-based Algorithm Configuration) combines a random-forest based Bayesian approach with an aggressive racing mechanism [24]. Its parallel variant pSMAC uses multiple collaborating SMAC3 runs which share their evaluations through the file system.

Spearmint, GPyOpt, and MOE are Gaussian-process based Bayesian optimizers. Spearmint enables distributed HPO via Sun Grid Engine and MongoDB. GPyOpt is integrated into the Sherpa package [22], which provides implementations of recent HP optimizers along with the infrastructure to run them in parallel via a grid engine and a database server. MOE (Metric Optimization Engine) uses a one-step Bayes-optimal algorithm to maximize the multi-points expected improvement in a parallel setting [38]. Using a REST-based client-server model, it enables multi-level parallelism by distributing each evaluation and running multiple evaluations at a time.

Nevergrad [31] and Autotune [25] provide gradient-free and evolutionary optimizers, including Bayesian, particle swarm, and one-shot optimization. In Nevergrad, parallel evaluations use several workers via an executor from Python’s concurrent module. Autotune enables concurrent global and local searches, cross-method sharing of evaluations, method hybridization, and multi-level parallelism. Open Source Vizier [33] is a Python interface for Google’s HPO service Vizier. It implements Gaussian process bandits [19] and enables dynamic optimizer switching. A central database server does the algorithmic proposal work, clients perform evaluations and communicate with the server via remote procedure calls. Katib [18] is a cloud-native AutoML project based on the Kubernetes container orchestration system. It integrates with Optuna and Hyperopt. Tune [26] is built on the Ray distributed computing platform. It interfaces with Optuna, Hyperopt, and Nevergrad and leverages multi-level parallelism.

DEAP (Distributed Evolutionary Algorithms in Python) [16] implements general GAs, evolution strategies, multi-objective optimization, and co-evolution of multi-populations. It enables parallelization via Python’s multiprocessing or SCOOP module. EvoTorch [36] is built on PyTorch and implements distribution- and population-based algorithms. Using a Ray cluster, it can scale over multiple CPUs, GPUs, and computers. MENNDL (Multi-node Evolutionary Neural Networks for Deep Learning) [40] is a closed-source MPI-parallelized HP optimizer for automated network selection. A master node handles the genetic operations while evaluations are done on the remaining worker nodes. However, global synchronization hinders optimal resource utilization [40].

3 Propulate Algorithm and Implementation

figure a

To alleviate the bottleneck inherent to synchronized parallel genetic algorithms, our massively parallel genetic optimizer Propulate ( and ) implements a fully asynchronous island model specifically designed for large-scale HPC systems. Unlike conventional GAs, Propulate maintains a continuous population of evaluated individuals with a softened notion of the typically strictly separated generations. This enables asynchronous evaluation, variation, propagation, and migration of individuals. To ensure interoperability with existing data science and ML workflows, we provide a Python implementation. In most applications, evaluating the OF represents the largest contribution to the total resource consumption. Performance-relevant paths inside the OF evaluation are expected to be implemented and optimized in CUDA and C/C++ or Fortran. With the aforementioned workflows, this is typically already the case.

Propulate ’s basic mechanism is that of Darwinian evolution, i.e., beneficial traits are selected, recombined, and mutated to breed more fit individuals (see Algorithm 1). On a higher level, Propulate employs an IM, which combines independent evolution of self-contained subpopulations with intermittent exchange of selected individuals [34]. To coordinate the search globally, each island occasionally delegates migrants to be included in the target islands’ populations. With worse performing islands typically receiving candidates from better performing ones, islands communicate genetic information competitively, thus increasing diversity among the subpopulations compared to panmictic models [11]. Independent from the breeding mechanism used on each single island of a synchronous IM, this migrant exchange occurs simultaneously after a fixed number of synchronously evaluated generations, with no computation happening in that time. The following hyperparameters characterize an IM:

  • Island number and subpopulation sizes

  • Migration (pollination) probability

  • Number of migrants (pollinators): How many individuals migrate from the source population at a time.

  • Migration (pollination) topology: Directed graph of migration (pollination) paths between islands.

  • Emigration policy: How to select emigrants (e.g., random or best) and whether to remove them from the source population (actual migration) or not (pollination).

  • Immigration policy: How to insert immigrants into the target population, i.e., either add them (migration) or replace existing individuals (pollination, e.g., random or worst).

figure d

Propulate ’s functional principle is outlined in Algorithm 2. We consider multiple PEs (or workers) partitioned into islands. Each worker processes one individual at a time and maintains a population to track evaluated and migrated individuals on its island. To mitigate the computational overhead of synchronized OF evaluations, Propulate leverages asynchronous propagation of continuous populations with interwoven, worker-specific generations (see Fig. 1). In each iteration, each worker breeds and evaluates an individual which is added to its population list. It then sends the individual with its evaluation result to all workers on the same island and, in return, receives evaluated individuals dispatched by them for a mutual update of their population lists. To avoid explicit synchronization points, the independently operating workers use asynchronous point-to-point communication via MPI to share their results. Each one dispatches its result immediately after finishing an evaluation. Directly afterwards, it non-blockingly checks for incoming messages from workers of its own island awaiting to be received. In the next iteration, it breeds a new individual by applying the evolutionary operators to its continuous population list of all evaluated individuals from any generation on the island. The workers thus proceed asynchronously without idle times despite the individuals’ varying computational costs.

Fig. 1.
figure 1

Asynchronous propagation. Interaction of two workers on one island. Individuals bred by worker 1 and 2 are shown in blue and red, respectively. Their origins are given by a generation sub- and an island superscript. Populations are depicted as round grey boxes, where most recent individuals have black outlines. Varying evaluation times are represented by sharp boxes of different widths. We illustrate the asynchronous propagation and intra-island synchronization of the population using the example of the blue individual \(\textrm{ind}_{g3}^{i1}\). This individual is bred by worker 1 in generation 3 by applying the propagator (yellow) to the worker’s current population. After evaluating \(\textrm{ind}_{g3}^{i1}\), worker 1 sends it to all workers on its island and appends it to its population. As no evaluated individuals dispatched by worker 2 await to be received, worker 1 proceeds with breeding. Worker 2 receives the blue \(\textrm{ind}_{g3}^{i1}\) only after finishing the evaluation of the red \(\textrm{ind}_{g2}^{i1}\). It then appends both to its population and breeds a new individual for generation 3. (Color figure online)

After the mutual update, asynchronous migration or pollination between islands happens on a per-worker basis with a certain probability. Each worker selects a number of emigrants from its current population. For actual migrationFootnote 1, an individual can only exist actively on one island. A worker thus may only choose eligible emigrants from an exclusive subset of the island’s population to avoid overlapping selections by other workers. It then dispatches the emigrants to the target islands’ workers as specified in the migration topology. Finally, it sends them to all workers on its island for island-wide deactivation of emigrated individuals before deactivating them in its own population.

In the next step, the worker probes for and, if applicable, receives immigrants from other islands. It then checks for individuals emigrated by other workers of its island and tries to deactivate them in its population. Due to the asynchronicity, individuals might be designated to be deactivated before arriving in the population. Propulate continuously corrects these synchronization artefacts during the optimization.

For pollination (see Fig. 2), identical copies of individuals can exist on multiple islands. Workers thus can choose emigrating pollinators from any active individuals in their current populations and do not deactivate them upon emigration. To control the population growth, pollinators replace active individuals in the target population according to the immigration policy. For proper accounting of the population, one random worker of the target island selects the individual to be replaced and informs the other workers accordingly. Individuals to be deactivated that are not yet in the population are cached to be replaced in the next iteration. This process is repeated until each worker has evaluated a set number of generations. Finally, the population is synchronized among workers and the best individuals are returned.

Fig. 2.
figure 2

Asynchronous pollination. Consider two islands with N (blue) and M (red) workers, respectively. We illustrate pollination (dark colors) by tracing worker N on island 1. After evaluation and mutual intra-island updates (light blue, see Fig. 1), this worker performs pollination: It sends copies of the chosen pollinators to all workers of each target island, here island 2. The target island’s workers receive the pollinators asynchronously (dark blue arrows). For proper accounting of the populations, worker 1 on island 2 selects the individual to be replaced and informs all workers on its island accordingly (middle red arrow). Afterwards, worker N receives incoming pollinators from island 2 to be included into its population. It then probes for individuals that have been replaced by other workers on its island, here worker 1, in the meantime and need to be deactivated. After these pollination-related intra-island population updates, it breeds the next generation. As pollination does not occur in this generation, it directly receives pollinators from island 2. This time, worker N chooses the individual to be replaced. (Color figure online)

Propulate uses so-called propagators to breed child individuals from an existing collection of parent individuals. It implements various standard genetic operators, including uniform, best, and worst selection, random initialization, stochastic and conditional propagators, point and interval mutation, and several forms of crossover. In addition, Propulate provides a default propagator: Having selected two random parents from the breeding pool consisting of a set number of the currently most fit individuals, uniform crossover and point mutation are performed each with a specified probability. Afterwards, interval mutation is performed. To prevent premature trapping in a local optimum, a randomly initialized individual is added with a specified probability instead of one bred from the current population.

4 Experimental Evaluation

We evaluate Propulate on various benchmark functions (see Sect. 4.4) and an HPO use case in remote sensing classification (see Sect. 4.5) which provides a real world application. We compare our results against Optuna since it is the most widely used HPO software.

4.1 Experimental Environment

We ran the experiments on the distributed-memory, parallel hybrid supercomputer Hochleistungsrechner Karlsruhe (HoreKaFootnote 2) at the Steinbuch Centre for Computing, Karlsruhe Institute of Technology. Each of its 769 compute nodes is equipped with two 38-core Intel Xeon Platinum 8368 processors at 2.4 GHz base and 3.4 GHz maximum turbo frequency, 256 GB (standard) or 512 GB (high-memory and accelerator) local memory, a local 960 GB NVMe SSD disk, and two network adapters. 167 of the nodes are accelerator nodes each equipped with four NVIDIA A100-40 GPUs with 40 GB memory connected via NVLink. Inter-node communication uses a low-latency, non-blocking NVIDIA Mellanox InfiniBand 4X HDR interconnect with 200 Gbit/s per port. A Lenovo Xclarity controller measures full node energy consumption, excluding file systems, networking, and cooling. The operating system is Red Hat Enterprise Linux 8.2.

4.2 Benchmark Functions

Benchmark functions are used to evaluate optimizers in terms of convergence, accuracy, and robustness. The informative value of such studies is limited by how well we understand the characteristics making real-life optimization problems difficult and our ability to embed these features into benchmark functions [28]. We use Propulate to optimize a variety of traditional and recent benchmark functions emulating situations optimizers have to cope with in different kinds of problems (see Table 1).

  • Sphere is smooth, unimodal, strongly convex, symmetric, and thus simple.

  • Rosenbrock has a narrow minimum inside a parabola-shaped valley.

  • Step represents the problem of flat surfaces. Plateaus pose obstacles to optimizers as they lack information about which direction is favorable.

  • Quartic is a unimodal function padded with Gaussian noise. As it never returns the same value on the same point, algorithms that do not perform well on this test function will do poorly on noisy data.

  • Rastrigin is non-linear and highly multimodal. Its surface is determined by two external variables, controlling the modulation’s amplitude and frequency. The local minima are located at a rectangular grid with size 1. Their functional values increase with the distance to the global minimum.

  • Griewank’s product creates sub-populations strongly codependent to parallel GAs, while the summation produces a parabola. Its local optima lie above parabola level but decrease with increasing dimensions, i.e., the larger the search range, the flatter the function.

  • Schwefel has a second-best minimum far away from the global optimum.

  • Lunacek’s bi-sphere’s [28] landscape structure is the minimum of two quadratic functions, each creating a single funnel in the search space. The spheres are placed along the positive search-space diagonal, with the optimal and sub-optimal sphere in the middle of the positive and negative quadrant, respectively. Their distance and the barrier’s height increase with dimensionality, creating a globally non-separable underlying surface.

  • Lunacek’s bi-Rastrigin [28] is a double-funnel version of Rastrigin. This function isolates global structure as the main difference impacting problem difficulty on a well understood test case.

Table 1. Benchmark functions.

4.3 Meta-optimizing the Optimizer

Propulate itself has HPs influencing its optimization behavior, accuracy, and robustness. To explore their effect systematically and give transparent recommendations for default values, we conducted a grid search across the six most prominent HPs. The search space is shown in Table 2. We ran the grid search five times for the quartic, Rastrigin, and bi-Rastrigin benchmark functions (see Table 1 and Sect. 4.4), each with a different seed consistently used over all points within a search. All three functions have their global minimum at zero. They were chosen for their high-dimensional parameter spaces (30, 20, and 30, respectively) and different levels of difficulty to optimize. For quartic, Propulate found a minimum below \(0.01\pm 0.005\) for 80.12% of all points across the five grid searches. This increases to 94.94% for minima found within \(0.1\pm 0.05\) of the global minimum. In comparison, the tolerances have to be relaxed considerably for the more complex Rastrigin and bi-Rastrigin. While only 18.57% of all grid points had a function value less than \(1.0\pm 0.5\) for Rastrigin, only a single point resulted in an average value of less than 10 for bi-Rastrigin. Although the average value of bi-Rastrigin was only less than 10 once, we found the minimum across each of the five searches to be less than 1.0 for 3.31% of the grid points.

Table 2. Grid search parameters. All experiments use 144 CPUs equally distributed between two nodes. Random-initialization probability refers to the chance that a new individual is generated entirely randomly.

Considering grid points with at least one result smaller than 1.0, 86.61% used either 16 or 36 islands, while the remainder used eight. As Propulate initializes different islands at different positions in the search space, the chance that one of them is at a very beneficial position increases with the number of islands. This is further confirmed by a migration probability of 0.7 or 0.9 for 61.41% of these points. If one of the islands is well-initialized, it thus will quickly notify others.

With every best grid point using pollination, we clearly find pollination to be favorable over real migration. To determine the other HPs, we compute the averages of the results for the top ten grid points across all three functions. The top ten were determined by grouping over the lowest average and standard deviation of the function values, sorting by the averages, and sorting by the standard deviations. This method reduces the chances of a single run simply benefiting from an advantageous starting seed. Average crossover, point-mutation, and random-initialization probabilities are \(0.655 \pm 0.056\), \(0.363 \pm 0.133\), and \(0.423 \pm 0.135\), respectively. The average number of islands was \(28.800 \pm 6.009\) which equates to an island population of \(5.00 \pm 1.043\). The average migration probability was \(0.527 \pm 0.150\). These values provide a reasonable starting point towards choosing default HPs for Propulate (see Table 3). As the grid searches only considered functions with independent parameters, we assume a relatively high random-initialization probability to be useful due to the benefits of random search [6]. On this account, we chose to reduce the default random-initialization probability to 0.2. As the migration probability might also be lowered artificially by this phenomenon, we set its default to 0.7. The default probabilities for crossover and point-mutation were chosen as 0.7 and 0.4, respectively. The island size was set at four individuals. This is a practical choice as our test system has four accelerators per node and the number of CPUs per node is a multiple of four.

4.4 Benchmark Function Optimization

For each function, we ran each ten equivalent Propulate and Optuna optimizations, using the same compute resources, degree of parallelization, and number of evaluations. Figure 3 shows the optimization accuracy over wallclock time comparing Propulate with default parameters determined from our grid search (see Table 3) to Optuna’s default optimizer. In terms of accuracy, Propulate and Optuna are comparable in most experiments. For many functions, e.g. Schwefel, bi-Rastrigin, and Rastrigin, Propulate even achieves a better OF value. In terms of wallclock time, Propulate is consistently at least one order of magnitude faster. This is due to Propulate ’s MPI-based communication over the fast network, whereas Optuna uses relational databases with SQL and is limited by the slow file system. Since the functions are cheap to evaluate, optimization and communication dominate the wallclock time. In particular for problems where evaluations are cheap compared to the search itself, we find that Optuna’s computational efficiency suffers massively from the frequent file locking inherent to its parallelization strategy, reducing its usability for large-scale HPC applications.

Table 3. Propulate HPs for benchmark function minimization.

In addition, we inspected the evolution of the population over wallclock time for both Propulate and Optuna. An example for minimizing the Rastrigin function is shown in Fig. 4. Propulate is roughly three orders of magnitude faster and makes significantly greater progress in terms of both OF values and distance to the global optimum. Due to this drastic difference in runtime, we measured only 46.27 Wh for Propulate compared to Optuna’s 2646.29 Wh.

Fig. 3.
figure 3

Benchmark function minimization accuracy over wallclock time. Lowest function values found by Propulate (red) and Optuna (blue) versus wallclock time to reach them, each averaged over ten runs. Step is not shown since both optimizers achieve a perfect value of \(-25\) within 0.6 s and 278.2 s, respectively. (Color figure online)

4.5 HP Optimization for Remote Sensing Classification

BigEarthNet [35] is a Sentinel-2 multispectral image dataset in remote sensing. It comprises 590 326 image patches each of which is assigned one or more of the 19 available CORINE Land Cover map labels [10, 35]. Multiple computer vision networks for BigEarthNet classification have been trained [35], with ResNet-50 [20] being the most accurate. While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation orders [21]. The search space is shown in Table 4. Optimizer parameters, LR functions, and LR warmup are included as well. We only consider SGD-based optimizers as they share common parameters and thus exclude Adam-like optimizers from the search. We theorize that including Adam led to the difficulties seen previously [13]. The training is exited if the validation loss has not been increasing for ten epochs. We prepared the data analogously to [13]. The network is implemented in TensorFlow [1].

Fig. 4.
figure 4

Evolution of the population over wallclock time for the Rastrigin function. Propulate (left) versus Optuna (right). OF values (blue) use the left-hand scale, distances to the global optimum (purple) use the right-hand scale. Pastel dots show each individual’s OF value/distance. Solid (dashed) lines show the minimum (median) value and distance achieved so far. Maximum value and distance are shown in black. Both optimizers perform 38 912 evaluations. Note the difference on the time axis. (Color figure online)

For both Propulate and Optuna, we ran each three searches over 24 h on 32 GPUs. We use \(1-F_1^{\text {val}}\) with the validation \(F_1\) score as the OF to be minimized. On average, Optuna achieves its best OF value of \({(0.39 \pm 0.01)}\) within \({(7.05 \pm 3.14)}\) h. Propulate beats Optuna’s average best after \({(5.30 \pm 2.41)}\) h and achieves its best OF value of \(\left( 0.36 \pm 0.00\right) \) within \({(13.89 \pm 5.15)}\) h.

4.6 Scaling

Finally, we explore Propulate ’s scaling behavior for the use case presented in Sect. 4.5. Figure 5 shows our results for weak and strong linear scaling. Our baseline configuration used two nodes. Since each node has four GPUs, we calculate speedup and efficiency with respect to eight workers. For strong scaling, we fix the total number of evaluations at 512 and increase the number of workers, i.e., GPUs. We average over three runs with different seeds and keep four workers per island while increasing the number of islands. Speedup increases up to 128 workers, where we reach approximately half the optimal value. This is an expected decline since each worker only processes few individuals, so the variance in evaluation times leads to larger idle times of the faster workers before the final population synchronization at the end. Additionally, as the number of workers approaches the total number of evaluations, the randomly initialized evolutionary search in turn approaches a random search. This means that the search performance is likely to be worse than what the pure compute performance might suggest. It is still possible to apply Propulate on these scales, but the other search parameters have to be adjusted accordingly as shown in the weak scaling plot (see Fig. 5 top). The early super-scalar behavior is likely due to the non-sequential baseline. For small node counts, the performance is influenced by effects stemming from cluster utilization beyond the use case studied here, like file system congestion or inter-node distance in the network. With larger node counts relative to total cluster size, these effects average out or approach the worst case, which is consistent with the trend shown in Fig. 5. Weak efficiency only drops to 95% on average at our largest configuration of 128 workers.

Table 4. HP search space of ResNet-50 for BigEarthNet classification.
Fig. 5.
figure 5

Scaling with respect to a baseline of eight workers. Weak efficiency (top) and strong linear speedup (bottom). Use case and search space are described in Sect. 4.5. Weak-scaling problem size is varied via the number of OF evaluations. Results are averaged over three runs.

5 Conclusion

We presented Propulate, our HPC-adapted, asynchronous genetic optimization algorithm and software. Our experimental evaluation shows that the fully asynchronous evaluation, propagation, and migration enable a highly efficient and parallelizable genetic optimization. To our knowledge, all existing Python-based genetic optimization tools use synchronization schemes that are not tailored to application in HPC environments. Harder to quantify than performance but very important is ease of use. Especially for HPC applications at scale, some parallelization and distribution models are more suited than others. A purely MPI-based implementation as in Propulate is not only extremely efficient for highly parallel and communication-intensive algorithms but also easy to set up and maintain, since the required infrastructure is commonly available on HPC systems. This is not the case for any of the other tools investigated, except for the not publicly available MENNDL. In addition, Propulate ’s asynchronicity facilitates a tighter coupling of individuals during the optimization, which enables a more efficient evaluation of candidates and in particular early stopping informed by previously evaluated individuals in the NAS case. Propulate was already successfully applied to HPO for various ML models on different HPC machines [13, 17]. Another avenue for future work is including variable-length gene descriptions. Mutually exclusive genes of different lengths, such as the parameter sets for Adam- and SGD-like optimizers in our NAS use case, can thus be explored efficiently. While this is already possible, it requires an inconvenient workaround of including inactive genes and adapting the propagators to manually prevent the evaluation of many individuals differing only in inactive genes.