Parametric Optimization on HPC Clusters with Geneva

Many challenges of today’s science are parametric optimization problems that are extremely complex and computationally intensive to calculate. At the same time, the hardware for high-performance computing is becoming increasingly powerful. Geneva is a framework for parallel optimization of large-scale problems with highly nonlinear quality surfaces in grid and cloud environments. To harness the immense computing power of high-performance computing clusters, we have developed a new networking component for Geneva—the so-called MPI Consumer—which makes Geneva suitable for HPC. Geneva is most prominent for its evolutionary algorithm, which requires repeatedly evaluating a user-defined cost function. The MPI Consumer parallelizes the computation of the candidate solutions’ cost functions by sending them to remote cluster nodes. By using an advanced multithreading mechanism on the master node and by using asynchronous requests on the worker nodes, the MPI Consumer is highly scalable. Additionally, it provides fault tolerance, which is usually not the case for MPI programs but becomes increasingly important for HPC. Moreover, the MPI Consumer provides a framework for the intuitive implementation of fine-grained parallelization of the cost function. Since the MPI Consumer conforms to the standard paradigm of HPC programs, it vastly improves Geneva’s user-friendliness on HPC clusters. This article gives insight into Geneva’s general system architecture and the system design of the MPI Consumer as well as the underlying concepts. Geneva—including the novel MPI Consumer—is publicly available as an open source project on GitHub (https://github.com/gemfony/geneva) and is currently used for fundamental physics research at GSI in Darmstadt, Germany.

]. An important part of the study of the strong interaction based on QCD is computer-aided calculations on discretized space-time lattices [3, pp. 218−219]. These lattice-QCD simulations require extremely large computational power and therefore can only be performed for a few specific cases. At GSI/FAIR in Darmstadt, Germany, research is being done to establish an effective quantum field theory that describes QCD but can be computed with significantly less computational effort [2, p. 1]. For this purpose, the QCD is modified by variable transformations to obtain equations that are easier to solve than the original QCD. These transformations however result in a large number of unknown low energy constants (LEC) that are not analytically accessible [2, p. 2]. Nevertheless, to determine them approximately in the effective field theory, single solutions for the fundamental QCD are calculated [2, p. 2] first. Then, the challenge is to find a set of LEC parameters for which the effective field-theoretic equations and the QCD equations yield approximately the same results [4, p. 15, 5, p. 6]. At GSI, typical problems include up to 50 LEC. In particular, the computation of a
In practice, optimization problems are usually non-derivable, so the optimal solution cannot be calculated analytically. Also, the size of the input parameter space makes it impossible to compute all possible input parameter sets when searching for the optimal one. For this reason, optimization algorithms based on metaheuristics are used to systematically search the parameter space trying to find an approximate solution. There are two main classes of optimization algorithms-local and global search algorithms.
Starting from an entry point, local search algorithms try to iteratively improve the quality of a parameter set by searching in its local environment. These types of algorithms converge comparatively quickly to the closest local optimum. However, they might not be able to find the global optimum of the cost function. Moreover, many implementations such as gradient descent [15] are not easily parallelizable because of dependencies between the intermediate results.
On the other hand, global search algorithms like evolutionary algorithms [13] scan a bigger range of the parameter space and focus on finding the global optimum. Differently from local search algorithms, global search algorithms do not converge as quickly to a given optimum. Many implementations use a population-based approach, such that many candidate solutions are evaluated independently in one iteration, resulting in an inherent possibility for parallelization. Because real-life problems often have a large search space and many local minima, global search algorithms are particularly important today.
Recently, many resources are being invested to build new high-performance computing clusters [16-18, pp. 10-12]. Utilizing these computational resources for optimizing application problems allows for faster progress in science and engineering. Especially large-scale problems can be approached, which would not be solvable without this massive computing power.
Although there is software for parametric optimization available, none of these frameworks is an optimal fit for solving large-scale optimization problems on HPC clusters. DEAP [19,20], DISROPT [21,22] and DISOP [23], for instance, are frameworks written in Python. But Python is known to be a lot slower compared to compiled languages [24], which is not acceptable when working on big computer clusters that are extremely costly to use 2 . CEGO [25,26] and PaGMO [27][28][29], on the other hand, are optimization frameworks written in C++. CEGO is not suitable for HPC, as it does not allow for distributed parallelization but merely for thread-based parallel execution. PaGMO is a popular framework and implements many different optimization algorithms. However, since its upgrade to version 2.x.x, it no longer supports parallelization over MPI or any other networking technology. The reason is that breaking changes between PaGMO 1.x.x and 2.
x.x require rewrites of parts of the library that have not been approached yet. Therefore, PaGMO is not able to exploit the potential of grids, clouds and clusters.
Geneva is a powerful optimization framework focusing on global optimization algorithms and supporting parallelization in grid and cloud environments. Additionally, it is specifically designed for long-running optimization functions and stability. It is, however, not optimized for high-performance computing and has no integration with common cluster scheduling systems. Also, Geneva does not allow for distributed parallelization of the user-defined cost function, which is, nevertheless, necessary if the requirements for computing the cost function go beyond the resources of a single compute node.
For this reason, we have developed a new networking component for the Geneva optimization library-the socalled MPI Consumer. The MPI Consumer efficiently distributes candidate solutions to cluster nodes. The interface it exposes is compliant with today's programming paradigm for high-performance computing, which makes it easy to use in cluster environments. Furthermore, the MPI Consumer provides a framework for fine-grained parallelization of the user-defined cost function. As a result, Geneva is not only a great tool for running in grids and clouds, but also is now optimized for HPC clusters.
In this work, we first give an overview of Geneva's system architecture in general and afterwards explain the MPI Consumer's system design in more detail. Furthermore, we evaluate the improvements achieved through the MPI Consumer and briefly show how it can be used.

Geneva's System Architecture
As mentioned in the introduction, Geneva is an open-source library for distributed optimization and provides the basis for the software component developed in this work. Geneva is programmed in the C++17 standard and has as its only external dependency the Boost [30] program libraries, which implement fundamental functionalities such as serialization and parsing. 3 Geneva's approximately 130,000 lines of program code are separated into the components shown in Fig. 1.
• The Common sublibrary contains functionality that could potentially be used by all other software components. For example, it contains the implementation of a thread pool, a logger, a parser and other useful components.
• The Geneva. 4 sublibrary provides functionalities specific to parametric optimization. Figure 2 shows an overview of the main base classes of this sublibrary. GObject is directly or indirectly the base class for most of the classes in the Geneva sublibrary.
G_Optimization_Algorithm_Base is the base class for all optimization algorithms to unify the interface of optimization algorithms within Geneva. GParame-terBase is the base class for all types of parameters that are adjusted during optimization. Individual parameters of different types and lists of parameters are derived from this class. GAdaptorT represents a base class for classes that determine the adaptation of candidate solutions within optimization algorithms. GParameterSet is the base class for parameter sets to be optimized. Users define optimization problems by deriving a class from GParameterSet. In doing so, users add parameters to their parameter set which are indirectly derived from GParameterBase, and also define a quality function 5 taking the parameters as input and returning one or more floating point numbers for the objectives. The currently implemented optimization algorithms are the multi-objective evolutionary algorithm, gradient descent, simulated annealing, a swarm algorithm and parameter scans. • The Courtier library is a template library for parallelization. It is used within Geneva for parallel evaluation of individuals, but can also be used for other use cases because of its generic implementation. Since this sublibrary is particularly relevant for the MPI Consumer implemented in this work, it will be discussed in detail afterwards. • The Hap library provides mechanisms for the transparent, centralized generation of random numbers in multithreaded environments. • Furthermore, automated unit tests and benchmarks are used to avoid errors in the library collection. A set of examples is provided to give users an insight into the many capabilities of Geneva and get them started quickly.

Parallelization with the Courtier Library
The Courtier library plays a particularly important role for the MPI Consumer, as the consumer is added to this sublibrary as a new module. Fig. 3 shows the basic parts that make up the Courtier sublibrary. Single tasks to be executed are defined through so-called work items. These are implemented as GProcessing-ContainerT embedded in an instance of type GCom-mandContainerT. GProcessingContainerT provides a generic interface for computable entities consisting of data, the function to compute and metainformation.
GCommandContainerT contains an attribute of type GProcessingContainerT and a command indicating the current processing stage of the work item and provides (de)serialization functionality.
The central building block of the Courtier library architecture is the broker, which provides an abstraction layer separating producers and consumers. Producers can register a so-called buffer port, which consists of two thread-safe queues, with the broker. Producers can then pass work items to be processed to the broker by adding them to the in-buffer of the buffer port. On the other hand, one or many consumers can be registered, which request work items and process them in a consumer-specific manner. When a work item is requested by a consumer, the broker takes one of the available work items from the in-buffer of one of the registered buffer ports. If the in-buffers of all buffer ports are empty, the consumer receives an empty GCommandContainerT whose command attribute indicates that no work items are currently available. Every time a consumer requests a new work item, it returns the computed work item of the previous request (if a work item was obtained in the previous request). The broker inserts returned items into the out-buffer of the buffer port which they were previously taken from. The distribution of items occurs on both the producer and consumer sides in a round-robin fashion.
It is important to note that the producer, broker and consumer all run in the same process (but on different threads) and therefore cannot be distributed to different machines. This makes the system architecture fundamentally different from well-known message-queuing architectures such as MQTT [32]. In this way, many components of the overall system run on the same machine. With a large number of producers and consumers, this could lead to a bottleneck, which would result in reduced scalability. In practice, usually exactly one producer and exactly one consumer are used, so this potential bottleneck has little to no relevance.
The parallelization of computation is done on the consumer side. Different consumer types implement the distribution of individual work items in different ways. So far, available consumer types were a serial consumer, a multithreaded consumer, a distributed consumer based on the Boost.Asio and a distributed consumer based on Boost.Beast networking program library. As the main contribution of this work, another distributed consumer, the MPI Consumer, is developed, which significantly improves Geneva's user experience for high-performance computing and adds new useful features for solving large-scale optimization problems. Of particular interest are the distributed consumers. These represent servers and provide work items via their respective network technologies to appropriately implemented consumer clients. The work items are serialized via the serialization functionality of GProcessingContainerT before each sending and deserialized after each receiving. The actual computation of the items is then done on the client side. Clients must be started separately from the main process as independent processes (usually on other physical machines). It must be ensured that a network connection to the server is possible at the time of client startup.

Building an MPI Consumer
This section explains the MPI Consumer's system design. To begin with, in section "System Design Overview" an overview of the system architecture is given. The subsequent sections explain the individual parts in more detail.

System Design Overview
The MPI consumer developed in this work is a distributed consumer within Geneva's Courtier sublibrary. Fig. 4 shows a bird's eye view of the MPI Consumer's system architecture.
The MPI Consumer consists of a server, clients and subclients. The server waits for clients requesting work items. To make the server more scalable, it handles incoming requests with multiple execution streams. When a work item is requested, the server takes a work item from the broker (if available) and sends it to the client that has requested the item. To request work items, clients use asynchronous requests. While they are computing a work item, they send the request for another work item. Asynchronous requests can increase the productivity of the clients by filling the waiting time for server responses with computations. The request for a new work item contains (if available) the previously computed work item, which the server delivers back to the broker. The client-server model is not common for MPI applications. More commonly, the fork-join model is used. In the context of optimization algorithms, like e.g., the evolutionary algorithm, the fork-join model would mean splitting the population of candidate solutions into equally sized groups at the beginning of each iteration of the algorithm and sending one group to each worker process. However, since the evaluation of individual work items can potentially also take a very different computation time, 6 the fork-join model, in this case, would lead to suboptimal use of the computational resources of nodes. The reason for this is that early-finishing processes will have to wait for slower nodes to complete their computation because only once all processes have returned all candidate solutions, the evolutionary algorithm will generate the next population. In the client-server model, on the other hand, the candidate solutions of each iteration of the optimization algorithm are retrieved one by one by the clients. Since the number of individuals in the population is usually at least one order of magnitude greater than the number of clients, each client will send many requests in each of the iterations. Differently from the fork-join model, the client-server model results in natural load balancing, since faster clients send more requests and can thus compute more work items. In addition, the client-server model makes it comparatively easy to achieve the desired fault tolerance with respect to unreachable clients. Since the clients work independently of each other and the server does not need to know the number of intact clients, if individual clients become unavailable, all remaining clients are unaffected. To ensure that fault tolerance can be guaranteed, it is furthermore necessary to protect against failed network communication by using timeouts. This ensures that the server does not wait indefinitely for responses from unavailable clients.
There is an option for work items to be computed in small groups. Each client is then assigned a certain number of subclients among which it distributes the calculation of individual work items. However, since the cost function of the work items is specified by the user, its parallelization must also be implemented by the user. The MPI Consumer makes this possible in an intuitive way by providing a preconfigured infrastructure for the parallelization of the cost function.
At the program start time, all processes are equal. Depending on their MPI rank the processes configure themselves as server, client or subclient. This is convenient to use with modern cluster scheduling systems since at the time of starting a job all processes are homogeneous and MPI is natively supported by common workload managers.

Fault Tolerance with Timeouts
This section shows how the MPI Consumer can handle a broad range of possible errors using a client-server model with timeouts.
Since the MPI Consumer shall be extremely scalable, fault tolerance is essential. This is because even if Geneva itself were fault-free, there would always be classes of faults, such as network or hardware failures, that are outside of the software's control. If the probability of a fault f ∈ F on a node is described by p f , then the probability that such a fault occurs on at least one node in a system with n nodes is 1 − (1 − p) n . For example, with an error probability p of only 0.1% , the probability that f occurs in a system with 500 nodes is 1 − (1 − 0.001) 500 ≈ 39% . Thus, the claim to provide high scalability also requires the solid handling of unpredictable errors.
Not all errors can be caught with reasonable effort. Therefore, the MPI Consumer's fault tolerance focuses on faults f of type F with F ∶= { Temporary or permanent inaccessibility of clients } . The error class F describes a large portion of common errors that can occur in the system, since it is comparatively generic and thus includes many types of errors (network connection between server and client unavailable, client crashed for an unknown reason, etc.) and since the probability of occurrence of these errors increases with the number of clients, as explained before. However, the probability of unpredictable server-side errors is not related to the number of clients and is therefore less relevant. In addition, server-side fatal errors such as hardware failures are also much more difficult to handle. 7 If the specification S of The fault model used is crash-recovery [33, pp. 34-35] and states that clients can fail unexpectedly (i.e., without reporting an error in a defined manner) and failed clients can spontaneously become available again. This is a realistic model in the context of Geneva. After all, clients might be unavailable due to hardware, software or network failures. At the same time, it is also possible that a client is merely busy with the calculation of a work item for a longer time than expected and therefore participates in the system again after an alleged error has been detected.
To ensure that the specification S can be satisfied even if an error f ∈ F occurs, the server must be independent of the number of available clients at all times and ensure that a failing client does not affect the satisfaction of S. To enable this, as shown in Fig. 5, a client-server model with two-sided timeouts is used. The server does not know anything about the number or status of clients. It only knows that occasionally requests from the black box will arrive and that they have a sender address to which the response should be sent. The behavior of the system with n clients is defined for all n ∈ N 0 as waiting cyclically for requests and processing them. If assuming that no errors occur and n ≥ 1 , this behavior obviously fulfills S. To make sure that S is also fulfilled in case of the occurrence of errors f ∈ F , fault tolerance mechanisms must ensure that a system with n clients behaves like an intact system with n − 1 clients once an error f ∈ F occurs. For this purpose, the system must have a detector that detects the unavailability of a node and a corrector that resets the server's state to its initial state before the error. As a detector, the MPI consumer uses timeouts. Each network communication operation is expected to terminate within a given finite time frame. If the time frame is exceeded, the connection is terminated by the corrector and the resources associated with the connection are released.
Conversely, clients also detect the unavailability of the server. In this case, a controlled shutdown of the client is initiated. The reason for server unavailability can be the controlled shutdown of the server, network failure or an error on the server side. In all cases, shutting down the client makes sense. In particular, use case of the Courtier library in Geneva, a controlled server shutdown occurs at the end of the optimization process.
To realize the client-server model, point-to-point communication between the server and the individual clients must be used. MPI itself provides the three following functions (and the corresponding receiving functions) for that purpose: (1) MPI_Send synchronizes with the receiver and asynchronously transmits; (2) MPI_Ssend fully synchronous sending; and (3) MPI_Isend asynchronous sending. None of the three functions supports timeouts, which are, nevertheless, required for implementing the previously explained loosely coupled client-server architecture.
Therefore, we use MPI's asynchronous point-to-point operations to implement timeouts ourselves as briefly shown in Listing 1. First, an asynchronous network operation is  8 The number of work items is a finite constant. 9 i.e., any number of faults may occur for finite time, meaning that temporary failures of clients or network connections are irrelevant for the satisfaction of S Footnote 7 (continued) Already the task of synchronization adds a huge amount of overhead and is not trivial to implement. initiated (implemented using the previously mentioned MPI_I* operations). Then, the current time is determined for later determination of the elapsed time. Now we repeatedly check whether the operation has been completed or a timeout has been exceeded. At the end of each iteration, depending on the use case, a waiting time can be configured to put less stress on the CPU. Depending on how critical it is considered to know the status of the message transmission early, the wait factor can be adjusted or omitted. 10 To return the process to its initial state in case of a timeout, the network operation is aborted and the associated resources are released. The loop is then exited and the process can resume its task.

Asynchronous Requests
In this section, we explain the idea of asynchronous client requests to improve the system's scalability.
If the system had been implemented without asynchronous requests, as the consumers that have existed so far in Geneva, scalability would be limited by Amdahl's law [31, p. 5, 34]. According to Amdahl, the minimum execution time achievable by parallel evaluation of candidate solutions is the sum t min = t seq + t par of the time t seq spent sequentially on server-side computation (i.e., primarily (de) serialization) and the time t par required for parallel execution on clients. This phenomenon is of great importance, as it has been shown by testing that (de)serialization of work items requires a comparatively large amount of computation time on the server. This is not surprising, since regardless of the number of clients, each work item must be (de)serialized once on the server. The clients of the MPI Consumer use asynchronous requests to increase their productivity and thereby increase the performance and scalability of the overall system. The reason is that asynchronous requests can effectively circumvent Amdahl's law by allowing sequential computations on the server to overlap in time with computations on the clients, so that t min ≈ t par holds. This method had also been briefly proposed by Dr. Berlich and is now implemented as part of the MPI Consumer [31, p. 6].
The asynchronous request mechanism used in the MPI Consumer clients can be described by the deterministic finite state machine shown in Fig. 6. The operations mapped to the edges of the state machine are defined as follows: req: Sending a request for a new work item asynchronously to the server. The request includes sending the last processed work item (if any) to the server. The call returns immediately and is processed by an independent execution stream.
recv: Waiting until the result of the previously dispatched req operation has been received.
proc: Processing the work items previously received by recv.
The states of the automaton can be described informally as follows: q0: No communication with the server has been started yet.
q1: No work items have been received yet and the request for the first work item has been started.
q2: There is an unprocessed work item in a local queue. There are also 0-1 processed work items in a local queue.
q3: There is one unprocessed work item in a local queue and another work item has been requested but has not yet been received.
q4: There is a processed work item in a local queue and another work item has been requested, but has not yet been retrieved.
As can be seen, the clients first send a request to the server and wait for the reply to fill their local queues, which makes them reach state q2. Once state q2 is reached, the clients are in a cycle in which they always first request a new work item and then compute the currently available work item while this request is being processed. In this way, the waiting times for the server's responses are filled with meaningful computations. Note, however, that asynchronous requests have no effect if there is no previous work item, i.e., on the first work item to be processed. For populationbased optimization algorithms parallelized with the MPI Consumer, this case occurs at the beginning of each iteration of the optimization algorithm.

Multithreading
This section explains the multithreading design of the MPI Consumer server. We explain what considerations have been made to optimize the server's scalability using multithreading and the responsible software component's design.
Computers in modern HPC clusters, which the MPI Consumer is targeted at, are equipped with a high number of CPU cores. For instance, GSI's Virgo cluster in Darmstadt, Germany, is composed of over 470 machines with 56 to 256 CPU cores each. Therefore, the MPI Consumer server features a minimal overhead multithreading design which allows scaling to this high number of physical cores without suffering from congestion. Unlike other networking libraries such as Boost.Asio or Boost.Beast, MPI does not provide an infrastructure for multithreading. Hence, multithreading must be implemented from scratch in the MPI Consumer. In the diagram in Fig. 7, the main components of the server involved in the multithreading are shown as an excerpt from Fig. 4.
The design follows the idea that for small tasks that are associated with waiting time, few execution streams and for computationally intensive tasks, more execution streams are useful. This is because for small tasks associated with waiting time, the waiting times of multiple tasks can thus be overlapped. At the same time, it is not critical if the tasks are processed serially because their computation does not take much time. Splitting the small tasks among many execution streams would lower their productivity and thus block hardware resources unnecessarily, since they would be waiting for a large part of the time. On the other hand, for computationally intensive tasks without waiting, parallelization is useful because the difference between serial and parallel execution time is significant. After all, using more cores will ideally divide the computation time by the number of execution streams used for these types of tasks. The four components receiver thread, thread pool, clean-up thread and open sessions queue are the most important building blocks involved in organizing the execution streams. For the reasons mentioned above, requests are first bundled in the receiver thread, then distributed back to the thread pool, and then re-combined in the clean-up thread.
The receiver thread permanently waits for requests from arbitrary clients. When a request arrives, it is passed to the thread pool for processing as soon as possible. This allows the receiver thread to start waiting for further requests again as soon as possible afterwards and to respond to them as quickly as possible.
The thread pool is a group of execution streams dealing with processing requests already received by the receiver thread. The number of execution streams of the thread pool is configurable and should be chosen according to the number of physical CPUs available on the host machine. Processing a request involves comparatively time-consuming operations-mainly the deserialization of the request and the serialization of the response. With a thread pool with n execution streams, n already accepted connections can be processed simultaneously, which approximately divides the required processing time by n. Once the response has been serialized and the asynchronous operation for sending it to the client has been initiated, the thread pool adds the pending session to the open sessions queue. From that point on, this execution stream of the thread pool is immediately available again for processing new requests, i.e., before the send operation is completed.
The open sessions queue is a thread-safe queue for processed sessions whose send operation has not yet been completed. To release allocated resources without errors, it is essential to check the status of the sessions' network operations. The clean-up thread periodically iterates over the queue and checks if there are sessions whose network operation has been completed, whose timeout has been exceeded, or which are in an erroneous state. In each of these cases, that session is removed from the queue, and if there is an error, it is handled.

Automatic Configuration
This section explains how the MPI Consumer's ability to automatically configure its processes allows for easy scheduling on modern HPC clusters using common scheduling systems such as slurm [35].
The distributed consumers so far implemented in Geneva (Boost.Asio and Boost.Beast consumer) required the server's IP address and port at client start-up time. However, when scheduling jobs on modern HPC clusters using modern Fig. 7 Visualization of the multithreading concept used in the MPI Consumer server scheduling systems such as slurm [35] or PBS [36,37], the specific machines to run the job will be determined dynamically based on the required and available resources. Therefore, scheduling Geneva on HPC systems required a twophase approach: first scheduling the server, then determining its IP address and then scheduling the clients. For this reason, Geneva was not convenient to use on HPC clusters in the past. When using the MPI Consumer, all of the processes are initially equivalent and independent and all take the same command line arguments. The processes configure their role later at runtime using their MPI rank, which is set as an environment variable by the scheduling system. Figure 8 shows how the distributed startup and configuration of an optimization with MPI-Consumer works conceptually. Only a single and trivial submit script is needed on a submit node, as opposed to the previously mentioned two-phase startup. With an appropriate command, in Slurm, e.g., sbatch, the scheduling system is instructed to allocate a certain number of compute nodes and start the script on the first node. The submit script then starts the configured number of instances of Geneva on the allocated compute nodes. In doing so, the scheduling system's MPI integration automatically sets the runtime environments of the Geneva processes so that each node is uniquely identifiable and reachable by its rank. The MPI consumer only then decides for itself which process should take on which role, so that the initially homogeneous processes take on heterogeneous tasks.

Subclient Parallelization
In this section, we explain how the MPI Consumer allows users to conveniently implement distributed parallelization of their cost functions by forming groups of clients which together compute individual cost functions. Later in section "Using and Evaluating theMPI Consumer", more details about the exposed interface are shown and a typical usage pattern is presented.
Geneva so far only parallelizes the optimization at the population level of the optimization algorithm, but not at the individual level. This means that while multiple individuals can be computed simultaneously on different clients, Geneva does not provide an infrastructure for parallelizing the quality function itself. In some use cases, nevertheless, the problem complexity requires the distribution of the evaluation function to multiple machines because the number of cores or the amount of main memory on a single machine would not be sufficient. Implementing this as a user on top of Geneva is an extremely complex task and definitely not something that a domain expert using Geneva would like to do or would be capable of doing. Furthermore, it is a task that can be partially solved in a generic way to avoid multiple users of Geneva facing the same challenge.
The GMPISubClientOptimizer provides a new programming interface for Geneva that works with the MPI Consumer to provide the user with convenient access to preconfigured compute nodes of the cluster to implement the parallelization of the cost function using MPI. The individual cost functions are computed by each client with the help of subclients in a distributed manner.
To realize this, different, independent communication groups are needed, which are visualized in Fig. 9. The base communicator is the one formed at initialization time and includes all Geneva processes. In a system with m clients, m + 1 subgroups are derived from the base communicator. One of the subgroups is used for communication between the Geneva server and its clients. Each of the other m groups contains a client and a configurable number of subclients. 11 .  In Geneva, users usually interact with the GParam-eterSet class and the Go2 class. Users create a class derived from GParameterSet, within which they define the optimization problem mainly by defining a parameter search space and the cost function by overriding the fit-nessCalculation method. The Go2 class coordinates the optimization and provides a convenient abstract interface to optimization algorithms and parallelization strategies. To make parallelization with subclients as user-friendly as possible, we have created an extended user interface for Geneva by developing the two classes GMPISubClientIndividual and GMPISubClientOptimizer, which are directly derived from the above-mentioned classical interface. Despite providing a couple of additional public methods to enable the desired functionality, the interface is equivalent to the traditional interface, but internally uses additional mechanisms to set up the subclient infrastructure.
Listing 2 shows how the user can use the GMPISub-ClientOptimizer class. First, the class defining the optimization problem (UserIndividual) is derived from GMPISubClientIndividual instead of GPa-rameterSet (not shown) to get access to the additional functionality needed. Also, the Go2 class in the application's main function is replaced by the GMPISubCli-entOptimizer class. Then a function is defined which should be executed by subclients (subClientJob). This function is registered with the optimizer in the main function. The optimizer is then responsible for calling this function for each subclient with the correct configuration and communicator, thereby creating the groups previously shown in Fig. 9. Inside the fitnessCalculation method, which is inherited from GParameterSet and describes the definition of the optimization problem, the user has access to the preconfigured communicator by means of the getCommunicator method. The implementation of the parallel computation of the quality function is application-specific and can now be freely designed by the user within the two methods fitnessCalculation and subClientJob. In many cases, it is useful to run an algorithm on the subclients until the associated client no longer provides any more parameter sets. For example, in each iteration of a while loop, the data of a subproblem could be received from the associated client, then computed, and then returned. Once the associated client is down due to the end of the optimization algorithms reached on the server, this loop should also be terminated. This commonly used scheme is straightforward to implement thanks to the getClientStatus method inherited from GMPISubClientIndividual, as also shown in Listing 2.

Using and Evaluating the MPI Consumer
In this section, we evaluate the new features that the MPI Consumer adds to Geneva and how they affect Geneva's user experience for high-performance computing. Furthermore, we use performance tests to investigate the MPI Consumer's scalability.

New Features and User Experience
As mentioned previously in section "Subclient Parallelization", Geneva was originally not supporting parallelization of the cost function but was only capable of executing multiple cost functions in parallel. The MPI Consumer introduces fine-grained parallelization of the cost function using subclients.
The enhanced user interface for using subclients in Geneva is identical to the classical interface but adds additional methods. This allows users to use subclient optimization with minimal change to the user code. Only the base class for the user-defined optimization problem and the used optimizer must be changed, merely requiring the modification to two lines of the user source code. In fact, in the simpler use case without sublients, no modifications to user code are necessary at all because the MPI Consumer also integrates with the Go2, which is part of the traditional Geneva user interface. To use the MPI Consumer, just a different command line parameter must be specified at startup time. 12 The framework offered to the user already takes care of (1) the initialization of the network communication; (2) the coordination of processes and assigning them to different roles; (3) the invocation of the client and subclient code at the right time; (4) checking the clients' state to determine the end of the optimization; and (5) shutting down the network connections at the end. The remaining task for the user is thereby reduced to the absolute minimum, which is application-specific. Furthermore, implementing the parallelization with subclients is intuitive for domain experts as they are already accustomed to working with MPI, since this is the longstanding standard for high-performance computing.
The second challenge that Geneva was facing in the past was its integration with modern high-performance computing workflows. As explained in section "Automatic Configuration", starting Geneva processes required constant configuration parameters at start-up time, which in highperformance computing environments, however, usually will be set after the job has been submitted. This drawback made Geneva not as easy to use as it should have been and required a multi-step approach for submitting HPC jobs.
In contrast, with the MPI Consumer, Geneva is now integrating well with modern cluster scheduling systems. For example, to schedule an MPI program (program) with n processes on a high-performance cluster using slurm as scheduling system, the command srun --ntasks=n./program --consumer mpi suffices. Similarly, for running a Geneva optimization on a single machine, the MPI Consumer also seamlessly integrates with MPI launcher programs such as mpirun and mpiexec.

Scalability Evaluation
To test the MPI Consumer scalability, we have created a performance test, which performs a pseudo-optimization 13 with a certain fixed number of individuals and iterations and measures the execution time required for the entire optimization. The test takes two vectors as input: the numbers of clients, and the durations of the simulated cost function computation. The test then determines the execution times for all elements of the Cartesian product of these input vectors and presents them as a three-dimensional graph as shown in Fig. 10. We have tested client counts in the range of 1 to 1000, and cost function evaluation times in the range of 0.001 to 10. The graph shows the speedup, which is calculated as the quotient of serial execution time (one client) and parallel execution time. The hardware used was a computer with AMD EPYC 7551 32-Core Processors and a total of 128 physical cores with 2 threads each. As for the configuration of the MPI Consumer, we have set the thread pool size to 64 and activated asynchronous client requests.
The ideal behavior of the system is a linear speedup, which would show up in the graph as a plane containing every point for which the number of clients equals the speedup. In contrast, for a system with a scalability issue, both smaller evaluation times and a higher number of clients would be expected to have a negative impact on the efficiency 14 of the system. This is because both parameters increase the frequency with which requests from clients arrive at the server and thus directly increase the server load.
The test results depicted in Fig. 10 show close-to-optimal 15 speedup with an efficiency of ≈ 94% for cost functions with a computation time of 5 seconds or more. When using a shorter evaluation function with a duration of one second, the scalability is slightly reduced, resulting in a speedup of 640 for 1000 client, which equals about 1000 requests per second. When the execution time of the cost function is further decreased, the efficiency drops because this directly increases the frequency at which requests are received on the server. However, this scalability limitation with short execution times is something that naturally happens in any system since request frequency can theoretically be increased to an arbitrary value, while hardware resources are limited. Taking into account the heavy (de)serialization tasks that have Fig. 10 Results of performance tests evaluating the MPI Consumer's scalability on a 128-Core computer. The speedup is calculated as the quotient of serial and parallel execution time 12 Please refer to the user guide in section A for more details. 13 The evaluation cost is implemented as sleep to allow for testing higher loads without exhausting the available CPUs on the cost function computation. This also makes the test more independent of the used hardware architecture and isolates the effect of the network communication. 14 The efficiency is calculated as the quotient of ideal and actual execution time. 15 Note that optimal speedup is not possible because there always exist some non-parallelizable parts of code. to be performed on the server for every request, we think that the scalability reached is reasonable. Moreover, tests with the other Geneva consumers (using the C++ Boost. Beast and Boost.Asio networking libraries) have indicated that the MPI Consumer is even more scalable than these components. Furthermore, one must notice that we had up to 1000 client processes running on the same machine with a server that also had more than 64 threads. Therefore, the scalability shown in the test results might have been inhibited by limited resources and thread contention and is still more than reasonable. Therefore, we expect the MPI Consumer to perform even better in production environments on high-performance computing clusters where each process is exclusively assigned the requested number of CPU cores.

Conclusion
The MPI Consumer adds a new software component for network communication to the powerful optimization library Geneva, making it now also better suitable for high-performance computing. The MPI Consumer's subclient parallelization functionality enables intuitive fine-grained parallelization of user-defined cost functions with minimal overhead. Furthermore, the MPI Consumer significantly improves Geneva's user experience for high-performance computing, as it now seamlessly integrates with common HPC scheduling systems. Performance evaluation of the MPI Consumer on a 128-core machine with up to 1000 client processes has shown that it is perfectly scalable when choosing reasonable computation times for the cost function.
All software has been contributed to the official Github repository of Geneva [38] and is now available for public use. Users can benefit from the MPI Consumer directly without any adaptations to their existing program code.
Independent of Geneva and parametric optimization, the MPI-Consumer can be used as part of Geneva's Courtier sublibrary as a scalable framework for client-server workflows on HPC clusters. Furthermore, the discussed components of the MPI consumer can be used in a generalized form as programming patterns. Asynchronous communication with timeouts provides a general method to implement fault-tolerant systems using underlying non-fault-tolerant communication primitives. The effective multithreading of the MPI consumer can be used independently as a template for a scalable server. In addition, the state machine for asynchronous requests is a general approach to increase the efficiency of sequential processing of tasks that involve waiting times.
Geneva is currently being used for fundamental physics research on a high-performance cluster at GSI in Darmstadt, Germany.

Appendix: MPI Consumer User Guide
In this section, we give instructions for using the MPI-Consumer with the Geneva program library. For general usage hints for Geneva independent of the MPI-Consumer, refer to the Geneva manual [10], the paper written by Dr. Berlich in [31] and the examples in the Github repository [38].

Overview of the Repository
Geneva including the MPI Consumer is available as opensource software on Github [38]. The MPI Consumer has been integrated into the repository's develop branch, which is also its default branch. The most important components of the MPI Consumer can be found in the repository at the following locations:

Build and Installation
The MPI Consumer depends on an implementation of the MPI standard as a dynamically linked library. Geneva itself, however, also shines because the only dependency on external program libraries is Boost [30]. Thus, if the user does not have an MPI implementation installed, he should still be able to use Geneva. Therefore, it is possible to specify in the build configuration of the Geneva library whether it should be built with or without the MPI Consumer. If the MPI Consumer is to be built, the corresponding program code is compiled and the MPI installation is linked. Otherwise, the MPI installation is not needed on the target machine. Listing 3 shows a small snippet of the scripts/genevaConfig.gcfg configuration file, which can be used to conveniently set options for cmake and thereby configure the Geneva build.
If the variable BUILDMPICONSUMER is set to 1 and an implementation of MPI exists on the machine, Geneva with MPI-Consumer can then be compiled and installed using the script scripts/prepareBuild.sh as described in the Geneva manual [10, p. 77-86].

Using the MPI Consumer
A fundamental difficulty in the user-friendly implementation of a consumer for Geneva using MPI is to abstract all calls to MPI library functions so that they are hidden from the user. There must be a place in the program code where the rank of the current process is used to decide whether the process should act as a server (usually rank = 0 ) or as a client (usually rank > 0 ). The MPI consumer achieves this by simultaneously inheriting from both abstract base classes GBaseConsumerT and GBaseClientT. It then initializes MPI and executes the server, client or subclient code depending on the rank of the process (see section "Automatic Configuration").
Listing 4 demonstrates how the user can initialize and start the MPI consumer. After MPI is initialized by calling the setPositionInCluster method, it can be queried whether the current process is a client or subclient by calling the isWorkerNode method. To run an application program named program, such as the one shown in Listing 4, with n parallel local processes of which n − 1 processes represent clients, the command mpirun -np n./program can be executed 16 . An example of such a program can be found at examples/geneva/16_GMPIConsumer.
Within Geneva 17 however, it is recommended and more convenient to use the Go2 class instead of using the Courtier library directly. Go2 is, among other things, an abstraction layer on top of the Courtier sublibrary. Go2 allows to write a user program independent of consumers and accepts the type and configuration of the consumer as command line parameters. The MPI consumer has also been integrated into Go2. Therefore, programs written in the past using Go2 can be used directly with the MPI consumer without a single change to the user code if the correct command line parameters are supplied at startup. The command line option for using MPI-Consumer is --consumer mpi. A list of all options relevant for the MPI consumer can be found in section "Configuration".

Configuration
To adapt to specific use cases, the MPI consumer has numerous configuration options, but they already come with sensible default values. The MPI consumer registers with Go2 in the constructor of Go2 and provides the configuration options so that Go2 can read them from the command line. If a program using the Go2 class is started with the command line parameters --help --showAll, it will not start an optimization but instead emit all available configuration parameters. The names of all parameters that affect only client processes have the prefix mpi_worker_, those that affect only the server have the prefix mpi_master_, and those that affect both server and clients have only the prefix mpi_.
The most important parameters are now briefly explained: • mpi_worker_asyncRequests: a boolean value specifying whether the asynchronous requests explained in section "Asynchronous Requests" should be enabled. As this is helpful in the vast majority of cases, asynchronous requests are enabled by default. • mpi_master_nIOThreads: The number of threads in the thread pool, which was explained in section "Multithreading". In general, more threads are better than fewer. However, additional threads can only be effective if there are enough physical CPU cores available on the node running the server. Therefore, the default value for this option is set to 0 and causes the number of threads to be dynamically set to the number of physical CPU cores. • *_pollInterval and *_pollTimeout: These options define intervals at which asynchronous operations are checked and for timeouts. These quantities are input parameters for the fault-tolerant communication with MPI explained in section "Fault Tolerance with Timeouts". The clients use the same parameters for checking send and receive operations. For the server, however, these parameters are separate, since checking the send operation is done on the clean-up thread described in section "Multithreading" and is less timecritical for the server. So it makes sense to separate these parameters. All of these parameters have default values which follow the logic described in the sections "Fault Tolerance with Timeouts" and "Multithreading". Usually, the default values should be sufficient. But in cases of extremely high load, it may be necessary to increase mpi_worker_pollTimeout to prevent clients from shutting down prematurely, due to long response times, when the server is overloaded. However, it usually makes more sense to then reduce the number of clients to avoid the overload.