Parametric Optimization on HPC Clusters with Geneva

Weßner, Jonas; Berlich, Rüdiger; Schwarz, Kilian; Lutz, Matthias F. M.

doi:10.1007/s41781-023-00098-6

Parametric Optimization on HPC Clusters with Geneva

Research
Open access
Published: 21 April 2023

Volume 7, article number 4, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computing and Software for Big Science Aims and scope Submit manuscript

Parametric Optimization on HPC Clusters with Geneva

Download PDF

Jonas Weßner¹,
Rüdiger Berlich²,
Kilian Schwarz³ &
…
Matthias F. M. Lutz¹

1127 Accesses
1 Citation
Explore all metrics

Abstract

Many challenges of today’s science are parametric optimization problems that are extremely complex and computationally intensive to calculate. At the same time, the hardware for high-performance computing is becoming increasingly powerful. Geneva is a framework for parallel optimization of large-scale problems with highly nonlinear quality surfaces in grid and cloud environments. To harness the immense computing power of high-performance computing clusters, we have developed a new networking component for Geneva—the so-called MPI Consumer—which makes Geneva suitable for HPC. Geneva is most prominent for its evolutionary algorithm, which requires repeatedly evaluating a user-defined cost function. The MPI Consumer parallelizes the computation of the candidate solutions’ cost functions by sending them to remote cluster nodes. By using an advanced multithreading mechanism on the master node and by using asynchronous requests on the worker nodes, the MPI Consumer is highly scalable. Additionally, it provides fault tolerance, which is usually not the case for MPI programs but becomes increasingly important for HPC. Moreover, the MPI Consumer provides a framework for the intuitive implementation of fine-grained parallelization of the cost function. Since the MPI Consumer conforms to the standard paradigm of HPC programs, it vastly improves Geneva’s user-friendliness on HPC clusters. This article gives insight into Geneva’s general system architecture and the system design of the MPI Consumer as well as the underlying concepts. Geneva—including the novel MPI Consumer—is publicly available as an open source project on GitHub (https://github.com/gemfony/geneva) and is currently used for fundamental physics research at GSI in Darmstadt, Germany.

ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT)

ppOpen-HPC/pK-Open-HPC: Application Development Framework with Automatic Tuning (AT)

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Motivation

Quantum chromodynamics (QCD) is the fundamental quantum field theory describing the strong interaction and one of the most active research topics in physics today [1, p. 9, 2, p. 1]. An important part of the study of the strong interaction based on QCD is computer-aided calculations on discretized space–time lattices [3, pp. 218−219]. These lattice-QCD simulations require extremely large computational power and therefore can only be performed for a few specific cases. At GSI/FAIR in Darmstadt, Germany, research is being done to establish an effective quantum field theory that describes QCD but can be computed with significantly less computational effort [2, p. 1]. For this purpose, the QCD is modified by variable transformations to obtain equations that are easier to solve than the original QCD. These transformations however result in a large number of unknown low energy constants (LEC) that are not analytically accessible [2, p. 2]. Nevertheless, to determine them approximately in the effective field theory, single solutions for the fundamental QCD are calculated [2, p. 2] first. Then, the challenge is to find a set of LEC parameters for which the effective field-theoretic equations and the QCD equations yield approximately the same results [4, p. 15, 5, p. 6]. At GSI, typical problems include up to 50 LEC. In particular, the computation of a quality function requires the solution of coupled nonlinear systems of equations [5, p. 6], which leads to a strongly nonlinear behavior of the quality function. Due to the resulting many local optima, so far meaningful solutions could only be found with the help of evolutionary algorithms, as opposed to classical approaches such as gradient descent. Usually, at least 1000 iterations of the optimization with up to 50,000 individuals^{Footnote 1} are needed until convergence or estimation of the uncertainties of the LEC. The computation time of typical quality functions for a single individual varies widely, depending on the considered system on the one hand and on the parameter set on the other hand. The typical total time of an optimization run has so far taken several weeks despite massive parallelization on GSI’s high-performance computing cluster. If it was possible to carry out the computer-aided optimization more efficiently, the progress in the basic research of the strong interaction could thus be further advanced.

Introduction

Parametric optimization is the process of searching for an optimal set of input parameters to an application-specific cost function, such that its output value is minimal. Parametric optimization problems are often encountered in physics [6, 7, pp. 115–116], chemistry [8], medicine [9, 10, 70], engineering [11, pp. 419–420, 12, p. 131, 13, p. 12370] and economics [14].

In practice, optimization problems are usually non-derivable, so the optimal solution cannot be calculated analytically. Also, the size of the input parameter space makes it impossible to compute all possible input parameter sets when searching for the optimal one. For this reason, optimization algorithms based on metaheuristics are used to systematically search the parameter space trying to find an approximate solution. There are two main classes of optimization algorithms—local and global search algorithms.

Starting from an entry point, local search algorithms try to iteratively improve the quality of a parameter set by searching in its local environment. These types of algorithms converge comparatively quickly to the closest local optimum. However, they might not be able to find the global optimum of the cost function. Moreover, many implementations such as gradient descent [15] are not easily parallelizable because of dependencies between the intermediate results.

On the other hand, global search algorithms like evolutionary algorithms [13] scan a bigger range of the parameter space and focus on finding the global optimum. Differently from local search algorithms, global search algorithms do not converge as quickly to a given optimum. Many implementations use a population-based approach, such that many candidate solutions are evaluated independently in one iteration, resulting in an inherent possibility for parallelization. Because real-life problems often have a large search space and many local minima, global search algorithms are particularly important today.

Recently, many resources are being invested to build new high-performance computing clusters [16–18, pp. 10–12]. Utilizing these computational resources for optimizing application problems allows for faster progress in science and engineering. Especially large-scale problems can be approached, which would not be solvable without this massive computing power.

Although there is software for parametric optimization available, none of these frameworks is an optimal fit for solving large-scale optimization problems on HPC clusters. DEAP [19, 20], DISROPT [21, 22] and DISOP [23], for instance, are frameworks written in Python. But Python is known to be a lot slower compared to compiled languages [24], which is not acceptable when working on big computer clusters that are extremely costly to use^{Footnote 2}. CEGO [25, 26] and PaGMO [27,28,29], on the other hand, are optimization frameworks written in C++. CEGO is not suitable for HPC, as it does not allow for distributed parallelization but merely for thread-based parallel execution. PaGMO is a popular framework and implements many different optimization algorithms. However, since its upgrade to version 2.x.x, it no longer supports parallelization over MPI or any other networking technology. The reason is that breaking changes between PaGMO 1.x.x and 2.x.x require rewrites of parts of the library that have not been approached yet. Therefore, PaGMO is not able to exploit the potential of grids, clouds and clusters.

Geneva is a powerful optimization framework focusing on global optimization algorithms and supporting parallelization in grid and cloud environments. Additionally, it is specifically designed for long-running optimization functions and stability. It is, however, not optimized for high-performance computing and has no integration with common cluster scheduling systems. Also, Geneva does not allow for distributed parallelization of the user-defined cost function, which is, nevertheless, necessary if the requirements for computing the cost function go beyond the resources of a single compute node.

For this reason, we have developed a new networking component for the Geneva optimization library—the so-called MPI Consumer. The MPI Consumer efficiently distributes candidate solutions to cluster nodes. The interface it exposes is compliant with today’s programming paradigm for high-performance computing, which makes it easy to use in cluster environments. Furthermore, the MPI Consumer provides a framework for fine-grained parallelization of the user-defined cost function. As a result, Geneva is not only a great tool for running in grids and clouds, but also is now optimized for HPC clusters.

In this work, we first give an overview of Geneva’s system architecture in general and afterwards explain the MPI Consumer’s system design in more detail. Furthermore, we evaluate the improvements achieved through the MPI Consumer and briefly show how it can be used.

Geneva’s System Architecture

As mentioned in the introduction, Geneva is an open-source library for distributed optimization and provides the basis for the software component developed in this work. Geneva is programmed in the C++17 standard and has as its only external dependency the Boost [30] program libraries, which implement fundamental functionalities such as serialization and parsing.^{Footnote 3} Geneva’s approximately 130,000 lines of program code are separated into the components shown in Fig. 1.

The Common sublibrary contains functionality that could potentially be used by all other software components. For example, it contains the implementation of a thread pool, a logger, a parser and other useful components.
The Geneva.^{Footnote 4} sublibrary provides functionalities specific to parametric optimization. Figure 2 shows an overview of the main base classes of this sublibrary. GObject is directly or indirectly the base class for most of the classes in the Geneva sublibrary. G_Optimization_Algorithm_Base is the base class for all optimization algorithms to unify the interface of optimization algorithms within Geneva. GParameterBase is the base class for all types of parameters that are adjusted during optimization. Individual parameters of different types and lists of parameters are derived from this class. GAdaptorT represents a base class for classes that determine the adaptation of candidate solutions within optimization algorithms. GParameterSet is the base class for parameter sets to be optimized. Users define optimization problems by deriving a class from GParameterSet. In doing so, users add parameters to their parameter set which are indirectly derived from GParameterBase, and also define a quality function^{Footnote 5} taking the parameters as input and returning one or more floating point numbers for the objectives. The currently implemented optimization algorithms are the multi-objective evolutionary algorithm, gradient descent, simulated annealing, a swarm algorithm and parameter scans.
The Courtier library is a template library for parallelization. It is used within Geneva for parallel evaluation of individuals, but can also be used for other use cases because of its generic implementation. Since this sublibrary is particularly relevant for the MPI Consumer implemented in this work, it will be discussed in detail afterwards.
The Hap library provides mechanisms for the transparent, centralized generation of random numbers in multithreaded environments.
Furthermore, automated unit tests and benchmarks are used to avoid errors in the library collection. A set of examples is provided to give users an insight into the many capabilities of Geneva and get them started quickly.

Parallelization with the Courtier Library

The Courtier library plays a particularly important role for the MPI Consumer, as the consumer is added to this sublibrary as a new module. Fig. 3 shows the basic parts that make up the Courtier sublibrary.

Single tasks to be executed are defined through so-called work items. These are implemented as GProcessingContainerT embedded in an instance of type GCommandContainerT. GProcessingContainerT provides a generic interface for computable entities consisting of data, the function to compute and metainformation. GCommandContainerT contains an attribute of type GProcessingContainerT and a command indicating the current processing stage of the work item and provides (de)serialization functionality.

The central building block of the Courtier library architecture is the broker, which provides an abstraction layer separating producers and consumers. Producers can register a so-called buffer port, which consists of two thread-safe queues, with the broker. Producers can then pass work items to be processed to the broker by adding them to the in-buffer of the buffer port. On the other hand, one or many consumers can be registered, which request work items and process them in a consumer-specific manner. When a work item is requested by a consumer, the broker takes one of the available work items from the in-buffer of one of the registered buffer ports. If the in-buffers of all buffer ports are empty, the consumer receives an empty GCommandContainerT whose command attribute indicates that no work items are currently available. Every time a consumer requests a new work item, it returns the computed work item of the previous request (if a work item was obtained in the previous request). The broker inserts returned items into the out-buffer of the buffer port which they were previously taken from. The distribution of items occurs on both the producer and consumer sides in a round-robin fashion.

It is important to note that the producer, broker and consumer all run in the same process (but on different threads) and therefore cannot be distributed to different machines. This makes the system architecture fundamentally different from well-known message-queuing architectures such as MQTT [32]. In this way, many components of the overall system run on the same machine. With a large number of producers and consumers, this could lead to a bottleneck, which would result in reduced scalability. In practice, usually exactly one producer and exactly one consumer are used, so this potential bottleneck has little to no relevance.

The parallelization of computation is done on the consumer side. Different consumer types implement the distribution of individual work items in different ways. So far, available consumer types were a serial consumer, a multithreaded consumer, a distributed consumer based on the Boost.Asio and a distributed consumer based on Boost.Beast networking program library. As the main contribution of this work, another distributed consumer, the MPI Consumer, is developed, which significantly improves Geneva’s user experience for high-performance computing and adds new useful features for solving large-scale optimization problems. Of particular interest are the distributed consumers. These represent servers and provide work items via their respective network technologies to appropriately implemented consumer clients. The work items are serialized via the serialization functionality of GProcessingContainerT before each sending and deserialized after each receiving. The actual computation of the items is then done on the client side. Clients must be started separately from the main process as independent processes (usually on other physical machines). It must be ensured that a network connection to the server is possible at the time of client startup.

Building an MPI Consumer

This section explains the MPI Consumer’s system design. To begin with, in section "System Design Overview" an overview of the system architecture is given. The subsequent sections explain the individual parts in more detail.

System Design Overview

The MPI consumer developed in this work is a distributed consumer within Geneva’s Courtier sublibrary. Fig. 4 shows a bird’s eye view of the MPI Consumer’s system architecture.

The MPI Consumer consists of a server, clients and subclients. The server waits for clients requesting work items. To make the server more scalable, it handles incoming requests with multiple execution streams. When a work item is requested, the server takes a work item from the broker (if available) and sends it to the client that has requested the item. To request work items, clients use asynchronous requests. While they are computing a work item, they send the request for another work item. Asynchronous requests can increase the productivity of the clients by filling the waiting time for server responses with computations. The request for a new work item contains (if available) the previously computed work item, which the server delivers back to the broker. The client–server model is not common for MPI applications. More commonly, the fork-join model is used. In the context of optimization algorithms, like e.g., the evolutionary algorithm, the fork-join model would mean splitting the population of candidate solutions into equally sized groups at the beginning of each iteration of the algorithm and sending one group to each worker process. However, since the evaluation of individual work items can potentially also take a very different computation time,^{Footnote 6} the fork-join model, in this case, would lead to suboptimal use of the computational resources of nodes. The reason for this is that early-finishing processes will have to wait for slower nodes to complete their computation because only once all processes have returned all candidate solutions, the evolutionary algorithm will generate the next population. In the client–server model, on the other hand, the candidate solutions of each iteration of the optimization algorithm are retrieved one by one by the clients. Since the number of individuals in the population is usually at least one order of magnitude greater than the number of clients, each client will send many requests in each of the iterations. Differently from the fork-join model, the client–server model results in natural load balancing, since faster clients send more requests and can thus compute more work items. In addition, the client–server model makes it comparatively easy to achieve the desired fault tolerance with respect to unreachable clients. Since the clients work independently of each other and the server does not need to know the number of intact clients, if individual clients become unavailable, all remaining clients are unaffected. To ensure that fault tolerance can be guaranteed, it is furthermore necessary to protect against failed network communication by using timeouts. This ensures that the server does not wait indefinitely for responses from unavailable clients.

There is an option for work items to be computed in small groups. Each client is then assigned a certain number of subclients among which it distributes the calculation of individual work items. However, since the cost function of the work items is specified by the user, its parallelization must also be implemented by the user. The MPI Consumer makes this possible in an intuitive way by providing a preconfigured infrastructure for the parallelization of the cost function.

At the program start time, all processes are equal. Depending on their MPI rank the processes configure themselves as server, client or subclient. This is convenient to use with modern cluster scheduling systems since at the time of starting a job all processes are homogeneous and MPI is natively supported by common workload managers.

Fault Tolerance with Timeouts

This section shows how the MPI Consumer can handle a broad range of possible errors using a client–server model with timeouts.

Since the MPI Consumer shall be extremely scalable, fault tolerance is essential. This is because even if Geneva itself were fault-free, there would always be classes of faults, such as network or hardware failures, that are outside of the software’s control. If the probability of a fault \(f \in F\) on a node is described by \(p_{f}\), then the probability that such a fault occurs on at least one node in a system with n nodes is \(1 - (1 - p)^{n}\). For example, with an error probability p of only \(0.1\%\), the probability that f occurs in a system with 500 nodes is \(1 - (1 - 0.001)^{500} \approx 39\%\). Thus, the claim to provide high scalability also requires the solid handling of unpredictable errors.

Not all errors can be caught with reasonable effort. Therefore, the MPI Consumer’s fault tolerance focuses on faults f of type F with \(F:= \{\) Temporary or permanent inaccessibility of clients \(\}\). The error class F describes a large portion of common errors that can occur in the system, since it is comparatively generic and thus includes many types of errors (network connection between server and client unavailable, client crashed for an unknown reason, etc.) and since the probability of occurrence of these errors increases with the number of clients, as explained before. However, the probability of unpredictable server-side errors is not related to the number of clients and is therefore less relevant. In addition, server-side fatal errors such as hardware failures are also much more difficult to handle.^{Footnote 7} If the specification S of the software is defined as \(S:= \{\) MPI-Consumer processes all available work items^{Footnote 8}in finite time \(\}\), then the MPI-Consumer with n clients satisfies S under the assumption \(A = \{\) \(f \in F\) occurs at most \(n-1\) times for infinite time \(\}\)^{Footnote 9}.

The fault model used is crash-recovery [33, pp. 34–35] and states that clients can fail unexpectedly (i.e., without reporting an error in a defined manner) and failed clients can spontaneously become available again. This is a realistic model in the context of Geneva. After all, clients might be unavailable due to hardware, software or network failures. At the same time, it is also possible that a client is merely busy with the calculation of a work item for a longer time than expected and therefore participates in the system again after an alleged error has been detected.

To ensure that the specification S can be satisfied even if an error \(f \in F\) occurs, the server must be independent of the number of available clients at all times and ensure that a failing client does not affect the satisfaction of S. To enable this, as shown in Fig. 5, a client–server model with two-sided timeouts is used. The server does not know anything about the number or status of clients. It only knows that occasionally requests from the black box will arrive and that they have a sender address to which the response should be sent. The behavior of the system with n clients is defined for all \(n \in N_0\) as waiting cyclically for requests and processing them. If assuming that no errors occur and \(n \ge 1\), this behavior obviously fulfills S. To make sure that S is also fulfilled in case of the occurrence of errors \(f \in F\), fault tolerance mechanisms must ensure that a system with n clients behaves like an intact system with \(n-1\) clients once an error \(f \in F\) occurs. For this purpose, the system must have a detector that detects the unavailability of a node and a corrector that resets the server’s state to its initial state before the error. As a detector, the MPI consumer uses timeouts. Each network communication operation is expected to terminate within a given finite time frame. If the time frame is exceeded, the connection is terminated by the corrector and the resources associated with the connection are released.

Conversely, clients also detect the unavailability of the server. In this case, a controlled shutdown of the client is initiated. The reason for server unavailability can be the controlled shutdown of the server, network failure or an error on the server side. In all cases, shutting down the client makes sense. In particular, use case of the Courtier library in Geneva, a controlled server shutdown occurs at the end of the optimization process.

To realize the client–server model, point-to-point communication between the server and the individual clients must be used. MPI itself provides the three following functions (and the corresponding receiving functions) for that purpose: (1) MPI_Send synchronizes with the receiver and asynchronously transmits; (2) MPI_Ssend fully synchronous sending; and (3) MPI_Isend asynchronous sending. None of the three functions supports timeouts, which are, nevertheless, required for implementing the previously explained loosely coupled client–server architecture.

Therefore, we use MPI’s asynchronous point-to-point operations to implement timeouts ourselves as briefly shown in Listing 1. First, an asynchronous network operation is initiated (implemented using the previously mentioned MPI_I* operations). Then, the current time is determined for later determination of the elapsed time. Now we repeatedly check whether the operation has been completed or a timeout has been exceeded. At the end of each iteration, depending on the use case, a waiting time can be configured to put less stress on the CPU. Depending on how critical it is considered to know the status of the message transmission early, the wait factor can be adjusted or omitted.^{Footnote 10} To return the process to its initial state in case of a timeout, the network operation is aborted and the associated resources are released. The loop is then exited and the process can resume its task.

Asynchronous Requests

In this section, we explain the idea of asynchronous client requests to improve the system’s scalability.

If the system had been implemented without asynchronous requests, as the consumers that have existed so far in Geneva, scalability would be limited by Amdahl’s law [31, p. 5, 34]. According to Amdahl, the minimum execution time achievable by parallel evaluation of candidate solutions is the sum \(t_{min} = t_{seq} + t_{par}\) of the time \(t_{seq}\) spent sequentially on server-side computation (i.e., primarily (de)serialization) and the time \(t_{par}\) required for parallel execution on clients. This phenomenon is of great importance, as it has been shown by testing that (de)serialization of work items requires a comparatively large amount of computation time on the server. This is not surprising, since regardless of the number of clients, each work item must be (de)serialized once on the server. The clients of the MPI Consumer use asynchronous requests to increase their productivity and thereby increase the performance and scalability of the overall system. The reason is that asynchronous requests can effectively circumvent Amdahl’s law by allowing sequential computations on the server to overlap in time with computations on the clients, so that \(t_{min} \approx t_{par}\) holds. This method had also been briefly proposed by Dr. Berlich and is now implemented as part of the MPI Consumer [31, p. 6].

The asynchronous request mechanism used in the MPI Consumer clients can be described by the deterministic finite state machine shown in Fig. 6. The operations mapped to the edges of the state machine are defined as follows:

req :: Sending a request for a new work item asynchronously to the server. The request includes sending the last processed work item (if any) to the server. The call returns immediately and is processed by an independent execution stream.
recv :: Waiting until the result of the previously dispatched req operation has been received.
proc :: Processing the work items previously received by recv.

The states of the automaton can be described informally as follows:

q0:: No communication with the server has been started yet.
q1:: No work items have been received yet and the request for the first work item has been started.
q2:: There is an unprocessed work item in a local queue. There are also 0-1 processed work items in a local queue.
q3:: There is one unprocessed work item in a local queue and another work item has been requested but has not yet been received.
q4:: There is a processed work item in a local queue and another work item has been requested, but has not yet been retrieved.

As can be seen, the clients first send a request to the server and wait for the reply to fill their local queues, which makes them reach state q2. Once state q2 is reached, the clients are in a cycle in which they always first request a new work item and then compute the currently available work item while this request is being processed. In this way, the waiting times for the server’s responses are filled with meaningful computations. Note, however, that asynchronous requests have no effect if there is no previous work item, i.e., on the first work item to be processed. For population-based optimization algorithms parallelized with the MPI Consumer, this case occurs at the beginning of each iteration of the optimization algorithm.

Multithreading

This section explains the multithreading design of the MPI Consumer server. We explain what considerations have been made to optimize the server’s scalability using multithreading and the responsible software component’s design.

Computers in modern HPC clusters, which the MPI Consumer is targeted at, are equipped with a high number of CPU cores. For instance, GSI’s Virgo cluster in Darmstadt, Germany, is composed of over 470 machines with 56 to 256 CPU cores each. Therefore, the MPI Consumer server features a minimal overhead multithreading design which allows scaling to this high number of physical cores without suffering from congestion. Unlike other networking libraries such as Boost.Asio or Boost.Beast, MPI does not provide an infrastructure for multithreading. Hence, multithreading must be implemented from scratch in the MPI Consumer. In the diagram in Fig. 7, the main components of the server involved in the multithreading are shown as an excerpt from Fig. 4.

The design follows the idea that for small tasks that are associated with waiting time, few execution streams and for computationally intensive tasks, more execution streams are useful. This is because for small tasks associated with waiting time, the waiting times of multiple tasks can thus be overlapped. At the same time, it is not critical if the tasks are processed serially because their computation does not take much time. Splitting the small tasks among many execution streams would lower their productivity and thus block hardware resources unnecessarily, since they would be waiting for a large part of the time. On the other hand, for computationally intensive tasks without waiting, parallelization is useful because the difference between serial and parallel execution time is significant. After all, using more cores will ideally divide the computation time by the number of execution streams used for these types of tasks. The four components receiver thread, thread pool, clean-up thread and open sessions queue are the most important building blocks involved in organizing the execution streams. For the reasons mentioned above, requests are first bundled in the receiver thread, then distributed back to the thread pool, and then re-combined in the clean-up thread.

The receiver thread permanently waits for requests from arbitrary clients. When a request arrives, it is passed to the thread pool for processing as soon as possible. This allows the receiver thread to start waiting for further requests again as soon as possible afterwards and to respond to them as quickly as possible.

The thread pool is a group of execution streams dealing with processing requests already received by the receiver thread. The number of execution streams of the thread pool is configurable and should be chosen according to the number of physical CPUs available on the host machine. Processing a request involves comparatively time-consuming operations—mainly the deserialization of the request and the serialization of the response. With a thread pool with n execution streams, n already accepted connections can be processed simultaneously, which approximately divides the required processing time by n. Once the response has been serialized and the asynchronous operation for sending it to the client has been initiated, the thread pool adds the pending session to the open sessions queue. From that point on, this execution stream of the thread pool is immediately available again for processing new requests, i.e., before the send operation is completed.

The open sessions queue is a thread-safe queue for processed sessions whose send operation has not yet been completed. To release allocated resources without errors, it is essential to check the status of the sessions’ network operations. The clean-up thread periodically iterates over the queue and checks if there are sessions whose network operation has been completed, whose timeout has been exceeded, or which are in an erroneous state. In each of these cases, that session is removed from the queue, and if there is an error, it is handled.

Automatic Configuration

This section explains how the MPI Consumer’s ability to automatically configure its processes allows for easy scheduling on modern HPC clusters using common scheduling systems such as slurm [35].

The distributed consumers so far implemented in Geneva (Boost.Asio and Boost.Beast consumer) required the server’s IP address and port at client start-up time. However, when scheduling jobs on modern HPC clusters using modern scheduling systems such as slurm [35] or PBS [36, 37], the specific machines to run the job will be determined dynamically based on the required and available resources. Therefore, scheduling Geneva on HPC systems required a two-phase approach: first scheduling the server, then determining its IP address and then scheduling the clients. For this reason, Geneva was not convenient to use on HPC clusters in the past. When using the MPI Consumer, all of the processes are initially equivalent and independent and all take the same command line arguments. The processes configure their role later at runtime using their MPI rank, which is set as an environment variable by the scheduling system.

Figure 8 shows how the distributed startup and configuration of an optimization with MPI-Consumer works conceptually. Only a single and trivial submit script is needed on a submit node, as opposed to the previously mentioned two-phase startup. With an appropriate command, in Slurm, e.g., sbatch, the scheduling system is instructed to allocate a certain number of compute nodes and start the script on the first node. The submit script then starts the configured number of instances of Geneva on the allocated compute nodes. In doing so, the scheduling system’s MPI integration automatically sets the runtime environments of the Geneva processes so that each node is uniquely identifiable and reachable by its rank. The MPI consumer only then decides for itself which process should take on which role, so that the initially homogeneous processes take on heterogeneous tasks.

Subclient Parallelization

In this section, we explain how the MPI Consumer allows users to conveniently implement distributed parallelization of their cost functions by forming groups of clients which together compute individual cost functions. Later in section "Using and Evaluating the MPI Consumer", more details about the exposed interface are shown and a typical usage pattern is presented.

Geneva so far only parallelizes the optimization at the population level of the optimization algorithm, but not at the individual level. This means that while multiple individuals can be computed simultaneously on different clients, Geneva does not provide an infrastructure for parallelizing the quality function itself. In some use cases, nevertheless, the problem complexity requires the distribution of the evaluation function to multiple machines because the number of cores or the amount of main memory on a single machine would not be sufficient. Implementing this as a user on top of Geneva is an extremely complex task and definitely not something that a domain expert using Geneva would like to do or would be capable of doing. Furthermore, it is a task that can be partially solved in a generic way to avoid multiple users of Geneva facing the same challenge.

The GMPISubClientOptimizer provides a new programming interface for Geneva that works with the MPI Consumer to provide the user with convenient access to preconfigured compute nodes of the cluster to implement the parallelization of the cost function using MPI. The individual cost functions are computed by each client with the help of subclients in a distributed manner.

To realize this, different, independent communication groups are needed, which are visualized in Fig. 9. The base communicator is the one formed at initialization time and includes all Geneva processes. In a system with m clients, \(m+1\) subgroups are derived from the base communicator. One of the subgroups is used for communication between the Geneva server and its clients. Each of the other m groups contains a client and a configurable number of subclients.^{Footnote 11}.

In Geneva, users usually interact with the GParameterSet class and the Go2 class. Users create a class derived from GParameterSet, within which they define the optimization problem mainly by defining a parameter search space and the cost function by overriding the fitnessCalculation method. The Go2 class coordinates the optimization and provides a convenient abstract interface to optimization algorithms and parallelization strategies. To make parallelization with subclients as user-friendly as possible, we have created an extended user interface for Geneva by developing the two classes GMPISubClientIndividual and GMPISubClientOptimizer, which are directly derived from the above-mentioned classical interface. Despite providing a couple of additional public methods to enable the desired functionality, the interface is equivalent to the traditional interface, but internally uses additional mechanisms to set up the subclient infrastructure.

Listing 2 shows how the user can use the GMPISubClientOptimizer class. First, the class defining the optimization problem (UserIndividual) is derived from GMPISubClientIndividual instead of GParameterSet (not shown) to get access to the additional functionality needed. Also, the Go2 class in the application’s main function is replaced by the GMPISubClientOptimizer class. Then a function is defined which should be executed by subclients (subClientJob). This function is registered with the optimizer in the main function. The optimizer is then responsible for calling this function for each subclient with the correct configuration and communicator, thereby creating the groups previously shown in Fig. 9. Inside the fitnessCalculation method, which is inherited from GParameterSet and describes the definition of the optimization problem, the user has access to the preconfigured communicator by means of the getCommunicator method. The implementation of the parallel computation of the quality function is application-specific and can now be freely designed by the user within the two methods fitnessCalculation and subClientJob. In many cases, it is useful to run an algorithm on the subclients until the associated client no longer provides any more parameter sets. For example, in each iteration of a while loop, the data of a subproblem could be received from the associated client, then computed, and then returned. Once the associated client is down due to the end of the optimization algorithms reached on the server, this loop should also be terminated. This commonly used scheme is straightforward to implement thanks to the getClientStatus method inherited from GMPISubClientIndividual, as also shown in Listing 2.

Using and Evaluating the MPI Consumer

In this section, we evaluate the new features that the MPI Consumer adds to Geneva and how they affect Geneva’s user experience for high-performance computing. Furthermore, we use performance tests to investigate the MPI Consumer’s scalability.

New Features and User Experience

As mentioned previously in section "Subclient Parallelization", Geneva was originally not supporting parallelization of the cost function but was only capable of executing multiple cost functions in parallel. The MPI Consumer introduces fine-grained parallelization of the cost function using subclients.

The enhanced user interface for using subclients in Geneva is identical to the classical interface but adds additional methods. This allows users to use subclient optimization with minimal change to the user code. Only the base class for the user-defined optimization problem and the used optimizer must be changed, merely requiring the modification to two lines of the user source code. In fact, in the simpler use case without sublients, no modifications to user code are necessary at all because the MPI Consumer also integrates with the Go2, which is part of the traditional Geneva user interface. To use the MPI Consumer, just a different command line parameter must be specified at startup time.^{Footnote 12} The framework offered to the user already takes care of (1) the initialization of the network communication; (2) the coordination of processes and assigning them to different roles; (3) the invocation of the client and subclient code at the right time; (4) checking the clients’ state to determine the end of the optimization; and (5) shutting down the network connections at the end. The remaining task for the user is thereby reduced to the absolute minimum, which is application-specific. Furthermore, implementing the parallelization with subclients is intuitive for domain experts as they are already accustomed to working with MPI, since this is the long-standing standard for high-performance computing.

The second challenge that Geneva was facing in the past was its integration with modern high-performance computing workflows. As explained in section "Automatic Configuration", starting Geneva processes required constant configuration parameters at start-up time, which in high-performance computing environments, however, usually will be set after the job has been submitted. This drawback made Geneva not as easy to use as it should have been and required a multi-step approach for submitting HPC jobs.

In contrast, with the MPI Consumer, Geneva is now integrating well with modern cluster scheduling systems. For example, to schedule an MPI program (program) with n processes on a high-performance cluster using slurm as scheduling system, the command srun --ntasks=n./program --consumer mpi suffices. Similarly, for running a Geneva optimization on a single machine, the MPI Consumer also seamlessly integrates with MPI launcher programs such as mpirun and mpiexec.

Scalability Evaluation

To test the MPI Consumer scalability, we have created a performance test, which performs a pseudo-optimization^{Footnote 13} with a certain fixed number of individuals and iterations and measures the execution time required for the entire optimization. The test takes two vectors as input: the numbers of clients, and the durations of the simulated cost function computation. The test then determines the execution times for all elements of the Cartesian product of these input vectors and presents them as a three-dimensional graph as shown in Fig. 10. We have tested client counts in the range of 1 to 1000, and cost function evaluation times in the range of 0.001 to 10. The graph shows the speedup, which is calculated as the quotient of serial execution time (one client) and parallel execution time. The hardware used was a computer with AMD EPYC 7551 32-Core Processors and a total of 128 physical cores with 2 threads each. As for the configuration of the MPI Consumer, we have set the thread pool size to 64 and activated asynchronous client requests.

The ideal behavior of the system is a linear speedup, which would show up in the graph as a plane containing every point for which the number of clients equals the speedup. In contrast, for a system with a scalability issue, both smaller evaluation times and a higher number of clients would be expected to have a negative impact on the efficiency^{Footnote 14} of the system. This is because both parameters increase the frequency with which requests from clients arrive at the server and thus directly increase the server load.

The test results depicted in Fig. 10 show close-to-optimal^{Footnote 15} speedup with an efficiency of \(\approx 94\%\) for cost functions with a computation time of 5 seconds or more. When using a shorter evaluation function with a duration of one second, the scalability is slightly reduced, resulting in a speedup of 640 for 1000 client, which equals about 1000 requests per second. When the execution time of the cost function is further decreased, the efficiency drops because this directly increases the frequency at which requests are received on the server. However, this scalability limitation with short execution times is something that naturally happens in any system since request frequency can theoretically be increased to an arbitrary value, while hardware resources are limited. Taking into account the heavy (de)serialization tasks that have to be performed on the server for every request, we think that the scalability reached is reasonable. Moreover, tests with the other Geneva consumers (using the C++ Boost.Beast and Boost.Asio networking libraries) have indicated that the MPI Consumer is even more scalable than these components. Furthermore, one must notice that we had up to 1000 client processes running on the same machine with a server that also had more than 64 threads. Therefore, the scalability shown in the test results might have been inhibited by limited resources and thread contention and is still more than reasonable. Therefore, we expect the MPI Consumer to perform even better in production environments on high-performance computing clusters where each process is exclusively assigned the requested number of CPU cores.

Conclusion

The MPI Consumer adds a new software component for network communication to the powerful optimization library Geneva, making it now also better suitable for high-performance computing. The MPI Consumer’s subclient parallelization functionality enables intuitive fine-grained parallelization of user-defined cost functions with minimal overhead. Furthermore, the MPI Consumer significantly improves Geneva’s user experience for high-performance computing, as it now seamlessly integrates with common HPC scheduling systems. Performance evaluation of the MPI Consumer on a 128-core machine with up to 1000 client processes has shown that it is perfectly scalable when choosing reasonable computation times for the cost function.

All software has been contributed to the official Github repository of Geneva [38] and is now available for public use. Users can benefit from the MPI Consumer directly without any adaptations to their existing program code.

Independent of Geneva and parametric optimization, the MPI-Consumer can be used as part of Geneva’s Courtier sublibrary as a scalable framework for client–server workflows on HPC clusters.

Furthermore, the discussed components of the MPI consumer can be used in a generalized form as programming patterns. Asynchronous communication with timeouts provides a general method to implement fault-tolerant systems using underlying non-fault-tolerant communication primitives. The effective multithreading of the MPI consumer can be used independently as a template for a scalable server. In addition, the state machine for asynchronous requests is a general approach to increase the efficiency of sequential processing of tasks that involve waiting times.

Geneva is currently being used for fundamental physics research on a high-performance cluster at GSI in Darmstadt, Germany.

Availability of data and materials

If desired, the data used to plot the graph shown in Fig. 10 can be provided.

Code availability

As mentioned in section "Overview of the repository", the source code is publicly available on Github [38].

Notes

In the context of evolutionary algorithms, this is a synonym for a candidate parameter set.
Note that many Python libraries mitigate the performance drawbacks of Python by accessing GPUs or calling compiled code through a Python interface. However, looking at the source code of the above-mentioned frameworks, we can see that they are directly implemented in non-optimized Python, making the execution of optimization algorithms and other parts of the libraries slow.
If the new MPI Consumer is used, it also depends on an MPI implementation. But compiling the MPI Consumer is optional as explained later in section "Build and installation".
For historical reasons, the sublibrary has the same name as the program library collection [31, p. 2].
Also called fitness function or cost function.
The reason for this can be differences in the cost function for these work items or different hardware equipment of the clients.
For example, all state information would need to be backed up to a fail-over server that can replace the server in the event of a failure. Already the task of synchronization adds a huge amount of overhead and is not trivial to implement.
The number of work items is a finite constant.
i.e., any number of faults may occur for finite time, meaning that temporary failures of clients or network connections are irrelevant for the satisfaction of S
For example, when receiving messages on the receiver thread (see section "Multithreading"), no wait time is used because this operation is extremely time-critical. In turn, when checking the completion of the send operation on the server (see clean-up thread in section "Multithreading"), the wait time defaults to one second, since this is not time-critical.
With subClientGroupSize set to 0 this would be equivalent to not using subclients at all.
Please refer to the user guide in section A for more details.
The evaluation cost is implemented as sleep to allow for testing higher loads without exhausting the available CPUs on the cost function computation. This also makes the test more independent of the used hardware architecture and isolates the effect of the network communication.
The efficiency is calculated as the quotient of ideal and actual execution time.
Note that optimal speedup is not possible because there always exist some non-parallelizable parts of code.
This assumes that mpirun is installed on the machine. This program is usually included in the installation of an MPI implementation.
It should be remembered that the Courtier library can also be used independently of Geneva and parametric optimization.

References

Gürsoy U (2021) Basic features of qcd. In: Holography and Magnetically Induced Phenomena in QCD, pp. 9–14. https://doi.org/10.1007/978-3-030-79599-3.pdf
Guo X-Y, Heo Y, Lutz MF (2020) On a first order transition in qcd with up, down and strange quarks. Eur Phys J C 80(3):1–5
Article ADS Google Scholar
Kanamori I, Ishikawa K-I, Matsufuru H (2021) Object-oriented implementation of algebraic multi-grid solver for lattice qcd on simd architectures and gpu clusters. In: International Conference on Computational Science and Its Applications, pp. 218–233. Springer. https://doi.org/10.1007/-978-3-030-86976-2.pdf
Guo X-Y, Heo Y, Lutz MF (2018) On chiral extrapolations of charmed meson masses and coupled-channel reaction dynamics. Phys Rev D 98(1):014510
Article ADS Google Scholar
Lutz MF, Bavontaweepanya R, Kobdaj C, Schwarz K (2014) Finite volume effects in the chiral extrapolation of baryon masses. Phys Rev D 90(5):054505
Article ADS Google Scholar
Ziogas AN, Ben-Nun T, Fernández GI, Schneider T, Luisier M, Hoefler T (2019) A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13. https://doi.org/10.1145/3295500.3357156
Schering P, Scherer PW, Uhrig GS (2021) Simulation of nonequilibrium spin dynamics in quantum dots subjected to periodic laser pulses. In: High Performance Computing in Science and Engineering’20, pp. 115–131. https://doi.org/10.1007/978-3-030-80602-6.pdf
Nagarajan V, Solaiyappan A, Mahalingam SK, Nagarajan L, Salunkhe S, Nasr EA, Shanmugam R, Hussein HMAM (2022) Meta-heuristic technique-based parametric optimization for electrochemical machining of monel 400 alloys to investigate the material removal rate and the sludge. Appl Sci 12(6):2793
Article Google Scholar
Lin C-J, Jeng S-Y, Chen M-K (2020) Using 2d cnn with taguchi parametric optimization for lung cancer recognition from ct images. Appl Sci 10(7):2591
Article Google Scholar
Berlich R, Gabriel S, Garcıa A Parametric Optimization with the Geneva Library Collection - Version: 1.6 (Ivrea). http://www.gemfony.eu/fileadmin/documentation/geneva-manual.pdf Accessed 5 May 2022
Dauch T, Chaussonnet G, Keller M, Okraschevski M, Ates C, Koch R, Bauer H-J (2021) 3d predictions of the primary breakup of fuel in spray nozzles for aero engines. In: High Performance Computing in Science and Engineering’20, pp. 419–433. https://doi.org/10.1007/978-3-030-80602-6.pdf
Benmessahel I, Xie K, Chellal M, Semong T (2019) A new evolutionary neural networks based on intrusion detection systems using locust swarm optimization. Evol Intell 12(2):131–146
Article Google Scholar
Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379
Article Google Scholar
Chakraborty S, Chakraborty S (2022) A scoping review on the applications of mcdm techniques for parametric optimization of machining processes. Arch Comput Methods Eng. 29:1–22
Article MathSciNet Google Scholar
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectrum 59(1):34–35
Article Google Scholar
Ponce M, van Zon R, Northrup S, Gruner D, Chen J, Ertinaz F, Fedoseev A, Groer L, Mao F, Mundim BC et al (2019) Deploying a top-100 supercomputer for large parallel workloads: The niagara supercomputer. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pp. 1–8 . https://doi.org/10.1145/3332186.3332195
Zheng W (2020) Research trend of large-scale supercomputers and applications from the top500 and gordon bell prize. Sci China Inf Sci 63(7):1–14
Article ADS Google Scholar
Fortin F-A, De Rainville F-M, Gardner M-AG, Parizeau M, Gagné C (2012) Deap: evolutionary algorithms made easy. J Mach Learn Res 13(1):2171–2175
MathSciNet Google Scholar
DEAP Development Team: DEAP Source Code on Github. https://github.com/DEAP/deap Accessed 25 Feb 2022
Farina F, Camisa A, Testa A, Notarnicola I, Notarstefano G (2020) Disropt: a python framework for distributed optimization. IFAC-PapersOnLine 53(2):2666–2671
Article Google Scholar
Farina F, Camisa A, Testa A, Notarnicola I, Notarstefano G DISROPT Quellcode Auf Github. https://github.com/OPT4SMART/disropt Accessed 2022-05-25
Gómez-Iglesias A (2015) Solving large numerical optimization problems in hpc with python. In: Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing. PyHPC ’15. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2835857.2835864
Lion D, Chiu A, Stumm M, Yuan D (2022) Investigating managed language runtime performance: Why \(\{\)JavaScript\(\}\) and python are 8x and 29x slower than c++, yet java and go can be faster? In: 2022 USENIX Annual Technical Conference (USENIX ATC 22), pp. 835–852. https://www.usenix.org/system/files/atc22-lion.pdf
Bell IH (2018) Cego: C++ 11 evolutionary global optimization. J Open Source Softw 4(36)https://doi.org/10.21105/joss.01147
Article Google Scholar
Bell IH CEGO Source Code on Github. https://github.com/usnistgov/CEGO Accessed 28 May 2022
Biscani F, Izzo D (2020) A parallel global multiobjective framework for optimization: pagmo. J Open Source Softw 5(53):2338. https://doi.org/10.21105/joss.02338
Article ADS Google Scholar
Pagmo Development Team: Pagmo Documentation - Capabilities. https://esa.github.io/pagmo2/overview.html Accessed 2022-05-25
Biscani F, Izzo D PaGMO V2 Source Code on Github. https://github.com/esa/pagmo2 Accessed 28 May 2022
Dawes B, Abrahams D Boost Homepage. https://www.boost.org/ Accessed 2022-05-29
Berlich R, Gabriel S, García A (2015) Geneva 1.6: Improving the performance of highly concurrent workloads in parametric optimization. In: International Symposium on Grids and Clouds, vol. 15. https://pos.sissa.it/239/026/pdf
Hillar GC (2017) MQTT Essentials-A Lightweight IoT Protocol
Gärtner FC (2003) Formale grundlagen der fehlertoleranz in verteilten systemen. In: Wagner D (ed.) Ausgezeichnete Informatikdissertationen 2001, pp. 39–49. Gesellschaft für Informatik, Bonn. https://tuprints.ulb.tu-darmstadt.de/162/1/gaertner.pdf
Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46. https://doi.org/10.1109/MC.2013.418
Article Google Scholar
SchedMD LLC: Slurm Documentation. https://slurm.schedmd.com/documentation.html Accessed 8 May 2022
Altair Grid Technologies: PBS Pro 5.4 User Guide. https://www3.physnet.uni-hamburg.de/physnet/PBSproUG.pdf Accessed 21 July 2022
Altair Engineering: OpenPBS Homepage. https://www.openpbs.org/ Accessed 21 July 2022
Gemfony Scientific: Geneva Github Repository. https://github.com/gemfony/geneva Accessed 8 Dec 2022

Download references

Acknowledgements

The publication is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 491382106 , and by the Open Access Publishing Fund of GSI Helmholtzzentrum fuer Schwerionenforschung.

Funding

Open Access funding enabled and organized by Projekt DEAL. Partial financial support was received from GSI in Darmstadt, Germany, which is one of the users of Geneva.

Author information

Authors and Affiliations

GSI Helmholtz Center for Heavy Ion Research, Planckstrasse 1, 64291, Darmstadt, Germany
Jonas Weßner & Matthias F. M. Lutz
Gemfony Scientific UG, Hermann-von-Helmholtz-Platz 6, 76344, Eggenstein-Leopoldshafen, Germany
Rüdiger Berlich
Hochschule Darmstadt University of Applied Sciences, Haardtring 100, 64295, Darmstadt, Germany
Kilian Schwarz

Authors

Jonas Weßner
View author publications
You can also search for this author in PubMed Google Scholar
Rüdiger Berlich
View author publications
You can also search for this author in PubMed Google Scholar
Kilian Schwarz
View author publications
You can also search for this author in PubMed Google Scholar
Matthias F. M. Lutz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jonas Weßner has done the research that resulted in the work presented. The other authors have supported him in that by participating in discussions on a regular basis. Furthermore, Rüdiger Berlich has given advice regarding Geneva’s code base. Matthias Lutz has provided a use case from physics that motivated the work.

Corresponding author

Correspondence to Jonas Weßner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Consent for publication

Not applicable.

Consent to participate

Not applicable.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: MPI Consumer User Guide

In this section, we give instructions for using the MPI-Consumer with the Geneva program library. For general usage hints for Geneva independent of the MPI-Consumer, refer to the Geneva manual [10], the paper written by Dr. Berlich in [31] and the examples in the Github repository [38].

Overview of the Repository

Geneva including the MPI Consumer is available as open-source software on Github [38]. The MPI Consumer has been integrated into the repository’s develop branch, which is also its default branch. The most important components of the MPI Consumer can be found in the repository at the following locations:

include/courtier/: Contains the GMPIConsumerT class as essential implementation of the MPI Consumer and MPI helper functions.
include/geneva/ and src/geneva/: Contain the GMPISubClientOptimizer and -Individual classes and the integration of the MPI Consumer in Go2.
examples/geneva/16_GMPIConsumer: Contains an example of using the MPI Consumer without the Go2 class.
examples/geneva/17_GMPISubClients: Contains an example of how to use subclients for fine-grained parallelization of the cost function.

Build and Installation

The MPI Consumer depends on an implementation of the MPI standard as a dynamically linked library. Geneva itself, however, also shines because the only dependency on external program libraries is Boost [30]. Thus, if the user does not have an MPI implementation installed, he should still be able to use Geneva. Therefore, it is possible to specify in the build configuration of the Geneva library whether it should be built with or without the MPI Consumer. If the MPI Consumer is to be built, the corresponding program code is compiled and the MPI installation is linked. Otherwise, the MPI installation is not needed on the target machine. Listing 3 shows a small snippet of the scripts/genevaConfig.gcfg configuration file, which can be used to conveniently set options for cmake and thereby configure the Geneva build.

If the variable BUILDMPICONSUMER is set to 1 and an implementation of MPI exists on the machine, Geneva with MPI-Consumer can then be compiled and installed using the script scripts/prepareBuild.sh as described in the Geneva manual [10, p. 77–86].

Using the MPI Consumer

A fundamental difficulty in the user-friendly implementation of a consumer for Geneva using MPI is to abstract all calls to MPI library functions so that they are hidden from the user. There must be a place in the program code where the rank of the current process is used to decide whether the process should act as a server (usually rank \(= 0\)) or as a client (usually rank \(> 0\)). The MPI consumer achieves this by simultaneously inheriting from both abstract base classes GBaseConsumerT and GBaseClientT. It then initializes MPI and executes the server, client or subclient code depending on the rank of the process (see section "Automatic Configuration").

Listing 4 demonstrates how the user can initialize and start the MPI consumer. After MPI is initialized by calling the setPositionInCluster method, it can be queried whether the current process is a client or subclient by calling the isWorkerNode method. To run an application program named program, such as the one shown in Listing 4, with n parallel local processes of which \(n - 1\) processes represent clients, the command mpirun -np n./program can be executed^{Footnote 16}. An example of such a program can be found at examples/geneva/16_GMPIConsumer.

Within Geneva^{Footnote 17} however, it is recommended and more convenient to use the Go2 class instead of using the Courtier library directly. Go2 is, among other things, an abstraction layer on top of the Courtier sublibrary. Go2 allows to write a user program independent of consumers and accepts the type and configuration of the consumer as command line parameters. The MPI consumer has also been integrated into Go2. Therefore, programs written in the past using Go2 can be used directly with the MPI consumer without a single change to the user code if the correct command line parameters are supplied at startup. The command line option for using MPI-Consumer is --consumer mpi. A list of all options relevant for the MPI consumer can be found in section "Configuration".

Configuration

To adapt to specific use cases, the MPI consumer has numerous configuration options, but they already come with sensible default values. The MPI consumer registers with Go2 in the constructor of Go2 and provides the configuration options so that Go2 can read them from the command line. If a program using the Go2 class is started with the command line parameters --help --showAll, it will not start an optimization but instead emit all available configuration parameters. The names of all parameters that affect only client processes have the prefix mpi_worker_, those that affect only the server have the prefix mpi_master_, and those that affect both server and clients have only the prefix mpi_.

The most important parameters are now briefly explained:

mpi_worker_asyncRequests: a boolean value specifying whether the asynchronous requests explained in section "Asynchronous Requests" should be enabled. As this is helpful in the vast majority of cases, asynchronous requests are enabled by default.
mpi_master_nIOThreads: The number of threads in the thread pool, which was explained in section "Multithreading". In general, more threads are better than fewer. However, additional threads can only be effective if there are enough physical CPU cores available on the node running the server. Therefore, the default value for this option is set to 0 and causes the number of threads to be dynamically set to the number of physical CPU cores.
*_pollInterval and *_pollTimeout: These options define intervals at which asynchronous operations are checked and for timeouts. These quantities are input parameters for the fault-tolerant communication with MPI explained in section "Fault Tolerance with Timeouts". The clients use the same parameters for checking send and receive operations. For the server, however, these parameters are separate, since checking the send operation is done on the clean-up thread described in section "Multithreading" and is less time-critical for the server. So it makes sense to separate these parameters. All of these parameters have default values which follow the logic described in the sections "Fault Tolerance with Timeouts" and "Multithreading". Usually, the default values should be sufficient. But in cases of extremely high load, it may be necessary to increase mpi_worker_pollTimeout to prevent clients from shutting down prematurely, due to long response times, when the server is overloaded. However, it usually makes more sense to then reduce the number of clients to avoid the overload.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Weßner, J., Berlich, R., Schwarz, K. et al. Parametric Optimization on HPC Clusters with Geneva. Comput Softw Big Sci 7, 4 (2023). https://doi.org/10.1007/s41781-023-00098-6

Download citation

Received: 01 January 2023
Accepted: 31 March 2023
Published: 21 April 2023
DOI: https://doi.org/10.1007/s41781-023-00098-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parametric Optimization on HPC Clusters with Geneva

Abstract

Similar content being viewed by others

ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT)

ppOpen-HPC/pK-Open-HPC: Application Development Framework with Automatic Tuning (AT)

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

Motivation

Introduction

Geneva’s System Architecture

Parallelization with the Courtier Library

Building an MPI Consumer

System Design Overview

Fault Tolerance with Timeouts

Asynchronous Requests

Multithreading

Automatic Configuration

Subclient Parallelization

Using and Evaluating the MPI Consumer

New Features and User Experience

Scalability Evaluation

Conclusion

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Consent for publication

Consent to participate

Ethics approval

Additional information

Publisher's Note

Appendix: MPI Consumer User Guide

Appendix: MPI Consumer User Guide

Overview of the Repository

Build and Installation

Using the MPI Consumer

Configuration

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation