Hierarchical dynamic workload scheduling on heterogeneous clusters for grid search of inverse problems

Inverse problems occur in many scientific fields. Albeit grid search, where points of a regular grid are tested as possible solutions, is a straightforward and robust method to numerically solve inverse problems, it is computationally intensive and becomes prohibitive when the problem has a high dimensionality. Heterogeneous clusters are a viable and cost-effective solution to exploit the combined computational power of multiple available computers. In this paper, we present a computing framework that supports efficient grid search for inverse problems on heterogeneous clusters. Scheduling the workload on such systems might be challenging, especially when nodes are comprised of CPUs and GPUs with different computational speeds. The framework dynamically schedules computations on the processing elements of the cluster according to a selected performance index, which is determined at run-time. The framework is extensible, as it allows easy integration of additional inverse problems.


Introduction
Heterogeneous clusters are a cost-effective approach to combine existing computational infrastructure, typically within an organization, toward solving large computational problems. Multiple applications, libraries and computational kernels have been successfully implemented for heterogeneous clusters and exhibit satisfactory performance [4,5,34,37,42,54,56,66].
Achieving high performance for parallel applications becomes especially challenging on heterogeneous systems consisting of processing elements of varying performance characteristics. Therefore, it is challenging to develop algorithms that will distribute the workload of a parallel application among all processing elements so as to minimize execution time. Lately, the advent of multi-core CPUs, which in addition might support simultaneous multithreading (SMT) on each core, heterogeneous multi-core CPUs and accelerators like GPUs have further intensified the problem both within and across cluster nodes.
Static scheduling of the workload is effective when all processing elements are identical and the application does not introduce further imbalances after distribution of the workload, e.g., a different number of iterations is required to converge to a solution on different processing elements. Hence, it is not appropriate for heterogeneous clusters. Dynamic scheduling approaches, like Chunk Self-Scheduling, Guided Self-Scheduling [44], Trapezoid Self-Scheduling [62] and Factoring [22], improve on static scheduling, but suffer from their own problems. They introduce additional cost, as they require further communication and synchronization, and they still might not achieve good work balancing when processing elements of the cluster have large differences in processing speed. This is true in modern computing systems, where GPUs can have orders of magnitude higher performance compared to CPUs.
Recognition of these shortcomings has led to approaches that take into account the processing speed of nodes and processing elements within nodes. One route that has been followed is to execute a set of benchmarks on the cluster and determine the relative speed of each processing element. This information is stored and later on used to decide on the workload that should be assigned to each processing element. A drawback of this approach is that the benchmarks might not accurately resemble the compute and communication requirements of the applications that will be run, which sometimes might even depend on the specific input. This makes decisions regarding workload scheduling suboptimal. Furthermore, changes in the hardware of the cluster require new execution of benchmarks.
The most advanced scheduling techniques use real-time measurements during execution of the application of interest to determine the relative execution speed of processing elements and distribute work according to measured speed [6,12,52]. Some work is assigned to each processing element and the time required to complete the work is measured. A performance ratio function is used to make scheduling decisions on the fly, regarding the amount of work that should be next assigned to each processing element. This approach can be applied at different levels of the cluster, leading to hierarchical dynamic scheduling schemes. For 1 3 example, performance can be monitored at the node level, thus first distributing work according to the performance of each node. At a second level, the workload assigned to each node is further distributed among processing elements within the node, according to their relative performance.
In this work, we focus on the use of heterogeneous clusters to solve highdimensional inverse problems. Inverse problems consist in using the results of actual observations to infer the values of the parameters that characterize the system under investigation (cf. Sect. 2). Such problems occur in many scientific fields including Medicine [9,30], Earth Sciences [63,64], Civil Engineering [47,57], Solid Mechanics [26,57], Aeronautics [32] and Fluid Dynamics [18,40]. Our work is motivated by the importance of inverse problems in seismology. As Southeast Europe and West Asia are highly seismic areas, identifying seismic faults and monitoring, recording and evaluating their behavior is of high importance to organize, prepare and protect residential areas and infrastructure. Characteristics of seismic faults can be identified by formulating inverse problems using measurements either from seismographs [59,60] or GPS [63,64]. Developing and maintaining, however, a different code base for each inverse problem is a tedious task. A common software framework where different inverse problems can easily be incorporated is a much more attractive solution. If such a framework can in addition compute solutions of inverse problems in short time-spans, this can lead to further utilization of the framework for real-time applications, e.g., issuing warnings for tsunamis due to earthquakes.
In order to determine the values of the parameters that lead to the measured observations, a search space, i.e., a range of values where each parameter can lie, must first be defined. Trial solutions (different combinations of values for the parameters from within the search space) must be tested and the one that provides results closest to the observed measurements is typically considered to be the solution. How trial solutions are selected leads to different techniques to solve inverse problems. One approach is to sample the search space at equal distances per dimension, hence creating a regular grid. Although this is a robust method to numerically solve inverse problems, it is computationally intensive when the problem has a high dimensionality. Increasing the sampling rate on multiple dimensions to improve accuracy leads to an exponential increase in the number of trial solutions that must be tested. This is an important drawback that keeps the method from mainstream adoption. Even if parallel processing is considered, the sampling rate on each dimension of the problem must be kept low for problems of high dimensionality, in order for the computations to complete within an acceptable time-frame. Typical approaches, such as parallelizing only the outermost loop of the loop nest that samples the grid, might not expose sufficient parallelism to keep all processing elements busy. Instead, multiple levels of the loop nest must be parallelized. Overall, it is desirable to develop a generic framework that can help the user, expert or not, to easily implement regular grid searches on the domain of interest. Moreover, it would be desirable for these grid searches to take into account any special properties of the inverse problem and its objective function as well as of the computing platform so as to maximize efficiency and minimize run-times.
In this work, we introduce a parallel computing framework for regular grid searches of inverse problems on heterogeneous clusters. In the context of the framework, we introduce Hierarchical Real-time Dynamic Loop Scheduling (HRDLS), a hierarchical dynamic scheduling algorithm to distribute the workload of regular grid searches of inverse problems among nodes of the cluster and processing elements within each node. The assigned workload is dynamically determined through real-time measurements during execution of the application. HRDLS presents a challenge when it starts, since no data are yet available to determine performance of processing elements. We also introduce an initial adaptation strategy that exponentially increases the workload per processing element and quickly leads to appropriate timing measurements for further work distribution. Work distribution in our framework is further facilitated through linearization of the search space (collapsing of the whole loop nest that samples the grid). Linearization is feasible because the search grid is regular and allows representation of each grid point by a single integer value. Furthermore, it is easily reversible and it reduces computation and communication overhead of assigning work to processing elements in the cluster, especially for high-dimensionality problems. Linearization is performed automatically by the framework and is completely transparent to the user, relieving him/her from the corresponding programming burden. Finally, the framework is extensible, as it allows easy integration of additional inverse problems. The code of the framework is freely available. 1 In summary, our contributions are as follows: • We introduce a parallel computing framework for regular grid searches of inverse problems on heterogeneous clusters that is easy to use and easy to extend with the inclusion of new inverse problems. • We introduce Hierarchical Real-time Dynamic Loop Scheduling (HRDLS), a hierarchical dynamic scheduling algorithm used in the context of the framework. HRDLS uses real-time measurements to take decisions regarding distribution of the workload among nodes of the cluster and processing elements within each node. • We introduce an initial adaptation strategy during the first assignments of work that quickly leads to appropriate timing measurements for further work distribution with HRDLS. • We propose a method to represent each trial solution as a single integer, thus reducing the amount of data that needs to be exchanged among nodes and processing elements within a node. • We fine-tune the parameters of the proposed framework and show that our approach leads to satisfactory performance compared to a highly optimized, GPU-only application for grid searches of inverse problems.
In the remaining of this paper, we will use the following terminology. A node will be one of the computers that comprise the cluster. A processing element will be any 1 3 component of a node that is capable of performing computations, i.e., a CPU, a GPU or any other type of accelerator. A compute unit will be either a node or a processing element, depending on the level where our hierarchical scheduling algorithm is applied, i.e., at the cluster or node level. The rest of this paper is organized as follows: In Sect. 2, we discuss important issues with respect to inverse problems, thus identifying features that the computing framework should support. In Sect. 3, we present the design and implementation of the computing framework, as well as our hierarchical scheduling algorithm. In Sect. 4, we present two physical problems that we implemented in the context of the framework in order to evaluate performance. In Sect. 5, we present an experimental evaluation of the framework. In Sect. 6, we present related work, whereas in Sect. 7, we discuss conclusions.

Background
To study a physical system one has to define a set of parameters x = [x 1 , x 2 , … , x D ] ⊺ , whose values characterize the system. We assume that the number of parameters is D. Furthermore, one has to define a mathematical model g that allows for given values of the parameters to predict the results that actual measurements will provide for some observable parameters (or data) d = [d 1 , d 2 , … , d M ] ⊺ . In other words, we describe a physical system as: The above is also called the forward problem 2 .
In many cases, the model g is known, but it is not possible to obtain specific values for the parameters. This is a typical case in Earth sciences. In the Okada model (Sect. 4.1), a seismic fault is described by nine parameters, which are used to calculate the displacements that should be observed at a specific point on the Earth's surface due to an earthquake caused by the fault. However, it is practically impossible to measure the parameters, as the fault is typically deep underground. What is of interest in this Relationship between a forward and its corresponding inverse problem case, is whether one can determine the values of the parameters when displacements at several points on the Earth's surface have been measured. This is also called the inverse problem. Figure 1 presents the relationship between the forward and the inverse problem.
Generalizing, inverse theory can be defined as a set of mathematical techniques that help us obtain knowledge about the physical world, based on inferences drawn from observations. Typically, inverse theory is applied to observations that can be represented numerically. Hence, observations are measurements (the data) obtained during physical processes. The questions we want to answer are stated in terms of the numerical values (and statistics) of specific (but not necessarily directly measurable) properties of the world (the parameters).
Although there exist elegant, exact solutions for several inverse problems [10,11,21,24,31,38,65], in reality they are of limited applicability for a number of reasons. First, such exact inversion techniques are only applicable under idealistic situations that do not hold in practice [39]. Second, exact inversion techniques are often very unstable [13]. Lastly, but most importantly, in many inverse problems the parameters that one aims to determine are continuous functions of the parameters that correspond to spacial variables. This means that these parameters have infinitely many degrees of freedom. However, in a realistic experiment the amount of measurements that can be obtained to determine the parameters is finite. As a result, the acquired measurements cannot determine the parameters uniquely, meaning that there can be multiple parameters that explain the acquired measurements equally well [2,3,43].
This last point indicates that there is a possibility that the parameters obtained from the inversion of the data are not necessarily the true parameters that one is looking for. Another reason that can lead to the calculation of parameters other than the ones we are interested in is that measurements, although often regarded as errorless constants in the model [19,68], are in reality contaminated by unknown measurement errors v = [v 1 , v 2 , … , v M ] ⊺ . Therefore, measurements represent in fact stochastic variables = [ 1 , 2 , … , M ] ⊺ for which only an estimate can be determined [35], e.g., using some kind of appropriate measuring equipment. For this reason, Eq. (1) is rewritten as: The solution of Eq. (2) cannot be calculated exactly, but only using numerical search techniques [33,46,50]. Typically, these techniques start by creating a set that contains a number of trial solutions x k ( x k ∈ , k = 1, 2, … , P ). Depending on how is constructed and whether new trial solutions are added during the procedure leads to different techniques. Each x k is then provided as input to Eq. (1), which predicts a result d k (solving the forward problem for each x k ). Using an objective function that is devised to measure the discrepancy between measurements and theoretical predictions d k , the best solution x is selected. Having the estimated parameters x that best fit the data, it is now necessary to investigate the relation between the estimated and the true parameters. This is also called the appraisal problem. In the appraisal problem, one determines what properties of x are recovered by x and what errors are attached to it. Figure 2 presents the relationship between the forward, the inverse and the appraisal problem. It is important to acknowledge that taking into account (2) g(x) = + v errors and limited resolution of the selected parameters x k is essential to make a physical interpretation of the selected solution x [58].
Instead of finding a single solution x , the objective function can collect multiple solutions x k that satisfy the required criteria, creating a set ̃ of solutions. ̃ can be a subset or even a superset of , since new trial solutions can be added to during the search for the solutions. The set ̃ can then be further analyzed, as it might contain (possibly disconnected) ensembles (or clusters) of acceptable solutions, each one corresponding to a different class of solutions [46,50].

Numerical techniques to solve inverse problems
Numerical techniques require the definition of a parameter space. The parameter space corresponds to a D-dimensional space in which the solutions are likely to exist. A parameter space is defined in advance, using additional, independent information or even assuming reasonable, large uncertainty margins for each component of the unknown variable x . The ideal approach to solving an inverse problem would be to exhaustively consider all possible solutions in the parameter space, which is obviously not possible due to computational time constraints. Hence, the largest possible and most promising number of trial solutions should be considered. Several techniques exist on how to select these trial solutions.
An issue that arises by limiting the number of trial solutions is that the selected solution x might not be the best one in the parameter space and that one might get trapped in a local extremum and miss the global extremum in the parameter space.
From the many approaches for tackling inverse problems, we here focus on the so-called grid search method [45,51]. This approach has the advantage of being reliable, in the sense that it always finds the global extremum within the selected grid, as long as the objective function is smooth over the scale of the grid spacing, so that the extremum is not missed due to the grid spacing being too coarse. Furthermore, it is easier to perform uncertainty analysis, because the samples are uniformly distributed and produce an unbiased sample of the parameter space. On the other hand, grid search becomes quickly computationally intensive and eventually intractable due to the "curse of dimensionality." If the problem under consideration has a large number of dimensions D, making the grid sufficiently dense along all dimensions leads to exponential increase in the number of trial solutions. This is further intensified if the cost of forward modeling is high. The highest number of Fig. 2 A more precise representation of the inverse problem, as extended due to non-uniqueness of solutions and errors in measurements dimensions we discuss in this work is D = 18 , which already leads to a high computational load, but the dimensionality of the problem can reach even more than 100 [7].
To overcome the issue, irregular parameter space sampling methods were proposed. These range from pure Monte Carlo search, where trial solutions are selected completely randomly, to the generation of quasi-random sequences [1,53], simulated annealing [25], genetic algorithms and gradient-based methods. Although typically faster than grid search, these methods might get trapped in a local extremum and not find the best solution.

Grid search and modern parallel architectures
Modern computing systems are based on powerful multi-core CPUs and GPUs. Their capabilities surpass by many orders of magnitude the capabilities of systems that were available some decades ago, when grid search was first proposed to solve inverse problems. It is also typical for organizations to possess multiple such computing systems with different types of CPUs and GPUs, which can be connected through a network to build a heterogeneous cluster. This creates an infrastructure with considerable computing power. Under this view, it is interesting to investigate whether grid search can be used nowadays as a method to solve inverse problems within acceptable time-frames on such systems, even for problems with a high dimensionality.
At first sight, grid search seems to suit extremely well parallel architectures. For each grid point, forward modeling has to be performed, which is independent of forward modeling for other grid points. Some synchronization is only required to select the global best solution from the local best solutions that all processing elements have computed (or to add multiple acceptable solutions to the set of solutions). Hence, a static assignment of grid points to processing elements could be considered. There are however a number of parameters that perplex the problem. First, if one requires that all processing elements (CPU and GPU) on each node of a heterogeneous cluster should be used, a static assignment of the workload will lead to load-imbalance and reduced performance, as each processing element has different compute capabilities. Second, the problem of load-imbalance can be further intensified if there exist physical constraints in the inverse problem. It might be the case that combinations of values for specific parameters have no physical meaning. In this case, forward modeling should not be performed at all for these grid points. However, the number of such grid points that each processing element will receive is not known a priori. Therefore, a dynamic scheduling scheme that assigns grid points to processing elements during run-time is more adequate to better exploit the available computational infrastructure. Furthermore, decisions regarding workload assignment should take into account the actual performance of the forward model and the objective function on each processing element of the heterogeneous cluster. Therefore, the scheduling algorithm should take into account real-time measurements during execution and adapt the workload of each node and processing element according to their real speed for the actual inverse problem in use.
Apart from the performance oriented parameters that must be taken into account during design and implementation of the framework and the dynamic workload scheduler, other issues are of equal importance. Users of the framework might be scientists from domains other than Computer Science. The framework should therefore be easy to use, but also provide acceptable performance on the clusters used without requiring from the user complicated configuration of parameters that control the behavior of the framework. On the other hand, if the user is determined to optimize performance further, the framework should expose and allow adaptation of its parameters. Finally, it should allow easy integration of additional inverse problems, hiding the intricacies of the framework from the user and allowing him/her to concentrate on the inverse problem being solved.

A computing framework for grid searches
In this section, we present the design and implementation of a parallel computing framework for regular grid searches of inverse problems on heterogeneous clusters. Before delving into the details of the framework, we present an example of a typical inverse problem in order to elucidate the specific design decisions for the framework. As mentioned in Sect. 1, an inverse problem can be formulated to find characteristics of a seismic fault using measurements taken at several seismic stations that are equipped with seismographs. When an earthquake happens, seismic stations send their data to a central station and a first, quick estimation of the epicenter of the earthquake can be made. Having a first estimate of the area where the earthquake happened, an inverse problem to further localize the incident is formulated. The forward model requires coordinates of where the earthquake happened and the time between when the earthquake happened and when seismic waves reach a seismic station. Using this information, the forward model computes the waveform that a seismic station should record. Therefore, a four-dimensional spatiotemporal search space has to be defined. The spatial dimensions of the search space are defined to surround the epicenter. For example, the search space could be defined to be 200 km north up to 200 km south of the epicenter, 200 km west up to 200 km east of the epicenter and up to 20 km toward the center of the Earth. Computationally, it is not possible to feed all data points of the search space to the forward model. Therefore, data points from the search space are sampled at constant intervals, for example every 1 km. Similarly, a time interval, e.g., 2 s, is defined and discretized, e.g., per 10ms, for the temporal dimension of the grid. In this example, the grid consists of ≈ 160 million grid points, with each one used as input to the forward model. The trial solution that produces waveforms closer to the ones recorded in the seismographs of the seismic stations is used to identify where the earthquake happened. An objective function must be defined to measure the discrepancy of the waveforms produced per trial solution from the waveforms recorded. Figure 3 shows a real example of applying the approach 3 .   Figure 4 presents an overview of the framework and will be used throughout the following discussion. In order to define the parameter space and the uniform grid of trial solutions, the user of the framework only needs to provide a triplet

Parameter space and trial solutions representation
. The first two values correspond to the minimum and maximum value that can be assigned to parameter x d in the context of the inverse problem being solved. Combined for all parameters, these values define the parameter space of the inverse problem (Fig. 4a). The third value defines the number of grid points that should be created for parameter (Fig. 4b). Therefore, trial solutions are sampled at the following values per parameter x d : This approach makes the definition of the parameter space easy and intuitive to the user, allowing at the same time the framework to create the search grid internally, without further intervention from the user.
To further simplify how the framework internally handles trial solutions, we notice that in order to describe a trial solution, only a vector I = [i 1 , i 2 , ..., i D ] of integer indices is required. If I is given, the framework can identify the trial solution by applying Eq. (3) on each dimension. However, we can go one step further and convert vector I to a single integer value, thus even further simplifying handling of trial solutions. We exploit the same idea that is used in many programming languages to map a multi-dimensional array to a one-dimensional array for storing it in memory (Fig. 4c). The following equation maps the indices vector to a unique linear index: Linearization of the indices of trial solutions both simplifies assignment of trial solutions to compute units and leads to reduced exchange of information in the context of the framework. Assume that s trial solutions (with linear indices 0, 1, … , s − 1 ) have already been distributed for forward modeling. If the framework decides that the next batch of trial solutions to be distributed must have a size of n, then it needs to communicate only two integer values to the compute unit that will perform forward modeling for these trial solutions: The first linear index s and the size n of the batch (Fig. 4d). Using simple division and modulus operations, vector I can be reconstructed for each trial solution in the above range, and then from Eq. (3), the trial solution to be used for forward modeling can be easily obtained. Linearization is completely hidden from the user.

Forward model and objective function representation
In the context of the proposed framework, the forward model is implemented as a function that accepts two parameters. The first parameter is a vector that contains the values of the parameters of a single trial solution x k . The second parameter is a pointer that can be used to pass any additional information that the inverse problem requires for forward modeling. The function must return as a result the prediction d k . It is important to ensure that the function that implements forward modeling can be applied independently for each trial solution. Finally, the function must be implemented for both, the CPU and the GPU. Although this leads to some duplication of code, it also allows the user to take advantage of platform-specific optimizations that cannot be applied by the framework itself.
One of the main duties of the framework is to distribute the workload among processing elements of the cluster. This is performed by linearizing the indices of trial solutions and assigning batches of continuous linearized indices to processing elements (see Sect. 3.1). Exposing the batches of indices to the user is not desirable. It would require from the user to write code to iterate over every linearized index in the batch, reconstruct vector I for each index, use Eq. (3) to obtain the values of the parameters of each trial solution and finally call the forward modeling function for each trial solution. To alleviate the user from this burden, the framework employs a wrapper function that performs the aforementioned operations, completely hiding the associated complexity from the user (Fig. 4e).
The objective function accepts one parameter, which is the prediction d k computed by the forward modeling function for a trial solution x k . The function must return true if the prediction satisfies the criteria set by the user or false otherwise. Calling of the objective function happens within the framework-internal wrapper function that also calls the forward modeling function. If the return value is true, the framework adds the trial solution to the set of accepted solutions (Fig. 4e). The framework, however, supports an additional functionality. If no objective function is provided by the user, the predictions of all trial solutions are collected and saved. This allows more advanced post-processing of the results, if required by the user. As with the forward modeling function, the objective function must be implemented for both the CPU and the GPU.

Virtual network structure
A key feature of the framework is its ability to effectively utilize heterogeneous clusters. The framework can be executed on a network of nodes, with each node containing one or more multi-core CPUs and one or more GPUs. In order to manage the resources that participate in the computation, the framework implements a virtual network that follows a two-layer coordinator/worker structure. The overall architecture of the network is presented in Fig. 5.
The first layer is the inter-node network, which manages the nodes on which the computation is executed. This layer is implemented with MPI. On one node, a coordinator process is created, whereas on the remaining nodes, worker processes are 1 3 deployed. The second layer is the intra-node network. This layer is implemented with OpenMP. It is deployed per worker process of the first layer, and it manages the processing elements of the node. Within each worker process, a coordinator thread, a CPU handler thread and GPU handler threads are created. To further clarify the interaction between components, we include Algorithms 1-4 in Appendix A, with the first two algorithms corresponding to the coordinator and worker processes and the last two algorithms corresponding to the coordinator and worker threads. Algorithm 2 presents the creation of the aforementioned threads in each worker process and the initiation of execution of the coordinator thread, which afterward acts as an intermediate between the coordinator process and the worker threads of the node.

First level communication and work distribution
Distribution of work at the first level of the virtual network is achieved through communication between the coordinator process and the coordinator thread of each worker process. A dynamic, on-demand batch assignment scheme is followed. The coordinator process repeatedly waits for requests from the coordinator threads. As soon as a request has been received, it determines the rank of the worker process that sent the request and the type of the request (Algorithm 1, Line 27). Depending on the type of the request, the coordinator process either assigns work to the requesting worker process, or waits to receive results from the worker process, or simply notes that the worker process is terminating (Algorithm 1, Lines 32, 34-38 and 41, respectively). The coordinator thread repeatedly requests work from the coordinator process, completes it and returns the results to the coordinator process, until there is no more work to be performed (Algorithm 3, Lines 22, 46-50 and 27, respectively). The work that the coordinator process assigns to worker processes is simply a batch of the linearized indices of trial solutions. The size of the batch is The framework builds a two-level, virtual network on top of the cluster. Coordinator/worker processes are employed to distribute work among nodes. Coordinator/worker threads are employed to further distribute work among processing elements within the node automatically determined by our dynamic workload scheduler and depends on the real-time performance measurements obtained for each node (Algorithm 1, Line 30. Sect. 3.4 provides details regarding the scheduler). A time stamp is recorded before sending a batch of work and immediately after receiving results from a worker process (Algorithm 1, Lines 31 and 39, respectively). Thus, the coordinator process is able to track the amount of time each worker process requires for a specific amount of work assigned to it. This information is used in our dynamic workload scheduler to determine the size of the next batch that should be assigned to the worker process. Furthermore, the coordinator process keeps internal information so that it can arrange the returned results according to the order of the trial solutions. For the sake of clarity, we have not included this in the algorithms.
A concern regarding the first layer of the virtual network is its centralized nature. There is a single coordinator process that handles multiple worker processes. Using the framework on larger systems might turn the coordinator process into a bottleneck. As we show, however, in Sect. 5.2, the framework behaves well even for larger numbers of worker processes. This can be attributed to the minimal data exchanged when sending requests, assigning work and receiving results (typically, there are only a few tenths of solutions). Furthermore, we will show that worker processes finish their assigned work at different points in time, hence avoiding to a large degree conflicts in communication.

Second-level communication and work distribution
After receiving work from the coordinator process, the coordinator thread distributes the work among the processing elements of the node, gathers results and propagates them back to the coordinator process. Distribution of work among processing elements of a worker is achieved through the CPU and GPU handler threads. A single CPU handler thread is created to manage all CPUs (and their subsequent cores) of a node, whereas a separate GPU handler thread is created for each GPU of a node. These handler threads cooperate with the coordinator thread in a similar manner to how worker processes cooperate with the coordinator process at the first level of the virtual network. Each time the coordinator thread receives a batch of work, it initiates execution of the worker threads and waits for them to finish their computations (Algorithm 3, Lines [33][34][35][36][37][38][39][40][41][42][43][44]. Similarly to the coordinator process, the coordinator thread records a time stamp per worker thread (Algorithm 3, Line 35). After worker threads complete their computations, the coordinator thread propagates the results of the node to the coordinator process (Algorithm 3, Lines 46-50).
Worker threads repeatedly request work from the coordinator thread, complete it and return results to the coordinator thread, until there is no more work to be performed (Algorithm 4, Lines 13, 29-40 and 15, respectively). The work assigned to a worker thread is a sub-batch of the batch that the coordinator thread received from the coordinator process. Before exiting, each worker thread records a time stamp to indicate when it finished its assigned work (Algorithm 4, Line 43). The same dynamic workload scheduler as in the first layer is used to distribute work among worker threads, taking into account the corresponding timing measurements. As denoted in Fig. 5, if the coordinator node has more available processing elements, a second process can be created that acts as a worker process and exploits processing elements that would otherwise remain unused in that node. In contrast to the first level of the virtual network, contention for work assignment in the second level is expected to be low in all cases. Even if a node has 4 or 8 GPUs, the total number of threads competing to receive work from the coordinator thread will be small (Algorithm 4, Line 13), since there is a single CPU handler thread and a GPU handler thread per GPU. This is a rather low number of threads, and no severe overheads are expected.

Execution of forward model and objective function on the CPU and the GPU
The CPU handler thread distributes the work it receives from the coordinator thread amongst the available physical processing cores, which execute the user-provided forward modeling function and (if required) the objective function for each trial solution assigned to them. Since forward modeling of different trial solutions might require different execution times, the framework uses Chunk Self-Scheduling to dynamically assign work to the available CPU cores. If the user can guarantee that the physical cores are identical and the processing load of each trial solution is constant, the framework can be configured to statically assign work to all available cores in order to alleviate the overhead of one more layer of dynamic scheduling.
Each GPU handler thread initiates execution of a framework-internal CUDA kernel, after receiving work from the coordinator thread. This kernel calls the forward modeling function and (if required) the objective function for each trial solution assigned to the GPU. If the number of trial solutions is less than the number of CUDA cores on the GPU, the number of CUDA threads launched equals the number of trial solutions and each CUDA thread calculates results for one trial solution. Otherwise, the number of CUDA threads launched equals the number of CUDA cores and the framework-internal CUDA kernel calls the user provided device functions for multiple trial solutions on each CUDA core, thus reducing overhead. Assignment of trial solutions to CUDA cores is performed statically in this case, as dynamic scheduling on the GPU is expected to have a high overhead. Parallel execution of the forward modeling and the objective function for the CPU and the GPU is represented in Algorithm 4, Lines 29-40. For the sake of clarity, we have not included the call to the framework-internal CUDA kernel in the case of execution on the GPU.
To further optimize performance when executing on the GPU, we took advantage of the different memory levels available. Data regarding the representation of the grid, as well as additional data passed to the forward modeling function (see Sect. 3.2) are stored in constant memory. Typically, these are less than 1Kb of data. Solutions are stored in the global memory. Typically, the are only a few tenths of solutions and accessing global memory for such a limited number of times does not induce significant overhead. A second reason that solutions are stored in global memory is for the special case when the user requests for all solutions to be stored. In this case, a large amount of memory is required, which is available only in the global memory. Shared memory is used for storing information required by the forward modeling and objective functions. For each thread executing on the GPU, this includes a vector containing the indices that represent each trial solution and a second vector that represents the coordinates of each trial solution within the grid (see Eq. (3) and Algorithm 4, Lines 30-31). Although using shared memory in this case allows fast access to frequently used data, it also limits the number of threads than can comprise a block of threads. Furthermore, the larger the dimensionality of the problem, the larger the vectors and the fewer the number of threads that can be created and execute concurrently. Some aspects regarding the usage of shared memory for the trial solutions and the impact on performance are discussed in Sect. 5. From the above discussion, it is clear that the amount of data required to be transferred between the host and the device is in the typical case very limited, with the exception of the case where the user requests all solutions to be stored. Most data are transferred during initialization of the framework (data regarding the representation of the grid and additional data passed to the forward modeling function) and at the end of execution (copying results from the device to the host). During execution, only the information regarding the batch assigned to a GPU (first linearized index and size of batch) must be transferred to the GPU.

Hierarchical dynamic workload scheduling
The proposed framework has been designed to support heterogeneous clusters including processing elements with differing compute capabilities. Furthermore, the framework is agnostic to the forward model, which might have a different computational cost for different trial solutions. To this end, the framework incorporates a dynamic workload scheduling scheme, taking into account both sources of work imbalance. In addition, the framework is easy to use, requiring the least intervention from the user in order to achieve acceptable performance. If required, experts might fine-tune the system in order to further improve its performance.
Given the above requirements, we developed the Hierarchical Real-time Dynamic Loop Scheduling (HRDLS) scheme. We present the main functions of HRDLS in Algorithm 5 and discuss their details in this section. HRDLS is based on Heuristic parallel loop scheduling (HPLS) [67], which takes into account the performance of each node. HPLS uses a performance function to calculate a performance ratio for each node, which indicates the performance of the node in relation to all other nodes. The workload assigned to each node is calculated according to this ratio. The performance function in HPLS uses execution times that are collected a priori by executing a specific benchmark on each node, and the performance ratio PF i of node i is calculated as: where t i is the execution time of the benchmark on node i and S is the set of nodes. This approach has the advantage that a heavier workload (a larger batch of trial solutions in our case) is assigned to nodes that perform better. On the other hand, it is vulnerable to changes in the composition of the cluster. Adding or removing a compute unit would require new benchmarks and the recalculation of the performance ratios for all nodes. Furthermore, the benchmark selected for calculating the performance ratios might not be representative of the forward modeling functions that the user of the framework will actually write and use. As a result, HPLS (and any other scheduling scheme that determines statically the performance of each node) might not be the best choice.
HRDLS tackles this issue using a performance function that is based on continuous real-time measurements of the actual forward model being used at the time, instead of pre-calculated data. Furthermore, it takes a hierarchical approach with respect to the underlying hardware architecture. In the context of a cluster and the virtual network structure of the framework (Sect. 3.3), HRDLS handles separately the first and second layers of the network. More specifically, an instance of the HRDLS scheduler runs on the coordinator process and, in addition, an instance runs on each worker process. The instance running on the coordinator process maintains information regarding all nodes of the cluster, whereas instances running on worker processes maintain information regarding the processing elements of the node they are running on.
In the heart of HRDLS lies the performance function: where N j is the size of the last batch that was assigned to a compute unit j, T j is the time it took for compute unit j to calculate the results for this batch and S is the set of compute units in the network layer where HRDLS is being applied. In the context of the framework, S are the nodes of the cluster when HRDLS is applied at the first level of the virtual network and the processing elements of a single node when HRDLS is applied at the second level of the virtual network. The purpose of the HRDLS scheduler is twofold. First, when a work assignment request is made, the instance of the scheduler at the corresponding level decides on the amount of work to be sent to the requesting compute unit (Algorithm 5, function getNextBatch()). Second, to make this decision, the scheduler must maintain the performance ratio of each compute unit and update these each time a compute unit finishes a batch of assigned work (Algorithm 5, function recalcRatios()). Equation (6) is used to calculate the performance ratio based on the real-time measurements of T j . Execution times T j are measured by the previous layer of the virtual network, i.e., execution times for nodes of the first layer are measured by the coordinator process and execution times for a processing element are measured by the coordinator thread of the corresponding node, as described throughout Sect. 3.3 and in Algorithms 1-4. This allows the framework to take into account not only the raw performance of a compute unit, but also more real-time factors that affect execution time, such as data transfers, network delays, synchronization and overhead of work balancing within a node.
Each time processing of a batch is completed, the function recalcRatios() is called at the appropriate level of the virtual network to update the performance ratios with the new timing information acquired for the compute unit that completed its batch (Algorithm 1, Line 40 for the first level of the virtual network and Algorithm 3, Line 43 for the second level of the virtual network). It is straightforward to show that the ratios of all nodes always add up to 1 and the size of the next batch assigned to a compute unit is easily calculated by simply multiplying a predetermined batch size of trial solutions b with the performance ratio (Algorithm 5, Line 21). A drawback of HRDLS is that when execution of the framework starts, there is no data available to calculate performance ratios. Therefore, the question arises what the batch size assigned to a compute unit will be, until execution time measurements are available. One solution would be to use a constant batch size. This, however, can lead to other problems. If the batch size is too small, the measured execution time for the batch might be dominated by factors other than raw performance, like data transfers and other network delays. If the batch size is too large, it might take too much time to report execution time for the batch, if it happens that the batch is executed on a slow compute unit. In both cases, the calculation of an appropriate batch size for the next assignments of work in HRDLS will not be optimum and it is possible that several rounds of calculations are required until a good batch size for each node is found. In order to overcome this issue, we implemented a temporary limitation of batch sizes for the first assignments. The batch size per compute unit i is initially set to a constant c, which is deliberately chosen to be small. After each assignment of a batch to the compute unit, c is doubled, leading to an exponential increase. Apart from c, the user can also set a custom limit s, regarding the number of batch assignments to compute units during this initial stage. This allows the exponential increase to continue even after there are enough execution time measurements from all compute units, which allows HRDLS to adapt faster in the lower hierarchical levels of the virtual network. The upper limit L i,k of the batch size for the k-th iteration of compute unit i is then calculated as: Each compute unit carries its own iteration counter k i . Equation (7) summarizes how the batch size N i,k for compute unit i is finally calculated (Algorithm 5, Lines 24-26): One further variation of HRDLS that one could experiment with is to use average ratios instead of always using the latest one. This can be effective when dealing with highly irregular computational demands for each trial solution, i.e., some trial solutions might be much faster to calculate than others. A bad scenario would be for a slow node to take a large batch of work that contains only trial solutions that are fast to calculate. This would lead to a low execution time and hence a higher ratio for the node, resulting in a larger batch of trial solutions in the next iteration. The next assignment could contain trial solutions that are more computationally demanding, causing the slow node to be overwhelmed and take an extreme amount of time to finish, leaving other nodes idle. This can be especially destructive if it happens during the last assignments when there are no other trial solutions to be scheduled on the remaining nodes. To counter this, HRDLS can be set to calculate the scores as , so that a small number of outlier measurements is less likely to trigger the problem. HRDLS can also be set to ignore time measurements that are below a given threshold t min , assuming these correspond to executions that were so fast that most of the time was spent in overhead operations (Algorithm 5, Line 48). Table 1 summarizes the parameters taken into account during design of the framework, as presented throughout this section. A short description that presents the relevant action taken to address each issue is included, so as to provide a global overview to the reader.

Physical problems used for the evaluation of the framework
To evaluate the performance of our framework, we focus on two problems, both based on highly nonlinear, overdetermined systems of equations and observations of ground deformation, i.e., displacements measured by GPS on the ground surface. The first problem describes ground deformation caused by seismic faults at depth. The second problem describes ground deformation caused by accumulation of magma in magma chambers of volcanoes.

Ground deformation due to seismic faults
Seismic faults are approximated as rectangular surfaces that exhibit a uniform slip at the depth of the lithosphere and are defined by nine variables (see Fig. 6). The Okada equations [41] (presented in Appendix B) relate these nine characteristics of a fault and its motion to the ground deformations that should be observed due to the motion of the fault. Hence, the Okada equations define a system of equations with D = 9 unknowns. As previously described, however, the opposite problem is of importance. Displacements on M points on the ground surface are known and the fault and its motion described by x = {E C , N C , L, W, d, , , , U} must be determined. It is also common that two or more seismic faults contribute to the ground deformation observed and that the characteristics of all of them must be determined. In this case, x contains nine unknowns for each fault. Taking advantage of the significant runtime improvements made possible with our framework implementation, we also address these more advanced cases by experimenting with a two faults Okada problem, corresponding to a system of equations with D = 18 unknowns.
An issue that complicates the distribution of workload among all available compute units on a system in the case of the Okada equations is the fact that there exist physical constraints that lead to load imbalance in calculations per grid point. The Okada model is valid under the assumption that the ratio of the length and width of a fault lies within a specific range. If the constraint does not hold for a trial solution, no further calculations have to be performed for it. Since the input grid is provided during run-time, the exclusion of the grid points that do not satisfy the constraint leads to work imbalance that cannot be predicted. However, as will become evident in Sect. 5, our framework manages to cope with this additional difficulty.

Ground deformation due to magma chambers
Magma chambers are approximated as spheres and are defined by four variables: coordinates x, y and depth d of the center of the sphere and volumetric change ΔV . The Mogi equations [36] (presented in Appendix C) relate these four characteristics of a magma chamber to the ground deformations that should be observed due to the change in the volume of the chamber as it accumulates magma. Hence, a system of equations with D = 4 unknowns is defined. As with the seismic fault case, we also examine a problem in which two magma sources contribute toward the observed ground deformations and D = 8 unknowns have to be determined.
As with the case of ground deformation due to seismic faults, physical constraints exist that lead to load imbalance in calculations per grid point. More specifically, the model is valid if the radius of the magma chamber is small relative to its depth. If the ratio of these variables is above a manually set threshold for a trial solution, no further calculations are performed for it.

Experimental evaluation
As presented throughout Sect. 3, the framework depends on a number of parameters that determine its operation, like b, c, s and t min . Exposing these parameters to the user provides flexibility, but also requires careful adjustment in order to take full advantage of all available resources. In the context of a heterogeneous cluster, it might be even necessary to adjust these parameters independently for each node. Apart from the aforementioned parameters, which are of importance for HRDLS during intra-and inter-node workload scheduling, the user can also indicate which resources of a worker node will be used ("CPU", "GPU" or "Both") and define CPU-and GPU-specific parameters.
In the context of a single node, a single CPU handler thread controls all CPUs and/or cores of a node. Further refinements such as the use of static or dynamic scheduling of trial solutions on CPU cores can be made. In the latter case, the batch size of trial solutions must be provided. Finally, if GPUs are available, it is possible to set the number of trial solutions assigned per GPU thread, the block size and the number of GPU streams to be used. Sensible default values for each parameter are set in the context of the framework, but high resource utilization requires further fine-tuning. In the rest of this paper, we will refer to a set of values for all parameters as a configuration.
In the sequel of this section, we vary several of these parameters and present results regarding performance of the framework. Numerous more results have been gathered for different combinations of values for all parameters, but are not presented due to space limits. Using these results, decisions were made regarding the default value per parameter. These default values lead to acceptable performance in each case, relieving the user from fine-tuning the framework manually, except if the user decides so.
For comparison reasons, the framework can be configured to not use HRDLS for workload distribution. If HRDLS is disabled at the intra-node level, trial solutions are equally distributed among processing elements of the node. If HRDLS is disabled at the inter-node level, Chunk Self-Scheduling is used and only a chunk size needs to be provided.
In order to evaluate our framework, we employed three systems: • We compared our framework against another application that performs grid search for inverse problems [63,64]. This application specifically targets systems with only a single GPU and therefore our framework significantly extends it. The aforementioned application does not linearize the parameter space, but requires incorporation of the required number of nested loops (that correspond to the D dimensions of the problem) into the forward model. Although the implementation of the corresponding computational kernel becomes more complicated, this approach allows advanced optimizations for fast execution on the GPU, like dynamic loop interchange during run-time to match the total number of iterations of nested loops to the number of CUDA cores of the utilized GPU [64]. The application includes implementations of the Okada and Mogi models as described in Sect. 4, and we will refer to it as the Baseline version. Since the Baseline version targets only a single GPU, a direct comparison with the framework presented in this paper can be performed only for this architecture. Therefore, we have used an incremental approach for the evaluation of the framework and we present results for the following cases: 1. We perform a performance evaluation of the framework and of the Baseline version for the Okada and Mogi models using only the GPU on System A. This allows a direct comparison between the two implementations. Based on the con-1 3 clusions drawn from this comparison, we fine-tune the framework for execution on the GPU, with the purpose of reaching at least the performance of the Baseline version. 2. After ensuring that the performance of the framework on the GPU is optimized, we setup the framework to only use all CPU cores on System A and fine-tune to fully utilize the available CPU resources. 3. After fine-tuning for all CPU cores and the GPU separately for System A, we configure the framework to use all of them simultaneously, in order to test and fine-tune the utilization of multiple resources within the same system. Dynamic self-scheduling with HRDLS is exploited for the first time at this point. Fine-tuning is performed to match the theoretical lowest execution times, which are calculated based on the measurements from the two previous points (see Sect. 5.1).
The same process is also performed separately for System B and System C. 4. Finally, we setup our framework to use a heterogeneous cluster that is comprised of System B and System C. We further experiment with two sub-configurations: (a) use all GPUs on both systems and (b) use all CPU cores and GPUs on both systems. In both sub-configurations, System B runs the coordinator process, while a worker process is also run on the node. Finally, the same fine-tuning process is followed to tune all inter-node parameters that were not being used in the previous setups. While this mainly includes parameters for HRDLS in the inter-node network, other operations such as data distribution, data collection and MPI idling features are also tested and evaluated. Fine-tuning is again performed until the theoretical lowest execution times are approximated (see Sect. 5.1).

Performance metrics
The fine-tuning process is based on measured execution times. To adequately assess the performance of the framework, we evaluated it using the Okada and Mogi models. For each model, 1 or 2 faults or magma chambers were assumed, and in each case, 3 to 4 different grids of trial solutions were used, leading to a total of 15 test cases. For the rest of this paper, we will refer to each model/search grid combination as a search instance. Considering the range of user-adjustable parameters, the number of trial solutions in a grid and a systems' characteristics, it is expected that a search instance might perform better on one configuration, while another search instance might perform better on a different configuration. Due to the large number of combinations of search instances and configurations, analysis of the results is challenging. Therefore, we opted toward a performance indicator that is a single number. More specifically, we used the sum of relative delays over all search instances for a specific configuration, compared to a target configuration: where i is an index over all search instances. A negative value indicates better performance compared to the target. In short, the relative delay shows how multiple search instances behave with respect to a single configuration. If more search instances behave better with a specific configuration, compared to the same search instances on the target configuration, the relative delay will have a smaller value. The relative delay can be calculated over all search instances in order to get a general performance indicator for a specific configuration, but it can also be calculated separately over search instances of each model, in order to get more detailed information regarding performance. A drawback of using the Baseline version to compare against is the fact that it does not support using all processing elements of a node. Therefore, to evaluate configurations where only CPU cores or CPU cores and GPUs are used, we calculate the theoretical lowest execution times. These are the execution times that could be achieved according to the measured execution times from the fine-tuning process. Let t c and t g be the best execution times obtained during fine-tuning of a search instance using only CPU cores and only GPUs, respectively. Let n be the total number of trial solutions in that search instance. Assuming that t c and t g correspond to the best possible utilization of the hardware, we calculate the theoretical execution time t cg that could be achieved if we perfectly distributed the workload among the CPU cores and the GPUs, and introduced no extra computational overhead. If T = work time is the throughput at which each resource processes trial solutions, the throughput T cg of using both resources is the sum of the throughputs of each individual processing element, i.e., T cg = T c + T g . Since T c = n∕t c , T g = n∕t g and T cg = n∕t cg solutions/sec: The embarrassingly parallel nature of the problem (forward modeling of a trial solution is independent from forward modeling of all other trial solutions) and the fact that asynchronous execution on the CPU and the GPUs is exploited in the framework, makes using the separate execution times on the CPU and the GPU a prediction model of the combined execution time that provides sufficient accuracy for our purposes. Equation (9) can then be used as a rough lower limit for the execution time when both, CPU cores and GPUs of the system are used. Then, we can finetune the framework targeting this execution time. Equation (9) can also be applied for any combination of configurations. For example, after fine-tuning two systems using both their CPUs and GPUs, it is possible to calculate the theoretical lowest execution time for the combination of the two systems in order to test the workload distribution within a network of physical systems.

Evaluation results
In this section, we present the results of our performance evaluation. In the figures that follow, the relative delay over all physical problems for a specific configuration is presented, except if otherwise mentioned. Using this approach, we present the expected average behavior of the framework for different problems, with different dimensionalities and characteristics. More details regarding performance of each individual physical problem can be found in Appendix E. Figure 7 presents the relative delay of the framework with respect to the Baseline version (executed on System A), when only the GPU is used for calculations on each of the three systems. This corresponds to the first case of our incremental approach for the performance evaluation, as described earlier. Results are presented when different numbers of trial solutions are assigned to each thread on the GPU. The graph is created from the data contained in Table 4, where further details regarding performance of each physical problem can be found. As already mentioned, fine-tuning has been performed on the GPU of System A, which includes defining the number of streams (with 8 streams performing best) and the number of threads in a block. Due to these optimizations, the framework indeed performs better or equally wellcompared to the Baseline version on System A. The same observation holds for System B, with the exception of assigning 20000 trial solution to each GPU thread. The reduced performance observed in this case is due to the 1-chamber Mogi problem, which has a relative delay of 22.17 (see Table 4). This, in turn, is caused by the relatively small grid used, which includes about 10 7 trial solutions. Hence, only about 10 7 ∕20000 = 500 threads are running, whereas the total number of Streaming Processors (SPs or GPU cores) on the V100 is 5120. The framework performs better than the Baseline version on System B for all other physical problems where grids have more trial solutions. Systems A and C exhibit similar behavior for the 1-chamber Mogi problem, but are affected less, since they have 2880 and 768 GPU cores, respectively. These results show that in most cases the framework can compete with the highly optimized Baseline version, when only GPUs are used.

Evaluation when using only GPUs
On the other hand, System C exhibits worse behavior in all cases. In order to further analyze this behavior, we include Fig. 8, which presents separately the relative delay for each physical problem. These results indicate that reduced performance on System C is caused mainly by the 1-fault Okada case (see Table 4). We tracked the Fig. 7 Relative delay of the framework with respect to the Baseline version (executed on System A), when only the GPU is used for calculations on each system issue down to the architectural differences of the GPU of System C, compared to the GPUs of the other two systems. The GPU of System A, that was used to fine-tune the framework, is based on the "Pascal" architecture, whereas the GPU of System B is based on the "Volta" architecture. On the other hand, the GPU of System C is based on the "Kepler" architecture. A combination of architectural features that are common in "Pascal" and "Volta", but different in "Kepler" explain the behavior of the Okada model on System C. The first factor that contributes to the lower performance on System C is a combination of the number of Streaming Processors (SPs) per Streaming Multiprocessor (SM) and the size of the shared memory per SM on "Kepler", which are 192 SPs and 48KB per SM, respectively. Although the user of the framework can define the size of a block of threads, the size of the block might be further limited by the framework, depending on the size of the shared memory available per SM. Shared memory is used in the framework to store temporary internal information that is critical to the computation, and that would otherwise be stored on the device's global memory causing significant delays. The larger the dimensionality of a problem, the more memory is required per solution, leading to less threads in a block. To make the case clear, we use an arithmetic example, which corresponds to the real data for the 1-fault and 2-fault Okada problems. The framework determines the block size for a problem with D dimensions (2-fault Okada) as: Size of available shared memory∕Memory required per solution = 384.8 , which is truncated to 384 = 2 ⋅ 192 . In this case, the GPU is utilized extremely well, as the block size is a multiple of the number of SPs in an SM. If another problem with D/2 dimensions is used (1-fault Okada), then Size of available shared memory∕Memory required per solution = 769.6 , which is truncated to 769 = 4 ⋅ 192 + 1 . In this case, an additional thread must execute, utilizing only a single SP out of the 192 SPs of the SM while it is executing. On contrary, the other two GPUs have 64 SPs and 96KB of shared memory per SM, which significantly reduces the effect of this issue. This observation points toward possible Fig. 8 Relative delay of the framework with respect to the Baseline version (executed on System A) for each physical problem, when only the GPU is used for calculations on System C improvements that could be applied in the future in the fine-tuning process of the framework. For example, a different configuration could be applied, depending on the GPU architecture detected.
The second architectural feature that contributes toward lower performance on System C is the number of available Special Function Units (SFUs) per SM. In "Kepler," there is 1 SFU per 6 SPs, whereas in the other two architectures, there is 1 SFU per 4 SPs. The Okada model relies much more on trigonometric, logarithmic and square root functions, which are handled in the SFUs, compared to the Mogi model. Hence, the relative delay of the Mogi model is lower than the Okada model, at least for lower numbers of trial solutions per GPU thread. As more trial solutions are assigned to each GPU thread, the 1-chamber Mogi model exploits less SPs, as explained previously, and its performance drops. The 2-chambers Mogi model is the only one exhibiting good performance on System C, as it is not affected by any of the previously mentioned issues. Figure 9 presents the relative delay of the framework with respect to the Baseline version (executed on the GPU of System A), when only the CPU is used for calculations on each of the three systems. This corresponds to the second case of our incremental approach for the performance evaluation. Results are presented for different batch sizes used by the framework's dynamic CPU scheduler (see Sect. 3.3). The graph is created from the data contained in Table 5, where further details regarding performance of each physical problem can be found. As expected, performance on the CPU is lower, even compared to the least powerful GPU of the three systems. Performance on System B is 10-12 times higher on Fig. 9 Relative delay of the framework with respect to the Baseline version (executed on System A), when only the CPU is used for calculations on each system the CPU, compared to the other two systems. This is due to the fact that System B is equipped with processors that are based on a newer architecture and because it incorporates a total of 24 cores, which is 6 times higher compared to the other two systems. Figure 9 proves the large discrepancy among performance of different processing elements used. Furthermore, it shows that smaller batch sizes provide better performance, highlighting the need for dynamic scheduling for the CPU cores due to the highly irregular computational demands of the trial solutions. Fig. 10 Relative delay of the framework when CPUs and the GPU are used simultaneously. Results are presented for different batch sizes and the last ratio calculated is used to decide the next assignment Fig. 11 Relative delay of the framework when CPUs and the GPU are used simultaneously. Results are presented for different batch sizes and the average ratio calculated is used to decide the next assignment 1 3 Figure 10, 12, 13 present the relative delay of the framework on each system using both CPU and GPU, with respect to the theoretical lowest execution times based on the respective system's best CPU and GPU configurations. This corresponds to the third case of our incremental approach for the performance evaluation. Results are presented for different combinations of b values and ratio recalculation strategies (last or average) for the intra-node HRDLS. For the b values, we have two sets of configurations; the first set contains a series of values that are common for all search instances within a configuration (Fig. 10, 11), while in the second set the values are Fig. 12 Relative delay of the framework when CPUs and the GPU are used simultaneously. Batch sizes used are a fraction of the total number of Trial Solutions (TS) and the last ratio calculated is used to decide the next assignment Fig. 13 Relative delay of the framework when CPUs and the GPU are used simultaneously. Batch sizes used are a fraction of the total number of Trial Solutions (TS) and the average ratio calculated is used to decide the next assignment fractions of the trial solution grid's size of each search instance, thus calculated at run-time (Fig. 12, 13). The graphs are created from the data contained in Table 6 and Table 7, where further details regarding performance of each physical problem can be found.

Evaluation when using CPUs and GPUs on each system
In the first set of graphs, we can see the effect that the b value has in HRDLS. It is obvious that the optimal value depends on the system that is used, and looking at the raw results, we can see that it is also affected by the model itself. However, a value of 10 8 seems to provide the optimal (or near optimal) results in all cases. We would like to mention that we have tested further cases with smaller batch sizes. These are not included in the graphs, as performance deteriorated quickly. This leads to high values of the relative delay (up to 1500 in some cases), which made it difficult to show the differences for the remaining cases. The second set of graphs shows significantly better performance when the b value is calculated as a fraction of the total number of trial solutions on the corresponding grid. The improvement in performance is a result of better behavior of the problems of smaller dimensionality, especially when grids with lower numbers of trial solutions are used. Many grids in these cases have a total number of trial solutions between 10 5 and 10 7 . When using batches of constant size, the size is relatively large compared to the number of trial solutions and only few batches are available, leading to loss in performance. Using fractions of the total number of trial solutions creates a larger number of batches, allowing better load balancing, thus leading to better performance. For the problems of higher dimensionality, even the smaller grids have a larger number of trial solutions and workload imbalance is significantly lower. An interesting point is that the recalculation strategy (last or average) does not seem to make a difference with respect to performance. HRDLS assigns workload according to the time required by each compute unit to complete a specific amount of work. When this strategy is successful, the scores used to determine the next batch size for a compute unit are not varying significantly throughout time. Therefore, the last and average scores have small differences. An exception can be seen for System C in Fig. 10, 11, where using the average ratio improves performance for the smallest batch size. From the specifications regarding single-precision performance of the three GPUs used and relative performance of the corresponding CPUs (Fig. 9), we conclude that System C has the largest difference in performance between the GPU and the CPU for the physical problems used. Using the average ratio seems to improve performance in this case, by making better decisions with respect to assignments when such large performance differences exist. This result confirms our observation in Sect. 3.4 that lead to the introduction of average ratios. Overall, these results prove the dependency of the HRDLS parameters, specifically the b value, on the model, the size of the trial solution grid, and of course the computing systems.

Evaluation when using CPUs and GPUs on the heterogeneous cluster
After having evaluated each system separately, we now evaluate the performance of the heterogeneous cluster that is comprised from System B and System C. This corresponds to the fourth case of our incremental approach for the performance evaluation. Figure 14 presents the relative delay of the framework for the heterogeneous cluster using only the GPUs, with respect to the theoretical lowest execution times of each system's best GPU-only configuration. Figure 15 presents the corresponding information when using both the CPUs and GPUs of the heterogeneous cluster, comparing with their corresponding best CPU and GPU configurations. Results are presented for different b, s, and c (where applicable) values for the internode HRDLS, using the "last" ratio recalculation strategy with a time threshold for ratio adjustment t min = 10ms . The graphs are created from the data contained in Table 8 and 9, where further details regarding performance of each physical problem can be found.
These last graphs highlight the importance of HRDLS's temporary limitation of batch sizes during its first assignments, along with the significance of the values c and s. This becomes clear when compared with Fig. 16, which presents results for the heterogeneous cluster when no limitation is set during the first assignments. It is clear that when working with a heterogeneous cluster of nodes, HRDLS requires some time to adapt to the systems before it can be used to efficiently distribute large chunks of Fig. 14 Relative delay of the framework for the heterogeneous cluster comprised from System B and System C without limitation for first assignments, when only the GPUs are used. "MAX" represents a very large batch size (the largest unsigned integer in our implementation) Fig. 15 Relative delay of the framework for the heterogeneous cluster comprised from System B and System C without limitation for first assignments, when the CPUs and the GPUs are used. "MAX" represents a very large batch size (the largest unsigned integer in our implementation) the search grid. In Figs. 14 and 15, we see an optimum c value at around 10 7 . Lower values prevent HRDLS from correctly adjusting its ratios due to extremely low time measurements during the first assignments that are filtered by t min . Higher values can effectively render the entire functionality irrelevant, by distributing too many trial solutions to slow nodes in the beginning, leaving faster nodes idle. As expected, the s value, which determines how many steps will be limited before HRDLS utilizes the calculated ratios for each node, is also important, as more steps allow the intra-node HRDLS of each system to also adapt to its node, but too many might cause a communication overhead significantly higher than is needed. The b value is also proven important even with the temporary limitation of batch sizes. Due to the unpredictable nature of the trial solutions within a search grid causing irregular computational demands, and the differences in the compute capability of our physical nodes, it is simply too improbable that HRDLS will manage, at any point in time, to calculate the exact ratios needed for a perfect single-step workload distribution. As such, we find that the cost of a few additional Fig. 16 Relative delay of the framework for the heterogeneous cluster comprised from System B and System C without limitation for first assignments Table 2 Default values for the parameters of HRDLS. The framework can vary the number of threads comprising a block on the GPU due to other constraints (size of shared memory, number of trial solutions assigned to the GPU)

Parameter
Default value Batch size during initial assignments (c) 1.0e+7 Number of steps during initial assignments (s) 6 Timing threshold ( t min ) 10ms Scheduling for CPU cores controlled by CPU handler thread Dynamic, chunk size = 10000 Block size for GPU 1024 threads Streams per GPU 8 Processing elements to use for computations Both (All CPUs and GPUs) Use latest or average ratio Average assignments caused by a limited b value is preferable to the risk of making large assignments and leaving nodes idle. Table 2 summarizes the default values used in HRDLS. These were derived from the performance analysis carried out and provide acceptable performance in most cases.

Workload balancing evaluation
In this section, we evaluate to which extend HRDLS manages to balance workload. Using System B, a 2-fault Okada problem was run multiple times. The number of worker processes, as well as the number of threads that the CPU handler thread, was controlling in each worker process varied. The CPU of the system used consists of 24 cores, each supporting two hardware threads. The selected inverse problem was run so that each time the total number of threads executing the forward modeling and objective functions was 48. After completion of a batch of trial solutions on a node, a semi-random delay was added before requesting the next batch. This has been done so as to emulate execution of different worker processes on nodes of different speed. The delay is calculated according to the rank of each process and is a random value in the interval [0.9 ⋅ rank∕2, 1.1 ⋅ rank∕2] . This approach leads processes with a higher rank to consistently emulate systems of lower speed throughout each execution. The grid used for the inverse problem consists of 6 ⋅ 10 9 trial solutions. Table 3 summarizes the results. We recorded the time when the first process completes its last assigned batch (column "Min"), when the last process completes its last assigned batch (column "Max") and the average over completion times of all worker processes (column "Average"). Absolute values of standard deviation and as a percentage over the average have also been included. The results show that HRDLS manages to balance the workload efficiently. In the most difficult case (24 worker processes, 2 threads per process), the slowest process (rank 23) has an average delay of 11.5 s per batch processed, compared to the fastest process (rank 0). Even in this case, HRDLS manages to distribute the workload of the application in such a manner that all processes complete close to each other. A clarification regarding the results presented in Table 3 must be made at this point. It was mentioned in Sect. 3.3.2 that contention for work assignment among threads is expected to be low in all cases. This statement is with respect to HRDLS at the node level, where the single CPU and multiple GPU handler threads compete for assignment of work (Algorithm 4, Line 13). Results presented in this section indicate that for 12 or more threads per process execution time increases significantly. We remind the reader that after the single CPU handler thread of a worker process receives work, it has to further distribute it among the cores of the CPU it is controlling (Algorithm 4, Line 29, "parallel for" construct, implemented using OpenMP when executed on the CPU). The overhead observed is due to work distribution of OpenMP to an increasing number of these threads.

Coordinator process communication bottleneck evaluation
As pointed out in Sect. 3.3.1, it is possible that the coordinator process might turn into a bottleneck, due to the centralized nature of the virtual network. In this section, we analyze the behavior of communication at the first level of the virtual network to determine the extend at which this bottleneck appears.
Using System B, we run twice a 2-fault Okada problem, recording the time at which each worker process requests work from the coordinator process. The first time 24 worker processes were created, so that a single process will execute on each core of the two CPUs of the system. Since the CPUs support two hardware threads per core, we repeated the experiment creating 48 worker processes. Since all processes execute on the same system, the GPU of the system was not used, as it would need to be shared among all processes. The grid used is the same as in the previous section, consisting of 6 ⋅ 10 9 trial solutions. Figures 19 and 20 in Appendix D summarize the results. Dots on each horizontal line represent points in time when the corresponding process requests work. As expected, during the first assignments, there are multiple worker processes that attempt to communicate with the coordinator process at the same time to request work. This behavior appears because the dynamic workload scheduler assigns small batch sizes to worker processes during the first rounds and all batch sizes have the same size (see Eq. (7)). As a result, processing elements on each node finish quickly their computations and coordinator threads request their next batch from the coordinator process. However, as execution progresses and batch sizes start to adapt to the performance of each worker process, communication conflicts start to become less frequent. Figures 19 and 20 do not provide a clear view regarding communication conflicts during the first assignments of work. In the 24 processes case, the time required for all processes to receive their first 5 batches of work is only 0.5 s and a total of 120 work assignments are performed (4ms per request on average). In the 48 processes case, the corresponding time is 2.3 s, and a total of 250 assignments are performed (9ms per request on average). To provide a better view for the initial assignments of work, we include Fig. 21. For this graph, we consider that two requests to the coordinator process that happen within 10ms consist a communication conflict. The result labeled "5" includes communication conflicts that happen up to the fifth work assignment for all processes. Almost all communication requests conflict with others in this case. The result labeled "6" includes communication conflicts that happen after the last process has received its fifth work assignment and until all processes have been assigned work for the sixth time. The last two results are similar to this last case. The graph shows that communication conflicts decrease quickly and that there are almost no conflicts starting from the eighth assignment.
These results show that even under the presence of a larger number of worker processes, the coordinator process is under pressure mainly during the first few ( ≈ 5 ) rounds of work assignment. But even in this case, it manages to serve worker processes in under 10ms in the vast majority of cases. Taking into account that there are 11-13 work assignments per process in the 24 processes case and 10-12 work assignments in the 48 processes case for the selected grid, the overhead of communication conflicts is negligible. Thus, our approach to linearize the indices of trial solutions and minimize exchange of data proves to be an appropriate approach. Figures 19 and 20 show another expected behavior of the framework. Excluding the first few rounds of work assignment, which are handled in a special way, the figures show that after the assignment of a short batch and the recording of timing information, the next batches are typically much larger, showing that our scheduler indeed adapts to the processing speed of processing elements. However, there are exceptions to this behavior. One exception is when there are few trial solutions left in the batch assigned to the worker process. In this case, the worker thread will receive these remaining trial solutions. The second case is when a processing element requires more time for computations for a short batch, compared to other processing elements. The scheduler computes a lower score for this particular processing element and hence assigns a shorter batch to it. Nevertheless, these cases again show that our scheduler manages to adapt to changing conditions of execution in the context of the application.

Related work
Dynamic scheduling of application workload on heterogeneous clusters has been used in various cases. In [61], the authors present a scalable distributed Monte Carlo framework for multi-accelerator heterogeneous clusters. Different load balancing schemes are evaluated and, as in the case of our framework, dynamic scheduling is used to improve utilization of all available processing elements in the cluster. In [29], the authors present a methodology for agglomerating work units dynamically at runtime and scheduling the agglomerated work units on accelerators. The methodology is portable across architectures and adaptable between problem configurations. It is evaluated using an n-body particle simulation and a parallel dense LU solver. In [69], the authors present a multiround scheduling algorithm and a scheduling framework for farm-pattern based applications on heterogeneous clusters. In [48,49], the authors implement the PLB-Hec and PLB-HAC profile-based load-balancing algorithms for data-parallel applications in heterogeneous CPU-GPU clusters. Both algorithms construct a performance curve model for each resource during run-time and continuously adapt it to changing conditions through a machine learning approach. PLB-HAC is an improvement of PLB-HeC that removes synchronization phases, enhances the rebalancing mechanism and includes support for Xeon Phi accelerators. Similarly to our work, the authors in [55] present a hierarchical scheduling scheme where the same shared-memory-level code can be used on both the CPU and the GPU. A difference compared to our work is that the code used for inter-node communication in [55] can also be used during CPU-GPU communication. This is in contrast to the traditional "offload" model of the GPU, in which GPU code can differ significantly from CPU code. Using a hybrid MPI-OpenMP implementation of the acoustic-elastic wave propagation, the authors demonstrate the efficacy of the proposed partitioning scheme on Stampede, a heterogeneous, Intel Xeon Phi accelerated supercomputer. In [14,15], the authors propose a distributed batch-calculation approach that does not require the coordinatorworker execution scheme. They approach also considers features lately introduced into MPI, such as passive-target remote memory access, shared-memory window creation and atomic read-modify-write operations. To evaluate the proposed approach, the authors implement five well-known dynamic loop scheduling techniques (Self-Scheduling (SS), Guided Self-Scheduling (GSS), Trapezoid Self-Scheduling (TSS), Factoring (FAC) and Weighted Factoring (WF)), apply them to two applications and evaluate their performance on two different heterogeneous hardware setups. In [16], the authors expand on their previous work and compare their distributed to the centralized batch-size assignment approach.
Dynamic scheduling has also been adopted in the context of single heterogeneous systems. In [20], the authors present a profile-based AI-assisted dynamic scheduling approach to efficiently utilize heterogeneous resources (CPU cores and GPUs) in a single system. The approach combines online scheduling, application profile information, hardware mathematical modeling and offline machine learning estimation modeling to implement automatic application-device-specific scheduling for heterogeneous architectures. Similarly, in [23], the authors present two online profilingbased scheduling algorithms that use low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and the integrated GPU of a system. Grid search has been used as a method to characterize seismic faults in [63,64]. The work presented in these publications targets only a single GPU and has been used in this work as the Baseline version. The authors in [59,60] present a grid search approach for a different method of obtaining information regarding earthquakes. The moment tensor is calculated, which allows the computation of the magnitude, the epicenter, the depth and other information regarding an earthquake. Special care is taken to improve on fully automatic operation of the system, by introducing filters that exclude low quality input data. Regarding the computational part, OpenACC has been used and again only a single GPU is targeted. The software developed has recently been made fully operational and provides real-time analysis of earthquakes at the Institute of Geodynamics of the National Observatory of Athens, Greece 4 and FMIPA Unesa -Universitas Negeri Surabaya, Indonesia. 5 It has replaced the previous software due to better automation and accuracy, indicating that grid search is applicable to real-time, real-life problems. Grid search (together with random search) has also been extensively used in optimizing neural networks parameters [8,17,27,28].

Conclusions
We presented a framework that allows dynamic workload scheduling of grid searches for inverse problems on heterogeneous clusters. To the best of our knowledge, this is the first attempt to solve inverse problems on such a computational environment. We have shown that the hierarchical dynamic workload scheduler proposed in the context of the framework adapts to the varying computational load of trial solutions and the varying processing speed of processing elements within the cluster, by assigning appropriate numbers of trial solutions to each processing element, hence reducing total execution time. Fine-tuning of the parameters of the framework is a necessary step to obtain high performance, but our experimental evaluation has led to a set of parameters that provide acceptable performance in most cases. Although the framework already manages to achieve high performance, there is still room for improvement, like better determination of block sizes on the GPU. Nevertheless, our framework shows that grid search as a method for solving inverse problems is now feasible due to the performance that GPUs provide.
The statistical analysis of obtained results is currently a post-processing step that has to be performed by the user with appropriate tools. Although this does not affect performance of the framework, it is a component that could further facilitate the user in this process. Therefore, we are planning to implement such a component in the framework. Fig. 17 Geometry of the source model of Okada. A point P at a free surface is dislocated by u E , u N , u Z due to the uniform slippage of a rectangular surface (fault) by U 1 and U 2 at depth d. The free surface denotes the ground surface and the limit of the elastic half-space Setting R = √ 2 + 2 + q 2 , X = √ 2 + q 2 , = ⋅ cos + q ⋅ sin and = ⋅ sin − q ⋅ cos , displacements in the (x, y, z) coordinate system corresponding to dislocation U 1 (strike-slip direction) are calculated as: whereas displacements corresponding to the U 2 dislocation (dip-slip direction) are calculated as: where except if the fault has a vertical tilt ( = 90 • ⇒ cos = 0) , where ux ss ( , , q, ) = − U 1 2 ⋅ ⋅ ⋅ q R ⋅ (R + ) + arctan ⋅ q ⋅ R + I 1 ⋅ sin uy ss ( , , q, ⋅ arctan ⋅ (X + q ⋅ cos ) + X ⋅ (R + X) ⋅ sin ⋅ (R + X) ⋅ cos In the above equations, and are the Lamé constants and under the assumption of an elastic isotropic space, Poisson's ratio is: It can also be shown that: Finally, the displacements of point P are transformed into the (E, N, Z) coordinate system through the following equations: where u x = ux ss (x, p, q, ) − ux ss (x, a, q, ) − ux ss (b, p, q, ) + ux ss (b, a, q, ) + ux ds (x, p, q, ) − ux ds (x, a, q, ) − ux ds (b, p, q, ) + ux ds (b, a, q, ) u y = uy ss (x, p, q, ) − uy ss (x, a, q, ) − uy ss (b, p, q, ) + uy ss (b, a, q, ) + uy ds (x, p, q, ) − uy ds (x, a, q, ) − uy ds (b, p, q, ) + uy ds (b, a, q, ) u z = uz ss (x, p, q, ) − uz ss (x, a, q, ) − uz ss (b, p, q, ) + uz ss (b, a, q, ) + uz ds (x, p, q, ) − uz ds (x, a, q, ) − uz ds (b, p, q, ) + uz ds (b, a, q, ) Fig. 18 Geometry of the source model of Mogi. A point P at a free surface is dislocated by a radial 3D displacement u , due to the volume change ΔV of a magma chamber M. The radial distance between P and M is r. Displacement u is analyzed into a radial horizontal displacement u r and a vertical displacement u z , described by the standard Mogi equations, on the condition that the radius R of the magma chamber is small relative to its depth d. Displacement u r can further be analyzed into two Cartesian displacements u x and u y

Appendix E: Tables
In this Appendix, tables containing the data used in Sect. 5.2 are included. The tables have been color-coded, to better visualize the results (See Tables 4, 5 , 6, 7, 8, 9).

Fig. 21
Number of communication conflicts until the last worker process receives its first 5 work assignments and then until the last process receives its 6 th , 7 th and 8 th work assignment Table 4 Relative delay of the framework with respect to the Baseline version (executed on System A), when only the GPU is used for calculations on each system 1 3 Hierarchical dynamic workload scheduling on heterogeneous… Table 5 Relative delay of the framework with respect to the Baseline version (executed on System A), when only the CPU is used for calculations on each system Table 6 Relative delay of the framework on each system using both CPU and GPU, with respect to the theoretical lowest execution times based on the respective system's best CPU and GPU configurations. Results given for common worker batch sizes Table 7 Relative delay of the framework on each system using both CPU and GPU, with respect to the theoretical lowest execution times based on the respective system's best CPU and GPU configurations. Results given for worker batch sizes specific to each search grid Table 8 Heterogeneous cluster comprised of System B and System C, without limitation of batch sizes for the first assignments