Keywords

1 Introduction

Load balancing refers to the distribution of tasks over a set of computing resources in parallel systems. We simplify load as execution time, where the load difference between processes results in imbalance. A process is an abstract entity performing its tasks on a processor. For example, the imbalance can happen when a process waits for the others in bulk-synchronous parallel programs. The primary use case in our paper is represented by iterative applications such as adaptive mesh refinement (AMR) solving partial differential equations (PDEs) [22]. Traditional methods distribute the load at the beginning by using cost indicators. However, an unexpected performance slowdown can lead to a new imbalance. Therefore, dynamic load balancing strategies are more practical to help, such as work-stealing [9]. Work-stealing principally waits until the queue of underloaded processes is empty, then overloaded processes will steal tasks within an agreement. In contrast, the reactive approach monitors execution repeatedly to estimate the load status, and offloadsFootnote 1 tasks if the imbalance ratio reaches a given condition [13]. The monitored information is the most recent number of waiting tasks on each queue that implicitly represents computing speed per process. Following that, the imbalance ratio is estimated; tasks at an overloaded process can be reactively offloaded to a corresponding underloaded process [23]. Without prior load information, this idea safely fixes a consistent number of offloaded tasks once. Nevertheless, a very high imbalance case is the challenge that can limit reactive load balancing.

We propose a proactive approach for offloading tasks to improve the performance further. The scheme is based on task characterization and online load prediction. Instead of monitoring only queue information, we characterize task features and execution time on-the-fly. Then, we apply this data to train an adaptive prediction model. The prediction knowledge is learned from dynamic change during execution. After that, our proactive algorithm will use the prediction result to guide task offloading. The idea is implemented in a task-based programming framework for shared and distributed memory called Chameleon [13]. We evaluate this work with an artificial benchmark (matrix multiplication) and an adaptive mesh refinement (AMR) named Sam(oa)\(^{2}\) [18]. Sam(oa)\(^{2}\) is a hybrid framework PDE systems on dynamically adaptive tree-structured triangular meshes. Variations in computation cost per element are caused by the limiting procedure, space-time predictor, and numerical inundation treatment at coastlines [21]. Our example and implementation can be found in more detail at (See footnote 5). The main contributions are:

  • We discuss what limits the existing reactive approaches and define a proactive solution based on load prediction.

  • Our approach shows when it is possible to apply machine learning on-the-fly to predict task execution time.

  • Then, a fully distributed algorithm for offloading task is proposed to improve load balancing further.

Finally, the rest of paper begins with related work in Sect. 2. Section 3 describes the terminologies of task-based load balancing and problem motivation. Online prediction scheme and proactive algorithm for offloading tasks are addressed in Sect. 4. Finally, Sect. 5 reveals the evaluation and Sect. 6 highlights conclusion with future work.

2 Related Work

Assuming that system performance is stable, load balancing has been studied in terms of static cost models and partitioning algorithms [12] [4]. The balance is achieved by accurately mapping tasks to processors. Our paper focuses on issues after the work has been already partitioned. As mentioned, performance slowdown is a reason for imbalance during execution [27]. There are three classes of dynamic load balancing algorithms, centralized [5], distributed, and hierarchical [7]. Work stealing is a traditional approach employed in shared memory systems [2]. For distributed memory, work-stealing is risky because of communication overhead. Researchers attempted to improve communication by using RDMA in PGAS programming models [9, 15]. Lifflander et al. introduced a hierarchical technique that applies the persistence principle to refine the load of task-based applications [17]. Focus on scientific applications where computational tasks tend to be persistent, Menon et al. proposed using partial information about the global system state to balance load by randomized work-stealing [19]. To improve stealing decisions, Freitas et al. analyzed workload information to combine with distributed scheduling algorithms [10]. The authors reduced migration overhead by packing similar tasks to minimize messages. Instead of enhancing migration, reactive solutions rely on monitoring execution speed to offload tasks from an overloaded process to underloaded targetsFootnote 2 [13, 23]. The following idea is replication that aims at tackling unexpected performance variability [24]. However, this is difficult to know exactly how many tasks should be offloaded or which processes are truly underloaded in high imbalance cases. Without prior load knowledge, replication strategies need to fix the target process for replicas, such as neighbor ranks. The decision is not easy to make and may get high cost. Using machine learning-based prediction to guide task scheduling is not new. However, the difference comes from the problem feature and applied context. Almost all studies have been proposed in terms of cloud [1] or cluster management [8] using historic logs or traces [3, 25] in profilers, i.e., TAU [26], Extrae [20]. Li et al. introduced an online prediction model to optimize task scheduling as a master-worker model in R language [16]. Our context is a given distribution of tasks, and the imbalance is caused by online performance slowdown. Therefore, offline prediction from historical data is insufficient.

3 Preliminaries and Motivation

Fig. 1.
figure 1

The illustration of (A) an iterative task-based execution with 4 ranks, 2 threads per rank, and (B) a real load imbalance case with Sam(oa)\(^2\).

The many-task runtimes have been studied in shared memory architectures [28]. A task is defined by an entry function and its data (e.g., input arguments). An iterative application has a decomposition into distinct parallel phases of executing tasks. Barriers synchronize each parallel execution phase (so-called time step in numerical simulation). Figure 1(A) illustrates an execution phase, where x-axis represents the time progress, y-axis lists four processes (MPI ranksFootnote 3 from 0 to 3), and the green boxes indicate tasks. Each rank has 16 tasks, running by two threads per rank. In general, we define \(n_t\) independent tasks per phase, where \(T =\{0, ..., n_t-1\}\) denotes a set of tasks. One task has an associated execution wallclock time (\(w \ge 0\)) and runs on a specific core until termination. All tasks in T are distributed on \(n_p\) processes, where \(P = \{0, ..., n_p-1\}\) denotes a set of processes. The real value of w depends on task’s input, CPU frequency, or memory bandwidth. Therefore, it can only be measured at runtime. Below, we address some definitions and illustrate their symbols in Fig. 1(A).

  • \(W_{i}\): denotes the wallclock execution time of Rank i. Besides, \(L_{i}\) is a total load of Rank i being the sum of load values of all tasks assigned to Rank i.

  • \(W_{par}\): indicates the longest wallclock execution time (the so-called parallel wallclock execution time), where \(W_{par} = \max _{\forall i \in P} W_{i}\).

Thereby, the maximum wallclock execution time (\(W_{max}\)) is considered as \(W_{par}\), \(W_{min} = \min _{\forall i \in P} W_{i}\), and the average value is \(W_{avg} = avg_{\forall i \in P} W_{i}\). Load balancing strategies need to minimize the \(W_{par}\) value. To evaluate the balance, we use a ratio of the maximum and average W values called \(R_{imb}\) in Eq. 1, where \(R_{imb} \ge 0\) and a high \(R_{imb}\) means a high imbalance.

$$\begin{aligned} R_{imb} = \frac{W_{max}}{W_{avg}} - 1 \end{aligned}$$
(1)

In work-stealing, underloaded ranks exchange information with overloaded ranks when the task queues are empty, and tasks can be stolen if reaching an agreement. However, this might be too late in distributed memory because of communication overhead. In contrast, the reactive balancing approach uses a dedicated threadFootnote 4. Based on the most current status, tasks are offloaded by speculative balancing operations early instead of waiting for empty queues [23]. This approach has two strategies: reactive task offloading [14] and reactive task replication [24]. Without prior knowledge, the balancing operation of reactive decisions must be safe at runtime about the number of offloaded tasks and potential victims. In the cases of high imbalance ratio, such as Fig. 1(B) shows, the uncertainty of balancing decision at a time \(t_{k}\) can affect the overall efficiency after execution. This leads to motivation for this work such the following points:

  1. (1)

    For permanently task offloading, how can we know the appropriate number of tasks to offload?

  2. (2)

    For victim selection from phase to phase, how can we know the potential victims to offload tasks proactively?

  3. (3)

    For a long-term vision, it is necessary to learn the variability of communication overhead along with given topology information at runtime.

4 Online Load Prediction and Proactive Task Offloading

4.1 Online Load Prediction

This work exploits a task-based framework of hybrid MPI+OpenMP and a dedicated thread to perform online prediction by machine learning regression model. The results are then adapted to balance load before a new iteration begins.

Where is dataset from? The inputs (IN) are from two sides: application (\(IN_{app}\)) and system (\(IN_{sys}\)), where \(IN_{app}\) is task-related features and \(IN_{sys}\) is related to processor frequencies or performance counters. The output is defined by OUT, which can be the wallclock execution time of a task or the total load of a rank in the next execution phases. IN and OUT are normalized from the characterized information at runtime, being used to create a training dataset. Due to domain-specific applications, users should pre-define influence characteristics or parameters. Therefore, we design this scheme as a user-defined tool outside the main library [6].

When is a prediction model trained? Iterative applications can have many execution phases (iterations) relying on computation scenarios. In hybrid MPI+OpenMP model, our dedicated thread runs asynchronously with other threads, which will characterize and collect runtime data in the first iterations on each rank. We simplify in-out features as configuration parameters in the tool. Users can flexibly tune the parameters before running applications. This issue also raises some related questions below.

  • Which input features and how much data are effective?

  • Why is machine learning needed?

  • In which ways do the learned parameters change during runtime?

First, in-out features are based on observing application characteristics. Depending on each use case, it is difficult to confirm how much data are generally adequate. Therefore, an external user-defined tool is relevant for this issue. Second, the hypothesis is a correlation between application and system characteristics that can map to a prediction target over iterations. Also, the repetition of iterative applications facilitates machine learning to learn the behavior. Third, learning models can be adaptive by re-training in the scope of performance variability. However, how many levels of variability make the model ineffective has not been addressed in the paper; this will be extended in future work.

Table 1. The input-output features for training the prediction models.

For our experiments, we describe the input and output parameters of online prediction in Table 1. There are two use cases: synthetic matrix multiplication (denoted by MxM) and Sam(oa)\(^2\). In MxM, the matrix size argument of a task mainly impacts its execution time. Thereby, we configure the training inputs being matrix sizes and core frequency. For Sam(oa)\(^2\), it uses the concept of grid sections where each section is processed by a single thread [18]. A traversed section is an independent computation unit which is defined as a task. Following the canonical approach of cutting the grid into parts of uniform load, tasks per rank are uniform and a set of tasks on different ranks might not have the same load. By characterizing Sam(oa)\(^2\), we predict the total load of a rank in an iteration (\(L^{I}_{i}\)) instead of the wall clock time of each task (w), where L denotes the total load value of Rank i in Iteration I. To estimate w, we can divide L by the number of assigned tasks per rank. Furthermore, our observation shows that \(L^{I}_{i}\) can be predicted by the correlation between the current iteration and the previous iterations. For example, suppose Rank 0 has finished Iteration I, and we take the total load values of four previous iterations. In that case, our training features will be the load values from Iteration \(I-4\) to \(I-1\), such as the following samples \(I = 8, 9\).

$$\begin{aligned} \begin{aligned} \scriptstyle&\cdots \\ \scriptstyle&L^{4}_{0},L^{5}_{0},L^{6}_{0},L^{7}_{0} \rightarrow L^{8}_{0} \\ \scriptstyle&L^{5}_{0},L^{6}_{0},L^{7}_{0},L^{8}_{0} \rightarrow L^{9}_{0} \end{aligned} \end{aligned}$$
(2)

Concretely, the left part of the arrow is training inputs, and the right part is training labels. Other ranks also use this format for generating their dataset.

4.2 Proactive Algorithm and Offloading Strategies

As Algorithm 1 shows, our proactive algorithm uses the prediction results as inputs, where Array L contains the total predicted load, Array N denotes the given number of tasks per rank. The number of ranks (\(n_{p}\) mentioned in Sect. 3) is the size of L, N. First, L is sorted by the load values and stored in a new array \(\hat{L}\). Second, \(L_{avg}\) indicates the average load, which is considered an optimal balanced value. To estimate how many tasks should be offloaded, Algorithm 1 uses Array R to record the total load of offloaded tasks (so-called remote tasks). Also, Array TB is used to track the number of local tasks (remaining tasks in a local rank) and remote tasks. TB is a tracking table with the same number of rows and columns (\(= n_{p}\)), where its diagonal represents the local task number, and the others indicate the remote task number. For example, if the value of \(TB[i,j] > 0\) (\(i \ne j\)), Rank i should offload TB[ij] tasks to Rank j.

figure a

In detail, the outer loop goes forward each victim (\(\hat{L}[i] < L_{avg}\)). The underloaded value between Rank i and \(L_{avg}\) is then calculated, named \(\delta _{under}\), which means that Rank i needs a load of \(\delta _{under}\) to be balanced. The inner loop goes backward each offloader (\(\hat{L}[j] > L_{avg}\)). The overloaded load (\(\delta _{over}\)) between Rank j and \(L_{avg}\) is then calculated and distributed around. To compute the number of tasks for offloading, we need to know the load per task (w) except in the cases we predict w directly, i.e., in MxM. Otherwise, the load per task can be estimated by the total predicted load over the number of assigned tasks per rank, named \(\hat{w}\) at line 10. Afterward, the number of offloaded tasks (\(N_{off}\)) and the total offloaded load (\(L_{off}\)) are calculated. The following values of \(\delta _{under}\), \(\hat{L}\), N, R, TB will be updated at the corresponding indices. In line 18, the absolute value between \(\delta _{under}\) and \(L_{avg}\) is compared with \(\hat{w}\) to check whether or not the current offloader has enough tasks to fill up a load of \(\delta _{under}\). If not, we will go through another offloader. Regarding complexity, if we have \(n_{p}\) ranks in total, where K is the number of victims, \(n_{p}-K\) will be offloaders; then the algorithm takes \(O(K(n_{p}-K))\). As mentioned, our implementation is described in more detail atFootnote 5. For offloading tasks, we use two strategies: round-robin and packed-tasks offloading. Round-robin sends task by task, e.g., Algorithm 1 says that \(R_{0}\) needs to offload 3 tasks to \(R_{1}\) and 5 tasks to \(R_{2}\). It will send the \(1^{st}\) task to \(R_{1}\), the \(2^{nd}\) one to \(R_{2}\), and repeat the progress until all tasks are sent. In contrast, packed-tasks offloading encodes the three tasks for \(R_{1}\) as a package and send it once before proceeding \(R_{2}\).

5 Evaluation

5.1 Environment and Online Prediction Evaluation

All tests are run on three clusters with different communication infrastructures at Leibniz Supercomputing Centre, CoolMUC2Footnote 6, SuperMUC-NGFootnote 7 and BEASTFootnote 8. The CoolMUC2 system has 28-way Haswell-based nodes and FDR14 Infiniband interconnect. SuperMUC-NG features Intel Skylake compute nodes with 48 cores per dual-socket, using Intel OmniPath interconnection. In BEAST-system, we use AMD Rome EPYC 7742 nodes with a higher interconnect bandwidth, HDR 200Gb/s InfiniBand.

Fig. 2.
figure 2

An evalution of online load prediction for Sam(oa)\(^2\) in simulating the oscillating lake scenario.

Table 2. The overview of compared load balancing methods.

The first evaluation shows the results of load prediction with Sam(oa)\(^2\). We run 100 time-steps to simulate oscillating lake scenario. Sam(oa)\(^2\) has several configuration parameters that can be found at [18], such as the number of grid sections, grid size, etc. This paper use a default configuration to reproduce the experiments. As mentioned in Subsect. 4.1, the training input features are the total load of the first finished iterations (the dataset from the first 20 iterations). To evaluate accuracy, we use MSE loss [11] between real and predicted values as the boxplot in Fig. 2 (left). It shows feasibility when using this prediction scheme for load balancing, where x-axis points to the scale of machines, and y-axis is the loss values. Besides, Fig. 2 (right) highlights the comparision between real and predicted load from \(R_{28}\) to \(R_{31}\) in 16 nodes from Iteration 20 to 99, because we collect data in Iteration 0–19 to generate the training dataset.

Fig. 3.
figure 3

The comparison of MxM testcases with 8 ranks in total, 2 ranks per node.

5.2 Artificial Imbalance Benchmark

We use the synthetic MxM test cases to ease reproducibility, where tasks are independent and uniform load. The number of tasks per rank is varied to cause different imbalance scenarios. In detail, we generate 4 cases from no imbalance to a high imbalance ratio (Imb.0 - Imb.3). Compared to the baseline and other methods, we name the proposed methods proact_off1 and proact_off2 that apply the same prediction scheme and proactive algorithm but different offloading strategies. All compared methods are addressed in Table 2. In Fig. 3, the smaller ratio is the better. It indicates that the \(W_{par}\) and waiting-time values between ranks are low. For reactive solutions, \(react\_mig\) and \(react\_rep\_mig\) are competitive. However, the case of Imb.3 shows the ratio of \(\approx \) 1.7 with \(random\_ws\), 1.5–1.1 with \(react\_mig\) and \(react\_mig\_rep\) on CoolMUC2. \(proact\_off1\) and \(proact\_off2\) reduce this under 0.6. On SuperMUC-NG and the BEAST system, the communication overhead is mitigated by higher bandwidth interconnection, showing that the reactive methods are still useful. Corresponding to the Imb. values, the second row of charts highlights the speedup values calculated by execution time of each method over the baseline.

5.3 Realistic PDE Use Case with Sam(oa)\(^2\)

In this experiment, we vary the number of ranks on each system, where two ranks per node and each rank uses full cores of a CPU socket, e.g., 14 threads per rank on CoolMUC2. For different communication overheads, the tests can show scalability and adaptation in various methods. In Fig. 4, reactive or proactive methods obtain higher performance than the baseline. Compared to \(react\_mig\), speculative replication (\(react\_rep\)) usually comes to some cost. However, their combination \(react\_mig\_rep\) could help in the cases from 16 ranks on CoolMUC2 and BEAST. The replication strategy is difficult to deal with the imbalance case of consecutive underloaded ranks. In contrast, our proactive approach uses online prediction to provide information about potential victims. As we can see, \(proact\_off1\) and \(proact\_off2\) can improve load balancing in the high imbalance cases (\(\ge 8\) ranks). In two offloading strategies, \(proact\_off2\) has some delay for encoding a set of tasks when the data is large. Therefore, if an overloaded rank has multiple victims, the second victim must wait long for proceeding the first one. Without any objection, the proactive algorithm must depend on the accuracy of prediction models. However, the features characterized by an online scheme at runtime can reflect the execution behavior flexibly. Therefore, it is feasible to generate a reasonable runtime cost model. Furthermore, we can combine reactive and proactive approaches to improve each other.

Fig. 4.
figure 4

The comparison of imbalance ratios and speedup in various methods by the usecase of oscillating lake simulation.

6 Conclusion

We have introduced a proactive approach for task-based load balancing in distributed memory systems, which mainly supports the use cases of iterative applications. This approach is enabled by combining online load prediction and proactive task offloading. We proposed a fully distributed algorithm that utilizes prediction results to guide task offloading. The paper shows that existing reactive approaches can be limited in high imbalance use cases by lacking load information to select victims and wisely decide the number of offloaded tasks. Our proactive approach can provide prediction knowledge to make better decisions, e.g., potential victims and how many tasks should be offloaded. We implemented this approach in a task-based parallel library and evaluated it with synthetic and real use cases. The results confirm the benefits in important use cases on three different systems. For a long-term vision, this work can be considered as a potential scheme to co-schedule tasks across multiple applications in future parallel systems. Our solution could work as a plugin on top of a task-based programming framework for load balancing improvement.