1 Introduction

Job scheduler plays a critical role in the HPC platforms by allocating resources to submitted jobs and implementing system management policies [1]. One of the well-known scheduling algorithms is backfilling, an optimized allocation strategy introduced by Lifka [2]. This approach executes a queued job (bypassing other existing jobs) with the aim of using the unassigned computational resources. There are two types of backfilling, conservative backfilling and EASY (the Extensible Argonne Scheduling sYstem) backfilling [3]. In the conservative approach, a reservation is made for each incoming job. The jobs are allowed to move ahead in the queue on the condition that they do not delay other jobs in the queue besides their reserved start time. EASY backfilling, on the other hand, takes a more direct approach and allows short jobs to be executed without delaying the job at the head of the queue. Both conservative and EASY backfilling have benefited some workloads while negatively affecting others [4, 5]. For example, for workloads typical on IBM Scalable POWERparallel Systems (IBM SP2), EASY backfilling showed better performance, but for other workloads, both algorithms are similar [6]. In practice, the EASY backfilling algorithm is common, and it is supported by all major production schedulers [7, 8]. Note that both approaches require users to estimate the running time of their jobs. However, if user runtime estimates are inaccurate, this may also be beneficial for these algorithms [9]. Although these techniques shorten job wait times, in practice, a significant fraction of the jobs cannot be effectively executed because they require more resources than the currently available ones.

Recently, application malleability has become a hot research topic in several large-scale projects (ADMIRE [10], DEEP-SEA [11], RED-SEA [12], and TEXTAROSSA [13], among others). In the context of application scheduling, the malleability comes as a rescue. It supports the dynamic increase or decrease of the number of processors of the running applications [14]. We denote this action as application reconfiguration. This feature provides an extra dimension to the scheduling problem, the resources can be dynamically re-allocated from some applications to others during the execution time. However, new factors, such as the application scalability or the reconfiguration overhead [15], must be considered to determine the feasibility of a malleable operation. This work presents a malleable backfilling algorithm that is configured by different execution policies. The work has been developed in the context of ADMIRE project [10]. This is an EU-funded project that targets the development of a controller that provides a holistic view of the system by combining both the compute nodes and application monitoring and modeling. Additionally, the project aims to develop malleable compute and I/O runtimes and scheduling techniques. The proposed malleable scheduling algorithm uses the existing models produced in the ADMIRE framework in order to determine which jobs are reconfigured and scheduled. Experimental results show how this approach produces a more efficient application execution.

The main contributions of this work are summarized as follows:

  • We present a dynamic scheduling algorithm as a performance-driven backfilling variant over a cluster environment.

  • We propose several malleable policies based on application scalability to increase system utilization and decrease slowdown and turnaround time.

  • We include novel scalability application models and use reconfiguration overhead models to support the scheduler’s decision making.

  • We explored the malleability impact when backfilling fails to find a suitable job for the available resources.

  • We include a comprehensive evaluation and analysis of different scheduling algorithms using real workloads with a variety of scenarios.

The remainder of this article is structured as follows: Sect. 2 introduces an overview of the execution framework and system components. Section 3 describes the proposed main algorithm and reconfiguration strategies. Section 4 contains the methodology used in our experiments, explaining the system platform, performance metrics used, and results. Section 5 provides the background and discusses related work. Finally, in Sect. 6 we summarize the work and discuss future work.

Fig. 1
figure 1

ADMIRE architecture. The hardware components are shown in green, and the ADMIRE components are in blue. Dark blue boxes show the new components developed for this job (color figure online)

2 Execution framework

The main objective of the ADMIRE project [10] is to create an active compute and I/O stack. This stack dynamically adjusts compute and storage requirements through intelligent global coordination, elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy. Figure 1 illustrates the considered malleable execution framework. The compute, I/O, and network components are shown in green, while the ADMIRE components are shown in blue. The latter include user-level applications, controllers, algorithms, models, and resource managers. The user-level applications include both the running applications and the libraries that provide malleability and monitoring support. The controllers manage the application execution, generate the malleable commands, and collect the monitoring performance metrics. The next level, algorithms, and models, includes the decision-making logic and the application models that are used to predict the application performance for a different number of processes. Finally, Slurm is used as a resource manager taking into account the decisions of the existing scheduling algorithms. Note that Slurm employs the backfilling scheduling using (sched/backfill) plugin [16, 17]. The main contributions of this work to the execution framework are shown in dark blue, including the malleable scheduling algorithms and the application performance models. To validate our proposal, we evaluate the algorithms and models using a batch-system simulator, ElastiSim. Note that in this work I/O scheduling has not been considered.

2.1 ElastiSim

As experiments on real systems are expensive and time-consuming, we used ElastiSim to evaluate our proposed scheduling algorithm. ElastiSim is a batch-system simulator for malleable workloads simulating jobs, applications, the scheduling algorithm, and the platform. It is a discrete-event simulator written in C++, based on the simulation framework SimGrid [18]. In contrast to other general-purpose simulators known in the literature, such as Batsim [19], AccaSim [20], or Alea [21], ElastiSim has built-in support for malleable jobs augmented with performance models, which we exploit to describe malleability in simulated applications. As ElastiSim also provides a Python interface to integrate custom scheduling algorithms, the simulation of scheduling scenarios introduces less overhead compared to simulators based on actual batch systems [22, 23].

To evaluate scheduling algorithms for large-scale distributed systems, ElastiSim introduces system actors and separates the concerns of platform simulations, user interaction (i.e., job submission), and the batch system—including the scheduling algorithm. As illustrated in Fig. 2, ElastiSim provides interfaces to define the workload and the platform that executes it. To define the scheduling logic, users integrate their scheduling policies using the provided scheduling interface that runs in a separate process and constantly communicates with the simulation process to forward scheduling decisions. We used the Python interface of ElastiSim to develop the algorithms proposed in this work.

Fig. 2
figure 2

Taken from [24]

The architecture of ElastiSim. The scheduling algorithm runs as an external process and is in constant communication with the simulation process.

ElastiSim’s workload model comprises jobs and their corresponding application model. While jobs define high-level attributes such as the requested number of nodes, application models define the actual load executed on the system. To define an application model, ElastiSim introduces phases and tasks. Each application model consists of phases, and each phase comprises tasks representing system activities such as computation, communication, or I/O operations. To allow the reconfiguration of malleable applications during runtime, ElastiSim introduces scheduling points where reconfigurations can safely occur. Malleable applications automatically provide scheduling points between phases to apply reconfiguration requests issued by the scheduler.

As the system load introduced by an application depends on the configuration (i.e., the number of resources) and scales linearly only in rare cases, ElastiSim supports performance models to define the load of a task executed on the system (e.g., the number of bytes to transfer). Users can define performance models (i.e., mathematical functions) dependent on the number of nodes to allow the workload to adapt to a new configuration and define its scalability. With a final attribute, the distribution pattern, users can define which assigned resources participate in the task (e.g., all ranks or root only), allowing the description of simulated clones of real-world applications.

2.2 Performance models

In this work, we have introduced in the simulation framework a \(\rho\) parameter, with \(\rho \ge 0\). This parameter permits to reproduce the behavior of applications with different scalability degrees, i.e., sub-linear speedups. Figure 3a shows the application execution time for different \(\rho\) values, with \(\rho\) equal to zero representing linear scalability, and \(\rho\) values greater than zero different sub-linear scalability degrees. It is important to highlight that only CPU-intensive applications have been considered in this work. The results can also be applied to other classes of applications (like I/O and communication-intensive jobs) because they usually exhibit sub-linear scalability, which is equivalent (from the performance point of view) to considering a \(\rho\) greater than zero.

In this work, we assume that the running applications that are subjected to malleable reconfiguration are modeledFootnote 1 in order to predict its performance when malleable reconfigurations are applied. The malleable scheduler uses two different models: the scalability model and the reconfiguration overhead model. The scalability model estimates the application execution time when the job is executed with a different number of processes. In general, this model can be either generated offline based on previous execution records [25] or online with the support of monitoring libraries [26]. In particular, we have mathematically modeled the simulated job performance considering the previously defined \(\rho\) parameter as a model input. Figure 3b shows an example in which the real scalability of one application is compared with the predicted by the model for \(\rho\) = 0.05.

The main idea behind the performance-aware scheduling algorithms presented in this work is that the scheduler selects jobs with higher scalability for expand and those with a lower scalability for shrinking. Note that in practice, the application scalability models may have a small error and are not perfectly accurate compared with the actual application, which might lead to wrong selection of applications. In the experimental section we evaluate this situation, showing that a certain error in the model does not affect to a great extent the performance of the malleable scheduling algorithm. As suitable jobs are always selected, even if not always the best fitting ones.

Fig. 3
figure 3

The performance modeling of one application

The reconfiguration overhead model takes into account the redistribution of data, creation or destruction of processes, and allocation or release of processors that occur when a job is reconfigured. We denote these operations as reconfiguration overheads. These overheads have also been incorporated into both the simulation framework and the application modeling components.

The cost of redistributing data depends on the type of application, the memory footprint, and other factors like the network performance. For a certain application, the reconfiguration overhead depends on the number of processors the application has before and after the reconfiguration [27]. We assume that the reconfiguration overhead for a given application is linearly proportional to the change in the number of nodes. According to [28], Eqs. 1 and 2, compute the overhead for data redistribution \(T_{dr}\) and the overhead for process creation \(T_{p}\).

$$\begin{aligned}{} & {} T_{dr} = \alpha (\Delta N) + \frac{\beta }{N} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} T_p = b (\Delta N) \end{aligned}$$
(2)

Here, \(\alpha\) and \(\beta\) are constants, and b is the cost per node, as the amount of data specified per node, where \(\Delta N\) is the change in the number of processors, and the N is the total processors involved in the configuration. For example, the data redistribution cost of configuration from 5 to 10 is lower than the data redistribution cost of 5 to 20. The total nodes involved are 15 and 25, while the difference is 5 and 15, respectively. The descriptions of the scheduling algorithm and policies are provided in the next section.

3 Malleable backfilling scheduling

The scheduling algorithm is an extension of EASY backfilling (EBF) that takes malleability into account. The basic idea is combining application malleability with monitoring and modeling to incorporate application scalability into the decision-making process. Figure 4 describes how the system components cooperate. The system receives the applications and resources and then applies the performance model to help the feasibility component make the decision for reconfiguration events. All these components are simulated and tested under the simulation framework.

Fig. 4
figure 4

System and Simulation framework overview illustrates that the system components receive inputs to be configured using the proposed MEBF

Malleable EASY Backfilling (MEBF) (Algorithm 1) periodically receives a list of jobs and nodes to extract sub-lists according to their state and type: pending jobs, running jobs, running jobs that are malleable and existing free nodes (lines 1–4). Firstly, it attempts to serve each submitted job when it arrives. The jobs are allocated to resources without any problems if the resources are available and fit with the application’s preferred number of nodes. Otherwise, MEBF allows the job to adapt to the availability of the system’s resources (lines 6–9) as long as it does not exceed the minimum limit (\({nodes_{\text {min}}}\)). Here, the function GetResources returns the number of resources that can be assigned to the job, while the function Allocate does the actual allocation. If an arriving job cannot be allocated because of insufficient available resources, the system state is examined to go either with backfilling, shrink, or expand. If the system state indicates that there are no free resources at all while there is at least one pending job, the Shrink procedure is invoked (line 10–12). Backfill, on the other hand, is invoked to find another job that can be allocated on the available number of nodes when the system state indicates that there are free resources but not enough for the first queued job (line 14–16). When the BackFill\(_{\text {Job}}\) is not found for some reason (i.e., its estimated execution time exceeds the required time for the running jobs to release the resources), a reconfiguration of running applications can be performed (line 17–18). This reconfiguration can be either a shrink or an expand as an alternative to backfilling (we denote these malleability reconfigurations in this case as expand\(^+\) and shrink\(^+\)). Otherwise, when the system workload reduces, where all jobs are running and idle resources are available, the Expand procedure can be applied (line 20–21).

Figure 5 provides examples of the proposed idea and approaches. Each job is represented as a rectangle with a total number of processes and the number of processes per node. For example, 4x1 means that each one of the four processes is executed through a different compute node. The jobs came into the queue in the order J1, J2, J3, J4, J5, J6, J7, J8. Firstly, Fig. 5a shows how the resources gap is filled with J4, J7, and J8 from the queue using the backfilling approach. When a configuration is applied to a malleable application, as shown in Fig. 5b, J3 is shrunken to run eight processes on four nodes instead of eight nodes, permitting a reduction in the queue time while J6 expands to terminate earlier with better resource utilization. The combination of reconfiguration events (i.e., shrink and expand) with EBF forms the core idea of MEBF. In MEBF, a shrink event is executed when no more resources are available, while an expand event is executed when the queue is empty. In some cases, a matching job from the queue cannot be found with resource gaps, illustrated by area (A) and area (B) in Fig. 5b. The MEBF approach leaves these gaps unused. For the sake of distinction, we will refer to this approach as MEBF-basic. We explored two strategies as extensions to MEBF-basic: MEBF-expand\(^+\) and MEBF-shrink\(^+\). Figure 5c shows how MEBF-expand\(^+\) performs an additional expand even if the queue is not empty. On the other hand, MEBF-shrink\(^+\), Fig. 5d, shrinks running applications to make more space for the waiting job even though system resources are not fully utilized.

Fig. 5
figure 5

Toy example of scheduling eight jobs using a EBF, b MEBF-basic, c MEBF-expand\(^+\), and d MEBF-shrink\(^+\). The X-axis represents time, and the Y-axis represents the number of nodes

In MEBF approaches, application reconfiguration is applied according to the existing scalability exhibited by the application. In the case of shrink, as indicated by the Algorithm 2, the routine first receives the list of all malleable running applications and then excludes the previously configured ones. This provides the shrink routine enough awareness to avoid repeated reconfiguration (line 1, line 5). The running applications are sorted based on their scalability; the least scalable application would be the strongest candidate (line 2). The share_factor determines the number of nodes that can be granted according to the resources allocated to the candidate job (line 3–4). If the candidate is not yet reconfigured and its reconfiguration prediction is Feasible, the job is shrunk (line 5–7). The expand function works in contrast to shrink function. It picks the highest scalable application to extend to a specific number of resources according to the selected expand policy as detailed in Sect. 3.1.

In all reconfiguration policies, the adjustment follows the confirmation from the feasibility routine (Algorithm 3). The feasibility receives the candidate’s job to be assessed based on various factors, including the job’s characteristics, the reconfiguration cost, and the potential impact of resource adjustment on job performance. Firstly, the remainder of the execution time of a candidate in a running state is evaluated. If the job is expected to terminate after a sufficient time, the reconfiguration benefits would be more worthwhile (line 2). Here, a threshold parameter \(\theta\) is assigned to half of the initial estimated job execution time. The performance model (line 3) estimates the application execution time when it is executed with a different number of processes and considers the related overhead. The reconfiguration is considered feasible if the ratio between the estimated execution time with new processes to the static execution time does not exceed a predefined threshold \(\gamma\) (line 4–6). After running many experiments, we have determined that the optimal value for the \(\gamma\) threshold is 2. This means that the new execution time should not exceed twice the static execution time (i.e., the execution time without any changes on assigned resources). If this limit is exceeded, it could negatively impact performance.

Algorithm 1
figure a

Main scheduling, MEBF

Algorithm 2
figure b

Reconfiguration Event, Shrink

Algorithm 3
figure c

Reconfiguration Feasibility

3.1 Reconfiguration policies

Reconfiguration policies determine how the expand and shrink operations are implemented. The shrink procedure is implemented using Algorithm 2. In this section, we explain three expand policies: Handoff, Spare, and Intensive.

In the Handoff policy, the expand occurs only when the expanding amount (i.e., the number of nodes that could be extended to the job within the expand operation) is significant. A set of constraints, listed in Algorithm 4, limits the newly created processes to the ones that produce the highest performance gain. Initially, the list of running malleable applications is sorted according to scalability in descending order (line 1). The expanding amount is chosen between the nodes\(_{\text {max}}\) limit of the candidate job and its currently assigned nodes using GetResources (line 3). If the job will be extended with an amount greater than its current assigned nodes and the reconfiguration with new nodes is feasible, then the job is expanded (line 5–7). The essence of this strategy is to give a chance to the jobs that did not occupy significant space in the first allocation, while also limiting the expand events to those that are worth the reconfiguration overhead.

Algorithm 4
figure d

Expand policy: Handoff

The Spare policy (Algorithm 5) tends to be less constrained than the Handoff strategy. It takes into account the possibility of arriving new jobs. The idea of Spare is using a portion of free resources while reducing the waiting time for jobs that may arrive at different intervals by keeping a spare amount of free nodes. Like Handoff, the running applications are sorted into a list according to their scalability, and the expanding amount is determined by GetResources (line 1–3). If the number of nodes to expand amount is larger than a certain ratio of the currently allocated nodes and the reconfiguration by this amount is considered feasible, the expand takes place (line 6–7). After running many experimental tests, we determined that when Spare uses a ratio equal to half of the assigned nodes it provides the best performance.

Algorithm 5
figure e

expand Policy: Spare

The Intensive policy is the least restrictive one. The jobs are expanded to their maximum possible of compute nodes, even if that costs all the available resources (i.e., \(Nodes_{\text {free}}\)). List 6 shows how this policy allows jobs to expand. The same as Handoff and Spare policies, Intensive sorts the running applications according to their scalability. The number of new resources used in the expansion is determined by GetResources. The Feasibility routine assesses the feasibility of reconfiguration. If the assessment (line 5) is positive, the job is expanded to the maximum number of resources without any additional restrictions.

Algorithm 6
figure f

expand Policy: Intensive

4 Runtime evaluation

In this section, we first present a description of the workload used in the experiments and the experimental platform. After that, a performance comparison between the EBF and the policies proposed in this work is carried out.

4.1 Experimental framework

Table 1 describes the main characteristics of the platform used for the evaluations, which is a cluster based on Intel Xeon Platinum processors and Intel Omni-Path interconnection using a fat-tree topology. The use of this topology minimizes the number of links in the cluster while maintaining a high bisection bandwidth and a relatively small diameter. This topology is modeled using SimGrid [18] integrated with ElastiSim (Sect. 2.1). The topology configuration distributes the nodes over a two-tier cluster where each router is connected to a group of fifty nodes in the lower level where all nodes have the same computational power. The platform consists of 500 compute nodes, where each node contains 48 cores (in total, 24000 cores in the whole system). Assuming that each processor is of the Intel Xeon Platinum 8160 24C type at 2.1 GHz, the node peak performance power is 3.2 Tflops.

Table 1 Description of the platform used in experimental evaluations

4.2 Workloads

For our study, we selected a real parallel workload log from [29], the KIT ForHLR II log. We denoted this workload as KIT. The main data fields that are considered from the KIT trace include job identifier, submit time, run time, wait time, and the number of allocated processors.

The workload log KIT is originally rigid. Therefore, we generated a malleable workload by randomly selecting a certain percentage of the jobs to be malleable. In our experiments, the malleable percentage is 40%, 60%, 80%, and 100% of the jobs, the remaining ones are rigid. Table 2 describes the main features of a simulated malleable application. Each malleable job has a minimum number of nodes and a maximum number of nodes. Therefore, we considered the number of allocated nodes from the KIT trace log as preferred nodes (\(nodes_{\text {pref}}\)). Where the maximum number of nodes (\(nodes_{\text {max}}\)) is five times the number of \(nodes_{\text {pref}}\) and the minimum number of nodes (\(nodes_{\text {min}}\)) is the half of \(nodes_{\text {pref}}\). KIT contains the execution time and the number of assigned processors for each job as well. From this, we calculated the amount of computational load FLOP that is executed by the corresponding job. For computing computational load accurately, we considered the specifications of the cluster [30] where the KIT workload is executed. One of the main features is the type of the used processor and the peak performance of the processor in GFLOP/s. Then, the number of processes in the simulated platform is obtained by estimating the equivalent processing power. It is the same in both the original platform (where the KIT trace is collected) and the simulated clusters. For example, let us assume a work that is originally executed on one compute node (48 cores) and has an execution time of 45 s. Given that the computational power of each core is 41.6 GFLOP/s, the estimated computational load for this job will be 89.8 TFLOP.

Regarding the trace generation, we have selected two traces with 1K and 5K jobs. For each one of them, we created two sub-traces: one with the original arrival time and one with the scaled arrival time. In the scaled version, we reduce the arrival time by 25% compared to the original sample in order to create a kind of pressure on the system and see the impact of that. The workload with no changes is referred to as KIT_normal, while the one with scaled arrival time is referred to as KIT_intensive.

Both examples are used to generate other workloads with different distributions of reconfiguration overheads and scalability levels. The reconfiguration overheads model, which has been described in Sect. 2.2, uses a scale factor to convert the computed cost to bytes. This scale factor is derived from [31]. When the model uses the total value of this scale factor, it is considered a full reconfiguration overhead (ROF). While using half of this factor, it is considered as half of the reconfiguration overhead (ROH). In this work, we have considered three scalability levels: low (LS) with \(\rho \in (0.2,0.3]\), medium (MS) \(\rho \in (0.1,0.2]\), and high (HS) with \(\rho \in (0.0,0.1]\). In each one, the applications follow a normal distribution with different scalability degrees than tend to be among these categories. Table 4 summarizes these workloads. For example, \(KIT\_normal\_LS\_ROH\) represents a workload with normal arrival time and low scalability, and it is tested with half the overhead.

Table 2 Workload description
Table 3 Workload specifications
Table 4 Workload categorization according to the system utilization level, the application scalability, and the reconfiguration overheads

4.3 Experimental results

This section contains the experimental results obtained from simulations. We used the average turnaround time, the average slowdown, and the average CPU utilization as metrics for the evaluation. While the turnaround time (TAT) is computed as the difference between a job’s completion time and submission time, the average turnaround time (TAT\(_{\text {avg}}\)) is the summation of TAT\(_{\text {jobs}}\) divided by the workload size. Accordingly, the lowest value means better performance. The slowdown is defined as the average turnaround time divided by the average static execution time (i.e., the execution time when the workload is executed without any reconfigurations). The average CPU utilization, equation 3, is the average CPU utilization by all processors. The higher value of this metric means better CPU utilization.

$$\begin{aligned} \text {CPU}_{\text {utilization}} = \frac{\sum \text {CPU}_{\text {used}}}{\text {CPUs}_{\text {total}} \times \text {Time}} \end{aligned}$$
(3)

We have used the following values for the parameters in this simulation:

  • The \(\alpha\) and \(\beta\) that are used on the reconfiguration overhead model (Sect. 2.2) are randomly chosen from the interval [0.005, 0.05].

  • The share_factor in shrink procedure (Algorithm 2) is assigned to 0.4.

  • The \(\theta\) and \(\gamma\) in reconfiguration feasibility (Algorithm 3) is assigned to 0.5 and 2.0, respectively.

  • The ratio in Spare policy (Algorithm 5) is assigned to 0.5.

As a base scenario, we analyzed each policy in terms of the initial allocation of the jobs that are reconfigured. Figure 6 represents each job group’s percentage of expand events. The grouping of jobs (x-axis) is based on the initial assignment. We can observe that the Spare policy affects jobs that have small initial allocations, while the Handoff tends to include larger jobs in addition to the small jobs. The Intensive strategy includes different job sizes.

Fig. 6
figure 6

The expand events percentage per job group of workload size 1K jobs with different expand policies: Spare, Handoff, and Intensive

The workloads described in Sect. 4.2 have been tested extensively under different conditions. The first scenario compares the performance of the proposed algorithm with the existence of pressure on the system with its performance without any pressure (i.e., the original arrival time without reduction). This scenario uses the workloads KIT_intensive and KIT_normal from Table 4, with a size of 1K and 5K jobs for each. For these workloads, the average turnaround time in Fig. 7 shows that MEBF with Handoff, Spare, and Intensive are a good improvement over EBF for both intensive and normal workloads. With an intensive workload size of 1K jobs, Handoff provides the best performance compared to other policies. Handoff allows shrink more often than expanding for this workload size to reduce wait time while limiting expand to only the most important events. The improvements are 40%, 33%, and 37% for Handoff, Intensive, and Spare, respectively. For an intensive workload size of 5K jobs, the Spare policy provides the best performance. It allows reasonable expand while retaining a portion of new incoming jobs within such a larger workload size. The improvements are 49%, 48% 47% for Spare, Intensive, and Handoff, respectively.

Fig. 7
figure 7

Average turnaround time of workload with 1K jobs, 5K jobs for intensive workloads (arrival time scaled by 25%) and normal (original arrival times)

The second scenario evaluates different scalability levels corresponding to workloads KIT_intensive_LS_ROF, KIT_intensive_MS_ROF, and KIT_intensive_HS_ROF with sizes of 1K and 5K jobs. Figure 8 shows the average turnaround time of EBF compared to Handoff, Spare, and Intensive policies with low, medium, and high scalabilities. In both workloads and all scalability levels, MEBF provides better performance than EBF. For the workload size of 1K jobs, Handoff is the best approach in the case of low scalability. From the statistical data of the experiments with 1K workload size, Table 5, Handoff creates a higher opportunity to shrink while limiting expand. At the same time, it allows frequent backfilling. In such cases, Handoff allows shrink events in addition to backfilling to reduce the waiting time without affecting the overall performance of the workload. MEBF policies achieve improvements with the percentages of 40%, 37%, and 33% for Handoff, Intensive, and Spare, respectively. With medium scalability, the Intensive policy provides 42% of enhancement over EBF. In such a scalability level, Intensive policy allows the expand to happen more frequently than shrink and backfilling operations while Handoff and Spare restrict the malleability of medium scalability workload. Therefore, the Intensive policy provides the best performance. The Spare achieves the best improvement for highly scalable workloads. It allows such a flexible workload to implement expand and shrink operations more than other policies. Its improvement is 17%. On the other hand, for larger and more intensive workloads with 5K jobs, all MEBF policies provide better performance than EBF, while the performance between MEBF policies is not significant as a result of such workload size.

Table 5 The percentage of the frequency of shrink, expand, and backfill events in relation to the total number of events
Fig. 8
figure 8

Average turnaround time of intensive workload of 1K jobs and 5K jobs with different levels of scalability

The third scenario evaluates the effect of the existence of different levels of reconfiguration overhead. Performance models employ these overheads to decide the suitability of each malleable operation. We have tested KIT_intensive_HS with size 5K jobs under two scenarios, the first where workload bears the full reconfiguration cost, and the second bears half of this cost. Figure 9 shows the results of these evaluations. We can observe how all MEBF algorithms have better performance than EBF in both cases. At full reconfiguration cost, Handoff is the best approach and can limit expand events while providing a better chance to shrink and backfill. This, in turn, significantly reduces the waiting time. For the low-cost case, Intensive offers the greatest improvement as it allows to take advantage of reconfigurations while there is a low cost for such changes.

Fig. 9
figure 9

Average turnaround time of workload with 5K jobs for different reconfiguration overhead levels

In this work, we also compare the different policies in terms of CPU utilization. Figure 10 shows the CPU utilization for 1K and 5K workloads under different scalability levels. For the low scalability workload of 1K jobs, the system is better utilized by EBF, and Intensive is close to EBF utilization. For medium and high scalable workloads, MEBF policies are better than EBF in system utilization. For such scalable workloads, MEBF gives jobs the required ability to change based on the system state. At 5K workload, MEBF offers 99% utilization due to the intensive workload and the dynamics offered by MEBF. The Intensive policy provides the highest level of dynamism. Therefore, the Intensive policy utilizes resources to the upper limit at all scaling levels. In this context, CPU utilization over time is also evaluated. Figure 11 shows that the system is better utilized over time with MEBF policies than with EBF, due to both malleability and backfilling. When the system reaches a time range of 6000 to 8000 s, it may have fewer jobs to schedule. At this point, the MEBF policy is more effective than the EBF policy in expanding operations. There are also differences in the degree of utilization between MEBF policies, which emphasize the result in Fig. 10.

Fig. 10
figure 10

The percentage of utilization of the system by workloads 5000 jobs, 1000 jobs over Handoff, Spare, Intensive, and EBF with different levels of scalability

Fig. 11
figure 11

CPU utilization by 1K KIT_intensive_HS_ROF workload, scheduled by: EBF, MEBF-Handoff, MEBF-Spare, MEBF-Intensive over time

In the next scenario, we evaluate the scheduling algorithm performance with the impact of the percentage of rigid jobs. The workload with 1K jobs (KIT_intensive_SM_ROF, Table 3) has been considered to extract four different workloads with 40%, 60%, 80%, and 100% malleable jobs. For each of them, the rigid jobs are randomly selected. Figure 12 shows that MEBF performs better than EBF in terms of turnaround time, even in the presence of a large fraction of rigid jobs. When only 40% of jobs are malleable, MEBF policies outperform EBF with a convergent performance of 2.1%. Among the policies, Handoff provides the best performance within 60% malleable, where it reduces the turnaround time to 34% compared to EBF. From experiments statistics, we observe that Handoff allows more backfilling events to occur than expand or shrink events in this case. The presence of rigid jobs and their distribution across workloads creates more gaps and thus provides more opportunities for backfilling execution. Since the distribution of rigid jobs is randomly generated, Handoff does not perform well at workloads with 80% and 100% malleable percentages compared to other policies, while Intensive provides more efficient malleable scheduling for such workloads.

Fig. 12
figure 12

The average turnaround time with different percentages of malleable jobs: 40% malleable, 60% malleable, 80% malleable, and 100% malleable

Table 6 summarizes the slowdown improvement of MEBF policies over EBF. The highest improvement for workload 1K is achieved by the Intensive and Spare strategies at medium and high scalability, with a percentage of 42%. Workload 5K achieves the best performance when running with the Spare policy to achieve a 49% improvement over EBF at low scalability.

Table 7 outlines the improvement in slowdown for MEBF when rigid jobs are present. The 60% and 80% of each workload are malleable, while the remaining is rigid. For 60% malleable, the best policy, when run over medium scalability, is the Handoff which improves the slowdown by 34%. However, the 80% malleable medium scalability workload improved by 33% when run over Intensive. The workloads in both tables are run considering the full reconfiguration cost and with scaled arrival time.

Table 6 The slowdown improvement per each full malleable workload with different sizes and different scalability levels
Table 7 The slowdown improvement of 1K workload with different malleability percentage and different scalability levels

In this work, we also study the implication of malleability when backfilling is unable to find a job from the queue that matches the available resources. In such a case, MEBF-basic leaves these resources unused. For this, we tested the malleability reconfiguration, shrink (MEBF-shrink\(^+\)) and expand (MEBF-expand\(^+\)) (see line 17 in Algorithm 1), beside those offered by MEBF-basic. The shrink in MEBF-shrink\(^+\) is implemented even if resources are available but do not match with any job in the queue, while the expand in MEBF-expand\(^+\) is implemented even the queue is not empty.

Fig. 13
figure 13

The slowdown of workload size 1K jobs between MEBF-basic, MEBF-shrink\(^+\), and MEBF-expand\(^+\) over different scalability levels

Figures 13 and 14 show the results of using MEBF-shrink\(^+\) and MEBF-expand\(^+\) with different scalability levels of workloads with 1K and 5K jobs. MEBF outperforms EBF in all approaches for both workloads. Comparing MEBF-shrink\(^+\) with MEBF-basic, MEBF-shrink\(^+\) offers the same improvement as MEBF-basic. On the other hand, MEBF-expand\(^+\) provides less improvement. Based on empirical statistics, we found that although expand reduces the execution time of jobs, too many expands may increase the waiting time. The increase in waiting time could have a negative impact on the overall workload performance. MEBF-basic and MEBF-shrink\(^+\) are able to take advantage of malleability better than MEBF-Expand\(^+\) because they are able to make a tradeoff between the waiting time and the execution time of the jobs. For this, we tested MEBF-shrink\(^+\) and MEBF-expand\(^+\) with a workload with the original arrival time (i.e., without scaling the arrival time). Figure 15 shows that MEBF-shrink\(^+\) performs better than MEBF-expand\(^+\) under both workloads KIT_normal_MS and KIT_intensive_MS.

Fig. 14
figure 14

The slowdown of 5K jobs between MEBF-basic, MEBF-shrink\(^+\), and MEBF-expand\(^+\) over different scalability levels

Fig. 15
figure 15

The slowdown of medium scalability 5K workload run on MEBF-shrink\(^+\), and MEBF-expand\(^+\) over scaled (intensive) arrival time and normal arrival time

The final scenario evaluates the impact of an error in the application performance model on the MEBF performance. Figure 16 evaluates the KIT 1K size workload with two models: one that accurately predicts the application performance (no error) and another one that estimates bigger application scalability (error). This error was introduced as a bias in the \(\rho\) values for each application, but not for the related model. In this way, the real application scalability (see Fig. 3a) is smaller than the one predicted by the model. The new considered application \(\rho\) values were increased by adding a value of 0.1. This implies that the model estimates that application scalability is bigger than the real one. Empirically, results show that even with the model’s uncertainty, the MEBF performs better than the EBF. The reason is that, from a performance point of view, it is not important to accurately select the applications with maximum or minimum scalabilities. Note that EBF is also affected by this uncertainty because it also uses the performance models.

Fig. 16
figure 16

Evaluation of the error in the application model on the MEBF workload

5 Background

While there are many different scheduling algorithms, the simplest and most widely used approach is First Come First Serve (FCFS) scheduling. In FCFS, submitted jobs are served based on their arrival time [32]. Although it is fair and predictable [33], the main drawback of FCFS is the convoy effect, as all other tasks wait for a large task to release the resources, as well as the fragmentation, where free processors cannot meet the requirements of the next job and therefore remain idle until additional ones become available [9]. These effects are accompanied by lower resource utilization than possible if the smaller tasks took precedence [32] over the large tasks. On the other hand, non-FCFS scheduling tends to serve the small jobs before the larger jobs to take advantage of the available resources. When the small jobs are always passed around, the large jobs risk starvation. The backfilling approach comes as a compromise. It allows a limited number of small jobs to use idle resources when the job in the order cannot be served [34].

Depending on who decides the number of nodes and when the decision happens, HPC jobs can be divided into four categories: rigid, evolving, moldable, and malleable [35]. In rigid and evolving jobs, users decide how many nodes to use. For rigid jobs, decisions are made at submission and before execution. The evolving jobs may change their requirements during execution, and the change request is initiated by the application itself. For moldable and malleable jobs, users determine a range of nodes on which a job can run, and the scheduler decides how many nodes to use. For moldable jobs, the scheduler makes the decision at the beginning of execution. Malleable and evolving jobs can dynamically shrink or grow the resources on which they execute during runtime [36]. By shrink, a malleable job releases part of the assigned resources, allowing the start of some new jobs waiting in the queue or reconfiguring other malleable jobs that can benefit from these released resources. Expanding resources in a malleable job could increase resource utilization and possibly help improve the execution of the job. Although the execution of moldable and malleable jobs can improve system utilization and reduce average response time, their utilization in HPC has never become common [37].

There are various challenges that may limit the use of malleability. One of the challenges is the existence of adequate support from message-passing libraries, middleware, and resource management systems (RMS). Additionally, scheduling malleable jobs requires communication between running jobs and the RMS. This entails tracking pending jobs and prioritizing jobs to expand or shrink at a certain point in time. Malleable job schedulers need to decide when to increase or decrease resources for malleable jobs [28]. This is done by providing the scheduling algorithm the ability to deal with malleable jobs by implying the malleability decisions.

Many previous research papers have addressed the problem of scheduling malleable workloads and allocating resources to such jobs in parallel systems by integrating new techniques to backfilling or FCFS policies [1]. Sonmez et al. in [38] proposed two strategies for managing adaptive applications. The first one expands only the jobs that run the longest based on start time, while the other distributes the new resources evenly across all jobs. However, their experimental results clearly demonstrate the positive impact of malleability on application performance.

An adaptive job scheduler for managing the malleable applications is proposed in [39]. The proposed scheduler uses a variant of the equipartitioning algorithm to schedule adaptive applications. The job execution could be started on a partition, but the expand of the job would be restricted to that partition. Their main idea is to enhance the capabilities of the load balancer to allow the shrink, expand, or change of the set of processors allocated to a job.

In [28], several scheduling algorithms for adaptive applications (i.e., evolving and malleable) are proposed. Although different scheduling policies were used to apply adaptive events, in all cases, their results showed that workloads with elastic applications improved both application and system performance compared to the same workload with only rigid applications.

Gladys Utrera et al. in [1] extended FCFS scheduling to support malleability. They used a concept of virtual malleability, which can be implemented on all jobs regardless of their categorization. Their shrink policy reduces the assigned processors of the oldest job to half. They claimed that their approach does not incur overhead in redistributing data. Regardless of this assumption, the results obtained compared to EASY backfilling are impressive enough to observe the power of malleability to improve system performance.

The work in [14] illustrates the potential benefits and challenges of malleability with scheduling policies. They analyzed the quality of FCFS scheduling considering a mix of rigid and malleable jobs in the system. They showed experimentally how average response time and utilization could be improved when considering malleability. In some cases, for example [40], the scheduler should be able to collect data about the performance of running applications to decide who should receive additional processors or from whom processors should be subtracted. A variant of backfilling scheduling in [41] enables malleability by shrinking the resources of running jobs to make room for jobs that run with a reduced amount of resources only if the estimated slowdown improves over the static approach.

Our algorithm differs from previous works by utilizing unique application performance models to aid the scheduler’s decision making. Additionally, the proposed algorithm is adaptive and considers the existing system resources. While most of the previous works focused on policies for shrinking the running jobs more than expanding them, this work includes three novel policies for expansion and one for shrinking. Each expansion policy targets a specific workload type based on the size and scalability level of the jobs, which sets it apart from previous works. A scalability model for applications is simulated and evaluated using a discrete-event simulator, ElastiSim.

6 Conclusion

In this paper, we have proposed scheduling algorithms for malleable workloads based on performance models. We proposed three expand policies: Handoff, Spare, and Intensive, each policy targeting a specific type of workload. Each policy was better suited for a particular scalability level than others. For example, the Intensive policy worked better for highly scalable workloads, while Handoff provided better performance for low scalable workloads. We evaluated the proposed algorithm and policies using updated real workload traces. The main results of our study are that the proposed algorithms achieved good improvement over EASY backfilling in all scenarios. As the workload increases, the performance improvement difference between the proposed policies becomes insignificant. In addition, the reconfiguration cost affects the performance. Another finding was that using many reconfigurations can negatively impact workload performance, so it is critical to find the most appropriate strategy to achieve the best performance. For example, MEBF-expand\(^+\) did not provide better performance than MEBF-basic, and MEBF-shrink\(^+\) provided similar results to MEBF-basic. Finally, MEBF-basic was able to improve performance even when a significant portion of the workload was rigid. In the future, we plan to incorporate the I/O in the scheduling decision logic and extend the application modeling and simulation to other classes of applications, such as I/O-intensive or communication-intensive applications and applications with different execution phases.