Performance-driven scheduling for malleable workloads

Almaaitah, Njoud O.; Singh, David E.; Özden, Taylan; Carretero, Jesus

doi:10.1007/s11227-023-05882-0

Performance-driven scheduling for malleable workloads

Open access
Published: 29 January 2024

Volume 80, pages 11556–11584, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Performance-driven scheduling for malleable workloads

Download PDF

Njoud O. Almaaitah¹,
David E. Singh¹,
Taylan Özden²^na1 &
…
Jesus Carretero¹

636 Accesses
Explore all metrics

Abstract

The development of adaptive scheduling algorithms that take advantage of malleability has become a crucial area of research in many large-scale projects. Malleable workloads can improve the system’s performance but, at the same time, provide an extra dimension to the scheduling problem. This paper proposes an adaptive, performance-based job scheduling method that emphasizes the backfilling concept with malleability. The proposed method performs the malleability operations only when the estimated execution time of the involved applications is better than or equal to the execution time on the allocated resources without reconfiguration. The reconfiguration feasibility is determined by performance models considering the application scalability and reconfiguration overheads. Different policies for implementing malleability are presented, each targeting a specific workload in terms of job size and scalability. The comprehensive evaluation shows an improvement in the slowdown up to 49% compared to the non-adaptive baseline scheduling algorithm.

Real-Life Experience with Major Reconfiguration of Job Scheduling System

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Article 29 March 2024

Towards Smarter Schedulers: Molding Jobs into the Right Shape via Monitoring and Modeling

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Job scheduler plays a critical role in the HPC platforms by allocating resources to submitted jobs and implementing system management policies [1]. One of the well-known scheduling algorithms is backfilling, an optimized allocation strategy introduced by Lifka [2]. This approach executes a queued job (bypassing other existing jobs) with the aim of using the unassigned computational resources. There are two types of backfilling, conservative backfilling and EASY (the Extensible Argonne Scheduling sYstem) backfilling [3]. In the conservative approach, a reservation is made for each incoming job. The jobs are allowed to move ahead in the queue on the condition that they do not delay other jobs in the queue besides their reserved start time. EASY backfilling, on the other hand, takes a more direct approach and allows short jobs to be executed without delaying the job at the head of the queue. Both conservative and EASY backfilling have benefited some workloads while negatively affecting others [4, 5]. For example, for workloads typical on IBM Scalable POWERparallel Systems (IBM SP2), EASY backfilling showed better performance, but for other workloads, both algorithms are similar [6]. In practice, the EASY backfilling algorithm is common, and it is supported by all major production schedulers [7, 8]. Note that both approaches require users to estimate the running time of their jobs. However, if user runtime estimates are inaccurate, this may also be beneficial for these algorithms [9]. Although these techniques shorten job wait times, in practice, a significant fraction of the jobs cannot be effectively executed because they require more resources than the currently available ones.

Recently, application malleability has become a hot research topic in several large-scale projects (ADMIRE [10], DEEP-SEA [11], RED-SEA [12], and TEXTAROSSA [13], among others). In the context of application scheduling, the malleability comes as a rescue. It supports the dynamic increase or decrease of the number of processors of the running applications [14]. We denote this action as application reconfiguration. This feature provides an extra dimension to the scheduling problem, the resources can be dynamically re-allocated from some applications to others during the execution time. However, new factors, such as the application scalability or the reconfiguration overhead [15], must be considered to determine the feasibility of a malleable operation. This work presents a malleable backfilling algorithm that is configured by different execution policies. The work has been developed in the context of ADMIRE project [10]. This is an EU-funded project that targets the development of a controller that provides a holistic view of the system by combining both the compute nodes and application monitoring and modeling. Additionally, the project aims to develop malleable compute and I/O runtimes and scheduling techniques. The proposed malleable scheduling algorithm uses the existing models produced in the ADMIRE framework in order to determine which jobs are reconfigured and scheduled. Experimental results show how this approach produces a more efficient application execution.

The main contributions of this work are summarized as follows:

We present a dynamic scheduling algorithm as a performance-driven backfilling variant over a cluster environment.
We propose several malleable policies based on application scalability to increase system utilization and decrease slowdown and turnaround time.
We include novel scalability application models and use reconfiguration overhead models to support the scheduler’s decision making.
We explored the malleability impact when backfilling fails to find a suitable job for the available resources.
We include a comprehensive evaluation and analysis of different scheduling algorithms using real workloads with a variety of scenarios.

The remainder of this article is structured as follows: Sect. 2 introduces an overview of the execution framework and system components. Section 3 describes the proposed main algorithm and reconfiguration strategies. Section 4 contains the methodology used in our experiments, explaining the system platform, performance metrics used, and results. Section 5 provides the background and discusses related work. Finally, in Sect. 6 we summarize the work and discuss future work.

2 Execution framework

The main objective of the ADMIRE project [10] is to create an active compute and I/O stack. This stack dynamically adjusts compute and storage requirements through intelligent global coordination, elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy. Figure 1 illustrates the considered malleable execution framework. The compute, I/O, and network components are shown in green, while the ADMIRE components are shown in blue. The latter include user-level applications, controllers, algorithms, models, and resource managers. The user-level applications include both the running applications and the libraries that provide malleability and monitoring support. The controllers manage the application execution, generate the malleable commands, and collect the monitoring performance metrics. The next level, algorithms, and models, includes the decision-making logic and the application models that are used to predict the application performance for a different number of processes. Finally, Slurm is used as a resource manager taking into account the decisions of the existing scheduling algorithms. Note that Slurm employs the backfilling scheduling using (sched/backfill) plugin [16, 17]. The main contributions of this work to the execution framework are shown in dark blue, including the malleable scheduling algorithms and the application performance models. To validate our proposal, we evaluate the algorithms and models using a batch-system simulator, ElastiSim. Note that in this work I/O scheduling has not been considered.

2.1 ElastiSim

As experiments on real systems are expensive and time-consuming, we used ElastiSim to evaluate our proposed scheduling algorithm. ElastiSim is a batch-system simulator for malleable workloads simulating jobs, applications, the scheduling algorithm, and the platform. It is a discrete-event simulator written in C++, based on the simulation framework SimGrid [18]. In contrast to other general-purpose simulators known in the literature, such as Batsim [19], AccaSim [20], or Alea [21], ElastiSim has built-in support for malleable jobs augmented with performance models, which we exploit to describe malleability in simulated applications. As ElastiSim also provides a Python interface to integrate custom scheduling algorithms, the simulation of scheduling scenarios introduces less overhead compared to simulators based on actual batch systems [22, 23].

To evaluate scheduling algorithms for large-scale distributed systems, ElastiSim introduces system actors and separates the concerns of platform simulations, user interaction (i.e., job submission), and the batch system—including the scheduling algorithm. As illustrated in Fig. 2, ElastiSim provides interfaces to define the workload and the platform that executes it. To define the scheduling logic, users integrate their scheduling policies using the provided scheduling interface that runs in a separate process and constantly communicates with the simulation process to forward scheduling decisions. We used the Python interface of ElastiSim to develop the algorithms proposed in this work.

ElastiSim’s workload model comprises jobs and their corresponding application model. While jobs define high-level attributes such as the requested number of nodes, application models define the actual load executed on the system. To define an application model, ElastiSim introduces phases and tasks. Each application model consists of phases, and each phase comprises tasks representing system activities such as computation, communication, or I/O operations. To allow the reconfiguration of malleable applications during runtime, ElastiSim introduces scheduling points where reconfigurations can safely occur. Malleable applications automatically provide scheduling points between phases to apply reconfiguration requests issued by the scheduler.

As the system load introduced by an application depends on the configuration (i.e., the number of resources) and scales linearly only in rare cases, ElastiSim supports performance models to define the load of a task executed on the system (e.g., the number of bytes to transfer). Users can define performance models (i.e., mathematical functions) dependent on the number of nodes to allow the workload to adapt to a new configuration and define its scalability. With a final attribute, the distribution pattern, users can define which assigned resources participate in the task (e.g., all ranks or root only), allowing the description of simulated clones of real-world applications.

2.2 Performance models

In this work, we have introduced in the simulation framework a $\rho$ parameter, with $\rho \ge 0$. This parameter permits to reproduce the behavior of applications with different scalability degrees, i.e., sub-linear speedups. Figure 3a shows the application execution time for different $\rho$ values, with $\rho$ equal to zero representing linear scalability, and $\rho$ values greater than zero different sub-linear scalability degrees. It is important to highlight that only CPU-intensive applications have been considered in this work. The results can also be applied to other classes of applications (like I/O and communication-intensive jobs) because they usually exhibit sub-linear scalability, which is equivalent (from the performance point of view) to considering a $\rho$ greater than zero.

In this work, we assume that the running applications that are subjected to malleable reconfiguration are modeled^{Footnote 1} in order to predict its performance when malleable reconfigurations are applied. The malleable scheduler uses two different models: the scalability model and the reconfiguration overhead model. The scalability model estimates the application execution time when the job is executed with a different number of processes. In general, this model can be either generated offline based on previous execution records [25] or online with the support of monitoring libraries [26]. In particular, we have mathematically modeled the simulated job performance considering the previously defined $\rho$ parameter as a model input. Figure 3b shows an example in which the real scalability of one application is compared with the predicted by the model for $\rho$ = 0.05.

The main idea behind the performance-aware scheduling algorithms presented in this work is that the scheduler selects jobs with higher scalability for expand and those with a lower scalability for shrinking. Note that in practice, the application scalability models may have a small error and are not perfectly accurate compared with the actual application, which might lead to wrong selection of applications. In the experimental section we evaluate this situation, showing that a certain error in the model does not affect to a great extent the performance of the malleable scheduling algorithm. As suitable jobs are always selected, even if not always the best fitting ones.

The reconfiguration overhead model takes into account the redistribution of data, creation or destruction of processes, and allocation or release of processors that occur when a job is reconfigured. We denote these operations as reconfiguration overheads. These overheads have also been incorporated into both the simulation framework and the application modeling components.

The cost of redistributing data depends on the type of application, the memory footprint, and other factors like the network performance. For a certain application, the reconfiguration overhead depends on the number of processors the application has before and after the reconfiguration [27]. We assume that the reconfiguration overhead for a given application is linearly proportional to the change in the number of nodes. According to [28], Eqs. 1 and 2, compute the overhead for data redistribution $T_{dr}$ and the overhead for process creation $T_{p}$.

$$\begin{aligned}{} & {} T_{dr} = \alpha (\Delta N) + \frac{\beta }{N} \end{aligned}$$

(1)

$$\begin{aligned}{} & {} T_p = b (\Delta N) \end{aligned}$$

(2)

Here, $\alpha$ and $\beta$ are constants, and b is the cost per node, as the amount of data specified per node, where $\Delta N$ is the change in the number of processors, and the N is the total processors involved in the configuration. For example, the data redistribution cost of configuration from 5 to 10 is lower than the data redistribution cost of 5 to 20. The total nodes involved are 15 and 25, while the difference is 5 and 15, respectively. The descriptions of the scheduling algorithm and policies are provided in the next section.

3 Malleable backfilling scheduling

The scheduling algorithm is an extension of EASY backfilling (EBF) that takes malleability into account. The basic idea is combining application malleability with monitoring and modeling to incorporate application scalability into the decision-making process. Figure 4 describes how the system components cooperate. The system receives the applications and resources and then applies the performance model to help the feasibility component make the decision for reconfiguration events. All these components are simulated and tested under the simulation framework.

Malleable EASY Backfilling (MEBF) (Algorithm 1) periodically receives a list of jobs and nodes to extract sub-lists according to their state and type: pending jobs, running jobs, running jobs that are malleable and existing free nodes (lines 1–4). Firstly, it attempts to serve each submitted job when it arrives. The jobs are allocated to resources without any problems if the resources are available and fit with the application’s preferred number of nodes. Otherwise, MEBF allows the job to adapt to the availability of the system’s resources (lines 6–9) as long as it does not exceed the minimum limit (${nodes_{\text {min}}}$). Here, the function GetResources returns the number of resources that can be assigned to the job, while the function Allocate does the actual allocation. If an arriving job cannot be allocated because of insufficient available resources, the system state is examined to go either with backfilling, shrink, or expand. If the system state indicates that there are no free resources at all while there is at least one pending job, the Shrink procedure is invoked (line 10–12). Backfill, on the other hand, is invoked to find another job that can be allocated on the available number of nodes when the system state indicates that there are free resources but not enough for the first queued job (line 14–16). When the BackFill$_{\text {Job}}$ is not found for some reason (i.e., its estimated execution time exceeds the required time for the running jobs to release the resources), a reconfiguration of running applications can be performed (line 17–18). This reconfiguration can be either a shrink or an expand as an alternative to backfilling (we denote these malleability reconfigurations in this case as expand$^+$ and shrink$^+$). Otherwise, when the system workload reduces, where all jobs are running and idle resources are available, the Expand procedure can be applied (line 20–21).

Figure 5 provides examples of the proposed idea and approaches. Each job is represented as a rectangle with a total number of processes and the number of processes per node. For example, 4x1 means that each one of the four processes is executed through a different compute node. The jobs came into the queue in the order J1, J2, J3, J4, J5, J6, J7, J8. Firstly, Fig. 5a shows how the resources gap is filled with J4, J7, and J8 from the queue using the backfilling approach. When a configuration is applied to a malleable application, as shown in Fig. 5b, J3 is shrunken to run eight processes on four nodes instead of eight nodes, permitting a reduction in the queue time while J6 expands to terminate earlier with better resource utilization. The combination of reconfiguration events (i.e., shrink and expand) with EBF forms the core idea of MEBF. In MEBF, a shrink event is executed when no more resources are available, while an expand event is executed when the queue is empty. In some cases, a matching job from the queue cannot be found with resource gaps, illustrated by area (A) and area (B) in Fig. 5b. The MEBF approach leaves these gaps unused. For the sake of distinction, we will refer to this approach as MEBF-basic. We explored two strategies as extensions to MEBF-basic: MEBF-expand$^+$ and MEBF-shrink$^+$. Figure 5c shows how MEBF-expand$^+$ performs an additional expand even if the queue is not empty. On the other hand, MEBF-shrink$^+$, Fig. 5d, shrinks running applications to make more space for the waiting job even though system resources are not fully utilized.

In MEBF approaches, application reconfiguration is applied according to the existing scalability exhibited by the application. In the case of shrink, as indicated by the Algorithm 2, the routine first receives the list of all malleable running applications and then excludes the previously configured ones. This provides the shrink routine enough awareness to avoid repeated reconfiguration (line 1, line 5). The running applications are sorted based on their scalability; the least scalable application would be the strongest candidate (line 2). The share_factor determines the number of nodes that can be granted according to the resources allocated to the candidate job (line 3–4). If the candidate is not yet reconfigured and its reconfiguration prediction is Feasible, the job is shrunk (line 5–7). The expand function works in contrast to shrink function. It picks the highest scalable application to extend to a specific number of resources according to the selected expand policy as detailed in Sect. 3.1.

In all reconfiguration policies, the adjustment follows the confirmation from the feasibility routine (Algorithm 3). The feasibility receives the candidate’s job to be assessed based on various factors, including the job’s characteristics, the reconfiguration cost, and the potential impact of resource adjustment on job performance. Firstly, the remainder of the execution time of a candidate in a running state is evaluated. If the job is expected to terminate after a sufficient time, the reconfiguration benefits would be more worthwhile (line 2). Here, a threshold parameter $\theta$ is assigned to half of the initial estimated job execution time. The performance model (line 3) estimates the application execution time when it is executed with a different number of processes and considers the related overhead. The reconfiguration is considered feasible if the ratio between the estimated execution time with new processes to the static execution time does not exceed a predefined threshold $\gamma$ (line 4–6). After running many experiments, we have determined that the optimal value for the $\gamma$ threshold is 2. This means that the new execution time should not exceed twice the static execution time (i.e., the execution time without any changes on assigned resources). If this limit is exceeded, it could negatively impact performance.

3.1 Reconfiguration policies

Reconfiguration policies determine how the expand and shrink operations are implemented. The shrink procedure is implemented using Algorithm 2. In this section, we explain three expand policies: Handoff, Spare, and Intensive.

In the Handoff policy, the expand occurs only when the expanding amount (i.e., the number of nodes that could be extended to the job within the expand operation) is significant. A set of constraints, listed in Algorithm 4, limits the newly created processes to the ones that produce the highest performance gain. Initially, the list of running malleable applications is sorted according to scalability in descending order (line 1). The expanding amount is chosen between the nodes$_{\text {max}}$ limit of the candidate job and its currently assigned nodes using GetResources (line 3). If the job will be extended with an amount greater than its current assigned nodes and the reconfiguration with new nodes is feasible, then the job is expanded (line 5–7). The essence of this strategy is to give a chance to the jobs that did not occupy significant space in the first allocation, while also limiting the expand events to those that are worth the reconfiguration overhead.

The Spare policy (Algorithm 5) tends to be less constrained than the Handoff strategy. It takes into account the possibility of arriving new jobs. The idea of Spare is using a portion of free resources while reducing the waiting time for jobs that may arrive at different intervals by keeping a spare amount of free nodes. Like Handoff, the running applications are sorted into a list according to their scalability, and the expanding amount is determined by GetResources (line 1–3). If the number of nodes to expand amount is larger than a certain ratio of the currently allocated nodes and the reconfiguration by this amount is considered feasible, the expand takes place (line 6–7). After running many experimental tests, we determined that when Spare uses a ratio equal to half of the assigned nodes it provides the best performance.

The Intensive policy is the least restrictive one. The jobs are expanded to their maximum possible of compute nodes, even if that costs all the available resources (i.e., $Nodes_{\text {free}}$). List 6 shows how this policy allows jobs to expand. The same as Handoff and Spare policies, Intensive sorts the running applications according to their scalability. The number of new resources used in the expansion is determined by GetResources. The Feasibility routine assesses the feasibility of reconfiguration. If the assessment (line 5) is positive, the job is expanded to the maximum number of resources without any additional restrictions.

4 Runtime evaluation

In this section, we first present a description of the workload used in the experiments and the experimental platform. After that, a performance comparison between the EBF and the policies proposed in this work is carried out.

4.1 Experimental framework

Table 1 describes the main characteristics of the platform used for the evaluations, which is a cluster based on Intel Xeon Platinum processors and Intel Omni-Path interconnection using a fat-tree topology. The use of this topology minimizes the number of links in the cluster while maintaining a high bisection bandwidth and a relatively small diameter. This topology is modeled using SimGrid [18] integrated with ElastiSim (Sect. 2.1). The topology configuration distributes the nodes over a two-tier cluster where each router is connected to a group of fifty nodes in the lower level where all nodes have the same computational power. The platform consists of 500 compute nodes, where each node contains 48 cores (in total, 24000 cores in the whole system). Assuming that each processor is of the Intel Xeon Platinum 8160 24C type at 2.1 GHz, the node peak performance power is 3.2 Tflops.

Table 1 Description of the platform used in experimental evaluations

Full size table

4.2 Workloads

For our study, we selected a real parallel workload log from [29], the KIT ForHLR II log. We denoted this workload as KIT. The main data fields that are considered from the KIT trace include job identifier, submit time, run time, wait time, and the number of allocated processors.

The workload log KIT is originally rigid. Therefore, we generated a malleable workload by randomly selecting a certain percentage of the jobs to be malleable. In our experiments, the malleable percentage is 40%, 60%, 80%, and 100% of the jobs, the remaining ones are rigid. Table 2 describes the main features of a simulated malleable application. Each malleable job has a minimum number of nodes and a maximum number of nodes. Therefore, we considered the number of allocated nodes from the KIT trace log as preferred nodes ($nodes_{\text {pref}}$). Where the maximum number of nodes ($nodes_{\text {max}}$) is five times the number of $nodes_{\text {pref}}$ and the minimum number of nodes ($nodes_{\text {min}}$) is the half of $nodes_{\text {pref}}$. KIT contains the execution time and the number of assigned processors for each job as well. From this, we calculated the amount of computational load FLOP that is executed by the corresponding job. For computing computational load accurately, we considered the specifications of the cluster [30] where the KIT workload is executed. One of the main features is the type of the used processor and the peak performance of the processor in GFLOP/s. Then, the number of processes in the simulated platform is obtained by estimating the equivalent processing power. It is the same in both the original platform (where the KIT trace is collected) and the simulated clusters. For example, let us assume a work that is originally executed on one compute node (48 cores) and has an execution time of 45 s. Given that the computational power of each core is 41.6 GFLOP/s, the estimated computational load for this job will be 89.8 TFLOP.

Regarding the trace generation, we have selected two traces with 1K and 5K jobs. For each one of them, we created two sub-traces: one with the original arrival time and one with the scaled arrival time. In the scaled version, we reduce the arrival time by 25% compared to the original sample in order to create a kind of pressure on the system and see the impact of that. The workload with no changes is referred to as KIT_normal, while the one with scaled arrival time is referred to as KIT_intensive.

Both examples are used to generate other workloads with different distributions of reconfiguration overheads and scalability levels. The reconfiguration overheads model, which has been described in Sect. 2.2, uses a scale factor to convert the computed cost to bytes. This scale factor is derived from [31]. When the model uses the total value of this scale factor, it is considered a full reconfiguration overhead (ROF). While using half of this factor, it is considered as half of the reconfiguration overhead (ROH). In this work, we have considered three scalability levels: low (LS) with $\rho \in (0.2,0.3]$, medium (MS) $\rho \in (0.1,0.2]$, and high (HS) with $\rho \in (0.0,0.1]$. In each one, the applications follow a normal distribution with different scalability degrees than tend to be among these categories. Table 4 summarizes these workloads. For example, $KIT\_normal\_LS\_ROH$ represents a workload with normal arrival time and low scalability, and it is tested with half the overhead.

Table 2 Workload description

Full size table

Table 3 Workload specifications

Full size table

Table 4 Workload categorization according to the system utilization level, the application scalability, and the reconfiguration overheads

Full size table

4.3 Experimental results

This section contains the experimental results obtained from simulations. We used the average turnaround time, the average slowdown, and the average CPU utilization as metrics for the evaluation. While the turnaround time (TAT) is computed as the difference between a job’s completion time and submission time, the average turnaround time (TAT$_{\text {avg}}$) is the summation of TAT$_{\text {jobs}}$ divided by the workload size. Accordingly, the lowest value means better performance. The slowdown is defined as the average turnaround time divided by the average static execution time (i.e., the execution time when the workload is executed without any reconfigurations). The average CPU utilization, equation 3, is the average CPU utilization by all processors. The higher value of this metric means better CPU utilization.

$$\begin{aligned} \text {CPU}_{\text {utilization}} = \frac{\sum \text {CPU}_{\text {used}}}{\text {CPUs}_{\text {total}} \times \text {Time}} \end{aligned}$$

(3)

We have used the following values for the parameters in this simulation:

The $\alpha$ and $\beta$ that are used on the reconfiguration overhead model (Sect. 2.2) are randomly chosen from the interval [0.005, 0.05].
The share_factor in shrink procedure (Algorithm 2) is assigned to 0.4.
The $\theta$ and $\gamma$ in reconfiguration feasibility (Algorithm 3) is assigned to 0.5 and 2.0, respectively.
The ratio in Spare policy (Algorithm 5) is assigned to 0.5.

As a base scenario, we analyzed each policy in terms of the initial allocation of the jobs that are reconfigured. Figure 6 represents each job group’s percentage of expand events. The grouping of jobs (x-axis) is based on the initial assignment. We can observe that the Spare policy affects jobs that have small initial allocations, while the Handoff tends to include larger jobs in addition to the small jobs. The Intensive strategy includes different job sizes.

The workloads described in Sect. 4.2 have been tested extensively under different conditions. The first scenario compares the performance of the proposed algorithm with the existence of pressure on the system with its performance without any pressure (i.e., the original arrival time without reduction). This scenario uses the workloads KIT_intensive and KIT_normal from Table 4, with a size of 1K and 5K jobs for each. For these workloads, the average turnaround time in Fig. 7 shows that MEBF with Handoff, Spare, and Intensive are a good improvement over EBF for both intensive and normal workloads. With an intensive workload size of 1K jobs, Handoff provides the best performance compared to other policies. Handoff allows shrink more often than expanding for this workload size to reduce wait time while limiting expand to only the most important events. The improvements are 40%, 33%, and 37% for Handoff, Intensive, and Spare, respectively. For an intensive workload size of 5K jobs, the Spare policy provides the best performance. It allows reasonable expand while retaining a portion of new incoming jobs within such a larger workload size. The improvements are 49%, 48% 47% for Spare, Intensive, and Handoff, respectively.

The second scenario evaluates different scalability levels corresponding to workloads KIT_intensive_LS_ROF, KIT_intensive_MS_ROF, and KIT_intensive_HS_ROF with sizes of 1K and 5K jobs. Figure 8 shows the average turnaround time of EBF compared to Handoff, Spare, and Intensive policies with low, medium, and high scalabilities. In both workloads and all scalability levels, MEBF provides better performance than EBF. For the workload size of 1K jobs, Handoff is the best approach in the case of low scalability. From the statistical data of the experiments with 1K workload size, Table 5, Handoff creates a higher opportunity to shrink while limiting expand. At the same time, it allows frequent backfilling. In such cases, Handoff allows shrink events in addition to backfilling to reduce the waiting time without affecting the overall performance of the workload. MEBF policies achieve improvements with the percentages of 40%, 37%, and 33% for Handoff, Intensive, and Spare, respectively. With medium scalability, the Intensive policy provides 42% of enhancement over EBF. In such a scalability level, Intensive policy allows the expand to happen more frequently than shrink and backfilling operations while Handoff and Spare restrict the malleability of medium scalability workload. Therefore, the Intensive policy provides the best performance. The Spare achieves the best improvement for highly scalable workloads. It allows such a flexible workload to implement expand and shrink operations more than other policies. Its improvement is 17%. On the other hand, for larger and more intensive workloads with 5K jobs, all MEBF policies provide better performance than EBF, while the performance between MEBF policies is not significant as a result of such workload size.

Table 5 The percentage of the frequency of shrink, expand, and backfill events in relation to the total number of events

Full size table

The third scenario evaluates the effect of the existence of different levels of reconfiguration overhead. Performance models employ these overheads to decide the suitability of each malleable operation. We have tested KIT_intensive_HS with size 5K jobs under two scenarios, the first where workload bears the full reconfiguration cost, and the second bears half of this cost. Figure 9 shows the results of these evaluations. We can observe how all MEBF algorithms have better performance than EBF in both cases. At full reconfiguration cost, Handoff is the best approach and can limit expand events while providing a better chance to shrink and backfill. This, in turn, significantly reduces the waiting time. For the low-cost case, Intensive offers the greatest improvement as it allows to take advantage of reconfigurations while there is a low cost for such changes.

In this work, we also compare the different policies in terms of CPU utilization. Figure 10 shows the CPU utilization for 1K and 5K workloads under different scalability levels. For the low scalability workload of 1K jobs, the system is better utilized by EBF, and Intensive is close to EBF utilization. For medium and high scalable workloads, MEBF policies are better than EBF in system utilization. For such scalable workloads, MEBF gives jobs the required ability to change based on the system state. At 5K workload, MEBF offers 99% utilization due to the intensive workload and the dynamics offered by MEBF. The Intensive policy provides the highest level of dynamism. Therefore, the Intensive policy utilizes resources to the upper limit at all scaling levels. In this context, CPU utilization over time is also evaluated. Figure 11 shows that the system is better utilized over time with MEBF policies than with EBF, due to both malleability and backfilling. When the system reaches a time range of 6000 to 8000 s, it may have fewer jobs to schedule. At this point, the MEBF policy is more effective than the EBF policy in expanding operations. There are also differences in the degree of utilization between MEBF policies, which emphasize the result in Fig. 10.

In the next scenario, we evaluate the scheduling algorithm performance with the impact of the percentage of rigid jobs. The workload with 1K jobs (KIT_intensive_SM_ROF, Table 3) has been considered to extract four different workloads with 40%, 60%, 80%, and 100% malleable jobs. For each of them, the rigid jobs are randomly selected. Figure 12 shows that MEBF performs better than EBF in terms of turnaround time, even in the presence of a large fraction of rigid jobs. When only 40% of jobs are malleable, MEBF policies outperform EBF with a convergent performance of 2.1%. Among the policies, Handoff provides the best performance within 60% malleable, where it reduces the turnaround time to 34% compared to EBF. From experiments statistics, we observe that Handoff allows more backfilling events to occur than expand or shrink events in this case. The presence of rigid jobs and their distribution across workloads creates more gaps and thus provides more opportunities for backfilling execution. Since the distribution of rigid jobs is randomly generated, Handoff does not perform well at workloads with 80% and 100% malleable percentages compared to other policies, while Intensive provides more efficient malleable scheduling for such workloads.

Table 6 summarizes the slowdown improvement of MEBF policies over EBF. The highest improvement for workload 1K is achieved by the Intensive and Spare strategies at medium and high scalability, with a percentage of 42%. Workload 5K achieves the best performance when running with the Spare policy to achieve a 49% improvement over EBF at low scalability.

Table 7 outlines the improvement in slowdown for MEBF when rigid jobs are present. The 60% and 80% of each workload are malleable, while the remaining is rigid. For 60% malleable, the best policy, when run over medium scalability, is the Handoff which improves the slowdown by 34%. However, the 80% malleable medium scalability workload improved by 33% when run over Intensive. The workloads in both tables are run considering the full reconfiguration cost and with scaled arrival time.

Table 6 The slowdown improvement per each full malleable workload with different sizes and different scalability levels

Full size table

Table 7 The slowdown improvement of 1K workload with different malleability percentage and different scalability levels

Full size table

In this work, we also study the implication of malleability when backfilling is unable to find a job from the queue that matches the available resources. In such a case, MEBF-basic leaves these resources unused. For this, we tested the malleability reconfiguration, shrink (MEBF-shrink$^+$) and expand (MEBF-expand$^+$) (see line 17 in Algorithm 1), beside those offered by MEBF-basic. The shrink in MEBF-shrink$^+$ is implemented even if resources are available but do not match with any job in the queue, while the expand in MEBF-expand$^+$ is implemented even the queue is not empty.

Figures 13 and 14 show the results of using MEBF-shrink$^+$ and MEBF-expand$^+$ with different scalability levels of workloads with 1K and 5K jobs. MEBF outperforms EBF in all approaches for both workloads. Comparing MEBF-shrink$^+$ with MEBF-basic, MEBF-shrink$^+$ offers the same improvement as MEBF-basic. On the other hand, MEBF-expand$^+$ provides less improvement. Based on empirical statistics, we found that although expand reduces the execution time of jobs, too many expands may increase the waiting time. The increase in waiting time could have a negative impact on the overall workload performance. MEBF-basic and MEBF-shrink$^+$ are able to take advantage of malleability better than MEBF-Expand$^+$ because they are able to make a tradeoff between the waiting time and the execution time of the jobs. For this, we tested MEBF-shrink$^+$ and MEBF-expand$^+$ with a workload with the original arrival time (i.e., without scaling the arrival time). Figure 15 shows that MEBF-shrink$^+$ performs better than MEBF-expand$^+$ under both workloads KIT_normal_MS and KIT_intensive_MS.

The final scenario evaluates the impact of an error in the application performance model on the MEBF performance. Figure 16 evaluates the KIT 1K size workload with two models: one that accurately predicts the application performance (no error) and another one that estimates bigger application scalability (error). This error was introduced as a bias in the $\rho$ values for each application, but not for the related model. In this way, the real application scalability (see Fig. 3a) is smaller than the one predicted by the model. The new considered application $\rho$ values were increased by adding a value of 0.1. This implies that the model estimates that application scalability is bigger than the real one. Empirically, results show that even with the model’s uncertainty, the MEBF performs better than the EBF. The reason is that, from a performance point of view, it is not important to accurately select the applications with maximum or minimum scalabilities. Note that EBF is also affected by this uncertainty because it also uses the performance models.

5 Background

While there are many different scheduling algorithms, the simplest and most widely used approach is First Come First Serve (FCFS) scheduling. In FCFS, submitted jobs are served based on their arrival time [32]. Although it is fair and predictable [33], the main drawback of FCFS is the convoy effect, as all other tasks wait for a large task to release the resources, as well as the fragmentation, where free processors cannot meet the requirements of the next job and therefore remain idle until additional ones become available [9]. These effects are accompanied by lower resource utilization than possible if the smaller tasks took precedence [32] over the large tasks. On the other hand, non-FCFS scheduling tends to serve the small jobs before the larger jobs to take advantage of the available resources. When the small jobs are always passed around, the large jobs risk starvation. The backfilling approach comes as a compromise. It allows a limited number of small jobs to use idle resources when the job in the order cannot be served [34].

Depending on who decides the number of nodes and when the decision happens, HPC jobs can be divided into four categories: rigid, evolving, moldable, and malleable [35]. In rigid and evolving jobs, users decide how many nodes to use. For rigid jobs, decisions are made at submission and before execution. The evolving jobs may change their requirements during execution, and the change request is initiated by the application itself. For moldable and malleable jobs, users determine a range of nodes on which a job can run, and the scheduler decides how many nodes to use. For moldable jobs, the scheduler makes the decision at the beginning of execution. Malleable and evolving jobs can dynamically shrink or grow the resources on which they execute during runtime [36]. By shrink, a malleable job releases part of the assigned resources, allowing the start of some new jobs waiting in the queue or reconfiguring other malleable jobs that can benefit from these released resources. Expanding resources in a malleable job could increase resource utilization and possibly help improve the execution of the job. Although the execution of moldable and malleable jobs can improve system utilization and reduce average response time, their utilization in HPC has never become common [37].

There are various challenges that may limit the use of malleability. One of the challenges is the existence of adequate support from message-passing libraries, middleware, and resource management systems (RMS). Additionally, scheduling malleable jobs requires communication between running jobs and the RMS. This entails tracking pending jobs and prioritizing jobs to expand or shrink at a certain point in time. Malleable job schedulers need to decide when to increase or decrease resources for malleable jobs [28]. This is done by providing the scheduling algorithm the ability to deal with malleable jobs by implying the malleability decisions.

Many previous research papers have addressed the problem of scheduling malleable workloads and allocating resources to such jobs in parallel systems by integrating new techniques to backfilling or FCFS policies [1]. Sonmez et al. in [38] proposed two strategies for managing adaptive applications. The first one expands only the jobs that run the longest based on start time, while the other distributes the new resources evenly across all jobs. However, their experimental results clearly demonstrate the positive impact of malleability on application performance.

An adaptive job scheduler for managing the malleable applications is proposed in [39]. The proposed scheduler uses a variant of the equipartitioning algorithm to schedule adaptive applications. The job execution could be started on a partition, but the expand of the job would be restricted to that partition. Their main idea is to enhance the capabilities of the load balancer to allow the shrink, expand, or change of the set of processors allocated to a job.

In [28], several scheduling algorithms for adaptive applications (i.e., evolving and malleable) are proposed. Although different scheduling policies were used to apply adaptive events, in all cases, their results showed that workloads with elastic applications improved both application and system performance compared to the same workload with only rigid applications.

Gladys Utrera et al. in [1] extended FCFS scheduling to support malleability. They used a concept of virtual malleability, which can be implemented on all jobs regardless of their categorization. Their shrink policy reduces the assigned processors of the oldest job to half. They claimed that their approach does not incur overhead in redistributing data. Regardless of this assumption, the results obtained compared to EASY backfilling are impressive enough to observe the power of malleability to improve system performance.

The work in [14] illustrates the potential benefits and challenges of malleability with scheduling policies. They analyzed the quality of FCFS scheduling considering a mix of rigid and malleable jobs in the system. They showed experimentally how average response time and utilization could be improved when considering malleability. In some cases, for example [40], the scheduler should be able to collect data about the performance of running applications to decide who should receive additional processors or from whom processors should be subtracted. A variant of backfilling scheduling in [41] enables malleability by shrinking the resources of running jobs to make room for jobs that run with a reduced amount of resources only if the estimated slowdown improves over the static approach.

Our algorithm differs from previous works by utilizing unique application performance models to aid the scheduler’s decision making. Additionally, the proposed algorithm is adaptive and considers the existing system resources. While most of the previous works focused on policies for shrinking the running jobs more than expanding them, this work includes three novel policies for expansion and one for shrinking. Each expansion policy targets a specific workload type based on the size and scalability level of the jobs, which sets it apart from previous works. A scalability model for applications is simulated and evaluated using a discrete-event simulator, ElastiSim.

6 Conclusion

In this paper, we have proposed scheduling algorithms for malleable workloads based on performance models. We proposed three expand policies: Handoff, Spare, and Intensive, each policy targeting a specific type of workload. Each policy was better suited for a particular scalability level than others. For example, the Intensive policy worked better for highly scalable workloads, while Handoff provided better performance for low scalable workloads. We evaluated the proposed algorithm and policies using updated real workload traces. The main results of our study are that the proposed algorithms achieved good improvement over EASY backfilling in all scenarios. As the workload increases, the performance improvement difference between the proposed policies becomes insignificant. In addition, the reconfiguration cost affects the performance. Another finding was that using many reconfigurations can negatively impact workload performance, so it is critical to find the most appropriate strategy to achieve the best performance. For example, MEBF-expand$^+$ did not provide better performance than MEBF-basic, and MEBF-shrink$^+$ provided similar results to MEBF-basic. Finally, MEBF-basic was able to improve performance even when a significant portion of the workload was rigid. In the future, we plan to incorporate the I/O in the scheduling decision logic and extend the application modeling and simulation to other classes of applications, such as I/O-intensive or communication-intensive applications and applications with different execution phases.

Data availability

Data will be available in a public, open-access repository. A link to the data in a GitHub repository will be included within the article if the paper is accepted for publication.

Notes

The models can be provided by either the user or the monitoring controller (empirically evaluating the application scalability). These models may be more or less accurate, i.e., may include uncertainties related to incorrect performance estimation by the user, or an insufficient empirical evaluation of the application performance.

References

Utrera G, Tabik S, Corbalan J, Labarta J (2012) A job scheduling approach for multi-core clusters based on virtual malleability. In: Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27–31, 2012. Proceedings 18, pp 191–203
Lifka DA (1995) The ANL/IBM SP scheduling system. In: Feitelson DG, Rudolph L (eds) Job Scheduling Strategies for Parallel Processing. Springer, Berlin, pp 295–303
Chapter Google Scholar
Gómez-Martín C, Vega-Rodríguez MA, González-Sánchez J-L (2016) Fattened backfilling: an improved strategy for job scheduling in parallel systems. J Parallel Distrib Comput 97:69–77
Article Google Scholar
Li B, Zhao D (2007) Performance impact of advance reservations from the grid on backfill algorithms. In: Sixth International Conference on Grid and Cooperative Computing (GCC 2007), pp 456–461
Srinivasan S, Kettimuthu R, Subramani V, Sadayappan P (2002) Selective reservation strategies for backfill job scheduling. In: Job Scheduling Strategies for Parallel Processing, vol 2537, pp 55–71
Feitelson DG, Weil AM (1998) Utilization and predictability in scheduling the IBM SP2 with backfilling. In: Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998 1998-March, pp 542–546. https://doi.org/10.1109/IPPS.1998.669970
Tsafrir D, Feitelson DG (2006) The dynamics of backfilling: solving the mystery of why increased inaccuracy may help. In: 2006 IEEE International Symposium on Workload Characterization, pp 131–141
Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve hpc scheduling performance. J Supercomput 76:122–149
Article Google Scholar
Mu’alem A, Feitelson D (2001) Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Trans Parallel Distrib Syst 12:529–543. https://doi.org/10.1109/71.932708
Article Google Scholar
EuroHPC JU (2023) The European high performance computing joint undertaking. https://eurohpc-ju.europa.eu/research-innovation/our-projects/admire_en. Accessed 20 Aug 2023
EuroHPC JU (2021) Programming environment for European exascale systems. https://www.deep-projects.eu/. Accessed 20 Aug 2023
EuroHPC (2021) Network interconnect for exascale systems. https://redsea-project.eu/. Accessed 20 Aug 2023
EuroHPC-RIA (2021) Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw supercomputing applications for exascale. https://textarossa.eu/. Accessed 23 Aug 2023
Sudarsan R, Ribbens CJ (2009) Scheduling resizable parallel applications. In: 2009 IEEE International Symposium on Parallel & Distributed Processing, pp 1–10
Sanders P, Schreiber D (2022) Decentralized online scheduling of malleable np-hard jobs. In: European Conference on Parallel Processing, pp 119–135
SchedMD (2022) Scheduling configuration guide. https://slurm.schedmd.com/sched_config.html. Accessed 25 Aug 2023
Li J, Zhang X, Han L, Ji Z, Dong X, Hu C (2021) Okcm: improving parallel task scheduling in high-performance computing systems using online learning. J Supercomput 77:5960–5983
Article Google Scholar
Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917
Article Google Scholar
Dutot P-F, Mercier M, Poquet M, Richard O (2017) Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai N, Cirne W (eds) Job Scheduling Strategies for Parallel Processing. Springer, Cham, pp 178–197
Chapter Google Scholar
Galleguillos C, Kiziltan Z, Netti A, Soto R (2020) Accasim: a customizable workload management simulator for job dispatching research in hpc systems. Clust Comput 23(1):107–122. https://doi.org/10.1007/s10586-019-02905-5
Article Google Scholar
Klusáček D, Tóth v, Podolníková G (2016) Complex job scheduling simulations with alea 4. SIMUTOOLS’16. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, BEL, pp 124–129
Jokanovic A, D’Amico M, Corbalan J (2018) Evaluating slurm simulator with real-machine slurm and vice versa. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp 72–82. https://doi.org/10.1109/PMBS.2018.8641556
Rodrigo GP, Elmroth E, Östberg P-O, Ramakrishnan L (2018) Scsf: a scheduling simulation framework. In: Klusáček D, Cirne W, Desai N (eds) Job Scheduling Strategies for Parallel Processing. Springer, Cham, pp 152–173
Chapter Google Scholar
Özden T, Beringer T, Mazaheri A, Fard HM, Wolf F (2022) Elastisim: a batch-system simulator for malleable workloads. In: Proceedings of the 51st International Conference on Parallel Processing. ICPP ’22. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3545008.3545046
Calotoiu A, Hoefler T, Poke M, Wolf F (2013) Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–12
Martin G, Marinescu M-C, Singh DE, Carretero J (2013) Flex-mpi: an mpi extension for supporting dynamic load balancing on heterogeneous non-dedicated systems. In: Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26–30, 2013. Proceedings 19, pp 138–149
Ghafoor SK (2007) Modeling of an adaptive parallel system with malleable applications in a distributed computing environment
Lina DH, Ghafoor S, Hines T (2023) Scheduling of elastic message passing applications on hpc systems. In: Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers, pp 172–191
Feitelson D (2005) Parallel workloads archive cs.huji.ac.il. https://www.cs.huji.ac.il/labs/parallel/workload/index.html. Accessed 14 May 2023
KIT (2023) Konfiguration des ForHLR II. https://www.scc.kit.edu/dienste/forhlr2.php. Accessed 25 Jul 2023
Cruz GM, Singh DE, Marinescu M-C (2015) Optimization techniques for adaptability in mpi applications. Ph.D. thesis, Computer Science and Engineering Department-Universidad Carlos
Silberschatz A, Galvin PB, Gagne G (2018) Operating system concepts, 10th edn. Wiley. http://os-book.com/OS10/index.html
Feitelson DG, Weil AM (1998) Utilization and predictability in scheduling the IBM SP2 with backfilling. In: Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998 1998-March, pp 542–546. https://doi.org/10.1109/IPPS.1998.669970
Khan KH, Qureshi K, Abd-El-Barr M (2014) An efficient grid scheduling strategy for data parallel applications. J Supercomput 68:1487–1502. https://doi.org/10.1007/s11227-019-03004-3
Article Google Scholar
Feitelson DG, Rudolph L (1996) Toward convergence in job schedulers for parallel supercomputers. In: Job Scheduling Strategies for Parallel Processing: IPPS’96 Workshop Honolulu, Hawaii, April 16, 1996 Proceedings 2, pp 1–26
Fan Y (2021) Job scheduling in high performance computing. arXiv preprint arXiv:2109.09269
D’Amico M, Jokanovic A, Corbalan J (2019) Holistic slowdown driven scheduling and resource management for malleable jobs. In: Proceedings of the 48th International Conference on Parallel Processing, pp 1–10
Sonmez O, Mohamed H, Lammers W, Epema D et al (2007) Scheduling malleable applications in multicluster systems. In: 2007 IEEE International Conference on Cluster Computing, pp 372–381
Kalé LV, Kumar S, DeSouza J (2002) A malleable-job system for timeshared parallel machines. In: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’02), pp 230–230
Chadha M, John J, Gerndt M (2020) Extending slurm for dynamic resource-aware adaptive batch scheduling. In: 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp 223–232
D’Amico M, Jokanovic A, Corbalan J (2019) Holistic slowdown driven scheduling and resource management for malleable jobs. In: Proceedings of the 48th International Conference on Parallel Processing. ICPP ’19. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3337821.3337909

Download references

Acknowledgements

This work has been developed within the framework of the European Union (European High-Performance Computing Joint Undertaking) through the project “Adaptive multi-tier intelligent data manager for Exascale” with reference No 956748—ADMIRE—H2020-JTI—EuroHPC—2019-1, and the Spanish Research Agency with project reference PCI2021-121966.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has been partially funded by the European High-Performance Computing Joint Undertaking (JU) under the ADMIRE project (Grant Agreement No. 956748). The role of all study sponsors was limited to financial support. Funding for APC: Universidad Carlos III de Madrid (Agreement CRUE-Madroño 2023).

Author information

Taylan Özden: Writing & editing, review, and software development.

Authors and Affiliations

Computer Science, Uc3m, Avda. Universidad, 28911, Leganés, Madrid, Spain
Njoud O. Almaaitah, David E. Singh & Jesus Carretero
Department of Computer Science, Technical University of Darmstadt, Darmstadt, Hesse, Germany
Taylan Özden

Authors

Njoud O. Almaaitah
View author publications
You can also search for this author in PubMed Google Scholar
David E. Singh
View author publications
You can also search for this author in PubMed Google Scholar
Taylan Özden
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Carretero
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

NOA-M was involved in conceptualization, methodology, software, investigation, formal analysis, writing—original draft, evaluation, review, and editing. DES helped in conceptualization, methodology, supervision, writing, review and editing. TÖ contributed to writing and editing, review, and software development. JC assisted in conceptualization, supervision, funding acquisition, review, and editing.

Corresponding author

Correspondence to Njoud O. Almaaitah.

Ethics declarations

Conflict of interest

Authors declare no competing interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Almaaitah, N.O., Singh, D.E., Özden, T. et al. Performance-driven scheduling for malleable workloads. J Supercomput 80, 11556–11584 (2024). https://doi.org/10.1007/s11227-023-05882-0

Download citation

Accepted: 23 December 2023
Published: 29 January 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11227-023-05882-0

Performance-driven scheduling for malleable workloads

Abstract

Similar content being viewed by others

Real-Life Experience with Major Reconfiguration of Job Scheduling System

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Towards Smarter Schedulers: Molding Jobs into the Right Shape via Monitoring and Modeling

1 Introduction