## Abstract

As multicore processors become ever more prevalent, it is important for real-time programs to take advantage of intra-task parallelism in order to support computation-intensive applications with tight deadlines. In this paper, we consider the global earliest deadline first (GEDF) scheduling policy for task sets consisting of parallel tasks. Each task can be represented by a directed acyclic graph (DAG) where nodes represent computational work and edges represent dependences between nodes. In this model, we prove that GEDF provides a *capacity augmentation bound of*\(4-\frac{2}{m}\) and a *resource augmentation bound of*\(2-\frac{1}{m}\). The capacity augmentation bound acts as a linear-time schedulability test since it guarantees that any task set with total utilization of at most \(m/(4-\frac{2}{m})\) where each task’s critical-path length is at most \(1/(4-\frac{2}{m})\) of its deadline is schedulable on \(m\) cores under GEDF. In addition, we present a pseudo-polynomial time fixed-point schedulability test for GEDF; this test uses a carry-in work calculation based on the proof for the capacity bound. Finally, we present and evaluate a prototype platform—called PGEDF—for scheduling parallel tasks using global earliest deadline first (GEDF). PGEDF is built by combining the GNU OpenMP runtime system and the \(\text {LITMUS}^\text {RT}\) operating system. This platform allows programmers to write parallel OpenMP tasks and specify real-time parameters such as deadlines for tasks. We perform two kinds of experiments to evaluate the performance of GEDF for parallel tasks. (1) We run numerical simulations for DAG tasks. (2) We execute randomly generated tasks using PGEDF. Both sets of experiments indicate that GEDF performs surprisingly well and outperforms an existing scheduling techniques that involves task decomposition.

### Keywords

Real-time scheduling Parallel scheduling Global EDF Resource augmentation bound Capacity augmentation bound## 1 Introduction

During the last decade, the increase in performance processor chips has come primarily from increasing numbers of cores. This has led to extensive work on real-time scheduling techniques that can exploit multicore and multiprocessor systems. Most prior work has concentrated on inter-task parallelism, where each task runs sequentially (and therefore can only run on a single core) and multiple cores are exploited by increasing the number of tasks. This type of scheduling is called multiprocessor scheduling. When a model is limited to inter-task parallelism, each individual task’s total execution requirement must be smaller than its deadline since individual tasks cannot run any faster than on a single-core machine. In order to enable tasks with higher execution demands and tighter deadlines, such as those used in autonomous vehicles, video surveillance, computer vision, radar tracking and real-time hybrid testing (Maghareh et al. 2012), we must enable parallelism within tasks.

In this paper, we are interested in parallel scheduling, where in addition to inter-task parallelism, task sets contain intra-task parallelism, which allows threads from one task to run in parallel on more than a single core. While there has been some recent work in this area, many of these approaches are based on task decomposition (Lakshmanan et al. 2010; Saifullah et al. 2013, 2014), which first decomposes each parallel task into a set of sequential subtasks with assigned intermediate release times and deadlines, and then schedules these sequential subtasks using a known multiprocessor scheduling algorithm. In this work, we are interested in analyzing the performance of global EDF (GEDF) schedulers *without* any decomposition.

We consider a general task model, where each task is represented as a directed acyclic graph (DAG) and where each node represents a sequence of instructions (thread) and each edge represents a dependency between nodes. A node is ready to be executed when all its predecessors have been executed. GEDF works as follows: for ready nodes at each time step, the scheduler first tries to schedule as many jobs with the earliest deadline as it can; then it schedules jobs with the next earliest deadline, and so on, until either all cores are busy or no more nodes are ready.

Compared with other schedulers, GEDF has benefits, such as automatic load balancing. Efficient and scalable implementations of GEDF for sequential tasks are available for Linux (Lelli et al 2011) and \( LITMUS ^ RT \) (Brandenburg and Anderson 2009), which can be used to implement GEDF for parallel tasks if decomposition is not required. Prior theory analyzing GEDF for parallel tasks is either restricted to a single recurring task (Baruah et al. 2012) or considers response time analysis for soft-real time tasks (Liu and Anderson 2012). In this paper, we consider task sets with \(n\) tasks and analyze their schedulability under a GEDF scheduler in terms of augmentation bounds.

We distinguish between two types of augmentation bounds, both of which are called “resource augmentation” in the previous literature. By standard definition, a scheduler \(\mathcal {S}\) provides a resource augmentation bound of \(b\) if the following condition holds: if an ideal scheduler can schedule a task set on \(m\) unit-speed cores, then \(\mathcal {S}\) can schedule that task set on \(m\) cores of speed \(b\). Note that the ideal scheduler (optimal schedule) is only a hypothetical scheduler, meaning that if a feasible schedule ever exists for a task set then this ideal scheduler can guarantee to schedule it. Unfortunately, even for a single parallel DAG task, scheduling it on \(m\) cores within a deadline has been shown to be NP-complete (Garey and Johnson 1979). If there are more than one tasks in the system, the problem is only exacerbated. Since there may be no way to tell whether the ideal scheduler can schedule a given task set on unit-speed cores, a resource augmentation bound may not directly provide a schedulability test.

Therefore, we distinguish resource augmentation from a capacity augmentation bound that can serve as an easy schedulability test. If on unit-speed cores, a task set has total utilization of at most \(m\) and the critical-path length of each task is smaller than its deadline, then scheduler \(\mathcal {S}\) with capacity augmentation bound \(b\) can schedule this task set on \(m\) cores of speed \(b\). Note that the ideal scheduler cannot schedule any task set that does not meet these utilization and critical-path length bounds on unit-speed cores; hence, a capacity augmentation bound of \(b\) trivially implies a resource augmentation bound of \(b\).

It is important to note that capacity augmentation bound is an extension of the notion of schedulable utilization bound from sequential tasks to parallel real-time tasks. Just like utilization bounds, it provides an estimate of how much load a system can handle in the worst case. Therefore, it provides qualitatively different information about the scheduler than a resource augmentation bound. It also has the advantage that it directly leads to schedulability tests, since one can easily check the bounds on utilization and critical-path length for any task set. A tigher but more complex schedulablity test is only needed when a task set does not satisfy capacity augmentation bound. Additionally, in prior literature, many proved bounds for parallel real-time tasks, which were called resource augmentation bounds, were actually capacity augmentation bounds. Thus, the capacity augmentation bound is important for comparing different schedulers.

- 1.
For a system with \(m\) identical cores, we prove a capacity augmentation bound of \(4-\frac{2}{m}\) (which approaches 4 as \(m\) approaches infinity) for sporadic task sets with

*implicit deadlines*—the relative deadline of each task is equal to its period. Another way to understand this bound is: if a task set has total utilization at most \(m/(4-\frac{2}{m})\) and the critical-path length of each task is at most \(1/(4-\frac{2}{m})\) of its deadline, then it can be scheduled using GEDF on unit-speed cores. - 2.
For a system with \(m\) identical cores, we prove a resource augmentation bound of \(2-\frac{1}{m}\) (which approaches 2 as \(m\) approaches infinity) for sporadic task sets with

*arbitrary deadlines*.^{1} - 3.
We also show that GEDF’s capacity bound for parallel task sets (even with implicit deadlines) is lower bounded by \(2-\frac{1}{m}\). In particular, we show example task sets with utilization \(m\) where the critical-path length of each task is no more than its deadline, while GEDF misses a deadline on \(m\) cores with speed less than \(\frac{3+\sqrt{5}}{2} \approx 2.618\).

- 4.
We conduct simulation experiments to show that the capacity augmentation bound is safe for task sets with different DAG structures (as mentioned above, checking the resource augmentation bound is difficult since we cannot compute the optimal schedule). Simulations show that GEDF performs surprisingly well. All simulated random task sets meet their deadlines with \(50\,\%\) utilization (core speed of \(2\)). We also compare GEDF with a scheduling technique that decomposes parallel tasks and then schedules decomposed subtasks using GEDF (Saifullah et al. 2014). For all of the DAG task sets considered in our experiments, the GEDF scheduler without decomposition has better performance.

- 1.
While the capacity augmentation bound functions as a linear-time schedulability test, we further provide a sufficient fixed-point schedulability test for tasks with implicit deadlines that may admit more task sets but takes pseudo-polynomial time to compute.

- 2.
To demonstrate the feasibility of parallel GEDF scheduling in real systems, we implement a prototype platform named PGEDF. PGEDF supports standard OpenMP programs with parallel for-loops. Therefore, it supports a subset of DAGs—namely synchronous tasks where the program consists of a sequence of segments which can be parallel or sequential and parallel segments are represented using parallel for-loops. While not as general as DAGs, these synchronous tasks constitute a large subset of interesting parallel programs. PGEDF integrates the GNU OpenMP runtime system (OpenMP 2011) and \( LITMUS ^ RT \) patched Linux kernel (Branden-burg and Anderson 2009), where the former executes each task with parallel threads and the latter is responsible for scheduling threads of all tasks under GEDF scheduling.

- 3.
We evaluate the performance of PGEDF with randomly generated synthetic task sets. With those task sets, all deadlines are met when total utilization is less than \(30\,\%\) (core speed of \(3.3\)) in PGEDF. We compare PGEDF with an existing parallel real-time platform, RT-OpenMP (Ferry et al. 2013), which was also designed for synchronous tasks but under decomposed fixed priority scheduling. We find that for most task sets, PGEDF performs better.

## 2 Related work

Most prior work on hard real-time scheduling atop multiprocessors has concentrated on sequential tasks (Davis and Burns 2011). In this context, many sufficient schedulability tests for GEDF and other global fixed priority scheduling algorithms have been proposed (Andersson et al. 2001; Srinivasan and Baruah 2002; Goossens et al 2003; Bertogna et al 2009; Baruah and Baker 2008; Baker and Baruah 2009; Lee and Shin 2012; Baruah 2004; Bertogna and Baruah 2011). In particular, for implicit deadline hard-real time tasks, the best known utilization bound is \(\approx 50\,\%\) using partitioned fixed priority scheduling (Andersson and Jonsson 2003) or partitioned EDF (Baruah et al 2010; Lopez et al. 2004) this trivially implies a capacity bound of \(2\). Baruah et al. (2010) proved that global EDF has a capacity augmentation bound of \(2-1/m\) for sequential tasks on multiprocessors.

Earlier work considering intra-task parallelism makes strong assumptions on task models (Lee and Heejo 2006; Collette et al. 2008; Manimaran et al. 1998). For more realistic parallel tasks, e.g. synchronous tasks, Kato and Ishikawa (2009) proposed a gang scheduling approach. The synchronous model, a special case of the more general DAG model, represents tasks with a sequence of multi-threaded segments with synchronization points between them (such as those generated by parallel for-loops).

Most other approaches for scheduling synchronous tasks involve decomposing parallel tasks into independent sequential subtasks, which are then scheduled using known multiprocessor scheduling techniques, such as deadline monotonic (Fisher et al. 2006) or GEDF (Baruah and Baker 2008). For a restricted set of synchronous tasks, Lakshmanan et al. (2010) prove a capacity augmentation bound of 3.42 using deadline monotonic scheduling for decomposed tasks. For more general synchronous tasks, Saifullah et al. (2013) proved a capacity augmentation bound of 4 for GEDF and 5 for deadline monotonic scheduling. The decomposition strategy was improved in Nelissen et al. (2012) for using less cores. For the same general synchronous model, the best known augmentation bound is 3.73 (Kim et al. 2013) also using decomposition. The decomposition approach in Saifullah et al. (2013) was recently extended to general DAGs (Saifullah et al. 2014) to achieve a capacity augmentation bound of 4 under GEDF on decomposed tasks (note that in that work GEDF is used to schedule sequential decomposed tasks, not parallel tasks directly). This is the best augmentation bound known for task sets with multiple DAGs. For scheduling synchronous tasks without decomposition, Chwa et al. (2013) and Axer et al. (2013) presented schedulability tests for GEDF and partitioned fixed priority scheduling respectively.

More recently, there has been some work on scheduling general DAGs without decomposition. Regarding the resource augmentation bounds of GEDF, (Andersson and de Niz 2012) proved a resource augmentation bound of \(2-\frac{1}{m}\) under GEDF for a staged DAG model. Baruah et al. (2012) proved that when the task set is a *single DAG task* with arbitrary deadlines, GEDF provides a resource augmentation bound of 2. For multiple DAGs under GEDF, Bonifaci et al. (2013) and Li et al. (2013) independently proved the same resource augmentation bound \(2-\frac{1}{m}\) using different proving techniques, which extended the resource augmentation bound of \(2-\frac{1}{m}\) for sequential multiprocessor scheduling result from Phillips et al. (1997). In Bonifaci et al. (2013), they also proved that global deadline monotonic scheduling has a resource augmentation bound of \(3-\frac{1}{m}\), and also present polynomial time and pseudo-polynomial time schedulability tests for DAGs with arbitrary-deadlines. In this paper, we considered the capacity augmentation bound for GEDF and provided a linear-time schedulability test directly from the capacity augmentation bound and a pseudo-polynomial time schedulability test for DAGs with implicit deadlines.

There has been some result for other scheduling strategies and different real-time constraints. Nogueira et al. (2012) explored the use of work-stealing for real-time scheduling. The paper is mostly experimental and focused on soft real-time performance. The bounds for hard real-time scheduling only guarantee that tasks meet deadlines if their utilization is smaller than 1. Liu and Anderson (2012) analyzed the response time of GEDF without decomposition for soft real-time tasks.

Various platforms support sequential real-time tasks on multi-core machines (Brandenburg and Anderson 2009; Lelli et al. 2011; Cerqueira et al. 2014). \( LITMUS ^ RT \) (Brandenburg and Anderson 2009) is a straightforward implementation of GEDF scheduling with usability, stability and predictability. The SCHED_DEADLINE (Lelli et al. 2011) is another comparable GEDF patch to Linux and has been submitted to mainline Linux. A more recent work, G-EDF-MP (Cerqueira et al. 2014) uses massage passing instead of locking and has better scalability than the previous implementations. Our platform prototype, PGEDF, is implemented using \( LITMUS ^ RT \) as the underlying GEDF scheduler. Our goal is to simply to illustrate the feasibility of GEDF for parallel tasks. We speculate that if the underlying GEDF scheduler implementation is replaced with one that has lower overhead, the overall performance of PGEDF will also improve.

As for parallel tasks, we are aware of two systems (Kim et al. 2013; Ferry et al. 2013) that support parallel real-time tasks based on different decomposition strategies. Kim et al. (2013) used a reservation-based OS to implement a system that can run parallel real-time programs for an autonomous vehicle application, demonstrating that parallelism can enhance performance for complex tasks. Ferry et al. (2013) developed a parallel real-time scheduling service on standard Linux. However, since both systems adopted task decomposition approaches, they require users to provide exact task structures and subtask execution time details in order to decompose tasks correctly. The system presented (Ferry et al. 2013) also requires modifications to the compiler and runtime system to decompose, dispatch and execute parallel applications. The platform prototype presented here does not require decomposition or such detailed information.

Scheduling parallel tasks without deadlines has been addressed by parallel-computing researchers (Polychronopoulos and Kuck 1987; Drozdowski 1996; Deng et al. 1996; Agrawal et al. 2008). Soft real-time scheduling has been studied for various optimization criteria, such as cache misses (Calandrino and Anderson 2009; Anderson and Calandrino 2006), makespan (Wang and Cheng 1992) and total work done by tasks that meet deadlines (Oh-Heum and Kyung-Yong 1999).

## 3 Task model and definitions

For each task \(\tau _i\) in task set \(\tau \), let \(C_i = \sum _j{C_i^j}\) be the total worst-case execution time on a single core, also called the work of the task. Let \(L_i\) be the critical-path length (i.e. the worst-case execution time of the task on an infinite number of cores). In Fig. 1, the critical-path (i.e. the longest path) starts from node \(W_1^1\), goes through \(W_1^3\) and ends at node \(W_1^4\), so the critical-path length of DAG \(W_1\) is \(1+3+2=6\). The work and the critical-path length of any job generated by task \(\tau _i\) are the same as those of task \(\tau _i\).

We also define the notion of remaining work and remaining critical-path length of a partially executed job. The remaining work is the total work minus the work that has already been done. The remaining critical-path length is the length of the longest path in the unexecuted portion of the DAG (including partially executed nodes). For example, in Fig. 1, if \(W_1^1\) and \(W_1^2\) are completely executed, and \(W_1^3\) is partially executed such that 1 unit (out of 3) of work has been done for it, then the remaining critical-path length is \(2 + 2 = 4\).

Nodes do not have individual release offsets and deadlines when scheduled by the GEDF scheduler; they share the same absolute deadline of their jobs. Therefore, to analyze the GEDF scheduler, we do not require any knowledge of the DAG structure beyond the total worst-case execution time \(C_i\), deadline \(D_i\), period \(P_i\) and critical-path length \(L_i\). We also define the utilization of a task \(\tau _i\) as \(u_i = \frac{C_i}{P_i}\).

- The critical-path length of each task is less than its deadline.$$\begin{aligned} L_i \le D_i \end{aligned}$$(1)
- The total utilization is smaller than the number of cores.$$\begin{aligned} \sum _i u_i \le m \end{aligned}$$(2)

## 4 Capacity augmentation bound of \(4-\frac{2}{m}\)

In this section, we propose a capacity augmentation bound of \(4-\frac{2}{m}\) for *implicit deadline tasks*, which yields an sufficient schedulability test. In particular, we show that GEDF can successfully schedule a task set, if the task set satisfies two conditions: (1) its total utilization is at most \(m/(4-\frac{2}{m})\) and (2) the critical-path length of each task is at most \(1/(4-\frac{2}{m})\) of its period (and deadline). Note that this is equivalent to saying that if a task set meets conditions from Inequalities 1 and 2 on processors of unit speed, then it can be scheduled on \(m\) cores of speed \(4-\frac{2}{m}\) (which approaches 4 as \(m\) approaches infinity).

The gist of the proof is the following: at a job’s release time, we can bound the remaining work from other tasks under GEDF with speedup \(4-\frac{2}{m}\). Bounded remaining work leads to bounded interference from other tasks, and hence GEDF can successfully schedule all of them.

### 4.1 Notation

We first define a notion of interference. Consider a job \(J_{k,a}\), which is the \(a\)th instance of task \(\tau _k\). Under GEDF scheduling, only jobs that have absolute deadlines earlier than the absolute deadline of \(J_{k,a}\) can interfere with \(J_{k,a}\). We say that a job is unfinished if the job has been released but has not completed yet. Due to implicit deadlines (\(D_i = P_i\)), at most one job of each task can be unfinished at any time.

### 4.2 Proof of the Theorem

Consider a GEDF schedule with \(m\) cores each of speed \(b\). Each time step can be divided into \(b\) sub-steps such that each core can do one unit of work in each sub-step. We say a sub-step is complete if all cores are working during that sub-step, and otherwise we say it is incomplete.

First, a couple of straight-forward lemmas.

**Lemma 1**

On every incomplete sub-step, the remaining critical-path length of each unfinished job reduces by 1.

**Lemma 2**

*Proof*

The total number of complete sub-steps during \(t\) steps is \(bt - t^*\), and the total work during these complete steps is \(m(bt - t^*)\). On an incomplete sub-step, at least one unit of work is done. Therefore, the total work done in incomplete sub-steps is at least \(t^*\). Adding the two gives us the bound. \(\square \)

We now prove a sufficient condition for the schedulability of a job.

**Lemma 3**

*Proof*

Note that there are \(D_k\) time steps (therefore \(bD_k\) sub-steps) between the release time and deadline of this job. There are two cases:

**Case 1:** The total number of incomplete sub-steps between the release time and deadline of \(J_{k,a}\) is more than \(D_k\), and therefore, also more than \(L_k\). In this case, \(J_{k,a}\)’s critical-path length reduces on all of these sub-steps. After at most \(L_k\) incomplete steps, the critical-path is 0 and the job has finished executing. Therefore, it can not miss the deadline.

**Case 2:** The total number of incomplete sub-steps between the release and deadline of \(J_{k,a}\) is smaller than \(D_k\). Therefore, the total amount of work done during this time is more than \(bmD_k - (m-1)D_k\) by the condition in Lemma 2. Since the total interference (including \(J_{k,a}\)’s work) is at most this quantity, the job cannot miss its deadline. \(\square \)

**Lemma 4**

*Proof*

We now complete the proof by showing that the carry-in work is bounded as required by Lemma 4 for every job.

**Lemma 5**

*Proof*

We prove this theorem by induction from absolute time \(0\) to the release time of job \(J_{k,a}\).

**Base Case:**For the very first job of all the tasks released in the system (denoted \(J_{l, 1}\)), no carry-in jobs are released before this job. Therefore, the condition trivially holds and the job can meet its deadline by Lemma 4.

**Inductive Step:**Assume that for every job with an earlier release time than \(J_{k,a}\), the condition holds. Therefore, according to Lemma 4, every earlier released job meets its deadline. Now we prove that the condition also holds for job \(J_{k,a}\).

For job \(J_{k,a}\), if there is no carry-in work from jobs released earlier than \(J_{k,a}\), so that \(R^{k, a} = 0\), the property trivially holds. Otherwise, there is at least one unfinished job (a job with carry-in work) at the release time of \(J_{k,a}\).

Comparing between \(t\) and \(\alpha _j^{k, a}\), when \(t \le \frac{1}{2} D_j\), by Eq. (8), \(\alpha _j^{k,a} = D_j -t \ge \frac{1}{2} D_j \ge t\). There are two cases:

**Case 1:**\(t \le \frac{1}{2} D_j\) and hence \(\alpha _j^{k, a} \ge t\):

**Case 2:**\(t > \frac{1}{2} D_j\):

From Lemmas 4 and 5, we can easily derive the following capacity augmentation bound theorem.

**Theorem 1**

If, on unit speed cores, the utilization of a sporadic task set is at most \(m\), and the critical-path length of each job is at most its deadline, then the task set can meet all their implicit deadlines on \(m\) cores of speed \(4-\frac{2}{m}\).

Theorem 1 proves the speedup factor of GEDF and it also can be restated as follows:

**Corollary 1**

Given that a sporadic task set \(\tau \) with implicit deadlines satisfies the following conditions: (1) total utilization is at most \(1/(4-\frac{2}{m})\) of the total system capacity \(m\) and (2) the critical path \(L_i\) of every task \(\tau _i \in \tau \) is at most \(D_i/(4-\frac{2}{m})\), then GEDF can schedule this task set \(\tau \) on \(m\) cores.

## 5 Fixed point schedulability test

In Sect. 4, we described a capacity augmentation bound for the GEDF scheduler, which acts as a simple linear time schedulability test. In this section, we describe a tighter sufficient fixed point schedulability test for parallel task sets with implicit deadlines under a GEDF scheduler. We start with a schedulability test similar to one for sequential tasks. Then, we improve the calculation of the carry-in work—this improvement is based on some of the equations used in the proof for our capacity augmentation bound. Finally, we further improve the interference calculation by considering the calculated finish time and altogether derive the fixed point schedulability test.

### 5.1 Basic schedulability test

Given a task set, we denote \(\widehat{R_i^k}\) as an upper bound on the carry-in work from task \(\tau _i\) to a job of task \(\tau _k\), and \(\widehat{R^k} = \sum _i \widehat{R_i^k}\) as an upper bound on the total carry-in work from the entire task set to a job of task \(\tau _k\). We also denote \(\widehat{A_i^k}\) and \(\widehat{A^k}\) as the corresponding upper bounds on individual and total interference to task \(\tau _k\). In addition, \(\widehat{n_i^k}\) is an upper bound on the number of task \(\tau _i\)’s interfering jobs, which are not part of the carry-in jobs, but interfere with task \(\tau _k\). Finally, we use \(\widehat{f_k}\) to denote an upper bound on the relative completion time of task \(\tau _k\). If \(\widehat{f_k} \le D_k\), then task \(\tau _k\) is schedulable, and otherwise we cannot guarantee its schedulability.

Obviously there could at most be one carry-in job of task \(\tau _i\) to the job \(J_{k,a}\) of task \(\tau _k\). Moreover, if in the worst-case of \(\widehat{A_i^k}\), this job has already finished before the release time of \(J_{k,a}\), then \(\widehat{R_i^k} = 0\). By the definition of carry-in jobs and Eq. (14) for \(\widehat{n_i^k}\), we can see that the length between the deadline of carry-in job and the release time of job \(J_{k,a}\) is \(D_k - \widehat{n_i^k} D_i\). If the carry-in job has not finished when job \(J_{k,a}\) is released, then \(D_k - \widehat{n_i^k} D_i\) has to be longer than \(D_k - \widehat{f_i}\), where \(\widehat{f_i}\) is the upper bound of task \(\tau _i\)’s completion time.

Obviously, before the last step of calculating \(\widehat{f_k}''\), in each iteration, \(\widehat{f_k}\) will not be larger than \(D_k\). After the first iteration, each \(\widehat{f_k}\) will either stays at \(D_k\) or decrease (because \(\widehat{f_k}'\) is less than \(D_k\)). More importantly, \(\widehat{f_k}\) will decrease or stay the same when at least one \(\widehat{f_i}\) of another task \(\tau _i\) decreases. In conclusion, \(\widehat{f_k}\) will not increase in each iteration. Therefore, the fixed point calculation will converge.

Note that there is a subtlety about this calculation. Because of the assumption \(\widehat{f_i} \le D_i\) of Eqs. (14), (17) is only correct when the finish time of each task in the task set is no more than its relative deadline. This is the reason why in the fixed point calculation, we do not update \(\widehat{f_k}\) if the calculated new value \(\widehat{f_k}'\) is larger than \(D_k\). After the last step (calculating \(\widehat{f_k}''\)) of the fixed point calculation, if the task set is schedulable, i.e. the assumption is satisfied, we actually did correctly calculate an upper bound on the interference and therefore an upper bound on the completion time. Therefore, if this test says that a task set is schedulable, it is indeed schedulable. If the test says that the task set is unschedulable, then the test may be underestimating the interference. In this case, however, this inaccuracy it does not matter, since even the underestimation makes the task set unschedulable, so even the correct estimation will also deem the task set unschedulable.

### 5.2 Improving the carry-in work calculation

In the basic test, we calculate the carry-in work using Eq. (15). However, this upper bound calculation \(X_i^k\) may be pessimistic, if task \(\tau _k\) has a very short period, while task \(\tau _i\) has a very long period. This is because if the carry-in job of \(\tau _i\) to \(\tau _k\) has not finished before \(\tau _k\) is released, then the entire \(C_i\) will be counted as interference. However, GEDF, as a greedy algorithm, might have already executed most of the computation of the carry-in job. Inspired by the proof of the capacity augmentation bound for GEDF, we propose another upper bound for \(\widehat{R^k}\).

Note that in the proof of Lemma 5, there are the two cases. The calculation of \(X^k = \sum _i X_i^k\) in the basic test is similar to Case 1, but without knowing the first carry-in job. Therefore, from Case 2, we can also obtain another upper bound \(Y^k\) for \(\widehat{R^k}\) without knowing the first carry-in job. After getting the two upper bounds of \(\widehat{R^k}\), we can simply take the minimum of \(X^k\) and \(Y^k\) and achieve a schedulability test.

### 5.3 Improving the calculation for completion time

The overall schedulability test is presented in Algorithm 1.

## 6 Resource augmentation bound of \(2-\frac{1}{m}\)

In this section, we prove the resource augmentation bound of \(2-\frac{1}{m}\) for GEDF scheduling of arbitrary deadline tasks.

First, some definitions. Since the GEDF scheduler runs on cores of speed \(2-\frac{1}{m}\), each step under GEDF can be divided into \((2m-1)\) sub-steps of length \(\frac{1}{m}\). In each sub-step, each core can do \(\frac{1}{m}\) units of work (i.e. execute one sub-node). In a GEDF scheduler, on an incomplete step, all ready nodes are executed (Observation 2). As in Sect. 4, we say that a sub-step is complete if all cores are busy, and incomplete otherwise. For each sub-step \(t\), we define \(\mathcal {F_I}(t)\) as the set of sub-nodes that have *finished* executing under an ideal scheduler after sub-step \(t\), \(\mathcal {R_I}(t)\) as the set of sub-nodes that are *ready* (all their predecessors have been executed) to be executed by the ideal scheduler before sub-step \(t\), and \(\mathcal {D_I}(t)\) as the set of sub-nodes completed by the ideal scheduler in sub-step \(t\). Note that \(\mathcal {D_I}(t) = \mathcal {R_I}(t) \cap \mathcal {F_I}(t)\). We similarly define \(\mathcal {F_G}(t)\), \(\mathcal {R_G}(t)\), and \(\mathcal {D_G}(t)\) for GEDF scheduler.

We prove the resource augmentation bound by comparing each incomplete sub-step of a GEDF scheduler with each step of the ideal scheduler. We show that (1) if GEDF has had at least as many incomplete sub-steps as the total number of steps of the ideal scheduler, then GEDF has executed all the sub-nodes on the critical-path of the task and hence must have completed this task; (2) otherwise, GEDF has “many complete sub-steps” and in these complete sub-steps, it must have finished all the work that is required to be done by this time and hence must also have completed all the tasks with the same or earlier deadlines.

**Observation 2**

Note for the ideal scheduler, each original step consists of \(m\) sub-steps, while for GEDF with speed \(2-\frac{1}{m}\) each step consists of \(2m-1\) sub-steps.

For example, in Fig. 6 for step \(t_1\), there are two sub-steps \(t_{1(1)}\) and \(t_{1(2)}\) under ideal scheduler, while under GEDF there is an additional \(t_{1(3)}\) (since \(2m-1=3\)).

**Theorem 3**

If an ideal scheduler can schedule a task set \(\tau \) (periodic or sporadic tasks with arbitrary deadlines) on a unit-speed system with \(m\) identical cores, then global EDF can schedule \(\tau \) on \(m\) cores of speed \(2-\frac{1}{m}\).

*Proof*

**Case 1:** In \([0, t]\), GEDF has at most \(tm\) incomplete sub-steps.

Since there are at least \((2tm-t)-tm=tm-t\) complete steps, the system can complete \(|\mathcal {F_G}(t)|-|\mathcal {F_G}(0)| \ge m(tm-t)+(tm)=tm^2\) work, since each complete sub-step can finish executing \(m\) sub-nodes and each incomplete sub-step can finish executing at least \(1\) sub-node. We define \(I(t)\) as the set of all sub-nodes from jobs with absolute deadlines no later than \(t\). Since the ideal scheduler can schedule this task set, we know that \(|I(t)|-|\mathcal {F_I}(0)| \le mt*m=tm^2\), since the ideal scheduler can only finish at most \(m\) sub-nodes in each sub-step and during \([0,t]\) there are \(mt\) sub-steps for the ideal scheduler. Hence, we have \(|\mathcal {F_G}(t)|-|\mathcal {F_G}(0)| \ge |I(t)|-|\mathcal {F_I}(0)|\). By Eq. (23), we get \(|\mathcal {F_G}(t)| \ge |I(t)|\). Note that jobs in \(I(t)\) have earlier deadlines than the other jobs, so under GEDF, no other jobs can interfere with them. The GEDF scheduler will never execute other sub-nodes unless there are no ready sub-nodes from \(I(t)\). Since \(|\mathcal {F_G}(t)| \ge |I(t)|\), i.e. GEDF has finished at least as many sub-nodes as the number in \(I(t)\), this implies that GEDF must have finished all sub-nodes in \(I(t)\). Therefore, GEDF can meet all deadlines since it has finished all work that needed to be done by time \(t\).

**Case 2:** In \([0, t]\), GEDF has more than \(tm\) incomplete sub-steps.

For each integer \(s\) we define \(f(s)\) as the first time instant such that the number of incomplete sub-steps in interval \([0, f(s)]\) is exactly \(s\). Note that the sub-step \(f(s)\) is always incomplete, since otherwise it wouldn’t be the first such instant. We show, via induction, that \(\mathcal {F_I}(s) \subseteq \mathcal {F_G}(f(s))\). In other words, after \(f(s)\) sub-steps, GEDF has completed all the nodes that the ideal scheduler has completed after \(s\) sub-steps.

**Base Case:** For \(s=0\), \(f(s)=0\). By Eq. (23), the claim is vacuously true.

**Inductive Step:** Suppose that for \(s-1\) the claim \(\mathcal {F_I}(s-1) \subseteq \mathcal {F_G}(f(s-1))\) is true. Now, we prove that \(\mathcal {F_I}(s) \subseteq \mathcal {F_G}(f(s))\).

### 6.1 An example providing an intuition for the Proof

We provide an example in Fig. 6 to illustrate the proof of Case \(2\) and compare the execution trace of an ideal scheduler (this scheduler is only considered “ideal” in the sense that it makes all the deadlines) and GEDF. In addition to task \(1\) from Fig. 1, Task \(\tau _2\) consists of two nodes connected to another node, all with execution time of 1 (each split into 2 sub-nodes in figure). All tasks are released by time \(t_0\). The system has 2 cores, so GEDF has a resource augmentation bound of 1.5. Figure 6 is the execution for the ideal scheduler on unit-speed cores, while Fig. 6 shows the execution under GEDF on speed 2 cores. One step is divided into 2 and 3 sub-steps, representing the speedup of 1 and 1.5 for the ideal scheduler and GEDF respectively.

Since the critical-path length of Task \(\tau _1\) is equal to its deadline, intuitively it should be executed immediately even though it has the latest deadline. That is exactly what the ideal scheduler does. However, GEDF (which does not take critical-path length into consideration) will prioritize Task \(\tau _2\) first. If GEDF is only on a unit-speed system,Task \(\tau _1\) will miss deadline. However, when GEDF gets speed-1.5 cores, all jobs are finished in time. To illustrate Case 2 of the above theorem, consider \(s=2\). Since \(t_{2(3)}\) is the second incomplete sub-step under GEDF, \(f(s) = 2(3)\). All the nodes finished by the ideal scheduler after second sub-step (shown above in dark grey) have also been finished under GEDF by step \(t_{2(3)}\) (shown below in dark grey).

## 7 Lower bound on capacity augmentation bound of GEDF

While the above proof guarantees a bound, since the ideal scheduler is not known, given a task set, we cannot tell if it is feasible on speed-1 cores. Therefore, we cannot tell if it is schedulable by GEDF on cores with speed \(2-\frac{1}{m}\).

One standard way to prove resource augmentation bounds is to use lower bounds on the ideal scheduler, such as Inequalities 1 and 2. As previously stated, we call the resource augmentation bound proven using these lower bounds a capacity augmentation bound in order to distinguish it from the augmentation bound described above. To prove a capacity augmentation bound of \(b\) under GEDF, one must prove that if Inequalities 1 and 2 hold for a task set on \(m\) unit-speed cores, then GEDF can schedule that task set on \(m\) cores of speed \(b\). Hence, the capacity augmentation bound is also an easy schedulability test.

First, we demonstrate a counter-example to show proving a capacity augmentation bound of 2 for GEDF is impossible.

The execution trace of the task set on a 2-speed 6-core core under GEDF is shown in Fig. 8. The first task is released at time 0 and is immediately executed by GEDF. Since the system under GEDF is at speed 2, \(W_1^{1,1}\) finishes executing at time 28. GEDF then executes 6 out of the 12 parallel nodes from Task \(\tau _1\). At time 29, task \(\tau _2\) is released. However, its deadline is \(r_2+D_2 = 29+60 = 89\), which is later than deadline 88 of task \(\tau _1\). Nodes from task \(\tau _1\) are not preempted by task \(\tau _2\) and continue to execute until all of them finish their work at time 60. Task \(\tau _1\) successfully meets its deadline. The GEDF scheduler finally gets to execute task \(\tau _2\) and finishes it at time 90, so task \(\tau _2\) just fails to meet its deadline of 89. Note that this is not a counter-example for the resource augmentation bound shown in Theorem 3, since no scheduler can schedule this task set on unit-speed system either.

## 8 Simulation evaluation

In this section, we present results of our simulation results of the performance of GEDF and the robustness of our capacity augmentation bound.^{2} We randomly generate task sets that fully load machines, and then simulate their execution on machines of increasing speed. The capacity augmentation bound for GEDF predicts that all task sets should be schedulable by the time the core speed is increased to \(4-\frac{2}{m}\). In our simulations, all task sets became schedulable before the speed reached 2.

We also compared GEDF with the another method that provides capacity bounds for scheduling multiple DAGs (with a DAG’s utilization potentially more than (1) on multicores (Saifullah et al. 2014). In this method, which we call DECOMP, tasks are decomposed into sequential subtasks and then scheduled using GEDF.^{3} We find that GEDF without decomposition performs better than DECOMP for most task sets.

### 8.1 Task sets and experimental setup

We generate two types of DAG tasks for evaluation. For each method, we first fix the number of nodes \(n\) in the DAG and then add edges.

**(1) Erdos–Renyi method**\(G(n,p)\) (Cordeiro et al. 2010): For a DAG with \(n\) nodes, there are \(n^2/2\) possible valid edges. We go through each valid edge and add it with probability \(p\), where \(p\) is a parameter (i.e. DAGs with \(e\) valid edges will have \(ep\) edges in average). Note that this method does not necessarily generate a connected DAG. Although the bound also does not require the DAG of a task to be fully connected, connecting more of its nodes can make it harder to schedule. Hence, we modified the algorithm slightly in the last step, to add the fewest edges needed to make the DAG connected.

**(2) Special synchronous task**\(L(n,m)\): As shown in Fig. 7, synchronous tasks like it, in which highly parallel segments follow sequential segments, makes scheduling difficult for GEDF since they can cause deadline misses for other tasks. Therefore, we generate task sets with alternating sequential and highly parallel segments. Tasks in \(L(n,m)\) (\(m\) is the number of processors) are generated in the following way. While the total number of nodes in the DAG is smaller than \(n\), we add another sequential segment by adding a node, then generate the next parallel layer randomly. For each parallel layer, we uniformly generate a number \(t\) between 1 and \(\lfloor \frac{n}{m} \rfloor \), and set the number of nodes in the segment to be \(t*m\).

Given a task structure generated by either of the above methods, worst-case execution times for individual nodes in the DAG are picked randomly between \([50, 500]\). The critical-path length \(L_i\) for each task is then calculated. After that, we assign a period (equal to its deadline) to each task. Note that a valid deadline is at least the critical-path length. Two types of periods were assigned to tasks.

**(1) Harmonic Period:** We evaluate tasks with **Harmonic Periods**. All tasks have periods that are integral powers of 2. We first compute the smallest value \(a\) such that \(2^a\) is larger than a task’s critical-path length \(L_i\). We then randomly assign the period either \(2^a\), \(2^{a+1}\) or \(2^{a+2}\) to generate tasks with varying utilization. All tasks are then released at the same time and simulated for the hyper-period of the tasks.

**(2) Arbitrary Period:** An arbitrary period is assigned in the form \((L_i + \frac{C_i}{0.5m})*(1+0.25*gamma(2,1))\), where \(gamma(2, 1)\) is the Gamma distribution with \(k = 2\) and \(\theta = 1\). The formula is designed such that, for small \(m\), tasks tend to have smaller utilization. This allows us to have a reasonable number of tasks in a task set for any value of \(m\). Each task set is simulated for 20 times the longest period in a task set.

Several parameters were varied to test the system: \(G(n,p)\) versus \(L(n,m)\) DAGs, different \(p\) for \(G(n,p)\), harmonic versus arbitrary Periods, numbers of Core \(m\) (4, 8, 16, 32, 64). Task sets are created by adding tasks to them until the total utilization reaches \(99\,\%\) of \(m\).

We first simulated the task sets for each setting on cores of speed 1, and increased the speed in steps of \(0.2\). For each setting, we measured the failure ratio—the number of task sets where *any* task missed its deadline over the number of total simulated task sets. We stopped increasing the speed for a task set when no deadline was missed.

Our experiments are statistically unbiased because our tasks and task sets are randomly generated, according to independent and indentically distributions. For each setting, we generated 1,000 task sets. This number is large enough to provide stable results for failure ratio, while the exact value of minimum schedulable speedup depends on the experimented task sets and only the trend is comparable between different settings and different schedulers.

### 8.2 Simulation results

We first present the results for task sets generated by the Erdos–Renyi method for various setting of \(p\) and different numbers of processors to see the effect of these parameters on the performance of GEDF.

#### 8.2.1 Erdos–Renyi method

For this method, we generate two types of task sets: (1) *Fixed*\(p\)*task sets*: In this setting, all task sets have the same \(p\). We varied the values of \(p\) over {0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9}. (2) *Random*\(p\)*task sets*: We also generated task sets where each task has a different, randomly picked, value of \(p\).

Figure 9d–f show the failure rate for fixed-\(p\) task sets as we varied \(p\) and kept \(m\) constant at 16. GEDF without decomposition still outperforms DECOMP for almost all cases. Comparing the results between 64-core and 16-core task sets with same \(p\), we can see that DECOMP improves greatly, while GEDF only improves slightly. This is mostly because for GEDF, most task sets are schedulable at the speedup of 1.4. This required speedup might have approached to the limit, so there is no more space for improvement.

Figure 11 also allows us to see other effects. For instance, we can compare the failure rates of harmonic versus arbitrary periods by comparing Figs. 11b, a. The figures suggest that, in general, the harmonic and arbitrary period task sets behave similarly. It does appear that tasks with arbitrary periods are slightly easier to schedule, especially for GEDF. This is at least partially explained by the observation that, with harmonic periods, many tasks have the same deadline, making it difficult for GEDF to distinguish between them. These trends also hold for other parameter settings, and therefore we omit those figures to reduce redundancy.

We also compare the effect of fixed versus random \(p\) by comparing Fig. 11c–a. The former shows the failure ratio for GEDF and DECOMP for task sets where \(p\) is not fixed, but is randomly generated for each task, as we vary \(m\). Again, GEDF outperforms DECOMP. Note, however, that GEDF appears to have a harder time for random \(p\) than in the fixed \(p\) experiment.

#### 8.2.2 Synchronous method

Figure 11d shows the comparison between GEDF and DECOMP with varying \(m\) for specially constructed synchronous task sets. In this case, the failure ratio for GEDF is higher than for task sets generated with the Erdos–Renyi Method. We can also see that sometimes DECOMP outperforms GEDF in terms of failure ratio and required speedup. This indicates that synchronous tasks with highly parallel segments are indeed more difficult for GEDF to schedule. However, even in this case, we never require a speedup of more than \(2\). Even though Fig. 7 demonstrates that there exist task sets that require speedup of more than \(2\), such pathological task sets never appeared in our randomly generated sample.

In conclusion, simulation results indicate that GEDF performs better than predicted by the capacity augmentation bound. For most task sets, GEDF is also better than DECOMP.

## 9 Parallel GEDF platform

To demonstrate the feasibility of parallel GEDF scheduling, we implemented a simple prototype platform called PGEDF by combining GNU OpenMP runtime system and the \( LITMUS ^ RT \) system. PGEDF is a straightforward implementation based on these off-the-shelf systems and simply sets appropriate parameters for both OpenMP and \( LITMUS ^ RT \) without modifying either. It is also easy to use this platform; the user can write tasks as programs with standard OpenMP directives and compile them using the g++ compiler. In addition, the user provides a task-set configuration file that specifies the tasks in the task-set and their deadlines. To validate the theory we present, PGEDF is configured for CPU intensive workloads and cache or I/O effects are beyond the scope this paper.

Note that our goal in implementing PGEDF as a prototype platform is to show that GEDF is a good scheduler for parallel real-time tasks. This platform uses the GEDF plug-in of \( LITMUS ^ RT \) to execute the tasks. Our experimental results show that this PGEDF implementation has better performances than other two existing platforms for parallel real-time tasks in most cases. Recent work Cerqueira et al. (2014) has shown that using massage passing instead of coarse-grain locking (used in \( LITMUS ^ RT \)) the overhead of GEDF scheduler can be significantly lowered. Therefore, we potentially can get even better performance using G-EDF-MP as underlying operating system scheduler (instead of \( LITMUS ^ RT \)). However, improving the implementation and performance of PGEDF is beyond the scope of this work.

We first describe the relevant aspects of OpenMP and \( LITMUS ^ RT \) and then describe the specific settings that allow us to run parallel real-time tasks on this platforms.

### 9.1 Background

We briefly introduce the GNU OpenMP runtime system and the \( LITMUS ^ RT \) patched Linux operating system, with an emphasis on the particular features that our PGEDF relies on in order to realize parallel GEDF scheduling.

#### 9.1.1 OpenMP overview

OpenMP is a specification for parallel programs that defines an open standard for parallel programming in C, C++, and Fortran (OpenMp 2011). It consists of a set of compiler directives, library routines and environment variables, which can influence the runtime behavior. Our PGEDF implementation uses a GNU implementation of OpenMP runtime system (GOMP), which is part of the GCC (GNU Compiler Collection).

In OpenMP, *logical parallelism* in a program is specified through compiler pragma statements. For example, a regular for-loop can be transformed into a parallel for-loop by simply adding #pragma omp parallel for above the regular for statement. This gives the system permission to execute the iterations of the loop independently in parallel with each other. If the compiler does not support OpenMP, the pragma will be ignored, and the for-loop will be executed sequentially. On the other hand, if OpenMP is supported, then the runtime system can choose to execute these iterations in parallel. OpenMP also supports other parallel constructs; however, for our prototype of PGEDF, we only support parallel synchronous tasks. These tasks are described as a series of segments which can be parallel or sequential. A parallel segment is described as a parallel for-loop while a sequential segment consists of arbitrary sequential code. Therefore, we will restrict our attention to parallel for-loops.

We now briefly describe the OpenMP (specifically GOMP) runtime strategy for such programs. Under GOMP, each OpenMP program starts with a single thread, called the master thread. During execution, when the runtime system encounters the first parallel section of the program, the master thread will create a thread pool and assign that team of threads to the parallel region. The threads created by the master thread in the thread pool are called worker threads. The number of worker threads can be set by the user.

The master thread executes the sequential segments. In parallel segments (parallel for-loops), each iteration is considered a unit of work and maps (distributes) the work to the team of threads according to the chosen policies, as specified by arguments passed to the omp_set_schedule() function call. In OpenMP, there are three different kind of policies: dynamic, static and guided policies. In the static 1 policy, all iterations are divided among the team of threads at the start of the loop, and iterations are distributed to threads one by one: each thread in the team will get one iteration at a time in a round robin manner.

Once a thread finishes all its assigned work from a particular parallel segment, it waits for all other threads in the team to finish before moving on to the next segment of the task. The waiting policy can be set by via the environment variable OMP_WAIT_POLICY. Using passive synchronization, waiting threads are blocked and put into the Linux sleep queue, where they do not consume CPU cycle while waiting. On the other hand, active synchronization would let waiting threads spin without yielding the processors, which would consume CPU cycles while waiting.

One important property of the GOMP, upon which our implementation relies, is that the team of threads for each program is reusable. After the execution of a parallel region, the threads in the team are not destroyed. Instead, all threads except the master thread wait for the next parallel segment, again according to the policy set by OMP_WAIT_POLICY. The master thread continues the sequential segment. When it encounters the next parallel segment GOMP runtime system detects that it already has a team of threads available to it, and simply reuses them for executing this segment, as before.

#### 9.1.2 \(\text {LITMUS}^\text {RT}\) overview

\( LITMUS ^ RT \) (Linux Testbed for Multiprocessor Scheduling in Real-Time Systems) is an algorithm-oriented real-time extension of Linux (Branderburg and Anderson 2009). It focuses on multiprocessor real-time scheduling and synchronization. Many real-time schedulers, including global, clustered, partitioned and semi-partitioned schedulers are implemented as plug-ins for Linux. Users can use these schedulers for real-time tasks, and standard Linux scheduling for non-real-time tasks.

In \( LITMUS ^ RT \), the GEDF implementation is meant for sequential tasks. A typical \( LITMUS ^ RT \) real-time program contains one or more rt_tasks (real-time tasks), which are released periodically. In fact, each rt_task can be regarded as a rt_thread, which is a standard Linux thread with real-time parameters. Under the GEDF scheduler, a rt_task can be suspended and migrated to a different CPU core according to the GEDF scheduling strategy. The platform consists of three main data structures to hold these tasks: a release queue, a one-to-one processor mapping, and a shared ready queue. The release queue is implemented as a priority queue with a clock tick handler, and is used to hold waiting-to-be-released jobs. The one-to-one processor mapping has the thread that corresponds to each job that is currently executing on each processor. The ready queue (shared by all processors) is implemented as a priority queue by using binomial heaps for fast queue-merge operations triggered by jobs with coinciding release times.

- 1.
First, function init_rt_thread() is called to initialize the user-space real-time interface for the thread.

- 2.
Then, the real-time parameters of the thread are set by calling set_rt_task_param(getid(),&rt_task_param): the getid() function will return the actual thread ID in the system; the real-time parameters, including period, relative deadline, execution time and budget policy, are stored in the rt_task_param structure; these parameters will then be stored in the TCB (thread control block) using the unique thread ID and they will be validated by the kernel.

- 3.
Finally, task_mode(LITMUS_RT_TASK) is called to start running the thread as a real-time task.

### 9.2 PGEDF platform implementation

Now we describe how our PGEDF platform integrates the GOMP runtime with GEDF scheduling in \( LITMUS ^ RT \) to execute parallel real-time tasks. The PGEDF platform provides two key features: parallel task execution and real-time GEDF scheduling. The GOMP runtime system is used to perform parallel execution of each task, while real-time execution and GEDF scheduling is realized by the \( LITMUS ^ RT \) GEDF plug-in.

#### 9.2.1 Programming interface

Currently, PGEDF only supports synchronous task sets with implicit deadlines—tasks which consist of a sequence of segments and each segment is either a parallel segment (specified using a parallel-for loop) or a sequential segment (specified as regular code).

#### 9.2.2 PGEDF operation

Unlike sequential tasks where there is only one thread per rt_task, for parallel tasks there is a team of threads generated by OpenMP. Since all the threads in the team belong to the same task, we must set all their deadlines (periods) to be the same. In addition, we must make sure that all the threads of all the tasks are properly executed by the GEDF plug-in in LITMUS. We now describe how to set the parameters of both OpenMP and LITMUS to properly enforce this policy.

We first describe the specific parameter settings we have to use to ensure correct execution: (1) We specify the static 1 policy within OpenMP to ensure that each thread gets approximately the same amount of work. (2) We also set the OpenMP thread synchronization policy to be passive. As discussed in Sect. 9.1.1, PGEDF cannot allow spinning waiting of threads. By using blocking synchronization, once a worker thread finishes its job, it will go to sleep immediately and yield the processor to threads from other tasks. Then the GEDF scheduler will assign the available core to the thread in the front of the prioritized ready queue. Thus, the idle time of one task can be utilized by other tasks, which is consistent with GEDF scheduling theory. (3) For each task, we set the number of threads to be equal to the number of available cores, \(m\), using the GOMP function call *omp_set_num_threads(m)*. This means that if there are \(n\) tasks in the system, we will have a total of \(mn\) threads in the system. (4) In \( LITMUS ^ RT \), the budget_policy is set equal to NO_ENFORCEMENT and the execution time of a thread is set to be the same as the relative deadline, as we do not need bandwidth control.

Let us first look at the initial for-loop. This parallel for-loop is meant to set the proper real-time parameters for this task to be correctly scheduled by GEDF plug-in in \( LITMUS ^ RT \). We must set the real-time parameters for the entire team of OpenMP threads of this task. However, OpenMP threads are designed to be invisible to programmers, so we have no direct access to them. We get around this problem by using this initial for-loop, which has exactly \(m\) iterations—recall that each task has exactly \(m\) threads in its thread pool. Note that before this parallel for-loop, we set the scheduling policy to be static 1 policy, which is a round robin static mapping between iterations and threads. Therefore, due to the static 1 policy, each iteration is mapped to exactly 1 thread in the thread pool. Therefore, even though we cannot directly access OpenMP threads, we can still set real-time parameters for them inside the initial parallel for-loop by calling rt_thread(period, deadline) within this loop. This function is defined within the PGEDF platform to perform configuration for \( LITMUS ^ RT \). In particular, the configuration steps described in the itemized list in the previous section are performed by this function. Since the thread team is reused for all parallel regions of the same program, we only need to set the real-time parameters for it once during task initialization; we need not set it at each job invocation.

After initialization, each task is periodically executed by task.run(task_argc, task_argv), inside which there could be multiple parallel for-loops executed by the same team of threads. Periodic execution is achieved by the parallel for-loop after the task.run function; after each job invocation, this loop ensures that sleep_next_period() is called by each thread in the thread pool. Note again that since the number of iterations in this parallel for-loop is \(m\), each thread will get exactly one iteration ensuring that each thread calls this function. This last for-loop is similar to the initialization for-loop, but tells the system that all the threads in the team of this task have finished their work and that the system should only wake them up when next period begins.

We can now briefly argue that these settings guarantee the correct GEDF execution. After we appropriately set the real-time parameters, all the relative deadlines will be automatically converted to absolute deadlines when scheduled by the \( LITMUS ^ RT \). Since each thread in the same team of a particular task has the same deadline, all threads of this task have the same priority. Also, threads of a task with an earlier deadline have higher priority than the threads of the task with later deadlines—this is guaranteed by \( LITMUS ^ RT \) GEDF plug-in. Since the number of threads allocated to each program is equal to the number of cores, as required by GEDF, each job can utilize the entire machine when it is the highest priority task and has enough parallelism. If it does not have enough parallelism, then some of its threads sleep and yield the machine to the job with the next highest priority. Therefore, the GEDF scheduler within the \( LITMUS ^ RT \) enforces the correct priorities using the ready queue.

## 10 Experimental evaluation of PGEDF

We now describe our empirical evaluation of PGEDF using randomly generated tasks in OpenMP. Our experiments indicate that the parallel GEDF scheduling algorithm provides good real-time performance and that PGEDF outperforms the only other openly available parallel real-time platform, RT-OpenMP (Ferry et al. 2013), in most cases.

### 10.1 Experimental machines

Our experimental hardware is a 16-core machine with two Intel Xeon E5-2687W processors. We use the \( LITMUS ^ RT \) patched Linux kernel 3.10.5 and the GOMP runtime system from GCC version 4.6.3. The first core of the machine is always reserved in \( LITMUS ^ RT \) for releasing jobs periodically when running experiments. In order to test both single-socket and multi-socket performance, we ran two configurations—one with 7 experimental cores (with 1 reserved for releasing jobs and the other 8 disabled) and one with 14 experimental cores (with 1 reserved for releasing jobs and 1 disabled). For experiments with \(m\) available cores for task sets (\(m=7\) or 14 in our experiments) and one reserved core for releasing tasks, we set the number of cores for the system through the Linux kernel boot time parameter maxcpus=\(m+1\). After rebooting the system, only \(m+1\) total cores are available and the rest of the cores are disabled entirely.

### 10.2 Task set generation

*T7:LP:LS:Har*using 7 cores and the rest using 14 cores. Here, we describe how we randomly generate task sets for our empirical evaluation. For each task, we first randomly selected its period (and deadline) \(D\) in a range between 4ms to 128ms. For task sets with harmonic deadlines, periods were always chosen to be one of {4, 8, 16, 32, 64, 128 ms}, while for arbitrary deadlines, periods can be any value between 4 and 128 ms.

Task set characteristics

Name | Total | Deadline | \(L/D\,\%\) | Avg. | Avg. \(\#\) tasks |
---|---|---|---|---|---|

\(\#\) cores | \(\#\) iterations | per TaskSet | |||

T14:LP:LS:Har | 14 | Harmonic | 100 | 8 | 5.03 |

T14:HP:LS:Har | 14 | Harmonic | 100 | 12 | 3.38 |

T14:LP:HS:Har | 14 | Harmonic | 50 | 8 | 8.58 |

T14:HP:HS:Har | 14 | Harmonic | 50 | 12 | 5.22 |

T7:LP:LS:Har | 7 | Harmonic | 100 | 8 | 3.66 |

T7:HP:HS:Har | 7 | Harmonic | 50 | 12 | 3.47 |

T14:HP:LS:Arb | 14 | Arbitrary | 100 | 12 | 3.33 |

The task sets vary along two other dimensions: (1) Tasks may have low-parallelism or high-parallelism. We control the parallelism by controlling the average number of iterations in each parallel for-loop. For low-parallelism task sets, the number of iterations in each parallel for-loop is chosen from a log-normal distribution with mean 8. For high-parallelism task sets, the number of iterations is chosen from a log-normal distribution with mean 12. In Table 1, the high parallelism task sets have HP in their label while low-parallelism tasks have LP in their label. Note that high-parallelism task sets have fewer tasks per task set on average since each individual task typically has higher utilization. (2) Tasks may have low-slack (LS) or high-slack (HS). We control the slack of a task by controlling the ratio between its critical path length and deadline. For low-slack task, their critical path length can be as large as their period. For high-slack tasks, their critical path length is at most half their deadline. In general, low-slack tasks are more difficult to schedule.

For all types of jobs, the execution time of each iteration was chosen from a log-normal distribution with a mean of 700 \(\upmu \)s. Segments were added to the task until adding another segment would make its critical-path length longer than the desired maximum ratio (\(1/2\) for high-slack tasks and \(1\) for low-slack tasks). Each task set starts empty and tasks were successively added until the total utilization ratio was between 98 and \(100\,\%\) of \(m\)—the number of cores in the machine. For example, for 14-core experiments, total utilization was between 13.72 and 14. Our experiments are statistically unbiased because our tasks and task sets are randomly generated, according to independent and indentically distributions.

As with the numerical simulation experiments described in Sect. 8, we wished to understand the effects of speedup. We achieved the desired speedup by scaling down the execution time of each iteration of each segment of each task in each task set. For each experiment, we first generated 100 task sets with total utilization between \(0.98m\) and \(m\), and then scaled down the execution time by the desired speedup \(1/b\). For example, for a speedup of 2, a iteration with execution time of 700 \(\upmu \)s will be scaled down to 350 \(\upmu \)s, and the total utilization of the task set will be about 7 for a 14-core experiment. In this manner, without scaling the actual core speed, we can achieve the desired speedup compared to the original task set. We evaluate the following speedup values {5, 3.3, 2.5, 2, 1.8, 1.6, 1.4, 1.2}, which correspond to total utilizations {20, 30, 40, 50, 56, 62.5, 71.4, 83.3 %} of \(m\).

### 10.3 Baseline platform

We compared the performance of PGEDF with the only other open source platform, RT-OpenMP from (Ferry et al. 2013)—labeled RT-OpenMP—that can schedule parallel synchronous task sets on multicore system. RT-OpenMP is based on a task decomposition scheduling strategy similar to the DECOMP algorithm in Sect. 8: parallel tasks are decomposed into sequential subtasks with intermediate release times and deadlines. These sequential tasks are scheduled using a partitioned deadline monotonic scheduling strategy (Fisher at al. 2006). This decomposition based scheduler was shown to guarantee a capacity augmentation of 5 (Saifullah et al. 2013). In theory, any valid bin-packing strategy provides this augmentation bound. The original paper (Ferry et al. 2013) compared a worst-fit and best-fit bin-packing strategy for partitioning and found that worst-fit always performed better. Therefore, we only compare PGEDF (solid line in figures) versus RT-OpenMP (dashed line in figures) with worst-fit bin-packing.

### 10.4 Experiment results

For all experiments, each task set was run for 1,000 hyper-periods for harmonic deadlines and 1,000 times the largest period for arbitrary deadlines. In our experiments, we say that a task set failed if any task missed any deadline over the entire run of the experiment. In all figures, we plot the failure rate—the ratio of the failed task sets to the total number of task sets. The \(x\)-axis is the task set’s utilization as a percentage of \(m\). For example, \(50\,\%\) utilization in a 14-core experiment has a total utilization of \(7\). This setting is also equivalent to running the experiment on a machine of speed-2—this speedup factor is shown on the top of the figures as the \(x\)-axis.

We now look at the influence of the degree of parallelism in task sets. First, we look at the task sets with high-slack. Figure 16a, b show the results for high and low-parallelism task sets for high-slack setting. Note that higher-parallelism task sets have a higher failure ratio than the low-parallelism task sets for both platforms, but the difference is not significant. Now we take a look at the low-slack case—Fig. 15a, b show the results for high and low-parallelism task sets. Now the results are reversed—both platforms perform better on high-parallelism task sets than on low-parallelism task sets. We believe that these results are due to the fact that low-parallelism task sets have a larger number of total tasks per task set (shown in Table 1)—which leads to higher overhead due to a larger number of total threads. For low-slack tasks, the slack between deadline and critical-path length is relatively small, so they are more sensitive to overhead—therefore, when there are a large number of threads (low-parallelism task sets), they perform worse. Also note that this effect is much more pronounced on RT-OpenMP than on PGEDF, indicating that PGEDF may be less effected by overheads and more scalable.

Finally, note that there is a significant difference between the simulation results in Sect. 8 and these experiments. In simulation, GEDF required a speedup of at most 2, while here it often requires speedup of 2.5 or more. This is not surprising, since real platforms have overheads that are completely ignored in simulations. In particular, for 14 core experiments on our machines, there is high inter-socket communication overhead of the operating system, which is ignored by theory and is not considered in simulation.

In conclusion, PGEDF performs better in all experiments and generally requires lower speedup to schedule task sets than RT-OpenMP. In addition, the capacity augmentation bound of 4 for the GEDF scheduler holds for all experiments conducted here.

## 11 Conclusions

In this paper, we have presented the best bounds known for GEDF scheduling of parallel tasks represented as DAGs. In particular, we proved that GEDF provides a resource augmentation bound of \(2-1/m\) for sporadic task sets with arbitrary deadlines and a capacity augmentation bound of \(4-2/m\) with implicit deadlines. The capacity augmentation bound also serves as a simple schedulability test, namely, a task set is schedulable on \(m\) cores if (1) \(m\) is at least \(4-2/m\) times its total utilization, and (2) the implicit deadline of each task is at least \(4-2/m\) times its critical-path length. We also presented another fixed point schedulability test for GEDF.

We present two types of evaluation results. First, we simulated randomly generated DAG tasks with a variety of settings. In these simulations, we never saw a required capacity augmentation of more than 2 on randomly generated task sets. Second, we implemented and performed an empirical evaluation of a simple prototype platform, PGEDF, for running parallel tasks using GEDF. Programmers can write their programs using OpenMP pragmas and the platform schedules them on a multicore machine. For computationally intensive jobs, our experiments indicate that this platform out-performs a previous platform that relies on task decomposition.

There are three possible directions of future work. First, we would like to extend the capacity augmentation bounds to constrained and arbitrary deadline. In addition, while we prove that a capacity augmentation bound of more than \(\frac{3+\sqrt{5}}{2}\) is needed, there is still a gap between this lower bound and the upper bound of \((4-2/m)\) for capacity augmentation, which we would like to close. Finally, we would like to conduct more experiments on PGEDF to quantify its performance and to measure its overheads in more detail, and improve its performance based on these experiments.

## Footnotes

- 1.
In ECRTS 2013, two papers (Li et al. 2013; Bonifaci et al. 2013) prove the same resource augmentation bound of \(2-\frac{1}{m}\). These two results were derived independently and in parallel, and they proved the same bound using different analysis techniques. More detailed discussion of the results from Bonifaci et al. (2013) is presented in Sect. 2.

- 2.
Note that, due to the lack of a schedulability test, it is difficult to experimentally test the resource augmentation bound of \(2-1/m\) or through simulation.

- 3.
For DECOMP, end-to-end deadline (instead of decomposed subtask’s deadline) miss ratios were reported.

## Notes

### Acknowledgments

This research was supported in part by NSF Grants CCF-1136073 (CPS) and CCF-1337218 (XPS).

### References

- Agrawal K, Leiserson CE, He Y, Hsu WJ (2008) Adaptive work-stealing with parallelism feedback. ACM Trans Comput Syst 26(3):7CrossRefGoogle Scholar
- Anderson JH, Calandrino JM (2006) Parallel real-time task scheduling on multicore platforms. In: RTSSGoogle Scholar
- Andersson B, Jonsson J (2003) The utilization bounds of partitioned and pfair static-priority scheduling on multiprocessors are 50 %. In: ECRTSGoogle Scholar
- Andersson B, de Niz D (2012) Analyzing global-edf for multiprocessor scheduling of parallel tasks. Principles of distributed systems. Prentice Hall, Upper Saddle River, pp 16–30Google Scholar
- Andersson B, Baruah S, Jonsson J (2001) Static-priority scheduling on multiprocessors. In: RTSSGoogle Scholar
- Axer P, Quinton S, Neukirchner M, Ernst R, Dobel B, Hartig H (2013) Response-time analysis of parallel fork-join workloads with real-time constraints. In: ECRTSGoogle Scholar
- Baker TP (2005) An analysis of EDF schedulability on a multiprocessor. IEEE Trans Parallel Distrib Syst 16(8):760–768CrossRefGoogle Scholar
- Baker TP, Baruah SK (2009) Sustainable multiprocessor scheduling of sporadic task systems. In: ECRTSGoogle Scholar
- Baruah S (2004) Optimal utilization bounds for the fixed-priority scheduling of periodic task systems on identical multiprocessors. IEEE Trans Comput 53(6):781–784CrossRefGoogle Scholar
- Baruah S, Baker T (2008) Schedulability analysis of global EDF. Real-Time Syst 38(3):223–235MATHCrossRefGoogle Scholar
- Baruah S, Bonifaci V, Marchetti-Spaccamela A, Stiller S (2010) Improved multiprocessor global schedulability analysis. Real-Time Syst 46(1):3–24MATHCrossRefGoogle Scholar
- Baruah SK, Bonifaci V, Marchetti-Spaccamela A, Stougie L, Wiese A (2012) A generalized parallel task model for recurrent real-time processes. In: RTSSGoogle Scholar
- Bertogna M, Baruah S (2011) Tests for global edf schedulability analysis. J Syst Arch 57(5):487–497CrossRefGoogle Scholar
- Bertogna M, Cirinei M, Lipari G (2009) Schedulability analysis of global scheduling algorithms on multiprocessor platforms. IEEE Trans Parallel Distrib Syst 20(4):553–566CrossRefGoogle Scholar
- Bonifaci V, Marchetti-Spaccamela A, Stiller S, Wiese A (2013) Feasibility analysis in the sporadic dag task model. In: ECRTSGoogle Scholar
- Brandenburg BB, Anderson JH (2009) On the implementation of global real-time schedulers. In: RTSSGoogle Scholar
- Calandrino JM, Anderson JH (2009) On the design and implementation of a cache-aware multicore real-time scheduler. In: ECRTSGoogle Scholar
- Cerqueira F, Brandenburg BB (2013) A comparison of scheduling latency in linux, PREEMPT-RT, and LITMUSRT. OSPERTGoogle Scholar
- Cerqueira F, Vanga M, Brandenburg BB (2014) Scaling global scheduling with massage passing. In: RTASGoogle Scholar
- Chwa HS, Lee J, Phan KM, Easwaran A, Shin I (2013) Global edf schedulability analysis for synchronous parallel tasks on multicore platforms. In: ECRTSGoogle Scholar
- Collette S, Cucu L, Goossens J (2008) Integrating job parallelism in real-time scheduling theory. Inf Process Lett 106(5):180–187MATHMathSciNetCrossRefGoogle Scholar
- Cordeiro D, Mouni G, Perarnau S, Trystram D, Vincent JM, Wagner F (2010) Random graph generation for scheduling simulations. In: SIMUToolsGoogle Scholar
- Davis RI, Burns A (2011) A survey of hard real-time scheduling for multiprocessor systems. ACM Comput Surv 43(4):35CrossRefGoogle Scholar
- Deng X, Gu N, Brecht T, Lu K (1996) Preemptive scheduling of parallel jobs on multiprocessors. In: SODAGoogle Scholar
- Drozdowski M (1996) Real-time scheduling of linear speedup parallel tasks. Inf Process Lett 57(1):35–40MATHCrossRefGoogle Scholar
- Ferry D, Li J, Mahadevan M, Agrawal K, Gill C, Lu C (2013) A real-time scheduling service for parallel tasks. In: RTASGoogle Scholar
- Fisher N, Baruah S, Baker TP (2006) The partitioned scheduling of sporadic tasks according to static-priorities. In: ECRTSGoogle Scholar
- Garey RM, Johnson SD (1979) Computers and intractability: a guide to the theory of np-completeness. WH Freeman & Co, San FranciscoMATHGoogle Scholar
- Goossens J, Funk S, Baruah S (2003) Priority-driven scheduling of periodic task systems on multiprocessors. Real-Time Syst 25(2–3):187–205MATHCrossRefGoogle Scholar
- Kato S, Ishikawa Y (2009) Gang EDF scheduling of parallel task systems. In: RTSSGoogle Scholar
- Kim J, Kim H, Lakshmanan K, Rajkumar RR (2013) Parallel scheduling for cyber-physical systems: analysis and case study on a self-driving car. In: ICCPSGoogle Scholar
- Lakshmanan K, Kato S, Rajkumar R (2010) Scheduling parallel real-time tasks on multi-core processors. In: RTSSGoogle Scholar
- Lee J, Shin KG (2012) Controlling preemption for better schedulability in multi-core systems. In: RTSSGoogle Scholar
- Lee WY, Heejo L (2006) Optimal scheduling for real-time parallel tasks. IEICE Trans Inf Syst 89(6):1962–1966CrossRefGoogle Scholar
- Lelli J, Lipari G, Faggioli D, Cucinotta T (2011) An efficient and scalable implementation of global edf in linux. In: OSPERTGoogle Scholar
- Li J, Agrawal K, Lu C, Gill C (2013) Analysis of global EDF for parallel tasks. In: ECRTSGoogle Scholar
- Liu C, Anderson J (2012) Supporting soft real-time parallel applications on multicore processors. In: RTCSAGoogle Scholar
- López JM, Díaz JL, García DF (2004) Utilization bounds for EDF scheduling on real-time multiprocessor systems. Real-Time Syst 28(1):39–68MATHCrossRefGoogle Scholar
- Maghareh A, Dyke S, Prakash A, Bunting G, Lindsay P (2012) Evaluating modeling choices in the implementation of real-time hybrid simulation. EMI/PMCGoogle Scholar
- Manimaran G, Murthy CSR, Ramamritham K (1998) A new approach for scheduling of parallelizable tasks in real-time multiprocessor systems. Real-Time Syst 15(1):39–60CrossRefGoogle Scholar
- Nelissen G, Berten V, Goossens J, Milojevic D (2012) Techniques optimizing the number of processors to schedule multi-threaded tasks. In: ECRTSGoogle Scholar
- Nogueira L, Pinho LM (2012) Server-based scheduling of parallel real-time tasks. In: EMSOFTGoogle Scholar
- Oh-Heum K, Kyung-Yong C (1999) Scheduling parallel tasks with individual deadlines. Theor Comput Sci 215(1):209–223MATHCrossRefGoogle Scholar
- OpenMP (2011) OpenMP Application Program Interface v3.1. http://www.openmp.org/mp-documents/OpenMP3.1.pdf
- Phillips CA, Stein C, Torng E, Wein J (1997) Optimal time-critical scheduling via resource augmentation. In: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, ACM, pp 140–149Google Scholar
- Polychronopoulos CD, Kuck DJ (1987) Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans Comput 100(12):1425–1439CrossRefGoogle Scholar
- Saifullah A, Li J, Agrawal K, Lu C, Gill C (2013) Multi-core real-time scheduling for generalized parallel task models. Real-Time Syst 49(4):404–435MATHCrossRefGoogle Scholar
- Saifullah A, Ferry D, Li J, Agrawal K, Lu C, Gill C (2014) Parallel real-time scheduling of DAGS. IEEE Trans Parallel Distrib SystGoogle Scholar
- Srinivasan A, Baruah S (2002) Deadline-based scheduling of periodic task systems on multiprocessors. Inf Process Lett 84(2):93–98MATHMathSciNetCrossRefGoogle Scholar
- Wang Q, Cheng KH (1992) A heuristic of scheduling parallel tasks and its analysis. SIAM J Comput 21(2):281–294MATHMathSciNetCrossRefGoogle Scholar