On the hard-real-time scheduling of embedded streaming applications
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s10617-012-9086-x
- Cite this article as:
- Bamakhrama, M.A. & Stefanov, T.P. Des Autom Embed Syst (2013) 17: 221. doi:10.1007/s10617-012-9086-x
- 7 Citations
- 960 Downloads
Abstract
In this paper, we consider the problem of hard-real-time (HRT) multiprocessor scheduling of embedded streaming applications modeled as acyclic dataflow graphs. Most of the hard-real-time scheduling theory for multiprocessor systems assumes independent periodic or sporadic tasks. Such a simple task model is not directly applicable to dataflow graphs, where nodes represent actors (i.e., tasks) and edges represent data-dependencies. The actors in such graphs have data-dependency constraints and do not necessarily conform to the periodic or sporadic task models. In this work, we prove that the actors in acyclic Cyclo-Static Dataflow (CSDF) graphs can be scheduled as periodic tasks. Moreover, we provide a framework for computing the periodic task parameters (i.e., period and start time) of each actor, and handling sporadic input streams. Furthermore, we define formally a class of CSDF graphs called matched input/output (I/O) rates graphs which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of achieving the maximum achievable throughput of an application for matched I/O rates graphs. Therefore, hard-real-time schedulability analysis can be used to determine the minimum number of processors needed to schedule matched I/O rates applications while delivering the maximum achievable throughput. This can be of great use for system designers during the Design Space Exploration (DSE) phase.
Keywords
Real-time multiprocessor scheduling Embedded streaming systems1 Introduction
The ever-increasing complexity of embedded systems realized as Multi-Processor Systems-on-Chips (MPSoCs) is imposing several challenges on systems designers [18]. Two major challenges in designing streaming software for embedded MPSoCs are: (1) How to express parallelism found in applications efficiently?, and (2) How to allocate the processors to provide guaranteed services to multiple running applications, together with the ability to dynamically start/stop applications without affecting other already running applications?
Model-of-Computation (MoC) based design has emerged as a de-facto solution to the first challenge [10]. In MoC-based design, the application can be modeled as a directed graph where nodes represent actors (i.e., tasks) and edges represent communication channels. Different MoCs define different rules and semantics on the computation and communication of the actors. The main benefits of a MoC-based design are the explicit representation of important properties in the application (e.g., parallelism) and the enhanced design-time analyzability of the performance metrics (e.g., throughput). One particular MoC that is popular in the embedded signal processing systems community is the Cyclo-Static Dataflow (CSDF) model [5] which extends the well-known Synchronous Data Flow (SDF) model [15].
Unfortunately, no such de-facto solution exists yet for the second challenge of processor allocation [23]. For a long time, self-timed scheduling was considered the most appropriate policy for streaming applications modeled as dataflow graphs [14, 28]. However, the need to support multiple applications running on a single system without prior knowledge of the properties of the applications (e.g., required throughput, number of tasks, etc.) at system design-time is forcing a shift towards run-time scheduling approaches as explained in [13]. Most of the existing run-time scheduling solutions assume applications modeled as task graphs and provide best-effort or soft-real-time quality-of-service (QoS) [23]. Few run-time scheduling solutions exist which support applications modeled using a MoC and provide hard-real-time QoS [4, 11, 20, 21]. However, these solutions either use simple MoCs such as SDF/PGM graphs or use Time-Division Multiplexing (TDM)/Round-Robin (RR) scheduling. Several algorithms from the hard-real-time multiprocessor scheduling theory [9] can perform fast admission and scheduling decisions for incoming applications while providing hard-real-time QoS. Moreover, these algorithms provide temporal isolation which is the ability to dynamically start/run/stop applications without affecting other already running applications. However, these algorithms from the hard-real-time multiprocessor scheduling theory received little attention in the embedded MPSoC community. This is mainly due to the fact that these algorithms assume independent periodic or sporadic tasks [9]. Such a simple task model is not directly applicable to modern embedded streaming applications. This is because a modern streaming application is typically modeled as a directed graph where nodes represent actors, and edges represent data-dependencies. The actors in such graphs have data-dependency constraints and do not necessarily conform to the periodic or sporadic task models.
Therefore, in this paper we investigate the applicability of the hard-real-time scheduling theory for periodic tasks to streaming applications modeled as acyclic CSDF graphs. In such graphs, the actors are data-dependent. However, we analytically prove that they (i.e., the actors) can be scheduled as periodic tasks. As a result, a variety of hard-real-time scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. By considering acyclic CSDF graphs, our investigation findings and proofs are applicable to most streaming applications since it has been shown recently that around 90 % of streaming applications can be modeled as acyclic SDF graphs [30]. Note that SDF graphs are a subset of the CSDF graphs we consider in this paper.
1.1 Problem statement
Given a streaming application modeled as an acyclic CSDF graph, determine whether it is possible to execute the graph actors as periodic tasks. A periodic task τ_{i} is defined by a 3-tuple τ_{i}=(S_{i},C_{i},T_{i}). The interpretation is as follows: τ_{i} is invoked at time instants t=S_{i}+kT_{i} and it has to execute for C_{i} time-units before time t=S_{i}+(k+1)T_{i} for all k∈ℕ_{0}, where S_{i} is the start time of τ_{i} and T_{i} is the task period. This scheduling approach is called Strictly Periodic Scheduling (SPS) [22] to avoid confusion with the term periodic scheduling used in the dataflow scheduling theory to refer to a repetitive finite sequence of actors invocations. The sequence is periodic since it is repeated infinitely with a constant period. However, the individual actors invocations are not guaranteed to be periodic. In the remainder of this paper, periodic scheduling/schedule refers to strictly periodic scheduling/schedule.
1.2 Paper contributions
Given a streaming application modeled as an acyclic CSDF graph, we analytically prove that it is possible to execute the graph actors as periodic tasks. Moreover, we present an analytical framework for computing the periodic task parameters for the actors, that is the period and the start time, together with the minimum buffer sizes of the communication channels such that the actors execute as periodic tasks. The proposed framework is also capable of handling sporadic input streams. Furthermore, we define formally two classes of CSDF graphs: matched input/output (I/O) rates graphs and mis-matched I/O rates graphs. Matched I/O rates graphs constitute around 80 % of streaming applications [30]. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates graphs. Applying our approach to matched I/O rates applications enables using a plethora of schedulability tests developed in the real-time scheduling theory [9] to easily determine the minimum number of processors needed to schedule a set of applications using a certain algorithm to provide the maximum achievable throughput. This can be of great use for embedded systems designers during the Design Space Exploration (DSE) phase.
The remainder of this paper is organized as follows: Sect. 2 gives an overview of the related work. Section 3 introduces the CSDF model and the considered system model. Section 4 presents the proposed analytical framework. Section 5 presents the results of empirical evaluation of the framework presented in Sect. 4. Finally, Sect. 6 ends the paper with conclusions.
2 Related work
Parks and Lee [25] studied the applicability of non-preemptive Rate-Monotonic (RM) scheduling to dataflow programs modeled as SDF graphs. The main difference compared to our work is: (1) they considered non-preemptive scheduling. In contrast, we consider only preemptive scheduling. Non-preemptive scheduling is known to be NP-hard in the strong sense even for the uniprocessor case [12], and (2) they considered SDF graphs which are a subset of the more general CSDF graphs.
Goddard [11] studied applying real-time scheduling to dataflow programs modeled using the Processing Graphs Method (PGM). He used a task model called Rate-Based Execution (RBE) in which a real-time task τ_{i} is characterized by a 4-tuple τ_{i}=(x_{i},y_{i},d_{i},c_{i}). The interpretation is as follows: τ_{i} executes x_{i} times in time period y_{i} with a relative deadline d_{i} per job release and c_{i} execution time per job release. For a given PGM, he developed an analysis technique to find the RBE task parameters of each actor and buffer size of each channel. Thus, his approach is closely related to ours. However, our approach uses CSDF graphs which are more expressive than PGM graphs in that PGM supports only a constant production/consumption rate on edges (same as SDF), whereas CSDF supports varying (but predefined) production/consumption rates. As a result, the analysis technique in [11] is not applicable to CSDF graphs.
Bekooij et al. presented a dataflow analysis for embedded real-time multiprocessor systems [4]. They analyzed the impact of TDM scheduling on applications modeled as SDF graphs. Moreira et al. have investigated real-time scheduling of dataflow programs modeled as SDF graphs in [20, 21, 22]. They formulated a resource allocation heuristic [20] and a TDM scheduler combined with static allocation policy [21]. Their TDM scheduler improves the one proposed in [4]. In [22], they proved that it is possible to derive a strictly periodic schedule for the actors of a cyclic SDF graph iff the periods are greater than or equal to the maximum cycle mean of the graph. They formulated the conditions on the start times of the actors in the equivalent Homogeneous SDF (HSDF, [15]) graph in order to enforce a periodic execution of every actor as a Linear Programming (LP) problem.
Our approach differs from [4, 20, 21, 22] in: (1) using the periodic task model which allows applying a variety of proven hard-real-time scheduling algorithms for multiprocessors, and (2) using the CSDF model which is more expressive than the SDF model.
3 Background
3.1 Cyclo-static dataflow (CSDF)
- 1.The successors set, denoted by succ(v_{i}), and given by:$$ \mathsf{succ}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_i, v_j) \in E \bigr\} $$(1)
- 2.The predecessors set, denoted by prec(v_{i}), and given by:$$ \mathsf{prec}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_j, v_i) \in E \bigr\} $$(2)
- 3.The input channels set, denoted by inp(v_{i}), and given by:$$ \mathsf{inp}(v_i) = \left \{ \begin{array}{l@{\quad}l} \{ e_u \in E : e_u = (v_j, v_i) \}, & \mbox{if } \sigma_i >1 \\ \mbox{The set of channels delivering the input streams to } v_i & \mbox{if } \sigma_i = 1 \end{array} \right . $$(3)
- 4.The output channels set, denoted by out(v_{i}), and given by:$$ \mathsf{out}(v_i) = \left\{ \begin{array}{l@{\quad}l} \{e_u \in E : e_u = (v_i, v_j)\}, & \mbox{if } \sigma_i <\mathcal{L}\\ \mbox{The set of channels carrying the output streams from } v_i, & \mbox{if } \sigma_i = \mathcal{L} \end{array} \right. $$(4)
Every actor v_{j}∈V has an execution sequence [f_{j}(1),f_{j}(2),…,f_{j}(P_{j})] of length P_{j}. The interpretation of this sequence is: The nth time that actor v_{j} is fired, it executes the code of function f_{j}(((n−1)modP_{j})+1). Similarly, production and consumption of tokens are also sequences of length P_{j} in CSDF. The token production of actor v_{j} on channel e_{u} is represented as a sequence of constant integers \([x_{j}^{u}(1), x_{j}^{u}(2), \ldots, x_{j}^{u}(P_{j})]\). The nth time that actor v_{j} is fired, it produces \(x_{j}^{u}(((n - 1) \bmod P_{j}) + 1)\) tokens on channel e_{u}. The consumption of actor v_{k} is completely analogous; the token consumption of actor v_{k} from a channel e_{u} is represented as a sequence \([y_{k}^{u}(1), y_{k}^{u}(2), \ldots, y_{k}^{u}(P_{j})]\). The firing rule of a CSDF actor v_{k} is evaluated as “true” for its nth firing iff all its input channels contain at least \(y_{k}^{u}(((n - 1) \bmod P_{j}) + 1)\) tokens. The total number of tokens produced by actor v_{j} on channel e_{u} during the first n invocations, denoted by \(X_{j}^{u}(n)\), is given by \(X_{j}^{u}(n) = \sum_{l = 1}^{n} x_{j}^{u}(l)\). Similarly, the total number of tokens consumed by actor v_{k} from channel e_{u} during the first n invocations, denoted by \(Y_{k}^{u}(n)\), is given by \(Y_{k}^{u}(n) = \sum_{l = 1}^{n} y_{k}^{u}(l)\).
Example 1
An important property of the CSDF model is its decidability, which is the ability to derive at compile-time a schedule for the actors. This is formulated in the following definitions and results from [5].
Definition 1
(Valid static schedule [5])
Given a connected CSDF graph G, a valid static schedule for G is a finite sequence of actors invocations that can be repeated infinitely on the incoming sample stream while the amount of data in the buffers remains bounded. A vector q=[q_{1},q_{2},…,q_{N}]^{T}, where q_{j}>0, is a repetition vector of G if each q_{j} represents the number of invocations of an actor v_{j} in a valid static schedule for G. The repetition vector of G in which all the elements are relatively prime^{1} is called the basic repetition vector of G, denoted by \(\dot{\mathbf{q}}\). G is consistent if there exists a repetition vector. If a deadlock-free schedule can be found, G is said to be live. Both consistency and liveness are required for the existence of a valid static schedule.
Theorem 1
([5])
Definition 2
For a consistent and live CSDF graph G, an actor iteration is the invocation of an actor v_{i}∈V for q_{i} times, and a graph iteration is the invocation of every actor v_{i}∈V for q_{i} times, where q_{i}∈q.
Corollary 1
(From [5])
If a consistent and live CSDF graphGcompletesniterations, wheren∈ℕ, then the net change to the number of tokens in the buffers ofGis zero.
Lemma 1
Any acyclic consistent CSDF graph is live.
Proof
Bilsen et al. proved in [5] that a CSDF graph is live iff every cycle in the graph is live. Equivalently, a CSDF graph deadlocks only if it contains at least one cycle. Thus, absence of cycles in a CSDF graph implies its liveness. □
Example 2
3.2 System model and scheduling algorithms
In this section, we introduce the system model and the related schedulability results.
3.2.1 System model
A system Ω consists of a set π={π_{1},π_{2},…,π_{m}} of m homogeneous processors. The processors execute a task set τ={τ_{1},τ_{2},…,τ_{n}} of n periodic tasks, and a task may be preempted at any time. A periodic task τ_{i}∈τ is defined by a 4-tuple τ_{i}=(S_{i},C_{i},T_{i},D_{i}), where S_{i}≥0 is the start time of τ_{i}, C_{i}>0 is the worst-case execution time of τ_{i}, T_{i}≥C_{i} is the task period, and D_{i}, where C_{i}≤D_{i}≤T_{i}, is the relative deadline of τ_{i}. A periodic task τ_{i} is invoked (i.e., releases a job) at time instants t=S_{i}+kT_{i} for all k∈ℕ_{0}. Upon invocation, τ_{i} executes for C_{i} time-units. The relative deadline D_{i} is interpreted as follows: τ_{i} has to finish executing its kth invocation before time t=S_{i}+kT_{i}+D_{i} for all k∈ℕ_{0}. If D_{i}=T_{i}, then τ_{i} is said to have implicit-deadline. If D_{i}<T_{i}, then τ_{i} is said to have constrained-deadline. If all the tasks in a task-set τ have the same start time, then τ is said to be synchronous. Otherwise, τ is said to be asynchronous.
The utilization of a task τ_{i} is U_{i}=C_{i}/T_{i}. For a task set τ, the total utilization of τ is \(U_{\mathrm{sum}} = \sum_{\tau_{i} \in\tau} U_{i}\) and the maximum utilization factor of τ is \(U_{\mathrm{max}} = \max_{\tau_{i} \in\tau} U_{i}\).
In the remainder of this paper, a task set τ refers to an asynchronous set of implicit-deadline periodic tasks. As a result, we refer to a task τ_{i} with a 3-tuple τ_{i}=(S_{i},C_{i},T_{i}) by omitting the implicit deadline D_{i} which is equal to T_{i}.
3.2.2 Scheduling asynchronous set of implicit deadline periodic tasks
Partitioned: Each task is allocated to a processor and no migration is permitted
Global: Migration is permitted for all tasks
- Hybrid: Hybrid algorithms mix partitioned and global approaches and they can be further classified to:
- 1.
Semi-partitioned: Most tasks are allocated to processors and few tasks are allowed to migrate
- 2.
Clustered: Processors are grouped into clusters and the tasks that are allocated to one cluster are scheduled by a global scheduler
- 1.
M_{PAR} is specific to the task set τ for which it is computed. Another task set \(\hat{\tau}\) with the same total utilization and maximum utilization factor as τ might not be schedulable on M_{PAR} processors due to partitioning issues.
4 Strictly periodic scheduling of acyclic CSDF graphs
This section presents our analytical framework for scheduling the actors in acyclic CSDF graphs as periodic tasks. The construction it uses arranges the actors forming the CSDF graph into a set of levels as shown in Sect. 3. All actors belonging to a certain level depend directly only on the actors in the previous levels. Then, we derive, for each actor, a period and start time, and for each channel, a buffer size. These derived parameters ensure that a strictly periodic schedule can be achieved in the form of a pipelined sequence of invocations of all the actors in each level.
4.1 Definitions and assumptions
In the remainder of this paper, a graph G refers to an acyclic consistent CSDF graph. We base our analysis on the following assumptions:
Assumption 1
- 1.
Z_{i}∩Z_{j}=∅ ∀v_{i},v_{j}∈V.
- 2.
The first samples of all the streams arrive prior to or at the same time when the actors of G start executing
- 3.
Each input stream I_{j} is characterized by a minimum inter-arrival time (also called period) of the samples, denoted by γ_{j}. This minimum inter-arrival time is assumed to be equal to the period of the input actor which receives I_{j}. This assumption indicates that the inter-arrival time for input streams can be controlled by the designer to match the periods of the actors.
Assumption 2
An actor v_{i} consumes its input data immediately when it starts its firing and produces its output data just before it finishes its firing.
We start with the following definition:
Definition 3
(Execution time vector)
Let \(\eta= \max_{v_{i} \in V}(\mu_{i} q_{i})\) and Q=lcm{q_{1},q_{2},…,q_{N}} (lcm denotes the least-common-multiple operator). Now, we give the following definition.
Definition 4
(Matched input/output rates graph)
The concept of matched I/O rates applications was first introduced in [30] as the applications with low value ofQ. However, the authors did not establish exact test for determining whether an application is matched I/O rates or not. The test in (13) is a novel contribution of this paper. If ηmodQ=0, then there exists at least a single actor in the graph which is fully utilizing the processor on which it runs. This, as shown later in Sect. 4.3.3, allows the graph to achieve optimal throughput. On the other hand, if ηmodQ≠0, then there exist idle durations in the period of each actor which results in sub-optimal throughput. This is illustrated later in Example 3 which shows the strictly periodic schedule of a mis-matched I/O rates application.
Definition 5
(Output path latency)
Let w_{a⇝z}={(v_{a},v_{b}),…,(v_{y},v_{z})} be an output path in a graph G. The latency of w_{a⇝z} under periodic input streams, denoted by L(w_{a⇝z}), is the elapsed time between the start of the first firing of v_{a} which produces data to (v_{a},v_{b}) and the finish of the first firing of v_{z} which consumes data from (v_{y},v_{z}).
Consequently, we define the maximum latency of G as follows:
Definition 6
(Graph maximum latency)
Definition 7
(Self-timed schedule)
A self-timed schedule (STS) is one where all the actors are fired as soon as their input data are available.
Definition 8
(Strictly periodic actor)
An actor v_{i}∈V is strictly periodic iff the time period between any two consecutive firings is constant.
Definition 9
(Period vector)
Definition 9 implies that all the actors have the same iteration period. This is captured in the following definition:
Definition 10
(Iteration period)
Now, we prove the existence of a strictly periodic schedule when the input streams are strictly periodic. An input stream I_{j} connected to input actor v_{i} is strictly periodic iff the inter-arrival time between any two consecutive samples is constant. Based on Assumption 1-3, it follows that γ_{j}=λ_{i}. Later on, we extend the results to handle periodic with jitter and sporadic input streams.
4.2 Existence of a strictly periodic schedule
Lemma 2
Proof
Theorem 2
For any graphG, a periodic scheduleΠexists such that every actorv_{i}∈Vis strictly periodic with a constant periodλ_{i}∈λ^{min}and every communication channele_{u}∈Ehas a bounded buffer capacity.
Proof
In schedule Π_{∞}, every actor v_{i} is fired every λ_{i} time-unit once it starts. The start time defined in (26) guarantees that actors in a given level will start only when they have enough data to execute one iteration in a periodic way. The overlapping guarantees that once the actors have started, they will always find enough data for executing the next iteration since their predecessors have already executed one additional iteration. Thus, schedule Π_{∞} shows the existence of a periodic schedule of G where every actor v_{j}∈V is strictly periodic with a period equal to λ_{j}.
Example 3
4.3 Earliest start times and minimum buffer sizes
Now, we are interested in finding the earliest start times of the actors, and the minimum buffer sizes of the communication channels that guarantee the existence of a periodic schedule. Minimizing the start times and buffer sizes is crucial since it minimizes the initial response time and the memory requirements of the applications modeled as acyclic CSDF graphs.
4.3.1 Earliest start times
In the proof of Theorem 2, the notion of start time was introduced to denote when the actor is started on the system. The start time values used in the proof of the theorem were not the minimum ones. Here, we derive the earliest start times. We start with the following definitions:
Definition 11
(Cumulative production function)
The cumulative production function of actor v_{i} producing into channel e_{u} during the interval [t_{s},t_{e}), denoted by \(\mathsf{prd}_{[t_{s}, t_{e})} (v_{i},e_{u})\), is the sum of the number of tokens produced by v_{i} into e_{u} during the interval [t_{s},t_{e}).
Similarly, we define the cumulative consumption function as follows:
Definition 12
(Cumulative consumption function)
The cumulative consumption function of actor v_{i} consuming from channel e_{u} over the interval [t_{s},t_{e}], denoted by \(\mathsf{cns}_{[t_{s}, t_{e}]}(v_{i},e_{u})\), is the sum of the number of tokens consumed by v_{i} from e_{u} during the interval [t_{s},t_{e}].
Recall that prec(v_{i}) is the predecessors set of actor v_{i}, \(Y_{i}^{u}\) is the consumption sequence of an actor v_{i} from channel e_{u}, and α is the iteration period. Now, we give the following lemma:
Lemma 3
Proof
In (36), a valid start time candidate ϕ_{i→j} must satisfy extra conditions to guarantee that the number of produced tokens on edge e_{u}=(v_{i},v_{j}) at any time instant \(t \ge\hat{t}\) is greater than or equal to the number of consumed tokens at the same instant. To satisfy these extra conditions, we consider the following two possible cases:
Satisfying (37) guarantees that v_{j} can fire at times \(t = \hat{t}, \hat{t} + \lambda_{j}, \ldots, \hat{t} + \alpha\). Thus, a valid value of \(\hat{t}\) guarantees that once v_{j} is started, it always finds enough data to fire for one iteration. As a result, v_{j} executes in a strictly periodic way.
This case occurs when v_{j} consumes zeros tokens during the interval \([\hat{t},\phi_{i}]\). This is a valid behavior since the consumption rates sequence can contain zero elements. Since \(\hat{t} < \phi_{i}\), it is sufficient to check the cumulative production and consumption over the interval [ϕ_{i},ϕ_{i}+α] since by time t=ϕ_{i}+α both v_{i} and v_{j} are guaranteed to have finished one iteration. Thus, \(\hat{t}\) also guarantees that once v_{j} is started, it always finds enough data to fire. Hence, v_{j} executes in a strictly periodic way.
Any value of \(\hat{t}\) which satisfies (39) is a valid start time value that guarantees strictly periodic execution of v_{j}. Since there might be multiple values of \(\hat{t}\) that satisfy (39), we take the minimum value because it is the earliest start time that guarantees strictly periodic execution of v_{j}. □
4.3.2 Minimum buffer sizes
Lemma 4
Proof
Equation (40) tracks the maximum cumulative number of unconsumed tokens in e_{u} during one iteration for v_{i} and v_{j}. There are two cases:
Theorem 3
- 1.
every edgee_{u}∈Ehas a capacity of at leastb_{u}tokens, whereb_{u}is given by (40)
- 2.
τ_{G}satisfies the schedulability test of\(\mathcal{A}\)onMprocessors
Proof
Follows from Theorem 2, and Lemmas 3 and 4. □
Example 4
4.3.3 Throughput and latency analysis
Now, we analyze the throughput of the graph actors under strictly periodic scheduling and compare it with the maximum achievable throughput. We also present a formula to compute the latency for a given CSDF graph under strictly periodic scheduling. We start with the following definitions:
Definition 13
(Actor throughput)
Definition 14
(Rate-optimal strictly periodic schedule [22])
For a graph G, a strictly periodic schedule that delivers the same throughput as a self-timed schedule for all the actors is called Rate-Optimal Strictly Periodic Schedule (ROSPS).
Now, we provide the following result.
Theorem 4
For a matched I/O rates graphG, the maximum achievable throughput of the graph actors under strictly periodic scheduling is equal to their maximum throughput under self-timed scheduling.
Proof
Equation (44) shows that the throughput under SPS depends solely on the relationship between Q and η. Recall from Definition 3 that the execution time μ used by our framework is the maximum value over all the actual execution times of the actor. Therefore, if ηmodQ=0, then R_{SPS}(v_{i}) is exactly the same as R_{STS}(v_{i}) for SDF graphs and CSDF graphs where all the firings of an actor v_{i} require the same actual execution time. If ηmodQ≠0 and/or the actor actual execution time differs per firing, then R_{SPS}(v_{i}) is lower than R_{STS}(v_{i}). These findings illustrate that our framework has high potential since it allows the designer to analytically determine the type of the application (i.e., matched vs. mis-matched) and accordingly to select the proper scheduler needed to deliver the maximum achievable throughput.
Now, we prove the following result regarding matched I/O rates applications:
Corollary 2
For a matched I/O rates graphGscheduled using its minimum period vectorλ^{min}, U_{max}=1.
Proof
Recall from Sect. 3.2.2 that β=⌊1/U_{max}⌋. It follows from Corollary 2 that β=1 for matched I/O rates applications scheduled using their minimum period vectors.
4.4 Handling sporadic input streams
In case the input streams are not strictly periodic, there are several techniques to accommodate the aperiodic nature of the streams. We present here some of these techniques.
4.4.1 De-jitter buffers
In case of periodic with jitter input streams, it is possible to use de-jitter buffers to hide the effect of jitter. We assume that a jittery input stream I_{i} starts at time t=t_{0} and has a constant inter-arrival time γ_{i} equal to the input actor period (see Assumption 1-3) and jitter bounds \([\varepsilon_{i}^{-}, \varepsilon_{i}^{+}]\). The interpretation of the jitter bounds is that the kth sample of the stream is expected to arrive in the interval \([t_{0} + k\gamma_{i} - \varepsilon_{i}^{-}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\). If a sample arrives in the interval \([t_{0} + k\gamma_{i} - \varepsilon_{i}^{-}, t_{0} + k\gamma_{i})\), then it is called an early sample. On the other hand, if the sample arrives in the interval \((t_{0} + k\gamma_{i}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\), then it is called a late sample. It is trivial to show that early samples do not affect the periodicity of the input actor as the samples arrive prior to the actor release time. Late samples, however, pose a problem as they might arrive after an actor is released.
For late samples, it is possible to insert a buffer before each input actor v_{i} receiving a jittery input stream I_{j} to hide the effect of jitter. The buffer delays delivering the samples to the input actor by a certain amount of time, denoted by t_{buffer}(I_{j}). t_{buffer}(I_{j}) has to be computed such that once the input actor is started, it always finds data in the buffer. Assume that \(\varepsilon_{i}^{-}\) and \(\varepsilon_{i}^{+} \in[0, \gamma_{i}]\), then we can derive the minimum value for t_{buffer}(I_{j}) and the minimum buffer size. In order to do that, we start with proving the following lemma:
Lemma 5
Proof
Lemma 6
Proof
During a time interval (t,t+t_{MIT}(I_{j})), v_{i} can fire at most twice. Therefore, it is necessary to buffer up to 2 samples in order to guarantee that the input actor v_{i} can continue firing periodically when the samples are separated by t_{MIT} time-units. □
Lemma 7
Letv_{i}be an input actor andI_{j}be a jittery input stream tov_{i}. Suppose thatI_{j}starts at timet=t_{0}andv_{i}starts at timet=t_{0}+t_{buffer}(I_{j}). The de-jitter buffer must be able to hold at least 3 samples.
Proof
Suppose that the (k−1) and (k+1) samples arrive late and early, respectively, by the maximum amount of jitter. This means that they arrive at time t=t_{0}+kγ_{i}. Now, suppose that the kth sample arrives with no jitter. This means that at time t=t_{0}+kγ_{i} there are 3 samples arriving. Hence, the de-jitter buffer must be able to store them. During the interval [t_{0}+kγ_{i},t_{0}+(k+1)γ_{i}), there are no incoming samples and v_{i} processes the (k−1) sample. At time t=t_{0}+(k+1)γ_{i}, the (k+2) sample might arrive which means that there are again 3 samples available to v_{i}. By the periodicity of v_{i} and I_{j}, the previous pattern can repeat. □
The main advantage of the de-jitter buffer approach is that the actors are still treated and scheduled as periodic tasks. However, the major disadvantage is the extra delay encountered by the input stream samples and the extra memory needed for the buffers.
4.4.2 Resource reservation
For sporadic streams in general, we can consider the actors as aperiodic tasks and apply techniques for aperiodic task scheduling from real-time scheduling theory [6]. One popular approach is based on using a server task to service the aperiodic tasks. Servers provide resource reservation guarantees and temporal isolation. Several servers have been proposed in the literature (e.g., [1, 27]). The advantages of using servers are the enforced isolation between the tasks, and the ability to support arbitrarily input streams. When using servers, we can schedule each actor using a server which has an execution budget C_{s} equal to the actor execution time and a period P_{s} equal to the actor’s period.
- 1.
The underlying operating system (OS) or scheduler has a monitoring mechanism which polls the buffers to detect when an actor has enough data to fire. Once it detects that an actor has enough data to fire, it releases an actor job.
- 2.Modify the actor implementation such that the polling happens within the actor. In this approach, an actor job is always released at the start of the actor period. When the actor is activated (i.e., running), it checks its input buffers for data. If enough data is available, then it executes its function. Otherwise, it exhausts its budget and waits until the next period. This mechanism is summarized in Fig. 12.
The first approach (i.e., polling by the OS) does not require modifications to the actors’ implementations. However, it requires an additional task which always checks all the buffers. This task can become a bottleneck if there are many channels. The second approach is better in terms of scalability and overhead. However, it might cause delays in the processing of the data.
5 Evaluation results
We evaluate our proposed framework in Sect. 4 by performing an experiment on a set of 19 real-life streaming applications. The objective of the experiment is to compare the throughput of streaming applications when scheduled using our strictly periodic scheduling to their maximum achievable throughput obtained via self-timed scheduling. After that, we discuss the implications of our results from Sect. 4 and the throughput comparison experiment. For brevity, we refer in the remainder of this section to our strictly periodic scheduling/schedule as SPS and the self-timed scheduling/schedule as STS.
The streaming applications used in the experiment are real-life streaming applications coming from different domains (e.g., signal processing, communication, multimedia, etc.). The benchmarks are described in details in the next section.
5.1 Benchmarks
Benchmarks used for evaluation
Domain | No. | Application | Source |
---|---|---|---|
Signal Processing | 1 | Multi-channel beamformer | [30] |
2 | Discrete cosine transform (DCT) | ||
3 | Fast Fourier transform (FFT) kernel | ||
4 | Filterbank for multirate signal processing | ||
5 | Time delay equalization (TDE) | ||
Cryptography | 6 | Data Encryption Standard (DES) | |
7 | Serpent | ||
Sorting | 8 | Bitonic Parallel Sorting | |
Video processing | 9 | MPEG2 video | |
10 | H.263 video decoder | [29] | |
Audio processing | 11 | MP3 audio decoder | |
12 | CD-to-DAT rate converter (SDF)^{a} | [24] | |
13 | CD-to-DAT rate converter (CSDF) | ||
14 | Vocoder | [30] | |
Communication | 15 | Software FM radio with equalizer | |
16 | Data modem | [29] | |
17 | Satellite receiver | ||
18 | Digital Radio Mondiale receiver | [19] | |
Medical | 19 | Heart pacemaker^{b} | [26] |
We use the SDF^{3} tool-set [29] for several purposes during the experiments. SDF^{3} is a powerful analysis tool-set which is capable of analyzing CSDF and SDF graphs to check for consistency errors, compute the repetition vector, compute the maximum achievable throughput, etc. SDF^{3} accepts the graphs in XML format. For StreamIt benchmarks, the StreamIt compiler is capable of exporting an SDF graph representation of the stream program. The exported graph is then converted into the XML format required by SDF^{3}. For the graphs from the research articles, we constructed the XML representation for the CSDF graphs manually.
5.2 Experiment: throughput and latency comparison
In this experiment, we compare the throughput and latency resulting from our SPS approach to the maximum achievable throughput and minimum achievable latency of a streaming application. Recall from Definition 7 that the maximum achievable throughput and minimum achievable latency of a streaming application modeled as a CSDF graph are the ones achieved under self-timed scheduling. In this experiment, we report the throughput for the output actors (i.e., the actors producing the output streams of the application, see Sect. 3). For latency, we report the graph maximum latency according to Definition 6. For SPS, we used the minimum period vector given by Lemma 2. The STS throughput and latency are computed using the SDF^{3} tool-set. SDF^{3} defines R_{STS}(G) as the graph throughput under STS, and R_{STS}(v_{i})=q_{i}R_{STS}(G) as the actor throughput. Similarly, L_{STS}(G) denotes the graph latency under self-timed scheduling. We use the sdf3analysis tool from SDF^{3} to compute the throughput and latency for the STS with auto-concurrency disabled and assuming unbounded FIFO channel sizes. Computing the throughput is performed using the throughput algorithm, while latency is computed using the latency(min_st) algorithm.
Results of throughput comparison. v_{out} denotes the output actor
Application | \(\dot{q}_{\mathrm{out}}\) | R_{STS}(v_{out}) | η | Q | R_{SPS}(v_{out}) | R_{SPS}(v_{out})/R_{STS}(v_{out}) |
---|---|---|---|---|---|---|
Beamformer | 1 | 1.97×10^{−4} | 5076 | 1 | 1/5076 | 1.0 |
DCT | 1 | 2.1×10^{−5} | 47616 | 1 | 1/47616 | 1.0 |
FFT | 1 | 8.31×10^{−5} | 12032 | 1 | 1/12032 | 1.0 |
Filterbank | 1 | 8.84×10^{−5} | 11312 | 1 | 1/11312 | 1.0 |
TDE | 1 | 2.71×10^{−5} | 36960 | 1 | 1/36960 | 1.0 |
DES | 1 | 9.765×10^{−4} | 1024 | 1 | 1/1024 | 1.0 |
Serpent | 1 | 2.99×10^{−4} | 3336 | 1 | 1/3336 | 1.0 |
Bitonic | 1 | 1.05×10^{−2} | 95 | 1 | 1/95 | 1.0 |
MPEG2 | 1 | 1.30×10^{−4} | 7680 | 1 | 1/7680 | 1.0 |
H.263 | 1 | 3.01×10^{−6} | 332046 | 594 | 1/332046 | 1.0 |
MP3 | 2 | 5.36×10^{−7} | 3732276 | 2 | 1/1866138 | 1.0 |
CD2DAT-S | 160 | 1.667×10^{−1} | 960 | 23520 | 1/147 | 0.04 |
CD2DAT-C | 160 | 1.361×10^{−1} | 1176 | 23520 | 1/147 | 0.05 |
Vocoder | 1 | 1.1×10^{−4} | 9105 | 1 | 1/9105 | 1.0 |
FM | 1 | 6.97×10^{−4} | 1434 | 1 | 1/1434 | 1.0 |
Modem | 1 | 6.25×10^{−2} | 16 | 16 | 1/16 | 1.0 |
Satellite | 240 | 2.27×10^{−1} | 1056 | 5280 | 1/22 | 0.2 |
Receiver | 288000 | 4.76×10^{−2} | 6048000 | 288000 | 1/21 | 1.0 |
Pacemaker | 64 | 2.0×10^{−1} | 320 | 320 | 1/5 | 1.0 |
5.3 Discussion
Unfortunately, such easy computation as discussed above of the minimum number of processors is not possible for STS. This is because the minimum number of processors required by STS, denoted by M_{STS}, can not be easily computed with equations such as (9), (10), and (11). Finding M_{STS} in practice requires Design Space Exploration (DSE) procedures to find the best allocation which delivers the maximum achievable throughput. This fact shows one more advantage of using our SPS framework compared to using STS in cases where our SPS gives the same throughput as STS.
6 Conclusions
We prove that the actors of a streaming application, modeled as an acyclic CSDF graph, can be scheduled as periodic tasks. As a result, a variety of hard-real-time scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. We present an analytical framework for computing the periodic task parameters for the actors together with the minimum channel sizes such that a strictly periodic schedule exists. We also show how the proposed framework can handle sporadic input streams. We define formally a class of CSDF graphs called matched I/O rates applications which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates applications together with the ability to analytically determine the minimum number of processors needed to schedule the applications.
Acknowledgements
This work is supported by CATRENE/MEDEA+ 2A718 TSAR (Terascale multicore processor architecture) project. We would like to thank William Thies and Sander Stuijk for their support with StreamIt and SDF^{3} benchmarks, respectively.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.