On the hardrealtime scheduling of embedded streaming applications
Authors
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s106170129086x
 Cite this article as:
 Bamakhrama, M.A. & Stefanov, T.P. Des Autom Embed Syst (2013) 17: 221. doi:10.1007/s106170129086x
Abstract
In this paper, we consider the problem of hardrealtime (HRT) multiprocessor scheduling of embedded streaming applications modeled as acyclic dataflow graphs. Most of the hardrealtime scheduling theory for multiprocessor systems assumes independent periodic or sporadic tasks. Such a simple task model is not directly applicable to dataflow graphs, where nodes represent actors (i.e., tasks) and edges represent datadependencies. The actors in such graphs have datadependency constraints and do not necessarily conform to the periodic or sporadic task models. In this work, we prove that the actors in acyclic CycloStatic Dataflow (CSDF) graphs can be scheduled as periodic tasks. Moreover, we provide a framework for computing the periodic task parameters (i.e., period and start time) of each actor, and handling sporadic input streams. Furthermore, we define formally a class of CSDF graphs called matched input/output (I/O) rates graphs which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of achieving the maximum achievable throughput of an application for matched I/O rates graphs. Therefore, hardrealtime schedulability analysis can be used to determine the minimum number of processors needed to schedule matched I/O rates applications while delivering the maximum achievable throughput. This can be of great use for system designers during the Design Space Exploration (DSE) phase.
Keywords
Realtime multiprocessor scheduling Embedded streaming systems1 Introduction
The everincreasing complexity of embedded systems realized as MultiProcessor SystemsonChips (MPSoCs) is imposing several challenges on systems designers [18]. Two major challenges in designing streaming software for embedded MPSoCs are: (1) How to express parallelism found in applications efficiently?, and (2) How to allocate the processors to provide guaranteed services to multiple running applications, together with the ability to dynamically start/stop applications without affecting other already running applications?
ModelofComputation (MoC) based design has emerged as a defacto solution to the first challenge [10]. In MoCbased design, the application can be modeled as a directed graph where nodes represent actors (i.e., tasks) and edges represent communication channels. Different MoCs define different rules and semantics on the computation and communication of the actors. The main benefits of a MoCbased design are the explicit representation of important properties in the application (e.g., parallelism) and the enhanced designtime analyzability of the performance metrics (e.g., throughput). One particular MoC that is popular in the embedded signal processing systems community is the CycloStatic Dataflow (CSDF) model [5] which extends the wellknown Synchronous Data Flow (SDF) model [15].
Unfortunately, no such defacto solution exists yet for the second challenge of processor allocation [23]. For a long time, selftimed scheduling was considered the most appropriate policy for streaming applications modeled as dataflow graphs [14, 28]. However, the need to support multiple applications running on a single system without prior knowledge of the properties of the applications (e.g., required throughput, number of tasks, etc.) at system designtime is forcing a shift towards runtime scheduling approaches as explained in [13]. Most of the existing runtime scheduling solutions assume applications modeled as task graphs and provide besteffort or softrealtime qualityofservice (QoS) [23]. Few runtime scheduling solutions exist which support applications modeled using a MoC and provide hardrealtime QoS [4, 11, 20, 21]. However, these solutions either use simple MoCs such as SDF/PGM graphs or use TimeDivision Multiplexing (TDM)/RoundRobin (RR) scheduling. Several algorithms from the hardrealtime multiprocessor scheduling theory [9] can perform fast admission and scheduling decisions for incoming applications while providing hardrealtime QoS. Moreover, these algorithms provide temporal isolation which is the ability to dynamically start/run/stop applications without affecting other already running applications. However, these algorithms from the hardrealtime multiprocessor scheduling theory received little attention in the embedded MPSoC community. This is mainly due to the fact that these algorithms assume independent periodic or sporadic tasks [9]. Such a simple task model is not directly applicable to modern embedded streaming applications. This is because a modern streaming application is typically modeled as a directed graph where nodes represent actors, and edges represent datadependencies. The actors in such graphs have datadependency constraints and do not necessarily conform to the periodic or sporadic task models.
Therefore, in this paper we investigate the applicability of the hardrealtime scheduling theory for periodic tasks to streaming applications modeled as acyclic CSDF graphs. In such graphs, the actors are datadependent. However, we analytically prove that they (i.e., the actors) can be scheduled as periodic tasks. As a result, a variety of hardrealtime scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. By considering acyclic CSDF graphs, our investigation findings and proofs are applicable to most streaming applications since it has been shown recently that around 90 % of streaming applications can be modeled as acyclic SDF graphs [30]. Note that SDF graphs are a subset of the CSDF graphs we consider in this paper.
1.1 Problem statement
Given a streaming application modeled as an acyclic CSDF graph, determine whether it is possible to execute the graph actors as periodic tasks. A periodic task τ _{ i } is defined by a 3tuple τ _{ i }=(S _{ i },C _{ i },T _{ i }). The interpretation is as follows: τ _{ i } is invoked at time instants t=S _{ i }+kT _{ i } and it has to execute for C _{ i } timeunits before time t=S _{ i }+(k+1)T _{ i } for all k∈ℕ_{0}, where S _{ i } is the start time of τ _{ i } and T _{ i } is the task period. This scheduling approach is called Strictly Periodic Scheduling ( SPS ) [22] to avoid confusion with the term periodic scheduling used in the dataflow scheduling theory to refer to a repetitive finite sequence of actors invocations. The sequence is periodic since it is repeated infinitely with a constant period. However, the individual actors invocations are not guaranteed to be periodic. In the remainder of this paper, periodic scheduling/schedule refers to strictly periodic scheduling/schedule.
1.2 Paper contributions
Given a streaming application modeled as an acyclic CSDF graph, we analytically prove that it is possible to execute the graph actors as periodic tasks. Moreover, we present an analytical framework for computing the periodic task parameters for the actors, that is the period and the start time, together with the minimum buffer sizes of the communication channels such that the actors execute as periodic tasks. The proposed framework is also capable of handling sporadic input streams. Furthermore, we define formally two classes of CSDF graphs: matched input/output (I/O) rates graphs and mismatched I/O rates graphs. Matched I/O rates graphs constitute around 80 % of streaming applications [30]. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates graphs. Applying our approach to matched I/O rates applications enables using a plethora of schedulability tests developed in the realtime scheduling theory [9] to easily determine the minimum number of processors needed to schedule a set of applications using a certain algorithm to provide the maximum achievable throughput. This can be of great use for embedded systems designers during the Design Space Exploration (DSE) phase.
The remainder of this paper is organized as follows: Sect. 2 gives an overview of the related work. Section 3 introduces the CSDF model and the considered system model. Section 4 presents the proposed analytical framework. Section 5 presents the results of empirical evaluation of the framework presented in Sect. 4. Finally, Sect. 6 ends the paper with conclusions.
2 Related work
Parks and Lee [25] studied the applicability of nonpreemptive RateMonotonic (RM) scheduling to dataflow programs modeled as SDF graphs. The main difference compared to our work is: (1) they considered nonpreemptive scheduling. In contrast, we consider only preemptive scheduling. Nonpreemptive scheduling is known to be NPhard in the strong sense even for the uniprocessor case [12], and (2) they considered SDF graphs which are a subset of the more general CSDF graphs.
Goddard [11] studied applying realtime scheduling to dataflow programs modeled using the Processing Graphs Method (PGM). He used a task model called RateBased Execution (RBE) in which a realtime task τ _{ i } is characterized by a 4tuple τ _{ i }=(x _{ i },y _{ i },d _{ i },c _{ i }). The interpretation is as follows: τ _{ i } executes x _{ i } times in time period y _{ i } with a relative deadline d _{ i } per job release and c _{ i } execution time per job release. For a given PGM, he developed an analysis technique to find the RBE task parameters of each actor and buffer size of each channel. Thus, his approach is closely related to ours. However, our approach uses CSDF graphs which are more expressive than PGM graphs in that PGM supports only a constant production/consumption rate on edges (same as SDF), whereas CSDF supports varying (but predefined) production/consumption rates. As a result, the analysis technique in [11] is not applicable to CSDF graphs.
Bekooij et al. presented a dataflow analysis for embedded realtime multiprocessor systems [4]. They analyzed the impact of TDM scheduling on applications modeled as SDF graphs. Moreira et al. have investigated realtime scheduling of dataflow programs modeled as SDF graphs in [20–22]. They formulated a resource allocation heuristic [20] and a TDM scheduler combined with static allocation policy [21]. Their TDM scheduler improves the one proposed in [4]. In [22], they proved that it is possible to derive a strictly periodic schedule for the actors of a cyclic SDF graph iff the periods are greater than or equal to the maximum cycle mean of the graph. They formulated the conditions on the start times of the actors in the equivalent Homogeneous SDF (HSDF, [15]) graph in order to enforce a periodic execution of every actor as a Linear Programming (LP) problem.
Our approach differs from [4, 20–22] in: (1) using the periodic task model which allows applying a variety of proven hardrealtime scheduling algorithms for multiprocessors, and (2) using the CSDF model which is more expressive than the SDF model.
3 Background
3.1 Cyclostatic dataflow (CSDF)
 1.The successors set, denoted by succ(v _{ i }), and given by:$$ \mathsf{succ}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_i, v_j) \in E \bigr\} $$(1)
 2.The predecessors set, denoted by prec(v _{ i }), and given by:$$ \mathsf{prec}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_j, v_i) \in E \bigr\} $$(2)
 3.The input channels set, denoted by inp(v _{ i }), and given by:$$ \mathsf{inp}(v_i) = \left \{ \begin{array}{l@{\quad}l} \{ e_u \in E : e_u = (v_j, v_i) \}, & \mbox{if } \sigma_i >1 \\ \mbox{The set of channels delivering the input streams to } v_i & \mbox{if } \sigma_i = 1 \end{array} \right . $$(3)
 4.The output channels set, denoted by out(v _{ i }), and given by:$$ \mathsf{out}(v_i) = \left\{ \begin{array}{l@{\quad}l} \{e_u \in E : e_u = (v_i, v_j)\}, & \mbox{if } \sigma_i <\mathcal{L}\\ \mbox{The set of channels carrying the output streams from } v_i, & \mbox{if } \sigma_i = \mathcal{L} \end{array} \right. $$(4)
Every actor v _{ j }∈V has an execution sequence [f _{ j }(1),f _{ j }(2),…,f _{ j }(P _{ j })] of length P _{ j }. The interpretation of this sequence is: The nth time that actor v _{ j } is fired, it executes the code of function f _{ j }(((n−1)modP _{ j })+1). Similarly, production and consumption of tokens are also sequences of length P _{ j } in CSDF. The token production of actor v _{ j } on channel e _{ u } is represented as a sequence of constant integers \([x_{j}^{u}(1), x_{j}^{u}(2), \ldots, x_{j}^{u}(P_{j})]\). The nth time that actor v _{ j } is fired, it produces \(x_{j}^{u}(((n  1) \bmod P_{j}) + 1)\) tokens on channel e _{ u }. The consumption of actor v _{ k } is completely analogous; the token consumption of actor v _{ k } from a channel e _{ u } is represented as a sequence \([y_{k}^{u}(1), y_{k}^{u}(2), \ldots, y_{k}^{u}(P_{j})]\). The firing rule of a CSDF actor v _{ k } is evaluated as “true” for its nth firing iff all its input channels contain at least \(y_{k}^{u}(((n  1) \bmod P_{j}) + 1)\) tokens. The total number of tokens produced by actor v _{ j } on channel e _{ u } during the first n invocations, denoted by \(X_{j}^{u}(n)\), is given by \(X_{j}^{u}(n) = \sum_{l = 1}^{n} x_{j}^{u}(l)\). Similarly, the total number of tokens consumed by actor v _{ k } from channel e _{ u } during the first n invocations, denoted by \(Y_{k}^{u}(n)\), is given by \(Y_{k}^{u}(n) = \sum_{l = 1}^{n} y_{k}^{u}(l)\).
Example 1
An important property of the CSDF model is its decidability, which is the ability to derive at compiletime a schedule for the actors. This is formulated in the following definitions and results from [5].
Definition 1
(Valid static schedule [5])
Given a connected CSDF graph G, a valid static schedule for G is a finite sequence of actors invocations that can be repeated infinitely on the incoming sample stream while the amount of data in the buffers remains bounded. A vector q=[q _{1},q _{2},…,q _{ N }]^{ T }, where q _{ j }>0, is a repetition vector of G if each q _{ j } represents the number of invocations of an actor v _{ j } in a valid static schedule for G. The repetition vector of G in which all the elements are relatively prime^{1} is called the basic repetition vector of G, denoted by \(\dot{\mathbf{q}}\). G is consistent if there exists a repetition vector. If a deadlockfree schedule can be found, G is said to be live. Both consistency and liveness are required for the existence of a valid static schedule.
Theorem 1
([5])
Definition 2
For a consistent and live CSDF graph G, an actor iteration is the invocation of an actor v _{ i }∈V for q _{ i } times, and a graph iteration is the invocation of every actor v _{ i }∈V for q _{ i } times, where q _{ i }∈q.
Corollary 1
(From [5])
If a consistent and live CSDF graph G completes n iterations, where n∈ℕ, then the net change to the number of tokens in the buffers of G is zero.
Lemma 1
Any acyclic consistent CSDF graph is live.
Proof
Bilsen et al. proved in [5] that a CSDF graph is live iff every cycle in the graph is live. Equivalently, a CSDF graph deadlocks only if it contains at least one cycle. Thus, absence of cycles in a CSDF graph implies its liveness. □
Example 2
3.2 System model and scheduling algorithms
In this section, we introduce the system model and the related schedulability results.
3.2.1 System model
A system Ω consists of a set π={π _{1},π _{2},…,π _{ m }} of m homogeneous processors. The processors execute a task set τ={τ _{1},τ _{2},…,τ _{ n }} of n periodic tasks, and a task may be preempted at any time. A periodic task τ _{ i }∈τ is defined by a 4tuple τ _{ i }=(S _{ i },C _{ i },T _{ i },D _{ i }), where S _{ i }≥0 is the start time of τ _{ i }, C _{ i }>0 is the worstcase execution time of τ _{ i }, T _{ i }≥C _{ i } is the task period, and D _{ i }, where C _{ i }≤D _{ i }≤T _{ i }, is the relative deadline of τ _{ i }. A periodic task τ _{ i } is invoked (i.e., releases a job) at time instants t=S _{ i }+kT _{ i } for all k∈ℕ_{0}. Upon invocation, τ _{ i } executes for C _{ i } timeunits. The relative deadline D _{ i } is interpreted as follows: τ _{ i } has to finish executing its kth invocation before time t=S _{ i }+kT _{ i }+D _{ i } for all k∈ℕ_{0}. If D _{ i }=T _{ i }, then τ _{ i } is said to have implicitdeadline. If D _{ i }<T _{ i }, then τ _{ i } is said to have constraineddeadline. If all the tasks in a taskset τ have the same start time, then τ is said to be synchronous. Otherwise, τ is said to be asynchronous.
The utilization of a task τ _{ i } is U _{ i }=C _{ i }/T _{ i }. For a task set τ, the total utilization of τ is \(U_{\mathrm{sum}} = \sum_{\tau_{i} \in\tau} U_{i}\) and the maximum utilization factor of τ is \(U_{\mathrm{max}} = \max_{\tau_{i} \in\tau} U_{i}\).
In the remainder of this paper, a task set τ refers to an asynchronous set of implicitdeadline periodic tasks. As a result, we refer to a task τ _{ i } with a 3tuple τ _{ i }=(S _{ i },C _{ i },T _{ i }) by omitting the implicit deadline D _{ i } which is equal to T _{ i }.
3.2.2 Scheduling asynchronous set of implicit deadline periodic tasks

Partitioned: Each task is allocated to a processor and no migration is permitted

Global: Migration is permitted for all tasks

Hybrid: Hybrid algorithms mix partitioned and global approaches and they can be further classified to:
 1.
Semipartitioned: Most tasks are allocated to processors and few tasks are allowed to migrate
 2.
Clustered: Processors are grouped into clusters and the tasks that are allocated to one cluster are scheduled by a global scheduler
 1.
M _{ PAR } is specific to the task set τ for which it is computed. Another task set \(\hat{\tau}\) with the same total utilization and maximum utilization factor as τ might not be schedulable on M _{ PAR } processors due to partitioning issues.
4 Strictly periodic scheduling of acyclic CSDF graphs
This section presents our analytical framework for scheduling the actors in acyclic CSDF graphs as periodic tasks. The construction it uses arranges the actors forming the CSDF graph into a set of levels as shown in Sect. 3. All actors belonging to a certain level depend directly only on the actors in the previous levels. Then, we derive, for each actor, a period and start time, and for each channel, a buffer size. These derived parameters ensure that a strictly periodic schedule can be achieved in the form of a pipelined sequence of invocations of all the actors in each level.
4.1 Definitions and assumptions
In the remainder of this paper, a graph G refers to an acyclic consistent CSDF graph. We base our analysis on the following assumptions:
Assumption 1
 1.
Z _{ i }∩Z _{ j }=∅ ∀v _{ i },v _{ j }∈V.
 2.
The first samples of all the streams arrive prior to or at the same time when the actors of G start executing
 3.
Each input stream I _{ j } is characterized by a minimum interarrival time (also called period) of the samples, denoted by γ _{ j }. This minimum interarrival time is assumed to be equal to the period of the input actor which receives I _{ j }. This assumption indicates that the interarrival time for input streams can be controlled by the designer to match the periods of the actors.
Assumption 2
An actor v _{ i } consumes its input data immediately when it starts its firing and produces its output data just before it finishes its firing.
We start with the following definition:
Definition 3
(Execution time vector)
Let \(\eta= \max_{v_{i} \in V}(\mu_{i} q_{i})\) and Q=lcm{q _{1},q _{2},…,q _{ N }} (lcm denotes the leastcommonmultiple operator). Now, we give the following definition.
Definition 4
(Matched input/output rates graph)
The concept of matched I/O rates applications was first introduced in [30] as the applications with low value of Q. However, the authors did not establish exact test for determining whether an application is matched I/O rates or not. The test in (13) is a novel contribution of this paper. If ηmodQ=0, then there exists at least a single actor in the graph which is fully utilizing the processor on which it runs. This, as shown later in Sect. 4.3.3, allows the graph to achieve optimal throughput. On the other hand, if ηmodQ≠0, then there exist idle durations in the period of each actor which results in suboptimal throughput. This is illustrated later in Example 3 which shows the strictly periodic schedule of a mismatched I/O rates application.
Definition 5
(Output path latency)
Let w _{ a⇝z }={(v _{ a },v _{ b }),…,(v _{ y },v _{ z })} be an output path in a graph G. The latency of w _{ a⇝z } under periodic input streams, denoted by L(w _{ a⇝z }), is the elapsed time between the start of the first firing of v _{ a } which produces data to (v _{ a },v _{ b }) and the finish of the first firing of v _{ z } which consumes data from (v _{ y },v _{ z }).
Consequently, we define the maximum latency of G as follows:
Definition 6
(Graph maximum latency)
Definition 7
(Selftimed schedule)
A selftimed schedule (STS) is one where all the actors are fired as soon as their input data are available.
Definition 8
(Strictly periodic actor)
An actor v _{ i }∈V is strictly periodic iff the time period between any two consecutive firings is constant.
Definition 9
(Period vector)
Definition 9 implies that all the actors have the same iteration period. This is captured in the following definition:
Definition 10
(Iteration period)
Now, we prove the existence of a strictly periodic schedule when the input streams are strictly periodic. An input stream I _{ j } connected to input actor v _{ i } is strictly periodic iff the interarrival time between any two consecutive samples is constant. Based on Assumption 13, it follows that γ _{ j }=λ _{ i }. Later on, we extend the results to handle periodic with jitter and sporadic input streams.
4.2 Existence of a strictly periodic schedule
Lemma 2
Proof
Theorem 2
For any graph G, a periodic schedule Π exists such that every actor v _{ i }∈V is strictly periodic with a constant period λ _{ i }∈λ ^{min} and every communication channel e _{ u }∈E has a bounded buffer capacity.
Proof
In schedule Π _{∞}, every actor v _{ i } is fired every λ _{ i } timeunit once it starts. The start time defined in (26) guarantees that actors in a given level will start only when they have enough data to execute one iteration in a periodic way. The overlapping guarantees that once the actors have started, they will always find enough data for executing the next iteration since their predecessors have already executed one additional iteration. Thus, schedule Π _{∞} shows the existence of a periodic schedule of G where every actor v _{ j }∈V is strictly periodic with a period equal to λ _{ j }.
Example 3
4.3 Earliest start times and minimum buffer sizes
Now, we are interested in finding the earliest start times of the actors, and the minimum buffer sizes of the communication channels that guarantee the existence of a periodic schedule. Minimizing the start times and buffer sizes is crucial since it minimizes the initial response time and the memory requirements of the applications modeled as acyclic CSDF graphs.
4.3.1 Earliest start times
In the proof of Theorem 2, the notion of start time was introduced to denote when the actor is started on the system. The start time values used in the proof of the theorem were not the minimum ones. Here, we derive the earliest start times. We start with the following definitions:
Definition 11
(Cumulative production function)
The cumulative production function of actor v _{ i } producing into channel e _{ u } during the interval [t _{ s },t _{ e }), denoted by \(\mathsf{prd}_{[t_{s}, t_{e})} (v_{i},e_{u})\), is the sum of the number of tokens produced by v _{ i } into e _{ u } during the interval [t _{ s },t _{ e }).
Similarly, we define the cumulative consumption function as follows:
Definition 12
(Cumulative consumption function)
The cumulative consumption function of actor v _{ i } consuming from channel e _{ u } over the interval [t _{ s },t _{ e }], denoted by \(\mathsf{cns}_{[t_{s}, t_{e}]}(v_{i},e_{u})\), is the sum of the number of tokens consumed by v _{ i } from e _{ u } during the interval [t _{ s },t _{ e }].
Recall that prec(v _{ i }) is the predecessors set of actor v _{ i }, \(Y_{i}^{u}\) is the consumption sequence of an actor v _{ i } from channel e _{ u }, and α is the iteration period. Now, we give the following lemma:
Lemma 3
Proof
In (36), a valid start time candidate ϕ _{ i→j } must satisfy extra conditions to guarantee that the number of produced tokens on edge e _{ u }=(v _{ i },v _{ j }) at any time instant \(t \ge\hat{t}\) is greater than or equal to the number of consumed tokens at the same instant. To satisfy these extra conditions, we consider the following two possible cases:
Satisfying (37) guarantees that v _{ j } can fire at times \(t = \hat{t}, \hat{t} + \lambda_{j}, \ldots, \hat{t} + \alpha\). Thus, a valid value of \(\hat{t}\) guarantees that once v _{ j } is started, it always finds enough data to fire for one iteration. As a result, v _{ j } executes in a strictly periodic way.
This case occurs when v _{ j } consumes zeros tokens during the interval \([\hat{t},\phi_{i}]\). This is a valid behavior since the consumption rates sequence can contain zero elements. Since \(\hat{t} < \phi_{i}\), it is sufficient to check the cumulative production and consumption over the interval [ϕ _{ i },ϕ _{ i }+α] since by time t=ϕ _{ i }+α both v _{ i } and v _{ j } are guaranteed to have finished one iteration. Thus, \(\hat{t}\) also guarantees that once v _{ j } is started, it always finds enough data to fire. Hence, v _{ j } executes in a strictly periodic way.
Any value of \(\hat{t}\) which satisfies (39) is a valid start time value that guarantees strictly periodic execution of v _{ j }. Since there might be multiple values of \(\hat{t}\) that satisfy (39), we take the minimum value because it is the earliest start time that guarantees strictly periodic execution of v _{ j }. □
4.3.2 Minimum buffer sizes
Lemma 4
Proof
Equation (40) tracks the maximum cumulative number of unconsumed tokens in e _{ u } during one iteration for v _{ i } and v _{ j }. There are two cases:
Theorem 3
 1.
every edge e _{ u }∈E has a capacity of at least b _{ u } tokens, where b _{ u } is given by (40)
 2.
τ _{ G } satisfies the schedulability test of \(\mathcal{A}\) on M processors
Proof
Follows from Theorem 2, and Lemmas 3 and 4. □
Example 4
4.3.3 Throughput and latency analysis
Now, we analyze the throughput of the graph actors under strictly periodic scheduling and compare it with the maximum achievable throughput. We also present a formula to compute the latency for a given CSDF graph under strictly periodic scheduling. We start with the following definitions:
Definition 13
(Actor throughput)
Definition 14
(Rateoptimal strictly periodic schedule [22])
For a graph G, a strictly periodic schedule that delivers the same throughput as a selftimed schedule for all the actors is called RateOptimal Strictly Periodic Schedule (ROSPS).
Now, we provide the following result.
Theorem 4
For a matched I/O rates graph G, the maximum achievable throughput of the graph actors under strictly periodic scheduling is equal to their maximum throughput under selftimed scheduling.
Proof
Equation (44) shows that the throughput under SPS depends solely on the relationship between Q and η. Recall from Definition 3 that the execution time μ used by our framework is the maximum value over all the actual execution times of the actor. Therefore, if ηmodQ=0, then R _{ SPS }(v _{ i }) is exactly the same as R _{ STS }(v _{ i }) for SDF graphs and CSDF graphs where all the firings of an actor v _{ i } require the same actual execution time. If ηmodQ≠0 and/or the actor actual execution time differs per firing, then R _{ SPS }(v _{ i }) is lower than R _{ STS }(v _{ i }). These findings illustrate that our framework has high potential since it allows the designer to analytically determine the type of the application (i.e., matched vs. mismatched) and accordingly to select the proper scheduler needed to deliver the maximum achievable throughput.
Now, we prove the following result regarding matched I/O rates applications:
Corollary 2
For a matched I/O rates graph G scheduled using its minimum period vector λ ^{min}, U _{max}=1.
Proof
Recall from Sect. 3.2.2 that β=⌊1/U _{max}⌋. It follows from Corollary 2 that β=1 for matched I/O rates applications scheduled using their minimum period vectors.
4.4 Handling sporadic input streams
In case the input streams are not strictly periodic, there are several techniques to accommodate the aperiodic nature of the streams. We present here some of these techniques.
4.4.1 Dejitter buffers
In case of periodic with jitter input streams, it is possible to use dejitter buffers to hide the effect of jitter. We assume that a jittery input stream I _{ i } starts at time t=t _{0} and has a constant interarrival time γ _{ i } equal to the input actor period (see Assumption 13) and jitter bounds \([\varepsilon_{i}^{}, \varepsilon_{i}^{+}]\). The interpretation of the jitter bounds is that the kth sample of the stream is expected to arrive in the interval \([t_{0} + k\gamma_{i}  \varepsilon_{i}^{}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\). If a sample arrives in the interval \([t_{0} + k\gamma_{i}  \varepsilon_{i}^{}, t_{0} + k\gamma_{i})\), then it is called an early sample. On the other hand, if the sample arrives in the interval \((t_{0} + k\gamma_{i}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\), then it is called a late sample. It is trivial to show that early samples do not affect the periodicity of the input actor as the samples arrive prior to the actor release time. Late samples, however, pose a problem as they might arrive after an actor is released.
For late samples, it is possible to insert a buffer before each input actor v _{ i } receiving a jittery input stream I _{ j } to hide the effect of jitter. The buffer delays delivering the samples to the input actor by a certain amount of time, denoted by t _{buffer}(I _{ j }). t _{buffer}(I _{ j }) has to be computed such that once the input actor is started, it always finds data in the buffer. Assume that \(\varepsilon_{i}^{}\) and \(\varepsilon_{i}^{+} \in[0, \gamma_{i}]\), then we can derive the minimum value for t _{buffer}(I _{ j }) and the minimum buffer size. In order to do that, we start with proving the following lemma:
Lemma 5
Proof
Lemma 6
Proof
During a time interval (t,t+t _{MIT}(I _{ j })), v _{ i } can fire at most twice. Therefore, it is necessary to buffer up to 2 samples in order to guarantee that the input actor v _{ i } can continue firing periodically when the samples are separated by t _{MIT} timeunits. □
Lemma 7
Let v _{ i } be an input actor and I _{ j } be a jittery input stream to v _{ i }. Suppose that I _{ j } starts at time t=t _{0} and v _{ i } starts at time t=t _{0}+t _{buffer}(I _{ j }). The dejitter buffer must be able to hold at least 3 samples.
Proof
Suppose that the (k−1) and (k+1) samples arrive late and early, respectively, by the maximum amount of jitter. This means that they arrive at time t=t _{0}+kγ _{ i }. Now, suppose that the kth sample arrives with no jitter. This means that at time t=t _{0}+kγ _{ i } there are 3 samples arriving. Hence, the dejitter buffer must be able to store them. During the interval [t _{0}+kγ _{ i },t _{0}+(k+1)γ _{ i }), there are no incoming samples and v _{ i } processes the (k−1) sample. At time t=t _{0}+(k+1)γ _{ i }, the (k+2) sample might arrive which means that there are again 3 samples available to v _{ i }. By the periodicity of v _{ i } and I _{ j }, the previous pattern can repeat. □
The main advantage of the dejitter buffer approach is that the actors are still treated and scheduled as periodic tasks. However, the major disadvantage is the extra delay encountered by the input stream samples and the extra memory needed for the buffers.
4.4.2 Resource reservation
For sporadic streams in general, we can consider the actors as aperiodic tasks and apply techniques for aperiodic task scheduling from realtime scheduling theory [6]. One popular approach is based on using a server task to service the aperiodic tasks. Servers provide resource reservation guarantees and temporal isolation. Several servers have been proposed in the literature (e.g., [1, 27]). The advantages of using servers are the enforced isolation between the tasks, and the ability to support arbitrarily input streams. When using servers, we can schedule each actor using a server which has an execution budget C _{ s } equal to the actor execution time and a period P _{ s } equal to the actor’s period.
 1.
The underlying operating system (OS) or scheduler has a monitoring mechanism which polls the buffers to detect when an actor has enough data to fire. Once it detects that an actor has enough data to fire, it releases an actor job.
 2.Modify the actor implementation such that the polling happens within the actor. In this approach, an actor job is always released at the start of the actor period. When the actor is activated (i.e., running), it checks its input buffers for data. If enough data is available, then it executes its function. Otherwise, it exhausts its budget and waits until the next period. This mechanism is summarized in Fig. 12.
The first approach (i.e., polling by the OS) does not require modifications to the actors’ implementations. However, it requires an additional task which always checks all the buffers. This task can become a bottleneck if there are many channels. The second approach is better in terms of scalability and overhead. However, it might cause delays in the processing of the data.
5 Evaluation results
We evaluate our proposed framework in Sect. 4 by performing an experiment on a set of 19 reallife streaming applications. The objective of the experiment is to compare the throughput of streaming applications when scheduled using our strictly periodic scheduling to their maximum achievable throughput obtained via selftimed scheduling. After that, we discuss the implications of our results from Sect. 4 and the throughput comparison experiment. For brevity, we refer in the remainder of this section to our strictly periodic scheduling/schedule as SPS and the selftimed scheduling/schedule as STS.
The streaming applications used in the experiment are reallife streaming applications coming from different domains (e.g., signal processing, communication, multimedia, etc.). The benchmarks are described in details in the next section.
5.1 Benchmarks
Benchmarks used for evaluation
Domain 
No. 
Application 
Source 

Signal Processing 
1 
Multichannel beamformer 
[30] 
2 
Discrete cosine transform (DCT)  
3 
Fast Fourier transform (FFT) kernel  
4 
Filterbank for multirate signal processing  
5 
Time delay equalization (TDE)  
Cryptography 
6 
Data Encryption Standard (DES)  
7 
Serpent  
Sorting 
8 
Bitonic Parallel Sorting  
Video processing 
9 
MPEG2 video  
10 
H.263 video decoder 
[29]  
Audio processing 
11 
MP3 audio decoder  
12 
CDtoDAT rate converter (SDF)^{a} 
[24]  
13 
CDtoDAT rate converter (CSDF)  
14 
Vocoder 
[30]  
Communication 
15 
Software FM radio with equalizer  
16 
Data modem 
[29]  
17 
Satellite receiver  
18 
Digital Radio Mondiale receiver 
[19]  
Medical 
19 
Heart pacemaker^{b} 
[26] 
We use the SDF^{3} toolset [29] for several purposes during the experiments. SDF^{3} is a powerful analysis toolset which is capable of analyzing CSDF and SDF graphs to check for consistency errors, compute the repetition vector, compute the maximum achievable throughput, etc. SDF^{3} accepts the graphs in XML format. For StreamIt benchmarks, the StreamIt compiler is capable of exporting an SDF graph representation of the stream program. The exported graph is then converted into the XML format required by SDF^{3}. For the graphs from the research articles, we constructed the XML representation for the CSDF graphs manually.
5.2 Experiment: throughput and latency comparison
In this experiment, we compare the throughput and latency resulting from our SPS approach to the maximum achievable throughput and minimum achievable latency of a streaming application. Recall from Definition 7 that the maximum achievable throughput and minimum achievable latency of a streaming application modeled as a CSDF graph are the ones achieved under selftimed scheduling. In this experiment, we report the throughput for the output actors (i.e., the actors producing the output streams of the application, see Sect. 3). For latency, we report the graph maximum latency according to Definition 6. For SPS, we used the minimum period vector given by Lemma 2. The STS throughput and latency are computed using the SDF^{3} toolset. SDF^{3} defines R _{ STS }(G) as the graph throughput under STS, and R _{ STS }(v _{ i })=q _{ i } R _{ STS }(G) as the actor throughput. Similarly, L _{ STS }(G) denotes the graph latency under selftimed scheduling. We use the sdf3analysis tool from SDF^{3} to compute the throughput and latency for the STS with autoconcurrency disabled and assuming unbounded FIFO channel sizes. Computing the throughput is performed using the throughput algorithm, while latency is computed using the latency(min_st) algorithm.
Results of throughput comparison. v _{out} denotes the output actor
Application 
\(\dot{q}_{\mathrm{out}}\) 
R _{ STS }(v _{out}) 
η 
Q 
R _{ SPS }(v _{out}) 
R _{ SPS }(v _{out})/R _{ STS }(v _{out}) 

Beamformer 
1 
1.97×10^{−4} 
5076 
1 
1/5076 
1.0 
DCT 
1 
2.1×10^{−5} 
47616 
1 
1/47616 
1.0 
FFT 
1 
8.31×10^{−5} 
12032 
1 
1/12032 
1.0 
Filterbank 
1 
8.84×10^{−5} 
11312 
1 
1/11312 
1.0 
TDE 
1 
2.71×10^{−5} 
36960 
1 
1/36960 
1.0 
DES 
1 
9.765×10^{−4} 
1024 
1 
1/1024 
1.0 
Serpent 
1 
2.99×10^{−4} 
3336 
1 
1/3336 
1.0 
Bitonic 
1 
1.05×10^{−2} 
95 
1 
1/95 
1.0 
MPEG2 
1 
1.30×10^{−4} 
7680 
1 
1/7680 
1.0 
H.263 
1 
3.01×10^{−6} 
332046 
594 
1/332046 
1.0 
MP3 
2 
5.36×10^{−7} 
3732276 
2 
1/1866138 
1.0 
CD2DATS 
160 
1.667×10^{−1} 
960 
23520 
1/147 
0.04 
CD2DATC 
160 
1.361×10^{−1} 
1176 
23520 
1/147 
0.05 
Vocoder 
1 
1.1×10^{−4} 
9105 
1 
1/9105 
1.0 
FM 
1 
6.97×10^{−4} 
1434 
1 
1/1434 
1.0 
Modem 
1 
6.25×10^{−2} 
16 
16 
1/16 
1.0 
Satellite 
240 
2.27×10^{−1} 
1056 
5280 
1/22 
0.2 
Receiver 
288000 
4.76×10^{−2} 
6048000 
288000 
1/21 
1.0 
Pacemaker 
64 
2.0×10^{−1} 
320 
320 
1/5 
1.0 
5.3 Discussion
Unfortunately, such easy computation as discussed above of the minimum number of processors is not possible for STS. This is because the minimum number of processors required by STS, denoted by M _{ STS }, can not be easily computed with equations such as (9), (10), and (11). Finding M _{ STS } in practice requires Design Space Exploration (DSE) procedures to find the best allocation which delivers the maximum achievable throughput. This fact shows one more advantage of using our SPS framework compared to using STS in cases where our SPS gives the same throughput as STS.
6 Conclusions
We prove that the actors of a streaming application, modeled as an acyclic CSDF graph, can be scheduled as periodic tasks. As a result, a variety of hardrealtime scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. We present an analytical framework for computing the periodic task parameters for the actors together with the minimum channel sizes such that a strictly periodic schedule exists. We also show how the proposed framework can handle sporadic input streams. We define formally a class of CSDF graphs called matched I/O rates applications which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates applications together with the ability to analytically determine the minimum number of processors needed to schedule the applications.
Acknowledgements
This work is supported by CATRENE/MEDEA+ 2A718 TSAR (Terascale multicore processor architecture) project. We would like to thank William Thies and Sander Stuijk for their support with StreamIt and SDF^{3} benchmarks, respectively.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.