Abstract
In this paper, we consider the problem of hardrealtime (HRT) multiprocessor scheduling of embedded streaming applications modeled as acyclic dataflow graphs. Most of the hardrealtime scheduling theory for multiprocessor systems assumes independent periodic or sporadic tasks. Such a simple task model is not directly applicable to dataflow graphs, where nodes represent actors (i.e., tasks) and edges represent datadependencies. The actors in such graphs have datadependency constraints and do not necessarily conform to the periodic or sporadic task models. In this work, we prove that the actors in acyclic CycloStatic Dataflow (CSDF) graphs can be scheduled as periodic tasks. Moreover, we provide a framework for computing the periodic task parameters (i.e., period and start time) of each actor, and handling sporadic input streams. Furthermore, we define formally a class of CSDF graphs called matched input/output (I/O) rates graphs which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of achieving the maximum achievable throughput of an application for matched I/O rates graphs. Therefore, hardrealtime schedulability analysis can be used to determine the minimum number of processors needed to schedule matched I/O rates applications while delivering the maximum achievable throughput. This can be of great use for system designers during the Design Space Exploration (DSE) phase.
Introduction
The everincreasing complexity of embedded systems realized as MultiProcessor SystemsonChips (MPSoCs) is imposing several challenges on systems designers [18]. Two major challenges in designing streaming software for embedded MPSoCs are: (1) How to express parallelism found in applications efficiently?, and (2) How to allocate the processors to provide guaranteed services to multiple running applications, together with the ability to dynamically start/stop applications without affecting other already running applications?
ModelofComputation (MoC) based design has emerged as a defacto solution to the first challenge [10]. In MoCbased design, the application can be modeled as a directed graph where nodes represent actors (i.e., tasks) and edges represent communication channels. Different MoCs define different rules and semantics on the computation and communication of the actors. The main benefits of a MoCbased design are the explicit representation of important properties in the application (e.g., parallelism) and the enhanced designtime analyzability of the performance metrics (e.g., throughput). One particular MoC that is popular in the embedded signal processing systems community is the CycloStatic Dataflow (CSDF) model [5] which extends the wellknown Synchronous Data Flow (SDF) model [15].
Unfortunately, no such defacto solution exists yet for the second challenge of processor allocation [23]. For a long time, selftimed scheduling was considered the most appropriate policy for streaming applications modeled as dataflow graphs [14, 28]. However, the need to support multiple applications running on a single system without prior knowledge of the properties of the applications (e.g., required throughput, number of tasks, etc.) at system designtime is forcing a shift towards runtime scheduling approaches as explained in [13]. Most of the existing runtime scheduling solutions assume applications modeled as task graphs and provide besteffort or softrealtime qualityofservice (QoS) [23]. Few runtime scheduling solutions exist which support applications modeled using a MoC and provide hardrealtime QoS [4, 11, 20, 21]. However, these solutions either use simple MoCs such as SDF/PGM graphs or use TimeDivision Multiplexing (TDM)/RoundRobin (RR) scheduling. Several algorithms from the hardrealtime multiprocessor scheduling theory [9] can perform fast admission and scheduling decisions for incoming applications while providing hardrealtime QoS. Moreover, these algorithms provide temporal isolation which is the ability to dynamically start/run/stop applications without affecting other already running applications. However, these algorithms from the hardrealtime multiprocessor scheduling theory received little attention in the embedded MPSoC community. This is mainly due to the fact that these algorithms assume independent periodic or sporadic tasks [9]. Such a simple task model is not directly applicable to modern embedded streaming applications. This is because a modern streaming application is typically modeled as a directed graph where nodes represent actors, and edges represent datadependencies. The actors in such graphs have datadependency constraints and do not necessarily conform to the periodic or sporadic task models.
Therefore, in this paper we investigate the applicability of the hardrealtime scheduling theory for periodic tasks to streaming applications modeled as acyclic CSDF graphs. In such graphs, the actors are datadependent. However, we analytically prove that they (i.e., the actors) can be scheduled as periodic tasks. As a result, a variety of hardrealtime scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. By considering acyclic CSDF graphs, our investigation findings and proofs are applicable to most streaming applications since it has been shown recently that around 90 % of streaming applications can be modeled as acyclic SDF graphs [30]. Note that SDF graphs are a subset of the CSDF graphs we consider in this paper.
Problem statement
Given a streaming application modeled as an acyclic CSDF graph, determine whether it is possible to execute the graph actors as periodic tasks. A periodic task τ _{ i } is defined by a 3tuple τ _{ i }=(S _{ i },C _{ i },T _{ i }). The interpretation is as follows: τ _{ i } is invoked at time instants t=S _{ i }+kT _{ i } and it has to execute for C _{ i } timeunits before time t=S _{ i }+(k+1)T _{ i } for all k∈ℕ_{0}, where S _{ i } is the start time of τ _{ i } and T _{ i } is the task period. This scheduling approach is called Strictly Periodic Scheduling ( SPS ) [22] to avoid confusion with the term periodic scheduling used in the dataflow scheduling theory to refer to a repetitive finite sequence of actors invocations. The sequence is periodic since it is repeated infinitely with a constant period. However, the individual actors invocations are not guaranteed to be periodic. In the remainder of this paper, periodic scheduling/schedule refers to strictly periodic scheduling/schedule.
Paper contributions
Given a streaming application modeled as an acyclic CSDF graph, we analytically prove that it is possible to execute the graph actors as periodic tasks. Moreover, we present an analytical framework for computing the periodic task parameters for the actors, that is the period and the start time, together with the minimum buffer sizes of the communication channels such that the actors execute as periodic tasks. The proposed framework is also capable of handling sporadic input streams. Furthermore, we define formally two classes of CSDF graphs: matched input/output (I/O) rates graphs and mismatched I/O rates graphs. Matched I/O rates graphs constitute around 80 % of streaming applications [30]. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates graphs. Applying our approach to matched I/O rates applications enables using a plethora of schedulability tests developed in the realtime scheduling theory [9] to easily determine the minimum number of processors needed to schedule a set of applications using a certain algorithm to provide the maximum achievable throughput. This can be of great use for embedded systems designers during the Design Space Exploration (DSE) phase.
The remainder of this paper is organized as follows: Sect. 2 gives an overview of the related work. Section 3 introduces the CSDF model and the considered system model. Section 4 presents the proposed analytical framework. Section 5 presents the results of empirical evaluation of the framework presented in Sect. 4. Finally, Sect. 6 ends the paper with conclusions.
Related work
Parks and Lee [25] studied the applicability of nonpreemptive RateMonotonic (RM) scheduling to dataflow programs modeled as SDF graphs. The main difference compared to our work is: (1) they considered nonpreemptive scheduling. In contrast, we consider only preemptive scheduling. Nonpreemptive scheduling is known to be NPhard in the strong sense even for the uniprocessor case [12], and (2) they considered SDF graphs which are a subset of the more general CSDF graphs.
Goddard [11] studied applying realtime scheduling to dataflow programs modeled using the Processing Graphs Method (PGM). He used a task model called RateBased Execution (RBE) in which a realtime task τ _{ i } is characterized by a 4tuple τ _{ i }=(x _{ i },y _{ i },d _{ i },c _{ i }). The interpretation is as follows: τ _{ i } executes x _{ i } times in time period y _{ i } with a relative deadline d _{ i } per job release and c _{ i } execution time per job release. For a given PGM, he developed an analysis technique to find the RBE task parameters of each actor and buffer size of each channel. Thus, his approach is closely related to ours. However, our approach uses CSDF graphs which are more expressive than PGM graphs in that PGM supports only a constant production/consumption rate on edges (same as SDF), whereas CSDF supports varying (but predefined) production/consumption rates. As a result, the analysis technique in [11] is not applicable to CSDF graphs.
Bekooij et al. presented a dataflow analysis for embedded realtime multiprocessor systems [4]. They analyzed the impact of TDM scheduling on applications modeled as SDF graphs. Moreira et al. have investigated realtime scheduling of dataflow programs modeled as SDF graphs in [20–22]. They formulated a resource allocation heuristic [20] and a TDM scheduler combined with static allocation policy [21]. Their TDM scheduler improves the one proposed in [4]. In [22], they proved that it is possible to derive a strictly periodic schedule for the actors of a cyclic SDF graph iff the periods are greater than or equal to the maximum cycle mean of the graph. They formulated the conditions on the start times of the actors in the equivalent Homogeneous SDF (HSDF, [15]) graph in order to enforce a periodic execution of every actor as a Linear Programming (LP) problem.
Our approach differs from [4, 20–22] in: (1) using the periodic task model which allows applying a variety of proven hardrealtime scheduling algorithms for multiprocessors, and (2) using the CSDF model which is more expressive than the SDF model.
Background
Cyclostatic dataflow (CSDF)
In [5], the CSDF model is defined as a directed graph G=〈V,E〉, where V is a set of actors and E⊆V×V is a set of communication channels. Actors represent functions that transform incoming data streams into outgoing data streams. The communication channels carry streams of data, and an atomic data object is called a token. A channel e _{ u }∈E is a firstin, firstout (FIFO) queue with unbounded capacity, and is defined by a tuple e _{ u }=(v _{ i },v _{ j }). The tuple means that e _{ u } is directed from v _{ i } (called source) to v _{ j } (called destination). The number of actors in a graph G is denoted by N=V. An actor receiving an input stream of the application is called input actor, and an actor producing an output stream of the application is called output actor. A path w _{ a⇝z } between actors v _{ a } and v _{ z } is an ordered sequence of channels defined as w _{ a⇝z }={(v _{ a },v _{ b }),(v _{ b },v _{ c }),…,(v _{ y },v _{ z })}. A path w _{ i⇝j } is called output path if v _{ i } is an input actor and v _{ j } is an output actor. \(\mathcal{W}\) denotes the set of all output paths in G. In this work, we consider only acyclic CSDF graphs. An acyclic graph G has a number of levels, denoted by \(\mathcal{L}\), which is given by Algorithm 1. The level of an actor v _{ i }∈V is denoted by σ _{ i }. Each actor v _{ i }∈V is associated with four sets:

1.
The successors set, denoted by succ(v _{ i }), and given by:
$$ \mathsf{succ}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_i, v_j) \in E \bigr\} $$(1) 
2.
The predecessors set, denoted by prec(v _{ i }), and given by:
$$ \mathsf{prec}(v_i) = \bigl\{ v_j \in V : \exists e_u = (v_j, v_i) \in E \bigr\} $$(2) 
3.
The input channels set, denoted by inp(v _{ i }), and given by:
$$ \mathsf{inp}(v_i) = \left \{ \begin{array}{l@{\quad}l} \{ e_u \in E : e_u = (v_j, v_i) \}, & \mbox{if } \sigma_i >1 \\ \mbox{The set of channels delivering the input streams to } v_i & \mbox{if } \sigma_i = 1 \end{array} \right . $$(3) 
4.
The output channels set, denoted by out(v _{ i }), and given by:
$$ \mathsf{out}(v_i) = \left\{ \begin{array}{l@{\quad}l} \{e_u \in E : e_u = (v_i, v_j)\}, & \mbox{if } \sigma_i <\mathcal{L}\\ \mbox{The set of channels carrying the output streams from } v_i, & \mbox{if } \sigma_i = \mathcal{L} \end{array} \right. $$(4)
Every actor v _{ j }∈V has an execution sequence [f _{ j }(1),f _{ j }(2),…,f _{ j }(P _{ j })] of length P _{ j }. The interpretation of this sequence is: The nth time that actor v _{ j } is fired, it executes the code of function f _{ j }(((n−1)modP _{ j })+1). Similarly, production and consumption of tokens are also sequences of length P _{ j } in CSDF. The token production of actor v _{ j } on channel e _{ u } is represented as a sequence of constant integers \([x_{j}^{u}(1), x_{j}^{u}(2), \ldots, x_{j}^{u}(P_{j})]\). The nth time that actor v _{ j } is fired, it produces \(x_{j}^{u}(((n  1) \bmod P_{j}) + 1)\) tokens on channel e _{ u }. The consumption of actor v _{ k } is completely analogous; the token consumption of actor v _{ k } from a channel e _{ u } is represented as a sequence \([y_{k}^{u}(1), y_{k}^{u}(2), \ldots, y_{k}^{u}(P_{j})]\). The firing rule of a CSDF actor v _{ k } is evaluated as “true” for its nth firing iff all its input channels contain at least \(y_{k}^{u}(((n  1) \bmod P_{j}) + 1)\) tokens. The total number of tokens produced by actor v _{ j } on channel e _{ u } during the first n invocations, denoted by \(X_{j}^{u}(n)\), is given by \(X_{j}^{u}(n) = \sum_{l = 1}^{n} x_{j}^{u}(l)\). Similarly, the total number of tokens consumed by actor v _{ k } from channel e _{ u } during the first n invocations, denoted by \(Y_{k}^{u}(n)\), is given by \(Y_{k}^{u}(n) = \sum_{l = 1}^{n} y_{k}^{u}(l)\).
Example 1
Figure 1 shows a CSDF graph consisting of four actors and four communication channels. Actor v _{1} is the input actor with a successors set succ(v _{1})={v _{2},v _{3}}, and v _{4} is the output actor with a predecessors set prec(v _{4})={v _{2},v _{3}}. There are two output paths in the graph: w _{1}={(v _{1},v _{2}),(v _{2},v _{4})} and w _{2}={(v _{1},v _{3}),(v _{3},v _{4})}. The production sequences are shown between square brackets at the start of edges (e.g., [5,3,2] for actor v _{1} on edge e _{2}), while the consumption sequences are shown between square brackets at the end of the edges (e.g., [1,3,1] for v _{3} on e _{2}).
An important property of the CSDF model is its decidability, which is the ability to derive at compiletime a schedule for the actors. This is formulated in the following definitions and results from [5].
Definition 1
(Valid static schedule [5])
Given a connected CSDF graph G, a valid static schedule for G is a finite sequence of actors invocations that can be repeated infinitely on the incoming sample stream while the amount of data in the buffers remains bounded. A vector q=[q _{1},q _{2},…,q _{ N }]^{T}, where q _{ j }>0, is a repetition vector of G if each q _{ j } represents the number of invocations of an actor v _{ j } in a valid static schedule for G. The repetition vector of G in which all the elements are relatively prime^{Footnote 1} is called the basic repetition vector of G, denoted by \(\dot{\mathbf{q}}\). G is consistent if there exists a repetition vector. If a deadlockfree schedule can be found, G is said to be live. Both consistency and liveness are required for the existence of a valid static schedule.
Theorem 1
([5])
In a CSDF graph G, a repetition vector q=[q _{1},q _{2},…,q _{ N }]^{T} is given by
where r=[r _{1},r _{2},…,r _{ N }]^{T} is a positive integer solution of the balance equation
and where the topology matrix Γ∈ℤ^{E×V} is defined by
Definition 2
For a consistent and live CSDF graph G, an actor iteration is the invocation of an actor v _{ i }∈V for q _{ i } times, and a graph iteration is the invocation of every actor v _{ i }∈V for q _{ i } times, where q _{ i }∈q.
Corollary 1
(From [5])
If a consistent and live CSDF graph G completes n iterations, where n∈ℕ, then the net change to the number of tokens in the buffers of G is zero.
Lemma 1
Any acyclic consistent CSDF graph is live.
Proof
Bilsen et al. proved in [5] that a CSDF graph is live iff every cycle in the graph is live. Equivalently, a CSDF graph deadlocks only if it contains at least one cycle. Thus, absence of cycles in a CSDF graph implies its liveness. □
Example 2
For the CSDF graph shown in Fig. 1
System model and scheduling algorithms
In this section, we introduce the system model and the related schedulability results.
System model
A system Ω consists of a set π={π _{1},π _{2},…,π _{ m }} of m homogeneous processors. The processors execute a task set τ={τ _{1},τ _{2},…,τ _{ n }} of n periodic tasks, and a task may be preempted at any time. A periodic task τ _{ i }∈τ is defined by a 4tuple τ _{ i }=(S _{ i },C _{ i },T _{ i },D _{ i }), where S _{ i }≥0 is the start time of τ _{ i }, C _{ i }>0 is the worstcase execution time of τ _{ i }, T _{ i }≥C _{ i } is the task period, and D _{ i }, where C _{ i }≤D _{ i }≤T _{ i }, is the relative deadline of τ _{ i }. A periodic task τ _{ i } is invoked (i.e., releases a job) at time instants t=S _{ i }+kT _{ i } for all k∈ℕ_{0}. Upon invocation, τ _{ i } executes for C _{ i } timeunits. The relative deadline D _{ i } is interpreted as follows: τ _{ i } has to finish executing its kth invocation before time t=S _{ i }+kT _{ i }+D _{ i } for all k∈ℕ_{0}. If D _{ i }=T _{ i }, then τ _{ i } is said to have implicitdeadline. If D _{ i }<T _{ i }, then τ _{ i } is said to have constraineddeadline. If all the tasks in a taskset τ have the same start time, then τ is said to be synchronous. Otherwise, τ is said to be asynchronous.
The utilization of a task τ _{ i } is U _{ i }=C _{ i }/T _{ i }. For a task set τ, the total utilization of τ is \(U_{\mathrm{sum}} = \sum_{\tau_{i} \in\tau} U_{i}\) and the maximum utilization factor of τ is \(U_{\mathrm{max}} = \max_{\tau_{i} \in\tau} U_{i}\).
In the remainder of this paper, a task set τ refers to an asynchronous set of implicitdeadline periodic tasks. As a result, we refer to a task τ _{ i } with a 3tuple τ _{ i }=(S _{ i },C _{ i },T _{ i }) by omitting the implicit deadline D _{ i } which is equal to T _{ i }.
Scheduling asynchronous set of implicit deadline periodic tasks
Given a system Ω and a task set τ, a valid schedule is one that allocates a processor to a task τ _{ i }∈τ for exactly C _{ i } timeunits in the interval [S _{ i }+kT _{ i },S _{ i }+(k+1)T _{ i }) for all k∈ℕ_{0} with the restriction that a task may not execute on more than one processor at the same time. A necessary and sufficient condition for τ to be scheduled on Ω to meet all the deadlines (i.e., τ is feasible) is:
The problem of constructing a periodic schedule for τ can be solved using several algorithms [9]. These algorithms differ in the following aspects: (1) Priority Assignment: A task can have fixed priority, jobfixed priority, or dynamic priority, and (2) Allocation: Based on whether a task can migrate between processors upon preemption, algorithms are classified into:

Partitioned: Each task is allocated to a processor and no migration is permitted

Global: Migration is permitted for all tasks

Hybrid: Hybrid algorithms mix partitioned and global approaches and they can be further classified to:

1.
Semipartitioned: Most tasks are allocated to processors and few tasks are allowed to migrate

2.
Clustered: Processors are grouped into clusters and the tasks that are allocated to one cluster are scheduled by a global scheduler

1.
An important property of scheduling algorithms is optimality. A scheduling algorithm \(\mathcal{A}\) is said to be optimal iff it can schedule any feasible task set τ on Ω. Several global and hybrid algorithms were proven optimal for scheduling asynchronous sets of implicitdeadline periodic tasks (e.g., [2, 3, 8, 16]). The minimum number of processors needed to schedule τ using an optimal scheduling algorithm, denoted by M _{ OPT }, is given by:
Partitioned algorithms are known to be nonoptimal for scheduling implicitdeadline periodic tasks [7]. However, they have the advantage of not requiring task migration. One prominent example of partitioned scheduling algorithms is the Partitioned Earliest Deadline First (PEDF) algorithm. EDF is known to be optimal for scheduling arbitrary task sets on a uniprocessor system [6]. In a multiprocessor system, EDF can be combined with different processor allocation algorithms (e.g., Binpacking heuristics such as FirstFit (FF) and WorstFit (WF)). López et al. derived in [17] the worstcase utilization bounds for a task set τ to be schedulable using PEDF. These bounds serve as a simple sufficient schedulability test. Based on these bounds, they derived the minimum number of processors needed to schedule a task set τ under PEDF, denoted by M _{ PEDF }:
where β=⌊1/U _{max}⌋. A task set τ with total utilization U _{sum} and maximum utilization factor U _{max} is always guaranteed to be schedulable on M _{ PEDF } processors. Since M _{ PEDF } is derived based on a sufficient test, it is important to note that τ may be schedulable on less number of processors. We define M _{ PAR } as the minimum number of processors on which τ can be partitioned assuming bin packing allocation (e.g., FirstFit (FF)) with each set in the partition having a total utilization of at most 1. M _{ PAR } can be expressed as:
M _{ PAR } is specific to the task set τ for which it is computed. Another task set \(\hat{\tau}\) with the same total utilization and maximum utilization factor as τ might not be schedulable on M _{ PAR } processors due to partitioning issues.
Strictly periodic scheduling of acyclic CSDF graphs
This section presents our analytical framework for scheduling the actors in acyclic CSDF graphs as periodic tasks. The construction it uses arranges the actors forming the CSDF graph into a set of levels as shown in Sect. 3. All actors belonging to a certain level depend directly only on the actors in the previous levels. Then, we derive, for each actor, a period and start time, and for each channel, a buffer size. These derived parameters ensure that a strictly periodic schedule can be achieved in the form of a pipelined sequence of invocations of all the actors in each level.
Definitions and assumptions
In the remainder of this paper, a graph G refers to an acyclic consistent CSDF graph. We base our analysis on the following assumptions:
Assumption 1
A graph G has a set \(I = \{ I_{1}, I_{2}, \ldots, I_{\mathcal{K}}\}\) of \(\mathcal{K}\) sporadic input streams connected to the input actors of G. The set of input streams to an actor v _{ i } is denoted by Z _{ i }. We make the following assumptions about the input streams:

1.
Z _{ i }∩Z _{ j }=∅ ∀v _{ i },v _{ j }∈V.

2.
The first samples of all the streams arrive prior to or at the same time when the actors of G start executing

3.
Each input stream I _{ j } is characterized by a minimum interarrival time (also called period) of the samples, denoted by γ _{ j }. This minimum interarrival time is assumed to be equal to the period of the input actor which receives I _{ j }. This assumption indicates that the interarrival time for input streams can be controlled by the designer to match the periods of the actors.
Assumption 2
An actor v _{ i } consumes its input data immediately when it starts its firing and produces its output data just before it finishes its firing.
We start with the following definition:
Definition 3
(Execution time vector)
For a graph G, an execution time vector μ, where μ∈ℕ^{N}, represents the worstcase execution times, measured in timeunits, of the actors in G. The worstcase execution time of an actor v _{ j }∈V is given by
where P _{ j } is the length of CSDF firing/production/consumption sequences of actor v _{ j }, T ^{R} is the worstcase time needed to read a single token from an input channel, \(y_{j}^{l}\) is the consumption sequence of v _{ j } from channel e _{ l }, T ^{W} is the worstcase time needed to write a single token to an output channel, \(x_{j}^{r}\) is the production sequence of v _{ j } into channel e _{ r }, and \(T_{j}^{C}(k)\) is the worstcase computation time of v _{ j } in firing k.
Let \(\eta= \max_{v_{i} \in V}(\mu_{i} q_{i})\) and Q=lcm{q _{1},q _{2},…,q _{ N }} (lcm denotes the leastcommonmultiple operator). Now, we give the following definition.
Definition 4
(Matched input/output rates graph)
A graph G is said to be matched input/output (I/O) rates graph if and only if
If (13) does not hold, then G is said to be mismatched I/O rates graph.
The concept of matched I/O rates applications was first introduced in [30] as the applications with low value of Q. However, the authors did not establish exact test for determining whether an application is matched I/O rates or not. The test in (13) is a novel contribution of this paper. If ηmodQ=0, then there exists at least a single actor in the graph which is fully utilizing the processor on which it runs. This, as shown later in Sect. 4.3.3, allows the graph to achieve optimal throughput. On the other hand, if ηmodQ≠0, then there exist idle durations in the period of each actor which results in suboptimal throughput. This is illustrated later in Example 3 which shows the strictly periodic schedule of a mismatched I/O rates application.
Definition 5
(Output path latency)
Let w _{ a⇝z }={(v _{ a },v _{ b }),…,(v _{ y },v _{ z })} be an output path in a graph G. The latency of w _{ a⇝z } under periodic input streams, denoted by L(w _{ a⇝z }), is the elapsed time between the start of the first firing of v _{ a } which produces data to (v _{ a },v _{ b }) and the finish of the first firing of v _{ z } which consumes data from (v _{ y },v _{ z }).
Consequently, we define the maximum latency of G as follows:
Definition 6
(Graph maximum latency)
For a graph G, the maximum latency of G under periodic input streams, denoted by L(G), is given by:
Definition 7
(Selftimed schedule)
A selftimed schedule (STS) is one where all the actors are fired as soon as their input data are available.
Selftimed scheduling has been shown in [28] to achieve the maximum achievable throughput and minimum achievable latency of a Homogeneous SDF (HSDF, [15]) graph. This results extends to CSDF graphs since any CSDF graph can be converted to an equivalent HSDF graph. For acyclic graphs, the STS throughput of an actor v _{ i }, denoted by R _{ STS }(v _{ i }), is given by:
Definition 8
(Strictly periodic actor)
An actor v _{ i }∈V is strictly periodic iff the time period between any two consecutive firings is constant.
Definition 9
(Period vector)
For a graph G, a period vector λ, where λ∈ℕ^{N}, represents the periods, measured in timeunits, of the actors in G. λ _{ j }∈λ is the period of actor v _{ j }∈V. λ is given by the solution to both
and
where \(q_{j} \in\dot{\mathbf{q}}\) (the basic repetition vector of G according to Definition 1).
Definition 9 implies that all the actors have the same iteration period. This is captured in the following definition:
Definition 10
(Iteration period)
For a graph G, the iteration period under strictly periodic scheduling, denoted by α, is given by
Now, we prove the existence of a strictly periodic schedule when the input streams are strictly periodic. An input stream I _{ j } connected to input actor v _{ i } is strictly periodic iff the interarrival time between any two consecutive samples is constant. Based on Assumption 13, it follows that γ _{ j }=λ _{ i }. Later on, we extend the results to handle periodic with jitter and sporadic input streams.
Existence of a strictly periodic schedule
Lemma 2
For a graph G, the minimum period vector of G, denoted by λ ^{min}, is given by
.
Proof
Equation (16) can be rewritten as:
where Δ∈ℤ^{(N−1)×N} is given by
Observe that nullity(Δ)=1. Thus, there exists a single vector which forms the basis of the nullspace of Δ. This vector can be represented by taking any unknown λ _{ k } as the freeunknown and expressing the other unknowns in terms of it which results in:
The minimum λ _{ k }∈ℕ is
Thus, the minimum λ∈ℕ that solves (16) is given by
Let \(\boldsymbol{\hat{\lambda}}\) be the solution given by (22). Equations (16) and (17) can be rewritten as:
where c∈ℕ. Equation (24) can be rewritten as:
It follows that c must be greater than or equal to \(\max_{v_{i} \in V}(\mu_{i} q_{i}) /Q = \eta/ Q\). However, η/Q is not always guaranteed to be an integer. As a result, the value is rounded by taking the ceiling. It follows that the minimum λ which satisfies both of (16) and (17) is given by
□
Theorem 2
For any graph G, a periodic schedule Π exists such that every actor v _{ i }∈V is strictly periodic with a constant period λ _{ i }∈λ ^{min} and every communication channel e _{ u }∈E has a bounded buffer capacity.
Proof
Recall that in this proof we assume that the input streams to level1 actors are strictly periodic with periods equal to the input actors periods. Therefore, it follows that level1 actors can execute periodically since their input streams are always available when they fire. By Definition 2, level1 actors will complete one iteration when they fire q _{ i } times, where q _{ i } is the repetition of v _{ i }∈A _{1}. Assume that level1 actors start executing at time t=0. Then, by time t=α, level1 actors are guaranteed to finish one iteration. According to Theorem 1, level1 actors will also generate enough data such that every actor v _{ k }∈A _{2} can execute q _{ k } times (i.e., one iteration) with a period λ _{ k }. According to (16), firing v _{ k } for q _{ k } times with a period λ _{ k } takes α timeunits. Thus, starting level2 actors at time t=α guarantees that they can execute periodically with their periods given by Definition 9 for α timeunits. Similarly, by time t=2α, level3 actors will have enough data to execute for one iteration. Thus, starting level3 actors at time t=2α guarantees that they can execute periodically for α timeunits. By repeating this over all the \(\mathcal{L}\) levels, a schedule Π _{1} (shown in Fig. 2) is constructed in which all the actors that belong to A _{ i } are started at start time ϕ _{ i } given by
A _{ j }(k) denotes levelj actors executing their kth iteration. For example, A _{2}(1) denotes level2 actors executing their first iteration. At time \(t = \mathcal{L}\alpha\), G completes one iteration. It is trivial to observe from Π _{1} that as soon as level1 actors finish one iteration, they can immediately start executing the next iteration since their input streams arrive periodically. If level1 actors start their second iteration at time t=α, their execution will overlap with the execution of the level2 actors. By doing so, level2 actors can start immediately their second iteration after finishing their first iteration since they will have all the needed data to execute one iteration periodically at time t=2α. This overlapping can be applied to all the levels to yield the schedule Π _{2} shown in Fig. 3.
Now, the overlapping can be applied \(\mathcal{L}\) times on schedule Π _{1} to yield a schedule \(\varPi_{\mathcal{L}}\) as shown in Fig. 4.
Starting from time \(t = \mathcal{L}\alpha\), a schedule Π _{∞} can be constructed as shown in Fig. 5.
In schedule Π _{∞}, every actor v _{ i } is fired every λ _{ i } timeunit once it starts. The start time defined in (26) guarantees that actors in a given level will start only when they have enough data to execute one iteration in a periodic way. The overlapping guarantees that once the actors have started, they will always find enough data for executing the next iteration since their predecessors have already executed one additional iteration. Thus, schedule Π _{∞} shows the existence of a periodic schedule of G where every actor v _{ j }∈V is strictly periodic with a period equal to λ _{ j }.
The next step is to prove that Π _{∞} executes with bounded memory buffers. In Π _{∞}, the largest delay in consuming the tokens occurs for a channel e _{ u }∈E connecting a level1 actor and a level\(\mathcal{L}\) actor. This is illustrated in Fig. 5 by observing that the data produced by iteration1 of a level1 source actor will be consumed by iteration1 of a level\(\mathcal{L}\) destination actor after \((\mathcal{L} 1) \alpha\) timeunits. In this case, e _{ u } must be able to store at least \((\mathcal{L} 1) X^{u}_{1}(q_{1})\) tokens. However, starting from time \(t = \mathcal{L} \alpha\), both of the level1 and level\(\mathcal{L}\) actors execute in parallel. Thus, we increase the buffer size by \(X^{u}_{1}(q_{1})\) tokens to account for the overlapped execution. Hence, the total buffer size of e _{ u } is \(\mathcal{L} X^{u}_{1}(q_{1})\) tokens. Similarly, if a level2 actor, denoted v _{2}, is connected directly to a level\(\mathcal{L}\) actor via channel e _{ v }, then e _{ v } must be able to store at least \((\mathcal{L}1) X^{v}_{2}(q_{2})\) tokens. By repeating this argument over all the different pairs of levels, it follows that each channel e _{ u }∈E, connecting a leveli source actor and a levelj destination actor, where j≥i, will store according to schedule Π _{∞} at most:
tokens, where v _{ k } is the leveli actor, and \(q_{k} \in \dot{\mathbf{q}}\). Thus, an upper bound on the FIFO sizes exists. □
Example 3
We illustrate Theorem 2 by constructing a periodic schedule for the CSDF graph shown in Fig. 1. Assume that the CSDF graph has an execution vector μ=[5,2,3,2]^{T}. Given \(\dot{\mathbf{q}}= [3, 3, 6, 4]^{T}\) as computed in Example 2, we use (19) to find λ ^{min}=[8,8,4,6]^{T}. Figure 6 illustrates the periodic schedule of the actors for the first graph iteration. \(\mathcal{L}= 3\) and the levels consist of three sets: A _{1}={v _{1}}, A _{2}={v _{2},v _{3}}, and A _{3}={v _{4}}. A _{1} actors start at time t=0. Since α=q _{ i } λ _{ i }=24 for any v _{ i } in the graph, A _{2} actors start at time t=α=24 and A _{3} actors start at time t=2α=48. Every actor v _{ j } in the graph executes for μ _{ j } timeunits every λ _{ j } timeunits. For example, actor v _{2} starts at time t=24 and executes for 2 timeunits every 8 timeunits.
Earliest start times and minimum buffer sizes
Now, we are interested in finding the earliest start times of the actors, and the minimum buffer sizes of the communication channels that guarantee the existence of a periodic schedule. Minimizing the start times and buffer sizes is crucial since it minimizes the initial response time and the memory requirements of the applications modeled as acyclic CSDF graphs.
Earliest start times
In the proof of Theorem 2, the notion of start time was introduced to denote when the actor is started on the system. The start time values used in the proof of the theorem were not the minimum ones. Here, we derive the earliest start times. We start with the following definitions:
Definition 11
(Cumulative production function)
The cumulative production function of actor v _{ i } producing into channel e _{ u } during the interval [t _{ s },t _{ e }), denoted by \(\mathsf{prd}_{[t_{s}, t_{e})} (v_{i},e_{u})\), is the sum of the number of tokens produced by v _{ i } into e _{ u } during the interval [t _{ s },t _{ e }).
In case of implicitdeadline periodic tasks, \(\mathsf{prd}_{[t_{s}, t_{e})}(v_{i},e_{u})\) is given by:
Similarly, we define the cumulative consumption function as follows:
Definition 12
(Cumulative consumption function)
The cumulative consumption function of actor v _{ i } consuming from channel e _{ u } over the interval [t _{ s },t _{ e }], denoted by \(\mathsf{cns}_{[t_{s}, t_{e}]}(v_{i},e_{u})\), is the sum of the number of tokens consumed by v _{ i } from e _{ u } during the interval [t _{ s },t _{ e }].
Similar to (28), \(\mathsf{cns}_{[t_{s}, t_{e}]} (v_{i},e_{u})\) is given by:
Recall that prec(v _{ i }) is the predecessors set of actor v _{ i }, \(Y_{i}^{u}\) is the consumption sequence of an actor v _{ i } from channel e _{ u }, and α is the iteration period. Now, we give the following lemma:
Lemma 3
For a graph G, the earliest start time of an actor v _{ j }∈V, denoted by ϕ _{ j }, under a strictly periodic schedule is given by
where
Proof
Theorem 2 proved that starting a levelk actor v _{ j } at a start time
guarantees strictly periodic execution of the actor v _{ j }. Any start time later than that guarantees also strictly periodic execution since v _{ j } will always find enough data to execute in a strictly periodic way.
Equation (32) can be rewritten as:
The equivalence follows from observing that a levelk actor, where k>1, has a level(k−1) predecessor. Hence, applying (33) to a levelk actor, where k>1, yields:
Now, we are interested in starting v _{ j }∈A _{ k }, where k>1, earlier. That is:
ϕ _{ j } has also a lowerbound by observing that an actor v _{ j } can not start before the application is started. That is:
If we select ϕ _{ j } such that
then this guarantees that ϕ _{ j } also satisfies (35).
In (36), a valid start time candidate ϕ _{ i→j } must satisfy extra conditions to guarantee that the number of produced tokens on edge e _{ u }=(v _{ i },v _{ j }) at any time instant \(t \ge\hat{t}\) is greater than or equal to the number of consumed tokens at the same instant. To satisfy these extra conditions, we consider the following two possible cases:
Case I: \(\hat{t} \ge\phi_{i}\). This case is illustrated in Fig. 7. In this case, a valid start time candidate \(\hat{t}\) must satisfy:
Satisfying (37) guarantees that v _{ j } can fire at times \(t = \hat{t}, \hat{t} + \lambda_{j}, \ldots, \hat{t} + \alpha\). Thus, a valid value of \(\hat{t}\) guarantees that once v _{ j } is started, it always finds enough data to fire for one iteration. As a result, v _{ j } executes in a strictly periodic way.
Case II: \(\hat{t} < \phi_{i}\). This case is illustrated in Fig. 8. A valid start time candidate \(\hat{t}\) must satisfy:
This case occurs when v _{ j } consumes zeros tokens during the interval \([\hat{t},\phi_{i}]\). This is a valid behavior since the consumption rates sequence can contain zero elements. Since \(\hat{t} < \phi_{i}\), it is sufficient to check the cumulative production and consumption over the interval [ϕ _{ i },ϕ _{ i }+α] since by time t=ϕ _{ i }+α both v _{ i } and v _{ j } are guaranteed to have finished one iteration. Thus, \(\hat{t}\) also guarantees that once v _{ j } is started, it always finds enough data to fire. Hence, v _{ j } executes in a strictly periodic way.
Now, we can merge (37) and (38) which results in:
Any value of \(\hat{t}\) which satisfies (39) is a valid start time value that guarantees strictly periodic execution of v _{ j }. Since there might be multiple values of \(\hat{t}\) that satisfy (39), we take the minimum value because it is the earliest start time that guarantees strictly periodic execution of v _{ j }. □
Minimum buffer sizes
Lemma 4
For a graph G, the minimum bounded buffer size b _{ u } of a communication channel e _{ u }∈E connecting a source actor v _{ i } with start time ϕ _{ i }, and a destination actor v _{ j } with start time ϕ _{ j }, where v _{ i },v _{ j }∈V, under a strictly periodic schedule is given by
Proof
Equation (40) tracks the maximum cumulative number of unconsumed tokens in e _{ u } during one iteration for v _{ i } and v _{ j }. There are two cases:
Case I: ϕ _{ i }≤ϕ _{ j }. In this case, (40) tracks the maximum cumulative number of unconsumed tokens in e _{ u } during the time interval [ϕ _{ i },ϕ _{ j }+α). Figure 9 illustrates the execution timelines of v _{ i } and v _{ j } when ϕ _{ i }≤ϕ _{ j }. In interval A, v _{ i } is actively producing tokens while v _{ j } has not yet started executing. As a result, it is necessary to buffer all the tokens produced in this interval in order to prevent v _{ i } from blocking on writing. Thus, b _{ u } must be greater than or equal to \(\mathsf{prd}_{[\phi_{i}, \phi _{j})}(v_{i},e_{u})\). Starting from time t=ϕ _{ j }, both of v _{ i } and v _{ j } are executing in parallel (i.e., overlapped execution). In the proof of Theorem 2, an additional \(X^{u}_{i}(q_{i})\) tokens were added to the buffer size of e _{ u } to account for the overlapped execution. However, this value is a “worstcase” value. The minimum number of tokens that needs to be buffered is given by the maximum number of unconsumed tokens in e _{ u } at any time over the time interval [ϕ _{ j },ϕ _{ j }+α) (i.e., intervals B and C in Fig. 9). Taking the maximum number of unconsumed tokens guarantees that v _{ i } will always have enough space to write to e _{ u }. Thus, b _{ u } is sufficient and minimum for guaranteeing strictly periodic execution of v _{ i } and v _{ j } in the time interval [ϕ _{ i },ϕ _{ j }+α). At time t=ϕ _{ j }+α, both of v _{ i } and v _{ j } have completed one iteration and the number of tokens in e _{ u } is the same as at time t=ϕ _{ j } (Follows from Corollary 1). Due to the strict periodicity of v _{ i } and v _{ j }, the pattern shown in Fig. 9 repeats. Thus, b _{ u } is also sufficient and minimum for any t≥ϕ _{ j }+α.
Case II: ϕ _{ i }>ϕ _{ j }. Figure 10 illustrates this case. According to Lemma 3, ϕ _{ j } can be smaller than ϕ _{ i } iff v _{ i } consumes zero tokens in interval A. Therefore, the intervals in which there is actually production/consumption of tokens are B and C. During interval B, there is overlapped execution and b _{ u } gives the maximum number of unconsumed tokens in e _{ u } during [ϕ _{ i },ϕ _{ j }+α) which guarantees that v _{ i } always have enough space to write to e _{ u } and v _{ j } has enough data to consume from e _{ u }. At time t=ϕ _{ j }+α, v _{ j } finishes one iteration and interval C starts. During interval C, v _{ i } is producing data to e _{ u } while v _{ j } is consuming zero tokens. Therefore, e _{ u } has to accommodate all the tokens produced during interval C and b _{ u } must be greater than or equal to \(\mathsf{prd}_{[\phi_{j} + \alpha,\phi_{i} + \alpha]}(v_{i},e_{u})\). As in Case I, b _{ u } is sufficient and minimum for guaranteeing strictly periodic execution of v _{ i } and v _{ j } in the interval [ϕ _{ j },ϕ _{ i }+α]. At time t=ϕ _{ i }+α, both of v _{ i } and v _{ j } have completed one iteration and e _{ u } contains a number of tokens equal to the number of tokens at time t=ϕ _{ i }. Due to the strict periodicity of v _{ i } and v _{ j }, their execution pattern repeats. Thus, b _{ u } is also sufficient and minimum for any t≥ϕ _{ i }+α.
□
Theorem 3
For a graph G, let τ _{ G } be a task set such that τ _{ i }∈τ _{ G } corresponds to v _{ i }∈V. τ _{ i } is given by:
where ϕ _{ i } is the earliest start time of v _{ i } given by (30), μ _{ i }∈μ, and λ _{ i }∈λ ^{min} is the period given by (19). τ _{ G } is schedulable on M processors using any hardrealtime scheduling algorithm \(\mathcal{A}\) for asynchronous sets of implicitdeadline periodic tasks if:

1.
every edge e _{ u }∈E has a capacity of at least b _{ u } tokens, where b _{ u } is given by (40)

2.
τ _{ G } satisfies the schedulability test of \(\mathcal{A}\) on M processors
Proof
Follows from Theorem 2, and Lemmas 3 and 4. □
Example 4
This is an example to illustrate Lemmas 3, 4, and Theorem 3. First, we calculate the earliest start times and the corresponding minimum buffer sizes for the CSDF graph shown in Fig. 1. Applying Lemmas 3 and 4 on the CSDF graph results in:
where ϕ _{ i } denotes the earliest start time of actor v _{ i }, and b _{ j } denotes the minimum buffer size of communication channel e _{ j }. Given μ and λ ^{min} computed in Example 3, we construct a task set τ _{ G }={(0,5,8),(8,2,8),(8,3,4),(20,2,6)}. We compute the minimum number of required processors to schedule τ _{ G } according to (9), (10), and (11):
τ _{ G } is schedulable using an optimal scheduling algorithm on 2 processors, and is schedulable using PEDF on 3 processors.
Throughput and latency analysis
Now, we analyze the throughput of the graph actors under strictly periodic scheduling and compare it with the maximum achievable throughput. We also present a formula to compute the latency for a given CSDF graph under strictly periodic scheduling. We start with the following definitions:
Definition 13
(Actor throughput)
For a graph G, the throughput of actor v _{ i }∈V under strictly periodic scheduling, denoted by R _{ SPS }(v _{ i }), is given by
Definition 14
(Rateoptimal strictly periodic schedule [22])
For a graph G, a strictly periodic schedule that delivers the same throughput as a selftimed schedule for all the actors is called RateOptimal Strictly Periodic Schedule (ROSPS).
Now, we provide the following result.
Theorem 4
For a matched I/O rates graph G, the maximum achievable throughput of the graph actors under strictly periodic scheduling is equal to their maximum throughput under selftimed scheduling.
Proof
The maximum achievable throughput under strictly periodic scheduling is the one obtained when \(\lambda_{i} = \lambda_{i}^{\min}\). Recall from (19) that
Let us rewrite η as η=pQ+r, where p=η÷Q (÷ is the integer division operator), and r=ηmodQ. Now, (43) can be rewritten as
Recall from (15) that
Now, recall from Definition 4 that a matched I/O rates graph satisfies the following condition:
Therefore, the maximum achievable throughput of the actors of a matched I/O rates graph under strictly periodic scheduling is:
□
Equation (44) shows that the throughput under SPS depends solely on the relationship between Q and η. Recall from Definition 3 that the execution time μ used by our framework is the maximum value over all the actual execution times of the actor. Therefore, if ηmodQ=0, then R _{ SPS }(v _{ i }) is exactly the same as R _{ STS }(v _{ i }) for SDF graphs and CSDF graphs where all the firings of an actor v _{ i } require the same actual execution time. If ηmodQ≠0 and/or the actor actual execution time differs per firing, then R _{ SPS }(v _{ i }) is lower than R _{ STS }(v _{ i }). These findings illustrate that our framework has high potential since it allows the designer to analytically determine the type of the application (i.e., matched vs. mismatched) and accordingly to select the proper scheduler needed to deliver the maximum achievable throughput.
Now, we prove the following result regarding matched I/O rates applications:
Corollary 2
For a matched I/O rates graph G scheduled using its minimum period vector λ ^{min}, U _{max}=1.
Proof
Recall from Sect. 3.2.1 that the utilization of a task τ _{ i } is defined as U _{ i }=C _{ i }/T _{ i }, where C _{ i }≤T _{ i }. Therefore, the maximum possible value for U _{ i } is when C _{ i }=T _{ i } which leads to U _{ i }=1.0. Now, let v _{ m } be the actor with the maximum product of actor execution time and repetition. That is
The period of v _{ m } is λ _{ m } given by
Now, let us write η as η=pQ+r, where p=η÷Q (÷ is the integer division operator), and r=ηmodQ. Then, we can rewrite (48) as
For matched I/O rates applications, r=0 (see Definition 4). Therefore, (50) can be rewritten as
The utilization of v _{ m } is U _{ m } given by
Since r=0 and η=pQ=μ _{ m } q _{ m }, (52) becomes
□
Recall from Sect. 3.2.2 that β=⌊1/U _{max}⌋. It follows from Corollary 2 that β=1 for matched I/O rates applications scheduled using their minimum period vectors.
Let ϕ _{ i } be the earliest start time of an actor v _{ i }∈V. Then, according to Definitions 5 and 6, the graph latency L(G) is given by:
where ϕ _{ j } and ϕ _{ i } are the earliest start times of the output actor v _{ j } and the input actor v _{ i }, respectively, λ _{ j } and λ _{ i } are the periods of v _{ j } and v _{ i }, and \(g^{C}_{j}\) and \(g^{P}_{i}\) are two constants, such that for an output path w _{ i⇝j } in which e _{ r } is the first channel and e _{ u } is the last channel, \(g^{P}_{i}\) and \(g^{C}_{j}\) are given by:
where \(x_{i}^{r}\) and \(y_{j}^{u}\) are production/consumption rates sequences introduced in Sect. 3.
Handling sporadic input streams
In case the input streams are not strictly periodic, there are several techniques to accommodate the aperiodic nature of the streams. We present here some of these techniques.
Dejitter buffers
In case of periodic with jitter input streams, it is possible to use dejitter buffers to hide the effect of jitter. We assume that a jittery input stream I _{ i } starts at time t=t _{0} and has a constant interarrival time γ _{ i } equal to the input actor period (see Assumption 13) and jitter bounds \([\varepsilon_{i}^{}, \varepsilon_{i}^{+}]\). The interpretation of the jitter bounds is that the kth sample of the stream is expected to arrive in the interval \([t_{0} + k\gamma_{i}  \varepsilon_{i}^{}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\). If a sample arrives in the interval \([t_{0} + k\gamma_{i}  \varepsilon_{i}^{}, t_{0} + k\gamma_{i})\), then it is called an early sample. On the other hand, if the sample arrives in the interval \((t_{0} + k\gamma_{i}, t_{0} + k\gamma_{i} + \varepsilon_{i}^{+}]\), then it is called a late sample. It is trivial to show that early samples do not affect the periodicity of the input actor as the samples arrive prior to the actor release time. Late samples, however, pose a problem as they might arrive after an actor is released.
For late samples, it is possible to insert a buffer before each input actor v _{ i } receiving a jittery input stream I _{ j } to hide the effect of jitter. The buffer delays delivering the samples to the input actor by a certain amount of time, denoted by t _{buffer}(I _{ j }). t _{buffer}(I _{ j }) has to be computed such that once the input actor is started, it always finds data in the buffer. Assume that \(\varepsilon_{i}^{}\) and \(\varepsilon_{i}^{+} \in[0, \gamma_{i}]\), then we can derive the minimum value for t _{buffer}(I _{ j }) and the minimum buffer size. In order to do that, we start with proving the following lemma:
Lemma 5
Let I _{ j } be a jittery input stream with \(\varepsilon_{i}^{}, \varepsilon_{i}^{+} \in[0,\gamma_{i}]\). Then, the maximum interarrival time between any two consecutive samples in I _{ j }, denoted by t _{MIT}(I _{ j }), satisfies:
Proof
Based on the jitter model, t _{MIT} occurs when the kth sample is early by the maximum value of jitter (i.e., arrives at time t=kγ _{ i }−γ _{ i }), and the (k+1) sample is late by the maximum value of jitter (i.e., arrives at time t=(k+1)γ _{ i }+γ _{ i }). This is illustrated in Fig. 11.
□
Lemma 6
An input actor v _{ i }∈V is guaranteed to always find an input sample in each of its input dejitter buffers if the following holds:
Proof
During a time interval (t,t+t _{MIT}(I _{ j })), v _{ i } can fire at most twice. Therefore, it is necessary to buffer up to 2 samples in order to guarantee that the input actor v _{ i } can continue firing periodically when the samples are separated by t _{MIT} timeunits. □
Lemma 7
Let v _{ i } be an input actor and I _{ j } be a jittery input stream to v _{ i }. Suppose that I _{ j } starts at time t=t _{0} and v _{ i } starts at time t=t _{0}+t _{buffer}(I _{ j }). The dejitter buffer must be able to hold at least 3 samples.
Proof
Suppose that the (k−1) and (k+1) samples arrive late and early, respectively, by the maximum amount of jitter. This means that they arrive at time t=t _{0}+kγ _{ i }. Now, suppose that the kth sample arrives with no jitter. This means that at time t=t _{0}+kγ _{ i } there are 3 samples arriving. Hence, the dejitter buffer must be able to store them. During the interval [t _{0}+kγ _{ i },t _{0}+(k+1)γ _{ i }), there are no incoming samples and v _{ i } processes the (k−1) sample. At time t=t _{0}+(k+1)γ _{ i }, the (k+2) sample might arrive which means that there are again 3 samples available to v _{ i }. By the periodicity of v _{ i } and I _{ j }, the previous pattern can repeat. □
The main advantage of the dejitter buffer approach is that the actors are still treated and scheduled as periodic tasks. However, the major disadvantage is the extra delay encountered by the input stream samples and the extra memory needed for the buffers.
Resource reservation
For sporadic streams in general, we can consider the actors as aperiodic tasks and apply techniques for aperiodic task scheduling from realtime scheduling theory [6]. One popular approach is based on using a server task to service the aperiodic tasks. Servers provide resource reservation guarantees and temporal isolation. Several servers have been proposed in the literature (e.g., [1, 27]). The advantages of using servers are the enforced isolation between the tasks, and the ability to support arbitrarily input streams. When using servers, we can schedule each actor using a server which has an execution budget C _{ s } equal to the actor execution time and a period P _{ s } equal to the actor’s period.
One particular issue when scheduling the actors using servers is how to generate the aperiodic task requests. For the CSDF model, the requests can be generated when the firing rule of an actor is evaluated as “true” (see Sect. 3). Detecting when the firing rule is evaluated as “true” can be done in the following ways:

1.
The underlying operating system (OS) or scheduler has a monitoring mechanism which polls the buffers to detect when an actor has enough data to fire. Once it detects that an actor has enough data to fire, it releases an actor job.

2.
Modify the actor implementation such that the polling happens within the actor. In this approach, an actor job is always released at the start of the actor period. When the actor is activated (i.e., running), it checks its input buffers for data. If enough data is available, then it executes its function. Otherwise, it exhausts its budget and waits until the next period. This mechanism is summarized in Fig. 12.
The first approach (i.e., polling by the OS) does not require modifications to the actors’ implementations. However, it requires an additional task which always checks all the buffers. This task can become a bottleneck if there are many channels. The second approach is better in terms of scalability and overhead. However, it might cause delays in the processing of the data.
Evaluation results
We evaluate our proposed framework in Sect. 4 by performing an experiment on a set of 19 reallife streaming applications. The objective of the experiment is to compare the throughput of streaming applications when scheduled using our strictly periodic scheduling to their maximum achievable throughput obtained via selftimed scheduling. After that, we discuss the implications of our results from Sect. 4 and the throughput comparison experiment. For brevity, we refer in the remainder of this section to our strictly periodic scheduling/schedule as SPS and the selftimed scheduling/schedule as STS.
The streaming applications used in the experiment are reallife streaming applications coming from different domains (e.g., signal processing, communication, multimedia, etc.). The benchmarks are described in details in the next section.
Benchmarks
We collected the benchmarks from several sources. The first source is the StreamIt benchmark [30] which contributes 11 streaming applications. The second source is the SDF^{3} benchmark [29] which contributes 5 streaming applications. The third source is individual research articles which contain reallife CSDF graphs such as [19, 24, 26]. In total, 19 applications are considered as shown in Table 1. The graphs are a mixture of CSDF and SDF graphs. The actors execution times of the StreamIt benchmark are specified by its authors in clock cycles measured on MIT RAW architecture, while the actors execution times of the SDF^{3} benchmark are specified for ARM architecture. For the graphs from [24, 26], the authors do not mention explicitly the actors execution times. As a result, we made assumptions regarding the execution times which are reported below Table 1.
We use the SDF^{3} toolset [29] for several purposes during the experiments. SDF^{3} is a powerful analysis toolset which is capable of analyzing CSDF and SDF graphs to check for consistency errors, compute the repetition vector, compute the maximum achievable throughput, etc. SDF^{3} accepts the graphs in XML format. For StreamIt benchmarks, the StreamIt compiler is capable of exporting an SDF graph representation of the stream program. The exported graph is then converted into the XML format required by SDF^{3}. For the graphs from the research articles, we constructed the XML representation for the CSDF graphs manually.
Experiment: throughput and latency comparison
In this experiment, we compare the throughput and latency resulting from our SPS approach to the maximum achievable throughput and minimum achievable latency of a streaming application. Recall from Definition 7 that the maximum achievable throughput and minimum achievable latency of a streaming application modeled as a CSDF graph are the ones achieved under selftimed scheduling. In this experiment, we report the throughput for the output actors (i.e., the actors producing the output streams of the application, see Sect. 3). For latency, we report the graph maximum latency according to Definition 6. For SPS, we used the minimum period vector given by Lemma 2. The STS throughput and latency are computed using the SDF^{3} toolset. SDF^{3} defines R _{ STS }(G) as the graph throughput under STS, and R _{ STS }(v _{ i })=q _{ i } R _{ STS }(G) as the actor throughput. Similarly, L _{ STS }(G) denotes the graph latency under selftimed scheduling. We use the sdf3analysis tool from SDF^{3} to compute the throughput and latency for the STS with autoconcurrency disabled and assuming unbounded FIFO channel sizes. Computing the throughput is performed using the throughput algorithm, while latency is computed using the latency(min_st) algorithm.
Now, Table 2 shows the results of comparing the throughput of the output actor for every application under both STS and SPS schedules. The most important column in the table is the last column which shows the ratio of the SPS schedule throughput to the STS schedule throughput (R _{ SPS }(v _{out})/R _{ STS }(v _{out})), where v _{out} denotes the output actor. We clearly see that our SPS delivers the same throughput as STS for 16 out of 19 applications. All these 16 applications are matched I/O rates applications. This result conforms with Theorem 4 proved in Sect. 4. Only three applications (CD2DAT(S,C) and Satellite) are mismatched and have lower throughput under our SPS. Table 2 confirms also the observation made by the authors in [30] who reported an interesting finding: Neighboring actors often have matched I/O rates. This reduces the opportunity and impact of advanced scheduling strategies proposed in the literature. According to [30], the advanced scheduling strategies proposed in the literature (e.g., [28]) are suitable for mismatched I/O rates applications. Looking into the results in Table 2, we see that our SPS approach performs verywell for matched I/O applications.
Figure 13 shows the ratios of the SPS latency (denoted by L _{ SPS }(G)) to the STS latency. For all the applications, the average SPS latency is 5× the STS latency. We also see that the mismatched applications have large latency which conforms with their suboptimal throughput. If we exclude the mismatched applications, then the average SPS latency is 4x the STS latency. For latencyinsensitive applications, this is acceptable as long as they can be scheduled using SPS to achieve the maximum achievable throughput. For latencysensitive applications, reducing the latency can be done by, for example, using the constrained deadline model (see Sect. 3.2.1). The constrained deadline model assigns for each task τ _{ i } a deadline D _{ i }<T _{ i }, where T _{ i } is the task period. For example, the Vocoder application has ratio of L _{ SPS }(G)/L _{ STS }(G)≈13.5 under the implicitdeadline model. This ratio is reduced to 1.0 if the deadline of each task is set to its execution time. However, using the constraineddeadline model requires different schedulability analysis. Therefore, a detailed treatment of how to reduce the latency is outside the scope of this paper.
Discussion
Suppose that an engineer wants to design an embedded MPSoC which will run a set of matched I/O rates streaming applications. How can he/she determine easily the minimum number of processors needed to schedule the applications to deliver the maximum achievable throughput? Our SPS framework in Sect. 4 provides a very fast and accurate answer, thanks to Theorems 3 and 4. They allows easy computation of the minimum number of processors needed by different hardrealtime scheduling algorithms for periodic tasks to schedule any matched I/O streaming application, modeled as an acyclic CSDF graph, while guaranteeing the maximum achievable throughput. Figure 14 illustrates the ability to easily compute the minimum number of processors required to schedule the benchmarks in Table 1 using optimal and partitioned hardrealtime scheduling algorithms for asynchronous sets of implicitdeadline periodic tasks. For optimal algorithms, the minimum number of processors is denoted by M _{ OPT } and computed using (9). For partitioned algorithms, we choose PEDF algorithm combined with FirstFirst (FF) allocation, abbreviated as PEDFFF. For PEDFFF, the minimum number of processors is computed using (10) (M _{ PEDF }) and (11) (M _{ PAR }). For matched I/O applications scheduled using the minimum periods obtained by Lemma 2, Corollary 2 shows that β defined in Sect. 3.2.2 is equal to 1. This implies that for matched I/O applications, M _{ PEDF }=⌈2U _{sum}−1⌉ which is approximately twice as M _{ OPT } for large values of U _{sum}. M _{ PAR } provides less resource usage compared to M _{ PEDF } with the restriction that it is valid only for the specific task set τ _{ G } for which it is computed. Another task set \(\hat{\tau}_{G}\) with the same total utilization and maximum utilization factor as τ _{ G } may not be schedulable on M _{ PAR } due to the partitioning issues. Comparing M _{ PAR } to M _{ OPT }, we see that PEDFFF requires in around 44 % of the cases an average of 14 % more processors than an optimal algorithm due to the binpacking effects.
Unfortunately, such easy computation as discussed above of the minimum number of processors is not possible for STS. This is because the minimum number of processors required by STS, denoted by M _{ STS }, can not be easily computed with equations such as (9), (10), and (11). Finding M _{ STS } in practice requires Design Space Exploration (DSE) procedures to find the best allocation which delivers the maximum achievable throughput. This fact shows one more advantage of using our SPS framework compared to using STS in cases where our SPS gives the same throughput as STS.
Conclusions
We prove that the actors of a streaming application, modeled as an acyclic CSDF graph, can be scheduled as periodic tasks. As a result, a variety of hardrealtime scheduling algorithms for periodic tasks can be applied to schedule such applications with a certain guaranteed throughput. We present an analytical framework for computing the periodic task parameters for the actors together with the minimum channel sizes such that a strictly periodic schedule exists. We also show how the proposed framework can handle sporadic input streams. We define formally a class of CSDF graphs called matched I/O rates applications which represents more than 80 % of streaming applications. We prove that strictly periodic scheduling is capable of delivering the maximum achievable throughput for matched I/O rates applications together with the ability to analytically determine the minimum number of processors needed to schedule the applications.
Notes
 1.
I.e., gcd{q _{1},q _{2},…,q _{ N }}=1.
References
 1.
Abeni L, Buttazzo G (1998) Integrating multimedia applications in hard realtime systems. In: Proceedings of the 19th IEEE realtime systems symposium (RTSS), pp 4–13. doi:10.1109/REAL.1998.739726
 2.
Anderson JH, Srinivasan A (2001) Mixed Pfair/ERfair scheduling of asynchronous periodic tasks. In: Proceedings of the 13th Euromicro conference on realtime systems (ECRTS 2001), pp 76–85. doi:10.1109/EMRTS.2001.934004
 3.
Andersson B, Tovar E (2006) Multiprocessor scheduling with few preemptions. In: Proceedings of the 12th IEEE international conference on embedded and realtime computing systems and applications (RTCSA 2006), pp 322–334. doi:10.1109/RTCSA.2006.45
 4.
Bekooij M, Hoes R, Moreira O, Poplavko P, Pastrnak M, Mesman B, Mol J, Stuijk S, Gheorghita V, Meerbergen J (2005) Dataflow analysis for realtime embedded multiprocessor system design. In: Dynamic and robust streaming in and between connected consumerelectronic devices, vol 3. Springer, Amsterdam, pp 81–108. doi:10.1007/1402034547_4
 5.
Bilsen G, Engels M, Lauwereins R, Peperstraete J (1996) Cyclostatic dataflow. IEEE Trans Signal Process 44(2):397–408. doi:10.1109/78.485935
 6.
Buttazzo GC (2011) Hard realtime computing systems, 3rd edn. Springer, Berlin. doi:10.1007/9781461406761
 7.
Carpenter J, Funk S, Holman P, Srinivasan A, Anderson J, Baruah S (2004) A categorization of realtime multiprocessor scheduling problems and algorithms. In: Leung JYT (ed) Handbook of scheduling: algorithms, models, and performance analysis. CRC Press, Boca Raton. doi:10.1201/9780203489802.ch30
 8.
Cho H, Ravindran B, Jensen ED (2010) TL planebased realtime scheduling for homogeneous multiprocessors. J Parallel Distrib Comput 70(3):225–236. doi:10.1016/j.jpdc.2009.12.003
 9.
Davis RI, Burns A (2011) A survey of hard realtime scheduling for multiprocessor systems. ACM Comput Surv 43:35:1–35:44. doi:10.1145/1978802.1978814
 10.
Gerstlauer A, Haubelt C, Pimentel AD, Stefanov TP, Gajski DD, Teich J (2009) Electronic systemlevel synthesis methodologies. IEEE Trans ComputAided Des Integr Circuits Syst 28(10):1517–1530. doi:10.1109/TCAD.2009.2026356
 11.
Goddard S (1998) On the management of latency in the synthesis of realtime signal processing systems from processing graphs. PhD thesis, University of North Carolina at Chapel Hill
 12.
Jeffay K, Stanat D, Martel C (1991) On nonpreemptive scheduling of periodic and sporadic tasks. In: Proceedings of the 12th realtime systems symposium (RTSS 1991), pp 129–139. doi:10.1109/REAL.1991.160366
 13.
Karam L, AlKamal I, Gatherer A, Frantz G, Anderson D, Evans B (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49. doi:10.1109/MSP.2009.934113
 14.
Lee EA, Ha S (1989) Scheduling strategies for multiprocessor realtime DSP. In: IEEE global telecommunications conference and exhibition: communications technology for the 1990s and beyond (GLOBECOM 1989), vol 2, pp 1279–1283. doi:10.1109/GLOCOM.1989.64160
 15.
Lee EA, Messerschmitt DG (1987) Synchronous data flow. Proc IEEE 75(9):1235–1245. doi:10.1109/PROC.1987.13876
 16.
Levin G, Funk S, Sadowski C, Pye I, Brandt S (2010) DPFAIR: a simple model for understanding optimal multiprocessor scheduling. In: Proceedings of the 22nd Euromicro conference on realtime systems (ECRTS 2010), pp 3–13. doi:10.1109/ECRTS.2010.34
 17.
López JM, Díaz JL, García DF (2004) Utilization bounds for EDF scheduling on realtime multiprocessor systems. RealTime Syst 28:39–68. doi:10.1023/B:TIME.0000033378.56741.14
 18.
Martin G (2006) Overview of the MPSoC design challenge. In: Proceedings of the 43rd annual design automation conference (DAC 2006), pp 274–279. doi:10.1145/1146909.1146980
 19.
Moonen A, Bekooij M, van den Berg R, van Meerbergen J (2008) Cache aware mapping of streaming applications on a multiprocessor systemonchip. In: Proceedings of the conference on design, automation and test in Europe (DATE 2008), pp 300–305. doi:10.1145/1403375.1403448
 20.
Moreira O, Mol JD, Bekooij M, van Meerbergen J (2005) Multiprocessor resource allocation for hardrealtime streaming with a dynamic jobmix. In: Proceedings of the 11th IEEE real time and embedded technology and applications symposium (RTAS 2005), pp 332–341. doi:10.1109/RTAS.2005.33
 21.
Moreira O, Valente F, Bekooij M (2007) Scheduling multiple independent hardrealtime jobs on a heterogeneous multiprocessor. In: Proceedings of the 7th ACM & IEEE international conference on embedded software (EMSOFT 2007), pp 57–66. doi:10.1145/1289927.1289941
 22.
Moreira OM, Bekooij MJG (2007) Selftimed scheduling analysis for realtime applications. EURASIP J Adv Signal Process 2007:1–15. doi:10.1155/2007/83710
 23.
Nollet V, Verkest D, Corporaal H (2010) A safari through the MPSoC runtime management jungle. Signal Process Syst 60:251–268. doi:10.1007/s1126500803054
 24.
Oh H, Ha S (2004) Fractional rate dataflow model for efficient code synthesis. J VLSI Signal Process 37:41–51. doi:10.1023/B:VLSI.0000017002.91721.0e
 25.
Parks T, Lee E (1995) Nonpreemptive realtime scheduling of dataflow systems. In: Proceedings of the 1995 international conference on acoustics, speech, and signal processing (ICASSP 1995), vol 5, pp 3235–3238. doi:10.1109/ICASSP.1995.479574
 26.
Pellizzoni R, Meredith P, Nam MY, Sun M, Caccamo M, Sha L (2009) Handling mixedcriticality in SoCbased realtime embedded systems. In: Proceedings of the 7th ACM international conference on embedded software (EMSOFT 2009), pp 235–244. doi:10.1145/1629335.1629367
 27.
Sprunt B, Sha L, Lehoczky J (1989) Aperiodic task scheduling for hardrealtime systems. RealTime Syst 1:27–60. doi:10.1007/BF02341920
 28.
Sriram S, Bhattacharyya SS (2009) Embedded multiprocessors: scheduling and synchronization, 2nd edn. CRC Press, Boca Raton. doi:10.1201/9781420048025
 29.
Stuijk S, Geilen M, Basten T (2006) SDFT^{3}: SDF for free. In: Proceedings of the 6th international conference on application of concurrency to system design (ACSD 2006), pp 276–278. doi:10.1109/ACSD.2006.23
 30.
Thies W, Amarasinghe S (2010) An empirical characterization of stream programs and its implications for language and compiler design. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques (PACT 2010), pp 365–376. doi:10.1145/1854273.1854319
Acknowledgements
This work is supported by CATRENE/MEDEA+ 2A718 TSAR (Terascale multicore processor architecture) project. We would like to thank William Thies and Sander Stuijk for their support with StreamIt and SDF^{3} benchmarks, respectively.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Bamakhrama, M.A., Stefanov, T.P. On the hardrealtime scheduling of embedded streaming applications. Des Autom Embed Syst 17, 221–249 (2013). https://doi.org/10.1007/s106170129086x
Received:
Accepted:
Published:
Issue Date:
Keywords
 Realtime multiprocessor scheduling
 Embedded streaming systems