Energy-Efficient Real-Time Scheduling for Two-Type Heterogeneous Multiprocessors

We propose three novel mathematical optimization formulations that solve the same two-type heterogeneous multiprocessor scheduling problem for a real-time taskset with hard constraints. Our formulations are based on a global scheduling scheme and a fluid model. The first formulation is a mixed-integer nonlinear program, since the scheduling problem is intuitively considered as an assignment problem. However, by changing the scheduling problem to first determine a task workload partition and then to find the execution order of all tasks, the computation time can be significantly reduced. Specifically, the workload partitioning problem can be formulated as a continuous nonlinear program for a system with continuous operating frequency, and as a continuous linear program for a practical system with a discrete speed level set. The task ordering problem can be solved by an algorithm with a complexity that is linear in the total number of tasks. The work is evaluated against existing global energy/feasibility optimal workload allocation formulations. The results illustrate that our algorithms are both feasibility optimal and energy optimal for both implicit and constrained deadline tasksets. Specifically, our algorithm can achieve up to 40% energy saving for some simulated tasksets with constrained deadlines. The benefit of our formulation compared with existing work is that our algorithms can solve a more general class of scheduling problems due to incorporating a scheduling dynamic model in the formulations and allowing for a time-varying speed profile. Moreover, our algorithms can be applied to both online and offline scheduling schemes.


INTRODUCTION
E FFICIENT energy management has become an important issue for modern computing systems due to higher computational power demands in today's computing systems, e.g. sensor networks, satellites, multi-robot systems, as well as personal electronic devices. There are two common schemes used in modern computing energy management systems. One is dynamic power management (DPM), where certain parts of the system are turned off during the processor idle state. The other is dynamic voltage and frequency scaling (DVFS), which reduces the energy consumption by exploiting the relation between the supply voltage and power consumption. In this work, we consider the problem of scheduling real-time tasks on heterogeneous multiprocessors under a DVFS scheme with the goal of minimizing energy consumption, while ensuring that both the execution cycle requirement and timeliness constraints of real-time tasks are satisfied.

Terminologies and Definitions
This section provides basic terminologies and definitions used throughout the paper.
Task T i : An aperiodic task T i is defined as a triple T i := (c i , d i , b i ); c i is the required number of CPU cycles needed to complete the task, d i is the task's relative deadline and b i is the arrival time of the task. A periodic task T i is defined as a triple T i := (c i , d i , p i ) where p i is the task's period. If the task's deadline is equal to its period, the task is said to have an 'implicit deadline'. The task is considered to have a 'constrained deadline' if its deadline is not larger than its period, i.e. d i ≤ p i . In the case that the task's deadline can be less than, equal to, or greater than its period, it is said to have an 'arbitrary deadline'. Throughout the paper, we will refer to a task as an aperiodic task model unless stated otherwise, because a periodic task can be transformed into a collection of aperiodic tasks with appropriately defined arrival times and deadlines, i.e. the j th instance of a periodic task T i , where j ≥ 1, arrives at time (j − 1)p i , has the required execution cycles c i and an absolute deadline at time (j − 1)p i + d i . Moreover, for a periodic taskset, we only need to find a valid schedule within its hyperperiod L, defined as the least common multiple (LCM) of all task periods, i.e. the total number of job instances of a periodic task T i during the hyperperiod L is equal to L/p i . The taskset is defined as a set of all tasks. The taskset is feasible if there exists a schedule such that no task in the taskset misses the deadline.
Speed s r : The operating speed s r is defined as the ratio between the operating frequency f r of processor type-r and the maximum system frequency f max , i.e. s r := f r /f max , f max := max {max{f r | r ∈ R}}, where R := {1, 2}.
Minimum Execution Time 1 x i : The minimum execution 1. In the literature, this is often called 'worst-case execution time'. However, in the case where the speed is allowed to vary, using the term 'minimum execution time' makes more sense, since the execution time increases as the speed is scaled down. For simplicity of exposition, we also assume no uncertainty, hence 'worst-case' is not applicable here. Extensions to uncertainty should be relatively straightforward, in which case x i then becomes 'minimum worst-case execution time'. time x i is the execution time of task T i when executed at the maximum system frequency f max , i.e. x i := c i /f max .
Task Density 2 δ i (s i ): For a periodic task, a task density δ i (s i ) is defined as the ratio between the task execution time and the minimum of its deadline and its period, i.e. δ i (s i ) := c i /(s i f max min{d i , p i }), where s i is the task execution speed.
Taskset Density D(s i ): A taskset density D(s i ) of a periodic taskset is defined as the summation of all task densities in the taskset, i.e. D(s i ) := n i=1 δ i (s i ). The minimum taskset density D is given by D := n i=1 δ i (1). System Capacity C: The system capacity C is defined as C := r∈R s r max m r , where s r max is the maximum speed of processor type-r, i.e. s r max := f r max /f max , f r max := max f r , m r is the total number of processors of type-r.
Migration Scheme: A global scheduling scheme allows task migration between processors and a partitioned scheduling scheme does not allow task migration.
Feasibility Optimal: An algorithm is feasibility optimal if the algorithm is guaranteed to be able to construct a valid schedule such that no deadlines are missed, provided a schedule exists.
Energy Optimal: An algorithm is energy optimal when it is guaranteed to find a schedule that minimizes the energy, while meeting the deadlines, provided such a schedule exists.
Step Function: A function f : X → R is a step (also called a piecewise constant) function, denoted f ∈ PC, if there exists a finite partition {X 1 , . . . , X p } of X ⊆ R and a set of real numbers {φ 1 , . . . , φ p } such that f (x) = φ i for all x ∈ X i , i ∈ {1, . . . , p}.

Related Work
Due to the heterogeneity of the processors, one should not only consider the different operating frequency sets among processors, but also the hardware architecture of the processors, since task execution time will be different for each processor type. In other words, the system has to be captured by two aspects: the difference in operating speed sets and the execution cycles required by different tasks on different processor types.
With these aspects, fully-migration/global based scheduling algorithms, where tasks are allowed to migrate between different processor types, are not applicable in practice, since it will be difficult to identify how much computational work is executed on one processor type compared to another processor type due to differences in instruction sets, register formats, etc. Thus, most of the work related to heterogeneous multiprocessor scheduling are partition-based/non-preemptive task scheduling algorithms [1]- [7], i.e. tasks are partitioned onto one of the processor types and a well-known uniprocessor scheduling algorithm, such as Earliest Deadline First (EDF) [8], is used to find a valid schedule. With this scheme, the heterogeneous multiprocessor scheduling problem is reduced to a task partitioning problem, which can be formulated as an integer linear program (ILP). Examples of such work are [1] and [5]. 2. When all tasks are assumed to have implicit deadlines, this is often called 'task utilization'. However, with the advent of ARM two-type heterogeneous multicores architecture, such as the big.LITTLE architecture [9], that supports task migrations among different core types, a global scheduling algorithm is possible. In [10], [11], the first energy-aware global scheduling framework for this special architecture is presented, where an algorithm called Hetero-Split is proposed to solve a workload assignment and a Hetero-Wrap algorithm to solve a schedule generation problem. Their framework is similar to ours, except that we adopt a fluid model to represent a scheduling dynamic, our assigned operating frequency is time-varying and the CPU idle energy consumption is also considered.
A fluid model is the ideal schedule path of a real-time task. The remaining execution time is represented by a straight line where the slope of the line is the task execution speed. However, a practical task execution path is nonlinear, since a task may be preempted by other tasks. The execution interval of a task is represented by a line with a negative slope and a non-execution interval is represented by a line with zero slope.
There are at least two well-known homogeneous multiprocessor scheduling algorithms that are based on a fluid scheduling model: Proportionate-fair (Pfair) [12] and Largest Local Remaining Execution Time First (LLREF) [13]. Both Pfair and LLREF are global scheduling algorithms. By introducing the notion of fairness, Pfair ensures that at any instant no task is one or more quanta (time intervals) away from the task's fluid path. However, the Pfair algorithm suffers from a significant run-time overhead, because tasks are split into several segments, incurring frequent algorithm invocations and task migrations. To overcome the disadvantages of quantum-based scheduling algorithms, the LLREF algorithm splits/preempts a task at two scheduling events within each time interval [13]. One occurs when the remaining time of an executing task is zero and it is better to select another task to run. The other event happens when the task has no laxity, i.e. the difference between the task deadline and the remaining execution time left is zero, hence the task needs to be selected immediately in order to finish the remaining workload in time.
The unified theory of the deadline partitioning technique and its feasibility optimal versions, called DP-FAIR, for periodic and sporadic tasks are given in [14]. Deadline Partitioning (DP) [14] is the technique that partitions time into intervals bounded by two successive task deadlines, after which each task is allocated the workload and is scheduled at each time interval. A simple optimal scheduling algorithm based on DP-FAIR, called DP-WRAP, was presented in [14]. The DP-WRAP algorithm partitioned time according to the DP technique and, at each time interval, the tasks are scheduled using McNaughton's wrap around algorithm [15]. McNaughton's wrap around algorithm aligns all task workloads along a real number line, starting at zero, then splits tasks into chunks of length 1 and assigns each chunk to the same processor. Note that the tasks that have been split migrate between the two assigned processors. The work of [14] was extended in [16], [17] by incorporating a DVFS scheme to reduce power consumption.
However, the algorithms that are based on the fairness notion [13], [14], [16]- [19] are feasibility optimal, but have hardly been applied in a real system, since they suffer from high scheduling overheads, i.e. task preemptions and migrations. Recently, two feasibility optimal algorithms that are not based on the notion of fairness have been proposed. One is the RUN algorithm [20], which uses a dualization technique to reduce the multiprocessor scheduling problem to a series of uniprocessor scheduling problems. The other is U-EDF [21], which generalises the earliest deadline first (EDF) algorithm to multiprocessors by reducing the problem to EDF on a uniprocessor.
Alternatively to the above methods, the multiprocessor scheduling problem can also be formulated as an optimization problem. However, since the problem is NPhard [22], in general, an approximated polynomial-time heuristic method is often used. An example of these approaches can be found in [23], [24], which consider energyaware multiprocessor scheduling with probabilistic task execution times. The tasks are partitioned among the set of processors, followed with computing the running frequency based on the task execution time probabilities. Among all of the feasibility assignments, an optimal energy consumption assignment is chosen by solving a mathematical optimization problem, where the objective is to minimize some energy function. The constraints are to ensure that all tasks will meet their deadlines and only one processor is assigned to a task. In partitioned scheduling algorithms, such as [23], [24], once a task is assigned to a specific processor, the multiprocessor scheduling problem is reduced to a set of uniprocessor scheduling problems, which is well studied [25]. However, a partitioned scheduling method cannot provide an optimal schedule.

Contribution
The main contributions of this work are: • The formulation of a real-time multiprocessor scheduling problem as an infinite-dimensional continous-time optimal control problem.
• Three mathematical programming formulations to solve a hard real-time task scheduling problem on heterogeneous multiprocessor systems with DVFS capabilities are proposed. • We provide a generalised optimal speed profile solution to a uniprocessor scheduling problem with realtime taskset.
• Our work is a multiprocessor scheduling algorithm that is both feasibility optimal and energy optimal.
• Our formulations are capable of solving a multiprocessor scheduling problem with any periodic tasksets as well as aperiodic tasksets, compared to existing work, due to the incorporation of a scheduling dynamic and a time-varying speed profile.

•
The proposed algorithms can be applied to both an online scheduling scheme, where the characteristics of the taskset is not known until the time of execution, and an offline scheduling scheme, where the taskset information is known a priori.
• Moreover, the proposed formulations can also be extended to a multicore architecture, which only allows frequency to be changed at a cluster-level, rather than at a core-level, as explained in Section 2.3.

Outline
This paper is organized as follows: Section 2 defines our feasibility scheduling problem in detail. Details on solving the scheduling problem with finite-dimensional mathematical optimization is given in Section 3. The optimality problem formulations are presented in Section 4. The simulation setup and results are presented in Section 5. Finally, conclusions and future work are discussed in Section 6.

FEASIBILITY PROBLEM FORMULATION
Though our objective is to minimize the total energy consumption, we will first consider a feasiblity problem before presenting an optimality problem.

System model
We consider a set of n real-time tasks that are to be partitioned on a two-type heterogeneous multiprocessor system composed of m r processors of type-r, r ∈ R. We will assume that the system supports task migration among processor types, e.g. sharing the same instruction set and having a special interconnection for data transfer between processor types. Note that c i is the same for all processor types, since the instruction set is the same.

Task/Processor Assumptions
All tasks do not share resources, do not have any precedence constraints and are ready to start at the beginning of the execution. A task can be preempted/migrated between different processor types at any time. The cost of preemption and migration is assumed to be negligible or included in the minimum task execution times. Processors of the same type are homogeneous, i.e. having the same set of operating frequencies and power consumptions. Each processor's voltage/speed can be adjusted individually. Additionally, for an ideal system, a processor is assumed to have a continuous speed range. For a practical system, a processor is assumed to have a finite set of operating speed levels.

Scheduling as an Optimal Control Problem
Below, we will refer to the sets I := {1, . . . , n}, K r := {1, . . . , m r } and Γ := [0, L], where L is the largest deadline of all tasks. Note that ∀i, ∀k, ∀r, ∀t are short-hand notations for ∀i ∈ I, ∀k ∈ K r , ∀r ∈ R, ∀t ∈ Γ, respectively. The scheduling problem can therefore be formulated as the following infinite-dimensional continous-time optimal control problem: find x i (·), a r ik (·), s r k (·), ∀i ∈ I, k ∈ K r , r ∈ R subject to where the state x i (t) is the remaining minimum execution time of task T i at time t, the control input s r k (t) is the execution speed of the k th processor of type-r at time t and the control input a r ik (t) is used to indicate the processor assignment of task T i at time t, i.e. a r ik (t) = 1 if and only if task T i is active on processor k of type-r. Notice that here we formulated the problem with speed selection at a corelevel; a stricter assumption of a multicore architecture, i.e. a cluster-level speed assignment, is straightforward. Particularly, by replacing a core-level speed assignment s r k with a cluster-level speed assignment s r in the above formulation.
The initial conditions on the minimum execution time of all tasks and task deadline constraints are specified in (1a) and (1b), respectively. The fluid model of the scheduling dynamic is given by the differential constraint (1c). Constraint (1d) ensures that each task will be assigned to at most one non-idle processor at a time. Constraint (1e) quarantees that each non-idle processor will only be assigned to at most one task at a time. The speeds are constrained by (1f) to take on values from S r ⊆ [0, 1]. Constraint (1g) emphasis that task assignment variables are binary. Lastly, (1h) denotes that the control inputs should be step functions.

Fact 1. A solution to (1) where (1c) is satisfied with equality
can be constructed from a solution to (1).
Proof: Let (a, s, x) be a feasible point to (1). Let , ∀i, k, r, t. It follows that (ã,s,x) is a solution to (1) where (1c) is an equality.

SOLVING THE SCHEDULING PROBLEM WITH FINITE-DIMENSIONAL MATHEMATICAL OPTIMIZA-TION
The original problem (1) will be discretized by introducing piecewise constant constraints on the control inputs s and a. Let T := {τ 0 , τ 1 , . . . , τ N }, which we will refer to as the major grid, denote the set of discretization time steps corresponding to the distinct arrival times and deadlines of all tasks within L, where 0 = τ 0 < τ 1 < τ 2 < · · · < τ N = L.

Mixed-Integer Nonlinear Program (MINLP-DVFS)
The above scheduling problem, subject to piecewise constant constraints on the control inputs, can be naturally formulated as an MINLP, defined below. Since the context switches due to task preemption and migration can jeopardize the performance, a variable discretization time step [26] method is applied on a minor grid, so that the solution to our scheduling problem does not depend on the size of the discretization time step. Let {τ µ,0 , . . . , τ µ,M } denote the set of discretization time steps on a minor grid on the interval {τ µ,1 , . . . , τ µ,M−1 } is to be determined for all µ from solving an appropriately-defined optimization problem.
Let Λ denote the set of all tasks within L, i.e. Λ : ∀i ∈ I and let ∀µ i be short notation for ∀µ ∈ U i .
By solving a first-order ODE with piecewise constant input, a solution of the scheduling dynamic (1c) has to satisfy the difference constraint The discretization of the original problem (1) subject to piecewise constant constraints on the inputs (3) is therefore equivalent to the following finite-dimensional MINLP: , ∀i ∈ I, k ∈ K r , r ∈ R subject to (4a) and x where (4h)-(4i) enforce upper and lower bounds on discretization time steps. Proof: Follows from the fact that if a solution exists to (1), then the Hetero-Wrap scheduling algorithm [11] can find a valid schedule with at most m r − 1 migrations within the cluster. [11,Lemma 2].

Computationally Tractable Multiprocessor Scheduling Algorithms
The time to compute a solution to problem (4) is impractical even with a small problem size. However, if we relax the binary constraints in (4g) so that the value of a can be interpreted as the percentage of a time interval during which the task is executed (this will be denoted as ω in later formulations), rather than the processor assignment, the problem can be reformulated as an NLP for a system with continuous operating speed and an LP for a system with discrete speed levels. The NLP and LP can be solved at a fraction of the time taken to solve the MINLP above. Particularly, the heterogeneous multiprocessor scheduling problem can be simplified into two steps:

STEP 1:Workload Partitioning
Determine the percentage of task execution times and execution speed within a time interval such that the feasibility constraints are satisfied.

STEP 2:Task Ordering
From the solution given in the workload partitioning step, find the execution order of all tasks within a time interval such that no task will be executed on more than one processor at a time.

Solving the Workload Partitioning Problem as a Continuous Nonlinear Program (NLP-DVFS)
Since knowing the processor on which a task will be executed does not help in finding the task execution order, the corresponding processor assignment subscript k of the control variables ω and s is dropped to reduce the number of decision variables. Moreover, partitioning time using only a major grid (i.e. M = 1) is enough to guarantee a valid solution, i.e. the percentage of the task exection time within a major grid is equal to the sum of all percentages of task execution times in a minor grid. Since we only need a major grid, we define the notation [µ] := τ µ and h[µ] := τ µ+1 − τ µ . Note that we make an assumption that h[µ] > 0, ∀µ. We also assume that the set of allowable speed levels S r is a closed interval given by the lower bound s r min and upper bound s r max .
Consider now the following finite-dimensional NLP: where ω r i [µ] is defined as the percentage of the time interval [τ µ , τ µ+1 ] for which task T i is executing on a processor of type-r at speed s r i [µ]. (6d) guarantees that a task will not run on more than one processor at a time. The constraint that the total workload at each time interval should be less than or equal to the system capacity is specified in (6e). Upper and lower bounds on task execution speed and percentage of task execution time are given in (6f) and (6g), respectively.

Solving the Workload Partitioning Problem as a Linear Program (LP-DVFS)
The problem (6) can be further simplified to an LP if the set of speed levels S r is finite, as is often the case for practical systems. We denote with s r q the execution speed at level q ∈ Q r := {1, . . . , l r } of an r-type processor, where l r is the total number of speed levels of an r-type processor. Let ∀q be short-hand for ∀q ∈ Q r .
Consider now the following finite-dimensional LP: where ω r iq [µ] is the percentage of the time interval [τ µ , τ µ+1 ] for which task T i is executing on a processor of type-r at a speed level q. Note that all constraints are similar to (6), but the speed levels are fixed. (6) can be constructed from a solution to (7), and vice versa, if the discrete speed set S r is any finite subset of the closed interval [s r min , s r max ] with s r min and s r max in S r for all r. Proof: Let (x,ω,s) denote a solution to (6) and (x, ω) a solution to (7

Task Ordering Algorithm
This section discusses how to find a valid schedule in the task ordering step for each time interval [τ µ , τ µ+1 ]. Since the solutions obtained in the workload partitioning step are partitioning workloads of each task on each processor type within each time interval, one might think of using McNaughton's wrap around algorithm [15] to find a valid schedule for each processor within the processor type. However, McNaughton's wrap around algorithm only guarantees that a task will not be executed at the same time within the cluster. There exists a possibility that a task will be assigned to more than one processor type (cluster) at the same time.
To avoid a parallel execution on any two clusters, we can adopt the Hetero-Wrap algorithm proposed in [11] to solve a task ordering problem of a two-type heterogeneous multiprocessor platform. The algorithm takes the workload partitioning solution to STEP 1 as its inputs and returns (σ r ik , η r ik ) ∈ [0, 1] 2 , ∀i, k, r, which is a task-to-processor interval assignment on each cluster. Note that, for a solution to problem (7), we define the total execution workload of a task ω r i := q ω r iq and assume that the percentage of execution times of each task at all frequency levels ω r iq will be grouped together in order to minimize the number of migrations and preemptions. In order to be self-contained, the Hetero-Wrap algorithm is given in Algorithm 1. Specifically, the algorithm classifies the tasks into four subsets: (i) a set IM a of migrating tasks with r ω r i = 1, (ii) a set IM b of migrating tasks with r ω r i < 1, (iii) a set CP 1 of partitioned tasks on cluster of type-1, and (iv) a set CP 2 of partitioned tasks on cluster of type-2. The algorithm then employs the following simple rules: • For a type-1 cluster, tasks are scheduled in the order of IM a , IM b and CP 1 using McNaughton's wrap around algorithm. That is, a slot along the number line is allocated, starting at zero, with the length equal to m 1 and the task is aligned with its assigned workload on empty slots of the cluster in the specified order starting from left to right.
• For a type-2 cluster, in the same manner, tasks are scheduled using McNaughton's wrap around algorithm, but in the order of IM a , IM b and CP 2 starting from right to left. Note that the order of tasks in IM a has to be consistent with the order in a type-1 cluster.
However, the algorithm requires a feasible solution to (6) or (7), in which IM b has at most one task, which we will call an inter-cluster migrating task. From Theorem 3, we can always transform a solution to (6) into a solution to (7). Therefore, we only need to show that there exists a solution to (7) with at most one inter-cluster migrating tasks that lies on the vertex of the feasible region by the following facts and lemma. Algorithm 1 Hetero-Wrap Algorithm [11] 1: INPUT: ω r i , m r , ∀i, r 2: σ r ik ← 0, η r ik ← 1, ∀i, k, r 3: p 1 ← 0, p 2 ← m 2 , k 1 ← 1, k 2 ← m 2 4: for r = 1, 2 do 5: if r = 1 then 6: for i ∈ {IM a , IM b , CP 1 } do 7: if p 1 = 0 then 8: else 10: if p 1 + w r i ≤ k 1 then 11: 13:  η r ik2 ← p 2 − (k 2 − 1) 33: Aχ ≤ b, χ ∈ R n } for some A ∈ R (m+n)×n , b ∈ R m+n , c ∈ R n . Suppose that n constraints are nonnegative constraints on each variable, i.e. χ i ≥ 0, ∀i ∈ {1, 2, . . . , n} and the rest are m linearly independent constraints. If  Proof: A unique basic solution can be identified by any n + m linearly independent active constraints. Since there are n nonnegative constraints and m < n, a basic solution will have at most m non-zero values.

Lemma 7.
For a solution to (7) that lies on the vertex of the feasible region, there will be at most one inter-cluster partitioning task.
Proof: The number of variables ω subjected to nonnegative constraint (7f) at each time interval of (7) is n( r l r ). The number of variables ω subjected to a set of necessary and sufficient feasibility constraints (7d)-(7e) is n + 2. Note that we do not count the number of variables in (7c) because (7c) and (7d) are linearly dependent constraints for a given value of . If we assume that n ≥ 2 and each processor type has at least one speed level, then it follows from Fact 6 that the number of nonzero values of variable ω, a solution to (7) at the vertex of the feasible region, is at most n + 2. Let γ be the number of tasks assigned to two processor types. Therefore, there are 2γ + (n − γ) entries of variable ω that are non-zero. This implies that γ < 2, i.e. the number of inter-cluster partitioning tasks is at most one.
To illustrate how Algorithm 1 works, consider a simple taskset in which the percentage of execution workload partition at time interval [τ µ , τ µ+1 ] for each task is as shown in Table 1. A feasible schedule obtained by Algorithm 1 is shown in Figure 1. For this example, Theorem 8. If a solution to (1) exists, then a solution to (6)/ (7) exists. Furthermore, at least one valid schedule satisfying (1) can be constructed from a solution to problem (6)/(7) and the output from Algorithm 1.
Proof: The existence of a valid schedule is proven in [11,Thm 3]. It follows from Facts 4-6 and Lemma 7 that one can compute a solution with at most one intercluster partitioning task. Given a solution to (6)/(7) and the output from Algorithm 1 for all intervals, choose a to be a step function such that a r and a r ik (t) = 0 otherwise, ∀i, k, r, µ, ν. Specifically, one can verify that the following condition holds τµ,ν k a r ik (t)dt, ∀i, r, µ, ν.
Then it is straightforward to show that (1) is satisfied. Note that, although, we need to solve the same multiprocessor scheduling problem with two steps in this section, the computation times to solve (6) or (7) is extremely fast compared to solving problem (1), i.e. even for a small problem, the times to compute a solution of (4) can be up to an hour, while (6) or (7) can be solved in milliseconds using a general-purpose desktop PC with off-the-shelf optimization solvers. Furthermore, the complexity of Algorithm 1 is O(n) [11].

Energy Consumption model
A power consumption model can be expressed as a summation of dynamic power consumption P d and static power consumption P s . Dynamic power consumption is due to the charging and discharging of CMOS gates, while static power consumption is due to subthreshold leakage current and reverse bias junction current [29]. The dynamic power consumption of CMOS processors at a clock frequency f = sf max is given by where the constraint has to be satisfied [29]. Here C ef > 0 denotes the effective switch capacitance, V dd is the supply voltage, V t is the threshold voltage (V dd > V t > 0 V) and ζ > 0 is a hardwarespecific constant. From (9b), it follows that if s increases, then the supply voltage V dd may have to increase (and if V dd decreases, so does s). In the literature, the total power consumption is often simply expressed as an increasing function of the form P (s) := P d (s) + P s = αs β + P s , where α > 0 and β ≥ 1 are hardware-dependent constants, while the static power consumption P s is assumed to be either constant or zero [30]. The energy consumption of executing and completing a task T i at a constant speed s i is given by In the literature, it is often assumed that E is an increasing function of the operating speed. However, because s → 1/s is a decreasing function, it follows that the energy consumed might not be an increasing function if P s is nonzero; Figure 3 gives an example of when the energy is non-monotonic, even if the power is an increasing function of clock frequency. This result implies the existence of a non-zero energy-efficient speed s ef f , i.e. the minimizer of (11) [31]- [33]. Moreover, in the work of [34], the nonconvex relationship between the energy consumption and processor speed can be observed as a result of scaling supply voltage.
The total energy consumption of executing a real-time task T i can be expressed as a summation of active energy consumption and idle energy consumption, i.e. E = E active + E idle , where E active is the energy consumption when the processor is busy executing the task and E idle is the energy consumption when the processor is idle. The energy consumption of executing and completing a task T i at a constant speed s i is where P active (s) := P da (s) + P sa is the total power consumption in the active interval, P idle := P di + P si is the total power consumption during the idle period. P da > 0 and P sa ≥ 0 are dynamic and static power consumption during the active period, respectively. Similarly, P di > 0 and P si ≥ 0 are the dynamic and static power consumption during the idle period. P di will be assumed to be a constant, since the processor is executing a nop (no operation) instruction at the lowest frequency f min during the idle interval. P sa and P si are also assumed to be constants where P si < P sa . Note that P active (s)−P idle is strictly greater than zero.

Optimality Problem Formulation
The scheduling problem with the objective to minimize the total energy consumption of executing the taskset on a twotype heterogeneous multiprocessor can be formulated as the following optimal control problems: I) Continuous Optimal Control Problem: minimize xi(·),a r ik (·),s r k (·), ∀i∈I,k∈K r ,r∈R r,k,i L 0 ℓ r (a r ik (t), s r k (t))dt (13) subject to (1 subject to (7).
where ℓ r (a, s) := a(P r active (s) − P r idle ). Note that (16) is an LP, since the cost is linear in the decision variables.

Constant or Time-varying Speed?
In this section, we present a result on a general speed selection trajectory for a uniprocessor scheduling problem with a real-time taskset. With this observation about optimal speed profile, we can formulate algorithms that are able to solve a more general class of scheduling problems than in the literature. Consider the following simple example, illustrated in Fig. 4, where the power consumption model P (·) is a concave function of speed. Assume that s 2 is the lowest possible constant speed at which task T i can be finished on time, i.e. x i = s 2 d i . The energy consumed is E(s 2 ) = P (s 2 )d i and the average power consumption P (s 2 ) =:P c . Let λ ∈ [0, 1] be a constant such that s 2 = λs 1 + (1 − λ)s 3 , s 1 < s 2 < s 3 . Suppose s(·) is a time-varying speed profile such that s(t) = s 1 , ∀t ∈ [0, t 1 ) and s(t) = s 3 , ∀t ∈ [t 1 , d i ). We can choose t 1 such that x i = s 1 t 1 +s 3 (d i −t 1 ). The energy used in this case This result implies that a time-varying speed profile is better than a constant speed profile when the power consumption is concave. Notably, the result can be generalised to the case where the power model is non-convex, non-concave as well as discrete speed set. There exists a piecewise constant speed trajectory s(·) with at most one switch such that the amount of computations done and the energy consumed is the same as using s * (·), i.e. s(·) is of the form whereŝ,š ∈ S, λ ∈ [0, 1], such that the total amount of computations and energy consumed It follows that c = i s i ∆ i and E = i P (s i )∆ i . Hence, the average speeds := c/(t f − t 0 ) = i λ i s i and average powerP := E/(t f − t 0 ) = i λ i P (s i ).
Corollary 11. An optimal speed profile to (13) can be constructed by switching between no more than two nonzero speed levels within each time interval defined by two consecutive time steps of the major grid T .
Proof: The overall optimal speed profile can be obtained by connecting an optimal time-varying speed profile proven in Theorem 9 for each partitioned time interval. Specifically, the generalised optimal speed profile is a step function.
The result of the above Theorem and Corollaries can be applied directly to scheduling algorithms that adopt the DP technique such as, LLREF, DP-WRAP, as well as our algorithms in Section 3. Consider the problem of determining the optimal speeds at each time interval defined by two consecutive task deadlines. By subdividing time into such intervals, we can easily determine the optimal speed profile of four uniprocessor scheduling paradigms classified by power consumption and taskset models, i.e. (i) a convex power consumption model with implicit deadline taskset, (ii) a convex power consumption model with constrained deadline taskset, (iii) a non-convex power consumption model with implicit deadline taskset and (iv) a non-convex power consumption model with constrained deadline taskset. Specifically, if the taskset has an implicit deadline, then the required workloads (taskset density) are equal for all time intervals; the optimal speed profiles of all schedule intervals are the same as well. Therefore, the optimal speed profile is a constant for (i) (Cor. 10) and a combination of two speeds for (iii) (Cor. 9). However, for a constrained deadline taskset, the required workload varies from interval to interval, but is constant within the interval. Hence, even if the power function is (ii) convex or (iv) non-convex, the optimal speed profile is a (time-varying) piecewise constant function. In other words, for generality, a time-varying speed profile with two speed levels at each partitioned time interval is guaranteed to provide an energy optimal solution. Theorem 12. Consider the optimization problems (13)- (16).
An optimal speed profile for (13) can be constructed using any of the following methods: • Compute a solution to (14) with the lower bound on M at least twice the bound in Theorem 2.
• If the active power function P active is convex and the speed level sets are closed intervals, compute a solution to (15). If there is more than one intercluster partitioning task, then the (finite) range of the optimal speed profile should be used to define and compute a solution to (16) with at most one intercluster partitioning task. This process is concluded with Algorithm 1.
• If the speed level sets are finite, compute a solution to (16) with at most one inter-cluster partitioning task, followed with Algorithm 1.
Proof: Follows from the choices of selecting a and s as in the proofs of Theorem 2 and Theorem 8. The cost of all problems are then equal.

System, Processor and Task models
The energy efficiency of solving the above optimization problems is evaluated on the ARM big.LITTLE architecture, where a big core provides faster execution times, but consumes more energy than a LITTLE core. The details of the ARM Cortex-A15 (big) and Cortex-A7 (LITTLE) core, which have been validated in [10], are given in Tables 2 and 3. The active power consumption models, obtained by a polynomial curve fitting to the generic form (10), are shown in Table 4. The plots of the actual data versus the fitted models are shown in Fig. 5. The idle power consumption was not reported, thus we will assume this to be a constant strictly less than the lowest active power consumption, namely P idle = 70 mW for the big core and P idle = 12 mW for the LITTLE core. To illustrate that our formulations are able to solve a broader class of multiprocessor scheduling problems than others optimal algorithms reported in the literature, we consider periodic taskset models with both implicit and constrained deadlines. However, a more general taskset model such as an arbitary deadline taskset, where the deadline could be greater than the period, a sporadic taskset model, where the inter-arrival time of successive tasks is at least p i time units, and an aperiodic taskset can be solved by our algorithms as well. To guarantee the existence of a valid schedule, the minimum taskset density has to be less than or equal to the system capacity. Moreover, a periodic task needs to be able to be executed on any processor type. Specifically, the minimum task density should be less than or equal to the lowest capacity of all processor types, i.e. δ i (1) ≤ 0.375 for this particular architecture.

Comparison between Algorithms
For a system with a continuous speed range, four algorithms are compared: (i) MINLP-DVFS, (ii) NLP-DVFS, (iii) GWA-SVFS, which represents a global energy/feasibility-optimal workload allocation with constant frequency scaling scheme at a core-level and (iv) GWA-NoDVFS, which is a global scheduling approach without frequency scaling scheme. For a system with discrete speed levels, four algorithms are compared: (i) LP-DVFS, (ii) GWA-NoDVFS, (iii) GWA-DDiscrete and (iv) GWA-SDiscrete, which represent global energy/feasibility-optimal workload allocation with timevarying and constant discrete frequency scaling schemes, respectively. Note that GWA-SVFS, GWA-NoDVFS, GWA-DDiscrete and GWA-SDiscrete are based on the mathematical optimization formulation proposed in [10], but adapted to our framework, for which details are given below.
GWA-SVFS/GWA-NoDVFS: Given m r processors of type-r and n periodic tasks, determine a constant operating speed for each processor s r k and the workload ratio y r ik for all tasks within hyperperiod L that solves: minimize s r k ,y r ik , i∈I,k∈K r ,r∈R r,i,k where y r ik is the ratio of the workload of task T i on processor k of type-r, δ r ik (s r k ) is the task density on processor k type-r defined as δ r ik (s r k ) := y r ik c i /(s r k f max min{d i , p i }) and L i := L min{d i , p i }/p i . Note that when d i = p i as in the case of an implicit deadline taskset L i = L, ∀i. (17b) ensures that all tasks will be allocated the amount of required execution time. The constraint that a task will not be executed on more than one processor at the same time is specified in (17c). (17d) asserts that the assigned workload will not exceed processor type capacity. Upper and lower bounds on the workload ratio of a task are given in (17e). The difference between GWA-SVFS and GWA-NoDVFS lies in restricting a core-level operating speed s r k to be either a continuous variable (17f) or fixed at the maximum value (17g).
GWA-DDiscrete: Given m r processors of type-r and n periodic tasks, determine a percentage of the task workload y r iq at a specific speed level for all tasks within hyperperiod L that solves: minimize y r iq i∈I,q∈Q r ,r∈R r,i,q where y r iq is the percentage of workload of task T i on processor type-r at speed level q, δ r iq (y r iq ) is the task density on processor type-r at speed level q, i.e. δ r iq (y r iq ) := y r iq c i /(s r q f max min{d i , p i }). Constraint (18b) guarantees that the total execution workload of a task is allocated. Constraint (18c) assures that a task will be executed only on    one processor at a time. Constraint (18d) ensures that each processor type workload capacity is not violated. Constraint (18e) provides upper and lower bounds on a percentage of task workload at specific speed level.
GWA-SDiscrete: Given m r processors of type-r and n periodic tasks, determine a percentage of task workload y r iq at a specific speed level and a processor speed level selection z r q for all tasks within hyperperiod L that solves: minimize y r ikq ,z r kq i∈I,k∈K r ,q∈Q r ,r∈R r,i,q is the magnitude of the relative error in the i th measurement, z → F (z) is the estimated function, z is the input data, y is the actual data and k is the total number of fitted points. κ r=1 lr q=1 δ r ikq (y r ikq )z r kq ≤ 1, where y r ikq is the workload partition of task T i of processor k of an r-type at speed level q, z r kq is a speed level selection variable for processor k of an r-type , i.e. z r kq = 1 if a speed level q of an r-type processor is selected and z r kq = 0 otherwise. Constraint (19b), (19d)-(19f) are the same as the GWA-DDiscrete. Constraint (19c) assures that only one speed level is selected. Constraint (19g) emphasises that the speed level selection variable is a binary.

Simulation Setup and Results
For simplicity and without loss of generality, consider the case where independent real-time tasks are to be executed on two-type processor architectures, for which the details are given in Section 5.1. The MINLP formulations were modelled using ZIMPL [37] and solved with SCIP [38]. The LP and NLP formulations were solved with SoPlex [39] and Ipopt [40], respectively. The value of the minor grid discretization step M is chosen according to Theorem 2.
For implict deadline tasksets, we consider the system composed of two big cores and six LITTLE cores, which has a system capacity of 4.25=(2+2.25). The total energy consumption of each taskset with a minimum taskset density varying from 0.5 to system capacity with a step of 0.25, given in Table 5, are evaluated. Figure 6a shows simulation results for scheduling a realtime taskset with implicit deadlines on an ideal system. The minimum taskset density D is represented on the horizontal axis. The vertical axis is the total energy consumption normalised by GWA-NoDVFS, where less than 1 means the algorithm does better than GWA-NoDVFS.
The three algorithms with a DVFS scheme, i.e. MINLP-DVFS, NLP-DVFS, and GWA-SVFS, produce the same optimal energy consumption, though both of our algorithms allow the operating speed to vary with time compared with a constant frequency scaling scheme, used by GWA-SVFS. The simulation results suggest that the optimal speed is a constant, rather than time-varying, for an implicit deadline taskset that has a constant workload over time. This result complies with Corollary 10. Moreover, the little core, which only has 37.5% computing power compared with the big core and consumes considerably less power even when running at full speed, will be selected by the optimizer before considering the big cores. This is why we can see two upwards parabolic curves in the figures, where the first one corresponds to the case where only little cores in the system are selected, while both core types are selected in the second, which happens when the minimum taskset density is larger than the little-core cluster's capacity.
However, for a practical system, where a processor has discrete speed levels, the constant speed assignment is not an optimal strategy. As can be observed in Figure 6b, the LP-DVFS and GWA-DDiscrete are energy optimal, while the GWA-SDiscrete is not. The results imply that to obtain an energy optimal schedule, a time-varying combination of discrete speed levels is necessary.  For a real-time taskset with constrained deadlines, we consider a system with one big core and one LITTLE core, i.e. a system capacity of 1.375=(1+0.375). The simulation results of executing each taskset, listed in Table 6, are shown in Figures 7, where the total energy consumption normalised by GWA-NoDVFS is on the vertical axis. It can be seen from the plots that for a taskset with a piecewise constant and time-varying workload, i.e. constrained deadlines, GWA-SVFS, GWA-DDiscrete and GWA-SDiscrete cannot provide an optimal energy consumption, while our algorithms are optimal. This is because time is incorporated in our formulations, which provides benefits for solving a scheduling problem with a time-varying workload as well as a constant workload.
Lastly, it has to be mentioned that the energy saving percentage varies with the taskset, which implies that the number on the plots shown here can be varied, but the significant outcomes stay the same.

CONCLUSIONS
This work presents multiprocessor scheduling as an optimal control problem with the objective of minimizing the total energy consumption. We have shown that the scheduling problem is computationally tractable by first solving a workload partitioning problem, then a task ordering problem. The simulation results illustrate that our algorithms are both feasibility optimal and energy optimal when compared to an existing global energy/feasibility optimal workload allocation algorithm. Moreover, we have shown via proof and simulation that a constant frequency scaling scheme is enough to guarantee optimal energy consumption for an ideal system with a constant workload and convex power function, while this is not true in the case of a timevarying workload or a non-convex power function. For a practical system with discrete speed levels, a time-varying speed assignment is necessary to obtain an optimal energy consumption in general.
For future work, one could incorporate a DPM scheme and formulate the problem as a multi-objective optimization problem to further reduce energy consumption of a system. Extending the idea presented here to cope with uncertainty in a task's execution time using feedback is also possible. Though our work has been focused on minimizing the energy consumption, the framework could be easily applied to other objectives such as leakage-aware, thermal-aware and communication-aware scheduling problems. Numerically efficient methods could also be developed to solve optimization problems defined here.