Energy-efficient scheduling of streaming applications in VFI-NoC-HMPSoC based edge devices

Energy-aware high-performance computing is becoming a challenging facet for streaming applications at edge devices in Internet-of-Things (IoT) due to the high computational complexity involved. Therefore, the number of processors has increased significantly on the multiprocessor system subsequently, Voltage Frequency Island (VFI) recently adopted for an effective energy management mechanism in the large scale multiprocessor chip designs. In this paper, energy-aware scheduling of real-time streaming applications on edge-devices is investigated. First, an innovative re-timing based technique is developed to transform the dependent workload into an independent task model to avail resources and the wasted slack in the processors with a possible minimal prologue. Moreover, unlike the existing population-based optimization algorithms, a novel population-based algorithm, ARSH-FATI is proposed that can dynamically switch between explorative and exploitative search modes at run-time for performance trade-off. Finally, a communication contention-aware Earliest Edge Consistent Deadline First (EECDF) scheduling algorithm is presented. Our static scheduler ARHS-FATI collectively performs task mapping and ordering. Consequently, its performance is superior to the existing state-of-the-art approach proposed for homogeneous VFI based MPSoCs.


Introduction
Internet-of-Things (IoT) paradigm is a 21 st century technological revolution to interconnect plethora of edge-devices to the Internet for monitoring and automation without human intervention Ali et al. (2018a). IoT is positively influencing human lives through interconnection and automation used for applications including environmental monitoring, health monitoring, home appliances, automated devices, surveillance and security. The literature suggests that there will be over 50 billion interconnected IoT edge-devices by the 2020. The multimedia content was estimated to be around 80% of the overall Internet data traffic in 2019. Video surveillance used to monitor traffic conditions is a prime example of a multimedia application in IoT. In the example of traffic monitoring, video streaming is used to collect information regarding vehicle positions, traffic congestion and road ac severity. The data content is streamed over the network in an encoded form, the e live or recorded video is presented to the end-user or professional for analysis Aliyu et al. (2018). Meaningful information is extracted from the video using machine learning techniques Yu and Wang (2019).
Streaming applications can be modelled using periodic dependent task models because theses applications repeatedly execute to service data stream when arrived Wang et al. (2013). For example the MPEG-encoder is executed many times for the video stream. Streaming applications 1 3 are computationally extensive as they service continuous streams of data. These properties make streaming services suitable for execution on MPSoCs (Multiprocessor-System-on-Chips) due to their hig performance, and exceptional Quality-of-Service (QoS) Han et al. (2015); Ali et al. (2018a); Tariq and Wu (2017). The MPSoCs integrate processors, I/O units, and memory on a single silicon die. MPSoCs are widely deployed in IoT applications and consumer electronics such as smart-phones, personal computers and tablets Ali et al. (2018a); Miao et al. (2018); Zhao et al. (2018). The MPSoC systems also play a vital role in mobile systems such as autonomous cars and robots Tariq et al. (2018). Xilinx Zynq ® UltraScale+ TM MPSoCs and Renesas R-Car H3 nona-core are the popular examples of the multiprocessor systems commercially available for both robotic and IoT based applications. Similarly Apple A11 Bionic, Xilinx Zynq ® UltraScale+ TM RFSoCs, Samsung Exynos 9810, Intel ® Stratix ® 10, Tilera TILE-Gx72 TM , and EZchip TILE-Mx100 TM are high-performance MPSoCs used in consumer digital systems Ali et al. (2018a); Tariq and Wu (2016).
To efficiently use the multiprocessor architecture of MPSoCs techniques are required that can increase the degree of parallelism of streaming applications Wang et al. (2013). In this article, we explore task level software pipelining to maximize the degree of parallelism of the periodic dependent task set. Re-timing is a powerful technique applied at the application level in order to transform the intra-period dependencies between tasks by regrouping the tasks from different periods. Re-timing reschedules a parent task few periods ahead of its child task so that the data needed by the child task is available at the start of each period. Consequently, the start time of the child task is not constrained by the finish time of the parent task. Stated simply, re-timing converts the dependent task model to independent task model by transforming intra-period data dependencies into inter-period data dependencies.
The complexity of the data extensive applications in both IoT and consumer digital devices is proliferating subsequently, the number of processors on MPSoC computing systems is increasing Ali et al. (2018a). According to the International Technology Roadmap for Semiconductors (ITRS), soon, MPSoCs will have hundred of processors so, traditional bus-based System-on-Chips (SoCs) will become a bottleneck due to their poor scalability and congestion issues. Network-on-Chip (NoC) based communication alternatively offers an improved scalability with higher flexibility over the traditional bus structures Guindani and Moraes (2013); Wu (2016, 2017). Various IoT based realtime streaming applications evidently deploy MPSoCs few of them include human gait analysis Nguyen (2015), human recognition Safaei et al. (2018); Meng et al. (2008), video surveillance Yan et al. (2009);Sasagawa and Mori (2016), video/image enhancement Saponara et al. (2013), healthcare Djelouat et al. (2018); Zhai et al. (2018), and person tracking Ahmed et al. (2018).
Voltage Frequency Island (VFI), Globally Asynchronous Locally Synchronous (GALS) is recently introduced to NoC based MPSoCs in which tiles are divided into islands. Each island possesses its own operating frequency, supply voltage, and threshold voltage Jang and Pan (2011);Ullah Tariq et al. (2019). The VFI based NoC-MPSoC (VFI-NoC-HMPSoC) is an ideal choice for data extensive applications due to its higher throughput, lower complexity, and superior energyefficiency Lackey et al. (2002). VFI-MPSoCs require lower number of multiple clock first-in-first-out (MCFIFO) and voltage level converters (VLCs) Mahabadi et al. (2013). Green computing in modern embedded systems is a challenging facet because higher energy consumption reduces the lifetime of an edge-device in IoT and causes an increased carbon dioxide CO 2 emission i.e. carbon footprint Rault et al. (2014); Huang et al. (2014). Proper task scheduling using search-based algorithms on multiprocessor architectures can significantly improve the performance and energy-efficiency of a battery-constrained edge-device. Re-timing is another technique that drastically increases the energy-efficiency Ali et al. (2018a). Re-timing is another technique used to transform intra-data dependencies into inter-data dependencies Wang et al. (2011).
The rest of the paper is organized is as follows: Section 2 discusses existing scheduling approaches and/or algorithms. Section 3 presents preliminaries and Sect. 4 explains our retiming technique. In Sect. 5 we discuss our proposed static contention-aware energy-efficient task scheduling. We present simulation results on different benchmarks in Sect. 6 while Sect. 7 offers the conclusion.

Literature review and contributions
Task scheduling on MPSoC architecture is a well-known NP-hard problem. Different heuristics have been developed using mathematical formulation such as Non Linear Programming (NLP), Mixed Integer Linear Programming (MILP), Integer Linear Programming (ILP), and Linear Programming (LP). Similarly, search based heuristics using selection, crossover, mutation, and elitism have been deployed etc. The popular examples of search based heuristics include Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Particle Swarm Optimization (PSO). Among the aforementioned algorithms, GA is widely used for task scheduling Ali et al. (2018aAli et al. ( , 2018b; Mirjalili and Lewis (2016). These evolutionary algorithms are stochastic generate and test algorithms based on exploration and exploitation. Exploration is the capability of a heuristic to discover the unseen regions while exploitation primarily shows the ability of an algorithm to proceed in the desired direction for improvement. For example, in GA the crossover and mutation are hypothetically considered to perform exploitation and exploration respectively Črepinšek et al. (2013); Vafaee and Nelson (2010). However, there is strong criticism that mutation possessed no competitive advantage over crossover Vafaee and Nelson (2010). Nevertheless, these stochastic heuristics fail to exploit the available chunk of information (schemata) efficiently. Moreover, exploration and exploitation are totally opposing forces therefore, a well found balance determines the success of a heuristic. Olafsson (1995) introduced one of the first dynamical model for the scheduling of tasks on heterogeneous multiprocessor systems. Aydin et al. introduced DVFS technique to determine the optimal voltage levels for the tasks and developed an algorithm called Earliest Deadline First (EDF) to generate feasible tasks schedule Aydin et al. (2001). Aydin et al. (2004) further investigated the energy-aware scheduling considering periodic tasks and designed Dynamic Reclaiming Algorithm (DRA) to efficiently utilize the available idle slack. Shin and Kim (2003) investigated nonpopulation based scheduling and developed a Non Linear Programming (NLP) based heuristic for assigning optimal discrete voltage levels to each task for reducing the computational energy consumption. Tariq and Wu (2016) scheduled the tasks on homogeneous MPSoCs and assigned continuous voltages to the tasks for executions. Tariq et al. (2018) formulated the scheduling problem as NLP and integrated task ordering and voltage scaling to maximally reduce the energy consumption. They developed an algorithm called Iterative Offline Energy-aware Task and Communication Scheduling (IOETCS) using Earliest Successor-Tree-Consistent Deadline First (ESTCDF) heuristic for generating an initial task schedule and then applying voltage scaling. However, these researchers only investigated energy-aware task scheduling on single processor per VFI.
Research studies further explored task scheduling deploying VFI based MPSoCs using bus as a communication interconnect. For example, Jang and Pan (2011) investigated energy optimization of VFI based NoC-MPSoC by reducing power overheads due to VFIs. Similarly Shin et al. (2011) considered a VFI based NoC-MPSoC and developed inter-VFI communication aware mapping algorithm to decrease the communication energy consumption. Han et al. (2014) mapped the tasks on the processors of the VFIs and communications on the NoC to reduce the overall makespan and inter-VFI communication. The authors developed two contention and energy-aware task mapping and edge scheduling heuristics called CA-TMES-Quick and CA-TMES-Search for assigning tasks to the processors and edges on NoC. Pagani et al. (2014) presented a Single Frequency Approximation (SFA) algorithm for optimal voltage assignment to the processor islands in MPSoC architecture. SFA is integrated with Dynamic Programming Mapping Algorithm (DPMA) to increase the energyefficiency and minimize the running time. Liu and Guo (2016) developed an algorithm called Voltage Island Largest Capacity First (VILCF) for task scheduling. The VILCF reduces the energy consumption by fully utilizing an island that is already active before activating other islands. Gammoudi et al. (2018) scheduled periodic tasks on homogeneous NoC-VFI-MPSoC architectures deploying well-known EDF task ordering policy. Tariq et al. (2019) developed a meta-heuristic for energy-efficient and contention-aware dependent tasks with precedence constraints. Though these investigations reduced the energy consumption by proper task mapping and scheduling however they did not consider re-timing at the task level to further decrease the total energy consumption.
In this paper, we investigate energy-efficient and contention-aware static scheduling for a set of tasks with precedence and deadline constraints representing a real-time periodic streaming application on VFI based NoC-HMPSoC. Our contributions and innovations include as follows: 1. We present an innovative re-timing based algorithm to transform the intra-period data dependencies into interperiod data dependencies of the streaming application to maximize the degree of parallelism and at the same time reduce the re-timing latency. 2. We propose a novel meta-heuristic called ARSH-FATI for task scheduling. ARSH-FATI considers energy performance profiles of the processors, contention at NoC links, and inter-VFI communications during task mapping. 3. We also develop a novel contention-aware Earliest Edge Consistent Deadline First (EECDF) scheduling scheme applicable on both the task and communication nodes. 4. Our energy management scheme achieves better average energy-efficiency of ∼ 15% and ∼ 20% over CA-TMES-Search and CA-TMES-Quick Han et al. (2015) respectively for 8 real benchmarks. The energy savings further increases to ∼ 35% and ∼ 40% when ARSH-FATI is combined with our proposed re-timing technique. Furthermore, the prologue reduces by ∼ 50% when we compare our re-timing technique with R-DAG Liu et al. (2009).

Preliminaries
In this section we explain relevant models that are used in our simulations to generate different results. In this paper, we use the term processor and tile interchangeably.

Application model
We characterize a real-time periodic application by DAG illustrated in Fig. 1, G(V, E, X). Where V = {v 1 , v 2 , v 3 , … , v n } shows a set of tasks and E ⊆ V × V represents directed edges set where each edge (v i , v j ) ∈ E shows data dependency between two tasks. For example, an edge from v i to v j indicates that v i is the predecessor of v j and outputs the data to v j , where v j is the successor of v i and accepts input data from v i . Moreover, X denotes a set of directed edge weights and (i,j) is edge-weight showing the volume of data represented in bits sent from v i to v j . We consider soft periodic applications and all tasks in the application have a common period D.

System architecture
We consider a VFI-NoC-HMPSoC architecture shown in Fig. 2 with M processors P = {pe 1 , pe 2 , pe 3 , … pe M } . Each tile contains a processor, network interface (NI) card, and local memory. We partition the processors of the target multiprocessor architecture into a set C = {c 1 , c 2 , c 3 , … c m � } of m ′ heterogeneous VFIs. Where each VFI, c i ∈ C consist of a set of k homogeneous processors. We assume that processors within an island are of the same type while processors across different VFIs may be of different types. Furthermore, we assume a 2D-mesh topology NoC, XY routing and virtual cut-through (VCT) switching.

Communication architecture
We consider a 2D-mesh topology NoC interconnect for interprocessor and/or inter-VFI communications. Each processor of the VFI-NoC-MPSoC associates with a router for communicating with other processors. 2D-mesh NoC primarily contains N R rows and N C columns with number of processors equal to N R × N C . There are five ports in each router, four of them are deployed to communicate with other neighbor routers while one router is dedicated for communicating with the associated processor. A link connects two routers and a processor a router. We assume that all links are indistinguishable and full-duplex with the same bandwidth, b w . We deploy a well-known XY deterministic routing which is a choice of selection for 2D-mesh NoC topology. In XY deterministic routing the packets are first routed in x-direction then in the y-direction. We consider a Virtual Cut-through (VCT) packet switching in this paper. wormhole (WH) and VCT are the popular and widely adopted packet switching techniques in NoC interconnects for communications. In WH each packet is divided into small pieces called FLITS. When a packet traverse in the network, WH immediately determines its next hop and forwards it then the subsequent FLITS worm their way through the network. Though, WH switching is a simple approach having higher flow control efficiency over VCT however, in case of congestion occurrence, the stalling packet may block all the links subsequently, results a lower link utilization. In VCT the entire packet is sent to next node while in WH a portion of the message is stored. Thus, VCT

Energy model
We adopted the energy model presented in Ogras et al. (2009), where the total energy consumption is given as follows: where V and V * show task nodes and communication nodes respectively. E i is the energy consumed in execution of a task v i mapped on processor pe k that belongs to VFI c j : where C eff k is the effective switched capacitance of pe k , V dd j represents supplied voltage, L g denotes the number of logic gates, K 3 , K 4 , K 5 represents technology specific parameters, v bs and I jn shows body-bias voltage and leakage current respectively. E u is the energy consumed in execution of communication node v u that traverses the route R u =< L 1 , L 2 , … , L l > . The parent and child task nodes of v u are v p and v c respectively. E u is calculated as follows: where E bit (src, dest) is the energy consumed to transmit one bit from source tile to destination E bit (src, dest) is discussed in detail in Ogras et al. (2009).

Offline pipelined scheduling
In this section we discuss our coarse-grained task level pipelining or re-timing approach. Before we start explaining our proposed re-timing approach we briefly explain re-timing.
The notion of re-timing was originally introduced by Leiserson and Saxe (1991) to reduce the synchronous circuit cycle period. Recently, Wang et al. (2011Wang et al. ( , 2013Wang et al. ( , 2014 extended re-timing to schedule applications represented by a classical DAG (Directed Acyclic Graph) task model on MPSoCs and is defined as follows: From the point of view of the program, re-timing regroup loop body such that some or all dependencies within a period are transformed into inter-period dependencies. The re-timing function is valid if no reference is made to the data from the future period. The definition of a valid re-timing function is: If RT(v i ) − RT(v j ) < 0 , the re-timing function is illegal because this condition implies a reference to unavailable data from future.  Interface   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile   Tile  Tile  Tile Figure 3(a) shows the schedule of first three periods of the DAG shown in Fig. 4(a). Figure 4(b) and Fig. 4(c) showing the re-timing values of nodes generated RDAG and our approach respectively. Table 1 demonstrates the execution time and energy consumption of the tasks of the CTG in Fig. 4(a). The applications is scheduled on MPSoC that consists of two processors pe 1 and pe 2 . Table 1 shows the energy consumption and execution time of each task on the two processors. Compared to pe 1 , pe 2 is more energy-efficient. Figure 3(a) illustrates a schedule generated by CA-TMES-Search. Notice that CA-TMES-Search fails to efficiently utilize the more energy-efficient processor pe 2 because it favours the processor on which the task can start the earliest. pe 2 remains idle until v 1 completes execution on pe 1 . This is because of the intra-period dependency between v 1 and v 2 . CA-TMES-Search cannot utilize this slack.
Figure 3(c) shows the schedule generated by our approach. Notice that if intra-period dependencies can be transformed into inter-period dependencies then the wasted slack can be utilized. This can be obtained by regrouping tasks from different periods with computation and communication node rescheduling. As each task is periodic, in Fig. 3(c), we reschedule periodic task v 1 and execute it one period before v 2 and v 3 . The newly added period is called the prologue. In this way the data required by v 2 and v 3 is available at the start of each period, therefore v 2 can start early on pe 2 . Consequently task v 4 can be scheduled on pe 2 . Since our approach can utilize the available resources more efficiently hence it is able to generate more energy-efficient schedule. The energy consumption of the schedule in Fig. 3(a) is 7.5nJ where as the schedule in Fig. 3(c) consumes 7nJ. Our approach can further reduce the energy to 6nJ if the MPSoC has another energy efficient processor like pe 2 .
Although re-timing is effective in reducing energy consumption, but there is a cost associated with it. One drawback of re-timing is that it adds prologue. The prologue latency is the number of periods in prologue times the period D. The number of periods in the prologue is equal e 1 pe 2v to the maximum re-timing value RT max of the nodes in G, Thus the prologue latency is: Besides energy reduction we also want to minimize the pro-logLatency. Figure 3(b) shows the re-timed schedule generated by R-DAG. Compared to this the prologue latency of the re-timed schedule generated by our approach is half. We are able to reduce the prologue latency because we take a (4) prologLatency = RT max × D different approach compared to R-DAG. We first transform the DAG into independent task set by relaxing the precedence constraints between the nodes and then we schedule the independent task model onto the MPSoC. Hence the MPSoC resources are maximally utilized and do not remain idle due to precedence constraints between nodes. Finally we calculate the re-timing values of the nodes and generate the re-timed schedule.
Algorithm 1 describes our offline pipelined scheduling approach. Our approach has two main steps. Firstly we generate the relaxed schedule of DAG G on the MPSoC by relaxing the precedence constraints between nodes in G. In Sect. 5 we discuss in detail our offline scheduling approach.
Secondly we calculate the re-timing values of the nodes. Given a node v i and its child node v j , our re-timing function is defined as follows: where the start and finish times of a task node v i are respectively denoted by (v j ) and (v i ). . 4 Periodic application a DAG, b DAG with re-timing values using R-DAG, c DAG with re-timing values using our approach 8 Generate the retimed schedule;

Contention-aware static energy optimization approach
The details of our proposed ARSH-FATI meta-heuristic are given in Algorithm 2. Before we start explain our algorithms first we define an extended graph G e illustrated in Fig. 5. In an application, two kinds of events exist i.e. communications and tasks. In order to schedule the communications using traditional DAG based scheduling approaches, we transform a DAG i.e. G into an extended graph G e . Given a task to processor mapping an extended graph we construct G e by inserting an additional node v s to graph G for each edge (v i , v j ) whose tail node v i and head node v j are mapped on different processors. The edge (v i , v j ) in extended graph is replaced by two edges (v i , v s ) and (v s , v i ) . The additional inserted nodes are called communication nodes. The extended graph is represented by G(V + V * , E) , where V shows task nodes, V * represents the communication nodes, and E denotes edges.

Communication Node Task Node
The robustness of ARSH-FATI algorithm lies in the notion of updating the parameter dimensional rate (DR) at run-time during the searching process. Our algorithm attains a satisfactory trade-off between the exploration and exploitation attributes of the search process. We define the parameter DR as the percentage of tasks that are probabilistically re-mapped to generate a new solution (mapping) form the current (best and worst) solutions. The need for Find the generation best π b and worst π w members; Generate the extended graph G e for task to processor mapping determined by π; only re-mapping a percentage of tasks and not all the tasks stems from the sensitivity of energy consumption to task mapping in this (energy optimization) problem. In other words re-mapping, even a small subset of tasks can generate a schedule with energy consumption significantly different than the schedule generated by the original mapping. Thus, the role of DR is to adjust at run-time the exploration and exploitation features of ARSH-FATI meta-heuristic that we explain in the following.
1. Setting the initial value of DR: We set DR to and initial value DR 0 (Line 3). DR 0 can take on any value between the range 0 < DR 0 ≤ 1 . The higher DR value means more explorative search that leads to large and unconstrained step sizes. Compared to this a small value of DR motivates a more exploitative search by allowing small and conservative steps in the search space. Therefore we set DR 0 = 0.4.

Population Generation:
We initially generate a matrix of dimensions × n of zeros (Line 1), where is the size of the population and n is the total number of tasks. Each member of the population reflects a possible task to processor mapping and the entire population reflects different mappings. The value of the element [i][j] represents the processor where the j th task is mapped in the i th member of the population. This value is a positive integer that determines the processor where the task is mapped. For example if the value of the third element is 2 this reflects that task v 3 is mapped on pe 2 . We use the notation [i][∶] to access the i th member of the population. We generate and initial value of the member of the population by randomly mapping tasks to processors. For each member we calculate the fitness value by the following fitness function: where e is the total energy consumption and m is the make-span of the schedule. e and m are obtained by executing algorithm 3. Before we proceed with our discussion we define the following two terms: (a) Global Best, gBest, is a member of the population that has the highest fitness value gBestFit in all generations. (b) Generation worst w is a member of the population that has the lowest fitness value in a generation.

Population Refinement:
We refine the initial population until the termination criteria satisfy (Lines 12-36). In each generation we update each member of the population. The j th element of the i th member is updated as follows: where r 1 and r 2 are the random numbers. The term r 1 ( gBest[i] − [ , i]) reflects the likelihood that the member moves closer to global best and the term reflects the likelihood that it moves away from the worst member of the population. We use an acceptance probability function P( f , T) to adopt or reject the new value of the member [i, ∶] (Lines 18-24). The parameter f is the difference of the new and old fitness value of [i, ∶] , f = f new − f old . The parameter T is referred to as temperature. We define function P( f , T) as follows: when the new mapping reduces the energy consumption it is always accepted. If the new mapping is worse than the current mapping, probability still exists that may be accepted. We have included this feature in ARSH-FATI to prevent it getting stuck in a local optimum. The value of temperature T reduces in each iteration by multiplying it with a cooling factor 1 , (0 < 1 < 1) (Line 27). The value of 1 is calculated from maxGens: where T 2 is a very large number and T 1 is a very small number. Initially the value of temperature is set to T 2 and it reduces to T 1 as optimization finishes. In each generation we update DR (Line 25): where the is dimensional rate adaption parameter and its value lies within the range 0 < < 1 . In this work is set to 0.98. The parameter sets the new value of the DR during the optimization process. The values of DR min and DR max are respectively the upper and lower bounds on DR. The values of DR min and DR max must be set subject to constraint 0 < DR min < DR max < 1 . We avoid excessive exploration and exploitation by setting DR min and DR max to 0.2 and 0.6 respectively. 4. Termination criteria: ARSH-FATI terminates if either the generation count reaches maximum generation max-Gens or no improvement is observed in gBest for consecutive generations.

Earliest edge consistent deadline first (EECDF) algorithm
Before we describe EECDF we define some notations. The worst case execution time of a task node v i mapped on processor pe k operating at frequency f j is et where NCC(v i , k) is the worst case clock cycles of v i on processor pe k . The start and finish times of a task node v i are respectively denoted by (v i ) and (v i ) . Similarly for a communication node v j (corresponding to edge (v a , v b ) ) the transmission time on a link L between processors pe s and pe d is where b w is the link width, f s and f d are the frequencies of pe s and pe d respectively. The start and finish time of v j on link L are respectively denoted by (v j , L) Given task to processor mapping, operating frequencies of processors and a DAG G we calculate the ECD by the following dynamic programming algorithm.
Traverse the DAG G in the reverse topological order of G. If the task v i is a sink node then its ECD, d ′ i is equal to its pre-assigned deadline d i otherwise: where (ISucc i ) is a set of immediate successors of v i . The ECD, d ′ j of a communication node is same as its parent (task) node.
The EECDF algorithm is described in Algorithm 3. We performs four major steps.
1. Calculate the ECD of each task v i ∈ G (Line 1). 2. Relax the precedence constraints between the nodes in G e . We generate the schedule under no precedence constraints so no wasted slack exist due to the precedence constraints. 3. Create a ready queue R and insert all the source nodes in G e to R (Line 2). 4. Find a node v i that has minimum ECD in R and schedule it. Then delete v i from R and insert all the ready nodes in G e to R. Repeat this until R is empty (Line 3-10). 5. Calculate the energy E and make-span m of the schedule.
We define seven rules to schedule the highest priority node v i ∈ R . The first three rules deal with the schedule of a task node and the remaining four deal with the schedule of a communication node.
Task scheduling rules: The schedule of a task node v i is obtained by applying the following rules collectively in order: The release time of each unscheduled task node v j mapped on same processor of v i is Communication scheduling rules: In communication scheduling, network resources such as links are treated as processors in a way that each communication can only use one resource at a time. Hence, communication nodes are scheduled on the links for the time they occupy them. Consider a communication node v j whose source is mapped on pe src and destination is mapped on pe dest , the routing algorithm used by the network generates the route R j from pe src to pe dst . The route R j =< L 1 , L 2 , … , L l > is an ordered list of links, where L 1 is the first link and L l is the last link on the route. The schedule of v j is obtained by applying the following rules collectively in order: (  Algorithm 3: EECDF input : A DAG G, an extended DAG G e , task to processor mapping vector α output: Energy e and make-span m of schedule 1 Calculate the ECD, d i of each task in v i ∈ G; 2 Relax the precedence constraints in G e ; 3 Insert all source node in a ready queue R; 4 while there are ready nodes in R do if v i is a task node then 7 Schedule v i subject to rules R1, R2 and R3; 8 else 9 Schedule v i subject to rules R4, R5, R6 and R7;

10
Delete v i from R;

11
Insert all ready nodes in R; 12 Calculate the energy e and make-span m of the schedule; Concisely, we presented a contention-aware population based searching algorithm for task scheduling that can switch into explorative and exploitative search modes for better performance trade-off. We also developed a novel energy-aware re-timing approach that efficiently avails the idle slack in the processors while significantly reduces the prologue for real-time streaming applications. Moreover, we performed task ordering, and mapping in an integrated manner.

Experimental setup and results
In this section, we explain the experimental set up which is used for the simulation. We also generate energy consumption values for different benchmarks and discuss the results. We deploy 8 real benchmarks listed in Table 2 adopted from Embedded System Synthesis Benchmarks Suite (E3S). It is well-known and widely adopted benchmark suit in the task scheduling research Cheng et al. (2013).
We deploy an energy model of Samsung Exynos 5422 chip described in Liu et al. (2015) for our simulations while using two types of processors. Type 1 processors are Cortex A15 (big), high-performance and type 2 are low-power, Cortex A7 (little) processors. The Cortex A15 consumes ∼ 6 − 12 times higher power compared to Cortex A7 Lukefahr et al. (2014). The relative power consumptions and operating frequencies and of both types are given in Table 3. The 70 nanometers (nm) processor technology parameters are adopted from Ali et al. Ali et al. (2018a). We used Matlab version R2016a to build the simulation environment and conducted experiments using a hardware platform of Intel (R) Xeon (R), i5-3570 CPU with 3.50 GHz clock frequency and 16.00 GB memory, a cache of 10 MB.
We compare our ARSH-FATI results with the energy management approach presented by Han et al. (2015). The authors developed a contention-aware static mapping and scheduling scheme for a set of tasks with precedence constraints to decrease the makespan and inter-VFI communications for total energy consumption. They proposed two static heuristics CA-TMES-Quick and CA-TMES-Search. CA-TMES-Quick performs task mapping first then determines the routes for communications. CA-TMES-Search calculates start-time for the tasks and avoids communication contention. it selects a processor which offers earliest start time (EST) for a task. Concisely, CA-TMES-Search comparatively to CA-TMES-Quick performs better because it coordinates task mapping in an exhaustive way subsequently, significantly reduces the makespan.

Results and discussions
The VFI-NoC-MPSoCs are used as an edge-computing devices in IoT for processing data-extensive applications which involve multimedia data. The VFI-NoC-MPSoCs are de-facto computing architectures as they provide higher throughput and extraordinary performance as compared to the traditional controllers. Energy consumption reduction is a prime concern in edge-devices as higher energy consumption decreases lifetime of the network. Moreover, reduced prologue i.e. latency is required during video-streaming in real-time applications such as surveillance and traffic monitoring. Therefore, we aim to reduce both the energy consumption of the edge-devices and latency of the real-time streaming application in IoT. First, we generate results for two scenarios considering homogeneous VFI-NoC-MPSoC computing platform and heterogeneous multiprocessing computing system. Second, we produce results for ARSH-FATI without re-timing and then combining our innovative re-timing approach with it deploying VFI-NoC-HMPSoC.
Before we explain the impact of ARSH-FATI metaheuristic on energy performance, we discuss the important parameter DR impact on ARSH-FATI. Figure 6 shows the impact of DR parameter on ARSH-FATI performance. Initially we set DR = 0.3 though it can acquire values from 0.1 to 0.5 with negligible impact on the overall energyefficiency during static task scheduling. The results demonstrate that energy-efficiency of the ARSH-FATI slightly decreases when DR = 0.1 and DR = 0.5 are used. However, ARSH-FATI automatically adjusts the DR value to attain maximum energy-efficiency. Setting DR = 0.10 initially means ARSH-FATI performs an insufficient exploration and DR = 0.5 leads to an excessive exploration. Therefore, DR = 0.3 is set which is the nominal initial value for our, ARSH-FATI meta-heuristic. ARSH-FATI converges and DR value relatively stabilizes at number of iterations (NI) equal to 200 while a infinitesimal variation occurs at 500. No further variations occur when NI > 500 therefore, we consider NI = 500 , = 0.9 , and = 5 , for our experiments.

ARSH-FATI without re-timing
We generate results for 8 real benchmarks considering homogeneous and heterogeneous VFI-NoC-MPSoC computing platforms with 4 VFIs, 2 × 2 processors per VFI (PPI). First, we consider all the processors to be homogeneous i.e. type 1 and we set the operating frequency to maximum, Consumer-1 f max = 2.0 GHz. Second, we use a heterogeneous VFI-NoC-MPSoC architecture using both type 1 and type 2 processors. We randomly select the type of the processors to generate a heterogeneous multiprocessor computing platform to ensure unbiased experimentation. Figure 7 shows the energy performance of our ARSH-FATI static task scheduler for two different scenarios when compared to CA-TMES-Search and CA-TMES-Quick, state-of-the-art task scheduling techniques. The x-axis denotes the real-world benchmarks while the y-axis represents energy consumption in joule (J). Not surprisingly (ARSH − FATI) homogeneous consumes lower energy because our population-based meta-heuristic ARSH-FATI performs better solution space search as it can switch between exploitation and exploration search modes during task scheduling. Our scheduler (ARSH − FATI) homogeneous tends to map the dependent tasks i.e. parent and child nodes on the same processor to reduce the communication energy E c in Equation (1). Energy-efficiency further increases when we use a heterogeneous architecture for task scheduling. ARSH-FATI considers energy performance profiles of the processor during task mapping and schedules high energy consuming tasks on energy-efficient processor such that the timing constraints are not violated. Now, we determine the impact of the PPIs on energy consumption and examining the ability of ARSH − FATI to efficiently utilize the computing platform resources for real benchmarks. We set number of VFI i.e. NVFI = 4 , heterogeneous computing system, and gradually set, PPI = 2 × 2, 4 × 2, 4 × 3 . Figure 8 illustrates that The energy consumption of MP3-decoder, JPEG-encoder, and Robot reduces however other benchmarks do not demonstrate a noticeable reduction in the energy consumption when PPI is increased gradually. MP3-decoder, JPEGencoder, and Robot consist of relatively higher number of task nodes and offer more degree of parallelism. We also examine the performance of ARSH − FATI on a complex real benchmark, Robot that contains 88 task nodes. Compared to PPI = 2 × 2 , ARSH − FATI achieves energy efficiency of ∼ 13% and ∼ 18% for robot when PPI = 4 × 2 and PPI = 4 × 3 are deployed respectively. These results show that ARSH − FATI utilizes the resources efficiently by availing the degree of parallelism present within the real benchmarks. However re-timing is necessary to further decrease the total energy consumption.

ARSH-FATI with re-timing
Now we generate results for 8 real benchmarks deploying VFI-NoC-HMPSoC architecture and integrate ARSH-FATI meta-heuristic with our innovative re-timing technique. We set VFIs = 4 and PPI = 2 × 2 while randomly generate a heterogeneous multiprocessor system for task scheduling. Figure 9 illustrates the energy consumption comparison for real benchmarks of (ARSH − FATI) Heterogeneous , (ARSH − FATI) Heterogeneous−re−timing , C A -T M E S -Quick and CA-TMES-Search. The horizontal axis and vertical axis show benchmarks and energy  consumption respectively. The energy-efficiency further increases significantly when we integrate our retiming technique with (ARSH − FATI) Heterogeneous . The (ARSH − FATI) Heterogeneous−re−timing achieves an average energy savings of ∼ 20 %, ∼ 35 %, ∼ 40 % over (ARSH − FATI) Heterogeneous , CA-TMES-Quick and CA-TMES-Search respectively. This increase in the energyefficiency is due to transforming the dependent task model into an independent task model. In other words intradependencies in data are transformed into inter-dependencies. This transformation enables the scheduling technique to efficiently utilize the available resources and the slack in the processors. Which consequently reduces the energy consumption for real benchmarks. We compare the prologue latency of our re-timing approach with the R-DAG Liu et al. (2009). We conducted the experiments on 8 real benchmarks to estimate the retiming values.
In terms of energy consumption the both of the approaches perform similarly and there is no noticeable change. However, the prologue latency of our approach is significantly lower than R-DAG. We compare the prologue latency in terms of maximum re-timing RT max . The smaller the value of RT max the shorter the prologue latency. The RT max values are summarized in Table 4. Our re-timing technique reduces the RT max by ∼ 50% when R-DAG is used as baseline.

Conclusion
Network-on-Chip (NoC) based Multiprocessor Systemon-Chip (MPSoC) architecture is becoming a de-facto computing platform for real-time computational intensive applications in Internet-of-Things (IoT) due to its higher performance and Quality-of-Service (QoS). Energy consumption reduction at the edge-device level in IoT is significantly important because higher energy consumption reduces the lifetime of edge-device and network. First time ever we investigated a contention-aware and energy-efficient dependent tasks scheduling on VFI-NoC-HMPSoC computing system. We proposed a novel meta-heuristic ARSH-FATI for energy-aware task scheduling. Our algorithm ARSH-FATI can dynamically switch at run-time between exploitative and explorative search modes for better energy trade-off. We also integrated communication contention-aware Earliest Edge Consistent Deadline First (EECDF) scheduling with ARSH-FATI to minimize the total energy consumption. We performed experiments on 8 real benchmarks for different scenarios. We also transformed the periodic DAG tasks into periodic independent tasks using innovative coarse-grained software pipelining i.e. re-timing mechanism for further energy savings. Our static scheduler outperformed CA-TMES-Search and CA-TMES-Quick energy management techniques and achieved ∼ 15 % and ∼ 20 % average energy savings respectively. The energy-efficiency increased up to ∼ 35 % and ∼ 40 % when ARSH-FATI combined with our novel re-timing technique. Moreover, our re-timing technique reduces the prologue by ∼ 50% compared to state-ofthe-art approach R-DAG. We do not consider memory limits during our task scheduling and re-timing therefore, memory bounds can be violated. Subsequently, one of the future work could be developing a memory-aware energy-efficient task scheduling approach for real-time streaming applications. Moreover, any voltage scaling can be integrated with ARSH-FATI to further increase the energy-efficiency.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.