1 Introduction

Various semiconductor industries continues to build a kind of chip that can accommodate a high density of very large-scale integrated circuits (VLSI). In order to integrate all essential components such as IP cores, which includes RAM, counters, interfaces, voltage regulator etc., and System-on-Chip (SoC) is the methodology to be used for this purpose [1, 28]. SoC integrates all these components onto a single chip. In past few years, as number of cores are increasing, the design structure of SoC becomes more complex. Due to complexity, SoC design are less flexible and lead to problem of communication between different cores. Networks-on-Chip (NoC) was introduced as design concept for SoC with support of communication, providing better and powerful solution to connect different intellectual property IP cores through scalable interconnection network [2,3,4, 16]. NoC architecture comprises of interconnected devices like processors (IP cores, DSP, ASIC etc.), routers, network interfaces, and communication links or channels as shown in Fig. 1. Communication between different cores is achieved by sending and receiving packets over interconnection network. NoC offers flexible mechanism by supporting different interconnection networks and fault tolerance. Interconnection network communicates through various communication protocols which are useful for enhancing the flexibility of systems when merged with VLSI [5, 6, 17].

Fig. 1
figure 1

NoC Mesh topology architecture

On-chip interconnected networks has benefits over shared wiring and buses, i.e., high-bandwidth utilization, less latency, low power consumption, scalability and flexibility. A lot of research work have been done in area of NoC in order to optimize the design and, the main area where designing need to be more focused are topology generation, scheduling, mapping, routing and floorplanning [7, 22]. Each area has its own important role in order to provide better performance of multiprocessor systems. In this paper, we mainly focused on mapping of task on 2D NoC architecture. Task mapping comprises of finding best placement or mapping of task in such a way so that mapping fulfill set of certain requirements like less energy consumption, reduction in congestion and less latency keeping constant bandwidth constraints. There are two approaches for mapping of tasks on the cores, i.e. static or dynamic task mapping [8, 9, 23]. Static task mapping approach states that tasks should be placed at design time. As different tasks are executed at design time, static mapping uses different composite algorithms to investigate the SoC resources, which further results in optimized solutions such as less energy consumption and efficient performance of multiprocessor SoC. The major drawback of static mapping approaches are that they do not have capability to handle newly arrived tasks or application, which may be loaded during run-time. In order to tackle this problem in future, dynamic or run-time mapping techniques are introduced, which can map dynamic tasks onto the topology at run-time. In this paper, we have proposed mapping algorithms, HorMAP, RtMAP and DACMAP for mapping of tasks onto topology having different cores, so that latency, queuing time, service time and energy consumption of topology are minimized.

2 Related work

There are various mapping algorithm developed by different researchers to provide better performance in terms of energy consumption, latency, thermal behavior, and bandwidth constraints that should be minimized. Ning et al. [10] proposed a mapping algorithm named GA-MMAS, which is combination of Genetic Algorithm (GA) and MAX-MIN Ant System (MMAS), to optimize energy consumption for NoC. Jang et al. [11] proposed A3MAP which is Architecture-Aware Analytic Mapping algorithm that can be applied to regular mesh architecture with homogeneous cores as well as on irregular mesh or custom architecture with heterogeneous cores. At first, author developed an interconnection matrix for modelling any task graph and network, then task mapping problem is converted to MIQP (Mixed Integer Quadratic Programming). As MIQP is NP-hard problem, then author proposed two heuristics techniques, a successive relaxation algorithm (A3MAP-SR) and a genetic algorithm (A3MAP-GA) to reduce amount of traffic by comparing regular and irregular mesh, and custom network. Yin et al. [12] proposes an application mapping technique that incorporates domain knowledge into genetic algorithm (GA) to minimize the energy consumption of NoC communication.

The GA is initialized with knowledge on network partition whereas the genetic crossover operator is guided with communication demands. The effects of domain knowledge GA on initial population and genetic operator are analyzed in terms of the solution quality and convergence speed. Fen et al. [13] developed GAMR, which is genetic algorithm based mapping and routing technique for 2D regular Network on chip (NoC) architecture under bandwidth constraint. The main focus of author is to minimize energy consumption and maximize bandwidth link utilization of the NoC design. GAMR mapping maps IP cores of application onto NoC topology which leads to generation of a deterministic deadlock free minimal routing path for every communication trace.

Wang explores the bandwidth and latency based IP mapping that a set of IP cores onto the tiles of mesh NoC topology in order to minimize the power consumption having inter-core communications [14]. By analyzing different applications communication characteristics with their communication trace graphs, author recognizes two connectivity templates first one is graphs with tightly coupled vertices and other one with distributed vertices. Author developed different mapping approaches for these templates, in which tightly coupled vertices are mapped onto topology tiles that are very close to one another and in other case, the distributed vertices are mapped according to graph partition scheme given by author. Murali et al. [15] introduced NMAP, a fast algorithm for mapping the cores on a mesh interconnection architecture under bandwidth constraints in order to minimizing the average communication delay. The NMAP algorithm is designed for single minimum-path routing and also for split traffic routing. The algorithm is applied to a DSP benchmark design and simulation was done by author using xpipes library. Yang et al. [18] divided NoC-based MPSoC design process into two steps that is, scheduling subtasks to appropriate processing elements having appropriate types and quantity and then mapping those processing elements onto NoC topology. Particle swarm optimization (PSO) was used to achieve first step with less amount of task execution time, less task running and transfer cost. The outcome of first step was communication diagram and second step shows least network transmission delay and least resource consumption as well as power consumption. Qianqi et al. [19] finds the Pareto optimal solutions rather than a single solution which are usually obtained through scalarization. Author proposed fault-tolerant routing and improved particle swarm optimization to meet NoC requests and have capability of searching solutions. Proposed methods solved tradeoff between high performance and reliability of the system. Srinivasan et al. [20] present a technique to reconfigure the network dynamically among different use-cases and explain the how to integrate Dynamic Voltage and Frequency Scaling (DVS/DFS) techniques with those use-case centric NoC design. This dynamically reconfiguration of the NoC along with integration of DVS/DFS schemes resulting in less power consumption for NoC systems. Mehran et al. [21] proposed SPIRAL algorithm for mapping of tasks on different cores, which minimizes energy consumption. For implementing SPIRAL algorithm, author used 2D mesh topology along with XY routing and used MATLAB tool in order to evaluate the performance of proposed SPIRAL algorithm. SMAP, a tool for generating random graphs, was used by researchers. Author have compared SPIRAL algorithm along with random mapping and genetic mapping algorithm to show improved result in terms of energy consumption. If spiral mapping algorithm is used and there are very few task to be mapped then also in case of the spiral mapping the middle core is chosen to map the task and as the middle core is farther from the task queue and hence the processing gets slower and mapping the task takes lots of time.

Marcon et al. [24] proposed combination of communication dependence as well as computation model(CDCM) for application mapping on regular NoCs. Using CDCM technique, author estimated 40% reduction in execution time and 20% reduction in energy consumption. Celik has discussed the effect of mapping of application on NoC with the help of network traffics that encapsulates the self-similarity [25]. Author considered queuing delay and packet loss rate parameters in order to analyze the effect of application mapping. Jiawen et al. [26] proposed logistic function based adaptive genetic algorithm (LFAGA) for energy efficient mapping of application on 3D NoC. The result of LFAGA is compared with chaos-genetic mapping algorithm (CGMAP), which saves 18.6% of energy consumption. Harmanani proposed an effective routing algorithm, whose main concern is to minimize blocking in routing [27]. The author uses 2D mesh topology and benchmarks like VOPD, DSP filter, LinearP15 in order to simulate the results.

The rest of the paper is organized as: Section 3 includes the problem formulation and mathematical representation of the proposed approach. Section 4 explains existing random mapping algorithm. Section 5 contains brief explanation about proposed approach. The experimental results are included in Section 6 followed by conclusion and future work mentioned in Section 7.

3 Problem formulation

Before formulating mapping problem, we assume that in order to perform mapping, we are given with application that is characterized by set of tasks which performs scheduling onto NoC cores. For appropriate understanding of mapping problem strategies, some important definition need to be explained.

Definition 1

A Logical Application Trace Graph (LATG) G = (At,Et) is an directed acyclic graph, where atAt represents task from list of application tasks and ci,jEt is an directed arc between application tasks, that shows communication dependency between tasks at1 and at2. Logical application trace graph is depicted in Fig. 2. Each directed edge or arc has one property:-

  • v(ci,j) represents volume bits transferred between from arc ci to cj.

Fig. 2
figure 2

Logical Application Trace graph

Definition 2

NoC Architecture Characterization Graph (NACG)G = (T,LT) represents undirected graph as shown in Fig. 3, where vertex node ti,tjT shows tiles in NoC architecture, whereas lk = li,j = (ti,tj) ∈ LT represents routing path between ti and tj. Routing path in NACG consists of two properties :-

  • e(li,j) is average energy consumption of task in bits from ti and tj.

  • Lat(li,j) represents average latency of task from ti and tj.

  • band(li,j) is defined as bandwidth of link between ti and tj.

Fig. 3
figure 3

NoC Architecture Characterization Graph

Definition 3

A mapping function (Ω) is represented as \({\varOmega } : A_{t} \rightarrow T\), that shows mapping of application tasks from LATG onto tiles available in NACG, where atAt and Ω(at) ∈ T and Ω(at) characterizes mapped tile in NACG. Fig. 4 shows mapping of application tasks onto NoC tile based architecture.

Fig. 4
figure 4

NoC mapping technique

Finally, the formulation of mapping problem is as follows:Given: An LATG G = (At,Et) and NACG G = (T,LT),Evaluate mapping function \({\varOmega } : A_{t} \rightarrow T\), that maps task atAt in LATG to tile tiT in NACG, such that energy consumption and average latency is minimized. Fig. 5 represents the flow of mapping of application task onto topology in order to get optimized results in terms of energy consumption and average latency.

Fig. 5
figure 5

Flowchart representation of application mapping onto topology

3.1 Energy model

The objective function is to minimize the energy consumption, which can be mathematically represented as:

$$ \begin{array}{@{}rcl@{}} \min \left\{\sum\limits_{\forall a_{t} \in A_{t}}^{} e_{{\varOmega}(a_{t})} + \sum\limits_{\forall c_{i,j} \in E_{t}}^{} v(c_{i,j})\right.\\ \left.\times \sum\limits_{l_{i,j} \in R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})}}^{\mid R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})} \mid} e(R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})})\right\} \end{array} $$
(1)

satisfying conditions as

$$ \forall a_{t} \in A_{t}, \forall {\varOmega}(a_{t}) \in T $$
(2)
$$ \forall a_{t1} \neq a_{t2}, \forall {\varOmega}(a_{t1}) \neq {\varOmega}(a_{t2}) $$
(3)

The average energy consumption for transferring task from ti to tj can be represented as follows:

$$ E_{task}^{t_{i}, t_{j}} = N \times num_{hops} \times E_{Link} + N \times (num_{hops} - 1) \times E_{Router} $$
(4)

where, ELink and ERouter represents energy consumption of link and energy consumption of router. In order to compute ERouter, we have to compute energy consumption of buffer (EBuffer), energy consumption of crossbar switch (ECrossbar) and energy consumption of arbiter (EArbiter). EArbiter is further divided into two parts : (i) \(E_{Crossbar\_Allocation}\), energy consumption of switch allocation and (ii) \(E_{VC\_Allocation}\), energy consumption of virtual channel allocation. ELink can be computed as gievn in Eq. 7. Energy consumption of topology is calculated for all N tasks is given in Eq. 8.

$$ E_{Router} = E_{Buffer} + E_{Crossbar} + E_{Arbiter} $$
(5)
$$ E_{Arbiter} = E_{Crossbar\_Allocation} + E_{VC\_Allocation} $$
(6)
$$ E_{Link} = \frac{P_{Link}}{Freq.} $$
(7)
$$ E_{Total} = \sum\limits_{i = 1}^{N} E_{task_{i}} $$
(8)

3.2 Latency model

The mapping function for minimization of average latency of topology can be mathematically formulated as:

$$ \begin{array}{@{}rcl@{}} \min \left\{\sum\limits_{\forall a_{t} \in A_{t}}^{} Lat_{{\varOmega}(a_{t})} + \sum\limits_{\forall c_{i,j} \in E_{t}}^{} v(c_{i,j})\right.\\ \left.\quad\times \sum\limits_{l_{i,j} \in R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})}}^{\mid R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})} \mid} Lat(R_{{\varOmega}(a_{t1}) , {\varOmega}(a_{t2})})\right\} \end{array} $$
(9)

satisfying conditions as

$$ \forall a_{t} \in A_{t}, \forall {\varOmega}(a_{t}) \in T $$
(10)
$$ \forall a_{t1} \neq a_{t2}, \forall {\varOmega}(a_{t1}) \neq {\varOmega}(a_{t2}) $$
(11)

The latency from tile ti to tile tj can be computed according to Eq. 12. The overall latency for all N tasks is calculated by Eq. 13.

$$ Lat_{task}^{t_{i}, t_{j}} = N \times num_{hops} \times Lat_{Link} + N \times (num_{hops} - 1) \times Lat_{Router} $$
(12)
$$ Lat_{Total} = \sum\limits_{i = 1}^{N} Lat_{task_{i}} $$
(13)

Figure 6 shows the 3 × 3 NoC topology in the form of tile, where each tile consist of cores (that can be IP core, DSP core etc.) and routers (consists of crossbar switch, routing algorithm and arbitration logic). Latency of single task to be transferred across channel are Dinjection and Dejection respectively and latency of a task across router are Dswitch,Drouting and Dwaiting. In Fig. 7, we have considered link injection latency (Dinjection), latency of first router (Dswitch + Drouting), inter-tile latency (Dwaiting) which is defined as how long, the task takes to arrive to the destination from time the first bit is sent out from source for a single hop, second router latency (Drouting + Dswitch), and link ejection latency (Dejection). Latency flow of single hop can be calculated according to Eq. 14:

Fig. 6
figure 6

NoC topology tile

Fig. 7
figure 7

Latency flow in single hop

$$ \begin{array}{@{}rcl@{}} Latency\_single\_hop &=& D_{injection} + (D_{routing} + D_{switch}) + D_{waiting} \\ &&+ (D_{routing} + D_{switch}) + D_{ejection} \end{array} $$
(14)

In order to calculate the latency from source to destination core, we have assumed that as task arrives to destination core, then the task is immediately accessible for processing by destination core. In Fig. 8, the latency involved, is considered from source IP core to destination IP core passing through routers are Rsource,Rintermediate and RDestination. The latency of task having two hops between source and destination core (\(L_{source \rightarrow destination}\)) is calculated as given in Eq. 15, where Wsource,Wdestination and Wintermediate represents the waiting time in routers. The average latency of task (L) can be calculated in Eq. 16, where \(P_{source \rightarrow destination}\) is probability of task to be generated.

$$ \begin{array}{@{}rcl@{}} L_{source \rightarrow destination} &=& D_{injection} + (D_{routing} + W_{inj \rightarrow port}^{source} + D_{switch} )\\ &&+ D_{waiting} + (D_{routing} + W_{port \rightarrow port}^{intermediate} + D_{switch})\\ &&+ D_{waiting} + (D_{routing} + W_{port \rightarrow ejc}^{destination} + D_{switch})\\ &&+ D_{ejection} + (m - 1) (D_{switch} + D_{waiting}) \end{array} $$
(15)
$$ L = \sum\limits_{source} \sum\limits_{destination} P_{source \rightarrow destination} \times L_{source \rightarrow destination} $$
(16)
Fig. 8
figure 8

Latency flow in two hop from source to destination core

4 Existing random mapping algorithm

There are many issues involved using random mapping algorithm, such as load balancing as shown in Fig. 9, latency, service time and queuing time are not handled by random algorithm for NoC. In random algorithm, tasks are mapped on the cores randomly as discussed in Algorithm 1 . The worst case of the algorithm is, when every time the same core is chosen for mapping the task. As all tasks are mapped on the same core, so, the new tasks to be mapped will remain in the queue and wait for an infinite period of time till the core is not ready to process the new task. Once the core is available task is mapped on the core. In the best case of random algorithm for mapping, the randomly chosen cores will have an equal probability to be chosen, and task will be mapped on to these cores uniformly. There are rare chances to obtain the best case of the random algorithm. Let us consider a scenario that every time the last core of the grid is chosen to map the tasks. If such a case exist then latency involved to map the tasks on the cores will be very high. So mapping the task on to the cores in case of random algorithm consumes a large amount of latency, service time, queuing time and the energy consumption. To improve the performance of the mapping algorithm in this paper, the horological, rotational and divide and conquer mapping algorithms are proposed.

Fig. 9
figure 9

Load balancing in random mapping algorithm

figure a

5 Proposed approach

In this section the three proposed approaches are discussed which proves to be better than the existing random mapping algorithm in terms of latency, load balancing and energy consumption. First approach discussed is horological mapping algorithm, in which the cores are visited one by one guaranteeing load balancing over the cores of the grid. Second approach is the rotational mapping algorithm. In this the task are assigned to the cores in rotation one by one guaranteeing the least latency involved during mapping of tasks. The third algorithm proposed in the paper is the divide and conquer mapping algorithm, which provides an assurity of load balancing on the grid.

5.1 Horological algorithm

As the name suggest, in this mapping algorithm the tasks are mapped horologically on the cores one by one. As the task are assigned to the cores, then the core will process these task, and after the processing of task, the core gets ready to execute the next task in the queue. In this, the cores are allotted an core_id horologically. The first task in the queue is allocated to the first core, second task to the second core and so on. When the task on some core is completed, then a new task is allocated to this core. This algorithm produces good results in terms of load balancing on the cores, but the accessing time of the core increases as we moves towards the last core, with the last core having the maximum access time. So the accessing time of the cores is increased moving towards the last core. Fig. 10 shows the allocation of task on 8 × 8 mesh topology. For an instance, suppose there are 8 tasks which are to be mapped on the cores then even if the core with core_id 8 is closer to the queue the task will not be assigned to it, instead tasks will be assigned to the cores having core_id 0 to core_id 7. Horological mapping proves to be better than the random mapping in terms of load balancing as shown in Fig. 11, queuing time and service time. It also resolve the issue of bottleneck existing in random mapping algorithm. Hence the horological mapping algorithm proves to be better over the random mapping algorithm. Horological mapping algorithm is given in Algorithm 2.

Fig. 10
figure 10

Horological mapping algorithm

Fig. 11
figure 11

Load balancing in horological mapping algorithm

figure b

5.2 Rotational mapping algorithm

Rotational mapping algorithm is proposed in this paper in order to minimize the latency involved during the mapping of the task on to the core, but there is no assurity of load balancing in this mapping algorithm as shown in Fig. 12. In rotational algorithm, task are mapped on the cores in the rotational manner. The basic concern of the proposed approach is to reduce the amount of time required for mapping task on the core. In order to achieve the goal, it is required to map the task on the core which is placed nearest to the task allocation queue, so whenever task has to be mapped, it is mapped on to the core which is nearest to the allocation queue and is in ready state, i.e. it is ready to accept the task for execution. For this purpose, the ports of routers are considered to be very important. In rotational, task to be mapped is routed on to the elements (routers or cores) attached to the ports of the router. For each router starting from port zero to the last port, task are passed to each port in an sequential order. Once all the ports are visited then this procedure repeats from first port of the router to the last port. In this way the algorithm is capable of mapping multiple task on the cores till the task allocation queue is not empty. As the procedure repeats for each router considering all the ports every time hence the algorithm is called as rotational algorithm. Rotational mapping algorithm is given in Algorithm 3.

Fig. 12
figure 12

Load balancing in rotational mapping algorithm

figure c

5.3 Divide and conquer mapping algorithm

In divide and conquer mapping algorithm, the main emphasis is on load balancing on n × n mesh topology. As the name suggest, in this algorithm first 2D Mesh topology is divided vertically into two (nearly equal) parts and then the division is carried out horizontally. After each vertical and horizontal division the topology is divided into 4 sub-grids of nearly equal dimensions(rows × columns) as shown in Fig. 13. Different tasks from task list, which are maintained in queue, are being mapped onto sub-grids in such a way that load is equally balanced on the mesh topology. For this purpose each time the task has to be mapped, the grids and sub-grids are further divided both vertically and horizontally. The task is assigned to the core belonging to that sub-grid in which there are least number of task mapped. In this way the task mapped on the cores of sub-grid are balanced, hence there is an assurity of load balancing during the mapping of the task to the cores. For an instance let us consider a simple scenario for mapping task on the 8 × 8 mesh topology, first task from task list is mapped onto first core of first sub-grids. Second task from task list, is mapped onto 5th core belonging to second sub-grids. In the similar way, 3rd task mapped onto 33th core belonging to third sub-grids and next task mapped onto 37th core which belongs to fourth sub-grids. So in this way, all task is mapped onto mesh topology as shown in Fig. 14, assuring the researcher to get a NoC architecture with complete load balancing in Fig. 15. Divide and conquer mapping algorithm is given in Algorithm 4.

Fig. 13
figure 13

Grid Divison into sub-grid

Fig. 14
figure 14

Divide and conquer mapping algorithm

Fig. 15
figure 15

Load balancing in divide and conquer mapping algorithm

figure d

6 Experimental results

For implementation purpose, we have used OMNET++ simulator along with the use of in-built mapping package. In order to implement proposed mapping algorithms, we have considered the 2-dimensional 8 × 8 mesh topology for NoC. Initially, the application tasks are maintained in the task list, which can be the queued. From that task list, tasks are mapped on the cores, following the proposed mapping algorithms as mentioned in section 4. We have perform simulation varying the number of tasks from 64 to 128 and compared the results in terms of latency, queuing time, service time and energy consumption. Fig. 16 shows average latency of proposed mapping algorithms for mesh topology, and results are compared with random mapping algorithm.

Fig. 16
figure 16

Average latency (in ns) of mapping algorithms in mesh topology

Figure 17 gives graphical analysis of queuing time for random and proposed mapping algorithms, and comparison of total queuing time is given in Fig. 18. Results obtained for service time required by each task, using OMNET++ simulator, are shown in Fig. 19. Best mapping algorithm, in terms of total service time can be obtained by the comparative analysis of mapping algorithms as shown in Fig. 20.

Fig. 17
figure 17

Queuing Time of mesh topology (a) random mapping (b) horological mapping (c) rotational mapping (d) divide and conquer mapping

Fig. 18
figure 18

Total queuing time (in ns) of mapping algorithms in mesh topology

Fig. 19
figure 19

Service Time of mesh topology (a) random mapping (b) horological mapping (c) rotational mapping (d) divide and conquer mapping

Fig. 20
figure 20

Total service time (in ns) of mapping algorithms in mesh topology

In order to compute the energy consumption of topology, we have used Orion 2.0 simulator. With the help of orion simulator, we calculate the energy consumption of link represented as ELink and energy consumption of router represented as ERouter. Table 1 shows the energy of router at different loads. Table 2 represents the energy consumption of link at different link length and different load. With the help of Eq. 4, we compute the energy consumption of individual core as well as energy consumption of topology as shown in Fig. 21 -22. Table 3 shows the comparison of proposed and random mapping algorithm in terms of average latency, total queuing time and total service time for mesh topology.

Fig. 21
figure 21

Energy consumption of tasks in (a) random mapping (b) horological mapping (c) rotational mapping (d) divide and conquer mapping

Fig. 22
figure 22

Comparison of energy consumption (in pJ) of mapping algorithms

Table 1 Energy of router (in pJ) at different load
Table 2 Energy of link (in pJ) at different load and link length (in mm)
Table 3 Comparison of average Latency, total queuing time and total service time of mapping algorithms (in ns)

7 Conclusion and future work

In this paper, we have proposed the mapping algorithms for tile based NoC mesh topology that maps application tasks onto NoC tiles and develops a function such that energy consumption and average latency is minimized satisfying some performance constraints.The processing tiles with high computational power in big little approach are mostly used in ARM based SoC like Apple’s M1 processor. These task mapping algorithms can be dynamically applied to these clusters of processing cores with optimized QoS. As future work, our main emphasis is to apply these mapping algorithms as machine learning blended algorithms over different NoC topologies dynamically. The possible further extension may be the formulation of an efficient mapping algorithm for different 3D NoC tile based architectures.