Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained parallel applications, data locality and energy aware scheduling within the parallel code. In fact, parallel applications need to change even now by redesigning algorithms and data structures respectively to take advantage of the recent improvements in energy efficiency of heterogeneous computing hardware, including multicore processors and GPU accelerators. Over the next few years one of the biggest challenges for exascale will be the ability of parallel applications to fully exploit locality which will, in turn, be required to achieve expected performance and energy efficiency. Future highly parallel applications will have to deal with deep memory hierarchies taking into account energy cost in moving data off-chip. Therefore, they will have to apply new coordinated scheduling approaches to balance energy aware resource utilization and minimize work starvation during runtime. As new constraints and limits on memory bandwidth and energy will play a key role in high performance computing (HPC) in the future, more sophisticated and dynamic scheduling techniques will be needed and applied within the parallel code. In this paper we focus on an energy-aware distribution of the stencil workload on heterogeneous processors. Our analysis of energy and performance models focused on relevant class of stencil computations to explore the relationship between task scheduling algorithms and energy constraints. More precisely, we search for a schedule which minimizes the energy usage within a specified computation’s deadline of the stencil workload on heterogeneous architectures. Since the problem is computationally intractable, we present an integer linear programming formulation for finding optimal schedules. As finding optimal schedules is time consuming we have developed four heuristics and tested them experimentally with respect to optimal solutions. In our work we focus on a single node configurations with heterogeneous processors. These configurations represent the state of the art multi- and many-core architectures.


Introduction
Stencil computations as relevant class of applications occur in many HPC codes on block-structured grids for modelling various physical phenomena, e.g. for computational fluid dynamics, geometric modelling, solving partial differential equations or image and video processing [1][2][3][4][5]. As computing time and memory usage grow linearly with the number of array elements in stencil computations our research targets highly parallel implementations of stencil codes together with task scheduling and optimization techniques taking into consideration energy cost and data locality [6][7][8][9][10]. We have proved during our experimental studies that recent changes introduced in heterogeneous computing hardware resulted in different performance and energy characteristics that are critical for highly efficient and scalable stencil computations [11]. As shown in [12,13], the overall performance of stencil computations is memory bound. One should note that many existing HPC architectures mainly focus on floating point performance [14]. However, only a partial and limited usage of the floating point units in a given computing architecture is possible today and may reduce energy cost without the performance degradation. Moreover, many latest improvements introduced in dynamic power management policies at the hardware level, e.g. dynamic voltage and frequency scaling (DVFS) or even switching off an entire unit block of a chip (clock gating), can lead to significant reduction in the energy required for memory-bound workloads. Advanced dynamic power management policies give new opportunities for scheduling tasks within the fine-grained parallel code as users are able to control the utilization of various functional units in heterogeneous computing hardware, e.g. turn on and off dynamically individual cores, change on-demand the frequency of a small processing and communication units or even put portions of cache memory at specific sleep states during runtime.
In our previous work [15] we performed an exhaustive evaluation of the key characteristics that have a relevant impact on the performance and energy usage of a stencil computation running on a certain processing unit. Based on these characteristics in this article, we present an energyaware ILP model that distributes stencil computations to heterogeneous processors and minimizes the schedule energy cost while meeting the computation's deadline. The distribution of stencil computations is done on the blocks obtained from the decomposition of the computational domain. The computational domain is a Cartesian grid on which the stencil computations are defined. The optimization space of the model shows that the best strategy depends not only on load balancing the problem size between the processing units, the processing units specification, and the stencils employed, but also on detailed mapping of the communication dependencies of the blocks to the communication topology of respective processing units. No previous work has attempted to account for the time and energy simultaneously in the context of the distribution of the stencil computations between processing units. We also developed new heuristics that schedule example workloads in real time. The developed heuristics attempt to include the communication overhead in the distribution process. The described algorithms have been tested experimental using the state of the art multi-and many-core architectures. In our work we focus the experiments on a single node configurations with heterogeneous processors.
The paper is organized as follows. In Sect. 2 the related work is discussed. The key properties that have an influence on energy usage are defined in Sect. 3. The scheduling problem is introduced in Sect. 4. Performance and energy models are introduced in Sect. 5. Section 6 describes the integer linear programming (ILP) model. The dynamic scheduling policies are described in Sect. 7. Section 8 presents experiments using a 3D Laplacian stencil defined on different grid topologies using several CPU-GPU configurations. Section 9 concludes our experiments and presents a future work.

Related work
In general, considered stencil calculations perform global sweeps through data structures that are typically much larger than the capacity of the available data caches available within processing units. Additionally, accessing data in main memory within the hardware is not fast enough and we often have to deal with the traffic between local cache and main memory. Therefore, many researchers have already tried to exploit data locality in stencil computations by performing operations on cache-sized blocks of data after domain decomposition [16], after time decomposition [17] or proposed cache-aware optimisation algorithms on many-core modern processors [18].
In consequence, there exist frameworks that try to ease the implementation of the stencil calculations. The user writes single stencil code in a framework's specific language which during a compilation is translated to a target architecture. The frameworks distribute the computations to employ multiple processors. The distribution involves the decomposition of the Cartesian domain to overlapping blocks. The overlap, called halo region, is needed to correctly update the decomposed block on borders. Each block is updated by a single processor. The minimal size of the overlap depends on the stencil pattern. The stencil pattern defines which neighbouring points are used during stencil computations. For example, Physis [19] uniformly decomposes the global domain over all the accelerators as instructed by a user-controllable parameter. The user has to experimentally determine which decomposition provides the highest performance. The framework focuses only on the GPU architecture. Similarly, work in [1] utilises a simple decomposition method with uniform partition where each processor and accelerator receives blocks of the same size. On the other hand, authors in [20] provide a method that allows programmers to partition the data contiguously between CPU and GPU within a single node. Unlike our work, their approach does not allow to find an optimal distribution of the domain between heterogeneous architectures in terms of time and energy costs. What is more, there is a lack of careful analyses of stencil optimizations and performance modelling connecting specific properties such as communication and locality with architectural time and energy costs.
Moreover, performance and energy models for modern heterogeneous computing architectures incorporating specialized processing capabilities should be flexible and extendable to explore recent properties of heterogeneous hardware units. A good example is the roofline model which allows a programmer to model, predict, and analyse an individual kernel performance given an architecture communication and computation capabilities [21]. In this approach an application is modelled simply by the ratio of useful operations to memory operations. The roofline model can predict the performance of a simple von Neumann architecture with two levels of memory as well as the more complex design with a multi-level memory hierarchy. It has been successfully used to model the performance of many applications on the multi-core and many-core processors [22]. Recently, it has been extended to model the energy consumption in GPUs [23]. In the new model the authors have assumed that each operation has a fixed energy cost and a fixed data movement cost while the constant energy cost is linear in time. The constant power depends on both a hardware and an algorithm and includes both static and leakage power management. However, the proposed model does not include dynamic power management by charging and discharging gate capacitance. The authors assumed that time per work (arithmetic) operation and time per memory operation are estimated with the hardware peak throughput values, whereas the energy cost is estimated using a linear regression based on real experiments. Another set of extensions to the roofline model have been proposed in [24] to model energy on dual multi-core CPU with three-level cache hierarchy. In this approach the dynamic power management was modelled as a second degree polynomial, based on real benchmark data, that scales linearly with the number of active cores up to the saturation point. The authors assumed that the dynamic power depends quadratically on the frequency. In the saturation point the energy to solution grows with the number of used cores, that is proportional to dynamic power, while the time to solution stays constant. In our article we are providing two examples of architectures CPUs and GPUs. However, the presented model can be utilized with other architectures as well, for instance Intel Xeon Phi or ARM. Antoher example is an energy model presented in [25] to evaluate the cost of parallel algorithms for GPU. Based on the energy model they propose the method for the energy scalability to easy the selection of the optimal number of blocks.

Stencil properties
In our previous work [15] we experimentally discovered the key characteristics that have a relevant impact on the performance and energy usage of a stencil computation running on a certain processing unit (PU). We tested the performance and energy usage of an example 3D Laplacian stencil on eight core Intel Xeon E5-2670@2.6GHz CPU and Kepler K20m GPU using multiple of frequency and voltage pairs, called P-states. Firstly, the maximum performance can be reached with a lower number of cores than available. Secondly, to minimize the energy usage it is more important to reduce the frequency than the number of cores used. What is more, in case of CPU, DRAM may use up to 60% of the energy. Thus, the data movement consumes the most of the power. Finally, the lowest energy usage may be reached with not the maximum performance. To summarise the analysis, the stencil computation u ∈ T , called task, is described by the following parameters: 1. The number of arithmetic operations per grid point W u, p on a processor p, 2. The number of required bytes to update a grid point Q u, p on a processor p, The processor p ∈ P has following properties:

Problem formulation
As showed in the previous section that the data locality has the highest influence on the energy usage, it has encouraged us to focus our research on a stencil workload scheduling using heterogeneous computing architectures to minimize the energy usage while meeting the computation's deadline. The scheduling problem is defined by a set P of m processors and a workload T = T 1 , T 2 , ..., T n of n dependent tasks. A considered workload represents a stencil defined on a structural grid. Each point on a grid is updated with a strict pattern, see Fig. 1. The pattern defines which neighbouring points are used during a stencil computation. A single update of the whole grid is called a timestep. In our approach we focus on an explicit method where a current timestep is updated by using values of the grid points from a previous timestep. The considered heterogeneous hardware includes unrelated processing units (PUs) and the same stencil computation takes different execution times on them. Based on our experimental studies we distinguished two different unrelated processing units: central processing units (CPUs) and graphic processing units (GPUs).
The workload contains the set of dependent tasks. The block decomposition of the structural grid updated by the stencil forms the workload of tasks with the communication dependencies. A task represents a single block of the decomposed grid. We assume that the grid is decomposed on equally sized blocks. We assume that a given task may be processed by a single processing unit at a time and each processing unit may execute several tasks.
The tasks are represented by a directed graph defined by a tuple G = (V, E) where V denotes the set of tasks and E represents the set of edges. For simplicity we assume that the task T u = u and the processor is depicted by p. Each edge (u, v) ∈ E defines a communication between the tasks u, v ∈ V . The communication load d u,v on the edge (u, v) depicts the number of grid cells exchanged between tasks. The model assumes a fully connected network of heterogeneous processors with heterogeneous communication links. If tasks u and v are executed on different processors p, k ∈ P, they cause the time t e p,k and the energy e e p,k penalty required to exchange a single grid cell between the processors p and k. If both tasks are scheduled on the same processor, then the communication time and the communication energy are equal to zero. The computation load w u describes the number of grid cells provided by the task u. The computation time and the energy cost to update the single cell on the processor p are represented by t c u, p,l and e c u, p,l respectively; see (3), (4). The idle power P idle p depicts the power used when no computations are executed on processor p. The memory size m p represents the maximum number of grid cells that can be computed on processor p. The total communication time and the total communication energy to exchange all data are represented by t e and e e respectively. Total execution time t t indicates how much time it takes to finish the whole workload. The execution deadline t d denotes the time by which all tasks have to be finished. The objective is to determine a schedule such that the total energy cost is minimized and deadline t d is not exceeded.

Performance and energy models
Detailed analysis of the performance and the energy usage of the stencil computations on two unrelated processing units resulted in the following formulation of the performance model. Computation time t c u, p,l of task u on processor p with state l is estimated as follows: where O u, p is the number of arithmetic operations executed and B u, p is the number of bytes transferred. The energy model assumes that each arithmetic operation as well as the memory operation consumes some energy: Variables e op u, p , e byte u, p approximates the energy usage of stencil operations. For simplicity, it is assumed that arithmetic operations, i.e. additions, multiplications, subtractions and divisions, consume the same amount of energy. Additionally, the energy usage also depends on an instruction set used, thus for the highest performance the CPU implementation of the stencil uses the vector extensions. P0 u, p,l is a constant power consumed by the processor P p based on the state l. The coefficients e op u, p , e byte u, p and P0 u, p,l are approximated with a linear regression. Table 1 shows estimated values of the energy cost for the double precision floating point operation and the transfer of a single byte of data. For CPU and GPU the cost to transfer a single byte of data is 5.2x and 6x more expensive than the floating point operation, respectively. What is more, both floating point and memory operations are 5x more expensive on CPU than on GPU. Figure 2 shows that the constant power grows linearly with the increasing number of cores using different P-states.

Optimal model
This section presents a method based on ILP (ILP) to obtain the optimal solution of the energy minimization problem. In particular, this method is developed to have a reference for the heuristics described in Sect. 7. Before going into details let us introduce basic definitions from graph theory [26].

Incidence
Two edges uv, st ∈ E are incident if {u, v} ∩ {s, t} = ∅ and edge uv ∈ E is also called incident to its both end nodes u and v. The set of edges incident at a node u is denoted by δ G (u): The number of edges incident to a node u is the degree of this node in G and will be denoted by deg G (u). For U ⊆ V the set of all edges with exactly one endpoint in U is denoted by δ(U ). In a directed graph the edges in E are assumed to be ordered pairs and are described as

Maximum degree and maximum multiplicity
Maximum degree and maximum multiplicity of a graph are defined as A graph with μ(G) = 1 that contains no parallel edges is called simple. Graphs with maximum multiplicity at least 1 are called multigraphs and denoted by M.

Chromatic index
An edge colouring of a graph G = (V, E) is a map c : E → C which assigns to each edge e ∈ E a colour c(e) ∈ C such that no two incident edges receive the same colour. The minimal cardinality of the colour set C for which such a mapping exists is called the chromatic index of the graph and denoted by χ (G).

ILP solution
Our method was inspired by the model proposed in [27]. The idea is to decompose the scheduling problem to two parallel subproblems. At first, the tasks are mapped to processors to minimize the maximum number of grid cells placed on each processor. Secondly, the number of communication rounds is minimized by employing an edge colouring model. The communication is executed in parallel between different pairs of processors in stencil computations. However, each processor can initiate a single communication link with another processor at a time. As a result, we have to employ several communication rounds to exchange all data. The number of communication rounds directly influence the communication time t e , as each round costs some time. The reason for selecting the ILP solution is that the edge colouring problem is NP-hard [28,29]. For the task scheduling model the set of edges is mapped to processors, where each edge (u, v) may be mapped to a single processor p or might be placed between two different processors p and k. In the first case, both endpoints u and v are mapped to p. In the second case, u is mapped to p and v to k or u is mapped to k and v to p. For each edge the slots ( p, k) ∈ P × P are provided and it is required that each edge must be assigned to exactly one slot. If edge e is assigned to slot ( p, k) then it starts in p and ends in k. If p = k then e lies completely on p and is intra-processor, in all other cases it is inter-processor. For the minimization of the number of communication rounds the edge colouring model is used. If the graph G = (V, E) of tasks is mapped to the complete graph K m of m processors to form a new multigraph M p , then each edge in M p receives at least as many colours as its multiplicity demands and incident edges do not receive the same colour. What is more, an edge can only receive a colour that is used.
Variables. For the integer programming model we introduce the following variables: These constraints are for all p ∈ P and uv ∈ E, where (13), -(Control the number of grid cells) This constraint controls the number of the grid cells allocated for each p ∈ P. The sum of grid cells mapped to processor p is given by for all p ∈ P, -(Number of colours not less than multiplicity) This constraint requires that each edge in M p receives at least as many colours as its multiplicity demands. Each edge models time required to exchange single grid cell between processors p and k: -(Incident edges receive different colours) This requires that incident edges do not receive the same colour in the edge colouring of M p and that an edge can only receive a colour that is used: -(Restrict memory capacity for each processor) This constraint restricts the number of grid cells allocated for each processor p: -(Control energy used for communication) The sum of energy used for the inter-processor communication is depicted as -(Control execution time) These two constraints calculate the total execution time t t using the maximum value from the computation and the communication time. As described in Sect. 4 the computation and the communication are done in parallel: -(Control processor's idle time) This constraint controls the idle time for all p ∈ P: -(Deadline) This inequality restricts the execution time: Table 2 shows the number of variables and constraints that formulate the ILP model.
Optimization objective. Finally, the objective of the model is to minimize the energy cost:

Heuristics
Taking into account the relevance of an energy efficiency issues in the next generation of the high-end supercomputers in this section we introduce new heuristics. In our approach we consider energy aware stencil workload scheduling on heterogeneous architectures with two following objectives: -minimize the energy usage, -load balance of the tasks to meet the deadline.

Simple
This strategy is focused on balancing the load between processors, and does not take into account the communication dependencies. These heuristics are usually quite simple and fast as they act online on the workload.

Balancing load
The algorithm distributes tasks to processors while attempting to keep the maximal load small and not to exceed the If there are multiple tasks that attain this 10: minimum pick the one with smallest computational load. 11: Remove task u from set T 12: Set m(u ) = p i and s = s + w u 13: for If there are multiple tasks that attain this 10: minimum pick the one with smallest computational load.

11:
Remove task u from set T 12: Set m(u ) = p i and s = s + w u 13: for deadline. This strategy is called Balancing Load, see Algorithm 1. We start with processor p 0 and assign tasks to this processor until its size is at least w V * r i / p∈P r p . Then we move to the next processor and repeat the procedure. The limit w V * r i / p∈P r p stems from the fact that in a prefect balancing of tasks there is one processor that has this many grid cells. This limit is a modification of a limit w V /|P| for homogeneous processor, as we consider the speed r p of each processor. The time complexity of the algorithm is O |V | to assign all tasks to processors. The algorithm is sensitive to the order in which the tasks and the processors are selected. If there are multiple tasks that attain this 15: minimum pick the one with smallest computational load.

16:
Remove task u from set T 17: Set m(u ) = p i and s = s + w u 18: end while 20: end while 21: end procedure

Advanced
Algorithms described in this section attempt to include communication overhead in the scheduling process. They try to find such a schedule that the resulting multigraph yields a small chromatic index. Since finding the chromatic index is an NP-complete problem [28], the algorithms employ different approximation methods to minimize it.

Minimize degree
In this algorithm task u with the lowest number of unmapped edges is assigned to the current processor p. Based on the equations χ (G) ≤ (G) + μ(G) and χ (G) ≤ 3 * (G)/2 the chromatic index χ (G) of any multigraph G depends on the max degree. Thus, when task u is assigned to processor p, then each incident edge to this task not mapped to p increases the current degree of p by one. The neighbours of the task u that are mapped to another processor k = p also increase the degree of p, but they are not considered in this algorithm. Therefore, the array deg(u) is used to keep for each task u the number of unmapped edges. The number by which the degree of processor p would increase if the task u was mapped to it. If two tasks have the same number of unmapped edges, then the task with the smallest computational load is selected. In other words, the number of additional grid points by which computational load on the processor p would exceed the perfect load w V * r i / p∈P r p if the task u was mapped to p. The running time of Algorithm 2 is O |V 2 | . The time needed to find the task with the smallest computational load takes O |V | , whereas the while loop is executed O |V | times.

Minimize multicut
In this algorithm the chromatic index for a multigraph is estimated based on the complete number of edges |E|. The previous Algorithm 2 is modified to obtain a minimal multicut. To achieve this task u with the smallest number of the unscheduled neighbours is found to be mapped on the current processor p. In line 3 the deg(u) is initialized with the number of the unscheduled neighbours. Each scheduled task u decreases the deg(v) for each unscheduled neighbour v of u . The time complexity of the algorithm is equal to O |V 2 | .

Accumulate neighbours
In this algorithm the unmapped task u with the highest number of neighbours on the currently selected processor p is chosen. This policy tries to yield most of the communication edges of the grid graph intra-processor. The array N records the number of neighbours the task u has on the processor p. In line 10 the task with the most neighbours on the processor p is selected. However, at the end of the inner while loop (line 12) when the processor p is almost full different strategy is employed. The task u connected to the subgraph mapped to p with a minimum number of neighbours not on p is selected. To recognise when the processor is almost full the load factor f ∈ [0, 1] is introduced. While s ≤ f * w V * r i / p∈P r p the tasks with the maximum number of neighbours on the current processor p are selected, whereas s ≥ f * w V * r i / p∈P r p the tasks with the minimum number of neighbours not on p are picked. When no unmapped task is adjacent to the tasks currently mapped to p, then the task with the maximum degree is preferred. Additionally, for the strategy defined in line 12, the task with the minimum degree among the unmapped ones is selected. To find the task in lines 10 and 12 takes O |V | time. The while loops are executed |V | times, thus the whole Algorithm 4 runs in time O |V 2 | .

Simulation setup
To validate our models a new simulator has been designed and implemented to calculate the total execution time, the energy usage and the number of communication rounds (colours). The simulator is initialized with the following data: The simulation instances include the two different real world simulation grids. These grids are related to the weather simulations problems. The connection topology of points on each grid is defined by a 3D Laplacian stencil depicted in Figure 1. Table 3 outlines the properties of the test instances. The first grid called Cuboid (Figure 3) was used to simulate a decaying turbulence of a homogeneous incompressible fluid. Whereas, the second grid called Sphere (Figure 4) was used as a benchmark for the atmospheric circulation models. The connections in the horizontal direction for the Sphere grid are periodical. Figure 5 shows the example of the decomposed Cuboid grid with the connection dependencies. Each number represents the block id that later is mapped to specific processor. To analyse the quality of the ILP model and the heuristics the grids are mapped to single node with the three different configurations of the processors: CPU-CPU, CPU-GPU and 2xCPU-2xGPU. The simulated CPU is Intel Xeon   Figure 6 presents the node topology with four processors. In all algorithms the CPU and GPU frequencies are set to default values. GPU operates at 705MHz of the core clock and 2600 MHz of the memory clock. Whereas, CPU operates at 2.6 GHz of the core clock. The parameters used in all test runs are shown in Table 4. The values of the parameters are obtained based on methodology described in Sect. 3 and 5.

Simulation results
First, we show the results for the ILP model, see Tables 5 and  6, where the first column presents the configurations used. Each configuration is simulated with different deadlines. The deadline is provided as an input parameter. We reduce its value to the point where the ILP model is not able to generate the feasible solution. The next columns provide the number of colours used in a multigraph, the total energy consumed, the energy used for computations, the energy used for communication and the time elapsed. The number of colours used in the graph colouring provide information about the number of the communication rounds employed. As the results show the decreasing deadlines improve the computation times, however they increase the energy usage. This is especially true for the heterogeneous configurations where the energy usage grows up to 7x from the extended deadline to the shortest one. Shorter deadline forces usage of the next processing unit, and as a result, it requires more energy to commu-nicate. For example, for the 2xCPU-2xGPU configuration 88% of energy is consumed by the communication. Therefore, for this reason it is important to efficiently distribute the stencil tasks to reduce the number of the communication rounds between the processing units. However, as we can see it is beneficial to use the heterogeneous configurations, as we switch from the CPU-CPU configuration to the 2xCPU-2xGPU configuration both the computation time and energy costs decrease by 87 and 57%, respectively, for the Cuboid grid. Similarly, the computation time for the Sphere grid decreases by 87% whereas the energy usage decreases by 42%. Higher energy usage for the Sphere grid is caused by the periodic connections of tasks on the I and J boundaries. For single node configurations we can notice that the maximum computation time t c u, p,l among the processing units is a bottleneck for the total execution time t t . Although, we can expect that for the multi-node configurations the limiting factor will be the communication time.
The quality of all the four heuristics described in Sect. 7 is presented. The obtained results are presented in Tables 7,8,9,10,11 and 12, where Algorithm 1 is tested with four different sorting orders of the tasks: random (RD), IJK indices, JIK indices and KIJ indices. The tasks can be order by the grid indices depending on their location within the grid. This order may have influence on the number of edges mapped between different processors.
For Algorithm 4 the first column in Tables contains the value of the load factor f , that depicts when the processor is almost full. This algorithm is tested with different values of this parameter. The second to last columns show the number The best results in terms of energy efficiency are given in bold The best results in terms of energy efficiency are given in bold of edges in the returned scheduling, the number of colours used in the obtained multigraph and the objective values for the energy and time. Gap is defined as a difference between the optimal solution o and the solution o * returned be the algorithm: The optimal solution with the shortest deadline is selected as a base for the comparison. In other words, the results The best results in terms of energy efficiency are given in bold   Tables 7 and 10 show that almost all heuristics except for Alg_1 − R D are able to schedule stencil tasks with close to the optimal solution for homogeneous hardware configurations with two processors. What is more, the results show that the heuristics that target at the balanced load provide good solutions for simple configurations with two processors. The balancing load algorithm Alg_1 produces an efficient distribution depending on the sorting order of the input tasks. The order based on JIK indices minimizes the number of the communication rounds for both grids with two processors. With four processors it is beneficial to use the heuristics that take into account the communication penalty.
The algorithm Alg_4 provides good schedules for the four processors as it tries to yield most of the communication edges of the task graph intra-processor. The quality of schedule depends on the load factor, which determines when to switch the mapping from the task with the most neighbours on the current processor to the task with a minimum number of neighbours not on the current processor. Take as an example the 5x4x5 gird with 100 blocks distributed on a node with single CPU and two GPUs where the 3D Laplacian stencil is empolyed. Figure 7a shows the schedule from Alg_1 with the best performing J I K order. The blocks are distributed horizontally according to the J I K order. Figure  7b shows the output from Alg_4 with the load factor equal to 0.9. The blocks scheduled to CPU are distributed vertically within the computational grid whereas the blocks scheduled  to GPUs are distributed horizontally. Alg_4 and the rest of the algorithms ( Alg_2 and Alg_3) are able to mix the spatial distribution of the blocks. The energy cost is 4.65J and 4.43J for Alg_1 and Alg_4 respectively. 5% of the energy is saved by reducing the number of communication rounds. The presented heuristics may be applied to the distribution of the stencil computations between the processing units defined on the Cartesian grids. These grids may be 2D or 3D with or without periodic boundaries. Table 13 shows the average exeuction time of the investigated ILP model and heuristics for the 2xCPU-2xGPU configuration with previously described grid setups. As one can notice, the time to find the optimal solutions is seven orders of magnitude larger than the time of heuristics.

Verification of energy model
This section contains the experimental comparison of the energy usage model used in the simulator with the real measurements. Figures 8 and 9 present the comparison of energy usage between the proposed model and the real measurements. The results are obtained for the Intel Xeon processor  As it can be observed, the accuracy of the presented model is high and exceeds visibly 90%. The results suggest that applying the time and energy models, while verifying different scheduling policies, does not lead to deterioration of overall results. This leads to the conclusion that the described environment can be used to simulate the heterogeneous computer system.

Conclusions and future work
In this paper new heuristics to distribute efficiently the stencil workload on the heterogeneous architectures and consequently minimize the energy usage within the deadline are presented and evaluated. They are based on our analysis of energy and performance models for a relevant class of stencil computations to explore the relationship between task scheduling algorithms and energy constraints. Additionally, the obtained results during experimental tests of our heuristics are compared to optimal solutions achieved by the ILP formulation of the stencil-scheduling problem. The optimization space of the model shows that the best strategy depends not only on load balancing the problem size between the processing units, the processing units specification, and the stencils employed, but also on detailed mapping of the communication dependencies of the blocks to the communication topology of respective processing units. The careful mapping of the stencil tasks on the heterogeneous architectures can lead to the substantial savings in the execution time and energy costs. We show that even a basic heuristic with load balancing for the configurations of the two processors is sufficient enough with respect to energy efficiency. Moreover, in this paper we demonstrate various improvements which take into account recent achievements in heterogeneous CPU and GPU architectures. Nevertheless, with the increasing number of processors, in our opinion heuristics that take into account the communication penalty are needed. The heuristics are applicable to distribute the stencil computations defined on a Cartesian grid regardless of a domain topology. The domain borders can be both periodic and non-periodic.
In our future work we plan to extend the proposed model and heuristics to take into account the remote communication between nodes to better predict the runtime and the energy usage of stencil computations in large scale. Therefore, we plan to conduct additional experimental tests to model the data movement within the inter-node network. Furthermore, we want to model a workflow of the different stencils to better predict the energy usage of real use cases and applications.
ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.