To sustain the claims of this paper, it is necessary to obtain a series of models and algorithms that allow determining an optimal share of the load among the computing devices. A definition of a set of concepts and parameters is necessary, as they characterize both the parallel application and the devices of the system.
-
Work-item: in OpenCL is the unit of concurrent execution. This paper assumes that each one represents the same amount of compute load.
-
Total Workload (
W
): is the number of work-items needed to solve a problem. It is determined by some input parameters of the application.
-
Device Workload (
\(W_C, W_G\)
): is the number of work-items assigned to each device: \(W_C\) for the CPU and \(W_G\) for the GPU.
-
Processing speeds of devices (
\(S_C, S_G\)
): are the number of work-items that each device can execute per time unit, taking into account the communication times.
-
Processing speed of the system (
\(S_T\)
): is the sum of the speeds of all the devices in the system.
$$\begin{aligned} S_T = S_C+S_G \end{aligned}$$
-
Device execution time (
\(T_C, T_G\)
): is the time required by a device to complete its assigned workload.
$$\begin{aligned} T_C = \frac{W_C}{S_C} \qquad T_G = \frac{W_G}{S_G} \end{aligned}$$
-
Total execution time (
T
): is the time required by the whole system to execute the application, determined by the last device to finish its task.
$$\begin{aligned} T = max\{T_C, T_G\} \end{aligned}$$
-
Workload partition (
\(\alpha \)
): dictates the proportion of the total workload that is given to the CPU. Then, the proportion for the GPU is \(1-\alpha \).
$$\begin{aligned} W_C = \alpha W \qquad W_G = (1 - \alpha ) W \end{aligned}$$
Based on the above, the total execution time (T) is obtained from the workload of each device and their processing speed:
$$\begin{aligned} T = max \left\{ \alpha \frac{W}{S_C}, (1 - \alpha ) \frac{W}{S_G} \right\} \end{aligned}$$
(1)
It is also necessary to model the energetic behaviour of the system, by considering the specifications of the devices.
-
Static power (
\(P_C^S, P_G^S\)
): is consumed by each device while idle. This is unavoidable and will be consumed throughout the execution of the application.
-
Dynamic power (
\(P_C^D, P_G^D\)
): is consumed when the devices are computing.
-
Device energy (
\(E_C, E_G\)
): is consumed by each device during the execution.
-
Total energy (
E
): is the drawn by the heterogeneous system while executing the application. And it is the sum of the energy of each device.
The total consumed energy is the addition of the static (first term in Eq. 2) and dynamic (second term in Eq. 2) energies. The static energy is consumed by both devices throughout the execution of the task. Thus is obtained by multiplying the static power of the devices \(P_C^S\), \(P_G^S\) by the total execution time T (Eq. 1). The dynamic energy is consumed only when the device is computing. The dynamic energy of the CPU is \(P^D_C T_C\) and \(P^D_G T_G\) for the GPU.
$$\begin{aligned} E = \left[ (P_C^S+P_G^S) \ max\left\{ \alpha \frac{W}{S_C}, (1-\alpha )\frac{W}{S_G} \right\} \right] + \left[ \alpha P_C^D \frac{W}{S_C} + (1 - \alpha ) P_G^D \frac{W}{S_G} \right] \end{aligned}$$
(2)
2.1 Optimal Performance Load Balancing
Attending strictly to performance, an ideal load balancing algorithm causes both devices to take the same time \(T_{opt}\) to conclude their assigned workload. Because none of them incur in idle time waiting for the other to finish.
$$\begin{aligned} T_{opt} = T_C = T_G = \frac{W}{S_T} \end{aligned}$$
The question remains as to which that work distribution is, or what \(\alpha \) satisfies the above equation. Intuitively, it will depend on the speeds of the devices. In Expression 1, it was shown that the execution times of each device are determined by the workload assigned to them, as well as their processing speed.
$$\begin{aligned} T_C = \alpha \frac{W}{S_C} \qquad T_G = (1-\alpha ) \frac{W}{S_G} \end{aligned}$$
Both times are linear with \(\alpha \), so they each define a segment in the range \((0 \le \alpha \le 1)\). \(T_C\) has positive slope and its maximum value is reached at \(\alpha = 1\). While \(T_G\) has its maximum value at \(\alpha = 0\) and negative slope. Then, where both segments cross, both devices are taking the same time to execute, and therefore the optimal \(\alpha _{opt}\) share is found.
$$\begin{aligned} \alpha _{opt} \frac{W}{S_C} = (1 - \alpha _{opt}) \frac{W}{S_G} \Rightarrow \alpha _{opt} (\frac{W}{S_C} + \frac{W}{S_G}) = \frac{W}{S_G} \Rightarrow \alpha _{opt} = \frac{S_C}{S_C + S_G} \end{aligned}$$
(3)
Finally, it is also possible to determine the gain (or speedup) of the optimal execution compared to running on each of the devices alone.
$$ G_C = \frac{1}{\alpha _{opt}} \qquad G_G = \frac{1}{1-\alpha _{opt}} $$
2.2 Optimal Energy Load Balancing
The value of \(\alpha _{opt}\) determined by Expression 3 tells how to share the workload between both devices to obtain the best performance. Now it is interesting to know if this sharing also gives the best energy consumption.
Regarding the total energy of the system (Expression 2), note that it uses the maximum function. To analyse this, also note that \(\alpha _{opt}\) is the turning point where the CPU finishes earlier than the GPU, and where the maximum is going to change its result. Then the total energy of the system can be expressed in a piece-wise manner with two linear segments joined at \(\alpha _{opt}\). This expression is not differentiable but it is continuous. In order to determine local minima, three cases have to be analysed.
-
1.
Both segments have positive slope, so \(\alpha = 0\) will give the minimum energy.
-
2.
Both segments have negative slope. Then the minimum is found at \(\alpha = 1\).
-
3.
The slope of the left segment is negative and the right is positive. Then the minimum occurs at \( \alpha _{opt} = \frac{S_C}{S_C+S_G} \).
The problem is now finding when each of the cases occur. For this, each segment has to be analysed separately.
Left Side. In the range of \(( 0< \alpha < \alpha _{opt})\) the CPU is being underused. Its workload is not enough to keep it busy and has to wait for the overworked GPU to finish. Therefore the execution time is dictated by the GPU, and the energy of the whole system is:
$$\begin{aligned} E = (P_C^S + P_G^S + P_G^D) (1-\alpha ) \frac{W}{S_G} + P_C^D \alpha \frac{W}{S_C} \end{aligned}$$
To find when the segment has a negative slope, it is differentiated with respect to \(\alpha \) and compared to 0:
$$\begin{aligned} \frac{dE}{d\alpha } = -(P_C^S + P_G^S) \frac{W}{S_G} + P_C^D \frac{W}{S_C} - P_G^D \frac{W}{S_G}< 0 \Rightarrow \frac{S_G}{S_C} < \frac{P_C^S + P_G^S + P_G^D}{P_C^D} \end{aligned}$$
(4)
Right Side. In the range \((\alpha _{opt}< \alpha < 1)\) the opposite situation occurs. The CPU is overloaded, taking longer to complete its workload than the GPU. Then the execution time is determined by the CPU, and the system energy is:
$$\begin{aligned} E = (P_C^S + P_G^S + P_C^D)\, \alpha \frac{W}{S_C} + P_G^D (1-\alpha ) \frac{W}{S_G} \end{aligned}$$
As before the slope of the segment is found differentiating, only this time it is desired to find when the slope is positive.
$$\begin{aligned} \frac{dE}{d\alpha } = (P_C^S + P_G^S) \frac{W}{S_C} + P_C^D \frac{W}{S_C} - P_G^D \frac{W}{S_G}> 0 \Rightarrow \frac{S_G}{S_C} > \frac{P_G^D}{P_C^S + P_G^S + P_C^D} \end{aligned}$$
(5)
Satisfying both Expressions (4 and 5) means that the third case occurs, where the minimum energy is found at \(\alpha _{opt} = \frac{S_C}{S_C+S_G}\). Combining these leads to:
$$\begin{aligned} \frac{P_G^D}{P_C^S + P_G^S + P_C^D}< \frac{S_G}{S_C} < \frac{P_C^S + P_G^S + P_G^D}{P_C^D} \end{aligned}$$
(6)
This indicates that the ratio between the speeds of the devices must lie within a given range in order for the sharing to make sense from an energy perspective. The energy consumed in this case can be expressed as:
$$\begin{aligned} E = \frac{W}{S_T} (P_C^S + P_G^S + P_C^D + P_G^D) \end{aligned}$$
(7)
Should the above condition not be satisfied, then it is advisable to use only one of the devices. If \(\frac{S_G}{S_C} < \frac{P_G^D}{P_C^S + P_G^S + P_C^D} \), then the minimum appears at \(\alpha = 0\). Meaning that using the CPU is pointless, as no matter how small the portion of work, it is going to waste energy. The consumption in this case is:
$$\begin{aligned} E = \frac{W}{S_G} (P_C^S + P_G^S + P_G^D) \end{aligned}$$
(8)
When the condition is not satisfied on the other side: \(\frac{S_G}{S_C} > \frac{P_C^S + P_G^S + P_G^D}{P_C^D}\), the minimum is found at \(\alpha = 1\). Then it is the CPU that must be used exclusively. As assigning the smallest workload to the GPU is going to be detrimental to the energy consumption of the system.
$$\begin{aligned} E = \frac{W}{S_C} \cdot (P_C^S + P_G^S + P_C^D) \end{aligned}$$
(9)
2.3 Optimal Energy Efficiency Load Balancing
Finally, this section analyses the advantage of co-execution when considering the energy efficiency. The metric used to evaluate the efficiency is the Energy-Delay Product (EDP), of the product of the consumed energy and the execution time of the application. The starting point is then combining the expressions of time and energy (1 and 2) of the system.
Again, since both expressions include the maximum function they have to be analysed in pieces. This time, both pieces will be quadratic functions of \(\alpha \), that may have local extrema at any point in the curve. Therefore it is necessary to equate the differential to 0 and solve for \(\alpha \).
Left Side. If (\( 0< \alpha < \frac{S_C}{S_T})\) the expressions for time and energy are multiplied obtaining the EDP. Differentiating on \(\alpha \) and solving the differential equated to 0 leads to an extreme point at \(\alpha _{left}\).
$$\begin{aligned} \alpha _{left} = \frac{2 S_C (P_C^S + P_G^S + P_G^D) - S_G P_C^D}{2 S_C (P_C^S + P_G^S + P_G^D) - 2 S_G P_C^D} \end{aligned}$$
Right Side. Now the range (\( \frac{S_C}{S_T}< \alpha < 1)\) is considered. Again, combining the time and energy expressions for this interval gives the EDP, which is differentiated and equated to 0 to locate the extremum at \(\alpha _{right}\).
$$\begin{aligned} \alpha _{right} = \frac{S_C P_G^D}{2 \left[ S_C P_G^D - S_G (P_C^S + P_G^S + P_C^D) \right] } \end{aligned}$$
The analysis of both sides shows that determining the minimum EDP is less obvious than in previous analysis. There are five possible \(\alpha \) values. The first three are, \(\alpha =0\), \(\alpha _{opt}\) and \(\alpha =1\). But due to the quadratic nature of both parts of the EDP expression, it is possible to find a local minimum in each of them. As was shown above, these can occur in \(\alpha _{left}\) and \(\alpha _{right}\). However, these minima are only relevant if they lie within the appropriate ranges \(0< \alpha _{left} < \frac{S_C}{S_T}\) and \(\frac{S_C}{S_T}< \alpha _{right} < 1\). To find the optimum workload share, the energy efficiency is evaluated at the relevant points, and the best is chosen. Again, if the optimal \(\alpha \) is not 0 or 1, it means that it is advisable to use co-execution.