Scheduling divisible loads with time and cost constraints
- 70 Downloads
Abstract
In distributed computing, divisible load theory provides an important system model for allocation of data-intensive computations to processing units working in parallel. The main task is to define how a computation job should be split into parts, to which processors those parts should be allocated and in which sequence. The model is characterized by multiple parameters describing processor availability in time, transfer times of job parts to processors, their computation times and processor usage costs. The main criteria are usually the schedule length and cost minimization. In this paper, we provide the generalized formulation of the problem, combining key features of divisible load models studied in the literature, and prove its NP-hardness even for unrestricted processor availability windows. We formulate a linear program for the version of the problem with a fixed number of processors. For the case with an arbitrary number of processors, we close the gaps in the study of special cases, developing efficient algorithms for single criterion and bicriteria versions of the problem, when transfer times are negligible.
Keywords
Divisible load scheduling Computational complexity Linear programming1 Introduction
Problem parameters and notations
V | Total load size |
T | Deadline (upper limit of the schedule length) |
K | Budget (upper limit on cost) |
\(x_{i}\) | Decision variable for the size of the load chunk assigned to processor \(P_{i}\) |
\(B_{i}\) | The maximum load processor \(P_{i}\) can compute, due to memory limitations |
\(C_{i}\) | Finishing time for computing load \(x_{i}\) on processor \(P_{i}\) |
\(C_{\max }\) | Schedule length defined as \(\max \left\{ C_{i}|i=1,\ldots ,m\right\} \) |
\(\mathcal {K}\) | Overall cost |
\(\left[ r_{i},d_{i}\right] \) | The time window when processor \(P_{i}\) is available |
\(p_{i}+a_{i}x_{i}\) | The time for computing load \(x_{i}\) on processor \(P_{i}\), where \(p_{i}\) is the setup time to start computation and \(a_{i}\) is the processing rate (or reciprocal of speed) of processor \(P_{i}\) |
\(s_{i}+c_{i}x_{i}\) | The time for transferring load \(x_{i}\) to processor \(P_{i}\), where \(s_{i}\) is communication start up time and \(c_{i}\) is the communication rate (or reciprocal of bandwidth) of the link to \(P_{i}\) |
\(f_{i}+\ell _{i}x_{i}\) | The cost of computing load \(x_{i}\) by processor \(P_{i}\), including the fixed cost \(f_{i}\) |
\(\mathcal {P}\) | Set of the worker processors |
\(\mathcal {P}^{\prime }\) | Set of processors participating in computation, \(\mathcal {P}^{\prime }\subseteq \mathcal {P}\) |
m | Total number of processors, i.e., \(m=|\mathcal {P}|\) |
\(m^{\prime }\) | Number of processors in \(\mathcal {P}^{\prime }\), i.e., \( m^{\prime }=|\mathcal {P}^{\prime }|\) |
In its general form, the problem of divisible load scheduling (DLS) can be formulated as follows. A computational load of volume V (measured in bytes) is initially held by a master processor\(P_{0}\). The load must be distributed among worker processors from set \(\mathcal {P}= \{P_{1},\dots ,P_{m}\}\). In our model master processor \(P_{0}\) only distributes load and does not do any computation. In some publications this assumption is waived so that \(P_{0}\) performs computation after it completes all communications. The results discussed in the following sections can be easily adjusted for that case.
For the summary of the notation and explanation of the parameters used in this paper, see Table 1. Each processor \(P_{i}\) has its own availability interval \(\left[ r_{i},d_{i} \right] \). The time required for sending x bytes of load to \(P_{i}\) is \( s_{i}+c_{i}x\), for \(i=1,\ldots ,m\). The communications are performed sequentially, i.e., only one processor at a time can receive its chunk of the load from the master. The transmission to \(P_{i}\) of the allocated chunk can start at any time, even before the processor’s availability time \(r_{i}\). The processing of the chunk can start only after the allocated chunk is received in full and no earlier than processor’s availability time \(r_{i}\). For the chunk of size x received by processor \( P_{i}\), the computation time and the processor usage cost (computation cost) are \(p_{i}+a_{i}x\) and \(f_{i}+\ell _{i}x\), respectively.
It is required that \(P_{i}\) finishes computation by the end of its availability interval \(d_{i}\), \(d_{i}>r_{i}+p_{i}\). Due to the limited memory size, there is an upper limit \(B_{i}\) on the maximum load that can be handled by processor \(P_{i}\). A processor may be left unused if no load is sent to it. Such a processor does not incur any time or cost overheads.
Let \(C_{i}\) denote the time when processor \(P_{i}\) completes its chunk, and let \(C_{\max }\) denote the length of the whole schedule. The cost of processing the load is denoted \(\mathcal {K}\). Solving the DLS problem requires three decisions:
Decision 1 choosing the subset of processors \(\mathcal {P^{\prime }} \subseteq \mathcal {P}\) for performing computation; for any processor \(P_{i}\in \mathcal {P^{\prime }}\) the allocated chunk size is nonzero (\(x_{i}>0\)) and such a processor is called active;
Decision 2 choosing the sequence in which the master processor \( P_{0} \) sends parts of the load to the processors in \(\mathcal {P^{\prime }}\);
in problem \(\mathrm {DL}_{\text {time}}(K)\) the objective is to minimize \(C_{\max }\) subject to \(\mathcal {K}\le K\), where K is an upper limit of the available budget,
in problem \(\mathrm {DL}_{\text {cost}}(T)\) the objective is to minimize the cost \(\mathcal {K}\) subject to \(C_{\max }\le T\), where T is an upper limit of the acceptable schedule length.
Summary of the results
Transfer time | Comput. time | Cost | Conditions | Objectives | Results |
---|---|---|---|---|---|
\(c_{i}x_{i~}\) | \(a_{i}x_{i}\) | \(\ell _{i}x_{i}\) | \(C_{\max }\le T,\)\(\mathcal {K}\le K\) | NP-complete, Sect. 3 even if \(r_{i}=0,d_{i}=B_{i}=\infty \) | |
\(cx_{i~}\)(common c) | \(a_{i}x_{i}\) | \(\ell _{i}x_{i}\) | \(r_{i}=0, d_{i}=B_{i}=\infty \) | \((C_{\max },\mathcal {K})\) | \(O(m^{3})\)Shakhlevich (2013) |
O(m), Sect. 3.2 if \(c_{1}\le c_{2}\le \cdots \le c_{m}\) and \(\frac{\ell _{1}}{c_{1}}\le \frac{\ell _{2}}{c_{2}}\le \cdots \le \frac{\ell _{m}}{c_{m}}\) | |||||
\(s_{i}+c_{i}x_{i}\) | \(p_{i}+a_{i}x_{i}\) | \(f_{i}+\ell _{i}x_{i}\) | arb. \( r_{i},d_{i},B_{i}\) | \((C_{\max },\mathcal {K})\) | FPT w.r.t. m, Sect. 2 |
\(c_{i}x_{i}\) | \(a_{i}x_{i}\) | 0 | \(r_{i}=0,\)\(d_{i}=B_{i}=\infty \) | \(C_{\max }\) | \(O(m\log m)\) Bharadwaj et al. (1994) Bharadwaj et al. (1996) Blazewicz and Drozdowski (1997) (Processor seq. \(c_{1}\le \!c_{2}\le \!\cdots \!\le \!c_{m}\)) |
\(s+cx_{i}\) (common s, c) | \(a_{i}x_{i}\) | 0 | \(r_{i}=0,\)\(d_{i}=B_{i}=\infty \) | \(C_{\max }\) | \(O(m\log m)\)Blazewicz and Drozdowski (1997) (Processor seq. \(a_{1}\!\le \!a_{2}\!\le \!\cdots \!\le \!a_{m}\)) |
\(s_{i}\) | \(p_i+a_{i}x_{i}\) | 0 | \(r_{i}=0,\)\(d_{i}=B_{i}=\infty \) | \(C_{\max }\) | NP-hard, even if processor sequence is fixed Drozdowski and Lawenda (2005) |
\(s_{i}\) | \(a_{i}x_{i}\) | 0 | \(r_{i}=0\), \(d_{i}=B_{i}=\infty \) | \(C_{\max }\) | NP-hard Yang et al. (2007) \(O(m\log (Vmas)\times \min \{\lfloor s+Va\rfloor ,S\})^{*)}\) |
0 | \(p_{i}+a_{i}x_{i}\) | \(\ell _{i}x_{i}\) | arb. \(r_{i},d_{i},B_{i}\)\(C_{\max }\le T\)\(\mathcal {K}\le K\) | \(\mathcal {K}\)\(C_{\max }\)\((C_{\max },\mathcal {K})\) | O(m), Sect. 4.1\(O(m\log m)\), Sect. 4.1\(O(m\log m)\), Sect. 4.1 |
0 | \(a_{i}x_{i}\) | \(f_{i}\) | \(C_{\max }\le T,\)\(\mathcal {K}\le K\) | NP-complete even if \(r_{i}=0,\)\(d_{i}=B_{i}=\infty \),Drozdowski and Lawenda (2005), Sect. 4.2 | |
0 | \(p_{i}+a_{i}x_{i}\) | \(f_{i}+\ell _{i}x_{i}\) | arb. \(r_{i},d_{i},B_{i}\); fixed set of active processors \(C_{\max }\le T\)\(\mathcal {K}\le K\) | \(\mathcal {K}\)\(C_{\max }\)\((C_{\max },\mathcal {K})\) | O(m), Sect. 4.2\(O(m\log m)\), Sect. 4.2\(O(m\log m)\), Sect. 4.2 |
While the computation time overhead is at the center of the DLS problem and cannot be ignored, the two other types of overheads may become negligible in some scenarios. The version of the problem with zero cost overheads is well studied, see Bharadwaj et al. (1994), Bharadwaj et al. (1996), Blazewicz and Drozdowski (1997), Drozdowski and Lawenda (2005), Yang et al. (2007) and the summary of the results in the second part of Table 2. In this paper, we analyze the alternative version with zero transfer overheads; see the lower part of Table 2. It appears that if fixed cost overheads are negligible (\(f_{i}=0\) for all \(P_{i}\in \mathcal {P}\)), then the bicriteria version of the problem is solvable in \(O(m\log m)\) time. Its single criterion counterpart of cost minimization subject to a bounded schedule length can be solved in O(m) time. The version with nonzero fixed cost overheads \(f_{i}\) is NP-hard, but can be solved in O(m) time provided that the set of active processors is fixed.
Further organization of this paper is as follows. In Sect. 2 we study the general version of the problem, with arbitrary values of all parameters \(s_{i}\), \(c_{i}\), \(p_{i}\), \(a_{i}\), \( f_{i} \), \(\ell _{i}\), \(r_{i}\), \(d_{i}\) and \(B_{i}\) for all processors \( P_{i}\subseteq \mathcal {P}\). In Sect. 3 we present our results for the case with zero fixed overheads, \(s_{i}=p_{i}=f_{i}=0\) for all processors \(P_{i}\subseteq \mathcal {P}\). Section 4 is dedicated to the system with negligible transfer times, \(s_{i}=c_{i}=0\) for all \(P_{i}\subseteq \mathcal {P}\). Conclusions are presented in Sect. 5.
2 Nonzero time/cost parameters—fixed set of active processors
In this section, we consider the DLS problem with arbitrary time/cost parameters \(s_{i}\), \(c_{i}\), \(p_{i}\), \(a_{i}\), \(f_{i}\), \(\ell _{i}\) and arbitrary processor availability parameters \(r_{i}\), \(d_{i}\), \(B_{i}\). The number of worker processors m is fixed. We present linear programs for problems \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\) and \(\mathrm {DL}_{ \text {cost}}\mathrm {(}T\mathrm {)}\), justifying that both problems are fixed parameter tractable (FPT) with respect to the parameter m. We then explain how problem \(\mathrm {DL}_{\text {bicrit}}\) can be solved in FPT time. Note that for an arbitrary m the problem is NP-hard, as we show in Sect. 3.
2.1 Limited cost K—schedule length minimization
Consider first problem \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\), assuming that the set of processors \(\mathcal {P}^{\prime }\subseteq \mathcal { P}\), which receive nonzero chunks of the load, is fixed, and their sequence is also fixed. At the end of the section, we discuss the case with a non-fixed processor sequence.
2.2 Limited schedule length T—cost minimization
2.3 Time–cost trade-off
Consider now the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\). The approach described below constructs at most \(\eta \) trade-offs, one for each fixed processor activation sequence, and then takes the lower envelope out of them.
Another extreme point \(\left( T^{q},K^{q}\right) \) corresponds to the schedule with the smallest cost \(K^{q}\). It can be found by solving LP\(_{ \mathrm {\cos t}}\mathrm {(}\infty \mathrm {)}\) with \(T=\infty \); then for the found value of \(K^{q}\) the associated minimum schedule length \(T^{q}\) can be found by solving LP\(_{\mathrm {time}}\)(\(K^{q}\)).
Merging trade-off curves into the Pareto-front
- (1)
Find the intersection points of all pieces of trade-off curves defined for various processor sequences;
- (2)
Find the minimal layer of breakpoints and the intersection points, using, for example an approach outlined in Cormen et al. (2001) (p. 1045) for the symmetric problem of finding the maximal layer. Note that given z points in the plane, the minimal or maximal layer can be found in \(O(z\log z) \) time.
Statement 1
Problems \(\mathrm {DL}_{\text {time}}(K)\), \(\mathrm {DL}_{\text {cost}}(T)\), \(\mathrm {DL}_{\text {bicrit}}\) are fixed parameter tractable with respect to the number of machines m.
3 Nonzero time/cost parameters—zero fixed overheads: \( p_{i}=s_{i}=f_{i}=0\)
In this section, we assume that all fixed overheads are equal to zero, i.e., \( p_{i}=s_{i}=f_{i}=0\), for \(i=1,\dots ,m\), while the linear components of transfer time (\(c_{i}\)), computation time (\(a_{i}\)) and cost (\(\ell _{i}\)) are not simultaneously equal to zero. In terms of the three types of decisions introduced in Sect. 1, decisions of type 3 imply decisions of type 1: any processor \(P_{i}\) which gets a chunk \(x_{i}=0\) does not contribute to any time or cost component because there are no fixed overheads. This implies that a processor receiving a 0-size chunk can be removed from the list \(\mathcal {P}^{\prime }\) of active processors. Thus, it is enough to make decisions 2 and 3. Our main result of this section is the NP-hardness proof of \(\mathrm {DL}_{\text {cost}}(T)\) and \(\mathrm {DL} _{\text {time}}(K)\).
3.1 Limited schedule length T and limited cost K
To prove NP-hardness of problem \(\mathrm {DL}_{\text {cost}}(T)\), let us introduce its decision version \(\mathrm {DL}(T,K)\) which verifies whether there exists a feasible solution with the schedule length and the cost not exceeding the given thresholds T and K, respectively. We reduce the even-odd partition to problem \(\mathrm {DL}(T,K)\).
Lemma 1
- (i)
The whole load of size V is fully processed.
- (ii)
Every processor is fully loaded completing its load chunk at time T.
- (iii)
The cost of the schedule is K.
- (iv)
Schedule S defines a solution to the instance of \(\mathrm {DL} (T,K)\).
Proof
A feasible schedule with processor sequence \( (P_{11},P_{12},~P_{21},P_{22},~P_{31},P_{32})\)
We prove that properties (i)–(iv) hold for schedule S.
In the remaining part we prove that if there exists a solution to the instance of problem \(\mathrm {DL}(T,K)\), then there exists a solution to the related instance of Even-odd partition. The lemma below starts with auxiliary properties of a feasible schedule and concludes with the main result.
Lemma 2
- (1)
Schedule S can be transformed into a schedule with processor activating sequence \((\left\{ P_{11},P_{12}\right\} ,\ldots , \left\{ P_{u1},P_{u2}\right\} , \left\{ P_{u+1,1},P_{u+1,2}\right\} ,\ldots , \left\{ P_{n1},P_{n2}\right\} )\).
- (2)Consider a feasible schedule obtained by transformation (1). Renumber processors in the order they appear in the activating sequence and renumber the associated values \(e_{u1}\) and \(e_{u2}\) accordingly. For the resulting schedule, with processor activating sequence \((P_{11},P_{12},\)\( \ldots ,\)\(P_{u1},P_{u2},\)\(\ldots ,\)\(P_{n1},P_{n2})\), the following inequality holds:where$$\begin{aligned} \sum \limits _{i=1}^{n}2^{2i}G^{n-i+2}y_{i}\ge 3\sum \limits _{i=2}^{n+1}G^{i}, \end{aligned}$$for \(i=1,\ldots ,n.\)$$\begin{aligned} y_{i}= & {} \frac{1}{T}\left( c_{i1}x_{i1}+c_{i2}x_{i2}\right) \nonumber \\= & {} \frac{1}{2^{2i-1}} \left( \frac{x_{i1}}{G^{n-i+2}+e_{i1}}+\frac{x_{i2}}{G^{n-i+2}+e_{i2}} \right) , \end{aligned}$$(29)
- (3)If in a feasible schedule satisfying property 2) at least one processor \(P_{uk}\), \(1\le u\le n\), \(k=1,2\), is not fully loaded (i.e., \( C_{uk}<T\) holds), then$$\begin{aligned} \sum \limits _{i=1}^{n}2^{2i}G^{n-i+2}y_{i}<3\sum \limits _{i=2}^{n+1}G^{i}. \end{aligned}$$
- (4)
Each of the 2n processors is fully loaded and has completion time T.
- (5)
Equality \(\sum _{i=1}^{n}e_{i1}=G\) holds so that the set \(\left\{ e_{i1}\right\} _{i=1}^{n}\) defines a solution to Even-odd partition.
Proof
Let t be the starting time of communication of processor \(P_{h_{z-1}}\) in S. Modify the fragment of schedule S, starting from t, by moving the full load \(x_{h_{z-1}}\ \)from \(P_{h_{z-1}}\) to \(P_{h_{z}}\). In the new schedule, processor \(P_{h_{z}}\) finishes communication at time \( t+c_{h_{z}}(x_{h_{z-1}}+x_{h_{z}})\), which is less than \( t+c_{h_{z-1}}x_{h_{z-1}}+c_{h_{z}}x_{h_{z}}\), the communication finish time of \(P_{h_{z}}\) in S (since \(c_{h_{z-1}}>c_{_{h_{z}}}\)). The same is true for the computation completion time: the new completion time of \(P_{h_{z}}\) is \(t+2c_{h_{z}}(x_{h_{z-1}}+x_{h_{z}})\), which is less than \( t+c_{h_{z-1}}x_{h_{z-1}}+2c_{h_{z}}x_{h_{z}}\), completion time of \(P_{h_{z}}\) in the original schedule (since \(c_{h_{z-1}}>2c_{_{h_{z}}}\) by (30)). The cost of the modified schedule is less than that of the original one since each of the values \(\ell _{v1}\) and \(\ell _{v2}\) is greater than \(\ell _{u1}\) and \(\ell _{u1}\).
As a result of the described transformation, processor \(P_{h_{z}}\) takes the full load of processor \(P_{h_{z-1}}\), making \(P_{h_{z-1}}\) idle. Modify the processor sequence by swapping \(P_{h_{z-1}}\) and \(P_{h_{z}}\). If \(P_{h_{z}}\) is still out of order, then perform a similar transformation: move the load from \(P_{h_{z-2}}\) to \(P_{h_{z}}\), making \(P_{h_{z-2}}\) idle and swap the two processors. Continue shifting processor \(P_{h_{z}}\) upstream until it reaches the right position in the schedule, immediately after its partner from the pair \(\left\{ P_{u1},P_{u2}\right\} \) or after a pair of processors \(\left\{ P_{u-1,1},P_{u-1,2}\right\} \). Repeating the same transformation, we construct a schedule with no larger length and with a smaller cost.
Property (4) immediately follows from properties (2) and (3). It remains to prove property (5).
We conclude with the main result which follows from Lemmas 1 and 2.
Theorem 1
Problem \(\mathrm {DL}(T,K)\) is NP -complete, problems \(\mathrm {DL}_{\text {cost}}(T)\) and \(\mathrm {DL}_{\text { time}}(K)\) are NP-hard, even if computation time, communication time and cost have no fixed overheads, and \(r_i=0, d_i=B_i=\infty \), for \( i=1,\dots ,m\).
3.2 Time–cost trade-off
For zero overheads, the arguments from Section 2 can be simplified. MIP formulations (13)–(19) and (20)–(23) do hold, but the number of different sequences can be reduced from \(\eta =2^{m}m!\), given by (12), to \(\eta =m!\). Notice that due to zero overheads there is no need to make a selection of the set of active processors \(\mathcal {P}^{\prime }\) since an idle processor can be kept in any place of the sequence. The smaller value of \(\eta \) results in a slightly lower time complexity for enumerating all trade-offs, namely \(O\left( 4^{m}m^{m}m!\times LP(m-1,4m)\right) \).
The problem of finding extreme points in the (T, K)-space, with the shortest schedule or with the smallest cost, was addressed in the prior research for the special case when all processors are available simultaneously and have no deadline and capacity restrictions, \(r_{i}=0\), \( d_{i}=B_{i}=\infty \) for all \(1\le i\le m\). As shown in Bharadwaj et al. (1994, 1996), Blazewicz and Drozdowski (1997), the shortest schedule is provided if processors are sequenced in the non-decreasing order of \(c_{i}\) and complete all tasks simultaneously. For the same special case, the cheapest solution is constructed if the whole load is processed by the cheapest processor, i.e., \(P_{i}:\ell _{i}=\min _{j=1}^{m}\{\ell _{j}\}\). Hence, in the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\) end-points \( (T^{0},K^{0}),(T^{q},K^{q})\) of the time–cost trade-off can be found in, respectively, \(O(m\log m)\) and O(m) time.
For the general case of arbitrary \(r_{i}\), \(d_{i}\), \(B_{i}\), finding the solution \(\left( T^{q},K^{q}\right) \) with the lowest cost, i.e., the rightmost point \((T^{q},K^{q})\) in the time–cost trade-off, is computationally hard by Theorem 1, because even though schedule length may be arbitrary to find the lowest cost schedule, processor availability constraints \(r_{i},d_{i},B_{i}\) may impose limits equivalent to schedule length. We conjecture that finding the solution \(\left( T^{0},K^{0}\right) \) with the shortest schedule is also computationally hard.
Conjecture 1
For arbitrary \(r_{i},d_{i},B_{i}\), for all processors \(P_{i}\in \mathcal {P}\), problem \(\mathrm {DL}_{\text {time}}(\infty )\) is NP-hard, even if computation time, communication time and cost have no fixed overheads.
Theorem 2
If processors \(\mathcal {P}\) are agreeable and have no availability and capacity restrictions, i.e., \(r_{i}=0\), \(d_{i}=B_{i}=\infty \), \(1\le i\le m\) , then an optimum solution can be found in polynomial time.
Proof
Assume that processors are numbered in accordance with (38) and they are activated in the order of their numbering. The processor sequence corresponding to \(c_{1}\le c_{2}\le \dots \le c_{m}\) guarantees the shortest schedule (Bharadwaj et al. 1994, 1996; Blazewicz and Drozdowski 1997). It is also known (Yang et al. 2007) that in the shortest schedule there are no idle times between communications and all processors finish computation simultaneously.
4 Zero transfer overheads: \(s_{i}=c_{i}=0\)
The main model studied in this section is characterized by zero transfer times and zero fixed cost overheads for all processors \(P_{i}\): \( s_{i}=c_{i}=0\), \(f_{i}=0\), \(1\le i\le m\) (Sect. 4.1). In terms of the three types of decisions introduced in Sect. 1, only decisions of type 1 and 3 should be considered: any processor \(P_{i}\) which gets a zero-size chunk \(x_{i}=0\) should be removed from the list \(\mathcal {P}^{\prime }\) of active processors, and all processors in \(\mathcal {P}^{\prime }\) can be sequenced arbitrarily. We also discuss how the proposed methods can be adjusted for the case with arbitrary cost overheads \(f_{i}\) (Sect. 4.2) and their applicability to the related models with nonzero transfer times (Sect. 4.3).
4.1 Zero fixed cost overheads: \(f_{i}=0\)
In this section, we study the version of the main problem with the cost function \(F=\sum _{i=1}^{n}\ell _{i}x_{i}\), i.e., \(f_{i}=0\) for \(1\le i\le m\).
Statement 2
If there are no transfer overheads and the cost function is \(F=\sum _{i=1}^{n}\ell _{i}x_{i}\), then problem \(\mathrm {DL}_{\mathrm {cost}}(T)\) is solvable in O(m) time.
For the counterpart \(\mathrm {DL}_{\text {time}}(K)\) of problem \(\mathrm {DL}_{\text {cost}}(T)\), we can only propose an \(O(m\log m)\)-time algorithm. As we show next, the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\) is solvable in \(O(m\log m)\) time as well. Thus, in what follows we focus on \( \mathrm {DL}_{\text {bicrit}}\); a solution to \(\mathrm {DL}_{\text {time}}(K)\) can be found from a solution to \(\mathrm {DL}_{\text {bicrit}}\) without increasing the \(O(m\log m)\) time complexity.
In order to find the Pareto-front for \(\mathrm {DL}_{\text {bicrit}}\), consider \(\mathrm {LP}_{\text {cost}}(T)\) as the underlying model and treat it in a parametric way, with parameter T that varies in \(\left[ \min _{1\le i\le m}\left\{ d_{i}\right\} ,\right. \left. \max _{1\le i\le m}\left\{ d_{i}\right\} \right] \). Notice that for small values of T from that interval the problem \(\mathrm {LP}_{\text {cost}}(T)\) may be infeasible.
We start with the rightmost point of the trade-off, that corresponds to a solution with the largest length and minimum cost. It can be found in O(m) time by solving problem \(\mathrm {LP}_{\text {cost}}(T)\) with \( T=\max _{1\le i\le m}\left\{ d_{i}\right\} \). The resulting point \(\left( T,K\right) \) has \(K=\sum _{i=1}^{m}\ell _{i}x_{i}(T)\).
critical processors\(\mathcal {P}_{c}\), with \(x_{i}\left( T\right) =\left( T-r_{i}-p_{i}\right) /a_i\) so that \(C_{i}=T\),
non-critical processors\(\mathcal {P}_{n}\), with \(x_{i}\left( T\right) =\widetilde{B}_{i}\) so that \(C_{i}<T\),
excluded processors\(\mathcal {P}_{e}\), with \(x_{i}\left( T\right) =0\).
- (a)The set \(\mathcal {P}_{c}\ \)is adjusted to exclude a processor whose load reduces to 0. This happens if the decreased value \(T^{\prime \prime }=T^{\prime }-\Delta \) reaches \(r_{i}+p_{i}\) for some \(P_{i}\in \mathcal {P}_{c}\). In this case$$\begin{aligned} \Delta =T^{\prime }-\max \left\{ r_{i}+p_{i}|P_{i}\in \mathcal {P} _{c}\right\} . \end{aligned}$$(48)
- (b)The set \(\mathcal {P}_{c}\ \)is adjusted to include a non-critical processor which becomes critical. This happens if \(T^{\prime \prime }=T^{\prime }-\Delta \) reaches an absolute deadline \(\widetilde{d}_{i}\) for some non-critical processor \(P_{i}\in \mathcal {P}_{n}\),In this case$$\begin{aligned} \widetilde{d}_{i}=\min \{d_{i},r_{i}+p_{i}+a_{i}B_{i}\}. \end{aligned}$$(49)$$\begin{aligned} \Delta =T^{\prime }-\max \left\{ \widetilde{d}_{i}|P_{i}\in \mathcal {P} _{n}\right\} . \end{aligned}$$(50)
- (c)The split processor \(P_{s}\) can no longer get any additional load since its increased load \(x_{s}^{\prime \prime }=x_{s}^{\prime }+h\left( \mathcal {P}_{c}\right) \Delta \) reaches an absolute upper bound \( \widetilde{B}_{s}\), which implies$$\begin{aligned} \Delta =\frac{\widetilde{B}_{s}-x_{s}^{\prime }}{h\left( \mathcal {P} _{c}\right) }. \end{aligned}$$
- (d)The split processor \(P_{s}\) can no longer get any additional load since processing its increased load \(x_{s}^{\prime \prime }\) reaches \(T^{\prime \prime }\) so that processor \(P_{s}\) becomes a critical processor. This happens if the completion time \(r_{s}+p_{s}+a_{s}\left( x_{s}^{\prime }+ h\left( \mathcal {P}_{c}\right) \Delta \right) \) of \(P_{s}\) becomes equal to \(T^{\prime \prime }=T^{\prime }-\Delta \), which implies$$\begin{aligned} \Delta =\frac{T^{\prime }-\left( r_{s}+p_{s}+a_{s}x_{s}^{\prime }\right) }{ a_{s}h\left( \mathcal {P}_{c}\right) +1}. \end{aligned}$$
- In the case of event (a) triggered by processor \(P_{i}\in \mathcal {P} _{c}\), calculate \(x_{s}^{\prime \prime }\) by (47) and set$$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\backslash \left\{ P_{i}\right\} ,\, \mathcal {P}_{e}:=\mathcal {P}_{e}\cup \left\{ P_{i}\right\} , \end{aligned}$$$$\begin{aligned} h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) -\frac{1}{a_{i}},\, k\left( \mathcal {P}_{c}\right) := k\left( \mathcal {P}_{c}\right) -\frac{\ell _{i}}{a_{i}}. \end{aligned}$$
- In the case of event (b) triggered by processor \(P_{i}\in \mathcal {P} _{n}\), calculate \(x_{s}^{\prime \prime }\) by (47) and set$$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\cup \left\{ P_{i}\right\} ,\, \mathcal {P}_{n}:=\mathcal {P}_{n}\backslash \left\{ P_{i}\right\} , \end{aligned}$$$$\begin{aligned} h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) +\frac{1}{a_{i}},\, k\left( \mathcal {P}_{c}\right) :=k\left( \mathcal {P}_{c}\right) + \frac{\ell _{i}}{a_{i}}. \end{aligned}$$
- In the case of event (c), setand define a new split processor by considering processors \(P_{i}\), \( i=s+1,s+2,\) ... one by one: if \(r_{i}+p_{i}<T\), then \(P_{i}\) becomes a new split processor (note that its load is 0); otherwise, \(P_{i}\) joins the set of excluded processors \(\mathcal {P}_{e}\), and the next processor is examined. If no processor from \(\left\{ P_{s+1},\ldots ,P_{m}\right\} \) becomes a split processor, the algorithm stops.$$\begin{aligned} \mathcal {P}_{n}:=\mathcal {P}_{n}\cup \left\{ P_{s}\right\} , \end{aligned}$$
- In the case of event (d), set$$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\cup \left\{ P_{s}\right\} , h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) +\frac{1}{a_{s}}, \end{aligned}$$and find the next split processor \(P_{s}\) as in the case of event (c).$$\begin{aligned} k\left( \mathcal {P}_{c}\right) :=k\left( \mathcal {P}_{c}\right) +\frac{\ell _{s}}{a_{s}}, \end{aligned}$$
If both events (c) and (d) happen simultaneously, proceed as in the case of event (d).
We can now treat the found breakpoint \(\left( T^{\prime \prime },K^{\prime \prime }\right) \) as the current one and proceed similarly to finding the next breakpoint. The algorithm stops if \(s=m\) and event (c) or (d) happens.
Initialization involves renumbering processors in accordance with (45), finding the first solution \(\left( x_{1}^{\prime },x_{2}^{\prime },\ldots ,x_{m}^{\prime }\right) \) for \(T^{\prime }=\max \left\{ \widetilde{d}_{i}|1\le i\le m\right\} \), computing \(K^{\prime }\) and auxiliary values \(h\left( \mathcal {P}_{c}\right) \) and \(k\left( \mathcal { P}_{c}\right) \). All required steps can be done in \(O(m\log m)\) time.
A transition from one breakpoint to the next one requires updating the two priority queues, which takes \(O(\log m)\) time, and updating the five parameters, \(x_{s}^{\prime \prime }\), \(T^{\prime \prime }\), \(K^{\prime \prime }\), \(h\left( \mathcal {P}_{c}\right) \) and \(k\left( \mathcal {P} _{c}\right) \), which can be done in O(1) time. Note that x-values for \( i\ne s\) are not maintained. Since there are at most m events of each type, the total number of breakpoints is no larger than 4m, and the overall time complexity is \(O(m\log m)\). Thus, the following statement holds.
Statement 3
Problem \(\mathrm {DL}_{\text {bicrit}}\) has a trade-off with at most 4m breakpoints which can be computed in \(O(m\log m)\) time.
Example data for time–cost trade-off calculation
i | \(a_i\) | \(B_i\) | \(r_i\) | \(d_i\) | \(p_i\) | \(\ell _i\) | \(r_i+p_i\) | \( \widetilde{d}_i\) | \(\widetilde{B}_i\) |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 10 | 80 | 100 | 1 | 1 | 81 | 91 | 10 |
2 | 4 | 40 | 30 | 110 | 2 | 2 | 32 | 110 | 19.5 |
3 | 8 | 10 | 20 | 40 | 5 | 3 | 25 | 40 | 1.875 |
4 | 4 | 20 | 20 | 70 | 4 | 5 | 24 | 70 | 11.5 |
5 | 5 | 10 | 10 | 80 | 2 | 8 | 12 | 62 | 10 |
6 | 6 | 10 | 40 | 100 | 2 | 10 | 42 | 100 | \(\approx \) 9.667 |
7 | 3 | 30 | 5 | 50 | 1 | 20 | 6 | 50 | \(\approx \) 14.667 |
8 | 2 | 50 | 10 | 60 | 3 | 40 | 13 | 60 | 23.5 |
Load allocations \(x_i\) and total costs in time–cost trade-off calculation
4.2 Arbitrary cost overheads \(f_{i}\), \(\ell _{i}\)
In the general case, with nonzero overheads \(f_{i}\) in the cost function \( K=\sum _{i=1}^{m}\left( f_{i}+\ell _{i}x_{i}\right) \), both problems \(\mathrm {DL}_{\text {time}}\) and \(\mathrm {DL}_{\text {cost}}\) are NP-hard, see Drozdowski and Lawenda (2005). As we show in this section, the problem can be solved efficiently if we limit our search to a class of solutions with a fixed set of active processors \(\mathcal {P}^{\prime }\subseteq \mathcal {P}\). The associated problem is of the form: given a set of active processors \(\mathcal {P} ^{\prime }\), it is required to allocate a positive load to each active processor minimizing the objective function T or K. Note that some processors may get an infinitely low load \(\varepsilon >0\); such a processor then has a completion time \(r_{i}+p_{i}+a_{i}\varepsilon \), which should be taken into account when calculating the length T of the schedule.
Property 1: component \(\sum _{i\in \mathcal {P}^{\prime }}f_{i}\) is constant and can be excluded from K;
- Property 2: the schedule length T satisfies \(T>\rho \), where$$\begin{aligned} \rho =\max _{1\le i\le m^{\prime }}\left\{ r_{i}+p_{i}\right\} . \end{aligned}$$
For problem \(\mathrm {DL}_{\text {cost}}(T)\) with a given set \(\mathcal {P} ^{\prime }\) and \(T>\rho \), consider the continuous knapsack formulation (40) defined over \(\mathcal {P}^{\prime }\). If in an optimal knapsack solution \(x_{i}=0\) for some \(P_{i}\in \mathcal {P}^{\prime } \), then such a solution is adjusted by replacing 0-values by \(\varepsilon \). The overall time complexity remains the same, O(m).
For problem \(\mathrm {DL}_{\text {bicrit}}\), apply the approach from Sect. 4.1 for the processor set \(\mathcal {P}^{\prime }\) and output the part of the trade-off that satisfies \(T>\rho \). Treat all found solutions as if each idle processor gets an \(\varepsilon \)-load. This assumption does not affect the values of T and K, assuming that \( \varepsilon \) is infinitely small. There is one case that needs a special attention. It occurs if all breakpoints \(\left( T,K\right) \) satisfy \(T\le \rho \). In that case consider the rightmost point \(\left( T^{*},K^{*}\right) \) of the trade-off and output the unique solution obtained from \(\left( T^{*},K^{*}\right) \) by allocating \(\varepsilon \)-loads to all idle processors. The described adjustments do not affect the \(O(m\log m)\) time complexity derived for problem \(\mathrm {DL}_{\text {bicrit}}\) in Sect. 4.1.
Treating problem \(\mathrm {DL}_{\text {time}}(K)\) as a special case of problem \(\mathrm {DL}_{\text {bicrit}}\), we conclude that it is solvable in \(O(m\log m)\) time.
Statement 4
If there are arbitrary cost overheads \(f_{i}\), \(\ell _{i}\) in the cost function \(F=\sum _{i=1}^{n}\left( f_{i}+\ell _{i}x_{i}\right) \), and a set of active processors is fixed, then problem \(\mathrm {DL}_{\mathrm { \cos t}}(T)\) is solvable in O(m) time, while problems \(\mathrm {DL}_{ \mathrm {time}}(K)\) and \(\mathrm {DL}_{\mathrm {bicrit}}\) are solvable in \( O(m\log m)\) time.
4.3 Arbitrary transfer overheads \(s_{i},c_{i}\)
The results from Sect. 4.1 can be applied to special scenarios with nonzero transfer overheads \(s_{i},c_{i}\).
One scenario arises in parallel communication with simultaneous start mode, see Kim (2003), Robertazzi (2003), with computation speeds slower than transfer speeds. For the simultaneous start mode, load transfer to all worker processors starts at the same time. Worker processors start computing as soon as the first grain of the load is received. Due to the slower computation speeds compared to transfer speeds, computation time of any grain is higher than the transfer time of any subsequent grain. Thus, in parallel communication with simultaneous start, communication does not affect the overall schedule length and cost. Consequently, communication time can be ignored as if \( c_{i}=s_{i}=0\) for any \(P_{i}\in \mathcal {P}\).
Another scenario is typical for a pipeline-like computing mode. Load scattering and processing are interleaved so that communications and computations are performed at different stages. The load is distributed in one interval (say interval i) and processed in the next interval (\(i+1\)). If there is a common communication time \(\tau _{comm}\) for all processors and a common computation time \(\tau _{comp}\), with \(\tau _{comm}\le \tau _{comp}\), then the communications executed in interval i do not determine partitioning of the load for minimum computing time and the cost in interval \(i+1 \). It can be shown that the general case of the pipeline mode, characterized by \( s_{i}>0,a_{i}>0\), is NP-hard [see, e.g., DLS with processor release times in Drozdowski and Lawenda (2005)].
5 Conclusions
In this paper, we analyze the time/cost optimization for divisible load scheduling problems with arbitrary processor memory sizes, ready times, deadlines, communication and computation start-up costs. Three versions of the problem are studied: \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\)—schedule length minimization for the given limited budget K, \(\mathrm {DL}_{\text {cost}}\mathrm {(}T\mathrm {)}\)—cost minimization for the given schedule length limit T, and \(\mathrm {DL}_{\text {bicrit}}\)—constructing the set of time–cost Pareto-optimal solutions. All three versions can be solved in polynomial time for fixed m.
The case with given upper bounds on the schedule length and cost appears to be NP-hard even if all fixed overheads are zero (\(p_{i}=s_{i}=f_{i}=0\) for all \(P_{i}\in \mathcal {P}\)). This result is rather unusual: all previous NP-hardness results in the divisible load theory assumed nonzero fixed overheads. Interestingly, a divisible load problem is linked to scheduling problems with preemption, for which NP-hardness results are rather atypical (see, e.g., Sitters 2001; Drozdowski et al. 2017).
We leave an open question regarding the time complexity of finding a shortest schedule with processor availability constraints, but with zero fixed communication and computation overheads, regardless of the computation cost (\(K=\infty \)). We believe that the latter problem is computationally hard, see Conjecture 1. Contrarily, the version with negligible communication times is solvable in \(O(m\log m)\) time even in its bicriteria setting.
Our summary table provided in Introduction presents the state-of-the-art results in divisible load scheduling and can be used as a guideline for future research.
Notes
References
- Adler, I., & Monteiro, R. D. C. (1992). A geometric view on parametric linear programming. Algorithmica, 8, 161–176.CrossRefGoogle Scholar
- Agrawal, R., & Jagadish, H. V. (1988). Partitioning techniques for large-grained parallelism. IEEE Transactions on Computers, 37, 1627–1634.CrossRefGoogle Scholar
- Balas, E., & Zemel, E. (1980). An algorithm for large zero-one knapsack problems. Operations Research, 28, 1130–1154.CrossRefGoogle Scholar
- Bharadwaj, V., Ghose, D., & Mani, V. (1994). Optimal sequencing and arrangement in distributed single-level tree networks with communication delays. IEEE Transactions on Parallel and Distributed Systems, 5, 968–976.CrossRefGoogle Scholar
- Bharadwaj, V., Ghose, D., Mani, V., & Robertazzi, T. G. (1996). Scheduling divisible loads in parallel and distributed systems. Los Alamitos: IEEE Computer Society Press.Google Scholar
- Blazewicz, J., & Drozdowski, M. (1997). Distributed processing of divisible jobs with communication startup costs. Discrete Applied Mathematics, 76, 21–41.CrossRefGoogle Scholar
- Cheng, Y.-C., & Robertazzi, T. G. (1988). Distributed computation with communication delay. IEEE Transactions on Aerospace and Electronic Systems, 24, 700–712.CrossRefGoogle Scholar
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms (2nd ed.). Cambridge: MIT Press and McGraw-Hill.Google Scholar
- Drozdowski, M. (2009). Scheduling for parallel processing. London: Springer.CrossRefGoogle Scholar
- Drozdowski, M., Jaehn, F., & Paszkowski, R. (2017). Scheduling position-dependent maintenance operations. Operations Research, 65, 1657–1677.CrossRefGoogle Scholar
- Drozdowski, M., & Lawenda, M. (2005). The combinatorics in divisible load scheduling. Foundations of Computing and Decision Sciences, 30, 297–308.Google Scholar
- Goldfarb, D., & Todd, M. J. (1989). Chapter II: Linear programming. In G. L. Nemhauser, A. H. G. Rinooy Kan, & M. J. Todd (Eds.), Handbooks in operations research and management science. Optimization (Vol. 1, pp. 73–170). Elsevier Science Publishers B.V. (North-Holland). Google Scholar
- Kim, H. J. (2003). A novel optimal load distribution algorithm for divisible loads. Cluster Computing, 6, 41–46.CrossRefGoogle Scholar
- Robertazzi, T. G. (2003). Ten reasons to use divisible load theory. IEEE Computer, 36, 63–68.CrossRefGoogle Scholar
- Shakhlevich, N. V. (2013). Scheduling divisible loads to optimize the computation time and cost (Vol. 8193, pp. 138–148)., Lecture notes in computer science Cham: Springer.Google Scholar
- Sitters, R. A. (2001). Two NP-hardness results for preemptive minsum scheduling of unrelated parallel machines (Vol. 2081, pp. 396–405)., Lecture Notes in Computer Science Berlin: Springer.Google Scholar
- Yang, Y., Casanova, H., Drozdowski, M., Lawenda, M., & Legrand, A. (2007) On the complexity of multi-round divisible load scheduling. INRIA Rône-Alpes, Research Report No. 6096, 2007.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.