Advertisement

Scheduling divisible loads with time and cost constraints

  • M. DrozdowskiEmail author
  • N. V. Shakhlevich
Open Access
Article
  • 70 Downloads

Abstract

In distributed computing, divisible load theory provides an important system model for allocation of data-intensive computations to processing units working in parallel. The main task is to define how a computation job should be split into parts, to which processors those parts should be allocated and in which sequence. The model is characterized by multiple parameters describing processor availability in time, transfer times of job parts to processors, their computation times and processor usage costs. The main criteria are usually the schedule length and cost minimization. In this paper, we provide the generalized formulation of the problem, combining key features of divisible load models studied in the literature, and prove its NP-hardness even for unrestricted processor availability windows. We formulate a linear program for the version of the problem with a fixed number of processors. For the case with an arbitrary number of processors, we close the gaps in the study of special cases, developing efficient algorithms for single criterion and bicriteria versions of the problem, when transfer times are negligible.

Keywords

Divisible load scheduling Computational complexity Linear programming 

1 Introduction

Divisible load theory (DLT) is an important model of parallel computations. It is assumed that a big volume of data, conventionally referred to as load, can be divided continuously into parts which can be processed independently on distributed computers. DLT was proposed by Cheng and Robertazzi (1988) to represent computation in a chain of intelligent sensors. A very similar approach was proposed independently by Agrawal et al. (1988) to model performance of a network of workstations. DLT was successfully applied to scheduling data-intensive applications and to analyzing various aspects of their efficiency depending on communication sequences, load scattering algorithms, memory limitations and time-varying environments (Bharadwaj et al. 1996; Drozdowski 2009; Robertazzi 2003). Overall, DLT is widely recognized as an adequate and accurate model of real systems dealing with load distribution.
Table 1

Problem parameters and notations

V

Total load size

T

Deadline (upper limit of the schedule length)

K

Budget (upper limit on cost)

\(x_{i}\)

Decision variable for the size of the load chunk assigned to processor \(P_{i}\)

\(B_{i}\)

The maximum load processor \(P_{i}\) can compute, due to memory limitations

\(C_{i}\)

Finishing time for computing load \(x_{i}\) on processor \(P_{i}\)

\(C_{\max }\)

Schedule length defined as \(\max \left\{ C_{i}|i=1,\ldots ,m\right\} \)

\(\mathcal {K}\)

Overall cost

\(\left[ r_{i},d_{i}\right] \)

The time window when processor \(P_{i}\) is available

\(p_{i}+a_{i}x_{i}\)

The time for computing load \(x_{i}\) on processor \(P_{i}\), where \(p_{i}\) is the setup time to start computation and \(a_{i}\) is the processing rate (or reciprocal of speed) of processor \(P_{i}\)

\(s_{i}+c_{i}x_{i}\)

The time for transferring load \(x_{i}\) to processor \(P_{i}\), where \(s_{i}\) is communication start up time and \(c_{i}\) is the communication rate (or reciprocal of bandwidth) of the link to \(P_{i}\)

\(f_{i}+\ell _{i}x_{i}\)

The cost of computing load \(x_{i}\) by processor \(P_{i}\), including the fixed cost \(f_{i}\)

\(\mathcal {P}\)

Set of the worker processors

\(\mathcal {P}^{\prime }\)

Set of processors participating in computation, \(\mathcal {P}^{\prime }\subseteq \mathcal {P}\)

m

Total number of processors, i.e., \(m=|\mathcal {P}|\)

\(m^{\prime }\)

Number of processors in \(\mathcal {P}^{\prime }\), i.e., \( m^{\prime }=|\mathcal {P}^{\prime }|\)

In its general form, the problem of divisible load scheduling (DLS) can be formulated as follows. A computational load of volume V (measured in bytes) is initially held by a master processor\(P_{0}\). The load must be distributed among worker processors from set \(\mathcal {P}= \{P_{1},\dots ,P_{m}\}\). In our model master processor \(P_{0}\) only distributes load and does not do any computation. In some publications this assumption is waived so that \(P_{0}\) performs computation after it completes all communications. The results discussed in the following sections can be easily adjusted for that case.

For the summary of the notation and explanation of the parameters used in this paper, see Table 1. Each processor \(P_{i}\) has its own availability interval \(\left[ r_{i},d_{i} \right] \). The time required for sending x bytes of load to \(P_{i}\) is \( s_{i}+c_{i}x\), for \(i=1,\ldots ,m\). The communications are performed sequentially, i.e., only one processor at a time can receive its chunk of the load from the master. The transmission to \(P_{i}\) of the allocated chunk can start at any time, even before the processor’s availability time \(r_{i}\). The processing of the chunk can start only after the allocated chunk is received in full and no earlier than processor’s availability time \(r_{i}\). For the chunk of size x received by processor \( P_{i}\), the computation time and the processor usage cost (computation cost) are \(p_{i}+a_{i}x\) and \(f_{i}+\ell _{i}x\), respectively.

It is required that \(P_{i}\) finishes computation by the end of its availability interval \(d_{i}\), \(d_{i}>r_{i}+p_{i}\). Due to the limited memory size, there is an upper limit \(B_{i}\) on the maximum load that can be handled by processor \(P_{i}\). A processor may be left unused if no load is sent to it. Such a processor does not incur any time or cost overheads.

Let \(C_{i}\) denote the time when processor \(P_{i}\) completes its chunk, and let \(C_{\max }\) denote the length of the whole schedule. The cost of processing the load is denoted \(\mathcal {K}\). Solving the DLS problem requires three decisions:

Decision 1 choosing the subset of processors \(\mathcal {P^{\prime }} \subseteq \mathcal {P}\) for performing computation; for any processor \(P_{i}\in \mathcal {P^{\prime }}\) the allocated chunk size is nonzero (\(x_{i}>0\)) and such a processor is called active;

Decision 2 choosing the sequence in which the master processor \( P_{0} \) sends parts of the load to the processors in \(\mathcal {P^{\prime }}\);

Decision 3 splitting the total load of size V into chunks \(x_{i}\), one for each processor \(P_{i}\in \mathcal {P^{\prime }}\), such that the schedule length \(C_{\max }\) and the total cost \(\mathcal {K}\) are minimum,
$$\begin{aligned} C_{\max }= & {} \max _{i\in \mathcal {P}^{\prime }}\{C_{i}\}, \\ \mathcal {K}= & {} \sum _{i\in \mathcal {P}^{\prime }}(f_{i}+\ell _{i}x_{i}). \end{aligned}$$
With respect to decision 3, the most general version is bicriterion: finding the Pareto-front \((C_{\max },\mathcal {K})\) of non-dominated solutions in criteria \(C_{\max }\) and \(\mathcal {K}\). We denote that problem by \(\mathrm {DL }_{\text {bicrit}}\). Its two counterparts deal with minimizing one objective subject to the bounded value of the second objective:
  • in problem \(\mathrm {DL}_{\text {time}}(K)\) the objective is to minimize \(C_{\max }\) subject to \(\mathcal {K}\le K\), where K is an upper limit of the available budget,

  • in problem \(\mathrm {DL}_{\text {cost}}(T)\) the objective is to minimize the cost \(\mathcal {K}\) subject to \(C_{\max }\le T\), where T is an upper limit of the acceptable schedule length.

There are three types of overheads for any processor \(P_{i}\in \mathcal {P^{\prime }}\): transfer time (also called communication time) \( s_{i}+c_{i}x\), computation time \(p_{i}+a_{i}x\) and computation cost \( f_{i}+\ell _{i}x\). We refer to parameters \(s_{i}\), \(p_{i}\) and \(f_{i}\) as fixed overheads as they define fixed amounts of time and cost incurred if a nonzero chunk of load is allocated to a processor. These amounts are independent of the chunk size.
In this paper, we perform the complexity study of the formulated DLS problem focusing on the most general case with arbitrary processors’ availability windows \(\left[ r_{i},d_{i}\right] \) and arbitrary restrictions on processors’ maximum loads \(B_{i}\). The results are summarized in Table 2. In the column “Objectives” we specify how the two objectives, \(C_{\max }\) and \(\mathcal {K}\), are handled: single criterion problems deal with either \(C_{\max }\) or \(\mathcal {K}\), while notation \(\left( C_{\max },\mathcal {K}\right) \) is used for the bicriteria problem in the space of objectives \(C_{\max }\), \(\mathcal {K}\). If one of the objectives is bounded, then the corresponding inequality is stated in the column “Conditions”.
Table 2

Summary of the results

Transfer time

Comput. time

Cost

Conditions

Objectives

Results

\(c_{i}x_{i~}\)

\(a_{i}x_{i}\)

\(\ell _{i}x_{i}\)

\(C_{\max }\le T,\)\(\mathcal {K}\le K\)

 

NP-complete, Sect. 3 even if \(r_{i}=0,d_{i}=B_{i}=\infty \)

\(cx_{i~}\)(common c)

\(a_{i}x_{i}\)

\(\ell _{i}x_{i}\)

\(r_{i}=0, d_{i}=B_{i}=\infty \)

\((C_{\max },\mathcal {K})\)

\(O(m^{3})\)Shakhlevich (2013)

     

O(m), Sect. 3.2 if \(c_{1}\le c_{2}\le \cdots \le c_{m}\) and \(\frac{\ell _{1}}{c_{1}}\le \frac{\ell _{2}}{c_{2}}\le \cdots \le \frac{\ell _{m}}{c_{m}}\)

\(s_{i}+c_{i}x_{i}\)

\(p_{i}+a_{i}x_{i}\)

\(f_{i}+\ell _{i}x_{i}\)

arb. \( r_{i},d_{i},B_{i}\)

\((C_{\max },\mathcal {K})\)

FPT w.r.t. m, Sect. 2

\(c_{i}x_{i}\)

\(a_{i}x_{i}\)

0

\(r_{i}=0,\)\(d_{i}=B_{i}=\infty \)

\(C_{\max }\)

\(O(m\log m)\) Bharadwaj et al. (1994) Bharadwaj et al. (1996) Blazewicz and Drozdowski (1997) (Processor seq. \(c_{1}\le \!c_{2}\le \!\cdots \!\le \!c_{m}\))

\(s+cx_{i}\) (common s, c)

\(a_{i}x_{i}\)

0

\(r_{i}=0,\)\(d_{i}=B_{i}=\infty \)

\(C_{\max }\)

\(O(m\log m)\)Blazewicz and Drozdowski (1997) (Processor seq. \(a_{1}\!\le \!a_{2}\!\le \!\cdots \!\le \!a_{m}\))

\(s_{i}\)

\(p_i+a_{i}x_{i}\)

0

\(r_{i}=0,\)\(d_{i}=B_{i}=\infty \)

\(C_{\max }\)

NP-hard, even if processor sequence is fixed Drozdowski and Lawenda (2005)

\(s_{i}\)

\(a_{i}x_{i}\)

0

\(r_{i}=0\), \(d_{i}=B_{i}=\infty \)

\(C_{\max }\)

NP-hard Yang et al. (2007) \(O(m\log (Vmas)\times \min \{\lfloor s+Va\rfloor ,S\})^{*)}\)

0

\(p_{i}+a_{i}x_{i}\)

\(\ell _{i}x_{i}\)

arb. \(r_{i},d_{i},B_{i}\)\(C_{\max }\le T\)\(\mathcal {K}\le K\)

\(\mathcal {K}\)\(C_{\max }\)\((C_{\max },\mathcal {K})\)

O(m), Sect. 4.1\(O(m\log m)\), Sect. 4.1\(O(m\log m)\), Sect. 4.1

0

\(a_{i}x_{i}\)

\(f_{i}\)

\(C_{\max }\le T,\)\(\mathcal {K}\le K\)

 

NP-complete even if \(r_{i}=0,\)\(d_{i}=B_{i}=\infty \),Drozdowski and Lawenda (2005), Sect. 4.2

0

\(p_{i}+a_{i}x_{i}\)

\(f_{i}+\ell _{i}x_{i}\)

arb. \(r_{i},d_{i},B_{i}\); fixed set of active processors \(C_{\max }\le T\)\(\mathcal {K}\le K\)

\(\mathcal {K}\)\(C_{\max }\)\((C_{\max },\mathcal {K})\)

O(m), Sect. 4.2\(O(m\log m)\), Sect. 4.2\(O(m\log m)\), Sect. 4.2

\(a=\max \limits _{i=1,\ldots ,n}\{a_{i}\}\), \( s=\max \limits _{i=1,\ldots ,n}\{s_{i}\}\), \(S=\Sigma _{i=1}^{m}s_{i}\)

In the presence of all three types of overheads, the DLS problem is NP-hard even if all fixed overheads are negligible,
$$\begin{aligned} s_{i}=p_{i}=f_{i}=0~~\mathrm {for~all~~}P_{i}\in \mathcal {P}, \end{aligned}$$
(1)
and processors’ restrictions are relaxed,
$$\begin{aligned} r_{i}=0,~d_{i}=B_{i}=\infty ~~\mathrm {for~all~~}P_{i}\in \mathcal {P}. \end{aligned}$$
(2)
If in addition to (1)–(2) per-unit transfer costs are equal,
$$\begin{aligned} c_{i}=c~~\mathrm {for~all~~}P_{i}\in \mathcal {P}, \end{aligned}$$
the problem is solvable in \(O(m^{3})\) time even in the bicriteria setting (Shakhlevich 2013). As we show in this paper, the general case with arbitrary \(s_{i}\), \(c_{i}\), \(p_{i}\), \(a_{i}\), \(f_{i}\), \(\ell _{i}\), \(r_{i}\), \(d_{i}\), \(B_{i}\) can be solved via linear programming under the condition that the number of worker processors m is fixed.

While the computation time overhead is at the center of the DLS problem and cannot be ignored, the two other types of overheads may become negligible in some scenarios. The version of the problem with zero cost overheads is well studied, see Bharadwaj et al. (1994), Bharadwaj et al. (1996), Blazewicz and Drozdowski (1997), Drozdowski and Lawenda (2005), Yang et al. (2007) and the summary of the results in the second part of Table 2. In this paper, we analyze the alternative version with zero transfer overheads; see the lower part of Table 2. It appears that if fixed cost overheads are negligible (\(f_{i}=0\) for all \(P_{i}\in \mathcal {P}\)), then the bicriteria version of the problem is solvable in \(O(m\log m)\) time. Its single criterion counterpart of cost minimization subject to a bounded schedule length can be solved in O(m) time. The version with nonzero fixed cost overheads \(f_{i}\) is NP-hard, but can be solved in O(m) time provided that the set of active processors is fixed.

Further organization of this paper is as follows. In Sect. 2 we study the general version of the problem, with arbitrary values of all parameters \(s_{i}\), \(c_{i}\), \(p_{i}\), \(a_{i}\), \( f_{i} \), \(\ell _{i}\), \(r_{i}\), \(d_{i}\) and \(B_{i}\) for all processors \( P_{i}\subseteq \mathcal {P}\). In Sect. 3 we present our results for the case with zero fixed overheads, \(s_{i}=p_{i}=f_{i}=0\) for all processors \(P_{i}\subseteq \mathcal {P}\). Section 4 is dedicated to the system with negligible transfer times, \(s_{i}=c_{i}=0\) for all \(P_{i}\subseteq \mathcal {P}\). Conclusions are presented in Sect. 5.

2 Nonzero time/cost parameters—fixed set of active processors

In this section, we consider the DLS problem with arbitrary time/cost parameters \(s_{i}\), \(c_{i}\), \(p_{i}\), \(a_{i}\), \(f_{i}\), \(\ell _{i}\) and arbitrary processor availability parameters \(r_{i}\), \(d_{i}\), \(B_{i}\). The number of worker processors m is fixed. We present linear programs for problems \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\) and \(\mathrm {DL}_{ \text {cost}}\mathrm {(}T\mathrm {)}\), justifying that both problems are fixed parameter tractable (FPT) with respect to the parameter m. We then explain how problem \(\mathrm {DL}_{\text {bicrit}}\) can be solved in FPT time. Note that for an arbitrary m the problem is NP-hard, as we show in Sect. 3.

2.1 Limited cost K—schedule length minimization

Consider first problem \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\), assuming that the set of processors \(\mathcal {P}^{\prime }\subseteq \mathcal { P}\), which receive nonzero chunks of the load, is fixed, and their sequence is also fixed. At the end of the section, we discuss the case with a non-fixed processor sequence.

Let processors in \(\mathcal {P}^{\prime }\) be renumbered in the order of their activation so that \(P_{1}\) receives the first chunk, \(P_{2}\) receives the second one, etc., until \(P_{m^{\prime }}\) receives the last chunk of the load, where \(m^{\prime }=|\mathcal {P}^{\prime }|\). Let \(x_{1}\), \(x_{2}\), ..., \(x_{m^{\prime }}\) represent the load distribution among processors \(\mathcal {P}^{\prime }\). Then, the completion time \(C_{i}\) of any processors \(P_{i}\), \(1\le i\le m^{\prime }\), can be calculated as
$$\begin{aligned} C_{i}=\max \left\{ r_{i},\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) \right\} +\left( p_{i}+a_{i}x_{i}\right) . \end{aligned}$$
(3)
Note that the first term in (3) represents the earliest possible starting time of the computation: the release time of \(P_{i}\) or the total duration of the chain of communication times for the upstream processors \( P_{1},P_{2},\ldots ,P_{i}\), whichever is larger. The second term represents the computation time.
Using (3), we follow Drozdowski and Lawenda (2005) to formulate problem \(\mathrm {DL}_{\text {time}}(K\mathrm )\) as a linear program \(\mathrm {LP}_{\text { time}}(K)\) of the form:
$$\begin{aligned}&\mathrm {LP}_{\text {time}}\mathrm {(}K\mathrm {)}\text {:}\min T \end{aligned}$$
(4)
$$\begin{aligned}&\mathrm {s.t.}\nonumber \\&\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) +\left( p_{i}+a_{i}x_{i}\right) \le T, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(5)
$$\begin{aligned}&r_{i}+\left( p_{i}+a_{i}x_{i}\right) \le T, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(6)
$$\begin{aligned}&\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) +\left( p_{i}+a_{i}x_{i}\right) \le d_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(7)
$$\begin{aligned}&r_{i}+\left( p_{i}+a_{i}x_{i}\right) \le d_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(8)
$$\begin{aligned}&0\le x_{i}\le B_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(9)
$$\begin{aligned}&\sum \limits _{i=1}^{m^{\prime }}\left( f_{i}+\ell _{i}x_{i}\right) \le K, \end{aligned}$$
(10)
$$\begin{aligned}&\sum \limits _{i=1}^{m^{\prime }}x_{i}=V. \end{aligned}$$
(11)
Here schedule length T is the variable to be minimized. It is defined via inequalities (5)–(6), which model \(T=\max _{1\le i\le m^{\prime }}\left\{ C_{i}\right\} \) with \(C_{i}\) given by (3). Inequalities (7)– (8) guarantee that computation on machine \(P_{i}\) is completed by the end of the machine availability interval. Inequalities (9) guarantee that the load of each processor \(P_{i}\) does not exceed its memory size \(B_{i}\). By (10) the total computation cost does not exceed K. The total size of the load allocated to \(m^{\prime }\) processors is equal to V by (11).
There are \(m^{\prime }+1\) variables and \(5m^{\prime }+2\) constraints in \( \mathrm {LP}_{\text {time}}(K)\), not counting the nonnegativity constraints. The number of variables and constraints can be further reduced by one, using equation (11). The number of constraints can be further reduced by \(m^{\prime }\) by combining conditions (8)– (9) for every \(1\le i\le m^{\prime }\) into
$$\begin{aligned} 0\le x_{i}\le \min \left\{ B_{i},\tfrac{1}{a_{i}}\left( d_{i}-r_{i}-p_{i}\right) \right\} . \end{aligned}$$
Thus, problem \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\) can be solved in \(O(\mathrm {LP}(m^{\prime },4m^{\prime }+1))\) time, where \(O\left( \mathrm {LP}(u,w)\right) \) is the time complexity of solving an LP problem with u variables and w inequality constraints, see, e.g., Goldfarb and Todd (1989).
With \(m^{\prime }\le m\), there are at most \(2^{m}\) possible selections for set \(\mathcal {P}^{\prime }\) and at most m! processor sequences for each selection. Hence, problem \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\) can be solved in \(O(\eta \times LP(m,4m+1))\) time, where
$$\begin{aligned} \eta =2^{m}m! \end{aligned}$$
(12)
which implies the FPT time complexity with respect to parameter m.

2.2 Limited schedule length T—cost minimization

Consider now problem \(\mathrm {DL}_{\text {cost}}\mathrm {(}T\mathrm {)}\) of finding the cheapest schedule for the common deadline T, assuming that the set of active processors \(\mathcal {P}^{\prime }\subseteq \mathcal {P}\) is fixed and their sequence is \((1,2,\ldots ,m^{\prime })\). Similarly to problem \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\), problem \(\mathrm {DL }_{\text {cost}}\mathrm {(}T\mathrm {)}\) can be solved by a linear program with objective \(\sum _{k=1}^{m^{\prime }}(f_{i}+\ell _{i}x_{i})\), which is equivalent to minimizing \(\sum _{k=1}^{m^{\prime }}\ell _{i}x_{i}\):
$$\begin{aligned}&\mathrm {LP}_{\text {cost}}(T)\text {:}\min \sum _{i=1}^{m^{\prime }}\ell _{i}x_{i} \end{aligned}$$
(13)
$$\begin{aligned}&\mathrm {s.t.~~} \nonumber \\&\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) +\left( p_{i}+a_{i}x_{i}\right) \le T, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(14)
$$\begin{aligned}&r_{i}+\left( p_{i}+a_{i}x_{i}\right) \le T,\quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(15)
$$\begin{aligned}&\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) +\left( p_{i}+a_{i}x_{i}\right) \le d_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(16)
$$\begin{aligned}&r_{i}+\left( p_{i}+a_{i}x_{i}\right) \le d_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(17)
$$\begin{aligned}&0\le x_{i}\le B_{i}, \quad i=1,\ldots ,m^{\prime }, \end{aligned}$$
(18)
$$\begin{aligned}&\sum \limits _{i=1}^{m^{\prime }}x_{i}=V. \end{aligned}$$
(19)
Simplifying the model, we combine (14) with (16) and also combine (15), (17), (18) to get
$$\begin{aligned}&\mathrm {LP}_{\text {cost}}(T)\text {:}\min \sum _{i=1}^{m^{\prime }}\ell _{i}x_{i} \end{aligned}$$
(20)
$$\begin{aligned}&\mathrm {s.t.~~} \nonumber \\&\sum \limits _{k=1}^{i}\left( s_{k}+c_{k}x_{k}\right) +\left( p_{i}+a_{i}x_{i}\right) \le \min \{T,d_{i}\},~~ \nonumber \\&i=1,\ldots ,m^{\prime }, \end{aligned}$$
(21)
$$\begin{aligned}&0\le x_{i}\le \min \{B_{i},\tfrac{1}{a_{i}}\left( \min \{T,d_{i}\}-r_{i}-p_{i}\right) \}, \nonumber \\&i=1,\ldots ,m^{\prime }, \end{aligned}$$
(22)
$$\begin{aligned}&\sum \limits _{i=1}^{m^{\prime }}x_{i}=V. \end{aligned}$$
(23)
The number of variables and constraints can be further reduced by one, using Eq. (23). In the resulting LP, there are \(m^{\prime }-1 \) variables and \(2m^{\prime }\) constraints so that problem \(\mathrm {DL} _{\text {cost}}\mathrm {(}T\mathrm {)}\) can be solved in \(O(\eta \times \mathrm {LP}(m-1,2m))\) time, where \(\eta \) is given by (12).

2.3 Time–cost trade-off

Consider now the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\). The approach described below constructs at most \(\eta \) trade-offs, one for each fixed processor activation sequence, and then takes the lower envelope out of them.

For a fixed sequence \(\mathcal {S}_{i}\) with \(m^{\prime }\) processors, \( m^{\prime }\le m\), the trade-off consists of linear pieces that connect breakpoints \(\left( T^{0},K^{0}\right) ,\left( T^{1},K^{1}\right) ,\ldots ,\left( T^{q},K^{q}\right) \),
$$\begin{aligned} T^{0}\le & {} T^{1}\le \cdots \le T^{q}, \\ K^{0}\ge & {} K^{1}\ge \cdots \ge K^{q}. \end{aligned}$$
The schedule corresponding to the extreme point \(\left( T^{0},K^{0}\right) \) has the smallest length \(T^{0}\), that can be found by solving LP\(_{ \mathrm {time}}\mathrm {(}\infty \mathrm {)}\) with \(K=\infty \) or equivalently with inequality (10) eliminated from the model. For the calculated \(T^{0}\), the associated minimum cost \(K^{0}\) can be found by solving LP\(_{{\mathrm{cost}}}\)(\(T^{0}\)).

Another extreme point \(\left( T^{q},K^{q}\right) \) corresponds to the schedule with the smallest cost \(K^{q}\). It can be found by solving LP\(_{ \mathrm {\cos t}}\mathrm {(}\infty \mathrm {)}\) with \(T=\infty \); then for the found value of \(K^{q}\) the associated minimum schedule length \(T^{q}\) can be found by solving LP\(_{\mathrm {time}}\)(\(K^{q}\)).

The remaining points \((T^{i},K^{i})\), \(1\le i\le q-1\), of the trade-off for the fixed processor activation sequence \(\mathcal {S}_{i}\) can be found as the solution to the parametric linear programming problem \(\mathrm {LP}_{ \text {cost}}\mathrm {(}T\mathrm {)}\) of type (13)–(19) with variable \(T\in \left[ T^{0},T^{q}\right] \) in constraints (14)–(15) treated as a parameter. Again, the model can be simplified so that the resulting formulation has \(m^{\prime }-1\ x\)-variables (after eliminating one variable using (19)) and \(4m^{\prime }\) constraints (after combining (17) with (18) and eliminating the equality constraint (19)). Using standard methods of parametric optimization, the breakpoints of the trade-off can be found in \( O(q\mathrm {LP}(m-1,4m))\) time, see, e.g., Adler and Monteiro (1992). The number of breakpoints q is bounded by the number of basic feasible solutions, which does not exceed \(\left( {\begin{array}{c} 4m \\ m-1 \end{array} }\right) \) for a linear program with \(m-1\) variables and 4m constraints. Using an upper bound
$$\begin{aligned} Q=4^{m}m^{m}, \end{aligned}$$
(24)
for the latter expression we conclude that the time complexity for constructing the trade-off for the fixed processor activation sequence \( \mathcal {S}_{i}\) of \(m^{\prime }\) processors, \(m^{\prime }\le m\), is \( O(Q\times \mathrm {LP}(m-1,4m))\). With an upper bound \(\eta \) on the number of processor sequences given by (12), the overall time complexity for constructing \(\eta \) trade-off curves is \(O(\eta Q\times \mathrm {LP}(m\!-1\!,\!4m))\!= \!O\left( 8^{m}m^{m}m!\!\times \!\mathrm {LP}(m\!-\!1,4m)\right) \).
We have to construct a merged set of Pareto-optimum points in the \(\left( T,K\right) \)-space for all processor sequences \(\mathcal {S}_{i}\). The resulting set of points \(\mathcal {F}\) and the segments connecting them constitute a non-increasing function of cost K in time T. This function may contain convex and concave parts, and it can be discontinuous, see Fig.  1. A point of discontinuity \(\left( t,K(t)\right) \) corresponds to, e.g., the left end of a trade-off curve for some processor sequence \(\mathcal {S}_{i}\), while another processor sequence \(\mathcal {S} _{j} \) incurs a higher cost at t, or the right end of \(\mathcal {S}_{j}\) trade-off curve is earlier than t.
Fig. 1

Merging trade-off curves into the Pareto-front

A possible approach to finding the overall trade-off may consist of the following two steps.
  1. (1)

    Find the intersection points of all pieces of trade-off curves defined for various processor sequences;

     
  2. (2)

    Find the minimal layer of breakpoints and the intersection points, using, for example an approach outlined in Cormen et al. (2001) (p. 1045) for the symmetric problem of finding the maximal layer. Note that given z points in the plane, the minimal or maximal layer can be found in \(O(z\log z) \) time.

     
To handle the discontinuity, introduce vertical lines from left end-points of individual trade-offs, horizontal lines from the right end-points, and include those pieces in Step (1). Then the overall number of linear pieces considered in Step (1) is not larger than \(\eta \left( Q+2\right) \), their \( O(\eta ^{2}Q^{2})\) intersection points can be found in \(O(\eta ^{2}Q^{2})\) time, see, e.g., Cormen et al. (2001), and finally the minimal layer is constructed in \(O(\eta ^{2}Q^{2}\log (\eta Q))\) time. Taking into account expressions (12) and (24) for \(\eta \) and Q and the time complexity \( O\left( 8^{m}m^{m}m!\times \mathrm {LP}(m-1,4m)\right) \) for constructing all individual trade-offs, we conclude that the time complexity of solving problem \(\mathrm {DL}_{\text {bicrit}}\) is polynomially bounded for fixed m.

Statement 1

Problems \(\mathrm {DL}_{\text {time}}(K)\), \(\mathrm {DL}_{\text {cost}}(T)\), \(\mathrm {DL}_{\text {bicrit}}\) are fixed parameter tractable with respect to the number of machines m.

3 Nonzero time/cost parameters—zero fixed overheads: \( p_{i}=s_{i}=f_{i}=0\)

In this section, we assume that all fixed overheads are equal to zero, i.e., \( p_{i}=s_{i}=f_{i}=0\), for \(i=1,\dots ,m\), while the linear components of transfer time (\(c_{i}\)), computation time (\(a_{i}\)) and cost (\(\ell _{i}\)) are not simultaneously equal to zero. In terms of the three types of decisions introduced in Sect. 1, decisions of type 3 imply decisions of type 1: any processor \(P_{i}\) which gets a chunk \(x_{i}=0\) does not contribute to any time or cost component because there are no fixed overheads. This implies that a processor receiving a 0-size chunk can be removed from the list \(\mathcal {P}^{\prime }\) of active processors. Thus, it is enough to make decisions 2 and 3. Our main result of this section is the NP-hardness proof of \(\mathrm {DL}_{\text {cost}}(T)\) and \(\mathrm {DL} _{\text {time}}(K)\).

3.1 Limited schedule length T and limited cost K

To prove NP-hardness of problem \(\mathrm {DL}_{\text {cost}}(T)\), let us introduce its decision version \(\mathrm {DL}(T,K)\) which verifies whether there exists a feasible solution with the schedule length and the cost not exceeding the given thresholds T and K, respectively. We reduce the even-odd partition to problem \(\mathrm {DL}(T,K)\).

Even-odd partition is defined as follows: given a set \( E=\{e_{1},\dots ,e_{2n}\}\) of positive integers, is there a subset \( E_{1}\subset E\) such that \(\sum _{e_{i}\in E_{1}}e_{i}=\sum _{e_{i}\in E\setminus E_{1}}e_{i}=G\) and \(E_{1}\) contains exactly one element from pair \(e_{2i-1},e_{2i}\), for \(i=1,\dots ,n\)? For an instance of even-odd partition, we construct an instance of \(\mathrm {DL}(T,K)\) as follows. Let \( T>0\) be an arbitrary schedule length and \(m=2n\) be the number of processors of the set \(\mathcal {P}\). The processor parameters are defined as \(r_{i}=0\), \(d_{i}=B_{i}=\infty \) for \(i=1,\dots ,2n\), and
$$\begin{aligned} \begin{array}{clcll} a_{2i-1} &{} = &{} c_{2i-1} &{} = &{} \dfrac{T}{2^{2i-1}\left( G^{n-i+2}+e_{2i-1}\right) }, \\ &{}&{}\ell _{2i-1} &{} = &{} \dfrac{e_{2i-1}}{G^{n-i+2}+e_{2i-1}}, \\ a_{2i} &{} = &{} c_{2i} &{} = &{} \dfrac{T}{2^{2i-1}\left( G^{n-i+2}+e_{2i}\right) },\\ &{}&{} \ell _{2i} &{} = &{} \dfrac{e_{2i}}{G^{n-i+2}+e_{2i}}, \end{array} \end{aligned}$$
(25)
for \(i=1,\dots ,n\). The load size V and the cost limit K are given by
$$\begin{aligned} V=\frac{3}{2}\sum _{i=1}^{n+1}G^{i}, \quad K=\frac{3}{2}G. \end{aligned}$$
It is easy to see that the reduction is polynomial.

Lemma 1

If there exists a solution to an instance of even-odd partition, then there exists a schedule S for the instance \( \mathrm {DL}(T,K)\) which satisfies the following properties.
  1. (i)

    The whole load of size V is fully processed.

     
  2. (ii)

    Every processor is fully loaded completing its load chunk at time T.

     
  3. (iii)

    The cost of the schedule is K.

     
  4. (iv)

    Schedule S defines a solution to the instance of \(\mathrm {DL} (T,K)\).

     

Proof

Let \(e_{i1}\) denote the element of the pair \( \{e_{2i-1},e_{2i}\}\) which belongs to \(E_{1}\), and \(e_{i2}\) be the other element of the pair, \(i=1,\dots ,n\). Construct schedule S by selecting processor activating sequence \((P_{11},P_{12},\)\(\ldots ,\)\(P_{u1},P_{u2},\)\( \ldots ,\)\(P_{n1},P_{n2})\), where processors \(P_{i1},P_{i2}\) correspond to \( e_{i1},e_{i2}\), respectively, and define the load chunks as
$$\begin{aligned} x_{i1}\!=\!G^{n-i+2}\!+\!e_{i1}, \quad x_{i2}\!=\!\frac{1}{2}\left( G^{n-i+2}\!+\!e_{i2}\right) i\!=\!1,\dots ,n. \end{aligned}$$
(26)
Fig. 2

A feasible schedule with processor sequence \( (P_{11},P_{12},~P_{21},P_{22},~P_{31},P_{32})\)

We prove that properties (i)–(iv) hold for schedule S.

(i) The sum of all chunks allocated to processors \( \mathcal {P}\) is as follows:
$$\begin{aligned} \mathrm {Load}= & {} \sum \limits _{i=1}^{n}(x_{i1}+x_{i2})\nonumber \\= & {} \sum \limits _{i=1}^{n} \left( G^{n-i+2}+e_{i1}+\frac{1}{2}G^{n-i+2}+\frac{1}{2}e_{i2}\right) \nonumber \\= & {} \frac{3}{2}\sum \limits _{i=1}^{n}G^{n-i+2}+\frac{1}{2}\sum \limits _{i=1}^{n}(e_{i1}+e_{i2})+\frac{1}{2}\sum \limits _{i=1}^{n}e_{i1}\nonumber \\= & {} \frac{3}{2}\sum \limits _{i=1}^{n}G^{n-i+2}+G+\frac{1}{2}G=V. \end{aligned}$$
(27)
(ii) For any processor \(P_{i1}\), \(1\le i\le n\), its communication time \(\mu _{i1}\) and computation time \(\nu _{i1}\) are given by
$$\begin{aligned} \mu _{i1}= & {} \nu _{i1}= \frac{T}{2^{2i-1}\left( G^{n-i+2}+e_{i1}\right) }\times \left( G^{n-i+2}+e_{i1}\right) \\= & {} \frac{T}{2^{2i-1}}. \end{aligned}$$
Similarly, for any processor \(P_{i2}\) the associated values are
$$\begin{aligned}&\mu _{i2}=\nu _{i2}=\frac{T}{2^{2i-1}\left( G^{n-i+2}+e_{i2}\right) }\times \frac{1}{2}\left( G^{n-i+2}+e_{i2}\right) \\&\quad \quad =\frac{T}{2^{2i}}. \end{aligned}$$
Thus, the schedule is of the form shown in Fig. 2, with every processor \(P_{ij}\) completing at time T, \(i=1,\ldots ,n\), \(j=1,2\).
(iii) The cost of S is as follows:
$$\begin{aligned}&\mathrm {Cost}=\sum \limits _{i=1}^{n}\left( \ell _{i1}x_{i1}+\ell _{i2}x_{i2}\right) \nonumber \\&\quad \quad =\sum \limits _{i=1}^{n}\left( \frac{e_{i1}}{G^{n-i+2}+e_{i1}}\times (G^{n-i+2}+e_{i1})\right. \nonumber \\&\quad \left. +\frac{e_{i2}}{G^{n-i+2}+e_{i2}}\times \frac{1}{2}(G^{n-i+2}+e_{i2}) \right) \nonumber \\&\quad =\sum \limits _{i=1}^{n}\left( e_{i1}+\frac{1}{2}e_{i2}\right) = \frac{3}{2}G=K. \end{aligned}$$
(28)
By properties (i)–(iii) the schedule is feasible and it defines a solution to \(\mathrm {DL}(T,K)\) so that property (iv) holds.\(\square \)

In the remaining part we prove that if there exists a solution to the instance of problem \(\mathrm {DL}(T,K)\), then there exists a solution to the related instance of Even-odd partition. The lemma below starts with auxiliary properties of a feasible schedule and concludes with the main result.

Lemma 2

A feasible schedule S for the instance of \(\mathrm {DL} (T,K)\) satisfies the following properties.
  1. (1)

    Schedule S can be transformed into a schedule with processor activating sequence \((\left\{ P_{11},P_{12}\right\} ,\ldots , \left\{ P_{u1},P_{u2}\right\} , \left\{ P_{u+1,1},P_{u+1,2}\right\} ,\ldots , \left\{ P_{n1},P_{n2}\right\} )\).

     
  2. (2)
    Consider a feasible schedule obtained by transformation (1). Renumber processors in the order they appear in the activating sequence and renumber the associated values \(e_{u1}\) and \(e_{u2}\) accordingly. For the resulting schedule, with processor activating sequence \((P_{11},P_{12},\)\( \ldots ,\)\(P_{u1},P_{u2},\)\(\ldots ,\)\(P_{n1},P_{n2})\), the following inequality holds:
    $$\begin{aligned} \sum \limits _{i=1}^{n}2^{2i}G^{n-i+2}y_{i}\ge 3\sum \limits _{i=2}^{n+1}G^{i}, \end{aligned}$$
    where
    $$\begin{aligned} y_{i}= & {} \frac{1}{T}\left( c_{i1}x_{i1}+c_{i2}x_{i2}\right) \nonumber \\= & {} \frac{1}{2^{2i-1}} \left( \frac{x_{i1}}{G^{n-i+2}+e_{i1}}+\frac{x_{i2}}{G^{n-i+2}+e_{i2}} \right) , \end{aligned}$$
    (29)
    for \(i=1,\ldots ,n.\)
     
  3. (3)
    If in a feasible schedule satisfying property 2) at least one processor \(P_{uk}\), \(1\le u\le n\), \(k=1,2\), is not fully loaded (i.e., \( C_{uk}<T\) holds), then
    $$\begin{aligned} \sum \limits _{i=1}^{n}2^{2i}G^{n-i+2}y_{i}<3\sum \limits _{i=2}^{n+1}G^{i}. \end{aligned}$$
     
  4. (4)

    Each of the 2n processors is fully loaded and has completion time T.

     
  5. (5)

    Equality \(\sum _{i=1}^{n}e_{i1}=G\) holds so that the set \(\left\{ e_{i1}\right\} _{i=1}^{n}\) defines a solution to Even-odd partition.

     

Proof

(1) Let schedule \(S\ \)be of the form \(S=\left( P_{h_{1}},P_{h_{2}},\!\ldots \!,P_{h_{2n}}\right) \) and \(P_{h_{z}}\) be an out-of-order processor such that it belongs to a pair \(\left\{ P_{u1},P_{u2}\right\} \) with the smallest index u, and its predecessor in S is \(P_{h_{z-1}}\in \left\{ P_{v1},P_{v2}\right\} \) with \(v>u\). Notice that
$$\begin{aligned} c_{h_{z-1}}>2c_{_{h_{z}}} \end{aligned}$$
(30)
since
$$\begin{aligned}&\frac{c_{h_{z-1}}}{c_{_{h_{z}}}}\\&\quad =\frac{T}{2^{2v-1}\left( G^{n-v+2}+e_{h_{z-1}}\right) }/ \frac{T}{2^{2u-1}\left( G^{n-u+2}+e_{h_{z}}\right) }\\&\quad =\frac{G^{n-u+2}+e_{h_{z}}}{2^{2(v-u)}\left( G^{n-v+2}+e_{h_{z-1}}\right) } \\&\quad>\frac{G^{n-u+2}}{2^{2(v-u)+1}G^{n-v+2}}=\frac{G^{v-u}}{2^{2(v-u)+1}}>2, \end{aligned}$$
where the first inequality holds since \(e_{h_{z}}>0\) and \(e_{h_{z-1}}<G^{n-v+2}\), and the second one is satisfied for a sufficiently large G, for example \(G>2^{4}\).

Let t be the starting time of communication of processor \(P_{h_{z-1}}\) in S. Modify the fragment of schedule S, starting from t, by moving the full load \(x_{h_{z-1}}\ \)from \(P_{h_{z-1}}\) to \(P_{h_{z}}\). In the new schedule, processor \(P_{h_{z}}\) finishes communication at time \( t+c_{h_{z}}(x_{h_{z-1}}+x_{h_{z}})\), which is less than \( t+c_{h_{z-1}}x_{h_{z-1}}+c_{h_{z}}x_{h_{z}}\), the communication finish time of \(P_{h_{z}}\) in S (since \(c_{h_{z-1}}>c_{_{h_{z}}}\)). The same is true for the computation completion time: the new completion time of \(P_{h_{z}}\) is \(t+2c_{h_{z}}(x_{h_{z-1}}+x_{h_{z}})\), which is less than \( t+c_{h_{z-1}}x_{h_{z-1}}+2c_{h_{z}}x_{h_{z}}\), completion time of \(P_{h_{z}}\) in the original schedule (since \(c_{h_{z-1}}>2c_{_{h_{z}}}\) by (30)). The cost of the modified schedule is less than that of the original one since each of the values \(\ell _{v1}\) and \(\ell _{v2}\) is greater than \(\ell _{u1}\) and \(\ell _{u1}\).

As a result of the described transformation, processor \(P_{h_{z}}\) takes the full load of processor \(P_{h_{z-1}}\), making \(P_{h_{z-1}}\) idle. Modify the processor sequence by swapping \(P_{h_{z-1}}\) and \(P_{h_{z}}\). If \(P_{h_{z}}\) is still out of order, then perform a similar transformation: move the load from \(P_{h_{z-2}}\) to \(P_{h_{z}}\), making \(P_{h_{z-2}}\) idle and swap the two processors. Continue shifting processor \(P_{h_{z}}\) upstream until it reaches the right position in the schedule, immediately after its partner from the pair \(\left\{ P_{u1},P_{u2}\right\} \) or after a pair of processors \(\left\{ P_{u-1,1},P_{u-1,2}\right\} \). Repeating the same transformation, we construct a schedule with no larger length and with a smaller cost.

(2) For a feasible schedule, the two inequalities \( \mathrm {Load}\ge V\) and \(\mathrm {Cost}\le K\) hold so that
$$\begin{aligned} \mathrm {Load}-\mathrm {Cost}\ge V-K=\frac{3}{2}\sum _{i=2}^{n+1}G^{i}. \end{aligned}$$
(31)
Using the expression
$$\begin{aligned}&\mathrm {Load}-\mathrm {Cost} \\&\quad =\sum \limits _{i=1}^{n}\left( x_{i1}+x_{i2}\right) -\sum \limits _{i=1}^{n}\left( \ell _{i1}x_{i1}+\ell _{i2}x_{i2}\right) \\&\quad =\sum \limits _{i=1}^{n}\left( x_{i1}+x_{i2}\right) \\&\qquad -\sum \limits _{i=1}^{n}\left( \dfrac{e_{i1}x_{i1}}{G^{n-i+2}+e_{i1}}+\dfrac{e_{i2}x_{i2}}{G^{n-i+2}+e_{i2}} \right) \\&\quad =\sum \limits _{i=1}^{n}G^{n-i+2}\left( \frac{x_{i1}}{G^{n-i+2}+e_{i1}}+ \frac{x_{i2}}{G^{n-i+2}+e_{i2}}\right) \end{aligned}$$
and by applying variables \(y_{i}\), \(i=1,\ldots ,n,\) defined by (29), the necessary condition (31) can be rewritten as
$$\begin{aligned} \sum _{i=1}^{n}2^{2i-1}G^{n-i+2}y_{i}\ge \frac{3}{2}\sum _{i=2}^{n+1}G^{i} \end{aligned}$$
so that property (2) holds. Let us observe that violating (31) means that insufficient load is processed or cost limit is exceeded.
(3) Consider a feasible schedule with processor activating sequence \((P_{11},P_{12},\)\(\ldots ,\)\(P_{i1},P_{i2},\)\(\ldots ,\)\( P_{n1},P_{n2})\), where at least one condition from
$$\begin{aligned} C_{ik}\le T,\quad i=1,\ldots ,n,\quad k=1,2, \end{aligned}$$
(32)
holds as a strict inequality. Taking a linear combination of these inequalities weighted by constants \(\lambda _{i}\) and \(2\lambda _{i}\), we obtain:
$$\begin{aligned} \sum _{i=1}^{n}\lambda _{i}\left( C_{i1}+2C_{i2}\right) <\sum _{i=1}^{n}3\lambda _{i}T, \end{aligned}$$
(33)
where
$$\begin{aligned} \begin{array}{ll} \lambda _{i}=2^{2(i-1)}\left( G^{n-i+2}-3\sum \limits _{u=2}^{n-i+1}G^{u}\right) ,&i=1,\ldots ,n. \end{array} \end{aligned}$$
(34)
Notice that \(\lambda _{i}>0\) for \(G>4\) since \(\lambda _{i}/2^{2(i-1)}=G^{n-i+2}-3\frac{G^{n-i+2}-G^{2}}{G-1}= \frac{G^{n-i+3}-4G^{n-i+2}+3G^{2}}{G-1}>0\). In what follows, we show that the left-hand side of (33) is equal to \(\left( \sum _{i=1}^{n}2^{2i}G^{n-i+2}y_{i}\right) T\) and the right-hand side is equal to \(\left( 3\sum _{i=2}^{n+1}G^{i}\right) T\) so that property (3) holds.
Starting with the left-hand side, we deduce:
$$\begin{aligned} \mathrm {LHS}= & {} \sum _{i=1}^{n}\lambda _{i}\left( \left[ \sum \limits _{u=1}^{i-1}\left( c_{u1}x_{u1}+c_{u2}x_{u2}\right) +2c_{i1}x_{i1}\right] \right. \\&\left. +2\left[ \sum \limits _{u=1}^{i-1}\left( c_{u1}x_{u1}+c_{u2}x_{u2}\right) +c_{i1}x_{i1}+2c_{i2}x_{i2}\right] \right) \\= & {} \sum _{i=1}^{n}\lambda _{i}\left( 3\sum \limits _{u=1}^{i-1}\left( c_{u1}x_{u1}+c_{u2}x_{u2}\right) \right. \\&+4\Bigl .(c_{i1}x_{i1}+c_{i2}x_{i2})\Bigr ) \\= & {} \sum _{i=1}^{n}\lambda _{i} \left( 3\sum \limits _{u=1}^{i-1}y_{u}+4y_{i}\right) T\\= & {} \left( \sum _{i=1}^{n-1}\left( 4\lambda _{i}+3\sum \limits _{u=i+1}^{n}\lambda _{i}\right) y_{i}+ 4\lambda _{n}y_{n}\right) T\\= & {} \left( \sum _{i=1}^{n-1}b_{i}y_{i}+b_{n}y_{n}\right) T, \end{aligned}$$
where
$$\begin{aligned} b_{i}= & {} 4\lambda _{i}+3\sum \limits _{u=i+1}^{n}\lambda _{i}, i=1,2,\ldots ,n-1, \end{aligned}$$
(35)
$$\begin{aligned} b_{n}= & {} 4\lambda _{n}. \end{aligned}$$
(36)
It remains to prove that
$$\begin{aligned} b_{i}=2^{2i}G^{n-i+2},1\le i\le n. \end{aligned}$$
(37)
Indeed, by (34), \(4\lambda _{n}=2^{2n}G^{2}\) and (37) holds for \(i=n\). If (37) holds for some i, \(2\le i\le n\), then it also holds for \(i-1\) since by (35)
$$\begin{aligned} b_{i-1}= & {} 4\lambda _{i-1}+3\sum \limits _{u=i}^{n}\lambda _{i}= b_{i}+4\lambda _{i-1}-\lambda _{i}\\= & {} 2^{2i}G^{n-i+2}\\&+4\times 2^{2(i-2)}\left( G^{n-i+3}-3\sum _{u=2}^{n-i+2}G^{u}\right) \\&-2^{2(i-1)}\left( G^{n-i+2}-3\sum _{u=2}^{n-i+1}G^{u}\right) \\= & {} 2^{2i}G^{n-i+2}+2^{2(i-1)}G^{n-i+3}\\&-3\times 2^{2(i-1)}\sum _{u=2}^{n-i+2}G^{u}-2^{2(i-1)}G^{n-i+2}\\&+3\times 2^{2(i-1)}\sum _{u=2}^{n-i+1}G^{u}\\= & {} 2^{2(i-1)}G^{n-i+3}\\&+2^{2(i-1)}\left( 4G^{n-i+2}-3G^{n-i+2}-G^{n-i+2}\right) \\= & {} 2^{2(i-1)}G^{n-i+3}. \end{aligned}$$
Consider now the expression in the right-hand side of (33), divided by 3T:
$$\begin{aligned} \frac{\mathrm {RHS}}{3T}= & {} \sum _{i=1}^{n}\lambda _{i}=\sum _{i=1}^{n}2^{2(i-1)}\left( G^{n-i+2}-3\sum _{u=2}^{n-i+1}G^{u}\right) \\= & {} \sum _{i=1}^{n}2^{2(i-1)}G^{n-i+2}-3\sum _{i=1}^{n-1}2^{2(i-1)} \sum _{u=2}^{n-i+1}G^{u}\\= & {} \sum _{u=2}^{n+1}2^{2(n-u+1)}G^{u}-3\sum _{u=2}^{n}G^{u} \sum _{i=u}^{n}2^{2(n-i)}\\= & {} \sum _{u=2}^{n}\left( 2^{2(n-u+1)}-3\sum _{i=u}^{n}2^{2(n-i)}\right) G^{u}+G^{n+1} \\= & {} \sum _{u=2}^{n}\left( 2^{2(n-u+1)}-3\times \frac{4\times 2^{2(n-u)}-1}{4-1} \right) G^{u}\\&+G^{n+1}=\sum _{u=2}^{n+1}G^{u}. \end{aligned}$$
Thus, inequality (33) is proved.

Property (4) immediately follows from properties (2) and (3). It remains to prove property (5).

By property (4), \(C_{11}=T\) so that
$$\begin{aligned} C_{11}=2c_{11}x_{11}=2\times \frac{T}{2\left( G^{n+1}+e_{11}\right) }x_{11}=T \end{aligned}$$
and \(x_{11}=G^{n+1}+e_{11}\). By the same property, \(C_{12}=T\) so that
$$\begin{aligned} \begin{array}{lll} C_{12}=c_{11}x_{11}+2c_{12}x_{12}=\frac{T}{2}+2\times \frac{T}{2\left( G^{n+1}+e_{12}\right) }x_{12}=T \end{array} \end{aligned}$$
and \(x_{12}=\frac{1}{2}\left( G^{n+1}+e_{12}\right) \). Proceeding in a similar way it is easy to verify that in a feasible schedule variables \(x_{i1}\), \(x_{i2}\) are defined by (26).
In order to prove the equality \(\sum _{i=1}^{n}e_{i1}=G\), consider conditions \(\mathrm {Load}\ge V\) and \(\mathrm {Cost}\le K\) which hold for any feasible schedule. Repeating calculations (27) and (28), we obtain:
$$\begin{aligned} \mathrm {Load}&=\frac{3}{2}\sum _{i=1}^{n}G^{n-i+2}+ \frac{1}{2}\sum _{i=1}^{n}(e_{i1}+e_{i2})+\frac{1}{2}\sum _{i=1}^{n}e_{i1}\\&=\frac{3}{2}\sum _{i=2}^{n+1}G^{i}+G+\frac{1}{2}\sum _{i=1}^{n}e_{i1}\ge V= \frac{3}{2} \sum _{i=1}^{n+1}G^{i},&\\ \mathrm {Cost}&=\sum _{i=1}^{n}\left( e_{i1}+\frac{1}{2}e_{i2}\right) \\&=\frac{1}{2}\sum _{i=1}^{n}\left( e_{i1}+e_{i2}\right) +\frac{1}{2}\sum _{i=1}^{n}e_{i1}\\&=G+\frac{1}{2}\sum _{i=1}^{n}e_{i1}\le K=\frac{3}{2}G. \end{aligned}$$
The latter two inequalities imply
$$\begin{aligned} \frac{1}{2}\sum _{i=1}^{n}e_{i1}\ge & {} \frac{1}{2}G, \\ \frac{1}{2}\sum _{i=1}^{n}e_{i1}\le & {} \frac{1}{2}G, \end{aligned}$$
which together imply property (5). \(\square \)

We conclude with the main result which follows from Lemmas 1 and 2.

Theorem 1

Problem \(\mathrm {DL}(T,K)\) is NP -complete, problems \(\mathrm {DL}_{\text {cost}}(T)\) and \(\mathrm {DL}_{\text { time}}(K)\) are NP-hard, even if computation time, communication time and cost have no fixed overheads, and \(r_i=0, d_i=B_i=\infty \), for \( i=1,\dots ,m\).

3.2 Time–cost trade-off

For zero overheads, the arguments from Section 2 can be simplified. MIP formulations (13)–(19) and (20)–(23) do hold, but the number of different sequences can be reduced from \(\eta =2^{m}m!\), given by (12), to \(\eta =m!\). Notice that due to zero overheads there is no need to make a selection of the set of active processors \(\mathcal {P}^{\prime }\) since an idle processor can be kept in any place of the sequence. The smaller value of \(\eta \) results in a slightly lower time complexity for enumerating all trade-offs, namely \(O\left( 4^{m}m^{m}m!\times LP(m-1,4m)\right) \).

The problem of finding extreme points in the (TK)-space, with the shortest schedule or with the smallest cost, was addressed in the prior research for the special case when all processors are available simultaneously and have no deadline and capacity restrictions, \(r_{i}=0\), \( d_{i}=B_{i}=\infty \) for all \(1\le i\le m\). As shown in Bharadwaj et al. (1994, 1996), Blazewicz and Drozdowski (1997), the shortest schedule is provided if processors are sequenced in the non-decreasing order of \(c_{i}\) and complete all tasks simultaneously. For the same special case, the cheapest solution is constructed if the whole load is processed by the cheapest processor, i.e., \(P_{i}:\ell _{i}=\min _{j=1}^{m}\{\ell _{j}\}\). Hence, in the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\) end-points \( (T^{0},K^{0}),(T^{q},K^{q})\) of the time–cost trade-off can be found in, respectively, \(O(m\log m)\) and O(m) time.

For the general case of arbitrary \(r_{i}\), \(d_{i}\), \(B_{i}\), finding the solution \(\left( T^{q},K^{q}\right) \) with the lowest cost, i.e., the rightmost point \((T^{q},K^{q})\) in the time–cost trade-off, is computationally hard by Theorem 1, because even though schedule length may be arbitrary to find the lowest cost schedule, processor availability constraints \(r_{i},d_{i},B_{i}\) may impose limits equivalent to schedule length. We conjecture that finding the solution \(\left( T^{0},K^{0}\right) \) with the shortest schedule is also computationally hard.

Conjecture 1

For arbitrary \(r_{i},d_{i},B_{i}\), for all processors \(P_{i}\in \mathcal {P}\), problem \(\mathrm {DL}_{\text {time}}(\infty )\) is NP-hard, even if computation time, communication time and cost have no fixed overheads.

In the remaining part of this section, we consider the case of agreeable processors. In that case, processors can be renumbered so that the two conditions hold:
$$\begin{aligned} \begin{array}{l} c_{1}\le c_{2}\le \dots \le c_{m}, \\ \frac{\ell _{1}}{c_{1}}\le \frac{\ell _{2}}{c_{2}}\le \cdots \le \frac{ \ell _{m}}{c_{m}}. \end{array} \end{aligned}$$
(38)

Theorem 2

If processors \(\mathcal {P}\) are agreeable and have no availability and capacity restrictions, i.e., \(r_{i}=0\), \(d_{i}=B_{i}=\infty \), \(1\le i\le m\) , then an optimum solution can be found in polynomial time.

Proof

Assume that processors are numbered in accordance with (38) and they are activated in the order of their numbering. The processor sequence corresponding to \(c_{1}\le c_{2}\le \dots \le c_{m}\) guarantees the shortest schedule (Bharadwaj et al. 1994, 1996; Blazewicz and Drozdowski 1997). It is also known (Yang et al. 2007) that in the shortest schedule there are no idle times between communications and all processors finish computation simultaneously.

We show, by interchange argument, that under the agreeable condition (38) the total cost is also minimum. Consider pair \( P_{i},P_{i+1}\) in the sequence. Let us assume that communication to \(P_{i}\) starts \(\tau \) units of time before the end of the schedule. The load processed by \(P_{i}\) is \(x_{i}=\tau /(a_{i}+c_{i})\). Since there are no idle times in the schedule and \(P_{i+1}\) receives its load and processes it while \(P_{i}\) is computing, the load processed by \(P_{i+1}\) is \( x_{i+1}=a_{i}x_{i}/(a_{i+1}+c_{i+1})=(\tau a_{i})/((a_{i}+c_{i})(a_{i+1}+c_{i+1}))\). The cost of processing on \( P_{i},P_{i+1}\) is
$$\begin{aligned} K_{1}=\ell _{i}x_{i}+\ell _{i+1}x_{i+1}=\frac{\tau (\ell _{i}a_{i+1}+\ell _{i}c_{i+1}+\ell _{i+1}a_{i})}{(a_{i}+c_{i})(a_{i+1}+c_{i+1})} \end{aligned}$$
If the order of communications were \(P_{i+1},P_{i}\), the cost would be
$$\begin{aligned} K_{2}=\frac{\tau (\ell _{i+1}a_{i}+\ell _{i+1}c_{i}+\ell _{i}a_{i+1})}{ (a_{i}+c_{i})(a_{i+1}+c_{i+1})}. \end{aligned}$$
The difference between the two costs is
$$\begin{aligned} K_{1}-K_{2}=\frac{\tau (\ell _{i}c_{i+1}-\ell _{i+1}c_{i})}{ (a_{i}+c_{i})(a_{i+1}+c_{i+1})}, \end{aligned}$$
The processor sequence \((P_{i},P_{i+1})\) results in a cheaper solution, if \( \ell _{i}/\ell _{i+1}\le c_{i}/c_{i+1}\). Thus, for agreeable processors the shortest schedule is also the cheapest and the Pareto-front is reduced to a point in the time\(\times \)cost space. It is possible to check whether processors are agreeable in \(O(m\log m)\) time and calculate load sizes \( x_{i}\) in O(m) time (Blazewicz and Drozdowski 1997).\(\square \)

4 Zero transfer overheads: \(s_{i}=c_{i}=0\)

The main model studied in this section is characterized by zero transfer times and zero fixed cost overheads for all processors \(P_{i}\): \( s_{i}=c_{i}=0\), \(f_{i}=0\), \(1\le i\le m\) (Sect.  4.1). In terms of the three types of decisions introduced in Sect. 1, only decisions of type 1 and 3 should be considered: any processor \(P_{i}\) which gets a zero-size chunk \(x_{i}=0\) should be removed from the list \(\mathcal {P}^{\prime }\) of active processors, and all processors in \(\mathcal {P}^{\prime }\) can be sequenced arbitrarily. We also discuss how the proposed methods can be adjusted for the case with arbitrary cost overheads \(f_{i}\) (Sect. 4.2) and their applicability to the related models with nonzero transfer times (Sect. 4.3).

4.1 Zero fixed cost overheads: \(f_{i}=0\)

In this section, we study the version of the main problem with the cost function \(F=\sum _{i=1}^{n}\ell _{i}x_{i}\), i.e., \(f_{i}=0\) for \(1\le i\le m\).

In the cost minimization problem \(\mathrm {DL}_{\text {cost}}(T)\), given the schedule length limit T, the upper bounds \(u_{i}\) on the chunks \(x_{i}\) allocated to each processor i can be found from (6) combined with (9) and (11):
$$\begin{aligned} u_{i}(T)=\min \left\{ \widetilde{B}_{i},\frac{1}{a_{i}}\left[ T-r_{i}-p_{i} \right] ^{+}\right\} \end{aligned}$$
(39)
where \(\left[ f\right] ^{+}=\max \left\{ f,0\right\} \) and
$$\begin{aligned} \widetilde{B}_{i}=\min \left\{ B_{i},\frac{1}{a_{i}}\left( d_{i}-r_{i}-p_{i}\right) ,V\right\} . \end{aligned}$$
We assume that \(\sum _{i=1}^{m}u_{i}(T)\ge V\) so that a feasible solution exists. For the given T, the cost minimization problem can be modeled as the following linear program:
$$\begin{aligned}&\mathrm {LP}_{\text {cost}}(T)\text {:}\min \sum \limits _{i=1}^{m}\ell _{i}x_{i}~~\mathrm {s.t.} \end{aligned}$$
(40)
$$\begin{aligned}&\sum \limits _{i=1}^{m}x_{i}=V, \end{aligned}$$
(41)
$$\begin{aligned}&0\le x_{i}\le u_{i}(T),\quad i=1,\ldots ,m. \end{aligned}$$
(42)
Note that (40)–(42) is the continuous knapsack problem, solvable in O(m) time by the algorithm due to Balas and Zemel (1980), which implies the following result.

Statement 2

If there are no transfer overheads and the cost function is \(F=\sum _{i=1}^{n}\ell _{i}x_{i}\), then problem \(\mathrm {DL}_{\mathrm {cost}}(T)\) is solvable in O(m) time.

For the counterpart \(\mathrm {DL}_{\text {time}}(K)\) of problem \(\mathrm {DL}_{\text {cost}}(T)\), we can only propose an \(O(m\log m)\)-time algorithm. As we show next, the bicriteria problem \(\mathrm {DL}_{\text {bicrit}}\) is solvable in \(O(m\log m)\) time as well. Thus, in what follows we focus on \( \mathrm {DL}_{\text {bicrit}}\); a solution to \(\mathrm {DL}_{\text {time}}(K)\) can be found from a solution to \(\mathrm {DL}_{\text {bicrit}}\) without increasing the \(O(m\log m)\) time complexity.

In order to find the Pareto-front for \(\mathrm {DL}_{\text {bicrit}}\), consider \(\mathrm {LP}_{\text {cost}}(T)\) as the underlying model and treat it in a parametric way, with parameter T that varies in \(\left[ \min _{1\le i\le m}\left\{ d_{i}\right\} ,\right. \left. \max _{1\le i\le m}\left\{ d_{i}\right\} \right] \). Notice that for small values of T from that interval the problem \(\mathrm {LP}_{\text {cost}}(T)\) may be infeasible.

Since \(\mathrm {LP}_{\text {cost}}(T)\) is the continuous knapsack problem for each fixed T, an optimal load distribution is defined by the formulae:
$$\begin{aligned} x_{1}(T)= & {} u_{1}\left( T\right) , \end{aligned}$$
(43)
$$\begin{aligned} x_{i}(T)= & {} \min \left\{ u_{i}(T),~\left[ V-\sum \limits _{k=1}^{i-1}u_{k}(T) \right] ^{+}\right\} ,\nonumber \\&\quad i=2,\ldots ,m, \end{aligned}$$
(44)
assuming that processors are numbered so that
$$\begin{aligned} \ell _{1}\le \ell _{2}\le \cdots \le \ell _{m}. \end{aligned}$$
(45)
Informally, the cheapest processor \(P_{1}\) gets the highest possible load, and every subsequent processor \(P_{i}\) gets the highest possible load, after all cheaper processors are loaded to their maximum capacity without violating the given T.

We start with the rightmost point of the trade-off, that corresponds to a solution with the largest length and minimum cost. It can be found in O(m) time by solving problem \(\mathrm {LP}_{\text {cost}}(T)\) with \( T=\max _{1\le i\le m}\left\{ d_{i}\right\} \). The resulting point \(\left( T,K\right) \) has \(K=\sum _{i=1}^{m}\ell _{i}x_{i}(T)\).

Consider a cost-optimum schedule of some length T. Let \(P_{s}\) be a processor with the smallest index whose load can be increased without increasing the length T. Such a processor satisfies a strict inequality
$$\begin{aligned} x_{s}<u_{s}\left( T\right) . \end{aligned}$$
(46)
We call \(P_{s}\) a split processor as it corresponds to the so-called split item s in the solution to the underlying continuous knapsack problem,
$$\begin{aligned} \begin{array}{ll} x_{i}\left( T\right) =u_{i}\left( T\right) , &{} 1\le i\le s-1, \\ x_{s}\left( T\right) =V-\sum \limits _{i=1}^{s-1}u_{i}(T), &{} \\ x_{j}\left( T\right) =0, &{} s+1\le j\le m. \end{array} \end{aligned}$$
In accordance with definition (39) of \(u_{i}\left( T\right) \), we divide processors \(\left\{ P_{1},\ldots ,P_{s-1}\right\} \) into three subsets:
  • critical processors\(\mathcal {P}_{c}\), with \(x_{i}\left( T\right) =\left( T-r_{i}-p_{i}\right) /a_i\) so that \(C_{i}=T\),

  • non-critical processors\(\mathcal {P}_{n}\), with \(x_{i}\left( T\right) =\widetilde{B}_{i}\) so that \(C_{i}<T\),

  • excluded processors\(\mathcal {P}_{e}\), with \(x_{i}\left( T\right) =0\).

Note that each excluded processor \(P_{i}\in \mathcal {P}_{e}\) defined for some schedule length T remains excluded for any smaller schedule length, since
$$\begin{aligned} r_{i}+p_{i}\ge T. \end{aligned}$$
The approach described below moves from a current breakpoint, denoted by \(\left( T^{\prime },K^{\prime }\right) \), to the next one \(\left( T^{\prime \prime },K^{\prime \prime }\right) \), repeating similar steps: it simultaneously decreases the load values of processors \(P_{i}\in \mathcal {P} _{c}\) and compensates that change by increasing the load of the split processor \(P_{s}\), keeping the total processed volume equal to V at all stages. For an efficient implementation, we define two auxiliary values:
$$\begin{aligned} h\left( \mathcal {P}_{c}\right) =\sum _{P_{i}\in \mathcal {P}_{c}} \frac{1}{a_{i}} \end{aligned}$$
the combined speed of processors \(\mathcal {P}_{c}\), and
$$\begin{aligned} k\left( \mathcal {P}_{c}\right) =\sum _{P_{i}\in \mathcal {P}_{c}} \frac{\ell _{i}}{a_{i}} \end{aligned}$$
the combined cost of occupying processors \(\mathcal {P}_{c}\) for 1 time unit.
Consider a transition from a solution with load values \(x_{i}^{\prime }\), corresponding to \(\left( T^{\prime },K^{\prime }\right) \), to a solution with load values \(x_{i}^{\prime \prime }\), corresponding to \(\left( T^{\prime \prime },K^{\prime \prime }\right) \). For any \(P_{i}\in \mathcal {P} _{c}\), the T-value changes from \(T^{\prime }=r_{i}+p_{i}+a_{i}x_{i}^{\prime }\) to \(T^{\prime }-\Delta =r_{i}+p_{i}+a_{i}x_{i}^{\prime \prime }\) so that
$$\begin{aligned} x_{i}^{\prime \prime }=x_{i}^{\prime }-\frac{1}{a_{i}}\Delta ,~~P_{i}\in \mathcal {P}_{c}. \end{aligned}$$
For the split processor \(P_{s}\), the increase in its load should be equal to the cumulative decrease in the load of processors \(\mathcal {P}_{c}\):
$$\begin{aligned} x_{s}^{\prime \prime }=x_{s}^{\prime }+h\left( \mathcal {P}_{c}\right) \Delta . \end{aligned}$$
(47)
The next breakpoint \(\left( T^{\prime \prime },K^{\prime \prime }\right) \) is triggered by one of the events (a)–(d), whichever is reached first. It corresponds to the smallest value of \(\Delta \), defined in the event descriptions below.
  1. (a)
    The set \(\mathcal {P}_{c}\ \)is adjusted to exclude a processor whose load reduces to 0. This happens if the decreased value \(T^{\prime \prime }=T^{\prime }-\Delta \) reaches \(r_{i}+p_{i}\) for some \(P_{i}\in \mathcal {P}_{c}\). In this case
    $$\begin{aligned} \Delta =T^{\prime }-\max \left\{ r_{i}+p_{i}|P_{i}\in \mathcal {P} _{c}\right\} . \end{aligned}$$
    (48)
     
  2. (b)
    The set \(\mathcal {P}_{c}\ \)is adjusted to include a non-critical processor which becomes critical. This happens if \(T^{\prime \prime }=T^{\prime }-\Delta \) reaches an absolute deadline \(\widetilde{d}_{i}\) for some non-critical processor \(P_{i}\in \mathcal {P}_{n}\),
    $$\begin{aligned} \widetilde{d}_{i}=\min \{d_{i},r_{i}+p_{i}+a_{i}B_{i}\}. \end{aligned}$$
    (49)
    In this case
    $$\begin{aligned} \Delta =T^{\prime }-\max \left\{ \widetilde{d}_{i}|P_{i}\in \mathcal {P} _{n}\right\} . \end{aligned}$$
    (50)
     
  3. (c)
    The split processor \(P_{s}\) can no longer get any additional load since its increased load \(x_{s}^{\prime \prime }=x_{s}^{\prime }+h\left( \mathcal {P}_{c}\right) \Delta \) reaches an absolute upper bound \( \widetilde{B}_{s}\), which implies
    $$\begin{aligned} \Delta =\frac{\widetilde{B}_{s}-x_{s}^{\prime }}{h\left( \mathcal {P} _{c}\right) }. \end{aligned}$$
     
  4. (d)
    The split processor \(P_{s}\) can no longer get any additional load since processing its increased load \(x_{s}^{\prime \prime }\) reaches \(T^{\prime \prime }\) so that processor \(P_{s}\) becomes a critical processor. This happens if the completion time \(r_{s}+p_{s}+a_{s}\left( x_{s}^{\prime }+ h\left( \mathcal {P}_{c}\right) \Delta \right) \) of \(P_{s}\) becomes equal to \(T^{\prime \prime }=T^{\prime }-\Delta \), which implies
    $$\begin{aligned} \Delta =\frac{T^{\prime }-\left( r_{s}+p_{s}+a_{s}x_{s}^{\prime }\right) }{ a_{s}h\left( \mathcal {P}_{c}\right) +1}. \end{aligned}$$
     
Thus, it suffices to calculate \(\Delta \)-values for each of the events (a)–(d), select the smallest one and compute the characteristics of the next breakpoint:
$$\begin{aligned} T^{\prime \prime }=T^{\prime }-\Delta ,K^{\prime \prime }=K^{\prime }-k\left( \mathcal {P}_{c}\right) \Delta +\ell _{s}h\left( \mathcal {P}_{c}\right) \Delta , \end{aligned}$$
Note that the term \(-k\left( \mathcal {P}_{c}\right) \Delta = -\sum _{P_{i}\in \mathcal {P}_{c}}\frac{\ell _{i}}{a_{i}}\Delta =\)\(=-\sum _{P_{i}\in \mathcal {P}_{c}}\ell _{i}\left( x_{i}^{\prime } -x_{i}^{\prime \prime }\right) \) defines the cost change due to the decrease in the loads of processors \(\mathcal {P} _{c}\), while the term \(\ell _{s}h\left( \mathcal {P}_{c}\right) \Delta =\ell _{s}\left( x_{s}^{\prime \prime }-x_{s}^{\prime }\right) \) defines the cost change due to the increase in the load of the split processor \(P_{s}\). To complete the transition from \(\left( T^{\prime },S^{\prime }\right) \) to \( \left( T^{\prime \prime },S^{\prime \prime }\right) \), perform the following updates.
  • In the case of event (a) triggered by processor \(P_{i}\in \mathcal {P} _{c}\), calculate \(x_{s}^{\prime \prime }\) by (47) and set
    $$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\backslash \left\{ P_{i}\right\} ,\, \mathcal {P}_{e}:=\mathcal {P}_{e}\cup \left\{ P_{i}\right\} , \end{aligned}$$
    $$\begin{aligned} h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) -\frac{1}{a_{i}},\, k\left( \mathcal {P}_{c}\right) := k\left( \mathcal {P}_{c}\right) -\frac{\ell _{i}}{a_{i}}. \end{aligned}$$
  • In the case of event (b) triggered by processor \(P_{i}\in \mathcal {P} _{n}\), calculate \(x_{s}^{\prime \prime }\) by (47) and set
    $$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\cup \left\{ P_{i}\right\} ,\, \mathcal {P}_{n}:=\mathcal {P}_{n}\backslash \left\{ P_{i}\right\} , \end{aligned}$$
    $$\begin{aligned} h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) +\frac{1}{a_{i}},\, k\left( \mathcal {P}_{c}\right) :=k\left( \mathcal {P}_{c}\right) + \frac{\ell _{i}}{a_{i}}. \end{aligned}$$
  • In the case of event (c), set
    $$\begin{aligned} \mathcal {P}_{n}:=\mathcal {P}_{n}\cup \left\{ P_{s}\right\} , \end{aligned}$$
    and define a new split processor by considering processors \(P_{i}\), \( i=s+1,s+2,\) ... one by one: if \(r_{i}+p_{i}<T\), then \(P_{i}\) becomes a new split processor (note that its load is 0); otherwise, \(P_{i}\) joins the set of excluded processors \(\mathcal {P}_{e}\), and the next processor is examined. If no processor from \(\left\{ P_{s+1},\ldots ,P_{m}\right\} \) becomes a split processor, the algorithm stops.
  • In the case of event (d), set
    $$\begin{aligned} \mathcal {P}_{c}:=\mathcal {P}_{c}\cup \left\{ P_{s}\right\} , h\left( \mathcal {P}_{c}\right) := h\left( \mathcal {P}_{c}\right) +\frac{1}{a_{s}}, \end{aligned}$$
    $$\begin{aligned} k\left( \mathcal {P}_{c}\right) :=k\left( \mathcal {P}_{c}\right) +\frac{\ell _{s}}{a_{s}}, \end{aligned}$$
    and find the next split processor \(P_{s}\) as in the case of event (c).
  • If both events (c) and (d) happen simultaneously, proceed as in the case of event (d).

  • We can now treat the found breakpoint \(\left( T^{\prime \prime },K^{\prime \prime }\right) \) as the current one and proceed similarly to finding the next breakpoint. The algorithm stops if \(s=m\) and event (c) or (d) happens.

In order to implement all calculations efficiently, we maintain sets \( \mathcal {P}_{c}\) and \(\mathcal {P}_{n}\) as priority queues so that each calculation (48) or (50) can be done in O(1) time. Since an element is added to \(\mathcal {P}_{c}\) or removed from \(\mathcal {P} _{c}\) at most once, all add/remove operations require \(O(m\log m)\) time. The same holds for add/remove operations on \(\mathcal {P}_{n}\). Similarly, \( \mathcal {P}_{e}\) is updated no more than \(m\ \)times.

Initialization involves renumbering processors in accordance with (45), finding the first solution \(\left( x_{1}^{\prime },x_{2}^{\prime },\ldots ,x_{m}^{\prime }\right) \) for \(T^{\prime }=\max \left\{ \widetilde{d}_{i}|1\le i\le m\right\} \), computing \(K^{\prime }\) and auxiliary values \(h\left( \mathcal {P}_{c}\right) \) and \(k\left( \mathcal { P}_{c}\right) \). All required steps can be done in \(O(m\log m)\) time.

A transition from one breakpoint to the next one requires updating the two priority queues, which takes \(O(\log m)\) time, and updating the five parameters, \(x_{s}^{\prime \prime }\), \(T^{\prime \prime }\), \(K^{\prime \prime }\), \(h\left( \mathcal {P}_{c}\right) \) and \(k\left( \mathcal {P} _{c}\right) \), which can be done in O(1) time. Note that x-values for \( i\ne s\) are not maintained. Since there are at most m events of each type, the total number of breakpoints is no larger than 4m, and the overall time complexity is \(O(m\log m)\). Thus, the following statement holds.

Statement 3

Problem \(\mathrm {DL}_{\text {bicrit}}\) has a trade-off with at most 4m breakpoints which can be computed in \(O(m\log m)\) time.

Let us finish this section with an example to illustrate calculation of the time–cost trade-off. Assume that load size is \(V=30\) and the number of processors is \(m=8\). Their parameters are given in Table 3. The breakpoints \(\left( T,K\right) \) are presented in the first two rows of Table 4. The boxed elements represent the load of the split processor \(P_{s}\). The loads of the remaining processors are not maintained by the described algorithm, in order to achieve the \(O(m\log m)\) time complexity. For completeness, we present the optimal x-values for all processors.
Table 3

Example data for time–cost trade-off calculation

i

\(a_i\)

\(B_i\)

\(r_i\)

\(d_i\)

\(p_i\)

\(\ell _i\)

\(r_i+p_i\)

\( \widetilde{d}_i\)

\(\widetilde{B}_i\)

1

1

10

80

100

1

1

81

91

10

2

4

40

30

110

2

2

32

110

19.5

3

8

10

20

40

5

3

25

40

1.875

4

4

20

20

70

4

5

24

70

11.5

5

5

10

10

80

2

8

12

62

10

6

6

10

40

100

2

10

42

100

\(\approx \) 9.667

7

3

30

5

50

1

20

6

50

\(\approx \) 14.667

8

2

50

10

60

3

40

13

60

23.5

Table 4

Load allocations \(x_i\) and total costs in time–cost trade-off calculation

4.2 Arbitrary cost overheads \(f_{i}\), \(\ell _{i}\)

In the general case, with nonzero overheads \(f_{i}\) in the cost function \( K=\sum _{i=1}^{m}\left( f_{i}+\ell _{i}x_{i}\right) \), both problems \(\mathrm {DL}_{\text {time}}\) and \(\mathrm {DL}_{\text {cost}}\) are NP-hard, see Drozdowski and Lawenda (2005). As we show in this section, the problem can be solved efficiently if we limit our search to a class of solutions with a fixed set of active processors \(\mathcal {P}^{\prime }\subseteq \mathcal {P}\). The associated problem is of the form: given a set of active processors \(\mathcal {P} ^{\prime }\), it is required to allocate a positive load to each active processor minimizing the objective function T or K. Note that some processors may get an infinitely low load \(\varepsilon >0\); such a processor then has a completion time \(r_{i}+p_{i}+a_{i}\varepsilon \), which should be taken into account when calculating the length T of the schedule.

There are two common properties that hold for any problem, \(\mathrm {DL}_{\text {time}}\), \(\mathrm {DL}_{\text {cost}}\) or \(\mathrm {DL}_{\text {bicrit}}\), under the assumption that \(\mathcal {P}^{\prime }\) is fixed:
  • Property 1: component \(\sum _{i\in \mathcal {P}^{\prime }}f_{i}\) is constant and can be excluded from K;

  • Property 2: the schedule length T satisfies \(T>\rho \), where
    $$\begin{aligned} \rho =\max _{1\le i\le m^{\prime }}\left\{ r_{i}+p_{i}\right\} . \end{aligned}$$
Based on these two properties, the results from Sect. 4.1 can be adjusted to handle the case with a fixed set \(\mathcal {P}^{\prime }\).

For problem \(\mathrm {DL}_{\text {cost}}(T)\) with a given set \(\mathcal {P} ^{\prime }\) and \(T>\rho \), consider the continuous knapsack formulation (40) defined over \(\mathcal {P}^{\prime }\). If in an optimal knapsack solution \(x_{i}=0\) for some \(P_{i}\in \mathcal {P}^{\prime } \), then such a solution is adjusted by replacing 0-values by \(\varepsilon \). The overall time complexity remains the same, O(m).

For problem \(\mathrm {DL}_{\text {bicrit}}\), apply the approach from Sect. 4.1 for the processor set \(\mathcal {P}^{\prime }\) and output the part of the trade-off that satisfies \(T>\rho \). Treat all found solutions as if each idle processor gets an \(\varepsilon \)-load. This assumption does not affect the values of T and K, assuming that \( \varepsilon \) is infinitely small. There is one case that needs a special attention. It occurs if all breakpoints \(\left( T,K\right) \) satisfy \(T\le \rho \). In that case consider the rightmost point \(\left( T^{*},K^{*}\right) \) of the trade-off and output the unique solution obtained from \(\left( T^{*},K^{*}\right) \) by allocating \(\varepsilon \)-loads to all idle processors. The described adjustments do not affect the \(O(m\log m)\) time complexity derived for problem \(\mathrm {DL}_{\text {bicrit}}\) in Sect.  4.1.

Treating problem \(\mathrm {DL}_{\text {time}}(K)\) as a special case of problem \(\mathrm {DL}_{\text {bicrit}}\), we conclude that it is solvable in \(O(m\log m)\) time.

Statement 4

If there are arbitrary cost overheads \(f_{i}\), \(\ell _{i}\) in the cost function \(F=\sum _{i=1}^{n}\left( f_{i}+\ell _{i}x_{i}\right) \), and a set of active processors is fixed, then problem \(\mathrm {DL}_{\mathrm { \cos t}}(T)\) is solvable in O(m) time, while problems \(\mathrm {DL}_{ \mathrm {time}}(K)\) and \(\mathrm {DL}_{\mathrm {bicrit}}\) are solvable in \( O(m\log m)\) time.

4.3 Arbitrary transfer overheads \(s_{i},c_{i}\)

The results from Sect. 4.1 can be applied to special scenarios with nonzero transfer overheads \(s_{i},c_{i}\).

One scenario arises in parallel communication with simultaneous start mode, see Kim (2003), Robertazzi (2003), with computation speeds slower than transfer speeds. For the simultaneous start mode, load transfer to all worker processors starts at the same time. Worker processors start computing as soon as the first grain of the load is received. Due to the slower computation speeds compared to transfer speeds, computation time of any grain is higher than the transfer time of any subsequent grain. Thus, in parallel communication with simultaneous start, communication does not affect the overall schedule length and cost. Consequently, communication time can be ignored as if \( c_{i}=s_{i}=0\) for any \(P_{i}\in \mathcal {P}\).

Another scenario is typical for a pipeline-like computing mode. Load scattering and processing are interleaved so that communications and computations are performed at different stages. The load is distributed in one interval (say interval i) and processed in the next interval (\(i+1\)). If there is a common communication time \(\tau _{comm}\) for all processors and a common computation time \(\tau _{comp}\), with \(\tau _{comm}\le \tau _{comp}\), then the communications executed in interval i do not determine partitioning of the load for minimum computing time and the cost in interval \(i+1 \). It can be shown that the general case of the pipeline mode, characterized by \( s_{i}>0,a_{i}>0\), is NP-hard [see, e.g., DLS with processor release times in Drozdowski and Lawenda (2005)].

5 Conclusions

In this paper, we analyze the time/cost optimization for divisible load scheduling problems with arbitrary processor memory sizes, ready times, deadlines, communication and computation start-up costs. Three versions of the problem are studied: \(\mathrm {DL}_{\text {time}}\mathrm {(}K\mathrm {)}\)—schedule length minimization for the given limited budget K, \(\mathrm {DL}_{\text {cost}}\mathrm {(}T\mathrm {)}\)—cost minimization for the given schedule length limit T, and \(\mathrm {DL}_{\text {bicrit}}\)—constructing the set of time–cost Pareto-optimal solutions. All three versions can be solved in polynomial time for fixed m.

The case with given upper bounds on the schedule length and cost appears to be NP-hard even if all fixed overheads are zero (\(p_{i}=s_{i}=f_{i}=0\) for all \(P_{i}\in \mathcal {P}\)). This result is rather unusual: all previous NP-hardness results in the divisible load theory assumed nonzero fixed overheads. Interestingly, a divisible load problem is linked to scheduling problems with preemption, for which NP-hardness results are rather atypical (see, e.g., Sitters 2001; Drozdowski et al. 2017).

We leave an open question regarding the time complexity of finding a shortest schedule with processor availability constraints, but with zero fixed communication and computation overheads, regardless of the computation cost (\(K=\infty \)). We believe that the latter problem is computationally hard, see Conjecture 1. Contrarily, the version with negligible communication times is solvable in \(O(m\log m)\) time even in its bicriteria setting.

Our summary table provided in Introduction presents the state-of-the-art results in divisible load scheduling and can be used as a guideline for future research.

Notes

References

  1. Adler, I., & Monteiro, R. D. C. (1992). A geometric view on parametric linear programming. Algorithmica, 8, 161–176.CrossRefGoogle Scholar
  2. Agrawal, R., & Jagadish, H. V. (1988). Partitioning techniques for large-grained parallelism. IEEE Transactions on Computers, 37, 1627–1634.CrossRefGoogle Scholar
  3. Balas, E., & Zemel, E. (1980). An algorithm for large zero-one knapsack problems. Operations Research, 28, 1130–1154.CrossRefGoogle Scholar
  4. Bharadwaj, V., Ghose, D., & Mani, V. (1994). Optimal sequencing and arrangement in distributed single-level tree networks with communication delays. IEEE Transactions on Parallel and Distributed Systems, 5, 968–976.CrossRefGoogle Scholar
  5. Bharadwaj, V., Ghose, D., Mani, V., & Robertazzi, T. G. (1996). Scheduling divisible loads in parallel and distributed systems. Los Alamitos: IEEE Computer Society Press.Google Scholar
  6. Blazewicz, J., & Drozdowski, M. (1997). Distributed processing of divisible jobs with communication startup costs. Discrete Applied Mathematics, 76, 21–41.CrossRefGoogle Scholar
  7. Cheng, Y.-C., & Robertazzi, T. G. (1988). Distributed computation with communication delay. IEEE Transactions on Aerospace and Electronic Systems, 24, 700–712.CrossRefGoogle Scholar
  8. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms (2nd ed.). Cambridge: MIT Press and McGraw-Hill.Google Scholar
  9. Drozdowski, M. (2009). Scheduling for parallel processing. London: Springer.CrossRefGoogle Scholar
  10. Drozdowski, M., Jaehn, F., & Paszkowski, R. (2017). Scheduling position-dependent maintenance operations. Operations Research, 65, 1657–1677.CrossRefGoogle Scholar
  11. Drozdowski, M., & Lawenda, M. (2005). The combinatorics in divisible load scheduling. Foundations of Computing and Decision Sciences, 30, 297–308.Google Scholar
  12. Goldfarb, D., & Todd, M. J. (1989). Chapter II: Linear programming. In G. L. Nemhauser, A. H. G. Rinooy Kan, & M. J. Todd (Eds.), Handbooks in operations research and management science. Optimization (Vol. 1, pp. 73–170). Elsevier Science Publishers B.V. (North-Holland). Google Scholar
  13. Kim, H. J. (2003). A novel optimal load distribution algorithm for divisible loads. Cluster Computing, 6, 41–46.CrossRefGoogle Scholar
  14. Robertazzi, T. G. (2003). Ten reasons to use divisible load theory. IEEE Computer, 36, 63–68.CrossRefGoogle Scholar
  15. Shakhlevich, N. V. (2013). Scheduling divisible loads to optimize the computation time and cost (Vol. 8193, pp. 138–148)., Lecture notes in computer science Cham: Springer.Google Scholar
  16. Sitters, R. A. (2001). Two NP-hardness results for preemptive minsum scheduling of unrelated parallel machines (Vol. 2081, pp. 396–405)., Lecture Notes in Computer Science Berlin: Springer.Google Scholar
  17. Yang, Y., Casanova, H., Drozdowski, M., Lawenda, M., & Legrand, A. (2007) On the complexity of multi-round divisible load scheduling. INRIA Rône-Alpes, Research Report No. 6096, 2007.Google Scholar

Copyright information

© The Author(s) 2019

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Institute of Computing SciencePoznań University of TechnologyPoznanPoland
  2. 2.School of ComputingUniversity of LeedsLeedsUK

Personalised recommendations