Scheduling divisible loads with time and cost constraints

In distributed computing, divisible load theory provides an important system model for allocation of data-intensive computations to processing units working in parallel. The main task is to define how a computation job should be split into parts, to which processors those parts should be allocated and in which sequence. The model is characterized by multiple parameters describing processor availability in time, transfer times of job parts to processors, their computation times and processor usage costs. The main criteria are usually the schedule length and cost minimization. In this paper, we provide the generalized formulation of the problem, combining key features of divisible load models studied in the literature, and prove its NP-hardness even for unrestricted processor availability windows. We formulate a linear program for the version of the problem with a fixed number of processors. For the case with an arbitrary number of processors, we close the gaps in the study of special cases, developing efficient algorithms for single criterion and bicriteria versions of the problem, when transfer times are negligible.


Introduction
Divisible load theory (DLT) is an important model of parallel computations. It is assumed that a big volume of data, conventionally referred to as load, can be divided continuously into parts which can be processed independently on distributed computers. DLT was proposed by Cheng and Robertazzi (1988) to represent computation in a chain of intelligent sensors. A very similar approach was proposed independently by Agrawal et al. (1988) to model performance of a network of workstations. DLT was successfully applied to scheduling data-intensive applications and to analyzing various aspects of their efficiency depending on communication sequences, load scattering algorithms, memory limitations and timevarying environments (Bharadwaj et al. 1996;Drozdowski 2009;Robertazzi 2003 In its general form, the problem of divisible load scheduling (DLS) can be formulated as follows. A computational load of volume V (measured in bytes) is initially held by a master processor P 0 . The load must be distributed among worker processors from set P = {P 1 , . . . , P m }. In our model master processor P 0 only distributes load and does not do any computation. In some publications this assumption is waived so that P 0 performs computation after it completes all communications. The results discussed in the following sections can be easily adjusted for that case.
For the summary of the notation and explanation of the parameters used in this paper, see Table 1. Each processor P i has its own availability interval [r i , d i ]. The time required for sending x bytes of load to P i is s i + c i x, for i = 1, . . . , m. The communications are performed sequentially, i.e., only one processor at a time can receive its chunk of the load from the master. The transmission to P i of the allocated chunk can start at any time, even before the processor's availability time r i . The processing of the chunk can start only after the allocated chunk is received in full and no earlier than processor's availability time r i . For the chunk of size x received by processor P i , the computation time and the processor usage cost (computation cost) are p i + a i x and f i + i x, respectively.
It is required that P i finishes computation by the end of its availability interval d i , d i > r i + p i . Due to the limited The time for computing load x i on processor P i , where p i is the setup time to start computation and a i is the processing rate (or reciprocal of speed) of processor P i s i + c i x i The time for transferring load x i to processor P i , where s i is communication start up time and c i is the communication rate (or reciprocal of bandwidth) of the link to P i The cost of computing load x i by processor P i , including the fixed cost f i P Set of the worker processors P Set of processors participating in computation, P ⊆ P m Total number of processors, i.e., m = |P| m Number of processors in P , i.e., m = |P | memory size, there is an upper limit B i on the maximum load that can be handled by processor P i . A processor may be left unused if no load is sent to it. Such a processor does not incur any time or cost overheads. Let C i denote the time when processor P i completes its chunk, and let C max denote the length of the whole schedule. The cost of processing the load is denoted K. Solving the DLS problem requires three decisions: Decision 1 choosing the subset of processors P ⊆ P for performing computation; for any processor P i ∈ P the allocated chunk size is nonzero (x i > 0) and such a processor is called active; Decision 2 choosing the sequence in which the master processor P 0 sends parts of the load to the processors in P ; Decision 3 splitting the total load of size V into chunks x i , one for each processor P i ∈ P , such that the schedule length C max and the total cost K are minimum, With respect to decision 3, the most general version is bicriterion: finding the Pareto-front (C max , K) of non-dominated solutions in criteria C max and K. We denote that problem by DL bicrit . Its two counterparts deal with minimizing one objective subject to the bounded value of the second objective: -in problem DL time (K ) the objective is to minimize C max subject to K ≤ K , where K is an upper limit of the available budget, -in problem DL cost (T ) the objective is to minimize the cost K subject to C max ≤ T , where T is an upper limit of the acceptable schedule length.
There are three types of overheads for any processor P i ∈ P : transfer time (also called communication time) s i + c i x, computation time p i +a i x and computation cost f i + i x. We refer to parameters s i , p i and f i as fixed overheads as they define fixed amounts of time and cost incurred if a nonzero chunk of load is allocated to a processor. These amounts are independent of the chunk size.
In this paper, we perform the complexity study of the formulated DLS problem focusing on the most general case with arbitrary processors' availability windows [r i , d i ] and arbitrary restrictions on processors' maximum loads B i . The results are summarized in Table 2. In the column "Objectives" we specify how the two objectives, C max and K, are handled: single criterion problems deal with either C max or K, while notation (C max , K) is used for the bicriteria problem in the space of objectives C max , K. If one of the objectives is bounded, then the corresponding inequality is stated in the column "Conditions".
In the presence of all three types of overheads, the DLS problem is NP-hard even if all fixed overheads are negligible, and processors' restrictions are relaxed, If in addition to (1)-(2) per-unit transfer costs are equal,  Bharadwaj et al. (1994) Bharadwaj et al. (1996) Blazewicz and Drozdowski (1997 (Processor seq. Blazewicz and Drozdowski (1997) Drozdowski and Lawenda (2005), Sect. 4.2 the problem is solvable in O(m 3 ) time even in the bicriteria setting (Shakhlevich 2013). As we show in this paper, the general case with arbitrary can be solved via linear programming under the condition that the number of worker processors m is fixed. While the computation time overhead is at the center of the DLS problem and cannot be ignored, the two other types of overheads may become negligible in some scenarios. The version of the problem with zero cost overheads is well studied, see Bharadwaj et al. (1994), Bharadwaj et al. (1996), Blazewicz and Drozdowski (1997), Drozdowski and Lawenda (2005), Yang et al. (2007) and the summary of the results in the second part of Table 2. In this paper, we analyze the alternative version with zero transfer overheads; see the lower part of Table 2. It appears that if fixed cost overheads are negligible ( f i = 0 for all P i ∈ P), then the bicriteria version of the problem is solvable in O(m log m) time. Its single criterion counterpart of cost minimization subject to a bounded schedule length can be solved in O(m) time. The version with nonzero fixed cost overheads f i is NP-hard, but can be solved in O(m) time provided that the set of active processors is fixed.
Further organization of this paper is as follows. In Sect. 2 we study the general version of the problem, with arbitrary values of all parameters s i , c i , p i , a i , f i , i , r i , d i and B i for all processors P i ⊆ P. In Sect. 3 we present our results for the case with zero fixed overheads, s i = p i = f i = 0 for all processors P i ⊆ P. Section 4 is dedicated to the system with negligible transfer times, s i = c i = 0 for all P i ⊆ P. Conclusions are presented in Sect. 5.

Nonzero time/cost parameters-fixed set of active processors
In this section, we consider the DLS problem with arbitrary time/cost parameters s i , c i , p i , a i , f i , i and arbitrary processor availability parameters r i , d i , B i . The number of worker processors m is fixed. We present linear programs for problems DL time (K ) and DL cost (T ), justifying that both problems are fixed parameter tractable (FPT) with respect to the parameter m. We then explain how problem DL bicrit can be solved in FPT time. Note that for an arbitrary m the problem is NPhard, as we show in Sect. 3.

Limited cost K-schedule length minimization
Consider first problem DL time (K ), assuming that the set of processors P ⊆ P, which receive nonzero chunks of the load, is fixed, and their sequence is also fixed. At the end of the section, we discuss the case with a non-fixed processor sequence. Let processors in P be renumbered in the order of their activation so that P 1 receives the first chunk, P 2 receives the second one, etc., until P m receives the last chunk of the load, where m = |P |. Let x 1 , x 2 , …, x m represent the load distribution among processors P . Then, the completion time C i of any processors P i , 1 ≤ i ≤ m , can be calculated as Note that the first term in (3) represents the earliest possible starting time of the computation: the release time of P i or the total duration of the chain of communication times for the upstream processors P 1 , P 2 , . . . , P i , whichever is larger. The second term represents the computation time. Using (3), we follow Drozdowski and Lawenda (2005) to formulate problem DL time (K ) as a linear program LP time (K ) of the form: Here schedule length T is the variable to be minimized. It is defined via inequalities (5)-(6), which model T = max 1≤i≤m {C i } with C i given by (3). Inequalities (7)-(8) guarantee that computation on machine P i is completed by the end of the machine availability interval. Inequalities (9) guarantee that the load of each processor P i does not exceed its memory size B i . By (10) the total computation cost does not exceed K . The total size of the load allocated to m processors is equal to V by (11).
There are m + 1 variables and 5m + 2 constraints in LP time (K ), not counting the nonnegativity constraints. The number of variables and constraints can be further reduced by one, using equation (11). The number of constraints can be further reduced by m by combining conditions (8)-(9) for every 1 ≤ i ≤ m into Thus, problem DL time (K ) can be solved in O(LP(m , 4m + 1)) time, where O (LP(u, w)) is the time complexity of solving an LP problem with u variables and w inequality constraints, see, e.g., Goldfarb and Todd (1989). With m ≤ m, there are at most 2 m possible selections for set P and at most m! processor sequences for each selection. Hence, problem DL time (K ) can be solved which implies the FPT time complexity with respect to parameter m.

Limited schedule length T-cost minimization
Consider now problem DL cost (T ) of finding the cheapest schedule for the common deadline T , assuming that the set of active processors P ⊆ P is fixed and their sequence is (1, 2, . . . , m ). Similarly to problem DL time (K ), problem DL cost (T ) can be solved by a linear program with objec- Simplifying the model, we combine (14) with (16) and also combine (15), (17), (18) to get The number of variables and constraints can be further reduced by one, using Eq. (23). In the resulting LP, there are m − 1 variables and 2m constraints so that problem where η is given by (12).

Time-cost trade-off
Consider now the bicriteria problem DL bicrit . The approach described below constructs at most η trade-offs, one for each fixed processor activation sequence, and then takes the lower envelope out of them.
For a fixed sequence S i with m processors, m ≤ m, the trade-off consists of linear pieces that connect breakpoints The schedule corresponding to the extreme point T 0 , K 0 has the smallest length T 0 , that can be found by solving LP time (∞) with K = ∞ or equivalently with inequality (10) eliminated from the model. For the calculated T 0 , the associated minimum cost K 0 can be found by solving LP cost (T 0 ).
Another extreme point (T q , K q ) corresponds to the schedule with the smallest cost K q . It can be found by solving LP cos t (∞) with T = ∞; then for the found value of K q the associated minimum schedule length T q can be found by solving LP time (K q ).
The remaining points (T i , K i ), 1 ≤ i ≤ q − 1, of the trade-off for the fixed processor activation sequence S i can be found as the solution to the parametric linear programming problem LP cost (T ) of type (13) (14)-(15) treated as a parameter. Again, the model can be simplified so that the resulting formulation has m −1 x-variables (after eliminating one variable using (19)) and 4m constraints (after combining (17) with (18) and eliminating the equality constraint (19)). Using standard methods of parametric optimization, the breakpoints of the trade-off can be found in O(qLP(m − 1, 4m)) time, see, e.g., Adler and Monteiro (1992). The number of breakpoints q is bounded by the number of basic feasible solutions, which does not exceed 4m m − 1 for a linear program with m − 1 variables and 4m constraints. Using an upper bound for the latter expression we conclude that the time complexity for constructing the trade-off for the fixed processor With an upper bound η on the number of processor sequences given by (12), the overall time complexity for constructing η trade-off curves is We have to construct a merged set of Paretooptimum points in the (T , K )-space for all processor sequences S i . The resulting set of points F and the segments connecting them constitute a non-increasing function of cost K in time T . This function may contain convex and concave parts, and it can be discontinuous, see Fig. 1. A point of discontinuity (t, K (t)) corresponds to, e.g., the left end of a trade-off curve for some processor sequence S i , while another processor sequence S j incurs a higher cost at t, or the right end of S j trade-off curve is earlier than t.
A possible approach to finding the overall trade-off may consist of the following two steps. To handle the discontinuity, introduce vertical lines from left end-points of individual trade-offs, horizontal lines from the right end-points, and include those pieces in Step (1). Then the overall number of linear pieces considered in Step (1) is not larger than η (Q + 2), their O(η 2 Q 2 ) intersection points can be found in O(η 2 Q 2 ) time, see, e.g., Cormen et al. (2001), and finally the minimal layer is constructed in O(η 2 Q 2 log(ηQ)) time. Taking into account expressions (12) and (24)  In this section, we assume that all fixed overheads are equal to zero, i.e., p i = s i = f i = 0, for i = 1, . . . , m, while the linear components of transfer time (c i ), computation time (a i ) and cost ( i ) are not simultaneously equal to zero. In terms of the three types of decisions introduced in Sect. 1, decisions of type 3 imply decisions of type 1: any processor P i which gets a chunk x i = 0 does not contribute to any time or cost component because there are no fixed overheads. This implies that a processor receiving a 0-size chunk can be removed from the list P of active processors. Thus, it is enough to make decisions 2 and 3. Our main result of this section is the NP-hardness proof of DL cost (T ) and DL time (K ).

Limited schedule length T and limited cost K
To prove NP-hardness of problem DL cost (T ), let us introduce its decision version DL(T , K ) which verifies whether there exists a feasible solution with the schedule length and the cost not exceeding the given thresholds T and K , respectively. We reduce the even-odd partition to problem DL(T , K ). Even-odd partition is defined as follows: given a set E = {e 1 , . . . , e 2n } of positive integers, is there a subset E 1 ⊂ E such that e i ∈E 1 e i = e i ∈E\E 1 e i = G and E 1 contains exactly one element from pair e 2i−1 , e 2i , for i = 1, . . . , n? For an instance of even-odd partition, we construct an instance of DL(T , K ) as follows. Let T > 0 be an arbitrary schedule length and m = 2n be the number of processors of the set P. The processor parameters are defined as r i = 0, for i = 1, . . . , n. The load size V and the cost limit K are given by It is easy to see that the reduction is polynomial.

Lemma 1
If there exists a solution to an instance of evenodd partition, then there exists a schedule S for the instance DL(T , K ) which satisfies the following properties.
(i) The whole load of size V is fully processed.
(ii) Every processor is fully loaded completing its load chunk at time T . (iii) The cost of the schedule is K . (iv) Schedule S defines a solution to the instance of DL(T , K ).
Proof Let e i1 denote the element of the pair {e 2i−1 , e 2i } which belongs to E 1 , and e i2 be the other element of the pair, i = 1, . . . , n. Construct schedule S by selecting processor activating sequence (P 11 , P 12 , . . . , P u1 , P u2 , . . . , P n1 , P n2 ), where processors P i1 , P i2 correspond to e i1 , e i2 , respectively, and define the load chunks as We prove that properties (i)-(iv) hold for schedule S. (i) The sum of all chunks allocated to processors P is as follows: (ii) For any processor P i1 , 1 ≤ i ≤ n, its communication time μ i1 and computation time ν i1 are given by Similarly, for any processor P i2 the associated values are Thus, the schedule is of the form shown in Fig. 2, with every processor P i j completing at time T , i = 1, . . . , n, j = 1, 2.
(iii) The cost of S is as follows: By properties (i)-(iii) the schedule is feasible and it defines a solution to DL(T , K ) so that property (iv) holds.
In the remaining part we prove that if there exists a solution to the instance of problem DL(T , K ), then there exists a solution to the related instance of Even-odd partition. The lemma below starts with auxiliary properties of a feasible schedule and concludes with the main result.
Lemma 2 A feasible schedule S for the instance of DL(T , K ) satisfies the following properties.
(3) If in a feasible schedule satisfying property 2) at least one processor P uk , 1 ≤ u ≤ n, k = 1, 2, is not fully loaded (i.e., C uk < T holds), then (4) Each of the 2n processors is fully loaded and has completion time T .
defines a solution to Even-odd partition.
Proof (1) Let schedule S be of the form S = P h 1 , P h 2 ,. . ., P h 2n and P h z be an out-of-order processor such that it belongs to a pair {P u1 , P u2 } with the smallest index u, and its predecessor in S is where the first inequality holds since e h z > 0 and e h z−1 < G n−v+2 , and the second one is satisfied for a sufficiently large G, for example G > 2 4 .
Let t be the starting time of communication of processor P h z−1 in S. Modify the fragment of schedule S, starting from t, by moving the full load x h z−1 from P h z−1 to P h z . In the new schedule, processor P h z finishes communication at time The same is true for the computation completion time: the new completion time of P h z is t + 2c h z (x h z−1 + x h z ), which is less than t + c h z−1 x h z−1 + 2c h z x h z , completion time of P h z in the original schedule (since c h z−1 > 2c hz by (30)). The cost of the modified schedule is less than that of the original one since each of the values v1 and v2 is greater than u1 and u1 .
As a result of the described transformation, processor P h z takes the full load of processor P h z−1 , making P h z−1 idle. Modify the processor sequence by swapping P h z−1 and P h z . If P h z is still out of order, then perform a similar transformation: move the load from P h z−2 to P h z , making P h z−2 idle and swap the two processors. Continue shifting processor P h z upstream until it reaches the right position in the schedule, immediately after its partner from the pair {P u1 , P u2 } or after a pair of processors P u−1,1 , P u−1,2 . Repeating the same transformation, we construct a schedule with no larger length and with a smaller cost.
(2) For a feasible schedule, the two inequalities Load ≥ V and Cost ≤ K hold so that Using the expression and by applying variables y i , i = 1, . . . , n, defined by (29), the necessary condition (31) can be rewritten as n i=1 2 2i−1 G n−i+2 y i ≥ 3 2 n+1 i=2 G i so that property (2) holds. Let us observe that violating (31) means that insufficient load is processed or cost limit is exceeded.
(3) Consider a feasible schedule with processor activating sequence (P 11 , P 12 , . . . , P i1 , P i2 , . . . , P n1 , P n2 ), where at least one condition from holds as a strict inequality. Taking a linear combination of these inequalities weighted by constants λ i and 2λ i , we obtain: where In what follows, we show that the left-hand side of (33) is equal to n i=1 2 2i G n−i+2 y i T and the right-hand side is equal to 3 n+1 i=2 G i T so that property (3) holds. Starting with the left-hand side, we deduce: It remains to prove that Indeed, by (34), 4λ n = 2 2n G 2 and (37) holds for i = n. If (37) holds for some i, 2 ≤ i ≤ n, then it also holds for i − 1 since by (35) Consider now the expression in the right-hand side of (33), divided by 3T : Thus, inequality (33) is proved.
In order to prove the equality n i=1 e i1 = G, consider conditions Load ≥ V and Cost ≤ K which hold for any feasible schedule. Repeating calculations (27) and (28), we obtain: The latter two inequalities imply which together imply property (5).
We conclude with the main result which follows from Lemmas 1 and 2.

Time-cost trade-off
For zero overheads, the arguments from Section 2 can be simplified. MIP formulations (13)- (19) and (20)-(23) do hold, but the number of different sequences can be reduced from η = 2 m m!, given by (12), to η = m!. Notice that due to zero overheads there is no need to make a selection of the set of active processors P since an idle processor can be kept in any place of the sequence. The smaller value of η results in a slightly lower time complexity for enumerating all trade-offs, namely O (4 m m m m! × L P (m − 1, 4m)). The problem of finding extreme points in the (T , K )space, with the shortest schedule or with the smallest cost, was addressed in the prior research for the special case when all processors are available simultaneously and have no deadline and capacity restrictions, r i = 0, d i = B i = ∞ for all 1 ≤ i ≤ m. As shown in Bharadwaj et al. (1994Bharadwaj et al. ( , 1996, Blazewicz and Drozdowski (1997), the shortest schedule is provided if processors are sequenced in the non-decreasing order of c i and complete all tasks simultaneously. For the same special case, the cheapest solution is constructed if the whole load is processed by the cheapest processor, i.e., P i : Hence, in the bicriteria problem DL bicrit end-points (T 0 , K 0 ), (T q , K q ) of the time-cost trade-off can be found in, respectively, O(m log m) and O(m) time.
For the general case of arbitrary r i , d i , B i , finding the solution (T q , K q ) with the lowest cost, i.e., the rightmost point (T q , K q ) in the time-cost trade-off, is computationally hard by Theorem 1, because even though schedule length may be arbitrary to find the lowest cost schedule, processor availability constraints r i , d i , B i may impose limits equivalent to schedule length. We conjecture that finding the solution T 0 , K 0 with the shortest schedule is also computationally hard.
Conjecture 1 For arbitrary r i , d i , B i , for all processors P i ∈ P, problem DL time (∞) is NP-hard, even if computation time, communication time and cost have no fixed overheads.
In the remaining part of this section, we consider the case of agreeable processors. In that case, processors can be renumbered so that the two conditions hold: Theorem 2 If processors P are agreeable and have no availability and capacity restrictions, i.e., r i = 0, d i = B i = ∞, 1 ≤ i ≤ m , then an optimum solution can be found in polynomial time.
Proof Assume that processors are numbered in accordance with (38) and they are activated in the order of their numbering. The processor sequence corresponding to c 1 ≤ c 2 ≤ · · · ≤ c m guarantees the shortest schedule (Bharadwaj et al. 1994(Bharadwaj et al. , 1996Blazewicz and Drozdowski 1997). It is also known (Yang et al. 2007) that in the shortest schedule there are no idle times between communications and all processors finish computation simultaneously. We show, by interchange argument, that under the agreeable condition (38) the total cost is also minimum. Consider pair P i , P i+1 in the sequence. Let us assume that communication to P i starts τ units of time before the end of the schedule. The load processed by P i is x i = τ/(a i + c i ).
Since there are no idle times in the schedule and P i+1 receives its load and processes it while P i is computing, the load processed by P i+1 is + c i+1 )). The cost of processing on P i , P i+1 is If the order of communications were P i+1 , P i , the cost would be The difference between the two costs is The processor sequence (P i , P i+1 ) results in a cheaper solution, if i / i+1 ≤ c i /c i+1 . Thus, for agreeable processors the shortest schedule is also the cheapest and the Pareto-front is reduced to a point in the time×cost space. It is possible to check whether processors are agreeable in O(m log m) time and calculate load sizes x i in O(m) time (Blazewicz and Drozdowski 1997).

Zero transfer overheads: s i = c i = 0
The main model studied in this section is characterized by zero transfer times and zero fixed cost overheads for all processors P i : In terms of the three types of decisions introduced in Sect. 1, only decisions of type 1 and 3 should be considered: any processor P i which gets a zero-size chunk x i = 0 should be removed from the list P of active processors, and all processors in P can be sequenced arbitrarily. We also discuss how the proposed methods can be adjusted for the case with arbitrary cost overheads f i (Sect. 4.2) and their applicability to the related models with nonzero transfer times (Sect. 4.3).

Zero fixed cost overheads: f i = 0
In this section, we study the version of the main problem with the cost function F = n i=1 i x i , i.e., f i = 0 for 1 ≤ i ≤ m. In the cost minimization problem DL cost (T ), given the schedule length limit T , the upper bounds u i on the chunks x i allocated to each processor i can be found from (6) combined with (9) and (11): where We assume that m i=1 u i (T ) ≥ V so that a feasible solution exists. For the given T , the cost minimization problem can be modeled as the following linear program: Note that (40)-(42) is the continuous knapsack problem, solvable in O(m) time by the algorithm due to Balas and Zemel (1980), which implies the following result.

Statement 2 If there are no transfer overheads and the cost function is F
For the counterpart DL time (K ) of problem DL cost (T ), we can only propose an O(m log m)-time algorithm. As we show next, the bicriteria problem DL bicrit is solvable in O(m log m) time as well. Thus, in what follows we focus on DL bicrit ; a solution to DL time (K ) can be found from a solution to DL bicrit without increasing the O(m log m) time complexity.
In order to find the Pareto-front for DL bicrit , consider LP cost (T ) as the underlying model and treat it in a parametric way, with parameter T that varies in min 1≤i≤m {d i } , max 1≤i≤m {d i } . Notice that for small values of T from that interval the problem LP cost (T ) may be infeasible.
Since LP cost (T ) is the continuous knapsack problem for each fixed T , an optimal load distribution is defined by the formulae: assuming that processors are numbered so that Informally, the cheapest processor P 1 gets the highest possible load, and every subsequent processor P i gets the highest possible load, after all cheaper processors are loaded to their maximum capacity without violating the given T . We start with the rightmost point of the trade-off, that corresponds to a solution with the largest length and minimum cost. It can be found in O(m) time by solving problem LP cost (T ) with T = max 1≤i≤m {d i }. The resulting point . Consider a cost-optimum schedule of some length T . Let P s be a processor with the smallest index whose load can be increased without increasing the length T . Such a processor satisfies a strict inequality We call P s a split processor as it corresponds to the so-called split item s in the solution to the underlying continuous knapsack problem, In accordance with definition (39) of u i (T ), we divide processors {P 1 , . . . , P s−1 } into three subsets: Note that each excluded processor P i ∈ P e defined for some schedule length T remains excluded for any smaller schedule length, since The approach described below moves from a current breakpoint, denoted by T , K , to the next one T , K , repeating similar steps: it simultaneously decreases the load values of processors P i ∈ P c and compensates that change by increasing the load of the split processor P s , keeping the total processed volume equal to V at all stages. For an efficient implementation, we define two auxiliary values: Consider a transition from a solution with load values x i , corresponding to T , K , to a solution with load values x i , corresponding to T , K . For any P i ∈ P c , the T -value changes from T = r i + p i +a i x i to T − = r i + p i +a i x i so that For the split processor P s , the increase in its load should be equal to the cumulative decrease in the load of processors P c : The next breakpoint T , K is triggered by one of the events (a)-(d), whichever is reached first. It corresponds to the smallest value of , defined in the event descriptions below.
(a) The set P c is adjusted to exclude a processor whose load reduces to 0. This happens if the decreased value T = T − reaches r i + p i for some P i ∈ P c . In this case (b) The set P c is adjusted to include a non-critical processor which becomes critical. This happens if T = T − reaches an absolute deadline d i for some non-critical processor P i ∈ P n , In this case (c) The split processor P s can no longer get any additional load since its increased load x s = x s + h (P c ) reaches an absolute upper bound B s , which implies (d) The split processor P s can no longer get any additional load since processing its increased load x s reaches T so that processor P s becomes a critical processor. This happens if the completion time r s + p s + a s x s + h (P c ) of P s becomes equal to T = T − , which implies = T − r s + p s + a s x s a s h (P c ) + 1 .
Thus, it suffices to calculate -values for each of the events (a)-(d), select the smallest one and compute the characteristics of the next breakpoint: defines the cost change due to the decrease in the loads of processors P c , while the term s h (P c ) = s x s − x s defines the cost change due to the increase in the load of the split processor P s . To complete the transition from T , S to T , S , perform the following updates.
-In the case of event (a) triggered by processor P i ∈ P c , calculate x s by (47) and set -In the case of event (b) triggered by processor P i ∈ P n , calculate x s by (47) and set -In the case of event (c), set and define a new split processor by considering processors P i , i = s + 1, s + 2, … one by one: if r i + p i < T , then P i becomes a new split processor (note that its load is 0); otherwise, P i joins the set of excluded processors P e , and the next processor is examined. If no processor from {P s+1 , . . . , P m } becomes a split processor, the algorithm stops. -In the case of event (d), set and find the next split processor P s as in the case of event (c). -If both events (c) and (d) happen simultaneously, proceed as in the case of event (d). -We can now treat the found breakpoint T , K as the current one and proceed similarly to finding the next breakpoint. The algorithm stops if s = m and event (c) or (d) happens.
In order to implement all calculations efficiently, we maintain sets P c and P n as priority queues so that each calculation (48) or (50) can be done in O(1) time. Since an element is added to P c or removed from P c at most once, all add/remove operations require O(m log m) time. The same holds for add/remove operations on P n . Similarly, P e is updated no more than m times.
Initialization involves renumbering processors in accordance with (45), finding the first solution x 1 , x 2 , . . . , x m for T = max d i |1 ≤ i ≤ m , computing K and auxiliary values h (P c ) and k (P c ). All required steps can be done in O(m log m) time.
A transition from one breakpoint to the next one requires updating the two priority queues, which takes O(log m) time, and updating the five parameters, x s , T , K , h (P c ) and k (P c ), which can be done in O(1) time. Note that x-values for i = s are not maintained. Since there are at most m events of each type, the total number of breakpoints is no larger than 4m, and the overall time complexity is O(m log m). Thus, the following statement holds.

Statement 3
Problem DL bicrit has a trade-off with at most 4m breakpoints which can be computed in O(m log m) time.
Let us finish this section with an example to illustrate calculation of the time-cost trade-off. Assume that load size is V = 30 and the number of processors is m = 8. Their parameters are given in Table 3. The breakpoints (T , K ) are presented in the first two rows of Table 4. The boxed elements represent the load of the split processor P s . The loads of the remaining processors are not maintained by the described algorithm, in order to achieve the O(m log m) time complexity. For completeness, we present the optimal x-values for all processors.

Arbitrary cost overheads f i , i
In the general case, with nonzero overheads f i in the cost function K = m i=1 ( f i + i x i ), both problems DL time and DL cost are NP-hard, see Drozdowski and Lawenda (2005). As we show in this section, the problem can be solved efficiently if we limit our search to a class of solutions with a fixed set of active processors P ⊆ P. The associated problem is of the form: given a set of active processors P , it is required to allocate a positive load to each active processor minimizing the objective function T or K . Note that some processors may get an infinitely low load ε > 0; such a processor then has a completion time r i + p i +a i ε, which should be taken into account when calculating the length T of the schedule.
There are two common properties that hold for any problem, DL time , DL cost or DL bicrit , under the assumption that P is fixed: Property 1: component i∈P f i is constant and can be excluded from K ; Property 2: the schedule length T satisfies T > ρ, where Based on these two properties, the results from Sect. 4.1 can be adjusted to handle the case with a fixed set P .
For problem DL cost (T ) with a given set P and T > ρ, consider the continuous knapsack formulation (40) defined over P . If in an optimal knapsack solution x i = 0 for some P i ∈ P , then such a solution is adjusted by replacing 0-values by ε. The overall time complexity remains the same, O(m).
For problem DL bicrit , apply the approach from Sect. 4.1 for the processor set P and output the part of the trade-off that satisfies T > ρ. Treat all found solutions as if each idle processor gets an ε-load. This assumption does not affect the values of T and K , assuming that ε is infinitely small. There is one case that needs a special attention. It occurs if all breakpoints (T , K ) satisfy T ≤ ρ. In that case consider the rightmost point (T * , K * ) of the trade-off and output the unique solution obtained from (T * , K * ) by allocating εloads to all idle processors.

Arbitrary transfer overheads s i , c i
The results from Sect. 4.1 can be applied to special scenarios with nonzero transfer overheads s i , c i .
One scenario arises in parallel communication with simultaneous start mode, see Kim (2003), Robertazzi (2003), with computation speeds slower than transfer speeds. For the simultaneous start mode, load transfer to all worker processors starts at the same time. Worker processors start computing as soon as the first grain of the load is received. Due to the slower computation speeds compared to transfer speeds, computation time of any grain is higher than the transfer time of any subsequent grain. Thus, in parallel communication with simultaneous start, communication does not affect the overall schedule length and cost. Consequently,  Another scenario is typical for a pipeline-like computing mode. Load scattering and processing are interleaved so that communications and computations are performed at different stages. The load is distributed in one interval (say interval i) and processed in the next interval (i + 1). If there is a common communication time τ comm for all processors and a common computation time τ comp , with τ comm ≤ τ comp , then the communications executed in interval i do not determine partitioning of the load for minimum computing time and the cost in interval i + 1. It can be shown that the general case of the pipeline mode, characterized by s i > 0, a i > 0, is NP-hard [see, e.g., DLS with processor release times in Drozdowski and Lawenda (2005)].

Conclusions
In this paper, we analyze the time/cost optimization for divisible load scheduling problems with arbitrary processor memory sizes, ready times, deadlines, communication and computation start-up costs. Three versions of the problem are studied: DL time (K )-schedule length minimization for the given limited budget K , DL cost (T )-cost minimization for the given schedule length limit T , and DL bicrit -constructing the set of time-cost Pareto-optimal solutions. All three versions can be solved in polynomial time for fixed m.
The case with given upper bounds on the schedule length and cost appears to be NP-hard even if all fixed overheads are zero ( p i = s i = f i = 0 for all P i ∈ P). This result is rather unusual: all previous NP-hardness results in the divisible load theory assumed nonzero fixed overheads. Interestingly, a divisible load problem is linked to scheduling problems with preemption, for which NP-hardness results are rather atypical (see, e.g., Sitters 2001;Drozdowski et al. 2017).
We leave an open question regarding the time complexity of finding a shortest schedule with processor availability constraints, but with zero fixed communication and computation overheads, regardless of the computation cost (K = ∞). We believe that the latter problem is computationally hard, see Conjecture 1. Contrarily, the version with negligible communication times is solvable in O(m log m) time even in its bicriteria setting.
Our summary table provided in Introduction presents the state-of-the-art results in divisible load scheduling and can be used as a guideline for future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.