Approximating Vector Scheduling: Almost Matching Upper and Lower Bounds

We consider the Vector Scheduling problem, a natural generalization of the classical makespan minimization problem to multiple resources. Here, we are given n jobs, represented as d-dimensional vectors in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0,1]^d$$\end{document}[0,1]d, and m identical machines, and the goal is to assign the jobs to machines such that the maximum load of each machine over all the coordinates is at most 1. For fixed d, the problem admits an approximation scheme, and the best known running time is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{f(\epsilon ,d)}$$\end{document}nf(ϵ,d) where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\epsilon ,d) = (1/\epsilon )^{\tilde{O}(d)}$$\end{document}f(ϵ,d)=(1/ϵ)O~(d) (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{O}$$\end{document}O~ suppresses polylogarithmic terms in d). In particular, the dependence on d is double exponential. In this paper we show that a double exponential dependence on d is necessary, and give an improved algorithm with essentially optimal running time. Specifically, we let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp (x)$$\end{document}exp(x) denote \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^x$$\end{document}2x and show that: (1) For any \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon <1$$\end{document}ϵ<1, there is no \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+\epsilon )$$\end{document}(1+ϵ)-approximation with running time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp \left( o(\lfloor 1/\epsilon \rfloor ^{d/3})\right) $$\end{document}expo(⌊1/ϵ⌋d/3) unless the Exponential Time Hypothesis fails. (2) No \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+\epsilon )$$\end{document}(1+ϵ)-approximation with running time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp \left( \lfloor 1/\epsilon \rfloor ^{o(d)}\right) $$\end{document}exp⌊1/ϵ⌋o(d) exists, unless NP has subexponential time algorithms. (3) Similar lower bounds also hold even if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon m$$\end{document}ϵm extra machines are allowed (i.e. with resource augmentation), for sufficiently small \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon >0$$\end{document}ϵ>0. (4) We complement these lower bounds with a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+\epsilon )$$\end{document}(1+ϵ)-approximation that runs in time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp \left( (1/\epsilon )^{O(d \log \log d)}\right) + nd$$\end{document}exp(1/ϵ)O(dloglogd)+nd. This gives the first efficient approximation scheme (EPTAS) for the problem.

bounds with a (1 + )-approximation that runs in time exp (1/ ) O(d log log d) + nd. This gives the first efficient approximation scheme (EPTAS) for the problem.

Introduction
We consider the Vector Scheduling problem defined as follows. The input consists of a collection J of n jobs p 1 , . . . , p n , viewed as d-dimensional vectors from [0, 1] d , and m identical machines. The goal is to find an assignment of the jobs to the machines such that the load satisfies p∈P i p ∞ ≤ 1 for each machine i ∈ [m], where P i is the set of jobs assigned to machine i. That is, the maximum load on any machine in any coordinate is at most 1.
Vector Scheduling is the natural multi-dimensional generalization of the classic Multiprocessor Scheduling problem (also known as makespan minimization, P||C max , or load balancing). In the latter problem, the goal is to assign n jobs with arbitrary processing times to m machines in order to minimize the maximum sum of processing times (load) over all the machines. However, for many applications, the jobs may use different resources and the load of a job cannot be described by a single aggregate measure. For example, if jobs have both CPU and memory requirements, their processing requirement is best modeled as a two-dimensional vector, where the value in each coordinate corresponds to each of the requirements. Note that the assumption that the maximum load of a machine in any coordinate is 1 is without loss of generality, as the different coordinates can be scaled independently.
In this paper we are concerned with approximation algorithms. We say that an algorithm is an α-approximation for some α > 1 if it finds an assignment with load at most α, whenever there exists a feasible schedule with load at most 1.

Previous Work
Multiprocessor Scheduling and the related Bin Packing problem are two of the most fundamental problems in combinatorial optimization with a long and rich history. We only describe the work on Multiprocessor Scheduling in the setting where the number of machines m is part of the input. It is well-known that Multiprocessor Scheduling is strongly NP-hard [10].
The first polynomial time approximation scheme (PTAS), that is, a (1 + )approximation algorithm with polynomial running time for every fixed > 0, was obtained by Hochbaum and Shmoys [11]. The running time of their algorithm is O n O (1/ 2 ) . Note that by the strong NP-Hardness of the problem one cannot hope to have a running time with polynomial dependence in (i.e. an FPTAS), unless P = NP.
An efficient polynomial time approximation scheme (EPTAS), i.e. an algorithm with running time f ( )n O (1) , was implicit in [11] by replacing the dynamic program by an integer linear program and using fast integer programming algorithms in fixed dimensions. Alon et al. [1] developed a more general framework to obtain EPTASes for parallel machine scheduling that runs in f ( ) + O(n) time, where f ( ) is a double exponential function in 1/ .
Recently, this running time was substantially improved by Jansen [14] to O 2Õ (1/ 2 ) + n O (1) . His main idea is to use fast integer programming in fixed dimensions, together with an elegant result of Eisenbrand and Shmonin [6] about the existence of optimum integer solutions with small support. Most of these results also extend to the setting of uniform machines, i.e. a setting where the machine speeds differ (see e.g. [12,14]).
Fewer results are known for the case when the number of dimensions exceeds one. Chekuri and Khanna [5] gave the first polynomial-time approximation scheme for a fixed number of dimensions. They gave an algorithm with running time n g( ,d) , where (d) and hence the running time is n (1/ )Õ (d) . This seems to be the currently best known running time for this problem. PTASes for several other generalizations are also known [3,7,8].
When d is part of the input, Chekuri and Khanna [5] gave a polynomial time O(ln 2 d)-approximation and proved that it is NP-hard to approximate the problem within any constant factor. This approximation factor has been recently improved to O(log d) by Meyerson et al. [18]. The latter result even holds in the online setting.

Our Contribution
A natural question is whether there exists an approximation scheme for Vector Scheduling with a single exponential running time in 1/ and d, e.g. exp(poly(1/ , d)). We rule out this possibility by showing the following strong lower bound. This follows from a relatively simple reduction from the 3-Dimensional Matching problem. The same reduction also implies the following hardness under a more standard complexity assumption.
One may wonder whether these lower bounds are robust or whether they crucially exploit the fact that no additional machines are allowed. It is instructive to consider the case of d = 1 (i.e. Multiprocessor Scheduling). Recall that no FPTAS is possible for the problem. However, if one allows some extra machines (say m of them), then the running time dependence on reduces dramatically and in particular, an FPTAS is possible. In fact, the known FPTASes for Bin Packing imply that even very few extra machines (poly-logarithmic in m) suffice [16,20], and in fact one does not even need to violate the capacity of any machine.
Somewhat surprisingly, we show that extra machines do not help for Vector scheduling, provided that the desired approximation ratio is sufficiently small.

Theorem 3 For any
To complement the lower bounds above, we show the following algorithmic result.

Theorem 4 For any
By the lower bounds above, the running time is essentially the best possible (modulo the O(log log d) factor in the exponent), and the nd term is simply the time required to read the input. Theorem 4 gives the first EPTAS for Vector Scheduling.

Techniques
At a high level, the algorithm is similar to that of [14], and relies on integer programming in fixed dimensions and the existence of optimum integer solutions with small support. However, there are some important differences between d = 1 and d > 1. In particular, for d = 1 the small jobs (with size ≤ ) do not cause any problems and can later be assigned greedily in the remaining space, after solving the problem for just big jobs. However, for d ≥ 2, the big and small jobs (by small we mean jobs that are small in every dimension) interact in more complex ways and must be considered together. The following example illustrates this difficulty.
Example 1 Consider the following instance in d = 2 dimensions, with m = 2 machines and the following jobs: p 1 = 1 2 , 0 , p 2 = 1 2 , 0 and p i = 2 , for 3 ≤ i ≤ 2/ . Clearly, these jobs can be scheduled on two machines by assigning the first two jobs to separate machines and splitting the small jobs evenly. However, if the two large jobs are assigned to the same machine, there is no assignment of the small jobs such that the maximum load of the machines is exceeded by a constant factor dependent on . The two large jobs have total load (1, 0). As the small jobs have total load (1,2), no matter how these are assigned to the two machines, one machine will have load at least min {max{1 + x, 2x}, max{1 − x, 2(1 − x)}}, which is 4/3 (attained for x = 1/3). Chekuri and Khanna [5] overcame this problem by 'guessing' the division between small and large jobs for each machine. This allows them to decouple the assignment of small and big vectors. However, as there are roughly m (1/ ) d different possible divisions, with precision, this is not useful to obtain an efficient polynomial time approximation scheme.
To get around this, we incorporate both large and small vectors in our mixed integer linear program (MILP), but ensure that it has only few constraints by tracking only some coarse-grained information for the small jobs. We find an optimum solution to this MILP, which gives an integral assignment of large jobs, but small jobs might be assigned fractionally. We then show how to assign the small jobs to machines without overloading them. To do this, we first assign the jobs greedily guided by a potential function, which guarantees that the aggregate amount of overload on machines is small. This load is small enough to ensure that the jobs on overloaded machines can be redistributed in a round-robin manner. A naive implementation of the greedy assignment requires O(mn) time (as for each job, we need to determine which machine causes the least increase in potential), so we also present some additional ideas to show how everything can be done in linear time.

Organization
In Sect. 2 we state our notation and the hypotheses on which our lower bounds are based, and we describe the relevant background on integer programming. In Sect. 3 we prove our lower bounds for Vector Scheduling and we present our algorithm in Sect. 4.

Preliminaries
Let let v j denote its j-th coordinate. For two vectors a, b we say that a ≤ b if a i ≤ b i for all i. Throughout the paper the logarithm log is taken with base 2 and we let exp(x) denote 2 x . We say that a (1) ) . Without loss of generality we assume that the number of machines is less than the number of jobs (otherwise assign one job per machine or conclude infeasibility).
In the 3-CNF-Sat problem, we are given a Boolean expression in conjunctive normal form, consisting of N variables and M clauses that each consist of three literals. The question is whether or not there exists an assignment of logical values to the variables such that the expression evaluates to TRUE. Impagliazzo, Paturi and Zane formulated the Exponential Time Hypothesis, which in combination with the sparsification lemma [4] can be stated as follows.

Hypothesis 1 (Exponential Time Hypothesis (ETH) [13]) 3-CNF-Sat with N variables and M clauses cannot be solved in time
We will use the following well-known result for fast integer linear programs with few integer variables. [17], Kannan [15], Frank and Tardos [9]) Consider a mixedinteger linear program min{c T x | Ax ≥ b and ∀i ∈ I : x i ∈ Z} with n variables and m constraints, and where I ⊆ [n] denotes the set of indices of integer variables. Let s denote the binary encoding length of the input. There is an algorithm that finds a feasible solution or decides that there is no feasible solution in O n 2.5n+o(n) · s arithmetic operations.

Theorem 5 (Lenstra
Relatively recently, based on an elegant pigeonhole argument, Eisenbrand and Shmonin [6] showed that every feasible integer linear program has an optimum solution with small support. Theorem 6 (Eisenbrand and Shmonin [6]) Let min{c T y|Ay = b, y ≥ 0, y ∈ Z n } be an integer program, where A ∈ Z m×n and c ∈ Z n . If this integer program has a finite optimum, then there exists an optimal solution y * ∈ Z n ≥0 in which the number of nonzero components is at most 2(m + 1)(log(m + 1) + s + 2), where s is the largest size in binary representation of any coefficient of A and c.

Lower Bounds on the Running Time
We prove our lower bounds by a reduction from 3-Dimensional Matching (3-DM) to Vector Scheduling. In Sect. 3.1 we prove Theorem 1 by describing the reduction and proving that an approximate solution to the Vector Scheduling instance implies an exact solution for 3-DM and hence 3-CNF-Sat. In Sect. 3.2 we outline how the same reduction implies Theorem 2. Finally, in Sect. 3.3 we give the proof for Theorem 3 concerning resource augmentation.
Before we give our reduction, we first define the 3-Dimensional Matching problem. An instance of 3-DM consists of three disjoint sets X , Y , and Z , satisfying |X | = |Y | = |Z | := q, and a set T ⊂ X × Y × Z of triples. The goal is to find a subset of triples T ⊂ T such that each element of X , Y , and Z occurs in exactly one triple of T .
In [10], a reduction from 3-CNF-Sat to 3-DM is given, that transforms instances of 3-CNF-SAT with N variables and M clauses into instances for 3-DM with q = 6M and |T | = 2M N + 3M + 2M 2 N (N − 1) (using better bookkeeping you can prove that |T | = 17M suffices). Therefore, the ETH (Hypothesis 1) implies there is

The Construction
The main idea of the reduction is the following construction of a Vector Scheduling instance from 3-DM. For each triple in T we construct a job (that we call a triple-job), and for each element in X , Y or Z we construct as many jobs as the number of times this element occurs in the triples (we call such jobs element-jobs). We explicitly refer to X -jobs, Y -jobs and Z -jobs if we want to distinguish the element-jobs of the three Table 1 Construction of the jobs from elements and triples of the 3-DM problem

Job name
Values of the coordinates sets. For each element i, we designate exactly one of its jobs as the real element-job corresponding to i, and refer to the other element-jobs of i as dummy jobs. The number of machines is equal to the number of triples. We will assign sizes to these jobs such that to obtain a schedule where the maximum load in any coordinate is at most 1, we need to schedule each triple together with its corresponding three element-jobs, and moreover these element-jobs are either all real or all dummy element-jobs. Let > 0 be such that 1/ is integer. Let b = 1/ − 1 and let b denote the vector that has b in every coordinate. By i we denote the (b + 1)-ary encoding of the integer i and by i we denote its complement, that is, i := b − i . Let i j denote the j-th digit from the right of i . For ease of notation, we scale the jobs by a factor b. That is, all jobs are vectors in {0, . . . , b} d and we want to know whether we can schedule the jobs such that the maximum load in each coordinate is at most b. To make the proofs easier to read, we rename the elements in the sets X , Y and Z by assuming that X = Y = Z = {1, . . . , q}.

The Formal Reduction
Given an instance (X, Y, Z ; T ) of 3-DM, let n X (i) denote the number of triples (x, y, z) for which x = i; in a similar way, we define n Y (i) and n Z (i). For each element i ∈ X , we create n X (i) jobs, one real X -job i and n X (i) − 1 dummy X -jobs. In a similar way, we create n Y ( j) Y -jobs for each element j ∈ Y and n Z (k) Z -jobs for each element k ∈ Z . Finally, we have |T | triple-jobs, one for each triple l ∈ T . The number of machines is equal to m := |T |. Note that the total number of jobs is Recall that |X | = |Y | = |Z | = q, and let := log (1/ ) q . We associate a vector to each of the jobs as in Table 1. These vectors are d-dimensional, where d := 7 + 3 . In particular, the first four coordinates of a job indicate whether the job corresponds to an element in X , Y , Z or to a triple in T . The following three coordinates encode for each X , Y , or Z -job whether it is a real job or a dummy job. The last part of each job encodes the element to which the job corresponds.

Proof of the Reduction
We now show that the reduction has the desired properties.

Lemma 1 (Completeness) If the 3-DM instance has a solution, then there exists an assignment of the jobs to the m machines such that the load on every machine in each coordinate is at most b.
Proof Consider the collection T of disjoint triples that cover X , Y and Z . For each triple (i, j, k) ∈ T we assign the corresponding triple-job and the real element-jobs corresponding to i, j and k to a single machine. Clearly, every coordinate on every such machine has load at most b. We place each of the remaining triples (i, j, k) on a machine with a dummy job for i, for j and for k. It is easily verified that this is a feasible assignment.

Lemma 2 If the Vector Scheduling instance has a solution with load at most
Proof Consider any solution with load at most (1 + )b. We begin with various properties of such a solution.

Property 1
The load is exactly b in each coordinate on each machine.
Proof The load of each machine is at most (1 As all jobs have integer coordinates, the load of each machine is at most b. Moreover, since i∈X n X (i) = j∈Y n Y ( j) = k∈Z n Z (k) = |T | = m, observe that the total amount of work in the i-th coordinate summed over all jobs is mb. As all jobs are scheduled and the load is at most b, it is exactly b. Property 2 Each machine processes exactly one triple-job, one X -job, one Y -job, and one Z -job.
Proof This follows immediately from the values in the first four coordinates and the previous property.
Property 3 Element-jobs assigned to the same machine are either all real jobs or all dummy jobs.
Proof From Property 1 and the values in the fifth, sixth and seventh coordinate we see that the following three statements are simultaneously true: 1. There is exactly one real X -job or dummy Z -job (coordinate 5); 2. There is exactly one real Y -job or dummy X -job (coordinate 6); 3. There is exactly one real Z -job or dummy Y -job (coordinate 7).
The claim now follows by combining this with the fact that by Property 2 there is exactly one (real or dummy) job of each of the types X , Y and Z .
Property 4 If a machine processes the triple-job (i, j, k) and a (real or dummy) element-job a, then a is equal to i, j or k, depending on whether a is an X , Y or Z -job.
Proof We only consider the case that a is an X -element; the other cases are similar. By Properties 1 and 2, we know that i + a = b. Therefore, If a machine processes three real element-jobs, then by the last property the corresponding three elements form a triple in the 3-DM instance. Let T consist of all triples corresponding to the triple-jobs that are scheduled together with real elements. Then, the triples in T have no overlap as there is only one real element-job corresponding to an element. Moreover, T covers all elements, because all jobs, and therefore also all real element-jobs, need to be scheduled.
Therefore we have the following lemma.

Lower Bound Assuming NP has no Subexponential Time Algorithms
Lemma 3 also implies the following.

Lower Bound with Resource Augmentation
In this subsection we show a lower bound on the running time of (1+ )-approximation algorithms for Vector Scheduling that are allowed resource augmentation, i.e. besides exceeding the optimal load by a factor (1 + ), it is also allowed to use m extra machines.
To show this, we reduce from a stricter version of 3-Dimensional Matching, namely 3-Dimensional Matching-B, abbreviated as 3-DM-B. In this problem we are given a set of triples T ⊆ X × Y × Z , where X , Y and Z are disjoint finite sets and each element in X ∪ Y ∪ Z appears at most B times in the triples of T . The goal is to find a subset of triples T that maximizes the number of elements in X ∪ Y ∪ Z that appear exactly once in T .

Theorem 7 (Petrank [19]) For 3-Dimensional Matching-3 it is NP-hard to distinguish between instances where all elements can be covered by disjoint triples and those instances where at most a (1 − 3-DM ) fraction of the elements can be covered by disjoint triples, where 3-DM < 1 is some universal constant.
Using this result we prove the following lemma.

Lemma 4 For Vector Scheduling in d ≥ d 0 dimensions it is NP-hard to distinguish between instances where all jobs can be scheduled on m machines with maximum load 1 and those instances
where all jobs can be scheduled on (1+ 0 )m machines with maximum load 1 + 0 , where 0 < 0 < 1 and d 0 ≥ 1 are some universal constants and 1/ 0 is integer.
Proof Construct a Vector Scheduling instance from the 3-DM-3 problem in almost the same way as for 3-DM. The only difference is that for every (real or dummy) X -job i and triple (i, j, k), instead of only encoding i (respectively i ), we append this by encoding i (respectively i ) (all other jobs get extra zero-entries). See Table 2. Consequently, if a triple (i, j, k) is scheduled on a machine where also an X -job x is scheduled, then i = x. Previously we established this through the fact that the load in each coordinate is exactly b. However, here we do not have this property because of the extra machines.
There are at most 3q triples, where q = |X | = |Y | = |Z |. One direction is clear, if all 3q elements can be covered by disjoint triples then there is a schedule of height at most b on m = 3q machines. For the other direction, suppose we found a (1 + )-approximate solution with 3q extra machines. Using the same reasoning as before, we now have the following properties: • The maximum load is b; • On each machine there is at most one triple, one X -, one Y -, and one Z -job; • On each machine, if there are three element-jobs, then all three are real jobs or all three are dummy jobs; • If a triple (i, j, k) and an X -job x, Y -job y and Z -job z are scheduled on the same machine, then i = x, j = y and k = z.
Therefore, every machine on which a triple and three real elements are scheduled, corresponds to a triple in the solution to the 3-DM-3 problem.
We will now show that there is a universal constant such that it is hard to distinguish between instances where everything fits on m machines with maximum load 1 and instances where everything fits on (1 + )m machines with maximum load 1 + . Consider the 3q machines without a triple. These 3q machines contain at most 9q element-jobs. Considering that there are 3q machines on which 9q −9q element-jobs must be scheduled together with triples, there are at most 9q machines with a triple but with at most two elements. Hence, there are at most 9q +2(9q ) real elements that are scheduled on either a machine without a triple, or with a triple but with only one other element. Therefore, at least 3q − 27q real elements are scheduled together with triples, which corresponds to q − 9q disjoint triples that cover 3q − 27q elements. If 27 < 3-DM , we found a solution where more than a (1 − 3-DM )-fraction of the elements are covered in the 3-DM-3 instance, which is NP-hard.
Following the proof of Theorem 2, this immediately implies the following.

Linear Time Approximation Algorithm
In this section we describe our linear time algorithm. Roughly, it works as follows. First, we preprocess the instance such that there are relatively few different types of large jobs at the cost of a small factor in the approximation guarantee. Next, we formulate and solve a mixed-integer linear program from which we obtain a multiset of configurations of large jobs, each of which can fit on one machine. We assign each configuration to a distinct machine, thereby assigning large jobs integrally to machines and small jobs fractionally. In the randomized algorithm, we assign the small jobs according to the probabilities obtained from the MILP and redistribute the small jobs on the overloaded machines over the other machines in such a way that no machine is overloaded. In the deterministic algorithm, we derandomize this step by assigning the small jobs integrally to machines in a greedy manner guided by a potential function that tracks the aggregate overload on the machines. Finally, we distribute this overload evenly over all machines ensuring the final loads of all machines is at most 1 + .

Linear time algorithm
1. Preprocess the instance. 2. Solve the MILP, and assign big jobs according to this solution.
3. Assign small jobs to machines randomly according to the probabilities obtained from the MILP solution. 4. Remove small jobs from the overloaded machines and evenly distribute them over all machines.

Preprocessing
The preprocessing uses the same ideas used before in the design of approximation schemes. Typically, it is much easier to work with a few distinct jobs as we will see in the formulation of our mixed-integer linear program.

Preprocessing
1. Round each coordinate of every job down to the nearest power of (1 + ) times 4 /d 2 (Lemma 5). 2. Set coordinates of jobs that are small in comparison to the biggest coordinate to zero (Lemma 6).
The first step is to round all coordinates of each job down to the nearest power of (1 + ) times a small polynomial in and 1/d.

Lemma 5 ([5])
Given a set V of jobs and > 0, let W be a modified (multi)set of V where we replace each job v in V with a job w as follows: Then, for any subset of jobs V ⊆ V with corresponding subset W ⊆ W , we have Next, we ensure that the non-zero values of coordinates of a job are not too small compared to the largest coordinate of a job. Lemma 6 ( [5]) Given a set V of jobs and η > 0, let W be a modified (multi)set of V where we replace each job v in V with a job w as follows: Then, for any subset of jobs V ⊆ V with corresponding subset W ⊆ W , we have v∈V v ≤ w∈W w + η w∈W w ∞ 1.
The following lemma states that the error due to the preprocessing of any schedule is small, and follows from the previous lemmata, setting η := /d.

Lemma 7
Let > 0, let V be the original set of jobs and W be the (multi)set of jobs preprocessed by Lemmata 5 and 6. Then for any w ∈ W and coordinate j ∈ [d],

Moreover, for any subset of jobs
From now on, by job we mean the job preprocessed by Lemma 7.

The Mixed-Integer Linear Program
In this subsection we describe our mixed-integer linear program and how to solve it fast. We distinguish between small and big jobs and treat them differently. A job p is small if p ∞ < 3 /d and otherwise the job is big.
As all non-zero coordinates are at most a factor d/ apart by Lemma 7, the smallest possible coordinate of any big job is 4 /d 2 . Let T big be the set of all types of big jobs, T big := {0, 4 /d 2 , (1 + ) 4 /d 2 , (1 + ) 2 4 /d 2 , . . . , 1} d . A big job p has type t ∈ T big if and only if p = t. Every big job has a corresponding type, since the rounding procedure rounded these jobs to exactly these values.
Similarly, we define a set T small of all types of small jobs. We define the type of a small job based on its relative size in each coordinate, that is, a small job p has type t = (t 1 , . . . , t d ) ∈ T small if and only if p j / p ∞ = t j for all coordinates j ∈ [d].
As the smallest non-zero coordinate in p/ p ∞ is at least /d, we define T small : is such that (1+ ) − is the smallest power of 1+ that is at least /d. Note that each small job has exactly one type in T small and that there are at most T := 4 log (1+ ) (d/ ) + 2 d types of big and small jobs.
The mixed-integer linear programming has a variable for every configuration, which is a collection of big jobs together with available space for small jobs. We will call the (rounded) space for small jobs a profile, which is a vector from F := {0, , (1 + ) , (1+ ) 2 , . . . , 1} d . A configuration C is a tuple C = (B, f), where B is a multiset of rounded processing times of big jobs and f is a profile for small jobs such that the big jobs and the profile fit together on one machine, exceeding the maximum load by only a little, i.e.
p∈B p j + f j ≤ (1 + ) for all coordinates j. As each big job has a coordinate of at least 3 /d, there can be no more than d 2 / 3 big jobs on a machine. As there are at most T types of big jobs, we know that there are at most We now describe our mixed-integer linear program. Let C be the set of all configurations and let x C denote the number of machines that have jobs assigned to them according to configuration C ∈ C. Let n(C, t) denote the number of big jobs of type t in configuration C, and let n(t) denote the total number of big jobs of type t in the instance. Denote the set of small jobs of type t assigned to configurations having profile f by J (f, t), and let the variables y f,t = p∈J (f,t) p ∞ denote the sum of their largest coordinates, their amount. Let a(t) := p:p is of small type t p ∞ denote the total amount of small jobs of type t in the instance. Consider the following program.
x ∈ Z C y, x ≥ 0 The first and second constraint ensure that the big and the small jobs are covered integrally (respectively fractionally). The third constraint ensures that small jobs fit in the machine profiles, as it requires that for each profile f , the cumulative amount of small jobs of type t that are assigned to f is at most the total amount of f . These are valid constraints for any feasible solution.

Lemma 8 An optimal solution to MILP can be found in time
Proof First, we bound the number of choices for non-zero integer variables. To do that, suppose that there is a finite solution and suppose that the continuous variables y f,t are fixed: this allows us to disregard constraints (C2), only containing continuous variables. Then introduce slack variables such that all constraints are equality constraints and the MILP matches the form of Theorem 6. For the application of this theorem we can disregard the non-negativity constraints [6]. Thus, we are left with at most |T big | + d|F| ≤ (d + 1)T constraints. The largest size of the coefficients are the constants n(C, t), t i , t ∞ and f i , all of which require at most d 2 / 3 bits to describe. By Theorem 6 there is an optimal solution such that there are at most 2 ((d + 1)T + 1) log ((d + 1)T + 1) + d 2 / 3 + 2 non-zero integer variables. As the number of non-zero integer variables is at most Therefore, we can bound the number of choices for non-zero variables by Using that T log T ≤ T 2 and plugging in the definition of T , we bound this by As the first part is (

Randomized Algorithm
In this subsection we sketch step 3 and 4 of the algorithm, the integral assignment of small jobs to machines using the solution to MILP.
For step 3, recall that y f,t is the amount of small jobs of type t that are assigned to profile f. For each small job type t, let β(f, t) denote the fraction of type t assigned to profile f: β(f, t) := y f,t g∈F y g,t .
For each small job p of type t, pick a profile f randomly with probability β f,t and then pick a machine uniformly at random among the ones with profile f. Assign job p to this machine.
For step 4, we call a machine with profile f overloaded if the load of small jobs exceeds f + ·1 in some coordinate. We take all the small jobs on overloaded machines and distribute them among all machines using a linear time simple sequential assignment. We will prove that the probability that the load on a machine in a coordinate exceeds the profile by more than is exponentially small. This implies that the expected overload on each machine is small, hence, the total overload over all the machines is small.
For the following proofs we fix a machine. Define for each small job p and coordinate j the random variables X j p with μ j p as its mean, which is the contribution of job p to the j-th coordinate of the machine: if job p is assigned to the machine; 0, otherwise.
Both inequalities follow from the MILP constraints: the first follows as f∈F y f,t ≥ a(t) and the second follows as t∈T small y f,t t j / t ∞ ≤ m(f) f j . We now apply Bernstein's inequality to our setting. We have that Thus, This implies that Applying this to our setting, we get As f i ≤ 1 and by the proof of Lemma 9, this is at most where δ := max p:p small job |p ∞ is the maximum coordinate of any small job. The last term is (4δ/3) exp(−9/4δ). The second term can be upper bounded by We plug this in Eq. (1), bounding δ by 2 /(4 ln(d/ )), which is larger than 3 /d if d/ ≥ 9. As d ≥ 2 and < 1/5 this is a valid upper bound for all small jobs. This yields that the total expected load on overloaded machines is at most 1 + + 2δ This is at most 2 /d.
We can now prove the following theorem.

Theorem 9
There is an algorithm that runs in O 2 (1/ ) O(d log log d) + nd time and finds a schedule such that the load on each machine is at most 1 + with high probability.
Proof Let := /9. First we prove the approximation ratio. For an overloaded machine k, let L k be the sum of the 1 -norm of all small jobs assigned to k, and let L be the sum of the 1 -norm of all small jobs on all overloaded machines. By Lemma 10 we know that E[L k ] ≤ 2 for all machines k and thus, by linearity of expectation, E[L] ≤ 2m . Therefore Pr[L > 4m ] < 1/2 by Markov's inequality. (One can prove the L k variables are negatively associated, and therefore by standard Chernoff-Hoeffding bounds the probability of having at least total overload t is exponentially small for any t > 0.) Remove all small jobs assigned from the overloaded machines and order them arbitrarily. Greedily group them together until the 1 -norm exceeds 4 and then start a new group. Every group has size at most 4 + δ. Now assign every group to a non-overloaded machine. The small jobs on the overloaded machines have now been redistributed such that the extra load on every machine is in expectation at most the average plus the largest small job size, i.e. 4 + δ ≤ 4 + 3 /d ≤ 5 . All other machines exceeded their profile in each coordinate by at most . Additionally, from the mixed-integer linear program we lost another since we only required that the big jobs and the profile add up to at most 1+ . This gives a total of 7 on the preprocessed instance and factoring in the preprocessing we get (1 + )(7 ) + ≤ 9 = .
The preprocessing and randomized rounding steps can be implemented in O(nd) time. To bound the time of solving the mixed-integer linear program, we use the fact that ab ≤ a 2 + b 2 . Choosing a = 2 ( By simply repeating the rounding and grouping step until a solution is found, we get an O(nd) time algorithm for assigning small jobs that returns a (1 + )-approximation with high probability.

Deterministic Algorithm
Recall that the MILP only gives an assignment of small job types to profiles, while we need an assignment of individual jobs to machines for a deterministic algorithm. This can be done in three steps using standard techniques. First, small job types are assigned integrally to profiles. Then, using a pessimistic estimator, small jobs are integrally assigned to machines having a fixed profile. Finally, a direct calculation shows that the the total load on overloaded machines is at most O( m/d), so the small jobs from these machines can be redistributed over all machines in a round-robin fashion without increasing the loads too much.
From this and Theorem 9, we have our main theorem.