An Improved Approximation Algorithm for Knapsack Median Using Sparsification
 426 Downloads
Abstract
Knapsack median is a generalization of the classic kmedian problem in which we replace the cardinality constraint with a knapsack constraint. It is currently known to be 32approximable. We improve on the best known algorithms in several ways, including adding randomization and applying sparsification as a preprocessing step. The latter improvement produces the first LP for this problem with bounded integrality gap. The new algorithm obtains an approximation factor of 17.46. We also give a 3.05 approximation with small budget violation.
Keywords
Approximation algorithm Combinatorial optimization Randomized algorithm Facilitylocation problems1 Introduction
kmedian is a classic problem in combinatorial optimization. Herein, we are given a set of clients \(\mathcal {C}\), facilities \(\mathcal {F}\), and a symmetric distance metric c on \(\mathcal {C}\cup \mathcal {F}\). The goal is to open k facilities such that we minimize the total connection cost (distance to nearest open facility) of all clients. A natural generalization of kmedian is knapsack median (KM), in which we assign nonnegative weight \(w_i\) to each facility \(i\in \mathcal {F}\), and instead of opening k facilities, we require that the sum of the open facility weights be within some budget B.
While KM is not known to be harder than kmedian, it has thus far proved more difficult to approximate. kmedian was first approximated within constant factor 6\(\frac{2}{3}\) in 1999 [2], with a series of improvements leading to the current bestknown factor of 2.674 [1]^{1}. KM was first studied in 2011 by Krishnaswamy et. al. [6], who gave a bicriteria \(16+\epsilon \) approximation which slightly violated the budget. Then Kumar gave the first true constant factor approximation for KM with factor 2700 [7], subsequently reduced to 34 by Charikar and Li [3] and then to 32 by Swamy [10].
This paper’s algorithm has a flow similar to Swamy’s: we first get a halfintegral solution (except for a few ‘bad’ facilities), and then create pairs of halffacilities, opening one facility in each pair. By making several improvements, we reduce the approximation ratio to 17.46. The first improvement is a simple modification to the pairing process so that every halffacility is guaranteed either itself or its closest neighbor to be open (versus having to go through two ‘jumps’ to get to an open facility). The second improvement is to randomly sample the halfintegral solution, and condition on the probability that any given facility is ‘bad’. The algorithm can be derandomized with linear loss in the runtime.
The third improvement deals with the bad facilities which inevitabley arise due to the knapsack constraint. All previous algorithms used Kumar’s bound from [7] to bound the cost of nearby clients when bad facilities must be closed. However, we show that by using a sparsification technique similar in spirit to  but distinct from  that used in [8], we can focus on a subinstance in which the connection costs of clients are guaranteed to be evenly distributed throughout the instance. This allows for a much stronger bound than Kumar’s, and also results in an LP with bounded integrality gap, unlike previous algorithms.
Another alternative is to just open the few bad facilities and violate the budget by some small amount, as Krishnaswamy et. al. did when first introducing KM. By preprocessing, we can ensure this violates the budget by at most \(\epsilon B\). We show that the bipoint solution based method from [8] can be adapted for KM using this budgetviolating technique to get a 3.05 approximation.
1.1 Preliminaries
Let \(n = \mathcal {F} + \mathcal {C}\) be the size of the instance. For the ease of analysis, we assume that each client has unit demand. (Indeed, our algorithm easily extends to the general case.) For a client j, the connection cost of j, denoted as \(\,\text {cost}\,(j)\), is the distance from j to the nearest open facility in our solution. The goal is to open a subset \(\mathcal {S}\subseteq \mathcal {F}\) of facilities such that the total connection cost is minimized, subject to the knapsack constraint \(\sum _{i \in \mathcal {S}} w_i \le B\).
In this paper, given a KM instance \(\mathcal {I}=(B,\mathcal {F},\mathcal {C},c,w)\), let \(\mathrm {OPT}_\mathcal {I}\) and \(\mathrm {OPT}_f\) be the cost of an optimal integral solution and the optimal value of the LP relaxation, respectively. Suppose \(\mathcal {S}\subseteq \mathcal {F}\) is a solution to \(\mathcal {I}\), let \(\,\text {cost}\,_\mathcal {I}(\mathcal {S})\) denote cost of \(\mathcal {S}\). Let (x, y) denote the optimal (fractional) solution of the LP relaxation. Let \(C_j := \sum _{i \in \mathcal {F}} c_{ij} x_{ij}\) be the fractional connection cost of j. Given \(\mathcal {S}\subseteq \mathcal {F}\) and a vector \(v \in \mathbb {R}^{\mathcal {F}}\), let \(v(\mathcal {S}) := \sum _{i \in \mathcal {S}}v_i\). From now on, let us fix any optimal integral solution of the instance for the analysis.
2 An Improved Approximation Algorithm for Knapsack Median
2.1 Kumar’s Bound
The main technical difficulty of KM is related to the unbounded integrality gap of the LP relaxation. It is known that this gap remains unbounded even when we strengthen the LP with knapsack cover inequalities [6]. All previous constantfactor approximation algorithms for KM rely on Kumar’s bound from [7] to get around the gap. Specifically, Kumar’s bound is useful to bound the connection cost of a group of clients via some cluster center in terms of \(\mathrm {OPT}_{\mathcal {I}}\) instead of \(\mathrm {OPT}_f\). We now review this bound, and will improve it later.
Lemma 1
Proof
We can slightly strengthen the LP relaxation by adding the constraints: \(x_{ij} = 0\) for all \(c_{ij} > U_j\). (Unfortunately, the integrality gap is still unbounded after this step.) Thus we may assume that (x, y) satisfies all these constraints.
Lemma 2
Proof
This bound allows one to bound the cost of clients which rely on the bad facility.
2.2 Sparse Instances
Kumar’s bound can only be tight when the connection cost in the optimal solution is highly concentrated around a single client. However, if this were the case, we could guess the client for which this occurs, along with its optimal facility, which would give us a large advantage. On the other hand, if the connection cost is evenly distributed, we can greatly strengthen Kumar’s bound. This is the idea behind our definition of sparse instances below.
Let \(\mathrm {CBall}(j, r) := \{k \in \mathcal {C}: c_{jk} \le r\}\) denote the set of clients within radius of r from client j. Let \(\lambda _j\) be the connection cost of j in the optimal integral solution. Also, let i(j) denote the facility serving j in the optimal solution.
Definition 1
We will show that the integrality gap is bounded on these sparse instances. We also give a polynomialtime algorithm to sparsify any knapsack median instance. Moreover, the solution of a sparse instance can be used as a solution of the original instance with only a small loss in the total cost.
Lemma 3
Given some knapsack median instance \(\mathcal {I}_0 = (B, \mathcal {F}, \mathcal {C}_0, c, w)\) and \(0< \delta , \epsilon < 1\), there is an efficient algorithm that outputs \(O(n^{2/\epsilon })\) pairs of \((\mathcal {I},\mathcal {F}')\), where \(\mathcal {I}= (B, \mathcal {F}, \mathcal {C}, c, w)\) is a new instance with \(\mathcal {C}\subseteq \mathcal {C}_0\), and \(\mathcal {F}' \subseteq \mathcal {F}\) is a partial solution of \(\mathcal {I}\), such that at least one of these instances is \((\delta , \epsilon )\)sparse.
Proof

Initially, \(\mathcal {C}:= \mathcal {C}_0\).

While the instance \((B, \mathcal {F}, \mathcal {C}, c, w)\) is not sparse, i.e. there exists a “bad” client j such that \(\sum _{k \in \mathrm {CBall}(j, \delta \lambda _j) } (\lambda _j  c_{jk} ) > \epsilon \mathrm {OPT}_\mathcal {I},\) remove all clients in \(\mathrm {CBall}(j, \delta \lambda _j)\) from \(\mathcal {C}\).
Now, while we do not know which client j is “bad” and which facility i serves client j in the optimal solution, we can still guess these pairs in \(O(n^2)\) time in each iteration. Specifically, we will guess the number of iterations that the above algorithm terminates and the pair (j, i(j)) in each iteration. There are at most \(O(n^{2/\epsilon })\) possible cases and we will generate all of these new instances. Finally, we include all the facilities i(j) during the process into the set \(\mathcal {F}'\) of the corresponding instance. \(\square \)
The following theorem says that if we have an approximate solution to a sparse instance, then its cost on the original instance can be blown up by a small constant factor.
Theorem 1
Note that our notion of sparsity differs from that of Li and Svensson in several ways. It is clientcentric, and removes clients instead of facilities from the instance. On the negative side, removed clients’ costs blow up by \(\frac{1+\delta }{1\delta }\), so our final approximation cannot guarantee better.
Proof
From now on, assume that we are given some arbitrary knapsack median instance \(\mathcal {I}_0 = (B, \mathcal {F}, \mathcal {C}_0, c, w)\). We will transform \(\mathcal {I}_0\) into a \((\delta , \epsilon )\)sparse instance \(\mathcal {I}\) and use Theorem 1 to bound the real cost at the end.
2.3 Improving Kumar’s Bound and Modifying the LP Relaxation
Lemma 4
Proof
 For clients \(j \in \mathcal {S}' = \mathcal {S}\cap \mathrm {CBall}(s, \delta U_s) \), by definition of sparsity, we have$$\begin{aligned} \mathcal {S}' U_s&= \sum _{j \in \mathcal {S}'} (U_s  c_{js}) + \sum _{j \in \mathcal {S}'} c_{js} \\&\le \epsilon \mathrm {OPT}_\mathcal {I}+ \beta \sum _{j \in \mathcal {S}' }C_j. \end{aligned}$$
 For clients \(j \in \mathcal {S}'' = \mathcal {S}\setminus \mathrm {CBall}(s, \delta U_s) \), we have \(\beta C_j \ge c_{js} \ge \delta U_s\) and we get an alternative bound \(U_s \le \frac{\beta }{\delta } C_j \). Thus,$$\begin{aligned} \mathcal {S}''U_s = \sum _{j \in \mathcal {S}''} U_s \le \sum _{j \in \mathcal {S}''} \frac{\beta }{\delta } C_j. \end{aligned}$$
2.4 Filtering Phase
We will apply the standard filtering method for facilitylocation problems (see [3, 10]). Basically, we choose a subset \(\mathcal {C}' \subseteq \mathcal {C}\) such that clients in \(\mathcal {C}'\) are far from each other. After assigning each facility to the closest client in \(\mathcal {C}'\), it is possible to lowerbound the opening volume of each cluster. Each client in \(\mathcal {C}'\) is called a cluster center.
Filtering algorithm Initialize \(\mathcal {C}' := \mathcal {C}\). For each client \(j \in \mathcal {C}'\) in increasing order of \(C_j\), we remove all other clients \(j'\) such that \(c_{jj'} \le 4C_{j'} = 4\max \{C_{j'}, C_j \}\) from \(\mathcal {C}'\).
For each \(j \in \mathcal {C}'\), define \(F_j = \{i \in \mathcal {F}:~ c_{ij} = \min _{k \in \mathcal {C}'} c_{ik} \}\), breaking ties arbitrarily. Let \(F'_j = \{i \in F_j: c_{ij} \le 2C_j \}\) and \(\gamma _j = \min _{i \notin F_j} c_{ij}\). Then define \(G_j = \{i \in F_j: c_{ij} \le \gamma _j \}\). We also reassign \(y_i := x_{ij}\) for \(i \in G_j\) and \(y_i := 0\) otherwise. For \(j \in \mathcal {C}'\), let \(M_j\) be the set containing j and all clients removed by j in the filtering process.
We note that the solution (x, y) may not be feasible to the LP anymore after the reassignment step. For the rest of the paper, we will focus on rounding y into an integral vector. One important property is that the knapsack constraint still holds. In other words, the new sum \(\sum _{i \in \mathcal {F}} w_i y_i\) is still at most the budget B. This is due to the fact that \(x_{ij} \le y_i\). The opening variables only decrease after this step; and hence, the knapsack constraint will be preserved.
Lemma 5

All sets \(G_j\) are disjoint,

\(1/2 \le y(F_j') \) and \( y(G_j) \le 1\) for all \(j \in \mathcal {C}'\).

\(F_j' \subseteq G_j\) for all \(j \in \mathcal {C}'\).
Proof
For the first claim, observe that all \(F_j\)’s are disjoint and \(G_j \subseteq F_j\) by definition. Also, if \(\sum _{i \in F_j'}y_i = \sum _{i \in F_j'}x_{ij} < 1/2\), then \(\sum _{i \in \mathcal {F}\setminus F_j'}x_{ij} > 1/2. \) Since the radius of \(F_j'\) is \(2C_j\), this means that \(C_j > (1/2)(2C_j) = C_j\), which is a contradiction. Since we reassign \(y_i := x_{ij}\) for all \(i \in G_j\), the volume \(y(G_j)\) is now at most 1. Finally, we have \(2C_j \le \gamma _j\). Otherwise, let \(i \notin F_j\) be the facility such that \(\gamma _j = c_{ij}\). Observe that facility i is claimed by another cluster center, say \(j'\), because \(c_{ij'} \le c_{ij} \le 2C_j.\) This implies that \(c_{jj'} \le c_{ij} + c_{ij'} \le 4C_j\), which is a contradiction. \(\square \)
2.5 A Basic \((23.09+\epsilon )\)Approximation Algorithm
In this section, we describe a simple randomized \((23.09+\epsilon )\)approximation algorithm. In the next section, we will derandomize it and give more insights to further improve the approximation ratio to \(17.46+\epsilon \).
Highlevel ideas We reuse Swamy’s idea from [10] to first obtain an almost half integral solution \(\hat{y}\). This solution \(\hat{y}\) has a very nice structure. For example, each client j only (fractionally) connects to at most 2 facilities, and there is at least a halfopened facility in each \(G_j\). We shall refer to this set of 2 facilities as a bundle. In [10], the author applies a standard clustering process to get disjoint bundles and round \(\hat{y}\) by opening at least one facility per bundle. The drawback of this method is that we have to pay extra cost for bundles removed in the clustering step. In fact, it is possible to open at least one facility per bundle without filtering out any bundle. The idea here is inspired by the work of Charikar et. al [2]. In addition, instead of picking \(\hat{y}\) deterministically, sampling such a half integral extreme point will be very helpful for the analysis.
Lemma 6
([10]) Any extreme point of \(\mathcal {P}\) is almost halfintegral: there exists at most 1 cluster center \(s \in \mathcal {C}'\) such that \(G_s\) contains variables \(\notin \{0,\frac{1}{2},1\}\). We call s a fractional client.

If \(j \ne s\), let \(i_1(j)\) be any halfintegral facility in \(F'_j\) (i.e. \(Y_{i_1(j)} = 1/2\); such a facility exists because \(Y(F_j') \ge 1/2\)). Else (\(j = s\)), let \(i_1(j)\) be the smallestweight facility in \(F'_j\) with \(Y_{i_1(j)} > 0\).

If \(Y(i_1(j)) = 1\), let \(i_2(j) = i_1(j)\).

If \(Y(G_j) < 1\), then let \(\sigma (j)\) be the nearest client to j in \(\mathcal {C}'\). Define \(i_2(j) = i_1(\sigma (j))\).
 If \(Y(G_j) = 1\), then

If \(j \ne s\), let \(i_2(j)\) be the other halfintegral facility in \(G_j\).

Else (\(j = s\)), let \(i_2(j)\) be the smallestweight facility in \(G_j\) with \(Y_{i_2(j)} > 0\). If there are ties and \(i_1(j)\) is among these facilities then we let \(i_2(j) = i_1(j)\).


We call \(i_1(j), i_2(j)\) the primary facility and the secondary facility of j, respectively.
Lemma 7
Without loss of generality, we can assume that all cycles of Open image in new window (if any) are of size 2. This means that Open image in new window is bipartite.
Proof
Since the maximum outdegree is equal to 1, each (weakly) connected component of Open image in new window has at most 1 cycle. Consider any cycle \(j \rightarrow \sigma (j) \rightarrow \sigma ^2(j) \rightarrow \ldots \rightarrow \sigma ^k(j) \rightarrow j\). Then it is easy to see that \(c_{j\sigma (j)} = c_{\sigma ^k(j)j} \). The argument holds for any j in the cycle, and all edges on the cycle have the same length. Then we can simply redefine \(\sigma (\sigma ^k(j)) := \sigma ^{k1}(j)\) and get a cycle of size 2 instead. We can also change the secondary of the client corresponding to the edge \((\sigma ^k(j), j)\) into \(\sigma ^{k1}(j)\) because they are both at the same distance from it. \(\square \)
Theorem 2
Proof
Assume \(\mathcal {I}\) is the sparse instance obtained from \(\mathcal {I}_0\). We will give a proof of feasibility and a cost analysis. Recall that s is the center where we may have fractional values \(Y_i\) with \(i \in G_s\).
 For all centers \(j \in \mathcal {C}'\) with \(Y(G_j) < 1\), we haveNote that this is true for \(j \ne s\) because \(Y_{i_1(j)} = 1/2\). Otherwise, \(j = s\), by definition, \(w_{i_1(j)}\) is the smallest weight in the set \(F'_s\) which has volume at least 1 / 2. Thus, \(w_{i_1(j)} \le 2 \sum _{i \in F'_s} Y_i w_i \le 2 \sum _{i \in G_j}Y_i w_i. \)$$\begin{aligned} w_{i_1(j)} \le 2 \sum _{i \in G_j}Y_i w_i. \end{aligned}$$
 For all centers \(j \in \mathcal {C}'\) with \(Y(G_j) = 1\), we haveThe equality happens when \(j \ne s\). Otherwise, \(j = s\), we consider the following 2 cases$$\begin{aligned} w_{i_1(j)} + w_{i_2(j)} \le 2 \sum _{i \in G_j}Y_i w_i. \end{aligned}$$

If \(i_1(s) = i_2(s)\) the inequality follows because \(w_{i_1(j)} = w_{i_2(j)} \le \sum _{i \in G_j}Y_i w_i\).
 Else, we have \(i_2(s) \in G_j \setminus F'_j\) by definition of \(i_2(s)\). Since \(w_{i_1(s)} \ge w_{i_2(s)}\) and \(Y(F'_s) \ge 1/2\),$$\begin{aligned} \frac{1}{2}w_{i_1(s)} + \frac{1}{2}w_{i_2(s)}&\le Y(F'_s) w_{i_1(s)} + (1Y(F'_s))w_{i_2(s)} \\&\le \sum _{i \in F'_j} Y_i w_i + \sum _{i \in G_j \setminus F'_j}Y_i w_i = \sum _{i \in G_j} Y_i w_i. \end{aligned}$$

Cost analysis
 If \(j \ne s\), then either \(Y(G_j) = 1\) or \(Y(G_j) = 1/2\).
 Case \(Y(G_j) = 1\): then \(Y_{i_1(j)} = Y_{i_2(j)} = 1/2\), we have$$\begin{aligned} \,\text {cost}\,(j) \le d_j c_{i_2(j) j} \le 2 d_j \sum _{i \in G_j} Y_i c_{ij} = 2B_j(Y). \end{aligned}$$
 Case \(Y(G_j) = 1/2\): we have$$\begin{aligned} \,\text {cost}\,(j) \le 3 d_j \gamma _j = 6 d_j \gamma _j (1  Y(G_j)) \le 6B_j(Y). \end{aligned}$$

 If \(j = s\), we cannot bound the cost in terms of \(B_j(Y)\). Instead, we shall use Kumar’s bound.In either cases, applying the improved Kumar’s bound to the cluster \(M_s\) where \(c_{ks} \le 4C_k\) for all \(k \in M_s\), we get
 Case \(Y(G_j)=1\): \(i_2(j) \in G_j\). Recall that \(U_j\) is the upperbound on the connection cost of j. Our LP constraints guarantee that \(x_{ij} = 0\) for all \(c_{ij} > U_j\). Since \(Y_{i_2(j)}>0\), we also have \(y_{i_2(j)}>0\) or \(x_{i_2(j)j} > 0\), which implies that \(c_{i_2(j)j} \le U_j\). Thus,$$\begin{aligned} \,\text {cost}\,(j) \le d_j c_{i_2(j)j} \le d_j U_j. \end{aligned}$$
 Case \(Y(G_j)<1\): then there must exists some facility \(i \notin G_j\) such that \(x_{ij} > 0\). Since \(\gamma _j\) is the radius of \(G_j\), we have \(\gamma _j \le c_{ij} \le U_j\); and hence,$$\begin{aligned} \,\text {cost}\,(j) \le 3 d_j \gamma _j \le 3 d_j U_j. \end{aligned}$$
$$\begin{aligned} \,\text {cost}\,(j)&\le 3 d_j U_j \\&\le 3\epsilon \mathrm {OPT}_\mathcal {I}+ \frac{3 \cdot 4}{\delta } \sum _{k \in M_s} C_k \\&\le 3\epsilon \mathrm {OPT}_\mathcal {I}+ \frac{12}{\delta } \mathrm {OPT}_f. \end{aligned}$$ 
2.6 A \((17.46+\epsilon )\)Approximation Algorithm via Conditioning on the Fractional Cluster Center
Lemma 8
Proof
Finally, conditioning on the event \(\mathcal {E}\), we are able to combine certain terms and get the following improved bound.
Lemma 9
Proof
Theorem 3
Proof
 Case \(p \le 1/2\): By Lemma 8 and the fact that Algorithm 3 always returns a solution \(\mathcal {S}\) from the same distribution with minimum cost, we haveBy Theorem 1, the approximation ratio is at most$$\begin{aligned} \,\text {cost}\,_\mathcal {I}(\mathcal {S}) \le (10 +\min \{12\alpha /\delta , (12/\delta )(p\alpha + (1p)(1\alpha ))\} +3\epsilon ) \mathrm {OPT}_{\mathcal {I}}. \end{aligned}$$$$\begin{aligned} \max \left\{ \frac{1+\delta }{1\delta }, 10 +3\epsilon +\min \{12\alpha /\delta , (12/\delta )(p\alpha + (1p)(1\alpha ))\} \right\} . \end{aligned}$$

If \(\alpha \le 1/2\), the ratio is at most \(\max \left\{ \frac{1+\delta }{1\delta }, 10 +3\epsilon +6/\delta \right\} .\)
 If \(\alpha \ge 1/2\), we haveAgain, the ratio is at most \(\max \left\{ \frac{1+\delta }{1\delta }, 10 +3\epsilon +6/\delta \right\} .\)$$\begin{aligned} (12/\delta )(p\alpha + (1p)(1\alpha )) = (12/\delta )(p(2\alpha 1)  \alpha +1) \le 6/\delta . \end{aligned}$$

 Case \(p \ge 1/2\): Observe that the event \(\mathcal {E}\) does happen for some point in the for loop at lines 8, 9, and 10. By Lemma 9 and the fact that \(1+2/p = 3 < 12/\delta \), we haveBy Theorem 1, the approximation ratio is bounded by \( \max \left\{ \frac{1+\delta }{1\delta }, \frac{12}{\delta } + 3\epsilon +4 \right\} .\)$$\begin{aligned} \,\text {cost}\,_{\mathcal {I}}(\mathcal {S}) \le (\max \{6/p, 12/\delta \}+4+3\epsilon ) \mathrm {OPT}_{\mathcal {I}} = ( 12/\delta +4+3\epsilon ) \mathrm {OPT}_{\mathcal {I}}. \end{aligned}$$
Note that in [10], Swamy considered a slightly more general version of KM where each facility also has an opening cost. It can be shown that Theorem 3 also extends to this variant.
3 A Bifactor 3.05Approximation Algorithm for Knapsack Median
In this section, we develop a bifactor approximation algorithm for KM that outputs a pseudosolution of cost at most \(3.05\mathrm {OPT_{\mathcal {I}}}\) and of weight bounded by \((1+\epsilon )B\). This is a substantial improvement upon the previous comparable result, which achieved a factor of \(16+\epsilon \) and violated the budget additively by the largest weight \(w_{\max }\) of a facility. It is not hard to observe that one can also use Swamy’s algorithm [10] to obtain an 8approximation that opens a constant number of extra facilities (exceeding the budget B). Our algorithm works for the original problem formulation of KM where all facility costs are zero. Our algorithm is inspired by a recent algorithm of Li and Svensson [8] for the kmedian problem, which beat the long standing best bound of \(3+\epsilon \). The overall approach consists in computing a socalled bipoint solution, which is a convex combination \(a\mathcal {F}_1+b\mathcal {F}_2\) of two integral pseudo solutions \(\mathcal {F}_1\) and \(\mathcal {F}_2\) for appropriate factors \(a,b\ge 0\) with \(a+b=1\), and then rounding this bipoint solution to an integral one.
Depending on the value of a, Li and Svensson apply three different bipoint rounding procedures. We extend two of them to the case of KM. The rounding procedures of Li and Svensson have the inherent property of opening \(k+c\) facilities where c is a constant. Li and Svensson find a way to preprocess the instance such that any pseudo approximation algorithm for kmedian that opens \(k+c\) facilities can be turned into a (proper) approximation algorithm by paying only an additional \(\epsilon \) in the approximation ratio. We did not find a way to prove a similar result also for KM and therefore our algorithms violate the facility budget by a factor of \(1+\epsilon \).
3.1 Pruning the Instance
The bifactor approximation algorithm that we will describe in Sect. 3.2 has the following property. It outputs a (possibly infeasible) pseudosolution of cost at most \(\alpha \mathrm {OPT_{\mathcal {I}}}\) such that the budget B is respected when we remove the two heaviest facilities from this solution. This can be combined with a simple reduction to the case where the weight of any facility is at most \(\epsilon B\). This ensures that our approximate solution violates the budget by a factor at most \(1+2\epsilon \) while maintaining the approximation factor \(\alpha \).
Lemma 10
Let \(\mathcal {I} = (B, \mathcal {F}, \mathcal {C}, c, w)\) be any KM instance. Assume there exists an algorithm A that computes for instance \(\mathcal {I}\) a solution that consists of a feasible solution and two additional facilities, and that has cost at most \(\alpha \mathrm {OPT_{\mathcal {I}}}\). Then there exists for any \(\epsilon >0\) a bifactor approximation algorithm \(A'\) which computes a solution of weight \((1+\epsilon )B\) and of cost at most \(\alpha \mathrm {OPT_{\mathcal {I}}}\).
Proof
Let \(\mathcal {I} = (B, \mathcal {F}, \mathcal {C}, c, w)\) be an instance of knapsack median, let \(\mathcal {F}_{\epsilon }\subseteq \mathcal {F}\) be the set of facilities whose weight exceeds \(\epsilon B\) and let \({\mathcal {S}}\) be some fixed optimum solution. Note that any feasible solution can have no more than \(1/\epsilon \) many facilities in \(\mathcal {F}_{\epsilon }\).
This allows us to guess the set \({\mathcal {S}}_{\epsilon }:={\mathcal {S}}\cap \mathcal {F}_{\epsilon }\) of “heavy” facilities in the optimum solution \({\mathcal {S}}\). To this end we enumerate all \(O(\frac{1}{\epsilon }{\mathcal {F}}^{1/\epsilon })\) many subsets of \(\mathcal {F}_{\epsilon }\) of cardinality at most \(1/\epsilon \). At some iteration, we will consider precisely the set \({\mathcal {S}}_{\epsilon }\). We modify the instance as follows. The budget is adjusted to \(B':=Bw({\mathcal {S}}_{\epsilon })\). The weight of each facility in \({\mathcal {S}}_{\epsilon }\) is set to zero. The facilities in \(\mathcal {F}_{\epsilon }\setminus {\mathcal {S}}_{\epsilon }\) are removed from the instance. Let \(\mathcal {I}' = (B', \mathcal {F}\setminus (\mathcal {F}_{\epsilon }\setminus {\mathcal {S}}_{\epsilon }), \mathcal {C}, c, w')\) be the modified instance. Since \({\mathcal {S}}\) is a feasible solution to \(\mathcal {I}'\) it follows that \(\mathrm {OPT_{\mathcal {I'}}}\le \mathrm {OPT_{\mathcal {I}}}\). Therefore, the algorithm A from the statement outputs a solution \({\mathcal {S}}'\) whose cost is at most \(\alpha \mathrm {OPT_{\mathcal {I}}}\). If \({\mathcal {S}}'\subseteq {\mathcal {S}}_{\epsilon }\) we are done since then \({\mathcal {S}}'\) is already feasible solution under the original weight w. Otherwise, let \(f_1,f_2\) be the two heaviest facilities of \({\mathcal {S}}'\setminus {\mathcal {S}}_{\epsilon }\) where we set \(f_2=f_1\) if there is only one such facility. By the abovementioned property of our algorithm, we have that \(w'({\mathcal {S}}'\setminus \{f_1,f_2\})\le B'\) and thus \(w({\mathcal {S}}'\setminus \{f_1,f_2\})\le B\). Since \(f_1,f_2 \not \in {\mathcal {S}}_{\epsilon }\) we have that \(w(f_1)\) and \(w(f_2)\) are bounded by \(\epsilon B\). Hence the total weight of solution \({\mathcal {S}}'\) under the original weight function is \(w({\mathcal {S}}')\le (1+2\epsilon )B\). \(\square \)
3.2 Computing and Rounding a Bipoint Solution
Extending a similar result for the kmedian [11], we can compute a socalled bipoint solution, which is a convex combination of two integral pseudosolutions.
Theorem 4
We can compute in polynomial time two sets \(\mathcal {F}_1\) and \(\mathcal {F}_2\) of facilities and factors \(a,b\ge 0\) such that \(a+b=1\), \(w(\mathcal {F}_1)\le B\le w(\mathcal {F}_2)\), \(a\cdot w(\mathcal {F}_1)+b\cdot w(\mathcal {F}_2)\le B\), and \(a\cdot \mathrm {cost_{\mathcal {I}}}(\mathcal {F}_1)+b\cdot \mathrm {cost_{\mathcal {I}}}(\mathcal {F}_2)\le 2\cdot \mathrm {OPT_{\mathcal {I}}}\).
Proof
Williamson and Shmoys (Section 7.7, pp. 182–186 in [11]) prove an analogous theorem for the kmedian problem, which arises when we set \(w_i=1\) for all facilities i and \(B=k\). We can extend this proof to the case of nonuniform weights in a completely analogous manner. Moreover, instead of using the algorithm of Jain and Vazirani [5] for facility location (which has approximation ratio 3), we use a greedy algorithm of Jain et al. (Algorithm 2 in [4]) for facility location achieving a factor of 2. \(\square \)
We will now give an algorithm which for a given Knapsack Median instance \(\mathcal {I} = (B, \mathcal {F}, \mathcal {C}, d, w)\) returns a pseudosolution as in Lemma 10 with cost 3.05\(\mathrm {OPT_{\mathcal {I}}}\).
We use Theorem 4 to obtain a bipoint solution of cost 2 \(\mathrm {OPT_{\mathcal {I}}}\). We will convert it into a pseudosolution of cost 1.523 times bigger than the bipoint solution. Let \(a\mathcal {F}_1+b\mathcal {F}_2\) be the bipoint solution where \(a+b = 1\), \(w(\mathcal {F}_1) \le B < w(\mathcal {F}_2)\) and \(a w(\mathcal {F}_1) + b w(\mathcal {F}_2) = B\). For each client \(j \in \mathcal {C}\) the closest elements in sets \(\mathcal {F}_1\) and \(\mathcal {F}_2\) are denoted by \(i_1(j)\) and \(i_2(j)\), respectively. Moreover, let \(d_1(j) = c_{i_1(j)j}\) and \(d_2(j) = c_{i_2(j)j}\). Then the (fractional) connection cost of j in our bipoint solution is \(ad_1(j) + bd_2(j)\). In a similar way let \(d_1 = \sum _{j \in \mathcal {C}} d_1(j)\) and \(d_2 = \sum _{j \in \mathcal {C}} d_2(j)\). Then the bipoint solution has cost \(ad_1 + bd_2\).
We consider two candidate solutions. In the first we just pick \(\mathcal {F}_1\) which has cost bounded by \(\frac{d_1}{ad_1 + bd_2} \le \frac{1}{a + b r_D}\), where \(r_D = \frac{d_2}{d_1}\). This, multiplied by 2, gives our approximation factor.
To obtain the second candidate solution we use the concept of stars. For each facility \(i \in \mathcal {F}_2\) define \(\pi (i)\) to be the facility from set \(\mathcal {F}_1\) which is closest to i. For a facility \(i \in {\mathcal {F}_1}\) define star \(\mathcal {S}_i\) with root i and leafs \({L_i} = \{ i' \in \mathcal {F}_2  \pi (i') = i \}\). Note that by the definition of stars, we have that any client j with \(i_2(j) \in S_i\) has \(c_{i_2(j)i} \le c_{i_2(j) i_1(j)} = d_2(j) + d_1(j)\) and therefore \(c_{j i} \le c_{j i_2(j)} + c_{i_2(j) i} \le 2d_2(j) + d_1(j)\).
We will now analyze the cost of this solution. Both of the above knapsack LPs only reduce the connection cost in comparison to the original bipoint solution (or equivalently increase the saving with respect to the quantity \(d_1+2d_2\)), the total connection cost of the solution can be upper bounded by \(d_1+2d_2b(d_1+d_2)=(1+a)d_2+ad_1\).
Theorem 5
For any \(\epsilon >0\), there is a bifactor approximation algorithm for KM that computes a solution of weight \((1+\epsilon )B\) and has a cost \(3.05\mathrm {OPT_{\mathcal {I}}}\).
Proof
As argued above our algorithm computes a pseudo solution S of cost at most \(3.05\mathrm {OPT_{\mathcal {I}}}\). Moreover, S consists of a feasible solution and two additional facilities. Hence, Lemma 10 implies the theorem. \(\square \)
4 Discussion
The proof of Theorem 3 implies that for every \((\epsilon ,\delta )\)sparse instance \(\mathcal {I}\), there exists a solution \(\mathcal {S}\) such that \(\,\text {cost}\,_\mathcal {I}(\mathcal {S}) \le (4+12/\delta )OPT_f + 3\epsilon OPT_\mathcal {I}. \) Therefore, the integrality gap of \(\mathcal {I}\) is at most \(\frac{4+12/\delta }{13\epsilon }.\) Unfortunately, our clientcentric sparsification process inflates the approximation factor to at least \(\frac{1+\delta }{1\delta }\), so we must choose some \(\delta <1\) which balances this factor with that of Algorithm 3. In contrast, the facilitycentric sparsification used in [8] incurs only a \(1+\epsilon \) factor in cost. We leave it as a open question whether the facilitycentric version could also be used to get around the integrality gap of KM.
Our bifactor approximation algorithm achieves a substantially smaller approximation ratio at the expense of slightly violating the budget by opening two extra facilities. We leave it as an open question, to obtain a pre and postprocessing in the flavor of Li and Svensson to turn this into an approximation algorithm. It seems even interesting to turn any bifactor approximation into an approximation algorithm by losing only a constant factor in the approximation ratio. We also leave it as an open question to extend the third bipoint rounding procedure of Li and Svensson to knapsack median, which would give an improved result.
Footnotes
 1.
The paper claims 2.611, but a very recent correction changes this to 2.674.
Notes
Acknowledgements
We thank the anonymous reviewers of this paper and its earlier ESA 2015 version for their careful reading of our manuscript and valuable comments.
References
 1.Byrka, J., Pensyl, T., Rybicki, B., Srinivasan, A., Trinh, K.: An improved approximation for kmedian, and positive correlation in budgeted optimization. In: Proceedings of the Annual ACMSIAM Symposium on Discrete Algorithms, (SODA), pp. 737–756 (2015)Google Scholar
 2.Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constantfactor approximation algorithm for the kmedian problem. In: ACM Symposium on Theory of Computing (STOC), ACM. pp. 1–10 (1999)Google Scholar
 3.Charikar, M., Li, S.: A dependent LProunding approach for the kmedian problem. In: Automata, Languages, and Programming (ICALP), pp. 194–205 (2012)Google Scholar
 4.Jain, K., Mahdian, M., Markakis, E., Saberi, A., Vazirani, V.V.: Greedy facility location algorithms analyzed using dual fitting with factorrevealing LP. J. ACM 50(6), 795–824 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
 5.Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)median problems using the primaldual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001). https://doi.org/10.1145/375827.375845 MathSciNetCrossRefzbMATHGoogle Scholar
 6.Krishnaswamy, R., Kumar, A., Nagarajan, V., Sabharwal, Y., Saha, B.: The matroid median problem. In: Proceedings of the annual ACMSIAM Symposium on Discrete Algorithms (SODA), SIAM. pp. 1117–1130 (2011)Google Scholar
 7.Kumar, A.: Constant factor approximation algorithm for the knapsack median problem. In: Proceedings of the Annual ACMSIAM Symposium on Discrete Algorithms (SODA), SIAM. pp. 824–832 (2012)Google Scholar
 8.Li, S., Svensson, O.: Approximating \(k\)median via pseudoapproximation. In: STOC, pp. 901–910 (2013)Google Scholar
 9.Schrijver, A.: Combinatorial Optimization: Polyhedra and Efficiency, vol. 24. Springer Science & Business Media, New York (2003)zbMATHGoogle Scholar
 10.Swamy, C.: Improved approximation algorithms for matroid and knapsack median problems and applications. ACM Trans. Algorithms 28, 403–418 (2014)MathSciNetzbMATHGoogle Scholar
 11.Williamson, D.P., Shmoys, D.B.: The Design of Approximation Algorithms. Cambridge University Press, Cambridge (2011)CrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.