1 Introduction

In many modern scenarios one studies a variant of the matching problem where there is an additional family of side constraints. Such side constraints include distributional constraints (Ashlagi et al. 2020), ratio constraints (Yahiro et al. 2020), general multidimensional knapsack constraints (Nguyen et al. 2019), and pair restrictions (Hochbaum and Levin 2018). Here, we are motivated by applications of design of observational studies that use what is referred to as matching methods under fine balance constraints (Rosenbaum 2002; Rosenbaum et al. 2007; Stuart 2010; Sauppe et al. 2014; Karmakar et al. 2019; Rosenbaum 2020; Hochbaum et al. 2022a; Hochbaum et al. 2022b).

Based on this motivation we consider the following optimization problem. We have a (multi-)set of n elements P each of which is associated with a value (or position) in the interval (0, 1] of the real line. That is, each member \(p\in P\) is associated with a rational number \({\Upsilon }(p)\) where \(0< {\Upsilon }(p) \le 1\). We are also given a partition of P into two sets \({{\mathcal {T}}}\) and \({{\mathcal {C}}}\) referred to as the treatment group and the potential control group, respectively. We have a distance measure, that is, for every pair of points \(t\in {{\mathcal {T}}}\) and \(c\in {{\mathcal {C}}}\), we have a non-negative distance d(tc). We stress the fact that this distance measure is between the elements of \({{\mathcal {T}}}\cup {{\mathcal {C}}}\) and not between their values. We are also given a rational upper bound \(\delta \) and a natural number K. The problem we consider is a two-phase optimization problem. The first phase is to choose a subset of \({{\mathcal {C}}}\) named the control group such that together with the entire treatment group will be the selected set of points. However, the control group is restricted by a constraint stated below. The second phase is to construct a minimum total distance (perfect) matching consisting of disjoint pairs of points. Each pair in the matching consists of a member of \({{\mathcal {T}}}\) and a member of the control group.

Next, we formalize the constraints of the first phase. A feasible solution defines a partition of the (0, 1] interval into at most K intervals of the form \((v_i,v_{i+1}]\) such that \(0=v_0\le v_1\le v_2 \le \cdots v_{K}=1\) subject to the constraints \(v_{i+1}-v_i \le \delta \) for all i, and then the fine balance constraints are enforced. The fine balance constraints are that for each such interval that includes \(n_i\) points of \({{\mathcal {T}}}\), we need to choose exactly \(n_i\) points belonging to the same interval from \({{\mathcal {C}}}\). A solution to the first phase is feasible if it satisfies both constraints (length of each interval is at most \(\delta \) and fine balance constraints). In the second phase we construct a complete bipartite graph where the two partitions of the node set are \({{\mathcal {T}}}\) and the control group. The edge length in this bipartite graph is the distance between the two corresponding elements, and our second phase goal is to compute a perfect matching in this bipartite graph minimizing the total distance of its edges. The goal of the two-phase optimization problem is to compute a feasible solution for the first phase whose optimal solution of the second phase has a minimum cost. We denote this problem as Pr. We observe that without loss of generality, we have \(\frac{1}{\delta } \le K < \frac{2}{\delta }\), as given a feasible solution, we can unite two consecutive intervals as long as the length of the resulting interval is at most \(\delta \).

1.1 A motivating application in the design of observational studies subject to fine balance constraints

We have an associated continuous covariate whose support is in the interval (0, 1] standing for example for an estimated propensity score of the individual. The input also defines a distance between every pair of elements one from the treatment group and the other from the potential control group. We would like to enforce fine balance constraints on the covariate and to minimize the total distance of the matched pairs. Since the covariate is continuous, the input may contain an enormous number of different values in the support of the covariate (though still upper bounded by \(|{{\mathcal {T}}}\cup {{\mathcal {C}}}|\)), so a standard tool is to decrease the number of different values of the covariate prior to defining the fine balance constraints on the resulting (modified) covariate values. Observe that each such decrease in the number of values means that we will treat two elements with different covariate values as having the same value of the modified covariate. The natural argument is that if the two original values are similar, then it is reasonable to argue this statement. Thus, if we have an accuracy bound of \(\delta \in (0,1)\), we may consider the two values \(x_1,x_2\) of the covariate as similar if \(|x_1-x_2|\le \delta \). We would like to partition the interval (0, 1] into sub-intervals such that if two values belong to a common sub-interval, they are similar, and in a second phase to consider the matching problem with fine balance constraints with respect to the modified covariate. Here, the modified covariate of an element is the identity of the sub-interval in the above partition to which the value of the covariate belongs. In order to bound the number of different values, the earlier studies of such models pick in advance the partition into sub-intervals, and then solve the matching problem as a second phase optimization problem. The requirement that in each sub-interval the number of points in the control group is equal to the number of points in the treatment group means that if we define a new covariate with values given by the index of the sub-interval containing each point, then our selected set of points will satisfy the so-called fine balance constraints for this new covariate. Our goal is to consider the combined problem. Namely, given an upper bound K on the number of different sub-intervals (each of which has length at most \(\delta \)), we would like to pick the best partition into sub-intervals. Here, the definition of best is that an optimal solution to the corresponding matching problem has the minimum total distance. Detailed reviews of matching-related methods used for covariate balancing problems are given by Stuart (2010) and Rosenbaum (2020). See also e.g. Rosenbaum (2002), Sauppe et al. (2014), Karmakar et al. (2019), Hochbaum et al. (2022a); Hochbaum et al. (2022b) for complexity classification results for variants of matching problems with fine balance constraints.

1.2 Reformulating Pr

We next mention that we can reformulate Pr using relaxed fine balance constraints as follows. This reformulated problem is again a two-phase optimization problem. In the first phase of this reformulated problem we would like to find a partition of the (0, 1] interval into at most K intervals of the form \((v_i,v_{i+1}]\) such that \(0=v_0\le v_1\le v_2 \le \cdots v_{K}=1\) subject to the constraints \(v_{i+1}-v_i \le \delta \) for all i, and such that the following relaxed fine balance constraints are satisfied. These relaxed fine balance constraints are that for every \(i=0,1,\ldots ,K-1\), we have \(|\{ x\in {{\mathcal {C}}}: v_i< x \le v_{i+1} \} | \ge | \{ x\in {{\mathcal {T}}}: v_i < x \le v_{i+1} \}|\). Then, in the second phase of the reformulated problem the goal is to find a feasible solution for Pr whose cost is minimized. That is, in this reformulation the task of selecting the control group out of the potential control group is delayed from the first phase to the second phase. This reformulation is motivated by the following claim.

Claim 1.1

Assume that there is a feasible solution with objective function value X for Pr that in the first phase selects the sequence of values \(0=v_0\le v_1 \le \cdots \le v_K=1\) and this sequence is given as an input. Then, there is a polynomial time algorithm that selects a subset of the potential control group \({{\mathcal {C}}}'\) that together with the input sequence of values is a feasible solution for the first phase of Pr, such that there is a solution of the second phase whose objective function value is not larger than X.

The proof of the last claim holds by defining a new covariate for the points in P where a point has a value j of this new covariate if it is in the interval \((v_j,v_{j+1}]\). Then, we would like to solve the minimum total distance of a matching between \({{\mathcal {T}}}\) and a subset of \({{\mathcal {C}}}\) subject to fine balance constraints with respect to the new covariate. The last problem is known to be polynomial time solvable (Rosenbaum 2002; Rosenbaum et al. 2007), so the claim holds. We will show the other direction of the last claim in Sect. 2. The last claim means that in the reformulated problem our goal is to solve the first phase so that the polynomial time algorithm for the second phase gives the cheapest solution and this is also the case with respect to our original formulation of Pr.

1.3 Our results

In Sect. 3, we show that Pr is NP-hard. More precisely, we consider the decision problem of deciding if an instance of Pr admits a feasible solution whose total cost is zero. We let Dec-Pr denote this decision problem and we show that Dec-Pr is NP-complete. This claim also shows that it is NP-hard to approximate Pr within a bounded approximation ratio. Next, we consider a dual version of Dec-Pr where we would like to minimize the upper bound \(\delta \) on the length of the intervals so that the resulting instance of Pr admits a zero cost solution. We let Dual-Pr be this dual problem, and we show that Dual-Pr does not admit a polynomial time algorithm with a constant approximation ratio unless \(\text {P}=\text {NP}\). We obtain this result by modifying some of the gadgets used in the proof of the claim that Dec-Pr is NP-complete.

In Sect. 4 we show that Pr admits an algorithm with time complexity \(2^{O(|{{\mathcal {T}}}|)}\) times a polynomial of the input encoding length. This algorithm shows that the problem is in FPT with respect to the parameter \(|{{\mathcal {T}}}|\). Our algorithm is based on a dynamic programming approach. This approach translates into a polynomial time algorithm when the distance function is a monotone non-decreasing convex cost function of the absolute value of the difference between the values associated with the two points. Such assumptions are sometimes satisfied in the motivating application we have exhibited above. We start our study in Sect. 2 by some preliminaries that are used later on.

We conclude this work in Sect. 5 by noting some guidelines for future work by practitioners on applying the results of this paper.

2 Preliminaries

Our first goal is to show that we can discretize the choice of the values \(v_j\) for \(j=1,2,\ldots ,K-1\) (recall that \(v_0=0\) and \(v_K=1\)). To state the claim, we let the potential point set \({{\mathcal {P}}}\) be defined as

$$\begin{aligned} {{\mathcal {P}}}= \{ x+y\cdot \delta : x\in P\cup \{ 0,1\}, \frac{-x}{\delta } \le y \le \frac{1-x}{\delta }, y\in \mathbb {Z}\} \end{aligned}$$

where \(\mathbb {Z}\) denotes the set of integers. Observe that \(|{{\mathcal {P}}}| = O(nK)\). We next prove that without loss of generality, for all j, the value \(v_j\) belongs to \({{\mathcal {P}}}\). More precisely, we prove the following lemma.

Lemma 2.1

Given a feasible solution to the first phase of Pr, of the form \(0=v_0 \le v_1 \le \cdots \le v_K=1\) and a control group \({{\mathcal {C}}}' \subseteq {{\mathcal {C}}}\), there is another feasible solution \(0=v'_0 \le v'_1 \le \cdots \le v'_K=1\) and \({{\mathcal {C}}}'\) to the first phase of Pr for which \(v'_j \in {{\mathcal {P}}}\) for all j.

Observe that since the control group in the new solution is the same as in the original solution, the feasibility of the second phase problem is guaranteed and the cost of the optimal matching in the second phase is the same as it was with respect to the original solution. Thus, the lemma indeed proves that without loss of generality we can search for intervals where \(v'_j \in {{\mathcal {P}}}\) for all j. Next, we prove the lemma.

Proof

We consider the list \({{\mathcal {L}}}\) of values in \({{\mathcal {P}}}\) sorted from smallest to largest. Along \({{\mathcal {L}}}\) we let \(u_j < w_j\) be two consecutive values such that \(u_j \le v_j < w_j\). We let \(v'_j = u_j\) and apply this process for every value of j. Since for all j, the intervals \((v_j,v_{j+1}]\) and \((v'_j, v'_{j+1}]\) contain the same subset of points of \({{\mathcal {T}}}\cup {{\mathcal {C}}}\), we conclude that the number of points of \({{\mathcal {T}}}\) in this interval is the same as the number of points in \({{\mathcal {C}}}'\) in this interval. Thus, to show the feasibility of this first phase solution we have defined, it suffices to show that \(v'_{j+1} -v'_j \le \delta \) for all j.

We know that \(v'_{j+1} \le v_{j+1} \le v_j+\delta \). By definition of \({{\mathcal {P}}}\) and \(v'_{j+1}\in {{\mathcal {P}}}\), we conclude that \(v'_{j+1}-\delta \in {{\mathcal {P}}}\) or \(v'_{j+1} \le \delta \). Therefore, by definition of \(u_j\), we have \(u_j \ge v'_{j+1}-\delta \). Thus the claim holds. \(\square \)

We next show that the core of Pr (and thus also of Dec-Pr) is the selection of points in \({{\mathcal {C}}}\), that is, we will exhibit a polynomial time algorithm that given a selected set \({{\mathcal {C}}}'\subseteq {{\mathcal {C}}}\) for which we are guaranteed to have a partition of (0, 1] into K intervals satisfying the constraints of the first phase of Pr, the algorithm will find such a feasible partition of (0, 1]. Thus, we prove the following proposition.

Proposition 2.2

Assume that there is a feasible solution for Pr that selects the control group \({{\mathcal {C}}}'\). Then, there is a polynomial time algorithm that given the choice \({{\mathcal {C}}}'\) as an input, the algorithm finds a sequence of values \(0=v_0\le v_1 \le \cdots \le v_K=1\) that together with \({{\mathcal {C}}}'\) is a feasible solution for the first phase of Pr.

Proof

For a given point \(v \in {{\mathcal {P}}}\), we let \({{\mathcal {P}}}_v\) be the subset of \(\{ x\in {{\mathcal {P}}}: v< x \le v+\delta \}\) such that a point belongs to \({{\mathcal {P}}}_v\) if the number of points in \({{\mathcal {C}}}'\) belonging to the interval (vx] equals the number of points in \({{\mathcal {T}}}\) in (vx]. Observe that we can construct each of these sets in polynomial time. We next construct a directed graph whose node set is \({{\mathcal {P}}}\) and its edge set is the following set \(\{ (v,{\tilde{v}}): v\in {{\mathcal {P}}}, {\tilde{v}} \in {{\mathcal {P}}}_v \}\). In this graph we would like to check if there is a (directed) path from the node 0 to the node 1 consisting of at most K edges. Observe that such a procedure can be implemented in polynomial time using standard algorithms assuming that K is given in unary.

If K is given in binary and is significantly larger than n, we can apply the following method. First, observe that without loss of generality, if there is an interval that does not contain a value of a point in P, then its length is exactly \(\delta \). This claim holds as otherwise, we can replace the sequence of intervals between two consecutive values of points in P to satisfy this requirement without increasing the number of intervals or modifying the intervals containing values of points in P. Based on this observation, we can replace the constructed graph to allow edges of length \(\rho \) (for integer \(\rho \ge 2\)) to correspond to interval of length \(\rho \delta \) without values of points in P, and keep edges of the graph we constructed only if they have at least one value of a point in P or ends at 1 (or start at 0) defining their length as 1. Then we delete all nodes of the graph corresponding to a point v whose distance from values of points in \(P\cup \{ 0,1\}\) is strictly larger than \(\delta \). The resulting graph has polynomial size even if K is given in binary.

If such a path is found, then the values of the nodes appearing along the path in the original graph are the values that we output together with \({{\mathcal {C}}}'\) as the solution of the first phase. Observe that this means that if K is not given in unary, then we output an explicit description of the values \(v_j\) and not implicitly. The existence of such a path follows by Lemma 2.1. \(\square \)

The last proposition immediately translates into a fixed parameter tractable algorithm with respect to the parameter \(|{{\mathcal {C}}}|\). This is so, as for every subset of \({{\mathcal {C}}}\) of cardinality \(|{{\mathcal {T}}}|\), we can apply the algorithm of the proposition, and if it finds a feasible solution, we use a polynomial time algorithm for finding the best matching of the selected points in \({{\mathcal {C}}}\) and \({{\mathcal {T}}}\). Among all iterations that result in feasible solutions, we pick the best one. Thus, the number of iterations is at most \({|{{\mathcal {C}}}| \atopwithdelims ()|{{\mathcal {T}}}|} \le 2^{|{{\mathcal {C}}}|}\) each of which takes polynomial time. Thus, we conclude with the following corollary.

Corollary 2.3

Problem Pr admits an algorithm whose time complexity is at most \(2^{|{{\mathcal {C}}}|}\) times a polynomial of the input encoding length.

In Sect. 4, we will show that we can replace the parameter \(|{{\mathcal {C}}}|\) in Corollary 2.3 by the smaller parameter \(|{{\mathcal {T}}}|\) (note that due to fine balance constraints we can assume that \(|{{\mathcal {T}}}| \le |{{\mathcal {C}}}|\)).

3 NP-hardness of Pr

In this section we prove that Dec-Pr is NP-complete implying that Pr is NP-hard to approximate as the distance measure is non-negative so every solution has a non-negative cost. Our proof of NP-completeness is via reduction from the standard SAT problem defined as follows. In SAT the input defines a collection of n logical variables \(x_1,x_2,\ldots , x_n\), and a boolean formula in a conjunctive normal form (CNF) over these variables. The problem is to decide if there is a truth-assignment (i.e., assigning each variable either TRUE or FALSE) for which the value of the formula is TRUE. SAT is NP-complete (see e.g. Garey and Johnson 1978).

Theorem 3.1

The decision problem Dec-Pr is NP-complete.

Proof

It is clear that Dec-Pr is in NP as given an instance of Dec-Pr we guess the control group and then apply Proposition 2.2 to solve the resulting problem given the control group in polynomial time. Then, we check in polynomial time that the output solution satisfies the constraints of the problem and verify that its total distance is zero. These operations can be made in polynomial time.

In order to prove that Dec-Pr is NP-complete, we exhibit a polynomial reduction from SAT. Consider an instance of SAT with n variables denoted as \(x_1,\ldots ,x_n\) and m clauses \(c_1,\ldots ,c_m\), and assume that the ith clause \(c_i\) is a set of \(n_i\) literals. We let \(\delta =\frac{1}{5n}\) and \(K=10n\). Next we define the set of points in \({{\mathcal {T}}}\cup {{\mathcal {C}}}\) and their position along (0, 1]. For every variable \(x_i\) in the SAT instance, we have 3m points that we refer to as the points corresponding to \(x_i\). There are m points \(\tau _i^j\) in \({{\mathcal {C}}}\) (for \(j=1,2,\ldots ,m\)) referred to as the points corresponding to a TRUE value of \(x_i\) and all these points (for the given value of i) are located at the coordinate \(4\delta \cdot i - \frac{3}{2} \cdot \delta \). We also have m points \(\phi _i^j\) in \({{\mathcal {T}}}\) (for \(j=1,2,\ldots ,m\)) corresponding to the variable \(x_i\) and all these points (for the given value of i) are located at the coordinate \(4\delta \cdot i - \frac{3}{4} \cdot \delta \). Last, we have m points \(\psi _i^j\) in \({{\mathcal {C}}}\) (for \(j=1,2,\ldots ,m\)) referred to as the points corresponding to a FALSE value of \(x_i\) and all these points (for the given value of i) are located at the coordinate \(4\delta \cdot i\). Next, we add \(\sum _{i=1}^m n_i\) elements to the treatment group and the same number of elements to the potential control group, where all these additional elements will have points with coordinate 1. Out of these elements located at 1, there are \(n_i\) points of \({{\mathcal {T}}}\) corresponding to \(c_i\) that we denote as \(\alpha _{i}^j\) (for \(j=1,2,\ldots ,n_i\)) and \(n_i\) points of \({{\mathcal {C}}}\) corresponding to \(c_i\) that we denote as \(\beta _i^j\) (for \(j=1,2,\ldots ,n_i\)).

It remains to define the distance between every pair of points in \({{\mathcal {T}}}\times {{\mathcal {C}}}\). These distances are binary (that is, either 0 or 1), and we list the pairs with zero distance (and mean that other pairs have unit distance). The following are the pairs with zero distance:

$$\begin{aligned}{} & {} \{ (\phi _{i'}^{j'},\beta _{i}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ i=1,2,\ldots ,m,\ j=1,2,\ldots ,n_i \}\\{} & {} \cup \{ (\phi _{i'}^{j'},\tau _{i'}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ j=1,2,\ldots ,m \}\\{} & {} \cup \{ (\phi _{i'}^{j'},\psi _{i'}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ j=1,2,\ldots ,m \} \\{} & {} \cup \{ (\alpha _{i}^{j'},\beta _{i}^j): \forall i=1,2,\ldots ,m,\ j'=1,2,\ldots , n_i,\ j=2,3,\ldots ,n_i\}. \end{aligned}$$

In addition, for every ij if the jth literal of \(c_i\) is \(x_{i'}\), then \((\alpha _i^j,\tau _{i'}^i)\) is a zero distance pair and if it is \(\lnot x_{i'}\), then \((\alpha _i^j,\psi _{i'}^i)\) is a zero distance pair. This defines the instance of Dec-Pr that we construct. Observe that this construction procedure takes polynomial time.

Next, assume that the input SAT instance is a YES instance, namely there is a truth assignment that for every variable \(x_i\) assigns the value \(\zeta _i\) that is either TRUE or FALSE, for which every clause has at least one literal whose assigned value is TRUE. Consider Fig. 1 for an illustration. We first consider the set of intervals. We first define a set of \(n+1\) intervals each of which has length \(\delta \) as follows. The last interval is \((1-\delta ,1]\), and it contains all elements located at 1. Then, for every i, if \(\zeta _i=\text {TRUE}\) we will have the interval \((4\delta i - \frac{7}{4} \delta , 4\delta i - \frac{3}{4} \delta ]\) and if \(\zeta _i=\text {FALSE}\), we will have the interval \((4\delta i -\delta , 4\delta i ]\). Observe that these intervals (for different values of i) are disjoint. We complete the set of intervals by adding auxiliary intervals of length at most \(\delta \) so that the total number of intervals will be at most K and the entire collection of intervals will be a set of disjoint intervals.

Next, we define the set of points in \({{\mathcal {C}}}\) that we select. We do not select the points of \({{\mathcal {C}}}\) contained in the auxiliary intervals, while we select all points located in our initial set of \(n+1\) intervals. Observe that for every interval that we may use in the resulting solution, the number of points from \({{\mathcal {T}}}\) in the interval equals the number of points in \({{\mathcal {C}}}\) in the interval, and in the auxiliary intervals there are no points of the treatment group. Thus, fine balance constraints are satisfied.

Last, it remains to consider the matching and verify that it consists (only) of zero cost edges. Consider a given clause \(c_i\) of the SAT formula and assume that its first literal that is assigned a TRUE value (in the given truth assignment) is the jth literal of \(c_i\). Assume that this literal corresponds to the variable \(x_{i'}\) (or its negation). Then, we match the jth element of \({{\mathcal {T}}}\) corresponding to \(c_i\) that is \(\alpha _{i}^j\) to either \(\tau _{i'}^i\) (if \(\zeta _{i'}=\text {TRUE}\)) or to \(\psi _{i'}^i\) (if \(\zeta _{i'}=\text {FALSE}\)), we match the other elements of \({{\mathcal {T}}}\) corresponding to \(c_i\) to \(\beta _i^2,\ldots , \beta _i^{n_i}\), and we match \(\beta _i^1\) to \(\phi _{i'}^i\). After applying this definition for all clauses, we complete the definition of the matching by using the following definition for each variable \(x_{i'}\). We match the remaining elements corresponding to each variable \(x_{i'}\) (that we select from the potential control group) to the remaining (so-far unmatched) elements of \({{\mathcal {T}}}\) corresponding to \(x_{i'}\). Observe that whenever we construct the matching for a given clause, the number of matched elements corresponding to a given variable \(x_{i'}\) from \({{\mathcal {C}}}\) equals the number of matched elements corresponding to \(x_{i'}\) from \({{\mathcal {T}}}\), so indeed the second phase matched pairs exist and have zero distance. Thus, every pair we have considered has zero distance and indeed the instance of Dec-Pr is a YES instance.

Fig. 1
figure 1

An illustration of the input to Dec-Pr constructed by the reduction of Theorem 3.1 together with the zero cost matching corresponding to the instance of SAT with the formula \((x_1 \vee \lnot x_2) \wedge (\lnot x_1 \vee \lnot x_2)\) and the truth assignment \(x_1=\text {TRUE}\), \(x_2=\text {FALSE}\). The elements on top of the figure are from the potential control group, while the elements on the bottom are from the treatment group. Each element appears next to its label. For the points at position 1, the ones corresponding to the first clause are drawn on the left while the points corresponding to the second clause are drawn on the right

For the other direction, assume that the resulting Dec-Pr instance is a YES instance. We demonstrate a truth assignment for the SAT instance that satisfies all clauses. Observe that an interval of length \(\delta \) containing the points of the form \(\phi _{i}^j \) (for all j) may contain either the points \(\tau _{i}^j\) or the points \(\psi _i^j\) (or none) but not both. Thus, in order to satisfy fine balance constraints, it must contain exactly one of the sets of points of the potential control group corresponding to \(x_i\) with a common location. That is, in the selected set of points in the feasible solution for Dec-Pr either the entire set of points \(\tau _{i}^j\) for all j is selected or the entire set of points \(\psi _i^j\) for all j is selected (but not both). In the first case we let \(\zeta _i=\text {TRUE}\), and in the latter case we let \(\zeta _i=\text {FALSE}\). We apply this rule for every variable \(x_i\) in the SAT instance so we indeed get a truth assignment.

Next, consider a given clause \(c_{i'}\) that has \(n_{i'}\) literals. Observe that an interval of (strictly) positive length containing the point 1 whose length is at most \(\delta \) satisfies that it contains all points in location 1 but no other point. By the fine balance constraint of this interval, we conclude that the selected set of points in \({{\mathcal {C}}}\) must contain all points in the potential control group whose position is 1. Because the set of points at this location has only \(\sum _{\iota } (n_{\iota }-1)\) points that have a zero distance pair with an element of the treatment group with location 1 and because there are \(\sum _{\iota } n_{\iota }\) such elements of the treatment group, we conclude that there must be at least one element of the treatment group corresponding to every clause \(c_{i'}\) that has a zero distance pair together with an element of the control group whose position is strictly smaller than 1. However, such an element in the control group must correspond to a literal that is satisfied and appears in \(c_{i'}\). Therefore, the clause is indeed satisfied by the truth assignment we consider. Because this argument holds for all clauses, we conclude that the formula is satisfied, so the SAT instance is a YES instance. \(\square \)

Our next goal is to show that for every (positive) integer \(\rho \), the following \(\rho \)-gap decision problem is NP-complete. Proving this claim shows that unless \(\text {P}= \text {NP}\) we cannot approximate Dual-Pr within a factor of \(\rho \). The \(\rho \)-gap decision problem (denoted as Dual-Pr \('\)) is defined as follows. We are given an instance of Dec-Pr and our goal is to distinguish between two cases. In the first case there is a feasible solution for Dec-Pr that uses intervals of length at most \(\delta \). In the second case, for every partition of (0, 1] into intervals, and selection of \({{\mathcal {C}}}'\) that satisfies the fine balance constraints for which there is a matching between \({{\mathcal {C}}}'\) and \({{\mathcal {T}}}\) of zero cost, there is (at least) one interval of length larger than \(\rho \delta \). If the conditions of both cases fail to hold, there is no requirement regarding the output.

Theorem 3.2

Problem Dual-Pr \('\) is NP-hard.

Proof

We exhibit a gap producing reduction from SAT. Our reduction is similar to the one of Theorem 3.1, but replaces the variable gadgets with more complicated ones. We provide the details for completeness.

Consider an instance of SAT with n variables denoted as \(x_1,\ldots ,x_n\) and m clauses \(c_1,\ldots ,c_m\), and assume that the ith clause \(c_i\) is a set of \(n_i\) literals. We let \(\delta =\frac{1}{4\rho n}\) and \(K=8 \rho n\). Next, we define the set of points in \({{\mathcal {T}}}\cup {{\mathcal {C}}}\) and their position along (0, 1].

For every variable \(x_i\) in the SAT instance, we have \((3+4\rho )\cdot m\) points that we refer to as the points corresponding to \(x_i\). There are m points \(\tau _i^j\) in \({{\mathcal {C}}}\) (for \(j=1,2,\ldots ,m\)) referred to as the points corresponding to a TRUE value of \(x_i\) and all these points (for the given value of i) are located at the coordinate \(8\rho \delta i - 2\rho \cdot \delta \). We also have m points \(\phi _i^j\) in \({{\mathcal {T}}}\) (for \(j=1,2,\ldots ,m\)) corresponding to the variable \(x_i\) and all these points (for the given value of i) are located at the coordinate \(8\rho \delta i - (2\rho -\frac{1}{2}) \cdot \delta \). We also have m points \(\psi _i^j\) in \({{\mathcal {C}}}\) (for \(j=1,2,\ldots ,m\)) referred to as the points corresponding to a FALSE value of \(x_i\) and all these points (for the given value of i) are located at the coordinate \(8\rho \delta i + \delta \). We also have for every \(\ell =1,2,\ldots ,2\rho \), m points \(\gamma _{i,\ell }^j\) (for \(j=1,2,\ldots ,m\)) in \({{\mathcal {C}}}\) located at \(8\rho \delta i - (2\rho - \ell ) \cdot \delta \) and m points \(\nu _{i,\ell }^j\) (for \(j=1,2,\ldots ,m\)) in \({{\mathcal {T}}}\) located at \(8\rho \delta i - (2\rho - \ell -\frac{1}{2}) \cdot \delta \).

Next, we add \(\sum _{i=1}^m n_i\) elements to the treatment group and the same number of elements to the potential control group, where all these additional elements will have points with coordinate 1. Out of these elements located at 1, there are \(n_i\) points of \({{\mathcal {T}}}\) corresponding to \(c_i\) that we denote as \(\alpha _{i}^j\) (for \(j=1,2,\ldots ,n_i\)) and \(n_i\) points of \({{\mathcal {C}}}\) corresponding to \(c_i\) that we denote as \(\beta _i^j\) (for \(j=1,2,\ldots ,n_i\)).

It remains to define the distance between every pair of points in \({{\mathcal {T}}}\times {{\mathcal {C}}}\). These distances are binary (that is, either 0 or 1), and we list the pairs with zero distance. The following are the pairs with zero distance. These are the pairs \((\gamma _{i,\ell }^j, \nu _{i,\ell }^{j'})\) for every \(i,\ell \) and for all values of \(j,j'\), and in addition to that the pairs for which we have defined a zero distance pair in the proof of Theorem 3.1, namely,

$$\begin{aligned}{} & {} \{ (\phi _{i'}^{j'},\beta _{i}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ i=1,2,\ldots ,m,\ j=1,2,\ldots ,n_i \}\\{} & {} \quad \cup \{ (\phi _{i'}^{j'},\tau _{i'}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ j=1,2,\ldots ,m \} \\ {}{} & {} \quad \cup \{ (\phi _{i'}^{j'},\psi _{i'}^j): \forall i'=1,2,\ldots ,n, \ j'=1,2,\ldots ,m,\ j=1,2,\ldots ,m \} \\{} & {} \quad \cup \{ (\alpha _{i}^{j'},\beta _{i}^j): \forall i=1,2,\ldots ,m,\ j'=1,2,\ldots , n_i,\ j=2,3,\ldots ,n_i\} \, \end{aligned}$$

and for every ij if the jth literal of \(c_i\) is \(x_{i'}\), then \((\alpha _i^j,\tau _{i'}^i)\) is a zero distance pair and if it is \(\lnot x_{i'}\), then \((\alpha _i^j,\psi _{i'}^i)\) is a zero distance pair. This defines the instance of Dual-Pr \('\) that we construct. Observe that this construction procedure takes polynomial time as long as \(\rho \) is polynomially bounded (or a constant).

Next, assume that the input SAT instance is a YES instance, namely there is a truth assignment that for every variable \(x_i\) assigns the value \(\zeta _i\) that is either TRUE or FALSE, for which every clause has at least one literal whose assigned value is TRUE.

We consider the set of intervals. We first define a set of intervals each of which has length \(\delta \) as follows. The last interval is \((1-\delta ,1]\), and it contains all elements located at 1. Then, for every i, if \(\zeta _i=\text {TRUE}\) we will have the intervals \((8\rho \delta i - (2\rho +\frac{1}{2} -\ell ) \cdot \delta , 8\rho \delta i - (2\rho -\frac{1}{2} -\ell ) \cdot \delta ]\) for \(\ell =0,1,2,\ldots ,2\rho \). If \(\zeta _i=\text {FALSE}\), we will have the intervals \((8\rho \delta i - (2\rho -\ell ) \cdot \delta , 8\rho \delta i - (2\rho -\ell -1) \cdot \delta ]\) for \(\ell =0,1,2,\ldots ,2\rho \). Observe that these intervals are disjoint, and their union contains all elements of \({{\mathcal {T}}}\) and all points of the form \(\gamma _{\cdot , \cdot }^{\cdot }\). We complete the set of intervals by adding auxiliary intervals of length at most \(\delta \) so that the total number of intervals will be at most K and these will be disjoint intervals.

Next, we define the set of points in \({{\mathcal {C}}}\) that we select. We do not select the points of \({{\mathcal {C}}}\) contained in the auxiliary intervals, while we select all points located in our initial set of intervals. Observe that for every interval that we may use in the resulting solution, the number of points from \({{\mathcal {T}}}\) in the interval equals the number of points in \({{\mathcal {C}}}\) in the interval, and in the auxiliary intervals there are no points of the treatment group. Thus, fine balance constraints are satisfied.

Last, it remains to consider the matching and verify that it consists (only) of zero cost edges. For every \(i,j,\ell \), we match \(\gamma _{i,\ell }^j\) to \(\nu _{i,\ell }^j\). Next consider the other selected elements, where we use a matching that is identical to the one we used in the proof of Theorem 3.1. Thus, every pair we have considered has zero distance and indeed the instance of Dual-Pr \('\) is a YES instance for interval lengths at most \(\delta \).

Next, assume that the resulting Dual-Pr \('\) instance is a YES instance for interval lengths upper bounded by \(\rho \delta \). We next argue that the SAT instance is a YES instance. First, consider a given variable of the SAT instance \(x_i\), then it corresponds to a set of elements in \({{\mathcal {C}}}\cup {{\mathcal {T}}}\). Observe that in a zero-cost solution of Dual-Pr we must have all elements of the form \(\gamma _{i,\cdot }^{\cdot }\) being selected (otherwise at least one of the elements of the form \(\nu _{i,\ell }^j\) will be matched using a unit cost pair). We observe that every interval of length at most \(\rho \delta \) contains at most one of the sets of points \(\{ \tau _i^j:\forall j\}\) or \(\{ \psi _i^j: \forall j\}\) (or none), and if the interval contains one of these sets, then it does not contain elements corresponding to other variables of the SAT instance. By counting the number of elements of \({{\mathcal {C}}}\) and of \({{\mathcal {T}}}\) in such an interval that we select as one of the intervals in the solution for Dual-Pr \('\), we conclude that either we select the entire set of \(\{ \tau _i^j:\forall j\}\) or we select the entire set of \(\{ \psi _i^j: \forall j\}\). In the first case we let \(\zeta _i=\text {TRUE}\) while in the second case we let \(\zeta _i = \text {FALSE}\). The remainder of the proof is identical to the proof of Theorem 3.1. We conclude that the formula is satisfied, so the SAT instance is a YES instance. \(\square \)

4 Dynamic programming algorithms for Pr

Here, we show in Sect. 4.1 an FPT algorithm with respect to the parameter \(|{{\mathcal {T}}}|\) for solving Pr, and in Sect. 4.2 a polynomial time implementation for the most natural special cases of Pr. Note that the necessity to consider special cases when one seeks for a polynomial time algorithm that outputs an optimal solution, is a trivial result of the fact that the problem is NP-hard.

4.1 The general case of Pr

Our next goal is to present a dynamic programming algorithm for solving Pr whose time complexity is polynomial if \(|{{\mathcal {T}}}| \in O(\log n)\).

A state in our dynamic programming table corresponds to a partial selection of points \(0=v_0<v_1< \cdots < v_j\) all of which are in \({{\mathcal {P}}}\) together with the subset \(S_j\) of \({{\mathcal {T}}}\) that is matched to points in the control group with coordinate at most \(v_j\). The selection of intervals and a corresponding selection of the subset of \({{\mathcal {C}}}\) satisfy the fine balance constraints with respect to the points in \({{\mathcal {T}}}\) with position at most \(v_j\). Therefore, we have that \(|\{ x\in {{\mathcal {T}}}{\setminus } S_j: x \le v_j\} | = |\{ x\in S_j: x> v_j\}|\). Both j, \(v_j\), and \(S_j\) are recorded in the state. Thus, the number of states in the dynamic programming table will be \(O(K \cdot nK \cdot 2^{|{{\mathcal {T}}}|})\). We explain below how to eliminate the dependency on K.

For each state in the table, the dynamic programming algorithm computes the minimum cost partial solution that uses j feasible intervals whose right-most interval has a right endpoint v and matches control units with value at most v to treatment units in S. We let F(jvS) denote this cost.

We are clearly interested in \(\min _{j\in \{ 1,2,\ldots ,K\}} F(j,1, {{\mathcal {T}}})\). By definition, \(F(j,0,S)=0\) if \(j=0\) and \(S=\emptyset \), and otherwise \(F(j,0,S)=\infty \). In order to introduce the recursive formula we let \(D(S{\setminus } S', {{\mathcal {C}}}[v',v])\) denote the optimal matching cost of the points in \(S {\setminus } S'\) to the points in the control group whose positions are in the interval \((v',v]\) if the number of elements in \(S {\setminus } S'\) equals the number of points in the treatment group in the interval \((v',v]\) and \(\infty \) otherwise. Using this notation, the recursive formula is given as

$$\begin{aligned} F(j,v,S) = \min \{ F(j-1,v',S')+ D(S{\setminus } S', {{\mathcal {C}}}[v',v]): v-\delta \le v'<v, v'\in {{\mathcal {P}}}, S'\subseteq S \}. \end{aligned}$$

Observe that the solution we obtain using this dynamic programming formulation satisfies the fine balance constraints if its cost is finite. Moreover, the length of each interval is at most \(\delta \), so if the output has a finite cost, then it corresponds to a feasible solution. Furthermore, the time complexity of computing a value of D is \(O(|{{\mathcal {T}}}| (n|{{\mathcal {T}}}|+n\log n))\) using the successive shortest path algorithm (Ahyja et al. 1993) as there are at most \(|{{\mathcal {T}}}|\) paths computed by this algorithm and the bipartite graph in which the algorithm is applied has at most n nodes and at most \(n|{{\mathcal {T}}}|\) edges. The resulting dynamic programming algorithm has time complexity of at most \(O(K^3 \cdot n^3 \cdot 2^{2|{{\mathcal {T}}}|} \cdot |{{\mathcal {T}}}| \cdot (|{{\mathcal {T}}}|+\log n)) \).

Next, we observe that if K is given as part of the input in binary, we can speed up the algorithm and replace the dependence of K by \(\min \{ n,K\}\). In order to do so we recall that if there is an interval containing no value of a point in P, does not end at 1 and it does not start at 0, then its length is exactly \(\delta \). This means that for every point \(v\in {{\mathcal {P}}}\) that may serve as an endpoint of an interval, there is a set of O(n) values of j, such that v is the endpoint of the jth interval in the solution. For a given point v, we can find in polynomial time this set of values. Obviously, we can delete from the dynamic programming table a state (jvS) for which j does not belong to the corresponding set of v. Furthermore, we delete states corresponding to points v with distance larger than \(2\delta \) from any value of a point in P and with distance larger than \(2\delta \) from \(\{ 0,1\}\), while allowing the dynamic programming recursive formula an arbitrary number of consecutive intervals of length \(\delta \) containing no values of points in P. We let \({{\mathcal {P}}}'\) be the subset of \({{\mathcal {P}}}\) consisting of points whose states were not deleted by the above rule. The resulting recursive formula is

$$\begin{aligned}{} & {} F(j,v,S) = \min \{ \\ {}{} & {} \min \{ F(j-1,v',S')+ D(S{\setminus } S', {{\mathcal {C}}}[v',v]): v-\delta \le v'<v, v'\in {{\mathcal {P}}}', S'\subseteq S \} \},\\ {}{} & {} \min \{ F(j-\rho ,v-\rho \delta ,S): \rho \in \{ 1,2,\ldots ,K\}, v-\rho \delta \in {{\mathcal {P}}}' \} \}. \end{aligned}$$

The algorithm for Pr computes the entries of the new dynamic programming table, and then outputs the solution corresponding to \(\min _{j\in \{ 1,2,\ldots , K\}}F(j,1,{{\mathcal {T}}})\). The resulting time complexity is obtained by replacing K by \(\min \{n,K\}\), that is, \(O(\min \{ n,K\}^3 \cdot n^3 \cdot 2^{2|{{\mathcal {T}}}|} \cdot |{{\mathcal {T}}}| \cdot (|{{\mathcal {T}}}|+\log n)) \). We summarize this result as follows.

Theorem 4.1

There is an algorithm that solves Pr with time complexity that is upper bounded by \(2^{2|{{\mathcal {T}}}|}\) times a polynomial in the input binary encoding length. This algorithm is in FPT with respect to the parameter \(|{{\mathcal {T}}}|\) of the problem.

4.2 The special case with distance that is defined as a convex monotone non-decreasing function of the difference between the points

Next, we assume that the distance function d(tc) satisfies that \(d(t,c)=f(|{\Upsilon }(t)-{\Upsilon }(c)|)\) where f is a convex monotone non-decreasing function. For this case, we can speed up the last dynamic programming algorithm. That is, if for two elements \(t\in {{\mathcal {T}}}, c\in {{\mathcal {C}}}\), we have that their distance is obtained as the value of a convex monotone non-decreasing function defined over the (absolute value of the) difference between the values associated with t and c, then the running time is significantly better. Observe that the instance of Dec-Pr that we constructed in the proof of Theorem 3.1 does not satisfy this assumption on the distance function.

Note that by the fine balance constraints, the number of points from the control group with positions in an interval \((v',v]\) that are matched to points in \({{\mathcal {T}}}\) with positions outside of this interval must equal the number of points in \({{\mathcal {T}}}\) with positions from this interval that are matched to points in \({{\mathcal {C}}}\) with positions outside of the interval. If this common number is not zero for every j, pick the minimum value of j for which it is non-zero. Then, we have a situation where \(c,c'\) are selected from \({{\mathcal {C}}}\) where \({\Upsilon }(c)\le v<{\Upsilon }(c')\), \(c'\) is matched to a point \(t'\) such that \({\Upsilon }(t')\le v\), and c is matched to a point t such that \({\Upsilon }(t)>v\) (and \(t,t'\in {{\mathcal {T}}}\)). But we prove below that \(d(t',c')+d(t,c) \ge d(t',c)+d(t,c')\) by convexity and monotonicity of f, contradicting optimality.

Claim 4.2

If \({\Upsilon }(c)\le v<{\Upsilon }(c')\), \({\Upsilon }(t')\le v<{\Upsilon }(t)\), then for \(d({\hat{t}},{\hat{c}})=f(|{\Upsilon }({\hat{t}})-{\Upsilon }({\hat{c}})|)\) with f being a convex monotone non-decreasing function, we have that \(d(t',c')+d(t,c) \ge d(t',c)+d(t,c')\).

Proof

Our proof is based on four cases with respect to the relative order of \({\Upsilon }(t'),{\Upsilon }(c)\) and the relative order of \({\Upsilon }(t),{\Upsilon }(c')\).

  • If \({\Upsilon }(t') \le {\Upsilon }(c) <{\Upsilon }(c') \le {\Upsilon }(t)\), then by monotonicity of f, \(f(|{\Upsilon }(t')-{\Upsilon }(c')|)\ge f(|{\Upsilon }(t')-{\Upsilon }(c)|)\) and \(f(|{\Upsilon }(t)-{\Upsilon }(c)|) \ge f(|{\Upsilon }(t)-{\Upsilon }(c')|)\).

  • If \({\Upsilon }(c)\le {\Upsilon }(t') <{\Upsilon }(t) \le {\Upsilon }(c')\), then again by monotonicity of f we have \(f(|{\Upsilon }(t')-{\Upsilon }(c')|) \ge f(|{\Upsilon }(t)-{\Upsilon }(c')|)\) and \(f(|{\Upsilon }(t)-{\Upsilon }(c)|) \ge f(|{\Upsilon }(t')-{\Upsilon }(c)|)\), and the claim holds.

  • If \({\Upsilon }(c) \le {\Upsilon }(t') < {\Upsilon }(c') \le {\Upsilon }(t)\), then by letting \(y={\Upsilon }(c')-{\Upsilon }(t')\), \(z_1={\Upsilon }(t')-{\Upsilon }(c)\), and \(z_2 = {\Upsilon }(t)-{\Upsilon }(c')\), we need to show that \(f(y)+f(y+z_1+z_2) \ge f(z_1)+f(z_2)\). By monotonicity of f, we know that \(f(y+z_1+z_2) \ge f(z_1+z_2)\) and \(f(y) \ge 0\). The claim in this case follows as by convexity of f, we have \(f(0)+ f(z_1+z_2) \ge f(z_1)+f(z_2)\).

  • In the remaining case, \({\Upsilon }(t')\le {\Upsilon }(c)<{\Upsilon }(t)\le {\Upsilon }(c')\). Then we let \(y={\Upsilon }(t)-{\Upsilon }(c)\), \(z_1 ={\Upsilon }(c)-{\Upsilon }(t')\), and \(z_2 ={\Upsilon }(c')-{\Upsilon }(t)\). We need to show that \(f(y+z_1+z_2)+f(y) \ge f(z_1)+f(z_2)\) which holds by the proof of the previous case.

\(\square \)

Therefore, by the last claim, without loss of generality there is an optimal solution such that this case does not occur. So we can restrict the states in the dynamic programming table to include only (jvS) subject to the constraint that S is the set of points in \({{\mathcal {T}}}\) with position at most v. This observation results in a state space consisting of at most \(O(\min \{ n,K\}^2 n)\) states and the time complexity of the dynamic programming algorithm becomes \(O(\min \{ n,K\}^3 n^3 \cdot |{{\mathcal {T}}}| \cdot (|{{\mathcal {T}}}|+\log n))\) as we conclude.

Corollary 4.3

If the distance d(tc) satisfies that we have a convex monotone non-decreasing function f and for every tc the distance d(tc) is the value of \(f(|{\Upsilon }(t)-{\Upsilon }(c)|)\), then Pr admits a polynomial time algorithm.

Observe that the last corollary implies that given an instance of Pr one can find a feasible solution for the problem (if one exists) in polynomial time. This is done by introducing a new distance measure defined as \(f(|{\Upsilon }(t)-{\Upsilon }(c)|)\) for f being e.g. the identity function \(f(x)=x\).

5 Concluding remarks

We conclude this work by considering some aspects of the results to be used in future work by practitioners.

First, we address the choice of \(K,\delta \) before applying the algorithms. Usually in practical applications of designing an experiment using a fine balance method, these values of \(K,\delta \) are not determined by an exogenous adversary, but can be chosen in order to improve the design of the experiment. A common practice before using this work is to test several initial designs where the practitioner picks either a given number of common-width intervals or by determining intervals through ad-hoc clustering after visual inspection of the data. For each such initial design we have a corresponding pair of \((K,\delta )\) where the corresponding K value is the number of used intervals and \(\delta \) stands for the maximum length interval in this design. We can always assume that \(\frac{K}{2}\le \frac{1}{\delta } \le K\). But it makes sense to consider the parameters \((K,\delta ) \) by using the \(\delta \) value as in the initial design and increasing K by 1 or 2 above its value in the initial design. This will give a reasonable pair of values that should be able to produce an experiment whose design is better than the ones implied by the initial designs.

The next step would be to obtain an initial solution. To do that we can use the algorithm of Sect. 4.2 and use that by letting \(f(x)=x^2\) and ignore the original distance measure. If the distance measure is correlated with the distance between the values that we used for this initial solution, then it is an initial solution that we should examine. Other initial solutions are the ones suggested by the standard design methods.

Last, for each such initial solution we should try to improve it via a local search heuristic. This local search should use the original distance measure for directing the search, and a compact representation of the solution would be the set of values \(0=v_0 \le \cdots \le v_K=1\).

We hope that considering the selection of intervals based on an optimization criterion in a systematic way would lead to a better design of experiments in this context of fine balance constraints.