Even though (8) yields a strong relaxation for large K, it is computationally more challenging to solve than (3). To the best of our knowledge, there is no algorithm in the literature known that can solve (8). Due to the missing quasi-concavity property of \(S^{K}\), it is not possible to adjust each aggregation vector independently; an alternating-type method based on the \(K=1\) case could provide weak bounds.
In this section, we present the first algorithm for solving (8). The idea of the algorithm is the same as before: a master problem will generate an aggregation vector \((\lambda ^1, \ldots , \lambda ^K)\) and the sub-problem will solve the K-surrogate relaxation corresponding to \((\lambda ^1, \ldots , \lambda ^K)\). The only differences to Algorithm 1 are that we replace the LP master problem by a MILP master problem and solve \(S^{K}(\lambda ^1,\ldots ,\lambda ^K)\) instead of \(S(\lambda )\).
Generalizing Algorithm 1 Assume that we have a solution \({{\bar{x}}}\) of problem \(S^{K}(\lambda ^1, \ldots , \lambda ^K)\). In the next iteration, we need to make sure that \({{\bar{x}}}\) is infeasible for at least one of the aggregated constraints. This can be written as a disjunctive constraint
$$\begin{aligned} \bigvee _{k=1}^K \left( \sum _{i \in {\mathscr {M}}} \lambda ^k_i g_i({{\bar{x}}}) > 0 \right) . \end{aligned}$$
(10)
As in (6), we replace the strict inequality by maximizing the activity of \(\sum _{i \in {\mathscr {M}}} \lambda ^k_i g_i({{\bar{x}}})\) for all \(k \in \{1,\ldots ,K\}\). The master problem then reads as
$$\begin{aligned} {\max _{\varPsi ,\lambda }} \quad&\varPsi \nonumber \\ \text {s.t.}\quad&\bigvee _{k=1}^K \left( \sum _{i \in {\mathscr {M}}} \lambda ^k_i g_i({{\bar{x}}}) \ge \varPsi \right)&\text { for all }{{\bar{x}}} \in {\mathscr {X}}, \nonumber \\&\left\| \lambda ^k\right\| _1 \le 1,\, \lambda ^k \in {\mathbb {R}}_+^m&\text { for all }k \in \{1,\ldots ,K\} , \end{aligned}$$
(11)
where \({\mathscr {X}}\) is the set of generated points of the sub-problems.
Solving the master problem The disjunction in the master problem (11) is encoding \(K^{|{\mathscr {X}}|}\) many LPs, thus an enumeration approach for tackling the disjunction is impractical. Instead, we use a, so-called, big-M formulation. This enables us to solve (11) with moderate running times. Such MILP formulation of (11) reads
$$\begin{aligned} {\max _{\varPsi ,\lambda } }\quad&\varPsi \nonumber \\ \text {s.t.}\quad&\sum _{i \in {\mathscr {M}}} \lambda ^k_i g_i({{\bar{x}}}) \ge \varPsi - M (1 - z_k^{{{\bar{x}}}})&\text { for all }k \in \{1,\ldots ,K\},\, {{\bar{x}}} \in {\mathscr {X}}, \nonumber \\&\sum _{k = 1}^{K} z_k^{{{\bar{x}}}} = 1&\text { for all }{{\bar{x}}} \in {\mathscr {X}}, \nonumber \\&z_k^{{{\bar{x}}}} \in \{0,1\}&\text { for all }k \in \{1,\ldots ,K\}, \, {{\bar{x}}} \in {\mathscr {X}}, \nonumber \\&\left\| \lambda ^k\right\| _1 \le 1,\, \lambda ^k \in {\mathbb {R}}_+^m&\text { for all }k \in \{1,\ldots ,K\} , \end{aligned}$$
(12)
where M is a large constant. A binary variable \(z_k^{{{\bar{x}}}}\) indicates if the k-th disjunction of (11) is used to cut off the point \({{\bar{x}}} \in {\mathscr {X}}\). Due to the normalization \(\left\| \lambda ^k\right\| _1 \le 1\), it is possible to bound M by \(\max _{i \in {\mathscr {M}}} |g_i({{\bar{x}}})|\). Even more, since the optimal \(\varPsi \) values of (12) are non-increasing, we could use the optimal \(\varPsi _{prev}\) of the previous iteration as a bound on M. Thus, it is possible to bound M by \(\min \{\max _{i \in {\mathscr {M}}} |g_i({{\bar{x}}})|, \varPsi _{prev}\}\).
Remark 2
Big-M formulations are typically not considered strong in MILPs, given their usual weak LP relaxations. Other formulations in extended spaces can yield better theoretical guarantees (see, e.g., [6, 10, 62]). The drawback of these extended formulations is that they require to add copies of the \(\lambda \) variables depending on the number of disjunctions, which in our case is rapidly increasing.
In [61], the author proposes an alternative that does not create variable copies, but that can be costly to construct unless special structure is present.
In our case, as we discuss in Sect. 5.2.2, we do not require a tight LP relaxation of (11) and thus we opted to use (12).
The algorithm for the K-surrogate dual problem is stated in Algorithm 2. For details on the exact meaning of “subject to \(\epsilon \)-feasibility”, please see the discussion following Algorithm 1. The following example shows that Algorithm 2 can compute significantly better dual bounds than Algorithm 1.
Example 4
We briefly discuss the results of Algorithm 2 for the genpooling_lee1 instance from MINLPLib. The instance consists of 20 nonlinear, 59 linear constraints, 9 binary, and 40 continuous variables after preprocessing. The classic surrogate dual, i.e., \(K=1\), could be solved to optimality, whereas for \(K=2\) and \(K=3\) the algorithm hit the iteration limit. Nevertheless, the dual bound \(-4829.6\) achieved for \(K=2\) and the dual bound \(-4864.87\) for \(K=3\) are significantly better than the dual bound of \(-5246.0\) for \(K=1\), see Fig. 5.
Convergence
In the following, we show that the dual bounds obtained by Algorithm 2 converge to the optimal value of the K-surrogate dual. The idea of the proof is similar to the one presented by [38] for the case of \(K=1\) and linear constraints.
Theorem 2
Denote by \(((\lambda ^t, \varPsi ^t))_{t \in {\mathbb {N}}}\) the sequence of values obtained after solving (11) in Algorithm 2 for \(\epsilon = 0\). Let OPT be the optimal value of the K-surrogate dual (8). The algorithm either
-
(a) terminates in T steps, in which case \(\max _{1 \le t \le T} S^{K}(\lambda ^t) = OPT\), or
-
(b) \(\sup _{t \ge 1} S^{K}(\lambda ^t) = OPT\).
Proof
As in Proposition 1, we denote the k-th sub-vector of \(\lambda ^t\) as \(\lambda ^{t,k}\). Let \(x^t \in X\) be an optimal solution obtained from solving \(S^{K}(\lambda ^t)\) at iteration t.
(a) If the algorithm terminates after T iterations, i.e., \(\varPsi ^T = 0\), then for any choice \(\lambda \in {\mathbb {R}}^{K m}\), there is at least one point \(x^1,\ldots ,x^T\) that is feasible for \(S^{K}(\lambda )\). This implies \(OPT=\max _{1 \le t \le T} \{ c^{\mathsf {T}}x^t \}\).
(b) Now assume that the algorithm does not converge in a finite number of steps, i.e., \(\varPsi ^t > 0\) for all \(t \ge 1\). Since \((\varPsi ^t)_{t\in {\mathbb {N}}}\) defines a decreasing sequence which is bounded below by 0, it must converge to a value \(\varPsi ^* \ge 0\). The same holds for any subsequence of \((\varPsi ^t)_{t\in {\mathbb {N}}}\). Furthermore, the sequence \((\lambda ^t,x^t)_{t\in {\mathbb {N}}}\) belongs to a compact set: indeed, \(\left\| \lambda ^t\right\| _1 \le 1\) for all \(t\in {\mathbb {N}}\) and \(X\) is assumed to be compact. Therefore, there exists a subsequence of \((\varPsi ^t, \lambda ^t, x^t)_{t\in {\mathbb {N}}}\) that converges. With slight abuse of notation, we denote this subsequence by \((\varPsi ^l, \lambda ^l, x^l)_{l\in {\mathbb {N}}}\). To summarize, we have that
-
\(\lim _{l \rightarrow \infty } \varPsi ^l = \varPsi ^* \ge 0\),
-
\(\lim _{l \rightarrow \infty } \lambda ^l = \lambda ^*\), and
-
\(\lim _{l \rightarrow \infty } x^l = x^*\).
for some \((\varPsi ^*, \lambda ^*, x^*)\).
First, we show \(\varPsi ^* = 0\). Note that \(x^l\) is an optimal solution to \(S^{K}(\lambda ^l)\). This means that \(x^l\) satisfies all aggregation constraints, i.e., \(\sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}} \, g_i(x^l) \le 0\) for all \(k = 1, \ldots , K\), which is equivalent to the inequality \(\max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}} \, g_i(x^l) \le 0\). After solving (11), we know that \(\varPsi ^l\) is equal to the minimum violation of the disjunction constraints for the points \(x^1,\ldots ,x^{l-1}\). This implies the inequality
$$\begin{aligned} \varPsi ^l = \min _{1 \le t \le l-1} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}} \, g_i(x^t) \le \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}} \, g_i(x^{l-1}) , \end{aligned}$$
which uses the fact that the minimum over all points \(x^1,\ldots ,x^{l-1}\) is bounded by the value for \(x^{l-1}\). Both inequalities combined show that
$$\begin{aligned} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}} \, g_i(x^l) \le 0 < {\varPsi ^l} \le \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {\lambda _{i}^{l,k}}\, g_i(x^{l-1}) \end{aligned}$$
for all \(l \ge 0\). Using the continuity of \(g_i\) and the fact that the maximum of finitely many continuous functions is continuous, we obtain
$$\begin{aligned} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {(\lambda ^*)_{i}^{k}} \, g_i(x^*) \le 0 \le {\varPsi ^*} \le \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {(\lambda ^*)_{i}^{k}} \, g_i(x^*) , \end{aligned}$$
which shows \(\varPsi ^* = 0\).
Next, we show that \(\sup _{t \ge 1} S^{K}(\lambda ^t) = OPT\).
Clearly, \(\sup _{t \ge 1} S^{K}(\lambda ^t) \le OPT\).
Let us now prove that \(\sup _{t \ge 1} S^{K}(\lambda ^t) \ge OPT\).
Take any \(\epsilon > 0\) and let \({{\bar{\lambda }}}\) be such that \(S^{K}(\bar{\lambda }) \ge OPT - \epsilon \). Such \({{\bar{\lambda }}}\) always exists by definition of OPT. Furthermore, via a re-scaling we may assume \(\Vert {{\bar{\lambda }}}^k\Vert \le 1\) for all \(k \in \{1, \ldots , K\}\).
By definition,
$$\begin{aligned} \varPsi ^l \ge \min _{1 \le t \le l-1} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {{{\bar{\lambda }}}_{i}^k} \, g_i(x^t). \end{aligned}$$
Computing the limit when l goes to infinity, we obtain
$$\begin{aligned} 0 \ge \inf _{1 \le t} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {{{\bar{\lambda }}}_{i}^k} \, g_i(x^t). \end{aligned}$$
Let \({{\bar{x}}}\) be \(x^{t_0}\) if the infimum is achieved at \(t_0\) or \(x^*\) if the infimum is not achieved.
Then,
$$\begin{aligned} \max _{1 \le k \le K} \sum _{i \in {\mathscr {M}}} {{{\bar{\lambda }}}_{i}^k} \, g_i({{\bar{x}}}) \le 0. \end{aligned}$$
This last inequality implies that \({{\bar{x}}}\) is feasible for \(S^{K}({{\bar{\lambda }}})\).
Hence,
$$\begin{aligned} OPT - \epsilon \le S^{K}({{\bar{\lambda }}}) \le c^{\mathsf {T}}{{\bar{x}}} \le \sup _{t \ge 1} S^{K}(\lambda ^t). \end{aligned}$$
Since \(\epsilon > 0\) is arbitrary, we conclude that \(\sup _{t \ge 1} S^{K}(\lambda ^t) \ge OPT\). \(\square \)
The proof of Theorem 2 shows that \((\varPsi ^t)_{t \in {\mathbb {N}}}\) always converges to zero. Therefore, the proof also establishes that Algorithm 2 terminates after a finite number of steps for any \(\epsilon > 0\).
Computational enhancements
We now discuss computational enhancements additional to those discussed in Sect. 3.2 for improving the performance of Algorithm 2. Due to the increasing complexity of the master problem with each iteration, solving the MILP (11) is for most instances the bottleneck of Algorithm 2. For this reason, most of our computational enhancements are devoted to reduce the computational effort spent in the master problem.
As in the case \(K=1\), we also report techniques that we did not include in our final implementation.
Multiplier symmetry breaking
One difficulty of the K-surrogate dual is that (11) and (12) might contain many equivalent solutions. For example, any permutation \(\pi \) of the set \(\{1,\ldots ,K\}\) implies that the sub-problem \(S^{K}(\lambda )\) with \(\lambda = (\lambda ^1,\ldots ,\lambda ^K)\) is equivalent to \(S^{K}(\lambda ^{\pi })\) with \(\lambda ^\pi = (\lambda ^{\pi _1},\ldots ,\lambda ^{\pi _K})\). This can heavily impact the solution time of the master problem. We refer to [40] for an overview of symmetry in integer programming. Ideally, in order to break symmetry, we would like to impose \( \lambda ^1 \succeq _{\text {lex}} \lambda ^2 \succeq _{\text {lex}} \ldots \succeq _{\text {lex}} \lambda ^K \) where \(\succeq _{\text {lex}}\) indicates a lexicographical order. Such order can be imposed using linear constraints with additional binary variables and big-M constraints. However, this yields prohibitive running times. Two more efficient alternatives that only partially break symmetries, are as follows. First, the constraints
$$\begin{aligned} \lambda ^1_1 \ge \lambda ^2_1 \ge \ldots \ge \lambda ^K_1 \end{aligned}$$
(13)
enforce that \(\lambda ^1,\ldots ,\lambda ^K\) are sorted with respect to the first component. The drawback of this sorting is that if \(\lambda ^k_1 = 0\) for all \(k \in \{1,\ldots ,K\}\), then (13) does not break any of the symmetry of (12). Our second idea for breaking symmetry is to use
$$\begin{aligned} \begin{aligned} \lambda ^{j}_j&\ge \lambda _{j}^k&\text { for all }j \in \{1,\ldots ,K-1\} \text { for all }k \in \{j+1,\ldots ,K\}. \end{aligned} \end{aligned}$$
(14)
In our experiments, we observed that slightly better dual bounds could be computed when using (13) instead of (14). However, the overall impact was not significant, and we decided to not include this in our final implementation.
Early stopping of the master problem
Solving (12) to optimality in every iteration of Algorithm 2 is computationally expensive for \(K\ge 2\). On one hand, the true optimal value of \(\varPsi \) is needed to decide whether the algorithm terminated. On the other hand, to ensure progress of the algorithm it is enough to only compute a feasible point \((\varPsi ,\lambda ^1,\ldots ,\lambda ^K)\) of (12) with \(\varPsi > 0\). We balance these two opposing forces with the following early stopping method.
While solving (12), we have access to both a valid dual bound \(\varPsi _{d}\) and primal bound \(\varPsi _{p}\) such that the optimal \(\varPsi \) is contained in \([\varPsi _p,\varPsi _d]\). Note that the primal bound can be assumed to be nonnegative as the vector of zeros is always feasible for (12). Let \(\varPsi ^t_d\) and \(\varPsi ^t_p\) be the primal and dual bounds obtained from the master problem in iteration t of Algorithm 2. We stop the master problem in iteration \(t+1\) as soon as \(\varPsi ^{t+1}_p \ge \alpha \varPsi ^t_d\) holds for a fixed \(\alpha \in (0,1]\). The parameter \(\alpha \) controls the trade-off between proving a good dual bound \(\varPsi ^{t+1}_d\) and saving time for solving the master problem. On the one hand, \(\alpha = 1\) implies
$$\begin{aligned} \varPsi ^{t+1}_p \ge \alpha \varPsi ^{t}_d \ge \alpha \varPsi ^{t+1}_d = \varPsi ^{t+1}_d , \end{aligned}$$
which can only be true if \(\varPsi ^{t+1}_p = \varPsi ^{t+1}_d\) holds. This equality proves optimality of the master problem in iteration \(t+1\). On the other hand, setting \(\alpha \) close to zero means that we would stop as soon as a non-trivial feasible solution to the master problem has been found. In our experiments, we observed that setting \(\alpha \) to 0.2 performs well.
Constraint filtering
Another potential way of alleviating the computational burden of solving the master problem, is to reduce the set of nonlinear constraints to only those that are needed for a good quality solution of (8). Of course, this set of constraints is unknown in advance and challenging to compute because of the nonconvexity of the MINLP.
We tested different heuristics to preselect nonlinear constraints. We used the violation of the constraints with respect to the LP, MILP, and convex NLP relaxation of the MINLP, as measures of “importance” of nonlinear constraints. We also used the connectivity of nonlinear constraints in the variable-constraint graphFootnote 1 for discarding some constraints. Unfortunately, we could not identify a good filtering rule that results in strong bounds for (8).
However, we were able to find a way of reducing the number of constraints considered in the master problem without imposing a strong a-priori filter: an adaptive filtering, which we call support stabilization. We specify this next.
Support stabilization
Direct implementations of Benders-based algorithms, much like column generation approaches, are known to suffer from convergence issues. Deriving “stabilization” techniques that can avoid oscillations of the \(\lambda \) variables and tailing-off effects, among others, are a common goal for improving performance, see, e.g., [4, 16, 60].
We developed a support stabilization technique to address the exponential increase in complexity of the master problem (12) and to prevent the oscillations of the \(\lambda \) variables. Once Algorithm 2 finds a multiplier vector that improves the overall dual bound, we restrict the support to that of the improving dual multiplier. This restricts the search space and drastically improves solution times. Once stalling is detected (which corresponds to finding a local optimum of (8)), we remove the support restriction until another multiplier vector that improves the dual bound is found. This technique not only improves solution times, but also leads to better bounds on (8) in fewer iterations.
Trust-region stabilization
While the previous stabilization alleviates some of the computational burden in the master problem, the non-zero entries of the \(\lambda \) vectors can (and do, in practice) vary significantly from iteration to iteration. To remedy this, we incorporated a classic stabilization technique: a box trust-region stabilization, see [14]. Given a reference solution \(({\hat{\lambda }}^1, \ldots , {\hat{\lambda }}^k)\), we impose the following constraint in (12)
$$\begin{aligned} \Vert (\lambda ^1, \ldots , \lambda ^k) - ({\hat{\lambda }}^1, \ldots , {\hat{\lambda }}^k) \Vert _\infty \le \delta \end{aligned}$$
for some parameter \(\delta \). This prevents the \(\lambda \) variables from oscillating excessively. In addition, by removing the trust-region when stalling is detected, we are able to preserve the convergence guarantees of Theorem 2. In our implementation, we maintain a fixed \(({\hat{\lambda }}^1, \ldots , {\hat{\lambda }}^k)\) until we obtain a bound improvement or the algorithm stalls. When any of this happens, we remove the box and compute a new \(({\hat{\lambda }}^1, \ldots , {\hat{\lambda }}^k)\) with (12) without any stabilization added.
Remark 3
We also tested another stabilization technique inspired by column generation’s smoothing by [65] and [47]. Let \(\lambda ^{best}\) be the best found primal solution so far and let \(\lambda ^{new}\) be the solution of the current master problem. Instead of using \(\lambda ^{new}\) as a new multiplier vector, we choose a convex combination between \(\lambda ^{best}\) and \(\lambda ^{new}\).
While this stabilization technique improved the performance of Algorithm 2 with respect to the algorithm with no stabilization, it performed significantly worse than the trust-region stabilization. Therefore, we did not include it in our final implementation.