On generalized surrogate duality in mixed-integer nonlinear programming

The most important ingredient for solving mixed-integer nonlinear programs (MINLPs) to global ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-optimality with spatial branch and bound is a tight, computationally tractable relaxation. Due to both theoretical and practical considerations, relaxations of MINLPs are usually required to be convex. Nonetheless, current optimization solvers can often successfully handle a moderate presence of nonconvexities, which opens the door for the use of potentially tighter nonconvex relaxations. In this work, we exploit this fact and make use of a nonconvex relaxation obtained via aggregation of constraints: a surrogate relaxation. These relaxations were actively studied for linear integer programs in the 70s and 80s, but they have been scarcely considered since. We revisit these relaxations in an MINLP setting and show the computational benefits and challenges they can have. Additionally, we study a generalization of such relaxation that allows for multiple aggregations simultaneously and present the first algorithm that is capable of computing the best set of aggregations. We propose a multitude of computational enhancements for improving its practical performance and evaluate the algorithm’s ability to generate strong dual bounds through extensive computational experiments.


Introduction
We consider a mixed-integer nonlinear program (MINLP) of the form where X := {x ∈ R n− p × Z p | Ax ≤ b} is a compact mixed-integer linear set, each g i : R n → R is a factorable continuous function [41], and M := {1, . . . , m} denotes the index set of nonlinear constraints. Such a problem is called nonconvex if at least one g i is nonconvex, and convex otherwise. Many real-world applications are inherently nonlinear, and can be formulated as a MINLP. See, e.g., [28] for an overview.
The state-of-the-art algorithm for solving nonconvex MINLPs to globaloptimality is spatial branch and bound, see, e.g., [30,51,52], whose performance highly depends on the tightness of the relaxations used. Those relaxations are typically convex, and are iteratively refined by branching, cutting planes, and variable bound tightening, e.g., feasibility-and optimality-based bound tightening [8,50].
As a result of the rapid progress during the last decades, current solvers can often handle a moderate presence of nonconvex constraints efficiently. This progress opens the door for practical use of potentially tighter nonconvex relaxations in MINLP solvers. One example are MILP relaxations, see [11,43,68]. In this paper, we go one step further and explore a nonconvex relaxation referred to as surrogate relaxation [21].

Definition 1 (Surrogate relaxation)
For a given λ ∈ R m + , we call the following optimization problem a surrogate relaxation of (1) Let F ⊆ R n be the feasible region of the original problem (1), and let S λ ⊆ R n be the feasible region of a surrogate relaxation given in (2). Throughout the whole paper we assume that F is not empty. Clearly, F ⊆ S λ holds for every λ ∈ R m + , and as such (2) provides a valid lower bound of (1). Moreover, solving (2) might be computationally more convenient than solving the original problem (1), since there is only one nonconvex constraint in S(λ). Note that one may turn S λ into a continuous relaxation of (1) by removing the integrality restrictions. However, in this work we purposely choose to retain integrality in the relaxation and compare directly to the optimal value of (1) as opposed to its continuous relaxation.
The quality of the bound provided by S(λ) may be highly dependent on the value of λ, and therefore it is natural to consider the surrogate dual problem in order to obtain the tightest surrogate relaxation.
Definition 2 (Surrogate dual). The following optimization problem is the surrogate dual of (1): The function S is only lower semi-continuous [27], thus, there might be no λ such that S(λ) is equal to (3). The surrogate dual is closely related to the well-known Lagrangian dual max λ∈R m but always results in a bound that is at least as good [27,35], i.e., Figure 1 shows the difference between S(λ) and L(λ) on the two-dimensional instance of Example 1 below. In contrast to L : R m + → R, which is a continuous and concave function, S : R m + → R is only quasi-concave [27] and is, in general, discontinuous. As shown in Fig. 1, the main difficulty in optimizing S(λ) is that the function is most of the time "flat", and leads to nontrivial dual bounds for only a small subset of the λ-space. Contribution In this paper, we revisit surrogate duality in the context of mixedinteger nonlinear programming. To the best of our knowledge, surrogate relaxations have never been considered for solving general MINLPs. The first contribution of the paper is an algorithm capable of solving a generalized surrogate dual that allows for multiple aggregations of the nonlinear constraints simultaneously. We also prove its convergence guarantees. Secondly, we present computational enhancements to make the algorithm practical. Our developed algorithm allows us to compare the performance of the classical surrogate relaxation with the generalized one. Finally, we provide an exhaustive computational analysis using publicly available benchmark instances, and we show the practical utility of our proposed approach.
Structure The rest of the paper is organized as follows. In Sect. 2, we present a literature review of surrogate duality. Sect. 3 discusses an algorithm from the literature for solving the classic surrogate dual problem and our new computational enhancements. In Sect. 4, we review a generalization of surrogate relaxations from the literature. Afterward, in Sect. 5, we adapt an algorithm for the classic surrogate dual problem to the general case and prove its convergence. An exhaustive computational study on publicly available benchmark instances is given in Sect. 6. Afterward, Sect. 7 presents ideas for future work on how to include surrogate relaxations in the tree search of spatial branch and bound. Section 8 presents concluding remarks.

Background
Surrogate constraints were first introduced by Glover [21] in the context of zero-one linear integer programming problems. He defined the strength of a surrogate constraint according to the dual bound achieved by it-the same notion we use. Extensions were provided by Balas [5] and Geoffrion [20], although under an alternative notion of strength. Glover [22] provided a unified view on these approaches and proposed a generalization where only a subset of constraints are used in an aggregation, leaving the rest explicitly enforced. A theoretical analysis of surrogate duality in a nonlinear setting was first presented by Greenberg and Pierskalla [27]. This includes results such as quasi-concavity of the surrogate relaxation and a comparison with the Lagrangian dual. They also proposed a generalization using multiple disjoint aggregation constraints. A similar generalization allowing multiple aggregations was later proposed by Glover [23]. These were proposed without computational evaluation. Regarding the link with Lagrangian duality, Karwan and Rardin [35] presented necessary conditions for having no gap between the Lagrangian and surrogate duals. Constraint qualification conditions for the surrogate dual have been exhaustively studied, see [23,49,58] and references therein.
The first algorithmic method for solving (3) is attributed to Banerjee [7]. In the context of integer linear programming, he proposed a Benders' approach similar to the one considered by us in order to find the best multipliers. Karwan [34] expanded on this approach, including a refinement of that of Banerjee and subgradient-based methods. Independently, Dyer [17] proposed similar methods to those of Karwan. Karwan and Rardin [35,38] further developed both Benders-and subgradient-based approaches. Other search procedures for solving (3) involve consecutive Lagrangian dual searches [39,53] and heuristics [19].
From a different perspective, Karwan and Rardin [37] described the interplay between the branch-and-bound trees of an integer programming problem and its surrogate relaxations. Later on, Sarin et al. [54] showed how to integrate their Lagrangian-based multiplier search proposed in [53] into branch and bound.
Surrogate constraints have been used in various applications: in primal heuristics for IPs [24], knapsack problems with a quadratic objective function [15], the job shop problem [18], generalized assignment problems [46], among others. We refer the reader to [3,25] for reviews on surrogate duality methods including other applications and alternative methods for generating surrogate constraints not based on aggregations.
To the best of our knowledge, the efforts for practical implementations of multiplier search methods have mainly focused on linear integer programs. This focus can be explained by the maturity of computational optimization tools available during the time where most of these implementations have been developed. We are only aware of two exceptions: the entropy approach to nonlinear programming (see [59,67]) which uses an entropy-based surrogate reformulation instead of a weighted sum of the constraints; and the work by Nakagawa [45], who considered separable nonlinear integer programming and presented a novel, albeit expensive, algorithm for solving the surrogate dual.
Regarding the generalization of the surrogate dual that considers multiple aggregated constraints, we are not aware of any work considering a multiplier search method with provable guarantees or a computational implementation of a heuristic approach. We are only aware of the discussion by Karwan and Rardin [36] where they argue that the lack of desirable structures (such as quasi-concavity) may impair some multiplier searches for the surrogate dual generalizations proposed by Greenberg and Pierskalla [27] and Glover [23].

Surrogate duality in MINLPs
While surrogate duality in its broader definition can be applied to any MINLP, to the best of our knowledge, only mixed-integer linear programming problems have been considered for practical applications. Much less attention has been given to the general MINLP case, due to the potential nonconvexity of the resulting problems. Figure 2 illustrates the possible drawbacks and benefits of a nonconvex surrogate relaxation, namely, potentially tight convex relaxations (Fig. 2c), nonconvex ( Fig. 2a and 2d), and even disconnected (Fig. 2b) feasible regions.
In this section, we investigate the trade-off between the computational effort required to solve surrogate relaxations and the quality of the resulting bounds. We review how to solve the surrogate dual (3) via a Benders' algorithm (see [7,17,34]) and how to overcome some of the computational difficulties that arise. As we mentioned in Sect. 2, alternative algorithms for solving the surrogate dual exist. However, we use a Benders' approach because its extension to the generalized surrogate dual problem (Sect. 4) is more direct. It is unclear if, e.g., subgradient-based algorithms can be extended to work for the generalization.

Solving the surrogate dual via Benders
The Benders' algorithm is an iterative approach that alternates between solving a master and a sub-problem. The master problem searches for an aggregation vector λ ∈ R m + and the sub-problem solves (2) to evaluate S(λ). Note that the value of an optimal solutionx of S(λ), i.e., c Tx , is a valid dual bound for (1). To ensure that the Algorithm 1: Algorithm for the surrogate dual.
Input: MINLP of the form (1), threshold > 0 Output: optimal value D ∈ R of the surrogate dual, subject to -feasibility (λ, Ψ ) ← optimal solution of (6) for X 7: end while 8: return D pointx is not considered in later iterations, i.e.,x / ∈ S λ , the Benders' algorithm uses the master problem to compute a new vector λ ∈ R m + that ensures i∈M λ i g i (x) > 0. This can be done by maximizing constraint violation. More precisely, given X the set of previously generated points of the sub-problems, the master problem is the following linear program: The normalization constraint λ 1 ≤ 1 can be added without loss of generality, and it is used to avoid unboundedness of (6). The resulting scheme, formalized in Algorithm 1, terminates once the solution value of (6) is smaller than a fixed value ≥ 0. Note that, with this stopping criterion, the best bound D found by the algorithm is "subject to -feasibility". This means that when the algorithm stops, it holds that If a numerical feasibility tolerance of is used in a solver (which is usually the case), then for every surrogate relaxation, somex ∈ X would be considered feasible and thus prevent the dual bound from improving. The consideration of here is meant mostly for accommodating these numerical tolerances. An illustration of the algorithm for Example 1 is given in Fig. 3.

Remark 1
Instead of finding an aggregation vector that maximizes the violation of all points in X , Dyer [17] uses an interior point for the polytope in (6). This can be achieved by scaling Ψ in each constraint of (6) depending on the values g i (x) for each i ∈ M . In our experiments, however, we have observed that maximizing the violation significantly improves the quality of the computed dual bounds.
Using the analysis in [7], Karwan [34] proved the following in the linear case. We prove a stronger version of this theorem in Sect. 5 that also works for nonlinear constraints. This stronger result, in particular, shows finite convergence of the algorithm for > 0. Note that the convergence of the algorithm only relies on the solution of an LP and a nonconvex problem S(λ), and does not make any assumption on the nature of S(λ) besides the fact that it can be solved.

Algorithmic enhancements
In this section, we present computational enhancements that speed up Algorithm 1 and improve the quality of the dual bound that can be achieved from (3). For the sake of completeness, we also discuss techniques that have been tested but did not improve the quality of the computed dual bounds significantly.

Refined MILP relaxation
Instead of only using the initial linear constraints Ax ≤ b to define X in (1), we exploit a linear programming (LP) relaxation of (1) that is available in LP-based spatial branch and bound. This relaxation contains Ax ≤ b but also linear constraints that have been derived from, e.g., integrality restrictions of variables (e.g., MIR cuts [48] and Gomory cuts [26]), gradient cuts [32], RLT cuts [56], SDP cuts [57], or other valid underestimators for each g i with i ∈ M . Using a linear relaxation A x ≤ b with in the definition of S λ improves the value of (3) because a relaxed version of the Another way to strengthen the linear relaxation is to make use of objective cutoff information. If there is a feasible, but not necessarily optimal, solution x * to (1), the linear relaxation can be strengthened with the inequality c T x ≤ c T x * . This inequality preserves all optimal solutions of (1) and can improve the performance when solving (3) (for example, an objective cutoff can lead to dual reductions found by reduced-cost strengthening).

Dual objective cutoff in the sub-problem
An undesired phenomenon in Algorithm 1 is that the sequence of dual bounds provided by c Tx in Step 3 might not be monotone. One way to overcome this problem is to add a dual objective cutoff c T x ≥ D to the sub-problem S(λ), thus enforcing monotonicity in the sequence of dual bounds. This does not change the convergence/correctness guarantees of Algorithm 1 and it can improve the progress of the subsequent dual bounds. Moreover, such cutoff can be used to filter the set X and thus reduce the size of the LP (6). For example, in Fig. 3 the two last iterations could be avoided.
However, this cutoff has an unfortunate drawback: it increases degeneracy, which affects essential components of a branch-and-bound solver, e.g., pseudocost branching [9]. In the case of Algorithm 1, adding this cutoff significantly increases the time for solving the sub-problem, resulting in an overall negative effect on the algorithm. We confirmed this with extensive computational experiments and decided not to include this feature in our final implementation.
Fortunately, we can still carry dual information through different iterations and improve the performance of the algorithm.

Early stopping in the sub-problem
One important ingredient to speed up Algorithm 1, proposed by Karwan [34] and Dyer [17] independently, is an early stopping criterion while solving S(λ). In our setting, problem S(λ) is the bottleneck of Algorithm 1.
Assume that Algorithm 1 proved a dual bound D in some previous iteration. It is possible to stop the solving process of S(λ) if a pointx ∈ S λ with c Tx ≤ D has been found. The pointx both provides a new inequality for (6) violated by λ (asx ∈ S λ ) and shows S(λ) ≤ D, i.e., λ will not lead to a better dual bound. All convergence and correctness statements regarding Theorem 1 remain valid after this modification.
Furthermore, we can apply the same idea for any choice of D. In this scenario, D would act as a target dual bound that we want to prove. Since the Benders' algorithm is computationally expensive, one might require a minimum improvement in the dual bound. Empirically, we observed that solving S(λ) to global optimality for difficult MINLPs requires a lot of time. However, finding a good quality solution for S(λ) is usually fast. This allows us to early stop most of the sub-problems and only spend time on those sub-problems that will likely result in a dual bound that is at least as good as the target value D.

Initial empirical observations
For the implementation of Algorithm 1, we use the MINLP solver SCIP 1. to construct a linear relaxation A x ≤ b for (1), 2. to find a feasible solution x * to (1) to be used as an objective cutoff c T x ≤ c T x * when solving S(λ), and 3. to use it as a black box to solve each S(λ) sub-problem.
We provide extensive details of our implementation and the results in Sect. 6, but we would like to summarize some important observations to the reader.
Our proposed algorithmic enhancements proved to be key for obtaining a practical algorithm for the surrogate dual, especially the use of a refined MILP relaxation. The achieved dual bounds by only using the initial linear relaxation in Algorithm 1 were almost always dominated by the dual bounds obtained by the refined MILP relaxation. Thus, utilizing the refined MILP relaxation seems mandatory for obtaining strong surrogate relaxations. Our computational study in Sect. 6 shows that our algorithmic enhancements of Algorithm 1 allows us to compute dual bounds that close on average 35.0% more gap (w.r.t. the best known primal bound) than the dual bounds obtained by the refined MILP relaxations, i.e., S(0), on 469 affected instances. While the overall impact of this "classic" surrogate duality is positive, we observed that the dual bound deteriorates with increasing number of nonlinear constraints; aggregating a large number of nonconvex constraints into a single constraint may not capture the structure of the underlying MINLP. For this reason, in the next Section we propose the use of a generalized surrogate relaxation for solving MINLPs, which includes multiple aggregation constraints.

Generalized surrogate duality
In the following, we discuss a generalization of surrogate relaxations that has been introduced by [23]. Instead of a single aggregation, it allows for K ∈ N aggregations of the nonlinear constraints of (1). The nonnegative vector λ = (λ 1 of the nonlinear constraints. Note that, for convenience, we are denoting as λ k i the component λ (k−1)m+i of the λ vector, as it can also be seen as the i-th component of the λ k sub-vector. For a vector λ ∈ R K m + , the feasible region of the K -surrogate relaxation is given by the intersection where S λ k is the feasible region of the surrogate relaxation S(λ k ) for λ k ∈ R m + . It clearly follows that S K λ is a relaxation for (1). The best dual bound for (1) generated by a K -surrogate relaxation is given by which we call the K -surrogate dual. Note that scaling each λ k ∈ R m + individually by a positive scalar does not affect the value of S K (λ), i.e., for any α > 0. Therefore, it is possible to impose additional normalization constraints λ k 1 ≤ 1 for each k ∈ {1, . . . , K }. In [27], a related generalization was proposed, although not computationally tested. The paper considers a partition of constraints which are aggregated; equivalently, the support of sub-vectors λ k are assumed to be fixed and disjoint. Glover's generalization [23] does not make any assumption on the structure of the λ k sub-vectors. As we argue below, this makes a significant difference for two reasons: (a) selecting the "best" partition of constraints a-priori is a challenging task and (b) restricting the support of sub-vectors λ k to be disjoint can weaken the bound given by (8).
The function S K remains lower semi-continuous for any choice of K . The proof idea is similar to the one given by Glover [23] for the case of K = 1.

Proposition 1 If g i is continuous for every i ∈ M and X is compact, then S K :
We need to show that Since X is compact, there exists a subsequence (x l ) l∈N of (x τ ) τ ∈N such that lim l→∞ x l = x * . As (λ l ) l∈N is a subsequence of (λ τ ) τ ∈N , we have that Because the g i are continuous and the maximum of continuous functions is still continuous, it follows that max 1≤k≤K i∈M Hence, x * is feasible but not necessarily optimal for S K (λ * ). Therefore, One important difference to the classic surrogate dual is that S K (λ) is no longer quasi-concave, even for the case of K = 2 and two linear constraints, as the following example shows.
Since S K may not be quasi-concave, gradient descent-based algorithms for optimizing (3), as in [34], become inadequate for solving (8). Even though (8) is substantially more difficult to solve than (3), it might be beneficial to consider larger K to obtain tight relaxations for (1). The following proposition formalizes this. We omit its proof as it is elementary.

Proposition 2 The inequality
holds for any K ∈ N. Furthermore, sup λ∈R m 2 + S m (λ) is equal to the value of (1).

Fig. 4
A visualization of Example 2 that shows that S K is in general not quasi-concave for K > 1. The blue line shaded region (left) is the feasible set defined by two original inequalities. The orange region (center) depicts both S 2 λ and S 2 μ , while the green region (right) is their convex combination S 2 (λ+μ)/2 . The example shows S 2 (λ) = S 2 (μ) > S 2 ((λ + μ)/2), which proves that S 2 is not quasi-concave (color figure online) The following example shows that going from K = 1 to K = 2 can have a tremendous impact on the quality of the surrogate relaxation: Example 3 Consider the following NLP with four nonlinear constraints and four unbounded variables: It is easy to see that (0, 0, 0, 0) is the optimal solution. First, we show that the classic surrogate dual is unbounded. For an aggregation λ ∈ R 4 , the sole constraint in the corresponding surrogate relaxation is If either λ 1 = λ 2 or λ 3 = λ 4 , then the relaxation is clearly unbounded, as z and w are free variables with coefficient 0 in the objective. If λ 1 = λ 2 and λ 3 = λ 4 , the aggregation reads 2λ 1 x 3 + 2λ 3 y 2 ≤ 0. We split this case in two: (1) if λ 1 = 0, then (x, y, w, z) = (M, 0, 0, 0) is feasible for any M. (2) Otherwise, we can set (x, y, w, z) = (−M, 2M, 0, 0) and 2λ 1 x 3 + 2λ 3 y 2 = −2λ 1 M 3 + 8λ 3 M 2 . The last expression is negative for large enough M.

An algorithm for the K -surrogate dual
Even though (8) yields a strong relaxation for large K , it is computationally more challenging to solve than (3). To the best of our knowledge, there is no algorithm in the literature known that can solve (8). Due to the missing quasi-concavity property of S K , it is not possible to adjust each aggregation vector independently; an alternatingtype method based on the K = 1 case could provide weak bounds.
In this section, we present the first algorithm for solving (8). The idea of the algorithm is the same as before: a master problem will generate an aggregation vector (λ 1 , . . . , λ K ) and the sub-problem will solve the K -surrogate relaxation corresponding to (λ 1 , . . . , λ K ). The only differences to Algorithm 1 are that we replace the LP master problem by a MILP master problem and solve S K (λ 1 , . . . , λ K ) instead of S(λ).
In the next iteration, we need to make sure thatx is infeasible for at least one of the aggregated constraints. This can be written as a disjunctive constraint As in (6), we replace the strict inequality by maximizing the activity of i∈M λ k i g i (x) for all k ∈ {1, . . . , K }. The master problem then reads as where X is the set of generated points of the sub-problems.
Solving the master problem The disjunction in the master problem (11) is encoding K |X | many LPs, thus an enumeration approach for tackling the disjunction is impractical. Instead, we use a, so-called, big-M formulation. This enables us to solve (11) with moderate running times. Such MILP formulation of (11) reads where M is a large constant. A binary variable zx k indicates if the k-th disjunction of (11) is used to cut off the pointx ∈ X . Due to the normalization λ k 1 ≤ 1, it is possible to bound M by max i∈M |g i (x)|. Even more, since the optimal Ψ values of (12) are non-increasing, we could use the optimal Ψ prev of the previous iteration as a bound on M. Thus, it is possible to bound M by min{max i∈M |g i (x)|, Ψ prev }.

Remark 2
Big-M formulations are typically not considered strong in MILPs, given their usual weak LP relaxations. Other formulations in extended spaces can yield better theoretical guarantees (see, e.g., [6,10,62]). The drawback of these extended formulations is that they require to add copies of the λ variables depending on the number of disjunctions, which in our case is rapidly increasing.
In [61], the author proposes an alternative that does not create variable copies, but that can be costly to construct unless special structure is present.
In our case, as we discuss in Sect. 5.2.2, we do not require a tight LP relaxation of (11) and thus we opted to use (12).
The algorithm for the K -surrogate dual problem is stated in Algorithm 2. For details on the exact meaning of "subject to -feasibility", please see the discussion following Algorithm 1. The following example shows that Algorithm 2 can compute significantly better dual bounds than Algorithm 1.

Algorithm 2:
Algorithm for the K -surrogate dual.
Input: MINLP of the form (1), threshold > 0, K ∈ N aggregations Output: optimal value D ∈ R of the K -surrogate dual, subject to -feasibility (λ, Ψ ) ← optimal solution of (11) for X 7: end while 8: return D Example 4 We briefly discuss the results of Algorithm 2 for the genpooling_lee1 instance from MINLPLib. The instance consists of 20 nonlinear, 59 linear constraints, 9 binary, and 40 continuous variables after preprocessing. The classic surrogate dual, i.e., K = 1, could be solved to optimality, whereas for K = 2 and K = 3 the algorithm hit the iteration limit. Nevertheless, the dual bound −4829.6 achieved for K = 2 and the dual bound −4864.87 for K = 3 are significantly better than the dual bound of −5246.0 for K = 1, see Fig. 5.

Convergence
In the following, we show that the dual bounds obtained by Algorithm 2 converge to the optimal value of the K -surrogate dual. The idea of the proof is similar to the one presented by [38] for the case of K = 1 and linear constraints. (11) in Algorithm 2 for = 0. Let O P T be the optimal value of the K -surrogate dual (8).

The algorithm either (a) terminates in T steps, in which case
Proof As in Proposition 1, we denote the k-th sub-vector of λ t as λ t,k . Let x t ∈ X be an optimal solution obtained from solving S K (λ t ) at iteration t. (a) If the algorithm terminates after T iterations, i.e., Ψ T = 0, then for any choice λ ∈ R K m , there is at least one point x 1 , . . . , x T that is feasible for S K (λ). This implies O PT = max 1≤t≤T {c T x t }. (b) Now assume that the algorithm does not converge in a finite number of steps, i.e., Ψ t > 0 for all t ≥ 1. Since (Ψ t ) t∈N defines a decreasing sequence which is bounded below by 0, it must converge to a value Ψ * ≥ 0. The same holds for any subsequence of (Ψ t ) t∈N . Furthermore, the sequence (λ t , x t ) t∈N belongs to a compact set: indeed, λ t 1 ≤ 1 for all t ∈ N and X is assumed to be compact. Therefore, there exists a subsequence of (Ψ t , λ t , x t ) t∈N that converges. With slight abuse of notation, we denote this subsequence by (Ψ l , λ l , x l ) l∈N . To summarize, we have that -lim l→∞ Ψ l = Ψ * ≥ 0, -lim l→∞ λ l = λ * , and -lim l→∞ x l = x * .
First, we show Ψ * = 0. Note that x l is an optimal solution to S K (λ l ). This means that x l satisfies all aggregation constraints, i.e., i∈M λ l,k i g i (x l ) ≤ 0 for all k = 1, . . . , K , which is equivalent to the inequality max 1≤k≤K i∈M λ l,k i g i (x l ) ≤ 0.
After solving (11), we know that Ψ l is equal to the minimum violation of the disjunction constraints for the points x 1 , . . . , x l−1 . This implies the inequality which uses the fact that the minimum over all points x 1 , . . . , x l−1 is bounded by the value for x l−1 . Both inequalities combined show that max 1≤k≤K i∈M for all l ≥ 0. Using the continuity of g i and the fact that the maximum of finitely many continuous functions is continuous, we obtain Let us now prove that sup t≥1 S K (λ t ) ≥ O PT . Take any > 0 and letλ be such that S K (λ) ≥ O PT − . Suchλ always exists by definition of O PT . Furthermore, via a re-scaling we may assume λ k ≤ 1 for all k ∈ {1, . . . , K }.
By definition, Computing the limit when l goes to infinity, we obtain Letx be x t 0 if the infimum is achieved at t 0 or x * if the infimum is not achieved. Then, This last inequality implies thatx is feasible for S K (λ). Hence, Since > 0 is arbitrary, we conclude that sup t≥1 S K (λ t ) ≥ O PT .
The proof of Theorem 2 shows that (Ψ t ) t∈N always converges to zero. Therefore, the proof also establishes that Algorithm 2 terminates after a finite number of steps for any > 0.

Computational enhancements
We now discuss computational enhancements additional to those discussed in Sect. 3.2 for improving the performance of Algorithm 2. Due to the increasing complexity of the master problem with each iteration, solving the MILP (11) is for most instances the bottleneck of Algorithm 2. For this reason, most of our computational enhancements are devoted to reduce the computational effort spent in the master problem.
As in the case K = 1, we also report techniques that we did not include in our final implementation.

Multiplier symmetry breaking
One difficulty of the K -surrogate dual is that (11) and (12) might contain many equivalent solutions. For example, any permutation π of the set {1, . . . , K } implies that the sub-problem S K (λ) with λ = (λ 1 , . . . , λ K ) is equivalent to S K (λ π ) with λ π = (λ π 1 , . . . , λ π K ). This can heavily impact the solution time of the master problem. We refer to [40] for an overview of symmetry in integer programming. Ideally, in order to break symmetry, we would like to impose λ 1 lex λ 2 lex . . . lex λ K where lex indicates a lexicographical order. Such order can be imposed using linear constraints with additional binary variables and big-M constraints. However, this yields prohibitive running times. Two more efficient alternatives that only partially break symmetries, are as follows. First, the constraints enforce that λ 1 , . . . , λ K are sorted with respect to the first component. The drawback of this sorting is that if λ k 1 = 0 for all k ∈ {1, . . . , K }, then (13) does not break any of the symmetry of (12). Our second idea for breaking symmetry is to use In our experiments, we observed that slightly better dual bounds could be computed when using (13) instead of (14). However, the overall impact was not significant, and we decided to not include this in our final implementation.

Early stopping of the master problem
Solving (12) to optimality in every iteration of Algorithm 2 is computationally expensive for K ≥ 2. On one hand, the true optimal value of Ψ is needed to decide whether the algorithm terminated. On the other hand, to ensure progress of the algorithm it is enough to only compute a feasible point (Ψ , λ 1 , . . . , λ K ) of (12) with Ψ > 0. We balance these two opposing forces with the following early stopping method.
While solving (12), we have access to both a valid dual bound Ψ d and primal bound Ψ p such that the optimal Ψ is contained in [Ψ p , Ψ d ]. Note that the primal bound can be assumed to be nonnegative as the vector of zeros is always feasible for (12). Let Ψ t d and Ψ t p be the primal and dual bounds obtained from the master problem in iteration t of Algorithm 2. We stop the master problem in iteration t + 1 as soon as Ψ t+1 p ≥ αΨ t d holds for a fixed α ∈ (0, 1]. The parameter α controls the trade-off between proving a good dual bound Ψ t+1 d and saving time for solving the master problem. On the one hand, α = 1 implies which can only be true if Ψ t+1 holds. This equality proves optimality of the master problem in iteration t + 1. On the other hand, setting α close to zero means that we would stop as soon as a non-trivial feasible solution to the master problem has been found. In our experiments, we observed that setting α to 0.2 performs well.

Constraint filtering
Another potential way of alleviating the computational burden of solving the master problem, is to reduce the set of nonlinear constraints to only those that are needed for a good quality solution of (8). Of course, this set of constraints is unknown in advance and challenging to compute because of the nonconvexity of the MINLP.
We tested different heuristics to preselect nonlinear constraints. We used the violation of the constraints with respect to the LP, MILP, and convex NLP relaxation of the MINLP, as measures of "importance" of nonlinear constraints. We also used the connectivity of nonlinear constraints in the variable-constraint graph 1 for discarding some constraints. Unfortunately, we could not identify a good filtering rule that results in strong bounds for (8).
However, we were able to find a way of reducing the number of constraints considered in the master problem without imposing a strong a-priori filter: an adaptive filtering, which we call support stabilization. We specify this next.

Support stabilization
Direct implementations of Benders-based algorithms, much like column generation approaches, are known to suffer from convergence issues. Deriving "stabilization" techniques that can avoid oscillations of the λ variables and tailing-off effects, among others, are a common goal for improving performance, see, e.g., [4,16,60].
We developed a support stabilization technique to address the exponential increase in complexity of the master problem (12) and to prevent the oscillations of the λ variables. Once Algorithm 2 finds a multiplier vector that improves the overall dual bound, we restrict the support to that of the improving dual multiplier. This restricts the search space and drastically improves solution times. Once stalling is detected (which corresponds to finding a local optimum of (8)), we remove the support restriction until another multiplier vector that improves the dual bound is found. This technique not only improves solution times, but also leads to better bounds on (8) in fewer iterations.

Trust-region stabilization
While the previous stabilization alleviates some of the computational burden in the master problem, the non-zero entries of the λ vectors can (and do, in practice) vary significantly from iteration to iteration. To remedy this, we incorporated a classic stabilization technique: a box trust-region stabilization, see [14]. Given a reference solution (λ 1 , . . . ,λ k ), we impose the following constraint in (12) for some parameter δ. This prevents the λ variables from oscillating excessively. In addition, by removing the trust-region when stalling is detected, we are able to preserve the convergence guarantees of Theorem 2. In our implementation, we maintain a fixed (λ 1 , . . . ,λ k ) until we obtain a bound improvement or the algorithm stalls. When any of this happens, we remove the box and compute a new (λ 1 , . . . ,λ k ) with (12) without any stabilization added.

Remark 3
We also tested another stabilization technique inspired by column generation's smoothing by [65] and [47]. Let λ best be the best found primal solution so far and let λ new be the solution of the current master problem. Instead of using λ new as a new multiplier vector, we choose a convex combination between λ best and λ new .
While this stabilization technique improved the performance of Algorithm 2 with respect to the algorithm with no stabilization, it performed significantly worse than the trust-region stabilization. Therefore, we did not include it in our final implementation.

Computational experiments
In this section, we present a computational study of the classic and generalized surrogate duality on publicly available instances of the MINLPLib [42]. We conduct three main experiments to answer the following questions: 1. ROOTGAP: How much of the root gap with respect to the MILP relaxation can be closed by using the K -surrogate dual? 2. BENDERS: How much do the ideas of Sect. 5 improve the performance of Algorithm 2? 3. DUALBOUND: Can Algorithm 2 improve on the dual bounds obtained by the MINLP solver SCIP?
Our ideas are embedded in the MINLP solver SCIP [55]. We refer to [1,63,64] for an overview of the general solving algorithm and MINLP features of SCIP.

Experimental setup
In all experiments, Algorithm 2 is called after the root node has been completely processed by SCIP. All generated and initial linear inequalities are added to S λ .
For the ROOTGAP experiment, we run Algorithm 2 for one hour for each choice of K ∈ {1, 2, 3}. To measure the improvement in using K + 1 instead of K , we use the best found aggregation vector of K as an initial point for K + 1.
In the BENDERS experiment we focus on K = 3 and do not start with an initial point for the aggregation vector. This allows us to better analyze the impact of each proposed enhancement. We compare the following settings: -DEFAULT: Algorithm 2 with all enhancements described in Sects. 3.2 and 5.2.
-PLAIN: Plain version of Algorithm 2 with no enhancement.
-NOSTAB: Same as DEFAULT but without using the trust-region of Sect. 5.2.5 and support stabilization of Sect. 5.2.4. -NOSUPP: Same as DEFAULT but without using the support stabilization.
-NOEARLY: Same as DEFAULT but without using early termination for the master problem, described in Sect. 5.2.2. Each of the five settings uses a time limit of one hour.
Finally, in the DUALBOUND experiment we evaluate how much the dual bounds obtained by SCIP can be improved by Algorithm 2. First, we collect the dual bounds for all instances that could not be solved by SCIP within three hours. Afterward, we apply Algorithm 2 for K = 3, a time limit of three hours, and set a target dual bound (see Sect. 3

.2.3) of
where D is the dual bound obtained by default SCIP and P be the best known primal bound reported in the MINLPLib. This means that we aim for a gap closed reduction of at least 20% and early stop each sub-problem in Algorithm 2 that will provably lead to a smaller reduction.
During all three experiments, we use a gap limit of 10 −4 for each sub-problem of Algorithm 2. Additionally, we chose a dual feasibility tolerance of 10 −8 (SCIP's default is 10 −7 ) and a primal feasibility tolerance of 10 −7 (SCIP's default is 10 −6 ).

Stabilization details
The trust-region and support stabilization have been implemented as follows. Both stabilization methods are applied once an improving aggregation λ * could be found. Each entry λ i with λ * i = 0 is fixed to zero. Otherwise, the domain of λ i is restricted to the interval Once a new improving solution has been found, we update the trust region accordingly. We remove the trust region and support stabilization in case no improving solution could be found for 20 iterations.
Test set We used the publicly available instances of the MINLPLib [42], which at time of the experiments contained 1683 instances. We selected the instances that were available in OSiL format and consisted of nonlinear expressions that could be handled by SCIP, in total 1671 instances. Gap closed We use the gap closed in order to compare dual bounds. Let d 1 ∈ R and d 2 ∈ R be two dual bounds for (1) and p ∈ R a reference primal bound. The gap closed improvement is measured by the function GC : Performance evaluation To evaluate algorithmic performance we use the shifted geometric means, which provide a measure for relative differences. The shifted geometric This measure avoids results being dominated by outliers with large absolute values, and also avoids an over-representation of differences among very small values. See also the discussion in [1,2,29]. As shift values we use 10 seconds for averaging over running time and 5% for averaging over gap closed values.
Hardware and software The experiments were performed on a cluster of 64bit Intel Xeon X5672 CPUs at 3.2 GHz with 12 MB cache and 48 GB main memory. In order to safeguard against a potential mutual slowdown of parallel processes, we ran only one job per node at a time. We used a development version of SCIP with CPLEX 12.8.0.0 as LP solver [31], the graph automorphism package bliss 0.73 [33] for detecting MILP symmetry, the algorithmic differentiation code CppAD 20180000.0 [12] and Ipopt 3.12.11 with Mumps 4.10.0 [44] as NLP solver [13,66].

Computational results
ROOTGAP Experiment From all instances of MINLPLib, we filter those for which SCIP's MILP relaxation proves optimality in the root node, no primal solution is known, or SCIP aborted due to numerical issues in the LP solver. This leaves 633 instances for the ROOTGAP experiment. Figure 6 visualizes the achieved gap closed values via scatter plots. The plots show that for the majority of the instances we can close significantly more gap than the MILP relaxation. There are 173 instances for which K = 2 closes at least 1% more gap than K = 1, and even more gap can be closed using K = 3. There are 21 instances for which K = 1 could not close any gap, but K = 2 could close some. On 11 additional instances K = 3 could close gap, which was not possible with K = 2. Finally, comparing K = 2 and K = 3 shows that on 105 instances K = 3 could close at least 1% more gap than K = 2. Interestingly, for most of these instances K = 2 could already close at least 50% of the root gap.  Aggregated results are reported in Table 1 and we refer to the supplementary material for detailed instance-wise results. First, we observe an average gap reduction of 18.4% for K = 1, 21.4% for K = 2, and 23.4% for K = 3, respectively. The same tendency is true when considering groups of instances that are defined by a bound on the minimum number of nonlinear constraints. For example, for the 391 instances with at least 20 nonlinear constraints after preprocessing, K = 2 and K = 3 close 1.6% and 2.8% more gap than K = 1, respectively. Table 1 also reports results when filtering out the 164 instances for which less than 1% gap was closed by Algorithm 2. We consider these instances unaffected. On the 469 affected instances we close on average up to 46.9% of the gap, and we see that K = 3 closes 4.7% more gap than K = 2 and 11.9% more than K = 1.
These results show that using surrogate relaxations has a significant impact on reducing the root gap. Additionally, using the generalized surrogate dual for K = 2 and K = 3 reduces significantly more gap than the classic surrogate dual. BENDERS Experiment Table 2 reports aggregated results for the BENDERS experiment. We refer to the supplementary material for detailed instance-wise results.
First, we observe that the DEFAULT performs significantly better than PLAIN. Table 2 shows that on 100 of the 457 affected instances DEFAULT closes at least 1% more gap than PLAIN. Only on 36 instances PLAIN closes more gap, but over all The column "M"/"L" reports the number of instances for which DEFAULT could close at least 1% more/less root gap than the settings of the corresponding column. Column "rgc" reports the average root gap closed relative to our default settings (in %). Instances for which no setting could close at least 1% of the root gap are filtered out instances it closes on average 13.1% less gap than DEFAULT. On instances with a larger number of nonlinear constraints, DEFAULT performs even better: on the 107 instances with at least 50 nonlinear constraints, DEFAULT computes 35 times a better and only 1 time a worse dual bound than PLAIN. For these 107 instances, PLAIN closes 25.8% less gap than DEFAULT. Interestingly, there are instances for which PLAIN could not close any gap but DEFAULT could. There is no instance for which the opposite is true. Next, we analyze which components of Algorithm 2 are responsible for the significantly better performance of DEFAULT compared to PLAIN. Table 2 shows that DEFAULT dominates NOSTAB, NOSUPP, and NOEARLY with respect to the average gap closed and the difference between the number of wins and the number of losses on each subset of the instances. The most important component is the early termination of the master problem. By disabling this feature, Algorithm 2 closes 11.8% less gap on all instances and even 23.7% on those which have at least 50 nonlinear constraints.
Regarding both stabilization techniques, Table 2 seems to suggest that they are not crucial. However, both techniques are important to exploit the λ space in a more structured way, which help us to find better aggregation vectors faster. To visualize this, we use the instance genpooling_lee1 from Example 4. Figure 7 shows the achieved dual bounds in each iteration of Algorithm 2 for DEFAULT and NOSTAB. Both settings run with an iteration limit of 600. First, we observe that the achieved dual bound of −4775.26 with DEFAULT is significantly better than the dual bound of −5006.95 when using NOSTAB. The best dual bound is found after 97 iterations by DEFAULT and after 494 iterations with NOSTAB. The steepest improvement in the dual bounds for DEFAULT is seen between iterations 62 and 112, where the support was fixed and the trust region was enforced. After iteration 112 the algorithm removed both stabilizers and no further dual bound improvement could be found. In the iterations where no stabilization is used by DEFAULT, we do not observe any pattern indicating which of the two settings finds a better dual bound-the behavior seems rather random. This randomness, and the large time limit used, might explain the similar results for DEFAULT and NOSTAB of Table 2. DUALBOUND Experiment For this experiment, we include all instances which could not be solved by SCIP with default settings within three hours, have a final gap of at least ten percent, terminate without an error, and contain at least four nonlinear  constraints. This leaves in total 209 instances for the DUALBOUND experiment. The supplementary material reports detailed results on the subset of instances for which the Algorithm 2 was able to improve on the bound obtained by SCIP with default settings, which was the case for 53 of the 209 instances. On these, the average gap of 284.3% for SCIP with default settings could be reduced to an average gap of 142.8%. Two notable subsets of instances are the challenging polygon* instances and four of the facloc* instances. In these, Algorithm 2 finds better bounds than the reported best known dual bounds from the MINLPLib, as shown in Table 3.

Surrogate duality during the tree search
While we obtain strong dual bounds with Algorithm 2, complex instances still require branching in order to solve them to provable optimality. In this section, we show a promising technique to incorporate Algorithm 2 into spatial branch and bound. We discuss this technique in two instances of MINLPLib, while a full implementation is subject of future work.

Algorithm 3: Surrogate approximation
Input: node v, parent node p, parent aggregation candidates C p ⊆ Λ Output: D ∈ R valid dual bound for v, aggregation candidates end if 9: end for 10: return (D, ∅)

Fig. 8
Visualization of Algorithm 3 for the instance himmel16 of the MINLPLib. The size of the aggregation pool has been limited to three and we used bread-first-search as the node selection strategy. The colors determine |C v | at each node v: green for three, blue for two, orange for one, white for zero aggregations. Red square nodes could be pruned by Algorithm 3, i.e., the proven dual bound exceeds the value of an incumbent solution (color figure online) Since Algorithm 2 is too costly to be used in every node of a branch-and-bound tree, our technique focuses on extracting information of a single execution of Algorithm 2 in the root node, and reuses this information throughout the tree.
Let Λ := {λ 1 , . . . , λ L } ⊆ R K m be the set of aggregation vectors that have been computed during Algorithm 2 in the root node, and that imply a tighter dual bound than the MILP relaxation. Instead of using Algorithm 2 in a node v, we select an aggregation vector λ from Λ and solve S K v (λ), which is equal to S K (λ) except that the global linear relaxation is replaced with a locally valid linear relaxation for v. If S K v (λ) results in a better dual bound than the local MILP relaxation, i.e., S K v (0), then we skip the remaining aggregation vectors in Λ and continue with the tree search. If the dual bound does not improve, then we discard λ in the sub-tree with root v.
The intuition behind discarding aggregations as we search down the tree is twofold. First, since the aggregations are computed in the root node, their ability to provide good dual bounds is expected to deteriorate with the increasing depth of an explored node. Second, we would like to alleviate the computational load of checking for too many aggregations as the branch-and-bound tree-size increases. The idea is stated in Algorithm 3; the algorithm assumes C r = Λ for the root node r . In Fig. 8 we illustrate Algorithm 3 for the instance himmel16.
In principle, Algorithm 3 can lead to stronger dual bounds in local nodes of the branch-and-bound tree, which could result in a smaller tree. However, for the challenging instances of the DUALBOUND experiment we observed that solving S K v (λ) is too costly and almost always runs into the time limit. In these cases, Algorithm 3 was not able to improve the dual bound. An exception to this behavior is instance multiplants_mtg1c 2 , which contains 28 nonlinear constraints, 193 continuous variables, and 104 integer variables. SCIP with default settings proves a dual bound of −4096.04. This bound is improved by Algorithm 2 to −3161.13, and Algorithm 3 can further improve it to −2935.58. This last bound improvement happens in the early stages of the tree search; there were only seven nodes for which a better local dual bound was found, and the aggregation candidates were quickly filtered out. After this, SCIP was not able to improve the dual bound any further. In total, SCIP processed 491860 nodes.

Summary and future work
In this article, we studied theoretical and computational aspects of surrogate relaxations for MINLPs. We developed the first algorithm to solve a generalization of the surrogate dual problem that allows multiple aggregations of nonlinear constraints. We also proposed computational enhancements and discussed a first idea on how to exploit surrogate duality in a spatial branch-and-bound solver.
Our extensive computational study on the heterogeneous set of publicly available instances of the MINLPLib shows that surrogate duality can lead to significantly better dual bounds than using SCIP with default settings. Additionally, our experiments show that the presented computational enhancements are important to obtain good dual bounds for problems with a large number of nonlinear constraints. Finally, our tree experiments show that using the result of Algorithm 2 during the tree search can lead to significantly better dual bounds than solving MINLPs with standard spatial branch and bound, although in this latter direction the research is still preliminary.
There are two open questions related to generalized surrogate duality we would like to highlight. First, consider the case that each constraint of (1) is quadratic, i.e., g i (x) = x T Q i x + q T i x + b i for each i ∈ M . Adding the constraints i∈M λ k i Q i 0 for all k ∈ {1, . . . , K } to the master problem (12) enforces that each sub-problem is a convex mixed-integer quadratically-constrained program. This increases the complexity of the master problem but, at the same time, reduces the complexity of the sub-problems. It would be interesting to understand this trade-off computationally. Second, it remains an open question how a pure surrogate-based spatial branch-and-bound approach could perform in practice. Warm-start strategies based on the generated points X of a parent node and branching rules based on surrogates relaxations are some examples of tools that could be developed in order to successfully incorporate surrogate duality into branch and bound.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.