A Successive Linear Relaxation Method for MINLPs with Multivariate Lipschitz Continuous Nonlinearities

We present a novel method for mixed-integer optimization problems with multivariate and Lipschitz continuous nonlinearities. In particular, we do not assume that the nonlinear constraints are explicitly given but that we can only evaluate them and that we know their global Lipschitz constants. The algorithm is a successive linear relaxation method in which we alternate between solving a master problem, which is a mixed-integer linear relaxation of the original problem, and a subproblem, which is designed to tighten the linear relaxation of the next master problem by using the Lipschitz information about the respective functions. By doing so, we follow the ideas of Schmidt, Sirvent, and Wollner (Math Program 178(1):449–483 (2019) and Optim Lett 16(5):1355-1372 (2022)) and improve the tackling of multivariate constraints. Although multivariate nonlinearities obviously increase modeling capabilities, their incorporation also significantly increases the computational burden of the proposed algorithm. We prove the correctness of our method and also derive a worst-case iteration bound. Finally, we show the generality of the addressed problem class and the proposed method by illustrating that both bilevel optimization problems with nonconvex and quadratic lower levels as well as nonlinear and mixed-integer models of gas transport can be tackled by our method. We provide the necessary theory for both applications and briefly illustrate the outcomes of the new method when applied to these two problems.


Introduction
Mixed-integer nonlinear optimization problems (MINLPs) form one of today's most important classes of optimization models.The reason is, at least, twofold.First, the capability of modeling nonlinearities allows to include many sophisticated aspects of, e.g., physics, economics, engineering, or medicine.Second, the incorporation of integer variables makes it possible to model decision making such as turning on or off a machine or investing in a product or not.Of course, this modeling flexibility comes at the price of models that are hard to solve for realistically sized instances since MINLPs are NP-hard in general [17,29].Nevertheless, the theoretical and algorithmic advances of the last years and decades make it possible today to solve rather large-scale instances to global optimality in a reasonable amount of time [6]-in particular if problem-specific structural properties of the model can be exploited algorithmically; see, e.g., [7,15,16,33,34] for the convex as well as [1,36,49,51] for the nonconvex case.
In addition, there has been a significant amount of research devoted to the cases in which only very few structural assumptions can be exploited.This is the framework considered in this paper since we assume that certain functions of the model are not given explicitly but can be evaluated and some analytical properties such as Lipschitz continuity is known.To illustrate why this is important, let us sketch three areas of applications in which only few assumptions on the structure of the model (or on specific parts of the model) can be made.First, many mixed-integer optimization problems subject to ordinary or partial differential equations fit into this context.In many cases, approaches in this field are driven by incorporating the so-called control-to-state map into the optimization model to "eliminate" the differential equation from the model; see, e.g., [5,8,23,52].This mapping, however, cannot be stated explicitly in general and one thus has to resort to exploiting analytical properties such as Lipschitz continuity and the ability to evaluate the mapping, at least in an approximate way.Second, and rather related to the first example, optimization models incorporating constraints that rely on calls to expensive simulation software can be cast in the framework mentioned in this paper as well if enough analytic information is known about the input-output mapping of the simulation code; see, e.g., [2,10].Third and finally, bilevel optimization problems can also be interpreted as models in which a single constraint makes the problem much harder to solve [12][13][14].In this case, it is the constraint that models the optimality of the decisions of the lower-level (or follower) problem given the decisions of the leader (which is the decision maker modeled in the upper-level problem).The best-reply function, which models the optimal response of the follower, can usually not be written in closed form but it can be evaluated (by solving the lower-level problem for a given feasible point of the upper level) and, under suitable assumptions [13], analytical properties such as Lipschitz continuity can be established.
This paper addresses a special class of MINLPs for which closed-form expressions for the nonlinearities are not available but Lipschitz continuity is guaranteed with known Lipschitz constants.To this end, all three areas of application discussed above can be addressed by the method proposed in this paper.One part of our contribution, indeed, is that the case studies presented later in Sections 4 and 5 explicitly show the applicability of our method to bilevel problems with nonconvex and quadratic lower-level problems (Section 4) and to problems on gas transport networks that are subject to differential equations (Section 5).Both are classes of problems that received a lot of attention during the last years; see, e.g., [3,12,31] and [32,41].Before we are able to tackle these problems, we first need to formally state the problem class under consideration, which is what we do in Section 2. Afterward, in Section 3, we describe the main rationale of the method, present it formally, and analyze it theoretically.The latter leads to a correctness theorem showing that the method finitely terminates with an approximate solution and we further derive a worst-case iteration bound.
The main contributions of our work are the following.We develop an algorithm that only requires very weak assumptions for being applied.Hence, the method can be applied to a very large set of problems that cannot be solved with other classic MINLP solvers, which require that all constraints of the problem are given in closed form.To illustrate the generality of the method, we present the application to nonlinear gas network optimization as well as bilevel problems with nonconvex quadratic lowerlevel problems.Our work clearly needs to be seen as a generalization of the works [47,48].In particular, we generalize [47] to the multidimensional case for which we present a more effective numerical scheme compared to [48] that uses different geometries for the outer approximation as compared to those used in [47].Since our main workhorse is the Lipschitz continuity of the nonlinearities, we are still in line with the works [25,26,28,[42][43][44]53], to name only a few.For a more detailed overview about this field see the textbook [27] and the references therein as well as [47,48], where we discussed the positioning of the method in the literature in more detail.

Problem Definition
We consider the problem where c ∈ R n+m , A ∈ R q×(n+m) , and b ∈ R q are given data, x ∈ R n ×Z m and x ∈ R n ×Z m are finite bounds, and [p] := {1, . . ., p}.Hence, we consider a linear objective (1a), linear mixed-integer constraints (1b), and nonlinear constraints defined by the functions are Lipschitz continuous functions and l i = |I i | is the number of their arguments.Moreover, is the index set of the variables on which the function f i depends.Without loss of generality, we further assume that In what follows, we also write The main challenge when solving Problem ( 1) is that we assume that the nonlinear functions f i are not given in closed form but that we can only evaluate them and that we know their Lipschitz constants.
The ε-relaxed version of the original problem ( 1) is given by where ε > 0 is a prescribed tolerance.A feasible point of ( 2) is called an ε-feasible point of Problem (1).Moreover, we call a global solution of (2) an approximate global optimal solution in what follows.

The Algorithm
In this section, we introduce an iterative procedure to solve Problem (1) to approximate global optimality.The main idea is to relax all nonlinearities by utilizing the Lipschitz continuity of these functions.In each iteration, the relaxed problem, which we will call the master problem, needs to be solved to global optimality.Subsequently, a subproblem is solved to tighten the relaxation for the next master problem.This procedure is then repeated until an ε-feasible solution is found or until it can be shown that the original problem is infeasible.
The master problem in iteration k reads where Ω k i is a relaxation of the graph of the nonlinearity f i .This relaxation will be stated in terms of mixed-integer linear constraints; see below.The idea behind is to partition the domain of f i into a set of boxes that are indexed using indices in J k i = {1, . . ., |J k i |} and to linearly relax the graph over each box with the region Ω k i (j), j ∈ J k i , using the Lipschitz continuity of f i .Hence, we obtain After solving the master problem, the boxes that contain the solution x k are identified and split in smaller boxes to get a finer relaxation for the next iteration.The main purpose of solving the subproblem afterward is to find good splitting points for these boxes.To this end, the subproblem determines a point of the graph of each nonlinearity and, at the same time, tries to minimize the distance to the solution of the last master problem.Hence, the subproblem is given by where The reason for the use of these subregions will be discussed in detail in Section 3.2.
Figure 1 depicts the subproblem (S(k)) with the regions Ωk i (j k i ) and Ω k i (j k i ) (left) and the corresponding regions Ω k+1 i (j) and Ω k+1 i (j + 1) of the master problem (M(k)) in the next iteration (right).
3.1.Construction of the Master Problem's Feasible Region.We now describe in detail how we construct the relaxations Ω k i .First, we define the box

and arbitrary dimension d.
For each i ∈ [p], we assume that we are given vectors vk i (j), vk i (j) ∈ R li for j ∈ J k i such that the boxes B( vk i (j), vk i (j)) have pairwise disjoint interiors and cover the bounding box of x Ii , i.e., we have Let L i be the Lipschitz constant of f i on B( xIi , xIi ) ⊂ R li w.r.t. a given norm • , where any (weighted) norm in R li can be used.Let m k i (j) be the center point of the box B( vk i (j), vk i (j)), i.e., m k i (j) = 1 2 vk i (j) + vk i (j) holds.Due to the Lipschitz continuity of f i , we have attains its maximum over B( vk i (j), vk i (j)) in the vertices of the box, we can replace x Ii with vk i (j).It thus holds . With this we can define the region Ω k i (j) for j ∈ J k i as the box The next step is to define the region Ω k i as the union of all Ω k i (j); see (3).Using (4), we obtain a covering of the bounding box of x Ii and via (5), we obtain bounds for x ri .Hence, it follows that Ω k i is a relaxation of the graph of the function For what follows, we abbreviate i is the set of boxes that is used to define Ω k i .In our algorithm, we use Ω k i to replace the nonlinear constraints f i (x Ii ) = x ri for all i ∈ [p] in Problem (1) in order to obtain a relaxation.Lemma 3.2.The master problem (M(k)) can be modeled as mixed-integer linear problem.
Proof.We can write (M(k)) using the following Big-M formulation: The rationale of this model is as follows.For each nonlinearity i ∈ [p] and each box j ∈ J k i , we introduce a binary variable z k i (j) that indicates whether the solution lies in this box or not.If z k i (j) = 1, the constraints (6c)-(6f) are equivalent to the definition of Ω k i (j).If z k i (j) = 0, then (6c)-(6f) are always fulfilled if the constant M is chosen sufficiently large.constraint (6g) finally ensures that for each nonlinearity i ∈ [p] exactly one box j ∈ J k i is chosen.Let us remark that it is always possible in our setting to obtain finite and sufficiently large values M by using the finite bounds on the variables in (1b).
3.2.Construction of the Subproblem.We now introduce the chosen subregion of Ω k i (j) used in the subproblem (S(k)).We define this region as Ωk i (j) = Ω k i (j) ∩ Ωk i (j), using the further subregion }, for some λ ∈ (0, 1/2].This ensures that the solution of the subproblem (S(k)) cannot be arbitrarily close to the edges of the used box if λ > 0 is chosen appropriately.
Let us also note that for each iteration k ∈ N, the subproblem (S(k)) can be separated into p smaller problems under reasonable assumptions.If the index sets I i = (I i , r i ) are non-overlapping, i.e., these smaller problems can be solved in parallel.The above assumption can always be satisfied by introducing additional auxiliary variables.
Lemma 3.3.Suppose that the index sets I i = (I i , r i ) are non-overlapping, i.e., (7) holds.Then, the subproblem (S(k)) is completely separable, i.e., we can solve the subproblem in iteration k by solving p smaller problems given by Proof.The constraints of (S(k)), i.e., completely decouple along i ∈ [p] and so does the objective function Therefore, the solution of the subproblem (S(k)) can be obtained by solving Problem (8) for all i ∈ [p].
Note that, in the formal sense of complexity theory, the subproblem can be as hard to solve as the originally given MINLP.However, the split into multiple and thus much smaller subproblems can make a huge difference in practice.Moreover, we completely split the mixed-integer aspects from the nonlinear aspects of the problem, which can also help a lot in solving the subproblems although they are still hard in the formal sense.
3.3.Formal Statement of the Algorithm.Before we can formally introduce the algorithm, we need the following notation.Let B( v, v) ⊆ R d be a box with an interior point x ∈ int B( v, v), i.e., v < x < v.The point x splits the box into a set of boxes that we define as We can utilize this notation to get a finer relaxation of graph(f ) by splitting an element of X k i using the solution of the subproblem (S(k)) as the splitting point.This yields a set of smaller boxes that still fulfills Condition (4) as the following proposition states.Proposition 3.4.For a given box B( v, v) ⊂ R li and an interior point x ∈ int B( v, v), the set S( v, v, x) contains 2 li smaller boxes that have pairwise disjoint interiors and that completely cover the box B( v, v), i.e., We can now present the complete method, which is given in Algorithm 1.Before we prove its correctness, let us first discuss its basic functionality.After the master problem (M(k)) is solved in Step 3, it is checked in Step 8 if its solution is already εfeasible for the original problem.To determine the boxes j k i ∈ J k i in Step 11 one can simply check the indicator variables z k i (j) of the MIP formulation (6).If the solution is not yet ε-feasible, then there are nonlinearities f i that have feasibility violations larger than ε.For these nonlinearities we refine the relaxation of the master problem in Step 15 and re-iterate.
Note that it is not necessary for the correctness of Algorithm 1 to solve the subproblem (S(k)) to global optimality.Our rationale, however, is that optimal solutions of the subproblems yield better splitting points that lead to faster convergence in practice.For the correctness of the algorithm it is sufficient to find feasible points of (S(k)) that are guaranteed to exist due to the following lemma.Proof.Because of Property ( 7), the nonlinear constraints f i (x Ii ) = xri in (S(k)) do not depend on each other.From Proposition 3.1 we know that the graph of f i over the initial region B( xIi , xIi ) lies entirely in Ω k i .Since the nonempty subregion Ωk i (j) restricts Ω k i in all but the last dimension, the claim follows.
Next, we prove that Algorithm 1 always terminates after finitely many iterations.
Ensure: An approximate globally optimal and ε-feasible point for Problem (1) or an indication that Problem (1) is infeasible.
Solve the master problem (M(k)) to global optimality.
Let x k denote the optimal solution of (M(k)). 8: return x k . 10: Determine the boxes 12: Solve the subproblem (S(k)) and let xk denote the optimal solution. 13: 16: end for 20: end for Theorem 3.6.There exists a constant K < ∞ such that Algorithm 1 either terminates with an approximate global optimal solution x k * or with the indication of infeasibility in an iteration k * ≤ K.
Proof.The box Ω k i (j) is bounded for each iteration k and all i ∈ [p] and j ∈ J k i .For the x ri -coordinate, the bounding inequalities are given by Therefore the corresponding side length of the box will not be split but remains the same in Step 17.
Next, we analyze how i be one of the smaller boxes that is added to X k i in Step 15.The side length d k+1 ri (j) of the corresponding box Ω k+1 i (j) can be bounded from above via This means that d k ri (j) is decreased by at least a factor of k∈N is a geometric sequence with |1 − λ| < 1, it converges to zero, i.e., (1 − λ) k → 0 as k → ∞.It follows that any box B( vk i (j), vk i (j))-including the first box B( xIi , xIi )-can only be split finitely many times in Step 15 before the side length d k ri (j) of Ω k i (j) fulfills d k ri (j) ≤ ε.Since the index set [p] is finite, there are only finitely many boxes in X k i for all i ∈ [p] and all k.These boxes can only be split finitely many times.Hence, there exists an iteration K < ∞ in which no box will be split in Step 15.This, however, can only be the case if the if-condition in Step 14 does not hold for all i ∈ [p].Thus, we have , which is the if-condition in Step 8.This means that the algorithm terminates in Step 9 in an iteration K. Hence, it follows that there exists a K < ∞ such that Algorithm 1 terminates in Step 5 or 9 in an iteration k * ≤ K.
We close this section by stating and proving a result for the worst-case number of required iterations of Algorithm 1.
Theorem 3.7.Algorithm 1 terminates after at most iterations with Proof.From the proof of Theorem 3.6 we know that a box B( vk i (j k i ), vk i (j k i )) can only be split finitely many times before the side length d k ri (j) of Ω k i (j) satisfies d k ri (j) ≤ ε.We can give an upper bound for how many iterations this takes for the first box B( xIi , xIi ) by solving the equation will not be split anymore if d k ri (j) ≤ ε, we can round this value to obtain For each box that is split, there are 2 li smaller boxes that are added to X k i .Therefore, for each i ∈ [p] the maximal number of iterations required until there are no boxes left in X k i that can be split is bounded from above by Si k=0 Since it is possible that in each iteration, there is only a single nonlinearity i ∈ [p] for which a box is split, we have to sum up (9) for each i ∈ [p] to get as an upper bound for the required number of iterations of Algorithm 1.
Remark 3.8.Theorem 3.7 states that choosing λ = 0.5 results in the lowest number of iterations in the worst-case.Then, no subproblem (S(k)) is needed as one can simply evaluate the nonlinearity f i in the center point m k i (j k i ) of the current box to receive the splitting point.However, in practice it can be better to choose a smaller parameter λ, which allows the splitting point to be closer to the master problem's solution and which, thus, may result in a finer approximation of the nonlinearity near the optimal solution of Problem (1).

Application to Nonlinear Bilevel Problems
The method developed in the previous section can be applied to nonlinear bilevel problems with nonconvex lower-level models, which is an extremely challenging class of problems.To illustrate this, we consider optimistic MIQP-QP bilevel problems of the form where x ∈ R nx and y ∈ R ny denote the upper-and lower-level variables, which are finitely bounded by x, x, ȳ, and ȳ.Further, we have matrices ×ny , as well as right-hand side vectors a ∈ R mu and b ∈ R m l .In addition, we have c u ∈ R nx and d u , d l ∈ R ny .Finally, H u ∈ R nx×nx , G u ∈ R ny×ny are positive semidefinite and symmetric matrices, while G l ∈ R ny×ny is a possibly indefinite and symmetric matrix.Thus, the upper level is a convex-quadratic problem over linear constraints and the lower-level problem is an x-parameterized and continuous but nonconvex quadratic problem.Let ϕ(•) be the optimal-value function of the lower level, i.e., With this, we can rewrite Problem (10) equivalently as the single-level problem x see, e.g., [13].We now reformulate Problem (12) so that it fits into the framework introduced above.Therefore, we introduce the auxiliary variables η 1 and η 2 as well as the nonlinear function f : R ny → R with f (y) = 1/2y ⊤ G l y +d ⊤ l y.Based on this notation, Problem (12) can be restated as min x,y,η1,η2 x Now, the method developed in Section 3 can be applied to ( 13) if (i) the nonconvex functions ϕ and f are Lipschitz continuous on the projections of the bilevel constraint region onto the decision space of the upper and lower level, respectively, i.e., on the domains x , and if (ii) the associated Lipschitz constants are computable.
What makes things more complicated compared to the general setup described in Section 3 is that we can only evaluate the optimal-value function ϕ(x) but we cannot optimize over it.Thus, we cannot use subproblem (S(k)) directly to obtain a new splitting point.On the other hand, following the strategy to take the box center m as the new splitting point simplifies solving problem (S(k)) to evaluating ϕ(m).More precisely, using the box center corresponds to setting λ = 1/2.This, however, is only applicable if ϕ(m) is well-defined, which we ensure with the following assumption.
This assumption implies that −∞ < ϕ(x) < +∞ for all x ∈ B( x, x), i.e., a minimizer y of Problem (11) exists for all x ∈ B( x, x) and, thus, for every possible box center m.
Before we present the theoretical developments that are required to apply our method to the introduced class of bilevel problems, let us briefly discuss that interpreting approximate solutions of bilevel problems with nonconvex lower-level problems needs to be done with some care.In particular, it is shown in [4] that lower-level solutions that are only ε-feasible can lead to upper-level solutions that can be arbitrary far away from actual bilevel solutions.Since this also applies to our method, we later always present the difference of our solutions to the known optimal solutions of the bilevel problems in our test set.4.1.Lipschitz Continuity Properties.To apply our method with the outlined modifications for the subproblem to Problem (13), it remains to show that the properties (i) and (ii) are fulfilled.We start with the nonconvex function f .Since the relevant domain B( ȳ, ȳ) of this function is compact, continuous differentiability of f implies global Lipschitz continuity of f on this set.Since B( ȳ, ȳ) is convex and compact, the tightest Lipschitz constant can be computed by solving the optimization problem Note that it would also be possible to compute the Lipschitz constant in (14) over the feasible set of the master problem, i.e., over the set F y .However, this involves solving an optimization problem not over a simple box but over a more complex polytope.In our computational study, we test both variants.We will denote the former method as the "fast" method and the latter as the "slow" method.In the absence of lower-and upper-level constraints except for simple variable bounds, both approaches coincide.Next, we continue with the more difficult case of proving Lipschitz continuity of the optimal-value function ϕ.To this end, we exploit a variant of the Hoffman Lemma; for the original version see the main theorem in [24] or Lemma 5.8 in [50].For the ease of presentation, we assume from now on that the finite bounds on y are part of the lower-level inequality constraints and, thus, also part of the matrix C. Lemma 4.1 (see Corollary 5.1 in [50]).There exists L H > 0 such that for any x, x ∈ B( x, x) it holds: For any y ∈ T (x) we can find a point ỹ ∈ T (x) with The scalar L H is the so-called Hoffman constant.A sharp characterization of this constant and an algorithm to compute it can be found in [37,40].Based on the introduced variant of the Hoffman Lemma, we can now establish Lipschitz continuity of the optimal-value function ϕ under Assumption 1.Our proof follows the idea of the proof of Corollary 5.2 in [50].There, the Lipschitz continuity of the optimal-value function of a linear program with right-hand side perturbation is demonstrated.In contrast to this, we have a quadratic program with right-hand side perturbation.Let us finally note that the presented method is not restricted to nonconvex but quadratic problems in the lower level in general.If one has knowledge about the Lipschitz constant of a nonconvex lower-level problem with more general nonlinearities, the method can be applied as it is explained in this section.4.2.Implementation Details.In this section, we discuss some implementation details to clarify how we modified and extended Algorithm 1 to get a more tailored method for the considered bilevel setup.4.2.1."Slow" and "Fast" Method for ϕ.Because T (x) ⊆ B( ȳ, ȳ) holds for all x ∈ B( x, x), we immediately get Since we have to compute L G l for computing the Lipschitz constant of ϕ, we can distinguish between the "fast" and the "slow" method for ϕ as well.
4.2.2.Additional Nonlinearities.Bilinear nonlinearities of the form x i y j in the lower-or upper-level objective function can be easily reformulated to fit in our setup.If such nonlinearities occur, e.g., in the lower level, an additional variable y k is introduced in the lower level.Moreover, the constraint y k = x i is added to the lower level, while the nonlinear objective term x i y j is replaced by the product y k y j .The resulting bilevel problem then fits in our setup.
4.2.3.Box Filtering.Figure 2 illustrates the case that after splitting an initial bounding box, here [0, 3] 2 , a few times, there might be boxes (such as [2.25, 3] 2 ) that do not include any point from the bilevel constraint region, which is colored in red in Figure 2. To avoid further investigation of these boxes, we can detect these boxes by checking if the intersection of the bilevel constraint region with newly created boxes is empty.However, this requires to solve an LP feasibility problem and thus creates some additional computational effort.The benefit, however, is that constraints (6c)-(6f) and the corresponding binary variables are not added to the master problem for the filtered boxes in the given application.However, as the bilevel constraint region is also accounted for in the master problem, these filtered boxes are not further splited anyway, i.e., the constraints and variables that are not added to the master problem would be redundant if added.Hence, the additional computational effort of box filtering should only be undertaken if necessary.Note that box filtering is not necessary if there are no lower-level and upper-level constraints except for simple variable bounds since the intersection of the bilevel constraint region and every possible box created by the algorithm can never be empty.In contrast, box filtering is necessary if the "slow" method is applied since, e.g., Problem (14) for computing the Lipschitz constant of the nonconvex function f might be infeasible on newly created boxes.Thus, these boxes must be filtered after each box splitting and before Lipschitz constants are updated.This does not occur with the "fast" method and box filtering is therefore not necessary for this method.
where E and F = 0 are suitably chosen matrices and e is a suitably chosen vector.Now, for given x, x ∈ B( x, x), let ŷ and ỹ be defined as where B( ȳ, ȳ) is the lower level's feasible set in this case.Note that B( ȳ, ȳ) = T (x) holds for all x ∈ B( x, x) because x only influences the lower-level objective function but not the lower level's feasible set.Thus, it holds Note that f (ŷ, x) ≤ f (ỹ, x) holds since ŷ minimizes f (y, x) over B( ȳ, ȳ) and ỹ ∈ B( ȳ, ȳ).This is not necessarily true for the case of general lower-level constraints, i.e., when optimizing in (15) over T (x) and T (x), respectively, since T (x) = T (x) and ỹ / ∈ T (x) might hold.Analogously, we obtain ϕ(x) − ϕ(x) ≤ ||F ŷ|| ||x − x|| and, consequently, In our computational study below, we consider the QP-QP instances from the BASBLib library [38].We first describe which instances need to be excluded because they do not fit into our setup.First, we exclude the three instances d_1992_01, b_1984_02, and dd_2012_02 because they contain nonlinear constraints.
Next, the two instances y_1996_02 and lmp_1987_01 are excluded due to nonconvex upper-level objective functions, i.e., the matrix H u is not positive semidefinite for these instances.The instance sc_1998_01 is not considered because C is the zero matrix and, thus, the resulting optimization problem is not a "true" bilevel problem because the lowerlevel problem is not constrained by upper-level variables.Such problems can easily be solved by backwards induction, i.e., an optimal solution to the lower-level problem can first be determined and, given this lower-level optimal solution, the upper-level problem can be solved to optimality.Finally, we have to exclude the following four instances because they violate Assumption 1: • tmh_2007_01 (lower level is not feasible for x = 9), • b_1988_01 (lower level is not feasible for x = 9), • b_1998_07 (lower level is not feasible for x = 9), and • cw_1990_02 (lower level is not feasible for x = 7).In total, 10 instances (out of 20) remain; see Table 1.Note that, due to the applied reformulations as described in Section 4.2.2, the reported number of variables might differ from those reported in the BASBLib.Moreover, the reported number of constraints does not include additional constraints necessary due to these reformulations.Variable bounds are also not counted as constraints.
We implemented the algorithm in Python 3.7.9.All computations were conducted on a machine with an Intel(R) Core(TM) i7-8565U CPU with 4 cores, 1.8 GHz to 4.6 GHz, and 16 GB RAM.The master problem (M(k)) and the subproblem (S(k)) are modeled using Pyomo 5.7.2 [22] and solved with Gurobi 9.1.0[21].The Hoffman constant is computed with the algorithm described in [40] using the MATLAB code made publicly available by the authors of [40]. 2 We set ε = 10 −1 and use a time limit of 5 h.If the instance is solved within the time limit, we decrease ε by subsequently dividing by ten until ε = 10 −5 is reached to see how much accuracy can be reached by our algorithm in the given time limit.The obtained results for the case of box filtering and determining the Lipschitz constant with the "slow" method are summarized in Table 2, while the results for the "fast" method without box filtering are given in Table 3.Note that Table 2 contains results for 4 instances while Table 3 contains results for 10 instances.The reason is that 6 instances only have simple variable bounds so that there is no difference between the "slow" and the "fast" method as well as between box filtering and no box filtering (except for the additional computational effort due to the LP feasibility problems solved in the former case, see Section 4.2.3).In these cases, we only list the respective instances in Table 3.The tables are organized as follows.The first column states the ID of the instance and the second one states the used ε for the termination criterion.The number of required iterations is denoted by k, "runtime" states the runtimes in seconds, and the "final ε" column contains the tolerance that is actually reached.Finally, the columns "diff to opt." and "diff to opt.value" contain the 2-norm distance of our solution to the one reported in the BASBLib and the respective difference in the objective value.Before we discuss the results in detail, let us comment on two important aspects.First, it can be expected that our method performs rather bad if there are multiplicities.To get a finer relaxation, all boxes covering the multiple solutions have to be split at least once if ε-feasibility is not reached by the first split.Second, it is to be expected that our method performs rather good for instances with small variable ranges, e.g., ranges such as [0, 1] nx × [0, 1] ny instead of [0, 1000] nx × [0, 1000] ny , as well as small lower-level objective function ranges.In particular, the range of the lower level's objective function is small in the case of small Lipschitz constants.The Lipschitz constant derived here for the optimal-value function is valid for all quadratic programs with right-hand side perturbations that satisfy Assumption 1.However, for specific bilevel applications much tighter Lipschitz constants might be derived by exploiting problem-specific structural properties.
In what follows, we comment on the obtained computational results in the light of these two aspect.Multiplicities are reported for instance as_1981_01.As can be seen in Table 2, for ε = 10 −2 , this instance is not solved to ε-feasibility within the time limit despite the fact that an optimal solution is already reached.This effect is even enhanced if box filtering is deactivated and the "fast" method is used; see Table 3.In this case, the final ε is 47.3, which is very large.As this instance has rather many general constraints, box filtering with the "slow" method improves the solution process significantly.However, for instances with less general constraints like sa_1981_01 and sa_1981_02, the additional computational burden of the "slow" method and box filtering can outweigh its advantages.Finally, let us discuss the results in dependence of the tightness of the Lipschitz constants and the size of the variable ranges.To this end, we first order the considered instances by increasing Lipschitz constants of the function f : b_1998_02, b_1998_03, d_2000_01, d_1978_01, fl_1995_01, sa_1981_02, as_1981_01, sa_1981_01, b_1998_04, and b_1998_05. 3Indeed, the three instances with lowest Lipschitz constants are solved for the smallest ε within the time limit; see Tables 2 and 3.In addition to these 3 instances, the "fast" method with no box filtering also solves the instances d_1978_01, sa_1981_01, and b_1998_04 at least for the initial ε.Besides having one of the lowest Lipschitz constants, the instance d_1978_01 has relatively low variable ranges.The other two instances sa_1981_01 and b_1998_04 have relatively low number of lower-and upper-level variables, which leads to reduced dimensions of the boxes and therefore of the worst-case number of iterations.Nevertheless, the disadvantage of the large Lipschitz constant of b_1998_04 is reflected in the results as for ε = 10 −1 already over 2 h are needed to compute an ε-feasible solution.
In total, our method solves 7 out of 10 instances.Interestingly, 2 out of the 3 unsolved instances even have a convex lower-level problem and could thus be solved with specialized methods such as, e.g., [30] that explicitly exploit this property.In addition, as our method only requires very weak assumptions and can be applied to a broad range of problems besides nonconvex bilevel problems, it cannot be expected that it outperforms specifically tailored methods like, e.g., the BASBL solver [39].

Application to Gas Network Optimization
In this section, we use Algorithm 1 to solve stationary gas network optimization problems.We start by modeling the gas network and state an implicit nonlinear pressure law function for gas flow in pipes.We cannot state this function explicitly but it is possible to evaluate it rather cheaply.Then, we analyze its derivatives to derive suitable Lipschitz constants.Finally, numerical results on test instances show the successful application of our method.5.1.Modeling.We model the gas network as a directed and weakly connected graph (V, A), where the arcs A are composed of pipes A pi , short pipes A sp , valves A va , compressor stations A cs , and control valves A cv , i.e., The two main variables that describe the state of the gas flowing through the network are the pressure p and mass flow q.Each node u ∈ V has a bounded pressure variable p u ∈ [ pu , pu ] and a given mass flow q u that is supplied to or withdrawn from the network.A node u ∈ V is called an entry node if q u > 0, an exit node if q u < 0, and an inner node if q u = 0.In addition, each arc a ∈ A has a variable q a ∈ [ qa , qa ] that models the mass flow in the arc.
The balance equation q a for all u ∈ V ensures that no gas is gained or lost.Here, δ in (u) and δ out (u) denote the sets of in-and outgoing arcs of node u.
A short pipe a = (u, v) ∈ A sp directly connects its nodes u and v. Therefore, the related pressure values coincide: A valve a = (u, v) ∈ A va is either open or closed.If it is open, it is modeled as a short pipe.If it is closed, the mass flow q a has to be zero and the related pressures are decoupled but the corresponding pressure difference between the nodes u and v cannot b_1998_02, b_1998_03, d_1978_01, fl_1995_01, sa_1981_02, d_2000_01, sa_1981_01, b_1998_04, b_1998_05, and as_1981_01.exceed a given value ∆p a .This can be modeled by introducing a binary variable o a that indicates if the valve a is open (o a = 1) or closed (o a = 0).The valve model then reads Compressor stations a ∈ A cs have a fixed flow direction, i.e., qa ≥ 0 holds, and can increase the pressure of the gas.We model this pressure increase by introducing the variable ∆p a ∈ [0, ∆p a ].Then, the compressor station model is given by Note that this model is a significant simplification of how compressor stations in real gas networks operate.More complicated and realistic models can be found in, e.g., [41,45].Control valves a ∈ A cv are modeled similar to compressor stations but decrease the gas pressure instead of increasing it.We again have qa ≥ 0 and with ∆p a ∈ [0, ∆p a ].
Until now, we have modeled all gas network components in a (mixed-integer) linear way.The only components that are still missing are pipes a ∈ A pi .We will describe the pressure loss in a pipe in dependence of the inflow pressure and the mass flow using a nonlinear and Lipschitz continuous function which we will analyze in the next section.
The goal is to minimize the overall activity of the compressor stations.Therefore, the full stationary gas network optimization problem is given by min a∈Acs ∆p a s.t.mass balance ( 16), short pipe model (17), valve model (18), compressor station model (19), control valve model (20), pipe model (21), This model has the form of Problem (1).The pipe equations ( 21) constitute the Lipschitz nonlinearities (1c) while the other constraints fit (1b) as they are linear.Hence, we can use Algorithm 1 to solve it.
In contrast to the gas network model considered in [47], we do not restrict ourselves to tree-structured networks.This has the consequence that the mass flows in the network cannot be pre-computed.Therefore, the nonlinearity on each arc is multivariate as it depends on the pressure and the mass flow.This means that Algorithm 1 can be applied to a much broader class of gas transport models.

Lipschitz Continuity of the Gas Flow Equation.
In this section, we derive and analyze the nonlinear pipe model (21).For the sake of readability, we henceforth omit the subscript a that indicates the pipe a ∈ A pi .
Gas flow along a pipe can be modeled by the stationary momentum equation.This ordinary differential equation (ODE) reads where p, v, χ, and ρ model the pressure, velocity, mass flux, and density of the gas and λ as well as D denote the pipe's friction coefficient and the diameter of the pipe; see, e.g., [19].The relation between the mass flow q and mass flux χ is given by q = Aχ where A = πD 2 /4 is the cross-sectional area of the pipe.
The pressure p and density ρ are coupled by the equation of state for real gas, where R s denotes the specific gas constant.The compressibility factor z can be computed by the so-called AGA formula with pseudocritical pressure p c , pseudocritical temperature T c , and temperature T that we assume to be constant; see, e.g., [35].We only consider a positive compressibility factor, which is equivalent to We further introduce the speed of sound c which is defined via 1 c 2 = ∂ρ ∂p and the squared mach number Proof.We solve the equation of state (22) for ρ and obtain We can use this and the definition of the speed of sound to get We assume that the velocity of the gas is subsonic, i.e., η < 1 holds, as it is the case for real-world gas networks.This is equivalent to In what follows, we only consider pressures within the interval (|χ| √ R s T , 1/|α|).Therefore, we have p 2 − χ 2 R s T > 0 and 1 + αp > 0, which we will use many times throughout this section.
For a pipe (u, v) with length L and pressure p u at node u, the pressure function reads for x ∈ [0, L]; see [19,20].From [20], we further know the following properties of F .
Lemma 5.2.The function F as defined in The second derivative fulfills The property in (26) implies that F is strictly increasing.Therefore, the inverse in (25a) is well-defined.To evaluate the pressure function p in (25a), the equation needs to be solved.This can be done numerically using Newton's method since F is strictly increasing and convex; see [20].In the same way, ( 27) can be solved for p u if p and χ are given.We are interested in the pressure at the end of the pipe, i.e., for x = L, and need to find a Lipschitz constant for To this end, we make the following assumption.Assumption 2. We assume that the variables χ and p u of each pipe (u, v) are bounded by The following lemma guarantees the Lipschitz continuity of p v (p u , χ) on F .Proof.Let x, ỹ ∈ F .We define the auxiliary function g : [0, 1] → R as g(λ) := f (λx + (1 − λ)ỹ).Now, we use the fundamental theorem of calculus to prove the claim: Using this weighted 1-norm allows to get tighter bounds in (5) compared to the usual 1norm.To actually compute this weighted norm one could solve max x∈F |∂ xi f (x)|, which is an NLP for each i ∈ [d].In the case of Algorithm 1, the set F will always be a box.For the function p v (p u , χ), we give the optimal solution or at least a suitable upper bound of these NLPs in the following two theorems.
] be a given box with F = ∅.If Assumption 2 holds, the derivative ∂ χ p v (p u , χ) ≤ 0 has the following properties: The proofs of these two theorems can be found in the appendix of this paper as they are rather long and technical.
Theorem 5.4 and Theorem 5.5 can be used to not only compute the bounds in (5) for the initial box [ pu , pu ] × [ χ, χ] but also for every new box that is created in Step 15 of Algorithm 1.This can significantly tighten the bounds in (5) as the iteration proceeds.

Numerical Results
. Now, we apply Algorithm 1 to two test problems from the GasLib library [46], which contains stationary gas network benchmark instances.To this end, we implemented our method in Python 3.8.10.The computations were done on a machine with an Intel(R) Core(TM) i7-8550U CPU with 4 cores, 1.8 GHz to 4.0 GHz, and 16 GB RAM.The master problems (M(k)) and the subproblems (S(k)) have been modeled using GAMS 36.2.0 [9].We used the solver CPLEX 20.1.0.1 [11] to solve the master problems, which are MIPs, and the solver SNOPT 7.7.7 [18] for the subproblems, which are NLPs.
To improve the performance of our method, we detect boxes that lie outside the feasible set of a pipe.From ( 29) and (30) we know that this is the case if χ − ≥ 0 and p v (p + u , χ − ) < pv holds or if χ + ≤ 0 and p v (p − u , χ + ) > pv holds.Additionally, we can fix the flow in pipes that are not part of a cycle.This allows us to reduce the dimension of the corresponding nonlinearities by one.To get well-scaled problems, we model all pressure values in bar instead of Pa and exclusively use mass flow values (in kg s −1 ) instead of mass flux values.As a result, we have to scale all values of ∂ χ p v (p u , χ) that are used to compute the bounds in (5) by a factor of 10 −5 /A.
To compute a sufficiently large value M for the master problem constraints (6c)-(6f) we use the difference between the values that can occur on the right-hand sides of the inequalities and the bounds of the variables on the left-hand side.This leads to the formula The first instance we solve is GasLib-11, which is shown in Figure 3.It contains 11 nodes including 3 entries and 3 exits, 8 pipes, 2 compressor stations, and a single valve.Because the network has a cycle, not all flows on all arcs are known a priori.
The second instance we solve is GasLib-24; see Figure 4.It consists of 24 nodes including 3 entries and 5 exits, 19 pipes, 3 compressor stations, a single control valve, a single one short pipe, and a single resistor, which we replace by a short pipe.There are two cycles in the network.For both instances we tested the values 0.125, 0.25, 0.375, and 0.5 for the parameter λ.We chose the value ε = 0.1 (in bar) for the termination criterion.The resulting runtimes and numbers of iterations are listed in Table 4.In both cases, λ = 0.25 and λ = 0.375 yield better results than λ = 0.125 or λ = 0.5.For λ = 0.125 this can be explained by the fact that the boxes do not shrink fast enough when split because the splitting point can be quite close to the edge of its box.Choosing λ = 0.5, on the other hand, removes the possibility for the splitting points to be closer to the master problem solution and therefore potentially to the optimal solution of Problem (1).
For GasLib-11 the best result was achieved for λ = 0.375 for which an ε-feasible solution was found in 47.41 s.The method terminated after 74 iterations with a mean iteration time of 0.64 s.The mean time to solve the master problem was 0.35 s and the mean time to solve the subproblems was 0.14 s.The remaining 0.16 s were used to (re-)build the model in each iteration.
For GasLib-24 the best result was achieved for λ = 0.25 for which an ε-feasible solution was found in 158.43 s.With 95 iterations, the mean iteration time was 1.67 s.This mean includes 1.05 s to solve the master problem, 0.23 s to solve the subproblems, and 0.39 s to (re-)build the model in each iteration.Figure 5 shows the progress of the maximal error over the course of the iterations for the best-case of both test instances.One can see that it falls rapidly in the first iterations.After that, it fluctuates strongly with only a slight downward trend until the threshold max i∈[p] |f i (x Ii ) − x ri | ≤ ε is reached.This behavior is typical as the approximation progress from splitting a box in Step 15 of Algorithm 1 is reduced as the boxes get smaller.The fluctuations can be explained by the master problem solution switching to a bigger box j k i ∈ J k i after the previous box has been refined a number of times in Step 15.

Conclusion
In this paper we developed a successive linear relaxation method for solving MINLPs with nonlinearities that are not given in closed form.Instead, we only assume that these multivariate nonlinearities can be evaluated and that we know their global Lipschitz constants.We illustrate the flexibility of this class of models and of our method by showing that it can be applied to bilevel optimization models with nonlinear and nonconvex lower-level problems as well as to nonconvex MINLPs for gas transport problems that are constrained by differential equations.Moreover, we proved finite termination of the method (at approximate global optimal solutions) and derived a worst-case iteration bound.
Finally, let us sketch three important topics for future research that are out of scope of this paper.First, one can weaken the assumptions made in this paper as it is done in [47].In particular, the assumption of exact function evaluations is rather strong in some applications such as, e.g., in the case of PDE constraints.Second, one could also incorporate general-purpose local solvers to compute feasible points that lead to upper bounds.Together with lower bounds that we directly get from solving the master problem in every iteration, this could be used to obtain a gap and thus a further termination criterion besides the one based on the approximation accuracy that we used in this paper.Third, it is obvious (and also not expected) that the proposed method is not competitive with MINLP solution methods that explicitly use the structural properties of the nonlinearities that are given in closed form.However, there are many possible ways of improving the general method proposed in this paper.For instance, the incorporation of presolving and dimension-reduction techniques would help a lot to improve the performance of the method.We note that (p 2 + χ 2 R s T (1 + 2αp)) > 0 holds because of (34b).Now, if χ = 0, then p u = p and ∂p ∂χ = 0 holds.It follows ∂ 2 p ∂χ∂p u = 0.

Figure 1 .
Figure 1.Visualization of the subproblem (S(k)) (left) and of the feasible region of the master problem (M(k)) in the next iteration (right) for a nonlinear function f i : R → R.

Figure 2 .
Figure 2. Example for box filtering

Lemma 5 . 3 .
Let f : F → R be a partially differentiable function on a compact and convex subset F ⊂ R d with d ∈ N.Then, f is Lipschitz continuous on F with Lipschitz constant L = 1 w.r.t. the weighted 1-norm x w := d i=1 |x i | w i (28) with positive weights w ∈ R d and w i ≥ max x∈F |∂ xi f (x)| for all i ∈ [d].
4.2.4.Tighter Lipschitz Constants for Box-Constrained Lower Levels.For instances with simple variable bounds on the lower level that are not influenced by upper-level decisions, we can compute tighter Lipschitz constants for ϕ.To this end, we now explicitly take into account bilinear terms of the form x i y j in the lower-level objective function and do not reformulate these terms as described in Section 4.2.2.Hence, the lower-level objective function is given by

Table 1 .
Thus, L F is a valid Lipschitz constant.4.2.5.Big-M s.As a valid Big-M in the master problem, we use the maximum of ||ȳ − ȳ|| ∞ , ||x − x|| ∞ , and max Overview of the considered BASBLib instances.The number of lower-and upper-level constraints do not contain simple variable bounds and do not contain the additional constraints obtained due to the reformulation discussed in Section 4.2.2.Contrarily, the additional variables due to this reformulation are counted.

Table 2 .
Computational results with "slow" method and box filtering.Runtimes are given in seconds.

Table 3 .
Computational results with "fast" method and no box filtering.Runtimes are given in seconds.

Table 4 .
The runtimes and numbers of iterations of Algorithm 1 for different parameters λ.If no overall time is given, the time limit of 1000 s was reached before an ε-feasible solution could be found.