1 Introduction

Mixed-integer nonlinear optimization problems (MINLPs) form one of today’s most important classes of optimization models. The reason is, at least, twofold. First, the capability of modeling nonlinearities allows to include many sophisticated aspects of, e.g., physics, economics, engineering, or medicine. Second, the incorporation of integer variables makes it possible to model decision making such as turning on or off a machine or investing in a product or not. Of course, this modeling flexibility comes at the price of models that are hard to solve for realistically sized instances since MINLPs are NP-hard in general [17, 29]. Nevertheless, the theoretical and algorithmic advances of the last years and decades make it possible today to solve rather large-scale instances to global optimality in a reasonable amount of time [6]—in particular if problem-specific structural properties of the model can be exploited algorithmically; see, eg., [7, 15, 16, 33, 34] for the convex as well as [1, 36, 49, 51] for the nonconvex case.

In addition, there has been a significant amount of research devoted to the cases in which only very few structural assumptions can be exploited. This is the framework considered in this paper since we assume that certain functions of the model are not given explicitly but can be evaluated and some analytical properties such as Lipschitz continuity are known. To illustrate why this is important, let us sketch three areas of applications in which only few assumptions on the structure of the model (or on specific parts of the model) can be made. First, many mixed-integer optimization problems subject to ordinary or partial differential equations fit into this context. In many cases, approaches in this field are driven by incorporating the so-called control-to-state map into the optimization model to “eliminate” the differential equation from the model; see, eg., [5, 8, 23, 52]. This mapping, however, cannot be stated explicitly in general and one thus has to resort to exploiting analytical properties such as Lipschitz continuity and the ability to evaluate the mapping, at least in an approximate way. Second, and rather related to the first example, optimization models incorporating constraints that rely on calls to expensive simulation software can be cast in the framework mentioned in this paper as well if enough analytic information is known about the input-output mapping of the simulation code; see, eg., [2, 10]. Third and finally, bilevel optimization problems can also be interpreted as models in which a single constraint makes the problem much harder to solve [12,13,14]. In this case, it is the constraint that models the optimality of the decisions of the lower-level (or follower) problem given the decisions of the leader (which is the decision maker modeled in the upper-level problem). The best-reply function, which models the optimal response of the follower, can usually not be written in closed form but it can be evaluated (by solving the lower-level problem for a given feasible point of the upper level) and, under suitable assumptions [13], analytical properties such as Lipschitz continuity can be established.

This paper addresses a special class of MINLPs for which closed-form expressions for the nonlinearities are not available but Lipschitz continuity is guaranteed with known Lipschitz constants. To this end, all three areas of application discussed above can be addressed by the method proposed in this paper. One part of our contribution, indeed, is that the case studies presented later in Sects. 4 and 5 explicitly show the applicability of our method to bilevel problems with nonconvex and quadratic lower-level problems (Sect. 4) and to problems on gas transport networks that are subject to differential equations (Sect. 5). Both are classes of problems that received a lot of attention during the last years; see, eg., [4, 12, 31] and [32, 41]. Before we are able to tackle these problems, we first need to formally state the problem class under consideration, which is what we do in Sect. 2. Afterward, in Sect. 3, we describe the main rationale of the method, present it formally, and analyze it theoretically. The latter leads to a correctness theorem, showing that the method finitely terminates with an approximate solution and we further derive a worst-case iteration bound.

The main contributions of our work are the following. We develop an algorithm that only requires very weak assumptions for being applied. Hence, the method can be applied to a very large set of problems that cannot be solved with other classic MINLP solvers, which require that all constraints of the problem are given in closed form. To illustrate the generality of the method, we present the application to nonlinear gas network optimization as well as bilevel problems with nonconvex quadratic lower-level problems. Our work clearly needs to be seen as a generalization of the works [47, 48]. In particular, we generalize [47] to the multidimensional case for which we present a more effective numerical scheme compared to [48] that uses different geometries for the outer approximation as compared to those used in [47]. Since our main workhorse is the Lipschitz continuity of the nonlinearities, we are still in line with the works [25, 26, 28, 42,43,44, 53], to name only a few. For a more detailed overview about this field, see the textbook [27] and the references therein as well as [47, 48], where we discussed the positioning of the method in the literature in more detail.

2 Problem Definition

We consider the problem

$$\begin{aligned} \min _{x} \quad&c^\top x \end{aligned}$$
(1a)
$$\begin{aligned} \text {s.t.} \quad&Ax \ge b, \quad \underline{x} \le x \le \bar{{x}}, \quad x \in \mathbb {R}^{n} \times \mathbb {Z}^{m}, \end{aligned}$$
(1b)
$$\begin{aligned}&f_i(x_{I_i}) = x_{r_i}, \quad i \in [p], \end{aligned}$$
(1c)

where \(c \in \mathbb {R}^{n+m}\), \(A \in \mathbb {R}^{q \times (n+m)}\), and \(b \in \mathbb {R}^q\) are given data, \(\underline{x} \in \mathbb {R}^n \times \mathbb {Z}^m\) and \(\bar{{x}} \in \mathbb {R}^n \times \mathbb {Z}^m\) are finite bounds, and \([p] \mathrel {{\mathop :}{=}}\{1, \dotsc , p\}\). Hence, we consider a linear objective (1a), linear mixed-integer constraints (1b), and nonlinear constraints defined by the functions \(f_i: \mathbb {R}^{l_i} \rightarrow \mathbb {R}\). All \(f_i\), \(i \in [p]\), are Lipschitz continuous functions and \(l_i = |{I_i}|\) is the number of their arguments. Moreover, \(I_i\subset [n]\) is the index set of the variables on which the function \(f_i\) depends. Without loss of generality, we further assume that \(r_i\in [n]\) with \(r_i\notin I_i\) for all \(i \in [p]\). In what follows, we also write \(x_{\mathcal {I}_i} = (x_{I_i}, x_{r_i}) \in \mathbb {R}^{l_i + 1}\).Footnote 1

The main challenge when solving Problem (1) is that we assume that the nonlinear functions \(f_i\) are not given in closed form but that we can only evaluate them and that we know their Lipschitz constants.

The \(\varepsilon \)-relaxed version of the original problem (1) is given by

$$\begin{aligned} \min _{x} \quad&c^\top x \end{aligned}$$
(2a)
$$\begin{aligned} \text {s.t.} \quad&Ax \ge b, \quad \underline{x} \le x \le \bar{{x}}, \quad x \in \mathbb {R}^{n} \times \mathbb {Z}^{m}, \end{aligned}$$
(2b)
$$\begin{aligned}&|{f_i(x_{I_i}) - x_{r_i}}| \le \varepsilon , \quad i \in [p], \end{aligned}$$
(2c)

where \(\varepsilon > 0\) is a prescribed tolerance. A feasible point of (2) is called an \(\varepsilon \)-feasible point of Problem (1). Moreover, we call a global solution of (2) an approximate global optimal solution in what follows.

3 The Algorithm

In this section, we introduce an iterative procedure to solve Problem (1) to approximate global optimality. The main idea is to relax all nonlinearities by utilizing the Lipschitz continuity of these functions. In each iteration, the relaxed problem, which we will call the master problem, needs to be solved to global optimality. Subsequently, a subproblem is solved to tighten the relaxation for the next master problem. This procedure is then repeated until an \(\varepsilon \)-feasible solution is found or until it can be shown that the original problem is infeasible.

The master problem in iteration k reads

where \(\Omega _i^k\) is a relaxation of the graph of the nonlinearity \(f_i\). This relaxation will be stated in terms of mixed-integer linear constraints; see below. The idea behind is to partition the domain of \(f_i\) into a set of boxes that are indexed using indices in \(J_i^k = \{1,\dots , |{J_i^k}|\}\) and to linearly relax the graph over each box with the region \(\Omega _i^k(j)\), \(j \in J_i^k\), using the Lipschitz continuity of \(f_i\). Hence, we obtain

$$\begin{aligned} \Omega _i^k = \bigcup _{j \in J_i^k} \Omega _i^k(j). \end{aligned}$$
(3)

After solving the master problem, the boxes that contain the solution \(x^k\) are identified and split in smaller boxes to get a finer relaxation for the next iteration. The main purpose of solving the subproblem afterward is to find good splitting points for these boxes. To this end, the subproblem determines a point of the graph of each nonlinearity and, at the same time, tries to minimize the distance to the solution of the last master problem. Hence, the subproblem is given by

where \(j_i^k \in J_i^k\) denotes the box with \(x_{\mathcal {I}_i}^k \in \Omega _i^k(j_i^k)\) for all \(i \in [p]\) and \({\tilde{\Omega }}_i^k(j_i^k)\) is a suitably chosen subregion of \(\Omega _i^k(j_i^k)\). The reason for the use of these subregions will be discussed in detail in Sect. 3.2.

Figure 1 depicts the subproblem (4) with the regions \({\tilde{\Omega }}_i^k(j_i^k)\) and \(\Omega _i^k(j_i^k)\) (left) and the corresponding regions \(\Omega _i^{k+1}(j)\) and \(\Omega _i^{k+1}(j+1)\) of the master problem (M(k)) in the next iteration (right).

Before we continue with the discussion of the master problem, let us briefly explain the systematics of our index notation. The index i of the respective nonlinearity is subscripted, while the iteration index k is superscripted. The box index j is written in parentheses because it can depend on the other indices i and k as well.

Fig. 1
figure 1

Visualization of the subproblem (4) (left) and of the feasible region of the master problem (M(k)) in the next iteration (right) for a nonlinear function \(f_i:\mathbb {R}\rightarrow \mathbb {R}\)

3.1 Construction of the Master Problem’s Feasible Region

We now describe in detail how we construct the relaxations \(\Omega _i^k\). First, we define the box

$$\begin{aligned} B(\underline{v}, \bar{{v}}) \mathrel {{\mathop :}{=}}\{x \in \mathbb {R}^{d}:\underline{v} \le x \le \bar{{v}}\} \end{aligned}$$

for \(\underline{v}, \bar{{v}} \in \mathbb {R}^{d}\), \(\underline{v} \le \bar{{v}}\), and arbitrary dimension d.

For each \(i \in [p]\), we assume that we are given vectors \(\underline{v}_i^k(j), \bar{{v}}_i^k(j) \in \mathbb {R}^{l_i}\) for \(j \in J_i^k\) such that the boxes \(B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\) have pairwise disjoint interiors and cover the bounding box of \(x_{I_i}\), i.e., we have

$$\begin{aligned} B\left( \underline{x}_{I_i}, \bar{{x}}_{I_i}\right) = \bigcup _{j \in J_i^k} B\left( \underline{v}_i^k(j), \bar{{v}}_i^k(j)\right) . \end{aligned}$$
(4)

Let \(L_i\) be the Lipschitz constant of \(f_i\) on \(B(\underline{x}_{I_i}, \bar{{x}}_{I_i}) \subset \mathbb {R}^{l_i}\) w.r.t. a given norm \(\Vert {\cdot }\Vert \), where any (weighted) norm in \(\mathbb {R}^{l_i}\) can be used. Let \(m_i^k(j)\) be the center point of the box \(B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\), ie.,

$$\begin{aligned} m_i^k(j) = \frac{1}{2} \left( \underline{v}_i^k(j) + \bar{{v}}_i^k(j) \right) \end{aligned}$$

holds. Due to the Lipschitz continuity of \(f_i\), we have

$$\begin{aligned} x_{r_i}&\le f_i(m_i^k(j)) + L_i \Vert {x_{I_i} - m_i^k(j)}\Vert , \\ x_{r_i}&\ge f_i(m_i^k(j)) - L_i \Vert {x_{I_i} - m_i^k(j)}\Vert \end{aligned}$$

for \(x_{I_i} \in B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\). Since \(\Vert {x_{I_i} - m_i^k(j)}\Vert \) attains its maximum over \(B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\) in the vertices of the box, we can replace \(x_{I_i}\) with \(\bar{{v}}_i^k(j)\). It thus holds

$$\begin{aligned} x_{r_i}&\le f_i(m_i^k(j)) + L_i \Vert {\bar{{v}}_i^k(j) - m_i^k(j)}\Vert = f_i(m_i^k(j)) + \frac{L_i}{2} \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert , \end{aligned}$$
(5a)
$$\begin{aligned} x_{r_i}&\ge f_i(m_i^k(j)) - L_i \Vert {\bar{{v}}_i^k(j) - m_i^k(j)}\Vert = f_i(m_i^k(j)) - \frac{L_i}{2} \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert \end{aligned}$$
(5b)

for \(x_{I_i} \in B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\). With this, we can define the region \(\Omega _i^k(j)\) for \(j \in J_i^k\) as the box

$$\begin{aligned} \begin{aligned} \Omega _i^k(j) \mathrel {{\mathop :}{=}}\Big \{(x_{I_i}, x_{r_i}) \in \mathbb {R}^{l_i + 1} :&\underline{v}_i^k(j) \le x_{I_i} \le \bar{{v}}_i^k(j),\\&x_{r_i} \le f_i(m_i^k(j)) + \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert ,\\&x_{r_i} \ge f_i(m_i^k(j)) - \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert \Big \}. \end{aligned} \end{aligned}$$

The next step is to define the region \(\Omega _i^k\) as the union of all \(\Omega _i^k(j)\); see (3). Using (4), we obtain a covering of the bounding box of \(x_{I_i}\) and via (5), we obtain bounds for \(x_{r_i}\). Hence, it follows that \(\Omega _i^k\) is a relaxation of the graph of the function \(f_i\).

Proposition 3.1

It holds \({{\,\textrm{graph}\,}}(f_i) \subseteq \Omega _i^k\) on \(B(\underline{x}_{I_i}, \bar{{x}}_{I_i})\) for all \(i \in [p]\) and all k.

For what follows, we abbreviate

$$\begin{aligned} \mathcal {X}_i^k \mathrel {{\mathop :}{=}}\left\{ B(\underline{v}_i^k(1), \bar{{v}}_i^k(1)), \dots , B(\underline{v}_i^k(|{J_i^k}|), \bar{{v}}_i^k(|{J_i^k}|))\right\} , \end{aligned}$$

ie., \(\mathcal {X}_i^k\) is the set of boxes that is used to define \(\Omega _i^k\). In our algorithm, we use \(\Omega _i^k\) to replace the nonlinear constraints \(f_i(x_{I_i}) = x_{r_i}\) for all \(i \in [p]\) in Problem (1) in order to obtain a relaxation.

Lemma 3.2

The master problem (M(k)) can be modeled as mixed-integer linear problem.

Proof

We can write (M(k)) using the following Big-M formulation:

$$\begin{aligned} \min _{x, z} \quad&c^\top x \end{aligned}$$
(6a)
$$\begin{aligned} \text {s.t.} \quad&Ax \ge b, \quad \underline{x} \le x \le \bar{{x}}, \quad x \in \mathbb {R}^{n} \times \mathbb {Z}^{m}, \end{aligned}$$
(6b)
$$\begin{aligned}&x_{I_i} \ge \underline{v}_{I_i}^k(j) - M (1 - z_i^k(j)),&i \in [p], \ j \in J_i^k, \end{aligned}$$
(6c)
$$\begin{aligned}&x_{I_i} \le \bar{{v}}_{I_i}^k(j) + M (1 - z_i^k(j)),&i \in [p], \ j \in J_i^k, \end{aligned}$$
(6d)
$$\begin{aligned}&x_{r_i} \le f_i(m_i^k(j)) + \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert + M (1 - z_i^k(j)),&i \in [p], \ j \in J_i^k, \end{aligned}$$
(6e)
$$\begin{aligned}&x_{r_i} \ge f_i(m_i^k(j)) - \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert - M (1 - z_i^k(j)),&i \in [p], \ j \in J_i^k, \end{aligned}$$
(6f)
$$\begin{aligned}&\sum _{j \in J_i^k} z_i^k(j) = 1,&i \in [p], \end{aligned}$$
(6g)
$$\begin{aligned}&z_i^k(j) \in \{0, 1\},&i \in [p], \ j \in J_i^k. \end{aligned}$$
(6h)

The rationale of this model is as follows. For each nonlinearity \(i \in [p]\) and each box \(j \in J_i^k\), we introduce a binary variable \(z_i^k(j)\) that indicates whether the solution lies in this box or not. If \(z_i^k(j) = 1\), the constraints (6c)–(6f) are equivalent to the definition of \(\Omega _i^k(j)\). If \(z_i^k(j) = 0\), then (6c)–(6f) are always fulfilled if the constant M is chosen sufficiently large. Constraint (6g) finally ensures that for each nonlinearity \(i \in [p]\) exactly one box \(j \in J_i^k\) is chosen. \(\square \)

Let us remark that it is always possible in our setting to obtain finite and sufficiently large values M by using the finite bounds on the variables in (1b). Also note that the current algorithm is a generalization of the algorithm in [47] with the main change being the use of boxes instead of polytopes. This change is necessary because the number of constraints needed to model the polytopes would increase exponentially with the number of arguments \(l_i\) of the nonlinearity \(f_i\).

3.2 Construction of the Subproblem

We now introduce the chosen subregion of \(\Omega _i^k(j)\) used in the subproblem (4). We define this region as

$$\begin{aligned} {\tilde{\Omega }}_i^k(j) = \Omega _i^k(j) \cap {\hat{\Omega }}_i^k(j), \end{aligned}$$

using the further subregion

$$\begin{aligned} {\hat{\Omega }}_i^k(j) \!=\! \{(x_{I_i}, x_{r_i}) \in \mathbb {R}^{l_i + 1}:(1 \!-\! \lambda ) \underline{v}_i^k(j)\! +\! \lambda \bar{{v}}_i^k(j) \le x_{I_i} \le \lambda \underline{v}_i^k(j) \!+\! (1 - \lambda ) \bar{{v}}_i^k(j)\}, \end{aligned}$$

for some \(\lambda \in (0, 1/2]\). This ensures that the solution of the subproblem (4) cannot be arbitrarily close to the edges of the used box if \(\lambda > 0\) is chosen appropriately.

Let us also note that for each iteration \(k \in \mathbb {N}\), the subproblem (4) can be separated into p smaller problems under reasonable assumptions. If the index sets \(\mathcal {I}_i= (I_i, r_i)\) are non-overlapping, ie.,

$$\begin{aligned} \left( I_i\cup \{r_i\}\right) \cap \left( I_j \cup \{r_j\}\right) = \emptyset \quad \text {for all} \quad i,j \in [p],\ i \ne j, \end{aligned}$$
(7)

these smaller problems can be solved in parallel. The above assumption can always be satisfied by introducing additional auxiliary variables.

Lemma 3.3

Suppose that the index sets \(\mathcal {I}_i= (I_i, r_i)\) are non-overlapping, ie., (7) holds. Then, the subproblem (4) is completely separable, ie., we can solve the subproblem in iteration k by solving p smaller problems given by

$$\begin{aligned} \min _{{\tilde{x}}_{\mathcal {I}_i}} \quad \Vert {{\tilde{x}}_{\mathcal {I}_i} - x_{\mathcal {I}_i}^k}\Vert _2^2 \quad \mathrm{s.t.} \quad f_i({\tilde{x}}_{I_i}) = {\tilde{x}}_{r_i}, \quad {\tilde{x}}_{\mathcal {I}_i} \in {\tilde{\Omega }}_i^k(j_i^k). \end{aligned}$$
(8)

Proof

The constraints of (4), ie.,

$$\begin{aligned} f_i({\tilde{x}}_{I_i}) = {\tilde{x}}_{r_i}, \quad {\tilde{x}}_{\mathcal {I}_i} \in {\tilde{\Omega }}_i^k(j_i^k), \quad i \in [p], \end{aligned}$$

completely decouple along \(i \in [p]\) and so does the objective function

$$\begin{aligned} \Vert {{\tilde{x}} - x^k}\Vert _2^2 = \sum _{i \in [p]} \Vert {{\tilde{x}}_{\mathcal {I}_i} - x_{\mathcal {I}_i}^k}\Vert _2^2. \end{aligned}$$

Therefore, the solution of the subproblem (4) can be obtained by solving Problem (8) for all \(i \in [p]\). \(\square \)

Note that, in the formal sense of complexity theory, the subproblem can be as hard to solve as the originally given MINLP. However, the split into multiple and thus much smaller subproblems can make a huge difference in practice. Moreover, we completely split the mixed-integer aspects from the nonlinear aspects of the problem, which can also help a lot in solving the subproblems although they are still hard in the formal sense.

3.3 Formal Statement of the Algorithm

Before we can formally introduce the algorithm, we need the following notation. Let \(B(\underline{v}, \bar{{v}}) \subseteq \mathbb {R}^{d}\) be a box with an interior point \(x \in {{\,\textrm{int}\,}}B(\underline{v}, \bar{{v}})\), ie., \(\underline{v}< x < \bar{{v}}\). The point x splits the box into a set of boxes that we define as

$$\begin{aligned} S(\underline{v}, \bar{{v}}, x) \mathrel {{\mathop :}{=}}\{B(\underline{w}, \bar{{w}}):(\underline{w}_\ell = \underline{v}_\ell \wedge \bar{{w}}_\ell = x_\ell ) \vee (\underline{w}_\ell = x_\ell \wedge \bar{{w}}_\ell = \bar{{v}}_\ell ) \text { for all } \ell \in [d]\}. \end{aligned}$$

We can utilize this notation to get a finer relaxation of \({{\,\textrm{graph}\,}}(f)\) by splitting an element of \(\mathcal {X}_i^k\) using the solution of the subproblem (4) as the splitting point. This yields a set of smaller boxes that still fulfills Condition (4) as the following proposition states.

Proposition 3.4

For a given box \(B(\underline{v}, \bar{{v}}) \subset \mathbb {R}^{l_i}\) and an interior point \(x \in {{\,\textrm{int}\,}}B(\underline{v}, \bar{{v}})\), the set \(S(\underline{v}, \bar{{v}}, x)\) contains \(2^{l_i}\) smaller boxes that have pairwise disjoint interiors and that completely cover the box \(B(\underline{v}, \bar{{v}})\), ie.,

$$\begin{aligned} B(\underline{v}, \bar{{v}}) = \bigcup _{b \in S(\underline{v}, \bar{{v}}, x)} b. \end{aligned}$$
figure a

We can now present the complete method, which is given in Algorithm 1. Before we prove its correctness, let us first discuss its basic functionality. After the master problem (M(k)) is solved in Step 3, it is checked in Step 8 if its solution is already \(\varepsilon \)-feasible for the original problem. To determine the boxes \(j_i^k \in J_i^k\) in Step 11, one can simply check the indicator variables \(z_i^k(j)\) of the MIP formulation (6). If the solution is not yet \(\varepsilon \)-feasible, then there are nonlinearities \(f_i\) that have feasibility violations larger than \(\varepsilon \). For these nonlinearities, we refine the relaxation of the master problem in Step 15 and re-iterate.

Note that it is not necessary for the correctness of Algorithm 1 to solve the subproblem (4) to global optimality. Our rationale, however, is that optimal solutions of the subproblems yield better splitting points that lead to faster convergence in practice. For the correctness of the algorithm, it is sufficient to find feasible points of (4) that are guaranteed to exist due to the following lemma.

Lemma 3.5

All subproblems (4) are feasible if Property (7) is satisfied.

Proof

Because of Property (7), the nonlinear constraints \(f_i({\tilde{x}}_{I_i}) = {\tilde{x}}_{r_i}\) in (4) do not depend on each other. From Proposition 3.1, we know that the graph of \(f_i\) over the initial region \(B(\underline{x}_{I_i}, \bar{{x}}_{I_i})\) lies entirely in \(\Omega _i^k\). Since the nonempty subregion \({\tilde{\Omega }}_i^k(j)\) restricts \(\Omega _i^k\) in all but the last dimension, the claim follows. \(\square \)

Next, we prove that Algorithm 1 always terminates after finitely many iterations.

Theorem 3.6

There exists a constant \(K < \infty \) such that Algorithm 1 either terminates with an approximate global optimal solution \(x^{k^*}\) or with the indication of infeasibility in an iteration \(k^* \le K\).

Proof

The box \(\Omega _i^k(j)\) is bounded for each iteration k and all \(i \in [p]\) and \(j \in J_i^k\). For the \(x_{r_i}\)-coordinate, the bounding inequalities are given by

$$\begin{aligned} f_i(m_i^k(j)) - \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert \le x_{r_i} \le f_i(m_i^k(j)) + \frac{1}{2}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert . \end{aligned}$$

Therefore, the corresponding side length of the box \(\Omega _i^k(j)\) in its \(x_{r_i}\)-coordinate is

$$\begin{aligned} d_{r_i}^k(j) \mathrel {{\mathop :}{=}}L_i \Vert {\bar{{v}}_i^k(j) - \underline{v}_i^k(j)}\Vert . \end{aligned}$$

If \(d_{r_i}^k(j) \le \varepsilon \) holds, then the inequality

$$\begin{aligned} |{f_i(x_{I_i}^k) - x_{r_i}^k}| > \varepsilon \end{aligned}$$

in Step 14 of Algorithm 1 cannot be fulfilled. It follows that the box \(B(\underline{v}_i^k(j_i^k), \bar{{v}}_i^k(j_i^k))\) will not be split but remains the same in Step 17.

Next, we analyze how \(d_{r_i}^k(j)\) changes if a box is split into smaller boxes. Let

$$\begin{aligned} B\left( \underline{v}_i^{k+1}(j), \bar{{v}}_i^{k+1}(j)\right) \in S\left( \underline{v}_i^k(j_i^k), \bar{{v}}_i^k(j_i^k), {\tilde{x}}_i^k\right) \end{aligned}$$

be one of the smaller boxes that is added to \(\mathcal {X}_i^k\) in Step 15. The side length \(d_{r_i}^{k+1}(j)\) of the corresponding box \(\Omega _i^{k+1}(j)\) can be bounded from above via

$$\begin{aligned} d_{r_i}^{k+1}(j){} & {} = L_i \Vert {\bar{{v}}_i^{k+1}(j) - \underline{v}_i^{k+1}(j)}\Vert \\ {}{} & {} \le (1 - \lambda ) L_i \Vert {\bar{{v}}_i^k(j_i^k) - \underline{v}_i^k(j_i^k)}\Vert = (1 - \lambda ) d_{r_i}^k(j_i^k). \end{aligned}$$

This means that \(d_{r_i}^k(j)\) is decreased by at least a factor of \((1 - \lambda )\) if a box is split in Step 15. Since \(\left( (1 - \lambda )^k\right) _{k \in \mathbb {N}}\) is a geometric sequence with \(|{1 - \lambda }| < 1\), it converges to zero, ie., \((1 - \lambda )^k \rightarrow 0 \text { as } k \rightarrow \infty \). It follows that any box \(B(\underline{v}_i^k(j), \bar{{v}}_i^k(j))\)—including the first box \(B(\underline{x}_{I_i}, \bar{{x}}_{I_i})\)—can only be split finitely many times in Step 15 before the side length \(d_{r_i}^k(j)\) of \(\Omega _i^k(j)\) fulfills \(d_{r_i}^k(j) \le \varepsilon \).

Since the index set [p] is finite, there are only finitely many boxes in \(\mathcal {X}_i^k\) for all \(i \in [p]\) and all k. These boxes can only be split finitely many times. Hence, there exists an iteration \(K < \infty \) in which no box will be split in Step 15. This, however, can only be the case if the if-condition in Step 14 does not hold for all \(i \in [p]\). Thus, we have

$$\begin{aligned} |{f_i(x_{I_i}^K) - x_{r_i}^K}| \le \varepsilon \quad \text {for all } i \in [p], \end{aligned}$$

which is the if-condition in Step 8. This means that the algorithm terminates in Step 9 in an iteration K. Hence, it follows that there exists a \(K < \infty \) such that Algorithm 1 terminates in Step 5 or 9 in an iteration \(k^* \le K\). \(\square \)

We close this section by stating and proving a result for the worst-case number of required iterations of Algorithm 1.

Theorem 3.7

Algorithm 1 terminates after at most

$$\begin{aligned} K = \sum _{i \in [p]} \sum _{k = 0}^{S_i} 2^{k l_i} \end{aligned}$$

iterations with

for \(i \in [p]\).

Proof

From the proof of Theorem 3.6, we know that a box \(B(\underline{v}_i^k(j_i^k), \bar{{v}}_i^k(j_i^k))\) can only be split finitely many times before the side length \(d_{r_i}^k(j)\) of \(\Omega _i^k(j)\) satisfies \(d_{r_i}^k(j) \le \varepsilon \). We can give an upper bound for how many iterations this takes for the first box \(B(\underline{x}_{I_i}, \bar{{x}}_{I_i})\) by solving the equation

$$\begin{aligned} (1 - \lambda )^k L_i \Vert {\bar{{x}}_{I_i} - \underline{x}_{I_i}}\Vert = \varepsilon \end{aligned}$$

for k, which yields

$$\begin{aligned} k = \log _{(1 - \lambda )} \left( \frac{\varepsilon }{L_i \Vert {\bar{{x}}_{I_i} - \underline{x}_{I_i}}\Vert }\right) . \end{aligned}$$

Since the box \(B(\underline{v}_i^k(j_i^k), \bar{{v}}_i^k(j_i^k))\) will not be split anymore if \(d_{r_i}^k(j) \le \varepsilon \), we can round this value to obtain

For each box that is split, there are \(2^{l_i}\) smaller boxes that are added to \(\mathcal {X}_i^k\). Therefore, for each \(i \in [p]\) the maximal number of iterations required until there are no boxes left in \(\mathcal {X}_i^k\) that can be split is bounded from above by

$$\begin{aligned} \sum _{k = 0}^{S_i} \left( 2^{l_i} \right) ^k = \sum _{k = 0}^{S_i} 2^{k \, l_i}. \end{aligned}$$
(9)

Since it is possible that in each iteration, there is only a single nonlinearity \(i \in [p]\) for which a box is split, we have to sum up (9) for each \(i \in [p]\) to get

$$\begin{aligned} K = \sum _{i \in [p]} \sum _{k = 0}^{S_i} 2^{k l_i} \end{aligned}$$

as an upper bound for the required number of iterations of Algorithm 1. \(\square \)

Remark 3.8

Theorem 3.7 states that choosing \(\lambda = 0.5\) results in the lowest number of iterations in the worst case. Then, no subproblem (4) is needed as one can simply evaluate the nonlinearity \(f_i\) in the center point \(m_i^k(j_i^k)\) of the current box to receive the splitting point. However, in practice it can be better to choose a smaller parameter \(\lambda \), which allows the splitting point to be closer to the master problem’s solution and which, thus, may result in a finer approximation of the nonlinearity near the optimal solution of Problem (1).

4 Application to Nonlinear Bilevel Problems

The method developed in the previous section can be applied to nonlinear bilevel problems with nonconvex lower-level models, which is an extremely challenging class of problems. To illustrate this, we consider optimistic MIQP-QP bilevel problems of the form

$$\begin{aligned} \begin{aligned} \min _{x, y} \quad&\frac{1}{2} x^\top H_\textrm{u}x+ c_\textrm{u}^\top x+ \frac{1}{2} y^\top G_\textrm{u}y+ d_\textrm{u}^\top y\\ \text {s.t.} \quad&A x+ B y\le a, \quad \underline{x} \le x \le \bar{{x}}, \quad x\in \mathbb {R}^{{n_x}}, \\&y\in {{\,\mathrm{arg\,min}\,}}_{\tilde{y}} \bigg \{ \frac{1}{2} \tilde{y}^\top G_\textrm{l}\tilde{y}+ d_\textrm{l}^\top \tilde{y}: C x+ D \tilde{y}\le b, \ \underline{y} \le \tilde{y}\le \bar{{y}}, \ \tilde{y}\in \mathbb {R}^{n_y}\bigg \}, \end{aligned} \end{aligned}$$
(10)

where \(x\in \mathbb {R}^{n_x}\) and \(y\in \mathbb {R}^{n_y}\) denote the upper- and lower-level variables, which are finitely bounded by \(\underline{x}\), \(\bar{{x}}\), \(\underline{y}\), and \(\bar{{y}}\). Further, we have matrices \(A \in \mathbb {R}^{{m_\textrm{u}}\times {n_x}}\), \(B \in \mathbb {R}^{{m_\textrm{u}}\times {n_y}}\), \(C\in \mathbb {R}^{{m_\textrm{l}}\times {n_x}}\), \(D\in \mathbb {R}^{{m_\textrm{l}}\times {n_y}}\), as well as right-hand side vectors \(a\in \mathbb {R}^{m_\textrm{u}}\) and \(b\in \mathbb {R}^{m_\textrm{l}}\). In addition, we have \(c_\textrm{u}\in \mathbb {R}^{n_x}\) and \(d_\textrm{u}, d_\textrm{l}\in \mathbb {R}^{n_y}\). Finally, \(H_\textrm{u}\in \mathbb {R}^{{n_x}\times {n_x}}\), \(G_\textrm{u}\in \mathbb {R}^{{n_y}\times {n_y}}\) are positive semidefinite and symmetric matrices, while \(G_\textrm{l}\in \mathbb {R}^{{n_y}\times {n_y}}\) is a possibly indefinite and symmetric matrix. Thus, the upper level is a convex-quadratic problem over linear constraints and the lower-level problem

$$\begin{aligned} \min _{\tilde{y}} \quad \frac{1}{2} \tilde{y}^\top G_\textrm{l}\tilde{y}+ d_\textrm{l}^\top \tilde{y}\quad \text {s.t.} \quad C x+ D \tilde{y}\le b, \ \underline{y} \le \tilde{y}\le \bar{{y}}, \ \tilde{y}\in \mathbb {R}^{n_y}, \end{aligned}$$
(11)

is an \(x\)-parameterized and continuous but nonconvex quadratic problem. Let \(\varphi (\cdot )\) be the optimal-value function of the lower level, ie.,

$$\begin{aligned} \varphi ({x}) \mathrel {{\mathop :}{=}}\min _{\tilde{y}} \bigg \{ \frac{1}{2} \tilde{y}^\top G_\textrm{l}\tilde{y}+ d_\textrm{l}^\top \tilde{y}: C x+ D \tilde{y}\le b, \ \underline{y} \le \tilde{y}\le \bar{{y}}, \ \tilde{y}\in \mathbb {R}^{n_y}\bigg \}. \end{aligned}$$

With this, we can rewrite Problem (10) equivalently as the single-level problem

$$\begin{aligned} \min _{x, y} \quad&\frac{1}{2} x^\top H_\textrm{u}x+ c_\textrm{u}^\top x+ \frac{1}{2} y^\top G_\textrm{u}y+ d_\textrm{u}^\top y\end{aligned}$$
(12a)
$$\begin{aligned} \text {s.t.} \quad&A x+ B y\le a, \quad C x+ D y\le b, \end{aligned}$$
(12b)
$$\begin{aligned}&\underline{x} \le x \le \bar{{x}}, \quad \underline{y} \le y \le \bar{{y}}, \quad x\in \mathbb {R}^{n_x}, \quad y\in \mathbb {R}^{n_y}, \end{aligned}$$
(12c)
$$\begin{aligned}&\frac{1}{2} y^\top G_\textrm{l}y+ d_\textrm{l}^\top y\le \varphi (x), \end{aligned}$$
(12d)

see, eg., [13]. We now reformulate Problem (12) so that it fits into the framework introduced above. Therefore, we introduce the auxiliary variables \(\eta _1\) and \(\eta _2\) as well as the nonlinear function \(f: \mathbb {R}^{{n_y}} \rightarrow \mathbb {R}\) with \(f(y) = 1/2 y^\top G_\textrm{l}y+ d_\textrm{l}^\top y\). Based on this notation, Problem (12) can be restated as

$$\begin{aligned} \min _{x, y, \eta _1, \eta _2} \quad&\frac{1}{2} x^\top H_\textrm{u}x+ c_\textrm{u}^\top x+ \frac{1}{2} y^\top G_\textrm{u}y+ d_\textrm{u}^\top y\end{aligned}$$
(13a)
$$\begin{aligned} \text {s.t.} \quad&A x+ B y\le a, \quad C x+ D y\le b, \end{aligned}$$
(13b)
$$\begin{aligned}&\underline{x} \le x \le \bar{{x}}, \quad \underline{y} \le y \le \bar{{y}}, \quad x\in \mathbb {R}^{n_x}, \quad y\in \mathbb {R}^{n_y}, \end{aligned}$$
(13c)
$$\begin{aligned}&\eta _2 - \eta _1 \le 0, \quad \varphi (x) = \eta _1, \quad f(y) = \eta _2, \quad \eta _1, \eta _2 \in \mathbb {R}. \end{aligned}$$
(13d)

Now, the method developed in Sect. 3 can be applied to (13) if (i) the nonconvex functions \(\varphi \) and f are Lipschitz continuous on the projections of the bilevel constraint region onto the decision space of the upper and lower level, respectively, ie., on the domains

$$\begin{aligned} {\mathcal {F}}_x&\mathrel {{\mathop :}{=}}\left\{ x\in [\underline{x}, \bar{{x}}]:\exists y\in \mathbb {R}^{n_y}\text { such that } A x+ B y\le a, \, C x+ D y\le b, \, \underline{y} \le y \le \bar{{y}}\right\} , \\ {\mathcal {F}}_y&\mathrel {{\mathop :}{=}}\left\{ y\in [\underline{y}, \bar{{y}}]:\exists x\in \mathbb {R}^{n_x}\text { such that } A x+ B y\le a, \, C x+ D y\le b, \, \underline{x} \le x \le \bar{{x}}\right\} , \end{aligned}$$

and if (ii) the associated Lipschitz constants are computable.

What makes things more complicated compared to the general setup described in Sect. 3 is that we can only evaluate the optimal-value function \(\varphi (x)\) but we cannot optimize over it. Thus, we cannot use subproblem (4) directly to obtain a new splitting point. On the other hand, following the strategy to take the box center m as the new splitting point simplifies solving problem (4) to evaluating \(\varphi (m)\). More precisely, using the box center corresponds to setting \(\lambda = 1/2\). This, however, is only applicable if \(\varphi (m)\) is well-defined, which we ensure with the following assumption.

Assumption 1

The set \({\mathcal {T}}(x) \mathrel {{\mathop :}{=}}\{ y\in \mathbb {R}^{{n_y}}:D y\le b - C x, \, \underline{y} \le y \le \bar{{y}}\}\) is nonempty for all \(x\in B(\underline{x}, \bar{{x}})\).

This assumption implies that \(- \infty< \varphi (x) < + \infty \) for all \(x\in B(\underline{x}, \bar{{x}})\), ie., a minimizer \(y\) of Problem (11) exists for all \(x\in B(\underline{x}, \bar{{x}})\) and, thus, for every possible box center m.

Before we present the theoretical developments that are required to apply our method to the introduced class of bilevel problems, let us briefly discuss that interpreting approximate solutions of bilevel problems with nonconvex lower-level problems needs to be done with some care. In particular, it is shown in [3] that lower-level solutions that are only \(\varepsilon \)-feasible can lead to upper-level solutions that can be arbitrary far away from actual bilevel solutions. Since this also applies to our method, we later always present the difference of our solutions to the known optimal solutions of the bilevel problems in our test set.

4.1 Lipschitz Continuity Properties

To apply our method with the outlined modifications for the subproblem to Problem (13), it remains to show that the properties (i) and (ii) are fulfilled. We start with the nonconvex function f. Since the relevant domain \(B(\underline{y}, \bar{{y}})\) of this function is compact, continuous differentiability of f implies global Lipschitz continuity of f on this set. Since \(B(\underline{y}, \bar{{y}})\) is convex and compact, the tightest Lipschitz constant can be computed by solving the optimization problem

$$\begin{aligned} \max _{y\in B(\underline{y}, \bar{{y}})} || G_\textrm{l}y+ d_\textrm{l}||. \end{aligned}$$
(14)

Note that it would also be possible to compute the Lipschitz constant in (14) over the feasible set of the master problem, ie., over the set \({\mathcal {F}}_y\). However, this involves solving an optimization problem not over a simple box but over a more complex polytope. In our computational study, we test both variants. We will denote the former method as the “fast” method and the latter as the “slow” method. In the absence of lower- and upper-level constraints except for simple variable bounds, both approaches coincide.

Next, we continue with the more difficult case of proving Lipschitz continuity of the optimal-value function \(\varphi \). To this end, we exploit a variant of the Hoffman Lemma; for the original version see the main theorem in [24] or Lemma 5.8 in [50]. For the ease of presentation, we assume from now on that the finite bounds on y are part of the lower-level inequality constraints and, thus, also part of the matrix C.

Lemma 4.1

(see Corollary 5.1 in [50]) Suppose Assumption 1 holds. There exists \(L_H > 0\) such that for any \(x,{\tilde{x}} \in B(\underline{x}, \bar{{x}})\) it holds: For any \(y\in {\mathcal {T}}(x)\), we can find a point \({\tilde{y}} \in {\mathcal {T}}({\tilde{x}})\) with

$$\begin{aligned} ||{\tilde{y}}-y|| \le L_H ||C (x-{\tilde{x}})|| \le L_H ||C|| \, ||{\tilde{x}}-x||. \end{aligned}$$

The scalar \(L_H\) is the so-called Hoffman constant. A sharp characterization of this constant and an algorithm to compute it can be found in [37, 40]. Based on the introduced variant of the Hoffman Lemma, we can now establish Lipschitz continuity of the optimal-value function \(\varphi \) under Assumption 1. Our proof follows the idea of the proof of Corollary 5.2 in [50]. There, the Lipschitz continuity of the optimal-value function of a linear program with right-hand side perturbation is demonstrated. In contrast to this, we have a quadratic program with right-hand-side perturbation.

Theorem 4.2

Suppose Assumption 1 holds. Then, there exists \(L > 0\) such that for any \(x,{\tilde{x}} \in B(\underline{x}, \bar{{x}})\) it holds

$$\begin{aligned} | \varphi ({\tilde{x}}) - \varphi (x) | \le L || {\tilde{x}} - x||. \end{aligned}$$

Proof

Let any \(x,{\tilde{x}} \in B(\underline{x}, \bar{{x}})\) be given. By Assumption 1, minimizers \(y\) and \({\tilde{y}}\) of Problem (11) exist for \(x\) and \({\tilde{x}}\), respectively. By Lemma 4.1, for \(y\) we can find a point \({\hat{y}} \in {\mathcal {T}}({\tilde{x}})\) such that

$$\begin{aligned} ||{\hat{y}} - y|| \le L_H ||C|| \, || {\tilde{x}} - x|| \end{aligned}$$

holds for some \(L_H > 0\). Based on this, we can conclude

$$\begin{aligned} \varphi ({\tilde{x}})&- \varphi (x) \\ \le \ {}&\frac{1}{2} {\hat{y}}^\top G_\textrm{l}{\hat{y}} + d_\textrm{l}^\top {\hat{y}} - \left( \frac{1}{2} y^\top G_\textrm{l}y+ d_\textrm{l}^\top y\right) \\ = \ {}&\frac{1}{2} {\hat{y}}^\top G_\textrm{l}{\hat{y}} - \frac{1}{2} y^\top G_\textrm{l}y+ d_\textrm{l}^\top ({\hat{y}} - y) \\ = \ {}&\frac{1}{2} {\hat{y}}^\top G_\textrm{l}{\hat{y}} - \frac{1}{2} {\hat{y}}^\top G_\textrm{l}y+ \frac{1}{2} {\hat{y}}^\top G_\textrm{l}y- \frac{1}{2} y^\top G_\textrm{l}y + \frac{1}{2} y^\top G_\textrm{l}{\hat{y}} - \frac{1}{2} y^\top G_\textrm{l}{\hat{y}} + d_\textrm{l}^\top ({\hat{y}} - y) \\ = \ {}&\frac{1}{2} {\hat{y}}^\top G_\textrm{l}({\hat{y}} - y) + \frac{1}{2} y^\top G_\textrm{l}({\hat{y}} - y) + \frac{1}{2} {\hat{y}}^\top G_\textrm{l}y- \frac{1}{2} y^\top G_\textrm{l}{\hat{y}} + d_\textrm{l}^\top ({\hat{y}} - y) \\ = \ {}&\frac{1}{2} \left( {\hat{y}}^\top G_\textrm{l}+ d_\textrm{l}^\top + y^\top G_\textrm{l}+ d_\textrm{l}^\top \right) ({\hat{y}} - y) + \frac{1}{2} y^\top G_\textrm{l}^\top {\hat{y}} - \frac{1}{2} y^\top G_\textrm{l}{\hat{y}} \\ \le \ {}&\frac{1}{2} \left( ||G_\textrm{l}^\top {\hat{y}} + d_\textrm{l}|| + ||G_\textrm{l}^\top y+ d_\textrm{l}||\right) ||{\hat{y}} - y|| \\ \le \ {}&L_{G_\textrm{l}} ||{\hat{y}} - y|| \\ \le \ {}&L_H ||C|| L_{G_\textrm{l}} ||{\tilde{x}} - x||, \end{aligned}$$

where we use the symmetry of \(G_\textrm{l}\) and \(L_{G_\textrm{l}} \mathrel {{\mathop :}{=}}\max \{ || G_\textrm{l}y+ d_\textrm{l}|| : x\in B(\underline{x}, \bar{{x}}), y\in {\mathcal {T}}(x) \}\), which is well-defined due to Assumption 1.

Analogously, by Lemma 4.1, for \({\tilde{y}}\) we can find a point \({\hat{y}} \in {\mathcal {T}}(x)\) such that

$$\begin{aligned} ||{\hat{y}} - {\tilde{y}}|| \le L_H ||C|| \, ||x-{\tilde{x}}|| \end{aligned}$$

holds for the same \(L_H > 0\). With the same arguments as before, we obtain

$$\begin{aligned} \varphi (x) - \varphi ({\tilde{x}}) \le L_H ||C|| L_{G_\textrm{l}} ||{\tilde{x}}-x||. \end{aligned}$$

Consequently,

$$\begin{aligned} | \varphi ({\tilde{x}}) - \varphi (x) | \le L_H ||C|| L_{G_\textrm{l}} ||{\tilde{x}}-x|| \mathrel {{=}{\mathop :}}L ||{\tilde{x}}-x|| \end{aligned}$$

holds and the claim follows. \(\square \)

Let us finally note that the presented method is not restricted to nonconvex but quadratic problems in the lower level in general. If one has knowledge about the Lipschitz constant of a nonconvex lower-level problem with more general nonlinearities, the method can be applied as it is explained in this section.

4.2 Implementation Details

In this section, we discuss some implementation details to clarify how we modified and extended Algorithm 1 to get a more tailored method for the considered bilevel setup.

4.2.1 “Slow” and “Fast” Method for \(\varphi \)

Because \({\mathcal {T}}(x) \subseteq B(\underline{y}, \bar{{y}}) \) holds for all \(x\in B(\underline{x}, \bar{{x}})\), we immediately get

$$\begin{aligned} L_{G_\textrm{l}} = \max _{x\in B(\underline{x}, \bar{{x}}), y\in {\mathcal {T}}(x)} || G_\textrm{l}y+ d_\textrm{l}|| \le \max _{y\in B(\underline{y}, \bar{{y}})} || G_\textrm{l}y+ d_\textrm{l}||. \end{aligned}$$

Since we have to compute \(L_{G_\textrm{l}}\) for computing the Lipschitz constant of \(\varphi \), we can distinguish between the “fast” and the “slow” method for \(\varphi \) as well.

4.2.2 Additional Nonlinearities

Bilinear nonlinearities of the form \(x_i y_j\) in the lower- or upper-level objective function can be easily reformulated to fit in our setup. If such nonlinearities occur, eg., in the lower level, an additional variable \(y_k\) is introduced in the lower level. Moreover, the constraint \(y_k = x_i\) is added to the lower level, while the nonlinear objective term \(x_i y_j\) is replaced by the product \(y_ky_j\). The resulting bilevel problem then fits in our setup.

4.2.3 Box Filtering

Figure 2 illustrates the case that after splitting an initial bounding box, here \([0,3]^2\), a few times, there might be boxes (such as \([2.25, 3]^2\)) that do not include any point from the bilevel constraint region, which is colored in red in Fig. 2. To avoid further investigation of these boxes, we can detect these boxes by checking if the intersection of the bilevel constraint region with newly created boxes is empty. However, this requires to solve an LP feasibility problem and thus creates some additional computational effort. The benefit, however, is that constraints (6c)–(6f) and the corresponding binary variables are not added to the master problem for the filtered boxes in the given application. However, as the bilevel constraint region is also accounted for in the master problem, these filtered boxes are not further splited anyway, i.e., the constraints and variables that are not added to the master problem would be redundant if added. Hence, the additional computational effort of box filtering should only be undertaken if necessary.

Fig. 2
figure 2

Example for box filtering

Note that box filtering is not necessary if there are no lower-level and upper-level constraints except for simple variable bounds since the intersection of the bilevel constraint region and every possible box created by the algorithm can never be empty. In contrast, box filtering is necessary if the “slow” method is applied since, e.g., Problem (14) for computing the Lipschitz constant of the nonconvex function f might be infeasible on newly created boxes. Thus, these boxes must be filtered after each box splitting and before Lipschitz constants are updated. This does not occur with the “fast” method, and box filtering is therefore not necessary for this method.

4.2.4 Tighter Lipschitz Constants for Box-Constrained Lower Levels

For instances with simple variable bounds on the lower level that are not influenced by upper-level decisions, we can compute tighter Lipschitz constants for \(\varphi \). To this end, we now explicitly take into account bilinear terms of the form \(x_i y_j\) in the lower-level objective function and do not reformulate these terms as described in Sect. 4.2.2. Hence, the lower-level objective function is given by

$$\begin{aligned} f(y, x) = \frac{1}{2} \begin{pmatrix} y\\ x\end{pmatrix}^\top \begin{bmatrix} E &{}\quad F^\top \\ F &{}\quad 0 \end{bmatrix} \begin{pmatrix} y\\ x\end{pmatrix} + \begin{pmatrix} e \\ 0 \end{pmatrix}^\top \begin{pmatrix} y\\ x\end{pmatrix}, \end{aligned}$$

where E and \(F \ne 0\) are suitably chosen matrices and e is a suitably chosen vector. Now, for given \({\hat{x}}, {\tilde{x}} \in B(\underline{x}, \bar{{x}})\), let \({\hat{y}}\) and \({\tilde{y}}\) be defined as

$$\begin{aligned} {\hat{y}} \mathrel {{\mathop :}{=}}{{\,\mathrm{arg\,min}\,}}_{y\in B(\underline{y}, \bar{{y}})} f(y, {\hat{x}}), \quad {\tilde{y}} \mathrel {{\mathop :}{=}}{{\,\mathrm{arg\,min}\,}}_{y\in B(\underline{y}, \bar{{y}})} f(y, {\tilde{x}}), \end{aligned}$$
(15)

where \(B(\underline{y}, \bar{{y}})\) is the lower level’s feasible set in this case. Note that \(B(\underline{y}, \bar{{y}}) = {\mathcal {T}}(x)\) holds for all \(x\in B(\underline{x}, \bar{{x}})\) because \(x\) only influences the lower-level objective function but not the lower level’s feasible set. Thus, it holds

$$\begin{aligned} \varphi ({\hat{x}}) - \varphi ({\tilde{x}}) =\ {}&f({\hat{y}},{\hat{x}}) - f({\tilde{y}}, {\tilde{x}}) \le f({\tilde{y}},{\hat{x}}) - f({\tilde{y}}, {\tilde{x}}) \\ = \ {}&{\tilde{y}}^\top F^\top {\hat{x}} - {\tilde{y}}^\top F^\top {\tilde{x}} \le ||F {\tilde{y}}|| \, || {\hat{x}} - {\tilde{x}} ||. \end{aligned}$$

Note that \(f({\hat{y}},{\hat{x}}) \le f({\tilde{y}}, {\hat{x}})\) holds since \({\hat{y}}\) minimizes \(f(y,{\hat{x}})\) over \(B(\underline{y}, \bar{{y}})\) and \({\tilde{y}} \in B(\underline{y}, \bar{{y}})\). This is not necessarily true for the case of general lower-level constraints, ie., when optimizing in (15) over \({\mathcal {T}}({\hat{x}})\) and \({\mathcal {T}}({\tilde{x}})\), respectively, since \({\mathcal {T}}({\hat{x}}) \ne {\mathcal {T}}({\tilde{x}})\) and \({\tilde{y}} \notin {\mathcal {T}}({\hat{x}})\) might hold. Analogously, we obtain \(\varphi ({\tilde{x}}) - \varphi ({\hat{x}}) \le ||F {\hat{y}}|| \, || {\hat{x}} - {\tilde{x}} ||\) and, consequently,

$$\begin{aligned} | \varphi ({\hat{x}}) - \varphi ({\tilde{x}}) | \le L_F || {\hat{x}} - {\tilde{x}} || \end{aligned}$$

holds with \(L_F \mathrel {{\mathop :}{=}}\max \{||Fy||:y\in B(\underline{y}, \bar{{y}})\}\). Thus, \(L_F\) is a valid Lipschitz constant.

4.2.5 Big-Ms

As a valid Big-M in the master problem, we use the maximum of \(||\bar{{y}} - \underline{y}||_\infty \), \(||\bar{{x}} - \underline{x}||_\infty \), and

$$\begin{aligned} \max _{y\in B(\underline{y}, \bar{{y}})} \frac{1}{2} y^\top G_\textrm{l}y+ d_\textrm{l}y- \min _{y\in B(\underline{y}, \bar{{y}})} \frac{1}{2} y^\top G_\textrm{l}y+ d_\textrm{l}y. \end{aligned}$$

4.2.6 Lipschitz Constant Updates

The Lipschitz constants are updated after each box splitting since it is to be expected that the constants get smaller if they are computed on smaller sets.

4.3 Numerical Results

In our computational study below, we consider the QP-QP instances from the BASBLib library [38]. We first describe which instances need to be excluded because they do not fit into our setup. First, we exclude the three instances d_1992_01, b_1984_02, and dd_2012_02 because they contain nonlinear constraints. Next, the two instances y_1996_02 and lmp_1987_01 are excluded due to nonconvex upper-level objective functions, ie., the matrix \(H_\textrm{u}\) is not positive semidefinite for these instances. The instance sc_1998_01 is not considered because C is the zero matrix and, thus, the resulting optimization problem is not a “true” bilevel problem because the lower-level problem is not constrained by upper-level variables. Such problems can easily be solved by backwards induction, ie., an optimal solution to the lower-level problem can first be determined and, given this lower-level optimal solution, the upper-level problem can be solved to optimality. Finally, we have to exclude the following four instances because they violate Assumption 1:

  • tmh_2007_01 (lower level is not feasible for \(x=9\)),

  • b_1988_01 (lower level is not feasible for \(x=9\)),

  • b_1998_07 (lower level is not feasible for \(x=9\)), and

  • cw_1990_02 (lower level is not feasible for \(x=7\)).

In total, 10 instances (out of 20) remain; see Table 1. Note that, due to the applied reformulations as described in Sect. 4.2.2, the reported number of variables might differ from those reported in the BASBLib. Moreover, the reported number of constraints does not include additional constraints necessary due to these reformulations. Variable bounds are also not counted as constraints.

Table 1 Overview of the considered BASBLib instances

We implemented the algorithm in Python  3.7.9. All computations were conducted on a machine with an Intel(R) Core(TM) i7-8565U CPU with 4 cores, 1.8GHz to 4.6GHz, and 16GB RAM. The master problem (M(k)) and the subproblem (4) are modeled using Pyomo  5.7.2 [22] and solved with Gurobi  9.1.0 [21]. The Hoffman constant is computed with the algorithm described in [40] using the MATLAB code made publicly available by the authors of [40].Footnote 2 We set \(\varepsilon = 10^{-1}\) and use a time limit of 5h. If the instance is solved within the time limit, we decrease \(\varepsilon \) by subsequently dividing by ten until \(\varepsilon = 10^{-5}\) is reached to see how much accuracy can be reached by our algorithm in the given time limit. The obtained results for the case of box filtering and determining the Lipschitz constant with the “slow” method are summarized in Table 2, while the results for the “fast” method without box filtering are given in Table 3. Note that Table 2 contains results for 4 instances, while Table 3 contains results for 10 instances. The reason is that 6 instances only have simple variable bounds so that there is no difference between the “slow” and the “fast” method as well as between box filtering and no box filtering (except for the additional computational effort due to the LP feasibility problems solved in the former case, see Sect. 4.2.3). In these cases, we only list the respective instances in Table 3. The tables are organized as follows. The first column states the ID of the instance and the second one states the used \(\varepsilon \) for the termination criterion. The number of required iterations is denoted by k, “runtime” states the runtimes in seconds, and the “final \(\varepsilon \)” column contains the tolerance that is actually reached. Finally, the columns “diff to opt.” and “diff to opt. value” contain the 2-norm distance of our solution to the one reported in the BASBLib and the respective difference in the objective value.

Table 2 Computational results with “slow” method and box filtering
Table 3 Computational results with “fast” method and no box filtering

Before we discuss the results in detail, let us comment on two important aspects. First, it can be expected that our method performs rather bad if there are multiplicities. To get a finer relaxation, all boxes covering the multiple solutions have to be split at least once if \(\varepsilon \)-feasibility is not reached by the first split. Second, it is to be expected that our method performs rather good for instances with small variable ranges, eg., ranges such as \([0,1]^{n_x}\times [0,1]^{n_y}\) instead of \([0,1000]^{n_x}\times [0,1000]^{n_y}\), as well as small lower-level objective function ranges. In particular, the range of the lower level’s objective function is small in the case of small Lipschitz constants. The Lipschitz constant derived here for the optimal-value function is valid for all quadratic programs with right-hand-side perturbations that satisfy Assumption 1. However, for specific bilevel applications much tighter Lipschitz constants might be derived by exploiting problem-specific structural properties.

In what follows, we comment on the obtained computational results in the light of these two aspect. Multiplicities are reported for instance as_1981_01. As can be seen in Table 2, for \(\varepsilon = 10^{-2}\), this instance is not solved to \(\varepsilon \)-feasibility within the time limit despite the fact that an optimal solution is already reached. This effect is even enhanced if box filtering is deactivated and the “fast” method is used; see Table 3. In this case, the final \(\varepsilon \) is 47.3, which is very large. As this instance has rather many general constraints, box filtering with the “slow” method improves the solution process significantly. However, for instances with less general constraints like sa_1981_01 and sa_1981_02, the additional computational burden of the “slow” method and box filtering can outweigh its advantages.

Finally, let us discuss the results in dependence of the tightness of the Lipschitz constants and the size of the variable ranges. To this end, we first order the considered instances by increasing Lipschitz constants of the function f: b_1998_02, b_1998_03, d_2000_01, d_1978_01, fl_1995_01, sa_1981_02, as_1981_01, sa_1981_01, b_1998_04, and b_1998_05,Footnote 3 Indeed, the three instances with lowest Lipschitz constants are solved for the smallest \(\varepsilon \) within the time limit; see Tables 2 and 3. In addition to these 3 instances, the “fast” method with no box filtering also solves the instances d_1978_01, sa_1981_01, and b_1998_04 at least for the initial \(\varepsilon \). Besides having one of the lowest Lipschitz constants, the instance d_1978_01 has relatively low variable ranges. The other two instances sa_1981_01 and b_1998_04 have relatively low number of lower- and upper-level variables, which leads to reduced dimensions of the boxes and therefore of the worst-case number of iterations. Nevertheless, the disadvantage of the large Lipschitz constant of b_1998_04 is reflected in the results as for \(\varepsilon = 10^{-1}\) already over 2h are needed to compute an \(\varepsilon \)-feasible solution.

In total, our method solves 7 out of 10 instances. Interestingly, 2 out of the 3 unsolved instances even have a convex lower-level problem and could thus be solved with specialized methods such as, eg., [30] that explicitly exploit this property. In addition, as our method only requires very weak assumptions and can be applied to a broad range of problems besides nonconvex bilevel problems, it cannot be expected that it outperforms specifically tailored methods like, e.g., the BASBL solver [39].

5 Application to Gas Network Optimization

In this section, we use Algorithm 1 to solve stationary gas network optimization problems. We start by modeling the gas network and state an implicit nonlinear pressure law function for gas flow in pipes. We cannot state this function explicitly but it is possible to evaluate it rather cheaply. Then, we analyze its derivatives to derive suitable Lipschitz constants. Finally, numerical results on test instances show the successful application of our method.

5.1 Modeling

We model the gas network as a directed and weakly connected graph \((V, A)\), where the arcs \(A \) are composed of pipes \(A _\textrm{pi} \), short pipes \(A _\textrm{sp} \), valves \(A _\textrm{va} \), compressor stations \(A _\textrm{cs} \), and control valves \(A _\textrm{cv} \), ie.,

$$\begin{aligned} A = A _\textrm{pi} \cup A _\textrm{sp} \cup A _\textrm{va} \cup A _\textrm{cs} \cup A _\textrm{cv}. \end{aligned}$$

The two main variables that describe the state of the gas flowing through the network are the pressure \(p \) and mass flow \(q \). Each node \(u \in V \) has a bounded pressure variable \(p _u \in [\underline{p}_u, \bar{{p}}_u ]\) and a given mass flow \(q _u \) that is supplied to or withdrawn from the network. A node \(u \in V \) is called an entry node if \(q _u > 0\), an exit node if \(q _u < 0\), and an inner node if \(q _u = 0\). In addition, each arc \(a \in A \) has a variable \(q _a \in [\underline{q}_a, \bar{{q}}_a ]\) that models the mass flow in the arc.

The balance equation

$$\begin{aligned} q _u + \sum _{a \in \delta ^{\textrm{in}}(u)} q _a = \sum _{a \in \delta ^{\textrm{out}}(u)} q _a \quad \text {for all } u \in V \end{aligned}$$
(16)

ensures that no gas is gained or lost. Here, \(\delta ^{\textrm{in}}(u) \) and \(\delta ^{\textrm{out}}(u) \) denote the sets of in- and outgoing arcs of node \(u \).

A short pipe \(a = (u, v) \in A _\textrm{sp} \) directly connects its nodes \(u \) and \(v \). Therefore, the related pressure values coincide:

$$\begin{aligned} p _u = p _v \quad \text {for all } a = (u, v) \in A _\textrm{sp}. \end{aligned}$$
(17)

A valve \(a = (u, v) \in A _\textrm{va} \) is either open or closed. If it is open, it is modeled as a short pipe. If it is closed, the mass flow \(q _a \) has to be zero and the related pressures are decoupled but the corresponding pressure difference between the nodes \(u \) and \(v \) cannot exceed a given value \(\Delta \bar{{p}}_a \). This can be modeled by introducing a binary variable \(o_a \) that indicates if the valve \(a \) is open (\(o_a = 1\)) or closed (\(o_a = 0\)). The valve model then reads

$$\begin{aligned} q _a&\ge \underline{q}_a o_a&\text {for all } a \in A _\textrm{va}, \end{aligned}$$
(18a)
$$\begin{aligned} q _a&\le \bar{{q}}_a o_a&\text {for all } a \in A _\textrm{va}, \end{aligned}$$
(18b)
$$\begin{aligned} p _u- p _v&\le \Delta \bar{{p}}_a (1 - o_a)&\text {for all } a = (u, v) \in A _\textrm{va}, \end{aligned}$$
(18c)
$$\begin{aligned} p _v- p _u&\le \Delta \bar{{p}}_a (1 - o_a)&\text {for all } a = (u, v) \in A _\textrm{va}. \end{aligned}$$
(18d)

Compressor stations \(a \in A _\textrm{cs} \) have a fixed flow direction, ie., \(\underline{q}_a \ge 0\) holds, and can increase the pressure of the gas. We model this pressure increase by introducing the variable \(\Delta p _a \in [0, \Delta \bar{{p}}_a ]\). Then, the compressor station model is given by

$$\begin{aligned} p _v = p _u + \Delta p _a \quad \text {for all } a = (u, v) \in A _\textrm{cs}. \end{aligned}$$
(19)

Note that this model is a significant simplification of how compressor stations in real gas networks operate. More complicated and realistic models can be found in, eg., [41, 45].

Control valves \(a \in A _\textrm{cv} \) are modeled similar to compressor stations but decrease the gas pressure instead of increasing it. We again have \(\underline{q}_a \ge 0\) and

$$\begin{aligned} p _v = p _u- \Delta p _a \quad \text {for all } a = (u, v) \in A _\textrm{cv} \end{aligned}$$
(20)

with \(\Delta p _a \in [0, \Delta \bar{{p}}_a ]\).

Until now, we have modeled all gas network components in a (mixed-integer) linear way. The only components that are still missing are pipes \(a \in A _\textrm{pi} \). We will describe the pressure loss in a pipe in dependence of the inflow pressure and the mass flow using a nonlinear and Lipschitz continuous function

$$\begin{aligned} p _v = p _{v, a}(p _u, q _a) \quad \text {for all } a = (u, v) \in A _\textrm{pi}, \end{aligned}$$
(21)

which we will analyze in the next section.

The goal is to minimize the overall activity of the compressor stations. Therefore, the full stationary gas network optimization problem is given by

This model has the form of Problem (1). The pipe equations (21) constitute the Lipschitz nonlinearities (1c), while the other constraints fit (1b) as they are linear. Hence, we can use Algorithm 1 to solve it.

In contrast to the gas network model considered in [47], we do not restrict ourselves to tree-structured networks. This has the consequence that the mass flows in the network cannot be pre-computed. Therefore, the nonlinearity on each arc is multivariate as it depends on the pressure and the mass flow. This means that Algorithm 1 can be applied to a much broader class of gas transport models.

5.2 Lipschitz Continuity of the Gas Flow Equation

In this section, we derive and analyze the nonlinear pipe model (21). For the sake of readability, we henceforth omit the subscript \(a \) that indicates the pipe \(a \in A _\textrm{pi} \).

Gas flow along a pipe can be modeled by the stationary momentum equation. This ordinary differential equation (ODE) reads

$$\begin{aligned} \partial _{x}\left( p + \frac{\chi ^2}{\rho }\right) = -\frac{1}{2} \theta \frac{\chi |{\chi }|}{\rho }, \quad \chi = \rho v, \quad \theta = \frac{\lambda }{D} \end{aligned}$$

where \(p \), \(v\), \(\chi \), and \(\rho \) model the pressure, velocity, mass flux, and density of the gas and \(\lambda \) as well as \(D\) denote the pipe’s friction coefficient and the diameter of the pipe; see, eg., [19]. The relation between the mass flow \(q \) and mass flux \(\chi \) is given by \(q = A\chi \) where \(A = \pi D^2 / 4\) is the cross-sectional area of the pipe.

The pressure \(p \) and density \(\rho \) are coupled by the equation of state

$$\begin{aligned} p = R_{\textrm{s}}T z \rho \end{aligned}$$
(22)

for real gas, where \(R_{\textrm{s}}\) denotes the specific gas constant. The compressibility factor \(z \) can be computed by the so-called AGA formula

$$\begin{aligned} z = 1 + \alpha p,\quad \alpha = 0.257 \frac{1}{p _\textrm{c}} - 0.533 \frac{T _\textrm{c}}{p _\textrm{c} T} < 0 \end{aligned}$$

with pseudocritical pressure \(p _\textrm{c} \), pseudocritical temperature \(T _\textrm{c} \), and temperature \(T \) that we assume to be constant; see, eg., [35]. We only consider a positive compressibility factor, which is equivalent to

$$\begin{aligned} p < \frac{1}{|{\alpha }|}. \end{aligned}$$
(23)

We further introduce the speed of sound \(c \) which is defined via

$$\begin{aligned} \frac{1}{c ^2} = \frac{\partial \rho }{\partial p} \end{aligned}$$

and the squared Mach number

$$\begin{aligned} \eta = \frac{v^2}{c ^2}. \end{aligned}$$

Lemma 5.1

It holds

$$\begin{aligned} \eta = R_{\textrm{s}}T \frac{\chi ^2}{p ^2}. \end{aligned}$$

Proof

We solve the equation of state (22) for \(\rho \) and obtain

$$\begin{aligned} \rho = \frac{p}{R_{\textrm{s}}T z}. \end{aligned}$$

We can use this and the definition of the speed of sound to get

$$\begin{aligned} \eta&= v^2 \frac{1}{c ^2} = \left( \frac{\chi }{\rho }\right) ^2 \frac{\partial \rho }{\partial p} = \left( \frac{\chi R_{\textrm{s}}T z}{p}\right) ^2 \frac{R_{\textrm{s}}T z- p R_{\textrm{s}}T \alpha }{(R_{\textrm{s}}T z)^2}\\&= \left( \frac{\chi R_{\textrm{s}}T z}{p}\right) ^2 \frac{1}{R_{\textrm{s}}T z ^2} = R_{\textrm{s}}T \frac{\chi ^2}{p ^2}.\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \end{aligned}$$

\(\square \)

We assume that the velocity of the gas is subsonic, ie., \(\eta < 1\) holds, as it is the case for real-world gas networks. This is equivalent to

$$\begin{aligned} p > |{\chi }| \sqrt{R_{\textrm{s}}T}. \end{aligned}$$
(24)

In what follows, we only consider pressures within the interval \((|{\chi }| \sqrt{R_{\textrm{s}}T}, 1 / |{\alpha }|)\). Therefore, we have

$$\begin{aligned} p ^2 - \chi ^2 R_{\textrm{s}}T> 0 \quad \text {and} \quad 1 + \alpha p > 0, \end{aligned}$$

which we will use many times throughout this section.

For a pipe \((u, v)\) with length \(L\) and pressure \(p _u \) at node \(u \), the pressure function reads

$$\begin{aligned} p (x, p _u, \chi )&= F^{-1}\left( F(p _u) - \frac{1}{2} R_{\textrm{s}}T \chi |{\chi }| \theta x\right) , \end{aligned}$$
(25a)
$$\begin{aligned} F(p)&= \frac{1}{\alpha } p + \left( \chi ^2 R_{\textrm{s}}T- \frac{1}{\alpha ^2}\right) \ln (|{1+\alpha p}|) - \chi ^2 R_{\textrm{s}}T \ln (p), \end{aligned}$$
(25b)

for \(x\in [0, L]\); see [19, 20]. From [20], we further know the following properties of F.

Lemma 5.2

The function F as defined in (25b) is differentiable for \(p \in (|{\chi }| \sqrt{R_{\textrm{s}}T}, 1 / |{\alpha }|)\) with

$$\begin{aligned} F'(p) = \frac{p ^2 - \chi ^2 R_{\textrm{s}}T}{p (1 + \alpha p)} > 0. \end{aligned}$$
(26)

The second derivative fulfills

$$\begin{aligned} F''(p) = \frac{p ^2 + \chi ^2 R_{\textrm{s}}T (1 + 2 \alpha p)}{p ^2 (1 + \alpha p)^2} > 0. \end{aligned}$$

The property in (26) implies that F is strictly increasing. Therefore, the inverse in (25a) is well-defined. To evaluate the pressure function \(p \) in (25a), the equation

$$\begin{aligned} F(p) = F(p _u) - \frac{1}{2} R_{\textrm{s}}T \chi |{\chi }| \theta x \end{aligned}$$
(27)

needs to be solved. This can be done numerically using Newton’s method since F is strictly increasing and convex; see [20]. In the same way, (27) can be solved for \(p _u \) if \(p \) and \(\chi \) are given.

We are interested in the pressure at the end of the pipe, ie., for \(x = L\), and need to find a Lipschitz constant for

$$\begin{aligned} p _v = p _v (p _u, \chi ) \mathrel {{\mathop :}{=}}p (L, p _u, \chi ). \end{aligned}$$

To this end, we make the following assumption.

Assumption 2

We assume that the variables \(\chi \) and \(p _u \) of each pipe \((u, v)\) are bounded by

$$\begin{aligned} \chi \in [\underline{\chi }, \bar{{\chi }}] = [\underline{q} / A, \bar{{q}} / A], \quad p _u \in [\underline{p}_u, \bar{{p}}_u ] \subset \left( |{\chi }| \sqrt{R_{\textrm{s}}T}, \frac{1}{|{\alpha }|}\right) \end{aligned}$$

with

$$\begin{aligned} p _v (p _u, \chi ) \in \left( |{\chi }| \sqrt{R_{\textrm{s}}T}, \frac{1}{|{\alpha }|}\right) \quad \text {for all } (p _u, \chi ) \in \mathcal {F}\mathrel {{\mathop :}{=}}[\underline{p}_u, \bar{{p}}_u ] \times [\underline{\chi }, \bar{{\chi }}]. \end{aligned}$$

The following lemma guarantees the Lipschitz continuity of \(p _v (p _u, \chi )\) on \(\mathcal {F}\).

Lemma 5.3

Let \(f:\mathcal {F}\rightarrow \mathbb {R}\) be a partially differentiable function on a compact and convex subset \(\mathcal {F}\subset \mathbb {R}^d\) with \(d \in \mathbb {N}\). Then, f is Lipschitz continuous on \(\mathcal {F}\) with Lipschitz constant \(L = 1\) w.r.t. the weighted 1-norm

$$\begin{aligned} \Vert {{\tilde{x}}}\Vert _{w} \mathrel {{\mathop :}{=}}\sum _{i = 1}^d \left| {{\tilde{x}}_i}\right| w_i \end{aligned}$$
(28)

with positive weights \(w \in \mathbb {R}^d\) and \(w_i \ge \max _{x \in \mathcal {F}} \left| {\partial _{x_i}f(x)}\right| \) for all \(i \in [d]\).

Proof

Let \({\tilde{x}}, {\tilde{y}} \in \mathcal {F}\). We define the auxiliary function \(g:[0,1] \rightarrow \mathbb {R}\) as \(g(\lambda ) \mathrel {{\mathop :}{=}}f(\lambda {\tilde{x}} + (1-\lambda ) {\tilde{y}})\). Now, we use the fundamental theorem of calculus to prove the claim:

$$\begin{aligned} |{f({\tilde{x}}) - f({\tilde{y}})}|&= |{g(1) - g(0)}|\\&= \left| {\int _0^1 g^\prime (\lambda ) \,\textrm{d}\lambda }\right| \\&= \left| {\int _0^1 ({\tilde{x}} - {\tilde{y}})^\top \nabla f(\lambda {\tilde{x}} + (1-\lambda ) {\tilde{y}}) \,\textrm{d}\lambda }\right| \\&= \left| {\int _0^1 \sum _{i = 1}^d ({\tilde{x}}_i - {\tilde{y}}_i) \partial _{x_i}f(\lambda {\tilde{x}} + (1-\lambda ) {\tilde{y}}) \,\textrm{d}\lambda }\right| \\&\le \int _0^1 \sum _{i = 1}^d \left| {({\tilde{x}}_i - {\tilde{y}}_i) \partial _{x_i}f(\lambda {\tilde{x}} + (1-\lambda ) {\tilde{y}})}\right| \,\textrm{d}\lambda \\&\le \int _0^1 \sum _{i = 1}^d \left| {{\tilde{x}}_i - {\tilde{y}}_i}\right| \max _{x\in \mathcal {F}} \left| {\partial _{x_i}f(x)}\right| \,\textrm{d}\lambda \\&\le \left\| {{\tilde{x}}_i - {\tilde{y}}_i}\right\| _{\textrm{w}}. \end{aligned}$$

\(\square \)

Using this weighted 1-norm allows to get tighter bounds in (5) compared to the usual 1-norm. To actually compute this weighted norm, one could solve \(\max _{x\in \mathcal {F}} \left| {\partial _{x_i}f(x)}\right| \), which is an NLP for each \(i \in [d]\). In the case of Algorithm 1, the set \(\mathcal {F}\) will always be a box. For the function \(p _v (p _u, \chi )\), we give the optimal solution or at least a suitable upper bound of these NLPs in the following two theorems.

Theorem 5.4

If Assumption 2 holds, the derivative \(\partial _{p _u}p _v (p _u, \chi ) > 0\) attains its maximum on a given box \(\emptyset \ne [p _u ^-, p _u ^+] \times [\chi ^-, \chi ^+] \subseteq [\underline{p}_u, \bar{{p}}_u ] \times [\underline{\chi }, \bar{{\chi }}]\) in \((p _u ^-, \chi ^+)\) if \(\chi ^+ \ge 0\), or in \((p _u ^+, \chi ^+)\) else.

Theorem 5.5

Let \(\mathcal {F}= [p _u ^-, p _u ^+] \times [\chi ^-, \chi ^+] \subseteq [\underline{p}_u, \bar{{p}}_u ] \times [\underline{\chi }, \bar{{\chi }}]\) be a given box with \(\mathcal {F}\ne \emptyset \). If Assumption 2 holds, the derivative \(\partial _{\chi }p _v (p _u, \chi ) \le 0\) has the following properties:

$$\begin{aligned} \min _{(p _u, \chi ) \in \mathcal {F}\cap \mathbb {R}\times \mathbb {R}_{\ge 0}} \partial _{\chi }p _v (p _u, \chi )&= \partial _{\chi }p _v (p _u ^-, \chi ^+)&\text {if } \chi ^+ \ge 0,\\ \min _{(p _u, \chi ) \in \mathcal {F}\cap \mathbb {R}\times \mathbb {R}_{\le 0}} \partial _{\chi }p _v (p _u, \chi )&> \partial _{\chi }p _v (p _v (p _u ^-, \chi ^-), -\chi ^-)&\text {if } \chi ^- < 0. \end{aligned}$$

The proofs of these two theorems can be found in the appendix of this paper as they are rather long and technical.

Theorems 5.4 and 5.5 can be used to not only compute the bounds in (5) for the initial box \([\underline{p}_u, \bar{{p}}_u ] \times [\underline{\chi }, \bar{{\chi }}]\) but also for every new box that is created in Step 15 of Algorithm 1. This can significantly tighten the bounds in (5) as the iteration proceeds.

5.3 Numerical Results

Now, we apply Algorithm 1 to two test problems from the GasLib library [46], which contains stationary gas network benchmark instances. To this end, we implemented our method in Python  3.8.10. The computations were done on a machine with an Intel(R) Core(TM) i7-8550U CPU with 4 cores, 1.8 GHz to 4.0 GHz, and 16 GB RAM. The master problems (M(k)) and the subproblems (4) have been modeled using GAMS  36.2.0 [9]. We used the solver CPLEX  20.1.0.1 [11] to solve the master problems, which are MIPs, and the solver SNOPT  7.7.7 [18] for the subproblems, which are NLPs.

To improve the performance of our method, we detect boxes \([p _u ^-, p _u ^+] \times [\chi ^-, \chi ^+]\) in \(\mathcal {X}_i^k\) that lie outside the feasible set of a pipe. From (29) and (30), we know that this is the case if \(\chi ^- \ge 0\) and \(p _v (p _u ^+, \chi ^-) < \underline{p}_v \) holds or if \(\chi ^+ \le 0\) and \(p _v (p _u ^-, \chi ^+) > \bar{{p}}_v \) holds. Additionally, we can fix the flow in pipes that are not part of a cycle. This allows us to reduce the dimension of the corresponding nonlinearities by one. To get well-scaled problems, we model all pressure values in bar instead of Pa and exclusively use mass flow values (in kg  s\(^{-1}\)) instead of mass flux values. As a result, we have to scale all values of \(\partial _{\chi }p _v (p _u, \chi )\) that are used to compute the bounds in (5) by a factor of \(10^{-5}/A\).

To compute a sufficiently large value M for the master problem constraints (6c)–(6f), we use the difference between the values that can occur on the right-hand sides of the inequalities and the bounds of the variables on the left-hand side. This leads to the formula

$$\begin{aligned} M \mathrel {{\mathop :}{=}}\max \left\{ \left( \bar{{p}}_u- \underline{p}_u \right) , \left( \bar{{q}} - \underline{q}\right) , \left( \frac{1}{|{\alpha }|} - \underline{p}_v \right) , \left( \bar{{p}}_v- \frac{\max \left\{ \bar{{q}}, |{\underline{q}}|\right\} }{A} \sqrt{R_{\textrm{s}}T}\right) \right\} . \end{aligned}$$
Fig. 3
figure 3

Schematic representation of GasLib-11

The first instance we solve is GasLib-11, which is shown in Fig. 3. It contains 11 nodes including 3 entries and 3 exits, 8 pipes, 2 compressor stations, and a single valve. Because the network has a cycle, not all flows on all arcs are known a priori.

Fig. 4
figure 4

Schematic representation of GasLib-24

The second instance we solve is GasLib-24; see Fig. 4. It consists of 24 nodes including 3 entries and 5 exits, 19 pipes, 3 compressor stations, a single control valve, a single short pipe, and a single resistor, which we replace by a short pipe. There are two cycles in the network.

Table 4 The runtimes and numbers of iterations of Algorithm 1 for different parameters \(\lambda \)

For both instances, we tested the values 0.125, 0.25, 0.375, and 0.5 for the parameter \(\lambda \). We chose the value \(\varepsilon = 0.1\) (in bar) for the termination criterion. The resulting runtimes and numbers of iterations are listed in Table 4. In both cases, \(\lambda =0.25\) and \(\lambda =0.375\) yield better results than \(\lambda =0.125\) or \(\lambda =0.5\). For \(\lambda =0.125\), this can be explained by the fact that the boxes do not shrink fast enough when split because the splitting point can be quite close to the edge of its box. Choosing \(\lambda =0.5\), on the other hand, removes the possibility for the splitting points to be closer to the master problem solution and therefore potentially to the optimal solution of Problem (1).

For GasLib-11, the best result was achieved for \(\lambda =0.375\) for which an \(\varepsilon \)-feasible solution was found in 47.41s. The method terminated after 74 iterations with a mean iteration time of 0.64s. The mean time to solve the master problem was 0.35s, and the mean time to solve the subproblems was 0.14s. The remaining 0.16s were used to (re-)build the model in each iteration.

For GasLib-24, the best result was achieved for \(\lambda =0.25\) for which an \(\varepsilon \)-feasible solution was found in 158.43s. With 95 iterations, the mean iteration time was 1.67s. This mean includes 1.05s to solve the master problem, 0.23s to solve the subproblems, and 0.39s to (re-)build the model in each iteration.

Fig. 5
figure 5

The maximal error \(\max _{i \in [p]} |{f_i(x_{I_i}) - x_{r_i}}|\) in each iteration for GasLib-11 with \(\lambda =0.375\) (left) and GasLib-24 with \(\lambda =0.25\) (right) on a logarithmic scale

Figure 5 shows the progress of the maximal error over the course of the iterations for the best-case of both test instances. One can see that it falls rapidly in the first iterations. After that, it fluctuates strongly with only a slight downward trend until the threshold \(\max _{i \in [p]} |{f_i(x_{I_i}) - x_{r_i}}| \le \varepsilon \) is reached. This behavior is typical as the approximation progress from splitting a box in Step 15 of Algorithm 1 is reduced as the boxes get smaller. The fluctuations can be explained by the master problem solution switching to a bigger box \(j_i^k \in J_i^k\) after the previous box has been refined a number of times in Step 15.

6 Conclusion

In this paper, we developed a successive linear relaxation method for solving MINLPs with nonlinearities that are not given in closed form. Instead, we only assume that these multivariate nonlinearities can be evaluated and that we know their global Lipschitz constants. We illustrate the flexibility of this class of models and of our method by showing that it can be applied to bilevel optimization models with nonlinear and nonconvex lower-level problems as well as to nonconvex MINLPs for gas transport problems that are constrained by differential equations. Moreover, we proved finite termination of the method (at approximate global optimal solutions) and derived a worst-case iteration bound.

Finally, let us sketch three important topics for future research that are out of scope of this paper. First, one can weaken the assumptions made in this paper as it is done in [47]. In particular, the assumption of exact function evaluations is rather strong in some applications such as, eg., in the case of PDE constraints. Second, one could also incorporate general-purpose local solvers to compute feasible points that lead to upper bounds. Together with lower bounds that we directly get from solving the master problem in every iteration, this could be used to obtain a gap and thus a further termination criterion besides the one based on the approximation accuracy that we used in this paper. Third, it is obvious (and also not expected) that the proposed method is not competitive with MINLP solution methods that explicitly use the structural properties of the nonlinearities that are given in closed form. However, there are many possible ways of improving the general method proposed in this paper. For instance, the incorporation of presolving and dimension-reduction techniques would help a lot to improve the performance of the method.