Abstract
Bilevel problems are used to model the interaction between two decision makers in which the lowerlevel problem, the socalled follower’s problem, appears as a constraint in the upperlevel problem of the socalled leader. One issue in many practical situations is that the follower’s problem is not explicitly known by the leader. For such bilevel problems with unknown lowerlevel model we propose the use of neural networks to learn the follower’s optimal response for given decisions of the leader based on available historical data of pairs of leader and follower decisions. Integrating the resulting neural network in a singlelevel reformulation of the bilevel problem leads to a challenging model with a blackbox constraint. We exploit Lipschitz optimization techniques from the literature to solve this reformulation and illustrate the applicability of the proposed method with some preliminary case studies using academic and linear bilevel instances.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Bilevel optimization has been a highly active field of research in the last decades and has gained increasing attention over the last years. The main reason is that this class of optimization problems can serve as a powerful modeling tool in situations in which one has to model hierarchical decision making; see, e.g., the recent surveys by Beck et al. [18] and Kleinert et al. [2] as well as the annotated bibliography by Dempe [10] to get an overview of the many applications of bilevel optimization. However, this ability also renders bilevel optimization problems very hard to solve both in theory and practice. For instance, even linear bilevel problems are strongly NPhard [15].
As discussed in the survey by Beck et al. [2], bilevel optimization problems can also be subject to uncertainty. Moreover, the sources of potential uncertainties are even richer compared to usual, i.e., singlelevel, optimization. The reason is that not only the problem’s data can be uncertain but also the (observation of the) decisions of the two players can be noisy or uncertain. For instance, it might be the case that the leader is not sure about whether the follower can solve the respective lowerlevel problem to global optimality and might want to hedge against possibly occurring nearoptimal solutions; see, e.g., Besançon et al. [3, 4]. In this short paper, we go even one step further and assume that the leader has no knowledge about an explicit formulation of the lowerlevel problem. Hence, the leader needs to solve a bilevel optimization problem and the lower level is unknown. What might sound impossible at a first glance can be done, at least approximately, if the bilevel game is repeatedly played so that past pairs of leader decisions and respective responses of the follower can be observed and collected. The obtained set of pairs of decisions can then be used as training data to train a neural network that learns the bestresponse function of the follower.
The downside of this approach is that the inputoutput mapping of the neural network can again not be stated using a closedform expression. This leads to the field of optimization under blackbox constraints [23]. Fortunately, there is some recent research on deep neural networks with specific activation functions that shows that the inputoutput mapping of the neural network is Lipschitz continuous and that Lipschitz constants can actually be computed using specifically chosen semidefinite programs [11, 22]. This paves the way for applying the Lipschitz optimization method proposed in Schmidt et al. [25], see Schmidt et al. [26] as well, in order to solve the singlelevel reformulation of the bilevel problem with unknown lower level in which the optimal response of the follower is replaced by the mapping that describes the inputoutput behavior of the trained neural network. In the Lipschitzbased approach, the only nonlinearity, which is the one given by the inputoutput mapping of the neural network, is outerapproximated using the Lipschitz constant of the mapping. This outer approximation is tightened from iteration to iteration, leading to a union of polyhedra that form a relaxation of the graph of the nonlinearity, which enables the application of powerful stateoftheart mixedinteger solvers.
The main contribution of this short note is the combination of the recent results about the Lipschitz continuity of special neural networks with recent publications on Lipschitz optimization. By carrying this out in a careful way and by slightly adapting the method proposed in Schmidt et al. [25], we obtain a convergent algorithm for this highly challenging situation. We further illustrate the applicability of our approach in a case study based on academic bilevel instances from the literature.
Although there have been some applications of bilevel optimization problems for machine learning (see, e.g., Table 1 in Khanduri et al. [17]), this is, to the best of our knowledge, the first paper that uses neural networks to solve general bilevel optimization problems. The only other paper we are aware of following this idea is the one by Vlah et al. [27], who apply deep convolutional neural networks to classic bilevel bidding problems in power markets. However, they focus on the specific application and the specific type of network and its training. In contrast, we focus on the general mathematical idea of replacing unknown lower levels with neutral networks. Furthermore, the work by Borrero et al. [7] presents a sequential learning method for linear bilevel problems under incomplete information. Provided that the strategic players interact with each other multiple times, the authors develop feedback mechanisms to update the missing information for the lowerlevel objective. Nevertheless, this way to address uncertainty and learning clearly differs from our neuralnetwork based approach. The main idea of this short paper is to give a proof of concept for the proposed method and we, thus, make some simplifications such as that we only consider linear bilevel problems with a scalar upperlevel variable and without coupling constraints. We discuss these assumptions in more detail in Sect. 2 and how the setting can be generalized in Sect. 6. Let us finally comment on that our approach is related to other methods for solving bilevel optimization problems that rely on using optimalvalue or bestresponse functions (see, e.g., Lozano and Smith [20] and the references therein). However, our approach is different since we do not work with these functions themselves but with surrogates that we obtain from training a neural network.
The remainder of the paper is structured as follows. In Sect. 2, we introduce the class of bilevel problems that we consider and pose the main assumptions that are required in what follows. Afterward, in Sect. 3, we then review the recent literature about computing Lipschitz constants for deep neural networks, which is a prerequisite for the overall solution approach discussed in Sect. 4. Section 5 contains the case studies that illustrate the applicability of our approach for small instances from the literature. Finally, we conclude in Sect. 6 and discuss some topics for future research.
2 Problem statement
We consider linear bilevel problems of the general form
where \(\Psi (x)\) is the set of optimal solutions of the xparameterized problem
with \(c \in {\mathbb {R}}^{n_x}\), \(d, f \in \mathbb {R}^{n_y}\), \(A \in {\mathbb {R}}^{m \times n_x}\), \(C \in {\mathbb {R}}^{\ell \times n_x}\), \(D \in {\mathbb {R}}^{\ell \times n_y}\), \(a \in {\mathbb {R}}^{m}\), and \(b \in {\mathbb {R}}^{\ell }\). Problem (1) is called the upperlevel or leader’s problem. The decision variables of the leader are \(x \in \mathbb {R}^{n_x}\). Problem (2) is called the lowerlevel or follower’s problem and has the decision variables \(y\in \mathbb {R}^{n_y}\). Note that we consider the optimistic bilevel problem. Hence, the leader also optimizes over y if the lowerlevel problem is not uniquely solvable. The set \(\Omega :=\{(x,y) \in X\times Y:\) (1b), (1c), (2b)} is called the shared constraint set and the set \({\mathcal {F}}:=\{(x,y) \in \Omega :y \in \Psi (x)\}\) is called the bilevel feasible set. With \({\mathcal {F}}_x\), we denote its projection onto the x variables. For what follows, we make the ensuing assumptions.
Assumption 1

(i)
The upperlevel decision \(x\in \mathbb {R}^{n_x}\) is scalar, i.e., \(n_x=1\).

(ii)
For all upperlevel decisions x for which \(\Psi (x)\) is nonempty, we have \(\Psi (x) = 1\). Hence, if the lower level is feasible and bounded, then it has a unique optimal solution y. Consequently, we can write \(y = \Psi (x)\) instead of \(y \in \Psi (x)\).

(iii)
The solutionset mapping \(\Psi (\cdot )\) is Lipschitz continuous.
According to Assumption 1, \(\Psi (x)\) is a singleton, leading to a onetoone correspondence between follower’s optimal responses and the upperlevel decisions. Moreover, the mapping \(\Psi (\cdot )\) is polyhedral, i.e., its graph is a finite union of polyhedra; see Theorem 3.1 in Dempe [9]. Furthermore, the bilevel feasible set is connected because we have no coupling constraints and, thus, \(\Psi _i(x)\) is a Lipschitz continuous and piecewise linear function in x for all \(i \in [n_y] :=\{1,\dots ,n_y\}\).
Lowerlevel uniqueness is particularly important when it comes to training a neural network to learn the optimal response of the follower for a given upperlevel decision. It ensures that, during supervised learning, there is only a single output y to be learned for a given input x. The assumption of a scalar leader’s decision can be generalized and is mainly taken for the ease of presentation. We will discuss this in more detail in our conclusion in Sect. 6.
3 Using neural networks to approximate the follower’s response
Stating the bilevel problem (1) relies on the knowledge about an explicit formulation for the lowerlevel problem (2). In the case in which the follower’s problem (2) is not known by the leader but past pairs (x, y) of leader and follower decisions are available, we propose using this historical data to learn the optimal response \(y = \Psi (x)\) of the follower using neural networks.
Such (x, y)pairs naturally arise in many applications. For instance, for pricing models, the upperlevel variable x is the price set by the leader for a certain good and y is the amount of this good bought by the follower. Both the price and the bought goods can be observed and collected to obtain (x, y)pairs to train a neural network. Similar situations appear in many other fields such as in bilevel optimization for market design, transportation, or security. Obviously, it is required that the game modeled by the bilevel problem is played repeatedly so that large enough training sets can be collected over time.
In what follows, we use a neural network to learn \(\Psi (\cdot )\), which will then replace the lower level and turn the bilevel model into the singlelevel problem
where \(g_i\) is the function corresponding to the neural network for the ith follower’s response \(y_i\), \(i \in [n_y]\). Thus, we assume \(g_i(x) \approx \Psi _i(x)\). In what follows, we exploit some recent results from the literature to show that \(g_i\) is Lipschitz continuous and that Lipschitz constants can indeed be computed. This property of the used neural networks is vital for the decomposition method we use to solve Problem (3); see Sect. 4.
3.1 Lipschitz constants of neural networks
In this paper, we use the LipSDPNeuron method published in Fazlyab et al. [11] to compute Lipschitz constants of the neural network functions \(g_i\), \(i \in [n_y]\). That is, we compute a constant \(L_i \ge 0\) that satisfies
and for all \(i\in [n_y]\). The main idea in Fazlyab et al. [11] is to replace the nonlinear activation functions at the nodes of a neural network by socalled incremental quadratic constraints, which then allows to state the problem of estimating Lipschitz constants as a semidefinite program (SDP). It is worth noting that the most complex and, hence, the most accurate version of LipSDP in Fazlyab et al. [11] is shown to be wrong in Pauli et al. [22]. Thus, we use the second of the three versions of LipSDP, namely LipSDPNeuron.
We now describe the quadratic constraints that we use to replace all activation functions \(\phi (x) = [\varphi (x_1), \dots , \varphi (x_n) ]^{\top }: \mathbb {R}^n \rightarrow \mathbb {R}^n\) in a network layer, where the same sloperestricted function \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) is applied to each component of \(\phi \). Here and in what follows, we call a function \(\varphi \) sloperestricted w.r.t. \(0\le \alpha< \beta < \infty \) if
holds for all \(x, y \in \mathbb {R}\). This definition states that the slope of the line connecting any two points x and y on the graph of \(\varphi \) is bounded by \(\alpha \) and \(\beta \). It is easy to see that the Rectified Linear Unit (ReLU) activation function defined as \(\varphi (x):=\text {max}\{0, x\}\) is sloperestricted with \(\alpha = 0\) and \(\beta = 1\); see Goodfellow et al. [13]. Furthermore, if the activation function \(\varphi \) is sloperestricted w.r.t. \([\alpha , \beta ] = [0,1]\), then so is the vectorvalued function \(\phi (x) = [\varphi (x_1), \dots , \varphi (x_n)]\), which contains all activation functions in a network layer [12], if we apply the definition in (4) componentwise.
The following lemma shows that the slope property (4) can be written as a quadratic constraint. We use the notation \({\mathbb {S}}^n\) for the set of all symmetric \(n\times n\) matrices.
Lemma 1
(Based on Lemma 1 in Fazlyab et al. [11]) Suppose \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) is sloperestricted w.r.t. \(\alpha \) and \(\beta \). Moreover, we define the set
of diagonal matrices T with nonnegative entries. Then, for any \(T \in {\mathcal {T}}_n\), the function \(\phi (x)=[\varphi (x_1), \dots , \varphi (x_n) ]^{\top } :~{\mathbb {R}}^n \rightarrow ~ {\mathbb {R}}^n\) satisfies the quadratic constraint
for all \(x,y \in {\mathbb {R}}^n\).
The statement of the lemma above is based on Lemma 1 in Fazlyab et al. [11] and has been modified according to the corrections published in Pauli et al. [22]. There, the authors give a counterexample that shows that the original lemma in Fazlyab et al. [11] is wrong. To be more specific, they show that there exists a matrix T built according to the definition of the set \({\mathcal {T}}_n\) given in Fazlyab et al. [11] that violates (6).
Assuming that all activation functions \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) are the same, a feedforward neural network \(f(x): \mathbb {R}^{n_0} \rightarrow \mathbb {R}^{\ell +1}\) can be written compactly as
where \({{\textbf {x}}} = [(x^{0})^{\top }, \dots , (x^{\ell })^{\top }]^{\top }\) is the concatenation of the input values at every layer of the network, \(\ell \) is the number of layers, \(x^i \in \mathbb {R}^{n_i}\) for all \(i\in \{1,\dots ,\ell \}\), \(\phi \) is the vectorvalued function, which applies the activation function \(\varphi \) to every entry of the input vector, and
holds, where \(W^i\) is the weight matrix connecting layer i with layer \(i+1\), and \(I_{n_i}\) is the \(n_i \times n_i\) identity matrix. The next theorem is the central result in Fazlyab et al. [11] and shows that the Lipschitz constant of a neural network is the solution of an SDP in which the matrix T defined in Lemma 1 serves as a decision variable.
Theorem 2
(Theorem 2 in Fazlyab et al. [11]) Consider an \(\ell \)layer and fully connected neural network given by (7). Let \(n = \sum _{k=1}^{\ell } n_k\) be the total number of hidden neurons and suppose that the activation functions are sloperestricted w.r.t. \(\alpha \) and \(\beta \). Define \({\mathcal {T}}_n\) as in (5), A and B as in (8), and consider the matrix inequality
with
If (9) holds for some \((\rho , T) \in {\mathbb {R}}_{\ge 0} \times {\mathcal {T}}_n \,\), then \(\Vert f(x)  f(y) \Vert _2 \le \sqrt{\rho } \Vert x y \Vert _2\) holds for all \(x,y \in {\mathbb {R}}^{n_0}\).
As a result of Theorem 2, a Lipschitz constant for multilayer networks can be computed by solving
where \((\rho , T) \in {\mathbb {R}}_{+} \times {\mathcal {T}}_n\) are the decision variables. Furthermore, \(M(\rho , T)\) is linear in \(\rho \) and T and the set \({\mathcal {T}}_n\) is convex. Hence, Problem (10) is a semidefinite program.
4 Solution approach
The neural network constraint (3c) turns model (3) into a challenging problem. In particular, for complex and large neural networks, it cannot be expected to get a closedform expression for \(g_i\), \(i\in [n_y]\), that has reasonable properties required for optimization. In any case, the resulting constraints will be nonlinear, nonconvex, and nonsmooth. However, we can evaluate these constraints and we can compute their Lipschitz constants. Thus, it is reasonable to make the following assumption.
Assumption 2
An oracle is available that evaluates \(g_i(\cdot )\) for all \(i\in [n_y]\) and all \(g_i\) are globally Lipschitz continuous on \({\underline{x}}\le x \le {\bar{x}}\) with known global Lipschitz constant \(L_i\).
To solve Problem (3), we use the decomposition method published in Schmidt et al. [25] but modify it slightly so that we can apply it to our setting. We assume Lipschitz continuity of the problematic nonlinearities and use the corresponding Lipschitz constants to build a mixedinteger linear problem (MILP) that is a relaxation of the original model (3). We refine this relaxation in every iteration until a satisfactory solution is found or until the problem is shown to be infeasible. A satisfactory solution is formally defined to be an \(\varepsilon \)feasible point, i.e., a point that solves
where \(\varepsilon \ge 0\) a userspecified tolerance. Note that this relaxation of Problem (3) only affects the nonlinear functions \(g_i\), \(i \in [n_y]\).
4.1 Core ideas and notation
The main idea of the decomposition method is to relax the graphs of \(g_i\) with help of a set \(\Omega _i\). The set \(\Omega _i\) is given by linear constraints built around \(g_i\) using the corresponding Lipschitz constant \(L_i\). The resulting relaxation \(\Omega _i\) is then refined in each iteration k such that \(\Omega _i^k\) can be written as a union of polytopes
which converges towards the graph of \(g_i\) for \(k\rightarrow \infty \). The union of polytopes is uniquely defined in each iteration k by a set of points on the xaxis,
where \(x_{i}^{k,j} \in \mathbb {R}\) are scalar values for \(j \in \{0\} \cup J_i^k = \{0\} \cup \{1, \dots , J_i^k\}\) and
Note that while \(x_i\) are scalar, the index i indicates that the sets \({\mathcal {X}}_i\) can develop differently from each other, specifically to every \(i\in [n_y]\). In other words, the relaxations of \(g_i\) can be refined individually in each iteration k. This relaxation \(\Omega _i^k\) of \(g_i\) is improved by adding a new point \(x_i^{k,j}\) to \({\mathcal {X}}_i^k\). The polytopes of the union mentioned above are given by
for \(j \in J_i^k\). Visually speaking, (12) states that the polytopes \(\Omega _i^k(j)\) are all quadrilaterals with two vertices \(x_{i}^{k, j1}\) and \(x_{i}^{k,j}\) on the graph of \(g_i\); see Fig. 1. To understand how a point is added to \({\mathcal {X}}_i\), we discuss the two problems solved in each iteration of the algorithm.
4.2 The master problem
The master problem (M(k)) is solved in each iteration over the sets \(\Omega _i^k\). The problem is formally given by
with \(\omega = (x,y)\) and \(\omega _i :=(x, y_i)\). Note that \(\omega _i\) is not the ith entry in \(\omega \), since \(\omega \in \mathbb {R}^{1 + n_y}\) is the vector containing x and y while \(\omega _i \in \mathbb {R}^2\) contains x and \(y_i\). If the solution \(\omega ^k\) with \(x^k\in \mathbb {R}\) and \(y^k\in \mathbb {R}^{n_y}\) of (M(k)) is \(\varepsilon \)feasible, then Problem (3) is approximately solved. According to Schmidt et al. [25], the master problem (M(k)) can be modeled as the mixedinteger linear problem
The constraint \(\omega _i \in \Omega _i^k\), \(i\in [n_y]\), in (M(k)) is modeled here using bigM constraints. The binary variables z in (13i) and (13h) ensure that only one polytope is active for all \(i \in [n_y]\).
4.3 The subproblem
On the other hand, if the solution \(\omega ^k\) is not \(\varepsilon \)feasible, the relaxation \(\Omega _i^k\) of the graph of \(g_i\) needs to be refined. This is achieved by solving a subproblem to find a point on the graph of \(g_i\), which is as close as possible to the solution \(\omega _i^k \in \Omega _i^k\) of the master problem (M(k)). We then use this point to refine the relaxation.
Our subproblem differs from the original method described in Schmidt et al. [25]. There, an optimization problem over nonlinearities and other linear constraints is solved. This is not possible in the setting we consider here, or is at least extremely challenging, due to the nonconvex and nonsmooth nature of neural networks.
The subproblem is solved only for the particular polytope \(\Omega _i^k(j_i^k) \subset \Omega _i^k\) with \(\omega _i^k \in \Omega _i^k(j_i^k)\). That is, we look for a point on the graph of \(g_i\), which is contained in \(\Omega _i^k(j_i^k)\). Additionally, the feasible set of the subproblem is further reduced to only allow for a solution in an innerapproximation of the respective polytope. This way we ensure that newly found points do not accumulate at an already existing value in \({\mathcal {X}}_i^k\). For a given \(j \in J_i^k\), this subset is defined as
with
and \(d_i^{k,j} = x_{i}^{k,j}  x_{i}^{k, j1}\) is the length of the corresponding segment on the xaxis. The constant 0.25 can be replaced by any value in (0, 0.5). The left plot in Fig. 1 indicates the set \({\widehat{\Omega }}_i^k\) with vertical dashed lines.
To solve the subproblem, we sample points \(\tilde{x}_i\) on the xaxis segment corresponding to \({\widetilde{\Omega }}_i^k(j_i^k)\) and evaluate \(g_i\) at these points. Then, the solution of the subproblem is given by the computed point \({\tilde{\omega }}_i :=(\tilde{x}_i, g_i(\tilde{x}_i))\) that is closest to the solution \(\omega _i^k = (x^k, y_i^k)\) of the master problem w.r.t. the Euclidean distance; see Fig. 1. In other words, we choose from a finite set of points in \({\widetilde{\Omega }}_i^k(j_i^k)\) the one that is on the graph of \(g_i\) and as close as possible to the solution \(\omega _i^k\) of (M(k)). Note that we solve the subproblem only for those \(i \in [n_y]\) that are not \(\varepsilon \)feasible, i.e., if \(g_i(x^k)  y_i^k>\varepsilon \) holds.
The solution \(\tilde{x}_i^k\) of the subproblem is added to the set \({\mathcal {X}}_i^k\) and thus refines the relaxation of \(g_i\); see Fig. 1 (right). While \(x^k\) in \(\omega _i^k\) is the solution of (M(k)) and thus shared by all \(i\in [n_y]\), the solution \(\tilde{x}_i^k\) of the subproblem unique to \(i \in [n_y]\).
4.4 Algorithm and convergence properties
The Lipschitz decomposition method is formally given in Algorithm 1. There, the master problem (M(k)) is solved in each iteration k and the algorithm checks if the solution \(\omega ^k\) is \(\varepsilon \)feasible. If this is not the case for all \(i \in [n_y]\), the polytope \(\Omega _i^k(j^k_i)\) containing the solution \(\omega _i^k\) is identified using the index of the variables \(z_i^{k,j}\) in the MILP (13) for all \(\varepsilon \)infeasible \(i \in [n_y]\). Then, a new point \({\tilde{\omega }}_{i}^k = (\tilde{x}_i^k, \tilde{y}_i^k)\) is found on a subset \({\widetilde{\Omega }}_i^k(j^k_i) \subset \Omega _i^k(j^k_i)\) and \(\tilde{x}_i^k\) is added to \({\mathcal {X}}_i^k\), refining the approximation of \(g_i\). Notice that while x is scalar, the index i in \(\tilde{x}_i^k\) signals that the new point found on the xaxis in iteration k is specific to the relaxation of function \(g_i\). Moreover, the number of points in \({\mathcal {X}}_i^k\) is at most k in every iteration, which means that \(J_i^k \le k\) for all \(i \in [n_y].\)
We remark that the new point \(\tilde{x}_{i}^k\) found by the subproblem splits the original quadrilateral \(\Omega _i^k(j_i^k)\) into two smaller ones; see Fig. 1. For the two new polytopes, we use the notations \(\Omega _i^k(j_1^k)\) and \(\Omega _i^k(j_2^k)\). Hence, the union of polytopes \(\Omega _i^{k+1}\) that approximates \(g_i\) in iteration \(k+1\) is given by
and we have the following lemma.
Lemma 3
(See Lemma 4 in Schmidt et al. [25]) There exists a constant \(\delta > 0\) depending only on \(\varepsilon \) and L such that as long as Algorithm 1 does not terminate in Line 5 or 9, there exists a constant \(\delta ^k > \delta \) for every k with
Theorem 4
(See Theorem 1 in Schmidt et al. [25]) There exists a \(K<\infty \) such that Algorithm 1 either terminates with an approximate globally optimal point \(\omega ^k\) or with the indication of infeasibility in an iteration \(k \le K\).
Similarly to Theorem 1 in Schmidt et al. [25], Theorem 4 follows from the fact that, according to Lemma 3, the volume of \(\Omega _i^k\) decreases by a positive value \(\delta ^k\) in each iteration that is uniformly bounded away from zero.
However, due to the fact that \(\Omega _i\) is refined in each iteration to better approximate \(g_i\), the master problem (M(k)) grows linearly over the course of the iterations, leading to one additional binary variable \(z^{k,j}_i\) and some additional linear constraints. Consequently, the computational effort increases in every iteration.
5 Case study
In this section, we illustrate the applicability of Algorithm 1 on the basis of a set of exemplary case studies. We first discuss some implementation details and then present the results of the algorithm.
5.1 Implementation details
We now explain how we generate the (x, y)pairs for the considered instances. We also discuss the used neural networks and their training, the sampling of points for the subproblem, and how we verify the solutions obtained with our algorithm.
We generate the (x, y)pairs as follows. First, we solve the highpoint relaxation of the bilevel problem but with a different objective function in which we either minimize or maximize the upperlevel variable x to get the set \({\mathcal {F}}_x\). Then, we equidistantly sample in this interval and solve the parametric lowerlevel problem for the given values of x, obtain the lowerlevel problem’s solution y, and store the point (x, y) with \(y = \Psi (x)\) in our data set. The resulting data set entirely consists of bilevel feasible points. The obtained (x, y)pairs are then used to train neural networks to learn the optimal follower’s response. Note that using the modified highpoint relaxation to obtain the xinterval used for sampling is not possible in reality if the lowerlevel problem is not known. However, a generation procedure would not be required in reality at all since the set of (x, y)pairs for training the neural networks consist of observed data from the past.
The lowerlevel problem of every instance is then assumed to be unknown. Instead, for every considered problem, the optimal follower’s response for a given leader’s decision x, i.e., \(\Psi _i(x)\), is learned with feedforward neural networks with ReLU activation functions at every node, for all \(i\in [n_y]\); see, e.g., Goodfellow et al. [13]. The functions learned with ReLU networks are inherently piecewise linear and, thus, Lipschitz continuous. Moreover, for all instances the weights of the network are updated via the Adam optimizer [24]. We vary the learning rate, the number of epochs, and the network architecture depending on the instance at hand.
The bilevel feasible set \({\mathcal {F}}\) and its projection \({\mathcal {F}}_x\) onto the xvariables also depend on the unknown lower level. Hence, we attempt to deduce \({\mathcal {F}}\) and \({\mathcal {F}}_x\) from the available (x, y)pairs. Due to the fact that the interval \([{\underline{x}}, {\bar{x}}]\) with \({\underline{x}} \le x \le {\bar{x}}\) can be larger than \({\mathcal {F}}_x\), we do not use \({\underline{x}}\) and \({\bar{x}}\) to initialize \(\Omega _i\); see (12). Instead, we use the closed interval generated by the smallest and the largest xvalues in the available training set as a proxy for \({\mathcal {F}}_x\).
Given the trained networks \(g_i\), we solve the master problem and, subsequently, the subproblems as described in Sect. 4. To solve the subproblem, we evaluate the functions \(g_i\) at \(p=100\) points on the corresponding xaxis segment. If s points have already been evaluated on this segment in earlier iterations, then only \(ps\) new equidistantly distributed points \((x, g_i(x))\) are computed.
Finally, Algorithm 1 stops with the indication that Problem (3) is infeasible or with an \(\varepsilon \)feasible solution, where \(\varepsilon = 10^{5}\) is used in our experiments. The solution computed by our algorithm is compared with the solution obtained by solving the mixedinteger KKT reformulation with sufficiently large bigM constants; see, e.g., Kleinert et al. [18].
We implemented the LipSDPNeuron method using the cvxpy 1.1.13 package and the SDP solver MOSEK 9.3.20. All occurring linear or mixedinteger linear problems (in particular, (M(k))) are solved using Gurobi 9.5.1. All neural networks have been trained using the Python library torch 1.11.0. All computations have been executed on a Intel^{©} Core^{TM}i710510U CPU with 8 cores of 1.8 GHz each and 32 GB RAM.
5.2 Discussion of the results
We apply the Lipschitz decomposition method to 6 instances of linear bilevel problems from the literature. All instances have a scalar upperlevel decision variable x so that Assumption 1 is satisfied. For all 6 instances we use \(60\,\%\) of the (x, y)pairs for training and the rest for validation. Alternative proportions of training and validation set sizes could be used as well if appropriate. Furthermore, given that \(\Psi _i(\cdot ), i\in [n_y]\), is piecewise linear in our context, rather small data sets are often already enough to find good solutions for the considered instances.
Table 1 shows the training set sizes, the configurations of the neural networks, and the used learning rates. Table 2 contains the corresponding Lipschitz constants computed for the networks described in the previous table as well as the time required to compute them. Finally, Table 3 displays the solutions obtained with Algorithm 1 (as well as the required number of iterations) next to the ones computed by solving the KKT reformulation. All computation times are given in seconds.
5.2.1 The performance of Algorithm 1
As discussed in Sect. 4.4, the computational effort of solving the master problem naturally increases over the course of the iterations. This can also be seen clearly in Fig. 2 for 4 exemplarily chosen instances. Furthermore, the computed Lipschitz constants determine the volume of the polytopes \(\Omega _i^k\), \(i \in [n_y]\). Thus, according to Lemma 3 and Theorem 4, the convergence of Algorithm 1 directly depends on the constants \(L_i\).
One way to keep Algorithm 1 most efficient is to compute Lipschitz constants \(L_i\) that are as small as possible. To illustrate this, we consider the instance [21]
where \(\Psi (x)\) is the set of optimal solutions of the xparameterized linear lowerlevel problem
The optimal solution of Problem (14) is (0, 1.5). As explained in Sect. 5.1, we use the smallest and the largest xvalues in the available data set to initialize \([{\underline{x}}, {\bar{x}}]\), which is required in Line 1 of Algorithm 1. For this instance, the unknown lower level is substituted by a neural network that has 2 hidden layers with 5 nodes each and is trained on a set of 30 points. The learning rate is approximately 0.074. Using the trained network’s weights, we compute a Lipschitz constant of about 3.02. Finally, Algorithm 1 finds the approximate solution (0, 1.4999) in 1.21 seconds and 31 iterations for \(\varepsilon = 10^{5}\).
Table 4 captures how different Lipschitz constants influence the computation time and the number of iterations of Algorithm 1 when applied to Problem (14). For all computations, we keep the tolerance \(\varepsilon = 10^{5}\). This table clearly shows that the algorithm performs better, the closer the used Lipschitz constant is to the true value of 2.5.
Moreover, constants smaller than 2.5 seemingly perform even better. For instance, using 0.5, Algorithm 1 finds a solution in only 10 iterations. Nevertheless, using values smaller than the Lipschitz constant of function \(g_i(\cdot )\) can potentially lead to false infeasibilities reported by the algorithm.
5.2.2 The accuracy of Algorithm 1
According to (11), an \(\varepsilon \)feasible solution of an instance is \(\varepsilon \)close to the graph of all functions \(g_i(\cdot )\). This does not mean that it is also necessarily close to the true optimal response \(\Psi _i(\cdot )\), \(i\in [n_y]\). Consequently, the accuracy of the method depends on the ability of the neural network to approximate the true optimal responses as good as possible.
Figure 3 illustrates the importance of a good approximation. There you can see in the bottom plot that the \(\varepsilon \)feasible solution, in this case (0.03, 1.67), obtained for instance (14), is \(\varepsilon \)close to (0.03, g(0.03)) but not to \((0.03, \Psi (0.03))\).
In this context, it is also interesting that we observed that larger training sets can lead to better solutions. Table 5 shows how the computed Lipschitz constants and the solutions obtained for the instance in Bard [1] change with increasing data sets. For this instance, the true solution is (7.2, 1.6). Due to the fact that the solution is located at an extreme point of \({\mathcal {F}}_x\), the accuracy of Algorithm 1 directly depends on the proxy for \({\mathcal {F}}_x\) in this specific example.
6 Conclusion
In many practical situations, the leader of a bilevel optimization problem is not aware of an explicit formulation of the follower’s problem. To cope with this issue, we proposed a method that uses neural networks to learn the follower’s optimal reaction from past bilevel solutions. After training of the network, we compute Lipschitz constants for the learned functions and use a Lipschitz decomposition method to solve the reformulated, singlelevel problem with neuralnetwork constraints.
This short paper should serve as a proof of concept for the ideas sketched above. However, many aspects can be improved. First of all, the assumption of a scalar leader’s decision seems rather strong. Very recently, a followup paper [14] of Schmidt et al. [25] appeared, in which a similar Lipschitz decomposition method is developed that can tackle the multidimensional case. This method can now be used to also consider bilevel optimization problems with an unknown follower problem and multiple decision variables of the leader. Second, we restricted ourselves in this paper to the case in which we have no coupling constraints. Fortunately, it is rather straightforward to use Algorithm 1 also for linear bilevel problems with coupling constraints as well. Even if the bilevel feasible set is disconnected due to the presence of coupling constraints, the follower’s optimal response function \(\Psi _i(\cdot )\) is learned to be a piecewise linear function by the corresponding neural network. Due to the fact that solutions are found at vertices of the feasible set, at least one feasible point is always available to serve as solution, even if the line connecting two points is not bilevel feasible. On the other hand, this is getting more complicated if nonlinear bilevel problems are considered. Hence, the setting of nonlinear problems with coupling constraints is a topic of future research. Third, the overall idea should be reasonable also in the mixedinteger case, at least if no coupling constraints are present. Fourth, there has been some recent work [6] on the relation between multilevel mixedinteger linear optimization problems and multistage stochastic mixedinteger linear optimization problems with recourse. Hence, it might be possible to exploit these relations to carry over learningbased techniques for twostage stochastic optimization to bilevel optimization. Fifth and finally, it would be very interesting to see an application of the concept discussed in this paper to a realworld situation, which is, however, out the of the scope of this short paper.
References
Bard, J.F.: Optimality conditions for the bilevel programming problem. Naval Res. Logist. Q. 31(1), 13–26 (1984). https://doi.org/10.1002/nav.3800310104
Beck, Y., Ljubic, I., Schmidt M.: A Survey on Bilevel Optimization Under Uncertainty. Technical report. http://www.optimizationonline.org/DB_HTML/ 2022/06/8963.html (2022)
Besançon, M., Anjos, M. F., Brotcorne, L.: Nearoptimal Robust Bilevel Optimization. arxiv:1908.04040pdf (2019)
Besançon, M., Anjos, M.F., Brotcorne, L.: Complexity of nearoptimal robust versions of multilevel optimization problems. Optim. Lett. 15, 2597–2610 (2021). https://doi.org/10.1007/s11590021017549
Bialas, W.F., Karwan, M.H.: Twolevel linear programming. Manag. Sci. 30(8), 1004–1020 (1984)
Bolusani, S., Coniglio, S., Ralphs, T.K., Tahernejad, S.: A unified framework for multistage mixed integer linear optimization. In: Dempe, S., Zemkoho, A. (eds.) Bilevel Optimization: Advances and Next Challenges, pp. 513–560. Springer, Cham (2020). https://doi.org/10.1007/9783030521196_18
Borrero, J.S., Prokopyev, O.A., Sauré, D.: Learning in sequential bilevel linear programming. In: INFORMS Journal on Optimization. https://doi.org/10.1287/ijoo.2021.0063 (2022)
Clark, P., Westerberg, A.: A note on the optimality conditions for the bilevel programming problem. In Naval Research Logistics (NRL) 35(5), 413–418 (1988). https://doi.org/10.1002/15206750(198810)35:5<413::AIDNAV3220350505>3.0.CO;26
Dempe, S.: Foundations of Bilevel Programming. Springer, Berlin (2002)
Dempe, S.: Bilevel Optimization: theory, algorithms, applications and a bibliography. In: Dempe, S., Zemkoho, A. (eds.) Bilevel Optimization: Advances and Next Challenges, pp. 581–672. Springer, Berlin (2020). https://doi.org/10.1007/9783030521196_20
Fazlyab, M., Robey, A., Hassani, H., Morari, M., Pappas, G.: Efficient and accurate estimation of lipschitz constants for deep neural networks. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/95e1533eb1b20a97777749fb94fdb944Paper.pdf
Fazlyab, M., Morari, M., Pappas, G.J.: Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. IEEE Trans. Autom. Control 67(1), 1–15 (2022). https://doi.org/10.1109/TAC.2020.3046193
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. https://mitpress.mit.edu/books/deeplearning (2016)
Grübel, J., Krug, R., Schmidt, M., Wollner, W.: A successive linear relaxation method for MINLPs with multivariate lipschitz continuous nonlinearities with applications to bilevel optimization and gas transport. Technical report. arxiv:2208.06444 (2022)
Hansen, P., Jaumard, B., Savard, G.: New branchandbound rules for linear bilevel programming. SIAM J. Sci. Stat. Comput. 13(5), 1194–1217 (1992). https://doi.org/10.1137/0913069
Haurie, A., Savard, G., White, D.: A note on: an efficient point algorithm for a linear twostage optimization problem. Oper. Res. 38(3), 553–555 (1990). https://doi.org/10.1287/opre.38.3.553
Khanduri, P., Zeng, S., Hong, M., Wai, H.T., Wang, Z., Yang, Z.: A NearOptimal Algorithm for Stochastic Bilevel Optimization via DoubleMomentum. arxiv:2102.07367 (2021)
Kleinert, T., Labbé, M., Ljubic, I., Schmidt, M.: A survey on mixed integer programming techniques in bilevel optimization. EURO J. Comput. Optim. 9, 100007 (2021). https://doi.org/10.1016/j.ejco.2021.100007
Liu, Y.H., Hart, S.M.: Characterizing an optimal solution to the linear bilevel programming problem. Eur. J. Oper. Res. 73(1), 164–166 (1994). https://doi.org/10.1016/03772217(94)901554
Lozano, L., Smith, J.C.: A valuefunctionbased exact approach for the bilevel mixedinteger programming problem. Oper. Res. 65(3), 768–786 (2017). https://doi.org/10.1287/opre.2017.1589
Moore, J.T., Bard, J.F.: The mixed integer linear bilevel programming problem. Oper. Res. 38(5), 911–921 (1990)
Pauli, P., Koch, A., Berberich, J., Kohler, P., Allgöwer, F.: Training robust neural networks using Lipschitz bounds. IEEE Control Syst. Lett. 6, 121–126 (2022). https://doi.org/10.1109/LCSYS.2021.3050444
Rios, L.M., Sahinidis, N.V.: Derivativefree optimization: a review of algorithms and comparison of software implementations. J. Glob. Optim. 56(3), 1247–1293 (2013). https://doi.org/10.1007/s108980129951y
Ruder, S.: An overview of gradient descent optimization algorithms (2016). arxiv:1609.04747
Schmidt, M., Sirvent, M., Wollner, W.: A decomposition method for MINLPs with Lipschitz continuous nonlinearities. Math. Program. 178(1), 449–483 (2019). https://doi.org/10.1007/s101070181309x
Schmidt, M., Sirvent, M., Wollner, W.: The cost of not knowing enough: mixedinteger optimization with implicit lipschitz nonlinearities. In: Optimization Letters (2021). https://doi.org/10.1007/s11590021018279 (Forthcoming)
Vlah, D., Šepetanc, K., Pandžic, H.: Solving Bilevel Optimal Bidding Problems Using Deep Convolutional Neural Networks. (2022) https://doi.org/10.48550/ARXIV.2207.05825
Acknowledgements
The authors thank the DFG for their support within RTG 2126 “Algorithmic Optimization”.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Molan, I., Schmidt, M. Using neural networks to solve linear bilevel problems with unknown lower level. Optim Lett 17, 1083–1103 (2023). https://doi.org/10.1007/s11590022019587
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590022019587