## 1 The big-M reformulation: convenient but error-prone

We consider linear bilevel problems of the form

\begin{aligned} \min _{x\in \mathbb {R}^n, y\in \mathbb {R}^m} \quad c^\top x+ d^\top y\quad \text {s.t.} \quad A x+ By\ge a,\ y\in \Psi (x), \end{aligned}
(1)

where $$\Psi (x)$$ denotes the set of optimal solutions of the x-parameterized linear program

\begin{aligned} \max _{y} \quad f^\top y\quad \text {s.t.} \quad D y\le b - C x, \end{aligned}
(2)

with $$c \in \mathbb {R}^n$$, $$d, f \in \mathbb {R}^m$$, $$A \in \mathbb {R}^{k \times n}$$, $$B \in \mathbb {R}^{k \times m}$$, $$a \in \mathbb {R}^k$$, $$C \in \mathbb {R}^{\ell \times n}$$, $$D \in \mathbb {R}^{\ell \times m}$$, and $$b \in \mathbb {R}^\ell$$. In this setting, the upper-level player (or leader) anticipates the optimal reaction $$y$$ of the lower-level player (or follower). The set of optimal solutions $$\Psi (x)$$ is not a singleton if the follower is indifferent for a given $$x$$. In this case, Problem (1) establishes the so-called optimistic solution, i.e., the leader may select the solution $$y\in \Psi (x)$$ that is the most favorable one for the upper-level problem; see Dempe (2002). In general, bilevel problems are intrinsically nonconvex due to their hierarchical structure and even linear bilevel problems (1) are known to be strongly NP-hard; see Hansen et al. (1992). Although specialized techniques for this problem class exist, the “best-practice” still is the well-known mixed-integer single-level reformulation that relies on big-M values; see, e.g., Baringo and Conejo (2011), Garces et al. (2009), Baringo and Conejo (2011), Wogrin et al. (2011), Kazempour et al. (2011, 2012), Kazempour and Conejo (2012), Jenabi et al. (2013), Wogrin et al. (2013), Pozo et al. (2013), Pisciella et al. (2016), Maurovich-Horvat et al. (2015), Morales et al. (2014), Jaber Valinejad (2015), and the references therein for bilevel optimization problems in the field of power systems that are tackled with the classic big-M approach.

### 1.1 The method

This single-level reformulation is derived as follows. First, the lower-level problem (2) is replaced by its necessary and sufficient Karush–Kuhn–Tucker (KKT) conditions. This yields the mathematical program with complementarity constraints (MPCC)

\begin{aligned} \min _{x,y,\lambda } \quad&c^\top x+ d^\top y\end{aligned}
(3a)
\begin{aligned} \text {s.t.} \quad&A x+ By\ge a,\quad C x+ D y\le b, \end{aligned}
(3b)
\begin{aligned}&\lambda \in \Omega _D \mathrel {{\mathop :}{=}}\{ \lambda \ge 0:D^\top \lambda = f\}, \end{aligned}
(3c)
\begin{aligned}&\lambda ^\top (b - Cx- Dy) = 0, \end{aligned}
(3d)

in which Constraint (3c) denotes dual feasibility and (3d) are the KKT complementarity conditions of the lower-level problem. Next, the complementarity conditions are replaced by the mixed-integer reformulation

\begin{aligned} b- Cx- Dy\le M_P (1 - u), \quad \lambda \le M_D u, \quad u \in \{0,1\}^\ell , \end{aligned}
(4)

with sufficiently large big-M constants $$M_P$$ and $$M_D$$. Problem (3) as well as the mixed-integer reformulation (4) have been mentioned first by Fortuny-Amat and McCarl (1981). The advantage of this approach for OR practitioners is obvious: The single-level problem (3) in which the complementarity conditions (3d) are replaced by (4) can be easily implemented and the resulting model can be solved without further ado by standard mixed-integer solvers. However, this approach has some severe issues.

On the one hand, choosing the big-M constants too small can result in suboptimal solutions of Problem (1) as shown in Pineda and Morales (2019). Their counterexample also shows that the loss of optimality can be arbitrarily large in terms of the resulting objective function values. Unfortunately, many works do not discuss the selection of the constants at all or use trial-and-error procedures without any guarantee that the derived values yield a correct reformulation; see, e.g., the references compiled in Pineda and Morales (2019). Recently, it is shown in Kleinert et al. (2019) that verifying the correctness of given big-M values is as hard as solving the original bilevel problem, which is strongly NP-hard in general. Thus, unless valid big-M constants can be derived from problem-specific knowledge as it is done, e.g., in Böttger et al. (2021); Kleinert and Schmidt (2019) or Siddiqui and Gabriel (2013), the stated mixed-integer reformulation cannot be expected to yield correct results. On the other hand, large values of $$M_P$$ and $$M_D$$ may cause numerical instabilities. In the extreme case, too large values can indeed yield “solutions” that are actually infeasible for the original bilevel problem due to the products in (4) of very large constants and binary variables that are relaxed up to a certain tolerance by mixed-integer solvers. We illustrate the impact of different values for $$M_P$$ and $$M_D$$ in the following computational study.

### 1.2 Performance and reliability evaluation

We first briefly describe the computational setup that we use throughout this work as well as the details of our evaluation procedure. All computations in this and the remaining sections are carried out on a compute cluster with Xeon E3-1240 v6 CPUs at 3.7 GHz and 32 GB RAM; see Regionales Rechenzentrum Erlangen (2020) for more details. The code used for the computational studies in this work is implemented in C++-11 and has been compiled with GCC 7.3.0. All optimization problems are solved by CPLEX 12.10.

For the evaluation of the computational results, we mainly use performance profiles according to Dolan and Moré (2002). For every test instance i, we compute ratios $$r_{i,s}= t_{i,s} / \min \{t_{i,s}: s \in S\}$$, where S is the set of solution approaches under consideration and $$t_{i,s}$$ is the running time required by approach $$s \in S$$ for instance i. Each performance profile plot in this work shows the percentage of instances (y-axis) for which the performance ratio $$r_{i,s}$$ of approach s is within a factor $$\tau \ge 1$$ (log-scaled x-axis) of the best possible ratio.

For our analysis we use the linear bilevel instances described in Kleinert and Schmidt (2020). All of these instances stem from relaxing the integrality conditions of mixed-integer bilevel instances from the literature. We denote the various instance classes along with a reference to its original mixed-integer test set and relevant characteristics in Table 1. For our analysis, we clean up this instance set in the following way. We exclude 353 instances that can be solved by any of the methods tested in this work in less than 5 s. In addition, we exclude 26 instances that cannot be solved by any of the tested methods within the time limit of 1 h. This yields a total number of 698 instances in the cleaned test set $$\mathcal {I}$$.

We are now ready to compare the big-M approach for the choices $$M=M_D=M_P\in \{10^4, 10^5, 10^6\}$$. For better readability, we refer to the three instantiations by BigM-4, BigM-5, and BigM-6. As just discussed, the big-M approach may deliver wrong “solutions” in case the value M is chosen too small or “too large”. The latter may result in infeasible points because the lower-level complementarity may in fact not be fulfilled due to numerical tolerances. In order to circumvent this behavior, we tightened the integer feasibility tolerance of CPLEX to $$10^{-9}$$. An ex-post evaluation of the complementarity conditions revealed that with this setting, all solutions fulfill every complementarity condition up to a tolerance of $$10^{-6}$$. Thus, we consider these solutions to be feasible. In contrast, too small values for M may produce suboptimal solutions. Therefore, we run the following sanity check for every instance in $$\mathcal {I}$$. Let $$F^s_i$$ be the objective value of the best feasible solution, i.e., the best upper bound, and let $$\underline{F}^s_i$$ be the best lower bound found by approach $$s \in \{\text {BigM-4, BigM-5, BigM-6}\}$$ for instance $$i\in \mathcal {I}$$. We set unavailable values to $$+\infty$$ and $$-\infty$$, respectively. Further, let $$F^*_i$$ and $$\underline{F}^*_i$$ be the best upper and lower bound found by a provably correct solution approach. We will consider this provably correct approach as a black box for the moment and discuss it in more detail in the next section. We then check for every instance $$i$$ and approach s, if

\begin{aligned} F^{s}_i\ge \underline{F}^*_i\quad \text {and} \quad \underline{F}^{s}_i\le F^*_i\end{aligned}

hold. If this is not the case, we consider instance $$i$$ as not solved by s. Out of the 698 instances in $$\mathcal {I}$$, this happened 29 times for BigM-6, 50 times for BigM-5, and 147 times for BigM-4.

Figure 1 shows two performance profiles of the running times of the methods BigM-4, BigM-5, and BigM-6. The left performance profile is based on those 377 instances in $$\mathcal {I}$$ that all three methods solve. Apparently, a lower big-$$M$$ value is beneficial w.r.t. the running time and BigM-4 dominates the other two methods. Thus, there is an incentive to choose a small value of M. However, the picture changes, if we consider the 549 instances in $$\mathcal {I}$$ that at least one of the three methods solves; see Fig. 1 (right). It can be seen that BigM-6 solves the largest number of instances among the three tested approaches. To be more specific, BigM-6 solves 524, BigM-5 solves 503, and BigM-4 solves 411 instances. The reason is, as discussed above, that for a smaller value of M, the ex-post optimality check fails more often. This results in more instances counted as not solved. Note that there is, however, no dominance across the three methods w.r.t. the solved instances. There are 18 instances that BigM-5 solves but BigM-6 does not and 21 instances that BigM-4 solves but BigM-5 does not. Consequently, although it is not the fastest method, from a practical point of view, BigM-6 is the best choice. This is in line with results of a comparison of various values for $$M$$ on a different test set in Pineda et al. (2018). We recap however that even BigM-6 produces 29 “solutions” that are indeed not optimal.

## 2 The SOS1 reformulation: a lame duck?

Another way to solve the MPCC (3) is to omit the complementarity conditions (3d) initially and then branch on them instead.

### 2.1 The method

A first sketch of this approach has been proposed by Fortuny-Amat and McCarl (1981) and a more detailed evaluation along with first numerical results can be found in Bard and Moore (1990). A modern and convenient way to apply this method is to exploit special ordered sets of type 1 (SOS1). Such sets are introduced in Beale and Tomlin (1970) and require that at most one variable of a set of variables takes a nonzero value. With this, the complementarity conditions (3d) can be rephrased as follows:

\begin{aligned} s_i = (b- Cx- Dy)_i, \quad \{s_i, \lambda _i\} \text { is SOS1}, \quad i=1,\ldots , \ell . \end{aligned}
(5)

This technique is proposed in a more general MPCC context in Siddiqui and Gabriel (2013) and used in a bilevel context in Pineda et al. (2018). We highlight that it is big-M-free and fairly easy to implement. Modern mixed-integer solvers handle SOS1 constructs by automatically reformulating it to the mixed-integer formulation (4) if provably correct bounds on $$s_i$$ and $$\lambda _i$$ are available or by branching on $$s_i$$ and $$\lambda _i$$ otherwise. In this way, the benefits of using a highly evolved mixed-integer solver can be exploited, while the correctness of the obtained solutions is guaranteed—in contrast to the approach stated in Sect. 1. However, the theoretical worst-case complexity is the same for both approaches. We label this big-M-free method as SOS1 and evaluate its performance in comparison to BigM-6 in the following.

### 2.2 Performance evaluation

All results presented in this section follow the setup described in Sect. 1.2. Figure 2 shows a performance profile of the running times of BigM-6 and SOS1 on the 566 instances that at least one of the two methods solves. It can be seen that BigM-6 is the fastest method for more than 50% of the instances (see the leftmost point of the solid curve). In addition, it solves around 90% of the instances (see the rightmost point of the solid curve), although some “solved” instances are considered as not solved due to the ex-post optimality check described in Sect. 1.2. The valid black box used in this optimality check is exactly the SOS1-based approach. According to Fig. 2, BigM-6 can be considered the distinct winner approach, which is in line with the analysis in Pineda et al. (2018). Overall, this justifies why the big-M approach equipped with a large constant M is chosen instead of SOS1 in a lot of applications of OR.

## 3 The game changer: valid inequalities

The results presented up to this point only re-iterate folklore knowledge: Although it has its downsides, the big-M approach is the method of choice in practical linear bilevel optimization.

This section aims to change this thinking. We show that adding a simple inequality to the MPCC (3) changes the results drastically. This inequality has been proposed recently in Kleinert et al. (2020) and exploits the strong-duality condition of the follower problem. We briefly summarize its derivation in the following. For any leader decision $$x$$, the dual follower problem is given by

\begin{aligned} \min _{\lambda } \quad \lambda ^\top (b-Cx) \quad \text {s.t.} \quad \lambda \in \Omega _D= \{\lambda \ge 0:D^\top \lambda = f\}. \end{aligned}

For every lower-level primal-dual feasible point $$(y,\lambda )$$, weak duality

\begin{aligned} f^\top y\le \lambda ^\top b- \lambda ^\top Cx\end{aligned}

holds. Thus, strong duality can be enforced by

\begin{aligned} f^\top y\ge \lambda ^\top b- \lambda ^\top Cx. \end{aligned}

This is a bilinear constraint with products of primal leader variables $$x$$ and dual follower variables $$\lambda$$. However, one can derive a linear valid inequality from it by replacing each term $$C_{i\cdot } x$$ with an upper bound $$C_i^+\ge C_{i\cdot } x$$; see Kleinert et al. (2020). This yields the inequality

\begin{aligned} f^\top y\ge \lambda ^\top b- \lambda ^\top C^+, \end{aligned}
(6)

in which $$C^+$$ denotes the vector of upper bounds $$C_i^+$$. The linearization of the bilinear term $$\lambda ^\top Cx$$ in (6) can be seen as a special case of a McCormick inequality McCormick (1976) as discussed in Kleinert et al. (2020). Of course, other techniques to deal with these bilinearities are possible as well such as spatial branching Horst and Tuy (2013) or other iterative approaches based on convex optimization, see, e.g., Constante-Flores et al. (2022). The bounds required in (6) can be obtained, e.g., by exploiting variable bounds on $$x$$ or by solving auxiliary linear problems

\begin{aligned} \max \quad C_{i\cdot } x\quad \text {s.t.} \quad Ax + By \ge a,\ Cx + Dy \ge b, \end{aligned}
(7)

see also Kleinert et al. (2020). This requires the joint feasible set $$\{(x,y):Ax + By \ge a,\ Cx + Dy \ge b\}$$ of the upper and lower level to be bounded, which is the case for every tested instance in our instance set $$\mathcal {I}$$. Although solving the additional auxiliary LPs (7) can be done in polynomial time in general, it might be time consuming in practice. However, it has been shown in Kleinert et al. (2020) to pay off to add Inequality (6) to the SOS1 reformulation of Problem (1). A preliminary computational analysis revealed that this strategy is also very effective for the BigM-6 approach. Consequently, 688 instances in $$\mathcal {I}$$ are solved by at least one of the two methods BigM-6-R and SOS1-R compared to 566 for the methods BigM-6 and SOS1. More precisely, BigM-6-R solves 642 instances compared to 524 instances solved by BigM-6 and SOS1-R solves 615 instances compared to 421 instances solved by SOS1. Note that the “R” denotes that the valid inequality (6) has been added at the root node. In more detail, this means that the problem solved by the BigM-6 method is Problem (3) with (3d) replaced with (4) and the additional constraint (6). The SOS1 method solves Problem (3), extended by (6), with (3d) replaced with (5).

Figure 3 shows a performance profile that compares the running times of BigM-6-R and SOS1-R on the 688 instances that at least one method solves. We observe several interesting aspects. First, the SOS1-based approach is the faster method for over 65% of the instances (see the leftmost point of the dashed curve). This is very much in contrast to the results without the valid inequality. Second, the reliability advantage of the big-M-based approach decreases significantly. BigM-6-R solves 73 instances that SOS1-R cannot solve within the time limit, but SOS1-R also solves 46 instances that BigM-6-R cannot solve.

According to Fig. 3, the only reason for choosing BigM-6-R over the SOS1-based approach is that it “solves” more instances. However, this is a very ambivalent statement, because there is always the possibility that the “solutions” obtained by big-$$M$$-based methods are simply wrong due to a too small or too large value for $$M$$. In contrast, we highlight that the SOS1-based method either provides the correct optimal solution or terminates with a suboptimal solution and a trustworthy optimality gap. In fact, if we look at the 73 instances that only BigM-6-R solves, it turns out that the objective function values of the solutions provided by the SOS1-based solver always exactly match the objective function value of the “globally optimal” solutions provided by BigM-6-R.

## 4 Practical implications for linear bilevel optimization

The results presented in this note indicate the following. In general, the big-M approach is faster and “solves” more instances, especially if the value of M can be chosen small. Thus, whenever one is able to determine valid values of M, e.g., based on problem-specific knowledge, then one should use the big-M approach. On the other hand, if one does not have such problem-specific knowledge, one really should not use the big-M approach. In this note, we showed that there is also no need anymore to use it: The SOS1 approach equipped with the discussed root-node inequalities lead to comparable results on our test set. Moreover, this extended SOS1 approach is also easy to implement and, especially in the light of the validity of the obtained results, the small amount of extra work is worth the effort. To sum up, we hope that these results will change the “best-practice” in applied linear bilevel optimization and thus will lead to more trustworthy results of bilevel models in OR applications.