Keywords

1 Introduction

Bilevel optimization deals with optimization problems including additional optimization problem within the constraints. Two decision-makers attempt to find his/her optimal solution on these hierarchical nested systems. The upper-level problem is the first problem, and the decision-maker is called the leader. The lower-level problem forms a constraint in the leader problem, and this decision-maker is called the follower. The leader knows the follower’s objective and constraints, but the follower may have no knowledge about the leader. The decision-makers objectives are often in conflict though they may also be cooperative. During the optimization process the leader takes his/her action first. The follower takes that decision as a parameter and tries to find the best reaction. However, the follower’s reaction affects the leader’s decisions because the leader makes choices in the knowledge of how the follower will react.

Bilevel optimization problems occur in many practical applications including transportation, management, environmental economics, engineering and design. [43]. They also occur in machine learning: signal processing, meta-learning, hyperparameter optimization, reinforcement learning and neural architecture search can be modelled as bilevel optimization [21]. However, a lack of efficient solution methods has prevented the uptake of bilevel optimization.

The aim of this paper is to propose a new approach based on Bayesian optimization (BO) using multiple acquisition functions (MACBP) to improve efficiency (defined in terms of function evaluations). BO is a surrogate-based method for solving black-box functions that are expensive to evaluate [16], making it a useful approach to solving bilevel problems. An example is the BOBP algorithm [23]. BOBP used one lower confidence bound (LCB) acquisition function and obtains one decision point at a time. We propose using more than one acquisition function to improve the optimization process by making a wiser choice of acquisition points. Multiple acquisition functions have been used in BO, for example in the MACE algorithm for optimizing analog circuit design [33]. However, to the best of our knowledge no work has been done on this area for solving bilevel optimization problems.

Our contributions are twofold:

  • We use multiple acquisition functions, as no single acquisition function is appropriate for every problem [18]. We solve the resulting multi-objective optimization problem with evolutionary techniques, and select new points on the Pareto-front solution set.

  • We show empirically how using multiple acquisition functions affects optimization performance.

The rest of the paper is organised as follows. Background is provided in Sect. 2. The preliminaries for general bilevel optimization problems and BO are given in Sect. 3. The proposed method and algorithm details are explained in Sect. 4. In Sect. 5 the experimental setup is described. Finally, Sect. 6 concludes the paper and proposes future work.

2 Background

Bilevel optimization problems are described in two areas. In game theory von Stackelberg [50] proposed descriptive models of decision behaviour and built game-theoretic equilibria. In mathematical programming problems containing nested lower-level optimization problem as a constraint of upper-level optimization problem [8]. The hierarchical structure of bilevel problems might cause difficulties such as non-convexity and having no relation between instances. It is known to be strongly NP-hard [17].

A considerable number of exact approaches have been applied to bilevel problems. Karush-Kuhn-Tucker conditions [3] can be used to reformulate a bilevel problem to a single-level problem. Penalty functions compute the stationary points and local optima. Vertex enumeration has been used with a version of the Simplex method [6]. Gradient information for the follower problem can be extracted for use by the leader objective function. In terms of integer and mixed integer bilevel problems, reformulation [14], branch-and-bound [4] and parametric programming approaches have been applied to solve bilevel problems [27].

Because of the inefficiency of exact methods in complex bilevel problems, several kinds of meta-heuristics have been applied to bilevel problems in the literature. Four existing categories have been published in [53]: the nested sequential approach [25], the single-level transformation approach, the multi-objective approach [41] and the co-evolutionary approach [31]. An algorithm based on a human evolutionary model for non-linear bilevel problems [34], and the Bilevel Evolutionary Algorithm based on Quadratic approximations (BLEAQ) have been proposed [45]. This is another work which attempts to try to reduce the number of follower optimizations. The algorithm approximates the inducible region through the feasible region of the bilevel problem. In [40] they consider single optimization problem at both levels. They propose the Sequential Averaging Method (SAM) algorithm. In different recent works [32, 42] they used a truncated back-propagation approach to approximate the (stochastic) gradient of the upper-level problem. Basically, they use a dynamical system to model an optimization algorithm that solves the lower-level problem, and replaces the lower-level optimal solution. In another work [19] they developed a two-timescale stochastic approximation algorithm (TTSA) for solving a bilevel problem assuming the follower problem is unconstrained and strongly convex and the leader is a smooth objective function.

Many practical problems can be modelled and solved as Stackelberg games in the field of economics [46, 47] including principal agency problems and policy decisions. Hierarchical decision-making processes in management [2, 51] and in engineering and optimal structure design are other practical examples [24, 48]. Network design and the toll setting problem are the most popular applications in the field of transportation [9, 11, 35]. Finding optimal chemical equilibria, planning the preposition of defensive missile interceptors to counter an attacking threat, and interdicting nuclear weapons are other applications [10]. Inverse optimal control problems are modelled as bilevel optimization problems in nature [22, 37, 52]. There are many applications in robotics, computer vision, communication theory etc. In the machine learning community, bilevel optimization received significant attention recently and became an important framework in applications. Some interesting topics are meta-learning [5, 15, 39], hyperparameter optimization [13, 42], reinforcement learning [19, 26] and signal processing [29].

3 Preliminaries

The description of the MACBP algorithm will be divided into three parts. Firstly, we explain bilevel programming problems and their structure. Secondly, we discuss Bayesian optimization (BO) and Gaussian processes (GP). Finally, we propose the MACBP algorithm for solving bilevel optimization problems.

3.1 Bilevel Optimization Problems

For the upper-level objective function \(F:\mathbb {R}^{n}\times \mathbb {R}^{m}\rightarrow \mathbb {R}\) and lower-level objective function \(f:\mathbb {R}^{n}\times \mathbb {R}^{m}\rightarrow \mathbb {R}\), bilevel optimization problem can be defined as

(1)

where \(\textbf{x}_u \in X_U, \textbf{x}_l \in X_L\) are upper-level and lower-level decision variables and decision spaces, \(G_k\), \(g_j\) are constraints.

Because the lower-level decision maker depends on the upper-level variables, for every decision \(x_u\), there is a follower-optimal decision \(x_{l}^{*}\). In bilevel optimization, the decision set \(\textbf{x}^{*}=(\textbf{x}_{u}^{*},\textbf{x}_{l}^{*})\) is a feasible member for the upper-level only if it satisfies all the upper-level constraints and vector \(\textbf{x}_{x}^{*}\) is an optimal solution to the lower-level problem with upper-level decision as parameter.

3.2 Bayesian Optimization and Gaussian Process

BO is a method to optimize expensive-to-evaluate black-box functions. The probabilistic surrogate model and acquisition functions is important for BO. Predictions and uncertainties are provided by the surrogate model. It uses commonly GP [49] as a surrogate model, to obtain a posterior distribution \(\mathbb {P}(\textbf{f}| D )\) over the objective function \(\textbf{f}\) given the observed data \( D =\{(\textbf{x}_{i},\textbf{y}_{i})\}_{i=1}^{n}\). An acquisition function uses the posterior distribution to explore the search space. So the surrogate model is assisted by an acquisition function to choose the next candidate or a set of candidates \( X _{ cand } = \{\textbf{x}_{i}\}_{i=1}^{q}\). Though the objective function is expensive to evaluate, the surrogate-based acquisition function is not, so it can be optimized much more easier than the true function to yield \( X _{ cand }\).

Let us assume that we have a set of collection points \(\{x_{1},\dots ,x_{n}\}\in \mathbb {R}^{d}\) and an objective function values of these points \(\{f(x_{1}),\dots ,f(x_{n})\}\). After we observe n points, the mean vector is obtained by evaluating a mean function \(\mu _{0}\) at each point \(x_{i}\) and the covariance matrix by evaluating a covariance function or kernel \(\varSigma _{0}\) at each pair of \(x_{i},x_{j}\). The resulting prior distribution on \(\{f(x_{1}),\dots ,f(x_{n})\}\) is defined by

$$\begin{aligned} f(x_{1:n}) \sim N (\mu _{0}(x_{1:n}),\varSigma _{0}(x_{1:n},x_{1:n})) \end{aligned}$$
(2)

Let us suppose we wish to find a value of \(f(X_{cand})\) at some new candidate point \(X_{cand}\). For this purpose, the prior over \(\{f(x_{1:n}),f(X_{cand})\}\) is given by (2). Then we can compute the distribution of \(f(X_{cand})\) given the observations

$$\begin{aligned} f(X_{cand}) | f(x_{1:n}) \sim N (\mu _{0}(X_{cand}),\sigma _{0}^{2}(X_{cand})) \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \mu _{0}(X_{cand}) ={} \varSigma _{0}(X_{cand},x_{1:n})\varSigma _{0}(x_{1:n},x_{1:n})^{-1}(f(x_{1:n})-\mu _{0}(x_{1:n}))+\mu _{0}(X_{cand}) \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} {\begin{aligned} \sigma _{n}^{2}(X_{cand}) = \varSigma _{0}(X_{cand},X_{cand}) - \varSigma _{0}(X_{cand},x_{1:n})(\varSigma _{0}(x_{1:n},x_{1:n})^{-1}\varSigma _{0}(x_{1:n},X_{cand}) \end{aligned}}\end{aligned}$$
(5)

The distribution is called the posterior probability distribution in Bayesian statistics. So it is very important during the Bayesian optimization and Gaussian process to choose the next point to evaluate.

Acquisition functions are used to guide the search to a promising next point during the likelihood optimization, and it balances exploration and exploitation. Several acquisition functions have been developed over the years, such as probability of improvement (PI), expected improvement (EI) and upper confidence bound (UCB).

Probability of Improvement. The PI acquisition function tries to measure the probability that an arbitrary x exceeds the current best. Given the minimum objective function value \(\tau \) in the data set, the formulation is as follows [30]:

$$\begin{aligned} PI(x) = \varPhi (\lambda ) \end{aligned}$$
(6)

where \(\varPhi (\lambda )\) is the cumulative distribution function of standard normal distribution and \(\lambda = (\tau - \mu (x))/(\sigma (x))\).

Expected Improvement. We can expect that the observation x will not only reach the current best, but also reach the current best value at the highest magnitude. The corresponding formulation can be expressed as [36]:

$$\begin{aligned} EI(x) = \sigma (x)(\lambda \varPhi (\lambda )+\phi (\lambda )) \end{aligned}$$
(7)

where \(\phi (\cdot )\) is probability density function of standard normal distribution and \(\lambda = (\tau - \mu (x))/(\sigma (x))\).

Upper Confidence Bound. This is not an improvement-based strategies like EI and PI. It tries to guide the search from an optimistic perspective. The formulation is:

$$\begin{aligned} UCB(x) = \mu (x) + \beta \sigma (x) \end{aligned}$$
(8)

where \(\beta \) is a parameter represents exploration-exploitation trade-off. We fix \(\beta =0.1\).

figure a

4 Proposed Method

Bilevel problems have two levels of optimization tasks, such that the lower-level problem is a constraint of the upper-level problem. In general bilevel problems, the follower depends on the leader decisions \(x_u\). The leader has no control over the follower decision \(x_l\). For every leader decision there is an optimal follower decision, which can be called the reaction. Because the follower problem is a parametric optimization problem that depends on the leader decision \(x_u\), it is very time-consuming to adopt a nested strategy approach which sequentially solves both levels. In the continuous domain, the computational cost is very high. During the optimization process, it is important to choose wisely the next leader decision \(x_u\) according to make the process faster. For this purpose, we will present the proposed algorithm, we call MACBP, for solving bilevel problems by BO via multiple acquisition functions.

Problem Statement. Let us assume that we have a expensive black-box function that takes leader decisions in leader decision space \(x_u \in X_u\) and follower decisions coming from the follower decision maker \(x_l \in X_l\) as input. The function returns a scalar fitness score:

$$\begin{aligned} F(x_u,x_l) : X_u \times X_l \rightarrow \mathbb {R} \end{aligned}$$
(9)

Given a budget of N, the leader makes a decision and the follower makes its decisions accordingly. The leader can observe this information during the optimization process, and how follower decision maker reacting to leader decisions in every iteration and chooses the next leader decision to optimize the fitness score.

Algorithm Description. First we discuss fitting the decision data to the Gaussian process model. After observing n decision data \(\{(x_u^{i},y^{i})\}_{i=1}^{n}\) where \(y_i = F(x_u^{i},x_l^{i})\), we fit the data set to the Gaussian process model. After we have the data set let \(\hat{X}^n = ((x_u)^1,...,(x_u)^n)\) and \(Y^n = (y^1,...,y^n)\), then we define the Gaussian process by a prior mean \(\mu (x_u)\) and prior covariance function \(k((x_u),(x_u^{'}))\). After observing n data points, let \(K = k(\hat{X}^n,\hat{X}^n) \in \mathbb {R}^{n \times n}\). So the posterior mean and covariance is given by:

$$\begin{aligned} \begin{aligned} \mu (x_u)^{n} = \mu (x_u) + k(x_u,\hat{X}^n)(K+\sigma _0^{2}I)^{-1}(Y^n-\mu (\hat{X}^n)) \end{aligned} \end{aligned}$$
(10)
$$\begin{aligned} \begin{aligned} k^{n}(x_u,x_u^{'})^{n} = k^{n}(x_u,x_u^{'})-k^{n}(x_u,\hat{X}^n)(K+\sigma _0^{2}I)^{-1}k(\hat{X}^n,x_u^{'}) \end{aligned} \end{aligned}$$
(11)

After fitting the data to the model, we choose the next leader decision. After we find the optimal reaction \((x_u^{n+1},x_l^{n+1})\) and the fitness score of leader function \(F(x_u^{n+1},x_l^{n+1})=y^{n+1}\), we update the Gaussian process model with new decision data \((x_u^{n+1},y^{n+1})\). We shared the details of the MACBP algorithm on Algorithm 1 for upper-level optimization.

Fig. 1.
figure 1

The log optimality gap for the leader’s objective in SMD benchmark problems.

4.1 Multi-objective Optimization

There are multiple objectives to optimize when we consider the multi-objective optimization problems. It is formulated as

$$\begin{aligned} \begin{aligned} \underset{\textbf{x}\in X}{\text {minimize}}\;\;\mathbf {f(x)} = (f_{1}(\textbf{x}),\dots ,f_{d}(\textbf{x})) \end{aligned} \end{aligned}$$
(12)

for a vector-valued function \(\mathbf {f(x)} : \mathbb {R}\rightarrow \mathbb {R}^{d}\) and \( X \in \mathbb {R}\). So it is hard and commonly impossible to find a single optimum solution as there may be conflicts between the objectives. Therefore the main goal for these problems is to approximate the Pareto-front. Let us say that \(\mathbf {f(x)}\) dominates another solution \(\mathbf {f(x')}\) if \(\textbf{f}^{(i)}(x)\succ \textbf{f}^{(i)}(x')\) for all \(i=1,2,\dots ,M\) and there exists \(i'\in \{1,2,\dots ,M\}\) such that \(f^{{i}^{'}}(x) \succ f^{{i}^{'}}(x')\). So we can express the Pareto-optimal by \(P^{*} = \{\mathbf {f(x)}\) s.t. \(\not \exists \mathbf {x'}\in X : \mathbf {f(x')} \succ \mathbf {f(x)}\}\) and \(X^{*} = \{ \textbf{x} \in \textbf{X}\) s.t. \(\mathbf {f(x)} \in P^{*} \}\). A solution set is Pareto-optimal if it is not dominated by any other point and it dominates at least one point. The Pareto-set the set of all Pareto-optimal points, and a set of Pareto-optimal points is called a Pareto-front. There are many multi-objective optimization algorithms such as non-dominated sorting based genetic algorithm (NSGA-II) [12], multi-objective evolutionary algorithm based on decomposition (MOEA/D) [55] and multi-objective optimization based on differential evolution (DEMO) [54].

Table 1. The summary of SMD benchmark problems

4.2 Multi-objective Acquisition Function in Bayesian Optimization

Different acquisition functions have different characteristics according to their structure and point selection strategy. Improvement based strategies rely on the best selection so far at each iteration. For example the PI function value decreases when difference between mean function the best objective value so far below zero, \(\mu (x)-F^{*}(x) < 0\). The EI function value at sampled points would always be worse than the EI values at pending decision points. Uncertainty-based acquisition functions, for instance UCB, increase as \(\sigma (x)\) increases.

According to the different selection strategies explained above, we use the multi-objective optimization method NSGA-II in this work, to find the best trade-off between acquisition functions. Then we select the next point of the leader’s decision during the bilevel optimization process from the best trade-off between acquisition functions. This is called the Pareto-front of acquisition functions. So in every iteration the multi-objective optimization problem constructed is:

$$\begin{aligned} \begin{aligned} \underset{\textbf{x}\in X}{\text {minimize}}\;\; \bigg \{-UCB(x), -PI(x), -EI(x)\bigg \} \end{aligned} \end{aligned}$$
(13)

After we find the Pareto-front from the multi-objective optimization Problem 13 we make the random selection from Pareto-optimal decision set.

5 Experiments

We evaluate the MACBP algorithm using two experiments. First, we run the experiments by choosing a single point at each iteration for the setting of \(N_{iter}=50\). We set the number of initial random sampling to \(N_{init} = 20\). Then, we compare the results with those for the three single acquisition functions EI, PI and UCB performances. Second, we run the experiment with stopping criteria of \(d < 10^{-5}\) where d represents the difference between the results and the optimum value of functions. We compare the performance of our proposed method in terms of function evaluations in Table 2 and in terms of accuracy in Table 3. We run the algorithm in sequential mode and the Matern52 kernel is used for GP for both experiments. The parameters for acquisition functions are declared in Sect. 3.2. For the first experiment, the experiments are repeated 31 times to average the random fluctuations and the optimality gap in the log scale presented in Fig. 1.

The optimization is completed in a single core of 1.4 Ghz Quad Core i5, 8 Gb 2133 MHz LPDDR3 RAM. Bayesian optimization is implemented in the Python language and uses BoTorch [38], the SLSQP algorithm [28] is used for lower-level optimization, and the NSGA-II algorithm for multi-objective optimization by using PyMOO [7] library.

Table 2. Upper-level function evaluations for the proposed MACBP algorithm and other known algorithms for SMD1-SMD6
Table 3. Upper-level accuracy for the proposed MACBP algorithm and other known algorithms for SMD1-SMD6.

5.1 SMD Problems

We evaluated the MACBP algorithm on six standard benchmark problems proposed in [44]. It is called SMD test problems and the problems are unconstrained and high-dimensional with controllable complexities. They are scalable in terms of the number of decision variables. Each problem in the benchmark represents a different difficulty level in terms of convergence, the complexity of interaction and lower-level multi-modality as declared in [44]. Table 1 provides details on the problems. For all functions we used 2D decision variables. The total function evaluations for the leader’s objective can be calculated by \(N_{iter} + N_{init}\).

5.2 Results

Although bilevel optimization problems deal with the leader’s and follower’s optimization problems, we shall consider only the leader’s performance as it is the only one we model as an expensive black-box function. The optimality gap plots between true optimal points and approximated points in 50 iterations in log scale are given in Fig. 1. As can be seen in Fig. 1, the proposed algorithm for bilevel optimization is competitive with the sequential Bayesian method at upper-level optimization with the UCB, EI and PI acquisition functions. We fixed the iteration number for the first experiment to see how using multiple acquisitions effect the performance when we compare it with single acquisition functions. As we can see in Fig. 1, the multi-objective acquisition approach gave better results than EI, PI and UCB alone for SMD1, SMD3 and SMD6. We can see that at the end of optimization by reaching the closer point to the optimal value for these problems. The proposed algorithm gave better performance than UCB and PI for the SMD2 problem but EI reached closer point to the optimal at the end. PI reached the best point at the end of iterations for SMD 4 and they are so close as it is the second best one.

In the second experiment, we can see from Table 3 the MACBP algorithm reached better results for SMD4, SMD5 and SMD7 than compared algorithms. For SMD1, we get closer to the optimal solution than NBLE and BIDE algorithms. We reached the better results for SMD2 when we compare the results with NBLE and BLEAQ algorithms. Comparing with BIDE and BLEAQ, the proposed algorithm get better results for SMD3. In terms of function evaluations, our MACBP algorithm decreased significantly the function evaluations as we can see at the Table 2 when we compare the other state-of-art algorithm in the literature.

6 Conclusion

In this paper, we proposed the MACBP algorithm, a Bayesian approach via multi-objective acquisition functions for bilevel optimization problems. We approached the leader’s objective as an expensive black-box function. We used multiple acquisition functions during the bilevel optimization process, and made our selection from a Pareto-front solution set in each iteration. We selected six popular SMD benchmark problems for the experiments. We compared our experimental results with a classic sequential setting of Bayesian optimization with each acquisition function performance individually. We also compare our results in terms of required function evaluations at the upper-level. It is shown that the proposed MACBP algorithm is competitive with existing well-known algorithms compared in the paper for solving bilevel optimization problems.