Inexact direct-search methods for bilevel optimization problems

Diouane, Youssef; Kungurtsev, Vyacheslav; Rinaldi, Francesco; Zeffiro, Damiano

doi:10.1007/s10589-024-00567-7

Inexact direct-search methods for bilevel optimization problems

Open access
Published: 21 March 2024

Volume 88, pages 469–490, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

Inexact direct-search methods for bilevel optimization problems

Download PDF

Youssef Diouane¹,
Vyacheslav Kungurtsev²,
Francesco Rinaldi³ &
…
Damiano Zeffiro ORCID: orcid.org/0000-0002-4189-0631³

798 Accesses
1 Altmetric
Explore all metrics

Abstract

In this work, we introduce new direct-search schemes for the solution of bilevel optimization (BO) problems. Our methods rely on a fixed accuracy blackbox oracle for the lower-level problem, and deal both with smooth and potentially nonsmooth true objectives. We thus analyze for the first time in the literature direct-search schemes in these settings, giving convergence guarantees to approximate stationary points, as well as complexity bounds in the smooth case. We also propose the first adaptation of mesh adaptive direct-search schemes for BO. Some preliminary numerical results on a standard set of bilevel optimization problems show the effectiveness of our new approaches.

Computational Linear Bilevel Optimization

Global Search for Bilevel Optimization with Quadratic Data

Enhanced exact algorithms for discrete bilevel linear problems

Article 03 April 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Bilevel optimization (see, e.g., [6, 9, 12, 13, 25] and references therein for a complete overview on the topic) has been subject of increasing interest, thanks to its application to hyperparameter tuning for machine learning algorithms and meta-learning (see, e.g., [17] and references therein). In this work, we are interested in the following bilevel optimization problem

$$\begin{aligned} \min _{(x,y) \in \mathbb {R}^{n_x\times n_y}}~~~~ f(x,y),~~~~~~ \text{ s.t. }~~~~~~ y \in \displaystyle \arg \min _{z \in Z} g(x,z). \end{aligned}$$

(1)

wherein we assume that the upper-level function $f(x,y):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}$ is continuous, and $g(x,z):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}$ is such that the lower-level problem $\min _{z \in Z} g(x,z)$ has a unique solution y(x) for every $x \in \mathbb {R}^{n_x}$, and $Z\subset \mathbb {R}^{n_y}$. Uniqueness of the lower-level problem solution, also known as the Low-Level Singleton (LLS) assumption, is a quite common assumption in many real world applications, such as hyperparameter optimization, meta-learning, pruning, semi-supervised learning on multilayer graphs (see, e.g., [17, 21, 42, 45]). While for simplicity we focus on the setting described above, it is important to point out that our analysis still holds, for a specific class of BO problems, even when dropping the LLS assumption (see Remark 2.1).

The algorithms we study here are derivative free optimization (DFO) methods, which do not use derivatives of the upper-level objective function, but rather only the objective value itself. Importantly, in this setting we also assume the availability of some blackbox oracle generating an approximation ${\tilde{y}}(x)$ of y(x) for any given $x \in \mathbb {R}^n_x$. Among DFO methods, we are interested in particular in direct-search methods (see, e.g., [2, 27]), which sample the objective in suitably chosen tentative points without building a model for it. These algorithmic schemes allow us to prove convergence guarantees under very mild assumptions on our bilevel optimization problem.

1.1 Previous work

Several gradient-based methods have been proposed in the literature to tackle bilevel optimization problems. Those methods usually require the computation of the true objective gradient, called “hypergradient”, and rely on the LLS and suitable smoothness assumptions (see, e.g., [17, 18, 20, 24, 29] and references therein). In another line of research, some asymptotic results based on relaxations of the LLS assumption were also analyzed (see, e.g., [30,31,32] and references therein). Calculating the hypergradients can be however a notoriously challenging and time consuming task. It indeed requires the handling of $\nabla _x y(x)$, which in turns involves the calculation of the Hessian matrix related to the g function via the implicit differentiation theorem. In some contexts, the hypergradients might not be available at all due to the blackbox nature of the functions describing the problem. These are the reasons why the development of new and efficient zeroth-order/derivative-free approaches is crucial in the BO context.

As for derivative free approaches, classic direct-search (see, e.g., [2, 10, 27]) and trust-region methods (see, e.g., [10, 27]) have been applied to BO in [11, 15, 37, 44]. In [37], a direct-search method for BO assuming the availability of the true objective is described. More specifically, their analysis does not allow for approximation errors in the solution of the lower-level problem, and relies on suitable assumptions making the true objective directionally differentiable. In [44], the analysis from [37] is extended considering lower-level inexact solutions with a stepsize-based adaptive error. In [11], an algorithm applying trust-region methods both in the inner level and on the true objective is described, with an adaptive estimation error for the true objective depending on the trust-region radius; in that work, a strategy to recycle function evaluations for the lower-level problem is described as well. In [15], the analysis of another trust-region method with adaptive error for bilevel optimization is carried out. The authors report worst-case complexity estimates both in terms of upper-level iterations and computational work from the lower-level problem, when considering a strongly convex lower-level problem solved by a suitable gradient descent approach. In the more recent works [8, 35], zeroth-order methods based on smoothing strategies [39] are analyzed. These studies, drawing inspiration from the complexity results provided in [22] for zeroth-order methods that handle nonsmooth and non-convex objectives, offer complexity estimates tailored for the BO setting. They rely on the assumptions that the lower-level problem can be solved with fixed precision, and that gradient descent on the lower level converges either polinomially or exponentially, respectively.

Finally, min-max DFO problems (which can be seen as a particular instance of BO) are also recently tackled in the literature [1, 36]. Relevant to our work are also direct-search methods under the presence of noise. While previous works analyze direct-search methods with adaptive deterministic [34] and stochastic noise [1, 4, 41], we are not aware of previous analyses of direct-search methods with bounded but non adaptive noise.

1.2 Contributions

Our contributions can be summarized as follows.

We define and analyze the first inexact direct-search schemes for BO problems with general potentially nonsmooth true objectives. Those methods indeed never require exact lower-level problem solutions, but instead assume access to approximate solutions with fixed accuracy, a reasonable assumption in practice. We therefore operate in a different setting than the one considered in previous works on direct-search for BO, where true objectives are directionally differentiable [37, 44] and lower-level solutions are exact [37] or require an adaptive precision [44].
We analyze mesh based direct-search schemes for BO, extending in particular the classic mesh adaptive direct-search (MADS) scheme from [3]. This is, to the best of our knowledge, the first analysis of this scheme that considers both inexact objective evaluation and the simple decrease condition for new iterates used originally in [3].
We give the first convergence results for direct-search schemes with bounded and non-adaptive noise on the objective.
We give the first convergence guarantees to $(\delta , \epsilon )$-Goldstein stationary points for direct-search schemes applied to general nonsmooth objectives. With respect to classic analyses considering Clarke stationary points (see, e.g., [5]), these are the first results for direct-search scheme involving some quantitative measure of approximate nonsmooth stationarity.

2 Background and preliminaries

We now introduce the main assumptions considered in the paper, along with a set of helpful preliminary results that will support the subsequent convergence theory. As anticipated in the introduction, we will always assume the existence of a unique minimizer y(x) for the lower-level problem, i.e., that the LLS assumption holds.

Assumption 2.1

For any $x \in \mathbb {R}^{n_x}$, we have that ${{\,\textrm{argmin}\,}}_{z \in Z} g(x, z) = \{y(x)\}$.

Under Assumption 2.1, the bilevel optimization problem (1) can then be rewritten as

$$\begin{aligned} \min _{x \in \mathbb {R}^{n_x}}~~~~ F(x):=f\left( x,y(x)\right) . \end{aligned}$$

(2)

However, in practical applications, it is usually necessary to employ an iterative method to compute y(x). Therefore, one cannot expect to obtain an exact value of y(x), but rather some approximation. We will hence make use of the following assumption.

Assumption 2.2

For all $x \in \mathbb {R}^{n_x}$ we can compute an approximation ${\tilde{y}}(x)$ of y(x) such that:

$$\begin{aligned} \Vert {\tilde{y}}(x) - y(x)\Vert \le \varepsilon . \end{aligned}$$

(3)

While the remaining assumptions introduced in this section are not always needed, in the rest of this manuscript we always assume that Assumptions 2.1 and 2.2 hold.

Remark 2.1

Our analysis extends to the case where ${{\,\textrm{argmin}\,}}_{z \in Z} g(x, z)$ is not a singleton, but an approximate solution ${\tilde{y}}(x)$ of the simple bilevel problem

$$\begin{aligned} \min _{y\in \mathbb {R}^{n_y}} ~~~~ f(x,y),~~~~~~ \text{ s.t. }~~~~~~ y \in \displaystyle \arg \min _{z \in Z} g(x,z). \end{aligned}$$

(4)

is available for every $x \in \mathbb {R}^{n_x}$. In fact our convergence proofs rely on (3) rather than the singleton assumption, where y(x) can be any solution of problem (4). We refer the reader to the recent work [8] for a detailed discussion on the complexity and regularity properties of the simple bilevel problem (4).

In the next proposition, we show how condition (3) can be satisfied, by applying gradient descent to $g(x, \cdot )$, under a suitable error bound condition on $\nabla _{y} g(x, y)$ generalizing strong convexity (see, e.g, [23] for a detailed comparison with other conditions). We also give an explicit bound on the number of iterations needed to satisfy (3).

Proposition 2.1

Assume that there exists $c_g>0$ such that for all $y\in Z$,

$$\begin{aligned} c_g\Vert y - y(x)\Vert \le \Vert \nabla _y g(x, y)\Vert . \end{aligned}$$

(5)

Furthermore, let $\nabla _y g$ be $L_g$ Lipschitz continuous in y, uniformly in x. Define $y_{0}(x)$ to be any arbitrary initialization mapping onto the domain of $g(x,\cdot )$. Then consider the sequence,

$$\begin{aligned} y_{k + 1}(x) = y_k(x) - \frac{1}{L_g} \nabla _y g(x, y_k(x)) . \end{aligned}$$

(6)

Define the solution estimate to be:

$$\begin{aligned} {\tilde{y}}(x) = {{\,\textrm{argmin}\,}}_{k \in [0: K(x)]} \Vert \nabla _y g(x, y_k(x))\Vert \end{aligned}$$

(7)

It holds that ${\tilde{y}}(x)$ satisfies (3), for

(8)

Proof

This follows from the well known iteration complexity of gradient descent for smooth non convex objectives. $\square $

We introduce now some technical assumptions on the objective function needed in our analysis.

Assumption 2.3

The function f is lower bounded by $f_{\text{ low }}$.

Assumption 2.4

The function f is Lipschitz continuous with respect to y with Lipschitz constant $L_f$ (independent of x).

We remark that these assumptions are an adaptation to our bilevel setting of standard assumptions made in the analysis of direct-search methods [10, 34]. Assumption 2.2 together with Assumption 2.4 imply that ${\tilde{F}}(x):= f(x, {\tilde{y}}(x))$ is an approximation of F(x) with accuracy $L_f\varepsilon $. Indeed,

$$\begin{aligned} | {\tilde{F}}(x) - F(x) | = |f(x, {{\tilde{y}}}(x)) - f(x, {y}(x))| \le L_f \Vert {{\tilde{y}}}(x) - y(x)\Vert \le L_f\varepsilon . \end{aligned}$$

(9)

Some regularity on the true objective F(x) will always be necessary for our analyses. We consider both the differentiable and the potentially non differentiable setting.

Assumption 2.5

F(x) is Lipschitz continuous with constant $L_F$.

Assumption 2.6

The function F is continuously differentiable with Lipschitz continuous gradient, of Lipschitz constant L.

Note that if f is Lipschitz with respect to x, and y(x) is Lipschitz continuous with respect to x, then Assumption 2.5 is satisfied. Furthermore, in the strongly convex lower-level setting there is an explicit expression for $\nabla F$ (see, e.g., [8, Equation (3)]), implying that its Lipschitz continuity follows from that of y(x) together with suitable regularity assumptions on f and g.

2.1 Algorithm

In this section, we introduce a general direct-search algorithm for bilevel optimization that embeds both directional direct-search methods with sufficient decrease and mesh adaptive direct-search methods with simple decrease, as defined in [10]. The methods in the first class sample tentative points along a suitable set of search directions and then select as the new iterate a point satisfying a sufficient decrease condition. The methods in the second class sample the points in a suitably defined mesh, and then select the new iterate according to a simple decrease condition. A tentative point t is hence accepted if the decrease condition

$$\begin{aligned} f(t, {\tilde{y}}(t)) < f(x_k, y_k) - \rho (\alpha _k) \end{aligned}$$

(10)

is satisfied, for $\rho $ nonnegative function. We have a sufficient decrease when $\rho (t) > 0$ with $\lim _{t \rightarrow 0^+} \rho (t)/t = 0$, and a simple decrease in case $\rho (t) = 0$. These two classes of decrease conditions lead to significant differences in convergence properties and consequently require different choices in the algorithm parameters. They will therefore be analyzed separately in Sects. 3 and 4 respectively.

The detailed scheme (see Algorithm 1) follows the lines of the general schemes proposed in [10] and [27], with the addition of calls to the lower-level oracle ${\tilde{y}}(x)$, and an explicit reference to the mesh used in mesh-based schemes. At steps 3–6, the algorithm searches for a new iterate by testing the upper level objective in $(t, {\tilde{y}}(t))$ for t in $S_k$ subset of the mesh $M_k$. In case the search is not successful, the method generates a new iterate by selecting a set of search directions $D_k$ and testing the upper level objective in $(t, {\tilde{y}}(t))$ for t chosen along the search directions using a stepsize $\alpha _k$ (see steps 7–12). Steps 9, 11 and 13 perform updates on the algorithm iterate and parameters based on the search step and computed function evaluations. For the set of directions $D_k$, we require in some cases a positive cosine measure, that is

$$\begin{aligned} {{\,\textrm{cm}\,}}(D_k) {\mathop {=}\limits ^{d}} \min _{v \ne 0_{\mathbb {R}^{n_x}}} \max _{d\in D_k} \frac{d^\top v}{\Vert d\Vert \Vert v\Vert } \ge \kappa , \end{aligned}$$

(11)

for some $\kappa > 0$.

3 Sufficient decrease condition

In this section, we analyze directional direct-search methods using a sufficient decrease condition with $\rho (t) = \frac{c}{2}t^2$. We first focus on potentially nonsmooth objectives, and then on smooth ones. In both cases we consider the scheme presented in Algorithm 2, which can be viewed as an adaption to BO of classic generating set of search directions (GSS) schemes (see, e.g., [26, Algorithm 3.2]). In order to handle the error introduced by the approximate solution in the lower level, we lower bound the stepsize with a constant $\alpha _{\min }$. We further notice that, thanks to the sufficient decrease condition, maintaining a mesh is not necessary, and therefore we simply set $M_k = \mathbb {R}^{n_x}$.

3.1 Nonsmooth objectives

First, we present convergence guarantees and proofs thereof for a variant of Algorithm 2 designed for the case of Lipschitz continuous true objectives, i.e., under Assumption 2.5. With respect to the general scheme presented as Algorithm 2, here $D_k = \{g_k\}$ with $g_k$ generated in the unit sphere. We remark that this is a standard choice for direct-search algorithms applied to nonsmooth objectives (see, e.g., [16, Algorithm $\text {DFN}_{simple}$]). The stepsize lower bound here must be strictly positive (i.e. $\alpha _{\min }> 0$).This together with the sufficient decrease conditions ensures that the sequence generated by the algorithm is eventually constant, as proved in Lemma 3.1. We then use a novel argument to prove that the limit point of the sequence is a $(\delta , \epsilon )$-Goldstein stationary point. Although such a notion of stationarity has recently gained attention in the analysis of zeroth-order smoothing-based approaches [22, 28, 40], including extensions to BO [8, 35], to the best of our knowledge, it has never been used for the analysis of direct-search methods. It is further important to notice that convergence of directional direct-search methods to $(\delta , \epsilon )-$Goldstein stationary points in the nonsmooth case is a novel result also for classic optimization problems. We now recall some useful definitions. If $B_{\delta }(x)$ is the ball of radius $\delta $ centered in x, then the $\delta $-Goldstein subdifferential (see, e.g., [28]) is defined as

$$\begin{aligned} \partial _{\delta } F(x) = {{\,\textrm{conv}\,}}\left\{ \bigcup _{y \in B_{\delta }(x)} \partial F(y) \right\} , \end{aligned}$$

(12)

and x is an $(\delta , \epsilon )$-Goldstein stationary point for the function F if, for some $g \in \partial _{\delta }F(x)$, we have $\Vert g\Vert \le \epsilon $.

We can now proceed with our convergence analysis. As anticipated, we start by proving that the sequence of iterates generated by our method is eventually constant.

Lemma 3.1

Let Assumptions 2.3 and 2.4 hold. Then there exists ${\bar{k}}\in \mathbb {N}_0$ such that the sequence $\{x_k\}$ generated by Algorithm 2 is constant for $k \ge {\bar{k}}$.

Proof

Notice that $\{{\tilde{F}}(x_k)\}$ is non-increasing, with ${\tilde{F}}(x_k) = {\tilde{F}}(x_{k + 1})$ after an unsuccessful step, and

$$\begin{aligned} {\tilde{F}}(x_{k + 1}) < {\tilde{F}}(x_k) - \frac{c}{2}\alpha _k^2 \le {\tilde{F}}(x_k) - \frac{c}{2} \alpha _{\min }^2 \end{aligned}$$

(13)

after a successful step. Thus there can be at most

$$\begin{aligned} \frac{2\left( {\tilde{F}}(x_0) - \inf _{x \in \mathbb {R}^n} {\tilde{F}}(x)\right) }{c \alpha _{\min }^2} \le \frac{2\left( {\tilde{F}}(x_0) - f_{\text {low}} + L_f\varepsilon \right) }{c \alpha _{\min }^2} \, \end{aligned}$$

(14)

successful steps, where we used ${\tilde{F}}(x) \ge F(x) - L_f\varepsilon \ge f_{\text {low}} - L_f \varepsilon $ in the inequality. Since this quantity is finite, this implies that $\{x_k\}$ is eventually constant. $\square $

We now prove convergence of our algorithm to $(\delta ,\epsilon )$-Goldstein stationary points. In order to get our convergence result, we need to assume that the sequence $\{g_k\}$ is dense in the unit sphere. We remark that such a dense sequence can be generated using a suitable quasirandom sequence (see, e.g., [19, 33]).

Theorem 3.1

Let Assumptions 2.3, 2.4 and 2.5 hold. Assume that $\{g_k\}$ is dense in the unite sphere. Then the sequence $\{x_k\}$ generated by Algorithm 2 is eventually constant, with the unique limit point $(\delta ,\epsilon )$-Goldstein stationary, for

$$\begin{aligned} \epsilon = \frac{4L_f\varepsilon }{\alpha _{\min }} + c\alpha _{\min }~~\text{ and } \quad \delta = \alpha _{\min }. \end{aligned}$$

(15)

Proof

First, $\{x_k\}$ is eventually constant as seen in Lemma 3.1. Let ${\bar{x}}$ be the unique limit point. By the stepsize updating rule, we have that every iteration must be unsuccessful with $\alpha _k = \alpha _{\min }$ for k large enough. Then, there exists ${\bar{k}} \in \mathbb {N}$ large enough such that for every $k \ge {\bar{k}}$

$$\begin{aligned} {\tilde{F}}({\bar{x}}) < {\tilde{F}}({\bar{x}} + \alpha _k g_k) + \frac{c}{2}\alpha _{\min }^2 = {\tilde{F}}({\bar{x}} + \alpha _{\min }g_k) + \frac{c}{2}\alpha _{\min }^2 \end{aligned}$$

(16)

implying

$$\begin{aligned} F({\bar{x}}) < F({\bar{x}} + \alpha _{\min }g_k) + \frac{c}{2}\alpha _{\min }^2 + 2L_f\varepsilon . \end{aligned}$$

(17)

By the density of $\{g_k\}$ it follows

$$\begin{aligned} F({\bar{x}}) < F({\bar{x}} + d) + \frac{c}{2}\alpha _{\min }^2 + 2L_f\varepsilon \end{aligned}$$

(18)

for every d such that $\Vert d\Vert = \alpha _{\min }$.

We now define the function ${\bar{F}}_{{\bar{x}}}(d):= F({\bar{x}} + d) + (\frac{c}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }^2}) \Vert d\Vert ^2$. Since

$$\begin{aligned} {\bar{F}}_{{\bar{x}}}(0) < {\bar{F}}_{{\bar{x}}}(d) \end{aligned}$$

(19)

for every d such that $\Vert d\Vert = \alpha _{\min }$ by (18), there must be a ${\tilde{d}} \in {{\,\textrm{argmin}\,}}_{\Vert d\Vert \le \alpha _{\min }} {\bar{F}}_{{\bar{x}}}(d)$ with $\Vert {\tilde{d}}\Vert < \alpha _{\min }$. We can conclude

$$\begin{aligned} 0 \in \partial {\bar{F}}_{{\bar{x}}}({\tilde{d}}) = \partial F({{\bar{x}}} + {\tilde{d}}) - \left( c + \frac{4L_f\varepsilon }{\alpha _{\min }^2}\right) {\tilde{d}} \end{aligned}$$

(20)

Equivalently, $g = (c + \frac{4L_f\varepsilon }{\alpha _{\min }^2}){\tilde{d}} \in \partial F(x + {\tilde{d}})$ and since $\partial F(x + {\tilde{d}}) \subset \partial _{\alpha _{\min }} F({\bar{x}})$ we have $g \in \partial _{\alpha _{\min }} F({\bar{x}})$. To conclude, observe $\Vert g\Vert < c \alpha _{\min }+ \frac{4L_f\varepsilon }{\alpha _{\min }} $. $\square $

As a corollary of Theorem 3.1, for $\alpha _{\min }\propto \sqrt{\varepsilon }$ we are able to get a $(\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))-$Goldstein stationary point. Interestingly, the order of magnitude $\mathcal {O}(\sqrt{\varepsilon })$ of the approximation error coincides with that of typical gradient approximation methods [7], as well as with that of direct-search in the smooth setting, as we shall see in the next section.

Corollary 3.1

Let Assumptions 2.3, 2.4 and 2.5 hold. Assume that $\{g_k\}$ is dense in the unite sphere. Then the sequence $\{x_k\}$ generated by Algorithm 2 with $\alpha _{\min }= 2\sqrt{\frac{L_f\varepsilon }{c}}$ is eventually constant, with the unique limit point $(\delta ,\epsilon )$-Goldstein stationary, for

$$\begin{aligned} \epsilon = 4\sqrt{L_f\varepsilon c}~~\text{ and } \quad \delta = 2\sqrt{\frac{L_f\varepsilon }{c}} . \end{aligned}$$

(21)

3.2 Smooth objectives

We now focus on the case where the objective F is smooth, in particular under Assumption 2.6. We consider here a variant of Algorithm 2 with $D_k$ a positive spanning set. When the stepsize lower bound is strictly positive we set as termination criterion in step 14

$$\begin{aligned} \alpha _k = \alpha _{k + 1} = \alpha _{\min }. \end{aligned}$$

(22)

Our scheme can hence be seen as a variant of classic direct-search methods for smooth objectives [10, 26]. It is important to highlight that this is the first analysis of direct-search methods for smooth objectives under bounded noise. The only analysis of direct-search methods we are aware of in the smooth case is the one given in [14] under stochastic noise, where, however, the author only focuses on classic optimization problems.

We first extend to our bounded error setting a standard result that allows to get an upper bound on the gradient norm for unsuccessful iterations (see, e.g., [26, Theorem 3.3]).

Lemma 3.2

Let Assumptions 2.4 and 2.6 hold, together with (11). Let $\{x_k\}$ be a sequence generated by Algorithm 2. If the iteration k is unsuccessful, then

$$\begin{aligned} \Vert \nabla F(x_k)\Vert \le \frac{1}{\kappa } \left( \frac{(L + c)\alpha _k}{2} + \frac{2L_f\varepsilon }{\alpha _k} \right) . \end{aligned}$$

(23)

Proof

Let $d \in D_k$ be such that

$$\begin{aligned} -\nabla F(x_k)^\top d \ge \kappa \Vert \nabla F(x_k)\Vert \Vert d\Vert . \end{aligned}$$

(24)

We have

$$\begin{aligned}{} & {} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \nonumber \\{} & {} \le - \alpha _k \nabla F(x_k)^\top d - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \le F(x_k) - F(x_k + \alpha _k d) \nonumber \\{} & {} \le {\tilde{F}}(x_k) - {\tilde{F}}(x_k + \alpha _k d) + 2L_f\varepsilon \le \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon , \end{aligned}$$

(25)

where we used (24) in the first inequality, the standard descent lemma in the second inequality, (9) in the third inequality, and that the step is unsuccessful in the last inequality. Therefore, since by assumption $\Vert d\Vert = 1$

$$\begin{aligned} \kappa \alpha _k \Vert \nabla F(x_k)\Vert= & {} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert \le \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \nonumber \\= & {} \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon + \alpha _k^2 \frac{L}{2} , \end{aligned}$$

(26)

implying the thesis. $\square $

In [34], convergence of a linesearch scheme is analyzed in the noisy case (i.e., additive noise smaller than the stepsize), and a result analogous to Lemma 3.2 is given.

We now prove convergence and complexity bounds when $\alpha _{\min }> 0$, extending those given in [43] for the exact oracle case, and $\alpha _{\min }=0$. We notice that in this second case we lose finite convergence and our guarantees are thus somewhat weaker, i.e., we are only able to prove that the stepsize converges to 0 and that at some point the gradient norm is $\mathcal {O}(\sqrt{\varepsilon })$.

Theorem 3.2

Let Assumptions 2.3, 2.4 and 2.6 hold, together with (11) for every $k \in \mathbb {N}_0$. Let $\{x_k\}$ be a sequence generated by Algorithm 2.

1.
If $\alpha _{\min }> 0$, then the algorithm satisfied the termination condition (22) after ${\bar{k}}$ iterations, with
$$\begin{aligned} {\bar{k}} < 1 + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{low} + 2L_f\varepsilon )\left( 1 - \frac{\ln \gamma }{\ln \theta }\right) + \frac{\ln \alpha _{\min }- \ln \alpha _0}{\ln \theta } , \end{aligned}$$
(27)
and its last iterate $x_{{\bar{k}}}$ is such that
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{(L + c)\alpha _{\min }}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }} \right) . \end{aligned}$$
(28)
2.
If, furthermore, it holds that $\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}$, then
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{2}{\kappa } \sqrt{(c + L) L_f \varepsilon } . \end{aligned}$$
(29)
3.
If $\alpha _{\min }= 0$, then $\alpha _k \rightarrow 0$, and if additionally $\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}$, for some ${\bar{k}} \in \mathbb {N}_0$ we have
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{(L + c)\bar{\alpha }_{\min }}{2} + \frac{2L_f\varepsilon }{\bar{\alpha }_{\min }}\right) , \end{aligned}$$
(30)
and
$$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}} . \end{aligned}$$
(31)

Proof

1. Let $k_s$ and $k_{ns}$ be the number of successful and unsuccessful steps, so that $k_s + k_{ns} = k$. Reasoning as in Lemma 3.1, we obtain by (14)

$$\begin{aligned} k_s < \frac{2}{\alpha _{\min }^2c}(F(x_0) - f_{\text {low}} + 2L_f\varepsilon ) . \end{aligned}$$

(32)

Furthermore, since

$$\begin{aligned} \alpha _{\min }\le \alpha _k \le \alpha _0\gamma ^{k_{s}}\theta ^{k_{ns} - 1} , \end{aligned}$$

(33)

we get

$$\begin{aligned} \begin{aligned} k_{ns}&\le 1 -\frac{1}{\ln (\theta )}(\ln (\alpha _0) - \ln (\alpha _{\min }) + k_{s}\ln (\gamma )) \\&\le 1 -\frac{1}{\ln (\theta )}(\ln (\alpha _0) - \ln (\alpha _{\min }) + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{\text {low}} + 2L_f\varepsilon )\ln (\gamma )) , \end{aligned} \end{aligned}$$

(34)

where we applied (32) in the second inequality. Combining the bounds on the successful and unsuccessful steps (32) and (34), we have

$$\begin{aligned} k = k_{s} + k_{ns} < 1 + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{low} + 2L_f\varepsilon )\left( 1 - \frac{\ln \gamma }{\ln \theta }\right) + \frac{\ln \alpha _{\min }- \ln \alpha _0}{\ln \theta } ,\nonumber \\ \end{aligned}$$

(35)

as desired.

2. Follows from a direct application of the first result.

3. Reasoning as in the first result, the number of successful steps with stepsize above a certain threshold is bounded, hence $\alpha _k \rightarrow 0$. Furthermore, for any ${\bar{k}} \in \mathbb {N}_0$, if $k \ge {\bar{k}}$

$$\begin{aligned} F(x_k) \le {\tilde{F}}(x_{k}) + L_f\varepsilon \le {\tilde{F}}(x_{{\bar{k}}}) + L_f\varepsilon \le F(x_{{\bar{k}}}) + 2L_f\varepsilon , \end{aligned}$$

(36)

which proves (31). Let $\bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}$. Since $\alpha _0 \ge \bar{\alpha }_{\min }$, and $\alpha _k \rightarrow 0$ with contraction factor $\theta $, we must have $\alpha _{{\bar{k}}} \in [\theta \bar{\alpha }_{\min }, \bar{\alpha }_{\min }]$ for some ${\bar{k}} \in \mathbb {N}_0$. Then (30) follows from (23) for $\alpha _k = \alpha _{{\bar{k}}}$. $\square $

We now extend to our setting the $\mathcal {O}(n^2/\epsilon ^2)$ complexity result given in [43, Corollary 2]. For a fixed precision $\epsilon $, an approximation error $\varepsilon = \mathcal {O}(\epsilon ^2)$ is required, as for classic gradient approximation schemes [7].

Corollary 3.2

Let Assumptions 2.3, 2.4 and 2.6 hold, together with (11) for every $k \in \mathbb {N}_0$. Let $\{x_k\}$ be a sequence generated by Algorithm 2. Assume also $\varepsilon \le \epsilon ^2 \kappa ^2$, that at every iterations there are at most $d_1n$ function evaluations and that $\kappa \ge d_2/\sqrt{n}$, for $d_1, d_2 > 0$. Then if $\alpha _{\min }~=~2\sqrt{\frac{L_f \varepsilon }{L + c}}$, the algorithm terminates after $\mathcal {O}(n^2/\epsilon ^2)$ function evaluations with $\Vert \nabla {F}(x_{{\bar{k}}})\Vert \le d_3 \epsilon $, for $d_3 > 0$ depending only on c, L and $L_f$.

Proof

Follows from point 1 and 2 of Theorem 3.2, plugging in the parameters specified in the assumptions. $\square $

4 Simple decrease condition

In this section, we analyze two methods based on simple decrease condition (i.e., with $\rho (t)~=~0$, in (10)), one for potentially nonsmooth objectives and one for smooth objectives. Both methods follow the scheme presented in Algorithm 3, which is an adaptation to the BO setting of the mesh adaptive direct-search algorithm (MADS, see [2] and references therein). Again we lower bound the stepsize by a constant $\alpha _{\min }$. The stepsize updating rule we use to handle unsuccessful iterations depends on the mesh size parameter $\Delta _k$ and the contraction coefficient $\theta $, and smoothness of the true objective (i.e., update varies between the smooth and the nonsmooth case).

It is a standard assumption in the analysis of MADS that all the iterates lie in a compact set (see, e.g., [3, Section 3]). In our framework, this can be ensured if the following boundedness assumption is satisfied.

Assumption 4.1

The set

$$\begin{aligned} \mathcal {L}_{\varepsilon } = \{x \in \mathbb {R}^{n_x} \ | \ F(x) \le F(x_0) + 2L_f\varepsilon \} \end{aligned}$$

(37)

is bounded.

The mesh, as defined in the literature (see,e.g., [5, 10] and references therein for further details), is a discrete set of points from which the algorithm selects candidate trial points. Its coarseness is parameterised by the mesh size parameter $\delta $. The goal of each algorithm iteration is to get a mesh point whose objective function value improves with respect to the incumbent value. Given a positive spanning set D and a center x the related mesh is formally defined as follows:

$$\begin{aligned} M = \{ x + \delta Dy \ | \ y \in \mathbb {N}^p\} , \end{aligned}$$

(38)

where, with a slight abuse of notation, we use D also for the matrix $D\in \mathbb {R}^{n \times p}$ with columns corresponding to the elements of the set D. We notice that the mesh is just a conceptual tool, and is never actually constructed.

4.1 Nonsmooth objectives

With respect to the general scheme presented in Algorithm 3, here the stepsize updating rule for unsuccessful iterations is given by $\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2, \theta \alpha _k)$, ensuring that $\alpha _k \rightarrow 0$ and the mesh gets infinitely dense if the algorithm gets stuck in a certain point. The set of search directions $D_k$ must be such that

$$\begin{aligned} \frac{\Delta _k}{\alpha _k} b_1(\alpha _k) \le \Vert d\Vert \le \frac{\Delta _k}{\alpha _k} b_2(\alpha _k) \end{aligned}$$

(39)

for all $d \in D_k$, with $b_i: \mathbb {R}_{> 0} \rightarrow \mathbb {R}_{> 0}$ such that $\lim _{t \rightarrow 0}b_i(t) = 1$ for $i\in \{1, 2\}$. Thus with respect to the classic MADS scheme here the frame size $\Delta _k$ defines also a lower bound and not only an upper bound on the distance between the current iterate and tentative points selected in the poll step. This adjustment is necessary due to the error on the true objective evaluation. As shown in the next lemma, Condition (39) ensures that as the stepsize converges to 0 the tentative steps get closer and closer to the boundary of a ball of radius $\alpha _{\min }$.

Lemma 4.1

Assume that $\alpha _{\min }> 0$ and that (39) holds. Then if $\lim _{k \in K} \alpha _k = 0$, the set of limit points of $\{\alpha _kD_k\}_{k \in K}$ is contained in $S^{n_x - 1}(\alpha _{\min })$.

Proof

If $\lim _{k \in K} \alpha _k = 0$ then it holds that, for $k \in K$ large enough, $\Delta _k = \alpha _{\min }$. Consider $\{d_k\}= D_k$. It holds that, for all $d_k$,

$$\begin{aligned} \limsup _{k \in K} \Vert \alpha _k d_k\Vert \le \limsup _{k \in K} \Delta _k b_2(\alpha _k) = \alpha _{\min }, \end{aligned}$$

(40)

where we applied (39) in the inequality. Analogously, we can prove $\liminf _{k \in K} \Vert \alpha _k d_k\Vert \ge \Delta _k$, whence $\lim _{k \in K} \Vert \alpha _k d_k\Vert = \alpha _{\min }$, which implies the thesis. $\square $

We now extend to this scheme the $(\delta , \epsilon )$-Goldstein stationarity result proved under the sufficient decrease condition in Sect. 3.1. Also in this case we are not aware of any analogous result for the standard MADS scheme, which is instead known to convergence to Clarke stationary points [3].

We start with a lemma that extends a well known property of MADS (see, e.g., [3, Proposition 3.1]) to our bilevel setting.

Lemma 4.2

Let Assumptions 2.4, 2.5 and 4.1 hold. Then the sequence $\{\alpha _k\}$ generated by Algorithm 3 is such that $\liminf \alpha _k = 0$.

Proof

Since $\{{\tilde{F}}(x_k)\}$ is non-increasing (and strictly decreasing for successful iterations), $\{x_k\}$ is contained in the set $\mathcal {L}_{\varepsilon }$, which is compact by Assumptions 2.5 and 4.1. Thus $\liminf \alpha _k = 0$ follows from the finiteness of feasible points generated in $\mathcal {L}_{\varepsilon }$ when keeping the parameter $\alpha _k$ lower bounded, which can be proved with the same arguments used for MADS in [3, Proposition 3.1]. $\square $

We can now state our main result.

Theorem 4.1

Let Assumptions 2.4, 2.5 and 4.1 hold. Let K be a subset of unsuccessful iteration indices related to Algorithm 3. Let us further assume that:

$\lim _{k \in K} x_k = {\bar{x}}$;
$\lim _{k \in K} \alpha _k = 0$;
$\{{\hat{D}}_k\}_{k \in K}$ is dense in the unit sphere, with ${\hat{D}}_k = \{ \frac{d}{\Vert d\Vert } \ | \ d\in D_k\}$;
Condition (39) holds.

Then, the limit point ${\bar{x}}$ of $\{x_k\}_{k \in K}$ is $(\delta , \epsilon )$-Goldstein stationary, for

$$\begin{aligned} \epsilon = \frac{4L_f\varepsilon }{\alpha _{\min }}~~\text{ and } \quad \delta = \alpha _{\min }. \end{aligned}$$

(41)

Proof

Let ${\bar{d}} \in \mathbb {R}^n$ with $\Vert {\bar{d}}\Vert = 1$, and let $L \subset K$ be such that $\lim _{k \in L} \frac{d_k}{\Vert d_k\Vert } \rightarrow {\bar{d}}$, with $d_k \in D_k$. Then $\alpha _k d_k \rightarrow \alpha _{\min }{\bar{d}}$ by Lemma 4.1. Now, for every $k \in L$

$$\begin{aligned} F(x_k) - F(x_k + \alpha _k d_k) \le {\tilde{F}}(x_k) - {\tilde{F}}(x_k + \alpha _k d_k) + 2L_f\varepsilon \le 2L_f\varepsilon , \end{aligned}$$

(42)

where the first inequality follows from (9), and we used that the step k is unsuccessful in the second inequality. Passing to the limit, we obtain

$$\begin{aligned} F({\bar{x}}) \le F({\bar{x}} + \alpha _{\min }{\bar{d}}) + 2L_f\varepsilon . \end{aligned}$$

(43)

Now let ${\bar{F}}_{{\bar{x}}}(d) = F({\bar{x}} + d) + \frac{2L_f\varepsilon }{\alpha _{\min }^2} \Vert d\Vert ^2$. By applying (42) we get

$$\begin{aligned} {\bar{F}}_{{\bar{x}}}(0) \le {\bar{F}}_{{\bar{x}}}(\alpha _{\min }{\bar{d}}), \end{aligned}$$

and given that ${\bar{d}}$ is arbitrary, this holds for any d such that $\Vert d\Vert = \alpha _{\min }$. The thesis then follows as in the proof of Theorem 3.1. $\square $

As in Sect. 3.1, here we also have a corollary showing that for $\alpha _{\min }\propto \sqrt{\varepsilon }$ we are able to get a $(\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))$-Goldstein stationary point.

Corollary 4.1

Under the assumptions of Theorem 4.1, the limit point ${\bar{x}}$ of the sequence $\{x_k\}$ generated by Algorithm 3 with $\alpha _{\min }= 2\sqrt{L_f\varepsilon }$ is $(\delta ,\epsilon )$-Goldstein stationary, for

$$\begin{aligned} \epsilon = \delta = 2\sqrt{L_f\varepsilon } . \end{aligned}$$

(44)

4.2 Smooth objectives

Now we consider the case where the true objective is smooth, i.e., Assumption 2.6 holds. With respect to the general scheme reported in Algorithm 3, we have $\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2)$, and the algorithm set as termination condition in step 14 (22), as for the smooth case. As for $D_k$, it must always satisfy ${{\,\textrm{cm}\,}}(D_k) \ge \kappa $ for some positive $\kappa $ independent from k, as well as

$$\begin{aligned} \frac{\Delta _k}{\alpha _k}b_1 \le \Vert d\Vert \le \frac{\Delta _k}{\alpha _k}b_2 \end{aligned}$$

(45)

for every $d \in D_k$.

We remark that convergence of mesh based schemes for smooth objectives is well understood (see, e.g., [5, Chapter 7]), so that once again our main contribution here is the adaptation to the bilevel setting. We begin our analysis by extending Lemma 3.2 under the simple decrease condition and condition (45) on the search directions.

Lemma 4.3

Let Assumptions 2.4 and 2.6 hold, together with (11). Let $\{x_k\}$ be a sequence generated by Algorithm 3. If the step k is unsuccessful, then

$$\begin{aligned} \Vert \nabla F(x_{k})\Vert \le \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) . \end{aligned}$$

(46)

Proof

Since the step is unsuccessful, by considering $d \in D_{k}$ such that

$$\begin{aligned} -\nabla F(x_k)^{\top } d \ge \kappa \Vert \nabla F(x_k)\Vert \Vert d\Vert \end{aligned}$$

(47)

we have, reasoning as in (25) with $c = 0$

$$\begin{aligned} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \le 2L_f\varepsilon . \end{aligned}$$

(48)

Finally, we get

$$\begin{aligned} \Vert \nabla F(x_{k})\Vert \le \frac{1}{\kappa } \left( \frac{\alpha _k L\Vert d_k\Vert }{2} + \frac{2L_f\varepsilon }{\alpha _k \Vert d_k\Vert } \right) \le \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) . \end{aligned}$$

(49)

$\square $

We now extend Theorem 3.2 to our mesh based scheme. The main difference is the absence of complexity estimates, which to our knowledge are not available for MADS schemes.

Theorem 4.2

Let Assumptions 2.4, 2.5 and 4.1 hold. Let $\{x_k\}$ be a sequence generated by Algorithm 3.

1.
If $\alpha _{\min }> 0$, then the algorithm satisfies the termination condition (22) in a finite number of iterations, with the last iterate $x_{{\bar{k}}}$ satisfying,
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{b_2\alpha _{\min }L}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }b_1} \right) . \end{aligned}$$
(50)
2.
If, furthermore, it holds that $\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}$, then
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \sqrt{L b_2 L_f \varepsilon / b_1} . \end{aligned}$$
(51)
3.
If $\alpha _{\min }= 0$, then $\liminf \alpha _k = 0$, and if additionally $\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}$, for some ${\bar{k}} \in \mathbb {N}_0$ we have
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{L\bar{\alpha }_{\min }b_2}{2} + \frac{2L_f\varepsilon }{b_1\bar{\alpha }_{\min }} \right) , \end{aligned}$$
(52)
and
$$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}}. \end{aligned}$$
(53)

Proof

1. Since the frame parameter $\Delta _k$ is lower bounded, the mesh parameter $\alpha _k$ is lower bounded as well, and, by the subsequent finiteness of $\bigcup _{k \in \mathbb {N}_0} M_k$, the algorithm terminates in a finite number of iterations. By the termination criterion, at the last iteration ${\bar{k}}$ we have $\Delta _{{\bar{k}}} = \alpha _{\min }$. Since the last iteration is unsuccessful, we hence get

$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert\le & {} \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) \nonumber \\= & {} \frac{1}{\kappa } \left( \frac{b_2 \alpha _{\min }L}{2} + \frac{2L_f\varepsilon }{b_1 \alpha _{\min }} \right) , \end{aligned}$$

(54)

where we applied Lemma 4.3 in the second inequality.

2. Follows from the previous point replacing $\alpha _{\min }$ with the given value in (50).

3. The property $\liminf \alpha _k = 0$ follows from standard arguments used in the analysis of MADS schemes, already mentioned in the proof of Lemma 4.1. The result then follows from point 1 and 2 (similarly to point 3 in Theorem 3.2). $\square $

5 Numerical illustration

In this section, we evaluate the performance of the proposed algorithms on a large collection of nonlinear bilevel optimization problems.

Three direct-search solvers derived from Algorithm 2 and Algorithm 3 were implemented in Matlab: Mesh-DS (related to Algorithm 3) with the mesh defined as in [5, Algorithm 8.2], Coordinate-DS (related to Algorithm 2) with $D_k=[\mathcal {B}_{\oplus }, -\mathcal {B}_{\oplus }]$ (where $\mathcal {B}_{\oplus }$ is the canonical basis of $\mathbb {R}^n$), and Random-DS (related to Algorithm 2) with $D_k=[\frac{v}{\Vert v\Vert }, -\frac{v}{\Vert v\Vert }]$, where $v \in \mathbb {R}^n$ is a pseudo-randomly generated vector. We note that for Mesh-DS, a simple decrease is imposed to decide the acceptance of a candidate step. In contrast, Coordinate-DS and Random-DS are using a sufficient decrease condition to make this decision.

In our tests, the parameters used for Algorithm 2 and Algorithm 3 were set as follows: $\alpha _{\min }=~10^{-6}$, $\theta =\frac{1}{2}$, $\alpha _0=1$, $c=10^{-3}$, and $\gamma =2$. For all the tested approaches, the optional search step (Step 1) was not included. Instead, in the poll step, when we observed a decrease along a specific direction, we further explored it by using a simple extrapolation strategy (i.e., we multiplied the step-size $\alpha _k$ by $\gamma $ and re-evaluated the function).

In our implementation, the lower-level problem is solved using the fmincon Matlab procedure. To quantify the impact of inexact lower-level solutions on the performances, we used 2 different accuracies when solving the lower-level problem (i.e., LL_tol $\in \{10^{-3}, 10^{-6}\}$). The rest of the fmincon default parameters were kept unchanged. A feasibility tolerance of $10^{-6}$ for constraints violation was used in the solution of the lower-level problem.

The three solvers, Mesh-DS, Coordinate-DS, and Random-DS, were evaluated using 33 small-scale bilevel optimization problems from the BOLIB Matlab library [46]. This library consists of a collection of academic and real-world problems. The dimensions of the tested instances, with respect to the upper-level problem, do not exceed 10 variables. Since an initial point is not provided, we generated five problem instances by randomly selecting five different initial points, thus getting a total of 175 problem instances.

The computational analysis is carried out by using well-known tools from the literature, that is data and performance profiles (see,e.g., [38] for further details). We briefly recall here their definitions. Given a set S of algorithms and a set P of problems, for $s\in S$ and $p \in P$, let $t_{p,s}$ be the number of function evaluations required by algorithm s on problem p to satisfy the condition

$$\begin{aligned} {{\tilde{F}}}(x_k) \le {{\tilde{F}}}_{\text{ low }} + \alpha ({{\tilde{F}}}(x_0) - {{\tilde{F}}}_{\text{ low }}), \end{aligned}$$

(55)

where $\alpha \in (0, 1)$ and ${{\tilde{F}}}_{\text{ low }}$ is the best objective function value achieved by any solver on problem p. Then, the performance and data profiles of solver s are defined by

$$\begin{aligned} \rho _s(\gamma )= & {} \frac{1}{|P|}\left| \left\{ p\in P: \frac{t_{p,s}}{\min \{t_{p,s'}:s'\in S\}}\le \gamma \right\} \right| ,\\ d_s(\kappa )= & {} \frac{1}{|P|}\left| \left\{ p\in P: t_{p,s}\le \kappa (n_p+1)\right\} \right| , \end{aligned}$$

where $n_p$ is the dimension of problem p. We used a budget of 500 upper level function evaluations in our experiments.

Figures 1 and 2 depict the resulting performance and data profiles, respectively, considering two levels of accuracy $\alpha $: $10^{-3}$ and $10^{-6}$. From Fig. 2, it can be observed that the Coordinate-DS approach performs the best in terms of both efficiency (i.e., $\tau =1$) and robustness (i.e., larger $\tau $), particularly when the lower problem is solved accurately (i.e., LL_tol=$10^{-6}$). The data profiles (see Fig. 2) indicate that all the direct-search approaches perform similarly for small budgets. As the budget increases, the accuracy of the lower problem becomes impactful on the solver’s performance. Overall, on the test problems, the mesh-based approach is slightly more effective for small budgets, i.e., less than $25(n_x+1)$. However, as the budget increases, the directional direct-search algorithms appear to outperform the mesh-based approach.

6 Conclusion

In this work, we proposed an inexact direct-search based algorithmic framework for bilevel optimization, under the assumption that the lower-level problem can be solved within a fixed accuracy. We then proved convergence of two different classes of methods fitting our scheme, that is directional direct-search methods with sufficient decrease and mesh based schemes with simple decrease. Our results include complexity estimates for a directional direct-search scheme tailored for BO with smooth true objective, which extends previously known complexity estimates for the single level case. We also considered the nonsmooth case and gave convergence guarantees to $(\delta , \epsilon )$-Goldstein stationary points for both classes, thus nicely extending the known Clarke stationary point convergence properties of analogous schemes in the single level case. A lower bound on the stepsize allows these method to convergence to a point with the desired stationarity properties in a finite number of iterations. Preliminary numerical results suggest that directional direct-search methods might lead to better performance than mesh based strategies in this context.

Future developments include the extensions of our algorithms to constrained and stochastic objectives, as well as numerical comparisons with recent zeroth order smoothing based approaches for BO.

Data availibility

The data analysed during the current study are available in the BOLIB library and the code will be made available by the authors upon reasonable request.

References

Anagnostidis, S.-K., Lucchi, A., Diouane, Y.: Direct-search for a class of stochastic min-max problems. In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, vol. 130, pp. 3772–3780. PMLR (2021)
Audet, C.: A Survey on Direct Search Methods for Blackbox Optimization and Their Applications. Springer, Berlin (2014)
Book Google Scholar
Audet, C., Dennis, J.E., Jr.: Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim. 17(1), 188–217 (2006)
Article MathSciNet Google Scholar
Audet, C., Dzahini, K.J., Kokkolaras, M., Le Digabel, S.: Stomads: Stochastic blackbox optimization using probabilistic estimates. arXiv preprint arXiv:1911.01012 (2019)
Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization (2017)
Beck, Y., Schmidt, M.: A Gentle and Incomplete Introduction to Bilevel Optimization (2021)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Found. Comput. Math. 22(2), 507–560 (2022)
Article MathSciNet Google Scholar
Chen, L., Xu, J., Zhang, J.: On bilevel optimization without lower-level strong convexity. arXiv preprint arXiv:2301.00712 (2023)
Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Ann. Oper. Res. 153, 235–256 (2007)
Article MathSciNet Google Scholar
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. SIAM, Philadelphia (2009)
Book Google Scholar
Conn, A.R., Vicente, L.N.: Bilevel derivative-free optimization and its application to robust optimization. Optim. Methods Softw. 27(3), 561–577 (2012)
Article MathSciNet Google Scholar
Dempe, S.: Foundations of Bilevel Programming. Springer, Berlin (2002)
Google Scholar
Dempe, S.: Bilevel optimization: theory, algorithms, applications and a bibliography. In: Bilevel Optimization: Advances and Next Challenges, pp. 581–672 (2020)
Dzahini, K.J.: Expected complexity analysis of stochastic direct-search. Comput. Optim. Appl. 81, 179–200 (2022)
Article MathSciNet Google Scholar
Ehrhardt, M.J., Roberts, L.: Inexact derivative-free optimization for bilevel learning. J. Math. Imaging Vis. 63(5), 580–600 (2021)
Article MathSciNet Google Scholar
Fasano, G., Liuzzi, G., Lucidi, S., Rinaldi, F.: A linesearch-based derivative-free approach for nonsmooth constrained optimization. SIAM J. Optim. 24(3), 959–992 (2014)
Article MathSciNet Google Scholar
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming for hyperparameter optimization and meta-learning. In: International Conference on Machine Learning, pp. 1568–1577. PMLR (2018)
Grazzi, R., Franceschi, L., Pontil, M., Salzo, Sa.: On the iteration complexity of hypergradient computation. In: International Conference on Machine Learning, pp. 3748–3758. PMLR (2020)
Halton, J.H.: On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numer. Math. 2, 84–90 (1960)
Article MathSciNet Google Scholar
Ji, K., Liang, Y.: Lower bounds and accelerated algorithms for bilevel optimization. J. Mach. Learn. Res. 24(22), 1–56 (2023)
MathSciNet Google Scholar
Ji, K., Yang, J., Liang, Y.: Bilevel optimization: convergence analysis and enhanced design. In: International Conference on Machine Learning, pp. 4882–4892. PMLR (2021)
Jordan, M.I., Kornowski, G., Lin, T., Shamir, O., Zampetakis, M.: Deterministic nonsmooth nonconvex optimization. arXiv preprint arXiv:2302.08300 (2023)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–łojasiewicz condition. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19–23, 2016, Proceedings, Part I 16, pp. 795–811. Springer, Berlin (2016)
Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., Yang, Z.: A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inf. Process. Syst. 34, 30271–30283 (2021)
Google Scholar
Kleinert, T., Labbé, M., Ljubić, I., Schmidt, M.: A survey on mixed-integer programming techniques in bilevel optimization. EURO J. Comput. Optim. 9, 100007 (2021)
Article MathSciNet Google Scholar
Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. 45(3), 385–482 (2003)
Article MathSciNet Google Scholar
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numerica 28, 287–404 (2019)
Article MathSciNet Google Scholar
Lin, T., Zheng, Z., Jordan, M.I.: Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. Adv. Neural Inf. Process. Syst. 35, 26160–26175 (2022)
Google Scholar
Liu, B., Ye, M., Wright, S., Stone, P., Liu, Q.: Bome! bilevel optimization made easy: a simple first-order approach. Adv. Neural Inf. Process. Syst. 35, 17248–17262 (2022)
Google Scholar
Liu, R., Liu, X., Yuan, X., Zeng, S., Zhang, J.: A value-function-based interior-point method for non-convex bi-level optimization. In: International Conference on Machine Learning, pp. 6882–6892. PMLR (2021)
Liu, R., Liu, Y., Zeng, S., Zhang, J.: Towards gradient-based bilevel optimization with non-convex followers and beyond. Adv. Neural Inf. Process. Syst. 34, 8662–8675 (2021)
Google Scholar
Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In: International Conference on Machine Learning, pp. 6305–6315. PMLR (2020)
Liuzzi, G., Lucidi, S., Rinaldi, F., Vicente, L.N.: Trust-region methods for the derivative-free optimization of nonsmooth black-box functions. SIAM J. Optim. 29, 3012–3035 (2019)
Article MathSciNet Google Scholar
Lucidi, S., Sciandrone, M.: A derivative-free algorithm for bound constrained optimization. Comput. Optim. Appl. 21, 119–142 (2002)
Article MathSciNet Google Scholar
Maheshwari, C., Shankar Sasty, S.., Ratliff, L., Mazumdar, E.: Convergent first-order methods for bi-level optimization and stackelberg games. arXiv preprint arXiv:2302.01421 (2023)
Menickelly, M., Wild, S.M.: Derivative-free robust optimization by outer approximations. Math. Program. 179, 157–193 (2020)
Article MathSciNet Google Scholar
Mersha, A.G., Dempe, S.: Direct search algorithm for bilevel programming problems. Comput. Optim. Appl. 49(1), 1–15 (2011)
Article MathSciNet Google Scholar
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20, 172–191 (2009)
Article MathSciNet Google Scholar
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17, 527–566 (2017)
Article MathSciNet Google Scholar
Rando, M., Molinari, C., Rosasco, L., Villa, S.: An optimal structured zeroth-order algorithm for non-smooth optimization. arXiv preprint arXiv:2305.16024 (2023)
Rinaldi, F., Vicente, L.N., Zeffiro, D.: A weak tail-bound probabilistic condition for function estimation in stochastic derivative-free optimization. arXiv preprint arXiv:2202.11074 (2022)
Venturini, S., Cristofari, A., Rinaldi, F., Tudisco, F.: Learning the right layers: a data-driven layer-aggregation strategy for semi-supervised learning on multilayer graphs. arXiv preprint arXiv:2306.00152 (2023)
Vicente, L.N.: Worst case complexity of direct search. EURO J. Comput. Optim. 1(1–2), 143–153 (2013)
Article Google Scholar
Zhang, D., Lin, G.-H.: Bilevel direct search method for leader-follower problems and application in health insurance. Comput. Oper. Res. 41, 359–373 (2014)
Article MathSciNet Google Scholar
Zhang, Y., Yao, Y., Parikshit Ram, P., Zhao, T.C., Hong, M., Wang, Y., Liu, S.: Advancing model pruning via bi-level optimization. Adv. Neural Inf. Process. Syst. 35, 18309–18326 (2022)
Google Scholar
Zhou, S., Zemkoho, A.B., Tin, A.: Bolib: bilevel optimization library of test problems. arXiv preprint arXiv:1812.00230v3 (2020)

Download references

Funding

Open access funding provided by Università degli Studi di Padova within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Mathematics and Industrial Engineering, Polytechnique Montréal, Montreal, QC, Canada
Youssef Diouane
Department of Computer Science, Czech Technical University, Prague, Czech Republic
Vyacheslav Kungurtsev
Dipartimento di Matematica “Tullio Levi-Civita”, Università di Padova, Padua, Italy
Francesco Rinaldi & Damiano Zeffiro

Authors

Youssef Diouane
View author publications
You can also search for this author in PubMed Google Scholar
Vyacheslav Kungurtsev
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar
Damiano Zeffiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damiano Zeffiro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Diouane, Y., Kungurtsev, V., Rinaldi, F. et al. Inexact direct-search methods for bilevel optimization problems. Comput Optim Appl 88, 469–490 (2024). https://doi.org/10.1007/s10589-024-00567-7

Download citation

Received: 13 September 2023
Accepted: 16 February 2024
Published: 21 March 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10589-024-00567-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Inexact direct-search methods for bilevel optimization problems

Abstract

Similar content being viewed by others

Computational Linear Bilevel Optimization

Global Search for Bilevel Optimization with Quadratic Data

Enhanced exact algorithms for discrete bilevel linear problems

1 Introduction

1.1 Previous work

1.2 Contributions

2 Background and preliminaries

Assumption 2.1

Assumption 2.2

Remark 2.1

Proposition 2.1

Proof

Assumption 2.3

Assumption 2.4

Assumption 2.5

Assumption 2.6

2.1 Algorithm

3 Sufficient decrease condition

3.1 Nonsmooth objectives

Lemma 3.1

Proof

Theorem 3.1

Proof

Corollary 3.1

3.2 Smooth objectives

Lemma 3.2

Proof

Theorem 3.2

Proof

Corollary 3.2

Proof

4 Simple decrease condition

Assumption 4.1

4.1 Nonsmooth objectives

Lemma 4.1

Proof

Lemma 4.2

Proof

Theorem 4.1

Proof

Corollary 4.1

4.2 Smooth objectives

Lemma 4.3

Proof

Theorem 4.2

Proof

5 Numerical illustration

6 Conclusion

Data availibility

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation