1 Introduction

Bilevel optimization (see, e.g., [6, 9, 12, 13, 25] and references therein for a complete overview on the topic) has been subject of increasing interest, thanks to its application to hyperparameter tuning for machine learning algorithms and meta-learning (see, e.g., [17] and references therein). In this work, we are interested in the following bilevel optimization problem

$$\begin{aligned} \min _{(x,y) \in \mathbb {R}^{n_x\times n_y}}~~~~ f(x,y),~~~~~~ \text{ s.t. }~~~~~~ y \in \displaystyle \arg \min _{z \in Z} g(x,z). \end{aligned}$$
(1)

wherein we assume that the upper-level function \(f(x,y):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}\) is continuous, and \(g(x,z):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}\) is such that the lower-level problem \(\min _{z \in Z} g(x,z)\) has a unique solution y(x) for every \(x \in \mathbb {R}^{n_x}\), and \(Z\subset \mathbb {R}^{n_y}\). Uniqueness of the lower-level problem solution, also known as the Low-Level Singleton (LLS) assumption, is a quite common assumption in many real world applications, such as hyperparameter optimization, meta-learning, pruning, semi-supervised learning on multilayer graphs (see, e.g., [17, 21, 42, 45]). While for simplicity we focus on the setting described above, it is important to point out that our analysis still holds, for a specific class of BO problems, even when dropping the LLS assumption (see Remark 2.1).

The algorithms we study here are derivative free optimization (DFO) methods, which do not use derivatives of the upper-level objective function, but rather only the objective value itself. Importantly, in this setting we also assume the availability of some blackbox oracle generating an approximation \({\tilde{y}}(x)\) of y(x) for any given \(x \in \mathbb {R}^n_x\). Among DFO methods, we are interested in particular in direct-search methods (see, e.g., [2, 27]), which sample the objective in suitably chosen tentative points without building a model for it. These algorithmic schemes allow us to prove convergence guarantees under very mild assumptions on our bilevel optimization problem.

1.1 Previous work

Several gradient-based methods have been proposed in the literature to tackle bilevel optimization problems. Those methods usually require the computation of the true objective gradient, called “hypergradient”, and rely on the LLS and suitable smoothness assumptions (see, e.g., [17, 18, 20, 24, 29] and references therein). In another line of research, some asymptotic results based on relaxations of the LLS assumption were also analyzed (see, e.g., [30,31,32] and references therein). Calculating the hypergradients can be however a notoriously challenging and time consuming task. It indeed requires the handling of \(\nabla _x y(x)\), which in turns involves the calculation of the Hessian matrix related to the g function via the implicit differentiation theorem. In some contexts, the hypergradients might not be available at all due to the blackbox nature of the functions describing the problem. These are the reasons why the development of new and efficient zeroth-order/derivative-free approaches is crucial in the BO context.

As for derivative free approaches, classic direct-search (see, e.g., [2, 10, 27]) and trust-region methods (see, e.g., [10, 27]) have been applied to BO in [11, 15, 37, 44]. In [37], a direct-search method for BO assuming the availability of the true objective is described. More specifically, their analysis does not allow for approximation errors in the solution of the lower-level problem, and relies on suitable assumptions making the true objective directionally differentiable. In [44], the analysis from [37] is extended considering lower-level inexact solutions with a stepsize-based adaptive error. In [11], an algorithm applying trust-region methods both in the inner level and on the true objective is described, with an adaptive estimation error for the true objective depending on the trust-region radius; in that work, a strategy to recycle function evaluations for the lower-level problem is described as well. In [15], the analysis of another trust-region method with adaptive error for bilevel optimization is carried out. The authors report worst-case complexity estimates both in terms of upper-level iterations and computational work from the lower-level problem, when considering a strongly convex lower-level problem solved by a suitable gradient descent approach. In the more recent works [8, 35], zeroth-order methods based on smoothing strategies [39] are analyzed. These studies, drawing inspiration from the complexity results provided in [22] for zeroth-order methods that handle nonsmooth and non-convex objectives, offer complexity estimates tailored for the BO setting. They rely on the assumptions that the lower-level problem can be solved with fixed precision, and that gradient descent on the lower level converges either polinomially or exponentially, respectively.

Finally, min-max DFO problems (which can be seen as a particular instance of BO) are also recently tackled in the literature [1, 36]. Relevant to our work are also direct-search methods under the presence of noise. While previous works analyze direct-search methods with adaptive deterministic [34] and stochastic noise [1, 4, 41], we are not aware of previous analyses of direct-search methods with bounded but non adaptive noise.

1.2 Contributions

Our contributions can be summarized as follows.

  • We define and analyze the first inexact direct-search schemes for BO problems with general potentially nonsmooth true objectives. Those methods indeed never require exact lower-level problem solutions, but instead assume access to approximate solutions with fixed accuracy, a reasonable assumption in practice. We therefore operate in a different setting than the one considered in previous works on direct-search for BO, where true objectives are directionally differentiable [37, 44] and lower-level solutions are exact [37] or require an adaptive precision [44].

  • We analyze mesh based direct-search schemes for BO, extending in particular the classic mesh adaptive direct-search (MADS) scheme from [3]. This is, to the best of our knowledge, the first analysis of this scheme that considers both inexact objective evaluation and the simple decrease condition for new iterates used originally in [3].

  • We give the first convergence results for direct-search schemes with bounded and non-adaptive noise on the objective.

  • We give the first convergence guarantees to \((\delta , \epsilon )\)-Goldstein stationary points for direct-search schemes applied to general nonsmooth objectives. With respect to classic analyses considering Clarke stationary points (see, e.g., [5]), these are the first results for direct-search scheme involving some quantitative measure of approximate nonsmooth stationarity.

2 Background and preliminaries

We now introduce the main assumptions considered in the paper, along with a set of helpful preliminary results that will support the subsequent convergence theory. As anticipated in the introduction, we will always assume the existence of a unique minimizer y(x) for the lower-level problem, i.e., that the LLS assumption holds.

Assumption 2.1

For any \(x \in \mathbb {R}^{n_x}\), we have that \({{\,\textrm{argmin}\,}}_{z \in Z} g(x, z) = \{y(x)\}\).

Under Assumption 2.1, the bilevel optimization problem (1) can then be rewritten as

$$\begin{aligned} \min _{x \in \mathbb {R}^{n_x}}~~~~ F(x):=f\left( x,y(x)\right) . \end{aligned}$$
(2)

However, in practical applications, it is usually necessary to employ an iterative method to compute y(x). Therefore, one cannot expect to obtain an exact value of y(x), but rather some approximation. We will hence make use of the following assumption.

Assumption 2.2

For all \(x \in \mathbb {R}^{n_x}\) we can compute an approximation \({\tilde{y}}(x)\) of y(x) such that:

$$\begin{aligned} \Vert {\tilde{y}}(x) - y(x)\Vert \le \varepsilon . \end{aligned}$$
(3)

While the remaining assumptions introduced in this section are not always needed, in the rest of this manuscript we always assume that Assumptions 2.1 and 2.2 hold.

Remark 2.1

Our analysis extends to the case where \({{\,\textrm{argmin}\,}}_{z \in Z} g(x, z)\) is not a singleton, but an approximate solution \({\tilde{y}}(x)\) of the simple bilevel problem

$$\begin{aligned} \min _{y\in \mathbb {R}^{n_y}} ~~~~ f(x,y),~~~~~~ \text{ s.t. }~~~~~~ y \in \displaystyle \arg \min _{z \in Z} g(x,z). \end{aligned}$$
(4)

is available for every \(x \in \mathbb {R}^{n_x}\). In fact our convergence proofs rely on (3) rather than the singleton assumption, where y(x) can be any solution of problem (4). We refer the reader to the recent work [8] for a detailed discussion on the complexity and regularity properties of the simple bilevel problem (4).

In the next proposition, we show how condition (3) can be satisfied, by applying gradient descent to \(g(x, \cdot )\), under a suitable error bound condition on \(\nabla _{y} g(x, y)\) generalizing strong convexity (see, e.g, [23] for a detailed comparison with other conditions). We also give an explicit bound on the number of iterations needed to satisfy (3).

Proposition 2.1

Assume that there exists \(c_g>0\) such that for all \(y\in Z\),

$$\begin{aligned} c_g\Vert y - y(x)\Vert \le \Vert \nabla _y g(x, y)\Vert . \end{aligned}$$
(5)

Furthermore, let \(\nabla _y g\) be \(L_g\) Lipschitz continuous in y, uniformly in x. Define \(y_{0}(x)\) to be any arbitrary initialization mapping onto the domain of \(g(x,\cdot )\). Then consider the sequence,

$$\begin{aligned} y_{k + 1}(x) = y_k(x) - \frac{1}{L_g} \nabla _y g(x, y_k(x)) . \end{aligned}$$
(6)

Define the solution estimate to be:

$$\begin{aligned} {\tilde{y}}(x) = {{\,\textrm{argmin}\,}}_{k \in [0: K(x)]} \Vert \nabla _y g(x, y_k(x))\Vert \end{aligned}$$
(7)

It holds that \({\tilde{y}}(x)\) satisfies (3), for

(8)

Proof

This follows from the well known iteration complexity of gradient descent for smooth non convex objectives. \(\square \)

We introduce now some technical assumptions on the objective function needed in our analysis.

Assumption 2.3

The function f is lower bounded by \(f_{\text{ low }}\).

Assumption 2.4

The function f is Lipschitz continuous with respect to y with Lipschitz constant \(L_f\) (independent of x).

We remark that these assumptions are an adaptation to our bilevel setting of standard assumptions made in the analysis of direct-search methods [10, 34]. Assumption 2.2 together with Assumption 2.4 imply that \({\tilde{F}}(x):= f(x, {\tilde{y}}(x))\) is an approximation of F(x) with accuracy \(L_f\varepsilon \). Indeed,

$$\begin{aligned} | {\tilde{F}}(x) - F(x) | = |f(x, {{\tilde{y}}}(x)) - f(x, {y}(x))| \le L_f \Vert {{\tilde{y}}}(x) - y(x)\Vert \le L_f\varepsilon . \end{aligned}$$
(9)

Some regularity on the true objective F(x) will always be necessary for our analyses. We consider both the differentiable and the potentially non differentiable setting.

Assumption 2.5

F(x) is Lipschitz continuous with constant \(L_F\).

Assumption 2.6

The function F is continuously differentiable with Lipschitz continuous gradient, of Lipschitz constant L.

Note that if f is Lipschitz with respect to x, and y(x) is Lipschitz continuous with respect to x, then Assumption 2.5 is satisfied. Furthermore, in the strongly convex lower-level setting there is an explicit expression for \(\nabla F\) (see, e.g., [8, Equation (3)]), implying that its Lipschitz continuity follows from that of y(x) together with suitable regularity assumptions on f and g.

2.1 Algorithm

In this section, we introduce a general direct-search algorithm for bilevel optimization that embeds both directional direct-search methods with sufficient decrease and mesh adaptive direct-search methods with simple decrease, as defined in [10]. The methods in the first class sample tentative points along a suitable set of search directions and then select as the new iterate a point satisfying a sufficient decrease condition. The methods in the second class sample the points in a suitably defined mesh, and then select the new iterate according to a simple decrease condition. A tentative point t is hence accepted if the decrease condition

$$\begin{aligned} f(t, {\tilde{y}}(t)) < f(x_k, y_k) - \rho (\alpha _k) \end{aligned}$$
(10)

is satisfied, for \(\rho \) nonnegative function. We have a sufficient decrease when \(\rho (t) > 0\) with \(\lim _{t \rightarrow 0^+} \rho (t)/t = 0\), and a simple decrease in case \(\rho (t) = 0\). These two classes of decrease conditions lead to significant differences in convergence properties and consequently require different choices in the algorithm parameters. They will therefore be analyzed separately in Sects. 3 and 4 respectively.

Algorithm 1
figure a

DS for bilevel optimization

The detailed scheme (see Algorithm 1) follows the lines of the general schemes proposed in [10] and [27], with the addition of calls to the lower-level oracle \({\tilde{y}}(x)\), and an explicit reference to the mesh used in mesh-based schemes. At steps 3–6, the algorithm searches for a new iterate by testing the upper level objective in \((t, {\tilde{y}}(t))\) for t in \(S_k\) subset of the mesh \(M_k\). In case the search is not successful, the method generates a new iterate by selecting a set of search directions \(D_k\) and testing the upper level objective in \((t, {\tilde{y}}(t))\) for t chosen along the search directions using a stepsize \(\alpha _k\) (see steps 7–12). Steps 9, 11 and 13 perform updates on the algorithm iterate and parameters based on the search step and computed function evaluations. For the set of directions \(D_k\), we require in some cases a positive cosine measure, that is

$$\begin{aligned} {{\,\textrm{cm}\,}}(D_k) {\mathop {=}\limits ^{d}} \min _{v \ne 0_{\mathbb {R}^{n_x}}} \max _{d\in D_k} \frac{d^\top v}{\Vert d\Vert \Vert v\Vert } \ge \kappa , \end{aligned}$$
(11)

for some \(\kappa > 0\).

3 Sufficient decrease condition

In this section, we analyze directional direct-search methods using a sufficient decrease condition with \(\rho (t) = \frac{c}{2}t^2\). We first focus on potentially nonsmooth objectives, and then on smooth ones. In both cases we consider the scheme presented in Algorithm 2, which can be viewed as an adaption to BO of classic generating set of search directions (GSS) schemes (see, e.g., [26, Algorithm 3.2]). In order to handle the error introduced by the approximate solution in the lower level, we lower bound the stepsize with a constant \(\alpha _{\min }\). We further notice that, thanks to the sufficient decrease condition, maintaining a mesh is not necessary, and therefore we simply set \(M_k = \mathbb {R}^{n_x}\).

Algorithm 2
figure b

Inexact directional DS for bilevel optimization

3.1 Nonsmooth objectives

First, we present convergence guarantees and proofs thereof for a variant of Algorithm 2 designed for the case of Lipschitz continuous true objectives, i.e., under Assumption 2.5. With respect to the general scheme presented as Algorithm 2, here \(D_k = \{g_k\}\) with \(g_k\) generated in the unit sphere. We remark that this is a standard choice for direct-search algorithms applied to nonsmooth objectives (see, e.g., [16, Algorithm \(\text {DFN}_{simple}\)]). The stepsize lower bound here must be strictly positive (i.e. \(\alpha _{\min }> 0\)).This together with the sufficient decrease conditions ensures that the sequence generated by the algorithm is eventually constant, as proved in Lemma 3.1. We then use a novel argument to prove that the limit point of the sequence is a \((\delta , \epsilon )\)-Goldstein stationary point. Although such a notion of stationarity has recently gained attention in the analysis of zeroth-order smoothing-based approaches [22, 28, 40], including extensions to BO [8, 35], to the best of our knowledge, it has never been used for the analysis of direct-search methods. It is further important to notice that convergence of directional direct-search methods to \((\delta , \epsilon )-\)Goldstein stationary points in the nonsmooth case is a novel result also for classic optimization problems. We now recall some useful definitions. If \(B_{\delta }(x)\) is the ball of radius \(\delta \) centered in x, then the \(\delta \)-Goldstein subdifferential (see, e.g., [28]) is defined as

$$\begin{aligned} \partial _{\delta } F(x) = {{\,\textrm{conv}\,}}\left\{ \bigcup _{y \in B_{\delta }(x)} \partial F(y) \right\} , \end{aligned}$$
(12)

and x is an \((\delta , \epsilon )\)-Goldstein stationary point for the function F if, for some \(g \in \partial _{\delta }F(x)\), we have \(\Vert g\Vert \le \epsilon \).

We can now proceed with our convergence analysis. As anticipated, we start by proving that the sequence of iterates generated by our method is eventually constant.

Lemma 3.1

Let Assumptions 2.3 and 2.4 hold. Then there exists \({\bar{k}}\in \mathbb {N}_0\) such that the sequence \(\{x_k\}\) generated by Algorithm 2 is constant for \(k \ge {\bar{k}}\).

Proof

Notice that \(\{{\tilde{F}}(x_k)\}\) is non-increasing, with \({\tilde{F}}(x_k) = {\tilde{F}}(x_{k + 1})\) after an unsuccessful step, and

$$\begin{aligned} {\tilde{F}}(x_{k + 1}) < {\tilde{F}}(x_k) - \frac{c}{2}\alpha _k^2 \le {\tilde{F}}(x_k) - \frac{c}{2} \alpha _{\min }^2 \end{aligned}$$
(13)

after a successful step. Thus there can be at most

$$\begin{aligned} \frac{2\left( {\tilde{F}}(x_0) - \inf _{x \in \mathbb {R}^n} {\tilde{F}}(x)\right) }{c \alpha _{\min }^2} \le \frac{2\left( {\tilde{F}}(x_0) - f_{\text {low}} + L_f\varepsilon \right) }{c \alpha _{\min }^2} \, \end{aligned}$$
(14)

successful steps, where we used \({\tilde{F}}(x) \ge F(x) - L_f\varepsilon \ge f_{\text {low}} - L_f \varepsilon \) in the inequality. Since this quantity is finite, this implies that \(\{x_k\}\) is eventually constant. \(\square \)

We now prove convergence of our algorithm to \((\delta ,\epsilon )\)-Goldstein stationary points. In order to get our convergence result, we need to assume that the sequence \(\{g_k\}\) is dense in the unit sphere. We remark that such a dense sequence can be generated using a suitable quasirandom sequence (see, e.g., [19, 33]).

Theorem 3.1

Let Assumptions 2.32.4 and 2.5 hold. Assume that \(\{g_k\}\) is dense in the unite sphere. Then the sequence \(\{x_k\}\) generated by Algorithm 2 is eventually constant, with the unique limit point \((\delta ,\epsilon )\)-Goldstein stationary, for

$$\begin{aligned} \epsilon = \frac{4L_f\varepsilon }{\alpha _{\min }} + c\alpha _{\min }~~\text{ and } \quad \delta = \alpha _{\min }. \end{aligned}$$
(15)

Proof

First, \(\{x_k\}\) is eventually constant as seen in Lemma 3.1. Let \({\bar{x}}\) be the unique limit point. By the stepsize updating rule, we have that every iteration must be unsuccessful with \(\alpha _k = \alpha _{\min }\) for k large enough. Then, there exists \({\bar{k}} \in \mathbb {N}\) large enough such that for every \(k \ge {\bar{k}}\)

$$\begin{aligned} {\tilde{F}}({\bar{x}}) < {\tilde{F}}({\bar{x}} + \alpha _k g_k) + \frac{c}{2}\alpha _{\min }^2 = {\tilde{F}}({\bar{x}} + \alpha _{\min }g_k) + \frac{c}{2}\alpha _{\min }^2 \end{aligned}$$
(16)

implying

$$\begin{aligned} F({\bar{x}}) < F({\bar{x}} + \alpha _{\min }g_k) + \frac{c}{2}\alpha _{\min }^2 + 2L_f\varepsilon . \end{aligned}$$
(17)

By the density of \(\{g_k\}\) it follows

$$\begin{aligned} F({\bar{x}}) < F({\bar{x}} + d) + \frac{c}{2}\alpha _{\min }^2 + 2L_f\varepsilon \end{aligned}$$
(18)

for every d such that \(\Vert d\Vert = \alpha _{\min }\).

We now define the function \({\bar{F}}_{{\bar{x}}}(d):= F({\bar{x}} + d) + (\frac{c}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }^2}) \Vert d\Vert ^2\). Since

$$\begin{aligned} {\bar{F}}_{{\bar{x}}}(0) < {\bar{F}}_{{\bar{x}}}(d) \end{aligned}$$
(19)

for every d such that \(\Vert d\Vert = \alpha _{\min }\) by (18), there must be a \({\tilde{d}} \in {{\,\textrm{argmin}\,}}_{\Vert d\Vert \le \alpha _{\min }} {\bar{F}}_{{\bar{x}}}(d)\) with \(\Vert {\tilde{d}}\Vert < \alpha _{\min }\). We can conclude

$$\begin{aligned} 0 \in \partial {\bar{F}}_{{\bar{x}}}({\tilde{d}}) = \partial F({{\bar{x}}} + {\tilde{d}}) - \left( c + \frac{4L_f\varepsilon }{\alpha _{\min }^2}\right) {\tilde{d}} \end{aligned}$$
(20)

Equivalently, \(g = (c + \frac{4L_f\varepsilon }{\alpha _{\min }^2}){\tilde{d}} \in \partial F(x + {\tilde{d}})\) and since \(\partial F(x + {\tilde{d}}) \subset \partial _{\alpha _{\min }} F({\bar{x}})\) we have \(g \in \partial _{\alpha _{\min }} F({\bar{x}})\). To conclude, observe \(\Vert g\Vert < c \alpha _{\min }+ \frac{4L_f\varepsilon }{\alpha _{\min }} \). \(\square \)

As a corollary of Theorem 3.1, for \(\alpha _{\min }\propto \sqrt{\varepsilon }\) we are able to get a \((\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))-\)Goldstein stationary point. Interestingly, the order of magnitude \(\mathcal {O}(\sqrt{\varepsilon })\) of the approximation error coincides with that of typical gradient approximation methods [7], as well as with that of direct-search in the smooth setting, as we shall see in the next section.

Corollary 3.1

Let Assumptions 2.32.4 and 2.5 hold. Assume that \(\{g_k\}\) is dense in the unite sphere. Then the sequence \(\{x_k\}\) generated by Algorithm 2 with \(\alpha _{\min }= 2\sqrt{\frac{L_f\varepsilon }{c}}\) is eventually constant, with the unique limit point \((\delta ,\epsilon )\)-Goldstein stationary, for

$$\begin{aligned} \epsilon = 4\sqrt{L_f\varepsilon c}~~\text{ and } \quad \delta = 2\sqrt{\frac{L_f\varepsilon }{c}} . \end{aligned}$$
(21)

3.2 Smooth objectives

We now focus on the case where the objective F is smooth, in particular under Assumption 2.6. We consider here a variant of Algorithm 2 with \(D_k\) a positive spanning set. When the stepsize lower bound is strictly positive we set as termination criterion in step 14

$$\begin{aligned} \alpha _k = \alpha _{k + 1} = \alpha _{\min }. \end{aligned}$$
(22)

Our scheme can hence be seen as a variant of classic direct-search methods for smooth objectives [10, 26]. It is important to highlight that this is the first analysis of direct-search methods for smooth objectives under bounded noise. The only analysis of direct-search methods we are aware of in the smooth case is the one given in [14] under stochastic noise, where, however, the author only focuses on classic optimization problems.

We first extend to our bounded error setting a standard result that allows to get an upper bound on the gradient norm for unsuccessful iterations (see, e.g., [26, Theorem 3.3]).

Lemma 3.2

Let Assumptions 2.4 and 2.6 hold, together with (11). Let \(\{x_k\}\) be a sequence generated by Algorithm 2. If the iteration k is unsuccessful, then

$$\begin{aligned} \Vert \nabla F(x_k)\Vert \le \frac{1}{\kappa } \left( \frac{(L + c)\alpha _k}{2} + \frac{2L_f\varepsilon }{\alpha _k} \right) . \end{aligned}$$
(23)

Proof

Let \(d \in D_k\) be such that

$$\begin{aligned} -\nabla F(x_k)^\top d \ge \kappa \Vert \nabla F(x_k)\Vert \Vert d\Vert . \end{aligned}$$
(24)

We have

$$\begin{aligned}{} & {} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \nonumber \\{} & {} \le - \alpha _k \nabla F(x_k)^\top d - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \le F(x_k) - F(x_k + \alpha _k d) \nonumber \\{} & {} \le {\tilde{F}}(x_k) - {\tilde{F}}(x_k + \alpha _k d) + 2L_f\varepsilon \le \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon , \end{aligned}$$
(25)

where we used (24) in the first inequality, the standard descent lemma in the second inequality, (9) in the third inequality, and that the step is unsuccessful in the last inequality. Therefore, since by assumption \(\Vert d\Vert = 1\)

$$\begin{aligned} \kappa \alpha _k \Vert \nabla F(x_k)\Vert= & {} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert \le \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \nonumber \\= & {} \frac{c}{2}\alpha _k^2 + 2L_f\varepsilon + \alpha _k^2 \frac{L}{2} , \end{aligned}$$
(26)

implying the thesis. \(\square \)

In [34], convergence of a linesearch scheme is analyzed in the noisy case (i.e., additive noise smaller than the stepsize), and a result analogous to Lemma 3.2 is given.

We now prove convergence and complexity bounds when \(\alpha _{\min }> 0\), extending those given in [43] for the exact oracle case, and \(\alpha _{\min }=0\). We notice that in this second case we lose finite convergence and our guarantees are thus somewhat weaker, i.e., we are only able to prove that the stepsize converges to 0 and that at some point the gradient norm is \(\mathcal {O}(\sqrt{\varepsilon })\).

Theorem 3.2

Let Assumptions 2.32.4 and 2.6 hold, together with (11) for every \(k \in \mathbb {N}_0\). Let \(\{x_k\}\) be a sequence generated by Algorithm 2.

  1. 1.

    If \(\alpha _{\min }> 0\), then the algorithm satisfied the termination condition (22) after \({\bar{k}}\) iterations, with

    $$\begin{aligned} {\bar{k}} < 1 + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{low} + 2L_f\varepsilon )\left( 1 - \frac{\ln \gamma }{\ln \theta }\right) + \frac{\ln \alpha _{\min }- \ln \alpha _0}{\ln \theta } , \end{aligned}$$
    (27)

    and its last iterate \(x_{{\bar{k}}}\) is such that

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{(L + c)\alpha _{\min }}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }} \right) . \end{aligned}$$
    (28)
  2. 2.

    If, furthermore, it holds that \(\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\), then

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{2}{\kappa } \sqrt{(c + L) L_f \varepsilon } . \end{aligned}$$
    (29)
  3. 3.

    If \(\alpha _{\min }= 0\), then \(\alpha _k \rightarrow 0\), and if additionally \(\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\), for some \({\bar{k}} \in \mathbb {N}_0\) we have

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{(L + c)\bar{\alpha }_{\min }}{2} + \frac{2L_f\varepsilon }{\bar{\alpha }_{\min }}\right) , \end{aligned}$$
    (30)

    and

    $$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}} . \end{aligned}$$
    (31)

Proof

1. Let \(k_s\) and \(k_{ns}\) be the number of successful and unsuccessful steps, so that \(k_s + k_{ns} = k\). Reasoning as in Lemma 3.1, we obtain by (14)

$$\begin{aligned} k_s < \frac{2}{\alpha _{\min }^2c}(F(x_0) - f_{\text {low}} + 2L_f\varepsilon ) . \end{aligned}$$
(32)

Furthermore, since

$$\begin{aligned} \alpha _{\min }\le \alpha _k \le \alpha _0\gamma ^{k_{s}}\theta ^{k_{ns} - 1} , \end{aligned}$$
(33)

we get

$$\begin{aligned} \begin{aligned} k_{ns}&\le 1 -\frac{1}{\ln (\theta )}(\ln (\alpha _0) - \ln (\alpha _{\min }) + k_{s}\ln (\gamma )) \\&\le 1 -\frac{1}{\ln (\theta )}(\ln (\alpha _0) - \ln (\alpha _{\min }) + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{\text {low}} + 2L_f\varepsilon )\ln (\gamma )) , \end{aligned} \end{aligned}$$
(34)

where we applied (32) in the second inequality. Combining the bounds on the successful and unsuccessful steps (32) and (34), we have

$$\begin{aligned} k = k_{s} + k_{ns} < 1 + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{low} + 2L_f\varepsilon )\left( 1 - \frac{\ln \gamma }{\ln \theta }\right) + \frac{\ln \alpha _{\min }- \ln \alpha _0}{\ln \theta } ,\nonumber \\ \end{aligned}$$
(35)

as desired.

2. Follows from a direct application of the first result.

3. Reasoning as in the first result, the number of successful steps with stepsize above a certain threshold is bounded, hence \(\alpha _k \rightarrow 0\). Furthermore, for any \({\bar{k}} \in \mathbb {N}_0\), if \(k \ge {\bar{k}}\)

$$\begin{aligned} F(x_k) \le {\tilde{F}}(x_{k}) + L_f\varepsilon \le {\tilde{F}}(x_{{\bar{k}}}) + L_f\varepsilon \le F(x_{{\bar{k}}}) + 2L_f\varepsilon , \end{aligned}$$
(36)

which proves (31). Let \(\bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\). Since \(\alpha _0 \ge \bar{\alpha }_{\min }\), and \(\alpha _k \rightarrow 0\) with contraction factor \(\theta \), we must have \(\alpha _{{\bar{k}}} \in [\theta \bar{\alpha }_{\min }, \bar{\alpha }_{\min }]\) for some \({\bar{k}} \in \mathbb {N}_0\). Then (30) follows from (23) for \(\alpha _k = \alpha _{{\bar{k}}}\). \(\square \)

We now extend to our setting the \(\mathcal {O}(n^2/\epsilon ^2)\) complexity result given in [43, Corollary 2]. For a fixed precision \(\epsilon \), an approximation error \(\varepsilon = \mathcal {O}(\epsilon ^2)\) is required, as for classic gradient approximation schemes [7].

Corollary 3.2

Let Assumptions 2.32.4 and 2.6 hold, together with (11) for every \(k \in \mathbb {N}_0\). Let \(\{x_k\}\) be a sequence generated by Algorithm 2. Assume also \(\varepsilon \le \epsilon ^2 \kappa ^2\), that at every iterations there are at most \(d_1n\) function evaluations and that \(\kappa \ge d_2/\sqrt{n}\), for \(d_1, d_2 > 0\). Then if \(\alpha _{\min }~=~2\sqrt{\frac{L_f \varepsilon }{L + c}}\), the algorithm terminates after \(\mathcal {O}(n^2/\epsilon ^2)\) function evaluations with \(\Vert \nabla {F}(x_{{\bar{k}}})\Vert \le d_3 \epsilon \), for \(d_3 > 0\) depending only on cL and \(L_f\).

Proof

Follows from point 1 and 2 of Theorem 3.2, plugging in the parameters specified in the assumptions. \(\square \)

4 Simple decrease condition

In this section, we analyze two methods based on simple decrease condition (i.e., with \(\rho (t)~=~0\), in (10)), one for potentially nonsmooth objectives and one for smooth objectives. Both methods follow the scheme presented in Algorithm 3, which is an adaptation to the BO setting of the mesh adaptive direct-search algorithm (MADS, see [2] and references therein). Again we lower bound the stepsize by a constant \(\alpha _{\min }\). The stepsize updating rule we use to handle unsuccessful iterations depends on the mesh size parameter \(\Delta _k\) and the contraction coefficient \(\theta \), and smoothness of the true objective (i.e., update varies between the smooth and the nonsmooth case).

It is a standard assumption in the analysis of MADS that all the iterates lie in a compact set (see, e.g., [3, Section 3]). In our framework, this can be ensured if the following boundedness assumption is satisfied.

Assumption 4.1

The set

$$\begin{aligned} \mathcal {L}_{\varepsilon } = \{x \in \mathbb {R}^{n_x} \ | \ F(x) \le F(x_0) + 2L_f\varepsilon \} \end{aligned}$$
(37)

is bounded.

The mesh, as defined in the literature (see,e.g., [5, 10] and references therein for further details), is a discrete set of points from which the algorithm selects candidate trial points. Its coarseness is parameterised by the mesh size parameter \(\delta \). The goal of each algorithm iteration is to get a mesh point whose objective function value improves with respect to the incumbent value. Given a positive spanning set D and a center x the related mesh is formally defined as follows:

$$\begin{aligned} M = \{ x + \delta Dy \ | \ y \in \mathbb {N}^p\} , \end{aligned}$$
(38)

where, with a slight abuse of notation, we use D also for the matrix \(D\in \mathbb {R}^{n \times p}\) with columns corresponding to the elements of the set D. We notice that the mesh is just a conceptual tool, and is never actually constructed.

Algorithm 3
figure c

Inexact mesh based DS for bilevel optimization

4.1 Nonsmooth objectives

With respect to the general scheme presented in Algorithm 3, here the stepsize updating rule for unsuccessful iterations is given by \(\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2, \theta \alpha _k)\), ensuring that \(\alpha _k \rightarrow 0\) and the mesh gets infinitely dense if the algorithm gets stuck in a certain point. The set of search directions \(D_k\) must be such that

$$\begin{aligned} \frac{\Delta _k}{\alpha _k} b_1(\alpha _k) \le \Vert d\Vert \le \frac{\Delta _k}{\alpha _k} b_2(\alpha _k) \end{aligned}$$
(39)

for all \(d \in D_k\), with \(b_i: \mathbb {R}_{> 0} \rightarrow \mathbb {R}_{> 0}\) such that \(\lim _{t \rightarrow 0}b_i(t) = 1\) for \(i\in \{1, 2\}\). Thus with respect to the classic MADS scheme here the frame size \(\Delta _k\) defines also a lower bound and not only an upper bound on the distance between the current iterate and tentative points selected in the poll step. This adjustment is necessary due to the error on the true objective evaluation. As shown in the next lemma, Condition (39) ensures that as the stepsize converges to 0 the tentative steps get closer and closer to the boundary of a ball of radius \(\alpha _{\min }\).

Lemma 4.1

Assume that \(\alpha _{\min }> 0\) and that (39) holds. Then if \(\lim _{k \in K} \alpha _k = 0\), the set of limit points of \(\{\alpha _kD_k\}_{k \in K}\) is contained in \(S^{n_x - 1}(\alpha _{\min })\).

Proof

If \(\lim _{k \in K} \alpha _k = 0\) then it holds that, for \(k \in K\) large enough, \(\Delta _k = \alpha _{\min }\). Consider \(\{d_k\}= D_k\). It holds that, for all \(d_k\),

$$\begin{aligned} \limsup _{k \in K} \Vert \alpha _k d_k\Vert \le \limsup _{k \in K} \Delta _k b_2(\alpha _k) = \alpha _{\min }, \end{aligned}$$
(40)

where we applied (39) in the inequality. Analogously, we can prove \(\liminf _{k \in K} \Vert \alpha _k d_k\Vert \ge \Delta _k\), whence \(\lim _{k \in K} \Vert \alpha _k d_k\Vert = \alpha _{\min }\), which implies the thesis. \(\square \)

We now extend to this scheme the \((\delta , \epsilon )\)-Goldstein stationarity result proved under the sufficient decrease condition in Sect. 3.1. Also in this case we are not aware of any analogous result for the standard MADS scheme, which is instead known to convergence to Clarke stationary points [3].

We start with a lemma that extends a well known property of MADS (see, e.g., [3, Proposition 3.1]) to our bilevel setting.

Lemma 4.2

Let Assumptions 2.42.5 and 4.1 hold. Then the sequence \(\{\alpha _k\}\) generated by Algorithm 3 is such that \(\liminf \alpha _k = 0\).

Proof

Since \(\{{\tilde{F}}(x_k)\}\) is non-increasing (and strictly decreasing for successful iterations), \(\{x_k\}\) is contained in the set \(\mathcal {L}_{\varepsilon }\), which is compact by Assumptions 2.5 and 4.1. Thus \(\liminf \alpha _k = 0\) follows from the finiteness of feasible points generated in \(\mathcal {L}_{\varepsilon }\) when keeping the parameter \(\alpha _k\) lower bounded, which can be proved with the same arguments used for MADS in [3, Proposition 3.1]. \(\square \)

We can now state our main result.

Theorem 4.1

Let Assumptions 2.42.5 and 4.1 hold. Let K be a subset of unsuccessful iteration indices related to Algorithm 3. Let us further assume that:

  • \(\lim _{k \in K} x_k = {\bar{x}}\);

  • \(\lim _{k \in K} \alpha _k = 0\);

  • \(\{{\hat{D}}_k\}_{k \in K}\) is dense in the unit sphere, with \({\hat{D}}_k = \{ \frac{d}{\Vert d\Vert } \ | \ d\in D_k\}\);

  • Condition (39) holds.

Then, the limit point \({\bar{x}}\) of \(\{x_k\}_{k \in K}\) is \((\delta , \epsilon )\)-Goldstein stationary, for

$$\begin{aligned} \epsilon = \frac{4L_f\varepsilon }{\alpha _{\min }}~~\text{ and } \quad \delta = \alpha _{\min }. \end{aligned}$$
(41)

Proof

Let \({\bar{d}} \in \mathbb {R}^n\) with \(\Vert {\bar{d}}\Vert = 1\), and let \(L \subset K\) be such that \(\lim _{k \in L} \frac{d_k}{\Vert d_k\Vert } \rightarrow {\bar{d}}\), with \(d_k \in D_k\). Then \(\alpha _k d_k \rightarrow \alpha _{\min }{\bar{d}}\) by Lemma 4.1. Now, for every \(k \in L\)

$$\begin{aligned} F(x_k) - F(x_k + \alpha _k d_k) \le {\tilde{F}}(x_k) - {\tilde{F}}(x_k + \alpha _k d_k) + 2L_f\varepsilon \le 2L_f\varepsilon , \end{aligned}$$
(42)

where the first inequality follows from (9), and we used that the step k is unsuccessful in the second inequality. Passing to the limit, we obtain

$$\begin{aligned} F({\bar{x}}) \le F({\bar{x}} + \alpha _{\min }{\bar{d}}) + 2L_f\varepsilon . \end{aligned}$$
(43)

Now let \({\bar{F}}_{{\bar{x}}}(d) = F({\bar{x}} + d) + \frac{2L_f\varepsilon }{\alpha _{\min }^2} \Vert d\Vert ^2\). By applying (42) we get

$$\begin{aligned} {\bar{F}}_{{\bar{x}}}(0) \le {\bar{F}}_{{\bar{x}}}(\alpha _{\min }{\bar{d}}), \end{aligned}$$

and given that \({\bar{d}}\) is arbitrary, this holds for any d such that \(\Vert d\Vert = \alpha _{\min }\). The thesis then follows as in the proof of Theorem 3.1. \(\square \)

As in Sect. 3.1, here we also have a corollary showing that for \(\alpha _{\min }\propto \sqrt{\varepsilon }\) we are able to get a \((\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))\)-Goldstein stationary point.

Corollary 4.1

Under the assumptions of Theorem 4.1, the limit point \({\bar{x}}\) of the sequence \(\{x_k\}\) generated by Algorithm 3 with \(\alpha _{\min }= 2\sqrt{L_f\varepsilon }\) is \((\delta ,\epsilon )\)-Goldstein stationary, for

$$\begin{aligned} \epsilon = \delta = 2\sqrt{L_f\varepsilon } . \end{aligned}$$
(44)

4.2 Smooth objectives

Now we consider the case where the true objective is smooth, i.e., Assumption 2.6 holds. With respect to the general scheme reported in Algorithm 3, we have \(\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2)\), and the algorithm set as termination condition in step 14 (22), as for the smooth case. As for \(D_k\), it must always satisfy \({{\,\textrm{cm}\,}}(D_k) \ge \kappa \) for some positive \(\kappa \) independent from k, as well as

$$\begin{aligned} \frac{\Delta _k}{\alpha _k}b_1 \le \Vert d\Vert \le \frac{\Delta _k}{\alpha _k}b_2 \end{aligned}$$
(45)

for every \(d \in D_k\).

We remark that convergence of mesh based schemes for smooth objectives is well understood (see, e.g., [5, Chapter 7]), so that once again our main contribution here is the adaptation to the bilevel setting. We begin our analysis by extending Lemma 3.2 under the simple decrease condition and condition (45) on the search directions.

Lemma 4.3

Let Assumptions 2.4 and 2.6 hold, together with (11). Let \(\{x_k\}\) be a sequence generated by Algorithm 3. If the step k is unsuccessful, then

$$\begin{aligned} \Vert \nabla F(x_{k})\Vert \le \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) . \end{aligned}$$
(46)

Proof

Since the step is unsuccessful, by considering \(d \in D_{k}\) such that

$$\begin{aligned} -\nabla F(x_k)^{\top } d \ge \kappa \Vert \nabla F(x_k)\Vert \Vert d\Vert \end{aligned}$$
(47)

we have, reasoning as in (25) with \(c = 0\)

$$\begin{aligned} \kappa \alpha _k \Vert \nabla F(x_k)\Vert \Vert d\Vert - \alpha _k^2 \frac{L}{2}\Vert d\Vert ^2 \le 2L_f\varepsilon . \end{aligned}$$
(48)

Finally, we get

$$\begin{aligned} \Vert \nabla F(x_{k})\Vert \le \frac{1}{\kappa } \left( \frac{\alpha _k L\Vert d_k\Vert }{2} + \frac{2L_f\varepsilon }{\alpha _k \Vert d_k\Vert } \right) \le \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) . \end{aligned}$$
(49)

\(\square \)

We now extend Theorem 3.2 to our mesh based scheme. The main difference is the absence of complexity estimates, which to our knowledge are not available for MADS schemes.

Theorem 4.2

Let Assumptions 2.42.5 and 4.1 hold. Let \(\{x_k\}\) be a sequence generated by Algorithm 3.

  1. 1.

    If \(\alpha _{\min }> 0\), then the algorithm satisfies the termination condition (22) in a finite number of iterations, with the last iterate \(x_{{\bar{k}}}\) satisfying,

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{b_2\alpha _{\min }L}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }b_1} \right) . \end{aligned}$$
    (50)
  2. 2.

    If, furthermore, it holds that \(\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}\), then

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \sqrt{L b_2 L_f \varepsilon / b_1} . \end{aligned}$$
    (51)
  3. 3.

    If \(\alpha _{\min }= 0\), then \(\liminf \alpha _k = 0\), and if additionally \(\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}\), for some \({\bar{k}} \in \mathbb {N}_0\) we have

    $$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{L\bar{\alpha }_{\min }b_2}{2} + \frac{2L_f\varepsilon }{b_1\bar{\alpha }_{\min }} \right) , \end{aligned}$$
    (52)

    and

    $$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}}. \end{aligned}$$
    (53)

Proof

1. Since the frame parameter \(\Delta _k\) is lower bounded, the mesh parameter \(\alpha _k\) is lower bounded as well, and, by the subsequent finiteness of \(\bigcup _{k \in \mathbb {N}_0} M_k\), the algorithm terminates in a finite number of iterations. By the termination criterion, at the last iteration \({\bar{k}}\) we have \(\Delta _{{\bar{k}}} = \alpha _{\min }\). Since the last iteration is unsuccessful, we hence get

$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert\le & {} \frac{1}{\kappa } \left( \frac{b_2 \Delta _k L}{2} + \frac{2L_f\varepsilon }{b_1 \Delta _k} \right) \nonumber \\= & {} \frac{1}{\kappa } \left( \frac{b_2 \alpha _{\min }L}{2} + \frac{2L_f\varepsilon }{b_1 \alpha _{\min }} \right) , \end{aligned}$$
(54)

where we applied Lemma 4.3 in the second inequality.

2. Follows from the previous point replacing \(\alpha _{\min }\) with the given value in (50).

3. The property \(\liminf \alpha _k = 0\) follows from standard arguments used in the analysis of MADS schemes, already mentioned in the proof of Lemma 4.1. The result then follows from point 1 and 2 (similarly to point 3 in Theorem 3.2). \(\square \)

5 Numerical illustration

In this section, we evaluate the performance of the proposed algorithms on a large collection of nonlinear bilevel optimization problems.

Three direct-search solvers derived from Algorithm 2 and Algorithm 3 were implemented in Matlab: Mesh-DS (related to Algorithm 3) with the mesh defined as in [5, Algorithm 8.2], Coordinate-DS (related to Algorithm 2) with \(D_k=[\mathcal {B}_{\oplus }, -\mathcal {B}_{\oplus }]\) (where \(\mathcal {B}_{\oplus }\) is the canonical basis of \(\mathbb {R}^n\)), and Random-DS (related to Algorithm 2) with \(D_k=[\frac{v}{\Vert v\Vert }, -\frac{v}{\Vert v\Vert }]\), where \(v \in \mathbb {R}^n\) is a pseudo-randomly generated vector. We note that for Mesh-DS, a simple decrease is imposed to decide the acceptance of a candidate step. In contrast, Coordinate-DS and Random-DS are using a sufficient decrease condition to make this decision.

In our tests, the parameters used for Algorithm 2 and Algorithm 3 were set as follows: \(\alpha _{\min }=~10^{-6}\), \(\theta =\frac{1}{2}\), \(\alpha _0=1\), \(c=10^{-3}\), and \(\gamma =2\). For all the tested approaches, the optional search step (Step 1) was not included. Instead, in the poll step, when we observed a decrease along a specific direction, we further explored it by using a simple extrapolation strategy (i.e., we multiplied the step-size \(\alpha _k\) by \(\gamma \) and re-evaluated the function).

In our implementation, the lower-level problem is solved using the fmincon Matlab procedure. To quantify the impact of inexact lower-level solutions on the performances, we used 2 different accuracies when solving the lower-level problem (i.e., LL_tol \(\in \{10^{-3}, 10^{-6}\}\)). The rest of the fmincon default parameters were kept unchanged. A feasibility tolerance of \(10^{-6}\) for constraints violation was used in the solution of the lower-level problem.

The three solvers, Mesh-DS, Coordinate-DS, and Random-DS, were evaluated using 33 small-scale bilevel optimization problems from the BOLIB Matlab library [46]. This library consists of a collection of academic and real-world problems. The dimensions of the tested instances, with respect to the upper-level problem, do not exceed 10 variables. Since an initial point is not provided, we generated five problem instances by randomly selecting five different initial points, thus getting a total of 175 problem instances.

The computational analysis is carried out by using well-known tools from the literature, that is data and performance profiles (see,e.g., [38] for further details). We briefly recall here their definitions. Given a set S of algorithms and a set P of problems, for \(s\in S\) and \(p \in P\), let \(t_{p,s}\) be the number of function evaluations required by algorithm s on problem p to satisfy the condition

$$\begin{aligned} {{\tilde{F}}}(x_k) \le {{\tilde{F}}}_{\text{ low }} + \alpha ({{\tilde{F}}}(x_0) - {{\tilde{F}}}_{\text{ low }}), \end{aligned}$$
(55)

where \(\alpha \in (0, 1)\) and \({{\tilde{F}}}_{\text{ low }}\) is the best objective function value achieved by any solver on problem p. Then, the performance and data profiles of solver s are defined by

$$\begin{aligned} \rho _s(\gamma )= & {} \frac{1}{|P|}\left| \left\{ p\in P: \frac{t_{p,s}}{\min \{t_{p,s'}:s'\in S\}}\le \gamma \right\} \right| ,\\ d_s(\kappa )= & {} \frac{1}{|P|}\left| \left\{ p\in P: t_{p,s}\le \kappa (n_p+1)\right\} \right| , \end{aligned}$$

where \(n_p\) is the dimension of problem p. We used a budget of 500 upper level function evaluations in our experiments.

Fig. 1
figure 1

Data profiles using two type of tolerances to get an approximate minimizer for the lower-level problem

Fig. 2
figure 2

Performance profiles using two type of tolerances to get an approximate minimizer for the lower-level problem

Figures 1 and 2 depict the resulting performance and data profiles, respectively, considering two levels of accuracy \(\alpha \): \(10^{-3}\) and \(10^{-6}\). From Fig. 2, it can be observed that the Coordinate-DS approach performs the best in terms of both efficiency (i.e., \(\tau =1\)) and robustness (i.e., larger \(\tau \)), particularly when the lower problem is solved accurately (i.e., LL_tol=\(10^{-6}\)). The data profiles (see Fig. 2) indicate that all the direct-search approaches perform similarly for small budgets. As the budget increases, the accuracy of the lower problem becomes impactful on the solver’s performance. Overall, on the test problems, the mesh-based approach is slightly more effective for small budgets, i.e., less than \(25(n_x+1)\). However, as the budget increases, the directional direct-search algorithms appear to outperform the mesh-based approach.

6 Conclusion

In this work, we proposed an inexact direct-search based algorithmic framework for bilevel optimization, under the assumption that the lower-level problem can be solved within a fixed accuracy. We then proved convergence of two different classes of methods fitting our scheme, that is directional direct-search methods with sufficient decrease and mesh based schemes with simple decrease. Our results include complexity estimates for a directional direct-search scheme tailored for BO with smooth true objective, which extends previously known complexity estimates for the single level case. We also considered the nonsmooth case and gave convergence guarantees to \((\delta , \epsilon )\)-Goldstein stationary points for both classes, thus nicely extending the known Clarke stationary point convergence properties of analogous schemes in the single level case. A lower bound on the stepsize allows these method to convergence to a point with the desired stationarity properties in a finite number of iterations. Preliminary numerical results suggest that directional direct-search methods might lead to better performance than mesh based strategies in this context.

Future developments include the extensions of our algorithms to constrained and stochastic objectives, as well as numerical comparisons with recent zeroth order smoothing based approaches for BO.