Inexact Direct-Search Methods for Bilevel Optimization Problems

In this work, we introduce new direct search schemes for the solution of bilevel optimization (BO) problems. Our methods rely on a fixed accuracy black box oracle for the lower-level problem, and deal both with smooth and potentially nonsmooth true objectives. We thus analyze for the first time in the literature direct search schemes in these settings, giving convergence guarantees to approximate stationary points, as well as complexity bounds in the smooth case. We also propose the first adaptation of mesh adaptive direct search schemes for BO. Some preliminary numerical results on a standard set of bilevel optimization problems show the effectiveness of our new approaches.


Introduction
Bilevel optimization (see, e.g., [6,9,12,13,24] and references therein for a complete overview on the topic) has been subject of increasing interest, thanks to its application to hyperparameter tuning for machine learning algorithms and meta-learning (see, e.g., [17] and references therein).In this work, we are interested in the following bilevel optimization problem min (x,y)∈R nx×ny f (x, y), s.t.y ∈ arg min z∈Z g(x, z). ( wherein we assume that the upper-level function f (x, y) : R nx×ny → R is continuous, and g(x, z) : R nx×ny → R is such that the lower-level problem min z∈Z g(x, z) has a unique solution y(x) for every x ∈ R nx , and Z ⊂ R ny .Uniqueness of the lower-level problem solution, also known as the Low-Level Singleton (LLS) assumption, is a quite common assumption in many real world applications, such as hyperparameter optimization, meta-learning, pruning, semisupervised learning on multilayer graphs (see, e.g., [17,20,42,45]).While for simplicity we focus on the setting described above, it is important to point out that our analysis still holds, for a specific class of BO problems, even when dropping the LLS assumption (see Remark 2.1).
The algorithms we study here are derivative free optimization (DFO) methods, which do not use derivatives of the upper-level objective function, but rather only the objective value itself.
Importantly, in this setting we also assume the availability of some blackbox oracle generating an approximation ỹ(x) of y(x) for any given x ∈ R n x .Among DFO methods, we are interested in particular in direct-search methods (see, e.g., [2,26]), which sample the objective in suitably chosen tentative points without building a model for it.These algorithmic schemes allow us to prove convergence guarantees under very mild assumptions on our bilevel optimization problem.

Previous Work
Several gradient-based methods have been proposed in the literature to tackle bilevel optimization problems.Those methods usually require the computation of the true objective gradient, called "hypergradient", and rely on the LLS and suitable smoothness assumptions (see, e.g., [17,18,23,27,29] and references therein).In another line of research, some asymptotic results based on relaxations of the LLS assumption were also analyzed (see, e.g., [30,31,32] and references therein).Calculating the hypergradients can be however a notoriously challenging and time consuming task.It indeed requires the handling of ∇ x y(x), which in turns involves the calculation of the Hessian matrix related to the g function via the implicit differentiation theorem.In some contexts, the hypergradients might not be available at all due to the blackbox nature of the functions describing the problem.These are the reasons why the development of new and efficient zeroth-order/derivative-free approaches is crucial in the BO context.
As for derivative free approaches, classic direct-search (see, e.g., [2,10,26]) and trust-region methods (see, e.g., [10,26]) have been applied to BO in [11,15,37,44].In [37], a direct-search method for BO assuming the availability of the true objective is described.More specifically, their analysis does not allow for approximation errors in the solution of the lower-level problem, and relies on suitable assumptions making the true objective directionally differentiable.In [44], the analysis from [37] is extended considering lower-level inexact solutions with a stepsize-based adaptive error.In [11], an algorithm applying trust-region methods both in the inner level and on the true objective is described, with an adaptive estimation error for the true objective depending on the trust-region radius; in that work, a strategy to recycle function evaluations for the lower-level problem is described as well.In [15], the analysis of another trust-region method with adaptive error for bilevel optimization is carried out.The authors report worstcase complexity estimates both in terms of upper-level iterations and computational work from the lower-level problem, when considering a strongly convex lower-level problem solved by a suitable gradient descent approach.In the more recent works [8,35], zeroth-order methods based on smoothing strategies [39] are analyzed.These studies, drawing inspiration from the complexity results provided in [21] for zeroth-order methods that handle nonsmooth and nonconvex objectives, offer complexity estimates tailored for the BO setting.They rely on the assumptions that the lower-level problem can be solved with fixed precision, and that gradient descent on the lower level converges either polinomially or exponentially, respectively.
Finally, min-max DFO problems (which can be seen as a particular instance of BO) are also recently tackled in the literature [1,36].Relevant to our work are also direct-search methods under the presence of noise.While previous works analyze direct-search methods with adaptive deterministic [34] and stochastic noise [1,4,41], we are not aware of previous analyses of directsearch methods with bounded but non adaptive noise.

Contributions
Our contributions can be summarized as follows.
• We define and analyze the first inexact direct-search schemes for BO problems with general potentially nonsmooth true objectives.Those methods indeed never require exact lowerlevel problem solutions, but instead assume access to approximate solutions with fixed accuracy, a reasonable assumption in practice.We therefore operate in a different setting than the one considered in previous works on direct-search for BO, where true objectives are directionally differentiable [37,44] and lower-level solutions are exact [37] or require an adaptive precision [44].
• We analyze mesh based direct-search schemes for BO, extending in particular the classic mesh adaptive direct-search (MADS) scheme from [3].This is, to the best of our knowledge, the first analysis of this scheme that considers both inexact objective evaluation and the simple decrease condition for new iterates used originally in [3].
• We give the first convergence results for direct-search schemes with bounded and nonadaptive noise on the objective.
• We give the first convergence guarantees to (δ, ǫ)-Goldstein stationary points for directsearch schemes applied to general nonsmooth objectives.With respect to classic analyses considering Clarke stationary points (see, e.g., [5]), these are the first results for directsearch scheme involving some quantitative measure of approximate nonsmooth stationarity.

Background and Preliminaries
We now introduce the main assumptions considered in the paper, along with a set of helpful preliminary results that will support the subsequent convergence theory.As anticipated in the introduction, we will always assume the existence of a unique minimizer y(x) for the lower-level problem, i.e., that the LLS assumption holds.
Assumption 2.1 For any x ∈ R nx , we have that argmin z∈Z g(x, z) = {y(x)}.
Under Assumption 2.1, the bilevel optimization problem (1) can then be rewritten as min However, in practical applications, it is usually necessary to employ an iterative method to compute y(x).Therefore, one cannot expect to obtain an exact value of y(x), but rather some approximation.We will hance make use of the following assumption.Assumption 2.2 For all x ∈ R nx we can compute an approximation ỹ(x) of y(x) such that: While the remaining assumptions introduced in this section are not always needed, in the rest of this manuscript we always assume that Assumptions 2.1 and 2.2 hold.
Remark 2.1 Our analysis extends to the case where argmin z∈Z g(x, z) is not a singleton, but an approximate solution ỹ(x) of the simple bilevel problem is available for every x ∈ R nx .In fact our convergence proofs rely on (3) rather than the singleton assumption.We refer the reader to the recent work [8] for a detailed discussion on the complexity and regularity properties of the simple bilevel problem (4).
In the next proposition, we show how condition (3) can be satisfied, by applying gradient descent to g(x, •), under a suitable error bound condition on ∇ y g(x, y) generalizing strong convexity (see, e.g, [22] for a detailed comparison with other conditions).We also give an explicit bound on the number of iterations needed to satisfy (3).Proposition 2.1 Assume that there exists c g > 0 such that for all y ∈ Z , c g y − y(x) ≤ ∇ y g(x, y) . ( Furthermore, let ∇ y g be L g Lipschitz continuous in y, uniformly in x.Define y 0 (x) to be any arbitrary initialization mapping onto the domain of g(x, •).Then consider the sequence, Define the solution estimate to be: It holds that ỹ(x) satisfies (3), for Proof.This follows from the well known iteration complexity of gradient descent for smooth non convex objectives.
We introduce now some technical assumptions on the objective function needed in our analysis.

Assumption 2.3
The function f is lower bounded by f low .

Assumption 2.4
The function f is Lipschitz continuous with respect to y with Lipschitz constant L f (independent of x).
We remark that these assumptions are an adaptation to our bilevel setting of standard assumptions made in the analysis of direct-search methods [10,34].Assumption 2.2 together with Assumption 2.4 imply that F (x) := f (x, ỹ(x)) is an approximation of F (x) with accuracy L f ε.
Some regularity on the true objective F (x) will always be necessary for our analyses.We consider both the differentiable and the potentially non differentiable setting.
Assumption 2.6 The function F is continuously differentiable with Lipschitz continuous gradient, of Lipschitz constant L.
Note that if f is Lipschitz with respect to x, and y(x) Lipschitz continuous with respect to x, then Assumption 2.5 is satisfied.Furthermore, in the strongly convex lower-level setting there is an explicit expression for ∇F (see, e.g., [8,Equation (3)]), implying that its Lipschitz continuity follows from that of y(x) together with suitable regularity assumptions on f and g.

Algorithm
In this section, we introduce a general direct-search algorithm for bilevel optimization that embeds both directional direct-search methods with sufficient decrease and mesh adaptive directsearch methods with simple decrease, as defined in [10].The methods in the first class sample tentative points along a suitable set of descent directions and then select as the new iterate a point satisfying a sufficient decrease condition.The methods in the second class sample the points in a suitably defined mesh, and then select the new iterate according to a simple decrease condition.A tentative point t is hence accepted if the decrease condition is satisfied, for ρ nonnegative function.We have a sufficient decrease when ρ(t) > 0 with lim t→0 + ρ(t)/t = 0, and a simple decrease in case ρ(t) = 0.These two classes of decrease conditions lead to significant differences in convergence properties and consequently require different choices in the algorithm parameters.They will therefore be analyzed separately in Sections 3 and 4 respectively.
Algorithm 1: DS for bilevel optimization ) be an approximate minimizer for the lower-level problem in x 0 .Optional: Let ∆ 0 = α 0 be the initial frame size parameter.
Set x k+1 = t, declare the iteration successful, and go to step 13.

7:
Choose a set of descent directions D k , possibly depending on ∆ k and such that Declare the iteration as successful.Set Declare the iteration as unsuccessful.Set x k+1 = x k and y k+1 = y k . 12: end if

13:
Update the frame size parameter ∆ k and the stepsize α k .

14:
Optional: If some approximate stationarity condition is satisfied, terminate the algorithm.15: end for The detailed scheme (see Algorithm 1) follows the lines of the general schemes proposed in [10] and [26], with the addition of calls to the lower-level oracle ỹ(x), and an explicit reference to the mesh used in mesh-based schemes.At Step 1, the algorithm searches for a new iterate by testing the upper level objective in (t, ỹ(t)) for t in S k subset of the mesh M k .In case Step 1 is not successful, the method generates, at Step 2, a new iterate by selecting a set of descent directions D k and testing the upper level objective in (t, ỹ(t)) for t chosen along the descent directions using a stepsize α k .Step 3 and Step 4 perform updates on the algorithm iterate and parameters based on the outcome of Step 1 and 2. For the set of directions D k , we require in some cases a positive cosine measure, that is for some κ > 0.

Sufficient decrease condition
In this section, we analyze directional direct-search methods using a sufficient decrease condition with ρ(t) = c 2 t 2 .We first focus on potentially nonsmooth objectives, and then on smooth ones.In both cases we consider the scheme presented in Algorithm 2, which can be viewed as an adaption to BO of classic generating set of search directions (GSS) schemes (see, e.g., [25,Algorithm 3.2]).In order to handle the error introduced by the approximate solution in the lower level, we lower bound the stepsize with a constant α min .We further notice that, thanks to the sufficient decrease condition, maintaining a mesh is not necessary, and therefore we simply set M k = R nx .

Nonsmooth objectives
First, we present convergence guarantees and proofs thereof for a variant of Algorithm 2 designed for the case of Lipschitz continuous true objectives, i.e., under Assumption 2.5.With respect to the general scheme presented as Algorithm 2, here D k = {g k } with g k generated in the unit sphere.We remark that this is a standard choice for direct-search algorithms applied to nonsmooth objectives (see, e.g., [16, Algorithm DFN simple ]).The stepsize lower bound here must be strictly positive (i.e.α min > 0).This together with the sufficient decrease conditions ensures that the sequence generated by the algorithm is eventually constant, as proved in Lemma 3.1.We then use a novel argument to prove that the limit point of the sequence is a (δ, ǫ)-Goldstein stationary point.Although such a notion of stationarity has recently gained attention in the analysis of zeroth-order smoothing-based approaches [21,28,40], including extensions to BO [8,35], to the best of our knowledge, it has never been used for the analysis of direct-search methods.It is further important to notice that convergence of directional directsearch methods to (δ, ǫ)−Goldstein stationary points in the nonsmooth case is a novel result also for classic optimization problems.We now recall some useful definitions.If B δ (x) is the ball of radius δ centered in x, then the δ-Goldstein subdifferential (see, e.g., [28]) is defined as and x is an (δ, ǫ)-Goldstein stationary point for the function F if, for some g ∈ ∂ δ F (x), we have g ≤ ǫ.
Algorithm 2: Inexact directional DS for bilevel optimization 1: Initialization: Choose starting point x 0 ∈ R nx , stepsize lower bound α min ≥ 0, initial stepsize α 0 ≥ α min , coefficient for stepsize contraction 0 < θ < 1, coefficient for stepsize expansion γ ≥ 1, sufficient decrease condition coefficient c.Let y 0 = ỹ(x 0 ) be an approximate minimizer for the lower-level problem at x 0 .2: for k = 0, 1, 2, . . .do 3: Set x k+1 = t, declare the iteration successful, and go to step 13. 6: Choose a set of descent directions D k .For a given d ∈ D k , compute the approximate minimizer Declare the iteration as successful.Set Declare the iteration as unsuccessful.Set x k+1 = x k and y k+1 = y k . 12: end if

13:
If the iteration was successful then maintain or increase the corresponding stepsize parameter -set Else decrease the stepsize parameter, by choosing α k+1 = max{α min , θα k }. 14: [Optional] If some approximate stationarity condition is satisfied, terminate the algorithm.

15: end for
We can now proceed with our convergence analysis.As anticipated, we start by proving that the sequence of iterates generated by our method is eventually constant.Lemma 3.1 Let Assumptions 2.3 and 2.4 hold.Then there exists k ∈ N 0 such that the sequence {x k } generated by Algorithm 2 is constant for k ≥ k.
Proof.Notice that { F (x k )} is non-increasing, with F (x k ) = F (x k+1 ) after an unsuccessful step, and after a successful step.Thus there can be at most successful steps, where we used F (x) ≥ F (x) − L f ε ≥ f low − L f ε in the inequality.Since this quantity is finite, this implies that {x k } is eventually constant.
We now prove convergence of our algorithm to (δ, ǫ)-Goldstein stationary points.In order to get our convergence result, we need to assume that the sequence {g k } is dense in the unit sphere.We remark that such a dense sequence can be generated using a suitable quasirandom sequence (see, e.g., [19,33]).
Proof.First, {x k } is eventually constant as seen in Lemma 3.1.Let x be the unique limit point.By the stepsize updating rule, we have that every iteration must be unsuccessful with α k = α min for k large enough.Then, there exists k ∈ N large enough such that for every k ≥ k By the density of {g k } it follows for every d such that d = α min .We now define the function Fx (d for every d such that d = α min by ( 18), there must be a d ∈ argmin d ≤α min Fx (d) with d < α min .We can conclude Equivalently, g = (c + L f ε c is eventually constant, with the unique limit point (δ, ǫ)-Goldstein stationary, for

Smooth objectives
We now focus on the case where the objective F is smooth, in particular under Assumption 2.6.
We consider here a variant of Algorithm 2 with D k positive spanning set.When the stepsize lower bound is strictly positive we set as termination criterion α k = α k+1 = α min .Our scheme can hence be seen as a variant of classic direct-search methods for smooth objectives [10,25].
It is important to highlight that this is the first analysis of direct-search methods for smooth objectives under bounded noise.The only analysis of direct-search methods we are aware of in the smooth case is the one given in [14] under stochastic noise, where, however, the author only focuses on classic optimization problems.We first extend to our bounded error setting a standard result that allows to get an upper bound on the gradient norm for unsuccessful iterations (see, e.g., [25,Theorem 3.3]).Lemma 3.2 Let Assumptions 2.4 and 2.6 hold, together with (11).Let {x k } be a sequence generated by Algorithm 2. If the iteration k is unsuccessful, then Proof.Let d ∈ D k be such that We have where we used (23) in the first inequality, the standard descent lemma in the second inequality, (9) in the third inequality, and that the step is unsuccessful in the last inequality.Therefore, since by assumption d = 1 implying the thesis.
We now prove convergence and complexity bounds when α min > 0, extending those given in [43] for the exact oracle case, and α min = 0. We notice that in this second case we lose finite convergence and our guarantees are thus somewhat weaker, i.e., we are only able to prove that the stepsize converges to 0 and that at some point the gradient norm is O( √ ε).
Theorem 3.2 Let Assumptions 2.3, 2.4 and 2.6 hold, together with (11) for every k ∈ N 0 .Let {x k } be a sequence generated by Algorithm 2.
1.If α min > 0, then the algorithm terminates after k iterations, with and its last iterate xk is such that 2. If, furthermore, it holds that α min = 2 3. If α min = 0, then α k → 0, and if additionally α 0 ≥ ᾱmin = 2 L+c , for some k ∈ N 0 we have and Proof. 1.Let k s and k ns be the number of successful and unsuccessful steps, so that k s +k ns = k.Reasoning as in Lemma 3.1, we obtain by ( 14) Furthermore, since we get where we applied (31) in the second inequality.Combining the bounds on the successful and unsuccessful steps ( 31) and ( 33), we have as desired.
2. Follows from a direct application of the first result.
We now extend to our setting the O(n 2 /ǫ 2 ) complexity result given in [43,Corollary 2].For a fixed precision ǫ, an approximation error ε = O(ǫ 2 ) is required, as for classic gradient approximation schemes [7].Proof.Follows from point 1 and 2 of Theorem 3.2, plugging in the parameters specified in the assumptions.

Simple decrease condition
In this section, we analyze two methods based on simple decrease condition (i.e., with ρ(t) = 0, in (10)), one for potentially nonsmooth objectives and one for smooth objectives.Both methods follow the scheme presented in Algorithm 3, which is an adaptation to the BO setting of the mesh adaptive direct-search algorithm (MADS, see [2] and references therein).Again we lower bound the stepsize by a constant α min .The stepsize updating rule we use to handle unsuccessful iterations depends on the mesh size parameter ∆ k and the contraction coefficient θ, and smoothness of the true objective (i.e., update varies between the smooth and the nonsmooth case).
It is a standard assumption in the analysis of MADS that all the iterates lie in a compact set (see, e.g., [3,Section 3]).In our framework, this can be ensured if the following boundedness assumption is satisfied.
is bounded.
The mesh, as defined in the literature (see,e.g., [5,10] and references therein for further details), is a discrete set of points from which the algorithm selects candidate trial points.Its coarseness is parameterised by the mesh size parameter δ.The goal of each algorithm iteration is to get a mesh point whose objective function value improves with respect to the incumbent value.Given a positive spanning set D and a center x the related mesh is formally defined as follows: where, with a slight abuse of notation, we use D also for the matrix D ∈ R n×p with columns corresponding to the elements of the set D. We notice that the mesh is just a conceptual tool, and is never actually constructed.

3:
[Optional] Let M k be the mesh with size parameter α k , positive spanning set D and center x k .Select a finite subset S k of M k .

4:
if f (t, ỹ(t)) < f (x k , y k ) for some t ∈ S k then

5:
Set x k+1 = t, declare the iteration successful, and go to step 7.

7:
Choose a positive spanning set Declare the iteration as successful.Set Declare the iteration as unsuccessful.Set x k+1 = x k and y k+1 = y k . 12: end if

13:
If the iteration was successful

14:
[Optional] If some approximate stationarity condition is satisfied, terminate the algorithm.15: end for

Nonsmooth objectives
With respect to the general scheme presented in Algorithm 3, here the stepsize updating rule for unsuccessful iterations is given by α k , θα k ), ensuring that α k → 0 and the mesh gets infinitely dense if the algorithm gets stuck in a certain point.The set of search directions D k must be such that for all d ∈ D k , with b i : R >0 → R >0 such that lim t→0 b i (t) = 1 for i ∈ {1, 2}.Thus with respect to the classic MADS scheme here the frame size ∆ k defines also a lower bound and not only an upper bound on the distance between the current iterate and tentative points selected in the poll step.This adjustment is necessary due to the error on the true objective evaluation.As shown in the next lemma, Condition (38) ensures that as the stepsize converges to 0 the tentative steps get closer and closer the boundary of a ball of radius α min .
Lemma 4.1 Assume that α min > 0 and that (38) holds.Then if lim k∈K α k = 0, the set of limit points of Proof.If lim k∈K α k = 0 then it holds that, for k ∈ K large enough, ∆ k = α min .Consider where we applied (38) in the inequality.Analogously, we can prove lim inf k∈K α k d k ≥ ∆ k , whence lim k∈K α k d k = α min , which implies the thesis.
We now extend to this scheme the (δ, ǫ)-Goldstein stationarity result proved under the sufficient decrease condition in Section 3.1.Also in this case we are not aware of any analogous result for the standard MADS scheme, which is instead known to convergence to Clarke stationary points [3].We start with a lemma that extends a well known property of MADS (see, e.g., [3, Proposition 3.1]) to our bilevel setting.Lemma 4.2 Let Assumptions 2.4, 2.5 and 4.1 hold.Then the sequence {α k } generated by Algorithm ?? is such that lim inf α k = 0.
Proof.Since { F (x k )} is non-increasing (and strictly decreasing for successful iterations), {x k } is contained in the set L ε , which is compact by Assumptions 2.5 and 4.1.Thus lim inf α k = 0 follows from the finiteness of feasible points generated in L ε when keeping the parameter α k lower bounded, which can be proved with the same arguments used for MADS in [3,Proposition 3.1].
We can now state our main result.• Condition (38) holds.
Then, the limit point x of {x k } k∈K is (δ, ǫ)-Goldstein stationary, for Proof.Let d ∈ R n with d = 1, and let L ⊂ K be such that lim k∈L where the first inequality follows from ( 9), and we used that the step k is unsuccessful in the second inequality.Passing to the limit, we obtain

Smooth objectives
Now we consider the case where the true objective is smooth, i.e., Assumption 2.6 holds.With respect to the general scheme reported in Algorithm 3, we have α u (α k , ∆ k , θ) = min(∆ k , ∆ 2 k ), and the algorithm terminates if α k = α k+1 = α min .As for D k , it must always satisfy cm(D k ) ≥ κ for some positive κ independent from k, as well as for every d ∈ D k .We remark that convergence of mesh based schemes for smooth objectives is well understood (see, e.g., [5,Chapter 7]), so that once again our main contribution here is the adaptation to the bilevel setting.We begin our analysis by extending Lemma 3.2 under the simple decrease condition and condition (44) on the descent directions.
Lemma 4.3 Let Assumptions 2.4 and 2.6 hold, together with (11).Let {x k } be a sequence generated by Algorithm ??.If the step k is unsuccessful, then Proof.Since the step is unsuccessful, by considering d ∈ D k such that we have, reasoning as in (24) with c = 0 Finally, we get We now extend Theorem 3.2 to our mesh based scheme.The main difference is the absence of complexity estimates, which to our knowledge are not available for MADS schemes.1.If α min > 0, then the algorithm terminates in a finite number of iterations, with the last iterate xk satisfying, 2. If, furthermore, it holds that α min = 2 3. If α min = 0, then lim inf α k = 0, and if additionally α 0 ≥ ᾱmin = 2 and Proof. 1.Since the frame parameter ∆ k is lower bounded, the mesh parameter α k is lower bounded as well, and, by the subsequent finiteness of k∈N 0 M k , the algorithm terminates in a finite number of iterations.By the termination criterion, at the last iteration k we have ∆k = α min .Since the last iteration is unsuccessful, we hence get where we applied Lemma 4.3 in the second inequality.2. Follows from the previous point replacing α min with the given value in (49).
3. The property lim inf α k = 0 follows from standard arguments used in the analysis of MADS schemes, already mentioned in the proof of Lemma 4.1.The result then follows from point 1 and 2 (similarly to point 3 in Theorem 3.2).

Numerical illustration
In this section, we evaluate the performance of the proposed algorithms on a large collection of nonlinear bilevel optimization problems.Three direct-search solvers derived from Algorithm 2 and Algorithm 3 were implemented in Matlab: Mesh-DS (related to Algorithm 3) with the mesh defined as in [ , where v ∈ R n is a uniformly generated vector.
In our tests, the parameters used for Algorithm 2 and Algorithm 3 were set as follows: α min = 10 −6 , θ = 1 2 , α 0 = 1, c = 10 −3 , and γ = 2.For all the tested approaches, the optional search step (Step 1) was not included.Instead, in the poll step, when we observed a decrease along a specific direction, we further explored it by using a simple extrapolation strategy (i.e., we multiplied the step-size α k by γ and re-evaluated the function).
In our implementation, the lower-level problem is solved using the fmincon Matlab procedure.To quantify the impact of inexact lower-level solutions on the performances, we used 2 different accuracies when solving the lower-level problem (i.e., LL tol ∈ {10 −3 , 10 −6 }).The rest of the fmincon default parameters were kept unchanged.A feasibility tolerance of 10 −6 for constraints violation was used in the solution of the lower-level problem.
The three solvers, Mesh-DS, Coordinate-DS, and Random-DS, were evaluated using 33 small-scale bilevel optimization problems from the BOLIB Matlab library [46].This library consists of a collection of academic and real-world problems.The dimensions of the tested instances, with respect to the upper-level problem, do not exceed 10 variables.Since an initial point is not provided, we generated five problem instances by randomly selecting five different initial points, thus getting a total of 175 problem instances.
The computational analysis is carried out by using well-known tools from the literature, that is data and performance profiles (see,e.g., [38] for further details).We briefly recall here their definitions.Given a set S of algorithms and a set P of problems, for s ∈ S and p ∈ P , let t p,s be the number of function evaluations required by algorithm s on problem p to satisfy the condition where γ p ∈ (0, 1) and Flow is the best objective function value achieved by any solver on problem p.Then, the performance and data profiles of solver s are defined by  Figures 1-2 depict the resulting performance and data profiles, respectively, considering two levels of accuracy α: 10 −3 and 10 −6 .From Figure 2, it can be observed that the Coordinate-DS approach performs the best in terms of both efficiency (i.e., τ = 1) and robustness (i.e., larger τ ), particularly when the lower problem is solved accurately (i.e., LL tol=10 −6 ).The data profiles (see Figure 2) indicate that all the direct-search approaches perform similarly for small budgets.However, as the budget increases, the accuracy of the lower problem becomes impactful on the solver's performance.Overall, on the tested problems, the directional direct-search approaches seem to outperform the mesh-based direct-search approach.

Conclusion
In this work, we proposed an inexact direct-search based algorithmic framework for bilevel optimization, under the assumption that the lower-level problem can be solved within a fixed accuracy.We then proved convergence of two different classes of methods fitting our scheme, that is directional direct-search methods with sufficient decrease and mesh based schemes with simple decrease.Our results include complexity estimates for a directional direct-search scheme tailored for BO with smooth true objective, which extends previously known complexity estimates for the single level case.We also considered the nonsmooth case and gave convergence guarantees to (δ, ǫ)-Goldstein stationary points for both classes, thus nicely extending the known Clarke stationary point convergence properties of analogous schemes in the single level case.A lower bound on the stepsize allows these method to convergence to a point with the desired stationarity properties in a finite number of iterations.Preliminary numerical results suggest that directional direct-search methods might lead to better performance than mesh based strategies in this context.Future developments include the extensions of our algorithms to constrained and stochastic objectives, as well as numerical comparisons with recent zeroth order smoothing based approaches for BO.
for d k satisfying the condition and y k+1 = y α k d k k .

Theorem 3 . 1
Let Assumptions 2.3, 2.4 and 2.5 hold.Assume that {g k } is dense in the unite sphere.Then the sequence {x k } generated by Algorithm 2 is eventually constant, with the unique limit point (δ, ǫ)-Goldstein stationary, for ǫ = 4L f ε α min + cα min and δ = α min .

Theorem 4 . 1
Let Assumptions 2.4, 2.5 and 4.1 hold.Let K be a subset of unsuccessful iteration indices related to Algorithm ??.Let us further assume that:• lim k∈K x k = x; • lim k∈K α k = 0; • { Dk } k∈K is dense in the unit sphere, with Dk = { d d | d ∈ D k };

2 min d 2 .Corollary 4 . 1
Now let Fx (d) = F (x + d) + 2L f ε α By applying (41) we get Fx (0) ≤ Fx (α min d) , and given that d is arbitrary, this holds for any d such that d = α min .The thesis then follows as in the proof of Theorem 3.1.As in Section 3.1, here we also have a corollary showing that for α min ∝ √ ε we are able to get a (O( √ ε), O( √ ε))-Goldstein stationary point.Under the assumptions of Theorem 4.1, the limit point x of the sequence {x k } generated by Algorithm ?? with α min = 2 L f ε is (δ, ǫ)-Goldstein stationary, for

5 ,
Algorithm 8.2], Coordinate-DS (related to Algorithm 2) with D k = [B ⊕ , −B ⊕ ] (where B ⊕ is the canonical basis of R n ), and Random-DS (related to Algorithm 2) with D

Figure 1 :
Figure 1: Data profiles using two type of tolerances to get an approximate minimizer for the lower-level problem.

Figure 2 :
Figure 2: Performance profiles using two type of tolerances to get an approximate minimizer for the lower-level problem.
[7]and since ∂F (x + d) ⊂ ∂ α min F (x) we have g ∈ ∂ α min F (x).To conclude, observe g < cα min + Interestingly, the order of magnitude O( √ ε) of the approximation error coincides with that of typical gradient approximation methods[7], as well as with that of direct-search in the smooth setting, as we shall see in the next section.Corollary 3.1 Let Assumptions 2.3, 2.4 and 2.5 hold.Assume that {g k } is dense in the unite sphere.Then the sequence {x k } generated by Algorithm 2 with α min = 2