Abstract
In this work, we introduce new direct-search schemes for the solution of bilevel optimization (BO) problems. Our methods rely on a fixed accuracy blackbox oracle for the lower-level problem, and deal both with smooth and potentially nonsmooth true objectives. We thus analyze for the first time in the literature direct-search schemes in these settings, giving convergence guarantees to approximate stationary points, as well as complexity bounds in the smooth case. We also propose the first adaptation of mesh adaptive direct-search schemes for BO. Some preliminary numerical results on a standard set of bilevel optimization problems show the effectiveness of our new approaches.
Similar content being viewed by others
1 Introduction
Bilevel optimization (see, e.g., [6, 9, 12, 13, 25] and references therein for a complete overview on the topic) has been subject of increasing interest, thanks to its application to hyperparameter tuning for machine learning algorithms and meta-learning (see, e.g., [17] and references therein). In this work, we are interested in the following bilevel optimization problem
wherein we assume that the upper-level function \(f(x,y):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}\) is continuous, and \(g(x,z):\mathbb {R}^{n_x\times n_y}\rightarrow \mathbb {R}\) is such that the lower-level problem \(\min _{z \in Z} g(x,z)\) has a unique solution y(x) for every \(x \in \mathbb {R}^{n_x}\), and \(Z\subset \mathbb {R}^{n_y}\). Uniqueness of the lower-level problem solution, also known as the Low-Level Singleton (LLS) assumption, is a quite common assumption in many real world applications, such as hyperparameter optimization, meta-learning, pruning, semi-supervised learning on multilayer graphs (see, e.g., [17, 21, 42, 45]). While for simplicity we focus on the setting described above, it is important to point out that our analysis still holds, for a specific class of BO problems, even when dropping the LLS assumption (see Remark 2.1).
The algorithms we study here are derivative free optimization (DFO) methods, which do not use derivatives of the upper-level objective function, but rather only the objective value itself. Importantly, in this setting we also assume the availability of some blackbox oracle generating an approximation \({\tilde{y}}(x)\) of y(x) for any given \(x \in \mathbb {R}^n_x\). Among DFO methods, we are interested in particular in direct-search methods (see, e.g., [2, 27]), which sample the objective in suitably chosen tentative points without building a model for it. These algorithmic schemes allow us to prove convergence guarantees under very mild assumptions on our bilevel optimization problem.
1.1 Previous work
Several gradient-based methods have been proposed in the literature to tackle bilevel optimization problems. Those methods usually require the computation of the true objective gradient, called “hypergradient”, and rely on the LLS and suitable smoothness assumptions (see, e.g., [17, 18, 20, 24, 29] and references therein). In another line of research, some asymptotic results based on relaxations of the LLS assumption were also analyzed (see, e.g., [30,31,32] and references therein). Calculating the hypergradients can be however a notoriously challenging and time consuming task. It indeed requires the handling of \(\nabla _x y(x)\), which in turns involves the calculation of the Hessian matrix related to the g function via the implicit differentiation theorem. In some contexts, the hypergradients might not be available at all due to the blackbox nature of the functions describing the problem. These are the reasons why the development of new and efficient zeroth-order/derivative-free approaches is crucial in the BO context.
As for derivative free approaches, classic direct-search (see, e.g., [2, 10, 27]) and trust-region methods (see, e.g., [10, 27]) have been applied to BO in [11, 15, 37, 44]. In [37], a direct-search method for BO assuming the availability of the true objective is described. More specifically, their analysis does not allow for approximation errors in the solution of the lower-level problem, and relies on suitable assumptions making the true objective directionally differentiable. In [44], the analysis from [37] is extended considering lower-level inexact solutions with a stepsize-based adaptive error. In [11], an algorithm applying trust-region methods both in the inner level and on the true objective is described, with an adaptive estimation error for the true objective depending on the trust-region radius; in that work, a strategy to recycle function evaluations for the lower-level problem is described as well. In [15], the analysis of another trust-region method with adaptive error for bilevel optimization is carried out. The authors report worst-case complexity estimates both in terms of upper-level iterations and computational work from the lower-level problem, when considering a strongly convex lower-level problem solved by a suitable gradient descent approach. In the more recent works [8, 35], zeroth-order methods based on smoothing strategies [39] are analyzed. These studies, drawing inspiration from the complexity results provided in [22] for zeroth-order methods that handle nonsmooth and non-convex objectives, offer complexity estimates tailored for the BO setting. They rely on the assumptions that the lower-level problem can be solved with fixed precision, and that gradient descent on the lower level converges either polinomially or exponentially, respectively.
Finally, min-max DFO problems (which can be seen as a particular instance of BO) are also recently tackled in the literature [1, 36]. Relevant to our work are also direct-search methods under the presence of noise. While previous works analyze direct-search methods with adaptive deterministic [34] and stochastic noise [1, 4, 41], we are not aware of previous analyses of direct-search methods with bounded but non adaptive noise.
1.2 Contributions
Our contributions can be summarized as follows.
-
We define and analyze the first inexact direct-search schemes for BO problems with general potentially nonsmooth true objectives. Those methods indeed never require exact lower-level problem solutions, but instead assume access to approximate solutions with fixed accuracy, a reasonable assumption in practice. We therefore operate in a different setting than the one considered in previous works on direct-search for BO, where true objectives are directionally differentiable [37, 44] and lower-level solutions are exact [37] or require an adaptive precision [44].
-
We analyze mesh based direct-search schemes for BO, extending in particular the classic mesh adaptive direct-search (MADS) scheme from [3]. This is, to the best of our knowledge, the first analysis of this scheme that considers both inexact objective evaluation and the simple decrease condition for new iterates used originally in [3].
-
We give the first convergence results for direct-search schemes with bounded and non-adaptive noise on the objective.
-
We give the first convergence guarantees to \((\delta , \epsilon )\)-Goldstein stationary points for direct-search schemes applied to general nonsmooth objectives. With respect to classic analyses considering Clarke stationary points (see, e.g., [5]), these are the first results for direct-search scheme involving some quantitative measure of approximate nonsmooth stationarity.
2 Background and preliminaries
We now introduce the main assumptions considered in the paper, along with a set of helpful preliminary results that will support the subsequent convergence theory. As anticipated in the introduction, we will always assume the existence of a unique minimizer y(x) for the lower-level problem, i.e., that the LLS assumption holds.
Assumption 2.1
For any \(x \in \mathbb {R}^{n_x}\), we have that \({{\,\textrm{argmin}\,}}_{z \in Z} g(x, z) = \{y(x)\}\).
Under Assumption 2.1, the bilevel optimization problem (1) can then be rewritten as
However, in practical applications, it is usually necessary to employ an iterative method to compute y(x). Therefore, one cannot expect to obtain an exact value of y(x), but rather some approximation. We will hence make use of the following assumption.
Assumption 2.2
For all \(x \in \mathbb {R}^{n_x}\) we can compute an approximation \({\tilde{y}}(x)\) of y(x) such that:
While the remaining assumptions introduced in this section are not always needed, in the rest of this manuscript we always assume that Assumptions 2.1 and 2.2 hold.
Remark 2.1
Our analysis extends to the case where \({{\,\textrm{argmin}\,}}_{z \in Z} g(x, z)\) is not a singleton, but an approximate solution \({\tilde{y}}(x)\) of the simple bilevel problem
is available for every \(x \in \mathbb {R}^{n_x}\). In fact our convergence proofs rely on (3) rather than the singleton assumption, where y(x) can be any solution of problem (4). We refer the reader to the recent work [8] for a detailed discussion on the complexity and regularity properties of the simple bilevel problem (4).
In the next proposition, we show how condition (3) can be satisfied, by applying gradient descent to \(g(x, \cdot )\), under a suitable error bound condition on \(\nabla _{y} g(x, y)\) generalizing strong convexity (see, e.g, [23] for a detailed comparison with other conditions). We also give an explicit bound on the number of iterations needed to satisfy (3).
Proposition 2.1
Assume that there exists \(c_g>0\) such that for all \(y\in Z\),
Furthermore, let \(\nabla _y g\) be \(L_g\) Lipschitz continuous in y, uniformly in x. Define \(y_{0}(x)\) to be any arbitrary initialization mapping onto the domain of \(g(x,\cdot )\). Then consider the sequence,
Define the solution estimate to be:
It holds that \({\tilde{y}}(x)\) satisfies (3), for
Proof
This follows from the well known iteration complexity of gradient descent for smooth non convex objectives. \(\square \)
We introduce now some technical assumptions on the objective function needed in our analysis.
Assumption 2.3
The function f is lower bounded by \(f_{\text{ low }}\).
Assumption 2.4
The function f is Lipschitz continuous with respect to y with Lipschitz constant \(L_f\) (independent of x).
We remark that these assumptions are an adaptation to our bilevel setting of standard assumptions made in the analysis of direct-search methods [10, 34]. Assumption 2.2 together with Assumption 2.4 imply that \({\tilde{F}}(x):= f(x, {\tilde{y}}(x))\) is an approximation of F(x) with accuracy \(L_f\varepsilon \). Indeed,
Some regularity on the true objective F(x) will always be necessary for our analyses. We consider both the differentiable and the potentially non differentiable setting.
Assumption 2.5
F(x) is Lipschitz continuous with constant \(L_F\).
Assumption 2.6
The function F is continuously differentiable with Lipschitz continuous gradient, of Lipschitz constant L.
Note that if f is Lipschitz with respect to x, and y(x) is Lipschitz continuous with respect to x, then Assumption 2.5 is satisfied. Furthermore, in the strongly convex lower-level setting there is an explicit expression for \(\nabla F\) (see, e.g., [8, Equation (3)]), implying that its Lipschitz continuity follows from that of y(x) together with suitable regularity assumptions on f and g.
2.1 Algorithm
In this section, we introduce a general direct-search algorithm for bilevel optimization that embeds both directional direct-search methods with sufficient decrease and mesh adaptive direct-search methods with simple decrease, as defined in [10]. The methods in the first class sample tentative points along a suitable set of search directions and then select as the new iterate a point satisfying a sufficient decrease condition. The methods in the second class sample the points in a suitably defined mesh, and then select the new iterate according to a simple decrease condition. A tentative point t is hence accepted if the decrease condition
is satisfied, for \(\rho \) nonnegative function. We have a sufficient decrease when \(\rho (t) > 0\) with \(\lim _{t \rightarrow 0^+} \rho (t)/t = 0\), and a simple decrease in case \(\rho (t) = 0\). These two classes of decrease conditions lead to significant differences in convergence properties and consequently require different choices in the algorithm parameters. They will therefore be analyzed separately in Sects. 3 and 4 respectively.
The detailed scheme (see Algorithm 1) follows the lines of the general schemes proposed in [10] and [27], with the addition of calls to the lower-level oracle \({\tilde{y}}(x)\), and an explicit reference to the mesh used in mesh-based schemes. At steps 3–6, the algorithm searches for a new iterate by testing the upper level objective in \((t, {\tilde{y}}(t))\) for t in \(S_k\) subset of the mesh \(M_k\). In case the search is not successful, the method generates a new iterate by selecting a set of search directions \(D_k\) and testing the upper level objective in \((t, {\tilde{y}}(t))\) for t chosen along the search directions using a stepsize \(\alpha _k\) (see steps 7–12). Steps 9, 11 and 13 perform updates on the algorithm iterate and parameters based on the search step and computed function evaluations. For the set of directions \(D_k\), we require in some cases a positive cosine measure, that is
for some \(\kappa > 0\).
3 Sufficient decrease condition
In this section, we analyze directional direct-search methods using a sufficient decrease condition with \(\rho (t) = \frac{c}{2}t^2\). We first focus on potentially nonsmooth objectives, and then on smooth ones. In both cases we consider the scheme presented in Algorithm 2, which can be viewed as an adaption to BO of classic generating set of search directions (GSS) schemes (see, e.g., [26, Algorithm 3.2]). In order to handle the error introduced by the approximate solution in the lower level, we lower bound the stepsize with a constant \(\alpha _{\min }\). We further notice that, thanks to the sufficient decrease condition, maintaining a mesh is not necessary, and therefore we simply set \(M_k = \mathbb {R}^{n_x}\).
3.1 Nonsmooth objectives
First, we present convergence guarantees and proofs thereof for a variant of Algorithm 2 designed for the case of Lipschitz continuous true objectives, i.e., under Assumption 2.5. With respect to the general scheme presented as Algorithm 2, here \(D_k = \{g_k\}\) with \(g_k\) generated in the unit sphere. We remark that this is a standard choice for direct-search algorithms applied to nonsmooth objectives (see, e.g., [16, Algorithm \(\text {DFN}_{simple}\)]). The stepsize lower bound here must be strictly positive (i.e. \(\alpha _{\min }> 0\)).This together with the sufficient decrease conditions ensures that the sequence generated by the algorithm is eventually constant, as proved in Lemma 3.1. We then use a novel argument to prove that the limit point of the sequence is a \((\delta , \epsilon )\)-Goldstein stationary point. Although such a notion of stationarity has recently gained attention in the analysis of zeroth-order smoothing-based approaches [22, 28, 40], including extensions to BO [8, 35], to the best of our knowledge, it has never been used for the analysis of direct-search methods. It is further important to notice that convergence of directional direct-search methods to \((\delta , \epsilon )-\)Goldstein stationary points in the nonsmooth case is a novel result also for classic optimization problems. We now recall some useful definitions. If \(B_{\delta }(x)\) is the ball of radius \(\delta \) centered in x, then the \(\delta \)-Goldstein subdifferential (see, e.g., [28]) is defined as
and x is an \((\delta , \epsilon )\)-Goldstein stationary point for the function F if, for some \(g \in \partial _{\delta }F(x)\), we have \(\Vert g\Vert \le \epsilon \).
We can now proceed with our convergence analysis. As anticipated, we start by proving that the sequence of iterates generated by our method is eventually constant.
Lemma 3.1
Let Assumptions 2.3 and 2.4 hold. Then there exists \({\bar{k}}\in \mathbb {N}_0\) such that the sequence \(\{x_k\}\) generated by Algorithm 2 is constant for \(k \ge {\bar{k}}\).
Proof
Notice that \(\{{\tilde{F}}(x_k)\}\) is non-increasing, with \({\tilde{F}}(x_k) = {\tilde{F}}(x_{k + 1})\) after an unsuccessful step, and
after a successful step. Thus there can be at most
successful steps, where we used \({\tilde{F}}(x) \ge F(x) - L_f\varepsilon \ge f_{\text {low}} - L_f \varepsilon \) in the inequality. Since this quantity is finite, this implies that \(\{x_k\}\) is eventually constant. \(\square \)
We now prove convergence of our algorithm to \((\delta ,\epsilon )\)-Goldstein stationary points. In order to get our convergence result, we need to assume that the sequence \(\{g_k\}\) is dense in the unit sphere. We remark that such a dense sequence can be generated using a suitable quasirandom sequence (see, e.g., [19, 33]).
Theorem 3.1
Let Assumptions 2.3, 2.4 and 2.5 hold. Assume that \(\{g_k\}\) is dense in the unite sphere. Then the sequence \(\{x_k\}\) generated by Algorithm 2 is eventually constant, with the unique limit point \((\delta ,\epsilon )\)-Goldstein stationary, for
Proof
First, \(\{x_k\}\) is eventually constant as seen in Lemma 3.1. Let \({\bar{x}}\) be the unique limit point. By the stepsize updating rule, we have that every iteration must be unsuccessful with \(\alpha _k = \alpha _{\min }\) for k large enough. Then, there exists \({\bar{k}} \in \mathbb {N}\) large enough such that for every \(k \ge {\bar{k}}\)
implying
By the density of \(\{g_k\}\) it follows
for every d such that \(\Vert d\Vert = \alpha _{\min }\).
We now define the function \({\bar{F}}_{{\bar{x}}}(d):= F({\bar{x}} + d) + (\frac{c}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }^2}) \Vert d\Vert ^2\). Since
for every d such that \(\Vert d\Vert = \alpha _{\min }\) by (18), there must be a \({\tilde{d}} \in {{\,\textrm{argmin}\,}}_{\Vert d\Vert \le \alpha _{\min }} {\bar{F}}_{{\bar{x}}}(d)\) with \(\Vert {\tilde{d}}\Vert < \alpha _{\min }\). We can conclude
Equivalently, \(g = (c + \frac{4L_f\varepsilon }{\alpha _{\min }^2}){\tilde{d}} \in \partial F(x + {\tilde{d}})\) and since \(\partial F(x + {\tilde{d}}) \subset \partial _{\alpha _{\min }} F({\bar{x}})\) we have \(g \in \partial _{\alpha _{\min }} F({\bar{x}})\). To conclude, observe \(\Vert g\Vert < c \alpha _{\min }+ \frac{4L_f\varepsilon }{\alpha _{\min }} \). \(\square \)
As a corollary of Theorem 3.1, for \(\alpha _{\min }\propto \sqrt{\varepsilon }\) we are able to get a \((\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))-\)Goldstein stationary point. Interestingly, the order of magnitude \(\mathcal {O}(\sqrt{\varepsilon })\) of the approximation error coincides with that of typical gradient approximation methods [7], as well as with that of direct-search in the smooth setting, as we shall see in the next section.
Corollary 3.1
Let Assumptions 2.3, 2.4 and 2.5 hold. Assume that \(\{g_k\}\) is dense in the unite sphere. Then the sequence \(\{x_k\}\) generated by Algorithm 2 with \(\alpha _{\min }= 2\sqrt{\frac{L_f\varepsilon }{c}}\) is eventually constant, with the unique limit point \((\delta ,\epsilon )\)-Goldstein stationary, for
3.2 Smooth objectives
We now focus on the case where the objective F is smooth, in particular under Assumption 2.6. We consider here a variant of Algorithm 2 with \(D_k\) a positive spanning set. When the stepsize lower bound is strictly positive we set as termination criterion in step 14
Our scheme can hence be seen as a variant of classic direct-search methods for smooth objectives [10, 26]. It is important to highlight that this is the first analysis of direct-search methods for smooth objectives under bounded noise. The only analysis of direct-search methods we are aware of in the smooth case is the one given in [14] under stochastic noise, where, however, the author only focuses on classic optimization problems.
We first extend to our bounded error setting a standard result that allows to get an upper bound on the gradient norm for unsuccessful iterations (see, e.g., [26, Theorem 3.3]).
Lemma 3.2
Let Assumptions 2.4 and 2.6 hold, together with (11). Let \(\{x_k\}\) be a sequence generated by Algorithm 2. If the iteration k is unsuccessful, then
Proof
Let \(d \in D_k\) be such that
We have
where we used (24) in the first inequality, the standard descent lemma in the second inequality, (9) in the third inequality, and that the step is unsuccessful in the last inequality. Therefore, since by assumption \(\Vert d\Vert = 1\)
implying the thesis. \(\square \)
In [34], convergence of a linesearch scheme is analyzed in the noisy case (i.e., additive noise smaller than the stepsize), and a result analogous to Lemma 3.2 is given.
We now prove convergence and complexity bounds when \(\alpha _{\min }> 0\), extending those given in [43] for the exact oracle case, and \(\alpha _{\min }=0\). We notice that in this second case we lose finite convergence and our guarantees are thus somewhat weaker, i.e., we are only able to prove that the stepsize converges to 0 and that at some point the gradient norm is \(\mathcal {O}(\sqrt{\varepsilon })\).
Theorem 3.2
Let Assumptions 2.3, 2.4 and 2.6 hold, together with (11) for every \(k \in \mathbb {N}_0\). Let \(\{x_k\}\) be a sequence generated by Algorithm 2.
-
1.
If \(\alpha _{\min }> 0\), then the algorithm satisfied the termination condition (22) after \({\bar{k}}\) iterations, with
$$\begin{aligned} {\bar{k}} < 1 + \frac{2}{\alpha _{\min }^2c}({{\tilde{F}}}(x_0) - f_{low} + 2L_f\varepsilon )\left( 1 - \frac{\ln \gamma }{\ln \theta }\right) + \frac{\ln \alpha _{\min }- \ln \alpha _0}{\ln \theta } , \end{aligned}$$(27)and its last iterate \(x_{{\bar{k}}}\) is such that
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{(L + c)\alpha _{\min }}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }} \right) . \end{aligned}$$(28) -
2.
If, furthermore, it holds that \(\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\), then
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{2}{\kappa } \sqrt{(c + L) L_f \varepsilon } . \end{aligned}$$(29) -
3.
If \(\alpha _{\min }= 0\), then \(\alpha _k \rightarrow 0\), and if additionally \(\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\), for some \({\bar{k}} \in \mathbb {N}_0\) we have
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{(L + c)\bar{\alpha }_{\min }}{2} + \frac{2L_f\varepsilon }{\bar{\alpha }_{\min }}\right) , \end{aligned}$$(30)and
$$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}} . \end{aligned}$$(31)
Proof
1. Let \(k_s\) and \(k_{ns}\) be the number of successful and unsuccessful steps, so that \(k_s + k_{ns} = k\). Reasoning as in Lemma 3.1, we obtain by (14)
Furthermore, since
we get
where we applied (32) in the second inequality. Combining the bounds on the successful and unsuccessful steps (32) and (34), we have
as desired.
2. Follows from a direct application of the first result.
3. Reasoning as in the first result, the number of successful steps with stepsize above a certain threshold is bounded, hence \(\alpha _k \rightarrow 0\). Furthermore, for any \({\bar{k}} \in \mathbb {N}_0\), if \(k \ge {\bar{k}}\)
which proves (31). Let \(\bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{L + c}}\). Since \(\alpha _0 \ge \bar{\alpha }_{\min }\), and \(\alpha _k \rightarrow 0\) with contraction factor \(\theta \), we must have \(\alpha _{{\bar{k}}} \in [\theta \bar{\alpha }_{\min }, \bar{\alpha }_{\min }]\) for some \({\bar{k}} \in \mathbb {N}_0\). Then (30) follows from (23) for \(\alpha _k = \alpha _{{\bar{k}}}\). \(\square \)
We now extend to our setting the \(\mathcal {O}(n^2/\epsilon ^2)\) complexity result given in [43, Corollary 2]. For a fixed precision \(\epsilon \), an approximation error \(\varepsilon = \mathcal {O}(\epsilon ^2)\) is required, as for classic gradient approximation schemes [7].
Corollary 3.2
Let Assumptions 2.3, 2.4 and 2.6 hold, together with (11) for every \(k \in \mathbb {N}_0\). Let \(\{x_k\}\) be a sequence generated by Algorithm 2. Assume also \(\varepsilon \le \epsilon ^2 \kappa ^2\), that at every iterations there are at most \(d_1n\) function evaluations and that \(\kappa \ge d_2/\sqrt{n}\), for \(d_1, d_2 > 0\). Then if \(\alpha _{\min }~=~2\sqrt{\frac{L_f \varepsilon }{L + c}}\), the algorithm terminates after \(\mathcal {O}(n^2/\epsilon ^2)\) function evaluations with \(\Vert \nabla {F}(x_{{\bar{k}}})\Vert \le d_3 \epsilon \), for \(d_3 > 0\) depending only on c, L and \(L_f\).
Proof
Follows from point 1 and 2 of Theorem 3.2, plugging in the parameters specified in the assumptions. \(\square \)
4 Simple decrease condition
In this section, we analyze two methods based on simple decrease condition (i.e., with \(\rho (t)~=~0\), in (10)), one for potentially nonsmooth objectives and one for smooth objectives. Both methods follow the scheme presented in Algorithm 3, which is an adaptation to the BO setting of the mesh adaptive direct-search algorithm (MADS, see [2] and references therein). Again we lower bound the stepsize by a constant \(\alpha _{\min }\). The stepsize updating rule we use to handle unsuccessful iterations depends on the mesh size parameter \(\Delta _k\) and the contraction coefficient \(\theta \), and smoothness of the true objective (i.e., update varies between the smooth and the nonsmooth case).
It is a standard assumption in the analysis of MADS that all the iterates lie in a compact set (see, e.g., [3, Section 3]). In our framework, this can be ensured if the following boundedness assumption is satisfied.
Assumption 4.1
The set
is bounded.
The mesh, as defined in the literature (see,e.g., [5, 10] and references therein for further details), is a discrete set of points from which the algorithm selects candidate trial points. Its coarseness is parameterised by the mesh size parameter \(\delta \). The goal of each algorithm iteration is to get a mesh point whose objective function value improves with respect to the incumbent value. Given a positive spanning set D and a center x the related mesh is formally defined as follows:
where, with a slight abuse of notation, we use D also for the matrix \(D\in \mathbb {R}^{n \times p}\) with columns corresponding to the elements of the set D. We notice that the mesh is just a conceptual tool, and is never actually constructed.
4.1 Nonsmooth objectives
With respect to the general scheme presented in Algorithm 3, here the stepsize updating rule for unsuccessful iterations is given by \(\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2, \theta \alpha _k)\), ensuring that \(\alpha _k \rightarrow 0\) and the mesh gets infinitely dense if the algorithm gets stuck in a certain point. The set of search directions \(D_k\) must be such that
for all \(d \in D_k\), with \(b_i: \mathbb {R}_{> 0} \rightarrow \mathbb {R}_{> 0}\) such that \(\lim _{t \rightarrow 0}b_i(t) = 1\) for \(i\in \{1, 2\}\). Thus with respect to the classic MADS scheme here the frame size \(\Delta _k\) defines also a lower bound and not only an upper bound on the distance between the current iterate and tentative points selected in the poll step. This adjustment is necessary due to the error on the true objective evaluation. As shown in the next lemma, Condition (39) ensures that as the stepsize converges to 0 the tentative steps get closer and closer to the boundary of a ball of radius \(\alpha _{\min }\).
Lemma 4.1
Assume that \(\alpha _{\min }> 0\) and that (39) holds. Then if \(\lim _{k \in K} \alpha _k = 0\), the set of limit points of \(\{\alpha _kD_k\}_{k \in K}\) is contained in \(S^{n_x - 1}(\alpha _{\min })\).
Proof
If \(\lim _{k \in K} \alpha _k = 0\) then it holds that, for \(k \in K\) large enough, \(\Delta _k = \alpha _{\min }\). Consider \(\{d_k\}= D_k\). It holds that, for all \(d_k\),
where we applied (39) in the inequality. Analogously, we can prove \(\liminf _{k \in K} \Vert \alpha _k d_k\Vert \ge \Delta _k\), whence \(\lim _{k \in K} \Vert \alpha _k d_k\Vert = \alpha _{\min }\), which implies the thesis. \(\square \)
We now extend to this scheme the \((\delta , \epsilon )\)-Goldstein stationarity result proved under the sufficient decrease condition in Sect. 3.1. Also in this case we are not aware of any analogous result for the standard MADS scheme, which is instead known to convergence to Clarke stationary points [3].
We start with a lemma that extends a well known property of MADS (see, e.g., [3, Proposition 3.1]) to our bilevel setting.
Lemma 4.2
Let Assumptions 2.4, 2.5 and 4.1 hold. Then the sequence \(\{\alpha _k\}\) generated by Algorithm 3 is such that \(\liminf \alpha _k = 0\).
Proof
Since \(\{{\tilde{F}}(x_k)\}\) is non-increasing (and strictly decreasing for successful iterations), \(\{x_k\}\) is contained in the set \(\mathcal {L}_{\varepsilon }\), which is compact by Assumptions 2.5 and 4.1. Thus \(\liminf \alpha _k = 0\) follows from the finiteness of feasible points generated in \(\mathcal {L}_{\varepsilon }\) when keeping the parameter \(\alpha _k\) lower bounded, which can be proved with the same arguments used for MADS in [3, Proposition 3.1]. \(\square \)
We can now state our main result.
Theorem 4.1
Let Assumptions 2.4, 2.5 and 4.1 hold. Let K be a subset of unsuccessful iteration indices related to Algorithm 3. Let us further assume that:
-
\(\lim _{k \in K} x_k = {\bar{x}}\);
-
\(\lim _{k \in K} \alpha _k = 0\);
-
\(\{{\hat{D}}_k\}_{k \in K}\) is dense in the unit sphere, with \({\hat{D}}_k = \{ \frac{d}{\Vert d\Vert } \ | \ d\in D_k\}\);
-
Condition (39) holds.
Then, the limit point \({\bar{x}}\) of \(\{x_k\}_{k \in K}\) is \((\delta , \epsilon )\)-Goldstein stationary, for
Proof
Let \({\bar{d}} \in \mathbb {R}^n\) with \(\Vert {\bar{d}}\Vert = 1\), and let \(L \subset K\) be such that \(\lim _{k \in L} \frac{d_k}{\Vert d_k\Vert } \rightarrow {\bar{d}}\), with \(d_k \in D_k\). Then \(\alpha _k d_k \rightarrow \alpha _{\min }{\bar{d}}\) by Lemma 4.1. Now, for every \(k \in L\)
where the first inequality follows from (9), and we used that the step k is unsuccessful in the second inequality. Passing to the limit, we obtain
Now let \({\bar{F}}_{{\bar{x}}}(d) = F({\bar{x}} + d) + \frac{2L_f\varepsilon }{\alpha _{\min }^2} \Vert d\Vert ^2\). By applying (42) we get
and given that \({\bar{d}}\) is arbitrary, this holds for any d such that \(\Vert d\Vert = \alpha _{\min }\). The thesis then follows as in the proof of Theorem 3.1. \(\square \)
As in Sect. 3.1, here we also have a corollary showing that for \(\alpha _{\min }\propto \sqrt{\varepsilon }\) we are able to get a \((\mathcal {O}(\sqrt{\varepsilon }), \mathcal {O}(\sqrt{\varepsilon }))\)-Goldstein stationary point.
Corollary 4.1
Under the assumptions of Theorem 4.1, the limit point \({\bar{x}}\) of the sequence \(\{x_k\}\) generated by Algorithm 3 with \(\alpha _{\min }= 2\sqrt{L_f\varepsilon }\) is \((\delta ,\epsilon )\)-Goldstein stationary, for
4.2 Smooth objectives
Now we consider the case where the true objective is smooth, i.e., Assumption 2.6 holds. With respect to the general scheme reported in Algorithm 3, we have \(\alpha _u(\alpha _k, \Delta _k, \theta ) = \min (\Delta _k, \Delta _k^2)\), and the algorithm set as termination condition in step 14 (22), as for the smooth case. As for \(D_k\), it must always satisfy \({{\,\textrm{cm}\,}}(D_k) \ge \kappa \) for some positive \(\kappa \) independent from k, as well as
for every \(d \in D_k\).
We remark that convergence of mesh based schemes for smooth objectives is well understood (see, e.g., [5, Chapter 7]), so that once again our main contribution here is the adaptation to the bilevel setting. We begin our analysis by extending Lemma 3.2 under the simple decrease condition and condition (45) on the search directions.
Lemma 4.3
Let Assumptions 2.4 and 2.6 hold, together with (11). Let \(\{x_k\}\) be a sequence generated by Algorithm 3. If the step k is unsuccessful, then
Proof
Since the step is unsuccessful, by considering \(d \in D_{k}\) such that
we have, reasoning as in (25) with \(c = 0\)
Finally, we get
\(\square \)
We now extend Theorem 3.2 to our mesh based scheme. The main difference is the absence of complexity estimates, which to our knowledge are not available for MADS schemes.
Theorem 4.2
Let Assumptions 2.4, 2.5 and 4.1 hold. Let \(\{x_k\}\) be a sequence generated by Algorithm 3.
-
1.
If \(\alpha _{\min }> 0\), then the algorithm satisfies the termination condition (22) in a finite number of iterations, with the last iterate \(x_{{\bar{k}}}\) satisfying,
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \left( \frac{b_2\alpha _{\min }L}{2} + \frac{2L_f\varepsilon }{\alpha _{\min }b_1} \right) . \end{aligned}$$(50) -
2.
If, furthermore, it holds that \(\alpha _{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}\), then
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\kappa } \sqrt{L b_2 L_f \varepsilon / b_1} . \end{aligned}$$(51) -
3.
If \(\alpha _{\min }= 0\), then \(\liminf \alpha _k = 0\), and if additionally \(\alpha _0 \ge \bar{\alpha }_{\min }= 2\sqrt{\frac{L_f \varepsilon }{b_1b_2L}}\), for some \({\bar{k}} \in \mathbb {N}_0\) we have
$$\begin{aligned} \Vert \nabla F(x_{{\bar{k}}})\Vert \le \frac{1}{\theta \kappa } \left( \frac{L\bar{\alpha }_{\min }b_2}{2} + \frac{2L_f\varepsilon }{b_1\bar{\alpha }_{\min }} \right) , \end{aligned}$$(52)and
$$\begin{aligned} F(x_k) \le F(x_{{\bar{k}}}) + 2L_f\varepsilon \quad \text { for all }k\ge {\bar{k}}. \end{aligned}$$(53)
Proof
1. Since the frame parameter \(\Delta _k\) is lower bounded, the mesh parameter \(\alpha _k\) is lower bounded as well, and, by the subsequent finiteness of \(\bigcup _{k \in \mathbb {N}_0} M_k\), the algorithm terminates in a finite number of iterations. By the termination criterion, at the last iteration \({\bar{k}}\) we have \(\Delta _{{\bar{k}}} = \alpha _{\min }\). Since the last iteration is unsuccessful, we hence get
where we applied Lemma 4.3 in the second inequality.
2. Follows from the previous point replacing \(\alpha _{\min }\) with the given value in (50).
3. The property \(\liminf \alpha _k = 0\) follows from standard arguments used in the analysis of MADS schemes, already mentioned in the proof of Lemma 4.1. The result then follows from point 1 and 2 (similarly to point 3 in Theorem 3.2). \(\square \)
5 Numerical illustration
In this section, we evaluate the performance of the proposed algorithms on a large collection of nonlinear bilevel optimization problems.
Three direct-search solvers derived from Algorithm 2 and Algorithm 3 were implemented in Matlab: Mesh-DS (related to Algorithm 3) with the mesh defined as in [5, Algorithm 8.2], Coordinate-DS (related to Algorithm 2) with \(D_k=[\mathcal {B}_{\oplus }, -\mathcal {B}_{\oplus }]\) (where \(\mathcal {B}_{\oplus }\) is the canonical basis of \(\mathbb {R}^n\)), and Random-DS (related to Algorithm 2) with \(D_k=[\frac{v}{\Vert v\Vert }, -\frac{v}{\Vert v\Vert }]\), where \(v \in \mathbb {R}^n\) is a pseudo-randomly generated vector. We note that for Mesh-DS, a simple decrease is imposed to decide the acceptance of a candidate step. In contrast, Coordinate-DS and Random-DS are using a sufficient decrease condition to make this decision.
In our tests, the parameters used for Algorithm 2 and Algorithm 3 were set as follows: \(\alpha _{\min }=~10^{-6}\), \(\theta =\frac{1}{2}\), \(\alpha _0=1\), \(c=10^{-3}\), and \(\gamma =2\). For all the tested approaches, the optional search step (Step 1) was not included. Instead, in the poll step, when we observed a decrease along a specific direction, we further explored it by using a simple extrapolation strategy (i.e., we multiplied the step-size \(\alpha _k\) by \(\gamma \) and re-evaluated the function).
In our implementation, the lower-level problem is solved using the fmincon Matlab procedure. To quantify the impact of inexact lower-level solutions on the performances, we used 2 different accuracies when solving the lower-level problem (i.e., LL_tol \(\in \{10^{-3}, 10^{-6}\}\)). The rest of the fmincon default parameters were kept unchanged. A feasibility tolerance of \(10^{-6}\) for constraints violation was used in the solution of the lower-level problem.
The three solvers, Mesh-DS, Coordinate-DS, and Random-DS, were evaluated using 33 small-scale bilevel optimization problems from the BOLIB Matlab library [46]. This library consists of a collection of academic and real-world problems. The dimensions of the tested instances, with respect to the upper-level problem, do not exceed 10 variables. Since an initial point is not provided, we generated five problem instances by randomly selecting five different initial points, thus getting a total of 175 problem instances.
The computational analysis is carried out by using well-known tools from the literature, that is data and performance profiles (see,e.g., [38] for further details). We briefly recall here their definitions. Given a set S of algorithms and a set P of problems, for \(s\in S\) and \(p \in P\), let \(t_{p,s}\) be the number of function evaluations required by algorithm s on problem p to satisfy the condition
where \(\alpha \in (0, 1)\) and \({{\tilde{F}}}_{\text{ low }}\) is the best objective function value achieved by any solver on problem p. Then, the performance and data profiles of solver s are defined by
where \(n_p\) is the dimension of problem p. We used a budget of 500 upper level function evaluations in our experiments.
Figures 1 and 2 depict the resulting performance and data profiles, respectively, considering two levels of accuracy \(\alpha \): \(10^{-3}\) and \(10^{-6}\). From Fig. 2, it can be observed that the Coordinate-DS approach performs the best in terms of both efficiency (i.e., \(\tau =1\)) and robustness (i.e., larger \(\tau \)), particularly when the lower problem is solved accurately (i.e., LL_tol=\(10^{-6}\)). The data profiles (see Fig. 2) indicate that all the direct-search approaches perform similarly for small budgets. As the budget increases, the accuracy of the lower problem becomes impactful on the solver’s performance. Overall, on the test problems, the mesh-based approach is slightly more effective for small budgets, i.e., less than \(25(n_x+1)\). However, as the budget increases, the directional direct-search algorithms appear to outperform the mesh-based approach.
6 Conclusion
In this work, we proposed an inexact direct-search based algorithmic framework for bilevel optimization, under the assumption that the lower-level problem can be solved within a fixed accuracy. We then proved convergence of two different classes of methods fitting our scheme, that is directional direct-search methods with sufficient decrease and mesh based schemes with simple decrease. Our results include complexity estimates for a directional direct-search scheme tailored for BO with smooth true objective, which extends previously known complexity estimates for the single level case. We also considered the nonsmooth case and gave convergence guarantees to \((\delta , \epsilon )\)-Goldstein stationary points for both classes, thus nicely extending the known Clarke stationary point convergence properties of analogous schemes in the single level case. A lower bound on the stepsize allows these method to convergence to a point with the desired stationarity properties in a finite number of iterations. Preliminary numerical results suggest that directional direct-search methods might lead to better performance than mesh based strategies in this context.
Future developments include the extensions of our algorithms to constrained and stochastic objectives, as well as numerical comparisons with recent zeroth order smoothing based approaches for BO.
Data availibility
The data analysed during the current study are available in the BOLIB library and the code will be made available by the authors upon reasonable request.
References
Anagnostidis, S.-K., Lucchi, A., Diouane, Y.: Direct-search for a class of stochastic min-max problems. In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, vol. 130, pp. 3772–3780. PMLR (2021)
Audet, C.: A Survey on Direct Search Methods for Blackbox Optimization and Their Applications. Springer, Berlin (2014)
Audet, C., Dennis, J.E., Jr.: Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim. 17(1), 188–217 (2006)
Audet, C., Dzahini, K.J., Kokkolaras, M., Le Digabel, S.: Stomads: Stochastic blackbox optimization using probabilistic estimates. arXiv preprint arXiv:1911.01012 (2019)
Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization (2017)
Beck, Y., Schmidt, M.: A Gentle and Incomplete Introduction to Bilevel Optimization (2021)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Found. Comput. Math. 22(2), 507–560 (2022)
Chen, L., Xu, J., Zhang, J.: On bilevel optimization without lower-level strong convexity. arXiv preprint arXiv:2301.00712 (2023)
Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Ann. Oper. Res. 153, 235–256 (2007)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. SIAM, Philadelphia (2009)
Conn, A.R., Vicente, L.N.: Bilevel derivative-free optimization and its application to robust optimization. Optim. Methods Softw. 27(3), 561–577 (2012)
Dempe, S.: Foundations of Bilevel Programming. Springer, Berlin (2002)
Dempe, S.: Bilevel optimization: theory, algorithms, applications and a bibliography. In: Bilevel Optimization: Advances and Next Challenges, pp. 581–672 (2020)
Dzahini, K.J.: Expected complexity analysis of stochastic direct-search. Comput. Optim. Appl. 81, 179–200 (2022)
Ehrhardt, M.J., Roberts, L.: Inexact derivative-free optimization for bilevel learning. J. Math. Imaging Vis. 63(5), 580–600 (2021)
Fasano, G., Liuzzi, G., Lucidi, S., Rinaldi, F.: A linesearch-based derivative-free approach for nonsmooth constrained optimization. SIAM J. Optim. 24(3), 959–992 (2014)
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming for hyperparameter optimization and meta-learning. In: International Conference on Machine Learning, pp. 1568–1577. PMLR (2018)
Grazzi, R., Franceschi, L., Pontil, M., Salzo, Sa.: On the iteration complexity of hypergradient computation. In: International Conference on Machine Learning, pp. 3748–3758. PMLR (2020)
Halton, J.H.: On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numer. Math. 2, 84–90 (1960)
Ji, K., Liang, Y.: Lower bounds and accelerated algorithms for bilevel optimization. J. Mach. Learn. Res. 24(22), 1–56 (2023)
Ji, K., Yang, J., Liang, Y.: Bilevel optimization: convergence analysis and enhanced design. In: International Conference on Machine Learning, pp. 4882–4892. PMLR (2021)
Jordan, M.I., Kornowski, G., Lin, T., Shamir, O., Zampetakis, M.: Deterministic nonsmooth nonconvex optimization. arXiv preprint arXiv:2302.08300 (2023)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–łojasiewicz condition. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19–23, 2016, Proceedings, Part I 16, pp. 795–811. Springer, Berlin (2016)
Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., Yang, Z.: A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inf. Process. Syst. 34, 30271–30283 (2021)
Kleinert, T., Labbé, M., Ljubić, I., Schmidt, M.: A survey on mixed-integer programming techniques in bilevel optimization. EURO J. Comput. Optim. 9, 100007 (2021)
Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. 45(3), 385–482 (2003)
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numerica 28, 287–404 (2019)
Lin, T., Zheng, Z., Jordan, M.I.: Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. Adv. Neural Inf. Process. Syst. 35, 26160–26175 (2022)
Liu, B., Ye, M., Wright, S., Stone, P., Liu, Q.: Bome! bilevel optimization made easy: a simple first-order approach. Adv. Neural Inf. Process. Syst. 35, 17248–17262 (2022)
Liu, R., Liu, X., Yuan, X., Zeng, S., Zhang, J.: A value-function-based interior-point method for non-convex bi-level optimization. In: International Conference on Machine Learning, pp. 6882–6892. PMLR (2021)
Liu, R., Liu, Y., Zeng, S., Zhang, J.: Towards gradient-based bilevel optimization with non-convex followers and beyond. Adv. Neural Inf. Process. Syst. 34, 8662–8675 (2021)
Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In: International Conference on Machine Learning, pp. 6305–6315. PMLR (2020)
Liuzzi, G., Lucidi, S., Rinaldi, F., Vicente, L.N.: Trust-region methods for the derivative-free optimization of nonsmooth black-box functions. SIAM J. Optim. 29, 3012–3035 (2019)
Lucidi, S., Sciandrone, M.: A derivative-free algorithm for bound constrained optimization. Comput. Optim. Appl. 21, 119–142 (2002)
Maheshwari, C., Shankar Sasty, S.., Ratliff, L., Mazumdar, E.: Convergent first-order methods for bi-level optimization and stackelberg games. arXiv preprint arXiv:2302.01421 (2023)
Menickelly, M., Wild, S.M.: Derivative-free robust optimization by outer approximations. Math. Program. 179, 157–193 (2020)
Mersha, A.G., Dempe, S.: Direct search algorithm for bilevel programming problems. Comput. Optim. Appl. 49(1), 1–15 (2011)
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20, 172–191 (2009)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17, 527–566 (2017)
Rando, M., Molinari, C., Rosasco, L., Villa, S.: An optimal structured zeroth-order algorithm for non-smooth optimization. arXiv preprint arXiv:2305.16024 (2023)
Rinaldi, F., Vicente, L.N., Zeffiro, D.: A weak tail-bound probabilistic condition for function estimation in stochastic derivative-free optimization. arXiv preprint arXiv:2202.11074 (2022)
Venturini, S., Cristofari, A., Rinaldi, F., Tudisco, F.: Learning the right layers: a data-driven layer-aggregation strategy for semi-supervised learning on multilayer graphs. arXiv preprint arXiv:2306.00152 (2023)
Vicente, L.N.: Worst case complexity of direct search. EURO J. Comput. Optim. 1(1–2), 143–153 (2013)
Zhang, D., Lin, G.-H.: Bilevel direct search method for leader-follower problems and application in health insurance. Comput. Oper. Res. 41, 359–373 (2014)
Zhang, Y., Yao, Y., Parikshit Ram, P., Zhao, T.C., Hong, M., Wang, Y., Liu, S.: Advancing model pruning via bi-level optimization. Adv. Neural Inf. Process. Syst. 35, 18309–18326 (2022)
Zhou, S., Zemkoho, A.B., Tin, A.: Bolib: bilevel optimization library of test problems. arXiv preprint arXiv:1812.00230v3 (2020)
Funding
Open access funding provided by Università degli Studi di Padova within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Diouane, Y., Kungurtsev, V., Rinaldi, F. et al. Inexact direct-search methods for bilevel optimization problems. Comput Optim Appl (2024). https://doi.org/10.1007/s10589-024-00567-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10589-024-00567-7