Efficient unconstrained black box optimization

Kimiaei, Morteza; Neumaier, Arnold

doi:10.1007/s12532-021-00215-9

Efficient unconstrained black box optimization

Full Length Paper
Open access
Published: 03 February 2022

Volume 14, pages 365–414, (2022)
Cite this article

Download PDF

You have full access to this open access article

Mathematical Programming Computation Aims and scope Submit manuscript

Efficient unconstrained black box optimization

Download PDF

Morteza Kimiaei¹ &
Arnold Neumaier¹

3295 Accesses
9 Citations
Explore all metrics

Abstract

For the unconstrained optimization of black box functions, this paper introduces a new randomized algorithm called VRBBO. In practice, VRBBO matches the quality of other state-of-the-art algorithms for finding, in small and large dimensions, a local minimizer with reasonable accuracy. Although our theory guarantees only local minimizers our heuristic techniques turn VRBBO into an efficient global solver. In very thorough numerical experiments, we found in most cases either a global minimizer, or where this could not be checked, at least a point of similar quality to the best competitive global solvers. For smooth, everywhere defined functions, it is proved that, with probability arbitrarily close to 1, a basic version of our algorithm finds with ${{\mathcal {O}}}(n\varepsilon ^{-2})$ function evaluations a point whose unknown exact gradient 2-norm is below a given threshold $\varepsilon >0$, where n is the dimension. In the smooth convex case, this number improves to ${{\mathcal {O}}}(n\log \varepsilon ^{-1})$ and in the smooth (strongly) convex case to ${{\mathcal {O}}}(n\log n\varepsilon ^{-1})$. This matches known recent complexity results for reaching a slightly different goal, namely the expected unknown exact gradient 2-norm is below a given threshold $\varepsilon >0$. Numerical results show that VRBBO is effective and robust in comparison with the state-of-the-art local and global solvers on the unconstrained CUTEst test problems of Gould et al. (Comput Optim Appl 60:545–557, 2014) for optimization and the test problems of Jamil and Yang (Int J Math Model Numer Optim 4:150, 2013) for global optimization with 2–5000 variables.

Deterministic global derivative-free optimization of black-box problems with bounded Hessian

Article 28 March 2019

RBFOpt: an open-source library for black-box optimization with costly function evaluations

Article 27 August 2018

Safe global optimization of expensive noisy black-box functions in the $$\delta $$ -Lipschitz framework

Article 11 June 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper we consider unconstrained black box optimization (BBO) or derivative-free optimization (DFO); see, e.g., [3, 12, 56]. The labels BBO and DFO are used in practice synonymously, though with slightly different emphasis. Algorithms for BBO/DFO repeatedly call an oracle (a black box, to which BBO refers) that returns for any given $x\in {\mathbb {R}}^n$ a real function value f(x) uniquely determined by x, possibly also $\infty $ or NaN (not a number). In this way, they try, using no other information about f (such as continuity, Lipschitz constants, differentiability, or derivative information, to which DFO refers), to find a minimizer of the underlying function f. However, although no such information is used for executing the algorithms, the motivation and analysis of the algorithms always assumes, at least informally, that the function f has reasonable mathematical properties. To be able to give performance guarantees, such properties are essential.

1.1 Related work

There is a huge amount of literature on BBO problems and how to solve them. We mention only a few pointers to the literature. A thorough survey of derivative-free optimization methods was given by Larson et al. [39]. Another useful paper suggested by Rios & Sahinidis [56] discusses the practical behaviour of derivative-free optimization software packages. The techniques for solving BBO problems fall into two classes, deterministic and randomized methods. We mainly discuss the randomized case; for deterministic methods see, e.g., the book by Conn et al. [12] and its many references. Randomized methods for BBO going back to Rastrigin [55], Polyak [49], and van Laarhoven & Aarts [58] were later discussed especially in the framework of evolutionary optimization [6, 28, 57]. There are also randomized BBO optimization algorithms for deterministic problems, e.g., Bandeira et al. [7] and Holland [30]. For deterministic global BBO optimization see, e.g., Hansen [27] and for stochastic global BBO optimization see, e.g., Zhigljavsky [62]. Other useful BBO references are Audet & Hare [3], Moré & Wild [42], and Müller & Woodbury [43].

Previous BBO software of the optimization group at the university of Vienna includes the deterministic algorithms GRID [20, 21] and MCS [31] and the randomized algorithms SnobFit [32] and VXQR [46]. Software by many others is mentioned in Sect. 7.3.

1.2 Known complexity results

This section discusses known complexity results in the deterministic and randomized cases.

Throughout the paper, we use a scaled 2-norm $\Vert p\Vert $ and its dual norm $\Vert g\Vert _*$ of $p,g\in {\mathbb {R}}^n$, defined in terms of a positive scaling vector $s\in {\mathbb {R}}^n$ by

$$\begin{aligned} \Vert p\Vert :=\sqrt{\sum _i p_i^2/s_i^2},~\quad \Vert g\Vert _*:=\sqrt{\sum _i s_i^2g_i^2}. \end{aligned}$$

(1)

For the choice of a suitable scaling vector see Sect. 5.

Assumptions

For the mathematical analysis of our algorithm we assume, as customary in the literature on complexity results, that

(A1)
the function f is continuously differentiable on ${\mathbb {R}}^n$, and the gradient g(x) of $x\in {\mathbb {R}}^n$ is Lipschitz continuous with Lipschitz constant L;
(A2)
the level set ${{{\mathcal {L}}}}(x^0):=\{x\in {\mathbb {R}}^n\mid f(x)\le f(x^0)\}$ of f at $x^0$ is compact.

Under these conditions, a global minimizer $\widehat{x}$ of f exists and

$$\begin{aligned} {\widehat{f}}:=f(\widehat{x}):=\displaystyle \inf \{f(x) \mid x\in {\mathbb {R}}^n\} \end{aligned}$$

(2)

is finite. x is called an $\varepsilon $-approximate stationary point if

$$\begin{aligned} \Vert g(x)\Vert _* \le \varepsilon \end{aligned}$$

(3)

holds. For fixed $\varepsilon >0$, $\varepsilon $-approximate stationary points may also exist in regions where the graph of f is sufficiently flat, although no stationary point is nearby. If such a point is encountered, the convergence speed of optimization methods may be extremely slowed down so that a spurious apparent local minimizer is found.

If $\sigma >0$ and the condition

$$\begin{aligned} f(x')\ge f(x)+g(x)^T(x'-x)+\frac{\sigma }{2}\Vert x'-x\Vert ^2 \ \ \hbox {for all}\, x,x'\in {\mathbb {R}}^n \end{aligned}$$

(4)

holds then f is called $\sigma $-strongly convex. In this case, the global optimizer $\widehat{x}$ is unique, and (3) guarantees that iterations are near ${\widehat{x}}$.

Proposition 1

If f(x) is a $\sigma $-strongly convex quadratic function, then (3) implies $f(x) - {\widehat{f}} \le \varepsilon ^2/(2\sigma )$ and $\Vert x-{\widehat{x}}\Vert ^2\le \varepsilon /\sigma ^2$ for $x\in {\mathbb {R}}^n$.

Proof

For fixed x, the right-hand side of (4) is a convex quadratic function of ${\widehat{x}}$, minimal when its gradient vanishes. By (1), this is the case iff ${\widehat{x}}_i$ takes the value $x_i-\displaystyle \frac{s_i}{\sigma }g_i(x)$ for $i=1,\ldots ,n$, so that ${\widehat{f}}\ge f(x)-\displaystyle \frac{1}{2\sigma }\Vert g(x)\Vert _*^2$ for $x\in {\mathbb {R}}^n$. Therefore we conclude from (3) and (4) that for $x\in {\mathbb {R}}^n$

$$\begin{aligned} f(x)-{\widehat{f}}\le \displaystyle \frac{1}{2\sigma }\Vert g(x)\Vert _*^2 \le \displaystyle \frac{\varepsilon ^2}{2\sigma } \ \ \hbox {and} \ \ \Vert x-{\widehat{x}}\Vert ^2\le \frac{2}{\sigma }\Big (f(x)-{\widehat{f}}-g({\widehat{x}})^T(x-{\widehat{x}})\Big ) \le \displaystyle \frac{\varepsilon }{\sigma ^2}, \end{aligned}$$

since the gradient vanishes at ${\widehat{x}}$. $\square $

In exact precision arithmetic, the exact gradient vanishes at a stationary point. But in finite precision arithmetic optimization methods may get stuck in nearly flat regions, so that a spurious apparent local minimizer may be found. For a finite termination a theoretical criterion should be used to get an $\varepsilon $-approximate stationary point. For a given threshold $\varepsilon >0$, a complexity bound of an unconstrained BBO method tells how many function evaluations, $N(\varepsilon )$, are needed to find with a given probability (or a related goal) a point $x^{{\mathrm{best}}}$ whose function value $f(x^{\mathrm{best}})$ is below the initial function value $f(x^0)$ and the unknown gradient norm $\Vert g(x^{{\mathrm{best}}})\Vert _*$ at this point is below $\varepsilon $, i.e.,

$$\begin{aligned} f(x^{\mathrm{best}}) \le \sup \{f(x)\mid x\in {\mathbb {R}}^n, \Vert g(x)\Vert _* \le \varepsilon ,\,\hbox { and }\, f(x)\le f(x^0)\}. \end{aligned}$$

(5)

(3) says that, in term of function evaluations, $x^{{\mathrm{best}}}$ is at least as good as the worst $\varepsilon $-approximate stationary point with the function value at most $f(x^0)$. Since gradients and Lipschitz constants are unknown to us, we could not say which point satisfies (5). But the result implies that the final best point has a function value equal to or better than some point whose gradient was small. If gradients are small only nearby a global optimizer, it will produce a point close to the local optimizer. If some iterate passes close to a non-global local optimizer or a saddle point, it is possible that the algorithm escapes its neighborhood. In this case, only a variant with restarts would produce convergence to a point with a small gradient.

Under the assumptions (A1)–(A2), the appropriate asymptotic form for the expression $N(\varepsilon )$, found by Vicente [60], Dodangeh & Vicente [17], Dodangeh, Vicente & Zhang [18], Gratton et al. [24], Bergou, Gorbunov & Richtárik [9], and Nesterov & Spokoiny [44, 45], depends on the properties (smooth, smooth convex, or smooth strongly convex) of f; cf. Sect. 2.1 below.

Table 1 Complexity results for randomized BBO in expectation (Bergou et al. [9] for all cases)

Full size table

Bergou et al. [9] and Nesterov & Spokoiny [45] generalized this result to give algorithms with complexity results for the nonconvex, convex, and strongly convex cases shown in Table 1. In each case, the bounds are better by a factor of n than the best known complexity results for deterministic algorithms (by Dodangeh & Vicente [17], Vicente [60] , and Konečný & Richtárik [37]) given in Table 2. Of course, being a randomized algorithm, the performance guarantee obtained by Bergou et al. is slightly weaker, only valid in expectation. Moreover, they generated step sizes without testing whether the function value is improved or not. This is the reason why the algorithms proposed by Bergou et al. [9] are numerically poor, see Fig. 3 in Sect. 7.

Table 2 Complexity results for deterministic BBO (Vicente [60] for the nonconvex case, Dodangeh & Vicente [17] for the convex and the strongly convex cases, Konečný & Richtárik [37] for all cases)

Full size table

The best complexity bound for a direct search with probabilistic (rather than expectation) guarantees has been found by Gratton et al. [24], only for nonconvex case. They used Chernoff bounds to prove that a complexity bound ${{\mathcal {O}}}(nR\varepsilon ^{-2})$ holds, R is the number of random directions (uniformly independently distributed on the unit sphere) used in each iteration, satisfying

$$\begin{aligned} R>\log \left( 1-\frac{\ln (\gamma _1)}{\ln (\gamma _2)}\right) , \end{aligned}$$

where $0<\gamma _1<1$ is a factor for reducing step sizes and $\gamma _2>1$ is a factor for expanding step sizes. If $\gamma _1=0.5$ and $\gamma _2=2$, then $R=2$.

1.3 Our contribution

We describe and test a new, practically very efficient randomized method, called VRBBO (short for Vienna randomized black box optimization), for which good local complexity results can be proved, and which is competitive in comparison with the state-of-the-art local and global BBO solvers. An algorithm loosely related to VRBBO (but without complexity guarantees) is the Hit-and-Run algorithm of Bélisle [8].

A basic version of VRBBO . In Sect. 2, an extrapolation step, called extrapolationStep is discussed and then a multi-line search with random directions, called MLS-basic, is constructed, trying extrapolationStep. Sect. 3 first introduces a basic version of our fixed decrease search algorithm, called FDS-basic to hopefully get a decrease in the function value. It has repeated calls to MLS-basic until the function value is decreased. Then a basic version of VRBBO, called VRBBO-basic, is introduced, which has repeated calls to FDS-basic until an $\varepsilon $ approximate stationary point is found; see Flowcharts (a)–(c) of Fig. 1.

Complexity results for VRBBO-basic Sect. 4 discusses our complexity bound for the nonconvex case with the same order and factor as the one found by Gratton et al. [24] obtained by Chernoff bound. But with the difference that the constant factor of our bound is obtained from a result of Pinelis [48]. Both complexity results are better by the factor of R/n than those given in Table 1 and more reasonable than those given in Table 2. Our complexity bounds for the convex and strongly convex cases are proven with probability arbitrarily close to 1, which are new results and are more reasonable than those given in Table 2, valid only in expectation. Table 3 summarizes our complexity results for all cases, matching Gratton et al. [24] for the nonconvex case. As discussed in Sect. 1.2, Gratton et al.’s results for the nonconvex case allow $R=2$, while VRBBO-basic needs

$$\begin{aligned} R = \varOmega (\log \eta ^{-1}) \ \ \hbox {for a given}\, 0<\eta \le \frac{1}{2}. \end{aligned}$$

But $\log \eta ^{-1}$ cannot be large for reasonable value of $\eta $. In practice, VRBBO-basic works best with a much larger value $R={{\mathcal {O}}}(n)$.

Table 3 Complexity results for randomized BBO with probability $1-\eta $, for fixed $0<\eta \le \frac{1}{2}$ (Gratton et al. [24] for the nonconvex case with $R=2$ and present paper for all cases, with $R= \varOmega (\log \eta ^{-1})$

Full size table

Heuristic techniques. We add many new useful heuristic techniques – discussed in Sect. 5 – to VRBBO-basic that make it very competitive, in particular:

Several kinds of search directions ensure good practical performance.
Adaptive heuristic estimations for the Lipschitz constant are used.
A sensible scaling vector is estimated.
The gradient vector is estimated by a randomized finite difference approach.

These heuristic techniques improve the performance in practice, leading to the FDS and VRBBO implemented documentations in Sect. 6.

Numerical results. In Sect. 7 we compare all solvers (including some good global ones) on the unconstrained CUTEst test problems of Gould et al. [23] for optimization and the test problems, called GlobalTest, of Jamil & Yang [33] for global optimization with 2–5000 variables. The numerical results show that VRBBO matches the quality of global state-of-the-art algorithms for finding, a global minimizer with reasonable accuracy. Although our theory only guarantees local minimizers, the FDS together with our heuristic techniques turn VRBBO into an efficient global solver. For example, FDS for large $\varDelta $ takes initially only large steps, hence it has a global character.

2 A new line search technique

In this section, we describe a method that tries to achieve a decrease in the function value using line searches along specially chosen random directions. In our algorithm random directions are used because it is known that randomized black box optimization methods have a worst case complexity by a factor of n lower than that of deterministic algorithms (see cf. [7]).

A line search then polls one or more points along the lines in each chosen direction starting at the current best point. Several such line searches are packaged together into a basic multi-line search, for which strong probabilistic results can be proved.

The details are chosen in such a way that failure to achieve the desired descent implies that, with probability arbitrarily close to one, a bound on the unknown gradient vector is obtained.

2.1 Probing a direction

Let $\varDelta \ge 0$ be the threshold for improvements on the function value and let $f(x)-f(x\pm p)$, for every $x,p\in {\mathbb {R}}^n$, be the gain along $\pm p$. First we give a theoretical test that either results in a gain of $\varDelta $ or more in function value, or gives a small upper bound for the norm of at least one of the unknown gradients encountered though our algorithm never calculates ones.

Assumption (A1) implies that for every $x,p\in {\mathbb {R}}^n$, we have

$$\begin{aligned} f(x+p)-f(x)=g(x)^Tp+\frac{1}{2}\gamma \Vert p\Vert ^2, \end{aligned}$$

(6)

where $\gamma $ depends on x and p and satisfies one of

$$\begin{aligned} |\gamma |\le & {} L,~~~(\hbox {general case)} \end{aligned}$$

(7)

$$\begin{aligned} 0\le & {} \gamma \le L,~~~(\hbox {convex case)} \end{aligned}$$

(8)

$$\begin{aligned} 0<\sigma\le & {} \gamma \le L.~~~(\hbox {strongly convex case)} \end{aligned}$$

(9)

Here $\sigma $ comes from (4). In all three cases,

$$\begin{aligned} g(x)^Tp-\frac{1}{2}L\Vert p\Vert ^2\le f(x+p)-f(x)\le g(x)^Tp+\frac{1}{2}L\Vert p\Vert ^2. \end{aligned}$$

(10)

Continuity and condition (A2) imply that a minimizer ${\widehat{x}}$ exists and

$$\begin{aligned} r_0:=\sup \Big \{\Vert x-{\widehat{x}}\Vert \mid x\in {\mathbb {R}}^n \ \ \hbox {and}\ \ f(x)\le f(x^0)\Big \}<\infty . \end{aligned}$$

(11)

(It is enough that this holds with $x^0$ replaced by some point found during the iteration, which is then taken as $x^0$).

Proposition 2

Let $x,p\in {\mathbb {R}}^n$ and $\varDelta \ge 0$. Then (A1) implies that

$$\begin{aligned} L\ge \frac{|f(x+p)+f(x-p)-2f(x)|}{\Vert p\Vert ^2} \end{aligned}$$

(12)

and at least one of the following holds:

(i)
$f(x+p) < f(x) - \varDelta $,
(ii)
$f(x+p) > f(x) + \varDelta $ and $f(x-p) < f(x) -\varDelta $,
(iii)
$|g^Tp| \le \varDelta +\displaystyle \frac{1}{2}L\Vert p\Vert ^2$.

Proof

Taking the sum of (10) and the formula obtained from it by replacing p with $-p$ gives (12).

Assume that (iii) is violated, so that $\varDelta +\frac{1}{2}L\Vert p\Vert ^2$. If $g(x)^Tp \le 0$, then by (10)

$$\begin{aligned} f(x+ p) -f(x) \le g(x)^Tp+\frac{1}{2}L \Vert p\Vert ^2=-|g^Tp|+\frac{1}{2}L\Vert p\Vert ^2 < -\varDelta . \end{aligned}$$

(13)

If $g(x)^Tp\ge 0$, then similarly

$$\begin{aligned} f(x- p) -f(x) \le g(x)^T(-p)+\frac{1}{2}L \Vert p\Vert ^2=-|g^Tp|+\frac{1}{2}L\Vert p\Vert ^2 < -\varDelta . \end{aligned}$$

(14)

If (13) holds we conclude that (i) holds. If (14) holds we get the second half of (ii), and the first half follows from

$$\begin{aligned} f(x+p)-f(x)\ge g(x)^Tp-\frac{1}{2}L \Vert p\Vert ^2 > \varDelta . \square \end{aligned}$$

Proposition 2 will play a key role in the construction of our basic multi-line search MLS-basic detailed in Sect. 2.3:

It establishes the well-known (Evtushenko [22]) lower bound (12) for the Lipschitz constant L which can be used to find reasonable approximations for L.
If (i) holds, then the step p gives a gain of at least $\varDelta $, called the sufficient gain.
If (ii) holds, then the step $-p$ gives a sufficient gain.
If neither (i) nor (ii) holds (no sufficient gain is found along $\pm p$) then (iii) holds, giving a useful upper bound for the directional derivative.

In particular, this allows us to prove statements about the unknown gradient even though our algorithm never calculates one.

2.2 Random search directions

For our complexity results, we need that sufficiently many search directions p satisfy the angle condition of the form

$$\begin{aligned} \sup \frac{g^Tp}{\Vert g\Vert _*\Vert p\Vert }\le -\varDelta ^{\mathrm{a}}<0. \end{aligned}$$

(15)

Here g is the gradient of the current best point and $\varDelta ^{\mathrm{a}}>0$ is a tuning parameter for the angle condition.

Random directions are uniformly independent and identically distributed (i.i.d) in $[-\frac{1}{2},\frac{1}{2}]^n$, computed by

$$\begin{aligned} p=\mathrm{rand}(n,1)-0.5, \end{aligned}$$

(16)

where $\mathrm{rand}(n,1)$ generates a random vector uniformly distributed in $[0,1]^n$.

The following variant of the angle condition (15) plays a key role to get our complexity bounds.

Proposition 3

For random search directions generated by (16) and scaled by

$$\begin{aligned} p:=p(\delta /\Vert p\Vert ). \end{aligned}$$

(17)

satisfies $\Vert p\Vert =\delta $ and, with probability $\ge \displaystyle \frac{1}{2}$,

$$\begin{aligned} \Vert g(x)\Vert _*\Vert p\Vert \le 2\sqrt{cn} |g(x)^Tp| \end{aligned}$$

(18)

with a positive constant $c\le 12.5$.

Proof

As defined earlier in Sect. 1.2, $s\in {\mathbb {R}}^n$ is a scaling vector. Denote by $s_i$ the ith component of s and define $\overline{p}_i:=p_i/s_i$ and $\overline{g}_i:=s_ig_i$. Then by (1), $g^Tp=\overline{g}^T\overline{p}$ and $\Vert g\Vert _*=\Vert \overline{g}\Vert _2$ and $\Vert p\Vert =\Vert \overline{p}\Vert _2$; so the results of Sect. 9.1 apply after scaling and give $c=c_0/4\le 12.5$. $\square $

This simulation result from Sect. 9.1 suggests that $c\approx 4/7$.

2.3 A multi-line search

In this section, we construct a multi-line search algorithm, called MLS-basic. It polls in random directions [satisfying (18), with probability $\ge \frac{1}{2}$, generated by (16), and scaled by (17)] in a line search fashion a few objective function values each in the hope of finding sufficient gains by more than a multiple of $\varDelta $.

2.3.1 An extrapolation step

As discussed in Sect. 1.3, the main ingredient of VRBBO-basic is FDS-basic which has repeated calls to MLS-basic until at least a sufficient gain is found. The accelerated ingredient of MLS-basic is extrapolation whose goal is to speed up reaching a minimizer by expanding step sizes and computing the corresponding trial points and their function values as long as sufficient gains are found. We discuss how to construct extrapolation steps, called extrapolationStep, trying to hopefully find sufficient gains. extrapolationStep may perform extrapolation along either the search direction p or its opposite direction.

Let $\{x^k\}_{k\ge 0}$ be the sequence generated by VRBBO-basic. In the kth iteration of this algorithm, FDS-basic takes as input the $(k-1)$th point $\mathtt{xm}=x^{k-1}$ and its function value $\mathtt{fm}=f_{k-1}$ generated by VRBBO-basic and returns the kth point $x^{k}=\mathtt{xm}$ and its function value $f_k= \mathtt{fm}$ as output if at least a sufficient gain is found by MLS-basic; otherwise $x^k=x^{k-1}$ and $f_k=f_{k-1}$. In fact, after the kth iteration of VRBBO-basic is performed, xm is the current trial point evaluated by extrapolationStep, obtained from a sufficient gain, accepted as a new point, and called the best point. Hence all points $x^k$, for $k=1,2,\ldots $, are the best points found by extrapolationStep. The last point generated by VRBBO-basic is said the overall best point.

Care must be taken to ensure that the book-keeping needed for the evaluation of the lower bound for the Lipschitz constant comes out correctly. To ensure this during an extrapolation step, we always use $\mathtt{xm}$ for the best point found by extrapolationStep such that the next evaluation is always at $\mathtt{xm}+p$ and a former third evaluation point is at $\mathtt{xm}-p$. The function values immediately after the next evaluation are then

$$\begin{aligned} \mathtt{fl}:=f(\mathtt{xm}-p),~~~\mathtt{fm}:=f(\mathtt{xm}),~~~\mathtt{fr}:=f(\mathtt{xm}+p). \end{aligned}$$

(19)

At this stage, we can compute the lower bound

$$\begin{aligned} \lambda :=\max \left( \lambda _{\mathrm{old}},|\mathtt{fl}+\mathtt{fr}-2\mathtt{fm}|/\delta ^2\right) \end{aligned}$$

(20)

for the Lipschitz constant L, valid by (12). Note that the initial $\lambda _{\mathrm{old}}$ is the tuning parameter $\lambda _{\max }$, however, it is updated by extrapolationStep and may be estimated by a heuristic formula.

As defined earlier in Sect. 2.1, $\mathtt{df}:= \mathtt{fm}-\mathtt{fr}$ is the gain and given the tuning parameter $0<\gamma _{\min }<1$, if the condition

$$\begin{aligned} \mathtt{df} >\overline{\varDelta }=\gamma _{\min }\varDelta \end{aligned}$$

(21)

holds, a sufficient gain is found and the corresponding point is updated by overwriting $\mathtt{xm}+p$ over $\mathtt{xm}$, with the consequence that in this case

$$\begin{aligned} \mathtt{fl}:=f(\mathtt{xm}-2p),~~~\mathtt{fm}:=f(\mathtt{xm}-p),~~~\mathtt{fr}:=f(\mathtt{xm}). \end{aligned}$$

(22)

R denotes the number of the random search directions used in MLS-basic and $\mathbf{a}$ denotes the list of R extrapolation step sizes. All components of the initial list $\mathbf{a}$ are one, i.e., $\mathbf{a}(t)=1$ for $t=1,\ldots ,R$. These components are expanded or reduced according to whether sufficient gains are found or not. Let $n_{\mathrm{sg}}$ be the number of sufficient gains found by extrapolationStep to exceed sufficient gains. If the counter $n_{\mathrm{sg}}$ remains zero, extrapolationStep cannot find a sufficient gain. t is a counter for R taking $0,\ldots ,R$. It does not change inside extrapolationStep, but it is updated later outside extrapolationStep (inside MLS-basic).

We must be careful to make sure that the estimation of the Lipschitz constant is correct, especially when an extrapolation step—improving the function value—is tried. This estimation is computed (i) after an opposite direction is tried. Since there is no sufficient gain along the direction p, its opposite direction is tried. Then $\lambda $ is estimated by (20) according to (19); (ii) after the first sufficient gain is found. For this estimation, fl, fm, fr are needed. Since a sufficient gain is found, according to (22), the Lipschitz constant $\lambda $ is estimated by (20).

In summary, extrapolationStep first takes the initial step size $\alpha _e=1$, which is necessary to approximate a lower bound for the unknown Lipschitz constant L. Then it chooses step sizes from $\mathbf{a}(t)$ later while expanding it until a sufficient gain is found. After a sufficient gain is found $\alpha _e$ is saved as the new $\mathbf{a}(t)$. One of the following cases is happened:

(i)
A sufficient gain is found along the direction p.
(ii)
A sufficient gain is found along the direction $-p$.
(iii)
No sufficient gain is found along $\pm p$.

If either (i) or (ii) holds, extrapolationStep is successful at least with a sufficient gain. But if (iii) holds extrapolationStep is unsuccessful without sufficient gain.

2.3.2 A basic version of the MLS algorithm

For each random direction generated, our basic multi-line search (MLS-basic) using extrapolationStep is performed where the following happens:

A step in the current direction is tried.
If a sufficient gain is found, a sequence of extrapolations is tried.
If a sufficient negative gain is found, a step in the opposite direction is tried.
If a sufficient gain is found in the opposite direction, a sequence of extrapolations is tried.
If no sufficient gain along $\pm p$ is found, the step size is reduced.

In (24), the extrapolation step sizes must be reduced whenever no sufficient gain is found. Hence, they need to be controlled by the tuning parameter $\alpha _{\min }$.

(23) is motivated at the end of this section since it is based on the details of Theorem 1, which asserts that one obtains either a sufficient gain of multiple of $\varDelta $ or, with probability arbitrarily close to 1, an upper bound of $\Vert g\Vert _*$ for at least one of the unknown gradients encountered though our algorithm never calculates ones.

Theorem 1

Assume that (A1) holds and let nf be the counter for the number of function evaluations, R be the number of random search directions, and let $\overline{\varDelta }=\gamma _{\min }\varDelta $ be the improvement on the function value in MLS-basic with $0<\gamma _{\min }<1$. Here nfmax is assumed to be sufficiently large.

(i)
f decreases by at least
$$\begin{aligned} \overline{\varDelta }_f:=\overline{\varDelta }\max (\mathtt{nf}-2R,0) \end{aligned}$$
(25)
(Note that $\overline{\varDelta }_f$ may be zero, catering for the case of no strict decrease).
(ii)
Suppose that $0<\eta \le \frac{1}{2}$ and $R:=\lceil \log _2\eta ^{-1} \rceil $. If f does not decrease by more than a multiple of $\varDelta $ then, with probability $\ge 1-\eta $, the original point or one of the points evaluated with better function values has a gradient g with
$$\begin{aligned} \Vert g\Vert _* \le \sqrt{cn}\varGamma (\delta ), \end{aligned}$$
(26)
where c is the constant in Proposition 3 and $\varGamma (\delta )$ is defined by
$$\begin{aligned} \varGamma (\delta ):=L\delta +\frac{2\varDelta }{\delta } \ \ \hbox {for some}\, \delta >0. \end{aligned}$$
(27)

Proof

(i) Clearly, the function value of the best point does not increase. Thus (i) holds if $\mathtt{nf}-2R\le 0$. If this is not the case, then $\mathtt{nf}\ge 2R+1$. But in the for loop of MLS-basic, R directions p are generated and at most two function values are computed, unless an extrapolation step is performed. In the latter case, at least $\mathtt{nf}-2R$ additional function values are computed during the extrapolation stage, each time with a sufficient gain of at least $\overline{\varDelta }$. Thus the total sufficient gain is at least (25).

(ii) Assume that f does not decrease by more than $\overline{\varDelta }$. For $t=1,\ldots ,R$, let $p^t$ be the tth random direction generated by (17), and let $x^t$ be the best point obtained before searching in direction $p^t$. Then, from Proposition 2, we get

$$\begin{aligned} |g(x^t)^Tp^t|\le \overline{\varDelta }+\frac{L}{2}\Vert p^t\Vert ^2\le \varDelta +\frac{L}{2}\Vert p^t\Vert ^2=\frac{\delta }{2}\varGamma (\delta ), \ \ \hbox {for all}\, t=1,\ldots ,R. \end{aligned}$$

Since the random direction is generated by (16), Proposition 3 implies that

$$\begin{aligned} \Vert g(x^t)\Vert _*=\Vert g(x^t)\Vert _*\Vert p^t\Vert /\delta \le \Big (2\sqrt{cn}|g(x^t)^Tp^t|\Big )/\delta \le \sqrt{cn}\varGamma (\delta ),\ \ \hbox {for all}\, t=1,\ldots ,R \end{aligned}$$

holds with probability $\displaystyle \frac{1}{2}$ or more. Therefore $\Vert g(x^t)\Vert _* \le \sqrt{cn}\varGamma (\delta )$ fails with a probability less than $\displaystyle \frac{1}{2}$ for all $t=1,\ldots ,R$. Therefore, the probability that (26) holds for at least one of the gradients $g=g(x^t)$ ($t\in \{1,\ldots ,R\} $) is

$$\begin{aligned} 1-\displaystyle \prod _{t=1}^{R-1}\Pr \Big (\Vert g(x^t)\Vert _*> \sqrt{cn}\varGamma (\delta )\Big ) \ge 1-2^{-R}. \end{aligned}$$

$\square $

Note that (26) is guaranteed to hold although gradients are never computed. Since gradients and Lipschitz constants are unknown to us, we could not say which point satisfies (26). But the result implies that the final best point has a function value equal to or better than some point whose gradient was small. If gradients are small only nearby a global optimizer, it will produce a point close to the local optimizer. If some iterate passes close to a non-global local optimizer or a saddle point, it is possible that the algorithm escapes its neighborhood. In this case, only a variant with restarts would produce convergence to a point with a small gradient.

As discussed earlier in Sect. 1.3, VRBBO-basic tries to find a point satisfying (5). The goal of the scaling of the search direction (16) is that the bound $\sqrt{cn}\varGamma (\delta )$ in (26) becomes below a given threshold $\varepsilon >0$. This is done by minimizing such a bound. For fixed $\varDelta $, the scale-dependent factor (27) is smallest for the choice

$$\begin{aligned} {\widehat{\delta }}:=\sqrt{2\varDelta /L}. \end{aligned}$$

Accordingly, (23) is used to scale the random directions (16), safeguarded by the sensible positive lower and upper bounds $0<\delta _{\min }<\delta _{\max }<+\infty $. As can be seen from Sect. 2.3.1, $\alpha _e$ is used for estimating L and here for adjusting $\delta $. Due to our experience, using the unfixed $\alpha _e$ in (23) is useful. In fact, ${\widehat{\delta }}$ is a special case of the term $\sqrt{\alpha _e\gamma _{\delta }\varDelta /\lambda }$ in (23) with $\alpha _e=1$ and $\gamma _\delta =2$.

3 A randomized descent algorithm for BBO

In this section, we first consider a fixed decrease search for which an upper bound of the unknown gradient norm for at least one of the points generated by the extrapolationStep or of the total number of function evaluations is found. Then a primary version of our algorithm is given.

3.1 Probing for fixed decrease

Based on the preceding results, we introduce the basic version of a fixed decrease search algorithm (FDS-basic) whose goal is to perform calls to the basic multi-line search MLS-basic to hopefully find sufficient gains by a multiple of $\varDelta $. If either there is no sufficient gain ($n_{\mathrm{sg}}=0$) by MLS-basic or nfmax is reached, FDS-basic ends. The main ingredient of VRBBO-basic is FDS-basic which takes for large $\varDelta $ many large steps, hence has a global character.

In the next algorithm, it is assumed that FDS-basic is tried in the kth iteration of VRBBO-basic.

Theorem 2

Assume that (A1) and (A2) hold, nfmax is sufficiently large, denote by $f_0$ the initial value of f and let $\overline{\varDelta }=\gamma _{\min }\varDelta $ with the tuning parameter $0<\gamma _{\min }<1$. Then:

(i)
The number of function evaluations of FDS-basic is bounded by
$$\begin{aligned} 2R+(2R+1)\displaystyle \frac{f_0-{\widehat{f}}}{\overline{\varDelta }}, \end{aligned}$$
where ${\widehat{f}}$ is the global minimum value.
(ii)
Denote by $K_f$ the number of calls to MLS-basic by FDS-basic and assume that
$$\begin{aligned} 0<\eta \le \frac{1}{2}, \ \ R=\lceil \log _2 \eta ^{-1}\rceil , \ \ \hbox {and} \ \ 0<\delta _{\min }<\delta _{\max }<\infty . \end{aligned}$$
Then FDS-basic finds a point x, with probability $\ge 1-\eta $, satisfying
$$\begin{aligned} \Vert g(x)\Vert _* \le \sqrt{cn}\min _{t=0:K_f}\varGamma (\delta ^t)\le \sqrt{cn} \Big (L\delta _{\min }+\sqrt{L'\varDelta }+\frac{2\varDelta }{\delta _{\max }}\Big ). \end{aligned}$$
(28)
Here c is the constant from Proposition 3 and, if $\lambda ^0$ denotes the value of $\lambda $ before the first execution of FDS-basic,
$$\begin{aligned} L':=\frac{L^2\gamma _{\delta }}{\lambda ^0}+4L+4\frac{\lambda ^0+L}{\gamma _{\delta }} \ \ \hbox {with }\,\gamma _{\delta }>0. \end{aligned}$$
(29)

Proof

By (A2), ${\widehat{f}}$ is finite. Denote by $f_{k+1}$ the result of the $(k+1)$th execution of FDS-basic. In the worst case in each iteration $\ell \in \{1,\ldots ,k\}$ of FDS-basic a sufficient gain is found, i.e., the condition

$$\begin{aligned} f_{\ell }\le f_{\ell -1}-\overline{\varDelta }\,\hbox {for}\,\ell \in \{1,\ldots ,k\} \end{aligned}$$

holds. But in the $(k+1)$th iteration FDS-basic cannot find any sufficient gain and ends. We then conclude that

$$\begin{aligned} {\widehat{f}}\le f_k\le f_0-\sum _{i=1}^{k}\overline{\varDelta }\le f_0-k\overline{\varDelta }=f_0-k\overline{\varDelta }\end{aligned}$$

by (24), so that $k\le (f_0-{\widehat{f}})/\overline{\varDelta }$.

Since a sufficient gain is found in each iteration $\ell =1,\ldots ,k$, $2R+1$ function evaluations are used. But in the $(k+1)$th iteration, 2R function evaluations are used since there is no sufficient gain. Hence (i) follows.

(ii) $K_f$ is finite due to (i) and we have $2^{-R}\le \eta $. Hence by Theorem 1 with probability $\ge 1-2^{-R}\ge 1-\eta $

$$\begin{aligned} \Vert g\Vert _*\le \sqrt{cn}\min _{t=0:K_f}\varGamma (\delta ^t) \end{aligned}$$

holds. Thus it is sufficient to show that

$$\begin{aligned} \varGamma (\delta ) \le L\delta _{\min }+ \sqrt{L'\varDelta }+\frac{2\varDelta }{\delta _{\max }}. \end{aligned}$$

(30)

By the definition of $\delta $ in (23), we have one of the following three cases:

Case 1: $\delta =\displaystyle \sqrt{\frac{\gamma _{\delta }\varDelta }{\lambda }}$. In this case,

$$\begin{aligned} \varGamma (\delta )=L\delta +\frac{2\varDelta }{\delta } =L\sqrt{\frac{\gamma _{\delta }\varDelta }{\lambda }} +2\sqrt{\frac{\lambda \varDelta }{\gamma _{\delta }}}=\varLambda \sqrt{\varDelta }, \end{aligned}$$

where

$$\begin{aligned} \varLambda :=L\sqrt{\frac{\gamma _{\delta }}{\lambda }}+2\sqrt{\frac{\lambda }{\gamma _{\delta }}}. \end{aligned}$$

(31)

Case 2: $\delta =\delta _{\min }\ge \displaystyle \sqrt{\frac{\gamma _{\delta }\varDelta }{\lambda }}$. In this case,

$$\begin{aligned} \varGamma (\delta )=L\delta _{\min }+\frac{2\varDelta }{\delta _{\min }} \le L\delta _{\min }+2\sqrt{\frac{\lambda \varDelta }{\gamma _{\delta }}} \le L\delta _{\min }+\varLambda \sqrt{\varDelta }. \end{aligned}$$

Case 3: $\delta =\delta _{\max }\le \displaystyle \sqrt{\frac{\gamma _{\delta }\varDelta }{\lambda }}$. In this case,

$$\begin{aligned} \varGamma (\delta )=L\delta _{\max }+\frac{2\varDelta }{\delta _{\max }} \le L\sqrt{\frac{\gamma _{\delta }\varDelta }{\lambda }}+\frac{2\varDelta }{\delta _{\max }} \le \varLambda \sqrt{\varDelta }+\frac{2\varDelta }{\delta _{\max }}. \end{aligned}$$

Thus in each case,

$$\begin{aligned} \varGamma (\delta ) \le L\delta _{\min }+ \varLambda \sqrt{\varDelta }+\frac{2\varDelta }{\delta _{\max }}. \end{aligned}$$

As discussed earlier, L is unknown and we replace it by an approximation value $\lambda $. Proposition 2 implies that

$$\begin{aligned} \lambda ^0\le \lambda \le \max (\lambda ^0,L)\le \lambda ^0+L, \end{aligned}$$

(32)

where $\lambda ^0$ is the initial value of $\lambda $. Now (30) follows since by (31) and (32),

$$\begin{aligned} \displaystyle \varLambda ^2=\displaystyle \frac{L^2\gamma _{\delta }}{\lambda }+4L+\displaystyle \frac{4\lambda }{\gamma _{\delta }} \le L'. \end{aligned}$$

$\square $

3.2 A basic version of the VRBBO algorithm

We now have all ingredients to formulate VRBBO-basic. It uses in each iteration the fixed decrease search algorithm to update the best point. If either no sufficient gain is found or nfmax is not reached in the corresponding FDS-basic call, $\varDelta $ is reduced by a factor of Q. Once either $\varDelta $ is below a minimum threshold, VRBBO-basic stops.

As discussed above, $\varDelta _{\max }$ and $\lambda _{\max }$ are initially tuning parameters but in an improved version of VRBBO-basic they will be estimated by heuristic techniques in Sect. 5. From Lines 1 and 10, the kth call to FDS-basic uses

$$\begin{aligned} \varDelta _k=Q^{1-k}\varDelta _{\max }. \end{aligned}$$

(33)

It will be used in the next section to prove all theorems.

4 Complexity analysis of VRBBO-basic

We now prove the complexity results for the nonconvex, convex, and strongly convex objective functions. We denote by $N_k $ the total number of function evaluations used by VRBBO-basic up to iteration k.

4.1 The general (nonconvex) case

Theorem 3

Let $\{x^k\}$ ($k=0,1,2,\ldots $) be the sequence generated by VRBBO-basic. Assume that (A1) and (A2) hold, nfmax is sufficiently large, and the parameters

$$\begin{aligned} 0<\eta \le \frac{1}{2}, 0<\gamma _{\min }<1, \varDelta _{\max }>0, \delta _{\max }>0, \varepsilon >0 \end{aligned}$$

are given. If the parameters are chosen such that

$$\begin{aligned} \varDelta _{\min }:= & {} \varTheta \left( \varepsilon ^2/n\right) , \end{aligned}$$

(34)

$$\begin{aligned} K:= & {} \Big \lceil \frac{\log \left( \varDelta _{\max }/\varDelta _{\min }\right) }{\log Q}\Big \rceil , \end{aligned}$$

(35)

$$\begin{aligned} R:= & {} \Big \lceil \log _2\eta ^{-1}\Big \rceil , \end{aligned}$$

(36)

$$\begin{aligned} \delta _{\min }:= & {} {{\mathcal {O}}}(\varepsilon /\sqrt{n}), \end{aligned}$$

(37)

then VRBBO-basic finds after at most ${{\mathcal {O}}}(n\varepsilon ^{-2})$ function evaluations with probability $\ge 1-\eta $ a point x with

$$\begin{aligned} \Vert g(x)\Vert _*= {{\mathcal {O}}}(\varepsilon ). \end{aligned}$$

(38)

Proof

We conclude from (33)–(35) that

$$\begin{aligned} \varDelta _\ell =Q^{1-\ell }\varDelta _{\max }\le \varDelta _{\min }\ \ \hbox {for}\, \ell \ge K. \end{aligned}$$

(39)

Hence at most K steps of FDS-basic are performed. By (36), we have $\eta _1=2^{-R}\le \eta $. Thus by Theorem 2(ii) we have from (34) and (37), with probability $\ge 1-\eta _1\ge 1-\eta $, for at least one of the function values encountered,

$$\begin{aligned} \Vert g\Vert _*\le \min _{j=0:K}\varGamma \left( \delta ^j\right) \le \sqrt{cn} \Big (L\delta _{\min }+\sqrt{L'\varDelta _{\min }}+\frac{2\varDelta _{\min }}{\delta _{\max }}\Big )={{\mathcal {O}}}(\varepsilon ). \end{aligned}$$

From (33) and Theorem 2(i), the Kth call to FDS-basic uses at most

$$\begin{aligned} 2R+(2R+1)\displaystyle \frac{f_0-{\widehat{f}}}{\gamma _{\min }Q^{1-K}\varDelta _{\max }} \end{aligned}$$

function evaluations; ${\widehat{f}}$ comes from (2) and is the global minimum value. Then

$$\begin{aligned} N_K\le & {} 1+\displaystyle \sum _{j=1}^K\Big (2R+(2R+1) \displaystyle \frac{f_0-{\widehat{f}}}{\gamma _{\min }Q^{1-j}\varDelta _{\max }}\Big )\\= & {} 1+2RK+(2R+1)\displaystyle \frac{f_0-{\widehat{f}}}{\gamma _{\min }\varDelta _{\max }}\frac{Q^K-1}{Q-1}. \end{aligned}$$

Choosing $\varDelta _{\min }={{\mathcal {O}}}(\varepsilon ^2/n)$ with (34) is possible, and K, R, $\delta _{\min }$ can clearly be chosen to satisfy (35)–(37) and displays $K={{\mathcal {O}}}(\log \displaystyle \frac{n}{\varepsilon ^2})$ and $ R={{\mathcal {O}}}(\log \eta ^{-1}) $. Finally, we conclude that $N_{ K}={{\mathcal {O}}}(nR\varepsilon ^{-2})= {{\mathcal {O}}}(n\varepsilon ^{-2})$. $\square $

4.2 The convex case

Theorem 4

Let $\{x^k\}$ ($k=0,1,2,\ldots $) be the sequence generated by VRBBO-basic and let f be convex on ${{{\mathcal {L}}}}(x^0)$. Moreover, assume that (A1) and (A2) hold and nfmax is sufficiently large. Given $0<\eta <1$, for any $\varepsilon >0$, if (34)–(37) hold then VRBBO-basic finds after at most ${{\mathcal {O}}}(n\varepsilon ^{-1})$ function evaluations with probability $\ge 1-\eta $ a point x satisfying (38) and

$$\begin{aligned} f(x)-{\widehat{f}}= {{\mathcal {O}}}\left( \varepsilon r_0\right) , \end{aligned}$$

(40)

where $r_0$ is given by (11) and ${\widehat{f}}$ is the global minimum value.

Proof

For $\ell \ge K$, from (33)–(35), (39) holds. Hence at most K steps of FDS-basic are performed. In fact in each step a point without sufficient gain is found satisfying (28) by Theorem 2(ii). As discussed earlier after Theorem 1, these points are unknown to us since the gradients and Lipschitz constants are unknown. The index set of these points is denoted by U whose size is K. The convex case is characterized by (8), so that

$$\begin{aligned} {\widehat{f}}\ge f_{\ell }+\left( g^{\ell }\right) ^T\left( {\widehat{x}}-x^{\ell }\right) \ \ \hbox {for all}\, \ell \ge 0. \end{aligned}$$

From (11), we have with probability $\ge 1-\eta $

$$\begin{aligned} f_{\ell -1}-f_{\ell } \le f_{\ell -1}-{\widehat{f}}\le & {} \left( g^{\ell -1}\right) ^T\left( x^{\ell -1}-{\widehat{x}}\right) \le \Vert g^{\ell -1}\Vert _*\Vert x^{\ell -1}-{\widehat{x}}\Vert \nonumber \\\le & {} r_0\sqrt{cn}\left( L\delta _{\min }+ \sqrt{L'\varDelta _\ell }+\frac{2\varDelta _\ell }{\delta _{\max }}\right) \ \ \hbox {for}\, \ell \in U. \end{aligned}$$

(41)

We consider the following three cases:

Case 1. The second term $\sqrt{L'\varDelta _\ell }$ in (41) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell } \le f_{\ell -1}-{\widehat{f}} ={{\mathcal {O}}}\left( \sqrt{n\varDelta _\ell }\right) . \end{aligned}$$

(42)

Put $U_1:=\{\ell \in U \mid (42)\,\hbox {holds} \}$.

Case 2. The first term $\delta _{\min }$ in (41) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell } \le f_{\ell -1}-{\widehat{f}}= {{\mathcal {O}}}\left( \sqrt{n}\delta _{\min }\right) ={{\mathcal {O}}}(\varepsilon ). \end{aligned}$$

(43)

Put $U_2:=\{\ell \in U \mid (43)\,\hbox {holds} \}$.

Case 3. The third term $\displaystyle \frac{2\varDelta _\ell }{\delta _{\max }}$ in (41) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell } \le f_{\ell -1}-{\widehat{f}} ={{\mathcal {O}}}\left( \sqrt{n}\varDelta _\ell \right) . \end{aligned}$$

(44)

Put $U_3:=\{\ell \in U \mid (44)\,\hbox {holds} \}$. Then we conclude from (33) and (42)–(44) that

$$\begin{aligned} \sum _{\ell \in U}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }= & {} \sum _{\ell \in U_1}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }+\sum _{\ell \in U_2}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }+\sum _{\ell \in U_3}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }\\\le & {} \sum _{\ell \in U_1}\frac{{{\mathcal {O}}}(\sqrt{n\varDelta _\ell })}{\varDelta _\ell }+{{\mathcal {O}}}(\varepsilon )\sum _{\ell \in U_2}\varDelta _\ell ^{-1} +\sum _{\ell \in U_3} \frac{{{\mathcal {O}}}(\sqrt{n}\varDelta _\ell )}{\varDelta _\ell } \\\le & {} \sum _{\ell \in U}{{\mathcal {O}}}(\sqrt{n}\varDelta _\ell ^{-1/2})+{{\mathcal {O}}}(\varepsilon )\sum _{\ell \in U}\varDelta _\ell ^{-1} +\sum _{\ell \in U} \frac{{{\mathcal {O}}}(\sqrt{n}\varDelta _\ell )}{\varDelta _\ell }\\\le & {} \varDelta _{\max }^{-1/2}\sum _{\ell \in U}{{\mathcal {O}}}\left( \sqrt{n}Q^{\frac{1}{2}(\ell -1)}\right) +\varDelta _{\max }^{-1}{{\mathcal {O}}}(\varepsilon )\sum _{\ell \in U}Q^{\ell -1}+{{\mathcal {O}}}(\sqrt{n})K. \end{aligned}$$

Hence (34) and (35) result in

$$\begin{aligned} \sum _{\ell \in U}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }={{\mathcal {O}}}\left( n\varepsilon ^{-1}\right) +{{\mathcal {O}}}\left( n\varepsilon ^{-1}\right) +{{\mathcal {O}}}\left( \sqrt{n}\log \left( n\varepsilon ^{-1}\right) \right) ={{\mathcal {O}}}\left( n\varepsilon ^{-1}\right) , \end{aligned}$$

so that we get in the same way as the proof of Theorem 2

$$\begin{aligned} N_{K}\le 1+2RK+(2R+1) \sum _{\ell \in U} \frac{f_{\ell -1}-f_{\ell }}{\gamma _{\min }\varDelta _\ell } = {{\mathcal {O}}}\left( nR\varepsilon ^{-1}\right) ={{\mathcal {O}}}\left( n \varepsilon ^{-1}\right) . \end{aligned}$$

According Theorem 3, (38) holds at least for one of evaluated points. As a result, at least for one of evaluated points (40) holds with probability $\ge 1-\eta $ by applying (11) and (38)–(41). $\square $

4.3 The strongly convex case

Theorem 5

Let $\{x^k\}$ ($k=0,1,2,\ldots $) be the sequence generated by VRBBO-basic and let f be $\sigma $-strongly convex on ${{{\mathcal {L}}}}(x^0)$. Moreover, assume that (A1) and (A2) hold and nfmax is sufficiently large. Under the assumptions of Theorem 3, VRBBO-basic finds after at most

$$\begin{aligned} {{\mathcal {O}}}\left( n\log n\varepsilon ^{-1}\right) \end{aligned}$$

function evaluations with probability $\ge 1-\eta $ a point x satisfying (38),

$$\begin{aligned} f(x)-{\widehat{f}}={{\mathcal {O}}}\left( \displaystyle \frac{\varepsilon ^2}{2\sigma }\right) , \end{aligned}$$

(45)

and

$$\begin{aligned} \Vert x-{\widehat{x}}\Vert ={{\mathcal {O}}}\left( \displaystyle \frac{\varepsilon }{\sigma ^2}\right) . \end{aligned}$$

(46)

Proof

For $\ell \ge K$, from (33)–(35), (39) holds. Hence at most K steps of FDS-basic are performed. In fact in each step a point without sufficient gain is found satisfying (28) by Theorem 2(ii). As discussed earlier after Theorem 1, these points are unknown to us since the gradients and Lipschitz constants are unknown. The index set of these points is denoted by U whose size is K. The strongly convex case is characterized by (9), so that f has a global minimizer ${\widehat{x}}$ and

$$\begin{aligned} f(y)\ge f(x)+g(x)^T(y-x)+\frac{1}{2}\sigma \Vert y-x\Vert ^2 \end{aligned}$$

for any x and y in ${{{\mathcal {L}}}}(x^0)$. For fixed x, the right-hand side of this inequality is a convex quadratic function of y, minimal when its gradient vanishes. By (1), this is the case iff $y_i$ takes the value $x_i-\displaystyle \frac{s_i}{\sigma }g_i(x)$ for $i=1,\ldots ,n$, and we conclude that $f(y)\ge f(x)-\displaystyle \frac{1}{2\sigma }\Vert g(x)\Vert _*^2$ for $y\in {{{\mathcal {L}}}}(x^0)$. Therefore

$$\begin{aligned} {\widehat{f}}\ge f(x)-\displaystyle \frac{1}{2\sigma }\Vert g(x)\Vert _*^2. \end{aligned}$$

(47)

The replacement of x by $x^{\ell -1}$ in (47) and (38) gives, with probability $\ge 1-\eta $,

$$\begin{aligned} f_{\ell -1}-f_{\ell } \le f_{\ell }-{\widehat{f}}\le \displaystyle \frac{\Vert g^{\ell -1}\Vert _*^2}{2\sigma } \le \frac{cn}{2\sigma } \Big (L\delta _{\min }+ \sqrt{L'\varDelta _\ell }+\frac{2\varDelta _\ell }{\delta _{\max }}\Big ) ^2. \end{aligned}$$

(48)

We consider the following three cases:

Case 1. The second term $\sqrt{L'\varDelta _\ell }$ in (48) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell }={{\mathcal {O}}}\left( n\varDelta _\ell \right) . \end{aligned}$$

(49)

Put $ U_1:=\{\ell \in U \mid (49)\,\hbox {holds} \}$.

Case 2. The first term $\delta _{\min }$ in (48) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell } ={{\mathcal {O}}}\left( n\delta _{\min }^2\right) ={{\mathcal {O}}}\left( \varepsilon ^2\right) . \end{aligned}$$

(50)

Put $U_2:=\{\ell \in U \mid (50)\,\hbox {holds} \}$.

Case 3. The third term $\displaystyle \frac{2\varDelta _\ell }{\delta _{\max }}$ in (48) dominates the others. Then for $\ell \in U$

$$\begin{aligned} f_{\ell -1}-f_{\ell } ={{\mathcal {O}}}\left( n\varDelta _\ell ^2\right) . \end{aligned}$$

(51)

Put $U_3:=\{\ell \in U \mid (51)\,\hbox {holds} \}$. Then we conclude from (33) and (49)–(51) that

$$\begin{aligned} \sum _{\ell \in U}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }= & {} \sum _{\ell \in U_1}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }+\sum _{\ell \in U_2}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }+\sum _{\ell \in U_3}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }\\\le & {} \sum _{\ell \in U_1}\frac{{{\mathcal {O}}}(n\varDelta _\ell )}{\varDelta _\ell }+{{\mathcal {O}}}(\varepsilon ^2)\sum _{\ell \in U_2}\varDelta _\ell ^{-1} +\sum _{\ell \in U_3} \frac{{{\mathcal {O}}}(n\varDelta _\ell ^2)}{\varDelta _\ell } \\\le & {} \sum _{\ell \in U}\frac{{{\mathcal {O}}}(n\varDelta _\ell )}{\varDelta _\ell }+{{\mathcal {O}}}(\varepsilon ^2)\sum _{\ell \in U}\varDelta _\ell ^{-1} +\sum _{\ell \in U} \frac{{{\mathcal {O}}}(n\varDelta _\ell ^2)}{\varDelta _\ell } \\\le & {} {{\mathcal {O}}}(n)K+\frac{{{\mathcal {O}}}(\varepsilon ^2)}{\varDelta _{\max }}\sum _{\ell \in U}Q^{\ell -1}+{{\mathcal {O}}}(n)\varDelta _{\max }\sum _{\ell =1}^{\infty }Q^{1-\ell }. \end{aligned}$$

Hence (34) and (35) result in

$$\begin{aligned} \sum _{\ell \in U}\frac{f_{\ell -1}-f_{\ell }}{\varDelta _\ell }= {{\mathcal {O}}}(n\log n\varepsilon ^{-1})+{{\mathcal {O}}}(\varepsilon ^2){{\mathcal {O}}}(n\varepsilon ^{-2})+{{\mathcal {O}}}(n)={{\mathcal {O}}}\left( n\log n\varepsilon ^{-1}\right) , \end{aligned}$$

and we then obtain in the same way as the proof of Theorem 2

$$\begin{aligned} N_{K}\le 1+2RK+(2R+1) \sum _{\ell \in U} \frac{f_{\ell -1}-f_{\ell }}{\gamma _{\min }\varDelta _\ell } = {{\mathcal {O}}}(nR\log n\varepsilon ^{-1})={{\mathcal {O}}}(n\log n\varepsilon ^{-1}). \end{aligned}$$

According to Theorem 3, (38) holds at least for one of evaluated points. As a result, at least for one of evaluated points (45) holds with probability $\ge 1-\eta $ by applying (38) to (45). (46) is obtained by applying (4), (45), and since $g({\widehat{x}})=0$, i.e.,

$$\begin{aligned} \Vert x-{\widehat{x}}\Vert ^2\le \frac{2}{\sigma }\Big (f(x)-{\widehat{f}}-g({\widehat{x}})^T(x-{\widehat{x}})\Big )={{\mathcal {O}}}\Big (\displaystyle \frac{\varepsilon }{\sigma ^2}\Big ). \end{aligned}$$

$\square $

5 Some new heuristic techniques

In this section, we describe several heuristic techniques that improve the basic version of Algorithm 4, VRBBO. While only convergence to a local minimizer is guaranteed, FDS together with our heuristic techniques turn VRBBO into an efficient global solver. In fact FDS takes initially only for large $\varDelta $ many large steps, hence has a global character.

More specifically, we discuss the occasional use of alternative search directions (two cumulative directions and a random subspace direction) and heuristics for estimating key parameters unspecified by the general theory – the initial desired gain, the Lipschitz constant, and the scaling vector. Moreover, we discuss how to approximate the gradient estimated by finite differences with step sizes extracted from the extrapolation steps. In Sect. 6, we combine Algorithm 4 with these heuristic techniques, resulting in the global solver VRBBO.

5.1 Cumulative directions

We consider two possibilities to accumulate past directional information into a cumulative search direction:

(i)
With xm and fm defined in Sect. 2.3.1 the first cumulative direction is model independent, computed by
$$\begin{aligned} p=\mathtt{xm}-x_{\mathrm{init}}, \end{aligned}$$
(52)
where $x_{\mathrm{init}}$ is the initial point of the current improved version of MLS-basic. Here the idea is that many small improvement steps accumulate to a direction pointing from the starting point into a valley, so that more progress can be expected by going further into this cumulative direction.
(ii)
The second cumulative direction assumes a separable quadratic model of the form
$$\begin{aligned} f\Big (\mathtt{xm}+\sum _{i\in I} \alpha _ip_i\Big ) \approx \mathtt{fm}-\sum _{i\in I}\varPsi _i(\alpha _i) \end{aligned}$$
(53)
with quadratic univariate functions $\varPsi _i(\alpha )$ vanishing at $\alpha =0$. Here I is the set of directions polled at least twice, and $p_i$ is the corresponding direction as rescaled by an improved version of MLS-basic.

By construction, we have for any $i\in I$ three function values at equispaced arguments. We write the quadratic interpolant as

$$\begin{aligned} f(\mathtt{xm}+\alpha p)=\mathtt{fm}-\displaystyle \frac{\alpha }{2}d+\displaystyle \frac{\alpha ^2}{2}h=\mathtt{fm}-\varPsi (\alpha ), \end{aligned}$$

where $\varPsi (\alpha ):=\displaystyle \frac{\alpha }{2}(d-\alpha h)$. Let us recall the function values fl, fm, and fr satisfying either (19) or (22). If $\mathtt{fr}<\mathtt{fm}$, the last evaluated point was the best one, so

$$\begin{aligned} \mathtt{fr}\le \min (\mathtt{fl},\mathtt{fm}). \end{aligned}$$

In this case, (22) holds and we have

$$\begin{aligned} d := 4\mathtt{fm}-3\mathtt{fr}-\mathtt{fl}, \end{aligned}$$

(54)

and

$$\begin{aligned} h := \mathtt{fr}+\mathtt{fl}-2\mathtt{fm}. \end{aligned}$$

(55)

Otherwise, the last evaluated point was not the best one, so $\mathtt{fm}\le \min (\mathtt{fl},\mathtt{fr})$. In this case, (19) holds and we compute d by

$$\begin{aligned} d := \mathtt{fl}-\mathtt{fr} \end{aligned}$$

(56)

and h by (55).

Given the tuning parameter $a>0$, the minimizer of the quadratic interpolant restricted to the interval $[-a,a]$ is

$$\begin{aligned} \alpha :={\left\{ \begin{array}{ll}{a} &{}\quad \hbox {if}\,~ d\ge 0, \\ {-a} &{} \quad \hbox {if}~\,d<0 \end{array}\right. } \end{aligned}$$

(57)

in case $h\le 0$. Otherwise, we have

$$\begin{aligned} \alpha :={\left\{ \begin{array}{ll}\min (a,d/2h) &{}\quad \hbox {if}~\, d\ge 0, \\ \max (-a,d/2h) &{}\quad \hbox {if}~\,d<0. \end{array}\right. } \end{aligned}$$

(58)

Assuming the validity of the quadratic model (53), we find the model optimizer by additively accumulating the estimated steps $\alpha p$ and gains $\varPsi $ into a cumulative step q with anticipated gain r.

5.2 Random subspace direction

When sufficient gains are found, the trial points are accepted as the new best points and saved in X and their function values are saved in F. Denote by $m_s$ the maximum number of points saved in X and by b the index of newest best point.

Throughout the paper, $A_{:k}$ denotes the kth column of a matrix A. Random subspace directions point into the low-dimensional affine subspace spanned by a number of good points kept from previous iterations. They are computed by

$$\begin{aligned} \alpha _{\mathrm{rand}}:=\mathrm{rand}(m_s-1,1)-0.5, \ \ \alpha _{\mathrm{rand}}:=\frac{\alpha _{\mathrm{rand}}}{\Vert \alpha _{\mathrm{rand}}\Vert }, \ \ p:=\displaystyle \sum _{i=1,i\ne b}^{m_s}(\alpha _{\mathrm{rand}})_i(X_{:i}-X_{:b}).\nonumber \\ \end{aligned}$$

(59)

5.3 Choosing the initial $\varDelta $

First of all, we compute

$$\begin{aligned} \mathtt{dF}:=\displaystyle \mathrm{median}_{i=1:m_s}|F_i-F_b|. \end{aligned}$$

(60)

Then if $\mathtt{dF}$ is not zero we approximate the initial desired gain

$$\begin{aligned} \varDelta :=\gamma _{\max }\min (\mathtt{dF},1), \end{aligned}$$

(61)

where $\gamma _{\max }>0$ is a tuning parameter. Otherwise $\varDelta :=\varDelta _{\max }$, where $\varDelta _{\max }>0$ is the initial gain.

5.4 Choosing the initial $\lambda $

The initial value for $\lambda $ is $\lambda _{\max }$ which is the tuning parameter, however, it is updated by (20) provided that the best point is updated by extrapolationStep. Our achievement is to approximate it by a heuristic formula based on the previous best function values restored in F.

Let $\lambda _{\mathrm{old}}$ be the old estimation for the Lipschitz constant and $\gamma _{\lambda }>0$ be a factor for adjusting $\lambda $. We compute $\lambda $ by

$$\begin{aligned} \lambda :={\left\{ \begin{array}{ll}\displaystyle \frac{\gamma _{\lambda }}{\sqrt{n}} &{} \hbox {if}\,\mathtt{dF}=0\,\hbox {and}\,\lambda _{\mathrm{old}}=0, \\ \lambda _{\mathrm{old}} &{} \hbox {if}\,\mathtt{dF}=0\,\hbox {and}\,\lambda _{\mathrm{old}}\ne 0, \\ \gamma _{\lambda }\displaystyle \sqrt{\frac{\mathtt{dF}}{n}} &{} \hbox {otherwise}, \end{array}\right. } \end{aligned}$$

(62)

where dF is computed by (60).

5.5 Choosing the scaling vector

The idea is to estimate a sensible scaling vector s with the goal of adjusting the search direction scaled by (17). We compute

$$\begin{aligned} \mathtt{dX}_{:i}:=X_{:i}-X_{:b}\ \ \hbox {for all}\, i=1,\ldots ,m_s \end{aligned}$$

and estimate the scaling vector

$$\begin{aligned} s:=\displaystyle \sup _{i=1:m_s}\left( \mathtt{dX}_{:i}\right) , \ \ J = \{i\mid s_i=0\}, \ \ s_J = 1. \end{aligned}$$

(63)

Finally, the formula (17) is rewritten as

$$\begin{aligned} p:=s\circ p\ \ \hbox {and}\ \ p=p(\delta /\Vert p\Vert ), \end{aligned}$$

(64)

where $\circ $ denotes componentwise multiplication and $\delta $ is computed by (23).

5.6 Estimating the gradient

With xm and fm defined in Sect. 2.3.1, finite difference quasi-Newton methods approximate the gradient with components

$$\begin{aligned} \widetilde{g}_i:=\frac{f(\mathtt{xm}+\alpha _i e_i)-\mathtt{fm}}{\alpha _i}, \end{aligned}$$

where $e_i$ is the ith coordinate vector. The most popular choice for $\alpha $ is the constant choice

$$\begin{aligned} \alpha _i:=\max \{1,\Vert \mathtt{xm}\Vert _\infty \}\sqrt{\varepsilon _m}, \end{aligned}$$

(65)

where $\varepsilon _m$ is the machine precision; another choice for $\alpha $ is made now. After generating each coordinate search direction, we approximate each component of the gradient in a way that is a little different from the forward finite difference approach. The step size generated by extrapolationStep is used instead of the general choice (65). The reason of this change is that we do not need to approximate the gradient by another algorithm due to the additional cost. Let describe how to compute the gradient. If extrapolationStep cannot find a sufficient gain in the tth iteration ($n_{\mathrm{sg}}=0$), fr is computed and $\mathbf{a}(t)$ is unchanged. Given the old best point $\mathtt{fm}_{\mathrm{old}}$, the tth component of the gradient is computed by

$$\begin{aligned} \widetilde{g}_{t} := \left( \mathtt{fr}-\mathtt{fm}_{\mathrm{old}}\right) /\mathbf{a}(t); \end{aligned}$$

(66)

otherwise, it is computed by

$$\begin{aligned} \widetilde{g}_{t} := \left( \mathtt{fm}-\mathtt{fm}_{\mathrm{old}}\right) /\mathbf{a}(t), \end{aligned}$$

(67)

where both $\mathtt{fm}_{\mathrm{old}}$ and $\mathbf{a}(t)$ are updated by extrapolationStep. We will add later this computation to an improved version of MLS-basic.

6 The implemented version of VRBBO-basic

In this section, we discuss the implementation of VRBBO-basic with the improvements which are of a heuristic nature and very important for efficiency and do not change the order of our complexity results. Thus, VRBBO gives the same order of complexity as that of Bergou et al. but with a guarantee that holds with probability arbitrarily close to 1; see Table 3. Numerical results in Sect. 7 show that, in practice, VRBBO matches the quality of all state-of-the-art algorithms for unconstrained black box optimization problems. VRBBO [35] is implemented in Matlab; the source code is obtainable from https://doi.org/10.5281/zenodo.564816 and http://www.mat.univie.ac.at/~neum/software/VRBBO.

It includes many subalgorithms described earlier in Sects. 2–4. The others have a simpler structure; hence we skip their details (which can be found at the above website) and only state their goals and those tuning parameters which have not been defined yet. These subalgorithms are identifyDir, lbfgsDir, updateSY, updateXF, updateCum, enforceAngle, direction, MLS, FDS, and setScales, which are described below.

Before we compute the direction, the type of direction must be identified with identifyDir. VRBBO calls direction to generate 5 kinds of direction vectors: coordinate directions, limited memory quasi-Newton directions, random subspace directions, random directions, and cumulative directions, as already detailed in Sect. 5:

Coordinate directions are the coordinate axes $e_i$, $i=1,\ldots ,n$, in a cyclic fashion. The coordinate direction values enhance the global search properties, decreasing on average with the number of function evaluations used. Moreover, they are used to approximate the gradient by the finite difference approach.
Limited memory quasi-Newton directions are computed by lbfgsDir (standard limited memory BFGS direction [61]). Due to rounding errors, the computed direction may not satisfy the angle condition (15); hence it must be modified by enforceAngle, as discussed in [36].
updateSY, updateXF, and updateCum are auxiliary routines for updating the data needed for calculating, limited memory quasi-Newton steps, random subspace steps and cumulative steps, respectively.

These subalgorithms use $\mathtt{cum}$ (the cumulative step type), $\mathtt{ms}_{\max }$ (the maximum number of best points kept), $\mathtt{mq}_{\max }$ (the memory for L-BFGS approximation), $0<\gamma _w<1$ and $0<\gamma _{\mathrm{a}}<1$ (tiny parameters for the angle condition), scSub (random subspace direction scale?), and scCum (cumulative direction scale?) as the tuning parameters.

We denote by C the number of coordinate directions, by R the number of random directions, and by S the number of subspace directions in each repeated call to a multi-line search algorithm – an improved version of MLS-basic, called MLS; once the cumulative direction and L-BFGS direction are computed. Here T is the number of 5 kinds of directions satisfying $1\le T \le C+S+R+2$; C, R, and S are the tuning parameters.

Denote by FDS the improved version of FDS-basic. Both setScales and FDS work by making repeated calls to MLS. MLS polls in several suitably chosen directions (implemented by direction) in a line search fashion a few objective function values each in the hope of reducing the objective by more than a multiple of $\varDelta $. Schematically, it works as follows:

At first, at most C iterations with coordinate directions are used. They are used to approximate the gradient.
Then, the L-BFGS direction is used only once since the gradient has been estimated by the finite difference technique using the coordinate directions.
Next, except in the final iteration, at most S iterations with subspace directions are used. These directions are very useful, especially after performing the coordinate directions and L-BFGS, due to our numerical experiments.
After generating $T-1$ directions without finding a sufficient gain, a cumulative direction is used as final, the Tth direction in the hope of finding a model-based gain.

MLS-basic calls an improved version of extrapolationStep, which is the same as extrapolationStep, except that it updates the cumulative step q and the cumulative gain r by updateCum whenever the second cumulative direction is used.

VRBBO initially calls the algorithm setScales to estimate a good scaling of norms, step lengths, and related control parameters. Then, in each iteration, it uses FDS, which aims to repeatedly reduce the function value by an amount of at least a multiple of $\varDelta $ to update the best point. If no sufficient gain is found in a call to FDS, $\varDelta $ is reduced by a factor of Q. Once $\varDelta $ is below a minimum threshold or nfmax is reached, VRBBO is terminated.

An important question is the ordering of the search directions. In Sect. 7, it will be shown that after coordinate directions are used the use of subspace directions (limited memory quasi-Newton and random subspace directions) is very preferable. Changing the ordering of other directions is not very effective on the efficiency of our algorithm.

The statement (i) of Theorem 1 remains valid when R is replaced by T, and the statement (ii) of it remains valid with probability $\ge 1-2^{C+S+2-T}=1-2^{-R}$.

Let $T_0$ be the maximal number of multi-line searches in setScales as a tuning parameter. Then, setScales uses $(2T+1)T_0$ function evaluations which does not affect on the order of the complexity bounds. Theorems 3–5 are valid with probability $\ge 1-2^{2+C+S-T}=1-2^{-R}$, where 5 kinds of directions are used. Given the tuning parameter $\mathtt{alg}\in \{0,1,2,3,4,5\}$ (algorithm type), we now discuss the factor of bounds depending on the number of search directions used in MLS-basic. We would have the following cases:

In the first case ($\mathtt{alg}=0$), $T=R <n$ random directions are used. Then complexity results considered as Table 3 are valid. This variant of VRBBO is denoted by VRBBO-basic1.
In the second case ($\mathtt{alg}=1$), $T=R \ge n$ random directions are used. Then complexity results considered as Table 3 are valid but with a factor of $n^2$. This variant of VRBBO is denoted by VRBBO-basic2.
In the third case ($\mathtt{alg}=2$), random, random subspace, and cumulative directions are used whose total number is $T=S+R+1 <n$. The complexity results considered as Table 3 are valid. This variant of VRBBO is denoted by VRBBO-C-Q, ignoring the coordinate and limited memory quasi-Newton directions.
In the fourth case ($\mathtt{alg}=3$), coordinate, random, random subspace, and cumulative directions are used whose total number is $T=C+S+R+1 >n$. The complexity results considered as Table 3 are valid but with a factor of $n^2$. This variant of VRBBO is denoted by VRBBO-Q, ignoring the limited memory quasi-Newton directions.
In the fifth case ($\mathtt{alg}=4$), only subspace directions are ignored. The total number of directions used is $T=C+R+2>n$; hence the complexity results are valid but with a factor $n^2$. This variant of VRBBO is denoted by VRBBO-S.
In the sixth case ($\mathtt{alg}=5$), coordinate, L-BFGS, random, random subspace, and cumulative directions are used successively whose total number is $T=C+S+R+2 >n$. The complexity results considered as Table 3 are valid, but with a factor of $n^2$. This variant of VRBBO is the default version.

This defines six versions of VRBBO, the full algorithm and 5 simplified variants. In Sect. 7, we compare them and show that each simplification degrades the algorithm. This means that all heuristic components of VRBBO are necessary for the best performance.

7 Numerical results

In this section we compare our new solver with other state-of-the-art solvers on a large public benchmark.

7.1 Default parameters for VRBBO

For our tests we used the following parameter choices:

Table 4 The values of the tuning parameters

Full size table

Although the best theoretical complexity is obtained for

$$\begin{aligned} \varOmega (\log \eta ^{-1}) \ \ \hbox {for a given }\,0<\eta \le \frac{1}{2}, \end{aligned}$$

the best numerical result are obtained for much larger $R\sim n$.

$\varDelta _{\min }=0$ implies that the algorithm stops due to nfmax or secmax. Here secmax is maximal time in seconds.

In recent years, there has been an increasing interest in finding the best tuning parameters configuration for derivative-free solvers with respect to a benchmark problem set; see, e.g., [5, 50, 51]. In Table 4, there are 7 integral, 2 binary, 2 ternary, and 14 continuous tuning parameters, giving a total of 25 parameters for tuning our algorithm. A small amount of tuning was done by hand. Automatic tuning of VRBBO will be considered elsewhere.

7.2 Test problems used

We compare 30 competitive solvers from the literature (discussed in Sect. 7.3) with all 549 unconstrained problems from the CUTEst [23] collection of test problems for optimization and the test problems of Jamil & Yang [33] for global optimization with 2–5000 variables, in the case of variable dimension problems for all allowed dimensions in this range. For problems in dimension $n\ge 21$, only the most robust and fastest solvers were compared. To avoid guessing the solution of toy problems with a simple solution (such as all zeros or all ones), we shifted the arguments by $\xi _i = (-1)^{i-1}2/(2+i)$ for all $i=1,\ldots ,n$.

As discussed earlier, nfmax denotes maximal number of function evaluations and secmax denotes maximal time in seconds. nf denotes the number of function evaluations. We limited the budget available for each solver by allowing at most

$$\begin{aligned} \mathtt{secmax}:={\left\{ \begin{array}{ll}360&{} \hbox {if}\, 2\le n \le 20, \\ 420&{} \hbox {if}\,21\le n \le 100, \\ 720 &{} \hbox {if}\, 101\le n \le 1000, \\ 1800 &{} \hbox {if}\, 1001\le n \le 5000\\ \end{array}\right. } \end{aligned}$$

seconds of run time and at most

$$\begin{aligned} \mathtt{nfmax}:={\left\{ \begin{array}{ll}100n, 500n, 1000n &{} \hbox {if}\, 2\le n \le 20, \\ 100n, 500n, 1000n&{} \hbox {if}\, 21\le n \le 100,\\ 100n, 500n &{} \hbox {if}\, 101\le n \le 5000\\ \end{array}\right. } \end{aligned}$$

function evaluations for a problem with n variables. We ran all solvers by monitoring in the function evaluation a routine the number of function values and the time used until the bound of this number was met or an error occurred. We saved time and number of function values at each improved function value and evaluated afterwards when the target accuracy was reached. In order to get the above choices for nfmax and secmax, we made preliminarily runs to ensure that the best solvers can solve most of the test problems. Both nfmax and secmax are input parameters for all solvers.

A problem with dimension n is considered it solved by the solver s if the target accuracy satisfies

$$\begin{aligned} q_{s}:=\left( f_{s}-f_{\mathrm{opt}}\right) /\left( f_{\mathrm{init}}-f_{\mathrm{opt}}\right) \le \epsilon ={\left\{ \begin{array}{ll}10^{-4} &{} \hbox {if}\, 1\le n \le 100, \\ 10^{-3} &{} \hbox {if}\,101\le n \le 5000,\end{array}\right. } \end{aligned}$$

where $f_{\mathrm{init}}$ is the function value of the starting point (common to all solvers), $f_{\mathrm{opt}}$ is the best point known to us, and $f_{s}$ is the best point found by the solver s. Otherwise, it is called unsolved since either nfmax or secmax has been reached.

Note that this amounts to testing for finding the global minimizer to some reasonable accuracy. We did not check which of the test problems were multimodal, so that descent algorithms might end up in a local minimum only.

The best point known to us was obtained through numerous attempts for finding the best local minimizer or global minimizer for all test problems by calling several gradient-based solvers such as LMBFG-DDOGL, LMBFG-EIG-MS and LMBFG-EIG-curve-inf presented by Burdakov et al. [11], ASACG presented by Hagar & Zhang [26] and LMBOPT implemented by Kimiaei et al. [36]. The condition $\Vert g^k\Vert _\infty \le 10^{-5}$ was satisfied for all test problems except those listed in Sect. 9.2.

For a more refined statistics, we use our test environment (Kimiaei & Neumaier [47]) for comparing optimization routines on the CUTEst test problem collection of Gould et al. [23]. A solver is said efficient when it has the lowest relative cost of function evaluations and said robust when it has the highest number of solved problems compared to the state-of-the-art BBO solvers. Performance profile of Dolan & Moré [19] and data profiles of Moré & Wild [42] for the cost measure nf (number of function evaluations needed to reach the target) are displayed to identify which solvers are competitive (efficient and robust) for small to high dimensions. In fact, the efficiency and robustness of all solvers are identified by the performance/data profiles. We denote by ${\mathcal {S}}$ the list of compared solvers, by ${\mathcal {P}}$ the list of problems, by $n_{p}$ the dimension of the problem $p\in {\mathcal {P}}$, and by $c_{p,s}$ the cost measure of the solver s to solve the problem p. The performance profile of the solver s

$$\begin{aligned} \rho _{s}(\tau ):=\frac{1}{|{\mathcal {P}}|}\Big |\Big \{p\in {\mathcal {P}} ~\Big |~ pr_{p,s}\le \tau \Big \}\Big | \end{aligned}$$

(68)

is the fraction of problems solved by the solver s that $\tau $ is the upper bound of the performance ratio $ pr_{p,s}:=\displaystyle \frac{c_{p,s}}{\min (c_{p,\overline{s}}\mid \overline{s}\in S)}$, while the data profile of the solver s

$$\begin{aligned} \delta _{s}(\kappa ):=\frac{1}{|{\mathcal {P}}|}\Big |\Big \{p\in {\mathcal {P}}~ \Big |~cr_{p,s}\le \kappa \Big \}\Big | \end{aligned}$$

(69)

is the fraction of problems solved by the solver s with $\kappa $ groups of $n_{p}+1$ function evaluations such that $\kappa $ is the upper bound of the cost ratio $cr_{p,s}:=\displaystyle \frac{c_{p,s}}{n_{p}+1}$.

7.3 Codes compared

We compare VRBBO with the following solvers for unconstrained black box optimization. For some of the solvers we had to choose options different from the default to make them competitive; if nothing is said, the default option were used.

SNOBFIT, obtained from

http://www.mat.univie.ac.at/~neum/software/snobfit/snobfit_v2.1.tar.gz

is a combination of a branching strategy to enhance the chance of finding a global minimum with a sequential quadratic programming method based on fitted quadratic models to have good local properties by Huyer & Neumaier [32].
NOMAD (version 3.9.1) , obtained from https://www.gerad.ca/nomad

is a Mesh Adaptive Direct Search algorithm (MADS) [1,2,3,4, 40]. NOMAD1 uses the following option set
$$\begin{aligned} \mathtt{opts}= & {} \mathtt{nomadset}\left( `{\mathtt{{max}\_{eval}}}\hbox {'},\mathtt{nfmax},`{\hbox {max}\_\hbox {iterations}}\hbox {'},2*\mathtt{nfmax},\right. \\&\left. `{\hbox {min}\_\hbox {mesh}\_\hbox {size}}\hbox {'},`{1e-008}\hbox {'},`{\hbox {initial}\_\hbox {mesh}\_\hbox {size}}\hbox {'},`{10}\hbox {'}\right) \end{aligned}$$
while NOMAD2 uses the following option set
$$\begin{aligned} \mathtt{opts}= & {} \mathtt{nomadset}\left( `{\mathtt{max\_eval}}\hbox {'},\mathtt{nfmax},`{\hbox {max}\_\hbox {iterations}}\hbox {'},2*\mathtt{nfmax},\right. \\&\left. `{\hbox {min}\_\hbox {mesh}\_\hbox {size}}\hbox {'},`{1e-008}\hbox {'},`{\hbox {initial}\_\hbox {mesh}\_\hbox {size}}\hbox {'},`{10}\hbox {'},`{\hbox {model}\_\hbox {search}}\hbox {'},`{0}\hbox {'}\right) . \end{aligned}$$
UOBYQA, NEWUOA, and BOBYQA obtained from

https://www.pdfo.net/docs.html

are model-based solvers by Powell [52,53,54].
STP-fs, STP-vs, and PSTP, obtained from the authors of Bergou et al. [9], are three versions of a stochastic direct search method with complexity guarantees.
BFO, obtained from

https://github.com/m01marpor/BFO,

is a trainable stochastic derivative-free solver for mixed integer bound-constrained optimization by Porcelli & Toint [50].
CMAES, obtained from

http://cma.gforge.inria.fr/count-cmaes-m.php?Down=cmaes.m,

is the stochastic covariance matrix adaptation evolution strategy by Auger & Hansen [6]. We used CMAES with the tuning parameters
$$\begin{aligned}&\mathtt{oCMAES.MaxFunEvals = nfmax}, \mathtt{oCMAES.DispFinal = 0}, \mathtt{oCMAES.DispModulo = 0},\\&\mathtt{oCMAES.LogModulo = 0}, \mathtt{oCMAES.SaveVariables = 0}, \mathtt{oCMAES.MaxIter = nfmax},\\&\mathtt{oCMAES.Restarts = 7}. \end{aligned}$$
GLOBAL, obtained from

http://www.mat.univie.ac.at/~neum/glopt/contrib/global.f90,

is a stochastic multistart clustering global optimization method by Csendes et al. [13]. We used GLOBAL with the tuning parameters
$$\begin{aligned}&\mathtt{oGLOBAL.MAXFNALL=nfmax}, \mathtt{oGLOBAL.MAXFN=nfmax/5}, \mathtt{oGLOBAL.DISPLAY=}`{\mathtt{off}}\hbox {'},\\&\mathtt{oGLOBAL.N100=300}, \mathtt{oGLOBAL.METHOD=}`{\mathtt{unirandi}}\hbox {'},\,\hbox {and}\,\mathtt{oGLOBAL.NG0 = 2}. \end{aligned}$$
DE, obtained from

http://www.icsi.berkeley.edu/~storn/code.html,

is the stochastic differential evolution algorithm by Storn & Price [57].
MCS, obtained from

https://www.mat.univie.ac.at/~neum/software/mcs/,

is the deterministic global optimization by multilevel coordinate search by Huyer & Neumaier [31]. We used MCS with the tuning parameters
$$\begin{aligned}&\mathtt{iinit} = 1, \mathtt{nfMCS=nfmax}, \mathtt{smax} = 5n+10, \mathtt{stop} = 3n, \mathtt{local = 50}, \mathtt{gamma = eps},\\&\mathtt{hess = ones}(n,n),\,\hbox {and}\,\mathtt{prt} = 0. \end{aligned}$$
BCDFO, obtained from Anke Troeltzsch (personal communication), is a deterministic model-based trust-region algorithm for derivative-free bound-constrained minimization by Gratton et al. [25].
PSM, obtained from

http://ferrari.dmat.fct.unl.pt/personal/alcustodio,

is a deterministic pattern search method guided by simplex derivatives for use in derivative-free optimization proposed by Custódio & Vicente [15, 16].
FMINUNC, obtained from the Matlab Optimization Toolbox at

https://ch.mathworks.com/help/optim/ug/fminunc.html,

is a deterministic quasi-Newton or trust-region algorithm. We use FMINUNC with the options set by optimoptions as follows:
$$\begin{aligned}&{\mathtt{opts = optimoptions(@fminunc)},`{\mathtt{Algorithm}}\hbox {'},`{\mathtt{quasi-newton}}\hbox {'}, `{\mathtt{Display}}\hbox {'}},\\&{`{\mathtt{Iter}}\hbox {'},`{\mathtt{MaxIter}}\hbox {'},\mathtt{Inf}, `{\mathtt{MaxFunEvals}}\hbox {'}, \mathtt{limits.nfmax}, `{\mathtt{TolX}}\hbox {'}, \mathtt{0},`{\mathtt{TolFun}}\hbox {'}, \mathtt{0}},\\&{`{\mathtt{ObjectiveLimit}}\hbox {'},\mathtt{-1e-50})}; \end{aligned}$$
It is the standard quasi-Newton method while finding step sizes by Wolfe condition.
FMINSEARCH, obtained from the Matlab Optimization Toolbox at

https://ch.mathworks.com/help/matlab/ref/fminsearch.html,

is the deterministic Nelder–Mead simplex algorithm by Lagarias et al. [38]. We use FMINSEARCH with the options set by
$$\begin{aligned}&{\mathtt{opts = optimset(}`{\mathtt{Display}}\hbox {'},`{\mathtt{Iter}}\hbox {'}, `{\mathtt{MaxIter}}\hbox {'}, \mathtt{Inf},`{\mathtt{MaxFunEvals}}\hbox {'},\mathtt{limits.nfmax},}\\&{`{\mathtt{TolX}}\hbox {'}, 0, `{\mathtt{TolFun}}\hbox {'},0,`{\mathtt{ObjectiveLimit}}\hbox {'},\mathtt{-1e-50)}}; \end{aligned}$$
GCES is a globally convergence evolution strategy presented by Diouane et al. [17, 18]. The default parameters are used.
PSWARM, obtained from

http://www.norg.uminho.pt/aivaz

is particle swarm pattern search algorithm for global optimization presented by Vaz & Vicente [59].
MDS, NELDER, and HOOKE, obtained from

https://ctk.math.ncsu.edu/matlab_darts.html

are multidirectional search, Nelder–Mead and Hooke–Jeeves algorithms, respectively, presented by Kelley [34]. The default parameters are used.
MDSMAX, NMSMAX, and ADSMAX, obtained from

http://www.ma.man.ac.uk/~higham/mctoolbox/

are multidirectional search, Nelder–Mead simplex and alternating directions method for direct search optimization algorithms, respectively, presented by Higham [29].
GLODS, obtained from

http://ferrari.dmat.fct.unl.pt/personal/alcustodio/

is Global and Local Optimization using Direct Search by CustÓdio & Madeira [14].
ACRS, obtained from

http://www.iasi.cnr.it/~liuzzi/DFL/index.php/list3

is a global optimization algorithm presented by Brachetti et al. [10].
SDBOX, obtained from

http://www.iasi.cnr.it/~liuzzi/DFL/index.php/list3

is a derivative-free algorithm for bound constrained optimization problems by Lucidi & Sciandrone [41].
DSPFD, available at

pages.discovery.wisc.edu/%7Ecroyer/codes/dspfd_sources.zip,

is a direct search Matlab code for derivative-free optimization by Gratton et al. [24]. The default parameters are used.

VRBBO and the other stochastic algorithms use random numbers, hence give slightly different results when run repeatedly. Due to run time constraints, each solver was run only once for each problem. However, we checked in preliminary tests that the summarized results reported were quite similar when another run was done.

Some of the other solvers have additional capabilities that were not used in our tests; e.g., allowing for bound constraints or integer constraints, or for noisy function value). Hence our conclusions are silent about the performance of these solvers outside the task of global unconstrained black box optimization with noiseless function values (apart from rounding errors).

7.4 Results for small dimensions ($n\le 20$)

We have a self-testing and tuning for our solver in terms of the tuning parameter alg shown in Fig. 2. As can be seen from the data and performance profiles, two competitive versions of our solver are VRBBO and VRBBO-Q. In fact, VRBBO is somewhat more robust than VRBBO-Q while VRBBO-Q is somewhat more efficient than VRBBO. As a result, VRBBO is recommended.

Results on CUTEst. Subfigures of Fig. 3 display three comparisons among all solvers in low to high budgets ($\mathtt{nfmax}\in \{100n,500n,1000n\}$) on CUTEst. The names of the solvers are given on the horizontal axis of these subfigures, sorted by the number of solved problems in descending order. As can be seen from these subfigures, UOBYQA is more robust and efficient than the others at low to high budgets. By increasing the budget, the efficiency and robustness of VRBBO are increased, making it the fourth rank robust solver with low function evaluations cost compared to the other line search, direct search, Nelder–Mead solvers, etc., except compared to some model-based solvers.

To make more detailed comparisons among the six most robust or efficient solvers, the data and performance profiles are shown in Fig. 4, whose subfigures confirm that UOBYQA is more robust and efficient than the others at low to high budgets.

In summary, for small scale problems from the CUTEst collection, the model-based solvers are recommended, and VRBBO is recommended for high budgets, since it is slightly more robust than some well-known model-based solvers (NEWUOA, BCDFO, and BOBYQA), although these solvers are slightly more efficient than VRBBO.

Results on GlobalTest. Subfigures of Fig. 3 display three comparisons among all solvers in low to high budgets ($\mathtt{nfmax}\in \{100n,500n,1000n\}$) on GlobalTest. The names of the solvers are given on the horizontal axis of these subfigures, sorted by the number of solved problems in descending order. As can be seen from these subfigures:

At low budget, BOBYQA and VRBBO are the first and second rank robust solvers, while BOBYQA and NEWUOA are the first and second rank efficient solvers.
For medium and large budgets, MCS and GLOUS are the first and second rank robust solvers. In fact, by increasing the budget, the global solvers actually behave better. In this case, VRBBO and VRBBO-Q are comparable to the global solvers.

To make more accurate comparisons among the four most robust or efficient solvers, the data and performance profiles are shown in Fig. 4, whose subfigures confirm that MCS and GLOUS are more robust and efficient than the others at medium and large budgets, respectively.

In summary, the global solvers are more robust than the local solvers in the GlobalTest collection, while our findings show that VRBBO is comparable to the global solvers in terms of the efficiency and robustness and is recommended for finding the global minimum.

7.5 Results for medium dimensions ($21\le n\le 100$)

For medium to very large scale problems, we ignored the model-based solvers in our comparison because they needed $\frac{1}{2}n(n+3)$ sample points to construct fully quadratic models. There were some too slow solvers that could not solve most problems even with an expansion of secmax. Therefore, we only compared either fast, efficient, or robust solvers and plotted the data and performance profiles for most robust solvers on CUTEst and GlobalTest for medium to very large problems.

Results on CUTEst. We conclude from the performance profiles and the data profiles shown in Fig. 5 that VRBBO is much more efficient and robust than other solvers.

Results on GlobalTest. We conclude from the performance profiles and the data profiles shown in Fig. 5, that:

At low budget, ADSMAX and VRBBO are the first and second rank robust solvers, while VRBBO is much more efficient than ADSMAX.
At medium budget, ADSMAX, SDBOX, and VRBBO are, respectively, more robust than the others.
At large budget, VRBBO is much more efficient and robust than the others.

In summary, although VRBBO is comparable to the global solvers on small scale problems from GlobalTest, it is much more efficient and robust than them on medium scale problems from GlobalTest.

7.6 Results for large dimensions ($101\le n\le 1000$)

Results on CUTEst. As discussed in Sect. 7.5, VRBBO, FMINUNC, and SDBOX were three rank robust and efficient solvers on CUTEst. Fig. 6 displays the performance and data profiles comparing these solvers and shows that VRBBO is much more robust than the others and is recommended for large scale problems on CUTEst.

Results on GlobalTest. As discussed in Sect. 7.5, VRBBO, ADSMAX, and SDBOX were three rank robust and efficient solvers on GlobalTest. We conclude from the performance and data profiles shown in Fig. 7, that VRBBO is the first rank efficient and second rank robust solver for large scale problems on GlobalTest.

7.7 Results for very large dimensions ($1001\le n\le 5000$)

Results on CUTEst. We conclude from the performance and data profiles shown in Fig. 8 that VRBBO is much more efficient and robust than the others for large scale problems on CUTEst.

Results on GlobalTest. We conclude from the performance and data profiles shown in Fig. 9 that VRBBO is the first rank efficient and second rank robust solver for large scale problems on GlobalTest.

8 Conclusion

We constructed an efficient randomized algorithm for unconstrained black box optimization problems. For the basic version of VRBBO with only random directions, the complexity bound for the nonconvex case, with probability arbitrarily close to 1, matches that found by Gratton et al. [24] for another algorithm. We also proved complexity bounds for VRBBO for the convex and strongly convex cases, with probability arbitrarily close to 1, essentially matching the bounds found by Bergou et al. [9], only valid in expectation.

An improved version of our algorithm has additional heuristic techniques that do not affect the order of the complexity results and which turn VRBBO into an efficient global solver, although our theory only guarantees local minimizers. This version even found in most cases either a global minimizer or, where this could not be verified, at least a point of similar quality to the best competitive global solvers.

Two competitive versions of our algorithm were VRBBO and VRBBO-Q in low to high budget on CUTEst and GlobalTest due to the use of all various directions and additional heuristic techniques. As a consequence of our extensive numerical results, UOBYQA is our recommendation for small scale problems in low to high budgets on CUTESt and MCS and GLODS on GlobalTest, while VRBBO is our recommendation for medium to very large scale problems in low to high budgets on CUTESt and GlobalTest.

9 Tools for VRBBO

9.1 Estimation of c

The following theorem was recently proved by Pinelis [48].

Theorem 6

There is a universal constant $c_0\le 50$ such that for any fixed nonzero real vector q of any dimension n and any random vector p of the same dimension n with independent components uniformly distributed in $[-1,1]$, we have

$$\begin{aligned} \left( p^Tp\right) \left( q^Tq\right) \le c_0n\left( p^Tq\right) ^2 \end{aligned}$$

(70)

with probability $\ge 1/2$.

Pinelis also proved the lower bound $0.73<c_0$ for the best Pinelis value of the constant $c_0$. The true optimal value seems to be approximately 16/7. This is suggested by numerical simulation. To estimate $c_0$, we executed three times the Matlab commands (shown in Fig.10)

using the algorithm PinConst below. All three outputs,

$$\begin{aligned} c_0=2.2582, c_0=2.2444, c_0=2.2714 \end{aligned}$$

are slightly smaller than $16/7=2.2857...$.

9.2 A list of test problems with $f_{\mathrm{opt}}$

Here we list the CUTEst test problems for which our best point did not satisfy the condition

$$\begin{aligned} \Vert g^k\Vert _\infty \le 10^{-5}. \end{aligned}$$

Problem	dim	$f_{\mathrm{opt}}$	$\Vert g\Vert _\infty $	$\Vert g\Vert _2$
BROWNBS	2	$-2.80e+00$	$2.08e-05$	$2.08e-05$
DJTL	2	$-8.95e+03$	$1.44e-04$	$1.44e-04$
STRATEC	10	$2.22e+03$	$8.10e-05$	$1.42e-04$
SCURLY10:10	10	$-1.00e+03$	$5.34e-04$	$5.50e-04$
OSBORNEB	11	$2.40e-01$	$3.47e-02$	$3.47e-02$
ERRINRSM:50	50	$3.77e+01$	$1.89e-05$	$1.89e-05$
ARGLINC:50	50	$1.01e+02$	$1.29e-05$	$5.28e-05$
HYDC20LS	99	$1.12e+01$	$5.54e-01$	$8.79e-01$
PENALTY3:100	100	$9.87e+03$	$2.01e-03$	$4.68e-03$
SCOSINE:100	100	$-9.30e+01$	$1.95e-02$	$3.58e-02$
SCURLY10:100	100	$-1.00e+04$	$5.74e-02$	$1.56e-01$
NONMSQRT:100	100	$1.81e+01$	$3.42e-05$	$6.51e-05$
PENALTY2:200	200	$4.71e+13$	$3.85e-04$	$1.07e-03$
ARGLINB:200	200	$9.96e+01$	$3.27e-04$	$2.68e-03$
SPMSRTLS:499	499	$1.69e+01$	$1.08e-05$	$3.59e-05$
PENALTY2:500	500	$1.14e+39$	$1.97e+26$	$4.08e+26$
MSQRTBLS:529	529	$1.13e-02$	$1.44e-05$	$1.03e-04$
NONMSQRT:529	529	$6.13e+01$	$2.17e-05$	$1.76e-04$
SCOSINE	1000	$-9.21e+02$	$3.38e-03$	$9.32e-03$
SCURLY10	1000	$-1.00e+05$	$5.49e+01$	$3.37e+02$
COSINE	1000	$-9.99e+02$	$5.00e-05$	$6.34e-05$
PENALTY2:1000	1000	$1.13e+83$	$2.53e+77$	$3.41e+77$
SINQUAD:1000	1000	$-2.94e+05$	$1.21e-05$	$1.52e-05$
SPMSRTLS:1000	1000	$3.19e+01$	$9.75e-05$	$2.26e-04$
NONMSQRT:1024	1024	$9.01e+01$	$1.73e-04$	$1.28e-03$
MSQRTALS:4900	4900	$7.60e-01$	$1.88e-03$	$3.56e-02$
SPMSRTLS:4999	4999	$2.05e+02$	$2.36e-03$	$9.27e-03$
INDEFM:5000	5000	$-5.02e+05$	$1.43e-05$	$2.00e-05$
SBRYBND:5000	5000	$2.58e-10$	$3.73e-04$	$3.50e-03$
SCOSINE:5000	5000	$-4.60e+03$	$6.32e-03$	$2.72e-02$
NONCVXUN:5000	5000	$1.16e+04$	$3.94e-05$	$7.19e-04$

Data availability statement

The manuscript has no data as supplementary material.

References

Abramson, M.A., Audet, C., Couture, G., Dennis, Jr., J.E., Le Digabel, S.: Tribes, C.: The NOMAD project. Software available at https://www.gerad.ca/nomad/
Audet, C., Dennis, J.E., Jr.: Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim. 17, 188–217 (2006)
Article MathSciNet MATH Google Scholar
Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization. Springer, Berlin (2017)
Book MATH Google Scholar
Audet, C., Le Digabel, S., Tribes, C.: NOMAD user guide. Technical Report G-2009-37, Les cahiers du GERAD (2009)
Audet, C., Orban, D.: Finding optimal algorithmic parameters using derivative-free optimization. SIAM J. Optim. 17, 642–664 (2006)
Article MathSciNet MATH Google Scholar
Auger, A., Hansen, N.: A restart CMA evolution strategy with increasing population size. In: 2005 IEEE congress on evolutionary computation. IEEE (2005)
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24, 1238–1264 (2014)
Article MathSciNet MATH Google Scholar
Bélisle, C.J.P., Romeijn, H.E., Smith, R.L.: Hit-and-run algorithms for generating multivariate distributions. Math. Oper. Res. 18, 255–266 (1993)
Article MathSciNet MATH Google Scholar
Bergou, E.H., Gorbunov, E., Richtárik, P.: Stochastic three points method for unconstrained smooth minimization. SIAM J. Optim. 30, 2726–2749 (2020)
Article MathSciNet MATH Google Scholar
Brachetti, P., De Felice Ciccoli, M., Di Pillo, G., Lucidi, S.: A new version of the prices algorithm for global optimization. J. Glob. Optim. 10, 165–184 (1997)
Article MathSciNet MATH Google Scholar
Burdakov, O., Gong, L., Zikrin, S., Yuan, Y.: On efficiently combining limited-memory and trust-region techniques. Math. Program. Comput. 9, 101–134 (2016)
Article MathSciNet MATH Google Scholar
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2009)
Book MATH Google Scholar
Csendes, T., Pál, L., Sendín, J.O.H., Banga, J.R.: The GLOBAL optimization method revisited. Optim. Lett. 2, 445–454 (2007)
Article MathSciNet MATH Google Scholar
Custódio, A.L., Madeira, J.F.A.: GLODS: global and local optimization using direct search. J. Glob. Optim. 62, 1–28 (2014)
Article MathSciNet MATH Google Scholar
Custódio, A.L., Rocha, H., Vicente, L.N.: Incorporating minimum frobenius norm models in direct search. Comput. Optim. Appl. 46, 265–278 (2009)
Article MathSciNet MATH Google Scholar
Custódio, A.L., Vicente, L.N.: Using sampling and simplex derivatives in pattern search methods. SIAM J. Optim. 18, 537–555 (2007)
Article MathSciNet MATH Google Scholar
Dodangeh, M., Vicente, L.N.: Worst case complexity of direct search under convexity. Math. Program. 155, 307–332 (2014)
Article MathSciNet MATH Google Scholar
Dodangeh, M., Vicente, L.N., Zhang, Z.: On the optimal order of worst case complexity of direct search. Optim. Lett. 10, 699–708 (2015)
Article MathSciNet MATH Google Scholar
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)
Article MathSciNet MATH Google Scholar
Elster, C., Neumaier, A.: A grid algorithm for bound constrained optimization of noisy functions. IMA J. Numer. Anal. 15, 585–608 (1995)
Article MathSciNet MATH Google Scholar
Elster, C., Neumaier, A.: A method of trust region type for minimizing noisy functions. Computing 58, 31–46 (1997)
Article MathSciNet MATH Google Scholar
Evtushenko, Yu.G.: Numerical methods for finding global extrema (case of a non-uniform mesh). USSR Comput. Math. Math. Phys. 11, 38–54 (1971)
Article Google Scholar
Gould, N.I.M., Orban, D., Toint, Ph.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60, 545–557 (2014)
Article MathSciNet MATH Google Scholar
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Direct search based on probabilistic descent. SIAM J. Optim. 25, 1515–1541 (2015)
Article MathSciNet MATH Google Scholar
Gratton, S., Toint, Ph.L., Tröltzsch, A.: An active-set trust-region method for derivative-free nonlinear bound-constrained optimization. Optim. Methods Softw. 26, 873–894 (2011)
Article MathSciNet MATH Google Scholar
Hager, W.W., Zhang, H.: A new active set algorithm for box constrained optimization. SIAM J. Optim. 17, 526–557 (2006)
Article MathSciNet MATH Google Scholar
Hansen, E.R.: Global Optimization Using Interval Analysis. M. Dekker, New York, NY (1992)
MATH Google Scholar
Hansen, N.: The CMA evolution strategy: a comparing review. In: Towards a new evolutionary computation, pp. 75–102. Springer, Berlin (2006)
Higham, N.J.: Optimization by direct search in matrix computations. SIAM J. Matrix Anal. Appl. 14, 317–333 (1993)
Article MathSciNet MATH Google Scholar
Holland, J.H.: Genetic algorithms and the optimal allocation of trials. SIAM J. Optim. 2, 88–105 (1973)
MathSciNet MATH Google Scholar
Huyer, W., Neumaier, A.: Global optimization by multilevel coordinate search. J. Glob. Optim. 14, 331–355 (1999)
Article MathSciNet MATH Google Scholar
Huyer, W., Neumaier, A.: SNOBFIT—stable noisy optimization by branch and fit. ACM. Trans. Math. Softw. 35, 1–25 (2008)
Article MathSciNet Google Scholar
Jamil, M., Yang, X.S.: A literature survey of benchmark functions for global optimisation problems. Int. J. Math. Model. Numer. Optim. 4, 150 (2013)
MATH Google Scholar
Kelley, C.T.: Iterative Methods for Optimization. Society for Industrial and Applied Mathematics, Philadelphia (1999)
Book MATH Google Scholar
Kimiaei, M.: GS1400/VRBBO: VRBBO (3.2). Zenodo. https://doi.org/10.5281/zenodo.5648165 (2021)
Kimiaei, M., Neumaier, A., Azmi, B.: LMBOPT—a limited memory method for bound-constrained optimization. http://www.optimization-online.org/DB_HTML/2020/11/8089.html (2020)
Konečný, J., Richtárik, P.: Simple complexity analysis of simplified direct search. arXiv preprintarXiv:1410.0390 (2014)
Lagarias, J.C., Reeds, J.A., Wright, M.H., Wright, P.E.: Convergence properties of the Nelder–Mead simplex method in low dimensions. SIAM J. Optim. 9, 112–147 (1998)
Article MathSciNet MATH Google Scholar
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numer 28, 287–404 (2019)
Article MathSciNet MATH Google Scholar
Le Digabel, S.: Algorithm 909: NOMAD: nonlinear optimization with the MADS algorithm. ACM Trans. Math. Softw. 37, 1–15 (2011)
Article MathSciNet MATH Google Scholar
Lucidi, S., Sciandrone, M.: A derivative-free algorithm for bound constrained optimization. Comput. Optim. Appl. 21, 119–142 (2002)
Article MathSciNet MATH Google Scholar
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20, 172–191 (2009)
Article MathSciNet MATH Google Scholar
Müller, J., Woodbury, J.D.: GOSAC: global optimization with surrogate approximation of constraints. J. Global Optim. 69, 117–136 (2017)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Random gradient-free minimization of convex functions. CORE discussion paper #2011/1 (2011)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17, 527–566 (2015)
Article MathSciNet MATH Google Scholar
Neumaier, A., Fendl, H., Schilly, H., Leitner, Thomas: VXQR: derivative-free unconstrained optimization based on QR factorizations. Soft. Comput. 15, 2287–2298 (2010)
Article Google Scholar
Neumaier, A., Kimiaei, M.: Testing and tuning optimization algorithm (2020) (in preparation)
Pinelis, I.: A probabilistic angle inequality. MathOverflow, (2018). https://mathoverflow.net/questions/298533/a-probabilistic-angle-inequality
Polyak, B.T.: Introduction to Optimization. Optimization Software Inc, New York (1987)
MATH Google Scholar
Porcelli, M., Toint, Ph.L.: BFO, a trainable derivative-free brute force optimizer for nonlinear bound-constrained optimization and equilibrium computations with continuous and discrete variables. ACM. Trans. Math. Softw. 44, 1–25 (2017)
Article MathSciNet MATH Google Scholar
Porcelli, M., Toint, Ph.L.: A note on using performance and data profiles for training algorithms. ACM Trans. Math. Softw. 45, 1–10 (2019)
Article MathSciNet MATH Google Scholar
Powell, M.J.D.: UOBYQA: unconstrained optimization by quadratic approximation. Math. Program. 92, 555–582 (2002)
Article MathSciNet MATH Google Scholar
Powell, M.J.D.: Developments of NEWUOA for minimization without derivatives. IMA J. Numer. Anal. 28, 649–664 (2008)
Article MathSciNet MATH Google Scholar
Powell, M.J.D, The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Cambridge, pp. 26–46 (2009)
Rastrigin, L.A.: Statisticheskie metody poiska. Nauka (1968)
Rios, L.M., Sahinidis, N.V.: Derivative-free optimization: a review of algorithms and comparison of software implementations. J. Global Optim. 56, 1247–1293 (2012)
Article MathSciNet MATH Google Scholar
Storn, Rainer, Price, Kenneth: Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11, 341–359 (1997)
Article MathSciNet MATH Google Scholar
van Laarhoven, P.J.M., Aarts, E.H.L.: Simulated Annealing: Theory and Applications. Springer, Netherlands (1987)
Book MATH Google Scholar
Vaz, A.I.F., Vicente, L.N.: A particle swarm pattern search method for bound constrained global optimization. J. Global Optim. 39, 197–219 (2007)
Article MathSciNet MATH Google Scholar
Vicente, L.N.: Worst case complexity of direct search. EURO J. Comput. Optim. 1, 143–153 (2012)
Article MATH Google Scholar
Wright, S., Nocedal, J.: Numerical Optimization. Springer, New York (2006)
MATH Google Scholar
Zhigljavsky, A.A.: Theory of Global Random Search. Springer, Netherlands (1991)
Book Google Scholar

Download references

Funding

Open acces funding provided by University of Vienna. The frist author acknowledges the financial support of the Doctoral Program Vienna Graduate School on Computational Optimization(VGSCO) funded by the Austrian Scxience Foundation under Project No. W1260-N35.

Author information

Authors and Affiliations

Fakultät für Mathematik, Universität Wien, Oskar-Morgenstern-Platz 1, 1090, Wien, Austria
Morteza Kimiaei & Arnold Neumaier

Authors

Morteza Kimiaei
View author publications
You can also search for this author in PubMed Google Scholar
Arnold Neumaier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Morteza Kimiaei.

Ethics declarations

Conflict of interest

The author declare that have no conflict of interest.

Code availablity

VRBBO v3.2 is available under the MIT General Public licence whose URLs are contained in this published paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The first author acknowledges the financial support of the Doctoral Program Vienna Graduate School on Computational Optimization (VGSCO) funded by the Austrian Science Foundation under Project No. W1260-N35.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kimiaei, M., Neumaier, A. Efficient unconstrained black box optimization. Math. Prog. Comp. 14, 365–414 (2022). https://doi.org/10.1007/s12532-021-00215-9

Download citation

Received: 03 July 2020
Accepted: 16 December 2021
Published: 03 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s12532-021-00215-9

Keywords

Mathematics Subject Classification

Primary 90C56

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Problem	dim	\(f_{\mathrm{opt}}\)	\(\Vert g\Vert _\infty \)	\(\Vert g\Vert _2\)
BROWNBS	2	\(-2.80e+00\)	\(2.08e-05\)	\(2.08e-05\)
DJTL	2	\(-8.95e+03\)	\(1.44e-04\)	\(1.44e-04\)
STRATEC	10	\(2.22e+03\)	\(8.10e-05\)	\(1.42e-04\)
SCURLY10:10	10	\(-1.00e+03\)	\(5.34e-04\)	\(5.50e-04\)
OSBORNEB	11	\(2.40e-01\)	\(3.47e-02\)	\(3.47e-02\)
ERRINRSM:50	50	\(3.77e+01\)	\(1.89e-05\)	\(1.89e-05\)
ARGLINC:50	50	\(1.01e+02\)	\(1.29e-05\)	\(5.28e-05\)
HYDC20LS	99	\(1.12e+01\)	\(5.54e-01\)	\(8.79e-01\)
PENALTY3:100	100	\(9.87e+03\)	\(2.01e-03\)	\(4.68e-03\)
SCOSINE:100	100	\(-9.30e+01\)	\(1.95e-02\)	\(3.58e-02\)
SCURLY10:100	100	\(-1.00e+04\)	\(5.74e-02\)	\(1.56e-01\)
NONMSQRT:100	100	\(1.81e+01\)	\(3.42e-05\)	\(6.51e-05\)
PENALTY2:200	200	\(4.71e+13\)	\(3.85e-04\)	\(1.07e-03\)
ARGLINB:200	200	\(9.96e+01\)	\(3.27e-04\)	\(2.68e-03\)
SPMSRTLS:499	499	\(1.69e+01\)	\(1.08e-05\)	\(3.59e-05\)
PENALTY2:500	500	\(1.14e+39\)	\(1.97e+26\)	\(4.08e+26\)
MSQRTBLS:529	529	\(1.13e-02\)	\(1.44e-05\)	\(1.03e-04\)
NONMSQRT:529	529	\(6.13e+01\)	\(2.17e-05\)	\(1.76e-04\)
SCOSINE	1000	\(-9.21e+02\)	\(3.38e-03\)	\(9.32e-03\)
SCURLY10	1000	\(-1.00e+05\)	\(5.49e+01\)	\(3.37e+02\)
COSINE	1000	\(-9.99e+02\)	\(5.00e-05\)	\(6.34e-05\)
PENALTY2:1000	1000	\(1.13e+83\)	\(2.53e+77\)	\(3.41e+77\)
SINQUAD:1000	1000	\(-2.94e+05\)	\(1.21e-05\)	\(1.52e-05\)
SPMSRTLS:1000	1000	\(3.19e+01\)	\(9.75e-05\)	\(2.26e-04\)
NONMSQRT:1024	1024	\(9.01e+01\)	\(1.73e-04\)	\(1.28e-03\)
MSQRTALS:4900	4900	\(7.60e-01\)	\(1.88e-03\)	\(3.56e-02\)
SPMSRTLS:4999	4999	\(2.05e+02\)	\(2.36e-03\)	\(9.27e-03\)
INDEFM:5000	5000	\(-5.02e+05\)	\(1.43e-05\)	\(2.00e-05\)
SBRYBND:5000	5000	\(2.58e-10\)	\(3.73e-04\)	\(3.50e-03\)
SCOSINE:5000	5000	\(-4.60e+03\)	\(6.32e-03\)	\(2.72e-02\)
NONCVXUN:5000	5000	\(1.16e+04\)	\(3.94e-05\)	\(7.19e-04\)

Efficient unconstrained black box optimization

Abstract

Similar content being viewed by others

Deterministic global derivative-free optimization of black-box problems with bounded Hessian

RBFOpt: an open-source library for black-box optimization with costly function evaluations

Safe global optimization of expensive noisy black-box functions in the $$\delta $$ -Lipschitz framework

1 Introduction

1.1 Related work

1.2 Known complexity results

Assumptions

Proposition 1

Proof

1.3 Our contribution

2 A new line search technique

2.1 Probing a direction

Proposition 2

Proof

2.2 Random search directions

Proposition 3

Proof

2.3 A multi-line search

2.3.1 An extrapolation step

2.3.2 A basic version of the MLS algorithm

Theorem 1

Proof

3 A randomized descent algorithm for BBO

3.1 Probing for fixed decrease

Theorem 2

Proof

3.2 A basic version of the VRBBO algorithm

4 Complexity analysis of VRBBO-basic

4.1 The general (nonconvex) case

Theorem 3

Proof

4.2 The convex case

Theorem 4

Proof

4.3 The strongly convex case

Theorem 5

Proof

5 Some new heuristic techniques

5.1 Cumulative directions

5.2 Random subspace direction

5.3 Choosing the initial \(\varDelta \)

5.4 Choosing the initial \(\lambda \)

5.5 Choosing the scaling vector

5.6 Estimating the gradient

6 The implemented version of VRBBO-basic

7 Numerical results

7.1 Default parameters for VRBBO

7.2 Test problems used

7.3 Codes compared

7.4 Results for small dimensions (\(n\le 20\))

7.5 Results for medium dimensions (\(21\le n\le 100\))

7.6 Results for large dimensions (\(101\le n\le 1000\))

7.7 Results for very large dimensions (\(1001\le n\le 5000\))

8 Conclusion

9 Tools for VRBBO

9.1 Estimation of c

Theorem 6

9.2 A list of test problems with \(f_{\mathrm{opt}}\)

Data availability statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availablity

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation