Introduction

Direct search is a broad class of methods for optimization without derivatives and includes those of simplicial type (Conn et al. 2009, Chapter 8), like the Nelder-Mead method and its numerous modifications (where typically one moves away from the worst point), and those of directional type (Conn et al. 2009, Chapter 7) (where one tries to move along a direction defined by the best or a better point). This paper focuses on direct-search methods of the latter type applied to smooth, continuously differentiable objective functions. The problem under consideration is the unconstrained minimization of a real-valued function, stated as \(\min _{x \in \mathbb R ^n} f(x)\).

Each iteration of the direct-search methods (of directional type) can be organized around a search step (optional) and a poll step, and it is the poll step that is responsible for the global convergence of the overall method (meaning the convergence to some form of stationarity independently of the starting point). In the poll step, the evaluation of the objective function is done at a finite set of polling points defined by a step size parameter and a set of polling directions. When the objective function is continuously differentiable and its gradient is Lipschitz continuous, and for the purpose of global convergence it suffices that the polling directions form a positive spanning set (i.e., a set of vectors whose linear combinations with non-negative coefficients span \(\mathbb R ^n\)), and that new iterates are only accepted when satisfying a sufficient decrease condition based on the step size parameter.

In this paper, we will prove that in at most \(\mathcal O (\epsilon ^{-2})\) iterations these methods are capable of driving the norm of the gradient of the objective function below \(\epsilon \). It was shown by Nesterov (2004, p. 29) that the steepest descent method for unconstrained optimization takes at most \(\mathcal O (\epsilon ^{-2})\) iterations or gradient evaluations for the same purpose (the stepsize is assumed to verify a sufficient decrease condition and another one to avoid too short steps, like a curvature type condition). Direct-search methods that poll using positive spanning sets are directional methods of descent type, and despite not using gradient information as in steepest descent, it is not unreasonable to expect that they share the same worst case complexity bound of the latter method in terms of number of iterations, provided new iterates are only accepted based on a sufficient decrease condition. In fact, it is known that one of the directions of a positive spanning set makes necessarily an acute angle with the negative gradient (when the objective function is continuously differentiable), and, as we will see in the paper, this is what is needed to achieve the same power of \(\epsilon \) in terms of iterations as of steepest descent. There is an effort in terms of objective function evaluations related to not knowing in advance which of these directions is of descent and to search for the corresponding decrease, but that is reflected in terms of a power of \(n\).

More concretely, based on the properties of positive spanning sets and on the number of objective function evaluations taken in an unsuccessful poll step, we will then conclude that at most \(\mathcal O (n^2 \epsilon ^{-2})\) objective function evaluations are required to drive the norm of the gradient of the objective function below \(\epsilon \).

As a direct consequence of this result, and following what Nesterov (2004, p. 29) states for first order oracles, one can ensure an upper complexity bound for the following problem class (where one can only evaluate the objective function and not its derivatives):

Model:

Unconstrained minimization

 

\(f \in C^1_{\nu }(\mathbb R ^n)\)

 

\(f\) bounded below

Oracle:

Zero-order oracle (evaluation of \(f\))

\(\epsilon \)–solution:

\(f(x_*^\mathrm{appr}) \le f(x_0)\), \(\Vert \nabla f(x_*^\mathrm{appr}) \Vert \le \epsilon \)

where \(f\) is assumed smooth with Lipschitz continuous gradient (with constant \(\nu > 0\)), \(x_*^\mathrm{appr}\) is the approximated solution found, and \(x_0\) is the starting point chosen in a method. Our result thus says that the number of calls of the oracle is \(\mathcal O (n^2 \epsilon ^{-2})\), and thus establishes an upper complexity bound for the above problem class.

Such an analysis of worst case complexity contributes to a better understanding of the numerical performance of this class of derivative-free optimizations methods. One knows that these methods, when strictly based on polling, although capable of solving most of the problem instances, are typically slow. They are also very appealing for parallel environment due to the natural way of paralelizing the poll step. The global rate of convergence of \(n^2 \epsilon ^{-2}\) indicates that their work (in terms of objective function evaluations) is roughly proportional to the inverse of the square of the targeted accuracy \(\epsilon \). It also tells us that such an effort is proportional to the square of the problem dimension \(n\), and this indication is certainly of relevance for computational budgetary considerations.

The structure of the paper is as follows. In “Direct-search algorithmic framework”, we describe the class of direct search under consideration. Then, in “Worst case complexity”, we analyze the worst case complexity or cost of such direct-search methods. Conclusions and extensions of our work are discussed in “Final remarks”. The notation \(\mathcal O (M)\) in our paper means a multiple of \(M\), where the constant multiplying \(M\) does not depend on the iteration counter \(k\) of the method under analysis (thus depending only on \(f\) or on algorithmic constants set at the initialization of the method). The dependence of \(M\) on the dimension \(n\) of the problem will be made explicit whenever appropriate. The vector norms will be \(\ell _2\) ones.

Direct-search algorithmic framework

We will follow the algorithmic description of generalized pattern search in Audet and Dennis (2002) (also adopted in Conn et al. 2009, Chapter 7 for direct-search methods of directional type). Such framework can describe the main features of pattern search, generalized pattern search (GPS) (Audet and Dennis 2002), and generating set search (GSS) (Kolda et al. 2003).

As we said in the introduction, each iteration of the direct-search algorithms under study here is organized around a search step (optional) and a poll step. The evaluation process of the poll step is opportunistic, meaning that one moves to a poll point in \(P_k = \{x_k+\alpha _k d: \, d \in D_k \}\), where \(\alpha _k\) is a step size parameter and \(D_k\) a positive spanning set, once some type of decrease is found.

As in the GSS framework, we include provision to accept new iterates based on a sufficient decrease condition which uses a forcing function. Following the terminology in Kolda et al. (2003), \(\rho :(0,+\infty )\rightarrow (0,+\infty )\) will represent a forcing function, i.e., a non-decreasing (continuous) function satisfying \(\rho (t)/t \rightarrow 0\) when \(t \downarrow 0\). Typical examples of forcing functions are \(\rho (t) = c \, t^{p}\), for \(p > 1\) and \(c > 0\). We are now ready to describe in Algorithm 1 the class of methods under analysis in this paper.

figure a

As we will see later, the global convergence of these methods is heavily based on the analysis of the behavior of the step size parameter \(\alpha _k\), in particular, on having \(\alpha _k\) approaching zero. There are essentially two known ways of enforcing the existence of a subsequence of step size parameters converging to zero in direct search of directional type. One way allows the iterates to be accepted based uniquely on a simple decrease of the objective function, but restricts the iterates to discrete sets defined by some integral or rational requirements (see Kolda et al. 2003; Torczon 1997). Another way is to impose a sufficient decrease on the acceptance of new iterates as we did in Algorithm 1.

Intuitively speaking, insisting on a sufficient decrease will make the function values decrease by a certain non-negligible amount each time a successful iteration is performed. Thus, under the assumption that the objective function \(f\) is bounded from below, it is possible to prove that there exists a subsequence of unsuccessful iterates driving the step size parameter to zero (see Kolda et al. 2003 or Conn et al. 2009, Theorems 7.1 and 7.11 and Corollary 7.2).

Lemma 1

Let \(f\) be bounded from below on \(L(x_0)=\{ x \in \mathbb R ^n: f(x) \le f(x_0) \}\). Then Algorithm 1 generates an infinite subsequence \(K\) of unsuccessful iterates for which \(\displaystyle \lim \nolimits _{k \in K} \alpha _k=0\).

One has plenty of freedom to choose the positive spanning sets used for polling when choosing this globalization strategy as long as they do not deteriorate significantly (in the sense of becoming close to loosing the positive spanning property). To quantify such a deterioration, we recall first the cosine measure of a positive spanning set \(D_k\) (with non-zero vectors), which is defined by (see Kolda et al. 2003)

$$\begin{aligned} \mathrm{cm}(D_k) \; = \; \min _{0 \ne v \in \mathbb R ^n} \max _{d \in D_k} \frac{v^\top d}{\Vert v\Vert \Vert d \Vert }. \end{aligned}$$

A positive spanning set (with non-zero vectors) has a positive cosine measure. Global convergence results essentially from the following fact, which is taken from Dolan et al. (2003) and Kolda et al. 2003 (see also Conn et al. 2009, Theorem 2.1 and Equation (7.14)) and describes the relationship between the size of the gradient and the step size parameter at unsuccessful iterations.

Theorem 1

Let \(D_k\) be a positive spanning set and \(\alpha _k > 0\) be given. Assume that \(\nabla f\) is Lipschitz continuous (with constant \(\nu > 0\)) in an open set containing all the poll points in \(P_k\). If \(f(x_k) \le f(x_k+\alpha _k d) + \rho (\alpha _k)\), for all \(d \in D_k\), i.e., the iteration \(k\) is unsuccessful, then

$$\begin{aligned} \Vert \nabla f(x_k) \Vert \; \le \; \mathrm{cm}(D_k)^{-1} \left( \frac{\nu }{2} \alpha _k \max _{d \in D_k} \Vert d\Vert +\frac{ \rho (\alpha _k) }{ \alpha _k \displaystyle \min _{d \in D_k} \Vert d\Vert } \right). \end{aligned}$$
(1)

The positive spanning sets used when globalization is achieved by a sufficient decrease condition are then required to satisfy the following assumption (see Kolda et al. 2003), where the cosine measure stays sufficiently positive and the size of the directions does not approach zero or tend to infinite.

Assumption 1

All positive spanning sets \(D_k\) used for polling (for all \(k\)) must satisfy \(\mathrm{cm}(D_k) \ge \mathrm{cm}_\mathrm{min}\) and \(d_\mathrm{min} \le \Vert d \Vert \le d_\mathrm{max}\) for all \(d \in D_k\) (where \(\mathrm{cm}_\mathrm{min} > 0\) and \(0 < d_\mathrm{min} < d_\mathrm{max}\) are constants).

One can then easily see from Theorem 1 (under Assumption 1) that when \(\alpha _k\) tends to zero (see Lemma 1) so does the gradient of the objective function. Theorem 1 will be also used in the next section to measure the worst case complexity of Algorithm 1.

Worst case complexity

We will now derive the worst case complexity bounds on the number of successful and unsuccessful iterations for direct-search methods in the smooth case (Algorithm 1 obeying Assumption 1 and using a sufficient decrease condition corresponding to a specific forcing function \(\rho (\cdot )\)). We will consider the search step either empty or, when applied, using a number of function evaluations not much larger than the maximum number of function evaluations made in a poll step, more precisely we assume that the number of function evaluations made in the search step is at most of the order of \(n\). This issue will be clear later in Corollary 2 when we multiply the number of iterations by the number of function evaluations made in each iteration.

As we know, each iteration of Algorithm 1 is either successful or unsuccessful. Therefore, in order to derive an upper bound on the total number of iterations, it suffices to derive separately upper bounds on the number of successful and unsuccessful iterations. The following theorem presents an upper bound on the number of successful iterations after the first unsuccessful one (which, from Lemma 1, always exists when the objective function is bounded from below). Note that when \(p=2\), one has \(p/\min (p-1,1) = 2\).

Theorem 2

Consider the application of Algorithm 1 when \(\rho (t) = c \, t^p\), \(p > 1\), \(c > 0\), and \(D_k\) satisfies Assumption 1. Let \(f\) satisfy \(f(x) \ge f_\mathrm{low}\) for \(x\) in \(L(x_0)=\{ x \in \mathbb R ^n: f(x) \le f(x_0) \}\) and be continuously differentiable with Lipschitz continuous gradient on an open set containing \(L(x_0)\) (with constant \(\nu > 0\)).

Let \(k_0\) be the index of the first unsuccessful iteration (which must exist from Lemma 1). Given any \(\epsilon \in (0, 1)\), assume that \(\Vert \nabla f(x_{k_0}) \Vert > \epsilon \) and let \(j_1\) be the first iteration after \(k_0\) such that \(\Vert \nabla f(x_{j_1+1}) \Vert \le \epsilon \). Then, to achieve \(\Vert \nabla f(x_{j_1+1})\Vert \le \epsilon \), starting from \(k_0\), Algorithm 1 takes at most \(|S_{j_1}(k_0)|\) successful iterations, where

$$\begin{aligned} |S_{j_1}(k_0)| \; \le \; \left\lceil \left( \frac{ f(x_{k_0}) - f_\mathrm{low} }{ c \, \beta _1^p L_1^p } \right) \epsilon ^{ - \frac{p}{ \min (p-1,1) } } \right\rceil , \end{aligned}$$
(2)

with

$$\begin{aligned} L_1 \; = \; \min \left(1, L_2^{ -\frac{1}{ \min (p-1,1) } } \right) \quad \text{ and} \quad L_2 \; = \; \mathrm{cm}_\mathrm{min}^{-1} ( \nu d_\mathrm{max} / 2 + d_\mathrm{min}^{-1} c). \end{aligned}$$
(3)

 

Proof

Let us assume that \(\Vert \nabla f(x_k) \Vert > \epsilon \), for \(k=k_0,\ldots ,j_1\).

If \(k\) is the index of an unsuccessful iteration, one has from (1) and Assumption 1 that

$$\begin{aligned} \Vert \nabla f(x_k) \Vert \; \le \; \mathrm{cm}_\mathrm{min}^{-1} \left( \frac{\nu }{2} d_\mathrm{max} \alpha _k + d_\mathrm{min}^{-1} c \, \alpha _k^{p-1} \right), \end{aligned}$$

which then implies, when \(\alpha _k < 1\),

$$\begin{aligned} \epsilon \; \le \; L_2 \alpha _k^{\min (p-1,1)}. \end{aligned}$$

If \(\alpha _k \ge 1\), then \(\alpha _k \ge \epsilon \). So, for any unsuccessful iteration, combining the two cases (\(\alpha _k \ge 1\) and \(\alpha _k < 1\)) and considering that \(\epsilon < 1\),

$$\begin{aligned} \alpha _k \; \ge \; L_1 \epsilon ^{ \frac{1}{ \min (p-1,1) } }. \end{aligned}$$

Since at unsuccessful iterations, the step size is reduced by a factor of at most \(\beta _1\) and it is not reduced at successful iterations, one can backtrack from any successful iteration \(k\) to the previous unsuccessful iteration \(k_1\) (possibly \(k_1=k_0\)), and obtain \(\alpha _k \ge \beta _1 \alpha _{k_1}\). Thus, for any \(k = k_0,\ldots ,j_1\),

$$\begin{aligned} \alpha _k \; \ge \; \beta _1 L_1 \epsilon ^{ \frac{1}{ \min (p-1,1) } }. \end{aligned}$$
(4)

Let now \(k > k_0\) be the index of a successful iteration. From (4) and by the choice of the forcing function,

$$\begin{aligned} f(x_k)-f(x_{k+1}) \; \ge \; c \, \alpha _k^p \; \ge \; c \, \beta _1^p L_1^p \epsilon ^{ \frac{p}{ \min (p-1,1) } }. \end{aligned}$$

We then obtain, summing up for all successful iterations, that

$$\begin{aligned} f(x_{k_0}) - f(x_{j_1}) \; \ge \; | S_{j_1}(k_0) | c \, \beta _1^p L_1^p \epsilon ^{ \frac{p}{ \min (p-1,1) } }, \end{aligned}$$

and the proof is completed.\(\square \)

Next, we bound the number of unsuccessful iterations (after the first unsuccessful one).

Theorem 3

Let all the assumptions of Theorem 2 hold.

Let \(k_0\) be the index of the first unsuccessful iteration (which must exist from Lemma 1). Given any \(\epsilon \in (0, 1)\), assume that \(\Vert \nabla f(x_{k_0}) \Vert > \epsilon \) and let \(j_1\) be the first iteration after \(k_0\) such that \(\Vert \nabla f(x_{j_1+1}) \Vert \le \epsilon \). Then, to achieve \(\Vert \nabla f(x_{j_1+1})\Vert \le \epsilon \), starting from \(k_0\), Algorithm 1 takes at most \( |U_{j_1}(k_0)|\) unsuccessful iterations, where

$$\begin{aligned} |U_{j_1}(k_0)| \; \le \; \left\lceil L_3 |S_{j_1}(k_0)| + L_4 + \frac{ \log \left( \beta _1 L_1 \epsilon ^{ \frac{1}{ \min (p-1,1) } } \right) }{ \log (\beta _2) } \right\rceil , \end{aligned}$$

with

$$\begin{aligned} L_3 \; = \; - \frac{ \log (\gamma ) }{ \log (\beta _2) }, \qquad L_4 \; = \; - \frac{ \log (\alpha _{k_0}) }{ \log (\beta _2) }, \end{aligned}$$

and \(L_1\) given by (3).

 

Proof

Since either \(\alpha _{k+1} \le \beta _2 \alpha _k\) or \(\alpha _{k+1} \le \gamma \alpha _k\), we obtain by induction

$$\begin{aligned} \alpha _{j_1} \; \le \; \alpha _{k_0} \gamma ^{|S_{j_1} (k_0)|} \beta _2^{|U_{j_1}(k_0)|}, \end{aligned}$$

which, in turn, implies from \(\log (\beta _2)<0\)

$$\begin{aligned} |U_{j_1}(k_0)| \; \le \; - \frac{ \log (\gamma ) }{ \log (\beta _2) } |S_{j_1}(k_0)| - \frac{ \log (\alpha _{k_0}) }{ \log (\beta _2) } + \frac{ \log (\alpha _{j_1}) }{ \log (\beta _2) }. \end{aligned}$$

Thus, from \(\log (\beta _2)<0\) and the lower bound (4) on \(\alpha _k\), we obtain the desired result.\(\square \)

Using an argument similar as the one applied to bound the number of successful iterations in Theorem 2, one can easily show that the number of iterations required to achieve the first unsuccessful one is bounded by

$$\begin{aligned} \left\lceil \frac{ f(x_0) - f_\mathrm{low} }{c \, \alpha _0^p} \right\rceil . \end{aligned}$$

Thus, since \(\epsilon \in (0,1)\), one has \(1 < \epsilon ^{-\frac{p}{ \min (p-1,1) }}\) and this number is, in turn, bounded by

$$\begin{aligned} \left\lceil \frac{ f(x_0) - f_\mathrm{low} }{c \, \alpha _0^p} \epsilon ^{ - \frac{p}{ \min (p-1,1) } } \right\rceil , \end{aligned}$$

which is of the same order of \(\epsilon \) as the number of successful and unsuccessful iterations counted in Theorems 2 and 3, respectively. Combining these two theorems, we can finally state our main result in the following corollary. Note, again, that \(p/\min (p-1,1) = 2\) when \(p=2\).

Corollary 1

Let all the assumptions of Theorem 2 hold.

To reduce the gradient below \(\epsilon \in (0,1)\), Algorithm 1 takes at most

$$\begin{aligned} \mathcal O \left( \epsilon ^{ - \frac{p}{ \min (p-1,1) } } \right) \end{aligned}$$
(5)

iterations. When \(p=2\), this number is of \(\mathcal O \left( \epsilon ^{-2} \right)\).

The constant in \(\mathcal O ( \cdot )\) depends only on \(\text{ cm}_\mathrm{min}\), \(d_\mathrm{min}\), \(d_\mathrm{max}\), \(c\), \(p\), \(\beta _1\), \(\beta _2\), \(\gamma \), \(\alpha _0\), on the lower bound \(f_\mathrm{low}\) of \(f\) in \(L(x_0)\), and on the Lipschitz constant \(\nu \) of the gradient of \(f\).

 

Proof

Let \(j_1\) be the first iteration such that \(\Vert \nabla f(y_{j_1+1})\Vert \le \epsilon \).

Let \(k_0\) be the index of the first unsuccessful iteration (which must always exist as discussed in “Direct-search algorithmic framework”).

If \(k_0 < j_1\), then we apply Theorems 2 and 3, to bound the number of iterations from \(k_0\) to \(j_1\), and the argument above this corollary to bound the number of successful iterations until \(k_0-1\).

If \(k_0 \ge j_1\), then all iterations from \(0\) to \(j_1-1\) are successful, and we use the argument above this corollary to bound this number of iterations.

Interestingly, the fact that the best power of \(\epsilon \) is achieved when \(p=2\) seems to corroborate previous numerical experience (Vicente and Custódio 2012), where different forcing functions of the form \(\rho (t)=c \, t^p\) (with \(2 \ne p>1\)) were tested but leading to a worse performance.

It is important now to analyze the dependence on the dimension \(n\) of the constants multiplying the power of \(\epsilon \) in (5). In fact, the lower bound \(\mathrm{cm}_\mathrm{min}\) on the cosine measure depends on \(n\), and the Lipschitz constant \(\nu \) might also depend on \(n\). The possible dependence of the Lipschitz constant \(\nu \) on \(n\) will be ignored—global Lipschitz constants appear in all existing worst case complexity bounds for smooth non-convex optimization, and it is well known that such constants may depend exponentially on the problem dimension \(n\) (see also Jarre 2012).

The dependence of \(\mathrm{cm}_\mathrm{min}\) on \(n\) is more critical and cannot be ignored. In fact, we have that \(\mathrm{cm}(D)=1/\sqrt{n}\) when \(D = [I \; -I]\) is the positive basis used in coordinate search (Kolda et al. 2003), and \(\mathrm{cm}(D)=1/n\) when \(D\) is the positive basis with uniform angles (Conn et al. 2009, Chapter 2) (by a positive basis it is meant a positive spanning set where no proper subset has the same property). Thus, looking at how the cosine measure appears in (2)–(3) and having in mind the existence of the case \(D = [I \; -I]\) (for which \(\mathrm{cm}(D)=1/\sqrt{n}\) and for which the maximum cost of function evaluations per iteration is \(2n\)), one can state the following result.

Corollary 2

Let all the assumptions of Theorem 2 hold. Assume that \(\mathrm{cm}(D_k)\) is a multiple of \(1/\sqrt{n}\) and the number of function evaluations per iteration is at most a multiple of \(n\).

To reduce the gradient below \(\epsilon \in (0,1)\), Algorithm 1 takes at most

$$\begin{aligned} \mathcal O \left( n (\sqrt{n})^{ \frac{p}{ \min (p-1,1) } } \epsilon ^{ - \frac{p}{ \min (p-1,1) } } \right) \end{aligned}$$

function evaluations. When \(p=2\), this number is \(\mathcal O \left( n^2 \epsilon ^{-2} \right)\).

The constant in \(\mathcal O ( \cdot )\) depends only on \(d_\mathrm{min}\), \(d_\mathrm{max}\), \(c\), \(p\), \(\beta _1\), \(\beta _2\), \(\gamma \), \(\alpha _0\), on the lower bound \(f_\mathrm{low}\) of \(f\) in \(L(x_0)\), and on the Lipschitz constant \(\nu \) of the gradient of \(f\).

It was shown by Nesterov (2004, p. 29) that the steepest descent method, for unconstrained optimization takes at most \(\mathcal O (\epsilon ^{-2})\) iterations or gradient evaluations to drive the norm of the gradient of the objective function below \(\epsilon \). A similar worst case complexity bound of \(\mathcal O (\epsilon ^{-2})\) has been proved by Gratton, Toint, and co-authors (Gratton et al. 2008a, b) for trust-region methods and by Cartis et al. (2011) for adaptive cubic overestimation methods, when these algorithms are based on a Cauchy decrease condition. The worst case complexity bound on the number of iterations can be reduced to \(\mathcal O (\epsilon ^{-\frac{3}{2}})\) (in the sense that the negative power of \(\epsilon \) increases) for the cubic regularization of Newton’s method (see  Nesterov and Polyak 2006) and for the adaptive cubic overestimation method (see  Cartis et al. 2011).

It has been proved by Cartis et al. (2010) that the worst case bound \(\mathcal O (\epsilon ^{-2})\) for steepest descent is sharp or tight, in the sense that there exists an example, dependent on an arbitrarily small parameter \(\tau > 0\), for which a steepest descent method (with a Goldstein-Armijo line search) requires, for any \(\epsilon \in (0,1)\), at least \(\mathcal O (\epsilon ^{-2+\tau })\) iterations to reduce the norm of the gradient below \(\epsilon \). The example constructed in Cartis et al. (2010) was given for \(n=1\).

It turns out that in the unidimensional case, a direct-search method of the type given in Algorithm 1 (where sufficient decrease is imposed using a forcing function \(\rho (\cdot )\)) can be cast as a steepest descent method with Goldstein-Armijo line search, when the objective function is monotonically decreasing (which happens to be the case in the example in Cartis et al. 2010) and one considers the case \(p=2\). In fact, when \(n=1\), and up to normalization, there is essentially one positive spanning set with \(2\) elements, \(\{ -1, 1 \}\). Thus, unsuccessful steps are nothing else than reductions of step size along the negative gradient direction. Also, since at unsuccessful iterations (see Theorem 1 and Assumption 1) one has \(\Vert g_k \Vert \le L_2 \alpha _k\) where \(L_2\) is the positive constant given in (3) and \(g_k = \nabla f(x_k)\), and since successful iterations do not decrease the step size, one obtains \(\alpha _k \ge L_1 \Vert g_k\Vert \) with \(L_1 = 1 / \max \{L_2,1\} \in (0,1]\). By setting \(\gamma _k = \alpha _k / \Vert g_k \Vert \), one can then see that successful iterations take the form \(x_{k+1} = x_k - \gamma _k g_k\) with \(f(x_{k+1}) \le f(x_k) - c \, \alpha _k^2 \le f(x_k) - c \, L_1 \gamma _k \Vert g_k\Vert ^2\) (note that if \(c\) is chosen in \((0,1)\), we have \(c \, L_1 \in (0,1)\)).

We make now a final comment (choosing the case \(p=2\) for simplicity) on the practicality of the worst case complexity bound derived, given that in derivative-free optimization methods like direct search one does not use the gradient of the objective function. Since as we have said just above \(\alpha _k \ge L_1 \Vert \nabla f(x_k) \Vert \) for all \(k\), we could have stated instead a worst case complexity bound for the number of iterations required to drive \(\alpha _k\) below a certain \(\epsilon \in (0,1)\) (something now measurable in a computer run of direct search) and from that a bound on the number of iterations required to drive \(\Vert \nabla f(x_k) \Vert \) below \(\epsilon / L_1\). However, we should point out that \(L_1\) depends on the Lipschitz constant \(\nu \) of \(\nabla f\) and in practice such approach would suffer from the same unpracticality.

Final remarks

The study of worst case complexity of direct search (of directional type) brings new insights about the differences and similarities of the various methods and their theoretical limitations. It was possible to establish a worst case complexity bound for those direct-search methods based on the acceptance of new iterates by a sufficient decrease condition (using a forcing function of the step size parameter) and when applied to smooth functions. Deviation from smoothness (see Audet and Dennis 2006; Vicente and Custódio 2012) poses several difficulties to the derivation of a worst case complexity bound, and such a study will be the subject of a separate paper (Garmanjani and Vicente 2012).

It should be pointed out that the results of this paper can be extended to bound and linear constraints, where the number of positive generators of the tangent cones of the nearly active constraints is finite. In this case, it has been shown in Kolda et al. (2003) and Lewis and Torczon (2000) that a result similar to Theorem 1 can be derived, replacing the gradient of the objective function by

$$\begin{aligned} \chi (x_k) \; = \; \displaystyle \max _{\begin{matrix} x_k +w \in \Omega \\ \Vert w\Vert \le 1 \end{matrix}} -\nabla f(x_k)^\top w, \end{aligned}$$

where \(\Omega \) denotes the feasible region defined by the bound or linear constraints. (Note that \(\chi (\cdot )\) is a continuous and non-negative measure of stationarity; see Conn et al. 2000, pp. 449–451.) Once such a result is at hands, and one uses a sufficient decrease for accepting new iterates, one can show global convergence similarly as in the unconstrained case (see Lewis and Torczon 2000; Kolda et al. 2003). In terms of worst case complexity, one would also proceed similarly as in the unconstrained case.

Another point we would like to stress is that once we allow new iterates to be accepted based uniquely on a simple decrease of the objective function (together, for globalization purposes, with the restriction that the iterates must lie on discrete sets defined by some integral or rational requirements Kolda et al. 2003; Torczon 1997), the worst case complexity bound on the number of iterations seems only provable under additional strong conditions like the objective function satisfying an appropriate decrease rate. In fact, one knows for this class of methods that \(\Vert x_k - x_{k+1} \Vert \) is larger than a multiple of the stepsize \(\alpha _k\) (see Audet and Dennis 2002). Thus, if \(f(x_k) - f(x_{k+1}) \ge \theta \Vert x_k - x_{k+1} \Vert ^p\) is true for some positive constant \(\theta \), then we could proceed similarly as in our paper.

Finally, we would like to mention that the result of our paper has been recently compared in Cartis et al. (2012). These authors derived the following worst case complexity bound (on the number of function evaluations required to drive the norm of the gradient below \(\epsilon \), and for the version of their adaptive cubic overestimation algorithm that uses finite differences to compute derivatives)

$$\begin{aligned} \mathcal O \left( (n^2+5n) \frac{ 1 + |\log (\epsilon )|}{ \epsilon ^{3/2} } \right). \end{aligned}$$

The bound \(\mathcal O (n^2 \epsilon ^{-2})\) for direct search is worse in terms of the power of \(\epsilon \).