1 Introduction

Recent years have seen a growing interest in the analysis of the worst-case evaluation complexity of nonlinear (possibly non-convex) smooth optimization (for the non-convex case only, see [1, 5,6,7,8,9, 11, 14,15,16,17, 19,20,23, 26,27,29, 31, 32, 34,35,37, 41,42,44, 47, 51, 53,54,55] among others). In general terms, this analysis aims at giving (sometimes sharp) bounds on the number of evaluations of a minimization problem’s functions (objective and constraints, if relevant) and their derivatives that are, in the worst case, necessary for certain algorithms to find an approximate critical point for the unconstrained, convexly constrained or general nonlinear optimization problem. It is not uncommon that such algorithms may involve possibly extremely costly internal computations, provided the number of calls to the problem functions is kept as low as possible.

At variance with the convex case (see [3]), most of the research on the non-convex case to date focuses on finding first-, second- and third-order critical points. Evaluation complexity for first-order critical point was first investigated, for the unconstrained case, by Nesterov [46] and for first- and second-order Nesterov and Polyak [47] and by Cartis et al. [16]. Third-order critical points were studied in [1], motivated by highly nonlinear problems in machine learning. However, the analysis of evaluation complexity for orders higher than three is missing both concepts and results.

The purpose of the present paper is to improve on this situation in two ways. The first is to review optimality conditions of arbitrary orders \(q \ge 1\) for convexly constrained minimization problems, and the second is to describe a theoretical algorithm whose behaviour provides, for this class of problems, the first evaluation complexity bounds for such arbitrary orders of optimality.

The paper is organized as follows. After the present introduction, Sect. 2 discusses some preliminary results on tensor norms, a generalized Cauchy–Schwarz inequality and high-order error bounds from Taylor series. Section 3 investigates optimality conditions for convexly constrained optimization, while Sect. 4.1 proposes a trust-region-based minimization algorithm for solving this class of problems and analyses its evaluation complexity. An example is introduced in Sect. 5 to show that the new evaluation complexity bounds are essentially sharp. A final discussion is presented in Sect. 6.

2 Preliminaries

2.1 Basic Notations

In what follows, \(y^Tx\) denotes the Euclidean inner product of the vectors x and y of \(\mathbb {R}^n\) and \(\Vert x\Vert = (x^Tx)^{1/2}\) is the associated Euclidean norm. If \(T_1\) and \(T_2\) are tensors, \(T_1\otimes T_2\) is their tensor product. \(\mathcal{B}(x,\delta )\), the ball of radius \(\delta \ge 0\) centred at x. If \(\mathcal{X}\) is a closed set, \(\partial \mathcal{X}\) denotes its boundary and \(\mathcal{X}^0\) denotes its interior. The vectors \(\{e_i\}_{i=1}^n\) are the coordinate vectors in \(\mathbb {R}^n\). The notation \(\lambda _{\min }[M]\) stands for the leftmost eigenvalue of the symmetric matrix M. If \(\{a_k\}\) and \(\{b_k\}\) are two infinite sequences of non-negative scalars converging to zero, we say that \(a_k= o(b_k)\) if and only if \(\lim _{k \rightarrow \infty } a_k/b_k = 0\) and, more generally, \(a(\alpha ) = o(\alpha )\) if and only if \(\lim _{\alpha \rightarrow 0} a(\alpha )/\alpha = 0\). The normal cone to a general convex set \(\mathcal{C}\) at \(x\in \mathcal{C}\) is defined by

$$\begin{aligned} \mathcal{N}_\mathcal{C}(x) \mathop {=}\limits ^\mathrm{def}\{ s \in \mathbb {R}^n \mid s^T(z-x) \le 0\, \text{ for } \text{ all }\, z \in \mathcal{C}\} \end{aligned}$$

and its polar, the tangent cone to \(\mathcal{F}\) at x, by

$$\begin{aligned} \mathcal{T}_\mathcal{C}(x) = \mathcal{N}_\mathcal{C}^*(x) \mathop {=}\limits ^\mathrm{def}\{ s \in \mathbb {R}^n \mid s^Tv \le 0 \text{ for } \text{ all }\, v \in \mathcal{N}_\mathcal{C}\}. \end{aligned}$$

Note that \(\mathcal{C}\subseteq x+\mathcal{T}_\mathcal{C}(x)\) for all \(x \in \mathcal{C}\). We also define \(P_{\mathcal{C}}[\cdot ]\) be the orthogonal projection onto \(\mathcal{C}\). (See [25, Section 3.5] for a brief introduction of the relevant properties of convex sets and cones, or to [39, Chapter 3] or [50, Part I] for an in-depth treatment.)

2.2 Tensor Norms and Generalized Cauchy–Schwarz Inequality

We will make substantial use of tensors and their norms in what follows and thus start by establishing some concepts and notation. If the notation \(T[v_1,\ldots ,v_j]\) stands for the tensor of order \(q-j\) resulting from the application of the qth-order tensor T to the vectors \(v_1,\ldots ,v_j\), the (recursively inducedFootnote 1) Euclidean norm \(\Vert \cdot \Vert _q\) on the space of qth-order tensors is the given by

$$\begin{aligned} \Vert T\Vert _q \mathop {=}\limits ^\mathrm{def}\max _{\Vert v_1\Vert =\cdots =\Vert v_q\Vert =1} T[v_1,\ldots ,v_q]. \end{aligned}$$
(2.1)

(Observe that this value is always non-negative since we can flip the sign of \(T[v_1,\ldots ,v_q]\) by flipping that of one of the vectors \(v_i\).)

Note that definition (2.1) implies that

$$\begin{aligned} \Vert T[v_1,\ldots ,v_j]\Vert _{q-j}= & {} \displaystyle \max _{\Vert w_1\Vert =\cdots =\Vert w_{q-j}\Vert =1} T[v_1,\ldots ,v_j][w_1,\ldots ,w_{q-j}] \nonumber \\= & {} \left( \displaystyle \max _{\Vert w_1\Vert =\cdots =\Vert w_{q-j}\Vert =1} T\left[ \frac{v_1}{\Vert v_1\Vert },\ldots ,\frac{v_j}{\Vert v_j\Vert },w_1,\ldots ,w_{q-j}\right] \right) \nonumber \\&\times \,\left( \displaystyle \prod _{i=1}^j\Vert v_i\Vert \right) \nonumber \\\le & {} \left( \displaystyle \max _{\Vert w_1\Vert =\cdots =\Vert w_{q}\Vert =1} T[w_1,\ldots ,w_q] \right) \left( \displaystyle \prod _{i=1}^j\Vert v_i\Vert \right) \nonumber \\= & {} \Vert T\Vert _q \,.\,\displaystyle \prod _{i=1}^j\Vert v_i\Vert , \end{aligned}$$
(2.2)

a simple generalization of the standard Cauchy–Schwarz inequality for order-1 tensors (vectors) and of \(\Vert Mv\Vert \le \Vert M\Vert \,\Vert v\Vert \) which is valid for induced norms of matrices (order-2 tensors). Observe also that perturbation theory (see [40, Th. 7]) implies that \(\Vert T\Vert _q\) is continuous as a function of T.

If T is a symmetric tensor of order q, define the q-kernel of the multilinear q-form

$$\begin{aligned} T[v]^q \mathop {=}\limits ^\mathrm{def}T[\underbrace{v, \ldots ,v}_{q\, \text{ times }}] \end{aligned}$$

as

$$\begin{aligned} \ker ^q[T] \mathop {=}\limits ^\mathrm{def}\{v \in \mathbb {R}^n \mid T[v]^q = 0\} \end{aligned}$$

(see [12, 13]). Note that, in general, \(\ker ^q[T]\) is a union of cones. Interestingly, the q-kernels are not only unions of cones but also subspaces for \(q=1\). However, this is not true for general q-kernels, since both \((0,1)^T\) and \((1,0)^T\) belong to the 2-kernel of the symmetric 2-form \(x_1x_2\) on \(\mathbb {R}^2\), but not their sum.

We also note that, for symmetric tensors of odd order, \(T[v]^q = -T[-v]^q\) and thus that

$$\begin{aligned} -\min _{\Vert d\Vert \le 1}T[d]^q = -\min _{\Vert d\Vert \le 1}\left( -T[-d]^q\right) = -\min _{\Vert d\Vert \le 1}\left( -T[d]^q\right) = \max _{\Vert d\Vert \le 1}T[d]^q,\nonumber \\ \end{aligned}$$
(2.3)

where we used the symmetry of the unit ball with respect to the origin to deduce the second equality.

2.3 High-Order Error Bounds from Taylor Series

The tensors considered in what follows are symmetric and arise as high-order derivatives of the objective function f. For the pth derivative of a function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) to be Lipschitz continuous on the set \(\mathcal{S}\subseteq \mathbb {R}^n\), we require that there exists a constant \(L_{f,p}\ge 0\) such that, for all \(x,y \in \mathcal{S}\),

$$\begin{aligned} \Vert \nabla _x^p f(x) - \nabla _x^p f(y) \Vert _p \le L_{f,p} \Vert x - y \Vert , \end{aligned}$$
(2.4)

where \(\nabla _x^p h(x)\) is the pth-order symmetric derivative tensor of h at x.

Let \(T_{f,p}(x,s)\) denoteFootnote 2 the pth-order Taylor series approximation to \(f(x+s)\) at some \(x \in \mathbb {R}^n\) given by

$$\begin{aligned} T_{f,p}(x,s) \mathop {=}\limits ^\mathrm{def}f(x) + \sum _{j=1}^p \frac{1}{j!}\nabla _x^jf(x)[s]^j \end{aligned}$$
(2.5)

and consider the Taylor identity

$$\begin{aligned} \phi (1) - t_k(1) = \frac{1}{(k-1)!}\int _0^1 (1 - \xi )^{k-1} [\phi ^{(k)}(\xi ) - \phi ^{(k)}(0)] \, d\xi \end{aligned}$$
(2.6)

involving a given univariate \(C^k\) function \(\phi (\alpha )\) and its kth-order Taylor approximation \(t_k(\alpha ) = \sum _{i=0}^k \phi ^{(i)}(0) \alpha ^i / i!\) expressed in terms of the ith derivatives \(\phi ^i\), \(i=1,\ldots ,k\). Let \(x,s \in \mathbb {R}^n\). Then, picking \(\phi (\alpha ) = f(x+\alpha s)\) and \(k = p\), it follows immediately from the fact that \(t_p(1) = T_{f,p}(x,s)\), the identity

$$\begin{aligned} \displaystyle \int _0^1 (1-\xi )^{p-1} \, d\xi = \frac{1}{p}, \end{aligned}$$
(2.7)

(2.2), (2.4), (2.5) and (2.6) imply that, for all \(x,s \in \mathbb {R}^n\),

$$\begin{aligned} f(x+s)\le & {} T_{f,p}(x,s) + \displaystyle \frac{1}{(p-1)!}\displaystyle \int _0^1 (1-\xi )^{p-1} |\nabla _x^p f(x+\xi s)[s]^p\nonumber \\&-\,\nabla _x^p f(x)[s]^p| \,d\xi \nonumber \\\le & {} T_{f,p}(x,s) + \left[ \displaystyle \int _0^1\displaystyle \frac{\,(1-\xi )^{p-1}}{(p-1)!}\, d\xi \right] \displaystyle \max _{\xi \in [0,1]}|\nabla _x^p f(x+\xi s)[s]^p\nonumber \\&-\, \nabla _x^p f(x)[s]^p | \nonumber \\\le & {} T_{f,p}(x,s) + \displaystyle \frac{1}{p!}\, \Vert s\Vert ^p \, \displaystyle \max _{\xi \in [0,1]}\Vert \nabla _x^p f(x+\xi s) - \nabla _x^p f(x) \Vert _p \nonumber \\= & {} T_{f,p}(x,s) + \displaystyle \frac{L_{f,p}}{p!} \, \Vert s\Vert ^{p+1}. \end{aligned}$$
(2.8)

Similarly,

$$\begin{aligned} f(x+s)\ge & {} T_{f,p}(x,s) - \displaystyle \frac{1}{p!}\, \Vert s\Vert ^p \, \displaystyle \max _{\xi \in [0,1]}\Vert \nabla _x^p f(x+\xi s) - \nabla _x^p f(x) \Vert _p\nonumber \\\ge & {} T_{f,p}(x,s) - \displaystyle \frac{L_{f,p}}{p!} \, \Vert s\Vert ^{p+1}. \end{aligned}$$
(2.9)

Inequalities (2.8) and (2.9) will be useful in our developments below, but immediately note that they in fact depend only on the weaker requirement that

$$\begin{aligned} \displaystyle \max _{\xi \in [0,1]} \Vert \nabla _x^p f(x+\xi s) - \nabla _x^p f(x) \Vert _p \le L_{f,p} \Vert s\Vert , \end{aligned}$$
(2.10)

for all x and s of interest, rather than relying on (2.4).

3 Unconstrained and Convexly Constrained Problems

The problem we wish to solve is formally described as

$$\begin{aligned} \min _{x \in \mathcal{F}} f(x), \end{aligned}$$
(3.1)

where we assume that \(f:\mathbb {R}^n\longrightarrow \mathbb {R}\) is q-times continuously differentiable and bounded from below, and that f has Lipschitz continuous derivatives of orders 1 to q. We also assume that the feasible set \(\mathcal{F}\) is closed, convex and non-empty. Note that this formulation covers unconstrained optimization (\(\mathcal{F}= \mathbb {R}^n\)), as well as standard inequality (and linear equality) constrained optimization in its different forms: the set \(\mathcal{F}\) may be defined by simple bounds, and/or by polyhedral or more general convex constraints. We are tacitly assuming here that the cost of evaluating values and derivatives of the constraint functions possibly involved in the definition of \(\mathcal{F}\) is negligible.

3.1 High-Order Optimality Conditions

Given that our ambition is to work with high-order model, it seems natural to aim at finding high-order local minimizers. As is standard, we say that \(x_*\) is a local minimizer of f if and only if there exists a (sufficiently small) neighbourhood \(\mathcal{B}_*\) of \(x_*\) such that

$$\begin{aligned} f(x) \ge f(x_*) \text{ for } \text{ all } x \in \mathcal{B}_*\cap \mathcal{F}. \end{aligned}$$
(3.2)

However, we must immediately remember important intrinsic limitations. These are exemplified by the smooth two-dimensional problem

$$\begin{aligned} \min _{x \in \mathbb {R}^2} f(x)&= \left\{ \begin{array}{ll} x_2\left( x_2- e^{-1/x_1^2}\right) &{}\quad \text{ if }\, x_1 \ne 0,\\ x_2^2 &{}\quad \text{ if }\, x_1 = 0, \end{array}\right. \end{aligned}$$
(3.3)

which is a simplified version of a problem stated by Hancock nearly a century ago [38, p. 36], itself a variation of a famous problem stated even earlier by Peano [49, Nos. 133–136]. The contour lines of its objective function are shown in Fig. 1.

Fig. 1
figure 1

Contour lines of the objective function in (3.3)

The first conclusion which can be drawn by examining this example is that, in general, assessing that a given point x (the origin in this case) is a local minimizer needs more that verifying that every direction from this point is an ascent direction. Indeed, this latter property holds in the example, but the origin is not a local minimizer (it is a saddle point). This is caused by the fact that objective function decrease may occur along specific arcs starting from the point under consideration, and these arcs need not be lines (such as for \(\alpha \ge 0\) in the example).

The second conclusion is that the characterization of a local minimizer cannot always be translated into a set of conditions only involving the Taylor expansion of f at \(x_*\). In our example, the difficulty arises because the coefficients of the Taylor’s expansion of \(e^{-1/x_1^2}\) at x all vanish as \(x_1\) approaches the origin and, therefore, that the (non-)minimizing nature of this point cannot be determined from the values of these coefficients. Thus, the gap between necessary and sufficient optimality conditions cannot be closed if one restricts one’s attention to using derivatives of the objective function at a putative solution of problem (3.1).

Note that worse situations may also occur, for instance if we consider the following variation on Hancock simplified example (3.3):

$$\begin{aligned} \min _{x \in \mathbb {R}^2} f(x) = \left\{ \begin{array}{ll} x_2\left( x_2- \sin (1/x_1) e^{-1/x_1^2}\right) &{}\quad \text{ if }\, x_1 \ne 0,\\ x_2^2 &{}\quad \text{ if }\, x_1 = 0, \end{array}\right. \end{aligned}$$
(3.4)

for which no continuous descent arc exists in a neighbourhood of the origin despite the origin not being a local minimizer.

3.1.1 Necessary Conditions for Convexly Constrained Problems

The above examples show that fully characterizing a local minimizer in terms of general continuous descent arcs is in general impossible. However, the fact that no such arc exists remains a necessary condition for such points, even if Hancock’s example shows that these arcs may not be amenable to a characterization using arc derivatives. In what follows, we therefore propose derivative-based necessary optimality conditions by focussing on a specific (yet reasonably general) class of descent arcs \(x(\alpha )\) of the form

$$\begin{aligned} x(\alpha ) = x_* + \sum _{i=1}^q\alpha ^i s_i + o(\alpha ^q), \end{aligned}$$
(3.5)

where \(\alpha > 0\). Such an arc-based approach was used by several authors for first- and second-order conditions (see [4, 10, 24, 33] for example). Note that, if \(s_{i_0}\) is the first nonzero \(s_i\) in the sum in the right-hand side of (3.5) (if any), we may redefine \(\alpha \) to be \(\alpha \Vert s_{i_0}\Vert ^{-1/i_0}\) without modifying the arc, so that we may assume, without loss of generality, that \(\Vert s_{i_0}\Vert = 1\) whenever \((s_1, \ldots , s_q)\ne (0, \ldots , 0)\).

Define the qth-order descriptor set of \(\mathcal{F}\) at x by

$$\begin{aligned} \mathcal{D}_\mathcal{F}^q(x) \mathop {=}\limits ^\mathrm{def}\left\{ (s_1, \ldots , s_q ) \in \mathbb {R}^{n \times q} \mid x + \sum _{i=1}^q\alpha ^i s_i + o(\alpha ^q) \in \mathcal{F}\right\} \end{aligned}$$
(3.6)

Note that \(\mathcal{D}_\mathcal{F}^q(x)\) is closed and always contains \((0, \ldots ,0)\), and that \(\mathcal{D}_\mathcal{F}^1(x) = \mathcal{T}_\mathcal{F}(x)\), the standard tangent cone to \(\mathcal{F}\) at x. Moreover, \(\mathcal{D}_\mathcal{F}^2(x) = \mathcal{T}_\mathcal{F}(x) \times 2 \mathcal{T}^2_\mathcal{F}(x)\), where \(\mathcal{T}^2_\mathcal{F}(x)\) is the inner second-order tangent set to \(\mathcal{F}\) at x, as defined in [10].Footnote 3 For example, if \(\mathcal{F}= \{ (x_1,x_2) \in \mathbb {R}^2 \mid x_2 \ge |x_1|^3 \}\), then one verifies that \(\mathcal{D}_\mathcal{F}^3(0) = \left[ \mathbb {R}\times \mathbb {R}_+\right] ^2 \times \left[ \mathbb {R}\times [1, \infty ) \right] \). We say that a feasible arc \(x(\alpha )\) is tangent to \(\mathcal{D}_\mathcal{F}^q(x)\) if (3.5) holds for some \((s_1,\ldots ,s_q) \in \mathcal{D}_\mathcal{F}^q(x)\).

Note that definition (3.6) implies that

$$\begin{aligned} s_i \in \mathcal{T}_\mathcal{F}(x_*), \text{ for } i \in \{1, \ldots , u \}, \end{aligned}$$
(3.7)

where \(s_u\) is the first nonzero \(s_\ell \).

We now consider some conditions that preclude the existence of feasible descent arcs of the form (3.5). These conditions involve the index sets \(\mathcal{P}(j,k)\) defined, for \(k \le j\), by

$$\begin{aligned} \mathcal{P}(j,k) \mathop {=}\limits ^\mathrm{def}\{ (\ell _1, \ldots ,\ell _k) \in \{1, \ldots , j \}^k \mid \sum _{i=1}^k \ell _i = j \}. \end{aligned}$$
(3.8)

For \(k \le j \le 4\), these are given in Table 1.

Table 1 The sets \(\mathcal{P}(j,k)\) for \(k\le j \le 4\)

We now state necessary conditions for \(x_*\) to be a local minimizer.

figure a

Proof

Consider an arbitrary feasible arc of the form (3.5). Substituting this relation in the expression \(f(x(\alpha )) \ge f(x_*)\) (given by (3.2)) and collecting terms of equal degree in \(\alpha \), we obtain that, for sufficiently small \(\alpha \),

$$\begin{aligned} 0\le f(x(\alpha )) - f(x_*) = \sum _{j=1}^q c_j \alpha ^j + o(\alpha ^q), \end{aligned}$$
(3.11)

where

$$\begin{aligned} c_j \mathop {=}\limits ^\mathrm{def}\sum _{k=1}^j \frac{1}{k!} \Bigg (\sum _{(\ell _1, \ldots ,\ell _k) \in \mathcal{P}(j,k)} \nabla _x^k f(x_*)[s_{\ell _1}, \ldots , s_{\ell _k}]\Bigg ) \;\;\;\;\;\;\;\;(j = 1, \ldots , q)\nonumber \\ \end{aligned}$$
(3.12)

with \(\mathcal{P}(i,k)\) defined in (3.8). For this to be true, we need each coefficient of \(\alpha ^j\) to be non-negative on the zero set of the coefficients \(1, \ldots , j-1\), subject to the requirement that the arc (3.5) must be feasible for \(\alpha \) sufficiently small, that is \(x(\alpha ) \in \mathcal{D}_\mathcal{F}^j(x_*)\). First consider the case where \(j=1\) (in which case (3.10) is void). The fact that the coefficient of \(\alpha \) in (3.11) must be non-negative implies that \(\nabla _x^1f(x_*)[s_1] \ge 0\) for all \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\), which proves (3.9) for \(j=1\). Assume now that \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) and that (3.10) holds for \(i=1\). This latter condition requests \(s_1\) to be in the zero set of the coefficient in \(\alpha \) in (3.11), that is

$$\begin{aligned} s_1 \in \mathcal{T}_\mathcal{F}(x_*) \cap \ker ^1[\nabla _x^1f(x_*)]. \end{aligned}$$

Then the coefficient of \(\alpha ^2\) in (3.11) must be non-negative, which yields, using \(\mathcal{P}(2,1) = \{(2)\}\), \(\mathcal{P}(2,2)= \{(1)\}\) (see Table 1), that

(3.13)

which is (3.9) for \(j=2\).

We may then proceed in the same manner for all coefficients up from order \(j=3\) to q, each time considering them in the zero set of the previous coefficients (that is (3.10)), and verify that (3.11) directly implies (3.9). \(\square \)

Following a long tradition, we say that \(x_*\) is a qth-order critical point for problem (3.1) if the conclusions of this theorem hold for \(j \in \{1, \ldots , q \}\). Of course, a qth-order critical point need not be a local minimizer, but every local minimizer is a qth-order critical point. This theorem states conditions for qth-order criticality for smooth problems which are only necessary because not every feasible arc needs to be tangent to \(\mathcal{D}_\mathcal{F}^q(x_*)\), depending on the geometry of the feasible set in the neighbourhood of \(x_*\).

Note that, as the order j grows, (3.9) may be interpreted as imposing a condition on \(s_j\) (via \(\nabla _x^1f(x_*)[s_j]\)), given the directions \(\{s_i\}_{i=1}^{j-1}\) satisfying (3.10).

In more general situations, the fact that conditions (3.9) and (3.10) not only depend on the behaviour of the objective function in some well-chosen subspace, but involve the geometry of the all possible feasible arcs makes the second-order condition (3.13) difficult to use, particular in the case where \(\mathcal{F}\subset \mathbb {R}^n\). In what follows we discuss, as far as we currently can, two resulting questions of interest.

  1. 1.

    Are there cases where these conditions reduce to checking homogeneous polynomials involving the objective function’s derivatives on a subspace?

  2. 2.

    If that is not the case, are there circumstances in which not only the complete left-hand side of (3.10) vanishes, but also each term of this left-hand side?

We start by deriving useful consequences of Theorem 3.1.

figure b

Proof

The fact that (3.9) for \(j=1\) reduces to \(\nabla _x^1f(x_*)[s_1] \ge 0\) for all \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) implies that (3.14) holds. Also note that (3.9) and (3.10) impose that

$$\begin{aligned} s_1 \in \mathcal{T}_\mathcal{F}(x_*) \cap \ker ^1[\nabla _x^1f(x_*)] = \mathcal{T}_\mathcal{F}(x_*) \cap \text {span}\left\{ \nabla _x^1f(x_*) \right\} ^\perp , \end{aligned}$$
(3.16)

which, because of (3.14) and the polarity of \(\mathcal{N}_\mathcal{F}(x_*)\) and \(\mathcal{T}_\mathcal{F}(x_*)\), yields that \(s_1\) belongs to \(\partial \mathcal{T}_\mathcal{F}(x_*)\). Assume now that \(s_2 \not \in \mathcal{T}_\mathcal{F}(x_*)\). Then, for all \(\alpha \) sufficiently small, \(\alpha s_1 + \alpha ^2 s_2\) does not belong to \(\mathcal{T}_\mathcal{F}(x_*)\) and thus \(x(\alpha )=x_* + \alpha s_1 + \alpha ^2 s_2 + o(\alpha ^2)\) cannot belong to \(\mathcal{F}\), which is a contradiction. Hence, \(s_2 \in \mathcal{T}_\mathcal{F}(x_*)\) and (3.15) follows for \(i=2\), while it follows from \(s_1 \in \mathcal{T}_\mathcal{F}(x_{*})\) and (3.14) for \(i=1\). \(\square \)

The first-order necessary condition (3.14) is well known for general first-order minimizers (see [48, Th. 12.9, p. 353] for instance).

Consider now the second-order conditions (3.13). If \(\mathcal{F}= \mathbb {R}^n\) (or if the convex constraints are inactive at \(x_*\)), then \(\nabla _x^1f(x_*) = 0\) because of (3.14) and (3.13) is nothing but the familiar condition that the Hessian of the objective function must be positive semi-definite. If \(x_*\) happens to lie on the boundary of \(\mathcal{F}\) and \(\nabla _x^1f(x_*) \ne 0\), (3.13) indicates that the effect of the curvature of the boundary of \(\mathcal{F}\) may be represented by the term \(\nabla _x^1f(x_*)[s_2]\), which is non-negative because of (3.15). Consider, for example, the problem

whose global solution is at the origin. In this case, it is easy to check that \(-\nabla _x^1f(0)= -e_1 \in \mathcal{N}_\mathcal{F}(0) = \text {span}\left\{ -e_1 \right\} \), that \(\nabla _x^2 f(0)=0\), and that second-order feasible arcs of the form (3.5) with \(x(0)=0\) may be chosen with \(s_1 = \pm e_2\) and \(s_2= \beta e_1\) where . This imposes \(\nabla _x^2f(0)[s_1]^2 \ge -1\), which (unsurprisingly) holds.

Interestingly, there are cases where the geometry of the set of locally feasible arcs is simple and manageable. In particular, suppose that the boundary of \(\mathcal{F}\) is locally polyhedral. Then, given \(\nabla _x^1f(x_*)\), either \(\mathcal{T}_\mathcal{F}(x_*) \cap \text {span}\left\{ \nabla _x^1f(x_*) \right\} ^\perp = \emptyset \), in which case conditions (3.9) and (3.10) are void, or there exists \(d \ne 0 \) in that subspace. It is then possible to define a locally feasible arc with \(s_1= d\) and \(s_2 = \cdots = s_q= 0\). As a consequence, the smallest possible value of \(\nabla _x^1f(x_*)[s_2]\) for feasible arcs starting from \(x_*\) is identically zero and this term therefore vanishes from (3.9) to (3.10). Moreover, because of the definition of \(\mathcal{P}(k,j)\) (see Table 1), all terms but that in \(\nabla _x^jf(x_*)[s_1]^j\) also vanish from these conditions, which then simplify to

$$\begin{aligned} \nabla _x^j f(x_*)[s_1]^j \ge 0\, \text{ for } \text{ all }\, s_1 \in \mathcal{T}_\mathcal{F}(x_*) \cap \left( \bigcap _{i=1}^{j-1}\ker ^i[\nabla _x^if(x_*)]\right) \end{aligned}$$
(3.17)

for \(j=2, \ldots , q\), which is a condition only involving subspaces and (for \(i \ge 2\)) cones. Analysis for first- and second orders in the polyhedral case can be found in [2, 30, 52] for instance. Further discussion of second-order (both necessary and sufficient) conditions for the more general problem can be found in [10] and the references therein.

3.1.2 Necessary Conditions for Unconstrained Problems

Consider now the case where \(x_*\) belongs to \(\mathcal{F}^0\), which is obviously the case if the problem is unconstrained. Then we have that \(\mathcal{D}_\mathcal{F}^q(x_*) = \mathbb {R}^{n \times q}\), and one is then free to choose the vectors \(\{s_i\}_{i=1}^q\) (and their sign) arbitrarily. Note first that, since \(\mathcal{N}_\mathcal{F}(x_*)= \{0\}\), (3.14) implies that, unsurprisingly,

$$\begin{aligned} \nabla _x^1f(x_*) = 0. \end{aligned}$$

For the second-order condition, we obtain from (3.9), again unsurprisingly, that, because \(\ker ^1[\nabla _x^1f(x_*)] = \mathbb {R}^n\),

$$\begin{aligned} \nabla _x^2f(x_*)\,\text{ is } \text{ positive } \text{ semi-definite } \text{ on }\,\mathbb {R}^n. \end{aligned}$$

Hence, if there exists a vector \(s_1 \in \ker ^2[\nabla _x^2f(x_*)]\), we have that and therefore that \(\nabla _x^2f(x_*)[s_1,s_2] = 0\) for all \(s_2 \in \mathbb {R}^n\). Thus, the term for \(k=1\) vanishes from (3.9), as well as all terms involving \(\nabla _x^2f(x_*)\) applied to a vector \(s_1 \in \ker ^2[\nabla _x^2f(x_*)]\). This implies in particular that the third-order condition may now be written as

$$\begin{aligned} \nabla _x^3f(x_*)[s_1]^3 = 0\, \text{ for } \text{ all }\, s_1 \in \ker ^2[\nabla _x^2f(x_*)], \end{aligned}$$
(3.18)

where the equality is obtained by considering both \(s_1\) and \(-s_1\).

Unfortunately, complications arise with fourth-order conditions, even when the objective function is a polynomial. Consider the following variant of Peano’s [49] problem:

$$\begin{aligned} \min _{x \in \mathbb {R}^2} f(x) = x_2^2 - \kappa _1 x_1^2x_2 + \kappa _2 x_1^4, \end{aligned}$$
(3.19)

where \(\kappa _1\) and \(\kappa _2\) are parameters. Then one can verify that

$$\begin{aligned} \nabla _x^1f(0)= & {} 0,\quad \nabla _x^2f(0) = \left( \begin{array}{ll} 0 &{}\quad 0 \\ 0 &{}\quad 2 \\ \end{array}\right) \\ {[}\nabla _x^3f(0)]_{ijk}= & {} \left\{ \begin{array}{ll} -2\kappa _1&{}\quad \text{ for } (i,j,k) \in \{ (1,2,1), (1,1,2), (2,1,1) \} = \mathcal{P}(4,3)\\ 0 &{}\quad \text{ otherwise, } \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} {[}\nabla _x^4f(0)]_{ijk\ell } = \left\{ \begin{array}{ll} 24 \kappa _2 &{}\quad \text{ for } (i,j,k, \ell ) = (1,1,1,1) \\ 0 &{}\quad \text{ otherwise. } \end{array}\right. \end{aligned}$$

Hence,

$$\begin{aligned} \ker ^1[\nabla _x^1f(0)]= & {} \mathbb {R}^2, \quad \ker ^2[\nabla _x^2f(0)] = \text {span}\left\{ e_1 \right\} , \text{ and }\\ \ker ^3[\nabla _x^3f(0)]= & {} \text {span}\left\{ e_1 \right\} \cup \text {span}\left\{ e_2 \right\} . \end{aligned}$$

The necessary condition (3.9) then states that, if the origin is a minimizer, then, using the arc defined by \(s_1=e_1\) and and the fact that \(\mathcal{P}(4,3)\) contains three elements,

This shows that the condition \(\nabla _x^4f(x_*)[s_1]^4\ge 0\) on \(\cap _{i=1}^3 \ker ^i[\nabla _x^if(x_*)]\), although necessary, is arbitrarily far away from the weaker necessary condition

(3.20)

when \(\kappa _1\) grows. As was already the case for problem (3.3), the example for \(\kappa _1=1\) and \(\kappa _2=2\), say, shows that a function may admit a saddle point (\(x_*=0\)) which is a maximum (\(x_*=0\)) along an arc ( in this case) while at the same time be minimal along every line passing through \(x_*\). Figure 2 shows the contour lines of the objective function of (3.19) for increasing values of \(\kappa _2\), keeping \(\kappa _1 = 3\).

Fig. 2
figure 2

Contour lines of (3.19) for \(\kappa _1=3\) and \(\kappa _2= 2\) (left), \(\kappa _2=2.25\) (center) and \(\kappa _2=2.5\) (right)

One may attribute the problem that not every term in (3.9) vanishes to the fact that switching signs of \(s_1\) or \(s_2\) does imply that any of the terms in (3.20) is zero (as we have verified) because of the terms \(\nabla _x^2f(0)[s_2]^2\) and \(\nabla _x^4f(0)[s_1]^4\). Is this a feature of even orders only? Unfortunately, this not the case for \(q=7\). Indeed is it not difficult to verify that the terms whose multi-index \((\ell _1,\ldots ,\ell _k)\) is a permutation of (1, 2, 2, 2) belong to \(\mathcal{P}(7,4)\) and those whose multi-index is a permutation of (1, 1, 1, 1, 1, 2) belong to \(\mathcal{P}(7,6)\). Moreover, the contribution of these terms to the sum (3.9) cannot be distinguished by varying \(s_1\) or \(s_2\), for instance by switching their signs as this technique yields only one equality in two unknowns. In general, we may therefore conclude that (3.9) must involve a mixture of terms with derivative tensors of various degrees.

3.1.3 Sufficient Conditions for Isolated Local Minimizers

Despite the limitations we have seen when considering the simplified Hancock example, we may still derive a sufficient condition for \(x_*\) to be an isolated minimizer, which is inspired by the standard second-order case (see Theorem 2.4 in Nocedal and Wright [48] for instance). This condition requires a constraint qualification in that the feasible set in the neighbourhood of \(x_*\) is required to be completely described by the arcs of the form (3.5) for small \(\alpha \).

figure c

Proof

Consider any \(\delta _2 \in (0, \delta ]\) and, using the fact that \(\mathcal{F}\ne \{x_*\}\), an arbitrary \(y \in \mathcal{F}\cap \partial \mathcal{B}(x_*,\delta _2) \subseteq \mathcal{A}_\mathcal{F}^q(x, \delta )\), where we used (3.21) to obtain the last inclusion. Thus, there exists at least one arc \(x(\alpha )\) of the form (3.5) which is tangent to \(\mathcal{D}_\mathcal{F}^q\) (with associated nonzero \((s_1,\ldots ,s_j)\)) and a smallest \(\alpha _y \ge 0\) such that \(x(\alpha _y)=y\). For any such arc, let m be the smallest integer such that \(c_m \ne 0\), where \(c_j\) is defined by (3.12). The relations (3.9), (3.10) and (3.22) then imply that

$$\begin{aligned} c_m > 0. \end{aligned}$$
(3.23)

and (3.22) also ensures that \(m \in \{1, \ldots , q \}\). Now choose such an arc \(x(\alpha )\) with maximal m. From Taylors’ theorem and using (3.11) to obtain the form of the derivatives along the arc \(x(\alpha )\), we have that

$$\begin{aligned}&f(x(\alpha _y)) - f(x_*)\nonumber \\&\quad = \displaystyle \sum _{j=1}^{m-1} c_j \alpha _y^j + \alpha _y^m \displaystyle \sum _{k=1}^m \frac{1}{k!}\Bigg (\displaystyle \sum _{(\ell _1, \ldots ,\ell _k) \in \mathcal{P}(m,k)} \nabla _x^k f(x(\tau \alpha _y))[s_{\ell _1}, \ldots , s_{\ell _k}]\Bigg )\nonumber \\&\quad = \alpha _y^m \displaystyle \sum _{k=1}^m \frac{1}{k!}\Bigg (\displaystyle \sum _{(\ell _1, \ldots ,\ell _k) \in \mathcal{P}(m,k)} \nabla _x^k f(x(\tau \alpha _y))[s_{\ell _1}, \ldots , s_{\ell _k}]\Bigg ) \end{aligned}$$
(3.24)

for some \(\tau \in [0,1]\), where we used our assumption \(c_j=0\) for \(j=1,\ldots ,m-1\) to deduce the second equality. Observe that \(\Vert x(\tau \alpha _y)-x_*\Vert \le \delta _2\) because \(\alpha _y\) is the smallest \(\alpha \) such that\( \Vert x(\alpha ) - x_*\Vert =\delta _2\). Hence, we choose \(\delta _2\) small enough to ensure, by continuity, (3.12), (3.23) and (3.24), that \(f(y) - f(x_*) = f(x(\alpha _y)) - f(x_*)> 0\). This proves the theorem since y is chosen arbitrarily in a sufficiently small feasible neighbourhood of \(x_*\). \(\square \)

Note that the condition \(\mathcal{F}\ne \{x_*\}\) may be viewed as a form of Slater condition, and also that \(x_*\) is obviously a local isolated minimizer if it fails.

If we now return to our examples, we see that Theorem 3.3 excludes that the origin is a local minimizer, for example, (3.19) with \(\kappa _1=1\) and \(\kappa _2=2\), since the arc must be considered in (3.21). The origin is not a local minimizer for either problem (3.3) or (3.4), since (3.22) fails for any q because the Taylor’s series of f is identically zero along the first coordinate axis (which defines two admissible arcs \(x(\alpha ) = \pm \alpha e_1\)).

Of course the assumptions of Theorem 3.3 may be difficult to check in general, but may be tractable in some cases. Assume for instance that \(\mathcal{F}\) is polyhedral. Then, for sufficiently small \(\delta \), \(\mathcal{A}_\mathcal{F}^q(x_*,\delta ) \subset \mathcal{T}_\mathcal{F}(x_*)\) and we may use half lines originating at \(x_*\) to define feasible arcs. This is the inspiration of the following less general but easier to verify alternative to Theorem 3.3.

figure d

Proof

If \(\mathcal{T}_\mathcal{F}(x_*)\) is reduced to the origin, then the inclusion \(\mathcal{F}\subseteq x_*+\mathcal{T}_\mathcal{F}(x_*)\) implies that \(\mathcal{F}= \{x_*\}\) and \(x_*\) is therefore an isolated minimizer. Let us therefore assume that there exists a nonzero \(s \in \mathcal{T}_\mathcal{F}(x_*)\). The second part of condition (3.25) and the continuity of the \((q+1)\)-th derivative then imply that

$$\begin{aligned} \nabla _x^qf(z)[s]^q > 0 \end{aligned}$$
(3.26)

for all z in a sufficiently small feasible neighbourhood of \(x_*\). Now, using Taylor’s expansion, we obtain that, for all \(s \in \mathcal{T}_\mathcal{F}(x_*)\) and all \(\tau \in (0,1)\),

$$\begin{aligned} f(x_*+\tau s) -f(x_*) = \sum _{i=1}^{q-1} \frac{\tau ^i}{i!} \nabla _x^if(x_*)[s]^i + \frac{\tau ^q}{(q+1)!}\nabla _x^qf(z)[s]^q \end{aligned}$$

for some \(z \in [x_*, x_*+\tau s]\). If \(\tau \) is sufficiently small, then this equality, the first part (3.25) and (3.26) ensure that \(f(x_*+\tau s) >f(x_*)\). Since this strict inequality holds for an arbitrary nonzero \(s \in \mathcal{T}_\mathcal{F}(x_*) \supseteq \mathcal{F}- x_*\) and all \(\tau \) sufficiently small, \(x_*\) must be a feasible isolated minimizer. \(\square \)

Observe that, in Peano’s example (see (3.19) with \(\kappa _1=3\) and \(\kappa _2=2\)), we have that the curvature of the objective function is positive along every line passing through the origin, but that the order of the curvature varies with s (second order along \(s=e_2\) and fourth order along \(s=e_1\)), which precludes applying Theorem 3.3. Also note that, when \(q=2\), weaker sufficient conditions (exploiting the structure of \(\mathcal{D}_\mathcal{F}^2(x_*)\) to a larger extent) are known for a several classes of problems, including semi-definite optimization (see [10] for details).

3.1.4 An Approach Using Taylor Models

As already noted, the conditions expressed in Theorem 3.1 may, in general, be very complicated to verify in an algorithm, due to their dependence on the geometry of the set of feasible arcs. To avoid this difficulty, we now explore a different approach. Let the symbol “globmin” represent global minimization and define, for some \(\Delta \in (0,1]\) and some \(j \in \{1, \ldots , p \}\),

$$\begin{aligned} \phi _{f,j}^\Delta (x) \mathop {=}\limits ^\mathrm{def}f(x)-{{\mathrm{globmin}}}_{_{\mathop {\scriptstyle \Vert d\Vert \le \Delta }\limits ^{\scriptstyle x+d \in \mathcal{F}}}}T_{f,j}(x,d), \end{aligned}$$
(3.27)

the smallest value of the jth-order Taylor model \(T_{f,j}(x,s)\) achievable by a feasible point at distance at most \(\Delta \) from x. Note that \(\phi _{f,j}^\Delta (x)\) is a continuous function of x and \(\Delta \) for given \(\mathcal{F}\) and f (see [40, Th. 7]). The introduction of this quantity is in part motivated by the following theorem.

figure e

Proof

We start by rewriting the power series (3.11) for degree j, for any given arc \(x(\alpha )\) tangent to \(\mathcal{D}_\mathcal{F}^j(x)\) in the form

$$\begin{aligned} f(x(\alpha )) - f(x) = \sum _{i=1}^j c_i \alpha ^i + o(\alpha ^j) = T_{f,j}(x,s(\alpha )) - f(x) +o(\alpha ^j),\qquad \end{aligned}$$
(3.28)

where \(s(\alpha )\mathop {=}\limits ^\mathrm{def}x(\alpha )-x\), \(c_i\) is defined by (3.12) and where the last equality holds because f and \(T_{f,j}\) share the first j derivatives at x. This reformulation allows us to write that, for \(i \in \{1, \ldots , j \}\),

$$\begin{aligned} c_i = \frac{1}{i!} \left. \frac{\text {d}^i}{\text {d}\alpha ^i}\big [T_{f,j}(x,s(\alpha )) - f(x)\big ]\right| _{\alpha =0}. \end{aligned}$$
(3.29)

Assume now there exists an \((s_1,\ldots ,s_j) \in \mathcal{Z}_\mathcal{F}^{f,j}(x)\) such that (3.9) does not hold. In the notation just introduced, this means that, for this particular \((s_1,\ldots ,s_j)\),

$$\begin{aligned} c_i = 0 \text{ for } i \in \{1, \ldots , j-1 \} \text{ and } c_j < 0. \end{aligned}$$

Then, from (3.29),

$$\begin{aligned} \left. \frac{\text {d}^i}{\text {d}\alpha ^i}\big [T_{f,j}(x,s(\alpha ))-f(x)\big ]\right| _{\alpha =0} =0 \text{ for } i \in \{1, \ldots , j-1 \}, \end{aligned}$$
(3.30)

and thus the first \((j-1)\) coefficients of the polynomial \(T_{f,j}(x,s(\alpha )) - f(x)\) vanish. Thus, using (3.28),

$$\begin{aligned} \left. \frac{\text {d}^j}{\text {d}\alpha ^j}\big [T_{f,j}(x,s(\alpha )) - f(x)\big ]\right| _{\alpha =0} = j! \lim _{\alpha \rightarrow 0}\frac{T_{f,j}(x,s(\alpha )) - f(x)}{\alpha ^j}. \end{aligned}$$
(3.31)

Now let \(i_0\) be the index of the first nonzero \(s_i\). Note that \(i_0 \in \{1, \ldots , j \}\) since otherwise the structure of the sets \(\mathcal{P}(i,k)\) implies that \(c_j=0\). Observe also that we may redefine the parameter \(\alpha \) as \(\alpha \Vert s_{i_0}\Vert ^{1/i_0}\) so that we may assume, without loss of generality that \(\Vert s_{i_0}\Vert =1\). As a consequence, we obtain that, for sufficiently small \(\alpha \),

(3.32)

Hence, successively using the facts that \(c_j <0\), that (3.29) and (3.31) hold for all arcs \(x(\alpha )\) tangent to \(\mathcal{D}_\mathcal{F}^q(x)\), and that (3.32) and (3.27) hold, we may deduce that

The conclusion of the theorem immediately follows since \(\lim _{\Delta \rightarrow \infty } \frac{\phi _{f,j}^\Delta (x)}{\Delta ^j}=0\). \(\square \)

This theorem has a useful consequence.

figure f

Proof

We successively apply Theorem 3.5 q times and deduce that x is a jth-order critical point for \(j = 1,\ldots ,q\). \(\square \)

This last result says that we may avoid the difficulty of dealing with the possibly complicated geometry of \(\mathcal{D}_\mathcal{F}^q(x)\) if we are ready to perform the global optimization occurring in (3.27) exactly and find a way to compute or overestimate the limit in (3.33). Although this is a positive conclusion, these two remaining challenges remain daunting. However, it is worthwhile noting that the standard approach to computing first-, second- and third-order criticality measures for unconstrained problems follows the exact same approach. In the first-order case, it is easy to verify that

$$\begin{aligned} \Vert \nabla _x^1f(x)\Vert= & {} \frac{1}{\Delta }\left[ - \min _{\Vert d\Vert \le \Delta }\nabla _x^1f(x)[d]\right] \nonumber \\= & {} \frac{1}{\Delta }\left[ f(x) - {{\mathrm{globmin}}}_{\Vert d\Vert \le \Delta }\Big (f(x)+\nabla _x^1f(x)[d]\Big )\right] , \end{aligned}$$

where the first equality is justified by the convexity of \(\nabla _x^1f(x)[d]\) as a function of d. Because the left-hand side of the above relation is independent of \(\Delta \), the computation of the limit (3.33) for \(\Delta \) tending to zero is trivial when \(j=1\) and the limiting value is \(\Vert \nabla _xf(x)\Vert \). For the second-order case, assuming \(\Vert \nabla _x^1f(x)\Vert =0\),

(3.34)

the first global optimization problem being easily solvable by a trust-region-type calculation [25, Section 7.3] or directly by an equivalent eigenvalue analysis. As for the first-order case, the left-hand side of the equation is independent of \(\Delta \) and obtaining the limit for \(\Delta \) tending to zero is trivial.

Finally, if \(\mathcal{M}(x) \mathop {=}\limits ^\mathrm{def}\ker [\nabla _x^1f(x)] \cap \ker [\nabla _x^2f(x)]\) and \(P_{\mathcal{M}(x)}\) is the orthogonal projection onto that subspace,

(3.35)

where the first equality results from (2.1). In this case, the global optimization in the subspace \(\mathcal{M}(x)\) is potentially harder to solve exactly (a randomization argument is used in [1] to derive a upper bound on its value), although it still involves a subspace.Footnote 4

While we are unaware of a technique for making the global minimization in (3.27) easy in the even more complicated general case, we may think of approximating the limit in (3.33) by choosing a (user-supplied) value of \(\Delta >0\) small enoughFootnote 5 and consider the size of the quantity

$$\begin{aligned} \displaystyle \frac{1}{\Delta ^j}\phi _{f,j}^\Delta (x). \end{aligned}$$
(3.36)

Unfortunately, it is easy to see that, if \(\Delta \) is fixed at some positive value, a zero value of \(\phi _{f,j}^\Delta (x)\) alone is not a necessary condition for x being a local minimizer. Indeed consider the univariate problem of minimizing \(f(x)=x^2(1-\alpha x)\) for \(\alpha >0\). One verifies that, for any \(\Delta > 0\), the choice \(\alpha = 2/\Delta \) yields that

$$\begin{aligned} \phi _{f,1}^\Delta (0) = 0, \;\;\;\;\phi _{f,2}^\Delta (0) = 0 \text{ but } \phi _{f,3}^\Delta (0) = \frac{4}{\alpha ^2} > 0, \end{aligned}$$
(3.37)

despite 0 being a local (but not global) minimizer. As a matter of fact, \(\phi _{f,j}^\Delta (x)\) gives more information than the mere potential proximity of a jth-order critical point: it is able to see beyond an infinitesimal neighbourhood of x and provides information on possible further descent beyond such a neighbourhood. Rather than a true criticality measure, it can be considered, for fixed \(\Delta \), as an indicator of further progress, but its use for terminating at a local minimizer is clearly imperfect.

Despite this drawback, the above arguments would suggest that it is reasonable to consider a (conceptual) minimization algorithm whose objective is to find a point \(x_\epsilon \) such that

$$\begin{aligned} \phi _{f,j}^\Delta (x_\epsilon ) \le \epsilon \Delta ^j \text{ for } j = 1, \ldots ,q \end{aligned}$$
(3.38)

for some \(\Delta \in (0,1]\) sufficiently small and some \(q \in \{1, \ldots , p \}\). This condition implies an approximate minimizing property which we make more precise by the following result.

figure g

Proof

Consider \(x+d \in \mathcal{F}\). Using the triangle inequality, we have that

$$\begin{aligned} f(x_\epsilon +d)= & {} f(x_\epsilon +d)-T_{f,q}(x_\epsilon ,d) + T_{f,q}(x_\epsilon ,d) \nonumber \\\ge & {} -\,|f(x_\epsilon +d)-T_{f,q}(x_\epsilon ,d)| + T_{f,q}(x_\epsilon ,d). \end{aligned}$$
(3.40)

Now, condition (3.38) for \(j=q\) implies that, if \(\Vert d\Vert \le \Delta \),

$$\begin{aligned} T_{f,q}(x_\epsilon ,d) \ge T_{f,q}(x_\epsilon ,0) - \epsilon \Delta ^q = f(x_\epsilon ) - \epsilon \Delta ^q. \end{aligned}$$
(3.41)

Hence, substituting (2.9) and (3.41) into (3.40), using the assumed Lipschitz continuity of \(\nabla _x^q f\) and remembering again that \(\Vert d\Vert \le \Delta < 1\), we deduce that

$$\begin{aligned} f(x_\epsilon +d) \ge f(x_\epsilon ) - \displaystyle \frac{L_{f,q}}{q!} \Vert d\Vert ^{q+1}-\epsilon \Delta ^q, \end{aligned}$$

and the desired result follows. \(\square \)

The size of the neighbourhood of \(x_\epsilon \) where f is “locally smallest”—in that the first part of (3.39) holds—therefore increases with the criticality order q, a feature potentially useful in various contexts such as global optimization.

Before turning to more algorithmic aspects, we briefly compare the results of Theorem 3.7 which what can be deduced on the local behaviour of the Taylor series \(T_{f,q}(x_*,s)\) if, instead of requiring the exact necessary condition (3.9) to hold exactly, this condition is relaxed to

$$\begin{aligned} \sum _{k=1}^j \frac{1}{k!}\left( \sum _{(\ell _1, \ldots ,\ell _k) \in \mathcal{P}(j,k)} \nabla _x^k f(x_*)[s_{\ell _1}, \ldots , s_{\ell _k}]\right) \ge -\epsilon \end{aligned}$$
(3.42)

while insisting that (3.10) should hold exactly. If \(j=q=1\), it is easy to verify that (3.42) for \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) is equivalent to the condition that

$$\begin{aligned} \Vert P_{\mathcal{T}_\mathcal{F}(x_*)}[\nabla _x^1 f(x_*)] \Vert \le \epsilon , \end{aligned}$$
(3.43)

from which we deduce, using the Cauchy–Schwarz inequality, that

$$\begin{aligned} T_{f,1}(x_*,s) \ge T_{f,1}(x_*,0) - \epsilon \Delta \end{aligned}$$
(3.44)

for all \(s \in \mathcal{T}_\mathcal{F}(x_{*})\) with \(\Vert d\Vert \le \Delta \), that is (3.38) for \(j=1\). Thus, by Theorem 3.7, we obtain that (3.39) holds for \(j=1\).

4 Evaluation Complexity of Finding Approximate High-Order Critical Points

4.1 A Trust-Region Minimization Algorithm

Aware of the optimality conditions and their limitations, we may now consider an algorithm to achieve (3.38). This objective naturally suggests a trust-regionFootnote 6 formulation with adaptative model degree, in which the user specifies a desired criticality order q, assuming that derivatives of order \(1, \ldots , q\) are available when needed. We made this idea explicit in Algorithm 4.1.

figure h

We first state a useful property of Algorithm 4.1, which ensures that a fixed fraction of the iterations \(1, 2, \ldots , k\) must be either successful or very successful. Indeed, if we define

$$\begin{aligned} \mathcal{S}\mathop {=}\limits ^\mathrm{def}\{ \ell \in \mathbb {N}_0 \mid \rho _\ell \ge \eta _1 \} \text{ and } \mathcal{S}_k \mathop {=}\limits ^\mathrm{def}\mathcal{S}\cap \{1, \ldots , k \}, \end{aligned}$$

the following bound holds.

figure i

Proof

The trust-region update (4.2) ensures that

$$\begin{aligned} \Delta _k \le \Delta _1 \gamma _2^{|\mathcal{U}_k|} \gamma _3^{|\mathcal{S}_k|}, \end{aligned}$$

where \(\mathcal{U}_k = \{1, \ldots , k \} \setminus \mathcal{S}_k\). This inequality then yields (4.3) by taking logarithms and using that \(|\mathcal{S}_k| \ge 1\) and \(k = |\mathcal{S}_k| + |\mathcal{U}_k|\). \(\square \)

4.2 Evaluation Complexity for Algorithm 4.1

We start our worst-case analysis by formalizing our assumptions. Let

$$\begin{aligned} \mathcal{L}_f\mathop {=}\limits ^\mathrm{def}\{ x +z \in \mathbb {R}^n \mid x \in \mathcal{F},\; f(x) \le f(x_1)\,\text{ and }\, \Vert z\Vert \le \Delta _{\max } \}. \end{aligned}$$
figure j

For simplicity of notation, define \(L_f \mathop {=}\limits ^\mathrm{def}\max _{j\in \{1, \ldots , q \}} L_{f,j}\).

Algorithm 4.1 is required to start from a feasible \(x_1 \in \mathcal{F}\), which, together with the fact that the subproblem solution in Step 2 involves minimization over \(\mathcal{F}\), leads to AS.1. Note that AS.3 requires AS.2 and automatically holds if f is \(q+1\) times continuously differentiable and \(\mathcal{F}\) is bounded.

We now establish a lower bound on the trust-region radius.

figure k

Proof

Assume that, for some \(\ell \in \{1, \ldots , k \}\)

$$\begin{aligned} \Delta _\ell \le \frac{1-\eta _2}{L_f}\,\epsilon . \end{aligned}$$
(4.6)

From (4.1), we obtain that, for some \(j\in \{1, \ldots , q \}\),

$$\begin{aligned} | 1 - \rho _\ell | = \frac{f(x_\ell +s_\ell )-T_{f,j}(x_\ell ,s_\ell )}{T_{f,j}(x_\ell ,0)-T_{f,j}(x_\ell ,s_\ell )} < \frac{L_f \Vert s_\ell \Vert ^{j+1}}{j!\,\epsilon \Delta _\ell ^j} \le \frac{L_f \Delta _\ell }{j!\,\epsilon } \le (1-\eta _2),\nonumber \\ \end{aligned}$$
(4.7)

where we used (2.8) (implied by AS.3) and the fact that \(\phi _{f,j}^{\Delta _\ell }(x_\ell ) > \epsilon \Delta _\ell ^j\) to deduce the first inequality, the bound \(\Vert s_\ell \Vert \le \Delta _\ell \) to deduce the second, and (4.6) with \(j\ge 1\) to deduce the third. Thus, \(\rho _\ell \ge \eta _2\) and \(\Delta _{\ell +1} \ge \Delta _\ell \). The mechanism of the algorithm and the inequality \(\Delta _1 \ge \epsilon \) then ensures that, for all \(\ell \in {k}\),

$$\begin{aligned} \Delta _\ell \ge \min \left[ \Delta _1, \frac{\gamma _1(1-\eta _2)\epsilon }{L_f}\right] \ge \kappa _\Delta \epsilon . \end{aligned}$$

\(\square \)

We next derive a simple lower bound on the objective function decrease at successful iterations.

figure l

Proof

We have, using (4.1), the fact that \(\phi _{f,j}^{\Delta _k}(x_k) > \epsilon \Delta _k^j\) for some \(j \in \{1, \ldots , q \}\) and (4.4) successively, that

$$\begin{aligned} f(x_k)-f(x_{k+1})\ge & {} \eta _1 [\, T_{f,j}(x_k,0)-T_{f,j}(x_k,s_k)\, ]\\= & {} \eta _1 \phi _{f,j}^{\Delta _k}(x_k) > \eta _1 \kappa _\Delta \epsilon ^{j+1} \ge \eta _1 \kappa _\Delta \epsilon ^{q+1}. \end{aligned}$$

\(\square \)

Our worst-case evaluation complexity results can now be proved by summing the decreases guaranteed by this last lemma.

figure m

Proof

Let k be the index of an arbitrary iteration before termination. Using the definition of \(f_\mathrm{low}\), the nature of successful iterations, (4.11) and Lemma 4.3, we deduce that

$$\begin{aligned} f(x_0) - f_\mathrm{low} \ge f(x_0) - f(x_{k+1}) = \displaystyle \sum _{i\in \mathcal{S}_k} [f(x_i)-f(x_{i+1}) ] \ge |\mathcal{S}_k|\,[\kappa _\mathcal{S}^f]^{-1}\,\epsilon ^{q+1}\nonumber \\ \end{aligned}$$
(4.14)

which proves (4.9).

We next call upon Lemma 4.1 to compute the upper bound on the total number of iterations before termination (obviously, there must be a least one successful iteration unless termination occurs for \(k=1\)) and add one for the evaluation at termination. Finally, (4.12) and (4.13) result from AS.3, Theorem 3.7 and the fact that \(\phi _{f,q}^{\Delta _k}(x_\epsilon ) \le \epsilon \, \Delta _{k_\epsilon }^q\) at termination. \(\square \)

Observe that, because of (4.2) and (4.4), \(\Delta _\epsilon \in [ \kappa _\delta \epsilon , \Delta _{\max }]\). Theorem 4.4 generalizes the known bounds for the cases where \(\mathcal{F}= \mathbb {R}\) and \(q=1\) [46], \(q=2\) [16, 47] and \(q=3\) [1]. The results for \(q=2\) with \(\mathcal{F}\subset \mathbb {R}^n\) and for \(q>3\) appear to be new. The latter provide the first evaluation complexity bounds for general criticality order q. Note that, if \(q = 1\), bounds of the type \(O(\epsilon ^{-(p+1)/p})\) exist if one is ready to minimize models of degree \(p > q\) (see [9]). Whether similar improvements can be obtained for \(q>1\) remains an open question at this stage.

We also observe that the above theory remains valid if the termination rule

$$\begin{aligned} \phi _{k,j}^{\Delta _k}(x_k) \le \epsilon \Delta _k^j \text{ for } j \in \{1, \ldots , q \} \end{aligned}$$
(4.15)

used in Step 1 is replaced by a more flexible one, involving other acceptable termination circumstances, such as if (4.15) hold or some other condition holds. We conclude this section by noting that the global optimization effort involved in the computation of \(\phi _{j,j}^{\Delta _k}(x_k)\) \((j \in \{1, \ldots , q \})\) in Algorithm 4.1 might be limited by choosing \(\Delta _{\max }\) small enough.

We close this section by an important observation. The full AS.3 using (2.4) is only needed in our complexity analysis to deduce (4.12) in Theorem 4.1 using Theorem 3.7, itself depending on (2.4) via (2.8) and (2.9). However, in the derivation of the complexity bounds (4.9) and (4.10), the Lipschitz continuity implied by AS.3 is only used for deriving the first inequality of (4.7), in that Lipschitz continuity of \(\nabla _x^q f\) implies (2.8) along the segment \([x_k,x_k+s_k]\). Since it was discussed in Sect. 2.3 that (2.10) implies the same (2.8) along this segment, the weaker assumption

figure n

is all what is required for deriving (4.7). AS.3b can therefore replace AS.3 in Theorem 4.4 for the limited purpose of ensuring (4.9)–(4.11).

5 Sharpness

It is interesting that an example was presented in [18] showing that the bound in \(O(\epsilon ^{-3})\) evaluations for \(q=2\) is essentially sharp for both the trust-region and regularization algorithms. This is significant, because requiring \(\phi _{f,2}^\Delta (x) \le \epsilon \Delta ^2\) is slightly stronger, for small \(\Delta \), than the standard condition

$$\begin{aligned} \Vert \nabla _x^1f(x)\Vert \le \epsilon \text{ and } \min \Big [0,\lambda _{\min }[\nabla _x^2f(x)]\Big ] \ge -\epsilon \end{aligned}$$
(5.1)

(used in [16, 47] for instance). Indeed, for one-dimensional problems and assuming \(\nabla _x^2 f(x) \le 0\), the former condition amounts to requiring that

$$\begin{aligned} \frac{1}{2}\left( -\nabla _{x}^2f(x)+2\frac{|\nabla _x^1(f(x)|}{\Delta } \right) \le \epsilon , \end{aligned}$$
(5.2)

where the absolute value reflects the fact that \(s=\pm \Delta \) depending on the sign of g. In the remainder of this section, we show that the example proposed in [18] can be extended to arbitrary order q, and thus that the complexity bounds (4.9)–(4.10) are essentially sharp for our trust-region algorithm.

The idea of our generalized example is to apply Algorithm 4.1 to a unidimensional objective function f for some fixed \(q\ge 1\) and \(\mathcal{F}= \mathbb {R}_+\) (hence guaranteeing AS.1), generating a sequence of iterates \(\{x_k\}_{k\ge 0}\) starting from the origin, i.e., \(x_0=x_1=0\). We first choose the sequences of derivatives values up to order q to be, for all \(k \ge 1\),

$$\begin{aligned} \nabla _x^j f(x_k) = 0\,\text{ for }\, j \in \{1, \ldots , q-1 \}\, \text{ and }\, \nabla _x^q f(x_k) = -q!\,\left( \frac{1}{k+1}\right) ^{\frac{1}{q+1}+\delta },\qquad \end{aligned}$$
(5.3)

where \(\delta \in (0,1)\) is a (small) positive constant. This means that, at iterate \(x_k\), the qth-order Taylor model is given by

$$\begin{aligned} T_{f,q}(x_k,s) = f(x_k) - \left( \frac{1}{k+1}\right) ^{\frac{1}{q+1}+\delta } s^q, \end{aligned}$$

where the value of \(f(x_k)\) remains unspecified for now. The step is then obtained by minimizing this model in a trust-region of radius

$$\begin{aligned} \Delta _k = \left( \frac{1}{k+1}\right) ^{\frac{1}{q+1}+\delta }, \end{aligned}$$

yielding that

$$\begin{aligned} s_k = \Delta _k =\left( \frac{1}{k+1}\right) ^{\frac{1}{q+1}+\delta } \in (0,1). \end{aligned}$$
(5.4)

As a consequence, the model decrease is given by

$$\begin{aligned} T_{f,q}(x_k,0)-T_{f,q}(x_k,s_k) = -\frac{1}{q!}\nabla _x^q f(x_k)s_k^q = \left( \frac{1}{k+1}\right) ^{1+(q+1)\delta }. \end{aligned}$$
(5.5)

For our example, we the define the objective function decrease at iteration k to be

(5.6)

thereby ensuring that \(\rho _k \in [\eta _1,\eta _2)\) and \(x_{k+1}=x_k+s_k\) for each k. Summing up function decreases, we may then specify the objective function’s values at the iterates by

$$\begin{aligned} f(x_0)= & {} \frac{\eta _1+\eta _2}{2}\zeta (1+(q+1)\delta )\, \text{ and } \nonumber \\ f(x_{k+1})= & {} f(x_k) - \frac{\eta _1+\eta _2}{2}\left( \frac{1}{k+1}\right) ^{1+(q+1)\delta }, \end{aligned}$$
(5.7)

where \(\zeta (t)\mathop {=}\limits ^\mathrm{def}\sum _{k=1}^\infty k^{-t}\) is the Riemann zeta function. This function is finite for all \(t>1\) (and thus also for \(t=1+(q+1)\delta \)), thereby ensuring that \(f(x_k) \ge 0\) for all \(k\ge 0\). We also verify that

$$\begin{aligned} \frac{\Delta _{k+1}}{\Delta _k} = \left( \frac{k+1}{k+2}\right) ^{\frac{1}{q+1}+\delta } \in [\gamma _2, 1] \end{aligned}$$

in accordance with (4.2), provided . Observe also that (5.3) and (5.5) ensure that, for each \(k\ge 1\),

$$\begin{aligned} \phi _{f,j}^{\Delta _k}(x_k) = 0\quad \,\text{ for }\,\quad j \in \{1, \ldots , q-1 \} \end{aligned}$$
(5.8)

and

$$\begin{aligned} \phi _{f,q}^{\Delta _k}(x_k) = \left( \frac{1}{k+1}\right) ^{1+(q+1)\delta } = \left( \frac{1}{k+1}\right) ^{\frac{1}{q+1}+\delta } \Delta _k^{q}. \end{aligned}$$
(5.9)

We now use Hermite interpolation to construct the objective function f on the successive intervals \([x_k,x_{k+1}]\) and define

$$\begin{aligned} f(x)=p_k(x-x_k)+f(x_k)\, \text {for } {x\in [x_k,x_{k+1}]\,\mathrm{and}\,k\ge 1,} \end{aligned}$$
(5.10)

where \(p_k\) is the polynomial

$$\begin{aligned} p_k(s)=\sum _{i=0}^{2q+1}c_{i,k}s^i, \end{aligned}$$
(5.11)

with coefficients defined by the interpolation conditions

$$\begin{aligned} p_k(0)= & {} f(x_k)-f(x_{k+1}),\quad p_k(s_k)=0;\nonumber \\ \nabla _s^j p_k(0)= & {} 0=\nabla _s^j p_k(s_k)\quad \text{ for }\,j \in \{1, \ldots , q-1 \}, \nonumber \\ \nabla _s^q p_k(0)= & {} \nabla _x^q f(x_k),\quad \nabla _s^q p_k(s_k)=\nabla _x^q f(x_{k+1}). \end{aligned}$$
(5.12)

These conditions ensure that f(x) is q times continuously differentiable on \(\mathbb {R}_+\) and thus that AS.2 holds. They also impose the following values for the first \(q+1\) coefficients

$$\begin{aligned} c_{0,k}=f(x_k)-f(x_{k+1}),\quad c_{j,k}=0 \;\;\;\;(j \in \{1, \ldots , q-1 \}) ,\quad c_{q,k}=-\nabla _x^q f(x_k);\nonumber \\ \end{aligned}$$
(5.13)

and the remaining \(q+1\) coefficients are solutions of the linear system

$$\begin{aligned} \left( \begin{array}{llll} s_k^{q+1} &{}\quad s_k^{q+2} &{}\quad \ldots &{}\quad s_k^{2q+1}\\ (q+1)s_k^q &{}\quad (q+2)s_k^{q+1}&{}\quad \ldots &{}\quad (2q+1)s_k^{2q}\\ \vdots &{}\quad \vdots &{}\quad &{}\quad \vdots \\ \frac{(q+1)!}{1!}s_k &{}\quad \frac{(q+2)!}{2!}s_k^2 &{}\quad \ldots &{}\quad \frac{(2q+1)!}{(q+1)!}s_k^{q+1} \end{array} \right) \left( \begin{array}{c} c_{q+1,k}\\ c_{q+2,k}\\ \vdots \\ c_{2q+1,k} \end{array} \right) = r_k,\qquad \end{aligned}$$
(5.14)

where the right-hand side is given by

$$\begin{aligned} r_k = \left( \begin{array}{c} -\Delta f_k - \frac{1}{q!} \nabla _x^qf(x_k)s_k^q\\ - \frac{1}{(q-1)!} \nabla _x^qf(x_k)s_k^{q-1}\\ \vdots \\ - \nabla _x^pf(x_k)s_k \\ \nabla _x^qf(x_{k+1})-\nabla _x^qf(x_k) \end{array} \right) . \end{aligned}$$
(5.15)

Observe now that the coefficient matrix of this linear system may be written as

$$\begin{aligned} \begin{array}{l} \left( \begin{array}{llll} s_k^{q+1} &{}\quad &{}\quad &{} \\ &{}\quad s_k^{q}&{} \quad &{} \\ &{}\quad &{}\quad \ddots &{}\\ &{}\quad &{}\quad &{}\quad s_k \end{array} \right) M_q \left( \begin{array}{llll} 1 &{}\quad &{}\quad &{}\quad \\ &{}\quad s_k&{}\quad &{}\quad \\ &{}\quad &{}\quad \ddots &{}\quad \\ &{}\quad &{}\quad &{}\quad s_k^q \end{array} \right) \end{array}, \end{aligned}$$

where

$$\begin{aligned} M_q \mathop {=}\limits ^\mathrm{def}\left( \begin{array}{llll} 1 &{}\quad 1 &{}\quad \ldots &{}\quad 1\\ q+1 &{}\quad q+2 &{}\quad \ldots &{}\quad 2q+1 \\ \vdots &{}\quad \vdots &{}\quad &{}\quad \vdots \\ \frac{(q+1)!}{1!} &{}\quad \frac{(q+2)!}{2!} &{}\quad \ldots &{}\quad \frac{(2q+1)!}{(q+1)!} \end{array} \right) \end{aligned}$$
(5.16)

is an invertible matrix independent of k (see Appendix). Hence,

$$\begin{aligned} \left( \begin{array}{c} c_{q+1,k}\\ c_{q+2,k}\\ \vdots \\ c_{2q+1,k} \end{array} \right) =\left( \begin{array}{llll} 1 &{}\quad &{}\quad &{}\quad \\ &{}\quad s_k^{-1}&{}\quad &{} \\ &{}\quad &{}\quad \ddots &{}\quad \\ &{}\quad &{}\quad &{}\quad s_k^{-q} \end{array} \right) M_q^{-1} \left( \begin{array}{lccr} s_k^{-(q+1)} &{} &{} &{} \\ &{} s_k^{-q}&{} &{} \\ &{} &{} \ddots &{} \\ &{} &{} &{} s_k^{-1} \end{array} \right) r_k.\nonumber \\ \end{aligned}$$
(5.17)

Observe now that, because of (5.4), (5.6), (5.5) and (5.3),

$$\begin{aligned} |\Delta f_k| = O( s_k^{q+1}), \;\;\;\;\;\;\;\;|\nabla _x^qf(x_k)s_k^{q-j}| = O( s_k^{q+1-j}) \;\;\;\;(j\in \{ 0, \ldots , q-1 \}) \end{aligned}$$

and, since \(\nabla _x^qf(x_k)< \nabla _x^qf(x_{k+1}) < 0\),

$$\begin{aligned} |\nabla _x^qf(x_{k+1})-\nabla _x^qf(x_k)| \le | \nabla _x^qf(x_k) | \le q!\,s_k. \end{aligned}$$

These bounds and (5.15) imply that \([r_k]_i\), the ith component of \(r_k\), satisfies

$$\begin{aligned} \left| [r_k]_i \right| = O(s_k^{q+2-i})\, \text{ for } \,i \in \{1, \ldots , q+1 \}. \end{aligned}$$

Hence, using (5.17) and the non-singularity of \(M_q\), we obtain that there exists a constant \(\kappa _q \ge 1\) independent of k such that

$$\begin{aligned} |c_{i,k}|s_k^{i-q-1} \le \kappa _q\, \text{ for } \, i \in \{ q+1, \ldots , 2q+1 \}, \end{aligned}$$
(5.18)

and thus that

$$\begin{aligned} |\nabla _s^{q+1} p_k(s)| \le \sum _{i=q+1}^{2q+1} i! \, |c_{i,k}|s_k^{i-q-1} \le \left( \displaystyle \sum _{i=q+1}^{2q+1} i! \right) \,\kappa _q \end{aligned}$$

Moreover, using successively (5.11), the triangle inequality, (5.13), (5.3), (5.4), (5.18) and \(\kappa _q \ge 1\), we obtain that, for \(j \in \{1, \ldots , q \}\),

$$\begin{aligned} \begin{array}{lcl} |\nabla _s^j p_k(s)| &{} \le &{} \displaystyle \sum _{i=j}^{2q+1} \frac{i!}{(i-j)!}\,|c_{i,k}|s^{i-j} \\ &{} = &{} \displaystyle \frac{q!}{(q-j)!}\,|c_{q,k}|s^{q-j} + \displaystyle \sum _{i=q+1}^{2q+1} \frac{i!}{(i-j)!}\,|c_{i,k}|s^{i-q-1}s^{q+1-j} \\ &{} \le &{} \displaystyle \frac{q!}{(q-j)!} + \displaystyle \sum _{i=q+1}^{2q+1} \frac{i!}{(i-j)!}\,|c_{i,k}|s^{i-q-1} \\ &{} \le &{} \left( \displaystyle \sum _{i=q}^{2q+1} \frac{i!}{(i-j)!}\right) \,\kappa _q \end{array} \end{aligned}$$

and thus, all derivatives of order one up to q remain bounded on \([0,s_k]\). Because of (5.10), we therefore obtain that AS.3 holds. Moreover (5.13), (5.18), the inequalities \(|\nabla _x^q f(x_k)|\le q!\) and \(f(x_k)\ge 0\), (5.10) and (5.4) also ensure that f(x) is bounded below.

We have therefore shown that the bounds of Theorem 4.4 are essentially sharp, in that, for every \(\delta >0\), Algorithm 4.1 applied to the problem of minimizing the lower-bounded objective function f just constructed and satisfying AS.1–AS.3 will take, because of (5.8) and (5.9),

$$\begin{aligned} \left\lceil \frac{1}{\epsilon ^{\frac{q+1}{1+(q+1)\delta }}}\right\rceil \end{aligned}$$

iterations and evaluation of f and its q first derivatives to find an iterate \(x_k\) such that condition (4.15) holds. Moreover, it is clear that, in the example presented, the global rate of convergence is driven by the term of degree q in the Taylor series.

6 Discussion

We have analysed the necessary and sufficient optimality conditions of arbitrary order for convexly constrained nonlinear optimization problems, using approximations of the feasible region which generalizes the idea of second-order tangent sets (see [10]) to orders beyond two. Using the resulting necessary conditions, we then proposed a measure of criticality for arbitrary order for convexly constrained nonlinear optimization problems. As this measure can be extended to define \(\epsilon \)-approximate critical points of high order, we have then used it in a conceptual trust-region algorithm to show that if derivatives of the objective function up to order \(q \ge 1\) can be evaluated and are Lipschitz continuous, then this algorithm applied to the convexly constrained problem (3.1) needs at most \(O(\epsilon ^{-(q+1)})\) evaluations of f and its derivatives to compute an \(\epsilon \)-approximate qth-order critical point. Moreover, we have shown by an example that this bound is essentially sharp.

In the purely unconstrained case, this result recovers known results for \(q=1\) (first-order criticality for Lipschitz gradients) [46], \(q=2\) (second-order criticalityFootnote 7 with Lipschitz Hessians) [18, 47] and \(q=3\) (third-order criticalityFootnote 8 with Lipschitz continuous third derivative) [1], but extends them to arbitrary order. The results for the convexly constrained case appear to be new and provide in particular the first complexity bound for second- and third-order criticality for such inequality constrained problems.

Because the condition (4.15) measures different orders of criticality, we could choose to use a different \(\epsilon \) for every order (as in [18]), complicating the expression of the bound accordingly. However, as shown by our example, the worst-case behaviour of Algorithm 4.1 is dominated by that of \(\nabla _x^qf\), which makes the distinction of the various \(\epsilon \)-s less crucial.

Since the global optimization occurring in the definition of the criticality measure \(\phi _{f,j}^\Delta (x)\), the algorithm discussed in the present paper remains, in general, of a theoretical nature. However, there may be cases where this computation is tractable for small enough \(\Delta \), for instance if the derivative tensors of the objective function are strongly structured. Such approaches may hopefully be of use for small dimensional or structured highly nonlinear problems, such as those occurring in machine learning using deep learning techniques (see [1]).

The present framework for handling convex constraints is not free of limitations, resulting from our choice to transfer difficulties associated with the original problem to the subproblem solution, thereby sparing precious evaluations of f and its derivatives. In particular, the cost of evaluating any constraint function/derivative possibly defining the convex feasible set \(\mathcal{F}\) is neglected by the present approach, which must therefore be seen as a suitable framework to handle “cheap inequality constraint” such as simple bounds.

Questions of course arise from the results presented. The first is whether it is possible to extend the existing work (e.g., [10]) on bridging the gap between necessary and sufficient optimality conditions for orders one and two to higher orders, possibly by finding sufficient conditions to ensure (3.21) and by isolating problem classes where this constraint qualification condition automatically holds. From the complexity point of view, it is known that the complexity of obtaining \(\epsilon \)-approximate first-order criticality for unconstrained and convexly constrained problem can be reduced to \(O(\epsilon ^{-(p+1)/p})\) if one is ready to define the step by using a regularization model of order \(p \ge 1\). In the unconstrained case, this was shown for \(p=2\) in [16, 47] and for general \(p \ge 1\) in [9], while the convexly constrained case was analysed (for \(p=2\)) in [17]. The question of whether this methodology and the associated improvements in evaluation complexity bounds can be extended to order above one also remains open at this stage.