Abstract
High-order optimality conditions for convexly constrained nonlinear optimization problems are analysed. A corresponding (expensive) measure of criticality for arbitrary order is proposed and extended to define high-order \(\epsilon \)-approximate critical points. This new measure is then used within a conceptual trust-region algorithm to show that if derivatives of the objective function up to order \(q \ge 1\) can be evaluated and are Lipschitz continuous, then this algorithm applied to the convexly constrained problem needs at most \(O(\epsilon ^{-(q+1)})\) evaluations of f and its derivatives to compute an \(\epsilon \)-approximate qth-order critical point. This provides the first evaluation complexity result for critical points of arbitrary order in nonlinear optimization. An example is discussed, showing that the obtained evaluation complexity bounds are essentially sharp.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Recent years have seen a growing interest in the analysis of the worst-case evaluation complexity of nonlinear (possibly non-convex) smooth optimization (for the non-convex case only, see [1, 5,6,7,8,9, 11, 14,15,16,17, 19,20,23, 26,27,29, 31, 32, 34,35,37, 41,42,44, 47, 51, 53,54,55] among others). In general terms, this analysis aims at giving (sometimes sharp) bounds on the number of evaluations of a minimization problem’s functions (objective and constraints, if relevant) and their derivatives that are, in the worst case, necessary for certain algorithms to find an approximate critical point for the unconstrained, convexly constrained or general nonlinear optimization problem. It is not uncommon that such algorithms may involve possibly extremely costly internal computations, provided the number of calls to the problem functions is kept as low as possible.
At variance with the convex case (see [3]), most of the research on the non-convex case to date focuses on finding first-, second- and third-order critical points. Evaluation complexity for first-order critical point was first investigated, for the unconstrained case, by Nesterov [46] and for first- and second-order Nesterov and Polyak [47] and by Cartis et al. [16]. Third-order critical points were studied in [1], motivated by highly nonlinear problems in machine learning. However, the analysis of evaluation complexity for orders higher than three is missing both concepts and results.
The purpose of the present paper is to improve on this situation in two ways. The first is to review optimality conditions of arbitrary orders \(q \ge 1\) for convexly constrained minimization problems, and the second is to describe a theoretical algorithm whose behaviour provides, for this class of problems, the first evaluation complexity bounds for such arbitrary orders of optimality.
The paper is organized as follows. After the present introduction, Sect. 2 discusses some preliminary results on tensor norms, a generalized Cauchy–Schwarz inequality and high-order error bounds from Taylor series. Section 3 investigates optimality conditions for convexly constrained optimization, while Sect. 4.1 proposes a trust-region-based minimization algorithm for solving this class of problems and analyses its evaluation complexity. An example is introduced in Sect. 5 to show that the new evaluation complexity bounds are essentially sharp. A final discussion is presented in Sect. 6.
2 Preliminaries
2.1 Basic Notations
In what follows, \(y^Tx\) denotes the Euclidean inner product of the vectors x and y of \(\mathbb {R}^n\) and \(\Vert x\Vert = (x^Tx)^{1/2}\) is the associated Euclidean norm. If \(T_1\) and \(T_2\) are tensors, \(T_1\otimes T_2\) is their tensor product. \(\mathcal{B}(x,\delta )\), the ball of radius \(\delta \ge 0\) centred at x. If \(\mathcal{X}\) is a closed set, \(\partial \mathcal{X}\) denotes its boundary and \(\mathcal{X}^0\) denotes its interior. The vectors \(\{e_i\}_{i=1}^n\) are the coordinate vectors in \(\mathbb {R}^n\). The notation \(\lambda _{\min }[M]\) stands for the leftmost eigenvalue of the symmetric matrix M. If \(\{a_k\}\) and \(\{b_k\}\) are two infinite sequences of non-negative scalars converging to zero, we say that \(a_k= o(b_k)\) if and only if \(\lim _{k \rightarrow \infty } a_k/b_k = 0\) and, more generally, \(a(\alpha ) = o(\alpha )\) if and only if \(\lim _{\alpha \rightarrow 0} a(\alpha )/\alpha = 0\). The normal cone to a general convex set \(\mathcal{C}\) at \(x\in \mathcal{C}\) is defined by
and its polar, the tangent cone to \(\mathcal{F}\) at x, by
Note that \(\mathcal{C}\subseteq x+\mathcal{T}_\mathcal{C}(x)\) for all \(x \in \mathcal{C}\). We also define \(P_{\mathcal{C}}[\cdot ]\) be the orthogonal projection onto \(\mathcal{C}\). (See [25, Section 3.5] for a brief introduction of the relevant properties of convex sets and cones, or to [39, Chapter 3] or [50, Part I] for an in-depth treatment.)
2.2 Tensor Norms and Generalized Cauchy–Schwarz Inequality
We will make substantial use of tensors and their norms in what follows and thus start by establishing some concepts and notation. If the notation \(T[v_1,\ldots ,v_j]\) stands for the tensor of order \(q-j\) resulting from the application of the qth-order tensor T to the vectors \(v_1,\ldots ,v_j\), the (recursively inducedFootnote 1) Euclidean norm \(\Vert \cdot \Vert _q\) on the space of qth-order tensors is the given by
(Observe that this value is always non-negative since we can flip the sign of \(T[v_1,\ldots ,v_q]\) by flipping that of one of the vectors \(v_i\).)
Note that definition (2.1) implies that
a simple generalization of the standard Cauchy–Schwarz inequality for order-1 tensors (vectors) and of \(\Vert Mv\Vert \le \Vert M\Vert \,\Vert v\Vert \) which is valid for induced norms of matrices (order-2 tensors). Observe also that perturbation theory (see [40, Th. 7]) implies that \(\Vert T\Vert _q\) is continuous as a function of T.
If T is a symmetric tensor of order q, define the q-kernel of the multilinear q-form
as
(see [12, 13]). Note that, in general, \(\ker ^q[T]\) is a union of cones. Interestingly, the q-kernels are not only unions of cones but also subspaces for \(q=1\). However, this is not true for general q-kernels, since both \((0,1)^T\) and \((1,0)^T\) belong to the 2-kernel of the symmetric 2-form \(x_1x_2\) on \(\mathbb {R}^2\), but not their sum.
We also note that, for symmetric tensors of odd order, \(T[v]^q = -T[-v]^q\) and thus that
where we used the symmetry of the unit ball with respect to the origin to deduce the second equality.
2.3 High-Order Error Bounds from Taylor Series
The tensors considered in what follows are symmetric and arise as high-order derivatives of the objective function f. For the pth derivative of a function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) to be Lipschitz continuous on the set \(\mathcal{S}\subseteq \mathbb {R}^n\), we require that there exists a constant \(L_{f,p}\ge 0\) such that, for all \(x,y \in \mathcal{S}\),
where \(\nabla _x^p h(x)\) is the pth-order symmetric derivative tensor of h at x.
Let \(T_{f,p}(x,s)\) denoteFootnote 2 the pth-order Taylor series approximation to \(f(x+s)\) at some \(x \in \mathbb {R}^n\) given by
and consider the Taylor identity
involving a given univariate \(C^k\) function \(\phi (\alpha )\) and its kth-order Taylor approximation \(t_k(\alpha ) = \sum _{i=0}^k \phi ^{(i)}(0) \alpha ^i / i!\) expressed in terms of the ith derivatives \(\phi ^i\), \(i=1,\ldots ,k\). Let \(x,s \in \mathbb {R}^n\). Then, picking \(\phi (\alpha ) = f(x+\alpha s)\) and \(k = p\), it follows immediately from the fact that \(t_p(1) = T_{f,p}(x,s)\), the identity
(2.2), (2.4), (2.5) and (2.6) imply that, for all \(x,s \in \mathbb {R}^n\),
Similarly,
Inequalities (2.8) and (2.9) will be useful in our developments below, but immediately note that they in fact depend only on the weaker requirement that
for all x and s of interest, rather than relying on (2.4).
3 Unconstrained and Convexly Constrained Problems
The problem we wish to solve is formally described as
where we assume that \(f:\mathbb {R}^n\longrightarrow \mathbb {R}\) is q-times continuously differentiable and bounded from below, and that f has Lipschitz continuous derivatives of orders 1 to q. We also assume that the feasible set \(\mathcal{F}\) is closed, convex and non-empty. Note that this formulation covers unconstrained optimization (\(\mathcal{F}= \mathbb {R}^n\)), as well as standard inequality (and linear equality) constrained optimization in its different forms: the set \(\mathcal{F}\) may be defined by simple bounds, and/or by polyhedral or more general convex constraints. We are tacitly assuming here that the cost of evaluating values and derivatives of the constraint functions possibly involved in the definition of \(\mathcal{F}\) is negligible.
3.1 High-Order Optimality Conditions
Given that our ambition is to work with high-order model, it seems natural to aim at finding high-order local minimizers. As is standard, we say that \(x_*\) is a local minimizer of f if and only if there exists a (sufficiently small) neighbourhood \(\mathcal{B}_*\) of \(x_*\) such that
However, we must immediately remember important intrinsic limitations. These are exemplified by the smooth two-dimensional problem
which is a simplified version of a problem stated by Hancock nearly a century ago [38, p. 36], itself a variation of a famous problem stated even earlier by Peano [49, Nos. 133–136]. The contour lines of its objective function are shown in Fig. 1.
The first conclusion which can be drawn by examining this example is that, in general, assessing that a given point x (the origin in this case) is a local minimizer needs more that verifying that every direction from this point is an ascent direction. Indeed, this latter property holds in the example, but the origin is not a local minimizer (it is a saddle point). This is caused by the fact that objective function decrease may occur along specific arcs starting from the point under consideration, and these arcs need not be lines (such as for \(\alpha \ge 0\) in the example).
The second conclusion is that the characterization of a local minimizer cannot always be translated into a set of conditions only involving the Taylor expansion of f at \(x_*\). In our example, the difficulty arises because the coefficients of the Taylor’s expansion of \(e^{-1/x_1^2}\) at x all vanish as \(x_1\) approaches the origin and, therefore, that the (non-)minimizing nature of this point cannot be determined from the values of these coefficients. Thus, the gap between necessary and sufficient optimality conditions cannot be closed if one restricts one’s attention to using derivatives of the objective function at a putative solution of problem (3.1).
Note that worse situations may also occur, for instance if we consider the following variation on Hancock simplified example (3.3):
for which no continuous descent arc exists in a neighbourhood of the origin despite the origin not being a local minimizer.
3.1.1 Necessary Conditions for Convexly Constrained Problems
The above examples show that fully characterizing a local minimizer in terms of general continuous descent arcs is in general impossible. However, the fact that no such arc exists remains a necessary condition for such points, even if Hancock’s example shows that these arcs may not be amenable to a characterization using arc derivatives. In what follows, we therefore propose derivative-based necessary optimality conditions by focussing on a specific (yet reasonably general) class of descent arcs \(x(\alpha )\) of the form
where \(\alpha > 0\). Such an arc-based approach was used by several authors for first- and second-order conditions (see [4, 10, 24, 33] for example). Note that, if \(s_{i_0}\) is the first nonzero \(s_i\) in the sum in the right-hand side of (3.5) (if any), we may redefine \(\alpha \) to be \(\alpha \Vert s_{i_0}\Vert ^{-1/i_0}\) without modifying the arc, so that we may assume, without loss of generality, that \(\Vert s_{i_0}\Vert = 1\) whenever \((s_1, \ldots , s_q)\ne (0, \ldots , 0)\).
Define the qth-order descriptor set of \(\mathcal{F}\) at x by
Note that \(\mathcal{D}_\mathcal{F}^q(x)\) is closed and always contains \((0, \ldots ,0)\), and that \(\mathcal{D}_\mathcal{F}^1(x) = \mathcal{T}_\mathcal{F}(x)\), the standard tangent cone to \(\mathcal{F}\) at x. Moreover, \(\mathcal{D}_\mathcal{F}^2(x) = \mathcal{T}_\mathcal{F}(x) \times 2 \mathcal{T}^2_\mathcal{F}(x)\), where \(\mathcal{T}^2_\mathcal{F}(x)\) is the inner second-order tangent set to \(\mathcal{F}\) at x, as defined in [10].Footnote 3 For example, if \(\mathcal{F}= \{ (x_1,x_2) \in \mathbb {R}^2 \mid x_2 \ge |x_1|^3 \}\), then one verifies that \(\mathcal{D}_\mathcal{F}^3(0) = \left[ \mathbb {R}\times \mathbb {R}_+\right] ^2 \times \left[ \mathbb {R}\times [1, \infty ) \right] \). We say that a feasible arc \(x(\alpha )\) is tangent to \(\mathcal{D}_\mathcal{F}^q(x)\) if (3.5) holds for some \((s_1,\ldots ,s_q) \in \mathcal{D}_\mathcal{F}^q(x)\).
Note that definition (3.6) implies that
where \(s_u\) is the first nonzero \(s_\ell \).
We now consider some conditions that preclude the existence of feasible descent arcs of the form (3.5). These conditions involve the index sets \(\mathcal{P}(j,k)\) defined, for \(k \le j\), by
For \(k \le j \le 4\), these are given in Table 1.
We now state necessary conditions for \(x_*\) to be a local minimizer.
Proof
Consider an arbitrary feasible arc of the form (3.5). Substituting this relation in the expression \(f(x(\alpha )) \ge f(x_*)\) (given by (3.2)) and collecting terms of equal degree in \(\alpha \), we obtain that, for sufficiently small \(\alpha \),
where
with \(\mathcal{P}(i,k)\) defined in (3.8). For this to be true, we need each coefficient of \(\alpha ^j\) to be non-negative on the zero set of the coefficients \(1, \ldots , j-1\), subject to the requirement that the arc (3.5) must be feasible for \(\alpha \) sufficiently small, that is \(x(\alpha ) \in \mathcal{D}_\mathcal{F}^j(x_*)\). First consider the case where \(j=1\) (in which case (3.10) is void). The fact that the coefficient of \(\alpha \) in (3.11) must be non-negative implies that \(\nabla _x^1f(x_*)[s_1] \ge 0\) for all \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\), which proves (3.9) for \(j=1\). Assume now that \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) and that (3.10) holds for \(i=1\). This latter condition requests \(s_1\) to be in the zero set of the coefficient in \(\alpha \) in (3.11), that is
Then the coefficient of \(\alpha ^2\) in (3.11) must be non-negative, which yields, using \(\mathcal{P}(2,1) = \{(2)\}\), \(\mathcal{P}(2,2)= \{(1)\}\) (see Table 1), that
which is (3.9) for \(j=2\).
We may then proceed in the same manner for all coefficients up from order \(j=3\) to q, each time considering them in the zero set of the previous coefficients (that is (3.10)), and verify that (3.11) directly implies (3.9). \(\square \)
Following a long tradition, we say that \(x_*\) is a qth-order critical point for problem (3.1) if the conclusions of this theorem hold for \(j \in \{1, \ldots , q \}\). Of course, a qth-order critical point need not be a local minimizer, but every local minimizer is a qth-order critical point. This theorem states conditions for qth-order criticality for smooth problems which are only necessary because not every feasible arc needs to be tangent to \(\mathcal{D}_\mathcal{F}^q(x_*)\), depending on the geometry of the feasible set in the neighbourhood of \(x_*\).
Note that, as the order j grows, (3.9) may be interpreted as imposing a condition on \(s_j\) (via \(\nabla _x^1f(x_*)[s_j]\)), given the directions \(\{s_i\}_{i=1}^{j-1}\) satisfying (3.10).
In more general situations, the fact that conditions (3.9) and (3.10) not only depend on the behaviour of the objective function in some well-chosen subspace, but involve the geometry of the all possible feasible arcs makes the second-order condition (3.13) difficult to use, particular in the case where \(\mathcal{F}\subset \mathbb {R}^n\). In what follows we discuss, as far as we currently can, two resulting questions of interest.
-
1.
Are there cases where these conditions reduce to checking homogeneous polynomials involving the objective function’s derivatives on a subspace?
-
2.
If that is not the case, are there circumstances in which not only the complete left-hand side of (3.10) vanishes, but also each term of this left-hand side?
We start by deriving useful consequences of Theorem 3.1.
Proof
The fact that (3.9) for \(j=1\) reduces to \(\nabla _x^1f(x_*)[s_1] \ge 0\) for all \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) implies that (3.14) holds. Also note that (3.9) and (3.10) impose that
which, because of (3.14) and the polarity of \(\mathcal{N}_\mathcal{F}(x_*)\) and \(\mathcal{T}_\mathcal{F}(x_*)\), yields that \(s_1\) belongs to \(\partial \mathcal{T}_\mathcal{F}(x_*)\). Assume now that \(s_2 \not \in \mathcal{T}_\mathcal{F}(x_*)\). Then, for all \(\alpha \) sufficiently small, \(\alpha s_1 + \alpha ^2 s_2\) does not belong to \(\mathcal{T}_\mathcal{F}(x_*)\) and thus \(x(\alpha )=x_* + \alpha s_1 + \alpha ^2 s_2 + o(\alpha ^2)\) cannot belong to \(\mathcal{F}\), which is a contradiction. Hence, \(s_2 \in \mathcal{T}_\mathcal{F}(x_*)\) and (3.15) follows for \(i=2\), while it follows from \(s_1 \in \mathcal{T}_\mathcal{F}(x_{*})\) and (3.14) for \(i=1\). \(\square \)
The first-order necessary condition (3.14) is well known for general first-order minimizers (see [48, Th. 12.9, p. 353] for instance).
Consider now the second-order conditions (3.13). If \(\mathcal{F}= \mathbb {R}^n\) (or if the convex constraints are inactive at \(x_*\)), then \(\nabla _x^1f(x_*) = 0\) because of (3.14) and (3.13) is nothing but the familiar condition that the Hessian of the objective function must be positive semi-definite. If \(x_*\) happens to lie on the boundary of \(\mathcal{F}\) and \(\nabla _x^1f(x_*) \ne 0\), (3.13) indicates that the effect of the curvature of the boundary of \(\mathcal{F}\) may be represented by the term \(\nabla _x^1f(x_*)[s_2]\), which is non-negative because of (3.15). Consider, for example, the problem
whose global solution is at the origin. In this case, it is easy to check that \(-\nabla _x^1f(0)= -e_1 \in \mathcal{N}_\mathcal{F}(0) = \text {span}\left\{ -e_1 \right\} \), that \(\nabla _x^2 f(0)=0\), and that second-order feasible arcs of the form (3.5) with \(x(0)=0\) may be chosen with \(s_1 = \pm e_2\) and \(s_2= \beta e_1\) where . This imposes \(\nabla _x^2f(0)[s_1]^2 \ge -1\), which (unsurprisingly) holds.
Interestingly, there are cases where the geometry of the set of locally feasible arcs is simple and manageable. In particular, suppose that the boundary of \(\mathcal{F}\) is locally polyhedral. Then, given \(\nabla _x^1f(x_*)\), either \(\mathcal{T}_\mathcal{F}(x_*) \cap \text {span}\left\{ \nabla _x^1f(x_*) \right\} ^\perp = \emptyset \), in which case conditions (3.9) and (3.10) are void, or there exists \(d \ne 0 \) in that subspace. It is then possible to define a locally feasible arc with \(s_1= d\) and \(s_2 = \cdots = s_q= 0\). As a consequence, the smallest possible value of \(\nabla _x^1f(x_*)[s_2]\) for feasible arcs starting from \(x_*\) is identically zero and this term therefore vanishes from (3.9) to (3.10). Moreover, because of the definition of \(\mathcal{P}(k,j)\) (see Table 1), all terms but that in \(\nabla _x^jf(x_*)[s_1]^j\) also vanish from these conditions, which then simplify to
for \(j=2, \ldots , q\), which is a condition only involving subspaces and (for \(i \ge 2\)) cones. Analysis for first- and second orders in the polyhedral case can be found in [2, 30, 52] for instance. Further discussion of second-order (both necessary and sufficient) conditions for the more general problem can be found in [10] and the references therein.
3.1.2 Necessary Conditions for Unconstrained Problems
Consider now the case where \(x_*\) belongs to \(\mathcal{F}^0\), which is obviously the case if the problem is unconstrained. Then we have that \(\mathcal{D}_\mathcal{F}^q(x_*) = \mathbb {R}^{n \times q}\), and one is then free to choose the vectors \(\{s_i\}_{i=1}^q\) (and their sign) arbitrarily. Note first that, since \(\mathcal{N}_\mathcal{F}(x_*)= \{0\}\), (3.14) implies that, unsurprisingly,
For the second-order condition, we obtain from (3.9), again unsurprisingly, that, because \(\ker ^1[\nabla _x^1f(x_*)] = \mathbb {R}^n\),
Hence, if there exists a vector \(s_1 \in \ker ^2[\nabla _x^2f(x_*)]\), we have that and therefore that \(\nabla _x^2f(x_*)[s_1,s_2] = 0\) for all \(s_2 \in \mathbb {R}^n\). Thus, the term for \(k=1\) vanishes from (3.9), as well as all terms involving \(\nabla _x^2f(x_*)\) applied to a vector \(s_1 \in \ker ^2[\nabla _x^2f(x_*)]\). This implies in particular that the third-order condition may now be written as
where the equality is obtained by considering both \(s_1\) and \(-s_1\).
Unfortunately, complications arise with fourth-order conditions, even when the objective function is a polynomial. Consider the following variant of Peano’s [49] problem:
where \(\kappa _1\) and \(\kappa _2\) are parameters. Then one can verify that
and
Hence,
The necessary condition (3.9) then states that, if the origin is a minimizer, then, using the arc defined by \(s_1=e_1\) and and the fact that \(\mathcal{P}(4,3)\) contains three elements,
This shows that the condition \(\nabla _x^4f(x_*)[s_1]^4\ge 0\) on \(\cap _{i=1}^3 \ker ^i[\nabla _x^if(x_*)]\), although necessary, is arbitrarily far away from the weaker necessary condition
when \(\kappa _1\) grows. As was already the case for problem (3.3), the example for \(\kappa _1=1\) and \(\kappa _2=2\), say, shows that a function may admit a saddle point (\(x_*=0\)) which is a maximum (\(x_*=0\)) along an arc ( in this case) while at the same time be minimal along every line passing through \(x_*\). Figure 2 shows the contour lines of the objective function of (3.19) for increasing values of \(\kappa _2\), keeping \(\kappa _1 = 3\).
One may attribute the problem that not every term in (3.9) vanishes to the fact that switching signs of \(s_1\) or \(s_2\) does imply that any of the terms in (3.20) is zero (as we have verified) because of the terms \(\nabla _x^2f(0)[s_2]^2\) and \(\nabla _x^4f(0)[s_1]^4\). Is this a feature of even orders only? Unfortunately, this not the case for \(q=7\). Indeed is it not difficult to verify that the terms whose multi-index \((\ell _1,\ldots ,\ell _k)\) is a permutation of (1, 2, 2, 2) belong to \(\mathcal{P}(7,4)\) and those whose multi-index is a permutation of (1, 1, 1, 1, 1, 2) belong to \(\mathcal{P}(7,6)\). Moreover, the contribution of these terms to the sum (3.9) cannot be distinguished by varying \(s_1\) or \(s_2\), for instance by switching their signs as this technique yields only one equality in two unknowns. In general, we may therefore conclude that (3.9) must involve a mixture of terms with derivative tensors of various degrees.
3.1.3 Sufficient Conditions for Isolated Local Minimizers
Despite the limitations we have seen when considering the simplified Hancock example, we may still derive a sufficient condition for \(x_*\) to be an isolated minimizer, which is inspired by the standard second-order case (see Theorem 2.4 in Nocedal and Wright [48] for instance). This condition requires a constraint qualification in that the feasible set in the neighbourhood of \(x_*\) is required to be completely described by the arcs of the form (3.5) for small \(\alpha \).
Proof
Consider any \(\delta _2 \in (0, \delta ]\) and, using the fact that \(\mathcal{F}\ne \{x_*\}\), an arbitrary \(y \in \mathcal{F}\cap \partial \mathcal{B}(x_*,\delta _2) \subseteq \mathcal{A}_\mathcal{F}^q(x, \delta )\), where we used (3.21) to obtain the last inclusion. Thus, there exists at least one arc \(x(\alpha )\) of the form (3.5) which is tangent to \(\mathcal{D}_\mathcal{F}^q\) (with associated nonzero \((s_1,\ldots ,s_j)\)) and a smallest \(\alpha _y \ge 0\) such that \(x(\alpha _y)=y\). For any such arc, let m be the smallest integer such that \(c_m \ne 0\), where \(c_j\) is defined by (3.12). The relations (3.9), (3.10) and (3.22) then imply that
and (3.22) also ensures that \(m \in \{1, \ldots , q \}\). Now choose such an arc \(x(\alpha )\) with maximal m. From Taylors’ theorem and using (3.11) to obtain the form of the derivatives along the arc \(x(\alpha )\), we have that
for some \(\tau \in [0,1]\), where we used our assumption \(c_j=0\) for \(j=1,\ldots ,m-1\) to deduce the second equality. Observe that \(\Vert x(\tau \alpha _y)-x_*\Vert \le \delta _2\) because \(\alpha _y\) is the smallest \(\alpha \) such that\( \Vert x(\alpha ) - x_*\Vert =\delta _2\). Hence, we choose \(\delta _2\) small enough to ensure, by continuity, (3.12), (3.23) and (3.24), that \(f(y) - f(x_*) = f(x(\alpha _y)) - f(x_*)> 0\). This proves the theorem since y is chosen arbitrarily in a sufficiently small feasible neighbourhood of \(x_*\). \(\square \)
Note that the condition \(\mathcal{F}\ne \{x_*\}\) may be viewed as a form of Slater condition, and also that \(x_*\) is obviously a local isolated minimizer if it fails.
If we now return to our examples, we see that Theorem 3.3 excludes that the origin is a local minimizer, for example, (3.19) with \(\kappa _1=1\) and \(\kappa _2=2\), since the arc must be considered in (3.21). The origin is not a local minimizer for either problem (3.3) or (3.4), since (3.22) fails for any q because the Taylor’s series of f is identically zero along the first coordinate axis (which defines two admissible arcs \(x(\alpha ) = \pm \alpha e_1\)).
Of course the assumptions of Theorem 3.3 may be difficult to check in general, but may be tractable in some cases. Assume for instance that \(\mathcal{F}\) is polyhedral. Then, for sufficiently small \(\delta \), \(\mathcal{A}_\mathcal{F}^q(x_*,\delta ) \subset \mathcal{T}_\mathcal{F}(x_*)\) and we may use half lines originating at \(x_*\) to define feasible arcs. This is the inspiration of the following less general but easier to verify alternative to Theorem 3.3.
Proof
If \(\mathcal{T}_\mathcal{F}(x_*)\) is reduced to the origin, then the inclusion \(\mathcal{F}\subseteq x_*+\mathcal{T}_\mathcal{F}(x_*)\) implies that \(\mathcal{F}= \{x_*\}\) and \(x_*\) is therefore an isolated minimizer. Let us therefore assume that there exists a nonzero \(s \in \mathcal{T}_\mathcal{F}(x_*)\). The second part of condition (3.25) and the continuity of the \((q+1)\)-th derivative then imply that
for all z in a sufficiently small feasible neighbourhood of \(x_*\). Now, using Taylor’s expansion, we obtain that, for all \(s \in \mathcal{T}_\mathcal{F}(x_*)\) and all \(\tau \in (0,1)\),
for some \(z \in [x_*, x_*+\tau s]\). If \(\tau \) is sufficiently small, then this equality, the first part (3.25) and (3.26) ensure that \(f(x_*+\tau s) >f(x_*)\). Since this strict inequality holds for an arbitrary nonzero \(s \in \mathcal{T}_\mathcal{F}(x_*) \supseteq \mathcal{F}- x_*\) and all \(\tau \) sufficiently small, \(x_*\) must be a feasible isolated minimizer. \(\square \)
Observe that, in Peano’s example (see (3.19) with \(\kappa _1=3\) and \(\kappa _2=2\)), we have that the curvature of the objective function is positive along every line passing through the origin, but that the order of the curvature varies with s (second order along \(s=e_2\) and fourth order along \(s=e_1\)), which precludes applying Theorem 3.3. Also note that, when \(q=2\), weaker sufficient conditions (exploiting the structure of \(\mathcal{D}_\mathcal{F}^2(x_*)\) to a larger extent) are known for a several classes of problems, including semi-definite optimization (see [10] for details).
3.1.4 An Approach Using Taylor Models
As already noted, the conditions expressed in Theorem 3.1 may, in general, be very complicated to verify in an algorithm, due to their dependence on the geometry of the set of feasible arcs. To avoid this difficulty, we now explore a different approach. Let the symbol “globmin” represent global minimization and define, for some \(\Delta \in (0,1]\) and some \(j \in \{1, \ldots , p \}\),
the smallest value of the jth-order Taylor model \(T_{f,j}(x,s)\) achievable by a feasible point at distance at most \(\Delta \) from x. Note that \(\phi _{f,j}^\Delta (x)\) is a continuous function of x and \(\Delta \) for given \(\mathcal{F}\) and f (see [40, Th. 7]). The introduction of this quantity is in part motivated by the following theorem.
Proof
We start by rewriting the power series (3.11) for degree j, for any given arc \(x(\alpha )\) tangent to \(\mathcal{D}_\mathcal{F}^j(x)\) in the form
where \(s(\alpha )\mathop {=}\limits ^\mathrm{def}x(\alpha )-x\), \(c_i\) is defined by (3.12) and where the last equality holds because f and \(T_{f,j}\) share the first j derivatives at x. This reformulation allows us to write that, for \(i \in \{1, \ldots , j \}\),
Assume now there exists an \((s_1,\ldots ,s_j) \in \mathcal{Z}_\mathcal{F}^{f,j}(x)\) such that (3.9) does not hold. In the notation just introduced, this means that, for this particular \((s_1,\ldots ,s_j)\),
Then, from (3.29),
and thus the first \((j-1)\) coefficients of the polynomial \(T_{f,j}(x,s(\alpha )) - f(x)\) vanish. Thus, using (3.28),
Now let \(i_0\) be the index of the first nonzero \(s_i\). Note that \(i_0 \in \{1, \ldots , j \}\) since otherwise the structure of the sets \(\mathcal{P}(i,k)\) implies that \(c_j=0\). Observe also that we may redefine the parameter \(\alpha \) as \(\alpha \Vert s_{i_0}\Vert ^{1/i_0}\) so that we may assume, without loss of generality that \(\Vert s_{i_0}\Vert =1\). As a consequence, we obtain that, for sufficiently small \(\alpha \),
Hence, successively using the facts that \(c_j <0\), that (3.29) and (3.31) hold for all arcs \(x(\alpha )\) tangent to \(\mathcal{D}_\mathcal{F}^q(x)\), and that (3.32) and (3.27) hold, we may deduce that
The conclusion of the theorem immediately follows since \(\lim _{\Delta \rightarrow \infty } \frac{\phi _{f,j}^\Delta (x)}{\Delta ^j}=0\). \(\square \)
This theorem has a useful consequence.
Proof
We successively apply Theorem 3.5 q times and deduce that x is a jth-order critical point for \(j = 1,\ldots ,q\). \(\square \)
This last result says that we may avoid the difficulty of dealing with the possibly complicated geometry of \(\mathcal{D}_\mathcal{F}^q(x)\) if we are ready to perform the global optimization occurring in (3.27) exactly and find a way to compute or overestimate the limit in (3.33). Although this is a positive conclusion, these two remaining challenges remain daunting. However, it is worthwhile noting that the standard approach to computing first-, second- and third-order criticality measures for unconstrained problems follows the exact same approach. In the first-order case, it is easy to verify that
where the first equality is justified by the convexity of \(\nabla _x^1f(x)[d]\) as a function of d. Because the left-hand side of the above relation is independent of \(\Delta \), the computation of the limit (3.33) for \(\Delta \) tending to zero is trivial when \(j=1\) and the limiting value is \(\Vert \nabla _xf(x)\Vert \). For the second-order case, assuming \(\Vert \nabla _x^1f(x)\Vert =0\),
the first global optimization problem being easily solvable by a trust-region-type calculation [25, Section 7.3] or directly by an equivalent eigenvalue analysis. As for the first-order case, the left-hand side of the equation is independent of \(\Delta \) and obtaining the limit for \(\Delta \) tending to zero is trivial.
Finally, if \(\mathcal{M}(x) \mathop {=}\limits ^\mathrm{def}\ker [\nabla _x^1f(x)] \cap \ker [\nabla _x^2f(x)]\) and \(P_{\mathcal{M}(x)}\) is the orthogonal projection onto that subspace,
where the first equality results from (2.1). In this case, the global optimization in the subspace \(\mathcal{M}(x)\) is potentially harder to solve exactly (a randomization argument is used in [1] to derive a upper bound on its value), although it still involves a subspace.Footnote 4
While we are unaware of a technique for making the global minimization in (3.27) easy in the even more complicated general case, we may think of approximating the limit in (3.33) by choosing a (user-supplied) value of \(\Delta >0\) small enoughFootnote 5 and consider the size of the quantity
Unfortunately, it is easy to see that, if \(\Delta \) is fixed at some positive value, a zero value of \(\phi _{f,j}^\Delta (x)\) alone is not a necessary condition for x being a local minimizer. Indeed consider the univariate problem of minimizing \(f(x)=x^2(1-\alpha x)\) for \(\alpha >0\). One verifies that, for any \(\Delta > 0\), the choice \(\alpha = 2/\Delta \) yields that
despite 0 being a local (but not global) minimizer. As a matter of fact, \(\phi _{f,j}^\Delta (x)\) gives more information than the mere potential proximity of a jth-order critical point: it is able to see beyond an infinitesimal neighbourhood of x and provides information on possible further descent beyond such a neighbourhood. Rather than a true criticality measure, it can be considered, for fixed \(\Delta \), as an indicator of further progress, but its use for terminating at a local minimizer is clearly imperfect.
Despite this drawback, the above arguments would suggest that it is reasonable to consider a (conceptual) minimization algorithm whose objective is to find a point \(x_\epsilon \) such that
for some \(\Delta \in (0,1]\) sufficiently small and some \(q \in \{1, \ldots , p \}\). This condition implies an approximate minimizing property which we make more precise by the following result.
Proof
Consider \(x+d \in \mathcal{F}\). Using the triangle inequality, we have that
Now, condition (3.38) for \(j=q\) implies that, if \(\Vert d\Vert \le \Delta \),
Hence, substituting (2.9) and (3.41) into (3.40), using the assumed Lipschitz continuity of \(\nabla _x^q f\) and remembering again that \(\Vert d\Vert \le \Delta < 1\), we deduce that
and the desired result follows. \(\square \)
The size of the neighbourhood of \(x_\epsilon \) where f is “locally smallest”—in that the first part of (3.39) holds—therefore increases with the criticality order q, a feature potentially useful in various contexts such as global optimization.
Before turning to more algorithmic aspects, we briefly compare the results of Theorem 3.7 which what can be deduced on the local behaviour of the Taylor series \(T_{f,q}(x_*,s)\) if, instead of requiring the exact necessary condition (3.9) to hold exactly, this condition is relaxed to
while insisting that (3.10) should hold exactly. If \(j=q=1\), it is easy to verify that (3.42) for \(s_1 \in \mathcal{T}_\mathcal{F}(x_*)\) is equivalent to the condition that
from which we deduce, using the Cauchy–Schwarz inequality, that
for all \(s \in \mathcal{T}_\mathcal{F}(x_{*})\) with \(\Vert d\Vert \le \Delta \), that is (3.38) for \(j=1\). Thus, by Theorem 3.7, we obtain that (3.39) holds for \(j=1\).
4 Evaluation Complexity of Finding Approximate High-Order Critical Points
4.1 A Trust-Region Minimization Algorithm
Aware of the optimality conditions and their limitations, we may now consider an algorithm to achieve (3.38). This objective naturally suggests a trust-regionFootnote 6 formulation with adaptative model degree, in which the user specifies a desired criticality order q, assuming that derivatives of order \(1, \ldots , q\) are available when needed. We made this idea explicit in Algorithm 4.1.
We first state a useful property of Algorithm 4.1, which ensures that a fixed fraction of the iterations \(1, 2, \ldots , k\) must be either successful or very successful. Indeed, if we define
the following bound holds.
Proof
The trust-region update (4.2) ensures that
where \(\mathcal{U}_k = \{1, \ldots , k \} \setminus \mathcal{S}_k\). This inequality then yields (4.3) by taking logarithms and using that \(|\mathcal{S}_k| \ge 1\) and \(k = |\mathcal{S}_k| + |\mathcal{U}_k|\). \(\square \)
4.2 Evaluation Complexity for Algorithm 4.1
We start our worst-case analysis by formalizing our assumptions. Let
For simplicity of notation, define \(L_f \mathop {=}\limits ^\mathrm{def}\max _{j\in \{1, \ldots , q \}} L_{f,j}\).
Algorithm 4.1 is required to start from a feasible \(x_1 \in \mathcal{F}\), which, together with the fact that the subproblem solution in Step 2 involves minimization over \(\mathcal{F}\), leads to AS.1. Note that AS.3 requires AS.2 and automatically holds if f is \(q+1\) times continuously differentiable and \(\mathcal{F}\) is bounded.
We now establish a lower bound on the trust-region radius.
Proof
Assume that, for some \(\ell \in \{1, \ldots , k \}\)
From (4.1), we obtain that, for some \(j\in \{1, \ldots , q \}\),
where we used (2.8) (implied by AS.3) and the fact that \(\phi _{f,j}^{\Delta _\ell }(x_\ell ) > \epsilon \Delta _\ell ^j\) to deduce the first inequality, the bound \(\Vert s_\ell \Vert \le \Delta _\ell \) to deduce the second, and (4.6) with \(j\ge 1\) to deduce the third. Thus, \(\rho _\ell \ge \eta _2\) and \(\Delta _{\ell +1} \ge \Delta _\ell \). The mechanism of the algorithm and the inequality \(\Delta _1 \ge \epsilon \) then ensures that, for all \(\ell \in {k}\),
\(\square \)
We next derive a simple lower bound on the objective function decrease at successful iterations.
Proof
We have, using (4.1), the fact that \(\phi _{f,j}^{\Delta _k}(x_k) > \epsilon \Delta _k^j\) for some \(j \in \{1, \ldots , q \}\) and (4.4) successively, that
\(\square \)
Our worst-case evaluation complexity results can now be proved by summing the decreases guaranteed by this last lemma.
Proof
Let k be the index of an arbitrary iteration before termination. Using the definition of \(f_\mathrm{low}\), the nature of successful iterations, (4.11) and Lemma 4.3, we deduce that
which proves (4.9).
We next call upon Lemma 4.1 to compute the upper bound on the total number of iterations before termination (obviously, there must be a least one successful iteration unless termination occurs for \(k=1\)) and add one for the evaluation at termination. Finally, (4.12) and (4.13) result from AS.3, Theorem 3.7 and the fact that \(\phi _{f,q}^{\Delta _k}(x_\epsilon ) \le \epsilon \, \Delta _{k_\epsilon }^q\) at termination. \(\square \)
Observe that, because of (4.2) and (4.4), \(\Delta _\epsilon \in [ \kappa _\delta \epsilon , \Delta _{\max }]\). Theorem 4.4 generalizes the known bounds for the cases where \(\mathcal{F}= \mathbb {R}\) and \(q=1\) [46], \(q=2\) [16, 47] and \(q=3\) [1]. The results for \(q=2\) with \(\mathcal{F}\subset \mathbb {R}^n\) and for \(q>3\) appear to be new. The latter provide the first evaluation complexity bounds for general criticality order q. Note that, if \(q = 1\), bounds of the type \(O(\epsilon ^{-(p+1)/p})\) exist if one is ready to minimize models of degree \(p > q\) (see [9]). Whether similar improvements can be obtained for \(q>1\) remains an open question at this stage.
We also observe that the above theory remains valid if the termination rule
used in Step 1 is replaced by a more flexible one, involving other acceptable termination circumstances, such as if (4.15) hold or some other condition holds. We conclude this section by noting that the global optimization effort involved in the computation of \(\phi _{j,j}^{\Delta _k}(x_k)\) \((j \in \{1, \ldots , q \})\) in Algorithm 4.1 might be limited by choosing \(\Delta _{\max }\) small enough.
We close this section by an important observation. The full AS.3 using (2.4) is only needed in our complexity analysis to deduce (4.12) in Theorem 4.1 using Theorem 3.7, itself depending on (2.4) via (2.8) and (2.9). However, in the derivation of the complexity bounds (4.9) and (4.10), the Lipschitz continuity implied by AS.3 is only used for deriving the first inequality of (4.7), in that Lipschitz continuity of \(\nabla _x^q f\) implies (2.8) along the segment \([x_k,x_k+s_k]\). Since it was discussed in Sect. 2.3 that (2.10) implies the same (2.8) along this segment, the weaker assumption
is all what is required for deriving (4.7). AS.3b can therefore replace AS.3 in Theorem 4.4 for the limited purpose of ensuring (4.9)–(4.11).
5 Sharpness
It is interesting that an example was presented in [18] showing that the bound in \(O(\epsilon ^{-3})\) evaluations for \(q=2\) is essentially sharp for both the trust-region and regularization algorithms. This is significant, because requiring \(\phi _{f,2}^\Delta (x) \le \epsilon \Delta ^2\) is slightly stronger, for small \(\Delta \), than the standard condition
(used in [16, 47] for instance). Indeed, for one-dimensional problems and assuming \(\nabla _x^2 f(x) \le 0\), the former condition amounts to requiring that
where the absolute value reflects the fact that \(s=\pm \Delta \) depending on the sign of g. In the remainder of this section, we show that the example proposed in [18] can be extended to arbitrary order q, and thus that the complexity bounds (4.9)–(4.10) are essentially sharp for our trust-region algorithm.
The idea of our generalized example is to apply Algorithm 4.1 to a unidimensional objective function f for some fixed \(q\ge 1\) and \(\mathcal{F}= \mathbb {R}_+\) (hence guaranteeing AS.1), generating a sequence of iterates \(\{x_k\}_{k\ge 0}\) starting from the origin, i.e., \(x_0=x_1=0\). We first choose the sequences of derivatives values up to order q to be, for all \(k \ge 1\),
where \(\delta \in (0,1)\) is a (small) positive constant. This means that, at iterate \(x_k\), the qth-order Taylor model is given by
where the value of \(f(x_k)\) remains unspecified for now. The step is then obtained by minimizing this model in a trust-region of radius
yielding that
As a consequence, the model decrease is given by
For our example, we the define the objective function decrease at iteration k to be
thereby ensuring that \(\rho _k \in [\eta _1,\eta _2)\) and \(x_{k+1}=x_k+s_k\) for each k. Summing up function decreases, we may then specify the objective function’s values at the iterates by
where \(\zeta (t)\mathop {=}\limits ^\mathrm{def}\sum _{k=1}^\infty k^{-t}\) is the Riemann zeta function. This function is finite for all \(t>1\) (and thus also for \(t=1+(q+1)\delta \)), thereby ensuring that \(f(x_k) \ge 0\) for all \(k\ge 0\). We also verify that
in accordance with (4.2), provided . Observe also that (5.3) and (5.5) ensure that, for each \(k\ge 1\),
and
We now use Hermite interpolation to construct the objective function f on the successive intervals \([x_k,x_{k+1}]\) and define
where \(p_k\) is the polynomial
with coefficients defined by the interpolation conditions
These conditions ensure that f(x) is q times continuously differentiable on \(\mathbb {R}_+\) and thus that AS.2 holds. They also impose the following values for the first \(q+1\) coefficients
and the remaining \(q+1\) coefficients are solutions of the linear system
where the right-hand side is given by
Observe now that the coefficient matrix of this linear system may be written as
where
is an invertible matrix independent of k (see Appendix). Hence,
Observe now that, because of (5.4), (5.6), (5.5) and (5.3),
and, since \(\nabla _x^qf(x_k)< \nabla _x^qf(x_{k+1}) < 0\),
These bounds and (5.15) imply that \([r_k]_i\), the ith component of \(r_k\), satisfies
Hence, using (5.17) and the non-singularity of \(M_q\), we obtain that there exists a constant \(\kappa _q \ge 1\) independent of k such that
and thus that
Moreover, using successively (5.11), the triangle inequality, (5.13), (5.3), (5.4), (5.18) and \(\kappa _q \ge 1\), we obtain that, for \(j \in \{1, \ldots , q \}\),
and thus, all derivatives of order one up to q remain bounded on \([0,s_k]\). Because of (5.10), we therefore obtain that AS.3 holds. Moreover (5.13), (5.18), the inequalities \(|\nabla _x^q f(x_k)|\le q!\) and \(f(x_k)\ge 0\), (5.10) and (5.4) also ensure that f(x) is bounded below.
We have therefore shown that the bounds of Theorem 4.4 are essentially sharp, in that, for every \(\delta >0\), Algorithm 4.1 applied to the problem of minimizing the lower-bounded objective function f just constructed and satisfying AS.1–AS.3 will take, because of (5.8) and (5.9),
iterations and evaluation of f and its q first derivatives to find an iterate \(x_k\) such that condition (4.15) holds. Moreover, it is clear that, in the example presented, the global rate of convergence is driven by the term of degree q in the Taylor series.
6 Discussion
We have analysed the necessary and sufficient optimality conditions of arbitrary order for convexly constrained nonlinear optimization problems, using approximations of the feasible region which generalizes the idea of second-order tangent sets (see [10]) to orders beyond two. Using the resulting necessary conditions, we then proposed a measure of criticality for arbitrary order for convexly constrained nonlinear optimization problems. As this measure can be extended to define \(\epsilon \)-approximate critical points of high order, we have then used it in a conceptual trust-region algorithm to show that if derivatives of the objective function up to order \(q \ge 1\) can be evaluated and are Lipschitz continuous, then this algorithm applied to the convexly constrained problem (3.1) needs at most \(O(\epsilon ^{-(q+1)})\) evaluations of f and its derivatives to compute an \(\epsilon \)-approximate qth-order critical point. Moreover, we have shown by an example that this bound is essentially sharp.
In the purely unconstrained case, this result recovers known results for \(q=1\) (first-order criticality for Lipschitz gradients) [46], \(q=2\) (second-order criticalityFootnote 7 with Lipschitz Hessians) [18, 47] and \(q=3\) (third-order criticalityFootnote 8 with Lipschitz continuous third derivative) [1], but extends them to arbitrary order. The results for the convexly constrained case appear to be new and provide in particular the first complexity bound for second- and third-order criticality for such inequality constrained problems.
Because the condition (4.15) measures different orders of criticality, we could choose to use a different \(\epsilon \) for every order (as in [18]), complicating the expression of the bound accordingly. However, as shown by our example, the worst-case behaviour of Algorithm 4.1 is dominated by that of \(\nabla _x^qf\), which makes the distinction of the various \(\epsilon \)-s less crucial.
Since the global optimization occurring in the definition of the criticality measure \(\phi _{f,j}^\Delta (x)\), the algorithm discussed in the present paper remains, in general, of a theoretical nature. However, there may be cases where this computation is tractable for small enough \(\Delta \), for instance if the derivative tensors of the objective function are strongly structured. Such approaches may hopefully be of use for small dimensional or structured highly nonlinear problems, such as those occurring in machine learning using deep learning techniques (see [1]).
The present framework for handling convex constraints is not free of limitations, resulting from our choice to transfer difficulties associated with the original problem to the subproblem solution, thereby sparing precious evaluations of f and its derivatives. In particular, the cost of evaluating any constraint function/derivative possibly defining the convex feasible set \(\mathcal{F}\) is neglected by the present approach, which must therefore be seen as a suitable framework to handle “cheap inequality constraint” such as simple bounds.
Questions of course arise from the results presented. The first is whether it is possible to extend the existing work (e.g., [10]) on bridging the gap between necessary and sufficient optimality conditions for orders one and two to higher orders, possibly by finding sufficient conditions to ensure (3.21) and by isolating problem classes where this constraint qualification condition automatically holds. From the complexity point of view, it is known that the complexity of obtaining \(\epsilon \)-approximate first-order criticality for unconstrained and convexly constrained problem can be reduced to \(O(\epsilon ^{-(p+1)/p})\) if one is ready to define the step by using a regularization model of order \(p \ge 1\). In the unconstrained case, this was shown for \(p=2\) in [16, 47] and for general \(p \ge 1\) in [9], while the convexly constrained case was analysed (for \(p=2\)) in [17]. The question of whether this methodology and the associated improvements in evaluation complexity bounds can be extended to order above one also remains open at this stage.
Notes
That it is the recursively norm induced by the standard Euclidean norm results from the observation that
$$\begin{aligned} \max _{\Vert v_1\Vert =\cdots =\Vert v_q\Vert =1} T[v_1,\ldots ,v_q] = \max _{\Vert v_q\Vert =1} \Bigg [\max _{\Vert v_1\Vert =\cdots =\Vert v_{q-1}\Vert =1} T[v_1,\ldots ,v_{q-1}]\Bigg ][v_q]. \end{aligned}$$Unfortunately, double indices are necessary for most of our notation, as we need to distinguish both the function to which the relevant quantity is associated (the first index) and its order (the second index).
It would be possible to generalize the approach of [10] and define the inner jth-order tangent set (\(j>1\)) by \(\mathcal{T}_\mathcal{F}^j(x,s_1, \ldots , s_{j-1}) \mathop {=}\limits ^\mathrm{def}\{ s_j \in \mathbb {R}^n \mid x_* + \sum _{i=1}^j\frac{\alpha ^i}{i!} s_i + o(\alpha ^j)\}\) (with \(\mathcal{T}_\mathcal{F}^1(x) = \mathcal{T}(x)\)), leading to \(\mathcal{D}_\mathcal{F}^q(x) = \prod _{j=1}^q \,j!\,\mathcal{T}_\mathcal{F}^j(x,s_1, \ldots , s_{j-1})\), but we prefer the equivalent (3.6) for notational convenience.
We saw in Sect. 3.1.2 that \(q=3\) is the highest order for which this is possible.
Note that a small \(\Delta \) has the advantage of limiting the global optimization effort.
A detailed account and a comprehensive bibliography on trust-region methods can be found in [25].
Using (3.34).
Using (3.35).
References
A. Anandkumar and R. Ge. Efficient approaches for escaping high-order saddle points in nonconvex optimization. arXiv.1602.05908, 2016.
A. Auslender and R. Cominetti. First and second order sensitivity analysis of nonlinear programs under directional constraint qualification conditions. Optimization 21 (1990), 351–363.
M. Baes. Estimate sequence methods: extensions and approximations. Technical Report IFOR Internal Report, ETH Zürich, Raemistrasse 101, Zürich, Switzerland, 2009.
A. Ben-Tal. Second order and related extremality conditions in nonlinear programming. Journal of Optimization Theory and Applications 31 (1980), 143–165.
W. Bian and X. Chen. Worst-case complexity of smoothing quadratic regularization methods for non-Lipschitzian optimization. SIAM Journal on Optimization 23 (2013), 1718–1741.
W. Bian and X. Chen. Linearly constrained non-Lipschitzian optimization for image restoration. Technical report, Department of Applied Mathematics, Polytechnic University of Hong Kong, Hong Kong, 2015.
W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, Series A 149 (2015), 301–327.
E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint. Evaluation complexity for nonlinear constrained optimization using unscaled KKT conditions and high-order models. SIAM Journal on Optimization 26 (2016), 951–967.
E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, Series A. doi:10.1007/s10107-016-1065-8.
J. F. Bonnans, R. Cominetti, and A. Shapiro. Second order optimality conditions based on parabolic second order tangent sets. SIAM Journal on Optimization, 9 (1999), 466–492.
N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on manifolds. Technical Report Technical Report arXiv:1605.08101, Oxford University, Oxford, UK, 2016.
O. A. Brezhneva and A. Tret’yakov. Optimality conditions for degenerate extremum problems with equality constraints. SIAM Journal on Control and Optimization 42(2003), 729–743.
O. A. Brezhneva and A. Tret’yakov. The \(p\)th-order optimality conditions for inequality constrained optimization problems. Nonlinear Analysis 63 (2005), 1357–1366.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization. SIAM Journal on Optimization 20 (2010), 2833–2852.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic overestimation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, Series A 127(2011), 245–295.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic overestimation methods for unconstrained optimization. Part II: worst-case function-evaluation complexity. Mathematical Programming, Series A 130 (2011), 295–319.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis 32 (2012), 1662–1695.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity 28 (2012), 93–108.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the complexity of finding first-order critical points in constrained nonlinear optimization. Mathematical Programming, Series A 144 (2013), 93–106.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the evaluation complexity of cubic regularization methods for potentially rank-deficient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM Journal on Optimization 23 (2013), 1553–1574.
C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the evaluation complexity of constrained nonlinear least-squares and general constrained nonlinear optimization using second-order methods. SIAM Journal on Numerical Analysis 53 (2015), 836–851.
C. Cartis, Ph. R. Sampaio, and Ph. L. Toint. Worst-case complexity of first-order non-monotone gradient-related algorithms for unconstrained optimization. Optimization 64 (2015), 1349–1361.
C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, Series A. doi:10.1007/s10107-017-1137-4.
R. Cominetti. Metric regularity, tangent sets and second-order optimality conditions. Applied Mathematics & Optimization, 21 (1990), 265–287.
A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust-Region Methods. MPS-SIAM Series on Optimization. SIAM, Philadelphia, USA, 2000.
F. E. Curtis, D. P. Robinson, and M. Samadi. A trust-region algorithm with a worst-case iteration complexity of \(O(\epsilon ^{-3/2})\) for nonconvex optimization. Mathematical Programming, Series A 162 (2017), 1–32.
M. Dodangeh, L. N. Vicente, and Z. Zhang. On the optimal order of worst case complexity of direct search. Optimization Letters 10 (2016), 699–708.
J. P. Dussault. ARC\(_q\): a new adaptive regularization by cubics. Optimization Methods and Software. doi:10.1080/10556788.2017.1322080
R. Garmanjani, D. Júdice, and L. N. Vicente. Trust-region methods without using derivatives: Worst case complexity and the non-smooth case. SIAM Journal on Optimization, 26 (2016), 1987–2011.
J. Gauvin and R. Janin. Directional behavior of optimal solutions in nonlinear mathematical programming. Mathematics of Operations Research 13 (1988), 629–649.
D. Ge, X. Jiang, and Y. Ye. A note on the complexity of \({L}_p\) minimization. Mathematical Programming, Series A 21 (2011), 1721–1739.
S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, Series A, 156 (2016), 59–100.
N. I. M. Gould and S. Leyffer. An introduction to algorithms for nonlinear optimization. In A. W. Craig J. F. Blowey and T. Shardlow, editors, Frontiers in Numerical Analysis (Durham 2002), Heidelberg, Berlin, New York, 2003. Springer Verlag, pp. 109–197.
G. N. Grapiglia, J. Yuan, and Y. Yuan. Nonlinear stepsize control algorithms: Complexity bounds for first and second-order optimality. Journal of Optimization Theory and Applications, 171 (2016), 980–997.
G. N. Grapiglia, J. Yuan, and Y. Yuan. On the convergence and worst-case complexity of trust-region and regularization methods for unconstrained optimization. Mathematical Programming, Series A 152 (2015), 491–520.
S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization 25 (2015), 1515–1541.
S. Gratton, A. Sartenaer, and Ph. L. Toint. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19 (2008), 414–444.
H. Hancock. The Theory of Maxima and Minima. The Athenaeum Press, Ginn & Co, NewYork, USA, 1917. Available on line at https://archive.org/details/theoryofmaximami00hancuoft.
J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms. Part 1: Fundamentals. Springer Verlag, Heidelberg, Berlin, New York, 1993.
W. Hogan. Point-to-set maps in mathematical programming. SIAM Review 15 (1973), 591–603.
F. Jarre. On Nesterov’s smooth Chebyshev-Rosenbrock function. Optimization Methods and Software 28 (2013), 478–500.
S. Lu, Z. Wei, and L. Li. A trust-region algorithm with adaptive cubic regularization methods for nonsmooth convex minimization. Computational Optimization and Applications 51 (2012), 551–573.
J. M. Martínez. On high-order model regularization for constrained optimization. Technical report, Department of Applied Mathematics, IMECC-UNICAMP, Campinas, Brasil, February 2017.
J. M. Martínez and M. Raydan. Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. Technical report, Department of Mathematics, IMECC-UNICAMP, University of Campinas, Campinas, Brazil, November 2015.
J. J. Moreau. Décomposition orthogonale d’un espace hilbertien selon deux cônes mutuellement polaires. Comptes-Rendus de l’Académie des Sciences (Paris) 255 (1962), 238–240.
Yu. Nesterov. Introductory Lectures on Convex Optimization. Applied Optimization. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2004.
Yu. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, Series A 108 (2006), 177–205.
J. Nocedal and S. J. Wright. Numerical Optimization. Series in Operations Research. Springer Verlag, Heidelberg, Berlin, New York, 1999.
G. Peano. Calcolo differenziale e principii di calcolo integrale. Fratelli Bocca, Roma, Italy, 1884.
R. T. Rockafellar. Convex Analyis. Princeton University Press, Princeton, USA, 1970.
K. Scheinberg and X. Tang. Complexity in inexact proximal Newton methods. Technical report, Lehigh Uinversity, Bethlehem, USA, 2014.
A. Shapiro. Perturbation theory of nonlinear programs when the set of optimal solutions is not a singleton. Applied Mathematics & Optimization 18 (1988), 215–229.
K. Ueda and N. Yamashita. Convergence properties of the regularized Newton method for the unconstrained nonconvex optimization. Applied Mathematics & Optimization 62 (2010), 27–46.
K. Ueda and N. Yamashita. On a global complexity bound of the Levenberg-Marquardt method. Journal of Optimization Theory and Applications 147 (2010), 443–453.
L. N. Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization 1 (2013), 143–153.
Acknowledgements
The authors would like to thank Oliver Stein for suggesting reference [40], as well as Jim Burke and Adrian Lewis for interesting discussions. Thanks are also due to helpful referees whose comments have helped to improve the manuscript. The third author would also like to acknowledge the support provided by the Belgian Fund for Scientific Research (FNRS), the Leverhulme Trust (UK), Balliol College (Oxford), the Department of Applied Mathematics of the Hong Kong Polytechnic University, ENSEEIHT (Toulouse, France) and INDAM (Florence, Italy).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Michael Overton.
Appendix: Non-singularity of \(M_q\)
Appendix: Non-singularity of \(M_q\)
We prove the non-singularity of the matrix \(M_q\) introduced in (5.16). Assume for the purpose of a contradiction that there exists a nonzero vector \(v= (c_{q+1,k},\ldots , c_{2q+1,k})^T \in \mathbb {R}^{q+1}\) such that \(M_qv = 0\). From the argument of Sect. 5, this amounts to saying that there exists a polynomial of the form (5.11) with one of the coefficients \(c_{q+1,k}, \ldots , c_{2q+1,k}\) being nonzero and which satisfies the interpolation conditions (5.12) (i.e., (5.13) and (5.14)) with the restriction that \(r_k\) given by (5.15) is identically zero. Since \(s_k > 0\), the fact that components 2 to q of \(r_k\) are zero implies that \(\nabla _x^qf(x_k) = q! \,c_{q,k}=0\), and hence (from the first component) \(\Delta f_k = 0\). The interpolation conditions thus specify that
where the last equality results from the fact that the last component of \(r_k\) is zero. Because \(p_k(s)\) is nonzero, this implies that \(p_k(s)\) must be of the form \(As^{q+1}(s-s_k)^{q+1}p_1(s)\) where A is a constant and \(p_1(s)\) is a polynomial in s. But, since \(p_k(s)\) is of degree \((2q+1)\) and \(s^{q+1}(s-s_k)^{q+1}\) of degree \(2q+2\), one must have that \(p_1(s) = 0 = p_k(s)\), which is impossible. Hence, \(M_q\) is non-singular.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Cartis, C., Gould, N.I.M. & Toint, P.L. Second-Order Optimality and Beyond: Characterization and Evaluation Complexity in Convexly Constrained Nonlinear Optimization. Found Comput Math 18, 1073–1107 (2018). https://doi.org/10.1007/s10208-017-9363-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-017-9363-y