We first outline the connection between \(\varLambda \)-poisedness of \(Y_k\) and fully linear models. We then prove global convergence of Algorithm 1 (i.e. convergence from any starting point
) to first-order critical points, and determine its worst-case complexity.
Interpolation models are fully linear
To begin, we require some assumptions on the smoothness of
.
Assumption 3.1
The function
is \(C^1\) and its Jacobian
is Lipschitz continuous in \({\mathcal {B}}\), the convex hull of
, with constant \(L_J\). We also assume that
and
are uniformly bounded in the same region; i.e.
and
for all
.
If the level set
is bounded, which is assumed in [38], then
for all k, so \({\mathcal {B}}\) is compact, from which Assumption 3.1 follows. A standard result follows, whose proof can be found in [4].
Lemma 3.2
If Assumption 3.1 holds, then \(\nabla f\) is Lipschitz continuous in \({\mathcal {B}}\) with constant
$$\begin{aligned} L_{\nabla f} :=r_{max}L_J + J_{max}^2. \end{aligned}$$
(3.1)
We now state the connection between \(\varLambda \)-poisedness of \(Y_k\) and full linearity of the models
(2.1) and \(m_k\) (2.5).
Lemma 3.3
Suppose Assumption 3.1 holds and \(Y_k\) is \(\varLambda \)-poised in
. Then
(2.1) is a fully linear model for
in
in the sense of Definition 2.4 with constants
$$\begin{aligned} \kappa _{ef}^r = \kappa _{eg}^r + \frac{L_J}{2} \qquad \text {and} \qquad \kappa _{eg}^r = \frac{1}{2}L_J\left( \sqrt{n}C+2\right) , \end{aligned}$$
(3.2)
in (2.13) and (2.14), where \(C={\mathcal {O}}(\varLambda )\). Under the same hypotheses, \(m_k\) (2.5) is a fully linear model for f in
in the sense of Definition 2.3 with constants
$$\begin{aligned} \kappa _{ef}= & {} \kappa _{eg} + \frac{L_{\nabla f} + (\kappa _{eg}^r\varDelta _{max} + J_{max})^2}{2} \,\, \text {and} \,\, \nonumber \\ \kappa _{eg}= & {} L_{\nabla f} +\, \kappa _{eg}^r r_{max} + (\kappa _{eg}^r\varDelta _{max}+J_{max})^2, \end{aligned}$$
(3.3)
in (2.11) and (2.12), where \(L_{\nabla f}\) is from (3.1). We also have the bound \(\Vert H_k\Vert \le (\kappa _{eg}^r\varDelta _{max} + J_{max})^2\), independent of
, \(Y_k\) and \(\varDelta _k\).
Proof
See Appendix A. \(\square \)
Global convergence of DFO-GN
We begin with some nomenclature to describe certain iterations: we call an iteration (for which the safety phase is not called)
-
‘Successful’ if
(i.e. \(R_k\ge \eta _1\)), and ‘very successful’ if \(R_k\ge \eta _2\). Let \({\mathcal {S}}\) be the set of successful iterations k;
-
‘Model-Improving’ if \(R_k<\eta _1\) and the model-improvement phase is called (i.e. \(Y_k\) is not \(\varLambda \)-poised in
); and
-
‘Unsuccessful’ if \(R_k<\eta _1\) and the model-improvement phase is not called.
The results below are largely based on corresponding results in [9, 38]. As such, we omit many details which can be found there; full proofs of these results are given in extended technical report [4] of this paper.
Assumption 3.4
We assume that \(\Vert H_k\Vert \le \kappa _H\) for all k, for some \(\kappa _H\ge 1\).Footnote 6
Lemma 3.5
(Lemma 4.3, [38]) Suppose Assumption 2.1 holds. If the model \(m_k\) is fully linear in
and
then either the k-th iteration is very successful or the safety phase is called.
The next result provides a lower bound on the size of the trust region step
, which we will later use to determine that the safety phase is not called when
is bounded away from zero and \(\varDelta _k\) is sufficiently small. Note that [38, Lemma 4.4] shows that the safety phase is not called by requiring that the trust region subproblem (2.6) is solved to global optimality, a stronger condition than Assumption 2.1.
Lemma 3.6
Suppose Assumption 2.1 holds. Then the step
satisfies
where \(c_2 :=2c_1 / (1+\sqrt{1+2c_1})\).
Proof
Let \(h_k:=\max (\Vert H_k\Vert ,1)\ge 1\). Since
from (2.8), we have
Substituting this into (2.8), we get
For (3.7) to be satisfied, we require that
is larger than (or equal to) the positive root of the left-hand side of (3.7), which gives the first inequality below
where
; from which we recover (3.5). \(\square \)
Lemma 3.7
In all iterations,
. Also, if
then
Proof
Firstly, if the criticality phase is not called, then we must have
. Otherwise, we have
. Hence
. The proof of (3.9) is given in [9, Lemma 10.11]. \(\square \)
Lemma 3.8
Suppose Assumptions 2.1, 3.1 and 3.4 hold. If
for all k, then \(\rho _k\ge \rho _{min} > 0\) for all k, where
$$\begin{aligned} \rho _{min} :=\min \left( \varDelta _0^{init}, \frac{\omega _C\epsilon }{\kappa _{eg}+1/\mu }, \, \frac{\alpha _1 \epsilon _g}{\kappa _H}, \, \alpha _1\left( \kappa _{eg} + \frac{2\kappa _{ef}}{c_1(1-\eta _2)}\right) ^{-1}\epsilon \right) . \qquad \end{aligned}$$
(3.10)
Proof
From Lemma 3.7, we also have
for all k. To find a contradiction, let k(0) be the first k such that \(\rho _k<\rho _{min}\). That is, we have
$$\begin{aligned} \rho _0^{init} \ge \rho _0 \ge \rho _1^{init} \ge \rho _1 \ge \cdots \ge \rho _{k(0)-1}^{init} \ge \rho _{k(0)-1} \ge \rho _{min} \qquad \text {and} \qquad \rho _{k(0)} < \rho _{min}. \nonumber \\ \end{aligned}$$
(3.11)
We first show that
$$\begin{aligned} \rho _{k(0)}=\rho _{k(0)}^{init}<\rho _{min}. \end{aligned}$$
(3.12)
From Algorithm 1, we know that either \(\rho _{k(0)}=\rho _{k(0)}^{init}\) or \(\rho _{k(0)}=\varDelta _{k(0)}\). Hence we must either have \(\rho _{k(0)}^{init}<\rho _{min}\) or \(\varDelta _{k(0)}<\rho _{min}\). In the former case, there is nothing to prove; in the latter, using Lemma B.1, we have that
$$\begin{aligned} \rho _{min} > \varDelta _{k(0)} \ge \min \left( \varDelta _{k(0)}^{init}, \frac{\omega _C \epsilon }{\kappa _{eg}+1/\mu }\right) \ge \min \left( \rho _{k(0)}^{init}, \frac{\omega _C \epsilon }{\kappa _{eg}+1/\mu }\right) . \end{aligned}$$
(3.13)
Since \(\rho _{min} \le \omega _C \epsilon / (\kappa _{eg}+1/\mu )\), we therefore conclude that (3.12) holds.
Since \(\rho _{min}\le \varDelta _0^{init}=\rho _0^{init}\), we therefore have \(k(0)>0\) and \(\rho _{k(0)-1} \ge \rho _{min} > \rho _{k(0)}^{init}\). This reduction in \(\rho \) can only happen from a safety step or an unsuccessful step, and we must have \(\rho _{k(0)}^{init}=\alpha _1\rho _{k(0)-1}\), so \(\rho _{k(0)-1} \le \rho _{min}/\alpha _1\). If we had a safety step, we know
, but if we had an unsuccessful step, we must have
. Hence in either case, we have
since \(\gamma _S<1\) and \(\gamma _{dec}<1\). Hence by Lemma 3.6 we have
Note that \(\rho _{min} \le \alpha _1 \epsilon _g / \kappa _H < (\alpha _1 c_2 \epsilon _g)/(\gamma _S \kappa _H)\), where in the last inequality we used the choice of \(\gamma _S\) in Algorithm 1. This inequality and the choice of \(\gamma _S\), together with (3.15), also imply
$$\begin{aligned} \varDelta _{k(0)-1} \le \frac{\gamma _S \rho _{min}}{\alpha _1 c_2} < \frac{\rho _{min}}{\alpha _1} \le \min \left( \frac{\epsilon _g}{\kappa _H}, \left( \kappa _{eg} + \frac{2\kappa _{ef}}{c_1(1-\eta _2)}\right) ^{-1}\epsilon \right) . \end{aligned}$$
(3.16)
Then since \(\varDelta _{k_{(0)}-1} \le \epsilon _g/\kappa _H\), Lemma 3.6 gives us
and the safety phase is not called.
If \(m_k\) is not fully linear, then we must have either a successful or model-improving iteration, so \(\rho _{k_{(0)}}^{init}=\rho _{k_{(0)}-1}\), contradicting (3.12). Thus \(m_k\) must be fully linear. Now suppose that
Then using full linearity, we have
contradicting (3.16). That is, (3.17) is false and so together with (3.16), we have (3.4). Hence Lemma 3.5 implies iteration \((k_0-1)\) was very successful (as we have already established the safety phase was not called), so \(\rho _{k_{(0)}}^{init}=\rho _{k_{(0)}-1}\), contradicting (3.12). \(\square \)
Our first convergence result considers the case where we have finitely-many successful iterations.
Lemma 3.9
Suppose Assumptions 2.1, 3.1 and 3.4 hold. If there are finitely many successful iterations, then \(\lim _{k\rightarrow \infty }\varDelta _k=\lim _{k\rightarrow \infty }\rho _k=0\) and
.
Proof
The proof follows [9, Lemma 10.8], except we have to consider the possibility of safety phases in two places. First, to show \(\varDelta _k\rightarrow 0\), we note that \(\varDelta _k\) is reduced by a factor \(\max (\alpha _2,\omega _S)<1\) in safety phases. Secondly, we use the observation: if the \(m_k\) is fully linear,
is sufficiently large, and \(\rho _k\le \varDelta _k\) are both sufficiently small, then Lemma 3.5 gives us either a very successful iteration or a safety step. In this case, a safety step is not called, because Lemma 3.6 implies
. \(\square \)
Lemma 3.10
(Lemma 10.9, [9]) Suppose Assumptions 2.1, 3.1 and 3.4 hold. Then \(\lim _{k\rightarrow \infty }\varDelta _k=0\) and so \(\lim _{k\rightarrow \infty }\rho _k=0\).
Proof
The proof of [9, Lemma 10.9] shows \(\varDelta _k\rightarrow 0\); since \(\rho _k\le \varDelta _k\), we conclude \(\rho _k\rightarrow 0\). \(\square \)
Theorem 3.11
Suppose Assumptions 2.1, 3.1 and 3.4 hold. Then
Proof
If \(|{\mathcal {S}}|<\infty \), then this follows from Lemma 3.9. Otherwise, it follows from Lemma 3.10 and Lemma 3.8. \(\square \)
Theorem 3.12
Suppose Assumptions 2.1, 3.1 and 3.4 hold. Then
.
Proof
If \(|{\mathcal {S}}|<\infty \), then the result follows from Lemma 3.9. Otherwise, the proof of [9, Theorem 10.13] applies, except for one modification: for \(k\in {\mathcal {K}}\) sufficiently large, iteration k is not unsuccessful, so must be a safety, successful or model-improving step. It cannot be a safety step by the same reasoning as in the proof of Lemma 3.9: since
for \(k\in {\mathcal {K}}\), and \(\varDelta _k\rightarrow 0\), if k sufficiently large then Lemma 3.6 implies that
. Hence iteration k must be successful or model-improving, and the remainder of the proof holds. \(\square \)
Worst-case complexity
Next, we bound the number of iterations and objective evaluations until
. We know such a bound exists from Theorem 3.11. Let \(i_{\epsilon }\) be the last iteration before
for the first time.
Lemma 3.13
Suppose Assumptions 2.1, 3.1 and 3.4 hold. Let \(|{\mathcal {S}}_{i_{\epsilon }}|\) be the number of successful steps up to iteration \(i_{\epsilon }\). Then
where \(\epsilon _g\) is defined in (3.9), and \(\rho _{min}\) in (3.10).
Proof
For all \(k\in {\mathcal {S}}_{i_{\epsilon }}\), we have the sufficient decrease condition
Since
from Lemma 3.7 and \(\varDelta _k\ge \rho _k\ge \rho _{min}\) from Lemma 3.8, this means
Summing (3.22) over all \(k\in {\mathcal {S}}_{i_{\epsilon }}\), and noting that
, we get
from which (3.20) follows. \(\square \)
We now need to count the number of iterations of Algorithm 1 which are not successful. Following [12], we count each iteration of the loop inside the criticality phase (Algorithm 2) as a separate iteration—in effect, one ‘iteration’ corresponds to one construction of the model \(m_k\) (2.5). We also consider separately the number of criticality phases for which \(\varDelta _k\) is not reduced (i.e. \(\varDelta _k=\varDelta _k^{init}\)). Counting until iteration \(i_{\epsilon }\) (inclusive), we let
-
\({\mathcal {C}}^M_{i_{\epsilon }}\) be the set of criticality phase iterations \(k\le i_{\epsilon }\) for which \(\varDelta _k\) is not reduced (i.e. the first iteration of every call of Algorithm 2—see Remark 2.5 for further details);
-
\({\mathcal {C}}^U_{i_{\epsilon }}\) be the set of criticality phase iterations \(k\le i_{\epsilon }\) where \(\varDelta _k\) is reduced (i.e. all iterations except the first for every call of Algorithm 2);
-
\({\mathcal {F}}_{i_{\epsilon }}\) be the set of iterations where the safety phase is called;
-
\({\mathcal {M}}_{i_{\epsilon }}\) be the set of iterations where the model-improving phase is called; and
-
\({\mathcal {U}}_{i_{\epsilon }}\) be the set of unsuccessful iterations.Footnote 7
Lemma 3.14
Suppose Assumptions 2.1, 3.1 and 3.4 hold. Then we have the bounds
$$\begin{aligned} |{\mathcal {C}}^U_{i_{\epsilon }}| + |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}|&\le |{\mathcal {S}}_{i_{\epsilon }}|\cdot \frac{\log {\overline{\gamma }}_{inc}}{|\log \alpha _3|} + \frac{1}{|\log \alpha _3|}\log \left( \frac{\varDelta _0^{init}}{\rho _{min}}\right) , \end{aligned}$$
(3.24)
$$\begin{aligned} |{\mathcal {C}}^M_{i_{\epsilon }}|&\le |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {S}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}|, \end{aligned}$$
(3.25)
$$\begin{aligned} |{\mathcal {M}}_{i_{\epsilon }}|&\le |{\mathcal {C}}^M_{i_{\epsilon }}| + |{\mathcal {C}}^U_{i_{\epsilon }}| + |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {S}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}|, \end{aligned}$$
(3.26)
where \(\alpha _3:=\max (\omega _C, \omega _S, \gamma _{dec}, \alpha _2)<1\) and \(\rho _{min}\) is defined in (3.10).
Proof
On each iteration \(k\in {\mathcal {C}}^U_{i_{\epsilon }}\), we reduce \(\varDelta _k\) by a factor of \(\omega _C\). Similarly, on each iteration \(k\in {\mathcal {F}}_{i_{\epsilon }}\) we reduce \(\varDelta _k\) by a factor of at least \(\max (\omega _S, \alpha _2)\), and for iterations in \({\mathcal {U}}_{i_{\epsilon }}\) by a factor of at least \(\max (\gamma _{dec},\alpha _2)\). On each successful iteration, we increase \(\varDelta _k\) by a factor of at most \({\overline{\gamma }}_{inc}\), and on all other iterations, \(\varDelta _k\) is either constant or reduced. Therefore, we must have
$$\begin{aligned} \rho _{min}&\le \varDelta _{i_{\epsilon }} \le \varDelta _0^{init} \cdot \omega _C^{|{\mathcal {C}}^U_{i_{\epsilon }}|} \cdot \max (\omega _S, \alpha _2)^{|{\mathcal {F}}_{i_{\epsilon }}|} \cdot \max (\gamma _{dec}, \alpha _2)^{|{\mathcal {U}}_{i_{\epsilon }}|} \cdot {\overline{\gamma }}_{inc}^{|{\mathcal {S}}_{i_{\epsilon }}|}, \end{aligned}$$
(3.27)
$$\begin{aligned}&\le \varDelta _0^{init} \cdot \alpha _3^{|{\mathcal {C}}^U_{i_{\epsilon }}| + |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}|} \cdot {\overline{\gamma }}_{inc}^{|{\mathcal {S}}_{i_{\epsilon }}|}, \end{aligned}$$
(3.28)
from which (3.24) follows.
After every call of the criticality phase, we have either a safety, successful or unsuccessful step, giving us (3.25). Similarly, after every model-improving phase, the next iteration cannot call a subsequent model-improving phase, giving us (3.26). \(\square \)
Assumption 3.15
The algorithm parameter \(\epsilon _C \ge c_3\epsilon \) for some constant \(c_3>0\).
Note that Assumption 3.15 can be easily satisfied by appropriate parameter choices in Algorithm 1.
Theorem 3.16
Suppose Assumptions 2.1, 3.1, 3.4 and 3.15 hold. Then the number of iterations \(i_{\epsilon }\) (i.e. the number of times a model \(m_k\) (2.5) is built) until
is at most
where \(c_4 :=\min \left( c_3, (1 + \kappa _{eg}\mu )^{-1}\right) \) and
$$\begin{aligned} c_5 :=\min \left( \frac{\omega _C}{\kappa _{eg}+1/\mu }, \frac{\alpha _1 c_4}{\kappa _H}, \alpha _1\left( \kappa _{eg} + \frac{2\kappa _{ef}}{c_1(1-\eta _2)}\right) ^{-1}\right) . \end{aligned}$$
(3.30)
Proof
From Assumption 3.15 and Lemma 3.7, we have \(\epsilon _g = c_4\epsilon \). Similarly, from Lemma 3.8 we have \(\rho _{min}=\min (\varDelta _0^{init}, c_5\epsilon )\). Thus using Lemma 3.14, we can bound the total number of iterations by
$$\begin{aligned}&|{\mathcal {C}}^M_{i_{\epsilon }}| + |{\mathcal {C}}^U_{i_{\epsilon }}| + |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {S}}_{i_{\epsilon }}| + |{\mathcal {M}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}| \end{aligned}$$
(3.31)
$$\begin{aligned}&\quad \le 4|{\mathcal {S}}_{i_{\epsilon }}| + 4\left( |{\mathcal {C}}^U_{i_{\epsilon }}| + |{\mathcal {F}}_{i_{\epsilon }}| + |{\mathcal {U}}_{i_{\epsilon }}|\right) , \end{aligned}$$
(3.32)
$$\begin{aligned}&\quad \le 4|{\mathcal {S}}_{i_{\epsilon }}|\left( 1 + \frac{\log {\overline{\gamma }}_{inc}}{|\log \alpha _3|}\right) + \frac{4}{|\log \alpha _3|}\log \left( \frac{\varDelta _0^{init}}{\rho _{min}}\right) , \end{aligned}$$
(3.33)
and so (3.29) follows from this and Lemma 3.13. \(\square \)
We can summarize our results as follows:
Corollary 3.17
Suppose Assumptions 2.1, 3.1, 3.4 and 3.15 hold. Then for \(\epsilon \in (0,1]\), the number of iterations \(i_{\epsilon }\) (i.e. the number of times a model \(m_k\) (2.5) is built) until
is at most \({\mathcal {O}}(\kappa _H \kappa _d^2 \epsilon ^{-2})\), and the number of objective evaluations until \(i_{\epsilon }\) is at most \({\mathcal {O}}(\kappa _H \kappa _d^2 n \epsilon ^{-2})\), where \(\kappa _d:=\max (\kappa _{ef},\kappa _{eg})={\mathcal {O}}(n L_J^2)\).
Proof
From Theorem 3.16, we have \(c_4^{-1}={\mathcal {O}}(\kappa _{eg})\) and so
$$\begin{aligned} c_5^{-1}={\mathcal {O}}(\max (\kappa _{eg}, \kappa _H c_4^{-1}, \kappa _{ef}+\kappa _{eg}))={\mathcal {O}}(\kappa _H \kappa _d). \end{aligned}$$
(3.34)
To leading order, the number of iterations is
$$\begin{aligned} {\mathcal {O}}(\max (\kappa _H c_4^{-2}, c_4^{-1}c_5^{-1})\epsilon ^{-2})={\mathcal {O}}(\kappa _H \kappa _d^2 \epsilon ^{-2}), \end{aligned}$$
(3.35)
as required. In every type of iteration, we change at most \(n+1\) points, and so require no more than \(n+1\) evaluations. The result \(\kappa _d={\mathcal {O}}(n L_J^2)\) follows from Lemma 3.3. \(\square \)
Remark 3.18
Theorem 3.16 gives us a possible termination criterion for Algorithm 1—we loop until k exceeds the value (3.29) or until \(\rho _k \le \rho _{min}\). However, this would require us to know problem constants \(\kappa _{ef}\), \(\kappa _{eg}\) and \(\kappa _H\) in advance, which is not usually the case. Moreover, (3.29) is a worst-case bound and so unduly pessimistic.
Remark 3.19
In [12], the authors propose a different criterion to test whether the criticality phase should be entered:
rather than
as found here and in [9]. We are able to use our criterion because of Assumption 3.15. If this did not hold, we would have \(\epsilon _g \ll \epsilon \) and so \(\rho _{min}\ll \epsilon \), which would worsen the result in Theorem 3.16. In practice, Assumption 3.15 is reasonable, as we would not expect a user to prescribe a criticality tolerance much smaller than their desired solution tolerance.
The standard complexity bound for first-order methods is \({\mathcal {O}}(\kappa _H \kappa _d^2 \epsilon ^{-2})\) iterations and \({\mathcal {O}}(\kappa _H \kappa _d^2 n \epsilon ^{-2})\) evaluations [12], where \(\kappa _d={\mathcal {O}}(\sqrt{n})\) and \(\kappa _H=1\). Corollary 3.17 gives us the same count of iterations and evaluations, but the worse bounds \(\kappa _d={\mathcal {O}}(n)\) and \(\kappa _H={\mathcal {O}}(\kappa _d)\), coming from the least-squares structure (Lemma 3.3).
However, our model (2.5) is better than a simple linear model for f, as it captures some of the curvature information in the objective via the term \(J_k^TJ_k\). This means that DFO-GN produces models which are between fully linear and fully quadratic [9, Definition 10.4], which is the requirement for convergence of second-order methods. It therefore makes sense to also compare the complexity of DFO-GN with the complexity of second-order methods.
Unsurprisingly, the standard bound for second-order methods is worse in general, than for first-order methods, namely, \({\mathcal {O}}(\max (\kappa _H \kappa _d^2, \kappa _d^3) \epsilon ^{-3})\) iterations and \({\mathcal {O}}(\max (\kappa _H \kappa _d^2, \kappa _d^3) n^2 \epsilon ^{-3})\) evaluations [16], where \(\kappa _d = {\mathcal {O}}(n)\), to achieve second-order criticality for the given objective. Note that here \(\kappa _d:=\max (\kappa _{ef}, \kappa _{eg}, \kappa _{eh})\) for fully quadratic models. If \(\Vert \nabla ^2 f\Vert \) is uniformly bounded, then we would expect \(\kappa _H={\mathcal {O}}(\kappa _{eh})={\mathcal {O}}(\kappa _d)\).
Thus DFO-GN has the iteration and evaluation complexity of a first-order method, but the problem constants (i.e. dependency on n) of a second-order method. That is, assuming \(\kappa _H={\mathcal {O}}(\kappa _d)\) (as suggested by Lemma 3.3), DFO-GN requires \({\mathcal {O}}(n^3 \epsilon ^{-2})\) iterations and \({\mathcal {O}}(n^4 \epsilon ^{-2})\) evaluations, compared to \({\mathcal {O}}(n\epsilon ^{-2})\) iterations and \({\mathcal {O}}(n^2\epsilon ^{-2})\) evaluations for a first-order method, and \({\mathcal {O}}(n^3\epsilon ^{-3})\) iterations and \({\mathcal {O}}(n^5\epsilon ^{-3})\) evaluations for a second-order method.
Remark 3.20
In Lemma 3.3, we used the result \(C={\mathcal {O}}(\varLambda )\) whenever \(Y_k\) is \(\varLambda \)-poised, and wrote \(\kappa _{eg}\) in terms of C; see Appendix A for details on the provenance of C with respect to the interpolation system (2.3). Our approach here matches the presentation of the first- and second-order complexity bounds from [12, 16]. However, [9, Theorem 3.14] shows that C may also depend on n. Including this dependence, we have \(C={\mathcal {O}}(\sqrt{n}\,\varLambda )\) for DFO-GN and general first-order methods, and \(C={\mathcal {O}}(n^2 \varLambda )\) for general second-order methods (where C is now adapted for quadratic interpolation). This would yield the alternative bounds \(\kappa _d={\mathcal {O}}(n)\) for first-order methods, \({\mathcal {O}}(n^2)\) for DFO-GN and \({\mathcal {O}}(n^3)\) for second-order methods.Footnote 8 Either way, we conclude that the complexity of DFO-GN lies between first- and second-order methods.
Remark 3.21
(Discussion of Assumption 3.4) It is also important to note that when \(m_k\) is fully linear, we have an explicit bound \(\Vert {H}_{k}\Vert \le {\widetilde{\kappa }}_{H}={\mathcal {O}}(\kappa _d)\) from Lemma 3.3. This means that Assumption 3.4, which typically necessary for first-order convergence (e.g. [9, 12]), is not required for Theorem 3.11 and our complexity analysis. To remove the assumption, we need to change Algorithm 1 in two places:
-
1.
Replace the test for entering the criticality phase with
-
2.
Require the criticality phase to output \(m_k\) fully linear and \(\varDelta _k\) satisfying
With these changes, the criticality phase still terminates, but instead of (B.1) we have
$$\begin{aligned} \min \left( \varDelta _{{k}^{{ init}}}, \frac{\omega _C\epsilon }{\kappa _{eg}+1/\mu }, \frac{\omega _C\epsilon }{\kappa _{eg}+{\widetilde{\kappa }}_{H}/\mu }\right) \le \varDelta _k \le \varDelta _{{k}^{{ init}}}. \end{aligned}$$
(3.38)
We can also augment Lemma 3.7 with the following, which can be used to arrive at a new value for \(\rho _{min}\).
Lemma 3.22
In all iterations,
. If
then
Ultimately, we arrive at complexity bounds which match Corollary 3.17, but replacing \(\kappa _H\) with \({\widetilde{\kappa }}_{H}\). However, Assumption 3.4 is still necessary for Theorem 3.12 to hold.