A Derivative-Free Gauss-Newton Method

We present DFO-GN, a derivative-free version of the Gauss-Newton method for solving nonlinear least-squares problems. As is common in derivative-free optimization, DFO-GN uses interpolation of function values to build a model of the objective, which is then used within a trust-region framework to give a globally-convergent algorithm requiring $O(\epsilon^{-2})$ iterations to reach approximate first-order criticality within tolerance $\epsilon$. This algorithm is a simplification of the method from [H. Zhang, A. R. Conn, and K. Scheinberg, A Derivative-Free Algorithm for Least-Squares Minimization, SIAM J. Optim., 20 (2010), pp. 3555-3576], where we replace quadratic models for each residual with linear models. We demonstrate that DFO-GN performs comparably to the method of Zhang et al. in terms of objective evaluations, as well as having a substantially faster runtime and improved scalability.


Introduction
Over the last 15-20 years, there has been a resurgence and increased effort devoted to developing efficient methods for derivative-free optimization (DFO) -that is, optimizing an objective using only function values. These methods are useful to many applications [7], for instance, when the objective function is a black-box function or legacy code (meaning manual computation of derivatives or algorithmic differentiation is impractical), has stochastic noise (so finite differencing is inaccurate) or expensive to compute (so the evaluation of a full n-dimensional gradient is intractable). There are several popular classes of DFO methods, such as direct and pattern search, model-based and evolutionary algorithms [22,15,26,9]. Here, we consider model-based methods, which capture curvature in the objective well [9] and have been shown to have good practical performance [18].
Model-based methods typically use a trust-region framework for selecting new iterates, which ensures global convergence, provided we can build a sufficiently accurate model for the objective [3]. The model-building process most commonly uses interpolation of quadratic functions, as originally proposed by Winfield [33] and later developed by Conn, Scheinberg and Toint [8,4] and Powell [21,23]. Another common choice for model-building is to use radial basis functions [31,20]. Global convergence results exist in both cases [6,7,32]. Several codes for modelbased DFO are available, including those by Powell [35] and others (see e.g. [7,9] and references therein).
Summary of contributions In this work, we consider nonlinear least-squares minimization, without constraints in the theoretical developments but allowing bounds in the implementation.
Model-based DFO is naturally suited to exploiting problem structure, and in this work we propose a method inspired by the classical Gauss-Newton method for derivative-based optimization (e.g. [19,Chapter 10]). This method, which we call DFO-GN (Derivative-Free Optimization using Gauss-Newton), is a simplification of the method by Zhang, Conn and Scheinberg [34]. It constructs linear interpolants for each residual, requiring exactly n + 1 points on each iteration, which is less than common proposals that generally insist on (partial or full) quadratic local models for each residual. In addition to proving theoretical guarantees for DFO-GN in terms of global convergence and worst-case complexity, we provide an implementation that is a modification of Powell's BOBYQA and that we extensively test and compare with existing state of the art DFO solvers. We show that little to nothing is lost by our simplified approach in terms of algorithm performance on a given evaluation budget, when applied to smooth and noisy, zeroand non-zero residual problems. Furthermore, significant gains are made in terms of reduced computational cost of the interpolation problem (leading to a runtime reduction of at least a factor of 7) and memory costs of storing the models. These savings result in a substantially faster runtime and improved scalability of DFO-GN compared to the implementation DFBOLS in [34].
Relevant existing literature. In [34], each residual function is approximated by a quadratic interpolating model, using function values from p ∈ [n + 1, (n + 1)(n + 2)/2] points. A quadratic (or higher-order) model for the overall least-squares objective is built from the models for each residual function, that takes into account full quadratic terms in the models asymptotically but allows the use of simpler models early on in the run of the algorithm. The DFBOLS implementation in [34] is shown to perform better than Powell's BOBYQA on a standard least-squares test set. A similar derivative-free framework for nonlinear least-squares problems is POUNDERS by Wild [30]: it also constructs quadratic interpolation models for each residual, but takes them all into account in the objective model construction on each iteration. In its implementation, it allows parallel computation of each residual component, and accepts previously-computed evaluations as an input providing extra information for the solver. We also note the connection to [2], which considers a Levenberg-Marquardt method for nonlinear least-squares when gradient evaluations are noisy; the framework is that of probabilistic local models, and it uses a regularization parameter rather than trust region to ensure global convergence. The algorithm is applied and further developed for data assimilation problems, with careful quantification of noise and algorithm parameters. Using linear vector models for objectives which are a composition of a (possibly nonconvex) vector function with a (possibly nonsmooth) convex function, such as a sum of squares, was also considered in [10]. There, worst-case complexity bounds for a general model-based trust-region DFO method applied to such objectives are established. Our approach differs in that it is designed specifically for nonlinear least-squares, and uses an algorithmic framework that is much closer to the software of Powell [27]. Finally, we note a mild connection to the approach in [1], where multiple solutions to nonlinear inverse problems are sought by means of a two-phase method, where in the first phase, low accuracy solutions are obtained by building a linear regression model from a (large) cloud of points and moving each point to its corresponding, slightly perturbed, Gauss-Newton step.
Further details of contributions. In terms of theoretical guarantees, we extend the global convergence results in [34] to allow inexact solutions to the trust-region subproblem (given by the usual Cauchy decrease condition), a simplification of the so-called 'criticality phase' and 'safety step', and prove first-order convergence of the whole sequence of iterates x k rather than a subsequence. We also provide a worst-case complexity analysis and show an iteration count which matches that of Garmanjani, Júdice and Vicente [10], but with problem constants that correspond to second-order methods. This reflects the fact that we capture some of the curvature in the objective (since linear models for residuals still give an approximate quadratic model for the least-squares objective), and so the complexity of DFO-GN sits between first-and secondorder methods.
In the DFO-GN implementation, which is very much the focus of this work, the simplification from quadratic to linear models leads to a confluence of two approaches for analysing and improving the geometry of the interpolation set. We compare DFO-GN to Powell's general DFO solver BOBYQA and to least-squares DFO solvers DFBOLS [34] (Fortran), POUNDERS [30] and our Python DFBOLS re-implementation Py-DFBOLS. The primary test set is Moré & Wild [18] where additionally, we also consider noisy variants for each problem, perturbing the test set appropriately with unbiased Gaussian (multiplicative and additive), and with additive χ 2 noise; we solve to low as well as high accuracy requirements for a given evaluation budget. We find -and show by means of performance and data profiles -that DFO-GN performs comparably well in terms of objective evaluations to the best of solvers, albeit with a small penalty for objectives with additive stochastic noise and an even less penalty for nonzero-residuals. We then do a runtime comparison between DFO-GN and Py-DFBOLS on the same test set and settings, comparing like for like, and find that DFO-GN is at least 7 times faster; see Table 1 for details. We further investigate scalability features of DFO-GN. We compare memory requirements and runtime for DFO-GN and DFBOLS on a particular nonlinear equation problem from CUTEst with growing problem dimension n; we find that both of these increase much more rapidly for the latter than the former (for example, for n = 2500 DFO-GN's runtime is 2.5 times faster than the Fortran DFBOLS' for n = 1400). To further illustrate that the improved scalability of DFO-GN does not come at the cost of performance, we compare evaluation performance of DFO-GN and DFBOLS on 60 medium-size least-squares problems from CUTEst and find similarly good behaviour of DFO-GN as on the Moré & Wild set.
Implementation. Our Python implementation of DFO-GN is available on GitHub 1 , and is released under the open-source GNU General Public License.
Structure of paper. In Section 2 we state the DFO-GN algorithm. We prove its global convergence to first-order critical points and worst case complexity in Section 3. Then we discuss the differences between this algorithm and its software implementation in Section 4. Lastly, in Section 5, we compare DFO-GN to other model-based derivative-free least-squares solvers on a selection of test problems, including noisy and higher-dimensional problems. We draw our conclusions in Section 6.

DFO-GN Algorithm
Here, our focus is unconstrained nonlinear least-squares minimization where r(x) := r 1 (x) · · · r m (x) maps R n → R m and is continuously differentiable with m × n Jacobian matrix [J(x)] i,j = ∂ri(x) ∂xj , although these derivatives are not available. Typically m ≥ n in practice, but we do not require this for our method. Throughout, · refers to the Euclidean norm for vectors or largest singular value for matrices, unless otherwise stated, and we define B(x, ∆) := {y ∈ R n : y − x ≤ ∆} to be the closed ball of radius ∆ > 0 about x ∈ R n .
In this section, we introduce the DFO-GN algorithm for solving (2.1) using linear interpolating models for r.

Linear Residual Models
In the classical Gauss-Newton method, we approximate r in the neighbourhood of an iterate x k by its linearization: r(y) ≈ r(x k ) + J(x k )(y − x k ), where J(x) ∈ R m×n is the Jacobian matrix of first derivatives of r. For DFO-GN, we use a similar approximation, but replace the Jacobian with an approximation to it calculated by interpolation.
Assume at iteration k we have a set of n + 1 interpolation points Y k := {y 0 , . . . , y n } in R n at which we have evaluated r. This set always includes the current iterate; for simplicity of notation, we assume y 0 = x k . We then build the model by finding the unique J k ∈ R m×n satisfying the interpolation conditions m k (y t − x k ) = r(y t ), for t = 1, . . . , n, (2.3) noting that the other interpolation condition m k (0) = r(x k ) is automatically satisfied by (2.2) 2 . We can find J k by solving the n × n system for each i = 1, . . . , m, where the rows of J k are j k,i . This system is invertible whenever the set of vectors {y 1 − x k , . . . , y n − x k } is linearly independent. We ensure this in the algorithm by routines which improve the geometry of Y k (in a specific sense to be discussed in Section 2.3).
Having constructed the linear models for each residual (2.1), we need to construct a model for the full objective f . To do this we simply take the sum of squares of the residual models, namely, where g k := J k r(x k ) and H k := J k J k .

Trust Region Framework
The DFO-GN algorithm is based on a trust-region framework [3]. In such a framework, we use our model for the objective (2.5), and maintain a parameter ∆ k > 0 which characterizes the region in which we 'trust' our model to be a good approximation to the objective; the resulting 'trust region' is B(x k , ∆ k ). At each iteration, we use our model to find a new point where we expect the objective to decrease, by (approximately) solving the 'trust region subproblem' If this new point x k +s k gives a sufficient objective reduction, we accept the step (x k+1 ← x k +s k ), otherwise we reject the step (x k+1 ← x k ). We also use this information to update the trust region radius ∆ k . The measure of 'sufficient objective reduction' is the ratio This framework applies to both derivative-based and derivative-free settings. However in a DFO setting, we also need to update the interpolation set Y k to incorporate the new point x k +s k , and have steps to ensure the geometry of Y k does not become degenerate (see Section 2.3).
A minimal requirement on the calculation of s k to ensure global convergence is the following.
Assumption 2.1. Our method for solving (2.6) gives a step s k satisfying the sufficient ('Cauchy') decrease condition This standard condition is not onerous, and can be achieved with c 1 = 1/2 by one iteration of steepest descent with exact linesearch applied to the model m k [3]. In Zhang et al. [34], convergence is only proven when (2.6) is solved to full optimality; here we use this weaker assumption.

Geometry Considerations
It is crucial that model-based DFO algorithms ensure the geometry of Y k does not become degenerate; an example where ignoring geometry causes algorithm failure is given by Scheinberg and Toint [28].
To describe the notion of 'good' geometry, we need the Lagrange polynomials of Y k . In our context of linear approximation, the Lagrange polynomials are the basis {Λ 0 (x), . . . , Λ n (x)} for the (n + 1)-dimensional space of linear functions on R n defined by for all l, t ∈ {0, . . . , n}. (2.9) Such polynomials exist whenever the matrix in (2.4) is invertible [7,Lemma 3.2]; when this condition holds, we say that Y k is poised for linear interpolation. The notion of geometry quality is then given by the following [5].
In general, if Y k is Λ-poised with a small Λ, then Y k has 'good' geometry, in the sense that linear interpolation using points Y k produces a more accurate model. The notion of model accuracy we use is given in [5,6]: for all s ≤ ∆ k , where κ ef and κ eg are independent of s, x k and ∆ k .
In the case of a vector model, such as (2.1), we use an analogous definition as in [13], which is equivalent, up to a change in constants, to the definition in [10].

Definition 2.4 (Fully linear, vector function). A vector model
for all s ≤ ∆ k , where J m is the Jacobian of m k , and κ r ef and κ r eg are independent of s, x k and ∆ k .
In Section 3.1, we show that if Y k is Λ-poised, then m k (2.1) and m k (2.5) are fully linear in B(x k , ∆ k ), with constants that depend on Λ.

Full Algorithm Specification
A full description of the DFO-GN algorithm is provided in Algorithm 1.
In each iteration, if g k is small, we apply a 'criticality phase'. This ensures that ∆ k is comparable in size to g k , which makes ∆ k , as well as g k , a good measure of progress towards optimality. After computing the trust region step s k , we then apply a 'safety phase', also originally from Powell [24]. In this phase, we check if s k is too small compared to the lower bound ρ k on the trust-region radius, and if so we reduce ∆ k and improve the geometry of Y k , without evaluating r(x k + s k ). The intention of this step is to detect situations where our trust region step will likely not provide sufficient function decrease without evaluating the objective, which would be wasteful. If the safety phase is not called, we evaluate r(x k + s k ) and determine how good the trust region step was, accepting any point which achieved sufficient objective decrease. There are two possible causes for the situation r k < η 1 (i.e. the trust region step was 'bad'): the interpolation set is not good enough, or ∆ k is too large. We first check the quality of the interpolation set, and only reduce ∆ k if necessary.
An important feature of DFO-GN, due to Powell [24], is that it maintains not only the (usual) trust region radius ∆ k (used in (2.6) and in checking Λ-poisedness), but also a lower bound on it, ρ k . This mechanism is useful when we reject the trust region step, but the geometry of Y k is not good (the 'Model Improvement Phase'). In this situation, we do not want to shrink ∆ k too much, because it is likely that the step was rejected because of the poor geometry of Y k , not because the trust region was too large. The algorithm floors ∆ k at ρ k , and only shrinks ∆ k when we reject the trust region step and the geometry of Y k is good (so the model m k is accurate) -in this situation, we know that reducing ∆ k will actually be useful.

16:
Accept/reject step and update trust region radius: set if r k ≥ η1 then 18:

23:
end if 24: end for Remark 2.5. There are two different geometry-improving phases in Algorithm 1. The first modifies Y k to ensure it is Λ-poised in B(x k , ∆ k ), and is called in the safety and model improvement phases. This can be achieved by [7,Algorithm 6.3], for instance, where the number of interpolation systems (2.4) to be solved depends only on Λ and n [7, Theorem 6.3].
The second, called in the criticality phase, also ensures Y k is Λ-poised, but it also modifies ∆ k to ensure ∆ k ≤ µ g k . This is a more complicated procedure [7, Algorithm 10.2], as we have a coupling between ∆ k and Y k : ensuring Λ-poisedness in B(x k , ∆ k ) depends on ∆ k , but since g k depends on Y k , there is a dependency of ∆ k on Y k . Full details of how to achieve this are given in Appendix B -we show that this procedure terminates as long as ∇f (x k ) = 0. In addition, there, we also prove the bound If the procedure terminates in one iteration, then ∆ k = ∆ init k , and we have simply made Y k Λ-poised, just as in the model-improving phase. Otherwise, we do one of these model-improving iterations, then several iterations where both ∆ k is reduced and Y k is made Λ-poised. The bound (2.16) tells us that these unsuccessful-type iterations do not occur when ∆ init k (but not ∇f (x k )) is sufficiently small. 3 Remark 2.6. In Lemma 3.4, we show that if Y k is Λ-poised, then m k is fully linear with constants that depend on Λ. For the highest level of generality, one may replace 'make Y k Λ-poised' with 'make m k fully linear' throughout Algorithm 1. Any strategy which achieves fully linear models would be sufficient for the convergence results in Section 3.4.
Remark 2.7. There are several differences between Algorithm 1 and its implementation, which we fully detail in Section 4. In particular, there is no criticality phase in the DFO-GN implementation as we found it is not needed, but the safety step is preserved to keep continuity with the BOBYQA framework 4 ; also, the geometry-improving phases are replaced by a simplified calculation.

Convergence and complexity results
We first outline the connection between Λ-poisedness of Y k and fully linear models. We then prove global convergence of Algorithm 1 (i.e. convergence from any starting point x 0 ) to firstorder critical points, and determine its worst-case complexity.

Interpolation Models are Fully Linear
To begin, we require some assumptions on the smoothness of r.
Assumption 3.1. The function r is C 1 and its Jacobian J(x) is Lipschitz continuous in B, the convex hull of ∪ k B(x k , ∆ max ), with constant L J . We also assume that r(x) and J(x) are uniformly bounded in the same region; i.e. r(x) ≤ r max and J(x) ≤ J max for all x ∈ B.
, then x k ∈ L for all k, so B is compact, from which Assumption 3.1 follows.
Proof. We choose x, y ∈ B and use the Fundamental Theorem of Calculus to compute Now we use this, the identity A = A , and ∇f (x) = J(x) r(x) to compute from which we recover (3.1).
We now state the connection between Λ-poisedness of Y k and full linearity of the models m k (2.1) and m k (2.5).
in (2.11) and (2.12), where L ∇f is from (3.1). We also have the bound H k ≤ (κ r eg ∆ max + J max ) 2 , independent of x k , Y k and ∆ k .
Proof. See Appendix A.

Global Convergence of DFO-GN
We begin with some nomenclature to describe certain iterations: we call an iteration (for which the safety phase is not called) • 'Successful' if x k+1 = x k + s k (i.e. r k ≥ η 1 ), and 'very successful' if r k ≥ η 2 . Let S be the set of successful iterations k; • 'Model-Improving' if r k < η 1 and the model-improvement phase is called (i.e. Y k is not Λ-poised in B(x k , ∆ k )); and • 'Unsuccessful' if r k < η 1 and the model-improvement phase is not called.
The results below are largely based on corresponding results in [34,7].
Assumption 3.5. We assume that H k ≤ κ H for all k, for some κ H ≥ 1 5 .
Lemma 3.6. Suppose Assumption 2.1 holds. If the model m k is fully linear in B(x k , ∆ k ) and , then either the k-th iteration is very successful or the safety phase is called.
Proof. We compute (3.10) By assumption, ∆ k ≤ g k / max( H k , 1). Applying this to (2.8), we have Using this and fully linearity (2.11), we get Thus r k ≥ η 2 and the iteration is very successful if s k ≥ γ S ρ k , otherwise the safety phase is called.
The next result provides a lower bound on the size of the trust region step s k , which we will later use to determine that the safety phase is not called when g k is bounded away from zero and ∆ k is sufficiently small. .
Proof. For convenience of notation, let h k : Substituting this into (2.8), we get For this to be satisfied, we require from which we recover (3.13).
Proof. Firstly, if the criticality phase is not called, then we must have To show (3.19), first suppose g init k ≥ C . Then g k = g init k and (3.19) holds. Otherwise, the criticality phase is called and m k is fully linear in B(x k , ∆ k ) with ∆ k ≤ µ g k . In this case, and so g k ≥ /(1 + κ eg µ) and (3.19) holds.
Lemma 3.9. Suppose Assumptions 2.1, 3.1 and 3.5 hold. If ∇f (x k ) ≥ > 0 for all k, then ρ k ≥ ρ min > 0 for all k, where (3.21) Proof. From Lemma 3.8, we also have g k ≥ g > 0 for all k. To find a contradiction, let k(0) be the first k such that ρ k < ρ min . That is, we have In the former case, there is nothing to prove; in the latter, using Lemma B.1, we have that . This reduction in ρ can only happen from a safety step or an unsuccessful step, and we must have since γ S < 1 and γ dec < 1. Hence by Lemma 3.7 we have where in the last inequality we used the choice of γ S in Algorithm 1. This inequality and the choice of γ S , together with (3.26), also imply (3.27) If m k is not fully linear, then we must have either a successful or model-improving iteration, so ρ init . Thus m k must be fully linear. Now suppose that Then using full linearity, we have contradicting (3.27). That is, (3.28) is false and so together with (3.27), we have (3.8). Hence Lemma 3.6 implies iteration (k 0 − 1) was very successful (as we have already established the safety phase was not called), so ρ init Our first convergence result considers the case where we have finitely-many successful iterations.
Proof. Let k 0 = max(S) be the last successful iteration, after which ∆ k is never increased. For any k > k 0 , we possibly call the criticality phase, and then have either a safety phase, model-improving phase, or an unsuccessful step.
If the model is not fully linear, then either it is made fully linear by the criticality phase, or we have a safety or model-improving step. In the first case, the model is made fully linear at iteration k; in the second and third, it is fully linear at iteration k + 1. That is, there is at most 1 iteration until the model is fully linear again. Therefore there are infinitely many k > k 0 where m k is fully linear and we have either a safety phase or an unsuccessful step. In both of these cases, ∆ k is reduced by a factor of at least max(γ dec , α 2 , ω S ) < 1, so ∆ k → 0 as k → ∞. Since ρ k ≤ ∆ k at all iterations, we must also have ρ k → 0.
For each k > k 0 , let j k be the first iteration after k where the model is fully linear. Then from the above discussion we know 0 ≤ j k − k ≤ 1, and hence As k → ∞, the first term of the right-hand side of (3.30) is bounded by L ∇f ∆ k → 0, while the second term is bounded by κ eg ∆ j k → 0; thus it remains to show that the last term goes to zero. By contradiction, suppose there exists > 0 and a subsequence k i such that g j k i ≥ > 0. Then Lemma 3.6 implies that for sufficiently small ∆ j k i (valid since ∆ j k i → 0), we get a very successful iteration or a safety step. Since j ki ≥ k i > k 0 , this must mean we get a safety step. However, Lemma 3.7 implies that for sufficiently large i, we have s j k i ≥ c 2 ∆ j k i > γ S ρ j k i , so the safety step cannot be called, a contradiction. Proof. If |S| < ∞, the proof of Lemma 3.10 gives the result. Thus, suppose there are infinitely many successful iterations (i.e. |S| = ∞).
For any k ∈ S, we have Lemma 3.8), this means that If we were to sum over all k ∈ S, the left-hand side must be finite as it is bounded above by f (x 0 ), remembering that f ≥ 0 for least-squares objectives. The right-hand side is only finite if lim k∈S→∞ ∆ k = 0. The only time ∆ k is increased is if k ∈ S, when it is increased by a factor of at most γ inc . For any given k / ∈ S, let j k ∈ S be the last successful iteration before k (which exists whenever k is sufficiently large). Proof. If |S| < ∞, then this follows from Lemma 3.10. Otherwise, it follows from Lemma 3.11 and Lemma 3.9. Proof. If |S| < ∞, then the result follows from Lemma 3.10. Thus, suppose there are infinitely many successful iterations (i.e. |S| = ∞).
To find a contradiction, suppose there is a subsequence of successful iterations t j with ∇f (x tj ) ≥ 0 for some 0 > 0 (note: we do not consider any other iteration types as x k does not change for these). Hence by Lemma 3.8, we must have g tj ≥ > 0 for some , where without loss of generality we assume that Let j be the first iteration j > t j such that g j < , which is guaranteed to exist by Theorem 3.12. That is, there exist subsequences t j < j satisfying Since g k ≥ and ∆ k → 0 (Lemma 3.11), Lemma 3.6 implies that for sufficiently large k ∈ K, there can be no unsuccessful steps. That is, iteration k is a safety step, or if not it must be successful or model-improving. By the same reasoning as in the proof of Lemma 3.10, since g k ≥ , for k ∈ K sufficiently large, Lemma 3.7 implies that s k ≥ c 2 ∆ k > γ S ρ k , so the safety step is never called.
For each successful iteration k ∈ K ∩ S, we have from (2.8) and for k sufficiently large (so that ∆ k ≤ /κ H ), we get Since for k ∈ K sufficiently large, we either have successful or model-improving steps, and of these x k is only changed on successful steps, we have (for j sufficiently large) Since {f (x k ) : k ∈ K} is a monotone decreasing sequence by (3.36), and bounded below (as f ≥ 0 for least-squares problems), it must converge.
Similarly to Lemma 3.10, the first term goes to zero as j → ∞ since ∇f is continuous and Lastly, the third term is bounded by by definition of j .
All together, this means that for sufficiently large j, and we have our contradiction.

Worst-Case Complexity
Next, we bound the number of iterations and objective evaluations until ∇f (x k ) < . We know such a bound exists from Theorem 3.12. Let i be the last iteration before ∇f (x i +1 ) < for the first time.
Lemma 3.14. Suppose Assumptions 2.1, 3.1 and 3.5 hold. Let |S i | be the number of successful steps up to iteration i . Then where g is defined in (3.19), and ρ min in (3.21).
Proof. For all k ∈ S i , we have the sufficient decrease condition Since g k ≥ g from Lemma 3.8 and ∆ k ≥ ρ k ≥ ρ min from Lemma 3.9, this means We now need to count the number of iterations of Algorithm 1 which are not successful. Following [10], we count each iteration of the loop inside the criticality phase (Algorithm 2) as a separate iteration -in effect, one 'iteration' corresponds to one construction of the model m k (2.5). We also consider separately the number of criticality phases for which ∆ k is not reduced (i.e. ∆ k = ∆ init k ). Counting until iteration i (inclusive), we let • C M i be the set of criticality phase iterations k ≤ i for which ∆ k is not reduced (i.e. the first iteration of every call of Algorithm 2 -see Remark 2.5 for further details); • C U i be the set of criticality phase iterations k ≤ i where ∆ k is reduced (i.e. all iterations except the first for every call of Algorithm 2); • F i be the set of iterations where the safety phase is called; • M i be the set of iterations where the model-improving phase is called; and • U i be the set of unsuccessful iterations 6 .
Proof. On each iteration k ∈ C U i , we reduce ∆ k by a factor of ω C . Similarly, on each iteration k ∈ F i we reduce ∆ k by a factor of at least max(ω S , α 2 ), and for iterations in U i by a factor of at least max(γ dec , α 2 ). On each successful iteration, we increase ∆ k by a factor of at most γ inc , and on all other iterations, ∆ k is either constant or reduced. Therefore, we must have After every call of the criticality phase, we have either a safety, successful or unsuccessful step, giving us (3.46). Similarly, after every model-improving phase, the next iteration cannot call a subsequent model-improving phase, giving us (3.47).
Assumption 3.16. The algorithm parameter C ≥ c 3 for some constant c 3 > 0.
Note that Assumption 3.16 can be easily satisfied by appropriate parameter choices in Algorithm 1.
We can summarize our results as follows: , and the number of objective evaluations until i is at most Proof. From Theorem 3.17, we have c −1 4 = O(κ eg ) and so To leading order, the number of iterations is as required. In every type of iteration, we change at most n + 1 points, and so require no more than n + 1 evaluations. The result κ d = O(nL 2 J ) follows from Lemma 3.4.
Remark 3.19. Theorem 3.17 gives us a possible termination criterion for Algorithm 1 -we loop until k exceeds the value (3.50) or until ρ k ≤ ρ min . However, this would require us to know problem constants κ ef , κ eg and κ H in advance, which is not usually the case. Moreover, (3.50) is a worst-case bound and so unduly pessimistic.
Remark 3.20. In [10], the authors propose a different criterion to test whether the criticality phase should be entered: g init k ≤ ∆ k /µ rather than g init k ≤ C as found here and in [7]. We are able to use our criterion because of Assumption 3.16. If this did not hold, we would have g and so ρ min , which would worsen the result in Theorem 3.17. In practice, Assumption 3.16 is reasonable, as we would not expect a user to prescribe a criticality tolerance much smaller than their desired solution tolerance.
The standard complexity bound for first-order methods is However, our model (2.5) is better than a simple linear model for f , as it captures some of the curvature information in the objective via the term J T k J k . This means that DFO-GN produces models which are between fully linear and fully quadratic [7,Definition 10.4], which is the requirement for convergence of second-order methods. It therefore makes sense to also compare the complexity of DFO-GN with the complexity of second-order methods.
Unsurprisingly, the standard bound for second-order methods is worse in general, than for first-order methods, namely, [14], where κ d = O(n), to achieve second-order criticality for the given objective. Note that here κ d := max(κ ef , κ eg , κ eh ) for fully quadratic models. If ∇ 2 f is uniformly bounded, then we would expect Thus DFO-GN has the iteration and evaluation complexity of a first-order method, but the problem constants (i.e. dependency on n) of a second-order method. That is, assuming Remark 3.21. In Lemma 3.4, we used the result C = O(Λ) whenever Y k is Λ-poised, and wrote κ eg in terms of C; see Appendix A for details on the provenance of C with respect to the interpolation system (2.3). Our approach here matches the presentation of the first-and second-order complexity bounds from [10,14]. However, [7,Theorem 3.14] shows that C may also depend on n. Including this dependence, we have C = O( √ n Λ) for DFO-GN and general first-order methods, and C = O(n 2 Λ) for general second-order methods (where C is now adapted for quadratic interpolation). This would yield the alternative bounds κ d = O(n) for first-order methods, O(n 2 ) for DFO-GN and O(n 3 ) for second-order methods 7 . Either way, we conclude that the complexity of DFO-GN lies between first-and second-order methods.

Discussion of Assumption 3.5
It is also important to note that when m k is fully linear, we have an explicit bound H k ≤ κ H = O(κ d ) from Lemma 3.4. This means that Assumption 3.5, which typically necessary for first-order convergence (e.g. [7,10]), is not required for Theorem 3.12 and our complexity analysis. To remove the assumption, we need to change Algorithm 1 in two places: .
With these changes, the criticality phase still terminates, but instead of (B.1) we have We can also augment Lemma 3.8 with the following, which can be used to arrive at a new value for ρ min .
Ultimately, we arrive at complexity bounds which match Corollary 3.18, but replacing κ H with κ H . However, Assumption 3.5 is still necessary for Theorem 3.13 to hold.

Implementation
In this section, we describe the key differences between Algorithm 1 and its software implementation DFO-GN. These differences largely come from Powell's implementation of BOBYQA [27] and are also features of DFBOLS, the implementation of the algorithm from Zhang et al. [34]. We also obtain a unified approach for analysing and improving the geometry of the interpolation set due to our particular choice of local Gauss-Newton-like models.

Geometry-Improving Phases
In practice, DFO algorithms are generally not run to very high tolerance levels, and so the asymptotic behaviour of such algorithms is less important than for other optimization methods. To this end, DFO-GN, like BOBYQA and DFBOLS, does not implement a criticality phase; but the safety step is implemented to encourage convergence.
In the geometry phases of the algorithm, we check the Λ-poisedness of Y k by calculating all the Lagrange polynomials for Y k (which are linear), then maximizing the absolute value of each in B(x k , ∆ k ). To modify Y k to make it Λ-poised, we can repeat the following procedure [7, Algorithm 6.3]: 1. Select the point y t ∈ Y k (y t = x k ) for which max y∈B(x k ,∆ k ) |Λ t (y)| is maximized (c.f. (2.10)); 2. Replace y t in Y k with y + , where . This procedure terminates after at most N iterations, where N depends only on Λ and n [7, Theorem 6.3], and in particular does not depend on x k , Y k or ∆ k . In DFO-GN, we follow BOBYQA and replace these geometry-checking and improvement algorithms (which are called in the safety and model-improvement phases of Algorithm 1) with simplified calculations. Firstly, instead of checking for the Λ-poisedness of Y k , we instead check if all interpolation points are within some distance of x k , typically a multiple of ∆ k . If any point is sufficiently far from x k , the geometry of Y k is improved by selecting the point y t furthest from x k , and moving it to y + satisfying (4.1). That is, we effectively perform one iteration of the full geometry-improving procedure.

Model Updating
In Algorithm 1, we only update Y k+1 , and hence m k and m k , on successful steps. However, in our implementation, we always try to incorporate new information when it becomes available, and so we update Y k+1 = Y k ∪ {x k + s k } \ {y t } on all iterations except when the safety phase is called (since in the safety phase we never evaluate r(x k + s k )).
Regardless of how often we update the model, we need some criterion for selecting the point y t ∈ Y k to replace with y + := x k +s k . There are three common reasons for choosing a particular point to remove from the interpolation set: Furthest Point: It is the furthest away from x k (or x k+1 ); Optimal Λ-poisedness: Replacing it with y + would give the maximum improvement in the Λ-poisedness of Y k . That is, choose the t for which |Λ t (y + )| is maximized; Stable Update: Replacing it with x k + s k would induce the most stable update to the interpolation system (2.4). As introduced by Powell [25] for quadratic models, moving y t to y + induces a low-rank update of the matrix W → W new in the interpolation system, here (2.4). From the Sherman-Morrison-Woodbury formula, this induces a low-rank update of H = W −1 , which has the form for some σ t = 0 and low rank A t B t . Under this measure, we would want to replace a point in the interpolation set when the resulting |σ t | is maximal; i.e. the update (4.2) is 'stable'.
Two approaches for selecting y t combine two of these reasons into a single criterion. Firstly in BOBYQA, the point t is chosen by combining the 'furthest point' and 'stable update' measures: Alternatively, Scheinberg and Toint [28] combine the 'furthest point' and 'optimal Λ-poisedness' measures: t = arg max j=0,...,n In DFO-GN, we use the BOBYQA criterion (4.3). However, as we now show, in DFO-GN, the two measures 'optimal Λ-poisedness' and 'stable update' coincide, meaning our framework allows a unification of the perspectives from [27] and [28], rather than having the indirect relationship via the bound σ t ≥ Λ t (y + ) 2 .
To this end, define W as the matrix in (2.4), and let H := W −1 . The Lagrange polynomials for Y k can then be found by applying the interpolation conditions (2.9). That is, we have Λ t (y) = 1 + g t (y − y t ), (4.5) where g t solves where e t is the usual coordinate vector in R n and e := [1 · · · 1] ∈ R n . This gives us the relations Now, we consider the 'stable update' measure. We will update the point y t to y + , which will give us a new matrix W new with inverse H new . This change induces a rank-1 update from W to W new , given by By the Sherman-Morrison formula, this induces a rank-1 update from H to H new , given by For a general rank-1 update W new = W + uv , the denominator is σ = 1 + v W −1 u, and so here we have and hence σ t = Λ t (y + ), as expected.

Termination Criteria
The specification in Algorithm 1 does not include any termination criteria. In the implementation of DFO-GN, we use the same termination criteria as DFBOLS [34], namely terminating whenever any of the following are satisfied: • Small objective value: since f ≥ 0 for least-squares problems, we terminate when For nonzero residual problems (i.e. where f (x * ) > 0 at the true minimum x * ), it is unlikely that termination will occur by this criterion; • Small trust region: ρ k , which converges to zero as k → ∞ from Lemma 3.11, falls below a user-specified threshold; and • Computational budget: a (user-specified) maximum number of evaluations of r is reached.

Other Implementation Differences
Addition of bound constraints This is allowed in the implementation of DFO-GN as it is important to practical applications. That is, we solve (2.1) subject to a ≤ x ≤ b for given bounds a, b ∈ R n . This requires no change to the logic as specified in Algorithm 1, but does require the addition of the same bound constraints in the algorithms for the trust region subproblem (2.6) and calculating geometry-improving steps (4.1). For the trust-region subproblem, we use the routine from DFBOLS, which is itself a slight modification of the routine from BOBYQA (which was specifically designed to produce feasible iterates in the presence of bound constraints). Calculating geometry-improving steps (4.1) is easier, since the Lagrange polynomials are linear rather than quadratic. That is, we need to maximize a linear objective subject to Euclidean ball and bound constraints In this case, we use our own routine, given in Appendix C, which handles the bound constraints via an active set method.
Internal representation of interpolation points In the interpolation system (2.4), we are required to calculate the vectors y t − x k . As the algorithm progresses, we expect y t − x k = O(∆ k ) → 0 as k → ∞, and hence the evaluation y t − x k to be sensitive to rounding errors, especially if x k is large. To reduce the impact of this, internally the points are stored with respect to a 'base point' x b , which is periodically updated to remain 'close' to the points in Y k . That is, we actually maintain the set , where y t − x b and x k − x b are much smaller than y t and x k , so the calculation is less sensitive to cancellation. Once we have determined J k by solving (2.4), the model m k (2.1) is internally represented as where . Note that we take the argument of m k to be y − x b , rather than y − x k as in (2.1).
When we update x b , say to x b + ∆b, we rewrite m k as and so we update x b ← x b + ∆b and c k ← c k + J k ∆b. Lastly, the interpolation points have their representation changed to Y k ← ( Other differences The following changes, which are from BOBYQA, are present in the implementation of DFO-GN: • We accept any step (i.e. set x k+1 = x k + s k ) where we see an objective reduction -that is, when r k > 0. In fact, we always update x k to be the best value found so far, even if that point came from a geometry-improving phase rather than a trust region step; • The reduction of ρ k in an unsuccessful step (line 22) only occurs when r k < 0; • Since we update the model on every iteration, we only reduce ρ k after 3 consecutive unsuccessful iterations; i.e. we only reduce ρ k when ∆ k is small and after the model has been updated several times (reducing the likelihood of the unsuccessful steps being from a bad interpolating set); • The method for reducing ρ k is usually given by ρ k+1 = α 1 ρ k , but it changed when ρ k approaches ρ end : (4.14) • In some calls of the safety phase, we only reduce ρ k and ∆ k , without improving the geometry of Y k .

Comparison to DFBOLS
As has been discussed at length, there are many similarities between DFO-GN and DFBOLS from Zhang et al. [34]. Although the algorithm described in [34] allows Gauss-Newton type models in principle, in practice DFO-GN is simpler in several respects: • The use of linear models for each residual (2.1) means we require only n + 1 interpolation points. In DFBOLS, quadratic models for each r i are used, which requires between n + 2 and (n + 1)(n + 2)/2 points, where the remaining degrees of freedom are taken up by minimizing the change in the model Hessian. This results in both a more complicated interpolation problem compared to our system (2.4), and a larger startup cost (where an initial Y 0 of the correct size is constructed, and r evaluated at each of these points); • As a result of using linear models, there is no ambiguity in how to construct the full model m k (2.5). In DFBOLS, simply taking a sum of squares of each residual's model gives a quartic. The authors drop the cubic and quartic terms, and choose the quadratic term from one of three possibilities, depending on the sizes of g k and f (x k ). This requires the introduction of three new algorithm parameters, each of which may require calibration.
• DFO-GN's method for choosing a point to replace when doing model updating, as discussed in Section 4.2, yields a unification of the geometric ('optimal Λ-poisedness') and algebraic ('stable update') perspectives on this update. In DFBOLS, the connection exists but is less direct, as it uses the same method as BOBYQA (4.3) with σ t ≥ Λ t (y + ) 2 . As discussed in [27], this bound may sometimes be violated as a result of rounding errors, and thus requires an extra geometry-improving routine to 'rescue' the algorithm from this problem. DFO-GN does not need or have this routine.
The first of these points also applies to Wild's POUNDERS [30], which builds quadratic models for each residual, and constructs a model for the full objective which is equivalent to a full Newton model (i.e. taking all available second-order information). It is also important to note that neither [34] nor [30] test the use of DFO-GN type linear models for each residual in practice.

Numerical Results
Now we compare the practical performance of DFO-GN to two versions of DFBOLS [34], and show that DFO-GN has comparable budget performance and significantly faster runtime. The two versions of DFBOLS are the original Fortran implementation from [34], and the other is our own Python implementation (which we will call 'Py-DFBOLS'), designed to be as similar as possible to the implementation of DFO-GN. Both DFO-GN and Py-DFBOLS use Python 3.5.2 8 . We also compare with the general-objective solver BOBYQA [27] and POUNDERS [30], another least-squares DFO code which uses quadratic interpolation models for each residual.

Test Problems and Methodology
We tested the solvers on the test suite from Moré and Wild [18], a collection of 53 unconstrained nonlinear least-squares problems with dimension 2 ≤ n ≤ 12 and 2 ≤ m ≤ 65. For each problem, we optionally allowed evaluations of the residuals r i to have stochastic noise. Specifically, we allowed the following noise models: To compare solvers, we use data and performance profiles [18]. First, for each solver S, each problem p and for an accuracy level τ ∈ (0, 1), we determine the number of function evaluations N p (S; τ ) required for a problem to be 'solved': where f * is an estimate of the true minimum 10 f (x * ). A full list of the values used is provided in Appendix D. We define N p (S; τ ) = ∞ if this was not achieved in the maximum computational budget allowed. We can then compare solvers by looking at the proportion of test problems solved for a given computational budget. For data profiles, we normalize the computational effort by problem dimension, and plot (for solver S, accuracy level τ ∈ (0, 1) and problem suite P) where N g is the maximum computational budget, measured in simplex gradients (i.e. N g (n p + 1) objective evaluations are allowed for problem p).
For performance profiles, we normalize the computational effort by the minimum effort needed by any solver (i.e. by problem difficulty). That is, we plot π S,τ (α) := |{p ∈ P : where N * p (τ ) := min S N p (S; τ ) is the minimum budget required by any solver. For test runs where we added stochastic noise, we took average data and performance profiles over multiple runs of each solver; that is, for each α we take an average of d S,τ (α) and π S,τ (α). When plotting performance profiles, we took N * p (τ ) to be the minimum budget required by any solver in any run.

Test Results
For our testing, we used a budget of N g = 200 gradients (i.e. 200(n + 1) objective evaluations) for each problem, noise level σ = 10 −2 , and took 10 runs of each solver 11 . Most results use an accuracy level of τ = 10 −5 in (5.1).
Firstly, Figure 1 shows two performance profiles under the low accuracy requirement τ = 10 −1 . Here we see an important benefit of DFO-GN compared to BOBYQA, DFBOLS and POUNDERS -the smaller interpolation set means that it can begin the main iteration and make progress sooner. This is reflected in Figure 1, where DFO-GN is the fastest solver more frequently than any other, both with smooth and noisy objective evaluations. We also note that POUNDERS is the faster solver more frequently than DFBOLS at this accuracy level. However, after the full budget of 200 gradients, POUNDERS is unable to solve a high proportion of problems to this accuracy -this holds both here and for the remainder of the results. We believe this is because it uses a built-in termination condition for sufficiently small trust region 11 Scheduled using [29]. radius, which is set too high to achieve (5.1) for many problems. In line with the results from [34], BOBYQA does not perform as well as DFBOLS or DFO-GN, as it does not exploit the least-squares problem structure. Next, Figure 2 shows results for accuracy τ = 10 −5 and smooth objective evaluations. It is important to note is that our simplification from quadratic to linear residual models has not led to a loss of performance at obtaining high accuracy solutions, and produces essentially identical long-budget performance. At this level, the advantage from the smaller startup cost is no longer seen, but particularly in the performance profiles, we can still see the substantially higher startup cost of using (n + 1)(n + 2)/2 interpolation points.
Similarly, Figure 3 shows the same plots but for noisy problems (multiplicative Gaussian, additive Gaussian and additive χ 2 respectively). Here, DFO-GN suffers a small performance penalty (of approximately 5-10%) compared to DFBOLS, particularly when using 2n+1 and (n+ 1)(n + 2)/2 interpolation points, suggesting that the extra curvature and evaluation information in DFBOLS has some benefit for noisy problems. Also, the performance penalty is larger in the case of additive noise than multiplicative (approximately 10% vs. 5%). Note that additive noise makes all our test problems nonzero residual (i.e. f (x * ) > 0 for the true minimum x * ). Thus the worse performance of DFO-GN compared to DFBOLS for additive noise is similar to the derivative-based case, where for nonzero residual problems the Gauss-Newton method has a lower asymptotic convergence rate than Newton's method [19].
Also of note is that although BOBYQA suffers a substantial performance penalty when moving from smooth to noisy problems, this penalty (compared to DFO-GN and DFBOLS) is much less for additive χ 2 noise. This is likely because this noise model makes each residual function nonsmooth by taking square roots, but the change to the full objective is relatively benign -simply adding χ 2 random variables.
Nonzero Residual Problems We saw above that DFO-GN suffered a higher -but still small -loss of performance, compared to DFBOLS, for problems with additive noise, where all problems become nonzero residual. We investigate this further by extracting the performance of the nonzero residual problems only from the test set results we already presented; Figure 4 shows the resulting performance profiles for accuracy τ = 10 −5 , for smooth objectives and multiplicative Gaussian noise (σ = 10 −2 ). We notice that in the smooth case, DFBOLS with 2n + 1 points is now the fastest solver on 30% of problems versus 20% for DFO-GN, compared to looking at all problems (seen in Figure 2b); but DFO-GN is comparably robust and we do not see the same loss of performance as in the additive noise case. In the case of multiplicative Gaussian noise, DFO-GN, and in fact (Py-)DFBOLS, see a loss of performance compared to looking at all problems; also, the difference between DFO-GN and DFBOLS with increasing    numbers of interpolation points is slightly more clear for nonzero residual problems (8% vs. 5% for all problems).

Conclusions to evaluation comparisons
The numerical results in this section show that DFO-GN performs comparably to DFBOLS in terms of evaluation counts, and outperforms BOBYQA and POUNDERS, in both smooth and noisy settings, and for low and high accuracy. DFO-GN exhibits a slight performance loss compared to DFBOLS for additive noisy problems and for noisy non-zero residual problems. We note that we also tested other noise models -such as multiplicative uniform noise and also biased variants of the Gaussian noise; all these performed either better (such as in the case of uniform noise) or essentially indistinguishable to the results already presented above. We also tried other noise variance levels, smaller than σ = 10 −2 , for which the performance of both DFO-GN and DFBOLS solvers vary similarly/comparably. We may explain the similar performance of DFO-GN to DFBOLS, despite the higher order models used by the latter, as being due to the general effectiveness of Gauss-Newton-like frameworks for nonlinear least-squares, especially for zero-residual problems; and furthermore, by the usual remit of DFO algorithms in which the asymptotic regimes are not or cannot really be observed or targeted by the accuracy at which the problems are solved. Though an accuracy level τ = 10 −5 is common and considered reasonably high in DFO, to ensure that our results are robust, we also performed the same tests for higher accuracy levels τ ∈ {10 −7 , 10 −9 , 10 −11 }. The resulting profiles are given in Appendix F. For smooth problems, DFO-GN is still able to solve essentially the same proportion of problems as (Py-)DFBOLS. For noisy problems, the results are more mixed: on average, DFO-GN does slightly worse for Gaussian noise, but slightly better for χ 2 noise. These results are the same when looking at all problems, or just nonzero residual problems. This reinforces our previous conclusions, and gives us confidence that a Gauss-Newton framework for DFO is a suitable choice, and is robust to the level of accuracy required for a given problem.

Runtime Comparison
The use of linear models in DFO-GN also leads to reduced linear algebra cost. The interpolation system for DFO-GN (2.4) is of size n. By comparison, for underdetermined quadratic interpolation, such as in (Py-)DFBOLS, the interpolation system has size p+n+1 when we have p ≥ n+2 interpolation points. Hence the DFBOLS system is at least twice the size of the DFO-GN system, so we would expect the computational cost of solving this system to be at least 8 times larger than for DFO-GN. To verify this, in this section we compare the runtime of DFO-GN  with Py-DFBOLS. The other solvers are implemented in lower-level languages (Fortran and C), and so a runtime comparison against DFO-GN does not provide a fair comparison. The wall time required by each solver to run the above testing (with a budget of 200(n + 1) objective evaluations) on a Lenovo ThinkCentre M900 (with one 64-bit Intel i5 processor, 8GB of RAM), is shown in Table 1 for both smooth and noisy evaluations. We find that DFO-GN is 7-9x faster than Py-DFBOLS with n + 2 points, 11-25x faster than Py-DFBOLS with 2n + 1 points, and 42-387x faster than Py-DFBOLS with (n + 1)(n + 2)/2 points. In all cases, this is a substantial improvement, particularly given the small difference in performance (measured in function evaluations) between DFO-GN and (Py-)DFBOLS described in Section 5.2.
We note that these runtimes include objective evaluations. However, all solvers used the same Python implementation of the objective functions and the same total budget, so the inclusion of objective evaluations in the runtime will not materially affect the results.

Scalability Features
We saw in Section 5.3 that DFO-GN runs faster due to the lower cost of solving the interpolation linear system. Another important benefit is that storing the interpolation models for each residual requires only O(mn) memory, rather than O(mn 2 ) for quadratic models. These two observations together suggest that DFO-GN should scale to large problems better than DFBOLS -in this section we demonstrate this. We consider Problem 29 from Moré, Garbow & Hillstrom [17] (which is Problem 33 (integreq) in the CUTEst set in Table 3). This is a zero-residual least-squares problem with m = n variable for solving a one-dimensional integral equation, using an n-point discretization of (0, 1). Both DFO-GN and DFBOLS 12 solve this problemterminating on small objective value f (x) ≤ 10 −12 -quickly, in no more than 20 iterations, regardless of n.
In Figure 5 we compare the runtime and peak memory usage of DFBOLS and DFO-GN as n increases. Note that we are comparing DFO-GN (implemented in Python) against DFBOLS (implemented in Fortran) rather than Py-DFBOLS (as used in Section 5.3), to put ourselves at a substantial disadvantage. We see that for small n, DFBOLS has significantly lower runtime and memory requirements than DFO-GN (which is unsurprising, since it is implemented in Fortran rather than Python). However, as expected, both the runtime and memory usage increases much faster for DFBOLS than for DFO-GN as n is increased. In fact, for n > 500, DFO-GN actually runs faster than DFBOLS. For n > 1200, DFBOLS exceeds the memory capacity of the system. At this point, it has to store data on disk, and as a result the runtime increases very quickly. DFO-GN does not suffer from this issue, and can continue solving problems quickly for substantially larger n. For instance, DFO-GN solves the n = 2500 problem over 2.5 times faster than DFBOLS solves the much smaller n = 1400 problem.
Similarly to before, it is important to gain an understanding of whether this improved scalability comes at the cost of performance. To assess this, we consider a set of 60 medium-sized problems (25 ≤ n ≤ 120 and 25 ≤ m ≤ 400, with n ≈ 100 for most problems) from the CUTEst test set [11]. The full list of problems is given in Appendix E. For these problems, we compare DFO-GN with DFBOLS using a smaller budget of 50(n + 1) evaluations, commensurate with the greater cost of objective evaluation. Given this small budget, we only test DFBOLS with  n + 2 and 2n + 1 interpolation points; using (n + 1)(n + 2)/2 interpolation points would mean in most cases the full budget is entirely used building the initial sample set.
In Figure 6, we show data and performance profiles for accuracy τ = 10 −5 . As before, we see that DFO-GN has very similar performance to DFBOLS, and although we have gained improved scalability, we have not lost in terms of performance on medium-sized test problems.

Concluding Remarks
It is well-known that, for nonlinear least-squares problems, using only linear models for each residual is sufficient to approximate the objective well, especially for zero-residual problems. This forms the basis of the derivative-based Gauss-Newton and Levenberg-Marquardt methods, and has motivated our derivative-free, model-based trust-region variant here, called DFO-GN.
In [34,30], quadratic local models are constructed by interpolation for each residual, and these are aggregated to produce a quadratic model for the objective. By contrast, in DFO-GN we build linear models for each residual, which retains first-order convergence and worst-case complexity, and reduces both the computational cost of solving the interpolation problem (leading to a runtime reduction of at least a factor of 7) and the memory cost of storing the models (from O(mn 2 ) to O(mn)). These savings result in a substantially faster runtime and improved scalability of DFO-GN compared to DFBOLS, the implementation from [34]. Furthermore, the simpler local models do not adversely affect the algorithm's performance numerically, in terms of evaluation counts: DFO-GN performs as well as DFBOLS and better than POUNDERS, the implementation from [30], on smooth test problems from the Moré & Wild and CUTEst collections. When the objective has noise, DFO-GN suffers a small performance penalty compared to DFBOLS (but not to POUNDERS), which is larger for additive than multiplicative noise as all problems become nonzero residual. Nonetheless, this, together with the substantial improvements in runtime and scalability, make DFO-GN an appealing choice for both zero and nonzero residuals, and in the presence of noise. We delegate to future work showing local quadratic rate of convergence for DFO-GN when applied to nondegenerate zero-residual problems, and generally improving the performance of DFO methods in the presence of noise.
[33] D. Winfield, Function minimization by interpolation in a data Using (A.1), we compute for any t = 1, . . . , n, LetŴ k be the interpolation matrix of the system (2.4) scaled by ∆ −1 k . Considering the matrix and so using the identity Ŵ −1 Thus we conclude that for any y ∈ B [7,Theorem 3.14]. Thus (2.14) holds with κ r eg := L J 1 + 1 2 √ nC , where C = O(Λ). Next, we prove (2.13) by computing where we use (2.14) and (A.1). Hence we have (2.13) with κ r ef = κ r eg + L J /2, as required. Since m k is fully linear, we also get from (2.14) the bound so J k is uniformly bounded for all k. Since H k = J k J k , this means that H k = J k 2 is uniformly bounded for all k.

B Geometry Improvement in Criticality Phase
Here, we describe the geometry-improvement step performed in the criticality phase of Algorithm 1, and prove its convergence. The proof of Lemma B.1 can be derived from the proof of [7, Lemma 10.5].

4:
Solve the interpolation system for Y

5:
if k is fully linear in the sense of Definition 2.3, with associated constants κ ef and κ eg in (2.11) and (2.12) respectively given by (3.7).
Otherwise, consider some iteration i where Algorithm 2 does not terminate; that is, where That is, if termination does not occur on iteration i, we must have so Algorithm 2 terminates in finite time. We also have from which (B.1) follows.

C Calculating Geometry-Improving Steps
Here, we provide Algorithm 3, a method for solving one subproblem for calculating a geometryimproving step (4.1) subject to bound constraints. That is, we solve the problem Algorithm 3 Solve geometry-improving subproblem (C.1). Find tentative step length αj ≥ 0 by finding the largest solution to yj + αjsj 2 = ∆ 2 k .
Lemma C.1. Suppose a ≤ x k ≤ b and a < b. Then Algorithm 3 returns a minimizer of (C.1).
Proof. Without loss of generality, we may assume that x k = 0, as otherwise we can simply shift a → a − x k , and similarly for b. Since (C.1) is convex, so it suffices to show convergence to a KKT point.
Suppose y is a KKT point where y i ∈ {a i , b i } for some i ∈ {1, . . . , n}. If y i = a i , then if we took y = y except for y i = b i , then since y is a global minimizer. Since b i − a i > 0 by assumption, we must have g i ≥ 0. Similarly, if y i = b i is to hold at a KKT point, we require g i ≤ 0.
If g i = 0, any value y i ∈ [a i , b i ] gives the same objective value, so we can take y i = 0. Given this, which Algorithm 3 provides, the KKT conditions reduce to: there exist µ and λ i for i = 1, . . . , n such that (for all i = 1, . . . , n, where relevant) Since y i = 0 whenever g i = 0, and we always write g as a sum of ±e i , we only need to check µ ≥ 0 and λ i ≥ 0. For convenience, let γ j := j l=0 β l . Consider the start of some iteration j, where currently y j,i = −γ j−1 g i for all i ∈ I. The equation for α j is then for which the largest solution is Hence y j,i = − α j g i for all i ∈ I, where α j := α j + γ j−1 ≥ 0. If Algorithm 3 terminates at line 8 for this iteration j, then we have µ = 1/ α j > 0, and y i ∈ {a i , b i } for all i / ∈ I. Thus the i-th component of (C.3e) is satisfied for all i ∈ I with λ i = 0. Now consider i / ∈ I, and suppose g i > 0. Then at some iteration l < j, we had − α l g i < a i ≤ 0, and set y l+1,i = a i , by choosing β l so that −γ l g i = a i . Thus, we have γ l ≥ 0. Similarly, if g i < 0, we had − α l g i > b i ≥ 0, and needed β l so that −γ l g i = b i ≥ 0, so we also need γ l ≥ 0 for this l. Thus 0 ≤ γ j ≤ α j for all j.
For termination at line 8 of Algorithm 3 and for (C.3e) to hold with λ i ≥ 0, we need − α j g i ≤ a i or − α j g i ≥ b i , depending on the sign of g i . Equivalently, we need α j ≥ γ l for all l < j. Since γ l ≤ α l , it suffices to show that α j−1 ≤ α j for all j.
To show this, let y j−2 be the vector with the fixed components of y j−2 (i.e. those i / ∈ I at iteration j − 2), and zeros otherwise. Then if i is the index removed from I at the end of iteration j − 1, the equation for α j−1 is Similarly, the equation for α j is where c i ∈ {a i , b i } is whichever value we fixed y j−1,i to be. Equating (C.7) and (C.8) and using | α j−1 g i | ≥ |c i | (from above), we get and so using s j > 0 (from line 4 of Algorithm 3) and α j ≥ 0 for all j (shown above), we get α j−1 ≤ α j . This finishes the proof for when we terminate at line 8 of Algorithm 3. Now suppose we terminate at line 16 of Algorithm 3. Then we have fixed y i ∈ {a i , b i } for all i (depending on the sign of g i ), with y ≤ ∆ still valid. Hence we can have µ = 0 (satisfying (C.3c)) and λ i = |g i |, and we have a KKT point.  Table 3. Details of medium-scale test problems from the CUTEst test set (showing 2f (x0) and 2f * , as per Table 2). The set of problems are taken primarily from [12,17,16]. Some problems are variabledimensional; the relevant parameters yielding the given (n, m) are provided. Problems marked * have box constraints. The value of n shown excludes fixed variables.

F Numerical Results for Higher Accuracy Levels
To supplement the results provided in Section 5, we provide here a number of the same comparisons of solvers, but for higher accuracy levels τ ∈ {10 −7 , 10 −9 , 10 −11 }, compared to τ = 10 −5 in the main text. For each set of plots below, the title indicates the content of the plots and the figure in the main text with the corresponding results for τ = 10 −5 .