Scalable subspace methods for derivative-free nonlinear least-squares optimization

We introduce a general framework for large-scale model-based derivative-free optimization based on iterative minimization within random subspaces. We present a probabilistic worst-case complexity analysis for our method, where in particular we prove high-probability bounds on the number of iterations before a given optimality is achieved. This framework is specialized to nonlinear least-squares problems, with a model-based framework based on the Gauss–Newton method. This method achieves scalability by constructing local linear interpolation models to approximate the Jacobian, and computes new steps at each iteration in a subspace with user-determined dimension. We then describe a practical implementation of this framework, which we call DFBGN. We outline efficient techniques for selecting the interpolation points and search subspace, yielding an implementation that has a low per-iteration linear algebra cost (linear in the problem dimension) while also achieving fast objective decrease as measured by evaluations. Extensive numerical results demonstrate that DFBGN has improved scalability, yielding strong performance on large-scale nonlinear least-squares problems.


Introduction
An important class of nonlinear optimization methods is so-called derivative-free optimization (DFO).In DFO, we consider problems where derivatives of the objective (and/or constraints) are not available to be evaluated, and we only have access to function values.This topic has received growing attention in recent years, and is primarily used for objectives which are blackbox (so analytic derivatives or algorithmic differentiation are not available), and expensive to evaluate or noisy (so finite differencing is impractical or inaccurate).There are many types of DFO methods, such as model-based, direct and pattern search, implicit filtering and others (see [50] for a recent survey), and these techniques have been used in a variety of applications [1].
Here, we consider model-based DFO methods for unconstrained optimization, which are based on iteratively constructing and minimizing interpolation models for the objective.We also specialize these methods for nonlinear least-squares problems, by constructing interpolation models for each residual term rather than for the full objective [79,73,18].
This paper aims to provide a method that attempts to answer a key question regarding model-based DFO: how to improve the scalability of this class.Existing model-based DFO techniques are primarily designed for small-to medium-scale problems, as the linear algebra cost of each iteration-largely due to the cost of constructing interpolation models-means that their runtime increases rapidly for large problems.There are several settings where scalable DFO algorithms may be useful, such as data assimilation [9,3], machine learning [67,35], generating adversarial examples for deep neural networks [2,68], image analysis [30], and as a possible proxy for global optimization methods [19].
To address this, we introduce RSDFO, a scalable algorithmic framework for model-based DFO.At each iteration of RSDFO we select a random low-dimensional subspace, build and minimize a model to compute a step in this space, then change the subspace at the next iteration.We provide a probabilistic worst-case complexity analysis of RSDFO.To our knowledge, this is the first subspace model-based DFO method with global complexity and convergence guarantees.We then describe how this general framework can be specialized to the case of nonlinear leastsquares minimization through a model construction technique inspired by the Gauss-Newton method, yielding a new algorithm RSDFO-GN with associated worst-case complexity bounds.We then present an efficient implementation of RSDFO-GN, which we call DFBGN.DFBGN is available on Github1 and includes several algorithmic features that yield strong performance on large-scale problems and a low per-iteration linear algebra cost that is typically linear in the problem dimension.

Existing Literature
The contributions in this paper are connected to several areas of research.We briefly review these topics below.
Block Coordinate Descent There is a large body of work on (derivative-based) block coordinate descent (BCD) methods, typically motivated by machine learning applications.BCD extends coordinate search methods [75] by updating a subset of the variables at each iteration, typically using a coordinate-wise variant of a first-order method.For nonconvex problems, the first convergence result for a randomized coordinate descent method based on proximal gradient descent was given in [58].Here, the sampling of coordinates was uniform and required step sizes based on Lipschitz constants associated with the objective.This was extended in [52] to general randomized block selection with a nonomonotone linesearch-type method (to allow for unknown Lipschitz constants), and to a (possibly deterministic) 'essentially cyclic' block selection and extrapolation (but requiring Lipschitz constants) in [77].Several extensions of this approach have been developed, including the use of stochastic gradients [76], parallel block updating [32] and inexact step calculations [32,78].
BCD methods have been extended to nonlinear least-squares problems, leading to so-called Subspace Gauss-Newton methods.These are derivative-based methods where a Gauss-Newton step is computed for a subset of variables.This approach was initially proposed in [70] for parameter estimation in climate models-where derivative estimates were computed using implicit filtering [49]-and analyzed in quadratic regularization and trust-region settings for general unconstrained objectives in [15,16].
Sketching Sketching is an alternative dimensionality reduction technique for least-squares problems, reducing the number of residuals rather than the number of variables.Sketching ideas have been applied to linear [46,53,74] and nonlinear [31] least-squares problems, as well as model-based DFO for nonlinear least-squares [13], as well as subsampling algorithms for finite sum-of-functions minimization such as Newton's method [66,6].
There are also alternative approaches to sketching which lead to subspace-type methods, where local gradients and Hessians are estimated only within a subspace (possibly used in conjunction with random subsampling).Sketching in this context has been applied to, for example, Newton's method [59,40,6], BFGS [39], and SAGA [41], as well as to trust-region and quadratic regularization methods [15,16].
Random Embeddings for Global Optimization Some global optimization methods have been proposed which randomly project a high-dimensional problem into a low-dimensional subspace and solve this smaller problem using existing (global or local) methods.Though applicable to general global optimization problems (as a more sophisticated variant of random search), this technique has been explored particularly for defeating the curse of dimensionality when optimising functions which have low effective dimensionality [64,72,17].For the latter class, often only one random subspace projection is needed, though the addition of constraints leads to multiple embeddings being required [17].Our approach here differs from these works in both theoretical and numerical aspects, as it is focused on a specific random subspace technique for local optimization.
Probabilistic Model-Based DFO For model-based DFO, several algorithms have been developed and analyzed where the local model at each iteration is only sufficiently accurate with a certain probability [5,21,11].Similar analysis also exists for derivative-based algorithms [20,43].Our approach is based on deterministic model-based DFO within subspaces, and we instead require a very weak probabilistic condition on the (randomly chosen) subspaces (Assumption 2.9).
Randomized Direct Search DFO In randomized direct search methods, iterates are perturbed in a random subset of directions (rather than a positive spanning set) when searching for local improvement.In this framework, effectively only a random subspace is searched in each iteration.Worst-case complexity bounds for this technique are given under predetermined step length regimes in [10,36], and with adaptive step sizes in [42,44], where [44] extends [42] to linearly constrained problems.
Large-Scale DFO There have been several alternative approaches considered for improving the scalability of DFO.These often consider problems with specific structure which enable efficient model construction, such as partial separability [24,60], sparse Hessians [4], and minimization over the convex hull of finitely many points [27].On the other hand, there is a growing body of literature on 'gradient sampling' techniques for machine learning problems.These methods typically consider stochastic first-order methods but with a gradient approximation based on finite differencing in random directions [56].This framework has lead to variants of methods such as stochastic gradient descent [34], SVRG [51] and Adam [22], for example.We note that linear interpolation to orthogonal directions-more similar to traditional model-based DFO-has been shown to outperform gradient sampling as a gradient estimation technique [8,7].
Subspace DFO Methods A model-based DFO method with similarities to our subspace approach is the moving ridge function method from [45].Here, existing objective evaluations are used to determine an 'active subspace' which captures the largest variability in the objective and build an interpolation model within this subspace.We also note the VXQR method from [57], which performs line searches along a direction chosen from a subspace determined by previous iterates.Both of these methods do not include convergence theory.By comparison, aside from our focus on nonlinear least-squares problems, both our general theoretical framework and our implemented method select their working subspaces randomly, and we provide (probabilistic) convergence guarantees.

Contributions
We introduce RSDFO (Randomized Subspace Derivative-Free Optimization), a generic modelbased DFO framework that relies on constructing a model in a subspace at each iteration.Our novel approach enables model-based DFO methods to be applied in a large-scale regime by giving the user explicit control over the subspace dimension, and hence control over the periteration linear algebra cost of the method.This framework is then specialized to the case of nonlinear least-squares problems, yielding a new algorithm RSDFO-GN (Randomized Subspace DFO with Gauss-Newton).The subspace model construction framework of RSDFO-GN is based on DFO Gauss-Newton methods [18,14], and retains the same theoretical guarantees as RSDFO.We then describe a practical implementation of RSDFO-GN, which we call DFBGN (Derivative-Free Block Gauss-Newton). 2 Compared to existing methods, DFBGN reduces the linear algebra cost of model construction and the initial objective evaluation cost by allowing fewer interpolation points at every iteration.In order for DFBGN to have both scalability and a similar evaluation efficiency to existing methods (i.e.objective reduction achieved for a given number of objective evaluations), several modifications to the theoretical framework, regarding the selection of interpolation points and the search subspace, are necessary.

Theoretical Results
We consider a generic theoretical framework RSDFO, where the subspace dimension is a user-chosen algorithm hyperparameter, and no specific model construction approach is specified.Our framework is not specific to a least-squares problem structure, and holds for any objective with Lipschitz continuous gradient, and allows for a general class of random subspace constructions (not relying on a specific class of embeddings or projections).The theoretical results here extend the approach and techniques in [15,16] to model-based DFO methods.In particular, we use the notion of a well-aligned subspace (Definition 2.8) from [15,16], one in which sufficient decrease is achievable, and assume that our search subspace is well-aligned with some probability (Assumption 2.9).This is achieved provided we select a sufficiently large subspace dimension (depending on the desired failure probability and subspace alignment quality).
We derive a high probability worst-case complexity bound for RSDFO.Specifically, our main bounds are of the form P min j≤k ∇f where K is the first iteration to achieve first-order optimality (see Theorem 2.18 and Corollary 2. 19).This result then implies a variety of alternative convergence results, such as expectation bounds and almost-sure convergence.Based on [15,16], we give several constructions for determining our random subspace, and show that for constructions based on Johnson-Lindenstrauss transformations, that we can achieve convergence with a subspace dimension that is independent of the ambient dimension.
Our analysis matches the standard deterministic O( −2 ) complexity bounds for model-based DFO methods (e.g.[33,18]).Compared to the analysis of derivative-based methods (e.g.BCD [77] and probabilistically accurate models [20]) we need to incorporate the possibility that the interpolation model is not accurate (not fully linear, see Definition 2.1).However, unlike [5,21,11] we do not assume that full linearity is a stochastic property; instead, our stochasticity comes from the subspace selection and we explicitly handle non-fully linear models similar to [26,18].This gives us a framework which is similar to standard model-based DFO and with weak probabilistic conditions.Compared to the analysis of derivative-based random subspace methods in [15,16], our analysis is complicated substantially by the possibility of inaccurate models and the intricacies of model-based DFO algorithms.
We then consider RSDFO-GN, which explicitly describes how interpolation models can be constructed for nonlinear least-squares problems, thus providing a concrete implementation of RSDFO in this context.We prove that RSDFO-GN retains the same complexity bounds as RSDFO.
Implementation Although RSDFO-GN gives a reduced linear algebra cost of model construction compared to existing methods, it is important that we have an implementation that achieves this reduced cost without sacrificing overall performance (in the sense of objective decrease achieved within a given number of objective evaluations).
We introduce a practical, implementable variant of RSDFO-GN called DFBGN, which is based on the solver DFO-LS [14].DFBGN includes an efficient, geoemtry-aware approach for selecting interpolation points, and hence directions in our subspace, for removal, an adaptive randomized approach for selecting new interpolation points/subspace directions.We study the per-iteration linear algebra cost of DFBGN, and show that it is linear in the problem dimension, a substantial improvement over existing methods, which are cubic in the problem dimension.Our per-iteration linear algebra costs are also linear in the number of residuals, the same as existing methods, but with a substantially smaller constant (quadratic in the subspace dimension, which is user-determined, rather than quadratic in the problem dimension).

Numerical Results
We compare DFBGN with DFO-LS (which itself is shown to have stateof-the-art performance in [14]) on collections of both medium-scale (approx.100 dimensions) and large-scale test problems (approx.1000 dimensions).We show that DFBGN with a fullsized subspace has similar performance to DFO-LS in terms of objective evaluations, but shows improved performance on runtime.As the dimension of the subspace reduces (i.e. the size of the interpolation set reduces), we demonstrate a tradeoff between reduced linear algebra costs and increased evaluation counts required to achieve a given objective reduction.The flexibility of DFBGN allows this tradeoff to be explicitly managed.When tested on large-scale problems, DFO-LS frequently reaches a reasonable runtime limit without making substantial progress, whereas DFBGN with small subspace size can perform many more iterations and hence make better progress than DFO-LS.In the case of expensive objectives with small evaluation budgets, we show that DFBGN can make progress with few objective evaluations in a similar way to DFO-LS (which has a mechanism to make progress from as few as 2 objective evaluations independent of problem dimension), but with substantially lower linear algebra costs.

Structure of paper
In Section 2 we describe RSDFO and provide our probabilistic worst-case complexity analysis.We specialize RSDFO to RSDFO-GN in Section 3. Then we describe the practical implementation DFBGN and its features in Section 4. Our numerical results are given in Section 5.
Implementation A Python implementation of DFBGN is available on Github. 3otation We use • to refer to the Euclidean norm of vectors and the operator 2-norm of matrices, and B(x, ∆) for x ∈ R n and ∆ > 0 to be the closed ball {y ∈ R n : y − x ≤ ∆}.

Random Subspace Model-Based DFO
In this section we outline our general model-based DFO algorithmic framework based on minimization in random subspaces.We consider the nonconvex problem where we assume that f : R n → R is continuously differentiable, but that access to its gradient is not possible (e.g. for the reasons described in Section 1).In a standard model-based DFO framework (e.g.[26,50]), at each iteration k we construct a quadratic model m k : R n → R which approximates f near our iterate x k : for some g k ∈ R n and H k ∈ R n×n symmetric.Based on this model, we build a globally convergent algorithm using a trust-region framework [25].This algorithmic framework is suitable providing that-when necessary-we can guarantee m k is a sufficiently accurate model for f near x k .Details about how to construct sufficiently accurate models based on interpolation are given in [26].
Our core idea here is to construct interpolation models which only approximate the objective in a subspace, rather than in the full space R n .This allows us to use interpolation sets with fewer points, since we do not have to capture the objective's behaviour outside our subspace, which improves the scalability of the method.
In this section, we outline our general algorithmic framework and provide a worst-case complexity analysis showing convergence to first-order stationary points with high probability.We then describe how this framework may be specialized to the case of nonlinear least-squares minimization.

RSDFO Algorithm
In our general framework, which we call RSDFO (Randomized Subspace DFO), we modify the above approach by randomly choosing a p-dimensional subspace (where p < n is user-chosen) and constructing an interpolation model defined only in that subspace. 4Specifically, in each iteration k we randomly choose a p-dimensional affine space Y k ⊂ R n given by the range of We then construct a model which interpolate f at points in Y k and ultimately construct a local quadratic model for f only on Y k .That is, given Q k , we assume that we have mk : R p → R given by where ĝk ∈ R p and Ĥk ∈ R p×p are the low-dimensional model gradient and Hessian respectively, adopting the convention of using hats on variables to denote low-dimensional quantities.In Section 3 we specialize this to a model construction process for nonlinear least-squares problems.
For our trust-region algorithm, we (approximately) minimize mk inside the trust region to get a tentative step ŝk ≈ arg min for the current trust-region radius ∆ k > 0, yielding a tentative step We thus also get the computational advantage coming from solving a p-dimensional trust-region subproblem.
In our setting we are only interested in the approximation properties of mk in the space Y k , and so we introduce the following notion of a "sufficiently accurate" model: (2.6a) for all s ∈ R p with ŝ ≤ ∆.The constants κ ef and κ eg must be independent of Q, m, x and ∆.
The gradient condition (2.6b) comes from noting that if f (ŝ) := f (x + Qŝ) then ∇ f (ŝ) = Q T ∇f (x + Qŝ).We note that if we have full-dimensional subspaces p = n and take Q = I, then we recover the standard notion of fully linear models [26,Definition 6.1].

Complete RSDFO Algorithm
The complete RSDFO algorithm is stated in Algorithm 1.The overall structure is common to model-based DFO methods [26].In particular, we assume that we have procedures to verify whether or not a model is Q k -fully linear in B(x k , ∆ k ) and (if not) to generate a Q k -fully linear model.When we specialize RSDFO to nonlinear least-squares problems in Section 3, we will describe how we can obtain such procedures.
After defining our subspace and model, we first perform a criticality step, which guarantees that-whenever we suspect we are close to first-order stationarity, as measured by ĝk -we have an accurate model and an appropriately sized trust-region radius.Often this criticality step is formulated as its own subroutine with an extra inner loop [26], but following [33] we only perform one criticality step per iteration and avoid nested loops.
Construct a reduced model mk : Define a subspace by randomly sampling Construct a reduced model mk : R p → R (2.4) which need not be Q k -fully linear.9: end if 10: Approximately solve the subspace trust-region subproblem in R p (2.5) and calculate the step (2.7)

19:
Accept/reject step and update trust region radius: set If The remainder of Algorithm 1 broadly follows standard trust-region methods.If the computed step is much shorter than the trust-region radius, we enter a safety step-originally due to Powell [61]-which is similar to an unsuccessful iteration (i.e.ρ k < η 1 , so we reject the step and decrease ∆ k ) but without evaluating f (x k + s k ) and hence saving one objective evaluation.Our updating mechanism for ∆ k (2.8) also takes into account the computed step size, and is the same as in [63,18].
An important feature of RSDFO is that in some iterations, we reuse the previous subspace, Y k = Y k−1 , corresponding to the flag CHECK_MODEL=TRUE.In this case, we had an inaccurate model in iteration k − 1 and require that our new model mk is accurate (Q k -fully linear).This mechanism essentially ensures that ∆ k is not decreased too quickly as a result of inaccurate models, and is mostly decreased to achieve sufficient objective reduction.
We now give our convergence and worst-case complexity analysis of Algorithm 1.

Assumptions and Preliminary Results
We begin our analysis with some basic assumptions and preliminary results.
Assumption 2.2 (Smoothness).The objective function f : R n → R is bounded below by f low and continuously differentiable, and ∇f is L ∇f -Lipschitz continuous in the extended level set {y ∈ R n : y − x ≤ ∆ max for some f (x) ≤ f (x 0 )}, for some constant L ∇f > 0.
We also need two standard assumptions for trust-region methods: uniformly bounded above model Hessians and sufficiently accurate solutions to the trust-region subproblem (2.5).
Lemma 2.6.Suppose Assumptions 2.3 and 2.4 hold, and we run RSDFO with then the criticality and safety steps are not called, and Proof.Since mk is Q k -fully linear and ∆ k ≤ µ ĝk , the criticality step is not called.From Lemma 2.5 and ∆ k ≤ ĝk /κ H , we have ŝk ≥ c 2 ∆ k ≥ β F ∆ k and so the safety step is not called.From Assumptions 2.3 and 2.4, we have since ∆ k ≤ ĝk /κ H by assumption.Next, since mk is Q k -fully linear, from (2.6a) we have ) Hence we have Our key new assumption is on the quality of our subspace selection, as suggested in [15,16]: for some α Q ∈ (0, 1) independent of k.
Of these two properties, (a) is needed for our complexity analysis, while (b) is only needed in order to construct Q k -fully linear models (in Section 3).We will discuss how to achieve Assumption 2.9 in more detail in Section 2.6.
Lemma 2.10.In all iterations k of RSDFO where the criticality step is not called, we have (2.17) Proof.The first part follows immediately from the entry condition of the criticality step.To prove (2.17), suppose the criticality step is not called in iteration k and ĝk < C .Then we have ∆ k ≤ µ ĝk and mk is Q k -fully linear, and so from (2.6b) we have Since Q k is well-aligned, we conclude from (2.16) and (2.18) that and we are done, since ∇f (x k ) ≥ .

Counting iterations
We now provide a series of results counting the number of iterations of RSDFO of different types, following the style of analysis from [20,16].First we introduce some notation to enumerate our iterations.Suppose we run RSDFO until the end of iteration K.We then define the following subsets of {0, . . ., K}: • C is the set of iterations in {0, . . ., K} where the criticality step is called.
• A C is the set of poorly aligned iterations in {0, . . ., K}, where (2.16) does not hold.
• L is the set of iterations in {0, . . ., K} where mk is • L C is the set of iterations in {0, . . ., K} where mk is not In particular, we have the partitions, for any ∆ > 0, First, we bound the number of successful iterations with large ∆ k using standard arguments from trust-region methods.Throughout, we use #(•) to refer to the cardinality of a set of iterations.
Proof.Since ∇f (x k ) ≥ , from Lemma 2.10 we have ĝk where the last line follows from ĝk ≥ g ( ) and ) from which the result follows.
Lemma 2.12.Suppose Assumptions 2.2, 2.3 and 2.4 hold, and (2.27) From this and k ∈ L, the assumptions of Lemma 2.6 are met, so k / ∈ F and Next, we suppose k ∈ A ∩ D C (∆) ∩ L ∩ C and again look for a contradiction.In this case, we have This means we have ĝk > µ −1 ∆ k and k ∈ L, so the criticality step is not entered; i.e. k ∈ C C , a contradiction.Hence we have #(A ∩ D C (∆) ∩ L ∩ C) = 0 and we are done.
Hence the total decrease in log(∆ k ) must be fully matched by the initial gap log(∆ 0 /∆) plus the maximum possible amount that log(∆ k ) can be increased above log(∆).That is, we must have which gives us (2.29).

.35)
Proof.We follow a similar reasoning to the proof of Lemma 2.13.For every iteration k ∈ VS ∩D C (∆), we increase ∆ k by a factor of at least γ inc , since and we are done.
Proof.After every iteration k where mk is not Q k -fully linear and either the criticality step is called or ρ k < η 2 , we always set ∆ k+1 ≤ ∆ k and CHECK_MODEL=TRUE.This means that ∩ L, and we are done.
We are now in a position to bound the total number of well-aligned iterations.
Lemma 2.16.Suppose Assumptions 2.2, 2.3 and 2.4 hold, and both where In these expressions, the values C 1 and C 2 are defined in Lemma 2.13, C 3 is defined in Lemma 2.14, φ(•, ) is defined in Lemma 2.11, and g ( ) and ∆ * ( ) are defined in Lemmas 2.10 and 2.12 respectively.
Next, we have where the first inequality follows from Lemma 2.11, the second inequality follows from Lemma 2.13 and ∆ min ≤ γ −1 inc ∆ 0 , and the last line follows from Lemma 2.15 and VS ⊂ S. Now we use Lemma 2.12 with where the last line follows from Lemma 2.11.Separately, we use Lemma 2.15, and apply Lemma 2.12 with ∆ min ≤ ∆ * ( ) to get (2.58) We then get where the third inequality follows from Lemma 2.11 and Lemma 2.14 with (2.64) Since C 3 ∈ (0, 1/2), we can rearrange (2.63) to conclude that Now, we combine (2.53) and (2.65) to get Substituting this into (2.69) and rearranging, we get the desired result.That C 4 > 0 follows from C 1 > 0 and C 3 ∈ (0, 1/2).

Overall Complexity Bound
The key remaining step is to compare #(A) with K. Since each event "Q k is well aligned" is effectively an independent Bernoulli trial with success probability at least 1 − δ S , we derive the below result based on a concentration bound for Bernoulli trials [23, Lemma 2.1].
Proof.The CHECK_MODEL=FALSE case of this proof has a general framework based on [42, Lemma 4.5]-also followed in [16]-with a probabilistic argument from [23, Lemma 2.1].First, we consider only the subsequence of iterations K 1 := {k 0 , . . ., k J } ⊂ {0, . . ., K} when Q k is resampled (i.e.where CHECK_MODEL=FALSE, so Q k = Q k−1 ).For convenience, we define Let T kj be the indicator function for the event "Q kj is well-aligned", and so #(A 1 ) = J j=0 T kj .Since T kj ∈ {0, 1}, and denoting p kj := P T kj = 1| x kj , for any t > 0 we have where the inequality from the identity px + log(1 Using the tower property of conditional expectations and the fact that, since k j ∈ K 1 , T kj only depends on x kj and not any previous iteration, we then get ) ) where the second-last line follows from (2.72) and the last line follows by induction.This means that ) ) where the inequalities follow from Markov's inequality and (2.78) respectively.Taking t = λ/ J j=0 p kj , we get Finally, we take λ = δ J j=0 p kj for some δ ∈ (0, 1) and note that p kj ≥ (1 − δ S ) (from Assumption 2.9), to conclude or equivalently, using the partition Now we must consider the iterations for which CHECK_MODEL=TRUE (so unless we are in the last iteration we consider, k = K).Futher, the algorithm guarantees that if k ∈ K C 1 , then k > 0 and k ∈ A if and only if k − 1 ∈ A. These are the key implications of RSDFO that we will now use.
Proof.First, fix some arbitrary k ≥ 0. Let k := min j≤k ∇f (x j ) and A k be the number of well-aligned iterations in {0, . . ., k}.If k > 0, from Lemma 2.16, we have For any δ > 0 such that , and so we can compute using Lemma 2.17.Defining we have ) from our assumption on δ S .Hence we get and we note that this result is still holds if k = 0, as lim →0 ψ( ) = ∞.Now we fix > 0 and choose k satisfying (2.92).We use the fact that ψ(•) is non-increasing to get (2.102) and (2.93) follows.Lastly, we fix and we use (2.93) and the definition of K to get (2.107) and we get (2.94).
Corollary 2.19.Suppose the assumptions of Theorem 2.18 hold.Then for k ≥ k 0 for some k 0 , we have for some constants c, C > 0. Alternatively, for ∈ (0, 0 ) for some 0 , we have for constants c, C > 0.
Remark 2.20.All the above analysis holds with minimal modifications if we replace the trustregion mechanisms in RSDFO with more standard trust-region updating mechanisms.This includes, for example, having no safety step (i.e.β F = 0), and replacing (2.8) with for some η ∈ (0, 1).The corresponding requirement on the trust-region updating parameters to prove a version of Theorem 2.18 is simply γ inc > γ −2 dec (provided we also set γ C = γ dec ).

Remarks on Complexity Bound
Our final complexity bounds for RSDFO in Corollary 2.19 are comparable to probabilistic direct search [42,Corollary 4.9].They also match-in the order of -the standard bounds for (full space) model-based DFO methods for general objective [71,33] and nonlinear least-squares [19] problems.
Following [42], we may also derive complexity bounds on the expected first-order optimality measure (of O(k −1/2 )) and the expected worst-case complexity (of O( −2 ) iterations) for RSDFO.
Theorem 2.21.Suppose the assumptions of Theorem 2.18 hold.Then for k ≥ k 0 , the iterates of RSDFO satisfy for c, C > 0 from (2.109), and for ∈ (0, 0 ) we have ) Here, k 0 and 0 are the same as in Corollary 2.19.
Proof.First, for k ≥ k 0 define the random variable H k as (2.115) Then since min j≤k ∇f (x j ) ≤ H k , we get and so from Theorem 2.18 we have where dt for non-negative random variables X (e.g.[69, eqn.(1.9)]) to get where C 1 comes from k 0 ( ) = Θ( −2 ), which concludes our proof.
Furthermore, we also get almost-sure convergence of lim inf type, similar to [26, Theorem 10.12] in the deterministic case.(2.120)However, P [inf k≥0 ∇f (x k ) > ] ≤ P [min j≤k ∇f (x j ) > ] for all k, and so The result follows from the union bound applied to any sequence n = n −1 , for example.
In particular, if ∇f (x k ) > 0 for all k, then Theorem 2.22 implies lim inf k→∞ ∇f (x k ) = 0 almost surely.

Selecting a Subspace Dimension
We now specify how to generate our subspaces Q k to be probabilistically well-aligned and uniformly bounded (Assumption 2.9).These requirements are quite weak, and so there are several possible approaches for constructing Q k .First, we discuss the case where Q k is chosen to be a random matrix with orthonormal columns, and show that we need to choose our subspace dimension p ∼ √ n.This is related to how we ultimately select Q k in the practical implementation DFBGN (Section 4).However, we then show that by instead taking Q k to be a Johnson-Lindenstrauss transform, we can choose a value of p independent of the ambient dimension n.

Random Orthogonal Basis
First, we consider the case where Q k is a randomly generated matrix with orthonormal columns.We have the below result, a consequence of [54,Theorem 9].Theorem 2.23.Suppose the columns of Q k ∈ R n×p form an orthonormal basis for a randomly generated p-dimensional subspace of R n .Then for any fixed vector v, for all θ ∈ (0, 1).
For some δ S ∈ (0, 1), if we set then we get (2.124) Therefore Assumption 2.9 is achieved provided we set In particular, since our theory holds if we take α Q > 0 arbitrarily small, our only limitation on the subspace dimension p is δ S < 1/(1 + C 4 ) from Theorem 2.18, yielding the minimum requirement p > 64 log where C 4 depends only on our choice of trust-region algorithm parameters.The boundedness condition Assumption 2.9(b) holds with Q max = 1 automatically.This gives us considerable scope to use very small subspace dimensions, which means our subspace approach has a strong chance of providing a substantial reduction in linear algebra costs.

Johnson-Lindenstrauss Embeddings
We can improve on the requirement (2.126) on p by using Q k with non-orthonormal columns.Specifically, we take Q k to be a Johnson-Lindenstrauss transform (JLT) [74].The application of these techniques to random subspace optimization algorithms follows [15,16].
By taking v = ∇f (x k ) in iteration k, and noting (1 − ) 2 ≤ 1 − for all ∈ (0, 1), we have that Assumption 2.9(a) holds if we take Q k = S T , where S is any That is, Assumption 2.9(a) is satisfied using either of the constructions above and p = Ω((1 − α Q ) −2 | log δ S |).Importantly, these constructions allow us to choose a subspace dimension p which has no dependence on the ambient dimension n (i.e.p = O(1) as n → ∞), an improvement on (2.126).We note that the requirement δ S < 1/(1 + C 4 ) in Theorem 2.18 yields a very mild dependence of p on the choice of trust-region updating parameters.
We conclude by noting that the uniform boundedness property Assumption 2.9(b) is trivial if S is a hashing matrix.If S is Gaussian, by Bernstein's inequality we can choose Q max large enough that P [ S > Q max ] is small.Then, we generate our final Q k by sampling S independently until S ≤ Q max holds (which almost-surely takes finite time).A union bound argument then gives us Assumption 2.9(a) for this choice of Q k , and so Assumption 2.9 is completely satisfied with the same (asymptotic) requirements on p.

Random Subspace Nonlinear Least-Squares Method
We now describe how RSDFO (Algorithm 1) can be specialized to the unconstrained nonlinear least-squares problem where r : R n → R m is given by r(x) := [r 1 (x), . . ., r m (x)] T .We assume that r is differentiable, but that access to the Jacobian J : R n → R m×n is not possible.In addition, we typically assume that m ≥ n (regression), but everything here also applies to the case m < n (inverse problems).We now introduce the algorithm RSDFO-GN (Randomized Subspace DFO with Gauss-Newton), which is a randomized subspace version of a model-based DFO variant of the Gauss-Newton method [18].Following the construction from [18], we assume that we have selected the p-dimensional search space Y k defined by Q k ∈ R n×p (as in RSDFO above).Then, we suppose that we have evaluated r at p + 1 points Y k := {x k , y 1 , . . ., y n } ⊂ Y k (which typically are all close to x k ).Since y t ∈ Y k for each t = 1, . . ., p, from (2.3) we have Given this interpolation set, we first wish to construct a local subspace linear model for r: To do this, we choose the approximate subspace Jacobian Ĵk ∈ R m×p by requiring that mk interpolate r at our interpolation points Y k .That is, we impose mk (ŝ t ) = r(y t ), ∀t = 1, . . ., p, which yields the p × p linear system (with m right-hand sides) Our linear subspace model mk (3.2) naturally yields a local subspace quadratic model for f , as in the classical Gauss-Newton method, namely (c.f.(2.4)), where ĝk := ĴT k r(x k ) and Ĥk := ĴT k Ĵk .

Constructing Q k -Fully Linear Models
We now describe how we can achieve Q k -fully linear models of the form (3.5) in RSDFO-GN.
As in [18], we will need to define the Lagrange polynomials and Λ-poisedness of an interpolation set.Given our interpolation set Y k lies inside Y k , we consider the (low-dimensional) Lagrange polynomials associated with Y k .These are the linear functions ˆ 0 , . . ., ˆ p : R p → R, defined by the interpolation conditions ˆ t (ŝ t ) = δ t,t , ∀t, t = 0, . . ., p, with the convention ŝ0 = 0 corresponding to the interpolation point x k .The Lagrange polynomials exist and are unique whenever Ŵk (3.4) is invertible, which we typically ensure through judicious updating of Y k at each iteration.
Note that since ˆ 0 (0) = 1, for the set Y k to be Λ-poised we require Λ ≥ 1.In general, a larger Λ indicates that Y k has "worse" geometry, which leads to a less accurate approximation for f .This notion of Λ-poisedness (in a subspace) is sufficient to construct Q k -fully linear models (3.5) for f .Lemma 3.2.Suppose Assumption 2.9(b) holds, J(x) is Lipschitz continuous, and r and J are uniformly bounded above in Proof.Consider the low-dimensional functions r : R p → R m and f : R p → R given by rk (ŝ) := r(x k + Q k ŝ) and f (ŝ) := 1  2 r(ŝ) 2 respectively.We note that r is continuously differentiable with Jacobian Ĵ Assumption 2.9(b), it is straightforward to show that both r and Ĵ are uniformly bounded above and Ĵ is Lipschitz continuous (with a Lipschitz constant Q 2 max times larger than for J(x)).We can then consider mk (3.2) and mk (3.5) to be interpolation models for r and f in the low-dimensional ball B(0, ∆ k ) ⊂ R p .From [18, Lemma 3.3], we conclude that mk is a fully linear model for f with constants κ ef , κ eg = O(p 2 Λ 2 ).The Q k -fully linear property follows immediately from this, noting that ∇ fk (ŝ) = Q T k ∇f (x k + Q k ŝ).Given this result, the procedures in [26,Chapter 6] allow us to check and/or guarantee the Λ-poisedness of an interpolation set, and we have met all the requirements needed to fully specify RSDFO-GN.
Lastly, we note that underdetermined linear interpolation, where (3.4) is underdetermined and solved in a minimal norm sense, has been recently shown to yield a property similar to Q k -full linearity [47,Theorem 3.6].
Complete RSDFO-GN Algorithm A complete statement of RSDFO-GN is given in Algorithm 2. This exactly follows RSDFO (Algorithm 1), but where we ask that the interpolation set satsifies the conditions:

Complexity Analysis for RSDFO-GN
We are now in a position to specialize our complexity analysis for RSDFO to RSDFO-GN.For this, we need to impose a smoothness assumption on r.
Assumption 3.3.The extended level set L := {y ∈ R n : y − x ≤ ∆ max for some f (x) ≤ f (x 0 )} is bounded, r is continuously differentiable, and the Jacobian J is Lipschitz continuous on L.
This smoothness requirement allows us to immediately apply the complexity analysis for RSDFO, yielding the following result.
Proof.Assumption 3.3 implies that both r and J are uniformly bounded above on L, which is sufficient for Lemma 3.2 to hold.Hence, whenever we check/ensure that In addition, from [18, Lemma 3.2] and taking f low = 0, we have that Assumption 2.2 is satisfied.Therefore the result follows directly from Corollary 2.19.

Bounds on Objective Evaluations
In each iteration of RSDFO, we require at most p + 1 objective evaluations: at most p to form the model mk -and this holds regardless of whether we need mk to be Q k -fully linear or not-and one evaluation for x k + s k .Hence the above bounds on K also hold for the number of objective evaluations required to first achieve ∇f (x k ) ≤ , up to a constant factor of p + 1.
Interpolation Models for RSDFO Using existing techniques for constructing Λ-poised interpolation sets for quadratic interpolation (as outlined in [26]), the specific model construction ideas presented here can also be applied to general objective problems, and thus provide a concrete implementation of RSDFO.

Linear Algebra Cost of RSDFO-GN
In RSDFO-GN, the interpolation linear system (3.4) is solved in two steps, namely: factorize the interpolation matrix Ŵk , then back-solve for each right-hand side.Thus, the cost of the linear algebra is: 1. Model construction costs O(p 3 ) to compute the factorization of Ŵk , and O(mp 2 ) for the back-substitution solves with m right-hand sides; and 2. Lagrange polynomial construction costs O(p 3 ) in total, due to one backsolve for each of the p + 1 polynomials (using the pre-existing factorization of Ŵk ).
By updating the factorization or Ŵ −1 k directly (e.g. via the Sherman-Morrison formula), we can replace the O(p 3 ) factorization cost with a O(p 2 ) updating cost (c.f.[62]).However, the dominant O(mp 2 ) model construction cost remains, and in practice we have observed that the factorization needs to be recomputed from scratch to avoid the accumulation of rounding errors.
In the case of a full-space method where p = n such as in [18], these costs becomes O(n 3 ) for the factorization (or O(n 2 ) if Sherman-Morrison is used) plus O(mn 2 ) for the back-solves.When n grows large, this linear algebra cost rapidly dominates the total runtime of these algorithms and limits the efficiency of full-space methods.This issue is discussed in more detail, with numerical results, in [65,Chapter 7.2].
In light of this discussion, we now turn our attention to building an implementation of RSDFO-GN that has both strong performance (in terms of objective evaluations) and low linear algebra cost.

DFBGN: An Efficient Implementation of RSDFO-GN
An important tenet of DFO is that objective evaluations are often expensive, and so algorithms should be efficient in reusing information, hence limiting the total objective evaluations required to achieve a given decrease.However because we require our model to sit within our active space Y k , we do not have a natural process by which to reuse evaluations between iterations, when the space changes.We dedicate this section to outlining an implementation of RSDFO-GN, which we call DFBGN (Derivative-Free Block Gauss-Newton).DFBGN is designed to be efficient in its objective queries while still only building low-dimensional models, and hence is also efficient in terms of linear algebra.Specifically, we design DFBGN to achieve two aims: • Low computational cost: we want our implementation to have a per-iteration linear algebra cost which is linear in the ambient dimension; • Efficient use of objective evaluations: our implementation should follow the principles of other DFO methods and make progress with few objective evaluations.In particular, we hope that, when run with 'full-space models' (i.e.p = n), our implementation should have (close to) state-of-the-art performance.
We will assess the second point in Section 5 by comparison with DFO-LS [14] an open-source model-based DFO Gauss-Newton solver which explores the full space (i.e.p = n).
Remark 4.1.As discussed in [14], DFO-LS has a mechanism to build a model with fewer than n + 1 interpolation points.However, in that context we modify the model so that it varies over the whole space R n , which enables the interpolation set to grow to the usual n + 1 points and yield a full-dimensional model.There, the goal is to make progress with very few evaluations, but here our goal is scalability, so we keep our model low-dimensional throughout and instead change the subspace at each iteration.

Subspace Interpolation Models
Similar to Section 3, we assume that, at iteration k, our interpolation set has p + 1 points {x k , y 1 , . . ., y p } ⊂ R n with 1 ≤ p ≤ n.However, we assume that these points are already given, and use them to determine the space Y k (as defined by Q k ).That is, given we compute the QR factorization where Q k ∈ R n×p has orthonormal columns and R k ∈ R p×p is upper triangular-and invertible provided W T k is full rank, which we guarantee by judicious replacement of interpolation points.This gives us the Q k that defines Y k via (2.3)-in this case Q k is has orthonormal columns-and in this way all our interpolation points are in Y k .
Since each y t ∈ Y k , from (4.2) we have y t = x k + Q k ŝt , where ŝt is the t-th column of R k .Hence we have Ŵk = R T k in (3.4) and so mk (3.2) is given by solving via forward substitution, since R T k is lower triangular.This ultimately gives us our local model mk via (3.5).
We reiterate that compared to RSDFO-GN, we have used the interpolation set Y k to determine both Q k and mk , rather than first sampling Q k , then finding interpolation points Y k ⊂ Y k with which to construct mk .This difference is crucial in allowing the reuse of interpolation points between iterations, and hence lowering the objective evaluation requirements of model construction.
Remark 4.2.As discussed in [65,Chapter 7.3], we can equivalently recover this construction by asking for a full-space model m k : R n → R m given by m k (s) = r(x k ) + J k s such that the interpolation conditions m k (y t − x k ) = r(y t ) are satisfied and J k has minimal Frobenius norm.

Complete DFBGN Algorithm
A complete statement of DFBGN is given in Algorithm 3. Compared to RSDFO-GN, we include specific steps to manage the interpolation set, which in turn dictates the choice of subspace Y k .Specifically, one issue with our approach is that our new iterate x k + s k is in Y k , so if we were to simply add x k + s k into the interpolation set, Y k would not change across iterations, and we will never explore the whole space.On the other hand, unlike RSDFO and RSDFO-GN we do not want to completely resample Q k as this would require too many objective evaluations.Instead, in DFBGN we delete a subset of points from the interpolation set and add new directions orthogonal to the existing directions, which ensures that Q k+1 = Q k in every iteration. 5e also note that DFBGN does not include some important algorithmic features present in RSDFO-GN, DFO-LS or other model-based DFO methods, and hence is quite simple to state.These features are not necessary for a variety of reasons, which we now outline.
No Criticality and Safety Steps Compared to RSDFO-GN, the implementation of DFBGN does not include criticality (which is also the case in DFO-LS) or safety steps.These steps ultimately function to ensure we do not have ĝk ∆ k .In DFBGN, we ensure ∆ k does not get too large compared to s k through (2.8), while s k is linked to ĝk through Lemma 2.5.If s k is much smaller than ∆ k and our step produces a poor objective decrease, then we will set ∆ k+1 ← s k for the next iteration.Although Lemma 2.5 allows s k to be large even if g k is small, in practice we do not observe ∆ k g k without DFBGN setting ∆ k+1 ← s k after a small number of iterations.
No Model-Improving Steps An important feature of model-based DFO methods are modelimproving procedures, which change the interpolation set to ensure Λ-poisedness (Definition 3.1), or equivalently ensure that the local model for f is fully linear.In RSDFO-GN for instance, model-improvement is performed when CHECK_MODEL=TRUE, whereas in [26,Algorithm 10.1] there are dedicated model-improvement phases.
Instead, DFBGN ensures accurate interpolation models via a geometry-aware (i.e.Λ-poisedness aware) process for deleting interpolation points at each iteration, where they are replaced by Algorithm 3 DFBGN: Derivative-Free Block Gauss-Newton for solving (3.1).

4:
Approximately solve the subspace trust-region subproblem in R p (2.5) and calculate the step Set Y k+1 = Y init k+1 ∪ {x k+1 + ∆ k+1 d 1 , . . ., x k+1 + ∆ k+1 dq}.17: end for new points in directions (from x k+1 ) which are orthogonal to Q k and selected at random.The process for deleting interpolation points-and choosing a suitable number of points to remove, p drop -at each iteration are considered in Sections 4.3 and 4.4 respectively.The process for generating new interpolation points, Algorithm 5, is outlined in Section 4.3.
A downside of our approach is that the new orthogonal directions are not chosen by minimizing a model for the objective (i.e.not attempting to reduce the objective), as we have no information about how the objective varies outside Y k .This is the fundamental trade-off between a subspace approach and standard methods (such as DFO-LS); we can reduce the linear algebra cost, but must spend objective evaluations to change the search space between iterations.
Linear Algebra Cost of DFBGN As in Section 3.3, our approach in DFBGN yields substantial reductions in the required linear algebra costs compared to DFO-LS: • Model construction costs O(np 2 ) for the factorization (4.2) and O(mp 2 ) for back-substitution solves (4.3), rather than O(n 3 ) and O(mn 2 ) respectively for DFO-LS; and • Lagrange polynomial construction costs O(p 3 ) rather than O(n 3 ). 6s well as these reductions, we also get a smaller trust-region subproblem (2.5)-in R p rather than R n -and smaller memory requirements for storing the model Jacobian: we only store Ĵk and Q k , requiring O((m + n)p) memory rather than O(mn) for storing the full m × n Jacobian.However, in (2.5), we do have the extra cost of projecting ŝk ∈ R p into the full space R n , which requires a multiplication by Q k , costing O(np).In addition to the reduced linear algebra costs, the smaller interpolation set means we have a lower evaluation cost to construct the initial model of p + 1 evaluations (rather than n + 1).
No particular choice of p is needed for this method, and anything from p = 1 (i.e.coordinate search) to p = n (i.e.full space search) is allowed.However, unsurprisingly, we shall see that larger values of p give better performance in terms of evaluations, except for the very low-budget phase, where smaller values of p benefit from a lower initialization cost.Hence, we expect that our approach with small p is useful when the O(mn 2 + n 3 ) per-iteration linear algebra cost of DFO-LS is too great, and reducing the linear algebra cost is worth (possibly) needing more Table 1.Comparison of per-iteration linear algebra costs of DFO-LS and DFBGN (Algorithm 3) with subspace dimension p ∈ {1, . . ., n}. *Note that the trust-region subproblem is solved using a truncated CG method [25, Chapter 7.5.1]originally from [63] in both DFO-LS and DFBGN.
objective evaluations to achieve a given accuracy.As a result, p should in general be set as large as possible, given the linear algebra costs the user is willing to bear.
In Table 1, we compare the linear algebra costs of DFO-LS and DFBGN.The overall periteration cost of DFO-LS is O(mn 2 +n 3 ) and the cost of DFBGN is O(mp 2 +np 2 +p 3 ), depending on the choice of p ∈ {1, . . ., n}.The key benefit is that our dependency on the underlying problem dimension n decreases from cubic in DFO-LS to linear in DFBGN (provided p n).We also note that both methods have linear cost in the number of residuals m, but with a factor that is significantly smaller in DFBGN than in DFO-LS-O(p 2 ) compared to O(n 2 ).Remark 4.3.In every iteration we must compute the QR factorization (4.2).However, we note, similar to [18, Section 4.2], that adding, removing and changing interpolation points all induce simple changes to Ŵ T k (adding or removing columns, and low-rank updates).This means that (4.2) can be computed with cost O(np) per iteration using the updating methods in [37, Section 12.5].In our implementation, however, we do not do this, as we find that these updates introduce errors7 that accumulate at every iteration and reduce the accuracy of the resulting interpolation models.To maintain the numerical performance of our method, we need to recompute (4.2) from scratch regularly (e.g.every 10 iterations), and so would not see the O(np) per-iteration cost, on average.Remark 4.4.The default parameter choices for DFBGN are the same as DFO-LS, namely: ∆ max = 10 10 , ∆ end = 10 −8 , γ dec = 0.5, γ inc = 2, γ inc = 4, η 1 = 0.1, and η 2 = 0.7.DFBGN also uses the same default choice ∆ 0 = 0.1 max( x 0 ∞ , 1).The default choice of p drop is discussed in Section 4.4.
Adaptive Choice of p One approach that we have considered is to allow p to vary between iterations of DFBGN, rather than being constant throughout.Instead of adding p drop new points at the end of each iteration (line 15), we implement a variable p by adding at least one new point to the interpolation set, continuing until some criterion is met.This criterion is designed to allow p small when such a p allows us to make reasonable progress, but to grow p up to p ≈ n when necessary.
We have tested several possible criteria-comparing some combination of model gradient and Hessian, trust-region radius, trust-region step length, and predicted decrease from the trustregion step-and found the most effective to be comparing the model gradient and Hessian with the trust-region radius.Specifically, we continue adding new directions until (c.f.Lemma 2.10 and [18,Lemma 3.22]) for some α > 0 (we use α = 0.2(n−p)/n for an interpolation set with p+1 points).However, our numerical testing has shown that DFBGN with p fixed outperforms this approach for all budget Algorithm 4 Mechanism for removing points from the interpolation set in DFBGN.
and accuracy levels, on both medium-and large-scale problems, and so we do not consider it further here.We delegate further study of this approach to future work, to see if alternative adaptive choices for p can be beneficial.

Interpolation Set Management
We now provide more details about how we manage the interpolation set in DFBGN.Specifically, we discuss how points are chosen for removal from Y k , and how new interpolations points are calculated.

Geometry Management
In the description of DFBGN, there are no explicit mechanisms to ensure that the interpolation set is well-poised.DFBGN ensures that the interpolation set has good geometry through two mechanisms: • We use a geometry-aware mechanism for removing points, based on [63,18], which requires the computation of Lagrange polynomials.This mechanism is given in Algorithm 4, and is called in lines 10 and 13 of DFBGN, as well as to select a point to replace in line 12; and • Adding new directions that are orthogonal to existing directions, and of length ∆ k , means adding these new points never causes the interpolation set to have poor poisedness.
Together, these two mechanisms mean that any points causing poor poisedness are quickly removed, and replaced by high-quality interpolation points (orthogonal to existing directions, and within distance ∆ k of the current iterate).The linear algebra cost of Algorithm 4 is O(p 3 ) to compute p Lagrange polynomials with cost O(p 2 ) each (since we already have a factorization of Ŵ T k ).Then for each t we must evaluate θ t (4.5), with cost O(p) to maximize t (x) (since t is linear and varies only in directions Y k ), and O(n) to calculate y t − x k+1 .This gives a total cost of O(p 3 + np). 8lternative Point Removal Mechanism Instead of Algorithm 4, we could have used a simpler mechanism for removing points, such as removing the points furthest from the current iterate (with total cost9 O(np)).However, this leads to a substantial performance penalty.In Figure 1, we compare these two approaches for selecting points to be removed, namely Algorithm 4 and distance to x k+1 , on the (CR) test set with p = n/10 and p = n (using the default value of p drop , as detailed in Section 4.4).For more details on the numerical testing framework, see Section 5.1 below.We see that the geometry-aware criterion (4.5) gives substantially better performance than the cheaper criterion. .Performance profiles (in evaluations) for DFBGN when p = n/10 and p = n, comparing removing points with the geometry-aware Algorithm 4 and without Lagrange polynomials (by distance to the current iterate).We use accuracy level τ = 10 −5 , and results are an average of 10 runs, each with a budget of 100(n + 1) evaluations.The problem collection is (CR).See Section 5.1 for details on the testing framework.
Algorithm 5 Mechanism for generating new directions in DFBGN.
3: Perform the QR factorization Q R = A and return d 1 , . . ., dq as the columns of Q.

Generation of New Directions
We now detail how new directions d 1 , . . ., d q are created in line 15 of DFBGN (Algorithm 3).The same approach is suitable for generating the initial directions s 1 , . . ., s p in line 1 of DFBGN, using A = A below (i.e.no Q required).
Suppose our current subspace is defined by the orthonormal columns of Q ∈ R n×p1 , and we wish to generate q new orthonormal vectors that are also orthogonal to the columns of Q (with p 1 + q ≤ n).When called in line 15 of DFBGN, we will have p 1 = p − p drop and q = p drop .We use the approach in Algorithm 5. From the QR factorization, the columns of Q are orthonormal, and if A is full rank (which occurs with probability 1; see Lemma 4.5 below) then we also have col( Q) = col( A).So, to confirm the columns of Q are orthogonal to Q, we only need to check that the columns of A are orthogonal to Q. Let a i be the i-th column of A and q j be the j-th column of Q.Then, if a i is the i-th column of A, we have as required.The cost of Algorithm 5 is O(nq) to generate A, O(np 1 q) to form A and O(nq 2 ) for the QR factorization.Since p 1 , q ≤ p (since p 1 is the number of directions remaining in the interpolation set and q is the number of new directions to be added), the whole process has cost at most O(np 2 ).This bound is tight, up to constant factors, as we could take p 1 = q = p/2, for instance.Proof.Let a i and a i be the i-th columns of A and A respectively.From [29, Proposition 7.1], A has full column rank with probability 1, and each a i / ∈ col(Q) with probability 1.Now suppose we have constants c 1 , . . ., c q so that q i=1 c i a i = 0. Then since a i = a i − QQ T a i , we have (4.7)The right-hand side is in col(Q), so since a i / ∈ col(Q), we must have q i=1 c i a i = 0. Thus c 1 = • • • = c q = 0 since A has full column rank, and so A has full column rank.

Selecting an Appropriate Value of p drop
An important component of DFBGN that we have not yet specified is how many points to remove from the interpolation set at each iteration, p drop ∈ {1, . . ., p}.
On one hand, a large p drop enables us to change the subspace by a large amount between iterations, ensuring we explore the whole of R n quickly, rather than searching in unproductive subspaces for many iterations.However, a small p drop means we require few objective evaluations per iteration, and so are more likely to use our evaluation budget efficiently.We consider two choices of p drop , aimed at each of these possible benefits: p drop = p/10 to change subspaces quickly, and p drop = 1 (the minimum possible value) to use few objective evaluations.
Another approach that we consider is a compromise between these two choices.We note that having p drop = 1 is useful to make progress with few evaluations, so we use this value while we are making progress-we consider this to occur when we have a successful iteration (i.e.ρ k ≥ η 1 ).When we are not progressing (i.e.unsuccessful steps with ρ k < η 1 ), we use the larger value p drop = p/10.
We compare these three approaches on the (CR) problem collection with a budget of 100(n+1) evaluations in Figure 2. Since these different choices of p drop are similar when p is small, we show results for subspace dimensions p = n/2 and p = n.We first see that, even with p = n/2, the three approaches all perform similarly.However, for p = n the compromise choice p drop ∈ {1, p/10} performs better than the two constant-p approaches.In addition, p drop = 1 outperforms p drop = p/10 for small performance ratios, but is less robust and solves fewer problems overall.
Given these results, in DFBGN we use the compromise choice as the default mechanism: p drop = 1 on successful iterations and p drop = p/10 on unsuccessful iterations.

Relationship to model-improvement phases
The CHECK_MODEL flag in RSDFO-GN is important for ensuring we do not reduce ∆ k too quickly without first ensuring the quality of the interpolation model. 10For a similar purpose, DFO-LS incorporates a second trust-region radius which also is involved with ensuring ∆ k does not decrease too quickly [14].In DFBGN, as unsuccessful iterations, when our model is performing poorly, our interpolation set is changed quickly.This results in DFBGN recovering a high-quality model after a smaller decrease in ∆ k .To demonstrate this, in Figure 4 we show the results of DFBGN with this p drop for the same problems as Figure 3 above.In both cases, we still see oscillations in ∆ k , but their magnitude is substantially reduced-it takes fewer iterations to get successful steps, and ∆ k stays well above ∆ end .This leads to both problems being solved to high accuracy.
In Figure 4, we also see that, as we approach the solution, ĝk and ∆ k decrease at the same rate, as we would hope.For drcavty1 after iteration 150, we also see the phenomenon described above, where ∆ k can become much larger than ĝk due to many successful iterations, before an unsuccessful iteration with s k small means that ∆ k returns to the level of ĝk regularly.
Alternative Mechanism for Recovering High-Quality Models An alternative approach for avoiding unnecessary decreases in ∆ k while the interpolation model quality is improved is to simply decrease ∆ k more slowly on unsuccessful iterations.This corresponds to setting γ dec to be closer to 1, which is the default choice of DFO-LS for noisy problems (see [14,Section 3.1]), and aligns with our theoretical requirements on the trust-region parameters (Theorem 2.18).
In Figure 5, we compare the DFBGN default choices, of p drop ∈ {1, p/10} and γ dec = 0.5, with p drop = 1 and γ dec ∈ {0.5, 0.98} on the (CR) problem collection.For small values of p (where the different choices of p drop are essentially identical), the choice of γ dec has almost no impact on the performance of DFBGN.For larger values of p, using γ dec = 0.98 with p drop = 1 performs comparably well to the DFBGN default (γ dec = 0.5 with p drop ∈ {1, p/10}).However, we opt for keeping γ dec = 0.5, to allow us to use the larger value for noisy problems (just as in DFO-LS), and to reduce the risk of overfitting our trust-region parameters to a particular problem collection.

Numerical Results
In this section we compare the performance of DFBGN (Algorithm 3) to that of DFO-LS.We note that that DFO-LS has been shown to have state-of-the-art performance compared to other solvers in [14].As described in Section 4.2, the implementation of DFBGN is based on the decision to reduce the linear algebra cost of the algorithm at the expense of more objective evaluations per iteration.However, we still maintain the goal of DFBGN achieving (close to) state-of-the-art performance when it is run as a 'full space' method (i.e.p = n).Here, we will investigate this tradeoff in practice.

Testing Framework
In our testing, we will compare a Python implementation of DFBGN (Algorithm 3) against DFO-LS version 1.0.2(also implemented in Python).The implementation of DFBGN is available on Github. 11We will consider both the standard version of DFO-LS, and one where we use a reduced initialization cost of n/100 evaluations (c.f.Remark 4.1).This will allow us to compare both the overall performance of DFBGN and its performance with small budgets (since DFBGN also has a reduced initialization cost of p + 1 evaluations).We compare these against DFBGN with the choices p ∈ {n/100, n/10, n/2, n} and the adaptive choice of p drop ∈ {1, p/10} (Section 4.4).All default settings are used for both solvers, and since both are randomized (DFO-LS uses random initial directions only, and DFBGN is randomized through Algorithm 5), we run 10 instances of each problem under all solver configurations.
Test Problems We will consider two collections of nonlinear least-squares test problems, both taken from the CUTEst collection [38].The first, denoted (CR), is a collection of 60 mediumscale problems (with 25 ≤ n ≤ 110 and n ≤ m ≤ 400).Full details of the (CR) collection may be found in [18,Table 3].The second, denoted (CR-large), is a collection of 28 large-scale problems (with 1000 ≤ n ≤ 5000 and n ≤ m ≤ 9998).This collection is a subset of problems from (CR), with their dimension increased substantially.Full details of the (CR-large) collection are given in Appendix B. Note that the 12 hour runtime limit was only relevant for (CR-large) in all cases.
Measuring Solver Performance For every problem, we allow all solvers a budget of at most 100(n + 1) objective evaluations (i.e.evaluations of the full vector r(x)).This dimensiondependent choice may be understood as equivalent to 100 evaluations of r(x) and the Jacobian J(x) via finite differencing.However, given the importance of linear algebra cost to our comparisons, we allow each solver a maximum runtime of 12 hours for each instance of each problem. 12or each solver S, each problem instance P , and accuracy level τ ∈ (0, 1), we calculate N (S, P, τ ) := # evaluations of r(x) required to find a point x with where f (x * ) is an estimate of the minimum of f as listed in [18, Table 3] for (CR) and Appendix B for (CR-large).If this objective decrease is not achieved by a solver before its budget or runtime limit is hit, we set N (S, P, τ ) = ∞.We then compare solver performances on a problem collection P by plotting either data profiles [55] d S,τ (α) := 1 |P| |{P ∈ P : N (S, P, τ ) ≤ α(n P + 1)}| ,  where n P is the dimension of problem instance P and α ∈ [0, 100] is an evaluation budget (in "gradients", or multiples of n + 1), or performance profiles [28] π S,τ (α) := 1 |P| |{P ∈ P : N (S, P, τ ) ≤ αN min (P, τ )}| , where N min (P, τ ) is the minimum value of N (S, P, τ ) for any solver S, and α ≥ 1 is a performance ratio.In some instances, we will plot profiles based on runtime rather than objective evaluations.For this, we simply replace "number of evaluations of r(x)" with "runtime" in (5.1).When we plot the objective reduction achieved by a given solver, we normalize the objective value to be in which corresponds to the best τ achieved in (5.1) after a given number of evaluations (again measured in "gradients") or runtime.

Results Based on Evaluations
We begin our comparisons by considering the performance of DFO-LS and DFBGN when measured in terms of evaluations.Medium-Scale Problems (CR) First, in Figure 6, we show the results for different accuracy levels for the (CR) problem collection (with n ≈ 100).For the lowest accuracy level τ = 0.5, DFO-LS with reduced initialization cost is the best-performing solver, followed by DFBGN with p = n/2.These correspond to methods with lower initialization costs than DFO-LS and DFBGN with p = n, so this is likely a large driver behind their performance.DFBGN with full space size p = n performs similarly to DFO-LS, and DFBGN with p = n/10 and p = n/100 perform worst (as they are optimizing in a very small subspace at each iteration).
However, as we look at higher accuracy levels, we see that DFO-LS (with and without reduced initialization cost) performs best, and the DFBGN methods perform worse.The performance gap is more noticeable for small values of p.As expected, this means that DFBGN requires more evaluations to achieve these levels of accuracy, and benefits from being allowed to use a larger p. Notably, DFBGN with p = n has only a slight performance loss compared to DFO-LS, even though it uses p/10 evaluations on unsuccessful iterations (rather than 1-2 for DFO-LS); this indicates that our choice of p drop provides a suitable compromise between solver robustness and evaluation efficiency.
Large-Scale Problems (CR-large) Next, in Figure 7, we show the same plots but for the (CR-large) problem collection, with n ≈ 1000.Compared to Figure 6, the situation is quite different.
At the lowest accuracy level, τ = 0.  n/100) gives the best-performing solvers, followed by the full-space solvers (DFO-LS and DFBGN with p = n).For higher accuracy levels, the performance of DFBGN with small p deteriorates compared with the full-space methods.DFBGN with p = n/2 is the worst-performing DFBGN variant at low accuracy levels, and performs similar to DFBGN with small p at high accuracy levels.DFO-LS with reduced initialization cost is the worst-performing solver for this dataset.Unlike the medium-scale results above, we no longer have a clear trend in the performance of DFBGN as we vary p. Instead, we have a combination of two factors coming into play, which have opposite impacts on the performance of DFBGN as we vary p.On one hand, we have the number of evaluations required for DFBGN (with a given p) to reach the desired accuracy level.On the other hand, we have the number of iterations that DFBGN can perform before reaching the 12 hour runtime limit.
DFBGN with small p requires more evaluations to reach a given level of accuracy (as seen with the medium-scale results), but can perform many evaluations before timing out due to its low per-iteration linear algebra cost.This is reflected in it solving many problems to low accuracy, but few problems to high accuracy.By contrast, DFBGN with p = n is allowed to perform fewer iterations before timing out (and hence see fewer evaluations), but requires many fewer evaluations to solve problems, particularly for high accuracy.This manifests in its good performance for low and high accuracy levels.The middle ground, DFBGN with p = n/2, has its performance negatively impacted by both issues: it requires many fewer evaluations to solve problems than p = n (especially for high accuracy), but also has a relatively high per-iteration linear algebra cost and times out compared to small p.
Both variants of DFO-LS show worse performance here than for the medium-scale problems.This is because, as suggested by the analysis in Table 1, they are both affected by the runtime limit.DFO-LS with reduced initialization cost is particularly affected, because of the high cost of the SVD (of the full m × n Jacobian) at each iteration for these problems.We note that this cost is only noticeable for these large-scale problems, and this variant of DFO-LS is still useful for small-and medium-scale problems, as discussed in [14].
We can verify the impact of the timeout on DFO-LS and DFBGN by considering the proportion of problem instances for (CR-large) for which the solver terminated because of the timeout.These results are presented in Table 2. DFO-LS reaches the 12 hour maximum much more frequently than DFBGN: over 90% rather than 35% for DFBGN with p = n/100 or 66% for DFBGN with p = n (see Remark 5.1 below).For DFBGN with different values of p, we see the same behaviour as in Figure 7.That is, DFBGN with small p times out the least frequently, as its low per-iteration runtime means it performs enough iterations to terminate naturally.For DFBGN with p = n, we time out more frequently (due to the high per-iteration runtime), but not as often as with p = n/2, as the its superior budget performance for high accuracy levels means it fully solves more problems, even with comparatively fewer iterations.We note that Table 2 does not measure what accuracy level was achieved before the timeout, which is better captured in the performance profiles Figure 7.
Remark 5.1.DFBGN with p = n has a similar per-iteration linear algebra cost to DFO-LS.Hence it can perform a similar number of iterations before reaching the runtime limit.However, DFBGN performs more objective evaluations per iteration, because of the choice of p drop .Since DFBGN with p = n has a similar performance to DFO-LS when measured on budget (as To further see the impact of this issue, we now consider how the solvers perform for a variabledimension test problem, as we increase the underlying dimension.We run each solver, with the same settings as above, on the CUTEst problem arwhdne for different choices of problem dimension n. 13 In Figure 9 we plot the objective reduction for each solver against budget and runtime for DFO-LS and DFBGN, showing n = 100, n = 1000 and n = 2000. We see that, when measured on evaluations, both DFO-LS variants achieve the fastest objective reduction, and that DFBGN with small p achieves the slowest objective reduction.This is in line with our results from Section 5.2.However, when we instead consider objective decrease against runtime, we see that DFBGN with small p gives the fastest decrease-the larger number of iterations needed by these variants (as seen by the larger number of evaluations) is offset by the substantially reduced per-iteration linear algebra cost.When viewed against runtime, both DFO-LS variants can only achieve a small objective decrease in the allowed 12 hours, even though they are showing fast decrease against budget, and would achieve higher accuracy than DFBGN if the linear algebra cost were negligible.

Results for Small Budgets
Another benefit of DFBGN is that it has a small initialization cost of p + 1 evaluations.When n is large, it is more likely for a user to be limited by a budget of fewer than n evaluations.Here, we examine how DFBGN compares for small budgets to DFO-LS with reduced initialization cost.
We recall from Remark 4.1 that DFO-LS with reduced initialization cost progressively increases the dimension of the subspace of its interpolation model, until it reaches the whole space R n (after approximately n + 1 evaluations), while in DFBGN we restrict the dimension at all iterations.
In Figure 10 we consider three variable-dimensional CUTEst problems from (CR) and (CRlarge), all using n = 1000 and n = 2000.We show the objective decrease against budget for 10 runs of each solver, restricted to a maximum of n + 1 evaluations.We see that the smaller p used in DFBGN, the faster DFBGN is able to make progress (due to the lower number of initial evaluations).However, this is offset by the faster objective decrease achieved by larger p values (after the higher initialization cost)-if the user can afford a larger p, both in terms of linear algebra and initial evaluations, then this is usually a better option.An exception to this is the problem vardimne, where its simple structure means DFBGN with small p solves the problem to very high accuracy with very few evaluations, substantially outperforming both DFBGN with larger p, and DFO-LS with reduced initialization cost.
In Figure 10 we also show the decrease for DFO-LS with full initialization cost and DFBGN with p = n, but they use the full budget on initialization, and so make no progress.However, in addition, we show DFO-LS with a reduced initialization cost of n/100 evaluations.This variant performs well, in most cases matching the decrease of DFBGN with p = n/100 initially, but achieving a faster objective reduction against budget-this matches with our previous observations.However, the extra cost of the linear algebra means that DFO-LS with reduced initialization does not end up using the full budget, instead hitting the 12 hour timeout.This is most clear when comparing the results for n = 1000 with n = 2000, where DFO-LS with reduced initialization cost begins by achieving a similar decrease in both cases, but hits the timeout more quickly with n = 2000, and so terminates after fewer objective evaluations (with a corresponding smaller objective decrease).
We analyze this more systematically in Figure 11, where we show data profiles (measured on budget) of DFBGN and DFO-LS on the (CR-large) problem collection, for low accuracy and small budgets.These results verify our conclusions: DFBGN with small p can make progress on many problems with a very short budget (fewer than n+1 evaluations), and outperform DFO-LS with reduced initialization cost due to its slow runtime.However, once we reach a budget of more than n + 1 evaluations, then DFO-LS and DFBGN with p = n become the best-performing solvers (when measuring on evaluations only).They are also able to achieve a higher level of accuracy compared to DFBGN with small p. progress with few objective evaluations (an important consideration for DFO techniques).A Python version of DFBGN is available on Github. 14 For medium-scale problems, DFBGN operating in the full ambient space (p = n) has similar performance to DFO-LS [14] when measured by objective evaluations, validating the techniques introduced in the practical implementation.However, DFBGN (with any choice of subspace dimension) has substantially faster runtime, which means it is much more effective than DFO-LS at solving large-scale problems from CUTEst, even when working in a very low-dimensional subspace.Further, in the case of expensive objective evaluations, working a subspace means that DFBGN can make progress with very few evaluations, many fewer than the n + 1 needed for standard methods to build their initial model.Overall, the implementation of DFBGN is suitable for large-scale problems both when objective evaluations are cheap (and linear algebra costs dominate) or when evaluations are expensive (and the initialization cost of standard methods is impractical).
Future work will focus on extending the ideas from the implementation DFBGN to the case of general objectives with quadratic models.This will bring the available software in line with the theoretical guarantees for RSDFO.We note that model-based DFO for nonlinear least-squares problems has been adapted to include sketching methods, which use randomization to reduce the number of residuals considered at each iteration [13].We also delegate to future work the development of techniques for nonlinear least-squares problems which combine sketching (i.e.dimensionality reduction in the observation space) with our subspace approach (i.e.dimensionality reduction in variable space), and further study of methods for adaptively selecting a subspace dimension (c.f.Section 4.2).
Figure1.Performance profiles (in evaluations) for DFBGN when p = n/10 and p = n, comparing removing points with the geometry-aware Algorithm 4 and without Lagrange polynomials (by distance to the current iterate).We use accuracy level τ = 10 −5 , and results are an average of 10 runs, each with a budget of 100(n + 1) evaluations.The problem collection is (CR).See Section 5.1 for details on the testing framework.

Lemma 4 . 5 .
The matrix A has full column rank with probability 1.
Figure Performance profiles (in evaluations) comparing different choices of p drop , for DFBGN with p = n/2 and p = n, with accuracy level τ = 10 −5 .The choice 'p drop mixed' uses p drop = 1 for successful iterations and p drop = p/10 for unsuccessful iterations.Results an average of 10 runs, each with a budget of 100(n + 1) evaluations.The problem collection is (CR).See Section 5.1 for details on the testing framework.

Figure 5 .
Figure 5. Performance profiles (in evaluations) comparing different choices of p drop and γ dec for DFBGN, at accuracy level τ = 10 −5 .The choice 'p drop mixed' uses p drop = 1 for successful iterations and p drop = p/10 for unsuccessful iterations.Results an average of 10 runs, each with a budget of 100(n + 1) evaluations.The problem collection is (CR).See Section 5.1 for details on the testing framework.

Figure 6 .
Figure 6.Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels.Results are an average of 10 runs for each problem, with a budget of 100(n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (CR).

Figure 7 .
Figure 7. Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels.Results are an average of 10 runs for each problem, with a budget of 100(n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (CR-large).

Figure 12 .
Figure 12.Data profiles (in runtime) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels and budgets.Results are an average of 10 runs for each problem, with a budget of n + 1 or 2(n + 1) evaluations and a 12 hour runtime limit per instance.The problem collection is (CR-large).
which is the claim of the lemma.

Table 2 .
Proportion of problem instances from (CR-large) for which each solver terminated on the maximum 12 hour runtime.