1 Introduction

An important class of nonlinear optimization methods is so-called derivative-free optimization (DFO). In DFO, we consider problems where derivatives of the objective (and/or constraints) are not available to be evaluated, and we only have access to function values. This topic has received growing attention in recent years, and is primarily used for objectives which are black-box (so analytic derivatives or algorithmic differentiation are not available), and expensive to evaluate or noisy (so finite differencing is impractical or inaccurate). There are many types of DFO methods, such as model-based, direct and pattern search, implicit filtering and others (see [55] for a recent survey), and these techniques have been used in a variety of applications [1].

Here, we consider model-based DFO methods for unconstrained optimization, which are based on iteratively constructing and minimizing interpolation models for the objective. We also specialize these methods for nonlinear least-squares problems, by constructing interpolation models for each residual term rather than for the full objective [19, 79, 85].

This paper aims to provide a method that attempts to answer a key question regarding model-based DFO: how to improve the scalability of this class. Existing model-based DFO techniques are primarily designed for small- to medium-scale problems, as the linear algebra cost of each iteration—largely due to the cost of constructing interpolation models—means that their runtime increases rapidly for large problems. There are several settings where scalable DFO algorithms may be useful, such as data assimilation [3, 10], machine learning [39, 71], generating adversarial examples for deep neural networks [2, 75], image analysis [34], and as a possible proxy for global optimization methods [21].

To address this, we introduce RSDFO, a scalable algorithmic framework for model-based DFO. At each iteration of RSDFO we select a random low-dimensional subspace, build and minimize a model to compute a step in this space, then change the subspace at the next iteration. We provide a probabilistic worst-case complexity analysis of RSDFO. To our knowledge, this is the first subspace model-based DFO method with global complexity and convergence guarantees. We then describe how this general framework can be specialized to the case of nonlinear least-squares minimization through a model construction technique inspired by the Gauss–Newton method, yielding a new algorithm RSDFO-GN with associated worst-case complexity bounds. We then present an efficient implementation of RSDFO-GN, which we call DFBGN. DFBGN is available on GithubFootnote 1 and includes several algorithmic features that yield strong performance on large-scale problems and a low per-iteration linear algebra cost that is typically linear in the problem dimension.

1.1 Existing literature

The contributions in this paper are connected to several areas of research. We briefly review these topics below.

Block coordinate descent There is a large body of work on (derivative-based) block coordinate descent (BCD) methods, typically motivated by machine learning applications. BCD extends coordinate search methods [81] by updating a subset of the variables at each iteration, typically using a coordinate-wise variant of a first-order method. For nonconvex problems, the first convergence result for a randomized coordinate descent method based on proximal gradient descent was given in [62]. Here, the sampling of coordinates was uniform and required step sizes based on Lipschitz constants associated with the objective. This was extended in [57] to general randomized block selection with a nonomonotone linesearch-type method (to allow for unknown Lipschitz constants), and to a (possibly deterministic) ‘essentially cyclic’ block selection and extrapolation (but requiring Lipschitz constants) in [83]. Several extensions of this approach have been developed, including the use of stochastic gradients [82], parallel block updating [36] and inexact step calculations [36, 84].

BCD methods have been extended to nonlinear least-squares problems, leading to so-called Subspace Gauss–Newton methods. These are derivative-based methods where a Gauss–Newton step is computed for a subset of variables. This approach was initially proposed in [74] for parameter estimation in climate models—where derivative estimates were computed using implicit filtering [53]—and analyzed in quadratic regularization and trust-region settings for general unconstrained objectives in [16, 17, 72].

Sketching Sketching is an alternative dimensionality reduction technique for least-squares problems, reducing the number of residuals rather than the number of variables. Sketching ideas have been applied to linear [50, 58, 80] and nonlinear [35] least-squares problems, as well as model-based DFO for nonlinear least-squares [14], as well as subsampling algorithms for finite sum-of-functions minimization such as Newton’s method [7, 70].

There are also alternative approaches to sketching which lead to subspace-type methods, where local gradients and Hessians are estimated only within a subspace (possibly used in conjunction with random subsampling). Sketching in this context has been applied to, for example, Newton’s method [7, 44, 63], BFGS [43], and SAGA [45], as well as to trust-region and quadratic regularization methods [16, 17, 72].

Random embeddings for global optimization Some global optimization methods have been proposed which randomly project a high-dimensional problem into a low-dimensional subspace and solve this smaller problem using existing (global or local) methods. Though applicable to general global optimization problems (as a more sophisticated variant of random search), this technique has been explored particularly for defeating the curse of dimensionality when optimising functions which have low effective dimensionality [18, 68, 78]. For the latter class, often only one random subspace projection is needed, though the addition of constraints leads to multiple embeddings being required [18]. Our approach here differs from these works in both theoretical and numerical aspects, as it is focused on a specific random subspace technique for local optimization.

Probabilistic model-based DFO For model-based DFO, several algorithms have been developed and analyzed where the local model at each iteration is only sufficiently accurate with a certain probability [5, 12, 23]. Similar analysis also exists for derivative-based algorithms [22, 47]. Our approach is based on deterministic model-based DFO within subspaces, and we instead require a very weak probabilistic condition on the (randomly chosen) subspaces (Assumption 4).

Randomized direct search DFO In randomized direct search methods, iterates are perturbed in a random subset of directions (rather than a positive spanning set) when searching for local improvement. In this framework, effectively only a random subspace is searched in each iteration. Worst-case complexity bounds for this technique are given under predetermined step length regimes in [11, 40], and with adaptive step sizes in [46, 48], where [48] extends [46] to linearly constrained problems.

Large-scale DFO There have been several alternative approaches considered for improving the scalability of DFO. These often consider problems with specific structure which enable efficient model construction, such as partial separability [26, 64], sparse Hessians [4], and minimization over the convex hull of finitely many points [31]. On the other hand, there is a growing body of literature on ‘gradient sampling’ techniques for machine learning problems. These methods typically consider stochastic first-order methods but with a gradient approximation based on finite differencing in random directions [60], i.e. approximations of the form \(\nabla f({\varvec{x}}) \approx \frac{f({\varvec{x}}+h{\varvec{u}})-f({\varvec{x}})}{h}{\varvec{u}}\) for a random Gaussian vector \({\varvec{u}}\).Footnote 2 This framework has lead to variants of methods such as stochastic gradient descent [38], SVRG [56] and Adam [24], for example. We note that linear interpolation to orthogonal directions—more similar to traditional model-based DFO—has been shown to outperform gradient sampling as a gradient estimation technique [8, 9].

Subspace DFO methods A model-based DFO method with similarities to our subspace approach is the moving ridge function method from [49]. Here, existing objective evaluations are used to determine an ‘active subspace’ which captures the largest variability in the objective and build an interpolation model within this subspace. We also note the VXQR method from [61], which performs line searches along a direction chosen from a subspace determined by previous iterates. Both of these methods do not include convergence theory. By comparison, aside from our focus on nonlinear least-squares problems, both our general theoretical framework and our implemented method select their working subspaces randomly, and we provide (probabilistic) convergence guarantees. Lastly, the unpublished works [77, 86] propose a similar construction to ours, but based on full minimization of the objective within each subspace, and allowing potentially multiple simultaneous parallel subspace minimizations.

1.2 Contributions

We introduce RSDFO (Random Subspace Derivative-Free Optimization), a generic model-based DFO framework that relies on constructing a model in a subspace at each iteration. Our novel approach enables model-based DFO methods to be applied in a large-scale regime by giving the user explicit control over the subspace dimension, and hence control over the per-iteration linear algebra cost of the method. This framework is then specialized to the case of nonlinear least-squares problems, yielding a new algorithm RSDFO-GN (Random Subspace DFO with Gauss–Newton). The subspace model construction framework of RSDFO-GN is based on DFO Gauss–Newton methods [15, 19], and retains the same theoretical guarantees as RSDFO. We then describe a practical implementation of RSDFO-GN, which we call DFBGN (Derivative-Free Block Gauss–Newton).Footnote 3 Compared to existing methods, DFBGN reduces the linear algebra cost of model construction and the initial objective evaluation cost by allowing fewer interpolation points at every iteration. In order for DFBGN to have both scalability and a similar evaluation efficiency to existing methods (i.e. objective reduction achieved for a given number of objective evaluations), several modifications to the theoretical framework, regarding the selection of interpolation points and the search subspace, are necessary.

Theoretical results We consider a generic theoretical framework RSDFO, where the subspace dimension is a user-chosen algorithm hyperparameter, and no specific model construction approach is specified. Our framework is not specific to a least-squares problem structure, and holds for any objective with Lipschitz continuous gradient, and allows for a general class of random subspace constructions (not relying on a specific class of embeddings or projections). The theoretical results here extend the approach and techniques in [16, 17, 72] to model-based DFO methods. In particular, we use the notion of a well-aligned subspace (Definition 2) from [16, 17, 72], one in which sufficient decrease is achievable, and assume that our search subspace is well-aligned with some probability (Assumption 4). This is achieved provided we select a sufficiently large subspace dimension (depending on the desired failure probability and subspace alignment quality).

We derive a high probability worst-case complexity bound for RSDFO. Specifically, our main bounds are of the form \({\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le C k^{-1/2}\right] \ge 1 - e^{-ck}\) and \({\mathbb {P}}\left[ K_{\epsilon } \le C\epsilon ^{-2}\right] \le 1-e^{-c\epsilon ^{-2}}\), where \(K_{\epsilon }\) is the first iteration to achieve first-order optimality \(\epsilon \) (see Theorem 1 and Corollary 1). This result then implies a variety of alternative convergence results, such as expectation bounds and almost-sure convergence. Based on [16, 17, 54, 72], we give several constructions for determining our random subspace, and show that we can achieve convergence with a subspace dimension that is independent of the ambient dimension.

Our analysis matches the standard deterministic \({\mathcal {O}}(\epsilon ^{-2})\) complexity bounds for model-based DFO methods built on linear interpolating models(e.g. [37]). However, when measuring the complexity in objective evaluations, our method yields a lower explicit dependency on the (ambient) problem dimension. Compared to the analysis of derivative-based methods (e.g. BCD [83] and probabilistically accurate models [22]) we need to incorporate the possibility that the interpolation model is not accurate (not fully linear, see Definition 1). However, unlike [5, 12, 23] we do not assume that full linearity is a stochastic property; instead, our stochasticity comes from the subspace selection and we explicitly handle non-fully linear models similar to [19, 29]. This gives us a framework which is similar to standard model-based DFO and with weak probabilistic conditions. Compared to the analysis of derivative-based random subspace methods in [16, 17, 72], our analysis is complicated substantially by the possibility of inaccurate models and the intricacies of model-based DFO algorithms. Although our approach could have considered situations where models are always guaranteed to be fully linear, we have developed our analysis to cope with this greater generality and to closely align with the traditional analysis of model-based DFO methods [19, 29]. The possibility of inaccurate models is similarly considered in [5, 12, 22, 23], but as an event that happens with some probability.

We then consider RSDFO-GN, which explicitly describes how interpolation models can be constructed for nonlinear least-squares problems, thus providing a concrete implementation of RSDFO in this context. Here we consider quadratic local models formed by linear interpolation for each residual function, which have strong practical performance [19]. We prove that RSDFO-GN retains the same \({\mathcal {O}}(\epsilon ^{-2})\) complexity bound as RSDFO, again matching existing deterministic bounds [19]. However as in the general case, RSDFO-GN has an oracle complexity bound with a lower dependency on problem dimension compared to existing results.

In both cases, our subspace approach improves on existing oracle complexity analysis in terms of its dependency on problem dimension. However our method also benefits from a significantly reduced linear algebra cost per iteration, and so also improves on existing complexity bounds when measuring the algorithm’s overall computational cost. For high-dimensional problems, both of these considerations (cost of objective evaluations and of linear algebra) are potentially relevant to overall algorithm performance.

Implementation Although it has beneficial evaluation and linear algebra complexity results, because of the random subspace framework, RSDFO-GN is not able to recycle objective evaluation information across multiple iterations. This is a key element of the practical success of model-based DFO methods tailored to the setting where objective evaluations are expensive. To address this, we introduce a practical, implementable variant of RSDFO-GN called DFBGN, which is based on the solver DFO-LS [15]. DFBGN achieves its practicality by using existing interpolation points to determine the relevant search subspace, coupled with a geometry-aware approach for selecting interpolation points for removal (inspired by the approach from [67]), and an adaptive randomized approach for selecting new interpolation points/subspace directions. We study the per-iteration linear algebra cost of DFBGN, and show that it is linear in the problem dimension, a substantial improvement over existing methods, which are cubic in the problem dimension, and equal to RSDFO-GN (although with significantly better practical performance than RSDFO-GN in terms of objective evaluations). This improvement comes from being able to perform almost all computations in the subspace, including model construction, step calculation and geometry-aware point removal. Our per-iteration linear algebra costs are also linear in the number of residuals, the same as existing methods, but with a substantially smaller constant (quadratic in the subspace dimension, which is user-determined, rather than quadratic in the problem dimension).

Numerical results We compare DFBGN with DFO-LS (which itself is shown to have state-of-the-art performance in [15]) on collections of both medium-scale (approx. 100 dimensions) and large-scale test problems (approx. 1000 dimensions). We show that DFBGN with a full-sized subspace has similar performance to DFO-LS in terms of objective evaluations, but shows improved performance on runtime.Footnote 4 This indicates that DFBGN’s practical approach for recycling objective evaluations across iterations yields state-of-the-art performance while inheriting the low linear algebra cost of RSDFO-GN. As the dimension of the subspace reduces (i.e. the size of the interpolation set reduces), we demonstrate a tradeoff between reduced linear algebra costs and increased evaluation counts required to achieve a given objective reduction. The flexibility of DFBGN allows this tradeoff to be explicitly managed. When tested on large-scale problems, DFO-LS frequently reaches a reasonable runtime limit without making substantial progress, whereas DFBGN with small subspace size can perform many more iterations and hence make better progress than DFO-LS. In the case of expensive objectives with small evaluation budgets, we show that DFBGN can make progress with few objective evaluations in a similar way to DFO-LS (which has a mechanism to make progress from as few as 2 objective evaluations independent of problem dimension), but with substantially lower linear algebra costs.

Structure of paper In Sect. 2 we describe RSDFO and provide our probabilistic worst-case complexity analysis. We specialize RSDFO to RSDFO-GN in Sect. 3. Then we describe the practical implementation DFBGN and its features in Sect. 4. Our numerical results are given in Sect. 5.

Implementation A Python implementation of DFBGN is available on Github.Footnote 5

Notation We use \(\Vert \cdot \Vert \) to refer to the Euclidean norm of vectors and the operator 2-norm of matrices, and \(B({\varvec{x}},\varDelta )\) for \({\varvec{x}}\in {\mathbb {R}}^n\) and \(\varDelta >0\) to be the closed ball \(\{{\varvec{y}}\in {\mathbb {R}}^n : \Vert {\varvec{y}}-{\varvec{x}}\Vert \le \varDelta \}\).

2 Random subspace model-based DFO

In this section we outline our general model-based DFO algorithmic framework based on minimization in random subspaces. We consider the nonconvex problem

$$\begin{aligned} \min _{{\varvec{x}}\in {\mathbb {R}}^n} f({\varvec{x}}), \end{aligned}$$
(1)

where we assume that \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is continuously differentiable, but that access to its gradient is not possible (e.g. for the reasons described in Sect. 1). In a standard model-based DFO framework (e.g. [30, 55]), at each iteration k we construct a quadratic model \(m_k:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) which approximates f near our iterate \({\varvec{x}}_k\):

$$\begin{aligned} f({\varvec{x}}_k + {\varvec{s}}) \approx m_k({\varvec{s}}) :=f({\varvec{x}}_k) + {\varvec{g}}_k^T {\varvec{s}}+ \frac{1}{2}{\varvec{s}}^T H_k {\varvec{s}}, \end{aligned}$$
(2)

for some \({\varvec{g}}_k\in {\mathbb {R}}^n\) and \(H_k\in {\mathbb {R}}^{n\times n}\) symmetric. Based on this model, we build a globally convergent algorithm using a trust-region framework [27]. This algorithmic framework is suitable providing that—when necessary—we can guarantee \(m_k\) is a sufficiently accurate model for f near \({\varvec{x}}_k\). Details about how to construct sufficiently accurate models based on interpolation are given in [28, 30].

Our core idea here is to construct interpolation models which only approximate the objective in a subspace, rather than in the full space \({\mathbb {R}}^n\). This allows us to use interpolation sets with fewer points, since we do not have to capture the objective’s behaviour outside our subspace, which improves the scalability of the method.

In this section, we outline our general algorithmic framework and provide a worst-case complexity analysis showing convergence to first-order stationary points with high probability. We then describe how this framework may be specialized to the case of nonlinear least-squares minimization.

2.1 RSDFO algorithm

In our general framework, which we call RSDFO (Random Subspace DFO), we modify the above approach by randomly choosing a p-dimensional subspace (where \(p\le n\) is user-chosen) and constructing an interpolation model defined only in that subspace.Footnote 6 Specifically, in each iteration k we randomly choose a p-dimensional affine space \({\mathcal {Y}}_k \subset {\mathbb {R}}^n\) given by the range of \(Q_k\in {\mathbb {R}}^{n\times p}\), i.e. 

$$\begin{aligned} {\mathcal {Y}}_k = \{{\varvec{x}}_k + Q_k {\hat{{\varvec{s}}}} : {\hat{{\varvec{s}}}}\in {\mathbb {R}}^p\}. \end{aligned}$$
(3)

We then construct a model which interpolates f at points in \({\mathcal {Y}}_k\) and ultimately construct a local quadratic model for f only on \({\mathcal {Y}}_k\). That is, given \(Q_k\), we assume that we have \({\hat{m}}_k:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) given by

$$\begin{aligned} f({\varvec{x}}_k + Q_k{\hat{{\varvec{s}}}}) \approx {\hat{m}}_k({\hat{{\varvec{s}}}}) := f({\varvec{x}}_k) + {\hat{{\varvec{g}}}}_k^T{\hat{{\varvec{s}}}} + \frac{1}{2}{\hat{{\varvec{s}}}}^T {\hat{H}}_k {\hat{{\varvec{s}}}}, \end{aligned}$$
(4)

where \({\hat{{\varvec{g}}}}_k\in {\mathbb {R}}^p\) and \({\hat{H}}_k\in {\mathbb {R}}^{p\times p}\) are the low-dimensional model gradient and Hessian respectively, adopting the convention of using hats on variables to denote low-dimensional quantities. In Sect. 3 we specialize this to a model construction process for nonlinear least-squares problems.

For our trust-region algorithm, we (approximately) minimize \({\hat{m}}_k\) inside the trust region to get a tentative step

$$\begin{aligned} {\hat{{\varvec{s}}}}_k \approx {\text {arg\,min}}_{{\hat{{\varvec{s}}}}\in {\mathbb {R}}^p} {\hat{m}}_k({\hat{{\varvec{s}}}}), \quad \text {s.t.} \quad \Vert {\hat{{\varvec{s}}}}\Vert \le \varDelta _k, \end{aligned}$$
(5)

for the current trust-region radius \(\varDelta _k>0\), yielding a tentative step \({\varvec{s}}_k = Q_k {\hat{{\varvec{s}}}}_k \in {\mathbb {R}}^n\). We thus also get the computational advantage coming from solving a p-dimensional trust-region subproblem.

In our setting we are only interested in the approximation properties of \({\hat{m}}_k\) in the space \({\mathcal {Y}}_k\), and so we introduce the following notion of a “sufficiently accurate” model:

Definition 1

Given \(Q\in {\mathbb {R}}^{n\times p}\), a model \({\hat{m}}:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is Q-fully linear in \(B({\varvec{x}},\varDelta )\subset {\mathbb {R}}^n\) if

$$\begin{aligned} |f({\varvec{x}}+Q{\hat{{\varvec{s}}}}) - {\hat{m}}({\hat{{\varvec{s}}}})|&\le \kappa _{\mathrm{ef}}\varDelta ^2, \end{aligned}$$
(6a)
$$\begin{aligned} \Vert Q^T \nabla f({\varvec{x}}+Q{\hat{{\varvec{s}}}}) - \nabla {\hat{m}}({\hat{{\varvec{s}}}})\Vert&\le \kappa _{\mathrm{eg}}\varDelta , \end{aligned}$$
(6b)

for all \({\varvec{s}}\in {\mathbb {R}}^p\) with \(\Vert {\hat{{\varvec{s}}}}\Vert \le \varDelta \). The constants \(\kappa _{\mathrm{ef}}\) and \(\kappa _{\mathrm{eg}}\) must be independent of Q, \({\hat{m}}\), \({\varvec{x}}\) and \(\varDelta \).

The gradient condition (6b) comes from noting that if \({\hat{f}}({\hat{{\varvec{s}}}}):=f({\varvec{x}}+Q{\hat{{\varvec{s}}}})\) then \(\nabla {\hat{f}}({\hat{{\varvec{s}}}}) = Q^T \nabla f({\varvec{x}}+Q{\hat{{\varvec{s}}}})\). We note that if we have full-dimensional subspaces \(p=n\) and take \(Q=I\), then we recover the standard notion of fully linear models [29, Definition 3.1]. In our analysis, we will generally assume that Definition 1 is satisfied by finding \({\hat{{\varvec{g}}}}_k\) using linear interpolation and taking \({\hat{H}}_k\) to be zero, but underdetermined quadratic interpolation techniques could instead be used [28, 30].

figure a

Complete RSDFO algorithm The complete RSDFO algorithm is stated in Algorithm 1. The overall structure is common to model-based DFO methods [29, 30]. In particular, we assume that we have procedures to verify whether or not a model is \(Q_k\)-fully linear in \(B({\varvec{x}}_k,\varDelta _k)\) and (if not) to generate a \(Q_k\)-fully linear model. When we specialize RSDFO to nonlinear least-squares problems in Sect. 3, we will describe how we can obtain such procedures.

The broad structure of RSDFO is as follows:

  1. 1.

    First generate a subspace \(Q_k\) and, by linear interpolation on a new set of points in the subspace, generate an interpolating model \({\hat{m}}_k\).

  2. 2.

    If we suspect we are close to first-order stationarity, perform one iteration of a criticality step [29, 30] to ensure we have an accurate model and appropriately sized trust-region radius.

  3. 3.

    Compute a step by solving the trust-region subproblem (5).

  4. 4.

    Evaluate the quality of the step and use this to determine the new iterate \({\varvec{x}}_{k+1}\) and trust-region radius \(\varDelta _{k+1}\). Our updating mechanism follows [19, 67]. In particular, we consider a very short step to be unsuccessful and invoke a safety step [65] without evaluating \(f({\varvec{x}}_k+{\varvec{s}}_k)\).

An important feature of RSDFO is that in some iterations, we reuse the previous subspace, \({\mathcal {Y}}_k={\mathcal {Y}}_{k-1}\), corresponding to the flag CHECK_MODEL=TRUE. In this case, we had an inaccurate model in iteration \(k-1\) and require that our new model \({\hat{m}}_k\) is accurate (\(Q_k\)-fully linear). This mechanism essentially ensures that \(\varDelta _k\) is not decreased too quickly as a result of inaccurate models, and is mostly decreased to achieve sufficient objective reduction.

We now give our convergence and worst-case complexity analysis of Algorithm 1. For brevity, we defer proofs based on standard model-based DFO techniques to Appendix A.

2.2 Assumptions and preliminary results

We begin our analysis with some basic assumptions and preliminary results.

Assumption 1

(Smoothness) The objective function \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is bounded below by \(f_{\mathrm{low}}\) and continuously differentiable, and \(\nabla f\) is \(L_{\nabla f}\)-Lipschitz continuous in the level set \(\{{\varvec{x}}\in {\mathbb {R}}^n : f({\varvec{x}}) \le f({\varvec{x}}_0)\}\), for some constant \(L_{\nabla f}>0\).

We also need two standard assumptions for trust-region methods: uniformly bounded above model Hessians and sufficiently accurate solutions to the trust-region subproblem (5).

Assumption 2

(Bounded model Hessians) We assume that \(\Vert {\hat{H}}_k\Vert \le \kappa _H\) for all k, for some \(\kappa _H\ge 1\).

Assumption 3

(Cauchy decrease) Our method for solving the trust-region subproblem (5) gives a step \({\hat{{\varvec{s}}}}_k\) satisfying the sufficient decrease condition

$$\begin{aligned} {\hat{m}}_k({\varvec{0}}) - {\hat{m}}_k({\hat{{\varvec{s}}}}_k) \ge c_1 \Vert {\hat{{\varvec{g}}}}_k\Vert \min \left( \varDelta _k, \frac{\Vert {\hat{{\varvec{g}}}}_k\Vert }{\max (\Vert {\hat{H}}_k\Vert ,1)}\right) , \end{aligned}$$
(9)

for some \(c_1\in [1/2, 1]\) independent of k.

A useful consequence, needed for the analysis of our trust-region radius updating scheme, is the following.

Lemma 1

(Lemma 3.6, [19]) Suppose Assumption 3 holds. Then

$$\begin{aligned} \Vert {\hat{{\varvec{s}}}}_k\Vert \ge c_2 \min \left( \varDelta _k, \frac{\Vert {\hat{{\varvec{g}}}}_k\Vert }{\max (\Vert {\hat{H}}_k\Vert , 1)}\right) , \end{aligned}$$
(10)

where \(c_2 := 2c_1 / (1+\sqrt{1+2c_1})\).

Lemma 2

Suppose Assumptions 2 and 3 hold, and we run RSDFO with \(\beta _F \le c_2\) (where \(c_2\) is introduced in Lemma 1). If \({\hat{m}}_k\) is \(Q_k\)-fully linear in \(B({\varvec{x}}_k,\varDelta _k)\) and

$$\begin{aligned} \varDelta _k \le c_0 \Vert {\hat{{\varvec{g}}}}_k\Vert , \qquad \text {where} \qquad c_0 := \min \left( \mu , \frac{1}{\kappa _H}, \frac{c_1(1-\eta _2)}{2\kappa _{\mathrm{ef}}}\right) , \end{aligned}$$
(11)

then the criticality and safety steps are not called, and \(\rho _k\ge \eta _2\).

Proof

See Appendix A.1. \(\square \)

Remark 1

The requirement \(\beta _F \le c_2\) in Lemma 2 is not restrictive. Since have \(c_1 \ge 1/2\) in Assumption 3, it suffices to choose \(\beta _F \le \sqrt{2}-1\), for example.

Our key new assumption is on the quality of our subspace selection, as introduced in [16, 17, 72]:

Definition 2

The matrix \(Q_k\) is well-aligned if

$$\begin{aligned} \Vert Q_k^T \nabla f({\varvec{x}}_k)\Vert \ge \alpha _Q \Vert \nabla f({\varvec{x}}_k)\Vert , \end{aligned}$$
(12)

for some \(\alpha _Q>0\) independent of k.

Assumption 4

(Subspace quality) Our subspace selection (determined by \(Q_k\)) satisfies the following two properties:

  1. (a)

    At each iteration k of RSDFO in which CHECK_MODEL = FALSE, our subspace selection \(Q_k\) is well-aligned for some fixed \(\alpha _Q>0\) with probability at least \(1-\delta _S\), for some \(\delta _S\in (0,1)\), independently of \(\{Q_0,\ldots ,Q_{k-1}\}\).

  2. (b)

    \(\Vert Q_k\Vert \le Q_{\max }\) for all k and some \(Q_{\max }>0\).

Of these two properties, (a) is needed for our complexity analysis, while (b) is only needed in order to construct \(Q_k\)-fully linear models (in Sect. 3). Note that if Assumption 4 holds then (12) and \(\Vert Q_k\Vert \le Q_{\max }\) together imply that we must have \(\alpha _Q \le Q_{\max }\). The constructions we will consider will be based on \(\alpha _Q\in (0,1)\). We will discuss how to achieve Assumption 4 in more detail in Sect. 2.6.

Lemma 3

In all iterations k of RSDFO where the criticality step is not called, we have \(\Vert {\hat{{\varvec{g}}}}_k\Vert \ge \min (\epsilon _C, \mu ^{-1}\varDelta _k)\). If the criticality step is not called in iteration k, \(Q_k\) is well-aligned and \(\Vert \nabla f({\varvec{x}}_k)\Vert \ge \epsilon \), then

$$\begin{aligned} \Vert {\hat{{\varvec{g}}}}_k\Vert \ge \epsilon _g(\epsilon ) := \min \left( \epsilon _C, \frac{\alpha _Q \epsilon }{\kappa _{\mathrm{eg}}\mu + 1}\right) > 0. \end{aligned}$$
(13)

Proof

See Appendix A.2. \(\square \)

2.3 Counting iterations

We now provide a series of results counting the number of iterations of RSDFO of different types, following the style of analysis from [17, 22, 72]. First we introduce some notation to enumerate our iterations. Suppose we run RSDFO until the end of iteration K. We then define the following subsets of \(\{0,\ldots ,K\}\):

  • \({\mathcal {C}}\) is the set of iterations in \(\{0,\ldots ,K\}\) where the criticality step is called.

  • \({\mathcal {F}}\) is the set of iterations in \(\{0,\ldots ,K\}\), where the safety step is called (i.e. \(\Vert {\hat{{\varvec{s}}}}_k\Vert <\beta _F \varDelta _k\)).

  • \(\mathcal {VS}\) is the set of very successful iterations in \(\{0,\ldots ,K\}\), where \(\rho _k \ge \eta _2\).

  • \({\mathcal {S}}\) is the set of successful iterations in \(\{0,\ldots ,K\}\), where \(\rho _k \ge \eta _1\). Note that \(\mathcal {VS}\subset {\mathcal {S}}\).

  • \({\mathcal {U}}\) is the set of unsuccessful iterations in \(\{0,\ldots ,K\}\), where \(\rho _k<\eta _1\).

  • \({\mathcal {A}}\) is the set of well-aligned iterations in \(\{0,\ldots ,K\}\), where (12) holds.

  • \({\mathcal {A}}^C\) is the set of poorly aligned iterations in \(\{0,\ldots ,K\}\), where (12) does not hold.

  • \({\mathcal {D}}(\varDelta )\) is the set of iterations in \(\{0,\ldots ,K\}\) where \(\varDelta _k \ge \varDelta \) for some \(\varDelta >0\).

  • \({\mathcal {D}}^C(\varDelta )\) is the set of iterations in \(\{0,\ldots ,K\}\) where \(\varDelta _k < \varDelta \).

  • \({\mathcal {L}}\) is the set of iterations in \(\{0,\ldots ,K\}\) where \({\hat{m}}_k\) is \(Q_k\)-fully linear in \(B({\varvec{x}}_k,\varDelta _k)\).

  • \({\mathcal {L}}^C\) is the set of iterations in \(\{0,\ldots ,K\}\) where \({\hat{m}}_k\) is not \(Q_k\)-fully linear in \(B({\varvec{x}}_k,\varDelta _k)\).

In particular, we have the partitions, for any \(\varDelta >0\),

$$\begin{aligned} \{0,\ldots ,K\} = {\mathcal {C}} \cup {\mathcal {F}} \cup {\mathcal {S}} \cup {\mathcal {U}} = {\mathcal {A}}\cup {\mathcal {A}}^C = {\mathcal {D}}(\varDelta )\cup {\mathcal {D}}^C(\varDelta ) = {\mathcal {L}}\cup {\mathcal {L}}^C. \end{aligned}$$
(14)

First, we bound the number of successful iterations with large \(\varDelta _k\) using standard arguments from trust-region methods. Throughout, we use \(\#(\cdot )\) to refer to the cardinality of a set of iterations.

Lemma 4

Suppose Assumptions 1, 2 and 3 hold. If \(\Vert \nabla f({\varvec{x}}_k)\Vert \ge \epsilon \) for all \(k=0,\ldots ,K\), then

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta )\cap {\mathcal {S}}) \le \phi (\varDelta , \epsilon ) := \frac{f({\varvec{x}}_0)-f_{\mathrm{low}}}{\eta _1 c_1\epsilon _g(\epsilon ) \min (\epsilon _g(\epsilon )/\kappa _H, \varDelta )}, \end{aligned}$$
(15)

for all \(\varDelta >0\).

Proof

See Appendix A.3. \(\square \)

Lemma 5

Suppose Assumptions 1, 2 and 3 hold, and \(\beta _F\le c_2\). If \(\Vert \nabla f({\varvec{x}}_k)\Vert \ge \epsilon \) for all \(k=0,\ldots ,K\), then

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta )\cap {\mathcal {L}}{\setminus }\mathcal {VS}) = 0, \end{aligned}$$
(16)

for all \(\varDelta >0\) satisfying

$$\begin{aligned} \varDelta \le \varDelta ^*(\epsilon ) := \min \left( \mu \epsilon _g(\epsilon ), \frac{\epsilon _g(\epsilon )}{\kappa _H}, \left( \kappa _{\mathrm{eg}} + \frac{2\kappa _{\mathrm{ef}}}{c_1(1-\eta _2)}\right) ^{-1} \alpha _Q \epsilon , \frac{\alpha _Q\epsilon }{\kappa _{\mathrm{eg}}+\mu ^{-1}}\right) .\nonumber \\ \end{aligned}$$
(17)

Proof

See Appendix A.4. \(\square \)

Lemma 6

Suppose Assumptions 1, 2 and 3 hold. Then we have

$$\begin{aligned} \#({\mathcal {D}}(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}\varDelta ){\setminus }{\mathcal {S}})&\le C_1 \#({\mathcal {D}}({\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta )\cap {\mathcal {S}}) + C_2, \end{aligned}$$
(18)

for all \(\varDelta \le \varDelta _0\), where

$$\begin{aligned} C_1 := \frac{\log ({\overline{\gamma }}_{\mathrm{inc}})}{\log (1/\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}))} \qquad \text {and} \qquad C_2 := \frac{\log (\varDelta _0/\varDelta )}{\log (1/\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}))}.\nonumber \\ \end{aligned}$$
(19)

Proof

See Appendix A.5. \(\square \)

Lemma 7

Suppose Assumptions 1, 2 and 3 hold. Then

$$\begin{aligned} \#({\mathcal {D}}^C(\gamma _{\mathrm{inc}}^{-1}\varDelta )\cap \mathcal {VS}) \le C_3\cdot \#({\mathcal {D}}^C(\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-1}\varDelta ){\setminus } \mathcal {VS}), \end{aligned}$$
(20)

for all \(\varDelta \le \min (\varDelta _0, \gamma _{\mathrm{inc}}^{-1}\varDelta _{\max })\), where

$$\begin{aligned} C_3 := \frac{\log (1/\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F))}{\log (\gamma _{\mathrm{inc}})}. \end{aligned}$$
(21)

Proof

See Appendix A.6. \(\square \)

Lemma 8

Suppose Assumptions 1, 2 and 3 hold. Then

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta )\cap {\mathcal {L}}^C{\setminus }\mathcal {VS}) \le \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta )\cap {\mathcal {L}}) + 1, \end{aligned}$$
(22)

for all \(\varDelta > 0\).

Proof

See Appendix A.7. \(\square \)

We are now in a position to bound the total number of well-aligned iterations.

Lemma 9

Suppose Assumptions 1, 2 and 3 hold, and both \(\beta _F\le c_2\) and \(\gamma _{\mathrm{inc}} > \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-2}\) hold. Then if \(\Vert \nabla f({\varvec{x}}_k)\Vert \ge \epsilon \) for all \(k=0,\ldots ,K\), we have

$$\begin{aligned} \#({\mathcal {A}})&\le \psi (\epsilon ) + \frac{C_4}{1+C_4} (K+1), \end{aligned}$$
(23)

where

$$\begin{aligned} \psi (\epsilon )&:=\frac{1}{1+C_4}\left[ (C_1+2)\phi (\varDelta _{\min }(\epsilon ), \epsilon )\right. \nonumber \\&\quad +\, \frac{4\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }(\epsilon ), \epsilon )}{1-2C_3} \end{aligned}$$
(24)
$$\begin{aligned}&\quad \left. +\, C_2 + \frac{2}{1-2C_3} + 1\right] , \end{aligned}$$
(25)
$$\begin{aligned} \varDelta _{\min }(\epsilon )&:=\min \left( {\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta _0, \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}){\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta ^*(\epsilon )\right) , \end{aligned}$$
(26)
$$\begin{aligned} C_4&:=\max \left( C_1, \frac{4C_3}{1-2C_3}\right) > 0. \end{aligned}$$
(27)

In these expressions, the values \(C_1\) and \(C_2\) are defined in Lemma 6, \(C_3\) is defined in Lemma 7, \(\phi (\cdot ,\epsilon )\) is defined in Lemma 4, and \(\epsilon _g(\epsilon )\) and \(\varDelta ^*(\epsilon )\) are defined in Lemmas 3 and 5 respectively.

Proof

For ease of notation, we will write \(\varDelta _{\min }\) in place of \(\varDelta _{\min }(\epsilon )\). We begin by noting that \(\gamma _{\mathrm{inc}} > \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-2}\) implies that \(C_3\in (0,1/2)\), which we will use later.

Next, we have

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min }))&= \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min }){\setminus }{\mathcal {S}}), \end{aligned}$$
(28)
$$\begin{aligned}&\le \phi (\varDelta _{\min }, \epsilon ) + \#({\mathcal {A}}\cap {\mathcal {D}}(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min }){\setminus }{\mathcal {S}}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min }){\setminus }{\mathcal {S}}), \end{aligned}$$
(29)
$$\begin{aligned}&\le \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min }){\setminus }{\mathcal {S}}), \end{aligned}$$
(30)
$$\begin{aligned}&= \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min })\cap {\mathcal {L}}{\setminus }{\mathcal {S}}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min })\cap {\mathcal {L}}^C{\setminus }{\mathcal {S}}), \end{aligned}$$
(31)
$$\begin{aligned}&\le \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\qquad + 2 \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min })\cap {\mathcal {L}}{\setminus }{\mathcal {S}}) \nonumber \\&\qquad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min })\cap {\mathcal {L}}\cap {\mathcal {S}}) + 1, \end{aligned}$$
(32)

where the first inequality follows from Lemma 4, the second inequality follows from Lemma 6 and \(\varDelta _{\min }\le {\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta _0\), and the last line follows from Lemma 8 and \(\mathcal {VS}\subset {\mathcal {S}}\). Now we use Lemma 5 with \(\varDelta _{\min }\le \max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}){\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta ^*(\epsilon )\) to get

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min }))&\le \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}}\varDelta _{\min })\cap {\mathcal {L}}\cap {\mathcal {S}}) + 1, \end{aligned}$$
(33)
$$\begin{aligned}&= \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {A}}\cap {\mathcal {D}} (\varDelta _{\min })\cap {\mathcal {S}}) \nonumber \\&\quad + C_1 \#({\mathcal {A}}^C\cap {\mathcal {D}} (\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })\cap {\mathcal {L}} \cap {\mathcal {S}}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {D}}^C (\max (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}})^{-1}{\overline{\gamma }}_{\mathrm{inc}} \varDelta _{\min })\nonumber \\&\quad \cap {\mathcal {L}}\cap {\mathcal {S}}) + 1, \end{aligned}$$
(34)
$$\begin{aligned}&\le \phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {A}}\cap {\mathcal {D}} (\varDelta _{\min })\cap {\mathcal {S}}) \nonumber \\&\quad + C_1 \#({\mathcal {A}}^C\cap {\mathcal {D}} (\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })) + \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + 1, \end{aligned}$$
(35)
$$\begin{aligned}&\le (C_1+2)\phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {A}}^C \cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) + C_2 \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })) + 1, \end{aligned}$$
(36)

where the last line follows from Lemma 4.

Separately, we use Lemma 8, and apply Lemma 5 with \(\varDelta _{\min }\le \varDelta ^*(\epsilon )\) to get

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }))&= \#({\mathcal {A}} \cap {\mathcal {D}}^C(\varDelta _{\min })\cap \mathcal {VS}) + \#({\mathcal {A}} \cap {\mathcal {D}}^C(\varDelta _{\min })\cap {\mathcal {L}}{\setminus }\mathcal {VS}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })\cap {\mathcal {L}}^C {\setminus }\mathcal {VS}), \end{aligned}$$
(37)
$$\begin{aligned}&\le \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap \mathcal {VS}) + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap {\mathcal {L}}{\setminus }\mathcal {VS}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap {\mathcal {L}}) + 1, \end{aligned}$$
(38)
$$\begin{aligned}&= \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap \mathcal {VS}) + 2 \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap {\mathcal {L}}{\setminus }\mathcal {VS}) \nonumber \\&\quad + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap {\mathcal {L}}\cap \mathcal {VS}) + 1, \end{aligned}$$
(39)
$$\begin{aligned}&= \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })\cap \mathcal {VS}) + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) \cap {\mathcal {L}}\cap \mathcal {VS}) + 1, \end{aligned}$$
(40)
$$\begin{aligned}&\le 2\#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })\cap \mathcal {VS}) + 1. \end{aligned}$$
(41)

We then get

$$\begin{aligned} \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }))&\le 2\#({\mathcal {A}}\cap {\mathcal {D}}^C(\gamma _{\mathrm{inc}}^{-1} \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min } )\cap \mathcal {VS}) \nonumber \\&\quad + 2\#({\mathcal {A}}\cap {\mathcal {D}}(\gamma _{\mathrm{inc}}^{-1} \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min })\nonumber \\&\quad \cap {\mathcal {D}}^C(\varDelta _{\min })\cap \mathcal {VS}) + 1, \end{aligned}$$
(42)
$$\begin{aligned}&\le 2\#({\mathcal {D}}^C(\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min })\cap \mathcal {VS}) \nonumber \\&\quad + 2\#({\mathcal {A}}\cap {\mathcal {D}}(\gamma _{\mathrm{inc}}^{-1} \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min })\cap \mathcal {VS}) + 1, \end{aligned}$$
(43)
$$\begin{aligned}&\le 2 C_3 \#({\mathcal {D}}^C(\varDelta _{\min }){\setminus }\mathcal {VS})\nonumber \\&\quad + 2\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon ) + 1, \end{aligned}$$
(44)
$$\begin{aligned}&= 2 C_3 \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min }) {\setminus }\mathcal {VS}) + 2 C_3 \#({\mathcal {A}}^C\cap {\mathcal {D}}^C( \varDelta _{\min }){\setminus }\mathcal {VS}) \nonumber \\&\quad + 2\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon ) + 1, \end{aligned}$$
(45)
$$\begin{aligned}&\le 2 C_3 \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })) + 2 C_3 \#({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }) {\setminus }\mathcal {VS}) \nonumber \\&\quad + 2\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon ) + 1, \end{aligned}$$
(46)

where the third inequality follows from Lemmas 4 and 7 with

$$\begin{aligned} \varDelta _{\min }\le & {} {\overline{\gamma }}_{\mathrm{inc}}^{-1}\varDelta _0 \le \gamma _{\mathrm{inc}}^{-1}\varDelta _0 \le \min (\varDelta _0, \gamma _{\mathrm{inc}}^{-1}\varDelta _{\max }) \nonumber \\\le & {} \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-1}\min (\varDelta _0, \gamma _{\mathrm{inc}}^{-1}\varDelta _{\max }). \end{aligned}$$
(47)

Since \(C_3\in (0,1/2)\), we can rearrange (46) to conclude that

$$\begin{aligned}&\#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })) \nonumber \\&\quad \le \frac{1}{1-2C_3}\left[ 2 C_3 \#({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }){\setminus }\mathcal {VS})\right. \nonumber \\&\qquad \left. +\, 2\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon ) + 1\right] . \end{aligned}$$
(48)

Now, we combine (36) and (48) to get

$$\begin{aligned} \#({\mathcal {A}})&= \#({\mathcal {A}}\cap {\mathcal {D}}(\varDelta _{\min })) + \#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })), \end{aligned}$$
(49)
$$\begin{aligned}&\le (C_1+2)\phi (\varDelta _{\min }, \epsilon ) + C_1 \#({\mathcal {A}}^C\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}})\nonumber \\&\quad + C_2 + 2\#({\mathcal {A}}\cap {\mathcal {D}}^C(\varDelta _{\min })) + 1, \end{aligned}$$
(50)
$$\begin{aligned}&\le (C_1+2)\phi (\varDelta _{\min }, \epsilon ) + \frac{4\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon )}{1-2C_3} \nonumber \\&\quad + C_1 \#({\mathcal {A}}^C\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) \nonumber \\&\quad + C_2 + \frac{2}{1-2C_3} + 1 + \frac{4C_3}{1-2C_3} \#({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }){\setminus }\mathcal {VS}),\end{aligned}$$
(51)
$$\begin{aligned}&\le (C_1+2)\phi (\varDelta _{\min }, \epsilon ) + \frac{4\phi (\gamma _{\mathrm{inc}}^{-1}\min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)\varDelta _{\min }, \epsilon )}{1-2C_3}\nonumber \\&\quad + C_2 + \frac{2}{1-2C_3} + 1 \nonumber \\&\quad + \max \left( C_1, \frac{4C_3}{1-2C_3}\right) \left[ \#({\mathcal {A}}^C \cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) \!+\! \#({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }) {\setminus }\mathcal {VS})\right] . \end{aligned}$$
(52)

Since \({\mathcal {A}}^C\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}\) and \({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }){\setminus }\mathcal {VS}\) are disjoint subsets of \({\mathcal {A}}^C\), we have

$$\begin{aligned}&\#({\mathcal {A}}^C\cap {\mathcal {D}}(\varDelta _{\min })\cap {\mathcal {S}}) +\#({\mathcal {A}}^C\cap {\mathcal {D}}^C(\varDelta _{\min }){\setminus }\mathcal {VS}) \le \#({\mathcal {A}}^C) \nonumber \\&\quad = (K+1) - \#({\mathcal {A}}). \end{aligned}$$
(53)

Substituting this into (52) and rearranging, we get the desired result. That \(C_4>0\) follows from \(C_1>0\) and \(C_3\in (0,1/2)\). \(\square \)

2.4 Overall complexity bound

The key remaining step is to compare \(\#({\mathcal {A}})\) with K. Since each event “\(Q_k\) is well aligned” is effectively an independent Bernoulli trial with success probability at least \(1-\delta _S\), we derive the below result based on a concentration bound for Bernoulli trials [25, Lemma 2.1].

Lemma 10

Suppose Assumptions 1, 2, 3 and 4 hold. Then we have

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}) + 1 \le (1-\delta _S)(1-\delta )(K+1)\right] \le e^{-\delta ^2 (1-\delta _S)K / 4}, \end{aligned}$$
(54)

for all \(\delta \in (0,1)\).

Proof

The CHECK_MODEL=FALSE case of this proof has a general framework based on [46,  Lemma 4.5]—also followed in [17, 72]—with a probabilistic argument from [25,  Lemma 2.1].

First, we consider only the subsequence of iterations \({\mathcal {K}}_1:=\{k_0,\ldots ,k_J\}\subset \{0,\ldots ,K\}\) when \(Q_k\) is resampled (i.e. where CHECK_MODEL=FALSE, so \(Q_k\ne Q_{k-1}\)). For convenience, we define \({\mathcal {A}}_1 :={\mathcal {A}}\cap {\mathcal {K}}_1\) and \({\mathcal {A}}^C_1 :={\mathcal {A}}^C\cap {\mathcal {K}}_1\).

Let \(T_{k_j}\) be the indicator function for the event “\(Q_{k_j}\) is well-aligned”, and so \(\#({\mathcal {A}}_1) = \sum _{j=0}^{J}T_{k_j}\). Since \(T_{k_j}\in \{0,1\}\), and denoting \(p_{k_j}:={\mathbb {P}}\left[ T_{k_j}=1 |\, {\varvec{x}}_{k_j}\right] \), for any \(t>0\) we have

$$\begin{aligned} {\mathbb {E}}\left[ e^{-t (T_{k_j}-p_{k_j})} |\, {\varvec{x}}_{k_j}\right]= & {} p_{k_j} e^{-t (1-p_{k_j})} + (1-p_{k_j})e^{t p_{k_j}} \nonumber \\= & {} e^{t p_{k_j} + \log (1-p_{k_j}+p_{k_j} e^{-t})} \le e^{t^2 p_{k_j} / 2}, \end{aligned}$$
(55)

where the inequality from the identity \( px + \log (1-p+pe^{-x}) \le px^2/2\), for all \(p\in [0,1]\) and \(x\ge 0\), shown in [25,  Lemma 2.1].

Using the tower property of conditional expectations and the fact that, since \(k_j\in {\mathcal {K}}_1\), \(T_{k_j}\) only depends on \({\varvec{x}}_{k_j}\) and not any previous iteration, we then get

$$\begin{aligned}&{\mathbb {E}}\left[ e^{-t (\#({\mathcal {A}}_1)-\sum _{j=0}^{J} p_{k_j})}\right] \nonumber \\&\quad = {\mathbb {E}}\left[ e^{-t \sum _{j=0}^{J} (T_{k_j} - p_{k_j})}\right] , \end{aligned}$$
(56)
$$\begin{aligned}&\quad = {\mathbb {E}}\left[ {\mathbb {E}}\left[ e^{-t \sum _{j=0}^{J}(T_{k_j} - p_{k_j})} | \, Q_0,\ldots ,Q_{k_J-1},{\varvec{x}}_0,\ldots ,{\varvec{x}}_{k_J} \right] \right] , \end{aligned}$$
(57)
$$\begin{aligned}&\quad = {\mathbb {E}}\left[ e^{-t \sum _{j=0}^{J-1} (T_{k_j} - p_{k_j})} {\mathbb {E}}\left[ e^{-t (T_{k_J} - p_{k_J})} | \, Q_0,\ldots ,Q_{k_J-1},{\varvec{x}}_0,\ldots ,{\varvec{x}}_{k_J}\right] \right] , \end{aligned}$$
(58)
$$\begin{aligned}&\quad = {\mathbb {E}}\left[ e^{-t \sum _{j=0}^{J-1} (T_{k_j} - p_{k_j})} {\mathbb {E}}\left[ e^{-t (T_{k_J} - p_{k_J})} | \, {\varvec{x}}_{k_J}\right] \right] , \end{aligned}$$
(59)
$$\begin{aligned}&\quad \le e^{t^2 p_{k_J}/2} {\mathbb {E}}\left[ e^{-t \sum _{j=0}^{J-1} (T_{k_j} - p_{k_j})}\right] , \end{aligned}$$
(60)
$$\begin{aligned}&\quad \le e^{t^2 (\sum _{j=0}^{J} p_{k_j}) /2}, \end{aligned}$$
(61)

where the second-last line follows from (55) and the last line follows by induction. This means that

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le \sum _{j=0}^{J} p_{k_j} - \lambda \right]&= {\mathbb {P}}\left[ e^{-t \left( \#({\mathcal {A}}_1)-\sum _{j=0}^{J} p_{k_j}\right) } > e^{t\lambda } \right] , \end{aligned}$$
(62)
$$\begin{aligned}&\le e^{-t\lambda } {\mathbb {E}}\left[ e^{-t \left( \#({\mathcal {A}}_1)-\sum _{j=0}^{J} p_{k_j}\right) }\right] , \end{aligned}$$
(63)
$$\begin{aligned}&\le e^{t^2 \left( \sum _{j=0}^{J} p_{k_j}\right) /2 - t\lambda }, \end{aligned}$$
(64)

where the inequalities follow from Markov’s inequality and (61) respectively. Taking \(t=\lambda / \sum _{j=0}^{J} p_{k_j}\), we get

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le \sum _{j=0}^{J} p_{k_j} - \lambda \right] \le e^{-\lambda ^2 / \left( 2\sum _{j=0}^{J} p_{k_j}\right) }. \end{aligned}$$
(65)

Finally, we take \(\lambda =\delta \sum _{j=0}^{J} p_{k_j}\) for some \(\delta \in (0,1)\) and note that \(p_{k_j}\ge (1-\delta _S)\) (from Assumption 4), to conclude

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le (1-\delta )(1-\delta _S)(J+1)\right]\le & {} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le (1-\delta )\sum _{j=0}^{J} p_{k_j}\right] \nonumber \\\le & {} e^{-\delta ^2 \left( \sum _{j=0}^{J} p_{k_j}\right) /2}, \end{aligned}$$
(66)

or equivalently, using the partition \({\mathcal {K}}_1 = {\mathcal {A}}_1 \cup {\mathcal {A}}_1^C\),

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le (1-\delta )(1-\delta _S)[\#({\mathcal {A}}_1) + \#({\mathcal {A}}_1^C)]\right] \le e^{-\delta ^2 (1-\delta _S)[\#({\mathcal {A}}_1) + \#({\mathcal {A}}_1^C)]/2}. \nonumber \\ \end{aligned}$$
(67)

Now we must consider the iterations for which CHECK_MODEL=TRUE (so \(Q_k=Q_{k-1}\)), which we denote \({\mathcal {K}}_1^C\). The algorithm ensures that if \(k\in {\mathcal {K}}_1^C\), then \(k+1\in {\mathcal {K}}_1\) (unless we are in the last iteration we consider, \(k=K\)). Futher, the algorithm guarantees that if \(k\in {\mathcal {K}}_1^C\), then \(k>0\) and \(k\in {\mathcal {A}}\) if and only if \(k-1\in {\mathcal {A}}\). These are the key implications of RSDFO that we will now use.

Firstly, we have \(\#({\mathcal {K}}_1^C) \le \#({\mathcal {K}}_1)+1\), and so

$$\begin{aligned} K+1 = \#({\mathcal {K}}_1) + \#({\mathcal {K}}_1^C) \le 2[\#({\mathcal {A}}_1) + \#({\mathcal {A}}_1^C)] + 1, \end{aligned}$$
(68)

which means (67) becomes

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le (1-\delta )(1-\delta _S)[\#({\mathcal {A}}_1) + \#({\mathcal {A}}_1^C)]\right] \le e^{-\delta ^2 (1-\delta _S)K/4}. \end{aligned}$$
(69)

Setting \(\alpha :=\delta + \delta _S + \delta \delta _S\), we have \((1-\delta )(1-\delta _S) = 1-\alpha \), and so

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le \frac{1-\alpha }{\alpha } \#({\mathcal {A}}_1^C)\right]= & {} {\mathbb {P}}\left[ \#({\mathcal {A}}_1) \le (1-\alpha )[\#({\mathcal {A}}_1) + \#({\mathcal {A}}_1^C)]\right] \nonumber \\\le & {} e^{-\delta ^2 (1-\delta _S)K/4}. \end{aligned}$$
(70)

Secondly, we have \(\#({\mathcal {K}}_1^C \cap {\mathcal {A}}^C) \le \#({\mathcal {A}}_1^C)+1\), and so \(\#({\mathcal {A}}^C) \le 2\#({\mathcal {A}}_1^C)+1\). This and \({\mathcal {A}}_1\subset {\mathcal {A}}\) give

$$\begin{aligned} {\mathbb {P}}\left[ \#({\mathcal {A}}) \le \frac{1-\alpha }{2\alpha } [\#({\mathcal {A}}^C)-1]\right] \le e^{-\delta ^2 (1-\delta _S)K/4}. \end{aligned}$$
(71)

We then note that \(K+1=\#({\mathcal {A}})+\#({\mathcal {A}}^C)\), and so

$$\begin{aligned}&{\mathbb {P}}\left[ \#({\mathcal {A}}) \le \frac{1-\alpha }{2\alpha } [K+1-\#({\mathcal {A}})-1]\right] \le e^{-\delta ^2 (1-\delta _S)K/4}, \end{aligned}$$
(72)
$$\begin{aligned}&{\mathbb {P}}\left[ \#({\mathcal {A}}) + \frac{1-\alpha }{1+\alpha } \le \frac{1-\alpha }{1+\alpha } (K+1)\right] \le e^{-\delta ^2 (1-\delta _S)K/4}, \end{aligned}$$
(73)
$$\begin{aligned}&{\mathbb {P}}\left[ \#({\mathcal {A}}) + 1 \le (1-\alpha )(K+1)\right] \le e^{-\delta ^2 (1-\delta _S)K/4}, \end{aligned}$$
(74)

since \(\alpha >0\). \(\square \)

Theorem 1

Suppose Assumptions 1, 2, 3 and 4 hold, and we have \(\beta _F\le c_2\), \(\delta _S < 1/(1+C_4)\) for \(C_4\) defined in Lemma 9, and \(\gamma _{\mathrm{inc}} > \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-2}\). Then for any \(\epsilon >0\) and

$$\begin{aligned} k \ge \frac{2(\psi (\epsilon )+1)}{1-\delta _S-C_4/(1+C_4)}, \end{aligned}$$
(75)

we have

$$\begin{aligned} {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le \epsilon \right] \ge 1 - \exp \left( -k\frac{(1-\delta _S-C_4/(1+C_4))^2}{16(1-\delta _S)}\right) . \end{aligned}$$
(76)

Alternatively, if \(K_{\epsilon } :=\min \{k : \Vert \nabla f({\varvec{x}}_k)\Vert \le \epsilon \}\) for any \(\epsilon >0\), then

$$\begin{aligned}&{\mathbb {P}}\left[ K_{\epsilon } \le \left\lceil \frac{2(\psi (\epsilon )+1)}{1-\delta _S-C_4/(1+C_4)} \right\rceil \right] \nonumber \\&\quad \ge 1 - \exp \left( -\frac{(\psi (\epsilon )+1) [1-\delta _S-C_4/(1+C_4)]}{8(1-\delta _S)}\right) , \end{aligned}$$
(77)

where \(\psi (\epsilon )\) is defined in Lemma 9.

Proof

First, fix some arbitrary \(k\ge 0\). Let \(\epsilon _k := \min _{j\le k}\Vert \nabla f({\varvec{x}}_j)\Vert \) and \(A_k\) be the number of well-aligned iterations in \(\{0,\ldots ,k\}\). If \(\epsilon _k>0\), from Lemma 9, we have

$$\begin{aligned} A_k \le \psi (\epsilon _k) + \frac{C_4}{1+C_4} (k+1). \end{aligned}$$
(78)

For any \(\delta >0\) such that

$$\begin{aligned} \delta < 1 - \frac{C_4}{(1+C_4)(1-\delta _S)}, \end{aligned}$$
(79)

we have \((1-\delta _S)(1-\delta ) > C_4 / (1+C_4)\), and so we can compute

$$\begin{aligned}&{\mathbb {P}}\left[ \psi (\epsilon _k) \le \left[ (1-\delta _S)(1-\delta )-\frac{C_4}{1+C_4}\right] (k+1) - 1\right] \nonumber \\&\le {\mathbb {P}}\left[ A_k \le (1-\delta _S)(1-\delta )(k+1)\right] , \end{aligned}$$
(80)
$$\begin{aligned}&\quad \le e^{-\delta ^2 (1-\delta _S)k/4}, \end{aligned}$$
(81)

using Lemma 10. Defining

$$\begin{aligned} \delta := \frac{1}{2}\left[ 1 - \frac{C_4}{(1+C_4)(1-\delta _S)}\right] , \end{aligned}$$
(82)

we have

$$\begin{aligned} (1-\delta _S)(1-\delta ) = \frac{1}{2}\left[ 1-\delta _S + \frac{C_4}{1+C_4}\right] > \frac{C_4}{1+C_4}, \end{aligned}$$
(83)

since \(1-\delta _S > C_4 / (1+C_4)\) from our assumption on \(\delta _S\). Hence we get

$$\begin{aligned} {\mathbb {P}}\left[ \psi (\epsilon _k) \le \frac{1}{2}\left( 1-\delta _S-\frac{C_4}{1+C_4}\right) (k+1) - 1\right] \le e^{-k [1-\delta _S-C_4/(1+C_4)]^2 / \left[ 16(1-\delta _S)\right] },\nonumber \\ \end{aligned}$$
(84)

and we note that this result is still holds if \(\epsilon _k=0\), as \(\lim _{\epsilon \rightarrow 0}\psi (\epsilon )=\infty \).

Now we fix \(\epsilon >0\) and choose k satisfying (75). We use the fact that \(\psi (\cdot )\) is non-increasing to get

$$\begin{aligned} {\mathbb {P}}\left[ \epsilon _k \ge \epsilon \right]&\le {\mathbb {P}}\left[ \psi (\epsilon _k) \le \psi (\epsilon )\right] , \end{aligned}$$
(85)
$$\begin{aligned}&\le {\mathbb {P}}\left[ \psi (\epsilon _k) \le \frac{1}{2}(1-\delta _S-C_4/(1+C_4))k - 1\right] , \end{aligned}$$
(86)
$$\begin{aligned}&\le {\mathbb {P}}\left[ \psi (\epsilon _k) \le \frac{1}{2}(1-\delta _S-C_4/(1+C_4))(k+1) - 1\right] , \end{aligned}$$
(87)

and (76) follows. Lastly, we fix

$$\begin{aligned} k = \left\lceil \frac{2(\psi (\epsilon )+1)}{1-\delta _S-C_4/(1+C_4)} \right\rceil , \end{aligned}$$
(88)

and we use (76) and the definition of \(K_{\epsilon }\) to get

$$\begin{aligned} {\mathbb {P}}\left[ K_{\epsilon } \ge k\right]&= {\mathbb {P}}\left[ \epsilon _k \ge \epsilon \right] , \end{aligned}$$
(89)
$$\begin{aligned}&\le e^{-k[1-\delta _S-C_4/(1+C_4)]^2 / [16(1-\delta _S)]}, \end{aligned}$$
(90)
$$\begin{aligned}&\le \exp \left( -\frac{(\psi (\epsilon )+1)[1-\delta _S-C_4/(1+C_4)]}{8(1-\delta _S)}\right) , \end{aligned}$$
(91)

and we get (77). \(\square \)

Corollary 1

Suppose the assumptions of Theorem 1 hold. Then for \(k \ge k_0\) for some \(k_0\), we have

$$\begin{aligned} {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le \frac{C {\kappa _H^{1/2} \kappa _d}}{{\alpha _Q}\sqrt{k}}\right] \ge 1 - e^{-ck}, \end{aligned}$$
(92)

for some constants \(c,C>0\), and where \(\kappa _d:=\max (\kappa _{\mathrm{ef}}, \kappa _{\mathrm{eg}})\). Alternatively, for \(\epsilon \in (0,\epsilon _0)\) for some \(\epsilon _0\), we have

$$\begin{aligned} {\mathbb {P}}\left[ K_{\epsilon } \le \widetilde{C} {\kappa _H \kappa _d^2} {\alpha _Q^{-2}} \epsilon ^{-2}\right] \ge 1 - e^{-\widetilde{c}{\kappa _H \kappa _d^2}{\alpha _Q^{-2}}\epsilon ^{-2}}, \end{aligned}$$
(93)

for constants \(\widetilde{c},\widetilde{C}>0\).

Proof

For \(\epsilon \) sufficiently small, \(\epsilon _g(\epsilon )\) and \(\varDelta _{\min }(\epsilon )\) are equal to a multiple of \(\alpha _Q\epsilon /\kappa _d\) and \(\alpha _Q\epsilon /(\kappa _H \kappa _d)\) respectively, and so \(\psi (\epsilon )=\alpha _1 \kappa _H \kappa _d^2 \alpha _Q^{-2} \epsilon ^{-2} + \alpha _2=\varTheta (\kappa _H \kappa _d^2 \alpha _Q^{-2} \epsilon ^{-2})\), for some constants \(\alpha _1,\alpha _2>0\).

Therefore for k sufficiently large, the choice

$$\begin{aligned} \epsilon = \sqrt{\frac{2\alpha _1 {\kappa _H \kappa _d^2 {\alpha _Q^{-2}}}}{(1-\delta _S-C_4/(1+C_4))k - 2-2\alpha _2}} = \varTheta ({\kappa _H^{1/2} \kappa _d} {\alpha _Q^{-1}} k^{-1/2}), \end{aligned}$$
(94)

is sufficiently small that \(\psi (\epsilon )=\alpha _1 {\kappa _H \kappa _d^2} {\alpha _Q^{-2}} \epsilon ^{-2} + \alpha _2\), and gives (75) with equality. The first result then follows from (76).

The second result follows immediately from \(\psi (\epsilon )=\varTheta ({\kappa _H \kappa _d^2} {\alpha _Q^{-2}} \epsilon ^{-2})\) and (77). \(\square \)

Remark 2

All the above analysis holds with minimal modifications if we replace the trust-region mechanisms in RSDFO with more standard trust-region updating mechanisms. This includes, for example, having no safety step (i.e. \(\beta _F=0\)), and replacing (8) with

$$\begin{aligned} {\varvec{x}}_{k+1} = {\left\{ \begin{array}{ll}{\varvec{x}}_k + {\varvec{s}}_k, &{} \rho _k \ge \eta , \\ {\varvec{x}}_k, &{} \rho _k< \eta , \end{array}\right. } \quad \text {and} \quad \varDelta _{k+1} = {\left\{ \begin{array}{ll}\min (\gamma _{\mathrm{inc}}\varDelta _k, \varDelta _{\mathrm{max}}), &{} \rho _k \ge \eta , \\ \gamma _{\mathrm{dec}}\varDelta _k, &{} \rho _k < \eta , \end{array}\right. }\nonumber \\ \end{aligned}$$
(95)

for some \(\eta \in (0,1)\). The corresponding requirement on the trust-region updating parameters to prove a version of Theorem 1 is simply \(\gamma _{\mathrm{inc}} > \gamma _{\mathrm{dec}}^{-2}\) (provided we also set \(\gamma _C=\gamma _{\mathrm{dec}}\)).

2.5 Remarks on complexity bound

Our final complexity bounds for RSDFO in Corollary 1 are comparable to probabilistic direct search [46,  Corollary 4.9]. They also match—in their dependencies on \(\epsilon \), \(\kappa _H\) and \(\kappa _d\)—the standard bounds for (full space) model-based DFO methods for general objective [37, 76] and nonlinear least-squares [21] problems.

Following [46], we may also derive complexity bounds on the expected first-order optimality measure (of \({\mathcal {O}}(k^{-1/2})\)) and the expected worst-case complexity (of \({\mathcal {O}}(\epsilon ^{-2})\) iterations) for RSDFO.

Theorem 2

Suppose the assumptions of Theorem 1 hold. Then for \(k\ge k_0\), the iterates of RSDFO satisfy

$$\begin{aligned} {\mathbb {E}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \right] \le C{\kappa _H^{1/2} \kappa _d}{\alpha _Q^{-1}} k^{-1/2} + \Vert \nabla f({\varvec{x}}_0)\Vert e^{-ck}, \end{aligned}$$
(96)

for \(c,C>0\) and \(\kappa _d\) from (92), and for \(\epsilon \in (0,\epsilon _0)\) we have

$$\begin{aligned} {\mathbb {E}}\left[ K_{\epsilon }\right] \le \widetilde{C}_1 {\kappa _H \kappa _d^2}{\alpha _Q^{-2}}\epsilon ^{-2} + \frac{1}{\widetilde{c}_1}, \end{aligned}$$
(97)

for constants \(\widetilde{c}_1,\widetilde{C}_1>0\). Here, \(k_0\) and \(\epsilon _0\) are the same as in Corollary 1.

Proof

First, for \(k\ge k_0\) define the random variable \(H_k\) as

$$\begin{aligned} H_k :={\left\{ \begin{array}{ll} C{\kappa _H^{1/2} \kappa _d}{\alpha _Q^{-1}} k^{-1/2}, &{} \quad \text {if } \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le C{\kappa _H^{1/2} \kappa _d}{\alpha _Q^{-1}} k^{-1/2}, \\ \Vert \nabla f({\varvec{x}}_0)\Vert &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(98)

Then since \(\min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le H_k\), we get

$$\begin{aligned} {\mathbb {E}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \right]\le & {} {\mathbb {E}}\left[ H_k\right] \nonumber \\\le & {} C{\kappa _H^{1/2} \kappa _d}{\alpha _Q^{-1}} k^{-1/2} \nonumber \\&+ \Vert \nabla f({\varvec{x}}_0)\Vert \, {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert > C{\kappa _H^{1/2} \kappa _d}{\alpha _Q^{-1}} k^{-1/2}\right] ,\nonumber \\ \end{aligned}$$
(99)

and we get the first result by applying Corollary 1.

Next, if \(\epsilon \in (0,\epsilon _0)\) then

$$\begin{aligned} k\ge k_0(\epsilon ) :=\frac{2(\psi (\epsilon )+1)}{1-\delta _S-C_4/(1+C_4)} = \varTheta ({\kappa _H \kappa _d^2}{\alpha _Q^{-2}}\epsilon ^{-2}), \end{aligned}$$
(100)

and so from Theorem 1 we have

$$\begin{aligned} {\mathbb {P}}\left[ K_{\epsilon } \le k\right] = {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le \epsilon \right] \ge 1 - e^{-\widetilde{c}_1 k}, \end{aligned}$$
(101)

where \(\widetilde{c}_1 :=(1-\delta _S-C_4/(1+C_4))^2 /[16(1-\delta _S)]\). We use the identity \({\mathbb {E}}\left[ X\right] = \int _{0}^{\infty } {\mathbb {P}}\left[ X>t\right] dt\) for non-negative random variables X (e.g. [73,  eqn. (1.9)]) to get

$$\begin{aligned} {\mathbb {E}}\left[ K_{\epsilon }\right]\le & {} k_0(\epsilon ) + \int _{k_0(\epsilon )}^{\infty } {\mathbb {P}}\left[ K_{\epsilon } > t\right] dt \le k_0(\epsilon ) \nonumber \\&+ \sum _{k=k_0(\epsilon )}^{\infty } e^{-\widetilde{c}_1 k} = k_0(\epsilon ) + \frac{e^{-\widetilde{c}_1 k_0(\epsilon )}}{1-e^{-\widetilde{c}_1}}, \end{aligned}$$
(102)

where \(\widetilde{C}_1\) comes from \(k_0(\epsilon )=\varTheta ({\kappa _H \kappa _d^2}{\alpha _Q^{-2}}\epsilon ^{-2})\), which concludes our proof. \(\square \)

Furthermore, we also get almost-sure convergence of \(\liminf \) type, similar to [29,  Theorem 5.8] or [30,  Theorem 10.12] in the deterministic case.

Theorem 3

Suppose the assumptions of Theorem 1 hold. Then the iterates of RSDFO satisfy \(\inf _{k\ge 0} \Vert \nabla f({\varvec{x}}_k)\Vert = 0\) almost surely.

Proof

From Theorem 1, for any \(\epsilon >0\) we have

$$\begin{aligned} \lim _{k\rightarrow \infty } {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert > \epsilon \right] = 0. \end{aligned}$$
(103)

However, \({\mathbb {P}}\left[ \inf _{k\ge 0}\Vert \nabla f({\varvec{x}}_k)\Vert> \epsilon \right] \le {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert > \epsilon \right] \) for all k, and so

$$\begin{aligned} {\mathbb {P}}\left[ \inf _{k\ge 0}\Vert \nabla f({\varvec{x}}_k)\Vert > \epsilon \right] = 0. \end{aligned}$$
(104)

The result follows from the union bound applied to any sequence \(\epsilon \rightarrow 0\), e.g. \(\epsilon _k = k^{-1}\). \(\square \)

In particular, if \(\Vert \nabla f({\varvec{x}}_k)\Vert > 0\) for all k, then Theorem 3 implies \(\liminf _{k\rightarrow \infty } |\nabla f({\varvec{x}}_k)\Vert = 0\) almost surely.

2.6 Selecting a subspace dimension

We now specify how to generate our subspaces \(Q_k\) to be probabilistically well-aligned and uniformly bounded (Assumption 4). These requirements are quite weak, and so there are several possible approaches for constructing \(Q_k\). Of course the simplest case is to use no embedding, taking \(Q_k=I_{n\times n}\), which gives us \(p=n\) and \(Q_{\max }=1\) in Assumption 4, however our overall complexity can be reduced with alternative approaches.

One approach to achieve this is by using Johnson-Lindenstrauss transforms (JLTs) [80]. The application of these techniques to random subspace optimization algorithms follows [16, 17, 72].

Definition 3

A random matrix \(S\in {\mathbb {R}}^{p\times n}\) is an \(({\beta },\delta )\)-JLT if, for any point \({\varvec{v}}\in {\mathbb {R}}^n\), we have

$$\begin{aligned} {\mathbb {P}}\left[ (1-{\beta })\Vert {\varvec{v}}\Vert ^2 \le \Vert S {\varvec{v}}\Vert ^2 \le (1+{\beta }) \Vert {\varvec{v}}\Vert ^2\right] \ge 1-\delta . \end{aligned}$$
(105)

There have been many different approaches for constructing \(({\beta },\delta )\)-JLT matrices proposed. Two common examples are:

  • If S is a random Gaussian matrix with independent entries \(S_{i,j}\sim N(0,1/p)\) and \(p = \varOmega ({\beta }^{-2}|\log \delta |)\), then S is an \(({\beta },\delta )\)-JLT (see [13,  Theorem 2.13], for example).

  • We say that S is an s-hashing matrix if it has exactly s nonzero entries per column (indices sampled independently), which take values \(\pm 1/\sqrt{s}\) selected independently with probability 1/2. If S is an s-hashing matrix with \(s=\varTheta ({\beta }^{-1}|\log \delta |)\) and \(p=\varOmega ({\beta }^{-2}|\log \delta |)\), then S is an \(({\beta },\delta )\)-JLT [52].

By taking \({\varvec{v}}= \nabla f({\varvec{x}}_k)\) in iteration k, and noting \((1-{\beta })^2 \le 1-{\beta }\) for all \({\beta }\in (0,1)\), we have that Assumption 4(a) holds if we take \(Q_k=S^T\), where S is any \((1-\alpha _Q, \delta _S)\)-JLT. That is, Assumption 4(a) is satisfied using either of the constructions above and \(p=\varOmega ((1-\alpha _Q)^{-2}|\log \delta _S|)\). We note that we need \(\alpha _Q<1\) to use this construction.

Alternatively, following [54], we may take \(Q_k=\sqrt{n/p}\, Z_{:,1:p}\), where \(Z_{:,1:p}\) comprises the first p columns of \(Z\in {\mathbb {R}}^{n\times n}\) sampled from the Haar distribution (i.e. a uniform distribution over \(n\times n\) orthogonal matrices). In this construction, the columns of \(Q_k\) are orthogonal. From [54,  Lemma 1], we have that \(Q_k\) satisfies Assumption 4(a) for any p and \(\alpha _Q\) with failure probability

$$\begin{aligned} \delta _S = I_{\alpha _Q^2 p/n}(p/2, (n-p)/2), \end{aligned}$$
(106)

where \(I_{q}(\alpha ,\beta )\) is the regularized incomplete beta function. Although this does not give us a simple criterion for choosing p in terms of \(\alpha _Q\) and \(\delta _S\), [54,  Figure 1] gives numerical evidence that p can be chosen independently of n.Footnote 7 We note that [77] considered a similar construction based on the Grassmann manifold.

Value of \(Q_{\max }\) If S is chosen to be Gaussian, then [6,  Corollary 3.11] gives the upper bound \(Q_{\max } = {\mathcal {O}}(\sqrt{n/p})\) with high probability. Following a union bound argument, by generating Gaussian S and rejecting those with large norm, we can achieve Assumption 4 for this construction while maintaining \(p={\mathcal {O}}(1)\). If S is a hashing matrix, then we have \(\Vert Q_k\Vert \le \Vert Q_k\Vert _F = \sqrt{n}\), and so \(Q_{\max }=\sqrt{n}\) suffices to achieve Assumption 4. Lastly, if \(Q_k\) is a subsampled Haar matrix, we simply get \(Q_{\max }=\sqrt{n/p}\).

Thus, we have presented three different random ensembles from which \(Q_k\) may be generated, each allowing us to use subspace dimension \(p={\mathcal {O}}(1)\), but requiring \(Q_{\max }={\mathcal {O}}(\sqrt{n})\). We note that the RSDFO framework and complexity analysis allow for different ensembles and/or bounds on p or \(Q_{\max }\), including any with improved dependencies on n, if possible.

Remark 3

We conclude this section by noting that our analysis raises the question of whether scaling Q by a (small) constant factor would improve the performance and complexity of the algorithm (by decreasing both \(\alpha _Q\) and \(Q_{\max }\)). This, and more broadly how to optimally design an embedding, is a diffcult and important question that we dedicate to future work.

2.7 Complexity for general linear interpolation

In the case of linear interpolation models for a general objective problem (for which RSDFO may be applied), reasoning similar to Lemma 11 and using the standard fully linear error bounds from [28] or [30,  Theorems 2.11, 2.12, 3.14] gives \(\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}}={\mathcal {O}}(Q_{\max }^2 p \varLambda )\). Since we may take \(\kappa _H=1\) for linear models and noting that these methods still require at most \(p+1\) evaluations per iteration, this yields a high probability complexity of \({\mathcal {O}}(Q_{\max }^4 p^2 \epsilon ^{-2})\) iterations or \({\mathcal {O}}(Q_{\max }^4 p^3 \epsilon ^{-2})\) evaluations.

This means that RSDFO with a full-space model (i.e. \(p=n\) and \(Q_k=I\)) requires \({\mathcal {O}}(n^2 \epsilon ^{-2})\) iterations and \({\mathcal {O}}(n^3 \epsilon ^{-2})\) evaluations. However, with careful subspace generation using the methods in Sect. 2.6, with \(p={\mathcal {O}}(1)\) and \(Q_{\max }={\mathcal {O}}(\sqrt{n})\), we again get \({\mathcal {O}}(n^2 \epsilon ^{-2})\) iterations but a strict improvement to only \({\mathcal {O}}(n^2 \epsilon ^{-2})\) evaluations. Our linear algebra cost also reduces from \({\mathcal {O}}(n^3)\) to \({\mathcal {O}}(n)\) flops per iteration, with a corresponding reduction in the overall linear algebra cost over the whole algorithm from \({\mathcal {O}}(n^5 \epsilon ^{-2})\) to \({\mathcal {O}}(n^3 \epsilon ^{-2})\) flops. A detailed summary of the linear algebra cost of RSDFO is given for the nonlinear least-squares case in Sect. 3.3; similar results apply here.

Instead of linear models, we may instead use (possibly underdetermined) quadratic interpolation to construct fully linear models. Details of these procedures may be found in [28, 30].

3 Random subspace nonlinear least-squares method

We now describe how RSDFO (Algorithm 1) can be specialized to the unconstrained nonlinear least-squares problem

$$\begin{aligned} \min _{{\varvec{x}}\in {\mathbb {R}}^n} f({\varvec{x}}) :=\frac{1}{2}\Vert {\varvec{r}}({\varvec{x}})\Vert ^2 = \frac{1}{2}\sum _{i=1}^{m} r_i({\varvec{x}})^2, \end{aligned}$$
(107)

where \({\varvec{r}}:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is given by \({\varvec{r}}({\varvec{x}}):=[r_1({\varvec{x}}), \ldots , r_m({\varvec{x}})]^T\). We assume that \({\varvec{r}}\) is differentiable, but that access to the Jacobian \(J:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^{m\times n}\) is not possible. In addition, we typically assume that \(m\ge n\) (regression), but everything here also applies to the case \(m<n\) (inverse problems). We now introduce the algorithm RSDFO-GN (Random Subspace DFO with Gauss–Newton), which is a random subspace version of a model-based DFO variant of the Gauss–Newton method [19].

Following the construction from [19], we assume that we have selected the p-dimensional search space \({\mathcal {Y}}_k\) defined by \(Q_k\in {\mathbb {R}}^{n\times p}\) (as in RSDFO above). Then, we suppose that we have evaluated \({\varvec{r}}\) at \(p+1\) points \(Y_k :=\{{\varvec{x}}_k,{\varvec{y}}_1,\ldots ,{{\varvec{y}}_p}\} \subset {\mathcal {Y}}_k\) (which typically are all close to \({\varvec{x}}_k\) and not recycled from previous iterations). Since \({\varvec{y}}_t\in {\mathcal {Y}}_k\) for each \(t=1,\ldots ,p\), from (3) we have \({\varvec{y}}_t = {\varvec{x}}_k + Q_k {\hat{{\varvec{s}}}}_t\) for some \({\hat{{\varvec{s}}}}_t\in {\mathbb {R}}^p\).

Given this interpolation set, we first wish to construct a local subspace linear model for \({\varvec{r}}\):

$$\begin{aligned} {\varvec{r}}({\varvec{x}}_k + Q_k {\hat{{\varvec{s}}}}) \approx {\hat{{\varvec{m}}}}_k({\hat{{\varvec{s}}}}) = {\varvec{r}}({\varvec{x}}_k) + {\hat{J}}_k {\hat{{\varvec{s}}}}. \end{aligned}$$
(108)

To do this, we choose the approximate subspace Jacobian \({\hat{J}}_k\in {\mathbb {R}}^{m\times p}\) by requiring that \({\hat{{\varvec{m}}}}_k\) interpolate \({\varvec{r}}\) at our interpolation points \(Y_k\). That is, we impose

$$\begin{aligned} {\hat{{\varvec{m}}}}_k({\hat{{\varvec{s}}}}_t) = {\varvec{r}}({\varvec{y}}_t), \qquad \forall t=1,\ldots ,p, \end{aligned}$$
(109)

which yields the \(p\times p\) linear system (with m right-hand sides)

$$\begin{aligned} {\hat{W}}_k {\hat{J}}_k^T :=\begin{bmatrix} {\hat{{\varvec{s}}}}_1^T \\ \vdots \\ {\hat{{\varvec{s}}}}_p^T \end{bmatrix} {\hat{J}}_k^T = \begin{bmatrix} ({\varvec{r}}({\varvec{y}}_1)-{\varvec{r}}({\varvec{x}}_k))^T \\ \vdots \\ ({\varvec{r}}({\varvec{y}}_p)-{\varvec{r}}({\varvec{x}}_k))^T \end{bmatrix}. \end{aligned}$$
(110)

Our linear subspace model \({\hat{{\varvec{m}}}}_k\) (108) naturally yields a local subspace quadratic model for f, as in the classical Gauss–Newton method, namely (c.f. (4)),

$$\begin{aligned} f({\varvec{x}}_k+Q_k {\hat{{\varvec{s}}}}) \approx {\hat{m}}_k({\hat{{\varvec{s}}}}) :=\frac{1}{2}\Vert {\hat{{\varvec{m}}}}_k({\hat{{\varvec{s}}}})\Vert ^2 = f({\varvec{x}}_k) + {\hat{{\varvec{g}}}}_k^T {\hat{{\varvec{s}}}} + \frac{1}{2} {\hat{{\varvec{s}}}}^T {\hat{H}}_k {\hat{{\varvec{s}}}}, \end{aligned}$$
(111)

where \({\hat{{\varvec{g}}}}_k :={\hat{J}}_k^T {\varvec{r}}({\varvec{x}}_k)\) and \({\hat{H}}_k :={\hat{J}}_k^T {\hat{J}}_k\).

3.1 Constructing \(Q_k\)-fully linear models

We now describe how we can achieve \(Q_k\)-fully linear models of the form (111) in RSDFO-GN.

As in [19], we will need to define the Lagrange polynomials and \(\varLambda \)-poisedness of an interpolation set. Given our interpolation set \(Y_k\) lies inside \({\mathcal {Y}}_k\), we consider the (low-dimensional) Lagrange polynomials associated with \(Y_k\). These are the linear functions \({\hat{\ell }}_0,\ldots ,{\hat{\ell }}_p:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\), defined by the interpolation conditions

$$\begin{aligned} {\hat{\ell }}_t({\hat{{\varvec{s}}}}_{t'}) = \delta _{t, t'}, \qquad \forall t,t'=0,\ldots ,p, \end{aligned}$$
(112)

with the convention \({\hat{{\varvec{s}}}}_0={\varvec{0}}\) corresponding to the interpolation point \({\varvec{x}}_k\). The Lagrange polynomials exist and are unique whenever \({\hat{W}}_k\) (110) is invertible, which we typically ensure through judicious updating of \(Y_k\) at each iteration.

Definition 4

For any \(\varLambda >0\), the set \(Y_k\) is \(\varLambda \)-poised in the p-dimensional ball \(B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\) if

$$\begin{aligned} \max _{t=0,\ldots ,p} \, \max _{\Vert {\hat{{\varvec{s}}}}\Vert \le \varDelta _k} |{\hat{\ell }}_t({\hat{{\varvec{s}}}})| \le \varLambda . \end{aligned}$$
(113)

Note that since \({\hat{\ell }}_0({\varvec{0}})=1\), for the set \(Y_k\) to be \(\varLambda \)-poised we require \(\varLambda \ge 1\). In general, a larger \(\varLambda \) indicates that \(Y_k\) has “worse” geometry, which leads to a less accurate approximation for f. This notion of \(\varLambda \)-poisedness (in a subspace) is sufficient to construct \(Q_k\)-fully linear models (111) for f.

Lemma 11

Suppose Assumption 4(b) holds, \(J({\varvec{x}})\) is Lipschitz continuous, and \({\varvec{r}}\) and J are uniformly bounded above in \(\cup _{k\ge 0} B({\varvec{x}}_k,\varDelta _{\max })\). If \(Y_k \subset B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\) and \(Y_k\) is \(\varLambda \)-poised in \(B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\), then \({\hat{m}}_k\) (111) is a \(Q_k\)-fully linear model for f, with \(\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}}={\mathcal {O}}( {Q_{\max }^4} p^2 \varLambda ^2)\).

Proof

Consider the low-dimensional functions \({\hat{{\varvec{r}}}}:{\mathbb {R}}^p\rightarrow {\mathbb {R}}^m\) and \({\hat{f}}:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) given by \({\hat{{\varvec{r}}}}_k({\hat{{\varvec{s}}}}):={\varvec{r}}({\varvec{x}}_k+Q_k{\hat{{\varvec{s}}}})\) and \({\hat{f}}({\hat{{\varvec{s}}}}) :=\frac{1}{2}\Vert {\hat{{\varvec{r}}}}({\hat{{\varvec{s}}}})\Vert ^2\) respectively. We note that \({\hat{{\varvec{r}}}}\) is continuously differentiable with Jacobian \({\hat{J}}({\hat{{\varvec{s}}}}) = J({\varvec{x}}_k+Q_k{\hat{{\varvec{s}}}}) Q_k\). Then since \(\Vert Q_k\Vert \le Q_{\max }\) from Assumption 4(b), it is straightforward to show that both \({\hat{{\varvec{r}}}}\) and \({\hat{J}}\) are uniformly bounded above and \({\hat{J}}\) is Lipschitz continuous (with a Lipschitz constant \(Q_{\max }^2\) times larger than for \(J({\varvec{x}})\)).

We can then consider \(\hat{{\varvec{m}}_k}\) (108) and \({\hat{m}}_k\) (111) to be interpolation models for \({\hat{r}}\)) and \({\hat{f}}\) in the low-dimensional ball \(B({\varvec{0}},\varDelta _k)\subset {\mathbb {R}}^p\). From [19,  Lemma 3.3], we conclude that \({\hat{m}}_k\) is a fully linear model for \({\hat{f}}\) with constants \(\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}}={\mathcal {O}}(p^2 \varLambda ^2)\). The \(Q_k\)-fully linear property follows immediately from this, noting that \(\nabla {\hat{f}}_k({\hat{{\varvec{s}}}}) = Q_k^T \nabla f({\varvec{x}}_k+Q_k{\hat{{\varvec{s}}}})\). The dependency on \(Q_{\max }\) follows as \(\kappa _{\mathrm{ef}},\kappa _{\mathrm{eg}}={\mathcal {O}}(L_{{\hat{J}}}^2)\), where \(L_{{\hat{J}}}\) is the Lipschitz constant of \({\hat{J}}\) [19,  Lemma 3.2]. \(\square \)

Given this result, the procedures in [28] or [30,  Chapter 6] allow us to check and/or guarantee the \(\varLambda \)-poisedness of an interpolation set, and we have met all the requirements needed to fully specify RSDFO-GN.

Lastly, we note that underdetermined linear interpolation, where (110) is underdetermined and solved in a minimal norm sense, has been recently shown to yield a property similar to \(Q_k\)-full linearity [51,  Theorem 3.6].

Complete RSDFO-GN algorithm A complete statement of RSDFO-GN is given in Algorithm 2. This exactly follows RSDFO (Algorithm 1), but where we ask that the interpolation set satisfies the conditions: \(Y_k \subset B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\) and \(Y_k\) is \(\varLambda \)-poised in \(B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\). From Lemma 11, this is sufficient to guarantee \(Q_k\)-full linearity of \({\hat{m}}_k\).

figure b

3.2 Complexity analysis for RSDFO-GN

We are now in a position to specialize our complexity analysis for RSDFO to RSDFO-GN. For this, we need to impose a smoothness assumption on \({\varvec{r}}\).

Assumption 5

The level set \({\mathcal {L}}:=\{{\varvec{x}}\in {\mathbb {R}}^n : f({\varvec{x}}) \le f({\varvec{x}}_0)\}\) is bounded, \({\varvec{r}}\) is continuously differentiable, and the Jacobian J is Lipschitz continuous on \({\mathcal {L}}\).

This smoothness requirement allows us to immediately apply the complexity analysis for RSDFO, yielding the following result.

Corollary 2

Suppose Assumptions 523 and 4 hold, and we have \(\beta _F\le c_2\), \(\delta _S < 1/(1+C_4)\) for \(C_4\) defined in Lemma 9, and \(\gamma _{\mathrm{inc}} > \min (\gamma _C, \gamma _F, \gamma _{\mathrm{dec}}, \beta _F)^{-2}\). Then for the iterates generated by RSDFO-GN and k sufficiently large,

$$\begin{aligned} {\mathbb {P}}\left[ \min _{j\le k} \Vert \nabla f({\varvec{x}}_j)\Vert \le \frac{C{\kappa _H^{1/2} Q_{\max }^4 p^2}}{\sqrt{k}}\right] \ge 1 - e^{-ck}, \end{aligned}$$
(114)

for some constants \(c,C>0\). Alternatively, for \(\epsilon \in (0,\epsilon _0)\) for some \(\epsilon _0\), we have

$$\begin{aligned} {\mathbb {P}}\left[ K_{\epsilon } \le \widetilde{C}{\kappa _H Q_{\max }^8 p^4}\epsilon ^{-2}\right] \ge 1 - e^{-\widetilde{c}{\kappa _H Q_{\max }^8 p^4}\epsilon ^{-2}}, \end{aligned}$$
(115)

for constants \(\widetilde{c},\widetilde{C}>0\).

Proof

Assumption 5 implies that both \({\varvec{r}}\) and J are uniformly bounded above on \({\mathcal {L}}\), which is sufficient for Lemma 11 to hold. Hence, whenever we check/ensure that \(Y_k \subset B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\) and \(Y_k\) is \(\varLambda \)-poised in \(B({\varvec{x}}_k,\varDelta _k)\cap {\mathcal {Y}}_k\) we are checking/guaranteeing that \({\hat{m}}_k\) is \(Q_k\)-fully linear in \(B({\varvec{x}}_k,\varDelta _k)\). In addition, from [19,  Lemma 3.2] and taking \(f_{\mathrm{low}}=0\), we have that Assumption 1 is satisfied. Therefore the result follows directly from Corollary 1 and \(\kappa _d={\mathcal {O}}( Q_{\max }^4 p^2)\) from Lemma 11. \(\square \)

We also note that it is reasonable to assume \(\kappa _H={\mathcal {O}}(\kappa _d)\) [19,  Lemma 3.3], and so the overall iteration complexity of RSDFO-GN is \({\mathcal {O}}(Q_{\max }^{12} p^6 \epsilon ^{-2})\) with high probability. Furthermore, each iteration of RSDFO or RSDFO-GN requires at most \(p+1\) objective evaluations (p to form the model \({\hat{m}}_k\), regardless of whether we need \(Q_k\)-full linearity or not, and one for \({\varvec{x}}_k+{\varvec{s}}_k\)). Hence the evaluation complexity of RSDFO-GN is \({\mathcal {O}}(Q_{\max }^{12} p^7 \epsilon ^{-2})\) with high probability.

If we use a full-space model (so \(p=n\) and \(Q_k=I\)), then with \(Q_{\max }=1\) we get a high probability complexity of \({\mathcal {O}}(n^6 \epsilon ^{-2})\) iterations and \({\mathcal {O}}(n^7 \epsilon ^{-2})\) evaluations. However, if we apply one of the subspace generation techniques from Sect. 2.6, with \(p={\mathcal {O}}(1)\) and \(Q_{\max }={\mathcal {O}}(\sqrt{n})\), then we get the same complexity of \({\mathcal {O}}(n^6 \epsilon ^{-2})\) iterations, but an improved evaluation complexity of \({\mathcal {O}}(n^6 \epsilon ^{-2})\). Thus RSDFO-GN with careful subspace generation can give a strict improvement in the evaluation complexity compared to standard full-space methods.

In the next section we show that RSDFO-GN also improves on full-space methods in the linear algebra cost at each iteration, reducing from \({\mathcal {O}}(mn^2+n^3)\) flops per iteration to \({\mathcal {O}}(m+n)\) flops per iteration. Hence in the standard least-squares setting where \(m\ge n\), the linear algebra cost of achieving a \(\epsilon \) first-order optimality reduces from \({\mathcal {O}}(mn^8 \epsilon ^{-2})\) to \({\mathcal {O}}(mn^6 \epsilon ^{-2})\) flops.

3.3 Linear algebra cost of RSDFO-GN

In RSDFO-GN, the interpolation linear system (110) is solved in two steps, namely: factorize the interpolation matrix \({\hat{W}}_k\), then back-solve for each right-hand side. Thus, the cost of the linear algebra is:

  1. 1.

    Model construction costs \({\mathcal {O}}(p^3)\) to compute the factorization of \({\hat{W}}_k\), and \({\mathcal {O}}(mp^2)\) for the back-substitution solves with m right-hand sides; and

  2. 2.

    Lagrange polynomial construction costs \({\mathcal {O}}(p^3)\) in total, due to one backsolve for each of the \(p+1\) polynomials (using the pre-existing factorization of \({\hat{W}}_k\)).

By updating the factorization or \({\hat{W}}_k^{-1}\) directly (e.g. via the Sherman-Morrison formula), we can replace the \({\mathcal {O}}(p^3)\) factorization cost with a \({\mathcal {O}}(p^2)\) updating cost (c.f. [66]). However, the dominant \({\mathcal {O}}(mp^2)\) model construction cost remains, and in practice we have observed that the factorization needs to be recomputed from scratch to avoid the accumulation of rounding errors. We also have a cost, typically of \({\mathcal {O}}(np)\) or \({\mathcal {O}}(np^2)\) to construct \(Q_k\) using our randomized procedures from Sect. 2.6, and \({\mathcal {O}}(np)\) from projecting the computed step \({\hat{{\varvec{s}}}}_k\) to the full space. Hence in our case where \(p={\mathcal {O}}(1)\), the linear algebra cost per iteration is \({\mathcal {O}}(m+n)\).

In the case of a full-space method where \(p=n\) such as in [19], these costs becomes \({\mathcal {O}}(n^3)\) for the factorization (or \({\mathcal {O}}(n^2)\) if Sherman-Morrison is used) plus \({\mathcal {O}}(mn^2)\) for the back-solves. When n grows large, this linear algebra cost rapidly dominates the total runtime of these algorithms and limits the efficiency of full-space methods. This issue is discussed in more detail, with numerical results, in [69,  Chapter 7.2]. This per-iteration cost is substantially higher than RSDFO-GN with random subspace generation.

In light of this discussion, we now turn our attention to building an implementation of RSDFO-GN that has both strong performance (in terms of objective evaluations) and low linear algebra cost.

4 DFBGN: an efficient implementation of RSDFO-GN

An important tenet of DFO is that objective evaluations are often expensive, and so algorithms should be efficient in reusing information, hence limiting the total objective evaluations required to achieve a given decrease. However because we require our model to sit within our active space \({\mathcal {Y}}_k\), we do not have a natural process by which to reuse evaluations between iterations, when the space changes. We dedicate this section to outlining an implementation of RSDFO-GN, which we call DFBGN (Derivative-Free Block Gauss–Newton). DFBGN is designed to be efficient in its objective queries while still only building low-dimensional models, and hence is also efficient in terms of linear algebra. Specifically, we design DFBGN to achieve two aims:

  • Low computational cost we want our implementation to have a per-iteration linear algebra cost which is linear in the ambient dimension;

  • Efficient use of objective evaluations our implementation should follow the principles of other DFO methods and make progress with few objective evaluations. In particular, we hope that, when run with ‘full-space models’ (i.e. \(p=n\)), our implementation should have (close to) state-of-the-art performance.

We will assess the second point in Sect. 5 by comparison with DFO-LS [15] an open-source model-based DFO Gauss–Newton solver which explores the full space (i.e. \(p=n\)).

Remark 4

As discussed in [15], DFO-LS has a mechanism to build a model with fewer than \(n+1\) interpolation points. However, in that context we modify the model so that it varies over the whole space \({\mathbb {R}}^n\), which enables the interpolation set to grow to the usual \(n+1\) points and yield a full-dimensional model. There, the goal is to make progress with very few evaluations, but here our goal is scalability, so we keep our model low-dimensional throughout and instead change the subspace at each iteration.

4.1 Efficiency of DFBGN versus RSDFO-GN

To motivate the utility of DFBGN, we begin by showing a comparison of DFBGN against a direct implementation of RSDFO-GN. We implement RSDFO-GN by constructing \(Q_k\)-fully linear interpolation models at each iteration by setting \({\hat{{\varvec{s}}}}_t\) in (110) to be the t-th coordinate vector in \({\mathbb {R}}^p\), and generate \(Q_k\) using the different approaches described in Sect. 2.6. To ensure consistency between the two algorithms, we use identical trust-region management procedures, algorithm parameters and starting points for our comparison.

We test DFBGN and RSDFO-GN on several CUTEst problems [42] with dimension \(n\approx 100\), drawn from the (CR) collection described in Sect. 5.1, with subspace dimensions \(p\in \{n/10, n/4, n/2, n\}\). For brevity, we show results for arwhdne and \(p=n/2\), but the results are similar for all problems and values of p.Footnote 8 All solvers were run for a maximum of \(10(n+1)\) evaluations or until \(\varDelta _k \le 10^{-8}\), and because both RSDFO-GN and DFBGN are random we perform 10 independent runs of each solver/problem combination.

Fig. 1
figure 1

Normalized objective value (versus evaluations and iterations) for 10 runs of RSDFO-GN DFBGN on CUTEst problem arwhdne with \(p=n/2\). We also show different methods for generating \(Q_k\) for RSDFO-GN. These results use a budget of \(10(n+1)\) evaluations

In Fig. 1 we plot the objective decrease attained by each solver versus the number of objective evaluationsFootnote 9 and iterations. We see that DFBGN significantly outperforms all variants of RSDFO-GN (i.e. all approaches for generating \(Q_k\)) for all choices of subspace dimension p tested when measured in terms of evaluations. The primary benefit of DFBGN in this context is that it reuses objective evaluations between iterations, rather than having to fully resample an interpolation set whenever the subspace \(Q_k\) is redrawn. This is most clearly seen by the relative performance of RSDFO-GN being better when measured on iterations than on evaluations (noting that DFBGN can perform many more iterations within a given evaluation budget).

4.2 Subspace interpolation models

Similar to Sect. 3, we assume that, at iteration k, our interpolation set has \(p+1\) points \(\{{\varvec{x}}_k,{\varvec{y}}_1,\ldots ,{\varvec{y}}_p\}\subset {\mathbb {R}}^n\) with \(1\le p\le n\). However, we assume that these points are already given, and use them to determine the space \({\mathcal {Y}}_k\) (as defined by \(Q_k\)). That is, given

$$\begin{aligned} W_k :=\begin{bmatrix}({\varvec{y}}_1-{\varvec{x}}_k)^T \\ \vdots \\ ({\varvec{y}}_p-{\varvec{x}}_k)^T\end{bmatrix} \in {\mathbb {R}}^{p\times n}, \end{aligned}$$
(116)

we compute the QR factorization

$$\begin{aligned} W_k^T = Q_k R_k, \end{aligned}$$
(117)

where \(Q_k\in {\mathbb {R}}^{n\times p}\) has orthonormal columns and \(R_k\in {\mathbb {R}}^{p\times p}\) is upper triangular—and invertible provided \(W_k^T\) is full rank, which we guarantee by judicious replacement of interpolation points. This gives us the \(Q_k\) that defines \({\mathcal {Y}}_k\) via (3)—in this case \(Q_k\) has orthonormal columns—and in this way all our interpolation points are in \({\mathcal {Y}}_k\).

Since each \({\varvec{y}}_t\in {\mathcal {Y}}_k\), from (117) we have \({\varvec{y}}_t = {\varvec{x}}_k + Q_k {\hat{{\varvec{s}}}}_t\), where \({\hat{{\varvec{s}}}}_t\) is the t-th column of \(R_k\). Hence we have \({\hat{W}}_k = R_k^T\) in (110) and so \({\hat{{\varvec{m}}}}_k\) (108) is given by solving

$$\begin{aligned} R_k^T {\hat{J}}_k^T = \begin{bmatrix} ({\varvec{r}}({\varvec{y}}_1)-{\varvec{r}}({\varvec{x}}_k))^T \\ \vdots \\ ({\varvec{r}}({\varvec{y}}_p)-{\varvec{r}}({\varvec{x}}_k))^T \end{bmatrix}, \end{aligned}$$
(118)

via forward substitution, since \(R_k^T\) is lower triangular. This ultimately gives us our local model \({\hat{m}}_k\) via (111).

We reiterate that compared to RSDFO-GN, we have used the interpolation set \(Y_k\) to determine both \(Q_k\) and \({\hat{m}}_k\), rather than first sampling \(Q_k\), then finding interpolation points \(Y_k\subset {\mathcal {Y}}_k\) with which to construct \({\hat{m}}_k\). This difference is crucial in allowing the reuse of interpolation points between iterations, and hence lowering the objective evaluation requirements of model construction.

Remark 5

As discussed in [69,  Chapter 7.3], we can equivalently recover this construction by asking for a full-space model \({\varvec{m}}_k:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) given by \({\varvec{m}}_k({\varvec{s}})={\varvec{r}}({\varvec{x}}_k)+J_k {\varvec{s}}\) such that the interpolation conditions \({\varvec{m}}_k({\varvec{y}}_t-{\varvec{x}}_k)={\varvec{r}}({\varvec{y}}_t)\) are satisfied and \(J_k\) has minimal Frobenius norm.

4.3 Complete DFBGN algorithm

A complete statement of DFBGN is given in Algorithm 3. Compared to RSDFO-GN, we include specific steps to manage the interpolation set, which in turn dictates the choice of subspace \({\mathcal {Y}}_k\). Specifically, one issue with our approach is that our new iterate \({\varvec{x}}_k+{\varvec{s}}_k\) is in \({\mathcal {Y}}_k\), so if we were to simply add \({\varvec{x}}_k+{\varvec{s}}_k\) into the interpolation set, \({\mathcal {Y}}_k\) would not change across iterations, and we will never explore the whole space. On the other hand, unlike RSDFO and RSDFO-GN we do not want to completely resample \(Q_k\) as this would require too many objective evaluations. Instead, in DFBGN we delete a subset of points from the interpolation set and add new directions orthogonal to the existing directions, which ensures that \(Q_{k+1}\ne Q_k\) in every iteration.Footnote 10

We also note that DFBGN does not include some important algorithmic features present in RSDFO-GN, DFO-LS or other model-based DFO methods, and hence is quite simple to state. These features are not necessary for a variety of reasons, which we now outline.

figure c

No criticality and safety steps Compared to RSDFO-GN, the implementation of DFBGN does not include criticality (which is also the case in DFO-LS) or safety steps. These steps ultimately function to ensure we do not have \({\hat{{\varvec{g}}}}_k\ll \varDelta _k\). In DFBGN, we ensure \(\varDelta _k\) does not get too large compared to \(\Vert {\varvec{s}}_k\Vert \) through (8), while \(\Vert {\varvec{s}}_k\Vert \) is linked to \(\Vert {\hat{{\varvec{g}}}}_k\Vert \) through Lemma 1. If \(\Vert {\varvec{s}}_k\Vert \) is much smaller than \(\varDelta _k\) and our step produces a poor objective decrease, then we will set \(\varDelta _{k+1}\leftarrow \Vert {\varvec{s}}_k\Vert \) for the next iteration. Although Lemma 1 allows \(\Vert {\varvec{s}}_k\Vert \) to be large even if \(\Vert {\varvec{g}}_k\Vert \) is small, in practice we do not observe \(\varDelta _k\gg \Vert {\varvec{g}}_k\Vert \) without DFBGN setting \(\varDelta _{k+1}\leftarrow \Vert {\varvec{s}}_k\Vert \) after a small number of iterations.

No model-improving steps An important feature of model-based DFO methods are model-improving procedures, which change the interpolation set to ensure \(\varLambda \)-poisedness (Definition 4), or equivalently ensure that the local model for f is fully linear. In RSDFO-GN for instance, model-improvement is performed when CHECK_MODEL=TRUE, whereas in [29,  Algorithm 4.1] or [30,  Algorithm 10.1] there are dedicated model-improvement phases.

Instead, DFBGN ensures accurate interpolation models via a geometry-aware (i.e. \(\varLambda \)-poisedness aware) process for deleting interpolation points at each iteration, where they are replaced by new points in directions (from \({\varvec{x}}_{k+1}\)) which are orthogonal to \(Q_k\) and selected at random. The process for deleting interpolation points—and choosing a suitable number of points to remove, \(p_{\mathrm{drop}}\)—at each iteration are considered in Sects. 4.4 and 4.5 respectively. The process for generating new interpolation points, Algorithm 5, is outlined in Sect. 4.4.

A downside of our approach is that the new orthogonal directions are not chosen by minimizing a model for the objective (i.e. not attempting to reduce the objective), as we have no information about how the objective varies outside \({\mathcal {Y}}_k\). This is the fundamental trade-off between a subspace approach and standard methods (such as DFO-LS); we can reduce the linear algebra cost, but must spend objective evaluations to change the search space between iterations.

Linear algebra cost of DFBGN As in Sect. 3.3, our approach in DFBGN yields substantial reductions in the required linear algebra costs compared to DFO-LS:

  • Model construction costs \({\mathcal {O}}(np^2)\) for the factorization (117) and \({\mathcal {O}}(mp^2)\) for back-substitution solves (118), rather than \({\mathcal {O}}(n^3)\) and \({\mathcal {O}}(mn^2)\) respectively for DFO-LS; and

  • Lagrange polynomial construction costs \({\mathcal {O}}(p^3)\) rather than \({\mathcal {O}}(n^3)\).Footnote 11

As well as these reductions, we also get a smaller trust-region subproblem (5)—in \({\mathbb {R}}^p\) rather than \({\mathbb {R}}^n\)—and smaller memory requirements for storing the model Jacobian: we only store \({\hat{J}}_k\) and \(Q_k\), requiring \({\mathcal {O}}((m+n)p)\) memory rather than \({\mathcal {O}}(mn)\) for storing the full \(m\times n\) Jacobian. However, in (5), we do have the extra cost of projecting \({\hat{{\varvec{s}}}}_k\in {\mathbb {R}}^p\) into the full space \({\mathbb {R}}^n\), which requires a multiplication by \(Q_k\), costing \({\mathcal {O}}(np)\). In addition to the reduced linear algebra costs, the smaller interpolation set means we have a lower evaluation cost to construct the initial model of \(p+1\) evaluations (rather than \(n+1\)).

No particular choice of p is needed for this method, and anything from \(p=1\) (i.e. coordinate search) to \(p=n\) (i.e. full space search) is allowed. However, unsurprisingly, we shall see that larger values of p give better performance in terms of evaluations, except for the very low-budget phase, where smaller values of p benefit from a lower initialization cost. Hence, we expect that our approach with small p is useful when the \({\mathcal {O}}(mn^2+n^3)\) per-iteration linear algebra cost of DFO-LS is too great, and reducing the linear algebra cost is worth (possibly) needing more objective evaluations to achieve a given accuracy. As a result, p should in general be set as large as possible, given the linear algebra costs the user is willing to bear.

In Table 1, we compare the linear algebra costs of DFO-LS and DFBGN. The overall per-iteration cost of DFO-LS is \({\mathcal {O}}(mn^2+n^3)\) and the cost of DFBGN is \({\mathcal {O}}(mp^2+np^2+p^3)\), depending on the choice of \(p\in \{1,\ldots ,n\}\). The key benefit is that our dependency on the underlying problem dimension n decreases from cubic in DFO-LS to linear in DFBGN (provided \(p\ll n\)). We also note that both methods have linear cost in the number of residuals m, but with a factor that is significantly smaller in DFBGN than in DFO-LS—\({\mathcal {O}}(p^2)\) compared to \({\mathcal {O}}(n^2)\).

Table 1 Comparison of per-iteration linear algebra costs of DFO-LS and DFBGN (Algorithm 3) with subspace dimension \(p\in \{1,\ldots ,n\}\)

Remark 6

In every iteration we must compute the QR factorization (117). However, we note, similar to [19,  Section 4.2], that adding, removing and changing interpolation points all induce simple changes to \({\hat{W}}_k^T\) (adding or removing columns, and low-rank updates). This means that (117) can be computed with cost \({\mathcal {O}}(np)\) per iteration using the updating methods in [41,  Section 12.5]. In our implementation, however, we do not do this, as we find that these updates introduce errorsFootnote 12 that accumulate at every iteration and reduce the accuracy of the resulting interpolation models. To maintain the numerical performance of our method, we need to recompute (117) from scratch regularly (e.g. every 10 iterations), and so would not see the \({\mathcal {O}}(np)\) per-iteration cost, on average.

Remark 7

The default parameter choices for DFBGN are the same as DFO-LS, namely: \(\varDelta _{\mathrm{max}}=10^{10}\), \(\varDelta _{\mathrm{end}}=10^{-8}\), \(\gamma _{\mathrm{dec}}=0.5\), \(\gamma _{\mathrm{inc}}=2\), \({\overline{\gamma }}_{\mathrm{inc}}=4\), \(\eta _1=0.1\), and \(\eta _2=0.7\). DFBGN also uses the same default choice \(\varDelta _0=0.1\max (\Vert {\varvec{x}}_0\Vert _{\infty },1)\). The default choice of \(p_{\mathrm{drop}}\) is discussed in Sect. 4.5.

Adaptive choice of p One approach that we have considered is to allow p to vary between iterations of DFBGN, rather than being constant throughout. Instead of adding \(p_{\mathrm{drop}}\) new points at the end of each iteration (line 15), we implement a variable p by adding at least one new point to the interpolation set, continuing until some criterion is met. This criterion is designed to allow p small when such a p allows us to make reasonable progress, but to grow p up to \(p\approx n\) when necessary.

We have tested several possible criteria—comparing some combination of model gradient and Hessian, trust-region radius, trust-region step length, and predicted decrease from the trust-region step—and found the most effective to be comparing the model gradient and Hessian with the trust-region radius. Specifically, we continue adding new directions until (c.f. Lemma 3 and [19,  Lemma 3.22])

$$\begin{aligned} \frac{\Vert {\varvec{g}}_k\Vert }{\max (\Vert H_k\Vert ,1)} \ge \alpha \varDelta _k, \end{aligned}$$
(119)

for some \(\alpha >0\) (we use \(\alpha =0.2(n-p)/n\) for an interpolation set with \(p+1\) points). However, our numerical testing has shown that DFBGN with p fixed outperforms this approach for all budget and accuracy levels, on both medium- and large-scale problems, and so we do not consider it further here. We delegate further study of this approach to future work, to see if alternative adaptive choices for p can be beneficial.

4.4 Interpolation set management

We now provide more details about how we manage the interpolation set in DFBGN. Specifically, we discuss how points are chosen for removal from \(Y_k\), and how new interpolations points are calculated.

4.4.1 Geometry management

In the description of DFBGN, there are no explicit mechanisms to ensure that the interpolation set is well-poised. DFBGN ensures that the interpolation set has good geometry through two mechanisms:

  • We use a geometry-aware mechanism for removing points, based on [19, 67], which requires the computation of Lagrange polynomials. This mechanism is given in Algorithm 4, and is called in lines 10 and 13 of DFBGN, as well as to select a point to replace in line 12; and

  • Adding new directions that are orthogonal to existing directions, and of length \(\varDelta _k\), means adding these new points never causes the interpolation set to have poor poisedness.

Together, these two mechanisms mean that any points causing poor poisedness are quickly removed, and replaced by high-quality interpolation points (orthogonal to existing directions, and within distance \(\varDelta _k\) of the current iterate). We note that the simpler approach of removing points based on distance to the current iterate alone does not perform as well as this method (see Appendix B.1 for details).

figure d

The linear algebra cost of Algorithm 4 is \({\mathcal {O}}(p^3)\) to compute p Lagrange polynomials with cost \({\mathcal {O}}(p^2)\) each (since we already have a factorization of \({\hat{W}}_k^T\)). Then for each t we must evaluate \(\theta _t\) (120), with cost \({\mathcal {O}}(p)\) to maximize \(\ell _t({\varvec{x}})\) (since \(\ell _t\) is linear and varies only in directions \({\mathcal {Y}}_k\)), and \({\mathcal {O}}(n)\) to calculate \(\Vert {\varvec{y}}_t-{\varvec{x}}_{k+1}\Vert \). This gives a total cost of \({\mathcal {O}}(p^3+np)\).Footnote 13

4.4.2 Generation of new directions

We now detail how new directions \({\varvec{d}}_1,\ldots ,{\varvec{d}}_q\) are created in line 15 of DFBGN (Algorithm 3). The same approach is suitable for generating the initial directions \({\varvec{d}}_1,\ldots ,{\varvec{d}}_p\) in line 1 of DFBGN, using \(\widetilde{A}=A\) below (i.e. no Q required).

figure e

Suppose our current subspace is defined by the orthonormal columns of \(Q\in {\mathbb {R}}^{n\times p_1}\), and we wish to generate q new orthonormal vectors that are also orthogonal to the columns of Q (with \(p_1+q\le n\)). When called in line 15 of DFBGN, we will have \(p_1=p-p_{\mathrm{drop}}\) and \(q=p_{\mathrm{drop}}\). We use the approach in Algorithm 5. From the QR factorization, the columns of \(\widetilde{Q}\) are orthonormal, and if \(\widetilde{A}\) is full rank (which occurs with probability 1; see Lemma 12 below) then we also have \({\text {col}}(\widetilde{Q}) = {\text {col}}(\widetilde{A})\). So, to confirm the columns of \(\widetilde{Q}\) are orthogonal to Q, we only need to check that the columns of \(\widetilde{A}\) are orthogonal to Q. Let \({\varvec{\widetilde{a}}}_i\) be the i-th column of \(\widetilde{A}\) and \({\varvec{q}}_j\) be the j-th column of Q. Then, if \({\varvec{a}}_i\) is the i-th column of A, we have

$$\begin{aligned} {\varvec{\widetilde{a}}}_i^T{\varvec{q}}_j = {\varvec{a}}_i^T(I-QQ^T){\varvec{q}}_j = {\varvec{a}}_i^T({\varvec{q}}_j-Q{\varvec{e}}_j) = 0, \end{aligned}$$
(121)

as required.

The cost of Algorithm 5 is \({\mathcal {O}}(nq)\) to generate A, \({\mathcal {O}}(np_1 q)\) to form \(\widetilde{A}\) and \({\mathcal {O}}(nq^2)\) for the QR factorization. Since \(p_1,q\le p\) (since \(p_1\) is the number of directions remaining in the interpolation set and q is the number of new directions to be added), the whole process has cost at most \({\mathcal {O}}(np^2)\). This bound is tight, up to constant factors, as we could take \(p_1=q=p/2\), for instance.

Lemma 12

The matrix \(\widetilde{A}\) has full column rank with probability 1.

Proof

Let \({\varvec{a}}_i\) and \({\varvec{\widetilde{a}}}_i\) be the i-th columns of A and \(\widetilde{A}\) respectively. From [33,  Proposition 7.1], A has full column rank with probability 1, and each \({\varvec{a}}_i\notin {\text {col}}(Q)\) with probability 1. Now suppose we have constants \(c_1,\ldots ,c_q\) so that \(\sum _{i=1}^{q}c_i{\varvec{\widetilde{a}}}_i={\varvec{0}}\). Then since \(\widetilde{{\varvec{a}}}_i={\varvec{a}}_i-QQ^T{\varvec{a}}_i\), we have

$$\begin{aligned} \sum _{i=1}^{q}c_i{\varvec{a}}_i = \sum _{i=1}^{q}c_i QQ^T{\varvec{a}}_i. \end{aligned}$$
(122)

The right-hand side is in \({\text {col}}(Q)\), so since \({\varvec{a}}_i\notin {\text {col}}(Q)\), we must have \(\sum _{i=1}^{q}c_i{\varvec{a}}_i={\varvec{0}}\). Thus \(c_1=\cdots =c_q=0\) since A has full column rank, and so \(\widetilde{A}\) has full column rank. \(\square \)

4.5 Selecting an appropriate value of \({\varvec{p_{\mathrm{drop}}}}\)

An important component of DFBGN that we have not yet specified is how many points to remove from the interpolation set at each iteration, \(p_{\mathrm{drop}}\in \{1,\ldots ,p\}\).

On one hand, a large \(p_{\mathrm{drop}}\) enables us to change the subspace by a large amount between iterations, ensuring we explore the whole of \({\mathbb {R}}^n\) quickly, rather than searching in unproductive subspaces for many iterations. However, a small \(p_{\mathrm{drop}}\) means we require few objective evaluations per iteration, and so are more likely to use our evaluation budget efficiently.

In DFBGN we use a compromise choice as the default mechanism: \(p_{\mathrm{drop}}=1\) on successful iterations and \(p_{\mathrm{drop}}=p/10\) on unsuccessful iterations. Our careful and extensive testing show that this is a successful choice in practice, because it ensures that the trust-region radius \(\varDelta _k\) does not decrease too quickly compared to the first-order optimality measure \(\Vert {\hat{{\varvec{g}}}}_k\Vert \). We detail our choices and approach in Appendix B.2.

5 Numerical results

In this section we compare the performance of DFBGN (Algorithm 3) to that of DFO-LS. We note that that DFO-LS has been shown to have state-of-the-art performance compared to other solvers in [15]. As described in Sect. 4.3, the implementation of DFBGN is based on the decision to reduce the linear algebra cost of the algorithm at the expense of more objective evaluations per iteration. However, we still maintain the goal of DFBGN achieving (close to) state-of-the-art performance when it is run as a ‘full space’ method (i.e. \(p=n\)). Here, we will investigate this tradeoff in practice.

5.1 Testing framework

In our testing, we will compare a Python implementation of DFBGN (Algorithm 3) against DFO-LS version 1.0.2 (also implemented in Python). The implementation of DFBGN is available on Github.Footnote 14 We will consider both the standard version of DFO-LS, and one where we use a reduced initialization cost of n/100 evaluations (c.f. Remark 4). This will allow us to compare both the overall performance of DFBGN and its performance with small budgets (since DFBGN also has a reduced initialization cost of \(p+1\) evaluations). We compare these against DFBGN with the choices \(p\in \{n/100, n/10, n/2, n\}\) and the adaptive choice of \(p_{\mathrm{drop}}\in \{1,p/10\}\) (Sect. 4.5). All default settings are used for both solvers, and since both are randomized (DFO-LS uses random initial directions only, and DFBGN is randomized through Algorithm 5), we run 10 instances of each problem under all solver configurations.

Test problems We will consider two collections of nonlinear least-squares test problems, both taken from the CUTEst collection [42]. The first, denoted (CR), is a collection of 60 medium-scale problems (with \(25\le n\le 110\) and \(n\le m \le 400\)). Full details of the (CR) collection may be found in [19,  Table 3]. The second, denoted (CR-large), is a collection of 28 large-scale problems (with \(1000 \le n \le 5000\) and \(n\le m \le 9998\)). This collection is a subset of problems from (CR), with their dimension increased substantially. Full details of the (CR-large) collection are given in Appendix C. Note that the 12 h runtime limit was only relevant for (CR-large) in all cases.

Measuring solver performance For every problem, we allow all solvers a budget of at most \(100(n+1)\) objective evaluations (i.e. evaluations of the full vector \({\varvec{r}}({\varvec{x}})\)). This dimension-dependent choice may be understood as equivalent to 100 evaluations of \({\varvec{r}}({\varvec{x}})\) and the Jacobian \(J({\varvec{x}})\) via finite differencing. However, given the importance of linear algebra cost to our comparisons, we allow each solver a maximum runtime of 12 h for each instance of each problem.Footnote 15 For each solver S, each problem instance P, and accuracy level \(\tau \in (0,1)\), we calculate

(123)

where \(f({\varvec{x}}^*)\) is an estimate of the minimum of f as listed in [19,  Table 3] for (CR) and Appendix C for (CR-large). If this objective decrease is not achieved by a solver before its budget or runtime limit is hit, we set \(N(S,P,\tau )=\infty \). We then compare solver performances on a problem collection \({\mathcal {P}}\) by plotting either data profiles [59]

$$\begin{aligned} d_{S,\tau } (\alpha ) :=\frac{1}{|{\mathcal {P}}|}\left| \{P\in {\mathcal {P}} : N(S,P,\tau ) \le \alpha (n_P+1)\}\right| , \end{aligned}$$
(124)

where \(n_P\) is the dimension of problem instance P and \(\alpha \in [0,100]\) is an evaluation budget (in “gradients”, or multiples of \(n+1\)), or performance profiles [32]

$$\begin{aligned} \pi _{S,\tau } (\alpha ) :=\frac{1}{|{\mathcal {P}}|}\left| \{P\in {\mathcal {P}} : N(S,P,\tau ) \le \alpha N_{\min }(P,\tau )\}\right| , \end{aligned}$$
(125)

where \(N_{\min }(P,\tau )\) is the minimum value of \(N(S,P,\tau )\) for any solver S, and \(\alpha \ge 1\) is a performance ratio. In some instances, we will plot profiles based on runtime rather than objective evaluations. For this, we simply replace “number of evaluations of \({\varvec{r}}({\varvec{x}})\)” with “runtime” in (123).

When we plot the objective reduction achieved by a given solver, we normalize the objective value to be in [0, 1] by plotting

$$\begin{aligned} \frac{f({\varvec{x}})-f({\varvec{x}}^*)}{f({\varvec{x}}_0)-f({\varvec{x}}^*)}, \end{aligned}$$
(126)

which corresponds to the best \(\tau \) achieved in (123) after a given number of evaluations (again measured in “gradients”) or runtime.

5.2 Results based on evaluations

We begin our comparisons by considering the performance of DFO-LS and DFBGN when measured in terms of evaluations.

Fig. 2
figure 2

Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10 runs for each problem, with a budget of \(100(n+1)\) evaluations and a 12 h runtime limit per instance. The problem collection is (CR)

Medium-scale problems (CR) First, in Fig. 2, we show the results for different accuracy levels for the (CR) problem collection (with \(n\approx 100\)). For the lowest accuracy level \(\tau =0.5\), DFO-LS with reduced initialization cost is the best-performing solver, followed by DFBGN with \(p=n/2\). These correspond to methods with lower initialization costs than DFO-LS and DFBGN with \(p=n\), so this is likely a large driver behind their performance. DFBGN with full space size \(p=n\) performs similarly to DFO-LS, and DFBGN with \(p=n/10\) and \(p=n/100\) perform worst (as they are optimizing in a very small subspace at each iteration).

However, as we look at higher accuracy levels, we see that DFO-LS (with and without reduced initialization cost) performs best, and the DFBGN methods perform worse. The performance gap is more noticeable for small values of p. As expected, this means that DFBGN requires more evaluations to achieve these levels of accuracy, and benefits from being allowed to use a larger p. Notably, DFBGN with \(p=n\) has only a slight performance loss compared to DFO-LS, even though it uses p/10 evaluations on unsuccessful iterations (rather than 1–2 for DFO-LS); this indicates that our choice of \(p_{\mathrm{drop}}\) provides a suitable compromise between solver robustness and evaluation efficiency.

Fig. 3
figure 3

Performance profiles (in evaluations) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10 runs for each problem, with a budget of \(100(n+1)\) evaluations and a 12 h runtime limit per instance. The problem collection is (CR-large)

Large-scale problems (CR-large) Next, in Fig. 3, we show the same plots but for the (CR-large) problem collection, with \(n\approx 1000\). Compared to Fig. 2, the situation is quite different.

At the lowest accuracy level, \(\tau =0.5\), DFBGN with small subspaces (\(p=n/10\) and \(p=n/100\)) gives the best-performing solvers, followed by the full-space solvers (DFO-LS and DFBGN with \(p=n\)). For higher accuracy levels, the performance of DFBGN with small p deteriorates compared with the full-space methods. DFBGN with \(p=n/2\) is the worst-performing DFBGN variant at low accuracy levels, and performs similar to DFBGN with small p at high accuracy levels. DFO-LS with reduced initialization cost is the worst-performing solver for this dataset.

Unlike the medium-scale results above, we no longer have a clear trend in the performance of DFBGN as we vary p. Instead, we have a combination of two factors coming into play, which have opposite impacts on the performance of DFBGN as we vary p. On one hand, we have the number of evaluations required for DFBGN (with a given p) to reach the desired accuracy level. On the other hand, we have the number of iterations that DFBGN can perform before reaching the 12 h runtime limit.

DFBGN with small p requires more evaluations to reach a given level of accuracy (as seen with the medium-scale results), but can perform many evaluations before timing out due to its low per-iteration linear algebra cost. This is reflected in it solving many problems to low accuracy, but few problems to high accuracy. By contrast, DFBGN with \(p=n\) is allowed to perform fewer iterations before timing out (and hence see fewer evaluations), but requires many fewer evaluations to solve problems, particularly for high accuracy. This manifests in its good performance for low and high accuracy levels. The middle ground, DFBGN with \(p=n/2\), has its performance negatively impacted by both issues: it requires many fewer evaluations to solve problems than \(p=n\) (especially for high accuracy), but also has a relatively high per-iteration linear algebra cost and times out compared to small p.

Both variants of DFO-LS show worse performance here than for the medium-scale problems. This is because, as suggested by the analysis in Table 1, they are both affected by the runtime limit. DFO-LS with reduced initialization cost is particularly affected, because of the high cost of the SVD (of the full \(m\times n\) Jacobian) at each iteration for these problems. We note that this cost is only noticeable for these large-scale problems, and this variant of DFO-LS is still useful for small- and medium-scale problems, as discussed in [15].

Table 2 Proportion of problem instances from (CR-large) for which each solver terminated on the maximum 12 h runtime

We can verify the impact of the timeout on DFO-LS and DFBGN by considering the proportion of problem instances for (CR-large) for which the solver terminated because of the timeout. These results are presented in Table 2. DFO-LS reaches the 12 h maximum much more frequently than DFBGN: over 90% rather than 35% for DFBGN with \(p=n/100\) or 66% for DFBGN with \(p=n\) (see Remark 8 below). For DFBGN with different values of p, we see the same behaviour as in Fig. 3. That is, DFBGN with small p times out the least frequently, as its low per-iteration runtime means it performs enough iterations to terminate naturally. For DFBGN with \(p=n\), we time out more frequently (due to the high per-iteration runtime), but not as often as with \(p=n/2\), as the its superior performance in terms of evaluations for high accuracy levels means it fully solves more problems, even with comparatively fewer iterations. We note that Table 2 does not measure what accuracy level was achieved before the timeout, which is better captured in the performance profiles Fig. 3.

Remark 8

DFBGN with \(p=n\) has a similar per-iteration linear algebra cost to DFO-LS. Hence it can perform a similar number of iterations before reaching the runtime limit. However, DFBGN performs more objective evaluations per iteration, because of the choice of \(p_{\mathrm{drop}}\). Since DFBGN with \(p=n\) has a similar performance to DFO-LS when measured by evaluations (as seen in Fig. 2), this means that it has a superior performance when measured by runtime. Additionally, if multiple objective evaluations can be run in parallel, then DFBGN would also be able to benefit from this, unlike DFO-LS.

Remark 9

For completeness, the technical report associated with this work [20,  Appendix A] compares DFBGN with DFO-LS on the low-dimensional collection of test problems from Moré and Wild [59]. We do not include this discussion here as these problems are low-dimensional, which is not the main use case for DFBGN.

Fig. 4
figure 4

Data profiles comparing the runtime of DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels. Results are an average of 10 runs for each problem, with a budget of \(100(n+1)\) evaluations and a 12 h runtime limit per instance. The problem collection is (CR-large)

5.3 Results based on runtime

We have seen above that DFBGN performs well compared to DFO-LS on the (CR-large) problem collection, as the 12 h timeout causes DFO-LS to terminate after relatively few objective evaluations. In Fig. 4, we show the same comparison for (CR-large) as in Fig. 3, but showing data profiles of problems solved versus runtime (rather than evaluations). Here, all DFBGN variants perform similar to or better than DFO-LS for low accuracy levels, since DFBGN has a lower per-iteration runtime than DFO-LS, and this is the regime where DFBGN performs best (on evaluations). For high accuracy levels, DFBGN with \(p=n\) is the best-performing solver, as it uses large enough subspaces to solve many problems to high accuracy. By contrast, both DFBGN with small p and DFO-LS perform similarly at high accuracy levels—the impact of the timeout on DFO-LS roughly matches the reduced robustness of DFBGN with small p at these accuracy levels. Again, as we observed above, DFO-LS with reduced initialization cost is the worst-performing solver, due to the high cost of the SVD at each iteration.

Fig. 5
figure 5

Normalized objective value (versus evaluations and runtime) for 10 runs of DFO-LS and DFBGN on CUTEst problem arwhdne. These results use a budget of \(100(n+1)\) evaluations and a 12 h runtime limit per instance

To further see the impact of this issue, we now consider how the solvers perform for a variable-dimension test problem, as we increase the underlying dimension. We run each solver, with the same settings as above, on the CUTEst problem arwhdne for different choices of problem dimension n.Footnote 16 In Fig. 5 we plot the objective reduction for each solver against objective evaluations and runtime for DFO-LS and DFBGN, showing \(n=100\), \(n=1000\) and \(n=2000\).

We see that, when measured on evaluations, both DFO-LS variants achieve the fastest objective reduction, and that DFBGN with small p achieves the slowest objective reduction. This is in line with our results from Sect. 5.2. However, when we instead consider objective decrease against runtime, we see that DFBGN with small p gives the fastest decrease—the larger number of iterations needed by these variants (as seen by the larger number of evaluations) is offset by the substantially reduced per-iteration linear algebra cost. When viewed against runtime, both DFO-LS variants can only achieve a small objective decrease in the allowed 12 h, even though they are showing fast decrease against number of evaluations, and would achieve higher accuracy than DFBGN if the linear algebra cost were negligible.

Fig. 6
figure 6

Normalized objective value (versus evaluations) for 10 runs of DFO-LS and DFBGN on different CUTEst problems (all with \(n=1000\) and \(n=2000\)). These results use a budget of \(n+1\) evaluations and a 12 h runtime limit per instance

5.4 Results for small budgets

Another benefit of DFBGN is that it has a small initialization cost of \(p+1\) evaluations. When n is large, it is more likely for a user to be limited by a budget of fewer than n evaluations. Here, we examine how DFBGN compares for small budgets to DFO-LS with reduced initialization cost.

We recall from Remark 4 that DFO-LS with reduced initialization cost progressively increases the dimension of the subspace of its interpolation model, until it reaches the whole space \({\mathbb {R}}^n\) (after approximately \(n+1\) evaluations), while in DFBGN we restrict the dimension at all iterations.

In Fig. 6 we consider three variable-dimensional CUTEst problems from (CR) and (CR-large), all using \(n=1000\) and \(n=2000\). We show the objective decrease against number of evaluations for 10 runs of each solver, restricted to a maximum of \(n+1\) evaluations. We see that the smaller p used in DFBGN, the faster DFBGN is able to make progress (due to the lower number of initial evaluations). However, this is offset by the faster objective decrease achieved by larger p values (after the higher initialization cost)—if the user can afford a larger p, both in terms of linear algebra and initial evaluations, then this is usually a better option. An exception to this is the problem vardimne, where its simple structure means DFBGN with small p solves the problem to very high accuracy with very few evaluations, substantially outperforming both DFBGN with larger p, and DFO-LS with reduced initialization cost.

In Fig. 6 we also show the decrease for DFO-LS with full initialization cost and DFBGN with \(p=n\), but they use the full budget on initialization, and so make no progress. However, in addition, we show DFO-LS with a reduced initialization cost of n/100 evaluations. This variant performs well, in most cases matching the decrease of DFBGN with \(p=n/100\) initially, but achieving a faster objective reduction against number of evaluations—this matches with our previous observations. However, the extra cost of the linear algebra means that DFO-LS with reduced initialization does not end up using the full budget, instead hitting the 12 h timeout. This is most clear when comparing the results for \(n=1000\) with \(n=2000\), where DFO-LS with reduced initialization cost begins by achieving a similar decrease in both cases, but hits the timeout more quickly with \(n=2000\), and so terminates after fewer objective evaluations (with a corresponding smaller objective decrease).

We analyze this more systematically in Fig. 7, where we show data profiles (measured on number of evaluations) of DFBGN and DFO-LS on the (CR-large) problem collection, for low accuracy and small budgets. These results verify our conclusions: DFBGN with small p can make progress on many problems with a very short budget (fewer than \(n+1\) evaluations), and outperform DFO-LS with reduced initialization cost due to its slow runtime. However, once we reach a budget of more than \(n+1\) evaluations, then DFO-LS and DFBGN with \(p=n\) become the best-performing solvers (when measuring on evaluations only). They are also able to achieve a higher level of accuracy compared to DFBGN with small p.

Fig. 7
figure 7

Data profiles (in evaluations) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels and budgets. Results are an average of 10 runs for each problem, with a budget of \(n+1\) or \(2(n+1)\) evaluations and a 12 h runtime limit per instance. The problem collection is (CR-large)

Fig. 8
figure 8

Data profiles (in runtime) comparing DFO-LS (with and without reduced initialization cost) with DFBGN (various p choices) for different accuracy levels and budgets. Results are an average of 10 runs for each problem, with a budget of \(n+1\) or \(2(n+1)\) evaluations and a 12 h runtime limit per instance. The problem collection is (CR-large)

Lastly, in Fig. 8 we show the same results as Fig. 7, but showing profiles measured on runtime. We note that we are only measuring the linear algebra costs, as the cost of objective evaluation for our problems is negligible. Here, the benefits of DFBGN with small p are not seen. This is because the problems that can be solved by DFBGN with small p using very few evaluations are likely easier, and so can likely be solved by DFBGN with large p in few iterations. Thus, the runtime requirements for DFBGN with large p to solve the problem are not large—even though they have a higher per-iteration cost, the number of iterations is small. In this setting, therefore, the benefit of DFBGN with small p is not lower linear algebra costs, but fewer evaluations—which is likely to be the more relevant issue in this small-budget regime.

6 Conclusions and future work

The development of scalable derivative-free optimization algorithms is an active area of research with many applications. In model-based DFO, the high per-iteration linear algebra cost associated (primarily) with interpolation model creation and point management is a barrier to its utility for large-scale problems. To address this, we introduce three model-based DFO algorithms for large-scale problems.

First, RSDFO is a general framework for model-based DFO based on model construction and minimization in random subspaces, and is suitable for general smooth nonconvex objectives. This is specialized to nonlinear least-squares problems in RSDFO-GN, a version of RSDFO based on Gauss–Newton interpolation models built in subspaces. Lastly, we introduce DFBGN, a practical implementation of RSDFO-GN. In all cases, the scalability of these methods arises from the construction and minimization of models in p-dimensional subspaces of the ambient space \({\mathbb {R}}^n\). The subspace dimension can be specified by the user to reflect the computational resources available for linear algebra calculations.

We prove high-probability worst-case complexity bounds for RSDFO, and show that RSDFO-GN inherits the same bounds with an oracle and flop complexity having an improved dependency on the ambient dimension compared to full space methods. In terms of selecting the subspace dimension, we show that by using matrices based on Johnson-Lindenstrauss transformations, we can choose p to be independent of the ambient dimension n. Our analysis extends to DFO the techniques in [16, 17, 72], and yields similar results to probabilistic direct search [46] and standard model-based DFO [19, 37]. Our results also imply almost-sure global convergence to first-order stationary points.

Our practical implementation of RSDFO-GN, DFBGN, has very low computational requirements: asymptotically, linear in the ambient dimension rather than cubic for standard model-based DFO. After extensive algorithm development described here, our implementation is simple and combines several techniques for modifying the interpolation set which allows it to still make progress with few objective evaluations (an important consideration for DFO techniques). A Python version of DFBGN is available on Github.Footnote 17

For medium-scale problems, DFBGN operating in the full ambient space (\(p=n\)) has similar performance to DFO-LS [15] when measured by objective evaluations, validating the techniques introduced in the practical implementation. However, DFBGN (with any choice of subspace dimension) has substantially faster runtime, which means it is much more effective than DFO-LS at solving large-scale problems from CUTEst, even when working in a very low-dimensional subspace. Further, in the case of expensive objective evaluations, working a subspace means that DFBGN can make progress with very few evaluations, many fewer than the \(n+1\) needed for standard methods to build their initial model. Overall, the implementation of DFBGN is suitable for large-scale problems both when objective evaluations are cheap (and linear algebra costs dominate) or when evaluations are expensive (and the initialization cost of standard methods is impractical).

Future work will focus on extending the ideas from the implementation DFBGN to the case of general objectives with quadratic models. This will bring the available software in line with the theoretical guarantees for RSDFO. We note that model-based DFO for nonlinear least-squares problems has been adapted to include sketching methods, which use randomization to reduce the number of residuals considered at each iteration [14]. We also delegate to future work the development of techniques for nonlinear least-squares problems which combine sketching (i.e. dimensionality reduction in the observation space) with our subspace approach (i.e. dimensionality reduction in variable space), and further study of methods for adaptively selecting a subspace dimension (c.f. Sect. 4.3).