1 Introduction

In this paper, we consider the unconstrained nonlinear least squares problem

$$\begin{aligned} \begin{array}{ll} \min &{} f(x){:}{=}\frac{1}{2}\Vert E(x)\Vert ^2_2\\ \mathop {\mathrm{s.t.~}}&{} x\in {\mathbb {R}}^n, \end{array} \end{aligned}$$
(1)

with high-dimensional \(x\in {\mathbb {R}}^n\) and continuously differentiable \(E:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^r\) (\(r\ge n\)), possibly expensive. However, we assume that no derivative information is available.

1.1 Related work

In recent years, there has been a huge amount of literature on least squares and its applications. Here we just list a useful book and paper:

  • Ortega and Rheinboldt (2000) introduced an excellent book, both covering algorithms and their analysis.

  • An excellent paper, both covering Levenberg–Marquardt algorithms, quasi-Newton algorithms, and trust region algorithms and their local analysis without non- singularity assumption, has been introduced by Yuan (2011).

Derivative free unconstrained nonlinear black box least squares solvers can be classified in two ways according to how the Jacobian matrix is estimated, and according to whether they are based on line search or on trust region:

  • Quasi-Newton approximation. Sorber et al. (2012) introduced MINLBFGS (a limited memory BFGS algorithm) and MINLBFGSDL (a trust region algorithm using a dogleg algorithm and limited memory BFGS approximation).

  • Finite difference approximation. There are many trust region methods using the finite difference method for the Jacobian matrix estimation such as CoDoSol and STRSCNE by Bellavia et al. (2004, 2012), NMPNTR by Kimiaei (2016), NATRN and NATRLS by Amini et al. (2016, 2016), LSQNONLIN from the MATLAB Toolbox, NLSQERR (an adaptive trust region strategy) by Deuflhard (2011), and DOGLEG by Nielsen (2012). They are suitable for small- and medium-scale problems. Line search methods using the finite difference approximation are NLEQ (a damped affine invariant Newton method) by Nowak and Weimann (1990) and MINFNCG (a family of nonlinear conjugate gradient methods) by Sorber et al. (2012).

FMINUNC by MATLAB Optimization Toolbox is an efficient solver for small- and medium-scale problems. It uses the finite difference method to estimate the gradient vector and the standard quasi-Newton method to estimate the Hessian matrix. In fact, FMINUNC disregards the least squares structure and only has access to function values. Nevertheless, it will be shown that FMINUNC is more efficient than LSQNONLIN using the least squares structure.

To solve the least squares problem (1), trust region methods use linear approximations of the residual vectors to make surrogate quadratic models whose accuracy are increased by restricting their feasible points. These methods use a computational measure to identify whether an agreement between an actual reduction of the objective function and a predicated reduction of surrogate quadratic model function is good or not. If this agreement is good, the iteration is said successful and the trust region radius is expanded; otherwise, the iteration is said unsuccessful and the trust region radius is reduced, for more details see (Conn et al. 2000; Nocedal and Wright 1999).

The efficiency of trust region methods depends on how the trust region radius is updated (see, e.g., Ahookhosh et al. 2013; Amini et al. 2016, 2016); Esmaeili and Kimiaei (2014b, 2014a, 2015); Fan (2006); Fan and Pan (2009, 2010); Kimiaei (2017); Yu and Pu (2008)) and whether non-monotone techniques are applied (see, e.g., Ahookhosh and Amini 2011; Ahookhosh et al. 2015, 2013; Amini et al. 2016, 2016); Deng et al. (1993); Grippo et al. (1986); Grippo and Sciandrone (2007); Kimiaei (2016, 2017); Yu and Pu (2008)). Rounding errors may lead two problems:

  1. (i)

    The model function may not decrease numerically for some iterations. In this case, if there is no decrease in the function value for such iterations, trust region radii are expanded possibly several times which is an unnecessary expansion for them,

  2. (ii)

    The model function may decrease numerically but the objective function may not decrease in the cases where iterations are near a valley, deep with a small creek at the bottom and steep sides. In this case, trust region radii are reduced possibly many times, leading to the production of quite a small radius, or even a failure.

Non-monotone techniques can be used in the hope of overcoming the second problem.

1.2 Overview of the new method

We suggest in Sect. 2 a new trust region-based limited memory algorithm for unconstrained black box least squares problems, called LMLS. This algorithm uses

  • a non-monotone ratio and an adaptive radius formula to quickly reach the minimizer when the valley is narrow;

  • a Broyden-like algorithm to get a decrease in the function value when the trust region radius is so small and iteration is unsuccessful;

  • a finite difference approximation in a subspace with random basis indices to estimate the Jacobian matrix;

  • either a Gauss–Newton or a dogleg algorithm in a subspace with random basis indices to solve the trust region subproblems.

Numerical results for small- to large-scale problems are given in Sect. 3 showing the fact that the new method is suitable for large-scale problems and is more robust and efficient than solvers using limited memory and standard quasi-Newton approximations.

2 The trust region method

In this section, we construct an improved trust region algorithm for handling problems in high dimensions:

  • In Sect. 2.1 a Gauss-Newton direction in a subspace with random basis indices is introduced.

  • In Sect. 2.2 a non-monotone term and an adaptive technique are constructed to quickly reach the minimizer in the presence of a narrow valley.

  • In Sect. 2.3 a dogleg algorithm in a subspace with random basis indices is discussed.

  • In Sect. 2.4 a Broyden-like technique is suggested based on the old best points.

  • In Sect. 2.5 our algorithm using new enhancements is introduced.

We write J(x) for the Jacobian matrix of the residual vector E at x. Then the gradient vector is \( g(x){:}{=}\nabla f(x){:}{=}J(x)^TE(x) \) and the Hessian matrix is

$$\begin{aligned} G(x){:}{=}J(x)^TJ(x)+\nabla ^2 E(x)^TE(x) \end{aligned}$$

If the residual vector E(x) is small, the second term in G(x) is small. Hence, we approximate G(x) by the Gauss-Newton Hessian matrix \(J(x)^TJ(x)\). We define the quadratic surrogate objective function

$$\begin{aligned} {\mathcal {Q}}(p){:}{=}\displaystyle \frac{1}{2}\Vert E+Jp\Vert ^2{:}{=}f+p^T g+\displaystyle \frac{1}{2}(Jp)^TJp, \end{aligned}$$
(2)

where \(f{:}{=}f(x)\), \(E{:}{=}E(x)\), \(J{:}{=}J(x)\), \(g{:}{=}g(x){:}{=}J^TE\). We denote by \(A_{:k}\) the kth column of a matrix A.

A trust region method finds a minimizer of the constrained problem

$$\begin{aligned} \begin{array}{ll} \text {min}~{\mathcal {Q}}(p)\\ \text {s.t}\ \ p \in {\mathbb {R}}^{n}\ \ \text{ and }\ \ \Vert p\Vert \le \varDelta ,\\ \end{array} \end{aligned}$$
(3)

whose constraint restricts feasible points by the trust region radius \(\varDelta >0\). This problem is called the trust region subproblem. Given a solution p of (3), we define the actual reduction in the objective function by

$$\begin{aligned} df {:}{=} f-f(x+p) \end{aligned}$$
(4)

and the predicted reduction in the model function by

$$\begin{aligned} dq {:}{=} {\mathcal {Q}}(0)-{\mathcal {Q}}(p). \end{aligned}$$
(5)

What constitutes an agreement between the actual and predicted reduction around the current iterate x must be measured by the monotone trust region ratio

$$\begin{aligned} \rho {:}{=}\frac{df}{dq}. \end{aligned}$$
(6)

If such an agreement is good according to a heuristic formula discussed in Sect. 2.2, the iteration is said successful, \(x+p\) is accepted as a new point, said a best point, and the radius is expanded; otherwise, the iteration is said unsuccessful and so the radius is reduced.

2.1 A new subspace Gauss–Newton method

In this subsection, we have two goals: estimating the Jacobian matrix and constructing a Gauss-Newton direction in a subspace with random basis indices.

Let \(m_{\mathop {\mathrm{sn}}}\) be the subspace dimension. The Jacobian matrix in a subspace with random basis indices is estimated by a new subspace random finite difference called SRFD using the following steps:

  1. (1)

    At first, an initial subspace basis indices set is a random subset of \(\{1,\ldots ,n\}\) consisting of \(m_{\mathop {\mathrm{sn}}}\) members and its complementary is \(S^c{:}{=}\{1,\ldots ,n\}\setminus S\).

  2. (2)

    Next, if the complementary of old subspace basis indices set \(S^c_{\mathop {\mathrm{old}}}\) is not empty, a new index set \({\mathcal {I}}\) needs to be identified before a new subspace basis indices set is determined. In this case, if \({\mathcal {I}}\) consists of at least \(m_{\mathop {\mathrm{sn}}}\) members, \({\mathcal {I}}\) is a random subset of \(\{1,\ldots ,m_{\mathop {\mathrm{sn}}}\}\) with the \(|S^c_{\mathop {\mathrm{old}}}|\) members; otherwise, it is a permutation of \(\{1,\ldots ,|S^c_{\mathop {\mathrm{old}}}|\}\). Then, a new subspace basis indices set is determined by \(S{:}{=}S^c_{\mathop {\mathrm{old}}}({\mathcal {I}})\) and its complementary is found by \(S^c{:}{=}S^c_{\mathop {\mathrm{old}}}\setminus S\). But if \(S^c_{\mathop {\mathrm{old}}}\) is empty, a new subspace basis indices set and its complementary are restarted and chosen in the same way as the initial subspace basis indices set and its complementary, respectively.

  3. (3)

    For any \(i\in S\),

    • the step size is computed by

      $$\begin{aligned} h_i{:}{=} \left\{ \begin{array}{ll} \gamma _s\ \ &{}\hbox {if }x_i=0, \\ \gamma _s(\mathop {\mathrm{sign}}x_i)\max \Big \{|x_i|,\displaystyle \frac{\Vert x\Vert _1}{n}\Big \}\ \ &{} \hbox {otherwise,}\\ \end{array} \right. \end{aligned}$$

      where \(0<\gamma _s<1\) is a tiny factor and \(\mathop {\mathrm{sign}}x_i\) identifies the sign of \(x_i\), taking one of values \(-1\) (if \(x_i<0\)), 0 (if \(x_i=0\)), and 1 (if \(x_i>0\)).

    • the random approximation coordinate direction p discussed in Kimiaei (2020) is used with the difference that its ith component is updated by \(p_i=p_i+h_i\).

    • the new trial residual \(E(x+p)\) and the new column \((E(x+p)-E)/h_i\) of the Jacobian matrix are computed.

It is well known that standard quasi-Newton methods are more robust than limited memory quasi-Newton ones, but they cannot handle problems in high dimensions; for standard quasi-Newton methods, see (Dennis and Moré 1977; Dennis and Walker 1981; Nocedal 1992; Schnabel 1989), and for limited memory quasi-Newton methods, see (Liu and Nocedal 1989; Nazareth 1979; Nocedal 1992). On the other hand, finite difference methods are more efficient than standard quasi-Newton ones. Hence, if used in a subspace with random basis indices, they can be more efficient than limited memory quasi-Newton methods for small- up to large-scale problems.

Using S and \(S^c\) generated and updated by SRFD, we construct a new subspace Gauss-Newton direction by

$$\begin{aligned} (p_{\mathop {\mathrm{sn}}})_{i}{:}{=}\,0 \ \ \text{ for } i\in S^c \text{ and } \ \ (p_{\mathop {\mathrm{sn}}})_S{:}{=}-(J_{:S}^TJ_{:S})^{-1}J_{:S}^TE. \end{aligned}$$
(7)

2.2 New non-monotone and adaptive strategies

In this subsection, a new non-monotone term – stronger than the objective function f – is constructed and a new adaptive radius formula to update \(\varDelta \) is derived from it. They help LMLS in finite precision arithmetic to quickly reach the minimizer in the cases where the valley is deep with a small creek at the bottom and steep sides.

Our non-monotone term is updated not only for successful but also for unsuccessful iterations that may have happened before a successful iteration is found. This choice is based on an estimated increase in f defined below which is updated according to whether a decrease in f is found or not. It helps us to generate a somewhat strong non-monotone term when a decrease in f is not found and a somewhat weak non-monotone term otherwise. Somewhat strong non-monotone terms increase the chance of finding a point with better function value or at least a point with a little progress in the function value instead of solving trust region subproblems with high computational costs.

We denote by X a list of best points and by F a list of corresponding function values. Let \(m_{\mathop {\mathrm{rs}}}\) be the maximum number of good points saved in X. In order to update X and F, we use updateXF. Here we describe how to work it. If \(m_{\mathop {\mathrm{rs}}}\) is not exceeded, points with good function values are saved in X and their function values in F. Otherwise, the worst point and its function value are found and replaced by the best point and its function value, respectively.

Let \(\gamma _t\in (0,1)\), \(\underline{\gamma }\in (0,1)\), \(\gamma ^{{\mathrm{init}}}>0\), and \(\overline{\gamma }>1\) be the tuning parameters and let

$$\begin{aligned} f_{\max }^k {:}{=} \max _{i=1:m_{\mathop {\mathrm{rs}}}}\{F^k_i\} \ \ \hbox { for all }k=0,1,2,\cdots . \ \ \end{aligned}$$
(8)

Before a new non-monotone term is constructed, an estimated increase in f needs to be estimated by

$$\begin{aligned} \delta _f^k{:}{=} \left\{ \begin{array}{ll} \gamma ^{{\mathrm{init}}}|f^0| &{}\hbox { if }k=0, f^0 \in (0,\infty ), \\ 1 &{}\hbox { if }k=0, f^0\in \{-\infty ,0,\infty \},\\ \displaystyle \frac{1}{\overline{\gamma }}(f^{k-1}-f(x^{k-1}+ p^{k-1})) &{}\hbox { if }k\ge 1, f(x^{k-1}+ p^{k-1}) <f^{k-1},\\ \max (\overline{\gamma }\delta _f^{k-1},\underline{\gamma }(|f(x^{k-1}+ p^{k-1})|+|f_{\max }^k|)) &{}\hbox { if }k\ge 1, f(x^{k-1}+ p^{k-1})\ge f^{k-1}. \end{array}\right. \end{aligned}$$
(9)

Accordingly, the new non-monotone formula is defined by

$$\begin{aligned} f_{\mathop {\mathrm{nm}}}^k{:}{=}\left\{ \begin{array}{ll} f^0 &{}\hbox { if }k=0, \\ f^k+\delta _f^k &{}\hbox { if }k\ge 1 \end{array}\right. \end{aligned}$$
(10)

and the new adaptive radius is constructed by

$$\begin{aligned} \varDelta _{\mathop {\mathrm{nm}}}^k {:}{=} \lambda ^k \sqrt{ f_{\mathop {\mathrm{nm}}}^k}, \end{aligned}$$
(11)

where \(\varDelta _{\mathop {\mathrm{nm}}}^0>0\) is a tuning parameter and \(\lambda ^k\) is updated according to

$$\begin{aligned} \lambda ^k{:}{=} \left\{ \begin{array}{ll} \sigma _1\lambda ^{k-1},\ \ &{} \hbox {if }{\rho }_{\mathop {\mathrm{nm}}}^{k-1} < \gamma _t, \\ \min (\overline{\lambda },\max (\sigma _2\lambda ^{k-1},\underline{\lambda })), \ \ &{} \hbox {otherwise}.\\ \end{array} \right. \end{aligned}$$
(12)

Here \(\lambda ^0>0\), \( 0< \sigma _1< 1 < \sigma _2 \), and \( \overline{\lambda }> \underline{\lambda }> 0 \) are the tuning parameters and the new non-monotone trust region ratio is defined by

$$\begin{aligned} \rho _{\mathop {\mathrm{nm}}}^{k-1}{:}{=}\frac{f_{\mathop {\mathrm{nm}}}^{k-1}-f(x^{k-1}+ p^{k-1})}{\widetilde{\mathcal {Q}}^{k-1}(0)-\widetilde{\mathcal {Q}}^{k-1}(p^{k-1})}, \end{aligned}$$
(13)

where \(p^{k-1}\) is a solution of the following trust region subproblem in a subspace with the random basis indices set S by SRFD

$$\begin{aligned} \begin{array}{ll} \text {min}~\widetilde{\mathcal {Q}}^{k-1}(p){:}{=}\displaystyle \frac{1}{2}\Vert E^{k-1}+J_{:S}^{k-1}p\Vert ^2{:}{=}f^{k-1}+p^T g^{k-1}_S\\ \qquad \qquad \qquad \qquad +\displaystyle \frac{1}{2}(J_{:S}^{k-1}p)^TJ_{:S}^{k-1}p\\ \text {s.t}\ \ p \in {\mathbb {R}}^{m_{\mathop {\mathrm{sn}}}}\ \ \text{ and }\ \ \Vert p\Vert \le \varDelta _{\mathop {\mathrm{nm}}}^{k-1}\\ \end{array} \end{aligned}$$
(14)

with \(f^{k-1}{:}{=}f(x^{k-1})\), \(E^{k-1}{:}{=}E(x^{k-1})\), \(J^{k-1}_{:S}{:}{=}J_{:S}(x^{k-1})\), and \(g^{k-1}_S{:}{=}(J_{:S}^{k-1})^TE^{k-1}\).

2.3 A subspace dogleg algorithm

We define the Cauchy step by

$$\begin{aligned}&p_{\mathop {\mathrm{c}}}{:}{=}-t^*g_S, \nonumber \\&t^*{:}{=}\displaystyle \mathop {\mathrm{argmin}}\{ \widetilde{\mathcal {Q}}(-tg_S)\mid t\ge 0,~\Vert tg_S\Vert \le \varDelta _{\mathop {\mathrm{nm}}}\}. \end{aligned}$$
(15)

The goal is to solve the trust region subproblem (14) such that

$$\begin{aligned} \Vert p\Vert \le \varDelta _{\mathop {\mathrm{nm}}}\ \ \text{ and } \ \ \widetilde{\mathcal {Q}}(p)\le \widetilde{\mathcal {Q}}(p_{\mathop {\mathrm{c}}}) \end{aligned}$$
(16)

hold. After the subspace Gauss-Newton direction is computed by (7), if it is outside a trust region, a subspace dogleg algorithm, called subDogleg, is used resulting in an estimated step enforcing (16).

The model function \(\widetilde{\mathcal {Q}}\) is reduced by (15) if \(dq_{\mathop {\mathrm{sn}}} {:}{=} \widetilde{\mathcal {Q}}(0)-\widetilde{\mathcal {Q}}(p_{\mathop {\mathrm{sn}}})>0\). subDogleg first identifies whether \(dq_{\mathop {\mathrm{sn}}}>0\) or not. Then we have one of the following cases:

Case 1. If \(dq_{\mathop {\mathrm{sn}}}>0\), the scaled steepest descent step

$$\begin{aligned} p_{\mathop {\mathrm{sd}}}{:}{=}-\displaystyle \frac{g^T_Sg_S}{(J_Sg_S)^T(J_Sg_S)}g_S \end{aligned}$$
(17)

is computed. If it is outside the trust region, an estimated solution of (3) is either the Cauchy step computed by (15) or the dogleg step

$$\begin{aligned} p_{\mathop {\mathrm{dg}}}{:}{=}p_{\mathop {\mathrm{sd}}}+t(p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}}); \end{aligned}$$
(18)

both of (17) and (18) are on the trust region boundary. Here t is found by solving the equation \(\Vert p_{\mathop {\mathrm{sd}}}+t(p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}})\Vert =\varDelta _{\mathop {\mathrm{nm}}}\). If the condition \(dp{:}{=}(p_{\mathop {\mathrm{sd}}})^T(p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}})\le 0\) holds, a positive root is computed by

$$\begin{aligned} t {:}{=} \frac{-dp+\sqrt{dp^2+\Vert p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}}\Vert (\varDelta _{\mathop {\mathrm{nm}}}^2-\Vert p_{\mathop {\mathrm{sd}}}\Vert ^2)}}{\Vert p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}}\Vert ^2}\in (0,1). \end{aligned}$$
(19)

Otherwise, t is computed by

$$\begin{aligned} t {:}{=} \frac{\varDelta _{\mathop {\mathrm{nm}}}^2-\Vert p_{\mathop {\mathrm{sd}}}\Vert ^2}{dp+\sqrt{dp^2+\Vert p_{\mathop {\mathrm{sn}}}-p_{\mathop {\mathrm{sd}}}\Vert (\varDelta _{\mathop {\mathrm{nm}}}^2-\Vert p_{\mathop {\mathrm{sd}}}\Vert ^2)}}\in (0,1); \end{aligned}$$
(20)

e.g., see Nielsen 2012.

Case 2. If \(dq_{\mathop {\mathrm{sn}}}\le 0\), the model function \(\widetilde{\mathcal {Q}}\) is convex since the matrix \((J_Sg_S)^T(J_Sg_S)\) is symmetric and positive semidefinite. An estimated solution of (3) is either \(p_{\mathop {\mathrm{sd}}}\) computed by (17) if it is inside the trust region or the Cauchy step \(p{:}{=}\varDelta _{\mathop {\mathrm{nm}}} (p_{\mathop {\mathrm{sd}}}/\Vert p_{\mathop {\mathrm{sd}}}\Vert )\) according to \(p_{\mathop {\mathrm{sd}}}\), otherwise.

2.4 Broyden-like technique

Before a successful iteration is found by a trust region algorithm, the trust region subproblems may be solved many times with high computational cost. Instead, our idea is to use a new algorithm based on the previous best points in the hope of finding a point with good function value.

Whenever LMLS cannot decrease the function value, a new Broyden-like technique, called BroydenLike, is used in the hope of getting a decrease in the function value. Let \(x^1,\ldots ,x^{m_{\mathop {\mathrm{rs}}}}\) be the \(m_{\mathop {\mathrm{rs}}}\) best point stored in X. Then a point in the affine space spanned by such points has the following form

$$\begin{aligned} x_z{:}{=}Xz, \ \ z\in {\mathbb {R}}^{m_{\mathop {\mathrm{rs}}}}, \ \ e^Tz=1, \end{aligned}$$
(21)

where \(e\in {\mathbb {R}}^{m_{\mathop {\mathrm{rs}}}}\) is a vector all of whose components are one. Given \(B{:}{=}\left( \begin{array}{l} X\\ e\end{array}\right) \), the linear approximation \(E(x_z)\approx Bz\) is used to replace (1) by the surrogate problem

$$\begin{aligned} \begin{array}{ll} \min &{}\displaystyle \frac{1}{2}\Vert B z\Vert ^2_2 \\ \mathop {\mathrm{s.t.~}}&{} e^Tz=1. \end{array} \end{aligned}$$
(22)

This is a quality constrained convex quadratic problem in \(m_{\mathop {\mathrm{rs}}}\) variables and hence can be solved in closed form. Then a QR factorization is made in the form \(B=QR\), where Q is an orthogonal matrix and R is a square upper triangular matrix. By setting \(Z{:}{=}R^{-1}\), we make the substitution \(z{:}{=}Zy\), define \(a^T{:}{=}e^TZ\), and obtain the \(m_{\mathop {\mathrm{rs}}}\)-dimensional minimal norm linear feasibility problem

$$\begin{aligned} \begin{array}{ll} \min &{}\displaystyle \frac{1}{2}\Vert y\Vert ^2_2\\ \mathop {\mathrm{s.t.~}}&{} a^Ty=1 \end{array} \end{aligned}$$
(23)

whose solution is \(y{:}{=}a/\Vert a\Vert _2\). Hence, a new trial point for the next algorithm is

$$\begin{aligned} x^{{\mathrm{trial}}}{:}{=}x_z{:}{=}XR^{-1}a/\Vert a\Vert _2. \end{aligned}$$
(24)

BroydenLike tries to find a point with better function value when no decrease in f is found along p. It takes X, F, x, E, B, f, \(\delta _f\), and \(f_{\mathop {\mathrm{nm}}}\) as input and uses the following tuning parameter:

  • \(\underline{\gamma }\in (0,1)\) (tiny factor for adjusting \(\delta _f\)),

  • \(\overline{\gamma }>1\) (parameter for expanding \(\delta _f\)),

  • \(\gamma _s\) (tiny parameter for the finite difference step size), \(m_{\mathop {\mathrm{rs}}}\) (memory for affine space),

  • \(0<\gamma _r<1\) (tiny parameter for adjusting the scaled random directions).

It returns a new value of \(\delta _f\) and \(f_{\mathop {\mathrm{nm}}}\) (and f, X, F, E, S, and \(J_S\) if a decrease in f is found) as output.

figure a

Since the Jacobian matrix may be singular or indefinite, a new point may move toward either a maximum point or saddle point. To remedy this disadvantage, BroydenLike does not lead to accept such a point with largest function value.

2.5 A limited memory trust region algorithm

We describe all steps of a new limited memory algorithm, called LMLS using the new subspace direction (7), the new non-monotone technique (10), the new adaptive radius strategy (11), and BroydenLike.

In each iteration, an estimated solution of the trust region subproblem (14) is found. Whenever the condition \(\rho _{\mathop {\mathrm{nm}}}\ge \gamma _t\) holds, the iteration is successful while updating both the non-monotone term (10) and adaptive radius formula (11), and estimating the Jacobian matrix in a subspace with random basis indices by SRFD. Otherwise, the iteration is unsuccessful. In this case, BroydenLike is performed in the hope of finding a decrease in the function value. If a decrease in the function value is found, the iteration becomes successful; otherwise, it remains unsuccessful while reducing the radius and updating the non-monotone term (10) until a decrease in the function value is found and the iteration becomes successful.

LMLS solves unconstrained nonlinear black box least squares problem. This algorithm takes the initial point \(x^0\), and maximal number of function evaluations (nfmax). It uses the following tuning parameters:

  • \(m_{\mathop {\mathrm{sn}}}\) ( subspace dimension),

  • \(\gamma _t\in (0,1)\) (parameter for trust region),

  • \( 0< \sigma _1< 1 < \sigma _2 \) (parameters for updating \(\lambda \)),

  • \(\gamma ^{{\mathrm{init}}}\) (parameter for updating the initial \(\delta _f\)),

  • \(\underline{\gamma }\in (0,1)\) (tiny factor for adjusting \(\delta _f\)),

  • \(\overline{\gamma }>1\) (parameter for expanding \(\delta _f\)),

  • \(\underline{\lambda }\) (lower bound for \(\lambda \)),

  • \(\overline{\lambda }\) (upper bound for \(\lambda \)),

  • \(\gamma _s\) (tiny parameter for adjusting finite difference step sizes),

  • \(0<\gamma _r<1\) (tiny parameter for adjusting random approximation coordinate directions).

It returns a solution \(x^{{\mathrm{best}}}\) of a nonlinear least squares problem as output.

figure b

LMLS was implemented in MATLAB; the source code is available at

http://www.mat.univie.ac.at/~neum/software/LMLS.

3 Numerical results

We updated the test environment constructed by Kimiaei and Neumaier (2019) to use test problems suggested by Lukšan et al. (2018). LMLS is compared with unconstrained least squares and unconstrained optimization solvers, for some of which we had to choose options different from the default to make them competitive in the first subsection.

3.1 Codes compared

Least squares solvers:

  • CoDoSol is a solver for constrained nonlinear systems of equations, obtained from

http://codosol.de.unifi.it.

It combines Newton method and a trust region method, see Bellavia et al. (2012). The following option was used

$$\begin{aligned} \mathtt{parms}= & {} [\mathtt{maxit},\mathtt{maxnf},\mathtt{tr},\mathtt{delta},\mathtt{scaling},\\\mathtt{outflag}]= & {} [\mathtt{inf},\mathtt{nfmax},1,-1,0,2]. \end{aligned}$$

Note that \(\mathtt{delta}=-1\) means that \(\varDelta _0=1\). According to our numerical results, CoDoSol was not sensitive for the initial radius; hence, the default was used.

  • STRSCNE is a solver for constrained nonlinear systems of equations, obtained from

http://codosol.de.unifi.it.

It combines Newton method and a trust region procedure, see Bellavia et al. (2004). The option

$$\begin{aligned} \mathtt{parms}= & {} [\mathtt{maxit},\mathtt{maxnf},\mathtt{delta},\mathtt{outflag}]\\= & {} [\mathtt{inf},\mathtt{nfmax},-1,0] \end{aligned}$$

was used. Note that \(\mathtt{delta}=-1\) means that \(\varDelta _0=1\). According to our numerical results, STRSCNE was not sensitive for the initial radius; hence, the default was used for \(\varDelta _0\).

  • NLEQ1 is a damped affine invariant Newton method for nonlinear systems, obtained from

    http://elib.zib.de/pub/elib/codelib/nleq1_m/nleq1.m and suggested by Deuflhard (2011) and written by Nowak and Weimann (1990). The default tuning parameters were used; only \(\mathtt{iopt.jacgen}=2\), \(\mathtt{iopt.qrank1}=1\), and \(\mathtt{wk.fmin = 1e-50}\) were selected.

  • NLEQ2 is the same as NLEQ1; only \(\mathtt{iopt.qrank1}=0\) was selected.

  • MINLBFGS is a L-BFGS with line search algorithm, obtained from

    https://www.tensorlab.net/

    and written by Sorber et al. (2012). The default parameters were used, except for m. MINLBFGS1 and MINLBFGS2 are MINLBFGS with \(m=\min (n,30)\) and \(m=\min (n,100)\), respectively.

  • MINLBFGSDL is a L-BFGS with dogleg trust region algorithm with the option set \(\mathtt{MaxIter = nfmax}\), \(\mathtt{TolFun = 1e-50}\), \(\mathtt{TolX = 1e-50}\). The default parameters were chosen for \(\mathtt{PlaneSearch}\) and M. MINLBFGSDL1, MINLBFGSDL2, MINLBFG SDL3, and MINLBFGSDL4 are MINLBFGSDL with

    1. (1)

      \(\mathtt{PlaneSearch = false}\) and \(M = \min (30,n)\),

    2. (2)

      \(\mathtt{PlaneSearch = false}\) and \(M = \min (100,n)\),

    3. (3)

      \(\mathtt{PlaneSearch = true}\) and \(M = \min (30,n)\),

    4. (4)

      \(\mathtt{PlaneSearch = true}\) and \(M = \min (100,n)\).

According to our results, MINLBFGSDL1 was the best.

  • MINFNCG is a nonlinear conjugate gradient solver, obtained from

    https://www.tensorlab.net/

    and written by Sorber et al. (2012). We used the following option set

    $$\begin{aligned}&\mathtt{MaxIter = nfmax};\ \ \mathtt{TolFun = 1e-50}; \ \ \\&\mathtt{TolX = 1e-50}. \end{aligned}$$

    The other tuning parameter was \(\mathtt{Beta}\in \{\mathtt{HS},\mathtt{HSm},{} \mathtt{PR},\mathtt{FR},\mathtt{PRm},\mathtt{SD}\}\). MINFNCG1, MINFNCG2, MINFNCG3, MINFNCG4, MINFNCG5, and MINFNCG6 are MINFNCG with \(\mathtt{Beta}=\mathtt{HS}\), \(\mathtt{Beta}=\mathtt{HSm}\), \(\mathtt{Beta}=\mathtt{PR}\), \(\mathtt{Beta}=\mathtt{FR}\), \(\mathtt{Beta}=\mathtt{PRm}\), and \(\mathtt{Beta}=\mathtt{SD}\), respectively.

  • NLSQERR is a global unconstrained Gauss-Newton method with error oriented convergence criterion and adaptive trust region strategy Deuflhard (2011), obtained from

    http://elib.zib.de/pub/elib/codelib/NewtonLib/index.html

    The following options were used

    $$\begin{aligned}&\mathtt{iniscalx} = 0;\ \ \mathtt{rescalx} = 0; \ \ \mathtt{xthrsh} = \mathop {\mathrm{ones}}(n, 1); \\&\mathtt{xtol} = \mathtt{1.e-50};\ \ \mathtt{ftol} = \mathtt{1.e-50};\ \ \mathtt{kmax} = \mathtt{nfmax};\\&\mathtt{printmon} = 2;\ \ \mathtt{printsol} = 1;\ \ \mathtt{fid} = 1; \ \ \\&\mathtt{numdif} = 1; \\&\mathtt{lambda0} = \mathtt{eps}; \mathtt{lambdamin} = \mathtt{1e-50}; \\&\mathtt{ftol} = \mathtt{1.e-50}. \end{aligned}$$
  • NMPNTR, non-monotone projected Newton trust region method, is a bound constrained solver Kimiaei (2016). NMPNTR1, NMPNTR2, NMPNTR3, and NMPNTR4 are NMPNTR with \(\varDelta _0=1\), \(\varDelta _0=10\), \(\varDelta _0=100\), and \(\varDelta _0=500\), respectively. According to our results, NMPNTR2 was the best.

  • NATRN is a non-monotone trust region algorithm Amini et al. (2016) using the full finite difference approximation. The subproblem was solved in the same way as LMLS. NATRN1, NATRN2, NATRN3, and NATRN4 are NATRN with \(\varDelta _0=1\), \(\varDelta _0=10\), \(\varDelta _0=100\), and \(\varDelta _0=500\), respectively. According to our results, NATRN1 had the best performance.

  • NATRLS is a non-monotone line search and trust region algorithm Amini et al. (2016) using the full finite difference approximation. The subproblem was solved in the same way as LMLS. NATRLS1, NATRLS2, NATRLS3, and NATRLS4 are NATRLS with \(\varDelta _0=1\), \(\varDelta _0=10\), \(\varDelta _0=100\), and \(\varDelta _0=500\), respectively. According to our results, NATRLS1 had the best performance.

  • LSQNONLIN1, obtained from the MATLAB Optimization Toolbox at,

    https://de.mathworks.com/help/optim/ug/lsqnonlin.html

    is a nonlinear least squares solver with the following options:

    options = optimoptions(@lsqnonlin,‘Algorithm’, ‘levenberg-marquardt’, ‘Fin- iteDifferenceType’,‘forward’, ‘MaxIter’, Inf,‘MaxFunEvals’, nfmax, ‘TolX’, 0,‘Specify ObjectiveGradient’,‘false’).

  • LSQNONLIN2, obtained from the MATLAB Optimization Toolbox at,

    https://de.mathworks.com/help/optim/ug/lsqnonlin.html

    is a nonlinear least squares solver with the following options:

    options = optimoptions(@lsqnonlin,‘Algorithm’, ‘trust-region reflective’, ‘FiniteDifferenceType’,‘forward’, ‘MaxIter’, Inf,‘MaxFunEvals’, nfmax, ‘TolX’, 0,‘SpecifyObjectiveGradient’,‘false’).

  • DOGLEG is Powell’s dogleg method for least squares problems, which is the best algorithm from the toolbox of immoptibox.zip Nielsen (2012), available at

    http://www2.imm.dtu.dk/projects/immoptibox/

    The following option was used

    $$\begin{aligned} \mathtt{opts}= & {} [\varDelta _0,\mathtt{tolg},\mathtt{tolx},\mathtt{tolr},\mathtt{maxeval}]\\= & {} [\varDelta _0,\mathtt{1e-50},\mathtt{1e-50},\mathtt{1e-50},\mathtt{nfmax}]. \end{aligned}$$

    DOGLEG1, DOGLEG2, DOGLEG3, and DOGLEG4 are DOGLEG with \(\varDelta _0=1\), \(\varDelta _0=10\), \(\varDelta _0=100\), and \(\varDelta _0=500\), respectively. The best version was DOGLEG1.

Unconstrained solvers:

  • FMINUNC, obtained from the MATLAB Optimization Toolbox at

    https://ch.mathworks.com/help/optim/ug/fminunc.html,

    is a standard quasi-Newton algorithm. We used FMINUNC with

    opts = optimoptions(@fminunc),‘Algorithm’,‘quasi-newton’, ‘Display’, ‘Iter’, ‘MaxIter’,Inf,‘MaxFunEvals’, nfmax, ‘TolX’, 0,‘TolFun’,0,‘ObjectiveLimit’,-1e-50).

  • FMINUNC1 is FMINUNC with the limited memory quasi-Newton approximation by Liu and Nocedal (1989). It were added to FMINUNC by the present authors. The option set for it was used the same as FMINUNC; only the memory \(m=10\) was added to the option set.

3.2 Default for tuning parameters of LMLS

The tuning parameters for our new method (LMLS) are chosen as

$$\begin{aligned} \boxed {\begin{array}[c]{llll} m_{\mathop {\mathrm{nm}}}=10;&{} m_{\mathop {\mathrm{rs}}}=10;&{} \gamma _r=10^{-30}; &{} \gamma _p=5\varepsilon _m;\\ \gamma _s=\sqrt{\varepsilon _m};&{} \gamma _t=0.1;&{} \sigma _1=0.5;&{} \lambda ^0=1; \\ \sigma _2=1; &{}\gamma ^{{\mathrm{init}}}=10^{-8};&{} \gamma _m=10^{-30};&{} \underline{\lambda }=10^{-4};\\ \varDelta ^0=10;&{} \varDelta ^{\min }=10^{-6};&{} \overline{\lambda }=10^5.&{}\\ \end{array}} \end{aligned}$$

The remaining tuning parameter \(m_{\mathop {\mathrm{sn}}}\) is varied in the experiment: LMLS1, LMLS2, LMLS3, and LMLS4 are LMLS with \(m_{\mathop {\mathrm{sn}}}=3\), \(m_{\mathop {\mathrm{sn}}}=\min (10,n)\), \(m_{\mathop {\mathrm{sn}}}=\min (30,n)\), and \(m_{\mathop {\mathrm{sn}}}=\min (100,n)\), respectively.

3.3 Test problems used, initial point, and stopping tests

Test problems suggested by Lukšan et al. (2018) are classified in Table 1 according to whether they are overdetermined (\(r>n\)) or not (\(r=n\)). A shifted point for these problems is done like Kimiaei and Neumaier (2020) as \(\chi _i {:}{=} (-1)^{i-1}2/(2+i)\) for all \(i=1,\ldots ,n\). This means that the initial point is chosen by \(x^0_i=\chi _i\) for all \(i=1,\ldots ,n\) and the initial function value is computed by \(f^0{:}{=}f(x^0)\), while the other function values are computed by \(f^\ell {:}{=}f(x^{\ell }+\chi )\) for all \(\ell \ge 1\).

Table 1 A classification of test problems

We denote \(\mathtt{nf}\) and \(\mathtt{msec}\) as the number of function evaluations and time in milliseconds, respectively. nfmax and secmax are the upper bounds for them, chosen as

$$\begin{aligned} \mathtt{nfmax}\in \left\{ \begin{array}{ll} \{10n,50n,100n,500n\} &{}\hbox { if }1\le n \le 100, \\ \{10n,50n,100n\} &{}\hbox { if }101\le n \le 1000\\ \{10n,100n\} &{}\hbox { if }1001\le n \le 10000 \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} \mathtt{secmax}{:}{=} \left\{ \begin{array}{ll} 300 &{}\hbox { if }1\le n \le 100, \\ 800 &{}\hbox { if }101\le n \le 10000.\\ \end{array}\right. \end{aligned}$$

Denote by \(f^0\) the function value of the starting point (common to all solvers), by \(f^{so}\) the best point found by the solver so, and by \(f^{\mathop {\mathrm{opt}}}\) the best point known to us. Then, if the target accuracy satisfies

$$\begin{aligned} q^{so}&{:}{=}&(f^{so}-f^{\mathop {\mathrm{opt}}})/(f^0-f^{\mathop {\mathrm{opt}}})\\\le & {} \left\{ \begin{array}{ll} 10^{-8} &{}\hbox { if }1\le n \le 100, \\ 10^{-3} &{}\hbox { if }101\le n \le 10000, \end{array}\right. \end{aligned}$$

then the problem is solved by the solver so. Otherwise, the problem is unsolved by it; either nfmax or secmax is exceeded, or the solver fails.

3.4 The efficiency and robustness of a solver

For a given collection S of solvers, the strength of a solver \(so \in S\)—relative to an ideal solver that matches on each problem the best solver—is measured, for any given cost measure \(c_s\) by the number, \(e_{so}\) defined by

$$\begin{aligned} e_{so}{:}{=}\left\{ \begin{array}{ll} \displaystyle \frac{\displaystyle \min _{s\in S}c_{s}}{c_{so}},~~~&{} \quad \hbox {if the solver }so\hbox { solves the problem},\\ 0,~~~&{} \quad \hbox {otherwise}, \end{array} \right. \end{aligned}$$

called the efficiency of the solver so with respect to this cost measure. Two cost measures nf and msec are used.

The robustness of a solver is how many test problems it can solve. Efficiency and robustness are two adequate tools to determine which solver is competitive. In fact, the robustness of a solver is more important than its efficiency. We use two different performance plots in terms of the robustness and efficiency of solvers:

  • The first performance plot is the data profile by Moré and Wild (2009) for nf/(best nf) and msec/(best msec) as well; but it is the percentage of problems solved within the number of function evaluations and time in milliseconds.

  • The second performance plot is the performance profile by Dolan and Moré (2002) for nf/(best nf) and msec/(best msec); the percentage of problems solved within a factor \(\tau \) of the best solvers.

All tables and data/performance profiles are given in Sects. 4.24.4. In Sects. 3.5-3.7, we summarize them as two new performance plots.

3.5 Small scale: \(n\in [1,100]\)

A comparison among LMLS1, LMLS2, LMLS3, LMLS4, and solvers using quasi-Newton is shown in Subfigures (a) and (b) of Fig. 1, so that

  • LMLS4 using the full estimated Jacobian matrix is the best in terms of the number of solved problems and the nf efficiency;

  • LMLS3, LMLS2, and LMLS1 are more efficient than solvers using quasi-Newton approximation (FMINUNC, FMINUNC1, MINFLBFGS1, and MINFLBFGSDL1) in terms of the nf efficiency;

  • FMINUNC and MINFLBFGSDL1 are comparable with LMLS3—only for very large budget—in terms of the number of solved problems but LMLS3, LMLS2, and LMLS1 are more efficient than FMINUNC and MINFLBFGSDL1 in terms of the nf efficiency not only for very large budget but also for small up to large budgets.

To determine whether our new non-monotone and adaptive radius strategies are effective or not, we compare LMLS4 with solvers using other non-monotone and adaptive radius strategies, shown in Subfigures (c) and (d) of Fig. 1. All solvers use the full Jacobian matrix and the trust region subproblems are solved by the same algorithm. As can be seen, LMLS4 is much more efficient and robust than NATRLS1, NMPGTR2, and NATRN1 in terms of the number of solved problems and the nf efficiency.

We compare LMLS4 with four famous solvers LSQNONLIN1, CoDoSol1, NLEQ1, and DOGLEG1 shown in Subfigures (e) and (f) of Fig. 1. It is seen that LMLS4 and CoDoSol1 are the two best solvers in terms of the nf efficiency while LMLS4 and DOGLEG1 are the two best solvers in terms of the number of solved problems.

Another comparison is among LMLS3, LMLS2, and LMLS1 using the Jacobian matrix in an adaptive subspace basis indices set and LSQNONLIN1 and NLEQ1 using the full Jacobian matrix. We conclude from Subfigures (g) and (h) of Fig. 1 that

  • LMLS3 is the best in terms of the number of solved problems and the nf efficiency;

  • LMLS2 is the second best solver in terms of the number of solved problems and the nf efficiency for medium, large, and very large budgets;

  • LMLS1 with lowest subspace dimension is more efficient than LSQNONLIN1 in terms of the number of solved problems and the nf efficiency; even it is more efficient than NLEQ1 for very large budget in terms of the nf efficiency.

As a result, LMLS is competitive for small-scale problems in comparison with the state-of-the-art solvers.

3.6 Medium scale: \(n\in [101,1000]\)

In this subsection, we compare LMLS1, LMLS2, LMLS3, and LMLS4 using the estimated Jacobian matrices in a subspace with random basis indices with FMINUNC using standard BFGS approximations and FMINUNC1 using limited memory BFGS ones (Fig. 2).

From Subfigures (a) and (b) of Fig. 1, we conclude that

  • LMLS4, LMLS3, and LMLS2 are the three best solvers in terms of the nf efficiency, respectively;

  • LMLS4 is the best solver in terms of the number of solved problems; only FMINUNC is the best for large budget.

Fig. 1
figure 1

Performance plots for small-scale problems. ab: A comparison of limited memory solvers, cd: A comparison among LMLS in a full subspace with random basis indices and solvers using other non-monotone and adaptive radius techniques, ef: A comparison among LMLS in a full subspace with random basis indices and other famous solvers, gh: A comparison among low-dimensional LMLS1, LMLS2, LMLS3, and NLEQ1 and LSQNONLIN1 using full estimated Jacobian

Fig. 2
figure 2

ab: Performance plots for medium-scale problems

Fig. 3
figure 3

ab: Performance plots for large-scale problems

Table 2 Results for small-scale and small budget
Table 3 Results for small-scale and medium budget

3.7 Large scale: \(n\in [1001,10000]\)

In this subsection, we compare LMLS1, LMLS2, LMLS3, LMLS4 using the estimated Jacobian matrices in a subspace with random basis indices with FMINUNC1 using limited memory BFGS approximations.

In terms of the nf efficiency and the number of solved problems, Subfigures (a) and (b) of Fig. 3 result in the fact that

  • LMLS4, LMLS3, and LMLS2 are the three best solvers, respectively;

  • LMLS1 with lowest subspace dimension is more efficient than FMINUNC1 for small budget while FMINUNC1 is more efficient than LMLS1 for large budget.

4 Additional material for LMLS

4.1 Summarizing tables

In all tables, efficiencies are given as percentages, rounded (towards zero) to integers. Larger efficiencies imply a better average behavior, while a zero efficiency indicates failure.

#100 is the total number of problems for which the solver so was best with respect to nf (\(e_{so}=1=100\%\)). !100 is the total number of problems solved for which the solver so was better than all other solvers with respect to nf.

We denote the time in seconds without the setup time for the objective function by sec. In tables, a sign

  • n indicates that nf \(\ge \mathtt{nfmax}\) was reached.

  • t indicates that sec \(\ge \mathtt{secmax}\) was reached.

  • f indicates that the algorithm failed for other reasons.

\(T_{\mathop {\mathrm{mean}}}\) is the mean of the time in seconds needed by a solver to solve the test problems chosen from the list of test problems \({\mathcal {P}}\), ignoring the times for unsolved problems. It can be a good measure when solvers have approximately the same number of solved problems.

4.2 Tables and data/performance profiles for \(1\le n\le 100\)

This section contains Tables 2, 3, 4, 5 and Figs. 4, 5, 6, 7, summaries of which were discussed in Sect 3.5.

Table 4 Results for small-scale and large budget
Table 5 Results for small-scale and very large budget
Fig. 4
figure 4

a and b: Data profiles for nf/(best nf) and msec/(best msec), respectively. \(\rho \) designates the fraction of problems solved within the number of function evaluations and time in milliseconds used by the best solver. Problems solved by no solver are ignored. cd: Performance profiles for nf/(best nf) and msec/(best msec), respectively. \(\rho \) designates the fraction of problems solved within the number of function evaluations and time in milliseconds used by the best solver. Problems solved by no solver are ignored

Fig. 5
figure 5

Details as in Fig. 4

Fig. 6
figure 6

Details as in Fig. 4

Fig. 7
figure 7

Details as in Fig. 4

Table 6 Results for medium-scale and small budget
Table 7 Results for medium-scale and budget
Table 8 Results for medium-scale and large budget

4.3 Tables and data/performance profiles for \(101 \le n \le 1000\)

This section contains Tables 6, 7, 8 and Figs. 8, 9, 10, summaries of which were discussed in Sect 3.6.

Fig. 8
figure 8

Details as in Fig. 4

Fig. 9
figure 9

Details as in Fig. 4

Fig. 10
figure 10

Details as in Fig. 4

Table 9 Results for large-scale and small budget
Table 10 Results for large-scale and budget
Fig. 11
figure 11

Details as in Fig. 4

4.4 Tables and data/performance profiles for \(1001 \le n \le 10000\)

This section contains Tables 9, 10 and Figs. 11, 12, summaries of which were discussed in Sect 3.7.

Fig. 12
figure 12

Details as in Fig. 4