A new limited memory method for unconstrained nonlinear least squares

This paper suggests a new limited memory trust region algorithm for large unconstrained black box least squares problems, called LMLS. Main features of LMLS are a new non-monotone technique, a new adaptive radius strategy, a new Broyden-like algorithm based on the previous good points, and a heuristic estimation for the Jacobian matrix in a subspace with random basis indices. Our numerical results show that LMLS is robust and efficient, especially in comparison with solvers using traditional limited memory and standard quasi-Newton approximations.


Introduction
In this paper, we consider the unconstrained nonlinear least squares problem with high-dimensional x ∈ R n and continuously differentiable E : R n → R r (r ≥ n), possibly expensive. However, we assume that no derivative information is available.

Related work
In recent years, there has been a huge amount of literature on least squares and its applications. Here we just list a useful book and paper:  (2000) introduced an excellent book, both covering algorithms and their analysis. • An excellent paper, both covering Levenberg-Marquardt algorithms, quasi-Newton algorithms, and trust region algorithms and their local analysis without nonsingularity assumption, has been introduced by Yuan (2011).
Derivative free unconstrained nonlinear black box least squares solvers can be classified in two ways according to how the Jacobian matrix is estimated, and according to whether they are based on line search or on trust region: • Quasi-Newton approximation. Sorber et al. (2012) introduced MINLBFGS (a limited memory BFGS algorithm) and MINLBFGSDL (a trust region algorithm using a dogleg algorithm and limited memory BFGS approximation). • Finite difference approximation. There are many trust region methods using the finite difference method for the Jacobian matrix estimation such as CoDoSol and STRSCNE by Bellavia et al. (2004Bellavia et al. ( , 2012, NMPNTR by Kimiaei (2016), NATRN and NATRLS by ; , LSQNONLIN from the MATLAB Toolbox, NLSQERR (an adaptive trust region strategy) by Deuflhard (2011), and DOGLEG by Nielsen (2012). They are suitable for small-and medium-scale problems. Line search methods using the finite difference approximation are NLEQ (a damped affine invariant Newton method) by Nowak and Weimann (1990) and MINFNCG (a family of nonlinear conjugate gradient methods) by Sorber et al. (2012).
FMINUNC by MATLAB Optimization Toolbox is an efficient solver for small-and medium-scale problems. It uses the finite difference method to estimate the gradient vector and the standard quasi-Newton method to estimate the Hessian matrix. In fact, FMINUNC disregards the least squares structure and only has access to function values. Nevertheless, it will be shown that FMINUNC is more efficient than LSQNONLIN using the least squares structure.
To solve the least squares problem (1), trust region methods use linear approximations of the residual vectors to make surrogate quadratic models whose accuracy are increased by restricting their feasible points. These methods use a computational measure to identify whether an agreement between an actual reduction of the objective function and a predicated reduction of surrogate quadratic model function is good or not. If this agreement is good, the iteration is said successful and the trust region radius is expanded; otherwise, the iteration is said unsuccessful and the trust region radius is reduced, for more details see (Conn et al. 2000;Nocedal and Wright 1999).
The efficiency of trust region methods depends on how the trust region radius is updated (see, e.g., Ahookhosh et al. 2013;; Kimiaei (2014b, a, 2015); Fan (2006); Fan andPan (2009, 2010); Kimiaei (2017); Yu and Pu (2008)) and whether nonmonotone techniques are applied (see, e.g., Ahookhosh and Amini 2011;Ahookhosh et al. 2015Ahookhosh et al. , 2013; Deng et al. (1993); Grippo et al. (1986); Grippo and Sciandrone (2007); Kimiaei (2016Kimiaei ( , 2017; Yu and Pu (2008)). Rounding errors may lead two problems: (i) The model function may not decrease numerically for some iterations. In this case, if there is no decrease in the function value for such iterations, trust region radii are expanded possibly several times which is an unnecessary expansion for them, (ii) The model function may decrease numerically but the objective function may not decrease in the cases where iterations are near a valley, deep with a small creek at the bottom and steep sides. In this case, trust region radii are reduced possibly many times, leading to the production of quite a small radius, or even a failure.
Non-monotone techniques can be used in the hope of overcoming the second problem.

Overview of the new method
We suggest in Sect. 2 a new trust region-based limited memory algorithm for unconstrained black box least squares problems, called LMLS. This algorithm uses • a non-monotone ratio and an adaptive radius formula to quickly reach the minimizer when the valley is narrow; • a Broyden-like algorithm to get a decrease in the function value when the trust region radius is so small and iteration is unsuccessful; • a finite difference approximation in a subspace with random basis indices to estimate the Jacobian matrix; • either a Gauss-Newton or a dogleg algorithm in a subspace with random basis indices to solve the trust region subproblems.
Numerical results for small-to large-scale problems are given in Sect. 3 showing the fact that the new method is suitable for large-scale problems and is more robust and efficient than solvers using limited memory and standard quasi-Newton approximations.

The trust region method
In this section, we construct an improved trust region algorithm for handling problems in high dimensions: • In Sect. 2.1 a Gauss-Newton direction in a subspace with random basis indices is introduced. • In Sect. 2.2 a non-monotone term and an adaptive technique are constructed to quickly reach the minimizer in the presence of a narrow valley. • In Sect. 2.3 a dogleg algorithm in a subspace with random basis indices is discussed. • In Sect. 2.4 a Broyden-like technique is suggested based on the old best points. • In Sect. 2.5 our algorithm using new enhancements is introduced.
We write J (x) for the Jacobian matrix of the residual vector E at x. Then the gradient vector is g(x):=∇ f (x):=J (x) T E(x) and the Hessian matrix is If the residual vector E(x) is small, the second term in G(x) is small. Hence, we approximate G(x) by the Gauss-Newton Hessian matrix J (x) T J (x). We define the quadratic surro-gate objective function where f := f (x), E:=E(x), J :=J (x), g:=g(x):=J T E. We denote by A :k the kth column of a matrix A. A trust region method finds a minimizer of the constrained problem whose constraint restricts feasible points by the trust region radius Δ > 0. This problem is called the trust region subproblem. Given a solution p of (3), we define the actual reduction in the objective function by and the predicted reduction in the model function by What constitutes an agreement between the actual and predicted reduction around the current iterate x must be measured by the monotone trust region ratio If such an agreement is good according to a heuristic formula discussed in Sect. 2.2, the iteration is said successful, x + p is accepted as a new point, said a best point, and the radius is expanded; otherwise, the iteration is said unsuccessful and so the radius is reduced.

A new subspace Gauss-Newton method
In this subsection, we have two goals: estimating the Jacobian matrix and constructing a Gauss-Newton direction in a subspace with random basis indices.
Let m sn be the subspace dimension. The Jacobian matrix in a subspace with random basis indices is estimated by a new subspace random finite difference called SRFD using the following steps: (1) At first, an initial subspace basis indices set is a random subset of {1, . . . , n} consisting of m sn members and its complementary is S c :={1, . . . , n} \ S.
(2) Next, if the complementary of old subspace basis indices set S c old is not empty, a new index set I needs to be identified before a new subspace basis indices set is determined. In this case, if I consists of at least m sn members, I is a random subset of {1, . . . , m sn } with the |S c old | members; otherwise, it is a permutation of {1, . . . , |S c old |}. Then, a new subspace basis indices set is determined by S:=S c old (I) and its complementary is found by S c :=S c old \S. But if S c old is empty, a new subspace basis indices set and its complementary are restarted and chosen in the same way as the initial subspace basis indices set and its complementary, respectively.
(3) For any i ∈ S, • the step size is computed by where 0 < γ s < 1 is a tiny factor and sign x i identifies the sign of x i , taking one of values −1 (if x i < 0), 0 (if x i = 0), and 1 (if x i > 0). • the random approximation coordinate direction p discussed in Kimiaei (2020) is used with the difference that its ith component is updated by It is well known that standard quasi-Newton methods are more robust than limited memory quasi-Newton ones, but they cannot handle problems in high dimensions; for standard quasi-Newton methods, see (Dennis and Moré 1977;Dennis and Walker 1981;Nocedal 1992;Schnabel 1989), and for limited memory quasi-Newton methods, see (Liu and Nocedal 1989;Nazareth 1979;Nocedal 1992). On the other hand, finite difference methods are more efficient than standard quasi-Newton ones. Hence, if used in a subspace with random basis indices, they can be more efficient than limited memory quasi-Newton methods for small-up to large-scale problems.
Using S and S c generated and updated by SRFD, we construct a new subspace Gauss-Newton direction by

New non-monotone and adaptive strategies
In this subsection, a new non-monotone term -stronger than the objective function f -is constructed and a new adaptive radius formula to update Δ is derived from it. They help LMLS in finite precision arithmetic to quickly reach the minimizer in the cases where the valley is deep with a small creek at the bottom and steep sides. Our non-monotone term is updated not only for successful but also for unsuccessful iterations that may have happened before a successful iteration is found. This choice is based on an estimated increase in f defined below which is updated according to whether a decrease in f is found or not. It helps us to generate a somewhat strong non-monotone term when a decrease in f is not found and a somewhat weak nonmonotone term otherwise. Somewhat strong non-monotone terms increase the chance of finding a point with better function value or at least a point with a little progress in the function value instead of solving trust region subproblems with high computational costs.
We denote by X a list of best points and by F a list of corresponding function values. Let m rs be the maximum number of good points saved in X . In order to update X and F, we use updateXF. Here we describe how to work it. If m rs is not exceeded, points with good function values are saved in X and their function values in F. Otherwise, the worst point and its function value are found and replaced by the best point and its function value, respectively.
Let γ t ∈ (0, 1), γ ∈ (0, 1), γ init > 0, and γ > 1 be the tuning parameters and let Before a new non-monotone term is constructed, an estimated increase in f needs to be estimated by Accordingly, the new non-monotone formula is defined by and the new adaptive radius is constructed by where Δ 0 nm > 0 is a tuning parameter and λ k is updated according to Here λ 0 > 0, 0 < σ 1 < 1 < σ 2 , and λ > λ > 0 are the tuning parameters and the new non-monotone trust region ratio is defined by where p k−1 is a solution of the following trust region subproblem in a subspace with the random basis indices set S by SRFD

A subspace dogleg algorithm
We define the Cauchy step by The goal is to solve the trust region subproblem (14) such that hold. After the subspace Gauss-Newton direction is computed by (7), if it is outside a trust region, a subspace dogleg algorithm, called subDogleg, is used resulting in an estimated step enforcing (16). The model function Q is reduced by (15) if dq sn := Q(0)− Q( p sn ) > 0. subDogleg first identifies whether dq sn > 0 or not. Then we have one of the following cases: Case 1. If dq sn > 0, the scaled steepest descent step is computed. If it is outside the trust region, an estimated solution of (3) is either the Cauchy step computed by (15) or the dogleg step both of (17) and (18) are on the trust region boundary. Here t is found by solving the equation p sd +t( p sn − p sd ) = Δ nm . If the condition dp:=( p sd ) T ( p sn − p sd ) ≤ 0 holds, a positive root is computed by Otherwise, t is computed by e.g., see Nielsen 2012. Case 2. If dq sn ≤ 0, the model function Q is convex since the matrix (J S g S ) T (J S g S ) is symmetric and positive semidefinite. An estimated solution of (3) is either p sd computed by (17) if it is inside the trust region or the Cauchy step p:=Δ nm ( p sd / p sd ) according to p sd , otherwise.

Broyden-like technique
Before a successful iteration is found by a trust region algorithm, the trust region subproblems may be solved many times with high computational cost. Instead, our idea is to use a new algorithm based on the previous best points in the hope of finding a point with good function value.
Whenever LMLS cannot decrease the function value, a new Broyden-like technique, called BroydenLike, is used in the hope of getting a decrease in the function value. Let x 1 , . . . , x m rs be the m rs best point stored in X . Then a point in the affine space spanned by such points has the following form where e ∈ R m rs is a vector all of whose components are one.
This is a quality constrained convex quadratic problem in m rs variables and hence can be solved in closed form. Then a QR factorization is made in the form B = Q R, where Q is an orthogonal matrix and R is a square upper triangular matrix. By setting Z :=R −1 , we make the substitution z:=Z y, define a T :=e T Z , and obtain the m rs -dimensional minimal norm linear feasibility problem whose solution is y:=a/ a 2 . Hence, a new trial point for the next algorithm is BroydenLike tries to find a point with better function value when no decrease in f is found along p. It takes X , F, x, E, B, f , δ f , and f nm as input and uses the following tuning parameter: γ ∈ (0, 1) (tiny factor for adjusting δ f ), γ > 1 (parameter for expanding δ f ), γ s (tiny parameter for the finite difference step size), m rs (memory for affine space), 0 < γ r < 1 (tiny parameter for adjusting the scaled random directions).
It returns a new value of δ f and f nm (and f , X , F, E, S, and J S if a decrease in f is found) as output. Save x in X and f in F by updateXF.
Since the Jacobian matrix may be singular or indefinite, a new point may move toward either a maximum point or saddle point. To remedy this disadvantage, BroydenLike does not lead to accept such a point with largest function value.

A limited memory trust region algorithm
We describe all steps of a new limited memory algorithm, called LMLS using the new subspace direction (7), the new non-monotone technique (10), the new adaptive radius strategy (11), and BroydenLike.
In each iteration, an estimated solution of the trust region subproblem (14) is found. Whenever the condition ρ nm ≥ γ t holds, the iteration is successful while updating both the nonmonotone term (10) and adaptive radius formula (11), and estimating the Jacobian matrix in a subspace with random basis indices by SRFD. Otherwise, the iteration is unsuccessful. In this case, BroydenLike is performed in the hope of finding a decrease in the function value. If a decrease in the function value is found, the iteration becomes successful; otherwise, it remains unsuccessful while reducing the radius and updating the non-monotone term (10) until a decrease in the function value is found and the iteration becomes successful.
It returns a solution x best of a nonlinear least squares problem as output.

Numerical results
We updated the test environment constructed by Kimiaei and Neumaier (2019) to use test problems suggested by Lukšan et al. (2018). LMLS is compared with unconstrained least squares and unconstrained optimization solvers, for some of which we had to choose options different from the default to make them competitive in the first subsection.

Least squares solvers:
• CoDoSol is a solver for constrained nonlinear systems of equations, obtained from http://codosol.de.unifi.it.
Note that delta = −1 means that Δ 0 = 1. According to our numerical results, CoDoSol was not sensitive for the initial radius; hence, the default was used.
• STRSCNE is a solver for constrained nonlinear systems of equations, obtained from http://codosol.de.unifi.it.
• FMINUNC1 is FMINUNC with the limited memory quasi-Newton approximation by Liu and Nocedal (1989). It were added to FMINUNC by the present authors. The option set for it was used the same as FMINUNC; only the memory m = 10 was added to the option set.

Test problems used, initial point, and stopping tests
Test problems suggested by Lukšan et al. (2018) are classified in Table 1 according to whether they are overdetermined (r > n) or not (r = n). A shifted point for these problems is done like Kimiaei and Neumaier (2020) as Denote by f 0 the function value of the starting point (common to all solvers), by f so the best point found by the solver so, and by f opt the best point known to us. Then, if the target accuracy satisfies then the problem is solved by the solver so. Otherwise, the problem is unsolved by it; either nfmax or secmax is exceeded, or the solver fails.

The efficiency and robustness of a solver
For a given collection S of solvers, the strength of a solver so ∈ S-relative to an ideal solver that matches on each problem the best solver-is measured, for any given cost measure c s by the number, e so defined by if the solver so solves the problem, 0, otherwise, called the efficiency of the solver so with respect to this cost measure. Two cost measures nf and msec are used. The robustness of a solver is how many test problems it can solve. Efficiency and robustness are two adequate tools to determine which solver is competitive. In fact, the robustness of a solver is more important than its efficiency. We use two different performance plots in terms of the robustness and efficiency of solvers: • The first performance plot is the data profile by Moré and Wild (2009) for nf/(best nf) and msec/(best msec) as well; but it is the percentage of problems solved within the number of function evaluations and time in milliseconds. • The second performance plot is the performance profile by Dolan and Moré (2002) for nf/(best nf) and msec/(best msec); the percentage of problems solved within a factor τ of the best solvers.
All tables and data/performance profiles are given in Sects. 4.2-4.4. In Sects. 3.5-3.7, we summarize them as two new performance plots.

Small scale: n ∈ [1, 100]
A comparison among LMLS1, LMLS2, LMLS3, LMLS4, and solvers using quasi-Newton is shown in Subfigures (a) and (b) of Fig. 1, so that • LMLS4 using the full estimated Jacobian matrix is the best in terms of the number of solved problems and the nf efficiency; • LMLS3, LMLS2, and LMLS1 are more efficient than solvers using quasi-Newton approximation (FMINUNC, FMINUNC1, MINFLBFGS1, and MINFLBFGSDL1) in terms of the nf efficiency; • FMINUNC and MINFLBFGSDL1 are comparable with LMLS3-only for very large budget-in terms of the number of solved problems but LMLS3, LMLS2, and LMLS1 are more efficient than FMINUNC and MINFLBFGSDL1 in terms of the nf efficiency not only for very large budget but also for small up to large budgets.
To determine whether our new non-monotone and adaptive radius strategies are effective or not, we compare LMLS4 with solvers using other non-monotone and adaptive radius strategies, shown in Subfigures (c) and (d) of Fig. 1. All solvers use the full Jacobian matrix and the trust region subproblems are solved by the same algorithm. As can be seen, LMLS4 is much more efficient and robust than NATRLS1, NMPGTR2, and NATRN1 in terms of the number of solved problems and the nf efficiency.
We compare LMLS4 with four famous solvers LSQNON-LIN1, CoDoSol1, NLEQ1, and DOGLEG1 shown in Subfigures (e) and (f) of Fig. 1. It is seen that LMLS4 and CoDoSol1 are the two best solvers in terms of the nf  6  16  21  21  21  21  21  21  21  21  21  21 efficiency while LMLS4 and DOGLEG1 are the two best solvers in terms of the number of solved problems. Another comparison is among LMLS3, LMLS2, and LMLS1 using the Jacobian matrix in an adaptive subspace basis indices set and LSQNONLIN1 and NLEQ1 using the full Jacobian matrix. We conclude from Subfigures (g) and (h) of Fig. 1 that • LMLS3 is the best in terms of the number of solved problems and the nf efficiency; • LMLS2 is the second best solver in terms of the number of solved problems and the nf efficiency for medium, large, and very large budgets; • LMLS1 with lowest subspace dimension is more efficient than LSQNONLIN1 in terms of the number of solved problems and the nf efficiency; even it is more efficient than NLEQ1 for very large budget in terms of the nf efficiency.
As a result, LMLS is competitive for small-scale problems in comparison with the state-of-the-art solvers.

Medium scale: n ∈ [101, 1000]
In this subsection, we compare LMLS1, LMLS2, LMLS3, and LMLS4 using the estimated Jacobian matrices in a subspace with random basis indices with FMINUNC using standard BFGS approximations and FMINUNC1 using limited memory BFGS ones (Fig. 2).
From Subfigures (a) and (b) of Fig. 1, we conclude that • LMLS4, LMLS3, and LMLS2 are the three best solvers in terms of the nf efficiency, respectively; • LMLS4 is the best solver in terms of the number of solved problems; only FMINUNC is the best for large budget.

Large scale: n ∈ [1001, 10000]
In this subsection, we compare LMLS1, LMLS2, LMLS3, LMLS4 using the estimated Jacobian matrices in a subspace with random basis indices with FMINUNC1 using limited memory BFGS approximations.
In terms of the nf efficiency and the number of solved problems, Subfigures (a) and (b) of Fig. 3 result in the fact that • LMLS4, LMLS3, and LMLS2 are the three best solvers, respectively; • LMLS1 with lowest subspace dimension is more efficient than FMINUNC1 for small budget while FMIN-UNC1 is more efficient than LMLS1 for large budget.

Summarizing tables
In all tables, efficiencies are given as percentages, rounded (towards zero) to integers. Larger efficiencies imply a better average behavior, while a zero efficiency indicates failure. #100 is the total number of problems for which the solver so was best with respect to nf (e so = 1 = 100%). !100 is the total number of problems solved for which the solver so was better than all other solvers with respect to nf.
We denote the time in seconds without the setup time for the objective function by sec. In tables, a sign • n indicates that nf ≥ nfmax was reached.
• t indicates that sec ≥ secmax was reached.
• f indicates that the algorithm failed for other reasons.
T mean is the mean of the time in seconds needed by a solver to solve the test problems chosen from the list of test problems P, ignoring the times for unsolved problems. It can be a good measure when solvers have approximately the same number of solved problems.  This section contains Tables 2, 3 ,4,5 and Figs. 4,5,6,7, summaries of which were discussed in Sect 3.5.   Fig. 4 a and b: Data profiles for nf/(best nf) and msec/(best msec), respectively. ρ designates the fraction of problems solved within the number of function evaluations and time in milliseconds used by the best solver. Problems solved by no solver are ignored. c-d: Performance profiles for nf/(best nf) and msec/(best msec), respectively. ρ designates the fraction of problems solved within the number of function evaluations and time in milliseconds used by the best solver. Problems solved by no solver are ignored  This section contains Tables 6, 7, 8 and Figs. 8, 9, 10, summaries of which were discussed in Sect 3.6.    Funding Open access funding provided by University of Vienna.

Conflict of interest The authors declare that they have no conflict of interest
Human and animal rights This study does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.