Abstract
This paper deals with regularized Newton methods, a flexible class of unconstrained optimization algorithms that is competitive with line search and trust region methods and potentially combines attractive elements of both. The particular focus is on combining regularization with limited memory quasiNewton methods by exploiting the special structure of limited memory algorithms. Global convergence of regularization methods is shown under mild assumptions and the details of regularized limited memory quasiNewton updates are discussed including their compact representations. Numerical results using all largescale test problems from the CUTEst collection indicate that our regularized version of LBFGS is competitive with stateoftheart line search and trustregion LBFGS algorithms and previous attempts at combining LBFGS with regularization, while potentially outperforming some of them, especially when nonmonotonicity is involved.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\), \(n\in \mathbb {N}\), be a twice continuously differentiable function, and consider the nonlinear minimization problem
Methods of Newton or quasiNewton type are commonly acknowledged to be some of the most efficient algorithms for the solution of such problems. Given a current iterate \(\textbf{x}_k\), these methods compute the iteration step \(\textbf{d}_k\) by solving a (quasi)Newton equation of the form
where \(\textbf{B}_k\in \mathbb {R}^{n\times n}\) is either the Hessian \(\nabla ^2 f(\textbf{x}_k)\) or an approximation thereof. When n is large, the matrix \(\textbf{B}_k\) is usually not stored explicitly. Instead, one uses socalled limited memory quasiNewton methods, which require the storage of a few vector pairs
and use this information to construct an implicit approximation to the Hessian matrix. This approximation is never formed explicitly; instead, the pairs \((\textbf{s}_k,\textbf{y}_k)\) are used to directly evaluate matrix–vector products of the form \(\textbf{B}_k \textbf{x}\) or \(\textbf{B}_k^{1} \textbf{y}\) as necessary. Arguably the most successful quasiNewton schemes are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [10] and its limited memory counterpart LBFGS [6, 19, 22]. Other examples include symmetric rankone (SR1), PowellsymmetricBroyden (PSB), Davidon–Fletcher–Powell (DFP), the so called Broyden class, and many more; see [10, 18, 31].
In today’s optimization landscape, LBFGS is the de facto standard for smooth largescale optimization. The method is usually combined with a line search technique to ensure global convergence [19]. There have also been efforts dedicated to making quasiNewton methods compatible with the trustregion framework; see [2, 4, 12] for LBFGS and [1] for LSR1. This is facilitated by the fact that most quasiNewton formulas admit a socalled compact representation of the form
where \(\textbf{B}_{0,k}\in \mathbb {R}^{n\times n}\), \(\textbf{A}_k\in \mathbb {R}^{n\times s},\textbf{Q}_k\in \mathbb {R}^{s\times s}\) and \(s\ll n\). (We put \(\textbf{Q}_k^{1}\) instead of \(\textbf{Q}_k\) in the above equation because this will be more convenient later on.) The initial matrix \(\textbf{B}_{0,k}\) is usually a multiple of the identity or some other diagonal matrix. Decompositions of the above form have been given by many authors [3, 6, 9], and they are immensely useful in optimization methods since they usually allow the computation of matrix operations involving \(\textbf{B}_k\) in the lower dimension s. In particular, they facilitate the efficient computation of quasiNewton directions and the solution of trustregion subproblems; see the references above.
In this paper, we will pursue a different globalization technique which can be seen as a (less wellknown) sibling of line search and trustregion methods, the socalled regularized Newton methods [15, 17, 29, 30, 32]. These are generally characterized by regularized quasiNewton equations of the form
where \(\mu _k\ge 0\) is called the regularization parameter. The attractive feature of these methods is that they combine some of the respective benefits of line search and trust region methods, and moreover they are highly compatible with compact representations of quasiNewton matrices. We will therefore present an algorithmic framework designed to efficiently combine limited memory and regularization techniques, with the following benefits:

The step computation is almost as cheap as for line search LBFGS algorithms. More specifically, the cost of each successful iteration (in the mstep BFGS case) is 4mn plus the solution of a \(2m \times 2m\) symmetric linear system. In particular, no inner loop is necessary for the computation of eigenvalue decompositions or trustregion solutions.

At the same time, the step quality is close to that of trustregion type limited memory algorithms because the regularization parameter \(\mu _k\) mimics the Lagrange multiplier arising in trustregion subproblems. The method can therefore be considered as a kind of “implicit” trustregion algorithm.

As a result of the above, the proportion of accepted steps is very high, leading to a relatively low number of function and gradient evaluations (on a level with trustregion type methods) while at the same time preserving the “cheap” steps of line search methods.
The use of regularization techniques has another important benefit over line search methods. In the line search setting, many authors advocate trying the “full” step size \(t_k=1\) first, the motivation being that LBFGS and similar methods are fundamentally algorithms of Newton type and the full step size may lead to fast convergence. However, the step size also serves the purpose of adapting the algorithm to the nonlinearity of the problem, and reinitializing the line search procedure with \(t_k=1\) at each step makes it hard to carry this information over from one step to the next. In contrast, the regularization approach that we advocate here provides a more seamless transition between the full (quasi)Newton step and a truncated version thereof (similar to trust region methods), which suggests that algorithms of this type may be able to handle nonlinear or nonconvex problems more effectively.
The idea of combining limited memory and regularization techniques is not entirely new. Multiple authors [15, 26, 28] have advocated modifying the secant equation in quasiNewton methods to instead approximate the sum \(\nabla ^2 f(\textbf{x}_k)+\mu _k \textbf{I}\). However, none of these methods fully exploit the quasiNewton approximation of the Hessian and the compact representation (3). The method we present takes full advantage of these tools.
In addition to the algorithm, the paper also contains a general convergence result for regularized Newton methods which, to the authors’ knowledge, does not exist in this generality in the literature. In particular, the convergence result does not assume any specific quasiNewton formula and allows for \(\textbf{B}_k + \mu _k \textbf{I}\) to be indefinite. This may be of interest to researchers in the field and provide a basis for future research on related methods.
This paper is organized as follows. Section 2 contains a detailed description of a general class of regularized quasiNewton methods. Global convergence results for this class of methods are presented in Sect. 3 under fairly mild assumptions. In Sect. 4, we show how compact representations of limited memory quasiNewton methods can be exploited to create efficient implementations of the algorithm. We also give a compact representation of the PSB formula that appears to be new. The numerical experiments in Sect. 5 indicate that the new technique is competitive with other attempts at regularizing LBFGS [28] as well as line search and trust region based LBFGS methods [2, 19]. We close with some final remarks in Sect. 6.
1.1 Notation
Matrices and vectors will be denoted by boldface letters \(\textbf{M}\) and \(\textbf{v}\), respectively. Given a matrix \(\textbf{M}\in \mathbb {R}^{s\times s}\), we write \(\textbf{L}(\textbf{M})\), \(\textbf{D}(\textbf{M})\), and \(\textbf{U}(\textbf{M})\) for the strictly lower, diagonal, and strictly upper parts of \(\textbf{M}\), respectively. In particular, it always holds that
The gradient of the smooth function f evaluated at an iterate \( \textbf{x}_k \) will often be denoted by \( \textbf{g}_k \). We denote sequences by \(\{s_k\}\) and write \(\{s_k\}_{k\in \mathcal {S}}\) for the subsequence induced by an infinite index set \(\mathcal {S} = \{k_1, k_2, \ldots \}\subseteq \mathbb {N}\) with \(k_i < k_{i+1}\) for all i. Similarly, \(s_k \rightarrow _{\mathcal {S}} s\) means that \(\{s_k\}_{k\in \mathcal {S}}\) converges to s.
2 Regularized quasiNewton methods
As discussed in the introduction, the fundamental principle underlying the methods in this paper is that of regularized Newton and quasiNewton methods, which are generally characterized by regularized quasiNewton equations of the form
where \(\textbf{B}_k\) is either the Hessian \(\nabla ^2 f(\textbf{x}_k)\) or an approximation thereof, and \(\mu _k\ge 0\) is the regularization parameter. Clearly, if \(\mu _k=0\), then (4) reduces to the standard quasiNewton equation \(\textbf{B}_k \textbf{d}_k = \nabla f(\textbf{x}_k)\). On the other hand, if \(\mu _k\gg 0\) is large, then the matrix \(\textbf{B}_k+\mu _k \textbf{I}\) will be invertible, and the step \(\textbf{d}_k\) produced by (4) will essentially be the negative gradient direction (up to normalization; see Lemma 1).
2.1 Mathematical motivation
The virtues of the regularization approach can be understood by recognizing that this essentially amounts to minimizing the regularized quadratic model
which differs from the conventional Newton model by Tikhonov regularization. Thus, a positive value of \(\mu _k\) may dampen the impact of negative eigenvalues of \(\textbf{B}_k\) on the search direction, prevent excessively long steps in negative curvature directions, and possibly guarantee that the model (5) admits a unique minimizer (i.e., that the matrix \(\textbf{B}_k+\mu _k\textbf{I}\) is positive definite). The anticipated setting is that \(\mu _k\) will initially be kept sufficiently large to guarantee global convergence, eventually decreasing rapidly enough so as to not impede fast local convergence.
A more rigorous interpretation is given by trustregion methods. Indeed, if \(\textbf{d}_k:=(\textbf{B}_k+\mu _k\textbf{I})^{1} \textbf{g}_k\) for some \(\mu _k\ge 0\), and if \(\Delta :=\Vert \textbf{d}_k\Vert \), then \(\textbf{d}_k\) is a stationary point of the trustregion subproblem
where
is the standard quadratic approximation of f around \(\textbf{x}_k\). If \(\textbf{B}_k+\mu _k \textbf{I}\) is positive definite, then \(\textbf{d}_k\) is in fact a solution of this auxiliary problem. It follows that regularized Newton methods can be interpreted as “implicit” trustregion methods whereby the regularization parameter is controlled instead of the trustregion radius.
Finally, it is also interesting to analyze how the regularization technique affects the conditioning of the quadratic model (5). Assuming for the moment that \(\textbf{B}_k\) is positive definite (as it is, e.g., in BFGStype methods), the regularization parameter improves the condition number of the underlying matrix in the sense that
where \(\lambda _{\max }(\textbf{B}_k),\lambda _{\min }(\textbf{B}_k)>0\) are the largest and smallest eigenvalues of \(\textbf{B}_k\), respectively.
2.2 Basic algorithm
To control the regularization parameter \(\mu _k\), we consider the quadratic approximation \(q_k\) of f from (6) and borrow some terminology from trustregion algorithms. Given a candidate step \(\textbf{d}_k=(\textbf{B}_k+\mu _k\textbf{I})^{1}\textbf{g}_k\), define the predicted reduction of f as
where the last equality uses the definition of \(\textbf{d}_k\). (Note that, in particular, the matrix \(\textbf{B}_k\) need not be available for the computation of \(\text {pred}_k\).) This quantity will be compared to the actual or achieved reduction in step k,
Similar to trustregion methods [8], we use the ratio between these quantities to control the regularization parameter. To this end, we distinguish between three cases, unsuccessful (u), successful (s), and highly successful (h) steps. Special care also needs to be taken because there is no apriori guarantee that \(\text {pred}_k\) is positive (since \(\textbf{B}_k\) may be indefinite); such steps are treated in the same manner as unsuccessful ones.
Algorithm 1 (Regularized quasiNewton method)
Choose \(\textbf{x}_0\in \mathbb {R}^n\) and parameters \(\mu _0>0\); \(p_{\min },c_1\in (0,1)\); \(c_2\in (c_1,1)\); \(\sigma _1\in (0,1)\); \(\sigma _2>1\).
 Step 1.:

If a suitable stopping criterion is satisfied, terminate.
 Step 2.:

(Step computation) Choose \(\textbf{B}_k\in \mathbb {R}^{n\times n}\) and attempt to solve the regularized quasiNewton equation
$$\begin{aligned} (\textbf{B}_k+\mu _k \textbf{I}) \textbf{d}_k=\nabla f(\textbf{x}_k). \end{aligned}$$(9)If this equation admits no solution \(\textbf{d}_k\), or if \(\text {pred}_k \le p_{\min } \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \), set \(\textbf{x}_{k+1}:=\textbf{x}_k\), \(\mu _{k+1}:=\sigma _2\mu _k\), and go to Step 4. Otherwise, go to Step 3.
 Step 3.:

(Variable update) Set \(\varrho _k:=\text {ared}_k/\text {pred}_k\) and perform one of the following steps:
Step 3u (\(\varrho _k\le c_1\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k\) and \(\mu _{k+1}:=\sigma _2 \mu _k\).
Step 3 s (\(c_1<\varrho _k\le c_2\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k+\textbf{d}_k\) and \(\mu _{k+1}:=\mu _k\).
Step 3 h (\(c_2<\varrho _k\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k+\textbf{d}_k\) and \(\mu _{k+1}:=\sigma _1\mu _k\).
 Step 4.:

Set \(k\leftarrow k+1\) and go to Step 1.
The condition \(\text {pred}_k > p_{\min } \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \) in Step 2 is a sufficient descent criterion similar to the angle condition in line search methods or the Cauchy condition in trustregion methods. The quantity \(\text {pred}_k\) denotes the minimal expected reduction in objective value (relative to \(\textbf{g}_k\) and \(\textbf{d}_k\)) for a step to be attempted.
As hinted above, in what follows, we will refer to a step as unsuccessful if it passes through Step 3u or skips Step 3 because of the checks in Step 2. (In particular, \(\mathbf {d_k}\) may not be defined in an unsuccessful step.)
The parameters \(c_1,c_2,\sigma _1, \sigma _2\) are used to classify steps and adjust the regularization accordingly (increase if the step was unsuccessful, decrease if the step was highly successful).
Algorithm 1 is closely related to trustregion methods. The main difference between trustregion methods and our regularization framework lies in the update of the parameter \( \mu _k \). The former uses an indirect way to compute \( \mu _k \) (via a trustregion radius), whereas here we update the regularization parameter directly. While the indirect update follows a wellunderstood and wellmotivated philosophy, its actual computation is sometimes timeconsuming and costly. We therefore expect superior behavior of the direct update, in particular for largescale problems.
The report [28] presents a method which is formally almost identical (except for a slightly different update of the regularization parameter) to Algorithm 1. The main difference is that [28] focuses on the matrices \( \textbf{B}_k \) being updated by a limited memory BFGS scheme (without using compact representations, as we shall do in Sect. 4). The convergence theory in [28] assumes a bounded level set condition; this is not required in our subsequent analysis, which is substantially more general since we only assume boundedness of \(\{\textbf{B}_k\}\) (allowing for other quasiNewton formulas or indefiniteness) and boundedness of the objective from below (consider, for example, the exponential function).
3 General convergence analysis
As we shall see, Algorithm 1 provides a powerful framework for the application of quasiNewton type updates. Before turning to this discussion (which is the main motivation for this paper), we shall dedicate the present section to a simple convergence analysis. Due to the nonspecificity of the algorithm in its general form, it will be convenient to carry out the convergence analysis under rather general assumptions. To this end, we shall make no assumption on the particular choice of the matrices \(\textbf{B}_k\), which may or may not be approximations of the Hessian \(\nabla ^2 f(\textbf{x}_k)\). The only assumption we make throughout this section is the following.
Assumption 1
(Boundedness) \(\{\textbf{B}_k\}\subseteq \mathbb {R}^{n\times n}\) is a bounded sequence.
Most practically relevant quasiNewton schemes should have no issues satisfying the above assumption, especially when the gradient \(\nabla f\) is Lipschitz continuous on an appropriate level set. Indeed, many of these techniques yield Hessian approximations which satisfy additional properties such as symmetry (which we omitted because it is unnecessary for the theory below) or positive definiteness.
Lemma 1
(Gradient approximation) Let Assumption 1 hold, and let \(\mu _k\rightarrow \infty \). Then \(\textbf{B}_k+\mu _k\textbf{I}\) is invertible for sufficiently large \(k\in \mathbb {N}\), and
The above result defines more precisely the intuitive relationship mentioned in Sect. 2; that is, if the regularization parameter is sufficiently large, then the regularized Newton equation (9) admits a unique solution, and the resulting vector will approximate the negative gradient direction as \(\mu _k\rightarrow \infty \).
Another consequence of Lemma 1 is that the method performs infinitely many successful steps. This follows from the fact that \(\textbf{d}_k\) becomes ever smaller and approaches the (local) steepest descent direction when \(\mu _k\rightarrow \infty \), thus leading to a local descent step which satisfies the sufficient decrease condition from Step 2 of the algorithm.
Lemma 2
(Welldefinedness) Let Assumption 1 hold, and assume that \(\textbf{g}_k\ne 0\) for all k. Then Algorithm 1 performs infinitely many successful or highly successful steps.
Proof
Assume for the sake of contradiction that there exists \(k_0\in \mathbb {N}\) such that all steps \(k\ge k_0\) are unsuccessful. In particular, this implies \(\mu _k\rightarrow \infty \) as \(k\rightarrow \infty \) and \(\textbf{x}_k=\textbf{x}_{k_0}\) for all \(k\ge k_0\). Since \(\{\textbf{B}_k\}\) is a bounded sequence, it follows from Lemma 1 that \(\textbf{B}_k+\mu _k \textbf{I}\) is invertible for sufficiently large k, that \(\textbf{d}_k\rightarrow 0\), and \(\textbf{d}_k/\Vert \textbf{d}_k\Vert \rightarrow  \textbf{g}_{k_0}/\Vert \textbf{g}_{k_0}\Vert \). Moreover, the regularized Newton equation (9) implies that \(\mu _k \Vert \textbf{d}_k\Vert \rightarrow \Vert \textbf{g}_{k_0}\Vert \). It is easy to deduce from these limit relations that
(simply divide this inequality by \( \Vert \textbf{d}_k \Vert \) and recall that \( p_{\min } \in (0,1) \)). Hence, the algorithm must eventually perform only Step 3u, which means that \(\text {ared}_k\le c_1 \text {pred}_k\) for all \(k\ge k_0\) sufficiently large. It then follows that
We now divide both sides of this inequality by \(t_k:=\Vert \textbf{d}_k\Vert \). Recalling that \(\textbf{d}_k/\Vert \textbf{d}_k\Vert \rightarrow \textbf{g}_{k_0}/\Vert \textbf{g}_{k_0}\Vert \), it follows that the lefthand side becomes
Conversely, recalling that \(\mu _k \Vert \textbf{d}_k\Vert \rightarrow \Vert \textbf{g}_{k_0}\Vert \), the righthand side of (10) divided by \(t_k\) satisfies
Since \(c_1\in (0,1)\), it then follows from (11), (12) that \(\Vert \textbf{g}_{k_0}\Vert =0\), a contradiction. \(\square \)
The following result builds upon the welldefinedness of the algorithm and shows that it achieves asymptotic stationarity.
Theorem 1
(Global convergence I) Let Assumption 1 hold, let f be bounded from below, and \(\{\textbf{x}_k\}\) generated by Algorithm 1. Then
In particular, given any \(\varepsilon >0\), the algorithm terminates with \(\Vert \textbf{g}_k\Vert <\varepsilon \) after finitely many iterations.
Proof
Let \(\mathcal {S}\subseteq \mathbb {N}\) be the set of indices of successful or highly successful steps. Note that \(\mathcal {S}=\infty \) by Lemma 2. Assume for the sake of contradiction that
Since every step \(k\in \mathcal {S}\) is successful, we have by definition that
By (13), there exist \(k_0\in \mathbb {N}\) and \(\varepsilon >0\) such that \(\Vert \textbf{g}_k\Vert \ge \varepsilon \) for all \(k\ge k_0\). Using the fact that f is bounded from below, we obtain
and, in particular, \(\textbf{d}_k\rightarrow _{\mathcal {S}}0\). Since every step \(k\in \mathcal {S}\) is successful, we have \((\textbf{B}_k+\mu _k \textbf{I})\textbf{d}_k = \textbf{g}_k\) for all \(k\in \mathcal {S}\). This implies that \(\{\mu _k\}_{k\in \mathcal {S}}\) cannot have a bounded subsequence [since this together with \(\textbf{d}_k\rightarrow _{\mathcal {S}}0\) would violate (13)]. Hence, \(\mu _k\rightarrow _{\mathcal {S}}+\infty \). In particular, the algorithm also performs infinitely many unsuccessful steps (i.e., \(\mathbb {N}{\setminus }\mathcal {S}=\infty \)), and \(\mu _k\rightarrow +\infty \) since \(\mu _k\) cannot decrease during unsuccessful iterations.
Now, since \(\mathcal {S}\) and \(\mathbb {N}{\setminus }\mathcal {S}\) are infinite, we may choose an infinite set \(\mathcal {S}'\subseteq \mathcal {S}\) such that \(k1\in \mathbb {N}{\setminus }\mathcal {S}\) whenever \(k\in \mathcal {S}'\). Since \(\textbf{x}_k\) is not updated in unsuccessful steps, it follows from (14) that
Hence \(\{\textbf{x}_k\}_{k\in \mathbb {N}}\) is a Cauchy sequence, and thus convergent. Let \(\bar{\textbf{x}}\) denote its limit point. In particular, we then obtain \(\textbf{x}_{k1}\rightarrow _{\mathcal {S}'}\bar{\textbf{x}}\); thus, using \(\mu _k\rightarrow +\infty \) and arguing as in the proof of Lemma 2, it follows that the steps \(k1\), \(k\in \mathcal {S}'\), must be successful for sufficiently large \(k\in \mathcal {S}'\). This is a contradiction. \(\square \)
Note that the counterpart of Theorem 1 also holds for trustregion methods under the same set of assumptions. Moreover, the technique of proof used here is related to the corresponding one known for trustregion methods. Nevertheless, we stress that one has to be careful in translating the standard trustregion proof to our regularization framework since wellknown properties of the solution of the trustregion subproblem may not hold in our case.
Similar to the theory of trustregion methods, we can use Theorem 1 to obtain a stronger convergence result under an additional assumption.
Theorem 2
(Global convergence II) Let Assumption 1 hold, let f be bounded from below, and \(\{\textbf{x}_k\}\) generated by Algorithm 1. Suppose that \( \nabla f \) is uniformly continuous on a set \( X \subseteq \mathbb {R}^n \) satisfying \( \{\textbf{x}_k\} \subseteq X \). Then \(\lim _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert = 0\); in particular, every accumulation point of \(\{\textbf{x}_k\}\) is a stationary point of f.
Proof
Assume there exists \( \delta > 0 \) and a subsequence \( \{ \textbf{x}_k \}_{k\in K} \) such that
Since \( \liminf _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert = 0 \) by Theorem 1, we can find, for each \( k \in K \), an index \( \ell (k) > k \) such that
For an arbitrary \( k \in K \) and a successful or highly successful iteration l with \( k \le l < \ell (k) \), we obtain
The same inequality holds for l being unsuccessful simply because \( \textbf{x}_{l+1} = \textbf{x}_l \) in this case. This implies
for all \( k \in K \). Since f is bounded from below and \( \{ f(\textbf{x}_k) \} \) is monotonically decreasing, we obtain \( f(\textbf{x}_k)  f(\textbf{x}_{\ell (k)}) \rightarrow 0 \) for \( k \rightarrow \infty \). This implies \( \Vert \textbf{x}_{\ell (k)}  \textbf{x}_k \Vert \rightarrow _K 0 \). The uniform continuity of \( \nabla f \) on the set X therefore yields
On the other hand, the choice of the index \( \ell (k) \) implies
This contradiction completes the proof. \(\square \)
We close this section by noting that regularization techniques like in Algorithm 1 are sometimes used in order to prove local fast convergence properties for Newtontype methods. This corresponds to the choice \( \textbf{B}_k:= \nabla ^2 f(\textbf{x}_k) \) as the exact Hessian. Using a more refined update of the regularization parameter, assuming a local error bound condition and the Hessian of f to be locally Lipschitz continuous, it is possible to verify local quadratic convergence for convex objective functions, cf. [17, 29, 30]. Since our focus is on largescale problems, our subsequent analysis concentrates on \( \textbf{B}_k \) being computed by limited memory quasiNewton matrices.
4 Regularized quasiNewton matrices
This section provides the details of limited memory type implementations of quasiNewton methods. Some of the material below can be applied with minimal modifications to full memory quasiNewton methods, but we forgo these investigations due to our focus on largescale optimization.
In keeping with conventional limited memory notation, we assume an algorithmic framework where the last m variable steps \(\textbf{s}_i:=\textbf{x}_{i+1}\textbf{x}_i\) are tracked together with the corresponding gradient differences \(\textbf{y}_i:=\textbf{g}_{i+1}\textbf{g}_i\), where we recall that \(\textbf{g}_i=\nabla f(\textbf{x}_i)\). For convenience of notation, we aggregate these in the matrices
If fewer than m previous iterates are available, that is, if \(k<m\), we set
These definitions may seem like a mere matter of notation, but there are actually quite pragmatic arguments why \(\textbf{S}\) and \(\textbf{Y}\) should be treated as matrices instead of collections of vectors. Many limited memory operations can be formulated as loops over the recurring index \(i=1,\ldots ,m\), and the matrix notation sometimes allows us to formulate the underlying calculations as matrix–vector operations (instead of a sequence of vector–vector operations). This approach should be used whenever possible in practical implementations because it leverages the power of lowlevel BLAS (basic linear algebra subprograms) and parallelism, providing a significant increase in computational efficiency.
Remark 1
(Rejected quasiNewton updates) For the sake of simplicity and to avoid notational overhead, we assume that the algorithm always “accepts” the data pair \((\textbf{s}_k,\textbf{y}_k)\) in each successful iteration. This is not the case for some quasiNewton schemes, especially for nonconvex objective functions. In general, quasiNewton updates are typically accepted or rejected using a socalled cautious updating scheme (see Sect. 5); when a pair \((\textbf{s}_k, \textbf{y}_k)\) is rejected, the matrices \(\textbf{S}_k,\textbf{Y}_k\) of previous steps simply remain as they were.
Most limited memory quasiNewton methods implicitly generate a socalled compact representation of the form
where \(\textbf{Q}_k\in \mathbb {R}^{s\times s}\) is a nonsingular symmetric matrix, \(\textbf{A}_k\in \mathbb {R}^{n\times s}\), and \(s\ll n\) is a constant depending on the particular quasiNewton scheme. For instance, \(s=2\,m\) in limited memory BFGS methods, and \(s=m\) for limited memory SR1.
The above representation provides a very convenient framework for the regularization approach: given a parameter \(\mu \ge 0\) (e.g., one of the values \(\mu _k\) from Algorithm 1), the regularized Hessian approximation can be written as
This facilitates the application of lowrank update formulas to compute the regularized Newton step both explicitly and cheaply. To this end, let \(\hat{\textbf{B}}_k:=\textbf{B}_k+\mu \textbf{I}\) and \(\hat{\textbf{B}}_{0,k}:=\textbf{B}_{0,k}+\mu \textbf{I}\). Then the Sherman–Morrison–Woodbury formula implies that
provided that \(\hat{\textbf{B}}_{0,k}\) is nonsingular. Note that \(\hat{\textbf{B}}_{0,k}\) is usually a diagonal matrix whose inversion is trivial. Moreover, the inner matrix \(\textbf{Q}_k+\textbf{A}_k^{\textsf{T}} \hat{\textbf{B}}_{0,k}^{1}\textbf{A}_k\) is of size \(s\times s\), so that its inversion can be carried out cheaply in relation to the dimension n. By the Woodbury matrix identity, the invertibility of this inner matrix is equivalent to that of \(\hat{\textbf{B}}_k\).
In the following, we shall mainly assume that the initial matrix \(\textbf{B}_{0,k}\) is chosen as a scalar multiple of the identity, \(\textbf{B}_{0,k}:=\gamma _k \textbf{I}\). Writing \({\hat{\gamma }}_k:=\gamma _k+\mu \), it then follows that
The practical efficiency of quasiNewton methods significantly depends on the memorization and reuse of previously computed quantities. To this end, observe that the quasiNewton recurrence implies
where
Thus, the main computational cost occurs in forming the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\), the solution of an \(s\times s\) symmetric linear equation to obtain \(\textbf{p}_k\), and the product \(\textbf{A}_k \textbf{p}_k\). In addition, the matrices \(\textbf{A}_k\) and \(\textbf{Q}_k\) need to be updated in each iteration, and the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) needs to be available. As we shall see later, it is possible to reduce the cost of these computations by using the inherent dependencies between the underlying formulas.
Remark 2
(Regularized secant equation) Instead of compact representations, it is also possible to combine regularization and quasiNewton techniques by directly approximating the sum \(\nabla ^2 f(\textbf{x}_k)+\mu \textbf{I}\); see [28]. This idea is based on the fact that the regularized Hessian satisfies (approximately) the modified secant equation
Thus, an approximation \(\hat{\textbf{B}}_k\) to \(\nabla ^2 f(\textbf{x}_k)+\mu \textbf{I}\) can be constructed by taking a modified initial guess \(\hat{\textbf{B}}_{0,k}:=\textbf{B}_{0,k}+\mu \textbf{I}\) and applying an arbitrary quasiNewton scheme to the modified pair \((\textbf{S}_k,\hat{\textbf{Y}}_k):=(\textbf{S}_k,\textbf{Y}_k+\mu \textbf{S}_k)\). For certain quasiNewton schemes like SR1 and PSB, this actually yields the same results as the approach based on compact representations (see Sects. 4.2, 4.3). In general, however, the two approaches are different.
4.1 Broyden–Fletcher–Goldfarb–Shanno (BFGS)
The BFGS update is often considered the most successful quasiNewton scheme. Throughout this section, let \(\textbf{B}_{0,k}= \gamma _k \textbf{I}\) for some \(\gamma _k\in \mathbb {R}\). Following [6], the compact representation of LBFGS is given by
where
(recall that \( \textbf{D}(\cdot ) \) denotes the diagonal part and \( \textbf{L} (\cdot ) \) the strict lower triangle of a given matrix). This can be written in the form (15) by defining
Note that \(\textbf{Q}_k\in \mathbb {R}^{2 m\times 2 m}\).
The BFGS formula has a significant advantage in that the welldefinedness of the updates can be controlled. More specifically, assuming that \(\textbf{s}_k^{\textsf{T}}\textbf{y}_k>0\) for all k, it can be shown that the BFGS matrix \(\textbf{B}_k\) is positive definite, so that the regularized BFGS matrix \(\hat{\textbf{B}}_k=\textbf{B}_k+\mu \textbf{I}\) is also positive definite and therefore nonsingular. By the Woodbury matrix identity, this implies that the inner matrix \(\textbf{Q}_k+{\hat{\gamma }}_k^{1} \textbf{A}_k^{\textsf{T}} \textbf{A}_k\) in (17) is invertible, and thus the regularized Newton step is welldefined for all \(\mu \ge 0\).
In practice, the welldefinedness is controlled by means of a socalled cautious updating mechanism [16]. The previous limited memory data is only updated with the next pair \((\textbf{s}_k,\textbf{y}_k)\) if
where \(\varepsilon >0\) is some predefined constant. This guarantees that the LBFGS matrices \(\textbf{B}_k\) are uniformly positive definite. If \(\nabla f\) is Lipschitz continuous on the set of iterates (or an appropriate level set), then (23) also guarantees that \(\{\textbf{B}_k\}\) is bounded.
4.1.1 Updating LBFGS information
We now describe how the LBFGS information can be updated in an efficient manner. To avoid repetition, we only describe the case where the previous information is already “full”, i.e., where at least m previous data pairs \((\textbf{s}_i,\textbf{y}_i)\) are available. The modifications necessary to treat the initial steps essentially amount to reindexing and will not be detailed here.
Much of the computational effort of regularized LBFGS can be mitigated by memorizing certain intermediate results. Motivated by a related trustregion approach in [2], we track, in addition to the matrices \(\textbf{S}_k\) and \(\textbf{Y}_k\), the quantities
Both of these quantities are necessary for the computation of the regularized quasiNewton step (18), (19), but they also occur in other places of the iteration and updating process, so that memorizing them can save redundant computational effort. Recall that \(\textbf{A}_k=[\textbf{S}_k , \, \textbf{Y}_k]\), so that in particular
Hence, the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) contains the blocks \(\textbf{S}_k^{\textsf{T}}\textbf{S}_k\), \(\textbf{L}_k\), and \(\textbf{D}_k\) from (22) as submatrices.
When passing from k to \(k+1\), these matrices and vectors can be updated as follows. If the data pair \((\textbf{s}_k,\textbf{y}_k)\) is rejected, then \(\textbf{A}_k\) remains unchanged, and we may update \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) by direct computation. If the data pair is accepted, then the updating process requires more care since both \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) and \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) need to be incremented. In this case, the new matrices \( \textbf{S}_{k+1} \) and \( \textbf{Y}_{k+1} \) consist of the last \( m1 \) columns of the old matrices \( \textbf{S}_{k} \) and \( \textbf{Y}_{k} \), respectively, to which the new vectors \(\textbf{s}_k\) and \(\textbf{y}_k\) are appended in the last column. We then begin by computing the vectors
where \(\textbf{p}_k\) is given by (19); as well as the scalar quantities \((\alpha _1, \alpha _2, \alpha _3):= (\textbf{s}_k^{\textsf{T}} \textbf{s}_k, \textbf{s}_k^{\textsf{T}} \textbf{y}_k, \textbf{y}_k^{\textsf{T}} \textbf{y}_k)\). This information is then used to update \(\textbf{A}_k^{\textsf{T}} \textbf{A}_k\) blockwise using the formulas
where “\(*\)” is given by symmetry, and expressions of the form \((\textbf{S}_k^{\textsf{T}}\textbf{S}_k)_{2:m,2:m}\) or \(\textbf{v}_{2:m}\) denote submatrices and vectors built from the subscripted index ranges. Finally, we have \(\textbf{Y}_{k+1}^{\textsf{T}}\textbf{S}_{k+1}=(\textbf{S}_{k+1}^{\textsf{T}}\textbf{Y}_{k+1})^{\textsf{T}}\), and the new vector \(\textbf{A}_{k+1}^{\textsf{T}} \textbf{g}_{k+1}\) is by definition equal to \(\textbf{w}\).
4.1.2 Computational complexity
Let us now comment on the complexity involved in the computation of the regularized quasiNewton step. Assuming that the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) has been formed, the main cost is the solution of a \(2 m\times 2 m\) symmetric linear system to form \(\textbf{p}_k\), and the multiplication of \(\textbf{p}_k\) with the \(n\times 2 m\) matrix \(\textbf{A}_k\). Hence, the complexity of the regularized quasiNewton equation is \(2 m n+O(m^3)\) multiplications.
When a step is successful, the existing data needs to be updated according to the formulas developed above. The dominating cost of this is 2mn multiplications for the computation of \(\textbf{w}=\textbf{A}_{k+1}^{\textsf{T}}\textbf{g}_{k+1}\). Hence, the overall computational effort is at most 2mn multiplications for an unsuccessful step, and 4mn for a successful step.
The computational cost of the \(2m \times 2m\) linear equation (19) for the computation of \(\textbf{p}_k\) is of order \(O(m^3)\). Thus, if \(m\ll n\), this cost is negligible in comparison to mn. The slight computational overhead induced by this linear equation can be mitigated further by using the Schur complement of \(\textbf{Q}_k + {\hat{\gamma }}_k^{1}\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) to reduce the \(2\,m \times 2\,m\) inversion to two \(m\times m\) Cholesky factorizations. See [4] for more details.
4.2 Symmetric rankone (SR1)
For SR1, the compact representation takes on the form
where \(\textbf{D}_k\) and \(\textbf{L}_k\) are again given by (21). This can be written in the form (15) by defining
Note that \(\textbf{Q}_k\in \mathbb {R}^{m\times m}\) in this case.
If \(\textbf{B}_{0,k}=\gamma _k \textbf{I}\), then (26) can be simplified to
The welldefinedness of the SR1 update is hard to guarantee in practice because the underlying rank one formula involves a denominator of the form \((\textbf{y}_k\textbf{B}_k\textbf{s}_k)^{\textsf{T}}\textbf{s}_k\), which can vanish. Thus, when applying formula (16) to the SR1 setting, it is important to clarify how this situation is handled. Note that it is not possible to predict which new data \((\textbf{s}_{k+1},\textbf{y}_{k+1})\) might lead to illconditioning because this crucially depends on the previous information \((\textbf{S}_k,\textbf{Y}_k)\). In fact, even the discarding of old data at some point during the iteration might have an influence and change the welldefinedness of the SR1 update.
Fortunately, there is a simple and effective way of skipping illconditioned updates “on the fly”, i.e., during the computation of the quasiNewton step. This effectively amounts to skipping an intermediate step \((\textbf{s}_i,\textbf{y}_i)\) when necessary and proceeding the SR1 update with \((\textbf{s}_{i+1},\textbf{y}_{i+1})\) instead. It was observed in [6] that illdefinedness of one of these updates amounts to the singularity of a principal minor of \(\textbf{Q}_k\), or equivalently, to a vanishing pivot element during a triangularization of \(\textbf{Q}_k\). When this occurs, it is proposed in [6] to skip the update by essentially ignoring the current row and column of \(\textbf{Q}_k\), and the current column of \(\textbf{A}_k\) (which contains the corresponding vectors \(\textbf{s}_i\) and \(\textbf{y}_i\)).
The above procedure can be adapted to the regularized SR1 setting by observing that the SR1 update “commutes” with the regularization in a certain sense. More specifically, if \(\textbf{B}_k={\text {SR1}}(\textbf{B}_{0,k},\textbf{S},\textbf{Y})\) denotes the SR1 update, then
for all \(\mu \ge 0\), provided that the left side exists. Moreover, an easy calculation shows that the matrix \(\textbf{Q}_k+\textbf{A}_k^{\textsf{T}} \hat{\textbf{B}}_{0,k}^{1} \textbf{A}_k\) from (16), which needs to be inverted for the computation of the regularized Newton step, coincides (up to scaling) with the analogue of \(\textbf{Q}_k\) which would arise for the SR1 update corresponding to \(\hat{\textbf{B}}_{0,k}\) and \(\textbf{Y}_k+\mu \textbf{S}_k\).
4.2.1 Updating LSR1 information
The quantities involved in the LSR1 computations can be updated in a similar fashion to the LBFGS case; see Sect. 4.1. We again maintain the quantities
These can be formed and updated as before. Moreover, they can be used to directly form the matrices \(\textbf{A}_k\) and \(\textbf{Q}_k\), the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\), and the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\).
4.2.2 Computational complexity
The computational cost of the regularized LSR1 method is as follows. In each successful iteration, the quantities (29) are updated, and the matrix \(\textbf{A}_k=\textbf{Y}_k\textbf{B}_{0,k}\textbf{S}_k\) is formed. Using the techniques from Sect. 4.1, these operations require 3mn multiplications.
Moreover, the quasiNewton step needs to be calculated in each step, which entails the solution of an \(m\times m\) symmetric linear system to obtain \(\textbf{p}_k\), and the multiplication of \(\textbf{p}_k\) with the \(n\times m\) matrix \(\textbf{A}_k\), requiring another mn multiplications.
In total, the cost of a successful step is therefore 4mn multiplications, and the cost of an unsuccessful step is mn multiplications (down from 2mn in the BFGS case).
4.3 PowellsymmetricBroyden (PSB)
As a third example, we include the classical PSB formula from [25]. This approach is interesting because the PSB update is always welldefined and has certain wellknown minimality properties. The PSB update is given by
The compact representation of PSB is given in the next theorem.
Note that there is a related representation in [3] for a multipoint secant version of PSB. The two representations coincide when \(m=1\).
Theorem 3
(Compact representation of PSB) The PSB formula admits the compact representation
where \(\textbf{W}_k:=\textbf{Y}_k\textbf{B}_{0,k}\textbf{S}_k\), \(\textbf{U}_k\) is the (nonstrictly) upper triangular part of \(\textbf{S}_k^{\textsf{T}}\textbf{S}_k\), \(\textbf{L}_k\) is the strictly lower triangular part of \(\textbf{S}_k^{\textsf{T}}\textbf{W}_k\), and \(\textbf{D}_k\) is the diagonal part of \(\textbf{S}_k^{\textsf{T}}\textbf{W}_k\).
Proof
To simplify some technical details, we restrict the proof to the case where \(k = m\) (i.e., the algorithm has performed exactly m steps, and the matrices \(\textbf{S}_k\) and \(\textbf{Y}_k\) are “full”). Observe first that (30) can be rewritten as
Therefore, we can write \(\textbf{B}_k=\textbf{M}_k+\textbf{N}_k\), where \(\textbf{M}_k,\textbf{N}_k\) are recursively defined through the formulas
where \(\textbf{V}_i:=\textbf{I}(\textbf{s}_i^{\textsf{T}}\textbf{s}_i)^{1}\textbf{s}_i\textbf{s}_i^{\textsf{T}}\) for all i. Observe now that \(\textbf{V}_0 \cdot \ldots \cdot \textbf{V}_{k1}=\textbf{I}\textbf{S}_k \textbf{U}_k^{1}\textbf{S}_k^{\textsf{T}}\) by [6, Lem. 2.1], so that
We proceed by using (finite) induction to show that
where \(\tilde{\textbf{L}}_i:=\textbf{L}(\textbf{S}_i^{\textsf{T}}\textbf{Y}_i)\) and \( \tilde{\textbf{D}}_i:= \textbf{D} (\textbf{S}_i^{\textsf{T}}\textbf{Y}_i)\). Before we verify this formula, we show that it yields the desired compact representation of the PSB formula. Indeed, using (32) and the definitions of the matrices \( \tilde{\textbf{L}}_k, \tilde{\textbf{D}}_k, \textbf{L}_k, \textbf{D}_k \), respectively, we obtain
On the other hand, exploiting the fact that
using \( \textbf{W}_k=\textbf{Y}_k\textbf{B}_{0,k}\textbf{S}_k \), and expanding (31), it is easy to see that we obtain the same expression.
Hence it remains to verify (32) by induction. For \( i = 1 \), we have
Together with the observation that
an elementary calculation shows that (32) holds for \( i = 1 \). Suppose the statement is true for some \( i = 1,\ldots ,k1\). Using the induction hypothesis together with (33), a straightforward calculation shows that
On the other hand, let us calculate the expression (32) for \( i + 1 \). Based on the partitions
we obtain
hence
Using these expressions and expanding (32) with i replaced by \( i+1 \), and taking into account once again the definition of \( \textbf{V}_i \), an elementary calculation shows that the resulting matrix \( \textbf{N}_{i+1} \) coincides with the one obtained previously. This completes the induction. \(\square \)
If \(\textbf{B}_{0,k}=\gamma _k \textbf{I}\) for some \(\gamma _k\in \mathbb {R}\), then (31) can be rewritten as
where \(\textbf{A}_k=[\textbf{S}_k, \textbf{Y}_k]\) as before. This form of \(\textbf{B}_k\) has the advantage that all involved quantities can be obtained as submatrices of the product \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\).
4.3.1 Updating and complexity
As before, the LPSB quantities can be updated in a similar fashion to the LBFGS case; see Sect. 4.1. We again maintain the quantities
These can be updated as before and used to compute the quasiNewton direction via the inverse formula (16). The complexity of the LPSB step equals that of LBFGS.
5 Numerical experiments
The benchmark implementation described here can be found online at https://github.com/dmsteck/paperregularizedqnbenchmark [27].
In this section, we compare a selection of regularized quasiNewton methods (Algorithm 1) amongst each other and with existing LBFGS type line search and trust region algorithms from the literature.
Algorithms were tested on all largescale (\(n\ge 1000\)) problems from the CUTEst collection [14]. The implementation was done in Python3 using the PyCUTEst interface [13]. All problems were computed with initial points as supplied by the library. We excluded test problems where all algorithms failed within the threshold of 100,000 iterations (see below). We also omitted FLETCBV2 because the initial point is a stationary point. The final test set after these considerations consists of 77 problems.
The results for different algorithms are compared using performance profiles [11] based on the number of function evaluations. Note that the regularization methods evaluate the function exactly once per successful or unsuccessful step, so that the number of function evaluations equals the number of iterations. Furthermore, aside from function or gradient evaluations, all tested methods have a similar computational complexity per step (see [2, 19] and Sect. 4), so that function evaluations provide a simple yet meaningful baseline metric.
Note that we didn’t account for gradient evaluations in our comparison; the regularization methods (and the trustregion comparison method in Sect. 5.2) evaluate \(\nabla f\) exactly once in every successful iteration whereas Wolfebased line search methods evaluate \(\nabla f\) within the inner line search loop. Hence, accounting for gradient evaluations would benefit many of our methods in the subsequent comparisons. However, to keep things simple, we have avoided a more granular breakdown and focused exclusively on function evaluations.
Whenever an algorithm didn’t solve a particular problem to within tolerance (see below), the number of function evaluations was set to \(+\infty \) for the purpose of comparison.
5.1 Comparison of regularized limited memory methods
We implemented the following four regularizationbased algorithms:
 regLBFGS::

Algorithm 1 using the LBFGS technique as set out in Sect. 4.1;
 regLBFGSsec::

Algorithm 1 using the regularized secant version of LBFGS as discussed in Remark 2 (see also [28]);
 regLSR1::

Algorithm 1 using the LSR1 technique as set out in Sect. 4.2;
 regLPSB::

Algorithm 1 using the LPSB technique as set out in Sect. 4.3.
The implementations all use the same hyperparameters
To guarantee welldefinedness, regLBFGS and regLBFGSsec are implemented using the cautious updating scheme (23) with \(\varepsilon :=10^{8}\). The regLSR1 and regLPSB algorithms benefit from indefinite Hessian approximations [5, 7] and therefore were not combined with the cautious updating scheme. However, for these methods, the cautious scheme was still applied to the update of the rolling initial approximation (37); see below.
Inspired by a technique from [2], all algorithms begin with a single Moré–Thuente line search along the normalized negative gradient direction prior to the main iteration loop (see Sect. 5.2 for more details). This has the advantage of providing an initial memory pair \((\textbf{s}_0, \textbf{y}_0)\) that passes the cautious update check (23), and reducing the impact of any initial backtracking on the iteration numbers.
The algorithms were terminated as soon as either
The initial estimate \(\textbf{B}_{0,k}\) in step k is defined by the standard formula
In addition, we adopted a lower threshold \(\mu _{\min }:=10^{4}\) for the regularization parameter. This improved the practical behavior of the method (particularly in the LBFGS case) and also prevented the regularization parameter from becoming zero in limitedprecision arithmetic (Fig. 1).
It may seem that the above choices lead to a preference of high regularization parameters over low ones and could therefore impede fast asymptotic convergence. What we have found empirically is that Algorithm 1 (with LBFGS) often behaves best when the regularization parameter is changed infrequently. This suggests that the parameter should be increased sharply when necessary (to avoid having to increase repeatedly), and only decreased when the step quality is very good. This is reflected in our choice of parameters.
Note also that limited memory methods rarely achieve actual superlinear convergence; the typical behavior is asymptotically linear [19], and classical results for inexact Newton methods (e.g., [23, Thm. 7.1]) indicate that a small but nondecaying value of \(\mu _k\) will typically preserve linear convergence. This indicates that the choices made here are sound from a theoretical point of view.
Comparable studies in other papers [2, 28] indicate that regularized methods may benefit from a nonmonotonicity strategy. Therefore, and to obtain a larger dataset, we also implemented nonmonotone versions of all algorithms, where \(M:=8\) was chosen as the nonmonotonicity offset; this was incorporated into the methods by replacing the reference value \(f(\textbf{x}_k)\) in the regularization control (8) and the line search routines by \(\max _{0\le i < M} f(\textbf{x}_{ki})\) for \(k\ge M\). The initial steps \(k=0,\ldots ,M1\) were treated without modification.
Figure 2 illustrates the relative behavior of the monotone and nonmonotone implementations. All algorithms seem to benefit from a nonmonotonicity strategy. It should be emphasized that our choice of such strategy is rather simple but we believe it is sufficient to illustrate the general picture.
Overall, somewhat unsurprisingly, LBFGS turns out to be by far the most efficient quasiNewton scheme even in the context of regularization. The regularized variants of LSR1 and LPSB are moderately competitive but fall short of the overall performance of regLBFGS and regLBFGSsec.
An interesting observation we made during our testing is that LSR1 and, in particular, LPSB were actually more efficient when used with a more “optimistic” regularization scheme (i.e., lower regularization parameters). This is somewhat surprising because these methods generate indefinite Hessian approximations which should, intuitively, benefit the most from regularization; on the other hand, LBFGS generates an approximation which is positive definite anyway, which suggests that regularization may be less necessary here. The numerical evidence we observed contradicts this intuition.
We can only give a partial explanation for this phenomenon. It is wellknown that BFGS and LBFGS are related to the classical conjugate gradient method [21], which suggests that LBFGS imposes some kind of relationship (a generalized “conjugacy”) on successive search directions (see also the discussion after [2, Eq. 65]). We are unaware of a rigorous definition of such a property, but the relationship of successive search directions may be preserved in a certain way when LBFGS is used with a regularization parameter that changes infrequently. On the other hand, LSR1 and LPSB are generally considered to generate more accurate approximations of the exact Hessian (especially when it is indefinite), which indicates that these methods behave more similarly to a conventional Newtonian algorithm and therefore benefit from a quicker reduction of regularization parameters.
The regularized secant version regLBFGSsec is interesting because it is rather simple to implement (by using the standard twoloop recursion) yet rivals the robustness of regLBFGS; see Fig. 1.
Finally, Table 1 shows the proportion of accepted steps and the total number of solved problems for all four regularizationbased algorithms. The LBFGS algorithms stand out for their high number of solved problems overall, and acceptance ratios of around 99% in the nonmonotone case. Interestingly, regLPSB achieves around 99% acceptance rate in the monotone case, tapering off to around 72% for the nonmonotone implementation.
5.2 Comparison to existing algorithms
Let us now measure regLBFGS against relevant algorithms available in the literature. The “reference” algorithms we use are:
 armijoLBFGS::

the ordinary LBFGS method with Armijo line search and the cautious updating scheme (23);
 wolfeLBFGS::

the Liu–Nocedal LBFGS method [19] with Moré–Thuente line search [20];
 eigLBFGS::

a slightly simplified version of the EIG\((\infty , 2)\) trust region LBFGS algorithm from [2].
The Armijo search uses standard backtracking by repeatedly halving the step size \(t_k\) until
where (in this context) \(\textbf{d}_k\) is the quasiNewton step. The Moré–Thuente line search uses the implementation of Diane O’Leary [24], translated into Python. It terminates when
The eigLBFGS algorithm is based on the EIG\((\infty , 2)\) implementation available at https://gratton.perso.enseeiht.fr/LBFGS/index.html. In order to make the comparison fair, we have slightly simplified this algorithm by replacing the twostage initial line search from EIG\((\infty , 2)\) with a single Moré–Thuente search along the normalized negative gradient direction (which is consistent with the implementation of the regularization methods; see Sect. 5.1). Furthermore, the stopping criteria, cautious update mechanism, and trust region control parameters of EIG\((\infty , 2)\) were brought in line with the other implementations.
Note that a nonmonotone implementation of the EIG\((\infty , 2)\) algorithm from [2] is not available, so we have excluded it from the corresponding comparisons.
The algorithms in this section all use the stopping criteria
depending on whether the algorithm is of line search or trust region type. Here, \(t_k\) is the line search step size and \(\Delta _k\) denotes the trustregion radius.
Figure 3 illustrates the performance of the three algorithms mentioned above and regLBFGS. The LBFGS algorithm with Moré–Thuente line search [20] is competitive on the fastest problems. Similar to the results in [2], however, we found that this and similar Wolfe–Powell based algorithms were noticeably less efficient than others due to the excessive number of function evaluations.
regLBFGS and eigLBFGS perform very similarly in the monotone case, with eigLBFGS (the algorithm based on [2]) attaining a slight advantage. This is not entirely surprising as regularization can be seen as an approximation of trust region algorithms. In return, eigLBFGS requires a (lowdimensional) eigenvalue decomposition in every iteration where the trust region is active, whereas regLBFGS only solves a symmetric linear equation.
Figure 4 compares the behavior of monotone and nonmonotone algorithms. The nonmonotone version of regLBFGS seems to outperform both eigLBFGS and nonmonotone versions of armijoLBFGS and wolfeLBFGS (see Fig. 3).
Note again that our comparison above is based exclusively on function evaluations, not CPU times. It may be interesting to also conduct an analysis of CPU times, but this would effectively require another programming language due to the lack of optimizing compilation in languages like Python or MATLAB, which incurs significant overhead on loops and repeated assignment operations. We anticipate that realistic CPU times would slightly benefit the line search LBFGS methods due to the logistic effort associated with limited memory updating in the regularized methods (see Sect. 4.1).
Finally, Table 2 shows the average ratio of accepted steps and total number of solved problems for all algorithms from this section, and regLBFGS. The interpolationbased Moré–Thuente line search achieves around 95% acceptance in the monotone and nonmonotone implementations. Somewhat unsurprisingly, the trustregion based eigLBFGS algorithms achieves the highest acceptance rate in the monotone case. regLBFGS again stands out with 99% acceptance in the nonmonotone case.
Remark 3
(Further improvements) It is possible to incorporate further modifications and improvements into the regularized quasiNewton schemes, but we have abstained from doing so in order to facilitate a fair comparison. For instance, it may be beneficial to update the quasiNewton information in rejected steps since the trial function value and gradient provide meaningful information [28]. Note that this technique is covered by the framework of Algorithm 1 since we allow \(\textbf{B}_k\) to be chosen anew in each iteration.
6 Final remarks
The results and numerical evidence in this paper demonstrate conclusively that regularization is a powerful globalization technique for limited memory quasiNewton methods.
The numerical results in particular indicate that regularization techniques can substantially improve the efficiency and robustness of LBFGS on largescale nonlinear problems or when nonmonotonicity strategies are employed. An intuitive explanation of this phenomenon lies in the fact that regularization “stabilizes” the Hessian approximation in the sense that the condition number becomes smaller, which may make the method less susceptible to step jumps or “discontinuities” induced by nonmonotonicity or extreme nonlinearity.
We hope that the findings presented here will facilitate more research into these techniques, for example, on quantitative convergence results or on how to integrate regularization with BFGS in a fullmemory context.
Data availability statement
The date of the test problems used in this paper are available from reference [14].
Code availability
The software implementing the algorithms are available online, see [27].
References
Brust, J., Erway, J.B., Marcia, R.F.: On solving LSR1 trustregion subproblems. Comput. Optim. Appl. 66(2), 245–266 (2017). https://doi.org/10.1007/s1058901698683
Burdakov, O.P., Gong, L., Zikrin, S., Yuan, Y.X.: On efficiently combining limitedmemory and trustregion techniques. Math. Program. Comput. 9(1), 101–134 (2017). https://doi.org/10.1007/s1253201601097
Burdakov, O.P., Martínez, J.M., Pilotta, E.A.: A limitedmemory multipoint symmetric secant method for bound constrained optimization. Ann. Oper. Res. 117, 51–70 (2002). https://doi.org/10.1023/A:1021561204463. Operations research and systems (CLAIO 2000), Part II (Mexico City)
Burke, J.V., Wiegmann, A., Xu, L.: Limited memory BFGS updating in a trust–region framework. Tech. rep., University of Washington (2008)
Byrd, R.H., Khalfan, H.F., Schnabel, R.B.: Analysis of a symmetric rankone trust region method. SIAM J. Optim. 6(4), 1025–1039 (1996). https://doi.org/10.1137/S1052623493252985
Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasiNewton matrices and their use in limited memory methods. Math. Program. 63(2, Ser. A), 129–156 (1994). https://doi.org/10.1007/BF01582063
Conn, A.R., Gould, N.I.M., Toint, P.L.: Testing a class of methods for solving minimization problems with simple bounds on the variables. Math. Comput. 50(182), 399–430 (1988). https://doi.org/10.2307/2008615
Conn, A.R., Gould, N.I.M., Toint, P.L.: Trustregion methods. MPS/SIAM Ser. Optim. SIAM, Philadelphia (2000). https://doi.org/10.1137/1.9780898719857
DeGuchy, O., Erway, J.B., Marcia, R.F.: Compact representation of the full Broyden class of quasiNewton updates. Numer. Linear Algebra Appl. 25(5), e2186, 15 (2018). https://doi.org/10.1002/nla.2186
Dennis, J.E., Jr., Moré, J.J.: QuasiNewton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977). https://doi.org/10.1137/1019005
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2, Ser. A), 201–213 (2002). https://doi.org/10.1007/s101070100263
Erway, J.B., Jain, V., Marcia, R.F.: Shifted LBFGS systems. Optim. Methods Softw. 29(5), 992–1004 (2014). https://doi.org/10.1080/10556788.2014.894045
Fowkes, J., Roberts, L.: PyCUTEst: Python interface to the CUTEst optimization test environment. https://jfowkes.github.io/pycutest. Accessed May 2019
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015). https://doi.org/10.1007/s1058901496873
Li, D.H., Fukushima, M.: A modified BFGS method and its global convergence in nonconvex minimization. J. Comput. Appl. Math. 129(1–2), 15–35 (2001). https://doi.org/10.1016/S03770427(00)005409
Li, D.H., Fukushima, M.: On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM J. Optim. 11(4), 1054–1064 (2001). https://doi.org/10.1137/S1052623499354242
Li, D.H., Fukushima, M., Qi, L., Yamashita, N.: Regularized Newton methods for convex minimization problems with singular solutions. Comput. Optim. Appl. 28(2), 131–147 (2004). https://doi.org/10.1023/B:COAP.0000026881.96694.32
Liu, C., Vander Wiel, S.A.: Statistical quasiNewton: a new look at least change. SIAM J. Optim. 18(4), 1266–1285 (2007). https://doi.org/10.1137/040614700
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(3, (Ser. B)), 503–528 (1989). https://doi.org/10.1007/BF01589116
Moré, J.J., Thuente, D.J.: Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Softw. 20(3), 286–307 (1994). https://doi.org/10.1145/192115.192132
Nazareth, L.: A relationship between the BFGS and conjugate gradient algorithms and its implications for new algorithms. SIAM J. Numer. Anal. 16(5), 794–800 (1979). https://doi.org/10.1137/0716059
Nocedal, J.: Updating quasiNewton matrices with limited storage. Math. Comput. 35(151), 773–782 (1980). https://doi.org/10.2307/2006193
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006)
O’Leary, D.: A Matlab implementation of a MINPACK line search algorithm by Jorge J. Moré and David J. Thuente (1991). https://www.cs.umd.edu/users/oleary/software/. Accessed 26 Feb 2019
Powell, M.J.D.: A new algorithm for unconstrained optimization. In: Nonlinear Programming (Proc. Sympos., Univ. of Wisconsin, Madison, Wis., 1970), pp. 31–65. Academic Press, New York (1970)
Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasiNewton method for online convex optimization. In: Artificial Intelligence and Statistics, pp. 436–443 (2007)
Steck, D.: dmsteck/paperregularizedqnbenchmark: Initial release (2022). https://doi.org/10.5281/zenodo.7479177. https://github.com/dmsteck/paperregularizedqnbenchmark
Sugimoto, S., Yamashita, N.: A regularized limitedmemory BFGS method for unconstrained minimization problems. Technical report 2014–001. Kyoto University, Department of Applied Mathematics and Physics (2014)
Ueda, K., Yamashita, N.: Convergence properties of the regularized Newton method for the unconstrained nonconvex optimization. Appl. Math. Optim. 62(1), 27–46 (2010). https://doi.org/10.1007/s0024500990949
Ueda, K., Yamashita, N.: A regularized Newton method without line search for unconstrained optimization. Comput. Optim. Appl. 59(1–2), 321–351 (2014). https://doi.org/10.1007/s105890149656x
Wei, Z., Li, G., Qi, L.: New quasiNewton methods for unconstrained optimization problems. Appl. Math. Comput. 175(2), 1156–1188 (2006). https://doi.org/10.1016/j.amc.2005.08.027
Zhang, H., Ni, Q.: A new regularized quasiNewton method for unconstrained optimization. Optim. Lett. 12(7), 1639–1658 (2018). https://doi.org/10.1007/s115900181236z
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Neither author has any relevant financial or nonfinancial interest to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kanzow, C., Steck, D. Regularization of limited memory quasiNewton methods for largescale nonconvex minimization. Math. Prog. Comp. 15, 417–444 (2023). https://doi.org/10.1007/s12532023002384
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12532023002384
Keywords
 Limited memory methods
 QuasiNewton methods
 LBFGS
 Regularized Newton methods
 Global convergence
 Largescale optimization