1 Introduction

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\), \(n\in \mathbb {N}\), be a twice continuously differentiable function, and consider the nonlinear minimization problem

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{\textbf{x}\in \mathbb {R}^n}\, f(\textbf{x}). \end{aligned}$$
(1)

Methods of Newton or quasi-Newton type are commonly acknowledged to be some of the most efficient algorithms for the solution of such problems. Given a current iterate \(\textbf{x}_k\), these methods compute the iteration step \(\textbf{d}_k\) by solving a (quasi-)Newton equation of the form

$$\begin{aligned} \textbf{B}_k \textbf{d}_k = -\nabla f(\textbf{x}_k), \end{aligned}$$
(2)

where \(\textbf{B}_k\in \mathbb {R}^{n\times n}\) is either the Hessian \(\nabla ^2 f(\textbf{x}_k)\) or an approximation thereof. When n is large, the matrix \(\textbf{B}_k\) is usually not stored explicitly. Instead, one uses so-called limited memory quasi-Newton methods, which require the storage of a few vector pairs

$$\begin{aligned} \textbf{s}_k:=\textbf{x}_{k+1}-\textbf{x}_k, \qquad \textbf{y}_k:=\nabla f(\textbf{x}_{k+1})-\nabla f(\textbf{x}_k), \end{aligned}$$

and use this information to construct an implicit approximation to the Hessian matrix. This approximation is never formed explicitly; instead, the pairs \((\textbf{s}_k,\textbf{y}_k)\) are used to directly evaluate matrix–vector products of the form \(\textbf{B}_k \textbf{x}\) or \(\textbf{B}_k^{-1} \textbf{y}\) as necessary. Arguably the most successful quasi-Newton schemes are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [10] and its limited memory counterpart L-BFGS [6, 19, 22]. Other examples include symmetric rank-one (SR1), Powell-symmetric-Broyden (PSB), Davidon–Fletcher–Powell (DFP), the so called Broyden class, and many more; see [10, 18, 31].

In today’s optimization landscape, L-BFGS is the de facto standard for smooth large-scale optimization. The method is usually combined with a line search technique to ensure global convergence [19]. There have also been efforts dedicated to making quasi-Newton methods compatible with the trust-region framework; see [2, 4, 12] for L-BFGS and [1] for L-SR1. This is facilitated by the fact that most quasi-Newton formulas admit a so-called compact representation of the form

$$\begin{aligned} \textbf{B}_k=\textbf{B}_{0,k}+\textbf{A}_k \textbf{Q}_k^{-1}\textbf{A}_k^{\textsf{T}}, \end{aligned}$$
(3)

where \(\textbf{B}_{0,k}\in \mathbb {R}^{n\times n}\), \(\textbf{A}_k\in \mathbb {R}^{n\times s},\textbf{Q}_k\in \mathbb {R}^{s\times s}\) and \(s\ll n\). (We put \(\textbf{Q}_k^{-1}\) instead of \(\textbf{Q}_k\) in the above equation because this will be more convenient later on.) The initial matrix \(\textbf{B}_{0,k}\) is usually a multiple of the identity or some other diagonal matrix. Decompositions of the above form have been given by many authors [3, 6, 9], and they are immensely useful in optimization methods since they usually allow the computation of matrix operations involving \(\textbf{B}_k\) in the lower dimension s. In particular, they facilitate the efficient computation of quasi-Newton directions and the solution of trust-region subproblems; see the references above.

In this paper, we will pursue a different globalization technique which can be seen as a (less well-known) sibling of line search and trust-region methods, the so-called regularized Newton methods [15, 17, 29, 30, 32]. These are generally characterized by regularized quasi-Newton equations of the form

$$\begin{aligned} (\textbf{B}_k+\mu _k \textbf{I}) \textbf{d}_k = -\nabla f(\textbf{x}_k), \end{aligned}$$

where \(\mu _k\ge 0\) is called the regularization parameter. The attractive feature of these methods is that they combine some of the respective benefits of line search and trust region methods, and moreover they are highly compatible with compact representations of quasi-Newton matrices. We will therefore present an algorithmic framework designed to efficiently combine limited memory and regularization techniques, with the following benefits:

  • The step computation is almost as cheap as for line search L-BFGS algorithms. More specifically, the cost of each successful iteration (in the m-step BFGS case) is 4mn plus the solution of a \(2m \times 2m\) symmetric linear system. In particular, no inner loop is necessary for the computation of eigenvalue decompositions or trust-region solutions.

  • At the same time, the step quality is close to that of trust-region type limited memory algorithms because the regularization parameter \(\mu _k\) mimics the Lagrange multiplier arising in trust-region subproblems. The method can therefore be considered as a kind of “implicit” trust-region algorithm.

  • As a result of the above, the proportion of accepted steps is very high, leading to a relatively low number of function and gradient evaluations (on a level with trust-region type methods) while at the same time preserving the “cheap” steps of line search methods.

The use of regularization techniques has another important benefit over line search methods. In the line search setting, many authors advocate trying the “full” step size \(t_k=1\) first, the motivation being that L-BFGS and similar methods are fundamentally algorithms of Newton type and the full step size may lead to fast convergence. However, the step size also serves the purpose of adapting the algorithm to the nonlinearity of the problem, and re-initializing the line search procedure with \(t_k=1\) at each step makes it hard to carry this information over from one step to the next. In contrast, the regularization approach that we advocate here provides a more seamless transition between the full (quasi-)Newton step and a truncated version thereof (similar to trust region methods), which suggests that algorithms of this type may be able to handle nonlinear or nonconvex problems more effectively.

The idea of combining limited memory and regularization techniques is not entirely new. Multiple authors [15, 26, 28] have advocated modifying the secant equation in quasi-Newton methods to instead approximate the sum \(\nabla ^2 f(\textbf{x}_k)+\mu _k \textbf{I}\). However, none of these methods fully exploit the quasi-Newton approximation of the Hessian and the compact representation (3). The method we present takes full advantage of these tools.

In addition to the algorithm, the paper also contains a general convergence result for regularized Newton methods which, to the authors’ knowledge, does not exist in this generality in the literature. In particular, the convergence result does not assume any specific quasi-Newton formula and allows for \(\textbf{B}_k + \mu _k \textbf{I}\) to be indefinite. This may be of interest to researchers in the field and provide a basis for future research on related methods.

This paper is organized as follows. Section 2 contains a detailed description of a general class of regularized quasi-Newton methods. Global convergence results for this class of methods are presented in Sect. 3 under fairly mild assumptions. In Sect. 4, we show how compact representations of limited memory quasi-Newton methods can be exploited to create efficient implementations of the algorithm. We also give a compact representation of the PSB formula that appears to be new. The numerical experiments in Sect. 5 indicate that the new technique is competitive with other attempts at regularizing L-BFGS [28] as well as line search and trust region based L-BFGS methods [2, 19]. We close with some final remarks in Sect. 6.

1.1 Notation

Matrices and vectors will be denoted by boldface letters \(\textbf{M}\) and \(\textbf{v}\), respectively. Given a matrix \(\textbf{M}\in \mathbb {R}^{s\times s}\), we write \(\textbf{L}(\textbf{M})\), \(\textbf{D}(\textbf{M})\), and \(\textbf{U}(\textbf{M})\) for the strictly lower, diagonal, and strictly upper parts of \(\textbf{M}\), respectively. In particular, it always holds that

$$\begin{aligned} \textbf{M}= \textbf{L}(\textbf{M}) + \textbf{D}(\textbf{M}) + \textbf{U}(\textbf{M}). \end{aligned}$$

The gradient of the smooth function f evaluated at an iterate \( \textbf{x}_k \) will often be denoted by \( \textbf{g}_k \). We denote sequences by \(\{s_k\}\) and write \(\{s_k\}_{k\in \mathcal {S}}\) for the subsequence induced by an infinite index set \(\mathcal {S} = \{k_1, k_2, \ldots \}\subseteq \mathbb {N}\) with \(k_i < k_{i+1}\) for all i. Similarly, \(s_k \rightarrow _{\mathcal {S}} s\) means that \(\{s_k\}_{k\in \mathcal {S}}\) converges to s.

2 Regularized quasi-Newton methods

As discussed in the introduction, the fundamental principle underlying the methods in this paper is that of regularized Newton and quasi-Newton methods, which are generally characterized by regularized quasi-Newton equations of the form

$$\begin{aligned} (\textbf{B}_k+\mu _k \textbf{I}) \textbf{d}_k = -\nabla f(\textbf{x}_k), \end{aligned}$$
(4)

where \(\textbf{B}_k\) is either the Hessian \(\nabla ^2 f(\textbf{x}_k)\) or an approximation thereof, and \(\mu _k\ge 0\) is the regularization parameter. Clearly, if \(\mu _k=0\), then (4) reduces to the standard quasi-Newton equation \(\textbf{B}_k \textbf{d}_k = -\nabla f(\textbf{x}_k)\). On the other hand, if \(\mu _k\gg 0\) is large, then the matrix \(\textbf{B}_k+\mu _k \textbf{I}\) will be invertible, and the step \(\textbf{d}_k\) produced by (4) will essentially be the negative gradient direction (up to normalization; see Lemma 1).

2.1 Mathematical motivation

The virtues of the regularization approach can be understood by recognizing that this essentially amounts to minimizing the regularized quadratic model

$$\begin{aligned} {\hat{q}}_k(\textbf{d}):= f(\textbf{x}_k)+\textbf{g}_k^{\textsf{T}}\textbf{d}+ \frac{1}{2}\textbf{d}^{\textsf{T}}\textbf{B}_k\textbf{d}+\frac{\mu _k}{2}\Vert \textbf{d}\Vert ^2, \end{aligned}$$
(5)

which differs from the conventional Newton model by Tikhonov regularization. Thus, a positive value of \(\mu _k\) may dampen the impact of negative eigenvalues of \(\textbf{B}_k\) on the search direction, prevent excessively long steps in negative curvature directions, and possibly guarantee that the model (5) admits a unique minimizer (i.e., that the matrix \(\textbf{B}_k+\mu _k\textbf{I}\) is positive definite). The anticipated setting is that \(\mu _k\) will initially be kept sufficiently large to guarantee global convergence, eventually decreasing rapidly enough so as to not impede fast local convergence.

A more rigorous interpretation is given by trust-region methods. Indeed, if \(\textbf{d}_k:=-(\textbf{B}_k+\mu _k\textbf{I})^{-1} \textbf{g}_k\) for some \(\mu _k\ge 0\), and if \(\Delta :=\Vert \textbf{d}_k\Vert \), then \(\textbf{d}_k\) is a stationary point of the trust-region subproblem

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{\Vert \textbf{d}\Vert \le \Delta }\, q_k(\textbf{d}), \end{aligned}$$

where

$$\begin{aligned} q_k(\textbf{d}):=f(\textbf{x}_k)+\textbf{g}_k^{\textsf{T}}\textbf{d}+\frac{1}{2}\textbf{d}^{\textsf{T}}\textbf{B}_k\textbf{d} \end{aligned}$$
(6)

is the standard quadratic approximation of f around \(\textbf{x}_k\). If \(\textbf{B}_k+\mu _k \textbf{I}\) is positive definite, then \(\textbf{d}_k\) is in fact a solution of this auxiliary problem. It follows that regularized Newton methods can be interpreted as “implicit” trust-region methods whereby the regularization parameter is controlled instead of the trust-region radius.

Finally, it is also interesting to analyze how the regularization technique affects the conditioning of the quadratic model (5). Assuming for the moment that \(\textbf{B}_k\) is positive definite (as it is, e.g., in BFGS-type methods), the regularization parameter improves the condition number of the underlying matrix in the sense that

$$\begin{aligned} \kappa (\textbf{B}_k+\mu _k \textbf{I})=\frac{\lambda _{\max }(\textbf{B}_k)+\mu _k}{\lambda _{\min }(\textbf{B}_k)+\mu _k} \le \frac{\lambda _{\max }(\textbf{B}_k)}{\lambda _{\min }(\textbf{B}_k)}=\kappa (\textbf{B}_k), \end{aligned}$$

where \(\lambda _{\max }(\textbf{B}_k),\lambda _{\min }(\textbf{B}_k)>0\) are the largest and smallest eigenvalues of \(\textbf{B}_k\), respectively.

2.2 Basic algorithm

To control the regularization parameter \(\mu _k\), we consider the quadratic approximation \(q_k\) of f from (6) and borrow some terminology from trust-region algorithms. Given a candidate step \(\textbf{d}_k=-(\textbf{B}_k+\mu _k\textbf{I})^{-1}\textbf{g}_k\), define the predicted reduction of f as

$$\begin{aligned} \text {pred}_k:=f(\textbf{x}_k)-q_k(\textbf{d}_k) = -\textbf{g}_k^{\textsf{T}}\textbf{d}_k - \frac{1}{2}\textbf{d}_k^{\textsf{T}}\textbf{B}_k \textbf{d}_k = \frac{\mu _k}{2}\Vert \textbf{d}_k\Vert ^2-\frac{1}{2}\textbf{g}_k^{\textsf{T}}\textbf{d}_k, \end{aligned}$$
(7)

where the last equality uses the definition of \(\textbf{d}_k\). (Note that, in particular, the matrix \(\textbf{B}_k\) need not be available for the computation of \(\text {pred}_k\).) This quantity will be compared to the actual or achieved reduction in step k,

$$\begin{aligned} \text {ared}_k:= f(\textbf{x}_k)-f(\textbf{x}_k+\textbf{d}_k). \end{aligned}$$
(8)

Similar to trust-region methods [8], we use the ratio between these quantities to control the regularization parameter. To this end, we distinguish between three cases, unsuccessful (u), successful (s), and highly successful (h) steps. Special care also needs to be taken because there is no a-priori guarantee that \(\text {pred}_k\) is positive (since \(\textbf{B}_k\) may be indefinite); such steps are treated in the same manner as unsuccessful ones.

Algorithm 1 (Regularized quasi-Newton method)

Choose \(\textbf{x}_0\in \mathbb {R}^n\) and parameters \(\mu _0>0\); \(p_{\min },c_1\in (0,1)\); \(c_2\in (c_1,1)\); \(\sigma _1\in (0,1)\); \(\sigma _2>1\).

Step 1.:

If a suitable stopping criterion is satisfied, terminate.

Step 2.:

(Step computation) Choose \(\textbf{B}_k\in \mathbb {R}^{n\times n}\) and attempt to solve the regularized quasi-Newton equation

$$\begin{aligned} (\textbf{B}_k+\mu _k \textbf{I}) \textbf{d}_k=-\nabla f(\textbf{x}_k). \end{aligned}$$
(9)

If this equation admits no solution \(\textbf{d}_k\), or if \(\text {pred}_k \le p_{\min } \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \), set \(\textbf{x}_{k+1}:=\textbf{x}_k\), \(\mu _{k+1}:=\sigma _2\mu _k\), and go to Step 4. Otherwise, go to Step 3.

Step 3.:

(Variable update) Set \(\varrho _k:=\text {ared}_k/\text {pred}_k\) and perform one of the following steps:

Step 3u (\(\varrho _k\le c_1\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k\) and \(\mu _{k+1}:=\sigma _2 \mu _k\).

Step 3 s (\(c_1<\varrho _k\le c_2\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k+\textbf{d}_k\) and \(\mu _{k+1}:=\mu _k\).

Step 3 h (\(c_2<\varrho _k\)). Set \(\textbf{x}_{k+1}:=\textbf{x}_k+\textbf{d}_k\) and \(\mu _{k+1}:=\sigma _1\mu _k\).

Step 4.:

Set \(k\leftarrow k+1\) and go to Step 1.

The condition \(\text {pred}_k > p_{\min } \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \) in Step 2 is a sufficient descent criterion similar to the angle condition in line search methods or the Cauchy condition in trust-region methods. The quantity \(\text {pred}_k\) denotes the minimal expected reduction in objective value (relative to \(\textbf{g}_k\) and \(\textbf{d}_k\)) for a step to be attempted.

As hinted above, in what follows, we will refer to a step as unsuccessful if it passes through Step 3u or skips Step 3 because of the checks in Step 2. (In particular, \(\mathbf {d_k}\) may not be defined in an unsuccessful step.)

The parameters \(c_1,c_2,\sigma _1, \sigma _2\) are used to classify steps and adjust the regularization accordingly (increase if the step was unsuccessful, decrease if the step was highly successful).

Algorithm 1 is closely related to trust-region methods. The main difference between trust-region methods and our regularization framework lies in the update of the parameter \( \mu _k \). The former uses an indirect way to compute \( \mu _k \) (via a trust-region radius), whereas here we update the regularization parameter directly. While the indirect update follows a well-understood and well-motivated philosophy, its actual computation is sometimes time-consuming and costly. We therefore expect superior behavior of the direct update, in particular for large-scale problems.

The report [28] presents a method which is formally almost identical (except for a slightly different update of the regularization parameter) to Algorithm 1. The main difference is that [28] focuses on the matrices \( \textbf{B}_k \) being updated by a limited memory BFGS scheme (without using compact representations, as we shall do in Sect. 4). The convergence theory in [28] assumes a bounded level set condition; this is not required in our subsequent analysis, which is substantially more general since we only assume boundedness of \(\{\textbf{B}_k\}\) (allowing for other quasi-Newton formulas or indefiniteness) and boundedness of the objective from below (consider, for example, the exponential function).

3 General convergence analysis

As we shall see, Algorithm 1 provides a powerful framework for the application of quasi-Newton type updates. Before turning to this discussion (which is the main motivation for this paper), we shall dedicate the present section to a simple convergence analysis. Due to the non-specificity of the algorithm in its general form, it will be convenient to carry out the convergence analysis under rather general assumptions. To this end, we shall make no assumption on the particular choice of the matrices \(\textbf{B}_k\), which may or may not be approximations of the Hessian \(\nabla ^2 f(\textbf{x}_k)\). The only assumption we make throughout this section is the following.

Assumption 1

(Boundedness) \(\{\textbf{B}_k\}\subseteq \mathbb {R}^{n\times n}\) is a bounded sequence.

Most practically relevant quasi-Newton schemes should have no issues satisfying the above assumption, especially when the gradient \(\nabla f\) is Lipschitz continuous on an appropriate level set. Indeed, many of these techniques yield Hessian approximations which satisfy additional properties such as symmetry (which we omitted because it is unnecessary for the theory below) or positive definiteness.

Lemma 1

(Gradient approximation) Let Assumption 1 hold, and let \(\mu _k\rightarrow \infty \). Then \(\textbf{B}_k+\mu _k\textbf{I}\) is invertible for sufficiently large \(k\in \mathbb {N}\), and

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{(\textbf{B}_k+\mu _k\textbf{I})^{-1}\textbf{z}}{\Vert (\textbf{B}_k+\mu _k\textbf{I})^{-1}\textbf{z}\Vert }=\frac{\textbf{z}}{\Vert \textbf{z}\Vert } \quad \text {for all }\textbf{z}\in \mathbb {R}^n{\setminus }\{0\}. \end{aligned}$$

The above result defines more precisely the intuitive relationship mentioned in Sect. 2; that is, if the regularization parameter is sufficiently large, then the regularized Newton equation (9) admits a unique solution, and the resulting vector will approximate the negative gradient direction as \(\mu _k\rightarrow \infty \).

Another consequence of Lemma 1 is that the method performs infinitely many successful steps. This follows from the fact that \(\textbf{d}_k\) becomes ever smaller and approaches the (local) steepest descent direction when \(\mu _k\rightarrow \infty \), thus leading to a local descent step which satisfies the sufficient decrease condition from Step 2 of the algorithm.

Lemma 2

(Well-definedness) Let Assumption 1 hold, and assume that \(\textbf{g}_k\ne 0\) for all k. Then Algorithm 1 performs infinitely many successful or highly successful steps.

Proof

Assume for the sake of contradiction that there exists \(k_0\in \mathbb {N}\) such that all steps \(k\ge k_0\) are unsuccessful. In particular, this implies \(\mu _k\rightarrow \infty \) as \(k\rightarrow \infty \) and \(\textbf{x}_k=\textbf{x}_{k_0}\) for all \(k\ge k_0\). Since \(\{\textbf{B}_k\}\) is a bounded sequence, it follows from Lemma 1 that \(\textbf{B}_k+\mu _k \textbf{I}\) is invertible for sufficiently large k, that \(\textbf{d}_k\rightarrow 0\), and \(\textbf{d}_k/\Vert \textbf{d}_k\Vert \rightarrow - \textbf{g}_{k_0}/\Vert \textbf{g}_{k_0}\Vert \). Moreover, the regularized Newton equation (9) implies that \(\mu _k \Vert \textbf{d}_k\Vert \rightarrow \Vert \textbf{g}_{k_0}\Vert \). It is easy to deduce from these limit relations that

$$\begin{aligned} \text {pred}_k=\frac{\mu _k}{2}\Vert \textbf{d}_k\Vert ^2-\frac{1}{2}\textbf{g}_k^{\textsf{T}}\textbf{d}_k > p_{\min } \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \quad \text {for sufficiently large }k \end{aligned}$$

(simply divide this inequality by \( \Vert \textbf{d}_k \Vert \) and recall that \( p_{\min } \in (0,1) \)). Hence, the algorithm must eventually perform only Step 3u, which means that \(\text {ared}_k\le c_1 \text {pred}_k\) for all \(k\ge k_0\) sufficiently large. It then follows that

$$\begin{aligned} \begin{aligned} f(\textbf{x}_{k_0}+\textbf{d}_k)-f(\textbf{x}_{k_0}) = -\text {ared}_k&\ge - c_1 \text {pred}_k \\&= \frac{c_1}{2}\textbf{g}_{k_0}^{\textsf{T}}\textbf{d}_k-\frac{c_1 \mu _k}{2}\Vert \textbf{d}_k\Vert ^2 \quad \text {for }k\ge k_0. \end{aligned} \end{aligned}$$
(10)

We now divide both sides of this inequality by \(t_k:=\Vert \textbf{d}_k\Vert \). Recalling that \(\textbf{d}_k/\Vert \textbf{d}_k\Vert \rightarrow -\textbf{g}_{k_0}/\Vert \textbf{g}_{k_0}\Vert \), it follows that the left-hand side becomes

$$\begin{aligned} \frac{f\left( \textbf{x}_{k_0}+t_k \frac{\textbf{d}_k}{\Vert \textbf{d}_k\Vert } \right) -f(\textbf{x}_{k_0})}{t_k} \rightarrow \nabla f(\textbf{x}_{k_0})^{\textsf{T}}\frac{-\textbf{g}_{k_0}}{\Vert \textbf{g}_{k_0}\Vert }=-\Vert \textbf{g}_{k_0}\Vert . \end{aligned}$$
(11)

Conversely, recalling that \(\mu _k \Vert \textbf{d}_k\Vert \rightarrow \Vert \textbf{g}_{k_0}\Vert \), the right-hand side of (10) divided by \(t_k\) satisfies

$$\begin{aligned} \frac{c_1}{2}\textbf{g}_{k_0}^{\textsf{T}}\frac{\textbf{d}_k}{\Vert \textbf{d}_k\Vert }-\frac{c_1 \mu _k}{2}\Vert \textbf{d}_k\Vert \rightarrow \frac{c_1}{2} \textbf{g}_{k_0}^{\textsf{T}}\frac{-\textbf{g}_{k_0}}{\Vert \textbf{g}_{k_0}\Vert }-\frac{c_1}{2} \Vert \textbf{g}_{k_0}\Vert =-c_1 \Vert \textbf{g}_{k_0}\Vert . \end{aligned}$$
(12)

Since \(c_1\in (0,1)\), it then follows from (11), (12) that \(\Vert \textbf{g}_{k_0}\Vert =0\), a contradiction. \(\square \)

The following result builds upon the well-definedness of the algorithm and shows that it achieves asymptotic stationarity.

Theorem 1

(Global convergence I) Let Assumption 1 hold, let f be bounded from below, and \(\{\textbf{x}_k\}\) generated by Algorithm 1. Then

$$\begin{aligned} \liminf _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert = 0. \end{aligned}$$

In particular, given any \(\varepsilon >0\), the algorithm terminates with \(\Vert \textbf{g}_k\Vert <\varepsilon \) after finitely many iterations.

Proof

Let \(\mathcal {S}\subseteq \mathbb {N}\) be the set of indices of successful or highly successful steps. Note that \(|\mathcal {S}|=\infty \) by Lemma 2. Assume for the sake of contradiction that

$$\begin{aligned} \liminf _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert >0. \end{aligned}$$
(13)

Since every step \(k\in \mathcal {S}\) is successful, we have by definition that

$$\begin{aligned} f(\textbf{x}_k)-f(\textbf{x}_{k+1})\ge c_1 \text {pred}_k \ge p_{\min } c_1 \Vert \textbf{g}_k\Vert \Vert \textbf{d}_k\Vert \quad \text {for every }k\in \mathcal {S}. \end{aligned}$$

By (13), there exist \(k_0\in \mathbb {N}\) and \(\varepsilon >0\) such that \(\Vert \textbf{g}_k\Vert \ge \varepsilon \) for all \(k\ge k_0\). Using the fact that f is bounded from below, we obtain

$$\begin{aligned} \begin{aligned} \infty > \sum _{k\in \mathbb {N}} \big [ f(\textbf{x}_k)-f(\textbf{x}_{k+1}) \big ]&= \sum _{k\in \mathcal {S}} \big [ f(\textbf{x}_k)-f(\textbf{x}_{k+1}) \big ] \\&\ge p_{\min } c_1 \varepsilon \sum _{k\in \mathcal {S},\,k\ge k_0} \Vert \textbf{d}_k\Vert \end{aligned} \end{aligned}$$
(14)

and, in particular, \(\textbf{d}_k\rightarrow _{\mathcal {S}}0\). Since every step \(k\in \mathcal {S}\) is successful, we have \((\textbf{B}_k+\mu _k \textbf{I})\textbf{d}_k = -\textbf{g}_k\) for all \(k\in \mathcal {S}\). This implies that \(\{\mu _k\}_{k\in \mathcal {S}}\) cannot have a bounded subsequence [since this together with \(\textbf{d}_k\rightarrow _{\mathcal {S}}0\) would violate (13)]. Hence, \(\mu _k\rightarrow _{\mathcal {S}}+\infty \). In particular, the algorithm also performs infinitely many unsuccessful steps (i.e., \(|\mathbb {N}{\setminus }\mathcal {S}|=\infty \)), and \(\mu _k\rightarrow +\infty \) since \(\mu _k\) cannot decrease during unsuccessful iterations.

Now, since \(\mathcal {S}\) and \(\mathbb {N}{\setminus }\mathcal {S}\) are infinite, we may choose an infinite set \(\mathcal {S}'\subseteq \mathcal {S}\) such that \(k-1\in \mathbb {N}{\setminus }\mathcal {S}\) whenever \(k\in \mathcal {S}'\). Since \(\textbf{x}_k\) is not updated in unsuccessful steps, it follows from (14) that

$$\begin{aligned} \begin{aligned} \infty > p_{\min } c_1 \varepsilon \sum _{k\in \mathcal {S},\,k\ge k_0} \Vert \textbf{d}_k\Vert&= p_{\min } c_1 \varepsilon \sum _{k\in \mathcal {S},\,k\ge k_0} \Vert \textbf{x}_{k+1} - \textbf{x}_k \Vert \\&= p_{\min } c_1 \varepsilon \sum _{k\ge k_0} \Vert \textbf{x}_{k+1} - \textbf{x}_k \Vert . \end{aligned} \end{aligned}$$

Hence \(\{\textbf{x}_k\}_{k\in \mathbb {N}}\) is a Cauchy sequence, and thus convergent. Let \(\bar{\textbf{x}}\) denote its limit point. In particular, we then obtain \(\textbf{x}_{k-1}\rightarrow _{\mathcal {S}'}\bar{\textbf{x}}\); thus, using \(\mu _k\rightarrow +\infty \) and arguing as in the proof of Lemma 2, it follows that the steps \(k-1\), \(k\in \mathcal {S}'\), must be successful for sufficiently large \(k\in \mathcal {S}'\). This is a contradiction. \(\square \)

Note that the counterpart of Theorem 1 also holds for trust-region methods under the same set of assumptions. Moreover, the technique of proof used here is related to the corresponding one known for trust-region methods. Nevertheless, we stress that one has to be careful in translating the standard trust-region proof to our regularization framework since well-known properties of the solution of the trust-region subproblem may not hold in our case.

Similar to the theory of trust-region methods, we can use Theorem 1 to obtain a stronger convergence result under an additional assumption.

Theorem 2

(Global convergence II) Let Assumption 1 hold, let f be bounded from below, and \(\{\textbf{x}_k\}\) generated by Algorithm 1. Suppose that \( \nabla f \) is uniformly continuous on a set \( X \subseteq \mathbb {R}^n \) satisfying \( \{\textbf{x}_k\} \subseteq X \). Then \(\lim _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert = 0\); in particular, every accumulation point of \(\{\textbf{x}_k\}\) is a stationary point of f.

Proof

Assume there exists \( \delta > 0 \) and a subsequence \( \{ \textbf{x}_k \}_{k\in K} \) such that

$$\begin{aligned} \Vert \textbf{g}_k \Vert \ge 2 \delta \quad \text {for all } k \in K. \end{aligned}$$

Since \( \liminf _{k\rightarrow \infty }\Vert \textbf{g}_k\Vert = 0 \) by Theorem 1, we can find, for each \( k \in K \), an index \( \ell (k) > k \) such that

$$\begin{aligned} \Vert \textbf{g}_l \Vert \ge \delta \quad \text {for all } k \le l< \ell (k), \qquad \text {and} \qquad \Vert \textbf{g}_{\ell (k)} \Vert < \delta , \quad k \in K. \end{aligned}$$

For an arbitrary \( k \in K \) and a successful or highly successful iteration l with \( k \le l < \ell (k) \), we obtain

$$\begin{aligned} f(\textbf{x}_l) - f(\textbf{x}_{l+1}) \ge c_1 \text {pred}_k \ge p_{\min } c_1 \Vert \textbf{g}_l \Vert \Vert \textbf{d}_l \Vert \ge p_{\min } c_1 \delta \Vert \textbf{x}_{l+1} - \textbf{x}_l \Vert . \end{aligned}$$

The same inequality holds for l being unsuccessful simply because \( \textbf{x}_{l+1} = \textbf{x}_l \) in this case. This implies

$$\begin{aligned} p_{\min } c_1 \delta \Vert \textbf{x}_{\ell (k)} - \textbf{x}_k \Vert \le p_{\min } c_1 \delta \sum _{l= k}^{\ell (k)-1} \Vert \textbf{x}_{l+1} - \textbf{x}_l \Vert&\le \sum _{l= k}^{\ell (k)-1} \big ( f(\textbf{x}_l) - f(\textbf{x}_{l+1}) \big ) \\&= f(\textbf{x}_k) - f(\textbf{x}_{\ell (k)}) \end{aligned}$$

for all \( k \in K \). Since f is bounded from below and \( \{ f(\textbf{x}_k) \} \) is monotonically decreasing, we obtain \( f(\textbf{x}_k) - f(\textbf{x}_{\ell (k)}) \rightarrow 0 \) for \( k \rightarrow \infty \). This implies \( \Vert \textbf{x}_{\ell (k)} - \textbf{x}_k \Vert \rightarrow _K 0 \). The uniform continuity of \( \nabla f \) on the set X therefore yields

$$\begin{aligned} \Vert \nabla f ( \textbf{x}_{\ell (k)} ) - \nabla f( \textbf{x}_k ) \Vert \rightarrow _K 0. \end{aligned}$$

On the other hand, the choice of the index \( \ell (k) \) implies

$$\begin{aligned} \Vert \nabla f ( \textbf{x}_{\ell (k)} ) - \nabla f( \textbf{x}_k ) \Vert \ge \Vert \nabla f( \textbf{x}_k ) \Vert - \Vert \nabla f ( \textbf{x}_{\ell (k)} ) \Vert \ge 2 \delta - \delta = \delta . \end{aligned}$$

This contradiction completes the proof. \(\square \)

We close this section by noting that regularization techniques like in Algorithm 1 are sometimes used in order to prove local fast convergence properties for Newton-type methods. This corresponds to the choice \( \textbf{B}_k:= \nabla ^2 f(\textbf{x}_k) \) as the exact Hessian. Using a more refined update of the regularization parameter, assuming a local error bound condition and the Hessian of f to be locally Lipschitz continuous, it is possible to verify local quadratic convergence for convex objective functions, cf. [17, 29, 30]. Since our focus is on large-scale problems, our subsequent analysis concentrates on \( \textbf{B}_k \) being computed by limited memory quasi-Newton matrices.

4 Regularized quasi-Newton matrices

This section provides the details of limited memory type implementations of quasi-Newton methods. Some of the material below can be applied with minimal modifications to full memory quasi-Newton methods, but we forgo these investigations due to our focus on large-scale optimization.

In keeping with conventional limited memory notation, we assume an algorithmic framework where the last m variable steps \(\textbf{s}_i:=\textbf{x}_{i+1}-\textbf{x}_i\) are tracked together with the corresponding gradient differences \(\textbf{y}_i:=\textbf{g}_{i+1}-\textbf{g}_i\), where we recall that \(\textbf{g}_i=\nabla f(\textbf{x}_i)\). For convenience of notation, we aggregate these in the matrices

$$\begin{aligned} \textbf{S}_k:= [\textbf{s}_{k-m} \,\cdots \, \textbf{s}_{k-1}] \in \mathbb {R}^{n\times m} \quad \text {and}\quad \textbf{Y}_k:= [\textbf{y}_{k-m} \,\cdots \, \textbf{y}_{k-1}]\in \mathbb {R}^{n\times m}. \end{aligned}$$

If fewer than m previous iterates are available, that is, if \(k<m\), we set

$$\begin{aligned} \textbf{S}_k:= [\textbf{s}_0 \,\cdots \, \textbf{s}_{k-1}]\in \mathbb {R}^{n\times k} \quad \text {and}\quad \textbf{Y}_k:= [\textbf{y}_0 \,\cdots \, \textbf{y}_{k-1}]\in \mathbb {R}^{n\times k}. \end{aligned}$$

These definitions may seem like a mere matter of notation, but there are actually quite pragmatic arguments why \(\textbf{S}\) and \(\textbf{Y}\) should be treated as matrices instead of collections of vectors. Many limited memory operations can be formulated as loops over the recurring index \(i=1,\ldots ,m\), and the matrix notation sometimes allows us to formulate the underlying calculations as matrix–vector operations (instead of a sequence of vector–vector operations). This approach should be used whenever possible in practical implementations because it leverages the power of low-level BLAS (basic linear algebra subprograms) and parallelism, providing a significant increase in computational efficiency.

Remark 1

(Rejected quasi-Newton updates) For the sake of simplicity and to avoid notational overhead, we assume that the algorithm always “accepts” the data pair \((\textbf{s}_k,\textbf{y}_k)\) in each successful iteration. This is not the case for some quasi-Newton schemes, especially for nonconvex objective functions. In general, quasi-Newton updates are typically accepted or rejected using a so-called cautious updating scheme (see Sect. 5); when a pair \((\textbf{s}_k, \textbf{y}_k)\) is rejected, the matrices \(\textbf{S}_k,\textbf{Y}_k\) of previous steps simply remain as they were.

Most limited memory quasi-Newton methods implicitly generate a so-called compact representation of the form

$$\begin{aligned} \textbf{B}_k=\textbf{B}_{0,k}+\textbf{A}_k \textbf{Q}_{k}^{-1} \textbf{A}_k^{\textsf{T}}, \end{aligned}$$
(15)

where \(\textbf{Q}_k\in \mathbb {R}^{s\times s}\) is a nonsingular symmetric matrix, \(\textbf{A}_k\in \mathbb {R}^{n\times s}\), and \(s\ll n\) is a constant depending on the particular quasi-Newton scheme. For instance, \(s=2\,m\) in limited memory BFGS methods, and \(s=m\) for limited memory SR1.

The above representation provides a very convenient framework for the regularization approach: given a parameter \(\mu \ge 0\) (e.g., one of the values \(\mu _k\) from Algorithm 1), the regularized Hessian approximation can be written as

$$\begin{aligned} \textbf{B}_k+\mu \textbf{I} =(\textbf{B}_{0,k}+\mu \textbf{I}) +\textbf{A}_k \textbf{Q}_k^{-1} \textbf{A}_k^{\textsf{T}}. \end{aligned}$$

This facilitates the application of low-rank update formulas to compute the regularized Newton step both explicitly and cheaply. To this end, let \(\hat{\textbf{B}}_k:=\textbf{B}_k+\mu \textbf{I}\) and \(\hat{\textbf{B}}_{0,k}:=\textbf{B}_{0,k}+\mu \textbf{I}\). Then the Sherman–Morrison–Woodbury formula implies that

$$\begin{aligned} \hat{\textbf{B}}_k^{-1}=\hat{\textbf{B}}_{0,k}^{-1}- \hat{\textbf{B}}_{0,k}^{-1} \textbf{A}_k (\textbf{Q}_k +\textbf{A}_k^{\textsf{T}}\hat{\textbf{B}}_{0,k}^{-1} \textbf{A}_k)^{-1} \textbf{A}_k^{\textsf{T}} \hat{\textbf{B}}_{0,k}^{-1}, \end{aligned}$$
(16)

provided that \(\hat{\textbf{B}}_{0,k}\) is nonsingular. Note that \(\hat{\textbf{B}}_{0,k}\) is usually a diagonal matrix whose inversion is trivial. Moreover, the inner matrix \(\textbf{Q}_k+\textbf{A}_k^{\textsf{T}} \hat{\textbf{B}}_{0,k}^{-1}\textbf{A}_k\) is of size \(s\times s\), so that its inversion can be carried out cheaply in relation to the dimension n. By the Woodbury matrix identity, the invertibility of this inner matrix is equivalent to that of \(\hat{\textbf{B}}_k\).

In the following, we shall mainly assume that the initial matrix \(\textbf{B}_{0,k}\) is chosen as a scalar multiple of the identity, \(\textbf{B}_{0,k}:=\gamma _k \textbf{I}\). Writing \({\hat{\gamma }}_k:=\gamma _k+\mu \), it then follows that

$$\begin{aligned} \hat{\textbf{B}}_k^{-1}={\hat{\gamma }}_k^{-1}\textbf{I} - {\hat{\gamma }}_k^{-2} \textbf{A}_k (\textbf{Q}_k+ {\hat{\gamma }}_k^{-1} \textbf{A}_k^{\textsf{T}} \textbf{A}_k)^{-1} \textbf{A}_k^{\textsf{T}}. \end{aligned}$$
(17)

The practical efficiency of quasi-Newton methods significantly depends on the memorization and re-use of previously computed quantities. To this end, observe that the quasi-Newton recurrence implies

$$\begin{aligned} \textbf{s}_k = -\hat{\textbf{B}}_k^{-1} \textbf{g}_k= -{\hat{\gamma }}_k^{-1}\textbf{g}_k+ {\hat{\gamma }}_k^{-2} \textbf{A}_k \textbf{p}_k, \end{aligned}$$
(18)

where

$$\begin{aligned} \textbf{p}_k:=(\textbf{Q}_k+{\hat{\gamma }}_k^{-1}\textbf{A}_k^{\textsf{T}}\textbf{A}_k)^{-1} \textbf{A}_k^{\textsf{T}} \textbf{g}_k. \end{aligned}$$
(19)

Thus, the main computational cost occurs in forming the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\), the solution of an \(s\times s\) symmetric linear equation to obtain \(\textbf{p}_k\), and the product \(\textbf{A}_k \textbf{p}_k\). In addition, the matrices \(\textbf{A}_k\) and \(\textbf{Q}_k\) need to be updated in each iteration, and the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) needs to be available. As we shall see later, it is possible to reduce the cost of these computations by using the inherent dependencies between the underlying formulas.

Remark 2

(Regularized secant equation) Instead of compact representations, it is also possible to combine regularization and quasi-Newton techniques by directly approximating the sum \(\nabla ^2 f(\textbf{x}_k)+\mu \textbf{I}\); see [28]. This idea is based on the fact that the regularized Hessian satisfies (approximately) the modified secant equation

$$\begin{aligned} (\nabla ^2 f(\textbf{x}_k)+\mu \textbf{I}) \textbf{s}_k \approx \textbf{y}_k +\mu \textbf{s}_k. \end{aligned}$$

Thus, an approximation \(\hat{\textbf{B}}_k\) to \(\nabla ^2 f(\textbf{x}_k)+\mu \textbf{I}\) can be constructed by taking a modified initial guess \(\hat{\textbf{B}}_{0,k}:=\textbf{B}_{0,k}+\mu \textbf{I}\) and applying an arbitrary quasi-Newton scheme to the modified pair \((\textbf{S}_k,\hat{\textbf{Y}}_k):=(\textbf{S}_k,\textbf{Y}_k+\mu \textbf{S}_k)\). For certain quasi-Newton schemes like SR1 and PSB, this actually yields the same results as the approach based on compact representations (see Sects. 4.2, 4.3). In general, however, the two approaches are different.

4.1 Broyden–Fletcher–Goldfarb–Shanno (BFGS)

The BFGS update is often considered the most successful quasi-Newton scheme. Throughout this section, let \(\textbf{B}_{0,k}= \gamma _k \textbf{I}\) for some \(\gamma _k\in \mathbb {R}\). Following [6], the compact representation of L-BFGS is given by

$$\begin{aligned} \textbf{B}_k=\gamma _k \textbf{I}- \begin{bmatrix} \textbf{S}_k&\textbf{Y}_k \end{bmatrix} \begin{bmatrix} \gamma _k^{-1} \textbf{S}_k^{\textsf{T}} \textbf{S}_k &{}\quad \gamma _k^{-1}\textbf{L}_k \\ \gamma _k^{-1}\textbf{L}_k^{\textsf{T}} &{}\quad -\textbf{D}_k \end{bmatrix}^{-1} \begin{bmatrix} \textbf{S}_k^{\textsf{T}} \\ \textbf{Y}_k^{\textsf{T}} \end{bmatrix}, \end{aligned}$$
(20)

where

$$\begin{aligned} \textbf{D}_k:= \textbf{D}(\textbf{S}_k^{\textsf{T}}\textbf{Y}_k) \quad \text {and}\quad \textbf{L}_k:=\textbf{L}(\textbf{S}_k^{\textsf{T}}\textbf{Y}_k) \end{aligned}$$
(21)

(recall that \( \textbf{D}(\cdot ) \) denotes the diagonal part and \( \textbf{L} (\cdot ) \) the strict lower triangle of a given matrix). This can be written in the form (15) by defining

$$\begin{aligned} \mathbf {A_k}:= \begin{bmatrix} \textbf{S}_k&\textbf{Y}_k \end{bmatrix} \quad \text {and}\quad \textbf{Q}_k:= \begin{bmatrix} -\gamma _k^{-1} \textbf{S}_k^{\textsf{T}} \textbf{S}_k &{}\quad -\gamma _k^{-1} \textbf{L}_k \\ -\gamma _k^{-1} \textbf{L}_k^{\textsf{T}} &{}\quad \textbf{D}_k \end{bmatrix}. \end{aligned}$$
(22)

Note that \(\textbf{Q}_k\in \mathbb {R}^{2 m\times 2 m}\).

The BFGS formula has a significant advantage in that the well-definedness of the updates can be controlled. More specifically, assuming that \(\textbf{s}_k^{\textsf{T}}\textbf{y}_k>0\) for all k, it can be shown that the BFGS matrix \(\textbf{B}_k\) is positive definite, so that the regularized BFGS matrix \(\hat{\textbf{B}}_k=\textbf{B}_k+\mu \textbf{I}\) is also positive definite and therefore nonsingular. By the Woodbury matrix identity, this implies that the inner matrix \(\textbf{Q}_k+{\hat{\gamma }}_k^{-1} \textbf{A}_k^{\textsf{T}} \textbf{A}_k\) in (17) is invertible, and thus the regularized Newton step is well-defined for all \(\mu \ge 0\).

In practice, the well-definedness is controlled by means of a so-called cautious updating mechanism [16]. The previous limited memory data is only updated with the next pair \((\textbf{s}_k,\textbf{y}_k)\) if

$$\begin{aligned} \textbf{y}_k^{\textsf{T}}\textbf{s}_k \ge \varepsilon \Vert \textbf{s}_k\Vert ^2, \end{aligned}$$
(23)

where \(\varepsilon >0\) is some predefined constant. This guarantees that the L-BFGS matrices \(\textbf{B}_k\) are uniformly positive definite. If \(\nabla f\) is Lipschitz continuous on the set of iterates (or an appropriate level set), then (23) also guarantees that \(\{\textbf{B}_k\}\) is bounded.

4.1.1 Updating L-BFGS information

We now describe how the L-BFGS information can be updated in an efficient manner. To avoid repetition, we only describe the case where the previous information is already “full”, i.e., where at least m previous data pairs \((\textbf{s}_i,\textbf{y}_i)\) are available. The modifications necessary to treat the initial steps essentially amount to re-indexing and will not be detailed here.

Much of the computational effort of regularized L-BFGS can be mitigated by memorizing certain intermediate results. Motivated by a related trust-region approach in [2], we track, in addition to the matrices \(\textbf{S}_k\) and \(\textbf{Y}_k\), the quantities

$$\begin{aligned} \textbf{A}_k^{\textsf{T}}\textbf{A}_k \in \mathbb {R}^{2 m\times 2 m} \quad \text {and}\quad \textbf{A}_k^{\textsf{T}} \textbf{g}_k \in \mathbb {R}^{2 m}. \end{aligned}$$

Both of these quantities are necessary for the computation of the regularized quasi-Newton step (18), (19), but they also occur in other places of the iteration and updating process, so that memorizing them can save redundant computational effort. Recall that \(\textbf{A}_k=[\textbf{S}_k , \, \textbf{Y}_k]\), so that in particular

$$\begin{aligned} \textbf{A}_k^{\textsf{T}}\textbf{A}_k= \begin{bmatrix} \textbf{S}_k^{\textsf{T}} \textbf{S}_k &{}\quad \textbf{S}_k^{\textsf{T}} \textbf{Y}_k \\ \textbf{Y}_k^{\textsf{T}} \textbf{S}_k &{}\quad \textbf{Y}_k^{\textsf{T}} \textbf{Y}_k \end{bmatrix} \quad \text {and}\quad \textbf{A}_k^{\textsf{T}}\textbf{g}_k= \begin{bmatrix} \textbf{S}_k^{\textsf{T}} \textbf{g}_k \\ \textbf{Y}_k^{\textsf{T}} \textbf{g}_k \end{bmatrix}. \end{aligned}$$

Hence, the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) contains the blocks \(\textbf{S}_k^{\textsf{T}}\textbf{S}_k\), \(\textbf{L}_k\), and \(\textbf{D}_k\) from (22) as submatrices.

When passing from k to \(k+1\), these matrices and vectors can be updated as follows. If the data pair \((\textbf{s}_k,\textbf{y}_k)\) is rejected, then \(\textbf{A}_k\) remains unchanged, and we may update \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) by direct computation. If the data pair is accepted, then the updating process requires more care since both \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) and \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) need to be incremented. In this case, the new matrices \( \textbf{S}_{k+1} \) and \( \textbf{Y}_{k+1} \) consist of the last \( m-1 \) columns of the old matrices \( \textbf{S}_{k} \) and \( \textbf{Y}_{k} \), respectively, to which the new vectors \(\textbf{s}_k\) and \(\textbf{y}_k\) are appended in the last column. We then begin by computing the vectors

$$\begin{aligned} \textbf{v}:= \textbf{A}_k^{\textsf{T}} \textbf{s}_k=-{\hat{\gamma }}_k^{-1} \textbf{A}_k^{\textsf{T}} \textbf{g}_k+{\hat{\gamma }}_k^{-2} (\textbf{A}_k^{\textsf{T}}\textbf{A}_k) \textbf{p}_k, \qquad \textbf{w}:= \textbf{A}_{k+1}^{\textsf{T}} \textbf{g}_{k+1}, \end{aligned}$$
(24)

where \(\textbf{p}_k\) is given by (19); as well as the scalar quantities \((\alpha _1, \alpha _2, \alpha _3):= (\textbf{s}_k^{\textsf{T}} \textbf{s}_k, \textbf{s}_k^{\textsf{T}} \textbf{y}_k, \textbf{y}_k^{\textsf{T}} \textbf{y}_k)\). This information is then used to update \(\textbf{A}_k^{\textsf{T}} \textbf{A}_k\) blockwise using the formulas

$$\begin{aligned} \textbf{S}_{k+1}^{\textsf{T}} \textbf{S}_{k+1}&= \begin{bmatrix} (\textbf{S}_k^{\textsf{T}}\textbf{S}_k)_{2:m, 2:m} &{}\quad \textbf{v}_{2:m} \\ * &{}\quad \alpha _1 \end{bmatrix}, \end{aligned}$$
(25a)
$$\begin{aligned} \textbf{S}_{k+1}^{\textsf{T}} \textbf{Y}_{k+1}&= \begin{bmatrix} (\textbf{S}_k^{\textsf{T}} \textbf{Y}_k)_{2:m,2:m} &{}\quad \textbf{w}_{1:m-1}-(\textbf{A}_k^{\textsf{T}}\textbf{g}_k)_{2:m} \\ \textbf{v}_{m+2:2 m}^{\textsf{T}} &{}\quad \alpha _2 \end{bmatrix}, \end{aligned}$$
(25b)
$$\begin{aligned} \textbf{Y}_{k+1}^{\textsf{T}} \textbf{Y}_{k+1}&= \begin{bmatrix} (\textbf{Y}_k^{\textsf{T}} \textbf{Y}_k)_{2:m,2:m} &{}\quad \textbf{w}_{m+1:2 m-1}-(\textbf{A}_k^{\textsf{T}} \textbf{g}_k)_{m+2:2 m} \\ * &{}\quad \alpha _3 \end{bmatrix}, \end{aligned}$$
(25c)

where “\(*\)” is given by symmetry, and expressions of the form \((\textbf{S}_k^{\textsf{T}}\textbf{S}_k)_{2:m,2:m}\) or \(\textbf{v}_{2:m}\) denote submatrices and -vectors built from the subscripted index ranges. Finally, we have \(\textbf{Y}_{k+1}^{\textsf{T}}\textbf{S}_{k+1}=(\textbf{S}_{k+1}^{\textsf{T}}\textbf{Y}_{k+1})^{\textsf{T}}\), and the new vector \(\textbf{A}_{k+1}^{\textsf{T}} \textbf{g}_{k+1}\) is by definition equal to \(\textbf{w}\).

4.1.2 Computational complexity

Let us now comment on the complexity involved in the computation of the regularized quasi-Newton step. Assuming that the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\) has been formed, the main cost is the solution of a \(2 m\times 2 m\) symmetric linear system to form \(\textbf{p}_k\), and the multiplication of \(\textbf{p}_k\) with the \(n\times 2 m\) matrix \(\textbf{A}_k\). Hence, the complexity of the regularized quasi-Newton equation is \(2 m n+O(m^3)\) multiplications.

When a step is successful, the existing data needs to be updated according to the formulas developed above. The dominating cost of this is 2mn multiplications for the computation of \(\textbf{w}=\textbf{A}_{k+1}^{\textsf{T}}\textbf{g}_{k+1}\). Hence, the overall computational effort is at most 2mn multiplications for an unsuccessful step, and 4mn for a successful step.

The computational cost of the \(2m \times 2m\) linear equation (19) for the computation of \(\textbf{p}_k\) is of order \(O(m^3)\). Thus, if \(m\ll n\), this cost is negligible in comparison to mn. The slight computational overhead induced by this linear equation can be mitigated further by using the Schur complement of \(\textbf{Q}_k + {\hat{\gamma }}_k^{-1}\textbf{A}_k^{\textsf{T}}\textbf{A}_k\) to reduce the \(2\,m \times 2\,m\) inversion to two \(m\times m\) Cholesky factorizations. See [4] for more details.

4.2 Symmetric rank-one (SR1)

For SR1, the compact representation takes on the form

$$\begin{aligned} \textbf{B}_k=\textbf{B}_{0,k}+(\textbf{Y}_k-\textbf{B}_{0,k}\textbf{S}_k) (\textbf{D}_k+\textbf{L}_k+\textbf{L}_k^{\textsf{T}}-\textbf{S}_k^{\textsf{T}}\textbf{B}_{0,k}\textbf{S}_k)^{-1} (\textbf{Y}_k-\textbf{B}_{0,k}\textbf{S}_k)^{\textsf{T}}, \nonumber \\ \end{aligned}$$
(26)

where \(\textbf{D}_k\) and \(\textbf{L}_k\) are again given by (21). This can be written in the form (15) by defining

$$\begin{aligned} \textbf{A}_k:=\textbf{Y}_k -\textbf{B}_{0,k}\textbf{S}_k \quad \text {and}\quad \textbf{Q}_k:=\textbf{D}_k+\textbf{L}_k+\textbf{L}_k^{\textsf{T}}-\textbf{S}_k^{\textsf{T}}\textbf{B}_{0,k}\textbf{S}_k. \end{aligned}$$
(27)

Note that \(\textbf{Q}_k\in \mathbb {R}^{m\times m}\) in this case.

If \(\textbf{B}_{0,k}=\gamma _k \textbf{I}\), then (26) can be simplified to

$$\begin{aligned} \textbf{B}_k = \gamma _k \textbf{I} + (\textbf{Y}_k - \gamma _k \textbf{S}_k)(\textbf{D}_k + \textbf{L}_k + \textbf{L}_k^{\textsf{T}} - \gamma _k \textbf{S}_k^{\textsf{T}}\textbf{S}_k)^{-1}(\textbf{Y}_k - \gamma _k \textbf{S}_k)^{\textsf{T}}. \end{aligned}$$
(28)

The well-definedness of the SR1 update is hard to guarantee in practice because the underlying rank one formula involves a denominator of the form \((\textbf{y}_k-\textbf{B}_k\textbf{s}_k)^{\textsf{T}}\textbf{s}_k\), which can vanish. Thus, when applying formula (16) to the SR1 setting, it is important to clarify how this situation is handled. Note that it is not possible to predict which new data \((\textbf{s}_{k+1},\textbf{y}_{k+1})\) might lead to ill-conditioning because this crucially depends on the previous information \((\textbf{S}_k,\textbf{Y}_k)\). In fact, even the discarding of old data at some point during the iteration might have an influence and change the well-definedness of the SR1 update.

Fortunately, there is a simple and effective way of skipping ill-conditioned updates “on the fly”, i.e., during the computation of the quasi-Newton step. This effectively amounts to skipping an intermediate step \((\textbf{s}_i,\textbf{y}_i)\) when necessary and proceeding the SR1 update with \((\textbf{s}_{i+1},\textbf{y}_{i+1})\) instead. It was observed in [6] that ill-definedness of one of these updates amounts to the singularity of a principal minor of \(\textbf{Q}_k\), or equivalently, to a vanishing pivot element during a triangularization of \(\textbf{Q}_k\). When this occurs, it is proposed in [6] to skip the update by essentially ignoring the current row and column of \(\textbf{Q}_k\), and the current column of \(\textbf{A}_k\) (which contains the corresponding vectors \(\textbf{s}_i\) and \(\textbf{y}_i\)).

The above procedure can be adapted to the regularized SR1 setting by observing that the SR1 update “commutes” with the regularization in a certain sense. More specifically, if \(\textbf{B}_k={\text {SR1}}(\textbf{B}_{0,k},\textbf{S},\textbf{Y})\) denotes the SR1 update, then

$$\begin{aligned} {\text {SR1}}(\textbf{B}_{0,k}+\mu \textbf{I},\textbf{S},\textbf{Y}+\mu \textbf{S})= {\text {SR1}}(\textbf{B}_{0,k},\textbf{S},\textbf{Y})+\mu \textbf{I} \end{aligned}$$

for all \(\mu \ge 0\), provided that the left side exists. Moreover, an easy calculation shows that the matrix \(\textbf{Q}_k+\textbf{A}_k^{\textsf{T}} \hat{\textbf{B}}_{0,k}^{-1} \textbf{A}_k\) from (16), which needs to be inverted for the computation of the regularized Newton step, coincides (up to scaling) with the analogue of \(\textbf{Q}_k\) which would arise for the SR1 update corresponding to \(\hat{\textbf{B}}_{0,k}\) and \(\textbf{Y}_k+\mu \textbf{S}_k\).

4.2.1 Updating L-SR1 information

The quantities involved in the L-SR1 computations can be updated in a similar fashion to the L-BFGS case; see Sect. 4.1. We again maintain the quantities

$$\begin{aligned} \textbf{S}_k^{\textsf{T}}\textbf{S}_k,\,\textbf{S}_k^{\textsf{T}}\textbf{Y}_k,\, \textbf{Y}_k^{\textsf{T}}\textbf{Y}_k\in \mathbb {R}^{m\times m} \quad \text {and}\quad \textbf{S}_k^{\textsf{T}}\textbf{g}_k,\,\textbf{Y}_k^{\textsf{T}}\textbf{g}_k\in \mathbb {R}^m. \end{aligned}$$
(29)

These can be formed and updated as before. Moreover, they can be used to directly form the matrices \(\textbf{A}_k\) and \(\textbf{Q}_k\), the product \(\textbf{A}_k^{\textsf{T}}\textbf{g}_k\), and the matrix \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\).

4.2.2 Computational complexity

The computational cost of the regularized L-SR1 method is as follows. In each successful iteration, the quantities (29) are updated, and the matrix \(\textbf{A}_k=\textbf{Y}_k-\textbf{B}_{0,k}\textbf{S}_k\) is formed. Using the techniques from Sect. 4.1, these operations require 3mn multiplications.

Moreover, the quasi-Newton step needs to be calculated in each step, which entails the solution of an \(m\times m\) symmetric linear system to obtain \(\textbf{p}_k\), and the multiplication of \(\textbf{p}_k\) with the \(n\times m\) matrix \(\textbf{A}_k\), requiring another mn multiplications.

In total, the cost of a successful step is therefore 4mn multiplications, and the cost of an unsuccessful step is mn multiplications (down from 2mn in the BFGS case).

4.3 Powell-symmetric-Broyden (PSB)

As a third example, we include the classical PSB formula from [25]. This approach is interesting because the PSB update is always well-defined and has certain well-known minimality properties. The PSB update is given by

$$\begin{aligned} \textbf{B}_{k+1} = \textbf{B}_k+\frac{(\textbf{y}_k-\textbf{B}_k \textbf{s}_k)\textbf{s}_k^{\textsf{T}}+\textbf{s}_k(\textbf{y}_k-\textbf{B}_k \textbf{s}_k)^{\textsf{T}}}{\textbf{s}_k^{\textsf{T}}\textbf{s}_k}- \frac{(\textbf{y}_k-\textbf{B}_k \textbf{s}_k)^{\textsf{T}}\textbf{s}_k}{(\textbf{s}_k^{\textsf{T}}\textbf{s}_k)^2} \textbf{s}_k \textbf{s}_k^{\textsf{T}}. \end{aligned}$$
(30)

The compact representation of PSB is given in the next theorem.

Note that there is a related representation in [3] for a multipoint secant version of PSB. The two representations coincide when \(m=1\).

Theorem 3

(Compact representation of PSB) The PSB formula admits the compact representation

$$\begin{aligned} \textbf{B}_k=\textbf{B}_{0,k}+ \begin{bmatrix} \textbf{S}_k&\textbf{W}_k \end{bmatrix} \begin{bmatrix} 0 &{}\quad \textbf{U}_k \\ \textbf{U}_k^{\textsf{T}} &{}\quad \textbf{L}_k+\textbf{D}_k+\textbf{L}_k^{\textsf{T}} \end{bmatrix}^{-1} \begin{bmatrix} \textbf{S}_k&\textbf{W}_k \end{bmatrix}^{\textsf{T}}, \end{aligned}$$
(31)

where \(\textbf{W}_k:=\textbf{Y}_k-\textbf{B}_{0,k}\textbf{S}_k\), \(\textbf{U}_k\) is the (non-strictly) upper triangular part of \(\textbf{S}_k^{\textsf{T}}\textbf{S}_k\), \(\textbf{L}_k\) is the strictly lower triangular part of \(\textbf{S}_k^{\textsf{T}}\textbf{W}_k\), and \(\textbf{D}_k\) is the diagonal part of \(\textbf{S}_k^{\textsf{T}}\textbf{W}_k\).

Proof

To simplify some technical details, we restrict the proof to the case where \(k = m\) (i.e., the algorithm has performed exactly m steps, and the matrices \(\textbf{S}_k\) and \(\textbf{Y}_k\) are “full”). Observe first that (30) can be rewritten as

$$\begin{aligned} \textbf{B}_{k+1}= \left( \textbf{I}-\frac{\textbf{s}_k \textbf{s}_k^{\textsf{T}}}{\textbf{s}_k^{\textsf{T}}\textbf{s}_k} \right) \textbf{B}_k \left( \textbf{I}-\frac{\textbf{s}_k \textbf{s}_k^{\textsf{T}}}{\textbf{s}_k^{\textsf{T}}\textbf{s}_k} \right) + \begin{bmatrix} \textbf{s}_k&\textbf{y}_k \end{bmatrix} \begin{bmatrix} 0 &{}\quad \textbf{s}_k^{\textsf{T}}\textbf{s}_k \\ \textbf{s}_k^{\textsf{T}}\textbf{s}_k &{}\quad \textbf{s}_k^{\textsf{T}}\textbf{y}_k \end{bmatrix}^{-1} \begin{bmatrix} \textbf{s}_k&\textbf{y}_k \end{bmatrix}^{\textsf{T}}. \end{aligned}$$

Therefore, we can write \(\textbf{B}_k=\textbf{M}_k+\textbf{N}_k\), where \(\textbf{M}_k,\textbf{N}_k\) are recursively defined through the formulas

$$\begin{aligned} \textbf{M}_0=\textbf{B}_{0,k}, \qquad&\textbf{M}_{i+1}=\textbf{V}_i \textbf{M}_i \textbf{V}_i,\\ \textbf{N}_0=0, \qquad&\textbf{N}_{i+1}=\textbf{V}_i \textbf{N}_i \textbf{V}_i + \begin{bmatrix} \textbf{s}_i&\textbf{y}_i \end{bmatrix} \begin{bmatrix} 0 &{}\quad \textbf{s}_i^{\textsf{T}}\textbf{s}_i \\ \textbf{s}_i^{\textsf{T}}\textbf{s}_i &{}\quad \textbf{s}_i^{\textsf{T}}\textbf{y}_i \end{bmatrix}^{-1} \begin{bmatrix} \textbf{s}_i&\textbf{y}_i \end{bmatrix}^{\textsf{T}}, \end{aligned}$$

where \(\textbf{V}_i:=\textbf{I}-(\textbf{s}_i^{\textsf{T}}\textbf{s}_i)^{-1}\textbf{s}_i\textbf{s}_i^{\textsf{T}}\) for all i. Observe now that \(\textbf{V}_0 \cdot \ldots \cdot \textbf{V}_{k-1}=\textbf{I}-\textbf{S}_k \textbf{U}_k^{-1}\textbf{S}_k^{\textsf{T}}\) by [6, Lem. 2.1], so that

$$\begin{aligned} \textbf{M}_k= \big (\textbf{I}-\textbf{S}_k \textbf{U}_k^{-\textsf{T}}\textbf{S}_k^{\textsf{T}} \big ) \textbf{B}_{0,k}\big ( \textbf{I}-\textbf{S}_k \textbf{U}_k^{-1}\textbf{S}_k^{\textsf{T}} \big ). \end{aligned}$$

We proceed by using (finite) induction to show that

$$\begin{aligned} \begin{aligned} \textbf{N}_i&=\textbf{S}_i \textbf{U}_i^{-\textsf{T}} \textbf{Y}_i^{\textsf{T}}+\textbf{Y}_i \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \\&\qquad - \textbf{S}_i \textbf{U}_i^{-\textsf{T}} \big ( \tilde{\textbf{L}}_i+\tilde{\textbf{D}}_i+\tilde{\textbf{L}}_i^{\textsf{T}} \big ) \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \quad \text {for all } i = 1, \ldots , k, \end{aligned} \end{aligned}$$
(32)

where \(\tilde{\textbf{L}}_i:=\textbf{L}(\textbf{S}_i^{\textsf{T}}\textbf{Y}_i)\) and \( \tilde{\textbf{D}}_i:= \textbf{D} (\textbf{S}_i^{\textsf{T}}\textbf{Y}_i)\). Before we verify this formula, we show that it yields the desired compact representation of the PSB formula. Indeed, using (32) and the definitions of the matrices \( \tilde{\textbf{L}}_k, \tilde{\textbf{D}}_k, \textbf{L}_k, \textbf{D}_k \), respectively, we obtain

$$\begin{aligned} \textbf{B}_k= & {} \textbf{M}_k + \textbf{N}_k \\= & {} \textbf{B}_{0,k}- \textbf{S}_k \textbf{U}_k^{-\textsf{T}} \textbf{S}_k^{\textsf{T}} \textbf{B}_{0,k}- \textbf{B}_{0,k}\textbf{S}_k \textbf{U}_k^{-1} \textbf{S}_k^{\textsf{T}} + \textbf{S}_k \textbf{U}_k^{-\textsf{T}} \textbf{Y}_k^{\textsf{T}} + \textbf{Y}_k \textbf{U}_k^{-1} \textbf{S}_k^{\textsf{T}} \\{} & {} - \textbf{S}_k \textbf{U}_k^{-\textsf{T}} \big ( \textbf{L}_k + \textbf{D}_k + \textbf{L}_k^{\textsf{T}} \big ) \textbf{U}_k^{-1} \textbf{S}_k^{\textsf{T}}. \end{aligned}$$

On the other hand, exploiting the fact that

$$\begin{aligned} \begin{bmatrix} 0 &{}\quad \textbf{U}_k \\ \textbf{U}_k^{\textsf{T}} &{}\quad \textbf{L}_k+\textbf{D}_k+\textbf{L}_k^{\textsf{T}} \end{bmatrix}^{-1} = \begin{bmatrix} - \textbf{U}_k^{-\textsf{T}} \big ( \textbf{L}_k+\textbf{D}_k+\textbf{L}_k^{\textsf{T}} \big ) \textbf{U}_k^{-1} &{}\quad \textbf{U}_k^{-\textsf{T}} \\ \textbf{U}_k^{-1} &{}\quad 0 \end{bmatrix}, \end{aligned}$$

using \( \textbf{W}_k=\textbf{Y}_k-\textbf{B}_{0,k}\textbf{S}_k \), and expanding (31), it is easy to see that we obtain the same expression.

Hence it remains to verify (32) by induction. For \( i = 1 \), we have

$$\begin{aligned} \textbf{S}_1 = \begin{bmatrix} s_0 \end{bmatrix}, \quad \textbf{Y}_1 = \begin{bmatrix} y_0 \end{bmatrix}, \quad \textbf{U}_1^{-1} = \frac{1}{\textbf{s}_0^{\textsf{T}} \textbf{s}_0}, \quad \tilde{\textbf{L}}_1 = \begin{bmatrix} 0 \end{bmatrix}, \quad \tilde{\textbf{D}}_1 = \textbf{s}_0^{\textsf{T}} \mathbf {y_0}. \end{aligned}$$

Together with the observation that

$$\begin{aligned} \begin{bmatrix} 0 &{}\quad \textbf{s}_i^{\textsf{T}} \textbf{s}_i \\ \textbf{s}_i^{\textsf{T}} \textbf{s}_i &{}\quad \textbf{s}_i^{\textsf{T}} \textbf{y}_i \end{bmatrix}^{-1} = \begin{bmatrix} - \frac{\textbf{s}_i^{\textsf{T}} \textbf{y}_i}{( \textbf{s}_i^{\textsf{T}} \textbf{s}_i )^2} &{}\quad \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \\ \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} &{}\quad 0 \end{bmatrix}, \end{aligned}$$
(33)

an elementary calculation shows that (32) holds for \( i = 1 \). Suppose the statement is true for some \( i = 1,\ldots ,k-1\). Using the induction hypothesis together with (33), a straightforward calculation shows that

$$\begin{aligned} \textbf{N}_{i+1}= & {} \textbf{V}_i \textbf{S}_i \textbf{U}_i^{-\textsf{T}} \textbf{Y}_i^{\textsf{T}} \textbf{V}_i + \textbf{V}_i \textbf{Y}_i \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \textbf{V}_i \\{} & {} - \textbf{V}_i \textbf{S}_i \textbf{U}_i^{- \textsf{T}} \big ( \tilde{\textbf{L}}_i+\tilde{\textbf{D}}_i+\tilde{\textbf{L}}_i^{\textsf{T}} \big ) \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \textbf{V}_i \\{} & {} - \frac{\textbf{s}_i^{\textsf{T}} \textbf{y}_i}{( \textbf{s}_i^{\textsf{T}} \textbf{s}_i )^2} \textbf{s}_i \textbf{s}_i^{\textsf{T}} + \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{s}_i \textbf{y}_i^{\textsf{T}} + \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{y}_i \textbf{s}_i^{\textsf{T}}. \end{aligned}$$

On the other hand, let us calculate the expression (32) for \( i + 1 \). Based on the partitions

$$\begin{aligned}{} & {} \textbf{S}_{i+1} = \begin{bmatrix} \textbf{S}_i&\textbf{s}_i \end{bmatrix}, \\{} & {} \textbf{Y}_{i+1} = \begin{bmatrix} \textbf{Y}_i&\textbf{y}_i \end{bmatrix}, \\{} & {} \textbf{U}_{i+1} = \begin{bmatrix} \textbf{U}_i &{}\quad \textbf{S}_i^{\textsf{T}} \textbf{s}_i \\ 0 &{}\quad \textbf{s}_i^{\textsf{T}} \textbf{s}_i \end{bmatrix} \quad \Longrightarrow \quad \textbf{U}_{i+1}^{-1} = \begin{bmatrix} \textbf{U}_i^{-1} &{}\quad - \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \textbf{s}_i \\ 0 &{}\quad \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \end{bmatrix}, \\{} & {} \tilde{\textbf{L}}_{i+1} = \begin{bmatrix} \tilde{\textbf{L}}_i &{}\quad 0 \\ \textbf{s}_i^{\textsf{T}} \textbf{Y}_i &{}\quad 0 \end{bmatrix}, \\{} & {} \tilde{\textbf{D}}_{i+1} = \begin{bmatrix} \tilde{\textbf{D}}_i &{}\quad 0 \\ 0 &{}\quad \textbf{s}_i^{\textsf{T}} \textbf{y}_i \end{bmatrix}, \end{aligned}$$

we obtain

$$\begin{aligned} \textbf{S}_{i+1} \textbf{U}_{i+1}^{-\textsf{T}}= & {} \begin{bmatrix} \textbf{V}_i \textbf{S}_i \textbf{U}_i^{-\textsf{T}}&\frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{s}_i \end{bmatrix}, \\ \textbf{S}_{i+1} \textbf{U}_{i+1}^{-\textsf{T}} \textbf{Y}_{i+1}^{\textsf{T}}= & {} \textbf{V}_i \textbf{S}_i \textbf{U}_i^{-\textsf{T}} \textbf{Y}_i^{\textsf{T}} + \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{s}_i \textbf{y}_i^{\textsf{T}}, \\ \tilde{\textbf{L}}_{i+1} + \tilde{\textbf{D}}_{i+1} + \tilde{\textbf{L}}_{i+1}^{\textsf{T}}= & {} \begin{bmatrix} \tilde{\textbf{L}}_{i} + \tilde{\textbf{D}}_{i} + \tilde{\textbf{L}}_{i}^{\textsf{T}} &{}\quad \textbf{Y}_i^{\textsf{T}} \textbf{s}_i \\ \textbf{s}_i^{\textsf{T}} \textbf{Y}_i &{}\quad \textbf{s}_i^{\textsf{T}} \textbf{y}_i \end{bmatrix}, \end{aligned}$$

hence

$$\begin{aligned}{} & {} {\textbf{S}_{i+1} \textbf{U}_{i+1}^{-\textsf{T}} \big ( \tilde{\textbf{L}}_{i+1} + \tilde{\textbf{D}}_{i+1} + \tilde{\textbf{L}}_{i+1}^{\textsf{T}} \big ) \textbf{U}_{i+1}^{-1} \textbf{S}_{i+1}^{\textsf{T}}} \\{} & {} \quad = \textbf{V}_i \textbf{S}_i \textbf{U}_i^{-\textsf{T}} \big ( \tilde{\textbf{L}}_{i} + \tilde{\textbf{D}}_{i} + \tilde{\textbf{L}}_{i}^{\textsf{T}} \big ) \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \textbf{V}_i + \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{V}_i \textbf{S}_i \textbf{U}_i^{-\textsf{T}} \textbf{Y}_i^{\textsf{T}} \textbf{s}_i \textbf{s}_i^{\textsf{T}} \\{} & {} \qquad +\, \frac{1}{\textbf{s}_i^{\textsf{T}} \textbf{s}_i} \textbf{s}_i \textbf{s}_i^{\textsf{T}} \textbf{Y}_i \textbf{U}_i^{-1} \textbf{S}_i^{\textsf{T}} \textbf{V}_i + \frac{\textbf{s}_i^{\textsf{T}} \textbf{y}_i}{( \textbf{s}_i^{\textsf{T}} \textbf{s}_i )^2} \textbf{s}_i \textbf{s}_i^{\textsf{T}}. \end{aligned}$$

Using these expressions and expanding (32) with i replaced by \( i+1 \), and taking into account once again the definition of \( \textbf{V}_i \), an elementary calculation shows that the resulting matrix \( \textbf{N}_{i+1} \) coincides with the one obtained previously. This completes the induction. \(\square \)

If \(\textbf{B}_{0,k}=\gamma _k \textbf{I}\) for some \(\gamma _k\in \mathbb {R}\), then (31) can be rewritten as

$$\begin{aligned} \textbf{B}_k=\gamma _k \textbf{I}+\textbf{A}_k \begin{bmatrix} 0 &{}\quad \textbf{U}_k \\ \textbf{U}_k^{\textsf{T}} &{}\quad \textbf{D}(\textbf{S}_k^{\textsf{T}}\textbf{Y}_k) + \gamma _k \textbf{D}(\textbf{S}_k^{\textsf{T}}\textbf{S}_k)+ \textbf{L}(\textbf{S}_k^{\textsf{T}}\textbf{Y}_k)+\textbf{L}(\textbf{S}_k^{\textsf{T}}\textbf{Y}_k)^{\textsf{T}} \end{bmatrix}^{-1} \textbf{A}_k^{\textsf{T}}, \nonumber \\ \end{aligned}$$
(34)

where \(\textbf{A}_k=[\textbf{S}_k, \textbf{Y}_k]\) as before. This form of \(\textbf{B}_k\) has the advantage that all involved quantities can be obtained as submatrices of the product \(\textbf{A}_k^{\textsf{T}}\textbf{A}_k\).

4.3.1 Updating and complexity

As before, the L-PSB quantities can be updated in a similar fashion to the L-BFGS case; see Sect. 4.1. We again maintain the quantities

$$\begin{aligned} \textbf{S}_k^{\textsf{T}}\textbf{S}_k,\,\textbf{S}_k^{\textsf{T}}\textbf{Y}_k,\, \textbf{Y}_k^{\textsf{T}}\textbf{Y}_k\in \mathbb {R}^{m\times m} \quad \text {and}\quad \textbf{S}_k^{\textsf{T}}\textbf{g}_k,\,\textbf{Y}_k^{\textsf{T}}\textbf{g}_k\in \mathbb {R}^m. \end{aligned}$$
(35)

These can be updated as before and used to compute the quasi-Newton direction via the inverse formula (16). The complexity of the L-PSB step equals that of L-BFGS.

5 Numerical experiments

The benchmark implementation described here can be found online at https://github.com/dmsteck/paper-regularized-qn-benchmark [27].

In this section, we compare a selection of regularized quasi-Newton methods (Algorithm 1) amongst each other and with existing L-BFGS type line search and trust region algorithms from the literature.

Algorithms were tested on all large-scale (\(n\ge 1000\)) problems from the CUTEst collection [14]. The implementation was done in Python3 using the PyCUTEst interface [13]. All problems were computed with initial points as supplied by the library. We excluded test problems where all algorithms failed within the threshold of 100,000 iterations (see below). We also omitted FLETCBV2 because the initial point is a stationary point. The final test set after these considerations consists of 77 problems.

The results for different algorithms are compared using performance profiles [11] based on the number of function evaluations. Note that the regularization methods evaluate the function exactly once per successful or unsuccessful step, so that the number of function evaluations equals the number of iterations. Furthermore, aside from function or gradient evaluations, all tested methods have a similar computational complexity per step (see [2, 19] and Sect. 4), so that function evaluations provide a simple yet meaningful baseline metric.

Note that we didn’t account for gradient evaluations in our comparison; the regularization methods (and the trust-region comparison method in Sect. 5.2) evaluate \(\nabla f\) exactly once in every successful iteration whereas Wolfe-based line search methods evaluate \(\nabla f\) within the inner line search loop. Hence, accounting for gradient evaluations would benefit many of our methods in the subsequent comparisons. However, to keep things simple, we have avoided a more granular breakdown and focused exclusively on function evaluations.

Whenever an algorithm didn’t solve a particular problem to within tolerance (see below), the number of function evaluations was set to \(+\infty \) for the purpose of comparison.

5.1 Comparison of regularized limited memory methods

We implemented the following four regularization-based algorithms:

regLBFGS::

Algorithm 1 using the L-BFGS technique as set out in Sect. 4.1;

regLBFGSsec::

Algorithm 1 using the regularized secant version of L-BFGS as discussed in Remark 2 (see also [28]);

regLSR1::

Algorithm 1 using the L-SR1 technique as set out in Sect. 4.2;

regLPSB::

Algorithm 1 using the L-PSB technique as set out in Sect. 4.3.

The implementations all use the same hyperparameters

$$\begin{aligned} m=5, \quad \mu _0=1, \quad p_{\min }=c_1=10^{-4}, \quad c_2=0.9, \quad \sigma _1 = 0.5, \quad \sigma _2 = 4. \end{aligned}$$
(36)

To guarantee well-definedness, regLBFGS and regLBFGSsec are implemented using the cautious updating scheme (23) with \(\varepsilon :=10^{-8}\). The regLSR1 and regLPSB algorithms benefit from indefinite Hessian approximations [5, 7] and therefore were not combined with the cautious updating scheme. However, for these methods, the cautious scheme was still applied to the update of the rolling initial approximation (37); see below.

Inspired by a technique from [2], all algorithms begin with a single Moré–Thuente line search along the normalized negative gradient direction prior to the main iteration loop (see Sect. 5.2 for more details). This has the advantage of providing an initial memory pair \((\textbf{s}_0, \textbf{y}_0)\) that passes the cautious update check (23), and reducing the impact of any initial backtracking on the iteration numbers.

The algorithms were terminated as soon as either

$$\begin{aligned} \Vert \textbf{g}_k\Vert _{\infty }<10^{-4}, \qquad k \ge 10^5, \qquad \text {or}\quad \mu _k > 10^{15}. \end{aligned}$$

The initial estimate \(\textbf{B}_{0,k}\) in step k is defined by the standard formula

$$\begin{aligned} \textbf{B}_{0,k}= \gamma _k \textbf{I}, \qquad \gamma _k = \frac{\textbf{y}_k^{\textsf{T}}\textbf{y}_k}{\textbf{y}_k^{\textsf{T}}\textbf{s}_k}. \end{aligned}$$
(37)

In addition, we adopted a lower threshold \(\mu _{\min }:=10^{-4}\) for the regularization parameter. This improved the practical behavior of the method (particularly in the L-BFGS case) and also prevented the regularization parameter from becoming zero in limited-precision arithmetic (Fig. 1).

It may seem that the above choices lead to a preference of high regularization parameters over low ones and could therefore impede fast asymptotic convergence. What we have found empirically is that Algorithm 1 (with L-BFGS) often behaves best when the regularization parameter is changed infrequently. This suggests that the parameter should be increased sharply when necessary (to avoid having to increase repeatedly), and only decreased when the step quality is very good. This is reflected in our choice of parameters.

Note also that limited memory methods rarely achieve actual superlinear convergence; the typical behavior is asymptotically linear [19], and classical results for inexact Newton methods (e.g., [23, Thm. 7.1]) indicate that a small but non-decaying value of \(\mu _k\) will typically preserve linear convergence. This indicates that the choices made here are sound from a theoretical point of view.

Comparable studies in other papers [2, 28] indicate that regularized methods may benefit from a nonmonotonicity strategy. Therefore, and to obtain a larger dataset, we also implemented nonmonotone versions of all algorithms, where \(M:=8\) was chosen as the nonmonotonicity offset; this was incorporated into the methods by replacing the reference value \(f(\textbf{x}_k)\) in the regularization control (8) and the line search routines by \(\max _{0\le i < M} f(\textbf{x}_{k-i})\) for \(k\ge M\). The initial steps \(k=0,\ldots ,M-1\) were treated without modification.

Fig. 1
figure 1

Performance profiles based on the number of function evaluations for the four algorithms from Sect. 5.1: monotone case (left), nonmonotone case (right)

Fig. 2
figure 2

Performance profiles based on the number of function evaluations for the four algorithms from Sect. 5.1: monotone versus nonmonotone (index n) algorithms

Figure 2 illustrates the relative behavior of the monotone and nonmonotone implementations. All algorithms seem to benefit from a nonmonotonicity strategy. It should be emphasized that our choice of such strategy is rather simple but we believe it is sufficient to illustrate the general picture.

Overall, somewhat unsurprisingly, L-BFGS turns out to be by far the most efficient quasi-Newton scheme even in the context of regularization. The regularized variants of L-SR1 and L-PSB are moderately competitive but fall short of the overall performance of regLBFGS and regLBFGSsec.

An interesting observation we made during our testing is that L-SR1 and, in particular, L-PSB were actually more efficient when used with a more “optimistic” regularization scheme (i.e., lower regularization parameters). This is somewhat surprising because these methods generate indefinite Hessian approximations which should, intuitively, benefit the most from regularization; on the other hand, L-BFGS generates an approximation which is positive definite anyway, which suggests that regularization may be less necessary here. The numerical evidence we observed contradicts this intuition.

We can only give a partial explanation for this phenomenon. It is well-known that BFGS and L-BFGS are related to the classical conjugate gradient method [21], which suggests that L-BFGS imposes some kind of relationship (a generalized “conjugacy”) on successive search directions (see also the discussion after [2, Eq. 65]). We are unaware of a rigorous definition of such a property, but the relationship of successive search directions may be preserved in a certain way when L-BFGS is used with a regularization parameter that changes infrequently. On the other hand, L-SR1 and L-PSB are generally considered to generate more accurate approximations of the exact Hessian (especially when it is indefinite), which indicates that these methods behave more similarly to a conventional Newtonian algorithm and therefore benefit from a quicker reduction of regularization parameters.

The regularized secant version regLBFGSsec is interesting because it is rather simple to implement (by using the standard two-loop recursion) yet rivals the robustness of regLBFGS; see Fig. 1.

Table 1 Average proportion of accepted steps and total problems solved for all algorithms from Sect. 5.1

Finally, Table 1 shows the proportion of accepted steps and the total number of solved problems for all four regularization-based algorithms. The L-BFGS algorithms stand out for their high number of solved problems overall, and acceptance ratios of around 99% in the nonmonotone case. Interestingly, regLPSB achieves around 99% acceptance rate in the monotone case, tapering off to around 72% for the nonmonotone implementation.

5.2 Comparison to existing algorithms

Let us now measure regLBFGS against relevant algorithms available in the literature. The “reference” algorithms we use are:

armijoLBFGS::

the ordinary L-BFGS method with Armijo line search and the cautious updating scheme (23);

wolfeLBFGS::

the Liu–Nocedal L-BFGS method [19] with Moré–Thuente line search [20];

eigLBFGS::

a slightly simplified version of the EIG\((\infty , 2)\) trust region L-BFGS algorithm from [2].

The Armijo search uses standard backtracking by repeatedly halving the step size \(t_k\) until

$$\begin{aligned} f(\textbf{x}_k + t_k \textbf{d}_k) \le f(\textbf{x}_k) + c_1 t_k \textbf{g}_k^{\textsf{T}}\textbf{d}_k, \end{aligned}$$

where (in this context) \(\textbf{d}_k\) is the quasi-Newton step. The Moré–Thuente line search uses the implementation of Diane O’Leary [24], translated into Python. It terminates when

$$\begin{aligned} f(\textbf{x}_k + t_k \textbf{d}_k) \le f(\textbf{x}_k) + c_1 t_k \textbf{g}_k^{\textsf{T}}\textbf{d}_k \quad \text {and}\quad | \nabla f(\textbf{x}_k + t_k \textbf{d}_k)^{\textsf{T}} \mathbf {d_k} | \le -0.9 \textbf{g}_k^{\textsf{T}}\textbf{d}_k. \end{aligned}$$

The eigLBFGS algorithm is based on the EIG\((\infty , 2)\) implementation available at https://gratton.perso.enseeiht.fr/LBFGS/index.html. In order to make the comparison fair, we have slightly simplified this algorithm by replacing the two-stage initial line search from EIG\((\infty , 2)\) with a single Moré–Thuente search along the normalized negative gradient direction (which is consistent with the implementation of the regularization methods; see Sect. 5.1). Furthermore, the stopping criteria, cautious update mechanism, and trust region control parameters of EIG\((\infty , 2)\) were brought in line with the other implementations.

Fig. 3
figure 3

Performance profiles based on the number of function evaluations for regLBFGS and the three algorithms from Sect. 5.2: monotone case (left), nonmonotone case (right)

Note that a nonmonotone implementation of the EIG\((\infty , 2)\) algorithm from [2] is not available, so we have excluded it from the corresponding comparisons.

The algorithms in this section all use the stopping criteria

$$\begin{aligned} \Vert \textbf{g}_k\Vert _{\infty }<10^{-4}, \qquad k \ge 10^5, \qquad \text {or}\quad {\left\{ \begin{array}{ll} \Delta _k< 10^{-15}, \\ t_k < 10^{-15}, \end{array}\right. } \end{aligned}$$

depending on whether the algorithm is of line search or trust region type. Here, \(t_k\) is the line search step size and \(\Delta _k\) denotes the trust-region radius.

Figure 3 illustrates the performance of the three algorithms mentioned above and regLBFGS. The L-BFGS algorithm with Moré–Thuente line search [20] is competitive on the fastest problems. Similar to the results in [2], however, we found that this and similar Wolfe–Powell based algorithms were noticeably less efficient than others due to the excessive number of function evaluations.

regLBFGS and eigLBFGS perform very similarly in the monotone case, with eigLBFGS (the algorithm based on [2]) attaining a slight advantage. This is not entirely surprising as regularization can be seen as an approximation of trust region algorithms. In return, eigLBFGS requires a (low-dimensional) eigenvalue decomposition in every iteration where the trust region is active, whereas regLBFGS only solves a symmetric linear equation.

Fig. 4
figure 4

Performance profiles based on the number of function evaluations for regLBFGS and the three algorithms from Sect. 5.2: monotone versus nonmonotone (index n) algorithms

Figure 4 compares the behavior of monotone and nonmonotone algorithms. The nonmonotone version of regLBFGS seems to outperform both eigLBFGS and nonmonotone versions of armijoLBFGS and wolfeLBFGS (see Fig. 3).

Note again that our comparison above is based exclusively on function evaluations, not CPU times. It may be interesting to also conduct an analysis of CPU times, but this would effectively require another programming language due to the lack of optimizing compilation in languages like Python or MATLAB, which incurs significant overhead on loops and repeated assignment operations. We anticipate that realistic CPU times would slightly benefit the line search L-BFGS methods due to the logistic effort associated with limited memory updating in the regularized methods (see Sect. 4.1).

Table 2 Average proportion of accepted steps and total problems solved for all algorithms from Sect. 5.2 (“non-m.” = non-monotone)

Finally, Table 2 shows the average ratio of accepted steps and total number of solved problems for all algorithms from this section, and regLBFGS. The interpolation-based Moré–Thuente line search achieves around 95% acceptance in the monotone and nonmonotone implementations. Somewhat unsurprisingly, the trust-region based eigLBFGS algorithms achieves the highest acceptance rate in the monotone case. regLBFGS again stands out with 99% acceptance in the nonmonotone case.

Remark 3

(Further improvements) It is possible to incorporate further modifications and improvements into the regularized quasi-Newton schemes, but we have abstained from doing so in order to facilitate a fair comparison. For instance, it may be beneficial to update the quasi-Newton information in rejected steps since the trial function value and gradient provide meaningful information [28]. Note that this technique is covered by the framework of Algorithm 1 since we allow \(\textbf{B}_k\) to be chosen anew in each iteration.

6 Final remarks

The results and numerical evidence in this paper demonstrate conclusively that regularization is a powerful globalization technique for limited memory quasi-Newton methods.

The numerical results in particular indicate that regularization techniques can substantially improve the efficiency and robustness of L-BFGS on large-scale nonlinear problems or when nonmonotonicity strategies are employed. An intuitive explanation of this phenomenon lies in the fact that regularization “stabilizes” the Hessian approximation in the sense that the condition number becomes smaller, which may make the method less susceptible to step jumps or “discontinuities” induced by nonmonotonicity or extreme nonlinearity.

We hope that the findings presented here will facilitate more research into these techniques, for example, on quantitative convergence results or on how to integrate regularization with BFGS in a full-memory context.