Implementing and modifying Broyden class updates for large scale optimization

We consider Broyden class updates for large scale optimization problems in n dimensions, restricting attention to the case when the initial second derivative approximation is the identity matrix. Under this assumption we present an implementation of the Broyden class based on a coordinate transformation on each iteration. It requires only 2nk+O(k2)+O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2nk + O(k^{2}) + O(n)$$\end{document} multiplications on the kth iteration and stores nK+O(K2)+O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$nK+ O(K^2) + O(n)$$\end{document} numbers, where K is the total number of iterations. We investigate a modification of this algorithm by a scaling approach and show a substantial improvement in performance over the BFGS method. We also study several adaptations of the new implementation to the limited memory situation, presenting algorithms that work with a fixed amount of storage independent of the number of iterations. We show that one such algorithm retains the property of quadratic termination. The practical performance of the new methods is compared with the performance of Nocedal’s (Math Comput 35:773--782, 1980) method, which is considered the benchmark in limited memory algorithms. The tests show that the new algorithms can be significantly more efficient than Nocedal’s method. Finally, we show how a scaling technique can significantly improve both Nocedal’s method and the new generalized conjugate gradient algorithm.


Introduction
There is a variety of methods for unconstrained optimization calculations, where in general the following problem is studied: given an at least twice continuously differentiable objective function f in n unknowns, we seek its minimum for x on a domain or the whole n dimensional space As an important step towards this, we seek a stationary point x * where f's gradient vanishes. This stationary point is approached iteratively, by sequence of points x k in n dimensions, beginning with an initial point x 0 . We go from step to step along a socalled search direction without using a Hesse matrix of the objective function.
However, in the so-called quasi-Newton algorithms for large scale unconstrained optimization calculations, an approximation B k = (H k ) −1 to the second derivative matrix is used successfully for the computation of the search direction d k .
A superior method for nonlinear optimization is the DFP (Davidon-Fletcher-Powell) algorithm (a "variable metric method", and such methods had a huge impact on optimization) a predecessor to the BFGS scheme we use in this article. It contains Mike Powell's name and that scheme uses derivative information (Fletcher and Powell [2], Powell [6][7][8][9][10][11][12]).
We use the condition as a stopping criterion in our algorithms that the gradient at x k that we shall denote by g k has length at most after only finitely many steps.

Algorithm 1
Step 0 Let any starting vector x 0 and any positive definite symmetric matrix H 0 be given, often the identity matrix. Set k ∶= 0.

Step 1
If the stopping criterion ‖g k ‖ 2 ≤ for a small positive is satisfied then stop. Else: calculate the search direction from the formula Step 2 Compute the vectors where k is determined by a line search that reduces the value of the objective function f and provides k T k > 0 (e.g., the Wolfe conditions).
Step 3 Form the matrix H k+1 by applying the usual Broyden class update formula where its parameter k is chosen so that H k+1 is positive definite: it is of the form (1.1) d k ∶= −H k g k .

3
Implementing and modifying Broyden class updates for large… where I is the identity matrix. Then increase k by one and go back to Step 1. In Step 3 we restrict the parameter k to choices giving a positive definite matrix H k+1 since this ensures that the search directions calculated in Step 1 are downhill.
For large scale problems, that is to say, large n, the standard method of Algorithm 1 has the unfavourable feature of requiring O(n 2 ) operations per iteration and storage for 1 2 n 2 + O(n) numbers. The limited memory updating technique presented by Nocedal [4] computes a second derivative approximation by updating a simple matrix (usually the identity or some other diagonal matrix) using the k and k vectors from the most recent m iterations, where m is a prescribed integer. Nocedal's method stores 2nm + O(n) numbers and calculates 4nm + O(n) + O(m) multiplications per iteration.
In Sect. 2 of this paper we present an implementation of Broyden class updates that, for H 0 = I , stores only one new n vector per iteration. Our approach differs from the work quoted above by following a geometrical motivation based on Lemma 1 below-which uses the span of the first k gradient vectors to show certain invariance properties with respect to multiplications by H k for our following algorithms-rather than relying on algebraic simplifications and by applying to the entire Broyden class.
This algorithm is the basis for the investigations of this paper. We demonstrate that it is computationally efficient [requiring only 2nk + O(k 2 ) + O(n) multiplications on the kth iteration] and leads in a natural way to a modification by a scaling technique that, in our test cases, substantially reduces the number of functions calls required to find the solution within given accuracy. These numerical results are presented in Sect. 3. The mentioned reduction is as much as close to 60 percent. Section 4 offers a number of limited memory modifications of the algorithms presented in Sect. 2. The numerical results presented in Sect. 5 indicate that these methods are superior to Nocedal's method. In Sect. 6 we apply a scaling technique to both Nocedal's method and the ones developed in Sect. 4 arriving at the surprising result that scaling is so beneficial to Nocedal's method that it closes the performance gap observed in Sect. 5.

A new implementation of Broyden class updates
The algorithm presented in this section relies on the simplifications that occur if the initial second derivative approximation B 0 is the identity matrix. We note that due to the invariance properties of the Broyden class updates with respect to is equivalent to setting B 0 = I and scaling the axes by the transformation so that such choices of B 0 can be treated within our framework. We note that other choices of B 0 are uncommon in practical large scale calculations.
The following lemma and the ensuing comments form the basis for our implementation.
Lemma 1 Let Algorithm 1 with H 0 = I be applied to a twice continuously differentiable function f. Define the subspace Then, for all k, the inclusion is satisfied. Let in addition the vectors s and t belong to S k and the orthogonal complement of S k , respectively. Then we have the relations Proof We prove the lemma by induction over k, noting that for k = 0 the statements (2.3)-(2.5) follow directly from 0 being parallel to d 0 ∶= −g 0 and B 0 = H 0 = I . We now assume that the Eqs. (2.3)-(2.5) hold for k and consider them for k + 1 . By the induction hypothesis we have the inclusions and by definition the vector in (1.2) is also contained in S k+1 . Using B k k , k ∈ S k+1 , we find B k+1 t = B k t = t for any vector t belonging to the orthogonal complement of S k+1 which is equivalent to H k+1 t = t.
Hence, since B k+1 is symmetric, orthogonality yields (2.4) with k replaced by k + 1 for any vector s ∈ S k+1 . Finally the definition and the second inclusion of (2.4) with k replaced by k + 1 give k+1 ∈ S k+1 . ◻ Let the conditions of Lemma 1 be satisfied, where we begin with the identity matrix as a start matrix by assumption. This is an essential feature of our method.
We define k ∶= dim S k and let Q k be an orthogonal matrix whose first k columns span the subspace S k . Consider the change of variables 1 3 Implementing and modifying Broyden class updates for large… which suggests the definitions Thus, using Lemma 1 we find where Ĥ k is an k × k matrix, and we apply these updates to Ĥ k from now on. We notice as well that the last n − k components of k ′ and g k ′ are zero. Let us now assume that g k+1 ∈ S k , implying S k+1 = S k and k+1 = k . Then the last n − k components of g k+1 � and hence also those of k ′ are zero. It thus follows that, in transformed variables, the updated matrix is where Ĥ k+1 is obtained by updating the k × k matrix Ĥ k instead of H k , resulting in Ĥ k+1 in place of H k+1 , and the k vectors ̂k and ̂k containing the first k components of k ′ and k ′ , respectively. These choices come once more from Lemma 1.
We now turn to g k+1 ∉ S k giving k+1 = k + 1 and consider the change of variables where the first k columns of Q k+1 agree with those of Q k and the ( k + 1) st column is the normalized component of g k+1 orthogonal to S k . Because the first k columns are the same, we have the identities In addition, the last n − ( k + 1) components of g k+1 �� , and therefore also those of k ′′ , are zero. Hence where Ĥ k+1 is now the ( k + 1) × ( k + 1) matrix obtained by updating using the vectors ̂k and ̂k formed by the first k + 1 components of k ′′ and k ′′ . The following new algorithm applies the above observations, exploiting the fact that all the information that is required from the n × n matrix H k is contained in the first k columns of Q k and in the k × k matrix Ĥ k . (2.14)

We shall take
We decompose the current approximation to the inverse of the Hesse matrix into Q kĤkQk T + RR T , where the columns of R span the orthogonal complement of the space spanned by the columns of Q k . Then −H k g k gives the above d k where we use R T g k = 0.

Step 1
If the stopping criterion is satisfied then End. Else: calculate the search direction by (2.16).

Step 2
As in Algorithm 1.

Step 3b
If k = 0 , then, using the vectors ̂k and ̂k , apply the Broyden class update to Ĥ k . Else apply the Broyden class update to the matrix H . End. Increase k by one and go back to Step 1.
The algorithm requires storage for n k + O(( k ) 2 ) + O(n) numbers, where k is the total number of iterations and k ≤ k + 1.
By storing appropriate intermediate results, the kth iteration of Algorithm 2 can be done in 3n k + O(( k ) 2 ) + O(n) multiplications in total, where k ≤ k + 1 . In fact, we note that the temporary k vectors defined by can be computed from the identities ( ‖ ⋅ ‖ meaning the Euclidean norm) where the elements of the k+1 × k matrix Q k+1 TQ k are zero except for the diagonal elements which are one. We thus have: the last equation being a consequence of the following identities: by (2.17). The value ( k ) T g k need not be zero. Finally we note that the update of Ĥ k itself is only an O(( k ) 2 ) process.
Equation (2.17) shows that the columns of the matrix Q k are calculated by applying the Gram-Schmidt process to the gradients {g 0 , … , g k } . It is well known that the calculation of k is ill-conditioned if the gradients are almost linearly dependent, which due to rounding leads to a loss of orthogonality in the columns of Q k . In the following paragraphs we present a technique overcoming this.
Let at the kth iteration of Algorithm 2 the columns of the n × k matrix G k be from By construction of Q k there exists a nonsingular upper triangular k × k matrix, R k say, so that the identity is satisfied, the elements of R k being the coefficients of the Gram-Schmidt process that has just been described. We exploit this relation by storing G k and R k instead of Q k , noting that the work for computing products of the form where v ∈ IR n and w ∈ IR k , increases only by O( k 2 ) multiplications. The update of R k and G k is done as follows: We note that the computation of the Cholesky factorisation in this fashion using orthogonality does not worsen condition numbers. We also note from (2.17) and (2.18) that because We also note that the argument of the square root is nonnegative in exact arithmetic as Q k contains orthogonal vectors. The ill-conditioned calculation of the vector k is thus avoided. This also saves n k multiplications. Cancellation will, however, occur in the computation of the difference ‖g k+1 ‖ 2 − ‖t k 2 ‖ 2 , if g k+1 is almost contained in the column span of G k . In our practical implementation we will, therefore, ignore the component of g k+1 orthogonal to the column span of G k and thus keep G k and R k unchanged, if holds, where C ≥ 0 (the value C ∶= 0.1 gave good results in our numerical experiments). With this choice, inequality (2.30) failed on most iterations, so the above modification was hardly ever invoked. We also noticed that choosing C ∶= 0.2 or C ∶= 0.05 led to only very minor changes in the number of function evaluations or iterations required for the solution of our test problems. Since rounding errors could Implementing and modifying Broyden class updates for large… cause the argument of the square root in (2.29) to become negative we replace (2.30) by which are equivalent in exact arithmetic.
Algorithm 3 incorporates the changes suggested above. It thus requires only

Algorithm 3
Step 0 Let any starting vector x 0 be given. If g 0 = 0 then End. Else: set k ∶= 0 , 0 ∶= 1 and let Ĥ 0 and R 0 be given by Ĥ 0 11 = 1 and R 0 11 = ‖g 0 ‖ , respectively. Let the column of the n × 1 matrix G 0 be the vector g 0 and let the real component t 0 1 be given by (t 0 1 ) 1 = ‖g 0 ‖.

Step 1
If the stopping criterion is satisfied then stop. Else: calculate the search direction from the formulae (2.18) and Step 2 As in Algorithm 1.

Step 3a
Compute the vector

Step 3b
If k+1 = k then use the vectors ̂k and ̂k and apply the Broyden class update to Ĥ k . Else (i.e., k+1 = k + 1) apply the Broyden class update to the matrix H . End. Increase k by one and go back to Step 1.
The following observations will lead to an alternative formulation of Algorithm 3. This is relevant to the next section, where we present a limited memory update. Let us assume that on iteration k the gradient g k+1 has been included into the matrix G k+1 . We note that, because of (2.32), the vector k+1 = k+1 d k+1 is contained in the column span of G k+1 . Moreover, we find k+1 T g k+1 ≠ 0 as d k+1 is a downhill search direction, thus the columns of the matrices G k+1 and (G k | k+1 ) span identical subspaces. One could replace the last columns of G k+1 and R k+1 by k+1 and respectively, without changing the underlying matrix Q k+1 . Employing this exchange of columns whenever k is increased gives an equivalent algorithm, in which the vectors k rather than the gradients g k are stored. Note that k+1 cannot be included into G k+1 directly, as it is available only at the beginning of iteration k + 1. Now Algorithm 4 is obtained from Algorithm 3 by inserting Step 2a below after Step 2.

Algorithm 4 As in Algorithm 3, but insert after Step 2 the following
Step 2a If k = k−1 + 1 , then replace the last columns of G k and R k by k and − k t k 3 , respectively.
As indicated, Algorithm 4 will be important in Sect. 4. Nonetheless, we do not recommend its general use, since inequality (2.30) whose purpose is to make sure that no cancellation occurs in the g k vectors, does not guarantee this for the corresponding k vectors, which may lead to a loss of orthogonality in Gram-Schmidt.
Due to the structure given in (2.10), the new algorithms provide full information on the subspace for which the second derivative information is based on the gradient information gathered so far and on the orthogonal subspace on which it is still the identity matrix. In the final algorithm of this section we intend to enhance the performance of Algorithm 3 by a scaling technique (as suggested in a similar setting for instance in Powell [11] and Siegel [14]) aimed at not distorting the second derivative information already gained on the previous iterations. This is achieved by multiplying the identity matrix in (2.10) by a scalar, which we shall denote by , thus changing the curvature of our second derivative approximation on the subspace orthogonal to the gradients from 1 to −1 .
Of course, reasoning for choosing is heuristic; it applies to a subspace on which no curvature information has been gathered so far. Our approach is k T k k T k as in Oren and Luenberger [5] as an approximation of the one dimensional curvature encountered on iteration k and choose ( k ) −1 to be the geometric mean recursively computed of all such curvature approximations encountered so far, thus implementing the assumption that the average curvature encountered so far should be a reasonable indicator for the curvature on the not yet discovered subspace. For such a curvature approximation it is reasonable to take geometric averages as weights to balance between the subspaces as all updates are essentially products in these method. No particular subspace is however emphasised.
The resulting Algorithm 5 is obtained from Algorithm 3, using the well-known recursive expression for the geometrical mean in (2.48), as follows:

Step 3a
As in Algorithm 3 .

Step 3b
As in Algorithm 3, but:

Numerical results-full BFGS implementations
In this section we compare the performances of Algorithms 3 and 3HH (Alg. 3 with Householder factorisation) and 5, all using k = 0 in the Broyden H updates.
Our test problems, motivated by the need to be able to create problems for both small and very large values of n, are derived from the physical situation described in (Siegel [13]). Our stopping condition is the inequality where > 0 (in our work we chose = 10 −2 , 10 −4 , 10 −6 , 10 −8 ).
All algorithms were implemented in double precision. Our line search routine finds steplengths k that satisfy Wolfe's [15] conditions with the choice of constants c 1 = 0.01 and c 2 = 0.9 . It uses function gradients as well as function values, and is a slightly modified version of the line search used in the TOLMIN Fortran package (see Powell [11]). We implemented the algorithms so that they were compatible (see the first paragraph of Sect. 2) with choosing the initial second derivative approximation noting that this initial scaling is often used in practice (Oren and Luenberger [5]). Moreover, it is compatible with the choice of 0 in Algorithm 5.
For Algorithm 3 and Algorithm 5 we use C ∶= 0.1 . Anticipating the well-known robustness of the Householder approach we use C ∶= 0.0 in Algorithm 3HH. Table 1 gives the results of 20 runs for each pair of n and , where the first column entry gives the average number of function evaluations and the second the average number of iterations. We draw the following conclusions: • Generally the performance of the native BFGS algorithm is very similar to both Algorithm 3 with C ∶= 0.1 and Algorithm 3HH (with Householder factorisation). We view this as an indication that the measures introduced to retain stability in the orthogonalization process required by the new algorithms are successful. In fact, in the course of our numerical tests we continuously monitored the orthogonality of the Q k matrices implied by the matrices R k and G k for Algorithm 3 and by the

Derivation of limited memory algorithms
In this section we consider the case when, due to limitations in the storage available on the computer, the number of columns of G is restricted by a number m, say. We therefore have to devise a procedure for removing a column from G and making corresponding changes to R and Ĥ . Our strategy will be to delete the gradient or -vector with the smallest iteration index. As a consequence of this procedure the column spaces of the matrices G will no longer agree, so now G usually changes the search directions of the algorithm even in exact arithmetic. For the deleting procedure it will turn out to be advantageous if the gradient or -vector to be removed is the last column of G.
We therefore replace the Gram-Schmidt process used so far by the "inverse" Gram-Schmidt procedure, which is outlined below. First we consider an algorithm in which gradients are stored.
When we add a column to G k we define the matrix G k+1 ∶= (g k+1 | G k ) and the preliminary matrix Thus the equation Q k+1 prel R k+1 prel = G k+1 implies the same underlying preliminary basis matrix as before. The matrix R k+1 prel possesses the sparsity structure Therefore we can obtain an upper triangular matrix R k+1 by premultiplying R k+1 prel by an orthogonal matrix where X k i is a Givens rotation that is different from the identity matrix only in its ith and (i + 1) st rows, which is defined by making the (i + 1) st and ith components of the first column of the matrix zero and positive, respectively. Therefore the equation Q k+1 R k+1 = G k+1 defines the new basis matrix using the definition We note that we have to take into account this change to Q k+1 when calculating the vectors ̂k and ̂k and updating the matrix Ĥ k , see Steps 3a and 3b of Algorithm 6.
In case -vectors are stored we proceed as above obtaining the matrices G k+1 = (g k+1 | G k ) and R k+1 . We replace the first columns of G k+1 and R k+1 by k+1 and − kĤk+1 t k+1 1 once these vectors are available. By premultiplying R k+1 by an orthogonal matrix, Y k+1 say, we restore R k+1 to upper triangular structure, again keeping in mind that this operation changes the underlying basis matrix Q k+1 , see Step 2a of Algorithm 7 below.
As required the updates of G have the property that the last columns of G contain the gradient or -vector with the smallest iteration number. We outline our deleting procedure. We give a description for the case that gradients are stored.
Consider the case k+1 = m + 1 and denote by G k+1 del the matrix formed by the first m columns of G k+1 and by R k+1 del the mth principal minor of R k+1 , noting that R k+1 del is upper triangular and that the matrix is formed by the first m columns of Q k+1 = G k+1 (R k+1 ) −1 . We therefore let Ĥ k+1 del be the matrix obtained by deleting the last row and column from Ĥ k+1 . We note that for any two vectors v 1 and v 2 in the column space of Q k+1 del we have the identity where H k+1 del is the same with Ĥ k+1 del replacing Ĥ k+1 , and Q k+1 is any orthogonal n × n matrix, whose first m + 1 columns agree with those of Q k+1 . Our deleting procedure thus leaves the approximation to the inverse Hessian unchanged on the column space of Q k+1 del . Moreover, assuming that Ĥ k+1 is positive definite, the matrices Ĥ k+1 del and thus H k+1 del inherit this property as any principal minor of a positive definite matrix is itself positive definite.
Algorithm 6 employs the inverse Gram-Schmidt process and the deleting procedure outlined above.

Step 1
As in Algorithm 3.

Step 2
As in Algorithm 3.

Step 3a
Compute the vector where X k is the product of Givens rotations defined in the paragraph containing equation (4.4).

Step 3b
If k+1 = k then use the vectors ̂k and ̂k and apply the Broyden class update to Ĥ k . Else (i.e., k+1 = k + 1 ) apply the Broyden class update to the matrix X kH X k T , where H is given by (2.15). End.

Step 4
If k+1 = m + 1 , then delete the last component of the vector t k+1 1 , the last column of G k+1 , the last row and column of R k+1 and Ĥ k+1 which is the updated Ĥ k , and set k+1 ∶= m . End.

3
Increase k by one and go back to Step 1.
Algorithm 7 below (the algorithm that stores delta-vectors rather than gradients) is obtained form Algorithm 6 by replacing the variable name G by Δ and inserting Step 2a after Step 2.
Algorithm 7 As in Algorithm 6, except: Step 2a If Δ was changed on the previous iteration, then replace the first columns of Δ k and R k by k and − kĤk t k 1 , respectively, and let the resulting matrices overwrite the original matrices. Let Y k be an orthogonal matrix such that Y k R k is upper triangular and redefine the variables End.
• As the additional Givens rotations are applied to m vectors and m × m matrices they require only O(m 2 ) multiplications. The total number of multiplications per iteration is thus 2nm + O(m 2 ) + O(n) both for Algorithms 6 and 7. • Consider Algorithm 6 and assume that on iterations k − 1 and k the test ‖ k ‖ ≥ C‖g k+1 ‖ is satisfied, this being the usual case. Then the vector k is contained in the column span of the matrix G k+1 defined in (4.13) as g k+1 and g k are the first two columns of G k+1 . Moreover k is also in the vector-space spanned by the columns of G k+1 since d k is calculated from (2.32). The update of Ĥ k in Step 3b is thus compatible with a Broyden update of the underlying matrix H k . Hence the resulting matrix H k+1 satisfies the quasi-Newton equation. Unfortunately, however, d k+1 is not calculated until the deleting procedure of Step 4 has changed the underlying matrix H k+1 . A "standard" proof of quadratic termination is thus not possible.
• We now turn to Algorithm 7 and note that by design of Step 1, d k , and thus k , are contained in the column span of G k . Thus the replacement taking place in Step 2a does not change the subspace spanned by the columns of G k . • By design of Algorithm 7 the vectors k and g k+1 are linear combinations of the columns of G k+1 . However, the gradient g k and thus the vector k will in general not be contained in this space. Thus the update in Step 3b does not correspond to applying the Broyden formula to H k , k and k , but we can enforce that the vector will be in the mentioned space: we replace it by k P , where k P is the projection of k onto the column space of G k+1 . Note that the update gives a positive definite matrix H k+1 too, since the relation is a consequence of k being in the span of the columns of G k+1 . • If we use the BFGS update in Step 3b, corresponding to k = 0 , Algorithm 7 has the quadratic termination property, which is the following result:

Theorem 1 Let Algorithm 7 with k = 0 in the updating formula employed in
Step 3b be applied to a strictly convex quadratic function f, and let the line searches be exact. Then the algorithm finds the minimum of f in at most L iterations, where L is the number of distinct eigenvalues of the Hessian matrix, A say, of f.
Proof By induction we will show that the search directions d k generated by Algorithm 7 satisfy the equations Under the assumptions made in the statement of the theorem, Algorithm 7 is thus equivalent to the conjugate gradient method, for which quadratic termination in at most L steps is a well known result. By definition (4.23) holds for k = 0 . We now assume that (4.23) has been established for all search directions with iteration index less than or equal to k and consider iteration k. From the theory of the conjugate gradient method we have Thus t k 2 formed in Step 3a is the zero vector as and (4.22) This gives since otherwise d k would have to vanish. This implies k = ‖g k+1 ‖ , so the test Not (2.31) [i.e., ‖t k 2 ‖ 2 < (1 − C 2 ) ‖g k+1 ‖ 2 ] holds. From the structure of (4.1) and the definition of X k in the paragraph containing Eq. (4.4) we deduce that X k is the permutation matrix defined by the equations the e i being the ith unit vector. Therefore we obtain in the Steps 3a and 3b The first row of both sides of (4.29) implies that the first two columns of Q k+1 = G k+1 (R k+1 ) −1 are the vectors ‖g k+1 ‖ −1 g k+1 and ‖ k ‖ −1 k , giving the formula Applying the update with k = 0 to the matrix (4.30) and to the vectors ̂k and ̂k given by (4.31) and (4.27), respectively, we obtain a matrix Ĥ k+1 whose first column is The deletion process of Step 4 changes neither the nonzero elements of the first column of Ĥ k+1 nor the first two columns of Q k+1 . Thus, recalling (4.28), the new search direction d k+1 formed in Step 1 of the next iteration is (4.33) d k+1 = −Q k+1Ĥk+1 t k+1 1 which, because of the identities is equivalent to (4.23) postdated by one iteration, as required. ◻ It is interesting to observe that Algorithm 7 can be seen as a generalization of the conjugate gradient algorithm.
In the limited memory setting we have-unlike in the classic Quasi-Newton case-an additional option of dealing with the situation in which the test ‖t k 2 ‖ 2 < (1 − C 2 ) ‖g k+1 ‖ 2 fails: We can re-start the entire algorithm since (given that usually m ≪ n ) the loss of second derivative information caused by the restart is acceptable. Moreover, failure of the above test may be an indicator of illconditioning, in which case a re-start would be appropriate anyway.
We build in this modification into Algorithm 7 to arrive at the following Algorithm 8 which, due to its importance, we spell out in full detail. We note that it also has the quadratic termination property since the test Not (2.31) [i.e., ‖t k 2 ‖ 2 < (1 − C 2 ) ‖g k+1 ‖ 2 ] cannot fail if the objective function is a strictly convex quadratic function.
In order to avoid too many restarts we have also introduced the condition that no restart is carried out unless at least m steps have been performed without a restart.

Algorithm 8
Step 0 As in Step 1 in Algorithm 3 with m ≥ 2.

Step 1
If the stopping criterion is satisfied then stop. Else: calculate the search direction from the formula (2.18), (2.32). End.

Step 2
As in Algorithm 5.

Step 2a
As in Algorithm 7. End.

Step 3a
Compute the vector t k 2 given by (2.33). If Not (2.31) (‖t k 2 ‖ 2 < (1 − C 2 ) ‖g k+1 ‖ 2 ) then set the variables as in (4.11)-(4.17) where X k is the product of Givens rotations defined in the paragraph containing equation (4.4) and where the argument of the square-root defining k is nonnegative by the properties of R k . End.
If (2.31) holds and if at least m iterations have been carried out without restart, then restart the algorithm by increasing k by one, setting k ∶= 1 and letting Ĥ k =Ĥ k 11 = 1 and R k = R k 11 = ‖g k ‖ . Moreover, let the column of the n × 1 matrix G k be the vector g k and let t k 1 ∈ IR 1 be given by (t k 1 ) 1 = ‖g k ‖ . End. Go to Step 1.

Step 3b
If k+1 = k , then use the vectors ̂k and ̂k and apply the Broyden class update to H . Else ( k+1 = k + 1) , use the vectors ̂k and ̂k and apply the Broyden class update to H given by (2.15). End.

Step 4
If k+1 = m + 1 , then delete the last component of the vector t k+1 1 , the last column of G k+1 , the last row and column of R k+1 and Ĥ k+1 and set k+1 ∶= m . End.
Increase k by one and go back to Step 1.

Numerical results-limited memory case
In this section we first compare the performance of Algorithms 6, 7 and 8 all using k = 0 with Nocedal's [4] limited memory BFGS update. Given the poor performance of Algorithm 6 we constrain the comparison for very large problems to Nocedal's method and the generalized conjugate gradient Algorithms 7 and 8.
We use the test problems introduced in Sect. 3, implemented all algorithms in double precision and used the same line search as in Sect. 3. Again we implemented all algorithms so that they were compatible with choosing the initial second derivative approximation given by (3.2). When Algorithm 8 re-starts, it also updates this initial scaling. For Algorithms 6, 7 and 8 we used C ∶= 0.1. Table 2 gives the results obtained from 20 (for n ≤ 1000 ) and 10 (for n ≥ 2000 ) runs for m = 10 and = 10 −6 , where the first column entry gives the average number of function evaluations, the second the average number of iterations and the third the percentage of cases in which the solution was not found within 10,000 function calls (those runs not being included in the averages stated in the first two columns). We observe the following:  [3], often reducing the number of function calls by more than two thirds-whilst at the same time requiring only half the storage for the same choice of m.

Scaling techniques for limited memory methods
In this section we ask the question in how far scaling techniques can be used-in analogy to their positive effect shown in Sect. 3-to improve further the performance of both the generalized conjugate gradient method Algorithm 7 and Nocedal's limited memory method. It is well known that scaling techniques can offer substantial improvements for Nodedal's method (Liu and Nocedal [3]); we are referring to the L-BFGS method, see also Al-Baali et al. [1]. However, the updating technique in Step 3 of their Algorithm 2.1 is different from the one we use.
Following the line of reasoning of Sect. 2 we obtain the scaling version of Algorithm 7 which we shall refer to as Algorithm 9 by using a geometric average for scaling.

Step 3b
As in Algorithm 7, but-for the case k+1 = k + 1apply the update to Ĥ k 0 0 k , instead of H , given by (2.15).
Step 4 as in Algorithm 7.
For Nocedal's method the idea is to replace the identity matrix as the initial second derivative approximation by I , where we calculate the series of k in the same way as above. The final Table 3 gives a comparison of Nocedal's Algorithm Table 3 Comparison of Nocedal's limited memory method with the generalized conjugate gradient methods and A8, A7, A9, = 10 −6 , m = 10

Conclusions/observations
In line with the results of Liu und Nocedal, applying the scaling approach to Nocedal's limited memory method provides a significant improvement. Algorithm 9 is indeed superior to the original generalized conjugate gradient Algorithm 7. However, the improvement is comparatively small-and it performs worse than Algorithm 8. There is little to choose between Nocedal's method with scaling and Algorithm 8 (generalized conjugate gradient with restarts). We also tried the obvious idea of applying the scaling approach to the generalized conjugate gradient method with restarts of Algorithm 8. Unfortunately this did not give any improvements. In fact, we observed that scaling removes the need for restarts in most cases, so the resulting algorithm performs almost identically to Algorithm 9.
In summary, in this paper we introduced a new way of looking at the Broyden class of Quasi Newton methods for the case in which the initial second derivative approximation is the identity matrix. We exploited the resulting computational simplifications to derive several new algorithms (both quasi-Newton and Limited Memory) that are particularly efficient in housekeeping cost (storage and number of multiplications per iteration) and number of iterations and function calls required to find the solution within a given accuracy. Of particular interest is an algorithm that can be viewed as a generalized conjugate gradient method. Similarly to what is the case for Nocedal's Limited Memory algorithm, restart and scaling modifications offer improvements for this generalized conjugate gradient method as well.