Symmetric Perry conjugate gradient method

A family of new conjugate gradient methods is proposed based on Perry’s idea, which satisfies the descent property or the sufficient descent property for any line search. In addition, based on the scaling technology and the restarting strategy, a family of scaling symmetric Perry conjugate gradient methods with restarting procedures is presented. The memoryless BFGS method and the SCALCG method are the special forms of the two families of new methods, respectively. Moreover, several concrete new algorithms are suggested. Under Wolfe line searches, the global convergence of the two families of the new methods is proven by the spectral analysis for uniformly convex functions and nonconvex functions. The preliminary numerical comparisons with CG_DESCENT and SCALCG algorithms show that these new algorithms are very effective algorithms for the large-scale unconstrained optimization problems. Finally, a remark for further research is suggested.


Introduction
The classical conjugate gradient (CG) method with line search is as follows: where the directions d k is given by where g k = g(x k ) = ∇f (x k ). The different choices for the parameter β k correspond to different CG methods, such as HS method [15], FR method [7], PRP method [22,23], LS method [16], PRP + method [8], DY method [5] and so on. On the history of the conjugate gradient method, there are several survey articles, such as [11].
In [17], the Perry conjugate gradient algorithm [21] was generalized and the line search directions were formulated as follows: where s k = x k+1 − x k = α k d k , y k = g k+1 − g k , α k is the steplength of the line search and σ is a preset parameter, which is called Perry iteration matrix, and the vector u k is any vector in R n such that y T k u k = 0. In the paper [17], the case u k = y k was discussed. When u k = s k , the CG_DESCENT algorithm [10][11][12] can be deduced and the D-L method [4] can be derived from the restriction σ > 0. Recently, we also studied the case u k = s k in [19] and presented a RSPDCGs algorithm.
In this paper, a family of symmetric Perry conjugate gradient methods is proposed, that is, the line search directions are formulated by ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ d 1 = −g 1 , d k+1 = −Q k+1 g k+1 = −g k+1 + β k d k + γ k y k , ∀k ≥ 1 where β k = which is called the symmetric Perry iteration matrix. When σy T k s k > 0, for any line search, the directions defined by (5) satisfy the descent property [1] d T k+1 g k+1 < 0, or the sufficient descent property [8] d T k+1 g k+1 ≤ −c 0 g k+1 2 (c 0 > 0). (8) This paper is organized as follows. In Sect. 2, first, the family of the symmetric Perry conjugate gradient methods is deduced. Then the spectra of the iteration matrix are analyzed, so, its sufficient descent property is proved and several concrete algorithms are proposed. In Sect. 3, the scaling technology and the restarting strategy are applied to the symmetric Perry conjugate gradient methods, thus, a family of scaling Perry conjugate gradient methods with restarting procedures is developed. In Sect. 4, the global convergence of the two families of the new methods with the Wolfe line searches is proven by the spectral analysis of the conjugate gradient iteration matrix. In Sect. 5, the preliminary numerical results are reported. A remark for further research is given in Sect. 6.

The symmetric Perry conjugate gradient method
In [21], A. Perry changed the CG update parameter β k of the HS conjugate gradient method [15] into β P k = (y k −s k ) T g k+1 y T k d k , and formulated the line search directions d k+1 = −g k+1 + β P k d k = −Q k+1 g k+1 , k = 1, 2, . . . , and y T k d k+1 = −s T k g k+1 , where Q k+1 = D k+1 + s k s T k y T k s k and D k+1 = I − s k y T k s T k y k . (10) In [17], (10) and (9) were substituted by and respectively, where u, v ∈ R n and σ is a parameter. Thus, it is follows from (11), (12) and from which the generalized Perry conjugate gradient method ( (1) and (3)) can be obtained [17].
In this paper, we choose a suitable u such that Q k+1 is a symmetric matrix, thus the line search directions d k may satisfy (7) or (8). Let Q k+1 = Q T k+1 , then Therefore, the vector u can be taken as and u T y k = aσ . We note that the matrix Q k+1 defined by (13) is independent of the nonzero constant a, so, we can choose a = 1. Thus, from (13) and (14) we can obtain the matrix Q k+1 defined by (6). The method formulated by (1) and (5) is called the symmetric Perry conjugate gradient method, denoted by SPCG. And the directions generated by (5) are called the symmetric Perry conjugate gradient directions, which will be proven to be descent directions in Sect. 2.2.
From the above discussions, a family of new nonlinear conjugate gradient algorithms can be obtained as follows: Step 1. Give an initial point x 1 and ε ≥ 0. Set k = 1.

Remark 1
In this paper, to ensure the convergence of the algorithm, we adopt the Wolfe line search strategies: and where 0 < b 1 < b 2 < 1. The stopping criterion, g k ≤ ε, can be changed into other forms. For the different choices of σ , several concrete forms of the algorithm will be discussed in the Sect. 2.2.

Spectral analysis
Here, we analyze the spectra of the Perry matrix and the symmetric Perry matrix.
Theorem 1 Let P k+1 be defined by (4). Then when σ (y T k s k ) = 0, P k+1 is a nonsingular matrix and the eigenvalues of P k+1 consist of 1 (n − 2 multiplicity), λ k+1 1 and λ k+1 2 , where λ k+1 and λ k+1 Proof From the fundamental algebra formula Therefore, the Perry matrix (4) is a nonsingular matrix when σy T k s k = 0. Since ∀ξ ∈ span{s k , y k } ⊥ ⊂ R n , the matrix P k+1 has the eigenvalue 1 (n − 2 multiplicity), corresponding to the eigenvectors ξ ∈ span{s k , y k } ⊥ . By the relationships between the trace and the eigenvalues of matrix and between the determinant and the eigenvalues of matrix, the other two eigenvalues are the roots of the following quadratic polynomial Thus, the other two eigenvalues are determined by (17) and (18), respectively.
According to Theorem 1, the following theorem for the symmetric Perry matrix Q k+1 defined by (6) can be deduced. max be the minimum and maximum eigenvalues of Q k+1 , respectively, where Q k+1 is defined by (6). If σ (y T k s k ) > 0, then and where Proof When σ (y T k s k ) > 0, from (14), (17), (18) and the following relations: it can be proven that λ (k+1) min and λ (k+1) max are formulated by (21) and (22), respectively. Simple calculation claims that So, the inequality (26) implies that In addition, it follows from (25) that In the end, it follows from (20) that λ Hence, (23) and (24) hold, which implies that Q k+1 is a symmetric positive definite matrix when σ (y T k s k ) > 0.
From the above theorem, we can easily obtain the following corollary.
According to (30), we let σ = c y T k y k s T k y k , c > 0 and s T k y k = 0, then it follows from (6) and (30) that and which shows that the directions defined by (5) with σ = c y T k y k s T k y k satisfy the sufficient descent property (8) for any functions and any line searches. Thus, the method defined by (1) and (5) with σ = c y T k y k s T k y k is called symmetric Perry descent conjugate gradient algorithm, denoted by SPDCG, or by SPDCG(c), to indicate the dependence on the positive constant c. Especially, due to Corollary 1, when c = 1, the method is called symmetric Perry descent conjugate gradient algorithm with optimal condition number, denoted by SPDOC. The corresponding iteration matrix is denoted by Q SPDOC k+1 , i.e.

Scaling technology and restarting strategy
According to S.S. Oren and E. Spedicato's idea [20], D.F. Shanno applied the scaling technology to the memoryless BFGS update formula (31) and developed a selfscaling conjugate gradient algorithms [25], i.e., he translated the memoryless BFGS update formula (31) into Thus, the symmetric Perry matrix Q k+1 defined by (6) can be scaled as follows: Thus, the line search directions defined by (5) are rewritten as from which a family of scaling Perry conjugate gradient methods can be deduced.
Based on Beale-Powell restarting strategy [24] (see also [2,3,25,26]), we define the following scheme to compute the directions. When at r-th step, we use the directions defined by (38). For k > r, the directions d k+1 are computed by the following double update scheme: with and where σ , ρ and σ are three preset parameters. Since where We also note that which implies that the optimal choices for σ in (41) and σ in (42) are respectively, such that κ 2 (H k+1 ) is optimal. Let g k+1 = H r+1 g k+1 and y k = H r+1 y k , namely, and then the directions d k+1 defined by (40) can be reformulated by Hence, we can introduce the following scaling symmetric Perry conjugate gradient method with restarting procedures (SSPCGRP).
Step 6. If the Powell restarting criterion (39) holds, then calculate the directions d k+1 via (38) with different σ and ρ, let y r = y k and s r = s k (store y r and s r ), set Nrestart = Nrestart + 1 and k = k + 1, go to step 3. Otherwise, go to step 7.
In Algorithm 2, k and Nrestart record the number of iterations and the number of restarting procedures, respectively.
When ρ = 1 and σ = c 1 is defined by (6) with σ = y T k y k s T k y k . So, the SSPCGRP algorithm is called the symmetric Perry descent conjugate gradient method with optimal condition numbers and restarting procedures, denoted by SPDOCRP.
and (48) were used by N. Andrei in [2], the SSPCGRP algorithm becomes the SCALCG algorithm with the spectral choice for θ k+1 [2], it is also called Andrei-Perry conjugate gradient method with restarting procedures.

Convergence
In this section, we analyze the convergence of the symmetric Perry conjugate gradient method (Algorithm 1) and the scaling symmetric Perry conjugate gradient method with restarting procedures (Algorithm 2). For this, we assume that the objective function f (x) satisfies the following assumptions: H1. f is bounded below in R n and f is continuously differentiable in a neighbor- Next, we introduce the spectral condition lemma of the global convergence for an objective function satisfying H1 and H2, which comes from [18], Theorem 4.1. H1 and H2. Assume that the line search directions of a nonlinear conjugate gradient method satisfy

Lemma 1 Let the objective function f (x) satisfy
where M k is the conjugate gradient iteration matrix, which is a symmetric positive semidefinite matrix. For a nonlinear conjugate gradient method (1) and (50) satisfying the sufficient descent condition (8), if its line search satisfies the Wolfe conditions (15) and (16), and where Λ k is the maximum eigenvalue of M k , then lim inf k→∞ g k = 0. Moreover, if Λ k ≤ Λ for all k, where Λ is a positive constant, then lim k→∞ g k = 0.

Remark 2
If M k is a symmetric positive definite matrix, then the spectral condition (51) can be rewritten as In fact, by (50), it can be derived that where θ k is the angle between d k and −g k , λ k and Λ k are the minimum eigenvalue and maximum eigenvalue of M k , respectively. The Zoutendijk's condition (Theorem 2.1 of [8]) asserts that (52) implies that the results of Lemma 1 are true.
In what follows, the convergence of these resulting algorithms is proved by evaluating the spectral boundary of the iteration matrix and Lemma 1. The proof method is called the spectral method. It should be pointed out that the proof method also can be applied to the non-symmetric conjugate gradient methods, if the positive square root of the maximum eigenvalue M T k M k substitutes for with the one of M k in Lemma 1, that is, the maximum singular value of M k substitutes for the maximum eigenvalue of M k (see Theorem 3.1 in [17]).

The convergence for uniformly convex functions
Here, we first prove the global convergence of the symmetric Perry conjugate gradient method (SPCG), the scheme (1) and (5) with Q k+1 defined by (6), for uniformly convex functions. For this, we introduce the following basic assumption, which is an equivalent condition for a uniformly convex differentiable function.
Proof Assume that g k = 0, ∀k ∈ N. Below, by induction, we first prove that the line search direction d k , defined by (5), satisfies the sufficient descent property (8).
Then, it follows from (16) that s T k y k ≥ −(1 − b 2 )α k d T k g k > 0. So, (30) and the assumptions H2 and H3 imply that Hence, by induction, the sufficient descent property (8) holds. Next, we prove that λ (k+1) max , the maximum eigenvalue of Q k+1 defined by (6), is uniformly bounded above. From the above analysis, it can be derived that s T k y k > 0. So, from (23) in Theorem 2, it can be deduced that Therefore, Lemma 1 claims that lim k→∞ g k = 0.
Remark 3 Theorem 3 shows that the memoryless BFGS quasi-Newton method and the method SPDCG are convergent for uniformly convex functions under the Wolfe line searches. In fact, the global convergence of the method SPDCG and the method mBFGS results from the following inequalities Next, we prove the global convergence of the SSPCGRP method for uniformly convex functions.
Proof First, we note that y T k s k = y T k s k for all k ≥ 1. If y T k s k > 0, we can denote the minimum and the maximum eigenvalues of the matrix H and where ω k = y T k y k s T k s k ( s T k y k ) 2 and ω r = y T r y r s T r s r (s T r y r ) 2 . In what follows, by induction, we prove that y T k s k > 0 and the sufficient descent property (8) is true for all k.
If the Powell restarting criterion (39) never holds for all k ≥ 1, the iteration matrix is ρQ k+1 . Thus, similar to Theorem 3, it can be easily shown that the results of Theorem 4 are true.
Suppose that k 0 is the first natural number such that the Powell restarting criterion (39) is true, then Nrestart ≥ 1. Similar to Theorem 3, it can be obtained that for k = 1, 2, . . . , k 0 , y T k s k = y T k s k > 0 and So, it follows from (16) and the above inequality that If the Powell restarting criterion (39) holds for k + 2, d k+2 is calculated by (38). Thus, Theorem 2 and the assumptions H2 and H3 claim that If the Powell restarting criterion (39) does not hold for k + 2, d k+2 is calculated by (48) with (46) and (47) (i.e., (40)-(42)). Since Nrestart ≥ 1, from (40)-(42) and Theorem 2, it can be obtained that By (57), (58), the assumptions H2 and H3, it is yielded that which, together with (16), implies that By induction, it follows that s T k y k > 0 for all k and the directions generated by the SSPCGRP algorithm (Algorithm 2) satisfy the sufficient descent property (8) with Next, we prove that κ 2 (H k+1 ) is bounded above. Since s k = H −1/2 r+1 s k and y k = H 1/2 r+1 y k , from (57), (58) and the assumptions H2 and H3, it can be derived that Thus, it follows from (44), (58) and above two inequalities that If d k+1 is calculated by (38), the iteration matrix is defined by (37). So, Corollary 1, (28), (29) and the assumptions H2 and H3 imply that where ψ(·) is defined by (29). Hence, the spectral condition number of the iteration matrix of Algorithm 2 is uniformly bounded above, which claims that the results of Theorem 4 are true according to Remark 2. From (56) and this theorem, it can be shown that the SPDRP algorithm and the SCALCG algorithm with the spectral choice [2] are global convergence for uniformly convex functions under the Wolfe line searches.

The convergence for general nonlinear functions
For general nonlinear functions, we first have following result for the symmetric Perry conjugate gradient method.  (15) and (16), then lim k→∞ y k = 0 implies that lim inf k→∞ g k = 0.
Proof Denote the maximum eigenvalue of the iteration matrix Q k+1 by λ (k+1) max . The Wolfe condition (16) leads to s T k y k ≥ −(1 − b 2 )s T k g k , which, together with (33), implies that (59) Thus, where c 5 = 1+c Now assume that lim k→∞ y k = 0, lim inf k→∞ g k = ε > 0. Then there exists a positive integer N 0 such that for j > N 0 , Therefore, Lemma 1 and (33) claim that lim k→∞ g k = 0, which contradicts the above assumption. So, lim k→∞ y k = 0 implies that lim inf k→∞ g k = 0.
Next, we prove the global convergence of the SSPCGRP algorithm (Algorithm 2) for general nonlinear functions.

Theorem 6
Assume that H1 and H2 hold. Let the sequence {x k } be generated by the SSPCGRP algorithm with σ = c y T k y k s T k y k and ν 0 ≤ ρ ≤ ν 1 in (38), where c, ν 0 and ν 1 are positive constants. If the line searches satisfy the Wolfe conditions (15) and (16), then lim k→∞ y k = 0 implies that lim inf k→∞ g k = 0.
Proof If lim inf k→∞ g k = 0 as y k → 0, then, for some ε > 0, there exists a positive integer N 1 such that g k+1 > ε and y k ≤ 0.8ε as k ≥ N 1 . Thus, So, g T k+1 g k ≥ 0.2 g k+1 2 for k ≥ N 1 , which means that the directions d k+1 are calculated by (38) for k ≥ N 1 , that is, where Q k+1 is defined by (6). Since σ = c y T k y k s T k y k and ν 0 ≤ ρ ≤ ν 1 , from Theorem 2, it follows that For convenience, we also denote the maximum eigenvalue of ρQ k+1 by λ (k+1) max . Analogous to (60), it can be derived that where c 5 = 1+c c . We substitute λ (N 1 ) max and N 1 for λ (1) max and 1 in (60), respectively, then, similar to Theorem 5, it can be obtain from Lemma 1 and (61) that lim k→∞ y k = 0 implies that lim inf k→∞ g k = 0.
The above two theorems show that the SPDCG(c) algorithm and the SPDRP(c 1 , c 2 ) algorithm are global convergence for the nonconvex functions under the Wolfe line searches, as lim k→∞ y k = 0. The condition for the global convergence, lim k→∞ y k = 0, was used by J.Y. Han, et al. in [14].

Numerical experiments
In this section, we demonstrate our algorithms: SPDCG and SPDRP, and compare them with the CG_DESCENT algorithm [12], the SCALCG algorithm with the spectral choice [2], the mBFGS algorithm (a special form of the SPCG algorithm with σ = 1) and the RSPDCGs algorithm [19] whose line search directions are formulated by In numerical experiments, we let η = 10 −5 . The RSPDCGs algorithm is fully detailed in [19].
The numerical experiments use two groups test functions, one group (145 test functions) is taken from the CUTEr [9] library, referring to website: http://www.cuter.rl.ac.uk/, which is only used to test mBFGS, SPDCG, RSPDCGs and CG_DESCENT algorithms. In order to compare with the SCALCG algorithm, the second group consists of the 73 unconstrained problems but the 71-st in SCALCG Fortran software package coded by N. Andrei, referring to website: http://camo.ici.ro/forum/SCALCG/.
For the second group, each test function is made ten experiments with the number of variable 1000, 2000, . . . , 10000, respectively. The starting points used are those given in the code, SCALCG.
The SPDCG, mBFGS and RSPDCGs algorithms are coded according to the package, CG_DESCENT (C language, Version 5.3), with minor revisions and implement the approximate Wolfe line searches with the default parameters in CG_DESCENT [10,12]. The package, CG_ DESCENT, can be got from Hager's web page at http://www.math.ufl.edu/~hager/.
In addition, in order to compare with the SCALCG algorithm, all subroutines of the SPDRP algorithm are written in Fortran 77 with the double precision, and the SPDRP algorithm uses the Wolfe line searches in the SCALCG Fortran code.
The termination criterion of all algorithms is that g ∞ < 10 −6 , where · ∞ is the infinity norm of a vector. The maximum number of iterations is 500n, where n is the number of variables. The tests are performed on PC (Dell Inspiron 530), Intel ® Core™ 2 Duo, E4600, 2.40 GHz, 2.39 GHz, RAM 2.00 GB, with the gcc and g77 compilers.
The SPDCG algorithm is a special form of the SPCG algorithm (Algorithm 1) with σ = c y T k y k s T k y k in (5) (see also step 6 in Algorithm 1). Thus the line search directions are formulated by with , and the iteration matrix is defined by (32). For the SPDCG algorithm, we test several different values of c in β k of (63) on the first group of test functions and find that the performance [6] is slightly better when c = 1 (SPDOC algorithm) than that when c is taken other values. For the first group of test functions, to compare the algorithms: mBFGS and SP-DOC with the RSPDCGs and CG_DESCENT algorithms. we divide the group into two parts: large scale problems, whose numbers of variables are not less than 100 (72 test functions), and small scale problems, whose numbers of variables are less than 100 (73 test functions).  2 show that for large scale problems, the performance of the SPDOC algorithm is similar to that of the CG_DESCENT algorithm and their performances are better than that of the others; the performance of the RSPDCGs algorithm is better than show that the mBFGS algorithm is best, which means that SPDOC and CG_DESCENT algorithms are more suitable for solving large scale problems. It should be pointed out that although mBFGS algorithm give better results on the small problems, it fails to solve three problems of the CUTEr test set. In the recent paper [13], it is observed that for the small ill-conditioned quadratic PALMER test problems in CUTEr, the gradients generated by the conjugate gradient method quickly lose orthogonality due to numerical errors. See Hager and Zhang's paper [13] for a strategy for handling these ill-conditioned problems.
The SPDRP algorithm is a special case of the SSPCGRP algorithm (Algorithm 2) with ρ = 1 and σ = c 1 y T k y k s T k y k in (38), and ρ = 1, σ = c 2 y T k H r+1 y k s T k y k and σ = c 2 y T r y r s T r y r in (46)-(48). For the SPDRP algorithm, we also test several different values of c on the second group of test functions, we find the performance is slightly better when c 1 = c 2 = 1 (i.e., the SPDOCRP algorithm) than that when c 1 and c 2 are taken other In addition, for the SPDCG algorithm, the inequality (33) shows that the descent degree of the line search directions of the algorithm becomes higher and higher as the value of c increases, but the performance of the algorithm is not directly proportional to c. In fact, the line search directions generated by the SPDCG algorithm vary with the value of c. What kind of criterion can be used to evaluate the performance of an algorithm? Does the criterion exist? These are still open problems. Of course, the condition number and the descent property are two important factors.
In the end, it should be pointed out that the version 5.3 of CG_ DESCENT (C code) uses a new formula for β k , i.e., η ≡ 0 in (62), instead of presented in [10]. In [19], we proved that β k formulated by (64) makes the spectral condition number of the iteration matrix defined by (4) with u = s k optimal.

Conclusion
In [18], we presented a rank one updating formula for the iteration matrix of the conjugate gradient methods: For the SSPCGRP algorithm, when ρσ = 1, σ = y T k y k s T k y k , σ = y T r y r s T r y r , ρ σ = σ = 1, these formulas (38), (46), (47) and (48) were suggested by D. F. Shanno in [25] and [26]. When ρ = σ = ρ = σ = σ = 1, the SSPCGRP algorithm becomes the memoryless BFGS conjugate gradient method with restarting procedures. Therefore, it is worthy of studying further how the parameters σ and ρ are chosen to construct more effective nonlinear conjugate gradient algorithms. The condition number of Q k+1 defined by (6) only depends on the parameter σ and the condition number of ρQ k+1 is the same as the one of Q k+1 (see (37)), so, we let ρ = 1 and ρ = 1 in SSPCGRP algorithm. That is to say, σ can scale the symmetric Perry iteration matrix Q k+1 . Therefore, the symmetric Perry conjugate gradient methods have the self-scaling property, Similarly, σ can also alter the maximum and minimum eigenvalues of the Perry iteration matrix P k+1 defined by (4), and P k+1 is a self-scaling matrix. Thus, the parameter σ in the condition (12) is a self-scaling factor, which can alter the condition number of the iteration matrix of the conjugate gradient method.
From (30) and (23), we find that if we restrict that y T k s k > δ s k 2 , 0 < δ < 1, then under Lipschitz condition, |y T k s k | > δ s k 2 ≥ δ/L s k y k , i.e., the angle between y k and s k is less than π/2. Thus, Therefore, according to Lemma 1, if a descent algorithm satisfies the condition (65), then it is globally convergent for nonconvex functions. So, (65) is also an interesting restarting strategy [19]. In fact, (65) is a uniformly convex condition. Based on (6)  So, a new family of quasi-newton method can be obtained from (66), which belongs to Huang's family, but does not belongs to Broyden's family. Hence, it is worth probing further to develop new and more effective unconstrained optimization algorithms.
For example, we let σ = c y T k H k y k s T k y k in (66).