1 Introduction

Consider the problem of minimizing a strictly convex quadratic,

$$ \min f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^TA\mathbf{x}- \mathbf{b}^T\mathbf{x}, $$
(1.1)

where AR n×n is a real symmetric positive definite matrix and bR n. The Barzilai and Borwein (BB) method for solving (1.1) takes the negative gradient as its search direction and updates the solution approximation iteratively by

$$ \mathbf{x}_{k+1} = \mathbf{x}_k -\alpha_k\, \mathbf{g}_k, $$
(1.2)

where g k =∇f(x k ) and α k is determined by the information achieved at the points x k−1 and x k . Specifically, denote s k−1=x k x k−1 and y k−1=g k g k−1. Since the matrix \(D_{k}=\alpha_{k}^{-1} I\), where I is the identity matrix, can be regarded as an approximation to the Hessian of f at x k , Barzilai and Borwein [2] chose the stepsize α k such that D k has certain quasi-Newton property:

$$ D_k = \arg\min_{D=\alpha^{-1} I} \|D \mathbf{s}_{k-1} - \mathbf{y}_{k-1}\|, $$
(1.3)

where and below ∥⋅∥ means the two norm, yielding

$$ \alpha_k=\frac{\mathbf{s}_{k-1}^T\mathbf {s}_{k-1}}{\mathbf{s}_{k-1}^T\mathbf{y}_{k-1}}. $$
(1.4)

Comparing with the classical steepest descent (SD) method by Cauchy [4], which takes the stepsize as the exact one-dimensional minimizer along x k αg k ,

$$ \alpha_k^{SD}=\arg\min_{\alpha>0} f( \mathbf{x}_k-\alpha\mathbf{g} _k), $$
(1.5)

the BB method often requires less computational work and speeds up the convergence greatly. Consequently, due to its simplicity and efficiency, the BB method has been extended or utilized in many occasions or applications. To mention just a few of them, Raydan [11] proposed an efficient global Barzilai and Borwein algorithm for unconstrained optimization by combining the traditional nonmonotone line search by Grippo et al. [8]. The algorithm of Raydan was further generalized by Birgin et al. [3] for the minimization of differentiable functions on closed convex sets, yielding an efficient projected gradient methods. Efficient projected algorithm based on BB-like methods have also been designed (see [6, 12]) for special quadratic programs arising from training support vector machine, that has a singly linear constraint in addition to box constraints. The BB method has also received much attentions in finding sparse approximation solutions to large underdetermined linear systems of equations from signal/image processing and statics (for example, see [13]).

Several attentions have also been paid to theoretical properties of the BB method in spite of the potential difficulties due to its heavy nonmonotone behaviors. These analysis proceed in the unconstrained quadratic case (this is also the case in this paper). Specifically, Barzilai and Borwein [2] presents an interesting R-superlinear convergence result for their method when the dimension is only two. For the general n-dimensional strong convex quadratic function, the BB method is also convergent (see [10]) and the convergence rate is R-linear (see [7]). A further analysis on the asymptotic behaviors of BB-like methods can be found in [5].

In this paper, we focus on the analysis of the BB method for two-dimensional quadratic functions. Though simple, the dimension of two has a special meaning to the BB method. As was just mentioned, the BB method is significantly faster than the SD method in practical computations, but there is still lack of theoretical evidences showing that the BB method is better than the SD method in the any-dimensional case. Nevertheless, the notorious zigzagging phenomenon of the SD method is well known to us (see Akaike [1]); namely, the search directions in the SD method usually tend to two orthogonal directions when applied to any-dimensional quadratic functions. Unlike the SD method, however, the BB method will not produce zigzags due to its R-superlinear convergence in the two-dimensional case. This explains to some extent the efficiency of the BB method over the SD method.

Our analysis begins with the assumption that the gradient norms at the first two iterations are fixed (see Sect. 2). We show that there is a superlinear convergence step in at most three consecutive steps. This sharpens the previous analysis by Barzilai and Borwein [2] and Yuan [14], which only indicates that in at most four consecutive steps, there is a superlinear convergence step. Meanwhile, we provide a better convergence relation, namely, (2.13), for the BB method. The influence of the condition number to the convergence rate is presented in Sect. 3. We find that the convergence rate of the BB method is related to both the starting point and the problem condition. Some remarks are also made at the end of Sect. 3.

2 A New Analysis on the BB Method

We focus on the BB method for the quadratic function (1.1) with n=2. In this case, since the method is invariant under translations and rotations, we assume that

(2.1)

where λ≥1, as in Barzilai and Borwein [2]. Assume that x 1 and x 2 are given with

$$ g_1^{(i)}\ne0, \qquad g_2^{(i)}\ne0, \quad\mbox{for $i=1$ and $2$}. $$
(2.2)

To analyze ∥g k ∥ for all k≥3, we denote \(\mathbf{g}_{k}=(g_{k}^{(1)},\, g_{k}^{(2)})^{T}\) and define

$$ q_k=\frac{ (g_k^{(1)} )^2}{ (g_k^{(2)} )^2}. $$
(2.3)

Then it follows that

Noticing that x k+1=x k α k g k and g k =Ax k , we have that

$$\mathbf{g}_{k+1} = (I-\alpha_k A) \mathbf{g}_k. $$

Writing the above relation in componentwise form,

Therefore we get for all k≥2,

$$ \begin{cases} (g_{k+1}^{(1)} )^2 = \frac {(\lambda-1)^2}{(\lambda+q_{k-1})^2} (g_k^{(1)} )^2, \\ \noalign{\vspace{3pt}} (g_{k+1}^{(2)} )^2 = \frac{(\lambda-1)^2\, q_{k-1}^2}{(\lambda+q_{k-1})^2} (g_k^{(2)} )^2. \end{cases} $$
(2.4)

In the case that λ=1, which means that the object function has sphere contours, the method will take a unit stepsize α 2=1 and give the exact solution at the third iteration. If \(g_{2}^{(1)}=0\) but \(g_{2}^{(2)}\ne0\), we have that q 2=0 and hence by (2.4) that \(g_{k}^{(1)}=0\) for k≥3 and \(g_{4}^{(2)}=0\), which means that the method gives the exact solution in at most four iterations. This is also true if \(g_{2}^{(2)}=0\) but \(g_{2}^{(1)}\ne0\) due to symmetry of the first and second components. If \(g_{1}^{(1)}=0\) but \(g_{1}^{(2)}\ne0\), we have that q 1=0 and \(g_{3}^{(2)}=0\). Then by considering x 2 and x 3 as two starting points, we must have g k =0 for some k≤5. The symmetry works for the case that \(g_{1}^{(2)}=0\) but \(g_{1}^{(1)}\ne 0\). Thus we may assume that λ>1 and the assumption (2.2) holds, for otherwise the method has the finite termination property.

Now, substituting (2.4) into the definition of q k+1, we can obtain the following recurrence relation

$$ q_{k+1} = \frac{q_k}{q_{k-1}^2}. $$
(2.5)

In other words, the positive sequence {q k } only depends upon the initial values q 1 and q 2. If the starting points x 1 and x 2 are given, then g 1 and g 2 are fixed and so are q 1 and q 2. However, as λ increases, \(\frac {\lambda-1}{\lambda+q_{k-1}}\) is closer to 1 from the left side and hence \((g_{k}^{(1)})^{2}\) and \((g_{k}^{(2)})^{2}\) become bigger. If q 1 and q 2 were unchangeable, we would be able to draw from the relation (2.4) the conclusion that the convergence of the BB method becomes slow as the problem becomes more ill-conditioning. As analyzed in Sect. 3, however, this is not the case since q 1 and q 2 are closely related to the starting point and the condition number λ.

To proceed with our analysis, we denote M k =lnq k . It follows from the recurrence relation (2.5) that

$$ M_{k+1}=M_k-2\,M_{k-1}, $$
(2.6)

which implies the analytical expression of M k ,

$$ M_k = \sqrt{2}^k \tau\cos\bigl(\phi+ k\, \arctan(\sqrt {7}) \bigr), $$
(2.7)

where τ is some constant only related to q 1 and q 2. If it happens that

$$ \bigl(g_i^{(1)}\bigr)^2= \bigl(g_i^{(2)}\bigr)^2\quad\mbox{for $i=1$ and $2$}, $$
(2.8)

we know from (2.4) and (2.5) that q k ≡1 and \((g_{k}^{(1)})^{2}=(g_{k}^{(2)})^{2}\) for all k≥1, which indicates that the method is identical to the SD method and the generated gradient norm sequence {∥g k ∥} is only linearly convergent with factor (λ−1)/(λ+1). In this case, the value of τ in (2.7) is zero. In the following, we assume that (2.8) does not hold and hence τ≠0. Further, without loss of generality, we assume that τ>0.

To improve the result of Barzilai and Borwein [2], we analyze the whole gradient norm ∥g k ∥ from the beginning (previously, the second component of g k , that is \(g_{k}^{(2)}\), was analyzed at the first stage). As a matter of fact, we have from (2.4) that

(2.9)

where

$$r_k = \frac{q_k+q_{k-1}^2}{(1+q_k)(\lambda+q_{k-1})^2}. $$

Notice that the quantity r k has the following properties:

  1. (i)

    r k ≤1 for all k≥1;

  2. (ii)

    If q k <1 and q k−1<1,

    $$r_k \le\frac{q_k+q_{k-1}^2}{\lambda^2}\le2\max\bigl\{q_k, \, q_{k-1}^2\bigr\}; $$
  3. (iii)

    If q k >1 and q k−1>1,

    $$r_k = \frac{q_k^{-1}+q_{k-1}^{-2}}{(1+q_k^{-1})(1+\lambda q_{k-1}^{-1})^2}\le2 \max\bigl\{q_k^{-1}, \, q_{k-1}^{-2}\bigr\}. $$

Using the above properties of r k , we have from (2.9) that

$$\|\mathbf{g}_{k+1}\|^2 \le2(\lambda-1)^2 \, u_k\, \|\mathbf{g}_k\|^2, $$

where

$$u_k = \begin{cases} \max\{q_k, q_{k-1}^2\}, &\mbox{if}\ q_k<1\ \mbox {and}\ q_{k-1}<1;\\ \max\{q_k^{-1}, q_{k-1}^{-2}\}, &\mbox{if}\ q_k>1\ \mbox{and}\ q_{k-1}>1; \\ \frac{1}{2}, &\mbox{otherwise.} \end{cases} $$

Consequently,

$$ \|\mathbf{g}_{k+3}\|^2 \le8(\lambda-1)^6 \, \Biggl(\prod_{j=0}^2 u_{k+j} \Biggr) \|\mathbf{g}_k\|^2. $$
(2.10)

Denoting

$$h_{k+j}=\cos\bigl(\phi+ (k+j)\arctan(\sqrt{7}) \bigr), $$

we can obtain from (2.10) and (2.7) that

$$\|\mathbf{g}_{k+3}\|^2 \le8(\lambda-1)^6 \, \exp\Biggl(\tau\, \sqrt{2}^k\, \sum_{j=0}^{2} v_{k+j} \Biggr) \|\mathbf{g}_k\|^2, $$

where for j=0,1,2,

$$v_{k+j} = \begin{cases} \max\{\sqrt{2}^j h_{k+j},\,\sqrt{2}^{j+1}h_{k+j-1} \}, &\mbox{if}\ h_{k+j}<0\ \mbox{and}\ h_{k+j-1}<0; \\ \noalign{\vspace{3pt}} \max\{-\sqrt{2}^j h_{k+j},\,-\sqrt{2}^{j+1}h_{k+j-1} \}, &\mbox{if}\ h_{k+j}>0\ \mbox{and}\ h_{k+j-1}>0; \\ 0, &\mbox{otherwise.} \end{cases} $$

Noticing that \(\sum_{j=0}^{2} v_{k+j}\) is a univariant function with ϕ, we can verify that

$$ \max_{\phi\in[0,\, 2\pi]}\sum_{j=0}^{2} v_{k+j} = \cos\biggl(\frac{\pi}{2}+\arctan(\sqrt{7}) \biggr) =- \frac{\sqrt{14}}{4} $$
(2.11)

(a strict proof can be found in the Appendix). Thus we can obtain

$$\|\mathbf{g}_{k+3}\|^2 \le8(\lambda-1)^6 \, \exp\biggl(-\frac {\sqrt{14}}{4}\,\tau\,\sqrt{2}^k \biggr) \| \mathbf{g}_k\|^2, $$

or, equivalently,

$$ \|\mathbf{g}_{k+3}\| \le2\sqrt{2}(\lambda-1)^3 \exp \biggl(-\frac {\sqrt{14}}{8}\,\tau\,\sqrt{2}^k \biggr) \| \mathbf{g}_k\|. $$
(2.12)

A corollary of (2.12) is that \(\frac{\|\mathbf{g}_{k+3}\| }{\|\mathbf{g} _{k}\|}=\prod_{i=0}^{2} \frac{\|\mathbf{g}_{k+i+1}\|}{\|\mathbf {g}_{k+i}\|}\) tends to zero as k→∞ and hence

$$\lim_{k\rightarrow\infty} \min\biggl\{\frac{\|\mathbf{g}_{k+1}\| }{\|\mathbf{g}_{k}\|},\, \frac{\|\mathbf{g}_{k+2}\|}{\|\mathbf{g}_{k+1}\|},\, \frac{\| \mathbf{g}_{k+3}\|}{\|\mathbf{g}_{k+2}\|} \biggr\} = 0. $$

This means that the BB method has a Q-superlinear convergence step in at most three consecutive steps. This sharpens the analysis in Barzilai and Borwein [2] and Yuan [14], which only indicates that there is a superlinear convergence step in at most four consecutive steps.

For any positive integer k≥2, we can write k=3l+i 0 for some integers l≥0 and i 0∈[2, 4]. Notice by (2.9) and r k ≤1 that ∥g k ∥≤(λ−1)k−2g 2∥ for any k≥2. By this and (2.12), we can obtain

(2.13)

where

$$c_1=\frac{\sqrt{14}+4\sqrt{7}}{56}\approx0.2558. $$

The relation (2.13) indicates that the gradient norm sequence {∥g k ∥} is R-superlinear convergent with order \(\sqrt{2}\), which is the same as before. As shown in Sect. 3, however, the convergence relation (2.13) improves the previous one in Yuan [14]. This is because our analysis provides a R-superlinear factor of exp(−c 1 τ), which is better than the previous one.

We sum up the above analysis into the following theorem.

Theorem 2.1

Consider the BB method for solving the quadratic function (1.1) with n=2 and (2.1). Suppose that g 1 and g 2 satisfy (2.2) but not (2.8). Then the method is R-superlinearly convergent and gives the convergence relation (2.13).

Two assumptions have been used in the above theorem for the two starting points x 1 and x 2. If the relation (2.2) does not hold, namely, if at least one component of g 1 and g 2 is zero, there must be g k =0 for some k≤5 and the method terminates finitely. In exact arithmetics, if (2.8) holds, we will have that \((g_{k}^{(1)})^{2}=(g_{k}^{(2)})^{2}\) for all k≥1 and the method is only linearly convergent, giving \(\|\mathbf{g}_{k+1}\|=\frac{\lambda -1}{\lambda +1}\|\mathbf{g}_{k}\|\) for all k≥1. In practical computations, this equality will usually be destroyed due to the existence of the numerical errors. Therefore we can always observe the superlinear convergence behavior of the BB method numerically for the two-dimensional case.

3 Influence of x 1 and λ to the Convergence Rate

To begin with, we notice by (2.6) that the sequence {M k } is of the same recurrence relation as the sequence {m k } in Yuan [14] (see the relation (3.1.44) there; a similar sequence is also defined in Barzilai and Borwein [2]). Specifically, by using the analytical expression of m k ,

$$ m_k = \sqrt{2}^k \theta\cos\bigl(\phi+ k \arctan( \sqrt{7}) \bigr), $$
(3.1)

where θ is also assumed to be positive, the following convergence relation has been established in Yuan [14],

$$ \|\mathbf{g}_k\| \le\sqrt{2}|t_2|(\lambda-1)^{k-2} \lambda^{(2\cos(\frac{3}{2}\arctan (\sqrt{7}))\,\theta\,(\sqrt{2})^{k-8})}, $$
(3.2)

where \(|t_{2}|=|g_{2}^{(2)}|\). Further, the relation (3.1.41) in Yuan [14] indicates that \(\lambda ^{2m_{k}}=(g_{k}^{(1)})^{2}/(g_{k}^{(2)})^{2}=q_{k}\). It follows from this and the definition M k =lnq k that m k =M k /(2lnλ). Then by comparing the expressions (2.6) and (3.1), we get the following relation between the values of τ and θ,

$$ \theta=\frac{\tau}{2 \ln\lambda}. $$
(3.3)

Submitting this into the convergence relation (3.2), we obtain

$$ \|\mathbf{g}_k\| \le\sqrt{2}\,|t_2|\,( \lambda-1)^{k-2}\exp\bigl(-{c}_2\,\tau\sqrt {2}^k \bigr) $$
(3.4)

where

$$c_2 = \frac{-\cos(\frac{3}{2}\arctan(\sqrt{7}))}{16}=\frac{\sqrt {8-5\sqrt{2}}}{64} \approx0.0151. $$

It is obvious that our new estimate (2.13) is an improvement over (3.4).

We now analyze how the starting point x 1 and the problem condition λ influences the convergence rate of the BB method. To this aim, we assume that the starting point \(\mathbf{x}_{1}=(x_{1}^{(1)},\, x_{1}^{(2)})^{T}\) is given and an SD step is taken during the first iteration. Denoting

$$ C=\frac{(x_1^{(1)})^2}{(x_1^{(2)})^2}, $$
(3.5)

it is easy to see from g k =A x k and the definition of q k in (2.3) that

$$ q_1=\frac{C}{\lambda^2}. $$
(3.6)

As the SD step provides the orthogonal condition \(\mathbf {g}_{2}^{T}\mathbf{g}_{1}=0\) and the dimension n is two, we can see that

$$ q_2=\frac{1}{q_1}. $$
(3.7)

Recall that M k =lnq k . By (2.7), we can obtain the following nonlinear system of τ and ϕ,

$$ \begin{cases} \sqrt{2}\, \tau\cos(\phi+ \arctan (\sqrt{7}) ) = \ln q_1, \\ 2\, \tau\cos(\phi+ 2\, \arctan(\sqrt{7}) ) = \ln q_2. \end{cases} $$
(3.8)

Summing the two relations in this system and using (3.7), we can solve

$$\phi=-\arctan\frac{\sqrt{7}}{7}. $$

Then by the first relation in (3.8) and (3.6), we can obtain

$$ \tau= \frac{2\sqrt{14}}{7} \ln\frac{C}{\lambda^2}. $$
(3.9)

In this special case, we give the following theorem by replacing this value to (2.13) and using ∥g 2∥≤(λ−1)∥g 1∥ in Theorem 2.1. Since τ is assumed to be positive in Sect. 2 without loss of generality, we need to change it to |τ| here to deal with the case that the value of τ in (3.9) is likely to be negative. If C is fixed, it is interesting to notice that the absolute value of θ in (3.3) tends to the constant \(\frac{2\sqrt{14}}{7}\) when λ goes to infinity; namely, \(\lim _{\lambda\rightarrow\infty}|\theta|=\frac{2\sqrt{14}}{7}\).

Theorem 3.1

Consider the BB method for solving the quadratic function (1.1) with n=2 and (2.1). Suppose that the starting point \(\mathbf {x}_{1}=(x_{1}^{(1)},\, x_{1}^{(2)})^{T}\) is given and an SD step is taken at the first iteration. If \(x_{1}^{(1)}x_{1}^{(2)}\ne0\) and Cλ 2, then the method is R-superlinearly convergent and gives the convergence relation

$$ \|\mathbf{g}_k\|\le\bigl[\sqrt{2}(\lambda-1) \bigr]^{k-1} \exp\biggl(-\frac{1+2\sqrt{2}}{14}\biggl|\ln\frac{C}{\lambda^2} \biggr|\bigl(\sqrt{2}^{k}-4 \bigr) \biggr) \| \mathbf{g}_1\|. $$
(3.10)

If the starting point x 1 is such that \(x_{1}^{(1)}x_{1}^{(2)}=0\), it is easy to see that the BB method will give the solution in at most four iterations. If C=λ 2, we will have that q k =1 and \(\|\mathbf{g}_{k+1}\|=\frac{\lambda-1}{\lambda+1}\| \mathbf{g}_{k}\|\) for all k≥1, which implies that the method is only linearly convergent.

If Cλ 2, the exponential term in (3.10) dominates the convergence rate of the gradient norm. Consider the term \(|\ln\frac{C}{\lambda^{2}}|\) as a function of λ, when C is held fixed. This function is monotonically decreasing for λ 2∈(1,C) and monotonically increasing in (C,∞) (here note that the first case may not happen if C≤1). Therefore we have the following statements:

  1. (i)

    the convergence rate of ∥g k ∥ is decreasing for λ 2∈(1, C);

  2. (ii)

    the convergence rate of ∥g k ∥ is increasing for λ 2∈(C, ∞).

Let us now consider the region of x 1 such that the convergence rate of ∥g k ∥ is decreasing and increasing, respectively. At first, we see that for a fixed value of λ, the value of \(|\ln\frac{C}{\lambda^{2}}|\) is larger if C<λ 2 becomes smaller or if C>λ 2 becomes bigger. This indicates that the convergence is faster when the starting point is close to any of the two eigenvectors of the Hessian. Further, for a fixed value of λ, we see that

  1. (iii)

    when x 1Ω 1(λ)={x:|x (1)|>λ|x (2)|>0}, the convergence rate of ∥g k ∥ has a tendency to decrease with λ;

  2. (iv)

    when x 1Ω 2(λ)={x:0<|x (1)|<λ|x (2)|}, the convergence rate of ∥g k ∥ has a tendency to increase with λ.

Then for any positive number l>0, denoting the unit ball \(\mathcal {B}(l)=\{\mathbf{x}: \|\mathbf{x}\|\le l\}\), we can obtain

$$ r(\lambda):=\frac{\mbox{Measure of $\varOmega _{1}(\lambda)\cap \mathcal{B}(l)$}}{\mbox{Measure of $\varOmega_{2}(\lambda)\cap\mathcal {B}(l)$}} = \frac{\arctan\frac{1}{\lambda}}{ \frac{\pi}{2} - \arctan\frac{1}{\lambda}}. $$
(3.11)

Since λ>1, we have \(\arctan\frac{1}{\lambda}<\frac{\pi}{4}\) and hence r(λ)<1. In addition,

$$ \lim_{\lambda\rightarrow\infty} r(\lambda) = 0. $$
(3.12)

Therefore we can conclude that the BB method has a greater possibility such that it converges faster as the problem condition increases and this possibility tends to one when λ goes to infinity.

To some extent, the analysis in the previous paragraph is similar to the one in Nocedal et al. [9] for the SD method in the two-dimensional case, although the latter is only linearly convergent. As was shown from Fig. 12 in Nocedal et al. [9] and the related discussions, for a fixed starting point, the convergence rate of the SD method improves when the condition number tends to infinity. The analysis for either the BB method or the SD method in the two-dimensional case is not typical. It remains under investigation how the problem condition influences the convergence of the BB method for higher-dimensional problems.