The behavior of the Gauss-Radau upper bound of the error norm in CG

Consider the problem of solving systems of linear algebraic equations Ax=b\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}\varvec{x}=\varvec{b}$$\end{document} with a real symmetric positive definite matrix A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document} using the conjugate gradient (CG) method. To stop the algorithm at the appropriate moment, it is important to monitor the quality of the approximate solution. One of the most relevant quantities for measuring the quality of the approximate solution is the A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document}-norm of the error. This quantity cannot be easily computed; however, it can be estimated. In this paper we discuss and analyze the behavior of the Gauss-Radau upper bound on the A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document}-norm of the error, based on viewing CG as a procedure for approximating a certain Riemann-Stieltjes integral. This upper bound depends on a prescribed underestimate μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} to the smallest eigenvalue of A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document}. We concentrate on explaining a phenomenon observed during computations showing that, in later CG iterations, the upper bound loses its accuracy, and is almost independent of μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document}. We construct a model problem that is used to demonstrate and study the behavior of the upper bound in dependence of μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document}, and developed formulas that are helpful in understanding this behavior. We show that the above-mentioned phenomenon is closely related to the convergence of the smallest Ritz value to the smallest eigenvalue of A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document}. It occurs when the smallest Ritz value is a better approximation to the smallest eigenvalue than the prescribed underestimate μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document}. We also suggest an adaptive strategy for improving the accuracy of the upper bounds in the previous iterations.


Introduction
Our aim in this paper is to explain the origin of the problems that have been noticed [1] when computing Gauss-Radau quadrature upper bounds of the A-norm of the error in the Conjugate Gradient (CG) algorithm for solving linear systems Ax = b with a symmetric positive definite matrix of order N .
Using a fixed node μ smaller than the smallest eigenvalue of A and the Gauss-Radau quadrature rule, an upper bound for the A-norm of the error can be easily computed. Note that it is useful to have an upper bound of the error norm to stop the CG iterations. In theory, the closer μ is to the smallest eigenvalue, the closer is the bound to the norm.
Concerning the approximation properties of the upper bound, we observed in many examples that in earlier iterations, the bound is approximating the A-norm of the error quite well, and that the quality of approximation is improving with increasing iterations. However, in later CG iterations, the bound suddenly becomes worse: it is delayed, almost independent of μ, and does not represent a good approximation to the A-norm of the error any more. Such a behavior of the upper bound can be observed also in exact arithmetic. Therefore, the problem of the loss of accuracy of the upper bound in later iterations is not directly linked to rounding errors and has to be explained.
The Gauss quadrature bounds of the error norm were obtained by using the connection of CG to the Lanczos algorithm and modifications of the tridiagonal matrix which is generated by this algorithm and implicitly by CG. This is why we start in Section 2 with the Lanczos algorithm. In Section 3 we discuss the relation with CG and how the Gauss-Radau upper bound is computed. A model problem showing the problems arising with the Gauss-Radau bound in "exact" arithmetic is constructed in Section 4. In Sections 5 to 7 we give an analysis that explains that the problems start when the distance of the smallest Ritz value to the smallest eigenvalue becomes smaller than the distance of μ to the smallest eigenvalue. We also explain why the Gauss-Radau upper bound becomes almost independent of μ. In Section 8 we present an algorithm for improving the upper bounds in previous CG iterations such that the relative accuracy of the upper bounds is guaranteed to be smaller than a prescribed tolerance. Conclusions are given in Section 9.
Assuming that k < n, the Lanczos algorithm (Algorithm 1) computes an orthonormal basis v 1 , . . . , v k+1 of the Krylov subspace K k+1 (A, v). The basis vectors v j satisfy the matrix relation where e k is the last column of the identity matrix of order k, V k = [v 1 · · · v k ] and T k is the k × k symmetric tridiagonal matrix of the recurrence coefficients computed in Algorithm 1: The coefficients β j being positive, T k is a so-called Jacobi matrix. If A is positive definite, then T k is positive definite as well. In the following we will assume for simplicity that the eigenvalues of A are simple and sorted such that λ 1 < λ 2 < · · · < λ N .

Approximation of the eigenvalues
The eigenvalues of T k (Ritz values) are usually used as approximations to the eigenvalues of A. The quality of the approximation can be measured using β k and the last components of the normalized eigenvectors of T k . In more detail, consider the spectral decomposition of T k in the form where I k is the identity matrix of order k, and assume that the Ritz values are sorted such that Denote s (k) i, j the entries and s (k) :, j the columns of S k . Then it holds that j can be seen as Rayleigh quotients, one can improve the bound (1) using the gap theorem; see [14, p. 244] or [15, p. 206]. In particular, let λ be an eigenvalue of A closest to θ In the following we will be interested in the situation when the smallest Ritz value θ giving the bounds It is known (see, for instance, [16]) that the squares of the last components of the eigenvectors are given by When the smallest Ritz value θ (k) 1 converges to λ 1 , this last component squared converges to zero; see also (3).

Modification of the tridiagonal matrix
Given μ < θ (k) 1 , let us consider the problem of finding the coefficient α has the prescribed μ as an eigenvalue. The connection of this problem to Gauss-Radau quadrature rule will be explained in Section 3.
In [17, pp. 331-334] it has been shown that at iteration k + 1 is the last component of the vector y, solution of the linear system From [10,Section 3.4], the modified coefficients α (μ) k+1 can be computed recursively using Using the spectral factorization of T k , we can now prove the following lemma. (6) is positive definite and, therefore, nonsingular. Hence, so that (8) holds. From (8) it is obvious that if μ < λ < θ (k) 1 (because of the interlacing of the Ritz values) we obtain α (λ) k+1 = α k+1 by construction.

CG and error norm estimation
When solving a linear system Ax = b with a symmetric and positive definite matrix A, the CG method (Algorithm 2) is the method of choice. In exact arithmetic, the CG iterates x k minimize the A-norm of the error over the manifold x − y A , and the residual vectors Thanks to this close relationship between the CG and Lanczos algorithms, it can be shown (see, for instance, [16]) that the recurrence coefficients computed in both algorithms are connected via α 1 = γ −1 0 and Writing (10) in matrix form, we find out that CG computes implicitly the L DL T factorization T k = L k D k L T k , where Algorithm 2 Conjugate gradient algorithm.
Hence the matrix T k is known implicitly in CG.

Modification of the factorization of T k+1
Similarly as in Section 2.2 we can ask how to modify the Cholesky factorization of T k+1 , that is available in CG, such that the resulting matrix T (μ) k+1 given implicitly in factorized form has the prescribed eigenvalue μ. In more detail, we look for a coefficient γ This problem was solved in [10] leading to an updating formula for computing the modified coefficients Moreover, γ (μ) k can be obtained directly from the modified coefficient α and vice-versa, see [10, p. 173 and 181].
Let A = Q Q T be the spectral decomposition of A, with Q = [q 1 , . . . , q N ] orthonormal and = diag(λ 1 , . . . , λ N ). As we said above, for simplicity of notation, we assume that the eigenvalues of A are distinct and ordered as λ 1 < λ 2 < · · · < λ N . Let us define the weights ω i by and the (nondecreasing) stepwise constant distribution function ω(λ) with a finite number of points of increase λ 1 , λ 2 , . . . , λ N , Having the distribution function ω(λ) and an interval η, ξ such that η < λ 1 < λ 2 < · · · < λ N < ξ, for any continuous function f , one can define the Riemann-Stieltjes integral (see, for instance, [18]) Using the optimality of CG it can be shown that CG implicitly determines nodes and weights of the k-point Gauss quadrature approximation to the Riemann-Stieltjes integral (14). The nodes are given by the eigenvalues of T k , and the weights by the squared first components of the normalized eigenvectors of T k . The corresponding Gauss quadrature rule can be written in the form where (T −1 k ) 1,1 represents the Gauss quadrature approximation, and the reminder is nothing but the scaled and squared A-norm of the kth error, i.e., the quantity of our interest.
To approximate the integral (14), one can also apply a modified quadrature rule. In this paper we consider the Gauss-Radau quadrature rule consisting in prescribing a node 0 < μ ≤ λ 1 and choosing the other nodes and weights to maximize the degree of exactness of the quadrature rule. We can write the corresponding Gauss-Radau quadrature rule in the form where the reminder R (μ) k is negative, and T (μ) k is given by (5).
The idea of deriving (basic) quadrature-based bounds in CG is to consider the Gauss quadrature rule (15) at iteration k, and a (eventually modified) quadrature rule at iteration k + 1, where T k+1 = T k+1 when using the Gauss rule and T k+1 = T (μ) k+1 in the case of using the Gauss-Radau rule. From the equations (15) and (16) we get The term in square brackets represents either a lower bound on k+1 (because of the negative reminder). In both cases, the term in square brackets can easily be evaluated using the available CG related quantities. In particular, the lower bound is given by γ k r k 2 , and the upper bound by γ can be updated using (12).
To summarize results of [5,7,12], and [1,10,11] related to the Gauss and Gauss-Radau quadrature bounds for the A-norm of the error in CG, it has been shown that for k < n − 1, and μ such that 0 < μ ≤ λ 1 . Note that in the special case k = n − 1 it holds that x − x n−1 2 A = γ n−1 r n−1 2 . If the initial residual r 0 has a nontrivial component in the eigenvector corresponding to λ 1 , then λ 1 is an eigenvalue of T n . If in addition μ is chosen such that μ = λ 1 , then γ n−1 = γ (μ) n−1 and the second inequality in (18) changes to equality. The last inequality is strict also for k = n − 1.
The rightmost bound in (18), that will be called the simple upper bound in the following, was derived in [1]. The norm p k is not available in CG, but the ratio can be computed efficiently using Note that at an iteration ≤ k we can obtain a more accurate bound using by applying the basic bounds (18) to the last term in (20); see [1] for details on the construction of more accurate bounds. In practice, however, one runs the CG algorithm, and estimates the error in a backward way, i.e., k − iterations back. The adaptive choice of the delay k − when using the Gauss quadrature lower bound was discussed recently in [19]. In the following we will we concentrate on the analysis of the behavior of the basic Gauss-Radau upper bound γ in dependence of the choice of μ. As already mentioned, we observed in many examples that in earlier iterations, the bound is approximating the squared A-norm of the error quite well, but in later iterations it becomes worse, it is delayed and almost independent of μ. We observed that this phenomenon is related to the convergence of the smallest Ritz value to the smallest eigenvalue λ 1 . In particular, the bound is getting worse if the smallest Ritz value approximates λ 1 better than μ. This often happens during finite precision computations when convergence of CG is delayed because of rounding errors and there are clusters of Ritz values approximating individual eigenvalues of A. Usually, such clusters arise around the largest eigenvalues. At some iteration, each eigenvalue of A can be approximated by a Ritz value, while the A-norm of the error still does not reach the required level of accuracy, and the process will continue and place more and more Ritz values in the clusters. In this situation, it can happen that λ 1 is tightly (that is, to a high relative accuracy) approximated by a Ritz value while the CG process still continues. Note that if A has well separated eigenvalues and we run the experiment in exact arithmetic, then λ 1 is usually tightly approximated by a Ritz value only in the last iterations. The above observation is key for constructing a motivating example, in which we can readily observe the studied phenomenon also in exact arithmetic, and which will motivate our analysis.

The model problem and a numerical experiment
In the construction of the motivating example we use results presented in [16,[20][21][22][23]. Based on the work by Chris Paige [24], Anne Greenbaum [20] proved that the results of finite precision CG computations can be interpreted (up to some small inaccuracies) as the results of the exact CG algorithm applied to a larger system with the system matrix having many eigenvalues distributed throughout "tiny" intervals around the eigenvalues of the original matrix. The experiments show that "tiny" means of the size comparable to u A , where u is the roundoff unit. This result was used in [21] to predict the behavior of finite precision CG. Inspired by [20][21][22] we will construct a linear system Ax = b with similar properties as the one suggested by Greenbaum [20]. However, we want to emphasize and visualize some phenomenons concerning the behavior of the basic Gauss-Radau upper bound (21). Therefore, we choose the size of the intervals around the original eigenvalues larger than u A .
We start with the test problem y = w from [23], where w = m −1/2 (1, . . . , 1) T and = diag(λ 1 , . . . ,λ m ), The diagonal matrix and the vector w determine the stepwise distribution function ω(λ) with points of increaseλ i and the individual jumps (weights) ω j = m −1 , We construct a blurred distribution function ω(λ) having clusters of points of increase around the original eigenvaluesλ i . We consider each cluster to have the same radius δ, and let the number c i of points in the ith cluster grow linearly from 1 to p, The blurred eigenvalues λ with the corresponding weights given by i.e., the weights that correspond to the ith cluster are equal, and their sum is ω i . Having defined the blurred distribution function ω(λ) we can construct the corresponding Jacobi matrix T ∈ R N ×N in a numerically stable way using the Gragg and Harrod rkpw algorithm [25]. Note that the mapping from the nodes and weights of the computed quadrature to the recurrence coefficients is generally well-conditioned [26, p. 59]. To construct the above-mentioned Jacobi matrix T we used Matlab's vpa arithmetic with 128 digits. Finally, we define the double precision data A and b that will be used for experimenting as where e 1 ∈ R N is the first column of the identity matrix. We decided to use double precision input data since we can easily compare results of our computations performed in Matlab's vpa arithmetic with results obtained using double precision arithmetic for the same input data. In our experiment we choose m = 12,λ 1 = 10 −6 ,λ m = 1, ρ = 0.8, δ = 10 −10 , and p = 4, resulting in N = 30. Let us run the "exact" CGQ algorithm of [10] on the model problem (24) constructed above, where exact arithmetic is simulated using Matlab's variable precision with digits=128. Let λ 1 be the exact smallest eigenvalue of A. We use four different values of μ for the computation of the Gauss-Radau upper bound (21): 16 which denotes the double precision number closest to λ 1 such that μ 16 ≤ λ 1 , and μ 50 = (1 − 10 −50 )λ 1 which is almost like the exact value. Note that γ (μ) k is updated using (12). The Gauss-Radau upper bounds in Fig. 1 first overestimate, and then closely approximate x−x k−1 A (starting from iteration 5). However, at some point, the Gauss-Radau upper bounds start to differ significantly from x −x k−1 A and represent worse approximations, except for μ 50 . We observe that for a given μ i , i = 3, 8, 16, the upper bounds are delayed when the distance of θ (k) 1 to λ 1 becomes smaller than the distance of μ i to λ 1 , i.e., when θ (k) If (25) holds, then the smallest Ritz value θ (k) 1 is a better approximation to λ 1 than μ i . This moment is emphasized using vertical dashed lines that connect the value θ (k) (25) holds. Moreover, below a certain level, the upper bounds become almost independent of μ i , i = 3, 8, 16, and visually coincide with the simple upper bound. The closer is μ to λ 1 , the later this phenomenon occurs.
Depending on the validity of (25), we distinguish between phase 1 and phase 2 of convergence. If the inequality (25) does not hold, i.e., if μ is a better approximation to λ 1 than the smallest Ritz value, then we say we are in phase 1. If (25) holds, then the smallest Ritz value is closer to λ 1 than μ and we are in phase 2. Obviously, the beginning of phase 2 depends on the choice of μ and on the convergence of the smallest Ritz value to the smallest eigenvalue. Note that for μ = μ 50 we are always in phase 1 before we stop the iterations.
In the given experiment as well as in many practical problems, the delay of the upper bounds is not large (just a few iterations), and the bounds can still provide a useful information for stopping the algorithm. However, we have also encountered examples where the delay of the Gauss-Radau upper bound was about 200 iterations; see, e.g., [1, Fig. 10] or [19, Fig. 2] concerning the matrix s3dkt3m2. Hence, we believe that this phenomenon deserves attention and explanation.

Analysis
The upper bounds are computed from the modified tridiagonal matrices (5) discussed in Section 2.2, that differ only in one coefficient at the position (k +1, k +1). Therefore, the first step of the analysis is to understand how the choice of μ and the validity of the condition (25) influences the value of the modified coefficient see (8). We will compare its value to a modified coefficient for which phase 2 does not occur; see Fig. 1 for μ 50 . Based on that understanding we will then address further important questions. First, our aim is to explain the behavior of the basic Gauss-Radau upper bound (21) in phase 2, in particular, its closeness to the simple upper bound (18). Second, for practical reasons, without knowing λ 1 , we would like to be able to detect phase 2, i.e., the first iteration k for which the inequality (25) starts to hold. Finally, we address the problem of how to improve the accuracy of the basic Gauss-Radau upper bound (21) in phase 2.
We first analyze the relation between two modified coefficients α (μ) and where and (27).

Assumptions
Let us describe in more detail the situation we are interested in. In the analysis that follows we will assume implicitly the following.
1. λ 1 is well separated from λ 2 so that we can use the gap theorem mentioned in Section 2.1, in particular relation (3) bounding η (λ 1 ) 1,k . 2. μ is a tight underestimate to λ 1 such that 3. The smallest Ritz value θ (k) 1 converges to λ 1 with increasing k so that there is an iteration index k from which Let us briefly comment on these assumptions. The assumption that λ 1 is well separated from λ 2 is used later to prove that η (λ 1 ) 1,k is bounded away from zero; see (33).
If there is a cluster of eigenvalues around λ 1 , one can still observe the discussed phenomenon of loss of accuracy of the upper bound, but a theoretical analysis would be much more complicated. Note that the first assumption is also often satisfied for a system matrixÂ that models finite precision CG behavior, if the original matrix A has well separated eigenvalues λ 1 and λ 2 . Using results of Greenbaum [20] we know thatÂ can have many eigenvalues distributed throughout tiny intervals around the eigenvalues of A. We have constructed the model matrixÂ in many numerical experiments, using the procedure suggested in [20]. We found out that the constructedÂ has usually clusters of eigenvalues around the larger eigenvalues of A while a smaller eigenvalue of A is usually approximated by just one eigenvalue ofÂ. Therefore, the analysis presented below can then be applied to the matrixÂ that models the finite precision CG behavior.
If μ is not a tight underestimate, then the Gauss-Radau upper bound is usually not a very good approximation of the A-norm of the error. Then the condition (25) can hold from the beginning and phase 1 need not happen.
Finally, in theory, the smallest Ritz value need not converge to λ 1 until the last iteration [27]. But, in that case, there won't be any problem for the Gauss-Radau upper bound. However, in practical computations, we very often observe the convergence of θ (k) 1 to λ 1 . In particular, in cases of matricesÂ with clustered eigenvalues that model finite precision behavior of CG, θ (k) 1 approximates λ 1 to a high relative accuracy usually earlier before the A-norm of the error reaches the ultimate level of accuracy.

The modified coefficient˛( ) k+1
Below we would like to compare α (λ 1 ) k+1 for which phase 2 does not occur with α (μ) k+1 for which phase 2 occurs; see Fig. 1. Using (27) and (30), we are able to compare the individual η-terms. In particular, for i > 1 we get Hence, α (μ) k+1 can significantly differ from α (λ 1 ) k+1 only in the first term of the sum in (26) for which η If θ (k) 1 is a better approximation to λ 1 than μ in the sense of (25), then (31) shows that η (λ 1 ) 1,k can be much larger than η (μ) for all k we are interested in, then phase 2 will not occur, and since μ is assumed to be a tight approximation to λ 1 and η i,k for all i. In the following we discuss phase 1 and phase 2 in more detail. In phase 1, and, therefore, all components η 1,k ) are not sensitive to small changes of μ; see (27). In other words, the coefficients α (μ) k+1 are approximately the same for various choices of μ.
Let us denote In fact, we can write θ (k) 1 − μ = θ (k) 1 − λ 1 + λ 1 − μ and use the Taylor expansion of 1/(1 + h k ). It yields 1 θ (k) Obviously, h k is an increasing function of the iteration number k; the numerator is constant while the denominator is decreasing in absolute value. The size of h k depends also on how well μ approximates λ 1 . If μ is a tight approximation to λ 1 , then, at the beginning of the CG iterations, the denominator of h k can be large compared to the numerator, h k is small and the right-hand side of 1/(θ We observed that the first term of the sum of the η (μ) i,k is then usually the largest one.
Let us now discuss phase 2. First recall that for any 0 < μ < λ 1 it holds that As before, suppose that λ 1 is well separated from λ 2 and that (30) holds. Phase 2 begins when θ (k) 1 is a better approximation to λ 1 than μ, i.e., when (25) holds. Since θ (k) 1 is a tight approximation to λ 1 in phase 2, (3) and (25) imply that Therefore, using (30), η 1,k is bounded away from zero. On the other hand, (3) also implies that and as θ (k) 1,k goes to zero. Therefore, and the sum on the right-hand side is almost independent of μ. Note that having two values 0 < μ < λ < λ 1 such that where we have used (27) and the assumption (34). Therefore, α (μ) k+1 is relatively insensitive to small changes of μ and the same is true for the upper bound (21).

The coefficient˛k +1
The coefficient α k+1 can also be written as , and the results of Lemmas 1 and 2 are still valid, even though, in practice, μ must be smaller than λ 1 . Using (28) we can express the differences between the coefficients, it holds that If the smallest Ritz value θ (k+1) 1 is close to λ 1 , then the second term of the right-hand side in (36) will be negligible in comparison to the first one, since see (29), and since η (λ 1 ) 1,k is bounded away from zero; see (33). Therefore, one can expect that (37) The size of the term on the right-hand side is related to the speed of convergence of the smallest Ritz value θ (k) For example, if the convergence of θ (k) 1 to λ 1 is superlinear, i.e., if ρ k → 0, then α k+1 and α (λ 1 ) k+1 are close.

Numerical experiments
Let us demonstrate numerically the theoretical results described in previous sections using our model problem. To compute the following results, we, again, use Matlab's vpa arithmetic with 128 decimal digits. We first consider μ = μ 3 = (1 − 10 −3 )λ 1 for which we have λ 1 − μ = 10 −9 . The switch from phase 1 to phase 2 occurs at iteration 13. Figure 2 displays the first term η (μ) 1,k and the maximum term η (μ) i,k as well as the sum ζ (μ) k defined by (9), see Lemma 1, as a function of the iteration number k. In phase 1 the first term η (μ) 1,k is the largest one. As predicted, after the start of phase 2, the first term is decreasing quite fast. The behavior of the first term is completely different for μ = (1 − 10 −50 )λ 1 which almost corresponds to using the exact smallest eigenvalue λ 1 .
The maximum term of the sum is then almost always the first one; see Fig. 4. Remember that, for this value of μ, we are always in phase 1.
Finally, in Fig. 5 we present a comparison of the sums ζ (μ) k for μ 3 , μ 8 , and μ 50 . We observe that from the beginning up to iteration 12, all sums visually coincide. Starting from iteration 13 we enter phase 2 for μ = μ 3 and the sum ζ visually coincide. In Fig. 6 we plot the coefficients α and α k , so that we can compare the observed behavior with the predicted one. Phase 2 starts for μ 3 at iteration 13, and for μ 8 at iteration 15; see also Fig. 1. For k ≤ 13 we observe that as explained in Section 5.2 and α k is larger. For k ≥ 16, the first terms η is still in phase 1. We can also observe that α k can be very close to α (λ 1 ) k when the smallest Ritz value θ (k) 1 is a tight approximation to λ 1 , i.e., in later iterations. We know that the closeness of α k to α (λ 1 ) k depends on the speed of convergence of the smallest Ritz value to λ 1 ; see (37) and the corresponding discussion.

The Gauss-Radau bound in phase 2
Our aim in this section is to investigate the relation between the basic Gauss-Radau upper bound (21) and the simple upper bound; see (18). Recall the notation see (19). In particular, we would like to explain why the two bounds almost coincide in phase 2. Note that using (13) we obtain and from (8) it follows α (μ) Therefore, In the following lemma we give another expression for e T k (T k − μI ) −1 e k .
Proof Since μT −1 k < 1, we obtain using a Neumann series We now express the terms on the right-hand side using the CG coefficients and the quantities from the spectral factorization of T k . Using T k = L k D k L T k we obtain e T k T −1 k e k = γ k−1 . After some algebraic manipulation, see, e.g., [28, p. 1369] we get Finally, Hence, Based on the previous lemma we can now express the coefficient γ Proof We start with (39). Using the previous lemma where we have used relation (19).
Obviously, using (41), the basic Gauss-Radau upper bound (21) and the simple upper bound in (18) are close to each other if and only if which can also be written as Under the assumptions formulated in Section 5.1, in particular that λ 1 is well separated from λ 2 , and that μ is a tight underestimate to λ 1 in the sense of (30), the sum of terms on the left-hand side of (43) can be replaced by its tight upper bound which simplifies the explanation of the dependence of the sum in (43) on μ.
The second term in (44) is independent of μ and its size depends only on the behavior of the underlying Lanczos process. Here can be seen as the relative accuracy to which the ith Ritz value approximates an eigenvalue, and the size of the term which will happen for k sufficiently large, then the first term in (47) is smaller than the term on the right-hand side. As already mentioned, the sum of positive terms in (47) depends only on approximation properties of the underlying Lanczos process, that are not easy to predict in general. Inspired by our model problem described in Section 4, we can just give an intuitive explanation why the sum could be small in phase 2.
Phase 2 occurs in later CG iterations and it is related to the convergence of the smallest Ritz value to the smallest eigenvalue. If the smallest eigenvalue is well approximated by the smallest Ritz value (to a high relative accuracy), then one can expect that many eigenvalues of A are relatively well approximated by Ritz values. If the eigenvalue λ j of A is well separated from the other eigenvalues and if it is well approximated by a Ritz value, then the corresponding term (45) measuring the relative accuracy to which λ j is approximated, is going to be small.
In particular, in our model problem, the smallest eigenvalues are well separated from each other, and in phase 2 they are well approximated by Ritz values. Therefore, the corresponding terms (45) are small. Hence, the Ritz values that did not converge yet in phase 2, are going to approximate eigenvalues in clusters which do not correspond to smallest eigenvalues, i.e., for which the terms (46) are small; see also Figs. 3 and 2. In our model problem, the sum of positive terms in (47) is small in phase 2 because either (45) or (46) are small. Therefore, one can expect that the validity of (47) will mainly depend on the size of the first term in (47); see Fig. 7.
The size of the sum of positive terms in (47) obviously depends on the clustering and the distribution of the eigenvalues, and we cannot guarantee in general that it will be small in phase 2. For example, it need not be small if the smallest eigenvalues of A are clustered. For our model problem it is not hard to detect phase 2 from the coefficients that are available during the computations. We first observe, see Fig. 7, that the coefficients and the corresponding bounds (21) and (18) visually coincide from the beginning up to some iteration 1 . From iteration 1 + 1, the Gauss-Radau upper bound (21) starts to be a much better approximation to the squared A-norm of the error than the simple upper bound (18). When phase 2 occurs, the Gauss-Radau upper bound (21) loses its accuracy and, starting from iteration 2 (approximately when (48) holds), it will again visually coincide with the simple upper bound (18). We observe that phase 2 occurs at some iteration k where the two coefficients (49) significantly differ, i.e., for To measure the agreement between the coefficients (49), we can use the easily computable relative distance We will consider this relative distance to be small, if it is smaller than 0.5. The behavior of the term in (50) for various values of μ is shown in Fig. 8. The index 1 = 12 is the same for all considered values of μ. For μ 3 we get 2 = 15 (red circle), for μ 8 we get 2 = 18 (magenta circle), for μ 16 2 = 25 (blue circle), and finally, for μ 50 there is no index 2 .
As explained in the previous section, in more complicated cases we cannot guarantee in general a similar behavior of the relative distance (50) as in our model problem. For example, in many practical problems we sometimes observe a staircase behavior of the A-norm of the error, when few iterations of stagnation are followed by few iterations of rapid convergence. In such cases, the quantity (50) can oscillate several 5 1 0 1 5 2 0 2 5 3 0 10 -10 10 -5 10 0 10 5 Fig. 8 The behavior of the relative distance in (50) for various values of μ times and it can be impossible to use it for detecting phase 2. Therefore, in general, we are not able to detect the beginning of phase 2 using (50) reliably. Nevertheless, in particular cases, the formulas (41) and (50) can be helpful.

Upper bounds with a guaranteed accuracy
In some applications it might be of interest to obtain upper bounds on the A-norm of the error that are sufficiently accurate. From the previous sections we know that the basic Gauss-Radau upper bound at iteration k can be delayed, and, therefore, it can overestimate the quantity of interest significantly. Nevertheless, going back in the convergence history, we can easily find an iteration index ≤ k such that for all 0 ≤ i ≤ , the sufficiently accurate upper bound can be found. To find such , we will use the ideas described in [12] and [19]. For integers k ≥ j ≥ ≥ 0, let us denote If (54) holds, then also (53) holds. The just described adaptive strategy for obtaining giving a sufficiently accurate upper bound is summarized in Algorithm 3. i.e., if (54) holds, then τ represents also an upper bound on the sum of relative errors of the improved lower and upper bounds. In other words, if is such that (54) is satisfied, then both the improved Gauss-Radau upper bound as well as the improved Gauss lower bound are sufficiently accurate. For a heuristic strategy focused on improving the accuracy of the Gauss lower bound, see [19].

Algorithm 3 CG with the improved
In the previous sections we have seen that the basic Gauss-Radau upper bound is delayed, in particular in phase 2. The delay of the basic Gauss-Radau upper bound can be defined as the smallest nonnegative integer j such that γ (μ) Having sufficiently accurate lower and upper bounds (e.g., if (54) is satisfied), we can approximately determine the delay of the basic Gauss-Radau upper bound as the smallest j satisfying (55) where ε in (55) is replaced by its tight lower bound :k .

Conclusions
In this paper we discussed and analyzed the behavior of the Gauss-Radau upper bound on the A-norm of the error in CG. In particular, we concentrated on the phenomenon observed during computations showing that, in later CG iterations, the upper bound loses its accuracy, it is almost independent of μ, and visually coincides with the simple upper bound. We explained that this phenomenon is closely related to the convergence of the smallest Ritz value to the smallest eigenvalue of A. It occurs when the smallest Ritz value is a better approximation to the smallest eigenvalue than the prescribed underestimate μ. We developed formulas that can be helpful in understanding this behavior. Note that the loss of accuracy of the Gauss-Radau upper bound is not directly linked to rounding errors in computations of the bound, but it is related to the finite precision behavior of the underlying Lanczos process. In more detail, the phenomenon can occur when solving linear systems with clustered eigenvalues. However, the results of finite precision CG computations can be seen (up to some small inaccuracies) as the results of the exact CG algorithm applied to a larger system with the system matrix having clustered eigenvalues. Therefore, one can expect that the discussed phenomenon can occur in practical computations not only when A has clustered eigenvalues, but also whenever orthogonality is lost in the CG algorithm.