Singular value decomposition in extended double precision arithmetic

A well-known and successful algorithm to compute the singular value decomposition (SVD) of a matrix was published by Golub and Reinsch (Numer. Math. 14:403–420, 1970), together with an implementation in Algol. We give an updated implementation in extended double precision arithmetic in the C programming language. Extended double precision is native for Intel x86 processors and provides improved accuracy at full hardware speed. The complete program for computing the SVD is listed. Additionally, a comprehensive explanation of the original algorithm of Golub and Reinsch (Numer. Math. 14:403–420, 1970) is given at an elementary level without referring to the more general results of Francis (Comput. J. 4:265–271, 1961, 1962).

C programming language, the type specifier long double is reserved for declaration of extended double precision variables. Some compilers, like the GNU compiler, Clang, and Intel's C/C++ compiler, implement long double using 80-bit extendend double precision numbers on x86-architectures. Extended double precision thus becomes easily available for computations and should be used, as it offers improved accuracy at full computational speed. We give a C implementation of the singular value decomposition (SVD) in extended double precision arithmetic. The program ist based on the algorithm published by Golub and Reinsch in [3].
Let A ∈ R m,n a real m × n matrix, let := min{m, n}. It is well known that where U ∈ R m,m , ∈ R m,n , and V ∈ R n,n such that U T U = I m , V T V = I n and i,i = σ i , i = 1, . . . , , all other elements of being zero. The numbers σ 1 , . . . , σ are the non-negative square roots of the largest eigenvalues of A T A. We shall assume that The computation of U , V , and splits into two parts. The first part is a reduction of A to a bidiagonal matrix. The second part is an implicit application of the QR algorithm (with shifts) to this bidiagonal matrix. We describe bidiagonalization in Section 2. Section 3 is a reminder about the QR algorithm for tridiagonal symmetric matrices and gives an elementary explanation of Francis' method to use implicit shifts. Section 4 explains how QR steps can implicitly be performed on a bidiagonal matrix. Section 5 gives details on how to compute the shift parameter reliably and Section 6 gives a stopping criterion for the QR iteration. Section 7 outlines how to arrange for computations in extended precision and in Section 8, a complete C function for computing the SVD is given. Test cases are investigated in Section 9 and we conclude in Section 10.

Bidiagonalization
A = (a i,j ) is a real matrix with m rows i = 0, . . . , m − 1, counting down from top to bottom, and with n columns j = 0, . . . , n − 1, counting across from left to right.
With the pair of rows 0 and i = 1, . . . , m − 1 we do the plane rotations from left a 0,j = cos ·a 0,j + sin ·a i,j , a i,j = − sin ·a 0,j + cos ·a i,j , j = 0, . . . , n − 1 overwriting the old with the new values. 1 We choose cos and sin as follows: let p := a 0,0 , q := a i,0 , r := ± p 2 + q 2 with same sign as p, cos = p/r and sin = q/r. (cos = 1, sin = 0 if q = 0). Then a i,0 gets eliminated, i.e., its new value is 0. Note that plane rotations from left leave the sum of squares in a column invariant. Thus while a i,0 gets annihilated, a 2 0,0 gets bigger. Indeed, the erasing is more some sort of 1 P T i A with the orthogonal m×m matrix P T i = cos sin − sin cos in rows/columns 0 and i collecting. After i = m − 1, a 2 0,0 is positive unless A has a zero column. In the same way we do with the pair of columns 1 and j = 2, . . . , n − 1 plane rotations from right a i,1 = a i,1 · cos + a i,j · sin, a i,j = −a i,1 · sin + a i,j · cos, i = 0, . . . , m − 1 again overwriting the old with the new values. 2 Thus we do not only generate zeros in the leftmost column but also zeros in the top row after the first two entries, which here are called d 0 and e 1 . Plane rotations from right leave the sum of squares in a row invariant, thus d 2 0 + e 2 1 > 0 unless there is a zero row. Such plane rotations either from left or from right in order to erase a selected matrix entry are called Givens rotations. Together, we have the main step number one. More such main steps follow, all a repetition of the first one, but each time with another column and another row less.
Let us assume for the moment that m ≥ n. Then, with n − 1 such main steps we can reduce the matrix A to bidiagonal form B. Ignoring the trivial (zero) rows i = n, . . . , m − 1, one has With e 0 := 0 each column of B has the same form. The remaining B is quadratic with det(B) = d 0 · · · d n−1 and det(B T B) = d 2 0 · · · d 2 n−1 .
If required, we accumulate all plane rotations from left in U T and all plane rotations from right in V . These are orthogonal matrices of order m and n. It is helpful for understanding the coming steps to extend the m×n matrix A to the right by the m×m unit matrix I m and down by the n×n unit matrix I n . With these two extensions we have a working area with m + n rows/columns of length m + n. At the beginning it is A I m I n where A = UBV T . After the plane rotations from left and right it is All steps of the second part of the algorithm (to be described below) follow this scheme of applying plane rotations from left and right to further reduce B to diagonal form while updating U T and V .
The case m < n had been omitted in the Algol program published in [3], but is included now. It is best to consider A T and to apply the algorithm to the n × m matrix A T . Anyway, as described in Section 7, a copy of A in extended precision format is needed. At the moment this copy is created, one may as well copy A T . The C function given in Section 8 does so and then computes V T instead of U T and U instead of V . It is easy to take this swap into account when returning the matrices U and V . Therefore, we may continue to assume m ≥ n in the following.
The present version of the SVD uses exclusively plane rotations. All Householder reflections from the version of 1967 [3] have been replaced, although they need only half the number of multiplications. But plane rotations will be needed in the second part of the algorithm and can all be done by calling two functions, viz.
PREPARE() to compute the desired values cos and sin for the next plane rotation, ROTATE() to apply this plane rotation to a pair of BLA-vectors x[ ], y[ ] (defined by equidistant storage in memory, for example matrix rows, columns, diagonals, or single elements).
As said above, the second part of the algorithm will consist in applying further plane rotations from left and right to B with the goal to reduce B to a diagonal matrix. When e j = 0 occurs for some index j > 0 (iteration indices are dropped), then the matrix B can be split into two bidiagonal submatrices of order j (rows 0, . . . , j − 1 of B) and n − j (rows j, . . . , n − 1 of B), which may be diagonalized independently of each other. At any time, the second part of the algorithm will iteratively diagonalize the southernmost remaining bidiagonal submatrix of B with non vanishing off-diagonal elements. The position of this submatrix is described by two indices and k, both in the range 0, . . . , n − 1, defining three diagonal blocks of B, as illustrated in (3): Here, = 0 means an empty first block, and k = n−1 means an empty third block. The third "bidiagonal" block in fact already is a diagonal matrix and can be ignored for all further computations, |d k+1 |, . . . , |d n−1 | are singular values. The middle block with lower row index and upper row index k is characterized by non vanishing offdiagonal elements e +1 , . . . , e k and can be diagonalized independently of the upper bidiagonal block (rows 0, . . . , −1 of B). 3 The elements e 1 , . . . , e −1 may be zero or not. After each iteration step taken in the second part, k and are updated by scanning for zero elements e j . When k = = 0, then B is completely diagonalized. When = k > 0, then e k = 0, |d k | is a singular value and the trivial 1×1 matrix (e k ) may be split off the middle block in (3), thus k will be decremented by 1. When < k, then one continues to operate on the bidiagonal submatrix of B in rows , . . . , k, which will be denoted by In practice, zero tests must be replaced by tests |e j | ≤ tol. A proper choice of tol will be discussed in Section 6.

Symmetric QR steps with shifts
This section is a reminder about one of the most successful algorithms of Linear Algebra, the QR iteration for the diagonalization of matrices. Heinz Rutishauser invented it in 1961 (then in the form of an equivalent LR iteration). The name comes from the basic iteration step for the matrix to be diagonalized, which is here X =: X (0) : The step from X (i) to X (i+1) is a similarity transformation: All X (i) have the same eigenvalues. The absolute values of these invariant eigenvalues, arranged in decreasing order, play a decisive role in analyzing the asympotic behavior of the sequence X (i) . The QR iteration will be applied to the symmetric tridiagonal matrix X := B T B, where B is initially given by (2) and more generally by (4), but in this section it is considered in its own right. To describe a single QR step applied to a tridiagonal symmetric matrix, the iteration index (i) is dropped. Let As a consequence of the tridiagonal form, each QR step is rather special: (1) Q is a product of n − 1 rotations Q j in plane (j −1, j), j = 1, . . . , n − 1, to eliminate ε j , (2) X stays symmetric and tridiagonal (the minor steps are X → Q T j XQ j ). The basic iteration step for the QR algorithm with shifts is described as where Q and R are orthogonal and upper triangular, respectively, but not the same as above, depending on the value of s. As before, one has X = Q T XQ. The value s is called the shift. It should be chosen close to one of the eigenvalues of X. The closer the shift is to such an eigenvalue, the smaller is some diagonal element of R (in most cases the last one) and the smaller is the (last) row and column of X. Good shifts make Rutishauser's LR or QR iteration so successful. In [5] Wilkinson proved global convergence of the sequence X (i) to a diagonal matrix for two different strategies to choose a shift. The convergence rate usually is cubic, i.e., for sufficiently small η follows from |ε n−1 | ≤ η that the next iteration will reduce |ε n−1 | to O(|ε n−1 | 3 ). We devote Section 5 to Wilkinson's idea and proposals for the shift.
The factorization X − sI = QR and the recombination QR + sI = X may be combined to what is known as a QR step with implicit shift. This was found by Francis [1] in a much more general setting, but shall be explained here on an elementary level. Assume that no off-diagonal element of X vanishes, i.e., ε j = 0 for j = 1, . . . , n− 1. In this case, the orthogonal matrix Q is unique up to a scaling of its columns by ±1. The matrix Q can always be written as a product of n − 1 rotations Q j in plane (j −1, j), j = 1, . . . , n − 1. The first rotation Q T 1 is to be chosen such that (schematically, just noting the upper 2 × 2-minors) (or the negatives of these values). Since ε 1 = 0, there cannot be a zero division in (5) and moreover sin = 0. It is known that is tridiagonal, but the matrices X i , i = 1, . . . , n − 2, deviate from the tridiagonal form. Schematically, one has The extra element β 2 is called bulge element. It is defined by β 2 = sin·ε 2 with sin from (5), so that β 2 = 0. Now choose a rotationQ T 2 in plane (1, 2), such that multiplication of X 1 from the left withQ T 2 eliminates β 2 in position (2, 0). This is achieved by the choice Schematically, one gets The bulge β 2 moved down one position along the diagonal to become β 3 = sin·ε 3 with the value sin defined in (7). Since β 2 = 0, one has sin = 0 again and therefore β 3 = 0. Continuing this way chasing down the bulge along the diagonal, one arrives atX is chosen such that multiplication ofX n−2 from the left withQ T n−1 eliminates β n−1 in position (n− 1, n−3). ThenX n−1 =Q T n−1X n−2Qn−1 is tridiagonal again. Because all bulges are non-zero, all rotationsQ i , i = 2, . . . , n − 1, are uniquely determined (up to phase factors, i.e., up to multiplications by ±1). This means that once Q 1 is set, the need to chase down the bulge until it disappears and to re-establish the tridiagonal form of uniquely determinesQ 2 , . . . ,Q n−1 (up to phase factors). But tridiagonal form will also be established by using the rotations Q 2 , . . . , Q n−1 as in (6). The conclusion is . . , n − 1 (up to phase factors). The choice of Q 1 according to (5) -which is the only place where the shift explicitly enters the computationand the choice ofQ 2 , . . . ,Q n−1 such as to chase down the bulge, will make (8) an implementation of the QR step. Thus, the symmetric QR step with shift for the matrix X = B T B can be done as follows: (1) Choose a good shift s.
(2) Choose Q T 1 according to (5) and compute in order to chase down the bulge and successively build X j :=Q T j X j −1Qj , j = 2, . . . , n − 1. X := X n−1 is the next matrix X.

Implicit QR steps on bidiagonal matrix
The QR iteration could explicitly be applied to the matrix X = B T B with B = B ,k from (4) -the double index of B ,k will be dropped now. The non-trivial diagonal and off-diagonal entries of X then are X j,j = d 2 j + e 2 j , ( j= , . . . , k) and However, for reasons of economy and accuracy, one should avoid dealing with B T B. Rather, it is preferable to work with B alone. In fact it is possible to do the similarity transformations on X with two-sided plane rotations, viz. B → P T j B Q j , j = , . . . , k − 1, where all matrices Q j and P T j are Givens rotations. To be precise, assume that d j −1 = 0 and e j = 0 for j = + 1, . . . , k, -according to (9) this means that X has no vanishing off-diagonal elements -and choose Q as if performing the first minor step of one QR iteration step (with shift) for X. From (5) and (9) it can be seen that this means to choose Q as a rotation in the plane ( , + 1) with (or the negatives of these values), depending on the shift parameter s. Multiplying B with Q from the right gives (schematically) B +1 is no longer an upper bidiagonal matrix because of the bulge element b +1 = sin·d +1 . The bulge element cannot vanish, since sin = 0 and d +1 = 0, see (10). To eliminate b +1 , multiply B +1 from the left by a rotation P T in the plane ( , + 1). Up to a phase factor, this rotation must be given by such that (schematically) The bulge has moved and becomes b +1 = sin·e +2 . Since b +1 = 0, also b +1 = 0.
To eliminate b +1 , multiply from the right with a rotationQ +1 like in (12), but now in the plane ( + 1, + 2). Because of b +1 = 0, this rotation must have a component sin = 0. The multiplication results in To eliminate the bulge b +2 , multiply from the left by a rotation P T +2 in the plane ( + 1, + 2). Since b +2 = 0, this rotation once more is uniquely determined (up to a phase factor) with sin = 0. One gets Continuing this way the bulge initially introduced by multiplication with Q is chased down in knight's moves until it disappears. This is illustrated in Fig. 2, where the bulge at different moments in the computation is notated as x, x , x , and x . All moves in north-eastern direction are effected by plane rotations from left and all moves in south-western direction are effected by plane rotations from right. The rotationsQ +1 , . . . ,Q k−1 from the right and P T , . . . , P T k−1 from the left are all uniquely determined (up to phase factors) once Q is chosen. In the end, with Q := Q ·Q +1 · · ·Q k−1 and P := P · . . . · P k−1 the matrix P T BQ =: B is bidiagonal again. Therefore

(B) T B =Q T B T BQ =Q T XQ
is a tridiagonal matrix again. The first rotation Q was explicitly chosen as needed to perform a step of the QR iteration. It was seen in Section 3, that this and the fact that Q T XQ is tridiagonal are enough to conclude thatQ T XQ is the result of a full QR step performed on X.
The conclusion is that one step of the QR iteration (with shift s) for X = B T B can be done implicitly as follows (1) Depending on the chosen shift s set Q as in (11), and multiply B from the right with Q . (2) Multiply with further rotationsQ +1 , . . . ,Q k−1 and P T , . . . , P T k−1 to chase down the bulge until it disappears.
The above arguments depend on (10), but a violation of these conditions even is an advantage, since then savings are possible. In case |e k | ≤ tol, |d k | is accepted as a singular value and k is decremented by 1. In case |e i | ≤ tol for some ≤ i < k, e i is neglected and the matrix B is split in two bidiagonal matrices which are diagonalized separately. The saving on d i = 0 is almost as obvious: one can chase this zero to the right with plane rotationsP T i+1 , . . . ,P T k from left in rows i and j = i + 1, . . . , k, so that the matrix B gets a zero row: Now B can be split in two bidiagonal matrices to be diagonalized separately. In case |d i | ≤ tol, the effect is the following. Multiplication withP T i+1 annihilates e i+1 and creates new elements x i+1 in position (i, i + 2) and η i+1 in position (i + 1, i). Multiplication withP T i+2 annihilates x i+1 (in row i) and creates new elements x i+2 in position (i, i+3) and η i+2 in position (i+2, i). Multiplication withP T k annihilates x k−1 (in row i) and creates η k in position (k, i). The result is a matrix of the form (all modified elements are overlined) By the orthogonality of plane rotations which shows that all elementsd i , η i+1 , . . . , η k have a magnitude less than or equal tol and can be neglected. B can again be split in two bidiagonal matrices. The case of a vanishing or negligible diagonal element d i is called cancellation.

Choosing a shift
The way shifts are chosen is an essential and characteristic part of the SVD. The shift has only one purpose: to decrease |e k |, the last off-diagonal entry in the current iteration matrix B = B ,k (again, the double index ( , k) will be dropped). Remember, that a good shift is close to one of the eigenvalues of the matrix B We want the root s which is closer to h 2 +z 2 , i.e., the root t := h 2 + z 2 − s closest to zero of the quadratic It is good numerical practice to first compute the larger root t 1 and then the smaller one t 2 from t 1 t 2 = −h 2 y 2 . As h and y are non-zero one can form Let w = f 2 + 1, then t has the roots hy (−f ± w). If f > 0, then yh(−f − w) is the one with bigger modulus and t = yh/(f + w) is the one with smaller modulus. If f < 0, then yh(−f + w) is the one with bigger modulus and yh/(f − w) is the one with smaller modulus. The shift to be chosen thus is for f > 0 or f < 0, respectively. The only place where the shift enters into the computation of the implicit QR step is (11). Evidently the values for cos and sin in (11) do not change if d 2 − s is replaced by d − s/d and if d e +1 is replaced by e +1 . This fact will be used in the program of Section 8.
For this second choice of shift (called shift strategy (b) in [5]), Wilkinson proved global convergence with a guaranteed quadratic convergence rate. Almost always convergence is cubic, as confirmed by the test cases examined in Section 9. One therefore expects s → z 2 from h → 0 and y bounded. After convergence, k is decremented until with enough QR steps k = 0 is reached.

Test for convergence
When |e k | ≤ tol, then |d k | is accepted as new singular value and k is decremented by 1. When |e i | ≤ tol for some i < k, then the matrix B is split in two parts and the singular values of both parts are computed separately, as said above. The cancellation case |d i | ≤ tol was considered in Section 4. It remains to choose the tolerance tol.
Here, we repeat the proposal from [3] to choose with mach the machine precision and B the initial bidiagonal matrix from (2).

Extended precision arithmetic
On the preceding pages all algorithmical aspects of computing the real matrices A, U , V , and B have been described without mentioning their common data type. Now we are specific. We assume that a user keeps A stored as a variable in standard IEEE double floating point format and wants to get U , V , and also as variables in double format. We call the corresponding variables external and use lower case letters to notate them. We perform the SVD computation inside a C function SVD in long double extended precision format. This function needs internal copies of data type long double of the external variables. The internal variables shall be denoted by upper case letters. We thus have -D, E where the dash in the last line means that B is only computed internally. Its diagonal is stored in D (which will be overwritten by the singular values on the diagonal of ) and its superdiagonal is stored in E. Explanations for the internal variables P and Q will be given in a moment.
When SVD is called, at first the C library function malloc is invoked to dynamically allocate memory for the internal variables. Then matrix A is copied from a to A, if m ≥ n. If, however, m < n, then A T is copied to A, as announced in Section 2.
With ma := max{m, n}, mi := min{m, n}, A can always be considered a ma ×mi matrix. All rotations from left will be accumulated in P (of dimension ma × ma) and all rotations from right will be accumulated in Q (of dimension mi × mi). As seen in Section 2, at the end of the computation, P holds U T and Q holds V , if m ≥ n. If, however, m < n, then Q holds U and P holds V T . Of course, P and Q must be initialized as identity matrices before the accumulations can start.
As already said in the introduction, on the Intel x86 architecture all arithmetic operations are done in 64 bit precision with full hardware speed. Also, the exponent of long double has 15 bits and thus four more bits than are provided for double. Therefore, there will be no problems with exponent overflow or underflow when squaring elements a i,j of A.
At the end of the computation in function SVD, the internal variables are rounded to double format and copied to the external variables a, u, v, and d. This is the moment to copy singular values in descending order from D to d. Columns / rows of P and Q must be copied in corresponding order. Finally the allocated memory for the internal variables must be freed (the 64 bit precision results get lost).

C program
An implementation of the SVD in the C programming language is available from the web page https://www.unibw.de/eit mathematik/forschung/eprecsvd as the na59 package.
In the header file SVD.h one may switch on or off computations in extended double precision and choose between Rayleigh shifts (corresponding to shift strategy (a) in [5]) and Wilkinson shifts (shift strategy (b) in [5]).
The file SVD.c contains an implementation of the SVD algorithm. All constants of data type long double are marked by a suffix W. The printf statements are included for testing only and can be omitted. At the end of SVD a function testSVD is called, which is implemented in SVDtestFunc.c. Its purpose is to test whether the fundamental identities U T U = I m , V T V = I n and AV = U (nearly) hold. The calling of this function can also be omitted, of course.
The file SVDtest2.c contains an implementation of all the test cases discussed in the following section.

Test results
Computations were carried out on an Intel processor i7-9850H, programs were compiled with the GNU compiler, version 10.2.0. The results obtained by computations in extended double precision arithmetic are noted asŨ ,Ṽ and˜ . These results were rounded to standard double precision format to giveÛ ,V andˆ . The singular values on the diagonal ofˆ are noted asσ k .
When exact singular values are known, they are rounded to standard double precision format and noted as σ k . In the cases where exact singular values are not known, numerical approximations were computed via an implementation of the SVD in 256bit arithmetic, based on the GNU MPFR multiple precision library. The singular values obtained from a multiple precision computation were then rounded to standard double precision format and taken as substitutes for the unknown exact singular values. The abbreviation μ(A) = max|a i,j | for a matrix A = (a i,j ) is used below.
A second example is the 10 × 7 Hilbert matrix A defined by  In this case, exact singular values are not known and substitutes for them were obtained from a computation in 256-bit arithmetic, as described above. After a total of 8 QR iteration steps, the results given in Table 2  were obtained. A third example is taken from [2]. Setting an 18 × 12 matrix is defined by Since this matrix has rank 6, σ 7 = . . . = σ 12 = 0. Substitutes for the exact singular values σ 1 , . . . , σ 6 of A were obtained from a computation in 256-bit arithmetic, as described above. After a total of 12 QR iteration steps, the results shown in Table 3 were found. Again, the fundamental identities were checked, with the following results:  A fourth example is taken from [3]. It is the 20 × 21 matrix defined by with exact singular values σ 21−k = k(k + 1), k = 1, . . . , 20.
The method converged after 39 QR iterations with the results shown in Table 4. In this example, all values |σ k −σ k | were 0, i.e., computed and exact values were identical, when both rounded to standard double precision. The fundamental identities were checked with the following results:

Conclusions
An implementation of the SVD in extended precision arithmetic substantially improves the accuracy of the computed results. This improvement comes at no loss in computational speed, when extended precision arithmetic is native, as for the x86 architecture. Only a minor programming effort is necessary to use the capabilities of extended precision arithmetic and the same program also runs in standard double precision arithmetic. So it is advantageous to use the updated program. We have also given a full, elementary explanation of the algorithm of Golub and Reinsch. In [3], not all details were fully explained on an elementary level.