On the convergence of Krylov methods with low-rank truncations

Low-rank Krylov methods are one of the few options available in the literature to address the numerical solution of large-scale general linear matrix equations. These routines amount to well-known Krylov schemes that have been equipped with a couple of low-rank truncations to maintain a feasible storage demand in the overall solution procedure. However, such truncations may affect the convergence properties of the adopted Krylov method. In this paper we show how the truncation steps have to be performed in order to maintain the convergence of the Krylov routine. Several numerical experiments validate our theoretical findings.


Introduction
We are interested in the numerical solution of general linear matrix equations of the form where A i ∈ R nA×nA , B i ∈ R nB ×nB are large matrices that allow matrix-vector products A i v, B i w to be efficiently computed for all i = 1, . . ., p, and any v ∈ R nA , w ∈ R nB .Moreover, C 1 , C 2 are supposed to be of low rank, i.e., C 1 ∈ R nA×q , C 2 ∈ R nB ×q , q ≪ n A , n B .For sake of simplicity we consider the case of n A = n B ≡ n in the following, so that the solution X ∈ R n×n is a square matrix, but our analysis can be applied to the rectangular case, with n A = n B , as well.
Many common linear matrix equations can be written as in (1).For instance, if p = 2 and B 1 = A 2 = I n , I n identity matrix of order n, we get the classical Sylvester equations.Moreover, if B 2 = A 1 , A 2 = B 1 , and C 1 = C 2 , the Lyapunov equation is attained.These equations are ubiquitous in signal processing and control and systems theory.See, e.g., [1,11,65].The discretization of certain elliptic PDEs yields Lyapunov and Sylvester equations as well.See, e.g., [14,45].
Generalized Lyapunov and Sylvester equations 1 amount to a Lyapunov/Sylvester operator plus a general linear operator: N i XN T i + CC T = 0, and See, e.g., [8,29].These equations play an important role in model order reduction of bilinear and stochastic systems, see, e.g., [8,9,16], and many problems arising from the discretization of PDEs can be formulated as generalized Sylvester equations as well.See, e.g., [45,49,68].
General multiterm linear matrix equation of the form (1) have been attracting attention in the very recent literature because they arise in many applications like the discretization of deterministic and stochastic PDEs, see, e.g., [5,48], PDE-constrained optimization problems [61], data assimilation [20], matrix regression problems arising in computational neuroscience [36], fluid-structure interaction problems [68], and many more.
Even when the coefficient matrices A i 's and B i 's in (1) are sparse, the solution X is, in general, dense and it cannot be stored for large scale problems.However, for particular instances of (1), as the ones above, and under certain assumptions on the coefficient matrices, a fast decay in the singular values of X can be proved and, thus, the solution admits accurate low-rank approximations of the form S 1 S T 2 ≈ X, S 1 , S 2 ∈ R n×t , t ≪ n, so that only the low-rank factors S 1 and S 2 need to be computed and stored.See, e.g., [4,8,29,46].
For the general multiterm linear equation (1), robust low-rank approximability properties of the solution have not been established so far even though X turns out to be numerically low-rank in many cases.See, e.g., [20,61].In the rest of the paper we thus assume that the solution X to (1) admits accurate low-rank approximations.
The efficient computation of the low-rank factors S 1 and S 2 is the task of the so-called low-rank methods and many different algorithms have been developed in the last decade for both generalized and standard Lyapunov and Sylvester equations.A non complete list of low-rank methods for such equations includes projection methods proposed in, e.g., [19,29,48,53,55], low-rank (bilinear) ADI iterations [8,10,38], sign function methods [6,7], and Riemannian optimization methods [33,67].We refer the reader to [56] for a thorough presentation of low-rank techniques.
To the best of our knowledge, few options are present in the literature for the efficient numerical solution of general equations (1): A greedy low-rank method by Kressner and Sirković [32], and low-rank Krylov procedures (e.g., [8,20,35,61]) which are the focus of this paper.
Krylov methods for matrix equations can be seen as standard Krylov subspace schemes applied to the n 2 × n 2 linear system where ⊗ denotes the Kronecker product and vec : R n×n → R n 2 is such that vec(X) is the vector obtained by stacking the columns of the matrix X one on top of each other.These methods construct the Krylov subspace and compute an approximate solution of the form vec(X m ) = V m y m ≈ vec(X), where V m = [v 1 , . . ., v m ] ∈ R n 2 ×m has orthonormal columns and it is such that Range(V m ) = K m (A, vec(C 1 C T 2 )) with y m ∈ R m .The vector y m can be computed in different ways which depend on the selected Krylov method.The most common schemes are based either on a (Petrov-)Galerkin condition on the residual vector or a minimization procedure of the residual norm; see, e.g., [51].
The coefficient matrix A in ( 2) is never assembled explicitly in the construction of K m (A, vec(C 1 C T 2 )) but its Kronecker structure is exploited to efficiently perform matrix-vector products.Moreover, to keep the memory demand low, the basis vectors of K m (A, vec(C 1 C T 2 )) must be stored in low-rank format.To this end, the Arnoldi procedure to compute V m has to be equipped with a couple of low-rank truncation steps.In particular, a low-rank truncation is performed after the "matrix-vector product" Av m where v m denotes the last basis vector, and during the orthogonalization process.See, e.g., [61,Section 3], [35,Section 2], [20, Section 3] and section 2.
In principle, the truncation steps can affect the convergence of the Krylov method and the well-established properties of Krylov schemes (see, e.g., [51]) may no longer hold.However, it has been numerically observed that Krylov methods with low-rank truncations often achieve the desired accuracy, even when the truncation strategy is particularly aggressive.See, e.g., [20,61].
In this paper we establish some theoretical foundations to explain the converge of Krylov methods with low-rank truncations.In particular, the full orthogonalization method (FOM) [51,Section 6] and the generalized minimal residual method (GMRES) proposed in [52] are analyzed.
We assume that two different truncation steps are performed within our routine and, to show that the convergence is maintained, we interpret these truncations in two distinct ways.First, the truncation performed after the matrix-vector product Av m is seen as an inexact matrix-vector product and results coming from [58] are employed.Second, the low-rank truncations that take place during the orthogonalization procedure are viewed as a structured perturbation of the new basis vector that preserves orthogonality; the perturbed vector is still orthogonal with respect to the previous ones.
We would like to underline the fact that the schemes studied in this paper significantly differ from tensorized Krylov methods analysed in, e.g., [34].Indeed, our A is not a Laplace-like operator in general, i.e., A = The following is a synopsis of the paper.In section 2 we review the low-rank formulation of FOM and GMRES and their convergence is proved in section 3.In particular, in section 3.1-3.2 the two different interpretations of the low-rank truncation steps are presented.Some implementation aspects of these lowrank truncations are discussed in section 4. It is well known that Krylov methods must be equipped with effective preconditioning techniques in order to achieve a fast convergence in terms of number of iterations.Due to some peculiar aspects of our setting, the preconditioners must be carefully designed as we discuss in section 5. Short recurrence methods like CG, MINRES and BICGSTAB can be very appealing in our context due to their small memory requirements and low computational efforts per iteration.Even though their analysis can be cumbersome since the computed basis is not always orthogonal (e.g., the orthogonality may be lost in finite precision arithmetic), their application to the solution of (1) is discussed in section 6.
Several numerical examples reported in section 7 support our theoretical analysis.The paper finishes with some conclusions given in section 8.
Throughout the paper we adopt the following notation.The matrix inner product is defined as X, Y F = trace(Y T X) so that the induced norm is X F = X, X F .In the paper we continuously use the identity vec(Y ) T vec(X) = X, Y F so that vec(X) 2  2 = X 2 F .Moreover, the cyclic property of the trace operator allows for a cheap evaluation of matrix inner products with low-rank matrices.Indeed, if ) and only matrices of small dimensions r i are involved in such a computation.Therefore, even if it is not explicitly stated, we will always assume that matrix inner products with low-rank matrices are cheaply computed without assembling any dense n × n matrix.For sake of simplicity we will omit the subscript in • F and write only • .
The k-th singular value of a matrix M ∈ R m1×m2 is denoted by σ k (M ), where the singular values are assumed to be ordered in a decreasing fashion.The condition number of M is denoted by κ(M ) = σ 1 (M )/σ p (M ), p = rank(M ) = argmin i {σ i (M ) = 0}.
As already mentioned, I n denotes the identity matrix of order n and the subscript is omitted whenever the dimension of I is clear from the context.The i-th canonical basis vector of R n is denoted by e i while 0 m is a vector of length m whose entries are all zero.
The brackets [•] are used to concatenate matrices of conforming dimensions.In particular, a Matlab-like notation is adopted and [M, N ] denotes the matrix obtained by stacking M and N one next to the other whereas [M ; N ] the one obtained by stacking M and N one of top of each other, i.e., [M ; N ] = [M T , N T ] T .The notation diag(M, N ) is used to denote the block diagonal matrix with diagonal blocks M and N .

Low-rank FOM and GMRES
In this section we revise the low-rank formulation of FOM (LR-FOM) and GMRES (LR-GMRES) for the solution of the multiterm matrix equation (1).
Low-rank Krylov methods compute an approximate solution In the following we will always assume the initial guess x 0 to be the zero vector 0 n and in Remark 3.2 such a choice is motivated.Therefore, the m orthonormal columns of V m = [v 1 , . . ., v m ] ∈ R n 2 ×m in (4) span the Krylov subspace (3) and y m ∈ R m .One of the peculiarities of low-rank Krylov methods is that the basis vectors must be stored in low-rank format.We thus write v j = vec(V 1,j V T 2,j ) where V 1,j , V 2,j ∈ R n×sj , s j ≪ n, for all j = 1, . . ., m.
The basis V m can be computed by a reformulation of the underlying Arnoldi process (see, e.g., [51,Section 6.4]) that exploits the Kronecker structure of A and the low-rank format of the basis vectors.In particular, at the m-th iteration, the n 2 -vector v = Av m must be computed.For sparse matrices A i , B i , a naive implementation of this operation costs O(nnz(A)) floating point operations (flops) where nnz(A) denotes the number of nonzero entries of A. However, it can be replaced by the linear combination , where 2ps j matrix-vector products with matrices of order n are performed.The cost of such operation is O((max i nnz(A i ) + max i nnz(B i ))ps j ) flops and it is thus much cheaper than computing v naively via the matrix-vector product by A since nnz(A) = O(max i nnz(A i ) • max i nnz(B i )), s j is supposed to be small and p is in general moderate.A similar argumentation carries over when (some of) the matrices A i , B i are not sparse but still allow efficient matrix vector products.
Moreover, since the low-rank format is preserved in the computation of V .In order to avoid an excessive increment in the column dimensions ps j of V 1 , V 2 , it is necessary to exercise a column compression of the factors V 1 and V 2 , i.e., the matrices With trunc(L, M, N, ε trunc ) we denote any routine that computes low-rank approximations of the product LM N T with a desired accuracy of order ε trunc , so that, the matrices Algorithm 1 illustrates a standard approach for such compressions that is based on thin QRfactorizations and a SVD thereafter; see, e.g., [35, Section 2.2.1], and used in the remainder of the paper.Some alternative truncation schemes are discussed in section 4.
The vector vec(V 1 V T 2 ) ≈ v returned by the truncation algorithm is then orthogonalized with respect to the previous basis vectors vec(V 1,j V T 1,j ), j = 1, . . ., m.Such an orthogonalization step can be implemented by performing, e.g., the modified Gram-Schmidt procedure and the low-rank format of the quantities involved can be exploited and maintained in the result.The vector formulation of the orthogonalization step is given by and, since vec , we can reformulate (5) as where Θ m = diag(I s , −h 1,m I s1 , . . ., −h m,m I sm ), vec( V ) = v, and the m coefficients h j,m are collected in the m-th column of an upper Hessenberg matrix H m ∈ R m×m .Obviously, the result V has factors with increased column dimensions such that a truncation of the matrix In particular, if ε orth is a given threshold, we compute The result in ( 6) is then normalized to obtained the (m + 1)-th basis vector, namely The upper Hessenberg matrix H m ∈ R (m+1)×m is defined such that its square principal submatrix is given by H m and e T m+1 H m e m = h m+1,m := V 1 V T 2 .The difference between FOM and GMRES lies in the computation of the vector y m in (4).In FOM a Galerkin condition on the residual vector is imposed.If no truncation steps are performed during the Arnoldi procedure, the Arnoldi relation is fulfilled and it is easy to show that imposing the Galerkin condition ( 7) is equivalent to solving the m × m linear system for y m = y f om m .Moreover, in the exact setting where (8) holds, the norm of the residual vector vec( See, e.g., [51,Proposition 6.7].We show later that this is possible also when the low-rank truncations are performed and an inexact version of ( 8) is taken into account.
In GMRES, the vector y m = y gm m is computed by solving a least squares problem which corresponds to the Petrov-Galerkin orthogonality condition If (8) holds, y gm m can be computed as and, following the discussion in [51, Section 6.5.3], this reduced least squares problem can be cheaply solved by applying m Givens rotations i=1 Ω i e 1 ∈ R m+1 , then the vector y gm m is given by the solution of the m × m linear system U m y gm m = g m where U m denotes the square principal submatrix of U m and g m collects the first m components of g m .Moreover, vec( See, e.g., [51,Proposition 6.9].As for FOM, we will show that this is possible also in the case of GMRES equipped with low-rank truncations. If at the m-th iteration the residual norm vec(C 1 C T 2 ) + AV m y m is sufficiently small 2 , we recover the solution X m .Clearly, the full X m is not constructed explicitly as this is a large, dense matrix.However, since we have assumed that the solution X to (1) admits accurate low-rank approximations, we can compute low-rank factors S 1 , S 2 ∈ R n×t , t ≪ n, such that S 1 S T 2 ≈ X.Also this operation can be performed by exploiting the low-rank format of the basis vectors.In particular The low-rank FOM and GMRES procedures are summarized in Algorithm 2. For sake of simplicity, we decide to collect the two routines in the same pseudo-algorithm as they differ only in the convergence check if a Givens rotations approach similar to the one presented for GMRES is adopted also for FOM.This allows for a cheap evaluation of the residual norm without solving the linear system (9) at each iteration.
Algorithm 2 LR-FOM and LR-GMRES input : Break and go to 16 end Break and go to 16 end end At each iteration step m of Algorithm 2 we perform three low-rank truncations3 and these operations substantially influence the overall solution procedure.If the truncation tolerances ε A and ε orth are chosen too large, the whole Krylov method my break down.Therefore, in the following sections we discuss how to adaptively choose the truncation tolerances ε A and ε orth to maintain convergence.Moreover, the low-rank truncation does have its own computational workload which can be remarkable, especially if the ranks of the basis vectors involved is quite large.In section 4 we discuss some computational appealing alternatives to Algorithm 1.

A convergence result
In this section we show that the convergence of LR-FOM and LR-GMRES is guaranteed if the thresholds ε A and ε orth for the low-rank truncations in line 3 and 7 of Algorithm 2 are properly chosen and if the routine used in the truncation steps satisfies certain properties.
The truncation that takes place in line 19, after the iterative process terminated, to recover the low-rank factors of the approximate solution is not discussed.Indeed, this does not affect the convergence of the Krylov method and it is justified by assuming that the exact solution X admits low-rank approximations.

Inexact matrix-vector products
We start by analyzing the truncation step in line 3 of Algorithm 2 assuming, for the moment, that the one in line 7 is not performed.In this way the generated basis V m is ensured to be orthogonal.In section 3.2 we will show that the truncation in line 7 of Algorithm 2 preserves the orthogonality of the constructed basis so that the results we show here still hold.
The low-rank truncation performed in line 3 of Algorithm 2 can be understood as an inexact matrix-vector product with A. Indeed, at the m-th iteration, we can write where E m is the matrix discarded when trunc( and the vector vec(V 1 V T 2 ) can thus be seen as the result of an inexact matrix-vector product by A. Following the discussion in [58], the Arnoldi relation ( 8) must be replaced with the inexact counterpart and Range(V m ) is no longer a Krylov subspace generated by A.
The vectors y f om m and y gm m can be still calculated as in ( 9) and ( 11), respectively, but these are no longer equivalent to imposing the Galerkin and Petrov-Galerkin conditions ( 7)-( 10) since the Arnoldi relation (8) no longer holds; different constraints must be taken into account.Proposition 3.1 (See [58]).Let (13) hold and define m is computed as in (11), then q gm m := W m y gm m is such that Similarly, if y f om m is computed as in (9), then q f om m Consequently, H m is not a true Galerkin projection of A onto Range(V m ).One may want to compute the vectors y f om m and y gm m by employing the true projection 9)-( 11) so that the reduced problems represent a better approximation (cf.[25]) of the original equation and the orthogonality conditions imposed are in terms of the true residual.However, the computation of T m requires to store the matrix [vec(E 1 ), . . ., vec(E m )] and this is impracticable as the benefits in terms of memory demand coming from the low-rank truncations are completely lost due to the allocation of both V m and [vec(E 1 ), . . ., vec(E m )].A different option is to store the matrix AV m and compute an explicit projection of A onto the current subspace, but also this strategy leads to an unfeasible increment in the memory requirements of the overall solution process as the storage demand grows of a factor p. Therefore, in all the numerical experiments reported in section 7, the matrix H m arising from the orthonormalization procedure is employed in the computation of y f om m and y gm m .If (13) holds and vec(X m ) = V m y m is the approximate solution to (2) computed by projection onto Range(V m ), then, at the m-th iteration, the true residual vector can be expressed as where r m is the computed residual vector.
In [58,Section 4] it has been shown that the residual gap δ m := r m − r m between the true residual and the computed one can be bounded by Since |e T j y m | decreases as the the iterations proceed (see, e.g., [58, Lemma 5.1-5.2]),E m is allowed to increase while still maintaining a small residual gap and preserving the convergence of the overall solution process.This phenomenon is often referred to as relaxation.
Theorem 3.1 (See [58]).Let ε > 0 and let r gm m := vec(C 1 C T 2 ) + AV m y gm m be the true GMRES residual after m iterations of the inexact Arnoldi procedure.If for every k m, then r gm m − r gm m ε.Moreover, if is the true FOM residual after m iterations of the inexact Arnoldi procedure, and if for every k m, Notice that the bound in (17) depends on the norm of the computed GMRES residual.This can be easily computed when Algorithm 2 is performed as r gm m = |e T m+1 g m+1 | in line 14 of Algorithm 2. However, if the FOM residual r f om k−1 exists for every k m, r gm k−1 can be replaced by r f om k−1 in (17).The quantities involved in the estimates ( 15)-( 16)- (17) are not available at iteration k < m making the latter of theoretical interest only.To have practically usable truncation thresholds, the quantities in ( 15)-( 16)-( 17) must be approximated with computable values.Following the suggestions in [58], we can replace m by the maximum number m max of allowed iterations, σ mmax (H mmax ) is replaced by σ n 2 (A), and we approximate σ 1 (H mmax ) by σ 1 (A) when computing κ(H mmax ) in (16).The extreme singular values of A can be computed once and for all at the beginning of the iterative procedure, e.g., by the Lanczos method that must be carefully designed to avoid the construction of A and to exploit its Kronecker structure.Approximations of σ 1 (A) and σ n 2 (A) coming, e.g., from some particular features of the problem of interest, can be also employed.To conclude, we propose to use the following practical truncation thresholds ε (k) A in line 3 of Algorithm 2 in place of ε A : for LR-GMRES, and for LR-FOM.
Allowing E k to grow is remarkably important in our setting, especially for the memory requirements of the overall procedure.Indeed, if the truncation step in line 3 of Algorithm 2 is not performed, the rank of the basis vectors increases very quickly as, at the m-th iteration, we have Therefore, at the first iterations the rank of the basis vectors is low by construction and having a very stringent tolerance in the computation of their low-rank approximations is not an issue.When the iterations proceed, the rank of the basis vectors increases but, at the same time, the increment in the thresholds for computing low-rank approximations of such vectors leads to more aggressive truncations with consequent remarkable gains in the memory allocation.
The interpretation of the truncation in line 3 of Algorithm 2 in terms of an inexact Krylov procedure has been already proposed in [18] for the more general case of GMRES applied to (2) where A is a tensor and the approximate solution is represented in the tensor-train (TT) format.However, also in the tensor setting, the results in Theorem 3.1 hold if and only if the matrix V m has orthonormal columns.In general, the low-rank truncation in line 7 can destroy the orthogonality of basis.In the next section we show that V m has orthogonal columns if the truncation step is performed in an appropriate way.
We first conclude this section with a couple of remarks.
Remark 3.2.We have always assumed the initial guess x 0 ∈ R n in (4) to be zero.This choice is motivated by the discussion in [58, Section 3], [39] where the authors show how this is a good habit in the framework of inexact Krylov methods.
where ε A denotes one of the values in (18)-( 19) depending on the selected procedure, the quantity r m + m j=1 ε (j) A • |e T j y m | must be computed to have a reliable stopping criterion in Algorithm 2. This means that the linear system U m y m = g m has to be solved at each iteration m.This does not significantly increase the computational workload because U m ∈ R m×m is of small dimension and already given in triangular form.

Structured perturbations of the basis
In this section we show how the low-rank truncations performed during the Gram-Schmidt procedure in line 7 of Algorithm 2 preserve the orthogonality of the basis, i.e., V m is still an orthonormal matrix, and the results presented in section 3.1 are still valid.
Proposition 3.2.The matrix m+1) computed by performing m iterations of Algorithm 2 where the low-rank truncations are computed by Algorithm 1 has orthonormal columns.
Proof.At the m-th iteration, the (m+1)-th basis vector is computed by performing (6) and then normalizing the result.In particular, if Θ m = diag(I s , −h 1,m I s1 , . . ., −h m,m I sm ), then where F 1,m F T 2,m is the matrix discarded during the application of Algorithm 1.
m ] denote the skinny QR factorizations performed during trunc and where U km , W km , Σ km contain the leading k m singular vectors and, respectively, singular values, and k m is the smallest index such that and, since Q 1 , Q 2 , W and U are orthogonal matrices, By pre and post-multiplying (20) by , respectively, we thus get Since we can see V 1 V T 2 as the result of a specific Gram-Schmidt procedure in which and, since ) has thus unit norm.
As shown in the proof of Proposition 3.2, to maintain the orthogonality of the basis, it is crucial that V 1 V T 2 and F 1,m F T 2,m are block-orthogonal to each other, i.e., ( V 1 V T 2 ) T F 1,m F T 2,m = 0, see, e.g., [24], an not only orthogonal with respect to the matrix inner product •, • .This is due to the QR-SVD-based truncation we perform.In general, it may happen that the computed basis V m+1 is no longer orthogonal if different truncation strategies are adopted.In this case, the theory developed in, e.g., [30] may be exploited to estimate the distance of the computed basis to orthogonality and such a value can be incorporated in the bounds ( 15)-( 16)- (17) to preserve the convergence of the overall iterative scheme.
In spite of Proposition 3.2, in finite precision arithmetic the computed basis V m+1 may fall short of being orthogonal and the employment of a modified Gram-Schmidt procedure with reorthogonalizationas outlined in Algorithm 2 -is recommended.See, e.g., [22,23] for some discussions about the loss of orthogonality in the Gram-Schmidt procedure.
The truncations performed during the orthogonalization procedure consist in another source of inexactness that must be taken into account.The inexact Arnoldi relation (13) becomes and one can derive results similar to the ones in Theorem 3.1 for the inexact Arnoldi relation obtaining estimates for it may be interesting to study how to distribute the allowed inexactness between the truncation steps.Since the rank of the iterates grows less dramatically during the orthogonalization step compared to what happens after the multiplication with A, we allow 2 E k to grow in accordance with Theorem 3.1, while T in line 7 of Algorithm 2 is, in general, very rank-deficient and a significant reduction in the number of columns to be stored takes place even when the trunc function is applied with a small threshold.
In particular, at the m-th iteration, we can set where ε is the desired accuracy of the final solution in terms of relative residual norm.This means that E k + F 1,k F T 2,k fulfills the estimates in ( 15)-( 16)-( 17) and the convergence is thus preserved.The vectors y f om m and y gm m can be still computed as in ( 9)-( 11) and Proposition 3.1 holds also when the low-rank truncation in line 7 of Algorithm 2 are performed.m is computed as in (11), where H m stems from the low-rank Arnoldi procedure illustrated in Algorithm 2 with low-rank truncations are performed by Algorithm 1, then q gm m := W m y gm m is such that Similarly, if y f om m is computed as in (9) where H m is the principal square submatrix of the aforementioned H m , then q f om m Proof.We only need to prove that V T m [vec(F 1,1 F T 2,1 ), . . ., vec(F 1,m F T 2,m )] = 0 as the rest of the proof comes from [58, Proposition 3.2-3.3].
Using the same arguments of the proof of Proposition 3.2, we can show that F 1,j F T 2,j is orthogonal to V 1,i V T 2,i for all j, i = 1, . . ., m, i + 1 = j.Therefore, the only nonzero components of are in the first subdiagonal.These entries are of the form V 1,ℓ+1 V T 2,ℓ+1 , F 1,ℓ F T 2,ℓ F and we show they are zero for every ℓ = 1, . . ., m − 1.We have and in the proof of Proposition 3.2 we have already shown that V 1 V T 2 , F 1,ℓ F T 2,ℓ F = 0.This completes the proof.The true relative residual norm can be written as and following the discussion in Remark 3.3 we have so that the right-hand side in the above expression must be computed to check convergence.

Alternative truncation strategies
As we discussed above, to keep the low-rank Krylov methods computationally feasible, the quantities involved in the solution process have to be compressed so that their rank, i.e., the sizes of the low-rank factors, is kept small.Let N M L T with factors N, L ∈ R n×m , M ∈ R m×m , be the quantity to be compressed, and assume that rank(N M L T ) = m.So far we have used a direct approach using QR and SVD decompositions in Algorithm 1 which essentially computes a partial SVD of N M L T corresponding to all m nonzero singular values.This whole procedure relies heavily on dense linear algebra computations and can, hence, become quite expensive.This is especially due to the QR decompositions which will be expensive if the rectangular factors N, L have many columns.Moreover, if N M L T has a very small numerical numerical rank, say k ≪ m, then Algorithm 1 will generate a substantial computational overhead because m − k singular vectors will be thrown away.Nevertheless, thanks to the complete knowledge of all singular values, this procedure is able to correctly assess the truncation error in the Frobenius norm so that the required accuracy of the truncation is always met.Following the discussion in, e.g., [12,43,61], a more economical alternative could be to compute only a partial SVD N M L T ≈ U k Σ k W T k associated to the k singular values that are larger than the given truncation threshold.If also the (k + 1)-th singular value is computed, one has the truncation error in the 2-norm: Obviously, the results of the previous section are still valid if this form of truncation is used.Approximations of the dominant singular values and corresponding singular vectors can be computed by iterative methods for large-scale SVD computations as, e.g., Lanczos bidiagonalization (see, e.g., [3,37,60]) or Jacobi-Davidson methods; see [28].To apply these methods, only matrix vector products N (M (L T x)) and L(M T (N T x)) are required.For achieving the compression goal one could, e.g., compute k max k triplets and, if required, neglect any singular vectors corresponding to singular value below a certain threshold.However, we do in general not know in advance how many singular values will be larger than a given threshold.Picking a too small value of k max can lead to very inaccurate truncations that do not satisfy the required thresholds ( 15)-( 17), ( 21) and, therefore, endanger the convergence of the lowrank Krylov method.Some of aforementioned iterative SVD methods converge theoretically monotonically, i.e., the singular values are found in a decreasing sequence starting with the largest one.Hence, the singular value finding iteration can be kept running until a sufficiently small singular value approximation, e.g., σ < ε trunc N M L T 2 , is detected.In the practical situations within low-rank Krylov methods, the necessary number of singular triplets can be O(10 2 ) or larger and it may be difficult to ensure that the iterative SVD algorithms do not miss some of the largest singular values or that no singular values are detected several times.Due to the sheer number of occurrences where compression is required in Algorithm 2, preliminary tests with iterative SVD methods did not yield any substantial savings compared to the standard approach in Algorithm 1.

Preconditioning
It is well-known that Krylov methods require preconditioning in order to obtain a fast convergence in terms of number of iterations and low-rank Krylov methods are no exception.However, due to the peculiarity of our framework, the preconditioner operator must possess some supplementary features with respect to standard preconditioners for linear systems.Indeed, in addition to be effective in reducing the number of iterations at a reasonable computational cost, the preconditioner operator must not dramatically increase the memory requirements of the solution process.
Given a nonsingular operator P or its inverse P −1 , if we employ right preconditioning, the original systems ( 2) is transformed into so that, at each iteration m, we have to apply P −1 to the current basis vector vec(V 1,m V T 2,m ).Note that we restrict ourselves here to right preconditioning because this has the advantage that one can still monitor the true unpreconditioned residuals without extra work within the Krylov routine.Of course, in principle also left and two-sided preconditioning can be used.
The preconditioning operation must be able to exploit the low-rank format of V 1,m V T 2,m .Therefore, a naive operation of the form ) is not admissible in our context as this would require the allocation of the dense n × n matrix V 1,m V T 2,m .One way to overcome this numerical difficulty is to employ a preconditioner operator P which allows for a representation in terms of a Kronecker sum, namely This means that the operation ) is equivalent to solving the matrix equation In our setting, the operator P often amounts to an approximation to A in (2) obtained by either dropping some terms in the series or replacing some of them by a multiple of the identity.See, e.g., [45,48,62].Another option that has not been fully explored in the matrix equation literature so far is the case of polynomial preconditioners (see, e.g., [40,66]) where P −1 resembles a fixed low-degree polynomial evaluated in A. Alternatively, we can formally set P = A in (24) and inexactly solve equation ( 25) by few iterations of another Krylov method (e.g., Algorithm 2) leading to an inner-outer Krylov method; see, e.g., [57].
Clearly, equation ( 25) must be easy to solve.For instance, if ) T and an exact application of the preconditioner can be carried out.Similarly, when ℓ = 2 and a fixed number of ADI iterations are performed at each Krylov iteration m, then it is easy to show that we are still working in an exact preconditioning framework.See, e.g.[8,16].In all these cases, the results presented in the previous sections still hold provided A is replaced by the preconditioned matrix AP −1 .
Equation ( 25) is often iteratively solved and, in general, this procedure leads to the computation of a lowrank approximation Z 1,m Z T 2,m to Y m that has to be interpreted as a variable preconditioning scheme with a different preconditioning operator at each outer iteration.In this cases, a flexible variant of Algorithm 2 must be employed which consists in a standard flexible Krylov procedure equipped with the low-rank truncations presented in the previous sections.See, e.g., [59,Section 10] for some details about flexible Krylov methods and [50, 51, Section 9.4.1] for a discussion about flexible GMRES.
We must mention that the employment of a flexible procedure doubles, at least, the memory requirements of the solution process.Indeed, both the preconditioned and unpreconditioned bases must be stored and rank ) for all m.This aspect must be taken into account when designing the preconditioner.See Example 7.1.
At a first glance, the presence of a variable preconditioning procedure can complicate the derivations illustrated in sections 3.1-3.2for the safe selection of the low-rank truncation thresholds that guarantee the convergence of the solution method.Indeed, if at iteration m, Z 1,m Z T 2,m is the result of the preconditioning step (25), we still want to truncate the matrix T in order to moderate the storage demand and one may wonder if the inexactness of step (25) plays a role in such a truncation.Thanks to the employment of a flexible strategy, we are going to show how the tolerances for the low-rank truncations, namely ε A and ε orth in Algorithm 2, can be still computed as illustrated in sections 3.1-3.2.Flexible Krylov methods are characterized not only by having a preconditioner that changes at each iteration, but also from the fact that the solution is recovered by means of the preconditioned basis.In particular, vec see, e.g., [50]; this is a key ingredient in our analysis.
We start our discussion by considering flexible Krylov methods with no truncations.For this class of solvers the relation holds, see, e.g., [51,Equation (9.22)], and span{vec(Z 1,1 Z T 2,1 ), . . ., vec(Z 1,m Z T 2,m )} is not a Krylov subspace in general.Therefore, also for the flexible Krylov methods with no low-rank truncations we must consider constrains different from the ones in ( 7)-( 10) and results similar to the ones in Proposition 3.1 with W m = AZ m hold.See, e.g., [51,Proposition 9.2].
If we now introduce a low-rank truncation of the matrix then the relation (26) becomes where the matrices E k 's are the ones discarded when ( 27) is performed.If E k satisfies the inequalities in Theorem 3.1, then the convergence of the low-rank flexible Krylov procedure is still guaranteed in the sense that the residual norm keeps decreasing as long as span{vec(Z 1,1 Z T 2,1 ), . . ., vec(Z 1,m Z T 2,m )} grows.However, the matrix H m no longer represents an approximation of A onto the current subspace and the approximation of σ mmax (H mmax ) and σ 1 (H mmax ) in the right-hand side of ( 15)-( 16)-( 17) by the corresponding singular values of A may no longer be effective.In our numerical experience, approximating σ mmax (H mmax ) and σ 1 (H mmax ) by the smallest and largest singular values of the preconditioned matrix AP −1 , i.e., mimicking what is done in case of exact applications of P, provides satisfactory results.Obtaining computable approximations to σ mmax (H mmax ) and σ 1 (H mmax ) for the inner-outer approach is not straightforward.In this case, a practical approach may be to still approximate σ mmax (H mmax ) and σ 1 (H mmax ) by σ n 2 (A) and σ 1 (A), respectively.These approximations may be very rough as they completely neglect the role of the preconditioner so that they may lead to quite conservative truncation thresholds.However, at the moment, we do not see any another possible alternatives.
The introduction of the low-rank truncations that lead to (28) implies that the constrained imposed on the residual vector are no longer in terms of the space spanned by Z m and the results presented in Proposition 3.1 with W m = AZ m − [vec(E 1 ), . . ., vec(E m )] hold.
In flexible Krylov methods, the orthogonalization procedure involves only the unpreconditioned basis V m so that the truncation step in line 7 of Algorithm 2 is not really affected by the preconditioning procedure and the results in Proposition 3.2-3.3are still valid.The truncation threshold ε orth can be still selected as proposed in section 3.2.

Short recurrence methods
Short recurrence Krylov methods can be very appealing in our context as only a fixed, usually small, number of basis vectors have to be stored.In case of symmetric problems, i.e., equation (1) where all the coefficient matrices A i 's and B i 's are symmetric, the low-rank MINRES algorithm proposed in [44] can be employed in the solution process.
If A in ( 2) is also positive definite, the low-rank CG method illustrated in [27] is a valid candidate for the solution of equation (1).Notice that, in general, it is not easy to characterize the spectral distribution of A in terms of the spectrum of the coefficient matrices A i 's and B i 's.However, it can be shown that if A i and B i are positive definite for all i, then also A is positive definite.
Short recurrence methods can be appealing also in case of a nonsymmetric A and low-rank variants of BICGSTAB [64], QMR [21] or other methods can be employed to solve equation (1).
See, e.g., [8,61] for an implementation of low-rank MINRES, CG and BICGSTAB.In all the short recurrence Krylov methods, the constructed basis V m is not orthogonal in practice and this loss of orthogonality must be taken into account in the bounds for the allowed inexactness proposed in Theorem 3.1.In [58,Section 6], the authors propose to incorporate the smallest singular values of the computed basis, namely σ m (V m ), in the right-hand side of ( 15)-( 16)- (17) to guarantee the convergence of the method.However, no practical approximation to σ m (V m ) is proposed in [58].
A different approach that can be pursued is the one illustrated in [13].In this paper the authors propose to select bounds of the form where r k is the current computed residual vector, and in [63] the authors studied the effects of such a choice on the convergence of a certain class of inexact Krylov methods.In particular, in [63] it is shown how the residual gap δ m remains small if E k fulfills ( 29) for all k m.Even though the true residual and the computed one are close, this does not imply that the residual norm is actually always small and we thus have to assume that the norm of the computed residual goes to zero as it is done in [63].

Numerical examples
In this section we present some numerical results that confirm the theoretical analysis derived in the previous sections.To this end we consider some general multiterm linear matrix equation of the form (1) stemming from the discretization of certain deterministic and stochastic PDEs.
We apply the LR-GMRES variant of Algorithm 2 in the solution process and we always select Algorithm 1 for the low-rank truncations.
We report the number of performed iterations, the rank of the computed solution, the computational time needed to calculate such a solution together with the relative residual norm achieved, and the storage demand.For the latter, we document the number of columns s = m+1 j=1 s j of the matrix [ where m is the number of iterations needed to converge.Similarly, if a flexible strategy is adopted, we also report the number of columns z of [Z 1,1 , . . ., Z 1,m ].
This means that, for equations of the form (1) where n A = n B = n, we have to allocate 2s (2(s + z)) vectors of length n.If n A = n B , the memory requirements amount to s (s + z) vectors of length n A and s (s + z) vectors of length n B .
The solution process is stopped as soon as the upper bound on the residual norm in (22), normalized by C 1 C T 2 F , gets smaller than 10 −6 .As already mentioned, we always assume that the exact solution X admits accurate low-rank approximations.Nevertheless, if S 1 , S 2 are the low-rank factors computed by Algorithm 2, we report also the real relative residual norm in the following to confirm the reliability of our numerical procedure.Once again, the real residual norm can be computed at low cost by exploiting the low rank of S 1 S T 2 and the cyclic property of the trace operator.All results were obtained with Matlab R2017b [42] on a Dell machine with 2.4GHz processors and 250 GB of RAM.
Example 7.1.We consider a slight modification of Example 4 in [45].In particular, the continuous problem we have in mind is the convection-diffusion equation where ν > 0 is the viscosity parameter and the convection vector w is given by w = (φ 1 (x)ψ 1 (y), φ 2 (x)ψ 2 (y)) = ((1 − (2x + 1) 2 )y, −2(2x + 1)(1 − y 2 )).The centered finite differences discretization of equation ( 30) yields the following matrix equation where T ∈ R n×n is the negative discrete laplacian, B ∈ R n×n corresponds to the discretization of the first derivative, Φ i and Ψ i are diagonal matrices collecting the nodal values of the corresponding functions φ i , ψ i , i = 1, 2, and 1 ∈ R n is the vector of all ones.See [45] for more details.Even though equation (31) amounts to a generalized Sylvester equation, the solution schemes available in the literature and tailored to this kind of problems cannot be applied to equation (31) in general.Indeed, to the best of our knowledge, all the existing methods for large-scale generalized equations rely on a splitting of the overall discrete operator of the form which is supposed to be convergent.See, e.g., [8,29,53].However, the latter property may be difficult to meet in case of the convection-diffusion equation, especially for dominant convection.
We thus have to interpret (31) as a general multiterm matrix equation of the form (1) and we solve it by the preconditioned LR-GMRES.Following the discussion in [45], we use the operator as preconditioner, where ψ 1 , φ 2 ∈ R are the mean values of ψ 1 (y) and φ 2 (x) on (0, 1), respectively.At each LR-GMRES iteration, we approximately invert L by performing 10 iterations of the extended Krylov subspace method for Sylvester equation4 derived in [14].Since this scheme gives a different preconditioner every time it is called, we must employ the flexible variant of LR-GMRES.To avoid an excessive increment in the memory requirements due to the allocation of both the preconditioned and unpreconditioned bases, we do not apply L to the current basis vector, i.e., at iteration k, we do not compute ).This procedure leads to a lower storage demand of the overall solution process and to less time consuming preconditioning steps.On the other hand, the effectiveness of the preconditioner in reducing the total iteration count may get weakened, especially for large ε precond .In the results reported in the following we have always set ε precond = 10 −3 .
In Table 1 we report the results for different values of n and ν.We notice that the number of iterations is very robust with respect to the problem dimension n, and thus the mesh-size.Unfortunately, this does not lead to a storage demand that is also independent of n.The rank of the basis vectors, i.e., the number of columns of the matrices [V 1,1 , . . ., V 1,m+1 ] and [Z 1,1 , . . ., Z 1,m ] increases with the problem size.This trend is probably inherited from some intrinsic properties of the ,j F for j = 1, . . ., 9. eps denotes machine precision.

Iterations
2 and we suppose a to be a random field of the form where σ i : Ω → Γ i ⊂ R are real-valued independent random variables (RVs).
The stochastic Galerkin method discussed in, e.g., [2,17,47,48,62], leads to a discrete problem that can be written as a matrix equation of the form where K i ∈ R nx×nx , G i ∈ R nσ×nσ , and f 0 ∈ R nx , g 0 ∈ R nσ .See, e.g., [47,48].We solve equation ( 34) by LR-GMRES and the following operators trace(K T i K0) trace(K T 0 K0) G i , are selected as preconditioners.P mean is usually referred to as mean-based preconditioner, see, e.g., [47,48] and the references therein, while Ullmann proposed P Ullmann in [62].
Both P mean and P Ullmann are very well-suited for our framework as their application amount to the solution of a couple of linear systems so that the rank of the current basis vector does not increase.See the discussion in section 5.Moreover, supposing that these linear systems can be solved exactly by, e.g., a sparse direct solver, there is no need to employ flexible GMRES so that only one basis has to be stored.In particular, in all our tests, we precompute once and for all the LU factors of the matrices 5 which define the selected preconditioner so that only triangular systems are solved during the LR-GMRES iterations.
We generate instances of (34) with the help of the S-IFISS 6 package version 1.04; see [54].The S-IFISS routine stoch diff testproblem pc is executed to generate two instances of (34).The first equation (Data 5 The computational time of such decompositions is always included in the reported results. 6Available at https://personalpages.manchester.ac.uk/staff/david.silvester/ifiss/sifiss.html 1) is obtained by using a spatial discretization with 2 7 points in each dimension, r = 2 RVs in (33) which are approximated by polynomial chaos expansions of length ℓ = 100 leading to n x = 16129, n σ = 5151, and r + 1 = 3.The second instance (Data 2) was generated with 2 8 grid points, r = 5, and chaos expansions of length ℓ = 10 resulting in n x = 65025, n σ = 3003, and r + 1 = 6.Table 2 summarizes the results and apparently problem Data 2 is much more challenging than Data 1.This is meanly due to the number of terms in (34).Indeed, the effectiveness of the preconditioners may deteriorate as r increases even though the actual capability of P mean and P Ullmann in reducing the iteration count is related to the coefficients of the KL expansion (33).See, e.g., [47,Theorem 3.8] and [62,Corollary 5.4].Moreover, r + 1 terms are involved in the products in line 2 of Algorithm 2 and a sizable r leads, in general, to a faster growth in the rank of the basis vectors so that a larger number of columns are retained during the truncation step in line 3.As a result, the computational cost of our iterative scheme increases as well leading to a rather time consuming routine.
If the discrete operator stemming from the discretization of ( 32) is well posed, then it is also symmetric positive definite and the CG method can be employed in the solution process.See, e.g., [47,Section 3].We thus try to apply the (preconditioned) low-rank variant of CG (LR-CG) to the matrix equation (34).To this end, we adopt the LR-CG implementation proposed in [8].With the notation of [8, Algorithm 1] we truncate all the iterates X k+1 , R k+1 , P k+1 and Q k+1 .In particular, the threshold for the truncation of X k+1 is set to 10 −12 while the value on the right-hand side of ( 29) is used at the k-th LR-CG iteration for the low-rank truncation of all the other iterates.We want to point out that in the LR-CG implementation proposed in [8], the residual matrix R k+1 is explicitly calculated by means of the current approximate solution X k+1 .We compute the residual norm before truncating R k+1 so that what we are actually evaluating is the true residual norm and not an upper bound thereof.
The results are collected in Table 3 where the column "Mem."reports the maximum number of columns that had to be stored in the low-rank factors of all the iterates X k+1 , R k+1 , P k+1 , Q k+1 , and Z k+1 .Except for Data 1 with P Ullmann as a preconditioner where LR-GMRES and LR-CG show similar results especially in terms of memory requirements, LR-CG allows for a much lower storage demand with a consequent reduction in the total computational efforts while achieving the prescribed accuracy.However, for Data

Table 1 :
Example 7.1.Results for different values of n and ν.