1 Introduction

In many fields of scientific computing, it is important to solve linear matrix equations of the form

$$ \begin{array}{@{}rcl@{}} \mathcal{A}(X) = B, \end{array} $$
(1)

where \(\mathcal {A}: \mathbb {R}^{n\times s} \rightarrow \mathbb {R}^{n\times s}\) is a linear operator, and \(X, B \in \mathbb {R}^{n\times s}\). In the present work, we focus mainly on the case where the representation of \(\mathcal {A}\) is a large sparse nonsymmetric matrix and sn.

The most basic form of (1) is the standard linear system of equations Ax = b, where \(A \in \mathbb {R}^{n\times n}\) and \(\boldsymbol {b} \in \mathbb {R}^{n}\). For this problem, short-recurrence Krylov subspace methods, such as the conjugate gradient squared (CGS) method [1], bi-conjugate gradient stabilized (BiCGSTAB) method [2], and BiCGStab2 method [3], are typical iterative solvers that are widely used. The CGS, BiCGSTAB, and BiCGStab2 methods are known to belong to the so-called Lanczos-type product methods (LTPMs) (also called hybrid BiCG methods), wherein the residuals are defined by the product of the stabilizing polynomials and BiCG residuals. The BiCGstab(L) method [4] and GPBiCG method [5] also belong to LTPMs and can be viewed as two different generalizations of the abovementioned typical methods. Moreover, the GPBiCGstab(L) method [6], which unifies BiCGstab(L) and GPBiCG, has been developed recently. Numerical experiments have been performed to demonstrate that the GPBiCGstab(L) method has superior convergence when compared to the conventional LTPMs, especially for linear systems with complex spectra.

Because \(\mathcal {A}\) is defined on the finite-dimensional space, there always exists a standard linear system of equations that is mathematically equivalent to (1), to which LTPMs can be applied theoretically. However, it is often difficult to construct the representation matrix of \(\mathcal {A}\), i.e., the coefficient matrix of the converted linear system. For example, it is well known that the Sylvester equation

$$ \begin{array}{@{}rcl@{}} \mathcal{A}(X) = AX - XC = B,\quad A \in \mathbb{R}^{n\times n},\quad C \in \mathbb{R}^{s\times s} \end{array} $$
(2)

can be converted to a standard linear system \(\tilde {A} \boldsymbol {x} = \tilde {\boldsymbol {b}}\) with \(\tilde {A} := I_{s} \otimes A - C^{\top } \otimes I_{n} \in \mathbb {R}^{ns\times ns}\) and \(\tilde {\boldsymbol {b}} := \textbf {vec}(B) \in \mathbb {R}^{ns}\), but it is costly to construct \(\tilde {A}\) explicitly when ns is large. Here, ⊗ denotes the Kronecker product, Ik is the identity matrix of order k, and \(\textbf {vec}: \mathbb {R}^{n\times s} \rightarrow \mathbb {R}^{ns}\) is the vectorization operator, i.e., \(\textbf {vec}(V) = [\boldsymbol {v}_{1}^{\top }, \boldsymbol {v}_{2}^{\top }, \dots , \boldsymbol {v}_{s}^{\top }]^{\top } \in \mathbb {R}^{ns}\) for \(V = [\boldsymbol {v}_{1}, \boldsymbol {v}_{2}, \dots , \boldsymbol {v}_{s}] \in \mathbb {R}^{n\times s}\). More generally, there are also cases where the representation matrix of \(\mathcal {A}\) cannot be expressed explicitly. Therefore, application of iterative solvers to the matrix equations without explicitly using the representation matrix is desirable. The global Krylov subspace methods (cf., e.g., [7,8,9,10,11,12]) are known to be such matrix-free approaches and generate approximate solutions using the matrix Krylov subspace \(\mathcal K_{k}(\mathcal {A}, R_{0}) := \left \{ {\sum }_{i=0}^{k-1} c_{i} \mathcal {A}^{i}(R_{0}) \mid c_{i} \in \mathbb {R} \right \}\), where \(R_{0} := B-\mathcal {A}(X_{0})\) is the initial residual with an initial guess X0, and \(\mathcal {A}^{i}\) indicates that \(\mathcal {A}\) acts i times. Thus, the global methods can be implemented without using the representation matrix if the linear transformation \(\mathcal {A}(V)\) is available for an arbitrary \(V \in \mathbb {R}^{n\times s}\).

Several LTPMs have been extended thus far to their global versions. For example, the Gl-CGS-type methods [13, 14], Gl-BiCGSTAB method [9], and Gl-GPBiCG method [15], have been proposed, where “Gl-” represents the global version of the method. Moreover, relating to the methods based on the global Lanczos process [9], the global quasi-minimal residual (Gl-QMR) method [16] and the global bi-conjugate residual (Gl-BiCR) type methods [17] have also been developed. However, to the best of our knowledge, the global versions of the classical BiCGstab(L) and recent GPBiCGstab(L) methods have not been studied previously. We therefore develop a novel global GPBiCGstab(L) method (including Gl-BiCGstab(L)) for further convergence improvement. Then, based on the aforementioned works, we restrict the target matrix equation to a large sparse nonsymmetric linear system with multiple right-hand sides AX = B and discuss the preconditioned algorithms of Gl-GPBiCGstab(L). It is comparatively easy to extend the original GPBiCGstab(L) algorithm [6, Algorithm 3] to its global version naively. However, when applying the so-called right preconditioning to this naive Gl-GPBiCGstab(L), several concerns regarding numerical stability and computational costs are noted. To overcome these problems, we also derive a refined variant of Gl-GPBiCGstab(L), which is mathematically equivalent to the naive version but uses alternative recursion formulas to update the iteration matrices. This refined variant enables right preconditioning to be applied with greater robustness and fewer computational costs. Through numerical experiments on model problems (i.e., linear system AX = B and the Sylvester equation AXXC = B), we show that the refined Gl-GPBiCGstab(L) (with or without preconditioning) is more effective than other typical Gl-LTPMs.

The remainder of this paper is organized as follows. In Section 2, we describe a naive Gl-GPBiCGstab(L) algorithm for (1) by extending the original GPBiCGstab(L). In Section 3, we derive a refined variant of Gl-GPBiCGstab(L) by partly using different recursion formulas. In Section 4, we present the preconditioned algorithms of the naive and refined Gl-GPBiCGstab(L) for the representative matrix equation AX = B as well as note their advantages and disadvantages. In Section 5, numerical experiments are presented to demonstrate that the proposed methods have superior convergence when compared to the conventional ones. Finally, some concluding remarks and a note on the future directions are given in Section 6.

2 Global GPBiCGstab(L) algorithm

This section presents the extension of the GPBiCGstab(L) algorithm proposed in [6] to its global version for solving linear matrix equations of the form (1).

2.1 Extension to global algorithms

We first describe a generic method for constructing global algorithms. When an algorithm of the LTPMs for the standard linear system of equations is given, its global version can be derived directly through the following simple steps (see also [9, 10]):

  1. 1.

    Applying the given LTPM algorithm to \(\tilde {A}\boldsymbol {x} = \tilde {\boldsymbol {b}}\) that is equivalent to (1), where \(\tilde {A} \in \mathbb {R}^{ns\times ns}\) is a representation matrix of \(\mathcal {A}\) and \(\tilde {\boldsymbol {b}} := \textbf {vec}(B)\).

  2. 2.

    Rewriting the operations in the algorithm using the following equivalent reformulations. For \(\boldsymbol {x}, \boldsymbol {y} \in \mathbb {R}^{ns}\) and \(a \in \mathbb {R}\),

    $$ \begin{array}{@{}rcl@{}} \text{linear transformation: }\tilde{A} \boldsymbol{x} &\longrightarrow& \mathcal{A}(X), \end{array} $$
    (3)
    $$ \begin{array}{@{}rcl@{}} \text{vector update (AXPY): } a\boldsymbol{x} + \boldsymbol{y} &\longrightarrow& aX+Y, \end{array} $$
    (4)
    $$ \begin{array}{@{}rcl@{}} \text{inner product (DOT): }(\boldsymbol{x}, \boldsymbol{y}) := \boldsymbol{x}^{\top} \boldsymbol{y} &\longrightarrow & \langle X, Y\rangle_{F} := \text{tr}(X^{\top} Y), \end{array} $$
    (5)

    where \(X, Y \in \mathbb {R}^{n\times s}\) are the matrices satisfying x = vec(X) and y = vec(Y ). Note that the 2-norm ∥x2 is also converted to the associated Frobenius norm (F-norm) \(\|X\|_{F} := \sqrt {\langle X, X\rangle _{F}}\).

From (3), the resulting global algorithm can be implemented without using the representation matrix \(\tilde {A}\) explicitly. This is the main advantage of the global methods in actual computations. Moreover, when the transformation by \(\mathcal {A}\) is based mainly on matrix–matrix products, we can also implement it with high-performance level-3 BLAS routines.

2.2 Simple derivation

Herein, we describe an algorithm of the Gl-GPBiCGstab(L) method obtained by the simple transformations noted above. For details regarding the GPBiCGstab(L) method itself, we refer to [6] and the next section.

By applying the original GPBiCGstab(L) algorithm [6, Algorithm 3] to \(\tilde {A}\boldsymbol {x} = \tilde {\boldsymbol {b}}\) and rewriting the iteration process with (3)–(4), the naive extension to the Gl-GPBiCGstab(L) algorithm is obtained as shown in Algorithm 1. The notations in the algorithm are based on MATLAB conventions. For example, the variables R and P consist of n × s submatrices Ri and Pi for \(i=0,1,\dots , j\), respectively; that is, \(\mathrm {R} = [\mathrm {R}_{0}; \mathrm {R}_{1};{\dots } ;\mathrm {R}_{j}]\) and \(\mathrm {P} = [\mathrm {P}_{0}; \mathrm {P}_{1};{\dots } ; \mathrm {P}_{j}]\), respectively, where \([\mathrm {V}_{0}; \mathrm {V}_{1}; \dots ; \mathrm {V}_{j}] := [\mathrm {V}_{0}^{\top },\mathrm {V}_{1}^{\top },\dots , \mathrm {V}_{j}^{\top }]^{\top }\), and the submatrix Vi theoretically corresponds to \(\mathcal {A}^{i}(\mathrm {V}_{0})\).

Algorithm 1
figure a

Gl-GPBiCGstab(L) for (1) (naive)

Remark 1

When using MATLAB, there are several approaches for storing the variables comprising n × s submatrices, such as R and P above. One method is to construct \(\mathrm {R} = [\mathrm {R}_{0}; \mathrm {R}_{1};{\dots } ;\mathrm {R}_{j}]\) directly as an n(j + 1) × s long rectangular matrix. Another approach is to use a multidimensional array. Because the matrix R can be viewed as a three-dimensional tensor, it can be stored in an n × s × (j + 1) three-dimensional array. In this case, the i th submatrix Ri is accessed as R(:,:,i) in MATLAB. Alternatively, we can use a structure array. For example, the matrix R can be defined using the MATLAB command ‘struct’ as follows:

$$ \begin{array}{@{}rcl@{}} \text{R\_struct} = \text{struct}(`\text{R}^{\prime}, \{\text{R}_{0}, \text{R}_{1}, \dots, \text{R}_{j}\}), \end{array} $$

where the first argument ‘R’ is the field name and second argument is a cell array containing the submatrices. In this case, the i th submatrix Ri is accessed as R_struct(i).R. Because the implementation using the structure array is faster than the other approaches in our experience, we use this in the actual computations.

Line 19 of Algorithm 1 requires solving a minimization problem

$$ \begin{array}{@{}rcl@{}} \underset{\zeta_{i}, \eta}{\min} \left\| \mathrm{R}_{0} - \sum\limits_{i=1}^{L} \zeta_{i}\mathrm{R}_{i} - \eta \mathrm{Y} \right\|_{F} \end{array} $$

for L + 1 scalar variables \(\zeta _{1},\dots ,\zeta _{L}\) and η, where \(\mathrm {R}_{i} \in \mathbb {R}^{n\times s}\) for \(i=0,1,\dots ,L\) and \(\mathrm {Y} \in \mathbb {R}^{n\times s}\) are given in the iteration process. Note that the integer L is a predetermined constant in Gl-GPBiCGstab(L). The above problem is equivalent to the standard least-squares problem

$$ \begin{array}{@{}rcl@{}} &&\quad\quad\quad\quad\quad \underset{\boldsymbol{\zeta}, \eta}{\min} \left\| \mathbf{vec}(\mathrm{R}_{0}) - M \left[\begin{array}{cc} \boldsymbol{\zeta}\\ \eta \end{array}\right] \right\|_{2}, \\ M & := & \left[\mathbf{vec}(\mathrm{R}_{1}), \dots, \mathbf{vec}(\mathrm{R}_{L}), \mathbf{vec}(\mathrm{Y})\right],\quad \boldsymbol{\zeta} := [\zeta_{1},\dots,\zeta_{L}]^{\top}, \end{array} $$
(6)

and can be converted to the normal equation

$$ \begin{array}{@{}rcl@{}} M^{\top} M \left[\begin{array}{cc} \boldsymbol{\zeta}\\ \eta \end{array}\right] = M^{\top} \mathbf{vec}(\mathrm{R}_{0}). \end{array} $$
(7)

As is well known, we can use a direct method such as Cholesky factorization to solve (7), and a more stable approach such as QR factorization is useful for solving the least-squares problem (6) directly when M is ill-conditioned.

Note that the first iteration of GPBiCGstab(L) corresponds to that of BiCGstab(L) and that η needs to be 0. Hence, the first iteration (i.e., one BiCGstab(L) iteration) is displayed separately from the main GPBiCGstab(L) iterations in the original algorithm [6, Algorithm 3], whereas we display all iterations in the same form using a branch at line 19 in Algorithm 1.

3 Alternative implementation of Gl-GPBiCGstab(L)

In this section, we present the derivation of another Gl-GPBiCGstab(L) algorithm that is mathematically equivalent to Algorithm 1 but exploits alternative recursion formulas for updating the iteration matrices. The resulting algorithm uses a slightly simpler expression and is also useful for designing a robust right preconditioned Gl-GPBiCGstab(L) with fewer additional costs, as shown in the next section.

3.1 Basic concepts

Similar to standard LTPMs, the global versions of the LTPMs can be characterized by generating the residuals Rk defined by the form

$$ \begin{array}{@{}rcl@{}} R_{k} := H_{k}(\mathcal{A})\left[ R_{k}^{gbcg} \right], \end{array} $$
(8)

where Hk(λ) is the so-called stabilizing polynomial of degree k satisfying Hk(0) = 1, and \(R_{k}^{gbcg}\) is the k th Gl-BiCG residual that is generated by the coupled two-term recurrences

$$ \begin{array}{@{}rcl@{}} R_{k+1}^{gbcg} &=& R_{k}^{gbcg} - \alpha_{k} \mathcal{A}\left( P_{k}^{gbcg}\right), \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} P_{k+1}^{gbcg} &=& R_{k+1}^{gbcg} - \beta_{k} P_{k}^{gbcg} \end{array} $$
(10)

with recurrence coefficients \(\alpha _{k}, \beta _{k} \in \mathbb {R}\) and direction matrix \(P_{k}^{gbcg} \in \mathbb {R}^{n\times s}\) of Gl-BiCG [9]. Note that \(H_{k}(\mathcal {A})\) is a linear operator of the polynomial form, i.e., \(H_{k}(\mathcal {A}) = \mathcal I - \omega _{1}\mathcal {A} - {\dots } - \omega _{k} \mathcal {A}^{k}\) with \(\omega _{j} \in \mathbb {R}\) and an identity operator \(\mathcal I\), and we define its operation for \(X \in \mathbb {R}^{n\times s}\) as

$$ \begin{array}{@{}rcl@{}} H_{k}(\mathcal{A})[X] := X - \omega_{1}\mathcal{A}(X) - {\dots} - \omega_{k} \mathcal{A}^{k}(X). \end{array} $$

Specific choices of the stabilizing polynomials give specific Gl-LTPMs.

The original GPBiCGstab(L) [6] is a novel framework of the standard LTPMs using the following comprehensive stabilizing polynomials.

$$ \begin{array}{@{}rcl@{}} H_{k+L}(\lambda) &:=& \left( 1-\zeta_{k,1} \lambda - {\dots} -\zeta_{k,L} \lambda^{L}\right)H_{k}(\lambda) - \eta_{k} \lambda G_{k-1}(\lambda), \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} G_{k-1}(\lambda) &:=& \frac{H_{k-L}(\lambda) - H_{k}(\lambda)}{\lambda}, \end{array} $$
(12)

where H0(λ) := 1, η0 := 0, and k is a multiple of L. The L + 1 coefficients \(\zeta _{k,1}, \dots , \zeta _{k,L}, \eta _{k} \in \mathbb {R}\) are independent parameters. The stabilizing polynomials are reduced to those of BiCGstab(L) if ηk is always set to 0 and can be equivalent to those of GPBiCG when L = 1. Thus, we can also derive CGS, BiCGSTAB, and BiCGStab2 from the above framework. In the practical computation of GPBiCGstab(L), the L + 1 parameters are determined by locally minimizing the residual norms in each iteration.

In the following subsections, we show derivation of the Gl-GPBiCGstab(L) algorithm that generates residuals (8) with stabilizing polynomials (11) and (12). The derivation is somewhat different from the one in [6].

3.2 Recursion formulas for updating the iteration matrices

Herein, the operators \(H_{k}(\mathcal {A})\) and \(G_{k-1}(\mathcal {A})\) are denoted as Hk and Gk− 1, respectively, and we omit the parentheses for simplicity; specifically, \(H_{k}(\mathcal {A})[X]\), \(G_{k-1}(\mathcal {A})[X]\), and \(\mathcal {A}(X)\) are expressed as HkX, Gk− 1X, and \(\mathcal {A}X\), respectively.

Let Xk, \(R_{k} = \mathrm {H}_{k}R_{k}^{gbcg}\), and \(P_{k} = \mathrm {H}_{k}P_{k}^{gbcg}\) be the approximation, residual, and direction matrix of Gl-GPBiCGstab(L), respectively. These matrices are updated to Xk+L, \(R_{k+L} = \mathrm {H}_{k+L}R_{k+L}^{gbcg}\), and \(P_{k+L} = \mathrm {H}_{k+L}P_{k+L}^{gbcg}\), respectively, and the processes are counted as one cycle below. We now introduce the following auxiliary matrices for \(j=1,2,\dots ,L\).

$$ \begin{array}{@{}rcl@{}} R_{k}^{(j)}:=\mathrm{H}_{k}R_{k+j}^{gbcg},\quad P_{k}^{(j)}:=\mathrm{H}_{k}P_{k+j}^{gbcg}, \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} Y_{k}^{(j)}:=\mathcal{A}\mathrm{G}_{k-1}R_{k+j}^{gbcg},\quad U_{k}^{(j)}:=\mathcal{A}\mathrm{G}_{k-1}P_{k+j}^{gbcg},\quad Z_{k}^{(j)} := \mathrm{G}_{k-1}R_{k+j}^{gbcg}. \end{array} $$
(14)

From (11), (13), and (14), the updated residual and direction matrices can be expanded as

$$ \begin{array}{@{}rcl@{}} R_{k+L} &=& R_{k}^{(L)} - \zeta_{k,1}\mathcal{A}R_{k}^{(L)} - {\dots} - \zeta_{k,L}\mathcal{A}^{L}R_{k}^{(L)} - \eta_{k}Y_{k}^{(L)}, \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} P_{k+L} &=& P_{k}^{(L)} - \zeta_{k,1}\mathcal{A}P_{k}^{(L)} - {\dots} - \zeta_{k,L}\mathcal{A}^{L}P_{k}^{(L)} - \eta_{k}U_{k}^{(L)}, \end{array} $$
(16)

and the associated approximation can be expressed as

$$ \begin{array}{@{}rcl@{}} X_{k+L} = X_{k}^{(L)} + \zeta_{k,1}R_{k}^{(L)} + {\dots} + \zeta_{k,L}\mathcal{A}^{L-1}R_{k}^{(L)} + \eta_{k}Z_{k}^{(L)}, \end{array} $$
(17)

where we note that \(Y_{k}^{(L)} = \mathcal {A}Z_{k}^{(L)}\) holds. When the auxiliary matrices in (15)–(17) are obtained, we can determine the L + 1 parameters \(\zeta _{k,1}, \dots , \zeta _{k,L}, \eta _{k}\) to minimize the residual norm ∥Rk+LF, as described in Section 2.2.

As mentioned in [6], the stabilizing polynomial Hk+L(λ) consists of two parts: multiplication of an L th degree polynomial with the previous Hk(λ) and relaxation term − ηkλGk− 1(λ). We therefore describe the generation of auxiliary matrices for each part separately.

3.2.1 Multiplication of the L th degree polynomial

We describe an iteration process to generate the auxiliary matrices \(\mathcal {A}^{i}R_{k}^{(L)}\) and \(\mathcal {A}^{i}P_{k}^{(L)}\) for \(i=0,1,\dots ,L\), and the associated approximation \(X_{k}^{(L)}\) in (15)–(17). This process corresponds to the well-known L-times-BiCG steps in BiCGstab(L), and the description below is based on that in [6, Section 2.2].

Assuming that the approximation \(X_{k}^{(0)} := X_{k}\), the corresponding residual \(R_{k}^{(0)} := R_{k}\), and direction matrix \(P_{k}^{(0)} := P_{k}\) are given at the beginning of the cycle. Then, the j th repetition (\(j=1,2,\dots ,L\)) of the BiCG steps is described as follows.

From the Gl-BiCG recursions (9) and (10), applying operators \(\mathcal {A}^{i}\mathrm {H}_{k}\) to \(R_{k+j}^{gbcg}\) for \(i=0,1,\dots , j-1\) and to \(P_{k+j}^{gbcg}\) for \(i=0,1,\dots , j\), we have

$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}\mathrm{H}_{k}R_{k+j}^{gbcg} &=& \mathcal{A}^{i}\mathrm{H}_{k}R_{k+j-1}^{gbcg} - \alpha_{k+j-1} \mathcal{A}^{i+1}\mathrm{H}_{k} P_{k+j-1}^{gbcg},\quad i=0,1,\dots, j-1, \\ \mathcal{A}^{i}\mathrm{H}_{k}P_{k+j}^{gbcg} &=& \mathcal{A}^{i}\mathrm{H}_{k}R_{k+j}^{gbcg} - \beta_{k+j-1} \mathcal{A}^{i}\mathrm{H}_{k}P_{k+j-1}^{gbcg},\quad i=0,1,\dots, j, \end{array} $$

where it is noted that \(\mathrm {H}_{k} \mathcal {A}V = \mathcal {A} \mathrm {H}_{k}V\) holds for an arbitrary \(V \in \mathbb {R}^{n\times s}\). Rewriting the above recursions with the auxiliary matrices (13) gives

$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}R_{k}^{(j)} &=& \mathcal{A}^{i}R_{k}^{(j-1)} - \alpha_{k}^{(j-1)} \mathcal{A}^{i+1}P_{k}^{(j-1)},\quad i=0,1,\dots, j-1, \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}P_{k}^{(j)} &=& \mathcal{A}^{i}R_{k}^{(j)} - \beta_{k}^{(j-1)} \mathcal{A}^{i}P_{k}^{(j-1)},\quad i=0,1,\dots, j, \end{array} $$
(19)

where \(\alpha _{k}^{(j)} := \alpha _{k+j}\) and \(\beta _{k}^{(j)} := \beta _{k+j}\). The approximation \(X_{k}^{(j)}\) associated with \(R_{k}^{(j)}\) can be expressed as

$$ \begin{array}{@{}rcl@{}} X_{k}^{(j)} = X_{k}^{(j-1)} + \alpha_{k}^{(j-1)}P_{k}^{(j-1)}. \end{array} $$
(20)

Following [6, Eqs. (14) and (15)] and using the iteration matrices, we compute the Gl-BiCG coefficients \(\alpha _{k}^{(j)}\) and \(\beta _{k}^{(j)}\) as follows:

$$ \begin{array}{@{}rcl@{}} &&\quad\alpha_{k}^{(j)} = \frac{\rho_{k}^{(j)}}{\sigma_{k}^{(j)}},\quad \beta_{k}^{(j)} = \frac{\rho_{k}^{(j+1)}}{\sigma_{k}^{(j)}},\\ \rho_{k}^{(j)}&:=&\langle \tilde{R}_{0}, \mathcal{A}^{j}R_{k}^{(j)}\rangle,\quad \sigma_{k}^{(j)} := \langle \tilde{R}_{0}, \mathcal{A}^{j+1} P_{k}^{(j)}\rangle, \end{array} $$
(21)

where \(\tilde {R}_{0}\) is the initial shadow residual.

Thus, when \(X_{k}^{(j-1)}\), \(\mathcal {A}^{i}R_{k}^{(j-1)}\), and \(\mathcal {A}^{i}P_{k}^{(j-1)}\) for \(i=0,1,\dots ,j-1\) are given at the beginning of the j th repetition; the new matrices \(X_{k}^{(j)}\), \(\mathcal {A}^{i}R_{k}^{(j)}\), and \(\mathcal {A}^{i}P_{k}^{(j)}\) for \(i=0,1,\dots ,j\) can be generated using (18)–(21). This updating scheme is summarized as follows:

$$ \begin{array}{cccccc} P_{k}^{(j-1)} \quad & X_{k}^{(j-1)}\quad & \quad\xrightarrow{\text{Eq.}(20)}\quad & \quad X_{k}^{(j)}\quad & \quad & \quad P_{k}^{(j)} \\ & & & &\quad \overset{\text{Eq.}(19)}{\nearrow}_{i=0} \quad & \\ \mathcal{A}P_{k}^{(j-1)} \quad& \quad R_{k}^{(j-1)} \quad & \xrightarrow[i=0]{\text{Eq.}(18)} \quad & \quad R_{k}^{(j)} \quad & \quad & \quad \mathcal{A}P_{k}^{(j)} \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \mathcal{A}^{j-1}P_{k}^{(j-1)} \quad & \quad \mathcal{A}^{j-2}R_{k}^{(j-1)}\quad & \quad \xrightarrow[i=j-2]{\text{Eq.}(18)} \quad & \quad \mathcal{A}^{j-2}R_{k}^{(j)}\quad & \quad & \quad \mathcal{A}^{j-1}P_{k}^{(j)} \\ \downarrow^{\text{Apply} \mathcal{A}} \quad& & & & \quad\overset{\text{Eq.}(19)}{\nearrow}_{i=j-1} & \\ \mathcal{A}^{j}P_{k}^{(j-1)}\quad &\quad \mathcal{A}^{j-1}R_{k}^{(j-1)} \quad & \quad \xrightarrow[i=j-1]{\text{Eq.}(18)} \quad & \mathcal{A}^{j-1}R_{k}^{(j)} \quad & \quad & \quad \mathcal{A}^{j}P_{k}^{(j)} \\ & & & \quad\downarrow^{\text{Apply} \mathcal{A}} \quad & \quad \overset{\text{Eq.}(19)}{\nearrow}_{i=j}\quad & \\ & & & \quad \mathcal{A}^{j}R_{k}^{(j)} \quad & & \end{array} $$
(22)

Note that \(\mathcal {A}^{j}P_{k}^{(j-1)}\) and \(\mathcal {A}^{j}R_{k}^{(j)}\) are obtained by explicitly applying \(\mathcal {A}\) to \(\mathcal {A}^{j-1}P_{k}^{(j-1)}\) and \(\mathcal {A}^{j-1}R_{k}^{(j)}\), respectively. By repeating scheme (22) for \(j=1,2,\dots , L\), we obtain the auxiliary matrices \(X_{k}^{(L)}\), \(\mathcal {A}^{i}R_{k}^{(L)},\) and \(\mathcal {A}^{i}P_{k}^{(L)}\) for \(i=0,1,\dots ,L\). This process is identical to that used in Algorithm 1.

3.2.2 Relaxation

Next, we describe the iterations to generate the auxiliary matrices \(Y_{k}^{(L)}\), \(U_{k}^{(L)}\), and \(Z_{k}^{(L)}\) for (15)–(17). This process corresponds to [6, Section 3.1], but we exploit different recursion formulas, and the resulting algorithm will thus be different from Algorithm 1.

For now, we consider the generation of \(Y_{k}^{(L)} = \mathcal {A}\mathrm {G}_{k-1}R_{k+L}^{gbcg}\) and \(U_{k}^{(L)} = \mathcal {A}\mathrm {G}_{k-1}P_{k+L}^{gbcg}\). We introduce additional auxiliary matrices

$$ \begin{array}{@{}rcl@{}} S_{k}^{(j)} := \mathrm{H}_{k-L}R_{k+j}^{gbcg},\quad Q_{k}^{(j)} := \mathrm{H}_{k-L}P_{k+j}^{gbcg} \end{array} $$

for \(j=0,1,\dots ,L\), where \(S_{k}^{(0)} = R_{k-L}^{(L)}\) and \(Q_{k}^{(0)} = P_{k-L}^{(L)}\) hold. Using the relation \(\mathcal {A}\mathrm {G}_{k-1} = \mathrm {H}_{k-L} - \mathrm {H}_{k}\), we have

$$ \begin{array}{@{}rcl@{}} \mathcal{A}\mathrm{G}_{k-1}R_{k+L}^{gbcg} &=& \mathrm{H}_{k-L}R_{k+L}^{gbcg} - \mathrm{H}_{k}R_{k+L}^{gbcg},\\ \mathcal{A}\mathrm{G}_{k-1}P_{k+L}^{gbcg} &=& \mathrm{H}_{k-L}P_{k+L}^{gbcg} - \mathrm{H}_{k}P_{k+L}^{gbcg}, \end{array} $$

which can be rewritten as follows:

$$ \begin{array}{@{}rcl@{}} Y_{k}^{(L)} &=& S_{k}^{(L)} - R_{k}^{(L)}, \end{array} $$
(23)
$$ \begin{array}{@{}rcl@{}} U_{k}^{(L)} &=& Q_{k}^{(L)} - P_{k}^{(L)}. \end{array} $$
(24)

Because \(R_{k}^{(L)}\) and \(P_{k}^{(L)}\) are generated by L repetitions of (22) with i = 0, we consider the method of obtaining \(S_{k}^{(L)}\) and \(Q_{k}^{(L)}\). At the j th repetition, applying \(\mathcal {A}^{i}\mathrm {H}_{k-L}\) for \(i=0,1,\dots , L-j\) to the Gl-BiCG recursions, we have

$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}\mathrm{H}_{k-L}R_{k+j}^{gbcg} &=& \mathcal{A}^{i}\mathrm{H}_{k-L}R_{k+j-1}^{gbcg} - \alpha_{k+j-1} \mathcal{A}^{i+1}\mathrm{H}_{k-L}P_{k+j-1}^{gbcg}, \\ \mathcal{A}^{i}\mathrm{H}_{k-L}P_{k+j}^{gbcg} &=& \mathcal{A}^{i}\mathrm{H}_{k-L}R_{k+j}^{gbcg} - \beta_{k+j-1} \mathcal{A}^{i}\mathrm{H}_{k-L}P_{k+j-1}^{gbcg}. \end{array} $$

Rewriting these recursions with the auxiliary matrices gives

$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}S_{k}^{(j)} &=& \mathcal{A}^{i}S_{k}^{(j-1)} - \alpha_{k}^{(j-1)} \mathcal{A}^{i+1}Q_{k}^{(j-1)}, \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} \mathcal{A}^{i}Q_{k}^{(j)} &=& \mathcal{A}^{i}S_{k}^{(j)} - \beta_{k}^{(j-1)} \mathcal{A}^{i}Q_{k}^{(j-1)}. \end{array} $$
(26)

Thus, when \(\mathcal {A}^{i}S_{k}^{(j-1)}\) for \(i=0,1,\dots ,L-j\) and \(\mathcal {A}^{i}Q_{k}^{(j-1)}\) for \(i=0,1,\dots ,L-j+1\) are given at the beginning of the j th repetition, the new matrices \(\mathcal {A}^{i}S_{k}^{(j)}\) and \(\mathcal {A}^{i}Q_{k}^{(j)}\) for \(i=0,1,\dots ,L-j\) can be generated using (21), (25), and (26). This is summarized by the following scheme:

$$ \begin{array}{cccccc} Q_{k}^{(j-1)} & & & & & Q_{k}^{(j)} \\ & & & & \overset{\text{Eq.}(26)}{\nearrow}_{i=0} & \\ \mathcal{A}Q_{k}^{(j-1)} & S_{k}^{(j-1)} & \xrightarrow[i=0]{\text{Eq.}(25)} & S_{k}^{(j)} & & \mathcal{A}Q_{k}^{(j)} \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \mathcal{A}^{L-j}Q_{k}^{(j-1)} & \mathcal{A}^{L-j-1}S_{k}^{(j-1)} & \xrightarrow[i=L-j-1]{\text{Eq.}(25)} & \mathcal{A}^{L-j-1}S_{k}^{(j)} & & \mathcal{A}^{L-j}Q_{k}^{(j)} \\ & & & & \overset{\text{Eq.}(26)}{\nearrow}_{i=L-j} & \\ \mathcal{A}^{L-j+1}Q_{k}^{(j-1)} & \mathcal{A}^{L-j}S_{k}^{(j-1)} & \xrightarrow[i=L-j]{\text{Eq.}(25)} & \mathcal{A}^{L-j}S_{k}^{(j)} & & \end{array} $$
(27)

Here, the starting matrices \(\mathcal {A}^{i}S_{k}^{(0)} = \mathcal {A}^{i}R_{k-L}^{(L)}\) for \(i=0,1,\dots ,L-1\) and \(\mathcal {A}^{i}Q_{k}^{(0)} = \mathcal {A}^{i}P_{k-L}^{(L)}\) for \(i=0,1,\dots ,L\) are generated in the previous cycle. We note that when j increases, the number of matrix updates decreases in the scheme (27), whereas it increases in the scheme (22). Thus, repeating (27) for \(j=1,2,\dots ,L\) gives \(S_{k}^{(L)}\) and \(Q_{k}^{(L)}\).

We now consider the computation of \(Z_{k}^{(L)} = \mathrm {G}_{k-1}R_{k+L}^{gbcg}\). From \(\mathcal {A}\mathrm {G}_{k-1} = \mathrm {H}_{k-L} - \mathrm {H}_{k}\) and the Gl-BiCG recursion for the residuals, we have

$$ \begin{array}{@{}rcl@{}} \mathcal{A}\mathrm{G}_{k-1}P_{k+j-1}^{gbcg} &=& \mathrm{H}_{k-L}P_{k+j-1}^{gbcg} - \mathrm{H}_{k}P_{k+j-1}^{gbcg},\\ \mathrm{G}_{k-1}R_{k+j}^{gbcg} &=& \mathrm{G}_{k-1}R_{k+j-1}^{gbcg} - \alpha_{k+j-1} \mathcal{A}\mathrm{G}_{k-1}P_{k+j-1}^{gbcg}, \end{array} $$

and rewriting them gives

$$ \begin{array}{@{}rcl@{}} U_{k}^{(j-1)} &=& Q_{k}^{(j-1)} - P_{k}^{(j-1)}, \end{array} $$
(28)
$$ \begin{array}{@{}rcl@{}} Z_{k}^{(j)} &=& Z_{k}^{(j-1)} - \alpha_{k}^{(j-1)} U_{k}^{(j-1)}. \end{array} $$
(29)

Therefore, we generate \(Z_{k}^{(L)}\) by repeating (29) with (28) for \(j=1,2,\dots ,L\), where \(U_{k}^{(0)} := Q_{k}^{(0)} - P_{k}^{(0)}\) and \(Z_{k}^{(0)} := \mathrm {G}_{k-1}R_{k}^{gbcg}\). Because it holds that

$$ \begin{array}{@{}rcl@{}} \mathrm{G}_{k+L-1} = \zeta_{k,1}\mathrm{H}_{k} + \zeta_{k,2}\mathcal{A}\mathrm{H}_{k} + {\dots} + \zeta_{k,L}\mathcal{A}^{L-1}\mathrm{H}_{k} + \eta_{k}\mathrm{G}_{k-1}, \end{array} $$

application to \(R_{k+L}^{gbcg}\) produces a recursion formula for computing the next starting matrix \(Z_{k+L}^{(0)}\) as follows:

$$ \begin{array}{@{}rcl@{}} Z_{k+L}^{(0)} = \zeta_{k,1}R_{k}^{(L)} + \zeta_{k,2}\mathcal{A}R_{k}^{(L)} + {\dots} + \zeta_{k,L}\mathcal{A}^{L-1}R_{k}^{(L)} + \eta_{k}Z_{k}^{(L)}. \end{array} $$
(30)

We now have all the auxiliary matrices required for the cycles.

3.3 Refined Gl-GPBiCGstab(L) algorithm

By combining (15)–(30), we obtain the refined Gl-GPBiCGstab(L) algorithm shown in Algorithm 2. We refer to Section 2 for the notation. Algorithms 1 and 2 generate the same approximations in exact arithmetic, but they use different formulas, especially for computing Y and U, and their numerical behaviors are different in finite-precision arithmetic. The total computational costs of Algorithms 1 and 2 are identical. However, Algorithm 2 has slightly simpler implementation because unlike Algorithm 1, there is no branch in the loops with j and no recursions of Y and U. Moreover, this simplicity is useful when applying preconditioning. As we show in later sections, Algorithm 2 with right preconditioning is numerically more stable and has a lower computational cost than that of Algorithm 1.

Algorithm 2
figure b

Gl-GPBiCGstab(L) for (1) (refined)

We briefly describe the related Gl-LTPMs derived from Gl-GPBiCGstab(L). If the parameter η is set to 0 in Algorithms 1 and 2, the computations of relaxation vanish, and both algorithms reduce to the same Gl-BiCGstab(L) algorithm. To the best of our knowledge, this is also a new Gl-LTPM. However, Algorithms 1 and 2 with L = 1 are mathematically equivalent to Gl-GPBiCG [15, Algorithm 2] (and also Gl-GPBiCG-plus [15, Algorithm 4]), but they use different implementations. As in the case of standard LTPMs, the global versions of CGS, BiCGSTAB, and BiCGStab2 can be derived from the framework of Gl-GPBiCG; therefore, Gl-GPBiCGstab(L) includes them as well. Thus, Algorithms 1 and 2 can be reduced to various global versions and have different implementations. However, because the main purpose of this study is to develop an effective method for Gl-GPBiCGstab(L), we will not discuss the details of the simplified algorithms further or their implementations. We consider several key algorithms and compare their convergence later in numerical experiments.

4 Preconditioning for linear systems with multiple right-hand sides

In this section, we focus on linear systems with multiple right-hand sides

$$ \begin{array}{@{}rcl@{}} AX = B,\quad A\in \mathbb{R}^{n\times n},\quad B \in \mathbb{R}^{n\times s}, \end{array} $$
(31)

which constitute one of the most important types of linear matrix equations [7,8,9,10]. Here, A is a large sparse nonsymmetric and nonsingular matrix. Because the preconditioning technique is useful for enhancing the convergence of Gl-LTPMs when solving (31), we discuss the preconditioned algorithms of Gl-GPBiCGstab(L).

There are several approaches for converting the above system into a well-conditioned system using a preconditioner KA. Multiplying (31) by K− 1 from the left side gives the left preconditioned system

$$ \begin{array}{@{}rcl@{}} \hat{A}X = \hat{B},\quad \hat{A} := K^{-1}A,\quad \hat{B}:= K^{-1}B. \end{array} $$

In general, the left preconditioned algorithm is easy to implement by replacing only the multiplications by A with those by K− 1A, but the residuals generated in the iterations change to \(\hat {R}_{k} = K^{-1}R_{k}\); we therefore need special care in setting the stopping rule. Conversely, the right preconditioned system is given in the form

$$ \begin{array}{@{}rcl@{}} \hat{A}\hat{X} = B,\quad \hat{A} := AK^{-1},\quad \hat{X}:= KX. \end{array} $$
(32)

In this case, the residuals coincide with the standard ones, i.e., \(\hat {R}_{k} = R_{k}\), and the approximations generated in the iterations change to \(\hat {X}_{k} = KX_{k}\). The approximation \(X_{k} = K^{-1}\hat {X}_{k}\) needs to be computed accurately once the iterations are terminated. Alternatively, to capture the approximations during the iterations, it is well known that we can update Xk recursively instead of \(\hat {X}_{k}\) through some changes in the variables (cf., e.g., [2, 18]). We describe such algorithms of Gl-GPBiCGstab(L) with right preconditioning.

4.1 Naive right preconditioned algorithm

Algorithm 3
figure c

Right preconditioned Gl-GPBiCGstab(L) for (31) (naive)

Algorithm 3 is a naive right preconditioned Gl-GPBiCGstab(L) algorithm that is obtained by applying Algorithm 1 to (32). Similar to [2, 18], several variables are changed to update Xk recursively, and the variables with the hat symbol ‘\(\hat {~}\) ’ denote that K− 1 acts on its underlying variables (without hats). For example, we have the following relations in Algorithm 3:

$$ \begin{array}{@{}rcl@{}} \mathrm{R} & = & \left[\mathrm{R}_{0}; \mathrm{R}_{1}; {\dots} ;\mathrm{R}_{j}\right] = \left[R_{k}^{(j)}; \left( AK^{-1}\right)R_{k}^{(j)}; {\dots} ;\left( AK^{-1}\right)^{j}R_{k}^{(j)}\right],\\ \hat{\mathrm{R}} & = & \left[\hat{\mathrm{R}}_{0}; \hat{\mathrm{R}}_{1}; {\dots} ;\hat{\mathrm{R}}_{j}\right] = \left[\!K^{-1}R_{k}^{(j)}; K^{-1}\left( AK^{-1}\right)R_{k}^{(j)}; {\dots} ;K^{-1}\left( AK^{-1}\right)^{j}R_{k}^{(j)}\!\right], \end{array} $$
$$ \begin{array}{@{}rcl@{}} \mathrm{P} & = & [\mathrm{P}_{0}; \mathrm{P}_{1}; {\dots} ;\mathrm{P}_{j-1}] = \left[\left( AK^{-1}\right)P_{k}^{(j)}; \left( AK^{-1}\right)^{2}P_{k}^{(j)}; {\dots} ;\left( AK^{-1}\right)^{j}P_{k}^{(j)}\right],\\ \hat{\mathrm{P}} & = & \left[\hat{\mathrm{P}}_{0}; \hat{\mathrm{P}}_{1}; {\dots} ;\hat{\mathrm{P}}_{j}\right] = \left[\!K^{-1}P_{k}^{(j)}; K^{-1}\left( AK^{-1}\right)P_{k}^{(j)}; {\dots} ;K^{-1}\left( AK^{-1}\right)^{j}P_{k}^{(j)}\right], \end{array} $$

where we note that \(P_{k}^{(j)}\) itself is not needed for updating Xk, and we set \(\mathrm {P}_{0} := (AK^{-1})P_{k}^{(j)}\). Similar relationships also hold between other variables with and without hats.

We briefly describe some concerns regarding Algorithm 3. Because \(\hat {\mathrm {R}}\) is computed by a linear combination of \(\hat {\mathrm {R}}_{i}\) for \(i=0,1,\dots ,L\) and \(\hat {\mathrm {Y}}\) in line 26, it may differ significantly from K− 1R owing to accumulation of the rounding errors. In our experience, this causes numerical instabilities, such as stagnation and divergence of the residual norms in the late stages of the iterations (see Section 5.1 below). In our preliminary experiments, we confirmed that the convergence can be improved by computing \(\hat {\mathrm {R}}\) explicitly with the form K− 1R. However, this approach is costly because it requires additional multiplications by K− 1. We note that because \(\hat {\mathrm {Y}}\) and \(\hat {\mathrm {U}}\) are computed by their coupled recursions with the starting matrix \(\hat {\mathrm {Y}} = \hat {\mathrm {R}}^{\prime } - \hat {\mathrm {R}}\) given in line 26, it is difficult to remove this line without using additional multiplications with K− 1.

4.2 Refined right preconditioned algorithm

We can overcome the above difficulty by exploiting Algorithm 2. Algorithm 4 presents the refined right preconditioned Gl-GPBiCGstab(L) algorithm from applying Algorithm 2 to (32). Here, because \(\hat {\mathrm {U}}\) is computed directly in the form \(\hat {\mathrm {U}} = \hat {\mathrm {Q}}_{0} - \hat {\mathrm {P}}_{0}\), we do not need \(\hat {\mathrm {Y}}\), which enables removal of the linear combination for \(\hat {\mathrm {R}}\). Then, unlike Algorithm 3, multiplication by K− 1 can be used to compute the following variables in the j th repetition.

$$ \begin{array}{@{}rcl@{}} \hat{\mathrm{R}}_{j-1} &=& K^{-1}\mathrm{R}_{j-1} = K^{-1}\left( AK^{-1}\right)^{j-1}R_{k}^{(j)},\\ \hat{\mathrm{P}}_{j} &=& K^{-1}\mathrm{P}_{j-1} = K^{-1}\left( AK^{-1}\right)^{j}P_{k}^{(j)}. \end{array} $$

Hence, \(\hat {\mathrm {R}}\) is always a good approximation of K− 1R in finite-precision arithmetic, and we expect that Algorithm 4 is numerically more stable than Algorithm 3.

Note that because multiplications with K− 1 are used in different parts between Algorithms 3 and 4, the algorithms with η = 0 result in different formulations of the right preconditioned Gl-BiCGstab(L).

Algorithm 4
figure d

Right preconditioned Gl-GPBiCGstab(L) for (31) (refined)

4.3 Computational costs and memory requirements

We compare the computational costs and memory requirements between the related right preconditioned algorithms. We consider the original Gl-GPBiCG and Gl-GPBiCG-plus with right preconditioning given in [15] as well as the naive and refined variants of Gl-BiCGstab(L) and Gl-GPBiCGstab(L) with right preconditioning. Note that the evaluations for the naive and refined Gl-GPBiCG derived from Gl-GPBiCGstab(L) are obtained by substituting L = 1 in the results of Algorithms 3 and 4, respectively.

Below, “MM” denotes a matrix–matrix product with A. Although Gl-GPBiCG and Gl-GPBiCG-plus require two MMs per iteration while the others require 2L MMs per cycle, all methods require one MM per increase by one Krylov dimension on average. A single multiplication by K− 1 is also required per MM for all the methods. The number of AXPYs and DOTs per MM as well as their memory requirements are summarized in Table 1, where AXPY and DOT correspond to the forms (4) and (5), respectively. Following [6], the form aX or X + Y is counted as 1/2 AXPYs. The computational costs for checking the stopping rule are not included. The memory requirements indicate the amount of n × s matrices that must be stored in the algorithms. The memories for A and B are not counted.

Table 1 Computational cost per MM and memory requirements with right preconditioning

From Table 1, we see that Algorithm 4 has an additional advantage that it requires lower computational cost and memory requirements than Algorithm 3. The costs and memories of Algorithm 4 for a modest value of L, e.g., L ≤ 4, are not high compared with those of the Gl-GPBiCG and Gl-BiCGstab(L) algorithms.

5 Numerical experiments

We present numerical experiments to show the effectiveness of the proposed Gl-GPBiCGstab(L) method. Numerical calculations were carried out in double-precision floating-point arithmetic on a PC (Intel Core i7-1185G7 CPU with 32 GB of RAM) equipped with MATLAB 2021a. The right-hand side \(B \in \mathbb {R}^{n\times s}\) was given by a random matrix. The initial guess X0 and initial shadow residual \(\tilde {R}_{0}\) were set to O and R0 (= B), respectively. The least-squares problem (6) was converted to a normal equation (7) and solved using the backslash command in MATLAB. The other computational conditions were set individually for each example, as noted below.

5.1 Comparison of the naive and refined algorithms

We first compare convergence between the naive and refined Gl-GPBiCGstab(L) algorithms with and without preconditioning. Algorithms 1–4 were applied to (31). Following [6], the coefficient matrix A was given by the following Toeplitz matrix:

$$ \begin{array}{@{}rcl@{}} A := \left[\begin{array}{cccc} 2 & 1 & & \\ 0 & 2 & 1 & \\ 0 & 0 & 2 & {\ddots} \\ 0 & 0 & 0 & {\ddots} \\ 1.4 & 0 & 0 & {\ddots} \\ & 1.4 & 0 & {\ddots} \\ & & 1.4 & {\ddots} \\ & & & \ddots \end{array}\right] \in \mathbb{R}^{500 \times 500}. \end{array} $$

The iterations were stopped when the relative residual norms ∥RkF/∥BF were less than 10− 14. The parameters s and L were set to s = 1,2,4,8,16,32 and L = 2,4,8, respectively. The maximum number of MMs was set to 2n. ILU(0) [19] was used as the preconditioner in Algorithms 3 and 4.

Figure 1 shows the convergence histories of the relative residual norms of Algorithms 3 and 4, i.e., the naive and refined Gl-GPBiCGstab(L) with right preconditioning, for s = 1 and 32. The plots indicate the number of MMs along the horizontal axis versus \(\log _{10}\) of the relative residual F-norm along the vertical axis. Note that “MV” (matrix–vector product) and 2-norm are utilized for s = 1 instead of MM and F-norm, respectively. Table 2 shows the number of MMs required for successful convergence of the four algorithms. The symbol ‡ in Table 2 indicates that the residual norms stagnate or diverge at the late stages of the iterations, as displayed in Fig. 1.

Fig. 1
figure 1

Convergence histories of Algorithms 3 and 4 (naive and refined Gl-GPBiCGstab(L) with right preconditioning, respectively) for s = 1 (left) and s = 32 (right)

Table 2 Number of MMs for Algorithms 1–4 for the Toeplitz matrix

From Fig. 1 and Table 2, we observe the following. The numbers of MMs required for successful convergence of Algorithms 1 and 2 (without preconditioning) are comparable for each value of s and L. In contrast, in the preconditioned case, Algorithm 4 achieves faster convergence than in the non-preconditioned case for all s and L, whereas Algorithm 3 often does not converge. As noted in Section 4.1, \(\hat {\mathrm {R}}\) in Algorithm 3 may differ from K− 1R because of rounding errors, and this difference is expected to increase with increasing L. Actually, Algorithm 3 becomes unstable as parameter L increases for a fixed s. Similar results have been observed for other matrices. Moreover, because stagnation of the residual norms occurs even when s = 1, there is an inherent problem in the original GPBiCGstab(L) with right preconditioning, but the numerical instability seems to grow with increasing s. For the above reasons, we conclude that our refined Gl-GPBiCGstab(L) algorithm is more robust than the naive version with right preconditioning.

5.2 Experiments for linear systems with multiple right-hand sides

Next, we apply several Gl-LTPMs with preconditioning to (31) and compare their convergences. We use Gl-BiCGSTAB, Gl-GPBiCG, Gl-BiCGstab(L), and Gl-GPBiCGstab(L), which can be obtained from Algorithm 4. For Gl-GPBiCG, we also use the Gl-GPBiCG-plus implementation [15, Algorithm 5]. Table 3 displays the abbreviations of the solvers and their corresponding algorithms. Right preconditioning was applied to all methods, and the ILU(0) preconditioner was used.

Table 3 Solvers used in Section 5.2
Table 4 Characteristics of test matrices for linear systems with multiple right-hand sides (31)

We consider test matrices from the SuiteSparse Matrix Collection [20]. Table 4 shows the dimension (n), number of nonzero entries (nnz), and 2-norm condition number (κ2) of each matrix. The condition number is displayed only if it is given in the above collection. The iterations are stopped when the relative residual norms ∥RkF/∥BF are less than 10− 10. The maximum number of MMs was set to 2n. For Gl-BiCGstab(L) and Gl-GPBiCGstab(L), we use L = 2,4. The number of right-hand sides s is set to 16.

Figure 2 displays the convergence histories of the relative residual norms of Gl-LTPMs with right preconditioning for sme3Db and garon2. We refer to Section 5.1 for the plots of the figures. Table 5 shows the number of MMs required for successful convergence (MMs), computation time (Time), and explicitly computed relative residual norm (referred to as the true relative residual norm) ∥BAXkF/∥BF (TRR) at the time of termination of the solvers. Following [6], the smallest MMs and Time are displayed in boldface font in italics, and the second smallest values are displayed only in italics for each problem. “Inf”’ and “NaN” in the table indicate that the iteration values are “infinity” and “not a number”, respectively, when iterating with MATLAB and that the iterations cannot proceed thereafter.

Fig. 2
figure 2

Convergence histories of the relative residual norms of Gl-LTPMs with right preconditioning for sme3Db (left) and garon2 (right)

From Fig. 2 and Table 5, we observe the following. The fastest convergence in terms of number of MMs is achieved by the proposed Gl-GPBiCGstab(L) for all problems. In particular, Gl-GPBiCGstab(4) often converges faster than not only the other Gl-LTPMs but also Gl-GPBiCGstab(2). As noted in [6], moderately increasing L seems to be useful for reducing the number of MMs required for successful convergence. These are important metrics of the effectiveness of Gl-GPBiCGstab(L) as a global Krylov subspace method because the number of MMs coincides with the dimensions of the matrix Krylov subspace. In terms of the computational time, Gl-BiCGstab(L) and Gl-GPBiCGstab(L) are comparable to the conventional Gl-GPBiCG-plus on average. Comparing the true relative residual norms, there are cases where Gl-BiCGstab(L) and Gl-GPBiCGstab(L) generate slightly more accurate approximate solutions than Gl-GPBiCG-plus.

Table 5 Numbers of MMs, computation times, and true relative residual norms for Gl-LTPMs with right preconditioning for linear systems with multiple right-hand sides (31)

We also remark on the numerical results briefly when the stopping criterion was set to ∥RkF/∥BF < 10− 12, because we obtained slightly different observations from the above. With respect to the convergence speed in terms of the number of MMs, Gl-BiCGstab(4) and Gl-GPBiCGstab(4) are superior to other Gl-LTPMs. Moreover, only these two solvers converge for all the test matrices; other conventional Gl-LTPMs fail to converge at the late stages of the iterations for some problems. On the other hand, the attainable accuracy in terms of the true residual norm is limited for all the solvers, that is, a so-called large residual gap (the difference between Rk and BAXk) appears in most cases. This problem would be improved by combining Gl-LTPMs with the techniques described in [21].

5.3 Experiments for the Sylvester equation

Herein, we present the numerical results of the Gl-LTPMs for the Sylvester equation (2). The solvers shown in Table 6 are applied to (2) without preconditioning.

Table 6 Solvers used in Section 5.3

Note that linear operation by \(\mathcal {A}\) to \(X \in \mathbb {R}^{n\times s}\) is defined as \(\mathcal {A}(X) := AX - XC\) for the given matrices \(A\in \mathbb {R}^{n\times n}\) and \(C\in \mathbb {R}^{s\times s}\), and “MM” is replaced by “OP” to denote an operation with \(\mathcal {A}\). Motivated by [15, 22], we set the test matrices A and C as displayed in Table 7. The matrices are derived from the SuiteSparse Matrix Collection [20], except for the tridiagonal matrix in problems #11 and #12. The tridiagonal matrix C = [cij] is defined as ci+ 1,i := 11, cii := − 2, and ci,i+ 1 := − 9 for each i (otherwise, cij := 0). The value s is determined by the size of C, and the maximum number of OPs was set to 2sn. All other computational conditions were similar to those in Section 5.2.

Table 7 Characteristics of the test matrices for the Sylvester equation (2)

Figure 3 displays the convergence histories of the relative residual norms of the Gl-LTPMs for problems #9 and #11. The number of OPs and \(\log _{10}\) of the relative residual F-norms are plotted on the horizontal and vertical axes, respectively. Table 8 shows the number of OPs required for successful convergence (OPs), computation time (Time), and true relative residual norm \(\|B-\mathcal {A}(X_{k})\|_{F}/\|B\|_{F}\) (TRR) at the time of termination. The other notations are as defined in Table 5.

Fig. 3
figure 3

Convergence histories of the relative residual norms of Gl-LTPMs for problems #9 (left) and #11 (right)

Table 8 Numbers of OPs, computation times, and true relative residual norms for Gl-LTPMs for the Sylvester equation (2)

From Fig. 3 and Table 8, we observe the following. Similar to the results in Section 5.2, Gl-GPBiCGstab(L) (especially, with L = 4) often converges faster than the other Gl-LTPMs with respect to the number of OPs. The conventional Gl-GPBiCG-plus is efficient in terms of the computational time. However, Gl-GPBiCG-plus does not converge for problem #11, whereas Gl-BiCGstab(L) and Gl-GPBiCGstab(L) converge. Similar situations have occasionally occurred in our experience, and the proposed methods appear to be more robust than the conventional Gl-LTPMs.

We also note that the numerical results with the stopping criterion ∥RkF/∥BF < 10− 12 are similar to the above. However, Gl-BiCGstab(4) and Gl-GPBiCGstab(4) have a slightly larger residual gap for problems #9 and #10. We will not further discuss the residual gap in the present study, but will seek more refined algorithms of Gl-LTPMs based on [21] in the future.

6 Concluding remarks

We propose a novel global Lanczos-type method Gl-GPBiCGstab(L) for solving linear matrix equations. The original GPBiCGstab(L) can be easily extended to its global version naively, but such a method has numerical instabilities when right preconditioning is applied for solving linear systems with multiple right-hand sides. Therefore, we reconstruct Gl-GPBiCGstab(L) using alternative recursion formulas to update the iteration matrices. The resulting refined algorithm with right preconditioning has greater robustness and lower computational cost than the naive version. Moreover, the results of numerical experiments show that the refined Gl-GPBiCGstab(L) converges quickly and stably compared with other Gl-LTPMs for linear systems with multiple right-hand sides and the Sylvester equation.

Based on the above results, there are two main prospects for the proposed approach. First, we can apply the proposed approach to more difficult classes of matrix equations, such as the general coupled matrix equations including the generalized Sylvester equation [11]. Second, the proposed method can be used to discuss block Krylov subspace methods, such as the block BiCGstab(L) [23] and block GPBiCG [24] methods. Because these block-type methods are effective approaches, especially when solving linear systems with multiple right-hand sides, it is natural that a block version of GPBiCGstab(L) be developed, as noted in [6]. We have not elaborated on these points in the present work but expect to discuss them in future works.