Abstract
With the recent realization of exascale performance by Oak Ridge National Laboratory’s Frontier supercomputer, reducing communication in kernels like QR factorization has become even more imperative. Lowsynchronization GramSchmidt methods, first introduced in Świrydowicz et al. (Numer. Lin. Alg. Appl. 28(2):e2343, 2020), have been shown to improve the scalability of the Arnoldi method in highperformance distributed computing. Block versions of lowsynchronization GramSchmidt show further potential for speeding up algorithms, as columnbatching allows for maximizing cache usage with matrixmatrix operations. In this work, lowsynchronization block GramSchmidt variants from Carson et al. (Linear Algebra Appl. 638:150–195, 2022) are transformed into block Arnoldi variants for use in block full orthogonalization methods (BFOM) and block generalized minimal residual methods (BGMRES). An adaptive restarting heuristic is developed to handle instabilities that arise with the increasing condition number of the Krylov basis. The performance, accuracy, and stability of these methods are assessed via a flexible benchmarking tool written in MATLAB. The modularity of the tool additionally permits generalized block inner products, like the global inner product.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and motivation
Oak Ridge National Laboratory reported in May 2022 that its Frontier supercomputer is the first machine to have achieved true exascale performance.^{Footnote 1} That is, for the first time ever, a supercomputer performed more than 1 exaflop (i.e., 10^{18} doubleprecision floatingpoint operations) in a single second. This astounding development is clear motivation for our work. Exascale computing is no longer a nextgeneration dream; it is reality, and the need for highly parallelized algorithms that take full advantage of exaflop computational potential while reducing global communication between nodes is urgent.
To this end we build on the lowsynchronization (“lowsync”) GramSchmidt methods of Barlow [1], Świrydowicz et al. [2], Yamazaki et al. [3], Thomas et al. [4], and Bielich et al. [5], as well as our own earlier work with block versions of these methods [6, 7]. GramSchmidt methods are an essential backbone in orthogonalization routines like QR factorization and in iterative methods like Krylov subspace methods for linear systems, matrix functions, and matrix equations [8,9,10]. Block Krylov subspace methods in particular make better use of L3 cache via matrixmatrix operations and feature often in communicationavoiding Krylov subspaces, such as sstep [11, 12], enlarged methods [13], and randomized methods [14].
As in most realms of life, there is no such thing as a free lunch here. While lowsync variations have the potential to speed up highly parallelized implementations of GramSchmidt [3], they introduce new floatingpoint errors and thus potential instability, due to the reformulation of inner products and normalizations. Instability surfaces in the loss of orthogonality between basis vectors and can lead to breakdowns or wildly inaccurate approximations in downstream applications [15, 16]. Stability bounds for some lowsync variants have been established, but it often takes much longer to carry out a rigorous stability analysis than to derive and deploy new methods [1, 4, 6, 7]. It can also happen that a backward error bound is established and later challenged by an obscure edge case [17, 18]. With this tension in mind, we have not only extended lowsync variants of block GramSchmidt to block Arnoldi but also developed a benchmarking tool for the community to explore the efficiency, stability, and accuracy of these new algorithms, in a similar vein as the BlockStab^{Footnote 2} comparison tool developed in tandem with a recent block GramSchmidt survey [7]. We refer to this new tool as LowSyncBlockArnoldi^{Footnote 3} and encourage the reader to explore the tool in parallel with the text.
Established in this earlier work is the fact that block variants of lowsync GramSchmidt are less stable than their columnwise counterparts. However, when these skeletons are transferred to block Arnoldi and used to solve linear systems, we gain the option to restart the process. Restarting can be effective at mitigating stability issues in communicationavoiding algorithms [19, 20]. As long as each node redundantly computes residual or error estimates and checks the stability via local quantities, restarting does not introduce additional synchronization points. Furthermore, adaptive restarting allows for robustness, as we can use basic lookahead heuristics to foresee a breakdown and salvage progress without giving up completely at the first sign of trouble.
Given the modularity of our framework, we are also able to treat generalized block inner products, as described in [21, 22]. We focus in particular on the classical and global inner products.
The paper is organized as follows. In Section 2 we summarize terms, definitions, and concepts from highperformance (HPC) computing, generalized block inner products, block GramSchmidt algorithms, and block Krylov subspace methods with static restarting. We present new lowsynchronization block Arnoldi skeletons in Section 3, and derive an adaptive restarting heuristic in Section 4. Section 5 features a more indepth discussion of the LowSyncBlockArnoldi benchmarking tool as well as examples demonstrating how to compare different block Arnoldi variants. We summarize our findings in Section 6.
2 Background
This work is a combination of the generalized inner product framework of Frommer, Lund, and Szyld [21, 22] and the skeletonmuscle framework for block GramSchmidt (BGS) by Carson, Lund, Rozložník, and Thomas [6, 7]. Throughout the text, we focus on solving a linear system with multiple righthand sides
where \(A \in {\mathbb {C}}^{n \times n}\) is large and sparse (i.e., with \(\mathcal {O}\left (n\right )\) nonzero entries) and \(\boldsymbol {B} \in {\mathbb {C}}^{n \times s}\) is a tallskinny (i.e., s ≪ n) matrix.
We employ standard numerical linear algebra notation throughout. In particular, A^{∗} denotes the Hermitian transpose of A, \(\left \\cdot \right \\) refers to the Euclidean 2norm, unless otherwise specified, and \({\widehat {e}_{k}}\) denotes the k th standard unit vector with the k th entry equal to 1 and all others 0.
In the following subsections, we define key concepts in HPC, block GramSchmidt methods, and block Krylov subspace methods.
2.1 Communication in highperformance computing
As floatingpoint operations have become faster and less energyintensive, communication—the memory operations between levels of cache on a node or between parallelized processors on a network—has become a bottleneck in distributed computing. How expensive a memory operation is depends on the physical aspects of a specific system, specifically the latency, or the amount of time needed to pack and transmit a message, and the bandwidth, or how much information can be transmitted at a time. To improve algorithm performance in bandwidthlimited algorithms like Krylov subspace methods, it is therefore advantageous to increase the computational intensity, or the ratio between floatingpoint and memory operations [23]. We pay particular attention to synchronization points (“sync points”), i.e., the steps in an algorithm that initiate a broadcast or reduce pattern to synchronize a quantity on all processors. Reducing calls to kernels with sync points is a straightforward way to improve computational intensity [24].
Sync points in Krylov subspace methods arise primarily in the orthonormalization procedure, such as Arnoldi or Lanczos, both of which are reformulations of the GramSchmidt method, a standard method for orthonormalizing a basis one (block) vector at a time. For large n, vectors are typically partitioned rowwise and distributed among processors, meaning that any time an operation like an inner product or normalization is performed—which is at least once per (block) vector in GramSchmidt—a sync point is inevitable.
Other possibly communicationintensive kernels include applications of the operator A^{Footnote 4} and applications of \(\boldsymbol {\mathcal {V}}_{m}\), an n × ms Krylov basis matrix. We count each operation separately from sync points (block inner products and vector norms) in LowSyncBlockArnoldi; see Section 5.
2.2 Generalized block inner products
A block vector is a tallskinny matrix \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\), and a block matrix is a matrix of s × s matrices, e.g.,
We use a mixture of Matlab and blockindexing notation to handle block objects. In particular, we write \(\boldsymbol {\mathcal {V}}_{k}\) to denote the first k block vectors of the blockpartitioned matrix \(\boldsymbol {\mathcal {V}} = \begin {bmatrix} \boldsymbol {V}_{1} & \boldsymbol {V}_{2} & {\cdots } & \boldsymbol {V}_{m} \end {bmatrix}\) instead of \(\boldsymbol {\mathcal {V}}_{:,1:~ks}\) (i.e., the first ks columns). In a similar vein, s × s block entries of \({\mathscr{H}}\) are denoted as H_{j,k} instead of as H_{(j− 1)s+ 1:js,(k− 1)s+ 1:ks}. We denote block generalizations of the standard unit vectors \({\widehat {e}_{k}}\) as \(\widehat {\boldsymbol {E}}_{k} := {\widehat {e}_{k}} \otimes I_{s}\), where ⊗ is the Kronecker product and I_{s} the identity matrix of size s.
Blocking is a batching technique that can reduce the number of calls to the operator A applied to individual column vectors, maximize computational intensity by filling up the local cache with BLAS3 operations, and reduce the total number of sync points by performing inner products and normalization en masse [25, 26]. In the context of Krylov subspaces, blocking can also lead to enriched subspaces by sharing information across column vectors instead of treating each righthand side as an isolated problem. How much information is shared across columns depends on the choice of block inner product.
Let \({{\mathbb {S}}}\) be a ^{∗}subalgebra of \({{\mathbb {C}}}^{s \times s}\) with identity; i.e., \(I \in {{\mathbb {S}}}\) and when \( S,T \in {\mathbb {S}}\), \(\alpha \in {\mathbb {C}}\), then \(\alpha S +T, ST, S^{*} \in {\mathbb {S}}\).
Definition 1
A mapping 〈〈⋅,⋅〉〉 from \({\mathbb {C}}^{n\times s} \times {\mathbb {C}}^{n \times s}\) to \({\mathbb {S}}\) is called a block inner product onto \({\mathbb {S}}\) if it satisfies the following conditions for all \(\boldsymbol {X},\boldsymbol {Y},\boldsymbol {Z} \in {\mathbb {C}}^{n \times s}\) and \(C \in {\mathbb {S}}\):

(i)
\({\mathbb {S}}\)linearity: \({\langle \langle \boldsymbol {X}+\boldsymbol {Y},\boldsymbol {Z} C } \rangle \rangle _{{\mathbb {S}}}= {\langle \langle \boldsymbol {X},\boldsymbol {Z} \rangle \rangle }_{{\mathbb {S}}} C + {\langle \langle \boldsymbol {Y},\boldsymbol {Z} \rangle \rangle }_{{\mathbb {S}}} C\);

(ii)
symmetry: \({\langle \langle \boldsymbol {X},\boldsymbol {Y} \rangle \rangle _{{\mathbb {S}}} = \langle \langle \boldsymbol {Y},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}}^{*}\);

(iii)
definiteness: \({\langle \langle \boldsymbol {X},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}}\) is positive definite if X has full rank, and \({\langle \langle \boldsymbol {X},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}} = 0\) if and only if X = 0.
Definition 2
A mapping N which maps all \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\) with full rank on a matrix \(N(\boldsymbol {X}) \in {\mathbb {S}}\) is called a scaling quotient if for all such X, there exists \(\boldsymbol {Y} \in {\mathbb {C}}^{n \times s}\) such that X = YN(X) and \({\langle \langle \boldsymbol {Y},\boldsymbol {Y} \rangle \rangle }_{{\mathbb {S}}} = I_{s}\).
The scaling quotient is closely related to the intraorthogonalization routine discussed in Section 2.3. Block notions of orthogonality and normalization arise organically from Definitions 1 and 2.
Definition 3
Let \(\boldsymbol {X}, \boldsymbol {Y} \in {\mathbb {C}}^{n \times s}\) and \(\{\boldsymbol {X}_{j} \}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\).

(i)
X,Y are block orthogonal, if \({\langle \langle \boldsymbol {X},\boldsymbol {Y} \rangle \rangle }_{{\mathbb {S}}} = 0_{s}\).

(ii)
X is block normalized if N(X) = I_{s}.

(iii)
X_{1},…,X_{m} are block orthonormal if \({\langle \langle \boldsymbol {X}_{i},\boldsymbol {X}_{j} \rangle \rangle }_{{\mathbb {S}}} = \delta _{ij} I_{s}\).
A set of vectors \(\{\boldsymbol {X}_{j}\}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\) block spans a space \({\mathscr{K}} \subseteq {\mathbb {C}}^{n \times s}\), and we write \({\mathscr{K}} = \text {span}^{\mathbb {S}}\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) if
The set \(\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) constitutes a block orthonormal basis for \({\mathscr{K}} = \text {span}^{\mathbb {S}}\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) if it is block orthonormal.
In this work, we consider only the classical and global block paradigms, described in Table 1. These paradigms represent the two extremes of informationsharing, with the classical approach maximizing information shared among columns and the global approach minimizing it; see, e.g., [22, Theorem 3.3]. Moreover, the global paradigm leads to a lower complexity per iteration in Krylov subspace methods, because what are matrixmatrix products in the classical paradigm get reduced to scaling operations in the global one. Many other paradigms are also possible; see, e.g., [27, 28].
2.3 Block GramSchmidt
Block GramSchmidt (BGS) is a routine for orthonormalizing a set of block vectors \(\{\boldsymbol {X}_{j} \}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\). Writing
we define a BGS method as one that returns a block orthonormal \(\boldsymbol {\mathcal {Q}} \in {\mathbb {C}}^{n \times ms}\) and a block upper triangular \(\mathcal {R} \in {\mathbb {C}}^{ms \times ms}\) such that \(\mathcal {X} = \mathcal {Q} \mathcal {R}\). Important measures in the analysis of BGS methods are the condition number of \(\boldsymbol {\mathcal {X}}\),
i.e., the ratio between the largest and smallest singular values of \(\boldsymbol {\mathcal {X}}\), and the loss of orthogonality (LOO),
where \({\langle \langle {\cdot ,\cdot } \rangle \rangle _{{\mathbb {S}}}}\) is a generalized inner product as described in Section 2.4.
When we discuss the stability of BGS methods, we refer to bounds on the loss of orthogonality in terms of machine precision, ε. We assume IEEE double precision here, so \(\varepsilon = \mathcal {O}\left (10^{16}\right )\).
For categorizing BGS variants, we recycle the skeletonmuscle notation from [7, 12], where skeleton refers to the inter orthogonalization routine between block vectors, and the muscle refers to the intra orthogonalization routine between the columns of a single block vector. As a prototype, consider the Block Modified GramSchmidt (BMGS) skeleton, given by Algorithm 1. Here, IntraOrtho denotes a generic muscle that takes \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\) and returns \(\boldsymbol {Q} \in {\mathbb {C}}^{n \times s}\) and \(R \in {\mathbb {C}}^{s \times s}\) such that \({\langle \langle \boldsymbol {Q},\boldsymbol {Q} \rangle \rangle } _{{\mathbb {S}}} = I_{s}\) and X = QR. For the classical paradigm, this could be any implementation of a QR factorization: a columnwise GramSchmidt routine, Householder QR (HouseQR), Cholesky QR (CholQR), etc. As for the global paradigm, there is only one possible muscle, given by the global scaling quotient, which effectively reduces to normalizing block vectors with a scaled Frobenius norm. Consequently, intraorthogonalization does not actually occur in the global paradigm, as the columns of block vectors are not orthogonalized with respect to one another at all.
We regard a single call to either \({\langle \langle {\cdot , \cdot }\rangle \rangle _{\mathbb {S}}}\) or IntraOrtho as one sync point, which is only possible in practice if singlereduce algorithms like CholQR [29] or TSQR/AllReduceQR [30, 31] are employed for IntraOrtho.
2.4 Block Krylov subspace methods
The m th block Krylov subspace for A andB (with respect to \({\mathbb {S}}\)) is defined as
Block Arnoldi is often used to compute a basis for \({\mathscr{K}}^{{\mathbb {S}}}_{m}(A, \boldsymbol {B})\), and it is typically implemented with BMGS as the skeleton; see Algorithm 2. BMGSArnoldi accrues a high number of sync points due to the inner forloop, where an increasing number of inner products is performed per block column.
Performing m steps of a block Arnoldi routine returns the block Arnoldi relation
where \(\boldsymbol {\mathcal {V}}_{m}\) \({\mathbb {S}}\)spans \({\mathscr{K}}^{{\mathbb {S}}}_{m}(A, \boldsymbol {B})\) and \({\mathscr{H}}_{m}\) denotes the ms × ms principal submatrix of \({\mathscr{H}}_{m+1,m}\).
2.4.1 Block full orthogonalization methods with lowrank modifications
We define
where \(\widehat {\boldsymbol {E}}_{1} = \widehat {\boldsymbol {e}}_{1} \otimes I_{s}\) is a standard block unit vector, as the (modified) block full orthogonalization method (BFOM) for approximating (1). When \({\mathscr{M}} = 0\), we recover BFOM, which minimizes the error in the Aweighted \({\mathbb {S}}\)norm for A hermitian positive definite [21]. There are infinitely many choices for \({\mathscr{M}}\), but perhaps only a few useful ones, some of which are discussed in [22]. We will concern ourselves here with just \({\mathscr{M}} = {\mathscr{H}}_{m}^{{*}} \big (\widehat {\boldsymbol {E}}_{m} H_{m+1,m}^{*} H_{m+1,m}\big ) \widehat {\boldsymbol {E}}_{m}^{*}\), which gives rise to a block generalized minimal residual method (BGMRES) [32,33,34]. As in [22], we implement BGMRES as a modified BFOM here, with an eye towards downstream applications like f(A)B where the BFOM form is explicitly needed. In practice, there may be computational savings with a less modular implementation; see, e.g., [35,36,37].
2.4.2 Static restarting and cospatial factors
Restarting is a wellestablished technique for reconciling a growing basis with memory limitations. Define the residual of (6) as
The basic idea of restarts is to use R_{m} to build a new Krylov subspace, which we then use to approximate the error \(\boldsymbol {E}_{m} := A^{{1}} \boldsymbol {B}  \boldsymbol {X}_{m}\), which solves AE = R_{m} in exact arithmetic. Building a new Krylov subspace from R_{m} directly is not a great idea, because it would require an extra computation with A. Furthermore, we need a cheap, accurate, and ideally locally computable way to approximate \(\left \\boldsymbol {R}_{m}\right \\) from one cycle to the next in order to monitor convergence. In [22] a static restarting method for lowrank modified BFOM is introduced that satisfies these requirements. By “static,” we mean the basis size m is fixed from one restart cycle to the next, in contrast to adaptive or dynamic restart cycle lengths. We restate [22, Theorem 4.1], which enables an efficient residual approximation and restarting procedure.
Theorem 2.1
Suppose \({\mathscr{M}} = \boldsymbol {M} \widehat {\boldsymbol {E}}_{m}^{*}\), where \(\boldsymbol {M} \in {\mathbb {C}}^{ms \times s}\) and \(\widehat {\boldsymbol {E}}_{m} = \widehat {\boldsymbol {e}}_{m} \otimes I_{s}\). Define \(\boldsymbol {U}_{m} := \boldsymbol {\mathcal {V}}_{m+1} \begin {bmatrix} \boldsymbol {M} \\ H_{m+1,m} \end {bmatrix}\) and let \(\boldsymbol {\Xi }_{m} := ({\mathscr{H}}_{m} + {\mathscr{M}})^{{1}} \widehat {\boldsymbol {E}}_{1} B\) be the block coefficient vector for the approximation \(\boldsymbol {X}_{m} = \boldsymbol {\mathcal {V}}_{m} \boldsymbol {\Xi }_{m}\) (6) of the system (1). With R_{m} as in (7) it then holds that
We refer to the s × s matrix \(\widehat {\boldsymbol {E}}_{m}^{*} \boldsymbol {\Xi }_{m}\) as a cospatial factor, and (8) as the cospatial residual relation. The term cospatial refers to the fact that the columns of R_{m} and those of U_{m} span the same space. Moreover, in exact arithmetic, it is not hard to see that
and the righthand term can be computed locally (and possibly redundantly on each processor) for m ≪ n.
If the approximate residual norm does not meet the desired tolerance, then we can compute the Arnoldi relation for \({\mathscr{K}}_{m}(A, \boldsymbol {U}_{m})\) to obtain \(\boldsymbol {\mathcal {V}}_{m+1}^{(2)}\), \({\mathscr{H}}_{m}^{(2)}\), \(H_{m+1,m}^{(2)}\), and B^{(2)}, where the superscript here and later denotes association to the restarted Krylov subspace. We then approximate E_{m} as
and update X_{m} as
The process is repeated, applying Theorem 2.1 iteratively, until the desired residual tolerance is reached.
Remark 1
The analysis in [21, 22] is carried out in exact arithmetic. Therefore, when we replace Algorithm 2 with lowsync versions in Section 3, all the results summarized in this section still hold, because all block GramSchmidt variants generate the same QR factorization in exact arithmetic.
3 Lowsynchronization variants of block Arnoldi
To distinguish between block Arnoldi variants, we default to the name of the underlying block GramSchmidt skeleton. We specify a configuration as ipskel∘ (musc): inner product, skeleton, and muscle, respectively. This naturally leads to bit of an “alphabet soup,” for which we ask the reader’s patience, as it is crucial to precisely define algorithmic configurations for benchmarking. Please refer often to Table 2, which summarizes acronyms for all the GramSchmidt skeletons we consider in this text. Note that the coefficient in front of the number of sync points per cycle is often used to describe lowsync methods; e.g., BCGSPIP is a “onesync” method, while BMGSSVL is a “threesync” method.
Remark 2
The methods presented here are closely related to but not quite the same as the block methods used by Yamazaki et al. in [3], where BMGS, BCGSPIP, and BCGSI+LS are employed as GramSchmidt skeletons in sstep Arnoldi (also known as communicationavoiding Arnoldi) [11, 12, 23], which is used to solve a linear system with a single righthand side. Recall that we are solving (1), i.e., multiple righthand sides simultaneously.
Remark 3
In the pseudocode for each algorithm, intermediate quantities like W and U are defined explicitly each iteration for readability. In general, we purposefully avoid redefining quantities in a given iteration and instead only set an output (i.e., entries in B, \(\boldsymbol {\mathcal {V}}_{m}\), or \({\mathscr{H}}_{m+1,m}\)) once all computations pertaining to that value are complete. This approach simplifies mathematical analysis. Exceptions include Algorithms 1 and 2, where W is redefined inside the forloop as projected components are subtracted away from it. In practice, it is preferable to save storage by overwriting block vectors of \(\boldsymbol {\mathcal {V}}_{m}\) instead of allocating separate memory for W and U, for which there anyway may not be space.
3.1 BCGSPIP and BCGSPIO
A simple idea for reducing the number of sync points in BMGS is to condense the forloop in lines 4–7 of Algorithm 2 into a single inner product and subtraction,
This exchange gives rise to what is commonly referred to as the block classical GramSchmidt (BCGS) method. It is, however, rather unstable, with a loss of orthogonality worse than \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}])\) [6]. However, by making a correction based on the block Pythagorean theorem (as derived in, e.g., Section 2.1 [6]), we can guarantee a loss of orthogonality bounded by \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}])\), as long as \(\mathcal {O}\left (\sqrt {\varepsilon }\right ) \kappa ([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}]) \leq 1\).
One version of the corrected algorithm is given as Algorithm 3. The acronym “PIP” stands for “Pythagorean (variant) with Inner Product,” due to how the factor H_{k+ 1,k} is computed. An alternative formulation based off BCGSPIO (where “PIO” stands for “Pythagoren with IntraOrthogonalization”) is also possible and is given as Algorithm 4. Note that in line 5, we use \(\sim \) to denote that a full block vector need not be computed or stored here, just the 2s × 2s scaling quotient Ω. For subtle reasons, BCGSPIO appears to be less reliable in practice (see Section 4).
3.2 BMGSSVL and BMGSLTS
Barlow developed and analyzed one of the first stabilized lowsync GramSchmidt methods by using the SchreiberVan Loan representation of products of Householder transformations [1, 38]. Under modest conditions, this method—which we denote here as BMGSSVL—has loss of orthogonality like BMGS. Its success depends on tracking the loss of orthogonality via an auxiliary matrix \(\mathcal {T}\) (as defined in lines 1, 2, 6, and 9 of Algorithm 5) and using this matrix to make corrections each iteration. A closely related method is BMGSLTS, which is identical to BMGSSVL except that the \(\mathcal {T}\) matrix is formed via lowertriangular solves instead of matrix products. A column version of BMGSLTS was first developed by Świrydowicz et al. [2] and generalized to blocks by Carson et al. [7]. Although BMGSLTS appears to behave identically to BMGSSVL in practice, a formal analysis for the former remains open. We present Arnoldi versions of BMGSSVL and BMGSLTS as, with different colors highlighting the small differences between the methods. In both methods, the main inner product in line 4 is performed as in BCGS. Meanwhile \(\mathcal {T}\) acts as a kind of buffer, storing the loss of orthogonality per iteration, which is used in successive iterations to make small corrections to the computation in line 4. Balabanov and Grigori use a similar technique to stabilize randomized sketches of inner products, where instead of explicitly computing and storing \(\mathcal {T}\), they solve least squares problems to compute \({\mathscr{H}}_{1:k,k}\) [14, 39].
3.3 BMGSCWY/BMGSICWY
A columnwise version of this algorithm was first presented by Świrydowicz et al. as [2, Algorithm 8]. To the best of our knowledge, we are the first to develop a blockwise formulation, which we refer to here as BMGSCWYArnoldi, where CWY stands for “compact WY,” an alternative way to represent Householder transformations used to originally derive this algorithm. A related Arnoldi algorithm, not treated in either [2] or [4], is based on the inverse CWY (ICWY) form, and is given simultaneously with BMGSCWY in Algorithm 6.
It is important to note that BMGSCWYArnoldi would not reduce to [2, Algorithm 8] or [4, Algorithm 6] for s = 1, as we have one total sync point, due to the lack of a reorthonormalization step for V_{k}. Algorithm 6 was largely derived by transforming BMGSCWY and BMGSICWY from [7] into a block Arnoldi routine. The most challenging part is tracking how the \(\mathcal {R}\) factor in the GramSchmidt formulation maps to \({\mathscr{H}}_{m+1,m}\) and determining where to scale by the offdiagonal entry H_{k,k− 1} each iteration. It is also possible to compute only with \(\mathcal {R}\) and reconstruct \({\mathscr{H}}_{m+1,m}\) after \(\boldsymbol {\mathcal {V}}_{m+1}\) is finished; this approach proved to be much less stable in practice, however, due to the growing condition number of \(\mathcal {R}\).
3.4 BCGSI+LS
One of the most intriguing of all the lowsync algorithms is DCGS2 [5], referred to as CGSI+LS in [7]. This algorithm is a reformulation of reorthogonalized CGS with a single sync point derived by “delaying” normalization to the next iteration, where operations are batched in a kind of sstep approach (where s = 2). The columnwise version exhibits \(\mathcal {O}({\varepsilon })\) loss of orthogonality; a rigorous proof of the backward stability bounds remains open, however. The block version, BCGSI+LS, does not exhibit perfect \(\mathcal {O}({\varepsilon })\) LOO; see numerical results in [7].
Bielich et al. present a columnwise Arnoldi based on DCGS2 as Algorithm 4 in [5]. Our Algorithm 7 is a direct block generalization of this algorithm with slight reformulations to match the aesthetics of Algorithm 6 and principles of Remark 3. Note that, as in Algorithm 6, we are able to compute \({\mathscr{H}}_{m}\) directly, but we must track an auxiliary matrix J and scale several quantities by H_{k− 1,k− 2}. An alternative version of Algorithm 7 based more directly on BCGSI+LS from [7, Algorithm 7] is included in the code but not described here.
4 Adaptive restarting
Reproducibility and stability are not mutually exclusive. This realization is precisely the motivation for an adaptive restarting routine and can be demonstrated by a simple example.
Consider the tridiag test case from Section 5.1 with n = 100. Notably, both A and B are deterministic quantities; neither is defined with random elements. In Matlab, it is possible to specify the number of threads on which a script is executed via the builtin maxNumCompThreads function.^{Footnote 5} We solve AX = B with Algorithms 3 and 4 while varying the multithreading setting from 1 to 16 on a standard node of the Mechthild cluster; see the beginning of Section 5 for more details about the cluster. For both algorithms, we employ a variant of Matlab ’s Cholesky routine chol, which stores a flag when chol determines a matrix is too illconditioned to be factorized. This flag is fed to the linear solver driver of LowSyncBlockArnoldi (bfom), which halts the process when the flag is true. Through the following discussion, we refer to this flag as the “NaNflag,” because ignoring it leads to computations with illdefined quantities.
Figure 1 displays the loss of orthogonality (3) and \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) for different thread counts. The condition numbers for all thread counts and both methods are hardly affected, except for some slight deviation for BCGSPIP and 16 threads. The LOO plots are more telling: for both methods, changing the thread count directly affects the LOO and how many iterations the method can compute before encountering a NaNflag. We allowed for a maximum basis size of m = 50, but no method can compute that far. BCGSPIO with 8 threads gives up first at 16 iterations; BCGSPIP with 1 and 4 threads makes it all the way to 35 iterations. Among the BCGSPIO methods, there are orders of magnitude differences between the attained LOO.
This situation is perplexing on the surface: the problem is static, and the same code has been run every time. The only variable is the thread count.
There are two subtle issues that affect reproducibility in this case: 1) the configuration of math kernel libraries according to the parameters of the operating system and hardware,^{Footnote 6} and 2) guaranteed stability bounds. As for stability bounds, it is important to note that both BCGSPIO and BCGSPIP have a complete backward stability analysis [6]. Both methods have \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) loss of orthogonality, as long as \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}]) \leq \mathcal {O}\left (\frac {1}{\sqrt {\varepsilon }}\right ) = \mathcal {O}\left (10^{8}\right )\) and as long as the IntraOrtho for BCGSPIO behaves no worse than CholQR. (For this test, we used HouseQR, Matlab ’s builtin qr routine, which is unconditionally stable and therefore behaves better than CholQR [15].) For both methods, \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) exceeds \(\mathcal {O}\left (10^{8}\right )\) around iteration 15. At that point, the assumptions for the LOO bounds are no longer satisfied. The fact that either algorithm continues to compute something useful after that point is a lucky accident.
Computing \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) every iteration to check whether the LOO bounds are satisfied is not practical. We therefore propose a simple adaptive restarting regime based on whether chol raises a NaNflag, which happens whenever chol is fed a numerically nonpositive definite matrix. When a NaNflag is raised, we give up computing a new basis vector and go back to the last safely computed basis vector, which is then used to restart. Simultaneously, the maximum basis size m is also reduced. It is possible that an algorithm exhausts its maximum allowed restarts and basis size before converging; indeed, we have observed this often for BCGSPIP in examples not reported here. At the same time, there are many scenarios in which restarting is an adequate bandaid, thus allowing computationally cheap, onesync algorithms line BCGSPIP to salvage progress and converge, oftentimes faster than competitors. See Section 5 for demonstrations.
Remark 4
The restarted framework outlined in Section 2.4.2 does not change fundamentally with adaptive cycle lengths; only the notation becomes more complicated. We omit the details here.
5 Numerical benchmarks
Our treatment of BGS and block Krylov methods is hardly exhaustive. It is not our goal to determine the optimal block Arnoldi configuration at this stage, but rather to demonstrate the functionality of a benchmarking tool for the fair comparison of possible configurations on different problems. To this end, we restrict ourselves to the options below:

inner products: cl (classical), gl (global)

skeletons: Table 2

muscles: CholQR, which has \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}\) loss of orthogonality guaranteed only for \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2} < 1\), but is a simple, singlereduce algorithm. In practice, we would recommend TSQR/AllReduceQR [30, 31], which has \(\mathcal {O}({\varepsilon })\) loss of orthogonality and the same number of sync points, but is difficult to program in Matlab due to limited parallelization and messagepassing features. Other lowsync muscles are programmed in LowSyncBlockArnoldi as well, and the user can easily integrate their own. Note that BCGSPIP does not require a muscle, and BMGSCWY, BMGSICWY, and BCGSI+LS only call a muscle once, in the first iteration of a new basis. BMGSSVL and BMGSLTS are forced to use their columnwise counterparts MGSSVL and MGSLTS (both 3sync), respectively, and global methods are forced to use the global muscle (i.e., normalization without intraorthogonalization via the scaled Frobenius norm).

modification: none (FOM), harmonic (GMRES)
All results are generated by the LowSyncBlockArnoldi Matlab package. A single script (paper_script.m) comprises all the calls for generating the results in this manuscript. LowSyncBlockArnoldi is written as modularly as possible, to facilitate the exchange of inner products, skeletons, muscles, and modifications. While the timings reported certainly do not reflect the optimal performance for any of the methods, they do reflect a fair comparison across implementations and provide insights for possible speedups when these methods are ported to more complex architectures. The code is also written so that sync points (inner_prod and intra_ortho) and other potentially communicationintensive operations (matvec and basis_eval) are separate functions that can be tuned individually.
Every test script (including the example from Section 4) has been executed in Matlab R2019b on 16 threads of a single, standard node of Linux Cluster Mechthild at the Max Planck Institute for Dynamics of Complex Technical Systems in Magdeburg, Germany.^{Footnote 7} A standard node comprises 2 Intel Xeon Silver 4110 (Skylake) CPUs with 8 Cores each (64KB L1 cache, 1024KB L2 cache), a clockrate of 2.1 GHz (3.0 GHz max), and 12MB shared L3 cache each. We further focus on small problems that easily fit in the L3 Cache, which is easy to guarantee with sparse A, n ≤ 10^{4}, and s ≤ 10. Given that the latency between CPUs on a single node is small relative to exascale machines, we expect small improvements observed in these test cases to translate to bigger gains in a more complex setting.
For the timings, we measure the total time spent to reach a specified error tolerance. We run each test 5 times and average over the timings. We also calculate several intermediate measures, namely counts for Acalls, applications of \(\boldsymbol {\mathcal {V}}_{k}\), and sync points. In addition, we plot the convergence history in terms of the following quantities per iteration: relative residual, relative error, \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\), and loss of orthogonality (LOO) (3). When a ground truth solution X_{∗} is provided, the error is calculated as
For all our examples, X_{∗} is computed by Matlab ’s builtin backslash operator. The residual is approximated by (9) and is scaled by \(\left \{\boldsymbol {B}}\right \_{F}\). A summary of the parameters for all benchmarks can be found in Table 3. Except for tridiag and lapl_2d, all examples are taken from the SuiteSparse Matrix Collection [40]. Via the suite_sparse.m script, it is possible to run tests on any benchmark from this collection.
5.1 tridiag
The operator A is defined as a sparse, tridiagonal matrix with 1 on the offdiagonals and − 1,− 2,…,−n on the diagonal, where n is also the size of A. Clearly A is symmetric. The righthand side B has two columns, where the first has identical elements \(\frac {1}{\sqrt {n}}\) and the second is 1,2,…,n. This example is actually procedural, in the sense that a user can choose a desired n. At the same time, a larger n necessarily leads to a worse condition number.
Figure 2 presents the total run time per configuration as well as operator counts as a bar chart; see Table 4 in the Appendix for more details. The fastest methods are the stabilized lowsync variants. Despite being the computationally cheapest classical method per iteration, clBCGSPIP is notably slower than clBMGS, because its inherent instability requires restarting 3 times (and therefore additional applications of A and \(\boldsymbol {\mathcal {V}}_{k}\)) before converging. The method with the fewest \(\boldsymbol {\mathcal {V}}_{k}\) evaluations is clBMGS, which is to be expected, since the basis is split up and applied one block column at a time in the innermost loop; see Algorithm 1.
The fastest global method, glBCGSPIP, is significantly slower even than the slowest classical method. In fact, all global methods require over 6 times as many total iterations as the fastest classical method to converge; this is in line with the theory of Section 2.4. In this particular case, the floatingpoint savings per iteration do not outweigh the sheer amount of time needed for all the extra Acalls. Nevertheless, the onesync global methods (glBCGSPIP, glBMGSCWY, glBMGSICWY, and glBCGSI+LS s) have relatively low sync counts, compared even to clBMGS.
Figures 3 and 4 display convergence histories for a subset of the methods in Table 4 in the Appendix. The convergence histories for all global BMGS variants are very similar; we omit BMGSSVL and BMGSCWY, as they are visually identical to BMGSLTS and BMGSICWY, respectively. BMGS is identical to BMGSSVL and BMGSLTS and is therefore also omitted.
Both the classical and global variants of BCGSPIP show the robustness of the adaptive restarting procedure in action. In the global case, the LOO exceeds \(\mathcal {O}\left (10^{10}\right )\) and reaches \(\mathcal {O}\left (1\right )\) in clBCGSPIP. Despite the loss of orthogonality, restarting allows the methods to recover and eventually converge. All other lowsync variants remain stable, only restarting once the basis size limit of m = 70 has been reached. Although hardly perceptible, BMGSICWY does have a slightly worse LOO than that of BMGSLTS, which can be seen by zooming in on the last few iterations of the global plots in Fig. 3 or of the classical plots in Fig. 4.
We also note that the residual estimate (7) for all methods follows the same qualitative trend as that of the error. In the worst case, clBCGSPIP, the residual is nearly 3 orders of magnitude lower than the error in some places, which could lead to premature convergence. For all other methods, the difference is between 1 and 2 orders of magnitude. We would thus recommend setting the residual tolerance a couple orders of magnitude lower in practice, to ensure that the true error is accurate enough.
5.2 1138_bus
Now we turn to a slightly more complicated matrix. The matrix A comes from a power network problem and is real and symmetric positive definite, while entries of B are drawn randomly from the uniform distribution. Moreover we apply an incomplete LU (ILU) preconditioner with no fill, using Matlab ’s builtin ilu.
Even with the preconditioner, none of the global methods converges. We adjusted the thread count to see if it would aid convergence, to no avail. This is perhaps an extreme case of [22, Theorem 3.3], wherein the global method is much less accurate than the classical method in the first cycle and cannot manage to catch up even after restarting. A preconditioner better attuned to the structure of the problem may alleviate stagnation for global methods, but we do not explore this here.
In Fig. 5 we see the performance results for the convergent classical methods; more details can be found in Table 5 in the Appendix. Most notably, the onesync methods BMGSCWY, BMGSICWY, and BCGSI+LS improve over BMGS only slightly in terms of timings. BCGSPIP is much slower, due to a quick loss of orthogonality and need to restart more often. However, it is clear that sync counts for all onesync methods are drastically reduced compared to that of BMGS.
We examine the convergence histories of clBCGSPIP and clBMGSICWY more closely in Fig. 6. Although not discernible on the graph, we found that clBCGSPIP actually restarts every 28 iterations, meaning in the first cycle it encountered a NaNflag and reduced the maximum basis size to m = 28 for all subsequent cycles. Instability in the first cycle thus hinders clBCGSPIP greatly. On the other hand, BMGSICWY (as well as the other variants) is stable enough to exhaust the entire basis size allowance, which allows for further error reduction in the first cycle.
5.3 circuit_2
The next example comes from a circuit simulation problem. The matrix A is real but not symmetric or positive definite. We again apply an ILU preconditioner with no fill.
All the onesync classical and global methods converge, and their performance data is presented in Fig. 7 with further details in Table 6 in the Appendix. In fact, some global methods, like glBCGSPIP, are even faster than some classical methods, due to the fact that they require the same number of iterations to converge, and therefore fewer floatingpoint operations.
Figure 8 demonstrates how close in accuracy the global and classical BCGSPIP variants are for this problem. The global method even has a slightly better LOO, but it should be noted that global LOO is measured according to a different inner product than classical LOO; see Section 2.2 and (3).
5.4 rajat03
Another circuit simulation problem highlights slightly different behavior. In this case, A is again real but neither symmetric nor positive definite, and we again use an ILU preconditioner with no fill.
Figure 9 summarizes the performance results, with details given in Table 7 in the Appendix. It should be noted right away that clBCGSPIP fails to converge for this problem, while glBCGSPIP does not, and takes second place in terms of the timings. More specifically, clBCGSPIP encounters a NaNflag it cannot resolve, which means that every time it reduces the basis size, it cannot avoid a NaNflag. However, because global methods do not use Cholesky at all, nonpositive definite factors do not pose a problem, unless their trace is numerically zero, which occurs with very low probability. Otherwise, clBMGSCWY shows a small improvement over clBMGS.
Table 7 in the Appendix confirms that none of the methods requires restarting despite how high the condition number becomes in later iterations; see also Fig. 10. It is again interesting to see how close the error and residual plots are between the global and classical methods. In fact, the residual for the global method underestimates convergence by a couple orders of magnitude.
5.5 Kaufhold
This example treats a nearly numerically singular matrix with an extremely high condition number. Also notable, the norm of A is nearly \(\mathcal {O}\left (10^{15}\right )\). The matrix is real, but neither symmetric nor positive definite, and it was designed to trigger a bug in Gaussian elimination in a 2002 version of Matlab. We again apply an ILU preconditioner with no fill.
Figure 11 shows clBCGSPIP to be the fastest of the classical onesync methods, but the improvement over clBMGS is small. The global methods are all much slower. A look at the convergence histories in Fig. 12 shows a stubborn error curve despite significant progress in the initial iterations. For both BCGSPIP methods the LOO is moderately high in the first cycle, matching the high condition numbers, but the situation is not bad enough to trigger a NaNflag, and the LOO drops after restarting.
5.6 t2d_q9
We now examine a nonlinear diffusion problem, specifically a biquadratic mesh of a temperature field. The matrix A is real but not symmetric or positive definite, and we again use an ILU preconditioner with no fill.
Figure 13 shows that both BCGSPIP are the fastest overall, with clBMGS in secondtolast place; see Table 9 in the Appendix for more details. Interestingly, even glBMGS is faster than clBMGS in this scenario.
Both BCGSI+LS variants are rather slow in this example. Despite having just one sync per iteration, BCGSI+LS does generally have a higher complexity than its onesync counterparts, which manifests here as a disadvantage.
The convergence behavior for the BCGSPIP variants is given in Fig. 14. Here we see that despite the global condition number having a high variation relative to the classical method, the global LOO is overall much less. This phenomenon is not unique to this example, however, it just happens to be more noticeable.
5.7 lapl_2d
Our last problem is taken directly from [21, Section 5.4], a discretized twodimensional Laplacian matrix. A is thus banded, real, and symmetric positive definite. We do not apply a preconditioner and look at all skeletons considered in the text.
Figure 15 shows the performance results; more details can be found in Table 4 in the Appendix. All onesync classical methods except for clBCGSI+LS beat clBMGS, along with a number of global methods. The slowest classical methods are the threesync ones, and some onesync global methods follow behind. The fastest method, clBCGSPIP also happens to have the highest A count and applications of \(\boldsymbol {\mathcal {V}}_{k}\), due to its high number of restarts. Both clBMGSCWY and clBMGSICWY, however, have fewer sync counts, as well as A counts and \(\boldsymbol {\mathcal {V}}_{k}\) counts, and are very close in terms of timings.
The methods with the highest sync counts are clBMGSSVL and clBMGSLTS. The reason is that they cannot use CholQR as a muscle,^{Footnote 8} and this problem requires many iterations to converge. LowSyncBlockArnoldi is written to count sync points within the muscles as well, and with MGSSVL and MGSLTS each contributing 1 + 3s per call, the total number of sync points eventually passes that of clBMGS, which can use a communicationlight muscle like CholQR.
6 Conclusions and outlook
Stability bounds and floatingpoint analysis are challenging to work out rigorously, and it is therefore simultaneously important to search for counterexamples and edge cases while trying to prove conjectured bounds. In general, rigorous loss of orthogonality and backward error bounds for all these methods could lead to new insights and improvements in the quest for a reliable, scalable Krylov subspace solver. Our flexible benchmarking tool can aid in that process, and it can easily be extended to accommodate new algorithm configurations, test cases, and measures.
At the same time, lowsync block Arnoldi algorithms with adaptive restarting are clearly already useful and robust enough for a wide variety of problems, especially where A is reasonably conditioned and memory limitations cap basis sizes. In every benchmark, we have observed that at least one lowsync method outperformed both the classical and global BMGSbased Arnoldi methods. More research is needed to determine which lowsync skeletons are best for which problems and architectures, particularly computational models that account not only for operation counts but also for performance variations relative to block size [26, 41, 42]. Most likely the best configuration allows for switching between skeletons and muscles depending on convergence behavior.
For scenarios where the basic adaptive restarting procedure is not sufficient to rescue convergence, it might be possible to improve the heuristics with a cheap estimate of the loss of orthogonality computed, e.g., a randomized sketched inner product [39]. With such a cheap estimate, we could not only decrease the basis size when there are problems, but increase it again in later cycles. Randomized algorithms themselves are known to reduce communication, and a thorough comparison and combination of the methods proposed here and in [39] could lead to powerful Krylov subspace method well suited for exascale architectures.
Global methods are unfortunately less promising. They are almost always slower than even the slowest classical method, due to requiring more cycles, and thus operator calls and sync points, to converge. However, the benchmarks do suggest that, in cases with a good preconditioner known to guarantee convergence in a few iterations, global methods may become competitive again, especially in singlenode or “laptop” applications, where their reduced computational intensity per iteration is favorable.
Notes
http://www.top500.org/news/ornlsfrontierfirsttobreaktheexaflopceiling/. Accessed 8 August 2022.
The term matvec is often used to refer to the multiplication of A with a vector. Because we will be focusing on block vectors, we refrain from this term to avoid confusion.
https://mathworks.com/help/matlab/ref/maxnumcompthreads.html. Accessed 8 August 2022.
https://www.mpimagdeburg.mpg.de/cluster/mechthild. Accessed 8 August 2022.
References
Barlow, J.L: Block modified GramSchmidt algorithms and their analysis. SIAM J. Matrix Anal. Appl 40(4), 1257–1290 (2019). https://doi.org/10.1137/18M1197400
Świrydowicz, K., Langou, J., Ananthan, S., Yang, U., Thomas, S.: Low synchronization GramSchmidt and generalized minimum residual algorithms. Numer. Lin. Alg. Appl., 28(2), https://doi.org/10.1002/nla.2343 (2020)
Yamazaki, I, Thomas, S., Hoemmen, M., Boman, E.G, Świrydowicz, K, Eilliot, J.J: Lowsynchronization orthogonalization schemes for sstep and pipelined Krylov solvers in Trilinos. In: Proceedings of the 2020 SIAM conference on parallel processing for scientific computing (PP), pp. 118–128, https://doi.org/10.1137/1.9781611976137.11 (2020)
Thomas, S., Carson, E., Rozložník, M., Carr, A., Świrydowicz, K.: Iteratedgaussseidel GMRES. arXiv:2205.07805v2 (2022)
Bielich, D., Langou, J., Thomas, S., Świrydowicz, K., Yamazaki, I., Boman, E.G.: Lowsynch gram–schmidt with delayed reorthogonalization for krylov solvers. Parallel Comput. 112, 102940 (2022). https://doi.org/10.1016/j.parco.2022.102940
Carson, E., Lund, K., RozloCzník, M.: The stability of block variants of classical GramSchmidt. SIAM J. Matrix Anal. Appl. 42(3), 1365–1380 (2021). https://doi.org/10.1137/21M1394424
Carson, E., Lund, K., Rozložník, M., Thomas, S.: Block GramSchmidt algorithms and their stability properties. Linear Algebra Appl. 638(20), 150–195 (2022). https://doi.org/10.1016/j.laa.2021.12.017
Saad, Y.: Iterative methods for sparse linear systems, 2nd edn., p. 528. SIAM. https://doi.org/10.1137/1.9780898718003 (2003)
Güttel, S.: Rational Krylov approximation of matrix functions: numerical methods and optimal pole selection. GAMMMitteilungen 36(1), 8–31 (2013). https://doi.org/10.1002/gamm.201310002
Simoncini, V.: Analysis of the rational Krylov subspace projection method for largescale algebraic Riccati equations. SIAM J. Matrix Anal. Appl. 37 (4), 1655–1674 (2016). https://doi.org/10.1137/16M1059382
Carson, E.: CommunicationAvoiding Krylov Subspace Methods in Theory And Practice. Ph.D. Thesis, Department of Computer Science. University of California, Berkeley (2015). http://escholarship.org/uc/item/6r91c407
Hoemmen, M.: Communicationavoiding Krylov subspace methods. Ph.D. Thesis, department of computer science university of california at berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS201037.pdf (2010)
Grigori, L., Moufawad, S., Nataf, F.: Enlarged Krylov subspace conjugate gradient methods for reducing communicaiton. SIAM J. Matrix Anal. Appl. 37(2), 744–773 (2016). https://doi.org/10.1137/140989492
Balabanov, O., Grigori, L.: Randomized block GramSchmidt process for solution of linear systems and eigenvalue problems. arXiv:2111.14641 (2021)
Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. Appl. Math., p. 663. SIAM Publications, https://doi.org/10.1137/1.9780898718027 (2002)
Huckle, T., Neckel, T.: Bits and Bugs: a scientific and historical review of software failures in computational science. Softw. Environ. Tools, vol. 29. SIAM Publications, https://doi.org/10.1137/1.9781611975567 (2019)
Giraud, L., Langou, J., Rozložník, M., Van Den Eshof, J.: Rounding error analysis of the classical GramSchmidt orthogonalization process. Numer. Math. 101, 87–100 (2005). https://doi.org/10.1007/s0021100506154
Smoktunowicz, A., Barlow, J.L., Langou, J.: A note on the error analysis of classical GramSchmidt. Numer. Math. 105(2), 299–313 (2006). https://doi.org/10.1007/s0021100600421
Carson, E.: The adaptive sstep conjugate gradient method. SIAM J. Matrix Anal. Appl. 39(3), 1318–1338 (2018). https://doi.org/10.1137/16M1107942
Carson, E.C.: An adaptive sstep conjugate gradient algorithm with dynamic basis updating. Appl. Math. 65, 123–151 (2020). https://doi.org/10.21136/AM.2020.013619
Frommer, A., Lund, K., Szyld, D.B.: Block Krylov subspace methods for functions of matrices. Electron. Trans. Numer. Anal. 47, 100–126 (2017)
Frommer, A., Lund, K., Szyld, D.B.: Block Krylov subspace methods for functions of matrices II: modified block FOM. SIAM J. Matrix Anal. Appl. 41(2), 804–837 (2020). https://doi.org/10.1137/19M1255847
Ballard, G., Carson, E., Demmel, J.W., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23(2014), 1–155 (2014). https://doi.org/10.1017/S0962492914000038
Anzt, H., Boman, E.G., Falgout, R., Ghysels, P., Heroux, M., Li, X., Curfman McInnes, L., Mills, R.T., Rajamanickam, S., Rupp, K., Smith, B., Yamazaki, I., Yang, U.M.: Preparing sparse solvers for exascale computing. Philos. Trans. Royal Soc. A 378(2166), 20190053 (2020). https://doi.org/10.1098/rsta.2019.0053
Baker, A.H., Dennis, J.M., Jessup, E.R.: On improving linear solver performance: a block variant of GMRES. SIAM J. Sci. Comput. 27(5), 1608–1626 (2006). https://doi.org/10.1137/040608088
Birk, S.: Deflated Shifted block Krylov subspace methods for hermitian positive definite matrices. Ph.d. Thesis, Fakultät für Mathematik und Naturwissenschaften, Bergische Universität Wuppertal. http://elpub.bib.uniwuppertal.de/servlets/DocumentServlet?id=4880 (2015)
Dreier, N.A.: Hardwareoriented Krylov methods for highperformance computing. Ph.D. thesis, Fachbereich Mathematik und Informatik der MathematischNaturwissenschaftlichen Fakultät der Westfälische WilhelmsUniversität Münster. https://www.proquest.com/docview/2607316034/abstract/A334B3B058D24AF2PQ/1 (2020)
Dreier, N.A., Engwer, C.: Strategies for the vectorized block conjugate gradients method. In: Vermolen, F.J., Vuik, C. (eds.) Numerical mathematics and advanced applications ENUMATH 2019. Lecture notes in computational science and engineering, vol. 139, pp. 381–388. Springer, https://doi.org/10.1007/9783030558741_37 (2020)
Yamamoto, Y., Nakatsukasa, Y., Yanagisawa, Y., Fukaya, T.: Roundoff error analysis of the Cholesky QR2 algorithm. Electron. Trans. Numer. Anal. 44, 306–326 (2015)
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communicationoptimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). https://doi.org/10.1137/08073.1992
Mori, D., Yamamoto, Y., Zhang, S.L.: Backward error analysis of the AllReduce algorithm for householder QR decomposition. Jpn. J. Ind. Appl. Math. 29(1), 111–130 (2012). https://doi.org/10.1007/s131600110053x
Simoncini, V.: Ritz and PseudoRitz values using matrix polynomials. Linear Algebra Appl. 241243, 787–801 (1996). https://doi.org/10.1016/00243795(95)006826
Simoncini, V., Gallopoulos, E.: Convergence properties of block GMRES and matrix polynomials. Linear Algebra Appl. 247, 97–119 (1996). https://doi.org/10.1016/00243795(95)000933
Simoncini, V., Gallopoulos, E.: A hybrid block GMRES method for nonsymmetric systems with multiple righthand sides. J. Comput. Appl. Math. 66, 457–469 (1996). https://doi.org/10.1016/03770427(95)001980
Gutknecht, M.H.: Block Krylov space methods for linear systems with multiple righthand sides: an introduction. In: Siddiqi, A.H., Duff, I.S., Christensen, O. (eds.) Mod. math. model. methods algorithms real world syst, pp. 420447. Anamaya New Delhi (2007)
Gutknecht, M.H., Schmelzer, T.: Updating the QR decomposition of block tridiagonal and block Hessenberg matrices. Appl. Numer. Math. 58 (6), 871–883 (2008). https://doi.org/10.1016/j.apnum.2007.04.010https://doi.org/10.1016/j.apnum.2007.04.010
Gutknecht, M.H., Schmelzer, T.: The block grade of a block Krylov space. Linear Algebra Appl. 430, 174–185 (2009). https://doi.org/10.1016/j.laa.2008.07.008
Schreiber, R., Van Loan, C.: A storageefficient WY representation for products of householder transformations. SIAM J. Sci. Statist. Comput. 10(1), 53–57 (1989). https://doi.org/10.1137/0910005
Balabanov, O., Grigori, L.: Randomized Gram–Schmidt process with application to GMRES. SIAM J. Sci. Comput. 44(3), 1450–1474 (2022). https://doi.org/10.1137/20M138870X
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011). https://doi.org/10.1145/2049662.2049663
Boman, E.G., Higgins, A.J., Szyld, D.B.: Optimal size of the block in block GMRES on GPUs: computational model and experiments. eprint 220430, department of mathematics, Temple University, Philadelphia, PA. https://www.math.temple.edu/szyld/reports/BGMRES_GPU_rev.report.pdf (2022)
Parks, M. L., Soodhalter, K. M., Szyld, D. B.: A block recycled GMRES method with investigations into aspects of solver performance. arXiv:1604.01713v1 (2016)
Acknowledgements
The author is indebted to Stéphane Gaudreault, Teodor Nikolov, and Erin Carson for stimulating discussions that inspired this work. The author is also grateful to Jens Saak and Martin Köhler for answering questions about the Mechthild cluster and multithreading in MATLAB and to two anonymous reviewers for their constructive feedback.
Funding
Open Access funding enabled and organized by Projekt DEAL. K. Lund is a contracted employee of Max Planck Institute for Dynamics of Complex Technical Systems and did not receive any additional funding to support this project.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The author certifies that this manuscript has been submitted to only one journal at this time, that the work is original, and that the results are not fabricated or skewed. The work is entirely the author’s own, and to the best of the author’s ability, the work is complete in its own right and without error or misappropriation.
Consent for publication
As the sole author, K. Lund provides consent for publication.
Competing interests
The author declares no competing interests.
Additional information
Author contribution
K. Lund is the sole author of the manuscript and associated code.
Availability of supporting data
All code and scripts to reproduce plots can be found at https://gitlab.mpimagdeburg.mpg.de/lund/lowsyncblockarnoldi.
Human and animal ethics
Not applicable
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix. Raw data from tests
Appendix. Raw data from tests
A subset of raw data corresponding to the performance plots in Section 5 is provided below. Many headers are abbreviated for space reasons: “Accel.” refers to “acceleration” or “speedup”; “Ct.” refers to “Count”; and “Iter.” refers to “Iteration”.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lund, K. Adaptively restarted block Krylov subspace methods with lowsynchronization skeletons. Numer Algor 93, 731–764 (2023). https://doi.org/10.1007/s11075022014371
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11075022014371
Keywords
 GramSchmidt
 Krylov subspace methods
 Arnoldi method
 Block methods
 Stability
 Loss of orthogonality
 Lowsynchronization methods
 Highperformance computing
 Communicationavoiding methods