1 Introduction and motivation

Oak Ridge National Laboratory reported in May 2022 that its Frontier supercomputer is the first machine to have achieved true exascale performance.Footnote 1 That is, for the first time ever, a supercomputer performed more than 1 exaflop (i.e., 1018 double-precision floating-point operations) in a single second. This astounding development is clear motivation for our work. Exascale computing is no longer a next-generation dream; it is reality, and the need for highly parallelized algorithms that take full advantage of exaflop computational potential while reducing global communication between nodes is urgent.

To this end we build on the low-synchronization (“low-sync”) Gram-Schmidt methods of Barlow [1], Świrydowicz et al. [2], Yamazaki et al. [3], Thomas et al. [4], and Bielich et al. [5], as well as our own earlier work with block versions of these methods [6, 7]. Gram-Schmidt methods are an essential backbone in orthogonalization routines like QR factorization and in iterative methods like Krylov subspace methods for linear systems, matrix functions, and matrix equations [8,9,10]. Block Krylov subspace methods in particular make better use of L3 cache via matrix-matrix operations and feature often in communication-avoiding Krylov subspaces, such as s-step [11, 12], enlarged methods [13], and randomized methods [14].

As in most realms of life, there is no such thing as a free lunch here. While low-sync variations have the potential to speed up highly parallelized implementations of Gram-Schmidt [3], they introduce new floating-point errors and thus potential instability, due to the reformulation of inner products and normalizations. Instability surfaces in the loss of orthogonality between basis vectors and can lead to breakdowns or wildly inaccurate approximations in downstream applications [15, 16]. Stability bounds for some low-sync variants have been established, but it often takes much longer to carry out a rigorous stability analysis than to derive and deploy new methods [1, 4, 6, 7]. It can also happen that a backward error bound is established and later challenged by an obscure edge case [17, 18]. With this tension in mind, we have not only extended low-sync variants of block Gram-Schmidt to block Arnoldi but also developed a benchmarking tool for the community to explore the efficiency, stability, and accuracy of these new algorithms, in a similar vein as the BlockStabFootnote 2 comparison tool developed in tandem with a recent block Gram-Schmidt survey [7]. We refer to this new tool as LowSyncBlockArnoldiFootnote 3 and encourage the reader to explore the tool in parallel with the text.

Established in this earlier work is the fact that block variants of low-sync Gram-Schmidt are less stable than their column-wise counterparts. However, when these skeletons are transferred to block Arnoldi and used to solve linear systems, we gain the option to restart the process. Restarting can be effective at mitigating stability issues in communication-avoiding algorithms [19, 20]. As long as each node redundantly computes residual or error estimates and checks the stability via local quantities, restarting does not introduce additional synchronization points. Furthermore, adaptive restarting allows for robustness, as we can use basic look-ahead heuristics to foresee a breakdown and salvage progress without giving up completely at the first sign of trouble.

Given the modularity of our framework, we are also able to treat generalized block inner products, as described in [21, 22]. We focus in particular on the classical and global inner products.

The paper is organized as follows. In Section 2 we summarize terms, definitions, and concepts from high-performance (HPC) computing, generalized block inner products, block Gram-Schmidt algorithms, and block Krylov subspace methods with static restarting. We present new low-synchronization block Arnoldi skeletons in Section 3, and derive an adaptive restarting heuristic in Section 4. Section 5 features a more in-depth discussion of the LowSyncBlockArnoldi benchmarking tool as well as examples demonstrating how to compare different block Arnoldi variants. We summarize our findings in Section 6.

2 Background

This work is a combination of the generalized inner product framework of Frommer, Lund, and Szyld [21, 22] and the skeleton-muscle framework for block Gram-Schmidt (BGS) by Carson, Lund, Rozložník, and Thomas [6, 7]. Throughout the text, we focus on solving a linear system with multiple right-hand sides

$$ A \boldsymbol{X} = \boldsymbol{B}, $$
(1)

where \(A \in {\mathbb {C}}^{n \times n}\) is large and sparse (i.e., with \(\mathcal {O}\left (n\right )\) nonzero entries) and \(\boldsymbol {B} \in {\mathbb {C}}^{n \times s}\) is a tall-skinny (i.e., sn) matrix.

We employ standard numerical linear algebra notation throughout. In particular, A denotes the Hermitian transpose of A, \(\left \|\cdot \right \|\) refers to the Euclidean 2-norm, unless otherwise specified, and \({\widehat {e}_{k}}\) denotes the k th standard unit vector with the k th entry equal to 1 and all others 0.

In the following subsections, we define key concepts in HPC, block Gram-Schmidt methods, and block Krylov subspace methods.

2.1 Communication in high-performance computing

As floating-point operations have become faster and less energy-intensive, communication—the memory operations between levels of cache on a node or between parallelized processors on a network—has become a bottleneck in distributed computing. How expensive a memory operation is depends on the physical aspects of a specific system, specifically the latency, or the amount of time needed to pack and transmit a message, and the bandwidth, or how much information can be transmitted at a time. To improve algorithm performance in bandwidth-limited algorithms like Krylov subspace methods, it is therefore advantageous to increase the computational intensity, or the ratio between floating-point and memory operations [23]. We pay particular attention to synchronization points (“sync points”), i.e., the steps in an algorithm that initiate a broadcast or reduce pattern to synchronize a quantity on all processors. Reducing calls to kernels with sync points is a straightforward way to improve computational intensity [24].

Sync points in Krylov subspace methods arise primarily in the orthonormalization procedure, such as Arnoldi or Lanczos, both of which are reformulations of the Gram-Schmidt method, a standard method for orthonormalizing a basis one (block) vector at a time. For large n, vectors are typically partitioned row-wise and distributed among processors, meaning that any time an operation like an inner product or normalization is performed—which is at least once per (block) vector in Gram-Schmidt—a sync point is inevitable.

Other possibly communication-intensive kernels include applications of the operator AFootnote 4 and applications of \(\boldsymbol {\mathcal {V}}_{m}\), an n × ms Krylov basis matrix. We count each operation separately from sync points (block inner products and vector norms) in LowSyncBlockArnoldi; see Section 5.

2.2 Generalized block inner products

A block vector is a tall-skinny matrix \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\), and a block matrix is a matrix of s × s matrices, e.g.,

$$ \mathcal{H} = \begin{bmatrix} H_{1,1} & H_{1,2} & {\cdots} & H_{1,p} \\ H_{2,1} & H_{2,2} & {\cdots} & H_{2,p} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ H_{q,1} & H_{q,2} & {\cdots} & H_{q,p} \end{bmatrix} \in {\mathbb{C}}^{qs \times ps}. $$

We use a mixture of Matlab- and block-indexing notation to handle block objects. In particular, we write \(\boldsymbol {\mathcal {V}}_{k}\) to denote the first k block vectors of the block-partitioned matrix \(\boldsymbol {\mathcal {V}} = \begin {bmatrix} \boldsymbol {V}_{1} & \boldsymbol {V}_{2} & {\cdots } & \boldsymbol {V}_{m} \end {bmatrix}\) instead of \(\boldsymbol {\mathcal {V}}_{:,1:~ks}\) (i.e., the first ks columns). In a similar vein, s × s block entries of \({\mathscr{H}}\) are denoted as Hj,k instead of as H(j− 1)s+ 1:js,(k− 1)s+ 1:ks. We denote block generalizations of the standard unit vectors \({\widehat {e}_{k}}\) as \(\widehat {\boldsymbol {E}}_{k} := {\widehat {e}_{k}} \otimes I_{s}\), where ⊗ is the Kronecker product and Is the identity matrix of size s.

Blocking is a batching technique that can reduce the number of calls to the operator A applied to individual column vectors, maximize computational intensity by filling up the local cache with BLAS3 operations, and reduce the total number of sync points by performing inner products and normalization en masse [25, 26]. In the context of Krylov subspaces, blocking can also lead to enriched subspaces by sharing information across column vectors instead of treating each right-hand side as an isolated problem. How much information is shared across columns depends on the choice of block inner product.

Let \({{\mathbb {S}}}\) be a -subalgebra of \({{\mathbb {C}}}^{s \times s}\) with identity; i.e., \(I \in {{\mathbb {S}}}\) and when \( S,T \in {\mathbb {S}}\), \(\alpha \in {\mathbb {C}}\), then \(\alpha S +T, ST, S^{*} \in {\mathbb {S}}\).

Definition 1

A mapping 〈〈⋅,⋅〉〉 from \({\mathbb {C}}^{n\times s} \times {\mathbb {C}}^{n \times s}\) to \({\mathbb {S}}\) is called a block inner product onto \({\mathbb {S}}\) if it satisfies the following conditions for all \(\boldsymbol {X},\boldsymbol {Y},\boldsymbol {Z} \in {\mathbb {C}}^{n \times s}\) and \(C \in {\mathbb {S}}\):

  1. (i)

    \({\mathbb {S}}\)-linearity: \({\langle \langle \boldsymbol {X}+\boldsymbol {Y},\boldsymbol {Z} C } \rangle \rangle _{{\mathbb {S}}}= {\langle \langle \boldsymbol {X},\boldsymbol {Z} \rangle \rangle }_{{\mathbb {S}}} C + {\langle \langle \boldsymbol {Y},\boldsymbol {Z} \rangle \rangle }_{{\mathbb {S}}} C\);

  2. (ii)

    symmetry: \({\langle \langle \boldsymbol {X},\boldsymbol {Y} \rangle \rangle _{{\mathbb {S}}} = \langle \langle \boldsymbol {Y},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}}^{*}\);

  3. (iii)

    definiteness: \({\langle \langle \boldsymbol {X},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}}\) is positive definite if X has full rank, and \({\langle \langle \boldsymbol {X},\boldsymbol {X} \rangle \rangle }_{{\mathbb {S}}} = 0\) if and only if X = 0.

Definition 2

A mapping N which maps all \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\) with full rank on a matrix \(N(\boldsymbol {X}) \in {\mathbb {S}}\) is called a scaling quotient if for all such X, there exists \(\boldsymbol {Y} \in {\mathbb {C}}^{n \times s}\) such that X = YN(X) and \({\langle \langle \boldsymbol {Y},\boldsymbol {Y} \rangle \rangle }_{{\mathbb {S}}} = I_{s}\).

The scaling quotient is closely related to the intraorthogonalization routine discussed in Section 2.3. Block notions of orthogonality and normalization arise organically from Definitions 1 and 2.

Definition 3

Let \(\boldsymbol {X}, \boldsymbol {Y} \in {\mathbb {C}}^{n \times s}\) and \(\{\boldsymbol {X}_{j} \}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\).

  1. (i)

    X,Y are block orthogonal, if \({\langle \langle \boldsymbol {X},\boldsymbol {Y} \rangle \rangle }_{{\mathbb {S}}} = 0_{s}\).

  2. (ii)

    X is block normalized if N(X) = Is.

  3. (iii)

    X1,…,Xm are block orthonormal if \({\langle \langle \boldsymbol {X}_{i},\boldsymbol {X}_{j} \rangle \rangle }_{{\mathbb {S}}} = \delta _{ij} I_{s}\).

A set of vectors \(\{\boldsymbol {X}_{j}\}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\) block spans a space \({\mathscr{K}} \subseteq {\mathbb {C}}^{n \times s}\), and we write \({\mathscr{K}} = \text {span}^{\mathbb {S}}\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) if

$$ \mathscr{K} = \Big\{\sum\limits_{j=1}^{m} \boldsymbol{X}_{j} {\Gamma}_{j} : {\Gamma}_{j} \in {\mathbb{S}} { for } j = 1, \ldots, m \Big\}. $$

The set \(\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) constitutes a block orthonormal basis for \({\mathscr{K}} = \text {span}^{\mathbb {S}}\{\boldsymbol {X}_{j}\}_{j=1}^{m}\) if it is block orthonormal.

In this work, we consider only the classical and global block paradigms, described in Table 1. These paradigms represent the two extremes of information-sharing, with the classical approach maximizing information shared among columns and the global approach minimizing it; see, e.g., [22, Theorem 3.3]. Moreover, the global paradigm leads to a lower complexity per iteration in Krylov subspace methods, because what are matrix-matrix products in the classical paradigm get reduced to scaling operations in the global one. Many other paradigms are also possible; see, e.g., [27, 28].

Table 1 Choices of \({\mathbb {S}}\), \(\langle \langle {\cdot ,\cdot } \rangle \rangle _{{\mathbb {S}}}\), and N in the classical and global block paradigms

2.3 Block Gram-Schmidt

Block Gram-Schmidt (BGS) is a routine for orthonormalizing a set of block vectors \(\{\boldsymbol {X}_{j} \}_{j=1}^{m} \subset {\mathbb {C}}^{n \times s}\). Writing

$$ \boldsymbol{\mathcal{X}} := \begin{bmatrix} \boldsymbol{X}_{1} & \boldsymbol{X}_{2} & {\cdots} & \boldsymbol{X}_{m}\end{bmatrix} \in {\mathbb{C}}^{n \times ms}, $$

we define a BGS method as one that returns a block orthonormal \(\boldsymbol {\mathcal {Q}} \in {\mathbb {C}}^{n \times ms}\) and a block upper triangular \(\mathcal {R} \in {\mathbb {C}}^{ms \times ms}\) such that \(\mathcal {X} = \mathcal {Q} \mathcal {R}\). Important measures in the analysis of BGS methods are the condition number of \(\boldsymbol {\mathcal {X}}\),

$$ \kappa(\boldsymbol{\mathcal{X}}) := \frac{\sigma_{\max}(\boldsymbol{\mathcal{X}})}{\sigma_{\min}(\boldsymbol{\mathcal{X}})}, $$
(2)

i.e., the ratio between the largest and smallest singular values of \(\boldsymbol {\mathcal {X}}\), and the loss of orthogonality (LOO),

$$ \left\|I - {\langle\langle \boldsymbol{\mathcal{Q}},\boldsymbol{\mathcal{Q}} \rangle\rangle }_{{\mathbb{S}}}\right\|, $$
(3)

where \({\langle \langle {\cdot ,\cdot } \rangle \rangle _{{\mathbb {S}}}}\) is a generalized inner product as described in Section 2.4.

When we discuss the stability of BGS methods, we refer to bounds on the loss of orthogonality in terms of machine precision, ε. We assume IEEE double precision here, so \(\varepsilon = \mathcal {O}\left (10^{-16}\right )\).

For categorizing BGS variants, we recycle the skeleton-muscle notation from [7, 12], where skeleton refers to the inter orthogonalization routine between block vectors, and the muscle refers to the intra orthogonalization routine between the columns of a single block vector. As a prototype, consider the Block Modified Gram-Schmidt (BMGS) skeleton, given by Algorithm 1. Here, IntraOrtho denotes a generic muscle that takes \(\boldsymbol {X} \in {\mathbb {C}}^{n \times s}\) and returns \(\boldsymbol {Q} \in {\mathbb {C}}^{n \times s}\) and \(R \in {\mathbb {C}}^{s \times s}\) such that \({\langle \langle \boldsymbol {Q},\boldsymbol {Q} \rangle \rangle } _{{\mathbb {S}}} = I_{s}\) and X = QR. For the classical paradigm, this could be any implementation of a QR factorization: a column-wise Gram-Schmidt routine, Householder QR (HouseQR), Cholesky QR (CholQR), etc. As for the global paradigm, there is only one possible muscle, given by the global scaling quotient, which effectively reduces to normalizing block vectors with a scaled Frobenius norm. Consequently, intraorthogonalization does not actually occur in the global paradigm, as the columns of block vectors are not orthogonalized with respect to one another at all.

Algorithm 1
figure a

\([\boldsymbol {\mathcal {Q}}, \mathcal {R}] = \texttt {BMGS} (\boldsymbol {\mathcal {X}}).\)

We regard a single call to either \({\langle \langle {\cdot , \cdot }\rangle \rangle _{\mathbb {S}}}\) or IntraOrtho as one sync point, which is only possible in practice if single-reduce algorithms like CholQR [29] or TSQR/AllReduceQR [30, 31] are employed for IntraOrtho.

2.4 Block Krylov subspace methods

The m th block Krylov subspace for A andB (with respect to \({\mathbb {S}}\)) is defined as

$$ \mathscr{K}^{{\mathbb{S}}}_{m}(A, \boldsymbol{B}) := {\mathbb{S}}\{\boldsymbol{B}, A\boldsymbol{B}, \ldots, A^{m-1}\boldsymbol{B}\}. $$
(4)

Block Arnoldi is often used to compute a basis for \({\mathscr{K}}^{{\mathbb {S}}}_{m}(A, \boldsymbol {B})\), and it is typically implemented with BMGS as the skeleton; see Algorithm 2. BMGS-Arnoldi accrues a high number of sync points due to the inner for-loop, where an increasing number of inner products is performed per block column.

Algorithm 2
figure b

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BMGS}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

Performing m steps of a block Arnoldi routine returns the block Arnoldi relation

$$ A \boldsymbol{\mathcal{V}}_{m} = \boldsymbol{\mathcal{V}}_{m} \mathcal{H}_{m} + \boldsymbol{V}_{m+1} H_{m+1,m}, $$
(5)

where \(\boldsymbol {\mathcal {V}}_{m}\) \({\mathbb {S}}\)-spans \({\mathscr{K}}^{{\mathbb {S}}}_{m}(A, \boldsymbol {B})\) and \({\mathscr{H}}_{m}\) denotes the ms × ms principal submatrix of \({\mathscr{H}}_{m+1,m}\).

2.4.1 Block full orthogonalization methods with low-rank modifications

We define

$$ \boldsymbol{X}_{m} := \boldsymbol{\mathcal{V}}_{m} \big(\mathcal{H}_{m} + \mathcal{M} \big)^{{-1}} \widehat{\boldsymbol{E}}_{1} B, $$
(6)

where \(\widehat {\boldsymbol {E}}_{1} = \widehat {\boldsymbol {e}}_{1} \otimes I_{s}\) is a standard block unit vector, as the (modified) block full orthogonalization method (BFOM) for approximating (1). When \({\mathscr{M}} = 0\), we recover BFOM, which minimizes the error in the A-weighted \({\mathbb {S}}\)-norm for A hermitian positive definite [21]. There are infinitely many choices for \({\mathscr{M}}\), but perhaps only a few useful ones, some of which are discussed in [22]. We will concern ourselves here with just \({\mathscr{M}} = {\mathscr{H}}_{m}^{{-*}} \big (\widehat {\boldsymbol {E}}_{m} H_{m+1,m}^{*} H_{m+1,m}\big ) \widehat {\boldsymbol {E}}_{m}^{*}\), which gives rise to a block generalized minimal residual method (BGMRES) [32,33,34]. As in [22], we implement BGMRES as a modified BFOM here, with an eye towards downstream applications like f(A)B where the BFOM form is explicitly needed. In practice, there may be computational savings with a less modular implementation; see, e.g., [35,36,37].

2.4.2 Static restarting and cospatial factors

Restarting is a well-established technique for reconciling a growing basis with memory limitations. Define the residual of (6) as

$$ \boldsymbol{R}_{m} := \boldsymbol{B} - A \boldsymbol{X}_{m}. $$
(7)

The basic idea of restarts is to use Rm to build a new Krylov subspace, which we then use to approximate the error \(\boldsymbol {E}_{m} := A^{{-1}} \boldsymbol {B} - \boldsymbol {X}_{m}\), which solves AE = Rm in exact arithmetic. Building a new Krylov subspace from Rm directly is not a great idea, because it would require an extra computation with A. Furthermore, we need a cheap, accurate, and ideally locally computable way to approximate \(\left \|\boldsymbol {R}_{m}\right \|\) from one cycle to the next in order to monitor convergence. In [22] a static restarting method for low-rank modified BFOM is introduced that satisfies these requirements. By “static,” we mean the basis size m is fixed from one restart cycle to the next, in contrast to adaptive or dynamic restart cycle lengths. We restate [22, Theorem 4.1], which enables an efficient residual approximation and restarting procedure.

Theorem 2.1

Suppose \({\mathscr{M}} = \boldsymbol {M} \widehat {\boldsymbol {E}}_{m}^{*}\), where \(\boldsymbol {M} \in {\mathbb {C}}^{ms \times s}\) and \(\widehat {\boldsymbol {E}}_{m} = \widehat {\boldsymbol {e}}_{m} \otimes I_{s}\). Define \(\boldsymbol {U}_{m} := \boldsymbol {\mathcal {V}}_{m+1} \begin {bmatrix} \boldsymbol {M} \\ -H_{m+1,m} \end {bmatrix}\) and let \(\boldsymbol {\Xi }_{m} := ({\mathscr{H}}_{m} + {\mathscr{M}})^{{-1}} \widehat {\boldsymbol {E}}_{1} B\) be the block coefficient vector for the approximation \(\boldsymbol {X}_{m} = \boldsymbol {\mathcal {V}}_{m} \boldsymbol {\Xi }_{m}\) (6) of the system (1). With Rm as in (7) it then holds that

$$ \boldsymbol{R}_{m} = \boldsymbol{U}_{m} \big(\widehat{\boldsymbol{E}}_{m}^{*} \boldsymbol{\Xi}_{m}\big). $$
(8)

We refer to the s × s matrix \(\widehat {\boldsymbol {E}}_{m}^{*} \boldsymbol {\Xi }_{m}\) as a cospatial factor, and (8) as the cospatial residual relation. The term cospatial refers to the fact that the columns of Rm and those of Um span the same space. Moreover, in exact arithmetic, it is not hard to see that

$$ \|{\boldsymbol{R}_{m}}\|_{\text{F}} = \left\|{\begin{bmatrix} \boldsymbol{M} \\ -H_{m+1,m} \end{bmatrix} \big(\widehat{\boldsymbol{E}}_{m}^{*} \boldsymbol{\Xi}_{m}\big)}\right\|_{\text{F}}, $$
(9)

and the right-hand term can be computed locally (and possibly redundantly on each processor) for mn.

If the approximate residual norm does not meet the desired tolerance, then we can compute the Arnoldi relation for \({\mathscr{K}}_{m}(A, \boldsymbol {U}_{m})\) to obtain \(\boldsymbol {\mathcal {V}}_{m+1}^{(2)}\), \({\mathscr{H}}_{m}^{(2)}\), \(H_{m+1,m}^{(2)}\), and B(2), where the superscript here and later denotes association to the restarted Krylov subspace. We then approximate Em as

$$ \boldsymbol{D}_{m} := \boldsymbol{\mathcal{V}}_{m}^{(2)} (\mathcal{H}_{m}^{(2)} + \mathcal{M}^{(2)})^{{-1}} \widehat{\boldsymbol{E}}_{1} B^{(2)} \big(\widehat{\boldsymbol{E}}_{m}^{*} \boldsymbol{\Xi}_{m}\big), $$

and update Xm as

$$ \boldsymbol{X}_{m}^{(2)} := \boldsymbol{X}_{m} + \boldsymbol{D}_{m}. $$

The process is repeated, applying Theorem 2.1 iteratively, until the desired residual tolerance is reached.

Remark 1

The analysis in [21, 22] is carried out in exact arithmetic. Therefore, when we replace Algorithm 2 with low-sync versions in Section 3, all the results summarized in this section still hold, because all block Gram-Schmidt variants generate the same QR factorization in exact arithmetic.

3 Low-synchronization variants of block Arnoldi

To distinguish between block Arnoldi variants, we default to the name of the underlying block Gram-Schmidt skeleton. We specify a configuration as ip-skel∘ (musc): inner product, skeleton, and muscle, respectively. This naturally leads to bit of an “alphabet soup,” for which we ask the reader’s patience, as it is crucial to precisely define algorithmic configurations for benchmarking. Please refer often to Table 2, which summarizes acronyms for all the Gram-Schmidt skeletons we consider in this text. Note that the coefficient in front of the number of sync points per cycle is often used to describe low-sync methods; e.g., BCGS-PIP is a “one-sync” method, while BMGS-SVL is a “three-sync” method.

Table 2 Acronyms for BGS skeletons. Here “m-cycle” refers to a restart cycle, or the construction of \(\boldsymbol {\mathcal {V}}_{m+1}\). Loss of orthogonality is defined in (3), and here κ is shorthand for \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{m}])\). The loss of orthogonality bound for BMGS-LTS is conjecture and for BMGS-CWY, BMGS-ICWY, and BCGSI+LS, unknown

Remark 2

The methods presented here are closely related to but not quite the same as the block methods used by Yamazaki et al. in [3], where BMGS, BCGS-PIP, and BCGSI+LS are employed as Gram-Schmidt skeletons in s-step Arnoldi (also known as communication-avoiding Arnoldi) [11, 12, 23], which is used to solve a linear system with a single right-hand side. Recall that we are solving (1), i.e., multiple right-hand sides simultaneously.

Remark 3

In the pseudocode for each algorithm, intermediate quantities like W and U are defined explicitly each iteration for readability. In general, we purposefully avoid redefining quantities in a given iteration and instead only set an output (i.e., entries in B, \(\boldsymbol {\mathcal {V}}_{m}\), or \({\mathscr{H}}_{m+1,m}\)) once all computations pertaining to that value are complete. This approach simplifies mathematical analysis. Exceptions include Algorithms 1 and 2, where W is redefined inside the for-loop as projected components are subtracted away from it. In practice, it is preferable to save storage by overwriting block vectors of \(\boldsymbol {\mathcal {V}}_{m}\) instead of allocating separate memory for W and U, for which there anyway may not be space.

3.1 BCGS-PIP and BCGS-PIO

A simple idea for reducing the number of sync points in BMGS is to condense the for-loop in lines 4–7 of Algorithm 2 into a single inner product and subtraction,

$$ \begin{array}{@{}rcl@{}} \mathcal{H}_{1:k,k} & =& {\langle\langle \boldsymbol{\mathcal{V}}_{k},\boldsymbol{W} \rangle\rangle }_{{\mathbb{S}}} \\ \boldsymbol{W} & =& \boldsymbol{W} - \boldsymbol{W} \mathcal{H}_{1:k,k} \end{array} $$

This exchange gives rise to what is commonly referred to as the block classical Gram-Schmidt (BCGS) method. It is, however, rather unstable, with a loss of orthogonality worse than \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}])\) [6]. However, by making a correction based on the block Pythagorean theorem (as derived in, e.g., Section 2.1 [6]), we can guarantee a loss of orthogonality bounded by \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}])\), as long as \(\mathcal {O}\left (\sqrt {\varepsilon }\right ) \kappa ([\boldsymbol {B} A\boldsymbol {\mathcal {V}}_{m}]) \leq 1\).

One version of the corrected algorithm is given as Algorithm 3. The acronym “PIP” stands for “Pythagorean (variant) with Inner Product,” due to how the factor Hk+ 1,k is computed. An alternative formulation based off BCGS-PIO (where “PIO” stands for “Pythagoren with IntraOrthogonalization”) is also possible and is given as Algorithm 4. Note that in line 5, we use \(\sim \) to denote that a full block vector need not be computed or stored here, just the 2s × 2s scaling quotient Ω. For subtle reasons, BCGS-PIO appears to be less reliable in practice (see Section 4).

Algorithm 3
figure c

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BCGS-PIP}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

Algorithm 4
figure d

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BCGS-PIO}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

3.2 BMGS-SVL and BMGS-LTS

Barlow developed and analyzed one of the first stabilized low-sync Gram-Schmidt methods by using the Schreiber-Van Loan representation of products of Householder transformations [1, 38]. Under modest conditions, this method—which we denote here as BMGS-SVL—has loss of orthogonality like BMGS. Its success depends on tracking the loss of orthogonality via an auxiliary matrix \(\mathcal {T}\) (as defined in lines 1, 2, 6, and 9 of Algorithm 5) and using this matrix to make corrections each iteration. A closely related method is BMGS-LTS, which is identical to BMGS-SVL except that the \(\mathcal {T}\) matrix is formed via lower-triangular solves instead of matrix products. A column version of BMGS-LTS was first developed by Świrydowicz et al. [2] and generalized to blocks by Carson et al. [7]. Although BMGS-LTS appears to behave identically to BMGS-SVL in practice, a formal analysis for the former remains open. We present Arnoldi versions of BMGS-SVL and BMGS-LTS as, with different colors highlighting the small differences between the methods. In both methods, the main inner product in line 4 is performed as in BCGS. Meanwhile \(\mathcal {T}\) acts as a kind of buffer, storing the loss of orthogonality per iteration, which is used in successive iterations to make small corrections to the computation in line 4. Balabanov and Grigori use a similar technique to stabilize randomized sketches of inner products, where instead of explicitly computing and storing \(\mathcal {T}\), they solve least squares problems to compute \({\mathscr{H}}_{1:k,k}\) [14, 39].

Algorithm 5
figure e

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BMGS-SVL}/\texttt {BMGS-LTS}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

3.3 BMGS-CWY/BMGS-ICWY

A column-wise version of this algorithm was first presented by Świrydowicz et al. as [2, Algorithm 8]. To the best of our knowledge, we are the first to develop a block-wise formulation, which we refer to here as BMGS-CWY-Arnoldi, where CWY stands for “compact WY,” an alternative way to represent Householder transformations used to originally derive this algorithm. A related Arnoldi algorithm, not treated in either [2] or [4], is based on the inverse CWY (ICWY) form, and is given simultaneously with BMGS-CWY in Algorithm 6.

It is important to note that BMGS-CWY-Arnoldi would not reduce to [2, Algorithm 8] or [4, Algorithm 6] for s = 1, as we have one total sync point, due to the lack of a reorthonormalization step for Vk. Algorithm 6 was largely derived by transforming BMGS-CWY and BMGS-ICWY from [7] into a block Arnoldi routine. The most challenging part is tracking how the \(\mathcal {R}\) factor in the Gram-Schmidt formulation maps to \({\mathscr{H}}_{m+1,m}\) and determining where to scale by the off-diagonal entry Hk,k− 1 each iteration. It is also possible to compute only with \(\mathcal {R}\) and reconstruct \({\mathscr{H}}_{m+1,m}\) after \(\boldsymbol {\mathcal {V}}_{m+1}\) is finished; this approach proved to be much less stable in practice, however, due to the growing condition number of \(\mathcal {R}\).

Algorithm 6
figure f

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BMGS-CWY}/\texttt {BMGS-ICWY}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

3.4 BCGSI+LS

One of the most intriguing of all the low-sync algorithms is DCGS2 [5], referred to as CGSI+LS in [7]. This algorithm is a reformulation of reorthogonalized CGS with a single sync point derived by “delaying” normalization to the next iteration, where operations are batched in a kind of s-step approach (where s = 2). The column-wise version exhibits \(\mathcal {O}({\varepsilon })\) loss of orthogonality; a rigorous proof of the backward stability bounds remains open, however. The block version, BCGSI+LS, does not exhibit perfect \(\mathcal {O}({\varepsilon })\) LOO; see numerical results in [7].

Bielich et al. present a column-wise Arnoldi based on DCGS2 as Algorithm 4 in [5]. Our Algorithm 7 is a direct block generalization of this algorithm with slight reformulations to match the aesthetics of Algorithm 6 and principles of Remark 3. Note that, as in Algorithm 6, we are able to compute \({\mathscr{H}}_{m}\) directly, but we must track an auxiliary matrix J and scale several quantities by Hk− 1,k− 2. An alternative version of Algorithm 7 based more directly on BCGSI+LS from [7, Algorithm 7] is included in the code but not described here.

Algorithm 7
figure g

\([\boldsymbol {\mathcal {V}}_{m+1}, {\mathscr{H}}_{m+1,m}, {B}] = \texttt {BCGSI+LS}\texttt {-Arnoldi}(A, \boldsymbol {B}, m).\)

4 Adaptive restarting

Reproducibility and stability are not mutually exclusive. This realization is precisely the motivation for an adaptive restarting routine and can be demonstrated by a simple example.

Consider the tridiag test case from Section 5.1 with n = 100. Notably, both A and B are deterministic quantities; neither is defined with random elements. In Matlab, it is possible to specify the number of threads on which a script is executed via the built-in maxNumCompThreads function.Footnote 5 We solve AX = B with Algorithms 3 and 4 while varying the multithreading setting from 1 to 16 on a standard node of the Mechthild cluster; see the beginning of Section 5 for more details about the cluster. For both algorithms, we employ a variant of Matlab ’s Cholesky routine chol, which stores a flag when chol determines a matrix is too ill-conditioned to be factorized. This flag is fed to the linear solver driver of LowSyncBlockArnoldi (bfom), which halts the process when the flag is true. Through the following discussion, we refer to this flag as the “NaN-flag,” because ignoring it leads to computations with ill-defined quantities.

Figure 1 displays the loss of orthogonality (3) and \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) for different thread counts. The condition numbers for all thread counts and both methods are hardly affected, except for some slight deviation for BCGS-PIP and 16 threads. The LOO plots are more telling: for both methods, changing the thread count directly affects the LOO and how many iterations the method can compute before encountering a NaN-flag. We allowed for a maximum basis size of m = 50, but no method can compute that far. BCGS-PIO with 8 threads gives up first at 16 iterations; BCGS-PIP with 1 and 4 threads makes it all the way to 35 iterations. Among the BCGS-PIO methods, there are orders of magnitude differences between the attained LOO.

Fig. 1
figure 1

Multithreading example for tridiag problem with n = 100, no restarts, and maximum basis size m = 50

This situation is perplexing on the surface: the problem is static, and the same code has been run every time. The only variable is the thread count.

There are two subtle issues that affect reproducibility in this case: 1) the configuration of math kernel libraries according to the parameters of the operating system and hardware,Footnote 6 and 2) guaranteed stability bounds. As for stability bounds, it is important to note that both BCGS-PIO and BCGS-PIP have a complete backward stability analysis [6]. Both methods have \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) loss of orthogonality, as long as \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}]) \leq \mathcal {O}\left (\frac {1}{\sqrt {\varepsilon }}\right ) = \mathcal {O}\left (10^{8}\right )\) and as long as the IntraOrtho for BCGS-PIO behaves no worse than CholQR. (For this test, we used HouseQR, Matlab ’s built-in qr routine, which is unconditionally stable and therefore behaves better than CholQR [15].) For both methods, \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) exceeds \(\mathcal {O}\left (10^{8}\right )\) around iteration 15. At that point, the assumptions for the LOO bounds are no longer satisfied. The fact that either algorithm continues to compute something useful after that point is a lucky accident.

Computing \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\) every iteration to check whether the LOO bounds are satisfied is not practical. We therefore propose a simple adaptive restarting regime based on whether chol raises a NaN-flag, which happens whenever chol is fed a numerically non-positive definite matrix. When a NaN-flag is raised, we give up computing a new basis vector and go back to the last safely computed basis vector, which is then used to restart. Simultaneously, the maximum basis size m is also reduced. It is possible that an algorithm exhausts its maximum allowed restarts and basis size before converging; indeed, we have observed this often for BCGS-PIP in examples not reported here. At the same time, there are many scenarios in which restarting is an adequate band-aid, thus allowing computationally cheap, one-sync algorithms line BCGS-PIP to salvage progress and converge, oftentimes faster than competitors. See Section 5 for demonstrations.

Remark 4

The restarted framework outlined in Section 2.4.2 does not change fundamentally with adaptive cycle lengths; only the notation becomes more complicated. We omit the details here.

5 Numerical benchmarks

Our treatment of BGS and block Krylov methods is hardly exhaustive. It is not our goal to determine the optimal block Arnoldi configuration at this stage, but rather to demonstrate the functionality of a benchmarking tool for the fair comparison of possible configurations on different problems. To this end, we restrict ourselves to the options below:

  • inner products: cl (classical), gl (global)

  • skeletons: Table 2

  • muscles: CholQR, which has \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2}\) loss of orthogonality guaranteed only for \(\mathcal {O}\left (\varepsilon \right ) \kappa ^{2} < 1\), but is a simple, single-reduce algorithm. In practice, we would recommend TSQR/AllReduceQR [30, 31], which has \(\mathcal {O}({\varepsilon })\) loss of orthogonality and the same number of sync points, but is difficult to program in Matlab due to limited parallelization and message-passing features. Other low-sync muscles are programmed in LowSyncBlockArnoldi as well, and the user can easily integrate their own. Note that BCGS-PIP does not require a muscle, and BMGS-CWY, BMGS-ICWY, and BCGSI+LS only call a muscle once, in the first iteration of a new basis. BMGS-SVL and BMGS-LTS are forced to use their column-wise counterparts MGS-SVL and MGS-LTS (both 3-sync), respectively, and global methods are forced to use the global muscle (i.e., normalization without intraorthogonalization via the scaled Frobenius norm).

  • modification: none (FOM), harmonic (GMRES)

All results are generated by the LowSyncBlockArnoldi Matlab package. A single script (paper_script.m) comprises all the calls for generating the results in this manuscript. LowSyncBlockArnoldi is written as modularly as possible, to facilitate the exchange of inner products, skeletons, muscles, and modifications. While the timings reported certainly do not reflect the optimal performance for any of the methods, they do reflect a fair comparison across implementations and provide insights for possible speed-ups when these methods are ported to more complex architectures. The code is also written so that sync points (inner_prod and intra_ortho) and other potentially communication-intensive operations (matvec and basis_eval) are separate functions that can be tuned individually.

Every test script (including the example from Section 4) has been executed in Matlab R2019b on 16 threads of a single, standard node of Linux Cluster Mechthild at the Max Planck Institute for Dynamics of Complex Technical Systems in Magdeburg, Germany.Footnote 7 A standard node comprises 2 Intel Xeon Silver 4110 (Skylake) CPUs with 8 Cores each (64KB L1 cache, 1024KB L2 cache), a clockrate of 2.1 GHz (3.0 GHz max), and 12MB shared L3 cache each. We further focus on small problems that easily fit in the L3 Cache, which is easy to guarantee with sparse A, n ≤ 104, and s ≤ 10. Given that the latency between CPUs on a single node is small relative to exascale machines, we expect small improvements observed in these test cases to translate to bigger gains in a more complex setting.

For the timings, we measure the total time spent to reach a specified error tolerance. We run each test 5 times and average over the timings. We also calculate several intermediate measures, namely counts for A-calls, applications of \(\boldsymbol {\mathcal {V}}_{k}\), and sync points. In addition, we plot the convergence history in terms of the following quantities per iteration: relative residual, relative error, \(\kappa ([\boldsymbol {B} A \boldsymbol {\mathcal {V}}_{k}])\), and loss of orthogonality (LOO) (3). When a ground truth solution X is provided, the error is calculated as

$$ \left\|{\boldsymbol{X}_{k} - \boldsymbol{X}_{*}}\right\|_{\text{F}} / \left\|{\boldsymbol{X}_{*}}\right\|_{\text{F}}, $$

For all our examples, X is computed by Matlab ’s built-in backslash operator. The residual is approximated by (9) and is scaled by \(\left \|{\boldsymbol {B}}\right \|_{F}\). A summary of the parameters for all benchmarks can be found in Table 3. Except for tridiag and lapl_2d, all examples are taken from the SuiteSparse Matrix Collection [40]. Via the suite_sparse.m script, it is possible to run tests on any benchmark from this collection.

Table 3 Test properties and parameter choices

5.1 tridiag

The operator A is defined as a sparse, tridiagonal matrix with 1 on the off-diagonals and − 1,− 2,…,−n on the diagonal, where n is also the size of A. Clearly A is symmetric. The right-hand side B has two columns, where the first has identical elements \(\frac {1}{\sqrt {n}}\) and the second is 1,2,…,n. This example is actually procedural, in the sense that a user can choose a desired n. At the same time, a larger n necessarily leads to a worse condition number.

Figure 2 presents the total run time per configuration as well as operator counts as a bar chart; see Table 4 in the Appendix for more details. The fastest methods are the stabilized low-sync variants. Despite being the computationally cheapest classical method per iteration, cl-BCGS-PIP is notably slower than cl-BMGS, because its inherent instability requires restarting 3 times (and therefore additional applications of A and \(\boldsymbol {\mathcal {V}}_{k}\)) before converging. The method with the fewest \(\boldsymbol {\mathcal {V}}_{k}\) evaluations is cl-BMGS, which is to be expected, since the basis is split up and applied one block column at a time in the inner-most loop; see Algorithm 1.

Fig. 2
figure 2

Results from tridiag example

The fastest global method, gl-BCGS-PIP, is significantly slower even than the slowest classical method. In fact, all global methods require over 6 times as many total iterations as the fastest classical method to converge; this is in line with the theory of Section 2.4. In this particular case, the floating-point savings per iteration do not outweigh the sheer amount of time needed for all the extra A-calls. Nevertheless, the one-sync global methods (gl-BCGS-PIP, gl-BMGS-CWY, gl-BMGS-ICWY, and gl-BCGSI+LS s) have relatively low sync counts, compared even to cl-BMGS.

Figures 3 and 4 display convergence histories for a subset of the methods in Table 4 in the Appendix. The convergence histories for all global BMGS variants are very similar; we omit BMGS-SVL and BMGS-CWY, as they are visually identical to BMGS-LTS and BMGS-ICWY, respectively. BMGS is identical to BMGS-SVL and BMGS-LTS and is therefore also omitted.

Fig. 3
figure 3

Convergence histories of some global variants for tridiag example

Fig. 4
figure 4

Convergence histories of some classical variants for tridiag example

Both the classical and global variants of BCGS-PIP show the robustness of the adaptive restarting procedure in action. In the global case, the LOO exceeds \(\mathcal {O}\left (10^{-10}\right )\) and reaches \(\mathcal {O}\left (1\right )\) in cl-BCGS-PIP. Despite the loss of orthogonality, restarting allows the methods to recover and eventually converge. All other low-sync variants remain stable, only restarting once the basis size limit of m = 70 has been reached. Although hardly perceptible, BMGS-ICWY does have a slightly worse LOO than that of BMGS-LTS, which can be seen by zooming in on the last few iterations of the global plots in Fig. 3 or of the classical plots in Fig. 4.

We also note that the residual estimate (7) for all methods follows the same qualitative trend as that of the error. In the worst case, cl-BCGS-PIP, the residual is nearly 3 orders of magnitude lower than the error in some places, which could lead to premature convergence. For all other methods, the difference is between 1 and 2 orders of magnitude. We would thus recommend setting the residual tolerance a couple orders of magnitude lower in practice, to ensure that the true error is accurate enough.

5.2 1138_bus

Now we turn to a slightly more complicated matrix. The matrix A comes from a power network problem and is real and symmetric positive definite, while entries of B are drawn randomly from the uniform distribution. Moreover we apply an incomplete LU (ILU) preconditioner with no fill, using Matlab ’s built-in ilu.

Even with the preconditioner, none of the global methods converges. We adjusted the thread count to see if it would aid convergence, to no avail. This is perhaps an extreme case of [22, Theorem 3.3], wherein the global method is much less accurate than the classical method in the first cycle and cannot manage to catch up even after restarting. A preconditioner better attuned to the structure of the problem may alleviate stagnation for global methods, but we do not explore this here.

In Fig. 5 we see the performance results for the convergent classical methods; more details can be found in Table 5 in the Appendix. Most notably, the one-sync methods BMGS-CWY, BMGS-ICWY, and BCGSI+LS improve over BMGS only slightly in terms of timings. BCGS-PIP is much slower, due to a quick loss of orthogonality and need to restart more often. However, it is clear that sync counts for all one-sync methods are drastically reduced compared to that of BMGS.

Fig. 5
figure 5

Performance results for 1138_bus example

We examine the convergence histories of cl-BCGS-PIP and cl-BMGS-ICWY more closely in Fig. 6. Although not discernible on the graph, we found that cl-BCGS-PIP actually restarts every 28 iterations, meaning in the first cycle it encountered a NaN-flag and reduced the maximum basis size to m = 28 for all subsequent cycles. Instability in the first cycle thus hinders cl-BCGS-PIP greatly. On the other hand, BMGS-ICWY (as well as the other variants) is stable enough to exhaust the entire basis size allowance, which allows for further error reduction in the first cycle.

Fig. 6
figure 6

Subset of convergence histories for 1138_bus example

5.3 circuit_2

The next example comes from a circuit simulation problem. The matrix A is real but not symmetric or positive definite. We again apply an ILU preconditioner with no fill.

All the one-sync classical and global methods converge, and their performance data is presented in Fig. 7 with further details in Table 6 in the Appendix. In fact, some global methods, like gl-BCGS-PIP, are even faster than some classical methods, due to the fact that they require the same number of iterations to converge, and therefore fewer floating-point operations.

Fig. 7
figure 7

Performance results for circuit_2 example

Figure 8 demonstrates how close in accuracy the global and classical BCGS-PIP variants are for this problem. The global method even has a slightly better LOO, but it should be noted that global LOO is measured according to a different inner product than classical LOO; see Section 2.2 and (3).

Fig. 8
figure 8

Convergence histories of the BCGS-PIP variants for the circuit_2 example

5.4 rajat03

Another circuit simulation problem highlights slightly different behavior. In this case, A is again real but neither symmetric nor positive definite, and we again use an ILU preconditioner with no fill.

Figure 9 summarizes the performance results, with details given in Table 7 in the Appendix. It should be noted right away that cl-BCGS-PIP fails to converge for this problem, while gl-BCGS-PIP does not, and takes second place in terms of the timings. More specifically, cl-BCGS-PIP encounters a NaN-flag it cannot resolve, which means that every time it reduces the basis size, it cannot avoid a NaN-flag. However, because global methods do not use Cholesky at all, non-positive definite factors do not pose a problem, unless their trace is numerically zero, which occurs with very low probability. Otherwise, cl-BMGS-CWY shows a small improvement over cl-BMGS.

Fig. 9
figure 9

Performance results for rajat03 example

Table 7 in the Appendix confirms that none of the methods requires restarting despite how high the condition number becomes in later iterations; see also Fig. 10. It is again interesting to see how close the error and residual plots are between the global and classical methods. In fact, the residual for the global method underestimates convergence by a couple orders of magnitude.

Fig. 10
figure 10

Convergence histories of the two fastest variants for the rajat03 example

5.5 Kaufhold

This example treats a nearly numerically singular matrix with an extremely high condition number. Also notable, the norm of A is nearly \(\mathcal {O}\left (10^{15}\right )\). The matrix is real, but neither symmetric nor positive definite, and it was designed to trigger a bug in Gaussian elimination in a 2002 version of Matlab. We again apply an ILU preconditioner with no fill.

Figure 11 shows cl-BCGS-PIP to be the fastest of the classical one-sync methods, but the improvement over cl-BMGS is small. The global methods are all much slower. A look at the convergence histories in Fig. 12 shows a stubborn error curve despite significant progress in the initial iterations. For both BCGS-PIP methods the LOO is moderately high in the first cycle, matching the high condition numbers, but the situation is not bad enough to trigger a NaN-flag, and the LOO drops after restarting.

Fig. 11
figure 11

Performance results for Kaufhold example

Fig. 12
figure 12

Convergence histories of the BCGS-PIP variants for the Kaufhold example

5.6 t2d_q9

We now examine a nonlinear diffusion problem, specifically a biquadratic mesh of a temperature field. The matrix A is real but not symmetric or positive definite, and we again use an ILU preconditioner with no fill.

Figure 13 shows that both BCGS-PIP are the fastest overall, with cl-BMGS in second-to-last place; see Table 9 in the Appendix for more details. Interestingly, even gl-BMGS is faster than cl-BMGS in this scenario.

Fig. 13
figure 13

Performance results for t2d_q9 example

Both BCGSI+LS variants are rather slow in this example. Despite having just one sync per iteration, BCGSI+LS does generally have a higher complexity than its one-sync counterparts, which manifests here as a disadvantage.

The convergence behavior for the BCGS-PIP variants is given in Fig. 14. Here we see that despite the global condition number having a high variation relative to the classical method, the global LOO is overall much less. This phenomenon is not unique to this example, however, it just happens to be more noticeable.

Fig. 14
figure 14

Convergence histories of the BCGS-PIP variants for the t2d_q9 example

5.7 lapl_2d

Our last problem is taken directly from [21, Section 5.4], a discretized two-dimensional Laplacian matrix. A is thus banded, real, and symmetric positive definite. We do not apply a preconditioner and look at all skeletons considered in the text.

Figure 15 shows the performance results; more details can be found in Table 4 in the Appendix. All one-sync classical methods except for cl-BCGSI+LS beat cl-BMGS, along with a number of global methods. The slowest classical methods are the three-sync ones, and some one-sync global methods follow behind. The fastest method, cl-BCGS-PIP also happens to have the highest A count and applications of \(\boldsymbol {\mathcal {V}}_{k}\), due to its high number of restarts. Both cl-BMGS-CWY and cl-BMGS-ICWY, however, have fewer sync counts, as well as A counts and \(\boldsymbol {\mathcal {V}}_{k}\) counts, and are very close in terms of timings.

Fig. 15
figure 15

Performance results for lapl_2d example

The methods with the highest sync counts are cl-BMGS-SVL and cl-BMGS-LTS. The reason is that they cannot use CholQR as a muscle,Footnote 8 and this problem requires many iterations to converge. LowSyncBlockArnoldi is written to count sync points within the muscles as well, and with MGS-SVL and MGS-LTS each contributing 1 + 3s per call, the total number of sync points eventually passes that of cl-BMGS, which can use a communication-light muscle like CholQR.

6 Conclusions and outlook

Stability bounds and floating-point analysis are challenging to work out rigorously, and it is therefore simultaneously important to search for counterexamples and edge cases while trying to prove conjectured bounds. In general, rigorous loss of orthogonality and backward error bounds for all these methods could lead to new insights and improvements in the quest for a reliable, scalable Krylov subspace solver. Our flexible benchmarking tool can aid in that process, and it can easily be extended to accommodate new algorithm configurations, test cases, and measures.

At the same time, low-sync block Arnoldi algorithms with adaptive restarting are clearly already useful and robust enough for a wide variety of problems, especially where A is reasonably conditioned and memory limitations cap basis sizes. In every benchmark, we have observed that at least one low-sync method outperformed both the classical and global BMGS-based Arnoldi methods. More research is needed to determine which low-sync skeletons are best for which problems and architectures, particularly computational models that account not only for operation counts but also for performance variations relative to block size [26, 41, 42]. Most likely the best configuration allows for switching between skeletons and muscles depending on convergence behavior.

For scenarios where the basic adaptive restarting procedure is not sufficient to rescue convergence, it might be possible to improve the heuristics with a cheap estimate of the loss of orthogonality computed, e.g., a randomized sketched inner product [39]. With such a cheap estimate, we could not only decrease the basis size when there are problems, but increase it again in later cycles. Randomized algorithms themselves are known to reduce communication, and a thorough comparison and combination of the methods proposed here and in [39] could lead to powerful Krylov subspace method well suited for exascale architectures.

Global methods are unfortunately less promising. They are almost always slower than even the slowest classical method, due to requiring more cycles, and thus operator calls and sync points, to converge. However, the benchmarks do suggest that, in cases with a good preconditioner known to guarantee convergence in a few iterations, global methods may become competitive again, especially in single-node or “laptop” applications, where their reduced computational intensity per iteration is favorable.