1 Introduction

A sparse triangular solver is an important computational kernel for an iterative linear solver in various numerical simulations. It is the main component of the Gauss–Seidel (GS) smoother, SOR method and IC/ILU preconditioning, which are used as building blocks in various computational science or engineering analyses (Beauwens 2004; Saad 2003; Meurant 1999). Therefore, the development of a fast multithreaded sparse triangular solver is essential to accelerate these analyses when conducted on not only a single computing node but also a large-scale cluster system of nodes. For example, the performance of the solver significantly influences the total simulation time of large-scale partial differential equation analysis using a multigrid solver with the GS, IC, or ILU smoother (Wallin et al. 2006; Buckeridge and Scheichl 2010). However, it is well known that the sparse triangular solver, which consists of forward and backward substitutions, cannot be straightforwardly parallelized (Dongarra et al. 1990, 1998). Thus, in this paper, we discuss an effective approach to developing a high-performance multithreaded sparse triangular solver.

There are various methods for parallelizing a sparse triangular solver or its related techniques, and we focus on the parallel ordering (reordering) method, which is one of the most common methods for parallelization of a sparse triangular solver. There are several well-known orderings, such as dissection and domain decomposition orderings, but multi-color ordering is the most commonly used technique. It has been used in various applications to parallelize, for example, the ICCG method. However, it is well known that the multi-color ordering entails a trade-off problem between convergence and the number of synchronizations (Doi and Washio 1999). An increase in the number of colors typically results in better convergence, but it also leads to an increase in the number of synchronizations, which is proportional to the number of colors. The trade-off problem between convergence and parallelism is a common issue for parallel ordering techniques (Duff and Meurant 1989).

One of the solutions for the above trade-off problem is block multi-coloring. In this method, multi-color ordering is applied to blocks of unknowns. The technique has been investigated in several contexts. The concept of block coloring or block independent sets can be seen in (Saad 2003). In an early work (Block et al. 1990), it is discussed for the parallel SOR method. For parallelization of the IC/ILU preconditioned iterative solver, it was first investigated in a finite difference method, that is, structured grid analysis (Iwashita and Shimasaki 2003; Iwashita et al. 2005). In this research, block coloring proved to be effective for improving convergence without increasing thread synchronization. Following on from these research activities, the algebraic block multi-color ordering method was introduced for a general sparse linear system (Iwashita et al. 2012). Although there are various options for coloring or blocking methods (Jones and Plassmann 1994; Iwashita and Shimasaki 2002), this technique has been used in various applications because of its advantages in terms of convergence, data locality, and the number of synchronizations (Semba et al. 2013; Tsuburaya et al. 2015). Particularly, several high-performance implementations of the HPCG benchmark adopt the technique, which shows the effectiveness of the method in a fast multigrid solver with the parallel GS smoother (Park et al. 2014; Kumahata et al. 2016; Yoshifuji et al. 2016; Vermij et al. 2017; Ruiz et al. 2018). However, the block multi-coloring method has a drawback in its implementation using SIMD vectorization. Since the unknowns in each block have data dependency with one another, they cannot be processed in parallel, which prevents the use of SIMD instructions for elements in different rows.

Because the sparse triangular solver is a memory-intensive kernel, its performance on previous computer was not substantially affected by the use of SIMD instructions. However, to increase the floating-point performance, recent processors enhance the SIMD instructions and their SIMD width (vector length) is becoming large. For example, Intel Xeon (Skylake) (Hammond et al. 2018), Intel Xeon Phi (Sodani et al. 2016), and Fujitsu A64FX (ARM SVE) (Feldman 2018) processors are equipped with 512-bit SIMD instructions. We note that ARM SVE supports at most a 2,048 vector length (Stephens et al. 2017). Considering this trend of processors, we aim to develop a parallel sparse triangular solver in which both multithreading and SIMD vectorization are efficiently used.

In this paper, we propose a new parallel ordering technique in which SIMD vectorization can be used and the advantages of block multi-color ordering, that is, fast convergence and fewer synchronizations, are preserved. The technique is called “hierarchical block multi-color ordering” and it has a mathematically equivalent solution process (convergence) to block multi-color ordering. Moreover, the number of synchronizations in the multithreaded substitutions is the same as that of block multi-color ordering. We conduct seven numerical tests using finite element electromagnetic field analysis code and matrix data obtained from a matrix collection, and confirm the effectiveness of the proposed method in the context of the parallel ICCG solver.

2 Sparse triangular solver

In this paper, we consider the following n-dimensional linear system of equations:

$$\begin{aligned} {{\varvec{A}}}{{\varvec{x}}}= {{\varvec{b}}}. \end{aligned}$$
(1)

We discuss the case in which the linear system (1) is solved using an iterative linear solver involving IC(0)/ ILU(0) preconditioning, the Gauss-Seidel (GS) method (smoother), or the SOR method. When we discuss a parallel ICCG (precisely IC(0)-CG) solver for (1), we assume that coefficient matrix \({{\varvec{A}}}\) is symmetric and positive or semi-positive definite. For the parallelization of the iterative solver that we consider, the most problematic part is in the sparse triangular solver kernel. For example, in an IC/ILU preconditioned Krylov subspace iterative solver, the other computational kernels consist of an inner product, matrix-vector multiplication, and vector updates, which can be parallelized straightforwardly. The sparse triangular solver kernel is given by following forward and backward substitutions:

$$\begin{aligned} {{\varvec{y}}}= {{\varvec{L}}}^{-1} {{\varvec{r}}}, \end{aligned}$$
(2)

and

$$\begin{aligned} {{\varvec{z}}}= {{\varvec{U}}}^{-1} {{\varvec{y}}}, \end{aligned}$$
(3)

where \({{\varvec{r}}}\), \({{\varvec{y}}}\), and \({{\varvec{z}}}\) are n-dimensional vectors. Matrices \({{\varvec{L}}}\) and \({{\varvec{U}}}\) are, respectively, lower and upper triangular matrices with the same nonzero patterns as the lower and upper triangular parts of \({{\varvec{A}}}\). In ILU (IC) preconditioning, the preconditioning step is given by (2) and (3), and triangular matrices \({{\varvec{L}}}\) and \({{\varvec{U}}}\) are derived from the following incomplete factorization:

$$\begin{aligned} {{\varvec{A}}}\simeq {{\varvec{L}}}{{\varvec{U}}}. \end{aligned}$$
(4)

The iteration steps in the GS and SOR methods or smoothers can be expressed by similar substitutions. The substitution is an inherently sequential process, and it cannot be parallelized (multithreaded) straightforwardly.

3 Parallel ordering method

A parallel ordering (reordering) method is one of the most popular parallelization methods for a sparse triangular solver, that is, forward and backward substitutions. It transforms the coefficient matrix into an appropriate form for parallel processing by reordering the unknowns or their indices. Let the reordered linear system of (1) be denoted by

$$\begin{aligned} \bar{{{\varvec{A}}}} \bar{{{\varvec{x}}}} = \bar{{{\varvec{b}}}}. \end{aligned}$$
(5)

Then, the reordering is given by the transformation:

$$\begin{aligned} \bar{{{\varvec{x}}}} = {{\varvec{P}}}_{\pi } {{\varvec{x}}}, \end{aligned}$$
(6)

where \({{\varvec{P}}}_{\pi }\) is a permutation matrix. When we consider index set \(I=\{1, 2, \ldots , n\}\) that corresponds to the index of each unknown, the reordering is the permutation of the elements of I. In the present paper, the reordering function of the index is denoted by \(\pi\); that is, the i-th unknown of the original system is moved to the \(\pi (i)\)-th unknown of the reordered system. In the reordering technique, the coefficient matrix and right-hand side are given as follows:

$$\begin{aligned} \bar{{{\varvec{A}}}}={{\varvec{P}}}_{\pi } {{\varvec{A}}}{{\varvec{P}}}_{\pi }^{\top }, \quad \bar{{{\varvec{b}}}}={{\varvec{P}}}_{\pi } {{\varvec{b}}}. \end{aligned}$$
(7)

3.1 Equivalence of orderings

We consider the case in which two linear systems, (1) and (5), are solved using an identical iterative method. The approximate solution vector at the j-th iteration for (1) and that for (5) are denoted by \({{\varvec{x}}}^{(j)}\), and \(\bar{{{\varvec{x}}}}^{(j)}\), respectively. If it holds that

$$\begin{aligned} \bar{{{\varvec{x}}}}^{(j)} = {{\varvec{P}}}_{\pi } {{\varvec{x}}}^{(j)} \end{aligned}$$
(8)

at every j-th step under the setting \(\bar{{{\varvec{x}}}}^{(0)} ={{\varvec{P}}}_{\pi } {{\varvec{x}}}^{(0)}\) for initial guesses, then we can say that these two solution processes are equivalent. For example, in the Jacobi method and most Krylov subspace methods, reordering does not affect convergence; that is, the solution process for any reordered system is (mathematically) equivalent to that for the original system. However, in the case of the iterative solver that we consider in this paper, such as the IC/ILU preconditioned iterative solver, the solution processes are typically inequivalent. This is because of the sequentiality involved in the triangular solver (substitutions). However, there are special cases in which the reordered system has an equivalent solution process to the original system. In these cases, we say that two (original and new) orderings are equivalent or \(\pi\) is an equivalent reordering.

We define the equivalence of two orderings as follows: In the GS and SOR methods, equivalence is given by (8) under the proper setting of the initial guess. In IC(0)/ILU(0) preconditioning, equivalence is given as follows: We denote the incomplete factorization matrices of \(\bar{{{\varvec{A}}}}\) by \(\bar{{{\varvec{L}}}}\) and \(\bar{{{\varvec{U}}}}\). Moreover, the preconditioning step of the reordered linear system is given by \(\bar{{{\varvec{z}}}} = (\bar{{{\varvec{L}}}} \bar{{{\varvec{U}}}})^{-1} \bar{{{\varvec{r}}}}\). If \(\bar{{{\varvec{z}}}}={{\varvec{P}}}_{\pi } {{\varvec{z}}}\) is satisfied under \(\bar{{{\varvec{r}}}} = {{\varvec{P}}}_{\pi } {{\varvec{r}}}\), then we say that the orderings are equivalent. For example, the ICCG (IC(0)-CG) method exhibits an equivalent solution process for both original and reordered linear systems when the orderings are equivalent.

The condition for equivalent reordering is given as follows: When the following ER condition is satisfied, \(\pi\) is the equivalent reordering.

ER (Equivalent Reordering) Condition —

$$\begin{aligned}&\forall i_{1}, i_{2} \in I \ \mathrm{such} \ \mathrm{that} \ a_{i_{1}, i_{2}} \ne 0 \ \vee \ a_{i_{2}, i_{1}} \ne 0, \nonumber \\&\quad \text{ sgn } (i_{1}-i_{2}) = \text{ sgn } (\pi (i_{1})-\pi (i_{2})), \end{aligned}$$
(9)

where \(a_{i_{1}, i_{2}}\) denotes the \(i_{1}\)-th row \(i_{2}\)-th column element of \({{\varvec{A}}}\). For a further explanation, we introduce an ordering graph, which is the directed graph that corresponds to the coefficient matrix. Each node of the graph corresponds to an unknown or its index. An edge between two nodes \(i_{1}\) and \(i_{2}\) exists only when the \(i_{1}\)-th row \(i_{2}\)-th column element or \(i_{2}\)-th row \(i_{1}\)-th column element is nonzero. The direction of the edge (arrow) shows the order of two nodes. Figure 1 shows an example of the ordering graph. Using the ordering graph, (9) can be rewritten as the statement that the new and original orderings have the same ordering graph. In (Doi and Lichnewsky 1991), the authors stated that the ordering graph provides a unique class of mutually equivalent orderings. In the appendix, we provide a proof sketch of the relationship between (9) and the equivalence of orderings.

Fig. 1
figure 1

Example of an ordering graph

4 Hierarchical block multi-color ordering method

In this paper, we propose a new parallel ordering method for the vectorization and parallelization of a sparse triangular solver. Additionally, the proposed ordering is intended to inherit the advantages of convergence, number of synchronizations, and data locality from block multi-color ordering (BMC). The proposed parallel ordering is called hierarchical block multi-color ordering (HBMC), which is equivalent to BMC in terms of convergence.

In the technique, we first order the unknowns by using BMC, and then reorder them again. We focus on the explanation of the secondary reordering because we use the conventional algorithm shown in (Iwashita et al. 2012) for the application of BMC. Therefore, the original linear system based on BMC is written as (1) and secondary reordering is denoted by \(\pi\). Thus, the final reordered linear system based on HBMC is given by (5).

4.1 Block multi-color ordering (BMC)

In this subsection, we briefly introduce BMC and some notation required for the explanation of HBMC. In BMC, all unknowns are divided into blocks of the same size, and the multi-color ordering is applied to the blocks. Because blocks that have an identical color are mutually independent, the forward and backward substitutions are parallelized based on the blocks in each color. The number of (thread) synchronizations of parallelized (multithreaded) substitution is given by \(n_{c}-1\), where \(n_{c}\) is the number of colors. Figure 2 shows the coefficient matrix that results from BMC.

In the present paper, the block size and k-th block in color c are denoted by \(b_{s}\) and \(b_{k}^{(c)}\), respectively. In BMC, each unknown (or its index) is assigned to a certain block, as shown in Fig. 3, where n(c) is the number of blocks in color c.

4.2 Hierarchical block multi-color ordering (HBMC)

In the proposed HBMC, a new (hierarchical) block structure is introduced. First, we define a level-1 block (or multithreaded block) as follows. The block consists of w consecutive blocks of BMC in each color. When the \(k'\)-th level-1 block in color c is written as \(\bar{b}_{k'}^{(c)}\),

$$\begin{aligned} \bar{b}_{k'}^{(c)} = \bigcup _{k=k_{s}+1}^{ks+w} b_{k}^{(c)}, \end{aligned}$$
(10)

where \(k_{s}=(k'-1) \times w\). We note that parameter w is determined by the length of the SIMD vector instruction (SIMD width) of the targeted processor. It is typically 4 or 8, and will be larger in the future.

Fig. 2
figure 2

Coefficient matrix based on BMC

Fig. 3
figure 3

Blocks of BMC and secondary reordering for HBMC

Fig. 4
figure 4

Secondary reordering in a level-1 block

In our technique, secondary reordering is performed on each level-1 block as shown in Fig. 3. Without loss of generality, we describe the reordering process for a level-1 block, that is, the blocks from \(b_{k_{s}+1}^{(c)}\) to \(b_{k_{s}+w}^{(c)}\) of BMC. In the first step, we pick up the first (top) unknown of each block, \(b_{k_{s}+1}^{(c)}\), \(b_{k_{s}+2}^{(c)}, \ldots , b_{k_{s}+w}^{(c)}\), and order the picked unknowns. These w unknowns are mutually independent because the blocks in each color are independent in BMC. In the next step, we pick up the second unknown of each block, which are mutually independent, and order them after the unknowns previously ordered. We repeat the process until no unknowns remain. In total, the pick-up process is performed \(b_s\) times. Figure 4 shows the secondary reordering process in level-1 blocks when \(n_{c}=2\), \(b_{s}=2\), and \(w=4\), where each unknown is associated with the diagonal element of the coefficient matrix. In the figure, the colored elements represent nonzero elements. Although only one level-1 block exists in each color in Fig. 4, there are typically much more level-1 blocks in a color. After the reordering process is complete, we encounter the second-level block structure in the reordered coefficient matrix, which is given by the w\(\times\)w (small) diagonal matrices. The level-2 block structure is used for SIMD vectorization of the substitutions. Figure 5 shows the matrix form of the coefficient matrix based on HBMC.

The summary of the reordering process for HBMC from BMC is as follows: In the first step, we generate level-1 blocks in each color, each of which consists of w blocks of BMC. A level-1 block involves \(b_{s} w\) unknowns in total. In the second step, we reorder the unknowns in each level-1 block. For the explanation, we introduce local labels for the unknowns in the level-1 block, which are from 1 to \(b_{s} w\). For the application of HBMC, we reorder the unknowns in accordance with the value of \(\text{ mod } (i_{l}\) – 1, w) in the level-1 block, where \(i_{l}\) is the local label. Finally, we obtain the coefficient matrix form shown in Fig. 5 using the above procedure.

4.2.1 Equivalence between BMC and HBMC

We prove that HBMC is equivalent to BMC; that is, the convergence rates of the linear solvers based on the two orderings are the same. Because the secondary reordering for HBMC is locally performed in each level-1 block, the order between two unknowns that belong to two different level-1 blocks are preserved in the final order. Consequently, it holds that

$$\begin{aligned}&\forall i_{1} \in \bar{b}_{k_{1}}^{(c_{1})}, i_{2} \in \bar{b}_{k_{2}}^{(c_{2})} \ \mathrm{such} \ \mathrm{that} \ c_{1} \ne c_{2} \, \vee \, k_{1} \ne k_{2}, \nonumber \\&\quad \text{ sgn } (i_{1}-i_{2}) = \text{ sgn } (\pi (i_{1})-\pi (i_{2})). \end{aligned}$$
(11)
Fig. 5
figure 5

Coefficient matrix based on HBMC

From (11), if the local ordering subgraphs of BMC and HBMC that correspond to each level-1 block are identical, then the two orderings are equivalent. Next, we examine the reordering process in a level-1 block. In the secondary reordering process of HBMC, the order of unknowns that belong to different BMC blocks changes. However, the reordering process for these unknowns does not affect the ordering graph, that is, the convergence. In BMC, the unknowns that belong to two different blocks in the same color have no data relationship with one another; that is, there are no edges between them in the ordering graph. Therefore, even if we change the order of unknowns that belong to different BMC blocks, this does not affect the ordering graph. Consequently, we now pay attention to the influence of reordering inside a BMC block. When we analyze the above picking process, we can confirm that the order of the unknowns that belong to the same BMC block is preserved in the final order. In each block, we pick up the first unknown, and then the second, and continue this process. Therefore, the order does not change for these unknowns:

$$\begin{aligned} \forall i_{1}, i_{2} \in b_{k}^{(c)} \ \text{ sgn } (i_{1}-i_{2}) = \text{ sgn } (\pi (i_{1})-\pi (i_{2})). \end{aligned}$$
(12)

When we consider the mutual independence among the BMC blocks in each color, (11) and (12), we can demonstrate that secondary reordering \(\pi\) does not change the form of the ordering graph. This proves that HBMC is equivalent to BMC.

Fig. 6
figure 6

Ordering graphs in a five-point stencil problem and coefficient matrix based on HBMC

As an example that shows the relationship between BMC and HBMC, Fig. 6 demonstrates the ordering of nodes (unknowns) in a five-point finite difference analysis. Fig. 6a and b show that BMC and HBMC have identical ordering graphs. Consequently, the two orderings are equivalent in terms of convergence. Figure 6c shows the coefficient matrix based on HBMC, which involves the hierarchical block structures.

4.3 Parallelization and vectorization of forward and backward substitutions

Corresponding to the colors of the unknowns, solution vector \(\bar{{{\varvec{x}}}}\) and coefficient matrix \(\bar{{{\varvec{A}}}}\) are split as

$$\begin{aligned} \bar{{{\varvec{x}}}}= \left( \begin{array}{c} \bar{{{\varvec{x}}}}_{1} \\ \bar{{{\varvec{x}}}}_{2} \\ \vdots \\ \bar{{{\varvec{x}}}}_{n_{c}} \\ \end{array}\right) , \end{aligned}$$
(13)

and

$$\begin{aligned} \bar{{{\varvec{A}}}}= \left( \begin{array}{cccc} \bar{{{\varvec{C}}}}_{1,1} &{} \bar{{{\varvec{C}}}}_{1,2} &{} \ldots &{} \bar{{{\varvec{C}}}}_{1,n_{c}}\\ \bar{{{\varvec{C}}}}_{2,1} &{} \bar{{{\varvec{C}}}}_{2,2} &{} \ldots &{} \bar{{{\varvec{C}}}}_{2,n_{c}}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \bar{{{\varvec{C}}}}_{n_{c},1} &{} \bar{{{\varvec{C}}}}_{n_{c},2} &{} \ldots &{} \bar{{{\varvec{C}}}}_{n_{c},n_{c}} \\ \end{array}\right) , \end{aligned}$$
(14)

where \(\bar{{{\varvec{x}}}}_{c}\) corresponds to the unknowns with color c, and \(\bar{{{\varvec{C}}}}_{c,d}\) represents the relationship between \(\bar{{{\varvec{x}}}}_{c}\) and \(\bar{{{\varvec{x}}}}_{d}\). Hereafter, we assume that the size of \(\bar{{{\varvec{x}}}}_{c}\) is a multiple of \(b_{s}w\). In the analysis program, the assumption is satisfied using some dummy unknowns. Let the number of level-1 blocks assigned to color c be denoted by \(\bar{n}(c)\), then diagonal block \(\bar{{{\varvec{C}}}}_{c,c}\) of \(\bar{{{\varvec{A}}}}\) is given by the following block diagonal matrix:

$$\begin{aligned} \bar{{{\varvec{C}}}}_{c,c} = \left( \begin{array}{cccc} \bar{{{\varvec{B}}}}_{1}^{(c)} &{} &{} &{} \mathbf{0} \\ &{} \bar{{{\varvec{B}}}}_{2}^{(c)} &{} &{} \\ &{} &{} \ddots &{} \\ \mathbf{0} &{} &{} &{} \bar{{{\varvec{B}}}}_{\bar{n}(c)}^{(c)} \\ \end{array}\right) , \end{aligned}$$
(15)

where \(\bar{{{\varvec{B}}}}_{k}^{(c)}\) is the \(b_{s}w\)\(\times\)\(b_{s}w\) matrix that corresponds to the unknowns in the k-th level-1 block with color c, which we denote by \(\bar{b}_{k}^{(c)}\). Moreover, matrix \(\bar{{{\varvec{B}}}}_{k}^{(c)}\) is written as

$$\begin{aligned} \bar{{{\varvec{B}}}}_{k}^{(c)} = \left( \begin{array}{cccc} \bar{{{\varvec{D}}}}_{1}^{(k, c)} &{} \bar{{{\varvec{E}}}}_{1,2}^{(k,c)} &{} \ldots &{} \bar{{{\varvec{E}}}}_{1,b_{s}}^{(k,c)} \\ \bar{{{\varvec{E}}}}_{2,1}^{(k,c)} &{} \bar{{{\varvec{D}}}}_{2}^{(k, c)} &{} \ldots &{} \bar{{{\varvec{E}}}}_{2,b_{s}}^{(k,c)} \\ \vdots &{} \vdots &{} \ddots &{} \\ \bar{{{\varvec{E}}}}_{b_{s},1}^{(k,c)}&{} \bar{{{\varvec{E}}}}_{b_{s},2}^{(k,c)} &{} \ldots &{} \bar{{{\varvec{D}}}}_{b_{s}}^{(k, c)} \\ \end{array}\right) , \end{aligned}$$
(16)

where \(\bar{{{\varvec{D}}}}_{l}^{(k, c)}, (l=1, 2, \ldots , b_{s})\) are w\(\times\)w diagonal matrices.

The forward substitution included in ILU(0)/IC(0) preconditioners or GS and SOR methods uses a lower triangular matrix with the same nonzero element pattern as the lower triangular part of \(\bar{{{\varvec{A}}}}\). From (14) and (15), lower triangular matrix \(\bar{{{\varvec{L}}}}\) is written as

$$\begin{aligned} \bar{{{\varvec{L}}}}= \left( \begin{array}{cccc} \bar{{{\varvec{L}}}}_{1,1} &{} &{} &{} \\ \bar{{{\varvec{L}}}}_{2,1} &{} \bar{{{\varvec{L}}}}_{2,2} &{} &{} \mathbf{0} \\ \vdots &{} \ddots &{} \ddots &{} \\ \bar{{{\varvec{L}}}}_{n_{c},1} &{} \bar{{{\varvec{L}}}}_{n_{c},2} &{} \ldots &{} \bar{{{\varvec{L}}}}_{n_{c},n_{c}} \\ \end{array}\right) , \end{aligned}$$
(17)

and diagonal block \(\bar{{{\varvec{L}}}}_{c,c}\) is given by

$$\begin{aligned} \bar{{{\varvec{L}}}}_{c,c} = \left( \begin{array}{cccc} \bar{{{\varvec{L}}}}_{1}^{(c)} &{} &{} &{} \mathbf{0} \\ &{} \bar{{{\varvec{L}}}}_{2}^{(c)} &{} &{} \\ &{} &{} \ddots &{} \\ \mathbf{0} &{} &{} &{} \bar{{{\varvec{L}}}}_{\bar{n}(c)}^{(c)} \\ \end{array}\right) , \end{aligned}$$
(18)

where \(\bar{{{\varvec{L}}}}_{k}^{(c)}\) is the \(b_{s}w\)\(\times\)\(b_{s}w\) lower triangular matrix that corresponds to block \(\bar{b}_{k}^{(c)}\). The forward substitution for the reordered linear system is given by

$$\begin{aligned} \bar{{{\varvec{L}}}} \bar{{{\varvec{y}}}} = \bar{{{\varvec{r}}}}, \end{aligned}$$
(19)

where \(\bar{{{\varvec{r}}}}\) is the residual vector in the case of ILU (IC) preconditioning. Let \(\bar{{{\varvec{y}}}}_{c}\) and \(\bar{{{\varvec{r}}}}_{c}\) represent, respectively, the segments of \(\bar{{{\varvec{y}}}}\) and \(\bar{{{\varvec{r}}}}\) that correspond to color c, and \(\bar{{{\varvec{y}}}}_{k}^{(c)}\) and \(\bar{{{\varvec{r}}}}_{k}^{(c)}\) represent the subsegments in the segments that correspond to block \(\bar{b}_{k}^{(c)}\); that is,

$$\begin{aligned} \bar{{{\varvec{y}}}}_{c}= \left( \begin{array}{c} \bar{{{\varvec{y}}}}_{1}^{(c)} \\ \bar{{{\varvec{y}}}}_{2}^{(c)} \\ \vdots \\ \bar{{{\varvec{y}}}}_{\bar{n}(c)}^{(c)} \\ \end{array}\right) , \ \text{ and } \ \ \bar{{{\varvec{r}}}}_{c}= \left( \begin{array}{c} \bar{{{\varvec{r}}}}_{1}^{(c)} \\ \bar{{{\varvec{r}}}}_{2}^{(c)} \\ \vdots \\ \bar{{{\varvec{r}}}}_{\bar{n}(c)}^{(c)} \\ \end{array}\right) . \end{aligned}$$
(20)

Then, from (17) and (19), the forward substitution for \(\bar{{{\varvec{y}}}}_{c}\) is given by

$$\begin{aligned} \bar{{{\varvec{y}}}}_{c} = \bar{{{\varvec{L}}}}_{c,c}^{-1} \bar{{{\varvec{q}}}}_{c}, \end{aligned}$$
(21)

where

$$\begin{aligned} \bar{{{\varvec{q}}}}_{c} = \bar{{{\varvec{r}}}}_{c} - \sum _{d=1}^{c-1} \bar{{{\varvec{L}}}}_{c,d} \bar{{{\varvec{y}}}}_{d}. \end{aligned}$$
(22)

Because vector segments \(\bar{{{\varvec{y}}}}_{d} (d=1, \ldots , c-1)\) are computed prior to the substitution (21) and shared among all threads, \(\bar{{{\varvec{q}}}}_{c}\) is a given vector in (21). When the segment of \(\bar{{{\varvec{q}}}}_{c}\) that corresponds to block \(\bar{b}_{k}^{(c)}\) is denoted by \(\bar{{{\varvec{q}}}}_{k}^{(c)}\) as in (20), from (18), the forward substitution (21) is expressed as \(\bar{n}(c)\) independent steps:

$$\begin{aligned} \bar{{{\varvec{y}}}}_{k}^{(c)}=(\bar{{{\varvec{L}}}}_{k}^{(c)})^{-1} \bar{{{\varvec{q}}}}_{k}^{(c)} \ (k=1, \ldots , \bar{n}(c)). \end{aligned}$$
(23)

Consequently, the forward substitution (21) for color c can be multithreaded with the degree of parallelism given by the number of level-1 blocks of each color, which is approximately \(n/(n_{c} \cdot b_{s} \cdot w)\). Each thread processes one or more level-1 blocks in parallel.

Next, we explain how to vectorize each step of (23). We consider the procedure for the k-th level-1 block of color c: \(\bar{{{\varvec{y}}}}_{k}^{(c)}=(\bar{{{\varvec{L}}}}_{k}^{(c)})^{-1} \bar{{{\varvec{q}}}}_{k}^{(c)}\). From (16), lower triangular matrix \(\bar{{{\varvec{L}}}}_{k}^{(c)}\) is written as

$$\begin{aligned} \bar{{{\varvec{L}}}}_{k}^{(c)} = \left( \begin{array}{cccc} \tilde{{{\varvec{D}}}}_{1}^{(k, c)} &{} &{} &{} \\ \bar{{{\varvec{L}}}}_{2,1}^{(k,c)} &{} \tilde{{{\varvec{D}}}}_{2}^{(k, c)} &{} &{} \mathbf{0} \\ \vdots &{} \ddots &{} \ddots &{} \\ \bar{{{\varvec{L}}}}_{b_{s},1}^{(k,c)}&{} \bar{{{\varvec{L}}}}_{b_{s},2}^{(k,c)} &{} \ldots &{} \tilde{{{\varvec{D}}}}_{b_{s}}^{(k, c)} \\ \end{array}\right) , \end{aligned}$$
(24)

where \(\tilde{{{\varvec{D}}}}_{l}^{(k, c)}\) are diagonal matrices. We split \(\bar{{{\varvec{y}}}}_{k}^{(c)}\) and \(\bar{{{\varvec{q}}}}_{k}^{(c)}\) into \(b_{s}\) segments, each of size w. Let \(\bar{{{\varvec{y}}}}_{l}^{(k, c)}\) and \(\bar{{{\varvec{q}}}}_{l}^{(k, c)}\) represent the segments that correspond to the level-2 block of the k-th level-1 block of color c; that is,

$$\begin{aligned} \bar{{{\varvec{y}}}}_{k}^{(c)}= \left( \begin{array}{c} \bar{{{\varvec{y}}}}_{1}^{(k, c)} \\ \bar{{{\varvec{y}}}}_{2}^{(k, c)} \\ \vdots \\ \bar{{{\varvec{y}}}}_{b_{s}}^{(k, c)} \\ \end{array}\right) , \ \text{ and } \ \ \bar{{{\varvec{q}}}}_{k}^{(c)}= \left( \begin{array}{c} \bar{{{\varvec{q}}}}_{1}^{(k, c)} \\ \bar{{{\varvec{q}}}}_{2}^{(k, c)} \\ \vdots \\ \bar{{{\varvec{q}}}}_{b_{s}}^{(k, c)} \\ \end{array}\right) . \end{aligned}$$
(25)

Then, from (24), the forward substitution for level-1 block \(\bar{b}_{k}^{(c)}\) is given by the following \(b_{s}\) sequential steps:

$$\begin{aligned} \bar{{{\varvec{y}}}}_{l}^{(k, c)} = (\tilde{{{\varvec{D}}}}_{l}^{(k, c)})^{-1} \bar{{{\varvec{t}}}}_{l}^{(k, c)}, (l=1,2,\ldots ,b_{s}), \end{aligned}$$
(26)

where

$$\begin{aligned} \bar{{{\varvec{t}}}}_{l}^{(k, c)} =\bar{{{\varvec{q}}}}_{l}^{(k, c)} - \sum _{m=1}^{l-1} \bar{{{\varvec{L}}}}_{l,m}^{(k,c)} \bar{{{\varvec{y}}}}_{m}^{(k, c)}. \end{aligned}$$
(27)

In the l-th step of (26), because \(\tilde{{{\varvec{D}}}}_{l}^{(k, c)}\) is a diagonal matrix and each element of \(\bar{{{\varvec{t}}}}_{l}^{(k, c)}\) can be calculated in parallel, the step consists of w independent steps that can be efficiently vectorized. In other words, the l-th step of (26) consists of a simple matrix vector multiplication and element-wise vector updates that are directly vectorized.

The backward substitution is parallelized (multithreaded) and vectorized in a similar manner, although it is performed inversely from color \(n_{c}\) to 1.

4.4 Implementation of HBMC

4.4.1 Reordering process

In this section, we discuss the reordering process. In the technique, any type of algorithm (heuristic) for an implementation of BMC can be used. In the application of BMC, we set the number of BMC blocks assigned to each thread as a multiple of w, except for one of the threads (typically the last-numbered thread). In this circumstance, the application of HBMC, that is, the secondary reordering from BMC, is performed in each thread. Therefore, the reordering process is fully multithreaded.

4.4.2 Storage format

In the implementation of the sparse triangular solver, a storage format for sparse matrices (Barrett et al. 1994) is typically used. For example, the factorization matrices in an IC/ILU preconditioned solver are stored in memory using such a format. Although there are several standard formats, the sliced ELL (SELL) format (Liu et al. 2013) is the most efficient for exploiting the benefit of SIMD instructions, and we used it in our implementation. In the SELL format, the slice size is an important parameter. In HBMC, we naturally set the slice size as w because the forward and backward substitutions are vectorized every w rows. This leads to the natural introduction of the concept of the SELL-C-\(\sigma\) format (Kreutzer et al. 2014) to the analysis, which is a sophisticated version of SELL.

4.4.3 Multithreaded and vectorized substitutions

The program for each forward and backward substitution consists of nested loops. The outer-most loop is for the color. After the computations for each color, thread synchronization is required. Therefore, the number of synchronizations in each substitution is given by \(n_{c}-1\), which is the same as BMC and the standard multi-color ordering (MC). The second loop is for level-1 blocks. Because the level-1 blocks in each color are mutually independent, each thread processes single or multiple level-1 blocks in parallel. In each level-1 block, the substitution can be split into \(b_{s}\) steps, each of which is vectorized with an SIMD width of w. For the vectorized substitution, we used the OpenMP SIMD directive or the Intel intrinsic functions for SIMD instructions. Figure 7 shows a sample C program code for multithreaded and vectorized forward substitution using OpenMP and Intel AVX-512 intrinsic functions. In our implementation, we use padding of dummy unknowns to satisfy the condition that the number of unknowns in each color results in a multiple of \(b_{s} w\). Consequently, we use intrinsic SIMD functions for the aligned elements in the code.

Additionally, we discuss the special nonzero pattern that appears in \(\bar{{{\varvec{L}}}}_{k}^{(c)}\) that corresponds to the level-1 block. In the matrix, all nonzero elements lay on \(2b_{s}-1\) diagonal lines. Although we can consider using a hybrid storage format that exploits this special structure, it does not typically result in better performance because of the additional cost of processing the diagonal block and other off-diagonal elements separately. We confirmed this in some preliminary tests.

Finally, we discuss the data access locality. The access pattern for the vector elements in HBMC is different from that in BMC. Therefore, the data access locality can be different between two orderings. However, because the secondary reordering for HBMC is performed inside a level-1 block, the data access locality barely changes; at least, from the viewpoint of the last-level cache, both orderings are considered to be similar.

Fig. 7
figure 7

C program code of multithreaded and vectorized forward substitution with OpenMP directives and Intel AVX-512 intrinsic functions

Table 1 Information about the test environments
Table 2 Matrix information for the test problems

5 Numerical results

5.1 Computers and test problems

We conducted seven numerical tests on three types of computational nodes to evaluate the proposed reordering technique in the context of the ICCG method: the computational nodes were Cray XC40, CS400 (2820XT), and Fujitsu CX2550 (M4). The two Cray systems are operated by Academic Center for Computing and Media Studies, Kyoto Univ., whereas the Fujitsu system is at Information Initiative Center, Hokkaido University. Table 1 lists information about the computational node and compiler used. In the numerical tests, we used all cores of the computational node for execution.

The program code was written in C and OpenMP for the thread parallelization. For vectorization, we used the intrinsic functions of the Intel Advanced Vector Extensions. The AVX2 (256 bits SIMD) instruction set was used for the Xeon (Broadwell) processor, whereas the AVX-512 (512 bits SIMD) instruction set was used for the Xeon Phi (KNL) and Xeon (Skylake) processors. Although we also developed a vectorized program using the OpenMP SIMD directive, its performance was slightly worse than the version with the intrinsic function in most of the test cases. Thus, in this paper, we only report the results from using the intrinsic function.

For the test problems, we used a linear system that arises from finite element electromagnetic field analysis and six linear systems picked up from the SuiteSparse Matrix Collection (Davis and Hu 2011). We selected symmetric positive definite matrices that are mainly derived from computational science or engineering problems, and have a relatively large dimension compared with other matrices in the collection.

In the electromagnetic field analysis test, the linear system that arises from the finite element discretization of the IEEJ standard benchmark model (Nakata et al. 1991) was used. The basic equation for the problem is given as

$$\begin{aligned} \nabla \times (\nu \nabla \times {{\varvec{A}}}_{m}) = {{\varvec{J}}}_{0}, \end{aligned}$$
(28)

where \({{\varvec{A}}}_{m}\), \(\nu\), and \({{\varvec{J}}}_{0}\) are the magnetic vector potential, magnetic reluctivity, and excitation current density, respectively. The analysis solved (28) using a finite edge-element method with hexahedral elements. The boundary condition is given by setting electric walls around the model. Applying the Galerkin method to (28), we obtained a linear system of equations for the test problem. Because no gauge condition is enforced in the air region of the model, the coefficient matrix of the resulting linear system is semi-positive definite (Igarashi 2001). Although the linear system has multiple solutions, the distribution for the magnetic flux density derived from them is unique. The linear system was solved using the shifted ICCG method, with the shift parameter given as 0.3. The name of the dataset of the linear system is denoted by Ieej. Table 2 lists the matrix information for the test problems. The right-hand side vector for the linear systems except for Ieej is given by the vector of ones.

In this paper, we report the performance comparison of four multithreaded ICCG solvers. The solver denoted by MC is based on the multi-color ordering which is the most popular parallel ordering method. The solver “BMC” is the solver based on the block multi-color ordering method. The solvers denoted by “HBMC (csr_spmv)” and “HBMC (sell_spmv)” are based on the proposed HBMC, where the former solver used compressed storage row (CSR) format (Barrett et al. 1994) for the implementation of sparse matrix vector multiplication (SpMV) and the latter used the SELL format. In MC and BMC, the CSR format was used.

For the blocking method in BMC and HBMC, we used the simplest one among the heuristics introduced in (Iwashita et al. 2012), in which the unknown with the minimal number is picked up for the newly generated block. For the coloring of nodes or blocks, the greedy algorithm was used for all the solvers. The convergence criterion was set as the relative residual norm (2-norm) being less than \(10^{-7}\).

5.2 Numerical results

5.2.1 Equivalence of orderings in convergence and use of SIMD instructions

First, we examine the equivalence of BMC and HBMC in terms of convergence. Table 3 lists the number of iterations of the solvers tested on Cray XC40, where the block size of BMC and HBMC was set to 32. Equivalence was confirmed by the numerical results. Moreover, to examine the entire solution process, Fig. 8 shows the convergence behaviors of BMC and HBMC in the G3_circuit and Ieej tests. In the figure, the two lines of the relative residual norms overlap, which indicates that the solvers had an equivalent solution process. The equivalence of convergence was also confirmed in all test cases (seven datasets \(\times\) three block sizes \(\times\) three computational nodes). Furthermore, Table 3 shows the advantage in terms of convergence of BMC over MC, which coincides with the results reported in (Iwashita et al. 2012).

Next, we checked the use of SIMD instructions in the solver using the Intel Vtune Amplifier (application performance snapshot) in the G3_circuit test conducted on Fujitsu CX2550. The snapshot showed that the percentage of all packed floating point instructions in the solver based on HBMC (sell_spmv) reached 99.7%, although that in the solver using BMC was 12.7%.

5.2.2 Performance comparison

Table 4 (a) shows the performance comparison of four solvers on Cray XC40. In the tests, HBMC attained the best performance for all datasets, except Audikw_1. In the Thermal2 and G3_circuit tests, HBMC (sell_spmv) was more than two times faster than the standard MC solver. When HBMC (csr_spmv) was compared with BMC, it attained better performance in 18 out of 21 cases (seven datasets \(\times\) three block sizes), which demonstrates the effectiveness of HBMC for the sparse triangular solver. In all test cases, HBMC (sell_spmv) outperformed HBMC (csr_spmv), which implies that an efficient use of SIMD instructions was important on the Xeon Phi-based system.

Table 3 Comparison of the number of iterations
Fig. 8
figure 8

Convergence behavior of BMC and HBMC

Table 4 (b) shows the test results on Cray CS400 (Xeon Broadwell). HBMC attained the best performance for all datasets. When HBMC (csr_spmv) was compared with BMC, it attained better performance in 16 out of 21 cases, which shows the effectiveness of HBMC. Table 4 (b) also indicates that using the SELL format for the coefficient matrix mostly led to an improvement in solver performance.

Table 4 Numerical results: execution time (sec.)

Table 4 (c) shows the test results on Fujitsu CX2550 (Xeon Skylake). In the numerical tests, HBMC outperformed MC and BMC for six out of seven datasets. For the Audikw_1 dataset, HBMC did not outperform BMC on Xeon Phi and Skylake, although it was better than BMC on Xeon Broadwell. This result is thought to be because of the effect of increasing the slice size that was given by the SIMD width, w. In the SELL format, some zero elements were considered as nonzero in the slice. When there was a significant imbalance among the number of nonzero elements in each row in the slice, the number of elements processed largely increased compared with the implementation with the CSR format. The Audikw_1 dataset had this property. For Audikw_1, the number of processed elements in SELL increased by 40% compared with that in CSR, although the increase was 10% for the G3_circuit dataset. For this type of dataset, when the size of the slice increased, the number of processed elements often increased. The size of the slice was set to 8 for the Xeon Phi and Skylake processors and 4 for the Xeon Broadwell processors. On the Broadwell processors, the increase in the number of elements when changing CSR to SELL was 28%, which resulted in better performance for HBMC compared with BMC. In the future, for further acceleration of the solver, we will develop an implementation in which we introduce some remedies for this SELL issue, for example, splitting the row that includes an extremely large number of nonzero elements compared with other rows.

6 Related works

The parallelization of the sparse triangular solver for iterative solvers has been mainly investigated in the context of GS or IC/ILU smoothers for multigrid solvers, the SOR method, and IC/ILU preconditioned iterative solvers. Most of these parallelization techniques are classified into two classes: domain decomposition type methods and parallel orderings (Duff and van der Vorst 1999). A simple but commonly used technique in the former class is the additive Schwarz smoother or preconditioner. The hybrid Jacobi and GS smoother, and block Jacobi IC/ILU preconditioning are typical examples, and they are used in many applications (Vollaire and Nicolas 1998; Baker et al. 2010). However, these techniques typically suffer from a decline in convergence when the number of threads (processes) is increased. Although there are some remedies for the deterioration in convergence, for example, the overlapping technique (Radicati di Brozoloand and Robert 1989), it is generally difficult to compensate for it when many cores are used.

A parallel ordering or reordering technique is a standard method to parallelize the sparse triangular solver. We focus on the parallelization of IC/ILU preconditioners or smoothers; however, there are many studies that discuss the application of parallel ordering for GS and SOR methods, for example, (Park et al. 2015). Ref. (Duff and van der Vorst 1999) provides an overview of early works on the parallel ordering method when applied to IC/ILU preconditioned iterative solvers. Parallel orderings were mainly investigated in the context of a structured grid problem (a finite difference analysis), and the concepts of typical parallel ordering techniques, such as red-black, multi-color, zebra, domain decomposition (four or eight-corner), and nested dissection, were established in the 1990s. In these early research activities, a major issue was the trade-off problem between convergence and the degree of parallelism. After Duff and Meurant indicated the problem in (Duff and Meurant 1989), both analytical and numerical investigations were conducted in (Doi and Washio 1999; Iwashita et al. 2005; Doi and Lichnewsky 1991; Kuo and Chan 1990; Doi and Lichnewsky 1990; Eijkhout 1991; Doi 1991; Benzi et al. 1999). The concept of equivalence of orderings and some remedies for the trade-off problem, such as the use of a relatively large number of colors in multi-color ordering or block coloring, were introduced as the results of these research activities.

In practical engineering and science domains, unstructured problems are solved more frequently than structured problems. Therefore, parallel ordering techniques were enhanced for unstructured problems, and several heuristics were proposed. Typical examples are hierarchical interface decomposition (HID) (Hénon and Saad 2006) and heuristics for nodal or block multi-coloring (Iwashita et al. 2012; Jones and Plassmann 1994; Iwashita and Shimasaki 2002). These techniques and other related methods have been used in various application domains, such as CFD, computational electromagnetics, and structure analyses (Semba et al. 2013; Tsuburaya et al. 2015; Rivera et al. 2010; Nakajima 1999; Naumov 2011).

Finally, we address the recently reported research results that are related to parallel linear solvers that involve sparse triangular solvers. Gonzaga de Oliveira et al. reported their intensive numerical test results to evaluate various reordering techniques in the ICCG method in (2018). Gupta introduced a blocking framework to generate a fast and robust preconditioner based on ILU factorization in (2017). Chen et al. developed a couple of ILU-based preconditioners on GPUs in (2018). Ruiz et al. reported the evaluation results of HPCG implementations using nodal and block multi-color orderings on the ARM-based system, which confirmed the superiority of the block coloring method in (2018).

In this paper, we proposed a parallel ordering that is different from the techniques described above. To the best of our knowledge, there is no parallel ordering method that vectorizes the sparse triangular solver while maintaining the same convergence and number of synchronizations as the block multi-color ordering. Since the vectorization of SpMV has been intensively investigated (Liu et al. 2013; Kreutzer et al. 2014), one of conventional approaches is the use of multi-color ordering, in which the substitution is represented as an SpMV in each color. However, the multi-color ordering suffers from the problems of convergence and data locality, which are also indicated in the latest report (Ruiz et al. 2018). When we consider the numerical results and mathematical properties of the proposed hierarchical block multi-color ordering, it can be regarded as an effective technique for multithreading and vectorizing the sparse triangular solver.

7 Conclusions

In this paper, we proposed a new parallel ordering technique, hierarchical block multi-color ordering (HBMC), for vectorizing and multithreading the sparse triangular solver. HBMC was designed to maintain the advantages of the block multi-color ordering (BMC) in terms of convergence and the number of synchronizations. In the method, the coefficient matrix was transformed into the matrix with hierarchical block structures. The level-1 blocks were mutually independent in each color, which was exploited in multithreading. Corresponding to the level-2 block, the substitution was converted into \(w (= \hbox {SIMD width})\) independent steps, which were efficiently processed by SIMD instructions. In this paper, we demonstrated analytically the equivalence of HBMC and BMC in convergence. Furthermore, numerical tests were conducted to examine the proposed method using seven datasets on three types of computational nodes. The numerical results also confirmed the equivalence of the convergence of HBMC and BMC. Moreover, numerical tests indicated that HBMC outperformed BMC in 18 out of 21 test cases (seven datasets \(\times\) three systems), which confirmed the effectiveness of the proposed method.

In the future, we will examine our technique for other application problems, particularly a large-scale multigrid application and an HPCG benchmark. Moreover, we will examine the effect of the matrix format for preconditioners and coefficient matrices on the performance of the solver. We also intend to introduce a sophisticated storage format or its related technique to our solver. As other research issues, we will examine the effect of other coloring and blocking strategies on the performance of the solver.