Hierarchical Block Multi-Color Ordering: A New Parallel Ordering Method for Vectorization and Parallelization of the Sparse Triangular Solver in the ICCG Method

In this paper, we propose a new parallel ordering method to vectorize and parallelize the sparse triangular solver, which is called hierarchical block multi-color ordering. In this method, the parallel forward and backward substitutions can be vectorized while preserving the advantages of block multi-color ordering, that is, fast convergence and fewer thread synchronizations. To evaluate the proposed method in a parallel ICCG (Incomplete Cholesky Conjugate Gradient) solver, numerical tests were conducted using five test matrices on three types of computational nodes. The numerical results indicate that the proposed method outperforms the conventional block and nodal multi-color ordering methods in 13 out of 15 test cases, which confirms the effectiveness of the method.


Introduction.
A sparse triangular solver is an important computational kernel for an iterative linear solver in various numerical simulations. It is the main component of the Gauss-Seidel (GS) smoother, SOR method and IC/ILU preconditioning, which are used as building blocks in various computational science or engineering analyses [1][2][3]. Therefore, the development of a fast multithreaded sparse triangular solver is essential to accelerate these analyses when conducted on not only a single computing node but also a largescale cluster system of nodes. For example, the performance of the solver significantly influences the total simulation time of large-scale partial differential equation analysis using a multigrid solver with the GS, IC, or ILU smoother [4,5]. However, it is well known that the sparse triangular solver, which consists of forward and backward substitutions, cannot be straightforwardly parallelized [6,7]. Thus, in this paper, we discuss an effective approach to developing a high-performance multithreaded sparse triangular solver.
There are various methods for parallelizing a sparse triangular solver or its related techniques, and we focus on the parallel ordering (reordering) method, which is one of the most common methods for parallelization of a sparse triangular solver. There are several well-known orderings, such as dissection and domain decomposition orderings, but multi-color ordering is the most commonly used technique. It has been used in various applications to parallelize, for example, the ICCG method. However, it is well known that the multi-color ordering entails a trade-off problem between convergence and the number of synchronizations [8]. An increase in the number of colors typically results in better convergence, but it also leads to an increase in the number of synchronizations, which is proportional to the number of colors. The trade-off problem between convergence and parallelism is a common issue for parallel ordering techniques [9].
One of the solutions for the above trade-off problem is block multi-coloring. In this method, multi-color ordering is applied to blocks of unknowns. The technique has been investigated in several contexts. The concept of block coloring or block independent sets can be seen in [2]. In an early work [10], it is discussed for the parallel SOR method. For parallelization of the IC/ILU preconditioned iterative solver, it was first investigated in a finite difference method, that is, structured grid analysis [11,12]. In this research, block coloring proved to be effective for improving convergence without increasing thread synchronization. Following on from these research activities, the algebraic block multi-color ordering method was introduced for a general sparse linear system in [13]. Although there are various options for coloring or blocking methods [14,15], this technique has been used in various applications because of its advantages in terms of convergence, data locality, and the number of synchronizations [16,17]. Particularly, several high-performance implementations of the HPCG benchmark adopt the technique, which shows the effectiveness of the method in a fast multigrid solver with the parallel GS smoother [18][19][20][21][22]. However, the block multi-coloring method has a drawback in its implementation using SIMD vectorization. The calculations in the innermost loop for the parallel substitutions are performed sequentially, which prevents the efficient use of SIMD instructions.
Because the sparse triangular solver is a memory-intensive kernel, its performance on previous computer was not substantially affected by the use of SIMD instructions. However, to increase the floating-point performance, recent processors enhance the SIMD instructions and their SIMD width (vector length) is becoming large. For example, Intel Xeon (Skylake) [23], Intel Xeon Phi [24], and Fujitsu A64FX (ARM SVE) [25] processors are equipped with 512-bit SIMD instructions. We note that ARM SVE supports at most a 2,048 vector length [26]. Considering this trend of processors, we aim to develop a parallel sparse triangular solver in which both multithreading and SIMD vectorization are efficiently used.
In this paper, we propose a new parallel ordering technique in which SIMD vectorization can be used and the advantages of block multi-color ordering, that is, fast convergence and fewer synchronizations, are preserved. The technique is called "hierarchical block multi-color ordering" and it has a mathematically equivalent solution process (convergence) to block multi-color ordering. Moreover, the number of synchronizations in the multithreaded substitutions is the same as that of block multi-color ordering. We conduct five numerical tests using finite element electromagnetic field analysis code and matrix data obtained from a matrix collection, and confirm the effectiveness of the proposed method in the context of the parallel ICCG solver.
2. Sparse Triangular Solver. In this paper, we consider the following n-dimensional linear system of equations: We discuss the case in which the linear system (2.1) is solved using an iterative linear solver involving IC(0)/ILU(0) preconditioning, the Gauss-Seidel (GS) smoother, or the SOR method. When we discuss a parallel ICCG (precisely IC(0)-CG) solver for (2.1), we assume that coefficient matrix A is symmetric and positive or semi-positive definite. For the parallelization of the iterative solver that we consider, the most problematic part is in the sparse triangular solver kernel. For example, in an IC/ILU preconditioned Krylov subspace iterative solver, the other computational kernels consist of an inner product, matrix-vector multiplication, and vector updates, which can be parallelized straightforwardly. The sparse triangular solver kernel is given by following forward and backward substitutions: and where r, y, and z are n-dimensional vectors. Matrices L and U are, respectively, lower and upper triangular matrices with the same nonzero patterns as the lower and upper triangular parts of A. In ILU (IC) preconditioning, the preconditioning step is given by (2.2) and (2.3), and triangular matrices L and U are derived from the following incomplete factorization: The iteration steps in the GS and SOR methods (smoothers) can be expressed by similar substitutions. The substitution is an inherently sequential process, and it cannot be parallelized (multithreaded) straightforwardly.

Parallel Ordering Method.
A parallel ordering (reordering) method is one of the most popular parallelization methods for a sparse triangular solver, that is, forward and backward substitutions. It transforms the coefficient matrix into an appropriate form for parallel processing by reordering the unknowns or their indices. Let the reordered linear system of (2.1) be denoted bȳ Ax =b. (3.1) Then, the reordering is given by the transformation: where P π is a permutation matrix. When we consider index set I = {1, 2, . . . , n} that corresponds to the index of each unknown, the reordering is the permutation of the elements of I. In the present paper, the reordering function of the index is denoted by π; that is, the i-th unknown of the original system is moved to the π(i)-th unknown of the reordered system. In the reordering technique, the coefficient matrix and right-hand side are given as follows:Ā 3.1. Equivalence of orderings. We consider the case in which two linear systems, (2.1) and (3.1), are solved using an identical iterative method. The approximate solution vector at the j-th iteration for (2.1) and that for (3.1) are denoted by x (j) , andx (j) , respectively. If it holds that at every j-th step under the settingx (0) = P π x (0) for initial guesses, then we can say that these two solution processes are equivalent. For example, in the Jacobi method and most Krylov subspace methods, reordering does not affect convergence; that is, the solution process for any reordered system is (mathematically) equivalent to that for the original system. However, in the case of the iterative solver that we consider in this paper, such as the IC/ILU preconditioned iterative solver, the solution processes are typically inequivalent. This is because of the sequentiality involved in the triangular solver (substitutions). However, there are special cases in which the reordered system has an equivalent solution process to the original system. In these cases, we say that two (original and new) orderings are equivalent or π is an equivalent reordering. We define the equivalence of two orderings as follows: In the GS and SOR methods, equivalence is given by (3.4) under the proper setting of the initial guess. In IC(0)/ILU(0) preconditioning, equivalence is given as follows: We denote the incomplete factorization matrices ofĀ byL andŪ . Moreover, the preconditioning step of the reordered linear system is given byz = (LŪ ) −1r . Ifz = P π z is satisfied underr = P π r, then we say that the orderings are equivalent. For example, the ICCG (IC(0)-CG) method exhibits an equivalent solution process for both original and reordered linear systems when the orderings are equivalent.
The condition for equivalent reordering is given as follows: When the following ER condition is satisfied, π is the equivalent reordering.
ER (Equivalent Reordering) Condition - where a i1,i2 denotes the i 1 -th row i 2 -th column element of A. For a further explanation, we introduce an ordering graph, which is the directed graph that corresponds to the coefficient matrix. Each node of the graph corresponds to an unknown or its index. An edge between two nodes i 1 and i 2 exists only when the i 1 -th row i 2 -th column element or i 2 -th row i 1 -th column element is nonzero. The direction of the edge (arrow) shows the order of two nodes. Figure 3.1 shows an example of the ordering graph. Using the ordering graph, (3.5) can be rewritten as the statement that the new and original orderings have the same ordering graph. In [27], the authors stated that the ordering graph provides a unique class of mutually equivalent orderings. In the appendix, we provide a proof sketch of the relationship between (3.5) and the equivalence of orderings.

Hierarchical Block
Multi-Color Ordering Method . In this paper, we propose a new parallel ordering method for the vectorization and parallelization of a sparse triangular solver. Additionally, the proposed ordering is intended to inherit the advantages of convergence, number of synchronizations, and data locality from block multi-color ordering (BMC). The proposed parallel ordering is called hierarchical block multi-color ordering (HBMC), which is equivalent to BMC in terms of convergence.
In the technique, we first order the unknowns by using BMC, and then reorder them again. We focus on the explanation of the secondary reordering because we use the conventional algorithm shown in [13] for the application of BMC. Therefore, the original linear system based on BMC is written as (2.1) and secondary reordering is denoted by π. Thus, the final reordered linear system based on HBMC is given by (3.1).

Block multi-color ordering (BMC).
In this subsection, we briefly introduce BMC and some notation required for the explanation of HBMC. In BMC, all unknowns are divided into blocks of the same size, and the multi-color ordering is applied to the blocks. Because blocks that have an identical color are mutually independent, the forward and backward substitutions are parallelized based on the blocks in each color. The number of (thread) synchronizations of parallelized (multithreaded) substitution is given by n c − 1, where n c is the number of colors. Figure 4.1 shows the coefficient matrix that results from BMC.
In the present paper, the block size and k-th block in color c are denoted by b s and b (c) k , respectively. In BMC, each unknown (or its index) is assigned to a certain block, as shown in Fig. 4.2, where n(c) is the number of blocks in color c.

4.2.
Hierarchical block multi-color ordering (HBMC). In the proposed HBMC, a new (hierarchical) block structure is introduced. First, we define a level-1 block (or multithreaded block) as follows. The block consists of w consecutive blocks of BMC in each color. When the k -th level-1 block in color c is written asb where k s = (k − 1) × w. We note that parameter w is determined by the length of the SIMD vector instruction (SIMD width) of the targeted processor. It is typically 4 or 8, and will be larger in the future. In our technique, secondary reordering is performed on each level-1 block as shown in Fig. 4.2. Without loss of generality, we describe the reordering process for a level-1 block, that is, the blocks from b ks+w , and order the picked unknowns. These w unknowns are mutually independent because the blocks in each color are independent in BMC. In the next step, we pick up the second unknown of each block, which are mutually independent, and order them after the unknowns previously ordered. We repeat the process until no unknowns remain. In total, the pick-up process is performed b s times. Figure 4.3 shows the secondary reordering process in the first level-1 block when b s = 2 and w = 4, where each unknown is associated with the diagonal element of the coefficient matrix. In the figure, the colored elements represent nonzero elements. After the reordering process is complete, we encounter the second-level block structure in the reordered coefficient matrix, which is given by the w × w (small) diagonal matrices. The level-2 block structure is used for SIMD vectorization of the substitutions. 4.2.1. Equivalence between BMC and HBMC. We prove that HBMC is equivalent to BMC; that is, the convergence rates of the linear solvers based on the two orderings are the same. Because the secondary reordering for HBMC is locally performed in each level-1 block, the order between two unknowns that belong to two different level-1 blocks are preserved in the final order. Consequently, it holds that From (4.2), if the local ordering subgraphs of BMC and HBMC that correspond to each level-1 block are identical, then the two orderings are equivalent. Next, we examine the reordering process in a level-1 block. In the secondary reordering process of HBMC, the order of unknowns that belong to different BMC blocks changes. However, the reordering process for these unknowns does not affect the ordering graph, that is, the convergence. In BMC, the unknowns that belong to two different blocks in the same color have no data relationship with one another; that is, there are no edges between them in the ordering graph. Therefore, even if we change the order of unknowns that belong to different BMC blocks, this does not affect the ordering graph. Consequently, we now pay attention to the influence of reordering inside a BMC block. When we analyze the above picking process, we can confirm that the order of the unknowns that belong to the same BMC block is preserved in In each block, we pick up the first unknown, and then the second, and continue this process. Therefore, the order does not change for these unknowns: When we consider the mutual independence among the BMC blocks in each color, (4.2) and (4.3), we can demonstrate that secondary reordering π does not change the form of the ordering graph. This proves that HBMC is equivalent to BMC.
As an example that shows the relationship between BMC and HBMC, Fig. 4.5 demonstrates the ordering of nodes (unknowns) in a five-point finite difference analysis. Figures 4.5 (a) and (b) show that BMC and HBMC have identical ordering graphs. Consequently, the two orderings are equivalent in terms of convergence. Figure  4.5 (c) shows the coefficient matrix based on HBMC, which involves the hierarchical block structures.

Parallelization and vectorization of forward and backward substitutions.
Corresponding to the colors of the unknowns, solution vectorx and coefficient matrixĀ are split as  Hereafter, we assume that the size ofx c is a multiple of b s w. In the analysis program, the assumption is satisfied using some dummy unknowns. Let the number of level-1 blocks assigned to color c be denoted byn(c), then diagonal blockC c,c ofĀ is given by the following block diagonal matrix: k is the b s w × b s w matrix that corresponds to the unknowns in the k-th level-1 block with color c, which we denote byb whereD (k,c) l , (l = 1, 2, . . . , b s ) are w × w diagonal matrices. The forward substitution included in ILU(0)/IC(0) preconditioners or GS and SOR methods uses a lower triangular matrix with the same nonzero element pattern as the lower triangular part ofĀ. From (4.5) and (4.6), lower triangular matrixL is written as and diagonal blockL c,c is given byL k . The forward substitution for the reordered linear system is given byLȳ =r, (4.10) wherer is the residual vector in the case of ILU (IC) preconditioning. Letȳ c andr c represent, respectively, the segments ofȳ andr that correspond to color c, andȳ (4.11) Then, from (4.8) and (4.10), the forward substitution forȳ c is given bȳ Because vector segmentsȳ d (d = 1, . . . , c − 1) are computed prior to the substitution (4.12) and shared among all threads,q c is a given vector in (4.12). When the segment ofq c that corresponds to blockb k as in (4.11), from (4.9), the forward substitution (4.12) is expressed asn(c) independent steps: . . ,n(c)). (4.14) Consequently, the forward substitution (4.12) for color c can be multithreaded with the degree of parallelism given by the number of level-1 blocks of each color, which is approximately n/(n c · b s · w). Each thread processes one or more level-1 blocks in parallel. Next, we explain how to vectorize each step of (4.14). We consider the procedure for the k-th level-1 block of color c:ȳ can be calculated in parallel, the step consists of w independent steps that can be efficiently vectorized. In other words, the l-th step of (4.17) consists of a simple matrix vector multiplication and element-wise vector updates that are directly vectorized.
The backward substitution is parallelized (multithreaded) and vectorized in a similar manner, although it is performed inversely from color n c to 1.

Implementation of HBMC.
4.4.1. Reordering process. In this section, we discuss the reordering process. In the technique, any type of algorithm (heuristic) for an implementation of BMC can be used. In the application of BMC, we set the number of BMC blocks assigned to each thread as a multiple of w, except for one of the threads (typically the last-numbered thread). In this circumstance, the application of HBMC, that is, the secondary reordering from BMC, is performed in each thread. Therefore, the reordering process is fully multithreaded.

Storage format.
In the implementation of the sparse triangular solver, a storage format for sparse matrices [28] is typically used. For example, the factorization matrices in an IC/ILU preconditioned solver are stored in memory using such a format. Although there are several standard formats, the sliced ELL (SELL) format [29] is the most efficient for exploiting the benefit of SIMD instructions, and we used it in our implementation. In the SELL format, the slice size is an important parameter. In HBMC, we naturally set the slice size as w because the forward and backward substitutions are vectorized every w rows. This leads to the natural introduction of the concept of the SELL-C-σ format [30] to the analysis, which is a sophisticated version of SELL.

Multithreaded and vectorized substitutions.
The program for each forward and backward substitution consists of nested loops. The outer-most loop is for the color. After the computations for each color, thread synchronization is required. Therefore, the number of synchronizations in each substitution is given by n c − 1, which is the same as BMC and the standard multi-color ordering (MC). The second loop is for level-1 blocks. Because the level-1 blocks in each color are mutually independent, each thread processes single or multiple level-1 blocks in parallel. In each level-1 block, the substitution can be split into b s steps, each of which is vectorized with a SIMD width of w. For the vectorized substitution, we used the OpenMP SIMD directive or the Intel intrinsic functions for SIMD instructions. Figure 4.6 shows a sample C program code for multithreaded and vectorized forward substitution using OpenMP and Intel AVX-512 intrinsic functions.
Additionally, we discuss the special nonzero pattern that appears inL (c) k that corresponds to the level-1 block. In the matrix, all nonzero elements lay on 2b s −1 diagonal lines. Although we can consider using a hybrid storage format that exploits this special structure, it does not typically result in better performance because of the additional cost of processing the diagonal block and other off-diagonal elements separately. We confirmed this in some preliminary tests.
Finally, we discuss the data access locality. The access pattern for the vector elements in HBMC is different from that in BMC. Therefore, the data access locality can be different between two orderings. However, because the secondary reordering for HBMC is performed inside a level-1 block, the data access locality barely changes; at least, from the viewpoint of the last-level cache, both orderings are considered to be similar.

Computers and Test Problems.
We conducted five numerical tests on three types of computational nodes to evaluate the proposed reordering technique in the context of the ICCG method: the computational nodes were Cray XC40, Cray CS400 (2820XT), and Fujitsu CX2550 (M4). The two Cray systems are operated by Academic Center for Computing and Media Studies, Kyoto Univ., whereas the Fujitsu system is at Information Initiative Center, Hokkaido Univ. Table 4.1 lists information about the computational node and compiler used. In the numerical tests, we used all cores of the computational node for execution.
The program code was written in C and OpenMP for the thread parallelization. For vectorization, we used the intrinsic functions of the Intel Advanced Vector Extensions. The AVX2 (256 bits SIMD) instruction set was used for the Xeon (Broadwell) processor, whereas the AVX-512 (512 bits SIMD) instruction set was used for the Xeon Phi (KNL) and Xeon (Skylake) processors. Although we also developed a vectorized program using the OpenMP SIMD directive, its performance was slightly worse than the version with the intrinsic function in most of the test cases. Thus, in this paper, we only report the results from using the intrinsic function.
For the test problems, we used a linear system that arises from finite element electromagnetic field analysis and four linear systems picked up from the SuiteSparse Matrix Collection. We selected symmetric positivedefinite matrices that are mainly derived from computational science or engineering problems, and have a  relatively large dimension compared with other matrices in the collection.
In the electromagnetic field analysis test, the linear system that arises from the finite element discretization of the IEEJ standard benchmark model [31] was used. The basic equation for the problem is given as where A m , ν, and J 0 are the magnetic vector potential, magnetic reluctivity, and excitation current density, respectively. The analysis solved (5.1) using a finite edge-element method with hexahedral elements. Applying the Galerkin method to (5.1), we obtained a linear system of equations for the test problem. The resulting linear system was solved using the shifted ICCG method, with the shift parameter given as 0.3. The name of the dataset of the linear system is denoted by Ieej. Table 5.1 lists the matrix information for the test problems.
In this paper, we report the performance comparison of four multithreaded ICCG solvers. The solver denoted by "MC" is based on the multi-color ordering which is the most popular parallel ordering method. The solver "BMC" is the solver based on the block multi-color ordering method. The solvers "HBMC (crs spmv)" and "HBMC (sell spmv)" are based on the proposed HBMC, where the former solver used compressed row storage (CRS) format [28] for the implementation of sparse matrix vector multiplication (SpMV) and the latter used the SELL format. In MC and BMC, the CRS format was used.
For the blocking method in BMC and HBMC, we used the simplest one among the heuristics introduced in [13], in which the unknown with the minimal number is picked up for the newly generated block. For the coloring of nodes or blocks, the greedy algorithm was used for all the solvers. The convergence criterion was set as the relative residual norm (2-norm) being less than 10 −7 .

Numerical results.
5.2.1. Equivalence of orderings in convergence and use of SIMD instructions. First, we examine the equivalence of BMC and HBMC in terms of convergence. Table 5.2 lists the number of iterations of the solvers tested on Cray XC40, where the block size of BMC and HBMC was set to 32. Equivalence was confirmed by the numerical results. Moreover, to examine the entire solution process, Figure 5.1 shows the convergence behaviors of BMC and HBMC in the G3 circuit and Ieej tests. In the figure, the two lines of the relative residual norms overlap, which indicates that the solvers had an equivalent solution process. The equivalence of convergence was also confirmed in all test cases (five datasets × three block sizes × three computational nodes). Furthermore, Table 5.2 shows the advantage in terms of convergence of BMC over MC, which coincides with the results reported in [13].
Next, we checked the use of SIMD instructions in the solver using the Intel Vtune Amplifier (application performance snapshot) in the G3 circuit test conducted on Fujitsu CX2550. The snapshot showed that the percentage of all packed floating point instructions in the solver based on HBMC (sell spmv) reached 99.7%, although that in the solver using BMC was 12.7%. Table 5.3 (a) shows the performance comparison of four solvers in the numerical tests on Cray XC40. In the tests, HBMC attained the best performance for all datasets, except Audikw 1. In the Thermal2 and G3 circuit tests, HBMC (sell spmv) was more than two times faster than the standard MC solver. When HBMC (crs spmv) was compared with BMC, it attained better performance in 11 out of 15 cases (five datasets × three block sizes), which demonstrates the effectiveness of HBMC for the sparse triangular solver. Moreover, in all test cases, HBMC (sell spmv) outperformed HBMC (crs spmv), which implies that an efficient use of SIMD instructions was important on the Xeon Phi-based system. Table 5.3 (b) shows the test results on Cray CS400 (Xeon Broadwell). In the numerical tests, HBMC attained the best performance for all datasets. When HBMC (crs spmv) was compared with BMC, it attained better performance in 13 out of 15 cases, which shows the effectiveness of HBMC. Table 5.3 (b) also indicates that using the SELL format for the coefficient matrix mostly led to an improvement in solver performance. Table 5.3 (c) shows the test results on Fujitsu CX2550 (Xeon Skylake). In the numerical tests, HBMC outperformed MC and BMC for four out of five datasets. For the Audikw 1 dataset, HBMC did not outperform BMC on Xeon Phi and Skylake, although it was better than BMC on Xeon Broadwell. This result is thought to be because of the effect of increasing the slice size that was given by the SIMD width, w. In the SELL format, some zero elements were considered as nonzero in the slice. When there was a significant imbalance among the number of nonzero elements in each row in the slice, the number of elements processed largely increased compared with the implementation with the CRS format. The Audikw 1 dataset had this property. For Audikw 1, the number of processed elements in SELL increased by 40% compared with that in CRS, although the increase was 10% for the G3 circuit dataset. For this type of dataset, when the size of the slice increased, the number of processed elements often increased. The size of the slice was set to 8 for the Xeon Phi and Skylake processors and 4 for the Xeon Broadwell processors. On the Broadwell processors, the increase in the number of elements when changing CRS to SELL was 28%, which resulted in better performance for HBMC compared with BMC. In the future, for further acceleration of the solver, we will develop an implementation in which we introduce some remedies for this SELL issue, for example, splitting in two the row that includes an extremely large number of nonzero elements compared with other rows.

Performance comparison.
6. Related Works. The parallelization of the sparse triangular solver for iterative solvers has been mainly investigated in the context of GS or IC/ILU smoothers for multigrid solvers, the SOR method, and IC/ILU preconditioned iterative solvers. Most of these parallelization techniques are classified into two classes: domain decomposition type methods and parallel orderings [32]. A simple but commonly used technique in the former class is the additive Schwarz smoother or preconditioner. The hybrid Jacobi and GS smoother, and block Jacobi IC/ILU preconditioning are typical examples, and they are used in many applications [33,34]. However, these techniques typically suffer from a decline in convergence when the number of threads (processes) is increased. Although there are some remedies for the deterioration in convergence, for example, the overlapping  technique [35], it is generally difficult to compensate for it when many cores are used.
A parallel ordering or reordering technique is a standard method to parallelize the sparse triangular solver. We focus on the parallelization of IC/ILU preconditioners or smoothers; however, there are many studies that discuss the application of parallel ordering for GS and SOR methods, for example, [36]. Ref. [32] provides an overview of early works on the parallel ordering method when applied to IC/ILU preconditioned iterative solvers. Parallel orderings were mainly investigated in the context of a structured grid problem (a finite difference analysis), and the concepts of typical parallel ordering techniques, such as red-black, multi-color, zebra, domain    decomposition (four or eight-corner), and nested dissection, were established in the 1990s. In these early research activities, a major issue was the trade-off problem between convergence and the degree of parallelism. After Duff and Meurant indicated the problem in [9], both analytical and numerical investigations were conducted in [8,12,27,[37][38][39][40][41]. The concept of equivalence of orderings and some remedies for the trade-off problem, such as the use of a relatively large number of colors in multi-color ordering or block coloring, were introduced as the results of these research activities.
In practical engineering and science domains, unstructured problems are solved more frequently than structured problems. Therefore, parallel ordering techniques were enhanced for unstructured problems, and several heuristics were proposed. Typical examples are hierarchical interface decomposition (HID) [42] and heuristics for nodal or block multi-coloring [13][14][15]. These techniques and other related methods have been used in various application domains, such as CFD, computational electromagnetics, and structure analyses [16,17,[43][44][45].
Finally, we address the recently reported research results that are related to parallel linear solvers that involve sparse triangular solvers. Gonzaga de Oliveira et al. reported their intensive numerical test results to evaluate various reordering techniques in the ICCG method in [46]. Gupta introduced a blocking framework to generate a fast and robust preconditioner based on ILU factorization in [47]. Chen et al. developed a couple of ILU-based preconditioners on GPUs in [48]. Ruiz et al. reported the evaluation results of HPCG implementations using nodal and block multi-color orderings on the ARM-based system, which confirmed the superiority of the block coloring method in [22].
In this paper, we proposed a parallel ordering that is different from the techniques described above. To the best of our knowledge, there is no parallel ordering method that vectorizes the sparse triangular solver while maintaining the same convergence and number of synchronizations as the block multi-color ordering. Since the vectorization of SpMV has been intensively investigated [29,30], one of conventional approaches is the use of multi-color ordering, in which the substitution is represented as an SpMV in each color. However, the multicolor ordering suffers from the problems of convergence and data locality, which are also indicated in the latest report [22]. When we consider the numerical results and mathematical properties of the proposed hierarchical block multi-color ordering, it can be regarded as an effective technique for multithreading and vectorizing the sparse triangular solver.
7. Conclusions. In this paper, we proposed a new parallel ordering method, hierarchical block multicolor ordering (HBMC), for vectorizing and multithreading the sparse triangular solver. HBMC was designed to maintain the advantages of the block multi-color ordering (BMC) in terms of convergence and the number of synchronizations. In the method, the coefficient matrix was transformed into the matrix with hierarchical block structures. The level-1 blocks were mutually independent in each color, which was exploited in multithreading. Corresponding to the level-2 block, the substitution was converted into w (= SIMD width) independent steps, which were efficiently processed by SIMD instructions. In this paper, we demonstrated analytically the equivalence of HBMC and BMC in convergence. Furthermore, numerical tests were conducted to examine the proposed method using five datasets on three types of computational nodes. The numerical results also confirmed the equivalence of the convergence of HBMC and BMC. Moreover, numerical tests indicated that HBMC outperformed BMC in 13 out 15 test cases (five datasets × three systems), which confirmed the effectiveness of the proposed method. In the best case (G3 circuit, Cray XC40), HBMC was 2.3 times faster than BMC.
In the future, we will examine our technique for other application problems, particularly a large-scale multigrid application and an HPCG benchmark. Moreover, for further acceleration of the solver, we intend to introduce a sophisticated storage format or its related technique to our solver. As other research issues, we will examine the effect of other coloring and blocking strategies on the performance of the solver.
Appendix A. Equivalent reordering. In this paper, we only provide a sketch of the proof with respect to the ER condition because of a lack of space. In the GS and SOR cases, the proof is given through the double use of the inductive method. The first is for the iteration step. In each iteration step, we use the inductive method again in a row-by-row manner. The proof is relatively straightforward.
In the IC/ILU preconditioning case, the proof consists of two parts. We first prove the equivalence of preconditionersL = P π LP T π andŪ = P π U P T π . We consider the right-looking factorization process with overwriting, and first prove that the updating process in the factorization is equivalent. Next, we show that the application of the updating process to a i,j is equivalent to that toā π(i),π(j) (the π(i)-th row π(j)-th column element ofĀ) when (3.5) holds. This leads to the equivalence of preconditioners. In the second part, the equivalence in the substitutions is shown using the inductive method, similar to the proof for the GS method.