Distributed-memory Hierarchical Interpolative Factorization

The hierarchical interpolative factorization ({HIF}) offers an efficient way for solving or preconditioning elliptic partial differential equations. By exploiting locality and low-rank properties of the operators, the HIF achieves quasi-linear complexity for factorizing the discrete elliptic operator and linear complexity for solving the associated linear system. In this paper, the distributed-memory {HIF} ({DHIF}) is introduced as a parallel and distributed-memory implementation of the HIF. The DHIF organizes the processes in a hierarchical structure and keep the communication as local as possible. The computation complexity is $O\left(\frac{N\log N}{P}\right)$ and $O\left(\frac{N}{P}\right)$ for constructing and applying the DHIF, respectively, where $N$ is the size of the problem and $P$ is the number of processes. The communication complexity is $O\left(\sqrt{P}\log^3 P\right)\alpha + O\left(\frac{N^{2/3}}{\sqrt{P}}\right)\beta$ where $\alpha$ is the latency and $\beta$ is the inverse bandwidth. Extensive numerical examples are performed on the {TACC Stampede} system with up to 4096 processes. The numerical results agree with the complexity analysis and demonstrate the efficiency and scalability of the {DHIF}.


Background
This paper proposes an efficient distributed-memory algorithm for solving elliptic partial differential equations (PDEs) of the form, with a certain boundary condition, where a(x) > 0, b(x) and f (x) are given functions and u(x) is an unknown function. Since this elliptic equation is of fundamental importance to problems in physical sciences, solving (1) effectively has a significant impact in practice. Discretizing this with local schemes such as the finite difference or finite element methods leads to a sparse linear system, where A ∈ R N ×N is a sparse symmetric matrix with O(N ) nonzero entries with N being the number of the discretization points, and u and f are the discrete approximations of the functions u(x) and f (x), respectively. For many practical applications, one often needs to solve (1) on a sufficient fine mesh for which N can be very large, especially for three-dimensional (3D) problems. Hence, there is a practical need for developing fast and parallel algorithms for the efficient solution of (1).

Previous work
A great deal of effort in the field of scientific computing has been devoted to the efficient solution of (2). Beyond the O(N 3 ) complexity naïve matrix inversion approach, one can classify the existing fast algorithms into the following groups. The first one consists of the sparse direct algorithms, which take advantage of the sparsity of the discrete problem. The most noticeable example in this group is the nested dissection multifrontal method (MF) method [14,16,26]. By carefully exploring the sparsity and the locality of the problem, the multifrontal method factorizes the matrix A (and thus A −1 ) as a product of sparse lower and upper triangular matrices. For 3D problems, the factorization step costs O(N 2 ) operations, while the application step takes O(N 4/3 ) operations. Many parallel implementations [3,4,30] of the multifrontal method were proposed and they typically work quite well for problem of moderate size. However, as the problem size goes beyond a couple of millions, most implementations, including the distributed-memory ones, hit severe bottlenecks in memory consumption.
The second group consists of iterative solvers [9,15,33,34], including famous algorithms such as the conjugate gradient (CG) method and the multigrid method. Each iteration of these algorithms typically takes O(N ) steps and hence the overall cost for solving (2) is proportional to the number of iterations required for convergence. For problems with smooth coefficient functions a(x) and b(x), the number of iterations typically remains rather small and the optimal linear complexity is achieved. However, if the coefficient functions lack regularity or have high contrast, the iteration number typically grows quite rapidly as the problem size increases.
The third group contains the methods based on structured matrices [6][7][8]11]. These methods, for example, the H-matrix [18,20], the H 2 -matrix [19], and the hierarchically semi-separable matrix (HSS) [10,42], are shown to have efficient approximations of linear or quasi-linear complexity for the matrices A and A −1 . As a result, the algebraic operations of these matrices are of linear or quasi-linear complexities as well. More specifically, the recursive inversion and the rocket-style inversion [1] are two popular methods for the inverse operation. For distributed-memory implementations, however, the former lacks parallel scalability [24,25], while the latter demonstrates scalability only for 1D and 2D problems [1]. For 3D problems, these methods typically suffer from large prefactors that make them less efficient for practical large-scale problems.
A recent group of methods explores the idea of integrating the MF method with the hierarchical matrix [17,21,28,32,[38][39][40][41] or block low-rank matrix [2,35,36] approach in order to leverage the efficiency of both methods. Instead of directly applying the hierarchical matrix structure to the 3D problems, these methods apply it to the representation of the frontal matrices (i.e., the interactions between the lower-dimensional fronts). These methods are of linear or quasi-linear complexities in theory with much small prefactors. However, due to the combined complexity, the implementation is highly non-trivial and quite difficult for parallelization [27,43].
More recently, the hierarchical interpolative factorization (HIF) [22,23] is proposed as a new way for solving elliptic PDEs and integral equations. As compared to the multifrontal method, the HIF includes an extra step of skeletonizing the fronts in order to reduce the size of the dense frontal matrices. Based on the key observation that the number of skeleton points on each front scales linearly as the one-dimensional fronts, the HIF factorizes the matrix A (and thus A −1 ) as a product of sparse matrices that contains only O(N ) nonzero entries in total. In addition, the factorization and application of the HIF are of complexities O(N log N ) and O(N ), respectively, for N being the total number of degrees of freedom (DOFs) in (2). In practice, the HIF shows significant saving in terms of computational resources required for 3D problems.

Contribution
This paper proposes the first distributed-memory hierarchical interpolative factorization (DHIF) for solving very large-scale problems. In a nutshell, the DHIF organizes the processes in an octree structure in the same way that the HIF partitions the computation domain. In the simplest setting, each leaf node of the computation domain is assigned a single process. Thanks to the locality of the operator in (1), this process only communicates with its neighbors and all algebraic computations are local within the leaf node. At higher levels, each node of the computation domain is associated with a process group that contains all processes in the subtree starting from this node. The computations are all local within this process group via parallel dense linear algebra, and the communications are carried out between neighboring process groups. By following this octree structure, we make sure that both the communication and computations in the distributed-memory HIF are evenly distributed. As a result, the distributed-memory HIF implementation achieves O( N log N P ) and O( N P ) parallel complexity for constructing and applying the factorization, respectively, where N is the number of DOFs and P is the number of processes.
We have performed extensive numerical tests. The numerical results support the complexity analysis of the distributed-memory HIF and suggest that the DHIF is a scalable method up to thousands of processes and can be applied to solve large-scale elliptic PDEs.

Organization
The rest of this paper is organized as follows. In Sect. 2, we review the basic tools needed for both HIF and DHIF, and review the sequential HIF. Section 3 presents the DHIF as a parallel extension of the sequential HIF for 3D problems. Complexity analyses for memory usage, computation time, and communication volume are given at the end of this section. The numerical results detailed in Sect. 4 show that the DHIF is applicable to large-scale problems and achieves parallel scalability up to thousands of processes. Finally, Sect. 5 concludes with some extra discussions on future work.

Preliminaries
This section reviews the basic tools and the sequential HIF. First, we start by listing the notations that are widely used throughout this paper.

Notations
In this paper, we adopt MATLAB notations for simple representation of submatrices. For example, given a matrix A and two index sets, s 1 and s 2 , A(s 1 , s 2 ) represents the submatrix of A with the row indices in s 1 and column indices in s 2 . The next two examples explore the usage of MATLAB notation ":." With the same settings, A(s 1 , :) represents the submatrix of A with row indices in s 1 and all columns. Another usage of notation ":" is to create regularly spaced vectors for integer values i and j, for instance, i : j is the same as In order to simplify the presentation, we consider the problem (1) with periodic boundary condition and assume that the domain Ω = [0, 1) 3 and is discretized with a grid of size n × n × n for n = 2 L m, where L = O(log n) and m = O(1) are both integers. In the rest of this paper, L + 1 is known as the number of levels in the hierarchical structure and L is the level number of the root level. We use N = n 3 to denote the total number of DOFs, which is the dimension of the sparse matrix A in (2). Furthermore, each grid point x j is defined as where h = 1/n, j = (j 1 , j 2 , j 3 ) and 0 ≤ j 1 , j 2 , j 3 < n.
In order to fully explore the hierarchical structure of the problem, we recursively bipartite each dimension of the grid into L + 1 levels. Let the leaf level be level 0 and the root level be level L. At level , a cell indexed with j is of size m2 × m2 × m2 and each point in the cell is in the range, m2 j 1 + (0 : m2 − 1) × m2 j 2 + (0 : m2 − 1) × m2 j 3 + (0 : m2 − 1) , for j = (j 1 , j 2 , j 3 ) and 0 ≤ j 1 , j 2 , j 3 < 2 L− . C j denotes the grid point set of the cell at level indexed with j.
A cell C j owns three faces: top, front, and left. Each of these three faces contains the grid points on the first frame in the corresponding direction. For example, the front face contains the grid points in m2 j 1 + (0 : m2 − 1) × m2 j 2 × m2 j 3 + (0 : m2 − 1) . Besides these three in-cell faces (top, front, and left) that are owned by a cell, each cell is also adjacent to three out-of-cell faces (bottom, back, right) owned by its neighbors. Each of these three faces contains the grid points on the next to the last frame in the corresponding dimension. As a result, these faces contain DOFs that belong to adjacent cells. For example, the bottom face of C j contains the grid points in m2 (j 1 + 1) × m2 j 2 + (0 : m2 − 1) × m2 j 3 + (0 : m2 − 1) . These six faces are the surrounding faces of C j . One also defines the interior of C j to be I j = m2 j 1 + (1 : m2 − 1) × m2 j 2 + (1 : m2 − 1) × m2 j 3 + (1 : m2 − 1) for the same j = (j 1 , j 2 , j 3 ) and 0 ≤ j 1 , j 2 , j 3 < 2 L− . Figure 1 shows an illustration of a cell, its faces, and its interior. These definitions and notations are summarized in Table 1. Also included here are some notations used for the processes, which will be introduced later.

Sparse elimination
Suppose that A is a symmetric matrix. The row/column indices of A are partitioned into three sets I F R where I refers to the interior point set, F refers to the surrounding face point set, and R refers to the rest point set. We further assume that there is no interaction between the indices in I and the ones in R. As a result, one can write A in the following form Let the LDL T decomposition of A II be A II = L I D I L T I , where L I is lower triangular matrix with unit diagonal. According to the block Gaussian elimination of A given by (4),  Grid gap size Level number in the hierarchical structure L Level number of the root level in the hierarchical structure e 1 , e 2 , e 3 Unit vector along each dimension 0 Zero vector Point on the grid indexed with j Ω The set of all points on the grid C j Cell at level with index j C C = {C j } j is the set of all cells at level The set of active DOFs at level with process group index j p j , p The process group at level with/without index j one defines the sparse elimination to be where B FF = A FF − A FI A −1 II A T FI is the associated Schur complement and the explicit expressions for S I is The sparse elimination removes the interaction between the interior points I and the corresponding surrounding face points F and leaves A RF and A RR untouched. We call the entire point set, I F R, the active point set. Then, after the sparse elimination, the interior points are decoupled from other points, which is conceptually equivalent to eliminate the interior points from the active point set. After this, the new active point set can be regarded as F R. Figure 2 illustrates the impact of the sparse elimination. The dots in the figure represent the active points. Before the sparse elimination (left), edge points, face points, and interior points are active, while after the sparse elimination (right) the interior points are eliminated from the active point set.

Skeletonization
Skeletonization is a tool for eliminating redundant point set from a symmetric matrix that has low-rank off-diagonal blocks. The key step in skeletonization uses the interpolative decomposition [12,29] of low-rank matrices.
Let A be a symmetric matrix of the form, where A RF is a numerically low-rank matrix. The interpolative decomposition of A RF is (up to a permutation) where T F is the interpolation matrix, F is the skeleton point set, F is the redundant point set, and F = F F . Applying this approximation to A results and be symmetrically factorized as The factor Q F is generated by the block Gaussian elimination, which is defined to be Meanwhile, the factor S F is introduced in the sparse elimination: to what happens in Sect. 2.2, the skeletonization eliminates the redundant point set F from the active point set.
The point elimination idea of the skeletonization is illustrated in Fig. 3. Before the skeletonization (left), the edge points, interior points, skeleton face points (red), and redundant face points (pink) are all active, while after the skeletonization (right) the redundant face points are eliminated from the active point set.

Sequential HIF
This section reviews the sequential hierarchical interpolative factorization (HIF) for 3D elliptic problems (1) with the periodic boundary condition. Without loss of generality, we discretize (1) with the seven-point stencil on a uniform grid, which is defined in Sect. 2.1.
, and u j approximates the unknown function u(x) at x j . The corresponding linear system is We first introduce the notion of active and inactive DOFs.
is a diagonal matrix and A¯ is a zero matrix.  (10). With these notations, the sequential HIF in [22] is summarized as follows. A more illustrative representation of the sequential HIF is given on the left column of Fig. 5.
• Preliminary Let A 0 = A be the sparse symmetric matrix in (17), 0 be the initial active DOFs of A, which includes all indices. • Level for = 0, . . . , L − 1.
-Preliminary Let A denote the matrix before any elimination at level .
is the corresponding active DOFs. Let us recall the notations in Sect. 2.1. C j denotes the active DOFs in the cell at level indexed with j. F j and I j denote the surrounding faces and interior active DOFs in the corresponding cell, respectively.
-Sparse elimination We first focus on a single cell at level indexed with j, i.e., C j . To simplify the notation, we drop the superscript and subscript for now and introduce C = C j , I = I j , F = F j , and R = R j . Based on the discretization and previous level eliminations, the interior active DOFs interact only with itself and its surrounding faces. The interactions of the interior active DOFs and the rest DOFs are empty and the corresponding matrix is zero, A (R, I) = 0. Hence, by applying sparse elimination, we have, where the explicit definitions of B FF and S I are given in the discussion of sparse elimination. This factorization eliminates I from the active DOFs of A . Looping over all cells C j at level , we obtain Now all the active interior DOFs at level are eliminated from . -Skeletonization Each face at level not only interacts within its own cell but also interacts with faces of neighbor cells. Since the interaction between any two different faces is low rank, this leads us to apply skeletonization. The skeletonization for any face F ∈ F gives,  (21). Looping over all faces at level , we obtain where W = I∈I S I F ∈F S F Q F . The active DOFs for the next level are now defined as, • Level L Finally, A L and L are the matrix and active DOFs at level L. Up to a permutation, A L can be factorized as Combining all these factorization results and A −1 is factorized into a multiplicative sequence of matrices W and each W corresponding to level is again a multiplicative sequence of sparse matrices, S I , S F and Q F . Due to the fact that any S I , S F or Q F contains a small non-trivial (i.e., neither identity nor empty) matrix of size O( N 1/3 2 L− )×O( N 1/3 2 L− ), the overall complexity for strong and applying W is O(N /2 ). Hence, the application of the inverse of A is of O(N ) computation and memory complexity.

Distributed-memory HIF
This section describes the algorithm for the distributed-memory HIF.

Process tree
For simplicity, assume that there are 8 L processes. We introduce a process tree that has L + 1 levels and resembles the hierarchical structure of the computation domain. Each node of this process tree is called a process group. First at the leaf level, there are 8 L leaf process groups denoted as {p 0 j } j . Here j = (j 1 , j 2 , j 3 ), 0 ≤ j 1 , j 2 , j 3 < 2 L , and the superscript 0 refers to the leaf level (level 0). Each group at this level only contains a single process. Each node at level 1 of the process tree is constructed by merging 8 leaf processes. More precisely, we denote the process group at level 1 as p 1 j for j = (j 1 , j 2 , j 3 ), 0 ≤ j 1 , j 2 , j 3 < 2 L−1 , and p 1 j = j c /2 =j p 0 j c . Similarly, we recursively define the node at level as p j = j c /2 =j p −1 j c . Finally, the process group p L 0 at the root includes all processes. Figure 4 illustrates the process tree. Each cube in the process tree is a process group. Discretizing (1) with seven-point stencil on the grid provides the linear system Au = f , where A is a sparse N × N SPD matrix, u ∈ R N is the unknown function at grid points, and f ∈ R N is the given function at grid points.

Distributed-memory method
Given the process tree (Sect. 3.1) with 8 L processes and the sequential HIF structure (Sect. 2.4), the construction of the distributed-memory hierarchical interpolative factorization (DHIF) consists of the following steps.
• Preliminary Construct the process tree with 8 L processes. Each process group p 0 j owns the data corresponding to cell C 0 j and the set of active DOFs in C 0 j is denoted as 0 j , where j = (j 1 , j 2 , j 3 ) and 0 ≤ j 1 , j 2 , j 3 < 2 L . Set A 0 = A and let the process group p 0 j own A 0 (:, 0 j ), which is a sparse tall-skinny matrix with O(N /P) nonzero entries. • Level for = 0, . . . , L − 1. -Preliminary Let A denote the matrix before any elimination at level . j denotes the active DOFs owned by the process group p j for j = (j 1 , j 2 , j 3 ), 0 ≤ j 1 , j 2 , j 3 < 2 L− , and the nonzero submatrix of A (:, j ) is distributed among the process group p j using the two-dimensional block-cyclic distribution.
-Sparse elimination The process group p j owns A (:, j ), which is sufficient for performing sparse elimination for I j . To simplify the notation, we define I = I j as the active interior DOFs of cell C j , F = F j as the surrounding faces, and R = R j as the rest active DOFs. Sparse elimination at level within the process group p j performs essentially where -Skeletonization For each face F owned by p j , the corresponding matrices A (:, F) are stored locally. Similar to the parallel sparse elimination part, most operations are local at the process group p j and can be carried out using the dense parallel linear algebra efficiently. By forming a parallel interpolative decomposition (ID) for A RF = A R F T F A R F , the parallel skeletonization can be, conceptually, written as where the definitions of Q F and S F are given in the discussion of skeletonization.
, A F F and T F are all owned by p j , it requires only local operations to form Similarly, L F , which is the LDL T factor of B F F , is also formed within the process group p j . Moreover, since non-trivial blocks in Q F and S F are both local, this implies that the applications of Q F and S F are local operations. As a result, the parallel skeletonization factorizes A conceptually as and we can define We would like to emphasize that the factors W are evenly distributed among the process groups at level and that all non-trivial blocks are stored locally.
-Merging and redistribution Toward the end of the factorization at level , we need to merge the process groups and redistribute the data associated with the active DOFs in order to prepare for the work at level +1. For each process group at level + 1, p +1 j , for j = (j 1 , j 2 , j 3 ), 0 ≤ j 1 , j 2 , j 3 < 2 L− −1 , we first form its active , the complexities for message and bandwidth are bounded by the cost for parallel dense linear algebra. Actually, as we shall see in the numerical results, its cost is far lower than that of the parallel dense linear algebra.
• Level L factorization The parallel factorization at level L is quite similar to the sequential one. After factorizations from all previous levels, Consequently, we form the DHIF for A and A −1 as and The factors, W are evenly distributed among all processes and the application of F −1 is basically a sequence of parallel dense matrix-vector multiplications.
In Fig. 5, we illustrate an example of DHIF for problem of size 24 × 24 × 24 with m = 6 and L = 2. The computation is distributed on a process tree with 4 3 = 64 processes. Particularly, Fig. 5 highlights the DOFs owned by process groups involving p 0 (0,1,0) , i.e., p 0 (0,1,0) , p 1 (0,0,0) , and p 2 (0,0,0) . Yellow points denote interior active DOFs, blue and brown points denote face active DOFs, and black points denote edge active DOFs. Meanwhile, we also have unfaded and faded groups of points. Unfaded points are owned by the process groups involving p 0 (0,1,0) . In other words, p 0 (0,1,0) is the owner for part of the unfaded points. The faded points are owned by other process groups. In the second row and the fourth row, we also see faded brown points, which indicates the required communication to process p 0 (0,1,0) . Here Fig. 5 works through two levels of the elimination processes of the DHIF step by step.

Memory complexity
There are two places in the distributed algorithm that require heavy memory usage. The first one is to store the original matrix A and its updated version A for each level .  on 8 processes, the overall memory requirement for each W on a process is O( N P · 2 − ). Therefore, O( N P ) memory is required on each process to store all W s.

Computation complexity
The majority of the computation work goes to the construction of S I , Q F , and S F . As stated in the previous section, at level , each non-trivial dense matrix in these factors The construction adopts the matrix-matrix multiplication, the interpolative decomposition (pivoting QR), the LDL T factorization, and the triangular matrix inversion. Each of these operation is of cubic computation complexities and the corresponding parallel computation cost over 8 processes is O( N P ). Since there are only a constant number of these operations per process at a single level, the total computational complexity across all O(log N ) levels is O (   N log N  P ). The application computational complexity is simply the complexity of applying each nonzero entries in W s once, hence, the overall computational complexity is the same as the memory complexity O( N P ).

Communication complexity
The communication complexity is composed of three parts: the communication in the parallel dense linear algebra, the communication after sparse elimination, and the merging and redistribution step within DHIF. It is clear to see that the communication cost for the second part is bounded by either of the rest. Hence, we will simply derive the communication cost for the first and third parts. Here, we adopt the simplified communication model, T comm = α + β, where α is the latency and β is the inverse bandwidth. At level , the parallel dense linear algebra involves the matrix-matrix multiplication, the ID, the LDL T factorization, and the triangular matrix inversion for matrices of size All these basic operations are carried out on a process group of size 8 . Following the discussion in [5], the communication cost for these operations is bounded by Summing over all levels, one can control the communication cost of the parallel dense linear algebra part by On the other hand, the merging and redistribution step at level involves 8 +1 processes and redistributes matrices of The current implementation adopts the MPI routine MPI_AllToAll to handle the redistribution on a 2D process mesh. Further, we assume the all-to-all communication sends and receives long messages. The standard upper bound for the cost of this routine is O( [37]. Therefore, the overall cost is The complexity of the latency part is not scalable. However, empirically, the cost for this communication is relatively small in the actual running time.

Numerical results
Here we present a couple of numerical examples to demonstrate the parallel efficiency of the distributed-memory HIF. The algorithm is implemented in C++11 and all inter-process communication is expressed via the message passing interface (MPI). The distributed-memory dense linear algebra computation is done through the Elemental library [31]. All numerical examples are performed on Edison at the National Energy Research Scientific Computing center (NERSC). The numbers of processes used are always powers of two, ranging from 1 to 8192. The memory allocated for each process is limited to 2 GB.
All numerical results are measured in two ways: the strong scaling and weak scaling. The strong scaling measurement fixes the problem size and increases the number of processes. For a fixed problem size, let T S P be the running time of P processes. The strong scaling efficiency is defined as, In the case that T S 1 is not available, e.g., the fixed problem cannot fit into the single process memory, we adopt the first available running time, T S m , associating with the smallest number of processes, m, as a reference. And the modified strong scaling efficiency is, The weak scaling measurement fixes the ratio between the problem size and the number of processes. For a fixed ratio, we define the weak scaling efficiency as, where T W m is the first available running time with m processes and T W P is the running time of P processes.
The notations used in the following tables and figures are listed in Table 2. For simplicity, all examples are defined over Ω = [0, 1) 3 with periodic boundary condition, discretized on a uniform grid, n×n×n, with n being the number of points in each dimension and N = n 3 . The PDEs defined in (1) are discretized using the second-order central difference method with seven-point stencil, which is the same as (16). Octrees are adopted to partition the computation domain with the block size at leaf level bounded by 64.

Example 1
We first consider the problem in (1) with a(x) ≡ 1 and b(x) ≡ 0.1. The relative precision of the ID is set to be = 10 −3 . Table 3, given the tolerance = 10 −3 the relative error remains well below this for all N and P. The number of skeleton points on the root level, | L |, grows linearly as the one-dimensional problem size increases. The empirical linear scaling of the root skeleton size strongly supports the quasi-linear scaling for the factorization, linear scaling  for the application, and linear scaling for memory cost. The column labeled with m f in Table 3, or alternatively Fig. 6c, illustrates the perfect strong scaling for the memory cost. Since the bottleneck for most parallel algorithms is the memory cost, this point is especially important in practice. Perfect distribution of the memory usage allows us to solve very large problem on a massive number of processes, even through the communication penalty on massive parallel computing would be relatively large. The factorization time and application time show good scaling up to thousands of processes. Figure 6a, b presents the strong scaling plot for the running time of factorization and application, respectively. Together with Fig. 6d, which illustrates the timing for each part of the factorization, we conclude that the communication cost beside the parallel dense linear algebra (labeled with "El") remains small comparing to the cost of the parallel dense linear algebra. It is the parallel dense linear algebra part that stops the strong scaling. As it is also well known that parallel dense linear algebra achieves good weak scaling, so does our DHIF imple-  Table 3 shows the number of iterations for solving Au = f using the GMRES algorithm with a relative tolerance of 10 −12 and with the DHIF as a preconditioner. As the numbers in the entire column are equal to 6, this shows that DHIF serves an excellent preconditioner with the iteration number almost independent of the problem size.

Example 2
The second example is a problem of (1) with high-contrast random field a(x) and b(x) ≡ 0.1. The high-contrast random field a(x) is defined as follows, 1. Generate uniform random value a j between 0 and 1 for each discretization point; 2. Convolve the random value a j with an isotropic three-dimensional Gaussian with standard deviation 1; 3. Quantize the field via a j = 0.1, a j ≤ 0.5 1000, a j > 0.5 .
The given tolerance is set to be 10 −5 . Figure 7 shows a slice in a realization of the random field. The corresponding matrix A is clearly of high contrast. Solving such a problem is harder than Example 1 due to the raise of the condition number. The performance results of our algorithm are presented in Table 4. As we expect, the relative error for solving is lower than that in Table 3 and the number of iterations in GMRES is higher. Table 4 and Fig. 8 demonstrate the efficiency of the DHIF for high-contrast random field. Almost all comments regarding the numerical results in Example 1 apply here. To focus on the difference between Example 1 and Example 2, the most noticeable difference is about the relative error, e s . Though we give a higher relative precision = 10 −5 , the relative error for Example 2 is about 3 × 10 −3 , which is about ten times larger than e s in Example 1. The reason for the decrease of accuracy is most likely the increase of the condition number for Example 2. This also increases the number of iterations in GMRES. However, both e s and n iter remain roughly constant for varying problem sizes. This means that DHIF still serves as a robust and efficient solver and preconditioner for such problems. Another difference is the number of skeleton points on the root level, | L |. Due to the fact that the field a(x) is random, and different rows in Table 4 actually adopt different realizations, the small fluctuation of | L | for the same N and different P is expected. Overall | L | still grows linearly as n = N 1/3 increases. This again supports the complexity analysis given above.

Example 3
The third example provides a concrete comparison between the proposed DHIF and multigrid method (hypre [13]). The problem behaves similar as example 2 without randomness, (1) with high-contrast field a(x) and b(x) ≡ 0.1. The high-contrast field a(x) is defined as follows, where n is the number of grid points on each dimension.  We adopt GMRES iterative method in both DHIF and hypre to solve the elliptic problem to a relative error 10 −12 . The given tolerance in DHIF is set to be 10 −4 . And SMG interface in hypre is used as preconditioner for the problem on regular grids. The numerical results for DHIF and hypre are given in Table 5.
As we can read from Table 5, there are a few advantages of DHIF over hypre in the given settings. First, DHIF is faster than hypre's SMG except for small problems with small numbers of processes. And the number of iterations grows as the problem size grows in hypre, while it remains almost the same in DHIF. In truly large problems, the advantages of DHIF are more pronounced. Second, the scalability of DHIF appears to be better than that of hypre's SMG. Finally, DHIF only requires powers of two numbers of processes, whereas hypre's SMG requires powers of eight for 3D problems.

Conclusion
In this paper, we introduced the distributed-memory hierarchical interpolative factorization (DHIF) for solving discretized elliptic partial differential equations in 3D. The computational and memory complexity for DHIF is respectively, where N is the total number of DOFs and P is the number of processes. The communication cost is where α is the latency and β is the inverse bandwidth. Not only the factorization is efficient, the application can also be done in O( N P ) operations. Numerical examples in Sect. 4 illustrate the efficiency and parallel scaling of the algorithm. The results show that DHIF can be used both as a direct solver and as an efficient preconditioner for iterative solvers.
We have described the algorithm using the periodic boundary condition in order to simplify the presentation. However, the implementation can be extended in a straightforward way to problems with other type of boundary conditions. The discretization adopted here is the standard Cartesian grid. For more general discretizations such as finite element methods on unstructured meshes, one can generalize the current implementation by combining with the idea proposed in [36].
Here we have only considered the parallelization of the HIF for differential equations. As shown in [22], the HIF is also applicable to solving integral equations with non-oscillatory kernels. Parallelization of this algorithm is also of practical importance.