Parallel Aggregation Based on Compatible Weighted Matching for AMG

. We focus on the extension of the MLD2P4 package of parallel Algebraic MultiGrid (AMG) preconditioners, with the objective of improving its robustness and eﬃciency when dealing with sparse linear systems arising from anisotropic PDE problems on general meshes. We present a parallel implementation of a new coarsening algorithm for symmetric positive deﬁnite matrices, which is based on a weighted matching approach. We discuss preliminary results obtained by combining this coarsening strategy with the AMG components available in MLD2P4, on linear systems arising from applications considered in the Horizon 2020 Project “Energy oriented Centre of Excellence for computing applica-tions” (EoCoE).

rely on heuristics to drive the coarsening process among variables; for example the strength of connection heuristics is derived from a characterization of the algebraically smooth vectors that is theoretically well understood only for M -matrices.
A new approach for coarsening sparse symmetric positive definite (s.p.d.) matrices has been proposed in recent papers [10,11], following some theoretical and algorithmic developments [5,14]: coarsening based on compatible weighted matching. It defines a pairwise aggregation of unknowns where each pair is the result of a maximum weight matching in the matrix adjacency graph. Specifically, the aggregation scheme uses a maximum product matching in order to enhance the diagonal dominance of a matrix representing the hierarchical complement of the resulting coarse matrix, thereby improving the convergence properties of a corresponding compatible relaxation scheme. The matched nodes are aggregated to form coarse index spaces, and piecewise constant or smoothed interpolation operators are applied for the construction of a multigrid hierarchy. No reference is made to any a priori knowledge on the matrix origin and/or any definition of strength of connection; however, information about the smooth error may be used to define edge weights assigned to the original matrix graph. More aggressive coarsening can be obtained by combining multiple steps of the pairwise aggregation.
This method has been implemented in a C-language software framework called BootCMatch: Bootstrap AMG based on Compatible Weighted Matching, which incorporates the coarsening method in an adaptive algorithm that computes a composite AMG for general s.p.d. matrices with a prescribed convergence rate through a bootstrap process [10,12]. The main computational kernel of this coarsening approach is the computation of a maximum product matching, accounting for the largest fraction of the time needed to build the AMG preconditioner.
Let G = (V, E, C) be a weighted undirected graph associated with a symmetric matrix A, where V is the set of vertices, E the set of edges and C = (c ij ) i,j=1,...,n is the real positive matrix of edge weights. A matching in G is a subset of edges M ⊆ E such that no two edges share a vertex. The number of edges in M is the cardinality of the matching; a maximum cardinality matching contains the largest possible number of edges. A maximum product weighted matching is a matching maximizing the product of the weights of the edges in the matching. By a simple weight transformation it is possible to formulate the computation of a maximum product weighted matching in terms of a general maximum weight matching, that is, a matching which maximizes the sum of the weights for each edge in the matching; this is also called an assignment problem. Such problems are widely studied in combinatorial optimization and are often used in sparse direct linear solvers for reordering and scaling of matrices [13,17].
Three algorithms for maximum product weighted matching are used in BootCMatch: MC64: the algorithm implemented in the MC64 routine of the HSL library (http://www.hsl.rl.ac.uk). It works on bipartite graphs, i.e. graphs where the vertex set is partitioned into two subsets V r and V c (for example, the rows and columns indices of A) and (i, j) ∈ E connects i ∈ V r and j ∈ V c ; it finds optimal matchings with a worst-case computational complexity O(n(n+ nnz) log n), where n is the matrix dimension and nnz the number of nonzero entries. Half-approximate: the algorithm described in [19]. It is a greedy algorithm capable of finding, with complexity O(nnz), a matching whose total weight is at least half the optimal weight and whose cardinality is at least half the maximum cardinality. Auction-type: a version of the near-optimal auction algorithm, based on a notion of approximate optimality called ε-complementary slackness, first proposed by Bertsekas [3] and implemented in the SPRAL Library as described in [17]. The original auction algorithm has a worst case computational complexity O(n·nnz log (cn)), in the case of integer weights, where c = max ij |c ij |; however, its SPRAL version reduces the cost per iteration and the average number of iterations, producing a near-optimal matching at a much lower cost than that of the (optimal) MC64.
Here we discuss a parallel version of the coarsening algorithm described above and its integration into the library of parallel AMG preconditioners MLD2P4, to obtain preconditioners that are robust and efficient on large and sparse linear systems arising from anisotropic elliptic PDE problems discretized on general meshes. The algorithm is implemented with a decoupled approach, where each parallel process performs matching-based coarsening on the part of the matrix owned by the process itself. This requires an extension of the MLD2P4 software framework, in order to efficiently interface the BootCMatch functions implementing the coarsening and to combine the new functionality with the other AMG components of MLD2P4. Computational experiments have been performed on linear systems coming from applications in the Horizon 2020 Project "Energy oriented Centre of Excellence for computing applications" (EoCoE, http://www. eocoe.eu).

Interfacing MLD2P4 with BootCMatch
MLD2P4 is a package of parallel AMG and domain decomposition preconditioners, designed to provide scalable and easy-to-use preconditioners [6][7][8][9], which has been successfully exploited in different applications (see, e.g., [2,4]). It is based on the PSBLAS computational framework [16] and can be used with the PSBLAS Krylov solvers. MLD2P4 is written in Fortran 2003 using an objectoriented approach [15]; thanks to its layered software architecture, its classes and functionalities can be easily extended and reused. It is freely available from https://github.com/sfilippone/mld2p4-2.
Two classes are used to represent AMG preconditioners in MLD2P4: the base preconditioner and the multilevel preconditioner. The multilevel preconditioner holds the coarse matrices of the AMG hierarchy, the corresponding smoothers (stored as base preconditioners), and prolongation and restriction operators. An aggregator object, encapsulated in the base preconditioner structure, hosts the aggregation method for building the AMG hierarchy. The basic aggregation available in MLD2P4 is the decoupled smoothed aggregation algorithm presented in [20], where the final prolongator is obtained by applying a suitable smoother to a tentative prolongator. In particular, the method base aggregator build tprol builds the tentative prolongator and stores other information about it and the corresponding restriction operator.
The MLD2P4 basic aggregator type class can be extended to define new aggregation strategies, possibly adding attributes and methods, or overriding the basic ones. In order to interface the aggregation algorithm implemented in BootCMatch, the class bcmatch aggregator type was defined, extending basic aggregator type and providing the bcmatch aggregator build tprol method, which ovverrides the basic method and performs matching-based aggregation; a schematic picture of the new class is given in Fig. 1.
When implementing the interface, we had to take into account that MLD2P4 is written in Fortran while BootCMatch is written in C, and MLD2P4 and BootCMatch use different data structures for storing sparse matrices and vectors. Fortran 2003 provides a standardized way for creating procedures, variables and types that are interoperable with C. We exploited this feature by creating in MLD2P4 two derived types that are interoperable with C and correspond to the bcm CSRM atrix and bcm vector structures used in BootCMatch.
Furthermore, BootCMatch is sequential and thus needed a strategy to be applied in the parallel MLD2P4 environment. Therefore, BootCMatch was interfaced with the decoupled parallel aggregation scheme available in MLD2P4. This was implemented by passing to the aggregator the local part of the matrix held by the current process, so that the aggregation algorithm could be run on that submatrix, obtaining a local tentative prolongator. Furthermore, a subroutine called mld base map to tprol was developed to create a global tentative prolongator from the local ones. Of course, the resulting parallel matching-based aggregation is generally different from the sequential one.
The base aggregator mat asb method performs several actions: it applies a smoother to the tentative prolongator (if required), computes the restriction operator as R = P T , and creates the coarse-level matrix A c using the Galerking approach, i.e. A c = RAP . This method was overridden to take into account that the BootCMatch tentative prolongator has a more general form than the basic tentative prolongator implemented in MLD2P4. The base aggregator update level method was also overriden by bcmatch aggregator update level, which takes care of projecting on the next coarse level information needed by the matching-based aggregation.

Computational Experiments
In order to illustrate the behaviour of the parallel AMG preconditioners described in the previous sections, we show the results obtained on linear systems derived from two applications in the framework of the Horizon 2020 EoCoE Project.
A first set of linear systems comes from a groundwater modelling application developed at the Jülich Supercomputing Centre (JSC); it deals with the numerical simulation of the filtration of 3D incompressible single-phase flows through anisotropic porous media. The linear systems arise from the discretization of an elliptic equation with no-flow boundary conditions, modelling the pressure field, which is obtained by combining the continuity equation with Darcy's law. The discretization is performed by a cell-centered finite volume scheme (two-point flux approximation) on a Cartesian grid [1]. The systems considered in this work have s.p.d. matrices with dimension 10 6 and 6940000 nonzero entries distributed over seven diagonals. The anisotropic permeability tensor in the elliptic equation is randomly generated from a lognormal distribution with mean 1 and three standard deviation values, i.e., 1, 2 and 3, corresponding to the three systems M1-Filt, M2-Filt and M3-Filt. The systems were generated by using a Matlab code implementing the basics of reservoir simulations and can be regarded as simplified samples of systems arising in ParFlow, an integrated parallel watershed computational model for simulating surface and subsurface fluid flow, used at JSC.
A second set of linear systems comes from computational fluid dynamics simulations for wind farm design and management, carried out at the Barcelona Supercomputing Center (BSC) by using Alya, a multi-physics simulation code targeted at HPC applications [21]. The systems arise in the numerical solution of Reynolds-Averaged Navier-Stokes equations coupled with a modified k − ε model; the space discretization is obtained by using stabilized finite elements, while the time integration is performed by combining a backward Euler scheme with a fractional step method, which splits the computation of the velocity and pressure fields and thus requires the solution of two linear systems at each time step. Here we show the results concerning two systems associated with the pressure field, which are representative of systems arising in simulations with discretization meshes of different sizes. The systems, denoted by M1-Alya and M2-Alya, have s.p.d. matrices of dimensions 790856 and 2224476, with 20905216 and 58897774 nonzeros, respectively, corresponding to up to 27 entries per row.
All the systems were preconditioned by using an AMG Kcycle [18] with decoupled unsmoothed double-pairwise matching-based aggregation. One forward/backward Gauss-Seidel sweep was used as pre/post-smoother with systems M1-Filt, M2-Filt and M3-Filt; one block-Jacobi sweep, with ILU(0) factorization of the blocks, was applied as pre/post-smoother with M1-Alya and M2-Alya. UMFPACK (http://faculty.cse.tamu.edu/davis/suitesparse.html) was used to solve the coarsest-level systems, through the interface provided by MLD2P4. The experiments were performed by using the three matching algorithms described in Sect. 1. The truncated Flexible Conjugate Gradient method FCG(1) [18], implemented in PSBLAS, was chosen to solve all the systems, according to the variability introduced in the preconditioner by the Kcycle. The zero vector was chosen as starting guess and the iterations were stopped when the 2-norm of the residual achieved a reduction by a factor of 10 −6 . A generalized row-block distribution of the matrices was used, computed by using the Metis graph partitioner (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview).
The experiments were carried out on the yoda linux cluster, operated by the Naples Branch of the CNR Institute for High-Performance Computing and Networking. Its compute nodes consist of 2 Intel Sandy Bridge E5-2670 8core processors and 192 GB of RAM, connected via Infiniband. Given the size and the sparsity of the linear systems, at most 64 cores, running as many parallel processes, were used; 4 cores per node were considered, according to the memory bandwidth requirements of the linear systems. PSBLAS 3.4 and MLD2P4 2.2, installed on the top of MVAPICH 2.2, were used together with a development version of BootCMatch and the version of UMFPACK available in SuiteSparse 4.5.3. The codes were compiled with the GNU 4.9.1 compiler suite.
We first discuss the results for the systems M1-Filt, M2-Filt and M3-Filt. In Table 1 we report the number of iterations performed by FCG with the preconditioners based on the different matching algorithms, as the number of processes, np, varies. In Table 2 we report the corresponding execution times (in seconds) and speedup values. The times include the construction of the preconditioner and the solution of the preconditioned linear system. In general the preconditioned FCG solver shows reasonable algorithmic scalability, i.e. the number of iterations  for all systems does not vary too much with the number of processes. A larger variability in the iterations can be observed with M3-Filt, due to the higher anisotropy of this problem and its interaction with the decoupled aggregation strategy. With a more in-depth analysis (not shown here for the sake of space) we find the coarsening ratio between consecutive AMG levels ranges between 3.6 and 3.8 for the MC64 and auction algorithms, while it is between 3.0 and 3.3 for the half-approximation one, except with 64 processes where it reduces to 2.8 for M2-Filt and M3-Filt. None of the three matching algorithms produces preconditioners that are clearly superior in reducing the number of FCG iterations; indeed, for these systems there is no advantage in using the optimal matching algorithm implemented in MC64, since the non-optimal ones appear very competitive. The times corresponding to the half-approximation and auction algorithms are generally smaller, mainly because the time needed to build the AMG hierarchies is reduced. The speedup decreases as the anisotropy of the problem grows due to the larger number of FCG iterations and the well-known memory-bound nature of our computations. The largest speedups are obtained with MC64, because of the larger time required by MC64 on a single core.
In the case of M1-Alya and M2-Alya, only MC64 is able to produce an effective coarsening (with a ratio greater than 3.8), hence avoiding a significant increase of the number of iterations when the number of processes grows. Therefore, in Table 3, we only report results (iterations, time and speedup) for this case. The preconditioned solver shows good algorithmic scalability in general; the number of iterations on a single core is much smaller because in this case the AMG smoother reduces to an ILU factorization. A speedup of 42.8 is achieved for M1-Alya, which reduces to 29.3 for the larger matrix M2-Alya; in our opinion, this can be considered satisfactory. Note that we did not achieve convergence to the desired accuracy by using the basic classical smoothed aggregation algorithm available in MLD2P4.
In conclusion, the results discussed here show the potential of parallel matching-based aggregation and provide an encouraging basis for further work in this direction, including the application of non-decoupled parallel matching algorithms and the design of ad-hoc coarsening strategies for some classes of smoothers, according to the principles of compatible relaxation [14,22].