In order to illustrate the behaviour of the parallel AMG preconditioners described in the previous sections, we show the results obtained on linear systems derived from two applications in the framework of the Horizon 2020 EoCoE Project.
A first set of linear systems comes from a groundwater modelling application developed at the Jülich Supercomputing Centre (JSC); it deals with the numerical simulation of the filtration of 3D incompressible single-phase flows through anisotropic porous media. The linear systems arise from the discretization of an elliptic equation with no-flow boundary conditions, modelling the pressure field, which is obtained by combining the continuity equation with Darcy’s law. The discretization is performed by a cell-centered finite volume scheme (two-point flux approximation) on a Cartesian grid [1]. The systems considered in this work have s.p.d. matrices with dimension \(10^6\) and 6940000 nonzero entries distributed over seven diagonals. The anisotropic permeability tensor in the elliptic equation is randomly generated from a lognormal distribution with mean 1 and three standard deviation values, i.e., 1, 2 and 3, corresponding to the three systems M1-Filt, M2-Filt and M3-Filt. The systems were generated by using a Matlab code implementing the basics of reservoir simulations and can be regarded as simplified samples of systems arising in ParFlow, an integrated parallel watershed computational model for simulating surface and subsurface fluid flow, used at JSC.
A second set of linear systems comes from computational fluid dynamics simulations for wind farm design and management, carried out at the Barcelona Supercomputing Center (BSC) by using Alya, a multi-physics simulation code targeted at HPC applications [21]. The systems arise in the numerical solution of Reynolds-Averaged Navier-Stokes equations coupled with a modified \(k-\varepsilon \) model; the space discretization is obtained by using stabilized finite elements, while the time integration is performed by combining a backward Euler scheme with a fractional step method, which splits the computation of the velocity and pressure fields and thus requires the solution of two linear systems at each time step. Here we show the results concerning two systems associated with the pressure field, which are representative of systems arising in simulations with discretization meshes of different sizes. The systems, denoted by M1-Alya and M2-Alya, have s.p.d. matrices of dimensions 790856 and 2224476, with 20905216 and 58897774 nonzeros, respectively, corresponding to up to 27 entries per row.
All the systems were preconditioned by using an AMG Kcycle [18] with decoupled unsmoothed double-pairwise matching-based aggregation. One forward/backward Gauss-Seidel sweep was used as pre/post-smoother with systems M1-Filt, M2-Filt and M3-Filt; one block-Jacobi sweep, with ILU(0) factorization of the blocks, was applied as pre/post-smoother with M1-Alya and M2-Alya. UMFPACK (http://faculty.cse.tamu.edu/davis/suitesparse.html) was used to solve the coarsest-level systems, through the interface provided by MLD2P4. The experiments were performed by using the three matching algorithms described in Sect. 1. The truncated Flexible Conjugate Gradient method FCG(1) [18], implemented in PSBLAS, was chosen to solve all the systems, according to the variability introduced in the preconditioner by the Kcycle. The zero vector was chosen as starting guess and the iterations were stopped when the 2-norm of the residual achieved a reduction by a factor of \(10^{-6}\). A generalized row-block distribution of the matrices was used, computed by using the Metis graph partitioner (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview).
The experiments were carried out on the yoda linux cluster, operated by the Naples Branch of the CNR Institute for High-Performance Computing and Networking. Its compute nodes consist of 2 Intel Sandy Bridge E5-2670 8-core processors and 192 GB of RAM, connected via Infiniband. Given the size and the sparsity of the linear systems, at most 64 cores, running as many parallel processes, were used; 4 cores per node were considered, according to the memory bandwidth requirements of the linear systems. PSBLAS 3.4 and MLD2P4 2.2, installed on the top of MVAPICH 2.2, were used together with a development version of BootCMatch and the version of UMFPACK available in SuiteSparse 4.5.3. The codes were compiled with the GNU 4.9.1 compiler suite.
Table 1. M1-Filt, M2-Filt and M3-Filt: number of iterations
Table 2. M1-Filt, M2-Filt and M3-Filt: execution time and speedup
We first discuss the results for the systems M1-Filt, M2-Filt and M3-Filt. In Table 1 we report the number of iterations performed by FCG with the preconditioners based on the different matching algorithms, as the number of processes, np, varies. In Table 2 we report the corresponding execution times (in seconds) and speedup values. The times include the construction of the preconditioner and the solution of the preconditioned linear system. In general the preconditioned FCG solver shows reasonable algorithmic scalability, i.e. the number of iterations for all systems does not vary too much with the number of processes. A larger variability in the iterations can be observed with M3-Filt, due to the higher anisotropy of this problem and its interaction with the decoupled aggregation strategy. With a more in-depth analysis (not shown here for the sake of space) we find the coarsening ratio between consecutive AMG levels ranges between 3.6 and 3.8 for the MC64 and auction algorithms, while it is between 3.0 and 3.3 for the half-approximation one, except with 64 processes where it reduces to 2.8 for M2-Filt and M3-Filt. None of the three matching algorithms produces preconditioners that are clearly superior in reducing the number of FCG iterations; indeed, for these systems there is no advantage in using the optimal matching algorithm implemented in MC64, since the non-optimal ones appear very competitive. The times corresponding to the half-approximation and auction algorithms are generally smaller, mainly because the time needed to build the AMG hierarchies is reduced. The speedup decreases as the anisotropy of the problem grows due to the larger number of FCG iterations and the well-known memory-bound nature of our computations. The largest speedups are obtained with MC64, because of the larger time required by MC64 on a single core.
In the case of M1-Alya and M2-Alya, only MC64 is able to produce an effective coarsening (with a ratio greater than 3.8), hence avoiding a significant increase of the number of iterations when the number of processes grows. Therefore, in Table 3, we only report results (iterations, time and speedup) for this case. The preconditioned solver shows good algorithmic scalability in general; the number of iterations on a single core is much smaller because in this case the AMG smoother reduces to an ILU factorization. A speedup of 42.8 is achieved for M1-Alya, which reduces to 29.3 for the larger matrix M2-Alya; in our opinion, this can be considered satisfactory. Note that we did not achieve convergence to the desired accuracy by using the basic classical smoothed aggregation algorithm available in MLD2P4.
Table 3. M1-Alya and M2-Alya: number of iterations, execution time and speedup, using weighted matching based on MC64.
In conclusion, the results discussed here show the potential of parallel matching-based aggregation and provide an encouraging basis for further work in this direction, including the application of non-decoupled parallel matching algorithms and the design of ad-hoc coarsening strategies for some classes of smoothers, according to the principles of compatible relaxation [14, 22].