Non-conforming Elements in Nek5000: Pressure Preconditioning and Parallel Performance

Adaptive mesh refinement (AMR) is an important component of modern numerical solvers, as it allows to control the computational error during the simulation, increasing the reliability of the numerical modelling and giving the possibility to study a broad range of different phenomena even without knowing the physics a priori. In this work we present selected aspects of the implementation and parallel performance of a new h-type AMR framework developed for the high-order CFD solver Nek5000; the development was done within the ExaFLOW EU project. We utilise in this case the natural domain decomposition inherent to the spectral element method (SEM), which constitutes the main source of parallelism and provides meshing flexibility that can be exploited in AMR. We use standard libraries for parallel mesh management (p4est) and partitioning (ParMetis) and focus on developing efficient preconditioners for the pressure problem solved on non-conforming meshes. Two different approaches are considered: an additive overlapping Schwarz and a hybrid Schwarz-multigrid method. The strong scaling is shown on the example of the simulation of the turbulent flow around a NACA4412 wing section at Re = 200, 000.

refinement), changing the polynomial order in a particular element (p-refinement), or splitting the element into smaller ones (h-refinement). In this work we concentrate on an h-refinement framework and its implementation in Nek5000 [8], which is a highly parallel and efficient SEM solver for the incompressible Navier-Stokes equations. In its established version, Nek5000 only supports conformal elements at constant polynomial order throughout the domain.
The present work was started within EU project CRESTA, where the nonconforming solver for advection-diffusion problem was developed and the basic AMR tasks were implemented using existing external libraries. As h-refinement affects the element connectivity resulting in non-conforming meshes, a special grid manager is required to perform local refinement/coarsening and to build globally consistent meshes. For this task the p4est library [1] has been chosen, as it is designed to manipulate domains composed of multiple, non-overlapping logical cubic sub-domains, which can be represented by a recursive tree structure. This library provides element connectivity information for the dual graph, which is later manipulated by ParMETIS [10] producing a new element-to-processor mapping. The final step of grid refinement/coarsening and redistribution is performed within the non-conforming version of Nek5000, which utilises the so-called conformingspace/nonconforming-mesh approach based on the previous work of Fischer et al. [7,11]. As the solver complexity grows special care has been taken to develop efficient tools that can be used within AMR framework. A more detailed description of them and the related scaling tests can be found in [17].
The goal of ExaFLOW is to extend results of CRESTA to the full incompressible Navier-Stokes equations focusing on proper adaptation of the pressure preconditioners for nonconforming SEM. Defining a robust parallel preconditioning strategy has received much attention in past decades, as the linear sub-problem associated with the divergence-free constraint (pressure-Poisson equation) can become very illconditioned. In the context of SEM two possible approaches based on the additive overlapping Schwarz method [4,6] and the hybrid Schwarz-multigrid method [5,12] were proposed and implemented in Nek5000, leading to a significant reduction of pressure iterations.
In the present paper, we discuss the modifications necessary to adapt Nek5000 for the h-type AMR framework. The article is organised as follows. A short description of SEM and pressure preconditioners is given in Sects. 2 and 3. The following Sects. 4 and 5 describe the algorithmic modifications and parallel performance of the code. Finally, Sect. 6 provides conclusion and future work.

SEM Discretisation of the Navier-Stokes Equations
We review briefly the discretisation of the incompressible Navier-Stokes equations to introduce notation and point out algorithm parts that require modification. The more in-depth derivation can be found in e.g. [4]. The temporal discretisation is based on a semi-implicit scheme in which the nonlinear term is treated explicitly and the remaining unsteady Stokes problem is solved implicitly. To avoid spurious pressure modes our spatial discretisation is based on the P N − P N −2 SEM, where velocity and pressure spaces are spanned by Lagrangian interpolants on the Gauss-Lobatto-Legendre (GLL) and Gauss-Legendre (GL) quadrature points, respectively. Note that the basis for velocity is continuous across element interfaces, whereas the basis for pressure is not. Assuming f n incorporates all nonlinear and source terms treated explicitly at time t n , the matrix form of the Stokes problem after applying the Uzawa decoupling reads: where is the Stokes Schur complement governing the pressure, p = p n − p n−1 is the pressure update, and g is the inhomogeneity arising from Gaussian elimination. In these equations H = − 1 Re A + β 0 t B and D are the discrete Helmholtz and divergence operators, respectively. β 0 , A and B denote here a coefficient from time derivative, a discrete Laplacian and a diagonal mass matrix associated with the velocity mesh. Applying the Uzawa decoupling we use the inverse mass matrix B −1 as approximation of the inverse Helmholtz operator H −1 , giving rise to a splitting error. Note that for this splitting method the diagonality of the mass matrix B is crucial to avoid costly matrix inversion.
All operators H, A, B and E are symmetric positive definite (SPD) and can be solved with a preconditioned conjugate gradient (PCG) method. Moreover, E has properties similar to a Poisson operator, and is often referred to as a consistent Poisson operator. The systems involving H and E are solved iteratively with E being more challenging, and in the next section we will present the preconditioning strategy for the pressure equation, We close this section by shortly presenting the SEM operators. SEM introduces a globally unstructured and locally structured basis by tessellating the domain into K non-overlapping subdomains (deformed quadrilaterals), = K k=1 k , and representing functions in each subdomain in terms of tensor-product polynomials on a reference subdomainˆ = [−1, 1] d . In this approach every function or operator is represented by its local counterparts, which in case of functions takes the form of a sum over the subdomains Here, f k i and h i are the nodal values of the function in k and the base functions inˆ , respectively, with i representing the natural ordering of nodes inˆ . Combining the coefficients f k i one can build global f and local f L representations of the function. Each global degree of freedom occurs only once in the global representation, but has multiple copies of faces, edges and vertices related to k in the local one. To enforce function continuity, the global-to-local mapping is defined as the matrix-vector product f L = Qf , where Q is a binary operator duplicating the basis coefficients in adjoining subdomains. The action Q T f L sums multiple contributions to the global degree of freedom from their local values. The assembled global stiffness matrix A takes the form where a block diagonal matrix A L is the unassembled stiffness matrix with each diagonal block consisting of the local stiffness matrix A k ij = dh i dx dh j dx dx. In practise, the global stiffness matrix is never formed explicitly, and the gather-scatter operator QQ T is used instead. This operator contains all information about element connectivity.

Pressure Preconditioner
An efficient solution of Eq. (3) requires finding an SPD preconditioning matrix M −1 which can be inexpensively applied and which reduces the condition number of M −1 E. Preconditioners based on domain decomposition are a natural choice for SEM as the data is structured within an element but is otherwise unstructured.
An overlapping additive Schwarz preconditioner for Eq.
For the local problems restriction and prolongation operators, R k and R T k , are Boolean matrices that transfer data to and from the subdomain, andÂ k is a local stiffness matrix which can be inverted with e.g. a fast diagonalisation method. Note that action of R k and R T k are similar to the gather-scatter operator QQ T . The coarse grid problem corresponds to the Poisson problem solved on the element vertices only, with R T 0 being the linear operator interpolating the coarse grid solution onto the tensor product array of GL points. Unlike in [4,6],Â 0 is defined using local SEM-based Neumann operators performing the projection of local stiffness matrices A k evaluated on the GLL quadrature points onto the set of coarse base functions b i representing the linear finite element base on the GLL grid. The coarse base functions are defined inˆ as a tensor-product of the onedimensional linear functions. The local contribution toÂ 0 is given by b T i A k b j , and the fullÂ 0 is finally assembled by local-to-global mapping summing contributions to the global degree of freedom from their local counterparts.Â 0 is one of few matrices formed explicitly in Nek5000.
On the other hand, the hybrid Schwarz-multigrid preconditioner is based on the multiplicative Schwarz method, which for the two-level scheme takes the form, and leads to the following two-level multigrid scheme, where g, r, e and u are right-hand side, residual, coarse-grid error and solution of equation Au = g, respectively. This method can be extended to a general multilevel solver performing a full V cycle [5,12]. Notice that by replacing step ii) with r = g we obtain the additive Schwarz preconditioner.

Adaptation for Non-conforming Meshes
The important advantage of SEM in the context of AMR is its spatial decomposition into elements that can easily be split into smaller ones, and use of the local representation of the operators which decouples intra-and inter-element operations. As h-type AMR using the conforming-space/nonconforming-mesh approach leaves the approximation spaces unchanged, most of the tensor-product operations evaluated element-by-element are preserved, limiting the changes in the algorithm.
The inter-element operations are mostly performed by the gather-scatter operator QQ T which has to be redefined to include spectral interpolation at the nonconforming faces. Following [7] we consider a non-conforming face shared by one low resolution element (parent) and two (in 3D four) high resolution elements (children). We introduce a local parent-to-child interpolation operator J cp which is a spectral interpolation operator with entries where ζ cp j represents the mapping of GLL points from the child face to its parent. This operator is locally applied to give the desired nodal values on the child face, after Q copies data form the parent to the children. Building a block-diagonal matrix J L with local matrices J cp one can redefine scatter J L Q and gather-scatter J L QQ T J T L operators, respectively. For more discussion see Fig. 6 and Sect. 4 in [7]. The next crucial modification is diagonalisation of the global mass matrix Q T B L Q (B L is a block-diagonal built of local mass matrices), whose inverse is required in Eqs. (1) and (2). It is non-diagonal due to the fact that the quadrature points in the elements along the non-conforming faces do not coincide. A diagonalisation procedure is given in [7] and consists of building the global vectorb and finally setting the lumped mass matrixB ij = δ ijb i .ê andê L denote here the global and local vectors containing all ones. The additive Schwarz preconditioner requires two significant modifications. The first one is related to the assembly of the coarse grid operatorÂ 0 , which gets more complex for non-conforming meshes. This is due to the fact that the non-conforming mesh introduces hanging vertices located in the middle of faces or edges. These hanging vertices are not global degrees of freedom and cannot be included inÂ 0 . To remove them from consideration one has to modify the set of local coarse base functions b i , which are thus dependent on the shape of the refined region as well as the position and orientation of the child face with respect to the parent one. Unlike the conforming case, where all b i could be represented by a tensor product of two or three linear functions, the non-conforming mesh requires 5 basic components in two and 21 in three dimensions to assemble all the possible shapes of b i .
The last missing components are the restriction and prolongation operators, R k and R T k , for the local Poisson problem. Taking into account the similarity between these operators with QQ T and following the previous development we use an operator similar to J L QQ T J T L , replacing J L with the interpolation operator defined on the GL quadrature points. Although this choice seems to be optimal as it preserves properties of the preconditioner and J T L is well defined, our numerical experiments showed a significant increase of pressure iterations in some cases. It was found to be caused by the noise introduced by J T L in the Schwarz operator. To reduce this noise we replaced the transposed interpolation operator with the inverse one, getting a significant reduction of iterations. Unfortunately, such a preconditioner is no longer SPD and PCG cannot be used as an iterative solver in this case. The other problem is the definition of J −1 L , as J cp can be inverted for square matrices only, thus excluding p-refinement strategies. To avoid this problem we define a child-to-parent interpolation operator J pc with the entries where ∂ p and ∂ c are the parent and child common faces, ζ p j is a parent GLL point at the face ∂ p , and ζ pc j represents the mapping of ζ p j to the child face ∂ c . This operator is locally applied to give the desired nodal values on the child face, before Q T sums data form the children and the parent. Building block-diagonal matrix J −1 L consisting of local matrices J pc one can redefine the gather-scatter J L QQ T J −1 L operator such that it is appropriate for the pressure preconditioner. In a similar way we modify the multiplicative Schwarz method, as it shares a number of features with the additive one. In this case we distinguish between Schwarz (acting at single level) and restriction (connecting different levels) operators and apply J L QQ T J −1 L and J L QQ T J T L to each of them, respectively. Unlike the additive preconditioner, the hybrid one requires also the redefinition of the diagonal weight matrix that indicates the number of sub-domains sharing a given node, and is used to accommodate for overlapping regions. Its value is important as it reduces the largest eigenvalue of the MA operator and defines the smoothing properties of the additive Schwarz step (see [4] and the references therein). In the conforming case its definition is straightforward, however the non-conforming case is more involved as hanging nodes are not real degrees of freedom. In the current implementation the information about node multiplicity on the non-conforming faces is hidden to the parent element, so the parent element sees only one neighbour instead of two (four in 3D). Although this choice gives a preconditioner that significantly reduces the number of pressure iterations, its performance for the studied cases is slightly worse than the performance of the additive Schwarz preconditioner. This can be caused by a non-optimal value of the weight matrix, or by the fact that the hybrid preconditioner is superior over the additive one for high-aspect ratio elements (that are not present in our adaptive simulations).

Parallel Performance
The parallel performance test is based on the one of the ExaFLOW flagship calculations, and consists of the turbulent flow around a NACA4412 wing section with 5 • angle of attack, at a Reynolds number based on inflow velocity U ∞ and chord length c of Re c = 200,000. It was previously studied in a series of wellresolved large-eddy simulations conducted with the conforming Nek5000 version, and discussed in detail in [19]. This flow configuration was chosen to illustrate the significant benefit of using AMR, in particular when it comes to the farfield region in the computational domain, but for this article we will only briefly discuss the  Fig. 1 (a) Volume visualisation of that part of the domain covered by refinement levels higher than one for the turbulent flow around a wing profile. The wing vicinity and wake region are resolved and a colour indicates different refinement levels. (b) Strong scaling of the non-conforming Nek5000 solver for the same case performed on Beskow. The plot shows the time per time step as a function of node number. Each node consists of 32 cores strong scaling results. We omit here a weak scaling test, as Nek5000 uses iterative solvers and with the current example we cannot provide meaningful data.
The initial coarse and conforming mesh consisted of 2190 elements with polynomial order N = 7 and was evolved for 7.2 time units c/U ∞ to evolve the refinement process using spectral error indicators [13,14], and allowing for 6 refinements levels. The resulting non-conforming grid was built of 224,272 elements with 76.37 × 10 6 degrees of freedom, resolving the wing surface and the wake, Fig. 1a. This final mesh was used to test the parallel performance of the non-conforming solver using the petascale Cray XC40 system Beskow at PDC (Stockholm). This system consists of 2060 nodes with 32 cores per node and 2.438 PFlops peak performance. We compare our results with the scaling tests of the conforming Nek5000 presented in Offermans et al. [15]. The most relevant test in this article is pipe flow at Re τ = 360 (upper-right plot in their Fig. 5), as it is similar in size with the discussed wing case. We should mention here that our goal is not to improve the parallel performance of the conforming code, but rather to retain it despite of a work imbalance introduced by an additional operator in the direct stiffness summation of the non-conforming solver.
To be able to compare to the conforming solver, we focus on the time evolution loop only, excluding code initialisation, finalisation, mesh rebuilding within AMR and I/O operations. The result of the strong scaling test is presented in Fig. 1b showing the time per time step as a function of node count. This plot is almost identical with the reference one in [15]. Both show slight super-linear scaling between 32 and 256 nodes despite growing work imbalance for the non-conforming solver. We also reach the strong scaling limit at around 256 nodes, which for the conforming solver on Beskow was estimated to be between 30,000 and 50,000 degrees of freedom per core [15]. This shows that the parallel performance of the non-conforming and conforming solvers is almost the same and proves the efficiency of our implementation.
The maximum number of the compute nodes used in the test was not set by the parallel properties of the non-conforming Nek5000, but by the quality of the domain partitioning provided by ParMETIS. Within ExaFLOW we developed a new grid partitioning scheme for Nek5000 (not discussed in this paper) that takes into account a core distribution among the nodes, and consists of two steps: inter-and intra-node partitioning. Although this two-level partitioning scheme significantly improves the efficiency of a coarse grid operations for XXT, especially during the setup phase, it relies on the quality of an inter-node partitioning. If the first step gives subdomains with disjoint graphs, the second step cannot be performed. We found that the probability of getting disjoint graphs increases with decreasing number of elements per node, virtually prohibiting the runs with less than 1000 elements per node. However, this limit can differ between simulations. We note however that in the standard production use of the solver this limitation is not critical, as according to [15] it is usually close to the strong scaling limit of conforming Nek5000.

Conclusions
Within the ExaFLOW project we developed a fully functional SEM-based h-type adaptive mesh refinement (AMR) solver for the incompressible Navier-Stokes equations. This allows for much larger flow cases to be run at reduced cost, as the high resolution grid is placed only in those region where it is needed. At the same time the simulation quality is improved, as the computational error can be controlled during the run.
We have optimised for non-conforming meshes the pressure preconditioners based on the additive overlapping Schwarz and hybrid Schwarz-multigrid methods. To achieve this we modified the base functions for the assembly of a coarse-grid operator to remove hanging nodes, and redefined the direct stiffness summation operator to include spectral interpolation at the non-conforming faces and edges. We introduced two operators J L QQ T J T L and J L QQ T J −1 L for the different steps in the pressure calculation. The last crucial modification was the diagonalisation of the global mass matrix.
Using real flow cases we show our AMR implementation to be correct and efficient. An important success is the fact that parallel performance of the conforming and non-conforming solvers is very similar, despite the increased complexity of the non-conforming one.
In the future we are going to investigate other definitions of the weight matrix for the hybrid Schwarz-multigrid method, and to test different pressure preconditioners based on the restricted additive Schwarz method [2,3]. We are going as well to work on the quality of the graph partition, as the two-level partitioning would not accept disjoint graphs on the node's subdomain. The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.