$p$-Multilevel preconditioners for HHO discretizations of the Stokes equations with static condensation

We propose a $p$-multilevel preconditioner for Hybrid High-Order discretizations (HHO) of the Stokes equation, numerically assess its performance on two variants of the method, and compare with a classical Discontinuous Galerkin scheme. We specifically investigate how the combination of $p$-coarsening and static condensation influences the performance of the $V$-cycle iteration for HHO. Two different static condensation procedures are considered, resulting in global linear systems with a different number of unknowns and non-zero elements. An efficient implementation is proposed where coarse level operators are inherited using $L^2$-orthogonal projections defined over mesh faces and the restriction of the fine grid operators is performed recursively and matrix-free. The various resolution strategies are thoroughly validated on two- and three-dimensional problems.


Introduction
In this work we develop and numerically validate p-multigrid solution strategies for nonconforming polytopal discretizations of the Stokes equations, governing the creeping flow of incompressible fluids.
For the sake of simplicity, we focus on a Newtonian fluid with uniform density and unit kinematic viscosity. Given a polygonal or polyhedral domain Ω ⊂ R d , d ∈ {2, 3}, with boundary ∂Ω, the Stokes problem consist in finding the velocity field u : Ω → R d , and the pressure field p : Ω → R, such that in Ω, (1a) −n · ∇u + pn = g N on ∂Ω N , where n denotes the unit vector normal to ∂Ω pointing out of Ω, g D and g N denote, respectively, the prescribed velocity on the Dirichlet boundary ∂Ω D ⊂ ∂Ω and the prescribed traction on the Neumann boundary ∂Ω N ∂Ω \ ∂Ω D , while f : Ω → R d is a given body force. For the sake of simplicity, it is assumed in what follows that both ∂Ω D and ∂Ω N have non-zero (d − 1)-dimensional Hausdorff measure (otherwise, additional closure conditions are needed).
Our focus is on new generation discretization methods for problem (1) that support general polytopal meshes and high-order: Hybrid High-Order (HHO) and Discontinuous Galerkin (DG) methods.
Hybrid High-Order discretizations of the Stokes equations have been originally considered in [2] and later extended in [34] to incorporate robust handling of large irrotational body forces. Other extensions include their application to the Brinkman problem, considered in [18], and to the full Navier-Stokes equations [35,36,17]; see also [30,Chapters 8 and 9] for further details. In this work, we consider two HHO schemes that are novel variations of existing schemes with improved features. The first scheme, based on a hybrid approximation of the velocity along with a discontinuous approximation of the pressure, is a variation of the one considered in [30,Chapter 8] including two choices for the polynomial degree of the element velocity unknowns in the spirit of [21] (see also [30,Section 5.1]). The second scheme, inspired by the Hybridizable Discontinuous Galerkin (HDG) method of [47], hinges on hybrid approximations of both the velocity and the pressure and includes, with respect to the above reference, a different treatment of viscous terms that results in improved orders of convergence. In both cases, the Dirichlet condition on the velocity is enforced weakly in the spirit of [17].
Since the pioneering works [27,26,25,23,28] dating back to the late 1980s, DG methods have gained significant popularity in computational fluid mechanics, boosted by the 1997 landmark papers [12,13] on the treatment of viscous terms. The extension of DG methods to general polyhedral meshes was systematically considered in [31] and [32]. Crucially, this extension paved the way to adaptive mesh coarsening by agglomeration, a strategy proposed in [8] and exploited in [9,14] in practical CFD applications to provide high-order accurate geometry representation with arbitrarily coarse meshes. More recent developments, including hp-versions and the support of meshes with small faces, can be found in [4,3]; see also the recent monograph [19]. Our focus is on an equal-order approximation with stabilized pressure-velocity coupling in the spirit of [24] and a treatment of the viscous term based on the Bassi-Rebay 2 (BR2) method of [13]. Related works include [10,29]; see also [32,Chapter 6] and references therein.
p-Multilevel solvers are well suited for both HHO and DG methods because the process of building coarse level operators based on polynomial degree reduction is straightforward and inexpensive. The purpose of applying iterative solvers to coarse problems is twofold: on one hand, a coarser operator translates into a global sparse matrix of smaller size with fewer non-zero entries, resulting in cheaper matrix-vector products; on the other hand, coarse level iterations are best suited to smooth out the low-frequency components of the error, that are hardly dumped by fine level iterations. In the context of DG discretizations, p-multilevel solvers have been fruitfully utilized in practical applications see, e.g., [38,44,11,48,39]. h-,p-and hp-Multigrid solvers for DG discretizations of elliptic problems have been considered in [5], where uniform convergence with respect to the number of levels for the W-cycle iteration has been proved, and in [16]. Multigrid solvers for HDG discretizations of scalar elliptic problems were considered in [22] and, more recently, in [37,42], where a comparison with DG is carried out. p-Multivel solvers for HDG methods with application to compressible flow simulations have been recently considered in [40]. Preconditioners for DG and HDG discretizations of the Stokes problem have been considered in [6,41,1,15,20] and [46], respectively. Finally, an h-multigrid method for HHO discretizations of scalar diffusion problems has been recently proposed in [43]. The main novelty consists, in this case, in the use of the local potential reconstruction in the prolongation operator.
In this work we propose and numerically assess p-multilevel solution strategies for HHO discretizations of the Stokes equations. We specifically investigate how the combination of p-coarsening and static condensation influences the performance of the V-cycle iteration. To this end, we compare different static condensation strategies. In order to preserve computational efficiency, statically condensed coarse level operators are inherited using local L 2 -orthogonal projections defined over mesh faces. Restriction of fine grid operators is performed recursively and matrix-free, relying on L 2 -orthogonal basis functions to further reduce the computational burden. Performance assessment is based on accuracy and efficiency of p-multilevel solvers considering DG discretizations as a reference for comparison. Highorder accurate solutions approximating smooth analytical velocity and pressure fields are computed over standard and severely graded h-refined mesh sequences in both two and three space dimensions. Interestingly, the static condensation strategy plays a crucial role in case of graded meshes.
The rest of this work is organized as follows. In Section 2 we state the HHO and DG schemes considered in the numerical tests. The p-multilevel strategy is discussed in Section 3 and computational aspects are discussed in Section 4. Section 5 contains an extensive panel of numerical results that enable one to assess and compare several solution strategies. Finally, some conclusions are drawn in Section 6.

Three nonconforming methods for the Stokes problem
In this section we describe two HHO and one DG methods for the approximation of problem (1) that will be used to assess the performance of the p-multilevel preconditioner. In order to lay the ground for future works on the full nonlinear Navier-Stokes equations, the corresponding discrete problems are formulated in terms of the annihilation of residuals.

Discrete setting
We consider meshes of the domain Ω corresponding to couples M h (T h , F h ), where T h is a finite collection of polygonal (if d = 2) or polyhedral (if d = 3) elements such that h max while F h is a finite collection of line segments (if d = 2) or polygonal faces (if d = 3). For the sake of brevity, in what follows the term "face" will be used in both two and three space dimensions. It is assumed henceforth that the mesh M h matches the geometrical requirements detailed in [30,Definition 1.4]. This covers, essentially, any reasonable partition of Ω into polyhedral sets, not necessarily convex. For each mesh element T ∈ T h , the faces contained in the element boundary ∂T are collected in the set F T , and, for each mesh face F ∈ F h , T F is the set containing the one or two mesh elements sharing F. We define three disjoint subsets of the set F T : the set of Dirichlet boundary faces Hybrid High-Order methods hinge on local polynomial spaces on mesh elements and faces. For given integers ≥ 0 and n ≥ 1, we denote by P n the space of n-variate polynomials of total degree ≤ (in short, of degree ). For X mesh element or face, we denote by P (X) the space spanned by the restriction to X of functions in P d . When X is a mesh face, the resulting space is isomorphic to P d−1 (see [30,Proposition 1.23]). At the global level, we will need the broken polynomial space Let again X denote a mesh element or face. The local L 2 -orthogonal projector π X : L 2 (X) → P (X) is such that, for all q ∈ L 2 (X), ∫ X (q − π X q)r = 0 ∀r ∈ P (X).
Notice that, above and in what follows, we omit the measure from integrals as it can always be inferred from the context. The L 2 -orthogonal projector on P (X) d , obtained applying π X component-wise, is denoted by π X .

Local reconstructions and face residuals
The HHO discretizations of the Stokes problem considered in this work hinge on velocity reconstructions devised at the element level and obtained assembling diffusive potential reconstructions component-wise. In what follows, we let a mesh element T ∈ T h be fixed, denote by k ≥ 0 the degree of polynomials attached to mesh faces, and by k ∈ {k, k + 1} the degree of polynomials attached to mesh elements.

Scalar potential reconstruction
The velocity reconstruction is obtained leveraging, for each component, the scalar potential reconstruction originally introduced in [33] in the context of scalar diffusion problems (see also [21] and [30, Section 5.1] for its generalization to the case of different polynomial degrees on elements and faces). Define the local scalar HHO space The scalar potential reconstruction operator p k+1 T : V k ,k T → P k+1 (T) maps a vector of polynomials of V k ,k T onto a polynomial of degree (k + 1) over T as follows: Computing p k+1 T for each T ∈ T h requires to solve a small linear system. This is an embarrassingly parallel task that can fully benefit from parallel architectures.

Velocity reconstruction
Define, in analogy with (2), the following vector-valued HHO space for the velocity: The velocity reconstruction P k+1

Face residuals
Let T ∈ T h and F ∈ F T . The stabilization bilinear form for the HHO discretization of the viscous term in the momentum equation (1a) hinges on the face residual R k T F : where the scalar face residual r k ,k

HHO schemes
We consider two HHO schemes based, respectively, on discontinuous and hybrid approximations of the pressure. In both cases, the Dirichlet boundary condition is enforced weakly, considering a symmetric variation of the method discussed in [18].

An HHO scheme with discontinuous pressure
Let again k ≥ 0 and k ∈ {k, k + 1} denote the polynomial degrees of the face and element unknowns, respectively, and let a mesh element T ∈ T h be fixed. Given (u T , p T ) ∈ V k ,k T × P k (T), the local residuals r mnt I,T ((u T , p T ); ·) : V k ,k T → R of the discrete momentum conservation equation and r cnt I,T (u T ; ·) : P k (T) → R of the discrete mass conservation equation are such that, respectively: For all v T ∈ V k ,k T and all q T ∈ P k (T), In the expression of r mnt I,T ((u T , p T ); ·), η > 0 is a user-dependent parameter that has to be taken large enough to ensure coercivity. The penalty term where the parameter η appears, along with the consistency terms in the second line and the term involving the boundary datum g D in the fourth line, are responsible for the weak enforcement of the Dirichlet boundary condition for the velocity.
Define the global vector HHO space Scheme I (HHO-dp: HHO scheme with discontinuous pressure).

An HHO scheme with hybrid pressure
An interesting variation of Scheme I is obtained combining the HHO discretization of the viscous term with k = k + 1 with a hybrid approximation of the pressure inspired by [47].
→ R of the discrete momentum and r cnt I I,T (u T ; ·) : V k,k T → R of the discrete mass conservation equations for the HHO scheme with hybrid pressure are such that, for all v T ∈ V k+1,k T and all q T ∈ V k,k T , As before, η > 0 is a penalty parameter that has to be taken large enough to ensure coercivity. The boxed terms are the ones that distinguish the local residuals on the momentum and mass conservation equations for the HHO scheme with hybrid pressure from Scheme I with k = k + 1.
Define the global scalar HHO space The global residuals r mnt The HHO method (7) yields a velocity approximation that is pointwise divergence free (as can be checked adapting the argument of [47,Proposition 1]) and improves by one order the h-convergence rates of the HDG method proposed in [47], since it relies on an HHO discretization of the viscous term (cf. the discussion in [21] and also [30, Section 5.1.6]).
A key point consists in using element unknowns for the velocity one degree higher than face unknowns. Notice that seeking the velocity in the space V k+1,k T as opposed to V k,k T does not alter the number of globally coupled unknowns, as all velocity degrees of freedom attached to the mesh elements can be removed from the global linear system by static condensation procedures similar to the ones discussed in Section 4.1.2.

DG scheme
The third approximation of the Stokes problem is based on discontinuous approximations of both the velocity and the pressure. Specifically, we use the BR2 formulation for the vector Laplace operator (see [13] and also [32,Section 5.3.2]) together with a stabilized equal order pressure-velocity coupling. Fix a polynomial degree k ≥ 1 and let T ∈ T h . We define the local discrete gradient G k T : where, for any F ∈ F i,D T , the jump of v across F is defined as Introducing, for all F ∈ F i,D T , the jump lifting operator L k FT : , the local residual r mnt I I I,T ((u h , p h ); ·) : P k (T) d → R of the discrete momentum equation and r cnt I I I,T ((u h , p h ); ·) : P k (T) → R of the discrete mass equation are such that, for all v T ∈ P k (T) d and all q T ∈ P k (T), where, for all ϕ ∈ H 1 (T h ) and all F ∈ F h , with the understanding that the average operator acts componentwise when applied to vector and tensor functions, and The global residuals r mnt I I I,h ((u h , p h ); ·) : P k (T h ) d → R and r cnt I I I,h ((u h , p h ); ·) : P k (T h ) → R are obtained by elementby-element assembly of local residuals.
Scheme index Scheme label Fine discrete space Coarse discrete spaces Coarsest level I HHO-dp Table 1: Notation for the p-multilevel solver. We only consider the equal-order version of Scheme I, where both element and face velocity unknowns have the same polynomial degree.

p-Multilevel solution strategy
We consider L coarse problems, indexed as = 1, ..., L. Given a polynomial degree k ≥ 0 (for Schemes I and II) or k ≥ 1 (for Scheme III), we set k 0 k, the reference polynomial degree on the fine level, and denote by k the polynomial degree at level . Coarsening is achieved taking k +1 < k . The notation for the three schemes discussed in Section 2 is summarized in Table 1. Notice that, for the sake of simplicity, we only consider the equal-order version of Scheme I, where both element and face velocity unknowns have the same polynomial degree.

Intergrid transfer operators
Denoting by X ∈ T h ∪ F h a mesh element or face, the prolongation operator I ,X +1 : P k +1 (X) → P k (X) from level + 1 to level is the injection P k +1 (X) → P k (X). The prolongation operator I 0 from level to level 0 can be recursively defined by the composition of one level prolongation operators: The restriction operator I +1 ,X : P k (X)→P k +1 (X) from level to level + 1 is simply taken equal to the L 2 -orthogonal projector on P k +1 (X), that is, for all w X, ∈ P k (X), we set The restriction operator I 0 from level 0 to level is again obtained by composition: It can be checked that I +1 ,X is the transpose of I ,X +1 with respect to the L 2 (X)-inner product. When applied to vectorvalued functions, intergrid transfer operators act component-wise and are denoted using boldface font by I ,X +1 , I +1 ,X . The global restriction operator while the global restriction operator for DG spaces I +1 : P k (T h ) d → P k +1 (T h ) d is obtained patching the element restriction operators: For all v h, ∈ P k (T h ) d ,

Inherited multilevel operators
For any = 1, . . . , L set, for the sake of brevity, The coarse residuals for the momentum and mass continuity equations for the schemes of Section 2 corresponding to a velocity-pressure couple at level are obtained evaluating the corresponding fine residuals defined in Section 2.3 at the prolongation of the given function, i.e.: For = 1, . . . , L, • Scheme I (HHO-dp).
Besides the formal definition given above, coarse level operators can be efficiently inherited from the fine operators relying on the restriction and prolongation operators. This computationally efficient strategy, also known as Galerkin projection, is detailed in Section 4.2 focusing on Scheme I.

Multilevel V-cycle iteration
The approximate solution w h, to the global problem at level < L can be improved by means of one V-cycle iteration, as described in the following algorithm: Compute the coarse grid correction (recursion up to level L): , 0) Apply the coarse grid correction: where d h, +1 is the restriction of the defect and c h, +1 is the coarse grid correction. All applications of prolongation and restriction operators involved in the multilevel V-cycle iteration are performed matrix-free, that is, without assembling the global sparse matrices associated to the operators I +1 , I +1 .
In the pre-and post-smoothing steps, a few iterations of the Generalised Minimal Residual (GMRES) method preconditioned with an Incomplete Lower-Upper (ILU) factorization are performed in order to reduce the error e h, = w h, − w h, . Indeed, the components of the error associated to the highest-order basis functions at level are expected to be damped very fast, while the components of the error associated to lower-order basis functions are smoothed at a later stage when the recursion reaches coarser levels.
In the numerical tests of Section 5 we consider one V-cycle iteration as a preconditioner for the FGMRES (Flexible GMRES) iteration applied to solve the global problem A h,0 w h,0 = b h,0 . We employ the solver and preconditioner framework provided by the PETSc library [7].

Computational aspects
In what follows, we discuss some computational aspects for the Scheme I (HHO with discontinuous pressure). Algebraic objects are denoted using sans serif font, with boldface distinguishing matrices from vectors.

Algebraic expression for the local residuals
We assume that local bases for each polynomial space attached to mesh elements and faces have been fixed, so that bases for the global approximations spaces for the velocity and the pressure can be obtained by taking the Cartesian product of the latter. Possible choices of local bases are discussed in [30 The unknowns for a mesh element T ∈ T h correspond to the coefficients of the expansions of the velocity and pressure in the selected local bases. Assuming that the velocity unknowns are ordered so that element velocities come first and boundary velocities next, these coefficients are collected in the vectors where the block partition of the vector U T is the one naturally induced by the selected ordering of velocity unknowns.
The local matrices corresponding to the HHO discretization of the viscous term (first two lines of the right-hand side of (3a)) and of the pressure-velocity coupling (first line of the right-hand side of (3b)) are where again the block partition is the one induced by the ordering of velocity unknowns. Details on the construction of the matrix A T can be found in [30,Appendix B.2].
Remark 1 (Block structure). Denoting by N the number of faces of T, the block structure of the matrix A T can be further detailed as follows: Assume that the velocity unknowns attached to T and its faces are ordered by component. Since the viscous term is modelled in (1a) applying the Laplace operator to each velocity component, each block in the decomposition (13) is itself block-diagonal, and be efficiently constructed starting from the corresponding matrix for the scalar Laplace operator.
Introducing the vector representations R mnt and R cnt I,T of the residual linear forms defined by (3), G ∂T of the terms involving the boundary data in the last line of (3a), F T of the term involving the volumetric body force in the last line of (3a), and G ∂T of the last term in the right-hand side of (3b), it holds

Static condensation strategies
The discrete problem (5) is obtained enforcing that the global residuals be zero, which requires the solution of a global linear system. The size of this linear system can be reduced by statically condensing the element velocity unknowns and, possibly, the pressure unknowns corresponding to high-order modes inside each element. In what follows, we discuss two possible static condensations procedures leading to global systems with different features.
HHO-dp v-cond: Static condensation of velocity element unknowns. The first static condensation procedure hinges on the observation that, given a mesh element T ∈ T h , the velocity unknowns collected in U T are not directly coupled with unknowns attached to mesh elements other than T. As a result, enforcing that the residuals in the left-hand side of (14) be zero, U T can be locally eliminated by expressing it in terms of U ∂T and P T by computing the Schur complement of the block A TT in the matrix in the right-hand side of (14). With this static condensation strategy, the zero residual condition translates into HHO-dp v&p-cond: Static condensation of velocity element unknowns and pressure modes. The second static condensation strategy was originally suggested in [2] in the framework of HHO methods and later detailed in [34,Section 6]. Assume that the basis for the pressure inside each mesh element T ∈ T h is selected so that the first degree of freedom corresponds to the mean value of the pressure inside T and the remaining basis functions are L 2 -orthogonal to the first (this condition typically requires the use of modal bases). Let now a mesh element T ∈ T h be fixed. The above choice for the pressure basis induces the following partitions of the pressure unknowns and of the pressure-velocity coupling matrix: where P T ∈ R is the mean value of the pressure inside T, P T is the vector corresponding to high-order pressure modes, and the matrix B T has been partitioned row-wise according to this decomposition. Enforcing that the residuals be zero in (14) and rearranging the unknowns and equations, we infer that the discrete solution satisfies The only unknowns that are globally coupled are those collected in the subvector can be eliminated by expressing them in terms of the former. After performing this local elimination, the condition (16) that the residuals associated with T be zero becomes: where S v&p T denotes the Schur complement of the top left block of the matrix in (16), that is, Remark 2 (Differences between the static condensation strategies). The two static condensation strategies outlined above coincide for k=0. For k ≥ 1, the first, obvious difference is that the second results in a smaller global system, since high-order pressure unknowns are eliminated in addition to element-based velocity unknowns. There is, however, a second, more subtle difference. As a matter of fact, while the block S v&p ∂T ∂T in (17) is full, the block S v ∂T ∂T in (15) preserves the pattern of A ∂T ∂T (which is composed of block-diagonal blocks, see Remark 1). As a result, the first static condensation strategy results in a sparser, albeit larger, matrix. The numerical tests in the next section show that sparsity prevails over size, so that the first static condensation strategy is in fact the more efficient.
Notice that this difference would disappear if we replaced the Laplace operator in the momentum equation (1a) by div(ν∇ s ·), with ∇ s denoting the symmetric part of the gradient operator applied to vector-valued fields, as would be required for a viscosity coefficient ν : Ω → R + variable in space.

Inheritance by means of Galerkin projections
We show in this section how the operators can be inherited from level to + 1. For X mesh element or face, we let {ψ X, 1 , ψ X, 2 , ..., ψ X, P } be a basis of P k (X) (with P denoting the dimension of this vector space) and Q } a basis of P k +1 (X) (with Q denoting the dimension of this vector space). The algebraic counterpart I +1 ,X of the local restriction operator I +1 ,X defined by (9) reads ..,Q, j=1,..., P , and the algebraic counterpart I ,X +1 of the local prolongation operator I ,X +1 is Interestingly, when using hierarchical orthonormal bases and the basis for P k +1 (X) is obtained by restriction of the basis for P k (X), both the prolongation and restriction operators are represented by unit diagonal rectangular matrices. In particular, for the local restriction operator it holds .., Q and all j = 1, ..., P.
As a result, intergrid transfer operators need not need be computed nor stored in memory.
With a little abuse of notation, we also denote by I +1 ,X and I ,X +1 the local restriction and prolongation operators applied to vector-valued variables, which are obtained assembling component-wise the corresponding operators acting on scalar-valued variables. The matrix A +1 T discretizing the viscous term at level + 1 can be inherited from the corresponding matrix A T at the level applying the restriction operators block-wise (compare with (13)):  Applying this procedure recursively shows that, for any level ≥ 1, the matrix A T can be obtained from the fine matrix A 0 T . Note that pre-and post-multiplication of the matrix blocks by the restriction and the prolongation operators, respectively, results in a block shrink. When using orthonormal basis functions, these matrix multiplications can be avoided altogether and replaced with inexpensive sub-block extractions.
In order to further reduce the computational costs, Galerkin projections can be performed on the statically condensed fine grid operator, so that static condensation of coarse grid operators is avoided altogether. For example, having computed the fine-level block of the Schur complement S 0 ∂T ∂T (given by either formula (15) or (17)), the corresponding block S +1 ∂T ∂T at level + 1 is computed applying recursively the relation: To conclude the resulting sub-blocks are assembled into the global matrix.

Mesh sequences
In order to assess and compare the performance of p-multilevel preconditioners, we consider four h-refined mesh sequences of the two-dimensional domain (−1, 1) 2 , see Figure 1, and three h-refined mesh sequences of the threedimensional domain (0, 1) 3 , see Figure 2. In two space dimensions, we consider both standard and graded meshes  composed of triangular and trapezoidal elements. In three space dimensions, we consider standard meshes composed of prismatic and pyramidal elements and graded meshes composed of tetrahedral elements. While standard meshes have homogeneous meshsize, graded meshes feature mesh elements that become narrower and narrower while approaching the domain boundaries, mimicking computational grids commonly employed in CFD to capture boundary layers. In order to build h-refined graded mesh sequences, the mesh nodes are first positioned according to Gauss-Lobatto quandrature rules of increasing order and then randomly displaced by a small fraction of their distance. Accordingly, the reduction of the meshsize is non-linear in case of graded h-refined mesh sequences.

Manufactured analytical solution
We consider the following smooth analytical behaviours of the velocity and pressure fields: where {i, j} is the canonical basis of R 2 while, for d = 3, we set Ω (0, 1) 3 and where {i, j, k } is the canonical basis of R 3 . Dirichlet boundary conditions are enforced on all but one faces of Ω, where Neumann boundary conditions are enforced instead. The boundary data and forcing term are inferred from the exact solution.

Multilevel solver options
We consider high-order and higher-order versions of the HHO and DG schemes corresponding to the polynomial degrees k=3 and k=6, respectively. The theoretical h-convergence rates for DG are k+1 for the velocity error in the L 2 -norm and k for the velocity gradient and the pressure error in the L 2 -norm. The theoretical h-convergence rates for HHO are k+2 for the velocity reconstruction error in the L 2 -norm and k+1 for the gradient of the velocity reconstruction and the pressure error in the L 2 -norm. For the HHO-hp scheme, both the element velocity and the reconstructed velocity display the same convergence rates, but the former is additionally divergence free on standard meshes. For this reason, the element velocity field is used in the error computations. For all the numerical test cases, we report in the tables the L 2 -errors on the velocity ("u h " column), velocity gradients ("Gu h " column), pressure ("p h " column), and divergence ("Du h " column).
The solution of the linear systems is based on a FGMRES iterative solver preconditioned with a p-multilevel V-cycle iteration of three levels (L = 2): for k 0 =k=3 (fine level), we set k 1 =2 on the intermediate level and k 2 =k L =1 on the coarse level; for k 0 =k=6 (fine level), we set k 1 =3 on the intermediate level and k 2 =k L =1 on the coarse level.
On the fine and intermediate levels, the pre-and post-smoothing strategy consist in two iterations of ILU preconditioned GMRES. On the coarse level, we employ an LU solver when working in two space dimensions and ILU preconditioned GMRES solver when working in three space dimensions. Since enforcing looser tolerances on the coarse level does not alter the number of outer FGMRES iterations, we impose a three orders of magnitude decrease of the true (unpreconditioned) relative residual in three space dimensions. The relative residual decrease for the outer FGMRES solver is set to 10 −13 when k=3 and to 10 −14 when k=6.

Performance evaluation
For all the numerical test cases we compare the performance and efficiency of solver strategies based on: • Number of FGMRES outer iterations ("ITs" column); • Number of coarse solver iterations ("ITs L " column). Note that one iterations means that a direct solver is employed; • Wall clock time required for linear system solution ("CPU time Sol." column); • Wall clock time required for matrix assembly ("CPU time Ass." column); We remark that the computational cost of building the Schur complement is included since static condensation is performed element-by-element during matrix assembly. • Wall clock time required for matrix assembly plus linear system solution ("CPU time Tot." column); • Efficiency with respect to linear scaling of the computational expense with respect to the number of DOFs ("Eff." column). 100% efficiency means that for a fourfold increase of the number of DOFs we get a fourfold increase of the total (matrix assembly plus linear system solution) wall clock time.

Comparison based on matrix dimension and matrix non-zero entries
The cost of a Krylov iteration scales linearly with the number of Matrix Non-Zero entries (MNZs) plus the number of Krylov spaces times the matrix dimension (equal to the number of Degrees Of Freedom, DOFs), see, e.g., [45]. Multilevel Krylov solvers utilize only a few smoother iterations on the fine and intermediate levels and iteratively solve on the coarse level, where the number of MNZs and DOFs is favourable, see Section 3.3. Accordingly, with respect to solver efficiency, the most relevant discretization-dependent parameters are the MNZs of the fine and coarse matrices and the number of and DOFs of the coarse level: fine level MNZs influence the cost of the most expensive matrix-vectors products, performed once per smoother iteration; coarse level MNZs influence the cost of the least expensive matrix-vector products, performed once per iteration of the coarse solver (that is, many times per multilevel iteration); the number of DOFs of the coarse level influences the cost of the Gram-Schmidt orthogonalization carried out within the GMREs algorithm on the coarse level. Static condensation of the element based unknowns is an effective means of improving solver efficiency in the context of hybridized methods. For HHO-dp, we compare the uncondensed (HHO-dp uncond) implementation to the static condensation strategies described in Section 4.1. We recall that both static condensation procedures involve the local elimination of velocity unknowns attached to mesh elements, and the difference lays in the treatment of pressure degrees of freedom. According to (17), all pressure modes except the constant value are statically condensed in the HHO-dp v&p-cond strategy, while, according to (15), pressure modes are not statically condensed in the HHO-dp v-cond strategy. For HHO-hp, we consider static condensation of the element unknowns for both the velocity and the pressure (HHO-hp v&p-cond), so that the only skeletal unknowns appear in the global systems.
Roughly speaking, DOFs and MNZs of HHO discretizations are associated with element variables and face variables. DG discretizations rely only on element variables. The formulas for computing DOFs and MNZs reported in Table 2 show that: • The number of DOFs associated with element variables is proportional to the dimension of the polynomial space is the ratio between the stencil of face variables and element variables, respectively. This simple observation allows to interpret the results of Tables 3-5 and 4-6, where the DOFs and MNZs counts for the methods and implementations considered in this work are reported. Placeholders correspond to combinations of meshes, polynomial degrees, schemes, and static condensation options that are either not possible or haven't been considered in numerical tests. The data are reported only for the finest grids of each mesh sequence for k ∈ {1, 3, 6} (the case k=1 is also included as it is relevant for estimating the efficiency of the coarse solver). Some comments regarding the DOFs counts reported in Tables 3-5 are as follows. As expected, the HHO-dp uncond DOFs count is the largest. In 2D and 3D, HHO-dp v&p-cond and DG, respectively, have the fewest DOFs count on the coarse level (k=1). This can easily be interpreted based on (18), as the condition is harder to meet in 3D than in 2D. In 2D, the number of coarse level DOFs for HHO-dp v-cond, HHO-hp, and DG are very similar. In 2D and 3D, higher-order statically condensed HHO shows some advantage over DG in terms of DOFs.
Some comments regarding the MNZs counts reported in Tables 3-5 are as follows. In 2D, HHO-dp v&p-cond and v-cond have fewer MNZs than DG, at all polynomial degrees. In 3D, HHO-dp v&p-cond and v-cond have fewer MNZs than DG for both k=3 and k=6, with HHO-dp v-cond being the most efficient. HHO-dp v-cond is very close to DG for k=1. The fact that HHO-dp v-cond outperforms HHO-dp v&p-cond is due to increased fill-in of the Schur complement matrix arising from (17), see Remark 2. HHO-hp v&p-cond improves DG only for k=6, while DG is significantly better for both k=1 and k=3. Similar to strategy (17) for HHO-dp, the aforementioned static condensation procedure increases the fill-in of the blocks pertaining to skeletal velocity unknowns with respect to the uncondensed operator.

Comparison of static condensation strategies
In this section we evaluate the performance of the multilevel solution strategy for Scheme I (HHO-dp) comparing the two approaches for static condensation described in Section 4.1; see in particular (17) (HHO-dp v&p-cond) and (15) (HHO-dp v-cond). We also consider the uncondensed formulation (HHO-dp uncond) as a reference to evaluate the performance gains. HHO-dp HHO-dp v-cond     In case of regular 2D mesh sequences, the results reported in Table 7 confirm that static condensation leads to significant gains (on average, the computation time halves) when compared with the uncondensed implementation. The results reported in Table 8, where graded 2D mesh sequences are considered, show that the HHO-dp v&p-cond strategy (static condensation of both velocity element unknowns and high-order pressure modes) leads to a suboptimal performance of the multigrid preconditioner in case of stretched elements: notice the increasingly high number of FGMRES iterations when the mesh is refined. A similar behavior, even if less pronounced, is observed for the uncondensed implementation. The results reported in Table 9, where 3D mesh sequences are considered, confirm the strategy HHO-dp v-cond (static condensation of element velocity unknowns only) leads to the best performance in terms of execution times, both in the case of standard and graded meshes. We remark that the gains are to be ascribed to fewer FGMRES iterations and a smaller number of matrix non-zero entries, see Table 6.
It is interesting to remark that accuracy and convergence rates are not influenced by the static condensation procedure as soon as the relative residual drop satisfies the prescribed criterion. Solver fails to converge for HHO v&p-cond over fine graded triangular meshes, see Table 8. Note that the prescribed maximum number of iteration (1k) of the FMGRES solver is reached and the convergence rates are spoiled.

Comparison based on accuracy and efficiency of the solver strategy
In this section we compare the three nonconforming discretizations of the Stokes problem presented in Section 2 based on accuracy and performance of the multilevel solver strategy. For the HHO scheme HHO-dp, in accordance with the results of Section 5.4, the static condensation strategy v-cond is used for all meshes in both two and three space dimensions. For the HHO scheme HHO-hp, we consider static condensation of the element unknowns for both the velocity and the pressure (HHO-hp v&p-cond), so that only skeletal unknowns appear in the global systems. The results for 2D regular and graded sequences are reported in Tables 10-12 and 11-13, respectively. The results for 3D mesh sequences are reported in Tables 14-15. As a first point, we remark that the theoretical convergence rates are confirmed for all the test cases performed on regular 2D and 3D mesh sequences. When higher-order (k=6) discretizations are considered and machine precision is reached, the converge rates deteriorates, as expected. Turning to graded mesh sequences, we observe a slightly suboptimal convergence of HHO-hp with respect to HHO-dp over graded triangular meshes at higher-order. Note that velocity gradients and pressure fields reach an asymptotic sixth order convergence rate and the divergence error is a bit higher than expected, compare for example with the result on standard meshes. Interestingly, all the nonconforming discretizations suffer from a convergence degradation for card(T h ) between 192 and 1546 over the graded tetrahedral mesh sequence. This is probably due to mesh elements of extremely bad quality generated as a result of grading plus random node displacement, see Section 5.1. Overall, both HHO-dp and HHO-hp outperform DG in terms of accuracy with order of magnitudes gains observed moving towards finer meshes. This is due to better asymptotic convergence rates (one order higher) as well as better accuracy on coarse meshes.
p-Multilevel solvers guarantee uniform convergence with respect to the mesh density when standard 2D and 3D mesh sequences are considered: note that the number of FGMRES iterations is almost uniform all along the mesh sequence. Interestingly, HHO-dp discretizations show uniform convergence with respect to the mesh density on graded quadrilateral meshes, while DG is the most affected by mesh grading, especially for k=6. For HHO-hp, the number of iterations increases with mesh density on graded quadrilateral meshes. Nevertheless, the number of iterations over coarse meshes is remarkably small and grows up to match the iterations count of HHO-dp over fine meshes. The solver convergence deteriorates with the mesh density in case of graded triangular and tetrahedral mesh sequences: the iterations increase is clearly visible but not pathological in case HHO discretizations.
Interestingly, p-multilevel solvers deliver almost uniform convergence with respect to the polynomial degree when applied to HHO discretizations: moving from high-order (k=3) to higher-order (k=6) entails a mild iterations increase for HHO, while the iteration count doubles for DG. In 2D this behaviour has a strong impact on computation times: HHO is up to three and eight times faster than DG at high-order and higher-order, respectively. HHO-dp outperforms DG because of the reduced number of matrix non-zero entries and the reduced matrix dimension, see Tables 3 and 4: the former influences the cost of smoothing iterations while the latter strongly influences the cost of the LU factorization on the coarse level.
Let us consider the performance of the multilevel solver in 3D. HHO-dp is two times and four-to-five times faster than DG in terms of solution times for k=3 and k=6, respectively. HHO-hp is slower than HHO-dp in terms of solution times and faster than DG by a small amount, with the exception of the pyramidal elements mesh sequence for k=3. The difference in computational cost between HHO-dp and HHO-hp is essentially due to the number of MNZs, see Table 6, while the number of FGMRES iterations is comparable. Since in 3D the coarse level solver is generally more efficient for DG, the HHO advantage results from the efficiency of the smoothers and the reduced number of FGMRES iterations. In particular, we remark that DG has fewer DOFs than HHO for k=1, see Table 5. Moreover, DG and HHO-dp v-cond have a comparable MNZs count for k=1, significantly smaller than the MNZs count of HHO-hp v&p-cond, see Table 6.
Overall, the gain in terms of total execution times is less significant than in 2D. When working with HHO in three space dimensions, assembly times are a considerable fraction of the total computation time: matrix assembly is twice as expensive as linear system solution for HHO-dp for k=6. As opposite, for DG, solution times dominate. Increased assembly costs are essentially due to the increased expense of solving local problems involved in static condensation. An important observation is that, since the assembly procedure is perfectly scalable while ILU preconditioned smoothers are not, HHO discretizations might show better scalability results as compared to DG in massively parallel computations.
We conclude this section commenting about solver efficiency (last column in Tables 10-15). It is clear that higherorder discretizations (k=6) achieve better efficiency than high-order discretizations (k=3), in both 2D and 3D. This outlines the intrinsic limitation of p-multilevel solution strategies: when considering fine meshes, the performance of the coarse solver might limit the efficiency because the number of DOFs and MNZs on the coarse level can not be chosen arbitrarily low. Accordingly, p-multilevel solver are best suited for those situations where arbitrarily coarse meshes with higher-order polynomials can be employed.

Scalability
In this section we include basic scalability results for p-multilevel solvers applied to HHO-dp discretizations. Even if a complete analysis and comparison of the parallel performance of nonconforming discretizations is outside the scope of the paper, we ought to show that Additive Schwarz Method (ASM) preconditioners are an effective means of achieving satisfactory parallel efficiency. We consider the finest grid of the pyramidal elements mesh sequence (counting of 24k elements) and a HHO-dp scheme with k = 5. Static condensation acts on the sole velocity unknowns (HHO-dp v-cond), as described in (15). The multilevel solver strategy is the same employed in serial computations for k = 6, but smoother preconditioners are suitably designed, as outlined in what follows.
The parallel implementation is based on the distributed memory paradigm and requires to partition the computational mesh in several subdomains. In case of HHO methods, not only the mesh but also the mesh skeleton needs to be partitioned: as a result, each mesh entity (element or face) belongs to one and only subdomain. Each subdomain is assigned to a different computing unit that performs matrix assembly for the local mesh elements pertaining to the subdomain. Mesh partitioning directly reflects into matrix partitioning in the sense that all entries of the matrix rows (PETSc matrix implementation is row-major) pertaining to local mesh entities are allocated and stored in local memory. Once matrix assembly is completed, the linear system is approximately solved in each subdomain. Depending on the preconditioner strategy, the solver performance might degrade increasing the number of subdomains, see e.g. [49].
A commonly used ASM preconditioner strategy for DG discretizations consists in employing an ILU decomposition in each subdomain matrix suitably extended to include the matrix rows of ghost elements, that is, neighbors of local mesh elements that pertain to a different subdomain. This implies that the local matrix is extended to encompass the stencil of the DG discretizations, see [39] for additional details. We consider a similar strategy for HHO discretizations: each subdomain matrix is extended to include the matrix rows of ghost faces, that is, faces of the local mesh elements that pertain to a different subdomain. Interestingly, even if the resulting local matrix does not encompass the stencil of the HHO discretization, mass conservation defect takes into account all element's faces.
As a result of the ASM described above, the amount of overlap between subdomain matrices, i.e., the number of matrix entries that are repeated in more than one subdomain, is smaller for HHO than for DG. Consider, for example, two subdomains sharing a face: if the face is local for subdomain A, it is a ghost face for subdomain B and vice-versa. Accordingly, only one of the two subdomain matrices is extended for HHO discretizations. As opposite, since each of the two mesh elements sharing the face has a ghost neighbor, both subdomain matrices are extended for DG discretizations.
Scalability is measured on an AMD EPYC cluster of four nodes and 256 cores, increasing the number of execution units from 16 to 256: in particular we consider a total of five steps doubling the number of execution units at each step. Notice that, when running on 256 subdomains, each subdomain counts of approximately 96 local elements. The results reported in Table 16 confirm that the ASM preconditioner strategy provides satisfactory parallel performance: the number of outer FGMRES iterations is uniform while increasing the number of execution units, and only a mild increase in the iteration count is observed for the ASM preconditioned GMRES solvers on the coarse level. The efficiency parameter (last column in Table 16) measures strong scalability: 100% efficiency with N execution units would imply a N/16 fold reduction of total computation time with respect to the baseline computation performed with 16 execution units.

Conclusions
The multilevel V-cycle iteration based on p-coarsened operators and ILU preconditioned Krylov smoothers is an effective solution strategy for high-order accurate HHO discretizations of the Stokes equations. The global linear system resulting from the spatial HHO discretization can be solved up to machine precision in a reasonable amount of V-cycle preconditioned FGMRES iterations (less than 20). This is remarkable considering that severely graded mesh sequences have been tackled in both 2D and 3D.
Comparing p-multilevel solvers for HHO and DG discretizations based on FGMRES iteration count, we can conclude that the former are more robust than the latter with respect to both the meshsize and the polynomial degree. When standard h-refined mesh sequences are considered, HHO formulations show uniform convergence with respect to the meshsize, irrespectively of the considered polynomial degree. On graded h-refined mesh sequences, the iteration count increases over finer meshes, more severely so for DG discretizations. Similarly, when doubling the polynomial degree (passing from k = 3 to k = 6) for a fixed meshsize, we observe that the iteration count is more stable for HHO schemes.
Since code ruse and code optimization are still possible (note that HHO implementation is more recent and probably less optimized), we avoid drawing conclusions regarding computation times. Nevertheless, the following observations suggest that p-multilevel solution strategies are a compelling choice in case of HHO formulations: • HHO has a clear advantage over DG both in terms of matrix dimension and number of non-zero entries when the polynomial degree is sufficiently high; • p-multilevel solvers for HHO show better solver robustness with respect to the polynomial degree.   Table 7: Evaluation of p-multilevel solution strategies for solving high-order k=3 HHO-dp over 2D regular mesh sequences. Solvers are applied to uncondensed and statically condensed matrices (identified by different colors) considering two alternative Schur complement implementations, see text for details. See Section 5.2.2 for solver options.
HHO-dp v-cond            Table 16: Parallel performance of p-multilevel solution strategies applied to HHO-dp discretizations with k = 5.