## 1 Introduction

Exa-scale performance of numerical algorithms is determined by two factors, node-level performance and distributed-memory scalability to thousands of nodes over an Infiniband-type fabric. Additionally, the final application efficiency in terms of time-to-solution is strongly influenced by the choice of numerical methods, where a high sequential efficiency is essential. The project ExaDG aims to bring together these three pillars to create an algorithmic framework for the next generation of solvers for partial differential equations (PDEs). The guiding principles of the project are as follows: If we define the overall goal to be a minimum of computational cost to reach a predefined accuracy, this aim can be split into three components, namely the efficiency of the discretization in terms of the number of degrees of freedom (DoFs) and time steps, the efficiency of the solvers in terms of iteration counts, and the efficiency of the implementation :

\displaystyle \begin{aligned} \begin{aligned} E &= \frac{\mbox{accuracy}}{\mbox{computational cost}} \\ &= \underbrace{\frac{\mbox{accuracy}}{\mbox{DoFs}\cdot\mbox{timesteps}}}_{\mbox{discretization}} \cdot \underbrace{\frac{1}{\mbox{iterations}}}_{\mbox{solvers/preconditioners}} \cdot \underbrace{\frac{\mbox{DoFs}\cdot\mbox{timesteps}\cdot\mbox{iterations}}{\mbox{computational cost}}}_{\mbox{implementation}} \; . \end{aligned}{} \end{aligned}
(1)

We define computational cost as the product of compute resources (cores, nodes) times the wall time resulting in the metric of CPUh, the typical currency of supercomputing facilities.

Regarding the first metric, the type of discretizations in space and time are often the first decision to be made. Numerical schemes that involve as few unknowns and as few time steps as possible to reach the desired accuracy will be more efficient. This goal can be reached by using higher order methods which have a higher resolution capability, especially for problems with a large range of scales and some regularity in the solution . However several possibilities and profound knowledge regarding the performance capability of potential algorithms on modern hardware are still required to select those algorithms and implementations that are optimal with respect to the guiding metric of accuracy versus time-to-solution. High-order (dis-)continuous finite element methods are the basic building block of the ExaDG project due to their generality and geometric flexibility.

Regarding the second metric, solvers are deemed efficient if they keep the number of iterations minimal. We emphasize that “iterations” are defined in a low-level way as the number of operator evaluations, which is also accurate when nesting several iterative schemes within each other. Note that we assume that large-scale systems must be addressed by iterative solvers; in a finite element context sparse direct solvers are not scalable due to fill-in and complex dependencies during factorizations. One class of efficient solvers of particular interest to ExaDG are multigrid methods with suitable smoothers, which have developed to be the gold standard of solvers for elliptic and parabolic differential equations over the last decades. Here, the concept of iterations would accumulate several matrix-vector products within a multigrid cycle that in turn is applied in an outer Krylov subspace solver. Due to the grid transfer and the coarse grid solver, such methods are inherently challenging for highly parallel environments. As part of our efforts in ExaDG, we have developed an efficient yet flexible implementation in the deal.II finite element library [1, 15].

Third, the evaluation of discretized operators and smoothers remains the key component determining computational efficiency of a PDE solver. The relevant metric in this context is the throughput measured as the number of degrees of freedom (unknowns) processed per second (DoFs/s). An important contribution of our efforts is to both tune the implementation of a specific algorithm, but more importantly to also adapt algorithms towards a higher throughput. This means that an algorithm is preferred if it increases the DoFs/s metric, even if it leads to lower arithmetic performance in GFlop/s or lower memory throughput in GB/s. Operator evaluation in PDE solvers only involves communication with the nearest neighbors in terms of a domain decomposition of the mesh, which makes the node-level performance the primary concern in this regard. Since iterative solvers only require the action of the matrix on a vector (and a preconditioner), they are amenable to matrix-free evaluation where the final matrix entries are neither computed nor stored globally in memory in some generic sparse matrix format (e.g., compressed row storage). While matrix-free methods were historically often considered because they lower main memory requirements and allow to fit larger problems in memory , their popularity is currently increasing because they need to move less memory: Sparse matrix-vector products are limited by the memory bandwidth on all major computing platforms, so a matrix-free alternative promises to deliver a (much) higher performance.

The outline of this article is as follows. We begin with an introduction of matrix-free algorithms and a presentation of node-level performance results in Sect. 2. In Sect. 3, we describe optimizations of the conjugate gradient method for efficient memory access and communication. Next, we detail our multigrid developments, focusing on performance numbers and the massively parallel setup in Sect. 4 and on the development of better smoothers in Sect. 5. Application results in the field of computational fluid dynamics are presented in Sect. 6, where the efficiency and parallel scalability of our discontinuous Galerkin incompressbile turbulent flow solver are shown. An extension of the kernels to up to 6D PDEs is briefly presented in Sect. 7. We conclude with an outlook in Sect. 8.

## 2 Node-Level Performance Through Matrix-Free Implementation

An intuitive example of a matrix-free implementation is a finite difference method implemented by its stencil rather than an assembled sparse matrix . For finite element discretizations with sufficient structure of the underlying mesh and low-order shape functions, a small number of stencils allows to represent the operator of a large-scale problem . Such methods are used in the German exascale project TerraNeo, utilizing the regular data structures in hierarchical hybrid grids and embedded into a highly scalable multigrid solver for Stokes systems [31, 32]. By suitable interpolations, the stencils can be extended from the affine coarse grid assumption to also treat smoothly deformed geometries and variable coefficients .

For higher-order methods, finite element discretizations lead to fat stencils, making the direct evaluation inefficient even when done through stencils. An alternative matrix-free scheme used in ExaDG is to not compute the explicit DoF coupling and instead turn to integrals underlying the finite element scheme. As an example, we consider the constant-coefficient Laplacian

\displaystyle \begin{aligned} -\nabla^2 u = f \quad \mbox{in }\Omega, \quad u=g\quad \mbox{on }\partial \Omega,\end{aligned}
(2)

whose weak form in a finite-dimensional setting is

\displaystyle \begin{aligned} \left(\nabla \varphi_i, \nabla u_h\right)_{\Omega_h} = (\varphi_i, f)_{\Omega_h},\end{aligned}
(3)

where u h(x) =∑j=1:nφ j(x)u j is the finite element interpolant of the solution with n degrees of freedom, φ i denotes the test functions with i = 1, …, n, f is some right hand side, and Ωh is the finite element representation of the computational domain Ω. The left-hand side of this equation represents a finite element operator, mapping a vector of coefficients u = [u i]i to an output vector v = [v i]i by evaluating the weak form for all test functions φ i separately. A matrix-free implementation is obtained by evaluating the element-wise integrals

\displaystyle \begin{aligned} \begin{aligned} &\big[(\nabla \varphi_{i}, \nabla u_h)_{\Omega_h}\big]_{i=1:n}=\sum_{K} \int_{\hat{K}} \left(\mathcal J_{K}^{-\mathsf T} \hat{\nabla} \varphi_{i} \right)^{\mathsf T} \left(\mathcal J_{K}^{-\mathsf T} \sum_{j=1}^{n_{\mathrm{dof,ele}}}\hat{\nabla} \varphi_j u_{j}^{(K)}\right) \mathrm{det}(\mathcal J_{K})\, \mathrm{d}\widehat{\mathbf{x}}\\ &{\approx}\sum_{K} I_K^{\mathsf T} \Bigg[\hspace{-1pt}\sum_{q=1}^{n_q}(\hat{\nabla} \varphi_{i_K}(\hat{x}_q))^{\mathsf T} \hspace{-2pt}\underbrace{\mathcal J_{K}^{-1} \mathcal J_{K}^{-\mathsf T} \mathrm{det}(\mathcal J_{K}) w_q}_{\mbox{physics at quadrature point}} \hspace{-3pt} \sum_{j=1}^{n_{\mathrm{dof,ele}}}\hspace{-3pt}\hat{\nabla} \varphi_j(\hat{x}_q) u_{j}^{(K)}\Bigg]_{i_K=1:n_{\mathrm{dof,ele}}} \end{aligned} \end{aligned}
(4)

by quadrature on n q points per cell K. Here, K denotes the elements in the mesh, $$\hat {x}$$ the coordinates of the reference element $$\hat {K} = (0,1)^d$$, $$\mathcal J_{K}$$ denotes the Jacobian of the mapping from the reference to the real cell, and w q the quadrature weight. The operator I K denotes the index mapping from n dof,ele element-local to global unknowns and defines the element-related unknowns u (K) = I Ku.

On element K, the formulation of Eq. (4) consists of two nested sums over the elemental unknowns $$u_j^{(K)}$$, j ∈ n dof,ele, and the quadrature points q. The result is tested against all test functions $$\varphi _{i_K}$$ on the reference element, which are related to the global test functions φ i through I K. Since the metric terms do not depend on the shape function indices i K and j, and the sum over j does not depend on i K, the summations in the equation can be broken up into (1) an dn q × n dof,ele matrix operation to evaluate the reference element derivative of u (K) at the quadrature points, (2) the application of metric terms as well as other physics terms at n q quadrature points, and (3) an n dof,ele × dn q matrix operation to test by all n dof,ele test functions and perform the summation over the quadrature points. The separation of point-wise physics evaluation at quadrature points is a common abstraction in integration-based matrix-free methods [29, 41, 49, 50].

For high-order finite element methods, the naive evaluation would involve all shape functions at all quadrature points, which is of complexity $$\mathcal O(k^{2d})$$ for polynomials of degree k in d dimensions per element, or $$\mathcal O(k^{d})$$ per unknown, similarly to the fat stencil of the final matrix.

At this point, the structure in the reference-cell shape function and quadrature points can be utilized to lower the computational complexity. If the multi-dimensional shape functions are the tensor product of 1D shape functions, and if the quadrature formula is a tensor product of 1D formulas, the so-called sum-factorization algorithm can be used to group common factors along the various dimensions and break down the work into one-dimensional interpolations. Figure 1 visualizes the process of computing the interpolation of nodal values, visualized by black disks, to the values at the quadrature points. Rather than using a naive interpolation of cost 2(k + 1) 2d operations, it can be done in 2d(k + 1)d+1 operations instead. In matrix-vector notation, the interpolation of the gradient with respect to $$\hat {x}$$, evaluated at quadrature points, can be written as

\displaystyle \begin{aligned} \begin{bmatrix} {\partial \boldsymbol u_h}/{\partial \hat x_1}\\ {\partial \boldsymbol u_h}/{\partial \hat x_2}\\ \vdots\\ {\partial \boldsymbol u_h}/{\partial \hat x_d} \end{bmatrix} = \begin{bmatrix} I \otimes \ldots \otimes I \otimes D_1\\ I \otimes \ldots \otimes D_2 \otimes I\\ \vdots\\ D_d \otimes I \otimes \ldots \otimes I \end{bmatrix} \begin{bmatrix} S_d \otimes \ldots \otimes S_2 \otimes S_1 \end{bmatrix} \boldsymbol u^{(K)}. \end{aligned}
(5)

Here, S 1, …, S d denote the $$n_{q}^{\mathrm {1D}}\times (k+1)$$ interpolation matrices from the nodal values to the quadrature points, obtained by evaluating the 1D basis at all 1D quadrature points, and D 1, …, D d the $$n_{q}^{\mathrm {1D}}\times n_{q}^{\mathrm {1D}}$$ matrices of the derivatives of the Lagrangian basis in quadrature points. In this form, the multiplication by Kronecker matrices is implemented by small matrix-matrix multiplications.

Sum factorization was initially developed in the context of spectral element methods by Orszag , see  for an overview of the developments. In , sum factorization was compared against assembled matrices with the goal to find the best evaluation strategy among assembled matrices and matrix-free schemes. For hexahedral elements considered in this work, the memory consumption and arithmetic complexity indicate that this is the case already for quadratic basis functions [11, 49], with a growing gap for higher polynomial degrees.

### 2.1 Implementation of Sum Factorization in the deal.II Library

As part of the ExaDG project, we have developed efficient implementations in the deal.II finite element library [1, 4] with the following main features, see  for a detailed performance analysis:

• support for both continuous  and discontinuous finite elements on uniform and adaptively refined meshes with hanging nodes and deformed elements,

• support for arbitrary polynomial expansions on quadrilateral and hexahedral element shapes as well as tensor product quadrature rules,

• minimization of arithmetic operations by using available symmetries, such as the even-odd decomposition  and a switch between the collocation derivative (5) for $$n_q^{\mathrm {1D}} \approx k+1$$ quadrature points or an alternative variant based on derivatives of the original polynomials as used in  and discussed in ,

• flexible implementation of operations at quadrature points,

• vectorization across several elements to optimally use SIMD units (AVX, AVX-512, AltiVec) of modern processors,

• applicability to modern multi-core CPUs as well as GPUs [51, 57],

• data access optimizations such as element-based loops for DG elements [50, 56],

• and MPI implementation with tight data exchange as well as MPI-only and shared-memory models [43, 48, 54].

The concept of matrix-free evaluation with sum factorization has been widely adopted by now, like in the deal.II , DUNE [5, 40, 60], Firedrake , mfem , Nek5000  or Nektar++  projects. These fast evaluation techniques are directly applicable to explicit time stepping schemes, as we have demonstrated for wave propagation in [42, 53, 65,66,67,68] and the compressible Navier–Stokes equations . The proposed developments make matrix-free evaluation of high-order DG operators reach a throughput in unknowns per second almost as high as for optimized 5-wide finite difference stencils in a CFD context , despite delivering much higher accuracy.

### 2.2 Efficiency of Matrix-Free Implementation

In Fig. 2, we give an overview of the achieved performance with our framework applied to the discontinuous Galerkin interior penalty (IPDG) discretization of the 3D Laplacian on an affine geometry. The most advanced implementation presented in  is used, namely a cell-based loop with a Hermite-like basis for minimal data access . The figure lists the throughput, which is measured by recording the run time of the matrix-vector product in an experiment with around 50 million DoFs (too large to fit into caches), and reporting the normalized quantity DoFs/s obtained by dividing the number of DoFs by the measured run time. The code is run on a single node of six dual-socket HPC systems from the last decade with a shared-memory parallelization with OpenMP, threads pinned to logical cores with the close affinity rule, and using streaming stores to avoid the read-for-ownership data transfer  on the result vector. As systems, we consider a 2 × 8 core AMD Opteron 6128 system from 2010, a 2 × 8 core Intel Xeon 2680 Sandy Bridge from 2012 (as used in the SuperMUC phase 1 installation in Garching, Germany), a 2 × 8 core Intel Xeon 2630 v3 (Haswell) representing a medium-core count chip from 2014, a 2 × 14 core Intel Xeon 2697 v3 (Haswell) representing a high-core count chip of the same generation (as used in the SuperMUC phase 2 installation), a 2 × 20 core Intel Xeon 2698 v4 (Broadwell) system from 2016, and a 2 × 24 core Intel Xeon Platinum 8174 from 2017, labeled ‘Skylake’ in the remainder of this work, and installed in the SuperMUC-NG supercomputer. The chips are operated at 2.0 GHz, 2.7 GHz, 2.4 GHz, 2.6 GHz, 2.2 GHz, and 2.3 GHz, respectively, and all run with fully populated memory interfaces. The Intel machines are run with 2-way hyperthreading, e.g. with 96 threads for the Xeon Platinum Skylake.

The throughput results in Fig. 2 demonstrate the advancements of hardware during the last decade. In particular the increased width of vectorization, from 2 to 4 doubles with Sandy Bridge and from 4 to 8 doubles with Skylake, are clearly visible. Furthermore, the comparison between Sandy Bridge and the smaller Haswell system reveals the benefit of fused multiply-add (FMA) instructions and higher L1 cache bandwidth of the latter: For low polynomial degrees with a modest number of FMA instructions, Sandy Bridge with its higher frequency can approximately deliver the same performance as Haswell. As the polynomial degree is increased, the arithmetic work is increasingly dominated by FMAs in the sum factorization sweeps similar to (5) as shown in , and Haswell pulls ahead. Finally, while we observe a throughput of up to 5.7 billion DoFs/s on Skylake (with up to 1.35 TFlop/s for k = 8), we observe a relatively strong decrease of performance for polynomial degrees k ≥ 13: This is because the vectorization across elements leads to an excessive size of the temporary data within sum factorization—here, a different vectorization strategy could lead to better results. However, we consider the polynomial degrees 3 ≤ k ≤ 8 most interesting for practical simulations, where almost constant throughput in terms of DoFs/s is reached. This somewhat surprising result, given the expected $$\mathcal O(1/k)$$ complexity of throughput for sum factorization, is because face integrals and memory access with an $$\mathcal O(1)$$ complexity are dominant. Compared to our initial implementation in 2015, which achieved a throughput of 0.32 billion DoFs/s on Sandy Bridge with degree k = 3, the progress in software technologies allowed us to reach 1.02 billion DoFs/s on the same system. For Intel Skylake, where memory access is more important, the software progress of our project is more than 4×.

Figure 3 shows the throughput normalized by the number of cores for polynomial degree k = 4 over the different hardware generations. For operator evaluation with discontinuous elements and face integrals, approximately 200 floating point operations per unknown are involved with our optimized implementations . At the same time, we must access at least 16 byte (read one double, write one double) plus some neighbors that are not cached, so the arithmetic intensity is around 8–12 Flop/Byte, close to the machine balance of the Skylake Xeon. This means that both memory bandwidth and arithmetic performance are relevant for performance (on one Skylake node, we measured memory throughput of around 160 GB/s, compared to the STREAM limit of 205 GB/s). Likewise, continuous elements evaluated on an affine mesh have seen a considerable increase in throughput per core (arithmetic intensity of 7 Flop/Byte). However, the improvement has been much more modest for continuous elements evaluated on curved elements. In this setting, separate metric terms for all quadrature points and all elements are needed (as opposed to a single term per element in the affine mesh case), reducing the arithmetic intensity to around 1.2 Flop/Byte.Footnote 1

Figure 3 also contains the evolution of performance of a sparse matrix-vector product for tri-linear continuous finite elements. The performance is much lower due to the aforementioned memory bandwidth limit, and has hardly improved per core on Skylake over the dated Opteron architecture. This illustrates the effect of the so-called memory wall. We emphasize that the sparse matrix-vector product for k = 1 is more than three times slower than even the matrix-free evaluation for k = 4 on curved elements. Hence, high-order methods with matrix-free implementations are faster per unknown on newer hardware, in addition to their higher accuracy.

## 3 Performance-Optimized Conjugate Gradient Methods

The developments of matrix-free implementations presented in the previous section result in a throughput for evaluation of the IPDG operator in Fig. 2 of up to 5.7 billion DoFs/s on Skylake. This is equivalent in time to the mere access of 4.5 doubles per DoF (either reading or writing). In other words, our developments have made the operator evaluation so fast that the matrix-vector product may no longer be the dominant operation in algorithms like the conjugate gradient (CG) method preconditioned by the diagonal, or Chebyshev smoothers. These algorithms involve access to between 6 and 18 vectors for vector updates, the application of the diagonal preconditioner, and inner products. For optimal application performance it is therefore necessary to look into the access to vectors. As proposed in our work [51, 56, 65], merging the vector operations can improve throughput by up to a factor of two, and in particular for the DG case with cell-based loops which allow for a single pass through data [48, 56]. Fusion of different steps of a scheme has also been proposed for explicit time integrators in .

For the assessment of optimization opportunities on the algorithm level that goes beyond the matrix-vector product, we consider a high-order finite element benchmark problem suggested by the US exascale initiative “Center for Efficient Exascale Discretization” (CEED). The benchmark involves a continuous finite element discretization of the Laplacian (3), using matrix-free operator evaluation within a conjugate gradient solver preconditioned by the matrix diagonal. In this study, we consider the case BP5 , see also https://ceed.exascaleproject.org/bps/, which integrates the weak form (4) of polynomial degree k using a Gauss–Lobatto quadrature formula with $$n_q^{\mathrm {1D}}=k+1$$ quadrature points on a cube with deformed elements. While this integration is not exact, it is the typical spectral element setup with an identity interpolation matrix S i = I in Eq. (5).

Figure 4 lists the contributors to run time for the plain conjugate gradient method preconditioned by the point-Jacobi method as a function of the problem size for the polynomial degree k = 6. Here, the metric terms $$\mathcal J_K$$ are computed on the fly from a tri-linear representation of the geometry. Three different performance regimes can be distinguished in the graph: To the left, there is not enough parallelism given the domain decomposition on 48 MPI ranks and batches of 8 elements due to vectorization—indeed, at least 85,000 DoFs are needed to saturate all cores and SIMD lanes. Furthermore, the synchronization barriers due to the inner products in the conjugate gradient method also lead to a slowdown. As the problem size and parallelism increase, the run times decrease significantly and reach a minimum for a problem size around one million DoFs. Here, all data involved in the algorithm fits into the approximately 110 MB of L2+L3 cache on the processors. As the size is further increased, caches are exhausted and most data must be fetched from slow main memory. As a consequence, the run time of the solver increases significantly, and the vector updates, the diagonal preconditioner, and the inner products take a significant share. Note that all vector operations use the hardware optimally with a memory throughput of 205 GB/s.

In order to improve performance, we have therefore developed conjugate gradient implementations with merged vector operations by loop fusion. Figure 5 compares three variants of the conjugate gradient solver: the plain conjugate gradient method runs all vector operations through high-level vector interfaces with separate loops for addition and inner products. In the “merged dot products”, we have merged the dot product p TAp following the matrix-vector product into the loop over the elements, and merged the vector updates to the residual and solution with the dot product for r TP −1r. Here, r denotes the residual vector, p the search direction of the conjugate gradient method, A the matrix operator (represented in a matrix-free way), and P −1 the diagonal preconditioner. However, the improvements with this algorithm are relatively modest.

Much more performance can be gained by creating a conjugate gradient variant we call “fully merged”: Here, each CG iteration performs a single loop through all vector entries and ideally reads 5 vectors (solution, residual, search direction, temporary vector to hold the matrix-vector product, and diagonal of preconditioner) and writes four (solution, residual, search direction, temporary vector). All vector updates of the previous CG iteration are scheduled before the matrix-vector product and all inner products are scheduled after the matrix-vector product. The vector operations are interleaved with the loop over elements, ensuring that dependencies due to the access pattern of the loop and the MPI communication are fulfilled (this leads to slightly more access in practice). This approach applies the preconditioner several times with partial sums to construct the inner products with a single MPI_Allreduce, trading some local computations for the decreased memory access. Of course, fusing the preconditioner into the loop assumes that it is both cheap to apply and does not involve long-range coupling between the DoFs. The results in Fig. 5 show that performance in the saturated limit, i.e., for large sizes beyond 107 DoFs, is 2.5 times faster than with the plain CG iteration. Interestingly, this also improves performance for the sizes fitting into caches, which is due to less synchronization and reducing access to the slower L3 cache.

To put the performance of the fully merged case on Intel Skylake into perspective, we compare with executing the plain CG method on an Nvidia V100 GPU using the implementation from [51, 57]: even though the GPU runs with around 700 GB/s of memory throughput, the performance is higher on Intel Skylake with only 200 GB/s from RAM memory because the merged loops significantly increase data locality. Furthermore, on the GPU we do not compute the metric terms on the fly, but load a precomputed tensor $$\mathcal J_{K}^{-1} \mathcal J_{K}^{-\mathsf T} \mathrm {det}(\mathcal J_{K}) w_q$$ which is faster due to reduced register pressure, see also the analysis for BP5 in . We also note that the GPU results with our implementation are faster than an implementation with the OCCA library described in  with up to 0.6 billion DoFs/s on a V100 of the Summit supercomputer. The reason is that our implementation uses a continuous finite element storage that does not duplicate the unknowns at shared vertices, edges and faces, which reduces the memory access by about a factor of two. Furthermore, the results from  involve a separate gather/scatter step with additional memory transfer to enforce continuity, while this is part of the operator evaluation within a single loop in our code.

Figure 6 lists the achieved throughput with a fully merged conjugate gradient solver for polynomial degrees k = 2, …, 8, the most interesting regime for our solvers. We use a tri-linear representation of geometry and compute the geometric factors on the fly. Throughput is somewhat lower for quadratic and cubic elements because the geometry data located in the vertices is still noticable.

The results in Fig. 3 motivate the analysis of the representation of the geometry in the matrix-vector product, with results presented in Fig. 7. The figure lists both the throughput of the matrix-vector product in the left panel and the throughput of the complete CG iteration with merged vector operations. Highest performance is obtained for the affine mesh case where our implementation can compress the memory access of the Jacobian. While this case is excluded from the CEED BP5 specification that requires a deformed geometry , it is an interesting baseline to compare against. Using separate tensors for each quadrature point, “variable tensor cached”, is equally fast as the affine case as long as data fits into caches. However, performance drops once the big geometric arrays must be fetched from main memory. For the case the geometry is computed on the fly from a tri-linear representation of the mesh, i.e., the vertices, the matrix-vector product is slower than the affine variant. For the conjugate gradient solver, however, we observe that the two reach essentially the same performance for five million and more DoFs, as they are both limited by the memory bandwidth from vector access. The “tri-linear compute” case involves a higher Flop/s rate with almost 700 GFlop/s, as compared to the throughput of 330 GFlop/s for the affine mesh case. This means that the merged vector operations allow us to fit additional computations behind the unavoidable memory transfer without affecting application performance. Finally, an isoparametric representation of the geometry (labeled “isopara compute” in Fig. 7) can also be computed on the fly by sum factorization from a kth degree polynomial . While this case is obviously slower than the precomputed variable-tensor case from caches, it leverages higher performance when data must be fetched from main memory.

The tri-linear and isoparametric cases are not equivalent, as only the latter represents higher order curved boundaries. Intermediate polynomial degrees for the geometry are conceivable, which would land between the two in terms of application throughput. To combine the higher performance of the former, we plan to investigate the tradeoffs in more detail in the future, e.g. by using a k-degree representation on a single layer of elements at the boundary and a tri-quadratic representation in the domain’s interior.

Finally, Fig. 8 shows the weak scaling of the BP5 benchmark problem up to the full size of the SuperMUC-NG machine with 6336 nodes and 304,128 cores. The data is normalized by reporting the number of DoF per node, so ideal weak scaling would correspond to coinciding lines. While the saturated performance is scaling well, giving a sustained performance of up to 4.4 PFlop/s,Footnote 2 most of the in-cache performance advantage is lost due to the communication latency over MPI, see also  for limits with MPI in PDE solvers. Defining the strong scaling limit as the point where throughput reduces to 80% of saturated performance , it is reached for wall times of 56 μs on 1 node. On 512 nodes, the strong scaling limit is already around 180 μs, whereas it is 245 μs on the full SuperMUC-NG machine. Note that even though most optimizations presented in this section have addressed the node-level performance, we have also considered the strong scaling in our work—indeed, the strong scaling on SuperMUC-NG is excellent with a limit around 5 times lower than the BlueGene-Q results presented in .

## 4 Geometric Multigrid Methods in Distributed Environments

Multigrid methods are efficient solvers for the linear systems arising from the discretization of elliptic problems, see  for a recent efficiency evaluation and  for a projection of elliptic solver performance to the exascale setting. They apply simple iterative schemes called smoothers on a hierarchy of coarser problem representations. On each level of the hierarchy, the smoothers address the high-frequency content of the solution by smoothening the error. On a sufficiently coarse level with a small number of unknowns, a direct solver can be applied. The multigrid algorithm can be realized by a V-cycle as illustrated in Fig. 9 or some related cycle (W-cycle or F-cycle). In the matrix-free high-order finite element context, variants of the Chebyshev iteration around a simple additive scheme, such as point-Jacobi or approximate block-Jacobi with some rank-d approximation of the cell matrix, are state of the art. The results in this section are based on this selection. Overlapping Schwarz schemes are a new development detailed in Sect. 5 below.

In terms of finding the coarser representations for the multigrid hierarchy, high-order finite element and discontinuous Galerkin methods permit a range of options. The hierarchy can both be constructed by coarser meshes (h-multigrid), by lowering the polynomial degree (p-multigrid), by a discontinuous-continuous transfer as well as algebraically based on the matrix entries only (algebraic multigrid). The latter do not fit into a matrix-free context, since they explicitly rely on a sparse matrix and also often are not robust enough as the degree increases. As it has been shown by the work , scalability to the largest supercomputers is much more favorable if knowledge about coarsening by a mesh can be provided. In other words, geometric multigrid is to be preferred over algebraic multigrid in case there is such structure in the problem.

For these reasons, we have developed a comprehensive geometric multigrid framework with deal.II. In , a hybrid multigrid solver with all possibilities of h-, p-, and algebraic coarsening has been combined in a flexible framework, with the possibility to perform an additional c-transfer from discontinuous to continuous function spaces for the DG case. In terms of the h-MG method on adaptive meshes, the deal.II library implements the local smoothing algorithm [9, 36, 37] where smoothing is done level by level. Our work  developed a communication-efficient coarsening strategy for this setup, at the cost of a load imbalance for smoothing on the multigrid levels with adaptively refined meshes. The tradeoffs in this choice and the associated costs have been quantified by a performance model in .

Figure 10 shows the results of two strong scaling experiments of the multigrid V-cycle with the h-multigrid infrastructure of the deal.II library. The uniform grid and a typical adaptively refined case are compared for the same problem size of 137 million and 46 billion DoFs, respectively, see  for details on the experiment. Differences in run time are primarily due to the load imbalance for the level operations. The results demonstrate optimal parallel scaling of both the uniform and adaptively refined cases down to around 10−2 s, with a slightly better strong scaling of the adaptive case due to the slower baseline. This performance barrier—typical for strong scaling of multigrid schemes in general—can be explained by the specific type of global communication in this algorithm: from the fine mesh level with many unknowns distributed among a large number of cores, we transfer residuals to coarser meshes with restriction operators until the coarse grid solver is either run on a single core or with few cores in a tightly coupled manner. Then, the coarse-grid corrections are broadcast during prolongation, involving all processors again. The communication pattern of a multigrid V-cycle thus relates to a tree-based implementation of MPI_Allreduce, with the difference that the communication tree is induced by the grid and substantial operations, namely smoothing and level transfer, are intermixed with the communication. In this particular case, nine matrix-vector products with nearest-neighbor communication are performed per level (eight in the smoother and one for the residual before restriction). In addition, two vertical nearest-neighbor exchange operations are done in restriction and prolongation. A typical matrix-vector product with up to 26 neighbors takes around 10−4 s on the chosen Intel Sandy Bridge system when run on a few thousands of nodes . When done on seven levels plus the coarse mesh for the uniformly refined 137 million DoFs case, the expected saturated limit of around 8 ms is exactly seen in the figure. On the newer SuperMUC-NG machine, a latency barrier per V-cycle of around 2–4 ms per V-cycle has been measured, depending on the number of matrix-vector products for the level smoothers. This limit is attractive compared to alternative solvers for elliptic problems such as the fast multipole method or the fast Fourier transform [30, 35].

Multigrid schemes are at the heart of incompressible flow solvers through the pressure Poisson equation, as detailed in Sect. 6 below. Applications of matrix-free geometric multigrid to continuum mechanics were presented in  and to electronic calculations with sparse multivectors in [16, 17].

As an example of the large-scale suitability of the developed multigrid framework, Fig. 11 shows two scaling experiments on the SuperMUC-NG supercomputer with up to 304,128 cores of Intel Skylake. Black dashed lines denote ideal strong scaling along a line and weak scaling with a factor of 8 between the lines. The computational domain is a cube meshed by hexahedral elements, using the affine mesh code path for matrix-free algorithms discussed in Sect. 2. A consistent Gaussian quadrature with $$n_q^{\mathrm {1D}}=k+1$$ points is chosen. We run a conjugate gradient solver to a relative tolerance of 10−3 compared to the initial unpreconditioned residual. This setup is motivated by applications where a very good initial guess is already available, e.g. by extrapolation of solutions from the old time step [22, 45], and only a correction is needed. More accurate solves are obtained by tighter tolerances or by full multigrid setups . The multigrid V-cycle is run in single precision to increase throughput, together with a double precision correction through the outer CG solver. This setup has been shown in  to increase throughput by around 1.8 × without affecting the multigrid convergence.

In the left panel of Fig. 11, we present results for a continuous Galerkin discretization with a polynomial degree k = 4. A pure geometric coarsening down to a single mesh element is used. A Chebyshev iteration of degree five based on the matrix diagonal, i.e., point Jacobi, is used on all levels for pre- and post-smoothing. The maximal eigenvalue $$\tilde {\lambda }_{\max {}}$$ is estimated by 15 iterations of a conjugate gradient solver and the Chebyshev parameters are set to smoothen in a range $$[0.06\tilde {\lambda }_{\max {}},1.2\tilde {\lambda }_{\max {}}]$$. As a coarse solver, we use a Chebyshev iteration with the degree set to reduce the residual by 103 in terms of the Chebyshev a-priori error estimate . We observe ideal weak scaling and strong scaling to around 10−2 s. More importantly, the absolute run time is excellent: For instance, the 8.6 billion DoF case on 1536 cores is solved in 1.4 s, i.e., 4.0 million DoFs are solved per core per second.

The right panel of Fig. 11 shows the result for multigrid applied to an IPDG discretization with k = 5. Here, we use a transfer from the discontinuous space to the associated continuous finite element space with k = 5 on the finest mesh level (see  for the theoretical background and  for the multigrid context) and then progress by h-coarsening to a single element. On the DG level, we use a Chebyshev smoother around a block-Jacobi method, with the block-Jacobi problems inverted by the fast diagonalization method . On all continuous finite element levels, a Chebyshev iteration around the point-Jacobi method is used. The degree of the Chebyshev polynomial is six. This solver setup achieves a multigrid convergence rate of about 0.025, i.e., reduces the residual by 3 orders of magnitude with two V-cycles. If used in a full multigrid setting , a single V-cycle on the finest level would suffice to solve the problem to discretization accuracy. Merged vector operations with a Hermite-like basis for the Chebyshev iteration are used according to . The final application performance of the largest computation on 1.9 trillion DoFs is 5.9 PFlop/s, with 5.6 PFlop/s done in single precision and 0.27 PFlop/s in double precision. The limiting factor is mostly memory transfer, however, with an application throughput of around 175 GB/s per node (the STREAM limit of one node is 205 GB/s).

## 5 Fast Tensor Product Schwarz Smoothers

In Sect. 4, we have discussed a scalable implementation of geometric multigrid methods, obtaining an efficient solver in the sense of cost per iteration. It employs the matrix-free operator implementation from Sect. 2 in order to reduce the computational cost for residuals and grid transfer. The missing building block for our cost model in Eq. (1) is an efficient implementation (in terms of computational cost per DoF) of an efficient smoother (in terms of number of multigrid iterations).

The main challenge consists of finding preconditioners whose cost is similar to operator evaluation. So far, we have discussed Chebyshev smoothers, which can be implemented matrix-free in a straight-forward fashion. Alas, their performance is not robust for higher order elements. Likewise, from an arithmetic cost point of view sparse matrices can be competitive at most for moderate polynomial degrees k = 2, 3  or when done via auxiliary spaces of linear elements on a subdivided grid using some matrix-based preconditioner. However, Fig. 3 shows that even sparse matrices for linear elements are up to 10 times slower than the matrix-free operator evaluation. It seems that only the two SPPEXA projects ExaDUNE and ExaDG have addressed this question in [6, 76]. While  focuses on iterative solution of cell problems for multigrid smoothing, we consider domain decomposition based smoothers in the form of multilevel additive and multiplicative Schwarz methods based on low-rank tensor approximations. They consist of a subdivision of the mesh on each level into small subdomains consisting either of a single cell, or of the patch of cells sharing a common vertex. On each of these subdomains, local finite element problems are solved. Comparing with operator application, these smoothers share the structural property of evaluation of local operators on mesh cells or on a patch of cells. They differ by the fact that the smoothers involve local inverses instead of local forward operators, and that these local inverses in general are not amenable to a tensor decomposition like sum factorization. There is one exception though, namely separable differential operators. In d dimensions these can be written in the form

$$\displaystyle \begin{gathered} {} L = I_d \otimes \dots \otimes I_2\otimes L_1 + \dots + L_d \otimes I_{d-1} \otimes \dots \otimes I_1, \end{gathered}$$
(6)

where L k are one-dimensional differential operators and I k are identity operators in directions k = 1, …, d. This representation transfers to finite element operators with tensor product shape functions in a straight-forward way, reading I k as one-dimensional mass matrices M k.

Due to , the inverse of L can be represented as the product

$$\displaystyle \begin{gathered} {} L^{-1} = Q \Lambda^{-1}Q^{\mathsf T}, \end{gathered}$$
(7)

with the diagonal matrix Λ = I d ⊗⋯ ⊗ I 2 ⊗ Λ1 + ⋯ + Λd ⊗ I d−1 ⊗⋯ ⊗ I 1, where I k denote identity matrices, and a rank-1 decomposition Q = Q d ⊗⋯ ⊗ Q 1. The tensor factors are obtained by solving d generalized eigenvalue problems

\displaystyle \begin{gathered} \begin{aligned} \Lambda_k &= Q_k^{\mathsf{T}} L_k Q_k,\\ I_k &= Q_k^{\mathsf{T}} M_k Q_k, \qquad k=1,\dots,d. \end{aligned} \end{gathered}
(8)

Thus, the computational effort for computing the inverse has been reduced from $$\mathcal O(k^{3d})$$ to $$\mathcal O(dk^3)$$ and for the application of local solvers from $$\mathcal O(k^{2d})$$ to $$\mathcal O(dk^{d+1})$$ by exploiting sum factorization. Based on this technique, we have implemented a geometric multigrid method in  based on earlier work in [37,38,39].

### 5.1 The Laplacian on Cartesian Meshes

In order to test our concept and to obtain a performance baseline for more complicated cases, we first attend to the case where the decomposition described above can be applied in a straightforward way, namely the additive Schwarz method with subdomains equal to mesh cells. As Table 1 shows, it yields an efficient preconditioner with less than 25 conjugate gradient steps for a gain of accuracy of 108. While it is uniform in the mesh level, it is not uniform in the polynomial degree due to the increasing penalty parameter of the interior penalty method. The computational effort for a smoothing step based on local solvers in the form (7) is below the effort for a matrix-free operator application for polynomial degrees between 3 and 15 in three dimensions because it only involves operations on cells. The setup time for computing Q and Λ is even less. Thus, in the context of the performance analysis of the conjugate gradient method in Sect. 2, it barely adds to the cost per iteration step, but reduces the number of matrix-vector products when comparing to the accumulated numbers within a Chebyshev/point Jacobi method, and almost independently of polynomial degree.

In view of application to incompressible flow, we also study vertex patches as typical subdomains for smoothing. First, we observe that a regular vertex patch with 2d cells attached to a vertex inherits the low-rank tensor product structure from its cells, possibly after renumbering due to changes in orientation. Thus, we can apply the same method as on a single cell, resulting effectively in a factor 2d in the complexity estimates above. Patches around vertices with irregular topology like 3 or 5 cells in two dimensions do not possess a tensor product structure. Fortunately, on meshes obtained by refinement of a coarse mesh, they are all determined by irregularities of the coarse mesh and thus small in number.

Vertex patches lead to overlapping decompositions with overlap of at least 4 and 8 in two and three dimensions, respectively. From the analysis of Schwarz methods, it becomes clear that a multiplicative method is required for highest multigrid convergence rates. In order to parallelize such a smoother and to avoid race conditions, mesh cells are colorized, that is, they are separated into “colors” such that patches of the same “color” do not share any common face or cell. As a consequence, the multiplicative method coincides with an additive method within each color, such that we can execute the local solvers in parallel within each color, and the colors sequentially. Typical convergence results for the Laplacian are reported in Table 2, suggesting that this scheme is almost a direct solver.

The vertex patch has 4 and 8 times as many unknowns as a single mesh cell in two and three dimensions, respectively. Thus, the effort for a smoothing step with 16 colors and the optimizations described above turns out to be about 20 to 24 times the effort of a matrix-free operator application, measured over polynomial degrees from 3 to 15. This seems excessive at first glance, but it must be kept in mind that the Chebyshev smoother of degree 6 used in Fig. 11 also involves 12 matrix-vector products for pre- and post-smooting. Futhermore, the current scheme comes with a reduction of the number of steps by a factor 10 compared to the additive cell smoother for the Laplacian, which makes it almost competitive . Finally, the iteration counts are independent of the polynomial degree, making the scheme attractive for higher degrees. Moreover, we point out that this smoother also allows for the solution of a Stokes problem in four iteration steps .

### 5.2 General Geometry

As soon as the mesh cells are not Cartesian anymore, the special structure of separable operators in (6) is lost and the inverse cannot be computed according to (7). In this case, we have two options: solving the local problems iteratively, as in , or approximately. A possible approximation which recovers the situation of the previous subsection consists of replacing a non-Cartesian mesh cell by an approximating (hyper-)rectangle, then inverting the separable differential operator on the rectangle (omitting the prefix hyper from here on).

Such a surrogate rectangle can be obtained from the following procedure: first, we compute the arc length of all edges. From these, we obtain the length of the rectangle in each of its natural directions by averaging over all parallel edges (in a topological sense). Thus, the geometry of the rectangle is determined up to its position and orientation in space. Given the fact that the Laplacian is invariant under translation and rotation, these do not matter and we can choose a rectangle centered at the origin with edges parallel to the coordinate directions. Different differential operators may require different approximations here.

The convergence theory of Schwarz methods allows for inexact local solvers as long as they are spectrally equivalent. Naturally, the deviation from exactness enters into the convergence speed of the method. Additionally, inexact local solvers can amplify the solution, such that a smaller relaxation parameter may be necessary. This is exhibited in Table 3, where we compare the efficiency of multigrid with exact local solvers and the method with surrogate rectangles as described above. We see that a reduction of the relaxation parameter $$\widehat {\omega }=0.7$$ for exact local solvers to $$\widehat {\omega }=0.49$$ is necessary for robust convergence. We point out though, that while the inexact methods need more iteration steps, they are much faster than exact inverses, since they use the Kronecker representation (7) of the approximate inverse. For instance, the setup cost is 3000 times higher, with a growing gap for higher polynomial degrees.

### 5.3 Linear Elasticitiy

In order to provide an outlook on how to apply this concept to more general problems, we consider linear elasticity, namely the Lamé-Navier equations, with the bilinear form

$$\displaystyle \begin{gathered} a(\mathbf{u},\mathbf{v}) = 2 \mu \bigl(\upvarepsilon(\mathbf{u}),\upvarepsilon(\mathbf{v})\bigr) + \lambda\bigl(\nabla\cdot \mathbf{u},\nabla\cdot\mathbf{v}\bigr). \end{gathered}$$
(9)

Here, $$\upvarepsilon (\mathbf {u}) = \frac 12(\nabla \mathbf {u} + \nabla \mathbf {u} ^{\mathsf T})$$ is the strain tensor of the displacement field u and $$\bigl (\cdot ,\cdot \bigr )$$ denote the appropriate DG discretization with interior penalty terms.

Consider a Cartesian vertex patch, that is, a patch with all faces aligned with the coordinate planes and with tensor product shape functions on each cell. As before, let M k be the one-dimensional mass matrix in direction k and L k the matrix representing the Laplacian including all face terms introduced by the interior penalty formulation. Furthermore, let G k be the matrix associated to the first derivative, again including the DG interface terms which arise in products of the form $$G_k^{\mathsf T}\otimes G_l$$. With these notions and the three-dimensional Laplacian

$$\displaystyle \begin{gathered} L = M_3\otimes M_2\otimes L_1 + M_3 \otimes L_2\otimes M_1 + L_3 \otimes M_2 \otimes M_1, \end{gathered}$$
(10)

we can write the bilinear form a(., .) on the patch in matrix form

\displaystyle \begin{aligned} A_p &= \mu \begin{bmatrix} L + M_3 \otimes M_2 \otimes L_1 & M_3\otimes G_2^{\mathsf T} \otimes G_1 & G_3^{\mathsf T} \otimes M_2 \otimes G_1 \\ M_3 \otimes G_2 \otimes G_1^{\mathsf T} & L + M_3 \otimes L_2 \otimes M_1 & G_3^{\mathsf T} \otimes G_2 \otimes M_1 \\ G_3\otimes M_2\otimes G_1^{\mathsf T} & G_3 \otimes G_2^{\mathsf T} \otimes M_1 & L + L_3 \otimes M_2 \otimes M_1 \end{bmatrix}\\ &\qquad \qquad + \lambda \begin{bmatrix} M_3 \otimes M_2 \otimes L_1 & M_3 \otimes G_2 \otimes G_1^{\mathsf T} & G_3 \otimes M_2 \otimes G_1^{\mathsf T} \\ M_3 \otimes G_2^{\mathsf T} \otimes G_1 & M_3 \otimes L_2 \otimes M_1 & G_3 \otimes G_2^{\mathsf T} \otimes M_1 \\ G_3^{\mathsf T} \otimes M_2 \otimes G_1 & G_3^{\mathsf T} \otimes G_2 \otimes M_1 & L_3 \otimes M_2 \otimes M_1 \end{bmatrix} \end{aligned}
(11)

Clearly, this matrix lacks the simple structure of Kronecker products we employed in the previous subsections. Nevertheless, we have Korn’s inequality , and thus the block diagonal of the left matrix is spectrally equivalent to the matrix itself. Consequently, we expect that

$$\displaystyle \begin{gathered} \tilde A_p = \mu\begin{bmatrix} L + M_3 \otimes M_2 \otimes L_1 \\ &L + M_3 \otimes L_2 \otimes M_1 \\ &&L + L_3 \otimes M_2 \otimes M_1 \end{bmatrix}, \end{gathered}$$
(12)

which has the desired Kronecker product structure, is a good local solver. Indeed, Table 4 confirms this expectation. Iteration counts remain almost constant over a wide range of mesh levels and polynomial degrees. Comparing to Table 2, we lose less than a factor two, typically requiring 4 steps instead of 3.

While Korn’s inequality helped us with the left matrix in (11), the matrix corresponding to the “grad-div” term in the Lamé–Navier equations has a nontrivial kernel and thus its inverse cannot be approximated by a block diagonal. We confirm this in Table 5. After augmenting $$\tilde A_p$$ by the diagonal terms of the grad-div matrix, we vary μ and λ. As expected, iteration counts increase when λ ≫ μ to the point, where the method becomes infeasible.

The case λ ≫ μ corresponds to an almost incompressible material. Thus, this behavior has to be addressed from two sides. First, the discretization must be suitable . Then, the local solvers must be able to reduce the divergence sufficiently. Here, we have to find ways to implement a smoother like in  in an efficient way. Its structure prevents us from utilizing the tensor product techniques, namely the fast diagonalization method, used so far.

As an outlook, we describe a solution approach for two dimensions, which has been developed in a recent bachelor’s thesis . The diagonal blocks of the matrix A p are

\displaystyle \begin{aligned} A_1 = \left(2\mu+\lambda\right) M_2 \otimes L_1 + \mu L_2 \otimes M_1 , \quad A_2 = \mu M_2 \otimes L_1 + \left(2\mu+\lambda\right) L_2 \otimes M_1 . \end{aligned}
(13)

Both A 1 and A 2 admit a fast diagonalization, for instance

\displaystyle \begin{aligned} A_2^{-1} = \bigl(Q_2 \otimes Q_1\bigr) \bigl( I_2 \otimes \Lambda_1 + \Lambda_2 \otimes I_1 \bigr)^{-1} \bigl(Q_2 \otimes Q_1 \bigr)^{\mathsf T} . \end{aligned}
(14)

Given the off-diagonal block $$B=\mu G_2^{\mathsf T}\otimes G_1+\lambda G_2\otimes G_1^{\mathsf T}$$, the Schur complement of A p is

$$\displaystyle \begin{gathered} S = A_1 - B^{\mathsf T} A_2^{-1} B . \end{gathered}$$
(15)

While this is not a sum of Kronecker products, Kronecker singular value decomposition (KSVD), see [72, 73], can be utilized to construct an approximation of the Schur complement which is fast diagonalizable. We proceed as follows:

1. A.1

compute the fast diagonalizations of A 1 and A 2

2. A.2

compute the rank-ρ Λ KSVD of the inverse diagonal matrix in Eq. 14

\displaystyle \begin{aligned} \bigl( I \otimes \Lambda^{(1)} + \Lambda^{(2)} \otimes I \bigr)^{-1} \approx \sum_{i=1}^{\rho_\Lambda} C_i \otimes D_i \end{aligned}
(16)
3. A.3

compute the rank-2 KSVD

\displaystyle \begin{aligned} \widehat{S} := E_1 \otimes F_1 + E_2 \otimes F_2 \approx \widetilde{S} \end{aligned}
(17)

of the approximate Schur complement

\displaystyle \begin{aligned} \widetilde{S} := A_1 - B^{\mathsf T} \left[\sum_{i=1}^{\rho_\Lambda} Q_2 C_i^{-1} Q_2^{\mathsf T} \otimes Q_1 D_i^{-1} Q_1^{\mathsf T}\right] B \end{aligned}
(18)
4. A.4

compute the fast diagonalization of $${\widehat {S}}$$.

Then, Gaussian block elimination provides an approximate inverse

\displaystyle \begin{aligned} A_p^{-1} \approx \begin{bmatrix} I & -A_1^{-1} B \\ 0 & I \end{bmatrix} \begin{bmatrix} A_1^{-1} &0 \\ 0 & \widehat{S}^{-1} \end{bmatrix} \begin{bmatrix} I & 0 \\ -B^{\mathsf T} A_1^{-1} & I \end{bmatrix}. \end{aligned}
(19)

Implementation and evaluation of these smoothers are still work in progress, but the thesis  suggests fast and robust convergence at least in a finite difference context.

The take-home message from this section is that an efficient approximate solution of the local problems in Schwarz smoothers is possible using low-rank tensor representations and can be achieved with effort similar to a matrix-free operator application in the best case. Finding such low-rank representations is nevertheless highly dependent on the differential equation and geometry. Further investigation will be directed in particular at dealing with the grad-div operator.

## 6 High-performance Simulations of Incompressible Flows

Computational fluid dynamics (CFD) simulations of turbulent flows at large Reynolds number, e.g., Re > 106, are among those problems that typically require a huge amount of computational resources in order to resolve the turbulent flow structures in space and time, and have been addressed as an application by the ExaDG project. The underlying model problem is given by the incompressible Navier–Stokes equations

\displaystyle \begin{aligned} \frac{\partial \mathbf{u}}{\partial t} + \nabla \cdot \left(\mathbf{u} \otimes \mathbf{u}\right) - \nu \nabla^2 \mathbf{u} + \nabla p &= \mathbf{f} \; ,{} \end{aligned}
(20)
\displaystyle \begin{aligned} \nabla \cdot \mathbf{u} &= 0 \; .{} \end{aligned}
(21)

Scale-resolving simulations for engineering applications typically involve beyond $$\mathcal {O}(10^{10}-10^{11})$$ unknowns (DoFs) and $$\mathcal {O}(10^5-10^7)$$ time steps. High-performance implementations for this type of problem are therefore of paramount importance for the CFD community. It is important to stress that implementing a given algorithm optimally for a given hardware, i.e., an implementation that performs close to the hardware limits, is only one step to achieve the goal of providing efficient flow solvers for engineering problems as emphasized in the introduction. While the previous sections discussed the second and third term in Eq. (1), namely the performance of matrix-free evaluation routines and fast multigrid solvers for high-order discretizations, we now also include discretization aspects into the discussion. The implementation makes use of the fast matrix-free evaluation routines and multigrid solvers discussed in previous sections.

We use a method of lines approach with high-order DG discretizations in space and splitting methods with BDF time integration. Splitting methods separate the solution of the incompressible Navier–Stokes equations into sub-problems such as a Poisson equation for the pressure and a (convection–)diffusion equation for the velocity and are among the most efficient solvers currently known. In a first contribution , we highlighted that previous discretization methods lack robustness, on the one hand in the limit of small time step sizes, and on the other hand in under-resolved scenarios where the spatial discretization only resolves the largest scales of the flow. The stability problem for small time step sizes has been addressed in detail in  where we found that a proper DG discretization of velocity-pressure coupling terms is essential to achieve robustness at small time steps. In , we presented the first high-order DG incompressible flow solver that is robust in the under-resolved regime and that relies completely on efficient matrix-free evaluation routines. The developed discretization approach is attractive as it provides a generic solver for turbulent flow simulations that is robust and accurate without the use of explicit turbulence models. Such a technique is known as implicit large-eddy simulation in the literature and has the advantage that it does not require turbulence model parameters. While this property of high-order DG discretizations is already known from discontinuous Galerkin discretizations of the compressible Navier–Stokes equations, the work  has been the first demonstrating this property for DG discretizations of the incompressible Navier–Stokes equations. The key ingredient for a robust high-order, L 2-conforming DG discretization for incompressible flows turns out to be the use of consistent stabilization terms that enforce the divergence-free constraint and inter-element mass conservation in a weak sense. These requirements can also be included into the finite element function spaces by using so-called H(div)-conforming (normal-continuous) discretizations that are exactly (pointwise) divergence-free by using Raviart–Thomas elements. As investigated in detail in , such an approach has indeed very similar discretization properties when compared with the stabilized L 2-conforming approach in practically relevant, under-resolved application scenarios. The model has been extended to moving meshes in .

A detailed performance analysis has been undertaken in  where we discuss the incompressible flow solver w.r.t. its efficiency according to Eq. (1). Based on this efficiency model, we have then compared matrix-free solvers based on incompressible and compressible Navier–Stokes formulations in  for under-resolved turbulent incompressible flows. The compressible solver uses explicit time integration and therefore only requires one operator evaluation in every Runge–Kutta stage as opposed to the incompressible solver involving the solution of linear system of equations such as a pressure Poisson equation within every time step. Simple explicit solvers are often considered efficient due to better parallel scalability since implicit Krylov solvers with multigrid preconditioning involve global communication. However, our work shows a significant performance advantage of the incompressible formulation over the compressible one on the node-level for sufficient workload. Albeit speed-up factors are higher, it is difficult to achieve a performance advantage for the algorithmically simple, explicit-in-time compressible solver in the strong-scaling limit in terms of absolute run time. In our experience, the potential to outperform an implicit solver at some point in the strong-scaling limit has not materialized. We see it as a future challenge to devise optimal PDE solvers providing good performance over a wide range of problems and hardware platforms due to this high degree of interdisciplinarity.

We have applied this solver framework to conduct direct numerical simulations of turbulent channel flow in , the first direct numerical simulation of the turbulent flow over a periodic hill at Re ≈ 104 in , and to large-eddy simulation of the FDA benchmark nozzle problem in . Furthermore, we have developed multiscale wall modeling approaches that allow to use the proposed highly efficient schemes also for industrial cases with even higher Reynolds numbers than what is feasible for wall-resolved large eddy simulation .

Here, we show performance results obtained on SuperMUC-NG with Intel Skylake CPUs. We study the three-dimensional Taylor–Green vortex problem as a standard benchmark to assess the accuracy and computational efficiency of incompressible turbulent flow solvers. Regarding discretization accuracy and from a physical point of view, the quantity of interest is the kinetic energy dissipation rate shown in Fig. 12 as a function of time 0 ≤ t ≤ T = 20 for increasing Reynolds numbers Re = 100, 200, 400, 800, 1600, 3000, 10, 000, . The first direct numerical simulation for the Re = 1600 case with a high-order DG scheme of the incompressible Navier–Stokes equations with a resolution of 10243 and polynomial degrees k = 3, 7 has been shown in . Here, we show results for effective resolutions up to 30723 (corresponding to 0.99 ⋅ 1011 DoFs) for the highest Reynolds number cases. Despite these fine resolutions, grid-converged results are achieved only up to Re = 3000. The inviscid problem (Re = ) is most challenging, and the results in Fig. 12 suggest that even finer resolutions are required for grid-convergence, a goal that might be achievable in the foreseeable future. The largest problem with 0.99 ⋅ 1011 DoFs involved 6.6 ⋅ 104 time steps and required 11.4 h of wall time on 152, 064 cores. In terms of degrees of freedom solved per time step per core, this results in a throughput of 1.05 MDoFs/s/core.

Figure 13 shows strong scaling results for the TGV problem at Re = 1600 for effective resolutions of 1283, 2563, 10243, 20483 and polynomial degree k = 3. We assess strong scalability in terms of absolute run times for the whole application (including mesh-generation, setup of data structures, solvers, preconditioners, and postprocessing) rather than normalized speed-up factors as the aim of strong scalability is not only reducing but also minimizing time-to-solution, i.e., demonstrating strong-scalability of a code with poor serial performance is meaningless. The results in Fig. 13 reveal that we are able to perform the TGV simulations in realtime (t wall ≤ T = 20s) for spatial resolutions up to 1283. These numbers can be considered outstanding and we are not aware of other high-order DG solvers achieving this performance, see also the discussions in [22, 24]. The minimum wall time in the strong-scaling limit increases on finer meshes due to more time steps (the time step size is restricted according to the CFL condition, Δt ∼ 1∕h, for the mixed explicit–implicit splitting solver used here). For this reason, we also show strong scalability in terms of the wall time per time step, to allow extrapolations of how many time steps can be solved within a given wall time limit which is the typical use case for large-eddy and direct numerical simulations of turbulent flows. In this metric, the curves level off around 0.02 − 0.03 s of wall time per time step, independently of the spatial resolution. The SuperMUC-NG machine with 3 ⋅ 105 cores is too small to show the strong scaling limit for the largest problem size with 20483 resolution considered here. A parallel efficiency of 80.6% is achieved with a speed-up factor of 79.8 when scaling from 3072 cores to 304, 128 cores.

## 7 hyper.deal: Extending the Matrix-Free Kernels to Higher Dimensions

The matrix-free kernels developed within the ExaDG project have been implemented in a recursive manner which enables compilation with arbitrary spatial dimension. In order to be compatible with the mesh infrastructure of deal.II which is restricted to dimensions up to 3, we have developed schemes working on a tensor product of two deal.II meshes. This allows extension to 2+2, 2+3, and 3+3 dimensions. The corresponding framework is currently under development as the deal.II-extension hyper.deal .

The major application that we have in mind are kinetic problems in phase space where we use the tensor product of a spatial and a velocity mesh. However, other applications might arise such as parameter-dependent flow problems. Table 6 gives an overview of computational times on a six-dimensional Vlasov–Poisson problem, which involves an advection in the 6D space of the particle density in x and v space and the solution of a 3D Poisson equation for finding the electric potential that in turn specifies the electric field that transports the density field (cf.  for the same application tackled with a semi-Lagrangian solver).

Figure 14 lists the throughput of the matrix-free evaluation of cell integrals for the multi-dimensional advection in three to six spatial dimensions for polynomial degrees k = 2, 3, 4, 5 for AVX2 and AVX-512 vectorization over elements, respectively, without any application-specific tuning at this stage. While throughput is very good in 3D and 4D as well as 5D up to k = 4, performance drops significantly in 6D because the local arrays in sum factorization exhaust caches, especially with AVX-512. Vectorization strategies within an element  are currently under development.