Abstract
This text presents contributions to efficient highorder finite element solvers in the context of the project ExaDG, part of the DFG priority program 1648 Software for Exascale Computing (SPPEXA). The main algorithmic components are the matrixfree evaluation of finite element and discontinuous Galerkin operators with sum factorization to reach a high nodelevel performance and parallel scalability, a massively parallel multigrid framework, and efficient multigrid smoothers. The algorithms have been applied in a computational fluid dynamics context. The software contributions of the project have led to a speedup by a factor 3 − 4 depending on the hardware. Our implementations are available via the deal.II finite element library.
Download conference paper PDF
1 Introduction
Exascale performance of numerical algorithms is determined by two factors, nodelevel performance and distributedmemory scalability to thousands of nodes over an Infinibandtype fabric. Additionally, the final application efficiency in terms of timetosolution is strongly influenced by the choice of numerical methods, where a high sequential efficiency is essential. The project ExaDG aims to bring together these three pillars to create an algorithmic framework for the next generation of solvers for partial differential equations (PDEs). The guiding principles of the project are as follows:
If we define the overall goal to be a minimum of computational cost to reach a predefined accuracy, this aim can be split into three components, namely the efficiency of the discretization in terms of the number of degrees of freedom (DoFs) and time steps, the efficiency of the solvers in terms of iteration counts, and the efficiency of the implementation [22]:
We define computational cost as the product of compute resources (cores, nodes) times the wall time resulting in the metric of CPUh, the typical currency of supercomputing facilities.
Regarding the first metric, the type of discretizations in space and time are often the first decision to be made. Numerical schemes that involve as few unknowns and as few time steps as possible to reach the desired accuracy will be more efficient. This goal can be reached by using higher order methods which have a higher resolution capability, especially for problems with a large range of scales and some regularity in the solution [19]. However several possibilities and profound knowledge regarding the performance capability of potential algorithms on modern hardware are still required to select those algorithms and implementations that are optimal with respect to the guiding metric of accuracy versus timetosolution. Highorder (dis)continuous finite element methods are the basic building block of the ExaDG project due to their generality and geometric flexibility.
Regarding the second metric, solvers are deemed efficient if they keep the number of iterations minimal. We emphasize that “iterations” are defined in a lowlevel way as the number of operator evaluations, which is also accurate when nesting several iterative schemes within each other. Note that we assume that largescale systems must be addressed by iterative solvers; in a finite element context sparse direct solvers are not scalable due to fillin and complex dependencies during factorizations. One class of efficient solvers of particular interest to ExaDG are multigrid methods with suitable smoothers, which have developed to be the gold standard of solvers for elliptic and parabolic differential equations over the last decades. Here, the concept of iterations would accumulate several matrixvector products within a multigrid cycle that in turn is applied in an outer Krylov subspace solver. Due to the grid transfer and the coarse grid solver, such methods are inherently challenging for highly parallel environments. As part of our efforts in ExaDG, we have developed an efficient yet flexible implementation in the deal.II finite element library [1, 15].
Third, the evaluation of discretized operators and smoothers remains the key component determining computational efficiency of a PDE solver. The relevant metric in this context is the throughput measured as the number of degrees of freedom (unknowns) processed per second (DoFs/s). An important contribution of our efforts is to both tune the implementation of a specific algorithm, but more importantly to also adapt algorithms towards a higher throughput. This means that an algorithm is preferred if it increases the DoFs/s metric, even if it leads to lower arithmetic performance in GFlop/s or lower memory throughput in GB/s. Operator evaluation in PDE solvers only involves communication with the nearest neighbors in terms of a domain decomposition of the mesh, which makes the nodelevel performance the primary concern in this regard. Since iterative solvers only require the action of the matrix on a vector (and a preconditioner), they are amenable to matrixfree evaluation where the final matrix entries are neither computed nor stored globally in memory in some generic sparse matrix format (e.g., compressed row storage). While matrixfree methods were historically often considered because they lower main memory requirements and allow to fit larger problems in memory [8], their popularity is currently increasing because they need to move less memory: Sparse matrixvector products are limited by the memory bandwidth on all major computing platforms, so a matrixfree alternative promises to deliver a (much) higher performance.
The outline of this article is as follows. We begin with an introduction of matrixfree algorithms and a presentation of nodelevel performance results in Sect. 2. In Sect. 3, we describe optimizations of the conjugate gradient method for efficient memory access and communication. Next, we detail our multigrid developments, focusing on performance numbers and the massively parallel setup in Sect. 4 and on the development of better smoothers in Sect. 5. Application results in the field of computational fluid dynamics are presented in Sect. 6, where the efficiency and parallel scalability of our discontinuous Galerkin incompressbile turbulent flow solver are shown. An extension of the kernels to up to 6D PDEs is briefly presented in Sect. 7. We conclude with an outlook in Sect. 8.
2 NodeLevel Performance Through MatrixFree Implementation
An intuitive example of a matrixfree implementation is a finite difference method implemented by its stencil rather than an assembled sparse matrix [33]. For finite element discretizations with sufficient structure of the underlying mesh and loworder shape functions, a small number of stencils allows to represent the operator of a largescale problem [8]. Such methods are used in the German exascale project TerraNeo, utilizing the regular data structures in hierarchical hybrid grids and embedded into a highly scalable multigrid solver for Stokes systems [31, 32]. By suitable interpolations, the stencils can be extended from the affine coarse grid assumption to also treat smoothly deformed geometries and variable coefficients [7].
For higherorder methods, finite element discretizations lead to fat stencils, making the direct evaluation inefficient even when done through stencils. An alternative matrixfree scheme used in ExaDG is to not compute the explicit DoF coupling and instead turn to integrals underlying the finite element scheme. As an example, we consider the constantcoefficient Laplacian
whose weak form in a finitedimensional setting is
where u _{h}(x) =∑_{j=1:n}φ _{j}(x)u _{j} is the finite element interpolant of the solution with n degrees of freedom, φ _{i} denotes the test functions with i = 1, …, n, f is some right hand side, and Ω_{h} is the finite element representation of the computational domain Ω. The lefthand side of this equation represents a finite element operator, mapping a vector of coefficients u = [u _{i}]_{i} to an output vector v = [v _{i}]_{i} by evaluating the weak form for all test functions φ _{i} separately. A matrixfree implementation is obtained by evaluating the elementwise integrals
by quadrature on n _{q} points per cell K. Here, K denotes the elements in the mesh, \(\hat {x}\) the coordinates of the reference element \( \hat {K} = (0,1)^d\), \(\mathcal J_{K}\) denotes the Jacobian of the mapping from the reference to the real cell, and w _{q} the quadrature weight. The operator I _{K} denotes the index mapping from n _{dof,ele} elementlocal to global unknowns and defines the elementrelated unknowns u ^{(K)} = I _{K}u.
On element K, the formulation of Eq. (4) consists of two nested sums over the elemental unknowns \(u_j^{(K)}\), j ∈ n _{dof,ele}, and the quadrature points q. The result is tested against all test functions \(\varphi _{i_K}\) on the reference element, which are related to the global test functions φ _{i} through I _{K}. Since the metric terms do not depend on the shape function indices i _{K} and j, and the sum over j does not depend on i _{K}, the summations in the equation can be broken up into (1) an dn _{q} × n _{dof,ele} matrix operation to evaluate the reference element derivative of u ^{(K)} at the quadrature points, (2) the application of metric terms as well as other physics terms at n _{q} quadrature points, and (3) an n _{dof,ele} × dn _{q} matrix operation to test by all n _{dof,ele} test functions and perform the summation over the quadrature points. The separation of pointwise physics evaluation at quadrature points is a common abstraction in integrationbased matrixfree methods [29, 41, 49, 50].
For highorder finite element methods, the naive evaluation would involve all shape functions at all quadrature points, which is of complexity \(\mathcal O(k^{2d})\) for polynomials of degree k in d dimensions per element, or \(\mathcal O(k^{d})\) per unknown, similarly to the fat stencil of the final matrix.
At this point, the structure in the referencecell shape function and quadrature points can be utilized to lower the computational complexity. If the multidimensional shape functions are the tensor product of 1D shape functions, and if the quadrature formula is a tensor product of 1D formulas, the socalled sumfactorization algorithm can be used to group common factors along the various dimensions and break down the work into onedimensional interpolations. Figure 1 visualizes the process of computing the interpolation of nodal values, visualized by black disks, to the values at the quadrature points. Rather than using a naive interpolation of cost 2(k + 1) ^{2d} operations, it can be done in 2d(k + 1)^{d+1} operations instead. In matrixvector notation, the interpolation of the gradient with respect to \(\hat {x}\), evaluated at quadrature points, can be written as
Here, S _{1}, …, S _{d} denote the \(n_{q}^{\mathrm {1D}}\times (k+1)\) interpolation matrices from the nodal values to the quadrature points, obtained by evaluating the 1D basis at all 1D quadrature points, and D _{1}, …, D _{d} the \(n_{q}^{\mathrm {1D}}\times n_{q}^{\mathrm {1D}}\) matrices of the derivatives of the Lagrangian basis in quadrature points. In this form, the multiplication by Kronecker matrices is implemented by small matrixmatrix multiplications.
Sum factorization was initially developed in the context of spectral element methods by Orszag [61], see [19] for an overview of the developments. In [12], sum factorization was compared against assembled matrices with the goal to find the best evaluation strategy among assembled matrices and matrixfree schemes. For hexahedral elements considered in this work, the memory consumption and arithmetic complexity indicate that this is the case already for quadratic basis functions [11, 49], with a growing gap for higher polynomial degrees.
2.1 Implementation of Sum Factorization in the deal.II Library
As part of the ExaDG project, we have developed efficient implementations in the deal.II finite element library [1, 4] with the following main features, see [50] for a detailed performance analysis:

support for both continuous [49] and discontinuous finite elements on uniform and adaptively refined meshes with hanging nodes and deformed elements,

support for arbitrary polynomial expansions on quadrilateral and hexahedral element shapes as well as tensor product quadrature rules,

minimization of arithmetic operations by using available symmetries, such as the evenodd decomposition [69] and a switch between the collocation derivative (5) for \(n_q^{\mathrm {1D}} \approx k+1\) quadrature points or an alternative variant based on derivatives of the original polynomials as used in [49] and discussed in [29],

flexible implementation of operations at quadrature points,

vectorization across several elements to optimally use SIMD units (AVX, AVX512, AltiVec) of modern processors,

applicability to modern multicore CPUs as well as GPUs [51, 57],

data access optimizations such as elementbased loops for DG elements [50, 56],

and MPI implementation with tight data exchange as well as MPIonly and sharedmemory models [43, 48, 54].
The concept of matrixfree evaluation with sum factorization has been widely adopted by now, like in the deal.II [1], DUNE [5, 40, 60], Firedrake [63], mfem [2], Nek5000 [28] or Nektar++ [13] projects. These fast evaluation techniques are directly applicable to explicit time stepping schemes, as we have demonstrated for wave propagation in [42, 53, 65,66,67,68] and the compressible Navier–Stokes equations [24]. The proposed developments make matrixfree evaluation of highorder DG operators reach a throughput in unknowns per second almost as high as for optimized 5wide finite difference stencils in a CFD context [75], despite delivering much higher accuracy.
2.2 Efficiency of MatrixFree Implementation
In Fig. 2, we give an overview of the achieved performance with our framework applied to the discontinuous Galerkin interior penalty (IPDG) discretization of the 3D Laplacian on an affine geometry. The most advanced implementation presented in [50] is used, namely a cellbased loop with a Hermitelike basis for minimal data access [56]. The figure lists the throughput, which is measured by recording the run time of the matrixvector product in an experiment with around 50 million DoFs (too large to fit into caches), and reporting the normalized quantity DoFs/s obtained by dividing the number of DoFs by the measured run time. The code is run on a single node of six dualsocket HPC systems from the last decade with a sharedmemory parallelization with OpenMP, threads pinned to logical cores with the close affinity rule, and using streaming stores to avoid the readforownership data transfer [33] on the result vector. As systems, we consider a 2 × 8 core AMD Opteron 6128 system from 2010, a 2 × 8 core Intel Xeon 2680 Sandy Bridge from 2012 (as used in the SuperMUC phase 1 installation in Garching, Germany), a 2 × 8 core Intel Xeon 2630 v3 (Haswell) representing a mediumcore count chip from 2014, a 2 × 14 core Intel Xeon 2697 v3 (Haswell) representing a highcore count chip of the same generation (as used in the SuperMUC phase 2 installation), a 2 × 20 core Intel Xeon 2698 v4 (Broadwell) system from 2016, and a 2 × 24 core Intel Xeon Platinum 8174 from 2017, labeled ‘Skylake’ in the remainder of this work, and installed in the SuperMUCNG supercomputer. The chips are operated at 2.0 GHz, 2.7 GHz, 2.4 GHz, 2.6 GHz, 2.2 GHz, and 2.3 GHz, respectively, and all run with fully populated memory interfaces. The Intel machines are run with 2way hyperthreading, e.g. with 96 threads for the Xeon Platinum Skylake.
The throughput results in Fig. 2 demonstrate the advancements of hardware during the last decade. In particular the increased width of vectorization, from 2 to 4 doubles with Sandy Bridge and from 4 to 8 doubles with Skylake, are clearly visible. Furthermore, the comparison between Sandy Bridge and the smaller Haswell system reveals the benefit of fused multiplyadd (FMA) instructions and higher L1 cache bandwidth of the latter: For low polynomial degrees with a modest number of FMA instructions, Sandy Bridge with its higher frequency can approximately deliver the same performance as Haswell. As the polynomial degree is increased, the arithmetic work is increasingly dominated by FMAs in the sum factorization sweeps similar to (5) as shown in [50], and Haswell pulls ahead. Finally, while we observe a throughput of up to 5.7 billion DoFs/s on Skylake (with up to 1.35 TFlop/s for k = 8), we observe a relatively strong decrease of performance for polynomial degrees k ≥ 13: This is because the vectorization across elements leads to an excessive size of the temporary data within sum factorization—here, a different vectorization strategy could lead to better results. However, we consider the polynomial degrees 3 ≤ k ≤ 8 most interesting for practical simulations, where almost constant throughput in terms of DoFs/s is reached. This somewhat surprising result, given the expected \(\mathcal O(1/k)\) complexity of throughput for sum factorization, is because face integrals and memory access with an \(\mathcal O(1)\) complexity are dominant. Compared to our initial implementation in 2015, which achieved a throughput of 0.32 billion DoFs/s on Sandy Bridge with degree k = 3, the progress in software technologies allowed us to reach 1.02 billion DoFs/s on the same system. For Intel Skylake, where memory access is more important, the software progress of our project is more than 4×.
Figure 3 shows the throughput normalized by the number of cores for polynomial degree k = 4 over the different hardware generations. For operator evaluation with discontinuous elements and face integrals, approximately 200 floating point operations per unknown are involved with our optimized implementations [50]. At the same time, we must access at least 16 byte (read one double, write one double) plus some neighbors that are not cached, so the arithmetic intensity is around 8–12 Flop/Byte, close to the machine balance of the Skylake Xeon. This means that both memory bandwidth and arithmetic performance are relevant for performance (on one Skylake node, we measured memory throughput of around 160 GB/s, compared to the STREAM limit of 205 GB/s). Likewise, continuous elements evaluated on an affine mesh have seen a considerable increase in throughput per core (arithmetic intensity of 7 Flop/Byte). However, the improvement has been much more modest for continuous elements evaluated on curved elements. In this setting, separate metric terms for all quadrature points and all elements are needed (as opposed to a single term per element in the affine mesh case), reducing the arithmetic intensity to around 1.2 Flop/Byte.^{Footnote 1}
Figure 3 also contains the evolution of performance of a sparse matrixvector product for trilinear continuous finite elements. The performance is much lower due to the aforementioned memory bandwidth limit, and has hardly improved per core on Skylake over the dated Opteron architecture. This illustrates the effect of the socalled memory wall. We emphasize that the sparse matrixvector product for k = 1 is more than three times slower than even the matrixfree evaluation for k = 4 on curved elements. Hence, highorder methods with matrixfree implementations are faster per unknown on newer hardware, in addition to their higher accuracy.
3 PerformanceOptimized Conjugate Gradient Methods
The developments of matrixfree implementations presented in the previous section result in a throughput for evaluation of the IPDG operator in Fig. 2 of up to 5.7 billion DoFs/s on Skylake. This is equivalent in time to the mere access of 4.5 doubles per DoF (either reading or writing). In other words, our developments have made the operator evaluation so fast that the matrixvector product may no longer be the dominant operation in algorithms like the conjugate gradient (CG) method preconditioned by the diagonal, or Chebyshev smoothers. These algorithms involve access to between 6 and 18 vectors for vector updates, the application of the diagonal preconditioner, and inner products. For optimal application performance it is therefore necessary to look into the access to vectors. As proposed in our work [51, 56, 65], merging the vector operations can improve throughput by up to a factor of two, and in particular for the DG case with cellbased loops which allow for a single pass through data [48, 56]. Fusion of different steps of a scheme has also been proposed for explicit time integrators in [14].
For the assessment of optimization opportunities on the algorithm level that goes beyond the matrixvector product, we consider a highorder finite element benchmark problem suggested by the US exascale initiative “Center for Efficient Exascale Discretization” (CEED). The benchmark involves a continuous finite element discretization of the Laplacian (3), using matrixfree operator evaluation within a conjugate gradient solver preconditioned by the matrix diagonal. In this study, we consider the case BP5 [29], see also https://ceed.exascaleproject.org/bps/, which integrates the weak form (4) of polynomial degree k using a Gauss–Lobatto quadrature formula with \(n_q^{\mathrm {1D}}=k+1\) quadrature points on a cube with deformed elements. While this integration is not exact, it is the typical spectral element setup with an identity interpolation matrix S _{i} = I in Eq. (5).
Figure 4 lists the contributors to run time for the plain conjugate gradient method preconditioned by the pointJacobi method as a function of the problem size for the polynomial degree k = 6. Here, the metric terms \(\mathcal J_K\) are computed on the fly from a trilinear representation of the geometry. Three different performance regimes can be distinguished in the graph: To the left, there is not enough parallelism given the domain decomposition on 48 MPI ranks and batches of 8 elements due to vectorization—indeed, at least 85,000 DoFs are needed to saturate all cores and SIMD lanes. Furthermore, the synchronization barriers due to the inner products in the conjugate gradient method also lead to a slowdown. As the problem size and parallelism increase, the run times decrease significantly and reach a minimum for a problem size around one million DoFs. Here, all data involved in the algorithm fits into the approximately 110 MB of L2+L3 cache on the processors. As the size is further increased, caches are exhausted and most data must be fetched from slow main memory. As a consequence, the run time of the solver increases significantly, and the vector updates, the diagonal preconditioner, and the inner products take a significant share. Note that all vector operations use the hardware optimally with a memory throughput of 205 GB/s.
In order to improve performance, we have therefore developed conjugate gradient implementations with merged vector operations by loop fusion. Figure 5 compares three variants of the conjugate gradient solver: the plain conjugate gradient method runs all vector operations through highlevel vector interfaces with separate loops for addition and inner products. In the “merged dot products”, we have merged the dot product p ^{T}Ap following the matrixvector product into the loop over the elements, and merged the vector updates to the residual and solution with the dot product for r ^{T}P ^{−1}r. Here, r denotes the residual vector, p the search direction of the conjugate gradient method, A the matrix operator (represented in a matrixfree way), and P ^{−1} the diagonal preconditioner. However, the improvements with this algorithm are relatively modest.
Much more performance can be gained by creating a conjugate gradient variant we call “fully merged”: Here, each CG iteration performs a single loop through all vector entries and ideally reads 5 vectors (solution, residual, search direction, temporary vector to hold the matrixvector product, and diagonal of preconditioner) and writes four (solution, residual, search direction, temporary vector). All vector updates of the previous CG iteration are scheduled before the matrixvector product and all inner products are scheduled after the matrixvector product. The vector operations are interleaved with the loop over elements, ensuring that dependencies due to the access pattern of the loop and the MPI communication are fulfilled (this leads to slightly more access in practice). This approach applies the preconditioner several times with partial sums to construct the inner products with a single MPI_Allreduce, trading some local computations for the decreased memory access. Of course, fusing the preconditioner into the loop assumes that it is both cheap to apply and does not involve longrange coupling between the DoFs. The results in Fig. 5 show that performance in the saturated limit, i.e., for large sizes beyond 10^{7} DoFs, is 2.5 times faster than with the plain CG iteration. Interestingly, this also improves performance for the sizes fitting into caches, which is due to less synchronization and reducing access to the slower L3 cache.
To put the performance of the fully merged case on Intel Skylake into perspective, we compare with executing the plain CG method on an Nvidia V100 GPU using the implementation from [51, 57]: even though the GPU runs with around 700 GB/s of memory throughput, the performance is higher on Intel Skylake with only 200 GB/s from RAM memory because the merged loops significantly increase data locality. Furthermore, on the GPU we do not compute the metric terms on the fly, but load a precomputed tensor \(\mathcal J_{K}^{1} \mathcal J_{K}^{\mathsf T} \mathrm {det}(\mathcal J_{K}) w_q\) which is faster due to reduced register pressure, see also the analysis for BP5 in [71]. We also note that the GPU results with our implementation are faster than an implementation with the OCCA library described in [29] with up to 0.6 billion DoFs/s on a V100 of the Summit supercomputer. The reason is that our implementation uses a continuous finite element storage that does not duplicate the unknowns at shared vertices, edges and faces, which reduces the memory access by about a factor of two. Furthermore, the results from [29] involve a separate gather/scatter step with additional memory transfer to enforce continuity, while this is part of the operator evaluation within a single loop in our code.
Figure 6 lists the achieved throughput with a fully merged conjugate gradient solver for polynomial degrees k = 2, …, 8, the most interesting regime for our solvers. We use a trilinear representation of geometry and compute the geometric factors on the fly. Throughput is somewhat lower for quadratic and cubic elements because the geometry data located in the vertices is still noticable.
The results in Fig. 3 motivate the analysis of the representation of the geometry in the matrixvector product, with results presented in Fig. 7. The figure lists both the throughput of the matrixvector product in the left panel and the throughput of the complete CG iteration with merged vector operations. Highest performance is obtained for the affine mesh case where our implementation can compress the memory access of the Jacobian. While this case is excluded from the CEED BP5 specification that requires a deformed geometry [29], it is an interesting baseline to compare against. Using separate tensors for each quadrature point, “variable tensor cached”, is equally fast as the affine case as long as data fits into caches. However, performance drops once the big geometric arrays must be fetched from main memory. For the case the geometry is computed on the fly from a trilinear representation of the mesh, i.e., the vertices, the matrixvector product is slower than the affine variant. For the conjugate gradient solver, however, we observe that the two reach essentially the same performance for five million and more DoFs, as they are both limited by the memory bandwidth from vector access. The “trilinear compute” case involves a higher Flop/s rate with almost 700 GFlop/s, as compared to the throughput of 330 GFlop/s for the affine mesh case. This means that the merged vector operations allow us to fit additional computations behind the unavoidable memory transfer without affecting application performance. Finally, an isoparametric representation of the geometry (labeled “isopara compute” in Fig. 7) can also be computed on the fly by sum factorization from a kth degree polynomial [50]. While this case is obviously slower than the precomputed variabletensor case from caches, it leverages higher performance when data must be fetched from main memory.
The trilinear and isoparametric cases are not equivalent, as only the latter represents higher order curved boundaries. Intermediate polynomial degrees for the geometry are conceivable, which would land between the two in terms of application throughput. To combine the higher performance of the former, we plan to investigate the tradeoffs in more detail in the future, e.g. by using a kdegree representation on a single layer of elements at the boundary and a triquadratic representation in the domain’s interior.
Finally, Fig. 8 shows the weak scaling of the BP5 benchmark problem up to the full size of the SuperMUCNG machine with 6336 nodes and 304,128 cores. The data is normalized by reporting the number of DoF per node, so ideal weak scaling would correspond to coinciding lines. While the saturated performance is scaling well, giving a sustained performance of up to 4.4 PFlop/s,^{Footnote 2} most of the incache performance advantage is lost due to the communication latency over MPI, see also [62] for limits with MPI in PDE solvers. Defining the strong scaling limit as the point where throughput reduces to 80% of saturated performance [29], it is reached for wall times of 56 μs on 1 node. On 512 nodes, the strong scaling limit is already around 180 μs, whereas it is 245 μs on the full SuperMUCNG machine. Note that even though most optimizations presented in this section have addressed the nodelevel performance, we have also considered the strong scaling in our work—indeed, the strong scaling on SuperMUCNG is excellent with a limit around 5 times lower than the BlueGeneQ results presented in [29].
4 Geometric Multigrid Methods in Distributed Environments
Multigrid methods are efficient solvers for the linear systems arising from the discretization of elliptic problems, see [30] for a recent efficiency evaluation and [35] for a projection of elliptic solver performance to the exascale setting. They apply simple iterative schemes called smoothers on a hierarchy of coarser problem representations. On each level of the hierarchy, the smoothers address the highfrequency content of the solution by smoothening the error. On a sufficiently coarse level with a small number of unknowns, a direct solver can be applied. The multigrid algorithm can be realized by a Vcycle as illustrated in Fig. 9 or some related cycle (Wcycle or Fcycle). In the matrixfree highorder finite element context, variants of the Chebyshev iteration around a simple additive scheme, such as pointJacobi or approximate blockJacobi with some rankd approximation of the cell matrix, are state of the art. The results in this section are based on this selection. Overlapping Schwarz schemes are a new development detailed in Sect. 5 below.
In terms of finding the coarser representations for the multigrid hierarchy, highorder finite element and discontinuous Galerkin methods permit a range of options. The hierarchy can both be constructed by coarser meshes (hmultigrid), by lowering the polynomial degree (pmultigrid), by a discontinuouscontinuous transfer as well as algebraically based on the matrix entries only (algebraic multigrid). The latter do not fit into a matrixfree context, since they explicitly rely on a sparse matrix and also often are not robust enough as the degree increases. As it has been shown by the work [70], scalability to the largest supercomputers is much more favorable if knowledge about coarsening by a mesh can be provided. In other words, geometric multigrid is to be preferred over algebraic multigrid in case there is such structure in the problem.
For these reasons, we have developed a comprehensive geometric multigrid framework with deal.II. In [27], a hybrid multigrid solver with all possibilities of h, p, and algebraic coarsening has been combined in a flexible framework, with the possibility to perform an additional ctransfer from discontinuous to continuous function spaces for the DG case. In terms of the hMG method on adaptive meshes, the deal.II library implements the local smoothing algorithm [9, 36, 37] where smoothing is done level by level. Our work [15] developed a communicationefficient coarsening strategy for this setup, at the cost of a load imbalance for smoothing on the multigrid levels with adaptively refined meshes. The tradeoffs in this choice and the associated costs have been quantified by a performance model in [15].
Figure 10 shows the results of two strong scaling experiments of the multigrid Vcycle with the hmultigrid infrastructure of the deal.II library. The uniform grid and a typical adaptively refined case are compared for the same problem size of 137 million and 46 billion DoFs, respectively, see [15] for details on the experiment. Differences in run time are primarily due to the load imbalance for the level operations. The results demonstrate optimal parallel scaling of both the uniform and adaptively refined cases down to around 10^{−2} s, with a slightly better strong scaling of the adaptive case due to the slower baseline. This performance barrier—typical for strong scaling of multigrid schemes in general—can be explained by the specific type of global communication in this algorithm: from the fine mesh level with many unknowns distributed among a large number of cores, we transfer residuals to coarser meshes with restriction operators until the coarse grid solver is either run on a single core or with few cores in a tightly coupled manner. Then, the coarsegrid corrections are broadcast during prolongation, involving all processors again. The communication pattern of a multigrid Vcycle thus relates to a treebased implementation of MPI_Allreduce, with the difference that the communication tree is induced by the grid and substantial operations, namely smoothing and level transfer, are intermixed with the communication. In this particular case, nine matrixvector products with nearestneighbor communication are performed per level (eight in the smoother and one for the residual before restriction). In addition, two vertical nearestneighbor exchange operations are done in restriction and prolongation. A typical matrixvector product with up to 26 neighbors takes around 10^{−4} s on the chosen Intel Sandy Bridge system when run on a few thousands of nodes [52]. When done on seven levels plus the coarse mesh for the uniformly refined 137 million DoFs case, the expected saturated limit of around 8 ms is exactly seen in the figure. On the newer SuperMUCNG machine, a latency barrier per Vcycle of around 2–4 ms per Vcycle has been measured, depending on the number of matrixvector products for the level smoothers. This limit is attractive compared to alternative solvers for elliptic problems such as the fast multipole method or the fast Fourier transform [30, 35].
Multigrid schemes are at the heart of incompressible flow solvers through the pressure Poisson equation, as detailed in Sect. 6 below. Applications of matrixfree geometric multigrid to continuum mechanics were presented in [18] and to electronic calculations with sparse multivectors in [16, 17].
As an example of the largescale suitability of the developed multigrid framework, Fig. 11 shows two scaling experiments on the SuperMUCNG supercomputer with up to 304,128 cores of Intel Skylake. Black dashed lines denote ideal strong scaling along a line and weak scaling with a factor of 8 between the lines. The computational domain is a cube meshed by hexahedral elements, using the affine mesh code path for matrixfree algorithms discussed in Sect. 2. A consistent Gaussian quadrature with \(n_q^{\mathrm {1D}}=k+1\) points is chosen. We run a conjugate gradient solver to a relative tolerance of 10^{−3} compared to the initial unpreconditioned residual. This setup is motivated by applications where a very good initial guess is already available, e.g. by extrapolation of solutions from the old time step [22, 45], and only a correction is needed. More accurate solves are obtained by tighter tolerances or by full multigrid setups [51]. The multigrid Vcycle is run in single precision to increase throughput, together with a double precision correction through the outer CG solver. This setup has been shown in [51] to increase throughput by around 1.8 × without affecting the multigrid convergence.
In the left panel of Fig. 11, we present results for a continuous Galerkin discretization with a polynomial degree k = 4. A pure geometric coarsening down to a single mesh element is used. A Chebyshev iteration of degree five based on the matrix diagonal, i.e., point Jacobi, is used on all levels for pre and postsmoothing. The maximal eigenvalue \(\tilde {\lambda }_{\max {}}\) is estimated by 15 iterations of a conjugate gradient solver and the Chebyshev parameters are set to smoothen in a range \([0.06\tilde {\lambda }_{\max {}},1.2\tilde {\lambda }_{\max {}}]\). As a coarse solver, we use a Chebyshev iteration with the degree set to reduce the residual by 10^{3} in terms of the Chebyshev apriori error estimate [74]. We observe ideal weak scaling and strong scaling to around 10^{−2} s. More importantly, the absolute run time is excellent: For instance, the 8.6 billion DoF case on 1536 cores is solved in 1.4 s, i.e., 4.0 million DoFs are solved per core per second.
The right panel of Fig. 11 shows the result for multigrid applied to an IPDG discretization with k = 5. Here, we use a transfer from the discontinuous space to the associated continuous finite element space with k = 5 on the finest mesh level (see [3] for the theoretical background and [27] for the multigrid context) and then progress by hcoarsening to a single element. On the DG level, we use a Chebyshev smoother around a blockJacobi method, with the blockJacobi problems inverted by the fast diagonalization method [58]. On all continuous finite element levels, a Chebyshev iteration around the pointJacobi method is used. The degree of the Chebyshev polynomial is six. This solver setup achieves a multigrid convergence rate of about 0.025, i.e., reduces the residual by 3 orders of magnitude with two Vcycles. If used in a full multigrid setting [51], a single Vcycle on the finest level would suffice to solve the problem to discretization accuracy. Merged vector operations with a Hermitelike basis for the Chebyshev iteration are used according to [56]. The final application performance of the largest computation on 1.9 trillion DoFs is 5.9 PFlop/s, with 5.6 PFlop/s done in single precision and 0.27 PFlop/s in double precision. The limiting factor is mostly memory transfer, however, with an application throughput of around 175 GB/s per node (the STREAM limit of one node is 205 GB/s).
5 Fast Tensor Product Schwarz Smoothers
In Sect. 4, we have discussed a scalable implementation of geometric multigrid methods, obtaining an efficient solver in the sense of cost per iteration. It employs the matrixfree operator implementation from Sect. 2 in order to reduce the computational cost for residuals and grid transfer. The missing building block for our cost model in Eq. (1) is an efficient implementation (in terms of computational cost per DoF) of an efficient smoother (in terms of number of multigrid iterations).
The main challenge consists of finding preconditioners whose cost is similar to operator evaluation. So far, we have discussed Chebyshev smoothers, which can be implemented matrixfree in a straightforward fashion. Alas, their performance is not robust for higher order elements. Likewise, from an arithmetic cost point of view sparse matrices can be competitive at most for moderate polynomial degrees k = 2, 3 [55] or when done via auxiliary spaces of linear elements on a subdivided grid using some matrixbased preconditioner. However, Fig. 3 shows that even sparse matrices for linear elements are up to 10 times slower than the matrixfree operator evaluation. It seems that only the two SPPEXA projects ExaDUNE and ExaDG have addressed this question in [6, 76]. While [6] focuses on iterative solution of cell problems for multigrid smoothing, we consider domain decomposition based smoothers in the form of multilevel additive and multiplicative Schwarz methods based on lowrank tensor approximations. They consist of a subdivision of the mesh on each level into small subdomains consisting either of a single cell, or of the patch of cells sharing a common vertex. On each of these subdomains, local finite element problems are solved. Comparing with operator application, these smoothers share the structural property of evaluation of local operators on mesh cells or on a patch of cells. They differ by the fact that the smoothers involve local inverses instead of local forward operators, and that these local inverses in general are not amenable to a tensor decomposition like sum factorization. There is one exception though, namely separable differential operators. In d dimensions these can be written in the form
where L _{k} are onedimensional differential operators and I _{k} are identity operators in directions k = 1, …, d. This representation transfers to finite element operators with tensor product shape functions in a straightforward way, reading I _{k} as onedimensional mass matrices M _{k}.
Due to [58], the inverse of L can be represented as the product
with the diagonal matrix Λ = I _{d} ⊗⋯ ⊗ I _{2} ⊗ Λ_{1} + ⋯ + Λ_{d} ⊗ I _{d−1} ⊗⋯ ⊗ I _{1}, where I _{k} denote identity matrices, and a rank1 decomposition Q = Q _{d} ⊗⋯ ⊗ Q _{1}. The tensor factors are obtained by solving d generalized eigenvalue problems
Thus, the computational effort for computing the inverse has been reduced from \(\mathcal O(k^{3d})\) to \(\mathcal O(dk^3)\) and for the application of local solvers from \(\mathcal O(k^{2d})\) to \(\mathcal O(dk^{d+1})\) by exploiting sum factorization. Based on this technique, we have implemented a geometric multigrid method in [76] based on earlier work in [37,38,39].
5.1 The Laplacian on Cartesian Meshes
In order to test our concept and to obtain a performance baseline for more complicated cases, we first attend to the case where the decomposition described above can be applied in a straightforward way, namely the additive Schwarz method with subdomains equal to mesh cells. As Table 1 shows, it yields an efficient preconditioner with less than 25 conjugate gradient steps for a gain of accuracy of 10^{8}. While it is uniform in the mesh level, it is not uniform in the polynomial degree due to the increasing penalty parameter of the interior penalty method. The computational effort for a smoothing step based on local solvers in the form (7) is below the effort for a matrixfree operator application for polynomial degrees between 3 and 15 in three dimensions because it only involves operations on cells. The setup time for computing Q and Λ is even less. Thus, in the context of the performance analysis of the conjugate gradient method in Sect. 2, it barely adds to the cost per iteration step, but reduces the number of matrixvector products when comparing to the accumulated numbers within a Chebyshev/point Jacobi method, and almost independently of polynomial degree.
In view of application to incompressible flow, we also study vertex patches as typical subdomains for smoothing. First, we observe that a regular vertex patch with 2^{d} cells attached to a vertex inherits the lowrank tensor product structure from its cells, possibly after renumbering due to changes in orientation. Thus, we can apply the same method as on a single cell, resulting effectively in a factor 2^{d} in the complexity estimates above. Patches around vertices with irregular topology like 3 or 5 cells in two dimensions do not possess a tensor product structure. Fortunately, on meshes obtained by refinement of a coarse mesh, they are all determined by irregularities of the coarse mesh and thus small in number.
Vertex patches lead to overlapping decompositions with overlap of at least 4 and 8 in two and three dimensions, respectively. From the analysis of Schwarz methods, it becomes clear that a multiplicative method is required for highest multigrid convergence rates. In order to parallelize such a smoother and to avoid race conditions, mesh cells are colorized, that is, they are separated into “colors” such that patches of the same “color” do not share any common face or cell. As a consequence, the multiplicative method coincides with an additive method within each color, such that we can execute the local solvers in parallel within each color, and the colors sequentially. Typical convergence results for the Laplacian are reported in Table 2, suggesting that this scheme is almost a direct solver.
The vertex patch has 4 and 8 times as many unknowns as a single mesh cell in two and three dimensions, respectively. Thus, the effort for a smoothing step with 16 colors and the optimizations described above turns out to be about 20 to 24 times the effort of a matrixfree operator application, measured over polynomial degrees from 3 to 15. This seems excessive at first glance, but it must be kept in mind that the Chebyshev smoother of degree 6 used in Fig. 11 also involves 12 matrixvector products for pre and postsmooting. Futhermore, the current scheme comes with a reduction of the number of steps by a factor 10 compared to the additive cell smoother for the Laplacian, which makes it almost competitive [76]. Finally, the iteration counts are independent of the polynomial degree, making the scheme attractive for higher degrees. Moreover, we point out that this smoother also allows for the solution of a Stokes problem in four iteration steps [39].
5.2 General Geometry
As soon as the mesh cells are not Cartesian anymore, the special structure of separable operators in (6) is lost and the inverse cannot be computed according to (7). In this case, we have two options: solving the local problems iteratively, as in [6], or approximately. A possible approximation which recovers the situation of the previous subsection consists of replacing a nonCartesian mesh cell by an approximating (hyper)rectangle, then inverting the separable differential operator on the rectangle (omitting the prefix hyper from here on).
Such a surrogate rectangle can be obtained from the following procedure: first, we compute the arc length of all edges. From these, we obtain the length of the rectangle in each of its natural directions by averaging over all parallel edges (in a topological sense). Thus, the geometry of the rectangle is determined up to its position and orientation in space. Given the fact that the Laplacian is invariant under translation and rotation, these do not matter and we can choose a rectangle centered at the origin with edges parallel to the coordinate directions. Different differential operators may require different approximations here.
The convergence theory of Schwarz methods allows for inexact local solvers as long as they are spectrally equivalent. Naturally, the deviation from exactness enters into the convergence speed of the method. Additionally, inexact local solvers can amplify the solution, such that a smaller relaxation parameter may be necessary. This is exhibited in Table 3, where we compare the efficiency of multigrid with exact local solvers and the method with surrogate rectangles as described above. We see that a reduction of the relaxation parameter \(\widehat {\omega }=0.7\) for exact local solvers to \(\widehat {\omega }=0.49\) is necessary for robust convergence. We point out though, that while the inexact methods need more iteration steps, they are much faster than exact inverses, since they use the Kronecker representation (7) of the approximate inverse. For instance, the setup cost is 3000 times higher, with a growing gap for higher polynomial degrees.
5.3 Linear Elasticitiy
In order to provide an outlook on how to apply this concept to more general problems, we consider linear elasticity, namely the LaméNavier equations, with the bilinear form
Here, \(\upvarepsilon (\mathbf {u}) = \frac 12(\nabla \mathbf {u} + \nabla \mathbf {u} ^{\mathsf T})\) is the strain tensor of the displacement field u and \(\bigl (\cdot ,\cdot \bigr )\) denote the appropriate DG discretization with interior penalty terms.
Consider a Cartesian vertex patch, that is, a patch with all faces aligned with the coordinate planes and with tensor product shape functions on each cell. As before, let M _{k} be the onedimensional mass matrix in direction k and L _{k} the matrix representing the Laplacian including all face terms introduced by the interior penalty formulation. Furthermore, let G _{k} be the matrix associated to the first derivative, again including the DG interface terms which arise in products of the form \(G_k^{\mathsf T}\otimes G_l\). With these notions and the threedimensional Laplacian
we can write the bilinear form a(., .) on the patch in matrix form
Clearly, this matrix lacks the simple structure of Kronecker products we employed in the previous subsections. Nevertheless, we have Korn’s inequality [10], and thus the block diagonal of the left matrix is spectrally equivalent to the matrix itself. Consequently, we expect that
which has the desired Kronecker product structure, is a good local solver. Indeed, Table 4 confirms this expectation. Iteration counts remain almost constant over a wide range of mesh levels and polynomial degrees. Comparing to Table 2, we lose less than a factor two, typically requiring 4 steps instead of 3.
While Korn’s inequality helped us with the left matrix in (11), the matrix corresponding to the “graddiv” term in the Lamé–Navier equations has a nontrivial kernel and thus its inverse cannot be approximated by a block diagonal. We confirm this in Table 5. After augmenting \(\tilde A_p\) by the diagonal terms of the graddiv matrix, we vary μ and λ. As expected, iteration counts increase when λ ≫ μ to the point, where the method becomes infeasible.
The case λ ≫ μ corresponds to an almost incompressible material. Thus, this behavior has to be addressed from two sides. First, the discretization must be suitable [34]. Then, the local solvers must be able to reduce the divergence sufficiently. Here, we have to find ways to implement a smoother like in [39] in an efficient way. Its structure prevents us from utilizing the tensor product techniques, namely the fast diagonalization method, used so far.
As an outlook, we describe a solution approach for two dimensions, which has been developed in a recent bachelor’s thesis [64]. The diagonal blocks of the matrix A _{p} are
Both A _{1} and A _{2} admit a fast diagonalization, for instance
Given the offdiagonal block \(B=\mu G_2^{\mathsf T}\otimes G_1+\lambda G_2\otimes G_1^{\mathsf T}\), the Schur complement of A _{p} is
While this is not a sum of Kronecker products, Kronecker singular value decomposition (KSVD), see [72, 73], can be utilized to construct an approximation of the Schur complement which is fast diagonalizable. We proceed as follows:

A.1
compute the fast diagonalizations of A _{1} and A _{2}

A.2
compute the rankρ _{Λ} KSVD of the inverse diagonal matrix in Eq. 14
$$\displaystyle \begin{aligned} \bigl( I \otimes \Lambda^{(1)} + \Lambda^{(2)} \otimes I \bigr)^{1} \approx \sum_{i=1}^{\rho_\Lambda} C_i \otimes D_i \end{aligned} $$(16) 
A.3
compute the rank2 KSVD
$$\displaystyle \begin{aligned} \widehat{S} := E_1 \otimes F_1 + E_2 \otimes F_2 \approx \widetilde{S} \end{aligned} $$(17)of the approximate Schur complement
$$\displaystyle \begin{aligned} \widetilde{S} := A_1  B^{\mathsf T} \left[\sum_{i=1}^{\rho_\Lambda} Q_2 C_i^{1} Q_2^{\mathsf T} \otimes Q_1 D_i^{1} Q_1^{\mathsf T}\right] B \end{aligned} $$(18) 
A.4
compute the fast diagonalization of \({\widehat {S}}\).
Then, Gaussian block elimination provides an approximate inverse
Implementation and evaluation of these smoothers are still work in progress, but the thesis [64] suggests fast and robust convergence at least in a finite difference context.
The takehome message from this section is that an efficient approximate solution of the local problems in Schwarz smoothers is possible using lowrank tensor representations and can be achieved with effort similar to a matrixfree operator application in the best case. Finding such lowrank representations is nevertheless highly dependent on the differential equation and geometry. Further investigation will be directed in particular at dealing with the graddiv operator.
6 Highperformance Simulations of Incompressible Flows
Computational fluid dynamics (CFD) simulations of turbulent flows at large Reynolds number, e.g., Re > 10^{6}, are among those problems that typically require a huge amount of computational resources in order to resolve the turbulent flow structures in space and time, and have been addressed as an application by the ExaDG project. The underlying model problem is given by the incompressible Navier–Stokes equations
Scaleresolving simulations for engineering applications typically involve beyond \(\mathcal {O}(10^{10}10^{11})\) unknowns (DoFs) and \(\mathcal {O}(10^510^7)\) time steps. Highperformance implementations for this type of problem are therefore of paramount importance for the CFD community. It is important to stress that implementing a given algorithm optimally for a given hardware, i.e., an implementation that performs close to the hardware limits, is only one step to achieve the goal of providing efficient flow solvers for engineering problems as emphasized in the introduction. While the previous sections discussed the second and third term in Eq. (1), namely the performance of matrixfree evaluation routines and fast multigrid solvers for highorder discretizations, we now also include discretization aspects into the discussion. The implementation makes use of the fast matrixfree evaluation routines and multigrid solvers discussed in previous sections.
We use a method of lines approach with highorder DG discretizations in space and splitting methods with BDF time integration. Splitting methods separate the solution of the incompressible Navier–Stokes equations into subproblems such as a Poisson equation for the pressure and a (convection–)diffusion equation for the velocity and are among the most efficient solvers currently known. In a first contribution [45], we highlighted that previous discretization methods lack robustness, on the one hand in the limit of small time step sizes, and on the other hand in underresolved scenarios where the spatial discretization only resolves the largest scales of the flow. The stability problem for small time step sizes has been addressed in detail in [21] where we found that a proper DG discretization of velocitypressure coupling terms is essential to achieve robustness at small time steps. In [23], we presented the first highorder DG incompressible flow solver that is robust in the underresolved regime and that relies completely on efficient matrixfree evaluation routines. The developed discretization approach is attractive as it provides a generic solver for turbulent flow simulations that is robust and accurate without the use of explicit turbulence models. Such a technique is known as implicit largeeddy simulation in the literature and has the advantage that it does not require turbulence model parameters. While this property of highorder DG discretizations is already known from discontinuous Galerkin discretizations of the compressible Navier–Stokes equations, the work [23] has been the first demonstrating this property for DG discretizations of the incompressible Navier–Stokes equations. The key ingredient for a robust highorder, L _{2}conforming DG discretization for incompressible flows turns out to be the use of consistent stabilization terms that enforce the divergencefree constraint and interelement mass conservation in a weak sense. These requirements can also be included into the finite element function spaces by using socalled H(div)conforming (normalcontinuous) discretizations that are exactly (pointwise) divergencefree by using Raviart–Thomas elements. As investigated in detail in [26], such an approach has indeed very similar discretization properties when compared with the stabilized L _{2}conforming approach in practically relevant, underresolved application scenarios. The model has been extended to moving meshes in [20].
A detailed performance analysis has been undertaken in [22] where we discuss the incompressible flow solver w.r.t. its efficiency according to Eq. (1). Based on this efficiency model, we have then compared matrixfree solvers based on incompressible and compressible Navier–Stokes formulations in [24] for underresolved turbulent incompressible flows. The compressible solver uses explicit time integration and therefore only requires one operator evaluation in every Runge–Kutta stage as opposed to the incompressible solver involving the solution of linear system of equations such as a pressure Poisson equation within every time step. Simple explicit solvers are often considered efficient due to better parallel scalability since implicit Krylov solvers with multigrid preconditioning involve global communication. However, our work shows a significant performance advantage of the incompressible formulation over the compressible one on the nodelevel for sufficient workload. Albeit speedup factors are higher, it is difficult to achieve a performance advantage for the algorithmically simple, explicitintime compressible solver in the strongscaling limit in terms of absolute run time. In our experience, the potential to outperform an implicit solver at some point in the strongscaling limit has not materialized. We see it as a future challenge to devise optimal PDE solvers providing good performance over a wide range of problems and hardware platforms due to this high degree of interdisciplinarity.
We have applied this solver framework to conduct direct numerical simulations of turbulent channel flow in [45], the first direct numerical simulation of the turbulent flow over a periodic hill at Re ≈ 10^{4} in [46], and to largeeddy simulation of the FDA benchmark nozzle problem in [25]. Furthermore, we have developed multiscale wall modeling approaches that allow to use the proposed highly efficient schemes also for industrial cases with even higher Reynolds numbers than what is feasible for wallresolved large eddy simulation [47].
Here, we show performance results obtained on SuperMUCNG with Intel Skylake CPUs. We study the threedimensional Taylor–Green vortex problem as a standard benchmark to assess the accuracy and computational efficiency of incompressible turbulent flow solvers. Regarding discretization accuracy and from a physical point of view, the quantity of interest is the kinetic energy dissipation rate shown in Fig. 12 as a function of time 0 ≤ t ≤ T = 20 for increasing Reynolds numbers Re = 100, 200, 400, 800, 1600, 3000, 10, 000, ∞. The first direct numerical simulation for the Re = 1600 case with a highorder DG scheme of the incompressible Navier–Stokes equations with a resolution of 1024^{3} and polynomial degrees k = 3, 7 has been shown in [22]. Here, we show results for effective resolutions up to 3072^{3} (corresponding to 0.99 ⋅ 10^{11} DoFs) for the highest Reynolds number cases. Despite these fine resolutions, gridconverged results are achieved only up to Re = 3000. The inviscid problem (Re = ∞) is most challenging, and the results in Fig. 12 suggest that even finer resolutions are required for gridconvergence, a goal that might be achievable in the foreseeable future. The largest problem with 0.99 ⋅ 10^{11} DoFs involved 6.6 ⋅ 10^{4} time steps and required 11.4 h of wall time on 152, 064 cores. In terms of degrees of freedom solved per time step per core, this results in a throughput of 1.05 MDoFs/s/core.
Figure 13 shows strong scaling results for the TGV problem at Re = 1600 for effective resolutions of 128^{3}, 256^{3}, 1024^{3}, 2048^{3} and polynomial degree k = 3. We assess strong scalability in terms of absolute run times for the whole application (including meshgeneration, setup of data structures, solvers, preconditioners, and postprocessing) rather than normalized speedup factors as the aim of strong scalability is not only reducing but also minimizing timetosolution, i.e., demonstrating strongscalability of a code with poor serial performance is meaningless. The results in Fig. 13 reveal that we are able to perform the TGV simulations in realtime (t _{wall} ≤ T = 20s) for spatial resolutions up to 128^{3}. These numbers can be considered outstanding and we are not aware of other highorder DG solvers achieving this performance, see also the discussions in [22, 24]. The minimum wall time in the strongscaling limit increases on finer meshes due to more time steps (the time step size is restricted according to the CFL condition, Δt ∼ 1∕h, for the mixed explicit–implicit splitting solver used here). For this reason, we also show strong scalability in terms of the wall time per time step, to allow extrapolations of how many time steps can be solved within a given wall time limit which is the typical use case for largeeddy and direct numerical simulations of turbulent flows. In this metric, the curves level off around 0.02 − 0.03 s of wall time per time step, independently of the spatial resolution. The SuperMUCNG machine with 3 ⋅ 10^{5} cores is too small to show the strong scaling limit for the largest problem size with 2048^{3} resolution considered here. A parallel efficiency of 80.6% is achieved with a speedup factor of 79.8 when scaling from 3072 cores to 304, 128 cores.
7 hyper.deal: Extending the MatrixFree Kernels to Higher Dimensions
The matrixfree kernels developed within the ExaDG project have been implemented in a recursive manner which enables compilation with arbitrary spatial dimension. In order to be compatible with the mesh infrastructure of deal.II which is restricted to dimensions up to 3, we have developed schemes working on a tensor product of two deal.II meshes. This allows extension to 2+2, 2+3, and 3+3 dimensions. The corresponding framework is currently under development as the deal.IIextension hyper.deal [59].
The major application that we have in mind are kinetic problems in phase space where we use the tensor product of a spatial and a velocity mesh. However, other applications might arise such as parameterdependent flow problems. Table 6 gives an overview of computational times on a sixdimensional Vlasov–Poisson problem, which involves an advection in the 6D space of the particle density in x and v space and the solution of a 3D Poisson equation for finding the electric potential that in turn specifies the electric field that transports the density field (cf. [44] for the same application tackled with a semiLagrangian solver).
Figure 14 lists the throughput of the matrixfree evaluation of cell integrals for the multidimensional advection in three to six spatial dimensions for polynomial degrees k = 2, 3, 4, 5 for AVX2 and AVX512 vectorization over elements, respectively, without any applicationspecific tuning at this stage. While throughput is very good in 3D and 4D as well as 5D up to k = 4, performance drops significantly in 6D because the local arrays in sum factorization exhaust caches, especially with AVX512. Vectorization strategies within an element [50] are currently under development.
8 Outlook
Our work in the ExaDG project presented in this text has resulted in a highly competitive finite element framework. We have demonstrated excellent performance both for the pure operator evaluation, demonstrated e.g. by the CEED benchmark problems, as well as on an application level in computational fluid dynamics. We plan to engage in benchmarking also in the future to establish bestpractices for the highorder finite element community. Furthermore, the evolving hardware landscape requires a continued effort, with increasing pressure to additional performance improvements on throughput architectures such as GPUs and FPGAs. In addition, we plan to extend our hybrid hpmultigrid framework to also handle hpadaptive meshes. Finally, while the results from the Schwarzbased multigrid smoothers are very promising from a mathematical point of view, further steps are necessary to make them perform optimally on massively parallel hardware, and it is not yet clear how an optimal implementation compares in timetosolution against the simpler Chebyshevbased ingredients we have considered on the large scale so far.
Notes
 1.
 2.
The LINPACK performance of SuperMUCNG according to the top500 list is 19.4 PFlop/s. Considering that we use an iterative solver for PDE with optimization of throughput, this is an extremely good value.
References
Alzetta, G., Arndt, D., Bangerth, W., Boddu, V., Brands, B., Davydov, D., Gassmoeller, R., Heister, T., Heltai, L., Kormann, K., Kronbichler, M., Maier, M., Pelteret, J.P., Turcksin, B., Wells, D.: The deal.II library, version 9.0. J. Numer. Math. 26(4), 173–184 (2018). https://doi.org/10.1515/jnma20180054
Anderson, R., Barker, A., Bramwell, J., Cerveny, J., Dahm, J., Dobrev, V., Dudouit, Y., Fisher, A., Kolev, T., Stowell, M., Tomov, V.: MFEM: modular finite element methods (2019). mfem.org
Antonietti, P.F., Sarti, M., Verani, M., Zikatanov, L.T.: A uniform additive Schwarz preconditioner for highorder discontinuous Galerkin approximations of elliptic problems. J. Sci. Comput. 70(2), 608–630 (2017). https://doi.org/10.1007/s1091501602599
Arndt, D., Bangerth, W., Davydov, D., Heister, T., Heltai, L., Kronbichler, M., Maier, M., Pelteret, J.P., Turcksin, B., Wells, D.: The deal.II finite element library: Design, features, and insights. Comput. Math. Appl. (2020). https://doi.org/10.1016/j.camwa.2020.02.022
Bastian, P., Engwer, C., Fahlke, J., Geveler, M., Göddeke, D., Iliev, O., Ippisch, O., Milk, R., Mohring, J., Müthing, S., Ohlberger, M., Ribbrock, D., Turek, S.: Hardwarebased efficiency advances in the EXADUNE project. In: Bungartz, H.J., Neumann, P., Nagel, W.E. (eds.) Software for Exascale computing—SPPEXA 20132015, pp. 3–23. Springer, Cham (2016)
Bastian, P., Müller, E.H., Müthing, S., Piatkowski, M.: Matrixfree multigrid blockpreconditioners for higher order discontinuous Galerkin discretisations. J. Comput. Phys. 394, 417–439 (2019). https://doi.org/10.1016/j.jcp.2019.06.001
Bauer, S., Drzisga, D., Mohr, M., Rüde, U., Waluga, C., Wohlmuth, B.: A stencil scaling approach for accelerating matrixfree finite element implementations. SIAM J. Sci. Comput. 40(6), C748–C778 (2018)
Bergen, B., Hülsemann, F., Rüde, U.: Is 1.7 × 10^{10} unknowns the largest finite element system that can be solved today? In: Proceeding of ACM/IEEE Conference Supercomputing (SC’05), pp. 5:1–5:14 (2005). https://doi.org/10.1109/SC.2005.38
Brandt, A.: Multilevel adaptive solutions to boundaryvalue problems. Math. Comput. 31, 333–390 (1977). https://doi.org/10.1090/S0025571819770431719X
Brenner, S.C.: Korn’s inequalities for piecewise H ^{1} vector fields. Math. Comput. 73(247), 1067–1087 (2004)
Brown, J.: Efficient nonlinear solvers for nodal highorder finite elements in 3D. J. Sci. Comput. 45(1–3), 48–63 (2010)
Cantwell, C.D., Sherwin, S.J., Kirby, R.M., Kelly, P.H.J.: Form h to p efficiently: Selecting the optimal spectral/hp discretisation in three dimensions. Math. Model. Nat. Phenom. 6, 84–96 (2011)
Cantwell, C.D., Moxey, D., Comerford, A., Bolis, A., Rocco, G., Mengaldo, G., De Grazia, D., Yakovlev, S., Lombard, J.E., Ekelschot, D., Jordi, B., Xu, H., Mohamied, Y., Eskilsson, C., Nelson, B., Vos, P., Biotto, C., Kirby, R.M., Sherwin, S.J.: Nektar++: An opensource spectral/hp element framework. Comput. Phys. Comm. 192, 205–219 (2015). https://doi.org/10.1016/j.cpc.2015.02.008
Charrier, D.E., Hazelwood, B., Tutlyaeva, E., Bader, M., Dumbser, M., Kudryavtsev, A., Moskovsky, A., Weinzierl, T.: Studies on the energy and deep memory behaviour of a cacheoblivious, taskbased hyperbolic PDE solver. Int. J. High Perf. Comput. Appl. 33(5), 973–986 (2019). https://doi.org/10.1177/1094342019842645
Clevenger, T.C., Heister, T., Kanschat, G., Kronbichler, M.: A flexible, parallel, adaptive geometric multigrid method for FEM. Technical report, arXiv:1904.03317 (2019)
Davydov, D., Kronbichler, M.: Algorithms and data structures for matrixfree finite element operators with MPIparallel sparse multivectors. ACM Trans. Parallel Comput. (2020). https://doi.org/10.1145/3399736
Davydov, D., Heister, T., Kronbichler, M., Steinmann, P.: Matrixfree locally adaptive finite element solution of densityfunctional theory with nonorthogonal orbitals and multigrid preconditioning. Phys. Status Solidi B: Basic Solid State Phys. 255(9), 1800069 (2018). https://doi.org/10.1002/pssb.201800069
Davydov, D., Pelteret, J.P., Arndt, D., Kronbichler, M., Steinmann, P.: A matrixfree approach for finitestrain hyperelastic problems using geometric multigrid. Int. J. Numer. Meth. Eng. (2020). https://doi.org/10.1002/nme.6336
Deville, M.O., Fischer, P.F., Mund, E.H.: Highorder Methods for Incompressible Fluid Flow, vol. 9. Cambridge University, Cambridge (2002)
Fehn, N., Heinz, J., Wall, W.A., Kronbichler, M.: Highorder arbitrary LagrangianEulerian discontinuous Galerkin methods for the incompressible NavierStokes equations. Technical report, arXiv:2003.07166 (2020).
Fehn, N., Wall, W.A., Kronbichler, M.: On the stability of projection methods for the incompressible Navier–Stokes equations based on highorder discontinuous Galerkin discretizations. J. Comput. Phys. 351, 392–421 (2017). https://doi.org/10.1016/j.jcp.2017.09.031
Fehn, N., Wall, W.A., Kronbichler, M.: Efficiency of highperformance discontinuous Galerkin spectral element methods for underresolved turbulent incompressible flows. Int. J. Numer. Meth. Fluids 88(1), 32–54 (2018). https://doi.org/10.1002/fld.4511
Fehn, N., Wall, W.A., Kronbichler, M.: Robust and efficient discontinuous Galerkin methods for underresolved turbulent incompressible flows. J. Comput. Phys. 372, 667–693 (2018). https://doi.org/10.1016/j.jcp.2018.06.037
Fehn, N., Wall, W.A., Kronbichler, M.: A matrixfree highorder discontinuous Galerkin compressible Navier–Stokes solver: a performance comparison of compressible and incompressible formulations for turbulent incompressible flows. Int. J. Numer. Meth. Fluids 89(3), 71–102 (2019). https://doi.org/10.1002/fld.4683
Fehn, N., Wall, W.A., Kronbichler, M.: Modern discontinuous Galerkin methods for the simulation of transitional and turbulent flows in biomedical engineering: a comprehensive LES study of the FDA benchmark nozzle model. Int. J. Numer. Meth. Biomed. Eng. 35(12), e3228 (2019). https://doi.org/10.1002/cnm.3228
Fehn, N., Kronbichler, M., Lehrenfeld, C., Lube, G., Schroeder, P.W.: Highorder DG solvers for underresolved turbulent incompressible flows: a comparison of L ^{2} and H(div) methods. Int. J. Numer. Meth. Fluids 91(11), 533–556 (2019). https://doi.org/10.1002/fld.4763
Fehn, N., Munch, P., Wall, W.A., Kronbichler, M.: Hybrid multigrid methods for highorder discontinuous Galerkin discretizations. J. Comput. Phys. (2020). https://doi.org/10.1016/j.jcp.2020.109538
Fischer, P., Kerkemeier, S., Peplinski, A., Shaver, D., Tomboulides, A., Min, M., Obabko, A., Merzari, E.: Nek5000 Web page (2019). https://nek5000.mcs.anl.gov
Fischer, P., Min, M., Rathnayake, T., Dutta, S., Kolev, T., Dobrev, V., Camier, J.S., Kronbichler, M., Warburton, T., Świrydowicz, K., Brown, J.: Scalability of highperformance PDE solvers. Int. J. High Perf. Comput. Appl. (2020). https://doi.org/10.1177/1094342020915762
Gholami, A., Malhotra, D., Sundar, H., Biros, G.: FFT, FMM, or multigrid? A comparative study of stateoftheart Poisson solvers for uniform and nonuniform grids in the unit cube. SIAM J. Sci. Comput. 38(3), C280–C306 (2016). https://doi.org/10.1137/15M1010798
Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., Wohlmuth, B.: Towards textbook efficiency for parallel multigrid. Numer. Math.Theory Me. Appl. 8(1), 22–46 (2015)
Gmeiner, B., Huber, M., John, L., Rüde, U., Wohlmuth, B.: A quantitative performance study for Stokes solvers at the extreme scale. J. Comput. Sci. 17, 509–521 (2016). https://doi.org/10.1016/j.jocs.2016.06.006. http://www.sciencedirect.com/science/article/pii/S1877750316301077. Recent Advances in Parallel Techniques for Scientific Computing
Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, Boca Raton (2011)
Hansbo, P., Larson, M.G.: Discontinuous Galerkin methods for incompressible and nearly incompressible elasticity by Nitsche’s method. Comput. Methods Appl. Mech. Eng. 191, 1895–1908 (2002)
Ibeid, H., Olson, L., Gropp, W.: FFT, FMM, and multigrid on the road to exascale: performance challenges and opportunities. J. Parallel Distrib. Comput. 136, 63–74 (2020). https://doi.org/10.1016/j.jpdc.2019.09.014
Janssen, B., Kanschat, G.: Adaptive multilevel methods with local smoothing for H ^{1} and H ^{curl}conforming high order finite element methods. SIAM J. Sci. Comput. 33(4), 2095–2114 (2011). https://doi.org/10.1137/090778523
Kanschat, G.: Multilevel methods for discontinuous Galerkin FEM on locally refined meshes. Comput. Struct. 82(28), 2437–2445 (2004). https://doi.org/10.1016/j.compstruc.2004.04.015
Kanschat, G.: Robust smoothers for high order discontinuous Galerkin discretizations of advectiondiffusion problems. J. Comput. Appl. Math. 218, 53–60 (2008). https://doi.org/10.1016/j.cam.2007.04.032
Kanschat, G., Mao, Y.: Multigrid methods for H ^{div}conforming discontinuous Galerkin methods for the Stokes equations. J. Numer. Math. 23(1), 51–66 (2015). https://doi.org/10.1515/jnma20150005
Kempf, D., Hess, R., Müthing, S., Bastian, P.: Automatic code generation for highperformance discontinuous Galerkin methods on modern architectures. Technical report, arXiv:1812.08075 (2018)
Knepley, M.G., Brown, J., Rupp, K., Smith, B.F.: Achieving high performance with unified residual evaluation. Technical report, arXiv:1309.1204 (2013)
Kormann, K.: A timespace adaptive method for the Schrödinger equation. Commun. Comput. Phys. 20(1), 60–85 (2016). https://doi.org/10.4208/cicp.101214.021015a
Kormann, K., Kronbichler, M.: Parallel finite element operator application: graph partitioning and coloring. In: Proceeding of 7th IEEE International Conference eScience, pp. 332–339 (2011). https://10.1109/eScience.2011.53
Kormann, K., Reuter, K., Rampp, M.: A massively parallel semiLagrangian solver for the sixdimensional Vlasov–Poisson equation. Int. J. High Perform. Comput. Appl. 33(5), 924–947 (2019). https://doi.org/10.1177/1094342019834644
Krank, B., Fehn, N., Wall, W.A., Kronbichler, M.: A highorder semiexplicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow. J. Comput. Phys. 348, 634–659 (2017). https://doi.org/10.1016/j.jcp.2017.07.039
Krank, B., Kronbichler, M., Wall, W.A.: Direct numerical simulation of flow over periodic hills up to Re_{h} = 10, 595. Flow Turbulence Combust. 101, 521–551 (2018). https://doi.org/10.1007/s1049401899413
Krank, B., Kronbichler, M., Wall, W.A.: A multiscale approach to hybrid RANS/LES wall modeling within a highorder discontinuous Galerkin scheme using function enrichment. Int. J. Numer. Meth. Fluids 90, 81–113 (2019). https://doi.org/10.1002/fld.4712
Kronbichler, M., Allalen, M.: Efficient highorder discontinuous Galerkin finite elements with matrixfree implementations. In: Bungartz, H.J., Kranzlmüller, D., Weinberg, V., Weismüller, J., Wohlgemuth, V. (eds.) Advances and New Trends in Environmental Informatics, pp. 89–110. Springer, Berlin (2018). https://doi.org/10.1007/9783319996547_7
Kronbichler, M., Kormann, K.: A generic interface for parallel cellbased finite element operator application. Comput. Fluids 63, 135–147 (2012). https://doi.org/10.1016/j.compfluid.2012.04.012
Kronbichler, M., Kormann, K.: Fast matrixfree evaluation of discontinuous Galerkin finite element operators. ACM Trans. Math. Softw. 45(3), 29:1–29:40 (2019). https://doi.org/10.1145/3325864
Kronbichler, M., Ljungkvist, K.: Multigrid for matrixfree highorder finite element computations on graphics processors. ACM Trans. Parallel Comput. 6(1), 2:1–2:32 (2019). https://doi.org/10.1145/3322813
Kronbichler, M., Wall, W.A.: A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. SIAM J. Sci. Comput. 40(5), A3423–A3448 (2018). https://doi.org/10.1137/16M110455X
Kronbichler, M., Schoeder, S., Müller, C., Wall, W.A.: Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acoustic wave equation. Int. J. Numer. Meth. Eng. 106(9), 712–739 (2016). https://doi.org/10.1002/nme.5137
Kronbichler, M., Kormann, K., Pasichnyk, I., Allalen, M.: Fast matrixfree discontinuous Galerkin kernels on modern computer architectures. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D.E. (eds.) ISC High Performance 2017, LNCS 10266, pp. 237–255 (2017). https://doi.org/10.1007/978331958667013
Kronbichler, M., Diagne, A., Holmgren, H.: A fast massively parallel twophase flow solver for microfluidic chip simulation. Int. J. High Perf. Comput. Appl. 32(2), 266–287 (2018). https://doi.org/10.1177/1094342016671790
Kronbichler, M., Kormann, K., Fehn, N., Munch, P., Witte, J.: A Hermitelike basis for faster matrixfree evaluation of interior penalty discontinuous Galerkin operators. Technical report, arXiv:1907.08492 (2019)
Ljungkvist, K.: Matrixfree finiteelement computations on graphics processors with adaptively refined unstructured meshes. In: Proceedings of the 25th High Performance Computing Symposium, HPC ’17, pp. 1:1–1:12. Society for Computer Simulation International, San Diego (2017). http://dl.acm.org/citation.cfm?id=3108096.3108097
Lynch, R.E., Rice, J.R., Thomas, D.H.: Direct solution of partial difference equations by tensor product methods. Numer. Math. 6, 185–199 (1964). https://doi.org/10.1007/BF01386067
Munch, P., Kormann, K., Kronbichler, M.: hyper.deal: An efficient, matrixfree finiteelement library for highdimensional partial differential equations. Technical report, arXiv:2002.08110 (2020)
Müthing, S., Piatkowski, M., Bastian, P.: Highperformance implementation of matrixfree highorder discontinuous Galerkin methods. Technical report, arXiv:1711.10885 (2017)
Orszag, S.A.: Spectral methods for problems in complex geometries. J. Comput. Phys. 37, 70–92 (1980)
Raffenetti, K., Amer, A., Oden, L., Archer, C., Bland, W., Fujita, H., Guo, Y., Janjusic, T., Durnov, D., Blocksome, M., Si, M., Seo, S., Langer, A., Zheng, G., Takagi, M., Coffman, P., Jose, J., Sur, S., Sannikov, A., Oblomov, S., Chuvelev, M., Hatanaka, M., Zhao, X., Fischer, P., Rathnayake, T., Otten, M., Min, M., Balaji, P.: Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI3.1. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 62:1–62:12. ACM, New York (2017). https://doi.org/10.1145/3126908.3126963
Rathgeber, F., Ham, D.A., Mitchell, L., Lange, M., Luporini, F., McRae, A.T.T., Bercea, G.T., Markall, G.R., Kelly, P.H.J.: Firedrake: automating the finite element method by composing abstractions. ACM Trans. Math. Soft. 43(3), 24:1–24:27 (2017). https://doi.org/10.1145/2998441
Schmidt, S.: Fast, tensorbased solution of problems involving incompressibility, Bachelor thesis. Heidelberg University, Heidelberg (2019)
Schoeder, S., Kormann, K., Wall, W.A., Kronbichler, M.: Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves. SIAM J. Sci. Comput. 40(6), C803–C826 (2018). https://doi.org/10.1137/18M1185399
Schoeder, S., Kronbichler, M., Wall, W.: Arbitrary highorder explicit hybridizable discontinuous Galerkin methods for the acoustic wave equation. J. Sci. Comput. 76, 969–1006 (2018). https://doi.org/10.1007/s1091501806492
Schoeder, S., Sticko, S., Kreiss, G., Kronbichler, M.: Highorder cut discontinuous Galerkin methods with local time stepping for acoustics. Int. J. Numer. Meth. Eng. (2020). https://doi.org/10.1002/nme.6343
Schoeder, S., Wall, W.A., Kronbichler, M.: ExWave: A high performance discontinuous Galerkin solver for the acoustic wave equation. Soft. X 9, 49–54 (2019). https://doi.org/10.1016/j.softx.2019.01.001
Solomonoff, A.: A fast algorithm for spectral differentiation. J. Comput. Phys. 98(1), 174–177 (1992). https://doi.org/10.1016/00219991(92)90182X
Sundar, H., Biros, G., Burstedde, C., Rudi, J., Ghattas, O., Stadler, G.: Parallel geometricalgebraic multigrid on unstructured forests of octrees. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 43. IEEE Computer Society, Silver Spring (2012)
Świrydowicz, K., Chalmers, N., Karakus, A., Warburton, T.: Acceleration of tensorproduct operations for highorder finite element methods. Int. J. High Perf. Comput. Appl. 33(4), 735–757 (2019). https://doi.org/10.1177/1094342018816368
Van Loan, C.F.: The ubiquitous Kronecker product. J. Comput. Appl. Math. 123(1–2), 85–100 (2000)
Van Loan, C.F., Pitsianis, N.: Approximation with Kronecker products. In: Linear Algebra for Large Scale and Realtime Applications, pp. 293–314. Springer, Berlin (1993)
Varga, R.S.: Matrix Iterative Analysis, 2nd edn. Springer, Berlin (2009)
Wichmann, K.R., Kronbichler, M., Löhner, R., Wall, W.A.: Practical applicability of optimizations and performance models to complex stencilbased loop kernels in CFD. Int. J. High Perf. Comput. Appl. 33(4), 602–618 (2019). https://doi.org/10.1177/1094342018774126
Witte, J., Arndt, D., Kanschat, G.: Fast tensor product Schwarz smoothers for highorder discontinuous Galerkin methods. Technical report, arXiv:1910.11239 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2020 The Author(s)
About this paper
Cite this paper
Arndt, D. et al. (2020). ExaDG: HighOrder Discontinuous Galerkin for the ExaScale. In: Bungartz, HJ., Reiz, S., Uekermann, B., Neumann, P., Nagel, W. (eds) Software for Exascale Computing  SPPEXA 20162019. Lecture Notes in Computational Science and Engineering, vol 136. Springer, Cham. https://doi.org/10.1007/9783030479565_8
Download citation
DOI: https://doi.org/10.1007/9783030479565_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030479558
Online ISBN: 9783030479565
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)