Exa-Dune -- Flexible PDE Solvers, Numerical Methods and Applications

In the Exa-Dune project we have developed, implemented and optimised numerical algorithms and software for the scalable solution of partial differential equations (PDEs) on future exascale systems exhibiting a heterogeneous massively parallel architecture. In order to cope with the increased probability of hardware failures, one aim of the project was to add flexible, application-oriented resilience capabilities into the framework. Continuous improvement of the underlying hardware-oriented numerical methods have included GPU-based sparse approximate inverses, matrix-free sum-factorisation for high-order discontinuous Galerkin discretisations as well as partially matrix-free preconditioners. On top of that, additional scalability is facilitated by exploiting massive coarse grained parallelism offered by multiscale and uncertainty quantification methods where we have focused on the adaptive choice of the coarse/fine scale and the overlap region as well as the combination of local reduced basis multiscale methods and the multilevel Monte-Carlo algorithm. Finally, some of the concepts are applied in a land-surface model including subsurface flow and surface runoff.


Introduction
In the E -D project we extend the Distributed and Unified Numerics Environment (DUNE)1 [7,6] by hardware-oriented numerical methods and hardware-aware implementation techniques developed in the (now) FEAT32 [55] project to provide an exascale-ready software framework for the numerical solution of a large variety of partial differential equation (PDE) systems with state-of-the-art numerical methods including higher-order discretisation schemes, multi-level iterative solvers, unstructured and locally-refined meshes, multiscale methods and uncertainty quantification, while achieving close-to-peak performance and exploiting the underlying hardware.
In the first funding period we concentrated on the node-level performance as the framework and in particular its algebraic multigrid solver already show very good scalability in MPI-only mode as documented by the inclusion of DUNE's solver library in the High-Q-Club, the codes scaling to the full machine in Jülich at the time, with close to half a million cores. Improving the node-level performance in light of future exa-scale hardware involved multithreading ("MPI+X") and in particular exploiting SIMD parallelism (vector extensions of modern CPUs and accelerator architectures). These aspects were addressed within the finite element assembly and iterative solution phases. Matrix-free methods evaluate the discrete operator without storing a matrix, as the name implies, and promise to be able to achieve a substantial fraction of peak performance. Matrix-based approaches on the other hand are limited by memory bandwidth (at least) in the solution phase and thus typically exhibit only a small fraction of the peak (GFLOP/s) performance of a node, but decades of research have lead to robust and efficient (in terms of number of iterations) iterative linear solvers for practically relevant systems. Importantly, a consideration of matrixfree and matrix-based methods needs to take the order of the method into account. For low-order methods it is imperative that a matrix entry can be recomputed in less time than it takes to read it from memory, to counteract the memory wall problem. This requires to exploit the problem structure as much as possible, i.e., to rely on constant coefficients, (locally) regular mesh structure and linear element transformations [28,37]. In these cases it is even possible to apply stencil type techniques, like developed in the EXA-STENCIL project [40]. On the other hand, for high-order methods with tensor-product structure the complexity of matrix-free operator evaluation can be much less than that of matrix-vector multiplication, meaning that less floating-point operations have to be performed which at the same time can be executed at a higher rate due to reduced memory pressure and better suitability for vectorization [50,12,39]. This makes high-order methods extremely attractive for exa-scale machines [48,51].
In the second funding phase we have mostly concentrated on the following aspects: 1. Asynchronicity and fault tolerance: High-level C++ abstractions form the basis of transparent error handling using exceptions in a parallel environment, faulttolerant multigrid solvers as well as communication hiding Krylov methods. 2. Hardware-aware solvers for PDEs: We investigated matrix-based sparseapproximate inverse preconditioners including novel machine-learning approaches, vectorization through multiple right-hand sides as well as matrix-free high-order Discontinous Galerkin (DG) methods and partially matrix-free robust preconditioners based on algebraic multigrid (AMG).
In the community, there is broad consensus on the assumptions about exascale systems that did not change much during the course of this six year project. A report by the Exascale Mathematics Working Group to the U.S. Department of Energy's Advanced Scientific Computing Research Program [16] summarises these challenges as follows, in line with [35] and more recently the Exascale Computing Project3: (i) The anticipated power envelope of 20 MW implies strong limitations on the amount and organisation of the hardware components, an even stronger necessity to fully exploit them, and eventually even power-awareness in algorithms and software. (ii) The main performance difference from peta-to exascale will be through a 100-1000 fold increase in parallelism at the node level, leading to extreme levels of concurrency and increasing heterogeneity through specialised accelerator cores and wide vector instructions. (iii) The amount of memory per 'core' and the memory and interconnect bandwidth / latency will only increase at a much smaller rate, hence increasing the demand for lower memory footprints and higher data locality. (iv) Finally, hardware failures, and thus the mean-time-between-failure (MTBF), were expected to increase proportionally (or worse) corresponding to the increasing number of components. Recent studies have indeed confirmed this expectation [30], although not at the projected rate. First exascale systems are scheduled for 2020 in China [42], 2021 in the US and 2023 [25] in Europe. Although the details are not yet fully disclosed, it seems that the number of nodes will not be larger than 10 5 and will thus remain in the range of previous machines such as the BlueGene. The major challenge will thus be to exploit the node level performance of more than 10 TFLOP/s. The rest of this paper is organized as follows. In Section 2 we lay the foundations of asynchronicity and resilience, while Section 3 discusses several aspects of hardwareaware and scalable iterative linear solvers. These building blocks will then be used in Sections 4 and 5 to drive localized reduced basis and multilevel Monte-Carlo methods. Finally, Section 6 covers our surface-subsurface flow application.

Asynchronicity and Fault Tolerance
As predicted in the first funding period, latency has indeed become a major issue, both within a single node as well as between different MPI ranks. The core concept underlying all latency-and communication-hiding techniques is asynchronicity. This is also crucial to efficiently implement certain local-failure local-recovery methods. Following the D philosophy, we have designed a generic layer that abstracts the use of asynchronicity in MPI from the user. In the following, we first describe this layer and its implementation, followed by representative examples on how to build middleware infrastructure on it, and on its use for s-step Krylov methods and fault tolerance beyond global checkpoint-restart techniques.

Abstract layer for asynchronicity
We first introduce a general abstraction for asynchronicity in parallel MPI applications, which we developed for D . While we integrated these abstractions with the D framework, most of the code can easily be imported into other applications, and is available as a standalone library.
The C++ API for MPI was dropped from MPI-3 since it offered no real advantage over the C bindings, beyond being a simple wrapper layer. Most MPI users coding in C++ are still using the C bindings, writing their own C++ interface/layer, in particular in more generic software frameworks. At the same time the C++11 standard introduced high-level concurrency concepts, in particular the future/promise construct to enable an asynchronous program flow while maintaining value semantics. We adopt this approach as a first principle in our MPI layer to handle asynchronous MPI operations and propose a highlevel C++ MPI interface, which we provide in D under the generic interface of Dune::Communication and a specific implementation An additional issue of the concrete MPI library in conjunction with C++ is the error handling concept. In C++, exceptions are the advocated approach to handle error propagation. As exceptions change the local code path on the, e.g., failing process in a hard fault scenario, exceptions can easily lead to a deadlock. As we discuss later, the introduction of our asynchronous abstraction layer enables global error handling in an exception friendly manner.
In concurrent environments a C++ future decouples values from the actual computation (promise). The program flow can continue while a thread is computing the actual result and promotes this via promise to the future. The MPI C and Fortran interfaces offer asynchonous operations, but in contrast to thread parallel, the user does not specify the operation within the concurrent operation. Actually, MPI on its own does not offer any real concurrency at all, and provides instead a handle-based programming interface to avoid certain cases of deadlocks: The control flow is allowed to continue without finishing the communication, while the communication usually only proceeds when calls into the MPI library are executed.
We developed a C++ layer on top of the asynchonous MPI operations, which follows the design of the C++11 future. Note that the actual std::future class can not be used for this purpose. As different implementations like thread-based std::future, task-based TBB:: future, and our new MPIFuture are available, usability greatly benefits from a dynamically typed interface. This is a reasonable approach, as std::future is using a dynamical interface already and also the MPI operations are coarse grained, so that the additional overhead of virtual functions calls is negligible. At the same time the user expects a future to offer value semantics, which contradicts the usual pointer semantics used for dynamic polymorphism. In E -D we decided to implement type-erasure to offer a clean and still flexible user interface. An MPIFuture is responsible for handling all states associated with an MPI operation. The future holds a mutable MPI_Request and MPI_Status to access information on the current operation and it holds buffer objects, which manage the actual data. These buffers offer a great additional value, as we do not access the raw data directly, but can include data transformation and varying ownership. For example it is now possible to directly send an std::vector<double>, where the receiver automatically resizes the std::vector according to the incoming data stream.
This abstraction layer enables different use cases, highlighted below: 1. Parallel C++ exception handling: Exceptions are the recommended way to handle faults in C++ programs. As exceptions alter the execution path of a single node, they are not suitable for parallel programs. As asynchronicity allows for moderately diverging execution paths, we can use it to implement parallel error propagation using exceptions. 2. Solvers and preconditioners tolerant to hard and soft faults: This functionality is used for failure propagation, restoration of MPI in case of a hard fault, and asynchronous in-memory checkpointing. 3. Asynchronous Krylov solvers: Scalar products in Krylov methods require global communication. Asynchronicity can be used to hide the latency and improve strong scalability. 4. Asynchronous parallel IO: The layer allows to transform any non-blocking MPI operation into a really asynchronous operation. This allows also to support asynchronous IO, to hide the latency of write operations and overlap with the computation of the next iteration or time step.

Parallel Localized Reduced Basis Methods:
Asynchronicity will be used to mitigate the load-imbalance inherent in the error estimator guided adaptive online enrichement of local reduced bases.

Parallel C++ exception handling
In parallel numerical algorithms, unexpected behaviour can occur quite frequently: A solver could diverge, the input of a component (e.g., the mesher) could be inappropriate for another component (e.g., the discretiser), etc. A well-written code should detect unexpected behaviour and provide users with a possibility to react appropriately in their own programs, instead of simply terminating with some error code. For C++, exceptions are the recommended method to handle this. With well placed exceptions and corresponding try-catch blocks, it is possible to accomplish a more robust program behaviour. However, the current MPI specification [44] does not define any way to propagate exceptions from one rank (process) to another. In the case of unexpected behaviour within the MPI layer itself, MPI programs simply terminate, maybe after a time-out. This is a design decision that unfortunately implies a severe disadvantage in C++, when combined with the ideally asynchronous progress of computation and communication: An exception that is thrown locally by some rank can currently lead to a communication deadlock, or ultimately even to undesired program termination. Even though exceptions are technically an illegal use of the MPI standard (a peer no longer participates in a communication), it undesirably conflicts with the C++ concept of error handling. Building on top of the asynchronicity layer, we have developed an approach to enable parallel C++ exceptions. We follow C++11 techniques, e.g., use future-like abstractions to handle asynchronous communication. Our currently implemented interface requires ULFM [11], an MPI extension to restore communicators after rank losses, which is scheduled for inclusion into MPI-4. We also provide a fallback solution for non-ULFM MPI installations, that employs an additional communicator for propagation and can, by construction, not handle hard faults, i.e., the loss of a node resulting in the loss of rank(s) in some communicator.
To detect exceptions in the code we have extended the Dune::MPIGuard, that previously only implemented the scope guard concept to detect and react on local exceptions. Our extension revokes the MPI communicator using the ULFM functionality if an exception is detected, so that it is now possible to use communication inside a block with scope guard. This makes it superfluous to call the finalize and reactivate methods of the MPIGuard before and after each communication.
MPIGuard guard (comm); do_something (); communicate (comm); } catch (...) { comm. shrink (); recover (comm); } Listing 1 shows an example how to use the MPIGuard and recover the communicator in a node loss scenario. In this example, an exception that is thrown only on a few ranks in do_something() will not lead to a deadlock, since the MPIGuard would revoke the communicator. Details of the implementation and further descriptions are available in a previous publication [18]. We provide the "black-channel" fallback implementation as a standalone version.4 This library uses the P-interface of the MPI standard, which makes it possible to redefine MPI functions. At the initialization of the MPI setting the library creates an opaque communicator, called blackchannel, on which a pending MPI_Irecv request is waiting. Once a communicator is revoked, the revoking rank sends messages to the pending blackchannel request. To avoid deadlocks, we use MPI_Waitany to wait for a request, which listens also for the blackchannel request. All blocking communication is redirected to non-blocking calls using the P-interface. The library is linked via LD_PRELOAD which makes it usable without recompilation and could be removed easily once a proper ULFM implementation is available in MPI.   Figure 1 shows a benchmark comparing the time which is used for duplicating a communicator, revoking it and restore a valid state. The benchmark was performed on PALMA2, the HPC cluster of the University of Muenster. Three implementations are compared; OpenMPI_BC and IntelMPI_BC are using the blackchannel library based on OpenMPI and IntelMPI, respectively. OpenMPI_ULFM uses the ULFM implementation provided by fault-tolerance.org, which is based on Open-MPI. We performed 100 measurements for each implementation. The blackchannel implementation is competitive to the ULFM implementation.

Compressed in-memory checkpointing for linear solvers
The previously described parallel exception propagation, rank loss detection and communicator restoration by using the ULFM extension, allow us to implement a flexible in-memory checkpointing technique which has the potential to recover from hard faults on-the-fly without any user interaction. Our implementation establishes a backup and recovery strategy which in part is based on a local-failure local-recovery (LFLR) [54] approach, and involves lossy compression techniques to reduce the memory footprint as well as bandwidth pressure. The contents of this subsection have not been published previously.
Modified solver interface. To enable the use of exception propagation as illustrated in the previous section and to implement different backup recovery approaches we kept all necessary modifications to D -ISTL , the linear solver library. We embed the solver initialisation and the iterative loop in a try-catch block, and provide additional entry and execution points for recovery and backup, see Listing 2 for details. Default settings are provided on the user level, i.e., D -PDEL . This implementation ensures that the iterative solving process is active until the convergence criterion is reached. An exception inside the try-block on any rank is detected by the MPIGuard and propagated to all other ranks, so that all ranks will jump to the catch-block.
This catch-block can be specialised for different kind of exceptions, e.g., if a solver has diverged and a corresponding exception is thrown it could define some specific routine to define a modified restart with a possibly more robust setting and/or initial guess. The catch-block in Listing 2 exemplarily shows a possible solution in the scenario of a communicator failure, e.g., a node loss which is detected by using the ULFM extension to MPI, encapsulated by our wrapper for MPI exceptions. Following the detection and propagation, all still valid ranks end up in the catch-block and the communicator must be re-established in some way (Listing 2, line 20). This can be done by shrinking the communicator or replacing lost nodes by some previously allocated spare ones. After the communicator reconstitution a user-provided stack of functions can be executed (Listing 2, line 21) to react on the exception. If there is no on-exception-function or neither of them returns true the exception is re-thrown to the next higher level, e.g., from the linear solver to the application level, or in case of nested solvers, e.g. in optimisation or uncertainty quantification.
Furthermore, there are two additional entry points for user provided function stacks: In line 5 of Listing 2 a stack of recovery functions is executed and if it returns true, the solver expects that some modification, i.e., recovery, has been done. In this case it could be necessary that the other participating ranks have to update some data, like resetting their local right hand side to the initial values. The backup function stack in line 16 allows the user to provide functions for backup creation etc., after an iteration finished successfully.
Recovery approaches. First, regardless of these solver modifications, we describe the recovery concepts which are implemented into an exemplary recovery interface class providing functions that can be passed to the entry points within the modified solver. The interoperability of these components and the available backup techniques are described later. Our recovery class supports three different methods to recover from a data loss. The first approach is a global rollback to a backup, potentially involving lossy compression: Progress on non-faulty ranks may be lost but the restored data originate from the same state, i.e., iteration. This means there is no asynchronous progression in the recovered iterative process but possibly just an error introduced through the used backup technique, e.g., through lossy compression. This compression error can reduce the quality of the recovery and lead to additional iterations of the solver, but is still superior to a restart, as seen later. For the second and third approaches, we follow the local-failure local-recovery strategy and re-initialize the data which are lost on the faulty rank by using a backup. The second, slightly simpler strategy uses these data to continue with solver iterations. The third method additionally smoothes out the probably deteriorated (because of compression) data by solving a local auxiliary problem [29,31]. This problem is set up by restricting the global operator to its purely local degrees of freedom with indices F ⊂ N and a Dirichlet boundary layer. The boundary layer can be obtained by extending F to some set J using the ghost layer, or possibly the connectivity pattern of the operator A. The Dirichlet values on the boundary layer are set to their corresponding values x N on the neighbouring ranks and thus additional communication is necessary: If this problem is solved iteratively and backup data are available, the computation speed can be improved by initializingx with the data from the backup.
Backup techniques. Our current implementation provides two different techniques for compressed backups as well as a basic class which allows 'zero'-recovery (zeroeing of lost data) if the user wants to use the auxiliary solver in case of data loss without storing any additional data during the iterative procedure.
The next backup class uses a multigrid hierarchy for lossy data compression. Thus it should only be used if a multigrid operator is already in use within the solving process because otherwise the hierarchy has to be build beforehand and introduces additional overhead. Compressing the iterative vector with the multigrid hierarchy currently involves a global communication. In addition there is no adaptive control of the compression depth (i.e., hierarchy level where the backup is stored), but it has to be specified by the user, see a previous publication for details [29].
We also implemented a compressed backup technique based on SZ compression [41]. SZ allows compression to a specified accuracy target and can yield better compression rates than multigrid compression. The compression itself is purely local and does not involve any additional communication. We provide an SZ backup with a fixed user-specified compression target as well as a fully adaptive one which couples the compression target to the residual norm within the iterative solver. For the first we achieve an increased rate while we approach the approximate solution, as seen in Figure 2 (top, pink lines), at the price of an increased overhead in case of a data loss (cf. Figure 3). The backup with adaptive compression target (blue lines) gives more constant compression rates, and a better recovery in case of faults in particular in the second half of the iterative procedure of the solver.
The increased compression rate for the fixed SZ backup is obtained because, during the iterative process, the solution gets more smooth and thus can be compressed better by the algorithm. For the adaptive method this gain is counteracted by the demand of a higher compression accuracy.
All backup techniques require to communicate a data volume smaller than the volume of four full checkpoints, see Figure 2 (bottom). Furthermore this bandwidth requirement is distributed over all 68 iterations (in the fault-free scenario) and could be decreased further by a lower checkpoint frequency.
The chosen backup technique is initiated before the recovery class and passed to it. Further backup techniques can be implemented by using the provided base class and overloading the virtual functions.
Bringing the approaches together. The recovery class provides three functions which are added to the function stacks within the modified solver interface. The backup routine is added to the stack of backup functions of the specified iterative solver and generates backups of the current iterative solution by using the provided backup class. To adapt numerical as well as communication overhead for different fault scenarios and machine characteristics, the backup creation frequency can be varied. After the creation of the backup it is sent to a remote rank where it is kept in memory but never written to disk. In the following this is called 'remote backup'. Currently the backup propagation happens circular by rank. It is also possible to trigger writing a backup to disk.
In the near future we will implement an on-the-fly recovery if an exception is thrown. These will be provided to the other two function stacks and will differ depending on the availability of the ULFM extensions: If the extension is not available we can only detect and propagate exceptions but not recover a communicator in case of hard faults, i.e., node losses (cf. Section 2.2). In this scenario the function provided to the on-exception stack will only write out the global state. Fault-free nodes will write the data of the current iterative vector, whereas for faulty nodes the corresponding remote backup is written. In the following the user will be able to provide a flag to the executable which modifies the backup object initiation to read in the stored checkpoint data. Afterwards the recovery function of our interface will overwrite the initial values of the solver with the checkpointed and possibly smoothed data like described above. If the ULFM extensions are available, the recovery can be realised without any user interaction: During the backup class initiation a global communication ensures that it is the first and therefore fault-free start of the parallel execution. If the process is a respawned one which replaces a lost rank, this communication is matched by a send communication created from the rank which holds the corresponding remote backup. This communication will be initiated by the on-exception function. In addition to this message the remote backup rank sends the stored compressed backup so that the respawned rank can use this backup to recover the lost data.
So far, we have not fully implemented rebuilding the solver and preconditioner hierarchy, and the re-assembly of the local systems, in case of a node loss. This can be done with, e.g., message logging [13], or similar techniques which allow recomputing the individual data on the respawned rank without additional communication. Figure 3 shows the effect of various combinations of different backup and recovery techniques in case of a data loss on one rank after iteration 60. The problem is an anisotropic Poisson problem with zero Dirichlet boundary conditions which reaches the convergence criterion after 68 iterations in a fault-free scenario (black line). It is executed in parallel on 52 ranks with approximately 480 000 degrees of freedom per rank. Thus one rank loss corresponds to a loss of around 2% of data. For solving a conjugate gradient solver with an algebraic multigrid preconditioner is applied. In addition to the residual norm we show the number of iterations which are needed to solve the auxiliary problem when using different backups as initial guess at the bottom left. Convergence history The different backup techniques are colour-coded (multigrid: red; adaptive SZ compression: blue; fixed SZ compression: pink; no backup: green). For the SZ techniques we consider two cases, each with a different compression accuracy (fixed compression) respectively a different additional scaling coefficient (SZ). Recovery techniques are coded with different line styles: Global roll-back recovery is indicated by straight lines; simple local recovery is shown with dotted lines and if an auxiliary problem is solved to improve the quality of the recovery it is drawn with a dashed line style. We observe that a zero recovery, multigrid compression and a fixed SZ backup with a low accuracy target are not competitive if no auxiliary problem is solved. The number of iterations needed until convergence then increases signifi-cantly. By applying an auxiliary solver the convergence can be almost fully restored (one additional global iteration) but the auxiliary solver needs a high amount of iterations (multigrid: 28; sz: 70; no backup: 132). Other backup techniques only need 8 auxiliary solver iterations. When using adaptive or very accurate fixed SZ compression the convergence behaviour can be nearly preserved even when only a local recovery or a global roll-back is applied. The adaptive compression technique has similar data overhead as the fixed SZ compression (cf. Figure 2, bottom) but gives slightly better results: Both adaptive SZ compression approaches introduce only one additional iteration for all recovery approaches. For the accurate fixed SZ compression (SZfixed_*_1e-7) we have two additional iterations when using local or global recovery but if we apply the auxiliary solver we also have only one additional iteration until convergence.

Communication aware Krylov solvers
In Krylov methods multiple scalar products per iteration must be computed. This involves global sums in a parallel setting. As a first improvement we merged the evaluation of the convergence criterion to the computation of a scalar product. Obviously this does not effect the computed values, but the iteration terminates one iteration later. However this reduces the number of global reductions per iteration from 3 to 2 and thus already saves communication overhead.
As a second step we modify the algorithm, such that only one global communication is performed per iteration. This algorithm can also be found in the paper of Chronopoulos and Gear [15]. Another optimization is to overlap the two scalar products with the application of the operator and preconditioner, respectively. This algorithm was first proposed by Gropp [27]. A fully elaborate version was then presented by Ghysels and Vanroose [27]. This version only needs one global reduction per iteration, which is overlapped with both the application of the preconditioner and operator. This algorithm is shown in Algorithm 2.
With the new communication interface, described above, we are able to compute multiple sums in one reduction pattern and overlap the communication with computation. To apply these improvements in Krylov solvers the algorithm must be adapted, such that the communication is independent of the overlapping computation. For this adaption we extend the ScalarProduct interface by a function which can be passed multiple pairs of vectors for which the scalar product should be computed. The function returns a Future which contains a std::vector<field_type>, once it has finished.   The runtime improvement of the algorithm strongly depends on the problem size and on the hardware. On large systems the communication overhead makes up a large part of the runtime. However, the maximum speedup is 3 for reducing the number of global reductions and 2 for overlapping communication and computation, compared to the standard version, so that a maximum speedup of 6 is possible. The optimization also increases the memory requirements and vector operations per iteration. An overview of runtime and memory requirements of the methods can be found in Table 1.  . We use an SSOR preconditoner in an additive overlapping Schwarz setup. The problem matrix is generated from a 5-star Finite Difference model problem. With less cores the current implementation is faster than our optimized one. But with higher core count our optimized versions outperforms it. The test was executed on the helics3 cluster of the University on Heidelberg, with 5600 cores on 350 nodes. We expect that on larger systems the speedup will further increase, since the communication is more expensive. The overlap of communication and computation does not really come into play, since the currently used MPI version does not support it completely.

Hardware-aware, Robust and Scalable Linear Solvers
In this section we highlight improved concepts for high-performance iterative solvers. We provide matrix-based robust solvers on GPUs using sparse approximate inverses and optimize algorithm parameters using machine learning. On CPUs we significantly improve the node-level performance by using optimal matrix-free operators for Discontinous Galerkin methods, specialized partially matrix-free preconditioners as well as vectorized linear solvers.

Strong smoothers on the GPU: Fast Approximate Inverses with conventional and Machine Learning approaches
In continuation of the first project phase, we enhanced the assembly of sparse approximate inverses (SPAI), a kind of preconditioner that we had shown to be very effective within the DUNE solver before [26,9]. Concerning the assembly of such matrices we have investigated three strategies regarding their numerical efficacy (that is their quality in approximating A −1 ), the computational complexity of the actual assembly and ultimately, the total efficiency of the amortised assembly combined with all applications during a system solution. For both strategies, this includes a decisive performance engineering for different hardware architectures with focus on the exploitation of GPUs. SPAI-1. As a starting point we have developed, implemented and tuned a fast SPAI-1 assembly routine based on MKL/LAPACK routines (CPU) and on the cuBlas/cuSparse libraries, performing up to four times faster on the GPU. This implementation is based on the batched solution of QR decompositions that arise in Householder transformations during the SPAI minimisation process. In many cases, we observe that the resulting preconditioner features a high quality comparable to Gauss-Seidel methods. Most importantly, this result still holds true when taking into account the total time-to-solution, which includes the assembly time of the SPAI, even on a single core where the advantages of SPAI preconditioning over forward/backward substitution during the iterative solution process are not yet exploited. More systematic experiments with respect to these statements as well as their extension to larger test architectures are currently being conducted. To describe our new GPU implementation, we write the row-wise updates in the right-looking, outer product form of the A-biconjugation-process of the SAINV factorisation as follows: The assembly of the preconditioner is based on a loop over the existing rows i ∈ {1, . . . , N } of Z (initialised as unit matrix I N ), where in every iteration the loop generally calls three operations, namely a sparse-matrix vector multiplication, a dot product and an update of the remaining rows i + 1, . . . , N based on a drop-parameter ε. In our implementation we use the ELLPACK and CSR formats, pre-allocating a fixed amount of nonzeros of the matrix Z using ω times the average number of nonzeros per row of A. Having a fixed row size, no reallocation of the arrays of the matrix format is needed and the row-wise update can be computed in parallel. This idea is based on the observation that while the density ω for typical drop tolerances is not strictly limited, it generally falls into the interval ]0, 3[. As the SpMV and the dot kernels are well established, we take a closer look at the row-wise update, which is described more detailed in algorithm 3. We first compute the values to be added and store them in a variable α. Then we iterate over all nonzero entries of α (which of course has the same sparsity pattern as z i ) and check if the computed value exceeds a certain drop-tolerance. If this condition is met, we have three conditions for an insertion into the matrix Z: 1. check if there is already an existing nonzero value in the j-th row at the column index of the value α n and search for the new minimal entry of this row 2. else check if there is still place in the j-th row, so we can simply insert the value α n into that row and search for the new minimal entry of this row. 3. else check if the value α n is greater than the current minimum. If this condition is satisfied, then switch the old minimal value with α n and search for the new minimal entry of this row.
If none of these conditions is met, we drop the computed value without updating the current column and repeat these steps for the next values unequal to zero of the current row. This cap of values for each row also has the following disadvantages: By having a too small maximum of nonzeros per row, a qualitative A-orthogonalization cannot be performed. To avoid this case we only take values of ω greater than one, which seems to be sufficient. Also, if a row has already reached the maximum number of nonzeros, additional but relatively small values may be dropped. This can become an issue if the sum of these small numbers leads to a relevant entry in a later iteration. For a comparison, Figure 5 depicts the time-to-solution for V-cycle multigrid using different strong smoothers on a P100 GPU. All smoothers are constructed using 8 Richardson iterations with (reasonably damped if necessary) preconditioners such as Jacobi, Gauss-Seidel, ILU-0, SPAI-1, SPAI-and SAINV. We set up the benchmark case from a 2D Poisson problem in the isotropic case and with two-sided anisotropies in the grid to harden the problem even for well-ordered ILU approaches. The SPAI approaches are the best choice for the smoother on the GPU. Machine Learning. Finally we started investigating how to construct approximate inverses using methods from Machine Learning [53]. The basic idea here is to treat A −1 as a discrete function in the course of a function regression process. The The entries of the system matrix are represented vector-wise in the input layer (cf. Figure 6). In the same way, our output layer contains the entries of the approximate inverse. Between these layers we can add a number of hidden layers consisting of hidden neurons. How many hidden neurons we need to create strong approximate inverses is a key design decision and we discuss this below. In general our supervised training algorithm is a backward propagation with random initialisation. Alongside a linear propagation function i total = W·o total +b with the total (layer) net input i total , the weight matrix W, the vector for the bias weights b and the total output of the previous layer o total , we use the rectified linear unit (ReLu) function as activation function α(x) and thus we can calculate the output y of each neuron as y := α( j o j · w ij ).
Here o j is the output of the preceding sending units and w ij are the corresponding weights between the neurons. For the optimization we use the L2 error function and update the weights with w (t+1) ij = w (t) ij + γ · o i · δ j , with the output o i of the sending unit and learning rate γ. δ j symbolises the gradient decent method: if neuron j is an output neuron f (i j ) · k ∈S (δ k · w k j ) if neuron j is a hidden neuron.
For details concerning the test/training algorithm we refer to a previous publication [53]. For the defect correction prototype, we find a significant speedup for a moderately anisotropic Poisson problem, see Figure 7.

Autotuning with artificial neural networks
Inspired by our usage of Approximate Inverses generated by artificial neural networks (ANNs), we exploit (Feed Forward-) neural networks (FNN) for the automatic tuning of solver parameters. We were able to show that it is possible to use such an approach to provide much better a-priori choices for the parametrisation of iterative linear solvers. In detailed studies for 2D Poisson problems we conducted benchmarks for many test matrices and autotuning systems using FNNs as well as convolutionary neural networks (CNNs) to predict the ω parameter in a SOR solver. In Figure 8 we depict 100 randomly choosen samples of this study. It can be seen that even for good a-priori choices of ω the NN-driven system can compete whilst 'bad' choices (labeled constant) might lead to a stalling solver.

Further development of sum-factorized matrix-free DG methods
While we were able to achieve good node-level performance with our matrix-free DG methods in the first funding period, our initial implementations still did not utilize more than about 10% of the theoretical peak FLOP throughput. In the second funding period, we systematically improved on those results by focusing on several aspects of our implementation: Introduction of block-based DOF processing. Our implementation is based on D -PDEL , a very flexible discretization framework for both continuous and discontinuous discretizations of PDEs. In order to support a wide range of discretizations, PDELab has a powerful system for mapping DOFs to vector and matrix entries. Due to this flexibility, the mapping process is rather expensive. On the other hand, Discontinuous Galerkin values will always be blocked in a cell-wise manner. This can be exploited by only ever mapping the first degree of freedom associated with each cell and then assuming that all subsequent values for this cell are directly adjacent to the first entry. We have added a special 'DG codepath' to D -PDEL which implements this optimization. Avoiding unnecessary memory transfers. As all of the values for each cell are stored in consecutive locations in memory, we can further optimize the framework behavior by skipping the customary gather / scatter steps before and after the assembly of each cell and facet integral. This is implemented by replacing the data buffer normally passed to the integration kernels with a dummy buffer that stores a pointer to the first entry in the global vector / matrix and directly operates on the global values. This is completely transparent to the integration kernels, as they only ever access the global data through a well-defined interface on these buffer objects. Together with the previous optimization, these two changes have allowed us to reduce the overhead of the framework infrastructure on assembly times from more than 100% to less than 5%.
Explicit vectorization. The DG implementation used in the first phase of the project was written as scalar code and relied on the compiler's auto vectorization support to utilize the SIMD instruction set of the processor, which we tried to facilitate by providing compile time loop bounds and aligned data structures. In the second phase, we have switched to explicit vectorization with a focus on AVX2, which is a common foundation instruction set across all current x86-based HPC processors. We exploit the possibilities of our C++ code base and use a well-abstracted library which wraps the underlying compiler intrinsic calls [23]. In a separate project [34], we are extending this functionality to other SIMD instruction sets like AVX512.
Loop reordering and fusion. While vectorization is required to fully utilize modern CPU architectures, it is not sufficient. We also have to feed the execution units with a substantial number of mutually independent chains of computation (≈ 40-50 on current CPUs). This amount of parallelism can only be extracted from typical DG integration kernels by fusing and reordering computational loops. In contrast to other implementations of matrix-free DG assembly [22,43], we do not group computations across multiple cells or facets, but instead across quadrature points and multiple input/output variables. In 3D, this works very well for scalar PDEs that contain both the solution itself and its gradient, which adds up to four quantities that exactly fit into an AVX2 register.
Results. Table 2 compares the throughput and the hardware efficiency of our matrix-free code for two diffusion-reaction problems A (axis-parallel grid, constant coefficients per cell) and B (affine geometries, variable coefficients per cell) with a matrix-based implementation. Figure 9 compares throughput and floating point performance of our implementation for these problems as well as an additional problem C with multi-linear geometries, demonstrating that we are able to achieve more than 50% of theoretical peak FLOP rate on this machine as well as a good computational processing rate as measured in DOFs/sec. While our work in this project was mostly focused on scalar diffusion-advectionreaction problems, we have also applied the techniques shown here to projectionbased Navier-Stokes solvers [51]. One important lesson learned was the unsustainable amount of work required to extend our approach to different problems and / or hardware architectures. This has lead us to develop a Python-based code generator in a new project [34], which provides powerful abstractions for the building blocks listed above. This toolbox can be extended and combined in new ways to achieve performance comparable to hand-optimized code. Especially for more complex problems involving systems of equations, there are a large number of possible ways to group variables and their derivatives into sum factorization kernels due to our approach of vectorizing over multiple quantities within a single cell. The resulting search space is too large for manual exploration, which the above project solved by the addition of benchmark-driven automatic comparison of those variants. Finally, initial results show good scalability of our code as shown by the strong scaling results in Figure  10. Our implementation shows good scalability until we reach a local problem size of just 18 cells, where we still need to improve the asynchronicity of ghost data communication and assembly.

Hybrid solvers for Discontinuous Galerkin schemes
In Section 3.3 we concentrated on the performance of matrix-free operator application. This is sufficient for instationary problems with explicit time integration, but in case of stationary problems or implicit time integration, (linear) algebraic systems need to be solved. This requires operator application and robust, scalable preconditioners. For this we extended hybrid AMG-DG preconditioners [8] in a joint work with Eike Müller from Bath University, UK, [10]. In a solver for matrices arising from higher order DG discretizations the basic idea is to perform all computations on the DG system in a matrix-free fashion and to explicitly assemble only a matrix in a low-order subspace which is significantly smaller. In the sense of subspace correction methods [58] we employ a splitting T is the finite element space of polynomial degree p on element T and the coarse space V c is either the lowest-order conforming finite element space V 1 h on the mesh T h , or the space of piecewise constants V 0 h . Note that the symmetric weighted interior penalty DG method from [21] reduces to the cell-centered finite volume method with two-point flux approximation on V 0 h . Note also, that the system on V c can be assembled without assembling the large DG system. Matrix-explicit Matrix-free Partially matrix-free For solving the blocks related to V p T , two approaches have been implemented. In the first (named partially matrix-free), these diagonal blocks are factorized using LAPACK and each iteration uses a backsolve. In the second approach the diagonal blocks are solved iteratively to low accuracy using matrix-free sum factorization. Both variants can be used in additive and multiplicative fashion. Figure 11 shows that the partially matrix-free variant is optimal for polynomial degree p ≤ 5, but starting from p = 6, the fully matrix-free version starts to outperform all other options.
In order to demonstrate the robustness of our hybrid AMG-DG method we use the permeability field of the SPE10 benchmark problem [14] within a heterogeneous elliptic problem. This is considered to be a hard test problem in the porous media community. The DG method from [21] is employed. Figure 12 depicts results for different variants and polynomial degrees run in parallel on 20 cores. A moderate increase with the polynomial degree can be observed. With respect to time-tosolution (not reported) the additive (block Jacobi) partially matrix-free variant is to be preferred for polynomial degree larger than one.  Fig. 12: Convergence history for SPE10 benchmark. The relative energy norm is shown for polynomial degrees 1 (red squares), 2 (blue upward triangles) and 3 (green downward triangles). Results for the block-SSOR smoother are marked by filled symbols and results for the block-Jacobi smoother by empty symbols. cf. [10].

Horizontal vectorization of Block Krylov methods
Methods like Multiscale FEM (see Section 4), optimization and inverse problems need to invert the same operator for many right-hand-side vectors. This leads to a block problem, by the following conceptual reformulation: . Such problems can be solved using Block Krylov solvers. The benefit is that the approximation space can grow faster, as the solver orthogonalizes the updates for all right-hand-sides. Even for a single righthand-side Block Krylov based enriched Krylov methods can be used to accelerate the solution process. Preconditioners and the actual Krylov solver can be sped up using horizontal vectorization. Assuming k right-hand-sides we observe that the scalar product yields a k × k dense matrix and has O(k 2 ) complexity. While the mentioned larger approximation space should improve the convergence rate, this is only true for weaker preconditioners, therefore we pursued a different strategy and approximate the scalar product matrix by a sparse matrix, so that we again retain O(k) complexity. In particular we consider the case of a diagonal or block-diagonal matrix. The diagonal matrix basically results in k independent solvers running in parallel, so that the performance gain is solely based on SIMD vectorization and the associated favorable memory layout.
For the implementation in D -ISTL we use platform portable C++ abstractions of SIMD intrinsics, building on the VC library [38] and some D specific extentions. We use this to exchange the underlying data type of the right-hand-side and the solution vector, so that we no longer store scalars, but SIMD vectors. This is possible when using generic programming techniques, like C++ templates, and yields a row-wise storage of the dense matrices X and B. This row-wise storage is optimal and ensures a high arithmetic intensity. The implementations of the Krylov solvers have to be adapted to the SIMD data types, since some operations, like casts and branches, are not available generically for SIMD data types. As a side effect, all preconditioners, including the AMG, are now fully vectorized. Performance tests using 256 right-hand-side vectors for a 3D Poisson problem show nearly optimal speedup on a 64 core system (see Figure 13). The tests are carried out on a Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes). We observe a speedup of 50, while the theoretical speedup is 64.

Adaptive Multiscale Methods
The main goal in the second funding phase was a distributed adaptive multilevel implementation of the localized reduced basis multi-scale method (LRBMS [49]). Like Multiscale FEM (MsFEM), LRBMS is designed to work on heterogenous multiscale or large scale problems. It performs particularly well for problems that exhibit scale separation with effects on both a fine and a coarse scale contributing to the global behavior. Unlike MsFEM, LRBMS is best applied in multi-query settings in which a parameterized PDE needs to be solved many times for different parameters. As an amalgam of domain decomposition and model order reduction techniques, the computational domain is partitioned into a coarse grid with each macroscopic grid cell representing a subdomain for which, in an offline pre-compute stage, local reduced bases are constructed. Appropriate coupling is then applied to produce a global solution approximation from localized data. For increased approximation fidelity we can integrate localized global solution snapshots into the bases, or the local bases can adaptively be enriched in the online stage, controlled by a localized a-posteriori error estimator.

Continuous problem and discretization
We consider elliptic parametric multi-scale problems on a domain Ω ⊂ R d where we look for p(µ) ∈ Q that satisfy b(p(µ), q; µ) = l(q) for all q ∈ H 1 0 (Ω), µ are parameters with µ ∈ P ⊂ R p , p ∈ N. We let > 0 be the multi-scale parameter associated with the fine scale. For demonstration purposes we consider a particular linear elliptic problem setup in Ω ⊂ R d (d = 2, 3) that exhibits a multiplicative splitting in the quantities affected by the multi-scale parameter . It is a model for the so called global pressure p(µ) ∈ H 1 0 (Ω) in two phase flow in porous media, where the total scalar mobility λ(µ) is parameterized. κ denotes the heterogenous permeability tensor and f the external forces. Hence, we seek p that satisfies weakly in H 1 0 (Ω), −∇ · (λ(µ)κ ∇p(µ)) = f in Ω. ( With A(x; µ) := λ(µ)κ (x) this gives rise to the following definition of the forms in For the discretization we first require a triangulation T H of Ω for the macro level. We call the elements T ∈ T H subdomains. We then require each subdomain be covered with a fine partition τ h (T) in a way that T H and τ h := Σ T ∈T H τ h (T) are nested. We denote by F H the faces of the coarse triangulation and by F h the faces of the fine triangulation.
Let V(τ h ) ⊂ H 2 (τ h ) denote any approximate subset of the broken Sobolev space Here, the DG bilinear form b h and the right hand side l h are chosen according to the SWIPDG method [20], i.e.
where the DG coupling bilinear forms b e h for a face e is given by The LRBMS method allows for a variety of discretizations, i.e. approximation spaces V(τ h ). As a particular choice of an underlying high dimensional approximation space we choose V(τ h ) = Q k h := T ∈T H Q k,T h , where the discontinuous local spaces are defined as

Model Reduction
For model order reduction in the LRBMS method we choose the reduced space Q red := T ∈ T H Q T red ⊂ Q k h with local reduced approximation spaces Q T red ⊂ Q k,T h . We denote p red (µ) to be the reduced solution of (3) in Q red . This formulation naturally leads to solving a sparse blocked linear system similar to a DG approximation with high polynomial degree on the coarse subdomain grid.
The construction of subdomain reduced spaces Q T red is again very flexible. Initialization with shape functions on T up to order k ensures a minimum fidelity. Basis extensions can be driven by a discrete weak greedy approach which incorporates localized solutions of the global system. Depending on available computational resources, and given a suitable localizable a-posteriori error estimator η(p red (µ), µ), we can forego computing global high-dimensional solutions altogether and only rely on online enrichment to extend Q T red 'on the fly'. With online enrichment, given a reduced solution p red (µ) for some arbitrary µ ∈ P, we first compute local error indicators η T (p red (µ), µ) for all T ∈ T H . If η T (p red (µ), µ) is greater than some prescribed bound δ tol > 0 we solve on a overlay region N (T) ⊃ T and extend Q T red with p N(T ) (µ)| T . Inspired by results in [17] we set the overlay region's diameter diam(N (T)) of the order O(diam(T)|log(diam(T))|). In practice we use the completely on-line/off-line decomposable error estimator developed in [49,Sec. 4] which in turn is based on the idea of producing a conforming reconstruction of the diffusive flux λ(µ)κ ∇ h p h (µ) in some Raviart-Thomas-Nédélec space V l h (τ h ) ⊂ H div (Ω) presented in [21,57]. This process is then repeated until either a maximum number of enrichment steps occur or η T (p red (µ), µ) ≤ δ tol .

Implementation
We base our MPI-parallel implementation of LRBMS on the serial version developed previously. In this setup the high-dimensional quantities and all grid structures are implemented in D . The model order reduction as such is implemented in Python using pyMOR [45]. The model reduction algorithms in pyMOR follow a solver agnostic design principle. Abstract interfaces allow for example projections, greedy

Algorithm 4 Schematic representation of the LRBMS pipeline.
Require: P train ⊂ P Require: Reconstruction operator R h (p red (µ)) : for all T ∈ T H do 10: extend Q T red with p h (µ)| T 11: E ← E \ E i 12: Generate T H Offline Phase 13: for all T ∈ T H do 14: create τ h (T ) 15: init Q T red with DG shape functions of order k 16: G B G (· · · ) Optional 17: compute p red (µ) for arbitraryµ Online phase 18: for all T ∈ T H do Optional Adaptive Enrichment 19: η ← η T (p red (µ), µ) 20: while η ≥ δ tol do 21: compute p N(T ) (µ) 22: algorithms or reduced data reconstruction to be written without knowing details of the PDE solver backend. The global macro grid T H can be any MPI-enabled D grid manager with adjustable overlap size for the domain decomposition, we currently use D -Y G . The fine grids τ h (T) are constructed using the same grid manager as on the macro scale, with MPI subcommunicators.These are currently limited to a size of one (rank-local), however the overall scalability could benefit from dynamically sizing these subcommunicators to balance communication overhead and computational intensity as demonstrated in [36,Sec. 2.2]. The assembly of the local (coupling) bilinear forms is done in D -GDT [24], with pyMOR/Python bindings facilitated through D -XT [46], where D -G -G [19] generates necessary grid views for the SWIPDG coupling between otherwise unrelated grids. Switching to D -G -G constitutes a major step forward in robustness of the overall algorithm, compared to our previous manually implemented approach to matching independent local grids for coupling matrices assembly.
We have identified three major challenges in parallelizing all the steps in LRBMS:

Global solutions p h (µ) of the blocked system in Equation ?
? with an appropriate MPI-parallel iterative solver. With the serial implementation already using D -ISTL as the backend for matrix and vector data, we only had to generate an appropriate communication configuration for the blocked SWIPDG matrix structure to make the BiCGStab solver usable in our context. This setup we then tested on the SuperMUC Petascale System in Garching. The results in Figure 14 show very near ideal speedup from 64 nodes with 1792 MPI ranks up to a full island with 512 nodes and 14336 ranks.

(Global) Reduced systems also need a distributed solver.
By design all reduced quantities in pyMOR are, at the basic, unabstracted level, NumPy arrays [56]. Therefore we cannot easily re-use the D -ISTL based solvers for the high-dimensional systems. Our current implementation gathers these reduced system matrices from all MPI-ranks to rank 0, recombines them, solves the system with a direct solver and scatters the solution. There is great potential in making this step more scalable by either using a distributed sparse direct solver like Mumps [3] or translating the data into the D -ISTL backend. 3. Adaptive online enrichment is inherently load imbalanced due to its localized error estimator guidance. The load imbalance results from one rank idling while waiting to receive updates to a basis on a subdomain in its overlap region from another rank. This idle time can be minimized by encapsulating the update in a MPIFuture described in Subsection 2.1. This will allow the rank to continue in its own enrichment process until the updated basis is actually needed in a subsequent step.

Uncertainty Quantification
The solution of stochastic partial differential equations (SPDEs) is characterized by extremely high dimensions and poses great (computational) challenges. Multilevel Monte Carlo (MLMC) algorithms attract great interest due to their superiority over the standard Monte Carlo approach. Based on Monte Carlo (MC), MLMC retains the properties of independent sampling. To overcome the slow convergence of MC, where many computationally expensive PDEs have to be solved, MLMC combines in a proper way cheap MC estimators and expensive MC estimators, achieving (much) faster convergence. One of the critical components of the MLMC algorithms is the way in which the coarser levels are selected. The exact definition of the levels is an open question and different approaches exist. In the first funding phase, Multiscale FEM was used as a coarser level in MLMC. During the second phase, the developed parallel MLMC algorithms for uncertainty quantification were further enhanced. The main focus was on exploring the capabilities of the renormalization approach for defining the coarser levels in the MLMC algorithm, and on using MLMC as a coarse grained parallelization approach.
Here, we employ MLMC to exemplarily compute the mean flux through saturated porous media with prescribed pressure drop and known distribution of the random coefficients.
Mathematical problem. As a model problem in R 2 or R 3 , we consider steady state single phase flow in random porous media: subject to the boundary conditions p x=0 = 1 and p x=1 = 0 and zero flux on other boundaries. Here p is pressure, k is scalar permeability, and ω is random vector. The quantify of interest is the mean (expected value) E[Q] of the total flux Q through the inlet of the unit cube i.e., Q(x, ω) := ∫ x=0 k(x, ω)∂ n p(x, ω)dx. Both the coefficient k(x, ω) and the solution p(x, ω) are subject to uncertainty, characterized by the random vector ω in a properly defined random space Ω. For generating permeability fields we consider practical covariance C(x, y) = σ 2 exp(−||x − y|| 2 /λ). An algorithm based on forward and inverse Fourier transform over the circulant covariance matrix is used to generate the permeability field. For solving the deterministic PDEs a Finite Volume method on a cell centered grid is used [32]. More details and further references can be found in a previous paper [47].
Monte Carlo simulations. To quantify the uncertainty, and compute the mean of the flux we use a MLMC algorithm. Let ω M be a random vector over a properly defined probability space, and Q M be the corresponding flux. It is known that E[Q M ] can be made arbitrary close to E[Q] by choosing M sufficiently large. The standard MC algorithm convergences very slowly, proportionally to the variance over the square root of the number of samples, which makes it often unfeasible. MLMC introduces L levels with the L-th level coinciding with the considered problem, and exploits the telescopic sum identity: The notation Y l = Q 1 − Q l−1 is also used. The main idea of MLMC is to properly define levels, and combine a large number of cheap simulations, that are able to approximate the variance well, with a small number of expensive correction simu-lations providing the needed accuracy. For details on Monte Carlo and MLMC we refer to previous publications [32,47] and the references therein. Here, the target is to estimate the mean flux on a fine grid, and we define the levels as discretizations on coarser grids. In order to define the permeability at the coarser levels we use the renormalization approach. MLMC has previously run the computations at each level with the same tolerance. However, in order to evaluate the number of samples needed per level, one has to know the variance at each level. Because the exact variance is not known in advance, MLMC starts by performing simulations with a prescribed, moderate number of samples per level. The results are used to evaluate the variance at each level, and thus to evaluate the number of samples needed per level. This procedure can be repeated several times in an Estimate-Solve cycle. At each estimation step, information from all levels is needed, which leads to a synchronization point in the parallel realization of the algorithm. This may require dynamic redistribution of the resources after each new evaluation.
MLMC can provide a coarse graining in the parallelization. A well balanced algorithm has to account for several factors: (i) How many processes should be allocated per level; (ii) How many processes should be allocated per deterministic problem including permeability generation; (iii) How to parallelize the permeability generation; (iv) Which of the parallelization algorithm for deterministic problems available in E -D should be used; (v) Should each level be parallelized separately and if not, how to group the levels for efficient parallelization. The last factor is the one giving coarse grain parallelization opportunities. For the generation of the permeability, we use the parallel MPI implementation of the FFTW library. As deterministic solver, we use a parallel implementation of the conjugate gradient scheme preconditioned with AMG, provided by D -ISTL . Both of them have their own internal domain decomposition.
We shortly discuss one Estimate-Solve cycle of the MLMC algorithm. Without loss of generality we assume 3-level MLMC. Suppose that we have already computed the required number of samples per level (i.e., we are after Estimate and before Solve). Let us denote by N i , i = {0, 1, 2} the number of required realizations per level for Y l , by p i the number of processes allocated per Y i , by p g l i the respective group size of processes working on a single realization, by n the number of realizations for each group of levels, with t i the respective time for solving a single problem once, and finally with p total the total number of available processes. Then we can compute the total CPU time for the current Estimate-Solve cycle as Ideally each process should take T p CPU = T total CPU /p total . Dividing the CPU time needed for one Y i by T . To obtain an integer value for the number of processes allocated per level, first we construct a set of all possible splits of the original problem as a combination of subproblems (e.g., parallelize level 2 separately and the combination of levels 0 and 1, or parallelize all levels simultaneously, etc.). Each element of this set is evaluated independently, and all combinations of upper and lower bounds are calculated, such that p ideal i is divisible by p g l i , 2 l=0 p i < p total and p i ≤ N i p g l i . Traversing, computing and summing the computational time needed for each element gives us a time estimation. Then we select the element (grouping of levels) with minimal computational time. To tackle the distribution of the work on a single level, a similar approach can be employed. Due to the large dimension of the search tree a heuristic technique can be employed. Here we consider a simple predefined group size for each deterministic problem, having in mind that when using AMG the work for a single realization at the different levels is proportional to the unknowns at this level.
Numerical experiments. Results for a typical example are shown in Figure 15. The parameters are σ = 2.75, λ = 0.3. The tests are done on SuperMUC-NG, LRZ Munich on a dual Intel Skylake Xeon Platinum 8174 node. Due to the stochasticity of the problem, we plot the speedup multiplied with the proportion of the tolerance. The renormalization has shown to be a very good approach for defining the coarser levels in MLMC. The proposed parallelization algorithm gives promising scalability results. It is weakly coupled to the number of samples that MLMC estimates. Although the search for an optimal solution is an NP-hard problem, the small number of levels enables a full traversing of the search tree. It can be further improved by automatically selecting the number of processes per group that solves a single problem. One also may consider introducing interrupts between the MPI communicators on a level to further improve the performance.

Land-Surface Flow Application
To test some of the approaches developed in the E -D project, especially the usage of sum-factorized operator evaluation with more complex problems, we developed an application to simulate coupled surface-subsurface flow for larger geographical areas. This is a topic with high relevance for a number of environmental questions from soil protection and groundwater quality up to weather and climate prediction.
One of the central aims of the new approach developed in the project is to be able to relate a physical meaning to the parameter functions used in each grid cell. This is not possible with the traditional structured grid approaches as the necessary resolution would be prohibitive. To avoid the excessive memory requirements of completely unstructured grids we build on previous results for block-structured meshes and use a 2-dimensional unstructured grid on the surface which is extruded in a structured way in vertical direction. However, more flexible discretization schemes are needed for such grids, compared to the usual cell-centred Finite-Volume approaches.

Modelling and numerical approach
To describe subsurface flow we use Richards equation [52] ∂θ(ψ) ∂t − ∇ · k(ψ) ∇ψ + e g + q w = 0 where θ is the volumetric water content, ψ the soil water potential, k the hydraulic conductivity, e g the unit vector pointing in the direction of gravity and q w a volumetric source or sink term. In nearly all existing models for coupled surface-subsurface flow, the kinematicwave approximation is used for surface flow, which only considers surface slope as driving force and does not even provide a correct approximation of the steadystate solution. The physically more realistic shallow-water-equations are used rarely, as they are computationally expensive. We use the diffusive-wave approximation, which still retains the effects of water height on run-off, yields a realistic steady-state solution and is a realistic approximation for flow on vegetation covered ground [2]: where h is the height of water over the surface level z, f c is a source-sink term (which is used for the coupling) and the diffusion coefficient D is given by with · refering to the Euclidean norm and α, γ and C are empirical constants. In the following we use α = 5 3 and γ = 1 2 to obtain Manning's formula and a friction coefficient of C = 1.
Both equations are discretised with a cell-centered Finite-Volume scheme and alternatively with a SWIPDG scheme in space (see Section 3) and an appropriate diagonally implicit Runge-Kutta scheme in time for the subsurface and an explicit Runge-Kutta scheme for the surface flow. Upwinding is used for the calculation of conductivity in subsurface flow [5] and for the water height in the diffusion term in surface flow.
First tests have shown that the formulation of the diffusive-wave approximation from the literature as given by equation (4) does not result in a numerically stable solution if the gradient becomes very small, as then a gradient approaching zero is multiplied by a diffusion coefficient going to infinity. A much better behaviour is achieved by rewriting the equation as where the rescaled gradient ∇(h+z) ∇(h+z) 1−γ is always going to zero when ∇(h + z) is going to zero as long as γ < 1 and the new diffusion coefficient h α /C is bounded.
Due to the very different time-scales for surface and subsurface flow, an operatorsplitting approach is used for the coupled system. A new coupling condition has been implemented, which is a kind of Dirichlet-Neumann coupling, but guarantees a massconservative solution. With a given height of water on the surface (from the initial condition or the last time step modified by precipitation and evaporation), subsurface flow is calculated with a kind of Signorini boundary condition, where all surface water is infiltrated in one time step as long as the necessary gradient is not larger than the pressure resulting from the water ponding on the surface (in infiltration) and potential evaporation rates are maintained as long as the pressure at the surface is not below a given minimal pressure (during evaporation). The advantage of the new approach is that it does not require a tracking of the sometimes complicated boundary between wet and dry surface elements, that it yields no unphysical results and that the solution is mass-conservative even if not iterated until convergence.
Parallelisation is obtained by an overlapping or non-overlapping domain-decomposition (depending on the grid). However, only the two-dimensional surface grid is partitioned whereas the vertical direction is kept on one process due to the strong coupling. Thus there is also no need for communication of surface water height for the coupling, as the relevant data is always stored in the same process. The linear equation systems are solved with the BiCGstab-solver from D -ISTL with Block-ILU0 as preconditioner. The much larger mesh size in horizontal direction compared to the vertical direction results in strong coupling of the unknowns in the vertical direction. The Block-ILU0 scheme provides an almost exact solver of the strongly coupled blocks in the vertical direction and is thus a very effective scheme. Furthermore, one generally has a limited number of cells in the vertical direction and extends the mesh in horizontal direction to simulate larger regions. Thus the good properties of the solver are retained when scaling up the size of the system.

Performance Optimisations
As the time steps in the explicit scheme for surface flow can get very small due to the stability limit, a significant speedup can be achieved by using a semi-implicit scheme, where the non-linear coefficients are calculated with the solution from the previous time step or iteration. However, if the surface is nearly completely dry, this could lead to numerical problems, thus an explicit scheme is still used under nearly dry conditions with an automatic switching between both.
While matrix-free DG solvers with sum-factorization can yield excellent per node performance (Section 3.3), it is a rather tedious task to implement for new partial differential equations. Therefore, a code-generation framework is currently being developed in a project related to E -D [33]. The framework is used to implement an alternative optimized version of the solver for Richards equation as this is the computationally most expensive part of the computations. Expensive material functions like the van Genuchten model including several power functions are replaced by cubic spline approximations, which allow a fast vectorized incorporation of flexible material functions to simulate strongly heterogeneous systems. Sum-factorisation is used in the operator evaluations for the DG-discretization with a selectable polynomial degree.
A special pillar grid has been developed as a first realisation of a 2.5D grid [33]. It adds a vertical dimension to a two-dimensional grid (which is either structured or unstructured). However, as the current version still produces a full three-dimensional mesh at the moment, future developments are necessary to exploit the full possibilities of the approach.

Scalability and Performance Tests
Extensive tests covering infiltration as well as exfiltration have been performed (e.g. Figure 16) to test the implementation and the new coupling condition. Good scalability is achieved in strong as well as in weak scaling experiments on up to 256 nodes and 4096 cores of the bwForCluster in Heidelberg ( Figure 17). Simulations for a large region with topographical data taken from a digital elevation model ( Figure 18) have been conducted as well.
With the generated code-based solver for Richards equation a substantial fraction of the systems peak performance (up to 60 % on a Haswell-CPU) can be utilized due to the combination of sum factorization and vectorisation ( Figure 19). For Richards equation (as for other PDEs before) the number of millions of degrees of freedom per second is independent of the polynomial degree with this approach. We measure a total speedup of 3 compared to the naive implementation in test simulations on a Intel Haswell Core i7-5600U 2.6 GHZ CPU with first order DG base functions on a structured 32 × 32 × 32 mesh for subsurface and 32 × 32 grid for surface flow. Even higher speedups are expected with higher-order base functions and matrix-free iterative solvers. The fully-coupled combination of the Richards solver obtained with  the code generation framework and surface-runoff is tested with DG up to fourth order on structured as well as on unstructured grids. Parallel simulations are possible as well.

Conclusion
In E -D we extended the DUNE software framework in several directions to make it ready for the exa-scale architectures of the future which will exhibit a significant increase in node level performance through massive parallelism in form of cores and vector instructions. Software abstractions are now available that enable asynchronicity as well as parallel exception handling and several use cases for these abstractions have been demonstrated in this paper: resilience in multigrid solvers and several variants of asynchronous Krylov methods. Significant progress has been achieved in hardware-aware iterative linear solvers: we developed preconditioners for the GPU based on approximate sparse inverses, developed matrix-free operator application and preconditioners for higher-order DG methods and our solvers are now able to vectorize over multiple right hand sides. These building blocks have then been used to implement adaptive localized reduced basis methods, multilevel Monte-Carlo methods and a coupled surface-subsurface flow solver on up to 14k cores. The E -D project has spawned a multitude of research projects, running and planned, as well as further collaborations in each of the participating groups. We conclude that the D framework has made a major leap forward due to the E -D project and work on the methods investigated here will continue in future research projects.