ExaStencils – Advanced Multigrid Solver Generation

Present-day stencil codes are implemented in general-purpose programming languages, such as Fortran, C, or Java, Python or derivates thereof, and harnesses for parallelism, such as OpenMP, OpenCL or MPI. Project ExaStencils pursued a domain-speciﬁc approach with a language, called ExaSlang, that is stratiﬁed into four layers of abstraction, the most abstract being the formulation in continuous mathematics and the most concrete a full, automatically generated implementation. At every layer, the corresponding language expresses not only computational di-rectives but also domain knowledge of the problem and platform to be leveraged for optimization. We describe the approach, the software technology behind it and several case studies that demonstrate its feasibility and versatility: high-performance stencil codes can be engineered, ported and optimized more easily and eﬀectively.

We begin with the delineation of the domain and the mathematical challenges, distinctly from the computer science challenges, that ExaStencils addressed (see Section 2).We restricted our attention to the development of smoothers for geometric multigrid methods and the required analysis tools.In this domain, local Fourier analysis (LFA) is the method of choice to analyze the developed smoothers and the entire multigrid method.To make LFA useful for our purpose, we extended the method to periodic stencils, covering block smoothers and varying coefficients [11].
The first major computer science challenge was to cover, with one single source program, a wider range of multigrid solvers than is possible with contemporary implementations based on general-purpose host languages such as Fortran or C++ (see Section 3).To this end, we decided not to build ExaSlang on an existing, generalpurpose host language but to make it an external DSL.We were able to demonstrate its flexibility already early on in the project by providing a common ExaSlang program for Jacobi, Gauss-Seidel, and red-black solvers for finite differences and finite volumes with constant and linear interpolation and with restriction [53].Later on, we introduced a way to specify data layout transformations simply by a linear expression-a feature that can aid the development process significantly [44] (see Subsections 3.1-3.2).A smaller task was to decide how to describe aspects of the execution platform in a platform-description language (TPDL) [73] (see Subsection 3.3).
The second major computer science challenge was to reach high performance with our approach to code generation on a wide range of architectures (see Section 4).One important aspect here is what information to provide at which layer (see Subsection 4.1).While the syntax of ExaSlang is partly inspired by Scala, Matlab and LaTeX, our target language is C++ with additional features that depend on the execution platform (see Subsection 4.2).We demonstrated that, with the help of our optimization techniques (see , weak scaling could be achieved on traditional cluster architectures such as the JUQUEEN supercomputer at the Jülich Supercomputing Center (JSC) [75,49].We also achieved weak scaling on the GPU cluster Piz Daint at the Swiss National Supercomputing Centre (CSCS).Furthermore, we demonstrated the automatic generation of solvers that can be executed on emerging architectures such as ARM [51] and FPGA platforms [77,78].
One new concept that ExaStencils introduced into high-performance computing is that of feature-based domain-specific optimization [5] (see Subsection 4.6).The central idea is to view a source code, such as a stencil implementation or an application, as a member of a program family or software product line rather an as an isolated individual, and to describe the source code by its commonalities and variabilities with respect to the other family members in terms of features.A feature represents a concept of the domain (e.g., a type of smoother or grid) that may be selected and combined with others on demand.With this approach, a large search space of configuration choices can be reviewed automatically at the level of domain concepts and the most performant choices for the application and execution platform at hand can be identified.To this end, we devised a framework of sampling and machine learning approaches [36,39,81] that allow us to derive a performance model of a given code that is parameterized in terms of its features.This way, we can express performance behavior in terms of concepts of the domain and automatically determine optimal configurations that are tailored to the problem at hand, which we have demonstrated in the domain of stencil codes [30,31,32,36,39,81,59] and beyond (e.g., databases, video encoders, compilers, and compression tools).Our framework integrates well with the other parts of ExaStencils that use and gather domain and configuration knowledge in different phases.
Project ExaStencils came with several case studies whose breadth was to demonstrate the practicality and flexibility of the approach (see Section 5).The case studies are a central deliverable of ExaStencils.They include studies close to real-world problems: the simulation of non-Newtonian and non-isothermal fluids (see Subsection 5.3 and a molecular dynamics simulation (see Subsection 5.4).
We conducted two additional studies at the fringes of ExaStencils, exploring alternative approaches (see Section 6).In one, the option of an internal rather than external DSL was explored: ExaSlang 4 was embedded in the mutual host languages Java and Ruby to study the trade-off between the effort of the language implementation and the performance gain of the target code [17].The outcome was that an embedding is possible but, as expected, with a loss of performance (see Subsection 6.1).In the second study, we implemented a simple multigrid solver in SPIRAL [10] (see Subsection 6.2).The success of the SPIRAL project [24,61] a decade ago was a strong motivator for project ExaStencils.SPIRAL can handle simple algebraic multigrid solvers but would have to be extended for more complex ones.

Project Peripherals
Attempts of abstraction and automation in programming have received increased attention in the past two decades in the area of software engineering.Highperformance computing has been comparatively conservative in going down this road.The reason is that the demands on performance are much higher than in general software engineering, and the architectures used to achieve it are more complex, notably with large numbers of loosely coupled processors.
The potential of an effective automation grows as the application domain shrinks.Promising domains are much smaller than those of any general-purpose programming language.The extreme is the compiler FFTW [25] that targets a single numerical problem, the fast Fourier transform.As just mentioned, SPIRAL widened the domain to linear transforms (and, lately, beyond [82]).By now, quite a number of optimizing code generators have been proposed that target stencil codes.Patus [18] has a strong focus on autotuning.The strong point of Pochoir [83] is the cache obliviousness of its target code.Pochoir addresses constant stencils and is based on the C dialect Intel Cilk with limited portability to other platforms, Devito [52] performs symbolic computation via SymPy and targets shared-memory architectures, just like Snowflake [90].Firedrake [63] is primarily for finite-element methods and has adopted multigrid lately.One recent component of it is Coffee [54], which addresses the local assembly in finite-element methods.STELLA [34] and its spiritual successor GridTools generate code for stencil computations on distributed structured grids for climate and weather modeling.
For the domain of image processing, many approaches based on code generation exist that are conceptually similar to the idea behind project ExaStencils.Notable projects include Halide [62], HIPA cc [56], PolyMage [58], and Mint [86].The specification of image filter kernels is related to the concept of stencils, and many of the basic parallelization techniques are comparable.However, ExaSlang is fundamentally based on computational domains that are multidimensional, as opposed to the usually two-dimensional data structures used to represent images.Image processing DSLs usually target shared-memory parallel systems, i.e., single compute nodes such as multi-core CPUs or a single GPU.On the other hand, ExaStencils aims at the domain of HPC and, consequently, supports distributed-memory systems and respective parallelization techniques.

Multigrid
The goal of the ExaStencils project has been the automatic generation of efficient stencil codes for geometric multigrid solvers [85] on (semi-)structured grids.Multigrid methods form a class of iterative solvers for large and sparse linear systems.These methods have originally been developed in the context of the numerical solution of partial differential equations, which often involve the solution of such linear systems.
Multigrid methods combine two complementary processes: a smoothing process and a coarse-grid correction.While each of the two processes is by itself not sufficient to solve the problem efficiently, their combination yields a fast solver.
The smoothing process is usually a straightforward iterative method that converges rather slowly when used on its own.Combining such a process with a coarse-grid correction accelerates the convergence of the resulting method by augmenting the iteration with information obtained on a coarser grid.Since the grid determines the resolution at which the solution is being computed, a multigrid method considers the problem at different resolutions.
A multigrid method performs stencil computations on a hierarchy of fine to successively coarser grids.The overall cost of the method can be reduced further by applying this idea recursively, i.e., instead of two grids, we consider a hierarchy of successively coarser grids.The recursion follows a so-called cycling strategy, e.g., a V-cycle or a W-cycle (see Figure 1).The cycling strategy determines how much work is performed at what level of the grid hierarchy, which also has an effect on the convergence rate of the method.On the coarser grids, less processing power is required.
In summary, to construct a multigrid method, one must choose a set of components: a smoother, an interpolation, a restriction, a coarse-grid approximation, and Fig. 1: Cycling strategies across the grid hierarchy.The hierarchy levels are denoted by Ω h , Ω 2h , and Ω 4h , progressing from the finest to the coarsest grid.Light circles solve the coarse-grid system directly, and dark circles solve the coarse-grid system recursively.Down arrows symbolize restrictions, up arrows interpolations.a cycling strategy.The choice of components influences the number of iterations required to obtain an adequate solution, the computational cost per iteration, and the communication pattern of the method.Furthermore, the behavior of the method depends on the linear system to be solved.

Advancing the Mathematical Analysis of Multigrid Methods
At the start of project ExaStencils stood the search for a way to estimate the number of iterations that a multigrid method requires to produce an adequate solution.The choice of multigrid components determines the operations that must be performed per iteration.Hence, to estimate the time required per iteration, one just needs to place these operations in an appropriate order and estimate the duration time of their execution.While this is by no means a trivial endeavor, estimating the number of iterations needed is inherently different.In this section, we deal with the latter problem.
To estimate the number of iterations, we must consider the mathematical properties of a given multigrid method for a given set of components and problem.For this purpose, we must determine the contraction properties of the iteration operator of the multigrid method for a given configuration.To this end, local Fourier analysis (LFA) tells us whether the repeated application of the iteration operator lets the error converge to zero.In particular, we are interested in the resulting rate of convergence.
For project ExaStencils, we had to extend the capabilities of LFA.We wanted to analyze block smoothers, which have a higher arithmetic density than standard smoothers (see Subsection 2.3), and complex problems for which no feasible way of applying LFA was known.Furthermore, we aimed at an automation of the analysis.
At the beginning of project ExaStencils, we considered using and expanding an existing LFA software [88].However, in the course of the project, it became quickly evident that we needed a more general approach, something beyond a collection of templates that allow to fill in some blank spots but do not facilitate the reuse or recombination of components.Since we intended to explore many possible combinations of various multigrid components, it was not feasible to program an individual LFA by hand for every combination.We developed the LFA Lab [65], which is based on the principle of combining a set of primitive expressions to complicated ones, enabling a much more flexible analysis.
A multigrid method, as well as its components, can be characterized by their iteration operators.An iteration operator describes the propagation of the approximation error during the execution of the method.If we denote the error after the k-th iteration with e (k) and the iteration matrix of the method with E, we have the following equation: The spectral radius ρ(E) and the operator norm E of E characterize the asymptotic and worst-case error reduction factors of the method, respectively.LFA determines these quantities in a simplified setting.Let h, with 0 < h ∈ R, be the step size of the infinite d-dimensional grid G h := h • Z d .We consider stencil operators on the space of bounded grid functions Stencil operators have a particularly simple form when transformed via the discretetime Fourier transform.
The discrete-time Fourier transform (DTFT) F is a linear isometry that maps the space 2 In other words, it represents a function on an infinite grid by a function on a continuous and bounded interval.A stencil operator A in Fourier space, i.e., the operator F AF −1 , is just the multiplication by a function â : L 2 (Θ h ) → C. We call the function â the symbol of A. Thus, the symbol of a stencil operator is a function that encodes the entire behavior of the infinite-dimensional operator.
The symbol â of a stencil operator reveals the desired information about the operator A. We have that where ess-sup denotes the essential supremum.These formulas mean that, to compute the spectral radius or the operator norm of a stencil operator, we must just compute the largest value of the absolute modulus of its symbol â.Note that, in the definition of stencil operators, we have assumed that stencil s does not depend on position x.However, it can be useful to consider operators whose stencil is allowed to change with the position.
A stencil that depends arbitrarily on the position has no particularly simple form in the Fourier domain.However, we were able to show that periodic stencils do have a simple representation [11,66].A periodic stencil depends on the position, but has the same entries repeated periodically across the entire domain.While this does not represent stencils accurately that are variable in the entire domain, at least some variability is reflected in the analysis.We showed that a periodic stencil operator is described, after a proper reordering of the frequency domain, by a matrix of ordinary symbols-more precisely, by matrix symbols from the space L n×m 2 (θ h ) for some appropriate positive h .Furthermore, we were able to show that there is a one-to-one relationship between matrix symbols and periodic stencils.
Using matrix symbols, similar results for the spectral radius and operator norm hold.For an operator given by a periodic stencil, we have that Thus, to obtain the norm and spectral radius of the operator, we must find the largest value of the norm and spectral radius of the matrix â(θ).The framework of periodic stencils and matrix symbols allows for more advanced problems to be analyzed.It also lends itself to automation via software.Operators that have a matrix symbol can be combined in many ways such that the combination also has a matrix symbol.Thus, we can create a flexible LFA software by using the idea of providing first a set of primitive expressions and then means of combination and abstraction [1].
For example, the iteration operator of the weighted Jacobi method is where A is the system matrix, D the diagonal part of A, I the identity matrix and ω ∈ R a weighting factor.If we assume that the behavior of A can be modelled sufficiently accurately by a (periodic) stencil operator on an infinite grid, we can replace each matrix by the infinite-dimensional operator given by the corresponding (periodic) stencils to simplify the analysis.Listing 1 shows the computation of the spectral radius of the iteration operator of the Jacobi method for the stencil resulting from the discretization of the Poisson equation using our software LFA Lab [65].LFA Lab can be used as a simple Python [67] library, but it is more or less an embedded DSL for LFA.The user provides the (periodic) stencils, which yield the corresponding operators.Then, the user combines these operators with interpolation and restriction to an expression describing the desired iteration operator.Listing 1: Implementation of an LFA of the weighted (ω = 0.8) Jacobi smoother for the solution of the Poisson equation using LFA Lab.
The code in Listing 1 is essentially a direct implementation of the formula for the iteration operator.However, keep in mind that this formula describes actually a combination of infinite dimensional operators.When calling the symbol method, the software determines automatically a way to represent the given iteration operator via its Fourier matrix symbol, which is a non-trivial procedure.
To compute the matrix symbol of an expression given by the user, appropriate sampling parameters must be determined.For this purpose, LFA Lab has two stages.The first stage extracts and analyzes the expression tree of the formula that the user entered.The second stage then samples the matrix symbol to obtain the spectral radius and operator norm.The power of the software lies in the fact that arbitrarily complex expressions can be analyzed.
A more complex example is the analysis of the two-grid method.It has the iteration operator E TG = S (I − P A −1 c RA) S , where A c is the coarse grid operator, P the interpolation, R the restriction, and S the iteration operator of the smoother.Assume that we already have an analysis for the smoother.If we can express P, R, and A c using Fourier matrix symbols, we can combine these with the analysis of the smoother we already have to analyze the two-grid method in its entirety.
In summary, we have constructed a powerful and flexible LFA software.The flexibility comes from a small set of primitive expressions and means of combination and abstraction.The periodic stencil operators are the primitive expressions, mathematical operations are a means of combination, and the Python programming language provides the opportunity of abstraction.The software is used to estimate the convergence rate of a multigrid method for a given set of components and a given problem.The estimate comes as a number of iterations a multigrid method needs to achieve adequate accuracy.

Advancing Multigrid Components
The choice of the smoothing component in a multigrid method is not always straightforward.Some problems require advanced smoothers.This can be easily appreciated when considering that, since the system contains a zero block, a pointwise relaxation is not possible [85].The steady-state Stokes equations can be written as follows: for a given domain Ω with boundary ∂Ω.Here, u is the vector-valued fluid velocity, p is the pressure, and f describes an external force.
This linear system of PDEs can be discretized on staggered grids or by using appropriate finite elements.Here, we consider staggered grids in two dimensions.For the Stokes equations, the efficient Vanka smoother has a relatively high computational  cost and unsatisfactory parallelization properties due to a process of overlapping block-smoothing steps (see Figure 2).This made us consider the Triad smoother as an alternative: it provides low computational cost in combination with good parallelization properties [19].
Both block smoothers are based on a collective updating process of unknowns inside one block [87].As depicted in Figure 2, the Vanka blocks consist of five unknowns including one pressure and two velocity components in each direction while the Triad smoother comprises the simultaneous update of three unknowns including one unknown of each kind.
Numerical results for periodic boundary conditions in combination with parallelization properties and computational work show the potential of the Triad relaxation method.However, numerical results for the Stokes system with Dirichlet boundary conditions show that the Triad method in its original form has one issue: the convergence rate deteriorates tremendously.To illustrate this, we applied the method to the Stokes equation discretized on a staggered grid in the unit square, with periodic and with Dirichlet boundary conditions.As righthand side, we chose in both cases f u x (x, y) = 2π 2 sin(πx) sin(πy) + π cos(πx), f u y (x, y) = 2π 2 cos(πx) cos(πy) − π sin(πy), and f p ≡ 0; the initial guess was zero. Figure 3 shows the different convergence behaviors.
The convergence properties of the Triad smoother can be improved with the following idea: repeat the relaxation process four times while changing the unknowns contained in one block after each iteration, i.e., rotating the "L"-shaped pattern that describes the block to be relaxed.This algorithm, illustrated in Figure 4, improves the convergence significantly.When applying the four iterations of the proposed smoother, it is very similar to one iteration of Vanka.Thus, a smoother that has not been considered as an option before becomes a viable alternative to established smoothers for systems of PDEs.In addition, the order in which the boxes are updated can be varied.For more details and further results, see the dissertation of Lisa Claus [19].

Stencil-Specific Programming in ExaStencils
The central element in the ExaStencils approach is its stencil-specific programming language ExaSlang and its code generator.Then there is also a language for specifying properties of the execution platform.The two languages are discussed in this section.The code generator is the subject of the following section.

The Domain-Specific Language ExaSlang
The idea of ExaStencils is to support the domain-specific programming and optimization of stencil codes by providing different layers of abstraction, specifying the various aspects of the stencil code at the respectively suitable layer, and exploiting domain information available at that layer for an optimization of the specification.ExaSlang comes in four layers: from ExaSlang 1, the most abstract, to ExaSlang 4, the most concrete (see Figure 5).
• Layer 1: the continuous problem This is the layer for the scientist or engineer who needs the solution of the PDE.The problem is specified as a continuous equation.The present implementation supports Unicode and LaTeX symbols.Optional inputs are the specification of discretization and solver options used to autogenerate lower layers.There is also support for an automatic finite-difference discretization of operators.

• Layer 2: the discrete problem
This is the most abstract layer that provides an executable description.Discretized functions are fields (data type, grid location), tied to a computational This layer of ExaSlang can describe a full application, including communication, input/output, evaluation and visualization, but is still more abstract than C++ or Java in that it contains language constructs specific to multigrid.Optimizations at this layer include the tuning of communication patterns and data structure layouts.
Yet more concrete, but not accessible to the user, is an intermediate representation (IR) in which most code refinements take place and which forms the base for the generating of the target C++ code.One may write one's program at the ExaSlang layer of one's choice and let the code generator refine it to a more concrete layer.One may also modify generated code to implement certain aspects that cannot be expressed at the chosen programming layer.However, the IR representation is not meant to be modified by the user.
Through its four layers, ExaSlang evolves from a declarative language at layer 1 to an imperative programming language at layer 4.Besides the standard data types that represent floating-point and integer numbers or strings, domain-specific data types represent vectors and matrices to be used for coupled systems of equations.Assembly of global vectors and matrices is not supported since we focus on local computations using stencils.In contrast, stencils are declared globally in ExaSlang programs.Furthermore, fields-corresponding to vectors that may represent the PDE's right-hand side or an approximation of the unknown function-are declared globally as well.Via declaration, certain settings important to parallelization may be specified by the user.One example is the size of ghost layers (also called halo layers), as depicted in Listing 5 in Section 3.2.

An ExaSlang Example
To illustrate how our language stack can be used to implement different aspects of partial differential equations (PDEs) solvers, let us take the Poisson equation for a given domain Ω-here the unit square-Dirichlet boundary conditions g = cos(πx) − sin(2πy) and the right-hand side f = π 2 cos(πx) − 4π 2 sin(2πy).Listing 2 shows an exemplary layer 1 code for this specification.From it, our generator is able to derive representations at subsequent layers.The refinement methods employed in this process are described in Subsection 4.1 further below.Listings 3 and 4 illustrate variants similar to the auto-generated ones expressing the discretized version of the given equation (Listing 3) and the multigrid algorithm used to solve for it (Listing 4).Based on them, a complete layer 4 program can be assembled, comparable to the one illustrated in Listing 2. In the example code at the lower layers, parts of the source code has been omitted for the sake of compactness.A complete specification and examples of other PDEs are part of Sebastian Kuckuk's dissertation [47].

The Target-Platform Description Language
To be able to optimize a code adequately, one must know details of the execution platform.Our code generation process is governed by more than one hundred parameters that allow to select specific code refinements or to set device-specific properties.Examples include the use of vector units on CPUs and the corresponding instruction set to use, e.g., SSE or AVX on x86 CPUs, NEON on ARM-based CPUs or even QPX for IBM's POWER architecture.This yields a design space that is too large for users to be able to specify a (near-)optimal configuration of code generation settings.However, we may be able to derive sensible parameters from a structured description of the target platform.To this end, one element of ExaSlang is the so-called target platform description language (TPDL) [73].One design goal was to increase the modularity and reusability of hardware component descriptions, such as CPUs or accelerators, to let users compose systems based on a repository of ready-made parametric snippets.By treating these in a fashion similar to the class concept in object-oriented programming languages, users can infer instantiations with parameters set appropriately.A short example describing an accelerator card is provided in Listing 6.It enumerates a number of technical hardware details, but also contains the important software information on the compute capability, i.e., which features of the target technology CUDA may be used.
A small, yet flexible library supports information retrieval from a TPDL specification, This enables DSL developers to check for certain information and aggregate or evaluate characteristics of the target platform.Developers need not to worry about the instantiations and their parameters but may just use the discrete specification, since all the processing required has already been done.For many recurring tasks, predefined queries are available.Listing 7 shows a predefined query to return all working units in a system.Its result is filtered to retain only working units that consume more than 50W of electrical power.Listing 7: Query to enumerate working units consuming more than 50W of power.

Refinement of ExaSlang Programs
The overall goal of ExaStencils has been to enable users to choose the layer most appropriate for them and code exclusively at this layer, e.g., by providing a continuous formulation of the problem to be solved at layer 1 and nothing more.Ideally, our framework would then automatically derive suitable discretizations (layer 2), solver components (layer 3) and parallelism particulars (layer 4).However, in practice, this requires domain knowledge whose automatic inference is beyond the capabilities of present software technology.We address this issue by introducing hint specifications which allow us to progress automatically to subsequent layers of ExaSlang.For example, at layer 1, discretization hints may be supplied.As Listing 8 shows, continuous functions are discretized at certain points of a computational grid, such as node positions.We also support the specification of operator discretization using finite differences.An optional renaming is also possible at this stage.Then, we combine the equation provided at layer 1 and the discretization hints to synthesize a discretized form of the equation at layer 2. While layer 2 expresses the problem to be solved, its solution is specified at layer 3.This can be achieved either by implementing a suitable iterative solver by hand or by issuing a directive for our generate solver interface.Listing 9 illustrates such a directive.For cases in which no further modification is required, matching solver hints may be provided at the upper layers to set up the code in Listing 9 automatically, allowing users to work exclusively at one layer.The generated solver is by default a geometric multigrid variant.In the concrete case of Listing 9, our framework would generate a standard V-cycle using three pre-and post-smoothing steps of a red-black Gauss-Seidel (RBGS) and a conjugate gradient (CG) coarse-grid solver.Frequently, minor adaptations of the generated solver are necessary.They can either be implemented by taking the implementation generated at layer 3, adapting it and replacing the original generate solver directive with the result.A more generic approach is to add modifiers to the generate solver directive.They target usually a certain stage of the multigrid solver, e.g., the restriction or the correction, at one or more levels of the hierarchy.These stages can either be replaced completely with custom layer 3 code, or arbitrary layer 3 statements may be added to be executed before or after the stage.A more complex option exists for smoother stages, as illustrated in Listing 10 for the case of a block smoother to be used when solving for the Stokes equations on a staggered grid (see Section 5).
Listing 10: ExaSlang 3 code for a smoother stage for the 2D Stokes problem.
After the algorithmic specification at layer 3 is completed, it must be transformed to a program at layer 4.This requires, most importantly, the addition of data layout information for fields, loops to compose kernels, and suitable communication statements.Listings 11 and 12 illustrate the first half of an RBGS smoother.This description of features is by no means complete.More detail is available in Sebastian Kuckuk's dissertation [47].
We designed the language stack of ExaSlang to enable maximum flexibility for users.They can either work at one level exclusively and make use of the hint system to generate more concrete specifications automatically, or implement different parts of their application at multiple levels, or mix both approaches.

Generation of Target Code
After the domain-specific program has been refined to a complete program specification (ExaSlang 4), the IR code is subjected to many non-algorithmic transformations (recall Figure 5 on page 12).Among others, parallelization techniques are applied to the code (see Subsection 4.4), memory layouts are selected and applied to variables (see Subsection 4.3), required code fragments are inserted, and finally the program is written to disk as C++ files.
To express all these steps in a short syntax, we developed the code-transformation framework Athariac [74].It allows to modify the tree-based representation of a program-called the abstract syntax tree (AST)-by using simple, yet powerful rewrite rules.Essentially, we specify a pattern in the AST, such as the node that represents a certain ExaSlang statement, and then define a structure with which to replace it.Of course, we can also remove nodes by specifying an empty replacement.A single rewrite rule is called a transformation, and a group of transformations is called a strategy.In Listing 13, a simple strategy for code-generation time evaluation of mathematical expressions is presented.First, the two identifiers E and PI are replaced by their constant numerical values.Then, we look for constant numerical values to be summed, e.g., 2.1 + 3.1415, and replace them with the actual result of the addition.Note that the constant-folding transformation must be applied multiple times for expressions involving multiple additions.
By chaining and conditional execution of strategies in a fixed order, ExaSlang programs can be refined for specific program configurations.Depending on the selected code-generation parameters and input-program size, between 200 and 300 transformations are required to generate C++ code.In Figure 6, the trajectory of a program from the intermediate representation to C++ output is depicted to illustrate the basic approach that applies to every program.Most of the optimizations are applied in this trajectory.Naturally, a concrete code-generation path may differ, as users can skip certain optimizations, or transformations may not be applied because the preconditions required are not satisfied.For example, it does not make sense to generate code for heterogeneous target devices if none exist.In Figure 7, program sizes during the code-refinement process of the threedimensional optical flow application (see Subsection 5.3) are depicted for a selection of different parallelization techniques.For all variants, we see different stages of the process.First, information on the program is gathered and preparations are done, i.e., setting array sizes and memory layouts.Next, minor code refinements are applied, such as the replacement of stencil convolutions with their corresponding computational rules.Then, depending on the parallelization technique, memory layouts are imposed and loops are generated, respectively modified.The execution times of Athariac's code generator are usually on the same order of magnitude as an invocation of the target C++ compiler, i.e., within a few seconds to minutes.More detail on Athariac can be found in a recent issue of the Proceedings of the IEEE [74] and in Christian Schmitt's dissertation [72].

Target-Specific Optimization
Stencil codes have a very simple structure but they are difficult to optimize since the memory bandwidth usually acts as a performance brake.In project ExaStencils, a diverse set of optimizations has been developed to overcome this.We describe a small selection briefly here.More detail on all implemented techniques, including those not presented here, can be found in Stefan Kronawitter's dissertation [43].

Data Layout Transformations
Much effort of adapting stencil codes from one application or execution platform to another goes into making the data layout fit for best efficiency.One major limiting factor is usually the available memory bandwidth.Examples of such adaptations are color splitting for multi-color kernels or switching between an array of structs (AoS) and a struct of arrays (SoA).
ExaSlang admits flexible layout transformation directives for the specification of arbitrary affine transformations that are then applied automatically by the code generator [44].Even though different variants of a color splitting or an SoA-to-AoS transformation may be most commonly used, the ExaStencils code generator is by no means restricted to them.Moreover, the only modification required to adapt the memory layout of any field is the insertion of an affine function that specifies the transformation.No other part of the application, including data initialization and communication, must be modified to make the new layout generally available.Explicit albeit very simple modifications to other parts are only necessary if they are meant to use different data arrangements.In this case, additional fields must be inserted for each layout and copy kernels are necessary.
This approach avoids unnecessary changes in the source code and constitutes a big advance in the ease of testing and evaluating different memory layout schemes in order to identify the best memory layout.There are other systems that offer similar transformation devices but not in this generality.

Polyhedral Code Exploration
Besides a layout modification, an affine transformation can lead to the most efficient implementation of a frequent structure in stencil codes: the loop nest.A popular approach to selecting such transformations automatically is the polyhedron model for loop optimization [22].As for the data layout, the search is also here for best data locality.However, established automatic transformation techniques based on the PLuTo algorithm [13,6,12] fail to yield optimal results.
Our first attempt to select better transformations was a specialized variant of the PLuTo algorithm available in the ExaStencils code generator.While it is capable of detecting very good schedules for some stencil codes, it can also encounter problematic ones.For example, it fails completely for RBGS kernels.Thus, we developed a new, optimized, multi-dimensional polyhedral search space exploration for the ExaStencils code generator [46] that obtains in several cases better results than existing approaches, such as different PLuTo variants or PolyMage [58].It also has the capability of specializing the search for the domain of stencil codes, which reduces the exploration effort dramatically without significantly impairing performance.An extreme but still beneficial approach is to choose the first schedule selected by our specialized search without any further evaluation.This may not lead to the best performance but it avoids the overhead of a complete exploration-and the performance improvement is still satisfactory: in most experiments it was only a few percentage points below the best variant explored.

Vectorization
A third optimization focuses on the vector units available in most processor architectures that provide single-instruction-multiple-data (SIMD) parallelism.Their use is typically mandatory for highest performance.However, each architecture comes with its own vector instruction set.Intel x86 features Streaming SIMD Extensions (SSE) or Advanced Vector Extensions (AVX) in several different versions, IBM's BlueGene/Q provides Quad Processing eXtension (QPX), and some ARM processors implement the Neon instruction set.Even though all of these sets target the same problem, their implementations differ not only in detail but also in key aspects, such as the way in which data can be fetched from main memory.
As a remedy, contemporary compilers are equipped with rudimentary automatic vectorization capabilities most of which are, unfortunately, not very effective.On top, the more advanced compilers can exclude some popular architectures and be costly or not widely available.Since the ExaStencils code generator comes with its own vectorization phase [43,74], it avoids any dependence on a special target compiler.Currently, it supports Intel SSE3, AVX, and AVX2, as well as IBM's QPX and ARM Neon.

Parallelization
To parallelize ExaSlang applications automatically, mainly two concepts must be implemented [49].First, data must be partitioned and distributed across the available compute resources and, second, data between the partitions must be synchronized periodically.We realize the former by splitting our computational domain into blocks which are further subdivided into fragments.Each fragment holds a part of the computational grid and all data associated with it.This hierarchical approach is depicted in Figure 8 and permits an efficient mapping to different execution platforms.For instance, blocks can be mapped to MPI ranks while each fragment inside is handled by a distinct OpenMP thread.Mapping single fragments to accelerators is also possible, as explained later on.The synchronization of data can be controlled by users at layer 4 via communicate directives.They specify the field to be communicated and can additionally be parameterized to communicate only certain parts, e.g., specific ghost layers.Each communicate directive triggers the generation of a function that implements the corresponding behavior.This permits the reuse of functions, which can become quite lengthy if the same communication is reissued throughout an application.The function contains code specifying an MPI exchange for fragments residing in different blocks and simple copy kernels for fragments inside the same block.For the interblock case, this includes copying the required data to buffers, calling the appropriate MPI functions, waiting for their completion and copying back data received.The communication buffers required are set up automatically by our framework and, additionally, are shared between multiple communication routines where possible.This minimizes the memory footprint as well as the allocation and deallocation overhead.
Kernels updating data in one fragment can additionally be parallelized.This can be done either with OpenMP or, if available, via an accelerator, e.g., a GPU or reconfigurable hardware such as an FPGA.

Accelerators
To accelerate computations with Nvidia GPUs, we require multiple steps during code generation and developed a back-end to emit corresponding CUDA2 code.First, fields must be extended to manage a CPU and a GPU version.We also introduce flags that keep track of field data being changed on the host or device side at this point.Next, compute kernels are transformed to their CUDA counterparts and wrapped by an interface function passing variables and fields, used inside the kernel, as parameters.Then we replace the original kernel with a user-controlled condition branching to either the original CPU kernel or the new GPU kernel.Both variants are extended to set the flags previously described after execution.The same flags can also be used to control the use of added copy operations between host and device.This ensures that data is always synchronized correctly while the overhead is minimized.
As an alternative to CUDA, SYCL is a new technology for the inclusion of the OpenCL ecosystem into the world of C++.It strives to combine the strengths of both worlds, e.g., by allowing to use custom data structures and templates inside computational kernels, or by providing an implementation of the parallel C++ Standard Template Library that can be executed not only on multicore CPUs, but also on accelerators.Furthermore, it aims at a reduction of OpenCL boiler-plate code by enabling the direct issuance of parallel_for statements instead of forcing the user to declare kernels, map their arguments, etc.Additionally, it detects automatically and initiates memory transfers between host and device.
We had to specialize our code generation workflow slightly to address the characteristics of SYCL.While, in ExaSlang, the dimensionality of a kernel can be arbitrarily large, SYCL only supports up to three-dimensional kernels, i.e., 3D iteration domains.Because of this, we implemented a custom mapping to linearize 2 CUDA is a proprietary application programming interface and de-facto standard for Nvidia GPUs. a kernel with more than three dimensions to a single dimension while maintaining correct indexing.Furthermore, SYCL has currently no built-in support for reduction operators, such as the sum of all data items.Since multigrid methods depend on reduction operators for the computation of norms used to approximate the reduction of the residual, we implemented our own reduction operator and used a template function that is called in the corresponding places.Our approach uses standard parallelization techniques that should perform well on CPUs and GPUs.
SYCL separates work items into work groups in the same sense in which these concepts are used by CUDA and OpenCL.While this concept is not paramount, our code generator can optionally specify the granularity of work groups and modify data accesses inside the computational kernels correspondingly.
SYCL is merely a specification, as of yet without a complete and optimized implementation.We verified our SYCL back-end with the commercial product ComputeCpp by Codeplay and also with the experimental open-source implementation triSYCL3 [33].Although SYCL still needs to mature, it has great potential as a unified middleware for a multitude of devices: in addition to targeting CPUs and GPUs, the two major vendors of reconfigurable hardware, Xilinx and Intel/Altera, are actively working to accept SYCL code.This will allow us to extend the pool of ExaSlang-supported hardware targets without having to develop a separate code-generator back-end, such as the one presented in the next paragraph.

Reconfigurable Hardware
A particularly interesting target platform for scientific computing is reconfigurable hardware, such as in field-programmable gate arrays (FPGAs).They are increasingly used in high-performance computing because of their high throughput and energy efficiency.We started with extending an existing code generator for imaging processing to produce a description of a simple V-cycle for further processing using C/C++-based high-level synthesis via Vivado HLS for Xilinx FPGAs [68,69].This laid the base for a custom Vivado HLS back-end in our code generator, resulting in the ability to emit algorithm descriptions for Vivado HLS stemming from user input in ExaSlang [77].To test the capabilities of the HLS back-end, we considered a starting grid of size 4096 × 4096 and used a V(2,2)-cycle.For the coarse-grid solution at a size of 32 × 32, a corresponding number of Jacobi smoother iterations was applied.The biggest difference to the execution on a CPU or GPU is that all multigrid operators were laid out spatially on the FPGA chip.Another essential consideration in mapping algorithms to reconfigurable hardware is the handling of memory and data transfers.FPGAs provide only very scarce amounts of on-chip memory, called block RAM, that can be accessed with full speed.This implies that data needs to be stored off-chip, e.g., in RAM that can be found on the FPGA board or even in the host computer's memory.As a consequence, these transfers need to be supported by the hardware with corresponding buffers, which are mapped to first-in-first-out (FIFO) hardware buffers.In an FPGA, FIFO buffers can be either composed of registers or implemented by on-chip block RAM.
Another important concern of designing reconfigurable hardware descriptions is the usage of computational resources on the chip.For our initial experiments, we used single-precision floating-point data types and were able to fit the complete design on a mid-range Kintex 7 FPGA.On this chip, our design was able to outperform an Intel i7-3770 in terms of throughput by a factor of approximately 3. Switching from single to double precision does not change any performance aspects but merely requires more chip resources.As a result, we had to switch to the larger Virtex 7 FPGA to house the design but were able to retain the performance advantage.
Based on these results, we worked on incorporating a more relevant coarse-grid solution algorithm into our hardware designs.By a clever arrangement of the required memory buffers and transfers, we were able to implement a conjugate-gradient solver and map it to the Virtex 7 using double-precision floating-point numbers [78].This algorithm is usually not well-suited for FPGAs because of its random memory-access patterns that break the preferable pipelining model.As a consequence, data needs to be stored on-chip in block RAM to retain performance.However, this takes away resources required by other multigrid components, e.g., smoothers and inter-grid operators.We conducted several experiments and were able to outperform an Intel Xeon E5-1620 v3 in many cases [78].As an optimization, we overlapped successive V-cycles, i.e., we were able to reuse multigrid operators on the FPGA when they were no longer required by the previous V-cycle iteration provided the data dependences were respected.Theoretically, this should nearly double the throughput, which was confirmed in our numerical experiments.

Compositional Optimization
The ExaStencils framework enables the automatic generation of geometric multigrid solvers from a high-level representation either by using the generate solver interface or by means of a direct specification of its algorithmic components at layer 3.However, in many cases, the construction of an efficient and scalable multigrid method is a difficult task requiring extensive knowledge in numerical mathematics.Therefore, it is typically performed manually by a domain expert.Part of ExaStencils' vision is to automate all steps from the specification of a problem in the continuous domain (layer 1) to the generation of an efficient implementation on the target platform.To achieve this goal, we construct geometric multigrid methods for solving systems of linear equations as program optimization tasks.We have developed a context-free grammar for the automatic construction of solver instances from a given set of algorithmic options, whose production rules are shown in Figure 9 for the case of pointwise smoothers.Each rule defines the list of expressions by which a certain production symbol, denoted • , can be replaced.To generate a multigrid expression, starting with the symbol S , this process is repeated recursively until the produced expression contains only terminal symbols or the empty string λ.A more Fig.9: A formal grammar for the generation of multigrid solvers.detailed description, including the semantics of the expressions generated following this grammar, can be found elsewhere [79].
The effectiveness of an iterative method is determined by two objectives: its rate of convergence and its compute performance on the target platform.To estimate both objectives for a given multigrid solver expression in reasonable time, we employ automated local Fourier analysis (LFA), as described in Subsection 2.2, for the first and a roofline performance model [89] for the second objective.Our goal is to obtain the set of Pareto-optimal solvers with respect to both objectives.A direct enumeration and evaluation of all possible configurations leads to an exponential operation increase with the number of multigrid levels.Thus, a brute-force search is infeasible in most cases.As a remedy, we employ evolutionary computation, a family of stochastic optimization methods stemming from the field of artificial intelligence and inspired by the principle of biological evolution.These methods evolve a population of candidate solutions, which are iteratively improved through the use of so-called genetic operators.The operators are specifically designed for the given problem representation, in our case the set of expressions that can be generated according to the grammar described in Figure 9.After the optimization is completed, we transform the expression of each Pareto-optimal solver to an algorithmic representation which is emitted in the form of ExaSlang 3 code that can then serve to evaluate the solver's rate of convergence and compute performance on the target hardware.As a first step, we have implemented our optimization approach in Python4, using the libraries LFA Lab [65] and DEAP [23].We point the interested reader to preliminary results of the optimization of multigrid solvers for the steady-state heat equation on a multi-core CPU [79].
In the future, we intend to integrate this implementation into the generate solver interface for the automatic construction of efficient and scalable geometric multigrid solvers based on a given specification.

Feature-Based Domain-Specific Optimization
One new concept that ExaStencils introduced into high-performance computing is that of feature-based domain-specific optimization [5].The central idea is to look at a source program (e.g., a stencil code or application) not as an isolated individual but as a member of a program family or software product line and to specify it by its commonalities and variabilities with respect to the other family members in terms of features.A feature represents a domain concept (e.g., a type of smoother or grid) that may be selected and combined with others on demand.With this approach, a wide search space of configuration choices can be reviewed automatically at the level of domain concepts and the most performant choices for the application and execution platform at hand can be identified.
Features and configuration options can have a significant influence on the performance of the generated code.The influences of some might already be known to the developer of the system, but other influences may be opaque and likewise may be influences that arise from interactions among features, called feature interactions.Also, the influences that one may expect based on theory and domain knowledge may not match the actual influences in the program, which often depend on implementation details.
Ideally, a developer or user knows the optimal choices for all features.The goal must then be to exploit domain knowledge to identify the most performant configuration.But since, as just explained, domain knowledge is unevenly developed and remains often incomplete, ExaStencils resorts to machine learning to derive a feature-specific performance model [81] that provides a comprehensible description of the effects of features and feature interactions on performance.The machinelearning techniques employ a set of configurations, the learning set or sample set, as input to learn from.
To this end, we have developed a framework of configuration sampling and machine learning approaches [36,39,81,59] that allow us to derive a performance model of a given code that is parameterized in terms of its features.This way, we can express performance behavior in terms of concepts of the domain and determine automatically optimal configurations that are tailored to the problem at hand, which we have demonstrated in the domain of stencil codes [30,32,31] and beyond (e.g., databases, video encoders, compilers, and compression tools) [36,39,81].Our framework integrates well with the other parts of ExaStencils that use and gather domain and configuration knowledge in different phases.It is similar in spirit to the performance modeling tool Extra-P, developed in the SPPEXA projects Catwalk und ExtraPeak [14], though we concentrate on flexibility in choosing and combining different sampling and learning techniques.
To demonstrate the usefulness of the machine-learning approach, we performed experiments on a large number of configurable software systems from different domains, among them different multigrid solvers including an implementation in ExaSlang [30,31,32].To settle on a suitable learning set, we analyze the influence of different sampling strategies on the accuracy of the performance-influence models [81].
Performance-influence models are not only meant to determine optimal configuration but also to validate and refine domain knowledge.We must be sure that the observed influences on the performance of the system match with our postulated knowledge, otherwise the system or the knowledge base must be revised.In a case study of a multigrid system working on triangular meshes, albeit not written in ExaSlang, we were able to validate the domain knowledge of the developers of the system [31].
For complex systems, learning a performance-influence model that captures accurately the performance behavior of all individual configurations is often timeconsuming and the resulting models can become very large (i.e., many terms describing only a small influence on the performance).However, it is not always necessary to learn such a complex model; it might be beneficial to learn faster a simpler model that is less accurate but accurate enough for a given use-case [39].In a set of experiments, we have demonstrated the tradeoff between accuracy, complexity, and time to learn a performance-influence model and shown that even simple models can predict all configurations with high accuracy.Depending on the use case of the performance-influence model, it can be of greater value to get quickly a rough impression of the most relevant influences of configuration options rather than waiting for a long time for a more detailed but also more complex model.

Case Studies
An essential ingredient of ExaStencils was a collection of case studies to test our approach for performance, target performance, and versatility.The more illustrative ones are sketched in this section.Section 5.1 deals with scalar elliptic PDEs, Section 5.2 with image processing applications and Section 5.3 with aspects of computational fluid dynamics.Section 5.4 details how generated solvers can be coupled with existing code, in our case molecular dynamics simulations.Finally, Section 5.5 goes beyond code generation and discusses a design decision study of porous media applications.

Scalar Elliptic Partial Differential Equations
A prominent and well-researched PDEs example that describes the steady state of the distribution of a physical quantity u : R d → R, such as heat in a solid medium, for a given domain Ω of dimensionality d, suitable boundary conditions, right-hand side f : R d → R and a thermal conductivity a : R d → R of the material to be simulated.Assuming a finite-difference discretization on a uniform grid, it can be observed that a constant a leads to a stencil with constant coefficients, whereas a variable a leads to one with variable coefficients.In the latter case, we always store all stencil coefficients, rather than values of a, unless specified otherwise.Lastly, if a is 1, the equation simplifies to the Poisson equation introduced in Section 3.2, which is given by −∇ 2 u = −∆u = f .In previous publications [75,49,44,43], we demonstrated our ability to generate highly optimized and massively parallel geometric multigrid solvers for this model problem.For example, we achieve weak scalability up to the full JUQUEEN supercomputer, i.e., using up to 458 752 cores across 28 867 nodes, for moderately sized problems of 1 to 16 million unknowns per core.This is a good result considering that the ratio of data to be communicated, e.g., via MPI, is quite high compared to the required amount of computation.Figure 10 summarizes the obtained time to solution for up to 7.3 • 10 12 unknowns using a hybrid MPI/OpenMP parallelization for which one V(3, 3)-cycle takes about 3.6s.A similar scaling behavior can also be observed in a scaling experiment on Piz Daint (see Figure 11).With a largest problem of 7.3 • 10 12 unknowns solved, the ExaStencils software reaches for the Poisson equation a size comparable to the finite-element approach [27] developed in the TerraNeo project [8], where the solution of a stabilized tetrahedral linear finiteelement discretization of a Stokes system is demonstrated with 1.7 • 10 13 unknowns.This is based on a Uzawa-type smoother [20] for the Stokes system.A first direct comparison has been made recently [41].
In a further single-node case study, we compared ExaStencils against the High-Performance Geometric Multigrid (HPGMG) benchmark [2].Here, on one core of an Intel Xeon E5-2630 v2, the vectorized code generated by Athariac (see Subsections 4.2 and 4.3) was able to solve 14.6 times as many unknowns per second as HPGMG.A closer inspection, using likwid [84], revealed that this performance gain is mainly due to a higher MFLOPS rate and a much better memory bandwidth exploitation of the solvers generated by ExaStencils.Further details, including also software productivity metrics (e.g., lines of code and Halstead complexity measures), are available elsewhere [74].As an example, see also our case study on molecular dynamics (Subsection 5.4).

Image Processing
Next, we consider applications in variational image processing.State-of-the-art denoising algorithms based on total generalized variation have been implemented in ExaSlang [26] using a preconditioned Douglas-Rachford iteration for the under-  Computation of the optical flow refers to the approximation of the apparent motion between two or more images which are part of an image sequence I.We used this application to illustrate ExaSlang's higher-dimensional data types [76].
Let us assume that I(x, y, t) refers to a specific pixel (x, y) in an image sequence at time t, where t could refer to a certain frame in a video stream.Furthermore, we assume that a moving object's intensity does not change over time.Then, for small movements (corresponding to small time differences between two images), we may describe the movement of an intensity value at a pixel (x, y, t) as follows: After performing a Taylor expansion and reordering factors, we can define the partial image derivatives I x = ∂I ∂x , I y = ∂I ∂y and I t = ∂I ∂t .This leads to the temporal gradient ∇ θ I = (I x , I y , I t ) T and the optical flow vector (u, v) = dx dt , dy dt , which can be transformed to the following system of PDEs to be solved: Here, α denotes the diffusion coefficient, which we set to 1 in this example.Furthermore, we set the time gradient I t also to 1 for simplification purposes.After discretization, we obtain the following stencil: This stencil can be directly mapped to Exaslang and used in a smoother kernel, as illustrated in Listing 14.
All corresponding variables, such as the unknowns u and v, are represented in ExaSlang as vectors of corresponding dimensionality.Thanks to our code generation approach, we can easily conduct parameter studies.For example, we varied the number of pre-and post-smoothing steps between 0 and 6 to study their influence on the convergence rate (see Christian Schmitt's dissertation [72] for details).Furthermore, we switched vectorization for Intel's AVX2 instruction set on and off.For our measurements, we used a single compute node equipped with an Intel i7-6700 CPU.We varied the number of threads between one (1T) and eight (8T) and used likwid [84] for our measurements, which is a supported back-end for ExaSlang's profiling capabilities.
A comparison of the energy consumption in relation to the application's execution time is depicted in Figure 12.Clearly, the application is limited by the available advance Flow@ current Listing 14: Jacobi smoother definition using slots for the flow field.
Fig. 12: Parameter study of different parallelization approaches and V-cycle configurations for the two-dimensional optical flow application.memory bandwidth, as a speedup can be observed when going from one to two threads, but additional usage of CPU cores does not improve the speedup further.The use of AVX2 yields a small performance improvement compared to the scalar variants and, consequently, results in lower energy consumption.The V-cycle configurations that performed best in this case study are the V(4,2)-cycle, the V(3,3)-cycle, and the V(2,4)-cycle.

Computational Fluid Dynamics
Our next area of application is computational fluid dynamics (CFD), where we considered the solution of Stokes and Navier-Stokes equations [47,48].
Let us turn first to the Stokes equations as introduced in Subsection 2.3.We discretized the given linear system of PDEs using finite differences and volumes on staggered grids.Additionally, more advanced multigrid components are required.This includes overlapping block smoothers that may additionally be colored.In ExaSlang, both can be expressed concisely and intuitively, as demonstrated in Subsection 4.1.Choosing a suitable coarse-grid solver is also a highly efficient process in our framework as it can be configured via our generate solver directive.Currently, we support conjugate gradient (CG), conjugate residual (CR), biconjugate gradient stabilized (BiCGSTAB) and minimal residual (MINRES), all with optional restart mechanisms, as well as simply applying the smoother.Of course, implementing one's own solver at layer 3 or layer 4 is still possible.We found that, in the collection of variants above, a BiCGSTAB with an added restart mechanism works reasonably well for the present test case [47].
Extending this work towards the full Navier-Stokes equations is the next step.For incompressible fluids, they can be given as on a given domain Ω, which is square in 2D and cubic in 3D, with suitable boundary conditions and with u, p and f u as before, ρ as the density and ν as the kinematic viscosity given by the ratio of the dynamic viscosity µ and ρ.The switch from Stokes to Navier-Stokes raises two challenges.First, since the equations become time-dependent, a suitable time integration is required.We choose a traditional implicit Euler, which can be implemented easily at layer 3 or layer 4. For each time step, one can then call either a generated solver variant or a custom implementation.Second, the employed solver has to be able to deal with non-linear equations.To this end, we allow expressing suitable linearizations, such as those based on Newton's method and Picard iterations [47,21], at layer 2. The linearization chosen can be applied either locally or globally, and both variants can be expressed in tandem with the discretization approach at layer 2.Then, non-linear multigrid solvers based on the full approximation scheme can be tailored at layer 3 automatically.Scaling tests on the JUWELS supercomputer exhibit good strong scalability for 2.7 • 10 8 unknowns and good weak scalability on up to 24 576 cores, the largest amount available to us at the time of the experiment [47].The results are summarized in Figures 13 and 14.This approach can also be extended to the scope of our most sophisticated application, namely the simulation of non-Newtonian and non-isothermal fluids.Originally, we ported a legacy FORTRAN code to ExaSlang 4 [48] resulting not only in a considerable reduction in code complexity and size of about one order of magnitude, but also in a reasonable gain in performance of also about one order of magnitude on average [48].One prerequisite is the support of staggered non-uniform grids by our code generation framework and DSL, which we extended accordingly.After implementing fully parallel grid setup routines, we are now also able to generate applications executable on GPUs and cluster architectures from the same ExaSlang code.This application also serves as a motivation to extend our generator to ap-ply domain-specific optimizations, such as our sophisticated loop-carried common subexpression elimination [45], which are able to improve performance even further.Finally, using the previously described layer 2 and layer 3 capabilities allows for even more expressive and concise specifications, which have moreover a greater ability to reflect concepts familiar to domain scientists.Lastly, we should mention that we invested some time to explore alternative approaches, such as the Lattice-Boltzmann method, which can also be implemented in our DSL [64].An early evaluation showed that applications generated with the ExaStencils framework are about a factor of 2 slower compared to the state-of-the-art multiphysics framework waLBerla 5.

Molecular Dynamics Simulation
To illustrate the versatility of the ExaStencils approach, we demonstrated that the ExaStencils compiler could be used to generate a multigrid solver that integrates into a large, existing simulation code.For this purpose, we considered a molecular dynamics simulation code used in practice [71,70], a simulation that computes the motion of the electrons and nuclei of molecules or groups of molecules.
The simulation is a so-called real-space density functional theory (RSDFT) simulation.A simulation in real space computes the wave functions on a discretized Cartesian grid, which avoids costly computations of the fast Fourier transform (FFT) that methods which work in Fourier space have to perform.The simulation computes the motion of the particles using the Kohn-Sham density functional theory, which is implemented using a Car-Parrinello molecular dynamics scheme.
The RSDFT simulation requires the computation of the electron-electron Coulomb interaction potential of the charge density.This potential is computed efficiently by solving the corresponding Poisson equation using a multigrid method.The RSDFT code contains a manually derived multigrid solver, which we wanted to replace by a solver generated by the ExaStencils compiler.
However, integrating a solver into an existing code base requires some additional steps above the ones necessary to generate a stand-alone solver.The RSDFT code has its own set of grid data structures for storage of the wave functions.To obtain optimal performance, the ExaStencils compiler generates a set of internal data structures that are the most suitable for the computation (see Subsection 4.3).These data structures are usually not compatible with the ones an application code uses.While it would be possible to force the compiler to use the RSDFT data structures, it would deprive the compiler of some of its optimization opportunities.As a workaround, we used the compiler to generate a set of data transfer routines that copy the grid functions from the RSDFT data structures to the internal ones and vice versa.
In Table 1, we compare the performance of the automatically generated multigrid solver and the legacy solver that was manually written in C. Note that both solvers implement different coarse-grid solution strategies.Thus, the overall timings are not directly comparable, because the number of iterations change between the implementations.However, the time per iteration is indicative for the quality of the code of the generated solver, and we see that it is comparable to the time per iteration of the legacy code.The generated code is often slightly faster.We can conclude that the ExaStencils approach is also feasible in situations that involve an already existing codebase.

Porous Media
A further study, in cooperation with SPPEXA project EXA-DUNE [7], concerned the design decision made for stencil codes in the domain of porous media [29].The goal of this study was to help application engineers understand the complexity of stencil computations and the relation between the mathematical model of an application and the stencil used to solve the partial differential equation underlying the application.To model these design decisions, we use feature orientation as introduced in Subsection 4.6.An example of a design decision is the choice of the type of boundary conditions that have to be satisfied by a given application.We differentiate between decisions at the mathematical and the stencil level, and we devise one feature model for each of these levels.Since decisions at the mathematical level may affect decisions that can be made at the stencil level, we also include constraints between these two levels.
To demonstrate the usefulness of feature modeling, we considered a set of applications that deviate from the standard 5-point stencil used in many benchmarks.We demonstrated that feature models can be used beyond simple mathematical problems by considering problems of different complexity.Specifically, we started with the simple heat equation and went on to more complex equations, such as the advectiondiffusion equation with operator splitting or Richard's equation.For all of these equations, we were able to express the design decisions at the mathematical and the stencil level as well as their dependences.The result is a comprehensive set of feature models that can be uses to formally reason about the variabilities in stencil codes (e.g., to identify dependencies among configuration decisions).

ExaSlang 4 Embedded, not External
In the second funding period, ExaStencils was part of an international collaboration with Shigeru Chiba, co-funded by the Japan Science and Technology Agency (JST).This project has already been reported to and reviewed by JST as part of the Japanese strategic basic research programme CREST [17].The summary is also available at the author's Web site6.
The project aimed at a software architecture for embedded domain-specific languages.The cooperation of the DFG and JST in SPPEXA gave Chiba's team the opportunity to explore the applicability of its software architecture in the context of ExaStencils.ExaSlang 4 is a standalone language that needs its own compiler and development environments, such as an editor and a debugger, and that has a relatively large development cost.The principle of embedding is an approach to reduce the size and cost of the infrastructure needed.An embedded DSL is implemented on top of a general-purpose host language as a library with a programming interface that looks like a language.Since an embedded DSL is a library, its development cost is usually smaller and programmers can reuse the environments of the host language.
The result of the project was an embedded-DSL architecture based on run-time metaprogramming.In this architecture, a part of the running program is reified at run time; that is, its abstract syntax tree is dynamically extracted.Then it is translated to efficient binary code to be run.For example, the host language could be Java, while the program is translated to C/C++ and compiled to machine code.The idea is to view the extracted code as a program not in the host language, but in the DSL with a slightly different semantics.Hence, we can apply domain-specific optimization, such as partial evaluation, during translation.
To evaluate this architecture, Chiba's team developed two software systems.One is Bytespresso [15], which they used to implement an embedded DSL that mimics Exaslang 4. They were successful in developing the DSL with a similar programming interface and then measured the execution performance.Although the computational part alone experienced a loss by a factor of 3, the total execution including compilation was twice as fast in the largest case measured: 12 grid levels7, which is still small-scale.This demonstrates that, for small-scale problems, the embedded version of Exaslang 4 is more suitable for interactive execution than the external version.
Chiba's team also explored the expressive power of the architecture it proposed in Yadriggy [16], a framework for embedding DSLs in Ruby.Compared to the ExaSlang 4 implementation in Java using Bytespresso, exhibiting a source code syntax far from the original, the Ruby-embedded version on Yadriggy features a syntax much closer to the original.This is due to the flexible syntax of Ruby and the development support by Yadriggy.For example, the code fragment depicted in Listing 15 is ExaSlang 4 code taken from the optical flow case study (Subsection 5.2, Listing 14) embedded in Ruby.This internal DSL code is not interpreted as Ruby code and, hence, the language does not have to adhere to Ruby's semantics.

A Multigrid Solver in SPIRAL
Forerunner and motivator for the vision of ExaStencils was the US project SPIRAL [61,24].SPIRAL's initial and foremost domain has been linear transformations.Recently it went on to small-scale linear algebra [82].ExaStencils' domain is a subdomain of multigrid computations.At project halftime, we put a very simple case of multigrid on SPIRAL: a solver with a Richardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions [10].The central step is to bring the smoother into an algebraic form of about a dozen rewrite equations.It was an effort of a few days, but the present implementation of SPIRAL does not support the broader domain of ExaStencils.For the generation of efficient code, special adaptations are required as illustrated in the Bachelor thesis of Sebastian Schweikl [80].This is a consequence of the fact that SPIRAL's target domain has been signal processing and not the solution of PDEs.The applicability of SPIRAL in its present form to our domain is very limited.

Effort
ExaStencils was not the only project in SPPEXA to address the solution of PDEs via multigrid methods.However, it was the only one that took a revolutionary rather than evolutionary software approach, i.e., that did not build on existing software tools or application software but started afresh with the automatic, domain-specific synthesis and optimization of application software via a set of dedicated domainaware software tools.The challenge was to identify the domain knowledge that is useful in helping an optimization, to build the domain-aware tools and to generate some target codes that demonstrate that this approach is realistic and promising.
A central ingredient of the approach is the stratification of the DSL ExaSlang into four layers of abstraction, each addressing a different set of aspects of the stencil code.The domain knowledge is twofold.Knowledge of the application problem is provided by hints that enable the ExaSlang compiler to optimize at the next-more concrete layer.These hints can be supplied as code annotations or conceptually by other means, e.g., configuration files or GUIs, as is the case in the ExaSlang Level 1 editor documented in Tim Ammenhäuser's Bachelor's thesis [4].Knowledge of the execution platform is provided by a dedicated platform description language.
The main effort in the project concerned the design and implementation of the ExaSlang compiler and code generator which had to be built from scratch, since ExaSlang is an external DSL.This effort led to three dissertations of the six that emerged from the project [43,47,72] and two habilitation theses [40,35].In a sideline, a comparative study was made with an internal DSL mimicking the most concrete layer of the external ExaSlang, which proved the approach doable but not the first choice for exascale software.
Another major effort was to drive the development of performance prediction further.On the software side, this will soon be realized by one cumulative dissertation introducing the, also revolutionary, generation of performance prediction models via machine learning in the setting of software product line engineering [28].On the mathematics side, a dissertation drove local Fourier analysis for the convergence prediction of multigrid codes further [66].Another dissertation in the realm of multigrid math drove the technology of multigrid smoothers forward [19].
The final major element was the case studies conducted to illustrate the impact of the ExaStencils approach.

Outreach
Let us sketch how ExaStencils profited from and nurtured other research projects.
There was a bilateral exchange of ideas and best practices with SPPEXA project TerraNeo [8] that deals with multigrid solvers on block-structured grids for applications in geoscience.Both ExaStencils and TerraNeo have dealt in loose collaboration with high-performance multigrid codes.One result was an evaluation from the perspective of code generation technology [41].A further systematic comparison of the performance regarding the algorithms, data structures and their implementation is left to the future.However, the generative programming technologies of ExaStencils are already being leveraged for the redesign of the TerraNeo code basis [38], where they help to simplify the software structure and to reduce the coding effort.
In cooperation with SPPEXA project EXA-DUNE [7], we developed a semiautomatic variability extraction approach that generates a family of applications based on a given application [29].An ExaStencils-generated application solves the same problem as the EXA-DUNE application but uses sets of methods different from the DUNE framework [9] with the goal of optimizing performance.For example, applications might use different grid implementations or different finite-element maps provided by the DUNE framework.The goal of this approach is to ease the burden of having to understand the DUNE framework when adapting a given application to new use cases and optimizing performance.In a first evaluation, the approach was able to identify over 90% of the alternative but compatible methods a developer of the framework has identified.Using this automated approach, we were even able to identify a bug and an inconsistency in the DUNE framework.
A molecular dynamics simulation on the application platform RSDFT [71] profited from the ExaStencils approach.The port of a development in ExaSlang to RSDFT was successful.
Applied researchers in triangular meshes profited from the ExaStencils approach of validating and refining domain knowledge via a performance-influence model [31].In simpler words, an automatic learning procedure confirmed their design choices and their effects on performance and made them aware of others.
In a technically related BMBF-funded project, called HighPerMeshes, which also aims at creating a DSL but for the scientific-computing domain of unstructured meshes [37], the parallelization technology SYCL was evaluated [3].This fueled our interest in creating a SYCL back-end for our code generator which, in turn, provided valuable insights into possible code generation strategies and optimizations for project HighPerMeshes.
SPIRAL [24,61] was ExaStencils' main motivator and an orientation point for positioning its contribution [10].At present, SPIRAL is not as serious a platform for multigrid solver development as ExaStencils and it would support the algebraic rather than the geometric form of multigrid.
Similarly, HIPA cc [55,56], a DSL embedded in C++ and a source-to-source translator for image processing applications that can target a wide variety of parallel accelerator architectures, inspired us on how to deal with both domain and architecture knowledge.In turn, at the start of ExaStencils, we introduced language constructs to HIPA cc to process and represent data at different resolutions, which enables the specification of applications that work on image pyramids as well as 2D stencil-based multigrid methods.By decoupling the algorithm from its schedule, HIPA cc allows the generation of efficient stencil-code implementations for single accelerators [57].
A recent DFG-funded project on ocean modeling (grant.no 320126236) challenges the extensibility of ExaStencils' application domain by considering timedependent hyperbolic partial differential equations, discretized using finite-volume or higher-order discontinuous Galerkin methods on block-structured triangular grids.Since, so far, an explicit time-stepping scheme is being applied, no solvers like multigrid are required.However, it has been shown already that the ExaStencils framework is capable of generating efficient and scalable code for such applications on GPU clusters [50].

Potential
The ExaSlang compiler and code generator remain at the stage of a prototype, but we believe that they have demonstrated that a stratified optimizing DSL approach can provide dramatically increased convenience for the application programmer and high flexibility concerning the execution platform.The effort of adapting a complex code to a different application or platform can be reduced significantly.The price to be paid is good knowledge of the theory on which the application rests and a significant development cost in implementing the DSL.The latter can be reduced by going the internal rather than the external way, at the price of reduced expressiveness and flexibility in terms of target code optimization.
Our choice not to go with an embedded DSL tends to be not popular in presentday DSL development but is essential.We did not want to be bound by any language limitations in exploring the potential of the ExaStencils approach to domain-specific programming and optimization.To achieve high portability and execution performance of generated programs, we selected C/C++ as the target language.
Our case studies covered the range from textbook examples to realistic applications.The latter are represented by a simulation of non-Newtonian flow, fully developed in the ExaSlang world, and a molecular dynamics simulation which used an ExaStencils solver through an automatically generated compatibility interface.We have demonstrated that we can reach the expected performance, achieve portability to most contemporary HPC platforms, and profit from a substantial decrease of development time of new PDE models that can be expressed in ExaSlang.
Since the start of our project, code generation has become a standard technique in many HPC codes and, thus, the concepts developed in ExaStencils are used or can be adapted to other domains.Currently, a popular approach is the generation of single compute kernels from an embedded DSL for existing software frameworks.In addition, C++ template-based embedded DSLs are common.Frequently, both the concrete DSL and the resulting implementations are very application-specific, but the intermediate representations and code transforms can be defined quite generally and then optimized for the specific case.In the future, it should be possible to extend the ExaStencils approach to other HPC applications to enable more holistic optimizations than currently established approaches.

Fig. 2 :
Fig. 2: Unknowns that are included in the blocks of the smoothing steps.

Fig. 3 :
Fig. 3: Convergence behaviors of the two block smoothers applied to the Stokes equations discretized using staggered grids in the unit square with periodic and Dirichlet boundary conditions.

Fig. 4 :
Fig. 4: Order of iterations of the advanced block smoother.

Fig. 6 :
Fig. 6: The transformation trajectory of an ExaSlang program from its IR to C++.

Fig. 7 :
Fig. 7: AST size (number of nodes) during code generation for different variants of the 3D optical flow application, starting at ExaSlang 4.

Fig. 11 :
Fig.11: Weak scaling on Piz Daint for constant coefficients (CC) and variable coefficients (VC).In all cases, the number of iterations required is nearly constant.Performance differences between CPU and GPU reflect the hardware characteristics.

Table 1 :
[42]ngs of the solution of the Poisson equation discretized on a grid of n points on JURECA[42]using the legacy solver and an automatically generated multigrid solver.