Implicitly Extrapolated Geometric Multigrid on Disk-Like Domains for the Gyrokinetic Poisson Equation from Fusion Plasma Applications

The gyrokinetic Poisson equation arises as a subproblem of Tokamak fusion reactor simulations. It is often posed on disk-like cross sections of the Tokamak that are represented in generalized polar coordinates. On the resulting curvilinear anisotropic meshes, we discretize the differential equation by finite differences or low order finite elements. Using an implicit extrapolation technique similar to multigrid τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document}-extrapolation, the approximation order can be increased. This technique can be naturally integrated in a matrix-free geometric multigrid algorithm. Special smoothers are developed to deal with the mesh anisotropy arising from the curvilinear coordinate system and mesh grading.


Introduction
In the context of Tokamak fusion plasma, a Poisson equation has to be solved on disk-like domains which correspond to the poloidal cross sections of the Tokamak geometry; see, e.g., [5,10,37]. In its most simplified form, this cross section takes a circular form but deformed (cf. Fig. 1, center) or more realistic D-shaped geometries were found to be advantageous; see, e.g., [5,10]. For more details on the problem setting and the physical details, we refer the reader to [5,10,13,32,[37][38][39]. Here, we propose a tailored solver for −∇ · (α∇u) = f in Ω, u = 0 on ∂Ω, (1.1) where Ω ⊂ R 2 is a disk-like domain, f : Ω → R, and α : Ω → R is a varying coefficient, also refered to as density profile.
The purpose of this article is to develop a problem-specific solver for this scenario. Our particular problem setting is taken from [5,13,32,38,39]. The solution of this system is a part of the iterative solution process in large gyrokinetic codes such as Gysela [13] where the 2D problem must be solved repeatedly over many times steps on hundreds or thousands of such 2D cross sections. Note that the solution of the Gyrokinetic equation on each cross section is only a moderately sized problem, but each simulation run with the plasma physics code may require the solution of this subproblem millions of times. Already small improvements in the methods can lead to substantial reductions in computation time of the field solver. This motivates and justifies the effort for a dedicated development and for optimizing the algorithms.
The article will study several mathematical difficulties and propose special techniques to handle each of them. In particular, -the domain geometry and the variable coefficient formally lead to a linear system with a sparse matrix. However, the conventional assembly and the repeated access to this matrix will limit performance. In fact, on many modern architectures, memory access can be more critical for performance than the floating point operations. Therefore, as an alternative to working with sparse matrix formats, we will here develop techniques that are suitable for a matrix-free implementation. -The smoothness of the domain and the regularity of the coefficients justify the use of higher order discretizations, promising to reach better accuracy with fewer unknowns. Conventionally, however, higher order discretizations would lead to sparse matrices with more complex structure, which are also more densely populated. Solving for these discretizations would thus incur an increased cost. Here we will focus on a nonstandard alternative, where higher order is achieved using computationally cheap low order discretizations only, but using an implicit extrapolation step [18] to reach higher order. -The solver algorithm itself must be scalable or, in other words, asymptotically optimal.
Recall that linear (or close to linear) algorithmic complexity is a necessary condition for reaching scalability. Beyond this, we are here not only interested in the order of complexity, but also to keep the absolute cost minimal. Our focus is on geometric multigrid methods which belong to the most efficient solvers for elliptic problems, also in terms of absolute cost. In the development below, we will carefully optimize them for the given mesh structures. This requires in particular dealing with the anisotropy of the meshes and thus introducing special zebra line smoothers. -Finally, all algorithms must be suitable for parallelization. Although developing an optimized parallel implementation is beyond the scope of the current article, we will design the algorithms such that parallelization will be possible in future work. This means that all algorithmic components, e.g., smoothers or grid transfer operators, are constructed such as to avoid any sequential bottlenecks or global dependencies, except the limitations caused by the multigrid hierarchy itself. In the envisioned application, many cross sections will be computed independently, leading to a significant degree of coarse grain parallelism. Beyond this the solution on each of the 2D cross sections must be suitable to exploit node level parallelism and instruction level parallelism. Therefore all algorithms developed here are ready to exploit intra-node parallelism. Additional care has also been taken so that a vectorization of the loop kernels will be possible and also executing the solver on accelerators such as a GPU becomes possible.
In the article, we will consider two illustrative domain shapes. The first one is a simple circle or circular annulus and the second one is a deformed circle as introduced in [5]; see Fig. 1. These domains can be described in curvilinear coordinates. In its simplest form, the geometry can be described by polar coordinates, but for a more realistic geometry, more general transformations must be used.
One drawback of the curvilinear coordinates is the introduction of an artificial singularity in the origin of the mapping. Further challenges for our solver come from the anisotropy in the transformed meshes and a varying coefficient α describing a physical density. Additionally, the meshes may be refined anisotropically to account for particular physical effects.
Multigrid methods can achieve optimal complexity for many problems and are among the most efficient solvers for elliptic model problems such as (1.1); see, e.g., [6,35]. However, multigrid methods for curvilinear (e.g., polar) meshes are less commonly studied topics; cf. [1,3,23,33,35] for some results. When additional difficulties arise, such as generalized polar coordinates, varying coefficients, and locally refined, anisotropic meshes, then the multigrid components must be suitably modified and adapted to maintain excellent convergence rates at low cost per iteration. Designing the algorithms for parallel execution on modern computer architectures and achieving a low memory footprint put additional constraints on the design of the multigrid components for coarsening, prolongation, and smoothing of the iterates.
We present a geometric multigrid algorithm using special line smoothers tailored to support parallel scalability. Additionally, we propose an implicit extrapolation scheme based on [15,16,18] with the goal to improve the order of differential convergence. Note that this refers to convergence of the algorithm with respect to the solution of the PDE, as different from algebraic convergence when considering the multigrid method as a linear system solver.
Conventional Richardson-style extrapolation for PDE relies on global asymptotic error expansions for the discrete solution and thus requires strict smoothness assumptions [22] to guarantee the existence of global asymptotic error expansions. An interesting variant of Richardson extrapolation that can save cost for higher dimensional problems is called splitting or multi-parameter extrapolation, see e.g. [20,26,30].
The combination of such extrapolation methods with iterative multilevel solvers, such as the multigrid method, is in many ways natural [29,36] and has recently seen renewed interest [9,19]. In particular, it is attractive to combine Richardson-style extrapolation with cascadic multigrid [4,31] methods, as studied e.g. in [25,34]. Since the cascadic multigrid method is not optimal in the L 2 -norm, special extrapolation formulas must be developed [7,8].
τ -extrapolation and other implicit extrapolation variants rely on extrapolation applied to local quantities, such as the residual or the local energy, and thus they can also be used when only local smoothness is guaranteed. In this article, we will use implicit variants of extrapolation as proposed in [15,27,28]. These methods are related to τ -extrapolation that has been proposed in combination with multigrid solvers [6,14]. Different from τ -extrapolation, our method is based on the existence of an underlying energy functional and is thus limited in its current form to self-adjoint PDE.
The remainder of the article is organized as follows. In Sect. 2, we present the detailed problem setting and the geometry motivated from fusion plasma applications. In Sect. 3, we briefly introduce our five and nine point finite difference stencils as well as the finite elements combined with nonstandard numerical integration. We also briefly discuss the handling of the mesh singularity at the origin. In the main section, Sect. 4, we introduce the new geo-

Curvilinear Coordinates and Model Problem Representations
In this paper, we will consider physical domains Ω r 1 that can be described by a mapping from a logical domain (r 1 , 1.3) × [0, 2π) onto Ω r 1 for r 1 ∈ [0, 1.3). Except for the singularity arising for r 1 = 0, the mapping is invertible. We will later present different strategies to handle the artificial singularity. Note that r max = 1.3 is purely problem-specific and it will be used throughout this paper only for simplicity of the presentation; other values are possible.
First, we will consider the circular geometry, which can be described by the polar coordinate transformation F P (r , θ) = (x, y) with (2.1) see Fig. 1 (left). A generalized transformation F GP (r , θ) = (x, y) is given by 2) The particular model setting is based on [5,32,39] to describe more realistic Tokamak crosssections. According to [5,39], we use κ = 0.3 and δ = 0.2. The resulting domain is illustrated in Fig. 1 (center). Note that (2.2) reduces for κ = δ = 0 to (2.1). Nevertheless, we will also give explicit formulas for (2.1) since we consider the polar coordinate transformation as a case of particular interest. In our simulations, we either use α ≡ 1 to consider the Poisson equation or use a typical density profile given by α(r , θ) = α(r ) = 2 2.6 + 3.14 1.3 + arctan 1 − r 0.09 ; see Fig. 1 (right). The density profile (2.3) is motivated by [12,32,38] and models the rapid decay from the core to the edge region of the separatrix in the Tokamak.
Concerning the mesh, we use local refinements to pass from the core to the edge region of the separatrix; see, e.g., [24] or Fig. 1 where the term in the last parenthesis corresponds to the well-known Laplacian operator expressed in polar coordinates.

Remark 2.1
Note that we neither explicitly distinguish between the functions u(r , θ) = u(x, y) nor the operators ∇ = ∇ x,y and ∇ = ∇ r ,θ in (1.1) and (2.4), defined for the corresponding variables. We will do this to a certain extent by using · for operators and functions expressed in Cartesian coordinates in the following. However, to not overload the notation we waive this notational overhead whenever this disctinction becomes clear from the context.
In the following, we will consider the energy functional to be minimized corresponding to (1.1). For a scalar coefficient, it writes where * ∈ {P, GP}, D F * is the Jacobian matrix, and Ω * r 1 is the physical domain for either (2.1) or (2.2). In the remaining part of the paper, we mostly use the index · * to refer to both transformations likewise. For the inverse transformations, see [5]. Due to space limitations, we only provide , (2.6) and In order to simplify the notation for (2.5), we define a rr * 1 2 a r θ * 1 2 a θr * a θθ * Note that (2.8) is symmetric and thus a r θ * = 1 2 a r θ * + 1 2 a θr * . Additionally, note that a r θ P = 0, i.e., for the polar coordinate transformation the offdiagonal entries are zero as long as the diffusion term α remains scalar.

Discretization Methods
In this paper, we will use linear finite elements and compact finite differences to construct a geometric multigrid algorithm on anisotropic grids represented by curvilinear coordinates. We use particular finite difference stencils from [18] which maintain the symmetry of the energy functional also for anisotropic grids. For finite elements, we introduce nonstandard integration rules that are advantageous when implicit extrapolation is used within the multigrid algorithm; cf. [16,18].
We first consider an anisotropic hierarchical grid with two levels. This is suitable to apply the extrapolation method developed in [18]. The two-level hierarchical grid is given in tensorproduct form on the logical domain (r 1 , 1.3) × [0, 2π) (see Fig. 2) with where n r and n θ denote the (odd) number of nodes in r -and θ -direction, respectively. We denote h := max i h i and k := max j k j .

Finite Element Discretization
We briefly recapitulate the nonstandard integration rule from [16,18,21]. In [18], it was already shown that the nonstandard quadrature rules may be better suited than standard quadrature for linear elements when approaching the singularity as r 1 → 0.

Remark 3.1 (Nonstandard Finite Element integration)
We use nodal P 1 basis functions with a nonstandard numerical integration rule; see [16,18,21]. Let us note that this nonstandard integration rule does not use standard edge-midpoint quadrature. In this approach, each additive term of a reformulated integrand will be evaluated at one term-specific evaluation node only.
For any two finite element basis functions ϕ α and ϕ β with Δ ∈ supp(ϕ α ) ∩ supp(ϕ β ), we have for the bilinear form on the logical domain where ϕ α and ϕ β , α, β ∈ {1, 2, 3}, are the corresponding functions on the reference element and where with the mapping F −1 (Δ) = T ; cf. Fig. 3. For the nonstandard quadrature rule, we first transform (3.2) by using (3.1) to The numerical approximation of the integral (3.2) is then given by The linear form is approximated by using where z i , i ∈ {1, 2, 3}, are the corner nodes; cf. Fig. 3.

Finite Difference Discretizations
For completeness, we additionally present the finite difference stencils as they will be used here. We refer to [18] for further detail. For any rectangular grid element := (r i , r i+1 ) × (θ j , θ j+1 ) of the logical domain, we consider the discretized local energy function corresponding to (1.1) and where a rr * , a r θ * , and a θθ * are implicitly given by F * as defined in (2.1) or (2.2), respectively.
Note that a r θ * = 0 if F * = F P is the standard polar coordinate transformation. Note also that this does only generally hold for scalar diffusion α. For this case, we obtain the five point stencil and quadratic error convergence. Note that this stencil differs slightly in u s+1,t and u s−1,t when compared to [18]. This results from the use of the trapezoidal rule instead of the midpoint rule. It was chosen to have the five point stencil (3.9) as the reduced version of the nine point stencil (3.11). In case of a transformation where a r θ * = 0, we have to use a seven or nine point stencil, to obtain a quadratic discretization error. The nine point stencil used here is given by with right hand side (3.10). We refer to [18] for its derivation.

Handling of the Artificial Singularity
In the following, we propose some ways to handle the artificial singularity for r → 0. All our proposals are based on the idea to retain a symmetric operator.

The Origin as Discretization Node
A natural approach consists in integrating the node (r 1 , θ) = (0, 0) into the mesh. This, however, needs an adaptation of the discretization rules since the logical nodes (0, θ j ), j = 1, . . . , n θ all coincide geometrically.
For our finite difference stencils, we modify the discretization around the origin as following. Let us consider an arbitrary node (r 2 , θ j ), 1 < j < n θ − 1. We remove all interactions of the stencil with (r 1 , θ j−1 ) and (r 1 , θ j+1 ); see Fig. 4 (top left). We then take the interaction with (r 1 , θ j ) and set it also as connection from (r 1 , θ j ) to (r 2 , θ j ) to obtain a symmetric matrix. The diagonal entry for (r 1 , 0) is then given by the negative sum of the values on all angles.
For our finite element discretization, we integrate the basis functions over the triangles with nodes (r 1 , θ j ), (r 2 , θ j ), (r 2 , θ j+1 ); see Fig. 4 (center). Note that it is important to pass (r 1 , θ j ) to the assembly of the transformation onto the reference angle, although it physically corresponds to (r 1 , 0) for all 1 ≤ j ≤ n θ . If (r 1 , 0) is passed for all angles, the orthogonality of the mesh (i.e., the tridiagonal structure of T ) is lost locally and the connections of (r 1 , θ j 1 ) to (r 2 , θ j 1 ) and (r 1 , θ j 2 ) to (r 2 , θ j 2 ) can differ for j 1 = j 2 .

Artificial Boundary Conditions
A simple and often used workaround to overcome the problem of the artificial singularity is to choose 0 < r 1 1 and to enforce Dirichlet or Neumann boundary conditions for the artificial boundary r 1 × [0, 2π]. A direct drawback is that these conditions are hard or even impossible to determine in practical cases. This workaround is however used in the Gysela implementation as presented in [13].

Discretization Across the Origin
Another approach that we propose is the discretization across the origin. Instead of explicitly using (0, 0) as discretization node or imposing boundary conditions at r 1 > 0, we first assemble the stiffness matrix for r 1 > 0 without any condition on (r 1 , θ j ), 1 ≤ j ≤ n θ .
Note that this may lead to an unsymmetric operator if nonsymmetric domains and seven point stencils are considered. In this case, one could copy the values from the first half circle to the second half circle to retain a symmetric operator.

Geometric Multigrid for Curvilinear Coordinates
Multigrid methods are among the most efficient solvers for elliptic model problems such as (1.1); see, e.g., [6,35]. Multigrid methods for meshes in polar coordinates were considered in, e.g., [1,3,23,33,35] but are, however, less studied. In the following sections, we will develop special multigrid components for the model problem in curvilinear coordinates such as the generalized polar coordinates proposed in (2.2).
In order to define the notation, we first define a hierarchy of L + 1 grids with Ω l−1 ⊂ Ω l , 1 ≤ l ≤ L, and |Ω L | = n r * n θ . To identify matrices and vectors on grid Ω l , we use the subindex l, 0 ≤ l ≤ L. The iterates of step m are characterized by a superindex m, m ≥ 0. The restriction operator from grid l to grid l − 1 is denoted I l−1 l and I l l−1 represents the interpolation from grid l − 1 to grid l. The presmoothing operation with ν 1 steps is denoted S ν 1 , the postsmoothing operation with ν 2 steps is denoted S ν 2 . The multigrid cycle u m+1 is then given recursively for 0 ≤ l ≤ L.
The multigrid cycle -Interpolate the correction: e m+2/3 l In the recursive call, ♦ stands for zero as a first approximation and in further calls (W-cycle) for an approximation taken from the previous cycle.

Optimized Zebra Line Smoothers
For highly anisotropic problems, point relaxation and standard coarsening (i.e., coarsening by a factor of 2 in each dimension) do not yield satisfactory results. Pointwise smoothing then only has poor smoothing properties with respect to weakly-coupled degrees of freedom (dofs); cf. [35,Sec. 5.1]. In the context of multigrid, we speak of strong coupling between one dof to another if the offdiagonal entry of the considered matrix is "relatively" large; compared to the other offdiagonal entries of the same dof. If the entry is "relatively" small, we speak of weak coupling.
If the anisotropy is aligned with the grid, standard coarsening can be kept and only the smoothing operation has to be adapted to obtain good multigrid performance. Line relaxations are block relaxations where all the connections between degrees of freedom of one line are taken into account to update this line in one single step. Using line relaxation, errors become smooth if strongly connected degrees of freedom are updated together. For a more detailed introduction to line smoothers, see, e.g., [35,Sec. 5.1].
For compact finite difference stencils and linear nodal basis functions, zebra line smoothers correspond to Gauß-Seidel line relaxation methods where all even and all odd lines (rows or columns), respectively, are processed simultaneously. For operators where the anisotropy changes across the domain, alternating zebra relaxation has been proposed; see [33]. The polar coordinate transformed Laplace operator yields strong connections on circle lines on the interior part of the domain and strong connections on radial lines on the outer part; cf. (2.4). Consequently, alternating zebra relaxation was proposed for the unit disk [1]. We will now briefly introduce zebra relaxation and then explain our particular choice of smoothers for all parts of the (deformed) domain from Fig. 1 described by curvilinear coordinates.  5 Circle (left) and radial (second to left) zebra coloring for the equidistant discretized annulus with r 1 = 1e − 6. Nonzero pattern of K l,B B restricted to an interior circle (second to right) and of K l,B B restricted to one radial direction (right) assuming a circle-by-circle numeration of the nodes with n l,r = n l,θ = 9. The periodic boundary conditions at (r l,i , θ n l,θ ) = (r l,i , 2π) introduce the interaction in the upper right and lower left corners of the the circle relaxation operator (second to right). The first and last line of radial relaxation operator (right) only have one entry since Dirichlet boundary conditions were set there, entries · 1,9 and · 57,65 were put on the right hand side; only the corresponding nonzero rows and columns are printed Let n l = n l,r × n l,θ be the number of nodes on grid l ∈ {0, L}. Furthermore, let B l and W l be disjoint index sets such that B l ∪ W l = {1, 2, . . . , n l } and by reordering , K l , f l ) is obtained equivalently. In order to smooth the coarse degrees of freedom first, we will color them always in black.

Remark 4.1
Note that the zebra-line Gauss-Seidel preconditioner is not triangular but blocktriangular. That means that all nonzero entries shown in Fig. 5 (also those in the upper triangular part) remain on the left hand side of the system. The shown entries all belong to the same line (row or column). For larger finite difference stencils or hierarchical finite element bases, K l,B B and K l,W W from (4.2) may have more than three nonzeros per row. Then, either more colors have to be used or a part of the upper triangular matrix has to be brought to the right hand side.
Let us consider the annulus Ω h i := [r i , r i + h i ] × [0, 2π] as an individual domain; with a constant discretization parameter k j = k in the second dimension, i.e, n θ k = 2π. From [1], we know that the smoothing factors of circle and radial relaxation, μ CZ,h i ,k j and μ RZ,h i ,k j , on Ω h i are given by with q i, j = k j h i as well as C C ∈ {0.23, 0.34}, depending on r i ≥ 0, and C R = 0.23, independently of r i . From Fig. 6 (left), we see that both relaxations behave very differently on different annuli of size h i of the global domain. We see that radial relaxation is prohibitive around the origin but shows good smoothing behavior for r → 1.3. Circle relaxation shows good smoothing behavior around the origin but does not provide essential smoothing where the mesh was refined and for r → 1.3. In order to obtain a reasonable smoothing procedure on the entire domain, we thus have to combine circle relaxation with radial relaxation. In [1], alternating zebra relaxation, consisting of one step with each smoothing operator, was proposed.
To reduce the workload and to optimize the smoothing operation, we propose the following smoothing procedure. Since circle relaxation leads to good smoothing around the origin, we color the nodes around the origin in circle lines. For each following circle with radius r i > r 1 , we then check in accordance to (4.3), if and change to radial relaxation if this is the case. Note that we use that k j is constant on each circle line represented by r i , i = 1, . . . , n r . We then obtain a decomposition of the domain into two domains, where different relaxation methods are used; see Fig. 6.
Although the decomposition rule (4.4) was developed for a domain described by polar coordinates, we also use this as a rule of thumb for the deformed geometries described by transformation (2.2). See Sect. 5.2 for a numerical evaluation.
The optimized presmoothing operation S ν 1 (u m l , K l , f l ) is then given with six colors: black (for circle and radial, denoted B C and B R ), white (for circle and radial, denoted W C and W R ), and orange (denoted O C and O R ), which itself is not smoothed; see  The values u m,i l,O * on the orange colored part of the domain contain the interface boundary conditions for each half-step of smoother. Note that only those values next to the interior interface, which represent the interface boundary conditions, have to be updated in each step of the iterative process. The larger the stencil, the more lines have to be updated in practice.
Note that (4.5) is not parallelized across the two different smoothers (circle and radial) but that the radial smoothers use information from the circle smoothers. If no information is exchanged, an increase in iterations is to be expected. However, if the color of the outermost circle-smoother line is smoothed first, then for compact FE or FD stencils such as provided in this paper, the two sequential colors of the radial smoothers can be executed in parallel with the second color of the circle smoother. Since the circle color lines are of larger size than the radial color lines, both parallel steps are expected to finish at similar times.

Coarsening and Intergrid Transfer Operators
The coarsening and intergrid transfer operators use the classical choices. We always employ standard coarsening and we use bilinear interpolation, which is also well-defined for anisotropic meshes, if the additional extrapolation algorithm is not used. In case of implicit extrapolation, we use bilinear interpolation for l = 1, . . . , L − 1 only and transfer between the two finest grids is adapted. As presented in Sect. 4.3, extrapolation will only affect the transfer between the two finest grid levels. In case (0, 0) is an actual discretization node and is chosen as first coarse node, we have to adapt the restriction and prolongation there. Our restriction operator is always defined as the adjoint (4.6)

Remark 4.2 [Scaling between prolongation and restriction]
Note that there is no scaling constant in definition (4.6) since for the finite element discretizations as well as for our tailored finite difference schemes, the right hand side is locally scaled with O(h i k j ), 1 ≤ i ≤ n r and 1 ≤ j ≤ n θ ; cf. [18] for details on the derivation of the finite difference stencils. As a potential source of implementation error, this has to be taken into account.

Implicit Extrapolation
In this section, we introduce the implicit extrapolation step within our multigrid algorithm, based on the extrapolation strategy of [15,16,18]. The extrapolation step is only conducted between the two finest levels of multigrid hierarchy, affecting the operators on and interpolation between Ω L and Ω L−1 . Let us assume that the coarse degrees of freedom are ordered before the fine degrees of freedom. By using the indices · c for coarse and · f for fine nodes, we have The new smoother on the finest level is the previously defined smoother only acting on the fine nodes.

Remark 4.3
Only the fine grid nodes are smoothed on the first level and the nodes belonging to the coarse grid are excluded from the smoothing operation. This differs from the introduction of τ -extrapolation in [2,6,14]. The weaker smoother may lead to a reduced algebraic convergence of the multigrid iteration, but it has the advantage that the fixed point of the multigrid iteration is uniquely defined. For more details, we refer to [15,p. 173] and the references therein.
Before presenting the extrapolated multigrid cycle, we must also introduce the modified intergrid transfer operators In [15], only constant coefficients were considered. Note that in our applications, due to the transformation of the physical domain, even α ≡ 1 leads to nonconstant coefficients; cf. Sect. 2. Nonconstant coefficients were considered with hierarchical bases in [16]. In contrast to [16], we use the intergrid transfer operator given in [15]. This results from the discretization by nodal basis functions.
The proof of Remark 4.4 is based on the relation between linear nodal, linear quadratic, and h-and p-hierarchical basis functions. The transfer operator I L L−1 is part of the transformation between a nodal and a hierarchical basis; see also [18,Sec. 4.4.1]. The necessary relations [15,(55) and (56)] are formally proven for nonconstant coefficients in [18,Lemma 4.2,Theorem 4.3]. In particular, we can write where the term in brackets corresponds to the residual computation of the quadratic approach.

Remark 4.5
We note that the direct discretization of a PDE with higher order finite elements will typically lead to a denser matrix structure and consequently to a higher flop cost per matrix-vector multiplication or smoother application. Here, we construct an equivalent higher order discretization using by way of a clever recombination of low order components as they arise canonically in a multigrid solver. In this way, we avoid the explicit set up of any more expensive higher order discrete operator. In other words, the implicit extrapolation multigrid method leads to a qualitatively equivalent high order discretization at reduced cost. This reduces memory cost avoiding the setup of more densely populated matrices, and they avoid the corresponding memory traffic and higher flop cost incurred in each iteration of an iterative solution process. In fact, except the computation of the extrapolated residual in the restriction phase, the cost of the extrapolated multigrid algorithm is the same as for standard low order discretization. Of course, this alone does not account for other solver cost such as induced by the possibly slower (algebraic) convergence of the extrapolated multigrid algorithm (meaning that more iterations are needed) and the need to solve the discrete system with higher (algebraic) accuracy in order to exploit the lower discretization error. Because of these two effects the cost of computing a proper solution with the extrapolated multigrid algorithm is still expected to be more expensive than solving for a low order discretization. For an in-depth analysis of the so-called textbook efficiency of parallel multigrid algorithms, see also [11,17].

Numerical Results
In this section, we study (1.1) with α as given in (2.4) to model the density of the fusion plasma according to [32,38]. As test case for our new method, we use the manufactured solution u(x, y) = (1.3 2 − r 2 (x, y)) cos(2π x) sin(2π y), (5.1) where r (x, y) is defined by (2.1) or (2.2). The right hand side f and the Dirichlet boundary conditions on (r , θ) ∈ 1.3 × [0, 2π] are given accordingly. This example is taken from [39].
We thank Edoardo Zoni for providing his Python script for symbolic differentiation and, in the interest of saving space, we refrain from representing the right hand side explicitly. We use an anisotropic discretization in r ∈ [r 1 , 1.3] with r i+1 = r i + h i , i = 1, . . . , n r to account for the density profile drop in the separatrix' edge area; cf. [12,38]. We restrict the anisotropy to h = max i h i = 8 min i h i .
For our multigrid algorithm, we conduct one step of pre-and one step of postsmoothing, i.e., ν = ν 1 + ν 2 = 2. In prospect of a parallel implementation, we only use V -cycles. We use a strong convergence criterion by demanding a relative residual reduction by a factor of 10 8 . The maximum number of iterations is set to 150. In all tables, we provide the finest mesh size as n r × n θ . We also provide the iteration count of the multigrid algorithm needed to convergence as its as well as the mean residual reduction factor. Note that the measured ρ is generally slightly smaller than the theoretical reduction factor and becomes more precise when more iterations are executed. For all simulations, we present the error of the iterative solution compared to the exact solution evaluated at the nodes in the (weighted) · 2 -norm, defined by and the · ∞ -norm. We also provide the error reduction order as ord. for both norms. The order is here calculated as the error reduction from one row to its predecessor, i.e., ord = log err k−1 err k / log gridsize k gridsize k−1 .
The orders were computed with errors norms rounded to the fifth nontrivial digit.
Remark 5.1 (Residual and algebraic error convergence) As mentioned in Remark 4.4, the implicitly extrapolated multigrid algorithm can be considered as a multigrid algorithm based on a second order discretization. Consequently, we require for the residual for which the norm is directly available from the multigrid context; cf. (4.9).
We test several different configurations and provide comparisons in the following sections.
-In Sect. 5.1, we show that neither circle nor radial relaxation alone are sufficient to obtain fast convergence of our multigrid algorithm. Our choice of optimized circle-radial relaxation always leads to fast convergence. -In Sect. 5.2, we show that (4.5) results in an optimal domain decomposition to execute the optimized smoothing operations, -In Sect. 5.3, we compare the multigrid algorithm based on finite elements with nonstandard integration and finite difference discretizations for different approaches to handle the artificial singularity from Sect. 3.3. -In Sect. 5.4, we proceed similarly to Sect. 5.3 by using the multigrid algorithm with implicit extrapolation as described in Sect. 4.3. Multigrid without extrapolation based on finite difference discretizations on circular and deformed geometry with r 1 = 1e−8 and Dirichlet boundary conditions on the innermost circle. Iteration counts its (max. its=150), mean residual reduction factor ρ, and errors of iterative solution to exact solution evaluated at the nodes in · 2 and · ∞ norms Remark 5.2 (Expectations on convergence orders) For the non-extrapolated versions of our solvers, we expect quadratic error convergence in the 2 -norm, as, e.g., predicted by the derivation of the finite difference stencils. For the implicitly extrapolated version, we expect quartic convergence for a non-refined grid and to come close to 3.7 if a fast descending diffusion coefficient and local grid refinement is used. For more details, see [18].

Multigrid with Circle, Radial, and Optimized Circle-Radial Relaxation
In this section, we study different smoothing procedures: circle smoothing, radial smoothing, and our optimized circle-radial smoothing described in Sect. 4.1 and denoted as optimized smoothing in Table 1. For r 1 = 0, radial relaxation is prohibitive since the origin couples all directions. We thus choose r 1 = 1e − 8 and enforce Dirichlet boundary conditions on the innermost circle to avoid an additional influence of the artificial singularity. From Table 1, we see that neither circle nor radial smoothing alone are sufficient to obtain satisfactory residual reduction factors. Note that pointwise smoothers yielded even worse results. The optimized smoother, although smoothing each node also only once, yields good results (i.e., quadratic error reduction).

Multigrid with Optimized Circle-Radial Relaxation
In this section, we numerically show the optimality of our circle-radial domain decomposition by testing it against other decompositions. As a basic rule, we use (4.5). We compare this optimized smoother with other decompositions where we color ±n (additional or less) circles, n ∈ N, circle by circle and the remaining part in a radial manner. Table 2 shows that the optimal residual reduction factor, as well as the minimum number of iterations, is obtained with rule (4.5).  , means that ±n circles are colored circle-wise instead of radial-wise as proposed by (4.5). Further notation as in Table 1  Multigrid without extrapolation based on finite element and finite difference discretizations on circular and deformed geometry with r 1 = 0. Error reduction given by ord.; further notation as in Table 1

Multigrid Based on Different Discretizations
In this section, we consider different ways to handle the artificial singularity as proposed in Sect. 3.3. We consider the case where r 1 = 0, i.e., where (0, 0) is a node on the grid as well as r 1 ∈ {10 −2 , 10 −5 , 10 −8 }. For r 1 > 0, we consider the case of Dirichlet boundary conditions as well as our strategy of discretizing across the origin; cf. Sect. 3.3.3. We observe that we obtain identical results for the configurations with Dirichlet boundary conditions on the innermost circle and by discretizing across the origin, respectively, if r 1 → 0; cf. Table 5.
We consider the multigrid algorithm based on finite element discretization with nonstandard integration techniques and on the finite difference five and nine point stencil, respectively, where the latter is only used if the deformed geometry is considered.
From Tables 3, 4, and 5, we first see that the multigrid algorithm needs about twice as many iterations if (0, 0) is used as explicit mesh node. For the circular geometry and r 1 > 0, we only need 13 iterations to reduce the residual by a factor of 10 8 . The number of iterations respectively. Further notation as in Table 3 Table  Table 3  Multigrid with extrapolation based on finite element and finite difference discretizations on circular and deformed geometry with r 1 = 0. Further notation as in Table 3 and residual reduction factors are higher in the case of the deformed geometry. The number of iterations is still only between 41 and 47. The convergence of the iterative scheme is (almost) independent of the choice how to handle r 1 > 0. The number of iterations of our multigrid algorithm is independent of the discretization parameter. The error convergence over the different levels of discretizations is unsatisfactory for the finite element discretization if r 1 = 0. For r 1 = 1e − 2, our strategy to discretize across the origin also leads to unsatisfactory results; r 1 = 1e − 2 might still be too large for this heuristic. For r 1 ∈ [1e−5, 1e−8], we obtain identical results with this heuristic and Dirichlet boundary conditions on the innermost circle. We have optimal quadratic error convergence in l 2 -as well as inf-norm.

Extrapolated Multigrid Based on Different Discretizations
We now consider the multigrid algorithm as in the previous section by only adding our extrapolation step between the two finest grids. Note that the analysis of [15] predicts an accuracy equivalent to P2-elements, and thus to reach O(h 3 ) in L 2 up from O(h 2 ). The meshes enjoy an approximate symmetry, which can lead to a cancellation of odd error terms. In fact, we will observe below that the approximation order reaches close to O(h 4 ) in some cases.
From Tables 6 and 7, we again see that the multigrid algorithm needs about twice as many iterations if (0, 0) is used as explicit mesh node.
For the circular geometry and r 1 > 0, we need less than 40 iterations to reduce the residual by a factor of 10 8 . The number of iterations and residual reduction factors are higher in the case of the deformed geometry but still only between 73 and 85. Again, the number of iterations is independent of the discretization parameter.
The slower convergence (compared to the previous section) is due to the influence of a strengthened Cauchy inequality. For more details, we refer to Remark 4.5 and [15]. The error convergence over the different levels of discretizations is unsatisfactory for the finite element discretization if r 1 = 0.
For r 1 > 0, the convergence of the iterative scheme is independent of the choice on how to handle the innermost circle, if r 1 is sufficiently small. For r 1 ∈ [1e − 5, 1e − 8], we obtain almost identic results with the heuristic of discretizing across the origin and Dirichlet boundary conditions on the innermost circle. We have an error convergence order between 3.5 and 4.0 in l 2 -and a convergence order of about 3.0 in inf-norm.

Conclusion
We have presented a novel scalable geometric multigrid solver for a Poisson equation arising in gyrokinetic fusion plasma models. We have developed a new optimized radial-circle smoothing procedure to take into account the anisotropies of the underlying partial differential equation and of the mesh, particularly in the edge area of the separatrix in the Tokamak. Furthermore, we have constructed an implicit extrapolation scheme that leads to third order error convergence in the inf-norm and shows an error convergence order between 3.5 and 4.0 in the l 2 -norm. For simpler meshes and geometries as considered here, e.g., without deformation, artificial singularity, and anisotropic mesh-refinement, we expect up to convergence order four when the odd order terms in the error expansions vanish due to symmetry.
If using implicit extrapolation, the iteration counts are slightly larger but they still remain modest and independent of the mesh size, so that the solver is asymptopically optimal. This is necessary for algorithmic scalability. Further improvements may be possible based on more efficient smoothing procedures. The numerical results for our multigrid algorithm based on finite elements with nonstandard integration and the finite difference nine point stencil are almost identical. Our extrapolated finite difference stencil gives thus rise to a matrixfree implementation with low memory footprint and high precision. With the fast implicitly extrapolated multigrid method, we have constructed an algorithm to compute cost-effective high precision approximations of the gyrokinetic Poisson equation.
Author Contributions All authors developed the numerical algorithm, the implementation was done by M.J.K and C.K., all authors wrote and revised the script.
Funding Open Access funding enabled and organized by Projekt DEAL. The first two authors gratefully acknowledge the funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 824158.

Availability of Data and Material
No particular data was used.

Conflict of interest/Competing interests None.
Code Availability Code can be shared upon request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.