An accelerated Poisson solver based on multidomain spectral discretization

This paper presents a numerical method for variable coefficient elliptic PDEs with mostly smooth solutions on two dimensional domains. The PDE is discretized via a multi-domain spectral collocation method of high local order (order 30 and higher have been tested and work well). Local mesh refinement results in highly accurate solutions even in the presence of local irregular behavior due to corner singularities, localized loads, etc. The system of linear equations attained upon discretization is solved using a direct (as opposed to iterative) solver with $O(N^{1.5})$ complexity for the factorization stage and $O(N \log N)$ complexity for the solve. The scheme is ideally suited for executing the elliptic solve required when parabolic problems are discretized via time-implicit techniques. In situations where the geometry remains unchanged between time-steps, very fast execution speeds are obtained since the solution operator for each implicit solve can be pre-computed.


Introduction
This manuscript describes a direct solver for elliptic PDEs with variable coefficients, such as, e.g., (1) [Au](x) = g(x), x ∈ Ω, where A is a variable coefficient elliptic differential operator where Ω is a rectangular domain in R 2 with boundary Γ = ∂Ω, where all coefficient functions (c, c i , c ij ) are smooth, and where f and g are given functions. The generalization to domains that are either unions of rectangles, or can via local parameterizations be mapped to a union of rectangles is relatively straight-forward [10,Sec. 6.4]. The technique is specifically developed to accelerate implicit time stepping techniques for parabolic PDEs such as, e.g., the heat equation When (3) is discretized using an implicit time-stepping scheme (e.g. backwards Euler or Crank-Nicolson), one is required to solve for each time-step an equation of the form (1), see Section 7.6.
With the ability to combine very high order discretizations with a highly efficient means of timestepping parabolic equations, we believe that the proposed method will be particularly well suited for numerically solving the Navier-Stokes equation at low Reynolds numbers. The proposed solver is direct and builds an approximation to the solution operator of (1) via a hierarchical divide-and-conquer approach. It is conceptually related to classical nested dissection and multifrontal methods [1,2,3], but provides tight integration between the direct solver and the discretization procedure. Locally, the scheme relies on high order spectral discretizations, and collocation of the differential operator. We observe that while classical nested dissection and multifrontal solvers slow down dramatically as the discretization order is increased [5, Table 3], the proposed method retains high efficiency regardless of the discretization order. The method is an evolution of the scheme described in [10,9], and later refined in [4,5,6]. One novelty of the present work is that it describes how problems with body loads can be handled efficiently (the previous papers [10,4,5] consider the case where g = 0 in (1)). A second novelty is that local mesh refinement is introduced to enable the method to accurately solve problems involving concentrated loads, singularities at re-entrant corners, and other phenomena that lead to localized loss of regularity in the solution. (In contrast, the previous papers [10,4,5,6] restrict attention to uniform grids.) The principal advantage of the proposed solver, compared to commonly used solvers for (1), is that it is direct (as opposed to iterative), which makes it particularly well suited for problems for which efficient pre-conditioners are difficult to find, such as, e.g., problems with oscillatory solutions. The cost to build the solution operator is in the most basic version of the scheme O(N 3/2 ), where N is the number of discretization points. However, the practical efficiency of the solver is very high and the superlinear scaling is hardly visible until N > 10 7 . When the number of discretization points is higher than 10 7 , the scheme can be modified to attain linear complexity by implementing techniques analogous to those described in [4]. Once the solution operator has been built, the time required to apply it to execute a solve given a boundary condition and a body load is either O(N log N ) for the basic scheme, or O(N ) for the accelerated scheme, with a small scaling constant in either case. In Section 7, we demonstrate that even when N = 10 6 , the time for solving (1) with a precomputed solution operator is approximately one second on a standard office laptop.
The discretization scheme we use is related to earlier work on spectral collocation methods on composite ("multi-domain") grids, such as, e.g., [8,13], and in particular Pfeiffer et al [11]. The differences and similarities between the various techniques is discussed in detail in [10]. Our procedure is also conceptually related to so-called "reduction to the interface" methods, see [7] and the references therein. Such "interface" methods also use local solution operators defined on boundaries but typically rely on variational formulations of the PDE, rather than the collocation techniques that we employ.
The manuscript is organized as follows: Section 2 provides a high level description of the proposed method. Sections 3 and 4 describe the local discretization scheme. Section 5 describes the nested dissection type solver used to solve the system of linear equations resulting from the discretization. Section 6 describes how local mesh refinement can be introduced to the scheme. Section 7 provides results from numerical experiments that establish the efficiency of the proposed method.

Overview of algorithm
The proposed method is based on a hierarchical subdivision of the computational domain, as illustrated in Figure 1 for the case of Ω = [0, 1] 2 . In the uniform mesh version of the solver, the tree of boxes is built by recursively splitting the original box in halves. The splitting continues until each box is small enough that the solution, and its first and second derivatives, can accurately be resolved on a local tensor product grid of p × p Chebyshev nodes (where, say, p = 10 or p = 20).
Once the tree of boxes has been constructed, the actual solver consists of two stages. The first, or "build", stage consists of a single upwards pass through the tree of boxes, starting with the leaves and going up to larger boxes. On each leaf, we place a local p × p tensor product grid of Chebyshev nodes, and then discretize the restriction of (1) via a basic collocation scheme, as in [12]. By performing dense linear algebraic operations on matrices of size at most p 2 × p 2 , we form for each leaf a local solution operator and an approximation to the local Dirichlet-to-Neumann (DtN) operator, as described in Section 3. The build stage then continues with an upwards pass through the tree (going from smaller boxes to larger) where for each parent box, we construct approximations to its local solution operator and its local DtN operator by "merging" the corresponding operators for its children, cf. Section 4. The end result of the "build stage" is a hierarchical representation of the overall solution operator for (1). Once this solution operator is available, the "solve stage" takes as input a given boundary data f and a body load g, and constructs an approximation to the solution u valid throughout the domain via two passes through the tree: first an upwards pass (going from smaller boxes to larger) where "particular solutions" that satisfy the inhomogeneous equation are built, and then a downwards pass where the boundary conditions are corrected.
The global grid of collocation points used in the upwards and downwards passes is obtained by placing on the edge of each leaf a set of q Gaussian interpolation nodes (a.k.a. Legendre nodes). Observe that this parameter q is in principle distinct from the local parameter p which specifies the order of the local Chebyshev grids used to construct the solution operators on the leaves. However, we typically choose p = q + 1 or p = q + 2.

Leaf computation
In this section, we describe how to numerically build the various linear operators (represented as dense matrices) needed for a given leaf Ω τ in the hierarchical tree. To be precise, let u be the solution to the local equation (4) [Au](x) = g(x), for some given (local) Dirichlet data d. We then build approximations to two linear operators that both take d and g as their inputs. The first operator outputs the local solution u on Ω τ and the second outputs the boundary fluxes of u on Γ τ .
3.1. Notation. We work with two sets of interpolation nodes on the domain Ω τ . First, let {y j } 4q j=1 denote the nodes obtained by placing q Gaussian nodes on each of the four sides of Ω τ . Next, let denote the nodes in a p × p Chebyshev grid on Ω τ . We partition the index vector for the nodes in the Chebyshev grid as {1, 2, . . . , p 2 } = I ce ∪ I ci so that I ce holds the (Chebyshev) exterior nodes and I ci holds the (Chebyshev) interior nodes. Let u c , u ci , u ce , and u ge denote vectors holding approximations to the values of the solution u at the interpolation nodes: Let v ge ∈ R 4q denote a vector holding boundary fluxes of u on the Gaussian grid, so that v ge (j) ≈ [∂ 1 u](y hj ) when y j lies on a vertical boundary, v ge (j) ≈ [∂ 2 u](y hj ) when y j lies on a horizontal boundary.
Observe that our sign convention for boundary fluxes means that a positive flux sometimes represents flow into the box and sometimes out of the box. Finally, let d ge and g ci denote tabulations of the  Figure 1. The square domain Ω is split into 4 × 4 leaf boxes. These are then gathered into a binary tree of successively larger boxes as described in Section 2.
One possible enumeration of the boxes in the tree is shown, but note that the only restriction is that if box τ is the parent of box σ, then τ < σ.
boundary data and the body load, Our objective is now to construct the matrices that map {d ge , g ci } to v ge and u c .
3.2. Discretization on the Cheyshev grid. In order to execute the local solve on Ω τ of (4), we use a classical spectral collocation technique, as described, e.g., in [12]. To this end, let D (1) and D (2) denote the p 2 × p 2 spectral differentiation matrices on the p × p Chebyshev grid. (In other words, for any function u that is a tensor product of polynomials of degree at most p − 1, the differentiation matrix exactly maps a vector of collocated function values to the vector of collocated values of its derivative.) Further, let A denote the matrix where C ij are diagonal matrices with entries {c ij (x k )} p 2 k=1 , and C i and C are defined analogously. Next, partition the matrix A to separate interior and exterior nodes via and A ci,ce = A(I ci , I ce ).
Collocating (4) at the interior nodes then results in the discretized equation where d ce = {d(x i )} i∈Ice encodes the local Dirichlet data d.

3.3.
Solving on the Chebyshev grid. While solving (5) gives the solution at the interior Chebychev nodes, it does not give a map to the boundary fluxes v ge that we seek. These are found by following the classic approach of writing the solution as the superposition of the homogeneous and particular solutions. Specifically, the solution to (4) is split as where w is a particular solution and where φ is a homogeneous solution Discretizing (6) on the Chebyshev grid, and collocating at the internal nodes, we get the equation Observing that w ce = 0, the particular solution is given by Analogously, the discretization of (7) on the Chebyshev grid yields A ci,ce φ ce + A ci,ci φ ci = 0.
Since φ ce = d ce , the homogeneous solution is given by Figure 2. Notation for the merge operation described in Section 4. Given two leaf boxes Ω α and Ω β , their union is denoted Ω τ = Ω α ∪ Ω β . The sets J 1 (black circles) and J 2 (black diamonds) form the exterior nodes, while J 3 (white circles) consists of the interior nodes.

3.4.
Interpolation and differentiation. Section 3.3 describes how to locally solve the BVP (4) on the Chebyshev grid via the superposition of the homogeneous and particular solutions. Note that this computation assumes that the local Dirichlet data d is given on the Chebyshev exterior nodes. In reality, this data will be provided on the Gaussian nodes, and we therefore need to introduce an interpolation operator that moves data between the different grids. To be precise, let L ce,ge denote a matrix of size 4(p − 1) × 4q that maps a given data vector d ge to a different vector (10) d ce = L ce,ge d ge 4(p − 1) × 1 4(p − 1) × 4q 4q × 1 as follows: An entry of d ce corresponding to an interior node is defined simply via a standard interpolation from the Gaussian to the Chebyshev nodes on the local edge alone. An entry of d ce corresponding to a corner node is defined as the average value of the two extrapolated values from the Gaussian nodes on the two edges connecting to the corner. (Observe that except for the four rows corresponding to the corner nodes, the matrix L ce,ge is a 4 × 4 block diagonal matrix.) Combining (8), (9), and (10), the solution to (4) on the Chebyshev grid is given by All that remains is now to determine the vector v ge of boundary fluxes on the Gaussian nodes. To this end, let us define a combined interpolation and differentiation matrix D ge,c of size 4q × p 2 via where L loc is a q × p interpolation matrix from a set of p Chebyshev nodes to a set of q Gaussian nodes, and where I s , I e , I n , I w are four index sets, each of length p, that point to the south, east, north, and west sides of the exterior nodes in the Chebyshev grid. By differentiating the local solution on the Chebyshev grid defined by (11), the boundary fluxes v ge are given by where H ge,ci = D ge,c F c,ci and T ge,ge = D ge,c S c,ge .

Merging two leaves
Consider a rectangular box τ consisting of two leaf boxes α and β, and suppose that all local operators for α and β defined in Section 3 have been computed. Our objective is now to construct the Dirichlet-to-Neumann operator for the bigger box τ from the local operators for its children. In this operation, only sets of Gaussian nodes on the boundaries will take part, cf. Figure 2. We group these nodes into three sets, indexed by vectors J 1 , J 2 , and J 3 , defined as follows: J 1 Edge nodes of box α that are not shared with box β. J 2 Edge nodes of box β that are not shared with box α. J 3 Edge notes that line the interior edge shared by α and β. We also define J τ ge = J 1 ∪ J 2 and J τ gi = J 3 as the exterior and interior nodes for the parent box τ . Finally, we let h α , h β ∈ R 4q denote two vectors that hold the boundary fluxes for the two local particular solutions w α and w β , cf. (12), Then the equilibrium equations for each of the two leaves can be written . Now partition the two equations in (14) using the notation shown in Figure 2 (The subscript "ge" is suppressed in (15) and (16) since all nodes involved are Gaussian exterior nodes.) Combine the two equations for v 3 in (15) and (16) to obtain the equation Using the relation (17) in combination with (15), we find that We now define the operators Constructing the approximate solution on the shared edge u τ gi can be viewed an upward pass to compute the approximate boundary flux by where w τ gi = X τ gi,gi h β 3 − h α 3 , followed by a downward pass u τ gi = S gi,ge u τ ge + w τ gi . Remark 1 (Physical interpretation of merge). The quantities w τ gi and h τ ge have a simple physical meaning. The vector w τ gi introduced above is simply a tabulation of the particular solution w τ associated with τ on the interior boundary Γ 3 , and h τ ge is the normal derivative of w τ . To be precise, w τ is the solution to the inhomogeneous problem, cf. (6) We can re-derive the formula for w| Γ 3 using the original mathematical operators as follows: First observe that for x ∈ Ω α , we have A(w τ − w α ) = g − g = 0, so the DtN operator T α applies to the function w τ − w α : (20) and (21) to eliminate (∂ n w τ )| 3 and obtain Observe that in effect, we can write the particular solution w τ as The function w τ must of course be smooth across Γ 3 , so the functionŵ τ must have a jump that exactly offsets the discrepancy in the derivatives of w α and w β . This jump is precisely of size h α − h β .

5.
The full solver for a uniform grid 5.1. Notation. Suppose that we are given a rectangular domain Ω, which has hierarchically been split into a binary tree of successively smaller patches, as described in Section 2. We then define two sets of interpolation nodes. First, {x i } M i=1 denotes the set of nodes obtained by placing a p × p tensor product grid of Chebyshev nodes on each leaf in the tree. For a leaf τ , let I τ c denote an index vector pointing to the nodes in {x i } M i=1 that lie on leaf τ . Thus the index vector for the set of nodes in τ can be partitioned into exterior and interior nodes as follows The second set of interpolation nodes {y j } N j=1 is obtained by placing a set of q Gaussian ("Legendre") interpolation nodes on the edge of each leaf. For a node τ in the tree (either a leaf or a parent), let I τ ge denote an index vector that marks all Gaussian nodes that lie on the boundary of Ω τ . For a parent node τ , let I τ gi denote the Gaussian nodes that are interior to τ , but exterior to its two children (as in Section 4).
Once the collocation points have been set up, we introduce a vector u ∈ R M holding approximations to the values of the potential u on the Gaussian collocation points, We refer to subsets of this vector using the short-hand u τ ge = u(I τ ge ), and u τ gi = u(I τ gi ) for the exterior and interior nodes respectively. At the very end of the algorithm, approximations to u on the local Chebyshev tensor product grids are constructed. For a leaf node τ , let the vectors u τ c , u τ ce , and u τ ci denote the vectors holding approximations to the potential on sets of collocation points in the Chebyshev grid marked by I τ c , I τ ce , and I τ ci , respectively. Observe that these vectors are not subvectors of u.
Before proceeding to the description of the algorithm, we introduce two sets of auxiliary vectors. First, for any parent node τ , let the vector w τ gi denote the computed values of the local particular solution w τ that solves (6) on Ω τ , as tabulated on the interior line marked by I τ gi . Also, define h τ as the approximate boundary fluxes of w τ as defined by (13) for a leaf and by (18) for a parent.

The build stage.
Once the domain is partitioned into a hierarchical tree, we execute a "build stage" in which the following matrices are constructed for each box τ : S τ For a box τ , the solution operator that maps Dirichlet data ψ on ∂Ω τ to values of u at the interior nodes. In other words, u τ c = S τ c,ge ψ τ ge on a leaf or u τ gi = S τ gi,ge ψ τ ge on a parent box. T τ For a box τ , the matrix that maps Dirichlet data ψ on ∂Ω τ to the flux v on the boundary.
In other words, v τ ge = T τ ge,ge ψ ge . F τ For a leaf box, the matrix that maps the body load to the particular solution on the interior of the leaf assuming the Dirichlet data is zero on the boundary. In other words w τ c = F τ c,ci g ci . H τ For a leaf box, the matrix that maps the body load to the flux on the boundary of the leaf.
In other words h τ ge = H τ ge,ci g τ ci . X τ For a parent box τ with children α and β, the matrix that maps the fluxes of the particular solution for the children on the interior of a parent to the particular solution on the interior nodes. In other words The build stage consists of a single sweep over all nodes in the tree. Any ordering of the boxes in which a parent box is processed after its children can be used. For each leaf box τ , approximations S τ and F τ to the solution operators for the homogeneous and particular solutions are constructed. Additionally, approximations T τ and H τ to the local DtN map T τ for the homogeneous and particular solutions are constructed using the procedure described in Section 3. For a parent box τ with children α and β, we construct the solution operators X τ gi,gi and S τ gi,ge , and the DtN operator T τ ge,ge via the process described in Section 4. Algorithm 1 summarizes the build stage.
5.3. The solve stage. After the "build stage" described in Algorithm 1 has been completed, an approximation to the global solution operator of (1) has been computed, and represented through the various matrices (H τ , F τ , etc.) described in Section 5.2. Then given specific boundary data f and a body load g, the corresponding solution u to (1) can be found through a "solve stage" that involves two passes through the tree, first an upwards pass (from smaller to larger boxes), and then a downwards pass. In the upward pass, the particular solutions and normal derivatives of the particular solution are computed and stored in the vectors w and h respectively. Then by sweeping down the tree applying the solution operators S to the Dirichlet boundary data for each box τ and adding the particular solution, the approximate solution u is computed. Algorithm 2 summarizes the solve stage.

Algorithm 1 (Build stage for problems with body load)
This algorithm builds all solution operators required to solve the non-homogeneous BVP (1). It is assumed that if node τ is a parent of node σ, then τ < σ.
Let α and β be the children of τ . Partition I α ge and I β ge into vectors I 1 , I 2 , and I 3 as shown in Figure 2.
end if end for We observe that the vectors w τ gi can all be stored on a global vector w ∈ R N . Since each boundary collocation node y j belongs to precisely one index vector I τ gi , we simply find that w τ gi = w(I τ gi ).
Remark 2 (Efficient storage of particular solutions). For notational simplicity, we describe Algorithm 2 (the "solve stage") in a way that assumes that for each box τ , we explicitly store a corresponding vector h τ ge that represents the boundary fluxes for the local particular solution. In practice, these vectors can all be stored on a global vector h ∈ R N , in a manner similar to how we store w. For any box τ with children α and β, we store on h the difference between the boundary fluxes, so that h( In other words, as soon as the boundary fluxes have been computed for a box α, we add its contributions to the vector h(I α ge ) with the appropriate signs and then delete it. This becomes notationally less clear, but is actually simpler to code. 5.4. Algorithmic complexity. In this section, we determine the asymptotic complexity of the direct solver. The analysis is very similar to the analysis seen in [5] for no body load. Let N leaf = 4q denote the number of Gaussian nodes on the boundary of a leaf box, and let p 2 denote the number of Chebychev nodes used in the leaf computation. In the asymptotic analysis, we set p = q + 2, so that p ∼ q. Let L denote the number of levels in the binary tree. This means there are 2 L boxes. Thus the total number of discretization nodes N is approximately 2 L q 2 .
In processing a leaf, the dominant cost involves matrix inversion (or factorization followed by triangular solve) and matrix-matrix multiplications. The largest matrices encountered are of size O(q 2 ) × O(q 2 ), making the cost to process one leaf O(q 6 ). Since there are N/q 2 leaf boxes, the total cost of pre-computing approximate DtN operators for all the bottom level is ∼ (N/q 2 ) × q 6 ∼ N q 4 .

Algorithm 2 (Solver for problems with body load)
This algorithm constructs an approximation u to the solution u of (1). It uses the matrices that represent the solution operator that were constructed using Algorithm 3. It is assumed that if node τ is a parent of node σ, then τ < σ.
Upwards pass -construct all particular solutions: if (τ is a leaf) # Compute the boundary fluxes of the local particular solution.
Let α and β denote the children of τ . # Compute the local particular solution. w τ gi = X τ gi,gi −h α 3 + h β 3 . # Compute the boundary fluxes of the local particular solution.
Downwards pass -construct all potentials: # Use the provided Dirichlet data to set the solution on the exterior of the root. u(I 1 ge ) = {f (y k )} k∈I 1 ge . for τ = 1, 2, 3, . . . , N boxes if (τ is a parent) # Add the homogeneous term and the particular term. u τ gi = S τ gi,ge u τ ge + w τ gi . else # Add the homogeneous term and the particular term. u τ c = S τ c,ge u τ ge + F τ c,ci g τ ci . end end for Finally, consider the cost of the solve stage (Algorithm 2). We first apply at each of the 2 L leaves the operators H τ ge,ci , which are all of size 4q × (p − 2) 2 , making the overall cost ∼ 2 L q 3 = N q since p ∼ q and N ∼ 2 L q 2 . In the upwards sweep, we apply at level matrices of size O(2 − /2 N 0.5 ) × O(2 − /2 N 0.5 ) on 2 boxes, adding up to an overall cost of The cost of the downwards sweep is the same. However, the application of the matrices F τ c,ci at the leaves is more expensive since these are of size O(q 2 ) × O(q 2 ), which adds up to an overall cost of 2 L q 4 = N q 2 .
The analysis of the asymptotic storage requirements perfectly mirrors the analysis of the flop count for the solve stage, since each matrix that is stored is used precisely once in a matrix-vector multiplication. In consequence, the amount of storage required is Remark 3 (A storage efficient version). The storage required for all solution operators can become prohibitive when the local order q is high, due to the term N q 2 in (22). One way to handle this problem is to not store the local solution operators for a leaf, but instead simply perform small dense solves each time the "solve stage" is executed. This makes the solve stage slower, obviously, but has the benefit of completely eliminating the N q 2 term in (22). In fact, in this modified version, the overall storage required is ∼ N L ≈ N log 2 (N/q 2 ), so we see that the storage costs decrease as q increases (as should be expected since we do all leaf computations from scratch in this case). Figure  8 provides numerical results illustrating the memory requirements of the various approaches.

Local refinement
When solving a boundary value problem like (1) it is common to have a localized loss of regularity due to, e.g., corners on the boundary, a locally non-smooth body load or boundary condition, or a localized loss of regularity in the coefficient functions in the differential operator. A common approach to efficiently regain high accuracy without excessively increasing the number of degrees of freedom used, is to locally refine the mesh near the troublesome location. In this manuscript, we assume the location is known and given, and that we manually specify the degree of local refinement. The difficulty that arises is that upon refinement, the collocation nodes on neighboring patches do not necessarily match up. To remedy this, interpolation operators are introduced to transfer information between patches. (The more difficult problem of determining how to automatically detect regions that require mesh refinement is a topic of current research.) 6.1. Refinement criterion. Suppose we desire to refine our discretization at some pointx in the computational domain (the pointx can be either in the interior or on the boundary). Consider as an example the situation depicted in Figure 5. For each level of refinement, we split any leaf box that containsx and any "close" leaf boxes into a 2 × 2 grid of equal-sized leaf boxes. In Figure 5 we perform one level of refinement and find there are 6 leaf boxes "close" tox, which is represented by the green dot. These 6 boxes are refined into smaller leaf boxes.
A leaf box Ω τ is close tox if the distance d τ fromx to the box Ω τ satisfies d τ ≤ tl τ , where t = √ 2 and 2l τ is the length of one side of the leaf box Ω τ . In Figure 5 we show circles of size tl γ and tl β at the points in Ω γ and Ω β closest tox (in this case the boxes are all the same size so l γ = l β ). We seex is "close" to Ω γ , but not "close" to Ω β . Just as in section 5.1, we place a p × p tensor product grid of Chebyshev nodes on each new leaf and a set of q Gaussian ("Legendre") interpolation nodes on the edge of each leaf. The vector {y j } N j=1 holds the locations of all Gaussian nodes across all leaves in the domain.
6.2. Refined mesh. Notice that with the refined grid the nodes along common boundaries are no longer aligned. Figure 6 is an example of such a grid. This is a problem during the build stage of Ω γ Ω β Figure 5. A sample domain where we desire to refine the grid atx, shown by the green circle. For leaf box Ω γ the shortest distance tox satisfies d γ < tl γ , so Ω γ is refined. The maximum distance tl γ is shown by the blue circle, which is centered at the closest point from Ω γ tox. For leaf box Ω β the shortest distance tox does not satisfy d β < tl β , so Ω β is not refined. The maximum distance tl β is shown by the red circle, which is centered at the closest point from Ω β tox. the method since the merge operation is performed by equating the Neumann data on the common boundary. We begin the discussion on how to address this problem by establishing some notation. Define two boxes as neighbors if they are on the same level of the tree and they are adjacent. In the case that only one of two neighbors has been refined, such as Ω α and Ω β in Figure 6, special attention needs to be paid to the nodes on the common boundary. In order to merge boxes with different number of Gaussian nodes on the common edge, interpolation operators will be required. The next section describes this process in detail.
Consider the nodes on the common boundary between the two leaf boxes Ω α and Ω β . Let q denote the number of Gaussian nodes on one side of each leaf. Let {J α,i } q i=1 denote the index vector for the common boundary nodes from Ω α and {J β,i } 2q i=1 denote the index vector for the common boundary nodes from Ω β . That is, recalling that y holds the locations of all Gaussian nodes in the domain, y(J α,i ) contains the q Gaussian nodes on the Eastern side of box Ω α and y(J β,i ) contains the 2q nodes on the Western side of box Ω β . 6.3. Modifications to build stage. Once the grid with the Gaussian and Chebyshev nodes is constructed, as described in section 6.2, the build stage starts with the construction of all leaf operators as described in Section 3. Then, for simplicity of presentation, boxes are merged from the lowest level moving up the tree. After merging the children of a refined parent, such as Ω β in Figure  6 it is seen that the parent's exterior nodes do not align with the exterior nodes of any neighbor which has not been refined.
Recalling the index notation used in section 6.2, we form the interpolation matrix P up,W mapping data on y(J α,i ) to data on y(J β,i ) and the interpolation matrix P down,W mapping data on y(J β,i ) to data on y(J α,i ). Observe that when interpolating from two sets of q Gaussian nodes to a set of q Gaussian nodes, the interpolation must be done as two separate interpolations from q to q/2 nodes. The matrix P down,W is a block diagonal matrix consisting of two q/2 × q matrices (assuming q is divisible by 2).
For a refined parent, such as Ω β in Figure 6, we form the operators T β , S β , and X β and form the interpolation operators for every side of the parent, regardless of whether the exterior nodes align with the neighbor's exterior nodes. Observe that in the case of Ω β in Figure 6 the Eastern and Northern sides of Ω β will have P down = P up = I, where I is the identity matrix.
Then interpolation operators mapping the entire boundary data between fine and coarse grids are given by block diagonal matrices P up and P down whose diagonal blocks are the interpolation operators for each edge. The interpolation operators for Ω β are P β up = blkdiag(P β up,S , P β up,E , P β up,N , P β up,E ) and P β down = blkdiag(P β down,S , P β down,E , P β down,N , P β down,E ). (The text blkdiag denotes the function that forms a block diagonal matrix from its arguments.) Then we form the new operators T β new and S β new for the parent box Ω β as follows Now T β new is a map defined on the same set of points as all of the neighbors of Ω β and Neumann data can be equated on all sides.
Next, suppose a refined parent does not have a neighbor on one of its sides. Then on that side we use P down = P up = I. This could happen if the parent is on the boundary of our domain Ω. For example, suppose Ω α in Figure 6 was also refined. Then Ω α would not have a neighbor on its Western side. Additionally, if multiple levels of refinement are done then a refined parent could have no neighbors on one side. For example, suppose the Northwestern child of Ω β in Figure 6 was refined. Then the Northwestern child would not have a neighbor on its Western side since box Ω α is on a different level of the tree.
Forming the interpolation operators for each side of Ω β before we perform any following merge operations is the easiest approach. The alternative would be to form an interpolation operator every time two boxes are merged and the nodes do not align.
6.4. Modifications to solve stage. On the upwards pass of the solve stage, the fluxes for the particular solution must be calculated on the same nodes so the particular solution can be calculated on those nodes. This is easily achieved by applying the already computed interpolation operator P down to obtain h β new = P down h β old . In the downwards pass of the solve stage, the application of the solution operators results in the approximate solution at the coarse nodes on the Western and Southern sides of Ω β . The solution operator S β new now maps the solution on the coarse nodes on the Western and Southern sides of Ω β (and the dense nodes on the Eastern and Northern sides) to the solution on the interior of Ω β . However, we also need the solution on the dense nodes on the Western and Southern sides of Ω β . Let u ge,new denote the solution on the boundary of Ω β with the coarse nodes on the Western and Southern edges. Then the approximate solution on the dense nodes is given by u ge,old = P up u ge,new .

Numerical experiments
In this section, we present the results of numerical experiments that illustrate the performance of the scheme proposed. Section 7.1 reports on the computational cost and memory requirements. Sections 7.2-7.5 report on the accuracy of the proposed solution technique for a variety of problems where local mesh refinement is required. Finally, Section 7.6 illustrates the use of the proposed method in the acceleration of an implicit time stepping scheme for solving a parabolic partial differential equation.
For each experiment, the error is calculated by comparing the approximate solution with a reference solution u ref constructed using a highly over resolved grid. Errors are measured in ∞ -norm, on all Chebyshev nodes on leaf boundaries.
In all of the experiments, each leaf is discretized using a p × p tensor product mesh of Chebyshev nodes. The number of Legendre nodes per leaf edge is set to q = p − 1. In all experiments except the one described in Section 7.5, the computational domain is the square Ω = [0, 1] 2 discretized into n × n leaf boxes, making the total number of degrees of freedom roughly N ≈ p 2 × n 2 (to be precise, N = n 2 × (p − 1) 2 + 2n × (p − 1) + 1).
The proposed method was implemented in Matlab and all experiments were run on a laptop computer with a 4 core Intel i7-3632QM CPU running at 2.20 GHz with 12 GB of RAM. 7.1. Computational speed. The experiments in this section illustrate the computational complexity and memory requirements of the direct solver. Recall that the asymptotic complexity of the method scales as nested dissection or multifrontal methods, with execution times scaling as O(N 3/2 ) and O(N log N ) for the "build" and "solve" stages, respectively. The asymptotic memory requirement is O (N log N ).
The computational complexity and memory requirements of the proposed method depend only on the domain and the computational mesh; the choice of PDE is irrelevant. In the experiments reported here, we used Ω = [0, 1] 2 with a uniform mesh. Figure 7 reports the time in seconds for the (a) build and (b) solve stages of the proposed solution technique when there is a body load (BL), when the leaf computation is done on the fly as described in Remark 3 (BL(econ)), and when there is no body load (NBL). Results for two different orders of discretization (q = 8 and 16) are shown. Notice that as expected the constant scaling factor for both stages is larger for the higher order discretization. Figure 8 reports on memory requirements. Letting R denote total memory used, we plot R/N versus the number of discretization points N , where R is measured in terms of number of floating point numbers. We see that storing the solution operators on the leaves is quite costly in terms of memory requirements. The trade-off to be considered here is whether the main priority is to conserve memory, or to maximize the speed of the solve stage, cf. Remark 3. As an illustration, we see that for a problem with q = 8 and n = 128, for a total of 10 6 unknowns, the solve stage takes 1.7 seconds with the solution operators stored versus 9.8 seconds for performing the local solves on the fly. In situations where the solution is only desired in prescribed local regions of the geometry, computing the operators on the fly is ideal.    . The error for the variable coefficient problem described in Section 7.2. As before, q denotes the number of Legendre nodes along one side of a leaf. Figure 9 reports the l ∞ error versus the number of discretization points N . We get no accuracy for q = 4, but as q is increased, the errors rapidly decrease.  Figure 10. Error for Helmholtz equation with κ = 20 and a very concentrated body load, demonstrating the ability to improve the solution with refinement. We use local Chebyshev grids with 17 × 17 Chebyshev nodes per leaf, and n × n leaves, before refinement. For a problem like this with a concentrated body load we can improve the error just as much by refining the discretization at the troublesome location as we can from doubling the number of leaves, which would give the same grid at the target location.
with Ω = [0, 1] × [0, 1] and a very concentrated Gaussian for the body load, g = exp(−α|x −x| 2 ) with α = 3000. In this case, we chose the Dirichlet boundary data to equal the solution to the free space equation −∆u − κ 2 u = g with a radiation condition at infinity. In other words, u is the convolution between g and the free space fundamental solution. We computed the boundary data and the reference solution by numerically evaluating this convolution to very high accuracy.
To test the refinement strategy, we build a tree first with a uniform grid, i.e. n × n leaf boxes then add n ref levels of refinement around the pointx. Figure 10 reports the l ∞ norm of the error versus n ref for four choices of uniform starting discretization. When n = 4 one level of refinement (i.e. 28 leaf boxes) results in approximately the same accuracy as when n = 8 and no levels of refinement (i.e. 64 leaf boxes). 7.4. Discontinuous body load. In section, we consider a Poisson boundary value problem on Ω = [0, 1] 2 with an indicator function body load g that has support [1/4, 1/2] × [1/4, 1/2]. Observe that the lines of discontinuity of g coincide with edges of leaves in the discretization. Figure 11 reports the l ∞ error versus the number of discretization points N with uniform refinement for four different orders of discretization. Note that the approximate solution and its first derivative are continuous through the boundaries of the leaves (even on the boundaries where the jump in the body load occurs) since the algorithm enforces them by derivation.
Remark 5. Applying the scheme to a problem where a discontinuity in the body load does not align with the leaf boundaries results in a very low accuracy approximation to the solution. For a problem analogous to the one described in this section, we observed slow convergence and attained no better than two or three digits of accuracy on the most finely resolved mesh.  Figure 11. The error for a problem with a discontinuous body load. The discontinuities align with the edges of the leaves so we still get 10 digits of accuracy. In the legend, q denotes the number of Legendre nodes along one side of a leaf.

7.5.
Tunnel. This section reports on the performance of the solution technique when applied to the Helmholtz Dirichlet boundary value problem −∆u − κ 2 u = g, x ∈ Ω, u = f x ∈ ∂Ω with κ = 60 where the domain Ω is "tunnel" as illustrated in Figure 12. The body load is taken to be a Gaussian g = exp(−α|x −x| 2 ) with α = 300 andx = [1, 3/4]. The Dirichlet boundary data is given by Note that f (x) is continuous on ∂Ω and with this choice of wave number κ the domain Ω is about 10 wavelengths wide and 115 wavelengths long. The presence of the re-entrant corners results in a solution that has strong singularities which require local refinement in order for the method to achieve high accuracy. Figure 13 reports the l ∞ error versus the number of refinements into the corners with three choices of coarse grid. We use q = 16 for all examples and h gives the width and height of each leaf box. When h = 1/4, the discretization is only sufficient to resolve the Helmholtz equation with κ = 60 within 1% of the exact solution. When h = 1/8, the solution technique stalls at 5 digits of accuracy independent of the number of refinement levels.
Remark 6 (Symmetries). This problem is rich in symmetries that can be used to accelerate the build stage. In our implementation, we chose to exploit the fact that the tunnel is made up of four L-shaped pieces glued together. The DtN operator and corresponding solution operators were constructed for one L-shape. Then creating the solver for the entire geometry involved simply gluing the 4 L-shaped geometries together via three merge operations.   Applying the Crank-Nicolson time stepping scheme with a time step size k results in having to solve the following elliptic problem at each time step: where A = ∆ − ∂/∂x 1 is our partial differential operator.
Observe that the algorithm does not change for this problem. The build stage execution time and memory requirement are identical to those seen in Section 7.1. The execution time for the solve stage for each individual time step is nearly identical to the solve stage execution time shown in Section 7.1. The only new step in the solve stage is the need to evaluate (I/k + A/2)u n at each time step. Figure 14 reports the l ∞ error vs. the time step size k at three different times t = 0.025, 0.1, and 0.5. Note that even with a low order time stepping scheme, it is still feasible to high accuracy (i.e. use small time steps) since processing each time step is very inexpensive.
We use 16 leaf boxes per side with q = 16. This gives more than enough nodes to obtain the accuracy shown in Figure 14 and the error shown is limited by the accuracy of Crank-Nicolson and not by the discretization in space.

Concluding remarks
We have described an algorithm for solving non-homogenous linear elliptic PDEs in two dimensions based on a multidomain spectral collocation discretizations. The solver is designed explicitly for being combined with a nested dissection type direct solver. Its primary advantage over existing methods is that it enables the use of very high order local discretization without jeopardizing computational efficiency in the direct solver. The scheme is an evolution on previously published methods [10,4,5]. The novelty in this work is that the scheme has been extended to allow for problems involving body loads, and for local refinement.
The scheme is particularly well suited for executing the elliptic solve required solving parabolic problems using implicit time-stepping techniques in situations where the domain is fixed, so that the elliptic solve is the same in every time step. In this environment, the cost of computing an explicit solution operator is amortized over many time-steps, and can also be recycled when the same equation is solved for different initial conditions.
The fact that the method can with ease incorporate high order local discretizations, and allows for very efficient implicit time-stepping appears to make it particularly well suited for solving the Navier-Stokes equations at low Reynolds numbers. Such a solver is currently under development and will be reported in future publications. Other extensions currently under way includes the development of adaptive refinement criteria (as opposed to the supervised adaptivity used in this work), and the extension to problems in three dimensions, analogous to the work in [6] for homogeneous equations.