## 1 Introduction

Fractional diffusion equations are becoming increasingly important in modeling phenomena where nonlocal effects are significant, including fractured media in material science [46, 48], transport in complex media [47, 51], stable Lévy processes in finance [16, 37], Gaussian random fields in spatial statistics [40, 41], image denoising [28], among many others. In addition to physics-based modeling, fractional operators have also been used in controlling the smoothness of priors in Bayesian inverse problems involving distributed parameters [15]. While there has been much work devoted to the formulation and discretization of fractional diffusion in the homogenous case with constant fractional order and material properties (diffusivity, permeability, etc.), there has been comparatively little work that addresses the practically useful heterogenous case with variable coefficients, particularly in multiple spatial dimensions. Our motivation in this work is to develop methods for the efficient discretization of the fractional diffusion operator with variable fractional order and properties.

A standard formulation of fractional diffusion involves the integral representation of the fractional Laplacian $$(-\varDelta )^s$$ of order $$s\in (0,1)$$, where the fractional order characterizes the global smoothness of the solution [11, 27, 31]. The model may be extended to the variable order and diffusivity case by defining these quantities as functions in a physical domain; see e.g. [17, 19, 29, 48]. The specific form of the variable-order fractional diffusion operator we consider here is defined in a bounded domain $$\varOmega \subset {\mathbb {R}}^d$$ with $$d=1,2,3$$ as:

\begin{aligned} {\mathcal {L}} u(x):= \text {p.v.}\int _\varOmega \frac{a(x,y) \left( u(x) - u(y) \right) }{|x-y|^{d+s(x)+s(y)}}\,{{\textrm{d}}}y, \end{aligned}
(1)

where $$a(x,y): \varOmega \times \varOmega \rightarrow {\mathbb {R}}^+$$ denotes the diffusion coefficient and $$s(x): {\mathbb {R}}^d\rightarrow (0,1)$$ is the variable-order function. Formal settings for the above operator with appropriate volume constraints to insure well-posedness are described in Sect. 2 below.

The operator (1) becomes the classical integral fractional Laplacian operator (or the Riesz potential) when s(x) and a(xy) are constant functions and $$\varOmega ={\mathbb {R}}^d$$. The numerical approximation and analysis of this integral fractional Laplacian have been extensively studied in the literature. We refer to [24, 25, 33, 34, 44] for finite difference methods, [53] for a collocation approach, and [12, 30, 42, 57] for particle methods as well as random walk approaches. In terms of finite element methods, we refer to [2, 3] for a classical conforming scheme, [20] based on volume constraints, [9] for a non-conforming approach by the Dunford-Taylor integral formulation, [39, 49] for Petrov–Galerkin approaches, and [36] based on the well-known Caffarelli–Silvestre extension (cf. [14]).

In contrast, the numerical study for the variable-order fractional operator (1) has just started in recent years. For one dimensional problems, we refer to [6, 35, 58, 60] for the numerical approximations of different formulations of variable-order fractional diffusion operators. In multiple spatial dimensions, finite difference schemes were presented in [5] for discretizing the operator (1) in Cartesian geometries. A finite element approach is presented in [19] for domains where the variability consists of uniform inclusions in an otherwise homogeneous medium. There is still however no complete treatment of general variable order and properties for arbitrary geometries in the multidimensional setting. In this work, we extend the considerations from the finite-difference method on cartesian grids developed in [5] to the quasi-uniform meshes for general polytopes, and approximate variable-order fractional diffusion problems involving (1) using a linear finite element discretization scheme.

A primary challenge in the finite element discretization of (1) is that the non-local fractional kernel results in a dense stiffness matrix with a prohibitive $$O(N^2)$$ quadratic growth in memory footprint and arithmetic operations, where N denotes the number of degrees of freedom. In [3], the stiffness matrix is decomposed into the near-field and the far-field based on interactions between element pairs and an $${\mathcal {H}}^2$$ approximation is constructed from a kernel interpolation [32]. The corresponding near-field matrix is sparse while the far-field matrix can be compressed as a hierarchically low-rank matrix, thus reducing the complexity of the assembly process to O(N), with constants depending on the order of the quadrature and polynomial approximation of the kernel. We also refer to [8, 12, 36, 38, 59] for related hierarchical low-rank approximations, also all in the context of constant-order fractional problems. The variable coefficient case, however, requires additional considerations that are not present in the constant coefficient setting. For example, the work in [3, 4] takes the advantage of the constant-order property to evaluate the integral fractional Laplacian locally; see [3, Lemma 4], a procedure that cannot be readily extended to the variable-order case.

In our approach, the non-singular far-field matrix is expressed as the sum of two terms, where the first is a matrix that fits the structure for the $${\mathcal {H}}^2$$ approximation,

\begin{aligned} \int _\varOmega \int _\varOmega V(x) \gamma _{{\mathcal {T}}}(x,y) U(y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x, \end{aligned}
(2)

where U and V are finite element functions and $$\gamma _{{\mathcal {T}}}$$ is a de-singularized kernel depending on the subdivision $${\mathcal {T}}$$ of $$\varOmega$$. The second term is a sparse weighted mass matrix with the density function $$\rho _{{\mathcal {T}}}(x) = \int _\varOmega \gamma _{{\mathcal {T}}}(x,y)\,{{\textrm{d}}}y$$. We compute this weighted mass matrix by quadrature on each element. The bottleneck of this procedure is that a direct evaluation of the density function at each quadrature point is O(N), resulting in a quadratic overall complexity. To remedy this growth, we propose a fast multipole method to obtain the values at all the quadrature points in linear time complexity.

Another challenge of (1) is that the singularity of the kernel requires special quadrature rules in order not to slow down the rate of convergence as the discretization is refined. To address this challenge, we adapt techniques from boundary element computations (see e.g. [45]) to resolve the singularities of the integrand when computing the entries in element stiffness matrices. Thus, we extend the techniques developed for the constant-order problem [1, 2] and achieve a treatment similar to that in [5] which relied on a singularity subtraction technique that is particularly convenient to express on a cartesian finite difference grid.

The main contributions of this work are:

• We propose a decomposition of the stiffness matrix for the finite element approximation of (1) into three sub-matrices. All three sub-matrices can be computed in O(N), with construction criteria depending on the distance between shape function pairs and element pairs in the subdivision. This strategy allows the handling of general domains with spatially-variable order and coefficients as well as bounded exterior regions.

• We generalize the variable-order setting discussed in [19] from element-wise constant functions to element-wise analytic functions. The numerical integration techniques for singular integral kernels developed by [1, 2] are generalized as well. The quadrature error is shown to converge at rates controllable simply by the order of the quadrature used. We also note that the proposed computation techniques can be applied to more general nonlocal kernels introduced by [22].

• We propose an $${\mathcal {H}}^2$$ representation to approximate the form (2) and a variable-kernel fast multipole method for computing the density $$\rho _{{\mathcal {T}}}(x)$$ at quadrature points. The variable-kernel fast multipole computation is cast as a matrix vector product, where the matrix is constructed as an $${\mathcal {H}}^2$$ matrix via a collocation method and the vector consists of quadrature weights. The errors in these approximations are shown to converge at rates readily controllable by the order of the polynomial used in the kernel approximation.

The rest of this paper is organized as follows. Section 2 presents a weak formulation of the problem and its finite-element discretization. Section 3 describes the decomposition of the stiffness matrix into three sub-matrices that handle different aspects of the problem. Sections 4 to 6 discuss the numerical approximation of these individual sub-matrices and analyze the resulting discretization errors. Specifically, Sect. 4 introduces a tensor-product quadrature rule to directly compute the singular near-field interactions. Section 5 presents an $$\mathcal{H}^2$$ representation of the sub-matrix that approximates the non-singular part of the near-field as well as the far-field. Section 6 describes the fast multipole machinery to efficiently account for global interactions at quadrature points. Section 7 presents numerical experiments to illustrate the convergence of the finite element approximation as well as the linear-complexity of the proposed assembly process. Section 8 concludes with directions for future work.

Notation. In the following, we set $$a\lesssim b$$ if $$a\le Cb$$ with C denoting a positive constant independent of a, b, and the discretization parameters (e.g. the mesh size h, the number of degrees of freedom N, the quadrature order n, and the polynomial degree p). We set $$a\sim b$$ when $$a\lesssim b$$ and $$b\lesssim a$$.

## 2 Weak Formulation and Finite Element Discretization

Our weak formulation for the nonlocal symmetric operator (1) starts from the definition of a generalized nonlocal divergence operator, $${\mathcal {D}}$$, introduced in [23]. For a vector field $$v: {\mathbb {R}}^d\times {\mathbb {R}}^d\rightarrow {\mathbb {R}}$$,

\begin{aligned} {\mathcal {D}}[v](x) = \int _{{\mathbb {R}}^d} (v(x,y)+ v(y, x)) \cdot \alpha (x, y)\,{{\textrm{d}}}y. \end{aligned}

Here the vector field $$\alpha (x,y): {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d$$ satisfies the antisymmetric property $$\alpha (x,y) = -\alpha (y,x)$$. An adjoint operator $${\mathcal {D}}^*$$ corresponding to $${\mathcal {D}}$$ under the $$L^2({\mathbb {R}}^d)$$ inner product may be written as

\begin{aligned} {\mathcal {D}}^*[u] (x,y):= - (u(y) - u(x)) \alpha (x, y), \end{aligned}

with $$-{\mathcal {D}}^*$$ interpreted as a nonlocal gradient.

Similarly to the classical diffusion operator which is defined with a second-order diffusion coefficient a (a symmetric tensor representing diffusivity, permeability, or related material properties), the nonlocal diffusion operator can be defined as

\begin{aligned} {\mathcal {L}} u:= {\mathcal {D}}[a\cdot {\mathcal {D}}^*[u]](x):= -2 \int _{{\mathbb {R}}^d} (u(y)-u(x))\gamma (x,y) \,{{\textrm{d}}}y \end{aligned}
(3)

with

\begin{aligned} \gamma := \alpha (x,y)^T a(x,y) \alpha (x,y). \end{aligned}

In order to simplify our discussion, we set a to be diagonal, and assume that there exists an analytic function $$\kappa (x)$$ so that $$\kappa (x)\ge \delta >0$$ for some positive constant $$\delta$$, and that $$a(x,y) = \sqrt{\kappa (x)\kappa (y)} I$$ with $$I\in {\mathbb {R}}^{d\times d}$$ denoting the identity matrix.Footnote 1 We are interested in the numerical approximation of the diffusion operator $${\mathcal {L}}$$ with $$\alpha$$ defined by

\begin{aligned} \alpha (x,y) = \frac{y-x}{|y-x|^{\tfrac{d+s(x)+s(y)}{2} + 1}}, \end{aligned}

where s(x) is an analytic function satisfying that

\begin{aligned} 0\le {\underline{s}}\le s(x)\le {{\overline{s}}} < 1. \end{aligned}
(4)

Plugging the above definition of $$\alpha$$ into (3) yields

\begin{aligned} \mathcal {L}u(x) = -2 \int _{{\mathbb {R}}^d} \frac{a(x,y)(u(x)-u(y))}{|x-y|^{d+s(x)+s(y)}} \,{{\textrm{d}}}y, \end{aligned}

a spatially variable-order generalization of the integral fractional Laplacian.

### 2.1 Volume-Constrained Problems

We shall apply the operator $${\mathcal {L}}$$ in a bounded domain $$\varOmega :={{\overline{{\varOmega ^{\text {int}}}}}}\cup {\varOmega ^{\text {ext}}}\subset {\mathbb {R}}^d$$ with $${\varOmega ^{\text {int}}}\cap {\varOmega ^{\text {ext}}}= \emptyset$$. Here the domain $${\varOmega ^{\text {int}}}$$ has Lipschitz boundary and $${\varOmega ^{\text {ext}}}$$ is an enclosing region of $${\varOmega ^{\text {int}}}$$ satisfying that $$0< |{\varOmega ^{\text {ext}}}|<\infty$$ or $${\varOmega ^{\text {ext}}}= {\mathbb {R}}^d \setminus {\varOmega ^{\text {int}}}$$. Unlike classical, second-order elliptic problems, where imposing boundary conditions on $$\partial {\varOmega ^{\text {int}}}$$ is sufficient to guarantee well-posedness, a general non-local fractional operator requires that volume constraints be imposed on $${\varOmega ^{\text {ext}}}$$ [23].Footnote 2 These conditions are in a sense the equivalent of the boundary conditions of the local operator. Here we focus on the Dirichlet volume-constrained problem which can be stated as: given a data function f supported in $${\varOmega ^{\text {int}}}$$, we seek $$u \in \varOmega$$ satisfying

\begin{aligned} \begin{array}{ll} - {\mathcal {L}}_\varOmega u = f, &{} \quad \text {in } {\varOmega ^{\text {int}}}, \\ u =0, &{} \quad \text {in } {\varOmega ^{\text {ext}}}, \end{array} \end{aligned}
(5)

with the operator

\begin{aligned} {\mathcal {L}}_\varOmega u = -2 \int _{\varOmega } \frac{a(x,y)(u(x)-u(y))}{|x-y|^{d+s(x)+s(y)}} \,{{\textrm{d}}}y. \end{aligned}

### 2.2 Weak Formulation

Define the energy space

\begin{aligned} {\mathbb {V}}:= \{v\in L^2(\varOmega ): \Vert v\Vert _{{\mathbb {V}}}<\infty \text { and } v = 0 \text { in } {\varOmega ^{\text {ext}}}\}, \end{aligned}

where the energy norm $$\Vert .\Vert _{{\mathbb {V}}}$$ is given by $$\Vert v\Vert _{{\mathbb {V}}}^2:= \Vert v\Vert _{L^2(\varOmega )}^2 + |v|_{{\mathbb {V}}}^2$$ with

\begin{aligned} |v|_{{\mathbb {V}}}^2:= \int _{\varOmega }\int _{\varOmega } (u(x) - u(y))^2\gamma (x,y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x \quad \text {and}\quad \gamma (x,y) = \frac{a(x,y)}{|x-y|^{d+s(x)+s(y)}}. \end{aligned}

Clearly, the energy space $${\mathbb {V}}$$ is a Hilbert space. By utilizing a standard argument (e.g., Proposition 2.2 and 2.4 in [2]), we can show the following Poincaré inequality

\begin{aligned} \Vert v\Vert _{L^2(\varOmega )} \lesssim |v|_{{\mathbb {V}}}, \quad \text {for } v\in {\mathbb {V}}. \end{aligned}
(6)

Multiplying the first equation of (5) with a test function $$v\in {\mathbb {V}}$$, and integrating the resulting equation over $$\varOmega$$, we have

\begin{aligned} \begin{aligned} \int _\varOmega L_{\varOmega } u v\,{{\textrm{d}}}x&= -2\int _{\varOmega }\int _{\varOmega }(u(x)-u(y))\gamma (x,y) v(x) \,{{\textrm{d}}}y \,{{\textrm{d}}}x \\&= -2\int _{\varOmega }\int _{\varOmega } (u(y)-u(x))\gamma (x,y) v(y) \,{{\textrm{d}}}y \,{{\textrm{d}}}x. \\ \end{aligned} \end{aligned}

To obtain the second equality we switched the order of the double integral and used the fact that $$\gamma (x,y) = \gamma (y,x)$$. Summing up the two double integrals on the right hand side leads to the definition of the bilinear form

\begin{aligned} {\mathcal {A}}(u,v):= -\int _{\varOmega } L_\varOmega u v\,{{\textrm{d}}}x =\int _{\varOmega }\int _\varOmega (u(x)-u(y))(v(x)-v(y))\gamma (x,y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x. \end{aligned}

So our weak formulation reads: find $$u\in {\mathbb {V}}$$ satisfying that

\begin{aligned} {\mathcal {A}}(u,v) = \int _{{\varOmega ^{\text {int}}}} f v\,{{\textrm{d}}}x \quad {\text {for all}}v\in {\mathbb {V}}. \end{aligned}
(7)

Clearly, $${\mathcal {A}}(u,v)\le \Vert u\Vert _{{\mathbb {V}}}\Vert v\Vert _{{\mathbb {V}}}$$ by the Cauchy–Schwarz inequality. So $${\mathcal {A}}(.,.)$$ is bounded in $${\mathbb {V}}$$. Thanks to (6), we obtain the coercivity of the bilinear form. Thus, the Lax-Milgram lemma guarantees that the weak formulation (8) admits a unique solution $$u\in {\mathbb {V}}$$.

### 2.3 Finite Element Discretization

We consider a simplicial finite element mesh that subdivides the interior and exterior polytope regions $${\varOmega ^{\text {int}}}$$ and $${\varOmega ^{\text {ext}}}$$. We denote by $${\mathcal {T}}:={\mathcal {T}}(\varOmega )$$ the resulting subdivision of $$\varOmega$$, and assume that $${\mathcal {T}}(\varOmega )$$ matches the boundary of $${\varOmega ^{\text {int}}}$$, i.e., $${{\overline{{\varOmega ^{\text {int}}}}}}\cap {{\overline{{\varOmega ^{\text {ext}}}}}}$$ consists of faces from $${\mathcal {T}}$$, allowing us to identify $${\mathcal {T}}^\text {int}$$ and $${\mathcal {T}}^\text {ext}$$ as the subdivisions of the interior and exterior regions. Let $${\mathcal {N}}:= \{{{\textbf{x}}}_i\}_{i=1}^M$$ be the set of vertices associated with the subdivision $${\mathcal {T}}$$. In particular, we set the first N ($$N<M$$) nodes to be the interior nodes in $${\varOmega ^{\text {int}}}$$ and denote the collection of them to be $${\mathcal {N}}^\circ$$. Also, we let $${\mathcal {N}}^c:={\mathcal {N}}\backslash {\mathcal {N}}^\circ$$. Let $${\mathbb {V}}({\mathcal {T}})\subset {\mathbb {V}}$$ be the conforming continuous piecewise linear finite element space associated with $${\mathcal {T}}$$. For each node $${{\textbf{x}}}_i\in {\mathcal {N}}$$, we set $$\psi _i \in {\mathbb {V}}({\mathcal {T}})$$ to be the corresponding linear shape function. The above notations allow us to write the discrete solution as $$U= \sum _{j=1}^N u_j \psi _j$$ with $${\underline{U}} = (u_1,u_2,\ldots , u_N)^T\in {\mathbb {R}}^N$$ and express the test function as $$V = \sum _{i=1}^N v_i\psi _i$$ with the coefficient vector $${\underline{V}} = (v_1,v_2,\ldots , v_N)^T\in {\mathbb {R}}^N$$. We also denote the support of $$\psi _i$$ by $${\mathcal {S}}_i$$, and the patch of each cell $$\tau \in {\mathcal {T}}$$ by $${\mathcal {S}}_\tau$$, namely (see Fig. 1)

\begin{aligned} {\mathcal {S}}_i:= \bigcup _{\tau \in {{\mathcal {T}}} | {{\textbf{x}}}_i\in \tau } \tau , \qquad {\mathcal {S}}_\tau := \bigcup _{ \begin{array}{c} \tau ' \in {\mathcal {T}} \\ \tau \cap \tau '\ne \emptyset \end{array} } \tau ' \,. \end{aligned}

A finite element discretization with respect to (7) seeks $$U\in {\mathbb {V}}({\mathcal {T}})$$ so that $$U = 0$$ in $${\varOmega ^{\text {ext}}}$$ and

\begin{aligned} {\mathcal {A}}(U,V) = \int _{{\varOmega ^{\text {int}}}} f V\,{{\textrm{d}}}x \quad {\text {for all}}V\in {\mathbb {V}}({\mathcal {T}}). \end{aligned}
(8)

Because the above numerical scheme is conforming, namely $${\mathbb {V}}({\mathcal {T}})\subset {\mathbb {V}}$$, the well-posedness follows from the continuous formulation (7). Using the finite element mesh, we can rewrite the discrete bilinear form in (8) as

\begin{aligned} {\mathcal {A}}(U,V) = \sum _{\tau ,\tau '\in {\mathcal {T}}} {\mathcal {A}}_{\tau ,\tau '}(U,V), \end{aligned}

where

\begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(U,V):= \int _\tau \int _{\tau '} (U(x)-U(y))(V(x)-V(y))\gamma (x,y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x. \end{aligned}
(9)

The stiffness matrix assembly, i.e., the construction of the matrix $${\underline{A}} \in {\mathbb {R}}^{N\times N}$$ with entries $${\underline{A}}_{ij} = {\mathcal {A}}(\psi _j,\psi _i)$$, is based on contributions of element pairs ($$\tau , \tau '$$) from (9), with each such contribution involving interactions (9) between shape functions that are supported on either $$\tau$$ or $$\tau '$$. In other words, if we let $${\mathcal {I}}(\tau ,\tau ')\subset \{1,\ldots ,N\}$$ be the union of the global index sets of the vertices of elements $$\tau$$ and $$\tau '$$ so that for $$i \in {\mathcal {I}}(\tau ,\tau ')$$, we have $$\tau \subset {\mathcal {S}}_i$$ or $$\tau ' \subset {\mathcal {S}}_i$$. Then we can define the local stiffness matrix of an element pair as

\begin{aligned} {\underline{A}}^{\tau ,\tau '} = \{{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)\}_{i,j\in {\mathcal {I}}(\tau ,\tau ')} \end{aligned}
(10)

and assemble these elemental contributions into a global stiffness matrix $${\underline{A}}$$.

## 3 Decomposition of the Stiffness Matrix

A direct assembly of the elemental stiffness matrices $${\underline{A}}^{\tau ,\tau '}$$ of (10) into a global stiffness matrix $${\underline{A}}$$ will obviously result in a scheme with quadratic complexity in both storage and operations. In this section, we outline a splitting scheme that decomposes the stiffness matrix into three components, in order to obtain a linear complexity algorithm. The splitting reflects distinct computational characteristics of the problem, and the resulting decomposition is primarily motivated by the different linear-complexity construction algorithms for the three matrix components. We describe the details of these matrices for the two-dimensional case in Sects. 45, and 6, respectively.

Our consideration for the decomposition of $${\underline{A}}$$ starts with the relation between two elements $$\tau ,\tau '\in {\mathcal {T}}$$. Our goal is to write the bilinear form in (8) as the sum:

\begin{aligned} {\mathcal {A}}(U,V) = \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^\text {int}, \tau '\in {\mathcal {T}} \\ \tau \cap \tau ' \ne \emptyset \end{array}} {\mathcal {A}}_{\tau ,\tau '}(U,V) + \sum _{\begin{array}{c} \tau ,\tau '\in {\mathcal {T}}^\text {int} \\ \tau \cap \tau ' = \emptyset \end{array}} {\mathcal {K}}_{\tau ,\tau '}(U,V) + \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^\text {int}, \tau '\in {\mathcal {T}} \\ \tau \cap \tau ' = \emptyset \end{array}} \! \! {\mathcal {M}}_{\tau ,\tau '}(U,V).\nonumber \\ \end{aligned}
(11)

The first term of the sum handles the singular integrals that arises when $$\tau$$ and $$\tau '$$ are not separated. When $$\tau$$ and $$\tau '$$ are separated, the bilinear form $${\mathcal {A}}_{\tau ,\tau '}(.,.)$$, which is now non-singular, can be split into two components: $${\mathcal {K}}_{\tau ,\tau '}(.,.)$$ which represents effects of $$\tau$$ on all non-neighboring elements of the mesh, and $${\mathcal {M}}_{\tau ,\tau '}(.,.)$$ which represents effects from all non-neighboring elements of the mesh (including elements in $${\mathcal {T}}^\text {ext}$$) on $$\tau$$. Formal definitions for these two forms will be given later in (13). We will show that under a suitable mesh setting the matrix associated with the first term turns out to be sparse, whereas hierarchical matrix algorithms and fast multipole methods allow us to construct the last two terms of (11) in linear complexity.

### 3.1 Case $$\tau \cap \tau ' \ne \emptyset$$

When $$\tau$$ and $$\tau '$$ are direct neighbors or $$\tau = \tau '$$, we define the corresponding near-field bilinear form as

\begin{aligned} \begin{aligned} {\mathcal {B}}(U,V)&:= \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^\text {int}, \tau '\in {\mathcal {T}} \\ \tau \cap \tau ' \ne \emptyset \end{array}} {\mathcal {A}}_{\tau ,\tau '}(U,V)&=\sum _{\tau \in {\mathcal {T}}^\text {int}} \sum _{\tau '\subset {\mathcal {S}}_{\tau }} {\mathcal {A}}_{\tau ,\tau '}(U,V). \end{aligned} \end{aligned}
(12)

Here we note that the elements $$\tau$$ in the above form are restricted to those in $${\varOmega ^{\text {int}}}$$ since both U and V vanish in $${\varOmega ^{\text {ext}}}$$. Denoting the near-field matrix $${\underline{B}} \in {\mathbb {R}}^{N\times N}$$ with entries $${\underline{B}}_{ij} = {\mathcal {B}}(\psi _j,\psi _i)$$, we assemble it directly from element-pair local stiffness matrices $${\underline{A}}^{\tau , \tau '}$$ and call this process $$\texttt {DirectEval}(\tau ,\tau ')$$. The primary challenge here is the accurate evaluation of these local stiffness matrices which involves singular integrands with spatially-varying fractional order. As we describe in detail in Sect. 4, we generalize ideas from boundary element methods for transferring the singular integrands in (9) to analytical integrands (cf. [1]) to handle the case of variable order.

We further assume that the triangulation $${\mathcal {T}}$$ is shape-regular and quasi-uniform, i.e., there exist two positive constants $$c_{\texttt {sr}}$$ and $$c_{\texttt {u}}$$ so that for all $$\tau \in {\mathcal {T}}$$ and $$\tau \subset \varOmega$$, there holds that

\begin{aligned} h_\tau \le c_{\texttt {sr}} \rho _\tau \quad \text { and }\quad \max _{\tau \in {\mathcal {T}}}h_\tau \le c_{\texttt {u}} \min _{\tau \in {\mathcal {T}}} h_\tau , \end{aligned}

with $$h_\tau$$ and $$\rho _\tau$$ denoting the size of $$\tau$$ and the maximum size of the inscribed ball in $$\tau$$. Thus $${\underline{B}}$$ is a sparse matrix and the maximum number of the nonzero column entries, namely

\begin{aligned} \max _{i=1,\ldots , N}\#\{{{\textbf{x}}}_j \in \overline{{\mathcal {S}}}_\tau , \tau \subset {\mathcal {S}}_i\} \end{aligned}

is uniformly bounded and only depends on $$c_{\texttt {sr}}$$, insuring linear complexity.

### 3.2 Case $$\tau \cap \tau ' = \emptyset$$

When the elements $$\tau$$ and $$\tau '$$ are separated, $$|x - y| > 0$$ for all $$x \in \tau$$ and $$y \in \tau '$$, the singularity of the integrand is avoided but another difficulty is introduced because of the quadratic number of the element pairs that have to be considered. A different strategy is hence needed for constructing this contribution to the stiffness matrix.

We start by observing that if both $$\tau ,\tau '\in {\varOmega ^{\text {ext}}}$$, we immediately get $${\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)=0$$. We can therefore fix $$\tau \in {\mathcal {T}}^{\text {int}}$$ and further consider this case by whether $$\tau '$$ is located in $${\varOmega ^{\text {int}}}$$ or in $${\varOmega ^{\text {ext}}}$$.

#### 3.2.1 Subcase $$\tau , \tau ' \in {\mathcal {T}}^\text {int}$$

Since the $${\mathcal {A}}_{\tau ,\tau '}(.,.)$$ is not singular, we can rewrite it as

\begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(U,V)&= \int _{\tau }\int _{\tau '} \bigg (U(x)V(x)+U(y)V(y)-U(x)V(y)-U(y)V(x)\bigg ) \gamma (x,y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x \nonumber \\&= \bigg [\int _\tau U(x)V(x) \int _{\tau '} \gamma (x,y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x + \int _{\tau '} U(x)V(x) \int _{\tau }\gamma (x,y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x\bigg ] \nonumber \\&\quad -\, \bigg [\int _\tau U(x) \int _{\tau '} \gamma (x,y)V(y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x + \int _{\tau '} U(x) \int _{\tau }\gamma (x,y)V(y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x \bigg ] \nonumber \\&=:{\mathcal {M}}_{\tau ,\tau '}(U,V) + {\mathcal {K}}_{\tau ,\tau '}(U,V) \end{aligned}
(13)

We sum $${\mathcal {K}}_{\tau ,\tau '}$$ for all elements $$\tau ,\tau '$$ in $${\varOmega ^{\text {int}}}$$. By the symmetry of choosing between $$\tau$$ and $$\tau '$$, we can derive that

\begin{aligned} \begin{aligned} {\mathcal {K}}(U,V)&:= \sum _{\tau \in {\mathcal {T}}^\text {int}}\sum _{\tau '\in {\mathcal {T}}^\text {int}} {\mathcal {K}}_{\tau ,\tau '}(U,V) \\&= - 2 \int _{{\varOmega ^{\text {int}}}}\int _{{\varOmega ^{\text {int}}}}\gamma _{{\mathcal {T}}}(x,y) U(x)V(y) \,{{\textrm{d}}}y\,{{\textrm{d}}}x, \end{aligned} \end{aligned}
(14)

where $$\gamma _{{\mathcal {T}}}$$ is a mesh-dependent kernel

\begin{aligned} \gamma _{{\mathcal {T}}}(x,y):= \left\{ \begin{array}{ll} 0, &{}\quad x\in \tau , y\in \tau ' \text { for some }\tau ,\tau '\in {\mathcal {T}} \text { with } \tau \cap \tau '\ne \emptyset , \\ \gamma (x,y),&{}\quad \text {otherwise}. \\ \end{array} \right. \end{aligned}
(15)

This mesh-dependent kernel evaluates to zero precisely in the integration regions that have already been handled by the first case above, i.e., when $$\tau \cap \tau ' \ne \emptyset$$.

Following a similar argument, we sum $${\mathcal {M}}_{\tau ,\tau '}$$ for all $$\tau ,\tau '\subset {\varOmega ^{\text {int}}}$$ to define

\begin{aligned} \begin{aligned} {{\mathcal {M}}^{\text {int}}}(U,V)&:=\sum _{\tau \in {\mathcal {T}}^\text {int}} \sum _{\tau '\in {\mathcal {T}}^\text {int}}{\mathcal {M}}_{\tau ,\tau '}(U,V) \\&= 2\sum _{\tau \in {\mathcal {T}}^\text {int}} \int _\tau U(x)V(x) \bigg (\int _{{\varOmega ^{\text {int}}}\backslash {\mathcal {S}}_\tau } \gamma (x,y) \,{{\textrm{d}}}y\bigg ) \,{{\textrm{d}}}x. \\ \end{aligned} \end{aligned}
(16)

#### 3.2.2 Subcase $$\tau \in {\mathcal {T}}^\text {int}$$, $$\tau ' \in {\mathcal {T}}^\text {ext}$$

In this case, we can simplify the form $${\mathcal {A}}_{\tau ,\tau '}(.,.)$$ knowing that $$U(y)=V(y)=0$$ for $$y\in \tau '$$ due to $$U,V\in {\mathbb {V}}({\mathcal {T}})$$. The second term in (13) vanishes and $${\mathcal {A}}_{\tau ,\tau '}(.,.)$$ is simply $${\mathcal {M}}_{\tau ,\tau '}(U,V)$$, which we sum for all $$\tau \in {\mathcal {T}}^\text {int}$$ and $$\tau '\in {\mathcal {T}}^\text {ext}$$ to obtain

\begin{aligned} \begin{aligned} {{\mathcal {M}}^{\text {ext}}}(U,V)&:= 2\sum _{\tau \in {\mathcal {T}}^\text {int}}\sum _{\tau '\in {\mathcal {T}}^\text {ext}} \int _\tau U(x)V(x) \bigg (\int _{\tau '} \gamma (x,y) \,{{\textrm{d}}}y\bigg ) \,{{\textrm{d}}}x \\&= 2\sum _{\tau \in {\mathcal {T}}^\text {int}} \int _\tau U(x)V(x) \bigg (\int _{{\varOmega ^{\text {ext}}}\backslash {\mathcal {S}}_\tau } \gamma (x,y) \,{{\textrm{d}}}y\bigg ) \,{{\textrm{d}}}x. \end{aligned} \end{aligned}
(17)

Here we also have the factor 2 in the above equation because we accounted for the symmetric case $$\tau '\in {\mathcal {T}}^\text {int}$$ and $$\tau \in {\mathcal {T}}^\text {ext}$$.

Gathering (16) and (17), we can define the form

\begin{aligned} \begin{aligned} {\mathcal {M}}(U,V)&:= {{\mathcal {M}}^{\text {ext}}}(U,V) + {{\mathcal {M}}^{\text {int}}}(U,V) \\&= 2\sum _{\tau \in {\mathcal {T}}^\text {int}} \int _\tau U(x)V(x) \bigg (\int _{\varOmega \backslash {\mathcal {S}}_\tau } \gamma (x,y) \,{{\textrm{d}}}y\bigg ) \,{{\textrm{d}}}x \\&= 2\sum _{\tau \in {\mathcal {T}}^\text {int}} \int _\tau U(x)V(x) \rho _{{\mathcal {T}}}(x)\,{{\textrm{d}}}x, \end{aligned} \end{aligned}
(18)

where

\begin{aligned} \rho _{{\mathcal {T}}}(x):= \int _{\varOmega \backslash {\mathcal {S}}_\tau } \gamma (x,y) \,{{\textrm{d}}}y = \int _{\varOmega } \gamma _{{\mathcal {T}}}(x,y) \,{{\textrm{d}}}y \end{aligned}
(19)

recalling that $$\gamma _{{\mathcal {T}}}$$ is defined in (15). In this form, it is easy to see that a matrix $${\underline{M}}$$ with entries $${\underline{M}}_{i,j}={\mathcal {M}}(\psi _j,\psi _i)$$ for $$i,j=1,\ldots , N$$ has the footprint of a mass matrix and may be assembled from element contributions. If i and j are the indices for the shape functions defined on $$\tau$$, an element stiffness matrix is written as:

\begin{aligned} {\underline{M}}^\tau = \left\{ \int _\tau \psi _j(x)\psi _i(x) \rho _{{\mathcal {T}}}(x) \,{{\textrm{d}}}x\right\} _{ij}. \end{aligned}
(20)

$${\underline{M}}$$ is in fact a weighted mass matrix with $$\rho _{{\mathcal {T}}}(x)$$ playing the role of a density function. The evaluation of $$\rho _{{\mathcal {T}}}(x)$$ is required at all quadrature points, with each evaluation requiring a global integration. This appears to demand quadratic complexity in arithmetic operations. However, a method akin to a fast multipole method for spatially-varying kernels can evaluate $$\rho _{{\mathcal {T}}}(x)$$ at all quadrature points in linear complexity as we show in Sect. 6.

### 3.3 Assembly of Complete Stiffness Matrix

The bilinear forms in (12), (14), (18) provide the three components, $${\underline{B}}$$, $${\underline{K}}$$, and $${\underline{M}}$$ of the stiffness matrix $${\underline{A}}$$ of the problem, where $${\underline{K}}_{i,j} = {\mathcal {K}}(\psi _j,\psi _i)$$ for $$i,j=1,\ldots ,N$$. The matrices $${\underline{B}}$$ and $${\underline{M}}$$ are sparse and can be stored directly. The matrix $${\underline{K}}$$ however is formally dense. We take advantage of the structure of the bilinear form $${\mathcal {K}}(.,.)$$ with its de-singularized kernel to store $${\underline{K}}$$ in linear complexity using $${\mathcal {H}}^2$$-matrix compression techniques [32]. We also use construction algorithms that approximate the smooth kernel using piecewise polynomial interpolants and build the compressed matrix in linear complexity. We call this procedure $${\texttt {HMATRIX}}({\mathcal {T}})$$ and describe it in Sect. 5.

The overall algorithm can then be summarized as shown in Algorithm 1.

## 4 Direct computation of the singular near-field integrals

In this section, we describe our 2d implementation for computing the singular integrals $${\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)$$ when two triangles $$\tau ,\tau '$$ are touching each other. The implementation is based on techniques popularized in boundary element methods; see e.g. [45, Chapter 5]. We refer to [1, 3, 4] for constant-order problems and to [21] for a more general class of nonlocal problems.

Let $${{\hat{\tau }}}$$ be the reference triangle with vertices $$(0,0)^T$$, $$(1,0)^T$$ and $$(1,1)^T$$. For each triangle $$\tau \in {\mathcal {T}}$$, we denote by $$\chi _\tau : {{\hat{\tau }}}\rightarrow \tau$$ an affine transformation from the reference triangle to $$\tau$$. When mapping a touching element pair $$(\tau , \tau ')$$ to $${\hat{\tau }}$$ we distinguish three cases depending on the number of shared vertices (see Fig. 2):

• If $$\tau$$ and $$\tau '$$ share only one vertex, the affine mappings $$\chi _\tau$$ and $$\chi _{\tau '}$$ satisfy $$\chi _\tau ((0,0)^T) = {\chi _{\tau '}} ((0,0)^T)$$;

• if $$\tau$$ and $$\tau '$$ share a common edge, we assume that $$\chi _\tau (({{\hat{x}}}_1,0)^T) = {\chi _{\tau '}} (({{\hat{x}}}_1,0)^T)$$ for $${{\hat{x}}}_1\in [0,1]$$;

• if $$\tau = \tau '$$, we set $$\chi _\tau = \chi _{\tau '}$$.

Under the above assumptions, we set $$(x,y) = \chi _{\tau ,\tau '}({{\hat{x}}},{{\hat{y}}}):= (\chi _\tau ({{\hat{x}}}),\chi _{\tau '}({{\hat{y}}}))$$.

Suppose that the supports of basis functions $$\psi _i$$ and $$\psi _j$$ contain either $$\tau$$ or $$\tau '$$, we shall compute $${\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)$$ on the reference elements, namely

\begin{aligned} \begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)&= |J_\tau | |J_{\tau '}| \int _{{{\hat{\tau }}}}\int _{{{\hat{\tau }}}} \bigg ( a(\chi _\tau ({{\hat{x}}}),\chi _{\tau '}({{\hat{y}}})) \\&\quad \times \frac{[\psi _j(\chi _\tau ({{\hat{x}}})) - \psi _j(\chi _{\tau '}({{\hat{y}}}))][\psi _i(\chi _\tau ({{\hat{x}}})) - \psi _i(\chi _{\tau '}({{\hat{y}}}))]}{|\chi _\tau ({{\hat{x}}}) - \chi _{\tau '}({{\hat{y}}})|^{2 + s(\chi _\tau ({{\hat{x}}})) + s(\chi _{\tau '}({{\hat{y}}}))}} \bigg )\,{{\textrm{d}}}{{\hat{x}}}\,{{\textrm{d}}}{{\hat{y}}}, \end{aligned} \end{aligned}
(21)

where $$J_\tau$$ and $$J_{\tau '}$$ denote the Jacobian of $$\chi _\tau$$ and $$\chi _{\tau '}$$ respectively and $$|J_\tau |$$ and $$|J_{\tau '}|$$ are the absolute values of the corresponding determinants. The difficulties in the evaluation of the above integral are primarily due to the singularities at $$\chi _{\tau }({{\hat{x}}}) = \chi _{\tau '}({{\hat{y}}})$$. The strategy for accurate evaluation is to split the integration domain $${{\hat{\tau }}}\times {{\hat{\tau }}}$$ into several subregions (depending on the relation between $$\tau$$ and $$\tau '$$) so that the integrand in each subregion can be transformed into $$[0,1]^4$$ and is analytic. We then compute the resulting integrals with tensor-product Gaussian quadrature schemes.

### 4.1 Vertex-Sharing Case

When $$\tau$$ and $$\tau '$$ share only one vertex, we use the transformations $$({{\hat{x}}},{{\hat{y}}}) = {\mathfrak {T}}_V^{(i)}(\xi ,\varvec{\eta })$$ with $$\xi \in (0,1)$$ and $$\varvec{\eta }\in (0,1)^3 \subset {\mathbb {R}}^3$$, satisfying

\begin{aligned} {\mathfrak {T}}_V^{(1)}(\xi ,\varvec{\eta }) = (\xi ,\xi \eta _1,\xi \eta _2,\xi \eta _2\eta _3) \quad \text {and}\quad {\mathfrak {T}}_V^{(2)}(\xi ,\varvec{\eta }) = (\xi \eta _2,\xi \eta _2\eta _3,\xi ,\xi \eta _1). \end{aligned}
(22)

So we can decompose $${{\hat{\tau }}}\times {{\hat{\tau }}}$$ into two regions and write

\begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) = |J_\tau | |J_{\tau '}|\sum _{k=1}^2 \int _{(0,1)^4}a(x,y) \frac{(\varPhi ^{(k)}_{V,j}(\varvec{\eta }) \varPhi ^{(k)}_{V,i}(\varvec{\eta }))\xi ^{3-s(x)-s(y)} \eta _2}{|d_V^{(k)}(\varvec{\eta })|^{2+s(x)+s(y)}} \,{{\textrm{d}}}\varvec{\eta }\,{{\textrm{d}}}\xi .\nonumber \\ \end{aligned}
(23)

Here we recall that $$(x,y) = \chi _{\tau ,\tau '}\circ {\mathfrak {T}}_V^{(k)}(\xi ,\varvec{\eta })$$ for $$k=1,2$$. $$\varPsi _{V,i}^{(k)}$$ provides the difference between $$\psi _i(x)-\psi _i(y)$$ for all the five associated shape functions and rescaled by $$\xi ^{-1}$$, namely

\begin{aligned} \begin{array}{ll} \{\eta _2-1,1-\eta _1,\eta _1,\eta _2\eta _3-\eta _2,-\eta _2\eta _3\}, &{} \quad \text {for } k = 1, \\ \{1-\eta _2,\eta _2-\eta _2\eta _3,\eta _2\eta _3,\eta _1-1,-\eta _1\}, &{} \quad \text {for } k = 2. \\ \end{array} \end{aligned}

The denominator $$d_V^{(k)}$$ comes from the difference $$x-y$$ but is rescaled by extracting the factor $$\xi$$ to give

\begin{aligned} d_V^{(1)}(\varvec{\eta }) = J_\tau (1,\eta _1)^T - J_{\tau '}(\eta _2,\eta _2\eta _3)^T \quad \text {and}\quad d_V^{(2)}(\varvec{\eta }) = J_\tau (\eta _2,\eta _2\eta _3)^T - J_{\tau '}(1,\eta _1)^T. \end{aligned}
(24)

The integrand in (23) is now non-singular and can thus be approximated by the tensorized Gaussian quadrature rule with order n. Here we apply an extra transformation for the variable $$\xi$$ with $$\xi = \zeta ^{1/(4-2{{\overline{s}}})}$$ and rewrite (23) as

\begin{aligned} \begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)&= \frac{|J_\tau | |J_{\tau '}|}{4-2{{\overline{s}}}}\sum _{k=1}^2 \int _{(0,1)^4}\bigg (a(x,y)\zeta ^{\frac{2{{\overline{s}}} -s(x)-s(y)}{4-2{{\overline{s}}}}}\eta _2\\&\quad \times \frac{\varPhi ^{(k)}_{V,j}(\varvec{\eta })\varPhi ^{(k)}_{V,i}(\varvec{\eta })}{|d_V^{(k)} (\varvec{\eta })|^{2+s(x)+s(y)}}\bigg ) \,{{\textrm{d}}}\varvec{\eta }\,{{\textrm{d}}}\zeta , \end{aligned} \end{aligned}
(25)

where $$(x,y) = \chi _{\tau ,\tau '}\circ {\mathfrak {T}}_V^{(k)}(\zeta ^{1/(4-2{{\overline{s}}})},\varvec{\eta })$$. We shall apply the tensorized Gaussian quadrature rule to the above integral and denote the corresponding approximation of $${\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)$$ by $$Q^n_{V}(\psi _j,\psi _i)$$. The reason to compute $${\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)$$ by (25) rather than (23) is to improve the quadrature approximation rates for a more general setting of s(x) and a(xy); see Remark 3 below for details.

To analyze the quadrature error, we follow the argument from Section 5.3.2 in [45], which is based on the derivative-free quadrature error estimates [18]. The error estimate for $$\int _{(0,1)^4}f(x_1,\ldots ,x_4)\,{{\textrm{d}}}{\textbf{x}}$$ requires that for each direction $$x_i$$, the integrand f can be analytically extended from [0, 1] to $${\mathcal {E}}_{\rho _i}\subset {\mathbb {C}}$$, where $${\mathcal {E}}_{\rho _i}$$ is a closed ellipse with the focus points 0 and 1, and where $$\rho _i>\tfrac{1}{2}$$ denotes the sum of semimajor and semiminor axes. Then, the error between the exact integral and its n-th order tensor product Gaussian quadrature, denoted by $$Q^n f$$, can be estimated by (cf. [45, Theorem 5.3.15])

\begin{aligned} \bigg |\int _{(0,1)^4} f\,{{\textrm{d}}}{\textbf{x}} - Q^n f\bigg | \lesssim \sum _{i=1}^4(2\rho _i)^{-2n} \max _{\begin{array}{c} {\textbf{x}}\in [0,1]^4\\ z\in \partial {{\mathcal {E}}}_{\rho _i} \end{array}} |f(x_1,\ldots ,x_{i-1},z,x_{i+1},\ldots , x_4)|. \end{aligned}
(26)

To verify that the above estimate can by applied to (23), we shall check that the integrand in (25) (or (23)) can be analytically extended to $${\mathcal {E}}_\rho$$ for each component with $$\rho >\tfrac{1}{2}$$. To this end, we first note that the mappings $${\mathfrak {T}}_V^{(k)}$$ in (22) are component-wise analytic in $${\mathbb {C}}$$. Thus, the analyticity also holds for the mapping $$\chi _{\tau ,\tau '}\circ {\mathfrak {T}}_V^{(k)}$$ as well as the diffusion coefficient a(xy) and $$s(x)+s(y)$$ due to the analyticity assumption for a(., .) and s(.). So $$\zeta ^{(2{{\overline{s}}}-s(x)-s(y))/(4-2{{\overline{s}}})}$$ (or $$\xi ^{3-s(x)-s(y)}$$) and $$|d_V^{(k)}(\varvec{\eta })|^{2+s(x)+s(y)}$$ are analytic. Noting that the product $$(\varPhi ^{(k)}_{V,j}(\varvec{\eta })\varPhi ^{(k)}_{V,i}(\varvec{\eta }))$$ is a polynomial with degree no more than four, there exists an analytic extension of it in the complex space.

Now we estimate the maximum of the integrand for each component in $$(\zeta ,\varvec{\eta })$$. For the numerator, we choose $$\rho \in (\tfrac{1}{2}, 1)$$ so that

\begin{aligned} \sup _{\begin{array}{c} \zeta \in {\mathcal {E}}_\rho \\ \varvec{\eta }\in [0,1]^3 \end{array}}|a(x,y)(\varPhi ^{(k)}_{V,j} (\varvec{\eta })\varPhi ^{(k)}_{V,i}(\varvec{\eta })) \zeta ^{\frac{2{{\overline{s}}}-s(x)-s(y)}{4-2{{\overline{s}}}}}\eta _2| \lesssim 1 \end{aligned}
(27)

and

\begin{aligned} \sup _{\begin{array}{c} \eta _\ell \in {\mathcal {E}}_\rho \\ \eta _k\in [0,1], k\ne \ell \\ \zeta \in [0,1] \end{array}}|a(x,y)(\varPhi ^{(k)}_{V,j} (\varvec{\eta })\varPhi ^{(k)}_{V,i}(\varvec{\eta })) \zeta ^{\frac{2{{\overline{s}}}-s(x)-s(y)}{4-2{{\overline{s}}}}} \eta _2| \lesssim 1,\quad \text {for } \ell = 1,2,3. \end{aligned}

For the denominator, we further assume that h is sufficiently small so that $$|d_V^{(k)}(\varvec{\eta })| \le 1$$ (since $$|d_V^{(k)}(\varvec{\eta })|\sim h$$). So there holds

\begin{aligned} |d_V^{(k)}(\varvec{\eta })|^{2+s(x)+s(y)} \ge |d_V^{(k)}(\varvec{\eta })|^{2+2{{\overline{s}}}}. \end{aligned}
(28)

This implies that

\begin{aligned} \sup _{\begin{array}{c} \zeta \in {\mathcal {E}}_\rho \\ \varvec{\eta }\in [0,1]^3 \end{array}}\frac{1}{ |d_V^{(k)}(\varvec{\eta })|^{2+s(x)+s(y)} } \lesssim h^{-2-2{{\overline{s}}}}\quad \text {and}\quad \sup _{\begin{array}{c} \eta _\ell \in {\mathcal {E}}_\rho \\ \eta _k\in [0,1], k\ne \ell \\ \xi \in [0,1] \end{array}}\frac{1}{ |d_V^{(k)}(\varvec{\eta })|^{2+s(x)+s(y)} } \lesssim h^{-2-2{{\overline{s}}}}. \end{aligned}

Gathering the above estimates and invoking (26) for the approximation (25) (or (23)) one sees that when h is sufficiently small, there holds

\begin{aligned} |{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q_{V}^n(\psi _j,\psi _i)| \lesssim h^{2-2{{\overline{s}}}} (2\rho )^{-2n}. \end{aligned}
(29)

### 4.2 Edge-Sharing Case

When $$\tau$$ and $$\tau '$$ share a common edge, we shall use the the following transformations $$({\hat{x}},{{\hat{y}}}) = {\mathfrak {T}}_E^{(k)}(\xi ,\varvec{\eta })$$ for $$k=1,\ldots ,5$$ and $$(\xi ,\varvec{\eta })\in [0,1]^4$$:

\begin{aligned} \begin{aligned} {\mathfrak {T}}_E^{(1)}(\xi ,\varvec{\eta })&= (\xi ,\xi \eta _1\eta _3,\xi (1-\eta _1\eta _2), \xi \eta _1(1-\eta _2)), \\ {\mathfrak {T}}_E^{(2)}(\xi ,\varvec{\eta })&= (\xi ,\xi \eta _1,\xi (1-\eta _1\eta _2\eta _3),\xi \eta _1\eta _2(1-\eta _3)), \\ {\mathfrak {T}}_E^{(3)}(\xi ,\varvec{\eta })&= (\xi (1-\eta _1\eta _2), \xi \eta _1(1-\eta _2), \xi , \xi \eta _1\eta _2\eta _3), \\ {\mathfrak {T}}_E^{(4)}(\xi ,\varvec{\eta })&= (\xi (1-\eta _1\eta _2\eta _3), \xi \eta _1\eta _2(1-\eta _3), \xi , \xi \eta _1), \\ {\mathfrak {T}}_E^{(5)}(\xi ,\varvec{\eta })&= (\xi (1-\eta _1\eta _2\eta _3), \xi \eta _1(1-\eta _2\eta _3), \xi , \xi \eta _1\eta _2). \end{aligned} \end{aligned}

We again use the change of variable $$\xi = \zeta ^{1/(4-2{{\overline{s}}})}$$ to write (21) as

\begin{aligned} \begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)&= \frac{|J_\tau ||J_{\tau '}|}{4-2{{\overline{s}}}} \sum _{k=1}^5 \int _{(0,1)^4} \bigg (a(x,y) \zeta ^{\frac{2{{\overline{s}}}-s(x)-s(y)}{4 -2{{\overline{s}}}}}J_E^{(k)}(\varvec{\eta }) \\&\quad \times \frac{\varPsi _{E,j}^{(k)}(\varvec{\eta }) \varPsi _{E,i}^{(k)}(\varvec{\eta })}{|d_E^{(k)} (\varvec{\eta })|^{2+s(x)+s(y)}}\bigg ) \,{{\textrm{d}}}\varvec{\eta }\,{{\textrm{d}}}\zeta . \end{aligned} \end{aligned}
(30)

Here $$(x,y) = \chi _{\tau ,\tau '}\circ {\mathfrak {T}}_E^{(k)}(\zeta ^{1/(4-2{{\overline{s}}})},\varvec{\eta })$$ for $$k=1,\ldots ,5$$. In the numerator above, $$J_E^{(1)} = \eta _1^2$$ and $$J_E^{(k)} = \eta _1^2\eta _2$$ for $$k=2, \ldots , 5$$ are Jacobians. The functions $$\varPsi _{E,i}^{(k)}$$ are the transformations (rescaled by $$\xi ^{-1}$$) of $$\psi _i(x)-\psi _i(y)$$ for all four associated shape functions (defined on the corresponding four vertices of $$\tau \cup \tau '$$):

\begin{aligned} \begin{array}{ll} \{-\eta _1\eta _2,\eta _1(1-\eta _3), \eta _1\eta _3, -\eta _1(1-\eta _2)\}, &{} \quad \text {for } k = 1, \\ \{-\eta _1\eta _2\eta _3,-\eta _1(1-\eta _2),\eta _1,-\eta _1\eta _2(1-\eta _3)\}, &{} \quad \text {for } k = 2, \\ \{\eta _1\eta _2, -\eta _1(1-\eta _2\eta _3),\eta _1(1-\eta _2),-\eta _1\eta _2\eta _3\}, &{} \quad \text {for } k = 3, \\ \{\eta _1\eta _2\eta _3, \eta _1(1-\eta _2), \eta _1\eta _2(1-\eta _3),-\eta _1\}, &{} \quad \text {for } k = 4, \\ \{\eta _1\eta _2\eta _3, -\eta _1(1-\eta _2),\eta _1(1-\eta _2\eta _3), -\eta _1\eta _2\}, &{} \quad \text {for } k = 5. \\ \end{array} \end{aligned}

In the denominator, the rescaled distances $$d_E^{(k)}(\varvec{\eta })$$ are

\begin{aligned} \begin{aligned} d_E^{(1)}(\varvec{\eta })&= J_\tau (1,\eta _1\eta _3)^T - J_{\tau '}(1-\eta _1\eta _2,\eta _1(1-\eta _2))^T \\ d_E^{(2)}(\varvec{\eta })&= J_\tau (1,\eta _1)^T - J_{\tau '}(1-\eta _1\eta _2\eta _3,\eta _1\eta _2(1-\eta _3))^T, \\ d_E^{(3)}(\varvec{\eta })&= J_\tau (1-\eta _1\eta _2,\eta _1(1-\eta _2))^T - J_{\tau '}(1,\eta _1\eta _2\eta _3)^T \\ d_E^{(4)}(\varvec{\eta })&= J_\tau (1-\eta _1\eta _2\eta _3, \eta _1\eta _2(1-\eta _3))^T - J_{\tau '}(1,\eta _1)^T, \\ d_E^{(5)}(\varvec{\eta })&= J_\tau (1-\eta _1\eta _2\eta _3,\eta _1(1-\eta _2\eta _3))^T - J_{\tau '}(1,\eta _1\eta _2)^T. \end{aligned} \end{aligned}

Following a similar argument as in the previous case, we can show that the function in the above integral can be analytically extended to $${\mathcal {E}}_\rho$$ with some $$\rho \in (\tfrac{1}{2},1)$$ for each component of $$(\xi ,\varvec{\eta })$$. The detailed proof is omitted for brevity. So we can apply the n-th order tensorized Gaussian quadrature rule, denoted by $$Q_E^n(\psi _j,\psi _i)$$, to approximate the integral. Assuming that the mesh size h is small enough so that $$d_E^{(k)}(\varvec{\eta }) < 1$$ for $$k=1,\ldots , 5$$, the quadrature error can be estimated with

\begin{aligned} |{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q_{E}^n(\psi _j,\psi _i)| \lesssim h^{2-2{{\overline{s}}}} (2\rho )^{-2n}. \end{aligned}
(31)

### 4.3 Identical Case

When $$\tau =\tau '$$ we use the following transformations $$({{\hat{x}}},{{\hat{y}}}) = {\mathfrak {T}}_I^{(k)}(\xi ,\varvec{\eta })$$ for $$k=1,\ldots ,6$$ and $$(\xi ,\varvec{\eta })\in [0,1]^4$$:

\begin{aligned} \begin{aligned} {\mathfrak {T}}_I^{(1)}(\xi ,\varvec{\eta })&= (\xi , \xi (1-\eta _1+\eta _1\eta _2), \xi (1-\eta _1\eta _2\eta _3), \xi (1-\eta _1)), \\ {\mathfrak {T}}_I^{(2)}(\xi ,\varvec{\eta })&= (\xi , \xi (1-\eta _2+\eta _2\eta _3), \xi (1-\eta _1\eta _2), \xi \eta _1(1-\eta _2)), \\ {\mathfrak {T}}_I^{(3)}(\xi ,\varvec{\eta })&= (\xi (1-\eta _1\eta _2\eta _3), \xi \eta _1(1-\eta _2\eta _3), \xi , \xi \eta _1(1-\eta _2)), \\ {\mathfrak {T}}_I^{(k)}(\xi ,\varvec{\eta })&= {\mathfrak {S}}\circ {\mathfrak {T}}_I^{(k-3)}(\xi ,\varvec{\eta }),\quad \text {for } k = 4,5,6. \end{aligned} \end{aligned}

with $${\mathfrak {S}}({{\hat{x}}},{{\hat{y}}}) = ({{\hat{y}}},{{\hat{x}}})$$. Thanks to the symmetry property between the mappings $${\mathfrak {T}}_I^{(k)}$$ and $${\mathfrak {T}}_I^{(k+3)}$$ for $$k=1,2,3$$ (i.e., the operator $${\mathfrak {S}}$$) as well as the symmetry property for the kernel function, we arrive at

\begin{aligned} \begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)&= \frac{|J_\tau |^2}{2-{{\overline{s}}}} \sum _{k=1}^3\int _{(0,1)^4} \bigg (\zeta ^{\frac{2{{\overline{s}}} -s(x)-s(y)}{4-2{{\overline{s}}}}}\eta _1^{2-s(x)-s(y)}\eta _2^{1-s(x)-s(y)} \\&\quad \times \frac{a(x,y)\varPsi _{I,j}^{(k)}(\eta _3) \varPsi _{I,i}^{(k)}(\eta _3)}{|d_I^{(k)}(\eta _3)|^{2+s(x) +s(y)}}\bigg ) \,{{\textrm{d}}}\varvec{\eta }\,{{\textrm{d}}}\zeta , \end{aligned} \end{aligned}
(32)

where $$(x,y) = \chi _{\tau ,\tau '}\circ {\mathfrak {T}}_I^{(k)}(\xi ,\varvec{\eta })$$ with $$\xi = \zeta ^{1/(4-2{{\overline{s}}})}$$, $$\varPsi _{I,i}^{(k)}(\eta _3)$$ are the three rescaled shape functions ($$\psi _i(x)-\psi _j(y)$$) defined on $$\tau$$ provided by

\begin{aligned} \begin{array}{ll} \{-\eta _3,\eta _3-1,1\}, &{} \quad \text {for } k = 1, \\ \{-1,1-\eta _3,\eta _3\}, &{} \quad \text {for } k = 2, \\ \{\eta _3,-1,1-\eta _3\}, &{} \quad \text {for } k = 3, \\ \end{array} \end{aligned}

and

\begin{aligned} d_I^{(1)}(\eta _3) = J_\tau (\eta _3,1)^T,\, d_I^{(2)}(\eta _3) = J_\tau (1,\eta _3)^T,\, d_I^{(3)}(\eta _3) = J_\tau (\eta _3,1-\eta _3)^T \end{aligned}

are the rescaled distances between x and y. Notice that the integrand in (32) could be singular at $$\eta _2 = 0$$ due to the term $$\eta _2^{1-s(x)-s(y)}$$ when $$s(x)+s(y)>1$$. To resolve this, we use the change of variable $$\eta _2 = t^{1/(2-2{{\overline{s}}})}$$ to write

\begin{aligned} \begin{aligned} {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i)&= \frac{|J_\tau |^2}{(2-{{\overline{s}}})(2-2{{\overline{s}}})} \sum _{k=1}^3\int _{(0,1)^4} \bigg (\zeta ^{\frac{2{{\overline{s}}} -s(x)-s(y)}{4-2{{\overline{s}}}}}\eta _1^{2-s(x) -s(y)}t^{\tfrac{2{{\overline{s}}}-s(x)-s(y)}{2-2{{\overline{s}}}} } \\&\quad \times \frac{a(x,y)\varPsi _{I,j}^{(k)}(\eta _3) \varPsi _{I,i}^{(k)}(\eta _3)}{|d_I^{(k)}(\eta _3)|^{2+s(x) +s(y)}}\bigg ) \,{{\textrm{d}}}\eta _1\,{{\textrm{d}}}t\,{{\textrm{d}}}\eta _3\,{{\textrm{d}}}\zeta , \end{aligned} \end{aligned}
(33)

where we recall that $$(x,y) = \chi _{\tau ,\tau '}\circ {\mathfrak {T}}_I^{(k)}(\zeta ^{1/(4-2{{\overline{s}}})},\eta _1, t^{1/(2-2{{\overline{s}}})},\eta _3)$$. We then apply the tensorized Gaussian quadrature to (33). If $${{\overline{s}}}<\tfrac{1}{2}$$, we can also simply apply the same quadrature rule to (32). Denoting $$Q_I^n(\psi _j,\psi _i)$$ the resulting quadrature approximation with order n based on (32) or (33). Following the arguments above, we obtain that when h is small enough, there holds

\begin{aligned} |{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q_{I}^n(\psi _j,\psi _i)| \lesssim h^{2-2{{\overline{s}}}} (2\rho )^{-2n}, \end{aligned}
(34)

for some $$\rho \in (\tfrac{1}{2},1]$$.

### 4.4 Quadrature Error for the Singular Near-Field Formulation

The quadrature schemes provided by the proceeding subsections for all $$\tau \in {\varOmega ^{\text {int}}}$$ and $$\tau '\in {\mathcal {S}}_\tau$$ form an approximation of the bilinear form $${\mathcal {B}}(U,V)$$ for $$U,V\in {\mathbb {V}}({\mathcal {T}})$$. We denote this approximation by $$Q^n_{{\mathcal {B}}}(U,V)$$. The following proposition shows the corresponding consistency error. The proof follows the standard arguments for the quadrature approximation for bilinear forms (see e.g. [3, Theorem 10] and [45, Theorem 5.3.29]). Here we provide a proof for completeness.

### Proposition 1

(quadature error for near-field approximations) For $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, let $$Q^n_{{\mathcal {B}}}(U,V)$$ be the approximation of $${\mathcal {B}}(U,V)$$ by replacing $${\mathcal {A}}_{\tau ,\tau '}(U,V)$$ with $${\mathcal {Q}}^n_V$$, $${\mathcal {Q}}^n_E$$ or $${\mathcal {Q}}^n_I$$ defined in Sects. 4.1 to 4.3 depending on the relations between $$\tau$$ and $$\tau '$$. When the mesh size h is small enough, there exists $$\rho \in (\tfrac{1}{2},1)$$ such that

\begin{aligned} |{\mathcal {B}}(U,V) - {\mathcal {Q}}^n_{{\mathcal {B}}}(U,V)| \lesssim h^{-2{{\overline{s}}}}(2\rho )^{-2n} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})}\Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}

### Proof

Denote $$e^{ij}_{\tau ,\tau '} = {\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q^n(\psi _j,\psi _i)$$, where $$Q^n$$ is the tensor product Gaussian quadrature form $$Q_V^n$$, $$Q_E^n$$, or $$Q_I^n$$. According to the quadrature error estimates Eqs. (29), (31), and (34), there holds that for h sufficiently small,

\begin{aligned} |e^{ij}_{\tau ,\tau '}| \lesssim h^{2-2{{\overline{s}}}}(2\rho )^{-2n}. \end{aligned}

Now we set $$U=\sum _{j=1}^Nu_j\psi _j$$ and $$V=\sum _{i=1}^N v_i\psi _i$$. To estimate the target error, we use the definition (12) and write

\begin{aligned} \begin{aligned} |{\mathcal {B}}(U,V) - {\mathcal {Q}}^n_{{\mathcal {B}}}(U,V)|&\le \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^{\text {int}} \\ \tau '\in {\mathcal {S}}_\tau \end{array}} \sum _{i,j\in {\mathcal {I}}(\tau ,\tau ')} |e^{ij}_{\tau ,\tau '}u_jv_i| \\&\lesssim h^{2-2{{\overline{s}}}}(2\rho )^{-2n} \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^{\text {int}} \\ \tau '\in {\mathcal {S}}_\tau \end{array}} \sum _{i,j\in {\mathcal {I}}(\tau ,\tau ')} |u_jv_i|. \end{aligned} \end{aligned}
(35)

Here, we recall that $${\mathcal {I}}(\tau ,\tau ')\subset \{1,\ldots , N\}$$ is the index set whose associated global shape functions are non-zero on $$\tau \cup \tau '$$. Utilizing the Cauchy–Schwarz inequality, we have

\begin{aligned} \begin{aligned} \sum _{i,j\in {\mathcal {I}}(\tau ,\tau ')} |u_jv_i|&\le \#{\mathcal {I}}(\tau ,\tau ') \bigg (\sum _{j\in {\mathcal {I}}(\tau ,\tau ')}|u_j|^2\bigg )^{1/2} \bigg (\sum _{i\in {\mathcal {I}}(\tau ,\tau ')}|v_i|^2\bigg )^{1/2} \\&\lesssim h^{-2}\Vert U\Vert _{L^2(\tau \cup \tau ')}\Vert V\Vert _{L^2(\tau \cup \tau ')}, \end{aligned} \end{aligned}
(36)

where $$\#{\mathcal {I}}(\tau ,\tau ')$$ denotes the cardinality of $${\mathcal {I}}(\tau ,\tau ')$$ and where for the last inequality we used the fact that $$\#{\mathcal {I}}(\tau ,\tau ')\lesssim 1$$ and that

\begin{aligned} \sum _{j\in {\mathcal {I}}(\tau ,\tau ')}|u_j|^2\lesssim h^{-2}\Vert U\Vert _{L^2(\tau \cup \tau ')}^2. \end{aligned}

We note that due to the shape-regularity property for $${\mathcal {T}}$$, the number of cells in $${\mathcal {S}}_\tau$$ is uniformly bounded for all $$\tau \in {\varOmega ^{\text {int}}}$$. So we insert (36) into (35) and continue to bound the summation in (35) by the Cauchy–Schwarz inequality. This leads to

\begin{aligned} \begin{aligned} \sum _{\begin{array}{c} \tau \in {\mathcal {T}}^{\text {int}} \\ \tau '\in {\mathcal {S}}_\tau \end{array}} \sum _{i,j\in {\mathcal {I}}(\tau ,\tau ')} |u_jv_i|&\lesssim h^{-2}\sum _{\tau \in {\mathcal {T}}^{\text {int}} } \sum _{\tau '\in {\mathcal {S}}_\tau } \Vert U\Vert _{L^2(\tau \cup \tau ')}\Vert V\Vert _{L^2(\tau \cup \tau ')} \\&\lesssim h^{-2}\bigg (\sum _{\begin{array}{c} \tau \in {\mathcal {T}}^{\text {int}} \\ \tau '\in {\mathcal {S}}_\tau \end{array}} \Vert U\Vert _{L^2(\tau \cup \tau ')}^2\bigg )^{1/2} \bigg (\sum _{\begin{array}{c} \tau \in {\mathcal {T}}^{\text {int}} \\ \tau '\in {\mathcal {S}}_\tau \end{array}} \Vert V\Vert _{L^2(\tau \cup \tau ')}^2\bigg )^{1/2} \\&\lesssim h^{-2} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})}\Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned} \end{aligned}

Together with (35), we arrive at

\begin{aligned} |{\mathcal {B}}(U,V) - {\mathcal {Q}}^n_{{\mathcal {B}}}(U,V)| \lesssim h^{-2{{\overline{s}}}}(2\rho )^{-2n} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})}\Vert V\Vert _{L^2({\varOmega ^{\text {int}}})} \end{aligned}

as desired. $$\square$$

We end the section with the following remarks.

### Remark 1

(order of the quadrature rule) In order to get the convergence rate $$h^{\beta }$$ with $$\beta >0$$ for the consistency error in Proposition 1, we need to set

\begin{aligned} h^{-2{{\overline{s}}}}(2\rho )^{-2n} \le Ch^{\beta } \end{aligned}

with some positive constant $$C>1$$. This implies that the quadrature order should be chosen to satisfy that

\begin{aligned} n\ge \frac{(\beta +2{{\overline{s}}})\log (1/h)-\log C}{2\log (2\rho )}. \end{aligned}

### Remark 2

(near-field consistency) Note that by the fact that $$\tau '\in {\mathcal {S}}_\tau$$ and that $${\mathcal {S}}_\tau$$ is quasi-uniform, the error estimate we obtained in Proposition 1 improves the results from [3, Theorem 10] by removing the factor N (the number of degrees of freedom).

### Remark 3

(a general setting for s(x) and a(xy)) The above implementation for the matrix $${\underline{B}}$$ can be also extended when s(x) has jumps across the edges. Here we assume that for each triangle $$\tau \in {\mathcal {T}}$$, $$s(\chi _\tau ({{\hat{x}}}))$$ can be analytically extended to a complex neighborhood of the reference triangle $${{\hat{\tau }}}$$, denoted by $${{\hat{\tau }}}^*$$. Similar assumptions can be also applied to a(., y) and a(x, .). Following the argument in [45, Lemma 5.3.19], we can show that $$s(\chi _\tau \circ {\mathfrak {T}}_V^{(k)}(\zeta ,\varvec{\eta }))$$ is analytic for $$\zeta \in {\mathcal {E}}_{\rho _1/h}$$ with $$\rho _1>0$$ sufficiently small and is analytic for $$\eta _\ell \in {\mathcal {E}}_{\rho _2}$$ for some $$\rho _2\in (\tfrac{1}{2}, 1)$$ with $$\ell =1,2,3$$. Under the above restriction for $$\zeta$$, we shall update the estimate (27) with

\begin{aligned} \sup _{\begin{array}{c} \zeta \in {\mathcal {E}}_\rho \\ \varvec{\eta }\in [0,1]^3 \end{array}}|a(x,y)(\varPhi ^{(k)}_{V,j} (\varvec{\eta })\varPhi ^{(k)}_{V,i}(\varvec{\eta })) \zeta ^{\frac{2{{\overline{s}}}-s(x)-s(y)}{4-2{{\overline{s}}}}}\eta _2| \lesssim h^{-\frac{{{\overline{s}}}-{\underline{s}}}{2-{{\overline{s}}}}}. \end{aligned}

This implies the new error estimate for approximating (25)

\begin{aligned} \begin{aligned} |{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q_V^n(\psi _j,\psi _i)|&\lesssim \left( \frac{\rho _1}{h}\right) ^{-2n}h^{2-2{{\overline{s}}} -\frac{{{\overline{s}}}-{\underline{s}}}{2-{{\overline{s}}}}} + h^{2-2{{\overline{s}}}}(2\rho _2)^{-2n} \\&\lesssim h^{2n+2-2{{\overline{s}}}-\frac{{{\overline{s}}} -{\underline{s}}}{2-{{\overline{s}}}}} + h^{2-2{{\overline{s}}}}(2\rho _2)^{-2n}. \end{aligned} \end{aligned}

On the other hand, if we approximate (25), we have

\begin{aligned} |{\mathcal {A}}_{\tau ,\tau '}(\psi _j,\psi _i) - Q_V^n(\psi _j,\psi _i)| \lesssim h^{2n+2{\underline{s}}-2{{\overline{s}}}-1} + h^{2-2{{\overline{s}}}}(2\rho _2)^{-2n}. \end{aligned}

Here we note that for the first term on the right-hand side above, the convergence rate is lower than the previous approach and could even be negative when $$n=1$$. We can analogously analyze the quadrature for Eqs. (30) and (33) and follow the argument in Proposition 1 to obtain that

\begin{aligned} |{\mathcal {B}}(U,V) - {\mathcal {Q}}^n_{{\mathcal {B}}}(U,V)| \lesssim h^{2n-2\overline{s}-\frac{\overline{s}-\underline{s}}{2-\overline{s}}} + h^{-2{{\overline{s}}}}(2\rho _2)^{-2n}. \end{aligned}

Hence, when choosing the quadrature order according to Remark 1 to achieve the rate $$O(h^\beta )$$, a sufficient condition is to set

\begin{aligned} n\ge \frac{1}{2} \left( \beta + 2 \overline{s} + \frac{\overline{s}-\underline{s}}{2-\overline{s}} \right) . \end{aligned}

In our numerical simulations in Sect. 7, we set $$n=1$$ when assembling $${\underline{B}}$$.

## 5 $${\mathcal {H}}^2$$-Matrix Approximation of the Non-singular Interactions

We now consider the construction of a hierarchical matrix approximation of the second term in (11) so as to avoid the quadratic complexity that a direct computation would entail. The construction here is now dealing with a de-singularized kernel, since the singularities due to element-pairs that are touching have been resolved by the integrations of the previous section.

### 5.1 Construction of the $${\mathcal {H}}^2$$ Matrix Structure

One of the key approximations in hierarchical matrix representations involves clustering neighboring vertices and representing their net effect on other, sufficiently far-away, clusters by appropriate 2-dimensional polynomials. Therefore the first step of the construction is to generate a hierarchy of spatial clusters for the mesh vertices. We do this by partitioning the interior vertices using a KD-tree, with repeated plane splits along coordinate directions. The construction is recursive starting from the whole point set as the topmost cluster. The points within each cluster are first sorted by projecting along the largest dimension of their bounding box. The sorted point clusters are then split along their median into two children clusters, with the recursion stopping when the cardinality of leaf clusters reaches a specified parameter m. This procedure produces a complete binary cluster tree $${\mathscr {T}}$$ that has $$L = 1 + \log (N/m)$$ levels with leaves of size no larger than m.

The resulting cluster tree $${\mathscr {T}}$$ together with an admissibility condition provides the structure and the starting point for constructing the $${\mathcal {H}}$$-matrix approximation $$\widetilde{{\underline{K}}}$$ of $${\underline{K}}$$. Specifically, let $$\sigma$$ and $$\sigma '$$ be the vertex index sets for two clusters at the same level $$\ell$$ in $${\mathscr {T}}$$ and $$\omega _{\sigma }$$ and $$\omega _{\sigma '}$$ their corresponding bounding boxes, respectively. The matrix block $$\widetilde{{\underline{K}}}_{\sigma ,\sigma '}$$ with rows $$\sigma$$ and columns $$\sigma '$$ may be represented as a single low rank approximation if the corresponding bounding boxes satisfy the admissibility condition

\begin{aligned} \max \{{{\,\textrm{diam}\,}}(\omega _{\sigma }^e), {{\,\textrm{diam}\,}}(\omega _{\sigma '}^e)\} \le \lambda _1{{\,\textrm{dist}\,}}(\omega _{\sigma }^e,\omega _{\sigma '}^e) \end{aligned}
(37)

for some $$\lambda _1 > 0$$. Here $$\omega _{\sigma }^e$$ and $$\omega _{\sigma '}^e$$ are the bounding-box extensions containing all elements in the support of the basis functions of nodes in $$\sigma$$ and $$\sigma '$$. This extends the elements in the clusters whose bounding boxes are $$\omega _{\sigma }$$ and $$\omega _{\sigma '}$$ by a band that is one-element wide (see Fig. 3a), and therefore insures that no element pair ($$\tau$$, $$\tau '$$) from the two bounding boxes $$\omega _{\sigma }^e$$ and $$\omega _{\sigma '}^e$$ satisfying (37) involves singular integrals, and that the kernel $$\gamma _{{\mathcal {T}}}(x, y) = \gamma (x, y)$$ in this cluster pair. We denote by $${P^{\text {far}}}$$ the collection of the cluster pairs $$(\sigma ,\sigma ')\in {\mathscr {T}}\times {\mathscr {T}}$$ that satisfy the admissibility condition and describe the computation of their low rank approximation in Sect. 5.3 below.

When the admissibility condition is not satisfied for clusters at the leaf level L, the entries in the blocks $$\widetilde{{\underline{K}}}_{\sigma ,\sigma '}$$ are computed by direct numerical quadrature. Since we have already accounted for the singular integrals involving element pairs $$(\tau , \tau ')$$ with $$\tau \cap \tau ' \ne 0$$ directly in Sect. 4, the integrals here involve only smooth integrands and we describe their computation in Sect. 5.2 below. We will denote by $${P^{\text {near}}}$$ the set of leaf-level cluster pairs that violate the admissibility condition and are computed directly. They represent the non-singular smooth part of the near field. Hence, $${P^{\text {far}}}\cup {P^{\text {near}}}$$ covers all the cluster pairs.

The construction of the hierarchical structure of the matrix is recursive and starts from the root of the matrix quadtree $${\mathscr {T}}\times {\mathscr {T}}$$. At every level l, the pairs $$(\sigma ,\sigma ')$$ with $$\sigma$$ in the first (row) tree and $$\sigma '$$ in the second (column) tree are considered. If a pair satisfies the admissibility condition, a low rank approximation of it is constructed from a suitable polynomial approximation of the kernel, and the corresponding matrix block is no longer subdivided. If the admissibility condition is violated, the children of $$\sigma$$ and $$\sigma '$$ at level $$l+1$$ are considered. The recursion terminates when the leaf level L is reached. The blocks at that level with clusters that do not satisfy the admissibility condition are computed directly as described next; this includes the diagonal blocks with $$\sigma = \sigma '$$.

### 5.2 Direct Computation of the Smooth Near-Field

Let us first consider the (non singular) near-field entries. For the entry $$\widetilde{{\underline{K}}}_{ij}$$ for $$i,j=1,\ldots ,N$$ with $$i\in \sigma$$, $$j\in \sigma '$$, and $$(\sigma ,\sigma ')\in {P^{\text {near}}}$$, we have

\begin{aligned} \begin{aligned} {\underline{K}}_{ij}&= \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i, \tau '\subset {\mathcal {S}}_j \\ \tau \cap \tau '=\emptyset \end{array}} {\mathcal {K}}_{\tau ,\tau '}(\psi _j,\psi _i) \\&= -2 \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i, \tau '\subset {\mathcal {S}}_j \\ \tau \cap \tau '=\emptyset \end{array}} \int _\tau \int _{\tau '}\gamma (x,y)\psi _j(x)\psi _i(y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x \\&= -2 \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i, \tau '\subset {\mathcal {S}}_j \\ \tau \cap \tau '=\emptyset \end{array}}|J_\tau ||J_{\tau '}|\int _{{{\hat{\tau }}}}\int _{{{\hat{\tau }}}} \gamma (\chi _\tau ({{\hat{x}}}),\chi _{\tau '}({{\hat{y}}})){{\hat{\psi }}}_j({{\hat{x}}}){{\hat{\psi }}}_i({{\hat{y}}}) \,{{\textrm{d}}}{{\hat{y}}}\,{{\textrm{d}}}{{\hat{x}}}, \end{aligned} \end{aligned}
(38)

where $${{\hat{\psi }}}_i$$ are the corresponding shape functions defined on $${{\hat{\tau }}}$$. The sum is over non-touching element pairs in the support of $$\psi _i$$ and $$\psi _j$$ (see Fig. 3b). The computation for every such element pair (the integral on the right-hand side above) is based on a strategy similar to the one introduced in Sect. 4. Here we apply the Duffy coordinates $${\mathfrak {T}}: [0,1]^4\rightarrow {{\hat{\tau }}}\times {{\hat{\tau }}}$$ satisfying that

\begin{aligned} {\mathfrak {T}}(\xi ,\varvec{\eta }) = (\xi ,\xi \eta _1,\eta _2,\eta _2\eta _3) = ({{\hat{x}}},{{\hat{y}}}). \end{aligned}

We then again apply the tensor-product Gaussian quadrature scheme with order n to the transformed integral and denote by $$\widetilde{{\underline{K}}}_{ij}$$ the resulting approximation of $${\underline{K}}_{ij}$$.

### 5.3 Approximation of the Far-Field

Let’s consider a matrix block $$\widetilde{{\underline{K}}}_{\sigma ,\sigma '}$$ with $$(\sigma ,\sigma ')\in {P^{\text {far}}}$$, which is represented as a low rank block. The ij-th entry of that block, $$(\widetilde{{\underline{K}}}_{\sigma ,\sigma '})_{ij}$$ for $$i\in \sigma$$ and $$j\in \sigma '$$ may be written as:

\begin{aligned} \begin{aligned} (\widetilde{{\underline{K}}}_{\sigma ,\sigma '})_{ij}&= \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i,\tau '\subset {\mathcal {S}}_j\\ \tau \cap \tau '=\emptyset \end{array}}\widetilde{{\mathcal {K}}}_{\tau ,\tau '}(\psi _j,\psi _i)\\&:= \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i,\tau '\subset {\mathcal {S}}_j \\ \tau \cap \tau '=\emptyset \end{array}} -2\int _\tau \int _{\tau '} I^p_{\sigma ,\sigma '}\gamma (x,y)\psi _i(x)\psi _j(y)\,{{\textrm{d}}}y \,{{\textrm{d}}}x, \end{aligned} \end{aligned}
(39)

where $$I^{p}_{\sigma ,\sigma '}$$ is the nodal interpolant on the domain $$\omega _{\sigma }^e\times \omega _{\sigma '}^e$$ using a tensor-product Chebyshev polynomial of order p,

\begin{aligned} I^{p}_{\sigma ,\sigma '}\gamma (x, y) = \sum _{\alpha =1}^p \sum _{\beta =1}^p \gamma (\xi _{\sigma , \alpha }, \xi _{\sigma ', \beta }) \ell _{\sigma , \alpha }(x) \ell _{\sigma ', \beta }(y) \quad \text {for } x \in \omega ^e_\sigma , y \in \omega ^e_{\sigma '} \end{aligned}
(40)

where $$\{\xi _{\sigma ,\alpha },\xi _{\sigma ',\beta }\}$$ are Chebyshev nodes and $$\{\ell _{\sigma ,\alpha }, \ell _{\sigma ',\beta }\}$$ are the corresponding d-dimensional Lagrange basis polynomials in $$\omega ^e_\sigma$$ and $$\omega ^e_{\sigma '}$$, respectively. This allows us to write the ij-th entry of $$\widetilde{{\underline{K}}}_{\sigma ,\sigma '}$$ as

\begin{aligned} \sum _{\alpha =1}^p \sum _{\beta =1}^p \left[ \sum _{\begin{array}{c} \tau \subset {\mathcal {S}}_i \end{array}} \int _\tau \psi _i(x) \ell _{\sigma , \alpha } (x) dx \right] \left[ -2 \gamma (\xi _{\sigma , \alpha }, \xi _{\sigma ', \beta }) \right] \left[ \sum _{\begin{array}{c} \tau '\subset {\mathcal {S}}_j \end{array}} \int _{\tau '} \psi _j(y) \ell _{\sigma ', \beta } (y) dy \right] \end{aligned}
(41)

which in factored form is given by

\begin{aligned} (\widetilde{{\underline{K}}}_{\sigma ,\sigma '})_{ij} = \sum _{\alpha =1}^p \sum _{\beta =1}^p \left( {\underline{U}}_\sigma \right) _{i\alpha } \left( {\underline{S}}_{\sigma , \sigma '} \right) _{\alpha \beta } \left( {\underline{V}}_{\sigma '}^T \right) _{\beta j} = \left( {\underline{U}}_\sigma \, {\underline{S}}_{\sigma , \sigma '} \, {\underline{V}}_{\sigma '}^T \right) _{ij}. \end{aligned}
(42)

The representation $${\underline{U}}_\sigma \, {\underline{S}}_{\sigma , \sigma '} \, {\underline{V}}_{\sigma '}^T$$ is a rank-p factorization of the $$\widetilde{{\underline{K}}}_{\sigma ,\sigma '}$$ block. It is written in the bases $${\underline{U}}_\sigma$$ and $${\underline{V}}_{\sigma '}$$. These bases are of size ($$\#\sigma \times p$$) and ($$\#\sigma ' \times p$$) and are common to all block rows $$\sigma$$ and block columns $$\sigma '$$ in $$\widetilde{{\underline{K}}}$$, respectively. Individual matrix blocks have their own small $${\underline{S}}_{\sigma , \sigma '}$$ ($$p \times p$$) factors. Evaluation of the bases can be done exactly using $$\lceil \tfrac{p+1}{2}\rceil$$-order Gaussian quadrature schemes since the integrands involved in their entries are polynomials of order no more than $$p+1$$.

There is one final step needed to achieve linear overall complexity, since the approximation of the admissible blocks by the low rank approximation $${\underline{U}}_\sigma \, {\underline{S}}_{\sigma , \sigma '} \, {\underline{V}}_{\sigma '}^T$$ above would result in $$O(N \log N)$$ complexity. In order to remove the $$\log N$$ factor we can build nested bases to avoid generating and storing $${\underline{U}}_\sigma$$ and $${\underline{V}}_{\sigma '}$$ explicitly for all levels of the hierarchy. This can be done by expressing the p polynomial bases used in the approximation over a region $$\omega ^e_\sigma$$ at level l in terms of the approximating polynomials over the subregions of its children clusters in the cluster tree $${\mathscr {T}}$$. In practice, this allows us to generate and store the bases $${\underline{U}}_\sigma$$ and $${\underline{V}}_{\sigma '}$$ explicitly at the leaf level only, with small inter-level transfer matrices that allow the implicit generation of the bases at coarser levels, recursively. This hierarchical (nested) basis of the hierarchically partitioned matrix is called the $${\mathcal {H}}^2$$ representation and attains the optimal complexity [10]. While asymptotically optimal, the thus constructed matrix does not generally have optimal constants, as it uses a generic polynomial basis for the construction. As a result, we are able to further compress the matrix algebraically and reduce the ranks of the matrix blocks and the overall memory footprint of the matrix. The details of this algebraic compression process and a demonstration of its effectiveness are described in [12, 56].

### 5.4 Consistency

There are two approximation errors that need to be analyzed: (i) the local quadrature error for $${\mathcal {K}}_{\tau ,\tau '}(\psi _j,\psi _i)$$ when $$(\sigma ,\sigma ')\in {P^{\text {near}}}$$, and (ii) the local interpolation error for $$I_{\sigma ,\sigma '}^p$$ when $$(\sigma ,\sigma ')\in {P^{\text {far}}}$$. For $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, we denote by $$\widetilde{{\mathcal {K}}}^{\text {near}}(U,V)$$ the bilinear form associated with the near-field part, i.e., the non-zero entries in (38), and $$\widetilde{{\mathcal {K}}}^{\text {far}}(U,V)$$ the far-field part defined in (39). So $${\mathcal {K}}(U,V)$$ is approximated by

\begin{aligned} {\mathcal {K}}(U,V) \approx {{\widetilde{K}}}(U,V):= \widetilde{{\mathcal {K}}}^{\text {near}}(U,V) + \widetilde{{\mathcal {K}}}^{\text {far}}(U,V). \end{aligned}

We can also similarly decompose $${\mathcal {K}}(.,.)$$ to the near-field part $${\mathcal {K}}^{\text {near}}(.,.)$$ and the far-field part $${\mathcal {K}}^{\text {far}}(.,.)$$ and we shall estimate their errors separately.

#### 5.4.1 Near-Field

In order to bound the error from the near-field part, we first note that the cardinality of $${P^{\text {near}}}$$ is uniformly bounded. Hence, the near-field entries in $$\widetilde{{\underline{K}}}$$ form a sparse matrix. In order to show this, we first note that thanks to the quasi-uniformity assumption on $${\mathcal {T}}$$, for each leaf cluster $$\sigma \in {\mathscr {T}}$$, $$\omega _\sigma$$ is regular and satisfies that $$|\omega _\sigma |\sim 2^{-l_{\max }}|{\varOmega ^{\text {int}}}|$$. For $$(\sigma ,\sigma ')\in {P^{\text {near}}}$$ and $$x\in \omega _{\sigma }^e$$ and $$y\in \omega _{\sigma '}^e$$, there holds

\begin{aligned} \begin{aligned} |x-y|&\le {{\,\textrm{dist}\,}}(\omega _{\sigma }^e,\omega _{\sigma '}^e) + {{\,\textrm{diam}\,}}(\omega _{\sigma }^e) + {{\,\textrm{diam}\,}}(\omega _{\sigma '}^e) \\&< \bigg (2+\frac{1}{\lambda _1}\bigg )\max \{{{\,\textrm{diam}\,}}(\omega _{\sigma }^e), {{\,\textrm{diam}\,}}(\omega _{\sigma '}^e)\} \\&\le C_1(2^{-l_{\max }/2} + h) =: r, \end{aligned} \end{aligned}
(43)

where for the second inequality above we used the fact that $$(\sigma ,\sigma ')$$ is non-admissible and where for the last inequality we applied the setting for the partition of the cluster tree so that for each leaf $$\sigma \in {\mathscr {T}}$$ there holds

\begin{aligned} {{\,\textrm{diam}\,}}(\omega _{\sigma }^e) \sim {{\,\textrm{diam}\,}}(\omega _{\sigma }) + 2h \sim |\omega _{\sigma }|^{1/2} + 2h \sim (2^{-l_{\max }} |{\varOmega ^{\text {int}}}|)^{1/2} + 2h. \end{aligned}
(44)

The estimate (43) implies that given an index i and a cluster $$\sigma$$ so that $${\mathcal {S}}_i\subset \omega _{\sigma }^e$$, the union of the near-field leaves $$\sigma$$, namely $$(\sigma ,\sigma ')\in {P^{\text {near}}}$$, is covered by $$\varSigma = \cup _{x\in \omega _{\sigma }^e}B_r(x)$$. Since $$|\varSigma |\lesssim r^2 + |\omega _{\sigma }^e|\lesssim 2^{-l_{\max }} + h^2$$, we then utilize (44) to derive that

\begin{aligned} \#\{\sigma ': (\sigma ,\sigma ')\in {P^{\text {near}}}\} \lesssim \frac{|\varSigma |}{|\omega _{\sigma }^e|}\lesssim \frac{2^{-l_{\max }} + h^2}{|\omega _{\sigma }^e|} \lesssim 1, \end{aligned}

where we note that the above hidden constant depends on $$|{\varOmega ^{\text {int}}}|$$, $$\lambda _1$$ as well as the quasi-uniformity constant.

Let $$Q^n_K(\psi _j,\psi _i)$$ denote the approximation of $$K_{\tau ,\tau '}(\psi _j,\psi _i)$$ in Sect. 5.2. Following the argument from Sect. 4.1 (see also [45, Lemma 5.3.20 & Theorem 5.3.24]), we have that when h is sufficiently small,

\begin{aligned} |{\mathcal {K}}_{\tau ,\tau '}(\psi _j,\psi _i)-Q_{K}^n(\psi _j,\psi _i)| \lesssim h^{2-2{{\overline{s}}}}(2\rho )^{-2n}, \end{aligned}
(45)

for some $$\rho \in (\tfrac{1}{2}, 1)$$. Using the argument from Proposition 1 and applying the local error estimate (45), we can show the consistency of the near-field part by

\begin{aligned} |{\mathcal {K}}^{\text {near}}(U,V)-\widetilde{{\mathcal {K}}}^{\text {near}}(U,V)| \lesssim h^{-2{{\overline{s}}}}(2\rho )^{-2n} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}

#### 5.4.2 Far-Field

For the consistency error from the far-field part, we first note that the error estimate for the kernel interpolation $$I^p_{\sigma ,\sigma '}$$ [10, Theorem 4.22 & Remark 4.23] (see also [32, Lemma 5.1])

\begin{aligned} \begin{aligned} \Vert \gamma (x,y) - I^p_{\sigma ,\sigma '}\gamma (x,y)\Vert _{L^\infty (\tau \times \tau ')}&\lesssim c_1^p\Vert \gamma (x,y)\Vert _{L^\infty (\omega _{\sigma }^e\times \omega _{\sigma '}^e)}\\&\lesssim \frac{c_1^p}{{{\,\textrm{dist}\,}}(\omega _{\sigma }^e,\omega _{\sigma '}^e)^{2+2{{\overline{s}}}}}. \end{aligned} \end{aligned}
(46)

where $$c_1=\min \{\tfrac{c_0\lambda _1}{c_0\lambda _1+1},\tfrac{c_0\lambda _1}{2}\}$$ for some $$c_0>0$$ and where the hidden constant depends only on p and d. We follow the proof of [45, Theorem 7.3.18] to obtain that

\begin{aligned} |{\mathcal {K}}^{\text {far}}(U,V)-\widetilde{{\mathcal {K}}}^{\text {far}}(U,V)| \lesssim c_1^p h^{-2-2{{\overline{s}}}} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}
(47)

#### 5.4.3 Overall Error

Gathering (45) and (47) gives the consistency error for the $${\mathcal {H}}$$-matrix approximation.

### Proposition 2

(consistency for the $${\mathcal {H}}^2$$-approximation) For $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, let $${{\widetilde{K}}}(U,V)$$ be the approximation of $${\mathcal {K}}(U,V)$$ by the $${\mathcal {H}}^2$$-approximation using Chebyshev polynomials with order p together with the admissibility condition (37). A tensor-product Gaussian quadrature rule with order n is used to compute the near-field part of the $${\mathcal {H}}$$-matrix; see (39). Then there exist constants $$\rho \in (\tfrac{1}{2}, 1]$$ and $$c_1\in (0,1)$$ so that

\begin{aligned} |{\mathcal {K}}(U,V)-\widetilde{{\mathcal {K}}}(U,V)| \lesssim h^{-2{{\overline{s}}}} (c_1^p h^{-2} + (2\rho )^{-2n}) \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}

### Remark 4

(general settings for s(x) and a(xy)) The consistency result above relies on the analyticity of s(x). For the case where there are jumps across element edges, the strategy for showing exponential convergence for the $${\mathcal {H}}^2$$-approximation versus polynomial degree will no longer hold since the $$I_{\sigma ,\sigma '}^p$$ is not well-defined. A more refined argument, following the approach in [10, Section 9.2 and Theorem 9.5], is needed to establish that the low rank approximation indeed allows the operator and solution errors to retain an exponential convergence with the rank/degree. The technical details are, however, beyond the scope of the present work.

## 6 Fast Multipole Acceleration for Computing the Weighted Mass Matrix

We finally consider the construction of the matrix approximation $${\underline{M}}$$ of the third term in (11). $${\underline{M}}$$ has the footprint of a mass matrix but is weighted by a global density function whose direct computation would require an expensive $$O(N^2)$$ computation.

### 6.1 Element Computations

For each cell $$\tau \in {\mathcal {T}}^{\text {int}}$$, we shall first compute the local contributions $${\underline{M}}^\tau$$ defined in (20) by the n-th order tensor-product quadrature scheme used in (38) but only for x. We use this scheme primarily for the convenience of having the same code and analysis as the previous sections. Letting $$i,j=1,2,3$$ be the indices of the local shape functions in $$\tau$$, the quadrature scheme leads to

\begin{aligned} {\underline{M}}^\tau (\tau )_{i,j} = \int _{\tau }\psi _j(x)\psi _i(x)\rho _{{\mathcal {T}}}(x)\,{{\textrm{d}}}x \approx \sum _{\ell =1}^{N_\tau } \psi _j(q_\ell )\psi _i(q_\ell )\rho _{{\mathcal {T}}}(q_\ell ) w_\ell \end{aligned}
(48)

with $$N_\tau$$ denoting the number of quadrature points and $${\mathcal {Q}}_\tau := \{q_\ell \}_{\ell =1}^{N_\tau }$$ and $${\mathcal {W}}_\tau := \{w_\ell \}_{\ell =1}^{N_\tau }$$ are the quadrature points and weights. Denote the collection of all the quadrature points for $$\tau \in {\mathcal {T}}^{\text {int}}$$ by $${\mathcal {Q}}^{\text {int}}$$, i.e.,

\begin{aligned} {\mathcal {Q}}^{\text {int}}:= \bigcup _{\tau \in {\mathcal {T}}^{\text {int}}} {\mathcal {Q}}_\tau . \end{aligned}

We similarly define $${\mathcal {W}}^{\text {int}}$$ for the quadrature weights. We first show that evaluation of $$\rho _{{\mathcal {T}}}(q)$$ for every $$q\in {\mathcal {Q}}^{\text {int}}$$ requires O(N) operations. Recalling the definition of $$\rho _{{\mathcal {T}}}(x)$$ from (18), we write the computation for $$q_i\in {\mathcal {Q}}^{\text {int}}$$ as

\begin{aligned} \rho _{{\mathcal {T}}}(q_i) = \sum _{\tau '\in {\mathcal {T}}} \int _{\tau '} \gamma _{{\mathcal {T}}}(q_i,y) \,{{\textrm{d}}}y. \end{aligned}

We shall again approximate the right-hand side above by quadrature. Denoting $${\mathcal {Q}}^{\text {ext}}$$ and $${\mathcal {W}}^{\text {ext}}$$ the set of quadrature points and weights for $$\int _{{\varOmega ^{\text {ext}}}}\gamma (q_i,y)\,{{\textrm{d}}}y$$, we set $${\mathcal {Q}} = {\mathcal {Q}}^{\text {int}}\cup {\mathcal {Q}}^{\text {ext}}$$ and $${\mathcal {W}} = {\mathcal {W}}^{\text {int}}\cup {\mathcal {W}}^{\text {ext}}$$. Using the quadrature scheme generated by $${\mathcal {Q}}$$ and $${\mathcal {W}}$$, we have

\begin{aligned} \rho _{{\mathcal {T}}}(q_i) \approx \sum _{q_j\in {\mathcal {Q}}}\gamma _{{\mathcal {T}}}(q_i,q_j) w_j. \end{aligned}
(49)

Assuming a suitable subdivision of $${\mathcal {T}}$$ so that the cardinality $$\#({\mathcal {Q}})$$ is O(N), the computation of the right-hand side above obviously requires O(N) operations.

In principle, we can use the kernel-independent fast multipole method [54] to accelerate the evaluations of $$\rho _{{\mathcal {T}}}(q_i)$$ for all $$q_i\in {\mathcal {Q}}^{\text {int}}$$. Here we consider (49) as a N-body problem by treating $${\mathcal {Q}}$$, $${\mathcal {Q}}^{\text {int}}$$ and $${\mathcal {W}}$$ as source points, target points and source densities, respectively. For the numerical simulation, one could use available fast multipole open source libraries such as exafmm [52], PVFMM [43], or PBBFMM3D [50]. Unfortunately, these libraries only support a kernel with constant order and constant diffusion coefficients. Our alternative solution is to interpret the general fast multipole method as a hierarchical matrix–vector product in the $${\mathcal {H}}^2$$ format [55] and again use an $${\mathcal {H}}^2$$-approximation as we describe below.

### 6.2 An $${\mathcal {H}}^2$$-Approximation for Density Evaluation

We introduce a collocation approach using an $${\mathcal {H}}^2$$-matrix to compute $$\rho _{{\mathcal {T}}}(q_i)$$ for all $$q_i\in {\mathcal {Q}}^{\text {int}}$$. To this end, denote the space $${\mathbb {V}}_s$$ the span of Dirac delta distributions for the source points $${\mathcal {Q}}$$, namely

\begin{aligned} {\mathbb {V}}_s:= \text {span}\{\delta _q: q\in {\mathcal {Q}}\}. \end{aligned}

Here $$\delta _q$$ denotes the Dirac delta distribution at q. We similarly define the space $${\mathbb {V}}_t$$ for the target points $${\mathcal {Q}}^{\text {int}}$$. We first consider the following rectangular matrix

\begin{aligned} ({\underline{K}}_\rho )_{ij} = \gamma _{{\mathcal {T}}}(q_i,q_j) = \int _\varOmega \int _\varOmega \gamma _{{\mathcal {T}}}(x,y)\delta _{q_i}(x)\delta _{q_j}(y)\,{{\textrm{d}}}y\,{{\textrm{d}}}x,\quad \text {for } q_i\in {\mathcal {Q}}^{\text {int}} \text { and } q_j\in {\mathcal {Q}}. \end{aligned}

Letting $${{\underline{\rho }}} = (\rho _{{\mathcal {T}}}(q_i))_{q_i\in {\mathcal {Q}}^{\text {int}}}^T$$ and $${\underline{w}}:= (w_j)_{q_j\in {\mathcal {Q}}}^T$$, we have

\begin{aligned} {{\underline{\rho }}} = {\underline{K}}_\rho {\underline{w}}. \end{aligned}
(50)

Therefore, in order to generate the necessary density values at all quadrature points we need to generate an $${\mathcal {H}}^2$$-approximation of $${\underline{K}}_\rho$$ and perform the multiplication in (50) efficiently. The $${\mathcal {H}}^2$$-approximation algorithm for $${\underline{K}}_\rho$$ is similar to the one presented in Sect. 5 for $${\underline{K}}$$ and starts by constructing two cluster trees $${\mathscr {T}}_t$$ and $${\mathscr {T}}_s$$ for the row index set for $${\mathbb {V}}_t$$ and the column index set for $${\mathbb {V}}_s$$, respectively. When building the $${\mathcal {H}}^2$$-matrix, we do not need to extend the bounding boxes for the clusters $$\sigma \in {\mathscr {T}}_t$$ and $$\sigma ' \in {\mathscr {T}}_s$$; consequently, we may define the interpolation operator $$I_{\sigma ,\sigma '}^p$$ for $$\gamma (x,y)$$ in $$\omega _{\sigma }\times \omega _{\sigma '}$$. The admissibility condition is given by

\begin{aligned} \max \{{{\,\textrm{diam}\,}}(\omega _{\sigma }), {{\,\textrm{diam}\,}}(\omega _{\sigma '})\} \le \lambda _2{{\,\textrm{dist}\,}}(\omega _{\sigma },\omega _{\sigma '}). \end{aligned}
(51)

for some fixed $$\lambda _2 >0$$. We similarly define $${P^{\text {far}}}$$ to be the collection of all the admissible blocks $$(\sigma ,\sigma ')$$ and define $${P^{\text {near}}}$$ to be the rest of the blocks. We shall further assume that $$\lambda _2$$ is small enough to guarantee that $${{\,\textrm{dist}\,}}(\omega _{\sigma },\omega _{\sigma '}) \ge 3h$$ so that $$\gamma _{{\mathcal {T}}}(x,y) = \gamma (x,y)$$, and is therefore smooth. Hence, the interpolation $$I_{\sigma ,\sigma '}^p \gamma _{{\mathcal {T}}}$$ makes sense for $$(\sigma ,\sigma ')\in {P^{\text {far}}}$$ and satisfies that

\begin{aligned} \Vert \gamma _{{\mathcal {T}}}(x,y) - I^p_{\sigma ,\sigma '}\gamma _{{\mathcal {T}}}(x,y)\Vert _{L^\infty (\omega _{\sigma }\times \omega _{\sigma '})} \lesssim c_2^p \Vert \gamma (x,y)\Vert _{L^\infty (\omega _{\sigma }\times \omega _{\sigma '})} \lesssim c_2^p h^{-2-2{{\overline{s}}}}, \end{aligned}
(52)

where $$c_2 = \min \{\tfrac{c_0\lambda _2}{c_0\lambda _2+1},\tfrac{c_0\lambda _2}{2}\}$$.

Once the matrix $${\underline{K}}_\rho$$ is constructed, the matrix–vector multiplication in (50) can be performed via standard multilevel methods for $${\mathcal {H}}^2$$ matrices which involve a pair of upward and downward passes over the basis trees and multiplication by the small low rank blocks at all levels of the hierarchy. The operation can be done in O(N) (cf. [13]).

### 6.3 Consistency

Given $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, let us first denote the quadrature approximation of $${\mathcal {M}}(U,V)$$ by $${\mathcal {Q}}_{{\mathcal {M}}}^n(U,V)$$. We also denote by $${\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)$$ the resulting bilinear form when $$\{\rho _{{\mathcal {T}}}(q_i)\}_{q_i\in {\mathcal {Q}}^{\text {int}}}$$ are approximated using the $${\mathcal {H}}^2$$ matrix–vector product. Here we recall that p is the degree of the Chebyshev polynomials. By the triangle inequality, the consistency error between $${\mathcal {M}}(U,V)$$ and $${\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)$$ can be bounded with

\begin{aligned} |{\mathcal {M}}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)| \le |{\mathcal {M}}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n}(U,V)| +|{\mathcal {Q}}_{{\mathcal {M}}}^{n}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)|\nonumber \\ \end{aligned}
(53)

We first estimate the quadrature error, namely the first error on the right-hand side above. Set $$e^{ij}_\tau$$ to be the error of the quadrature approximation in (48), following the argument in Sect. 4.1, we have that there exists a constant $$\rho \in (\tfrac{1}{2}, 1]$$ so that

\begin{aligned} |e^{ij}_\tau |\lesssim (2\rho )^{-2n} h^2\Vert \rho _{{\mathcal {T}}}\Vert _{L^\infty ({\mathbb {R}}^2)} \lesssim (2\rho )^{-2n} h^{2-2{{\overline{s}}}}. \end{aligned}

Here we note for the last inequality we used the fact that

\begin{aligned} \Vert \rho _{{\mathcal {T}}}\Vert _{L^\infty ({\mathbb {R}}^2)} \lesssim \int _{ B_{1}(0)\backslash B_{\varepsilon h}(0)} \frac{1}{|y|^{2+2{{\overline{s}}}}}\,{{\textrm{d}}}y + \int _{{\mathbb {R}}^2\backslash B_{1}(0)} \frac{1}{|y|^{2+2{\underline{s}}}}\,{{\textrm{d}}}y \lesssim h^{-2{{\overline{s}}}}, \end{aligned}

where $$B_{r}(0)$$ is a ball entered at origin with radius r and where $$\varepsilon >0$$ is sufficiently small. Using the above local error estimate, we again follow the same argument in Proposition 1 to derive that

\begin{aligned} |{\mathcal {M}}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n}(U,V)| \lesssim (2\rho )^{-2n} h^{-2{{\overline{s}}}}. \end{aligned}
(54)

In order to estimate the error from the $${\mathcal {H}}^2$$-approximation for $$\rho _{{\mathcal {T}}}$$, we let $$\sigma \in {\mathscr {T}}_t$$ and $$\sigma '\in {\mathscr {T}}_s$$. For $$q_i\in {\mathcal {Q}}^{\text {int}}$$, let $${{\widetilde{\rho }}}_{{\mathcal {T}}}(q_i)$$ be the resulting approximation of $$\rho _{{\mathcal {T}}}(q_i)$$. We invoke the interpolation error estimate (52) as well as $$w_j\sim h^2$$ to bound the error

\begin{aligned} \begin{aligned} |{{\widetilde{\rho }}}_{{\mathcal {T}}}(q_i) - \rho _{{\mathcal {T}}}(q_i)|&\le \sum _{\begin{array}{c} \sigma \in {\mathscr {T}}_t \\ i\in \sigma \end{array}}\sum _{\begin{array}{c} \sigma '\in {\mathscr {T}}_s\\ (\sigma ,\sigma ')\in P^{\text {far}} \end{array}} \sum _{j\in \sigma '}|\gamma _{{\mathcal {T}}}(q_i,q_j) - I^p_{\sigma ,\sigma '}\gamma _{{\mathcal {T}}}(q_i,q_j)|w_j \\&\lesssim h^2 c_2^{p} \sum _{\begin{array}{c} \sigma \in {\mathscr {T}}_t \\ i\in \sigma \end{array}}\sum _{\begin{array}{c} \sigma '\in {\mathscr {T}}_s\\ (\sigma ,\sigma ')\in P^{\text {far}} \end{array}} \sum _{j\in \sigma '}\Vert \gamma _{{\mathcal {T}}}\Vert _{L^\infty (\omega _{\sigma }\times \omega _{\sigma '})} \lesssim c_2^p h^{-2-2{{\overline{s}}}}, \end{aligned} \end{aligned}

where for the last inequality we used the fact the number of far-field indices j is bounded by $$N\sim h^{-2}$$. This means that the $${\mathcal {H}}^2$$-approximation for $$\rho _{{\mathcal {T}}}(q_\ell )$$ leads to the quadrature formula in (48) perturbed by the error $$Cc_2^p h^{-2{{\overline{s}}}}$$. Thus we again apply the argument in Proposition 1 to obtain that

\begin{aligned} |{\mathcal {Q}}_{{\mathcal {M}}}^{n}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)| \lesssim c_2^p h^{-2-2{{\overline{s}}}} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}
(55)

Gathering the errors (54) and (55) into (53), we conclude that

### Proposition 3

(consistency for $${\mathcal {M}}(U,V)$$) For $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, let $${\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)$$ be the resulting approximation of $${\mathcal {M}}(U,V)$$ by quadrature with order n as well as $${\mathcal {H}}^2$$-approximation for the quadrature points. Then there exists a constant $$\rho \in (\tfrac{1}{2},1]$$ and $$c_2\in (0,1)$$ so that

\begin{aligned} |{\mathcal {M}}(U,V) - {\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V)| \lesssim ((2\rho )^{-2n} + c_2^p h^{-2}) h^{-2{{\overline{s}}}} \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}

### 6.4 Overall Consistency

We conclude this section with the following theorem by combining the consistency error estimates from Propositions 1 to 3. For simplicity, we will use the same n-th order tensor-product Gaussian quadrature to approximate the integral and the same polynomial degree p for the $${\mathcal {H}}$$-matrices in Sects. 5 and 6.

### Theorem 1

(total error) For $$U,V\in {\mathbb {V}}({\mathcal {T}})$$, define the final approximation of $${\mathcal {A}}(U,V)$$ by

\begin{aligned} {\mathcal {A}}_h(U,V):= {\mathcal {Q}}_{{\mathcal {B}}}^n(U,V) +{{\widetilde{K}}}(U,V) + {\mathcal {Q}}_{{\mathcal {M}}}^{n,p}(U,V) \end{aligned}

where the bilinear forms on right-hand side are defined in Propositions 1 to 3, respectively. Then there holds that

\begin{aligned} |{\mathcal {A}}(U,V) - {\mathcal {A}}_h(U,V)| \lesssim h^{-2{{\overline{s}}}}((2\rho )^{-2n} + h^{-2} c_1^p + h^{-2} c_2^p) \Vert U\Vert _{L^2({\varOmega ^{\text {int}}})} \Vert V\Vert _{L^2({\varOmega ^{\text {int}}})}. \end{aligned}

## 7 Numerical Illustrations

In this section, we present numerical examples to illustrate the performance of our proposed finite element algorithm. In particular, we report the decay of $$L^2({\varOmega ^{\text {int}}})$$ errors with respect to a sequence of the quasi-uniform meshes as the mesh size h is systematically reduced, and the increase in computational cost as N increases to verify the linear complexity of the algorithm. Our numerical implementation is based on the Deal.II (version 9.4) finite element library [7] which supports simplex meshes and the H2Opus library [56] for hierarchical matrices. We use the TimerOutput class in deal.II to record the computation time. Solutions are obtained by a conjugate gradient solver. No attempt was made to fine tune algorithmic parameters, nor to parallelize or optimize the code, which was executed on a single core of a standard-issue laptop computer.

In the construction of the hierarchical matrix approximations for constructing $${\underline{K}}$$ and $${\underline{M}}$$, we use the slightly more convenient geometric admissibility condition $$\lambda \Vert C_\sigma - C_{\sigma '} \Vert \ge (D_{\sigma } + D_{\sigma '})/2$$ where C and D refer to the center and diameter of the bounding box ($$\omega$$ or $$\omega ^e$$) used for a cluster. We use $$\lambda =0.75$$, leaf size $$m=128$$, and approximate the kernel function $$\gamma _{{\mathcal {T}}}$$ using degree 10 Legendre polynomials. This guarantees that the $${\mathcal {H}}$$-matrix approximation does not dominate the total approximation error. We use 3-point Gaussian quadrature (namely the quadrature order $$n=1$$) to compute element integrals when assembling $${\underline{K}}$$ and $${\underline{M}}$$. We also point out that since H2Opus currently supports square matrices only, we expand the target space $${\mathbb {V}}_t$$ to $${\mathbb {V}}_s$$ for the quadrature evaluations by the $${\mathcal {H}}^2$$-approximation mentioned in Sect. 6.2, but only use the subset of values of the matrix–vector product that correspond to interior quadrature points.

### 7.1 Tests for the Integral Fractional Laplacian

We first consider a classical fractional diffusion problem involving the integral fractional Laplacian, namely $$a(x,y)\equiv 1$$, the order function $$s(x) \equiv s$$ is a constant in (0, 1), and the exterior domain $${\varOmega ^{\text {ext}}}={\mathbb {R}}^2\backslash {\varOmega ^{\text {int}}}$$. So the solution u satisfies $$u\in {{\widetilde{H}}}^s({\varOmega ^{\text {int}}})$$ and

\begin{aligned} \int _{{\mathbb {R}}^2}\int _{{\mathbb {R}}^2} \frac{({{\widetilde{u}}}(x)-{{\widetilde{u}}}(y))({{\widetilde{v}}}(x) - {{\widetilde{v}}}(y))}{|x-y|^{2+2s}} \,{{\textrm{d}}}y\,{{\textrm{d}}}x = \int _{\varOmega ^{\text {int}}}f v\,{{\textrm{d}}}x, \quad \text {for all } v\in {{\widetilde{H}}}^s({\varOmega ^{\text {int}}}),\nonumber \\ \end{aligned}
(56)

where $$\widetilde{.}$$ denotes the zero extension from $${\varOmega ^{\text {int}}}$$ to $${\mathbb {R}}^2$$ and the fractional Sobolev space

\begin{aligned} {{\widetilde{H}}}^s({\varOmega ^{\text {int}}}):= \{v\in L^2({\varOmega ^{\text {int}}}): {{\widetilde{v}}}\in H^s({\mathbb {R}}^2)\}. \end{aligned}

#### 7.1.1 An Extra Step

One bottleneck in generating a linear finite approximation for the above problem is to deal with the integral on the unbounded domain $${\mathbb {R}}^2\backslash {\varOmega ^{\text {int}}}$$. Here we borrow the assembling strategy from [1] and briefly introduce the implementation below. We set an auxiliary triangulation $${\mathcal {T}}_B$$ for a ball $$\varOmega _B$$ centered at the origin with radius R and containing the triangulation of $${\varOmega ^{\text {int}}}$$. We set R large enough so that the distance between $${\varOmega ^{\text {int}}}$$ and $$\partial \varOmega _B$$ is strictly positive. This guarantees that the patch $${\mathcal {S}}_\tau$$ for each cell $$\tau$$ in $${\varOmega ^{\text {int}}}$$ is contained in B. Whence, we follow Sects. 4 and 5 exactly to assemble $${\underline{B}}$$ and $${\underline{K}}$$. To compute $${\underline{M}}$$, according to (18), we can split the discrete bilinear form $${\mathcal {M}}(.,.)$$ as

\begin{aligned} \begin{aligned} {\mathcal {M}}(U,V)&= 2\sum _{\tau \in {\mathcal {T}}^{\text {int}}}\int _\tau U(x)V(x)\bigg (\int _{\varOmega _B\backslash {\mathcal {S}} _\tau } \gamma (x,y)\,{{\textrm{d}}}y\bigg ) \,{{\textrm{d}}}x \\&\quad + 2\sum _{\tau \in {\mathcal {T}}^{\text {int}}}\int _\tau U(x)V(x)\bigg (\underbrace{\int _{{\mathbb {R}}^2\backslash \varOmega _B} \gamma (x,y)\,{{\textrm{d}}}y}_{:=\rho _B(x)}\bigg ) \,{{\textrm{d}}}x. \end{aligned} \end{aligned}

Denote $${\underline{M}}^\text {in}$$ and $${\underline{M}}^{\text {out}}$$ the associated weighted mass matrices for the two bilinear forms on the right-hand side of the equation above. We apply the fast multipole approximation technique of Sect. 6 to $${\underline{M}}^{\text {in}}$$. For $${\underline{M}}^{\text {out}}$$, we use the fact that $$\rho _B$$ is radial in $${\mathbb {R}}^2\backslash \varOmega _B$$ and thus rewrite $$\rho _B$$ in polar coordinates with (cf. [1, Section A.5])

\begin{aligned} \rho _B(x) = \frac{1}{2s} \int _0^{2\pi } \frac{1}{t(\theta ,x)^{2s}} \,{{\textrm{d}}}\theta , \end{aligned}
(57)

where

\begin{aligned} t(\theta ,x):= \sqrt{\mu ^2 + R^2 - |x|^2} - \mu , \quad \text {with } \mu =x_1\cos \theta + x_2\sin \theta . \end{aligned}

Thus when assembling $${\underline{M}}^{\text {out}}$$ by using quadrature formulas for each $$\tau \in {\varOmega ^{\text {int}}}$$, we evaluate $$\rho _B$$ at each quadrature point by approximating the integral in (57) by numerical integration. Here we use a 9-point Gaussian quadrature formula. The stiffness matrix corresponding to the weak problem (56) is now decomposed into four sub-matrices, i.e.,

\begin{aligned} {\underline{A}} = {\underline{B}} + {\underline{K}} + {\underline{M}}^{\text {in}} + {\underline{M}}^{\text {out}}. \end{aligned}

#### 7.1.2 Simulation and Results

We set $${\varOmega ^{\text {int}}}$$ to be the unit ball. For the auxiliary ball $$\varOmega _B$$, we set its radius $$R=1.1$$. We shall test the convergence of the finite element approximation by using the well-known analytic solution

\begin{aligned} u(x) = \frac{2^{-2s}}{\varGamma (1+s)\varGamma (1+s)} (1-|x|^2)^s,\quad \text {in }{\varOmega ^{\text {int}}}\end{aligned}

so that $$f=1$$ in $${\varOmega ^{\text {int}}}$$. Starting from a coarse grid $${\mathcal {T}}_1$$ for $$\varOmega _B$$, we generate a sequence for meshes $$\{{\mathcal {T}}_j\}_{j=1}^6$$ by refining the mesh globally. We set $${{\overline{s}}} = 0.9$$ for the computation of the sub-matrix $${\underline{B}}$$.

The left panel of Fig. 4 reports the $$L^2({\varOmega ^{\text {int}}})$$-error between u and its finite element approximation $$U_j$$ against the number of degrees of freedom when $$s=0.7$$. The slope of the log-log error plot implies that $$U_j$$ converges to u in the first order, which is the optimal rate that can be reached and is limited only by the reduced regularity of the solution itself, which has singular derivatives at the boundary. The right panel of Fig. 4 reports the CPU time for assembling the sub-matrices $${\underline{B}}$$, $${\underline{K}}$$ and $${\underline{M}}^{\text {in}}$$, respectively. As the number of degrees of freedom increases, we observe a linear complexity for all three assembly routines.

Figure 5 depicts the tree structure of the $${\mathcal {H}}$$-matrix for $${\underline{K}}$$ (left panel). Here the blocks in red are computed directly while the blocks in green are approximated by the low-rank matrices based on the Lagrange interpolation. The right plot of Fig. 5 illustrates the tree structure of the $${\mathcal {H}}$$-matrix that is used to compute the density function $$\rho _{{\mathcal {T}}}$$; see Sect. 6 for details. Note that the matrix on the right is larger than the one on the left because it includes all interior and exterior degrees of freedom and we are using a 3-point Gaussian quadrature for the density function $$\rho _{{\mathcal {T}}}$$ in (18).

### 7.2 Tests for Variable Order

Next we test our algorithm with variable-order FDEs. We let constant $$s^*\in (\tfrac{1}{2},1)$$ be the so-called background order in $$\varOmega$$. This means that the variable order function $$s(x) = s^*$$ in $${\varOmega ^{\text {ext}}}$$. In $${\varOmega ^{\text {int}}}$$, we consider a tensor product bump function that is supported on a square in $${\varOmega ^{\text {int}}}$$ and centered at the point $$(x^c_1,x^c_2)$$ with the size $$\ell$$. Specifically, we let

\begin{aligned} s(x) = s^* + \eta \times \text {bump}(x_1-x_1^c, \ell )\times \text {bump}(x_2-x_2^c;\ell ), \end{aligned}
(58)

where

\begin{aligned} \text {bump}(x;\ell ) = \left\{ \begin{aligned} \exp \left( -\frac{1}{1-r^2}\right) ,\, r=\frac{2x}{\ell },&\quad |r|<1, \\ 0,&\quad |r|\ge 1 \\ \end{aligned} \right. \end{aligned}

and $$\eta$$ is a fixed constant satisfying that $$s(x)\in [\tfrac{1}{2},1)$$. For simplicity, we fix the diffusion coefficient $$a(x,y)\equiv 1$$.

Our computational domain $$\varOmega$$ is set to be a square $$(-2,2)^2$$ and the interior domain $${\varOmega ^{\text {int}}}$$ is set to be $$(-1,1)^2$$. We construct a sequence of uniform grids $$\{{\mathcal {T}}_j\}_{j=1}^6$$ with the mesh size $$h_j = \sqrt{2}/j^{j+1}$$ generated by globally refining the coarse grid $${\mathcal {T}}_1$$; see the coarsest grid $${\mathcal {T}}_1$$ in Fig. 6.

### Remark 5

We note that the subdivision in $${\varOmega ^{\text {ext}}}$$ mainly contributes to the assembly of matrix $${\underline{M}}$$ by computing the density function $$\rho _{{\mathcal {T}}}(x)$$ as in (18). By utilizing the decay property of $$\gamma (x,y)$$, graded meshes in $${\varOmega ^{\text {ext}}}$$ can be used in order to reduce the computational cost. This strategy will be explored in future work. Here we rely on quasi-uniform meshes in $${\varOmega ^{\text {ext}}}$$ to guarantee that the error from $$\rho _{{\mathcal {T}}}$$ does not affect the total error.

For spatially varying fractional order, we do not have analytic solutions to examine the rate of convergence of our numerical scheme. As an alternative, we perform a comparison test by computing the difference between the finite element solution $$U_j$$ on $${\mathcal {T}}_j$$ and a finite difference approximation $${{\widetilde{U}}}_j$$ developed in [5] using a cartesian grid $$\widetilde{{\mathcal {T}}}_j$$ with the same node locations. We also perform self-convergence tests.

The left plot of Fig. 7 reports the $$L^2({\varOmega ^{\text {int}}})$$-error decay between $$U_j$$ and $${{\widetilde{U}}}_j$$ for $$f\equiv 20$$ and for the variable order function s(x) in (58) with $$s^* = 0.7$$, $$\eta = 0.2$$, $$\ell =1.0$$, and $$(x_c,y_c) = (-0.4,0.4)$$; see Fig. 6. We also report the self-convergence of $$U_j$$, namely the error $$e_{u,j} = U_{j+1}-U_j$$, in both $$L^2({\varOmega ^{\text {int}}})$$ and $$L^\infty ({\varOmega ^{\text {int}}})$$ norms, by estimating the error on each grid by using the next finer grid as the reference solution. It is seen that all three errors exhibit a first order decay. Such error behavior is similar to the constant-order integral fractional Laplacian case when $$s \ge 1/2$$, and might be expected here since $${\underline{s}} \ge 1/2$$. We also observe the singular behavior of the solution at $$\partial {\varOmega ^{\text {int}}}$$; see the approximate solution along $$x_1=-0.4$$ in the right plot of Fig. 8. The right panel of Fig. 7 shows the performance of the assembly routines for matrices $${\underline{B}}$$, $${\underline{K}}$$ and $${\underline{M}}$$. We again observe a linear complexity in time for each assembly procedure, though the time for $${\underline{M}}$$ is relatively large due the quasi-uniform triangulation of $$\varOmega$$ as mentioned in Remark 5.

Figure 8 shows the finite element approximation $$U_6$$ with 16,129 degrees of freedom in $${\varOmega ^{\text {int}}}$$ (131,072 cells total in $$\varOmega$$). In the left panel we see a faster diffusion rate in the variable-order region $$[-0.9,0.1]\times [-0.1,0.9]$$. This diffusive behavior of the solution can also be observed along $$x_1=-0.4$$ shown in the right panel of Fig. 8.

In Fig. 9, we test the same problem using the same parameters except that the coefficient bump function is negative, namely $$\eta =-0.2$$. As shown in the left panel, we again obtain first-order convergence in $$L^2({\varOmega ^{\text {int}}})$$ when comparing against the solution obtained from the finite difference method. In the right panel, we instead observe a less diffusive behavior of the solution in the bump region. Figure 10 displays the estimated condition number of the system (computed from the CG iterates) against the number of degrees of freedom. We observe that the condition number is nearly $$O(h^{-2\,s^*})$$, where $$s^*$$ denotes the background order.

The first-order convergence rate O(h) observed in the examples above is primarily due to the lack of regularity of the solution itself and is essentially the best that can be obtained for these solutions because of the singular derivatives at the boundary. One may wonder whether the proposed linear finite element discretization can produce second-order convergence when the solution has sufficient regularity.

To verify higher-order convergence, and since we cannot readily manufacture a solution analytically for variable order problems, we consider the following numerical alternative. We start from a given smooth solution and apply the forward operator discretized by a finite difference scheme to obtain a right-hand side. We then use this right-hand side in the finite element solver to recover the given solution.

Specifically, we seek the solution u with right-hand side data $$F_j = I_j A_{\text {FD},j}u$$, where $$A_{\text {FD},j}u$$ is the discrete right hand side data produced by the finite difference method [5] using the cartesian grid $$\widetilde{{\mathcal {T}}}_j$$ and $$I_j$$ is the Lagrange nodal interplant on $${\mathbb {V}}({\mathcal {T}}_j)$$. We fix u to be the smooth function $$u(x,y) = (1-x^2)^{15}(1-y^2)^{15}$$ and set the same variable order function s(x) in the previous variable order test. Here we note that we choose the power 15 so that u is smooth enough to mitigate the effect of consistency errors in the right-hand side introduced by the finite difference discretization (i.e., the approximation error of $$I_j A_{\text {FD}, j}$$). Figure 11 shows the results of this test. The right panel of Fig. 11 shows the convergence of the right hand side data as the mesh is refined. While in general only a first-order rate of convergence for $$r_j:=F_{j+1}-F_j$$ in $$L^2({\varOmega ^{\text {int}}})$$-norm is expected due to the singularity at the boundary, we obtain here second-order convergence since the solution is sufficiently smooth. The left plot of Fig. 11 shows that the proposed finite element approximation can indeed obtain second-order convergence when solution regularity allows it.

## 8 Conclusions

We presented an asymptotically optimal finite element methodology for modeling non-local fractional diffusion operators with spatial variation in fractional order and material coefficients and in general geometries. In the finite-element formulation, triangle pairs in the spatial mesh are the basic units that generate elemental stiffness matrices to be assembled into a global stiffness matrix. We address the singularities in evaluating touching triangle pairs through specialized mapping and quadrature schemes designed to handle the variable-order case. The computational complexity due to the quadratic number of interacting triangle pairs is overcome through (i) the construction of the hierarchical matrix approximations for representing the effect of every interior node on all other ones, and (ii) a generalized variable-order fast multipole method that computes the cumulative effect of all triangles on every interior node. The overall complexity for building the complete discrete operator is optimal, O(N), both in memory and in operations. We show the consistency of the method and the ability to control its accuracy through the number of quadrature points in the direct integrations and the degree of the polynomial interpolant of the kernel in the far field. Numerical experiments verify the accuracy and complexity of the methods proposed. The techniques are general and should apply to a broader class of nonlocal kernels.

There are a number of directions we are pursing in further work. First, the current work focused on the construction and application of the discrete operator. We are developing scalable preconditioners to allow the iterative solution of general fractional diffusion problems and are pursuing a geometric multilevel strategy that would allow end-to-end solutions in linear complexity. Second, the computations involved in building and applying the operator have high arithmetic intensity and substantial concurrency. We are developing performant implementations that are GPU-accelerated and can take advantage of multiple cores in a node and multiple distributed-memory nodes. This should produce dramatic improvements in absolute efficiency and make it quite practical to work with fractional operators in various application domains. We are also interested in extending the present finite element development to the case of anisotropic variability in fractional order and coefficients as well to three-dimensional geometry, a setting that has not yet been addressed sufficiently in the literature and where graded meshes will be particularly important. Finally, we are working on incorporating the forward solver in the inner loop of an inverse problem that seeks to recover the spatial distribution of fractional order and coefficients. In this context, the flexibility and linear complexity in memory and operations of the forward solver are essential from a practical point of view. We plan to report on these developments elsewhere.