Fast Barycentric-Based Evaluation Over Spectral/hp Elements

As the use of spectral/hp element methods, and high-order finite element methods in general, continues to spread, community efforts to create efficient, optimized algorithms associated with fundamental high-order operations have grown. Core tasks such as solution expansion evaluation at quadrature points, stiffness and mass matrix generation, and matrix assembly have received tremendous attention. With the expansion of the types of problems to which high-order methods are applied, and correspondingly the growth in types of numerical tasks accomplished through high-order methods, the number and types of these core operations broaden. This work focuses on solution expansion evaluation at arbitrary points within an element. This operation is core to many postprocessing applications such as evaluation of streamlines and pathlines, as well as to field projection techniques such as mortaring. We expand barycentric interpolation techniques developed on an interval to 2D (triangles and quadrilaterals) and 3D (tetrahedra, prisms, pyramids, and hexahedra) spectral/hp element methods. We provide efficient algorithms for their implementations, and demonstrate their effectiveness using the spectral/hp element library Nektar++ by running a series of baseline evaluations against the ‘standard’ Lagrangian method, where an interpolation matrix is generated and matrix-multiplication applied to evaluate a point at a given location. We present results from a rigorous series of benchmarking tests for a variety of element shapes, polynomial orders and dimensions. We show that when the point of interest is to be repeatedly evaluated, the barycentric method performs at worst 50%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$50\%$$\end{document} slower, when compared to a cached matrix evaluation. However, when the point of interest changes repeatedly so that the interpolation matrix must be regenerated in the ‘standard’ approach, the barycentric method yields far greater performance, with a minimum speedup factor of 7×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$7\times $$\end{document}. Furthermore, when derivatives of the solution evaluation are also required, the barycentric method in general slightly outperforms the cached interpolation matrix method across all elements and orders, with an up to 30%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$30\%$$\end{document} speedup. Finally we investigate a real-world example of scalar transport using a non-conformal discontinuous Galerkin simulation, in which we observe around 6×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$6\times $$\end{document} speedup in computational time for the barycentric method compared to the matrix-based approach. We also explore the complexity of both interpolation methods and show that the barycentric interpolation method requires O(k)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(k)$$\end{document} storage compared to a best case space complexity of O(k2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(k^2)$$\end{document} for the Lagrangian interpolation matrix method.


Introduction
The application of high-order numerical methods continues to expand, now spanning aeronautics (e.g., [13]) to biomedical engineering (e.g., [3]), in large part due to two factors: the numerical accuracy combined with low dissipation and dispersion errors they can obtain for certain problem classes [7], and the computational efficiency they can achieve when balancing approximation power against computational density [14]. Most high-order finite element, also called spectral/hp element, simulation codes mimic their traditional finite element counterparts in terms of software organization such as the development of local operators that are then "assembled" (either in the strict continuous Galerkin sense or in the weak discontinuous Galerkin sense) into a global system that is advanced in some way [5,6,10]. Tremendous effort has been expended to create optimized elemental operations that evaluate, for instance, the solution expansions at the quadrature points which allows for rapid evaluation of the solution at the points of integration needed for computing the stiffness matrix entries, the forcing terms, and other desirable quantities.
However, in the case of history points (positions in the field at which one wants to track a particular quantity of interest over time) [16], pathlines/streamlines [17], isosurface evaluation, [9] or refinement and mortaring [12], evaluation of solution expansions at arbitrary point locations is required. From the perspective of point evaluation over the entire domain, these operations require two phases: given a point in the domain, first finding the element in which that point resides, and then a fast evaluation of the solution expansion on an individual element. Optimization strategies such as octrees and R-trees have been implemented to accelerate the first of these two tasks (e.g., [8]); however, a concerted effort has not been placed on generalized elemental operations such as arbitrary point evaluation.
The purpose of this work is to address the need for efficient arbitrary point evaluation in high-order (spectral/hp) element methods. We are building upon the barycentric polynomial interpolation work of Berrut and Trefethen [1], which has been shown to have a myriad of applications within the polynomial approximation world [11]. In this work, we mathe-matically generalize barycentric polynomial interpolation to arbitrary simplicial elements, and then demonstrate the effectiveness of this approach on the canonical high-order finite elements shapes in 2D (triangles and quadrilaterals) and 3D (tetrahedra, prisms, pyramids, and hexahedra). The operational efficiency range of our work is designed for polynomial degrees often run within the spectral/hp community -from polynomial degree 2 to 10. The algorithms presented herein are also implemented in the publicly available high-order finite element library Nektar++ [2,15].
The paper proceeds as follows: In Sects. 2 and 3, we lay out the mathematical details of our work. In Sect. 2, we highlight the one-dimensional building blocks of barycentric interpolation and provide the mathematical framework for considering interpolation over finite element shapes generated through successive application of Duffy transformations of an orthotope element, and then in Sect. 3 we provide details on expanding these ideas to both tensor-product constructed 2D and 3D elements as used in Nektar++. In Sect. 5, we present both algorithmic and implementation details. We provide details that facilitate reproducibility of our results as well as complexity and storage analysis of the proposed strategy. In Sect. 6, we present various test cases that demonstrate the efficacy and efficiency of our proposed work, as well as a capstone example that highlights a real-world application of our strategy. We summarize our contributions and conclude with possible future work in Sect. 7.

Univariate Barycentric Interpolation
Berrut and Trefethen [1] propose barycentric Lagrange interpolation for stable, efficient evaluation of polynomial interpolants. With η ∈ [−1, 1] the independent variable, let {z j } k j=0 ⊂ [−1, 1] denote k + 1 unique nodes. Defining Q k as the space of polynomials of degree k or less in one variable, then any p ∈ Q k can be represented in its barycentric form, where the weights w j are given by The weights are independent of the polynomial p, and depend only on the nodal configuration. Barycentric form (1) reveals that given the values { p j } k j=0 of a polynomial, then evaluation of p at an arbitrary location η can be accomplished without solving linear systems or evaluating cardinal Lagrange interpolants.
In this paper, we focus on the algorithmic advantages of using barycentric form, with proper extensions to some standard multivariate non-tensorial domains that are popular in high-order (spectral/hp) finite element methods. These algorithmic advantages stem from the univariate algorithmic advantages: The map η → p(η) using the formula (1) requires O(k) arithmetic operations, whereas the same map using standard linear expansions frequently requires O(k 2 ) operations.
Throughout our discussion, we consider the nodes z j , and subsequently, through (2), the barycentric weights w j , as given and fixed. To emphasize the O(k) complexity of the operations, which we consider in the following section, we define which, given r , v, ostensibly requires only O(k) complexity to evaluate η → S r (v, η). Note in particular that S r (v, η) is linear in v and that We can write (1) as which clearly demonstrates the O(k) complexity if p is furnished.

Alternatives to Barycentric Form
Consider a linear expansion representation of p ∈ Q k , written in terms of either an (arbitrary) basis {φ j } k j=0 of Q k , or the cardinal Lagrange functions of the interpolation points {z j } k j=0 : where the φ j basis functions can be, e.g., cardinal Lagrange interpolants, monomials, or orthogonal polynomials: with ψ j any orthogonal polynomial family, such as Legendre or Chebyshev polynomials. Our main focus is the computational complexity of evaluating a polynomial interpolant given the data values { p j } k j=0 . The following diagram summarizes the complexity of utilizing each of these procedures to accomplish the evaluation η → p(η): Orth. Poly. (6), (7c) : Above, we have alluded to the fact that direct evaluation of monomial or orthogonal polynomial (k + 1)-term expansions appears to require O(k 2 ) complexity, but clever rearrangement of elements in the summation can lower this to O(k) in both cases, using either Horner's algorithm (monomials) or Clenshaw's algorithm [4] (orthogonal polynomials). Viewed in this way, there are two advantages to utilizing the barycentric form. The initial one-time computation for the barycentric weights is cheaper than for the monomials. Furthermore, with fixed nodal locations z j , the weights w j need not be recomputed if p is changed; in other words, the weights w j do not depend on p. Hence if k and the nodes z j are given and fixed, then the weights w j can be precomputed and (re-)used for any p ∈ Q k . Our first task in this paper is to generalize the barycentric procedure to first-and secondderivative.

Barycentric Evaluation of Derivatives
A formula for the derivative of a polynomial can be derived from the barycentric form (1) that inherits its advantages. We first note two auxiliary expressions for the derivative of the numerator and denominator rational functions, each of which ostensibly also requires only O(k) complexity to evaluate if the weights w j are precomputed. Then, by directly differentiating (1), we have where the vector p(η) − p has entries p(η) − p j . Therefore, this "barycentric" form for p (η) requires (i) an evaluation of p(η) that can be accomplished in O(k) complexity using (1), and (ii) an additional O(k) evaluation of the summation above. Thus, η → p (η) can be evaluated with only O(k) complexity. A similar computation shows that Again, since η → p(η) and η → p (η) can be evaluated with O(k) effort through (1) and (10), some extra O(k) effort to evaluate S 2 and S 3 above yields an evaluation η → p (η) that can also be accomplished in O(k) time.

Tensorization of the Barycentric Form
All the evaluations considered above can be generalized to tensorial formulations, which is the main topic of this section. To that end, we introduce some multidimensional notations. Let η := (η 1 , . . . , η d ) ∈ [1, 1] d be a d-dimensional vector. Given some multi-index k ∈ N d 0 , we introduce the tensorial space of polynomials Q k defined by k: where we have adopted the standard multi-index notation, and j ≤ k is true if all the component-wise inequalities are true. Fixing k, we consider the case of representing an element p of Q k through its values on a discrete tensorial grid of size d q=1 k q . Like the univariate case, linear expansions in cardinal Lagrange, monomial, and/or orthogonal polynomials are common, but we will exercise the barycentric form. Let a tensorial grid on a [−1, 1] d orthotope be given, with k q + 1 points in direction q: for q = 1, . . . , d. The tensorization of these grids results in the multidimensional grid η j j ≤k defined by Given this tensorial configuration of nodes, we define univariate barycentric weights associated with each dimension in a fashion similar to (2), Then, given the data for p ∈ Q k , then the multidimensional barycentric form of p is Given y ∈ [−1, 1], an evaluation of p( y) above requires O d q=1 k q operations, corresponding to the complexity of the summations.
In order to evaluate p( y), we will proceed by iteratively constructing p q from p q+1 for q = d −1, . . . , 1. Via barycentric form, "constructing" p q amounts to evaluating this function on the q-dimensional tensorial grid Then, for example, to first generate p d−1 , we must evaluate for every j ≤ (k 1 , . . . , k d−1 ). This can be accomplished via univariate procedures in O(k d ) complexity since, for each fixed j , is a polynomial of degree k d , and hence obeys the barycentric formula, In this way, we proceed to iteratively construct p q , amounting to q j=1 k j evaluations, each of computational complexity O(k q+1 ), In summary, computing y → p( y) can be accomplished with the procedure above, which entails repeated use of the univariate barycentric form (5). As mentioned earlier, the cost of the procedure (24) is slightly more expensive than direct evaluation of the multidimensional barycentric form (18). In particular, the (k-asymptotic) cost of the direct evaluation (18) is d q=1 k q . On the other hand, the dimension-by-dimension approach (24) incurs additional lower-order costs, and has complexity scaling as, where the inequality is a very crude bound. Thus, while the algorithm described by (24) is formally more expensive than direct evaluation (18), the actual additional cost is relatively small. In particular, for the physically relevant cases of d = 2, 3, this minor increase in cost is acceptable for the achieved gain in implementation ease.

Tensorial Functions
We end this section with a brief remark on a direct simplification that can be employed in the special case that one has prior knowledge that the polynomial p is tensorial, i.e., of the form, for some univariate polynomials { p j } d j=1 satisfying deg p j ≤ k j . In many high-order FEM simulations, basis functions in η space are often tensorial polynomials, so that this situation does occur in practice. In this tensorial case, computing the barycentric weights is simpler since we need to only compute the k j univariate weights for dimension j associated with the grid z q, j for q ∈ [k j ]. Thus, we need to only compute d j=1 k j weights, as opposed to the full set of d j=1 k j multivariate weights associated with (18). Evaluating η → p(η) is likewise faster in this case: once the univariate weights are computed, then each univariate barycentric evaluation

Derivatives
Section 3.1 discusses how we accomplish the evaluation η → p(η) algorithmically by iteratively evaluating along each dimension. This procedure is directly extensible to evaluation of (Cartesian) partial derivatives. For example, if p ∈ Q k , suppose for some multi-index λ ∈ N d 0 we wish to evaluate the order-λ derivative, We are mostly concerned with |λ| ≤ 2, i.e., at most "second" derivatives, but the procedure we describe applies to derivatives of arbitrary order. The dimension-by-dimension approach can be accomplished with essentially the same procedure as articulated in Section 3.1: Define so that constructing p q from p q+1 at a single grid point requires evaluation of the order-λ q+1 derivative along dimension q + 1. Through procedures outlined in Sect. 2.2, we can accomplish this in O(k q+1 ) complexity. We must evaluate the derivative at each of the q j=1 k j grid points associated with dimensions 1, . . . , q. Therefore, the outline and complexity of this procedure is precisely as given in (24), except that p( y) should be replaced by p (λ) ( y).

Nontensorial Multidimensional Formulations Via Duffy Transformations
The goal of this section is to describe how barycentric interpolation in the tensorial case over d-dimensional orthotopes of Sect. 3 can be utilized for efficient evaluation of polynomial approximations in certain nontensorial cases. We focus on (potentially) non-tensorial polynomial approximations in a d-dimensional variable ξ . The variable ξ will be related to the tensorial variable η through "collapsed coordinates" effected by a Duffy transformation, as described in Sect. 4.1. The transformation can be applied to a variety of common FEM element types, see Table 1. A description of how barycentric evaluation procedures can be used to evaluate polynomials on these potentially nontensorial geometries is given in Sect. 4.2; specifically, the procedure is given by (32). That section also gives precise conditions to which polynomials space p must belong so that the evaluation is exact (Theorem 4.1). Section 4.3 specializes the evaluation exactness conditions to common element types used Table 1 Multidimensional domains resulting from collapsed coordinates, multi-indices a and b identifying the associated multivariate Duffy map, and ancestor functions g a,b defined in (33) in the spectral/hp community. Section 4.4 closes the section by discussing extension of the evaluation routines to derivative evaluations through use of the chain rule.

Collapsed Coordinates
We consider polynomials in d variables ξ . The particular type of ξ -polynomials we consider are defined via a mapping from η space to ξ space, where η ∈ [−1, 1] d is the tensorial variable considered in Sect. 3. The essential building block that allows us to specify the η ↔ ξ relationship is the Duffy transformation, which is a variable transformation in two dimensions. For η ∈ [−1, 1] d and some fixed i, j ∈ {1, . . . , d} with i = j, define D i, j as the Duffy transformation that "collapses" dimension i with respect to or along dimension j and is the identity map on all the other dimensions, Various domains in d dimensions can be created by composing Duffy maps. For composing c < d maps, we let a ∈ [d] c have components that represent the dimensions that are collapsed by a Duffy transformation, and let b ∈ [d] c have components specifying along which dimensions the collapse occurs. We place the following restrictions on the entries of a and b: We now define the variable ξ as the image of η under a composition of Duffy transformations defined by a and b, We are interested primarily in these domains for dimensions d = 2, 3. Table 1 illustrates various standard geometric domains that are the result of particular choices of coordinate collapses.
A visual example with d = 2 is also shown in Fig. 1 which gives the Duffy transformations between reference triangles and reference quadrilaterals.

Evaluation of Polynomials
The previous section illustrates how various standard domains that are used to tesselate space in finite element simulations are constructed. This section considers how we can employ barycentric evaluation in η space to accomplish evaluation of polynomials in ξ space. In what follows we assume that the dimension d, the grid size multi-index k, and the Duffy transformation parameters c, a, and b are all given and fixed.
Recall that on the tensorial domain η ∈ [−1, 1] d we have a tensorial grid Z k comprised of k q points in dimension q, resulting in a total of d q=1 k q points in Z k . The image of this grid in ξ space is the result of applying the Duffy transformation: Assume that data values are furnished from a given function p, and are provided on the grid y j . Naturally, these values can be considered as data values in η space under the (inverse) Duffy transformation, and therefore the barycentric routines developed in tensorial form in Sect. 3 can be applied.
The main result of this section is the provision of conditions on p in ξ space under which applying the tensorial barycentric interpolation procedure in η space results in an exact evaluation. More precisely, we consider the following algorithmic set of steps given a point ξ in E(a, b), and a function p: Thus, our main result below gives conditions on p so that the output on the right of (32) equals the correct evaluation p(ξ ). To proceed, we need a more involved deconstruction of the element identifier multi-indices a and b. The particular rules in Sect. 4.1 that define the possible values of a and b ensure that a collection of tree structures can be constructed from a and b. Let each dimension 1, 2, . . . , d, correspond to a node. The directed edges correspond to drawing an arrow from node b j ending at node a j for each j = 1, . . . , c. Since the indices {a j } c j=1 are all distinct, one arrow at most lands at each node, and therefore this construction forms a collection of trees. With this structure, we now define an 'ancestor function' on the set of dimensions, which identifies which indices are ancestors of any dimension, where 2 [d] denotes the power set (set of subsets) of [d]. Note that we have also included the index q in g a,b (q), so that g a,b (q) is always non-empty. The identification of a and b for typical geometries in d = 2, 3 is given in Table 1. Finally, through the identification of the relation g T , we can articulate which functions in ξ space are exactly evaluated via the barycentric form in η space. d, k, c, a, and b all given and fixed, define the following multi-index set: which defines a polynomial space, Then, for every p ∈ P (a polynomial in ξ space), the procedure in (32) that utilizes the barycentric evaluation algorithm of Sect. 3 exactly evaluates p(ξ ).
Proof Let p ∈ P, and let α ∈ N d 0 denote the degree of p, i.e., the polynomial degree of p in dimension q is α q . Since p(ξ ) = p(D a,b (η)), then the result is proven if we can show thatp := p • D a,b ∈ Q k , since the barycentric form in η space is exact on this space of polynomials. Note that each Duffy transformatiom D a,b defined through (29) and (30) is a (multivariate) polynomial, so thatp = p • D a,b is a polynomial, and we need to show that only its maximum degree in dimension q is less than or equal to k q for every q = 1, . . . , d.
Fixing q ∈ [d], the degree ofp in dimension q is discernible from the ancestor function g a,b . Let deg q ( f ) denote the dimension-q degree of a polynomial f . Then the Duffy transformation definition (29) implies that for any two distinct dimensions i, j, Since D a,b is a composition of univariate Duffy maps, the degree ofp along any dimension q can be determined by tracing the history of which dimensions collapse onto q, i.e., is determined by g a,b (q). Thus, By assumption on the index set A to which α belongs, this last term is bounded by k q .
Given a tensorial grid in Z k in η space, Theorem 4.1 precisely describes what type of ξpolynomial space membership p should have so that the procedure (32) exactly evaluates p.

Specializations
This section describes certain specializations of the apparatus in the previous section. Our specializations will be the two-and three-dimensional domains shown in Table 1. The goal is to show how the exactness condition of Theorem 4.1 manifests on these domains, in particular, to articulate the polynonmial space P defined in (35) on which the barycentric evaluation procedure (32) is exact. We will describe P for a given degree index k, and will also present a special "isotropic" case when the number of points is the same in every dimension, i.e., when k = (k, k, . . . , k) for some non-negative scalar integer k.

Quadrilaterals
We consider d = 2, with c = 0 Duffy maps. In this case, we have η = ξ , and both variables take values on [−1, 1] 2 . Then, given degree k = (k 1 , k 2 ), the set A in (4.1) corresponds to all multi-indices j satisfying j ≤ k. Therefore, the polynomial space P in (35) is equal to Q k ,

Triangles
As with quadrilateral elements we have d = 2, but we now take c = 1, and a Duffy transformation collapse defined by a = 1, b = 2. Then, the η ↔ ξ map is given by The constraints in the definition of A are given by α 1 ≤ k 1 and α 1 + α 2 ≤ k 2 , so that the space P is where we have used P k to denote the set of bivariate polynomials of total degree at most k.

Hexahedrons
We now move to three dimensions so d = 3, and taking c = 0 Duffy maps, again implying that ξ = η. Therefore, the space P on which the barycentric procedure is exact is Q k :

Prisms
As with hexahedral elements we have d = 3, but we now take c = 1, and a Duffy transformation collapse defined by a = 1, b = 2. Then, the η ↔ ξ map is given by The constraints in the definition of A are given by α 1 ≤ k 1 and α 1 + α 2 ≤ k 2 , so that the space P is

Tetrahedrons
Also with d = 3 and c = 2, a Duffy transformation collapses defined by a = (1, 2) and b = (2, 3), the η ↔ ξ map is given by The constraints in the definition of A are given by α 1 ≤ k 1 and α 1 + α 2 ≤ k 2 , and α 1 + α 2 + α 3 ≤ k 3 so that the space P is where we have used P k to denote the set of trivariate polynomials of total degree at most k.

Pyramids
Finally, we again take d = 3 and c = 2, a Duffy transformation collapses defined by a = (1, 3) and b = (2, 3). Then, η ↔ ξ map is given by The constraints in the definition of A are given by α 1 ≤ k 1 and α 1 ≤ k 2 , and α 1 +α 2 +α 3 ≤ k 3 so that the space P is

Derivatives and Gradients
The results of Sect. 4.2 lead naturally to derivative evaluations. In particular, by defining the functionp := p • D a,b in η space, and writing p(ξ ) =p(η(ξ )), we can translate derivatives of p to those ofp, which can be efficiently evaluated using the results from previous sections. Using the chain rule, the gradient of p can be written in terms of the gradient ofp, where ∇ η is the standard d-variate gradient operator with respect to the Euclidean variables η, and we have defined the d × d Jacobian matrix of the ξ → η map, Note that the individual Duffy maps D a,b defined in (29) that collapse dimension a onto dimension b are invertible whenever η b = 1, so that the Jacobian above is well defined away from these points. The formula above shows that since the gradient ofp can be evaluated efficiently through the procedures in Sect. 4.4, so, too, can the gradient of p. In particular, this procedure exactly evaluates gradients (away from singularities of the Duffy transformation) if p ∈ P(A) where P(A) is given in (35). Similarly, components of the Hessian of p can be evaluated as, where ∂ 2 η ∂ξ i ξ j ∈ R d and ∂η ∂ξ i ∈ R d are componentwise derivatives, and H η (p) is the d × d Hessian ofp. Again, since the Hessian of p can be efficiently evaluated through the procedures in Sect. 4.4, the Hessian of p also inherits this asypmtotic efficiency. This procedure again exactly evaluates Hessians (away from singularities of the Duffy transformation) if p ∈ P(A). If p ∈ P(A), higher-order derivatives of p may likewise be computed exactly from those ofp using Faà di Bruno's formula with O d j=1 k j complexity stemming from the multivarite barycentric procedures described earlier.

Algorithmic and Implementation Details
In this section, we present the implementation details of the barycentric Lagrange interpolation in terms of the data structures and algorithms involved. The implementation follows the high-level algorithms described in Sects. 3 and 4, but some details differ in service of computational routine optimization. The implementation of these concepts can be accessed in the open-source spectral/hp element library Nektar++ [2,15].

Algorithm
The foundation of the implementation is in the kernel that performs the barycentric interpolation itself as given in Eq. (1) -that is, it takes the coordinate of a single arbitrary point and the stored physical polynomial values at each quadrature point in the expansion and returns the interpolated value at the arbitrary point. This kernel has been templated to perform the interpolation only in a specific direction based on the integer template parameter DIR, and also to return the derivative value and second-derivative value by the reference parameter based on the boolean template parameters DERIV, and DERIV2. Templating here is defined in the sense of C ++ templates; i.e. that these expressions are evaluated at compile time to reduce branching overheads and enable compiler inlining. For example, when DERIV and DERIV2 are not required, setting these template variables to false allows for performance gains by removing the if branch tests from the generated object code. The reasoning behind this unifying of the physical value evaluation and derivative interpolations is that we can make use of terms computed in the physical evaluation in the derivative interpolations saving repeat calculations (cf. (10) and (11)). An example kernel for physical, first-and second-derivative values is shown in Algorithm 1.
The next important method is the tensor-product function, which constructs the tensor line/square by calling the barycentric interpolation kernel on quadrature points, and is therefore dimension dependent and operates on the reference element in the appropriate form. The 1D version performs the barycentric interpolation directly on the provided point and returns the physical, first-and second-derivative value in the ξ 1 direction. In 2D and 3D, we chose to implement only the first-derivative to reduce overall complexity of the tensorproduct method. The 2D version constructs an interpolation in the ξ 1 direction to give an intermediate step of physical values and derivative values at the expansion quadrature points in the same ξ 1 direction. The quadrature derivative values in the ξ 1 direction can then be evaluated in the ξ 2 direction to produce the single derivative value in the ξ 1 direction at the point provided. Likewise, the quadrature physical values in the ξ 1 direction can then be evaluated in the ξ 2 direction with the derivative output enabled to return both the single derivative value in ξ 2 direction and the physical value at the provided point. The tensor product in 3D is similar, except we now also consider the ξ 3 direction and so our intermediary steps consist of constructing the tensor square. Structuring it in this manner allows for a minimum number of calls to the barycentric interpolation kernel. An example tensor product function in 2D is shown in Algorithm 2.
As this tensor-product function operates on the reference element that in 2D is a quadrilateral, and in 3D a hexahedron, additionally, it can be extended to non-reference shape types by collapsing coordinates and performing the correct quadrature point mapping as described in Sect. 4.1. This is achieved by overriding the existing reference element interpolation function, which evaluates the expansion at a single (arbitrary) point of the domain to also give it the capabilities to evaluate the derivative in each direction as needed. This function is a wrapper around a virtual function that is defined for each shape type and therefore allows for the shape dependent coordinate collapsing. Example structures of these functions for a triangular shape type are shown in Algorithms 3 and 4.

Complexity Analysis
Given a function p(η DIR ) evaluated at k + 1 quadrature points Q = {z 0 , z 1 , · · · , z k } DIR , we perform the following steps to find the interpolated values of p(η), and ∂ p ∂η DIR at a given point η / ∈ Q: for each quadrature point, z dir do 4: if deriv2 then 8: Use the precomputed derivative matrix, D z 9: if deriv or deriv2 then 22: if deriv2 then 26: end if 30: end if 31: end for 32: if deriv or deriv2 then 34:

37:
if deriv2 then 38: if ξ y = 1 then 3: out ← η 10: end procedure Algorithm 4 Overview of the Barycentric solution and first-derivative evaluation for a triangle. 1: procedure PhysEvaluate(ξ , p) 2: η ←CollapseCoords(ξ ) 3: p η , dp dη 1 , dp dη 2 ← BaryTensorDeriv(η, p) 4: set up geometric factor for x derivative G 1 = 2/(1 − η 2 ) 5: set up geometric factor for y derivative G 2 = G 1 * (η 1 + 1)/2 6: dp dξ 1 = dp dη 1 * G 1 7: dp dξ 2 = dp dη 2 + G 2 * dp dη 1 8: out ← p ξ , dp dξ 1 , dp dξ 2 9: end procedure (b) Calculate p(η) using step 32 of Algorithm 1, for which we use the pre-computed weights w from previous step and calculate the terms A and F. The storage requirement for each of these terms is of size k + 1, and the number of flops required to calculate them is 3(k + 1) and 2(k + 1), respectively. (If we reuse the term t 1 from step 18, calculating F needs only k + 1 flops). Thus, the total complexity of finding p(η) using the barycentric method is O(k), which is consistent with the evaluation in [1]. (c) Calculate ∂ ∂η DIR p(η) as shown in step 36 of Algorithm 1 which uses the precomputed weights w and the terms A and F from the previous step. Additional terms B and C are evaluated as per steps 23 and 24 of Algorithm 1. The storage requirement for these terms is of size k + 1 each. The calculation of term B requires 4(k + 1) flops (or 3(k + 1) using precomputed t 1 ). Similarly, calculating term C requires 2(k + 1) flops (or k + 1 if we consider precomputed t 2 ). Therefore, the total complexity of evaluating ∂ ∂η DIR p(η) is O(k).
(d) Applying a similar analysis for d 2 dη 2 DIR p(η), we need to evaluate additional terms D and E as shown in steps 27 and 28 of Algorithm 1. We need additional storage of size k + 1 for each of these terms. The computational complexity for both D and E is O(k).
Note that the analysis presented above is independent of dimensions (DIR). For higher dimensions, we follow the same procedure in each individual direction. For example, in 2D we require evaluation of p for DIR = 1 and DIR = 2. Therefore, when DIR = 2, the algorithm takes twice the amount of calculations as 1D. Thus, the complexity of the evaluation is still O(k). Similarly, for the first-derivative, we need the individual evaluations ∂ p/∂η 0 and ∂ p/∂η 1 . Therefore, the computational complexity of the derivative evaluation is O(k). By extension, the computational complexity for the second-derivative is also O(k).

Baseline Evaluations
To investigate how this implementation of the barycentric interpolation affects the efficiency of evaluation for physical and derivative values, a number of tests were run across various cases. The barycentric interpolation method has been implemented within the Nektar++ spectral/hp element framework [2,15] in a discontinuous Galerkin (DG) setting and is compared with two variants of the already existing standard Lagrange interpolation method, the first where the interpolation matrix is recalculated every iteration, and the second where the interpolation matrix is stored across iterations, mimicking a scenario involving history points. All test cases below were carried out on a single core of a dual-socket Intel Xeon Gold 5120 system, equipped with 256GB of RAM, with the solver pinned to a specific core in order to reduce the influence of kernel core and socket reassignment mid-process.

Construction of the Baseline Tests
These baseline tests are constructed for the desired elemental shape using the hierarchical modified basis of Karniadakis & Sherwin [10] of order P with tensor products of P + 2 points in each direction. We make use of Gauss-Lobatto-Legendre points in noncollapsed directions and Gauss-Radau points in collapsed directions to avoid evaluation at singularities. The physical values at these points are provided by the polynomial p(ξ ) = ξ 2 1 + ξ 2 2 − ξ 2 3 , which also allows for an analytical solution at any ξ for the physical and derivative values in each direction. The physical and derivative values are sampled on the constructed shape on a collocation grid that is again constructed as GLL/GLR points, like the quadrature rule. However, we use a fixed collocation grid size while varying order P, so that we ensure that the collocation grid is distinct from the quadrature rule in most cases. To ensure the same number of points is being sampled for all shape dimensions, we choose to use 64 total points because of the symmetry so that in 1D, it is 64 1 , 2D it is 8 2 , and 3D it is 4 3 . This creates some special considerations when the collocation grid matches exactly with the quadrature points used within the shape, which we discuss in the relevant sections below. We average the timings, in 1D from 10 6 evaluations, and in 2D/3D from 10 5 evaluations, to ensure results are not affected by system noise or other external factors. The tests are performed for a range Order (c) Second deriv

Segment baseline interpolation times
Recalc matrix Store matrix Barycentric  Figure 2 shows that recalculating the interpolation matrix every cycle is the notably slower of the three methods, whereas the barycentric interpolation and stored interpolation matrix method are closer in performance with the barycentric interpolation being on average across the orders 33% slower for the solution evaluation only, 20% slower when including first derivatives, and 18% slower when including second derivatives. However, the barycentric interpolation wins in terms of the storage complexity. This is because the former requires O(k) storage to store the weights w j , where k and z j are given and fixed. On the other hand, the stored interpolation matrix has the best case space complexity of O(k 2 ). An interesting feature present is the minor reductions in interpolation time for the stored matrix method at order 3, 8, 13, and 18. These basis orders correspond to the number of quadrature points being a multiple of five, which we theorize is the line cache size of the CPU being used. A match of this cache size with the quadrature point array sizes in our implementation will result in memory optimizations for the interpolation matrix multiplications. This phenomena will also be present for the recalculated matrix method, however, the result of optimization in this case is not visible on the graph due to the larger time scale. It can be seen that the baseline computational cost increases moving from interpolating the physical values only (Fig. 2a) to also including the first derivatives (Fig. 2b), and then the second derivatives (Fig. 2c) for all methods.

Extension to Traditional Tensor-Product Expansions
For the traditional tensor-product expansions, we now consider a quadrilateral element in 2D and a hexahedron in 3D. Figure 3 shows the results for the quadrilateral element. Generally results are similar to that for the segment. An obvious unique feature is the spike in the recalculated matrix timing result at order 6. The spike corresponds to 8 quadrature points in each direction, which is the same as the number we are sampling on, and therefore the points are collocated. Consequently, the routine in which the interpolation matrix is constructed has to handle this collocation, which results in the increased cost. We can also see the the barycentric interpolation method handling this collocation, resulting in a speed-up compared to neighboring orders, evident in Fig. 3a. On average the barycentric interpolation method is 30% slower across the orders for the solution evaluation only when compared to the stored interpolation matrix method. Figure 3b including the first derivative evaluations shows that the barycentric interpolation method is on average 15% faster across the orders than the stored matrix variant of the Lagrangian method indicating the computational cost savings from unifying the derivative call.
The same collocation trend is present in the hexahedral element, shown in Fig. 4 with the spike now present at order 2, which corresponds with the 4 quadrature points in each direction. The trends are similar again to the 1D and 2D results. On average across the orders for the the solution evaluation only the barycentric interpolation method is 48% slower than the stored matrix interpolation method. The most notable difference compared to the 2D results is in the first derivative timings (Fig. 4b), which when disregarding order 2 demonstrates the barycentric interpolation method is on average 10% slower at orders ≤ 11, while at orders > 11 it is on average 9% faster.

Extension to General Expansions
We now compare the most complicated of the available shape types in two and three dimensions, the triangle and the tetrahedron, which require the use of Duffy transformations. The results shown in Figs. 5 and 6 align closely with their tensor-product expansion counterparts, the quadrilateral and hexahedron, respectively. A unique feature can now be seen in the recalculated matrix interpolation method that appears to show a odd/even cyclical trend. We believe this is again due to collocated points, this time as a consequence of the collapsing of the element and the routine used to calculate the interpolation matrix making use of a floor function.

Speed-Up Factor
To further compare the methods, Figure 7 shows the speed-up factor when going from the recalculated matrix variant of the Lagrangian interpolation method to the barycentric interpolation method for segments, quadrilaterals, and hexahedrons. This shows that in 1D as the order increases the speed-up increases, for 2D it stays approximately the same, and for 3D it decreases. We can see that including the first-derivatives (Fig. 7b) in 1D causes the speedup factor to reduce when compared with the evaluation only version (Fig. 7). In 2D, the speed-up factor remains approximately consistent between the two versions, and in 3D it increases. A minimum speedup factor of approximately 7 is observed across all tests occurring in hexahedrons greater than order 17 when calculating the solution evaluation only.

Real-World Usage Example
To investigate the performance of the barycentric interpolation method in a less artificial setting, we now consider a real world problem containing a nonconformal interface, again posed in a DG setting within the Nektar++ spectral/hp element framework. In general we can imagine two scenarios: the first in which the nonconformal interface is fixed in time, and the second where the interface changes at each timestep to account for e.g. a grid rotation or translation. We investigate both settings in this section.

Handling the Nonconformal Interface
To handle the transfer of information across a nonconformal interface we adopt a pointto-point interpolation approach as outlined in [12], which involves minimizing an objective function to find an arbitrary point on a curved element edge utilizing the inverse of a parametric mapping to the reference element. In our implementation, this minimization problem is solved via a gradient-descent method utilizing a quasi-Newton search direction and backtracking line search that makes use of repeated calls to determine the physical, first-and secondderivative values within the loop. Once calculated and for a stationary interface the location of this arbitrary point in the reference element can be cached; however, to mimic a moving interface, where the minimization routine must be run every timestep, we disable this caching in order to also evaluate the performance impact of the new barycentric interpolation method. This is the equivalent of the comparison to the first Lagrange interpolation method discussed above, where the interpolation matrix is recalculated every iteration.

Test Case
We select a standard linear transport equation u t + ∇ · F(u) = 0 within a domain = [−1, 1] 2 , so that F(u) = vu for a constant velocity v = (1, 0), and an initial condition that is nonpolynomial, so that u(x, 0) = sin(2π x) cos(2π y). The domain consists of a single nonconformal interface with unstructured quadrilateral subdomains on either side, as visualized in Fig. 8 together with the initial condition for u. This means that the interpolation is being performed on the trace edges of the elements located at the nonconformal interface, which in this 2D example are segments. A polynomial order of P = 8 is considered, and we select Q = P + 2 = 10 quadrature points in each coordinate direction. We select a timestep size of t = 10 −3 and time for 10 cycles (i.e., t = 10), which is the equivalent of 10 4 timesteps.
We initially obtain a baseline time for both interpolation methods with the cache enabled. The cache is then disabled and both methods run again, allowing us to compare the cached (static) version with the non-cached (moving) version as a demonstration of the high computational cost incurred by calling this minimization routine every timestep. The results for the cached and noncached versions are shown in Table 2. This demonstrates that with the cache enabled, the timings for both methods are practically identical because the minimization occurs only in the first timestep which incurs a negligible cost over this timescale. However, the non-cached results (where the minimization procedure is run every timestep) shows a slowdown of around 13× for the Lagrangian method, but only 3× for the barycentric method when compared with the cached results. We can then calculate the performance impact of these methods only on the minimization routine, which shows the routine using the barycentric method as around 6× faster than the equivalent routine using the Lagrangian method. This is a significant speed-up, as the minimization routine accounts for a large proportion of the total computational time: for the Lagrangian method, this routine occupies 92% of total time whereas using the barycentric approach reduces this to 66%. The speed-up is realized in the total time taken for all 10 4 timesteps being reduced from 216s to 48s.

Conclusions
In the context of spectral/hp and high-order finite elements, solution expansion evaluation at arbitrary points in the domain has been a core capability needed for postprocessing operations such as visualization (streamlines/streaklines and isosurfaces) as well for interfacing methods such as mortaring. The process of evaluation of a high-order expansion at an arbitrary point in the domain consists of two parts: determining in which particular element the point lies, and evaluating the expansion within that element. This work focuses on efficient solution expansion evaluation at arbitrary points within an element. We expand barycentric interpolation techniques developed on an interval to 2D (triangles and quadrilaterals) and 3D (tetrahedra, prisms, pyramids, and hexahedra) spectral/hp element methods. We provided efficient algorithms for their implementations, and demonstrate their effectiveness using the spectral/hp element library Nektar++. The barycentric method shows a minimum speedup factor of 7 when compared with the non-cached interpolation matrix version of the Lagrangian method across all tests demonstrating a good performance uplift, culminating in an approximately 6× computational time speedup for the real-world example of advection across a nonconformal interface. In the artificial tests the barycentric method exhibits slightly worse performance than the stored interpolation matrix version of the Lagrangian method when evaluating purely physical values, with slowdowns of between 10% and 50% across all orders dependant on element type. However if first derivatives are also required the barycentric method can outperform the stored interpolation matrix method by up to 35%.
Funding The first and fifth authors acknowledge support from the EPSRC Platform Grant PRISM under grant EP/R029423/1 and the ELEMENT project under grant EP/V001345/1. The second and fourth authors acknowledge support from ARO W911NF-15-1-0222 (Program Manager Dr. Mike Coyle). The third author acknowledges support from NSF DMS-1848508.

Conflict of interest The authors declare no conflicts of interest.
Code Availability All code is available in the Nektar++ repository at https://gitlab.nektar.info.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.