FEMPAR: An Object-Oriented Parallel Finite Element Framework

FEMPAR is an open source object oriented Fortran200X scientific software library for the high-performance scalable simulation of complex multiphysics problems governed by partial differential equations at large scales, by exploiting state-of-the-art supercomputing resources. It is a highly modularized, flexible, and extensible library, that provides a set of modules that can be combined to carry out the different steps of the simulation pipeline. FEMPAR includes a rich set of algorithms for the discretization step, namely (arbitrary-order) grad, div, and curl-conforming finite element methods, discontinuous Galerkin methods, B-splines, and unfitted finite element techniques on cut cells, combined with h-adaptivity. The linear solver module relies on state-of-the-art bulk-asynchronous implementations of multilevel domain decomposition solvers for the different discretization alternatives and block-preconditioning techniques for multiphysics problems. FEMPAR is a framework that provides users with out-of-the-box state-of-the-art discretization techniques and highly scalable solvers for the simulation of complex applications, hiding the dramatic complexity of the underlying algorithms. But it is also a framework for researchers that want to experience with new algorithms and solvers, by providing a highly extensible framework. In this work, the first one in a series of articles about FEMPAR, we provide a detailed introduction to the software abstractions used in the discretization module and the related geometrical module. We also provide some ingredients about the assembly of linear systems arising from finite element discretizations, but the software design of complex scalable multilevel solvers is postponed to a subsequent work.


Introduction
Even though the origins of the FE method trace back to the 50s, the field has drastically evolved during the last six decades, leading to increasingly complex algorithms to improve accuracy, stability, and performance. The use of the p-version of the FE method and its exponential convergence makes high-order approximations an excellent option in many applications [1]. Adaptive mesh refinement driven by a posteriori error estimates, i.e., h-adaptivity, is an essential ingredient to reduce computational cost in an automatic way [2]. For smooth solutions, p-adaptivity or hybrid hp-adaptivity can further reduce computational cost for a target level of accuracy [3]. Originally, FE methods were restricted to nodal Lagrangian bases for structural problems. The extension of FE methods to other applications, like porous media flow or electromagnetism, motivated the design of more complex bases and require different mappings from the reference to the physical space, complicating the implementation of these techniques in standard FE codes. Saddle-point problems also require particular mixed FE discretizations for stability purposes [4,5]. More recently, novel FE formulations have been proposed within the frame of exterior calculus, e.g., for mixed linear elasticity problems [6]. Physics-compatible discretization are also gaining attention, e.g., in the field of incompressible fluid mechanics. Divergence-free mixed FEs satisfy mass conservation up to machine precision, but their implementation is certainly challenging [7]. During the last decade, a huge part of the computational mechanics community has embraced isogeometric analysis techniques [8], in which the discretization spaces are defined in terms of NURBS (or simply splines), leading to smoother global spaces. In the opposite direction, discontinuous galerkin (DG) methods have also been actively developed, and novel approaches, like hybridizable DG and Petrov-Galerkin DG methods, have been proposed [9,10]. As the discretization methods become more and more complex, the efficient implementation of these techniques is more complicated. It also poses a challenge in the design of scientific software libraries, which should be extensible and provide a framework for the (easy) implementation of novel techniques, to be resilient to new algorithmic trends.
The hardware in which scientific codes run evolves even faster. During 40 years, core performance has been steadily increasing, as predicted by Moore's law. In some years, supercomputers will reach 1 exaflop/s, a dramatic improvement in computational power that will not only affect the extreme scale machines but radically transform the whole range of platforms, from desktops to high performance computing (HPC) clouds. The ability to efficiently exploit the forthcoming 100x boost of computational performance will have a tremendous impact on scientific discoveries/economic benefits based on computational science, reaching almost every field of research. However, all the foreseen exascale growth in computational power will be delivered by increasing hardware parallelism (in distinct forms), and the efficient exploitation of these resources will not be a simple task. HPC architectures will combine general-purpose fat cores, finegrain many-cores accelerators (GPUs, DSPs, FPGAs, Intel MIC, etc.), and multiple-level disruptive-technology memories, with high non-uniformity as common denominator [11]. This (inevitable) trend challenges algorithm/software design. Traditional bulk-synchronous message passing interface (MPI) approaches are likely to face significant performance obstacles. Significant progress is already being made by MPI?X [12] (with X=OpenMP, CUDA, OpenCL, OmpSs, Kokkos, etc.) hybrid execution models. Going a step further, asynchronous many-task execution models (e.g., Charm?? [13], Legion [14], or HPX [15]) and their supporting run-time systems hold great promise [16].
Traditionally, researchers in the field of scientific computing used to develop codes with a very reduced number of developers, e.g., a university department, and a limited life span. The software engineering behind scientific codes was poor. Codes were rigid and non-extensible, and developed for a target application and a specific numerical method. However, the increasing levels of complexity both in terms of algorithms and hardware make the development of scientific software that can efficiently run state-of-the-art numerical algorithms on HPC resources a real challenge. Considering to start from scratch a project of this kind has an ever increasing level of complexity. Furthermore, due to the huge resources required to carry out such a project, it is natural to develop a framework that will be resilient to new algorithmic and hardware trends, in order to maximize life time, and to be applicable to a broad range of applications. In this sense, object-oriented (OO) programming, which provides modularity of codes and data-hiding, is the key for the software design of flexible and scalable (in terms of developers) projects.
There is a number of open source OO FE libraries available through the Internet, e.g., deal.II [17,18], FEniCS [19], GRINS [20], Nektar?? [21], MOOSE [22], MFEM [23], FreeFem?? [24], and DUNE [25]. In general, these libraries aim to provide all the machinery required to simulate complex problems governed by partial differential equations (PDE) using FE techniques. In any case, every library has its main goal and distinctive features. Some libraries, like FreeFem?? or FEniCS, have extremely simple user interfaces. FEniCS has its own domain specific language for weak forms to automatically generate the corresponding FE code (preventing p-adaptivity) and includes a collection of Python wrappers to provide userfriendly access to the services of the library. Other sophisticated libraries like deal.II or DUNE have a slightly more demanding learning curve. In general, parallel adaptivity is at most partially supported; as far as we know, none of the libraries above have support for parallel hpadaptivity, unless DG methods are being used. Some libraries are restricted to a particular cell topology, e.g., deal.II is limited to hexahedral/quadrilateral (n-cubes) meshes, while FEniCS only supports simulations on triangular/tetrahedral (n-simplices) meshes.
In general, these libraries provide modules for some of the different steps in the simulation pipeline, which involves the set-up of the mesh, the construction of the FE space, the integration and assembly of the weak form, the solution of the resulting linear system, and the visualization of the computed solution. The solution of the linear system is clearly segregated from the discretization step in all the scientific software libraries described above (for parallel computations); the linear system is transferred to a generalpurpose sparse linear algebra library, mainly PETSc [26][27][28], Hypre [29], and Trilinos [30,31]. As a result, the coupling between the discretization step and the linear solver step is somehow weak, since they rely on general purpose solvers, which usually involve simple interfaces. The strong point of these general purpose numerical linear algebra libraries is to be problem-independent, but it also limits their performance for specific applications, since they cannot fully exploit the underlying properties of the PDE operator and the numerical discretization. 1 This segregation has a clear impact on the type of methods to be used. This black-box approach to general-purpose linear solvers has favoured the use of algebraic multigrid methods, the de facto linear solver [29]. On the other hand, geometric multigrid methods and domain decomposition (DD) methods, which are very specific to mesh-based PDE solvers, are not common, even though they can be superior to algebraic methods in many cases. A geometric multigrid method that exploits the hp-adaptive structure of the FE space is included in deal.II, but it can only be used in the serial case. In parallel scenarios, DD methods have certainly evolved during the last decade. Modern DD methods do not (necessarily) rely on a static condensation of the internal variables, which requires sparse direct methods for the local subdomain problems. Instead, inexact solvers can be used, e.g., multigrid methods, and linear complexity DD preconditioners can be defined (see [33,34]). The definition of two-level DD methods resembles the one of FE methods, by exchanging the FE and subdomain concepts, and their definition is strongly related to the one of multiscale FEs [35]. Furthermore, multilevel extensions can be naturally defined. In short, state-of-the-art multilevel DD methods can be understood (in their inexact version) as a non-conforming multigrid method. Even though the mathematical theory of the DD methods is very sound, high performance implementations are quite recent (see [36][37][38]). On the other hand, we are not aware of any general purpose FE code that integrates a DD algorithm in the solution workflow. DD methods require sub-assembled matrices to be used, and are not supported by the majority of the existing advanced OO FE libraries. Analogously, the use of block-preconditioning is in general poorly supported, because it involves the discretization of additional operators to define the approximated Schur complement, and the corresponding block-based assembly of matrices.
On the other hand, based on the supercomputing trends, the segregation between time discretization, linearization, space discretization, and linear system solve, will progressively blur. As an example, nonlinear preconditioning and parallel-in-time solvers are two natural ways to attain the higher levels of concurrency of the forthcoming exascale supercomputers [36,39]. These facts will complicate even more the rigid workflow of current advanced FE libraries. In this sense, current efforts in PETSc to provide nonlinear preconditioning interfaces can be found in [40], relying on call-back functions, and the XBraid solver [41] aims to provide time-parallelism in a non-intrusive way.

The FEMPAR Project
In this work, we present FEMPAR, an OO FE framework for the solution of PDEs, designed from inception to be highly scalable on supercomputers and to easily handle complex multiphysics problems. The first public release of FEMPAR has almost 300K lines of code written in (mostly) OO Fortran and makes intensive use of the features defined in the 2003 and 2008 standards of the language. The source code that is complementary to this work corresponds to the first public release of FEMPAR, i.e., version 1.0.0. It is available at a git repository [42]. In particular, the first public release was assigned the git tag FEMPAR-1.0.0, in accordance with the ''Semantic Versioning'' system. 2 FEMPAR is very rich in terms of FE technology. In particular, it includes not only Lagrangian FEs, but also curl-and div-conforming ones, e.g., Nédélec (edge) and Raviart-Thomas FEs. The library supports n-cube and n-simplex meshes, and arbitrary high-order bases for all the FEs included. Continuous and discontinuous spaces can be used, providing all the machinery for the integration of DG facet (i.e., edges in 2D and faces in 3D) terms. Recently, in a beta version of the code, B-splines have also been added, together with the support for cut cell methods (using XFEM-type techniques) and hp-adaptivity, but we will not discuss these developments for the sake of brevity.
Moreover, FEMPAR has been developed with the aim to provide a framework that will allow developers to implement complex techniques that are not well-suited in the traditional segregated workflow commented above. FEM-PAR also provides a highly scalable built-in numerical linear algebra module based on state-of-the-art domain decomposition solvers. FEMPAR can provide partially assembled matrices, required for DD solvers; the multilevel BDDC solver in FEMPAR has scaled up to almost half a million cores and 1.75 million MPI tasks (subdomains) in the JUQUEEN Supercomputer [34,37]. It includes an abstract framework to construct applications and preconditioners based on multilevel nonoverlapping partitions.
Even though every block within the library preserves modularity, the interface between discretization and numerical linear algebra modules within FEMPAR is very rich and focused on PDE-based linear systems. In the path to the exascale, FEMPAR has been designed to permit an asynchronous implementation of multilevel methods, both in terms of multiphysics FEs and multilevel solvers, which have been exploited, e.g., in [37]. It is a unique feature that is not available in other similar libraries. The library also allows the user to define blocks in multiphysics applications, that can be used to easily implement complex block preconditioners [43][44][45]. All these blocks are very customizable, which has already been used to develop scalable DD solvers for electromagnetics problems and block preconditioners for multiphysics problems, e.g., magnetohydrodynamics [44]. These distinctive features of FEMPAR, however, are not discussed in this article but in a forthcoming one. A general discussion of the main ingredients of our implementation of the discretization step using FElike approximations is first necessary, which is the purpose of this work.
This work is more than an overview article with the main features of the library. It is a detailed description of the software abstractions being used within FEMPAR to develop an efficient, modular, and extensible implementation of FE methods and supporting modules in a broad sense. To this end, we enrich the discussion with code snippets that describe data structures, bindings, and examples of use. 3 This document is intended to be used as a guide for new FEMPAR developers that want to get familiarized with its software abstractions. But it can also be a useful tool for developers of FE codes that want to learn how to implement FE methods in an advanced OO framework. In any case, due to the size of the library itself, many details cannot be exposed, to keep a reasonable article length. The article can be read in different ways, since it is not necessary to fully understand all the preceding sections to grasp the main ideas of a section. For instance, the section about the abstract implementation of polytopes in arbitrary dimensions and its related algorithms is quite technical and a reader that is not particularly interested in the internal design of this type and its bindings implementations can skip it. Experienced FE researchers can skip the short section with the basics of FE methods, and only look at this one (if needed) when referred in subsequent sections.
The article is organized as follows. In Sect. 3 we present a concise mathematical description of the FE framework. The main mathematical abstractions are expressed in software by means of a set of derived data types and their associated TBPs, which are described in subsequent sections. In particular, the main software abstractions in FEMPAR and their roles in the solution of the problem are: • The polytope, which describes a set of admissible geometries and permits the automatic, dimensionindependent generation of reference cells and structured domains. The mathematics underlying the polytope are presented in Sect. 3.14, while its software implementation in Sect. 4. • The polynomial abstraction and related data types, which are presented in Sects. 3.4 and 5, respectively. These sections describe how shape functions bases can be generated for arbitrary orders and for n-cube and n-simplex topologies. • The reference FE in Sect. 6, which describes the reference cell and defines a set of basis functions and degrees of freedom on each cell. • The triangulation in Sect. 7, which represents a discrete approximation of the physical domain X. • A set of tools required to perform numerical integration (e.g., quadratures and geometrical maps) produced by the reference FE and described in Sect. 8 for cell integrals and in Sect. 9 for facet integrals. • The FE space described in Sect. 10, built from a triangulation and a set of reference FEs, which represents a global space of FE functions. • The discrete integration, an abstract class to be extended by the user to define an affine FE operator, which describes the numerical integration of the weak form of the problem to be solved, described in Sect. 11.2. • The linear (affine) operator in Sect. 11, whose root is the solution of the problem at hand, constructed using the FE space and a discrete integration.
• An example of a user driver in Sect. 12, in which the different ingredients previously described are used to simulate a problem governed by PDEs, the Stokes system.
A (very simplified) graphical overview of the main software abstractions in FEMPAR and some of their relationships is shown in Fig. 1.

The FE Framework
In this section, we briefly introduce all the mathematical abstractions behind the FE method for the discretization of problems governed by PDEs. For a more detailed exposition of the topics, we refer to [69][70][71][72]. The FEs described below (and many other not covered herein) can be formulated and analyzed using the finite element exterior calculus framework [6], which makes use of exterior algebra and exterior calculus concepts. In this framework, one can define FEs, e.g., div and curl-conforming ones, in arbitrary space dimensions, using the concept of differential k-forms. However, we have decided not to use such presentation of FE methods to simplify the exposition for readers not familiar with these abstractions.

The Boundary Value Problem in Weak Form
We are interested in problems governed by PDEs posed in a physical domain X & R d with boundary C ¼ : oX. In practice d ¼ 2; 3 but we are also interested in d [ 3 for some particular applications (see Sect. 3.14). Let us consider a differential operator A, e.g., the Laplace operator ÀD, and a force term f : X ! R. Let us also consider a partition of C into a Dirichlet boundary C D and a Neumann boundary C N , and the corresponding boundary data u D : C D ! R and g N : C N ! R. The boundary value problem reads as follows: find uðxÞ such that AuðxÞ ¼ f ðxÞ in X; B D uðxÞ ¼ u D ðxÞ on C D ; The operator B D is a trace operator and B N is the flux operator. Other boundary conditions, e.g., Robin (mixed) conditions can also be considered. We assume that the unknown uðxÞ in (1) can be a scalar, vector, or tensor field. (The case of multi-field problems is considered in Sect. 3.11.) For FE analysis, we must consider the weak form of (1). The weak formulation can be stated in an abstract setting as follows. Let us consider an abstract problem determined by a Banach space X (trial space), a reflexive Banach space Y (test space), a continuous bilinear form a : X Â Y ! R, and a continuous linear form ' : Y ! R. The abstract problem is stated as: find u 2 X such that aðu; vÞ ¼ 'ðvÞ; for any v 2 Y: The link between the two formulations is the following. Let DðXÞ be the space of C 1 functions with compact support in X; the dual space DðXÞ 0 is the space of distributions. We have that: aðu; uÞ ¼ : hAu; ui X ; 'ðuÞ ¼ : hg N ; ui C N þ hf ; ui X ; for any u 2 DðXÞ; where the derivatives are understood in distributional sense. e.g., For the Laplace operator, the bilinear form reads aðu; vÞ ¼ : R X ru Á rvdX. Furthermore, homogeneous Dirichlet boundary conditions, i.e., u ¼ 0 on C D , are usually enforced in a strong way; the functions in Y satisfy these boundary conditions. The extension to non-homogeneous boundary conditions is straightforward. One can define an arbitrary extension Eu D of the Dirichlet data, i.e., Eu D ¼ u D on C D . Next, we define the function u 0 ¼ : u À Eu D with zero trace on C D and solve (2) for u 0 with the right-hand side 'ðvÞ À aðEu D ; vÞ: ð3Þ  n is the outward normal. Let us assume that j 2 L 1 ðXÞ dÂd , f 2 H À1 ðXÞ, g N 2 H À 1 2 ðC N Þ, and u D 2 H 1 2 ðC D Þ. Let us also consider an extension Eu D 2 H 1 ðXÞ such that Eu D ¼ u D on C D . The weak form of the problem reads as: find u 0 2 H 1 0 ðXÞ such that Z for any v 2 H 1 0 ðXÞ: The solution is u ¼ Example 3.2 (Stokes problem) The Stokes problem consists on finding a velocity field u and a pressure field p such that Àr Á ðlðuÞÞ þ rp ¼ f ; r Á u ¼ 0; and (for example) u ¼ u D on C, where ðuÞ ¼ 1 2 ðru þ ru T Þ is the strain tensor. The weak form of the problem consists of finding ðu 0 ; pÞ 2 X ¼ : for any ðv; qÞ 2 X, where Eu D 2 H 1 0 ðXÞ Â Ã d is an extension of the Dirichlet data, i.e.,

Space Discretization with FEs
Problem (2) is an infinite-dimensional problem. In order to end up with a computable one, we must introduce finitedimensional subspaces with some approximability properties. We restrict ourselves to FE schemes in a broad sense that involve conforming and non-conforming spaces. Thus, our aim is to explicitly build spaces X h (and Y h ) with some approximability properties. If the discrete spaces are subspaces of the original ones (conforming), i.e., X h & X and Y h & Y, the discrete problem reads as: find u h 2 X h such that This is the Petrov-Galerkin problem. In the particular case when X h ¼ Y h , we have a Galerkin problem. The previous problem can be ill-posed for some choices of the FE spaces, e.g., using discrete spaces that do not satisfy the inf-sup condition for indefinite problems [5]. In some cases, judiciously chosen perturbations of aðÁ; ÁÞ and 'ðÁÞ, represented with a h ðÁ; ÁÞ and ' h ðÁÞ respectively, can stabilize the problem and make it stable and optimally convergent, circumventing the inf-sup condition restriction. In the most general case, we can describe any FE space as: find u h 2 X h such that replacing the continuous bilinear form by a general discrete bilinear form. One can also define the affine operator and state (4) as: find u h 2 X h such that F h ðu h Þ ¼ 0. This statement is the one being used for the practical implementation of FE operators in FEMPAR (see Sect. 11).
In order to define FE spaces, we require a triangulation T h of the domain X into a set fKg of cells. This triangulation is assumed to be conforming, i.e., for two neighbour cells K þ ; K À 2 T h , its intersection K þ \ K À is a whole kface (k\d) of both cells (note that k-face refers to a geometrical entity, e.g. cells, faces, edges and vertices for d ¼ 3, see Sect. 3.14). In practice, the cells must be expressed as a particular type of mapping over a set of admissible geometries (polytopes, see Sect. 3.14). Thus, for every element K 2 T h , we assume that there is a reference cellK K and a diffeomorphism U K :K ! K. In what follows, we usually use the notationx ¼ : U À1 K ðxÞ. The definition of the functional space also relies on a reference functional space as follows: (1) we define a functional space in the reference cellK; (2) we define a set of functions in the physical cell K via a function mapping; (3) we define the global space as the assemble of cell-based spaces plus continuity constraints between cells. In order to present this process, we introduce the concept of reference FE, FE, and FE space, respectively.

The FE Concept in the Reference and Physical Spaces
Using the abstract definition of Ciarlet, a FE is represented by the triplet fK; V; Rg, where K is a compact, connected, Lipschitz subset of R d , V is a vector space of functions, and R is a set of linear functionals that form a basis for the dual space V 0 . The elements of R are the so-called DOFs of the FE. We denote the number of moments as n R . The moments can be written as r a for a 2 N R ¼ : f1; . . .; n R g.
We can also define the basis f/ a g a2N R for V such that r a ð/ b Þ ¼ d ab for a; b 2 N R . These functions are the socalled shape functions of the FE, and there is a one-to-one mapping between shape functions and DOFs. Given a function v, we define the local interpolator for the FE at hand as It is easy to check that the interpolation operator is in fact a projection.
In the reference space, we build reference FEs ðK;V;RÞ as follows. First, we consider a bounded set of possible cell geometries, denoted byK; see the definition of polytopes in Sect. 3.14. OnK, we build a functional spaceV and a set of DOFsR. We consider some examples of reference FEs in Sects. 3.8, 3.9, and 3.10.
In the physical space, the FE triplet ðK; V; RÞ on a mesh cell K 2 T h relies on: (1) a reference FE ðK;V;RÞ, (2) a geometrical mapping U K such that K ¼ : U K ðKÞ, and (3) a linear bijective function mappingŴ K :V !V. The functional space in the physical space is defined as The set of DOFs in the physical space is defined as R ¼ : fr W À1 K :r 2Rg. Given the set of shape functions f/ a : a 2 NRg in the reference FE, it is easy to check that f/ a K ¼ : W K ð/ a Þ : a 2 NRg are the set of shape functions of the FE in the physical space. The reference FE spaceV is usually a polynomial space. Thus, the first ingredient is to define bases of polynomials; see Sect. 3.4. The analytical expression of the basis of shape functions is not straightforward for complicated definitions of moments; this topic is covered in Sect. 3.5. After that, we will consider how to build global (and conforming) FE spaces in Sect. 3.6, and how to integrate the bilinear forms in the corresponding weak formulation in Sect. 3.7. We finally provide three examples of FEs in Sects. 3.8, 3.9, and 3.10.

Construction of Polynomial Spaces
Local FE spaces are usually polynomial spaces. Given an order k 2 N and a set N k of distinct points (nodes) in R (we will indistinctly represent nodes by their index i or position x i ), we define the corresponding set of Lagrangian polynomials f' k 0 ; . . .; ' k k g as: We can also define the Lagrangian basis L k ¼ f' k i : 0 i kg. This set of polynomials are a basis for k-th order polynomials. We note that ' k m ðx l Þ ¼ d ml , for 0 m; l k.
For multi-dimensional spaces, we can define the set of nodes as the Cartesian product of 1D nodes. Given a dtuple order k, we define the corresponding set of nodes for n-cubes as: define the multi-dimensional Lagrange basis This Cartesian product construction leads to a basis for the local FE spaces usually used on n-cubes, i.e., the space of polynomials that are of degree less or equal to k with respect to each variable x 1 ; . . .; x d . We can define monomials by a d-tuple a as p a ðxÞ ¼ : P d i¼1 x a i i , and the polynomial space of order k as Q k ¼ spanfp a ðxÞ : 0 a i k i ; The definition of polynomial spaces on n-simplices is slightly different. It requires the definition of the space of polynomials of degree equal or less than k in the variables x 1 ; . . .; x d . It does not involve a full Cartesian product of 1D Lagrange polynomials (or monomials) but a truncated space, i.e., the corresponding polynomial space of order k is P k ¼ spanfp a ðxÞ : jaj kg, with jaj ¼ : P d i¼1 a i . Analogously as for n-cubes, a basis for the dual space of P k are the values at the set of nodesÑ k ¼ : fs 2 N k1 : jsj kg. It generates the typical grad-conforming FEs on n-simplices.

Construction of the Shape Functions Basis
The analytical expression of shape functions can become very complicated for high order FEs and non-trivial definitions of DOFs, e.g., for electromagnetic applications. Furthermore, to have a code that provides a basis for an arbitrary high order, an automatic generator of shape functions must be implemented. When the explicit construction of the shape functions is not obvious, we proceed as follows. Let us consider a FE defined by fK; V; Rg. 4 First, we generate a pre-basis fw b g b2R that spans the local FE space V, e.g., a Lagrangian polynomial basis (see Sect. 3.4). On the other hand, given the set of local DOFs, we proceed as follows. The shape functions can be written as where w b are the elements of the prebasis. By definition, the shape functions must satisfy r a ð/ b Þ ¼ d ab for a; b 2 N R . As a result, let us define C ab ¼ : r a ðw b Þ. We have (using Einstein's notation): 4 In this section, we do not make difference between reference and physical spaces, e.g., using theÁ symbol. In any case, all the following developments are usually performed at the reference FE level.
or in compact form, CU T ¼ I, and thus U T ¼ C À1 . As a result, U ab ¼ C À1 ba . The shape functions are computed as a linear combination of the pre-basis functions.

Global FE Space and Conformity
Finally, we must define the global FE space. Conforming FE spaces are defined as: X h ¼ : fv 2 X : vj K 2 Vg: The main complication in this definition is to enforce the conformity of the FE space, i.e., X h & X. In fact, the conformity constraint is the one that motivates the choice ofR and W, and as a consequence, R. In practice, the conformity constraint must be re-stated as a continuity constraint over FE DOFs. In general, these constraints are implicitly enforced via a global DOF numbering, even though it is not possible in general for adaptive schemes with non-conforming meshes and/or variable order cells, which require more involved constraints. Let us define by Cartesian product of local DOFs for all cells. We define the global DOFs as the quotient space of M h by an equivalence relation $ . Using standard notation, given $ , the equivalence class of a 2 M h with respect to $ is represented with ½a ¼ : fb 2 M h : a $ bg, and the correspond- The set N h is the set of global DOF and ½Á represents the local-to-global DOF map. We assume that the equivalence relation is such that if two elements ðb; KÞ; ðb 0 ; K 0 Þ 2 M h are such that ðb; KÞ $ ðb 0 ; K 0 Þ, then K 6 ¼ K 0 . 5 Using the one-to-one mapping between moments and shape functions, the same operator allows one to define global shape functions / a ¼ P ðb;KÞ $ a / b K . We assume that the choices above are such that they satisfy the conformity constraint, i.e., Let us consider an infinite-dimensional spaceX such that (1) X h &X & X and (2) for every function v 2X and global DOF a 2 N h , all the local DOFs b; b 0 2 ½a are such that r b ðvÞ ¼ r b 0 ðvÞ, i.e., local DOF related to the same global DOF are continuous among cells. The global interpolator is defined as: It is easy to check that it is in fact a projector. In any case, we use projection operator to refer to other projectors that involve the solution of a global FE system, e.g., based on the minimization of the L 2 or H 1 norm. Below, we provide details about how to choose the local DOFsR, the function map W, and the equivalence relation $ such that the conformity property is satisfied for grad, div, and curl-conforming FE spaces. The case of nonconforming methods, e.g., DG methods, can readily be considered. In this case, the conformity constraint is not required, which leads to much more flexibility in the definition of DOFs. On the other side, these schemes require numerical perturbations of the continuous bilinear and linear forms in (4) that involve integrals over the facets of FEs to weakly enforce the conformity. (Facets are ðd À 1Þfaces, e.g., faces in 3D and edges in 2D).
Once we have defined a basis for the FE spaces X h and Y h using the FE machinery presented above, every FE function u h can be uniquely represented by a vector u 2 In fact, problem (4) can be restated as: find u 2 R jN h j such that for any a 2 N h : We have ended up with a finite-dimensional linear problem, i.e., a linear system. We note that in general, the trial space moments can be different from the ones of the test space, as soon as the cardinality is the same. In matrix form, the problem can be stated as: Assuming that the bilinear form can be split into cell contributions as a h ðÁ; ÁÞ ¼ P K2T h a K ðÁ; ÁÞ, e.g., by replacing R X by P K2T h R K , the construction of the matrix is implemented through a cell-wise assembly process, as follows: The FE affine operator (5) can be represented as , it can be represented with a matrix and a vector of size jN h j.

Numerical Integration
In general, the local bilinear form can be stated as: where the evaluation of F ðxÞ involves the evaluation of shape function derivatives. Let us represent the Jacobian of the geometrical mapping with J K ¼ : oU K ox . We can rewrite the cell integration in the reference cell, and next consider a quadrature rule Q defined by a set of points/weights ðx gp ; w gp Þ, as follows: Here, the main complication is the evaluation of F Uðx gp Þ. By construction, the evaluation of this functional only requires the evaluation of o a / b K Uðx gp Þ for some values of the multi-index a (idem for the test functions). Usually, jaj 2 in C 0 FEs, since higher-order derivatives would require higher inter-cell continuity. The second derivatives, which only have sense for broken cellwise integrals, are in fact only needed for some method based on stabilization techniques based on the pointwise evaluation of residuals in the interior of cells [46].
Let us consider the case of zero and first derivatives, i.e., the evaluation of / b K U K ðx gp Þ and r/ b K U K ðx gp Þ. The values of the shape functions (times the geometrical mapping) on the quadrature points is determined as follows: whereas shape function gradients are computed as: where we have used some elementary differentiation rules and the inverse function theorem in the last equality; rx represents the gradient in the reference space. Thus, one only needs to provide the values of the Jacobian, its inverse, and its determinant, from one side, and the value of the shape functions Wð/ b Þ and their gradients rxWð/ b Þ in the reference space, on the other side, at all quadrature points, to compute all the entries of the FE matrices; second order derivatives can be treated analogously. Quadrature rules forK being an n-cube can readily be obtained as a tensor product of a 1D quadrature rule, e.g., the Gauss-Legendre quadrature. Symmetric quadrature rules on triangles and tetrahedra for different orders can be found, e.g., in [69]. In any case, to create arbitrarily large quadrature rules for n-simplices, one can consider the socalled Duffy transformation [73,74].
As it is well known, considering n-cube topologies for K, Gauss quadratures with n points per direction can integrate exactly 2n À 1 order polynomials. e.g., For a Lagrangian reference FE of order p and an affine geometrical map, we choose n ¼ p þ ceilingð1=2Þ ¼ p þ 1 per direction to integrate exactly a mass matrix. For n-simplex meshes, we use either symmetric quadratures (if available) or tensor product rules plus the Duffy transformation [73,74]. The latter case is based on introducing a change of variables that transform our n-simplex integration domain into an n-cube, and integrate on the n-cube using tensor product quadratures. It is worth noting that this change of variables introduces a non-constant Jacobian. The determinant of the Jacobian is of order at most d À 1 with respect to each variable. To integrate a mass matrix exactly, we must be able to integrate exactly polynomials of order 2p þ d À 1. Therefore, we need to take n ¼ p þ ceilingðd=2Þ to exactly integrate mass matrices. For n-simplices, we consider the reference FE space P k spanned by the pre-basis fp a ðxÞ : 0 a i k; i ¼ 1; . . .; dg and the set of nodesÑ k (see Sect. 3.4). The set of local DOFs N R K are the nodal values. Since the pre-basis elements are not shape functions, we proceed as in Sect. 3.5 to generate the expression of the shape functions basis for arbitrary order reference FEs on n-simplices. The global FE space is determined by the following equivalence relation. The set of local DOFs for n-cubes is M h ¼ : fðs; KÞ : s 2 N k1 ; K 2 T h g due to the one-to-one mapping between DOFs and nodes; we replace the set of nodes byÑ k for n-simplices. Furthermore, we say that ðs; KÞ $ ðs 0 ; K 0 Þ iff x s ¼ x s 0 . The implementation of this equivalence relation, and thus, the global numbering, relies on the ownership relation between n-faces and DOFs (e.g., in 3D we can say whether a DOF belongs to a vertex, edge, or face) and a permutation between the local node numbering in K þ to the one in K À for nodes on F. See Sect. 3.14 for more details. With such global DOF definition, it is easy to check that the global FE space functions are C 0 and thus grad-conforming.

Grad
Since Lagrangian moments involve point-wise evaluations of functions and H 1 0 Analogously, shape functions are computed as / a ¼ : P ði;s;KÞ $ a ' k1 s e i ; e i represents the i-th canonical basis vector of R d . We proceed analogously for n-simplices.
The verification that two nodes are in the same position is not straightforward. First, for every node s in K, we can assign an n-face owner F (e.g., a vertex, edge, face, or cell); cell DOFs are not replicated. Given a node s 2 N k1 of cell K that belongs to the n-face F, it can be determined by an index s F with respect to F and K. Analogously, another node that belongs to the same n-face but cell K 0 , is represented by s 0 F . On the other hand, one can define a permutation mapping that, given the local index of a node within the n-face F with respect to K, it provides the index in the n-face F with respect to K 0 (see Sects. 3.13 and 3.16 for more details).

Div-Conforming FEs
We present the so-called Raviart-Thomas FEs for vector fields [5]; the implementation of Brezzi-Douglas-Marini FEs is analogous. In this case, the order being used is different at every space dimension. Let us start with Raviart-Thomas FEs on n-cubes. In 2D, the space reads as V k ¼ : Q ðkþ1;kÞ Â Q ðk;kþ1Þ , whereas in 3D it reads as To check whether x s ¼ x s 0 holds, we can proceed similarly as for Lagrangian elements. The shape functions are built as in Sect. 3.5. We consider a Lagrangian pre-basis for V, and compute the shape functions via a change-of-basis. The function mapping reads as follows: the mappingŴ K U À1 K is the so-called contravariant Piola transformation. One can check that the definition of this mapping together with the assembly defined above leads to a global FE space that is div-conforming; i.e., its functions have continuous normal component across inter-cell facets. Thus, X h & Hðdiv; XÞ [5].
On n-simplices, the reference FE space is . . ., and the basis for R is composed of the following boundary and interior DOFs: In this case, the generation of the pre-basis is not a Lagrangian FE space of functions, but it can easily be expressed as the span of vector functions with components in a selected subset of P kþ1 .

Curl-Conforming FEs
The weak formulation of electromagnetic problems involve the functional space Hðcurl; XÞ. Conforming FE spaces for Hðcurl; XÞ must preserve the continuity of the tangential component of the field. The so-called edge elements (or Nédélec elements) are curl-conforming FEs [72]. As Raviart-Thomas elements, the edge elements pre-basis on n-cubes involves different orders per dimension and per component. In  respectively, where the edge map UÊ is defined as the one for the face. The boundary DOFs can be indexed by a triplet ðF; s; KÞ, where F can be an edge or a face in 3D, following the same ideas as for Raviart-Thomas elements.
In this case, the function mapping reads as follows: the mappingŴ K U À1 K is the so-called covariant Piola transformation, which leads to a global FE space that is curl-conforming [72], i.e., its functions have continuous tangential component across inter-cell facets.
On n-simplices, the space reads as: The basis for R in 3D is composed of the following boundary and interior DOFs: 7 In 2D, only the first two types of DOFs are required, where the first one is now related to facets (edges in 2D) and the second one are interior DOFs owned by the cell. As for Raviart-Thomas elements, the pre-basis functions are not Lagrangian shape functions, but they can again be expressed as the span of vector functions with components in a selected subset of P kþ1 . We refer to [75] for a discussion about the actual generation of a pre-basis for the space (19) in FEMPAR.

Cartesian Product of FEs for Multi-field Problems
Many problems governed by PDEs involve more than one field, e.g., the Navier-Stokes equations or any multi-physics problem. Let us consider a PDE that involves a set of unknown fields ðu 1 ; . . .; u n Þ 2 X 1 Â . . . Â X n , defined as the Cartesian product of functional spaces. We can proceed as above, and define a FE space for every field space separately, leading to a global FE space X 1 h Â . . . Â X n h defined by composition of FE spaces. To define the global numbering of DOFs in the multi-field case, we consider that two DOFs are equivalent if they are related to the same field and satisfy the equivalence relation of the FE space of this field. The Cartesian product of FE spaces is enough to define volume-coupling multi-physics problems governed on the same physical domain, i.e., the different physics are defined on the whole domain and coupled through volume terms in the formulation. However, many multi-physics problems are interface-based, i.e., the coupling between different physics that are defined on different subdomains is through transmission conditions on the interface. This is the case, e.g., of fluid-structure problems (see, e.g., [76][77][78][79]). In these cases, different FE spaces could be defined on different parts of the global mesh, i.e., one must describe the set of subdomains ðX 1 ; . . .; X n Þ of the whole domain X in which the corresponding FE spaces are defined.

Non-conforming Methods
Up to now, we have considered a global FE space that is conforming, i.e., X h & X. Alternatively, one can consider FE schemes that are not conforming. Since the original bilinear form has no sense in general for a non-conforming FE space X h , one shall consider a stabilized bilinear form a h that is well-posed (stable and continuous) in the discrete setting. In general, these schemes replace the required inter-cell continuity for conformity by a weak imposition of such continuity. Thus, the inter-cell continuity is imposed weakly through penalty-like terms. DG methods are schemes of this type [71].
In one sense, non-conforming FE spaces are simpler than conforming ones, since the conformity is not required; one has more flexibility in the definition of local DOFs and the equivalence class concept is not needed, since a DOF never belongs to more than one element. However, the bilinear form usually requires the integration of facet terms, i.e., terms of the type: X The integration of facet terms is far more complicated than cell terms.
Let us first briefly illustrate a simple application of nonconforming methods, namely the FE discretization of the Poisson problem using the so-called interior penalty (IP) 7 We note that we can take J T F v instead of J T F ðv Â nÞ in the definition of the face moments, since the rows of the Jacobian matrix are the transformation of the axes in the reference faceF 0 to the actual faceF of the reference cell and the space of test functions is invariant to rotations. family of DG formulations [71]. Dirichlet boundary conditions constraints, say uðxÞ ¼ u D ðxÞ on the whole boundary C of the domain X, are to be weakly imposed, as it is natural in such kind of formulations. The global discrete trial space X h is composed of functions that are continuous within each cell, but discontinuous across cells, i.e., as the set of interior and boundary facets of T h , respectively, the discrete weak form underlying this family of methods reads as: find u h 2 X h such that X where s is a fixed constant that characterizes the particular method at hand, c is a facet-wise positive constant referred to as penalty parameter, and jFj denotes the surface of the facet; s and c should be suitably chosen such that the bilinear form a h ðu h ; v h Þ on the left-hand side of (20) is well-posed (stable and continuous) in the discrete setting, and the resulting FE formulation enjoys optimal rates of convergence [71]. Finally, if we denote as K þ and K À the two cells that share a given facet, then f fw h g g and ½½w h denote mean values and jumps of w h across cells facets: with n þ , n À being the facet outward unit normals, and w þ h , w À h the restrictions of w h to the facet, both from either the perspective of K þ and K À , respectively.
The computation and assembly of DOFs related to interior nodes is straightforward. With regard to the facet terms, assuming that we are sitting on an interior facet F 2 F X h , four facet-wise matrices, namely A F K À K þ , and A F K À K À , are computed. (The case of boundary facets F 2 F C h is just a degenerated case of the one corresponding to interior facets where only a single facet-wise matrix A F K þ K þ has to be computed; we omit this sort of facets from the discussion in order to keep the presentation short.) These hold all partial contributions of the facet to the corresponding global entries of the coefficient matrix. The entries of, e.g., A F K þ K À , are defined (for our particular problem at hand) as: with indices a and b ranging from 1 to the number of shape functions N R of K þ and K À , respectively.

Facet Integration
As mentioned in Sect. 3.7 for the case of cell integrals, facet integrals involved in the computation of the facetwise matrix (22) cannot be in general computed analytically. These are instead computed using quadrature rules. In general, the bilinear form that contains the facet terms can be stated as We can consider a reference facetF, and a mapping U F : F ! F from the reference to the physical space. Let us represent the Jacobian of the geometrical mapping with J F ¼ : oU F ox , which has values in R ðdÀ1ÞÂd . We can rewrite the facet integral in the reference facet, and next consider a quadrature rule Q onF defined by a set of points/weights ðx gp ; w gp Þ, as follows: jJ F j is defined as: for d ¼ 2; 3, respectively. The expression of the shape functions and their gradients in the physical space in terms of the ones in the reference space are computed by using the cell-wise maps. Thus, two mappings U K þ and U K À among the reference cellK and the cells K þ and K À in physical space, respectively, are involved in the numerical evaluation of interior facet integrals. We can also consider the reference facetF and a map U F from this reference facet to F (analogously as U K and K but in one dimension less in the reference space). We can define a quadrature rule ðx gp ; w gp Þ inF. We can also define the reference facetF AE ofK such that U K AE ðF AE Þ ¼ F, and the map UFAE fromF tô F AE . With this map, we can define the quadrature ðx AE gp ¼ : UFAEðx gp Þ; w gp Þ with respect to the reference cellK.
However, the same facet F has (in general) a different orientation depending on the cell used as reference, and so, a different index might be assigned to the same facet quadrature points from the perspective of either cell, i.e., We adopt the convention that facet quadrature points identifiers are in the local numbering space of K þ , and these local identifiers are translated into the local numbering space of K À . This is represented by the permutation PðgpÞ such that The logic underlying this translation is equivalent to the one discussed in Sect. 3.16; see Fig. 2 for an explanatory illustration. As a result, we have Let us consider the evaluation of zero and first order derivatives on facets, i.e., the evaluation of / b K a U K a ðx gp Þ and r/ b K a U K a ðx gp Þ for a 2 fþ; Àg, where the quadrature points belong to a quadrature in the reference facetF. We note that the introduction ofŴ is not needed for nonconforming methods, since there is no continuity to be enforced, and we will consider it to be the identity operator for simplicity. The values of the shape functions (times the geometrical mapping) on the facet quadrature points is evaluated as follows: whereas shape function gradients are computed as: Without loss of generality, let us focus on the first integral in (22). Replacing the mean value and jump operators by their definition in (21), and taking into account that / b K À and r/ a K þ vanish on K þ and K À (by construction of X h and Y h ), respectively, we end up with the following integral to be computed numerically: This integral is first mapped back to the reference facet F & R dÀ1 , and then it is approximated by the following sum over quadrature points: Using these ideas, we can compute all the terms related to facet integrals. Furthermore, outward normals on facets can be computed as: for d ¼ 2; 3, respectively, and a 2 fþ; Àg; o is 0 or 1 and is used to enforce the normal to be outwards. Tangent vector(s) for a given facet can be easily computed out of the normal vector.

Polytopes
One of the motivations of FEMPAR is to develop a framework that can deal with arbitrary space dimensions. It permits to readily implement space-time formulations, which are posed in 4D. Other higher-dimensional applications include systems of PDEs posed in the phase space, e.g., the 7D (including time) Vlasov-Maxwell equations for the simulation of plasmas.
In this section, we provide the mathematical abstraction of cell topologies based on the concept of polytope. This  abstract concept is of practical importance, because it allows us to develop algorithms and codes that can be applied to any topology that fits into the framework. The framework developed herein is very general and includes triangles and quadrilaterals in 2D, and tetrahedra, hexahedra, prysms, and pyramids in 3D. Furthermore, it can also be extended to arbitrary dimensions, to define not only n-cubes and n-simplices but many other topologies. A polytope is mathematically defined as the convex hull of a finite set of points. As a consequence, a polytope is a polyhedron. In the frame of FEMPAR, we consider polytopes that can be expressed as the image of the composition of two operators. The definition of topologies for reference FEs based on this idea can be found in [25].
The main topological information consumed by FE codes is the description of the d-dim polytope boundary as the assemble of ðd À 1Þ-dim polytopes, proceeding recursively till 0-dim objects are obtained (vertices); we use the contraction k-dim object to say object of dimension k. These lower-dimensional entities describing the polytope boundary are denoted herein as n-faces. Usually, the nomenclature used to describe n-faces in FEs is restricted to 3D problems. In FEMPAR and in the following exposition, we use a dimension-independent nomenclature in order to accommodate higher-dimensional problems. We consider the space dimension d 2 N þ and a d-dim polytope. We define the d-face as the polytope itself. The set of ðd À 1Þ-dim polygons that compose the boundary of the polytope are its ðd À 1Þ-faces; ðd À 1Þ-faces are usually denoted as facets. We can proceed recursively, i.e., defining the ðk À 1Þ-faces of the polytope as the set of facets of its k-faces till reaching 0-faces. In 3D, 3-faces are called cells, 2-dim faces are faces, 1-dim faces are edges, and 0-dim faces are vertices. Herein, we use the term n-faces to denote all these objects. In this work, we denote by vefs the set of n-faces of dimension lower than the space dimension, e.g., it only includes vertices, edges, and faces in 3D. Let us introduce some notation. We represent the set of bitmaps of size m with B m . The bitmaps ð1; 1; . . .; 1Þ and ð0; 0; . . .; 0Þ are represented with 1 and 0, respectively. Given a domain h & R d we use the notation ah þ b, a 2 R, b 2 R d to denote the domain fax þ b : x 2 hg. e j represents the j-th canonical basis vector of R d .
Let us define first the directional extrusion h ðj;a;bÞ of h with respect to the direction e j of type ða; bÞ. a determines the topology of the extrusion, namely a prysm-type extrusion (1) or a pyramid-type extrusion (0) (see also [25]). b determines whether we want to perform the aextrusion (1) or do-nothing (0). Based on this, we have the following definition. The directional extrusion can be used recursively to define polytopes and their n-faces. An n-face is determined by a topology t 2 B d , an extrusion e 2 B d , and an anchor vertex v 2 R d , using a recursive procedure as follows. The use of directional extrusions to get different polytopes and n-faces is illustrated in Figs. 3 and 4. One can observe how all the lower dimensional n-faces after directional extrusion lead to one dimension larger n-faces for different values of a.
For our purposes, the anchor vertex v has only 0/1 entries, and thus, it can be represented as an element v of B d . As a result, an n-face can be uniquely represented with ðt; e; vÞ. Based on this definition, we can define a set of ddim polytopes by recursion. d-dim polytopes are given by t, and represented as n-faces with ðt; 1; 0Þ, i.e., using the origin 0 as anchor vertex and performing extrusions in all directions. On the other hand, a vertex v (with only 0/1 coordinates) is an n-face with ðt; 0; vÞ. Some examples of n-face constructions using this procedure can be found in Figs. 3 and 4. Furthermore, in these figures we show all nfaces of the 3-cube and 3-simplex, with all the e and v values. In our implementation of polytopes, we use Hasse diagrams based on the composition of extrusion and anchor vertex bitmaps to label the different n-faces of a polytope.
Given a bitmap t and a bit a, we define the bit operation that modifies the j bit of t to a with t:o j ðaÞ. Given the chain on n-faces (27), let us assume that h iÀ1 is represented by ðt; e 0 ; vÞ. Using the definition of the n-face, every element of the chain has a geometrical representation. We start with the vertex 0, next obtain the line segment fðx; 0; 0Þ : x 2 ½0; 1g, next the square fðx; y; 0Þ : x; y 2 ½0; 1g, and finally the unit cube. The previous definition is not only useful to represent d-dim objects but all its n-faces. See Figs. 3 and 4. For a given n-face h ðt; e; vÞ, we want to define the set S h of all n-faces of h. In order to do so, we introduce the following concepts. we can obtain all its n-faces recursively as follows. Let S 0 ¼ : fvg; we define S ¼ : S d based on the following recursion: All the resulting n-faces can also be written with the ðt; e; vÞ notation commented above. In order to define this chain as in (28) (i.e., only based on the bitmap notation), we note the following. Given the n-face h ðt; e; vÞ, the n-face h þ e j ðt; e; v:o j ð1ÞÞ. With this ingredient, we can implement the generator of all n-faces of an n-face using the bitmap notation.

Node Generation and Indexing
FE spaces are polynomial spaces, e.g., Lagrangian polynomials. (Let us note that div-and curl-conforming FEs also rely on Lagrangian polynomials for the definition of the pre-bases and the definition of the equivalence classes.) In order to express these polynomials, one must define a set of points (nodes). In the following, we define a node generator for a given order on an arbitrary polytope, using lexicographical notation. 8 Definition 3.7 (Set of nodes) Let us consider a polytope h 2 R d represented by ðt; 1; 0Þ. Its set N k of equidistant Lagrangian nodes of order k, in lexicographical notation, are generated recursively as follows: Given a node a 2 N d in lexicographical notation and assuming an equidistant distribution of nodes, its space coordinates x a 2 R d can readily be obtained, x a ¼ : a=k. We note that for n-cubes we recover the typical tensor product definition of nodes and the corresponding truncated subset of nodes for n-simplices. Other node generators can also be considered, especially for very high-order elements (e.g., Fekete points). It is basic in FE analysis to have an ownership relation between n-faces and nodes. In particular, it is basic to enforce continuity between FEs by enforcing continuity of nodal values. In order to generate the nodes of the polytope that belong to an n-face, we use the following construction.
First, we generate the local set of nodes, using the definition above, for the n-face. Given a k-face ðt; e; vÞ in R d , we consider the reference k-dim polytope ðt 0 ; 1; 0Þ, where t 0 is the restriction of t to the components that are extruded, i.e., t 0 ¼ : t m 'g with the mapping m 'g : f1; . . .; kg ! fj 2 f1; . . .; dg : eðjÞ ¼ 1g. Next, we define the local nodes of the n-face as the nodes of the reference polytope. It defines the n-faces nodes and their local coordinates. Finally, we define the linear mapping from the reference k-dim polytope to the k-face. The map can be defined with k þ 1 independent conditions. It can be defined by enforcing that the mapping maps the anchor vertex of the reference polytope to the one of the k-face and the same for the extrusion of the anchor vertex to all directions: Since the mapping is linear, it can be written as: Form the first constraints we get that a 0 ¼ v. For the other constraints, we get: Thus, we get: and thus: We could also obtain the expression for the inverse of the mapping m analogously. We can readily use the mapping for lexicographical coordinates. As a result, given a k-face, we can define its nodes with a local numbering based on the lexicographical label of the reference k-face. The localto-global lexicographical label (where global is the label of the d-dim base polytope) is obtained by applying the mapping (32).

Global DOF Numbering and Conformity
A basic ingredient in FE analysis is the imposition of continuity among FEs in order to build conforming global FE spaces. This process is mathematically defined with equivalence classes on DOFs (see Sect. 3.6). For example, functions in the Lagrangian FE space are related to geometrical nodes, and to impose continuity of a function among FEs is equivalent to impose continuity of nodal values in the same spatial position (see Sect. 3.8). In the following, we provide a mechanism to identify nodes in two different cells that share the same position to implement the required equivalence class. The situation is slightly more involved for div-conforming and curl-conforming FE spaces. In these cases, one can still determine a DOF with a node plus n-face ownership (see Sects. 3.9 and 3.10, respectively). Thus, the equivalence class in these situations can be formulated as in Lagrangian FEs (determine nodes with the same position) at every n-face separately. Following Sect. 3.6, a node within a cell of our triangulation can be represented as (b, K), where b is the local cell-wise index of the node and K is the cell global index. Given an n-face F of the cell, the same node can be represented with ðb 0 ; F; KÞ, where b 0 is an n-face-based local index. For example, node 8 (cell-wise local index) in the cell of Fig. 6 can also be determined as the node 1 (facetwise local index) of the n-face 8 (see Fig. 5). This facetwise local index is determined by the coordinate system being used at the n-face. For example, the nodes of n-face 8 in Fig. 6 are ordered as (8, 12) (i.e., first 8 and then 12). On the other hand, node indices are represented with the coordinates in a lexicographical coordinate system, as presented in (31). For example, node with b ¼ 8 Let us consider an n-face F in our triangulation, two cells K þ (source cell) and K À (target cell) sharing the nface, and nodes ðs 0 þ ; F; K þ Þ and ðs 0À ; F; K À Þ (with n-facewise local indices). The question that must be answered is: are nodes ðs 0 þ ; F; K þ Þ and ðs 0À ; F; K À Þ in the same spatial position? This question can be answered with the map p F in (15) that, given the position of the node in the coordinate system of F in K þ , provides the one in K À .
We note that this mapping is trivial when using structured (possibly locally adapted) n-cube meshes, since the local ordering of nodes in an n-face based on increasing local index leads to the same ordering for all cells containing that n-face; we say that the mesh is properly oriented in this case. However, 2D or 3D unstructured mesh generators might not return properly oriented meshes, and thus the FE code has to deal with the explicit construction and application of permutations. We also note that one can always end up with oriented meshes for n-simplices by simple cell-wise permutations (see, e.g., [72,Sect. 5.5] and [80]). After reading n-simplex meshes, these meshes are always properly oriented in FEMPAR before proceeding to any computation. While this is also true for 2D n-cube meshes, 3D n-cube meshes cannot be properly oriented in general [81].
Let us consider the reference polytopeK associated to K þ and K À . In general, the n-face F has a different n-face local index with respect to the two cells; its corresponding storing the number of entities, the start position in l(:) of the list associated to each entity, and the identifiers associated of all lists gathered in a single array, respectively reference n-face is represented withF þ andF À for K þ and K À , respectively. In general, the map between nodes of these two n-faces can be defined by using (32), which is invertible (since it is linear and full rank). Using this approach, the map can be generated for arbitrary dimension and polytope topology. However, for the particular case of 2D/3D n-cube meshes, we have implemented this procedure in a more computationally efficient manner. In particular, the required permutations (mappings) are expressed in terms of a set of tables, which are stored and set up (filled) by the so-called reference_fe_t abstract data type in FEMPAR. We refer to Sect. 6.3 for detailed implementation details. (Recall that n-simplices meshes do not actually require this procedure as they can always be properly (re)oriented.) Let us consider the case of 3D n-cube meshes. Vertices are trivial because there is only one node and no permutation is needed. For edges and faces, we rely on the three following concepts: • Rotation index: Provides the local index of the anchor vertex of F À with respect to the coordinate system of F þ . When FEs are sharing two edges, we have the following situations. The edge can have the same anchor vertex seen from both elements, or not. For faces, the anchor vertex can be in 4 positions. It is called rotation because it represents a map that keeps invariant the reference faceF À and makes the anchor vertices of the source and target cells coincide. • Orientation index: Given two cells sharing an n-face with the same anchor vertex, the orientation index codes the map from the coordinate system of the n-face with respect to the first cell to the one with respect to the second one. 9 For edges, this map is always the identity, because two cells sharing an edge with the same anchor node provide the same edge-wise node coordinates to its nodes. For faces, the situation is more complex, because it involves 2 different possible situations. The orientation index is equal to 0 for the identity permutation and 1 when we have to swap indices. We denote the base face as the face with the lowest local index (face [011|000] in Fig. 3). Next, we consider two cubes that share a face, restricted to the following scenario: (1) the face is the base face in at least one of the cubes; (2) the face has the same anchor vertex in the two cubes. It is trivial to compute the orientation index in these cases. The orientation index in the more general case of two cubes sharing a face only restricted to (2), i.e., two arbitrary faces with the same anchor vertex, can be obtained by composition as follows. If two faces have the same orientation index with the base face, they have an orientation index equal to 1, and 0 otherwise. • Permutation index: An index obtained by composition of the rotation and orientation indices (i.e., it ranges from 1 and 2, and 1 and 8 for edges and faces, respectively), that codifies the final mapping between coordinates of two cells as the composition of a rotation and a orientation map. We note that the composition of all possible rotations and orientations cover all the possible relative positions of cells for a conforming mesh.

Implementation of polytope_t and Related Data Types
In FEMPAR, the reference FE cell geometry is defined by the polytope_t data type; see Listing 1. The input needed to define the polytope is the space dimension num_dimensions and the topology t in the 32-bit integer topology. Using the ideas in (27), (28), and (29), we create the set of all n-faces of the polytope ðt; e; vÞ in the (private) fill_polytope_chain TBP, which is in turn invoked by the (public) create TBP. All n-faces of the polytope have the same topology, and can be uniquely determined by a 32-bit integer that represents the composition of ðe; vÞ. We note that the ordering of n-faces based on ðe; vÞ mixes n-faces of different dimensions and it is non-consecutive in general. Thus, we consider an ordering based first on the nface dimension, and next by ðe; vÞ. The set of all n-faces generated by the recursion (29) are stored in n_-face_array, an array of size number_n_faces. This array in particular provides the ðe; vÞ associated to each n-face. The inverse mapping (from ðe; vÞ to the actual numbering) is stored in the ijk_to_index array.
It is also possible to iterate over facets of an n-face, based on (30). The create_facet_iterator TBP of polytope_t creates a facet_iterator_t instance for a given n-face. facet_iterator_t is defined in Listing 2. The n-face ðe; vÞ is stored in root, the topology can be extracted from its polytope pointer member variable. The iterator over facets is described by two integers, component and coordinate, using the ideas in (30). The complexity of the traversal over facets is coded in facet_iterator_next and facet_iterator_ has_finished.  With regard to the implementation of nodes within FEMPAR, we provide the node_array_t data type to represent the set of nodes defined in (31); see Listing 3. It is constructed from a polytope and the order. It provides a create TBP, where we perform (31) and fill all the resulting nodes in the node_array array member variable. We number the nodes using a consecutive numbering with increasing lexicographical index. The node array provides the lexicographical label in one integer. The inverse is stored in ijk_to_index. The total number of nodes is stored in num_nodes. Finally, the space coordinates of nodes are stored in coordinates.
We also provide the node_iterator_t object (see Listing 4), which iterates over the nodes of an n-face (stored in n_face) using (31) and (32). It has a pointer to the node_array of the base polytope. Internally, it goes through the nodes of n_face (using (31)) (the current node being stored in displacement), which can be translated to the base polytope node numbering using (32) (stored in coordinate); the coordinate is computed on demand by calling the TBP node_iterator_cur-rent_ijk. The own_boundary logical allows one to iterate over the nodes considering the n-face as an open or closed set. We note that the create TBP of node_array_t relies on node_iterator_t.

The polynomial_t Abstraction
In FEMPAR, the definition of shape functions is not hardcoded, as usually done in most FE codes. Such approach has severe limitations: (1) it is not practical for high order discretizations, and the code cannot be written for an arbitrary order; (2) it involves a huge number of code lines with the analytical expression of shape functions for a given set of available orders (see the discussion in [82]); and (3) it does not allow for dimension-independent code. Instead, we consider a framework based on the concepts in Sect. 3.5, in which one considers a pre-basis, defines the moments, and performs a change of basis. The pre-basis is defined using the product of 1D functions (e.g., the Cartesian product), and the 1D function generator is written in terms of the (arbitrary) order. Our machinery for the generation of 1D functions has been restricted for the moment to polynomial functions in one variable, namely Lagrangian polynomials, monomials, and B-splines, but the implementation can be extended to other choices. The product of 1D functions can be a Cartesian product of 1D Lagrange polynomials (or monomials), to define Q k spaces on n-cubes, or a reduced combination of monomials to define P k spaces on n-simplices. The definition of the reference FE functional space relies on the polynomial_t data type in Listing 5, which represents a polynomial in one variable, i.e., pðxÞ ¼ P k i¼0 a i x k . Thus, a 1D polynomial is defined in terms of its order k and a set of k þ 1 coefficients fa i g k i¼0 , stored in order and the coefficients array, respectively. Different type extensions of polynomial_t have been considered so far, namely lagrange_polyno-mial_t and monomial_t. The first one generates a Lagrangian polynomial as in Sect. 3.4, in which the coefficients array has in its first order entries the coordinates of the nodes and in the last entry the coefficient (7). The monomial_t extension represents x k where k is its order. It is just a trivial case of polynomial_t for optimization purposes that is uniquely defined by the order (the coefficients array is not needed). We also consider the polynomial_basis_t data type, which is just a set (array) of (polymorphic) polynomials.
Up to this point, we have defined Lagrange polynomials and monomials in one variable. lagrange_polyno-mial_t and monomial_t also provide the binding generate_basis that generates a Lagrangian and monomial basis of polynomials, for a given order k. The result of this subroutine is a polynomial_basis_t that includes as many polynomials as the polynomial space dimension. In the case of the Lagrangian basis, it implements the basis L k in Sect. 3.4, whereas the binding for monomials simply implements fx i g k i¼0 . The next step is to generate higher dimensional spaces. We consider two types of spaces. The first one is a space that can be generated as the Cartesian product of 1D spaces, implemented in the data type tensor_pro-duct_polynomial_space_t. This data type is defined through the number of space dimensions and as many polynomial_basis_t as space dimensions. This data type can be applied to any combination of 1D spaces. e.g., In the case of 1D Lagrange bases (possibly with different order and nodes per dimension), it leads to Listing 5. The polynomial_t data type and related data types. the multi-dimensional basis in (8). Thus, with this data type and Lagrangian 1D bases we generate the Lagrangian FE spaces on top of n-cube cells, i.e., the Q k space of polynomials. 10 Furthermore, we also consider the trun-cated_tensor_product_polynomial_space_t extension that generates Lagrangian FE spaces on n-simplices, i.e., the P k space of polynomials. In this case, the generate_basis TBPs of monomial_t should be used to create the monomial 1D bases per direction and the order should also be the same for all directions. Otherwise, the resulting multi-variable function would have no sense. Next, the combination of 1D monomials only involves terms such that jaj k (see Sect. 3.4), to generate a prebasis for FE spaces on tetrahedra, i.e., the P k space of polynomials.
We note that with these abstract representations of polynomial spaces one can define the reference FE local space. However, unless one considers 1D Lagrangian basis and tensor product polynomials on n-cubes, the resulting basis is not the shape functions basis. Even in the case of Lagrangian n-simplices, a change-of-basis is needed, using the procedure in Sect. 3.5 taking nodal values as moments. In Sect. 9.5, we show how we can define the shape function basis for the case of div-conforming FEs of arbitrary order. The same ideas apply for grad-conforming Lagrangian FEs on n-simplices and curl-conforming FEs in general, but are not included for the sake of brevity. 6 The reference_fe_t Abstraction In this section, we introduce the reference_fe_t data type. This data type is the OO representation of the standard mathematical definition of a reference FE presented in Sect. 3.3, namely, a reference cell geometryK, a functional spaceV, and a set of DOFsR defined on top of it. The reference_fe_t is a central abstraction in a FE library and must be judiciously designed to be extensible and reusable. In particular, it must not only accommodate Lagrangian FEs, but also other (more involved/general) spaces like Raviart-Thomas or edge FEs, DG methods, and B-spline patches. An extensible and reusable design of reference_fe_t should allow one to, e.g., easily incorporate new local functional spaces that were not originally considered, and to do so without having to rewrite (and thus recompile) any code that is grounded on the set of methods provided by reference_fe_t. To this end, in FEMPAR, reference_fe_t is an abstract data type that serves as a template equipped with a set of member variables and deferred bindings that subclasses have to set up and implement (i.e., override), respectively, in order to complete the description of the concrete FE space at hand. The definition of the reference_fe_t data type, a classification of its member variables into three different categories (corresponding to the three ingredients in Ciarlet's definition), and an enumeration of its most relevant regular and deferred bindings, are shown in Listing 6. Listing 6. The reference_fe_t abstract type, a classification of its member variables, and an enumeration of its most relevant regular and deferred bindings. This section is structured as follows. The member variables in each of the three aforementioned categories are covered in detail in Sects. 6.1-6.3, respectively. In Sect. 6.4, we discuss the OO design pattern chosen in FEMPAR for the creation of reference_fe_t polymorphic instances, and describe the arguments that uniquely define a subclass of this data type; these are in line with its mathematical definition. In Sect. 6.5, we enumerate and briefly describe the subclasses of refer-ence_fe_t currently available in FEMPAR. We note that the section is not self-contained as most of the deferred bindings of reference_fe_t are not covered here. These involve interactions with other data types in our OO design, and will be described in the sections in which these interactions are exposed. Code comments in Listing 6 serve as a table of contents with the article sections in which these deferred bindings are covered.

The Reference Cell Topology
The reference cellK is a polytope. Therefore, following Sect. 3.14, it can be described with the topology, coded as a set of d bits, where d is the dimension of the polytope. The reference cell topology is generated using poly-tope_t described in Sect. 4, which offers methods like composition and local numbering of n-faces. Polytope topologies include triangles and quadrilaterals in 2D, and tetrahedra, hexahedra, prysms, and pyramids in 3D. The member variables in charge of the description of the reference cell topologyK are shown in Lines 5-10 of Listing 6. The user must provide the topology and dimension of the polytope to defineK, stored in the member variables topology and num_dimensions, respectively. A set of getters return this basic information, and other related data that can be generated out of them, e.g., the number of n-faces in the boundary of the cell is stored in the num_n_faces member variable. The list of vertex identifiers per each n-face and the list of facets (of dimension n À 1) per each n-face are stored in ver-tices_n_face and facets_n_face, respectively; see Fig. 5 for an illustration of these member variables and the data type list_t used in FEMPAR to store and traverse lists.
The FEMPAR data type list_t stores a set of (variable-sized) lists of integer identifiers, one per each entity; in this particular scenario, entities are n-faces. As shown in Fig. 5, the current implementation of this data type uses a compressed storage layout as, e.g., in compressed storage formats for sparse graphs. In order to preserve encapsulation and data hiding, list_t offers a rich set of TBPs that lets users to set up (step by a step) a new list_t instance; this type also provides a list_iterator_t type that lets them to sequentially read/write each of the integer identifiers of the list associated to an entity. The code snippet in Listing 7 illustrates how to iterate and print the identifiers of those vertices belonging to the n-face with identifier n_face_lid.
The number of n-faces of any dimension can be easily computed from ptr_n_faces_x_dim. We note that ptr_n_faces_x_dim is not a list_t instance, since we adopt the convention that n-faces are numbered from the lowest to highest dimension, and thus only the p array of the list is actually needed (see Fig. 5). In the example in Fig. 5, the value of this array is f1; 5; 9; 10g, since we have 4 vertices (dimension 0), 4 facets or edges (dimension 1), and 1 cell (dimension 2).

The Reference FE Space
For a given cell topology, different definitions of functional spaces and sets of DOFs are possible, e.g., the ones of the nodal Lagrangian grad-conforming reference FE in Sect. 3.8, the Raviart-Thomas div-conforming reference FE in Sect. 3.9, or the curl-conforming Nédélec reference FE in Sect. 3.10. The member variables of refer-ence_fe_t required to describe the functional spaceV with support onK are encompassed within Lines 13-16 of Listing 6. Listing 7. User-level code that illustrates how to print to screen those (local within cell) vertex identifiers belonging to n-face with (local within cell) identifier n_face_lid.
The local FE spaceV is determined by the member variables fe_type, (in some cases) field_type, and order. fe_type uniquely identifies the concrete FE space at hand. Possible values are provided by means of the public parameter constants fe_type_lagrangian, fe_type_raviart_thomas, and fe_type_nedelec corresponding to the reference_fe_t implementors currently supported in FEMPAR; see Sect. 6.5 for additional details on those. field_type identifies the ''type'' of physical field being discretized, i.e., whether it is scalar, vector-valued, etc. There are FE spaces that are inherently vector-valued such as, e.g., Raviart-Thomas and edge FEs. However, Lagrangian FEs can be either used to discretize scalar, vector, or tensor-valued fields, and field_type must be provided. We assume thatV can be parameterized with respect to an order, which is stored in order. Out of these values, we can generate additional data, e.g., the number of shape functions is stored in num_shape_functions. For example, for (scalar-valued) bi-quadratic (2D) and tri-quadratic (3D) Lagrangian FEs, the field_type is scalar, num_components is equal to 1, order is equal to 2, and num_shape_functions is equal to 9 and 27, respectively.

The Set of Local DOFs
Additional data is required to describe the set of DOFsR forV. In particular, the member variables encompassed within Lines 19-23 of Listing 6 serve this purpose.
The conformity member variable determines whether the global FE space X h is conforming with respect to the infinite-dimensional space X , i.e., whether X h & X or not. It is used to describe the n-face that owns every DOF, which is required to enforce conformity of the global FE space through equivalence classes (see Sect. 3). e.g., For Lagrangian FEs, setting it to .true. results in a gradconforming global FE space, whereas setting it to .false. it results in a discontinuous space for DG methods. It is conceptually possible to set it to .true. on some cells and false on others, leading to the CDG method in [83]. On the other hand, the continuity member variable is only determined by X , and tells us whether X admits a trace operator. Roughly speaking, it tells us whether we must enforce some type of continuity at the discrete level to preserve conformity, e.g., full, tangential, or normal traces for H 1 ðXÞ, Hðcurl; XÞ, and Hðdiv; XÞ, respectively. The value of continuity is .false. when X ¼ L 2 ðXÞ, since no continuity is required. When continuity is .false., conformity must be .true.. continuity is barely used (see discussion in next paragraph).
The value of conformity is used to generate the own_dofs_n_face member variable of type list_t. This member variable stores, for every n-face, the DOFs it owns; see Fig. 6. For CG methods, the notion of ownership is related to the geometrical location. For DG FEs, although node functionals are still geometrically located on the boundary of the cell, they are nevertheless owned by the cell, and considered as interior DOFs, since there is no global conformity to be enforced. This array is heavily used to generate the global DOF numbering. 11 On the other hand, the dofs_n_face member variable, determines, for a given n-face, the set of DOFs such that their respective shape functions are non-zero on the n-face. The continuity member variable is (currently) only used for DG methods in parallel distributed-memory environments. In particular, in order to decide whether to associate or not a global DOF identifier to nodes on the interface facets of ghost cells (and thus to be able to define nonsingular sub-assembled matrices for the DD methods in [84] for DG discretizations). The dofs_n_face member variable is used when continuity is .true. and a global DOF numbering is to be generated, and also might be used by triangulation subclasses (see Sect. 7) in order to extract the coordinates of those nodes on top of a vertex, edge, or face (using the dofs_n_face member variable of the reference_fe_t instance that describes the geometry of the cell). For example, in Fig. 6, the list corresponding to n-face with identifier 8 in dofs_n_face is {4,8,12,16}.
The reference_fe_t data type plays a crucial role in the algorithm in charge of assigning global DOF identifiers to node functionals distributed over the interior of the triangulation cells and their boundary n-faces. (This algorithm, which is is covered in detail in Sect. 10, is grounded on the notion of equivalence classes introduced in Sect. 3.) In particular, the function-like (regular) binding referred to as permute_dof_lid_n_face (see Line 32 of Listing 6) implements the mapping p F in (15). This function takes as input the so-called permutation index in Sect. 3.16, the local index of a node within an n-face of given dimension (e.g., in 3D, either 0 for vertices, 1 for edges, and 2 for faces) from the perspective of a source cell, and returns the local index of a node within that n-face from the perspective of the target cell. 12 This is in 11 We can consider three levels of DOF numbering: the cell-wise DOF numbering (referred to as local DOFs), the subdomain-wise DOF numbering (referred to as global DOFs), and a full domain global DOFs. The latter numbering is never created/required in FEMPAR. In serial environments, the latter two match. 12 We note that the responsibility of determining the permutation index does not lay on reference_fe_t, but on the abstraction of FEMPAR that represents the mesh of the computational domain; see Sect. 7. particular the transformation that we have to apply when global DOF identifiers have been already assigned to n-face nodes in the source cell, and we want to transfer them to n-face nodes in the target cell; see Sect. 10.3. This binding, implemented in reference_fe_t, ultimately relies on its own_dof_permutations(:) member variable; see Line 23 in Listing 6. This allocatable array is indexed with the n-face dimension (i.e., 1 for edges, and 2 for faces). For each n-face dimension larger than 0, it contains a rank-2 allocatable array (i.e., type(allo-catable_array_ip2_t) is the base type of the array), which serves as a lookup table for the implementation of the aforementioned transformation. In particular, the rows are indexed with the local index of the node identifier on top of the n-face from the perspective of the source cell, and the columns with the permutation index; see Sect. 3.16. The entry in the corresponding row and column of the table provides the local index of the node within the n-face from the perspective of the target cell. These lookup tables are filled within the fill_own_dofs_permutations deferred binding of reference_fe_t. We note that this latter binding, and permute_dof_lid_n_face, are declared as overridable bindings in Listing 6 on purpose. This lets, e.g., subclasses of reference_fe_t to be used in conjunction with (properly oriented; see Sect. 3.16) n-simplex meshes to implement the former such that the own_dof_permutations(:) member variable is not allocated nor filled, and the latter such that always returns the identity transformation.

Creating reference_fe_t Polymorphic Instances
Central to any OO software system relying on abstract data types is the approach chosen to create polymorphic instances at runtime. For simplicity, FEMPAR follows the so-called simple factory design pattern [85]. It takes the form of a single stand-alone function, called make_reference_fe, which selects the dynamic type of the polymorphic instance to be returned at runtime based on the values of its dummy arguments topology and fe_type. (For example, assuming the topology of an hexahedron and fe_type_lagrangian, then it will select its dynamic type to be hex_lagrangian_ref-erence_fe_t, i.e., the concrete data type implementing Lagrangian-type FE spaces on top of n-cubes.) Before returning, it calls a deferred binding of refer-ence_fe_t, called create, which is responsible to leave the reference_fe_t in a fully functional state. The interface of this deferred binding is shown in Listing 8.
We remark that field_type is only a free parameter for Lagrangian FEs (i.e., for a particular refer-ence_fe_t subclass). In other words, it must be field_type_vector for Raviart-Thomas and edge elements. We note that despite its fix set of dummy arguments interface, it has been proven to be sufficient to fully describe all subclasses currently available in FEMPAR; see Sect. 6.5. However, in the event that it is needed, and with extensibility in mind, a single parameter dictionary of \key; value [ pairs might have been used instead; FEMPAR indeed relies on an implementation of this data type where key is a string (typically denoting the name of the parameter), and value a scalar or arbitrary rank array of intrinsic or even user-defined types. 13

Enumeration of reference_fe_t Subclasses
There is a rather complex data type hierarchy rooted at reference_fe_t in FEMPAR, which has been judiciously designed with code re-use as the main driver. (For example, Lagrangian FE spaces on top of n-cubes and n-simplices share member variables and code that can be gathered into a common base data type.) For the sake of brevity, in this work we do not cover in full detail the implementation of the data types in this hierarchy (except those details given in Sects. 5 and 9.5). However, for completeness, it is convenient to enumerate those ref-erence_fe_t subclasses that, at present, are available in this hierarchy. These subclasses, which lay at the leaves of the hierarchy, are the following ones: • hex and tet_lagrangian_reference_fe_t.
Space of polynomials of arbitrary degree k on top of n-cubes (i.e., tensor-product like spaces Q k ) and nsimplices (i.e., P k ), respectively, for the discretization of either scalar-valued, vector-valued or tensor-valued fields; see Sect. 3.8. By selecting the ownership relationship among node functionals and n-faces appropriately (see Sect. 6.3), this FE space can be either globally continuous, or entirely discontinuous across cell boundaries. • hex and tet_raviart_thomas_refer-ence_fe_t. The vector-valued Raviart-Thomas FE of arbitrary degree k on top of n-cubes, and n-simplices, resp., suitable for the mixed Laplacian problem and some fluid flow problems. Global FE functions of this space (in its conformal variant) have continuous normal components across cell faces; see Sect. 3.9 for details.
• hex and tet_nedelec_reference_fe_t. The vector-valued curl-conforming Nédélec FE of arbitrary degree k on top of n-cubes, and n-simplices, resp., suitable for electromagnetic problems. Global FE functions of this space (in its conformal variant) have continuous tangential components across cell faces; see Sect. 3.9 for details. • void_reference_fe_t. A software artifact that represents a FE space with no DOFs at all, neither at the cell interiors, nor at their boundary n-faces. This sort of software resource has been proven extremely efficient for: (1) the numerical solution of a PDE on a subdomain of our original discretized domain (which thus has to be aligned with the cells boundaries); (2) the numerical solution of a PDE using XFEM-like discretization techniques (which are grounded on FE spaces that do not assign DOFs to cells exterior to the embedded domain); (3) to simplify the implementation of discretization methods for PDE problems that involve coupling at the interface level, e.g., fluidstructure interaction.
Apart from these reference_fe_t subclasses, there are already concluded developments within this hierarchy in a beta version of the code, such as B-splines [8], and other scheduled developments, such as div-conforming FEs [7]. 7 The Description of the Physical Domain: The triangulation_t Abstraction A central abstraction in all FE numerical simulation codes is the one that describes the triangulation/mesh T h of the physical domain X & R d in which our problem is posed.
(In practice, the mesh generation for X introduces a geometrical error, and the mesh is in fact over an approximated domain X h ). In FEMPAR, this abstraction is called tri-angulation_t. With flexibility, and code reuse in mind, this is an abstract data type. In Sect. 7.1, we introduce triangulation_t, and the mechanism that it provides to its subclasses in order to preserve encapsulation and data hiding, while still letting subclasses to store and access to data efficiently. For completeness, in Sect. 7.2, we introduce details underlying the implementation of a particular concrete subclass of triangulation_t.

An Abstract Triangulation Representation and Its Software Implementation
In this section, we present an abstract (conceptual) representation of a triangulation that FEMPAR exposes to userlevel applications and other library software abstractions that are grounded on it (see, e.g., Sect. 10). This conceptual representation is provided by a set of abstract derived data types (and the methods bounded to them) to which we have converged as a result of our experience in accommodating a wide range of state-of-the-art FE discretizations and solver techniques within a single framework, from desktops/laptops, to high-end distributed-memory supercomputers (see Sect. 2). For the sake of brevity, in this work we restrict ourselves to a subset of this representation that only provides support to the implementation of high-order conforming and nonconforming FE discretizations grounded on conforming meshes in a serial computing environment. We stress, however, that the actual (complete) representation also incorporates concepts to express the mesh in a distributedmemory environment (e.g., the set of cells of a subdomain is divided into local cells and a layer of cells owned by remote subdomains, which we denote as ghost cells). It also provides support to the implementation of high-order hp-adaptive (i.e., on locally refined, non-conforming meshes) conforming and non-conforming FEs (using hanging node constraints [82] and subface integration over a facet between cells of different refinement level, respectively) and to the implementation of XFEM-type techniques (see [60] and references therein); provided an implicit representation of the geometry of the domain, a background mesh is able to know whether a cell is interior, exterior or cut by the domain, and in the latter case, to provide the coordinates of the intersection points. This extra expressivity comes in the form of additional data types and an extended set of methods for those data types that are covered in this section. We stress, however, that neither the former nor the latter ones will be covered in this section.
Although our abstract representation of a triangulation has been proven to have high expressivity, we do not claim, however, that our triangulation representation is universally applicable to the implementation of arbitrary numerical discretization and solver techniques. It indeed has been designed such that extra extensions are foreseen to satisfy further requirements.
The triangulation representation encompasses both topological and geometric data. A triangulation is conceived as a partition of X into a set of cells (d-faces). Each cell is uniquely identified by a global identifier in the range cell gid ¼ 1; . . .; num cells. 14 Apart from the cells, a triangulation is also composed by a set of lower dimensional objects, i.e., a set of k-faces, for k ¼ 0; . . .; d À 1. We will also refer to elements in this set as ''vefs'', provided that in the d ¼ 3 case, it is composed of vertices, edges, and faces. Each of the objects in this set is uniquely identified by a global identifier in the range vef gid ¼ 1; . . .; num vefs. 15 Apart from the cells and vefs, a triangulation also encompasses adjacency data. This sort of data describes how n-faces in a mesh are related to each other. We denote by F the set of all n-faces in the mesh, by F k the set of all kfaces, and by F i and F k i the i-th n-face (of arbitrary dimension) and the i-th k-face (of fixed dimension k), respectively. In conforming meshes, there are mainly two relevant types of adjacency relationships, namely composition (m-faces that are part of a k-face for m\k) and neighbourhood (m-faces around a given k-face for m [ k). Following [87], the set of m-faces adjacent to F k i , is denoted by F k i hF m i (i.e., the operator hÁi selects from the set the m-faces adjacent to the one in the left). A triangulation conforming with FEMPAR abstract representation should be able to provide the composition data F 3 i hFi, and the neighbourship data F i hF 3 i, that is, n-faces that compose each cell and cells around n-faces.
A triangulation also includes geometry data. Cell geometries are represented by a map U K of a polytopeK in the reference space to the physical space (see Sect. 3). This map is represented as a function of a scalar FE space (e.g., grounded on high-order Lagrangian FEs or B-splines), with its DOF values being the vectors of node coordinates (i.e., point_t instances) in the physical space.
At the core of the software design in charge of providing the triangulation-related data covered so far is an abstract data type named triangulation_t. (The rationale behind this data type being abstract will be made clear in the course of this section.) This data type is defined as shown in Listing 9. triangulation_t is conceived as a template to which all subclasses have to conform. On the one hand, it is composed by a (minimal) set of member variables encompassing data common to any triangulation. In particular, any triangulation is embedded in a num_dimensions-dimensional space, and is composed of a total number of num_cells (num_dimensionsdimensional) cells and num_vefs vefs, respectively; see Lines 3-5 of Listing 9, respectively. On the other hand, triangulation_t is equipped with a set of deferred methods that the subclasses of triangulation_t must implement; see Lines 11-18. The rationale underlying these methods requires further elaboration, to be discussed in the sequel. 14 We note that the actual conceptual representation of the triangulation in FEMPAR differences among local (to subdomain) cell identifiers and global cell identifiers (among the whole triangulation of the domain) in a distributed-memory context. The second sort of identifiers are coded as long precision integers, i.e., integer(igp), in order to accommodate simulations with more than 2 31 À 1 global cells. 15 As mentioned in the case of cells, the actual conceptual representation of the triangulation in FEMPAR differences among local (to a subdomain) vef identifiers and global vef identifiers (among the whole triangulation of the domain) in a distributedmemory context. Again the latter ones are long precision integers.
In order to construct a conceptual view of triangu-lation_t suitable for the user (and library) code needs, FEMPAR relies on the so-called iterator OO design pattern [88]. Iterators are data types that provide sequential traversals over the full sets of objects that all together (conceptually) comprise triangulation_t as a meshlike container. There are several different iterators available, each one related to a different set of objects to be traversed. For example, cell_iterator_t provides traversals over the set composed of all cells, while vef_iterator_t over the one composed of all vefs. 16 In our software design, iterators are created and freed by a set of public TBPs provided by triangulation_t; see Lines 11-18 of Listing 9. Thus, for example, the expression call triangulation%create_cell_iterator(cell) creates an iterator on the cell client-space instance, while call triangulation%free_-cell_iterator(cell) frees it. Iterators sequentially traverse objects in increasing order by their global identifiers. However, we note that triangulation_t subclasses are completely free to decide how to internally label these objects. 17 As the reader might have already noted from the minimal set of member variables in Listing 9 (among others), our software design is such that we want to provide complete flexibility to concrete subclasses of triangu-lation_t with respect to how do they internally layout the (topology and geometry) data to be provided. To this end, triangulation_t is an abstract class that defers this decision to its subclasses. There is a clear separation among how the data is handled (i.e., stored and accessed) by the private data structures (member variables) underlying triangulation_t subclasses, and the conceptual/abstract view of triangulation_t exposed to FEMPAR users. This view renders triangulation_t easily accessible and understandable. Whereas the public interface of triangulation_t being used by client codes is designed to be stable over time, the internals of triangulation_t subclasses, however, are allowed to (and are subject to) change over time (e.g., in order to accommodate further optimizations, additional requirements, etc.). At the price of dynamic run-time polymorphism, triangulation_t subclasses might be designed such that they strongly strive to preserve encapsulation and data hiding while still storing and accessing to data efficiently. Thus, e.g., a triangulation_t subclass in charge of handling structured/uniform meshes of simple domains may decide to not explicitly store the cellwise global vef identifiers, nor the vertex coordinates of the mesh, but instead to provide them implicitly on demand as a function of the global cell identifier.
Apart from encompassing the logic underlying the actual traversal over objects of the set at hand, iterators also have the following crucial responsibility. Following the software concept of ''accessors'' presented in [17], they are able to tease out the data related to the current object on which they are seated from the global arrays and rest of private data structures that comprise the internals of the corresponding triangulation_t subclass. They therefore do not explicitly store, e.g., the global vef identifiers of the current cell. Instead, they know how to fetch 16 For completeness, let us mention that triangulation_t also offers traversals over subsets of objects conveniently selected for acceleration purposes. For example, triangulation_t provides an iterator over vertices, edges, and faces that lay on the interface among subdomains, called itfc_vef_iterator_t (i.e., a subset of the set of objects traversed by vef_iterator_t) for those subclasses suitable for parallel distributed-memory environments. 17 Thus, e.g., a triangulation_t subclass that internally labels the global identifiers of vefs by their dimension in increasing order would result in a traversal with such an order. This is however a potentially changing over time low-level implementation detail that user programs relying on triangulation_t and its associated iterators should not assume nor rely on.  them from the corresponding triangulation_t subclass into data structures suitable for the user needs. Provided that it is the responsibility of triangulation_t subclasses to decide how to internally layout data, iterators are abstract data types as well, and most of its TBPs are deferred/virtual. This also justifies why the methods in the Lines 11-18 of Listing 9 are deferred, and why the corresponding iterator dummy arguments, polymorphic allocatable. It is ultimately the responsibility of the concrete subclass of triangulation_t to decide on execution time the dynamic type of the polymorphic variable being created. Let us next discuss the rationale underlying the design of iterators over cells and vefs. These data types are defined in Listing 10, where set must be actually replaced by the corresponding name uniquely identifying the set of objects to be traversed by the iterator at hand, i.e., either cell or vef. In Fig. 7, we illustrate the implementation of a partial (selected) subset of the bindings of these data types.
The create binding of set_iterator_t takes as input a polymorphic triangulation_t instance to be traversed, and leaves the iterator positioned in the first object of the set, i.e., in a state ready to start the sequential traversal over all of its objects; see Fig. 7. This method (like free) is not intended to be directly called by the user. Instead, triangulation_t clients should rely on the deferred bindings of triangulation_t presented in Listing 9. The init, next, and has_finished bindings let clients to position the iterator on the first object of the set, move to its next object, and check whether all of its objects have been already traversed or not, respectively; see Fig. 7.
The actual set of (deferred) TBPs of a triangula-tion_t iterator highly depends on the type of object being pointed. We now briefly discuss those TBPs in the set corresponding to cell and vef iterators that provide support to the subset of the triangulation conceptual representation we are focusing on. These are in particular enumerated in Listing 11. The TBPs in Lines 8-12 of Listing 11 are in charge of providing data related to the composition relationship F 3 i hFi. In particular, the get_num_vefs binding returns the number of vefs on the boundary of the mesh (i.e., the cardinality of the composition relationship). Given the local index of a vef in a cell (within the range 1; . . .; num vefs), get_vef positions the vef_itera-tor_t instance on input such that it points to this vef, while get_vef_gid, returns its global identifier; get_vef_lid performs the inverse translation to the one of get_vef_gid. Finally, get_vefs_gid let the client obtain the global identifier of all vefs of the current cell in one shot provided a user-space pointer to integer array. The semantics of this last TBP are such that subclasses of cell_iterator_t are not allowed to allocate the provided pointer, but to associate it with existing (internal) memory (for increased performance and memory leaks avoidance).
The TBP in Line 15 of Listing 11 provides support to the implementation of the transformation procedure described in Sect. 3.16. In particular, this binding has to be invoked on a cell_iterator_t instance positioned in the source cell, and given a cell_iterator_t positioned on the target cell, and the n-face local identifier within the former and latter cells, returns the permutation index; see Sect. 3.16. We stress that both the rotation and orientation indices can be always computed using the TBPs in the previous paragraph. For example, in order to determine the rotation index, one can extract the global id of the anchor vertex of the n-face in the target cell (by calling Listing 11. A subset of the deferred TBPs of the cell_iterator_t and vef_iterator_t data types (follow-up to Listing 10).
FEMPAR: An Object-Oriented Parallel Finite Element Framework 225 get_vef_gid), and then searching for this global id in the set of vertices that comprise the n-face in the target cell (using an iterator over the corresponding sublist in ver-tices_n_face; see Sect. 6.1). However, we preferred to provide a specialized deferred binding for such purpose in order to leave room for optimizations in triangu-lation_t subclasses. For example, in the case of a subclass that works with oriented meshes, then get_permutation_index may be implemented such that it always returns the permutation index corresponding to the identity transformation. In the case of a subclass of triangulation_t that is intended to remain static (or to be adapted very infrequently) during the course of the simulation process (see, e.g., Sect. 7.2), then it might be beneficial for performance to precalculate all possible permutation indices during set up into lookup tables, and re-use them all the way through without having to perform the aforementioned searches over and over again. The TBPs in Lines 18-20 are in charge of providing the cell geometry related-data. In particular, get_refer-ence_fe returns a polymorphic pointer to the refer-ence_fe_t instance that describes the space of functions to which the mapping U K belongs. get_num_nodes and get_nodes_coordinates return the number of nodes describing the geometry of the cell, and its associated coordinates in physical space, respectively. Instead of a pointer to an user-space array to be associated with internal storage (as get_vef_gids), get_nodes_coordinates takes a user-space (pre-allocated) array of type point_t instances, and fills it (because of reasons made clear in Sect. 8.3). Assuming that reference_fe_t is a bi-linear Lagrangian FE on a quadrilateral, then get_-num_nodes would return 4 (one node per cell-vertex), while get_nodes_coordinates the coordinates in physical space of its vertices.
Any triangulation_t subclass should let its clients to classify the cells into sets. Each set is globally identified by an integer number, named set_id. The methods get_set_id and set_set_id let the caller to associate a set to the current cell, or to retrieve the set to which the cell is currently associated. Cells set identifiers are primarily (although not only) used by fe_space_t during its set-up; see Sect. 10. In particular, they instruct the latter to determine which reference_fe_t instances to use on top of the cells belonging to the same set. For example, assuming that we want to solve a scalar, singlefield PDE problem on a subdomain of our original domain (that we assume to be aligned with the cells boundaries), we would use two different sets. The first for the cells that are interior to the subdomain, and the second for those that are exterior. Then we could associate e.g., a linear Lagrangian reference FE to cells in the first set, and void_reference_fe_t on those cells of the second set; see Sect. 6.5.
Sitting on a given vef, the TBPs in Lines 36-37 are in charge of providing data related to the adjacency relationship F i hF 3 i. In particular, get_num_-cells_around returns its cardinality, while get_cell_around returns a cell in this set. To be more precise, the latter TBP positions the instance of cell_iterator_t on input such that it points to a cell in this set identified with an index within the range 1; . . .; get num cells aroundðÞ. The order in which the cells around a vef are listed can be arbitrary, so that codes relying on triangulation_t should not assume, e.g., that they are ordered increasingly by their global cell identifiers. On the other hand, get_num_nodes and get_nodes_coordinates return the number of points on top of the vef (including those on top of the lowerdimensional ones on its boundary), and its associated coordinates in physical space, respectively; see Lines 40-41. We adopt the convention that these nodes are (locally) labeled (within the input/output array of point coordinates to be filled) according to the reference coordinate system of the first cell around the vef, i.e., the cell obtained as vef%get_cell_around(1,cell).
The TBPs in Lines 44-48 let the client to determine whether the vef is at the interior of the domain or on its boundary, the vef dimension (e.g., in 3D, it would return 0, 1, and 2 for vertices, edges, and faces, respectively) and to retrieve the set to which the vef is currently associated, or associate a new set to it, respectively. Sets in the case of vefs are primarily used to codify the boundary conditions of the PDE problem at hand, as discussed in Sect. 10.4. At this point we are already in position to show userlevel code that exploits the software design covered so far. In particular, Listing 12 splits the whole set of triangulation cells into two disjoint sets, those that are in contact to the boundary of the domain, and those that are in its interior.

An Example triangulation_t Subclass and Rationale
In this section, we discuss how a particular subclass of triangulation_t is internally organized in order to efficiently provide triangulation-related data by means of the software abstractions presented in Sect. 7.1. This subclass is static_triangulation_t. A static_-triangulation_t codifies a conforming mesh, which is set up from scratch at the beginning of the simulation, and remains unaltered during the whole process. On the other hand, static_cell_iterator_t and static_vef_iterator_t are two non-abstract data type extensions of cell_iterator_t and vef_it-erator_t, respectively. By overriding the set of deferred methods of the former ones, the latter ones tease out the data related to the current object on which they are seated from the global arrays and rest of private data structures that comprise the internals of static_ triangulation_t.
There is no single approach to layout the data within a given triangulation subclass. The seek of an acceptable trade-off among memory consumption, computational time required to set up, update (if it applies), access to triangulation data, and the frequency on which these operations are performed should guide its internal organization. For example, in [87], two storage layouts are presented, and its memory and computational cost for the computation of any possible adjacency relationship is evaluated in 3D. The first one, called one-level representation, is defined by F 1 i hF 0 i, F 2 i hF 1 i, and F 3 i hF 2 i, and by F 0 i hF 1 i, F 1 i hF 2 i, and F 2 i hF 3 i (neighbourhood information).
In other words, it stores vertices of each edge, edges of each face, and faces of each cell, together with edges around vertices, faces around edges, and cells around faces. The second one, called circular representation, is defined by the composition information F 1 i hF 0 i, F 2 i hF 1 i, F 3 i hF 2 i (as above), together with the neighbourhood information F 0 i hF 3 i (cells around vertices). An important property of these two storage layouts is their completeness, i.e., the possibility to determine any adjacency without a loop over the entire mesh. The storage requirements for a uniform mesh of a cube domain with N c cells are 48N c (for hexahedra) and 24N c (for tetrahedra) in the former, and 32N c (for hexahedra) and 16N c (for tetrahedra) in the latter. However, the operation count for determining some adjacencies, although independent of N c , is high. For example, in the case of the one-level representation, to obtain the cells around a vertex requires 48 (for hexahedra) and 140 (for tetrahedra) operations, whereas only one operation is needed to obtain cells around facets. In the case of the circular representation, these queries involve one and 148 (for hexahedra) or 299 (for tetrahedra) operations, respectively [87]. (We recall that both kind of adjacencies are required by FEMPAR as presented in Sect. 7.1.) Another quite different storage data layout is the one followed by the triangulation in the deal.II library [17], essentially defined by the composition data F 1 i hF 0 i, F 2 i hF 1 i, and F 3 i hF 2 i (referred as hierarchical cell representation by the authors of the library), and the neighbourhood data F 3 i hF 2 i stored cell-wise (i.e., a given cell stores the identifiers of its cell neighbours across each face within the cell). Besides, the (potentially non-conforming) triangulation in this library is conceived (and explicitly Listing 12. User-level code illustrating the usage of the data types and its associated TBPs supporting FEMPAR conceptual triangulation representation. represented) as a collection of trees, where the cells of a coarsest conforming mesh (generated by deal.II itself for simple domains, or read from a file from several file formats) form the roots, and the children branch off their parent cells, thus forming binary-trees, quad-trees and octtrees in d ¼ 1; 2; and 3 spatial dimensions, respectively [17]. While both the ancestors (i.e., the so-called ''inactive'' cells) and leaf cells of the tree (i.e., the so-called ''active'' cells) are stored, only the latter ones actually form the partition of the domain. Apart from a hierarchy of cells, the deal.II triangulation also maintains a hierarchy of kfaces for k ¼ 1; . . .; d À 1. Such quite complex data structure is justified by the authors for two reasons. First, it allows for an efficient implementation of adaptive mesh adaptation (including coarsening and refinement). The hierarchy of n-faces aids in the process of handling the socalled hanging node constraints required to build conforming FE spaces on top of non-conforming meshes. The second reason is the implementation of (geometric) multigrid preconditioners grounded on the adaptivity tree.
In particular, such preconditioners require that DOFs are also associated to inactive cells. Thus, also inactive n-faces have to explicitly exist in the triangulation. In any case, such structure is hard to generate and maintain, and does not fit well when integrated with parallel octree libraries [89], like p4est [90]. The whole hierarchy must be generated from scratch on each mesh adaptivity step. However, based on our own experience, such hierarchy is not really needed for an efficient implementation of adaptive refinement. The second reason, i.e., the implementation of a serial hierarchical multigrid solver in deal.II, would probably be more complicated without such a hierarchical representation of the mesh. While the hierarchical cell representation in deal.II has been proven to be successful in the implementation of highly complex hp-adaptive FE discretization [82] and reduces memory consumption over F 3 i hFi, the restriction of the global vef identifiers to a cell (a very frequent operation in FE codes), becomes significantly more expensive in this storage layout as this operation requires permutations among the reference coordinate system of the cell that owns the vef to the one to which we are restricting to; the same applies to the restriction of global DOF identifiers to a cell when the DOFs are stored n-face-wise. Furthermore, it is a non-complete storage layout. In particular, neighbourship data F i hF 3 i has to be computed by the user by means of a loop over all cells. Besides, it prevents library support to loops over the facets of the mesh, and access to the neighbouring cells, a natural operation in the implementation of DG methods. In our experience, facet-loop based integration of DG terms (versus cell-loop based) leads to a software that is significantly easier to use, as it might be designed such that most of the complexity underlying facet integration can be hidden to the user (see Sect. 9). Finally, although it is very efficient for hierarchical and local mesh adaptation (within each subdomain), the most severe drawback is its costly set up (from scratch) for a given initial conforming coarse mesh (this can be mitigated by reducing the coarse mesh resolution, at the price of potentially losing geometry modelling accuracy), and, in a distributed-memory environment, the even more costly regeneration of an adapted non-conforming forest of trees after a re-distribution step among processes for dynamic load-balancing [90]. Indeed, in [89], the latter is reported as the second more costly operation in the simulation pipeline, only below the linear solver step.
The static_triangulation_t data type explicitly stores the composition data F 3 i hFi, and the neighbourship data F i hF 3 i within its internal (private) member variables. 18 The memory consumption of such complete storage layout is 52N c (hexahedra) and 28N c (tetrahedra), which is less than twice the one of the one-sided and circular representations [87]. At the price of this increased memory consumption, static_triangulation_t is able to provide the required adjacency data with Oð1Þ arithmetic complexity. Besides, the cell-based storage of the composition relationship is perfectly suited for its migration in parallel distributed-memory environments. On the other hand, the amount of permanent storage of this data layout can be reduced if one exploits the fact that neighbourship data is only required in very specific parts of the code. For example, unstructured mesh generators usually provide only the composition data F 3 i hF 0 i. In such a case, static_triangulation_t requires the neighbourship data F 0 i hF 3 i (plus the reference cell topology data encompassed within the reference_fe_t instance mapped to each cell; see Sect. 6.1) in order to set up the composition data F 3 i hF 1 i and F 3 i hF 2 i. It is also needed in triangulation_t subclasses suitable for distributedmemory computers, among others, to set up the data structures required to perform nearest neighbour exchanges of DOFs nodal values among subdomains. (We stress that this process requires to globally identify interface DOFs consistently among subdomains sharing such DOFs .) In this latter scenario, this adjacency data is only required for n-faces that lay on the inter-subdomain interface (and not for those on the interior). The evaluation of facet integrals (as designed in FEMPAR, see Sect. 9) also requires at least F 2 i hF 3 i and F 1 i hF 2 i, in 2D and 3D, respectively. The use of the full adjacency data can be needed for the implementation of advanced numerical discretization schemes, e.g., for the implementation of nodal-based shock detectors for monotonic FEs [58,59]. Due to the aforementioned reasons, we decided to design static_triangula-tion_t such that it permanently stores such data, but we stress that our software design is such that a triangulation subclass is always free to offer methods that set up and destroy these data on demand to reduce the amount of permanent data storage.
The static_triangulation_t data type, together with a selected set of its bindings, is defined as shown in Listing 13. Before going into more detail, there are two main points to remark with respect to how this type internally layouts its data. First, it relies all the way through on intrinsic Fortran allocatable arrays. These sort of data structures are perfectly suited for the particular case of static_triangulation_t, due to its static nature. We stress, however, that more efficient data structures (i.e., able to mitigate the effect of frequent/costly allocatable array re-allocations) would be convenient if it also had to support mesh adaptation (e.g., a linked list, or even better for data locality, a data structure with semantics close to std:vector of the C?? standard template library, which in fact is already in FEMPAR but not included for brevity). Second, for increased data locality during cell and vef sequential traversals (and thus a more efficient on the memory hierarchy of modern computer architectures) the data is not stored into cell-wise or vefwise local arrays, but into global arrays that are indexed either by the global cell or vef identifiers.
A collection of reference_fe_t polymorphic instances is stored in the reference_fes(:) array (see Line 4 of Listing 13). These instances are uniquely identified (within the local scope of static_triangula-tion_t) by their position in this array. For a given cell with global identifier cell_gid, the FE space of functions to which the cell mapping U K belongs, is described by the reference_fe_t instance with identifier cell_to_ref_fes(cell_gid) in the collection; see Line 7. The member variables used to store the composition data F 3 i hFi are encompassed within Lines 10-11 of Listing 13. As stated above, the global vef identifiers are stored cell-wise, in the lst_vefs_gids(:) array, which is in turn (indirectly) addressed by the ptr_vefs_x_cell(:) array. In particular, the ones assigned to the vefs on cell cell_gid start and end in position ptr_vefs_x_cell(cell_id) and ptr_ vefs_x_cell(cell_id?1)-1 of lst_vefs_ gids(:), respectively. Thus, e.g., the implementation of the (overridden) get_num_vefs TBP in static_ cell_accessor (see Listing 12), just determines the Listing 13. The internals of static_triangulation_t and a selected set of its bindings. Finally, the static_triangulation_create binding sets up a new static_triangulation_t instance. There are two options for creating a static_triangulation_t in FEMPAR, depending on whether the mesh is structured or unstructured. In the first case, FEMPAR provides the machinery for the automatic generation of a triangulation on simple domains (e.g., a unit cube), currently of brick (quadrilateral or hexahedral) cells. This function is implemented exploiting a tensor product structure of the space, numbering cells and vefs using lexicographical order. The second way to create a static_triangulation_t instance is from a mesh data file, e.g., using the GiD mesh generator [91].

Evaluation of Cell Integrals
In this section, we describe the data structures required to perform the numerical integration of the local matrices. In order to compute cell integrals (12), one needs (among others) functionality to evaluate the shape functions and their derivatives at the quadrature points in the physical cell and the determinant of the Jacobian at the quadrature points in the reference cell. In turn, the evaluation of the shape functions and derivatives in the physical cell rely on their evaluation (and possibly the evaluation of the Jacobian) in the reference cell (see, e.g., (13) and (14)). We note that the evaluation ofŴ does not require any additional information; it is the identity for Lagrangian elements and only requires the Jacobian in the reference cell for vector-valued shape functions (see (17) and (18)). In the following, we present a set of data types that contain all this information.
The evaluation of cell integrals involves the data type quadrature_t that represents the quadrature Q, in-terpolation_t, that stores the values of the shape functions and its first derivatives (either in the reference or physical space) at the quadrature points of Q, and a cell_map_t that describes the mapping from a reference to a physical cell U K (e.g., Jacobian-related data). Additionally, the data type cell_integrator_t provides the machinery to compute the interpolation_t corresponding to the physical space from the one at the reference space and the cell_map_t at every cell of the triangulation. In the following sections, we cover in detail these software abstractions.

Numerical Quadrature
The data type that in FEMPAR represents an arbitrary quadrature rule is called quadrature_t and is defined as shown in Listing 14.
Listing 14. The quadrature_t data type.
In Listing 14, coordinates(:,gp) and weights(gp) store, respectively,x gp 2 R num dims and w gp , for gp ¼ 1; . . .; num quadrature points. It might readily be observed from the interface of its create binding that quadrature_t is designed to be simply a placeholder for the quadrature points coordinates and its associated weights. Indeed, this binding essentially allocates coordinates(:,:) and weights(:). The code that ultimately decides how to distribute the quadrature points overK and set up its associated weights is actually bounded to the reference_fe_t implementors through the deferred binding with interface shown in Listing 15.
All reference_fe_t subclasses currently available in FEMPAR select by default a Gaussian quadrature that exactly integrates mass matrix terms (within their implementation of the binding in Listing 15) by invoking fill_*_gauss_legendre methods at lines Lines 13 and 14 in Listing 14. This quadrature can be solely determined from the attributes of the reference_fe_t implementor at hand (its topology and order). 19 However, in other more demanding situations, e.g., the integration of a trilinear weak form, the user can provide the desired quadrature degree through the degree optional dummy argument. If more general scenarios to the ones currently covered (e.g., a non-Gaussian quadrature) are to be addressed, then the interface might be modified such that an optional parameter dictionary is passed instead.

Evaluation of Reference Cell Shape Functions
As commented in the introduction of this section, to compute cell integrals (12), one needs to evaluate shape functions and their derivatives in the physical cell, which in turn rely on their evaluation in the reference cell (see, e.g., (13) and (14)). The values of the shape functions and their first derivatives at a set of quadrature points provided by a quadrature_t instance are stored in the interpolation_t data type presented below. The same data type can be used to store this data in the reference or physical space.
Let us start with the evaluation of shape function in the reference space. The local FE space on top ofK actually depends on the particular reference_fe_t implementor at hand. Consequently, this functionality has to be offered through a deferred binding of this abstract type. The interface of this binding is declared in Listing 16. The subroutine overriding it in concrete subclasses is in charge of computing the shape functions values and derivatives at quadrature points in the reference space and stores them in a raw-data container of type interpolation_t (to be discussed later in this section).
Let us remark several points related to this interface. First, this binding is typically called only once, and the data pre-computed and stored within the passed interpo-lation_t dummy argument is repeatedly re-used when transforming these values to an actual cell; see Sect. 8.4. Listing 16. The interface of the create_interpolation deferred binding of reference_fe_t. 19 As it is well known, considering n-cube topologies forK, for a Lagrangian reference FE of order p and an affine geometrical map, we need a 1D Gaussian quadrature with p þ 1 points. For tetrahedral meshes with the Duffy transformation, we need to take n ¼ p þ ceilingðd=2Þ to integrate exactly mass matrices (see Sect. 3.5 for more details).
Second, this binding is designed such that all functions are evaluated at all quadrature points within a single call, instead of following a (much) finer granularity approach in which only one function is evaluated at a quadrature point per call. 20 Third, we stress that the actual implementation of this deferred binding in FEMPAR computes shape functions values and first derivatives in the reference space, whereas it lets the caller to selectively decide whether to compute or not the second derivatives of the shape functions, provided that they are expensive to compute and only required in very particular scenarios; see Sect. 3.7. Indeed, the code implementation of this feature is of cross-cutting nature, being reflected in several interfaces and data types in which the cell (and face) integration functionality is split. We will nevertheless omit here (and in the rest of sections) details regarding second derivatives (and its optional computation) in order to keep the presentation simple.
Let us now discuss on the rationale underlying in-terpolation_t. This data type is not exposed at all to the user of FEMPAR. It is instead used as an internal lowlevel container that lets the data types involved in the implementation of cell integrals exchange the sort of data subject to consideration. It is ultimately the responsibility of the concrete reference_fe_t subclass to decide how the data is actually laid out within the member variables of interpolation_t. Thus, reference_fe_t is the only data type that can access or modify inter-polation_t. In its current flavour, interpola-tion_t is a concrete (i.e., non-abstract) data type with a fixed set of multi-rank allocatable array member variables for storing shape function values and derivatives. For example, the one storing shape function values is a 3-rank array, where a reference_fe_t implementor may choose its indices, from left to right, to refer to the component of the shape function, the shape function, and the quadrature point, respectively. The reference_fe_t subclass is, however, completely free to lay out the data in these arrays, and it is in this flexibility where the extensibility of the software design to accommodate several FE space realizations resides. This, indeed has been proven to be sufficient to (efficiently) implement all FE spaces currently available in FEMPAR, including scalar, vector, and tensor-valued Lagrangian FEs (where higher-rank spaces are determined as the tensor product of the scalar spaces, and shape functions have only one non-zero component), and genuinely vector-valued FE spaces (where more than one component of the shape function may be non-zero).

Geometrical Mapping
A basic building block is the mapping U K among the reference cellK coordinate system and the one corresponding to an actual cell K of the triangulation in the physical space; see Sects. 3.2 and 3.3. For example, we are able to pull back the gradients of the shape functions from the reference to the physical space in (14) using the Jacobian of the transformation evaluated at quadrature points, or to evaluate the source term at quadrature points in real space. The Jacobian is also required to the transform the integral from the physical to the reference space in (12) and to compute the Piola transformations in div and curl-conforming FE spaces (see (17) and (18)). The derived type cell_map_t in FEMPAR is designed to be a placeholder for the data required to provide this sort of services. It is declared as shown in Listing 17. The rationale underlying the inheritance relationship among cell_map_t and base_map_t will be made clear in Sect. 9.
The create binding of cell_map_t takes as input a quadrature_t instance with a set of integration points where J K ðx gp Þ, J À1 K ðx gp Þ, and jJ K ðx gp Þj are to be evaluated (see Listing 17). These geometry-related data are stored in the jacobian(:,:,gp), inv_jacobian(:,:,gp), and det_jacobian(gp) allocatable array member variables of cell_map_t, respectively, and allocated during a call to this binding. Apart from a quadrature_t instance, cell_map_t also requires a description of the (discrete) space of functions to which U K belongs. FEMPAR supports mappings U K belonging to abstract FE spaces (e.g., high-order polynomial FE spaces or spline-based spaces). The reference_fe dummy argument of polymorphic type reference_fe_t serves the purpose. (We note that dynamic run-time polymorphism in this particular context let us re-use cell_map_t, e.g., with an arbitrary cell topology.) It turns out that the only information that reference_fe_t has to provide to cell_map_t are its shape functions, first derivatives, and (on demand) second order derivatives at the quadrature points (in the reference space). The interpolation member variable (see Listing 17) is used by reference_fe to exchange this sort of data with cell_map_t via a call to the create_interpolation binding of the former (see Listing 16) during a call to the create binding of the latter.
While the create TBP of cell_map_t is designed to be called once, the update TBP of cell_map_t is, however, designed to be called multiple times, once per every cell K of the triangulation. A pre-condition of update is that the nodes_coordinates(:) scratch member variable (see Listing 17) has been loaded with the coordinates in real space of the nodes describing the geometry of K (stored into point_t instances). Once this pre-condition is fulfilled, U K can be expressed as a linear combination of the reference_fe_t shape functions with nodes_coordinates(:) being the corresponding coefficients in the expansion. At this stage, coor-dinates_quadrature_points(:), which stores the coordinates of quadrature points in real space, and jacobian(:,:,:), can be easily computed. Finally, inv_jacobian(:,:,:) and det_jacobian(:) can be computed from jacobian(:,:,:) using straightforward numerical algorithms.

Evaluation of Shape Functions in the Physical Space
The user code that evaluates cell integrals in (12), may need the value, gradient, curl, and divergence of the shape functions at the integration points in the physical space, provided that we want to unburden FEMPAR users from the complexity of having to explicitly apply mapping transformations. As commented in Sect. 3, the mapping that transforms a shape function/ a ðxÞ in the reference FE space into the one in the physical space / a ðxÞ ¼Ŵ K ð/ a Þ U À1 K , depends on the particular FE Listing 17. The cell_map_t data type.
FEMPAR: An Object-Oriented Parallel Finite Element Framework 233 space at hand; see Sects. 3.8, 3.9, and 3.10 for details. For this reason, the actual code that performs these transformations is not actually bounded to cell_map_t, but to reference_fe_t, through the deferred binding with interface declared in Listing 18.
The interpolation_reference_cell input dummy argument of apply_cell_map (see Listing 18) must have been obtained from a call to the binding in Listing 16 invoked on the same reference_fe_t instance. The output dummy argument interpola-tion_real_cell holds the shape functions and their derivatives evaluated at quadrature points in physical space (see (13) and (14)). It is also assumed that, on input, interpolation_real_cell already contains the data that does not have to be re-computed on each mesh cell, e.g., the value of the shape functions on integration points for Lagrangian FEs; see the discussion related to the update binding below for the strategy that we follow in order to fulfill this requirement. This leaves room for optimization in the implementation of this deferred binding (on subclasses), since these quantities do not have to be recomputed on each cell. The reference_fe_t subclass uses the cell_map_t instance (passed to the apply_-cell_map binding, see Listing 18) as a placeholder for the data required to provide the mapping transformations required.
We stress, however, that interpolation_t is a low level structure that is not designed as a data type that FEMPAR users have to interact with, for reasons made clear in Sect. 8.2. Therefore, we need to introduce an additional data type in our software design, called cell_inte-grator_t, that, among other services, is able to fetch raw data from interpolation_t into field data types (i.e., scalars, vectors, and tensors) the user can be easily familiarized with. This data type is declared as shown in Listing 19. It also clones interpolation_reference_cell into inter-polation_real_cell, and copies the contents of the former into the latter. This lets cell_integrator_t to fulfill later on the pre-condition on the last dummy argument of apply_cell_map. The create binding also associates its polymorphic pointer reference_fe member variable to the reference_fe_t instance provided to it on input. This pointer is required later on by the update and get_* bindings (see discussion in the sequel).
The update binding of cell_integrator_t simply invokes apply_cell_map on its polymorphic reference_fe member variable, using the instance of cell_map_t provided on input to update, and the two interpolation_t member variables as actual arguments, respectively; see Listings 18 and 19. It leaves the cell_integrator_t instance on which it is invoked in a state such that it is able to provide the services it was primarily designed for. These are offered through the get_values, get_gradients, get_divergences, get_curls, etc., generic bindings. We note that cell_integrator_t is designed such that it can handle either scalar, vector, or tensor-valued refer-ence_fe_t instances (see Sect. 6.2). With this purpose in mind, each of the aforementioned generic bindings are overloaded with subroutines that have appropriate interfaces for these three types of FEs. For example, the subroutine overloading get_gradients in the case of scalar-valued FEs is declared and implemented as shown in Listing 20, with vector_field_t representing a d-dimensional rank-1 tensor; the interface of the one corresponding to vector-valued FEs only differs from the one above on the base type of the gradients allocatable array dummy argument, which is of base type tensor_field_t (i.e., data type representing a d-dimensional rank-2 tensor).
Let us remark some important points with respect to the subroutines overloading the generic bindings of cell_integrator_t. First, we note that the actual argument passed in place of, e.g., the gradients(:,:) dummy argument in Listing 20, is intended to be actually declared in code written by the user of FEMPAR. Provided that FEMPAR can support variable degree FEs on top of different triangulation cells (see Sect. 10), the allocatable attribute of the gradients(:,:) dummy argument not only unburdens the user from the complexity of having to (pre)allocate this array, but even from the one associated to variable degree FEs. For example, if on input, the size of gradients(:,:) is not sufficient to hold the data to be provided by the cell_integrator_t instance corresponding to the reference_fe_t on top of the current triangulation cell, then it can be re-allocated to the appropriate size. Second, this binding is designed such that all functions are evaluated at all quadrature points within a single call, justifying why the dummy argument has to be a rank-2 allocatable array. 21 At this point, let us note that all subroutines subject to consideration ultimately rely on (deferred bindings of) reference_fe_t; see, e.g., line 5 in Listing 20. We recall that refer-ence_fe_t must mediate in any process that requires retrieving data from interpolation_t; see Sect. 8.2.

Cell Integration User Code Example
At this point of the discussion, we are already in position to show user code that evaluates the entries of the (current cell) local matrix for the Example 3.1 presented in Sect. 3.1. This code is sketched in Listing 21. This code would be bounded to a subclass of the discrete_integration_t abstract data type presented in Sect. 11.2 suitable for the Galerkin discretization of the Poisson problem. 21 This represents another design decision in the seek of maximizing the granularity of the calls to deferred bindings for code efficiency reasons.
Listing 20. The code implementing the get_gradients_scalar binding of cell_integrator_t ultimately relies on a deferred binding of reference_fe_t with the same name. The reader may note from Listing 21 that FEMPAR also offers an expression syntax that lets its users code weak forms in a way that resembles their mathematical expression. The user is in charge of explicitly writing the expression of the numerical integration in the reference cell, i.e., of explicitly implementing the quadrature point summation (loop) and handling the determinant of the Jacobian and the quadrature point weighting in (12). However, the evaluation of the shape function and their gradients, curls, etc., at the quadrature points in the physical space (e.g., expressions (13) and (14)) are completely hidden to the user. This can be achieved using a feature of modern programming languages called operator overloading. (We refer to [67] for a detailed exposition of this mechanism in Fortran2003.) Common (contraction) operations among tensors are provided by means of overloaded intrinsic and library-defined operators. For example, the operator(*) generic interface (corresponding to the * intrinsic operator) has to be overloaded with the single contraction of rank-1 tensors, and the multiplication of a rank-1 tensor by a scalar to let our code compile. A crucial design requirement in the seek of code efficiency is that no dynamic memory allocation/deallocation is involved as the partial evaluation of sub-expressions proceeds (in the order dictated by operator associativity and priority rules in Fortran). In order to fulfill this requirement, the data types representing vectors and tensors are declared such that their entries are stored in an array member variable of size known at compilation time. This size is stored in the library-level parameter constant SPACE_DIM, defined as the maximum number of space dimensions of the physical space in which the physical problem is posed. By default, FEMPAR is prepared to deal with 3D simulations, but the code is written such that a 2D simulation might also be performed if SPACE_DIM is equal to 3, at the price of extra storage and computation. 22 Higher dimensional problems could be considered by compiling FEMPAR with a larger value for SPACE_DIM. Apart from avoiding dynamic memory allocation/deallocation during the evaluation of weak forms, this solution has the following advantages: (1) there is no need to explicitly have the number of dimensions as a member variable of the data types representing vectors and tensors; (2) the limits of the loops implementing tensor contraction operations are known at compilation time, enabling compiler optimizations. We finally stress that we preferred this solution over the usage of Fortran2003 parameterized data types [67] due to the lack of support of this feature in some of the most popular compilers widely available on high-end computing environments.

Evaluation of Facet Integrals
This section covers the data types (and their interactions) in which the evaluation of integrals over the facets of the triangulation is grounded on. The integration of facet-wise matrices and vectors (see, e.g., (23)) involves the evaluation of shape functions and gradients of the neighbouring cells at the quadrature points within the facet in the physical space and the Jacobian of the facet map at the reference space. As described in Sect. 8, the former quantities are computed at every neighbouring cell from their values at the reference space and the Jacobian of the cell mapping. The evaluation of interior facet also requires the computation of the permutation PðgpÞ (see (25)) provided that the coordinate systems of the cells surrounding the facet might not be aligned in physical space. 22 In fact, 2D problems for PDEs that involve curl operators require SPACE_DIM to be equal to 3. In FEMPAR the assembly process of the global linear system underlying the discrete weak problem (20) involves two loops, over all cells and facets, respectively. In the former loop, a cell-wise matrix A K and vector f K are computed per each cell. These hold the partial contributions of the cell to the corresponding entries of the global coefficient matrix and right-hand side vector, respectively.
The data structures involved in their efficient computation have been already covered in Sect. 8. In the latter loop, and assuming that we are sitting on an interior facet F 2 F X h , four facet-wise matrices, namely A F K þ K þ , A F K þ K À , A F K À K þ , and A F K À K À are computed (see Sect. 3.12). We depict in Fig. 8 a complete UML class diagram of the data types involved in the evaluation of facet integrals and their relationships. The data types the user has to ultimately interact with are quadrature t, which holds the facet quadrature points and weights, facet maps t, which handles (i.e., stores, updates, provides) all the geometrical related data of the facet and neighbouring cells K þ and K À , and, finally, facet integrator t, which stores and updates shape function values and first derivatives, and provides shape function values, gradients, curls, etc., of K þ and K À evaluated at facet quadrature points in real space. The rest of data types in Fig. 8 are auxiliary data types, not exposed to the user, which aid the latter two in the implementation of their corresponding services. The reader might readily observe in Fig. 8 that our software design is such that the data types that provide support to the evaluation of cell integrals, i.e., quadrature_t, cell_-map_t, and cell_integrator_t (see Sect. 8), can be re-used to a large extent for the evaluation of facet integrals. As we will see in the rest of the section, some of the methods to be invoked in order to control their respective life cycles in the context of facet integrals are nevertheless different from the ones to be invoked in the context of cell integrals; see, e.g., the signature of the create_re-stricted_to_facet binding of cell_integra-tor_t in Fig. 8 compared to that of its create binding in Listing 19.

Numerical Quadrature
The data type quadrature_t is designed to be a placeholder for the facet quadrature pointsx q and its associated weights w q . However, the code that ultimately decides how to distributex q over the reference facetF coordinate system, and set up w q , is bounded to refer-ence_fe_t, in particular through the deferred binding with interface shown in Listing 22. We refer to Sect. 8.1 for the rationale underlying the degree optional dummy argument of this deferred binding.

Geometrical Mappings
The facet_maps_t data type in Fig. 8 handles the geometrical facet mapping and the two geometrical cell mappings. The facet mapping is represented by facet_map_t, whereas the cell mappings by cell_-map_t; see Sects. 9.2.1 and 9.2.2, respectively.

Facet Mapping
As illustrated in Fig. 8, facet_maps_t is composed, among others, of a single instance of type facet_map_t. The member variables (and associated code) that are common to facet_map_t and cell_map_t are factored into a superclass base_map_t (see Listing 17). facet_map_t handles all data related to the facet map U F , including the facet outward unit normals (see Fig. 8). An extra 2-rank real allocatable array member variable, outward_unit_normals(:,:), stores the facet outward unit normals (with respect to K þ by convention) evaluated at facet quadrature points in real space, as required by (25); n À ðx gp Þ can be simply obtained as n À ðx gp Þ ¼ Àn þ ðx gp Þ.
Let us now see how facet_maps_t controls the life cycle of its facet_map_t instance. The create binding of facet_map_t takes a quadrature_t instance with the facet quadrature points. J F ðx gp Þ and jJ F ðx gp Þj are evaluated at these quadrature points and stored in the jacobian and det_jacobian member variables, which are allocated during a call to this binding together with outward_unit_normals(:,:). Apart from a quadrature_t instance, facet_map_t also requires a description of the discrete, lower dimensional space of functions on top of the reference facetF to which U F belongs. The ref_fe_geo dummy argument of create, of polymorphic type reference_fe_t, is provided for this purpose; in particular, facet_maps_t sends the reference_fe_t on top of K þ as an actual argument to the ref_fe_geo dummy argument in order to comply with the above described convention for the normals. The interpolation_t member variable of Listing 22. The interface of the create_facet_quadrature deferred binding of reference_fe_t.
facet_map_t (see Listing 17) is used by ref_fe_geo to exchange with facet_map_t the shape function values and their derivatives. To this end, reference_fe_t is equipped with the create_facet_interpolation deferred binding (see its signature in Listing 23) that computes these quantities on top of the reference facetF.
The update binding of facet_map_t is intended to be called once per facet loop iteration, i.e., once per each facet of the triangulation. A pre-condition of this binding is that the nodes_coordinates(:) scratch member array of facet_map_t (see Listing 17) has been loaded with the coordinates in real space of the nodes that lay on the the facet. 23 The update binding takes as input dummy arguments a quadrature_t instance and the real parameter reorientation_factor in order to adjust the sign of the facet normals (see (26)). Within update, quadra-ture_points_coordinates(:) and jacobian(:,:,:) can be easily computed from the basis shape functions and their first derivatives, respectively. On the other hand, det_jacobian(:) and outward_unit_normals(:,:) can be computed from jacobian(:,:,:). The former as stated in (24), while the latter as in (26).

Neighbouring Cells Mappings
The facet_maps_t data type is also composed by two instances of type cell_map_facet_restriction _t; see Fig. 8. These instances handle all data related to U K a , with a being either þ or -. Let us thus refer to these instances as cell_map_facet_restriction a , and to the polymorphic reference_fe_t instances on top of K a as ref_fe_geo a . In turn, cell_map_facet_restriction a are composed by as many cell_map_t instances as facets in K a . Provided that an actual facet F can potentially have local identifier F a in K a within the range  As seen so far, the create binding of facet_maps_t is designed to be called right before the actual loop over all triangulation facets, and it sets up all the scratch data. It does so by covering all possible scenarios corresponding to potential values of local facet identifiers within the two surrounding cells (even if some of these scenarios are not actually exposed in the triangulation). The update binding of facet_maps_t, however, is intended to be called sitting on a particular facet F of the triangulation, and it has to only update those two cell_map_t instances within cell_map_facet_restriction a corresponding to the particular scenario at hand, i.e., to the particular combination of local facet identifiers F þ and F À of the facet on which it is being updated. To this end, the update binding of facet_maps_t receives these local identifiers in facet_lids (see Fig. 8) and then calls the update binding of cell_map_facet_restriction þ and cell_map_facet_restriction À with facet_ lid=facet_lid(1) and facet_lid=facet_lid (2), respectively. The update binding of cell_map_ facet_restriction_t picks up the cell_map_t corresponding to facet_lid and invokes the update binding of the latter. We stress that no specialized version of this binding is required in the context of facet integration, i.e., the same version discussed in Sect. 8.4 for cell integration can be re-used here. 24 During the update process, cell_map_-facet_restriction_t also registers in its cur-rent_facet_lid private member variable, the value supplied to the facet_lid dummy argument. This lets facet_maps_t to extract later on from cell_map_-facet_restriction a the updated cell_map_t instances; see discussion of facet_integrator_t in the sequel.

Evaluation of Shape Functions in the Physical Space
The last data type that remains to be covered is facet_integrator_t; see Fig. 8. This data type is the counterpart of cell_integrator_t (see Sect. 8.4) for the case of facet integrals. In particular, it stores and updates shape function values and derivatives, and provides the values, gradients, curls, and divergences of the respective fields for both K þ and K À evaluated at facet quadrature points in real space. As can be observed from Fig. 8, its overall design is very close to the one of facet_maps_t, with cell_integrator_facet_restriction_ t and the cell_integrator_t instances it is composed of, playing the role of its counterparts in the scope of facet_maps_t (i.e., cell_map_facet_restric-tion_t and cell_map_t, respectively). There are, however, two major differences among these two. First, facet_integrator_t deals with (e.g., it is created from) the two polymorphic reference_fe_t instances (see ref_fe a dummy arguments of its create binding in Fig. 8) on which the global FE spaces of functions X h , Y h are grounded on. For example, the create binding of cell_ integration_facet_restriction þ invokes the create_restricted_to_facet binding of the cell_integrator_t for all facets F þ within K þ . The latter computes at a given facet/ a K þ ðx þ gp Þ, r/ a K þ ðx þ gp Þ through the deferred binding create_interpolation...to_facet of reference_fe_t presented in Listing 24. Second, facet_integrator_t has to unburden the user from the complexity underlying the fact that the coordinate systems of K þ and K À might not be aligned in real space. To this end, it is equipped with a private lookup permutation table, called qpoints_perm(:,:) in Fig. 8, that lets it translate facet quadrature points identifiers from the local numbering space of K þ into the one of K À . This table is allocated and filled during the create binding of facet_integrator_t, in particular by reference_fe_t through a deferred binding called fill_qpoints_permutations. Given the facet quadrature identifier gp and the facet permutation index pi (see Sect. 3.16), qpoints_perm(gp,pi) stores the value of PðgpÞ (see (25)). The permutation index is stored within the current_permutation_index of facet_integrator_t, extracted from the permuta-tion_index dummy argument of the update binding. In turn, this parameter is extracted from the array facet_permutation_indices(:) of fe_spa-ce_t in Listing 27 (see Sect. 10). We note that for n-simplices, we consider a renumbering such that all facets have the same orientation on both cells that share it, as commented in Sect. 3.16. In this case, fill_qpoints_permutations fills the table with the identity permutation in all columns. We note that the re-orientation of the n-simplices can lead to mappings U K such that jJ K j\0, but this is not a problem as soon as one takes its absolute value, e.g., in (12).

Facet Integration User Code Example
In order to grasp how the data structures covered so far are actually used together in practice, the Fortran pseudocode snippet at Listing 25 shows user's space code in charge of evaluating the first integral in (22) for each interior facet in a loop over all facets. It would be bounded to a subclass of the discrete_integration_t abstract data type presented in Sect. 11.2 suitable for the non-conforming DG discretization of the Poisson problem.
There are a pair of worth noting remarks about Listing 25. First, the call to the get_values() binding of facet_integrator_t in Line 14 already returns the permuted K À shape function values, i.e., shape_val-ues_ K À (b,gp) actually stores / b K À ðx À PðgpÞ Þ. Second, it is the so-called fe_space_t abstraction (to be covered in Sect. 10) the one in charge of creating the facet integration data structures on loop initialization and to update them at each facet loop iteration (see Line 9). Therefore, the user does not actually directly deals with all the data types bindings and their interactions illustrated in Fig. 8. In this example, it becomes evident that facet-loop based integration is very convenient for the implementation of DG methods, since it very much resembles the blackboard expressions (see, e.g., (20)).

Change-of-Basis Implementation in a reference_fe_t Subclass
In this section, we provide a detailed presentation of how the change-of-basis required to compute the shape functions basis is implemented in a reference_fe_t subclass. In particular, we show the implementation for the Raviart-Thomas div-conforming FE on n-cubes in Sect. 3.5 (see also Sect. 3.9 for details). The pre-basis, e.g., Q ðkþ1;k;kÞ Â Q ðk;kþ1;kÞ Â Q ðk;k;kþ1Þ in 3D, has to be generated before this subroutine is called; see, e.g., the evaluation of the pre-basis in Line 31 of Listing 26. We also present how to compute the boundary moments in (16) in Listing 26; interior moments are simpler and omitted for the sake of brevity. The implementation of the boundary moments requires: (1) to create the refer-ence_fe_t that implements ½Q k1 dÀ1 in Line 16, (2) a facet quadrature on the reference facet in Line 24, and (3) the evaluation of the reference FE in the quadrature points in the interpolation_t in Line 25. We also require a Lagrangian (first order) FE that represents the geometry in Line 20. Next, we loop over all the facets of the cell and compute the values of the shape functions of the cell in the facet quadrature, stored in the interpolation_t instance in Line 31. With all these ingredients, we can compute the boundary moments for the pre-basis functions (see line 43) and assemble them in the change-of-basis matrix. After doing the same for interior moments, we just need to invert the change-of-basis matrix in Line 54. At this point, we have the shape functions basis as a linear combination of pre-basis functions. Thus, when one calls the fill_interpolation binding of the corresponding reference FE, it creates the pre-basis interpolation_t instance and next applies the change-of-basis matrix to compute the one for the shape functions basis, i.e., the Listing 26. Implementation of the change-of-basis required for Raviart-Thomas divconforming FEs on n-cubes, following the procedure presented in Sect. 3.5. placeholder where the evaluation of the shape functions and its derivatives (at the set of quadrature points for which the interpolation has been created) are stored. We note that the ownership of DOFs also changes in this process. The boundary moments (integrals of functions on facets) belong to the corresponding facet, whereas interior moments belong to the cell. Vertices and edges do not have DOFs in this case. The definition of the ownership is skipped for brevity.
10 Integration and Global DOF Handling: The fe_space_t Abstraction In this section, we introduce a software abstraction, referred to as fe_space_t, which represents (in the most general scenario) the mathematical concept of a global FE space X h ¼ X 1 h Â . . . Â X n h obtained by means of the Cartesian product of global FE spaces X i h corresponding to each of the i ¼ 1; . . .; n field field unknowns involved in a system of PDEs; see Sects. 3.6 and 3.11. Each X i h is described as a combination of: (1) an approximation X h of the physical domain X provided by triangulation_t, i.e., a mesh-like container for the cells on which X h is partitioned, their boundary lower-dimensional objects, and their adjacency relationships; see Sect. 7; (2) a description of the n field reference FEs associated to each triangulation cell grounded on reference_fe_t; see Sect. 6.
These two basic building blocks equip fe_space_t with the tools required to provide the following two crucial services. 25 On the one hand, it is in charge of handling (i.e., generating, storing, fetching) a global enumeration of the DOFs corresponding to each X i h taking into account the notion of conformity; see e.g., Sects. 3.6 and 6.2. On the other hand, it handles the data structures that are required to evaluate integrals over cells and facets (see Sects. 8 and 9, respectively). In particular, it judiciously sets up them, and orchestrates their respective life cycles and interactions, while unburdening the user (to a large extent) from the complexity (among others) inherent to high order FEs.
The OO design of fe_space_t (as the one of many other data types in FEMPAR, e.g., triangulation_t) strongly strives to preserve encapsulation and data hiding while still storing and accessing data efficiently (i.e., in a way that leverages data locality for the efficient exploitation of modern computer memory architectures). The userfriendly view of fe_space_t is implicitly (re)constructed by the data types (associated interfaces and interactions) that will be covered in Sect. 10.2. We now move on the approach that we follow for the internals of fe_space_t.

The Internal Organization of fe_space_t
In this section, we sketch how the internals of fe_spa-ce_t are organized in order to efficiently deliver the two services outlined above. For simplicity, we restrict ourselves to a simplified version of fe_space_t that, to a large extent, captures the spirit of its actual counterpart in FEMPAR. The declaration of this simplified data type is shown in Listing 27. 26 25 We stress, however, that the full set of services provided by fe_space_t is not actually restricted to only these two. 26 We note that fe_space_t is not actually in FEMPAR. It is a whole data type hierarchy rooted at base_fe_space_t, not included here for simplicity. Within this hierarchy, we have, e.g., FE space concretizations suitable for either serial or parallel distributedmemory environments. The one shown in the listing very much resembles serial_fe_space_t.
A collection of reference_fe_t polymorphic instances is stored in the reference_fes(:) array. These instances are uniquely identified (within the local scope of fe_space_t) by their position in this array. The global FE space corresponding to a given field, with identifier f_id in the range 1; . . .; num fields (with Listing 27. The internals of fe_space_t and a selected set of its bindings. num_fields equal to n field above), is described by: (1) the triangulation member variable (the rationale underlying it being polymorphic is made clear in Sect. 10.2; (2) its restriction to each cell provided by the reference FE space defined by the reference_fe_t instance with identifier field_cell_to_ref_fes(f_id,c_id) in the collection; c_id is assumed to be a positive integer in 1; . . .; triangulation%get num cellsðÞ that uniquely identifies each cell.
The member variables used to handle the global DOF numbering are encompassed within Lines 18-27 of Listing 27. The global DOF identifiers are stored cell-wise, and field-wise within each cell, in the lst_dofs_gids(:) array, which is in turn (indirectly) addressed by the ptr_dofs_x_fe(:,:) array. In particular, the ones assigned to the local nodes related to field f_id on cell c_id start and end in position ptr_dofs_x_fe(-f_id,c_id) and ptr_dofs_x_fe(f_id?1, c_id)-1 of lst_dofs_gids(:), respectively, if f id\num fields, and in position ptr_dofs_x_fe(f_id,c_id) and ptr_dofs_x_fe(1,-c_id?1), respectively, if f id ¼ num fields. The number of DOFs of the global FE space corresponding to each field (excluding those that are subject to strong boundary conditions) is stored in the num_dofs_x_field(:) array.
The member variable in Line 15 stores a reference to a data type that describes the block layout currently selected (i.e., it can be changed on demand) for the global matrix and right-hand side vector of the linear system (or a sequence of them) required for the solution of the PDE system at hand. The role of block_layout_t in the global DOF numbering generation process will be illustrated in Sect. 10.3.
The data structures that let fe_space_t handle the evaluation of cell integrals are declared in Lines 23-29 of Listing 27. The set_up_cell_integration binding sets up them. The method is intended to be called by the user's program right before any cell integration loop. It ensures that any (scratch) data that can be computed on its final form in the reference cell is pre-computed for any of the triangulation cells while minimizing the number of integration data structures required for the particular scenario at hand. To this end, fe_space_t is equipped with three array containers of quadrature_t, cell_map_t and cell_integra-tor_t objects (see Lines 24, 26, and 28, respectively), which are indirectly addressed by the hash_table_t member variables with corresponding names. 27 This is required because fe_space_t supports, e.g., non-conforming FE spaces with variable order per cell. A unique identifier (dynamically generated within the scope of fe_space_t) is assigned to each of the integration objects that must be created. The hash_table_t instances let fe_space_t transform these unique identifiers into container array positions from which the integration objects can be fetched.
The set_up_cell_integration method loops over all cells. Sitting on a cell, it determines an appropriate quadrature to be used on that cell and its associated unique identifier. (See discussion in the next paragraph for more details.) If this quadrature has not been generated yet (i.e., if the hash table lookup fails), then a new quadrature is created on the next free position of the cells_quadratures(:) array container, and a new identifier-position pair is inserted into the hash table. Otherwise, the quadrature is fetched from this array. The same process is repeated for the cell_map_t and cell_integrator_t instances. The former ones are uniquely determined by the combination of the unique identifier quadrature_t just created/fetched and that of the reference_fe_t instance on top of the current cell (see Sect. 7). On the other hand, a cell_integrator_t instance has to be associated to each field within the current cell; the cell_integra-tor_t instance corresponding to a field is uniquely determined by the unique identifier of the quadrature_t just created/fetched and the one of the reference_fe_t associated to that field (see Sect. 8.4). Therefore, the unique identifiers of the cell_map_t and cell_integra-tor_t instances required for the evaluation of cell integrals over the current cell can be easily determined combining the ones corresponding to the instances from which they are created. We recall that the unique identifier of the refer-ence_fe_t instance on top of the current cell, c_id, for a given field, f_id, can be retrieved from refer-ence_fe_id=field_cell_to_ref_fes(f_id, c_id), while the reference_fe_t instance itself from reference_fes(reference_fe_id).
The allocatable array member variable in line 23 (with as many entries as triangulation cells) can be used by the user in order to (optionally) determine the degree of the quadrature to be used on each triangulation cell. This member variable is allocated and initialized (during fe_space_t creation) to a reserved flag that instructs set_up_cell_integration to use an automatic (default) strategy to decide the degree of the quadrature to be used on each cell. This default strategy relies on a deferred binding of reference_fe_t, named get_default_quadrature_degree, which typically returns the quadrature degree for which mass matrix terms are integrated exactly (see Sect. 8.1). 28 The strategy, in particular, walks over all reference_fe_t instances on top of the cell, and the one for which its (polynomial) reference cell functional space is of maximum order becomes ultimately responsible of creating the quadrature via an invocation to its create_quadrature deferred binding. Alternatively, the user may explicitly select the quadrature degree to be used on each cell. In such a case, create_quadrature is invoked to create a quadrature with the degree given by the corresponding entry in the cell_quadratures_degree(:); see Sect. 8.1. In any case (i.e., default or explicit quadrature degree), both the unique identifier of the reference_fe_t instance on top of the current cell and the quadrature degree are used to generate a unique identifier of the quadrature to be created/fetched.
On the other hand, Lines 32-38 of Listing 27 encompass those data structures required for the evaluation of (both boundary and interior) facet integrals; see Sect. 9. A very close rationale to the one underlying their cell counterparts is followed to set up these data structures. The set_up_facet_integration binding loops over all facets. Sitting on a facet, it determines an appropriate facet quadrature_t rule. The quadrature degree is either the default or a user-defined one (via the allocatable array member variable in Line 32). It also determines the unique identifier of the quadrature and of the rest of the facetintegration data structures, which are created as necessary, while handling their interactions. Both the topology of the two cells sharing the facet and the quadrature degree are used to generate a unique identifier of facet quadratures. The member variables in Lines 41-42 provide support to the implementation on the so-called fe_facet_iter-ator_t data type and will be covered in detail in Sect. 10.2. Finally, the member variable num_-fixed_dofs in Listing 27 is used by fe_space_t to count how many DOFs are subject to strong boundary conditions; see Sect. 10.4.

A Conceptual View of fe_space_t
Following the ideas presented in Sect. 7.1, fe_space_t offers a number of iterators to provide traversals over its objects, and uniform data access to its internals. Apart from iterators over cells and vefs, fe_space_t also provides traversals over facets by means of the so-called fe_-facet_iterator_t data type. This iterator is essentially required to implement the evaluation of jump terms in, e.g., error estimators or DG methods in a user-friendly manner. For reasons made clear in the course of this section, a design goal to be fulfilled by fe_space_t iterators is that they are able to provide access to the same data as their counterpart triangulation_t iterators (see Sect. 7.1), and that they are able to do so efficiently while avoiding duplication of code bounded to the latter ones. For example, fe_cell_iterator_t should be designed such that it is also able to provide the coordinates (in physical space) of the nodes describing the geometry of the cell, apart from the global DOF identifiers on top of it.
Let us first discuss the design of iterators over cells and vefs (as the one of both follows the same lines). These data types are defined in Listing 28, where set must be actually replaced by either cell or vef.
As shown in Listing 28, fe_set_iterator_t holds a polymorphic pointer to the fe_space_t instance to which it has to provide data access. Dynamic polymorphism is exploited here with extensibility and code reuse in mind. Any type extension of fe_space_t (e.g., the one suitable for distributed-memory environments), can also become the target of this polymorphic pointer, thus enabling reuse of data and code bounded to Listing 28. fe_space_t ''set'' (either cell or vef) iterators and the composition relationship with their counterpart triangula-tion_t iterators (set_iterator_t). fe_set_iterator_t with these extensions. Of special relevance in Listing 28 is the composition relationship among the data type being defined and set_itera-tor_t, i.e., its triangulation_t iterator counterpart (see Sect. 7.1). This lets fe_set_iterator_t to fulfill the aforementioned design goal, i.e., to provide a superset of data over the class it is composed of, while still being able to access to any data stored within the triangulation scope. fe_set_iterator_t also reuses from set_iterator_t the code underlying the sequential traversal over all objects of the set. Indeed, as many other TBPs of fe_set_iterator_t, init, next, and has_finished TBPs of fe_set_iterator_t are simply implemented as wrappers of their counterparts in set_iterator_t. (We remark that this is possible provided that fe_space_t is deliberately set up such that it shares with triangulation_t a consistent global numbering for cells and lower-dimensional objects.) At this point it is important to remark that the set_iterator_t instance that fe_set_itera-tor_t aggregates is also polymorphic (see Line 3 in Listing 28). As stated in Sect. 10.1 (in particular, see Line 12 of Listing 27), a fe_space_t instance is created from a polymorphic triangulation_t instance. The create binding of fe_set_iterator_t extracts the latter from fe_space_t, and then calls its create_cell_iterator binding (see Sect. 7.1), which becomes ultimately in charge of determining the dynamic type of the set_it-erator_t member variable of fe_set_iterator_t (apart from leaving the iterator positioned in the first object of the set). This lets fe_space_t (and its associated iterators) to be re-used with any type extension of trian-gulation_t (e.g., the one suitable for distributed-memory computers and/or h-adaptivity). Likewise, the free binding of fe_set_iterator_t relies on the free_cell_iterator binding of triangula-tion_t in order to safely deallocate any dynamic memory allocation performed during creation. We stress that, as in the case of triangulation_t iterators, both the create and free TBPs are not intended to be directly called by the user. Instead, triangulation_t provides a set of (public) TBPs (as many as different iterators) for this purpose.
The implementation of fe_facet_iterator_t is based on a very close rationale to the one of cell and vefs iterators, with subtle differences though; see Listing 29. Provided that fe_facet_iterator_t is a kind of fe_vef_iterator_t, it should provide the same set of data access methods of the latter (e.g., the cells sharing the facet). However, it should restrict the traversal to those vefs that are actually facets, and to be able to provide all data required for the implementation of jump terms over facets. As shown in Listing 29, fe_facet_iterator_t extends fe_vef_iterator_t. This automatically equips the former with the data access methods of the latter. On the other hand, it overrides those methods controlling the sequential traversals over the items in the set such that it restricts to facets, i.e., create/free/first/ next/has_finished in Listing 29. The implementation of these methods relies on its member variable facet_gid, and the facet_gids (  The actual set of TBPs of a fe_space_t iterator highly depends on the type of object being pointed to. For completeness, we now briefly discuss those TBPs in the set corresponding to cell and facet iterators, which provide support for the implementation of the two services of fe_space_t we are focusing on. These are in particular shown in Listing 30. This listing also includes the generic TBPs in Lines 35 and 68, although they will be discussed in Sect. 11.1. The TBPs in Lines 18-28, and 50-61 of Listing 30 let the user fetch from fe_space_t the integration data associated to the current cell and facet being pointed to, respectively. On the other hand, the update_integration bindings in Lines 6 and 47 perform those Listing 30. The fe_cell_iterator_t and its facet counterpart. computations required to update these data structures such that they hold shape function values and derivatives evaluated at (current) cell and facet (quadrature points) in the physical space. The former binding is implemented as shown in Listing 31. Finally, the get_permuta-tion_index TBP of fe_facet_iterator_t lets the caller to obtain the permutation index (see Sects. 3.16 and 9.3 for further details). The implementation of this method relies on the facet_permutation_indices(:) member variable of fe_space_t; see Line 42 of Listing 27. For a given facet with global identifier facet gid, facet_permutation_indices(facet_gid) holds the permutation index corresponding to the facet. We have decided to permanently store facet permutation indices for performance reasons. These can be reused over and over again (e.g., in a transient and/or nonlinear PDE problem) without the overhead associated to its computation on each traversal over the facets of the triangulation.
An update of the cell_map_t instance (associated to the cell pointed by the fe_cell_iterator_t instance on which this subroutine is invoked) is performed in Line 12 of Listing 31. It is followed by a loop over the number of fields of the PDE system at hand in order to update the cell_integrator_t for every field in Line 17. The update of the former requires that its nodes_coordinates(:) scratch member variable has been loaded with the coordinates in the physical space of the nodes describing the geometry of the cell at hand (see Sect. 8.3). This is in particular fulfilled in Line 10. The coordinates fetched by this call are actually stored within the triangulation. However, fe_cell_iterator_t can satisfy this query provided that it is composed of a cell_iterator_t instance; see Listing 28 and accompanying discussion. At this point, the reader should be already capable to grasp how the fe_facet_iter-ator_t counterpart of this subroutine is implemented, so that it is omitted here in order to keep the presentation short.
Going back to Listing 30, the binding in Line 16 lets the user fetch the field-wise global DOF identifiers that fe_space_t has associated to the node functionals on the current cell interior and its vefs. (The bindings in Lines 9-13 of Listing 30, however, assist fe_space_t on the generation of the global DOF numbering and their usage will be illustrated in Sect. 10 In Listing 32, p_1D_ip_array_t is assumed to be a data type with a single member variable, called p, declared as a pointer to a rank-1 integer(ip) array. For each field, the subroutine locates the region within the lst_dofs_gids(:) member variable corresponding to that field within the current cell, and then it associates to it the corresponding pointer in fe_dofs(:). At the expense of sacrificing type safety (in Fortran there is no mechanism to declare a pointer to be read-only), we avoid the costly re-allocation of user-level allocatable arrays that would be needed in the case of non-conforming FE spaces with highly varying degree polynomial spaces among cells.
To end up, the get_vef binding in Listing 30 sets up a fe_vef_iterator_t instance to point to the corresponding vef within the cell. As a consequence, one may navigate over the cells, its vefs, cells around these vefs, etc., using fe_space_t iterators all the way round.

Global DOF Numbering Generation
In this section, we discuss how fe_space_t coordinates the building blocks covered so far in order to generate a global enumeration of the DOFs describing the global FE space X h ¼ : X 1 h Â . . . Â X n h for general multi-field systems of PDEs. This process is encompassed within the gen-erate_global_dof_numbering binding of fe_s-pace_t (see Listing 27). The code of this method is shown in Listing 33. The block_layout dummy argument lets the caller to customize the global DOF numbering to be generated. 29 On the one hand, this data type specifies in how many blocks the user wants to split the (discrete) PDE system at hand. In particular, the user may select to generate a DOF numbering suitable for monolithic or blocked storage linear algebra data structures, with block_layout%get_num_blocks() returning one and a number larger than one, respectively. On the other hand, block_layout_t specifies the mapping of fields into blocks, with block_layout%get_block_id(-field_id) returning the block identifier the field with identifier field_id is mapped to. Provided that blocked linear algebra data structures in FEMPAR are addressed using row/column identifiers that are local to each block, block_layout equips the subroutine with the input necessary to generate a block-aware global DOF numbering, in which the DOFs belonging to fields of the first block are numbered first, followed by the ones of the second, and so on. We note that block_layout_t also holds inside how many DOFs are there per block (see Sect. 11.3). These latter quantities are computed within generate_-global_dof_numbering (see discussion in the sequel).
The   30 The last requirement has been introduced to include the concept of void FEs for multi-field problems in which some fields are not defined on the whole domain (see Sect. 6.5).
the vefs of the current cell. If the current vef has not been visited yet, then the current cell becomes its owner, and both the cell and the local identifier of this vef within the cell are registered in the corresponding work arrays. The global DOF identifiers associated to node functionals on this vef within the owner cell are allocated in Line 32 (as above starting from fields_current_dof(-field_id)). On the other hand, if the current vef has been visited, then the global DOF identifiers associated to node functionals on this vef within the current cell are fetched from the corresponding ones within the owner cell in Line 39. The binding called in this line encodes the permutations described in Sect. 3 reference_fe_t instance used in the current cell for field_id. This iterator lets it to traverse those node functionals owned by the vef with local identifier vef_lid (see Sect. 6.2), and thus determine the (relative) position in lst_dofs_gids(:) of the global DOF identifiers to be allocated for such node functionals. We note that the logical predicate in Line 16 is evaluated to .true. if the DOF at hand is actually free, i.e., not subject to boundary conditions imposed in strong form; see Sect. 10.4. Finally, we would like to stress that error checking statements and a major optimization that can be applied for the single-field single-block case are not shown in the code listings of this section in order to keep the presentation as simple as possible. Both are present in FEMPAR. In particular, for the aforementioned case, the global DOF numbering can be generated with a single loop over all cells (instead of two). The call in Line 14 of Listing 33 can be avoided, deferring the computation of the number of DOFs per field and block to the call in Line 15.
On the other hand, there is no need to generate a global DOF numbering from scratch when there is already one available, a permutation from the old to the new numbering could be computed and applied to lst_dofs_gids(:) by a single sweep over all cells. This optimization, however, is not present in FEMPAR, as indeed we did not find frequent the case where an application requires to change on-the-fly the block-layout of the system of PDEs at hand.

Strong Imposition of Boundary Conditions
In this section, we discuss the mechanisms that fe_s-pace_t provides in order to support the strong imposition of boundary conditions. In order to grasp why these mechanisms are needed and how fe_space_t is designed to provide them, we must first briefly introduce the approach chosen by FEMPAR in order to handle this type of boundary conditions. We will use the term ''fixed DOFs'' to refer to those DOFs sitting on the boundary whose values are constrained (i.e., subject to strong boundary conditions), and the term ''free DOFs'' to refer to the remaining ones. For simplicity, let us restrict ourselves to the Laplacian problem with inhomogeneous Dirichlet boundary conditions uðxÞ ¼ u D ðxÞ on C D discretized with grad-conforming FEs. 31 The discrete solution u h 2 X h can be split into two parts as The nodal values u a are the actual unknowns of the problem at hand. E h u D is a discrete Dirichlet data extension, which can be understood as the projection of a Dirichlet data extension Eu D ðxÞ introduced in Sect. 3.1. Its nodal values u D a are selected such that E h u D becomes a suitable boundary FE approximation of u D ðxÞ (e.g., a boundary FE interpolation). 32 The linear system to be solved in order to compute the nodal values of u h can be written as: Listing 36. Implementation of the generate_own_dofs_vef_numbering binding of fe_cell_iterator_t. 31 We stress, however, that the approach discussed in the sequel to handle the strong imposition of boundary conditions is applicable to more complex problems and discretizations, e.g., the Maxwell equations discretized with curl-conforming FE spaces. 32 It is assumed that the discrete Dirichlet data extension is zero on free DOFs, but other more general situations can also be accommodated.
where its coefficient matrix has as many rows as free DOFs, and its right-hand side is the FE discretization of the linear form in (3); see Sect. 3.1. In order to assemble (33), the process described in Sect. 8 has to be slightly modified. A sweep over all triangulation cells is still required. Sitting on a given cell K, the element matrix A K and vector f K are computed as usual. However, the rows/columns corresponding to fixed DOFs in A K are not assembled into the global matrix. The same applies to the entries of f K . However, f K has to be updated before assembly in order to reflect the contributions of strong boundary conditions (see the right-hand side of (33)). Fortunately, the users of FEMPAR are unburdened from these subtleties. These are hidden within the assembly generic binding of fe_cell_iterator_t; see Listing 30 and 39. Apart from adding the contributions of the current cell to the global coefficient linear system and right-hand side, this binding is in charge of computing the contribution to f K from strong Dirichlet boundary conditions. This poses two additional requirements on fe_space_t. In particular, (1) it should handle a global enumeration of free and fixed DOFs, while being able to distinguish among both kinds of DOFs; and (2) it should offer a suitable set of bindings to project/interpolate u D ðxÞ on the boundary to get E h u D .
In order to satisfy (1), fe_space_t splits the whole set of DOFs into free and fixed DOFs, and the DOFs within each subset are labeled separately from each other as f1; 2; . . .; jffree DOFsgjg, and fÀ1; À2; . . .; À jffixed DOFsgjg, respectively. (This is nevertheless an implementation detail that is never exposed to FEMPAR users.) In turn, free and fixed DOF values are actually stored into different arrays, so that they can be addressed separately using the corresponding global identifiers in the former and latter set, respectively; see Sect. 10.5.
The process that associates global identifiers to free DOFs has been already covered in Sect. 10.3. The one corresponding to fixed DOFs very much resembles the one for free DOFs. It is, however, restricted to vertices, edges, and faces of the triangulation that lay at the boundary, and it has to be equipped with support from the user that lets the process become aware of which DOFs sitting on the boundary are actually fixed. The fixed DOFs global enumeration process occurs during the initial set-up of fe_space_t; see create generic binding in Listing 27. This process is grounded on two different ingredients. On the one hand, the user can determine C sub-regions through the sets associated to vefs sitting on the boundary (see Sect. 7.1). For example, the user may decide to use set identifier 1 and 2 to split the vefs in C into those which belong to C D and C N , respectively. On the other hand, an abstract data type, called condi-tions_t, to be extended by FEMPAR users, lets users to customize the strong imposition of boundary conditions. In particular, with regard to the fixed DOFs global enumeration process, this data type offers a deferred binding that given a set identifier, provides a logical component mask. For each component of the PDE system, this mask provides whether the DOFs associated to vefs marked with this set identifier are fixed or free. For those FE spaces for which there is no DOF-to-component association (e.g., Raviart-Thomas or Nédélec FEs), only the first component in the mask is taken into account, and the rest neglected.
On the other hand, for 2), fe_space_t provides a set of methods that let the user interpolate/project u D ðxÞ on the boundary to get E h u D in a number of suitable ways. E h u D is ultimately stored within an instance of the fe_function_t data type; see Sect. 10.5. Boundary projectors involve the solution of a boundary mass matrix problem where integrals over boundary facets have to be evaluated; see Sect. 9. Again, all these bindings rely on the conditions_t abstract data type. In particular, given a boundary vef set identifier, a deferred binding of this data type returns a user-defined (scalarvalued) function to be imposed for each component of the PDE system at hand. In the case of Raviart-Thomas or Nédélec FEs, the d scalar-valued functions corresponding to its components are used to reconstruct the vector-valued function, whose tangential or normal component, respectively is to be imposed.

Global FE Functions and Their Restriction to Triangulation Cells/Facets
In this section, we introduce a convenient software abstraction in our OO design, referred to as fe_func-tion_t, which represents a global FE function u h 2 X ¼ : This data type and a subset of its TBPs (in particular, those that are relevant for the present section) are presented in Listing 37.
In Listing 37, the free_dof_values and fixed_dof_values are used to store u h and E h u D , respectively; see Sect. 10.4. The former is a polymorphic member variable of type array_t; see Sect. 11.1. Relying on the set of deferred bindings offered by array_t, the code bounded to fe_function_t can be written independently of how the entries within the concrete implementation of array_t are laid out in memory, enabling code re-use to a large extent. For example, scalar_array_t is a concrete realization of array_t that uses monolithic storage, while block_array_t stores the entries organized into blocks (see Sect. 11.1 for more details). On the other hand, fixed_dof_values is a member variable of static type scalar_array_t; see Sect. 11.1. 33 Fixed DOFs belonging to different fields might be indeed assigned intermixed global identifiers, significantly simplifying the enumeration process. In particular, a single sweep over all boundary objects suffices, in contrast to Listing 33, where two sweeps over all cells are required in order to generate a block-aware global numbering. From our experience, it turns out that neither blocked storage nor a data structure suitable for distributedmemory environments are strictly required to store E h u D , so that we can prevent the overhead associated to run-time polymorphism when dealing with fixed_dof_values. 34 A fe_function_t instance is created from a fe_space_t instance (to which it belongs); see signature of the create binding in Listing 37. This binding selects the dynamic type of free_dof_values, and therefore its storage layout, according to the one currently selected for the PDE system at hand; see block_layout member variable in Listing 27. The entries of free_dof_values can be determined in a number of ways. They might become the unknowns of a problem to be solved (e.g., by a preconditioned iterative linear solver or sparse direct solver), or computed from an expression involving other fe_function_t instances, e.g., offers an expression syntax for global FE functions grounded on overloaded operators.) Apart from these, fe_space_t offers a pair of generic bindings, referred to as interpolate and project, to compute the DOFs nodal values of u h by either interpolation (using the expression in (9)) or projection (e.g., a global L 2 projection) into the FE space of a user-defined function u(x). 35 Each of these generic bindings is overloaded with three different regular bindings suitable for scalar, vector, and tensor-valued functions, respectively. The interpolate bindings in fe_space_t can be written independently of the reference FE by using a TBP associated to refer-ence_fe_t that computes the local interpolator in (6).
Apart from the software representation of a global FE function, FE codes typically need a mechanism that, sitting on a cell or facet of the triangulation, provides the values, gradients, etc. of a global FE function u h ¼ u 1 h Â . . . Â u n h evaluated at quadrature points in the physical space. To this end, FEMPAR offers a set of data types, referred to as cell_fe_function_type_t and facet_fe_-function_type_t, with type=scalar,vector,tensor, that represent the restriction of u i h to a given triangulation cell and facet, respectively. The two code snippets in Fig. 9 illustrate the usage of these data Listing 37. The fe_function_t data type. 33 In parallel environments, every processor only stores the fixed DOF values that belong to its associated subdomain. 34 Some of the algorithms in charge of computing E h u D may require a different storage layout from the one of scalar_array_t (e.g., blocked storage and/or suitable for distributed-memory computers), and/or restrict themselves to those fixed DOFs of E h u D corresponding to a given field (or set of fields). In such a case, E h u D is scattered in place back and forth into temporary work space with the appropriate layout for the algorithm at hand in charge of computing its entries (e.g., a serial or parallel distributed-memory boundary mass problem iterative solver). It turns out that it is not such a high performance penalty provided that such algorithms already require to perform a sweep over boundary facets (e.g., in order to assemble a boundary mass matrix). During this sweep, the fixed DOFs in question can be already counted and identified. 35 Analytical scalar, vector, and tensor-valued functions are also supported in FEMPAR through the classes scalar_function_t, vector_function_t, and tensor_function_t, respectively. To implement an analytical scalar function f ðxÞ in FEMPAR, the user has to extend scalar_function_t methods get_value, get_gradient (if used), etc., with the analytical expression, for a given point t that represents x. We proceed analogously for vector and tensor fields. These data types are very simple and we omit their description here. types, where we are assuming that u i h belongs to a global FE space of vector-valued functions.
There are three worth noting remarks in these two code snippets. First, the update binding of both data types rely on the gather_nodal_values binding of fe_-function_t; see Listing 37. The latter equips cell/facet FE functions with the ability to restrict (gather) the nodal values of u i h from global to local arrays (stored as private scratch data within cell/facet FE function data types), while taking care of strong boundary conditions. Second, the update bindings require a procedure that, given the shape functions, first derivatives, etc., evaluated at quadrature points in physical space, and the nodal values u i h restricted to the current cell, provides the shape function values, gradients, curls, etc., of the FE function at these quadrature points. This service is provided by reference_fe_t by the set of evaluate_fe_function... deferred bindings in Lines 63-68 of Listing 6. We note that fe_-function_t can extract the first set of data from the cell_integrator_t and facet_integrator_t instances accessible through fe_cell_iterator_t and fe_facet_iterator_t (provided on input to update), respectively. Third, facet FE functions provide u i h values, gradients, etc., at facet quadrature points from the perspective of its two surrounding cells. This make sense for functions u i h belonging to non-conforming FE spaces, which might be discontinuous across cell boundaries. Facet FE functions should also cope with the fact that the coordinate systems of its surrounding cells might not be aligned in physical space, so that a different local numbering might be assigned to facet quadrature points from the perspective of either cell; see Sect. 9.3 for an exposition of the strategy followed to solve this issue.

Building FE Affine Operators
In this section, we introduce the software abstractions on which the construction of the algebraic problem (10) in Sect. 3 relies. These software abstractions, and their relationship, are depicted in Fig. 10. The main design goal underlying the proposed software architecture is as follows. In the seek of code reusability and extensibility, FEMPAR users should have at their disposal a unique entry point data type and associated bindings in order to build their FE linear system, no matter whether a scalar or a system of PDEs, no matter whether the linear algebra data structures holding the linear system entries are either scalar (monolithic) or blocked, and no matter how they are laid out in memory (centralized, distributed-memory). In FEMPAR, this unique entry point data type is referred to as fe_affine_operator_t.
Mathematically, fe_affine_operator_t represents the affine operator in (5), obtained from the discrete weak formulation of the linear(ized) problem (4). As introduced in Sect. 3.6, the operator can be represented (after defining bases for trial and test FE spaces) with A and f defined in (10). The solution of the FE problem is the only root of this operator (as soon as the FE problem is nonsingular).
In order to seek the aforementioned goal, fe_affine_operator_t relies on an abstract data type, referred to as assembler_t (see Fig. 10). In a nutshell, assembler_t offers a set of FE-assembly tailored, data structure neutral, deferred TBPs, e.g., to assemble the contributions of a cell or facet integral into the linear system coefficient matrix A and/or right-hand side f. The subclasses of assembler_t are the ones ultimately responsible to deal with the details underlying the Fig. 9 User-level code snippets illustrating the usage of the cell_fe_function_type_t (left) and facet_fe_function_type_t (right) data types particular linear algebra data structures at hand. The latter ones offer FE-assembly neutral interfaces to inject new entries or add contributions to them, such that this software piece becomes reusable and separable, e.g., to be used in third party software projects (not necessarily FE-oriented) as a standalone software subsystem. FEMPAR offers a rich set of linear algebra data structures, e.g., data structures organized by blocks, which enable the implementation of block preconditioners for multiphysics problems (see, e.g., [43][44][45]). Apart from those required to handle the linear coefficient matrix and right-hand side of the system, fe_affine_operator_t also interacts with other data types required to deliver its life cycle (i.e., its auto-generation). In particular, A and f entries are computed according to the expressions in (10). These expressions involve a FE space (fe_space_t) and the discrete (bi)linear forms of the problem at hand. To express in software this second ingredient, we introduce the dis-crete_integration_t abstraction; see Fig. 10.
We have structured this section as follows. In Sect. 11.1, we first present the assembler_t abstract data type, and the rationale underlying the design of the linear algebra structures it is grounded on. Next, in Sect. 11.2, we introduce the discrete_integration_t abstract data type that ultimately is in charge of performing the integration of the (bi)linear forms and assembly of the discrete affine operator. We show a particular implementation of this data type (i.e., a subclass) for the Galerkin approximation of the Stokes problem. Finally, the fe_affine_-operator_t data type is described in Sect. 11.3.

Linear Algebra Data Structures and Associated Assemblers
Linear algebra in FEMPAR relies on a pair of data type hierarchies rooted at the mathematical abstractions of a linear algebra operator and a vector, and represented in software by means of the linear_operator_t and vector_t abstract data types, respectively. These abstract data types let a number of linear algebra algorithms within FEMPAR (e.g., iterative linear solvers and block preconditioners for PDE systems) to be expressed independently from the actual implementation of the concrete matrix and vector data structures being used, such as block layout (if any), storage (e.g., dense or sparse storage format) or memory layout (e.g., local or distributed-memory), enabling code re-use and extensibility to a large extent. An abstract expression syntax that allows the construction of complex expressions involving operations among operators and/or vectors is also provided. This enables the implementation of new algorithms in a compact manner. However, because these linear algebra algorithms are not discussed herein but postponed to a further work, the description of the data types and associated methods in these hierarchies will be restricted to what is necessary to describe the assembly of the FE affine operator.
The sparse_matrix_t data type can be found at an intermediate level in the hierarchy rooted at linear_-operator_t. This is a crucial data type in FEMPAR, which represents a scalar, non-distributed, sparse matrix. Its design follows the ideas presented in [92]. This design (re)uses the state OO design pattern [88] to hide the actual sparse matrix storage format to the user. Following this pattern, sparse_matrix_t is composed of a polymorphic member variable of (declared) type base_sparse_matrix_t. Its dynamic type can be thus changed at runtime (via re-allocation). This dynamic type represents the storage at hand being used. Current subclasses of base_sparse_matrix_t include coo_sparse_matrix_t, csr_sparse_matrix_t, csc_sparse_matrix_t, corresponding to the coordinate list (COO), the compressed sparse row (CSR), and the compressed sparse column (CSC) sparse matrix storage formats [93], respectively.
The life cycle of a sparse_matrix_t instance is as follows. The user first invokes its create TBP, in which one solely specifies the size of the matrix, i.e., the number of rows and columns. This method, however, triggers a number of subsequent actions. In particular, it allocates its dynamic type to the one corresponding to the COO format, and leaves it ready for the injection or addition of contributions to the entries of the matrix. Although not compressed, this format is ideally shaped for the injection or addition of contributions to the entries of the matrix. These are simply pushed back into member arrays that can grow dynamically during the integration/assembly loop (via a judiciously reallocation strategy to trade off cost and memory). Other sparse storage formats, as the CSR storage implemented in the csr_sparse_matrix_t data type (also a type extension of base_sparse_matrix_t), although more memory efficient, require a predefined sparsity pattern, which has to be precomputed. They are not thus well suited for the dynamic build up process of the matrix. At this point the reader should note that, for such inflexible storage formats, one typically needs an accurate estimation of the maximum number of nonzeros per each row (or column) to be memory efficient. This estimation, however, can only be a quite large upper bound in complex scenarios (e.g., hp-adaptive methods in 3D, among others).
Once the build up process finishes, the user can call a method specially designed to leave the sparse_ma-trix_t instance ready for being used (e.g., to perform operations with it). This involves a compression process, in which duplicated entries are either summed up, or filtered (as selected by the user) and a transformation of the COO storage format into the storage format that the user actually requires (e.g., CSR). For simplicity, we refer to this stage as the ''compression'' of the matrix. Once the sparse_matrix_t instance is in this final state, it is still possible to insert or add contributions to its entries, as far as they belong to the sparsity pattern resulting from the first build up process. Thus, e.g., if a transient and/or nonlinear problem is to be solved and the triangulation of the domain does not change, the assembly in COO format will only be performed at the first nonlinear iteration of the first time step.
As shown so far, the software architecture of sparse_matrix_t is such that several (current and future) storage formats are possible within a single framework. This flexibility is convenient for two main reasons. First, no given storage format is likely to be uniformly better in performance across all possible operations and computer architectures. Second, FEMPAR interoperability with external software dramatically increases. If a new library, that uses its own storage format, is to be integrated, only a new extension of base_sparse_-matrix_t has to be added, while leveraging dozens of thousands of lines of code already written. Apart from sparse_matrix_t, there are other sparse matrix data types available, suitable to handle blocks and/or distributed-memory computers. All these data types are essentially composed in some way or another of sparse_matrix_t instances. For example, block_sparse_matrix_t is composed of nblocks 2 sparse_matrix_t instances; see Fig. 10. It, however, provides a set of specialized TBPs that only apply in the blocked case, e.g., the get_block TBP that lets a client to retrieve the sparse_matrix_t instance corresponding to a given block of the matrix.
The counterpart of sparse_matrix_t in the vector case is referred to as scalar_array_t. It represents a scalar, non-distributed, linear algebra vector, with its entries stored explicitly in a simple (Fortran intrinsic) allocatable array. However, provided that it does not have to exploit sparsity, the code bounded to this data type is significantly simpler to the one bounded to sparse_-matrix_t. It is equipped with a pair of generic bindings, with signatures coming in different flavours, in order to insert or add contributions to the vector. Likewise, there are other vector-like data types available suitable to handle blocks and/or distributed-memory computers. For example, block_array_t is composed of nblocks scalar_array_t instances; see Fig. 10.
Apart from the linear algebra data structures so far, we need the additional data type assembler_t, which offers FE-assembly tailored signatures to fe_affine_-operator_t. The interface of its deferred TBPs, which its extensions, e.g., scalar_assembler_t and block...assembler_t, implement, are shown in Listing 38. assembler_t has to be ''general enough'' to handle many storage layouts and it is in charge to isolate fe_affine_operator_t from implementation details. With that purpose in mind, it is composed of a (polymorphic) matrix_t and a (polymorphic) array_t instance. These are in turn abstract data types rooted at all the matrix and array data types seen so far, respectively. The set of deferred TBPs of these two abstract data types is designed (on purpose) to be not sufficiently rich to handle the whole life cycle of the concrete matrix and array instances. The high heterogeneity of the concrete subclasses of ma-trix_t and array_t precludes it. This set of TBPs is, in particular, restricted to allocation of memory for its entries, initialization of its entries to a given value (e.g., initialization to zero), and deallocation of any internal memory. These three operations are required by fe_affine_operator_t during the deployment of its life cycle. The bulk of the life cycle of the concrete subclasses of matrix_t and array_t is handled by the subclasses of assembler_t. This is how it should be, provided that assembler_t subclasses are the ones aware of the concrete details of the corresponding ma-trix_t and array_t subclasses. Besides, by doing this, we can overcome the overhead associated to dynamic runtime polymorphism, provided that the binding of fine-grain calls to those TBPs injecting or adding contributions to the matrix or the array can be determined at compilation time.
Going back to Listing 38, observe that assem-bly_array (resp., assembly_matrix) takes an intrinsic Fortran array (resp., rank-2 array) as dummy argument for the element vector (resp., matrix). Besides, it also gets the global DOFs identifiers on top a single cell, or those corresponding to cells surrounding the facet (see Lines 23,35 and 36 in Listing 38). In the case of scalar_assembler_t, the implementation is made using the TBPs provided by scalar_array_t in order to add contributions to its entries and the corresponding TBPs of sparse_matrix_t. In the case of block_assembler_t, the implementation is made by looping through the blocks, obtaining a reference to the current block with the get_block TBP, and using the corresponding TBPs as in the previous case. The assembly_array and assembly_matrix TBPs are used by the fe_cell_iterator_t and fe_ facet_iterator_t data types to implement their assembly TBPs (see Lines  signature of the latter TBPs. These are the ones actually used by the user in the type extension of dis-crete_integration_t, as described in Sect. 11.2. Finally, the compress_storage deferred TBP of assembler_t lets fe_affine_operator_t to signal that the build up process of the linear algebra data structures has already finished and that they can already be ''compressed'' into its final stage. We stress that the software architecture presented in this section provides uniform assembly interfaces to the client that are completely independent of the underlying implementation of linear algebra data structures. The subclasses of assembler_t are in charge of the management of blocks (if any), whereas sparse_matrix_t is in charge of the management of the storage schemes.

Discrete Integration of FE Operators
In this section, we introduce the abstract data type discrete_integration_t (see Listing 40). It defines the generic integrate binding, which is overloaded by the integrate_galerkin and inte-grate_petrov_galerkin deferred TBPs, depending on the number of fe_space_t instances being passed to them (see, e.g., Line 8 of Listing 40 for the interface corresponding to the Galerkin case). A user that wants to implement a FE problem must extend this data type and overwrite the TBP to be used (Galerkin or Petrov-Galerkin) in the user-defined subclass. In the overridden method, the user must implement the evaluation of the entries of A and f as the numerical integration of the discrete bilinear and linear forms as in (10) (see Sect. 3).
Based on our experience, the integration part of a FE code must exhibit a huge level of flexibility. Every time one wants to consider a new set of PDEs or a new expression of the discrete bilinear form, this component must be modified. It must also have the ability to integrate general time integration schemes that can require functions in an arbitrary number of steps, deal with nonlinear problems that involve the need to evaluating FE functions in the integration of the discrete forms, or including variable physical coefficients of body force terms determined through analytical functions. As a result, any rigidity at this level must be eliminated. Indeed, the discrete_in-tegration_t abstract data type only forces its subclasses to adhere to the signatures of the deferred TBPs overloading integrate, and has no member variables that subclasses are forced to handle. Using the design previously sketched, the user has absolute flexibility to design its own discrete_integration_t subclass, adding the attributes and methods that can be required to integrate and assemble the discrete forms, e.g., by adding an arbitrary number of fe_function_t and *_func-tion_t instances (and corresponding setters to be used at the driver level) that can describe physical properties, previous time step values, the solution at the previous nonlinear iteration, etc.
The integration of cell-wise terms of the (bi)linear forms is accomplished by traversing through all the cells using a fe_cell_iterator instance (see Sect. 10.2), which has access to (1) all the cell integration data (see Sect. 8) required to compute the local cell contributions in (11) and (2) the local-to-global DOF numbering needed for the assembly in the global linear algebra data structures. Analogously, the integration of facet terms, e.g., the ones in (20) for DG formulations, requires the use of a fe_-facet_iterator_t instance to traverse through the facets and integrate the corresponding facet terms. The method integrate is called during the execution of the numerical_setup TBP of fe_affine_opera-tor_t. It is in fact the fe_affine_operator_t the one that decides whether to invoke the Galerkin or Petrov-Galerkin integration, depending on whether one or two FE spaces have been passed as actual arguments (the second one being optional) in its create binding (see Line 15 of Listing 42). Analogously, the FE space(s) are also passed as actual argument(s) to the integrate_* bindings, since they will be needed at any integration step (see Line 8 of Listing 40 for the Galerkin case).
For illustration purposes, we present in Listing 41 an example extension of discrete_integration_t. It shows the implementation of the deferred procedure in-tegrate_galerkin for the approximation of the Stokes problem using a Galerkin method. This data types will be used in the example driver presented in Sect. 12 for the inf-sup stable Taylor-Hood mixed FE method (see Listing 41). 36 As commented above, the integration of the (bi)linear forms requires the cell integration machinery, which is provided by fe_space_t through the creation of the fe_cell_iterator_t in Line 19 of Listing 41. Apart from controlling the loop over cells (Lines 24 and 62), fe_cell_iterator_t provides the numerical quadrature, which is in turn required to get the number of integration points (line 31), and its associated weights (line 32). It also provides the determinant of the Jacobian of the cell map (line 32), and the shape functions and gradients at Lines 28 to 30 (see (13) and (14)). The implementation of the (bi)linear forms is very close to the blackboard expression, making it compact, simple, and intuitive. This is possible through the definition of the vector_field_t, and tensor_-field_t data types, together with their corresponding expression syntax available in FEMPAR. As it was carefully discussed in Sect. 8.5, it is achieved using operator overloading for different vector and tensor operations, e.g., the contraction and scaling operations. The symmetric_part (used at Lines 35 and 38), double_contract (used at line 40) and trace helper stand-alone functions (used at Lines 36 and 49) are also offered to make tensor operations easy. We also note that this implementation is also efficient, since all these operations are made without any dynamic memory allocation/deallocation.
Finally, the fe_cell_iterator_t also offers a TBP to assemble the element matrix and vector into the assembler and to impose strong Dirichlet conditions (line 66) using the perturbation in (3) (See Sect. 10.4). The Dirichlet data is extracted from a fe_function_t that represents E h u D , which must be an attribute of the concrete discrete_integration_t. For non-conforming FE spaces, the formulation requires also a loop over the facets to integrate DG terms. It can be written in a similar fashion using the tools described in Sect. 9. In this example, the stokes_galerkin_integration_t extension has the attribute force, which is used in Line 56 to integrate Listing 40. The abstract data type discrete_integration_t and its deferred TBPs. 36 We note that the Stokes subclass of discrete_integra-tion_t in Listing 41 implements the Galerkin approximation for this problem but it is independent of the FE space being used. It can be re-used for any conforming inf-sup stable mixed FE method, e.g., Taylor-Hood, conformal Crouzeix-Raviart, MINI element, etc. The choice of the mixed FE space will be determined by the user in the driver, when building the Cartesian two-field FE space. the right-hand side. It is a vector field described by an instance of the vector_function_t data type.

The FE Affine Operator Abstraction
A (simplified) declaration of the fe_affine_opera-tor_t data type is shown in Listing 42. The fe_affine_operator_t is created from a single fe_space_t instance, or even two for Petrov-Galerkin formulations; the second instance is optional and, when it is not passed, the Galerkin method is used, i.e., the same FE space is used for trial and test spaces. The user can (optionally) configure a desired block layout. Given a Cartesian product FE space X 1 h Â . . . Â X n field h for a multifield problem with n field fields (see Sect. 3.11), the block layout represents a partition of fields into subsets. 37 It is described through the argument array field_blocks of size num_fields equal to n field , which indicates the block to which each field is assigned; by default, the one-block case is used. e.g., For the Stokes problem in Example 3.2, one can consider a monolithic block layout with only one block that includes both the velocity and pressure field (field_blocks= [1,1]), or two one-field blocks (field_blocks= [1,2] or [2,1]). Additionally, the user must provide additional information about the diagonal blocks, namely (1) whether the block is symmetric or not, (2) whether symmetric storage wants to be used for the block or not, and (3) whether the block is positive definite, semi-positive definite, or indefinite. The user can optionally provide the array of logicals field_coupling (of size num_fields Â num_fields); the position ði; jÞ determines whether the matrix entries related to trial/test functions of the FE space i and FE space j are always zero (in this case, the coupling is false) or not. For the Stokes problem and the Galerkin method, the only entry that is false (no coupling) is the pressure-pressure entry. When this array is not provided, the case by default is that all fields are coupled. It only implies more memory consumption, e.g., to store the zero entries in the pressurepressure block for the Stokes problem.
The block layout information is stored in the data type block_layout_t, sketched in Listing 43, which stores the arrays field_blocks and field_coupling. It is created in the binding that creates the fe_affine_op-erator_t. It also stores a block-wise DOF numbering generated by the fe_space_t instance, which is instructed to do so by passing the block_layout_t 38 when calling its TBP generate_global_dof_numbering, described in Sect. 10.3.
The fe_affine_operator_t also holds a polymorphic pointer to an assembler_t instance. Its dynamic type is selected during the creation phase depending on the number of blocks, the storage layout required, and the (parallel or serial) environment. Finally, a polymorphic pointer to an instance of declared type discrete_in-tegration_t is also stored (see line 11 of Listing 42). After the creation phase, the fe_affine_operator_t is ready for its setup. Thanks to the design of the linear algebra data structures in FEMPAR, it does not require a symbolic setup, i.e., to precompute a (potential) sparsity Listing 42. The fe_affine_operator_t data type. 37 The actual ordering of the fields in the Cartesian FE space is determined by the user in the creation of the multi-field FE space, which must be consistent with the implementation of the discrete weak form. See, e.g., the creation of the mixed Taylor-Hood FE space in Lines 11-21 of Listing 46, where the first field is the velocity field and the second one is the pressure field, and the integration of the weak form, e.g., in Lines 34, 37, and 42 of Listing 41, where this numbering is respected. 38 The block-wise numbering creates independently the DOF numbering of every block. Thus, DOFs of different blocks can have the same block-wise DOF label. pattern. The numerical_setup TBP at line 17 of Listing 42 calls the integrate_galerkin TBP of discrete_integration when the pointer to trial_fe_space is not associated or inte-grate_petrov_galerkin otherwise, as discussed in Sect. 11.2.

Driver Example for the Stokes Problem
In this section, we describe the software architecture of a driver program that approximates the solution of the Stokes problem. To this end, it implements a Galerkin FE method grounded on a ''static'' (i.e., non-adaptable) conforming mesh and inf-sup stable FE spaces. In particular, we consider a conforming FE space V h Â Q h , where V h is a gradconforming Lagrangian space of order k þ 1, and Q h , a grad-conforming Lagrangian space of order k, i.e., the mixed Taylor-Hood FE [5]. 39 It is up to FEMPAR users to decide how to design the software architecture of their main driver program. Any driver program has nevertheless to follow the typical stages needed in a simulation pipeline based on FEs. In the seek of uniformity, the architecture presented in Listing 44 and 45 is recommended to FEMPAR users. The main program unit relies on a number of driver-level module units, which are not part of the FEMPAR library but developed by the user specifically for the problem at hand. Each of these modules defines a driver-level derived data type and its TBPs. A central derived data type, called stokes_-driver_t in this example, is designed to drive all the necessary steps. In particular, it offers a public TBP, called run_simulation, on which the driver program relies to perform the actual simulation. The driver program is therefore as simple and concise as shown in Listing 44.
The main data type of the driver, stokes_driver_t, is shown in Listing 45. It is equipped with a set of member variables of type already described in previous sections; see comments on the right-hand side of each member variable. The data type solver_t in Line 11 does not exist in FEMPAR as such. There is actually a complete set of data types that provide interfaces to high-end third party sparse direct solvers. Besides, we have developed our own abstract implementation of iterative linear solvers (including, e.g., the conjugate gradient or GMRES Krylov subspace solvers). The convergence of these solvers can be accelerated using advanced preconditioners grounded on the Multilevel Balancing Domain Decomposition by Constraints (MLBDDC) preconditioner [34,37]. The description of the linear solvers software subsystem deserves considerable space and is postponed to a future work. In this example, it has to be understood as a data type that provides the necessary services required to implement the solve_system TBP at Line 20 of Listing 45. The data type stokes_conditions_t at Line 9 extends con-ditions_t in Sect. 10.4. It encodes the strong Dirichlet boundary conditions data for this particular operator. The member variable parameter_list (see Line 4) is a parameter dictionary of \key; value [ pairs. Its implementation is provided as a stand-alone external software library called FPL [86]. The member variable stokes_parameters (see Line 3) is a user-defined data type that encapsulates the interaction with a command line parser provided by the FLAP software package [94]. Both of them are used to implement the TBP in Line 14, which parses the arguments given by the user in the command line, and transfers them into the aforementioned parameter_list member variable.
The run_simulation TBP (called from the main program in Line 8 of Listing 44) is implemented with the help of the private TBPs in Lines 17-21 of Listing 45. The setup_triangulation TBP invokes the create TBP of static_triangulation_t. Depending on the command-line parameter values, the user may select to automatically generate a structured/uniform triangulation Listing 43. The block_layout_t data type.
for simple domains (e.g., a unit cube), currently of brick (quadrilateral or hexahedral) cells, or read it from a mesh data file, e.g., using the GiD unstructured mesh generator [91]. The FE space is built in setup_fe_space TBP, sketched in Listing 46.
An array with base type p_reference_fe_t, a data type that wraps a polymorphic pointer to a refer-ence_fe_t instance, is allocated in Line 8 of Listing 46. The reference_fe_t instances for the velocity and pressure fields are created by calling make_refer-ence_fe in Lines 11 and 21, respectively; see Sect. 6.4. The interpolation order of the numerical scheme is read from command-line in Line 5. We select order equal to k þ 1 and k in Lines 11 and 21, respectively. The dummy argument continuity determines whether X admits a trace operator. In this particular example, we could consider continuity=.false. if we wanted to use a discontinuous pressure space. The create TBP of fe_space_t (Line 35) performs the composition of the reference FEs to build the Cartesian product space X h . Finally, we call the set_up_cell_integration TBP of fe_space_t in Line 38 to set up all the data structures required to evaluate cell integrals in Listing 40.
The implementation of the setup_fe_affine_operator binding is shown in Listing 47. It first invokes the create TBP of fe_affine_operator_t in Line 6. We state monolithic storage for the global coefficient matrix (Line 13), that it is symmetric (Line 9), that we want symmetric storage, i.e., to only store its upper triangle (Line 8), and the fact that it is indefinite (Line 10). The definition of field_coupling in Line 14 reflects that the pressure diagonal block is null. We also pass an instance of fe_space_t in Line 11 and an instance of the subclass stokes_integration_t in Line 12.
Before we set up the operator in Line 27, we create a fe_function_t instance in Line 18. In Line 19, by means of the services provided by fe_space_t, we interpolate the analytical function to be prescribed on the boundary for the velocity field (retrieved from stokes_conditions), fixing the strong Dirichlet DOFs of the fe_function_t instance at hand. As a result, this FE function represents E h u D , with the zero extension to free DOFs; see Sect. 10.4. This FE function is passed to the stokes_integration_t instance in Line 24. Finally, we trigger the operator auto-construction in Line 27.  The solve_system TBP (see Line 20 of Listing 45) invokes either a direct or preconditioned iterative solver to obtain the free DOFs nodal values of our FE function (see Sect. 10.5). Provided that this%solution on input to solve_system is such that it vanishes on free DOFs (see discussion in previous paragraph), a common practice used in FEMPAR drivers to save space is to re-use the space devoted for free DOFs in this%solution to store the free DOFs nodal values of the solution of the problem at hand. We stress that all solvers in FEMPAR are such that they only solve for free DOFs. In our experience, this decision dramatically simplifies the development of some preconditioners, provided that they can be developed without taking care of strong Dirichlet boundary conditions. Finally, the write_solution TBP (see Line 21 of Listing 45) is in charge of the generation of simulation results in data files for later visualization using, e.g., VisIt [95] or Paraview [96]. To this end, write_solution relies on a format independent, extensible abstraction, referred to as output_handler_t. It lets the user to register an arbitrary number of FE functions (together with the corresponding FE space these functions were generated from) and cell data arrays (e.g., material properties or error estimator indicators), to be output in the appropriate format for later visualization. Among its responsibilities, this (abstract) data type generates the data to be written to the (potentially parallel-distributed) file system in neutral, cell-oriented data structures, dealing with (potentially) non-conforming (discontinuous), and variable degree FE spaces among cells. The user may also select to apply a differential operator to the FE function, such as divergence, gradient or curl, which involve further calculations to be performed on each cell, or to customize those cells to be output (e.g., only those that belong to the interior of the geometry in unfitted FE simulations) via their own implementation of cell iterators.
The generation of the actual data files in the appropriate format is in charge of the implementations (extensions) of output_handler_t. FEMPAR currently offers two implementations of output_handler_t (although many others could be implemented as well by the growing community of FEMPAR developers given the extensible software architecture designed). vtk_output_han-dler_t lets the user to generate their data in the standardopen model VTK [97]. It currently relies on Lib_VT-K_IO [98], which (by now) does not actually exploit parallel MPI I/O but instead uses a naive single file per MPI task scheme. vtk_output_handler_t is Listing 46. The implementation of the setup_fe_space binding for the Stokes problem. therefore the recommended option for serial computations or parallel computations on a moderate number of processors. The second one, xh5_output_handler_t, lets the user generate their data in XDMF [99]. XDMF separates the description of the raw data, referred to as ''light data'', from the data itself, referred to as ''heavy data''. The light data is expressed using a set of XMLbased constructs that are suited to represent the distributedmemory data structures in FEMPAR. XDMF in turn supports the heavy data to be stored using HDF5 [100]. HDF5 is, among others, a data model and file format designed with the parallel I/O data challenge in mind. By means of a set of supporting open source libraries, referred to as parallel HDF5 libraries, FEMPAR takes advantage of the underlying distributed file system without having to deal with the high complexity of other lower-level implementations, such as raw MPI I/O. In particular, the latter service is provided by XH5For [101], a stand-alone software library, which we developed from scratch, and lets the user to read/write parallel partitioned FEM meshes taking advantage of the Collective/Independent MPI-IO provided by the PHDF5 library for the efficient generation of the vast amount of data typically resulting from a large-scale scientific computing simulation.

Conclusions
In this work, we have thoroughly described the approach that we have followed in FEMPAR in order to abstract in software the numerical approximation of problems governed by PDEs using FE methods. The mathematical framework of FEs has been split into a number of (mathematically motivated) derived data types and their interaction, resulting into a well-separated, robust, and stable set of customizable software abstractions for the development of widely applicable FE solvers. These tools equip FEMPAR users with the machinery needed to perform all the steps in the simulation pipeline, including mesh import/generation, DOFs enumeration, evaluation/ assembly of the algebraic system of linear equations via FE integration, solution of the linear system, and output of computational results in the appropriate format for later visualization. In order to achieve this goal, the software architecture of FEMPAR has been thoroughly designed by means of advanced OO software re-engineering techniques (including the recurrent application of OO design patterns [85,88]) in order to increase its ease of use, extensibility, flexibility, and reusability. FEMPAR software architecture has been implemented using the latest OO features of the Fortran03/08 standard, namely, information hiding and data encapsulation, inheritance via type extension, and dynamic run-time polymorphism. This version of the Fortran standard is already widely (and robustly) supported by most of the compilers typically available on high-end computing environments. A judiciously set of programming techniques let us achieve a reasonable trade-off among extensibility and performance, while avoiding in most cases the computational overheads frequently associated with abstract OO software libraries.  The software abstractions covered in this work include: • The definition of reference FEs, which relies on the concept of polytopes to define the cell topology in arbitrary dimensions, a machinery to define multi-dimensional polynomial functions of arbitrary order in an easy and automatic way, and a general procedure for the generation of the shape function bases and local DOFs. • The global FE space abstraction, which relies on reference FE(s) and a triangulation of the physical domain. It is responsible to define the local-to-global DOF numbering, which must respect conformity (if needed). The FE space also provides tools for the numerical integration of (bi)linear forms, e.g., mappings from the reference to the physical space, etc., in cells and facets (for DG methods). • The FE affine operator generated after the discretization of the original problem (probably after a linearization step). The FE solution is the only root (as soon as the problem is well-posed) of this operator. This operator, once the trial and test functions and the discrete (bi)linear forms of the problem at hand are defined, is represented through a matrix and a vector whose entries can be computed by numerical integration using the FE space.
FEMPAR has been used for more than 4 years now by a team of about 10 researchers of different research institutions and universities. During the initial OO re-design, derived data types (attributes and bindings) were gradually modified to accommodate new features that had not been considered, to fix expressivity limitations or even dependency knots of the original design. The software architecture to which we have converged, although certainly subject to future change, has been already proven to be capable to satisfy a number of users' software requirements, even when the application problems involved complex and advanced features (e.g., the development of growing geometries in 3D printing technology). We consider that this steady regime, which has been attained after years of development, and a tremendous man-month power effort, is the proof that the software abstraction in FEMPAR is of practical relevance not only for prospective users and developers, but also for researchers that want to learn about the OO implementation of FE methods. It has motivated the decision of the authors to promote the library as a community software project, to open it to external users and new collaborators, to publish the library in an public git repository [42], and to write this article. In particular, the architecture described here corresponds to the first public release of FEMPAR, to which we assigned the git tag FEMPAR-1.0.0. The first public release of FEMPAR has almost 300K lines of (mostly) Fortran code. Thus, a document like this one, with a quite detailed description of the services provided by the library and the motivation underlying our software design, can be a very valuable resource to complement the source code, which can become overwhelming in itself. In this paper, we have restricted ourselves to the construction of FE operators for body-fitted FE spaces. However, a major (and unique compared to other FE scientific software packages available on the Internet) cornerstone of FEMPAR is an abstract OO framework for the implementation of widely applicable highly scalable multilevel DD solvers. 40 By letting this framework to be highly coupled with the numerical integration data structures of the application, on the one hand, and to be highly customizable, on the other, one can derive optimal preconditioners for the particular structure of the discrete operator at hand, and tackle new problems and challenges, while leveraging the distributed-memory implementation ideas [37] on which the framework is grounded on. Customizable building blocks in the framework include the fine-grid to coarse-grid DOFs aggregation, the constraint matrix underlying the imposition of continuity of coarse DOFs functionals across coarse objects, the weighting operator underlying the injection among the continuous and discontinuous spaces, and the kind of solvers to be used for the Dirichlet, Neumann constrained local problems, and the coarsest-grid global problem [103]. However, we postpone the discussion about solvers, preconditioners, data structures suitable for parallel distributed-memory computers, and other more exotic discretization techniques in FEMPAR, like B-splines and XFEM methods, to subsequent works.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.