Keywords

1 Introduction

In the context of numerical approximation of partial differential equations (PDE) for scientific and engineering applications alike, the generation of appropriate computational meshes is still one of the narrowest bottlenecks. This has given rise to isogeometric analysis (IGA)  [18] on the one hand and unfitted finite element and meshfree methods  [2] on the other. Although unfitted finite element methods encompass several classes, including the extended finite element method (XFEM)  [3], cutFEM  [5] and finite cell method (FCM)  [10, 22], their common goal is to try to find the solution to the PDE without the need for a boundary-conforming discretization. As an unfitted finite element method, the finite cell method combines adaptive quadrature integration and high-order FEM together with the weak imposition of boundary conditions.

Although mesh generation is essentially circumvented, unfitted finite element methods face several challenges, the most conspicuous of which is ill-conditioning of the system matrix and imposition of essential boundary conditions  [27]. The former issue limits the usability of many iterative solvers, which has led the majority of studies to focus on direct solvers. While direct solvers based on LU factorization have proven to be robust, their scalability suffers greatly due to poor complexity and concurrency  [25]. Recently, a geometric multigrid preconditioner with a penalty formulation has been studied for the finite cell method  [23] to formulate an efficient iterative solver.

On the other hand, unfitted FEM possesses characteristics that can be exploited to its advantage, especially for parallel computing. For instance, the computational mesh in unfitted FEM can normally be regular and Cartesian that in turn permits efficient computation and precomputation of finite element values. A parallel implementation of multi-level h-p-adaptive finite element with a shared mesh was recently applied to the finite cell method, employing a CG solver with an additive Schwarz preconditioner in  [19] and AMG preconditioning in [20].

The main contributions of the present work can be summarized as follows:

  • We employ a fully distributed, space-tree-based discretization of the computation domain with low memory foot print to allow the storage and manipulation of large problems and adaptive mesh refinement (AMR)

  • We present the parallelization of the finite cell method with adaptive refinement, focusing on the scalability of different aspects of the computation via exploiting space-tree data structures and the regularity of the discretization

  • We formulate a scalable hybrid Schwarz-type smoother for the treatment of cut cells to use in our geometric multigrid solver

  • We employ parallel adaptive geometric multigrid to solve large-scale finite cell systems and focus on the process-local generation of the required spaces and favorable communication patterns

  • We present the strong and weak scalability of different computational components of our methods

In Sect. 2, the FCM formulation of a model problem is set up. The geometric multigrid solver is formulated in Sect. 3. The developed methods are applied to a number of numerical experiments in Sect. 4. Finally, conclusions are drawn in Sect. 5.

2 Finite Cell Method

In the context of unfitted finite element methods, a given physical domain \(\varOmega \) with essential and natural boundaries \(\varGamma _{D}\) and \(\varGamma _{N}\), respectively, is commonly placed in an embedding domain \(\varOmega _{e}\) with favorable characteristics, such as axis alignment as shown in Fig. 1. Consequently, appropriate techniques are required for integration over \(\varOmega \) and imposition of boundary conditions on \(\varGamma _{D}\) and \(\varGamma _{N}\). In this work, we used the Poisson equation as model problem given by

$$\begin{aligned} \begin{aligned} -\varvec{\varDelta } u = f \quad \quad&\text {in} \; \varOmega ,\\ u = g \quad \quad&\text {on} \; \varGamma _{D},\\ \nabla u \cdot \textit{\textbf{n}} = t \quad \quad&\text {on} \; \varGamma _{N}, \end{aligned} \end{aligned}$$
(1)

where \(\varOmega \) is the domain, \(\varGamma = \varGamma _{D} \cup \varGamma _{N}\) is the boundary, \(\textit{\textbf{n}}\) is the normal vector to the boundary and u is the unknown solution.

Fig. 1.
figure 1

Illustration of a typical unfitted finite element setting, where the physical domain \(\varOmega \) is embedded in an embedding domain \(\varOmega _{e}\). \(\varGamma _{D}\) and \(\varGamma _{N}\) are essential and natural boundaries, respectively. Adaptive refinement in both FCM and integration spaces is demonstrated for a cut cell

2.1 Boundary Conditions

Natural Boundary Conditions. In the context of standard finite element method, natural boundary conditions are commonly integrated over the surface of those elements that coincide with the natural part of the physical boundary \(\varGamma _{N}\); however, in the general case, the physical domain does not coincide with cell boundaries in the context of the finite cell method. Therefore, a separate description of the boundary is necessary for integration of natural boundary conditions. Except for an appropriate Jacobi transformation from the surface space to volume space, integration of natural boundary conditions does not require special treatment.

Essential Boundary Conditions. The imposition of essential boundary conditions is a challenging task in unfitted finite element methods. Penalty methods  [1, 4, 30], Lagrange multipliers  [6, 12,13,14] and Nitsche’s method  [7, 9, 11, 17, 21] are commonly used for this purpose. We use a stabilized symmetric Nitsche’s method with a local estimate for the stabilization parameter that has the advantage of retaining the symmetry of the system, not introducing additional unknowns and being variationally consistent. The weak form is therefore given by

$$\begin{aligned} \begin{aligned} \int _{\varOmega } \varvec{\nabla }v \cdot \varvec{\nabla }u \; d\textit{\textbf{x}} - \int _{\varGamma _{D}} v (\varvec{\nabla }u \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}}\\ -\int _{\varGamma _{D}} (u - g) (\varvec{\nabla } v \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}} + \int _{\varGamma _{D}} \lambda v (u - g) \; d\textit{\textbf{s}}\\ =\,\int _{\varOmega } v f \; d\textit{\textbf{x}} +\int _{\varGamma _{N}} v t \; d\textit{\textbf{s}},\\ \end{aligned} \end{aligned}$$
(2)

where \(\lambda \) is the stabilization parameter. The computation of \(\lambda \) is further explained in Sect. 2.3.

2.2 Spatial Discretization

Unfitted finite element methods normally permit the use of a structured grid as the embedding domain. We employ distributed linear space trees  [8] for the discretization of the finite cell space. Space tree data structures not only require minimal work for setup and manipulation, they also allow for distributed storage, efficient load balancing and adaptive refinement and have a small memory footprint. We make use of Morton ordering as illustrated in Fig. 2.

Fig. 2.
figure 2

(a) The z-curve (Morton) ordering on a 2D example with one level of refinement and (b) the tree representation of the domain in (a)

An attractive aspect of computation on structured spaces is the optimization opportunities it provides, which is exactly where unfitted methods can seek to benefit compared to their boundary-conforming counterparts. For example, we compute element size, coordinates and Jacobian transformation efficiently on the fly without caching during integration.

A natural repercussion of adaptive refinement on space tree data structures is the existence of hanging nodes in the discretized space as shown in Fig. 1. To ensure the continuity of the solution, we treat hanging nodes by distributing their contribution to their associated non-hanging nodes and removing them from the global system. The influence of hanging nodes is thereby effectively local and no additional constraint conditions or unknowns appear in the solution space.

2.3 Volume Integration

The physical domain is free to intersect the embedding domain. During volume integration, the portion of the embedding domain that lies outside of the physical domain, \(\varOmega _{e} \setminus \varOmega \), is penalized by a factor \(\alpha \ll 1\). This stage is essentially where the physical geometry is recovered from the structured embedding mesh. Therefore, cells that are cut by the physical boundary must be sufficiently integrated in order to accurately resolve the geometry. On the other hand, the accuracy of standard Gaussian quadrature is decidedly deteriorated by discontinuities in the integrand. Thus, methods such as Gaussian quadrature with modified weights and points  [24] and uniform  [22] and adaptive  [10] refinement, also known as composed Gaussian quadrature, have been proposed for numerical integration in the face of discontinuities in the integrand.

We use adaptive quadrature for volume integration within the finite cell discretization. A number of adaptive integration layers are thereby added on top of the function space of \(\varOmega _{e}\) for cut cells as shown in Fig. 1. The concept of space tree data structures is congenial for adaptive quadrature integration as the integration space can readily be generated by refinement towards the boundary intersection. Furthermore, the integration space retains the regularity of the parent discretization. This scheme is especially suitable to our parallel implementation, where a given cell is owned by a unique process; therefore, the adaptive quadrature integration procedure is entirely performed process locally, and duplicate computations on the ghost layer are avoided.

Introducing a finite-dimensional function space \(V_{h} \subset H^{1}(\varOmega _{e})\), the finite cell formulation of the model problem can be written as

$$\begin{aligned} \text {Find }u_{h} \in V_{h} \subset H^1(\varOmega _e)&\text { such that for all } v_{h} \in V_{h} \nonumber \\ a_{h}(u_{h},v_{h})&= b_{h}(v_{h}) \end{aligned}$$
(3)

with

$$\begin{aligned} a_{h}(u_{h},v_{h}) :=&\int _{\varOmega _e} \alpha \varvec{\nabla }v_{h} \cdot \varvec{\nabla }u_{h} \; d\textit{\textbf{x}} - \int _{\varGamma _{D}} v_{h} (\varvec{\nabla }u_{h} \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}} \nonumber \\&- \int _{\varGamma _{D}} u_{h} (\varvec{\nabla } v_{h} \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}} + \int _{\varGamma _{D}} \lambda v_{h} u_{h} \; d\textit{\textbf{s}},\end{aligned}$$
(4)
$$\begin{aligned} b_{h}(v_{h}) :=&\int _{\varOmega _e} \alpha v_{h} f \; d\textit{\textbf{x}} + \int _{\varGamma _{N}} v_{h} t \; d\textit{\textbf{s}} \nonumber \\&- \int _{\varGamma _{D}} g (\varvec{\nabla } v_{h} \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}} + \int _{\varGamma _{D}} \lambda v_{h} g \; d\textit{\textbf{s}}, \end{aligned}$$
(5)

where

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha = 1, \quad \quad &{}\text { in}\;\varOmega ,\\ \alpha \ll 1, \quad \quad &{}\text { in}\;\varOmega _e\!\setminus \!\varOmega . \end{array}\right. } \end{aligned}$$
(6)

The stabilization parameter drastically affects the solution behavior, and its proper identification is vital to achieving both convergence in the solver and the correct imposition of the boundary conditions. There are several methods, including local and global estimates, for the determination of the stabilization parameter  [9, 15]. We employ a local estimate based on the coercivity condition of the bilinear form that can be formulated as a generalized eigenvalue problem of the form

$$\begin{aligned} \textit{\textbf{A}} \textit{\textbf{X}} = \textit{\textbf{B}} \textit{\textbf{X}} \varvec{\varLambda } , \end{aligned}$$
(7)

where the columns of \(\textit{\textbf{X}}\) are the eigenvectors, \(\varvec{\varLambda }\) is the diagonal matrix of the eigenvalues, and \(\textit{\textbf{A}}\) and \(\textit{\textbf{B}}\) are formulated as

$$\begin{aligned} {\left\{ \begin{array}{ll} \textit{\textbf{A}}_{ij} := \int _{\varGamma _{D}^{c}} (\varvec{\nabla } \phi _{j} \cdot \textit{\textbf{n}})(\varvec{\nabla } \phi _{i} \cdot \textit{\textbf{n}}) \; d\textit{\textbf{s}}, \\[12pt] \textit{\textbf{B}}_{ij} := \int _{\varOmega ^{c}} \alpha \varvec{\nabla }\phi _{j} \cdot \varvec{\nabla }\phi _{i} \; d\textit{\textbf{x}}, \end{array}\right. } \end{aligned}$$
(8)

where \(\varGamma _{D}^{c}\) and \(\varOmega ^{c}\) are the portion of the essential boundary that intersects a given cell and the cell domain, respectively. The stabilization parameter can be chosen as \(\lambda > \text {max}(\varvec{\varLambda }\)). This formulation leads to a series of relatively small generalized eigenvalue problems. On the other hand, global estimates assemble a single, large generalized eigenvalue problem by integration over the entire domain. The local estimate is more desirable in the context of parallel computing since it allows for the process-local assembly and solution of each problem. Moreover, most generalized eigensolver algorithms have non-optimal complexities, and a smaller system is nevertheless preferred.

3 Geometric Multigrid

We employ a geometric multigrid solver  [16] for the resultant system of the finite cell formulation. Unfitted finite element methods in general and finite cell in particular usually lead to the ill-conditioning of the system matrix due to the existence of cut elements, where the embedding domain is intersected by the physical domain  [27]. Small cut fractions exacerbate this problem. Therefore, an efficient multigrid formulation requires special treatment of this issue. Nevertheless, the main components of geometric multigrid remain unaltered.

Fig. 3.
figure 3

A sample four-level grid hierarchy generated using Algorithm 1

3.1 Grid Hierarchy

The hierarchical nature of space tree data structures allows for the efficient generation of the hierarchical grids required by geometric multigrid methods  [26, 29]. We generate the grid hierarchy top-down from the finest grid. In order to keep the coarsening algorithm process local, sibling cells (cells that belong to the same parent) are kept on the same process for all grids. While the coarsening rules are trivial in the case of uniform grids, adaptively refined grids require elaboration. Starting from a fine grid \(\varOmega _{e,h_{l}}\), we generate the coarse grid \(\varOmega _{e,h_{l-1}}\) according to Algorithm 1. Aside from keeping cell families on the same process, the only other major constraint is 2:1 balancing, which means that no two neighbor cells can be more than one level of refinement apart. In practice, load balancing and the application of the mentioned constraints are carried out in a single step. Figure 3 shows a sample four-level grid hierarchy with the finest grid adaptively refined towards a circle in the middle of the square domain.

figure a

3.2 Transfer Operators

Transfer operators provide mobility through the grid hierarchy, i.e., restriction from level l to \(l-1\) and prolongation from level \(l-1\) to l. In order to minimize communication and avoid costly cell lookup queries, we perform these operations in two steps. Restriction starts by transferring entities from the distributed fine grid \(\varOmega _{e,h_{l}}\) to an intermediate coarse grid \(\varOmega _{e,h_{l-1}}^{i}\) followed by a transfer to the distributed coarse grid \(\varOmega _{e,h_{l-1}}\). Conversely, prolongation starts by transferring entities from the distributed coarse grid \(\varOmega _{e,h_{l-1}}\) to the intermediate coarse grid \(\varOmega _{e,h_{l-1}}^{i}\) followed by a transfer to the distributed fine grid \(\varOmega _{e,h_{l}}\). The intermediate grids are generated and accessed entirely process locally and only store minimal information regarding the Morton ordering of the local part of the domain. A similar approach is taken in  [29]. The restriction and prolongation operations of a vector \(\textit{\textbf{v}}\) can be summarized as

$$\begin{aligned} \textit{\textbf{v}}_{l-1}&= \mathcal {T}_{l} \textit{\textbf{R}}_{l} \textit{\textbf{v}}_{l} \end{aligned}$$
(9)
$$\begin{aligned} \textit{\textbf{v}}_{l}&= \textit{\textbf{P}}_{l-1} \mathcal {T}_{l-1}^{-1} \textit{\textbf{v}}_{l-1} \end{aligned}$$
(10)

where \(\textit{\textbf{R}}_{l} = \textit{\textbf{P}}_{l}^{T}\). \(\mathcal {T}\) represents the transfer operator between intermediate and distributed grids.

This scheme allows \(\textit{\textbf{R}}\) and \(\textit{\textbf{P}}\) to be resolved in parallel, process locally and without the need for cell lookup queries. Additionally, flexible load balancing is achieved which is especially important for adaptively refined grids. The only additional component to establish effective communication between grids is the transfer operator \(\mathcal {T}\), which concludes the majority of the required communication.

Fig. 4.
figure 4

Subdomain designation for cut cells and parallel application of the hybrid Schwarz smoother on a sample domain. Every node that does not appear in any of the designated cut cell subdomains composes a subdomain with the functions supported only on that node. All nodal subdomains are applied multiplicatively. Nodes are colored based on their owner process

3.3 Parallelized Hybrid Schwarz Smoother

Special treatment of cut cells is crucial to the convergence of the solver for finite cell systems. This special treatment mainly manifests itself in the smoother operator \(\mathcal {S}\) in the context of geometric multigrid solvers. We employ a Schwarz-type smoother (e.g.  [28], cf. also [19, 23]), where subdomains are primarily determined based on cut configurations: A subdomain is designated for every cut cell that includes all the functions supported on that cell. The remaining nodes, which do not appear in any cut cells, each compose a subdomain with only the functions supported on that node. The selection of subdomains is illustrated in Fig. 4. The Schwarz-type smoother can be applied in two manners: additively and multiplicatively as given by

$$\begin{aligned} \textit{\textbf{u}}^{k+1} = \textit{\textbf{u}}^{k} + \mathcal {S} (\textit{\textbf{f}} - \textit{\textbf{A}}\textit{\textbf{u}}^{k}) \end{aligned}$$
(11)

with

$$\begin{aligned} \mathcal {S}^{add}&= \big [ (\textit{\textbf{R}}_{s,n}^{T}\textit{\textbf{A}}_{n}\textit{\textbf{R}}_{s,n}) + \dots + (\textit{\textbf{R}}_{s,1}^{T}\textit{\textbf{A}}_{1}\textit{\textbf{R}}_{s,1}) \big ] \end{aligned}$$
(12)
$$\begin{aligned} \mathcal {S}^{mult}&= \big [ (\textit{\textbf{R}}_{s,n}^{T}\textit{\textbf{A}}_{n}\textit{\textbf{R}}_{s,n}) \dots (\textit{\textbf{R}}_{s,1}^{T}\textit{\textbf{A}}_{1}\textit{\textbf{R}}_{s,1}) \big ] \end{aligned}$$
(13)

where \(\textit{\textbf{R}}_{s,i}\) are the Schwarz restriction operators, \(\textit{\textbf{A}}_{i} = \textit{\textbf{R}}_{s,i}\textit{\textbf{A}}\textit{\textbf{R}}_{s,i}^{T}\) are the subdomain matrices and n is the number of subdomains. The Schwarz restriction operator \(\textit{\textbf{R}}_{s,i}\) essentially extracts the rows corresponding to the functions of subdomain i, and its transpose, the Schwarz prolongation operator, takes a vector from the subdomain space to the global space by padding it with zeros.

Parallelization in the first approach is a relatively straightforward task. Each process can simultaneously apply the correction from the subdomains that occur on it, and within each process, subdomain corrections can be applied concurrently. Since any given cell is owned by a unique process, no communication is required during this stage. The only communication takes place when the correction is synchronized over process interfaces at the end.

Parallel realization of the latter approach however is a challenging task. Strict implementation of the multiplicative Schwarz method requires substantial communication not only for exchanging updated residual values but also for synchronizing the application of subdomain corrections, which is clearly not a desirable behavior for the parallel scalability of the algorithm; therefore, we employ a more compromised approach that adheres to the multiplicative application of the smoother as much as possible while minimizing the required communication. To this end, subdomains, whose support lies completely within their owner process are applied multiplicatively, while at process interfaces, the additive approach is taken. This application approach is demonstrated in Fig. 4.

4 Numerical Studies

We perform a number of numerical studies to investigate the performance of the methods outlined in the previous sections. We use the finite cell formulation developed in Sect. 2 and employ geometric multigrid from Sect. 3 as a solver. We consider both uniform and adaptive grids and present the weak and strong scaling of different components of the computation. The computations are performed on a distributed-memory cluster with dual-socket Intel Xeon Skylake Gold 6148 CPUs with 20 cores per socket at 2.4 GHz per core, 192 GB DDR4 main memory per node and a 100 GBit/s Intel Omni-Path network interconnect via PCIe x16 Gen 3. All nodes run Red Hat Enterprise Linux (RHEL) 7, and GCC 7.3 with the O2 optimization flag is used to compile the project. All computations are performed employing MPI parallelization without additional shared-memory parallelization and utilizing up to 40 MPI processes per node which equals the number of cores per node.

Fig. 5.
figure 5

Strong scaling of different components of the computation on a uniformly refined grid with 268,468,225 degrees of freedom (a and b) and an adaptively refined grid with 16,859,129 degrees of freedom (c and d)

The physical domain considered in this benchmark example is a circle that is embedded in a unit square embedding domain throughout this section (see Fig. 3). The finite cell formulation of the Poisson equation is imposed on the embedding domain. An inhomogeneous Dirichlet boundary condition is imposed on an arch to the left of the circle and a homogeneous Neumann boundary condition is imposed on the remaining part. This example is chosen to act as a reproducible benchmark. The conditioning of finite cell matrices directly depends on the configuration of cells cut by the physical domain. The circular domain covers a wide variety of cut configurations on each grid level due to its curvature. Therefore, the resultant matrices include the ill-conditioning associated with the finite cell method and can represent more general geometries. Furthermore, other computational aspects, e.g., volume integration are virtually independent of the geometry and mainly vary with problem size.

The geometric multigrid solver is set up with three steps of pre- and post-smoothing each, employing a combination of the hybrid multiplicative Schwarz smoother as in Sect. 3.3 and a damped Jacobi smoother. The Schwarz smoother is applied only to the three finest grids in each problem, and the damped Jacobi smoother is applied to the remaining grids. A tolerance of \(10^{-9}\) for the residual is used as the convergence criterion.

Fig. 6.
figure 6

Weak scaling of different components of the computation using grids that range in size from roughly 16.7 million degrees of freedom to 1.1 billion degrees of freedom

Fig. 7.
figure 7

The convergence behavior of the geometric multigrid solver for the grids in the weak scaling study

The scaling studies report the runtime of different components of the computation. System setup \(T_{sys\_setup}\) includes setup and refinement of the main discretization, load balancing, setup of the finite cell space and resolution of the physical boundary. Assembly \(T_{assembly}\) is the time required for the assembly of the global system, i.e., integration and distribution of all components of the weak form. Solver setup \(T_{solver\_setup}\) concerns the generation and setup of the hierarchical spaces for geometric multigrid and includes the grid hierarchy, transfer operators and smoothers. Finally, \(T_{solver}\) and \(T_{iteration}\) refer to the total runtime and the runtime of a single iteration of the geometric multigrid solver, respectively.

A model with roughly 268 million degrees of freedom with uniform refinement and another model with roughly 16.8 million degrees of freedom with adaptive refinement towards the boundary are chosen to investigate the strong scalability of the computation as shown in Fig. 5. In both cases, the speedup of all components are compared to the ideal parallel performance. Ideal or perfect speedup is driven by Amdahl’s law and is defined as a speedup that is linearly proportional to the number of processing units, normalized to the smallest configuration that can accommodate the memory demand, i.e., 16 processes for the uniform grid and 4 processes for the adaptively refined grid. Except \(T_{assembly}\) that virtually coincides with ideal speedup, other components show slightly smaller speedups; however, these minor variations are practically inevitable in most scientific applications due to communication overhead and load imbalances. The strong scalability of all components can be considered excellent as there are no breakdowns or plateaus and the differences from ideal speedup remain small.

On the other hand, weak scalability is investigated for a number of uniformly refined grids, ranging in size from approximately 16.7 million to 1.07 billion degrees of freedom as shown in Fig. 6. In addition to keeping roughly constant the number of degrees of freedom per core, in order to study the scalability of the geometric multigrid solver, the size of the coarse problem is kept constant on all grids; therefore, a deeper hierarchy is employed for larger problems as detailed in Table 1. The convergence behavior of the multigrid solver is shown in Fig. 7. Within the weak scaling study, each problem encounters many different cut cell configurations on each level of the grid hierarchy. The observed boundedness of the iteration count is therefore a testament to the robustness of the approach. All components exhibit good weak scalability throughout the entire range. While \(T_{sys\_setup}\) and \(T_{assembly}\) are virtually constant for all grid sizes, \(T_{solver\_setup}\) and \(T_{solver}\) slightly increase on larger problems. The difference in \(T_{solver\_setup}\) can be imputed to the difference in the number of grid levels for each problem, i.e., larger problems with deeper multigrid hierarchies have heavier workloads in this step. On the other hand, \(T_{solver}\) has to be considered in conjunction with the iteration count. Although the multigrid solver is overall scalable in terms of the iteration count, there are still minor differences in the necessary number of iterations between different problems (see Fig. 7). \(T_{iteration}\) can be considered a normalized metric in this regard, which remains virtually constant. Nevertheless, the differences in runtime remain small for all components and are negligible in practical settings.

Although a direct comparison is not possible due to differences in formulation, problem type and setup, hardware, etc., we try to give a high-level discussion on some aspects of the methods with respect to closely related works. In  [23], a multigrid preconditioner with Schwarz smoothers was presented, showing bounded iteration counts. However, a parallelization strategy was not reported. In  [20], a PCG solver with an AMG preconditioner was used. Similarly, a PCG solver with a Schwarz preconditioner was used in  [19]. In both studies, a shared base mesh was employed. The size of the examples in  [20] and  [19] were smaller in comparison to the ones considered here, which further hinders a direct comparison; nevertheless, the multigrid solver presented in this work shows promising results both in terms of parallel scalability and absolute runtime for similarly sized problems. The multigrid solver is furthermore robust with respect to broad variations in problem size, whereas the iteration count of the PCG solver in  [19] significantly increased for larger problems, which was directly reflected in the runtime. Geometric multigrid is used as a solver in this work. It is expected that even more robustness and performance can be gained if it is used in conjunction with a Krylov subspace accelerator, such as the conjugate gradient (CG) method.

Table 1. Problem and solver configuration for the weak scaling study. \(n_{proc}\) is the number of processes, \(n_{DoF}\) is the number of degrees of freedom, \(n_{hierarchy}^{gmg}\) is the depth of the geometric multigrid hierarchy, \(n_{iteration}^{gmg}\) is the number of required iterations and \(n_{DoF}^{coarse}\) is the number of degrees of freedom on the coarsest grid

5 Conclusions

A parallel adaptive finite cell formulation along with an adaptive geometric multigrid solver is presented in this work. Numerical benchmarks indicate that the core computational components of FCM as well as the GMG solver scale favorably in both weak and strong senses. The use of distributed space-tree-based meshes allows not only the scalable storage and manipulation of extremely large problems, but also effective load balancing, which is above all manifest in the perfect scalability of the integration of the weak form. Furthermore, the suitability of the space-tree-based algorithms to parallel environments for the generation of multigrid spaces is demonstrated by the scalability of the solver setup. The geometric multigrid solver with the Schwarz-type smoother exhibits robustness and scalability both in terms of the required iteration count for different problem sizes and parallelization. We strive to minimize communication in the parallelization of the multigrid components, especially for the application of the Schwarz smoother; nevertheless, iteration counts do not suffer from parallel execution, and the solver shows good weak and strong scalability. The ability to solve problems with more than a billion degrees of freedom and the scalability of the computations are promising results for the application of the finite cell method with geometric multigrid to large-scale problems on parallel machines. Nevertheless, further examples and problem types are necessary to extend the applicability of the presented methods. Moreover, the main algorithms and underlying data structures that are used in the presented methods are suitable to hardware accelerators such as GPUs and FPGAs, and we expect that a scalable implementation should be achievable for such architectures given optimized data paths and communication patterns. In particular, the semi-structuredness of the adaptive octree approach is conducive to a hardware-oriented implementation compared to unstructured meshing approaches. We intend to explore these opportunities as a future work.