A parallel Newton multigrid framework for monolithic fluid-structure interactions

We present a monolithic parallel Newton-multigrid solver for nonlinear three dimensional fluid-structure interactions in Arbitrary Lagrangian Eulerian formulation. We start with a finite element discretization of the coupled problem, based on a remapping of the Navier-Stokes equation onto a fixed reference framework. The strongly coupled fluid-structure interaction problem is discretized with finite elements in time and finite differences in time. The resulting nonlinear and linear systems of equations are large and show a very high condition number. We present a novel Newton approach that is based on two essential ideas: First, a static condensation of solid deformation by exploiting the velocity-deformation relation $d_t \ut = \vt$. Second, the Jacobian of the fluid-structure interaction system is simplified by neglecting all derivatives with respect to the ALE deformation, an approximation that has shown to have little impact. The resulting system of equation decouples into a joint momentum equation and into two separated equations for the deformation fields in solid and fluid. Besides a reduction of the problem sizes, the approximation has a positive effect on the conditioning of the systems such that multigrid solvers with simple smoothers like a parallel Vanka-iteration can be applied. We demonstrate the efficiency of the resulting solver infrastructure on a well-studied 2d test-case and we also introduce a challenging 3d problem. For 3d problems we achieve a substantial acceleration as compared to established approaches found in literature.


Introduction
Fluid structure interactions appear in various problems ranging from classical applications in engineering like the design of ships or aircrafts, the design of wind turbines, but they are also present in bio/medical systems describing the blood flow in the heart or in general problems involving the cardiovascular system. The typical challenge of fluid-structure interactions is two-fold. First, the special coupling character that stems from the coupling of a hyperbolic-type equation -the solid problemwith a parabolic-type equation -the Navier-Stokes equations. Second, the moving domain character brings along severe nonlinearities that have a non-local character, as geometrical changes close to the moving fluid-solid interface might have big impact on the overall solution.
Numerical approaches can usually be classified into monolithic approaches, where the coupled fluidstructure interaction system is taken as one entity and into partitioned approaches, where two separate problems -for fluid and solid -are formulated and where the coupling between them is incorporated in terms of an outer (iterative) algorithm. This second approach has the advantage that difficulties are isolated and that perfectly suited numerical schemes can be used for each of the subproblems. There are however application classes where partitioned approaches either fail or lack efficiency. The added mass effect [8] exactly describes this special stiffness connected to fluid-structure interactions. It is typical for problems with similar densities in the fluid and the solid -as it happens in the interaction of blood and tissue or in the interaction of water an the solid structure of a vessel. Here, monolithic approaches are considered to be favourable.
Monolithic approaches all give rise to strongly coupled, usually very large and nonlinear algebraic systems of equations. Although there has been substantial progress in designing efficient numerical schemes for tackling the nonlinear problems [21,19,14] (usually by Newton's method) and the resulting linear systems [17,33,30,26,2,9,11], the computational effort is still immense and numerically accurate results for 3d problems are still rare.
In this contribution we present an approximated Newton scheme for solving nonstationary fluids structure interactions in a strictly monolithic formulation. The idea is based on the observation that the Newton convergence rate does not significantly worsen, if we neglect the derivatives with respect to the ALE deformation, see [31,Section 5.2.3]. Although convergence rates slightly suffer, overall computational times can be reduced due to lesser effort for assembling the matrix. Here, we exploit this structure of the reduced Jacobian to achieve an exact splitting of the monolithic Jacobian into a coupled problem for the velocities of fluid and solid and into a second step, where separate update problems are solved for solid and fluid deformation. Apart from the approximation of the Jacobian, no further splitting error is introduced. The benefit of this approach is twofold: instead of one large system with 7 coupled unknowns (pressure, velocity field and deformation field in 3d) we solve one coupled system of four unknowns (pressure and velocities) and two separate problems involving the deformations. Second, separating a reduced velocity problem has a positive effect on the system matrices such that efficient preconditioners and smoothers can be applied that are suitable for easy parallelization. Finally, we use the newly developed solver to introduce and test a new three dimensional benchmark configuration that is based on the configurations described by Hron and Turek [21].
In the following section we give a brief presentation of the fluid-structure interaction problem in a variational Arbitrary Lagrangian Eulerian formulation. Section 3 shortly presents the discretization of the equations in space and time. As formulation and discretization are based on established techniques, these two section are rather concise. The nonlinear and linear solution framework is described in Section 4, where we start by an approximation of the Jacobian that results in a natural partitioning of the linear systems, which in turn are approximated by parallel multigrid methods. Numerical test-cases demonstrate the efficiency and scalability in Section 5. Here, we also present a new and challenging 3d configuration for benchmarking fluid-structure interactions. We conclude in Section 6.

Governing equations
We shortly present the monolithic formulation for fluid structure interactions, coupling the incompressible Navier-Stokes equations and an hyperelastic solid, based on the St. Venant Kirchhoff material. For details we refer to [31].
On the d-dimensional domain, partitioned in reference configuration Ω = F ∪ I ∪ S, where F is the fluid domain, S the solid domain and I the fluid structure interface, we denote by v the velocity field, split into fluid velocity v f := v| F and solid velocity v s := v| S , and by u the deformation field, again with u s := u| S and u f := u| F . The boundary of the fluid domain Γ f := ∂F \ I is split into inflow boundary Γ in f and wall boundary Γ wall f , where we usually assume Dirichlet conditions, , and a possible outflow boundary Γ out f , where we enforce the do-nothing outflow condition [20]. The solid boundary Γ s = ∂S \ I is split into Dirichlet part Γ D s and a Neumann part Γ N s . We formulate the coupled fluid-structure interaction problem in a strictly monolithic scheme by mapping the moving fluid domain onto the reference state via the ALE map T f (t) : F → F(t), constructed by a fluid domain deformation T f (t) = id +u f (t). In the solid domain, this map T s (t) = id +u s (t) denotes the Lagrange-Euler mapping and as the deformation field u will be defined globally on Ω we simply use the notation T (t) = id +u(t) with the deformation gradient F := ∇T and its determinant J := det(F). We find the global (in fluid and solid domain) velocity and deformation fields v and u and the pressure p in the function spaces where the test functions are given in By u D (t) ∈ H 1 (Ω) d and v D (t) ∈ H 1 (Ω) d we denote extensions of the Dirichlet data into the domain. The Cauchy stress tensor of the Navier-Stokes equations in ALE coordinates is given by with the kinematic viscosity ν f and the densities ρ f and ρ s . In the solid we consider the St. Venant Kirchhoff material with the Piola Kirchhoff tensor and with the shear modulus µ s and the Lamé coefficient ν s . In (1) we construct the ALE extension u f = u| F by a simple harmonic extension. A detailed discussion and further literature on the construction this extension is found in [31]. For shorter notation, we denote by U := (v, u, p f ) the solution and by Φ := (ξ, φ, ψ f , ψ s ) the test functions.

Discretization
We give a very brief presentation on the numerical approximation of System (1). In time, we use the theta time stepping scheme, which includes the backward Euler method, the Crank-Nicolson scheme and variants like the fractional step theta method, see [34]. In space we use conforming finite elements.

Temporal discretization
For discretization in time we split the temporal interval I = [0, T ] into discrete time steps 0 = t 1 < t 2 < · · · < t N = T with the step size k := t n − t n−1 . For simplicity we assume that the subdivision is uniform. By U n ≈ U (t n ) we denote the approximation at time t n . We choose the theta time stepping method for temporal discretization with θ ∈ [0, 1]. To simplify the presentation we introduce Then, one time step t n−1 → t n of the theta scheme is given as withJ n = 1 /2(J n−1 + J n ) andF n = 1 /2(F n−1 + F n ). Note that the ALE extension equation A ALE , the divergence equation A div and the pressure coupling A p are completely implicit. A discussion of this scheme and results on its stability for fluid-structure interactions are found in [31]. Usually we consider θ = 1 /2 + O(k) to get second order convergence and good stability properties. The last equation in (3) gives a relation for the new deformation at time t n u n = u n−1 + kθv n + k(1 − θ)v n−1 in S and we will use this representation to statically condense the unknown deformation and base the solid stresses purely on last time step and the unknown velocity, i.e. by expressing the deformation gradient as Removing the solid deformation from the momentum equation will help to reduce the algebraic systems in Section 4. A similar technique within a Eulerian formulation and using a characteristics method is presented in [29,28].

Finite elements
In space, we discretize with conforming finite elements by choosing discrete function spaces U h ∈ X h and Φ h ∈ Y h . We only consider finite element meshes that resolve the interface I in reference configuration, such that the ALE formulation will always exactly track the moving interface. In our setting, implemented in the finite element library Gascoigne 3D [5] we use quadratic finite elements for all unknowns and add stabilization terms based on local projections [4,16,27,31] to satisfy the inf-sup condition. Where transport is dominant, additional stabilization terms of streamline upwind type [35,32,21] or of local projection type [31,12] are added. As the remainder of this manuscript only considers the fully discrete setting, we refrain from indicating spatial or temporal discrete variables with the usual subscripts. Further details on the discrete setting are given in [31]. For each time step t n−1 → t n we introduce the following short notation for the system of algebraic equations that is based on the splitting of the solution into unknowns acting in the fluid domain (v f , u f ), on the interface (v i , u i ) and those on the solid (v s , u s ). The pressure variable p acts in the fluid and on the interface: where D describes the divergence equation, M the two momentum equations, E the ALE extension (which depends on the interface deformation u i as u f is its extension) and by U the relation between solid velocity and solid deformation. Note that M, the term describing the momentum equations, does not directly depend on the solid deformation u s as we base the deformation gradient on the velocity, see (4). We will also use the notation M f , M i , M s to account for the influence of the test functions in fluid domain, interface and solid domain. The same notation is used for D, E and U.

Solution of the algebraic systems
In fluid-structure interactions the solid and fluid problem are coupled via interface conditions. Forces in normal direction along the interface have to be equal (dynamic coupling condition) and the fluid domain has to follow the solid motion (kinematic and geometric coupling condition). If the solid motion is rather small and slow the energy exchange happens mainly via the dynamic coupling conditions. This allows the use of explicit time-stepping schemes for the mesh motion and ALE transformation for these examples. We want to follow a different approach and use a fully implicit time stepping with an inexact Jacobian in the Newton algorithm. We neglect the derivatives with respect to the ALE deformation. Thereby, we have to solve in every Newton step a linear system of the same complexity as in the case of a partitioned time-stepping scheme.
In [31, chapter 5] we give a numerical study on different linearization techniques. It is found that the overall computational time can be reduced by neglecting the ALE derivatives in the Jacobian. Even for the fsi-3 benchmark problem of Hron and Turek [22] it is more efficient (in terms of overall computational time) to omit these derivatives at the cost of some additional Newton steps. Neglecting the ALE derivatives will be crucial for the reduction step described in the following section.
As we only change the Jacobian, we still apply a fully implicit time-stepping scheme and take advantage of its stability properties. Furthermore the transport due to the mesh motion is well approximated. For small time-step sizes we will still observe super-linear convergence as with an exact Newton algorithm. In addition, the simplified structure of the matrix simplifies the development of preconditioners sincerely as we will see later.

Linearization and splitting
Each time step of the fully discrete problem is solved by Newton's method. Evaluating the Jacobian is cumbersome due to the moving domain character of the fluid problem. First presentations of the derivatives of the fsi problem with respect to the mesh motion based on the concept of shape derivatives have been given by Fernandez and Moubachir [15]. Later, van der Zee and co-workers discussed two approaches for the evaluation of the Jacobian, by using shape derivatives [37] and by using the ALE map [36], which is similar to the approach to be used in our framework. Details are given in [31, Section 5.2.2]. Based on the notation (5) let U (0) be an initial guess (usually taken from the last time step) we iterate for l = 0, 1, 2, . . .
with a line search parameter ω (l) > 0 and the Jacobian A (U ) evaluated at U . Each linear problem can be written as By M we denote the contributions of the joint momentum equation, the superscript f, i, s indicating the affiliation of the test-function to fluid, interface and solid, and the subscript indicating the direction of the derivative. Due to the velocity based representation of the solid stresses by (4), the bold symbols are either completely zero M i us , M s us , M s u i = 0, or partially zero, such as M i u i and M s u i , as these two terms also include the ALE derivatives of the Navier-Stokes equations that depends on the fluidand interface deformation u f and u s . We have highlighted all entries of the Jacobian that correspond the derivatives with respect to the ALE extension u f . These matrix entries will be set to zero and we note once more that this is the only approximation within our Newton-multigrid scheme. Sorting the unknowns as The dropped ALE derivatives are the most costly parts in matrix assembly. While skipping these terms does worsen Newton convergence rates, overall computational times still benefit, see [31,Section 5.2.3]. This reduced linear system decomposes into three sub-steps. First, the coupled momentum equation, living in fluid and solid domain and acting on pressure and velocity Second, the update equation for the deformation on the interface and within the solid domain which -as finite element discretization of the zero-order equation u n = u n+1 +k(1−θ)v n−1 +kθv n only involves the mass matrix on both sides, such that this update can be performed by one vector-addition.
Finally it remains to solve for the ALE extension equation one simple equation, usually either a vector Laplacian or a linear elasticity problem, see [31, section 5.2.5].
The main effort lies in the momentum equations (9), which is still a coupled fluid-solid problem with saddle-point character due to the incomprehensibility.
Details on the derivatives appearing in (9) are given in [15,36,37] and in [31,Section 5.2.2] in the framework of this work. Note however that most of these terms, including all derivatives of the Navier-Stokes equation in direction of the fluid domain deformation u f are skipped, such that the resulting fluid problem is a weighted (due to domain deformation) variant of the Navier-Stokes equation.

Solution of the linear problems
The efficient solution of the linear systems arising in Newton approximations to nonlinear fluidstructure interaction problems is still an open problem. Lately some progress has been done in the direction of multigrid preconditioners for the monolithic problem [17,30,2,31]. In all these contributions it has proven to be essential to apply a partitioning into fluid-problem and solid-problem within the smoother. The authors of [7] analyzed a simplified fluid-structure interaction problem and showed that a partitioned (exact) inversion of fluid and solid problem within the multigrid solver acts as perfect smoother with convergence rates tending to zero on finer meshes.
We shortly present the linear algebra framework used in the software library Gascoigne 3D [5]. We are using equal-order finite element for all unknowns, namely pressure, velocity and deformation such that can block all degrees of freedom locally. The solution U h is written as By N h we denote the number of degrees of freedom (for every unknown), by d the dimension. Likewise, the system matrix A is a matrix with block structure, i.e.
. Considering the approximation scheme described in (9), (10) and (11), the first problem has n M c = d + 1 components and the extension problem consists of n E c = d components. In general, the complete linear algebra module is acting on general matrices and vectors with a block structure and local blocks of size n c × n c and n c , respectively. The linear solver is designed by the following approach: • As outer iteration we employ a GMRES method. Usually very few (< 10) iterations are requires such that restarting strategies are not required.
• The GMRES solver is preconditioned by a geometric multigrid method in V-cycle [3,24]. The finite element mesh of each multigrid level resolves the fluid-solid interface.
• As smoother in the multigrid solver we use a Vanka type iteration which we will outline in some detail.
The smoother for the velocity problem and the smoother for the ALE extension problem is of Vanka type. Let N h be the set of degrees of freedom of the discretization on mesh level Ω h . By P = {P 1 , . . . , P n P } with P i ⊂ N h we denote a partitioning of unknowns into local patches. In the most simple case, P i includes all degrees of freedom in one element of the mesh. Larger patches, e.g. by combining 4 adjacent elements in 2d or 8 elements in 3d are possible. By n P we denote the number of patches and by n p the size of each patch, which is the number of degrees of freedom in the patch. For simplicity, we assume that all patches in P have the same size. By R i : R N → R np we denote the restriction of a global vector to the degrees of freedom in one patch, by R T i the prolongation. Given a block vector x ∈ R N h nc and a block matrix A ∈ R N h nc×N h n h we denote by the restrictions to the degrees of freedom of one patch P i . We iterate with a damping parameter ω V ≈ 0.8. This smoother can also be considered as a domain decomposition iteration with minimal overlap. Numerical tests have shown that this simple Jacobi coupling is more efficient than a corresponding Gauss-Seidel iteration. The local matrices A i are inverted exactly using the library Eigen [18]. The local matrices are of substantial size. For d = 3, the local matrices corresponding to the momentum equations (9) have 108 × 108 if small patches are used and 500 × 500 if the smoother is based on large patches.

Parallelization
Basic features of Gascoigne 3D [5] are parallelized based on OpenMP [25]. For parallelization of the assembly of residuals and the matrix as well as application of the Vanka smoother (12) we use a coloring of the patches P such that no collisions appear. The usual memory bottleneck of finite element simulations will limit the parallel efficiency of matrix vector product and Vanka smoother. We will present some data on the parallel performance in Section 5.5.3.

Problem configuration
Two different test-cases are considered to study the performance of the discretization and the solvers that have been presented in Sections 3 and 4. First, we perform a numerical study based on the fsi-2d 3 benchmark problem that has been defined by Hron and Turek [22]. Second, we present a new 3d benchmark configuration that is based on the Hron & Turek problem.

2d configuration
As two dimensional configuration we solve the nonstationary fsi-2d 3 benchmark problem that has been introduced by Hron and Turek [22] and since then has been be revisited in many contributions [19,32] or [31, chapter 7]. We present results for this well established benchmark problem in order to validate the discretization and to the compare the performance of the solver with results published in literature.  The material parameters are given in Table 1 and the parameters yield a Reynolds number (where we choose L = 0.1 m as the diameter of the cylinder) showing a periodic flow pattern.

3d configuration
The midpoint of the cylinder is slightly non-symmetric to allow for a stable oscillatory flow at low Reynolds numbers. Attached to the cylinder is an elastic beam with approximate dimension 0.35 × 0.02 × 0.2 given in initial state at time t = 0 as The reference fluid domain at time t = 0 is given by Boundary conditions The boundary of the domain is split into the inflow boundary Γ in at x = 0, the outflow boundary Γ out at x = 2.8, the wall boundaries at z = 0 and z = 0.41 as well as y = 0 and y = 0.41 as well as the cylinder boundary Γ cyl at (x − 0.5) 2 + (y − 0.2) 2 = 0.05 2 . On the inflow boundary Γ in we prescribe a bi-parabolic profile v in =v 36y(0.41 − y)z(0.41 − z) 0.41 4 , that satisfies |Γ in | −1 Γ in v in ds =v, wherev is the average velocity. For regularization we suggest to introduce a transient start-up of the inflow v(t) =v On the remaining boundaries Γ wall ∪Γ cyl the no-slip condition v = 0 is prescribed. For the deformation u (both the solid deformation and the ALE extension), a no-slip condition u = 0 is prescribed on all boundaries. On the outer boundaries Γ wall , Γ in and Γ out this condition can be relaxed to allow for larger mesh deformations, see [31, Section 5.3.5].
Material Parameters Similar material parameters as for the 2d set are taken and the values are given in Table 1. These parameters give a Reynolds number of and a periodic flow pattern arises.

Quantities of interest
we evaluate the residual representation where 1 Γ cyl is a finite element testfunction which is one the cylinder Γ cyl and zero elsewhere. Thereby we can compute the mean drag and lift value on every time interval I n = [t n , t n+1 ] with very high precision. Details on the evaluation of such surface integrals for flow problems are given in [6] and in [31, Section 6.6.2] in the case of fluid-structure interactions.

Approximative Newton scheme
We perform the computation of the fsi-2d-3 benchmark by Hron and Turek on the time interval I = [5, 5.5] with the same parameters and discretization as in [31, chapter 5.2.3]. The comparison with the results in [31] enables to evaluate the effects of the presented inexact Jacobian on the Newton scheme. On the time interval I = [5, 5.5] the oscillations are fully developed such that significant oscillations appear and the geometric nonlinearities, that come from the ALE mapping, have to be taken into account. We only update the matrix in (9), if the nonlinear convergence rate, that can easily be measured as is above a given threshold γ nt . The Jacobian in (11), to solve the mesh motion, is only assembled once in the first time step, as we use a linear elasticity law. We investigate as in [31] the behavior for the parameters γ nt ∈ {0, 0.2, 0.5} Matrix ass. tolerance γ = 0.0 γ = 0.05 γ = 0.2 γ = 0. 5  Total Newton steps  460  559  741  800  Jacobians assembled  460  164  110  85  Total Time (seconds) 1753  950  899  936   Table 2: Accumulated number of Newton steps, assemblies of the Jacobian in Equation (9)  We can see in Table 2 that we need 460 Newton steps, if we assemble the Jacobian in (9) in every Newton step. As we neglect the sensitivity information with respect to the mesh motion, we still have an inexact Newton scheme. Nevertheless, we need less Newton steps compared to the use the use of and exact Jacobian as in [31], where 532 Newton steps were required for the same setting. This is in line with the numerical tests on inexact Jacobians for the fsi-2d-3 benchmark results in [31], where in first numerical studies no disadvantages due to the inexact Jacobian could be observed. Nevertheless, the better convergence rate is surprising. The direct solver UMFPACK [10] has difficulties to solve the exact Jacobian accurately enough as reported in [30,31], which could be the reason for the higher number of Newton steps. A similar study in [2] shows better robustness of the linear solver MUMPS [1]. The condition numbers for the matrices of the subproblems (9), (10) and (11) are much better then for the exact Jacobian as already analyzed in [31].
The behavior with respect to the parameter γ nt is comparable to the results in [31]. For the pure Newton scheme a maximum of 5 Newton steps is required in comparison to 20 Newton steps for γ nt = 0.5. With respect to computational time, Table 2 shows that γ = 0.2 is most efficient, as the reduced time to assemble the Jacobian and the increased time, due to more Newton steps balances best. The inexact Jacobian only has minor influence on the sensitivity of the Newton scheme with respect to the parameter γ nt .   Table 4: Results of the fsi-2d-3 Benchmark with time step size k = 0.004s, k = 0.002s and k = 0.001s.

Reference values
lev  Table 5: Results of the fsi-3d-3 Benchmark with time step size k = 0.004s, k = 0.002s, k = 0.001s Table 3. The corresponding solutions at time t = 8s act as initial values for further computations on the interval I = [8, 10] based on the time step sizes k = 0.004s, k = 0.002s and k = 0.001s. To avoid inaccuracies in the reference values due a rapid change of the numerical discretization parameters, we only present results on the interval I = [9,10]. A similar approach on adaptive time-stepping schemes is demonstrated in [13] and shows accurate results.

Reference values for the 2d configuration
We summarized the maximal and minimal values for the functionals on various refinement levels and time step sizes in Table 4. The values indicate convergence of the algorithm in space and a dominance of the spatial discretization error on the coarse grids in comparison to the temporal discretization error.

Reference values for the 3d configuration
To make our results comparable we computed the displacement at the point B and the drag and lift around the whole cylinder and the flag. In Figures 4 we show the different functionals as function over the time interval I = [9,10]. In addition, we summarized the maximal and minimal value for different meshes indicated in Table 5 and for different time step sizes k. To draw a conclusion on the convergence or to present reference values, the computation has to be repeated on even finer meshes in the future.

Performance of the linear solver
To test the linear iterative solver presented in Section 4, we recomputed the solution on different mesh levels for the 2d and 3d benchmark configuration on the time interval I = [9, 9.5] with time-step size k = 0.002s (250 steps). The beam oscillates in this time interval. Hence, due to the strong coupling, the solution of the Newton system is very challenging and the fluid as well as the solid elasticity  Table 6: Vanka Blocking strategy for the 2d test case. We either choose every element as one block or combine 4 elements to one block on the fluid (F) and solid (S) domain. We present the number of Newton steps, matrix assembles and number of GMRES steps per linear solve of (9) on mesh level 6 and the computational time relative to the F: 27/S: 27 case problem have both to be solved very accurately. The Newton algorithm in every time step terminates, if the residual is reduced by eight orders of magnitude (relative tol = 10 −8 ) or if the absolute value so the residual falls below 10 −8 . In every Newton step, the iterative solver for the linear problem (9) reduces the error by a factor of 10 −4 . The parameter γ = 0.05 is chosen as in Section 5.3 to decide, if the Jacobian in (9) is reassembled in the next Newton step. The mesh motion subproblem (11) is a linear elasticity problem and hence can be solved very efficiently with the geometric multigrid solver. Nevertheless, as we have to solve (11) after every Newton step, the solution of the linear system has still a high contribution to the computational time. The matrix for the linear mesh motion problem (11) only has to be assembled once in the first step.
In the following, we will only present averaged values. By "mean time per Newton step" we denote the average time of each step, measured over all 250 time steps. Hence, this average value also includes the time to reassemble the Jacobian, whose assembly incidence depends on the Newton rate, see Section 5.3. To make the values comparable with other solution approaches, we additionally present the mean time to assemble one Jacobian of (9). In the case of the direct solver, this includes the times for preparation and computation of the LU decomposition. In the case of the ILU and Vanka smoother the assemble times include the time to compute the ILU or the LU of the block matrices A i .

Dependency on the Vanka patch size
For the Vanka smoother, the question arises, how large we should choose the patches P i to solve the linear system in (9) most efficiently. The simple structure of the Vanka solver enables to use different patch sizes on the fluid and solid domain. To test different blocking strategies we recorded the computational time for the fsi-2d benchmark on the finest mesh level 6 and present the mean number of Newton steps and matrix assemblies per time step in Table 6. We either choose patches consisting of one element (n p = 3 2 · 3 = 27) or patches stretching over four adjacent elements (n p = 5 2 · 3 = 75). This yields local matrices of size A i ∈ R 27×27 or A i ∈ R 75×75 if the large patches are used.
We can observe that the minimal number of GMRES steps to solve (9) in every Newton step can be obtained, if we use n p = 75. If we only use the degrees of freedom of one element as block on the solid domain, the number of GMRES steps increases and the Newton convergence suffers. This effect cannot be observed, if we only use smaller patches within the fluid domain but large patches in the solid. Therefore, in the following sections we will always use n p = 75 in the Vanka smoother for computations in 2d.
In 3d the same blocking strategy would mean to combine 8 elements to one block resulting in  Figure 5: Memory consumption in kB for 2d (left) and 3d (right) configuration using either the direct solver or geometric multigrid with ILU or Vanka smoother n p = 5 3 · 4 = 500 and matrices of size A i ∈ R 500×500 . This strategy is forbiddingly expensive with increasing memory and time consumption for each block-LU. As the results in Table 6 show that it is sufficient to use small patches in the fluid domain, we will combine large patches with n p = 500 in the solid with smaller patches of size n p = 3 3 · 4 = 108 in the fluid domain for all 3d computations to follow.

Geometric multigrid performance in 2d and 3d
All computations have been carried out on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz. Single Core performance only is used in this section. In Figure 5 the memory consumption for the direct linear Solver UMFPACK and geometric multigrid solver with ILU and Vanka smoother are given for different refinement levels. In Figure 6 the mean time per Newton step and the time to assemble the Jacobian in (9) can be found. The upper row shows the results for the 2d benchmark problem, the lower row corresponds to the 3d problem. The mean number of Newton steps and the mean number of Jacobians assembled per time step are given in Table 7. In addition, we present the mean number of GMRES steps to solve the linear problem (9) once. Figure 7 shows the average number of GMRES steps required for both Vanka and for ILU smoothing in every time step. The values fluctuate due to the oscillatory motion of the beam. To solve the mesh-motion subproblem, about 2 − 6 GMRES steps were needed per linear solve. In 2d all solvers perform similarly. In 3D the time to construct the LU decomposition rises quickly and dominates the computational time, which shows the necessity of iterative solvers. The geometric multigrid solver with Vanka and ILU smoother behaves very similarly in 2d and 3d. The geometric multigrid solver is almost h-independent for the ILU and Vanka smoother. In 3d, the Vanka smoother requires 11 GMRES steps on the finest mesh level, versus an average of 15.3 GMRES steps in in the case of an ILU smoother. Despite the higher number of GMRES steps the computational time is very similar.
On mesh level 6 in the 2d configuration, we need 43.88s per Newton step according to Figure 6 and an average of 5.1 Newton steps according to  . We want to highlight that the configuration is not directly comparable with the one used in our work.

Parallelization
The Vanka smoother (based on a Jacobi iteration) has the advantage that it can be easily parallelized. We introduced a cell wise coloring of the domain and an additional coloring for the Vanka patches. Thereby, we can parallelize the loops over all cells to compute the Newton residual and to build up the Matrix. In addition we can apply the Vanka smoother for each patch in parallel. As different patch sizes for fluid and solid domain are used in 3d, a different color is always allocated to fluid and solid patches, such that a good load balancing is achieved. Furthermore, we parallelized the matrix vector product. All parallelization is done in OpenMP [25]. Similar to Section 5.5.2 we recompute the 2d and 3d problem on the time-interval I = [9, 9.5] with the step size k = 0.002s using the finest refinement levels 6 (in 2d) and 3 (in 3d). The mean computational time per time step on an Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz is given in Figure 8 in a strong scalability test. In 3d we can observe that the parallelization of all ingredients scales rather well. If we double the number of cores the computational time reduces by a factor of 0.57. With 32 threads we achieve a speed up of factor 10 in comparison to single core performance. even benefits from this approximation, as the computational time for assembling the ALE derivatives is very high. The reduction has two positive effect: the large system of 7 unknowns (in 3d) decomposes into on fluid-solid problem in pressure and velocity with 4 unknowns and two partitioned systems with 3 unknowns each for solving solid and fluid deformation. The second effect is the better conditioning of the coupled system that allows for the use of very simple multigrid smoothers that are easy to parallelize. In comparison to our past approaches based on a monolithic solution of the complete pressure-velocity-deformation system and partitioned smoothers and also in comparison to approaches presented in literature we could significantly reduce the computational time.
As basis for future benchmarking of 3d fluid-structure interactions we presented an extension of the 2d benchmark problems by Hron and Turek [22] that is by far more challenging (due to larger deformations and a strong dynamic behavior) as compared to a first test case introduced in our past work [30]. It will still require further effort to establish reference values for this new 3d benchmark case.
Our work includes some first simple steps of parallelization which have to be extended in future