1 Introduction

Chemo-mechanics problems have gained increasing attention in the past decades, as a more refined understanding of processes in man-made and natural materials as well as living tissue can only be obtained by incorporation of mechanical and chemical loading conditions and their mutual interactions. Research in various fields of chemo-mechanics have emerged and are concerned for example with the coupling of deformation and diffusion in gels [18, 19, 28, 42, 58], the prediction of the transition from pitting corrosion to crack formation in metals [17, 20], hydrogen diffusion [22, 55] and embrittlement [1, 4, 45], functional degradation in Li-ion batteries [53, 54], chemical reaction-induced degradation of concrete [52, 64], diffusion mediated tumor growth [29, 65], or modeling of lithium-ion batteries [51].

As all of these examples involve a strong coupling of mechanical balance relations and mass diffusion, either of Fickian or gradient extended Cahn-Hilliard type, we adopt an established benchmark problem of swelling of hydrogels [14, 43, 57] that at the one hand accounts for this coupling and on the other hand is simple enough to develop efficient, problem specific numerical solution schemes.

In this paper, we are interested in the model presented in [13], which is derived from an incremental variational formulation and can therefore easily be recast into a minimization formulation as well as into a saddle point problem. The different variational formulations also have consequences for the solver algorithms to be applied.

In this contribution, as a first step, we consider the minimization formulation. The discretization of our three-dimensional model problem by finite elements is carried out using the deal.II finite element software library [2]. We solve the arising nonlinear system by means of a monolithic Newton–Raphson scheme; the linearized systems of equations are solved using the Fast and Robust Overlapping Schwarz (FROSch) solver [30, 37] which is part of the Trilinos software [63]. The FROSch framework provides a parallel implementation of the GDSW [26, 27] and RGDSW-type Overlapping Schwarz preconditioners [25]. These preconditioners have shown a good performance for problems ranging from the fluid–structure interaction of arterial walls and blood flow [7] to land ice simulation [39]. Monolithic preconditioners of GDSW type for fluid flow problems, where the coarse problem is, again, a saddle point problem were presented in [34, 36]. These preconditioners make use of the block structure. Within this project, FROSch has first been applied in [44]. The preconditioner also provides a recent extension using more than two levels, which has been tested up to 220,000 cores [41].

Let us note that, in this paper, our focus is on obtaining robustness for a challenging nonlinear coupled problem rather than on achieving a high range of parallel scalability for linear benchmark problems. We will therefore apply only two-level methods. The FROSch preconditioners considered in this paper are constructed algebraically, i.e., from the assembled finite element matrix, without the use of geometric information and without explicit knowledge of the block structure. The construction of the preconditioners therefore needs to make use of certain approximations as in [33]. Also note that, in this work, a node-wise numbering is used in the case of the \(Q_{1} Q_{1}\) discretization and a block-wise numbering is used for the \(Q_{1} RT_{0}\); see Sect. 4.2. In this paper, we will always apply the preconditioners from FROSch in fully algebraic mode; this implies that no explicit information on the block structure is provided in the construction of the preconditioner. Note that recent RGDSW methods with adaptive coarse spaces are not fully algebraic; e.g., [40] and cannot be used here.

For our benchmark problem of the swelling of hydrogels, we consider two sets of boundary conditions. We also compare the consequences of two different types of finite element discretizations for the flux flow: Raviart-Thomas finite elements and standard Lagrangian finite elements. We then evaluate the numerical and parallel performance of the iterative solver applied to the monolithic system, discussing strong and weak parallel scalability.

2 Variational framework of fully coupled chemo-mechanics

In order to evaluate the performance of the FROSch framework in the context of multi-physics problems, the variational framework of chemo-mechanics is adopted without further modifications as outlined in [11, 13]. This framework is suitable to model hydrogels.

This setting is employed here to solve some representative model problems involving full coupling between mechanics and mass diffusion in a finite deformation setting.

The rate-type potential

$$\begin{aligned} \Pi \left( \dot{{\varvec{\varphi }}},\dot{v},\varvec{J}_{v}\right) = \frac{{\mathrm d}{}}{{\mathrm d}{t}} E\left( \dot{{\varvec{\varphi }}},\dot{v}\right) +D\left( \varvec{J}_{v}\right) - P_{\textrm{ext}}\left( \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right) \end{aligned}$$
(1)

serves as a starting point for our description of the coupled problem, where the deformation is denoted \({\varvec{\varphi }}\), the swelling volume fraction v, and the fluid flux \(\varvec{J}_{v}\), consistent with the notation introduced in [11]. The stored energy functional E of the body \(\mathcal{B}\) is computed from the free-energy \(\psi \) as

$$\begin{aligned} E\left( {\varvec{\varphi }},v\right) =\int \nolimits _{\mathcal{B}}\widehat{\psi }\left( \nabla {\varvec{\varphi }},v\right) \textrm{d}V. \end{aligned}$$
(2)

Furthermore, the global dissipation potential functional is defined as

$$\begin{aligned} D\left( \varvec{J}_{v}\right) =\int \nolimits _{\mathcal{B}}\widehat{\phi }\left( \varvec{J}_{v};\nabla {\varvec{\varphi }},v\right) \textrm{d}V, \end{aligned}$$
(3)

involving the local dissipation potential \(\widehat{\phi }\).

Note that the dissipation potential possesses an additional dependency on the deformation via its material gradient and the swelling volume fraction. However, this dependency is not taken into account in the variation of the potential \(\Pi \) when determining the corresponding Euler-Lagrange equations, as indicated by the semicolon in the list of arguments. Lastly, the external load functional is split into a solely mechanical and solely chemical contribution of the form

$$\begin{aligned} P_{\textrm{ext}}\left( \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right) =P_{\textrm{ext}}^{{\varvec{\varphi }}}\left( \dot{{\varvec{\varphi }}}\right) + P_{\textrm{ext}}^{\mu }\left( \varvec{J}_{v}\right) , \end{aligned}$$
(4)

where the former includes the vector of body forces per unit reference volume \(\varvec{R}_{{\varvec{\varphi }}}\) and the prescribed traction vector \(\bar{\varvec{T}}\) at the surface of the body \(\partial \mathcal{B}^{\varvec{T}}\) such that

$$\begin{aligned} P_{\textrm{ext}}^{{\varvec{\varphi }}}\left( \dot{{\varvec{\varphi }}}\right) =\int \nolimits _{\mathcal{B}} \varvec{R}_{{\varvec{\varphi }}}\cdot \dot{{\varvec{\varphi }}}\,\textrm{d}V + \int \nolimits _{\partial \mathcal{B}^{\varvec{T}}}\bar{\varvec{T}}\cdot \dot{{\varvec{\varphi }}}\,\textrm{d}A. \end{aligned}$$
(5)

The latter contribution in (4) incorporates the prescribed chemical potential \(\bar{\mu }\) and the normal component of the fluid flux \(H_{v}\) at the surface \(\partial \mathcal{B}^{\mu }\) as

$$\begin{aligned} P_{\textrm{ext}}^{\mu }\left( \varvec{J}_{v}\right) =- \int \nolimits _{\partial \mathcal{B}^{\mu }} \bar{\mu }\underbrace{\varvec{J}_{v}\cdot {{{\textbf{N}}}}}_{H_{v}}\textrm{d}A. \end{aligned}$$
(6)

Along the disjoint counterparts of the mentioned surface, namely \(\partial \mathcal{B}^{{\varvec{\varphi }}}\) and \(\partial \mathcal{B}^{H_{v}}\), the deformation and the normal component of the fluid flux are prescribed, respectively.Taking into account the balance of solute volume

$$\begin{aligned} \dot{v}=-{\text {Div}}[{\varvec{J}_{v}}] \end{aligned}$$
(7)

in (1) allows one to derive the two-field minimization principle

$$\begin{aligned}{} & {} \Pi \left( \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right) \nonumber \\{} & {} \quad =\int \nolimits _{\mathcal{B}} \underbrace{ \partial _{{{{\textbf{F}}}}} \widehat{\psi }:\nabla \dot{{\varvec{\varphi }}}- \partial _{v} \widehat{\psi }{\text {Div}}[{\varvec{J}_{v}}]+\widehat{\phi }\left( \varvec{J}_{v};\nabla {\varvec{\varphi }},v\right) }_{\pi \left( \nabla \dot{{\varvec{\varphi }}},\varvec{J}_{v},{\text {Div}}[{\varvec{J}_{v}}]\right) }\textrm{d}V \nonumber \\{} & {} \quad - P_{\textrm{ext}}\left( \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right) , \end{aligned}$$
(8)

which solely depends on the deformation and the fluid flux. Herein, (7) is accounted for locally to capture the evolution of v and update the corresponding material state. To summarize, the deformation map and the flux field are determined from

$$\begin{aligned} \left\{ \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right\} = \textrm{Arg}\left\{ \underset{\dot{{\varvec{\varphi }}}\in \mathcal {W}_{\dot{{\varvec{\varphi }}}}}{\textrm{inf}}\ \underset{\varvec{J}_{v}\in \mathcal {W}_{\varvec{J}_{v}}}{\textrm{inf}} \Pi \left( \dot{{\varvec{\varphi }}},\varvec{J}_{v}\right) \right\} , \end{aligned}$$
(9)

using the following admissible function spaces.

$$\begin{aligned} \mathcal {W}_{\dot{{\varvec{\varphi }}}}&=\left\{ \dot{{\varvec{\varphi }}}\in H^{1}\!\left( \mathcal{B}\right) \vert \ \dot{{\varvec{\varphi }}}=\dot{\bar{{\varvec{\varphi }}}} \ \text {on} \ \partial \mathcal{B}^{{\varvec{\varphi }}}\right\} \end{aligned}$$
(10)
$$\begin{aligned} \mathcal {W}_{\varvec{J}_{v}}&=\left\{ \varvec{J}_{v}\in H\left( {\text {Div}},\mathcal{B}\right) \vert \ \varvec{J}_{v}\!\cdot {{{\textbf{N}}}}=H_{v} \ \text {on} \ \partial \mathcal{B}^{H_{v}}\right\} \end{aligned}$$
(11)

2.1 Specific free-energy function and dissipation potential

Following [11, 13], the choice of the free-energy function employed in this study is motivated by the fact that it accurately captures the characteristic nonlinear elastic response of a certain class of hydrogels [48] with moderate water content as well as the swelling induced volume changes. In principle, it also incorporates softening due to pre-swelling, despite its rather simple functional form. The isotropic, Neo-Hookean type free-energy reads as

$$\begin{aligned} \widehat{\psi }\left( {{{\textbf{F}}}}, v\right)= & {} \frac{\upgamma }{2 J_{0}}\left[ J_{0}^{2/3}I_{1}^{{{{\textbf{C}}}}} -3 -2 \ln \left( JJ_{0}\right) \right] \nonumber \\{} & {} + \frac{\uplambda }{2J_{0}}\left[ JJ_{0}-1- v\right] ^{2}\nonumber \\{} & {} +\frac{\upalpha }{J_{0}}\left[ v \ln \left( \frac{ v}{1+ v}\right) +\frac{\upchi v}{1+ v}\right] , \end{aligned}$$
(12)

in which the first invariant of the right Cauchy-Green tensor is defined as \(I_{1}^{{{{\textbf{C}}}}}={\text {tr}}\left( {{{\textbf{C}}}}\right) ={\text {tr}}\left( {{{\textbf{F}}}}^{\textrm{T}}\!\cdot {{{\textbf{F}}}}\right) \), while the determinant of the deformation gradient \({{{\textbf{F}}}}=\nabla {\varvec{\varphi }}\) is given as \(J={\text {det}}\left( {{{\textbf{F}}}}\right) \). The underlying assumption of the particular form of this energy function is the multiplicative decomposition of the deformation gradient

$$\begin{aligned} {{{\textbf{F}}}}^{\textrm{d}}={{{\textbf{F}}}}\cdot {{{\textbf{F}}}}_{\textrm{0}}=J_{0}^{1/3}{{{\textbf{F}}}}\end{aligned}$$
(13)

which splits the map from the reference configuration (dry-hydrogel) to the current configuration into a purely volumetric deformation gradient associated with the pre-swelling of the hydrogel \({{{\textbf{F}}}}_{\textrm{0}}\) and the deformation gradient \({{{\textbf{F}}}}\), accounting for elastic and diffusion-induced deformations. Clearly, (12) describes the energy relative to the pre-swollen state of the gel. In its derivation it is assumed, additionally, that the pre-swelling is stress-free and the energetic state of the dry-state and the pre-swollen state are equivalent, which gives rise to the scaling \(J_{\textrm{0}}^{-1}\) of the individual terms of the energy. Although the incompressibility of both the polymer, forming the dry hydrogel, and the fluid is widely accepted, its exact enforcement is beyond the scope of the current study and a penalty formulation is employed here, utilizing a quadratic function that approximately enforces the coupling constraint

$$\begin{aligned} JJ_{0}-1- v=0 \end{aligned}$$
(14)

for a sufficiently high value of the first Lamé constant \(\uplambda \). Thus, in the limit \(\uplambda \rightarrow \infty \), the volume change in the hydrogel is solely due to diffusion and determined by the volume fraction v, which characterizes the amount of fluid present in the gel. On the other hand, relaxing the constraint by choosing a small value of \(\uplambda \) allows for additional elastic volume changes.

Table 1 Material parameters of the coupled hyperelastic model

Energetic contributions due to the change in fluid concentration in the hydrogel are accounted for by the Flory-Rehner type energy, in which the affinity between the fluid and the polymer network is controlled by the parameter \(\upchi \). Finally, demanding that the pre-swollen state is stress-free requires the determination of the initial swelling volume fraction from

$$\begin{aligned} v_{0}= \frac{\upgamma }{\uplambda }\left[ J_{0}^{-1/3}-\frac{1}{J_{0}}\right] +J_{0}-1. \end{aligned}$$
(15)

A convenient choice of the local dissipation potential, in line with [13], is given as

$$\begin{aligned} \widehat{\phi }\left( \varvec{J}_{ v};{{{\textbf{C}}}}, v\right) = \frac{1}{2 \textrm{M} v}{{{\textbf{C}}}}:\left( \varvec{J}_{ v}\otimes \varvec{J}_{ v}\right) , \end{aligned}$$
(16)

which is formulated with respect to the pre-swollen configuration. It is equivalent to an isotropic, linear constitutive relation between the spatial gradient of the chemical potential and the spatial fluid flux, in the current configuration. Again, the state dependence of the dissipation potential through the right Cauchy–Green tensor and the swelling volume fraction is not taken into account in the course of the variation of the total potential. The material parameters employed in the strong and weak scaling studies in Sects. 7.2 and 7.3 are summarized in Table 1. Note that the chosen value of the pre-swollen Jacobian is problem dependent and indicated in the corresponding section.

2.2 Incremental two-field potential

Although the rate-type potential (1) allows for valuable insight into the variational structure of the coupled problem, e.g., a minimization formulation for the model at hand, an incremental framework is typically required for the implementation into a finite element code. This necessitates the integration of the total potential over a finite time step \(\Delta t = t_{n+1}-t_{n}\). Thus, the incremental potential takes the form

$$\begin{aligned}{} & {} \Pi ^{\Delta t}\left( {\varvec{\varphi }},\varvec{J}_{v}\right) \nonumber \\{} & {} \quad =\int \nolimits _{\mathcal{B}} \underbrace{\widehat{\psi }\left( \nabla {\varvec{\varphi }},v_{n}-\Delta t {\text {Div}}[{\varvec{J}_{v}}]\right) \! + \!\Delta t \widehat{\phi }\left( \varvec{J}_{v};\nabla {\varvec{\varphi }}_{n},v_{n}\right) }_{\pi ^{\Delta t}\left( \nabla {\varvec{\varphi }},\varvec{J}_{v},{\text {Div}}[{\varvec{J}_{v}}]\right) }\textrm{d}V \nonumber \\{} & {} \quad - \int \nolimits _{\mathcal{B}} \varvec{R}_{{\varvec{\varphi }}}\cdot \left( {\varvec{\varphi }}-{\varvec{\varphi }}_{n}\right) \textrm{d}V - \int \nolimits _{\partial \mathcal{B}^{\varvec{T}}} \bar{\varvec{T}}\cdot \left( {\varvec{\varphi }}-{\varvec{\varphi }}_{n}\right) \textrm{d}A\nonumber \\{} & {} \quad + \int \nolimits _{\partial \mathcal{B}^{\mu }} \Delta t\, \bar{\mu }\, \varvec{J}_{v}\cdot \varvec{N} \,\textrm{d}A, \end{aligned}$$
(17)

in which an Euler implicit time integration scheme is applied to approximate the global dissipation potential (3) as well as the external load functional (4). Furthermore, the balance of solute volume is also integrated numerically by means of the implicit backward Euler scheme yielding an update formula for the swelling volume fraction

$$\begin{aligned} v=v_{n}-\Delta t{\text {Div}}[{\varvec{J}_{v}}], \end{aligned}$$
(18)

which is employed to evaluate the stored energy functional (2) at \(t_{n+1}\). Note that, quantities given at the time step \(t_{n}\) are indicated by subscript \({}_{n}\), while the subscript is dropped for all quantities at \(t_{n+1}\) to improve readability. Additionally, it is remarked that the stored energy functional at \(t_{n}\) is excluded from (17) as it only changes its absolute value, while it does not appear in the first variation of \(\Pi ^{\Delta t}\), because it is exclusively dependent on quantities at \(t_{n}\). Finally, the state dependence of the local dissipation potential is only accounted for in an explicit manner, as recommended in [11, 13, 58, 61], in order to ensure consistency with the rate-type potential (1) and thus guarantee the symmetry of the tangent operators in the finite element implementation. An alternative approach based on a predictor-corrector scheme that also ensures the symmetry of the tangent operators and employs a fully implicit time integration scheme has recently been proposed in [59], which is however not pursued in this contribution as the corrector step generates additonal computational overhead.

For symmetric systems, we can hope for a better convergence of the Krylov methods applied to the preconditioned system. On the other hand, we are restricted by small time steps.

3 Fast and Robust Overlapping Schwarz (FROSch) Preconditioner

Domain decomposition solvers [62] are based on the idea to construct an approximate solution to a problem, defined on a computational domain, from the solutions of parallel problems on small subdomains and, typically, of an additional coarse problem, which introduces the global coupling. Originally, the coarse problem of classic overlapping Schwarz methods was defined on a coarse mesh. In the FROSch software, however, we only consider methods where no explicit coarse mesh needs to be provided.

Domain decomposition methods are typically used as a preconditioner in combination with Krylov subspace methods such as conjugate gradients or GMRES.

In overlapping Schwarz domain decomposition methods [62], the subdomains have some overlap. A large overlap increases the size of the subdomains but typically improves the speed of convergence.

The C++ library FROSch [30, 37], which is part of the Trilinos software library [63], implements versions of the Generalized Dryja-Smith-Widlund (GDSW) preconditioner, which is a two-level overlapping Schwarz domain decomposition preconditioner [62] using an energy-minimizing coarse space introduced in [26, 27]. This coarse space is inspired by iterative substructuring methods such as FETI-DP and BDDC methods [46, 47, 62]. An advantage of GDSW-type preconditioners, compared to iterative substructuring methods and classical two-level Schwarz domain decomposition preconditioners, is that they can be constructed in an algebraic fashion from the fully assembled stiffness matrix.

Fig. 1
figure 1

Decomposition of a cube into non-overlapping (top) and overlapping subdomains (bottom) on a structured grid with \(\delta =1h\). The GDSW preconditioner uses the overlapping subdomains to define the local solvers and the non-overlapping subdomains to construct the second level, which ensures global transport of information

Therefore, they do not require a coarse triangulation (as in classical two-level Schwarz methods) nor access to local Neumann matrices for the subproblems (as in FETI-DP and BDDC domain decomposition methods).

For simplicity, we will describe the construction of the preconditioner in terms of the computational domain, although the construction is fully algebraic in FROSch, i.e., subdomains arise only implicitly from the algebraic construction: the computational domain \(\Omega \) is decomposed into non-overlapping subdomains \(\lbrace \Omega _i\rbrace _{i = 1 \ldots N}\); see Fig. 1. Extending each subdomain by k-layers of elements, we obtain the overlapping subdomains \(\lbrace \Omega _i' \rbrace _{i = 1 \ldots N}\) with an overlap \(\delta = k h\), where h is the size of the finite elements. We denote the size of a non-overlapping subdomain by H. The GDSW preconditioner can be written in the form

$$\begin{aligned} M_\textrm{GDSW}^{-1} ={\Phi K_0^{-1} \Phi ^T}+\sum \nolimits _{i = 1}^N R_i^T K_i^{-1} R_i, \end{aligned}$$
(19)

where \(K_i = R_i K R_i^T, i = 1, \ldots N\) represent the local overlapping subdomain problems. The coarse problem is given by the Galerkin product \(K_0 = \Phi ^T K \Phi \). The matrix \(\Phi \) contains the coarse basis functions spanning the coarse space \(V^0\). For the classical two-level overlapping Schwarz method these functions would be nodal finite elements functions on a coarse triangulation.

The GDSW coarse basis functions are chosen as energy-minimizing extensions of the interface functions \(\Phi _{\Gamma }\) to the interior of the non-overlapping subdomains. These extensions can be computed from the assembled finite element matrix. The interface functions are typically chosen as restrictions of the nullspace of the global Neumann matrix to the vertices \(\vartheta \), edges \(\xi \), and faces \(\sigma \) of the non-overlapping decomposition, forming a partition of unity. Figure 2 illustrates the interface components for a small 3D decomposition of a cube into eight subdomains. In terms of the interior degrees of freedom (I) and the interface degrees of freedom (\(\Gamma \)) the coarse basis functions can be written as

$$\begin{aligned} \Phi = \begin{bmatrix} \Phi _I \\ \Phi _{\Gamma } \end{bmatrix} = \begin{bmatrix} -K_{II}^{-1}K_{I\Gamma }\Phi _\Gamma \\ \Phi _\Gamma \end{bmatrix}, \end{aligned}$$
(20)

where \(K_{II}\) and \(K_{I\Gamma }\) are submatrices of K. Here \(\Gamma \) corresponds to degrees of freedom on the interface of the non-overlapping subdomains \(\lbrace \Omega _i\rbrace _{i = 1 \ldots N}\) and I corresponds to degrees of freedom in the interior. The algebraic construction of the extensions is based on the partitioning of the system matrix K according to the \(\Gamma \) and I degrees of freedom, i.e.,

$$\begin{aligned} K = \begin{bmatrix} K_{II} &{} K_{I\Gamma } \\ K_{\Gamma I} &{} K_{\Gamma \Gamma } \end{bmatrix}. \end{aligned}$$

Here, \(K_{II} = \hbox {diag}(K_{II}^{(i)})\) is a block-diagonal matrix, where \(K_{II}^{(i)}\) defines the i-th non-overlapping subdomain. The computation of its inverse \(K_{II}^{-1}\) can thus be performed independently and in parallel for all subdomains.

Fig. 2
figure 2

Illustration of the interface components of the non-overlapping decomposition into eight subdomains

By construction, the number of interface components determines the size of the coarse problem. This number is smaller for the more recent RGDSW methods, which use a reduced coarse space [25, 31, 41].

For scalar elliptic problems and under certain regularity conditions the GDSW preconditioner allows for a condition number bound

$$\begin{aligned} \kappa (M_\textrm{GDSW}^{-1} K) \le C \left( 1+ \frac{H}{\delta }\right) \left( 1+ \log \left( \frac{H}{h}\right) \right) , \end{aligned}$$
(21)

where C is a constant independent of the other problem parameters; cf. [26, 27]; also cf. [24] for three-dimensional compressible elasticity. Here, H is the diameter of a subdomain, h the diameter of a finite element, and \(\delta \) the overlap.

For three-dimensional almost incompressible elasticity, using adapted coarse spaces, and a bound of the form

$$\begin{aligned} \kappa \le C \left( 1+\frac{H}{\delta }\right) ^3 \left( 1+ \log \left( \frac{H}{h}\right) \right) ^2 \end{aligned}$$

was established for the GDSW coarse space [23] and also for a reduced dimensional coarse space [24].

The more recent reduced dimensional GDSW (RGDSW) coarse spaces [25] are constructed from nodal interface function, forming a different partition of unity on the interface. The parallel implementation of the RGDSW coarse spaces is also part of the FROSch framework; cf. [32]. The RGDSW basis function can be computed in different ways. Here, we use the fully algebraic approach (Option 1 in [25]), where the interface values are determined by the multiplicity [25]; see [32] for a visualization. Alternatives can lead to a slightly lower number of iterations and a faster time to solution [32], but these use geometric information [25, 32].

For problems up to 1000 cores the GDSW preconditioner with an exact coarse solver is a suitable choice. The RGDSW method is able the scale up to 10,000 cores. For even larger numbers of cores and subdomains, a multi-level extension [35, 38] is available in the FROSch framework. Although it is not covered by theory, the FROSch preconditioner is sometimes able to scale even if certain dimension of the coarse space are neglected [30, 33], i.e., for linear elasticity the linearized rotations can sometimes be neglected.

4 Parallel software environment

4.1 Finite element implementation

The implementation of the coupled problem by means of the finite element method is based on the incremental two-field potential (17), in which the arguments of the local incremental potential \(\pi ^{\Delta t}\) and the external load functional are expressed by the corresponding finite element approximations. Introducing the generalized B- and N-matrix in the following manner

$$\begin{aligned} \mathcal {Q}= & {} \begin{bmatrix} \nabla {\varvec{\varphi }}\\ \varvec{J}_{v}\\ {\text {Div}}[{\varvec{J}_{v}}] \end{bmatrix} = \begin{bmatrix} \underline{{{{\textbf{B}}}}}^{{\varvec{\varphi }}} &{} \underline{{{{\textbf{0}}}}}\\ \underline{{{{\textbf{0}}}}} &{} \underline{{{{\textbf{N}}}}}^{\varvec{J}_{v}}\\ \underline{{{{\textbf{0}}}}} &{} \underline{{{{\textbf{B}}}}}^{{\text {Div}}[{\varvec{J}_{v}}]} \end{bmatrix} \begin{bmatrix} \underline{\widetilde{{\varvec{\varphi }}}}\\ \underline{\widetilde{\varvec{J}}}_{v} \end{bmatrix} =\underline{{{{\textbf{B}}}}}\,\underline{{{{\textbf{d}}}}} \end{aligned}$$
(22)
$$\begin{aligned} \mathcal {R}= & {} \begin{bmatrix} {\varvec{\varphi }}\\ \varvec{J}_{v} \end{bmatrix}= \begin{bmatrix} \underline{{{{\textbf{N}}}}}^{{\varvec{\varphi }}}&{}\underline{{{{\textbf{0}}}}}\\ \underline{{{{\textbf{0}}}}}&{}\underline{{{{\textbf{N}}}}}^{\varvec{J}_{v}} \end{bmatrix} \begin{bmatrix} \underline{\widetilde{{\varvec{\varphi }}}}\\ \underline{\widetilde{\varvec{J}}}_{v} \end{bmatrix}=\underline{{{{\textbf{N}}}}}\,\underline{{{{\textbf{d}}}}} \end{aligned}$$
(23)

and denoting the degrees of freedom of the finite elements by \(\widetilde{()}\), gives rise to the rather compact notation of (17)

$$\begin{aligned} \Pi ^{\Delta t,h}\left( \underline{{{{\textbf{d}}}}}\right) =\int \nolimits _{\mathcal{B}} \pi ^{\Delta t}\left( \underline{{{{\textbf{B}}}}}\,\underline{{{{\textbf{d}}}}}\right) \textrm{d}V - P^{\Delta t}_{\textrm{ext}}\left( \underline{{{{\textbf{N}}}}}\,\underline{{{{\textbf{d}}}}}\right) . \end{aligned}$$
(24)

Upon the subdivision of the domain \(\mathcal{B}\) into finite elements and the inclusion of the assembly operator A, the necessary condition to find a stationary value of the incremental potential is expressed as

$$\begin{aligned} \Pi ^{\Delta t, h}_{,\underline{{{{\textbf{d}}}}}}= \underline{{{{\textbf{0}}}}}, \end{aligned}$$
(25)

which represents a system of nonlinear equations

$$\begin{aligned} \underline{{{{\textbf{R}}}}}\left( \underline{{{{\textbf{d}}}}}\right) = \begin{bmatrix} \underline{{{{\textbf{r}}}}}_{{\varvec{\varphi }}}\left( \underline{\widetilde{{\varvec{\varphi }}}},\underline{\widetilde{\varvec{J}}}_{v}\right) \\ \underline{{{{\textbf{r}}}}}_{\varvec{J}_{v}}\left( \underline{\widetilde{{\varvec{\varphi }}}},\underline{\widetilde{\varvec{J}}}_{v}\right) \end{bmatrix} {\mathop {=}\limits ^{!}} \underline{{{{\textbf{0}}}}} \end{aligned}$$
(26)

with

(27)

and

(28)

Equation (26) is solved efficiently by means of a monolithic Newton–Raphson scheme. The corresponding linearization is inherently symmetric and is computed as

$$\begin{aligned} \underline{{{{\textbf{K}}}}}=\Pi ^{\Delta t, h}_{,\underline{{{{\textbf{d}}}}}\,\underline{{{{\textbf{d}}}}}}= \begin{bmatrix} \underline{{{{\textbf{K}}}}}_{{\varvec{\varphi }}\,{\varvec{\varphi }}}&{}\underline{{{{\textbf{K}}}}}_{{\varvec{\varphi }}\,\varvec{J}_{v}}\\ \underline{{{{\textbf{K}}}}}_{\varvec{J}_{v}\,{\varvec{\varphi }}}&{}\underline{{{{\textbf{K}}}}}_{\varvec{J}_{v}\,\varvec{J}_{v}} \end{bmatrix}, \end{aligned}$$
(29)

in which the individual contributions take the form

(30)
(31)
(32)
(33)

where \(\widehat{\psi }\) is the hyperelastic energy associated with the mechanical problem and \(\widehat{\phi }\) the dissipation potential corresponding to the diffusion problem; see Sect. 2.1. The implementation of the model is carried out using the finite element library deal.II [2] and some already implemented functions for standard tensor operations available from [50]. The finite elements employ tri-linear Lagrange ansatz functions for the deformation, while two different approaches have been chosen to approximate the fluid flux: First, also a tri-linear Lagrange ansatz function is used for the flux variable, which is not the standard conforming discretization but was nevertheless successfully applied in the context of diffusion-induced fracture of hydrogels [12]. Second, the lowest order, conforming Raviart-Thomas ansatz is selected, ensuring the continuity of the normal trace of the flux field across element boundaries.

In the following, we denote the tri-linear Lagrange ansatz functions by \(Q_1\) and the Raviart-Thomas ansatz functions of lowest order by \(RT_0\). We then denote the combination for the deformation and flux elements by \(Q_1 Q_1\) and \(Q_1 RT_0\). Both element combinations are fully integrated numerically by means of a Gauss quadrature. They are depicted in Fig. 3.

Fig. 3
figure 3

Two different finite element combinations employed in the current study. i Lagrange/Lagrange ansatz functions \(Q_{1} Q_{1}\) and ii Lagrange/Raviart–Thomas ansatz functions \(Q_{1} RT_{0}\). Note that, vectorial degree of freedom associated with the deformation are indicated by \(\bullet \), while vectorial fluid flux degree of freedom are illustrated as \(\bigcirc \)  and scalar normal traces of the flux field are shown as thick solid lines ((-))

4.2 Linearized monolithic system

For completeness we state the linearized monolithic system of equations that has to be solved at each iteration k of the Newton-Raphson scheme as

$$\begin{aligned} \begin{bmatrix}\underline{{{{\textbf{K}}}}}_{{\varvec{\varphi }}\,{\varvec{\varphi }}}&{}\underline{{{{\textbf{K}}}}}_{{\varvec{\varphi }}\,\varvec{J}_{v}}\\ \underline{{{{\textbf{K}}}}}_{\varvec{J}_{v}\,{\varvec{\varphi }}}&{}\underline{{{{\textbf{K}}}}}_{\varvec{J}_{v}\,\varvec{J}_{v}} \end{bmatrix}_{k} \begin{bmatrix}\vartriangle \!{\varvec{\varphi }}\\ \vartriangle \!\varvec{J}_{v} \end{bmatrix} =- \begin{bmatrix}\underline{{{{\textbf{r}}}}}_{{\varvec{\varphi }}}\\ \underline{{{{\textbf{r}}}}}_{\varvec{J}_{v}}, \end{bmatrix}_{k} \end{aligned}$$
(34)

where \(\underline{{{{\textbf{K}}}}}_{\varvec{J}_{v}\,{\varvec{\varphi }}}\!=\underline{{{{\textbf{K}}}}}^{\textrm{T}}_{{\varvec{\varphi }}\,\varvec{J}_{v}}\), to update the degrees of freedom associated with the deformation as well as the flux field according to

$$\begin{aligned} \begin{aligned} {\varvec{\varphi }}_{k+1}&= {\varvec{\varphi }}_{k} + \vartriangle \!{\varvec{\varphi }}\\ \varvec{J}_{v \,k+1}&= \varvec{J}_{v \, k} + \vartriangle \!\varvec{J}_{v}. \end{aligned} \end{aligned}$$
(35)

The convergence criteria employed in this study are outlined in Table 3.

Since our preconditioner is constructed algebraically, it is important to note that in deal.II the ordering of the degrees of freedom for the two-field problem is different for different discretizations.

In the case of a \(Q_{1} Q_{1}\) discretization, a node-wise numbering is used, and the global vector thus has the form

$$\begin{aligned} \underline{{{{\textbf{d}}}}} = [\dots ,\underbrace{\varphi _{1},\varphi _{2},\varphi _{3}, J_{v1}, J_{v2}, J_{v3}}_{\text{ node }\,\, p},\dots ]^{\textrm{T}}. \end{aligned}$$
(36)

On the contrary, in the \(Q_{1} RT_{0}\) discretization, all degrees of freedom associated with the deformation are arranged first, followed by the flux degrees of freedom. Thus the global vector thus takes the form

$$\begin{aligned} \underline{{{{\textbf{d}}}}}=[\dots ,\underbrace{\varphi _{1},\varphi _{2},\varphi _{3}}_{\text{ node }\,, p}\, ,\dots ,\underbrace{H_{v}}_{\text{ face }\,\, q } ,\dots ]^{\textrm{T}}. \end{aligned}$$
(37)

4.2.1 Free-swelling boundary value problem

The boundary value problem of a free-swelling cube is studied in the literature, cf. [13] for 2D and [49] for 3D results, and adopted here as a benchmark problem for the different finite element combinations. Considering a cube with edge length \(2\textrm{L}\), the actual simulation of the coupled problem is carried out employing only one eighth of the domain, as shown in Fig. 4, due to the intrinsic symmetry of the problem. Therefore, symmetry conditions are prescribed along the three symmetry planes, i.e. \(X_{1}=0, X_{2}=0, X_{3}=0\), which correspond to a vanishing normal component of the displacement vector and the fluid flux. At the outer surface the mechanical boundary conditions are assumed as homogeneous Neumann conditions, i.e. \(\bar{\varvec{T}}=\varvec{0}\), while two different boundary conditions are used for the diffusion problem, namely

  1. (i)

    Dirichlet conditions, i.e. the normal component of the fluid flux \(H_{v}\), are prescribed or

  2. (ii)

    Neumann conditions, i.e. the chemical potential \(\bar{\mu }\), are specified as shown in Fig. 5 and Table 2.

Note that, due to the coupling of mechanical and diffusion fields, the boundary conditions (i) and (ii) result in different temporal evolution of the two fields. However, in both cases a homogeneous, stress-free state is reached under steady state conditions.

Fig. 4
figure 4

One eighth of the cube domain, highlighted in dark gray, with edge length \(\textrm{L}=1\,\hbox {mm}\) employed in the parallel scalability study

Type (i) boundary conditions are used for strong scalability study outlined in Sect. 7.2.2, while type (ii) boundary conditions are employed in the weak parallel scalability study described in Sect. 7.3.

Fig. 5
figure 5

Time-dependent boundary conditions for the free-swelling problem: (i) flux control and (ii) control through chemical potential

Table 2 Problem specific parameters associated with the boundary conditions in the free-swelling problem illustrated in Fig. 5. The common parameters are \(t_{1}=0.25\,\hbox {s}\) and \(t_{4}=4\,\hbox {s}\)

4.2.2 Mechanically induced diffusion boundary value problem

Similar to the free-swelling problem, the mechanically induced diffusion problem is also solved on a unit cube domain with appropriate symmetry conditions applied along the planes \(X_{1}=0, X_{2}=0, X_{3}=0\), as shown in Fig. 6. Along the subset \(\left( X_{1},X_{3}\right) \in \left[ -\frac{\textrm{L}}{3},\frac{\textrm{L}}{3}\right] \times \left[ -\frac{\textrm{L}}{3},\frac{\textrm{L}}{3}\right] \) at the plane \(X_{2}=\textrm{L}\) the coefficients of the displacement vector are prescribed as \(u_{i}=[0,-\hat{u},0]\), mimicking the indentation of the body with a rigid flat punch under very high friction condition. The non-vanishing displacement coefficient is increased incrementally and subsequently held constant, similar to the function for the chemical potential illustrated in Fig. 5. The corresponding parameters read as \(t_{1}=1\,\hbox {s}\), \(t_{4}=6\,\hbox {s}\) and \(\hat{u}=0.4\,\hbox {mm}\). Additionally, the normal component of the fluid flux is set to zero at the complete outer surface of the cube together with traction free conditions at the remaining part of the outer boundary.

Fig. 6
figure 6

One eighth of the cube domain, highlighted in dark gray, with edge length \(\textrm{L}=1\,\hbox {mm}\) employed in the parallel scalability study. The shaded dark gray surface indicates the area with prescribed vertical displacement, mimicking the indentation of the body with a rigid flat punch

4.3 Distributed memory parallelization using deal.ii, p4est, and Trilinos

In early versions of our software, the assembly of the system matrix and the Newton steps were performed on a single core, and after distribution of the system matrix to all cores, it was solved by FROSch in fully algebraic mode [44].

In this work, the simulation is fully MPI-parallel. We assemble the system matrix in parallel using the deal.II classes from the parallel::distributed  namespace. In deal.II, parallel meshes for distributed memory machines are handled by a parallel::distributed::Triangulation  object, which calls the external library p4est [16] to determine the parallel layout of the mesh data.

As a result, each process owns a portion of cells (called locally owned cells in deal.II) of the global mesh. Each process stores one additional layer of cells surrounding the locally owned cells, which are denoted as ghost cells. Using the ghost cells two MPI ranks corresponding to neighboring nonoverlapping subdomains can, both, access (global) degrees of freedom on the interface.

Each local stiffness matrix is assembled by the process which owns the associated cell (i.e., the finite element), thus the processes work independently and concurrently. The handling of the parallel data (cells and degrees of freedom) distribution is performed by a DofHandler object. A more detailed describtion of the MPI parallelization in deal.II can be found in [8].

For the parallel linear algebra, deal.II interfaces to either PETSc [6] or Trilinos [63]. In this work, we make use of the classes in the dealii::LinearAlgebraTrilinos::MPI namespace, such that we obtain Trilinos Epetra vectors and matrices, which can be processed by FROSch. Similarly to the DofHandler the Trilinos Map object handles the data distribition of the parallel linear algebra objects.

To construct the coarse level, FROSch needs information on the interface between the subdomains. The FROSch framework uses a repeatedly decomposed Map to identify the interface components. In this Map the degrees of freedom on the interface are shared among the relevant processes. This Map can be provided as an input by the user. However, FROSch also provides a fully algebraic construction of the repeated Map [37], which is what we use here.

4.4 Solver settings

We use the deal.II software library (version 9.2.0) [2, 3] to implement the model in variational form, and to perform the finite element assembly in parallel. The parallel decomposition of the computational domain is performed in deal.II by using the p4est software library [16]. We remark that using p4est small changes in the number of finite elements and the number of subdomains may result in decompositions with very different subdomain shapes; see Fig. 7. A bad subdomain shape will typically degrade the convergence of the domain decomposition solver. We always choose an overlap of two elements. However, since the overlap is constructed algebraically in some positions there can be deviations from a geometric overlap of \(\delta =2h\).

Fig. 7
figure 7

Decomposition of the computational domain with 4096 finite elements into 8 (top), 27 (middle) and 64 (bottom) non-overlapping subdomains

In the Newton–Raphson scheme, we use absolute and relative tolerances for the deformation \({\varvec{\varphi }}\) and the fluid flux \(\varvec{J}_{v}\) according to Table 3, where \(\Vert r_k \Vert \) is the residual at the k-th Newton step and \(\Vert r_0 \Vert \) the initial residual.

Table 3 Tolerances for the Newton–Raphson scheme. Here, \(r_k\) is the k-th residual

FROSch is part of Trilinos [63] and makes heavy use of the parallel Trilinos infrastructure. The Trilinos software library is applied using the master branch of October 2021 [63].

On the first level of overlapping subdomains we always apply the restrictive additive Schwarz method. A one-to-one correspondence for the subdomains and cores is employed.

The linearized systems are solved using the parallel GMRES implementation provided by the Trilinos package Belos using the relative stopping criterion of \(\Vert r_k \Vert /\Vert r_0 \Vert \le 10^{-8}\). We use a vector of all zeros as the initial vector for the iterations.

The arising subproblems in the FROSch framework are solved by Trilinos’ built-in KLU sparse direct linear solver.

All parallel experiments are performed on the Compute Cluster of the Fakultät für Mathematik und Informatik at Technische Universität Freiberg. A cluster node has two Intel Xeon Gold 6248 processors (20 cores, 2.50 GHz).

5 Limitations

This study has some limitations which are described in more detail in Appendix A:

  • The coupling constraint may, potentially, depending on the discretization, lead to stability problems. However, in our parallel experiments the penalty parameter for the coupling constraint is rather low and no stability problems are observed experimentally.

  • A penalty formulation can lead to ill-conditioning of the stiffness matrix and slow convergence of the iterative solver. As the penalty parameter is mild, the convergence of solver is acceptable in our parallel experiments.

  • In some of our experiments standard \(H^1\)-conforming finite elements are used for the fluid. This may give bad results if the solution cannot be approximated well in \(H^1\).

  • We make use of the sparse direct solver KLU, which belongs to the Trilinos library, for the solution of the local problems and the coarse problem of the preconditioner. Other sparse direct solvers should also be tested in the future.

6 Stability of the finite element formulations

We briefly discuss the stability of the finite element discretization, namely the \(Q_{1} Q_{1}\) and \(Q_{1} RT_{0}\) ansatz; see also Appendix A.3 Here, the mechanically induced diffusion problem, described in Sect. 4.2.2, is solved for the complete loading history, employing a discretization with \(24^{3}\) finite elements, resulting in 93,750 degrees of freedom for the \(Q_{1} Q_{1}\) and 90,075 degrees of freedom for the \(Q_{1} RT_{0}\) ansatz function. Note that this discretization corresponds to only one uniform refinement step less than the discretizations considered in Sects. 7.2 and 7.3. The material parameters in this study are taken as \(\uplambda =10\,\hbox {N}/\hbox {mm}^{2}\) and \(J_{0}=4.5\), while all the remaining parameter are chosen according to Table 1. The penalty parameter \(\uplambda \) is thus larger by a factor of 50 compared to Table 1.

Fig. 8
figure 8

Evolution of the spatial distribution of i the chemical potential \(\mu \), ii the swelling volume fraction v and iii Jacobian J obtained with the \(Q_{1} RT_{0}\) discretization at \(t=1\,\hbox {s}\), \(t=2\,\hbox {s}\) and \(t=6\,\hbox {s}\) (from left to right)

Inspecting the spatial distributions of the chemical potential, the swelling volume fraction and the Jacobian in Fig. 8 obtained with the \(Q_{1} RT_{0}\) ansatz function, it becomes apparent that the mechanical deformation leads to a significant redistribution of the fluid inside the body. In particular, it can be seen that after the initial stage of the loading history (\(0\le t \le 1\,\hbox {s}\)), the chemical potential just below the flat punch has increased considerably due to the rather strict enforcement of the penalty constraint by the choice of material parameters \(\frac{\uplambda }{\upgamma }=100\). Given the definition of the chemical potential, which specializes to

$$\begin{aligned}{} & {} \mu {:}{=} \partial _{v} \widehat{\psi } = -\frac{\uplambda }{J_{0}}\left[ JJ_{0}-1-v\right] \nonumber \\{} & {} +\frac{\alpha }{J_{0}}\left[ \ln \left( \frac{v}{1+v}\right) +\frac{1}{1+v}+\frac{\chi }{\left[ 1+v\right] ^{2}}\right] \end{aligned}$$
(38)

for the free-energy function given in (12), the contribution associated with the constraint can readily be identified as the first term on the right hand side of (38).

During the subsequent holding stage of the loading history, i.e., the displacement coefficient \(\hat{u}\) is constant, a relaxation of the body can be observed, which results in a balanced chemical potential field alongside with a strong reduction of the swelling volume fraction below the flat punch. The spatial distribution of the Jacobian depicted in Fig. 8 (iii) is closely tied to the distribution of the swelling volume fraction.

In principle, similar observations are made in the simulation of the deformation induced diffusion problem, which employs the \(Q_{1} Q_{1}\) ansatz functions. However, significant differences occur during the holding stage of the loading history, in which deformations are due to the diffusion of the fluid. In particular, a checker board pattern develops below the flat punch, which is clearly visible in all three fields depicted in Fig. 9 at the end of the simulation at \(t=6\,\hbox {s}\). This is a result of the \(Q_{1}\) discretization; see the brief discussion in Appendix A.3.

For the \(Q_{1}\) ansatz for the fluid flux \(\varvec{J}_{v}\) the swelling volume fraction is not constant within a finite element. Due to the rather strict enforcement of the incompressibility constraint this heterogeneity is further amplified. Of course, a selective reduced integration technique is able to cure this problem, as shown in [13] for the two-dimensional case. Herein, (18) is solved at a single integration point per element and the current value v are subsequently transferred to the remaining Gauss points. Other choices, such as a three-field formulation, are also possible. The use of the \(RT_{0}\) ansatz function, however, may be more appropriate as it yields, both, a conforming discretization and a lower number of degree of freedom per element compared to the standard Lagrange approximation.

Note, however, that in the subsequent parallel simulations, the penalty parameter is smaller by more than an order of magnitude such that the problems described in this section were not observed.

Fig. 9
figure 9

Evolution of the spatial distribution of i the chemical potential \(\mu \), ii the swelling volume fraction v and iii Jacobian J obtained with the \(Q_{1} Q_{1}\) discretization at \(t=1\,\hbox {s}\), \(t=2\,\hbox {s}\) and \(t=6\,\hbox {s}\) (from left to right)

7 Numerical results

7.1 Performance of the iterative solver

To evaluate the numerical and parallel performance of the FROSch framework applied to the monolithic system in fully algebraic mode we consider the boundary value problems described in Sects. 4.2.1 and 4.2.2. We refer to the one in Sect. 4.2.1 as the free-swelling problem and denote the problem specified in Sect. 4.2.2 as the mechanically induced problem.

We compare the use of \(Q_1 RT_0\) and \(Q_1 Q_1\) ansatz functions regarding the consequences for the numerical and parallel performance of the simulation.

The different ansatz functions result in different numbers of degrees of freedom per node. For the \(Q_1 Q_1\) ansatz each node has six degrees of freedom. The usage of \(Q_{1} RT_0\) elements leads to three degree of freedoms per node and one per element face. If not noted otherwise, the construction of the coarse spaces uses the nullspace of the Laplace operator. The computing times are always sums over all time steps and Newton steps. We denote the time to assemble the tangent matrix in each Newton step by Assemble Matrix Time.

By Solver Time we denote the time to build the preconditioner (Setup Time) and to perform the Krylov iterations (Krylov Time).

For the triangulation, we executed four refinement cycles on an initial mesh with 27 finite elements resulting in a structured mesh of 110,592 finite elements.

Table 4 Strong scaling for the linear elasticity model problem in 3 dimensions using \(Q_1\) elements. Dirichlet boundary conditions on the complete boundary. We operate on a structures mesh with 110,592 finite elements such that we have 352,947 degrees of freedom. We use an overlap of two elements. We use the standard GDSW, without rotations in the coarse space
Table 5 Detailed cost of the overlapping subdomain problems \(K_i\) contained in the Setup Time in Table 4

7.2 Strong parallel scalability

For the strong scalability, we consider our problem on the structured mesh of 110,592 cells, which results in 691,635 degrees of freedom for the \(Q_1 RT_0\) elements and 705,894 degrees of freedom for the \(Q_1 Q_1\) discretization. We then increase the number of subdomains and cores and expect the computing time to decrease.

7.2.1 Linear elasticity benchmark problem

To provide a baseline to compare with, we first briefly present strong scaling results for a linear elastic benchmark problem on the unit cube \((0,1)^3\), using homogeneous Dirichlet boundary conditions on the complete boundary and discretized using \(Q_1\) elements; see Table 4. Here, the 110,592 finite elements result in only 352,947 degrees of freedom since the diffusion problem is missing. We use a generic right-hand-side vector of ones \((1,\ldots ,1)^T\).

We will use this simple problem as a baseline to evaluate the performance of our solver for our nonlinear coupled problems. Note that due to the homogeneous boundary conditions on the complete boundary, this problem is quite well conditioned, and a low number of Krylov iterations should be expected.

In Table 4, we see that, using the GDSW coarse space for elasticity (with three displacements but without rotations), we observe numerical scalability, i.e., the number of Krylov iterations does not increase and stays below 30. Note that this coarse space is algebraic, however, it exploits the knowledge on the numbering of the degrees of freedom. Other than for FETI-DP and BDDC methods, the GDSW theory does not guarantee numerical scalability for this coarse space missing the three rotations, however, experimentally, numerical scalability has been observed previously for certain, simple linear elasticity problems [30, 33].

In Table 4, the strong scalability is good when scaling from 64 (28.23 s) to 216 cores (7.43 s). The Solver Time increases when scaling from 216 to 512 cores (41.66 s), indicating that the coarse problem of size 26,109 is too large to be solved efficiently by Amesos2 KLU. The sequential coarse problem starts to dominate the solver time.

Note that, as we increase the number of cores and subdomains, the subdomain sizes decrease. In our strong parallel scalability experiments, we thus profit from the superlinear complexity of the sparse direct subdomain solvers.

We also provide results for the fully algebraic mode. Here, the number of Krylov iterations is slightly higher and increases slowly as the number of cores increase. This is not surprising since in fully algebraic mode, we make use of the space spanned by the constant vector \((1,1,\ldots ,1)\) in the construction of the second level of the preconditioner, which is only suitable for Laplace problems.

Table 6 Strong scalabilty results for the free-swelling problem corresponding to Figs. 10 and 11. We operate on a triangulation with 110,592 finite elements resulting in 691,635 degrees of freedom for the \(Q_1 RT_0\) ansatz functions and 705,894 degrees of freedom for the \(Q_1 Q_1\) ansatz functions. We choose the boundary conditions (i) described in Sect. 4.2.1. We apply the FROSch framework in algebraic mode with the GDSW and the RGDSW coarse space. We peform two time steps with \(\Delta t=0.05\,\hbox {s}\). By Avg. Krylov we denote the average number of Krylov iteration over all time- and Newton steps. The time measurements are taken over the whole computation
Table 7 Detailed cost of the overlapping subdomain problems \(K_i\) contained in the Setup Time in Table 6

However, the Solver Time is comparable for both coarse spaces for 64 (28.23 s vs. 27.35 s), 125 (16.25 s vs. 13.35 s), and 216 (7.43 s vs. 5.69 s) cores. Notably, the Solver Time is better for the fully algebraic mode for 512 cores (41.66 s vs 6.59 s) as a result of the smaller coarse space in the fully algebraic mode (8594 vs. 26,109).

In Table 5, more details of the computational cost are presented. These timings show that for 64, 125, and 216 cores the cost is dominated by the factorizations of the subdomain matrices \(K_i\). Only for 512 cores this is not the case any more.

Interestingly, the fully algebraic mode is thus preferable within the range of processor cores discussed here, although numerical scalability is not achieved.

7.2.2 Free swelling problem

We now discuss the strong scalabilty results for the free-swelling problem; see Sect. 4.2.1. Here, the pre-swollen Jacobian \(J_0\) is chose as \(J_0=1.01\). The other material paramater are chosen according to Table 1 and Table 2.

For the parallel performance study, we perform two time steps for each test run. In each time step, again, 5 Newton iterations are needed for convergence.

For a numerically scalable preconditioner, we would expect the number of Krylov iterations to be bounded. In Table 6, we observe that we do not obtain good numerical scalability, i.e., the number of iteration increases by 50 percent when scaling from 64 to 512 cores. This can be attributed to the fully algebraic mode, whose coarse space is not quite suitable to obtain numerical scalability; see also Sect. 7.2.1. Interestingly, the results are very similar for GDSW and RGDSW with the exception of 512 cores, where the smaller coarse space of the RGDSW method results in slightly better Solver Time. This is interesting, since the RGDSW coarse space is typically significantly smaller. This indicates that the RGDSW coarse space should be preferred in our future works.

Table 8 Results for mechanically induced problem using 216 cores using different time step sizes \(\Delta t\) to reach \(t = 0.1\,\hbox {s}\). We apply FROSch with the GDSW coarse space

The number of iterations is smaller for the \(Q_1RT_0\) discretization compared to the \(Q_1Q_1\) discretization. Since, in addition, the local subdomain problems are significantly larger when using \(Q_1Q_1\) (see Table 7) the Solver Times are better by (approximately) a factor of two when using \(Q_1RT_0\) discretizations.

Strong parallel scalability is good when scaling from 64 to 216 cores. Only incremental improvements are obtained for 512 cores indicating that the problem is too small.

If we relate these results to our linear elasticity benchmark problem in Sect. 7.2.1, we see that with respect to the number of iterations, the (average) number of Krylov iterations is higher by a factor 1.5 to 2 for the coupled problem compared to the linear elastic benchmark. We believe that this is an acceptable result.

If we compare the Solver Time, we need to multiply the Solver Time in Table 4 by a factor of 10, since 10 linearized systems are solved in the nonlinear coupled problem. Here, we see that the Solver Time is higher by a factor slightly more than 3 when using \(Q_1RT_0\) compared to solving 10 times the linear elastic benchmark problem of Sect. 7.2.1. For \(Q_1Q_1\), this factor is closer to 6 or 7. Interestingly, in both cases, this is mostly a result of larger factorization times for the local subdomain matrices (see Table 7) and only to a small extent a result of the larger number of Krylov iterations.

Table 9 Strong scalabilty results for the mechanically induced diffusion problem corresponding to Figs. 13 and 14. We operate on a triangulation with 110,592 finite elements resulting in 691,635 degrees of freedom for the \(Q_1 RT_0\) ansatz functions and 705,894 degrees of freedom for the \(Q_1 Q_1\) ansatz functions. We apply the FROSch framework in fully algebraic mode with the GDSW and the RGDSW coarse space. We perform two time steps with \(\Delta t=0.05\,\hbox {s}\). By Avg. Krylov we denote the average number of Krylov iteration computed over all time- and Newton steps. The time measurements are taken over the complete computation
Table 10 Detailed cost of the overlapping subdomain problems \(K_i\) contained in the Setup Time in Table 9

7.2.3 Mechanically induced diffusion problem

For the mechanically induced problem, we chose a value of \(J_0=4.5\) for the pre-swollen Jacobian \(J_0\). The other problem parameters are chosen according to Tables 1 and 2.

Effect of the time step size Let us note that in our simulations the time step size \(\Delta t\) has only a small influence on the convergence of the preconditioned GMRES method. Using different choices of the time step \(\Delta t\), in Table 8 we show the number of Newton and GMRES iterations. The model problem is always solved on 216 cores until the time \(t=0.1\,\hbox {s}\) is reached. The number of Newton iterations for each time step slighty differs; see Table 8. The small effect of the choice of the time step size on the Krylov iterations is explained by the lack of a mass matrix in the structure part of our model. However, the diffusion part of the model does contain a mass matrix. Moreover, time stepping is needed as a globalization technique of the Newton method, i.e., large time steps will result in a failure of Newton convergence. A different formulation including a mass matrix for the mechanical part of the model should be considered in the future as a possibility to improve solver convergence.

Strong scalability for the mechanically induced diffusion We, again, perform two time steps for each test run. In each time step 5 Newton iterations are needed for convergence. In Table 9, we present results using 64 up to 512 processor cores.

First, we observe that the average number of Krylov iterations is significantly higher compared to Sect. 7.2.2, indicating this problem is significantly harder as a result of the boundary conditions.

Next, we observe that the average number of Krylov iterations is similar for the \(Q_1 RT_0\) and the \(Q_1 Q_1\) case, which is different than in Sect. 7.2.2.

In both discretizations, the number average of Krylov iterations increases by about 50 percent for a larger number of cores. This is valid for the GDSW as well as for the RGDSW coase space.

We also see that the Solver Time for \(Q_1 RT_0\) is significantly better in all cases. Indeed, the time for the Krylov iteration (Krylov Time) as well as the time for the setup of the preconditioner (Setup Time) is larger for \(Q_1Q_1\). The Setup Times are often drastically higher for \(Q_1Q_1\). As illustrated in Table 10, this is a result of larger factorization times for the sparse matrices arising from \(Q_1Q_1\) discretizations.

To explain this effect, in Fig. 12 the sparsity patterns for \(Q_1 RT_0\) and \(Q_1Q_1\) are displayed. Although the difference is visually not pronounced, the number of nonzeros almost doubles using \(Q_1Q_1\). Precisely, we have for our example with 27 finite elements a tangent matrix size of 300 with 8688 nonzero entries for \(Q_1RT_0\), which compares to a a size of 384 with 16,480 nonzero entries for \(Q_1 Q_1\). Therefore, it is not surprising if the factorizations are more computationally expensive using \(Q_1Q_1\).

In Table 9, good strong parallel scalability is generally observed when scaling from 64 to 216 cores. Again, only incremental improvements are visible, when using 512 cores. This is, again, an indication that the problem is too small for 512 cores. A three-level approach could also help, here.

If we relate the results to the linear elastic benchmark problem in Sect. 7.2.1, we see that the number of Krylov iterations is now larger by a factor of 4 to 6, which is significant. This is not reflected in the Solver Time, since it is, again, dominated by the local factorization. However, for a local solver significantly faster than KLU, the large number of Krylov iterations could be very relevant for the time-to-solution.

Next, we therefore investigate if we can reduce the number of iterations by improving the preconditioner.

Fig. 10
figure 10

Strong scalability of the Solver Time (Setup Time + Krylov Time) for the free-swelling problem; see Table 6 for the data

Fig. 11
figure 11

Detailed timers for the free-swelling problem; see Table 6 for the data

Making more use of the problem structure in the preconditioner We have observed in Table 9 that the number of Krylov iterations increased by roughly 50 percent when scaling from 64 to 512 processor cores. We explain this by the use of the fully algebraic mode of FROSch which applies the null space of the Laplace operator to construct the second level.

To improve the preconditioner, the we use, for the \(Q_1Q_1\) discretization, the three translations (in x, y, and z direction) for the construction of the coarse problem for, both, the structure and the diffusion problem: we use six basis vectors \((1,0,0,0,0,0),\,(0,1,0,0,0,0),\ldots ,(0,0,0,0,0,1)\) for construction of the coarse space. Here, the first three components refer to the structure problem and the last three components to the diffusion problem. Note that the rotations are missing from the coarse space since they would need access to the point coordinates.

Table 11 A lower number of Krylov iterations and better numerical scalability is obtained when a better coarse space is used, spanned by the three translations in x, y, and z direction, for, both, the structure as well as the diffusion problem

In Table 11, we see that, by using this enhanced coarse space, we can reduce the number of Krylov iterations, and, more importantly, we can avoid the increase in the number of iterations. This means that, experimentally, in Table 11, we observe numerical scalability within the range of processor cores considered here.

Fig. 12
figure 12

Sparsity pattern for the tangent matrix of the mechanically induced problem using the \(Q_1 RT_0\) ansatz functions (top) and and the \(Q_1 Q_1\) ansatz functions (bottom). Here, 27 finite elements are employed resulting in 300 degrees of freedom for \(Q_1 RT_0\) and respectively 384 degrees of freedom for \(Q_1 Q_1\)

Note that the larger coarse space resulting from this approach is not amortized in terms of computing times for the two-level preconditioner. Therefore, within the range of processor cores considered here, the fully algebraic approach is preferable. Again, a three-level method may help, especially, for the GDSW preconditioner, which has a larger coarse space compared to RGDSW.

A similar approach could be taken for \(Q_1RT_0\), i.e., the three translations could be used for the deformation, and the Laplace nullspace could be used for the diffusion. However, this was not tested here.

Effect of Unstructured Grids Finally, we also consider the effect of unstructured grids; see Fig. 15 and Table 12. The mesh is regular where the boundary condition is applied. Let us note that even for structured meshes, the domain decomposition is often unstructured; see, e.g., Fig. 7 (middle).

Compared to the results for \(Q_1Q_1\) in Table 9 the problem is substantially larger (1,128,294 d.o.f. in Table 12 compared to 705,894 d.o.f. in Table 9).

For the unstructured grid the number of Krylov iterations is slightly higher and, as for the structured grid, the average number of Krylov iterations increases from 125 cores to 512 cores. As a result of the larger problem, the solver time is substantially larger than for the problem in Table 9. We only have parallel runs for 125 cores, 216 cores and 512 cores, however, the parallel scalability is acceptable.

Fig. 13
figure 13

Strong scalability of the Solver Time (Setup Time + Krylov Time) for the mechanically induced problem; see Table 9 for the data

Table 12 Strong parallel scalability results for the mechanically induced diffusion problem described in Sect. 4.2.2. On the initial unstructured grid, we performed 4 refinement cycles resulting in 180,224 finite elements and 1,128,294 degrees of freedom. We apply the FROSch framework in fully algebraic mode with the GDSW and the RGDSW coarse space. We perform two time steps with \(\Delta t = 0.05\,s\). By Avg. Krylov, we denote the average number of Krylov iteration computed over all time- and Newton steps. The time measurements are taken over the complete computation

7.3 Weak parallel scalability

For the weak parallel scalability, we only consider the free-swelling problem with type (ii) Neumann boundary condition as described in Sect. 4.2.1. The material parameters are chosen as in Sect. 7.2.2. On the initial mesh we perform different numbers of refinement cycles such that we obtain 512 finite elements per core.

Fig. 14
figure 14

Detailed timers for the mechanically induced problem; see Table 9 for the data

For the smallest problem of 8 cores, we have a problem size of 27,795 which compares to largest problem size of 1,622,595 using 512 cores.

Hence, we increase the problem size as well as the number of processor cores. For this setup, in the best case, the average number of Krylov iterations Avg. Krylov and the Solver Time should remain constant.

We observe that, within the range of 8 to 512 processor cores considered, the number of Krylov iterations grows from 16.1 to 71.4 for GDSW and from 14.6 to 76.6 for RGDSW. This increase is also reflected in the Krylov Time, which increases from \(2.51\,\hbox {s}\) to \(33.52\,\hbox {s}\) for GDSW and from \(2.06\,\hbox {s}\) to \(31.03\,\hbox {s}\) for RGDSW.

However, a significant part of the increase in the Solver Time comes from a load imbalance in the problems with more than 64 cores: the maximum local subdomain size is 7919 for 8 cores and 14,028 for 512 cores; see also Table 14. However, Table 14 also shows that the load imbalance does not increase when scaling from 64 to 512 cores, which indicates that the partitioning scheme works well enough.

Again, we see that the coarse level is not fully sufficient to obtain numerical scalability, and more structure should be used, as in Table 11, if numerical scalability is the goal.

Note that even for three-dimensional linear elasticity, we typically see an increase of the number of Krylov iterations when scaling from 8 to 512 cores, even for GDSW using with the full coarse space for elasticity (including rotations) [30, Fig. 15], which is covered by the theory. In [30, Fig. 15] the number of Krylov iterations only stays almost constant beyond about 2000 cores. However, the increase in the number of iterations is mild for the full GDSW coarse space in [30, Fig. 15] when scaling from 64 to 512 cores.

Fig. 15
figure 15

Domain decomposition into eight subdomains of an unstructured grid of a cube. Here, two uniform refinement cycles were are visualized; for the computations four refinement steps were used

Concluding, we see that using the fully algebraic mode, for the range of processor cores considered here, leads to an acceptable method although numerical scalability and optimal parallel scalability is not achieved.

Interestingly, the results for RGDSW are very similar in terms of Krylov iterations as well as the Solver Time although RGDSW has a significantly smaller coarse space. This advantage is not yet visible here, however, for a larger number of cores RGDSW can be expected to outperform GDSW by a large margin.

Table 13 Weak parallel scalability results for the free swelling problem with type (ii) boundary conditions described in Sect. 4.2.1 corresponding to Fig. 16. Each core owns approximately 512 finite elements. We use the \(Q_1 RT_0\) ansatz functions and apply FROSch with the GDSW and the RGDSW coarse space. Two time step with \(\Delta t = 0.05\,\hbox {s}\) are performed with each requiering 5 Newton steps. By Avg. Krylov, we denote the average number of Krylov iteration over all time- and Newton steps. The time measurements are taken over the whole computation
Fig. 16
figure 16

Weak parallel scalability of the Solver Time (Setup Time + Krylov Time) for the model problem of a free swelling cube with type (ii) boundary conditions described in Sect. 4.2.1. See Table 13 for the data

Increasing the the number of finite elements assigned to one core, we expect the weak parallel scalability to improve.

Table 14 Detailed cost of the overlapping subdomain problems \(K_i\) for the weak parallel scalabilty results in Table 13

7.4 Conclusion and outlook

The FROSch framework has shown a good parallel performance applied algebraically to the fully coupled chemo-mechanic problems. We have compared two benchmark problems with different boundary conditions. The time step size was of minor influence for our specific benchmark problems.

Our GDSW-type preconditioners implemented in FROSch are a suitable choice when used as a monolithic solver for the linear systems arising from the Newton linearization. They perform well when applied in fully algebraic mode even when numerical scalability is not achieved.

Our experiments of strong scalability have shown that, with respect to average time to solve a monolithic system of equations obtained from linearization of the nonlinear monolithic coupled problem discretized by \(Q_1 RT_0\), we have to invest a factor of slightly more than three in computing time compared to a solving a standard linear elasticity benchmark problem discretized using the same number of \(Q_1\) finite elements. This is a good result, considering that the monolithic system is larger by a factor of almost 2.

Using a \(Q_1 Q_1\) discretization, the computing times are much slower. This is mainly a result of a lower sparsity of the finite element matrices.

We have also discussed that, using more structure, we can achieve numerical scalability in our experiments. However, this approach will only be efficient when used with a future three-level extension of our preconditioner.