1 Introduction

This work is devoted to an innovative solution of the adjoint equation in goal-oriented error estimation with the dual weighted residual (DWR) method [3,4,5] (based on former adjoint concepts [19]); we also refer to [1, 7, 22, 41] for some important early work. Since then, the DWR method has been applied to numerous applications such as variational inequalities [54], space-time adaptivity for parabolic problems [51], fluid-structure interaction [20, 23, 47, 57], Maxwell’s equations [12], worst-case multi-objective adaptivity [55], the finite cell method [53], surrogate models in stochastic inversion [37], model adaptivity in multiscale problems [36], mesh and model adaptivity for frictional contact [43], and adaptive multiscale predictive modeling [39]. A summary of theoretical advancements in efficiency estimates and multi-goal-oriented error estimation was recently made in [15]. An important part in these studies is the adjoint problem, as it measures the sensitivity of the primal solution with respect to a single or multiple given goal functionals (quantities of interest). This adjoint solution is usually obtained by global higher order finite element method (FEM) solutions or local higher order approximations [5]. In general, the former is more stable, see e.g. [18], but the latter works often sufficiently well in practice. As the adjoint solution is only required to evaluate the a posteriori error estimator, a cheap solution is of interest.

Consequently, in this work, the main objective is to explore alternatives for computing the adjoint. Due to the universal approximation property [42], a primer candidate are neural networks as they are already successfully employed for solving ordinary and partial differential equations (PDE) [6, 10, 24, 25, 30, 31, 33, 34, 44, 45, 50, 52, 58]. A related work in aiming to improve goal-oriented computations with the help of neural network data-driven finite elements is [9]. Moreover, a recent summary of the key concepts of neural networks and deep learning was compiled in [26]. The advantage of neural networks is a greater flexibility as they belong to the class of meshless methods. We follow the methodology of [44, 52] to solve PDEs by minimizing the residual using an L-BFGS (Limited memory Broyden-Fletcher-Goldfarb-Shanno) method [32]. We address both linear and nonlinear PDEs and goal functionals in stationary settings. However, a shortcoming in the current approach is that we need to work with strong adjoint formulations, which may limit extensions to nonlinear coupled PDEs such as multiphysics problems and coupled variational inequality systems. If such problems can be restated in an energy formulation, again neural network algorithms are known [14, 50]. Despite this drawback, namely the necessity of working with strong formulations, the current study provides useful insights whether at all neural network guided adjoints can be an alternative concept for dual weighted residual error estimation. For this reason, our resulting modified adaptive algorithm and related numerical simulations are compared side by side in all numerical tests to classical Galerkin finite element solutions (see e.g., [11]) of the adjoint. Our proposed algorithm is implemented in the open-source finite element library deal.II [2] coupled with LibTorch, the PyTorch C++ API [40].

The outline of this paper is as follows: In Sect. 2, we recapitulate the DWR method. Next, in Sect. 3 we gather the important ingredients of the neural network solution. This section also includes an extension of an approximation theorem from Lebesgue spaces to classical function spaces. The algorithmic realization is addressed in Sect. 4. Then, in Sect. 5 several numerical experiments are conducted. Our findings are summarized in Sect. 6.

2 Dual weighted residual method

2.1 Abstract problem

Let U and V be Banach spaces and let \({\mathcal {A}}: U \rightarrow V^*\) be a nonlinear mapping, where \(V^*\) denotes the dual space of V. With this, we can define the problem: Find \(u \in U\) such that

$$\begin{aligned} {\mathcal {A}}(u)(v) = 0 \quad \forall v \in V. \end{aligned}$$
(1)

Additionally, we can look at an approximation of this problem. For subspaces \({\tilde{U}} \subset U\) and \({\tilde{V}} \subset V\) the problem reads: Find \({\tilde{u}} \in {\tilde{U}}\) such that

$$\begin{aligned} {\mathcal {A}}({\tilde{u}})({\tilde{v}}) = 0 \quad \forall {\tilde{v}} \in {\tilde{V}}. \end{aligned}$$

Remark 1

In the following the nonlinear mapping \({\mathcal {A}}(\cdot )(\cdot )\) will represent the variational formulation of a stationary partial differential equation with the associated function spaces U and V. We define the finite element approximation of the abstract problem as follows: Find \(u_h \in U_h\) such that

$$\begin{aligned} {\mathcal {A}}(u_h)(v_h) = 0 \quad \forall v_h \in V_h, \end{aligned}$$
(2)

where \(U_h \subset U\) and \(V_h \subset V\) denote the finite element spaces. Here the operator is given by \({\mathcal {A}}(u_h)(\cdot ) := a(u_h)(\cdot ) - l(\cdot )\) with the linear forms \(a(u_h)(\cdot )\) and \(l(\cdot )\).

2.2 Motivation for adaptivity

In many applications we are not necessarily interested in the whole solution to a given problem but more explicitly only in the evaluation of a certain quantity of interest. This quantity of interest can often be represented mathematically by a goal functional \(J: U \rightarrow {\mathbb {R}}\). Here the main target is to minimize the approximation error between u and \(u_h\) measured in the given goal functional \(J(\cdot )\) and use the computational resources efficiently. This can lead to the approach of [4, 5], the DWR method, which this work will follow closely. We also refer to [1, 3] and the prior survey paper using duality arguments for adaptivity in differential equations [19]. We are interested in the evaluation of the goal functional J in the solution \(u \in U\) to the problem \({\mathcal {A}}(u)(v) = 0\) for all \(v \in V\) and its corresponding discrete problem \(u_h \in U_h\) to the problem \({\mathcal {A}}(u_h)(v_h) = 0\) for all \(v_h \in V_h\). Under the assumption that both problems yield unique solutions, the problem statement of minimizing the approximation error with respect to a given PDE from above can be rewritten into the equivalent optimization problem:

$$\begin{aligned} \min _{u \in U} \{J(u) - J(u_h)\} \quad s.t. \quad {\mathcal {A}}(u)(v) = 0 \ \forall v \in V. \end{aligned}$$

For this constrained optimization problem we can introduce the corresponding Lagrangian

$$\begin{aligned} {\mathcal {L}}(u,z) = \{J(u) - J(u_h)\} - {\mathcal {A}}(u)(z) \end{aligned}$$

with the adjoint variable \(z \in V\) acting as a Lagrange multiplier. For this, a stationary point needs to fulfill the first-order necessary conditions

$$\begin{aligned} {\mathcal {L}}^\prime = 0&\Leftrightarrow {\left\{ \begin{array}{ll} {\mathcal {L}}_u^\prime (u,z) = J^\prime (u)(\delta u) - {\mathcal {A}}^\prime (u)(\delta u,z) &{}{\mathop {=}\limits ^{!}}0\\ {\mathcal {L}}_z^\prime (u,z) = -{\mathcal {A}}(u)(\delta z) &{}{\mathop {=}\limits ^{!}}0 \end{array}\right. } \\&\Leftrightarrow {\left\{ \begin{array}{ll} {\mathcal {A}}^\prime (u)(\delta u,z) = J^\prime (u)(\delta u)\\ {\mathcal {A}}(u)(\delta z) = 0 \end{array}\right. } \end{aligned}$$

where \(J^\prime , {\mathcal {A}}^\prime \) denote the Fréchet derivatives. We see that a defining equation for the adjoint variable arises therein. Find \(z \in V\) such that

$$\begin{aligned} {\mathcal {A}}^\prime (u)(\phi ,z) = J^\prime (u)(\phi ) \quad \forall \phi \in U, \end{aligned}$$
(3)

which is known as the adjoint problem. In other words, the adjoint solution measures the variation of the primal solution u with respect to the goal functional \(J(\cdot )\). Similarly to Sect. 2.1, we apply an approximation (for instance a finite element discretization) and obtain as discrete adjoint problem: Find \({{\tilde{z}}} \in V\) such that

$$\begin{aligned} {\mathcal {A}}^\prime ({{\tilde{u}}})({{\tilde{\phi }}},{{\tilde{z}}}) = J^\prime ({{\tilde{u}}})({{\tilde{\phi }}}) \quad \forall {{\tilde{\phi }}} \in U. \end{aligned}$$
(4)

These preparations yield to the a posteriori error representation for the approximation distance between u and \({{\tilde{u}}}\) measured in terms of the goal functional \(J(\cdot )\), as derived in [46].

Theorem 1

Let \((u,z) \in U \times V\) solve (1) and (3). Further, let \({\mathcal {A}} \in {\mathcal {C}}^3(U,V^*) \) and \(J \in {\mathcal {C}}^3(U,{\mathbb {R}})\). Then for arbitrary approximations \(({\tilde{u}},{\tilde{z}}) \in U \times V\) the error representation

$$\begin{aligned} J(u)-J({\tilde{u}}) = \frac{1}{2}\rho ({\tilde{u}})(z-{\tilde{z}})+\frac{1}{2}\rho ^*({\tilde{u}},{\tilde{z}})(u-{\tilde{u}}) + \rho ({\tilde{u}})({\tilde{z}}) + {\mathcal {R}}^{(3)} \end{aligned}$$
(5)

holds true and

$$\begin{aligned} \rho ({\tilde{u}})(\cdot )&:= -{\mathcal {A}}({\tilde{u}})(\cdot ), \\ \rho ^*({\tilde{u}},{\tilde{z}})(\cdot )&:= J^{\prime }({\tilde{u}})(\cdot ) - {\mathcal {A}}^{\prime }({\tilde{u}})(\cdot ,{\tilde{z}}). \end{aligned}$$

With \(e = u-{\tilde{u}}, \ e^* = z - {\tilde{z}}\), the remainder term reads as follows:

$$\begin{aligned} {\mathcal {R}}^{(3)} := \frac{1}{2}\int _0^1 \Big [J^{\prime \prime \prime }({\tilde{u}}+se)(e,e,e) - {\mathcal {A}}^{\prime \prime \prime }({\tilde{u}}+se)(e,e,e,{\tilde{z}}+se^*) \\ -3{\mathcal {A}}^{\prime \prime }({\tilde{u}}+se)(e,e,e^*)\Big ]s(s-1) \ \mathrm {d}s. \end{aligned}$$

Proof

The proof can be found in [46]. \(\square \)

Remark 2

If \({\tilde{u}} := u_h \in U_h \subset U\) is the Galerkin projection which solves (2) and \({\tilde{z}} := z_h \in V_h \subset V \) solving (4), then the iteration error \(\rho ( {\tilde{u}}) ({\tilde{z}})\) vanishes and yields the theorems presented in the early work [3]. Therefore, from now on we omit the iteration error. The remainder term is usually of third order [5] and can be omitted for which detailed computational evidence was demonstrated in [17]. In the case of a linear problem, it clearly holds that

$$\begin{aligned} \eta = \rho ({\tilde{u}})(z-{\tilde{z}}) = \frac{1}{2}\rho ({\tilde{u}})(z-{\tilde{z}})+\frac{1}{2}\rho ^*({\tilde{u}},{\tilde{z}})(u-{\tilde{u}}). \end{aligned}$$

Remark 3

Theorem 1 motivates the error estimator

$$\begin{aligned} \eta = \frac{1}{2}\rho ({\tilde{u}})(z-{\tilde{z}})+\frac{1}{2}\rho ^*({\tilde{u}},{\tilde{z}})(u-{\tilde{u}}). \end{aligned}$$

This error estimator is exact but not computable. Therefore, the exact solutions u and z are now being approximated by higher-order solutions \(\left( u_h^{(2)},z_h^{(2)}\right) \in U_h^{(2)} \times V_h^{(2)}\). These higher-order solutions can be realised on a refined grid or by using higher-order basis functions. The practical error estimator reads

$$\begin{aligned} \eta ^{(2)}= \frac{1}{2}\rho ({\tilde{u}})\left( z_h^{(2)}-{\tilde{z}}\right) +\frac{1}{2}\rho ^*({\tilde{u}},{\tilde{z}})\left( u_h^{(2)}-{\tilde{u}}\right) . \end{aligned}$$
(6)

2.3 DWR algorithm

In principle, we need to solve four problems, where especially the computation of \(u_h^{(2)}\) is expensive. It is well-known that different possibilities exist such as global higher-order finite element solution or local interpolations [5, 7, 48]. Moreover, we only consider the primal part of the error estimator, which is justified for linear problems only, and yields a second order remainder term in nonlinear problems [5] [Proposition 2.3]:

$$\begin{aligned} \eta _h^{(2)} = \rho (u_h)\left( z_h^{(2)}-{\tilde{z}}\right) . \end{aligned}$$

For many nonlinear problems this version is used as it reduces to solving only two problems and yields for mildly nonlinear problems, such as incompressible flow in a laminar regime [8], excellent values. On the other hand, for quasi-linear problems, there is a strong need to work with the adjoint error parts \(\rho ^*\) as well [16, 17].

In our work, we employ solutions in enriched spaces. We compute the adjoint solution \(z_h^l = i_h z_h^{l,(2)} \in V_h^l \subset V_h^{l,(2)}\) via restriction. For nonlinear problems, we approximate the primal solution in the enriched space \(u_h^{l,(2)} = I_h^{(2)}u_h^l \in U_h^{l,(2)} \supset U_h^l\) via interpolation. Therefore, we only solve two problems in practice: the primal problem and the enriched adjoint problem.

figure a

2.4 Error localization

The error estimator \(\eta ^{(2)}\) must be localized to corresponding regions of error contribution. This can be either done by methods proposed in [3,4,5], which use integration by parts in a backwards manner and result in an element wise localization employing the strong form of the equations or the filtering approach using the weak form [7]. In this work we use another weak form technique proposed in [48], where a partition-of-unity (PU) \(\sum _i \psi _i \equiv 1\) was introduced, in which the error contribution is localized on a nodal level. To realize this partition-of-unity, one can simply choose piece-wise bilinear elements \(Q_1^c\) (see e.g., [11]) in a finite element space \(W_h = \text {span}\{\psi _1,\ldots , \psi _N\}\) with dim\((W_h) = N\). Then, the approximated error indicator reads

$$\begin{aligned} \eta ^{(2), PU} =&\sum _{i=1}^{N} \left( \frac{1}{2}\rho ({\tilde{u}})\left( \left( z_h^{(2)}-{\tilde{z}}\right) \psi _i\right) \right. \nonumber \\&\left. +\frac{1}{2}\rho ^*({\tilde{u}},{\tilde{z}})\left( \left( u_h^{(2)}-{\tilde{u}}\right) \psi _i\right) \right) . \end{aligned}$$
(7)

Some recent theoretical work on the effectivity and efficiency of \(\eta ^{(2), PU}\) can be found in [17, 48], respectively. The main objective of the remainder of this paper is to compute the adjoint solution with a feedforward neural network.

2.5 Effectivity index

To evaluate the accuracy of the error estimator we introduce the effectivity index

$$\begin{aligned} I_{eff} = \frac{\left| \eta ^{(2), PU}\right| }{\vert J(u)-J({\tilde{u}})\vert }. \end{aligned}$$

If J(u) is unknown, we approximate it by \(J({\hat{u}})\), where \({\hat{u}}\) is the solution of the PDE on a very fine grid. We desire that the effectivity index converges to 1, which signifies that our error estimator is a good approximation of the error in the goal functional.

3 Neural networks

In order to realize neural network guided DWR, we consider feedforward neural networks \(u_{NN}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), where d is the dimension of the domain \(\Omega \) plus the dimension of u and the dimension of all the derivatives of u that are required for the adjoint problem. The neural networks can be expressed as

$$\begin{aligned} u_{NN}({\varvec{x}}) = T^{(L)} \circ \sigma \circ T^{(L-1)} \circ \cdots \circ \sigma \circ T^{(1)}({\varvec{x}}), \end{aligned}$$

where \(T^{(i)}: {\mathbb {R}}^{n_{i-1}} \rightarrow {\mathbb {R}}^{n_{i}},{\varvec{y}} \mapsto W^{(i)} {\varvec{y}}\, +\, {\varvec{b}^{(i)}}\) are affine transformations for \(1 \le i \le L\), with weight matrices \(W^{(i)} \in {\mathbb {R}}^{n_i \times n_{i-1}}\) and bias vectors \({\varvec{b}^{(i)}} \in {\mathbb {R}}^{n_i}\). Here \(n_{i}\) denotes the number of neurons in the i.th layer with \(n_0 = d\) and \(n_L = 1\). \(\sigma : {\mathbb {R}} \rightarrow {\mathbb {R}}\) is a nonlinear activation function, which is the hyperbolic tangent function throughout this work. Derivatives of neural networks can be computed with back propagation (see e.g. [26, 49]), a special case of reverse mode automatic differentiation [38]. Similarly higher order derivatives can be calculated by applying automatic differentiation recursively.

3.1 Universal function approximators

Cybenko [13] and Hornik [27] proved a first version of the universal approximation theorem, which states that continuous functions can be approximated to arbitrary precision by single hidden layer neural networks. A few years later Pinkus [42] generalized their findings and showed that single hidden layer neural networks can uniformly approximate a function and its partial derivatives.

This theoretical result motivates the application of neural networks for the numerical approximation of partial differential equations.

3.2 Residual minimization with neural networks

Residual minimization with neural networks has become popular in the last few years by the works of Raissi, Perdikaris and Karniadakis on physics-informed neural networks (PINNs) [44] and the paper of Sirignano and Spiliopoulos on the “Deep Galerkin Method” [52]. For their approach one can consider the strong formulation of the stationary PDE

$$\begin{aligned} \begin{aligned} {\mathcal {N}}(u, {\varvec{x}})&= 0 \quad \text {in}\ \Omega \\ {\mathcal {B}}(u, {\varvec{x}})&= 0 \quad \text {on}\ \partial \Omega \end{aligned} \end{aligned}$$
(8)

where \({\mathcal {N}}\) is a differential operator and \({\mathcal {B}}\) is a boundary operator. An example for the differential operator \({\mathcal {N}}\) is given by the semi-linear form \({\mathcal {A}}(u)(v)\) introduced in Sect. 2.1. The boundary operator \({\mathcal {B}}\) in case of Dirichlet conditions is realized in the weak formulation as usual in the function space U. One then needs to find a neural network \(u_{NN}\), which minimizes the loss function

$$\begin{aligned} L(u_{NN}) = \frac{1}{n_\Omega } \sum _{i = 1}^{n_\Omega } {\mathcal {N}}\left( u_{NN},{\varvec{x}}_i^\Omega \right) ^2 + \frac{1}{n_{\partial \Omega }} \sum _{i = 1}^{n_{\partial \Omega }} {\mathcal {B}}\left( u_{NN},{\varvec{x}}_i^{\partial \Omega }\right) ^2, \end{aligned}$$

where \({\varvec{x}}_1^\Omega , \dots , {\varvec{x}}_{n_\Omega }^\Omega \in \Omega \) are collocation points inside the domain and \({\varvec{x}}_1^{\partial \Omega }, \dots , {\varvec{x}}_{n_{\partial \Omega }}^{\partial \Omega } \in \partial \Omega \) are collocation points on the boundary. In [56] it has been shown that the two components of the loss function need to be weighted appropriately to yield accurate results. Therefore, we use a modified version of this method which circumvents these issues.

3.3 Our approach

Let us again consider the abstract PDE problem in its strong formulation (8). For simplicity, we only consider Dirichlet boundary conditions, i.e. \({\mathcal {B}}(u,{\varvec{x}}):=u({\varvec{x}}) - g({\varvec{x}})\). Additionally, in our work we use the approach of Berg and Nyström [6] shown in Fig. 1, who used the ansatz

$$\begin{aligned} u({\varvec{x}}) := d_{\partial \Omega }({\varvec{x}})\cdot u_{NN}({\varvec{x}}) + {\tilde{g}}({\varvec{x}}) \quad \text {for}\, {\varvec{x}} \in {\bar{\Omega }} \end{aligned}$$
(9)

to fulfill inhomogeneous Dirichlet boundary conditions exactly. Here \({\tilde{g}}\) denotes the extension of the boundary data g to the entire domain \({\bar{\Omega }}\), which is continuously differentiable up to the order of the differential operator \({\mathcal {N}}\). Berg and Nyström [6] used the distance to the boundary \(\partial \Omega \) as their function \(d_{\partial \Omega }\). However, it is sufficient to use a function \(d_{\partial \Omega }\) which is continuously differentiable up to the order of the differential operator \({\mathcal {N}}\) with the properties

$$\begin{aligned} d_{\partial \Omega }({\varvec{x}}) {\left\{ \begin{array}{ll} = 0 &{} \text {for } {\varvec{x}} \in \partial \Omega \\ \ne 0 &{} \text {for } {\varvec{x}} \in \Omega \end{array}\right. }. \end{aligned}$$

Thus, \(d_{\partial \Omega }\) can be interpreted as a level-set function, since

$$\begin{aligned} \Omega = \lbrace {\varvec{x}} \in {\bar{\Omega }} \;\vert \; d_{\partial \Omega }({\varvec{x}}) \ne 0 \rbrace \text { and } \partial \Omega = \lbrace {\varvec{x}} \in {\bar{\Omega }} \;\vert \; d_{\partial \Omega }({\varvec{x}}) = 0 \rbrace . \end{aligned}$$

Obviously, for this kind of ansatz for the solution of the PDE, it holds that

$$\begin{aligned} {\mathcal {B}}(u,{\varvec{x}}) = u({\varvec{x}}) - g( {\varvec{x}}) = \left[ d_{\partial \Omega }({\varvec{x}})\cdot u_{NN}({\varvec{x}}) + {\tilde{g}}({\varvec{x}}) \right] - g({\varvec{x}}) = 0 \quad \text {on}\ \partial \Omega . \end{aligned}$$

Therefore, in contrast to some previous works, we do not need to account for the boundary conditions in our loss function, which is a big benefit of our approach, since proper weighting of the different residual contributions in the loss function is not required. It might only be a little cumbersome to fulfill the boundary conditions exactly when dealing with mixed boundary condition, but the form of the ansatz function for such boundary conditions has been laid out in [35].

Fig. 1
figure 1

Section 3.3: Diagram of our ansatz \(u = d_{\partial \Omega }\cdot u_{NN} + {\tilde{g}}\) for the two dimensional Poisson problem. Here we used the abbreviations \(u_i := u(x_i, y_i)\) and \(f_i := f(x_i,y_i)\) for points \(\varvec{x}_i = (x_i,y_i) \in {\bar{\Omega }}\)

3.3.1 Approximation theorem

In the following, we prove that our neural network solutions approximate the analytical solutions well if their loss is sufficiently small. Our neural networks \(u_{NN}\) have been trained with the mean squared error of the residual of the PDE, i.e.

$$\begin{aligned} L(u) = \frac{1}{n} \sum _{i = 1}^n {\mathcal {N}}(u,{\varvec{x}}_i)^2, \end{aligned}$$
(10)

where n is the number of collocation points \({\varvec{x}}_i\) from the domain \(\Omega \). For the sake of generality, let us consider the generalized loss

$$\begin{aligned} {\hat{L}}_p(u) = \frac{1}{\vert \Omega \vert } \int _\Omega \vert {\mathcal {N}}(u,{\varvec{x}})\vert ^p\ \mathrm {d}{\varvec{x}} \end{aligned}$$

for \(p \ge 1\). Then, the loss (10) is just the Monte Carlo approximation of the generalized loss for \(p = 2\). We briefly recall the approximation theorem from [56] and show that the classical solution of the Poisson problem satisfies the assumptions of the approximation theorem.

Lemma 2

(Approximation theorem [56]) Let \(2 \le p \le \infty \). We consider a PDE of the form (8) on a bounded, open domain \(\Omega \subset {\mathbb {R}}^m\) with Lipschitz boundary \(\partial \Omega \) and \({\mathcal {N}}(u,{\varvec{x}}) = N(u,{\varvec{x}}) - {\hat{f}}({\varvec{x}})\), where N is a linear, elliptic operator and \({\hat{f}} \in L^2(\Omega )\). Let there be a unique solution \({\hat{u}} \in H^1(\Omega )\) and let the following stability estimate

$$\begin{aligned} \Vert u \Vert _{H^1(\Omega )} \le C \Vert f \Vert _{L^2(\Omega )} \end{aligned}$$

hold for \(u \in H^1(\Omega ), f \in L^2(\Omega )\) with \(N(u,{\varvec{x}}) = f({\varvec{x}})\) in \(\Omega \). Then we have for an approximate solution \(u\in H^1(\Omega )\) that

$$\begin{aligned} \forall \epsilon> 0\, \exists \delta > 0: \quad {\hat{L}}_p(u)< \delta \Longrightarrow \Vert u - {\hat{u}} \Vert _{H^1(\Omega )} < \epsilon . \end{aligned}$$

Proof

Let

$$\begin{aligned} \delta = \epsilon ^p C^{-p} \vert \Omega \vert ^{-\frac{p}{2}}. \end{aligned}$$

Let \(u = d_{\partial \Omega }\cdot u_{NN} + {\tilde{g}} \in H^1(\Omega )\) be an approximate solution of the PDE with \({\hat{L}}_p(u) < \delta \), which means that there exists a perturbation to the right-hand side \(f_{\text {error}} \in L^2(\Omega )\) such that \(N(u,{\varvec{x}}) = {\hat{f}}({\varvec{x}}) + f_{\text {error}}({\varvec{x}})\). By the stability estimate and the linearity of N, we have

$$\begin{aligned} \Vert u - {\hat{u}}\Vert _{H^1(\Omega )} \le C \Vert ({\hat{f}} + f_\text {error}) - {\hat{f}} \Vert _{L^2(\Omega )} = C \Vert f_\text {error}\Vert _{L^2(\Omega )}. \end{aligned}$$

Applying the Hölder inequality to the norm of \(f_\text {error}\) and using \(2 \le p \le \infty \) yields

$$\begin{aligned} \Vert f_\text {error}\Vert _{L^2(\Omega )} \le \vert \Omega \vert ^{\frac{1}{2} - \frac{1}{p}}\Vert f_\text {error}\Vert _{L^p(\Omega )}. \end{aligned}$$

Combing the last two inequalities gives us the desired error bound

$$\begin{aligned} \Vert u - {\hat{u}}\Vert _{H^1(\Omega )}&\le C \Vert f_\text {error}\Vert _{L^2(\Omega )} \le C \vert \Omega \vert ^{\frac{1}{2} - \frac{1}{p}}\Vert f_\text {error}\Vert _{L^p(\Omega )} \\&= C \vert \Omega \vert ^{\frac{1}{2}} {\hat{L}}_p(u)^{\frac{1}{p}} \\&< C \vert \Omega \vert ^{\frac{1}{2}} \delta ^{\frac{1}{p}} = \epsilon . \end{aligned}$$

In the last inequality, we used that the generalized loss of our approximate solution is sufficiently small, i.e. \({\hat{L}}_p(u) < \delta \). \(\square \)

Let us recapitulate an important result from the Schauder theory [21], which yields the existence and uniqueness of classical solutions of the Poisson problem if we assume higher regularity of our problem, i.e. when we work with Hölder continuous functions and sufficiently smooth domains.

Lemma 3

(Solution in classical function spaces) Let \(0< \lambda < 1\) be such that \(\Omega \subset {\mathbb {R}}^m\) is a domain with \(C^{2,\lambda }\) boundary, \({\tilde{g}} \in C^{2,\lambda }({\bar{\Omega }})\) and \({\hat{f}} \in C^{0,\lambda }({\bar{\Omega }})\). Then Poisson’s problem, which is of the form (8) with \(N(u,{\varvec{x}}) := -\Delta u({\varvec{x}})\), has a unique solution \({\hat{u}} \in C^{2,\lambda }({\bar{\Omega }})\).

Proof

Follows immediately from [21][Theorem 6.14]. \(\square \)

With Lemma 3 we can now show that the approximation theorem holds for the Poisson problem in classical function spaces.

Theorem 4

Let \(0< \lambda < 1\) be such that \(\Omega \subset {\mathbb {R}}^m\) is a bounded, open domain with \(C^{2,\lambda }\) boundary, \({\tilde{g}} \in C^{2,\lambda }({\bar{\Omega }})\) and \({\hat{f}} \in C^{0,\lambda }({\bar{\Omega }})\). Then Poisson’s problem, which is of the form (8) with \(N(u,{\varvec{x}}) := -\Delta u({\varvec{x}})\), has a unique solution \({\hat{u}} \in H^1(\Omega )\). Furthermore, there exists \(u = d_{\partial \Omega }\cdot u_{NN} + {\tilde{g}} \in H^1(\Omega )\) with the estimate

$$\begin{aligned} \forall \epsilon> 0\, \exists \delta > 0: \quad {\hat{L}}_p(u)< \delta \Longrightarrow \Vert u - {\hat{u}} \Vert _{H^1(\Omega )} < \epsilon . \end{aligned}$$

Proof

From Lemma 3 it follows that there exists a unique solution \({\hat{u}} \in C^{2,\lambda }({\bar{\Omega }}) \subset H^1(\Omega )\). Analogously it holds that \(u = d_{\partial \Omega }\cdot u_{NN} + {\tilde{g}} \in H^1(\Omega )\). Furthermore, we have by the Lax-Milgram Lemma that \({\hat{u}} \in H^1(\Omega )\) is the unique weak solution and fulfills the stability estimate

$$\begin{aligned} \Vert {\hat{u}} \Vert _{H^1(\Omega )} \le C \Vert {\hat{f}} \Vert _{L^2(\Omega )}. \end{aligned}$$

By Lemma 2 the estimate

$$\begin{aligned} \forall \epsilon> 0\, \exists \delta > 0: \quad {\hat{L}}_p(u)< \delta \Longrightarrow \Vert u - {\hat{u}} \Vert _{H^1(\Omega )} < \epsilon \end{aligned}$$

then also holds. \(\square \)

Remark 4

Theorem 4 implies that a low loss value of a neural network with high probability corresponds to an accurate approximation u of the exact solution \({\hat{u}}\) of the PDE, since the loss is a Monte Carlo approximation of the generalized loss, which for a large number of collocation points should be close in value.

3.3.2 Neural network solution of the adjoint PDE

To make a posteriori error estimates for our FEM solution of the primal problem (1), we now use neural networks to solve the adjoint PDE (3). In an FEM approach, the adjoint PDE would be solved in its variational form as described in Algorithm 1, but we minimize the residual of the strong form using neural networks and hence need to derive the strong formulation of the adjoint PDE first. After training, the neural network is then projected into the FEM ansatz function space of the adjoint problem. Finally, the a posteriori estimates can be made as usual with the DWR method following again Algorithm 1.

Remark 5

For linear goal functionals the Riesz representation theorem yields the existence and uniqueness of the strong formulation. Nevertheless, deriving the strong form of the adjoint PDE might be very involved for complicated PDEs, such as fluid structure interaction, e.g. [47, 57], and nonlinear goal functionals \(J: U \rightarrow {\mathbb {R}}\). In future works, we aim to extend to alternative approaches which do not require the derivation of the strong form.

4 Algorithmic realization

In this section, we describe our final algorithm for the neural network guided dual weighted residual method. In the algorithm, we work with hierarchical FEM spaces, i.e. \(U_h^{l} \subset U_h^{l,(2)}\) and \(V_h^{l} \subset V_h^{l,(2)}\).

figure b

Here we only consider the Galerkin method for which the ansatz function space and the trial function space coincide, i.e. \(U=V\), but \(U \ne V\) can be realized in a similar fashion. The novelty compared to the DWR method presented in Sect. 2 are step 4 and step 5 of the algorithm. In the following, we describe these parts in more detail.

In step 4, we solve the strong form of the adjoint problem, which for nonlinear PDEs or nonlinear goal functionals also depends on the primal solution \(u_h^{l,(2)}\). However, when the PDE and goal functional are both linear, the adjoint problem does not depend on the primal solution and it is sufficient to train the neural network only once. Otherwise, the neural network needs to be trained in each adaptive iteration. The strong form of the adjoint problem is of the form (8) and thus we can find a neural network based solution by minimizing the loss (10) with L-BFGS [32], a quasi-Newton method. To evaluate the partial derivatives of the neural network based solution \(z = d_{\partial \Omega } \cdot z_{NN} + {\tilde{g}}\) inside the loss function \(L(\cdot )\), e.g. the Laplacian of the solution, we employ automatic differentiation as mentioned at the beginning of Sect. 3. We observed that by using L-BFGS sometimes the loss exploded or the optimizer got stuck at a saddle point. Consequently, we restarted the training loop with a new neural network when the loss exploded or used a few steps with the Adam optimizer [29] when a saddle point was reached. Afterwards, L-BFGS can be used as an optimizer again. During training we used the coordinates of the degrees of freedom as our collocation points. We stopped the training when the loss did not decrease by more than \(TOL = 10^{-8}\) in the last \(n = 5\) epochs or when we reached the maximum number of epochs, which we chose to be 400. An alternative stopping criterion on fine meshes could be early stopping, where the collocation points are being split into a training and a validation set and the training stops when the loss on the validation set starts deviating from the loss on the training set, i.e. when the neural network begins to overfit on the training data.

In step 5, we projected the neural network based solution into the enriched FEM space by evaluating it at the coordinates of the degrees of freedom, which yields a unique function \(z_h^{l,(2)}\).

5 Numerical experiments

In this section, we consider three stationary problems (with in total five numerical tests) with our proposed approach. We consider both linear and nonlinear PDEs and goal functionals. The primal problem, i.e. the original PDE, is being solved with bilinear shape functions. The adjoint PDE is solved by minimizing the residual of our neural network ansatz (Sects. 3 and 4) and we project the solution into the biquadratic finite element space. For studying the performance, we also compute the adjoint problem with finite elements employing biquadratic shape functions. Finally, this neural network solution is being plugged into the PU DWR error estimator (7), which decides which elements will be marked for refinement. To realize the numerical experiments, we couple deal.II [2] with LibTorch, the PyTorch C++ API [40].

5.1 Poisson’s equation

At first we consider the two dimensional Poisson equation with homogeneous Dirichlet conditions on the unit square. In our ansatz (9), we choose the function

$$\begin{aligned} d_{\partial \Omega }(x,y) = x(1-x)y(1-y) {\quad \text {for } {\varvec{x}} = (x,y) \in \left[ 0,1\right] ^2}. \end{aligned}$$

Poisson’s problem is given by

$$\begin{aligned} -\Delta u&= f \quad \text {in}\ \Omega := (0,1)^2 \\ u&= 0 \quad \text {on}\ \partial \Omega , \end{aligned}$$

with \(f = -1\). For a linear goal functional \(J: V \rightarrow {\mathbb {R}}\) the adjoint problem then reads:

Find \(z \in H^1_0(\Omega )\) such that

$$\begin{aligned} ( \nabla \psi , \nabla z ) = J(\psi ) \quad \forall \psi \in H^1_0(\Omega ). \end{aligned}$$

Here \((\cdot ,\cdot )\) denotes the \(L^2\) inner product, i.e. \((f,g) := \int _{\Omega }\,f\cdot g\ \mathrm {d}{\varvec{x}}\).

5.1.1 Mean value goal functional

As a first numerical example of a linear goal functional, we consider the mean value goal functional

$$\begin{aligned} J(u) = \frac{1}{\vert \Omega \vert }\int _\Omega u\ \mathrm {d} {\varvec{x}}. \end{aligned}$$

The adjoint PDE can be written as

$$\begin{aligned} ( \nabla \psi , \nabla z ) = \left( \psi , \frac{1}{\vert \Omega \vert }\right) \end{aligned}$$

and can be transformed into its strong form

$$\begin{aligned} -\Delta z&= \frac{1}{\vert \Omega \vert } \quad \text {in}\ \Omega \\ z&= 0 \qquad \text {on}\ \partial \Omega . \end{aligned}$$

We trained a fully connected neural network with two hidden layers with 32 neurons each and the hyperbolic tangent activation function for 400 epochs on 1,000 uniformly sampled points. In [44] it has been shown that wider and deeper neural networks can achieve a lower \(L^2\) error between the neural network and the analytical solutions. However, if we use the support points of the FEM mesh as the collocation points, we cannot use bigger neural networks, since we do not have enough training data. Therefore, we decided to use smaller networks.

We compared our neural network based error estimator with a standard finite element based error estimator:

Table 1 Section 5.1.1: Error estimator results for mean value goal functional

In this numerical test the neural network refined in the same way as the finite element method and both error indicators yield effectivity indices \(I_{eff}\) of approximately 1.0, which means that the exact error and the estimated error were almost identical. The error reduction is of second order as to be expected and the overall results in Table 1 confirm well similar computations presented in [48][Table 1].

5.1.2 Regional mean value goal functional

In the second numerical example, we analyze the mean value goal functional which is only being computed on a subset \(D \subset \Omega \) of the domain. We choose \(D := \left[ 0,\frac{1}{4}\right] \times \left[ 0,\frac{1}{4}\right] \). For the regional goal function

$$\begin{aligned} J(u) = \frac{1}{\vert D \vert }\int _D u\ \mathrm {d} {\varvec{x}} \end{aligned}$$

the strong form of the PDE is given by

$$\begin{aligned} -\Delta z&= \frac{1_D}{\vert D\vert }, \end{aligned}$$

where \(1_D\) is the indicator function of D. The rest of the training setup is the same as for the previous goal functional.

The computational results for this example can be found in Table 2. Here we can observe that the finite element method and our approach end up with different grid refinements, see Fig. 2, however both methods have a similar performance and effectivity indices \(I_{eff}\) of approximately 1.0.

Table 2 Section 5.1.2: Error estimator results for regional mean value goal functional
Fig. 2
figure 2

Section 5.1.2: Grid refinement with regional mean value goal functional

On the grids in Fig. 2, which have been refined with the different approaches, we can see that the finite element method creates a symmetrical grid refinement. This symmetry can not be observed in the neural network based refinement. Furthermore, our approach refined a few more elements than FEM, but overall our methodology still produced a reasonable grid adaptivity.

5.1.3 Mean squared value goal functional

In this third numerical test, an example of a nonlinear goal functional is the mean squared value, which reads

$$\begin{aligned} J(u) = \frac{1}{\vert \Omega \vert }\int _\Omega u^2\ \mathrm {d} {\varvec{x}}. \end{aligned}$$

For a nonlinear goal functional the adjoint problem then needs to be modified to (see also (3) in Section 2): Find \(z \in H^1_0(\Omega )\) such that

$$\begin{aligned} ( \nabla \psi , \nabla z ) = J'(u)(\psi ) \quad \forall \psi \in H^1_0(\Omega ). \end{aligned}$$

Computing the Fréchet derivative of the mean squared value goal functional, we can rewrite the adjoint problem as

$$\begin{aligned} ( \nabla \psi , \nabla z ) = \left( \psi , \frac{2u}{\vert \Omega \vert }\right) \end{aligned}$$

and can be transformed into its strong form

$$\begin{aligned} -\Delta z&= \frac{2u}{\vert \Omega \vert }. \end{aligned}$$

Our training setup also changed slightly. The problem statement has become more difficult and we decided to use slightly bigger networks to compute a sufficiently good solution of the adjoint solution. We used three hidden layers with 32 neurons and retrained the neural network on each grid, since the primal solution is part of the adjoint PDE.

Table 3 Section 5.1.3: Error estimator results for mean squared value goal functional

In Table 3 it can be seen that our neural network approach consistently underestimates the error and produces slightly worse results than the FEM solution. Nevertheless, the effectivity index is still sufficiently close to 1 and the grid refinement, see Fig. 3, looks reasonable. Moreover as in the other previous tests, the effecitivity indices \(I_{eff}\) are stable without major oscillations.

Fig. 3
figure 3

Section 5.1.3: Grid refinement with mean squared value goal functional

5.2 Poisson’s problem with analytical primal and adjoint solutions

To be better able to assess our methodology, we now consider a Poisson problem with analytical primal and adjoint solutions.

5.2.1 Problem statement

For this we choose the source function of the primal Poisson problem to be

$$\begin{aligned} f(x,y) = 2\pi ^2 \sin (\pi x) \sin (\pi y) {\quad \text {for } \varvec{x} = (x,y) \in \left( 0,1\right) ^2}, \end{aligned}$$

and we choose the goal functional \(J (u) = \int _\Omega f\cdot u\ \mathrm {d} {\varvec{x}}\). For this problem the strong form of the adjoint PDE is given by

$$\begin{aligned} -\Delta z = f. \end{aligned}$$

Then it holds that the primal and adjoint solution read

$$\begin{aligned} u(x,y) = z(x,y) = \sin (\pi x) \sin (\pi y) {\quad \text {for } \varvec{x} = (x,y) \in \left[ 0,1\right] ^2.} \end{aligned}$$

5.2.2 Setups for performance analysis

In the following, we analyze the performance of our proposed approach, while varying the number of hidden layers and the number of neurons therein. Like in the previous numerical experiments, we consider fully connected neural networks with hyperbolic tangent activation function. For the hyper parameters of the neural networks, i.e. the number of hidden layers and the number of neurons, we applied a grid search to \(\lbrace 1,2,4,6,8 \rbrace \) hidden layers and \(\lbrace 10,20,40 \rbrace \) neurons.

For a fair comparison with the FEM guided adjoint computations, we chose to reuse the neural network and continue its training in each refinement cycle. However, since the PDE and the goal functional are linear, it would have been sufficient to train the neural network based adjoint solution once prior to the entire FEM simulations.

Furthermore, we investigated whether our method can lead to computational improvements over the finite element method on fine meshes, where the number of degrees of freedom is large and FEM simulations are computationally expensive. To reduce the computational effort when dealing with neural networks and a large number of degrees of freedom, we are reusing the neural network from the last refinement cycle. We expect the weights and biases from the last refinement cycle to be a good initial value for the weights and biases in the current refinement cycle. Additionally, instead of working with a large number of collocation points, on fine meshes we randomly sample 10,000 collocation points from the coordinates of the degrees of freedom. On the one hand, we chose this restriction since we are using L-BFGS to optimize the parameters of our neural network. This quasi-Newton scheme has a higher order of convergence than Adam or other gradient descent based schemes, but in its original formulation does not allow batch-wise optimization. Due to these memory limitations of the L-BFGS method, we decided to limit the number of collocation points. On the other hand, our neural network based ansatz is a meshless method, which might not require all coordinates of the degrees of freedom as collocation points to achieve a sufficient accuracy.

5.2.3 Discussion of our findings

To access the performance of our approach for different hyper parameters of the neural networks, we summarize the effectivity indices in Table 4. Here we start at refinement cycle 0 with 25 degrees of freedom and through adaptive mesh refinement end up with close to 500,000 degrees of freedom at refinement cycle 8. Taking a closer look at the effectivity indices of our simulations, we observe that the effectivity indices of the neural network mostly coincide with the effectivity indices from the finite element method computations independently of the number of hidden layers and the number of neurons. This indicates that our approach yields accurate solutions for all neural network hyper parameters from our experiment.

Table 4 Section 5.2: Mean of the effectivity indices of 10 independent runs for the Poisson problem with analytical primal and adjoint solution. The standard deviation for the neural network based simulations is 0.00 for the all but the last refinement cycle, where we have a standard deviation of 0.01. The neural networks are being denoted by tuples where the first number corresponds to the number of hidden layers and the second number represents the number of neurons therein

To investigate whether our method can lead to speed ups over the finite element method on fine meshes, we inspect the CPU times of the finite element method and the neural network based approach. The FEM based DWR method with 474,153 degrees of freedom in the 8th refinement cycle had a mean CPU time of 169.2 s and a standard deviation of 0.5 s in 10 independent runs. In Table 5 the CPU times for our approach with neural networks with different hyper parameters are being reported. We observe that the computational time increases with more hidden layers. Moreover, one hidden layer neural networks were on average twice as fast as the FEM simulations. Finally, neural networks with up to four hidden layers had on average a shorter CPU time than the FEM based approach. Note that the neural networks with 8 hidden layers with 10 neurons have a large mean and standard deviation due to a statistical outlier. Here in one of the runs the CPU time amounted to more than 3,000 seconds because of repeated failure to converge during the training procedure.

Table 5 Section 5.2: CPU times (in seconds) for 10 independent runs of the proposed neural network approach applied to the Poisson problem with analytical primal and adjoint solution. We compare various number of hidden layers and the number of neurons therein
Fig. 4
figure 4

Section 5.2: Neuronal network training times (in seconds) per refinement cycle for 10 independent runs for the Poisson problem with analytical primal and adjoint solution

In Fig. 4, we display the training times of the neural networks in each refinement cycle. The solid lines and the shaded regions, which are bounded by dashed lines, represent the mean and one standard deviation of the training times. We observe that for all neural network hyper parameters the 0th refinement cycle is the most CPU time intensive and the remaining refinement cycles have a one order of magnitude shorter neural network training time. Moreover, deeper neural networks, i.e. models with more hidden layers, in general require a higher training time than shallow neural networks, i.e. models with fewer hidden layers, due to a higher number of weights and biases which need to be optimized. Note that the peak of the training time of 8 hidden layer neural networks at 3 refinement cycles has been caused by the aforementioned statistical outlier, which restarted training more than 100 times in the 3rd refinement cycle due to an explosion in the loss value or due to being stuck in a local minimum.

Fig. 5
figure 5

Section 5.2: Number of epochs per refinement cycle to train neural network for 10 independent runs for the Poisson problem with analytical primal and adjoint solution

In Fig. 5, the number of epochs for training the neural networks are shown for each refinement cycle. In each epoch the weights and biases of the neural network are trained on the full set of collocation points with the L-BFGS optimizer. Like before, the solid lines and the shaded regions, which are bounded by dashed lines, represent the mean and one standard deviation of the number of epochs. Analogous to our observations of the CPU times, the number of epochs after the 0th refinement cycle decreases by an order of magnitude. In the 0th refinement cycle more hidden layers lead to a higher number of training epochs. Deeper neural networks have a more complex loss surface and thus are more prone to an explosion of the loss or a stagnation of the loss at an suboptimal value. This is being reflected in the higher number of epochs in the 0th refinement cycle. Nevertheless, in the remaining refinement cycles the number of epochs does not seem to depend on the depth of the neural networks.

5.3 Nonlinear PDE and nonlinear goal functional

In the second numerical problem, we now consider the case were both the PDE and the goal functional are nonlinear. We add the scaled nonlinear term \(u^2\) to the previous equation, such that the new problem is given by

$$\begin{aligned} -\Delta u + \gamma u^2&= f \quad \text {in}\ \Omega \\ u&= 0 \quad \text {on}\ \partial \Omega , \end{aligned}$$

with \(\gamma > 0\) and \(f = -1\). For our nonlinear goal functional, we choose the mean squared value goal functional from the previous example. The adjoint problem thus reads:

Find \(z \in H^1_0(\Omega )\) such that

$$\begin{aligned} ( \nabla \psi , \nabla z ) + 2\gamma (\psi , zu) = \left( \psi , \frac{2u}{\vert \Omega \vert }\right) \quad \forall \psi \in H^1_0(\Omega ), \end{aligned}$$

with corresponding strong form

$$\begin{aligned} -\Delta z + 2\gamma zu&= \frac{2u}{\vert \Omega \vert }. \end{aligned}$$

The training setup is the same as for the mean squared value goal functional example. For \(\gamma = 50\) we obtain the results shown in Table 6.

Table 6 Section 5.3: Error estimator results for the nonlinear PDE

Our neural network approach produces different results, see Fig. 6, than the finite element method, but at the efficiency indices and the refined grids we observe that our approach still works well for adaptive mesh refinement.

Fig. 6
figure 6

Section 5.3: Grid refinement for the nonlinear PDE

6 Conclusions and outlook

In this work, we proposed neural network guided a posteriori error estimation with the dual weighted residual method. Specifically, we computed the adjoint solution with feedforward neural networks with two or three hidden layers. To use existing FEM software we first solved the adjoint PDE with neural networks and then projected the solution into the FEM space of the adjoint PDE. We demonstrated experimentally that neural network based solutions of the strong formulation of the adjoint PDE yield excellent approximations for dual weighted residual error estimates. Therefore, neural networks might be an alternative way to compute adjoint sensitivities within goal-oriented error estimators for certain problems, when the number of degrees of freedom is high. Furthermore they admit greater flexibility being a meshless method and it would be interesting to investigate in future works how different choices of collocation points influence the quality of the error estimates. In the current work, we observed an advantage in computing times (in terms of CPU time) when using more than 400,000 degrees of freedom. Additionally, we could establish the same accuracies and robustness as for pure FEM problems. However, an important current limitation of our methodology is that we work with the strong formulation of the PDE, whose derivation from the weak formulation can be very involved for more complex problems, e.g. multiphysics. Hence, if an energy minimization formulation exists, this should be a viable alternative to our strong form of the adjoint PDE. This alternative problem can be solved with neural networks with the “Deep Ritz Method” [14, 50]. Nevertheless, the energy minimization formulation does not exist for all partial differential equations. For this reason in the future, we are going to analyze neural network based methods, which work with the variational formulation, e.g. VPINNs [28].