Neural network guided adjoint computations in dual weighted residual error estimation

In this work, we are concerned with neural network guided goal-oriented a posteriori error estimation and adaptivity using the dual weighted residual method. The primal problem is solved using classical Galerkin finite elements. The adjoint problem is solved in strong form with a feedforward neural network using two or three hidden layers. The main objective of our approach is to explore alternatives for solving the adjoint problem with greater potential of a numerical cost reduction. The proposed algorithm is based on the general goal-oriented error estimation theorem including both linear and nonlinear stationary partial differential equations and goal functionals. Our developments are substantiated with some numerical experiments that include comparisons of neural network computed adjoints and classical finite element solutions of the adjoints. In the programming software, the open-source library deal.II is successfully coupled with LibTorch, the PyTorch C++ application programming interface. Adjoint approximation with feedforward neural network in dual-weighted residual error estimation. Side-by-side comparisons for accuracy and computational cost with classical finite element computations. Numerical experiments for linear and nonlinear problems yielding excellent effectivity indices. Adjoint approximation with feedforward neural network in dual-weighted residual error estimation. Side-by-side comparisons for accuracy and computational cost with classical finite element computations. Numerical experiments for linear and nonlinear problems yielding excellent effectivity indices.


Introduction
This work is devoted to an innovative solution of the adjoint equation in goal-oriented error estimation with the dual weighted residual (DWR) method [4,5,3] (based on former adjoint concepts [18]); we also refer to [7,1,21,37] for some important early work.Since then, the DWR method has been applied to numerous applications such as fluid-structure interaction [49,42,19], Maxwell's equations [11], surrogate models in stochastic inversion [33], model adaptivity in multiscale problems [32], and adaptive multiscale predictive modeling [35].A summary of theoretical advancements in efficiency estimates and multi-goal-oriented error estimation was recently made in [14].An important part in these studies is the adjoint problem, as it measures the sensitivity of the primal solution with respect to a single or multiple given goal functionals (quantities of interest).This adjoint solution is usually obtained by global higher order finite element method (FEM) solutions or local higher order approximations [5].In general, the former is more stable, see e.g.[17], but the latter works often sufficiently well in practice.As the adjoint solution is only required to evaluate the a posteriori error estimator, a cheap solution is of interest.
Consequently, in this work, the main objective is to explore alternatives for computing the adjoint.
Due to the universal approximation property [38], a primer candidate are neural networks as they are already successfully employed for solving partial differential equations (PDE) [39,46,6,40,28,48,22,45,23].A related work in aiming to improve goal-oriented computations with the help of neural network data-driven finite elements is [9].Moreover, a recent summary of the key concepts of neural networks and deep learning was compiled in [24].The advantage of neural networks is a greater flexibility as they belong to the class of meshless methods.We follow the methodology of [39,46] to solve PDEs by minimizing the residual using an L-BFGS (Limited memory Broyden-Fletcher-Goldfarb-Shanno) method [29].We address both linear and nonlinear PDEs and goal functionals in stationary settings.However, a shortcoming in the current approach is that we need to work with strong adjoint formulations, which may limit extensions to nonlinear coupled PDEs such as multiphysics problems and coupled variational inequality systems.If such problems can be restated in an energy formulation, again neural network algorithms are known [13,45].Despite this drawback, namely the necessity of working with strong formulations, the current study provides useful insights whether at all neural network guided adjoints can be an alternative concept for dual weighted residual error estimation.For this reason, our resulting modified adaptive algorithm and related numerical simulations are compared side by side in all numerical tests to classical Galerkin finite element solutions (see e.g., [10]) of the adjoint.Our proposed algorithm is implemented in the open-source finite element library deal.II [2] coupled with LibTorch, the PyTorch C++ API [36].
The outline of this paper is as follows: In Section 2, we recapitulate the DWR method.Next, in Section 3 we gather the important ingredients of the neural network solution.This section also includes an extension of an approximation theorem from Lebesgue spaces to classical function spaces.
The algorithmic realization is addressed in Section 4.Then, in Section 5 several numerical experiments are conducted.Our findings are summarized in Section 6.

Abstract problem
Let U and V be Banach spaces and let A : U → V * be a nonlinear mapping, where V * denotes the dual space of V .With this, we can define the problem: Find u ∈ U such that Additionally, we can look at an approximation of this problem.For subspaces Ũ ⊂ U and Ṽ ⊂ V the problem reads: Find ũ ∈ Ũ such that A(ũ)(ṽ) = 0 ∀ṽ ∈ Ṽ .
Remark 2.1.In the following the nonlinear mapping A(•)(•) will represent the variational formulation of a stationary partial differential equation with the associated function spaces U and V .We define the finite element approximation of the abstract problem as follows: Find where U h ⊂ U and V h ⊂ V denote the finite element spaces.Here the operator is given by A with the linear forms a(u h )(•) and l(•).

Motivation for adaptivity
In many applications we are not necessarily interested in the whole solution to a given problem but more explicitly only in the evaluation of a certain quantity of interest.This quantity of interest can often be represented mathematically by a goal functional J : U → R.Here the main target is to minimize the error in this given goal functional and use the computational resources efficiently.This can lead to the approach of [4,5], the DWR method, which this work will follow closely.We are interested in the evaluation of the goal functional J in the solution u ∈ U to the problem A(u)(v) = 0 for all v ∈ V .Under the assumption that the problem yields a unique solution, the formulation from above can be rewritten into the equivalent optimization problem For this constrained optimization problem we can introduce the corresponding Lagrangian with the adjoint variable z ∈ V .For this, a stationary point needs to fulfill the first-order necessary conditions where J , A denote the Fréchet derivatives.We see that a defining equation for the adjoint variable arises therein.Find z ∈ V such that which is known as the adjoint problem.This leads to the error representation for arbitrary approximations, as derived in [41].
Then for arbitrary approximations (ũ, z) ∈ U × V the error representation holds true and With e = u − ũ, e * = z − z, the remainder term reads as follows: R Proof.The proof can be found in [41].
V , then the iteration error ρ(ũ)(z) vanishes and yields the theorems presented in the early work [3].
Therefore, from now on we omit the iteration error.The remainder term is usually of third order [5] and can be omitted for which detailed computational evidence was demonstrated in [16].In the case of a linear problem, it clearly holds that Remark 2.4.Theorem 2.2 motivates the error estimator This error estimator is exact but not computable.Therefore, the exact solutions u and z are now being approximated by higher-order solutions u (2) ∈ U (2) h .These higher-order solutions can be realised by a globally refined grid or by using higher-order basis functions.The practical error estimator reads h − ũ . (5)

DWR Algorithm
In principle, we need to solve four problems, where especially the computation of u h is expensive.It is well-known that different possibilities exist such as global higher-order finite element solution or local interpolations [5,43,7].Moreover, we only consider the primal part of the error estimator, which is justified for linear problems only, and yields a second order remainder term in nonlinear problems [5][Proposition 2.3]: For many nonlinear problems this version is used as it reduces to solving only two problems and yields for mildly nonlinear problems, such as incompressible flow [8], excellent values.On the other hand, for quasi-linear problems, there is a strong need to work with the adjoint error parts ρ * as well [15,16].
In our work, we employ solutions in enriched spaces.We compute the adjoint solution via restriction.For nonlinear problems, we approximate the primal solution in the enriched space u l,( = I (2) ⊃ U l h via interpolation.Therefore, we only solve two problems in practice: the primal problem and the enriched adjoint problem.

Algorithm 1 DWR algorithm for general nonlinear problems
using some nonlinear solver.
, using some linear solver.
Algorithm terminates with final output J u l h .

Error localization
The error estimator η (2) must be localized to corresponding regions of error contribution.This can be either done by methods proposed in [4,5,3], which use integration by parts in a backwards manner and result in an element wise localization employing the strong form of the equations.However, in this work we use the technique of [43], where a partition-of-unity (PU) i ψ i ≡ 1 was introduced, in which the error contribution is localized on a nodal level.To realize this partition-of-unity, one can simply choose piece-wise bilinear elements {ψ i h | i = 1, . . ., N }.Then, the approximated error indicator reads Some recent theoretical work on the effectivity and efficiency of η (2),P U can be found in [43,16], respectively.The main objective of the remainder of this paper is to compute the adjoint solution with a feedforward neural network.

Effectivity index
To evaluate the goodness of the error estimator we introduce the effectivity index If J(u) is unknown, we approximate it by J(û), where û is the solution of the PDE on a very fine grid.
We desire that the effectivity index converges to 1, which signifies that our error estimator is a good approximation of the error in the goal functional.

Neural networks
In order to realize neural network guided DWR, we consider feedforward neural networks u N N : R d → R, where d is the dimension of the domain Ω plus the dimension of u and the dimension of all the derivatives of u that are required for the adjoint problem.The neural networks can be expressed as where Here n i denotes the number of neurons in the i.th layer with n 0 = d and n L = 1.σ : R → R is a nonlinear activation function, which is the hyperbolic tangent function throughout this work.Derivatives of neural networks can be computed with back propagation (see e.g.[44,24]), a special case of reverse mode automatic differentiation [34].
Similarly higher order derivatives can be calculated by applying automatic differentiation recursively.

Universal function approximators
Cybenko [12] and Hornik [25] proved a first version of the universal approximation theorem, which states that continuous functions can be approximated to arbitrary precision by single hidden layer neural networks.A few years later Pinkus [38] generalized their findings and showed that single hidden layer neural networks can uniformly approximate a function and its partial derivatives.The space of single hidden layer neural networks is given by , and any > 0, there exists g ∈ M(σ) such that This theoretical result motivates the application of neural networks for the numerical approximation of partial differential equations.

Residual minimization with neural networks
Residual minimization with neural networks has become popular in the last few years by the works of Raissi, Perdikaris and Karniadakis on physics-informed neural networks (PINNs) [39] and the paper of Sirignano and Spiliopoulos on the "Deep Galerkin Method" [46].For their approach one can consider the strong formulation of the stationary PDE where N is a differential operator and B is a boundary operator.An example for the differential operator N is given by the semi-linear form A(u)(v) introduced in Section 2.1.The boundary operator B in case of Dirichlet conditions is realized in the weak formulation as usual in the function space U .
One then needs to find a neural network u N N , which minimizes the loss function where x Ω 1 , . . ., x Ω n Ω ∈ Ω are collocation points inside the domain and x ∂Ω 1 , . . ., x ∂Ω n ∂Ω ∈ ∂Ω are collocation points on the boundary.In [47] it has been shown that the two components of the loss function need to be weighted appropriately to yield accurate results.Therefore, we use a modified version of this method which circumvents these issues.

Our approach
Let us again consider the abstract PDE problem in its strong formulation (7).For simplicity, we only consider Dirichlet boundary conditions, i.e.B(u, x) := u(x) − g(x).Additionally, in our work we use the approach of Berg and Nyström [6], who used the ansatz to fulfill inhomogeneous Dirichlet boundary conditions exactly.Here g denotes the extension of the boundary data g to the entire domain Ω, which is continuously differentiable up to the order of the differential operator N .Berg and Nyström [6] used the distance to the boundary ∂Ω as their function d ∂Ω .However, it is sufficient to use a function d ∂Ω which is continuously differentiable up to the order of the differential operator N with the properties Thus, d ∂Ω can be interpreted as a level-set function, since Obviously, for this kind of ansatz for the solution of the PDE, it holds that Therefore, in contrast to some previous works, we do not need to account for the boundary conditions in our loss function, which is a big benefit of our approach, since proper weighting of the different residual contributions in the loss function is not required.It might only be a little cumbersome to fulfill the boundary conditions exactly when dealing with mixed boundary condition, but the form of the ansatz function for such boundary conditions has been laid out in [31].problem.Here we used the abbreviations u i := u(x i , y i ) and f i := f (x i , y i ).

Approximation theorem
In the following, we prove that our neural network solutions approximate the analytical solutions well if their loss is sufficiently small.Our neural networks u N N have been trained with the mean squared error of the residual of the PDE, i.e.
where n is the number of collocation points x i from the domain Ω.For the sake of generality, let us consider the generalized loss Then, the loss ( 9) is just the Monte Carlo approximation of the generalized loss for p = 2.We briefly recall the approximation theorem from [47] and show that the classical solution of the Poisson problem satisfies the assumptions of the approximation theorem.N is a linear, elliptic operator and f ∈ L 2 (Ω).Let there be a unique solution û ∈ H 1 (Ω) and let the following stability estimate Then we have for an approximate solution Let u = d ∂Ω (x) • u N N (x) + g(x) ∈ H 1 (Ω) be an approximate solution of the PDE with Lp (u) < δ, which means that there exists a perturbation to the right-hand side . By the stability estimate and the linearity of N , we have Applying the Hölder inequality to the norm of f error and using 2 ≤ p ≤ ∞ yields Combing the last two inequalities gives us the desired error bound In the last inequality, we used that the generalized loss of our approximate solution is sufficiently small, i.e.Lp (u) < δ.
Let us recapitulate an important result from the Schauder theory [20], which yields the existence and uniqueness of classical solutions of the Poisson problem if we assume higher regularity of our problem, i.e. when we work with Hölder continuous functions and sufficiently smooth domains.
Proof.From Lemma 3.3 it follows that there exists a unique solution û ∈ C 2,λ ( Ω) ⊂ H 1 (Ω).Analogously it holds that u = d ∂Ω (x) • u N N (x) + g(x) ∈ H 1 (Ω).Furthermore, we have by the Lax-Milgram Lemma that û ∈ H 1 (Ω) is the unique weak solution and fulfills the stability estimate By Lemma 3.2 the estimate then also holds.
Remark 3.5.Theorem 3.4 implies that a low loss value of a neural network with high probability corresponds to an accurate approximation u of the exact solution û of the PDE, since the loss is a Monte Carlo approximation of the generalized loss, which for a large number of collocation points should be close in value.

Neural network solution of the adjoint PDE
To make a posteriori error estimates for our FEM solution of the primal problem (1), we now use neural networks to solve the adjoint PDE (3).In an FEM approach, the adjoint PDE would be solved in its variational form as described in Algorithm 1, but we minimize the residual of the strong form using neural networks and hence need to derive the strong formulation of the adjoint PDE first.After training, the neural network is then projected into the FEM ansatz function space of the adjoint problem.Finally, the a posteriori estimates can be made as usual with the DWR method following again Algorithm 1.
Remark 3.6.For linear goal functionals the Riesz representation theorem yields the existence and uniqueness of the strong formulation.Nevertheless, deriving the strong form of the adjoint PDE might be very involved for complicated PDEs, such as fluid structure interaction, e.g.[42,49], and goal functionals J : U → R. In future works, we aim to extend to alternative approaches which do not require the derivation of the strong form.
Remark 3.7.We use neural networks to trade off accuracy for speed.In general the neural network approach requires less collocation points than the finite element method.Therefore, we would expect the neural networks to be faster than the finite element method on finer grids.In our numerical tests we used the coordinates of the degrees of freedom as our collocation points, but the collocation points could also be sampled randomly or one could adaptively choose the collocation points as proposed in [30].

Algorithmic realization
In this section, we describe our final algorithm for the neural network guided dual weighted residual method.In the algorithm, we work with hierarchical FEM spaces, i.e. .
Algorithm 2 Neural network guided DWR algorithm 1: Start with some initial guess u 0 h , l = 1.2: Solve the primal problem: Find u l h ∈ U l h such that using some nonlinear solver.
4: Solve the adjoint problem with a neural network: 5: Project the neural network solution in the enriched FEM space with a projection π : .
Algorithm terminates with final output J u l h .
Here we only consider the Galerkin method for which the ansatz function space and the trial function space coincide, i.e.U = V , but U = V can be realized in a similar fashion.The novelty compared to the DWR method presented in Chapter 2 are step 4 and step 5 of the algorithm.In the following, we describe these parts in more detail.
In step 4, we solve the strong form of the adjoint problem, which for nonlinear PDEs or nonlinear goal functionals also depends on the primal solution u l, (2) h .The strong form of the adjoint problem is of the form ( 7) and thus we can find a neural network based solution by minimizing the loss (9) with L-BFGS [29], a quasi-Newton method.We observed that by using L-BFGS sometimes the loss exploded or the optimizer got stuck at a saddle point.Consequently, we restarted the training loop with a new neural network when the loss exploded or used a few steps with the Adam optimizer [27] when a saddle point was reached.Afterwards, L-BFGS can be used as an optimizer again.During training we used the coordinates of the degrees of freedom as our collocation points.We stopped the training when the loss did not decrease by more than T OL = 10 −8 in the last n = 5 epochs or when we reached the maximum number of epochs, which we chose to be 400.An alternative stopping criterion on fine meshes could be early stopping, where the collocation points are being split into a training and a validation set and the training stops when the loss on the validation set starts deviat-ing from the loss on the training set, i.e. when the neural network begins to overfit on the training data.
In step 5, we projected the neural network based solution into the enriched FEM space by evaluating it at the coordinates of the degrees of freedom, which yields a unique function z l,(2) h .

Numerical experiments
In this section we consider two stationary problems (with in total four numerical tests) with our proposed approach.We consider both linear and nonlinear PDEs and goal functionals.The primal problem, i.e. the original PDE, is being solved with bilinear shape functions.The adjoint PDE is solved by minimizing the residual of our neural network ansatz (Sections 3 and 4) and we project the solution into the biquadratic finite element space.For studying the performance, we also compute the adjoint problem with finite elements employing biquadratic shape functions.Finally, this neural network solution is being plugged into the PU DWR error estimator (6), which decides which elements will be marked for refinement.To realize the numerical experiments, we couple deal.II [2] with LibTorch, the PyTorch C++ API [36].

Poisson's equation
At first we consider the two dimensional Poisson equation with homogeneous Dirichlet conditions on the unit square.In our ansatz (8), we choose the function Poisson's problem is given by For a linear goal functional J : V → R the adjoint problem then reads: Here (•, •) denotes the L 2 inner product, i.e. (f, g) := Ω f • g dx.

Mean value goal functional
As a first numerical example of a linear goal functional, we consider the mean value goal functional The adjoint PDE can be written as and can be transformed into its strong form We trained a fully connected neural network with two hidden layers with 32 neurons each and the hyperbolic tangent activation function for 400 epochs on 1,000 uniformly sampled points.In [39] it has been shown that wider and deeper neural networks can achieve a lower L 2 error between the neural network and the analytical solutions.However, if we use the support points of the FEM mesh as the collocation points, we cannot use bigger neural networks, since we do not have enough training data.
Therefore, we decided to use smaller networks.
We compared our neural network based error estimator with a standard finite element based error estimator: Est. error In this numerical test the neural network refined in the same way as the finite element method and both error indicators yield effectivity indices I ef f of approximately 1.0, which means that the exact error and the estimated error were almost identical.The error reduction is of second order as to be expected and the overall results confirm well similar computations presented in [43][Table 1].

Regional mean value goal functional
In the second numerical example, we analyze the mean value goal functional which is only being computed on a subset D ⊂ Ω of the domain.We choose D := 0, 1 4 × 0, 1 4 .For the regional goal function the strong form of the PDE is given by In this example the finite element method and our approach end up with different grid refinements but both had a similar performance and both error indicators had an effectivity index I ef f of approximately   1: Grid refinement with regional mean value goal functional.
On these grids which have been refined with the different approaches, we can see that the finite element method creates a symmetrical grid refinement.This symmetry can not be observed in the neural network based refinement.Furthermore, our approach refined a few more elements than FEM, but overall our methodology still produced a reasonable grid adaptivity.

Mean squared value goal functional
In this third numerical test, an example of a nonlinear goal functional is the mean squared value, which reads For a nonlinear goal functional the adjoint problem then needs to be modified to (see also (3) in Section Computing the Fréchet derivative of the mean squared value goal functional, we can rewrite the adjoint problem as and can be transformed into its strong form Our training setup also changed slightly.The problem statement has become more difficult and we decided to use slightly bigger networks to compute a sufficiently good solution of the adjoint solution.
We used three hidden layers with 32 neurons and retrained the neural network on each grid, since the primal solution is part of the adjoint PDE.

Nonlinear PDE and nonlinear goal functional
In the second numerical problem, we now consider the case were both the PDE and the goal functional are nonlinear.We add the scaled nonlinear term u 2 to the previous equation, such that the new problem is given by with γ > 0. For our nonlinear goal functional, we choose the mean squared value goal functional from the previous example.The adjoint problem thus reads: Find z ∈ H Our neural network approach produces different results than the finite element method, but at the efficiency indices and the refined grids we observe that our approach still works well for adaptive mesh refinement.

Conclusions and outlook
In this work, we proposed neural network guided a posteriori error estimation with the dual weighted residual method.Specifically, we computed the adjoint solution with feedforward neural networks with two or three hidden layers.To use existing FEM software we first solved the adjoint PDE with neural networks and then projected the solution into the FEM space of the adjoint PDE.We demonstrated experimentally that neural network based solutions of the strong formulation of the adjoint PDE yield excellent approximations for dual weighted residual error estimates.Therefore, neural networks might be an effective way to compute adjoint sensitivities within goal-oriented error estimators for certain problems, when the number of degrees of freedom is high.Furthermore they admit greater flexibility being a meshless method and it would be interesting to investigate in future works how different choices of collocation points influence the quality of the error estimates.A sophisticated choice of collocation points could lead to a significant speedup over the finite element method for a high number of degrees of freedom.However, an important current limitation of our methodology is that we work with the strong formulation of the PDE, whose derivation from the weak formulation can be very involved for more complex problems, e.g.multiphysics.Hence, if an energy minimization formulation exists, this should be a viable alternative to our strong form of the adjoint PDE.This alternative problem can be solved with neural networks with the "Deep Ritz Method" [13,45].Nevertheless, the energy minimization formulation does not exist for all partial differential equations.For this reason in the future, we are going to analyze neural network based methods, which work with the variational formulation, e.g.VPINNs [26].

Figure 1 :
Figure 1: Section 3.3: Diagram of our ansatz u = d ∂Ω • u N N + g for the two dimensional Poisson

Figure 2 :
Figure 2: Section 5.1.1:Error estimator results for mean value goal functional.

Figure 4 :
Figure 4: Section 5.1: Grid refinement with regional mean value goal functional.

Figure 6 :
Figure 6: Section 5.1: Grid refinement with mean squared value goal functional.
where 1 D is the indicator function of D. The rest of the training setup is the same as for the previous goal functional.
The training setup is the same as for the previous goal functional.For γ = 50 we obtain the following DoFs J(u) − J(u h ) Est. error I ef f DoFs J(u) − J(u h ) Est. error I ef f Section 5.2: Error estimator results for the nonlinear PDE.