ReLU Neural Network Galerkin BEM

We introduce Neural Network (NN for short) approximation architectures for the numerical solution of Boundary Integral Equations (BIEs for short). We exemplify the proposed NN approach for the boundary reduction of the potential problem in two spatial dimensions. We adopt a Galerkin formulation-based method, in polygonal domains with a finite number of straight sides. Trial spaces used in the Galerkin discretization of the BIEs are built by using NNs that, in turn, employ the so-called Rectified Linear Units (ReLU) as the underlying activation function. The ReLU-NNs used to approximate the solutions to the BIEs depend nonlinearly on the parameters characterizing the NNs themselves. Consequently, the computation of a numerical solution to a BIE by means of ReLU-NNs boils down to a fine tuning of these parameters, in network training. We argue that ReLU-NNs of fixed depth and with a variable width allow us to recover well-known approximation rate results for the standard Galerkin Boundary Element Method (BEM). This observation hinges on existing well-known properties concerning the regularity of the solution of the BIEs on Lipschitz, polygonal boundaries, i.e. accounting for the effect of corner singularities, and the expressive power of ReLU-NNs over different classes of functions. We prove that shallow ReLU-NNs, i.e. networks having a fixed, moderate depth but with increasing width, can achieve optimal order algebraic convergence rates. We propose novel loss functions for NN training which are obtained using computable, local residual a posteriori error estimators with ReLU-NNs for the numerical approximation of BIEs. We find that weighted residual estimators, which are reliable without further assumptions on the quasi-uniformity of the underlying mesh, can be employed for the construction of computationally efficient loss functions for ReLU-NN training. The proposed framework allows us to leverage on state-of-the-art computational deep learning technologies such as TENSORFLOW and TPUs for the numerical solution of BIEs using ReLU-NNs. Exploratory numerical experiments validate our theoretical findings and indicate the viability of the proposed ReLU-NN Galerkin BEM approach.


Introduction
Following fundamental mathematical developments in the 90's that established density (resp. "universal approximation capacity") of shallow NNs (see, e.g., the surveys [28] and the references therein), Deep Neural Networks (DNNs for short) based computations have seen, during the past five to ten years, an increasing development driven by the advent of data science and deep learning techniques. In these developments, in particular the quantitative advantages in expressive power furnished by DNNs has moved into the focus of interest.
Mounting computational and theoretical evidence points to significant advantages of the expressive power of DNNs when furnished by possibly large NN depth. We mention only [29,36] for the use of DNNs in the numerical solution of PDEs, and the recent works [27,32] that indicate a scope for the efficacy of DNNs in the approximation of maps between high-dimensional spaces.
In the present work, we propose the use of NNs for the approximation of variational BIEs, in particular of the first kind. We focus on NNs with ReLU activation, of fixed depth. As we shall show, mathematically and computationally, these ReLU-NNs allow for optimal convergence rates of the Galerkin BEM in polygons. We also comment in passing on advantages afforded by NNs with large depth. Here, approximation theory indicates that exponential convergence is possible, in principle.

Contributions
We consider the Laplace equation with appropriate boundary conditions on R 2 \ , where corresponds to a one dimensional open arc, and on a polygonal Lipschitz domain D, and consider their boundary reduction by means of BIOs as described, for example, in [31]. We prove that by using shallow ReLU-NNs as trial spaces in the Galerkin discretization of the resulting BIEs one can recover well-known algebraic convergence rate bounds of the BEM, which are in turn traditionally obtained by means of other methods, such as graded meshes or the adaptive BEM [9,15] (ABEM). Even though not thoroughly studied in the present work, we remark that by using Deep ReLU-NNs (and working under the assumption of analytic regularity hypotheses on the data) one can recover exponential convergence rates usually obtained using the so-called hp Galerkin BEM. This can be obtained by recalling recently obtained ReLU-NN emulation results from [27] together with analytic regularity results of the solution to BIEs in polygonal Lipschitz domain D, with a finite number of straight sides, in weighted function spaces [3]. However, we hasten to add that the main goal of the present work is to study the approximation of the solution to BIEs by means of shallow ReLU-NNs.
The insight behind the results present herein is the interpretation that shallow ReLU-NNs realize approximations in linear subspaces, whose basis elements are realized by the socalled "hidden layers" of the ReLU-NN and are, therefore, subject to optimization during ReLU-NN training. In the present paper, we partially leverage this flexibility and the ability of the hidden layers of ReLU-NNs to express: (a) low order boundary element spaces with "free-knots" (in the terminology of spline approximation), i.e., to adapt the partitions of the boundary to the structure of the unknown solution (which is reminiscent of adaptive meshing strategies in Galerkin BEM), and (b) to leverage adaptive mesh refinement methods in BEM for a rational procedure to "enlarge" the ReLU-NN through the insertion of nodes.
We propose two different algorithms to, computationally, perform the construction of the ReLU-NNs (thus, two different paths to train the network). The first one is based on the observation that the solution of symmetric, coercive problems, such as the ones arising from the boundary reduction of the Laplace problem, can be written as the minimization of a suitable energy functional.
The second algorithm makes use of well-known a posteriori error estimates, which are commonly used in the ABEM. The definition of an efficiently computable loss function which is based on computable, reliable a-posteriori error estimators for Galerkin discretizations differs from other widely established methods, and is not limited to Galerkin BEM. Numerical experiments show the computational feasibility of these algorithms and a detailed convergence analysis shows that shallow ReLU-NN Galerkin BEM can attain the optimal algebraic convergence rates. NN training can, in particular, compensate for reduced convergence rates due to, e.g., corner singularities of the physical domain.
Outline This work is structured as follows. In Sect. 2 we briefly review the boundary reduction of potential problems in two space dimensions. We also present a recapitulation of the direct method of boundary reduction and the strong ellipticity of boundary integral operators of the first-kind in fractional order Sobolev spaces. Additionally, we recall results concerning the regularity of the solution to BIEs both in Lipschitz polygons and on open arcs, which will be used ahead to obtain convergence rates for the ReLU-NN approximation of BIEs.
We begin Sect. 3 by introducing rigorous definitions of DNNs and describing, in detail, the connection existing between P 1 -spline boundary element spaces and ReLU-NNs. A key observation in our analysis is that the low-order BEM spaces with adaptive mesh refinement is shown to admit a representation through ReLU-NNs architectures. This property, together with the regularity results of the solutions to BIEs, allows us to establish convergence results with optimal rates in the approximation of the described BIEs by means of ReLU-NNs. These theoretical results provide a benchmark to assess the performance of the different training algorithms proposed ahead in Sect. 4. In Sect. 4, we propose two concrete training algorithms for the construction of ReLU-NNs approximating the solution of the analyzed BIEs. The first one leverages on the coercivity properties of the BIEs for the potential problem and the interpretation of its solution as the minimizer of a suitable energy functional. By performing a combination of the optimization of hidden parameters of the ReLU-NN and computation of the output layer by solving the corresponding Galerkin linear system, the proposed algorithm attains the (theoretically proven) optimal convergence rates for different numerical test cases. The second algorithm proposed in Sect. 4 hinges on a posteriori error estimates which, in turn, provide a computable upper bound for the error between the exact solution to the BIE and the ReLU-NN. Consequently, we can use this property to guide the training of the ReLU-NNs at each step of the iterative process. We mention that the proposed numerical methodology does not rely on automatic differentiation of the loss function with respect to the DNN parameters. Section

Sobolev Spaces in Bounded Domains
Let D ⊂ R 2 be a bounded, connected Lipschitz domain with boundary ∂D. For s ∈ R, we denote by H s (D) the standard Sobolev spaces of order s defined in D equipped with the norm · H s (D) . As it is customary, we identify H 0 (D) with L 2 (D). Sobolev spaces on the boundary := ∂D are denoted by H s ( ), where the range of s ∈ R is restricted in accordance to the regularity of the domain D (cf. [ The duality between H s ( ) and H −s ( ) is denoted by ψ, φ for ψ ∈ H −s ( ) and φ ∈ H s ( ).

Sobolev Spaces on Open Arcs
where, for a general Banach space X , X denotes its dual space. The duality between H s ( ) and H −s ( ) is denoted by φ, ψ for ψ ∈ H −s ( ) and φ ∈ H s ( ).

Boundary Integral Operators in Lipschitz Domains
Again, let D ⊂ R 2 be a bounded Lipschitz domain with boundary :=∂D. In the following, we introduce the main concepts concerning BIOs and BIEs to be used throughout this manuscript. As we will only discuss boundary value problems in R 2 , we limit our presentation of the aforementioned tools to the two dimensional case. Let G(x, y) be the fundamental solution of the Laplacian in R 2 (cf. [34,Chapter 5] or [31, Section 3.1]), given by We define the single and double layer potentials, respectively, as follows: Moreover, the single layer and hypersingular BIOs are coercive.
Remark 1 Let us define the bilinear formǎ : One can also prove (see, e.g., [37,Section 2] and the references therein) that there exists a constant α > 0 such thatǎ

Direct Boundary Integral Formulation of the Interior Dirichlet BVP
We consider the interior Laplace problem equipped with Dirichlet boundary conditions in a bounded Lipschitz domain D ⊂ R 2 with boundary :=∂D. As is customary, one may recast Problem 2.4 as an equivalent BIE using the BIOs introduced in (2.2). The starting point is the so-called integral representation formula: we express the weak solution u ∈ H 1 (D) to Problem 2.4 using the layer potentials introduced in (2.1) as follows: (2.5) By applying the Dirichlet trace operator γ 0 : H 1 (D) → H 1 2 ( ) to (2.5) and using the boundary condition on stated in Problem 2.4, one obtains the following BIE for the unknown datum γ 1 u ∈ H − 1 2 ( ).
The well-posedness of the BIE in Problem 2.5 follows from the mapping properties of the BIOs in Proposition 2.1, the ellipticity of the single layer BIO V : H − 1 2 ( ) → H 1 2 ( ) in Proposition 2.2, and the Lax-Milgram lemma.

Direct Boundary Integral Formulation of the Interior Neumann BVP
We consider the interior Laplace problem equipped with nonhomogeneous Neumann boundary conditions in a bounded, simply connected Lipschitz domain D ⊂ R 2 with boundary :=∂D, thus rendering the boundary connected itself.
It is well-established that there exists a unique u ∈ H 1 (D)/R solution to Problem 2.6 provided that one further imposes g ∈ H − 1 2 ( )/R (see, e.g., [34,Theorem 4.9]). We reduce the boundary value problem stated in Problem 2.6 to an equivalent BIE using the BIOs introduced in (2.2) and the integral representation formula (2.5). The application of the Neumann trace operator γ 1 : (2.5) together with boundary condition stated in Problem 2.6 yields the following boundary integral formulation of Problem 2.6. Problem 2.7 (Boundary Integral Formulation of Problem 2.6) Let g ∈ H − 1 2 ( )/R be given. We seek φ:=γ 0 u ∈ H 1 2 ( )/R such that The well-posedness of Problem 2.7 follows from the mapping properties of the BIOs stated in Proposition 2.1, Proposition 2.2, the fact that K preserves integral mean value zero for densities defined over and from the Lax-Milgram lemma, as it is thoroughly explained in [34, Section 7.2].

Regularity of the Solution to the BIEs in Lipschitz Polygons
We recapitulate results concerning the regularity of solutions to the BIEs stated in Problems 2.5 and 2.7. In the following, we assume that D ⊂ R 2 is a bounded Lipschitz polygon with boundary :=∂D characterized by a finite number J ≥ 3 of vertices {x j } J j=1 ⊂ R 2 . We enumerate cyclically mod J , i.e., x J +1 = x 1 . We denote j = conv(x j , x j+1 ), for j = 1, . . . , J , (with the convention 0 = J ) and let ω j ∈ (0, 2π) be the internal angle at x j , i.e., that of the wedge formed by the edges j−1 and j , j = 1 . . . , J .
Let {χ j } J j=1 be a partition of unity of , where each function χ j , j = 1, . . . , J , is constructed by considering the restriction of a function in C 2 0 (R 2 ) to the boundary such that χ j = 1 in a neighborhood of x j and supp{χ j } ⊂ j−1 ∪ {x j } ∪ j . Let ϕ : → R be a function defined on the boundary . Using the previously described partition of unity, we may write where by (ϕ − , ϕ + ) we denote the restriction of ϕ : → R to j−1 ∪ {x j } ∪ j with ϕ − := ϕ| j−1 and ϕ + := ϕ| j , for j = 1, . . . , J .
The following result (from [16]) describes the regularity of the solution to Problems 2.5 and 2.7.

Boundary Integral Operators on Open Arcs
We proceed to extend the definitions and results introduced in Sect.

Regularity of the Solution to the BIEs on Open Arcs
For the ensuing analysis of lower-order BEM, in particular on open arcs, we shall invoke a decomposition of solutions into regular and singular parts, from [16,38]. For i = 1, 2, let i denote the Euclidean distance between x ∈ and the endpoint x i ∈ R 2 of and let ξ i be a C ∞ cut-off function on with 0 ≤ ξ i ≤ 1, ξ i = 1 near x i and ξ = 0 at the opposite end. For α 1 , α 2 ∈ R and ψ 0 ∈ H s ( ) for any s < 2, we set and define is bijective and continuous and there exists C > 0 such that where ψ is as in (2.8).

Galerkin Boundary Element Discretization
We proceed to detail the Galerkin discretization of Problems 2.5, 2.7, 2.11 and 2.12. We remark that other forms of numerical discretizations, such as collocation or Petrov-Galerkin formulations, may be considered as well.
In what follows, we assume that D ⊂ R 2 is a bounded, Lipschitz polygon with boundary :=∂D. As is customary in the h-Galerkin BEM, we decompose the boundary into N ∈ N straight, disjoint line segments j , for j = 1, . . . , N , from now onwards referred to as elements. The vertices of the polygon must match the endpoints of some of the elements and the mesh T N :={ i } N i=1 covers itself. Equipped with these definitions, we introduce the usual boundary element spaces of piecewise polynomial functions on the mesh T N of : In the above, and for each j = 1, . . . , N , P p ( j ) denotes the space of polynomials of degree p ∈ N ∪ {0} on j .

Remark 2
The continuity and ellipticity of V and the bilinear formǎ(·, ·) (recall its definition in Remark 1) in H − 1 2 ( ) and H 1 2 ( ), respectively, will ensure the existence and uniqueness of solutions to Problems 2.14 and 2.15.

Remark 3 (Weakly Singular Operator V)
For := ∂D a closed curve of arclength L = | |, let r : [0, L) → be its arclength parametrization and consider the map I given for φ ∈ C 0 ( ) by for all x ∈ . One verifies that the map I may be extended to We continue with the Galerkin discretization of Problems 2.11 and 2.12. Recall to be a Jordan arc and let T N : be a mesh on consisting, as before, of straight, disjoint line segments j , for j = 1, . . . , N . We introduce the following spaces of piecewise polynomials: Problem 2.16 (Boundary element discretization of Problem 2.11) Let T N be a mesh over the Jordan arc and let f ∈ H 1 2 ( ) be given. We seek ψ N ∈ S 0 ( , T N ) satisfying Problem 2.17 (Boundary element discretization of Problem 2.12) Let T N be a mesh over the Jordan arc and let g ∈ H − 1 2 ( ) be given. We seek ϕ N ∈ S 1 ( , T N ) satisfying

ReLU Neural Networks
We introduce a key ingredient in the present paper, namely the so-called DNNs. Even though in this work we use only the ReLU activation function, other (continuous) activation functions could be considered in the analysis and numerical construction of DNNs, such as sigmoidal, or tanh activations. However, the approximation results presented herein hold with the simpler, and computationally more efficient, ReLU activation function.
ie., the subnetwork comprising all "responses from the hidden layers" of a DNN ∈ N N L,M,N 0 ,N L . Moreover, we denote the space of all the responses from the hidden layers of DNN's

Structure of Galerkin Approximation Spaces generated by ReLU-NNs
It is an immediate consequence from Definition 3.1 that functions generated by DNNs (understood in the sense that L ≥ 2 as opposed to shallow NNs, where L = 1) depend in a nonlinear fashion on the DNN parameters characterizing the hidden layers, i.e., the weights A ∈ R N ×N −1 and biases b ∈ R N . However, by setting the bias in the output layer of the DNN in Definition 3.1 to 0, i.e., b L = 0, we have that DNN functions belong to the linear span of the space of functions generated by the "hidden layers" of the corresponding DNN. We see from this proposition that DNNs: (i) span particular linear subspaces of dimension N L−1 and (ii) span subspaces with basis elements that, in turn, can be chosen in a problem-adapted fashion by adjusting the parameters in the hidden layers of the DNN.
It is clear from this (trivial) observation that shallow ReLU-NNs, which we will consider exclusively in the remainder of this paper, can exactly reproduce spaces of continuous and piecewise linear functions on with a DNN of depth L = 2. Mesh refinement on can be accounted for by adjusting the widths of the hidden layers during training.
As was shown in [27,Sections 4 and 5], for L > 2, ReLU-DNNs can represent hpboundary element spaces on geometric partitions, which are known to afford exponential convergence rates for piecewise analytic data. We hasten to add, however, that Proposition 3.2 has wider implications, e.g., training N N hid, to emulate reduced bases by a greedy search, for example, will imply that the corresponding Galerkin BEM (with subspaces corresponding to the DNN N N = span{ hid ∈ N N hid, }) will deliver performance corresponding to reduced bases BEM. This could be employed to accommodate, for example, Galerkin BEM with basis sets that feature additional properties which are tailored to particular problem classes.

P 1 -Spline Boundary Element Spaces as ReLU-NNs
Recall that N ∈ N corresponds to the number of elements in the meshes on and (T N and T N , respectively), and let p ∈ N denote the polynomial degree of the boundary element spaces S p ( , T N ) and S p ( , T N ) of piecewise polynomials of degree p. Moreover, define I:=(−1, 1) and let r : I → R 2 be a Lipschitz continuous and piecewise affine parametrization of the closed curve (where is Lipschitz) satisfying r(−1) = r(1) and r (t) = 0 for almost every t ∈ I. Define, for φ ∈ C 0 ( ), the so-called pullback operator as Proof Recall the setting described in Sect. 2.7: Given N ∈ N we consider a mesh T N of consisting of N + 1 points {x n } N n=0 ⊂ and where x 0 = x N (i.e. N distinct points). The set of nodes {x n } N n=0 ⊂ may be uniquely identified with {t j } N j=0 ⊂ I through r(t j ) = x j , for j = 0, . . . , N , and we assume that −1 = t 0 < t 1 < · · · < t N −1 < t N = 1 set K j = (t j , t j+1 ) for j = 0, . . . , N − 1, and define Let us define the so-called "hat functions" Each ζ j can be represented (non-uniquely) using ReLU-NNs as follows is defined as follows Finally, we construct the output layer by using (3.2). Let us define C:= (c 0 ( N ), . . . , c N ( N )) ∈ R 1×(N +1) . Then we have that Observe that N ∈ N N 2,N +1,1,1 and that, due to the construction (3.4), the weights of N are bounded in absolute value by max{1 Observe that the right-hand side of (3.6) defines an element of N N 2,N +1,1,1 , thus concluding the proof.

Remark 4
As pointed out in the proof of Proposition 3.3, the representation of the "hat functions" ζ j ∈ φ N ∈ S 1 (I, T N ) is not unique. One may also write for j = 0, . . . , N and t ∈ I. Then, there exists a neural network j ∈ N N 4,2,1,1 such that , for all t ∈ I and j = 1, . . . , N . This representation leads to ReLU-NNs of width 2 and depth 3.

Approximation Properties of ReLU-NNs: h-Galerkin BEM
Based on the result stated in Proposition 3.3 (concerning the exact emulation of the standard P 1 -BEM spaces) one may conclude that existing results on the convergence rates of Galerkin BEM are straightforwardly "transferred" to the ReLU Galerkin-BEM framework. In this section, we provide a clear result establishing this connection. Firstly, we recapitulate known approximation results of singular functions on graded partitions.
Proof A proof of this result is provided, for the convenience of the reader, in Appendix A.
We observe that Proposition 3.3 allows for arbitrary locations of the nodes characterizing the mesh T N on , while ensuring exact ReLU-NN emulation of the spaces S 1 ( , T N ). The preceding result, therefore, also implies approximation rate bounds for ReLU-NNs. We describe these now.  Then, for every ε > 0 there exists C(ε) > 0 (depending on ε > 0) such that for each N ∈ N there exists ϕ N ∈ N N 2,N ,1,1 satisfying Proof (i) Let r : I → be a Lipschitz continuous and piecewise linear parametrization of . Being a bounded Lipschitz polygon with straight sides and defined by J ≥ 3 vertices, there exist points −1 = t 0 < t 1 < · · · < t J := 1 satisfying r(t j ) = x j , for j = 0, · · · , J , where we set x 0 = x J . Throughout this proof, we use I j :=[t j , t j+1 ] ⊂ I. A piecewise affine parametrization of this polygon is, for instance, given by Additionally, for j = 0, . . . , J − 1, we define the extension by zero and restriction operators E j : C 0 (I j ) → C 0 (I) and R j : a.e. t ∈ I, and R j (v)(t) = v(t), a.e. t ∈ I j Equipped with this, and according to Proposition 2.8 item (ii), we have that and where χ j,1 , χ j,2 : I j → R are infinitely differentiable functions, which, furthermore, are identically equal to 1 for |t − t j | < L j /4 and |t − t j+1 | < L j /4 (with L j := |t j+1 − t j |), respectively. In addition, χ j,1 and χ j,2 vanish for |t − t j | > L j /2 and |t − t j+1 | > L j /2, respectively. In (3.10) and (3.11), for each k ∈ N and j = 1, . . . , J , we have that 3 2 and β j,k , β j+1,k ∈ R, as stated in Proposition 2.8. Observe that in (3.10) and (3.11) we have singularities arising at t = t j and t = t j+1 . The strength of these singularities is dictated by the inner angles of the polygon at the corresponding vertices. It follows, from Proposition 3.4, with s = 1 2 and and, in addition, there exists φ j,N ∈ S 1 (I j , T N ,β j ) such that Set φ j,N :=η j,1,N + η j,2,N + φ j,N ∈ S 1 (I, T N ,β j ) and we have that for some positive constant C independent of N . At this point, we make the following observation: . Following Proposition 3.3, on each side of the polygon we have constructed a ReLU-NN belonging to N N 3,N +2,1,1 that approximates R j (τ r φ) according to (3.12). In addition, Proposition 3.4 implies that these ReLU-NNs interpolate exactly the value of the solution to Problem 2.7 at the vertices of the polygon. Hence, by defining φ N :=φ j,N in I j , for j = 1, . . . , J , namely we define φ N to be equal to the previously constructed ReLU on each side of the polygon, we have constructed a ReLU-NN satisfying (3.8). Indeed, we have that . This ReLU-NN in particular belongs to N N 2,J (N +1),1,1 , thus concluding the proof of this statement.

ReLU Neural Network Galerkin BEM
In this section, we propose two algorithms to construct the ReLU-NNs described in Sect. 3 for the approximation of the solution to BIEs in polygons and arcs, introduced in Sects. 2.3 and 2.5, respectively. Following the representation of the lower-order boundary element spaces as ReLU-NNs elaborated in Sect. 3, we focus only on shallow NNs. The first method, described herein in Sect. 4.1, aims to construct a ReLU-NN by considering as a loss function the total energy of the problem. Indeed, it is well-known that the solution to operators equations involving "elliptic" operators in Hilbert spaces may be cast as minimization problems. We rely on this observation to mathematically justify this algorithm for the construction of the corresponding ReLU-NN. This approach has been previously described for example in [19,Section 7.2] in the context of one-dimensional FEM for the Poisson problem with non-homogeneous Dirichlet boundary conditions, and actually is the key ingredient of the so-called Deep Ritz Method, described in [13,14,22,24]. The second method proposed in this work consists in the construction of a computable loss function based on a posteriori error estimators for BIEs of the first kind. These tools come in different flavors, and we refer to [15] for an extensive review, including their application to the Adaptive BEM. In particular, here we make use of the so-called weighted residual estimators [7][8][9], which have been proven to be reliable, thus providing a computable upper bound for the error of the Galerkin BEM.
To date, these are the only ones shown to deliver optimal convergence rates when used in the Adaptive BEM algorithm (ABEM for short). Finally, in Sect. 4.3, we describe precise algorithms for the computations of the ReLU-NNs for the approximation of the solutions to the hypersingular BIEs introduced in Problems 2.7 and 2.12.

Energy Minimization
Throughout this section, let X be a real Hilbert space equipped with the inner product (·, ·) X and the induced norm · X = (·, ·) X . In addition, let X denote the dual space of X and let ·, · X represent the duality pairing between X and X . As it is customary, we endow X with the dual norm In addition, we say that operator A : X → X is X -elliptic if there exists a constant C a > 0 such that In the following, we recall a well-established result on continuous, self-adjoint and positive semi-definite operators. This property has been previously used in the construction of DNNs in the "Deep Ritz Method" framework introduced in [14]. Av, v X ≥ 0, ∀ v ∈ X , and let f ∈ X . Then, u ∈ X solves the variational problem if and only if Lemma 4.1 allows us to express Problems 2.7 and 2.12 (as well as their discrete counterparts) as minimization problems over the corresponding Hilbert spaces. Let us define, : where g ∈ H − 1 For each N ∈ N, we aim to find ReLU-NNs φ ,N ∈ N N 2,N +1,1,1 and ϕ ,N ∈ N N 2,N ,1,1 minimizing the loss functions defined in (4.2) and (4.3), i.e.,  , the spaces S 0 ( , T N ) in the Galerkin BEM Problem 2.14 for Symm's BIE can not be exactly realized via ReLU-NNs. However, with Remark 3, the corresponding loss functions for the Galerkin BEM Problem 2.14 can be realized also with ReLU-NNs on . All results on convergence rates for W will have, via Remark 3, analogs for V. For reasons of length, we develop the ReLU Galerkin BEM algorithms only for W and loss functions introduced in (4.2) and (4.3).
The following result and accompanying corollaries motivate our choice of loss functions. We use this result to address the construction of ReLU-NNs by solving the minimization problems listed in (4.4).

Lemma 4.2 Let
A : X → X be a continuous, self-adjoint and X -elliptic operator. Given f ∈ X , let u ∈ X be the unique solution to the following variational problem: (4.5) Define : X → R as Then, there exist positive constants C 1 and C 2 , both independent of u ∈ X , such that ∀v ∈ X : Proof Due to the continuity and X -ellipticity of the operator A : X → X , we may define an equivalent norm to · X in X as follows: Indeed, if C c > 0 denotes the continuity constant of A : X → X and C a > 0 is as in (4.1), for all v ∈ X it holds Let u ∈ X be the unique solution to the variational problem (4.5). Then, for all v ∈ X , we have For v ∈ X , we calculate In (4.8) we use (4.5) and the self-adjointness of the operator A : X → X . The bounds presented in (4.6), follow from (4.8) and (4.7) with C 1 = C −1 a > 0 and C 2 = C −1 c > 0.
The next results, relevant for DNN training, follow from Lemma 4.2 and Proposition 3.3.

Corollary 4.3
Let φ ∈ H 1 2 ( ) be the unique solution to Problem 2.7 with g ∈ H − 1 2 ( )/R. Then, there exist positive constants C 1 and C 2 (independent of φ and g) such that for all N ∈ N such that N + 1 > J and for all φ ∈ N N 2,N +1,1,1 , it holds .2), the ReLU-NN φ is identified with its restriction to I and τ r : H

Weighted Residual Estimators
We shortly recall the so-called weighted residual estimators for the a-posteriori error estimation of the numerical solution to hypersingular BIEs in a bounded Lipchitz polygon and in an open arc in R 2 , namely Problems 2.7 and 2.12, respectively. We proceed to recapitulate the result of [7], in which reliable a-posteriori error estimates for first-kind integral equations are analyzed.  (i) Let φ ∈ H 1 2 ( ) and φ N ∈ S 1 ( , T N ) be the solution to Problem 2.7 and Problem 2.17, respectively, with g ∈ L 2 ( ).
Proof We prove item (i). The Galerkin solution φ N ∈ S 1 ( , T N ) to Problem 2.15 satisfies as by construction it holds φ N , 1 = 0. Recalling that for bounded Lipschitz polygons the maps W : H 1 ( ) → L 2 ( ) and K : L 2 ( ) → L 2 ( ) are continuous. Considering that g ∈ L 2 ( ), we have that the duality pairings appearing in (4.10) can be interpreted as L 2 ( )-inner products. Hence, we have that and we may conclude that the residual R N ∈ L 2 ( ) is orthogonal to S 1 ( , T N ). Recall the bilinear formǎ : H (4.11) In addition, φ N , 1 = 0 yieldsWφ N = Wφ N . In view of this, together with (4.11) and using Theorem 4.5 with f = 1 2 Id − K g − Wφ N , we get the desired result. The proof of item (ii) follows the exact same steps of item (i), hence we omit it for the sake of simplicity.

Training Algorithms
In this section, we describe two algorithms devised to construct ReLU-NNs for the approximation of the solution to Problems 2.7 and 2.12. These two approaches rely on the following observation: Each element space of piecewise linear polynomials defined on a suitable partition of the boundary can be exactly represented by a ReLU-NN according to Proposition 3.3. Moreover, the parameters of these ReLU-NNs, i.e. weights and biases, can be precisely described in terms of the parameters of the partition, namely the position of the nodes over the boundary. Therefore, we can replace the Galerkin solution to the discrete variational problem by a ReLU-NN as the ones described in Proposition 3.3 and proceed to find its parameters by minimizing a suitable loss function. The algorithms to be presented here are in the spirit of the "Deep Ritz Method" [14] and rely on the tools introduced in Sects. 4.1 and 4.2.

Training Using Minimization of the Total Energy
Corollaries 4.3 and 4.4 provide a justification to use the loss functions (·) and (·) in the construction of the sought ReLU-NNs. Indeed, these quantities can be used as surogates of the exact error in the minimization process. Equipped with this observation, we proceed to describe a two-step scheme to find a realization of ReLU-NN that approximates the solution to Problems 2.7 and 2.12 using the aforementioned loss functions.
We begin by describing the algorithm for Problem 2.7 only. As per usual, we consider a bounded Lipschitz polygon D ⊂ R 2 with boundary :=∂D characterized by a number J ∈ N (J ≥ 3) of vertices. Let N ∈ N be fixed and such that N + 1 > J . We consider a mesh T N of as described in Sect. 2.7, with N + 1 points {x n } N n=0 ⊂ and where x 0 = x N (i.e. N distinct points). Within the set {x n } N n=0 ⊂ one may identify two kinds of nodes: (i) there is a first subset consisting in the vertices of the polygon , which in the the following are referred to as fixed nodes, and (ii) there is a second set of mesh nodes that are not vertices of the polygon, which in the following are referred to as free nodes. Moreover, with a Lipschitz continuous and piecewise linear parametrization of , r : I → (as in Sect. 3.2), the set of nodes {x n } N n=0 ⊂ may be identified with a set of biases t := {t n } N n=0 ⊂ I such that r(t n ) = x n for each n = 0, . . . , N (see Proposition 3.3). We aim to find the position of the biases in an optimal fashion, while keeping the biases associated with fixed nodes unaltered. Let t F ∈ R N +1−J be a vector containing the elements of the biases {t n } N n=0 that are associated to the free nodes only and let φ N ∈ N N 3,N +2,1,1 . In view of Proposition 3.3, there holds (4.12) where c := (c 0 , . . . , c N ) ∈ R N +1 and, for each n = 1, . . . , N , {ζ n } N n=0 corresponds to the "hat" functions introduced in Sect. 3.2. Then, in view of Proposition 3.3, the task of finding a ReLU-NN φ N to approximate the solution of Problem 2.7 may be stated in the following form: N (t, c) , (4.13) where (·) has been introduced in (4.2) and by φ N (t, c) we denote the dependence of φ N upon the biases t and the set of weights c. Note our slight abuse of notation by referring to both the vector of coefficients of the BEM solution in (4.12) and to the ReLU-NN weights as c (see Proposition 3.3).
In Algorithm 1, we propose a two-step scheme for the construction of optimal ReLU-NN in the approximation of the solution to Problem 2.7. This algorithm is based on the following observation: if we fix the free biases t F , the minimization of (4.13) for the computation of c ∈ R N boils down to the solution of the Galerkin discretization of Problem 2.7, namely Problem 2.15. Hence, rather than computing the gradient with respect to t F ∈ R N +2−J and c ∈ R N +1 , and executing a step of the gradient descent algorithm, we perform only a gradient descent step with respect to the vector of free nodes t F ∈ R N +2−J while keeping that of the coefficients c ∈ R N +1 fixed. Then, in the second step of this algorithm, we fix the biases and compute c ∈ R N +1 in (4.13) by solving the discretized boundary integral equation in Problem 2.15.

Algorithm 1 Construction of ReLU-NN by minimizing the total energy
Input: Initial set of biases t ∈ R N +2 ; maximum number of iterations M ∈ N; tolerance > 0.  Compute c ∈ R N by solving Problem 2.15 on the mesh associated with the biases t; 3: N (t, c)); return t and c; 7: end if 8: t F ←t F ; 9: Compute c ∈ R N by solving Problem 2.15 on the mesh associated with the biases t; 10: end for 11: return (t, c); 12: end procedure For given initial set of biases (with a fixed number of biases), Algorithm 1 returns an optimal collection of biases in (with the same cardinality) upon which the loss function (·) is minimized. This, in turn allows us to construct a ReLU-NN for the approximation of the solution to Problem 2.7 according to Proposition 3.3. To construct a sequence of ReLU-NN with increasing width (corresponding to a sequence of meshes with an increasing number of nodes according to Proposition 3.3) we propose Algorithm 2. Therein, we widen the network by including neurons with (initial) biases given as the midpoints between two contiguous biases, which are subsequently optimized through Algorithm 1.
Through Algorithm 2 we construct a sequence of biases {t i } i∈N from a given initial configuration t such that the sequence { ( (t i , c i ))} i∈N is monotonically decreasing, where c i is obtained by solving Problem 2.15 on the mesh T N i associated with the biases t i . Greedy algorithms aiming to construct shallow networks for function approximation characterized by a variety of activation function, e.g. the so-called ReLU k and sigmoidal activation function, are studied in [33]. Improved convergence rates for the Orthogonal Greedy Algorithm (with respect to the well-known result presented in [4,12]) depending on the smoothness properties of the activation function of choice are proven. Later in [18], these findings are used for the NN approximation of the solution to elliptic boundary value problem. We mention, however, that considering Algorithm 2 with a sufficiently high number of inner iterations M ∈ N (i.e. iterations of Algorithm 1) yields, in our numerical examples, the same convergence rates proved in [18] and the references there. Furthermore, Algorithms 1 and 2 may be Algorithm 2 Construction of a sequence of ReLU-NN minimizing the total energy Input: • Initial set of biases t ∈ R N +2 ; • Maximum number of iterations K ∈ N; • Maximum number of inner iterations M ∈ N; • Tolerance > 0.
Output: Sequence of optimal network parameters {(t j , c j )} K j=1 .

Training Using Weighted Residual Estimators
The result presented in Corollary 4.6 justifies the use of efficiently computable, weighted residual a-posteriori estimators in the construction of a suitable loss function to be used in the numerical implementation of the ReLU-NN BEM algorithms described herein. More precisely, the computable, local residual a-posteriori error estimates presented in Sect. 4.2 can be used as computable surrogates of the mismatch between the exact solution and its Galerkin approximation, in the corresponding norm. Unlike the approach presented in Sect. 4.3.1, here we do not aim to construct a ReLU-NN to approximate the solution of the BIEs previously described by finding the optimal position of the mesh nodes, which in turn is driven by the minimization of a computable loss function. In turn, the algorithm presented herein aims to greedily select a set of basis functions to enrich the finite dimensional space upon which an approximation of the solution to the BIEs is built. We remark in passing that this approach is strongly related to the adaptive basis viewpoint elaborated in [11]. Due to Proposition 3.3, this amounts to increasing the width of the underlying ReLU-NN every time a new neuron, i.e. basis function, is added.
As in Sect. 4.2, we restrict our presentation to the case of a bounded Lipschitz polygon in R 2 and, again, point out that the extension to an open arc is straightforward. The technique to be presented here is an adaptation of the orthogonal matching pursuit algorithm [6], and is also motivated by the recent results in [1,18], which strongly leverage the variational structure of the discretization scheme.
Let S ⊂ H  S 1 ( , T N ). For each ξ ∈ S, span {ζ n } N n=1 ∪ {ξ } is itself a valid finite dimensional space on which a solution to Problem 2.15 may be sought. For given S ⊂ H 1 2 ( ), we aim to determine the element ξ ∈ S having the least angle with respect to the residual ϕ − ϕ N . We aim at finding φ ∈ S such that (4.14) However, in the computation of φ ∈ S in (4.14) we encounter the following difficulty: We can not directly compute the residual ϕ − ϕ N ∈ H 1 2 ( ). In view of Corollary 4.6, we use as a surrogate of the exact residual ϕ −ϕ N in the H 1 2 ( )-norm. We proceed to find an element in S such that its contribution to (4.15) has the least angle, in the L 2 ( )-inner product, with the residual (4.15) itself, i.e., find φ ∈ S such that (4.16) To properly state an algorithm that allows us to construct a ReLU-NN, it remains to define how the set S is constructed at each step. As in Sect. 2.7, we consider a mesh T N of with N + 1 points {x n } N n=0 ⊂ R 2 , and where x 0 = x N (i.e. N distinct points). Let x n denote the midpoint between x n and x n+1 for each n = 0, . . . , N − 1, and consider the set of N piecewise linear functions S N :={ξ n } N n=0 ⊂ H 1 2 ( ), defined as for n = 0, . . . , N − 1. In this case, the set S N gathers the piecewise linear functions one would add to S 1 ( , T N ) if a uniform refinement of the mesh T N of were to be performed. We aim to select, among the candidates in S N , a function according to (4.16). Furthermore, since the incorporation of a single basis function of the set S N at each step may result in a needlessly expensive procedure, we allow for the incorporation of a subset of S N at each step in order to enhance the procedure (as in the ABEM algorithm). We do so by computing and then selecting a number of elements of S N having the largest values of q n . The number of elements of S N incorporated at each iteration is controlled by an input a parameter θ ∈ (0, 1] specifying the fraction of elements of S N to be included in our enhanced finite dimensional space at each iteration. The procedure is presented in Algorithm 3 for the setting of Problem 2.15 (recall the notation t and c denoting the biases and weights of the ReLU-NN). Note that Algorithm 3 may be implemented in tandem with Algorithm 1, as presented in Algorithm 4.
Finally, note that setting θ = 1 on Algorithm 4 results in Algorithm 2, while setting M = 0 on Algorithm 4 results in Algorithm 3.

Algorithm 3 Construction of a sequence of ReLU networks minimizing the weighted residual estimator
Input: Initial set of biases t ∈ R N ; cardinality parameter θ ∈ (0, 1] ; maximum number of iterations K ∈ N. Output: Optimal Numerical Approximation (t F , c ).

Numerical Results
In this section we present numerical results obtained using the algorithms described in Sect. 4.

Setting
For the sake of simplicity, we provide results only for the setting described in Problem 2.12 on the so-called slit For the numerical implementation of the Then, we consider Algorithm 2 with an initial configuration of only one free bias (N 0 = 1), located at t F,1 = 0, and display the evolution of the loss function and the H   Fig. 3a, the difference between the loss function and its minimum value − π 2 (see (5.1) and Lemma 4.1) is compared for the sequence of meshes resulting from Algorithm 2 and for uniform mesh refinement. Substantially faster convergence of the training procedure to the optimum is observed for Algorithm 2 than for a uniform mesh refinement. Figure  The numerical results display a decay of the convergence rate for the value θ = 0.5, due to the inclusion of suboptimal biases. For the value θ = 0.25, on the other hand, an optimal convergence rate of 1.5 with respect to the free biases is observed.
Finally, Fig. 5 portrays the convergence of the estimated H 1 2 ( )-norm for the ReLU-NNs returned by Algorithm 4, with θ ∈ {0.25, 0.5} and an initial configuration of one free bias, as before (N 0 = 1). Algorithm 1 was executed with M = 5000 and = 10 −15 and, on most iterations, fewer than 1000 inner iterations of Algorithm 1 were required thanks to the stopping criterion. The results of Algorithm 4 are compared with those of Algorithm 3, and display how the optimization of the free bias through Algorithm 1 enhances the convergence rates obtained for θ = 0.5, for which the optimal convergence rate of 1.5 (with respect to the number of free biases) is observed. The locations of the free biases generated by 8 iterations of Algorithm 4 are shown in Fig. 8, where an accumulation towards the boundary of I, in order to account for the singular behaviour of the solution. In addition, in Fig. 7 we show the computed solution on the meshes generated by Algorithm 4 together with the 1 Empirical Order of Convergence  The sequences of meshes returned by Algorithm 4 both achieve the optimal convergence rate of 1.5, independent of θ (though for θ = 0.25 Algorithm 3 had already achieved a convergence rate close to 1.5). The figure displays how optimizing the positions of the free biases t F through Algorithm 1 in between successive refinements by Algorithm 3 helps to improve convergence. exact solution of this problem. Figures 7b 7c, 7d, and 7e portray the solution obtained on the meshes generated by Algorithm 4 with N = 2, 4, 7, 11, respectively.

Example II
Here, we consider Problem 2.12 but with the right-hand side and repeat the numerical experiments presented in Sect. 5.2. That is to say, Figs. 8, 9, 10 and 11 correspond to our implementations of Algorithms 1, 2, 3 and 4 exactly as in Sect. 5.2, but considering the right-hand side defined in (5.2). However, all initial configurations in this section require a fixed bias at the origin to accommodate for the discontinuity of g at x = (0, 0). Figure 12 shows the locations of the biases generated by 8 iterations of Algorithm 4, where an accumulation at 0 due to the discontinuity of g, besides the expected accumulation at x = (−1, 0) and x = (1, 0), may be observed. All errors and differences | − | displayed on the figures referenced in this section are computed with respect to an overkill solution (implemented in HILBERT) with ≈ −0.63662, attained by Algorithm 4.

Concluding Remarks and Outlook
We recapitulate principal findings of the present paper, and indicate extensions and possible directions for further research.
We developed a novel class of Galerkin boundary element methods, on polygonal domains D ⊂ R 2 . They are based on trial spaces comprising deep neural networks with ReLU activation function.
Similar approaches are conceivable on boundaries of polyhedra D ⊂ R 3 , using the fact that also on surfaces, continuous first order Lagrangean boundary elements admit representations as ReLU NNs [19].
We investigated the approximation rates of corresponding BEM in dependence on the DNN architecture. Essentially, the size and structure of the triangulation is encoded in the NN architecture, with the location of the nodes being NN weights in the hidden layer of the NN.
We proved, also for singularities due to corners of ∂D, that optimal algebraic convergence rates can be achieved with shallow ReLU DNN BEM, by suitable choice of NN weights and biases in the hidden layer. Deep ReLU DNN trial spaces could facilitate exponential   Figure 9a displays the difference between the loss function and its minimum attained value ≈ −0.63662 for the sequence of meshes resulting from Algorithm 2 and for uniform mesh refinement. Again, faster convergence to the optimum is observed for Algorithm 2 than for uniform mesh refinement. Figure 9b shows the convergence in H  Fig. 10a and for the H 1 2 ( )-error in Fig. 10b, the optimal convergence rate of 1.5 with respect to the number of degrees of freedom, while Algorithm 3 achieves convergence rates of, approximately, 1.48 and 0.85 for θ = 0.25 and θ = 0.5, respectively convergence of the corresponding deep ReLU BEM, by emulating hp-boundary element methods. These can in principle achieve exponential rates of convergence, see [27].
We proposed DNN training in the "natural" energy spaces being fractional, hilbertian Sobolev spaces on the boundary which underlie the variational theory of first kind BIEs. While NN based discretizations have been proposed recently for PDEs, the nonlocal nature of the boundary integral operators renders efficient numerical evaluation of loss functions costly. We leveraged existing, computable local residual a-posteriori error estimators to obtain The sequences of meshes returned by Algorithm 4 both achieve the optimal convergence rate of 1.5, independent of θ (though for θ = 0.25 Algorithm 3 had already achieved the a convergence rate close to the optimum). The figure displays how optimizing the mesh nodes through Algorithm 1 in between successive refinements by Algorithm 3 helps improve convergence. Here, Algorithm 1 was called with M = 5 · 10 3 and = 10 −15 and, on most meshes, fewer than 10 3 iterations were required thanks to the stopping criterion novel, computationally efficient loss functions. They are based on local, reliable a-posteriori residual discretization error estimators. The present exposition was developed for plane, polygonal domains. However, the ReLU DNN expression results extend also to polyhedral domains D ⊂ R 3 with boundaries comprising a finite number of plane faces j . Here, again exact expression results of ReLU DNNs for continuous, piecewise affine BEM spaces on are available in [19]. Correspond-ing approximation results on corner-and edge-graded meshes on (see, e.g., [17]) will hold.
The general principle described in the present work, namely that ReLU DNNs are capable of emulating a wide range of spline-based approximation spaces with, essentially, identical convergence rate bounds, extends well beyond the presently considered setting.
The presently proposed formulation of boundary integral equations with DNN -based approximation spaces can serve as a vehicle to leverage powerful machine learning methodologies for the numerical treatment of boundary integral equations. Here, one single, unifying ReLU DNN based construction of approximation spaces on will allow to achieve performance of adaptive mesh refinements and exponential convergence of hp-BEM without any revision of implementations.
For more general BIEs arising, for example, in BIEs on polyhedral surfaces resulting from the boundary reduction of the time-harmonic Maxwell equations, it is well-known that Galerkin discretizations must be based on certain compatible subspaces. See, e.g., [5], and [20]. Also in these cases, DNNs which are structure preserving can be constructed. We refer to [23] for a development of DNNs for De Rham compatible Finite Element spaces. Development of details for BIEs of, e.g., electromagnetic scattering is beyond the scope of the present work.
Funding Open access funding provided by EPFL Lausanne. The authors have not disclosed any funding.
Data Availability Enquiries about data availability should be directed to the authors.

Conflict of interest The authors have not disclosed any competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Here, · H s (I k ) is the interpolation norm between L 2 (I k ) and H 1 0 (I k ), and C > 0 depends on s ∈ [0, 1], but not on h k . For s = 0, 1, we get Consider now a singularity, i.e., f (x) = x λ with λ ∈ {λ j } and takeγ ∈ R such that 1 − 1/β ≤γ ≤ 1. Assume, first, k ≥ 2, so that dist(0, I k ) > 0. Then, we have that f ∈ H 2 (I k ) ⊂ C(I k ) and, therefore, the linear nodal interpolant of f on I k is well-defined and f − I For k ≥ 2, and for any x ∈ I k = (x k−1 , x k ), we have that Therefore Since f (x) = λ(λ − 1)x λ−2 , the latter integral exists if To bound the contribution from I 1 = (0, x 1 ) = (0, h β ), we use a scaling argument. Let 0 < h < 1 be arbitrary. Then, for any g ∈ H 1 0 (I) there holds that g(x/x 1 ) L 2 (I 1 ) ≤ x We interpolate (or directly estimate the fractional order seminorm) to obtain g(x/x 1 ) H s (I 1 ) ≤ x 1/2−s 1

g(x) H s (I) .
Taking g(x) = ( f − I which is the second case in (3.7). The first case is proved in the same fashion.