Deep solution operators for variational inequalities via proximal neural networks

Following Bauschke and Combettes (Convex analysis and monotone operator theory in Hilbert spaces, Springer, Cham, 2017), we introduce ProxNet, a collection of deep neural networks with ReLU activation which emulate numerical solution operators of variational inequalities (VIs). We analyze the expression rates of ProxNets in emulating solution operators for variational inequality problems posed on closed, convex cones in real, separable Hilbert spaces, covering the classical contact problems in mechanics, and early exercise problems as arise, e.g., in valuation of American-style contracts in Black–Scholes financial market models. In the finite-dimensional setting, the VIs reduce to matrix VIs in Euclidean space, and ProxNets emulate classical projected matrix iterations, such as projected Jacobi and projected SOR methods.


Introduction
Variational Inequalities (VIs for short) in infinite-dimensional spaces arise in variational formulations of numerous models in the sciences. We refer only to [7,17,26] and the references there for models of contact problems in continuum mechanics, [20] and the references there for applications from optimal stopping in finance (mainly option pricing with "American-style," early exercise features) and [4] and the references there for resource allocation and game theoretic models. Two broad classes of approaches toward numerical solution of VIs can be identified: deterministic approaches, which are based on discretization of the VI in function space, and probabilistic approaches, which exploit stochastic numerical simulation and an interpretation of the solution of the VI as conditional expectations of optimally stopped sample paths. The latter approach has been used to design ML algorithms for the approximation of the solution of one instance of the VI in [3].
Deep neural network structures arise naturally in abstract variational inequality problems (VIs) posed on the product of (possibly infinite-dimensional) Hilbert spaces, as review, e.g., in [5]. Therein, the activation functions correspond to proximity operators of certain potentials that define the constraints of the VI. Weak convergence of this recurrent NN structure in the limit of infinite depth to feasible solutions of the VI is shown under suitable assumptions. An independent, but related, development in recent years has been the advent of DNN-based numerical approximations which are based on encoding known, iterative solvers for discretized partial differential equations, and certain fixed point iterations for nonlinear operator equations. We mention only [9], that developed DNNs which emulate the ISTA iteration of [6], or the more recently proposed generalization of "deep unrolling/unfolding" methodology [22]. Closer to PDE numerics, recently [11] proposed MGNet, a neural network emulation of multilevel, iterative solvers for linear, elliptic PDEs.
The general idea behind these approaches is to emulate by a DNN a contractive map, say , which is assumed to satisfy the conditions of Banach's Fixed Point Theorem (BFPT), and whose unique fixed point is the solution of the operator equation of interest. Let us denote the approximate map realized by emulating with a DNN by˜ . The universality theorem for DNNs in various function classes implies (see, e.g., [16,25] and the references there) that for any ε > 0 a DNN surrogate˜ to the contraction map exists, which is ε-close to , uniformly on the domain of attraction of .

Iteration of the DNN˜ being realized by composition, any finite number K of steps of the fixed point iteration can be realized by K -fold composition of the DNN surrogatẽ
. Iterating˜ , instead of , induces an error of order O(ε/(1 − L)), uniformly in the number of iterations K , where L ∈ (0, 1) denotes the contraction constant of . Due to the contraction property of , K may be chosen as O(| log(ε)|) in order to output an approximate fixed point with accuracy ε upon termination. The K -fold composition of the surrogate DNN˜ is, in turn, itself a DNN of depth O(depth(˜ )| log(ε)|). This reasoning is valid also in metric spaces, since the notions of continuity and contractivity of the map do not rely on availability of a norm. Hence, a (sufficiently large) DNN˜ exists which may be used likewise for the iterative solution of VIs in metric spaces. Furthermore, the resulting fixed-point-iteration nets obtained in this manner naturally exhibit a recurrent structure, in the case (considered here) that the surrogate˜ is fixed throughout the K -fold composition (more refined constructions with stage-dependent approximations {˜ (k) } K k=1 of increasing emulation accuracy could be considered, but shall not be addressed here).
In summary, with the geometric error reduction in FPIs which is implied by the contraction condition, finite truncation at a prescribed emulation precision ε > 0 will imply O(| log(ε)|) iterations, and exact solution representation (of the fixed point of˜ ) in the infinite depth limit. In DNN calculus, finitely terminated FPIs can be realized via finite concatenation of the DNN approximation˜ of the contraction map . The corresponding DNNs exhibit depth O(| log( )|), and naturally a recurrent structure due to the repetition of the Net˜ in their construction. Thereby, recurrent DNNs can be built which encode numerical solution maps of fixed point iterations. This idea has appeared in various incarnations in recent work; we refer to, e.g., MGNet for the realization of Multi-grid iterative solvers of discretized elliptic PDEs [11]. The presently proposed ProxNet architectures are, in fact, DNN emulations of corresponding fixed point iterations of (discretized) variational inequalities.
Recent work has promoted so-called Deep Operator Nets which emulate Data-to-Solution operators for classes of PDEs. We mention only [19] and the references there. To analyze expression rates of deep neural networks (DNNs) for emulating data-to-solution operators for VIs is the purpose of the present paper. In line with recent work (e.g., [19,21] and the references there), we take the perspective of infinite-dimensional VIs, which are set on closed cones in separable Hilbert spaces. The task at hand is then the analysis of rates of expression of the approximate data-to-solution map, which relates the input data (i.e., operator, cone, etc.) to the unique solution of the VI.

Layout
The structure of this paper is as follows. In Sect. 2, we recapitulate basic notions and definitions of proximal neural networks in infinite-dimensional, separable Hilbert spaces. A particular role is taken by so-called proximal activations, and a calculus of ProxNets, which we shall use throughout the rest of the paper to build solution operators of VIs. Section 3 addresses the conceptual use of ProxNets in the constructive solution of VIs. We build in particular ProxNet emulators of convergent fixed point iterations to construct solutions of VIs. Section 3.2 introduces quantitative bounds for perturbations of ProxNets. Section 4 emphasizes that ProxNets may be regarded as (approximate) solution operators to unilateral obstacle problems in infinite-dimensional Hilbert spaces. Section 5 presents DNN emulations of iterative solvers of matrix LCPs which arise from discretization of unilateral problems for PDEs. Section 6 presents several numerical experiments, which illustrate the foregoing developments. More precisely, we consider the numerical solution of free boundary value problems arising in the valuation of American-style options, and in parametric obstacle problems. Section 7 provides a brief summary of the main results and indicates possible directions for further research.

Notation
We use standard notation. By L(H, K), we denote the Banach space of bounded, linear operators from the Banach space H into K (surjectivity will not be required). Unless explicitly stated otherwise, all Hilbert and Banach-spaces are infinite-dimensional. By bold symbols, we denote matrices resp. linear maps between finite-dimensional spaces. We use the notation conventions 0 i=1 · = 0 and 0 i=1 · = 1 for the empty sum and empty product, respectively. Vectors in finite-dimensional, Euclidean space are always understood as column vectors, with denoting transposition of matrices and vectors.

Proximal neural networks (ProxNets)
We consider the following model for an artificial neural network: For finite m ∈ N, let H and (H i ) 0≤i≤m be real, separable Hilbert spaces. For every i ∈ {1, . . . , m}, let W i ∈ L(H i−1 , H i ) be a bounded linear operator, let b i ∈ H i , let R i : H i → H i be a nonlinear, continuous operator, and define H), b m+1 ∈ H and consider the neural network (NN) model The operator W 0 ∈ L(H 0 , H) allows to include skip connections in the model, similar to deep residual neural networks as proposed in [12,13]. This article focuses in particular on NNs with identical input and output spaces as in [5,Model 1.1], that arise as special case of model (2) with H 0 = H m = H and are of the form This enables us to approximate solutions to variational inequality problems as fixedpoint iterations of ProxNets and derive convergence rates. Due to the contraction property of , the fixed-point iteration x n = (x n−1 ), n ∈ N converges to x * = (x * ) for any x 0 ∈ H at linear rate. Moreover, as the set of contractions on H is open, the iteration is stable under small perturbations of the network parameters. As we show in Sect. 5.3 below, the latter property allows us to solve entire classes of variational inequality problems using only one ProxNet with fixed parameters.

Proximal activations
It is well-known that prox ψ i is a firmly nonexpansive operator, i.e., 2prox ψ i − id is nonexpansive, see, e.g., [2,Proposition 12.28]. As outlined in [5,Section 2], there is a natural relation between proximity operators and activation functions in neural networks: Virtually any commonly used activation function such as rectified linear unit, tanh, softmax, etc., may be expressed as proximity operator on H i = R d , d ∈ N, for an appropriate ψ i ∈ 0 (H i ) (see [5,Section 2] for examples). We consider a set of particular proximity operators given by cf. [5,Definition 2.20]. Apart from being continuous and nonexpansive, any

ProxNet calculus
Before investigating the relation of in (3) to variational inequality models, we record several useful definitions and results for NN calculus in the more general model from Eq. (2).
m be separable Hilbert spaces such that H (2) = H (1) 0 , and let j be m j -layer ProxNets as in (2) given by The concatenation of 1 and 2 is defined by the map Remark 2.4 Due to W (j) 0 ≡ 0, there are no skip connections after the last proximal activation in j ; hence, 1 • 2 is in fact a ProxNet as in (2) with 2m layers and no skip connection.
m be separable Hilbert spaces such that H (1) 0 = H (2) 0 , and let j be m-layer ProxNets as in (2) given by The parallelization of 1 and 2 is given for

Proposition 2.6
The parallelization P( 1 , 2 ) of two ProxNets 1 and 2 as in Definition 2.5 is a ProxNet.
Proof We set H i is again a separable Hilbert space. We define i y), i ∈ {0, 1, . . . , m}.
Note that all W i are bounded, linear operators. Moreover, if R (1) i ⊕ H (2) i ) and it holds that . , m}, which shows the claim.

Contractive ProxNets
We formulate sufficient conditions on the neural network model in (3) so that : H → H is a contraction. The associated fixed-point iteration converges to the unique solution of a variational inequality, which is characterized in the following.  (3), let x 0 ∈ H and define the iteration x k+1 := (x k ), k ∈ N 0 . Under Assumption 3.1, the sequence (x k , k ∈ N 0 ) converges for any x 0 ∈ H to the unique fixed-point x * ∈ H. For any finite number k ∈ N, the error is bounded by

Theorem 3.2 Let be as in
It holds that is the unique solution to the variational inequality problem: find Moreover, x * is bounded by Proof By the non-expansiveness of R i : As λ ∈ (0, 2) and L < min(1, 2/λ − 1) by Assumption 3.1, it follows that L ,λ < 1, hence, : H → H is a contraction. Existence and uniqueness of x * ∈ H and the first part of the claim then follow by Banach's fixed-point theorem for any initial value x 0 ∈ H.

Perturbation estimates for ProxNets
We introduce a perturbed version of the ProxNet in (3) in this subsection. Besides changing the network parameters W i , b i and R i , we also augment the input space H and allow an architecture that approximates each nonlinear operator T i itself by a multilayer network. These changes allow us to consider ProxNet as an approximate data-to-solution operator for infinite-dimensional variational inequalities and to control perturbations of the network parameters. For instance, we show in Example 3.4 that augmented ProxNets mimic the solution operator to Problem (8) where we allow that the operators T i are itself multi-layer ProxNets: For any i ∈ {1, . . . , m}, let m i ∈ N and let H (i) We then define T i as which in turn determines in (9). By construction, is a ProxNet of the form (2) with m i=1 m i ≥ m layers. As compared to , we augmented the input and intermediate spaces by H i . The composite structure of the maps T i allows to choose input vectors uniformly on a subset of H i−1 . As we show in Sect. 5.3 below, this enables us to solve large classes of variational inequalities with only one fixed ProxNet , that in turn approximates a data-to-solution operator, instead of employing different fixed maps : H → H for every problem.
To formulate reasonable assumptions on , we denote for any i ∈ {1, . . . , m − 1} by the projections to the first and second component for an element in H i ⊕ H i , respectively. Moreover, we define the closed ball B (i)

It holds that
Before we derive error bounds, we provide an example to motivate the construction of and Assumption 3.3.

Example 3.4 (Bias-to-solution operator) Let be as in Assumption 3.1 with m = 2 layers and network parameters
We construct a ProxNet that takes the bias vectors b 1 , b 2 of as inputs to represent for any choice of b i ∈ H i and therefore, may be concatenated to map any choice of b 1 , b 2 to the respective solution (x 1 , x 2 ) of (8).
In other words, we approximate the bias-to-solution operator To this end, we set 1 : Note that R Therefore, the last part of Assumption 3.3 holds with δ = 0 for arbitrary large 1 > 0 and hence, the constants 0 , 1 , 2 do not play any role in this example. The generalization to m > 2 layers follows by a similar construction of . Now, let (x 1 , x 2 ) be the solution to (8) for any choice (b 1 , b 2 ) ∈ H 1 ⊕ H 2 . It follows from Theorem 3.2 that the operator ) for any fixed x 0 ∈ H and any tuple (b 1 , b 2 ) ∈ H 1 ⊕ H 2 , for a sufficiently large number k of concatenations of (·, b 1 , b 2 ).
The augmented ProxNet may also be utilized to consider parametric families of obstacle problems, as shown in Example 4.4 below. Therein, the parametrization is with respect to the proximity operators R i instead of the bias vectors b i , and we construct an approximate obstacle-to-solution operator in the fashion of Example 3.4. In the finitedimensional case (where the linear operators W i correspond to matrices), the input of may even be augmented by a suitable space of operators, see Sect. 5.3 below for a detailed discussion. We conclude this section with a perturbation estimate that allows us to approximate the fixed-point of by the augmented NN .
Then, there is a constant C > 0 which is independent of δ > 0 and x 0 , such that for any k ∈ N, it holds We now show by induction that hold for a fixed i ∈ {0, . . . , m − 1}. Assumption 3.3 yields with Eq. (10) then yields together with the triangle inequality and the induction hypothesis . . , m}. With Assumption 3.3 and Eq. (10), we further obtain for each x ∈ B and by iterating this estimate over i ∈ {1, . . . , m} Now, let x * ∈ H be the unique fixed-point of as in Theorem 3.2, let x k = (x k−1 ) and x k = ( x k−1 , x 0 ) for any k ∈ N and a given initial value We obtain as in the proof of Theorem 3.2 where we have used that In the next step, we show that x k H ≤ 0 by induction over k. First, we obtain with x 0 ≤ 2 ≤ 0 , (11) and (12) that Thus, x 1 H ≤ 0 follows with Assumption 3.3 on the relation of 0 and 2 as λ(1− L) < 1. Using the induction hypothesis and hence, We apply the bounds from Theorem 3.2 and (11) and conclude the proof by deriving

Variational inequalities in Hilbert spaces
In the previous sections, we have considered a ProxNet model and derived the associated variational inequalities. Now, we use the variational inequality as starting point and derive suitable ProxNets for its (numerical) solution. Let (H, (·, ·) H ) be a separable Hilbert space with topological dual space denoted by H , and let H ·.· H be the associated dual pairing. Let a : H × H → R be a bilinear form, let f : H → R be a functional, and let K ⊂ H be a subset of H. We consider the variational inequality problem Moreover, f ∈ H and K ⊂ H is nonempty, closed and convex.
Problem (13) arises in various applications in the natural sciences, engineering and finance. It is well-known that there exists a unique solution u ∈ K under Assumption 4.1, see, e.g., [14,Theorem A.3.3] for a proof. We also mention that well-posedness of Problem (13) is ensured under weaker conditions as Assumption 4.1; in particular, the coercivity requirement may be relaxed as shown in [8]. For this article, however, we focus on the bounded and coercive case in order to obtain numerical convergence rates for ProxNet approximations. (3) such that u ∈ K is the unique fixed-point of . Furthermore, for a given u 0 ∈ H define the iteration u k := (u k−1 ), k ∈ N.

Fixed-point approximation by ProxNets
Then, there are constants L ,λ ∈ (0, 1) and C = C(u 0 ) > 0 such that Proof We recall the fixed-point argument, e.g., in [14,Theorem A.3.3], for proving existence and uniqueness of u since it is the base for the ensuing ProxNet construction: Since K is closed convex, the H-orthogonal projection P K : H → K onto K is well-defined and for any ω > 0 there holds Hence, u is a fixed-point of the mapping By Assumption 4.1, it is now possible to choose ω > 0 sufficiently small, so that T ω is a contraction on H, which proves existence and uniqueness of u. The optimal relaxation parameter in terms of the bounds To transfer this constructive proof of existence and uniqueness of solutions to the ProxNet setting, we denote by ι K the indicator function of K given by Since K is closed convex, it holds that ι K ∈ 0 (H) and prox ι K = P K (cf. [2, Examples 1.25 and 12.25]). Now, let m = 1, The ProxNet emulation of the contraction map reads: for a parameter λ ∈ (0, 1], . Since W 1 L(H) < 1, Assumption 3.1 is satisfied for every λ ∈ (0, 1]. Theorem 3.2 yields that the iteration u k := (u k−1 ) converges for any u 0 ∈ H to a unique fixed-point u * ∈ H with error bounded by (14) and , it follows that u * is in turn the unique fixed-point of T 1 , hence u = u * , which proves the claim.

Remark 4.3 In the fashion of Example 3.4, we may construct an augmented ProxNet
: (13). The only difference is that F has to be multiplied with ω in the first linear transform to obtain b 1 = ωF instead of F as bias vector. The parameters of in this construction are independent of F ; hence, Theorem 3.5 yields that for any f ∈ H (resp. F ∈ H) and x 0 ∈ H it holds The previous remark shows that one fixed ProxNet is sufficient to solve Problem (13) for any f ∈ H . A similar result is achieved if the set K ⊂ H associated Problem (13) is parameterized by a suitable family of functions: Assume (v) = P K (W 1 v + b 1 ) for W 1 ∈ L(H) and b 1 ∈ H are as in Theorem 4.2 and let K 0 := {v ∈ H| v ≥ 0 almost everywhere}. To obtain a ProxNet that uses the obstacle g ∈ H as input, we define Note that this yields W It now follows for any given v, g ∈ H and K := {v ∈ H| v ≥ g almost everywhere} As in Example 3.4, we concatenate to obtain for a fixed choice of x 0 ∈ H the operator O obs : H → H, g → (·, g) • · · · • (·, g) (x 0 ).
Convergence of O obs (g) to u for any g ∈ H (with arbitrary a-priori fixed x 0 ∈ H) with a contraction rate that is uniform with respect to g ∈ H is again guaranteed as the number of concatenations tends to infinity. Therefore, as in Example 3.4, there exists one ProxNet that approximately solves a family of obstacle problems with obstacle 'parameter' g ∈ H.
A combination of the ProxNets from Remark 4.3 and Example 4.4 enables us to consider both, f and K in (13), as input variables of a suitable NN : H ⊕ H ⊕ H → H. This allows, in particular, to construct an approximation of the data-to-solution operator to Problem (13) that maps (F, g) ∈ H ⊕ H to u.

Example: linear matrix complementarity problems
Common examples for Problem (13) arise in financial and engineering applications, where the bilinear form a : H × H → R stems from a second-order elliptic or parabolic differential operator. In this case, H ⊂ H s (D), where H s (D) is the Sobolev space of smoothness s > 0 with respect to the spatial domain D ⊂ R n , n ∈ N. Coercivity and boundedness of a as in Assumption 4.1 often arise naturally in this setting. To obtain a computationally tractable problem, it is necessary to discretize (13), for instance by a Galerkin approximation with respect to a finite dimensional subspace H d ⊂ H. To illustrate this, we assume that dim(H d ) = d ∈ N is a suitable finite-dimensional subspace with basis {v 1 , . . . , v d } and consider an obstacle problem with K = {v ∈ H| v ≥ g almost everywhere} for a smooth function g ∈ H.
Following Example 4.4, we introduce the set K 0 := {v ∈ H| v ≥ 0 almost everywhere} and note that Problem (13) is equivalent to finding u = u 0 + g ∈ K

Discretization and matrix LCP
To preserve non-negativity of the discrete approximation to (15), we assume that v ∈ K 0 if and only if the basis coordinates are nonnegative, i.e., if w ∈ R d ≥0 . This property holds, for instance, in finite element approaches. We write the discrete solution as Consequently, the discrete version of (15) is to where the matrix A ∈ R d×d and the vector c ∈ R d are given by Problem (16) is equivalent to the linear complementary problem (LCP) to find x ∈ R d such that for A ∈ R d×d and c ∈ R d as in (17) it holds where the constants C + ≥ C − > 0 stem from Assumption 4.1 and · 2 is the Euclidean norm on R d . This implies in particular that the LCP (18)

Solution of matrix LCPs by ProxNets
The purpose of this section is to show that several well-known iterative algorithms to solve (finite-dimensional) LCPs may be recovered as particular cases of ProxNets in the setting of Sect. 2. To this end, we fix d ∈ N and use the where D ∈ R d×d contains the diagonal entries of A, and L, U ∈ R d×d are the (strict) lower and upper triangular parts of A, respectively. The projected Jacobi (PJOR) overrelaxation method to solve LCP (18) is given as:
Example 5.4 (PSORNet) Another popular algorithm to numerically solve LCPs is the projected successive overrelaxation (PSOR) method in Algorithm 2.

end if 9: end for
To represent the PSOR-iteration by a ProxNet as in (3), we use the scalar ReLU activation from Definition 5.1 and define for i ∈ {1, . . . , d} In contrast to (d) Given the kth iterate x k and x k+1 1 , . . . , x k+1 i−1 from the inner loop of Algorithm 2, it follows for z k,i−1 : As z k−1,d = z k,0 = x k for k ∈ N, this shows x k+1 = PSOR (x k ) for Provided (19) holds, we derive similarly to Example 5.3 where A [i] denotes the ith row of A. Hence, ω * := C 3 − /(C + A 2 2 ) is sufficient to ensure that PSOR is a contraction, and convergence to a unique fixed-point follows as in Theorem 3.2.

Solution of parametric matrix LCPs by ProxNets
In this section, we construct ProxNets that take arbitrary LCPs (A, c) in finite-dimensional, Euclidean space as input, and output approximations of the solution x to (18) with any prescribed accuracy. Consequently, these ProxNets realize approximate data-to-solution operators The idea is to construct a NN that realizes Algorithm (1) that achieves prescribed error threshold ε > 0 uniformly for LCP data (A, c) from a set A , meaning the weights of the NN may not depend on A as in the previous section. To this end, we use that the multiplication of real numbers may be emulated by ReLU-NNs with controlled error and growth bounds on the layers and size of the ReLU NN. This was first shown in [27], and subsequently extended to the multiplication of an arbitrary number n ∈ N of real numbers in [24].
where ∂ x j denotes the weak derivative with respect to x j . The neural network n δ 0 , uses only ReLUs as in Definition 5.1 as proximal activations. There exists a constant C, independent of δ 0 ∈ (0, 1), n ∈ N and ≥ 1, such that the number of layers m n,δ 0 , ∈ N of n δ 0 , is bounded by m n,δ 0 , ≤ C 1 + log(n) log n n δ 0 .
Moreover, we may assume without loss of generality that m 2,δ 0 , = m 3,δ 0 , , as it is always possible to add ReLU-layers that emulate the identity function to the shallower network (see [24, Section 2] for details).
With this at hand, we are ready to prove a main result of this section.

Theorem 5.8 Let ≥ 2 be a fixed constant, d ≥ 2 and define for any given ≥ 2 the set
For the triangular decomposition A = D+L+U as in (21), define z A := vec(D −1 +L+U) ∈ R d 2 , where vec : R d×d → R d 2 is the row-wise vectorization of a R d×d -matrix. Let x * be the unique solution to the LCP (A, c), and let x 0 ∈ R d be arbitrary such that x 0 2 ≤ . For any ε > 0, there exists a ProxNet as in (9) and a k ε ∈ N such that holds for the sequence x k := ( x k−1 , z A , c) generated by and any tuple (A, c) ∈ A . Moreover, k ε ≤ C 1 (1 + | log(ε)|), where C 1 > 0 only depends on and has m ≤ C 2 (1 + | log(ε)| + log(d)) layers, where C 2 > 0 is independent of . Proof Our strategy is to approximate PJOR from Example 5.3 for given (A, c) ∈ A by (·, z A , c). We achieve this by constructing based on the approximate multiplication NNs from Proposition 5.6 and show that PJOR and satisfy Assumption 3.3 to apply the error estimate from Theorem 3.5.
For fixed and ε, the ProxNets emulate one step of the PJOR algorithm for any LCP (A, c) ∈ A and a given initial guess x 0 . This in turn allows to approximate the data-to-solution operator O from (25) to arbitrary accuracy by concatenation of suitable ProxNets. The precise statement is given in the main result of this section: Theorem 5.9 Let ≥ 2 be fixed, let A be given as in (28), and let the data-to-solution operator O be given as in (25). Then, for any ε > 0, there is a ProxNet O ε : A → R d such that for any LCP (A, c) Furthermore, let · F denote the Frobenius norm on R d×d . There is a constant C > 0, depending only on and d, such that for any ε > 0 and any two (A (1) , c (1) ), (A (2) , c (2) We give an explicit construction of the approximate data-to-solution operator O ε in the proof of Theorem 5.9 at the end of this section. To show the Lipschitz continuity of O ε with respect to the parametric LCPs in A , we derive an operator version of the so-called Strang Lemma: Lemma 5.10 Let ≥ 2, d ≥ 2, and let (A (1) , c (1) ), (A (2) , c (2) ) ∈ A . For l ∈ {1, 2}, let A (l) = D (l) + L (l) + U (l) be the decomposition of A (l) as in (21) and define z A (l) := vec((D (l) ) −1 + L (l) + U (l) ) ∈ R d 2 . For target emulation accuracy ε > 0, let be the ProxNet as in (29), let x 0 ∈ R d be such that x 0 2 ≤ and define the sequences x (l),k := ( x (l),k−1 , z A (l) , c (l) ), k ∈ N, x (l),0 := x 0 , l ∈ {1, 2}. (31) Then, there is a constant C > 0, depending only on and d, such that for any k ∈ N 0 and arbitrary, fixed ε > 0 it holds that Proof By construction of in Theorem 5.8, we have for x ∈ R d , l ∈ {1, 2}, and i ∈ {1, . . . , d} that Therefore, we estimate by the triangle inequality Hence, for any x with x ∞ ≤ we obtain by ≥ 2 and the second estimate in (26) | We have used the mean-value theorem to obtain the bound in the second last inequality and the Cauchy-Schwarz inequality in the last step. We recall from the proof of Theorem 5.8 that ω = −6 and δ 0 ≤ d −3/2 ; hence, there is a constant C = C( , d) > 0, depending only on the indicated parameters, such that for any Moreover, for any x, y ∈ R such that x ∞ , y ∞ ≤ , it holds by the mean-value theorem and the second estimate in (26) | (x, z A (1) , c (1) ) i − (y, z A (1) , c (1) Hence, Young's inequality yields for any > 0 that where we have used the Cauchy-Schwarz inequality in the last step. From the proof of Theorem 5.8, we have as before that ω = −6 , δ 0 ≤ d −3/2 , and, furthermore I d − ωD −1 A 2 ≤ 1 − 4 . Setting := −4 therefore shows that (·, z A (1) , c (1) ) : R d → R d is a contraction on (R d , · 2 ) with Lipschitz constant L 1 > 0 bounded by Note that we have used d ≥ 2 and ≥ 2 in the last two steps to obtain (35). Now, let ( x (l),k ) for l ∈ {1, 2} and k ∈ N 0 denote the iterates as defined in (31) and recall from the proof of Theorem 3.5 that x (l),k ∞ ≤ x (l),k 2 ≤ . Therefore, we may apply the estimates in (33) and (34) to obtain The claim follows for C := C 1− L 1 < ∞, since C = C( , d), and L 1 is bounded independently with respect to ε and k by (35).

Valuation of American options: Black-Scholes model
To illustrate an application for ProxNets, we consider the valuation of an American option in the Black-Scholes model. The associated payoff function of the American option is denoted by g : R ≥0 → R ≥0 , and we assume a time horizon T = [0, T ] for T > 0. In any time t ∈ T and for any spot price x 0 ≥ 0 of the underlying stock, the value of the option is denoted by V (t, x) and defines a mapping V : T × R ≥0 → R ≥0 . Changing to time-to-maturity and log-price yields the map v : which is the solution to the free boundary value problem see, e.g., [14,Chapter 5.1]. The parameters σ > 0 and r ∈ R are the volatility of the underlying stock and the interest rate, respectively. We assume that g ∈ H 1 (R ≥0 ) and construct in the following a ProxNet-approximation to the payoff-to-solution operator at time t ∈ T given by As V and v, and therefore O payoff,t , are in general not known in closed-form, a common approach to approximate v for a given payoff function g is to restrict Problem (36) to a bounded domain D ⊂ R and to discretize D by linear finite elements based on d equidistant nodal points. The payoff function is interpolated with respect to the nodal basis, and we collect the respective interpolation coefficients of g in the vector g ∈ R d . The time domain [0, T ] is split by M ∈ N equidistant time steps and step size t = T /M, and the temporal derivative is approximated by a backward Euler approach. This spacetime discretization of the free boundary problem (36) leads to a sequence of discrete variational inequalities: Given g ∈ R d and u 0 := 0 ∈ R d find u m ∈ R d such that for m ∈ {1, . . . , M}, it holds The LCP (38) is defined by the matrices A := M + tA BS ∈ R d×d , A BS := σ 2 2 S + ( σ 2 2 − r)B + rM ∈ R d×d and right hand side F m := − t(A BS ) g + Mu m ∈ R d . The matrices S, B, M ∈ R d×d represent the finite element stiffness, advection and mass matrices; hence, A is tri-diagonal and asymmetric if σ 2 2 = r. The true value of the options at time km is approximated at the nodal points via v( tm, ·) ≈ u m + g. This yields the discrete payoffto-solution operator at time tm defined by Problem (38) may be solved for all m using a shallow ProxNet The architecture of allows to take g and u m as additional inputs in each step; hence, we train only one shallow ProxNet that may be used for any payoff function g and every time horizon T. Therefore, we learn the payoff-to-solution operator O payoff,t associated with Problem (36) by concatenating . The parameters W 1 ∈ R d×3d and b 1 ∈ R d are learned in the training process and shall emulate one step of the PJOR Algorithm 1, as well as the linear transformation (g, u m ) → F m to obtain the right hand side in (38). Therefore, a total of 3d 2 + d parameters have to be learned in each example.
For our experiments, we use the Python-based machine learning package PyTorch. 2 All experiments are run on a notebook with 8 CPUs, each with 1.80 GHz, and 16 GB memory. To train , we sample N s ∈ N input data points x (i) := (x (i) 0 , g (i) , u (i) ) ∈ R 3d , i ∈ {1, . . . , N s }, from a 3d-dimensional standard-normal distribution. The output-training data samples y (i) consist of one iteration of Algorithm 1 with ω = 1, initial value x 0 := x (i) 0 , with A as in (38) and right hand side given by c := − t(A BS ) g (i) + Mu (i) ∈ R d . We draw a total of N s = 2 · 10 4 input-output samples, use half of the data for training, and the other half for validation. In the training process, we use mini-batches of size N batch = 100 and the Adam Optimizer [18] with initial learning rate 10 −3 , which is reduced by 50% every 20 epochs. As error criterion, we use the mean-squared error (MSE) loss function, which is for each batch of inputs ((x (i j ) , g (i j ) , u (i j ) ), j = 1, . . . , N batch ) and outputs (y (i j ) , j = 1, . . . , N batch ) We stop the training process if the loss function falls below the tolerance 10 −12 or after a maximum of 300 epochs. The number of spatial nodal points d that determines the size of the matrix LCPs is varied throughout our experiments in d ∈ {200, 400, . . . , 1000}. We choose the Black-Scholes parameters σ = 0.1, r = 0.01 and T = 1. Spatial and temporal refinement are balanced by using M = d time steps of size t = T /M = 1/d. The decay of the loss-curves is depicted in Fig. 1, where the reduction in the learning rate every 20 epochs explains the characteristic "steps" in the decay. This stabilizes the training procedure, and we reached a loss of O(10 −12 ) for each d before the 250th epoch. Once training is terminated, we compress the resulting weight matrix of the trained single-layer ProxNet by setting all entries with absolute value lower than 10 −7 to zero. This speeds up evaluation of the trained network, while the resulting error is negligible. As the matrix W 1 in the trained ProxNet is close to the "true" tri-diagonal matrix A from (38), this eliminates most of the ProxNet's O(d 2 ) parameters, and only O(d) non-trivial entries remain.
The relative validation error is estimated based on the N val := 10 4 validation samples via (40) The validation errors and training times for each dimension are found in Table 1   The relative error remains stable with increasing problem dimension but on random samples, and thus, we could in principle consider an arbitrary basket containing different types of payoffs. The restriction to put options is for the sake of brevity only. We denote by u m,i for m ∈ {0, . . . , M} the sequence of solutions to (38) with payoff vector g i and u 0,i = 0 ∈ R d for each i. Concatenating k times yields an approximation to the discrete operator O payoff, tm in (39) for any m ∈ {1, . . . , M} via An approximating sequence of (u m,i , m ∈ {0, . . . , M}) is then in turn generated by That is, u m+1,i is given by iterating k times with initial input x 0 = u m,i ∈ R d and fixed inputs and g i and u m,i . We stop for each m after k iterations if two subsequent iterates x k and x k−1 satisfy x k − x k−1 2 < 10 −3 .
The reference solution u M,i is calculated by a Python-implementation that uses the Primal-Dual Active Set (PDAS) Algorithm from [15] to solve LCP (38) with tolerance ε = 10 −6 in every time step. Compared to a fixed-point iteration, the PDAS method converges (locally) superlinear according to [15,Theorem 3.1], but has to be called separately for each payoff function g i . In contrast, the ProxNet may be iterated for the entire batch of 20 payoffs at once in PyTorch. We measure the relative error Sample mean errors and computational times are depicted for d ∈ {200, 400, . . . , 1000} in Table 2, where we also report the number of iterations k for each d to achieve the desired tolerance of 10 −3 . The results clearly show that ProxNets significantly accelerate the valuation of American option baskets, if compared to the standard, PDAS-based implementation. This holds true for any spatial resolution, i.e., the number of grid points d, while the relative error is small of magnitude O(10 −3 ) or O(10 −4 ). For d ≥ 600, we actually find that the combined times for training and evaluation of ProxNets is below the runtime of the reference solution. We further observe that computational times scale similarly

Valuation of American options: jump-diffusion model
We generalize the setting of the previous subsection from the Black-Scholes market to an exponential Lévy model. That is, the log-price of the stock evolves as a Lévy process, with jumps distributed with respect to the Lévy measure ν : B(R) → [0, ∞). The option value v (in log-price and time-to-maturity) is now the solution of a partial integro-differential inequality given by Introducing jumps in the model hence adds a non-local integral term to Eq. (36). The drift is set to γ := −σ 2/2 − R (e z − 1 − z)ν(dz) ∈ R in order to eliminate arbitrage in the market. We discretize Problem (42) by an equidistant grid in space and time as in the previous subsection, for details, e.g., integration with respect to ν, we refer to [14,Chapter 10]. The space-time approximation yields again a sequence of LCPs of the form where A L := M + tA Levy ∈ R d×d with A Levy := σ 2 2 S + A J , and the matrix A J stems from the integration of ν. A crucial difference to (38) is that A L is not anymore tridiagonal, but a dense matrix, due to the non-local integral term caused by the jumps. The drift γ and interest rate r are transformed into the right hand side, such that F m := − t(A Levy ) g m + Mu m ∈ R d , where g m is the nodal interpolation of the transformed payoff g m (x) := ge rkm (x − (γ + r)km). The inverse transformation gives an approximation to the solution v of (42) at the nodal points via v(km, · − (γ + r)T ) ≈ e −rT u M . We refer to [14,Chapter 10.6] for further details on the discretization of American options in Lévy models.
The jumps are distributed according to the Lévy measure That is, the jumps follow an asymmetric, double-sided exponential distribution with jump intensity λ = ν(R) ∈ (0, ∞). We choose p = 0.7, β + = 25, β − = 20 to characterize the The relative error remains stable with increasing problem dimension ProxNets significantly reduce computational time, while their relative error remains sufficiently small for all d tails of ν and set jump intensity to λ = 1. We further use σ = 0.1 and r = 0.01 as in the Black-Scholes example. We use the same training procedure and parameters as in the previous subsection to train the shallow ProxNets. As only difference, we compress the weight matrix with tolerance 10 −8 instead of 10 −7 (recall that A L is dense). This yields slightly better relative errors in this example, while it does not affect the time to evaluate the ProxNets. Training times and validation errors are depicted in Table 3 and indicate again a successful training. The decay of the training loss is for each d very similar to Fig. 1, and training is again stopped in each case before the 300th epoch.
After training, we again concatenate the shallow nets to approximate the operator O payoff,t in (37), that maps the payoff function g to the corresponding option value v(t, ·) at any (discrete) point in time. We repeat the test from Sect. 6.1 in the jump-diffusion model with the identical basket of put options to test the trained ProxNets. The reference solution is again computed by a PDAS-based implementation. The results for American options in the jump-diffusion model are depicted in Table 4. Again, we see that the trained ProxNets approximated the solution v to (42) for any g to an error of magnitude O(10 −3 ) or less. While keeping the relative error small, ProxNets again significantly reduce computational time and are therefore a valid alternative in more involved financial market models. We finally observe that the number of iterations to tolerance in the jump-diffusion model is stable at 6-7 for all d, whereas this number increases with d in the Black-Scholes market (compare the third row in Tables 2 and 4). The explanation for this effect is that the excess-to-payoff vector u M has a smaller norm in the jump-diffusion case, but the iterations terminate at the (absolute) threshold 10 −3 in both, the Black-Scholes and jump-diffusion model. Therefore, we require less iterations in the latter scenario, although the option prices v and relative errors are of comparable magnitude in both examples.

Parametric obstacle problem
To show an application for ProxNets beyond finance, we consider an elliptic obstacle problem in the two-dimensional domain D := (−1, 1) 2 . We define H := H 1 0 (D) and aim to find the solution u ∈ H to the partial differential inequality − u ≥ f in D, u≥ g in D, u= 0 on ∂D.
Therein, f ∈ H is a given source term and g ∈ H is an obstacle function, for which we assume g ∈ C(D) ∩ H for simplicity in the following. We introduce the convex set K := {v ∈ H| v ≥ g almost everywhere} and the bilinear form and note that a, f and K satisfy Assumption 4.1. The variational inequality problem associated with (45) is then to As for (15) at the beginning of Sect. 5, we introduce K 0 := {v ∈ H| v ≥ 0 almost everywhere}, and Problem (46) is equivalent to finding u = u 0 + g ∈ K with u 0 ∈ K 0 such that: We discretize D = [−1, 1] 2 for d 0 ∈ N by a (d 0 + 2) 2 -dimensional nodal basis of linear finite elements, based on (d 0 + 2) equidistant points in every dimension. Due to the homogeneous Dirichlet boundary conditions in (45), we only have to determine the discrete approximation of u within D and may restrict ourselves to a finite element basis {v 1 , . . . , v d }, for d := d 2 0 , with respect to the interior nodal points. Following the procedure outlined in Sect. 5.1, we denote by g ∈ R d again the nodal interpolation coefficients of g (recall that we have assumed g ∈ C(D)) and by A ∈ R d×d the finite element stiffness matrix with entries A ij := a(v j , v i ) for i, j ∈ {1, . . . , d} This leads to the matrix LCP to find u ∈ R d such that where c ∈ R d is in turn given by c i := f (v i ) − (A T g) i for i ∈ {1, . . . , d}. Given a fixed spatial discretization based on d nodes, we again approximate the discrete obstacle-to-solution operator by concatenating shallow ProxNets : The training process of the ProxNets in the obstacle problem is the same as in Sects. 6.1 and 6.2 and thus, is not further outlined here. The only difference is that we draw the input data for training now from a 2d-dimensional standard normal distribution. The output samples again correspond to one PJOR-Iteration with A and c as in (49) and The relative error remains stable with increasing problem dimension ω = 1, where the initial value and g are both replaced by the 2d-dimensional random input vector. After training, we again compress the weight matrices by setting all entries with absolute value lower than 10 −7 to zero. We test the ProxNets for LCPs of dimension d ∈ {10 2 , 20 2 , 30 2 , 40 2 } and report training times and validation errors in Table 5. As before, training is successful and aborted early for each d, since the loss function falls below 10 −12 before the 300th epoch. Once : R d ⊕ R d → R d is trained for given d, we use the initial value zero x = 0 ∈ R d and concatenate k times to obtain for any g the approximate discrete obstacle-tosolution operator

Conclusions
We proposed deep neural networks which realize approximate input-to-solution operators for unilateral, inequality problems in separable Hilbert spaces. Their construction was based on realizing approximate solution constructions in the continuous (infinite dimensional) setting, via proximinal and contractive maps. As particular cases, several classes of finite-dimensional projection maps (PSOR, PJOR) were shown to be representable by the proposed ProxNet DNN architecture. The general construction principle behind ProxNet introduced in the present paper can be employed to realize further DNN architectures, also in more general settings. We refer to [1] for multilevel and multigrid methods to solve (discretized) variational inequality problems. The algorithms in this reference may also be realized as concatenation of ProxNets, similarly to the PJOR-Net and PSOR-Net from Examples 5.3 and 5.4. The analysis and representation of multigrid methods as ProxNets will be considered in a forthcoming work.