Deep neural network expressivity for optimal stopping problems

This article studies deep neural network expression rates for optimal stopping problems of discrete-time Markov processes on high-dimensional state spaces. A general framework is established in which the value function and continuation value of an optimal stopping problem can be approximated with error at most $\varepsilon$ by a deep ReLU neural network of size at most $\kappa d^{\mathfrak{q}} \varepsilon^{-\mathfrak{r}}$. The constants $\kappa,\mathfrak{q},\mathfrak{r} \geq 0$ do not depend on the dimension $d$ of the state space or the approximation accuracy $\varepsilon$. This proves that deep neural networks do not suffer from the curse of dimensionality when employed to solve optimal stopping problems. The framework covers, for example, exponential L\'evy models, discrete diffusion processes and their running minima and maxima. These results mathematically justify the use of deep neural networks for numerically solving optimal stopping problems and pricing American options in high dimensions.


Introduction
In the past years, neural network-based methods have been used ubiquitously in all areas of science, technology, economics and finance.In particular, such methods have been applied to various problems in mathematical finance such as pricing, hedging and calibration.We refer, for instance, to the articles Buehler et al. (2019), Becker et al. (2020), Becker et al. (2021), Cuchiero et al. (2020) and to the survey papers Ruf and Wang (2020), Germain et al. (2021), Beck et al. (2020) for an overview and further references.The striking computational performance of these methods has also raised questions regarding their theoretical foundations.Towards a complete theoretical understanding, there have been recent results in the literature which prove that deep neural networks are able to approximate option prices in various models without the curse of dimensionality.For deep neural network expressivity results for option prices and associated PDEs we refer, for be approximated by deep ReLU neural networks without the curse of dimensionality, i.e., that an approximation error of size at most ε can be achieved by a deep ReLU neural network of size κd q ε −r for constants κ, q, r ≥ 0 which do not depend on the dimension d or the accuracy ε.The framework, in particular, provides deep neural network expressivity results for prices of American and Bermudan options.Our conditions cover most practically relevant payoffs and many popular models such as exponential Lévy models and discrete diffusion processes.The constants κ, q, r are explicit and thus the obtained results yield bounds for the approximation error component in any algorithm for optimal stopping and American option pricing in high-dimensions which is based on approximating the value function or the continuation value by deep neural networks.The remainder of the paper is organized as follows.In Section 2 we formulate the optimal stopping problem, recall its solution by dynamic programming and introduce the notation for deep neural networks.In Section 3 we formulate the assumptions and main results.Specifically, in Section 3.1 we formulate a basic framework, Assumptions 1 and 2, and prove that the value function can be approximated by deep neural networks without the curse of dimensionality, see Theorem 3.4.In Section 3.2 we then provide more refined assumptions on the considered Markov processes and extend the approximation result to this refined framework, see Theorem 3.8, which is the main result of the article.In Sections 3.3, 3.4 and 3.5 we then apply this result to exponential Lévy models, discrete diffusion processes and show that also barrier options can be covered via the running maximum or minimum of such processes.In order to make the presentation more streamlined, most proofs, in particular the proofs of Theorems 3.4 and 3.8, are postponed to Section 4.

Notation
Throughout, we fix a time horizon T ∈ N and a probability space (Ω, F, P) on which all random variables and processes are defined.For d ∈ N, x ∈ R d , A ∈ R d×d we denote by x the Euclidean norm of x and by A F the Frobenius norm of A. For f : R

Preliminaries
In this section we first formulate the optimal stopping problem and recall its solution in terms of the value function.Then we introduce the required notation for deep neural networks.

The optimal stopping problem
For each d ∈ N consider a discrete-time R d -valued Markov process X d = (X d t ) t∈{0,...,T } and a function g d : {0, . . ., T }×R d → R. Assume for each t ∈ {0, . . ., T } that E[|g d (t, X d t )|] < ∞ and let F = (F t ) t∈{0,...,T } be the filtration generated by X d .Denote by T the set of Fstopping times τ : Ω → {0, . . ., T } and by T t the set of τ ∈ T with t ≤ τ .For notational simplicity we omit the dependence on d in F, T and T t .The optimal stopping problem consists in computing sup (2.1) Consider the value function V d defined by the backward recursion V d (T, x) = g d (T, x) and for t = T − 1, . . ., 0 and P • (X d t ) −1 -a.e.x ∈ R d .Then knowledge of V d also allows to compute a stopping time τ * ∈ T which is a maximizer in (2.1): ]. Indeed, by backward induction and the Markov property we obtain that V d (t, X d t ) = U d t , P-a.s., where U d is the Snell envelope defined by the backward recursion U d T = g d (T, X d T ) and ) for t = T − 1, . . ., 0. Then, for instance by (Föllmer and Schied, 2016, Theorem 6.18), for all t ∈ {0, . . ., T }, P-a.s.
min = τ * is a maximizer of (2.1) and, in the case when X d 0 is constant, V d (0, X d 0 ) is the optimal value in (2.1).The idea of our approach is as follows: in many situations the function g d is in fact a neural network or can be approximated well by a deep neural network.Then the recursion (2.2) also yields a recursion for a neural network approximation.This argument will be made precise in the proof of Theorem 3.4 below.
Remark 2.2.The conditional expectation in (2.4) is defined in terms of the transition kernels µ d s,t , 0 ≤ s < t ≤ T of the Markov process X d (see (Kallenberg,p.143)).In fact, formally we start with transition kernels µ d on R d which then allow us to construct a family of probability measures P x on the canonical path space ((R d ) T +1 , B((R d ) T +1 )) such that, under P x , the coordinate process is a Markov process starting at x and with transition kernels µ d .We refer to (Kallenberg,Theorem 8.4) or (Peskir and Shiryaev, 2006, Chapter II.4.1); see also (Revuz, 1984, Chapter 1).

Deep neural networks
In this article we will consider neural networks with the ReLU activation function ̺ : R → R given by ̺(x) = max(x, 0).For each d ∈ N we also denote by where denote the total number of non-zero entries in the parameter matrices and vectors of the neural network.In most cases the number of layers, the activation function and the parameters of the network are not mentioned explicitly and we simply speak of a deep neural network φ : R d → R N L .We say that a function f : R d → R m can be realized by a deep neural network if there exists a deep neural network φ : R d → R m such that f (x) = φ(x) for all x ∈ R d .In the literature a deep neural network is often defined as the collection of parameters Φ = ((A 1 , b 1 ), . . ., (A L , b L )) and φ in (2.6) is called the realization of Φ, see for instance Petersen and Voigtlaender (2018), Opschoor et al. (2020), Gonon and Schwab (2021a).In order to simplify the notation we do not distinguish between the neural network realization and its parameters here, since the parameters are always (at least implicitly) part of the definition.Note that in general a function f may admit several different realizations by deep neural networks, i.e., several different choices of parameters may result in the same realization.However, in the present article this is not an issue, because pathological cases are excluded by bounds on the size of the networks.

DNN Approximations for Optimal Stopping Problems
This section contains the main results of the article, the deep neural network approximation results for optimal stopping problems.We start by formulating in Assumption 1 a general Markovian framework.In Assumption 2 we introduce the hypotheses on the reward functions.We then formulate in Theorem 3.4 the approximation result for this basic framework.Subsequently, we provide a more refined framework, see Assumption 1' below, and prove the main result of the article, Theorem 3.8.This theorem proves that the value function can be approximated by deep neural networks without the curse of dimensionality.Corollary 3.9 shows that an analogous approximation result also holds for the continuation value.Subsequently, in Sections 3.3, 3.4 and 3.5 we specialize the result to the case of exponential Lévy models, discrete diffusion processes and show that also barrier options can be covered by including the running maximum or minimum.

Basic framework
Let p ≥ 0 be a fixed rate of growth.For instance, in financial applications typically p = 1.We start by formulating in Assumption 1 a collection of hypotheses on the Markov processes X d .These hypotheses will be weakened later on in Assumption 1'.
Furthermore, there exists constants c > 0, q ≥ 0, α ≥ 0 such that Assumption 1(i) requires a recursive updating of the Markov processes X d according to update functions f d t and noise processes Y d .Assumption 1(ii) asks that the noise random variables and the initial condition are independent.Assumption 1(iii) imposes that the updating functions f d t can be approximated well by deep neural networks.Finally, Assumption 1(iv) requires that certain moments of the noise random variables and the "constant parts" of the update functions exhibit at most polynomial growth.
Remark 3.1.In Assumption 1(iii)-(iv) we could also put different constants c and q in each of the hypotheses.But then Assumption 1(iii)-(iv) still holds with c and q chosen as the respective maximum and so for notational simplicity we choose to directly work with the same constants for all these hypotheses.Remark 3.2.Let s ≥ t and consider ḡd,s : But the right hand side of (3.5) is defined for any x ∈ R d for which the expectation is finite, and so in what follows we will also consider the conditional expectation E[ḡ d,s (X d s )|X d t = x] to be defined for all such x ∈ R d (by (3.5)).Note that also < ∞ and so by backward induction this reasoning allows to define in our framework the value function V d (t, •) on all of R d , for each t.
Next, we formulate a collection of hypotheses on the reward (or payoff) functions g d .For instance, fix a strike price K > 0, an interest rate r ≥ 0 and consider the payoff of a max-call option g d (t, x) = e −rt (max i=1,...,d x i − K) + .Then (see, e.g., (Grohs et al., 2018, Lemma 4.12)) for each t the payoff g d (t, •) can be realized exactly by a neural network φ d,t with size(φ d,t ) ≤ 6d 3 .In addition, Lip(g d (t, •)) = 1 and therefore, setting φ ε,d,t = φ d,t for all ε ∈ (0, 1] we get that Assumption 2 is satisfied with c = 6, α = 0, q = 3.Further examples include basket call options, basket put options, call on min options and, by similar techniques, also put on min options, put on max options and many related payoffs.

Assumption 2. [Assumptions on g
We now state the main deep neural network approximation result under the assumptions introduced above.
Theorem 3.4.Suppose Assumptions 1 and 2 are satisfied.Let c > 0, q ≥ 0 and assume for all d ∈ N that ρ d is a probability measure on R d with R d x 2 max(p,2) ρ d (dx) ≤ cd q .Then there exist constants κ, q, r ∈ [0, ∞) and neural networks ψ ε,d,t , ε ∈ (0, 1], d ∈ N, t ∈ {0, . . ., T }, such that for any ε ∈ (0, 1], d ∈ N, t ∈ {0, . . ., T } the number of neural network weights grows at most polynomially and the approximation error between the neural network ψ ε,d,t and the value function is at most ε, that is, size(ψ ε,d,t ) ≤ κd q ε −r and (3.9) The proof of Theorem 3.4 will be given in Section 4.4 below.Theorem 3.4 shows that under Assumptions 1 and 2 the value function V d can be approximated by deep neural networks without the curse of dimensionality: an approximation error at most ε can be achieved by a deep neural network whose size is at most polynomial in ε −1 and d.The approximation error in Theorem 3.4 is measured in the L 2 (ρ d )-norm.Theorem 3.4 can also be used to deduce further properties of V d .In the basic framework we obtain for instance the following corollary, which shows that under Assumptions 1 and 2 for each t the value function satisfies a certain average Lipschitz property with constant growing at most polynomially in d.
Corollary 3.5.Suppose Assumptions 1 and 2 are satisfied.Let ν d 0 be the standard Gaussian measure on R d .Then for any R > 0 there exist constants κ, q ∈ [0, ∞) such that for (3.10) The proof of Corollary 3.5 will be given at the end of Section 4.4.

Refined framework
We now introduce a refined framework, in which the approximation hypothesis (3.2) and the Lipschitz condition (3.4) in Assumption 1(iii) are weakened, see (3.11) and (3.13) below.Due to these weaker hypotheses we need to introduce potentially stronger moment assumptions on the noise variables Y d t .Note that the additional growth conditions (3.14) and (3.15) are satisfied automatically under Assumption 1 (see Lemma 4.5 and Remark 3.6 below).
Assumption 1'.[Weaker assumptions on X d ] Assume that (i), (ii) and (iv) in Assumption 1 are satisfied.Furthermore, assume that there exist constants c > 0, h > 0, q, q ≥ 0, Remark 3.6.A sufficient condition for (3.14) is that there exist c > 0 and q ≥ 0 such that for all d ∈ N, x, y ∈ R d we have E[ Y d t 2m max (2,p) ] ≤ cd q and f d t (x, y) ≤ cd q(1 + x + y ).Then (p,2) (1 + x 2m max(p,2) + cd q).Remark 3.7.While in many relevant applications the number of time steps T is only moderate (e.g.around 10 in (Becker et al., 2019, Sections 4.1-4.2)), it is also important to analyse the situation when T is large.To this end, in Assumption 1' we have introduced the constants h and q instead of using the common upper bounds c, q.This makes it possible to get first insights about the situation in which T is large from the proofs in Section 4. Indeed, if h = 1 + h and h is sufficiently small (as it is the case for instance in certain discretized diffusion models), then the constants in Lemma 4.6 and Lemma 4.8 are small also for large T .
Examples of processes that satisfy Assumption 1' are provided further below.These include, in particular, the Black-Scholes model, more general exponential Lévy processes and discrete diffusions.We now state the main theorem of the article.
Theorem 3.8.Suppose Assumptions 1' and 2 are satisfied.Let c > 0, q ≥ 0 and assume for all d ∈ N that ρ d is a probability measure on R d with R d x 2m max(p,2) ρ d (dx) ≤ cd q .Furthermore, assume that ζ < min(1,βm−θ) , where m, β, ζ, θ are the constants appearing in Assumption 1'.Then there exist constants κ, q, r ∈ [0, ∞) and neural networks ψ ε,d,t , ε ∈ (0, 1], d ∈ N, t ∈ {0, . . ., T }, such that for any ε ∈ (0, 1], d ∈ N, t ∈ {0, . . ., T } the number of neural network weights grows at most polynomially and the approximation error between the neural network ψ ε,d,t and the value function is at most ε, that is, size(ψ ε,d,t ) ≤ κd q ε −r and (3.16) The proof of Theorem 3.8 will be given in Section 4.5 below.Theorem 3.8 shows that for Markov processes satisfying Assumption 1' and for reward functions satisfying Assumption 2 the value function of the associated optimal stopping problem can be approximated by deep neural networks without the curse of dimensionality.
In other words, an approximation error at most ε can be achieved by a deep neural network whose size is at most polynomial in ε −1 and d.
The condition ζ < min(1,βm−θ) T −1 in Theorem 3.8 can be viewed as a condition on m, which needs to be sufficiently large.This means that sufficiently high moments of Y d t need to exist and grow only polynomially in d.
A key step in the proof consists in constructing a deep neural network approximating the continuation value.Therefore, we immediately obtain the following corollary.

Exponential Lévy models
In this subsection we apply Theorem 3.8 to exponential Lévy models.Recall that an R d -valued stochastic process Lévy process if it is stochastically continuous, its sample paths are almost surely right continuous with left limits, it has stationary and independent increments and We refer, e.g., to Sato (1999), Applebaum (2009) for more detailed statements of these definitions, proofs of this characterization and further details on Lévy processes.
A stochastic process X d is said to follow an exponential Lévy model, if We refer to Cont and Tankov (2004), Eberlein and Kallsen (2019) for more details on financial modelling using exponential Lévy models.From Theorem 3.8 we now obtain the following deep neural network approximation result.This result includes the case of a Black-Scholes model (ν d = 0) as well as pure jump models (A d i,j = 0) with sufficiently integrable tails.In particular, Corollary 3.10 applies to prices of American / Bermudan basket put options, put on min or put on max options in such models (cf.Example 3.3 for payoffs that satisfy Assumption 2).
Corollary 3.10.Let X d follow an exponential Lévy model with Lévy triplet (A d , γ d , ν d ) and assume the triplets are bounded in the dimension, that is, there exists B > 0 such that for any d ∈ N, i, j = 1, . . ., d Suppose the payoff functions g d satisfy Assumption 2. Let c > 0, q ≥ 0 and assume for all Then there exist constants κ, q, r ∈ [0, ∞) and neural networks Proof.This follows directly from Theorem 3.8 and Lemma 4.2 with the choice T −1 .

Discrete diffusion models
Let T > 0 and let X d follow a discrete diffusion model with coefficients Consider the following assumption on the drift and diffusion coefficients: Assumption 3. Assume that there exist constants C > 0, q, α, ζ ≥ 0 and, for each Here we denote by σ ε,d,t (x) ∈ R d×d the matrix with i-th row σ ε,d,t,i (x).
Corollary 3.11.Let X d follow a discrete diffusion model with coefficients satisfying Assumption 3 with ζ < 1 T −1 .Suppose p ≥ 2 and the payoff functions g d satisfy Assumption 2. Let c > 0, q ≥ 0 and assume for all d ∈ N that ρ d is a probability measure on R d with Proof.By Lemma 4.3 it follows that Assumption 1' is satisfied.In addition, the constant β > 0 in Assumption 1' may be chosen arbitrarily and ζ = θ = β + ζ.Thus, we may select β = 1 T −1 − ζ − δ for some δ > 0 and then β > 0 and T −1 . Theorem 3.8 hence implies (3.22).

Running minimum and maximum
In this subsection we show that our framework can also cover barrier options.This follows from the next proposition, which proves that for processes satisfying Assumption 1' also the processes augmented by their running maximum or minimum satisfy Assumption 1'.Proposition 3.12.Suppose Assumption 1' holds.Let X1 = X 1 and for , where either The proof is given at the end of Section 4.2 below.

Proofs
This section contains the remaining proofs of the results in Section 3. The section is split in several subsections.In Section 4.1 we provide a refined result on deep neural network approximations of the product function R × R → R, (x, y) → xy.Section 4.2 then contains two lemmas in which this approximation result is applied to verify that suitable exponential Lévy and discrete diffusion models satisfy Assumption 1'.Subsequently, Section 4.3 contains auxiliary results needed for the proofs of Theorem 3.4 and Theorem 3.8.The proofs of these two results are then contained in Sections 4.4 and 4.5.

Deep neural network approximation of the product
Based on (Yarotsky, 2017, Proposition 3) we provide here a refined result regarding deep neural network approximations of the product function R × R → R, (x, y) → xy.

Sufficient conditions
In this subsection we prove Lemma 4.2 and Lemma 4.3, which show that the exponential Lévy and discrete diffusion models considered above satisfy Assumption 1'.We also provide a proof of Proposition 3.12.
Lemma 4.2.Let X d follow an exponential Lévy model (cf.(3.18)) for each d ∈ N and assume that the Lévy triplets (A d , γ d , ν d ) are bounded in the dimension, that is, there exists where p = 2m max (2, p).Then Assumption 1' is satisfied with constant β > 0 in Assumption 1' chosen arbitrarily and with ζ = θ = β.
Proof.Firstly, (3.18) shows for each .1) is satisfied.Since L d has independent increments, it follows that Assumption 1(ii) is satisfied.Next, we can employ an argument from the proof of (Gonon and Schwab, 2021a, Theorem 5.1) (which uses (Sato, 1999, Theorem 25.17) and (4.2)) to obtain for any Combined with Minkowski's inequality and the stationarity increments property of L d this yields and Minkowski's integral inequality and (4.4) imply Proof.Firstly, (3.1) holds with Assumption 1(ii) holds by the independent increment property of Brownian motion.
Next, for all d ∈ N, t ∈ {0, . . ., T −1} it holds that The fact that Z 2 ∼ χ 2 (d) and the upper and lower bounds for the gamma function (see, e.g., (Gonon et al., 2021, Lemma 2.4)) thus yield Next, we use Assumption 3 to estimate Furthermore, Assumption 3 and the Lipschitz property of n ε,M yield for all x, x ′ , y, y Finally, for all x, y ∈ R d 1 2 p so that (4.5) implies a polynomial growth bound (3.14) and the estimate and ) and so the independence and moment conditions on Ȳ d are satisfied and f d t (0, 0) ≤ 2 f d−1 t (0, 0) .Thus, (i), (ii) and (iv) in Assumption 1 are satisfied.Furthermore, by the identity x = x + − (−x) + and (Grohs et al., 2018, Lemma 4.12) the function min k : R k → R, z → min j=1,...,k z j can be realized by a deep neural network with size at most 12k 3 .We now set ηε,d,t (x, y) Then the 1-Lipschitz property of min k , which follows from the fact that the pointwise minimum of 1-Lipschitz functions is again 1-Lipschitz, implies that Lip ) .The bound on size(η ε,d,t ) follows from the bound on size(η ε,d−1,t ) and bounds for the operations composition, parallelization and the realization of the identity (which yields a bound for the size of the neural network realizing ) so that all the required bounds follow from the corresponding properties of X d−1 .In the case of the running maximum one proceeds analogously, except that the growth bounds are now a bit different, namely, which still allows us to deduce the claimed statement.

Auxiliary results
This section contains auxiliary results that are needed for the proof of Theorems 3.4 and 3.8.We start with Lemma 4.4, which establishes growth properties of the payoff function and its neural network approximation.
Lemma 4.4.Suppose Assumption 2 is satisfied.Then for all ε ∈ (0 Proof.First note that from (3.6), (3.8) and the growth assumption on g d we obtain for every ε ∈ (0, 1] that (4.9) Letting ε tend to 0 we thus obtain (4.7).Next, note that the same properties of g d and φ ε,d,t imply (4.10) The next result, Lemma 4.5, establishes growth properties of the Markov update function and its neural network approximation.
Proof.Assume w.l.o.g. that c ≥ 1.Consider first the case when Assumption 1 holds.Then (4.11) can be used to prove inductively that for all s ≥ t, Indeed, for s = t this directly follows from the definition.Assume now s > t and (4.16) holds for s − 1, s − 2, . . ., t, then (4.11), the induction hypothesis and independence yield (4.17) as claimed.This shows that (4.16) holds for all s ≥ t.From (4.16) and Assumption 1(iv) we obtain (4.18) In the case when Assumption 1' holds we first note that independence, Jensen's inequality and (3.14) yield (4.19)We can now apply this estimate instead of (4.11) to get from the first to the second line in (4.17) and arrive at hence the conclusion follows.
The next result ensures that the optimal value (2.1) is finite in our setting.
Lemma 4.7.Suppose Assumption 2 holds and Assumption 1 or 1' is satisfied.Then Proof.Let d ∈ N, t ∈ {0, . . ., T }.Then Lemma 4.4, Lemma 4.6 and Assumption 1 or 1' ensure that The next lemma proves that the value function grows at most linearly.Recall from Remark 3.2 that Lemma 4.7 allows us to recursively define the value function for all x ∈ R d as the right hand side of (2.2).
Lemma 4.9 mathematically proves the intuitively obvious fact that a neural network in which some input arguments are held at fixed values is still a neural network with at most as many non-zero parameters as the original neural network.
Lemma 4.9.Let d 0 , d 1 , m ∈ N and let φ : R d 0 +d 1 → R m be a neural network.Let y ∈ R d 1 .
The next lemma will allow us to construct a realization of a random neural network and at the same time obtain a bound on the neural network weights.
Lemma 4.10.Let U be a nonnegative random variable, let Proof.Firstly, by the i.i.d.assumption it follows that (4.25) Next, note that Bernoulli's inequality implies ( 2 3 ) 1/N ≤ 1 − 1 3N and therefore, by Markov's inequality, 3 and inserting this in (4.25) yields Furthermore, Markov's inequality implies that Combining (4.26) and (4.27) with the elementary fact that P(A ∩ B) = P(A) + P(B) − P(A ∪ B) ≥ P(A) + P(B) − 1 for A, B ∈ F then shows that 4.4 Proof of Theorem 3.4 and Corollary 3.5 With these preparations we are now ready to prove Theorem 3.4.The proof is divided into several steps, which are highlighted in bold in order to facilitate reading.Let us first provide a brief sketch of the proof.The proof proceeds by backward induction.This entails some subtleties regarding the probability measure ρ d , which we will not discuss here.We refer to the proof below for details.Here we rather provide an easy-to-follow overview.
The starting point is the backward recursion (2.2).Our goal is to provide a neural network approximation of the right hand side in (2.2).At time t we first aim to bound the L 2 (ρ d )approximation error We derive a bound on E[E d t ], which we can then use to obtain existence of a realization γ ε,d,t of Γ ε,d,t satisfying a slightly worse bound and such that the realization of max i=1,...,N Y d,i t can also be bounded suitably.This last point is necessary to control the growth of γ ε,d,t (x).Then γ ε,d,t (x) is an approximation of the continuation value and so we naturally define the approximate value function at time t by vε,d,t (x) = max (φ ε,d,t (x) − δ, γ ε,d,t (x)) (4.28) for a suitably chosen δ (depending on ε).We then consider the continuation region and the approximate continuation region Then we may decompose (4.31)The L 2 (ρ d )-error of the last term has already been analysed, and it remains to analyse the remaining terms.The first term is small due to Assumption 2. The second and third term may not necessarily be small, but we will be able to show that ρ d (C t ∩ Ĉc t ) and ρ d (C c t ∩ Ĉt ) are small.Hence, the overall L 2 (ρ d )-error can be controlled.The proof is then completed by showing that the neural network (4.28) satisfies the growth, size and Lipschitz properties required to carry out the induction argument.

Stronger statement:
We will now proceed to prove the following stronger statement, which shows that the constants κ t , q t , r t can be chosen essentially independently of the probability measure ρ d and, in addition, ρ d may be allowed to depend on t.Specifically, we will prove that for any t ∈ {0, . . ., T } there exist constants κ t , q t , r t ∈ [0, ∞) such that for any family of probability measures and for all d ∈ N, ε ∈ (0, 1] there exists a neural network ψ ε,d,t such that Choosing ρ d t = ρ d for all t and noting that (4.32) is satisfied due to q ≤ q t , c ≤ c t the statement of Theorem 3.4 then follows.In order to prove the stronger statement for each fixed t, we now proceed by backward induction.

Start of the induction step:
The remainder of the proof will now be dedicated to the induction step.To improve readability we will again divide it into several steps.For the induction step we now assume that the stronger statement formulated in Step 2 above holds for time t + 1 and aim to prove it for time t.To this end, let ρ d t be a probability measure satisfying (4.32) and denote by ν d t the distribution of Y d t . 5. Induction hypothesis: Let κ t+1 , q t+1 , r t+1 ∈ [0, ∞) denote the constants with which the stronger statement formulated in Step 2 above holds for time t + 1.Consider the probability measure Then, using the change-of-variables formula, (4.11), (4.32) and Assumption 1(iv) and writing p = 2 max(p, 2), this measure satisfies (4.37) Hence, by induction hypothesis, for any ε ∈ (0, 1], d ∈ N there exists a neural network ψ ε,d,t+1 such that and Lip(ψ ε,d,t+1 ) ≤ κ t+1 d q t+1 .(4.41) Now let ε ∈ (0, 1], d ∈ N be given.The remainder of the proof consists in selecting κ t , q t , r t (only depending on c, α, p, q, t, T, κ t+1 , q t+1 , r t+1 ) and constructing a neural network ψ ε,d,t such that (4.33)-(4.36)are satisfied.This will complete the proof.
In what follows we fix ε ∈ (0, 1) and choose The value of ε will be chosen later (depending on ε and d).

Approximation of the continuation value
We now estimate the expected L 2 (ρ d t )-error that arises when Γ ε,d,t is used to approximate the continuation value.Denote Z ε,d,i (x) = vε,d,t+1 (η ε,d,t (x, Y d,i t )) and recall that Assumption 1 d,i (x) and thus the bias-variance decomposition and independence show (4.43)The term corresponding to the first integral in the last line of (4.43) can be estimated as (4.44) 6.a) Applying the error estimate from t + 1: Now consider the first term in the right hand side of (4.44) and recall that Then Jensen's inequality, (3.1), Assumption 1(ii) and (4.38) yield (4.45) 6.b) Applying the Lipschitz property of the network at t + 1: For the second term in the right hand side of (4.44), note that by induction hypothesis (4.41) we have Lip(v ε,d,t+1 ) ≤ κ t+1 d q t+1 and hence (3.2), the assumption E[ Y d t p ] ≤ cd q and (4.32) imply (4.46) 6.c) Applying the growth property of the network at t + 1: For the last term in (4.43), note that the induction hypothesis ] ≤ cd q , Hölder's inequality and (4.32) yield (4.47) 6.d) Bounding the overall error and constructing a realization: We can now insert the estimates from (4.45) and (4.46) into (4.44) and subsequently insert the resulting bound and (4.47) into (4.43).We obtain Hence, Assumption 1(iv), the fact that Y and claim that ψ ε,d,t = vε,d,t satisfies all the properties required in (4.33)-(4.36).
(4.56)In addition, (3.8) implies Lip(φ ε,d,t − δ) = Lip(φ ε,d,t ) ≤ cd q and the pointwise maximum of two Lipschitz continuous functions is again Lipschitz continuous with Lipschitz constant given by the maximum of the two Lipschitz constants.Combining this with (4.56) yields Lip(ψ ε,d,t ) ≤ max(cd q , κ t+1 d q t+1 +q c) ≤ c max(κ t+1 , 1)d q t+1 +q . (4.57) 10. Bounding the overall approximation error: We now work towards verifying the approximation error bound (4.33).To achieve this, let x]} be the continuation region and let Ĉt = {x ∈ R d : φ ε,d,t (x)−δ < γ ε,d,t (x)} be the approximate continuation region.Then (4.58)We now estimate (the integral of) each of these four terms separately.For the first term, from (3.6) we directly get and so we proceed with analysing the second term.

Proof of Theorem 3.8
This subsection is devoted to the proof of Theorem 3.8.It is based on the proof of Theorem 3.4 given in the previous subsection.
0) combined with the Lipschitz, growth and approximation properties that we already established implies a polynomial bound (3.15) with θ = β + ζ.Altogether, this proves that Assumption 1' is satisfied with the claimed choices of ζ and θ.Proof of Proposition 3.12.Consider first the case of the running minimum.For z ∈ R d partition z = (z 1:d−1 , z d ) into the first d − 1 and the last component.Define the transition map for the augmented process, f d t : R d × R d → R d , by f d t (x, y) = (f d−1 t (x 1:d−1 , y 1:d−1 ), min( min j=1,...,d−1